[00:00:50] it is interesting that the ratios of hits has changed [00:01:08] from 40% each 200s and 404s to 40% 200s, 20% 404s and 20% 'other' [00:01:49] Jamesofur: heh, i was grepping for more errors and wondering why i get a different number of results just when sorting the output.. until i realized it is exactly 00:00 UTC ..and the cronjobs run fine now :) [00:02:29] LOL [00:03:31] I'm sure it'll be a day or two and then we'll get more random errors ;) [00:05:01] New patchset: Bhartshorne; "changing swift logtailer module to allow for new logging parameters to be appended to the proxy log line (as happened across the 1.4.3 -> 1.5 bondary)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21304 [00:05:03] paravoid: ^^^ [00:05:45] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21304 [00:06:13] ah, trivial enough [00:06:39] so, what's the 20% other? that's peculiar [00:06:53] when writing the thing I bucketed the response codes I was expecting to get. [00:07:01] I'm seeing a bunch of 499s now that I didn't before [00:07:54] "Client Closed Request" according to wikipedia. [00:08:03] http://en.wikipedia.org/wiki/List_of_HTTP_status_codes#4xx_Client_Error [00:08:05] that's only slightly ironic [00:08:52] I'd give you http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html but it doesn't go above 417. [00:09:02] 499 is not an HTTP code [00:09:10] haven't read that page, but I'm pretty sure [00:09:27] apparently it's one that nginx gives. [00:09:30] if it's client closed request it's probably a specific server extension [00:09:32] and swift too, it seems. [00:09:33] ah right [00:09:54] 20% client closed request? isn't that a lot? [00:10:02] Jamesofur: http://meta.wikimedia.org/w/index.php?title=Planet_Wikimedia&diff=4062382&oldid=4062279 [00:12:02] mutante: perfect thanks, I think I'm going to be heading home soon but will go through all of those and commit from there [00:12:03] huh. [00:12:21] Jamesofur: i can do it, just following your example [00:12:28] I wonder if that's caused by the way the rewrite.py hands off the requeust and the fact that the proxy-logging module is below rewrite. [00:12:29] like commenting instead of removing [00:12:49] paravoid: I did get a suggestion that we invert that order (the pipeline in teh proxy config) to put the proxy-logging module before rewrite rather than after. [00:14:01] mutante: ahh perfect thanks, yeah, I'm thinking of commenting them out for at least 5-6 months since we don't know if the site is just having temporary issues etc. [00:14:41] paravoid: a comparison of response codes over 10,000 lines on ms-fe1 and 4: http://pastebin.com/wR3EYSFx [00:14:55] each column is count, code pairs. [00:15:37] (that's the output of tail -n 10000 /var/log/syslog | cut -f 12 -d\ | sort | uniq -c ) [00:25:16] maplebed: btw, I have a meeting I can't postpone on Tuesday morning... [00:25:51] 10:30 your time, so I'll be available before. [00:26:49] ok, np. [00:38:23] New patchset: Dzahn; "remove more broken feed URLs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21306 [00:39:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21306 [00:39:09] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21306 [00:41:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:43:12] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [00:46:12] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:46:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.032 seconds [00:51:45] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [00:53:29] New review: Jalexander; "\o/ looks good" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21306 [01:12:27] New patchset: DamianZaremba; "Making gitdir configurable" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21307 [01:13:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21307 [01:16:22] New patchset: DamianZaremba; "Making gitdir configurable" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21307 [01:17:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21307 [01:17:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:22:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.636 seconds [01:40:48] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 252 seconds [01:40:57] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 261 seconds [01:47:24] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 647s [01:48:29] about to do one last scap per a request from Erik [01:57:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:58:03] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 3 seconds [01:58:57] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 9 seconds [02:00:00] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 1s [02:07:21] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [02:07:21] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [02:07:21] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [02:07:21] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [02:10:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [02:13:21] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [03:07:48] RECOVERY - Puppet freshness on snapshot4 is OK: puppet ran at Fri Aug 24 03:07:33 UTC 2012 [03:28:21] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [03:37:21] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [03:37:21] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [06:24:08] ok so the sql script is broken on bast1001 because of course there is no /home/wikipedia/anything [06:24:41] I assume a lot of crap doesn't work over there because of that. what was our approach to that going to be? [06:45:17] New review: Nikerabbit; "Sorry I can't make head or tails from the commit message." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21283 [07:00:15] New patchset: preilly; "BREW Public IP" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21313 [07:01:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21313 [07:01:18] paravoid: can you approve and merge https://gerrit.wikimedia.org/r/#/c/21313/ [07:10:42] Ryan_Lane: go to bed [07:10:55] Ryan_Lane: or, approve and merge https://gerrit.wikimedia.org/r/#/c/21313/ [07:41:33] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: Puppet has not run in the last 10 hours [07:50:34] PROBLEM - Puppet freshness on ms-fe4 is CRITICAL: Puppet has not run in the last 10 hours [08:06:25] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [08:06:25] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [08:06:25] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [08:06:25] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [08:06:25] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [08:06:26] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [08:06:26] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [08:06:27] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [08:06:27] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [08:06:28] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [08:06:28] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [08:06:29] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [08:06:29] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [08:38:22] PROBLEM - Puppet freshness on amssq46 is CRITICAL: Puppet has not run in the last 10 hours [08:38:22] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Puppet has not run in the last 10 hours [08:38:22] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [08:38:22] PROBLEM - Puppet freshness on cp1002 is CRITICAL: Puppet has not run in the last 10 hours [08:38:22] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [08:38:23] PROBLEM - Puppet freshness on mw5 is CRITICAL: Puppet has not run in the last 10 hours [08:38:23] PROBLEM - Puppet freshness on marmontel is CRITICAL: Puppet has not run in the last 10 hours [08:38:24] PROBLEM - Puppet freshness on mw36 is CRITICAL: Puppet has not run in the last 10 hours [08:38:24] PROBLEM - Puppet freshness on mw57 is CRITICAL: Puppet has not run in the last 10 hours [08:38:25] PROBLEM - Puppet freshness on mw74 is CRITICAL: Puppet has not run in the last 10 hours [08:38:25] PROBLEM - Puppet freshness on mw54 is CRITICAL: Puppet has not run in the last 10 hours [08:38:26] PROBLEM - Puppet freshness on srv258 is CRITICAL: Puppet has not run in the last 10 hours [08:38:26] PROBLEM - Puppet freshness on sq62 is CRITICAL: Puppet has not run in the last 10 hours [08:38:27] PROBLEM - Puppet freshness on sq80 is CRITICAL: Puppet has not run in the last 10 hours [08:38:27] PROBLEM - Puppet freshness on search22 is CRITICAL: Puppet has not run in the last 10 hours [08:38:28] PROBLEM - Puppet freshness on sq54 is CRITICAL: Puppet has not run in the last 10 hours [08:38:28] PROBLEM - Puppet freshness on virt6 is CRITICAL: Puppet has not run in the last 10 hours [08:39:25] PROBLEM - Puppet freshness on cp1012 is CRITICAL: Puppet has not run in the last 10 hours [08:39:25] PROBLEM - Puppet freshness on amssq36 is CRITICAL: Puppet has not run in the last 10 hours [08:39:25] PROBLEM - Puppet freshness on arsenic is CRITICAL: Puppet has not run in the last 10 hours [08:39:25] PROBLEM - Puppet freshness on knsq29 is CRITICAL: Puppet has not run in the last 10 hours [08:39:25] PROBLEM - Puppet freshness on lvs1 is CRITICAL: Puppet has not run in the last 10 hours [08:39:26] PROBLEM - Puppet freshness on db1026 is CRITICAL: Puppet has not run in the last 10 hours [08:39:26] PROBLEM - Puppet freshness on mw35 is CRITICAL: Puppet has not run in the last 10 hours [08:39:27] PROBLEM - Puppet freshness on sq85 is CRITICAL: Puppet has not run in the last 10 hours [08:39:27] PROBLEM - Puppet freshness on es1001 is CRITICAL: Puppet has not run in the last 10 hours [08:39:28] PROBLEM - Puppet freshness on mw10 is CRITICAL: Puppet has not run in the last 10 hours [08:39:28] PROBLEM - Puppet freshness on srv235 is CRITICAL: Puppet has not run in the last 10 hours [08:39:29] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: Puppet has not run in the last 10 hours [08:39:29] PROBLEM - Puppet freshness on srv269 is CRITICAL: Puppet has not run in the last 10 hours [08:39:30] PROBLEM - Puppet freshness on db1050 is CRITICAL: Puppet has not run in the last 10 hours [08:39:30] PROBLEM - Puppet freshness on virt5 is CRITICAL: Puppet has not run in the last 10 hours [09:13:12] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [10:08:25] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 183 seconds [10:08:25] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 186 seconds [10:08:43] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 198 seconds [10:08:52] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 206 seconds [10:21:55] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 0 seconds [10:22:13] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 0 seconds [10:23:07] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [10:23:25] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [10:41:07] apergos: ping [10:41:22] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21313 [10:41:23] paravoid: poooonnngggg [10:41:35] hey, sorry [10:41:49] oh? what did you do to be sorry for? [10:41:50] I was up and talking with people until 4am :/ [10:41:55] ah no worries [10:42:05] the only thing is that at 6 pm I gotta leave [10:42:14] I can be back on later but I'm not sure when [10:43:13] so this is "merge the proxy specific changes to puppet", "pull ms-fe1/2 from pool", "push puppet changes" ? [10:44:19] preilly: merged [10:44:39] paravoid: thanls [10:44:48] tnanks even [10:44:57] thanks damn [10:45:37] apergos: I was thinking about the "merge the proxy specific stuff to puppet" [10:45:57] see, we'd need to push packages to the repo too [10:46:06] and the "swift" package is shared among proxies and backends [10:46:20] grrr [10:46:25] well thaat's just peachy [10:48:15] the stanza says "ensure present" right now, right? [10:48:24] I mean one could manually do the packages on the proxies [10:48:33] then force a puppet run for the rest [10:49:51] we could [10:51:09] I'm not too excited about any of our options [10:52:04] well what do you prefer? [10:52:58] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [12:03:51] New patchset: Faidon; "vumi: add smpp_enquire_link_interval to TATA SMS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21320 [12:04:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21320 [12:04:55] New review: Jerith; "Looks good." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/21320 [12:05:46] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21320 [12:08:30] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [12:08:30] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [12:08:30] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [12:08:30] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [12:14:30] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [12:30:53] apergos: so, I say let's do it manually to all of them now [12:30:59] ok [12:31:01] I've recorded all the steps carefully, so I'll apply them to puppet [12:31:32] If I thought we could actually test via puppet in any meaningful way I'd have more to say about it [12:31:53] but given the package thing n top of eveerything else, it wouldn't be worth much [12:32:02] indeed [12:32:29] !log depooling ms-fe1/2 for the 1.5 upgrade [12:32:36] I'm on to my second round of deletes on ms5 (which will take at least a day) [12:32:39] Logged the message, Master [12:32:48] let's look sat some graphs [12:35:15] this feels so wrong [12:35:20] ? [12:35:29] upgrade on a friday [12:35:48] if it was the sf timezone I would agree [12:35:58] but luckily for us we are jus a few hours ahead of them:-D [12:40:36] !log on ms5, running from screen session as root: tossing non-standard thumb sizes > 100 px for commons/x/xx to see what space that gives us [12:40:46] Logged the message, Master [12:40:47] shoulda logged that earlier [12:41:13] we never used to care about when doing maintenance or upgrades [12:41:31] i'd just as easily do them on fridays, saturdays, sundays, or whenever I felt like it [12:41:39] heh [12:41:41] s/we/you/ :-P [12:41:52] I don't mind picking up the pieces on the weekend [12:41:52] there was noone else anyway [12:42:01] I'm more worried about paging/worrying everyone else [12:42:13] ariel has been around for some time but doesn't count, always online and watching anyway :P [12:42:25] http://isitreadonlyfriday.com/ [12:42:29] you're in for a surprise this weekend then :-P [12:42:34] i don't mind [12:48:04] stil a lot of traffic on ms-fe1/2 [12:53:39] I see none [12:53:41] so, proceeding. [12:54:10] well testat shows a lot of established conns [12:54:13] Netstat [12:54:55] that's the swift->memcached ones [12:55:20] if you grep for port 80 you'll see only a few from the LVS servers, which I guess is the pybal idle connection [12:55:39] oh memcached [12:55:40] fine [12:56:04] go to town [12:56:21] sorry? [12:56:31] = feel free, have at [12:56:43] ah [12:59:08] why does ganglia totally lie about cpu load on ms-fe3 (as an example)? that host is bored, I should know cause I'm on it [12:59:28] ah because I can't read, nm [12:59:31] :-/ [12:59:46] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: Connection refused [12:59:55] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: Connection refused [13:00:50] all done [13:01:00] yes, I see the changes on both hosts [13:01:16] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.010 seconds [13:01:25] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.010 seconds [13:01:28] both tested and work too [13:01:40] do a check if you want to, and I'll repool them [13:01:41] that was fast (the testing) [13:02:21] how are you testing them? [13:02:24] doing s/ms-fe4/ms-fe1/ in your .bash_history helps :P [13:02:31] :-D [13:02:39] GET -Used http://ms-fe2.pmtpa.wmnet/wikipedia/en/thumb/0/03/Homelandposter.jpg/220px-Homelandposter.jpg [13:02:42] e.g. [13:02:46] ah [13:03:50] gah, ms-fe4 is .214 but ms-fe1 is .210 [13:04:11] oh noes, they're all wrong [13:04:33] ?? [13:04:39] nevermind me [13:06:04] verified that it logs properly too [13:06:42] so, ack for repooling? [13:07:46] yeah, I didn't do a particularly comprehensive test but a few random things on each [13:07:48] so go ahead [13:08:17] I wonder if we should write a bit more complicated pybal test [13:08:24] as to let pybal do the test for us for free [13:08:36] !log repooling ms-fe1/2 with all new swift [13:08:46] Logged the message, Master [13:09:21] traffic flowing [13:12:18] New patchset: Parent5446; "(bug 39380) Enabling secure login (HTTPS)." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21322 [13:12:32] looks ok [13:13:01] these "499" worry me [13:13:47] interesting aren't they? [13:22:35] apergos: so. [13:23:03] what should we do next I wonder [13:23:07] ms-be upgrade? [13:23:32] well I'd like to wait a half hour to make sure nothing weird crops up [13:23:43] and I am mindful of my deadline of being out the front door at 6 [13:24:02] I don't mind staying [13:24:41] do we think we can get one done in the hour we'll have available? [13:25:16] sure [13:25:38] ok, well we can do that [13:25:44] say at 5? [13:28:08] sure [13:28:10] looks trivial enough [13:28:21] really trivial [13:28:22] no good pattern to these 499s (as far is urls or internal/external requests), which sucks [13:28:52] how do we test those? same way? [13:29:28] not really. I was thinking grep URLs off the logs and try those [13:29:37] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [13:34:48] guess that should get documented before ben leaves [13:38:37] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [13:38:37] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [13:43:58] okay, while waiting I'm going to tcpdump and find what those 499 are [13:45:13] sure [13:45:20] I mean we see em in the logs on the hosts [14:19:54] ottomata: I am onsite now, going to poke at analytics1023 [14:20:06] well, just got here, getting setup, and going to work on that [14:21:13] mark: Once I finish working on ottomata's analytics servers, we can work on the network stuff [14:21:16] k [14:21:26] mark: or if i hit a wall and it will take too long, i put off analytics a bit while you are still around and working [14:21:38] and return to it once its late your evening [14:21:51] will know shortly. [14:27:59] !log stopping puppet on brewster to do a local nonpuppetized test change [14:28:09] Logged the message, RobH [14:30:20] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:30:51] New patchset: Umherirrender; "(bug 34386) Enable e-mailing password based on e-mail address" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21326 [14:30:57] we need a cmjohnson2 [14:31:41] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [14:31:59] PROBLEM - MySQL disk space on db22 is CRITICAL: DISK CRITICAL - free space: /a 17121 MB (3% inode=99%): [14:32:43] mark: so there is no links going to csw1 from the cwdm [14:33:08] host way to cwdm, fiber to msw1 and fiber to cr1 [14:33:54] so there are two cross-floor fibers, right [14:34:03] ARGH, where the hell is my usb memory stick. [14:34:08] one is the old one used for production [14:34:12] and one is the one always used for management [14:34:19] * RobH checks every single server to see if he left it someplace plugged in [14:34:24] the production one is between cr2-pmtpa and csw1-sdtpa, correct? [14:34:29] no CWDM in between [14:35:10] no [14:37:03] huh [14:37:14] I see cr2-pmtpa is now connected directly to cr1-sdtpa on xe-0/0/1 [14:39:41] mark: is that going through csw1? [14:39:50] not according to the description [14:39:54] but i have no idea what you guys did [14:40:52] I -think- [14:40:59] RECOVERY - MySQL disk space on db22 is OK: DISK OK [14:41:10] cr2-pmtpa:xe-0/0/1 is connected via the CWDM system to cr1-sdtpa:xe-0/0/1 [14:41:15] so, the hostway fiber is going into the cwdm [14:41:21] "the hostway fiber", which one is that [14:41:23] there are two [14:41:25] or more [14:41:36] depending on if you count the transits [14:41:45] do you have a fiber nr? [14:41:47] morning paravoid, apergos. [14:41:52] looks like you had a good time this morning! [14:42:16] morning [14:42:25] fzr15802100f [14:42:28] yes, a quiet little upgrade [14:42:39] paravoid was looking at the 499s by doing some tcpdumping [14:42:55] cmjohnson1: that's not our cable id format [14:42:55] cmjohnson1: is that ptbthbth in hex? [14:42:58] I had a gander through the 1.43 and 1.5.0 code but didn't see much on a quick pass through that would explain the change [14:43:13] I'm curious about them too. [14:43:19] is there no 4 digit nr on it? [14:43:22] in about 15 mins I'm going to have to take off unfortunatelly, though I will be back later tonight and will check in [14:43:33] 0004 [14:43:40] did you put that on there? [14:43:43] I asked in the swiftstack channel (as you'll see) in case it was a known issue but no dice [14:43:47] like all our other fibers? [14:43:58] apergos: they're all san francisco time. [14:44:12] ok, well I'll just let the scrollback sit there then [14:44:22] no most of these are not labeled�i know the fpl and xo fibers now because of our outage [14:44:30] oh, you did get a response, just not an answer. [14:44:31] nevermind. [14:44:33] the backend upgrade looks pretty straightforward with one exception [14:44:38] not sure how you test the backends specifically [14:44:44] the other host way is going to csw1 16/4 [14:44:49] cmjohnson1: can you trace where this fiber is going, and make sure EVERY fiber has a unique cable like all our other cables? [14:44:51] like a given one of them [14:45:09] cmjohnson1: right [14:45:14] they are all unique [14:45:15] apergos: I was able to test on the test clusters because I could upgrade a majority then watch requests succeed. [14:45:25] oh [14:45:32] you can find look at some of the objects stored on a specific back end and test with those objects... [14:45:39] hmm well we wanted up upgrade one, not toss is back in yet, and figure out how to test it [14:45:46] cmjohnson1: so can you give me our cable nrs then? [14:45:50] then I can put them in the devices [14:45:58] for example, what nr is on xe-0/0/0 [14:46:02] on cr2-pmtpa [14:46:05] apergos: that you can't do. the're not conveniently behind a load balancer like the proxies. [14:46:22] I mean, I suppose you could make connections directly to port 6000/6001/6002 [14:46:23] that's supposed to be the fiber going to csw1-sdtpa:e16/4 [14:46:37] hmm [14:46:47] but as soon as you start the processes, it'll be 'back in'. [14:47:12] uh [14:47:29] xe 0/0/0 is # 6005 and is going to csw1 14/3 [14:47:29] so there is no way to test it without it getting production requests? [14:47:52] I suppose you could put up a firewall that blocks access from the rest of the cluster... [14:48:07] !log analytics1023-1027 rotating down for removal of extra nic [14:48:16] Logged the message, RobH [14:49:08] mark: xe-0/0/1 is 6001 goes to c2 on the cwdm [14:49:27] xe-0/0/2 is the xo link [14:50:22] i believe it is xe-1/0/1 6001 going to 16/3 [14:50:46] 1/0/2 is fpl [14:51:19] cmjohnson1: ok [14:52:44] um so ... our objects are replicated to two other backends, right? and when a node drops off, there's some sort of replication tht happens to account for that node being down, is that manual? [14:53:10] cmjohnson1: there is no xo link in pmtpa [14:53:15] anymore [14:53:25] according to my info, xe-0/0/2 is one of the two hostway transits [14:53:30] can you confirm? and let me know what cable id? [14:53:34] i am in sdtpa [14:53:49] oh [14:53:52] darn [14:53:56] then i'm working on the wrong router [14:54:05] so everything you just mentioned was for the ports on cr1-sdtpa? [14:54:10] yes [14:54:17] ok [14:54:18] sec [14:54:50] huh [14:55:01] you mentioned #6001 twice [14:55:07] on xe-0/0/1 and on xe-1/0/1 [14:55:40] xe-0/0/1 is 6000 [14:55:42] sorry [14:55:44] ok [14:56:38] what's the cable id for the XO link? [14:56:41] on xe-0/0/2 [14:57:34] we don't have our number system on that link [14:57:39] !log restart puppeted on brewster [14:57:48] Logged the message, RobH [14:57:51] but i do have sr1825716 [14:58:41] please put our own number on, like #6001 [14:58:55] sr1825716 is some other company's id [14:59:08] probably equinix [14:59:26] it is their label [14:59:55] back in a while. [14:59:58] yeah, so put our own on and let me know what the nr is please [15:00:40] 2162 will b the number [15:01:04] thanks [15:01:36] and can you tell me what the nr is on port xe-1/1/0? that's an FPL fiber [15:02:09] that is going to be 2163 [15:03:11] ok [15:03:35] while we are at it�2164 will be the fpl link on csw1 13/1 [15:04:01] ok [15:04:22] here's the summary for cr1-sdtpa: [15:04:23] Interface Admin Link Description [15:04:23] xe-0/0/0 up up Core: << csw1-sdtpa:e14/3 {#6005} [10Gbps DF] [15:04:23] xe-0/0/1 up up Core: << cr2-pmtpa:xe-0/0/1 {#6000} [10Gbps CWDM] [15:04:23] xe-0/0/2 up up Transit: xe-0/0/3 down down EMPTY [15:04:24] xe-1/0/0 up up Core: << csw1-sdtpa:e16/3 {#6001} [10Gbps DF] [15:04:25] xe-1/1/0 up up Core: << cr2-eqiad:xe-5/2/1 (FPL/Level3, CV71028) {#2163} [10Gbps wave] [15:06:46] matches everything i see [15:06:50] cool [15:06:58] want to check upstairs as well? [15:07:14] you say, the other fpl link is on csw1-sdtpa 13/1 [15:07:17] that port doesn't exist I believe [15:07:32] 14/1 [15:07:32] yeah I want to have this fully correct [15:07:35] these are essential links [15:07:43] and this stuff being incorrect could be disastrous during an outage [15:07:48] fpl is 14/1 on csw1 [15:07:58] right [15:08:02] thanks [15:08:08] yes, you can go upstairs, and we'll audit there as well [15:13:30] New review: Dereckson; "Commit message issue." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/17503 [15:14:29] mark: so 1/1/0 is down but is connected on cr2 it is a hostway fiber [15:14:54] ok, it's not labeled at all... [15:15:00] any idea what it is? [15:15:55] i don't recall..i remember it going in�but not sure for what now�.i will find out [15:25:17] New patchset: Umherirrender; "(bug 34386) Enable e-mailing password based on e-mail address" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21326 [15:28:57] mark: psw1 is connected to cr2 via fiber # 2626 to cr2 5/2/3 [15:29:06] thanks [15:29:08] though the link is orange [15:29:14] so not setup, but it was alrady there [15:29:15] already even [15:30:22] and it's on xe-0/1/2 on the other side? [15:30:57] its the second 10g, so yep, starts on 0/1/0 right? [15:31:05] 0/1/0-3? [15:31:25] i think that's 0/1/2 yes [15:31:29] also, all the uplinks from cr1/2 to row c show green on the row c side, so must be software stuff [15:31:32] then yep [15:31:39] ok [15:31:45] the row C uplinks don't have cable ids set in the routers [15:31:48] so I'd like to have those as well [15:31:54] i can snag right now, brb [15:35:44] asw-c1-eqiad:0/1/0 1984:0/1/2 2826 asw-c8-eqiad:0/1/0 2808:0/1/2 2827 [15:35:47] mark: ^ [15:36:42] I am going to go ahead and label the new cross-connect 2976 [15:36:44] 0/1/0 doesn't sound right [15:36:51] ah that's 1/1/0 of course [15:37:03] sorry, i thought it was 0 for the addon, its 1 [15:37:13] so s/0/1 for all the first part of each string [15:37:45] you know, that notation really doesn't make it any less confusing ;-) [15:38:01] heh, which notation =P [15:38:10] the incorrect ones, or the correction ;P [15:38:48] the new cross connect is 2953 (not 2976) [15:38:57] define new cross connect [15:39:00] what port should I go to on cr1? [15:39:05] it's in the ticket [15:39:08] the one that you are having me put in for next week, oh [15:39:08] ok [15:39:23] found it. [15:39:41] mark: Need any other networking items I should address before I do this? [15:39:55] I realize its getting later in the day for you there, so I don't wanna keep you about waiting on me. [15:40:01] don't worry [15:40:15] so one uplink for row C is down [15:40:17] disabled [15:40:21] do you remember why? [15:40:24] nope [15:40:25] perhaps it had a dirty fiber or optic or so [15:40:34] which one, i can pull and wipe [15:41:02] huh [15:41:03] now they're up [15:41:18] sec, lemme check [15:42:13] ah I see it [15:42:27] it's a config issue indeed [15:43:33] actually [15:43:40] those ids you gave above are not really readable [15:43:45] which id is which? [15:44:09] so in c1 [15:44:22] the first uplink is 1984 and the second is 2826 [15:44:39] so thats 1/1/0 and 1/1/2 ? [15:44:47] ah right [15:45:13] so then in asw-c8-eqiad in the first upload is 2808 and then 2827, so thats 8/1/0 and 8/1/2 ? [15:45:28] (i forgot the first number changed based on which switch in stack) [15:45:40] uplink even, not upload, sorry [15:46:04] That is how it works, yes? [15:47:11] yeah [15:47:17] the first number is basically the rack number [15:47:19] starts with 1 [15:47:25] only because we don't assign a switch 0 in that case [15:47:30] (but normally it starts from 0) [15:47:59] so asw-c3-eqiad:ge-0/7 is the same as asw-c-eqiad:ge-1/0/7 [15:48:16] and 0 is the 24 or 48 copper ports, 1 is the fiber uplinks [15:49:48] RobH: do you have the id of the fiber (to psw1) on xe-5/2/3 on cr1-eqiad for me? [15:52:11] hi paravoid [15:52:14] bah. [15:53:20] hrmph. well, good enough. [15:54:18] hi [15:54:51] so... 499s and backends! [15:55:08] trying to correlate pcap with logs [15:55:19] mark: psw1:0/1/0 is a 1m fiber with # 2627 [15:55:20] haven't found a match yet. [15:55:34] where's your .cap file? [15:55:52] on my computer [15:56:10] can you throw a copy up somewhere? I'll look too? [15:56:17] RobH: and that's the ex4200-24F, correct? [15:56:32] or I'll just make my own... [15:56:41] it shouldn't take too long to catch a 499, right? [15:56:45] yeah, better to make your own, will increase our chances [15:56:51] you won't see a 499 on the tcpdump [15:56:57] why not? [15:57:00] since it's not a real http response code as I was saying yesterday [15:57:06] it's something that's only logged locally [15:57:20] I'm trying to find a logline that contains 499 and correlate it with the request [15:57:24] and see what happened with that request [15:57:28] mark: confired, i assume the 24F denotes 24 sfp ports? [15:57:31] cuz thats what it has [15:57:32] yes [15:57:35] gtocha. [15:57:36] and [15:57:37] k. [15:57:42] there's also "psw2-eqiad" [15:57:45] is that an EX4500? [15:58:00] mark: the cross-connect is run, putting the label # in ticket and resolving. that is correct, psw2-eqiad is a 4500 [15:58:07] that needs to have an RMA for some packet loss issue [15:58:09] does it have any ports connected? [15:58:16] if not, we can disconnect it now [15:58:21] yes to 5/3/3 [15:58:22] it's meant to be a spare [15:58:24] yes [15:58:26] but no others right? [15:58:30] thats all it has [15:58:32] ok [15:58:37] you can unplug 5/3/3 then [15:58:37] that and a mgmt link [15:58:44] leave mgmt link right? [15:58:47] yes [15:59:11] maplebed_: "ssh ms-fe1.pmtpa.wmnet tcpdump -i eth0 -pn -c 5000 -U -s 0 -w - host 10.2.1.27 | wireshark -k -i -" is what I do. [15:59:20] mark: done [15:59:31] but I run a linux desktop, so ymmv [15:59:35] xe-5/0/3 up up << pfw1-eqiad:xe-6/0/0 [15:59:39] which cable id is that? [15:59:50] what's the -pn and -c 5000 and -U? I understand the rest. [16:00:07] 2953 [16:00:16] tnx [16:00:20] mark: 2953 in cr1:5/2/2 [16:00:26] -p = non-promiscuous, -n = do not resolve, -c = stop after capturing N packets, -U = unbuffered, needed for the pipeline [16:00:33] wait [16:00:37] 2953 in 5/2/2 [16:00:42] 5/2/2 is the new run you just did [16:00:49] yes [16:00:54] yes, sorry [16:00:58] I was asking for 5/0/3 [16:01:00] 5/0/3 checking [16:01:06] I also see the multiple HEADs [16:01:10] that really really sucks [16:01:20] it's HEAD -> 404, HEAD -> 404, HEAD -> 404, HEAD -> 404 for the same URL [16:01:25] cool. I'm just going to -w /tmp/blah and look locally. [16:01:33] cr1:5/0/3 label is 2952 [16:01:41] thanks [16:01:49] any others you dont have on cr1? [16:01:54] no that's all [16:01:58] need any off cr2? [16:02:05] nope, got all those [16:02:10] except the new ports for the netapps [16:02:14] but those aren't run yet ;) [16:02:21] xe-5/3/2 on both [16:02:29] oh [16:02:36] hrmm, i will poke cdw on that. [16:02:37] could you tell me which ports are actually copper sfp+? [16:02:46] then I won't call those "dark fibers" ;) [16:03:39] on cr1 just 5/0/0 [16:04:05] on cr2 5/2/0 [16:04:09] thats it on the cr1/2 [16:04:21] then all the DAC for memcached and such [16:04:26] but i assume you meant just cr1/2? [16:04:28] yeah don't need those [16:04:29] yep [16:04:39] 5/2/0? really? [16:04:45] that's a link between the two routers [16:05:24] im having a bad day [16:05:26] (tired) [16:05:29] 5/1/0 [16:05:34] sorry =P [16:05:36] ok [16:05:42] that makes sense [16:05:51] its to the access switch in its own rack [16:06:19] I am going to clean this place up and get it organized, then head home and handle the netapp followup and the caching center server quotes [16:06:37] unless you have more networking stuff you need, then i stay [16:06:40] =] [16:06:54] (otherwise im too damned tired to be in here futzing around) [16:07:19] ehm [16:07:26] can you tell me if there's a link from cr2 to frack? [16:07:33] i can, lemme check [16:07:48] maplebed_: I see 7 HEADs in a row for a file that doesn't exist [16:07:50] that's *bad* [16:08:10] I agree, but it's also been that way. so it's not relevant to what we're doing now. [16:08:10] mark: so frack has the normal access switch, which is just pretty much a waste in it ;] then it has pfw1/2 [16:08:16] each pfw has a single fiber to it [16:08:26] also aaron's been looking at that and I'm confident will find the spot that's doing it. [16:08:28] i recall one to each cr but lemme compare the #s [16:08:38] maplebed_: yes agreed, not relevant for what we do, just noticing. [16:09:10] did you see the pastebin I mailed about this morning? [16:09:27] it's ... illuminating to watch a single file move go by... [16:09:30] no? [16:09:54] http://pastebin.com/anpNn3tZ [16:10:05] mark: pfw1 to cr1:5/0/3 is 2952, pfw2 to cr2:5/0/3 is 2954 [16:10:17] and that's actually a failed move, not a successful one. when it succeeds there are 3 more HEADs to the new location. [16:10:27] yeah [16:10:52] RobH: to which port on pfw2? [16:10:54] mark: its on the line card 6 on the pfw1/2 [16:10:56] hrmm [16:11:09] 6/0? [16:11:14] 6/0/0 i think [16:11:27] seems 0 is shared between a sfp port or a rj45 port [16:11:41] so on pfw1/2 6/0/0 is to cr1/2 [16:11:52] and 6/0/1 is to each other (pfw1/2) [16:12:17] also 0/0/1 is cat6 to one another [16:12:35] ok [16:12:40] thanks [16:12:42] and ports 12-16 on the line card in slots 2/4 [16:12:45] (if it matters) [16:12:54] prolly not but its crazy wired ;] [16:13:28] paravoid: if you're interested, I have a .cap file for an individual GET/499 request on ms-fe1 (and the associated log file) [16:13:48] and I bet I know what it's about. [16:14:19] paravoid: ms-fe1:/tmp/499only.{cap,log} [16:15:20] looking [16:16:15] it's polluted with other stuff because it looks like the squids reuse connecitons, but the one in the log line is in there. [16:17:53] the get is about the 30th packet in the .cap. [16:18:19] yeah I found it [16:18:29] it's a normal 404, I don't understand why that's logged as 499 [16:19:22] mark: anything else? [16:19:27] nope [16:19:27] that's it [16:19:37] it's not just that the thumb didn't exist; the original also didn't exist. [16:19:37] cool, i jsut cleaned up the cage, moved a bunch of crap to the storage [16:19:41] having storage is awfully nice. [16:20:15] headed home, back online shortly [16:20:40] I'm going to keep looking through that .cap. if you want this pair, all the entries in 499.log exist within the 499.cap file. [16:20:42] cmjohnson1: so, can we audit the connections on cr2-pmtpa now? [16:20:50] mark: was a link to as13680 but they want to make it a redundant uplink for our us [16:21:10] xe 1/1/0 [16:21:17] cr2 [16:21:24] ok [16:21:28] so it will no longer be used? [16:21:39] I think this will need to be plugged in to the CWDM [16:21:43] not right now�so no need to enable it [16:21:45] paravoid: one thing I'm not sure I've told you before - when the thumb doesn't exist but gets generated, swift logs a 404 but sends the generated thumb back to the client with status 200. [16:21:49] on a free channel [16:21:52] can you do that? [16:22:11] so it will be interesting to check to see if all the 404s are thumbs that are genertaed and the 499s are thumbs with missing originals. [16:22:25] also I need cable ids for: [16:22:26] xe-0/0/0 up up Core: << csw1-sdtpa:e16/4 [16:22:26] xe-0/0/1 up up cr1-sdtpa:xe-0/0/1 [16:22:26] xe-0/0/2 up up Transit: xe-1/0/0 down down Core: << csw1-sdtpa:e14/4 [16:22:32] xe-1/2/0 down down Transit: not xe-1/2/0 I think [16:22:48] that's the one we're no longer gonna use [16:22:51] let me deconfigure that one [16:23:06] * maplebed_ just installed wireshark, so the next few checks might be a bit faster. [16:23:26] mark: xe-0/0/1 is a link to cwdm-pmtpa [16:23:38] yes [16:23:41] does it have a cable id? [16:24:03] 2164 [16:24:36] set [16:24:44] and xe-0/0/2? [16:24:57] that's our main hostway transit at the moment [16:25:07] you wanna make another link xe-1/2/0 to the cwdm..correct? [16:25:33] yeah but let's worry about that later [16:25:37] first I want all labeling correct [16:25:40] maplebed_: well, not anymore I guess [16:25:50] ok, time to dig in swift's source. [16:25:52] what's on cr2-pmtpa:xe-0/0/2? [16:26:02] hm? not anymore? [16:26:06] logs a 404 [16:26:09] it seems to log a 499 now. [16:26:33] link to as30217 label is 9002 [16:27:06] ok [16:27:26] xe-0/0/0 label 2166 [16:27:36] paravoid: no, the condition I described is different. [16:27:51] well, I'll dig more before saying that with confidence. [16:28:01] cool [16:28:03] then we have: [16:28:15] Interface Admin Link Description [16:28:15] xe-0/0/0 up up Core: << csw1-sdtpa:e16/4 {#2166} [10Gbps DF] [16:28:15] xe-0/0/1 up up Core: << cr1-sdtpa:xe-0/0/1 {#2164} [10Gbps CWDM] [16:28:15] xe-0/0/2 up up Transit: xe-0/0/3 up up << asw-d-pmtpa:xe-1/1/0 {#6021} [16:28:16] xe-1/0/0 down down Core: << csw1-sdtpa:e14/4 [16:28:16] xe-1/3/0 up up << asw-d-pmtpa:xe-3/1/0 {#6022} [16:28:22] maplebed_: yeah, obviously that's a wild guess of mine :) [16:28:31] but the two conditions are: [16:28:31] 1) thumb doesn't exist but is successfully generated [16:28:31] 2) thumb doesn't exist and generation fails because the original doesn't exist [16:28:39] I think 1) logs a 404 and 2) logs a 499. [16:28:43] right, in the cap you sent it was (2) [16:28:56] whereas both used to log a 404. [16:29:06] mark: that is correct [16:29:11] cool [16:29:24] awesome [16:29:26] on the CWDM [16:29:29] how many channels do we have? [16:29:59] paravoid: also, I think you'll find a clue in the rewrite.py source looking at when it passes through the connection it gets handed and when it makes its own to send to the client. [16:30:10] ok. enough gabbing. wiresharking now. [16:30:12] there are 2 available [16:30:16] for a total of 4 [16:30:24] only 4? :/ [16:30:33] ok [16:30:36] so the ones in use [16:30:39] one is the management link [16:30:42] yes [16:30:48] and the other is the new link between cr1-sdtpa and cr2-pmtpa, right [16:30:58] that last one is not right I think [16:31:03] i wish leslie was here to confirm with her [16:31:07] correct [16:31:18] we need to use the other 2 also [16:31:36] one to bring the FPL link which is now on csw1-sdtpa to cr2-pmtpa instead [16:31:48] and one to bring the 2nd, now unused hostway transit to cr1-sdtpa [16:31:54] then our CWDM system will already be full [16:31:55] meh [16:32:26] but I guess we won't do this now [16:32:31] given that it's friday, and leslie is sick [16:33:42] your call [16:33:54] oh one more question [16:34:05] you said that you looked at csw5-pmtpa, and that those modules couldn't be used [16:34:12] yet it seems you guys inserted a new module into csw1-sdtpa [16:34:15] where did it come from [16:34:16] ? [16:35:57] from csw5�i think leslie was confused on what you wanted�but we still need one xfp [16:41:40] so csw5-pmtpa had one 4xXFP module? [16:41:43] and that you took? [16:42:02] yes [16:42:04] ok [16:42:07] are you near it now? [16:42:16] it probably has multiple switch fabrics too [16:42:19] and added a card from csw5 to csw1 [16:42:21] those in the middle, SFM3 [16:42:24] csw5? [16:42:26] yes [16:42:29] yep [16:42:35] can you take one out [16:42:39] we can put that in csw1-sdtpa [16:45:06] mark: i have one with a nortel 8mb flash card�do you want that? [16:46:11] no [16:46:14] that's the management module [16:46:18] the switch fabrics are in the middle [16:46:48] http://www.nedworks.org/~mark/presentations/hd2006/Csw5-pmtpa.jpg [16:46:57] those two with "pwr" and "active" leds [16:48:37] okay..got it [16:48:50] just the one? [16:49:00] there's only space for one more in csw1 [16:49:52] F1: RX-BI-SFM3 Switch Fabric Module OK [16:49:52] F2: RX-BI-SFM3 Switch Fabric Module OK [16:49:52] F3: RX-BI-SFM3 Switch Fabric Module OK [16:49:53] F4: RX-BI-SFM3 Switch Fabric Module not present [16:50:18] makes sense now :P [16:50:34] we just keep csw5-pmtpa for spare parts [16:50:50] so if any remain, we can use those if the ones in csw1-sdtpa go bad [16:51:11] that's why hw couldn't have csw5 [16:51:45] there are no remaining 4x XFP modules in csw5, right? [16:51:50] no [16:51:57] i wonder if there are any in the closets [16:51:59] did you check? [16:52:10] i vaguely recall ordering 2 about a year ago [16:52:23] i checked up here..i will look downstairs when i go [16:52:27] ok [16:52:33] i'm not sure why we would need more than one [16:52:35] except as a spare [16:54:33] cmjohnson1: are you aware btw that csw1-sdtpa has some broken slots? [16:54:43] i have one of these Cisco GLC-SX-MM 1000BASE-SX SFP Modules [16:54:43] if i recall correctly, 3 out of 16 slots has a missing connector or something [16:54:52] no I am not aware [16:54:56] ok [16:54:59] so not all 16 slots are usable [16:55:08] but we never got it replaced as that's nearly impossible without a lot of downtime [16:55:13] and we didn't think we're ever gonna need all 16 [16:55:39] but we use 9 cards now [16:55:48] how can we tell which are bad? [16:55:54] by looking inside ;) [16:56:05] i'm telling you this so you won't be surprised if you ever need to insert a card [16:56:21] rob already went through that at the time hehe [16:56:37] i only remember that slot 13 was bad [16:56:42] and I thought that was funny [16:56:45] but there are 2 more iirc [16:57:17] that is good to know [16:57:18] nice eh, with such an expensive switch [16:57:42] it came like that�i imagine? [16:57:45] yes [16:57:53] we just never noticed until we tried to insert a new card some time [16:58:10] someone seriously bodget it up in production [16:58:16] shit way to find out we bought broken equipment [16:58:17] but unfortunately we were already heavily reliant on it then [16:58:52] going to move back down�brb [17:03:18] New patchset: Ottomata; "Using raid1-250G-1partition.cfg to partition analytics1023-1027." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21334 [17:04:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21334 [17:05:58] AaronSchulz: goood morning :) [17:06:19] mark: the 4th switching fabric is in [17:06:24] oh [17:06:30] don't do that without a warning ever again [17:06:41] there's always a risk it goes bad [17:06:47] and if we're not standby :) [17:06:55] but yep, it's up [17:06:56] thanks [17:07:15] �.famous last words but �i thought that just as I inserted it [17:07:30] nearly EVERYTHING relies on that box [17:08:07] that is the thought that crossed my mind�it was an "oh shit" kind of moment�but yep�never again [17:08:13] ok ;) [17:08:25] with the 4th switching fabric it's now at full capacity [17:08:42] it has less switching capacity with fewer fabrics [17:08:51] so once you add a certain amount of cards, it can't do the full throughput otherwise [17:08:57] and since we had them available... :) [17:09:29] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21334 [17:09:45] will there be a noticeable change to anything? [17:09:56] no, but there would have been if we added more cards [17:10:10] you guys added one this week [17:10:29] if the box is only half full you need 2-3 [17:11:00] i would have to do the math to check when it matters, but better to be safe than sorry [17:11:22] i agree there. [17:11:37] i see it also has 8 PSUs [17:11:39] so no other xfp modules here�i can put in a procurement ticket for a couple of spares [17:11:43] of which 4 are installed [17:11:53] cmjohnson1: xfp modules or the 4xXFP module line card? [17:11:56] different thing [17:12:06] i was talking about the same kind of card you guys inserted into csw1 [17:12:19] than we have one more of those [17:12:30] ok [17:12:31] i am thinking of the module? [17:12:38] well they're both called modules sometimes [17:12:40] it's a bit confusing [17:12:41] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20854 [17:12:42] one is a line card [17:12:44] one is not [17:12:48] so let's call them line cards from now on ;) [17:13:03] well I am good now�but thx [17:13:06] leslie and rob were gonna buy 1-2 of those line cards [17:13:12] and then I said, we have some available in csw5 I think [17:13:33] one has been inserted into csw1 already, so all is good [17:13:38] the other one can remain spare wherever it is [17:13:39] good if one dies [17:13:49] just keep track of it [17:14:07] it's the RX-BI-4XG [17:15:06] i am going to leave it in csw5 for now [17:15:10] good [17:15:59] the power supplies in csw5 [17:16:03] they can also be used for csw1 [17:16:07] it seems the system has enough power now [17:16:14] but anyway [17:16:24] pretty much every card that fits in csw5 also fits in csw1 [17:16:46] all of them in fact, I checked [17:18:31] i guess that's all I have for now [17:18:34] we'll probably do more next week [17:24:36] RECOVERY - Puppet freshness on search22 is OK: puppet ran at Fri Aug 24 17:24:31 UTC 2012 [17:25:14] New review: Dzahn; "Nikerabbit: for clarification, this is related to: http://meta.wikimedia.org/wiki/Planet_Wikimedia#R..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21283 [17:26:06] RECOVERY - Puppet freshness on mw74 is OK: puppet ran at Fri Aug 24 17:26:02 UTC 2012 [17:26:59] ottomata: so just to be clear, is there more work to be done than just replacing stat1:/mnt/htdocs by a local filesystem? [17:27:08] and copying the files obviously [17:28:04] mutante, thanks for merging my netboot change [17:42:45] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: Puppet has not run in the last 10 hours [17:46:39] RECOVERY - Puppet freshness on db42 is OK: puppet ran at Fri Aug 24 17:46:17 UTC 2012 [17:47:06] RECOVERY - Puppet freshness on mw10 is OK: puppet ran at Fri Aug 24 17:46:55 UTC 2012 [17:47:42] RECOVERY - Puppet freshness on mw35 is OK: puppet ran at Fri Aug 24 17:47:34 UTC 2012 [17:48:26] hey um, RobH and notpeter [17:48:36] RECOVERY - Puppet freshness on mw36 is OK: puppet ran at Fri Aug 24 17:48:12 UTC 2012 [17:48:42] i'm having trouble with the remaining analytics install again, and I think it is something we might have run into before [17:48:48] whats that? [17:48:52] RobH helped me this morning with the networking problems they were having [17:48:53] that's good [17:48:57] but I just installed [17:49:02] and it PXE booted after it finished [17:49:06] asking me to re-partition again [17:49:12] RECOVERY - Puppet freshness on mw5 is OK: puppet ran at Fri Aug 24 17:48:56 UTC 2012 [17:49:22] boot into the bios and ensure the boot order is set to the disk first [17:49:33] ah right, ok, will try that [17:50:06] RECOVERY - Puppet freshness on mw54 is OK: puppet ran at Fri Aug 24 17:49:44 UTC 2012 [17:50:42] RECOVERY - Puppet freshness on mw57 is OK: puppet ran at Fri Aug 24 17:50:26 UTC 2012 [17:51:36] RECOVERY - Puppet freshness on amssq36 is OK: puppet ran at Fri Aug 24 17:51:07 UTC 2012 [17:51:36] RECOVERY - Puppet freshness on amssq46 is OK: puppet ran at Fri Aug 24 17:51:27 UTC 2012 [17:51:45] PROBLEM - Puppet freshness on ms-fe4 is CRITICAL: Puppet has not run in the last 10 hours [17:52:12] RECOVERY - Puppet freshness on srv235 is OK: puppet ran at Fri Aug 24 17:52:02 UTC 2012 [17:52:28] hmmmmm RobH [17:52:29] error: grub rescue> [17:52:39] RECOVERY - Puppet freshness on sq62 is OK: puppet ran at Fri Aug 24 17:52:20 UTC 2012 [17:52:52] so are these set to autopart? [17:53:00] sounds like the reboot wiped your partitions [17:53:02] (if so) [17:53:06] RECOVERY - Puppet freshness on srv258 is OK: puppet ran at Fri Aug 24 17:52:47 UTC 2012 [17:53:12] if not, then something else is up [17:53:16] ? [17:53:19] they are…but it prompts me to confirm [17:53:21] and I did not confirm [17:53:33] ok, i will reinstall fully and see what happens [17:53:42] RECOVERY - Puppet freshness on srv269 is OK: puppet ran at Fri Aug 24 17:53:28 UTC 2012 [17:53:42] RECOVERY - Puppet freshness on sq80 is OK: puppet ran at Fri Aug 24 17:53:34 UTC 2012 [17:53:44] now that it should boot disk before pxe since I saved it in bios [17:53:44] hrmm, may not matter, it is set to not confirm [17:53:46] but does anyhow [17:53:52] pxe boot should be one time [17:53:56] yep [17:54:04] you can always f12 if you dont wanna run the drac command [17:54:09] RECOVERY - Puppet freshness on analytics1003 is OK: puppet ran at Fri Aug 24 17:53:56 UTC 2012 [17:54:09] RECOVERY - Puppet freshness on virt5 is OK: puppet ran at Fri Aug 24 17:53:57 UTC 2012 [17:54:09] (during post) [17:54:10] yeah, i've heard that is difficult to get the partman autoconfirm stuff to work properly [17:54:23] how do I send F12 without using F12? [17:54:27] (my F12 is mapped) [17:54:33] it is, and since its set to not confirm, i have seen it run some disk destructive commands when in auto mode [17:54:36] RECOVERY - Puppet freshness on arsenic is OK: puppet ran at Fri Aug 24 17:54:12 UTC 2012 [17:54:56] Use the <@> key sequence for [17:55:02] hmmmm ok [17:55:09] or you can just use the drac commands [17:55:12] RECOVERY - Puppet freshness on virt6 is OK: puppet ran at Fri Aug 24 17:54:38 UTC 2012 [17:55:21] can I exit out of console back to drac? [17:55:28] yep, ctrl+\ [17:55:29] ah got it [17:55:31] then in drac [17:55:31] danke [17:55:35] racadm config -g cfgServerInfo -o cfgServerBootOnce 1 [17:55:35] racadm config -g cfgServerInfo -o cfgServerFirstBootDevice PXE [17:55:35] racadm serveraction powercycle [17:55:37] console com2 [17:55:49] make sure you have the first line, or it may change the permanent boot order. [17:55:54] oh [17:55:56] you know [17:55:59] an24-27 all worked! [17:56:06] just an23 that is being annoying [17:56:06] yeah [17:56:06] RECOVERY - Puppet freshness on ssl3002 is OK: puppet ran at Fri Aug 24 17:55:49 UTC 2012 [17:56:15] am copy/pasting from build a new server page [17:56:17] being anal? [17:56:28] anal1023, yup [17:56:46] * RobH ain't touching that. [17:57:06] hehhe, too bad those aren't the real hostnames :) I think it was the two of you who weren't a fan of that :p [17:59:00] i was a fan of that [17:59:42] RECOVERY - Puppet freshness on cp1002 is OK: puppet ran at Fri Aug 24 17:59:20 UTC 2012 [18:01:12] RECOVERY - Puppet freshness on cp1012 is OK: puppet ran at Fri Aug 24 18:00:46 UTC 2012 [18:01:40] RECOVERY - Puppet freshness on db1004 is OK: puppet ran at Fri Aug 24 18:01:35 UTC 2012 [18:01:52] paravoid,maplebed: how are upgrades coming? [18:02:06] RECOVERY - Puppet freshness on sq85 is OK: puppet ran at Fri Aug 24 18:01:59 UTC 2012 [18:02:42] RECOVERY - Puppet freshness on db1026 is OK: puppet ran at Fri Aug 24 18:02:28 UTC 2012 [18:03:40] RECOVERY - Puppet freshness on db1050 is OK: puppet ran at Fri Aug 24 18:03:19 UTC 2012 [18:04:07] RECOVERY - Puppet freshness on marmontel is OK: puppet ran at Fri Aug 24 18:03:46 UTC 2012 [18:04:07] RECOVERY - Puppet freshness on es1001 is OK: puppet ran at Fri Aug 24 18:04:00 UTC 2012 [18:05:10] RECOVERY - Puppet freshness on lvs1 is OK: puppet ran at Fri Aug 24 18:04:38 UTC 2012 [18:05:10] RECOVERY - Puppet freshness on knsq29 is OK: puppet ran at Fri Aug 24 18:05:03 UTC 2012 [18:07:08] RECOVERY - Puppet freshness on sq54 is OK: puppet ran at Fri Aug 24 18:06:41 UTC 2012 [18:07:43] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [18:07:43] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [18:07:43] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [18:07:43] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [18:07:43] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [18:07:44] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [18:07:44] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [18:07:45] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [18:07:45] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [18:07:46] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [18:07:46] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [18:07:47] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [18:07:47] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [18:09:52] aahhhh maaan [18:09:53] RobH [18:09:57] 250G raid 1 [18:10:02] is filling up the rest of the disk with swap! [18:10:48] or at least, hm [18:10:51] there is way too much swap [18:11:05] ?? 1443G? [18:11:21] ....thats fubar [18:12:14] I had another .cfg do this to me as well [18:12:21] i was working with the analytics-dell.cfg [18:12:26] tryign to get mirrored raid to work this way [18:12:29] and had this problem [18:12:32] 4000 4000 4100 linux-swap [18:12:50] ottomata: it should just use 4100 for swap?! weird [18:12:55] i know! [18:13:13] root@analytics1024:~# free -g [18:13:13] total used free shared buffers cached [18:13:13] Mem: 7 1 5 0 0 0 [18:13:13] -/+ buffers/cache: 1 6 [18:13:13] Swap: 1443 0 1443 [18:13:21] do you see the "-1" in the raid defition above? [18:13:27] that is what makes it take "all the rest" [18:13:34] ? [18:13:34] you could turn the -1 into an actual value [18:13:40] hm [18:13:50] but i agree it is strange it does that for swap [18:13:56] but my raid / is only 207G [18:14:02] /dev/md0 207G 1.5G 195G 1% / [18:14:19] which is fine [18:14:29] do you want LVM btw? [18:14:44] i'd be fine with that, but we def want mirrored raid on / for these [18:14:45] i just used a recipe that gives me LVM and raid and worked fine [18:15:05] raid1-lvm.cfg [18:15:05] which one? [18:15:10] oooooo [18:15:48] you know, someone/we/me if I knew more about this [18:15:52] should templatize these partmans [18:16:04] too bad I can't do [18:16:07] that would be ideal yep =P [18:16:33] partman_recipe { … raid => 1, partitions => { "/" => 250G … } … } [18:16:50] so mutante [18:16:52] 64 1000 1000000 raid $primary{ } $lvmignore { } method{ raid } \ [18:16:55] will that do 1TB / [18:16:56] ? [18:17:55] also, is there a reason not to use ext4? [18:18:27] i used that on zirconium. i get a /dev/md0 on / with a size of 9.2G [18:18:46] it does not use all of the diskspace on purpose [18:18:55] what does this meeeaaaan? [18:18:55] 64 1000 1000000 [18:18:56] to ensure there are free "extents" [18:19:07] i'd like to have about a 30G root [18:19:12] 10G is a wee small [18:19:13] so you could extend it and/or take LVM snapshots [18:19:17] yeah [18:19:21] totally [18:19:54] wait, but your / in this recipe is not lvm, right? [18:19:58] 1 2 0 ext4 / /dev/sda1#/dev/sdb1 \ [18:19:58] ? [18:20:00] sorry [18:20:05] 1 2 0 ext3 / /dev/sda1#/dev/sdb1 \ [18:20:22] you just have swap on lvm? [18:21:03] hrmm, you are right. lvdisplay just has an LV /dev/zirconium/swap .. sigh [18:21:04] ottomata: the three numbers are minimum-size, priority, and maximum-size. [18:21:18] priority is only relative between different partitions [18:21:21] right [18:21:26] that's what I thought [18:21:31] maximum size accepts -1 as "just take the rest." [18:21:32] in bytes? [18:21:34] mb? [18:21:52] MB, right? [18:22:21] I think MB. [18:22:30] yeah [18:22:30] so [18:22:39] how did mutante get a 9.2G / out of that? [18:22:52] that look to me like 64MB - 1TB [18:22:57] 64    1000    1000000 [18:23:00] which recipe are you looking at? [18:23:07] raid1-lvm.cfg [18:23:11] partman will choose something between minimum and maximum [18:23:18] raid1-lvm.cfg , the results can be seen on host zirconium [18:23:21] and sslXXX [18:24:13] it's true, lvdisplay just shows swap, and / is md0, but we want LVM on top of raid or raid on top of LVM..? hrmm [18:24:21] 9.2G looks like the first line [18:24:21] lvm on top of raid [18:24:40] i don't really like lvm on boot partitions [18:24:42] so, really [18:24:46] either all of / on raid [18:24:52] or, like lvm.cfg does [18:25:00] the highest priority assignment is the first line (8000) between 5G and 10G. Formatted, I could see 10G turning into 9.2G. [18:25:02] a small /boot partition on raw somewhere (hopefully raided) [18:25:24] ahhhh [18:25:27] i see [18:25:28] hm [18:25:42] so there are two raid partitoins on sda and sdb? [18:25:46] yeah, 9.2 vs. 10 is just 1000 vs. 1024 i think [18:25:58] aye, md0 md1 [18:25:59] i see them [18:26:01] on zirconium [18:26:04] Change abandoned: DamianZaremba; "IP is wrong" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21297 [18:26:47] ahhh ok cool [18:26:54] yeah mutante that look spretty good then, I see [18:26:59] you did md0 at 9.2G [18:27:01] and put / on that [18:27:06] and then filled up md1 with the rest of the disk [18:27:27] pvcreate and created a volume group that uses all of md1 [18:27:37] and then created a 1GB swap lv [18:27:38] Alloc PE / Size 238 / 952.00 MiB [18:27:38] Free PE / Size 235844 / 921.27 GiB [18:27:51] aye, 1TB [18:27:53] ok [18:27:55] i like that setup [18:28:09] i'm going to stop puppet on brewster and try a slightly modified one out [18:28:10] yeah, this is what i was primarily looking for , having Free PE, re: the mail from Mark recently about not using all of the space with LVM [18:28:12] ummm, i forget [18:28:22] where is netboot.cfg on brester? [18:28:26] yeah totally [18:28:29] ottomata: hold on [18:28:32] ok [18:29:15] I though I saw mail from someone saying something about puppet on brewster. [18:29:30] bwerrrrrr [18:29:45] maplebed_: i admin logged about it earlier [18:29:52] ah, that was it. [18:29:54] but i was just stopping and restarting puppet on it to test a dhcp setting [18:29:55] not mail; admin log. [18:30:02] you're done then? [18:30:13] yea, placed it back and logged it, been back to normal for over an hour [18:30:17] so I see. [18:30:27] (you admin logged when you were done too.) [18:30:29] :) [18:30:32] aye cool, so um, where is that file? i'm just going to manually change for an23 so I can check before committing [18:30:43] ottomata: /srv/autoinstall [18:30:44] maplebed_: there is never too much admin logging. [18:30:46] =] [18:30:48] danke [18:30:52] de rien [18:31:58] usually i just stop puppet using the init script, but an alternative is puppet agent --disable [18:32:21] i used the init script [18:32:25] does agent —disable do the same? [18:32:37] with the latter one you still see a running process but it will not actually run and say "already running" in logs [18:32:40] !log temporarily stopping puppet on brewster to test out partman change for analytics1023-1027 [18:32:48] hm weird [18:32:50] Logged the message, Master [18:33:01] one time i was wondering why Nagios said it would not run a some hosts, when i saw the process [18:33:22] and it looked like somebody had used the --disable, cause i could have --enable make it work again [18:33:58] ok, sounds like that could cause confusion, so I will keep using init script :) [18:42:33] can i use srv193 for testing currently? [18:45:16] Krinkle: hello:) [18:45:22] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 184 seconds [18:45:31] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 188 seconds [18:46:03] mutante: Hi [18:46:43] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 192 seconds [18:47:01] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 201 seconds [18:47:32] Krinkle: there is a small config change in mw-config that i merged, but i think it has not been pushed out. would you mind checking? i looked at the docs you just changed regarding the sync-scripts and saw some local changes that made me stop [18:47:48] Hm... [18:48:08] I didn't change any scripts on fenari/nfs, I've only been observing and trying to update docs [18:48:35] i see the page redirects to "Wikimedia binaries" now, right [18:48:50] Yeah, because they're in /h/w/bin and not all are sync scripts [18:49:01] (e.g. apache-graful-all) [18:49:20] they aren't really binaries in a strict sense, but yeah [18:49:32] mutante: I know, most are bash scripts [18:49:46] until proven otherwise, I'd say all are bash scripts [18:49:58] but they're in the PATH and the dir is called ./bin that's historically binaries, right ? [18:50:13] yea [18:51:07] so would you know what the step is after merging in gerrit, and before using sync-file ? [18:51:16] They're all bash scripts [18:51:19] just git pull in /h/w/common/wmf-config ? [18:51:21] And they are NO LONGER in /h/w/bin [18:51:23] mutante: git pull ? [18:51:26] but there are untracked and modified files there [18:51:30] The canonical ones are now in /usr/local/bin via puppet [18:51:39] /h/w/bin is kept around for b/c I think [18:51:50] mutante: Let me take a look [18:51:56] i did not want to pull without knowing about the modified filebackend.php [18:52:00] thanks Roan [18:52:18] perfect, thanks mutante and RobH [18:52:24] I don't know if a simple git pull is correct, I've never done deployment yet. But yeah, I'd say git pull on fenari, maybe make sure there was no merge commit (in case a dinosaur made local changes), and then sync the right files [18:52:25] using modified lvm raid1 works great [18:52:27] filebackend.php has uncommitted modifications [18:52:29] * RoanKattouw blames AaronSchulz [18:52:32] Swap: 951 0 951 [18:52:38] /dev/md0 28G 1.3G 26G 5% / [18:52:45] Yes, you'll want git pull and watch out for merge commits [18:52:49] That part is right [18:52:54] and sync-file-all instead of sync-file, unless you're on an apache and only want it on that apache. [18:52:57] No [18:52:57] (right, Roan?) [18:53:01] There is no sync-file-all [18:53:05] oh, lol [18:53:12] There is documentation on this, I'm not sure if it's up to date [18:53:33] https://wikitech.wikimedia.org/view/Wikimedia_binaries#sync-file [18:53:34] Yep [18:53:36] it's an all script [18:53:46] https://wikitech.wikimedia.org/view/How_to_do_a_configuration_change#Change_wiki_configuration [18:53:58] ah, only sync-common (all of ./common) has a per-apache version [18:54:04] RoanKattouw: what we want to push out is an obvious type in a wikt. there was "atroller" instead of "patroller".. a nice typo actually :) hehe [18:54:13] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 0 seconds [18:54:16] lol [18:54:31] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 0 seconds [18:54:55] Running git pull should be fine [18:55:05] New patchset: Ottomata; "Changing partman recipe for analytics1023-1027" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21348 [18:55:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21348 [18:56:03] mutante, could you merge that for me? [18:56:57] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21348 [18:57:07] !log git pull in /h/w/common/wmf-config [18:57:17] Logged the message, Master [18:57:22] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [18:57:22] maplebed_: is the host hack still on srv193? [18:57:31] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [18:57:41] AaronSchulz: probably. I'll check and remove it if it is. [18:57:59] gone. [18:58:23] New patchset: Ottomata; "misc/statistics.pp - fixing sampled rsync job" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21349 [18:58:45] !log sync-file ./wmf-config/InitialiseSettings.php [18:58:47] danke mutante, could you do that one too real quick? [18:58:53] will fix an unrelated puppet error on stat1 [18:58:55] Logged the message, Master [18:59:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21349 [18:59:06] ottomata: after the sync file.. brb [18:59:09] * RoanKattouw has fixed up the docs a bit https://wikitech.wikimedia.org/index.php?title=Heterogeneous_deployment&diff=next&oldid=50040 [18:59:34] ah, spence is also in the list now [18:59:38] Tim just added it [18:59:44] mutante: Did the sync-file run yet? It should log itself in #wikimedia-tech [18:59:51] we needed it to unbreak the job_queue Nagios check [18:59:55] If it finished without logging, the bot is broken again [19:00:14] RoanKattouw: i did, but no log [19:00:23] !log starting puppet back up on brewster [19:00:33] Logged the message, Master [19:00:56] Blegh [19:01:16] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21349 [19:02:06] OK fixed [19:02:17] srv206 and srv266 should be removed from dsh groups [19:02:24] and spence needs /apache/common-local [19:02:27] thanks Roan [19:04:09] danke mutante! [19:05:37] RECOVERY - Puppet freshness on stat1 is OK: puppet ran at Fri Aug 24 19:05:23 UTC 2012 [19:05:44] now i want more lines in "Recently closed" in gerrit.. like arrows to go back [19:06:19] RoanKattouw: I've updated https://wikitech.wikimedia.org/view/Wikimedia_binaries#sync-wikiversions and others, maybe you can check it out some time and add other that are important / frequently used. [19:06:49] Any reference to /h/w/bin on that page should be killed with fire [19:06:56] ah, and somebody asked yesterday if there is a way to see who added you as a reviewer.. might be this in preferences? "Display Person Name In Review Category" [19:06:57] how so ? [19:07:02] The scripts live in /usr/local/bin now and are maintained by puppet [19:07:06] aha [19:07:07] The ones in /h/w/bin are stale [19:07:19] RoanKattouw: Link to gitweb? [19:07:27] The relevant puppet class is misc::scripts and the files live in files/misc/scripts IIRC [19:08:05] /h/w/bin is still in PATH though [19:08:18] Yes, but after /usr/local/bin hopefully [19:08:19] but so is /apache/bin [19:08:24] I want to get rid of it [19:08:32] rr. usr/local/bin rather [19:09:15] RoanKattouw: /h/w/bin is an svn repo with 1 commit [19:09:33] interesting [19:09:51] <^demon> mutante: No, that just changes your search results & dashboards to include the reviewer name who left the highest (or lowest) review. [19:10:09] <^demon> Typically the original e-mail you got from gerrit saying "plz review" would indicate it. [19:10:10] <^demon> I think. [19:10:37] Oh, I see now [19:10:43] Someone renamed it to misc::deployment::scripts [19:10:45] gotcha, yea, true, email tells you of course [19:11:25] RoanKattouw: but the directory is the same ? [19:11:30] `files/misc/scripts` [19:11:35] Yeah should be [19:11:42] https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=tree;f=files/misc/scripts;h=35e9601228e9c802e6cfa298bb0c0bfa54d05cc7;hb=HEAD [19:11:43] https://wikitech.wikimedia.org/index.php?title=Wikimedia_binaries&action=historysubmit&diff=50718&oldid=50711 [19:11:46] done [19:12:00] That line is misleading [19:12:08] It suggests /usr/local/bin is a checkout of that git dir [19:12:10] Which it's not [19:12:55] fixed [19:13:29] Thanks [19:14:07] wow, I see now how outdated /h/w/bin is [19:14:09] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [19:14:19] lots of old references that didn't make sense to me when I was reading git [19:14:20] it* [19:14:31] ls -l will tell you when the files were last modified [19:15:19] e.g. 10.0.5.8::common/ instead of /h/w/common [19:15:47] what's the reason for that? (the latter is the old variant) [19:16:13] * apergos peeks in here to see if the 7 headed mw dog has been slain or not [19:16:47] New review: Dzahn; "this is now pushed out to the cluster. checked on srv233" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21203 [19:19:05] Krinkle: 10.0.5.8 is nfs-home [19:19:17] So it makes the rsync pull from the NFS server instead of from fenari [19:19:26] I figured but why the IP reference to ifs directly as opposed to the mount of it [19:19:40] Hm.. [19:19:48] isn't that the same thing when running it from fenari? [19:19:52] Or is this faster? [19:20:04] It has the same effect, but it puts the load on a different box and on a different network link [19:20:38] This doesn't actually work *well*, mind you (Ryan is rewriting the deployment system to use git-deploy and salt in his spare time, IIRC), but it works slightly better [19:23:00] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [19:26:02] I'll have time for that again soon! [19:26:05] \o/ [19:26:19] in fact, after self-registration for labsconsole it's my next highest priority [19:26:21] Yay [19:26:36] we should have it in time for eqiad rollout [19:26:57] That would be awesome [19:27:12] I do not want to do cross-colo rsyncs with the current system [19:27:16] we may want to rewrite the git-deploy part in python at some point [19:27:23] it does a fairly simply thing, overall [19:27:31] Sure [19:27:35] What's git-deploy written in? [19:27:38] perl [19:27:43] * RoanKattouw punches hole in wall [19:27:46] hahaha [19:27:58] but really, it just makes tags [19:27:59] <^demon> Could be worse. [19:28:00] it's pretty simply [19:28:05] Why does almost every marginally useful thing in operations have to be written in perl or Ruby? [19:28:06] *makes tags and calls hooks [19:28:11] use Roan; Roan::love('perl'); [19:28:27] <^demon> RoanKattouw: svn2git is c++ :) [19:28:34] /$(892368598349543543/g*594859082671kgh9%^&#$%^$% [19:28:43] RoanKattouw: well, seriously, it should be an easy to implement thing :) [19:28:46] I don't know what that was but I'm sure it's valid Perl to slap someone [19:28:54] Yeah, git-deploy is simple enough [19:29:04] git deploy start -> check for lock file, add a tag [19:29:15] It's probably not even that much code [19:29:20] git deploy sync -> make a tag, run hooks, remove lock file [19:29:44] honestly we can likely drop the hook from git deploy start [19:29:50] but then it's more annoying to roll back [19:30:07] git deploy abort -> roll back to start tag, remove lock file [19:30:08] ryan_lane: when you get a chance can you comment on your rt for labsdb1/2 and labsdb1001/1002 rt3374 [19:30:36] wait, what? [19:30:44] we ordered this disks ages ago [19:30:51] they should definitely be on site [19:31:51] i have the sandisk 480's [19:32:25] let me try to find the email [19:33:28] * Ryan_Lane sighs [19:33:37] both ct and asher are on vacation [19:34:42] PROBLEM - NTP on analytics1026 is CRITICAL: NTP CRITICAL: Offset unknown [19:34:51] PROBLEM - NTP on analytics1024 is CRITICAL: NTP CRITICAL: Offset unknown [19:36:11] crap. I wonder if they are in eqiad [19:36:13] RobH: ? [19:36:21] RobH: you know anything about this? [19:36:42] about? [19:36:49] ssds for labs db boxes [19:36:55] do you have an rt #? [19:37:02] I can't find one [19:37:08] I know they were ordered [19:37:12] I have no idea who did it [19:37:13] I ordered a bunch of SSDs for eqiad, I do not recall any for Tampa [19:37:22] Sounds like something I did not handle. [19:37:24] damn [19:37:31] which I argue shouldnt happen since i am just a phone call away [19:37:31] this is killing me [19:37:34] but i am usually ignored. [19:37:44] let me check and see if i have any note [19:38:27] PROBLEM - NTP on analytics1025 is CRITICAL: NTP CRITICAL: Offset unknown [19:38:27] PROBLEM - NTP on analytics1027 is CRITICAL: NTP CRITICAL: Offset unknown [19:38:33] I'm betting they are in eqiad [19:38:38] Ryan_Lane: i have no RT or email pretaining to labsdb other than the one you just created [19:38:49] all my ssd related emails are to other accounted for projects [19:39:05] they may have been ordered for something else [19:39:15] I know Rachael ordered a bunch of Intel 720s for Asher, but I do not recall her ordering any 320s. [19:39:32] As far as I know, we have done all our 320s via dell (and me) or newegg (and me) [19:39:37] but that doesn't mean it didnt happen [19:39:47] it just means whoever did it didnt follow any procedure. [19:39:57] (so it certainly wasnt me, since im the one insisting on the documentation ;) [19:40:13] * Ryan_Lane groans [19:40:18] Ryan_Lane: If you drop a procurement ticket for this now, I can get it quoted for you today. [19:40:26] if its cheap, it takes mark to sign off [19:40:27] that's not a good idea [19:40:34] I'm pretty sure we already have them [19:40:37] well, quoted being newegg and it can arrive next week. [19:40:46] ok, well, then you have to track down who did it =P [19:40:49] yes [19:40:54] and they are both on vacation [19:40:57] cmjohnson1 do you have any unaccounted for intel 320s? [19:41:12] if he doesnt, I only have 720s and the spare 320s I ordered recently. [19:41:29] (his reply in rt makes it sound like he doesnt) [19:42:19] i also have no record of us ordering 520s [19:42:43] robh: ryan_lane: the only 320's i received were for the ohm servers (rt2740/41) [19:42:52] yep, those from dell [19:43:25] i received a bunch of sandisk 480's I believe �i don't have a rt# for them [19:43:47] ... what are they? [19:43:48] ssds? [19:43:53] yes [19:44:01] i have no idea what thats all about. [19:44:08] i hate how much shit just shows up. [19:44:27] RECOVERY - NTP on analytics1025 is OK: NTP OK: Offset -0.0126465559 secs [19:44:48] i saved the receipts but they do not have a rt# associated [19:45:05] from what vendor? [19:45:21] RECOVERY - NTP on analytics1026 is OK: NTP OK: Offset -0.02334046364 secs [19:45:21] RECOVERY - NTP on analytics1024 is OK: NTP OK: Offset -0.01556193829 secs [19:45:28] give me a sec�i moved them upstairs�i think they are for labs [19:45:57] RECOVERY - NTP on analytics1027 is OK: NTP OK: Offset -0.01094186306 secs [19:46:11] that would suck, since we dont want sandisk. [19:46:17] we want intel ssds specifically. [19:46:37] of a specific model type where we have known performance metrics =P [19:49:09] 3027 Buy 64 SSDs for use in Parser Cache and DB servers @ Eqiad ? [19:52:41] robh: they purchased from amazon [19:52:58] ahh, was prolly rachael then, but need to find out for what [19:53:02] cuz we dont use sandisk [19:53:06] so we prolly need to return. [19:53:06] in several orders�.i have 42 total [19:53:11] .....what?!? [19:53:13] ARGH [19:55:54] Change merged: Dzahn; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/18124 [20:01:04] maplebed: ms-be6 DIMM has been replaced and the error is no longer showing up in post [20:03:32] New patchset: Ottomata; "Relaying AFT udp2logs from emery over to vanadium per Ori's request." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21391 [20:03:48] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [20:04:19] New review: Ottomata; "waiting til monday to merge this" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/21391 [20:04:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21391 [20:09:21] Change merged: Dzahn; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/20727 [20:37:17] maplebed: for https://bugzilla.wikimedia.org/show_bug.cgi?id=34814 it seems like the only thing to do is rename the user [20:38:04] and kill the copy2 code I guess [20:38:15] it would be nice to streamline rewrite.py to remove the cruft [20:40:12] AaronSchulz: rewrite does still write thumbs in every test cluster. [20:40:25] only in production does it not write thumbs. [20:40:49] and until we can get test instances of mediawiki (in labs?) that are hooked into swift in the same way production is hooked in, we'll have to keep the copy2 stuff in there. [20:40:56] but +1 do that so we can rip it out. [20:41:11] ok, well we can still make a new user [20:41:17] to replace mw:thumb ;) [20:41:48] yes. that's in here: http://wikitech.wikimedia.org/view/User:Bhartshorne/swift_tasks_2012-08-13#to_do_Sometime.28tm.29 [20:48:36] New review: Mdale; "Are these settings now stored as part of the shell script?" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/17365 [20:53:56] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [20:54:43] New patchset: Krinkle; ".gitignore: Organize in sections; Add Mac .DS_Store" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21392 [21:01:24] Does anyone else have a ghost entry in "git status" on operations/mediawiki-config? [21:01:33] # Untracked files: [21:01:34] # "docroot/foundation/leve\314\201e_de_fonds.html" [21:01:41] the file is in version control, it is just fine [21:01:49] but the weird encoding is messing it up [21:01:58] I can't delete with "git rm" [21:02:06] fatal: pathspec 'docroot/foundation/levée_de_fonds.html' did not match any files [21:02:21] though regular unix "rm" is detecting it just fine and removes it from disk [21:02:44] Change merged: Krinkle; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21392 [21:03:50] PROBLEM - swift-account-server on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [21:04:26] PROBLEM - swift-object-replicator on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [21:04:26] PROBLEM - swift-container-replicator on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [21:04:26] PROBLEM - swift-container-server on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [21:04:26] PROBLEM - swift-object-server on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [21:04:26] PROBLEM - swift-account-replicator on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [21:04:35] PROBLEM - swift-account-reaper on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [21:04:53] PROBLEM - swift-object-updater on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [21:05:02] PROBLEM - swift-container-updater on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [21:05:11] PROBLEM - swift-account-auditor on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [21:05:20] PROBLEM - swift-container-auditor on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [21:05:20] PROBLEM - swift-object-auditor on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [21:08:02] RECOVERY - swift-object-updater on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [21:08:02] RECOVERY - swift-container-updater on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [21:08:20] RECOVERY - swift-container-auditor on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [21:08:20] RECOVERY - swift-account-server on ms-be8 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [21:08:20] RECOVERY - swift-object-auditor on ms-be8 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [21:08:56] RECOVERY - swift-account-auditor on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [21:09:05] RECOVERY - swift-container-server on ms-be8 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [21:09:05] RECOVERY - swift-object-server on ms-be8 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [21:09:05] RECOVERY - swift-account-replicator on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [21:09:14] RECOVERY - swift-container-replicator on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [21:09:14] RECOVERY - swift-object-replicator on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [21:09:14] RECOVERY - swift-account-reaper on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [21:11:20] !log upgraded swift backend host ms-be8 [21:11:29] Logged the message, Master [21:22:09] !log breaking test.wp :) ..and rolling back [21:22:18] Logged the message, Master [21:23:30] New review: Dzahn; "Syntax error on line 164 of /etc/apache2/wmf/redirects.conf:" [operations/apache-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/13293 [21:24:33] !log upgrading swift backend host ms-be1 [21:24:37] apergos: you're gone, right? [21:24:42] Logged the message, Master [21:25:08] PROBLEM - swift-object-server on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [21:25:26] PROBLEM - swift-container-server on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [21:25:35] PROBLEM - swift-container-updater on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [21:25:53] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [21:26:02] PROBLEM - swift-object-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [21:26:02] PROBLEM - swift-account-server on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [21:26:11] PROBLEM - swift-account-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [21:26:20] PROBLEM - swift-account-reaper on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [21:26:20] PROBLEM - swift-container-replicator on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [21:26:29] PROBLEM - swift-object-replicator on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [21:26:38] PROBLEM - swift-account-replicator on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [21:27:23] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [21:27:32] RECOVERY - swift-object-auditor on ms-be1 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [21:27:32] RECOVERY - swift-account-server on ms-be1 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [21:27:41] RECOVERY - swift-account-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [21:27:50] RECOVERY - swift-account-reaper on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [21:27:50] RECOVERY - swift-container-replicator on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [21:27:59] RECOVERY - swift-object-replicator on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [21:28:08] RECOVERY - swift-object-server on ms-be1 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [21:28:08] RECOVERY - swift-account-replicator on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [21:28:26] RECOVERY - swift-container-server on ms-be1 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [21:28:36] RECOVERY - swift-container-updater on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [21:30:45] !log removing decom'ed srv206 and srv217 from ALL dsh groups [21:30:54] Logged the message, Master [21:31:59] New patchset: Pyoungmeister; "temporarily pinning lucene version number for pmtpa cluster." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21394 [21:32:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21394 [21:34:47] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21394 [21:54:34] !log upgrading swift back end ms-be2 [21:54:44] Logged the message, Master [21:56:52] PROBLEM - swift-object-server on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [21:56:52] PROBLEM - swift-container-server on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [21:57:19] PROBLEM - swift-account-server on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [21:58:22] RECOVERY - swift-object-server on ms-be2 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [21:58:22] RECOVERY - swift-container-server on ms-be2 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [21:58:58] RECOVERY - swift-account-server on ms-be2 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [22:06:37] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [22:09:37] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [22:09:37] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [22:09:37] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [22:09:37] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [22:11:06] New patchset: Bhartshorne; "fixed a change in the storage log format for swift logtailing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21398 [22:11:51] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21398 [22:15:01] !log upgrading swift backend ms-be7 [22:15:12] Logged the message, Master [22:15:37] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [22:18:37] PROBLEM - swift-account-reaper on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [22:18:37] PROBLEM - swift-container-replicator on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [22:18:55] PROBLEM - swift-object-server on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [22:18:55] PROBLEM - swift-account-replicator on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [22:18:55] PROBLEM - swift-container-server on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [22:20:07] RECOVERY - swift-account-reaper on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [22:20:07] RECOVERY - swift-container-replicator on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [22:20:25] RECOVERY - swift-object-server on ms-be7 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [22:20:25] RECOVERY - swift-account-replicator on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [22:20:25] RECOVERY - swift-container-server on ms-be7 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [22:37:04] New review: Dzahn; "it's actually gone. even though the system is not rebuilt the cert and key have been manually shredded." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/15597 [22:37:05] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15597 [22:43:40] PROBLEM - Puppet freshness on ms-fe2 is CRITICAL: Puppet has not run in the last 10 hours [23:02:53] !log upgrading swift backend ms-be9 [23:03:03] Logged the message, Master [23:08:52] !log temp stopping puppet on iron [23:09:02] Logged the message, notpeter [23:40:05] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [23:40:05] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours