[00:00:50] it is interesting that the ratios of hits has changed [00:01:08] from 40% each 200s and 404s to 40% 200s, 20% 404s and 20% 'other' [00:01:49] Jamesofur: heh, i was grepping for more errors and wondering why i get a different number of results just when sorting the output.. until i realized it is exactly 00:00 UTC ..and the cronjobs run fine now :) [00:02:29] LOL [00:03:31] I'm sure it'll be a day or two and then we'll get more random errors ;) [00:05:01] New patchset: Bhartshorne; "changing swift logtailer module to allow for new logging parameters to be appended to the proxy log line (as happened across the 1.4.3 -> 1.5 bondary)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21304 [00:05:03] paravoid: ^^^ [00:05:45] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21304 [00:06:13] ah, trivial enough [00:06:39] so, what's the 20% other? that's peculiar [00:06:53] when writing the thing I bucketed the response codes I was expecting to get. [00:07:01] I'm seeing a bunch of 499s now that I didn't before [00:07:54] "Client Closed Request" according to wikipedia. [00:08:03] http://en.wikipedia.org/wiki/List_of_HTTP_status_codes#4xx_Client_Error [00:08:05] that's only slightly ironic [00:08:52] I'd give you http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html but it doesn't go above 417. [00:09:02] 499 is not an HTTP code [00:09:10] haven't read that page, but I'm pretty sure [00:09:27] apparently it's one that nginx gives. [00:09:30] if it's client closed request it's probably a specific server extension [00:09:32] and swift too, it seems. [00:09:33] ah right [00:09:54] 20% client closed request? isn't that a lot? [00:10:02] Jamesofur: http://meta.wikimedia.org/w/index.php?title=Planet_Wikimedia&diff=4062382&oldid=4062279 [00:12:02] mutante: perfect thanks, I think I'm going to be heading home soon but will go through all of those and commit from there [00:12:03] huh. [00:12:21] Jamesofur: i can do it, just following your example [00:12:28] I wonder if that's caused by the way the rewrite.py hands off the requeust and the fact that the proxy-logging module is below rewrite. [00:12:29] like commenting instead of removing [00:12:49] paravoid: I did get a suggestion that we invert that order (the pipeline in teh proxy config) to put the proxy-logging module before rewrite rather than after. [00:14:01] mutante: ahh perfect thanks, yeah, I'm thinking of commenting them out for at least 5-6 months since we don't know if the site is just having temporary issues etc. [00:14:41] paravoid: a comparison of response codes over 10,000 lines on ms-fe1 and 4: http://pastebin.com/wR3EYSFx [00:14:55] each column is count, code pairs. [00:15:37] (that's the output of tail -n 10000 /var/log/syslog | cut -f 12 -d\ | sort | uniq -c ) [00:25:16] maplebed: btw, I have a meeting I can't postpone on Tuesday morning... [00:25:51] 10:30 your time, so I'll be available before. [00:26:49] ok, np. [00:38:23] New patchset: Dzahn; "remove more broken feed URLs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21306 [00:39:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21306 [00:39:09] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21306 [00:41:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:43:12] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [00:46:12] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:46:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.032 seconds [00:51:45] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [00:53:29] New review: Jalexander; "\o/ looks good" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21306 [01:12:27] New patchset: DamianZaremba; "Making gitdir configurable" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21307 [01:13:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21307 [01:16:22] New patchset: DamianZaremba; "Making gitdir configurable" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21307 [01:17:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21307 [01:17:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:22:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.636 seconds [01:40:48] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 252 seconds [01:40:57] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 261 seconds [01:47:24] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 647s [01:48:29] about to do one last scap per a request from Erik [01:57:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:58:03] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 3 seconds [01:58:57] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 9 seconds [02:00:00] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 1s [02:07:21] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [02:07:21] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [02:07:21] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [02:07:21] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [02:10:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [02:13:21] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [03:07:48] RECOVERY - Puppet freshness on snapshot4 is OK: puppet ran at Fri Aug 24 03:07:33 UTC 2012 [03:28:21] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [03:37:21] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [03:37:21] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [06:24:08] ok so the sql script is broken on bast1001 because of course there is no /home/wikipedia/anything [06:24:41] I assume a lot of crap doesn't work over there because of that. what was our approach to that going to be? [06:45:17] New review: Nikerabbit; "Sorry I can't make head or tails from the commit message." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21283 [07:00:15] New patchset: preilly; "BREW Public IP" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21313 [07:01:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21313 [07:01:18] paravoid: can you approve and merge https://gerrit.wikimedia.org/r/#/c/21313/ [07:10:42] Ryan_Lane: go to bed [07:10:55] Ryan_Lane: or, approve and merge https://gerrit.wikimedia.org/r/#/c/21313/ [07:41:33] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: Puppet has not run in the last 10 hours [07:50:34] PROBLEM - Puppet freshness on ms-fe4 is CRITICAL: Puppet has not run in the last 10 hours [08:06:25] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [08:06:25] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [08:06:25] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [08:06:25] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [08:06:25] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [08:06:26] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [08:06:26] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [08:06:27] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [08:06:27] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [08:06:28] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [08:06:28] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [08:06:29] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [08:06:29] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [08:38:22] PROBLEM - Puppet freshness on amssq46 is CRITICAL: Puppet has not run in the last 10 hours [08:38:22] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Puppet has not run in the last 10 hours [08:38:22] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [08:38:22] PROBLEM - Puppet freshness on cp1002 is CRITICAL: Puppet has not run in the last 10 hours [08:38:22] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [08:38:23] PROBLEM - Puppet freshness on mw5 is CRITICAL: Puppet has not run in the last 10 hours [08:38:23] PROBLEM - Puppet freshness on marmontel is CRITICAL: Puppet has not run in the last 10 hours [08:38:24] PROBLEM - Puppet freshness on mw36 is CRITICAL: Puppet has not run in the last 10 hours [08:38:24] PROBLEM - Puppet freshness on mw57 is CRITICAL: Puppet has not run in the last 10 hours [08:38:25] PROBLEM - Puppet freshness on mw74 is CRITICAL: Puppet has not run in the last 10 hours [08:38:25] PROBLEM - Puppet freshness on mw54 is CRITICAL: Puppet has not run in the last 10 hours [08:38:26] PROBLEM - Puppet freshness on srv258 is CRITICAL: Puppet has not run in the last 10 hours [08:38:26] PROBLEM - Puppet freshness on sq62 is CRITICAL: Puppet has not run in the last 10 hours [08:38:27] PROBLEM - Puppet freshness on sq80 is CRITICAL: Puppet has not run in the last 10 hours [08:38:27] PROBLEM - Puppet freshness on search22 is CRITICAL: Puppet has not run in the last 10 hours [08:38:28] PROBLEM - Puppet freshness on sq54 is CRITICAL: Puppet has not run in the last 10 hours [08:38:28] PROBLEM - Puppet freshness on virt6 is CRITICAL: Puppet has not run in the last 10 hours [08:39:25] PROBLEM - Puppet freshness on cp1012 is CRITICAL: Puppet has not run in the last 10 hours [08:39:25] PROBLEM - Puppet freshness on amssq36 is CRITICAL: Puppet has not run in the last 10 hours [08:39:25] PROBLEM - Puppet freshness on arsenic is CRITICAL: Puppet has not run in the last 10 hours [08:39:25] PROBLEM - Puppet freshness on knsq29 is CRITICAL: Puppet has not run in the last 10 hours [08:39:25] PROBLEM - Puppet freshness on lvs1 is CRITICAL: Puppet has not run in the last 10 hours [08:39:26] PROBLEM - Puppet freshness on db1026 is CRITICAL: Puppet has not run in the last 10 hours [08:39:26] PROBLEM - Puppet freshness on mw35 is CRITICAL: Puppet has not run in the last 10 hours [08:39:27] PROBLEM - Puppet freshness on sq85 is CRITICAL: Puppet has not run in the last 10 hours [08:39:27] PROBLEM - Puppet freshness on es1001 is CRITICAL: Puppet has not run in the last 10 hours [08:39:28] PROBLEM - Puppet freshness on mw10 is CRITICAL: Puppet has not run in the last 10 hours [08:39:28] PROBLEM - Puppet freshness on srv235 is CRITICAL: Puppet has not run in the last 10 hours [08:39:29] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: Puppet has not run in the last 10 hours [08:39:29] PROBLEM - Puppet freshness on srv269 is CRITICAL: Puppet has not run in the last 10 hours [08:39:30] PROBLEM - Puppet freshness on db1050 is CRITICAL: Puppet has not run in the last 10 hours [08:39:30] PROBLEM - Puppet freshness on virt5 is CRITICAL: Puppet has not run in the last 10 hours [09:13:12] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [10:08:25] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 183 seconds [10:08:25] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 186 seconds [10:08:43] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 198 seconds [10:08:52] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 206 seconds [10:21:55] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 0 seconds [10:22:13] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 0 seconds [10:23:07] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [10:23:25] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [10:41:07] apergos: ping [10:41:22] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21313 [10:41:23] paravoid: poooonnngggg [10:41:35] hey, sorry [10:41:49] oh? what did you do to be sorry for? [10:41:50] I was up and talking with people until 4am :/ [10:41:55] ah no worries [10:42:05] the only thing is that at 6 pm I gotta leave [10:42:14] I can be back on later but I'm not sure when [10:43:13] so this is "merge the proxy specific changes to puppet", "pull ms-fe1/2 from pool", "push puppet changes" ? [10:44:19] preilly: merged [10:44:39] paravoid: thanls [10:44:48] tnanks even [10:44:57] thanks damn [10:45:37] apergos: I was thinking about the "merge the proxy specific stuff to puppet" [10:45:57] see, we'd need to push packages to the repo too [10:46:06] and the "swift" package is shared among proxies and backends [10:46:20] grrr [10:46:25] well thaat's just peachy [10:48:15] the stanza says "ensure present" right now, right? [10:48:24] I mean one could manually do the packages on the proxies [10:48:33] then force a puppet run for the rest [10:49:51] we could [10:51:09] I'm not too excited about any of our options [10:52:04] well what do you prefer? [10:52:58] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [12:03:51] New patchset: Faidon; "vumi: add smpp_enquire_link_interval to TATA SMS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21320 [12:04:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21320 [12:04:55] New review: Jerith; "Looks good." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/21320 [12:05:46] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21320 [12:08:30] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [12:08:30] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [12:08:30] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [12:08:30] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [12:14:30] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [12:30:53] apergos: so, I say let's do it manually to all of them now [12:30:59] ok [12:31:01] I've recorded all the steps carefully, so I'll apply them to puppet [12:31:32] If I thought we could actually test via puppet in any meaningful way I'd have more to say about it [12:31:53] but given the package thing n top of eveerything else, it wouldn't be worth much [12:32:02] indeed [12:32:29] !log depooling ms-fe1/2 for the 1.5 upgrade [12:32:36] I'm on to my second round of deletes on ms5 (which will take at least a day) [12:32:39] Logged the message, Master [12:32:48] let's look sat some graphs [12:35:15] this feels so wrong [12:35:20] ? [12:35:29] upgrade on a friday [12:35:48] if it was the sf timezone I would agree [12:35:58] but luckily for us we are jus a few hours ahead of them:-D [12:40:36] !log on ms5, running from screen session as root: tossing non-standard thumb sizes > 100 px for commons/x/xx to see what space that gives us [12:40:46] Logged the message, Master [12:40:47] shoulda logged that earlier [12:41:13] we never used to care about when doing maintenance or upgrades [12:41:31] i'd just as easily do them on fridays, saturdays, sundays, or whenever I felt like it [12:41:39] heh [12:41:41] s/we/you/ :-P [12:41:52] I don't mind picking up the pieces on the weekend [12:41:52] there was noone else anyway [12:42:01] I'm more worried about paging/worrying everyone else [12:42:13] ariel has been around for some time but doesn't count, always online and watching anyway :P [12:42:25] http://isitreadonlyfriday.com/ [12:42:29] you're in for a surprise this weekend then :-P [12:42:34] i don't mind [12:48:04] stil a lot of traffic on ms-fe1/2 [12:53:39] I see none [12:53:41] so, proceeding. [12:54:10] well testat shows a lot of established conns [12:54:13] Netstat [12:54:55] that's the swift->memcached ones [12:55:20] if you grep for port 80 you'll see only a few from the LVS servers, which I guess is the pybal idle connection [12:55:39] oh memcached [12:55:40] fine [12:56:04] go to town [12:56:21] sorry? [12:56:31] = feel free, have at [12:56:43] ah [12:59:08] why does ganglia totally lie about cpu load on ms-fe3 (as an example)? that host is bored, I should know cause I'm on it [12:59:28] ah because I can't read, nm [12:59:31] :-/ [12:59:46] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: Connection refused [12:59:55] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: Connection refused [13:00:50] all done [13:01:00] yes, I see the changes on both hosts [13:01:16] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.010 seconds [13:01:25] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.010 seconds [13:01:28] both tested and work too [13:01:40] do a check if you want to, and I'll repool them [13:01:41] that was fast (the testing) [13:02:21] how are you testing them? [13:02:24] doing s/ms-fe4/ms-fe1/ in your .bash_history helps :P [13:02:31] :-D [13:02:39] GET -Used http://ms-fe2.pmtpa.wmnet/wikipedia/en/thumb/0/03/Homelandposter.jpg/220px-Homelandposter.jpg [13:02:42] e.g. [13:02:46] ah [13:03:50] gah, ms-fe4 is .214 but ms-fe1 is .210 [13:04:11] oh noes, they're all wrong [13:04:33] ?? [13:04:39] nevermind me [13:06:04] verified that it logs properly too [13:06:42] so, ack for repooling? [13:07:46] yeah, I didn't do a particularly comprehensive test but a few random things on each [13:07:48] so go ahead [13:08:17] I wonder if we should write a bit more complicated pybal test [13:08:24] as to let pybal do the test for us for free [13:08:36] !log repooling ms-fe1/2 with all new swift [13:08:46] Logged the message, Master [13:09:21] traffic flowing [13:12:18] New patchset: Parent5446; "(bug 39380) Enabling secure login (HTTPS)." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21322 [13:12:32] looks ok [13:13:01] these "499" worry me [13:13:47] interesting aren't they? [13:22:35] apergos: so. [13:23:03] what should we do next I wonder [13:23:07] ms-be upgrade? [13:23:32] well I'd like to wait a half hour to make sure nothing weird crops up [13:23:43] and I am mindful of my deadline of being out the front door at 6 [13:24:02] I don't mind staying [13:24:41] do we think we can get one done in the hour we'll have available? [13:25:16] sure [13:25:38] ok, well we can do that [13:25:44] say at 5? [13:28:08] sure [13:28:10] looks trivial enough [13:28:21] really trivial [13:28:22] no good pattern to these 499s (as far is urls or internal/external requests), which sucks [13:28:52] how do we test those? same way? [13:29:28] not really. I was thinking grep URLs off the logs and try those [13:29:37] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [13:34:48] guess that should get documented before ben leaves [13:38:37] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [13:38:37] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [13:43:58] okay, while waiting I'm going to tcpdump and find what those 499 are [13:45:13] sure [13:45:20] I mean we see em in the logs on the hosts [14:19:54] ottomata: I am onsite now, going to poke at analytics1023 [14:20:06] well, just got here, getting setup, and going to work on that [14:21:13] mark: Once I finish working on ottomata's analytics servers, we can work on the network stuff [14:21:16] k [14:21:26] mark: or if i hit a wall and it will take too long, i put off analytics a bit while you are still around and working [14:21:38] and return to it once its late your evening [14:21:51] will know shortly. [14:27:59] !log stopping puppet on brewster to do a local nonpuppetized test change [14:28:09] Logged the message, RobH [14:30:20] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:30:51] New patchset: Umherirrender; "(bug 34386) Enable e-mailing password based on e-mail address" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21326 [14:30:57] we need a cmjohnson2 [14:31:41] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [14:31:59] PROBLEM - MySQL disk space on db22 is CRITICAL: DISK CRITICAL - free space: /a 17121 MB (3% inode=99%): [14:32:43] mark: so there is no links going to csw1 from the cwdm [14:33:08] host way to cwdm, fiber to msw1 and fiber to cr1 [14:33:54] so there are two cross-floor fibers, right [14:34:03] ARGH, where the hell is my usb memory stick. [14:34:08] one is the old one used for production [14:34:12] and one is the one always used for management [14:34:19] * RobH checks every single server to see if he left it someplace plugged in [14:34:24] the production one is between cr2-pmtpa and csw1-sdtpa, correct? [14:34:29] no CWDM in between [14:35:10] no [14:37:03] huh [14:37:14] I see cr2-pmtpa is now connected directly to cr1-sdtpa on xe-0/0/1 [14:39:41] mark: is that going through csw1? [14:39:50] not according to the description [14:39:54] but i have no idea what you guys did [14:40:52] I -think- [14:40:59] RECOVERY - MySQL disk space on db22 is OK: DISK OK [14:41:10] cr2-pmtpa:xe-0/0/1 is connected via the CWDM system to cr1-sdtpa:xe-0/0/1 [14:41:15] so, the hostway fiber is going into the cwdm [14:41:21] "the hostway fiber", which one is that [14:41:23] there are two [14:41:25] or more [14:41:36] depending on if you count the transits [14:41:45] do you have a fiber nr? [14:41:47] morning paravoid, apergos. [14:41:52] looks like you had a good time this morning! [14:42:16] morning [14:42:25] fzr15802100f [14:42:28] yes, a quiet little upgrade [14:42:39] paravoid was looking at the 499s by doing some tcpdumping [14:42:55] cmjohnson1: that's not our cable id format [14:42:55] cmjohnson1: is that ptbthbth in hex? [14:42:58] I had a gander through the 1.43 and 1.5.0 code but didn't see much on a quick pass through that would explain the change [14:43:13] I'm curious about them too. [14:43:19] is there no 4 digit nr on it? [14:43:22] in about 15 mins I'm going to have to take off unfortunatelly, though I will be back later tonight and will check in [14:43:33] 0004 [14:43:40] did you put that on there? [14:43:43] I asked in the swiftstack channel (as you'll see) in case it was a known issue but no dice [14:43:47] like all our other fibers? [14:43:58] apergos: they're all san francisco time. [14:44:12] ok, well I'll just let the scrollback sit there then [14:44:22] no most of these are not labeled�i know the fpl and xo fibers now because of our outage [14:44:30] oh, you did get a response, just not an answer. [14:44:31] nevermind. [14:44:33] the backend upgrade looks pretty straightforward with one exception [14:44:38] not sure how you test the backends specifically [14:44:44] the other host way is going to csw1 16/4 [14:44:49] cmjohnson1: can you trace where this fiber is going, and make sure EVERY fiber has a unique cable like all our other cables? [14:44:51] like a given one of them [14:45:09] cmjohnson1: right [14:45:14] they are all unique [14:45:15] apergos: I was able to test on the test clusters because I could upgrade a majority then watch requests succeed. [14:45:25] oh [14:45:32] you can find look at some of the objects stored on a specific back end and test with those objects... [14:45:39] hmm well we wanted up upgrade one, not toss is back in yet, and figure out how to test it [14:45:46] cmjohnson1: so can you give me our cable nrs then? [14:45:50] then I can put them in the devices [14:45:58] for example, what nr is on xe-0/0/0 [14:46:02] on cr2-pmtpa [14:46:05] apergos: that you can't do. the're not conveniently behind a load balancer like the proxies. [14:46:22] I mean, I suppose you could make connections directly to port 6000/6001/6002 [14:46:23] that's supposed to be the fiber going to csw1-sdtpa:e16/4 [14:46:37] hmm [14:46:47] but as soon as you start the processes, it'll be 'back in'. [14:47:12] uh [14:47:29] xe 0/0/0 is # 6005 and is going to csw1 14/3 [14:47:29] so there is no way to test it without it getting production requests? [14:47:52] I suppose you could put up a firewall that blocks access from the rest of the cluster... [14:48:07] !log analytics1023-1027 rotating down for removal of extra nic [14:48:16] Logged the message, RobH [14:49:08] mark: xe-0/0/1 is 6001 goes to c2 on the cwdm [14:49:27] xe-0/0/2 is the xo link [14:50:22] i believe it is xe-1/0/1 6001 going to 16/3 [14:50:46] 1/0/2 is fpl [14:51:19] cmjohnson1: ok [14:52:44] um so ... our objects are replicated to two other backends, right? and when a node drops off, there's some sort of replication tht happens to account for that node being down, is that manual? [14:53:10] cmjohnson1: there is no xo link in pmtpa [14:53:15] anymore [14:53:25] according to my info, xe-0/0/2 is one of the two hostway transits [14:53:30] can you confirm? and let me know what cable id? [14:53:34] i am in sdtpa [14:53:49] oh [14:53:52] darn [14:53:56] then i'm working on the wrong router [14:54:05] so everything you just mentioned was for the ports on cr1-sdtpa? [14:54:10] yes [14:54:17] ok [14:54:18] sec [14:54:50] huh [14:55:01] you mentioned #6001 twice [14:55:07] on xe-0/0/1 and on xe-1/0/1 [14:55:40] xe-0/0/1 is 6000 [14:55:42] sorry [14:55:44] ok [14:56:38] what's the cable id for the XO link? [14:56:41] on xe-0/0/2 [14:57:34] we don't have our number system on that link [14:57:39] !log restart puppeted on brewster [14:57:48] Logged the message, RobH [14:57:51] but i do have sr1825716 [14:58:41] please put our own number on, like #6001 [14:58:55] sr1825716 is some other company's id [14:59:08] probably equinix [14:59:26] it is their label [14:59:55] back in a while. [14:59:58] yeah, so put our own on and let me know what the nr is please [15:00:40] 2162 will b the number [15:01:04] thanks [15:01:36] and can you tell me what the nr is on port xe-1/1/0? that's an FPL fiber [15:02:09] that is going to be 2163 [15:03:11] ok [15:03:35] while we are at it�2164 will be the fpl link on csw1 13/1 [15:04:01] ok [15:04:22] here's the summary for cr1-sdtpa: [15:04:23] Interface Admin Link Description [15:04:23] xe-0/0/0 up up Core: << csw1-sdtpa:e14/3 {#6005} [10Gbps DF] [15:04:23] xe-0/0/1 up up Core: << cr2-pmtpa:xe-0/0/1 {#6000} [10Gbps CWDM] [15:04:23]