[00:05:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.546 seconds [00:30:23] bits seems to be very slow right now [00:30:53] well, randomly [00:30:56] robla: AaronSchulz: I verified that only objects I want to delete match the regex. I moved the date back to Feb 5th for the first run. (I'll run it with a more recent date later.) Anything else before I start a run for an hour or so to see how it does? [00:31:17] not that I know of [00:31:57] the code is at fenari:~ben/swift/delete-old-objects.py if you want a read. [00:32:22] http://geoiplookup.wikimedia.org/ is taking 5-8 seconds to load for me [00:33:22] PROBLEM - Varnish HTTP bits on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:33:49] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:34:09] geoiplookup is served from bits, BTW [00:34:28] maplebed: nothing I can think of [00:35:55] PROBLEM - LVS HTTP on bits-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:36:15] just got a connection reset error from geoiplookup [00:37:12] loaded just now, but took 39 seconds [00:37:34] RECOVERY - Varnish HTTP bits on palladium is OK: HTTP OK HTTP/1.1 200 OK - 637 bytes in 5.112 seconds [00:38:01] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.053 seconds [00:38:17] OK, it's back to 1 second now [00:39:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:40:08] RECOVERY - LVS HTTP on bits-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3925 bytes in 0.054 seconds [00:42:50] LeslieCarr: drdee: are the new filters backed out yet? [00:43:18] * robla needs to bolt momentarily [00:43:30] robla yes [00:43:42] didn't see any actual problems … :( [00:44:22] so there was some syn cookies sending ? [00:44:30] cool...didn't see it in the server admin log [00:44:44] syn cookies sending? in the filters? [00:44:54] * robla shuts down [00:44:57] not in the filters [00:45:02] two different topics [00:46:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.716 seconds [00:46:59] so something happened with palladium [00:47:12] lots of out of socket memory [00:47:14] amslvs4 went down earlier [00:47:17] did really noone do anything? [00:48:04] doh, it appears not … :( [00:48:47] checking that out as well [00:52:00] !log reloading amslvs4 [00:52:02] Logged the message, Mistress of the network gear. [00:54:32] RECOVERY - Host amslvs4 is UP: PING OK - Packet loss = 0%, RTA = 108.80 ms [00:54:41] RECOVERY - BGP status on csw2-esams is OK: OK: host 91.198.174.244, sessions up: 4, down: 0, shutdown: 0 [01:17:27] AaronSchulz: robla: it's running on ms-be1 but only deleting about 2 objects per second or so. [01:17:31] I'll send mail in a bit. [01:17:54] New patchset: Lcarr; "upping tcp orphan connection count" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5148 [01:18:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5148 [01:18:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:26:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.171 seconds [01:31:10] New patchset: Mark Bergsma; "Automatically restart varnish" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5149 [01:31:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5149 [01:31:30] Change abandoned: Lcarr; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5148 [01:31:41] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5149 [01:31:44] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5149 [01:35:31] New patchset: Mark Bergsma; "Escaping" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5150 [01:35:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5150 [01:35:52] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5150 [01:35:55] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5150 [01:39:23] PROBLEM - Varnish HTTP bits on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:41:02] PROBLEM - LVS HTTP on bits-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:41:38] checking out palladiummark did you just restart varnish ? [01:42:59] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 221 seconds [01:43:35] RECOVERY - Varnish HTTP bits on palladium is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 3.255 seconds [01:43:54] now I did [01:44:02] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:44:45] sigh, niobium is not responding, restarting varnish on that [01:44:50] !log restarting varnish on niobium [01:44:53] Logged the message, Mistress of the network gear. [01:44:56] not responding on port 80, that is [01:45:23] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK HTTP/1.1 200 OK - 633 bytes in 0.053 seconds [01:45:32] RECOVERY - LVS HTTP on bits-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3915 bytes in 0.080 seconds [01:47:11] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 17 seconds [01:59:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:06:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.564 seconds [02:08:12] New patchset: Mark Bergsma; "Cron executes with /bin/sh" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5151 [02:08:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5151 [02:09:35] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5151 [02:09:38] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5151 [02:16:46] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:19:46] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 633 bytes in 0.053 seconds [02:29:22] PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:29:49] hrm, ssh on arsenic is not critical [02:30:52] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [02:41:40] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:42:52] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 631 bytes in 0.053 seconds [02:43:10] !log restarted varnish on arsenic [02:43:12] Logged the message, Mistress of the network gear. [02:53:53] !log dist-upgrade on strontium [02:53:55] Logged the message, Master [02:56:04] PROBLEM - Host strontium is DOWN: PING CRITICAL - Packet loss = 100% [02:57:07] RECOVERY - Host strontium is UP: PING OK - Packet loss = 0%, RTA = 26.62 ms [03:02:24] Wait, so is the URL /skins-1.20wmf1/ now? [03:02:28] Or will just /skins/ work? [03:08:58] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:10:19] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 631 bytes in 0.053 seconds [03:12:56] !log started a script to delete old objects on ms-be1 for swift truncated object cleaning [03:12:59] Logged the message, Master [03:14:13] PROBLEM - Host arsenic is DOWN: PING CRITICAL - Packet loss = 100% [03:14:31] RECOVERY - Host arsenic is UP: PING WARNING - Packet loss = 86%, RTA = 26.46 ms [03:17:04] PROBLEM - Host arsenic is DOWN: PING CRITICAL - Packet loss = 100% [03:17:49] RECOVERY - Host arsenic is UP: PING WARNING - Packet loss = 73%, RTA = 26.47 ms [03:22:05] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [03:22:05] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [03:22:05] PROBLEM - Puppet freshness on es1004 is CRITICAL: Puppet has not run in the last 10 hours [03:25:32] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:26:34] New patchset: Lcarr; "escaping "$2"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5152 [03:26:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5152 [03:27:19] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5152 [03:27:23] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5152 [03:27:38] PROBLEM - LVS HTTP on bits-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:28:14] PROBLEM - LVS HTTPS on bits-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [03:29:39] !log restarting varnish on arsenic again [03:29:42] Logged the message, Mistress of the network gear. [03:30:20] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [03:30:20] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [03:30:56] RECOVERY - LVS HTTPS on bits-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3931 bytes in 0.113 seconds [03:31:23] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK HTTP/1.1 200 OK - 631 bytes in 0.054 seconds [03:34:20] LeslieCarr: what's going on? [03:34:28] we're with Ryan at an openstack event [03:34:31] with a really shitty network [03:34:42] not sure, but varnish is fucked to use the technical term [03:34:48] bits varnish only [03:34:59] ryan asks if you paged Mark [03:34:59] RECOVERY - LVS HTTP on bits-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3924 bytes in 0.053 seconds [03:35:08] restarting it works [03:35:11] mark went back to sleep [03:35:14] he can't figure out [03:35:20] I remember mark saying something about varnish during the weekend [03:35:26] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: Connection refused [03:35:28] oh, okay... [03:37:02] well restarting ti works but you have to restart it constantly ... [03:37:19] just dist-upgraded arsenic and rebooted it [03:37:24] !log dist-upgrade arsenic [03:37:26] Logged the message, Master [03:37:32] Jeff did, that is :) [03:37:36] whee! [03:37:44] let's hope it comes back up :-P [03:39:11] PROBLEM - Host arsenic is DOWN: PING CRITICAL - Packet loss = 100% [03:40:32] up [03:41:08] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 633 bytes in 0.055 seconds [03:41:17] RECOVERY - Host arsenic is UP: PING OK - Packet loss = 0%, RTA = 26.36 ms [03:41:48] seems to work... [03:41:51] is the ganglia collector broken for arsenic? [03:42:30] oh there we go, it's picking up some sessions [03:45:09] the network is shitty here [03:45:21] what's responsible for "[varnishstat] " ? ganglia? [03:45:28] paravoid: SF office? [03:45:35] no, openstack event [03:45:39] wifi [03:45:40] oic [03:45:41] in a bar [03:45:51] no idea, i see the defunct varnishstat pretty much everywhere [03:45:55] so, I think I'm going [03:46:07] well, the cron job is now working everywhere ... [03:46:15] LeslieCarr: that's excellent [03:46:53] arsenic still gets piddly sessions compared to its peers [03:48:21] hrm, well so far, so good. .. i'mreally hoping the cron job hack works [03:48:28] want food [03:48:35] me too [03:48:42] want sleep [03:49:07] lol, varnishstat comes back immediately defunct after restarting ganglia-monitor [03:49:10] wtf [03:50:56] hehehe wow [03:51:18] so far, fixed cron job is working [04:02:50] * jeremyb wonders what the cron job does [04:06:37] i guess it's 5149 and derivative [04:06:46] derivative(s) [04:20:31] wow, the hotel's network is even worse [04:23:49] paravoid: mifi? [04:24:10] didn't get one yet [04:24:15] wonder if I should [04:24:19] how expensive are they? [04:24:50] and do they offer a non-binding contract [04:25:08] well teh foundation has several, i thought you might have one ;) [04:25:32] for personal use I'd probably recommend just replicating my own personal setup [04:25:52] which is nexus s and tether. so only one SIM [04:26:10] and only one piece of equipment to pay for [04:26:52] has worked well for me with t-mobile in US and >=5 other SIMs in other countries. mexico is the only one of those where I got only edge (the rest were 3g) [04:29:13] of course you can do the same with other phone combos but IMHO, GSM (not CDMA) and being vanilla (untainted) android OOTB (out of the box) (not loaded with extra unremovable+crappy apps and built with hackers in mind) are both important [06:32:07] PROBLEM - Varnish HTTP bits on sq67 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:35:34] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [06:36:55] PROBLEM - Varnish HTTP bits on sq70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:39:55] PROBLEM - LVS HTTP on bits.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:40:49] awesome [06:41:07] RECOVERY - Varnish HTTP bits on sq67 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 3.377 seconds [06:42:46] RECOVERY - LVS HTTP on bits.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3975 bytes in 0.010 seconds [06:46:58] PROBLEM - Varnish HTTP bits on sq67 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:48:28] RECOVERY - Varnish HTTP bits on sq67 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 0.006 seconds [06:57:37] PROBLEM - Varnish HTTP bits on sq67 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:58:49] RECOVERY - Varnish HTTP bits on sq67 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 0.016 seconds [07:03:37] PROBLEM - Varnish HTTP bits on sq67 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:11:34] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [07:21:47] RECOVERY - Varnish HTTP bits on sq67 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 1.809 seconds [07:23:35] RECOVERY - Varnish HTTP bits on sq70 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 4.736 seconds [07:37:59] !log restarted varnish service manually a bit a go on sq67 and sq70, the cron job didn't seem to have gone off. [07:38:05] hmm [07:38:11] no logging [07:42:32] !log restarted varnish service manually a bit a go on sq67 and sq70, the cron job didn't seem to have gone off. restarted morebots too while I was at it [07:42:52] * apergos taps fingers impatiently [07:43:36] !log morebots test [07:43:50] * apergos is not impressed [07:44:36] Logged the message, Master [07:44:38] Logged the message, Master [07:44:51] slwo bots get beaten. just keep that in mind [07:46:11] afk for a little while, friend from out of town. [09:09:50] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [09:11:02] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [09:14:38] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [09:23:02] PROBLEM - Puppet freshness on gilman is CRITICAL: Puppet has not run in the last 10 hours [09:27:23] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.046 second response time [12:27:10] Change abandoned: Pyoungmeister; "lint check broke, not quite right." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4923 [13:22:37] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [13:22:37] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [13:22:37] PROBLEM - Puppet freshness on es1004 is CRITICAL: Puppet has not run in the last 10 hours [13:25:55] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [13:31:00] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [13:31:00] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [13:51:06] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:53:55] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:55:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.493 seconds [15:02:46] PROBLEM - Host db1005 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:20] New patchset: Pyoungmeister; "stop the spammening" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5168 [15:06:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5168 [15:07:56] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5168 [15:08:05] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5168 [15:12:49] RECOVERY - Host db24 is UP: PING WARNING - Packet loss = 37%, RTA = 0.21 ms [15:28:44] maplebed: can you start mysql on db24...it was down for memory testing but is ok now [15:29:06] I'm on my way out the door. I can start it, but I can't check and make sure it works right. [15:29:21] would you rather wait until someone can verify it stiarts correctly or should I hit go? [15:29:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:29:42] hit go [15:31:25] RECOVERY - mysqld processes on db24 is OK: PROCS OK: 1 process with command name mysqld [15:31:46] I hope it's happy! [15:31:48] ;) [15:32:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.181 seconds [15:32:19] it does have both slaving threads running, so looks good. [15:32:37] !log started mysqld on db24 with /etc/init.d/mysql start [15:32:47] ok, bye [15:33:31] cool...thx [15:33:40] notpeter: [15:33:43] !rt 2758 [15:33:43] https://rt.wikimedia.org/Ticket/Display.html?id=2758 [15:33:47] sup? [15:33:57] not sure...what he is talking about [15:34:34] nor I! [15:34:50] we get lots of alerts from them [15:35:04] sorry...i meant the sun servers.... [15:35:10] oh [15:35:16] !rt 2798 [15:35:16] https://rt.wikimedia.org/Ticket/Display.html?id=2798 [15:35:18] PROBLEM - MySQL Replication Heartbeat on db24 is CRITICAL: CRIT replication delay 89530 seconds [15:35:45] PROBLEM - MySQL Slave Delay on db24 is CRITICAL: CRIT replication delay 89521 seconds [15:36:03] so, search1-12 are... aged [15:36:10] but we want to keep them for parts [15:36:34] so at some point we'll power them all down, and pull them, and put them in the spare parts pile [15:36:51] but none of that should happen until at the earliest new search boxxies come in [15:36:56] and I don't think they have even been ordered yet [15:37:12] so that can be ingnored until new search boxes arrive. probably longer [15:37:28] as we want to upgrade search to precise pangolin [15:37:32] which will take some time [15:37:44] okay [15:37:46] and keeping search1-20 on and humming gives us a hot spare setup [15:37:52] (albeit an aged one) [15:38:05] so yeah, that cna be ignored for at least a month or two [15:38:36] is this any clearer? [15:39:45] yes...but I received 10 of the servers yesterday [15:39:54] oh! [15:39:54] ok [15:39:57] I didn't know that [15:39:58] no worries I believe we have to add the ssd [15:40:09] hhhmmmm [15:40:19] so, rob wants those to physiclaly replace the sun servers [15:40:26] is there, by any chance, enough room for both? [15:40:32] not in that rack [15:40:36] ok [15:40:40] so [15:41:00] I'm not going to rebuild search@pmtpa until ubuntu 12.04 is released [15:41:03] and we have it set up [15:41:05] that will be a bit [15:41:16] i will store for now [15:41:19] awesome! [15:41:20] thank you [15:41:23] until everyone is ready [15:41:29] yep [15:41:53] estimate of 1 month at earliest, probably 2. [15:41:54] cool...thx for background though...helpful [15:41:59] yep [15:43:21] i am having an odd issue when running a query against the arabic wikipedia and not sure whether the problem is mysql or in my console [15:43:45] when i run this query 'SELECT rc_user_text, rc_ip FROM arwiki.recentchanges' [15:44:14] then in the output it will switch column 1 and column 2 if rc_user_text is an arabic name [15:44:52] but when i select the arabic name for copy past i get the actual value of column 2 and vice versa [15:46:13] could it be a right to left thing? [15:47:18] yeah maybe, but not sure if the problem is in my console or in mysql, could you quickly run that query and tell me how your output looks like? [15:49:10] where? [15:49:55] i did it on db1024 [15:51:16] I get a bunch of what I woulod guess are names, in arabic characters and latin characters, and IPs [15:51:24] and some IPs [15:51:40] but everything always in the same column? [15:51:42] yeah [15:51:50] then it must be an osx console issue [15:51:55] although anonymous edits show IP in both [15:52:05] that's correct (i believe) [15:52:08] okay thanks! [15:52:12] (well, I would guess that's what they are, as it's the same in both columns) [15:52:15] ok [16:05:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:12:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.812 seconds [16:36:48] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [16:46:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:54:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.021 seconds [16:59:36] New patchset: ArielGlenn; "run a little piece of php code as maintenance script on all wikis" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/5171 [17:12:36] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [17:27:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:34:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.167 seconds [18:08:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:10:21] PROBLEM - Varnish HTTP bits on sq68 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:15:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.427 seconds [18:15:53] RECOVERY - Varnish HTTP bits on sq68 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 0.003 seconds [18:20:59] !log clearing mobile varnish cache [18:21:44] PROBLEM - Varnish HTTP bits on sq68 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:46] sq68 is having hte same issue .. [18:23:21] !log resatrting varnish on sq68 [18:23:29] logbot is dead [18:24:26] RECOVERY - Varnish HTTP bits on sq68 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 0.003 seconds [18:36:53] !log pulling sq68 from pybal for a bit [18:36:55] Logged the message, Master [18:40:27] Hey ops folks, could I get a review&deploy of https://gerrit.wikimedia.org/r/#change,3885 please? This has been blocking localization updates in the post-git-migration world for a while now [18:42:03] RoanKattouw: i can look at it ... [18:42:32] RoanKattouw, shouldn't you also have git submodule init, just in case? [18:42:38] Platonides: In the long run yes [18:42:47] Ideally it is able to set up the clone from scratch [18:43:04] But right now I just need it to be fixed and start working; the clone is set up on fenari so that's not needed Right Now [18:43:07] yay i *heart* fork limit [18:43:19] I heart it too for the simple reason it makes things not break :D [18:43:27] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3885 [18:43:30] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3885 [18:43:35] s/&& git submodule update/ && git submodule init && git submodule update/ [18:43:42] Servers would time out on their nfs1 fetch and not get updated [18:43:45] Yay thanks LeslieCarr [18:44:28] If this could be deployed to fenari before 02:00 UTC, I'd be very happy [18:46:05] so what command are you going to use to watch the log files? ;) [18:46:25] (the answer is tail -f ) [18:47:43] !log returning sq68 [18:47:45] Logged the message, Master [18:49:02] hehe [18:49:18] To be fair what I was doing is nontrivial with tail -f [18:49:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:55:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.050 seconds [18:59:13] LeslieCarr: tell me more about tcp_max_orphans - why'd you increase it, and did you see any effect after? [18:59:28] did not wind up increasing it [18:59:43] thought that perhaps that was causing the out of socket memory issues [18:59:59] oh yeah, didn't see the change was abandoned [19:00:11] RoanKattouw: your change was merged [19:00:24] Yay thanks [19:00:52] * RoanKattouw promises not to use tail -n 1000 any more [19:01:17] yeah, increasing that would probably increase memory pressure. did you make any tcp tweaks last night? [19:01:36] New patchset: Hashar; "testswarm: set innodb buffer pool size to 256M" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4395 [19:01:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4395 [19:02:04] New review: Hashar; "patchset 3 fix a typo in commit message summary" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/4395 [19:14:52] looks like ma rk bumped up varnish session_max from default (100K) to 200K on 4/11 [19:20:01] Jeff_Green: what change is that? i don't think thats live but want to double check [19:22:22] it's live if it was restarted. it's in templates/varnish/varnish-default.erb iirc [19:22:55] well, i guess it's only live if varnish uses /etc/default/varnish :-P [19:23:39] PROBLEM - Puppet freshness on gilman is CRITICAL: Puppet has not run in the last 10 hours [19:24:46] spelunking that makes me wish for a feature: "dear puppet, barf a copy of arsenic:/etc/default/varnish for 4/10/2012" [19:28:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:35:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.240 seconds [19:52:59] binasher: sorry was out eating, did not do any tweaks [20:09:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:16:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.077 seconds [20:19:36] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2575* [20:22:27] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2575* [20:28:18] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [20:32:30] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2338 [20:49:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:49:48] New patchset: Lcarr; "Updating neon to irc spam and be more complete" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5201 [20:50:04] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/5201 [20:51:53] New patchset: Lcarr; "Updating neon to irc spam and be more complete" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5201 [20:52:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5201 [20:53:21] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5201 [20:53:24] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5201 [20:55:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.886 seconds [21:07:22] RECOVERY - HTTP on ekrem is OK: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 7.023 second response time [21:07:22] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK: HTTP/1.1 200 OK - 1642 bytes in 8.515 second response time [21:10:22] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:10:22] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:17:31] yay, icinga-wm is working, now we can get double annoyed :) [21:19:04] we should switch our server responses… https://twitpic.com/1nykf8 [21:20:22] RECOVERY - HTTP on ekrem is OK: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 8.205 second response time [21:20:22] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK: HTTP/1.1 200 OK - 1642 bytes in 7.914 second response time [21:26:22] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:26:22] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:28:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: Connection timed out [21:29:22] RECOVERY - HTTP on ekrem is OK: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 9.557 second response time [21:29:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:31:22] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK: HTTP/1.1 200 OK - 1642 bytes in 8.255 second response time [21:36:22] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:36:22] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:36:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 335 bytes in 3.742 second response time [21:36:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.828 seconds [21:40:22] RECOVERY - HTTP on ekrem is OK: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 9.149 second response time [21:43:22] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK: HTTP/1.1 200 OK - 1642 bytes in 8.962 second response time [21:46:22] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:46:22] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:49:22] RECOVERY - HTTP on ekrem is OK: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 7.570 second response time [21:49:22] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK: HTTP/1.1 200 OK - 1642 bytes in 7.374 second response time [22:00:22] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:00:22] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:02:22]