[00:31:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:31:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:35:52] hi [00:35:53] any expert can help me with this http://epicfreeprizes.com/?ref=516747 ?? thanks so much, of course! [00:41:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.090 seconds [00:41:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.090 seconds [01:14:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:14:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:27:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.039 seconds [01:27:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.039 seconds [01:41:45] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 237 seconds [01:41:46] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 237 seconds [01:41:54] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 248 seconds [01:41:54] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 248 seconds [01:48:57] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 670s [01:48:57] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 670s [01:53:54] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [01:53:54] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [01:54:49] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 1s [01:54:49] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 1s [01:54:57] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 1 seconds [01:54:57] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 1 seconds [01:58:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:58:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:08:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.911 seconds [02:08:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.911 seconds [02:48:30] PROBLEM - Puppet freshness on srv281 is CRITICAL: Puppet has not run in the last 10 hours [02:48:30] PROBLEM - Puppet freshness on srv281 is CRITICAL: Puppet has not run in the last 10 hours [02:49:45] nagios-wm: quiet, 281 is out of rotation [03:19:06] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [03:19:07] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [03:32:09] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [03:32:10] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [03:56:09] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [03:56:10] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [04:36:26] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [04:36:26] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [04:51:26] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [04:51:26] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [05:04:20] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [05:04:20] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [05:19:20] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [05:19:20] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [06:48:53] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [06:48:53] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [06:48:53] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [06:48:54] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [06:48:54] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [06:48:54] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [07:19:03] New patchset: ArielGlenn; "rsync setup for ms10 (tampa media mirror)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17789 [07:19:42] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/17789 [07:23:59] Change abandoned: ArielGlenn; "ms10 has been set up as an internal host. gotta reinstall, gahhhh" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17789 [08:36:33] PROBLEM - Puppet freshness on bayes is CRITICAL: Puppet has not run in the last 10 hours [08:36:33] PROBLEM - Puppet freshness on bayes is CRITICAL: Puppet has not run in the last 10 hours [08:38:30] PROBLEM - Puppet freshness on srv242 is CRITICAL: Puppet has not run in the last 10 hours [08:38:30] PROBLEM - Puppet freshness on niobium is CRITICAL: Puppet has not run in the last 10 hours [08:38:30] PROBLEM - Puppet freshness on srv242 is CRITICAL: Puppet has not run in the last 10 hours [08:38:30] PROBLEM - Puppet freshness on niobium is CRITICAL: Puppet has not run in the last 10 hours [08:39:33] PROBLEM - Puppet freshness on mw27 is CRITICAL: Puppet has not run in the last 10 hours [08:39:33] PROBLEM - Puppet freshness on srv190 is CRITICAL: Puppet has not run in the last 10 hours [08:39:33] PROBLEM - Puppet freshness on srv238 is CRITICAL: Puppet has not run in the last 10 hours [08:39:33] PROBLEM - Puppet freshness on mw27 is CRITICAL: Puppet has not run in the last 10 hours [08:39:33] PROBLEM - Puppet freshness on srv190 is CRITICAL: Puppet has not run in the last 10 hours [08:39:33] PROBLEM - Puppet freshness on srv238 is CRITICAL: Puppet has not run in the last 10 hours [12:29:59] argh stat1 [12:30:03] sendmail?!? [12:30:46] (cleanup cronspam monday) [12:35:59] Cannot chdir to /mnt/htdocs/wikibooks for the other one [12:36:00] nice [12:40:23] welcome back :) [12:44:25] thanks [12:47:26] Is fenari heavily loaded atm? [12:47:42] took ages to get a login prompt.. [12:48:37] I dunno [12:48:40] lemme see [12:48:57] nope [12:49:10] how's nfs? :-P [12:49:36] PROBLEM - Puppet freshness on srv281 is CRITICAL: Puppet has not run in the last 10 hours [12:49:37] PROBLEM - Puppet freshness on srv281 is CRITICAL: Puppet has not run in the last 10 hours [12:49:59] seems ok when logged in [12:50:04] suggesting my connection or similar [12:52:53] maybe [12:53:01] it didn't seem like a long delay for me for the login [12:54:03] Computers suck! [12:54:16] Hmm, loading again and it's fine [12:54:32] yes they do [13:12:51] PROBLEM - Host ps1-b1-eqiad is DOWN: CRITICAL - Network Unreachable (10.65.0.40) [13:12:52] PROBLEM - Host ps1-b1-eqiad is DOWN: CRITICAL - Network Unreachable (10.65.0.40) [13:13:43] from 10.64.0.141 via cp1015.eqiad.wmnet (squid/2.7.STABLE9) to () [13:13:43] Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Mon, 06 Aug 2012 13:13:06 GMT [13:14:25] Our servers are currently experiencing a technical problem. [13:14:26] :( [13:14:29] squid down... [13:14:40] amssq33.esams.wikimedia.org (squid/2.7.STABLE9) to () [13:15:00] amssq34.esams.wikimedia.org (squid/2.7.STABLE9) to () [13:15:01] ., [13:15:41] via amssq33.esams.wikimedia.org (squid/2.7.STABLE9) to () [13:15:42] Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Mon, 06 Aug 2012 13:14:02 GMT [13:17:54] uh oh [13:18:03] think its network... [13:19:48] http://en.m.wikipedia.org/ [13:19:49] heh :) [13:20:31] different set of servers I think [13:21:31] yep [13:21:44] if it's an emergency, you can read wikipedia on the mobile site [13:22:16] grr, was browsing Wikipedia on US Elections 2012... [13:23:14] Hydriz: that was a *bad* idea :) [13:23:24] haha :P [13:35:05] woots [13:35:30] PROBLEM - check_minfraud_secondary on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:30] PROBLEM - check_minfraud_primary on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:30] PROBLEM - check_minfraud_secondary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:30] PROBLEM - check_minfraud_primary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:30] PROBLEM - check_minfraud_secondary on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:30] PROBLEM - check_minfraud_primary on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:30] PROBLEM - check_minfraud_secondary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:31] PROBLEM - check_minfraud_primary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:33] PROBLEM - NTP on db1022 is CRITICAL: NTP CRITICAL: No response from NTP server [13:40:27] PROBLEM - check_minfraud_secondary on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:27] PROBLEM - check_minfraud_secondary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:27] PROBLEM - check_minfraud_primary on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:27] PROBLEM - check_minfraud_primary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:27] PROBLEM - check_minfraud_secondary on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:27] PROBLEM - check_minfraud_primary on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:27] PROBLEM - check_minfraud_secondary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:28] PROBLEM - check_minfraud_primary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:41:21] RECOVERY - MySQL Slave Delay on es2 is OK: OK replication delay 12 seconds [13:45:25] PROBLEM - check_minfraud_secondary on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:25] PROBLEM - check_minfraud_secondary on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:25] PROBLEM - check_minfraud_primary on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:25] PROBLEM - check_minfraud_primary on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:25] PROBLEM - check_minfraud_secondary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:25] PROBLEM - check_minfraud_primary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:25] PROBLEM - check_minfraud_secondary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:26] PROBLEM - check_minfraud_primary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:00] PROBLEM - MySQL Slave Delay on es4 is CRITICAL: CRIT replication delay 289 seconds [13:47:48] RECOVERY - MySQL Slave Delay on es4 is OK: OK replication delay 12 seconds [13:50:30] PROBLEM - check_minfraud_secondary on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:30] PROBLEM - check_minfraud_primary on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:30] PROBLEM - check_minfraud_secondary on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:30] PROBLEM - check_minfraud_primary on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:30] PROBLEM - check_minfraud_secondary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:30] PROBLEM - check_minfraud_primary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:30] PROBLEM - check_minfraud_secondary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:31] PROBLEM - check_minfraud_primary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:48] PROBLEM - LVS on payments.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:36] PROBLEM - MySQL Slave Delay on es2 is CRITICAL: CRIT replication delay 300 seconds [13:53:30] PROBLEM - MySQL Slave Delay on es4 is CRITICAL: CRIT replication delay 353 seconds [13:55:27] PROBLEM - check_minfraud_secondary on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:27] PROBLEM - check_minfraud_secondary on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:27] PROBLEM - check_minfraud_primary on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:27] PROBLEM - check_minfraud_primary on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:27] PROBLEM - check_minfraud_secondary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:27] PROBLEM - check_minfraud_primary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:28] PROBLEM - check_minfraud_secondary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:28] PROBLEM - check_minfraud_primary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:57:24] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [13:57:43] RECOVERY - Host google is UP: PING OK - Packet loss = 0%, RTA = 55.46 ms [13:57:43] RECOVERY - Host cp3001 is UP: PING OK - Packet loss = 0%, RTA = 123.11 ms [13:57:43] RECOVERY - Host amslvs2 is UP: PING OK - Packet loss = 0%, RTA = 123.27 ms [13:57:43] RECOVERY - Host amssq61 is UP: PING OK - Packet loss = 0%, RTA = 122.85 ms [13:57:43] RECOVERY - Host amssq50 is UP: PING OK - Packet loss = 0%, RTA = 121.71 ms [13:57:44] RECOVERY - Host amssq53 is UP: PING OK - Packet loss = 0%, RTA = 122.99 ms [13:57:44] RECOVERY - Host amssq55 is UP: PING OK - Packet loss = 0%, RTA = 123.07 ms [13:57:45] RECOVERY - Host amssq62 is UP: PING OK - Packet loss = 0%, RTA = 122.96 ms [13:57:45] RECOVERY - Host amssq57 is UP: PING OK - Packet loss = 0%, RTA = 121.71 ms [13:57:46] RECOVERY - Host amssq52 is UP: PING OK - Packet loss = 0%, RTA = 122.91 ms [13:57:46] RECOVERY - Host amssq36 is UP: PING OK - Packet loss = 0%, RTA = 121.62 ms [13:59:03] RECOVERY - MySQL Slave Delay on es4 is OK: OK replication delay 0 seconds [13:59:03] RECOVERY - Host bits.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 118.87 ms [13:59:04] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: (Service Check Timed Out) [13:59:12] PROBLEM - BGP status on cr2-pmtpa is CRITICAL: (Service Check Timed Out) [18:02:20] * Damianz pats wm-bot [18:02:23] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/17827 [18:03:13] Could someone merge/push https://gerrit.wikimedia.org/r/#/c/17774/ ? [18:03:27] I'm not sure what difference it'll make, but either way, in it's current form it is wrong [18:04:10] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17774 [18:04:11] PROBLEM - Host csw1-esams is DOWN: PING CRITICAL - Packet loss = 58%, RTA = 3556.09 ms [18:04:28] Reedy: back if you didn't notice (wmbot) [18:04:34] Ta [18:04:54] Reedy: mark recently moved it to internal [18:05:02] and he's going to do the rest at some point too [18:09:44] RECOVERY - Host csw1-esams is UP: PING OK - Packet loss = 0%, RTA = 118.27 ms [18:09:48] Ryan_Lane: it's not just mobile servers [18:10:00] then it definitely wasn't our change :) [18:10:04] yes it was [18:10:07] how? [18:10:22] it's all varnish? [18:10:22] that fragment is in templates/varnish/wikimedia.vcl.erb [18:10:37] varnish only serves bits and mobile [18:10:39] and takes effect if xff_sources is non-empty [18:10:49] you can't edit through bits [18:11:15] who talked about edit? [18:11:26] it's about stats etc. too, isn't it? [18:11:27] the entire thread is about edits [18:11:36] no. it's specifically about edits [18:12:02] in fact, the only carrier this likely doesn't affect is opera mini [18:12:04] I hijacked the thread and talking about XFF & Opera Mini in general, sorry if I wasn't clear [18:12:06] because of our change [18:12:21] right, so our change only affects varnish [18:12:27] yeah, the opera mini stuff is completely separate from the original purpose of the thread [18:12:29] so, bits and mobile [18:12:38] and upload? [18:12:53] is upload totally on varnish right now? [18:13:54] either way, this shouldn't really affect stats or edits [18:14:08] trusting the XFF simply means we don't strip it [18:14:22] the stats will get the same thing, with the XFF field added [18:14:36] mediawiki would get the same thing, with XFF added [18:15:00] which means edits originating from the varnish servers would actually have the correct IP [18:15:19] the correct meaning "not opera's"? [18:15:22] as long as mediawiki trusts them for XFF [18:15:33] opera is likely the only one actually working [18:15:35] so, mediawiki has another layer of "trusting X-F-F"? [18:15:39] yes [18:15:43] okay [18:15:48] didn't know that. [18:15:58] Yeah [18:16:03] it's how the thread started ;) [18:16:26] In the case of 10.64.169, it wasn’t actually listed for XFF in $wgSquidServersNoPurge. 154.53 and 154.54 were already listed in $wgSquidServersNoPurge. [18:16:30] so, why do we do that overriding in varnish then? [18:16:37] quoting from the email ^^ [18:16:39] To determine the IP, MW uses the real IP unless that's in the trusted Squids list, in which case it follows the XFF chain to find the first untrusted IP [18:17:35] what's the point of doing it in two places? couldn't we just add the SSL proxies to that trusted XFF list? [18:19:39] the MW one? [18:19:44] yes [18:19:46] Hmm yeah why is there a trust list in Varnish? [18:19:56] Can't Varnish just follow the protocol and append to the XFF, then let MW sort it out? [18:20:29] possibly [18:21:10] I know there was some reasoning behind this when we did it [18:21:20] hell if I can remember off the top of my head right now [18:21:30] seems like the kind of thing we should have documented :D [18:22:25] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/17650 [18:22:42] it says "Needed for the geoiplookup code later on as it will use xff." there [18:22:46] perhaps that's the reasoning? [18:22:59] both varnish and squid strip XFF unless it comes from the HTTPS servers [18:23:26] RobH: So apparently my Parsoid server in Tampa (#3271) is ready from Chris's end, when can we get it set up? [18:23:41] it doesn't really matter if people spoof XFF for geoiplookup [18:23:51] that's your comment :P [18:24:01] we inherently don't trust XFF, so we added it for geoiplookup, though [18:26:39] Aaaah right [18:26:50] geoip needs to recognize the XFF set by the SSL proxies [18:27:14] It didn't always do this, I remember filing a bug about always getting "San Francisco, CA" when hitting geoiplookup over https [18:27:36] we don't actually need to strip XFF for that, though [18:27:48] No [18:27:50] just need to have a trust list for geoip [18:27:53] both varnish and squid strip XFF unless it comes from the HTTPS servers [18:27:58] I'm pretty sure that is false [18:28:02] is it? [18:28:08] At least the text Squids should be preserving XFF [18:28:11] are we just stripping XFP? [18:28:52] I don't know what happens in practice, I just know what should happen in theory [18:29:16] Which is that all caching proxies should process XFF per protocol [18:29:31] except for geoip which should interpret the XFF if the originating IP is an SSL proxy [18:30:41] just XFP [18:30:42] header_access X-Forwarded-Proto deny !sslproxy [18:30:52] which makes sense [18:31:14] Where "process per protocol" means prepend the originating IP to the XFF header [18:31:23] * Ryan_Lane nods [18:33:24] ok, so we are only reassigning XFF in varnish if its coming from the ssl servers or opera mini [18:33:39] wait [18:33:54] it's in fact the opposite [18:34:12] set req.http.X-Forwarded-For = client.ip; [18:34:30] By reassigning you mean destroying the XFF chain? [18:34:30] that said, this still wouldn't trigger the bug we're seeing [18:35:14] are you guys examining the inline C that remaps client.ip in very limited cases? [18:35:19] yes [18:35:23] Are you sure that line doesn't overwrite the XFF header and destroys the information in the incoming XFF header? [18:35:28] Because that would be bad [18:35:42] RoanKattouw: it does. [18:35:44] The incoming XFF header could be from a legitimate proxy that MW trusts [18:35:53] an outside proxy? [18:35:55] Yes [18:36:05] MW trusts lots of external proxies [18:36:31] this should be irrelevant to the issue of how mw sees edits via varnish [18:36:41] binasher: that's what I said [18:37:03] Hmm right [18:37:14] MW would report the IP of the external proxy, not the IP of some internal proxy [18:37:31] RoanKattouw: either way, if we *do* trust lots of external proxies, we should have this available as a puppet variable so that https and varnish can also have the same trust list [18:37:38] Wait [18:37:45] The proxies shouldn't *need* trust lists [18:37:53] PROBLEM - Puppet freshness on bayes is CRITICAL: Puppet has not run in the last 10 hours [18:37:58] If they just manipulate XFF per protocol it'll be fine [18:39:08] New patchset: Catrope; "Add a service class for Parsoid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15856 [18:39:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15856 [18:39:13] looking at https config [18:39:13] RoanKattouw: If the server is ready for install, are all the puppet manifest changes checked in? [18:39:13] if so we can review and merge change, and a puppet run on the server will make it live [18:39:14] binasher: Who and what needs to be done for me to shut down db1047 to replace its bad dimm? [18:39:14] its one of the two analytics slaves according to rt 3084 [18:39:14] I assume since it is one of two slaves, I can just do a clean shutdown [18:39:16] it doesn't strip [18:39:24] proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; [18:39:28] it just adds to the chain, properly [18:39:35] OK god [18:39:37] *good [18:39:46] Then you don't need to trust anything on the proxy's end, do you? [18:39:47] RobH: the analytics people should be notified first [18:39:49] PROBLEM - Puppet freshness on niobium is CRITICAL: Puppet has not run in the last 10 hours [18:39:50] PROBLEM - Puppet freshness on srv242 is CRITICAL: Puppet has not run in the last 10 hours [18:39:50] so, just varnish is changing the chain [18:40:01] RobH: https://gerrit.wikimedia.org/r/#/c/15856/ [18:40:02] RobH: if you want to do it at a certain time today, i'll email them [18:40:42] binasher: I wanna do it as soon as possible, but its not an emergency [18:40:44] PROBLEM - Puppet freshness on mw27 is CRITICAL: Puppet has not run in the last 10 hours [18:40:44] PROBLEM - Puppet freshness on srv238 is CRITICAL: Puppet has not run in the last 10 hours [18:40:44] PROBLEM - Puppet freshness on srv190 is CRITICAL: Puppet has not run in the last 10 hours [18:40:50] RoanKattouw: and we only do it for geoiplookup [18:40:53] so whatever the minimum leadtime they need, the best [18:41:01] would be best even [18:41:01] so, ideally we should check the host header there [18:41:04] OK that's fine then [18:41:09] RoanKattouw: well... [18:41:25] RoanKattouw: what I meant is: we only *need* to do it for geoiplookuo [18:41:29] Right [18:41:30] geoiplookup [18:41:38] RobH: want to do it in 30min? [18:41:40] Remember that geoiplookup is no longer its own host [18:41:48] right now we do it for every non-https or opera-mini client [18:41:55] It's no longer http://geoiplookup.wikimedia.org , it's now http://bits.wikimedia.org/geoiplookup [18:42:01] damn it [18:42:08] well, we can strip for bits [18:42:09] that's fine [18:42:15] it doesn't take edits [18:42:18] Yeah [18:42:24] let me open an rt [18:42:40] binasher: That would be great, yep! [18:42:43] this isn't really a problem right now [18:42:49] but if someone cannot do that, and needs two hours, thats fine. [18:42:50] it will be when we use varnish for text, thoguh [18:42:53] *though [18:42:56] I will be here until 7PM EST [18:42:59] or when we start allowing edits from mobile [18:45:22] RobH: What's the hostname of my new shiny server? [18:45:50] RobH: go for it at 3:15PM EST, a normal clean shutdown should be ok [18:45:58] binasher: great, thank you! [18:46:58] RoanKattouw: wtp1 [18:47:08] wtp.pmtpa.wmnet [18:47:12] sorry, wtp1.pmtpa.wmnet [18:47:13] RoanKattouw: added an rt to fix ths [18:47:15] *this [18:47:22] Thanks [18:47:35] welcome [18:47:42] or you thanking ryan ;] [18:47:47] i take his thanks as mine. [18:47:59] hahaha [18:48:01] he didnt send me any of the booze from sysadmin day [18:48:05] so he owes me anyhow [18:48:12] there's still two unopened bottles [18:48:29] hmm? [18:48:38] wouldnt this get server kitties drunk? [18:48:58] They're not in the datacentre(s) [18:50:05] New patchset: Catrope; "Install Parsoid on wtp1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17831 [18:50:14] RobH, Ryan_Lane: ---^^ [18:50:46] New patchset: Catrope; "Add a service class for Parsoid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15856 [18:50:47] argh, when did the theme take effect [18:50:56] its very bright. [18:50:59] hahaha [18:51:10] i dont love it. [18:51:13] I know you aren't going to say you like the old one better [18:51:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17831 [18:51:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15856 [18:51:26] meh, honestly, dont recall what it looked like already [18:51:32] except it was easier on the brightness [18:51:59] bleh [18:52:02] it has a dependency [18:52:21] * RobH goes back to on site stuff [18:53:52] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17831 [18:53:52] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15856 [18:54:14] RoanKattouw: ok. merged all the way in [18:55:44] Thanks [18:56:00] * RoanKattouw hopes Parsoid will magically come up on wtp1 in the next hour [18:56:34] I would run Puppet on the machine manually except it doesn't trust my key for root yet because Puppet hasn't run :) [18:56:43] heh [19:01:53] cmjohnson1: 3298 needs someone to install these machines and put them into puppet [19:04:46] New patchset: MaxSem; "Add user accounts to the WLM host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17623 [19:05:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17623 [19:05:29] paravoid, ^^ [19:07:58] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17623 [19:09:52] paravoid, thank you [19:10:39] MaxSem: done and applied on yttrium [19:11:29] cool, I can log in now [19:13:02] \o/ [19:13:12] thanks guys :) [19:14:07] awjr, lol: The program 'git' is currently not installed. To run 'git' please ask your administrator to install the package 'git' [19:14:15] * MaxSem goes back to puppet [19:14:50] doh [19:15:24] !log db1047 mysql and system shutdown per rt 3084 for bad memory swap [19:15:33] Logged the message, RobH [19:16:43] * RobH waits for mysql to actually shut down [19:17:07] cmjohnson1: looks like you reseated dimms in mc1 this morning? hows it look? [19:17:43] We need to have 2 more sent from DELL [19:17:57] ah, ok [19:17:58] i have a few things for them so I will be calling in a few mins [19:18:01] !log db1047 shutting down [19:18:07] cool [19:18:09] Logged the message, RobH [19:18:45] cmjohnson1: will you have time in the next day or two to try making the 10gb nics in those hosts pxe'able? [19:19:06] yes [19:19:17] PROBLEM - Host db1047 is DOWN: PING CRITICAL - Packet loss = 100% [19:19:35] hoping to have them for you NLT Wednesday [19:19:49] that would be great [19:20:21] RobH: can you try to get the eqiad mc1-16 hosts pxe bootable from their 10gb nics this week too? [19:20:38] PROBLEM - Apache HTTP on mw18 is CRITICAL: Connection refused [19:26:11] binasher: only mc1001-1008 are racked and wired [19:26:22] the other 8 i need a second person to help me, and that will prolly be next week. [19:26:40] robh: i have a windows7 iso if you need a windows disk to help you with the mc servers [19:26:42] ok, good to know. let me know if the time frame changes at all [19:26:56] will do [19:27:38] * RobH is watching db1047 post [19:27:54] RobH: one other thing.. i mentioned wanting to buy a bunch of dbs for a new sharded data store in the ops meeting on monday. do you still need me to RT ticket you for the quote? [19:28:17] RECOVERY - Host db1047 is UP: PING OK - Packet loss = 0%, RTA = 35.76 ms [19:30:12] !log db1047 back online [19:30:21] Logged the message, RobH [19:30:28] binasher: if there isnt a ticket in procurement, please create one with hardware details =] [19:31:13] will do! [19:31:35] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 649 seconds [19:32:37] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 642 seconds [19:35:19] RECOVERY - Apache HTTP on mw18 is OK: HTTP OK - HTTP/1.1 302 Found - 0.015 second response time [19:38:35] Ryan_Lane: there is some confusion [19:38:46] you have RT 3127 to relabel kypton as virt1000 [19:38:51] but you already have a virt1000 in eqiad. [19:39:41] assigning ticket 3127 back to you [19:44:13] !log asw-c8-eqiad PEM1 power reseated, cleared alarm rt 3204 [19:44:22] Logged the message, RobH [19:44:27] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [19:45:30] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [20:00:20] RobH: Are you somehow able to force a puppet run on wtp1? [20:03:17] RoanKattuow: I'm getting "[564af014] 2012-08-06 20:02:41: Fatal exception of type MWException" when trying to see recent changes on mediawiki.org [20:06:39] *RoanKattouw [20:07:02] Just got the same thing [20:07:05] Jasper_Deng: Yeah, I'm pushing out a prospective fix as we speak, sorry about that [20:07:26] mediawiki.org is our guinea pig for new code, so this kind of thing happens every second Monday [20:07:55] Jasper_Deng, in the mean time you can see irc.wikimedia.org #mediawiki.wikipedia [20:08:12] RoanKattouw: is it up? [20:08:19] Krenair: does not appear to be a valid channel [20:08:19] it doesnt appear to respond to ssh [20:08:24] RobH: It's responding to ping but not ssh [20:08:25] RoanKattouw: was the OS installed? [20:08:31] RobH: I have absolutely no idea [20:08:35] checking mgmt [20:08:53] its not installed [20:08:56] its on the partitioning menu [20:09:07] RoanKattouw: did the RT handoff say the os was installed? [20:09:29] Let me check [20:09:29] Jasper_Deng, what? I'm in it... [20:09:42] It said "the server is ready to go! " [20:09:45] Which is ambiguous I suppose [20:10:15] yea, the os isnt installed [20:10:15] Krenair: are you sure you typed it right? [20:10:18] someoe needs to do that [20:10:27] i would kick that ticekt back asking about it [20:10:31] * [Krenair] #meta.wikimedia #mediawiki.wikipedia [20:10:54] you don't appear to be in it atm [20:11:53] RobH: I commented on the ticket but I have no idea how RT workflow works so I didn't attempt to give it back to Chris or anything [20:12:02] It's # [20:12:05] 3271 [20:12:15] RoanKattouw: if he was about i would ping and ask [20:12:29] otherwise its a plain install, so asking ct to find an owner is prolly best [20:12:38] anyone in ops can do it, just someone has to find time and such [20:12:42] Right [20:17:46] New patchset: Demon; "(bug 39040) The Wikimedia logo is too close to the change ID" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17888 [20:18:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17888 [20:22:16] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17888 [20:26:32] ex/away [20:35:27] RECOVERY - Host search23 is UP: PING OK - Packet loss = 0%, RTA = 1.78 ms [20:46:14] binasher: fed up with pmtpa? :) [20:58:10] paravoid, do we need to puppetize mysql users on yttrium, or just set them up to our liking? [20:58:34] I'm not sure of who does what on that box tbh [21:00:11] so I guess it's the latter... it's just a temprary service for our evil needs [21:00:18] haven't used it , but there is a puppet module for handling mysql users and db's https://github.com/puppetlabs/puppetlabs-mysql [21:11:38] is daniel around? [21:12:18] yep [21:12:31] andrew_wmf: what's up [21:15:44] New patchset: Catrope; "Parsoid is very intolerant of double slashes right now" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/17900 [21:16:01] Change merged: Catrope; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/17900 [21:19:21] * jeremyb would love a dump of the watchmouse config [21:20:00] there's ~5 services there that are green for today and listed as 100% uptime. i think it's wrong for all of those (and there's ~3 that are green and 100% and I think are accurate) [21:20:32] or at least it's plausible for those 3 [21:21:03] s{1,4}{,-uncahced} and mobile are all wrong i think [21:21:21] New patchset: MaxSem; "Add Git on yttrium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17902 [21:21:23] uncached* [21:21:36] please review^^ [21:22:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17902 [21:24:01] ^demon: what's the labs LDAP problem you speak of? still broke? [[mw:talk:developer access]] [21:24:27] Ryan restarted opendj which "fixed" it [21:25:57] LeslieCarr_afk: ct was asking me about switching back to eqiad.. how do you feel about the current connectivity between colos and fpl in general? [21:26:29] binasher: pretty good - how about i call up fpl and ask about a freaking post mortem and then we'll see [21:26:36] <^demon> jeremyb: Yeah, what Damianz said. If it's back up and working, a response on-wiki would be nice :) [21:26:36] if they're doing more splice work i don't trust them [21:26:45] since it's too easy to accidentally mess shit up [21:26:50] ^demon: danke [21:26:57] there's a lot of d names to tab through ;) [21:27:05] LeslieCarr_afk: that sounds great [21:27:33] though talking on a public channel does break my illusion of being away ;) [21:28:00] quick, throw a smoke bomb [21:28:24] LeslieCarr_afk: OTRS uses aliases and you can too! [21:28:36] sneaky... [21:29:46] notpeter: It looks like the cron job for PageTriage still isn't working. Are you the best person to bug about that? [21:30:56] binasher: splicing activity is completed [21:31:01] post mortem coming in72 hours [21:31:21] but, should be safe (as it ever is in tampa) [21:31:29] kaldari: the last message I had from benny was that the cron wouldn't start producing output until september 9th [21:31:42] also, is it ok to respond to the foundation-l thread about having too much money with "if we do, can we please spend some on a new datacenter? [21:31:44] I was under the impression that it was working correctly (albeit not doing anything yet) [21:31:54] or is that just trolling the community [21:32:15] ^demon: make me an account? [21:32:17] they might troll as back about all the money spent on eqiad :) [21:32:45] ^demon: https://www.mediawiki.org/wiki/Project:Labsconsole_accounts?diff=569529&oldid=569288 [21:33:29] <^demon> jeremyb: We should grant you access to this ;-) [21:33:29] LeslieCarr_afk: do you want to work some bgp magic to resume routing of traffic bound for tampa via eqiad and that fiber at some point before we actually change dns? [21:33:31] <^demon> One moment. [21:33:37] ^demon: heh [21:33:58] notpeter: I think there might have been some miscommunication. We're up to 18,000+ unreviewed articles because the cron isn't running. This isn't horrible, but we definitely need it resolved before September 9th, since that is the official launch. The extension is already collecting articles and being used right now though. [21:34:17] * Damianz finds lcarr some cookies [21:34:29] * binasher finds lcarr some rye  [21:34:40] :) [21:34:53] notpeter: so ideally we would like the cron to be running now, but September 9th at the very latest. [21:34:57] mmm dipping cookies in manhattans.... [21:35:02] ok [21:35:05] this is what is being run [21:35:06] 55 20 */2 * * /usr/local/bin/mwscript extensions/PageTriage/cron/updatePageTriageQueue.php enwiki > /tmp/updatePageTriageQueue.en.log [21:35:11] * ^demon finds LeslieCarr_afk some hydrochloric acid. [21:35:11] is that correct? [21:35:16] <^demon> I mean, you never know ;-) [21:35:39] output is [21:35:39] Started processing... [21:35:39] processed 0 [21:35:39] Completed [21:35:47] eh? hmm [21:35:49] that's weird [21:35:56] yes [21:35:57] so [21:36:00] maybe there's a bug in the cron script [21:36:00] I mean, I htink it's running [21:36:06] unless I'm invoking it incorrectly [21:36:15] i somehow think oatmeal raisin cookies would work with a manhattan dip [21:36:23] but I asked reedy if the syntax was correct, as it's something that I get wrong often [21:36:27] and he gave it a green light [21:36:29] i hate the c2100. [21:36:32] New patchset: MaxSem; "Add Git on yttrium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17902 [21:36:42] binasher: +1 to the oatmeal cookies and manhattans [21:36:54] notpeter: at this point it should be touching thousands of rows [21:37:10] I'll double check the script [21:37:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17902 [21:37:15] kk [21:38:45] <^demon> jeremyb: Done, he should have an e-mail w/ password. [21:38:57] ^demon: danke [21:39:28] PROBLEM - Swift HTTP on ms-fe1001 is CRITICAL: Connection refused [21:39:40] ^^^^ that's me and it's ok. [21:39:49] paravoid, could you merge https://gerrit.wikimedia.org/r/17902 ? [21:47:03] notpeter: I think I see a problem with the cron script [21:47:22] notpeter: thanks for checking on it though. sorry to bug you. [21:48:00] PROBLEM - Host ms-be1003 is DOWN: PING CRITICAL - Packet loss = 100% [21:48:12] Thehelpfulone: fyi for the mailing list overview table: travel-l --> travel [21:48:30] Thehelpfulone: travel-l deleted, config copied to travel, now using travel and done [21:53:05] maplebed: I am ok to take town ms-be1005 right? [21:53:13] yes. [21:53:16] it appears online, the disks you list failing to mount are mounted [21:53:20] i take it you did that manually? [21:53:32] lemme check. [21:53:33] RECOVERY - Host ms-be1003 is UP: PING OK - Packet loss = 0%, RTA = 35.70 ms [21:53:39] I see all the dmesg errors as well [21:53:56] i checked the cables on ms-be1003 and rebooted it as well [21:54:11] !log saving space on wikitech linode - gzipping old .sql dumps and stuff [21:54:27] Logged the message, Master [21:54:45] maplebed: the ms-be1003 will require reinstall later to see if it clears the issue, i have not checked post cable check [21:54:48] RobH: I checked one drive (sdd) and it appears mounted, but trying to 'touch foo' in the mountpoint yields 'no space left on device'. [21:54:51] i have both tickets [21:54:53] mutante: that !log ate up more space on wikitech! [21:54:55] so... something's broken. [21:54:55] huh, ok [21:55:01] will shut it down and check connections [21:55:03] thx [21:55:05] k. [21:55:13] jeremyb: heh yeah, but not several GB like the dumps:) [21:58:41] PROBLEM - swift-container-updater on ms-be1005 is CRITICAL: Connection refused by host [21:58:48] PROBLEM - swift-object-updater on ms-be1005 is CRITICAL: Connection refused by host [21:58:48] PROBLEM - swift-account-server on ms-be1005 is CRITICAL: Connection refused by host [21:58:57] PROBLEM - swift-object-server on ms-be1005 is CRITICAL: Connection refused by host [21:59:06] PROBLEM - swift-account-auditor on ms-be1005 is CRITICAL: Connection refused by host [21:59:07] !log rebooting db1026, upgrading to precise [21:59:07] PROBLEM - swift-container-auditor on ms-be1005 is CRITICAL: Connection refused by host [21:59:15] PROBLEM - SSH on ms-be1005 is CRITICAL: Connection refused [21:59:15] PROBLEM - swift-object-replicator on ms-be1005 is CRITICAL: Connection refused by host [21:59:15] PROBLEM - swift-container-replicator on ms-be1005 is CRITICAL: Connection refused by host [21:59:15] PROBLEM - swift-account-reaper on ms-be1005 is CRITICAL: Connection refused by host [21:59:15] Logged the message, Master [21:59:24] PROBLEM - swift-object-auditor on ms-be1005 is CRITICAL: Connection refused by host [21:59:33] PROBLEM - swift-account-replicator on ms-be1005 is CRITICAL: Connection refused by host [21:59:51] PROBLEM - MySQL disk space on db1028 is CRITICAL: Connection refused by host [22:00:01] PROBLEM - swift-container-server on ms-be1005 is CRITICAL: Connection refused by host [22:00:27] PROBLEM - Host db1026 is DOWN: PING CRITICAL - Packet loss = 100% [22:01:23] so who do i talk to about watchmouse? maybe i should make an RT [22:01:30] PROBLEM - Host ms-be1005 is DOWN: PING CRITICAL - Packet loss = 100% [22:02:24] RECOVERY - Host db1026 is UP: PING OK - Packet loss = 0%, RTA = 35.38 ms [22:03:39] PROBLEM - NTP on db1026 is CRITICAL: NTP CRITICAL: Offset unknown [22:06:30] RECOVERY - NTP on db1026 is OK: NTP OK: Offset -0.04107093811 secs [22:12:30] RECOVERY - MySQL disk space on db1028 is OK: DISK OK [22:13:42] !log rebooting db1027, upgrade to precise [22:13:51] Logged the message, Master [22:14:50] !log rebooting db1028, upgrading [22:14:58] Logged the message, Master [22:15:48] PROBLEM - Host db1027 is DOWN: PING CRITICAL - Packet loss = 100% [22:17:27] RECOVERY - Host db1027 is UP: PING OK - Packet loss = 0%, RTA = 35.45 ms [22:17:27] PROBLEM - Host db1028 is DOWN: PING CRITICAL - Packet loss = 100% [22:18:12] RECOVERY - Host db1028 is UP: PING OK - Packet loss = 0%, RTA = 35.40 ms [22:21:56] maplebed: when you think we can work on https://bugzilla.wikimedia.org/show_bug.cgi?id=34814? [22:22:29] AaronSchulz: I've got two things on my plate that are more pressing - upgrading to 1.5 and setting up cross-colo replication. [22:23:05] though, it's not trouble for me to make you an account to play with, especially in one of teh labs clusters. [22:27:42] !log dist-upgrading wikitech instance [22:28:06] Logged the message, Master [22:29:12] mutante: does wikitech get monitored in any way? [22:33:16] jeremyb: linode sends mail for some things like disk space , disk i/o... [22:33:19] New patchset: Bhartshorne; "rebuilt swift upgrade cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17915 [22:33:57] mutante: oh, didn't know they do disk space. i was thinking ganglia/nagios [22:34:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17915 [22:34:32] mutante: for some reason didn't think you might be using the builtin monitoring, okey ;) [22:35:36] jeremyb: a check on watchmouse wouldn't hurt, right [22:35:42] jeremyb: but no, not in Nagios [22:36:10] mutante: idk if there's a limit to what can go in watchmouse and that's not why i was asking about watchmouse [22:36:25] * jeremyb wouldn't object to adding to watchmouse but maybe is overkill [22:36:29] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17915 [22:36:29] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17614 [22:36:35] jeremyb: there is a limit but we can ask them to have it raised afaik [22:36:45] k [22:37:05] * jeremyb will be writing some watchmouse mail soon [22:44:36] jeremyb: sure, i'm afraid there is no way to just dump the config though besides taking lots of screenshots or something [22:44:45] it's all just web ui [22:45:51] mutante: maybe they use some ajax that can be scraped? /me was worried about that very problem [22:46:58] they have a 30 day free trial. i can get that and play with it [22:47:13] will be fragile though [22:47:35] mutante: do they offer readonly accounts? or everyone can do everything? [22:47:40] btw, it is now called "Nimsoft Cloud Monitor" [22:48:27] yeah [22:48:43] jeremyb: not sure about read-only, please add that to ticket and i can check on it later [22:48:49] > Sign up for a Nimsoft Cloud Monitor account and be the first to know when your company's website is not performing. [22:49:05] heh, but " API refused to cooperate. Please try again later. (err 1000) [22:49:06] did they tell us first this morning? ;) [22:49:17] mutante: where's that? [22:49:47] it was here, but it's gone already, very temp. it seems http://www.watchmouse.com/en/checkit.php [22:49:58] huh [22:49:59] jeremyb: they woke me up, so kinda yes. likely the europeans already knew though. [22:50:33] <^demon> mutante: WFM. It says at the top "You can use this 5 times today." [22:50:38] <^demon> Maybe you already used it 5 times? [22:50:45] PROBLEM - Puppet freshness on srv281 is CRITICAL: Puppet has not run in the last 10 hours [22:51:19] oh, yay, someone fixed the duplicate nagios-wm [22:51:47] hrmmm, why is amsterdam on the list twice when i do checkit? [22:51:55] ^demon: i think it was just unlucky timing and i caught it in a downtime that lasted a few seconds only, i also get the "5 times" message now [22:52:06] no, 3x [22:52:24] <^demon> mutante: Let's keep refreshing ;-) [22:54:48] interesting, so somebody actually used "bazaar" in the past on wikitech [22:55:57] <^demon> It was domas, maybe it was for mydumper? :) [22:58:00] New patchset: Alex Monk; "(bug 34135#c4) Let admins change FlaggedRevs stable settings on cawikinews." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/17918 [23:11:20] !log temp. setting wikitec to readonly for maintenance [23:12:37] Isn't morebots supposed to reply to that? [23:14:00] !log performing db schema upgrade on wikitech [23:14:09] morebots: slap [23:14:17] morebots in on wikitech [23:14:46] just kill the python process, it's in a restart loop [23:14:50] heh, ok..:p [23:15:52] or maybe the bot is fine, maybe it just couldn't log to a read-only wiki [23:16:16] oh yeah, so true [23:16:51] lemme wait for the update.php [23:17:00] it is still doing rev_id changes [23:17:32] * AaronSchulz snickers [23:17:43] mutante: sha1? [23:17:46] It won't be as bad as enwiki ;) [23:17:56] Reedy: which is still going btw [23:18:01] indeed [23:18:27] ugh, done but upgrade fail [23:18:32] wikitech wiki is pretty big, because of all those server admin log revisions [23:18:49] rev_sha1 and ar_sha1 population complete [44548 revision rows, 5284 archive rows]. [23:18:58] wikipages don't make for the most efficient logs ;) [23:19:33] mutante: what was the error? [23:19:47] <^demon> AaronSchulz: You mean we're not using an ideal solution? [23:19:54] Reedy: getting no content on the page anymore [23:19:56] <^demon> Maybe we could write an extension. Special:SAL :) [23:19:58] TimStarling: can you comment on https://gerrit.wikimedia.org/r/#/c/17379/ when you get the chance? [23:20:05] HTTP Error 500 (Internal Server Error): An unexpected condition was encountered while the server was attempting to fulfil the request. [23:20:15] It looks like that read-only wasn't necessary [23:20:25] Reedy: very specific [23:20:47] <^demon> AaronSchulz: Add me to reviewer list. I want to see the seekrit commit ;-) [23:21:10] I thought no drafts were fool-proof secret [23:21:21] <^demon> Pfft, I don't wanna dig up a stupid gitweb url. [23:21:49] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [23:22:36] mutante: need a hand? [23:22:42] Reedy: yes please [23:22:59] ^demon: Security by Laziness [23:23:00] Reedy: well, or i could rollback [23:23:15] <^demon> AaronSchulz: Hey, it works I guess :p [23:23:48] We need some debugging: [23:23:48] error_reporting( -1 ); [23:23:48] ini_set( 'display_errors', 1 ); [23:24:08] I'm surprised drafts are secret to gerrit admins. [23:24:30] <^demon> Of course they are. [23:24:42] <^demon> READ permissions are enforced even on admins. [23:24:52] <^demon> But admins can change permissions ;-) [23:25:01] heh [23:25:18] <^demon> 'cept drafts. [23:25:26] <^demon> I guess I could just watch the stream-events. [23:25:30] Even more surprised that drafts aren't completely secret. Wouldn't you need to know the commit's hash to find it in gitweb? [23:25:43] Reedy: ini_set( 'display_errors', 1 ); error_reporting( E_ALL ); [23:25:58] <^demon> Krenair: Either the hash or the refs/changes/whatever it got stashed in. [23:26:15] <^demon> It doesn't surprise me at all. Gitweb is a flaming pile of dog crap. [23:27:23] * AaronSchulz always thought of dog crap as not being that flammable [23:27:42] ^demon: do we monitor temperature for the pile? maybe infrared? [23:27:59] AaronSchulz: dry it out first [23:28:13] mutante: that's already in localsettings? [23:28:20] <^demon> jeremyb: I just added it to Watchmouse earlier. [23:28:36] ^demon: cool, thanks [23:28:38] I noticed it seems to be somewhat confused [23:28:40] The requested URL /w/api.php was not found on this server. [23:28:41] Reedy: the first line was commented and i uncommented, the other was in as is [23:28:58] Did you move the source folders around? [23:29:26] nope, but note it has /view/ instead of /wiki/ [23:29:42] (not that that should matter about /w/ i guess) [23:30:12] Can't say I ever looked where wikitechs api was [23:32:41] What's the apache error logs showing? [23:33:49] File does not exist: /srv/org/wikimedia/wikitech/w/ [23:33:58] but that was just us looking for that manually [23:34:43] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [23:36:38] Reedy: uhm.. www-data needs write permissions on ./cache/ right [23:37:46] yeha [23:40:48] Reedy: is it me or is wikitech throwing an error? [23:41:08] 500 internal server error [23:41:25] I mentioned this 20 minutes ago ;) [23:41:29] they are updating it, i understand [23:41:56] what's going on? [23:41:57] !log wikitech currently down - will rollback soon if we can't fix [23:42:28] there's no place to log things if there's a syntax error somewhere in the PHP, right? [23:42:43] that was more for you:) [23:42:54] and adding !log out of habit [23:42:55] lol [23:43:23] mutante: are there any error log entries? [23:43:29] no :/ [23:43:34] php -l is ok? [23:43:47] PHP Warning: PHP Startup: apc.shm_segments setting ignored in MMAP mode in Unknown on line 0 [23:43:50] but that's it [23:45:11] it appears someone else tried this a while ago and it also broke [23:45:26] What version was wikitech on? [23:45:27] there is a "-broken" directory and an 1.18 tarball [23:45:29] 1.17 [23:45:36] And you tried to update it to what? [23:45:39] 1.19 [23:45:42] lol [23:46:52] mutante: which server? [23:46:55] rolling back [23:47:09] i can load it again [23:47:09] AaronSchulz: wikitech [23:47:10] :p [23:47:30] I wonder why it hasn't been switched to a vcs checkout [23:47:34] ok guys, i rolled back the files [23:47:41] * AaronSchulz has no access there [23:47:42] but i did not yet rollback the db schema update [23:47:57] There shouldn't be anything backwards incompatible [23:48:05] yeah [23:48:11] no more errors [23:48:14] seems to work for me [23:48:19] ok good [23:48:40] !log test [23:48:46] mutante: did you try php maintenance/eval.php ? [23:48:50] Logged the message, Master [23:49:06] !log wikitech back to old version after failed upgrade attempt [23:49:15] Logged the message, Master [23:49:25] AaronSchulz: no, i just did an maintenance/update.php [23:51:14] This wiki is powered by [//www.mediawiki.org/ MediaWiki] [23:51:16] lol [23:51:42] 1.17wmf1 (r90469) [23:51:47] Per aaron, putting the newer files back, and using eval.php might yield some errors [23:51:59] I'm not quite sure if wikitech really needs to run wmf branches [23:52:21] hmm, let's try again these days. I'd appreciate to work with somebody from dev next time:) [23:52:33] so we can go to the linode shell together and take a look [23:53:29] I think there's probably only Tim and Roan (and brion maybe, for old times sake ;)) in dev that'd have access.. [23:54:12] ok, yeah, that's why i said we can do it together at office.. i am here now [23:54:34] and we can open a shell at my desk [23:55:24] ahh [23:55:35] i think Brion put the 1.18 tarball there [23:55:46] well at least telling from the permissions [23:56:18] I'll try and ask him [23:58:01] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [23:58:24] AaronSchulz: that draft looks good in principle [23:59:40] TimStarling: but...? :)