s on English Wikipedia pages [07:55:26] I personally thought my idea of including a meta .css file in my enwiki .css file was quite clever [07:55:42] some peoplpe (me) did not try to do some goofy workaround [07:55:46] just sayin.... [07:56:02] Just disabling javascript would have worked fine as well. [07:56:06] ?banner=anything worked nicely too [07:56:24] apergos: but.. but what if I wanted to look something up? [07:56:49] guess I'll just have to use conservapedia [07:56:57] www.happyplace.com/13509/alternative-information-sources-while-wikipedia-is-down [07:57:19] in the news at conservapedia - Why isn't Wikipedia protesting Hollywood's insistence on SOPA/PIPA? [07:58:25] ""Wikipedia editors question site's blackout." [4] Why doesn't Wikipedia instead expose the big liberal money being poured into the Democrat Party to pass the bad bill? [07:58:26] " [07:58:39] ah well, read too much of that and your head explodes [07:58:50] good night apergos, Excirial, Andre_Engels [07:58:55] other people [08:03:05] Excirial: I tried the disabling Javascript, but if I have javascript disabled on Wikipedia, I get to see only the left-side vertical strip thingie [08:03:53] Wait, no, does work now. [08:04:02] Ah well, I managede [09:06:35] PROBLEM - Puppet freshness on mw1096 is CRITICAL: Puppet has not run in the last 10 hours [09:06:36] PROBLEM - Puppet freshness on mw1096 is CRITICAL: Puppet has not run in the last 10 hours [09:26:57] Another correction of "zip code" to "ZIP code" is needed: http://en.wikipedia.org/wiki/Special:CongressLookup [09:35:09] New patchset: ArielGlenn; "add snapshot1001-4 to site.pp and to download exports list" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1975 [09:41:30] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1975 [09:41:31] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1975 [09:50:05] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 424992 MB (3% inode=99%): [09:50:05] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 424992 MB (3% inode=99%): [09:50:15] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 424699 MB (3% inode=99%): [09:50:16] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 424699 MB (3% inode=99%): [10:16:35] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [10:16:35] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [10:36:15] RECOVERY - MySQL slave status on es1004 is OK: OK: [10:36:15] RECOVERY - MySQL slave status on es1004 is OK: OK: [10:36:35] PROBLEM - Puppet freshness on bast1001 is CRITICAL: Puppet has not run in the last 10 hours [10:36:35] PROBLEM - Puppet freshness on bast1001 is CRITICAL: Puppet has not run in the last 10 hours [10:41:26] PROBLEM - Puppet freshness on fenari is CRITICAL: Puppet has not run in the last 10 hours [10:41:26] PROBLEM - Puppet freshness on fenari is CRITICAL: Puppet has not run in the last 10 hours [11:06:20] New review: Dzahn; "looks like this is related to a new puppet problem on fenari:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1972 [11:32:52] New patchset: ArielGlenn; "em.. the new snaps are at equid, add to regexp in site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1976 [11:37:26] New patchset: ArielGlenn; "em.. the new snaps are at equid, add to regexp in site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1976 [11:44:39] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1976 [11:44:40] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1976 [12:08:58] I'm getting a 500 error on the English Wikinews RSS feed... http://en.wikinews.org/w/index.php?title=Special:NewsFeed&feed=rss&categories=Published&notcategories=No%20publish%7CArchived%7CAutoArchived%7Cdisputed&namespace=0&count=128&hourcount=240&ordermethod=categoryadd&stablepages=only [12:12:04] brianmc: you surely do [12:12:12] Because your URL is incorrect [12:12:26] It has XML entities in it [12:12:54] That worked until 17/01/12 22:22 [12:13:54] Well, you are supposed to strip them [12:17:24] The RSS link on the enWN main page, https://en.wikinews.org/w/index.php?title=Special:NewsFeed&feed=atom&categories=Published¬categories=No%20publish|Archived|AutoArchived|disputed&namespace=0&count=30&hourcount=124&ordermethod=categoryadd&stablepages=only, is similarly failing [12:18:24] PHP fatal error in /usr/local/apache/common-local/php-1.18/extensions/GoogleNewsSitemap/FeedSMItem.php line 111 [12:18:30] Access level to FeedSMItem::$title must be public (as in class FeedItem) [12:18:38] Looks like we got a real problem [12:22:20] Thanks, was a bit of head-scratching for me there. [12:40:57] Hello there Masti! After a long time! ;) [12:41:56] hi Tanvir ;) [12:42:28] Okay, whatever broke Wikinews' NewsFeed happened between 22:22 and 22:30UTC on the 17th... [12:44:26] note that shortly the wikitech web site (and therefor the server admin log and the bot that logs to it) will be unavailable, since the hosting site is moving our content off its currently broken instance to a new one [13:01:34] erm, morebots died about 7 1/2 hours ago anyway [13:04:39] good I'm not logging anything then :-/ [13:06:37] !log cleanupUploadStash finished for Commons [13:06:50] probably not [13:07:06] the wikitech instance is being moved or will be soon, by our hosters [13:07:15] ah [13:07:21] besides morebots being dead [13:07:29] Not a big deal [13:07:31] heh [14:24:15] PROBLEM - Apache HTTP on srv263 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:24:15] PROBLEM - Apache HTTP on srv263 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:30:25] PROBLEM - Disk space on srv263 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:30:25] PROBLEM - Disk space on srv263 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:35:35] PROBLEM - DPKG on srv263 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:35:35] PROBLEM - RAID on srv263 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:35:36] PROBLEM - DPKG on srv263 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:35:36] PROBLEM - RAID on srv263 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:36:13] !log reedy synchronized php-1.18/extensions/GoogleNewsSitemap/FeedSMItem.php 'r109532' [14:36:26] PROBLEM - SSH on srv263 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:36:26] PROBLEM - SSH on srv263 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:37:58] Hi, I'm having an issue on en.wiki with renames. I'm getting an error message: There was a problem with receiving the request. Please go back and try again. [14:39:02] I am trying to fulfill http://en.wikipedia.org/wiki/Wikipedia:Changing_username/Simple#Pfc432_.E2.86.92_Brenda_Fernandez [14:42:10] was the new username created during the rename? [14:44:29] apergos: no [14:44:46] so it didn't even get that far [14:45:15] PROBLEM - MySQL slave status on es2 is CRITICAL: CRITICAL: Connected threads = 1199 (1000) [14:45:16] PROBLEM - MySQL slave status on es2 is CRITICAL: CRITICAL: Connected threads = 1199 (1000) [14:45:26] apergos: correct [14:45:36] PROBLEM - DPKG on srv259 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:45:36] PROBLEM - Disk space on srv275 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:45:36] PROBLEM - DPKG on srv259 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:45:37] PROBLEM - Disk space on srv275 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:45:47] PROBLEM - Disk space on srv259 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:45:47] PROBLEM - Disk space on srv259 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:46:15] PROBLEM - MySQL slave status on es4 is CRITICAL: CRITICAL: Connected threads = 1159 (1000) [14:46:16] PROBLEM - MySQL slave status on es4 is CRITICAL: CRITICAL: Connected threads = 1159 (1000) [14:46:20] uh, things are lagging pretty hard on frwiki [14:47:05] PROBLEM - SSH on srv259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:47:05] PROBLEM - SSH on srv259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:47:53] brianmc, about? [14:48:16] Request: POST http://www.mediawiki.org/wiki/Special:Code/MediaWiki/109532, from 208.80.152.72 via sq59.wikimedia.org (squid/2.7.STABLE9) to () [14:48:16] Error: ERR_CANNOT_FORWARD, errno [No Error] at Thu, 19 Jan 2012 14:47:23 GMT [14:48:29] ugh [14:48:30] Had multiple of those in the past 2 mins. [14:48:30] I'm getting Request: GET http://zh.wikipedia.org/wiki/Template:Country, from [ip] via sq65.wikimedia.org (squid/2.7.STABLE9) to () [14:48:31] Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Thu, 19 Jan 2012 14:47:24 GMT [14:49:55] PROBLEM - Apache HTTP on srv259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:49:55] PROBLEM - Apache HTTP on srv259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:50:25] PROBLEM - RAID on srv275 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:50:26] PROBLEM - RAID on srv275 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:50:34] hello [14:50:35] http://commons.wikimedia.org/w/index.php?title=Commons:Deletion_requests/All_files_copyrighted_in_the_US_under_the_URAA&curid=18088827&diff=65642425&oldid=65642381 [14:50:45] PROBLEM - RAID on srv259 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:50:48] Request: GET http://commons.wikimedia.org/w/index.php?title=Commons:Deletion_requests/All_files_copyrighted_in_the_US_under_the_URAA&curid=18088827&diff=65642425&oldid=65642381, from 208.80.152.87 via sq66.wikimedia.org (squid/2.7.STABLE9) to () [14:50:48] PROBLEM - RAID on srv259 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:50:48] Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Thu, 19 Jan 2012 14:49:27 GMT [14:50:54] are you aware of that? [14:50:56] PROBLEM - SSH on srv275 is CRITICAL: Server answer: [14:50:57] PROBLEM - SSH on srv275 is CRITICAL: Server answer: [14:51:22] yannf: yes [14:51:35] PROBLEM - SSH on srv286 is CRITICAL: Server answer: [14:51:36] PROBLEM - SSH on srv286 is CRITICAL: Server answer: [14:51:37] ok [14:51:43] read appears to work, change/edit is failing. [14:52:05] Still here, Reedy - but might be worth getting out everyone's hair ;) [14:52:15] PROBLEM - DPKG on srv275 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:52:15] PROBLEM - DPKG on srv275 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:52:25] PROBLEM - Disk space on srv286 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:52:25] PROBLEM - Disk space on srv286 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:53:15] PROBLEM - Apache HTTP on srv275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:53:15] PROBLEM - Apache HTTP on srv275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:53:23] brianmc, in theory, the rss issue should be fixed... but i'm not familiar with any caching that takes place... Do you know where vvv got the actualy fatal listed? [14:53:44] Oh [14:53:46] I se eanother [14:53:56] Eh, nope. I just posted a bust link, and he dug that out [14:54:18] i just found another fatal [14:54:20] let me fix that [14:55:22] Anyone from ops able to provide some brief info? [14:55:46] I'm looking at it and don't know what's wrong [14:55:57] PROBLEM - DPKG on srv286 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:55:57] PROBLEM - DPKG on srv286 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:56:02] apergos: need more info, or are you able to reproduce? [14:56:19] I'm looking at these servers that are out to lunch, basically [14:56:26] hah [14:57:09] Thanks. [14:57:55] PROBLEM - RAID on srv286 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:57:55] PROBLEM - RAID on srv286 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:53] Hm. Taking squid ERR_CANNOT_FORWARD, errno (11) on http://wikimediafoundation.org/wiki/SOPA/Blackoutpage [14:58:55] PROBLEM - Apache HTTP on srv286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:58:56] PROBLEM - Apache HTTP on srv286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:59:06] I see 15 apaches not working during sync-file.. [14:59:27] !log reedy synchronized php-1.18/extensions/GoogleNewsSitemap/FeedSMItem.php 'r109538' [15:00:06] hi [15:00:10] srv275--had a big jump in cpu wait i/o time [15:00:24] Request: GET http://it.wikipedia.org/wiki/Pagina_principale, from 208.80.152.86 via sq66.wikimedia.org (squid/2.7.STABLE9) to () [15:00:25] Error: ERR_CANNOT_FORWARD, errno (115) Operation now in progress at Thu, 19 Jan 2012 14:59:24 GMT [15:01:01] that sorta feels like nfsfail [15:01:45] RECOVERY - SSH on srv286 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [15:01:46] RECOVERY - SSH on srv286 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [15:02:25] RECOVERY - Disk space on srv286 is OK: DISK OK [15:02:25] RECOVERY - Disk space on srv286 is OK: DISK OK [15:02:57] Jeff_Green, a few of the busy apaches have had quite an increase of swap usage [15:03:35] Reedy: yeah, I see that on srv286 for example [15:03:35] PROBLEM - DPKG on srv261 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:03:35] PROBLEM - Apache HTTP on srv230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:03:36] PROBLEM - DPKG on srv261 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:03:36] PROBLEM - Apache HTTP on srv230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:03:45] PROBLEM - Apache HTTP on srv211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:03:46] PROBLEM - Apache HTTP on srv211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:03:57] PROBLEM - Apache HTTP on mw54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:03:58] PROBLEM - Apache HTTP on mw54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:04:12] srv275 too [15:04:28] what just died? [15:04:29] 263, 261 [15:04:42] 279 [15:04:45] PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:04:45] PROBLEM - Apache HTTP on srv270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:04:45] PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:04:46] PROBLEM - Apache HTTP on srv270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:04:53] 268 [15:04:54] ? [15:04:59] has this been happening a lot? [15:05:05] PROBLEM - Apache HTTP on mw11 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:05:05] PROBLEM - Apache HTTP on mw11 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:05:13] You get the odd apache do it from time to time [15:05:15] PROBLEM - Apache HTTP on mw28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:05:15] PROBLEM - Apache HTTP on mw28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:05:25] PROBLEM - Disk space on srv261 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:05:25] PROBLEM - Disk space on srv261 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:05:38] might be time for a cron to log processes so we can see which bloats [15:05:55] PROBLEM - Apache HTTP on mw12 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:05:55] RECOVERY - DPKG on srv286 is OK: All packages OK [15:05:56] PROBLEM - Apache HTTP on mw12 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:05:56] RECOVERY - DPKG on srv286 is OK: All packages OK [15:05:58] as it stands I can't get a session to investigate [15:06:05] PROBLEM - Apache HTTP on srv190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:06:05] PROBLEM - Apache HTTP on srv190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:06:13] 38 srv boxes and 25 mw boxes on nagios complaining [15:06:20] I think the usual process is just to bounce the box [15:06:25] PROBLEM - Apache HTTP on srv205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:06:25] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:06:25] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:06:26] PROBLEM - Apache HTTP on srv205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:06:26] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:06:26] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:06:27] well [15:06:29] those machines are idle [15:06:32] apache threads are locked up [15:06:38] well, doing something [15:06:45] PROBLEM - Apache HTTP on srv271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:06:46] PROBLEM - Apache HTTP on srv271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:04] i'd happily slay procs if I could actually get a session :-( [15:07:06] #2 php_sock_stream_wait_for_data (stream=0x7fb63da3f0a0, buf=0x7fb63cc8c3b0 "\230\225\f:\266\177", count=8192) at /tmp/buildd/php5-5.3.2/main/streams/xp_socket.c:131 [15:07:06] #3 php_sockop_read (stream=0x7fb63da3f0a0, buf=0x7fb63cc8c3b0 "\230\225\f:\266\177", count=8192) at /tmp/buildd/php5-5.3.2/main/streams/xp_socket.c:154 [15:07:06] #4 0x00007fb63836db9a in php_openssl_sockop_read (stream=0x7fff764f7220, buf=0x7fb63cc8c3b0 "\230\225\f:\266\177", count=500) at /tmp/buildd/php5-5.3.2/ext/openssl/xp_ssl.c:234 [15:07:06] #5 0x00007fb63858a764 in php_stream_fill_read_buffer (stream=0x7fb63da3f0a0, size=) at /tmp/buildd/php5-5.3.2/main/streams/streams.c:562 [15:07:08] #6 0x00007fb63858a910 in _php_stream_get_line (stream=0x7fb63da3f0a0, buf=0x0, maxlen=500, returned_len=0xffffffffffffffff) at /tmp/buildd/php5-5.3.2/main/streams/streams.c:841 [15:07:11] #7 0x00007fb638500c01 in zif_fgets (ht=1, return_value=0x7fb63e75cd60, return_value_ptr=, this_ptr=, return_value_used=) at /tmp/buildd/php5-5.3.2/ext/standard/file.c:1074 [15:07:15] #8 0x00007fb63861e3fa in zend_do_fcall_common_helper_SPEC (execute_data=0x7fb63ea2ce48) at /tmp/buildd/php5-5.3.2/Zend/zend_vm_execute.h:313 [15:07:18] where do we do SSL reads? [15:07:25] PROBLEM - Apache HTTP on srv261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:26] PROBLEM - Apache HTTP on srv261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:30] * domas resolves stack trace manually [15:07:45] RECOVERY - RAID on srv286 is OK: OK: no RAID installed [15:07:45] PROBLEM - Apache HTTP on mw55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:45] PROBLEM - RAID on srv279 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:07:46] RECOVERY - RAID on srv286 is OK: OK: no RAID installed [15:07:46] PROBLEM - Apache HTTP on mw55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:46] PROBLEM - RAID on srv279 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:07:50] wikipedia is very slow atm. [15:07:54] yes, we know [15:07:55] PROBLEM - Apache HTTP on mw8 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:55] PROBLEM - Apache HTTP on srv244 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:56] PROBLEM - Apache HTTP on mw8 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:56] PROBLEM - Apache HTTP on srv244 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:05] PROBLEM - Apache HTTP on mw1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:06] PROBLEM - Apache HTTP on mw1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:07] domas: what's /tmp/buildd/php5-5.3.2/ext/standard/file.c ? [15:08:09] people will think there's a second blakout, haha [15:08:26] PROBLEM - Apache HTTP on mw27 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:26] PROBLEM - Apache HTTP on mw27 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:29] DarkoNeko: a blackout against blockouts! [15:08:39] a blackout against web scale technologies [15:08:45] PROBLEM - Apache HTTP on srv279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:45] PROBLEM - Apache HTTP on srv279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:54] mmm [15:08:56] PROBLEM - SSH on srv279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:57] PROBLEM - SSH on srv279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:02] something is doing ssl fgets [15:09:07] i guess stracing would work too :) [15:09:19] maybe people will think congress decided to censor wikipedia in retaliation [15:09:25] RECOVERY - Apache HTTP on srv286 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.063 second response time [15:09:26] RECOVERY - Apache HTTP on srv286 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.063 second response time [15:09:33] oh wait [15:09:39] or stupid me [15:09:55] PROBLEM - DPKG on srv279 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:09:56] PROBLEM - DPKG on srv279 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:10:07] can be just memcached issue [15:10:15] PROBLEM - Apache HTTP on srv232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:10:16] PROBLEM - Apache HTTP on srv232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:10:35] FFFUUUU [15:10:35] php mctest.php enwiki [15:10:35] No MWMultiVersion instance initialized! MWScript.php wrapper not used? [15:10:51] how can people do things like this? [15:10:54] mwscript mctest.php enwiki [15:11:11] heh, interesting [15:11:17] it starts timing out on very first one? [15:12:01] I guess it blew up in memory [15:12:21] there're like 10 bad memcached servers at the moment [15:12:23] or more [15:12:25] PROBLEM - Disk space on srv279 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:12:26] PROBLEM - Disk space on srv279 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:13:13] beautiful [15:13:13] http://ganglia.wikimedia.org/2.2.0/?c=Application%20servers%20pmtpa&h=srv268.pmtpa.wmnet&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [15:13:35] RECOVERY - DPKG on srv261 is OK: All packages OK [15:13:36] RECOVERY - DPKG on srv261 is OK: All packages OK [15:13:47] RECOVERY - Apache HTTP on srv211 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [15:13:47] RECOVERY - Apache HTTP on srv230 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.040 second response time [15:13:48] RECOVERY - Apache HTTP on srv211 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [15:13:48] RECOVERY - Apache HTTP on srv230 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.040 second response time [15:14:17] what did you do? [15:14:55] RECOVERY - Apache HTTP on srv270 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.698 second response time [15:14:56] RECOVERY - Apache HTTP on srv270 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.698 second response time [15:15:15] RECOVERY - Apache HTTP on mw28 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.020 second response time [15:15:16] RECOVERY - Apache HTTP on mw28 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.020 second response time [15:15:17] it looks like it bounced toward swapdeath, recovered briefly, and did it again [15:15:18] hmm? [15:15:25] RECOVERY - Disk space on srv261 is OK: DISK OK [15:15:26] RECOVERY - Disk space on srv261 is OK: DISK OK [15:15:45] PROBLEM - MySQL slave status on es2 is CRITICAL: CRITICAL: Connected threads = 1016 (1000) [15:15:46] PROBLEM - MySQL slave status on es2 is CRITICAL: CRITICAL: Connected threads = 1016 (1000) [15:16:25] RECOVERY - Apache HTTP on srv205 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.023 second response time [15:16:26] RECOVERY - Apache HTTP on srv205 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.023 second response time [15:16:35] PROBLEM - SSH on srv268 is CRITICAL: Server answer: [15:16:36] PROBLEM - SSH on srv268 is CRITICAL: Server answer: [15:17:25] RECOVERY - Apache HTTP on srv261 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.037 second response time [15:17:26] RECOVERY - Apache HTTP on srv261 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.037 second response time [15:17:45] RECOVERY - Apache HTTP on mw55 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.035 second response time [15:17:46] RECOVERY - Apache HTTP on mw55 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.035 second response time [15:17:55] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [15:17:56] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [15:18:05] RECOVERY - Apache HTTP on mw1 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.401 second response time [15:18:06] RECOVERY - Apache HTTP on mw1 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.401 second response time [15:18:25] RECOVERY - Apache HTTP on mw27 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.021 second response time [15:18:26] RECOVERY - Apache HTTP on mw27 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.021 second response time [15:18:45] PROBLEM - Disk space on srv268 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:18:45] PROBLEM - DPKG on srv268 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:18:46] PROBLEM - Disk space on srv268 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:18:46] PROBLEM - DPKG on srv268 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:19:32] lol [15:19:32] PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND [15:19:32] 24122 apache 39 19 1391m 1.2g 3616 R 38 15.4 63:04.28 php [15:20:15] RECOVERY - Apache HTTP on srv232 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [15:20:16] RECOVERY - Apache HTTP on srv232 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [15:21:45] PROBLEM - RAID on srv268 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:21:46] PROBLEM - RAID on srv268 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:22:03] I suggest stopping all job queue runners [15:22:16] PROBLEM - Apache HTTP on srv268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:22:16] PROBLEM - Apache HTTP on srv268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:22:43] or whole cluster will go down [15:22:44] :) [15:23:21] dominoes effect [15:23:56] not dominoes [15:24:01] i was going to say i just got a 502, but it seems i'm not alone [15:24:35] RECOVERY - Apache HTTP on mw54 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.055 second response time [15:24:36] RECOVERY - Apache HTTP on mw54 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.055 second response time [15:24:55] RECOVERY - Apache HTTP on mw30 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [15:24:56] RECOVERY - Apache HTTP on mw30 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [15:25:37] RECOVERY - Apache HTTP on mw11 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [15:25:38] RECOVERY - Apache HTTP on mw11 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [15:25:55] RECOVERY - Apache HTTP on mw12 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [15:25:56] RECOVERY - Apache HTTP on mw12 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [15:26:06] RECOVERY - Apache HTTP on srv190 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.972 second response time [15:26:06] RECOVERY - Apache HTTP on srv190 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.972 second response time [15:26:25] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [15:26:26] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [15:26:37] RECOVERY - Apache HTTP on mw42 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.616 second response time [15:26:37] RECOVERY - Apache HTTP on mw42 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.616 second response time [15:26:45] RECOVERY - Apache HTTP on srv271 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [15:26:46] RECOVERY - Apache HTTP on srv271 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [15:26:55] PROBLEM - MySQL slave status on es4 is CRITICAL: CRITICAL: Connected threads = 1038 (1000) [15:26:56] PROBLEM - MySQL slave status on es4 is CRITICAL: CRITICAL: Connected threads = 1038 (1000) [15:27:03] Reedy now seems fast. [15:27:18] crap [15:27:26] ? [15:27:36] kill all job queues please [15:27:55] RECOVERY - Apache HTTP on srv244 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [15:27:56] RECOVERY - Apache HTTP on srv244 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [15:29:59] what fricking user do I have to run things as now? grrrrr [15:30:08] * domas whistle [15:30:08] dsh -g job-runners pkill -9 -f obs [15:30:31] well if /etc/init.d stop won't do it [15:30:36] I don't know why pkill would [15:30:41] start-stop-daemon: warning: failed to kill 27943: Operation not permitted [15:30:42] etc [15:30:47] what do you mean? [15:30:53] upido it as root? [15:30:54] err [15:30:57] you're doing that as root? [15:31:04] of course I'm doing that as root [15:31:07] how else would I do that? [15:31:08] tried that and it prompted me for the root pwd [15:31:16] domas: i meant apergos [15:31:19] did not prompt for me [15:31:25] * domas is super-master [15:31:29] guess you had better do it then [15:31:45] * domas sighs, bunch of machines in swapdeath [15:31:52] you can reboot them! [15:32:03] four or so! [15:32:26] PROBLEM - Memcached on srv279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:32:27] PROBLEM - Memcached on srv279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:32:33] when this is over can we document this here: https://wikitech.wikimedia.org/view/Job_queue b/c I've been sitting here on my thumbs for lack of clue how to kill job queues [15:32:36] see the memory drop? thats me! [15:32:47] jeff_green: pkill -f obs [15:32:48] \o/ [15:32:58] obs? [15:33:03] yeah [15:33:04] it's not better kill -9 ? [15:33:18] obs matches jobs-loop and RunJobs [15:33:20] and few others! [15:33:21] domas: yes, but I don't even know where to do it from the documentation I've been able to find [15:33:21] I see [15:33:31] on the job-runner group [15:34:00] /home/config/others/usr/local/dsh/node_groups [15:34:04] they are in here now [15:34:25] a pretty obscure path but linked to from /etc/dsh/group if you remember that's the original location [15:34:31] lesson for today - overprovisioning hardware doesn't mean that you don't have to do operations :))) [15:35:16] lesson for today: if I need to do something in a hurry on a bunch of machines as root, it will prompt me for root on all of them, so I had better find someone else >_< [15:35:18] (and openssl calls are used to access memcached ;-) [15:35:32] apergos: I have no idea what root password is, by the way [15:35:37] probably thats because I never enter it [15:35:44] I didn't enter it, I backed out [15:35:49] ok [15:36:03] I generally don't have an issue with key forwarding [15:36:09] only when it's an emergency >_< [15:36:55] RECOVERY - DPKG on srv263 is OK: All packages OK [15:36:56] RECOVERY - DPKG on srv263 is OK: All packages OK [15:37:05] RECOVERY - Apache HTTP on srv263 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.601 second response time [15:37:06] RECOVERY - Apache HTTP on srv263 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.601 second response time [15:37:33] * domas restarted srv268 [15:37:49] I'll look at 275 then [15:37:52] so 259, 253, 268, 275, 279 [15:38:37] I guess that you know the root mgmt password though (domas) [15:38:47] for the record: [15:38:51] apache 24119 0.0 0.3 186036 25472 ? SN 14:01 0:00 php MWScript.php runJobs.php --wiki=ocwiki --procs=5 --maxtime=300 [15:38:51] apache 24122 80.4 15.4 1425252 1260932 ? RN 14:01 63:10 php MWScript.php runJobs.php --wiki=ocwiki --procs=5 --maxtime=300 [15:38:51]