[00:02:19] New patchset: Ryan Lane; "Applying LDAP fix to all instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2697 [00:04:04] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [00:10:04] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [00:10:04] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [00:13:16] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2697 [00:13:18] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2697 [00:14:32] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2611 [00:14:33] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2611 [00:20:03] New patchset: Diederik; "IP range filtering and regular expression now work." [analytics/udp-filters] (refactoring) - https://gerrit.wikimedia.org/r/2698 [00:21:34] New patchset: Ryan Lane; "Adding in nslcd.conf.erb, to avoid awkward cherry-pick" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2699 [00:22:04] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2699 [00:22:05] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2699 [00:26:18] New patchset: Ryan Lane; "We don't want to give people a shell, except in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2700 [00:27:00] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2700 [00:27:18] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2700 [00:27:19] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2700 [00:29:51] New patchset: Ottomata; "Removing launcher.py, moved multiprocessing support to pipeline/__main__.py" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2701 [00:30:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:35:52] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.660 seconds [00:40:59] !log tstarling synchronized wmf-config/CommonSettings.php 'reducing cache expiry for unversioned resources on commons' [00:41:01] Logged the message, Master [00:42:48] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2701 [00:42:49] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2701 [00:49:46] btw: css is broken on commons [00:50:03] confirmed [00:50:07] since at least 4 minutes [00:51:03] !log tstarling synchronized wmf-config/CommonSettings.php 'reducing cache expiry for unversioned resources on commons' [00:51:05] Logged the message, Master [00:51:26] Saibo: thanks, i've yelled at people to fix it :) [00:51:42] thanks, works [00:51:44] !log tstarling synchronized wmf-config/CommonSettings.php 'reducing cache expiry for unversioned resources on commons' [00:51:46] Logged the message, Master [00:52:04] and again broken [00:52:24] hm.. and works again [00:52:31] * Saibo beats the caches [00:53:17] who broke it? :) [00:53:27] :) [00:53:31] it's working again [00:53:53] I think it was in an attempt to fix all the issues before the 1.19 launch :) [00:54:05] I wish they'd get on it already [00:54:08] I want lua on my templates [00:54:13] heh [00:54:21] ;) [00:54:23] you know that's not coming in this release, right? :) [00:54:28] right [00:54:29] I do [00:54:33] is there any sort of timeframe? [00:54:39] no clue [00:54:41] because if it's a really long time I will just write it in Javascript [00:54:50] thought you might be WMF and might know ;) [00:54:58] I'm wmf, but I'm in ops [00:55:04] ops doesn't know stuff [00:55:08] ;) [00:55:09] +1 [00:55:12] heh [00:55:25] I feign ignorance *really* well, anyway [00:57:52] New patchset: Ottomata; "Adding __main__.py - meant for this to go with the last commit." [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2702 [00:59:07] !log reedy synchronized php-1.19/languages/messages/MessagesEn.php 'r112073' [00:59:09] New patchset: Lcarr; "commenting out aggregator Attempt to make puppet compile the directory before timeout" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2703 [00:59:10] Logged the message, Master [01:00:21] !log reedy synchronized php-1.19/includes/ 'r112073' [01:00:23] Logged the message, Master [01:00:24] well who do I have to bother in order to find out when Lua will be implemented? [01:00:54] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2702 [01:00:55] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2702 [01:01:30] New patchset: Lcarr; "commenting out aggregator Attempt to make puppet compile the directory before timeout" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2703 [01:01:37] Magog_the_Ogre: wikitech-l ? [01:01:42] or you can implement it :) [01:02:01] I will implement it if you will pay me prevailing wages :D [01:02:37] haha, yeah I'm not ms. moneybags [01:02:59] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2703 [01:03:00] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2703 [01:03:21] Magog_the_Ogre: https://www.mediawiki.org/wiki/Lua_scripting [01:03:51] looks like it will be a while [01:04:33] Start date is pinned for August... Not sure how accurate it is [01:06:00] yikes [01:06:12] sounds like I'm not getting it for at least a year, probably more like 1.5-2 years [01:06:32] I'll just have to write my tool in Javascript and put it on the site JS for everyone [01:07:18] test [01:07:40] failed [01:08:11] so, i can't talk on channel until i get passes status? [01:08:14] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:09:19] i was just messing with you, you're obviously talking on the channel right now [01:09:21] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Commonswiki to 1.19wmf1 [01:09:23] Logged the message, Master [01:12:35] PROBLEM - MySQL Idle Transactions on db22 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:14:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.023 seconds [01:16:47] PROBLEM - RAID on db22 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [01:17:59] !log reverting to 1.18 on commons due to DB overload [01:18:00] !log tstarling rebuilt wikiversions.cdb and synchronized wikiversions files: [01:18:01] Logged the message, Master [01:18:03] Logged the message, Master [01:19:29] RECOVERY - MySQL Idle Transactions on db22 is OK: OK longest blocking idle transaction sleeps for 0 seconds [01:20:13] New patchset: Lcarr; "Only pushing standard package as stafford is overloaded" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2704 [01:20:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2704 [01:21:24] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2704 [01:21:25] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2704 [01:26:05] RECOVERY - Disk space on neon is OK: DISK OK [01:26:15] New patchset: Lcarr; "Revert "Only pushing standard package as stafford is overloaded"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2705 [01:26:23] RECOVERY - DPKG on neon is OK: All packages OK [01:26:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2705 [01:27:26] RECOVERY - RAID on neon is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [01:28:59] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2705 [01:29:00] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2705 [01:30:44] RECOVERY - NTP on neon is OK: NTP OK: Offset 0.009791016579 secs [01:37:15] TimStarling: https://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/CentralAuth/specials/SpecialMergeAccount.php?r1=104235&r2=104236& [01:37:59] what about it? [01:38:53] break [01:39:48] hehe [01:39:55] don't you like little cleanups? [01:40:23] it convinced me for a minute there [01:44:09] poor MediaWiki, it's like a speeding cars who's breaks were cut...it just can't stop! [01:47:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:50:45] Reedy: ? [01:51:25] Shit [01:51:38] :) [01:51:57] Still, the method is *still* undefined [01:52:18] * Reedy fixes [01:53:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.581 seconds [01:53:14] * Damianz gives Reedy a cookie [01:55:47] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 601s [01:56:50] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 663s [01:57:21] !log reedy synchronized php-1.19/extensions/CentralAuth/specials/ 'r112075' [01:57:24] Logged the message, Master [01:58:11] New patchset: Lcarr; "Fixing nagios service to nagios3 in newmonitor class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2706 [01:58:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2706 [01:59:24] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2706 [01:59:25] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2706 [02:15:10] !log testing new scap script ~tstarling/bin/scap-new [02:15:14] Logged the message, Master [02:16:47] PROBLEM - Puppet freshness on bast1001 is CRITICAL: Puppet has not run in the last 10 hours [02:18:08] !log LocalisationUpdate completed (1.18) at Wed Feb 22 02:18:08 UTC 2012 [02:18:11] Logged the message, Master [02:18:24] !log tstarling synchronizing Wikimedia installation... : [02:18:26] Logged the message, Master [02:21:31] !log tstarling synchronizing Wikimedia installation... : [02:21:33] Logged the message, Master [02:24:44] sync done. [02:25:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:29:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.977 seconds [02:37:47] PROBLEM - RAID on srv194 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:39:08] PROBLEM - BGP status on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197, [02:39:27] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:39:35] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Can not connect to 10.0.2.227:11000 (Connection timed out) [02:40:03] PROBLEM - BGP status on csw1-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.247, [02:40:21] PROBLEM - Router interfaces on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [02:40:21] PROBLEM - Router interfaces on mr1-pmtpa is CRITICAL: CRITICAL: host 10.1.2.3, interfaces up: 32, down: 1, dormant: 0, excluded: 0, unused: 0BRfe-0/0/1: down - csw5-pmtpa:8/23:BR [02:40:29] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [02:40:29] PROBLEM - BGP status on cr2-eqiad is CRITICAL: (Service Check Timed Out) [02:40:57] PROBLEM - BGP status on cr1-eqiad is CRITICAL: (Service Check Timed Out) [02:40:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:42:53] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:43:29] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [02:43:47] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.197 for 1.3.6.1.2.1.2.2.1.2 with snmp version 2 [02:44:05] PROBLEM - DPKG on nfs1 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:44:24] PROBLEM - Router interfaces on br1-knams is CRITICAL: CRITICAL: No response from remote host 91.198.174.245 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [02:44:41] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:44:50] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 9, down: 0, shutdown: 0 [02:45:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.433 seconds [02:45:26] PROBLEM - LVS HTTP on ms-fe.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:45:27] RECOVERY - DPKG on nfs1 is OK: All packages OK [02:45:35] PROBLEM - RAID on mw40 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:45:44] RECOVERY - Router interfaces on br1-knams is OK: OK: host 91.198.174.245, interfaces up: 10, down: 0, dormant: 0, excluded: 0, unused: 0 [02:45:44] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [02:46:20] PROBLEM - Puppetmaster HTTPS on sockpuppet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:47:14] PROBLEM - BGP status on csw2-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.244, [02:47:33] RECOVERY - Puppetmaster HTTPS on sockpuppet is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.433 seconds [02:47:41] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [02:47:42] PROBLEM - Router interfaces on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [02:48:53] RECOVERY - BGP status on csw1-esams is OK: OK: host 91.198.174.247, sessions up: 5, down: 0, shutdown: 0 [02:48:53] RECOVERY - BGP status on cr1-eqiad is OK: OK: host 208.80.154.196, sessions up: 9, down: 0, shutdown: 0 [02:49:02] PROBLEM - Router interfaces on mr1-pmtpa is CRITICAL: CRITICAL: host 10.1.2.3, interfaces up: 32, down: 1, dormant: 0, excluded: 0, unused: 0BRfe-0/0/1: down - csw5-pmtpa:8/23:BR [02:49:03] RECOVERY - Router interfaces on cr2-pmtpa is OK: OK: host 208.80.152.197, interfaces up: 99, down: 0, dormant: 0, excluded: 0, unused: 0 [02:49:11] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [02:49:11] RECOVERY - BGP status on cr2-eqiad is OK: OK: host 208.80.154.197, sessions up: 9, down: 0, shutdown: 0 [02:49:20] RECOVERY - BGP status on cr2-pmtpa is OK: OK: host 208.80.152.197, sessions up: 9, down: 0, shutdown: 0 [02:49:29] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 84, down: 2, dormant: 0, excluded: 0, unused: 0BRae3: down - BRae4: down - BR [02:49:56] RECOVERY - BGP status on csw2-esams is OK: OK: host 91.198.174.244, sessions up: 4, down: 0, shutdown: 0 [02:49:56] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 87, down: 2, dormant: 0, excluded: 0, unused: 0BRae3: down - BRae4: down - BR [02:49:57] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [02:50:14] New patchset: Lcarr; "decreasing number of simultaneous checks for nagios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2707 [02:50:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2707 [02:50:38] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2707 [02:50:38] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2707 [02:50:52] RECOVERY - HTTP on fenari is OK: HTTP OK HTTP/1.1 200 OK - 4252 bytes in 0.005 seconds [02:51:17] New patchset: Ottomata; "Created DygraphLoader for generic transformation of observation aggregations into dygraphs csv format." [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2708 [02:51:17] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 2359 bytes in 0.015 seconds [02:51:44] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK HTTP/1.1 200 OK - 2359 bytes in 0.015 seconds [02:52:20] RECOVERY - LVS HTTP on ms-fe.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 2359 bytes in 0.009 seconds [02:54:24] New review: Ottomata; "(no comment)" [analytics/reportcard] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2708 [02:54:32] New review: Ottomata; "(no comment)" [analytics/reportcard] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2708 [02:54:40] New review: Ottomata; "(no comment)" [analytics/reportcard] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2708 [02:54:51] New review: Ottomata; "(no comment)" [analytics/reportcard] (master); V: 0 C: 1; - https://gerrit.wikimedia.org/r/2708 [02:55:01] New review: Ottomata; "(no comment)" [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2708 [02:55:01] Change merged: Ottomata; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2708 [02:57:50] !log Synced php-1.19/includes/MessageBlobStore.php to disable ::clear() ; where's the logging bot? [02:57:52] Logged the message, Mr. Obvious [03:00:50] [12:50] * logmsgbot has quit (Read error: Connection reset by peer) < so ~10 mins ago [03:03:28] Started ircecho [03:03:56] !log Manually started ircecho on fenari ; why doesn't this happen upon boot? Why didn't puppet start it? [03:03:59] Logged the message, Mr. Obvious [03:04:51] PROBLEM - BGP status on csw1-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.247, [03:05:01] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [03:05:27] PROBLEM - BGP status on cr2-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.197, [03:05:36] PROBLEM - BGP status on csw2-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.244, [03:06:04] PROBLEM - BGP status on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197, [03:06:04] PROBLEM - check_all_memcacheds on spence is CRITICAL: (Service Check Timed Out) [03:06:30] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:06:39] PROBLEM - Router interfaces on mr1-pmtpa is CRITICAL: CRITICAL: No response from remote host 10.1.2.3 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [03:07:33] PROBLEM - Certificate expiration on nfs1 is CRITICAL: (Service Check Timed Out) [03:09:49] what happens at COM? [03:09:58] many errors on editing [03:10:15] !log on fenari: NFS overload, killed apache and xinetd [03:10:17] Logged the message, Master [03:10:22] Saibo: what's COM? [03:10:24] e.g. Lock wait timeout exceeded; try restarting transaction (10.0.6.32)“. Or API errors when using gadgets. [03:10:26] commons [03:10:37] currently switched? [03:10:48] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.197 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [03:10:57] RECOVERY - BGP status on cr2-pmtpa is OK: OK: host 208.80.152.197, sessions up: 9, down: 0, shutdown: 0 [03:11:00] "API request returned code 200 parsererrorError code is SyntaxError: JSON.parse: unexpected non-whitespace character after JSON data" [03:11:06] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [03:11:15] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 87, down: 2, dormant: 0, excluded: 0, unused: 0BRae3: down - BRae4: down - BR [03:11:24] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [03:11:37] yeah, there is a problem [03:11:51] RECOVERY - BGP status on csw1-esams is OK: OK: host 91.198.174.247, sessions up: 5, down: 0, shutdown: 0 [03:12:00] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 9, down: 0, shutdown: 0 [03:12:09] RECOVERY - BGP status on csw2-esams is OK: OK: host 91.198.174.244, sessions up: 4, down: 0, shutdown: 0 [03:12:25] seems better now [03:12:36] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [03:12:37] RECOVERY - BGP status on cr2-eqiad is OK: OK: host 208.80.154.197, sessions up: 9, down: 0, shutdown: 0 [03:13:00] hmm, maybe not, another 13s query for site_stats showing up [03:15:27] New patchset: Catrope; "Don't let l10nupdate write to /home directly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2709 [03:15:36] TimStarling: ----^^ [03:15:45] That should help speed up l10nupdate and make it less heavy on NSF [03:15:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2709 [03:18:32] !log tstarling synchronized php-1.19/includes/LocalisationCache.php [03:18:34] Logged the message, Master [03:18:57] !log tstarling synchronized php-1.19/includes/LocalisationCache.php [03:18:59] Logged the message, Master [03:22:28] TimStarling: you think we'll be able to do something today/tonight, or should we just plan to have another window tomorrow? [03:23:15] I'm game to go on a little while longer, but I imagine we'll run out of steam here pretty quickly [03:24:19] we can switch now, Roan's patch should stop it from breaking [03:24:47] alright...let's give it a whirl [03:25:14] !log switching commons back to 1.19 [03:25:19] Logged the message, Master [03:25:24] !log tstarling rebuilt wikiversions.cdb and synchronized wikiversions files: [03:25:26] Logged the message, Master [03:26:26] seems fine [03:27:02] testing upload wizard....some breakage [03:27:21] TimStarling: is the cache timeout still on? [03:27:23] Didn't we forget the RL cache timeout thing? [03:27:30] yeah it's still 30 seconds [03:28:10] ok....probably should clear up quickly [03:28:16] * robla gets back to testing [03:28:40] it's been more than 30 seconds [03:29:19] why is it loading code from prototype.wikimedia.org? [03:29:34] surely that is asking for trouble [03:30:34] mw.user.anonymous is not a function [03:30:34] https://bits.wikimedia.org/commons.wikimedia.org/load.php?debug=false&lang=en&modules=site&only=scripts&skin=vector&* [03:30:34] Line 29 [03:31:22] I got that too for UploadWizard [03:31:24] and then Ididn't [03:31:39] it's from https://commons.wikimedia.org/wiki/MediaWiki:Common.js [03:31:49] Oh [03:32:06] Yeah that code is broken [03:32:08] Lemme fix that [03:32:22] (Missing dependency) [03:33:29] so that function doesn't exist by default in 1.19? [03:33:38] Site JS should be fixed now [03:33:49] It never reliably existed at the time Common.js is run [03:34:15] I edited Common.js , that error should be fixed now [03:34:55] you might want to get on #wikimedia-commons [03:42:36] PROBLEM - Puppet freshness on fenari is CRITICAL: Puppet has not run in the last 10 hours [03:48:30] !log catrope synchronized php-1.19/skins/common/shared.css 'r112081' [03:48:33] Logged the message, Master [03:49:05] * AaronSchulz watches fsockopen() errors for ED [04:02:43] noc down? http://noc.wikimedia.org/ isn't loading. [04:03:28] yeah I killed it [04:03:50] Report of ExtensionDistributor on mediawiki.org being down as well. [04:13:25] TimStarling: we're plotting a rollback here... [04:17:35] !log tstarling rebuilt wikiversions.cdb and synchronized wikiversions files: rolling back commons to 1.18 [04:17:37] Logged the message, Master [04:24:26] I guess it's time for me to have lunch [04:25:39] PROBLEM - Puppet freshness on searchidx1001 is CRITICAL: Puppet has not run in the last 10 hours [04:34:55] !log catrope synchronized php-1.19/thumb.php 'Experimental fix for 1.19 UploadWizard thumb issue' [04:34:58] Logged the message, Master [05:14:45] !log started xinetd [05:14:47] Logged the message, Master [05:18:23] !log catrope synchronized php-1.19/thumb.php 'Experimental fix for UploadStash thumbs in 1.19' [05:18:26] Logged the message, Master [05:20:15] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 10 seconds [05:21:45] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [05:51:45] !log tstarling synchronizing Wikimedia installation... : [05:51:47] Logged the message, Master [06:06:56] sync done. [06:12:11] New patchset: Tim Starling; "Support l10n manual recache in scap" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2710 [06:12:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2710 [06:12:49] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2710 [06:12:50] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2710 [06:15:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:17:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.538 seconds [06:26:41] !log installed new scap manually since puppet is broken [06:26:44] Logged the message, Master [06:33:03] !log tstarling synchronized wmf-config/CommonSettings.php 'enabling manual recache' [06:33:05] Logged the message, Master [06:39:36] PROBLEM - Puppet freshness on cadmium is CRITICAL: Puppet has not run in the last 10 hours [06:49:58] !log tstarling synchronized php-1.19/languages/messages/MessagesEn.php 'test change for manualRecache' [06:50:02] Logged the message, Master [06:51:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:55:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.619 seconds [07:03:09] PROBLEM - Lucene on search3 is CRITICAL: Connection timed out [07:03:36] PROBLEM - Lucene on search9 is CRITICAL: Connection timed out [07:09:36] PROBLEM - Puppet freshness on mw1002 is CRITICAL: Puppet has not run in the last 10 hours [07:09:36] PROBLEM - Puppet freshness on db46 is CRITICAL: Puppet has not run in the last 10 hours [07:11:33] RECOVERY - Lucene on search9 is OK: TCP OK - 8.993 second response time on port 8123 [07:23:42] PROBLEM - Lucene on search9 is CRITICAL: Connection timed out [07:31:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:33:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.452 seconds [07:40:27] Please look at https://bugzilla.wikimedia.org/show_bug.cgi?id=34585 (German Wikiversity offline) [07:44:14] it's not offline [07:44:33] but you can't get there without specifing the main page name or some other page name, it seems [07:45:48] http://de.wikiversity.org/wiki/Hauptseite [07:46:11] the layout is weird but I don't know if that's normal or not, having not visited the site ever before this [07:46:26] I get this page with and without any page title [07:46:41] if you click the above link, what do you see? [07:46:45] I get this page (This wiki does not exist) with and without any page title [07:46:53] The page "This wiki does not exist" [07:46:58] ugh [07:48:02] and now so do I [07:48:11] after a control-f5 [07:48:13] fabulous [07:49:11] yes after control-f5 [07:49:59] I add a screenshot to the bug report [07:53:12] it's somehow missing from wikiversions.dat [07:54:23] There happened something between 23:00 UTC and now because there were no problems before [07:57:28] Thank you for your help, apergos [07:57:33] but I have to go now [07:57:37] lemme see what version it's supposed to be running [07:58:01] I hope you or somebody else can fix the bug :-) [08:01:51] looks like it should still be on 1.18 [08:03:05] ok file's cleaned up [08:07:23] !log ariel rebuilt wikiversions.cdb and synchronized wikiversions files: fix dewikiversity typo in wikiversions file [08:07:26] Logged the message, Master [08:09:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:15:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.026 seconds [08:46:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:51:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.513 seconds [09:02:21] !log tstarling synchronizing Wikimedia installation... : [09:02:24] Logged the message, Master [09:05:33] !log tstarling synchronizing Wikimedia installation... : [09:05:35] Logged the message, Master [09:09:16] !log tstarling synchronizing Wikimedia installation... : [09:09:18] Logged the message, Master [09:23:49] sync done. [09:23:52] !log on fenari: started apache [09:23:54] Logged the message, Master [09:24:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:25:39] RECOVERY - HTTP on fenari is OK: HTTP OK HTTP/1.1 200 OK - 4252 bytes in 0.014 seconds [09:28:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.718 seconds [09:34:56] !log tstarling synchronizing Wikimedia installation... : [09:34:58] Logged the message, Master [09:39:36] PROBLEM - HTTP on fenari is CRITICAL: Connection refused [09:45:14] !log on fenair: stopped apache again due to overload. Restarted it with reduced MaxClients [09:45:16] Logged the message, Master [09:45:27] RECOVERY - HTTP on fenari is OK: HTTP OK HTTP/1.1 200 OK - 4252 bytes in 0.020 seconds [09:45:36] RECOVERY - Lucene on search9 is OK: TCP OK - 2.997 second response time on port 8123 [09:47:14] sync done. [09:50:44] I wonder why it is unhappy recently [09:57:54] PROBLEM - Lucene on search9 is CRITICAL: Connection timed out [10:02:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:05:42] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [10:06:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.382 seconds [10:09:49] !log hashar synchronized php-1.19/extensions/CodeReview/backend/DiffHighlighter.php 'r112098 - (bug 34554) diff chunk fail to parse file add/rm' [10:09:51] Logged the message, Master [10:11:42] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [10:11:42] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [10:37:30] RECOVERY - Lucene on search9 is OK: TCP OK - 2.995 second response time on port 8123 [10:42:25] New patchset: ArielGlenn; "initial commit: tool for managing dump uploads to archive.org" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/2711 [10:42:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:42:27] New review: gerrit2; "Lint check passed." [operations/dumps] (ariel); V: 1 - https://gerrit.wikimedia.org/r/2711 [10:46:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.386 seconds [10:49:39] PROBLEM - Lucene on search9 is CRITICAL: Connection timed out [11:00:18] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [11:02:42] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [11:20:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:24:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.023 seconds [11:26:33] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [11:58:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:02:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.055 seconds [12:18:36] PROBLEM - Puppet freshness on bast1001 is CRITICAL: Puppet has not run in the last 10 hours [12:20:33] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:20:33] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:20:33] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:20:33] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:25:30] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:25:30] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:25:31] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:25:31] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:30:36] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:30:36] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:30:36] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:30:36] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:35:24] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:35:25] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:35:25] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:35:25] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:35:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:39:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.093 seconds [12:40:30] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:40:30] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:40:30] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:40:30] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:45:27] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:45:27] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:45:27] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:45:27] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50:33] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50:34] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50:34] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50:34] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:53:33] RECOVERY - Puppet freshness on searchidx1001 is OK: puppet ran at Wed Feb 22 12:53:11 UTC 2012 [12:54:00] PROBLEM - RAID on db40 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:55:30] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:55:31] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:55:31] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:55:31] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:57:45] RECOVERY - RAID on db40 is OK: OK: 1 logical device(s) checked [13:00:27] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:00:27] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:00:27] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:00:27] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:05:24] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:05:24] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:05:24] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:05:24] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:07:12] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [13:10:30] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:10:30] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:10:30] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:10:30] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:13:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:15:36] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:15:36] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:15:36] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:15:36] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:17:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.162 seconds [13:20:33] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:20:33] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:20:33] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:20:33] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:09] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [13:21:38] New patchset: Demon; "Adding .gitreview" [test/mediawiki/extensions/examples] (master) - https://gerrit.wikimedia.org/r/2712 [13:21:40] New review: gerrit2; "Lint check passed." [test/mediawiki/extensions/examples] (master); V: 1 - https://gerrit.wikimedia.org/r/2712 [13:21:52] New review: Demon; "(no comment)" [test/mediawiki/extensions/examples] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2712 [13:21:52] Change merged: Demon; [test/mediawiki/extensions/examples] (master) - https://gerrit.wikimedia.org/r/2712 [13:22:17] apergos: Are you there? [13:22:22] yes [13:22:25] marely [13:22:27] barely [13:22:54] heh [13:23:08] I prefer to discuss it on IRC [13:23:14] ok [13:23:19] rather than constant mails, sigh [13:24:15] anyway is dataset1001 available for use? [13:24:36] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [13:25:02] not yet [13:25:26] and you won't be able to then pull from both server [13:25:26] s [13:25:30] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:25:30] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:25:30] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:25:30] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:25:31] sorry... [13:25:40] heh its okay :P [13:25:59] but how do you intend to actually get the dumps on IA? [13:26:13] ? [13:26:28] I mean, you contacted them, right? [13:26:34] sure [13:26:55] so, are they doing the downloading, or you will be doing the uploading? [13:27:17] I'll do it from here [13:27:33] but you still need to register a collection though [13:27:35] if we don't like the results I guess I'll tell them where the problems are and we'll see what can be done [13:27:36] uh huh [13:27:50] I've said I'll come back to them on tha after I talk with you folks [13:28:00] (lol not me) [13:28:26] you folks = people uploading to the wikimedia downloads collection [13:28:30] I am just helping to move things [13:28:41] and sadly, only me doing it :( [13:28:49] ok [13:28:51] brb [13:28:59] ok [13:29:03] (sorry, I did ay I'm barely here) [13:30:05] np [13:30:36] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:30:36] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:30:36] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:30:36] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:31:13] back (for a bit) [13:31:30] its okay if you don't really have the time now to discuss it [13:31:45] its no hurry anyway [13:31:49] I have time [13:31:59] oh, okay :) [13:32:01] just feeling sick, so there may be interruptions [13:32:21] heh [13:32:50] as I pointed out, I am okay with you uploading those dumps [13:32:55] so if my idea of our setup is not interfering with your plans, I'll just work the rest of it out with the archive folsk [13:33:12] seems good [13:33:24] and yes, the naming conventions [13:33:39] yeah, I'll settle on some nice name [13:33:46] we might put wmf in the name or something [13:33:49] watever [13:33:53] that stuff is easy [13:34:17] yeah, since you are already contacting the archive [13:34:28] they can settle all of this [13:34:35] yup [13:35:05] But you are uploading at 6 months intervals [13:35:11] seems very long... [13:35:24] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:24] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:24] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:24] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:30] not for what we wan [13:35:30] t [13:35:55] so you just want to use it as something like backup? [13:36:02] I don;t want to use the achive as a current mirror [13:36:07] in case of failure [13:36:13] no, not that either [13:36:18] hmm? [13:36:22] backup in case of failure would be mirror sites [13:36:26] * Hydriz sense evilness [13:36:28] or dataset1001 [13:36:37] I want it to hold historical material [13:36:51] I see [13:37:21] but is your python script available anywhere? [13:37:29] so that anyone can go look at the state of a project for a given time period (once every 6 months seems like a reasonable time frame for a historical snapshot) [13:37:32] not yet [13:37:33] well [13:37:44] hmm to be specific it is in gerrit waiting for review [13:37:49] I see [13:37:56] and it's quite crappy, this is the first iteration :-P [13:38:04] at least its in python [13:38:09] but it does upload objects, list items, and so on [13:38:23] auto-create-bucket? [13:38:31] its quite important [13:38:35] no. I don't do that. [13:38:46] item creation is first [13:38:49] then object uploads [13:38:53] I see [13:39:07] I haven't dealt with multipart uplaod yet [13:39:16] I want to see what the archive folsk have to say about that first [13:39:19] like enwiki? [13:39:27] well dewiki [13:39:41] they now have the largest single file produced by a dump [13:39:45] multipart is the splitting of the dumps, I suppose? [13:39:51] no [13:39:55] it's a specific s3 thing [13:39:59] I see [13:40:15] yeah, I had trouble uploading dewiki [13:40:28] 77GB in 2010 [13:40:30] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:30] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:30] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:30] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:32] you should look at the multipart specs for s3 [13:40:38] it's supported by the archive [13:41:07] split the file into smaller pieces and they each go up seperately [13:41:10] with the appropriate headers [13:41:25] I see [13:41:36] * Hydriz might look at that in the future [13:41:43] cool [13:42:08] but the split dumps of enwiki was the issue that I raised before [13:42:23] the naming makes it quite hard to get them on the archive [13:42:42] why is that? [13:43:34] the p2413204831 or something is quite weird (and very unique) [13:43:39] PROBLEM - Puppet freshness on fenari is CRITICAL: Puppet has not run in the last 10 hours [13:43:52] get the list of names out of the md5sums file [13:44:00] that's flat text, very easy to parse [13:44:02] I was thinking if you could actually running it like it is now [13:44:15] then after checking things, then renaming them [13:44:22] I don't want to renamem them [13:44:27] those names ahve content [13:44:46] the first and last page id contained in the file [13:45:26] this is very useful when we need to rerun something [13:45:35] it's also useful for endusers [13:45:36] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:36] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:36] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:36] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:39] true that [13:45:55] so seriously, just get the list of those out of mdsums.txt [13:46:13] ok, will make do with it [13:47:46] md5sums.txt [13:48:03] uh huh [13:49:54] New review: Diederik; "Ok." [analytics/udp-filters] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2559 [13:49:55] Change merged: Diederik; [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2559 [13:50:13] New review: Diederik; "Ok." [analytics/udp-filters] (refactoring); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2560 [13:50:13] Change merged: Diederik; [analytics/udp-filters] (refactoring) - https://gerrit.wikimedia.org/r/2560 [13:50:33] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:33] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:33] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:33] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:54:52] apergos: Is it possible for me to "fork" your code and use it for my own uploading? [13:55:08] it's not set up for what you want [13:55:30] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:31] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:31] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:31] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.451 seconds [13:55:32] yeah, I know [13:55:57] basically, I think it will be useless for you right now [13:56:21] once I have it set up to walk a directory tree and do whatever it's going to do [13:56:27] you would be able to adapt it then [13:56:31] okie [13:56:36] and it looks like... [13:56:46] you are using wmf-dumps-%s%s [13:56:53] not yet [13:56:57] I have not settled on any name [13:57:04] I put that in there as a sample only [13:57:08] yeah [13:57:16] test uploads went somewhere else with a different naming scheme :-P [13:58:58] heh [13:59:49] I saw evil [13:59:53] what evil? [14:00:03] your uploads :P [14:00:17] what's wrong with them? [14:00:28] they are great test files, and they don't take much room at all [14:00:30] not bad naming [14:00:36] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:00:36] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:00:36] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:00:36] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:00:42] -dumps- [14:00:52] seems to be good [14:00:52] mm hmm [14:00:58] or maybe: [14:01:04] I don't know what names we'll use for the real ones yet [14:01:08] I'll worry about that later [14:01:10] dumps- [14:01:16] okie [14:01:27] the test ones were just to test the script and the s3 interface [14:01:53] they could have been called blurpybloop for all I would care (except then the users would never find them and they might possibly be useful to a few folks) [14:02:47] looks good though [14:02:51] for the first run [14:03:15] The metadata is there [14:03:34] yes. generated from the name of the db and from the sitematrix [14:05:05] looks promising to me :) [14:05:33] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:05:34] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:05:34] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:05:34] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:05:40] lots of ways for it to break [14:05:49] sure I'll find them all in the course of running it :-D [14:06:07] yeah, like private wikis? [14:06:26] sitematrix lists them [14:06:27] no [14:06:35] sure it lists them [14:06:39] but they won't be uplaoded [14:06:48] uploaded will only be done from the public wikis [14:07:04] * Hydriz isn't sure if that is a good or bad thing :P [14:07:07] the sitematrix is used to look up information for the thing being created [14:07:22] Hydriz had better get sure [14:07:47] Oh yes another question [14:08:22] is there danger if I request from dumps and get too many 404 errors? [14:08:28] *dumps.wikimedia.org [14:08:58] what do you mean "too many 404s"? [14:09:44] I mean, lets say I write a script to "crawl" the dumps.wikimedia.org server [14:10:13] so first, I hope you will talk to me before you do that [14:10:21] I would much rather you run something like that on a mirror [14:10:28] (which we will soon have I believe) [14:10:30] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:10:30] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:10:31] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:10:31] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:10:33] oh good, then all the more I shouldn't be doing it :P [14:10:58] :-D [14:11:12] * Hydriz isn't evil towards servers [14:11:48] except for those servers that has ~90TB of disk drive [14:12:05] (referring to IA servers :P) [14:12:10] ah hah [14:14:37] uh, did search just die? [14:15:23] no results for anything i search for [14:15:30] :-( [14:15:36] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:15:36] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:15:36] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:15:36] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:15:56] notpeter: as our current search guru is there really nothing we can do? [14:16:26] closedmouth: what did you do? can you recreate? [14:16:30] what wiki? [14:16:38] http://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=default&search=foo&fulltext=Search [14:17:17] WORKSFORME [14:17:40] > There were no results matching the query. [14:17:44] closedmouth: that link, exactly, isn't working for me, but the search bar is working for me on the main page [14:18:33] also, apergos, take it back! don't call me our search guru! [14:18:34] =P [14:18:57] credit where credit is due :-P [14:19:16] ah, looks like it affects me now :( [14:20:33] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:20:33] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:20:33] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:20:33] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:20:43] when I remove &fulltext=Search it works [14:21:42] it randomly doesn't work.. while it always works on the backend.. see my comment on https://bugzilla.wikimedia.org/show_bug.cgi?id=34518 [14:22:00] also some en.wp servers seem to be just going dead recently [14:22:15] not sure if anything has changed in last ~ week [14:22:55] the comment is about search pool 2, but i think it might be affecting other wikis as well [14:23:14] rainman-sr: could it be caused by a timeout to the search nodes? [14:24:16] notpeter,i dont know.. the funny thing is that as i said, when i tried the query always worked on the backend with times <1s, but when going through wiki it would sometimes wait 10s then fail, or return the results right away [14:24:37] which makes me think that there might be something wrong with how apaches call the search backend [14:24:38] oh, that is really weird... [14:24:44] yeah [14:25:30] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:25:31] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:25:31] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:25:31] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:25:39] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.80473678261 (gt 8.0) [14:28:09] ah, yes, the search pool is unhapy: https://nagios.wikimedia.org/nagios/cgi-bin/history.cgi?host=search-pool1.svc.pmtpa.wmnet&service=LVS+Lucene [14:29:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:30:36] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:30:36] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:30:36] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:30:36] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.907 seconds [14:35:33] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:35:33] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:35:33] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:35:34] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:40:30] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:40:31] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:40:31] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:40:31] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:45:36] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:45:36] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:45:36] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:45:36] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:50:34] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:50:34] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:50:34] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:50:34] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:55:30] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:55:31] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:55:31] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:55:31] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:57:36] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 1.96843736842 [15:00:27] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:00:27] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:00:27] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:00:27] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:05:32] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:05:32] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:05:32] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:05:32] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:10:29] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:10:29] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:10:30] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:10:30] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:12:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.062 seconds [15:15:26] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:15:26] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:15:26] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:15:26] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:20:23] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:20:23] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:20:23] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:20:23] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:29] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:29] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:29] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:29] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:32:05] RECOVERY - check_minfraud1 on payments3 is OK: OK [15:32:06] RECOVERY - check_minfraud1 on payments2 is OK: OK [15:32:06] RECOVERY - check_minfraud1 on payments1 is OK: OK [15:32:06] RECOVERY - check_minfraud1 on payments4 is OK: OK [15:33:12] RobH: The on/off warnings from nagios for payment* have been going for about 24h now. Anything to worry about? [15:33:25] wonder what's going on [15:35:59] PROBLEM - Disk space on srv285 is CRITICAL: DISK CRITICAL - free space: / 277 MB (3% inode=56%): /var/lib/ureadahead/debugfs 277 MB (3% inode=56%): [15:47:14] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:49:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.365 seconds [15:50:50] PROBLEM - Host db1026 is DOWN: PING CRITICAL - Packet loss = 100% [15:56:41] RECOVERY - Lucene on search15 is OK: TCP OK - 0.008 second response time on port 8123 [15:59:14] RECOVERY - Lucene on search3 is OK: TCP OK - 0.012 second response time on port 8123 [16:02:59] RECOVERY - Lucene on search9 is OK: TCP OK - 0.001 second response time on port 8123 [16:04:02] PROBLEM - Apache HTTP on mw21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:04:11] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:05:59] RECOVERY - Apache HTTP on mw21 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.863 second response time [16:06:08] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [16:06:26] PROBLEM - Apache HTTP on mw49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:08:23] RECOVERY - Apache HTTP on mw49 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.499 second response time [16:08:32] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:08:50] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:08:50] PROBLEM - Apache HTTP on mw12 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:09:17] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:09:35] PROBLEM - Apache HTTP on mw55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:10:02] PROBLEM - Apache HTTP on mw20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:10:02] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:10:02] PROBLEM - Apache HTTP on mw11 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:10:38] RECOVERY - Apache HTTP on mw22 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [16:10:38] RECOVERY - Apache HTTP on mw17 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.838 second response time [16:10:47] PROBLEM - Apache HTTP on mw28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:11:14] RECOVERY - Apache HTTP on mw41 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.036 second response time [16:11:32] RECOVERY - Apache HTTP on mw55 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.051 second response time [16:11:50] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:11:59] RECOVERY - Apache HTTP on mw20 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.980 second response time [16:12:08] RECOVERY - Apache HTTP on mw11 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [16:12:17] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.455 second response time [16:12:26] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:12:44] RECOVERY - Apache HTTP on mw12 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.573 second response time [16:12:45] RECOVERY - Apache HTTP on mw28 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.691 second response time [16:13:56] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.040 second response time [16:14:23] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 251 MB (3% inode=62%): /var/lib/ureadahead/debugfs 251 MB (3% inode=62%): [16:14:23] PROBLEM - Apache HTTP on mw9 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:14:23] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.539 second response time [16:15:53] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [16:16:11] RECOVERY - Apache HTTP on mw9 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.891 second response time [16:17:41] PROBLEM - Lucene on search3 is CRITICAL: Connection timed out [16:18:17] PROBLEM - Apache HTTP on mw26 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:19:20] who pulled the plug? [16:19:29] PROBLEM - Lucene on search9 is CRITICAL: Connection timed out [16:19:38] RECOVERY - Disk space on srv285 is OK: DISK OK [16:20:23] RECOVERY - Disk space on srv219 is OK: DISK OK [16:22:02] RECOVERY - Lucene on search15 is OK: TCP OK - 8.997 second response time on port 8123 [16:22:02] PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:22:02] PROBLEM - Apache HTTP on mw7 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:22:02] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:23:59] RECOVERY - Apache HTTP on mw34 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [16:24:00] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.074 second response time [16:24:08] PROBLEM - Apache HTTP on mw6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:08] PROBLEM - Apache HTTP on mw13 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:08] PROBLEM - Apache HTTP on mw25 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:08] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:35] RECOVERY - Apache HTTP on mw26 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.241 second response time [16:24:44] PROBLEM - Apache HTTP on mw33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:56] PROBLEM - LVS HTTP on appservers.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:56] PROBLEM - Apache HTTP on mw5 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:05] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.173 second response time [16:26:32] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:27:08] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:27:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.314 seconds [16:27:44] PROBLEM - Apache HTTP on mw52 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:27:53] RECOVERY - Apache HTTP on mw5 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [16:27:53] PROBLEM - Apache HTTP on mw19 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:27:53] RECOVERY - LVS HTTP on appservers.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 57727 bytes in 4.041 seconds [16:28:02] RECOVERY - Apache HTTP on mw6 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.705 second response time [16:28:11] RECOVERY - Apache HTTP on mw25 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.602 second response time [16:28:56] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.592 second response time [16:29:23] PROBLEM - Apache HTTP on mw57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:29:41] RECOVERY - Apache HTTP on mw52 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [16:31:11] RECOVERY - Apache HTTP on mw33 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.852 second response time [16:31:20] RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.928 second response time [16:32:14] RECOVERY - Apache HTTP on mw13 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.747 second response time [16:32:14] RECOVERY - Apache HTTP on mw7 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.625 second response time [16:32:23] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.220 second response time [16:32:41] PROBLEM - Apache HTTP on mw45 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:33:53] PROBLEM - Apache HTTP on mw2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:33:53] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:34:02] RECOVERY - Apache HTTP on mw19 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.043 second response time [16:34:29] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [16:34:38] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:34:38] PROBLEM - Apache HTTP on mw18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:35:41] RECOVERY - Apache HTTP on mw2 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.815 second response time [16:36:35] RECOVERY - Apache HTTP on mw18 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.209 second response time [16:36:35] RECOVERY - Apache HTTP on mw45 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.048 second response time [16:36:44] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.403 second response time [16:37:47] RECOVERY - Apache HTTP on mw22 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.041 second response time [16:38:14] RECOVERY - Lucene on search3 is OK: TCP OK - 0.001 second response time on port 8123 [16:38:23] RECOVERY - Lucene on search9 is OK: TCP OK - 0.000 second response time on port 8123 [16:38:23] RECOVERY - Lucene on search15 is OK: TCP OK - 0.002 second response time on port 8123 [16:41:05] PROBLEM - Puppet freshness on cadmium is CRITICAL: Puppet has not run in the last 10 hours [16:45:53] I've just tried to use the search API, and it helpfully informed me that it was disabled. [16:48:26] New patchset: Sumanah; "Additional author for test commit" [test/mediawiki/extensions/examples] (master) - https://gerrit.wikimedia.org/r/2713 [16:50:59] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [16:52:35] New review: Sumanah; "I love this change! So rockin'!" [test/mediawiki/extensions/examples] (master) C: 1; - https://gerrit.wikimedia.org/r/2713 [16:55:29] PROBLEM - check_gcsip on payments2 is CRITICAL: Connection timed out [16:55:29] PROBLEM - check_gcsip on payments3 is CRITICAL: Connection timed out [16:56:32] PROBLEM - check_gcsip on payments1 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [17:00:35] RECOVERY - check_gcsip on payments1 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 3.725 second response time [17:00:35] RECOVERY - check_gcsip on payments2 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 3.598 second response time [17:00:35] RECOVERY - check_gcsip on payments3 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 3.583 second response time [17:00:47] New review: Sumanah; "Guybrush is so great and a substantive contributor to our community." [test/mediawiki/extensions/examples] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2713 [17:00:47] Change merged: Sumanah; [test/mediawiki/extensions/examples] (master) - https://gerrit.wikimedia.org/r/2713 [17:01:38] RECOVERY - Lucene on search15 is OK: TCP OK - 0.001 second response time on port 8123 [17:02:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.950 seconds [17:08:12] New patchset: Sumanah; "thinking seriously about our future" [test/mediawiki/extensions/examples] (master) - https://gerrit.wikimedia.org/r/2714 [17:11:05] PROBLEM - Puppet freshness on db46 is CRITICAL: Puppet has not run in the last 10 hours [17:11:05] PROBLEM - Puppet freshness on mw1002 is CRITICAL: Puppet has not run in the last 10 hours [17:11:41] New patchset: Demon; "Evil plans!" [test/mediawiki/extensions/examples] (master) - https://gerrit.wikimedia.org/r/2715 [17:12:21] Change abandoned: Sumanah; "I do not like your plans, Evil Chad!" [test/mediawiki/extensions/examples] (master) - https://gerrit.wikimedia.org/r/2715 [17:35:32] PROBLEM - check_gcsip on payments3 is CRITICAL: Connection timed out [17:35:32] PROBLEM - check_gcsip on payments4 is CRITICAL: Connection timed out [17:36:35] PROBLEM - check_gcsip on payments2 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [17:40:29] RECOVERY - check_gcsip on payments4 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 1.328 second response time [17:40:29] RECOVERY - check_gcsip on payments2 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 3.603 second response time [17:40:29] RECOVERY - check_gcsip on payments3 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 3.586 second response time [17:40:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:44:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.009 seconds [17:46:15] New patchset: Lcarr; "removing defunct ganglia1001" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2716 [17:46:44] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2716 [17:46:45] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2716 [17:50:32] PROBLEM - check_gcsip on payments2 is CRITICAL: Connection timed out [17:50:32] PROBLEM - check_gcsip on payments1 is CRITICAL: Connection timed out [17:55:29] RECOVERY - check_gcsip on payments2 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 4.329 second response time [17:55:29] PROBLEM - check_gcsip on payments1 is CRITICAL: Connection timed out [18:00:35] RECOVERY - check_gcsip on payments1 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 0.164 second response time [18:01:02] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [18:04:47] RECOVERY - Lucene on search15 is OK: TCP OK - 2.996 second response time on port 8123 [18:17:14] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [18:20:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.239 seconds [18:22:46] New patchset: Bhartshorne; "adding in partman configuration for ms-be hosts. also whitespace retabbing." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2717 [18:24:03] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2717 [18:24:05] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2717 [18:39:26] PROBLEM - Host labstore1 is DOWN: PING CRITICAL - Packet loss = 100% [18:48:49] !log aaron synchronizing Wikimedia installation... : deploying r112128 [18:48:52] Logged the message, Master [18:58:17] New patchset: Pyoungmeister; "eqiad != pmtpa" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2718 [18:58:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:59:34] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2718 [18:59:35] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2718 [19:00:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.156 seconds [19:05:19] !log running sync-common on srv256 because Aaron gets key errors for that box [19:05:21] Logged the message, Master [19:05:24] Ryan_Lane: any idea why srv255 gives key errors with SSH for me and no one else? [19:05:39] *256 [19:05:47] I dunno, is it pooled? [19:06:14] hm. it's pooled [19:06:14] sync done. [19:06:16] lemme see [19:06:36] lemme run puppet on it [19:07:25] you haven't got a key for it in your known_hosts have you? [19:09:03] It's already fixe [19:09:05] d [19:09:12] Like Reedy said, it was a known_hosts issue [19:13:08] !log aaron synchronized php-1.19/includes/logging/LogFormatter.php 'deployed r112136' [19:13:11] Logged the message, Master [19:15:35] RECOVERY - Puppet freshness on bast1001 is OK: puppet ran at Wed Feb 22 19:15:12 UTC 2012 [19:16:03] !log catrope synchronized php-1.19/thumb.php 'missing global' [19:16:05] Logged the message, Master [19:16:29] RECOVERY - Puppet freshness on spence is OK: puppet ran at Wed Feb 22 19:16:27 UTC 2012 [19:18:30] !log catrope synchronized php-1.19/thumb.php 'missing name key' [19:18:32] Logged the message, Master [19:27:27] !log catrope synchronized php-1.19/thumb.php 'attempt at debugging' [19:27:30] Logged the message, Master [19:27:53] New patchset: Ryan Lane; "Changing smtp host, on Reedy's request" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2719 [19:28:02] RECOVERY - Lucene on search1001 is OK: TCP OK - 0.027 second response time on port 8123 [19:28:06] Reedy: ^^ review, ple [19:28:08] *pls [19:28:45] New review: Reedy; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2719 [19:29:08] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2719 [19:29:09] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2719 [19:30:51] !log catrope synchronized php-1.19/thumb.php 'disable debugging' [19:30:53] Logged the message, Master [19:31:56] !log catrope synchronized php-1.19/thumb.php 'readd debugging for 404s' [19:31:59] Logged the message, Master [19:33:10] New patchset: Bhartshorne; "adding a new partman config for ms-be hosts to create a tiny bios partition for grub" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2720 [19:33:33] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2720 [19:33:33] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2720 [19:35:00] !log catrope synchronized php-1.19/thumb.php 'use temp path correctly' [19:35:03] Logged the message, Master [19:36:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:29] !log catrope synchronized php-1.19/thumb.php 'Add logging for no path supplied error' [19:39:31] Logged the message, Master [19:39:35] RECOVERY - Puppet freshness on fenari is OK: puppet ran at Wed Feb 22 19:39:07 UTC 2012 [19:41:12] !log catrope synchronized php-1.19/thumb.php 'more logging' [19:41:14] Logged the message, Master [19:42:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.048 seconds [19:45:26] New patchset: Pyoungmeister; "fqdns: not so much. oh well, doesn't really matter" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2721 [19:45:44] PROBLEM - Lucene on search1001 is CRITICAL: Connection refused [19:46:17] !log catrope synchronized php-1.19/thumb.php 'more logging' [19:46:20] Logged the message, Master [19:48:32] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2721 [19:48:34] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2721 [19:48:45] what is the current plan for 1.19 deployment to Commons? [19:49:13] !log catrope synchronized php-1.19/thumb.php 'pass in a Title object to UnregisteredLocalFile' [19:49:15] Logged the message, Master [19:49:51] got the answer in commons channel ;) [19:52:57] New patchset: Lcarr; "Making tweaks for nagios3 installation" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2722 [19:53:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2722 [19:54:05] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2722 [19:54:05] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2722 [20:00:35] PROBLEM - check_minfraud2 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:00:35] PROBLEM - check_minfraud2 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:00:35] PROBLEM - check_minfraud2 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:00:35] PROBLEM - check_minfraud2 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:01:11] RECOVERY - Lucene on search15 is OK: TCP OK - 2.992 second response time on port 8123 [20:01:38] RECOVERY - Lucene on search1002 is OK: TCP OK - 0.027 second response time on port 8123 [20:02:59] RECOVERY - Host labstore1 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [20:05:23] PROBLEM - check_minfraud2 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:05:23] PROBLEM - check_minfraud2 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:05:23] PROBLEM - check_minfraud2 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:05:23] PROBLEM - check_minfraud2 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:07:11] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [20:10:29] PROBLEM - check_minfraud2 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:10:29] PROBLEM - check_minfraud2 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:10:29] PROBLEM - check_minfraud2 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:10:29] PROBLEM - check_minfraud2 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:13:11] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [20:13:11] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [20:14:05] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [20:15:26] PROBLEM - check_minfraud2 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:26] PROBLEM - check_minfraud2 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:27] PROBLEM - check_minfraud2 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:27] PROBLEM - check_minfraud2 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:44] RECOVERY - Lucene on search1004 is OK: TCP OK - 0.034 second response time on port 8123 [20:17:14] RECOVERY - Lucene on search1009 is OK: TCP OK - 0.026 second response time on port 8123 [20:17:23] RECOVERY - Lucene on search1006 is OK: TCP OK - 0.027 second response time on port 8123 [20:17:50] RECOVERY - Lucene on search1005 is OK: TCP OK - 0.031 second response time on port 8123 [20:17:50] RECOVERY - Lucene on search1012 is OK: TCP OK - 0.027 second response time on port 8123 [20:17:59] RECOVERY - Lucene on search1011 is OK: TCP OK - 0.031 second response time on port 8123 [20:18:08] RECOVERY - Lucene on search1010 is OK: TCP OK - 0.027 second response time on port 8123 [20:18:08] RECOVERY - Lucene on search1013 is OK: TCP OK - 0.029 second response time on port 8123 [20:18:35] RECOVERY - Lucene on search1017 is OK: TCP OK - 0.026 second response time on port 8123 [20:19:11] RECOVERY - Lucene on search1018 is OK: TCP OK - 0.032 second response time on port 8123 [20:19:20] RECOVERY - Lucene on search1015 is OK: TCP OK - 0.027 second response time on port 8123 [20:19:29] RECOVERY - Lucene on search1020 is OK: TCP OK - 0.027 second response time on port 8123 [20:19:38] RECOVERY - Lucene on search1016 is OK: TCP OK - 0.026 second response time on port 8123 [20:19:47] RECOVERY - Lucene on search1019 is OK: TCP OK - 0.031 second response time on port 8123 [20:20:32] PROBLEM - check_minfraud2 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:32] PROBLEM - check_minfraud2 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:32] PROBLEM - check_minfraud2 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:32] PROBLEM - check_minfraud2 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:32] RECOVERY - Lucene on search1001 is OK: TCP OK - 0.027 second response time on port 8123 [20:25:29] PROBLEM - check_minfraud2 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:25:30] PROBLEM - check_minfraud2 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:25:30] PROBLEM - check_minfraud2 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:25:30] PROBLEM - check_minfraud2 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:30:26] PROBLEM - check_minfraud2 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:30:27] PROBLEM - check_minfraud2 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:30:27] PROBLEM - check_minfraud2 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:30:27] PROBLEM - check_minfraud2 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:33:43] New patchset: Lcarr; "Changing puppet agent timeout to 960 since 480 is sometimes not enough for puppet server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2723 [20:34:13] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2723 [20:34:14] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2723 [20:35:23] PROBLEM - check_minfraud2 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:35:23] PROBLEM - check_minfraud2 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:35:23] PROBLEM - check_minfraud2 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:35:23] PROBLEM - check_minfraud2 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:36:51] Reedy: could you deploy upload wizard to test2? [20:37:04] sure [20:37:33] Erik tells me there might be schema changes associated with that... [20:38:19] I was thinking the same [20:38:24] I know the uploadstash are now in core [20:38:30] but there's some feedback tables that Jeroen added [20:38:31] IIRC [20:39:28] 2 tables, 3 seperate index files [20:39:29] pfffft [20:40:21] Reedy: do you have the db access yo need to do that, or do we need someone with root? [20:40:29] RECOVERY - check_minfraud2 on payments2 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 118 bytes in 0.159 second response time [20:40:30] RECOVERY - check_minfraud2 on payments3 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 118 bytes in 0.229 second response time [20:40:30] RECOVERY - check_minfraud2 on payments4 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 118 bytes in 0.158 second response time [20:40:30] RECOVERY - check_minfraud2 on payments1 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 118 bytes in 0.159 second response time [20:40:31] I can do it all :) [20:40:38] xlnt! [20:43:28] !log Created uploadwizard campaign related tables on test2wiki [20:43:31] Logged the message, Master [20:44:05] !log reedy synchronized wmf-config/InitialiseSettings.php 'Enable UW on test2wiki' [20:44:08] Logged the message, Master [20:44:29] robla: there's some site specific config in commonsettings for test/commons [20:44:35] Do you know if/how erik wants those setting? [20:45:04] * robla asks [20:45:23] !log reedy synchronized wmf-config/InitialiseSettings.php 'Enable UW on test2wiki' [20:45:25] Logged the message, Master [20:46:07] Reedy: close as possible to commons [20:46:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:46:57] $wgUploadWizardConfig['feedbackPage'] = 'Commons:Upload_Wizard_feedback'; # Set by neilk, 2011-11-01, per erik [20:46:57] $wgUploadWizardConfig['altUploadForm'] = 'Commons:Upload'; [20:46:57] $wgUploadWizardConfig["missingCategoriesWikiText"] = "{{subst:unc}}"; [20:46:57] $wgUploadWizardConfig['blacklistIssuesPage'] = 'Commons:Upload_Wizard_blacklist_issues'; # Set by neilk, 2011-11-01, per erik [20:48:19] I'll just copy that all and replace Commons with Wikipedia then [20:49:24] !log reedy synchronized wmf-config/CommonSettings.php 'UW config for test2wiki' [20:49:27] Logged the message, Master [20:50:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.454 seconds [20:51:31] robla: enabled, though it's broken for me [20:51:47] ok on testwiki though [20:52:50] https://test2.wikipedia.org/wiki/Special:UploadWizard [20:53:28] Reedy: thumbnails being broken is a known issue. is there other breakage you're seeing? [20:53:54] on Special:UploadWizard i just see a spinner going round and round and round... [20:54:04] no JS errors in the console [20:54:10] unless [20:54:35] !log reedy synchronized php-1.19/resources/startup.js 'touch' [20:54:37] Logged the message, Master [20:55:13] !log catrope synchronized php-1.19/thumb.php 'And let's try that again' [20:55:17] Logged the message, Master [20:55:27] Reedy: Do you know more about the repeative failure and recovery of "check_minfround# on payment{1,4}"that's been going for about 2-3 days now? [20:55:58] Krinkle: Jeff_Green has said that although it's an issue, it's not a major issue [20:56:18] it's a total nonissue actually [20:56:40] apparently we were notified of IP address changes for that service on the 14th but I wasn't on the distribution list [20:56:59] well, I know nothing about it, but it's screaming and flashing the channel with dozens of line every few hours. So something is wrong, either it's too tightly checking or an issue is ignored. [20:57:29] more the latter than the former [20:57:43] but it should have just totally stopped [20:58:05] robla: seems to load now after the startup.js touch [20:58:06] they phased out the first IP this AM and it failed between the time they did that and when I logged in and saw the IRC blat [20:58:07] The page at https://test2.wikipedia.org/wiki/Special:UploadWizard displayed insecure content from http://upload.wikimedia.org/wikipedia/commons/4/42/Loading.gif. [20:58:09] Naughty [20:58:36] then after I killed that test, they phased out the second IP which caused the second round of blat [21:07:31] Reedy: did you happen to check your console when you got the spinner of doom? [21:07:43] yup, nothing [21:08:05] then after i touched startup.js, hard refresh, saw the http related error above, and it worked [21:08:37] !log updated the payments cluster to r112145 [21:08:39] Logged the message, Master [21:09:04] Worked first time when tried in another browser [21:12:39] * johnduhart wonders who K4-713 is [21:13:57] johnduhart: one of the fundraising developers [21:14:01] Legit :) [21:14:14] Ah [21:18:10] !log catrope synchronized php-1.19/thumb.php 'Cleanup: rename fake repo and add comments' [21:18:12] Logged the message, Master [21:22:16] yep, Special:UploadStash thumb & primary urls used for IE8 and work fine [21:23:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:30:07] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.028 seconds [21:30:40] AaronSchulz: RoanKattouw_away: Reedy: should make another deployment attempt? [21:31:34] What about the commons master? [21:31:55] Are we likely to see any more issues? Certainly worth getting ops to check the error logs for anything glaringly obvious that might hit us again [21:32:07] that's a good point [21:32:18] Presumably Tim didn't do it due to lack of time [21:32:27] (having to do other stuff too) [21:32:40] RECOVERY - Lucene on search15 is OK: TCP OK - 8.998 second response time on port 8123 [21:32:41] db22 [21:32:51] woosters: ^ [21:32:57] ya [21:33:14] * robla walks over to talk to woosters IRL [21:33:18] ook [21:38:40] Reedy: we believe it should be fine. We will have to do a master swap at some point, but it's not something we need to worry about now [21:43:22] * robla actually remembers to set the banner this time [21:43:58] wow, http://wikimedia.com/ isn't just showing a "wiki doesn't exist" screen. It's showing the one we had yeeeeeears ago. [21:44:00] that's ancient [21:44:12] and heavily broken with overlaps and inexisting projects [21:44:18] layout bugs [21:44:18] robla: has anyone at least given a cursory glance over the logs to make sure there's nothing glaringly obvious? [21:44:31] logs schmogs! [21:44:51] fair enough :p [21:44:55] how long do we have to wait? [21:44:57] RECOVERY - DPKG on erzurumi is OK: All packages OK [21:44:59] AaronSchulz: wanna look? [21:45:15] I don't think we've ssh access to the db servers [21:45:38] Reedy: oh, you're talking about logs on db22? [21:45:42] ya [21:45:53] Nagios is still only showing the same disk failure [21:46:00] * robla asks on #wikimedia-operations [21:46:03] which is alright/known [21:46:54] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [21:48:19] oh, robla, we need to change that thing that RoanKattouw told tim to change [21:48:26] for the caching [21:48:48] Hmm right [21:48:51] $wgResourceLoaderMaxage['unversioned'] = array( 'server' => 30, 'client' => 30 ); [21:48:53] you said 5? [21:48:55] did we ever changed it back? [21:48:57] No 30 is fine [21:49:05] Is it set to 30 for commons right now? [21:49:05] ah, right then, can leave it as is [21:49:08] yup [21:49:10] That's fine, cause the default is 300 [21:49:14] in a conditional for commons [21:49:23] Excellent [21:49:28] don't need to worry about that then :) cheers [21:53:09] rumor has it Jeff_Green is looking at the db22 logs now [21:53:19] i am looking yeah [21:53:48] Reedy: any other pre-flight checks? :) [21:54:16] As long as we've deployed all the irc related code, I think we're good [21:54:35] so yeah there's a dead disk, obviously [21:54:57] from #wikimedia-commons: (01:52:12 PM) Saibo: btw: today many edit atttempts did not return after sending [21:55:01] RAID appears to be doing the right thing--I don't see any evidence of filesystem corruption, and so mysql is not complaining [21:55:13] i'd expect performance degradation for sure though [21:55:29] k...thanks! [21:55:52] well, qualify that--it depends on the RAID config. trying to get more info on that now [21:56:22] RAID-0! [21:56:24] :) [21:56:39] i'm not going to laugh until i know taht is not the case :-P [21:57:16] raid10 [21:57:21] morning TimStarling [21:57:48] mornin' Tim....we're finally getting around to the commons deploy [21:58:35] morning [21:58:47] UploadWizard is tested and fixed now? [21:59:17] yup...we deployed to test2 and futzed around some [21:59:43] nothing exhaustive, but it's at least more functional than before [21:59:55] while we are at it: is also the multi file selection patch deployed? that would be nice [22:00:03] (for up.wizard) [22:00:06] When was that committed? [22:00:14] a second [22:00:24] https://bugzilla.wikimedia.org/show_bug.cgi?id=34333 2012-02-16 [22:00:36] multi file selection fails for Firefox currently [22:00:38] Nope [22:00:42] :( [22:00:45] ok [22:00:49] I'll merge and push it latert on if the deploy goes ok [22:01:07] no worries [22:03:15] RECOVERY - Lucene on search15 is OK: TCP OK - 8.995 second response time on port 8123 [22:03:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:03:20] TimStarling: we didn't do the master swap, but that's not a blocker for deployment, right? [22:03:42] right, not a blocker [22:03:42] Jeff_Green: you comfortable with what you're seeing? [22:04:07] well . . . I never feel good about RAID boxes with failed disk, so I don't love it [22:04:18] that said, for the moment the controller appears to be handling it fine [22:04:36] alright.... Reedy, ready to push the button? [22:05:12] * AaronSchulz can do it [22:05:27] Can if you want [22:05:52] do we need to flip a coin? :) [22:06:01] nvm [22:06:09] !log aaron rebuilt wikiversions.cdb and synchronized wikiversions files: commonswiki -> 1.19 [22:06:11] Logged the message, Master [22:06:52] Looks better from the outset [22:07:21] no JS errors! wow [22:07:30] I see it's still pulling JS from prototype.wikimedia.org [22:07:42] breaking HTTPS [22:07:43] ... [22:07:51] to quote brion: "whyyyyyy?" [22:08:04] Ryan_Lane: can you acidentaly prototype? [22:08:08] *accidentally [22:08:16] -_- [22:08:23] reboot? delete? [22:08:31] turn off works [22:08:34] maybe mdale knows [22:08:36] why? [22:08:59] why? [22:09:01] Unknown error: "unknown" [22:09:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.187 seconds [22:09:10] http://prototype.wikimedia.org/mwe-gadget/mwEmbed/ResourceLoader.php?class=mwEmbed,mw.style.mwCommon,mw.RemoteSearchDriver,mw.ClipEdit,mw.style.ClipEdit,$j.ui.progressbar,$j.ui.sortable,$j.ui.datepicker&uselang=en&urid=r192 [22:09:13] unknown error is unknown....heh [22:09:16] does that mean anything to you, mdale? [22:09:19] yea [22:09:36] like a test server being used to serve essential JS over HTTP to a production server on HTTPS? [22:09:36] mwEmbed had its own resource loader php etc... need a php env to do stuff [22:10:03] there was not a lot of gadget infrastructure back then... [22:10:19] I suppose I could have put it on tool server... does tool server have https? [22:10:25] mdale: yes [22:10:28] I assume this is the resource loader that I reviewed and rejected [22:10:34] yep ;) [22:11:20] The primary purpose of that gadget is the video player [22:11:38] if it is code that is not production quality, we should not be using it in production, regardless of where it is running [22:11:43] which in theory will be deployed sometime in the coming months.. in which case we can deprecate the gadget. [22:14:15] is the issue the gadget is on by default on some wiki or something? [22:15:24] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [22:15:42] yes, "some wiki" being commons [22:15:57] about which gadget are you talking? [22:16:05] mwEmbed [22:16:34] tagged as " (currently in beta)  " anyway [22:16:34] why is it on-by-default on commons? [22:16:41] and I think you're using the term "gadget" loosely [22:16:55] it is not default on Commons [22:16:58] since the gadget part is just a loader for a thing on prototype.wikimedia.org [22:17:02] sure [22:17:25] "mwEmbed [ResourceLoader]|mwEmbed.js" [22:17:30] so it is not default [22:17:38] hmm [22:17:51] well.. unless it is hacked somewhere in common.js ;) [22:18:12] apparently not [22:19:04] I don't think its on by default. There are links to enable it on videos with captions and things like that... [22:19:09] RECOVERY - Lucene on search15 is OK: TCP OK - 0.006 second response time on port 8123 [22:19:23] !log aaron synchronized wmf-config/swift.php [22:19:25] Logged the message, Master [22:19:44] you can enable it with a click on a link "withJS", yes [22:20:08] https://commons.wikimedia.org/w/index.php?title=File:President_Obama_on_Death_of_Osama_bin_Laden.ogv&withJS=MediaWiki:MwEmbed.js for example [22:20:34] yes, confirmed, then it loads form http://protoype ... [22:20:40] *from [22:21:13] !log aaron synchronized wmf-config/swift.php 'avoid notices' [22:21:16] Logged the message, Master [22:21:30] so it enables itself? [22:21:36] withJS [22:21:41] well, not really - with user interaction [22:21:45] click needed [22:22:39] apparently 810 people have it enabled [22:22:47] we could knock it off for https if that is such a bad problem [22:23:01] but would then be confusing for them [22:23:18] why can't the code get transferred to commons itself? [22:23:44] it would be work.. and we have the new version in an extension where it should be [22:24:04] but prototype is not able to serve via https? [22:24:14] prototype is a test server [22:24:17] yea... [22:24:22] hmm [22:24:25] I agree its not ideal [22:24:30] the code can't run on commons because it can't pass code review [22:24:49] ? .. keep in mind this code is 2 years+ old [22:24:50] code review for gadgets?! [22:24:57] and the TMH was reviewed a few times [22:25:16] it's only a gadget because it failed code review as a part of the MW core [22:25:20] * AaronSchulz gets out the popcorn [22:25:28] another question: is this anything related to 1.19?! ;) [22:26:04] presumably mdale moved it all out to an extension and installed the extension on prototype [22:26:40] basically yes.. I mean it was a stand alone video player.. [22:26:41] * saper lights up a cigar [22:26:46] as well [22:27:00] and a stand alone resource loader [22:27:05] * Damianz puts mdale in the firing line [22:27:10] * Saibo scratches his head and thinks that all that also was the case before today.. [22:27:25] * ^demon grabs the sodas and sits next to Aaron and his popcorn [22:27:55] if we have no real problems we invent some? [22:27:59] * mdale is also confused [22:28:09] what exactly is the issue? [22:28:19] ( that has not been an issue for 2 years ) ? [22:28:19] robla: so what is wrong with UW? [22:28:21] yes, it was the same before today, and I knew about the gadget or site JS before today [22:28:34] but I only found out that it was running on prototype yesterday, that's why I brought it up [22:28:48] I'm pretty sure prototype didn't exist 2 years ago [22:29:09] sorry 1.5 years ago [22:29:30] AaronSchulz: "Drop media file to donate here" button sometimes fails. if you hit reload enough times, it will stop bringing up the file open dialog [22:29:38] * robla is filing bug now [22:29:47] that sounds like deja vu [22:30:01] anyway if it's not on by default then it can just be disabled [22:31:08] at most 810 people will mind, which is not so bad compared to millions [22:31:23] but.. eh.. [22:31:25] the reason is the HTTPS breakage [22:31:28] if they want to have it?! [22:31:35] Tough [22:32:10] and the fact that prototype.wikimedia.org is a slow obsolete test server that will probably be shut down soon [22:32:12] Stuff running via gadget isn't ever promised to be working, so if it breaks or disabled, so be it [22:32:50] do I understand correctly that this is more or less dead code? [22:32:53] People complain we don't support HTTPS, we do it, then they complain stuff isn't protocol relative, so we fix it. Then they go and include random JS from a remote server via HTTP into their HTTPS session [22:32:56] without updates? [22:32:57] Fuck yeah. Well done you [22:33:27] yes, that sucks [22:33:45] btw .. it is 2 years old http://commons.wikimedia.org/w/index.php?title=MediaWiki:MwEmbed.js&action=history [22:33:57] Doesn't make it right [22:33:58] I didn't ask to have my HTTPS broken, I just clicked some "use new player" link a couple of years ago [22:34:11] what about copying the code to commons? [22:34:17] then it is https? [22:34:32] it can be hosted in the MediaWiki namespace if that's what you mean [22:34:38] yes [22:34:48] sure - like all other gadgets [22:34:51] The code needs fixing/improving/whatever so it will pass code review [22:35:02] not for being a gadget ;) [22:35:42] Saibo: its work to do that.. it has url images, assets, sub modules ... but all that code is old not maintained the code reviewed version is in TimedMediaHandler. [22:36:09] Saibo: Although they don't need CR, If there is dangerous code on a wiki, it [the gadget] can be killed at any time [22:37:20] how is that withJS controlled? is all code in MediaWiki namespace executable with withJS? [22:37:30] Saibo: yes [22:37:47] hrmm.. [22:38:11] withJS is a [[MediaWiki:Common.js]] hack [22:38:11] afaik the video player does not work anyway :D [22:38:20] *trying* [22:39:23] yup, subtitles don't work [22:39:33] maybe other stuff works [22:40:08] if commons is that desperate to run said code, and not have it break ssl connections, we could probably dump it on a labs instance compared to prototype... Although getting the extension up to scratch would be much more preferable [22:40:20] I searched the mediawiki namespace for prototype.wikimedia.org, nothing else uses it apparently [22:40:33] so we can shut down that server now [22:40:51] I do not know how many people like to use it - I do not like to [22:40:56] <^demon> Nobody from features is still using prototype, right? [22:41:01] TimStarling: Only checked commons? [22:41:08] the only thing I used it for was subtitles - and that is broken since months [22:41:15] TimStarling: Ryan_Lane said people had a fit when it was last restarted... [22:41:18] only commons, I can always check referer logs [22:41:31] wmf people did [22:41:38] because they were actively developing using it [22:42:17] https://en.wikipedia.org/wiki/MediaWiki:Gadget-mwEmbed.js [22:43:24] <^demon> Ryan_Lane: Wmf people should use labs instead :) [22:43:31] yes, they should [22:43:36] but not everyone has switched yet [22:43:47] fr and es wiki as well, btw [22:43:48] I'm going to send an email saying in one month it'll be turned off [22:43:55] and one month after that it'll be deleted [22:43:59] yeah, I see logins for some feature people [22:43:59] <^demon> +1 [22:44:10] we need to kill all the vms on tesla [22:44:13] and take back the hardware [22:44:31] not even a dozen users on labs right now [22:44:48] <^demon> Ryan_Lane: Hardware reclamation :) [22:44:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:45:18] well, less than a dozen currently logged in [22:45:29] about 120 users [22:45:33] ah [22:45:36] I am preparing a village pump notice.. [22:45:38] half of which are active [22:46:28] 46 unique users in last ;) [22:46:40] that's since Feb 1 [22:47:01] <^demon> Saibo: Which one of the 20 or so enwiki village pumps are you going to post to? ;-) [22:47:26] Ryan_Lane: post a list of vms too.. People can pre-emptively can then pre-emptively have their vm destroyed! I'm sure people have stuff they don't use [22:47:39] I actually don't know what they are [22:47:46] I can't log into the management interface. no windows [22:47:49] my vm died [22:47:59] sad [22:48:01] prototype has been used by 8 people in the last 3 weeks [22:48:07] * Ryan_Lane groans [22:48:17] do they just not know about labs yet? heh [22:48:26] Is it ESXi? [22:48:30] andrew, bsitu, catrope, ibaker, kaldari, laner, mdale, omniti [22:48:34] yep, esxi [22:48:44] ah, I logged in to fix kaldari's account [22:48:49] <^demon> TimStarling: To the village stocks with them [22:48:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.742 seconds [22:49:08] New review: Hashar; "Please note the test/mediawiki repo will be destroyed and that commit will be lost :-D" [test/mediawiki/extensions/examples] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2714 [22:49:08] Change merged: Hashar; [test/mediawiki/extensions/examples] (master) - https://gerrit.wikimedia.org/r/2714 [22:49:18] Why did omiti start using them? :/ [22:49:21] no clue [22:49:32] <^demon> "omniti" sounds like a role account :\ [22:49:33] ^demon: to COM:VP [22:49:36] * ^demon frowns [22:49:45] it probably is [22:49:54] anyhoo....any performance issues with commons? [22:49:56] it's the AFTv5 folk [22:50:05] role accounts are evil [22:50:14] <^demon> I thought AFTv5 was using labs_enwiki? [22:50:19] tstarling cleared profiling data [22:50:22] <^demon> Are they *also* using prototype? [22:50:25] * ^demon sighs [22:50:36] https://encrypted.google.com/search?aq=f&ix=hea&sourceid=chrome&ie=UTF-8&q=omniti [22:50:51] OmniTi [22:50:55] must be them [22:51:05] * derpalicious hugs Jyothis  [22:51:10] there is like ~6 omniti people with commit access [22:51:49] o.O [22:54:59] Thehelpfulone told me he had already been banned in a lot of channels [22:55:01] reedy cleared profiling data [22:55:08] Ryan_Lane: I've got the vSphere Client installed if needed and there's someway of getting in.... [22:55:22] well, you'd need to have credentials, which you do not [22:55:23] noc.wikimedia.org is not responding or very slow, so it's hard to get anything useful out of that profiling data [22:57:00] noc seems fine for me [22:57:48] I reduced its MaxClients yesterday since it seemed to be the cause of fenari needing a reboot [22:58:14] so slowness will probably come and go [22:58:20] the main user seems to be pybal [22:58:21] TimStarling: have commented out the code on the MediaWiki page at Commons - otherwise it would still be executable via withJS, wouldn't it? [22:59:03] looks like we've got a backlog of stuff to merge from trunk: [22:59:04] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/tag/1.19wmf1 [22:59:20] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/tag/1.19 [23:00:00] since things seem to be reasonably calm, and we're keeping a close eye on things, maybe now is a good time to get through the backlog? [23:03:39] ok [23:05:19] and review [23:06:14] 111796 needs a followup, so I'll do that now [23:06:14] TimStarling: http://www.mediawiki.org/wiki/Special:Code/MediaWiki/107309 [23:06:48] heh, r109469 is still new [23:07:19] yeah, I guess r107309 needs to be reverted in 1.19wmf1 [23:07:59] where's chad? [23:08:16] left 13 minutes ago... [23:08:34] AaronSchulz: on this code review, https://www.mediawiki.org/wiki/Special:Code/MediaWiki/107906 - does "okay" mean that it will be deployed in 1.19? [23:08:39] hey, guys, does any sysadmin have a few minutes to spare? I could really use some help setting a new interlanguage/project link on wikimedia sites (namely, there should be a rswikimedia: prefix directing at http://rs.wikimedia.org, without a need for sidebar placement, just interwiki capabilities) [23:08:51] just wondering about https://www.mediawiki.org/wiki/Special:Code/MediaWiki/111643#c31359 [23:10:42] dungodung|away: add it on the interwiki map on meta [23:10:54] Thehelpfulone: it appears [23:11:30] Reedy: I can do that, but I kinda hoped it would get done now ;) [23:11:36] so it should be working on meta for example AaronSchulz? [23:11:53] ah, actually, there is an entry for it: wmrs [23:11:56] never mind then [23:11:58] lol [23:12:12] That was easy. (tm) [23:12:39] Reedy: stop pressing that button! [23:13:11] New patchset: Lcarr; "Trying to move exported resources in new nagios host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2724 [23:13:19] Reedy: I see that you commented on https://www.mediawiki.org/wiki/Special:Code/MediaWiki/111217 - who can code review it if werdna made it? [23:13:39] Anyone who isn't werdna [23:13:49] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2724 [23:13:50] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2724 [23:14:01] I meant that knows the Abuse Filter ;) [23:15:58] They're all reviewed now... [23:16:02] * Reedy starts merging [23:16:14] New patchset: Pyoungmeister; "need the quotes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2725 [23:16:51] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2725 [23:16:52] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2725 [23:17:32] AaronSchulz: did you see my hack attack on FlaggedRevs yesterday? [23:17:46] nope [23:18:11] it's pretty obvious now where ExtensionMessages-1.19.php came from, since mergeMessagesFileList.php wouldn't run on 1.19wmf1 [23:18:24] about 5 extensions separately broke it [23:18:32] FlaggedRevs being one of them [23:18:42] I'm sure you'll like my fix involving removing type hinting [23:18:50] "Technical maintenance is underway. Temporary issues may arise but will be resolved shortly" is still on - don't want to shut down? [23:19:53] how would I have seen the hacks, was this logged somewhere? [23:20:13] Saibo: looks like CentralNotice... [23:20:25] Yeah, robla set it [23:20:32] AaronSchulz: it's in Subversion [23:20:44] Saibo: do you want to turn it off? [23:20:53] I would say so - yes [23:20:59] everything seems smooth [23:21:11] just navpopups do not work in Firefox for some reason [23:21:18] it's set to turn off at 06:00 UTC on 23/02/2012 [23:21:24] 112092 [23:21:27] or 02/23/2012 if you're american [23:21:35] ah, it's in /trunk [23:21:46] Thehelpfulone: okay, leave it that way then [23:21:48] usually when I make a commit you mark it OK within about 5 minutes [23:21:55] so I figured you might have seen it already [23:22:53] Saibo: I can shorten the time to get it to turn off at 01:00 UTC for example if you'd prefer, that way there wouldn't be a notice when there's no maintenance [23:23:04] or even 00:00 UTC, 00:30 UTC [23:23:19] any time at all, you just have to pick :) [23:23:52] Thehelpfulone: kill it :-D "Technical maintenance is underway." is wrong [23:24:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:26:11] should be gone now :) [23:26:17] TimStarling: so those functions are just getting passed garbage so you removed the type hinting? [23:27:09] pretty much, they are given null [23:27:26] which actually works just fine in that context, you could argue that it's not garbage [23:27:43] $a = null; $a[] = 1; [23:27:49] won't even give a notice [23:27:50] guys, I have a problem with rswikimedia again. I get Не могу да преименујем датотеку „/tmp/phprUVx2n“ у „public/e/e6/Finansijski_izveštaj_2011.pdf“. [23:28:04] in translation, "Can't rename file $1 to $2" [23:28:23] I would really really appreciate if someone could look into this, tia [23:28:43] (this happened last time I tried to upload a file, but I can't recall how it was fixed) [23:28:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.539 seconds [23:31:16] heh [23:31:28] your upload directory is owned by brion [23:31:59] well, 500, we somehow haven't managed to convince him to take his account name back so we have hundreds of files owned by "500" [23:32:41] !log reedy synchronized php-1.19/includes/SkinTemplate.php 'r112162' [23:32:43] Logged the message, Master [23:32:46] 500 used to own the l10nupdate stuff too [23:32:58] New patchset: Lcarr; "Revert "Trying to move exported resources in new nagios host"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2726 [23:32:59] lol [23:33:15] speaking of l10nupdate: it occurs to me that we will have to rewrite the scripts for manualRecache [23:33:37] it will have to run rebuildLocalisationCache.php and then push out the LocalisationCache files [23:33:41] TimStarling: I'm assuming it can be fixed quickly :) [23:33:50] greetings - I have a question regarding the Wikimedia Downloads site - anyone from that crew around? :) [23:34:05] Thehelpfulone: thanks! [23:34:05] and the LU files won't need to be pushed out anymore [23:34:15] no problem :) [23:34:15] dungodung|away: as soon as I work out how to log in to ms7 [23:34:25] New patchset: Lcarr; "Revert "Trying to move exported resources in new nagios host"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2727 [23:34:30] !log reedy synchronized php-1.19/maintenance/language/ 'r112162' [23:34:32] Logged the message, Master [23:35:01] alright [23:35:04] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2726 [23:35:12] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2726 [23:35:13] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2726 [23:35:17] !log reedy synchronized php-1.19/extensions/UploadWizard/resources/mw.fileApi.js 'r112164' [23:35:20] Logged the message, Master [23:35:25] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2727 [23:35:26] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2727 [23:35:59] !log reedy synchronized php-1.19/extensions/Vector/Vector.php 'r112164' [23:36:01] Logged the message, Master [23:37:02] !log reedy synchronized php-1.19/extensions/WikiEditor/ 'r112164' [23:37:04] Logged the message, Master [23:37:16] !log fixed ownership on /mnt/upload6/wikimedia/rs [23:37:18] Logged the message, Master [23:37:28] dungodung|away: try now [23:37:40] Saibo: the ff bugfix is live [23:37:58] TimStarling: works now. muchas graciac [23:38:00] *s [23:39:14] New patchset: Lcarr; "fixing reference to nagios3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2728 [23:39:14] oh, great reedy! *testing very soon* [23:39:47] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2728 [23:39:48] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2728 [23:40:19] Are we supposed to be deploying elsewhere tonight? [23:44:06] Reedy: UW's multi file selection works now in FF10 - thank you! [23:54:04] okay - sorry - got distracted - so MS outreached to us about inclusion on - http://www.microsoft.com/web/gallery/categories.aspx?category=Wiki - among other things we're working on in preparation - we'll need to be able to host a zip file version of the MW installation package (along with some added xml files) - downloads.wikimedia.org was suggested as a place to house that zip file - anyone know the who/how of doing th [23:54:46] !log reedy synchronized php-1.19/extensions/UploadWizard/resources/mw.UploadWizard.js 'r112167' [23:54:48] Logged the message, Master [23:55:15] varnent: well, doing it once manually is one thing [23:55:21] Making our release script build both versions is another [23:55:35] Getting the files up there is trivial enough if you have it somewhere, anyone in ops can do it [23:55:38] Reedy: right..that's a part of the problem [23:56:05] Which part? :p [23:56:17] how to handle this in future versions [23:56:37] well, if you can detail what's involved/pass on links, I can take a stab at updating the release scripts [23:56:46] will we do it uniquely each time - or do we add the xml to the package and automate zip creation [23:56:56] doing it uniquely everytime is stupid [23:56:58] Reedy: excellent! [23:57:05] the script already does most of the work, we've just got some extra files to copy in, and zip instead of tar [23:57:36] I suspect most of the script changes will be copy paste [23:57:39] isn't there someone who really loves Microsoft who will do it for us? [23:57:44] that's how it works with Debian [23:58:07] http://learn.iis.net/page.aspx/578/package-an-application-for-the-windows-web-application-gallery/ [23:58:15] New patchset: Lcarr; "Putting neon in decomissioned (reinstalling as public)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2729 [23:58:37] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2729 [23:58:42] TimStarling: a part of me wonders why MS isn't doing it since they have the interest..but I digress :) it's worth having in there..so.. [23:59:07] Hopefully MS doing it won't decide they want to restructure MW and put stuff in random places and hack core as and when they see fit [23:59:32] fair point - so I suppose there's an advantage in WM doing it vs. MS doing it