[00:02:19] New patchset: Ryan Lane; "Applying LDAP fix to all instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2697 [00:04:04] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [00:10:04] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [00:10:04] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [00:13:16] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2697 [00:13:18] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2697 [00:14:32] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2611 [00:14:33] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2611 [00:20:03] New patchset: Diederik; "IP range filtering and regular expression now work." [analytics/udp-filters] (refactoring) - https://gerrit.wikimedia.org/r/2698 [00:21:34] New patchset: Ryan Lane; "Adding in nslcd.conf.erb, to avoid awkward cherry-pick" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2699 [00:22:04] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2699 [00:22:05] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2699 [00:26:18] New patchset: Ryan Lane; "We don't want to give people a shell, except in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2700 [00:27:00] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2700 [00:27:18] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2700 [00:27:19] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2700 [00:29:51] New patchset: Ottomata; "Removing launcher.py, moved multiprocessing support to pipeline/__main__.py" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2701 [00:30:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:35:52] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.660 seconds [00:40:59] !log tstarling synchronized wmf-config/CommonSettings.php 'reducing cache expiry for unversioned resources on commons' [00:41:01] Logged the message, Master [00:42:48] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2701 [00:42:49] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2701 [00:49:46] btw: css is broken on commons [00:50:03] confirmed [00:50:07] since at least 4 minutes [00:51:03] !log tstarling synchronized wmf-config/CommonSettings.php 'reducing cache expiry for unversioned resources on commons' [00:51:05] Logged the message, Master [00:51:26] Saibo: thanks, i've yelled at people to fix it :) [00:51:42] thanks, works [00:51:44] !log tstarling synchronized wmf-config/CommonSettings.php 'reducing cache expiry for unversioned resources on commons' [00:51:46] Logged the message, Master [00:52:04] and again broken [00:52:24] hm.. and works again [00:52:31] * Saibo beats the caches [00:53:17] who broke it? :) [00:53:27] :) [00:53:31] it's working again [00:53:53] I think it was in an attempt to fix all the issues before the 1.19 launch :) [00:54:05] I wish they'd get on it already [00:54:08] I want lua on my templates [00:54:13] heh [00:54:21] ;) [00:54:23] you know that's not coming in this release, right? :) [00:54:28] right [00:54:29] I do [00:54:33] is there any sort of timeframe? [00:54:39] no clue [00:54:41] because if it's a really long time I will just write it in Javascript [00:54:50] thought you might be WMF and might know ;) [00:54:58] I'm wmf, but I'm in ops [00:55:04] ops doesn't know stuff [00:55:08] ;) [00:55:09] +1 [00:55:12] heh [00:55:25] I feign ignorance *really* well, anyway [00:57:52] New patchset: Ottomata; "Adding __main__.py - meant for this to go with the last commit." [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2702 [00:59:07] !log reedy synchronized php-1.19/languages/messages/MessagesEn.php 'r112073' [00:59:09] New patchset: Lcarr; "commenting out aggregator Attempt to make puppet compile the directory before timeout" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2703 [00:59:10] Logged the message, Master [01:00:21] !log reedy synchronized php-1.19/includes/ 'r112073' [01:00:23] Logged the message, Master [01:00:24] well who do I have to bother in order to find out when Lua will be implemented? [01:00:54] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2702 [01:00:55] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2702 [01:01:30] New patchset: Lcarr; "commenting out aggregator Attempt to make puppet compile the directory before timeout" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2703 [01:01:37] Magog_the_Ogre: wikitech-l ? [01:01:42] or you can implement it :) [01:02:01] I will implement it if you will pay me prevailing wages :D [01:02:37] haha, yeah I'm not ms. moneybags [01:02:59] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2703 [01:03:00] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2703 [01:03:21] Magog_the_Ogre: https://www.mediawiki.org/wiki/Lua_scripting [01:03:51] looks like it will be a while [01:04:33] Start date is pinned for August... Not sure how accurate it is [01:06:00] yikes [01:06:12] sounds like I'm not getting it for at least a year, probably more like 1.5-2 years [01:06:32] I'll just have to write my tool in Javascript and put it on the site JS for everyone [01:07:18] test [01:07:40] failed [01:08:11] so, i can't talk on channel until i get passes status? [01:08:14] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:09:19] i was just messing with you, you're obviously talking on the channel right now [01:09:21] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Commonswiki to 1.19wmf1 [01:09:23] Logged the message, Master [01:12:35] PROBLEM - MySQL Idle Transactions on db22 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:14:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.023 seconds [01:16:47] PROBLEM - RAID on db22 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [01:17:59] !log reverting to 1.18 on commons due to DB overload [01:18:00] !log tstarling rebuilt wikiversions.cdb and synchronized wikiversions files: [01:18:01] Logged the message, Master [01:18:03] Logged the message, Master [01:19:29] RECOVERY - MySQL Idle Transactions on db22 is OK: OK longest blocking idle transaction sleeps for 0 seconds [01:20:13] New patchset: Lcarr; "Only pushing standard package as stafford is overloaded" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2704 [01:20:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2704 [01:21:24] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2704 [01:21:25] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2704 [01:26:05] RECOVERY - Disk space on neon is OK: DISK OK [01:26:15] New patchset: Lcarr; "Revert "Only pushing standard package as stafford is overloaded"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2705 [01:26:23] RECOVERY - DPKG on neon is OK: All packages OK [01:26:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2705 [01:27:26] RECOVERY - RAID on neon is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [01:28:59] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2705 [01:29:00] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2705 [01:30:44] RECOVERY - NTP on neon is OK: NTP OK: Offset 0.009791016579 secs [01:37:15] TimStarling: https://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/CentralAuth/specials/SpecialMergeAccount.php?r1=104235&r2=104236& [01:37:59] what about it? [01:38:53] break [01:39:48] hehe [01:39:55] don't you like little cleanups? [01:40:23] it convinced me for a minute there [01:44:09] poor MediaWiki, it's like a speeding cars who's breaks were cut...it just can't stop! [01:47:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:50:45] Reedy: ? [01:51:25] Shit [01:51:38] :) [01:51:57] Still, the method is *still* undefined [01:52:18] * Reedy fixes [01:53:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.581 seconds [01:53:14] * Damianz gives Reedy a cookie [01:55:47] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 601s [01:56:50] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 663s [01:57:21] !log reedy synchronized php-1.19/extensions/CentralAuth/specials/ 'r112075' [01:57:24] Logged the message, Master [01:58:11] New patchset: Lcarr; "Fixing nagios service to nagios3 in newmonitor class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2706 [01:58:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2706 [01:59:24] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2706 [01:59:25] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2706 [02:15:10] !log testing new scap script ~tstarling/bin/scap-new [02:15:14] Logged the message, Master [02:16:47] PROBLEM - Puppet freshness on bast1001 is CRITICAL: Puppet has not run in the last 10 hours [02:18:08] !log LocalisationUpdate completed (1.18) at Wed Feb 22 02:18:08 UTC 2012 [02:18:11] Logged the message, Master [02:18:24] !log tstarling synchronizing Wikimedia installation... : [02:18:26] Logged the message, Master [02:21:31] !log tstarling synchronizing Wikimedia installation... : [02:21:33] Logged the message, Master [02:24:44] sync done. [02:25:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:29:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.977 seconds [02:37:47] PROBLEM - RAID on srv194 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:39:08] PROBLEM - BGP status on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197, [02:39:27] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:39:35] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Can not connect to 10.0.2.227:11000 (Connection timed out) [02:40:03] PROBLEM - BGP status on csw1-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.247, [02:40:21] PROBLEM - Router interfaces on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [02:40:21] PROBLEM - Router interfaces on mr1-pmtpa is CRITICAL: CRITICAL: host 10.1.2.3, interfaces up: 32, down: 1, dormant: 0, excluded: 0, unused: 0BRfe-0/0/1: down - csw5-pmtpa:8/23:BR [02:40:29] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [02:40:29] PROBLEM - BGP status on cr2-eqiad is CRITICAL: (Service Check Timed Out) [02:40:57] PROBLEM - BGP status on cr1-eqiad is CRITICAL: (Service Check Timed Out) [02:40:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:42:53] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:43:29] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [02:43:47] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.197 for 1.3.6.1.2.1.2.2.1.2 with snmp version 2 [02:44:05] PROBLEM - DPKG on nfs1 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:44:24] PROBLEM - Router interfaces on br1-knams is CRITICAL: CRITICAL: No response from remote host 91.198.174.245 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [02:44:41] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:44:50] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 9, down: 0, shutdown: 0 [02:45:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.433 seconds [02:45:26] PROBLEM - LVS HTTP on ms-fe.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:45:27] RECOVERY - DPKG on nfs1 is OK: All packages OK [02:45:35] PROBLEM - RAID on mw40 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:45:44] RECOVERY - Router interfaces on br1-knams is OK: OK: host 91.198.174.245, interfaces up: 10, down: 0, dormant: 0, excluded: 0, unused: 0 [02:45:44] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [02:46:20] PROBLEM - Puppetmaster HTTPS on sockpuppet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:47:14] PROBLEM - BGP status on csw2-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.244, [02:47:33] RECOVERY - Puppetmaster HTTPS on sockpuppet is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.433 seconds [02:47:41] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [02:47:42] PROBLEM - Router interfaces on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [02:48:53] RECOVERY - BGP status on csw1-esams is OK: OK: host 91.198.174.247, sessions up: 5, down: 0, shutdown: 0 [02:48:53] RECOVERY - BGP status on cr1-eqiad is OK: OK: host 208.80.154.196, sessions up: 9, down: 0, shutdown: 0 [02:49:02] PROBLEM - Router interfaces on mr1-pmtpa is CRITICAL: CRITICAL: host 10.1.2.3, interfaces up: 32, down: 1, dormant: 0, excluded: 0, unused: 0BRfe-0/0/1: down - csw5-pmtpa:8/23:BR [02:49:03] RECOVERY - Router interfaces on cr2-pmtpa is OK: OK: host 208.80.152.197, interfaces up: 99, down: 0, dormant: 0, excluded: 0, unused: 0 [02:49:11] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [02:49:11] RECOVERY - BGP status on cr2-eqiad is OK: OK: host 208.80.154.197, sessions up: 9, down: 0, shutdown: 0 [02:49:20] RECOVERY - BGP status on cr2-pmtpa is OK: OK: host 208.80.152.197, sessions up: 9, down: 0, shutdown: 0 [02:49:29] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 84, down: 2, dormant: 0, excluded: 0, unused: 0BRae3: down - BRae4: down - BR [02:49:56] RECOVERY - BGP status on csw2-esams is OK: OK: host 91.198.174.244, sessions up: 4, down: 0, shutdown: 0 [02:49:56] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 87, down: 2, dormant: 0, excluded: 0, unused: 0BRae3: down - BRae4: down - BR [02:49:57] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [02:50:14] New patchset: Lcarr; "decreasing number of simultaneous checks for nagios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2707 [02:50:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2707 [02:50:38] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2707 [02:50:38] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2707 [02:50:52] RECOVERY - HTTP on fenari is OK: HTTP OK HTTP/1.1 200 OK - 4252 bytes in 0.005 seconds [02:51:17] New patchset: Ottomata; "Created DygraphLoader for generic transformation of observation aggregations into dygraphs csv format." [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2708 [02:51:17] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 2359 bytes in 0.015 seconds [02:51:44] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK HTTP/1.1 200 OK - 2359 bytes in 0.015 seconds [02:52:20] RECOVERY - LVS HTTP on ms-fe.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 2359 bytes in 0.009 seconds [02:54:24] New review: Ottomata; "(no comment)" [analytics/reportcard] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2708 [02:54:32] New review: Ottomata; "(no comment)" [analytics/reportcard] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2708 [02:54:40] New review: Ottomata; "(no comment)" [analytics/reportcard] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2708 [02:54:51] New review: Ottomata; "(no comment)" [analytics/reportcard] (master); V: 0 C: 1; - https://gerrit.wikimedia.org/r/2708 [02:55:01] New review: Ottomata; "(no comment)" [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2708 [02:55:01] Change merged: Ottomata; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2708 [02:57:50] !log Synced php-1.19/includes/MessageBlobStore.php to disable ::clear() ; where's the logging bot? [02:57:52] Logged the message, Mr. Obvious [03:00:50] [12:50] * logmsgbot has quit (Read error: Connection reset by peer) < so ~10 mins ago [03:03:28] Started ircecho [03:03:56] !log Manually started ircecho on fenari ; why doesn't this happen upon boot? Why didn't puppet start it? [03:03:59] Logged the message, Mr. Obvious [03:04:51] PROBLEM - BGP status on csw1-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.247, [03:05:01] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [03:05:27] PROBLEM - BGP status on cr2-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.197, [03:05:36] PROBLEM - BGP status on csw2-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.244, [03:06:04] PROBLEM - BGP status on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197, [03:06:04] PROBLEM - check_all_memcacheds on spence is CRITICAL: (Service Check Timed Out) [03:06:30] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:06:39] PROBLEM - Router interfaces on mr1-pmtpa is CRITICAL: CRITICAL: No response from remote host 10.1.2.3 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [03:07:33] PROBLEM - Certificate expiration on nfs1 is CRITICAL: (Service Check Timed Out) [03:09:49] what happens at COM? [03:09:58] many errors on editing [03:10:15] !log on fenari: NFS overload, killed apache and xinetd [03:10:17] Logged the message, Master [03:10:22] Saibo: what's COM? [03:10:24] e.g. Lock wait timeout exceeded; try restarting transaction (10.0.6.32)“. Or API errors when using gadgets. [03:10:26] commons [03:10:37] currently switched? [03:10:48] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.197 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [03:10:57] RECOVERY - BGP status on cr2-pmtpa is OK: OK: host 208.80.152.197, sessions up: 9, down: 0, shutdown: 0 [03:11:00] "API request returned code 200 parsererrorError code is SyntaxError: JSON.parse: unexpected non-whitespace character after JSON data" [03:11:06] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [03:11:15] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 87, down: 2, dormant: 0, excluded: 0, unused: 0BRae3: down - BRae4: down - BR [03:11:24] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [03:11:37] yeah, there is a problem [03:11:51] RECOVERY - BGP status on csw1-esams is OK: OK: host 91.198.174.247, sessions up: 5, down: 0, shutdown: 0 [03:12:00] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 9, down: 0, shutdown: 0 [03:12:09] RECOVERY - BGP status on csw2-esams is OK: OK: host 91.198.174.244, sessions up: 4, down: 0, shutdown: 0 [03:12:25] seems better now [03:12:36] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [03:12:37] RECOVERY - BGP status on cr2-eqiad is OK: OK: host 208.80.154.197, sessions up: 9, down: 0, shutdown: 0 [03:13:00] hmm, maybe not, another 13s query for site_stats showing up [03:15:27] New patchset: Catrope; "Don't let l10nupdate write to /home directly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2709 [03:15:36] TimStarling: ----^^ [03:15:45] That should help speed up l10nupdate and make it less heavy on NSF [03:15:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2709 [03:18:32] !log tstarling synchronized php-1.19/includes/LocalisationCache.php [03:18:34] Logged the message, Master [03:18:57] !log tstarling synchronized php-1.19/includes/LocalisationCache.php [03:18:59] Logged the message, Master [03:22:28] TimStarling: you think we'll be able to do something today/tonight, or should we just plan to have another window tomorrow? [03:23:15] I'm game to go on a little while longer, but I imagine we'll run out of steam here pretty quickly [03:24:19] we can switch now, Roan's patch should stop it from breaking [03:24:47] alright...let's give it a whirl [03:25:14] !log switching commons back to 1.19 [03:25:19] Logged the message, Master [03:25:24] !log tstarling rebuilt wikiversions.cdb and synchronized wikiversions files: [03:25:26] Logged the message, Master [03:26:26] seems fine [03:27:02] testing upload wizard....some breakage [03:27:21] TimStarling: is the cache timeout still on? [03:27:23] Didn't we forget the RL cache timeout thing? [03:27:30] yeah it's still 30 seconds [03:28:10] ok....probably should clear up quickly [03:28:16] * robla gets back to testing [03:28:40] it's been more than 30 seconds [03:29:19] why is it loading code from prototype.wikimedia.org? [03:29:34] surely that is asking for trouble [03:30:34] mw.user.anonymous is not a function [03:30:34] https://bits.wikimedia.org/commons.wikimedia.org/load.php?debug=false&lang=en&modules=site&only=scripts&skin=vector&* [03:30:34] Line 29 [03:31:22] I got that too for UploadWizard [03:31:24] and then Ididn't [03:31:39] it's from https://commons.wikimedia.org/wiki/MediaWiki:Common.js [03:31:49] Oh [03:32:06] Yeah that code is broken [03:32:08] Lemme fix that [03:32:22] (Missing dependency) [03:33:29] so that function doesn't exist by default in 1.19? [03:33:38] Site JS should be fixed now [03:33:49] It never reliably existed at the time Common.js is run [03:34:15] I edited Common.js , that error should be fixed now [03:34:55] you might want to get on #wikimedia-commons [03:42:36] PROBLEM - Puppet freshness on fenari is CRITICAL: Puppet has not run in the last 10 hours [03:48:30] !log catrope synchronized php-1.19/skins/common/shared.css 'r112081' [03:48:33] Logged the message, Master [03:49:05] * AaronSchulz watches fsockopen() errors for ED [04:02:43] noc down? http://noc.wikimedia.org/ isn't loading. [04:03:28] yeah I killed it [04:03:50] Report of ExtensionDistributor on mediawiki.org being down as well. [04:13:25] TimStarling: we're plotting a rollback here... [04:17:35] !log tstarling rebuilt wikiversions.cdb and synchronized wikiversions files: rolling back commons to 1.18 [04:17:37] Logged the message, Master [04:24:26] I guess it's time for me to have lunch [04:25:39] PROBLEM - Puppet freshness on searchidx1001 is CRITICAL: Puppet has not run in the last 10 hours [04:34:55] !log catrope synchronized php-1.19/thumb.php 'Experimental fix for 1.19 UploadWizard thumb issue' [04:34:58] Logged the message, Master [05:14:45] !log started xinetd [05:14:47] Logged the message, Master [05:18:23] !log catrope synchronized php-1.19/thumb.php 'Experimental fix for UploadStash thumbs in 1.19' [05:18:26] Logged the message, Master [05:20:15] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 10 seconds [05:21:45] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [05:51:45] !log tstarling synchronizing Wikimedia installation... : [05:51:47] Logged the message, Master [06:06:56] sync done. [06:12:11] New patchset: Tim Starling; "Support l10n manual recache in scap" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2710 [06:12:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2710 [06:12:49] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2710 [06:12:50] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2710 [06:15:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:17:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.538 seconds [06:26:41] !log installed new scap manually since puppet is broken [06:26:44] Logged the message, Master [06:33:03] !log tstarling synchronized wmf-config/CommonSettings.php 'enabling manual recache' [06:33:05] Logged the message, Master [06:39:36] PROBLEM - Puppet freshness on cadmium is CRITICAL: Puppet has not run in the last 10 hours [06:49:58] !log tstarling synchronized php-1.19/languages/messages/MessagesEn.php 'test change for manualRecache' [06:50:02] Logged the message, Master [06:51:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:55:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.619 seconds [07:03:09] PROBLEM - Lucene on search3 is CRITICAL: Connection timed out [07:03:36] PROBLEM - Lucene on search9 is CRITICAL: Connection timed out [07:09:36] PROBLEM - Puppet freshness on mw1002 is CRITICAL: Puppet has not run in the last 10 hours [07:09:36] PROBLEM - Puppet freshness on db46 is CRITICAL: Puppet has not run in the last 10 hours [07:11:33] RECOVERY - Lucene on search9 is OK: TCP OK - 8.993 second response time on port 8123 [07:23:42] PROBLEM - Lucene on search9 is CRITICAL: Connection timed out [07:31:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:33:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.452 seconds [07:40:27] Please look at https://bugzilla.wikimedia.org/show_bug.cgi?id=34585 (German Wikiversity offline) [07:44:14] it's not offline [07:44:33] but you can't get there without specifing the main page name or some other page name, it seems [07:45:48] http://de.wikiversity.org/wiki/Hauptseite [07:46:11] the layout is weird but I don't know if that's normal or not, having not visited the site ever before this [07:46:26] I get this page with and without any page title [07:46:41] if you click the above link, what do you see? [07:46:45] I get this page (This wiki does not exist) with and without any page title [07:46:53] The page "This wiki does not exist" [07:46:58] ugh [07:48:02] and now so do I [07:48:11] after a control-f5 [07:48:13] fabulous [07:49:11] yes after control-f5 [07:49:59] I add a screenshot to the bug report [07:53:12] it's somehow missing from wikiversions.dat [07:54:23] There happened something between 23:00 UTC and now because there were no problems before [07:57:28] Thank you for your help, apergos [07:57:33] but I have to go now [07:57:37] lemme see what version it's supposed to be running [07:58:01] I hope you or somebody else can fix the bug :-) [08:01:51] looks like it should still be on 1.18 [08:03:05] ok file's cleaned up [08:07:23] !log ariel rebuilt wikiversions.cdb and synchronized wikiversions files: fix dewikiversity typo in wikiversions file [08:07:26] Logged the message, Master [08:09:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:15:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.026 seconds [08:46:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:51:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.513 seconds [09:02:21] !log tstarling synchronizing Wikimedia installation... : [09:02:24] Logged the message, Master [09:05:33] !log tstarling synchronizing Wikimedia installation... : [09:05:35] Logged the message, Master [09:09:16] !log tstarling synchronizing Wikimedia installation... : [09:09:18] Logged the message, Master [09:23:49] sync done. [09:23:52] !log on fenari: started apache [09:23:54] Logged the message, Master [09:24:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:25:39] RECOVERY - HTTP on fenari is OK: HTTP OK HTTP/1.1 200 OK - 4252 bytes in 0.014 seconds [09:28:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.718 seconds [09:34:56] !log tstarling synchronizing Wikimedia installation... : [09:34:58] Logged the message, Master [09:39:36] PROBLEM - HTTP on fenari is CRITICAL: Connection refused [09:45:14] !log on fenair: stopped apache again due to overload. Restarted it with reduced MaxClients [09:45:16] Logged the message, Master [09:45:27] RECOVERY - HTTP on fenari is OK: HTTP OK HTTP/1.1 200 OK - 4252 bytes in 0.020 seconds [09:45:36] RECOVERY - Lucene on search9 is OK: TCP OK - 2.997 second response time on port 8123 [09:47:14] sync done. [09:50:44] I wonder why it is unhappy recently [09:57:54] PROBLEM - Lucene on search9 is CRITICAL: Connection timed out [10:02:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:05:42] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [10:06:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.382 seconds [10:09:49] !log hashar synchronized php-1.19/extensions/CodeReview/backend/DiffHighlighter.php 'r112098 - (bug 34554) diff chunk fail to parse file add/rm' [10:09:51] Logged the message, Master [10:11:42] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [10:11:42] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [10:37:30] RECOVERY - Lucene on search9 is OK: TCP OK - 2.995 second response time on port 8123 [10:42:25] New patchset: ArielGlenn; "initial commit: tool for managing dump uploads to archive.org" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/2711 [10:42:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:42:27] New review: gerrit2; "Lint check passed." [operations/dumps] (ariel); V: 1 - https://gerrit.wikimedia.org/r/2711 [10:46:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.386 seconds [10:49:39] PROBLEM - Lucene on search9 is CRITICAL: Connection timed out [11:00:18] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [11:02:42] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [11:20:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:24:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.023 seconds [11:26:33] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [11:58:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:02:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.055 seconds [12:18:36] PROBLEM - Puppet freshness on bast1001 is CRITICAL: Puppet has not run in the last 10 hours [12:20:33] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:20:33] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:20:33] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:20:33] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:25:30] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:25:30] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:25:31] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:25:31] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:30:36] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:30:36] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:30:36] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:30:36] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:35:24] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:35:25] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:35:25]