[00:31:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:41:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.090 seconds [01:14:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:27:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.039 seconds [01:41:46] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 237 seconds [01:41:54] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 248 seconds [01:48:57] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 670s [01:53:54] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [01:54:49] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 1s [01:54:57] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 1 seconds [01:58:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:08:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.911 seconds [02:13:49] hmpf can't an op +q wm-bot [02:27:00] !log LocalisationUpdate completed (1.20wmf8) at Mon Aug 6 02:27:00 UTC 2012 [02:27:12] Logged the message, Master [02:48:30] PROBLEM - Puppet freshness on srv281 is CRITICAL: Puppet has not run in the last 10 hours [03:19:07] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [03:32:10] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [03:56:10] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [04:36:26] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [04:51:26] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [05:04:20] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [05:19:20] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [06:22:34] There is very slow loading ~http://en.wikipedia.org/wiki/Mars_Science_Laboratory [06:22:37] that page loads extremly slowsly [06:22:46] could you make it faster [06:31:05] Anyone? [06:48:54] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [06:48:54] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [06:48:54] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [08:36:33] PROBLEM - Puppet freshness on bayes is CRITICAL: Puppet has not run in the last 10 hours [08:38:30] PROBLEM - Puppet freshness on srv242 is CRITICAL: Puppet has not run in the last 10 hours [08:38:30] PROBLEM - Puppet freshness on niobium is CRITICAL: Puppet has not run in the last 10 hours [08:39:33] PROBLEM - Puppet freshness on mw27 is CRITICAL: Puppet has not run in the last 10 hours [08:39:33] PROBLEM - Puppet freshness on srv190 is CRITICAL: Puppet has not run in the last 10 hours [08:39:33] PROBLEM - Puppet freshness on srv238 is CRITICAL: Puppet has not run in the last 10 hours [11:27:02] mutante: why did you rebuild wikitech-l archives? is it related to those silly unsubjected emails? [11:27:07] (which are still there) [11:29:33] ah that's why my link was broken again [11:30:42] Nikerabbit: omg what link? [11:33:15] ohnoes all links seem to be broken now https://www.mediawiki.org/w/index.php?title=Special:LinkSearch&limit=500&offset=0&target=http%3A%2F%2F*.lists.wikimedia.org%2Fpipermail%2Fwikitech-l%2F [11:33:22] mutante what did you doooo [11:35:16] ah, even in 2004 archives, moved ahead of 2 [12:29:52] Anywhere to dicuss plans for SMIL animation on wikipedia? [12:45:29] I have seen examples of software that converts SMIL SVG's to APNG and GIF, would that be implementable in MediaWiki software? (thinking of wikipedia specifically) [12:49:37] PROBLEM - Puppet freshness on srv281 is CRITICAL: Puppet has not run in the last 10 hours [13:12:52] PROBLEM - Host ps1-b1-eqiad is DOWN: CRITICAL - Network Unreachable (10.65.0.40) [13:14:11] I�m getting Wikimedia Erros [13:14:20] any knows they are problems [13:15:02] What errors? [13:15:06] Wiki13: yes [13:15:08] ehm [13:15:09] Reedy: squid down [13:15:13] plus bugzilla [13:15:20] eqaid [13:15:26] apergos: o [13:15:29] Request: GET http://nl.wikipedia.org/wiki/Hoofdpagina, from 91.198.174.45 via amssq34.esams.wikimedia.org (squid/2.7.STABLE9) to () [13:15:32] paravoid: mark ^ [13:15:33] and then Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Mon, 06 Aug 2012 13:13:58 GMT [13:15:34] uk.wikimedia seems to be down too... [13:15:47] like 2/3 of the cluster http://ganglia.wikimedia.org/latest/ [13:16:21] wikipedia server is down? [13:16:29] yes [13:16:31] Status: up? really? [13:16:35] no [13:16:38] ok [13:16:43] down i think [13:16:44] this _is_ severe [13:16:48] Status: DOWN | Technical help for Wikimedia wikis: https://meta.wikimedia.org/wiki/Tech | MediaWiki: #mediawiki | Toolserver: #wikimedia-toolserver | Labs: #wikimedia-labs | Pastebin: http://p.defau.lt/ | Server admin log: http://bit.ly/wikisal | Channel is logged: http://ur1.ca/9lbuj | Bugs: https://bugzilla.wikimedia.org [13:17:03] !log more or less everything at !Wikimedia / !Wikipedia seems down [13:17:11] hmm [13:17:12] Logged the message, Master [13:17:20] you just need to get a Good gateway :) [13:17:38] BAAAAAAAD gateway [13:17:47] huh? [13:17:56] * aude panicks!  [13:18:38] * Nemo_bis eating apple [13:18:48] It looks like everything is bad [13:18:49] Even DNS [13:18:55] will it help if i hold down the F5 key in internet explorer [13:18:55] http://status.wikimedia.org/ [13:19:00] :P [13:19:18] TheCavalry: I don�t think so ;-) [13:19:27] always, remember to do ctrl-f5 so you'r sure it tries the server [13:19:28] of course nagios is slacking too [13:20:30] as the topic is still old: this channel is aware of the issues? [13:20:48] yes :) [13:20:51] yes [13:20:57] k [13:21:26] RichiH: old? [13:21:27] Request: GET http://de.wikipedia.org/wiki/Spezial:Beobachtungsliste, from 91.198.174.54 via amssq33.esams.wikimedia.org (squid/2.7.STABLE9) to () [13:21:27] Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Mon, 06 Aug 2012 13:21:18 GMT [13:22:00] hmpf [13:22:44] 502 Bad Gateway <-- that's what I'm having. [13:22:51] on both enwiki and nlwiki [13:22:51] hmmm, twice in a week. that sure brings down the up statistics [13:22:52] seems to be network issue, being investigted [13:23:05] apergos: !log ? :) [13:23:17] I'm not investigating [13:23:30] I'm just relaying to folks here [13:23:41] are all sites down? [13:23:50] yes [13:23:51] Sarrus: Jep :/ [13:23:52] tampa is inaccessible, folks are looking at it [13:24:22] ok [13:24:26] on zhwiki it's ERR_CANNOT_FORWARD [13:24:27] :-( [13:24:36] hi [13:24:44] I have Request: GET http://uk.wikimedia.org/wiki/Talk:Representing_Wikimedia_UK, from 91.198.174.41 via amssq33.esams.wikimedia.org (squid/2.7.STABLE9) to () [13:24:44] Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Mon, 06 Aug 2012 13:24:01 GMT [13:25:12] TheCavalry: yes, known [13:25:14] Problem is knon [13:25:17] here, looking [13:25:17] known [13:25:35] any news? [13:25:42] folks are looking at it [13:25:48] it's defeinitely network related but [13:25:52] just looking :D [13:26:01] is it power? routing? folks are checking. [13:26:08] bennylin: Join ##hoo please... about JS mess :P [13:26:27] !log Reedy says: Status: Down - routing issues to Tampa !Wikipedia !Wikimedia [13:26:30] * Nemo_bis abuses log [13:26:35] Logged the message, Master [13:26:41] yeah we don't know that it's routing ssues yet [13:26:53] anyways it'll get logged when that's figured out [13:27:12] hopefully its not power outage... [13:27:16] yeah it's for our microblogging followers [13:27:34] thanks, Nemo_bis [13:32:45] ok, now twitter is panicking [13:33:13] the servers are currenty experiencing a mental breakdown, please be patient whilst the techs attempt to convince it that it is not a chicken or a washing machine [13:33:56] Lol, link to donate on technical error page, when all sites are down... [13:35:21] What's matter? [13:35:30] PROBLEM - check_minfraud_secondary on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:30] PROBLEM - check_minfraud_primary on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:30] PROBLEM - check_minfraud_secondary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:30] PROBLEM - check_minfraud_primary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:30] PROBLEM - check_minfraud_secondary on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:30] PROBLEM - check_minfraud_primary on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:30] PROBLEM - check_minfraud_secondary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:31] PROBLEM - check_minfraud_primary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:53] what's the URL for the tech log? [13:36:08] OlegZH: it's down, it is under investigation. [13:36:25] malafaya: http://wikitech.wikimedia.org/view/Server_admin_log ?! [13:36:35] SOPA? [13:36:47] computers really do love to exaggerate the problems. CRITICAL = missing letter in some small peice of code [13:36:49] <_9xl> iwasthemuffinman says: the servers are currenty experiencing a mental breakdown, please be patient whilst the techs attempt to convince it that it is not a chicken or a washing machine [13:36:53] malafaya: down, but you can check twitter: https://twitter.com/wikimediatech [13:37:03] thanks [13:37:48] OlegZH: Ohnoe... they got us! [13:39:33] PROBLEM - NTP on db1022 is CRITICAL: NTP CRITICAL: No response from NTP server [13:39:52] I am russian wikipedist. Now there is a low concerned with internet filter against "some" information [13:39:53] thedj[work]: http://identi.ca/wikimediatech [13:40:27] PROBLEM - check_minfraud_secondary on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:27] PROBLEM - check_minfraud_secondary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:27] PROBLEM - check_minfraud_primary on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:27] PROBLEM - check_minfraud_primary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:27] PROBLEM - check_minfraud_secondary on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:27] PROBLEM - check_minfraud_primary on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:27] PROBLEM - check_minfraud_secondary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:28] PROBLEM - check_minfraud_primary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:41:21] RECOVERY - MySQL Slave Delay on es2 is OK: OK replication delay 12 seconds [13:41:54] AVRS: i love identi.ca but lets be honest, most people use twitter. [13:43:06] Matt Dobson ‏@Dobbo25 [13:43:06] Once again wikipedia fucks up, swear that site needs better maintenance [13:43:20] that's not nice.... [13:43:25] indeed. [13:43:25] ^ Some users are idiot [13:43:33] I can't remember the last time it went down [13:43:44] suprisingly few "how am i supposed to do my homework now" tweets. [13:43:53] bennylin: last week actually :D [13:43:53] lol [13:44:07] aw... :p [13:44:49] its only monday, all the essays go in on friday perhaps? [13:44:55] ^^ [13:44:59] once you're done handling the downtime, could you please consider changing the IRC link on the error page to http://webchat.freenode.net/?channels=#wikipedia instead of the current irc:// link [13:45:00] ? [13:45:02] How am I supposed to work when my work iis wikipedia :P [13:45:09] <_9xl> please donate [13:45:11] It'd make more sense to most people. [13:45:14] our data center tech is heading in now so we can get a direct report about the gear [13:45:25] PROBLEM - check_minfraud_secondary on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:25] PROBLEM - check_minfraud_secondary on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:25] PROBLEM - check_minfraud_primary on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:25] PROBLEM - check_minfraud_primary on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:25] PROBLEM - check_minfraud_secondary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:25] PROBLEM - check_minfraud_primary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:25] PROBLEM - check_minfraud_secondary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:26] PROBLEM - check_minfraud_primary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:29] apergos: thanks for the info [13:45:34] sure [13:45:42] wctaiwan: I was also thinking about that :P [13:45:59] More people can reach the channel then [13:46:00] PROBLEM - MySQL Slave Delay on es4 is CRITICAL: CRIT replication delay 289 seconds [13:46:05] apergos: yikes, we actually need to be near to the hardware to solve it ? [13:46:14] in this case we may need to be [13:46:20] that's very rare of course [13:46:41] it would be great if someone could explain the technical problems in laymens terms so that everyone knows what is going on [13:46:43] its early morning there [13:46:44] because it's not a matter of servers or bad queries [13:46:58] yes quite so. well it makes the case for eqiad as full backup site. [13:46:59] as soon as we know what's broke everyone will get updated [13:47:16] right now we only know that there is a connectivity issue to tampa [13:47:38] all the school kids with homework due third period are counting on you! :p [13:47:39] and no failover to eqaid? [13:47:45] not yet. [13:47:48] :( [13:47:48] RECOVERY - MySQL Slave Delay on es4 is OK: OK replication delay 12 seconds [13:48:01] most services run out of eqiad but we don't have everything in place. soon. [13:48:03] eqiad [13:48:10] apergos: right [13:48:26] Im sure I did it, it happened just as I pressed upload on an SVG :( :P [13:48:35] :-D [13:48:48] no eqiad application servers yet http://ganglia.wikimedia.org/latest/ [13:49:02] and other stuff [13:49:13] yes, the appservers are the big deal [13:49:43] obviously.... anyway, figuring out what's wrong now..... [13:49:46] time to open offline Wikipedia! Go Kiwix! [13:49:55] What the ?? [13:50:04] Did someone play DOOM in the server room? [13:50:07] XD [13:50:07] what's an appserver? [13:50:22] the main apaches, I think [13:50:24] Nikerabbit: Wikipedia has an artic.... Oh wait [13:50:24] how long does it take to get rid of 502 Bad Gateway? [13:50:26] zerodamage: probably that's what caused it ;) [13:50:30] PROBLEM - check_minfraud_secondary on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:30] PROBLEM - check_minfraud_primary on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:30] PROBLEM - check_minfraud_secondary on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:30] PROBLEM - check_minfraud_primary on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:30] PROBLEM - check_minfraud_secondary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:30] PROBLEM - check_minfraud_primary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:30] PROBLEM - check_minfraud_secondary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:31] PROBLEM - check_minfraud_primary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:48] PROBLEM - LVS on payments.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:00] Nikerabbit: yes [13:51:07] * Juandev can help by redesigning 502 message :-) [13:51:12] * aude looking at http://ganglia.wikimedia.org/latest/ [13:51:39] aude: ganglia not connecting from here [13:51:58] hi, just dropping in to see if there's an ETA on the servers being back up, thanks :) [13:52:05] no eta yet. [13:52:13] ganglia should work [13:52:33] T3rminat0r: hmmm..... [13:52:35] we are working both with a service tech from the data center and with one of our own guys [13:52:36] PROBLEM - MySQL Slave Delay on es2 is CRITICAL: CRIT replication delay 300 seconds [13:52:43] (well as soon as he gets on site) [13:52:58] apergos, thanks, i'll grab a cigarette then :) good luck with everything! [13:53:09] yw [13:53:27] fiber cut in tampa [13:53:30] PROBLEM - MySQL Slave Delay on es4 is CRITICAL: CRIT replication delay 353 seconds [13:53:32] O_o [13:53:51] lulz [13:54:02] how… did that happen? [13:54:05] apergos: Link? [13:54:17] don't have more information yet [13:54:23] first is how the heck we work around this [13:54:39] Don't you have alternate routing? [13:55:00] Or can't you swtich over to the European backup? [13:55:10] it's not a backup [13:55:17] we can't "switch over to europe" [13:55:21] it's a caching centre in europe [13:55:24] it's a cache, it doesn't have the content [13:55:27] PROBLEM - check_minfraud_secondary on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:27] PROBLEM - check_minfraud_secondary on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:27] PROBLEM - check_minfraud_primary on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:27] PROBLEM - check_minfraud_primary on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:27] PROBLEM - check_minfraud_secondary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:27] PROBLEM - check_minfraud_primary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:28] PROBLEM - check_minfraud_secondary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:28] PROBLEM - check_minfraud_primary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:31] Qcoder02: they probably could if people stopped asking them questions every 30 seconds. [13:55:35] anyways, more info as soon as we get it [13:55:52] (yes we havemultiple routes, this is horribly inconvenient however) [13:56:03] * Qcoder02 ponders if you can have distributed RAID [13:56:43] mobile site still works for me (served from eqiad) [13:57:03] Qcoder02: ... there are distributed database servers. but not all services are redundant yet... so patience is needed, while the techies fix the problem :) [13:57:14] no content though, aude (I think) [13:57:20] Hydriz: i have content [13:57:22] Wikipedia Mobile Korean is not working for me [13:57:24] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [13:57:26] it's cached? [13:57:33] yup [13:57:35] Error 503 Service Unavailable [13:57:36] Service Unavailable [13:57:36] Guru Meditation: [13:57:36] XID: 203550965 [13:57:36] Varnish cache server [13:57:36] tell me about fiber cuts:-) [13:57:43] RECOVERY - Host google is UP: PING OK - Packet loss = 0%, RTA = 55.46 ms [13:57:43] RECOVERY - Host cp3001 is UP: PING OK - Packet loss = 0%, RTA = 123.11 ms [13:57:43] RECOVERY - Host amslvs2 is UP: PING OK - Packet loss = 0%, RTA = 123.27 ms [13:57:43] RECOVERY - Host amssq61 is UP: PING OK - Packet loss = 0%, RTA = 122.85 ms [13:57:43] RECOVERY - Host amssq50 is UP: PING OK - Packet loss = 0%, RTA = 121.71 ms [13:57:43] RECOVERY - Host amssq53 is UP: PING OK - Packet loss = 0%, RTA = 122.99 ms [13:57:44] RECOVERY - Host amssq55 is UP: PING OK - Packet loss = 0%, RTA = 123.07 ms [13:57:44] RECOVERY - Host amssq62 is UP: PING OK - Packet loss = 0%, RTA = 122.96 ms [13:57:45] RECOVERY - Host amssq57 is UP: PING OK - Packet loss = 0%, RTA = 121.71 ms [13:57:45] RECOVERY - Host amssq52 is UP: PING OK - Packet loss = 0%, RTA = 122.91 ms [13:57:46] RECOVERY - Host amssq36 is UP: PING OK - Packet loss = 0%, RTA = 121.62 ms [13:57:48] thats fast for news to break out: http://www.nu.nl/internet/2877395/wikipedia-kampt-met-grote-storing.html [13:57:51] aude: for you. ;) caching. [13:58:15] and we're back [13:58:21] :( [13:58:26] not quite yet. [13:58:36] i get the page title at least :) [13:58:44] it's thinking about it..... [13:59:03] RECOVERY - MySQL Slave Delay on es4 is OK: OK replication delay 0 seconds [13:59:03] RECOVERY - Host bits.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 118.87 ms [13:59:04] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: (Service Check Timed Out) [13:59:12] PROBLEM - BGP status on cr2-pmtpa is CRITICAL: (Service Check Timed Out) [13:59:17] they are trying to make sure no traffic goes over that link [13:59:34] which might overload the other link ? [18:02:18] Everything is done so far, bar the pre-scap [18:02:20] Had enough time to branch, stage etc, but not run scap [18:02:21] which is gonna take half an hour or so.. [18:02:28] !log aaron synchronized wmf-config/filebackend.php [18:02:30] Logged the message, Master [18:04:11] PROBLEM - Host csw1-esams is DOWN: PING CRITICAL - Packet loss = 58%, RTA = 3556.09 ms [18:04:43] AaronSchulz: if your merging code to wmf8, needs to go into wmf9 too ;) [18:05:02] oh, you are :) [18:06:21] !log aaron synchronized php-1.20wmf9/includes/filerepo/backend/FileBackendMultiWrite.php 'deployed b55e9652fce3051d621356c5c87c47f44515a367' [18:06:29] Logged the message, Master [18:09:44] RECOVERY - Host csw1-esams is UP: PING OK - Packet loss = 0%, RTA = 118.27 ms [18:12:50] !log reedy Started syncing Wikimedia installation... : test2wiki to 1.20wmf9 and scap to rebuild message cache [18:12:58] Logged the message, Master [18:16:45] ...................................................................................... [18:16:52] ...................................................................... [18:16:53] ......... [18:16:59] ................................................................. [18:17:09] Reedy: Hey could you ping me when you deploy to mw.org [18:17:11] ? [18:18:02] cool [18:18:11] stupid phone [18:20:31] !log reedy Started syncing Wikimedia installation... : test2wiki to 1.20wmf9 and scap to rebuild message cache [18:20:39] Logged the message, Master [18:37:52] PROBLEM - Puppet freshness on bayes is CRITICAL: Puppet has not run in the last 10 hours [18:39:49] PROBLEM - Puppet freshness on niobium is CRITICAL: Puppet has not run in the last 10 hours [18:39:50] PROBLEM - Puppet freshness on srv242 is CRITICAL: Puppet has not run in the last 10 hours [18:40:43] PROBLEM - Puppet freshness on mw27 is CRITICAL: Puppet has not run in the last 10 hours [18:40:44] PROBLEM - Puppet freshness on srv238 is CRITICAL: Puppet has not run in the last 10 hours [18:40:44] PROBLEM - Puppet freshness on srv190 is CRITICAL: Puppet has not run in the last 10 hours [18:45:34] goddamn scap [18:51:49] Pfft, where's Aaron gone? [18:58:31] RoanKattouw: Any chance you could finish this deploy off for me please? Need to go AFK for a while, but scap is taking an age and I've just about waited as long as I can.. [18:58:42] Go ahead [18:58:54] it's just gone passed srv281, so nearly done [18:59:00] Oh sorry, for me to finish it up [18:59:02] Sure [18:59:16] just needs "sync-wikiversions testwiki and mediawikiwiki to 1.20wmf9" [18:59:29] I've already changed wikiversions.dat, so that'll build the cdb, push it, and all should be hunkdory [18:59:46] OK thanks [19:00:00] As usual, if it goes wrong, feel free to revert back to wmf8 [19:00:01] Thanks! :D [19:00:18] OK [19:00:22] I'll track down Aaron as well [19:00:56] srv290 [19:01:01] I need to disconnect my network cable for a minute, but I'm on it [19:01:12] Yeah I'm watching your processes in top, I can see the progress [19:11:28] !log reedy Finished syncing Wikimedia installation... : test2wiki to 1.20wmf9 and scap to rebuild message cache [19:11:37] Logged the message, Master [19:12:35] !log aaron rebuilt wikiversions.cdb and synchronized wikiversions files: testwiki and mediawikiwiki to 1.20wmf9 [19:12:43] Logged the message, Master [19:13:41] hmm, probably not just compiling texvc [19:14:44] !log aaron rebuilt wikiversions.cdb and synchronized wikiversions files: [19:14:51] Logged the message, Master [19:19:17] PROBLEM - Host db1047 is DOWN: PING CRITICAL - Packet loss = 100% [19:20:38] PROBLEM - Apache HTTP on mw18 is CRITICAL: Connection refused [19:25:47] I found this article: http://www.bbc.co.uk/news/technology-19148151 [19:25:51] Is this what happened? [19:26:45] It's ahm, somewhat accurate :) [19:27:12] RoanKattouw: I wonder whats with test1? [19:28:17] RECOVERY - Host db1047 is UP: PING OK - Packet loss = 0%, RTA = 35.76 ms [19:28:29] AaronSchulz: What /is/ with test1? [19:28:42] Oh crap [19:28:45] Error: invalid magic word 'switchlanguage' [19:29:00] Some extension probably? [19:29:10] fundraising? [19:29:14] test2 is fine [19:30:07] Hmm [19:30:57] * RoanKattouw updates his checkout of wmf9 [19:31:35] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 649 seconds [19:32:37] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 642 seconds [19:33:39] grepping on an Apache in the meantime [19:33:55] oic FundraiserLandingPage [19:34:00] It's in the backtrace, duh [19:34:22] WARNING: gnome-keyring:: couldn't connect to: /tmp/keyring-JNZS7Q/pkcs11: No such file or directory [19:34:26] * AaronSchulz wtfs at git [19:35:19] RECOVERY - Apache HTTP on mw18 is OK: HTTP OK - HTTP/1.1 302 Found - 0.015 second response time [19:35:54] oooh wait [19:35:58] It's not FRLP's fault [19:36:11] They added a .magic.i18n.php file but somehow that didn't make it into ExtensionMessages-1.20wmf9.php [19:36:28] *.i18n.magic.php [19:38:51] RoanKattouw: where is the file registered? [19:39:26] OK I got it [19:39:32] !log catrope synchronized wmf-config/ExtensionMessages-1.20wmf9.php 'Updated' [19:39:36] I rebuilt ExtensionMessages-1.20wmf9.php [19:39:40] Logged the message, Master [19:39:49] ...which doesn't fix it? [19:39:51] wtf? [19:40:03] 'FundraiserLandingPageMagic' => "$IP/extensions/FundraiserLandingPage/FundraiserLandingPage.i18n.magic.php", [19:40:04] ahh, nvm, I see the $wgExtensionMessagesFiles line [19:40:05] It's in there now [19:42:43] looks like we need to turn FundraiserLandingPage extension off on test.wiki [19:42:56] Aha, now it says "" is not a valid magic word for "switchlanguage" [19:43:04] I've informed the fundraising dev-team of the issue [19:43:05] Ooooh d'oh of course [19:43:09] I need to rebuild the l10n cache too [19:43:13] oh [19:43:46] I hope that fixes it [19:44:07] For some reason the ExtensionMessages-1.20wmf9.php build didn't pick up the new i18n file [19:44:14] I'm not quite sure why, it should just have worked [19:44:27] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [19:44:51] Whoo hoo [19:45:01] testwiki was fixed the minute the rebuild got past 'en' [19:45:20] So the FR people didn't break anything, our deployment system is just weird [19:45:24] * RoanKattouw glares at our deployment syste [19:45:25] m [19:45:30] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [19:45:57] !log Rebuilt ExtensionMessages-1.20wmf9.php and then l10n cache for wmf9 to pick up addition of FundraiserLandingPage.i18n.magic.php , somehow wasn't picked up during initial scap [19:46:05] Logged the message, Mr. Obvious [19:47:11] !log aaron rebuilt wikiversions.cdb and synchronized wikiversions files: mediawikiwiki to 1.20wmf9 [19:47:19] Logged the message, Master [19:50:42] 22:17, 1 August 2012 (diff | hist) . . (0)‎ . . m File:File name with spaces in it 2.0.jpg ‎ (Aaron Schulz moved page File:File name with spaces in it.jpg to File:File name with spaces in it 2.0.jpg: test) (top) [] [19:50:47] RoanKattouw_away: hrm? [19:53:50] wtf [20:02:45] '$disableRollbackEditCountSpecialPage = array( 'Recentchanges', 'Watchlist' );' [20:02:53] ugh [20:03:14] why so much code for such a little feature we can't even use? [20:03:31] * AaronSchulz watches pre-commit review fail ;) [20:06:11] hehe [20:06:35] This seems to be a glitch in our deployment scripts not picking up the i18n changes [20:06:46] I rebuilt the l10n cache earlier which fixed testwiki, now syncing that to the cluster [20:23:46] !log catrope synchronized php-1.20wmf9/cache/l10n/ 'Sync out l10n cache again now that it is properly rebuilt' [20:23:54] Logged the message, Master [20:35:27] RECOVERY - Host search23 is UP: PING OK - Packet loss = 0%, RTA = 1.78 ms [20:38:59] Some fun and games have been had I see... RoanKattouw there was also those errors related to creating the l10n cache dir (again) [20:40:04] Right [20:40:06] I should fix the persm [20:40:48] Done [20:43:07] http://gdash.wikimedia.org/dashboards/jobq/ hmm, jobqueue has turned spikey? [21:09:18] !log catrope synchronized php-1.20wmf9/extensions/VisualEditor 'Updating VisualEditor' [21:09:27] Logged the message, Master [21:18:19] !log catrope synchronized wmf-config/CommonSettings.php 'Fix Parsoid URL' [21:18:27] Logged the message, Master [21:32:11] !log catrope synchronized php-1.20wmf9/extensions/VisualEditor/ 'Updating VisualEditor' [21:32:19] Logged the message, Master [21:39:28] PROBLEM - Swift HTTP on ms-fe1001 is CRITICAL: Connection refused [21:48:00] PROBLEM - Host ms-be1003 is DOWN: PING CRITICAL - Packet loss = 100% [21:53:29] gn8 folks [21:53:33] RECOVERY - Host ms-be1003 is UP: PING OK - Packet loss = 0%, RTA = 35.70 ms [21:58:41] PROBLEM - swift-container-updater on ms-be1005 is CRITICAL: Connection refused by host [21:58:48] PROBLEM - swift-object-updater on ms-be1005 is CRITICAL: Connection refused by host [21:58:48] PROBLEM - swift-account-server on ms-be1005 is CRITICAL: Connection refused by host [21:58:57] PROBLEM - swift-object-server on ms-be1005 is CRITICAL: Connection refused by host [21:59:06] PROBLEM - swift-account-auditor on ms-be1005 is CRITICAL: Connection refused by host [21:59:07] PROBLEM - swift-container-auditor on ms-be1005 is CRITICAL: Connection refused by host [21:59:15] PROBLEM - SSH on ms-be1005 is CRITICAL: Connection refused [21:59:15] PROBLEM - swift-object-replicator on ms-be1005 is CRITICAL: Connection refused by host [21:59:15] PROBLEM - swift-container-replicator on ms-be1005 is CRITICAL: Connection refused by host [21:59:15] PROBLEM - swift-account-reaper on ms-be1005 is CRITICAL: Connection refused by host [21:59:24] PROBLEM - swift-object-auditor on ms-be1005 is CRITICAL: Connection refused by host [21:59:33] PROBLEM - swift-account-replicator on ms-be1005 is CRITICAL: Connection refused by host [21:59:51] PROBLEM - MySQL disk space on db1028 is CRITICAL: Connection refused by host [22:00:01] PROBLEM - swift-container-server on ms-be1005 is CRITICAL: Connection refused by host [22:00:27] PROBLEM - Host db1026 is DOWN: PING CRITICAL - Packet loss = 100% [22:01:30] PROBLEM - Host ms-be1005 is DOWN: PING CRITICAL - Packet loss = 100% [22:02:24] RECOVERY - Host db1026 is UP: PING OK - Packet loss = 0%, RTA = 35.38 ms [22:03:39] PROBLEM - NTP on db1026 is CRITICAL: NTP CRITICAL: Offset unknown [22:06:30] RECOVERY - NTP on db1026 is OK: NTP OK: Offset -0.04107093811 secs [22:10:17] !log aaron synchronized php-1.20wmf9/includes/filerepo/backend/SwiftFileBackend.php 'deployed 0a5c71d4dd6405502b1d1d0a02b5d0927d519986' [22:10:33] Logged the message, Master [22:12:30] RECOVERY - MySQL disk space on db1028 is OK: DISK OK [22:15:48] PROBLEM - Host db1027 is DOWN: PING CRITICAL - Packet loss = 100% [22:17:27] RECOVERY - Host db1027 is UP: PING OK - Packet loss = 0%, RTA = 35.45 ms [22:17:27] PROBLEM - Host db1028 is DOWN: PING CRITICAL - Packet loss = 100% [22:18:12] RECOVERY - Host db1028 is UP: PING OK - Packet loss = 0%, RTA = 35.40 ms [22:19:26] !log aaron synchronized php-1.20wmf8/includes/filerepo/backend/SwiftFileBackend.php [22:19:34] Logged the message, Master [22:35:39] is wikitech slow just for me or everyone? [22:37:09] "Wikitech uses cookies to log in users. You have cookies disabled. Please enable them and try again." - looks like it's broken [22:37:11] MaxSem: it is because i am working on it (upgrades, saving disk space) [22:38:02] MaxSem: it should be done by now though and confirmed it is still slow.. hrmm.. looking [22:39:23] MaxSem: cookie message confirmed.. arg.. [22:39:44] tmp fail? [22:42:41] MaxSem: try again, fixed for me [22:42:58] MaxSem: started memcached which just got upgraded [22:43:27] mutante, confirmed, thanks [22:43:34] pheew:) [22:50:45] PROBLEM - Puppet freshness on srv281 is CRITICAL: Puppet has not run in the last 10 hours [23:02:33] mmhhh, i cannot see the logo in it.wiki, but others can [23:03:39] Jaqen, can you browse to https://upload.wikimedia.org/wikipedia/commons/c/c5/Wikipedia-logo-v2-it.png ? [23:04:38] Krenair, yes [23:08:39] funny [23:21:49] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [23:34:43] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [23:58:01] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours