[00:10:46] Tim: Any thoughts on re-purposing this channel a bit to be more tech help-orientated? A IRC complement to https://meta.wikimedia.org/wiki/Tech [00:13:13] hm [00:13:29] Joan, do you know that they're going to get the Wikimedia error page to this channel? [00:13:32] PROBLEM - Puppet freshness on bast1001 is CRITICAL: Puppet has not run in the last 10 hours [00:13:34] That might be enough to flood it [00:13:51] s/to this/link to this/ [00:14:08] I hadn't heard anything about that. [00:14:24] I think it'd be better to remove the IRC link from the error message altogether. [00:14:37] Joan, I don't think where it's been documented, but it seemed decided. [00:14:42] *I don't know [00:14:47] Looks like time to go to bed [00:14:47] There's a bug somewhere. [00:14:53] I doubt it's there [00:15:02] Yes, that'd be too sensible. [00:15:05] heh [00:15:15] No, because I'm on cc IIRC [00:24:11] RECOVERY - Lucene on search15 is OK: TCP OK - 0.007 second response time on port 8123 [00:32:26] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [00:33:20] Joan: maybe [00:33:28] I'm not sure what this channel is for really [00:33:51] It had a clearer purpose before everything moved to -operations. [00:34:11] well yeah, but they're both full of bot spam to the point of making conversation difficult [00:34:32] it's easier to use a private channel if you actually want a quiet place to talk [00:35:35] I'd like a generic tech help place. I also considered #wikimedia. [00:35:40] I think wikimedia-operations was meant to be identical to #wikimedia-tech except a bit smaller [00:36:03] so I suppose it's successful in those terms [00:36:14] except for the botspam [00:36:57] anyway #wikimedia-tech is a sensible name for a tech help channel isn't it? [00:37:35] Yes. [00:37:54] I was considering changing the channel topic in here and maybe updating a few pages on meta-wiki and mw.org. [00:38:01] RECOVERY - Lucene on search15 is OK: TCP OK - 2.993 second response time on port 8123 [00:38:07] I moved the "141k" to the operations channel topic. [00:39:21] we could move the bots to #wikimedia-operations exclusively [00:39:32] nagios alerts tend to be offputting for newcomers [00:39:49] I think people (and bots) expect morebots to be in here. [00:40:10] The bots would be easier to adjust. [00:40:11] but nagios and gerrit? [00:40:30] Probably fine to have them in only operations. [00:42:17] I'm not sure when this channel became publicly logged. Not sure that's necessary. [00:46:52] PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours [00:47:44] Joan, it documents problems with wikis [00:47:53] also, sometimes bots fail to log things on wikitech [00:48:20] I agree that it doesn't make any sense to have nagios and gerrit both here and on -operations... [00:51:40] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [01:31:52] RECOVERY - Lucene on search15 is OK: TCP OK - 8.991 second response time on port 8123 [01:32:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:33:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.929 seconds [01:37:43] PROBLEM - Puppet freshness on fenari is CRITICAL: Puppet has not run in the last 10 hours [01:40:07] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [01:48:59] Okay, I cleaned up https://meta.wikimedia.org/wiki/Tech a bit. [01:56:01] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 600s [01:56:37] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 639s [02:01:16] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:02:28] RECOVERY - Lucene on search15 is OK: TCP OK - 0.006 second response time on port 8123 [02:02:28] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [02:06:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:10:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.944 seconds [02:16:24] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [02:18:03] RECOVERY - Lucene on search15 is OK: TCP OK - 0.001 second response time on port 8123 [02:18:57] !log LocalisationUpdate completed (1.18) at Mon Feb 20 02:18:57 UTC 2012 [02:19:02] Logged the message, Master [02:19:42] PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours [02:26:18] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [02:36:36] !log LocalisationUpdate completed (1.19) at Mon Feb 20 02:36:36 UTC 2012 [02:36:39] Logged the message, Master [03:08:28] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [03:08:55] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [03:11:37] RECOVERY - Lucene on search15 is OK: TCP OK - 0.002 second response time on port 8123 [03:19:52] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [03:30:22] RECOVERY - Lucene on search15 is OK: TCP OK - 0.002 second response time on port 8123 [03:32:57] hi TimStarling, how goes? [03:33:23] hi, just looking at RL async mode [03:33:27] bugs [03:33:48] any idea what's going on there? [03:34:26] (no need to explain it all to me...just asking if you're making headway) [03:35:25] it looks like the obvious solutions have been done already, as I guess you'd expect given the number of talented people who have been working on this [03:35:39] so the issue (if any actually remain) will probably be something subtle [03:38:19] the debug facilities available in RL are pretty primitive, I've been complaining about that pretty much since it was invented [03:38:46] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [03:38:49] editing the source is the only way to really see what is going on, so I hope I can reproduce it locally [03:42:02] it's unfortunate we don't have great ways of testing this stuff. worse, we never know if the problems we're seeing now are representative of what's to come, or if they're problems specific to some wikis, or if they're non-problems [03:44:58] I had a look at the IRC bug you pinged me about, wrote some recommendations on the bug report [03:47:46] I'm not sure who we've got to take that on until Tuesday, with it being an office holiday for the US folks [03:48:45] maybe Nikerabbit will get a run at it, or perhaps Reedy when he's up [03:50:22] !b 34508 [03:50:22] https://bugzilla.wikimedia.org/show_bug.cgi?id=34508 [03:51:05] * robla wonders if he can actually do it himself [03:51:45] roan's in transit, right? [03:52:00] yeah, he is [03:52:38] I think I found something, I'm now into "how could this possibly not break all the time" territory [03:52:57] :) [03:53:12] much better than "this looks so perfect that it could never break" [03:53:18] for a bug hunt at least :) [03:53:45] yeah, assuming you know things aren't working to begin with, I suppose the former is better than the latter [04:05:29] RECOVERY - Lucene on search15 is OK: TCP OK - 2.995 second response time on port 8123 [04:17:56] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [04:29:47] RECOVERY - Lucene on search15 is OK: TCP OK - 2.997 second response time on port 8123 [04:34:35] PROBLEM - Puppet freshness on cadmium is CRITICAL: Puppet has not run in the last 10 hours [04:38:11] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [04:56:20] RECOVERY - Lucene on search15 is OK: TCP OK - 0.003 second response time on port 8123 [05:04:44] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [06:16:12] RECOVERY - Lucene on search15 is OK: TCP OK - 9.004 second response time on port 8123 [06:24:27] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [06:40:21] RECOVERY - Lucene on search15 is OK: TCP OK - 3.009 second response time on port 8123 [06:48:19] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [07:06:36] PROBLEM - Lucene on search9 is CRITICAL: Connection timed out [07:06:54] PROBLEM - Lucene on search3 is CRITICAL: Connection timed out [07:17:42] PROBLEM - Puppet freshness on search1002 is CRITICAL: Puppet has not run in the last 10 hours [07:21:09] PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out [07:24:36] RECOVERY - Lucene on search4 is OK: TCP OK - 0.009 second response time on port 8123 [07:25:30] RECOVERY - Lucene on search3 is OK: TCP OK - 0.002 second response time on port 8123 [07:26:15] RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.004 second response time on port 8123 [07:26:33] RECOVERY - Lucene on search9 is OK: TCP OK - 0.012 second response time on port 8123 [07:59:39] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [08:02:39] PROBLEM - Puppet freshness on search1003 is CRITICAL: Puppet has not run in the last 10 hours [08:05:39] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [08:05:39] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [08:25:45] PROBLEM - SSH on db1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:25:45] PROBLEM - MySQL Replication Heartbeat on db1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:25:54] PROBLEM - Disk space on db1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:26:03] PROBLEM - MySQL Slave Delay on db1040 is CRITICAL: CRIT replication delay 200 seconds [08:26:03] PROBLEM - MySQL Slave Delay on db1022 is CRITICAL: CRIT replication delay 201 seconds [08:26:12] PROBLEM - MySQL Slave Running on db1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:26:13] PROBLEM - mysqld processes on db1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:26:13] PROBLEM - RAID on db1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:26:13] PROBLEM - MySQL disk space on db1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:26:30] PROBLEM - MySQL Recent Restart on db1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:26:39] PROBLEM - MySQL Idle Transactions on db1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:26:39] PROBLEM - DPKG on db1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:26:39] PROBLEM - MySQL Slave Delay on db1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:26:48] PROBLEM - MySQL Replication Heartbeat on db1040 is CRITICAL: CRIT replication delay 246 seconds [08:26:57] PROBLEM - Full LVS Snapshot on db1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:27:06] PROBLEM - MySQL Replication Heartbeat on db1022 is CRITICAL: CRIT replication delay 264 seconds [08:28:45] RECOVERY - MySQL Slave Delay on db1022 is OK: OK replication delay NULL seconds [08:30:51] PROBLEM - Host db1006 is DOWN: PING CRITICAL - Packet loss = 100% [08:33:06] RECOVERY - MySQL Slave Delay on db1040 is OK: OK replication delay NULL seconds [08:36:24] RECOVERY - Lucene on search15 is OK: TCP OK - 2.999 second response time on port 8123 [08:44:48] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [09:11:30] RECOVERY - Lucene on search15 is OK: TCP OK - 0.003 second response time on port 8123 [09:19:53] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [10:14:20] PROBLEM - Puppet freshness on bast1001 is CRITICAL: Puppet has not run in the last 10 hours [10:24:05] RECOVERY - Lucene on search15 is OK: TCP OK - 0.005 second response time on port 8123 [10:26:38] PROBLEM - Disk space on mw48 is CRITICAL: DISK CRITICAL - free space: /tmp 61 MB (3% inode=89%): [10:32:30] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [10:44:02] RECOVERY - Disk space on mw48 is OK: DISK OK [10:47:56] PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours [10:55:03] Where can I find $wgConf->getLocalDatabases()? [11:28:51] RECOVERY - Lucene on search15 is OK: TCP OK - 8.999 second response time on port 8123 [11:37:06] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [11:38:36] PROBLEM - Puppet freshness on fenari is CRITICAL: Puppet has not run in the last 10 hours [12:18:39] RECOVERY - Lucene on search15 is OK: TCP OK - 0.008 second response time on port 8123 [12:20:36] PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours [12:27:03] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [12:42:18] siebrand: did you get an answer/work it out? [13:01:21] RECOVERY - Lucene on search15 is OK: TCP OK - 8.994 second response time on port 8123 [13:09:36] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [13:36:02] !log hashar synchronized php-1.19/includes/UserMailer.php 'r111925 for bug 34421 duplicate Subject / wrong To: headers in mail' [13:36:05] Logged the message, Master [14:16:22] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:17:34] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [14:21:59] New patchset: Pyoungmeister; "adding mediawiki install to search indexers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2672 [14:23:05] New patchset: Hashar; "send Swift syslogs to their own file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2673 [14:23:30] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2672 [14:23:30] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2672 [14:23:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2673 [14:35:25] PROBLEM - Puppet freshness on cadmium is CRITICAL: Puppet has not run in the last 10 hours [14:56:48] !log hashar synchronized php-1.19/skins/simple/main.css 'r111580 for Bug 34397: align footer so that it does not overlap with sidebar in Simple skin' [14:56:50] Logged the message, Master [15:09:32] RECOVERY - Lucene on search15 is OK: TCP OK - 2.995 second response time on port 8123 [15:17:56] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [15:32:29] RECOVERY - Lucene on search15 is OK: TCP OK - 2.993 second response time on port 8123 [15:40:53] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [15:50:47] New patchset: Pyoungmeister; "some more indexer bits and pieces" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2674 [15:54:24] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2674 [15:54:25] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2674 [15:56:47] RECOVERY - Lucene on search15 is OK: TCP OK - 8.999 second response time on port 8123 [16:01:35] PROBLEM - Host search1002 is DOWN: PING CRITICAL - Packet loss = 100% [16:02:38] PROBLEM - Host search1003 is DOWN: PING CRITICAL - Packet loss = 100% [16:05:11] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [16:05:29] PROBLEM - SSH on search1001 is CRITICAL: Connection refused [16:06:59] RECOVERY - Host search1002 is UP: PING OK - Packet loss = 0%, RTA = 30.86 ms [16:08:02] RECOVERY - Host search1003 is UP: PING OK - Packet loss = 0%, RTA = 30.90 ms [16:09:56] PROBLEM - Disk space on search1002 is CRITICAL: Connection refused by host [16:10:59] PROBLEM - DPKG on search1002 is CRITICAL: Connection refused by host [16:11:08] PROBLEM - SSH on search1002 is CRITICAL: Connection refused [16:11:53] PROBLEM - RAID on search1002 is CRITICAL: Connection refused by host [16:12:11] PROBLEM - SSH on search1003 is CRITICAL: Connection refused [16:15:20] RECOVERY - Lucene on search15 is OK: TCP OK - 0.002 second response time on port 8123 [16:19:27] New patchset: Hashar; "remove gerrit/nagios bots from #wikimedia-tech" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2675 [16:23:35] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [16:28:05] PROBLEM - NTP on search1002 is CRITICAL: NTP CRITICAL: No response from NTP server [16:42:01] jeremyb: I do not have root, nor does sumanah. I don't even have shell. [17:12:32] New patchset: Pyoungmeister; "get the password to the file. boop boop" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2676 [17:18:02] What is that weird email that was just sent to the the Wikitech mailing list? Some sort of spam? [17:19:13] looks like [17:19:20] Platonides: ping :-) [17:19:41] Platonides: I believe the latest wikitech-l message is written in spanish, maybe you can help there? [17:19:59] PROBLEM - Host search1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:20:00] PROBLEM - Host search1001 is DOWN: PING CRITICAL - Packet loss = 100% [17:21:10] coming from Cuba [17:22:52] that is probably the first time I read a mail coming from Cuba :D [17:23:18] hashar, o.O time travel [17:23:55] hashar, his/her message's time: 18:!2 [17:23:59] 18:12 [17:24:26] yeah it show in the future for me [17:24:45] His mail date is: Date: Mon, 20 Feb 2012 12:12:08 -0600 [17:25:09] which is 18:12UTC 19:12CET [17:25:23] RECOVERY - Host search1002 is UP: PING OK - Packet loss = 0%, RTA = 31.01 ms [17:25:23] RECOVERY - Host search1001 is UP: PING OK - Packet loss = 0%, RTA = 31.28 ms [17:25:27] either user has wrong time or wrong timezone setup [17:34:31] New patchset: Hashar; "update WiktionaryMobile github URL" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2677 [17:36:19] New patchset: Pyoungmeister; "that hostname so doesn't exist..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2678 [17:37:27] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2676 [17:37:27] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2676 [17:37:36] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2678 [17:37:36] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2678 [17:38:45] zzz [17:45:33] New patchset: Pyoungmeister; "clean up requires" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2679 [17:46:46] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2679 [17:46:46] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2679 [17:48:43] RECOVERY - Lucene on search15 is OK: TCP OK - 0.000 second response time on port 8123 [17:56:53] New patchset: Pyoungmeister; "regex" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2680 [17:57:21] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2680 [17:57:22] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2680 [18:01:01] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [18:03:25] RECOVERY - Squid on brewster is OK: TCP OK - 0.004 second response time on port 8080 [18:03:52] PROBLEM - Puppet freshness on search1003 is CRITICAL: Puppet has not run in the last 10 hours [18:05:13] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [18:06:52] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [18:06:52] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [18:09:52] PROBLEM - Host search1002 is DOWN: PING CRITICAL - Packet loss = 100% [18:15:16] RECOVERY - Host search1002 is UP: PING OK - Packet loss = 0%, RTA = 30.90 ms [18:16:21] !log nikerabbit synchronized php-1.18/extensions/Narayam/ 'i18ndeploy r111945' [18:16:23] Logged the message, Master [18:17:16] !log nikerabbit synchronized php-1.19/extensions/Narayam/ 'i18ndeploy r111946' [18:17:18] Logged the message, Master [18:19:58] !log nikerabbit synchronized wmf-config/InitialiseSettings.php 'Narayam on mrwiki, mrwikisource; bug 32669, bug 34454' [18:20:01] Logged the message, Master [18:26:04] RECOVERY - SSH on search1001 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [18:26:13] PROBLEM - Host search1003 is DOWN: PING CRITICAL - Packet loss = 100% [18:26:52] !log nikerabbit synchronized wmf-config/InitialiseSettings.php 'Narayam on knwiki; bug 34516' [18:26:54] Logged the message, Master [18:27:43] RECOVERY - SSH on search1002 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [18:28:55] RECOVERY - Puppet freshness on search1001 is OK: puppet ran at Mon Feb 20 18:28:38 UTC 2012 [18:28:55] RECOVERY - SSH on search1003 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [18:29:04] RECOVERY - Host search1003 is UP: PING OK - Packet loss = 0%, RTA = 30.90 ms [18:31:02] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:31:28] RECOVERY - RAID on search1001 is OK: OK: no RAID installed [18:31:46] RECOVERY - DPKG on search1001 is OK: All packages OK [18:31:46] RECOVERY - Disk space on search1001 is OK: DISK OK [18:32:13] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [18:33:44] join #wikimedia-dev [18:34:37] RECOVERY - NTP on search1001 is OK: NTP OK: Offset -0.09592020512 secs [18:36:07] RECOVERY - Puppet freshness on search1002 is OK: puppet ran at Mon Feb 20 18:35:55 UTC 2012 [18:37:19] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.06784634783 (gt 8.0) [18:38:22] RECOVERY - DPKG on search1002 is OK: All packages OK [18:38:28] Zaran, use /join #wikimedia-dev [18:38:49] RECOVERY - Disk space on search1002 is OK: DISK OK [18:39:16] RECOVERY - RAID on search1002 is OK: OK: no RAID installed [18:42:07] RECOVERY - Puppet freshness on search1003 is OK: puppet ran at Mon Feb 20 18:41:55 UTC 2012 [18:44:49] RECOVERY - RAID on search1003 is OK: OK: no RAID installed [18:45:16] RECOVERY - Disk space on search1003 is OK: DISK OK [18:45:16] RECOVERY - DPKG on search1003 is OK: All packages OK [18:46:23] !log aaron synchronized php-1.19/includes/RevisionList.php 'deployed r111952' [18:46:26] Logged the message, Master [18:46:37] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 1.26315605263 [18:49:19] RECOVERY - NTP on search1002 is OK: NTP OK: Offset 0.0131098032 secs [18:54:52] RECOVERY - NTP on search1003 is OK: NTP OK: Offset 0.0687738657 secs [19:50:47] New patchset: Pyoungmeister; "well, that sure doesn't exist..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2681 [19:51:19] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2681 [19:51:19] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2681 [20:01:36] PROBLEM - Lucene on search1005 is CRITICAL: Connection refused [20:01:36] PROBLEM - Lucene on search1004 is CRITICAL: Connection refused [20:01:36] PROBLEM - Lucene on search1006 is CRITICAL: Connection refused [20:01:36] PROBLEM - Lucene on searchidx1001 is CRITICAL: Connection refused [20:03:41] New patchset: Danakim; "Modified the git-setup script to not use "git config --global" when setting the environment up for the user. This changes any .gitconfig file the user might have in his/her home and might not actually be in sync with the data the user might want to use fo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2682 [20:13:54] RECOVERY - Lucene on search15 is OK: TCP OK - 8.998 second response time on port 8123 [20:14:39] PROBLEM - Puppet freshness on bast1001 is CRITICAL: Puppet has not run in the last 10 hours [20:22:09] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [20:25:51] hexmode: you have shell on bz. i thought that maybe you did other places but didn't know. sumanah has shell on fenari? or wherever ldap is (she has sudo access to add new committers). likewise didn't know what else she might have [20:26:41] I do? Woah! you souldn't have told me [20:27:09] I think that is probably all sumanah has, but we'd have to ask her [20:36:51] RECOVERY - Lucene on search15 is OK: TCP OK - 9.004 second response time on port 8123 [20:45:06] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [20:45:32] Oh, thanks nagios, I actually just wanted to report that the search is partially down ;P [20:47:27] Search on German Wikipedia only works occasionally/randomly, often 0 search results are returned (and if you hit refresh often enough you get the correct results) [20:48:24] RobH, brion__, or whoever, are you aware of that problem? [20:48:46] most people are off (wmf holiday) [20:49:00] the search infrastructure is pretty fragile right now [20:49:06] I wonder which pool handles de [20:49:12] oh, okay. [20:49:44] New patchset: Hashar; "git-setup script no more use "git config --global"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2682 [20:50:53] search2 [20:51:04] figures [20:51:54] hmmm [20:52:00] although the backend looks normal [20:52:03] not sure where the problem is [20:52:10] I just restarted search15 [20:52:15] we'll see if that does anything [20:52:27] it's been whining for a bit [20:53:48] well it's quite strange, when i go via de.wikipedia.org search doesn't work, but the same query on the backend works just fine [20:53:55] weird [20:54:00] thanks for being here to look into it [20:54:16] Church_of_emacs, you might be interested to know that a very nice table about search cluster has been published on wikitech recently perhaps? [20:55:57] heh, thanks, I'm just here to watch the magic search box get fixed ;) [20:59:14] New review: Hashar; "We try to keep tabulations, this way anyone can use their preferred tab size (2,3,4, 8 ...)." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2682 [21:01:16] New review: Hashar; "Finally, you might want to look at git-review https://labsconsole.wikimedia.org/wiki/Git-review it ..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2682 [21:02:25] New review: Catrope; "Yeah git-review does make everything simpler... unless you're on Windows :D . Might be worth keeping..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/2682 [21:08:39] (that tells us that Windows makes everything harder ;)) [21:10:11] !log shut down lucene on search15, comes up with some strange errors "Connection refused to host: 10.0.3.15" [21:10:17] Logged the message, Master [21:35:35] http://commons.wikimedia.org/wiki/Commons:Village_pump#API_timout_while_renaming_a_file we seem to have some success message/callback problems with logged actions at Commons... [21:35:47] example.. http://commons.wikimedia.org/w/index.php?title=File:Cognac_L.-A._A._Montalivet.jpg&action=history [21:36:08] the move script still waits for api.php to answer that the move was successful [21:39:22] PROBLEM - Puppet freshness on fenari is CRITICAL: Puppet has not run in the last 10 hours [21:48:00] Saibo: does it not get any response at all? [21:48:54] Reedy: there was a completed http request to api.php in the firebug network view. But the next one was stuck and waiting [21:49:03] have closed the window in the meantime [21:49:24] but try it yourself if you like [21:49:53] https://commons.wikimedia.org/wiki/File:P1000372_dsdssdd_ds.JPG [21:50:10] this file volunteers to be moved ;) [21:50:42] you need to use the move&replace tab instead of the mediawiki move tab, of course [22:14:17] !log hashar synchronized php-1.19/includes/logging/PatrolLog.php 'r111969 - bug 34495 patrol log credit the user patrolled, not the user patrolling' [22:14:20] Logged the message, Master [22:25:40] I've got the problem, that I delete files on de.wp, than the description page is deleted, but the picture still appears for several days. Only purging can resolve this. Is this bug known? [22:26:33] (the file itself isn't accessible, only the thumbs seems to be kept) [22:28:22] wer hatte in der DÜP heute Langeweile? 72 Dateien an einem Tag?? [22:28:36] sry, last message was to the wrong channel [22:33:17] Quedel: please open a bug report on bugzilla.wikimedia.org [22:33:32] Quedel: it is an holiday day in US so most developers are out [22:33:38] or sleeping like me :-) [22:33:41] see you! [22:36:45] holiday? than, well sleeping to you ;) [22:44:02] I will sleep soon [22:44:21] as soon as the stupid eurogroup mkes up their minds to announce something [22:44:26] even another delay. [22:45:18] here, have another $170B [22:50:18] for now [22:51:28] Church_of_emacs: timestamp or subject for the tabular wikitech msg on search? was from OrenOf? [22:53:19] Quedel: I would say that is connected to the other purging bug that is/was around the other day [22:54:36] apergos: heard of http://battlemesh.org/BattleMeshV5#Where ? [22:55:27] thx, p858snake. I had this bug for one time some weeks ago for only one time. And today for several files ago. When you say there is a known bug relating to purge, than I mustnot open a second. Thanks. [22:55:35] no I hadn't [23:15:09] Evening guys, anyone seen LeslieCarr around in the last day or so please? [23:17:02] Well today is a US holiday [23:17:19] So she and a lot of other WMF people have the day off [23:17:44] Ah, ok. I had no idea. [23:18:04] What holiday is it? [23:18:10] president's day [23:18:12] Presidents' Day [23:19:33] that's fittibg: It si Rosenmontag in Germany (that's a part of the carneval) ;) [23:19:38] Nice. I wanted to catch her and let her know that virgin's problem is all fixed up. I was talking to her about 2 days ago, cause I was having grief hitting Wikipedia from Virgin Media, one of our interchanges was down. [23:20:01] Ah, OK [23:20:22] I was actually gonna call her in a minute anyway, I'll let her know [23:20:26] We have an interchange in Amsterdam which had gone kaput, it's been sorted out and I'm getting everywhere now. No lag, no waiting, so it's all sorted out [23:29:11] When documentation says Server Foo is in Tampa and Server Bar is in DC, what is DC ? [23:29:41] eqiad probably [23:29:58] http://wikitech.wikimedia.org/view/Dumps/Dump_servers [23:30:02] "Dataset1001 in D.C., rsync/mirrors:" [23:30:04] Our Equinix datacenter in Ashburn, VA [23:30:06] DC, District of Columbia, as in Washington DC [23:30:09] yeah it's actually in ashburn [23:30:10] If the number is >1000, it's in eqiad [23:30:10] whatever [23:30:18] huh? [23:30:26] mw1 is in Tampa, mw1001 is in Ashburn [23:30:31] That's the numbering pattern we use [23:30:35] Listed as DC, and it's in Virginia? [23:30:40] Ditto for dataset1 and dataset1001 [23:30:42] ok. categorized dataset1001 as eqiad [23:30:44] 0.o [23:30:47] it's so close we all talk about it being in D.C. [23:30:52] shrug [23:30:54] PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours [23:31:02] BarkingFish: Ashburn, VA is a suburb of DC. DC itself is tiny, it's like 5 by 10 miles [23:31:14] So the city itself is DC and the suburbs are all in Virginia and Maryland [23:31:52] * apergos goes back to half-sleeping half watching "the enigma code" with greek subs and half watcing twitter reports on the eurogroup [23:36:20] Well, at least now each cluster has it's own page with a somewhat descriptive intro line: http://wikitech.wikimedia.org/view/Category:Cluster#mw-pages [23:36:54] PROBLEM - Lucene on search1007 is CRITICAL: Connection refused [23:36:54] PROBLEM - Lucene on search1010 is CRITICAL: Connection refused [23:36:54] PROBLEM - Lucene on search1011 is CRITICAL: Connection refused [23:36:54] PROBLEM - Lucene on search1009 is CRITICAL: Connection refused [23:36:54] PROBLEM - Lucene on search1012 is CRITICAL: Connection refused [23:36:54] PROBLEM - Lucene on search1013 is CRITICAL: Connection refused [23:36:54] PROBLEM - Lucene on search1015 is CRITICAL: Connection refused [23:36:55] PROBLEM - Lucene on search1016 is CRITICAL: Connection refused [23:36:55] PROBLEM - Lucene on search1020 is CRITICAL: Connection refused [23:36:56] PROBLEM - Lucene on search1017 is CRITICAL: Connection refused [23:36:56] PROBLEM - Lucene on search1018 is CRITICAL: Connection refused [23:36:58] PROBLEM - Lucene on search1019 is CRITICAL: Connection refused [23:38:50] RoanKattouw: 'esams' is the only one without a definition for it's abbreviation [23:38:55] is it "external storage amsterdam" ? [23:39:18] It might be now, you've probably just named it Krinkle :D [23:39:19] EvoSwitch [23:40:57] RoanKattouw: Oh ? [23:41:16] Name of the hosting company [23:41:20] Right [23:41:21] ditto for Equinix [23:41:28] yeah [23:42:04] So "Amsterdam cluster" (esams) at EvoSwitch in Amsterdam (The Netherlands)> [23:42:45] aye [23:42:55] Amsterdam cluster is a redirect to Kennisnet cluster [23:43:01] No [23:43:03] Oh [23:43:05] Right [23:43:07] the wiki page [23:43:09] :D [23:43:09] Yeah we have knams too [23:43:34] But that's just for networking IIRC, it's weird [23:43:58] There is only 1 page(s) linking to [[Amsterdam cluster]] and it actually refers to esams [23:44:02] so I'll fix that redirect