[00:01:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:02:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [00:07:01] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [00:07:11] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [00:08:01] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 00:07:53 UTC 2013 [00:08:11] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [00:08:21] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 00:08:20 UTC 2013 [00:09:01] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [00:09:01] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 00:08:52 UTC 2013 [00:09:11] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [00:09:21] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 00:09:16 UTC 2013 [00:09:51] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 00:09:46 UTC 2013 [00:10:01] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [00:10:11] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [00:10:11] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 00:10:06 UTC 2013 [00:10:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:10:41] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 00:10:32 UTC 2013 [00:11:01] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [00:11:11] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:11] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 00:11:09 UTC 2013 [00:11:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [00:11:31] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 00:11:23 UTC 2013 [00:12:01] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [00:12:11] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [00:16:41] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 00:16:36 UTC 2013 [00:17:11] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [00:30:22] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 00:30:12 UTC 2013 [00:31:01] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [00:52:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:53:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.140 second response time [01:01:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:02:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [01:06:26] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [01:06:56] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [01:22:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:22:56] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [01:22:56] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [01:22:56] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [01:22:56] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [01:22:56] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [01:22:57] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [01:22:57] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [01:22:58] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [01:23:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [01:36:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:37:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [01:40:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:41:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [02:06:43] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [02:06:53] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [02:09:13] !log LocalisationUpdate completed (1.22wmf9) at Mon Jul 8 02:09:13 UTC 2013 [02:09:25] Logged the message, Master [02:15:21] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Jul 8 02:15:21 UTC 2013 [02:15:30] Logged the message, Master [03:01:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:02:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [03:06:41] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [03:07:01] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [03:22:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:23:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [03:40:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:41:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [03:52:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:53:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.161 second response time [04:06:43] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [04:07:23] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [04:08:03] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 04:08:00 UTC 2013 [04:08:43] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [04:08:43] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 04:08:35 UTC 2013 [04:09:23] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 04:09:13 UTC 2013 [04:09:23] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [04:09:43] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [04:09:53] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 04:09:44 UTC 2013 [04:10:23] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [04:10:23] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 04:10:19 UTC 2013 [04:10:43] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:53] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 04:10:48 UTC 2013 [04:11:23] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [04:11:23] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 04:11:20 UTC 2013 [04:11:43] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [04:11:53] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 04:11:45 UTC 2013 [04:12:23] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 04:12:13 UTC 2013 [04:12:23] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [04:12:43] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [04:12:43] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 04:12:36 UTC 2013 [04:13:03] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 04:13:00 UTC 2013 [04:13:23] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [04:13:23] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 04:13:19 UTC 2013 [04:13:43] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [04:13:43] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 04:13:42 UTC 2013 [04:14:23] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [04:14:43] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [04:16:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:16:54] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 04:16:43 UTC 2013 [04:17:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [04:17:43] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [04:22:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:23:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.142 second response time [04:29:53] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 04:29:46 UTC 2013 [04:30:23] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [04:42:51] !log on enwiki: discarded refreshLinks2 entries for templates Yesno, If_pagename, Category_handler/numbered, Category_handler, by setting their root job timestamp to a fake new timestamp. See bug 50785 [04:43:01] Logged the message, Master [05:00:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:02:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.144 second response time [05:02:32] icinga-wm surprises me [05:02:50] i put it on /ignore, but it flooded itself off irc and came back with a trailing underscore [05:03:25] like skynet [05:04:29] mmm [05:04:36] skynet has more of a character [05:04:51] give icinga-wm a bit more time [05:05:07] * ToAruShiroiNeko gives icinga-wm_ 10 units of time [05:06:27] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [05:06:37] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [05:27:02] Change merged: Tim Starling; [operations/debs/wikimedia-task-appserver] (master) - https://gerrit.wikimedia.org/r/72061 [05:40:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:41:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [06:06:30] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [06:06:50] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [06:07:07] TimStarling: was morebots running when you restarted it earlier? did it look the same as last time? [06:07:47] I didn't check [06:08:45] I filed a bug the last time it happened and blamed the lack of a regular keepalive ping, but logmsgbot / tcpircbot doesn't have that either and it manages to stay on IRC just fine. [06:10:17] so there's probably some basic bug in its connection handling somewhere [06:12:50] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [06:24:20] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:25:10] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [06:31:10] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 06:31:07 UTC 2013 [06:31:30] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [06:31:40] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 06:31:32 UTC 2013 [06:31:51] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [06:32:00] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 06:31:58 UTC 2013 [06:32:20] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 06:32:18 UTC 2013 [06:32:30] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [06:32:50] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 06:32:39 UTC 2013 [06:32:50] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [06:33:00] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 06:32:57 UTC 2013 [06:33:30] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [06:33:50] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [06:40:20] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:40:29] is there anything interesting in /var/log/adminbot.log? [06:42:10] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [06:45:10] PROBLEM - Packetloss_Average on analytics1006 is CRITICAL: CRITICAL: packet_loss_average is 52.0613 (gt 8.0) [06:46:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:47:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [06:59:29] ori-l: well it's a different uplink... [06:59:55] morebots lives on the rackspace cloud [06:59:55] I am a logbot running on wikitech-static. [06:59:55] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [06:59:55] To log a message, type !log . [06:59:58] gah [07:00:28] logmsgbot seems to live on neon (iirc that's icinga?) [07:00:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:01:08] * jeremyb runs away [07:01:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.075 second response time [07:02:45] right [07:02:52] and: [07:02:55] https://bitbucket.org/jaraco/irc/issue/17/client-ping-and-keep-alive-support [07:03:01] "it appears it may be useful to have a configurable timeout for a client-side ping command in order to force connections to stay open in environments where an aggressive TCP idle timeout is in force." [07:03:17] I wonder if rackspace is such an environment [07:04:27] when it disappears, it's usually with a ping timeout quit message [07:04:41] (most recent was a *.net *.split, tho) [07:05:17] i guess that's less dependant on rackspace and more on the distro and it's settings. but presumably there's not too much in common between the rackspace node and the normal cluster provisioning [07:05:33] if you look at other issues, there & at sourceforge (which has some legacy data), other people have reported this problem [07:06:19] and a number of improvements made since the version we are running have been released cite fixing some variant of this problem as their aim [07:07:01] logmsgbot doesn't use the library's process_once / process_forever loop; it implements its own select loop [07:07:33] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [07:08:12] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [07:08:46] i'm basically wondering if i should just rewrite morebots, or at least the irc client bits, to use a more robust library, like twisted or znc or supybot [07:09:44] i keep hoping someone else will fix it, though. i feel a weird sense of responsibility because i was the last person to touch it, even though i didn't touch the irc-handling code. blah. [07:09:47] morebots has had problems with netsplits for as long as I can remember [07:09:47] I am a logbot running on wikitech-static. [07:09:47] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [07:09:47] To log a message, type !log . [07:10:12] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: No successful Puppet run in the last 10 hours [07:10:40] apergos: idk, i kinda lump them all together (morebots, icinga-wm_, etc.) [07:11:09] logmsgbot is just fine :P [07:11:29] "etc." covers a lot of ground :-D [07:11:50] a crude fix to morebots could just recycle the connection if no !log has been made in a certain period of time [07:12:29] hah [07:12:39] but people complained when puppet was restarting too often!! [07:12:49] i admit it's a shitty fi [07:12:50] x [07:13:11] why can't morebots detect a netsplit? [07:13:42] apergos: dunno. [07:14:07] it used to not reconnect at all, but it does now, at least under certain conditions [07:14:16] i tested it by having an instance log in with my credentials and then ghosting it [07:14:16] that wasn't "why isn't it doing it now" but rather "why wouldn't we look for that and react accordingly" for a fix [07:15:19] apergos: the library it uses is pretty awful to debug. the call graph is like: base_class._connect calls child_class.connect which calls grandchild_class.on_connect [07:15:46] you have to keep jumping between classes/files when debugging it [07:16:03] ah I see the real motivation for a rewrite [07:16:10] not that it's a bad motivation [07:16:12] and morebots is a pile of bad on top of that [07:16:54] heh [07:17:12] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: No successful Puppet run in the last 10 hours [07:20:54] I think I'll settle for adding some plausibly-revealing logging calls to the code [07:22:08] apergos: btw, you saw the thread about img tarballs? [07:22:29] I see it's there [07:23:03] kevin day isn't sure where to sync from? [07:23:04] I haven't read it because then I would have to reply and then I would feel guilty that I need to do that in my off time and yet I need free time for my off time [07:23:41] where "that" is "get out the scripts that work with swift etc and test and debug and deploy" [07:23:41] sounds complicated... [07:24:11] ok, so basically he hasn't sync'd since ceph? or not even since the beginning of swift? [07:24:15] where's uor wikimedia coder status [07:24:18] "it's complicated" [07:24:26] hah [07:24:27] since moving off of having an nfs backend [07:24:30] apergos: I'm glad there are others out there with that sort of relationship with their emails :) [07:24:32] which was a fallback [07:24:38] right [07:26:04] andre the giant! [07:26:17] jeremyb, morn. what did I do wrong this time? :P [07:26:44] [[André René Roussimoff]] [07:26:54] andre__: didn't use SSL, for one :) [07:28:22] ok, i can get on board with that complaint [07:29:24] let's switch my settings to SSL for IRC and see how I fail to connect next time :) [07:29:31] andre__: in xchat, change the server to 'irc.freenode.net/7000', and set the server password to 'andre__:yourpassword' [07:30:24] errr? [07:30:36] not just password to yourpassword? [07:30:44] and username to andre__ [07:31:18] I admit I used "msg NickServ identify mypassword" so far. [07:31:38] is this an initiative to make the interwebs more secure? If so, I appreciate it. [07:31:51] it's not really so awful, tbh [07:32:08] but it's nice never to hear from nickserv and rather easy to set up, so i wonder why it isn't done more often [07:32:28] but yes, it's more secure [07:32:30] +1. thanks [07:32:59] haven't we all heard from nickserv a bunch recently? [07:34:03] * jeremyb runs away [07:56:10] hello [07:56:30] morning hashar [07:56:39] arch internet is slow :D [07:56:40] brb [08:00:52] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:02:52] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [08:06:50] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [08:07:30] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [08:07:50] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 08:07:46 UTC 2013 [08:08:11] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 08:08:03 UTC 2013 [08:08:22] New patchset: ArielGlenn; "activate rsync from dataset1001 to dataset2 of dumps run there" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72487 [08:08:30] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [08:08:40] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 08:08:36 UTC 2013 [08:08:50] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [08:09:30] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [08:12:11] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72487 [08:16:30] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 08:16:29 UTC 2013 [08:16:50] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [08:17:51] New patchset: Hashar; "beta: tweak $wgLoadScript to use the bits cache" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70322 [08:23:36] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70322 [08:27:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [08:30:00] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 08:29:51 UTC 2013 [08:30:30] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [08:32:48] !log Testing out $wgLoadScript beta/prod variance {{gerrit|70322}}, pulled change and running sync-common on srv193 [08:32:58] Logged the message, Master [08:34:48] !log hashar synchronized wmf-config/CommonSettings.php '$wgLoadScript beta/prod variance {{gerrit|70322}}' [08:34:59] Logged the message, Master [08:35:27] !log hashar synchronized wmf-config/InitialiseSettings.php 'touch config for $wgLoadScript beta/prod variance {{gerrit|70322}}' [08:35:36] Logged the message, Master [08:50:48] PROBLEM - Disk space on wtp1021 is CRITICAL: DISK CRITICAL - free space: / 345 MB (3% inode=78%): [08:59:08] hashar: do we have any kind of test environment for scap / sync scripts? [08:59:19] ori-l: beside srv193 no [08:59:37] srv193 is a test environment for scap / sync? [08:59:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:59:41] ori-l: we badly need integration tests on operations/mediawiki-config.git but that needs a lot of refactoring [08:59:58] ori-l: the caches have srv193 has an application server backend [09:00:10] ori-l: so one can run sync-common on srv193 and use test.wikipedia.org [09:00:26] and you need to touch initialisesettings.php as well [09:00:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [09:02:04] would you have time anytime soon to pair on getting scap & co. running on labs? [09:02:10] New patchset: Hashar; "contint: explicitly require php5-dev" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70182 [09:03:19] i wanted to create a proper puppet module for scap, but it'd require a testing environment [09:03:37] ori-l: that would need to pair up with someone from ops [09:03:57] ori-l: I am not really willing to adapt scap for labs and wait weeks for changes to be approved :-] [09:04:25] plus that probably need a good amount of refactoring, ideally we would use some kind of global configuration that all the scripts would use [09:04:47] New patchset: Hashar; "version images/wikimedia-button.png" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71775 [09:05:01] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71775 [09:05:14] hashar: something like https://gerrit.wikimedia.org/r/#/c/57890/ ? :) [09:05:27] ori-l: also back in february we considered using git-deploy, just before switching to eqiad. Seems git-deploy is stalled though [09:06:03] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [09:06:56] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [09:07:17] that change is part of the reason i'm asking. i made a couple of small mistakes -- nothing really crazy -- but it ended up taking a lot of tim's time to revert that patch, fix it, re-apply it, and then commit another fix [09:08:20] !log hashar synchronized images/wikimedia-button.png 'added to git as of {{gerrit|71775}}' [09:08:27] !log srv281 and srv1173 have ssh timeout when synchronizing configuration. [09:08:30] Logged the message, Master [09:08:40] Logged the message, Master [09:09:41] ori-l: so the patch is completely reverted ? or has it been fixed and reapplied ? [09:09:49] fixed and re-applied [09:10:03] great [09:10:15] so I guess it might work on beta if we provide the proper values in the config file [09:10:20] see /usr/local/lib/mw-deployment-vars.sh on tin -- all that stuff is parametrized via puppet [09:10:21] right [09:10:38] and /usr/local/lib/mw-deployment-vars.sh is really a bad name [09:10:40] but I am ranting :-] [09:10:51] i agree [09:10:52] would love to have something like /etc/wikimedia/mw-deployment-var.sh instead [09:11:00] anyway [09:11:09] if the change got merged in puppet, it must be on beta already [09:11:18] but then I can't manage to ssh on beta instances :( [09:11:33] oph [09:11:43] connected on deployment-bastion.pmtpa.wmflabs \O/ [09:11:58] which has MW_RSYNC_HOST=tin.eqiad.wmnet [09:11:59] :D [09:12:41] http://mywiki.wooledge.org/BashGuide/CommandsAndArguments argues against using '.sh' because it's misleading [09:12:48] esp since the scripts require bash, they're not posix [09:12:59] * hashar looks at https://gerrit.wikimedia.org/r/#/c/72058/ [09:13:11] ori-l: ah yeah [09:13:25] ori-l: I wish I could detect that 'foobar' is a shell script and run bash linting on it [09:13:44] could use `file` on each of the files not having an extension maybe [09:13:56] mwscript on fenari was totally unmanaged by puppet, just some script with local modifications [09:14:10] or just /etc/deployment-env or something [09:15:58] i don't think git-deploy has a chance of replacing scap if there's no good spec to refer to about what scap does and how it does it, and i think the best and only kind of spec to have is the sort that you get out of puppetizing and tidying it up [09:16:16] so i don't think cleaning up the scap scripts will be effort wasted [09:18:11] indeed [09:18:22] got to fix up the beta bastion now: err: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate definition: Class[Misc::Deployment::Vars] is already defined; cannot redefine at /etc/puppet/manifests/misc/deployment.pp:164 on node i-00000390.pmtpa.wmflabs [09:18:22] ;) [09:19:05] that was problem #2 [09:19:21] same think happened on terbium and hume [09:19:31] the deployment puppet classes are a rat's nest [09:20:00] never got properly designed and hacked around for the last few years [09:20:03] just like mediawiki hehe [09:22:15] New patchset: Hashar; "beta: do not use git-deploy" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72489 [09:22:30] apergos: hi :-]  Could you please merge in the beta related puppet change ' https://gerrit.wikimedia.org/r/72489 ' ? :-] [09:22:38] ηελλο [09:22:40] grrr [09:22:42] sec [09:23:04] apergos: it is not that urgent, take your time to finish up whatever you were doing (unless it is rewriting MediaWiki to python). [09:23:25] the grrr wasn't about that [09:23:25] it was about the keyboard layout [09:23:25] PROBLEM - Disk space on wtp1014 is CRITICAL: DISK CRITICAL - free space: / 328 MB (3% inode=78%): [09:23:29] ori-l: reviewing your scap adaptation ( https://gerrit.wikimedia.org/r/#/c/72058/ ) that is exactly what I was too lazy to handle :-] [09:24:47] my initial motivation was unrelated, incidentally. i just wanted us to try passing --delete-delay to rsync to minimize fatals when we push a change that removes a php file [09:26:53] ohhh [09:27:23] and delete-delay does not exit on mac os :-] [09:28:10] it's a slightly optimized --delete-after, which should exist [09:28:26] hashar: what are we relying on common_scripts for anyways (what things use it in beta)? [09:28:27] I must have an old version [09:28:33] PROBLEM - Parsoid on wtp1014 is CRITICAL: Connection refused [09:28:37] https://bugzilla.wikimedia.org/show_bug.cgi?id=20085#c4 [09:29:24] apergos: a bunch of misc scripts such as the PHP linter, mwscript wrapper around MultiVersion and so on [09:29:44] mwscript? really [09:29:52] yeah scaryy [09:30:13] it sure is [09:30:19] that is a very weird little collction of scripts [09:30:23] anyways... [09:30:42] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72489 [09:30:53] thanks! [09:31:36] yw [09:32:45] that fixed puppet on beta \O/ [09:34:35] excellent [09:35:05] * apergos considers the idea of mediawiki in python, with a bunch of c/c++ under the hood [09:35:30] the s3cr3t plan is to have much of the GUI to rely on web services / API [09:35:46] have them documented and provide a standard [09:35:56] be still my beating heart [09:35:58] then people could reimplement the standard in whatever they want (aka nodejs or c :D ) [09:36:07] a standard, isn't that going a bit overboard?! [09:36:13] :-] [09:36:28] by standard I mean a specification of what the web service should provide [09:36:36] which is more or less our current php based API [09:41:52] yes that's what I mean [09:42:15] is it right to allow the plebes to reimplement in any language they want, and expect it to work? horrors! [09:42:50] New patchset: Hashar; "deployment: abstract out MW_RSYNC_HOST" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72491 [09:46:38] hashar: wooooooooo [09:48:30] PROBLEM - Disk space on wtp1016 is CRITICAL: DISK CRITICAL - free space: / 301 MB (3% inode=78%): [09:48:40] PROBLEM - Disk space on wtp1024 is CRITICAL: DISK CRITICAL - free space: / 331 MB (3% inode=78%): [09:48:51] ori-l: there is definitely a lot more that need to be done [09:49:25] ori-l: like we do not collect dsh hosts on labs iirc [09:49:50] PROBLEM - Disk space on wtp1017 is CRITICAL: DISK CRITICAL - free space: / 340 MB (3% inode=78%): [09:49:57] do we need dsh hosts? [09:50:09] maybe we can get away with salt/minions/something [09:51:01] I was playing a bit with it the other day, seems pretty nice, especially given that one no longer would have to maintain dsh files.... removing something temporarily from a sync list would get harder though, need to think about that [09:51:04] apergos: there is no salt in labs :( [09:51:13] hmm do we want it? [09:52:06] no idea :-] [09:52:12] i barely now what salt is for [09:52:29] and its jargon (minion, pillar) is obscure to me [09:52:43] what's wrong with dsh? [09:53:07] ori-l: we use flat files to list hosts ? :-] [09:53:13] which have to be edited by hand [09:53:20] and fall out of sync without warning [09:53:21] though I think Tim wrote a patch to have the host collected by puppet [09:53:28] and no one knows if they are current or not or which ones [09:53:30] PROBLEM - Parsoid on wtp1024 is CRITICAL: Connection refused [09:53:36] as a good idealist, I would have all of that in LDAP as netgroups [09:53:40] PROBLEM - Parsoid on wtp1016 is CRITICAL: Connection refused [09:53:44] so, i love flat files, and having them generated by puppet is the obvious thing to do [09:54:22] I don't love flat files when the rest of puppet doesn't rely on having little lists of which hosts have which facts that we can look at [09:54:30] RECOVERY - Disk space on wtp1016 is OK: DISK OK [09:54:36] and let dsh query the netgroup using something like dsh -group @mediawiki-installation [09:54:40] RECOVERY - Parsoid on wtp1016 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [09:54:49] then handle all of that in ldap :-D [09:55:08] mm [09:55:14] * apergos is not crazy about ldap having that [09:55:32] i probably did too much deployment based on active directory :-D [09:55:38] :-D [09:55:40] PROBLEM - Parsoid on wtp1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:56:12] I do agree that whatever happens, it ought to be puppetized (as salt "grains" are) [09:56:30] RECOVERY - Disk space on wtp1014 is OK: DISK OK [09:56:40] RECOVERY - Parsoid on wtp1014 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [09:56:46] i revised the grains resource type a few days ago [09:56:50] hmm maybe Tim patch to collect dsh files never landed in prod [09:57:09] so i'm familiar with it, but a lot of salt is still opaque [09:57:54] to me the problem is that we are using both dsh and salt [09:58:09] either salt fulfill the requirements dsh is providing and we should only use salt [09:58:21] or we should only use dsh and work on improving salt [09:58:35] again, i think we won't really know in the detail what the requirements are until we clean things up [09:58:40] I think we're not really using salt yet [09:58:44] the problem is that we can't stop deployments, and we can't bring the cluster down, and we can't replace pieces of scap while it is such a complicated mess of interdependencies [09:59:12] hence why we wanted to use git-deploy [09:59:14] it' sused for a couple of very limited things [09:59:25] but it could be used for more (what would we want to test?) [09:59:27] and later rewrite it to sartoris (yet another tool hehe) [10:00:36] if you generate the deployment flat-file using puppet [10:00:52] and make scap atomic by syncing to a versioned directory name and then updating a symlink to point to it [10:00:58] what else would we need? [10:01:08] what would other platforms offer at that point? [10:01:23] both of these changes seem pretty straightforward [10:01:43] salt lets you do things like wild card match against a pile of names, or execute on hosts which have x but not y [10:01:55] do that with dsh -/ [10:02:15] I want all hosts running precise that have kernel X [10:02:18] for example.... [10:02:20] dsh lets me deploy code to the cluster [10:02:22] ahh https://gerrit.wikimedia.org/r/#/c/56107/ [10:02:23] do that with salt :P [10:02:35] flat dsh files under /files/dsh/group :-D [10:03:12] flat files? *sigh* [10:03:13] ori-l: so if we want to use scap / dsh on beta, that needs use a way to have the dsh group adapted for beta [10:04:50] RECOVERY - Disk space on wtp1017 is OK: DISK OK [10:05:18] ah well [10:05:33] I need to go run an errand and will be back in a little while [10:05:43] apergos: have a good lunch! [10:06:11] * ori-l waves [10:06:13] not lunch, this is "get supplies for watering system in backyard" (shops are closed in the evening for this) [10:06:58] * hashar notes ariel has a watering system in his backyard, whatever that maybe, it could be proven useful later on. [10:07:01] #PRISM [10:07:25] PROBLEM - Disk space on wtp1005 is CRITICAL: DISK CRITICAL - free space: / 247 MB (2% inode=78%): [10:08:36] RECOVERY - Parsoid on wtp1024 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [10:08:36] RECOVERY - Disk space on wtp1024 is OK: DISK OK [10:08:39] it means I can water the plants in their pots easier [10:08:46] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [10:08:46] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [10:08:50] anyways gone [10:10:39] there's nothing straightforward about generating dsh nodelists with puppet [10:10:46] PROBLEM - Parsoid on wtp1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:11:26] PROBLEM - Parsoid on wtp1005 is CRITICAL: Connection refused [10:11:51] mark: hi, I guess that is why we haven't done it yet :-] [10:12:07] mark: I have noticed you merged my change to adapt upload cache class for beta! thanks! [10:12:13] yup [10:12:36] will let us move out of the squid/lucid instance :-] [10:12:54] I was like "no no I *need* to do this before I go into the weekend!" ;) [10:14:38] I would handle the migration this morning but I can't access to labs instance anymore ehhe [10:14:41] wiating for Coren [10:15:21] hashar: Que pasa? [10:15:36] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:15:43] Coren: have you had your breakfast already and your first coffee? :-] [10:15:51] hashar: No. :-) [10:16:05] so enjoy it and I will ping you after my lunch :-D [10:16:13] Oh, allright. [10:16:42] there is no urgency, just some labs instance can't be sshed into probably because of some /home NFS madness. [10:16:50] mark: how does salt improve it? [10:17:02] ori-l: I didn't say that it does? [10:17:26] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [10:18:26] right. well, i take your point about 'straightforward'. what i should have said is that there is nothing about dsh that makes it especially problematic [10:18:32] at least that i can think of [10:20:23] for deploying mediawiki? it doesn't handle failures particularly well [10:20:46] RECOVERY - Disk space on wtp1021 is OK: DISK OK [10:21:39] bb after lunch [10:22:11] mark: what does it do? [10:24:59] if a box is dead or its sshd doesn't work well, it stalls the connection [10:25:02] and you never kow for sure something ran on a box [10:25:11] which is kinda annoying if you have hundreds [10:26:56] PROBLEM - Host colby is DOWN: PING CRITICAL - Packet loss = 100% [10:27:06] PROBLEM - Disk space on wtp1018 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=78%): [10:27:16] RECOVERY - Disk space on wtp1005 is OK: DISK OK [10:27:26] RECOVERY - Parsoid on wtp1005 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [10:27:27] !log (re)installing colby, rubidium, eeden [10:27:37] Logged the message, Master [10:27:46] PROBLEM - Disk space on wtp1020 is CRITICAL: DISK CRITICAL - free space: / 349 MB (3% inode=78%): [10:29:06] PROBLEM - Parsoid on wtp1018 is CRITICAL: Connection refused [10:29:45] ah. hrm. [10:31:08] so I think the question is more: "what's wrong with salt?" [10:31:55] ideally our deployment system is agnostic to the dispatch mechanism used anyway [10:32:06] RECOVERY - Host colby is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [10:32:41] yes, i think that's a good insight, but how do you get there? [10:33:05] my sense is by gradually simplifying the existing scap scripts to eliminate legacy code paths and cruft [10:33:06] PROBLEM - Parsoid on wtp1020 is CRITICAL: Connection refused [10:33:44] at which point the scripts will double as a good description of what any deployment system should be able to do (or do better) [10:34:19] thought you're right that salt vs. dsh isn't especially relevant to that [10:34:26] PROBLEM - SSH on colby is CRITICAL: Connection timed out [10:34:58] *though [10:36:21] RECOVERY - SSH on colby is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [10:39:11] RECOVERY - Disk space on wtp1018 is OK: DISK OK [10:39:11] RECOVERY - Parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.006 second response time [10:45:31] PROBLEM - Disk space on wtp1009 is CRITICAL: DISK CRITICAL - free space: / 346 MB (3% inode=78%): [10:46:11] PROBLEM - NTP on colby is CRITICAL: NTP CRITICAL: No response from NTP server [10:48:18] New patchset: Matthias Mullie; "(bug 50926) Disable feedback via AFTv5 on de.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72496 [10:50:31] PROBLEM - Parsoid on wtp1009 is CRITICAL: Connection refused [10:53:11] RECOVERY - Disk space on wtp1020 is OK: DISK OK [10:53:11] RECOVERY - Parsoid on wtp1020 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [10:55:00] New patchset: Faidon; "autoinstall: add eeden" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72497 [10:55:36] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72497 [10:59:20] New patchset: Matthias Mullie; "(bug 50926) Disable feedback via AFTv5 on de.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72496 [11:02:31] RECOVERY - Parsoid on wtp1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [11:02:31] RECOVERY - Disk space on wtp1009 is OK: DISK OK [11:06:33] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [11:06:33] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [11:19:03] RECOVERY - NTP on colby is OK: NTP OK: Offset -0.01593887806 secs [11:23:53] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [11:23:53] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [11:23:53] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [11:23:53] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [11:23:53] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [11:23:54] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [11:23:54] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [11:23:55] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [11:41:27] so what's up with snmptt [11:41:33] i thought they found the culprit? [11:46:08] re [11:47:25] mark: I gave out some investigation vector on ops list [11:47:29] forgot to paste on the RT though [11:50:40] I have copy pasted my reply on https://rt.wikimedia.org/Ticket/Display.html?id=5311 [12:02:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:03:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [12:05:04] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: No successful Puppet run in the last 10 hours [12:06:53] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [12:07:03] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [12:07:03] PROBLEM - Puppet freshness on es1005 is CRITICAL: No successful Puppet run in the last 10 hours [12:07:43] RECOVERY - Puppet freshness on ms-fe3 is OK: puppet ran at Mon Jul 8 12:07:37 UTC 2013 [12:08:03] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 12:07:56 UTC 2013 [12:08:34] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 12:08:31 UTC 2013 [12:08:53] RECOVERY - Puppet freshness on es1005 is OK: puppet ran at Mon Jul 8 12:08:43 UTC 2013 [12:08:53] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [12:09:03] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [12:09:13] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 12:09:07 UTC 2013 [12:09:43] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 12:09:38 UTC 2013 [12:09:53] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [12:10:03] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [12:10:23] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 12:10:14 UTC 2013 [12:10:43] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 12:10:42 UTC 2013 [12:10:53] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [12:11:03] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [12:11:23] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 12:11:14 UTC 2013 [12:11:43] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 12:11:39 UTC 2013 [12:11:53] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [12:12:03] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [12:12:03] RECOVERY - Puppet freshness on ms-be1002 is OK: puppet ran at Mon Jul 8 12:12:02 UTC 2013 [12:12:03] RECOVERY - Puppet freshness on ms-be1001 is OK: puppet ran at Mon Jul 8 12:12:02 UTC 2013 [12:12:13] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 12:12:07 UTC 2013 [12:12:34] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 12:12:29 UTC 2013 [12:12:53] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [12:13:03] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [12:13:03] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 12:12:54 UTC 2013 [12:13:13] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 12:13:11 UTC 2013 [12:13:53] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [12:13:53] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 12:13:48 UTC 2013 [12:14:03] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [12:14:53] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [12:16:43] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 12:16:34 UTC 2013 [12:17:03] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [12:29:55] New patchset: Hashar; "get rid of GlusterFS on deployment-prep labs project" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72504 [12:29:55] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 12:29:46 UTC 2013 [12:30:53] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [12:31:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:32:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [12:43:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:43:29] !log jenkins: reduced number of executors on master from 2 to 0. [12:43:38] Logged the message, Master [12:44:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [12:52:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:53:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [13:58:16] New review: coren; "This seems fine to me, but given the number of systems depending on the class I'd rather have a seco..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/72504 [14:02:43] mark: could you cast your voice on a hack I made on base.pp please ? It is to get rid of glusterFS for the labs deployment-prep project [14:03:03] that adds a case statement based on $::instanceproject https://gerrit.wikimedia.org/r/#/c/72504/ [14:06:38] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [14:07:08] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [14:16:38] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 14:16:31 UTC 2013 [14:17:08] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [14:24:48] New patchset: ArielGlenn; "remove debugging crap from dumps rsync script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72519 [14:25:42] New review: Mark Bergsma; "I think it's extremely ugly." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72504 [14:28:38] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72519 [14:29:48] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 14:29:46 UTC 2013 [14:30:38] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [14:36:48] New patchset: Mark Bergsma; "Allow persistent connections with vcl_error" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/72520 [14:37:04] New patchset: Faidon; "swift: set memcache_serialization_support to 2" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72521 [14:37:09] !log apt: installing gdnsd 1.8.3-1~precise1 [14:37:18] Logged the message, Master [14:37:56] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72521 [14:43:07] New review: Hashar; "Mailed Ryan + Marc & Mark :)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72504 [14:44:35] New patchset: Faidon; "packaging scripts update & init script changes" [operations/debs/jmxtrans] (master) - https://gerrit.wikimedia.org/r/72522 [14:46:31] jenkins giving me headaches, I am out again. Will be back later tonight for the monday conf calls [14:46:44] New review: Akosiaris; "LGTM." [operations/debs/buck] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/72050 [14:46:45] Change merged: Akosiaris; [operations/debs/buck] (master) - https://gerrit.wikimedia.org/r/72050 [14:50:34] who is on RT duty this week? [14:51:01] i can use some help with 2 simple RT tickes [14:51:28] "on RT duty: Ryan_Lane" [14:52:39] MatmaRex: that was last week [14:53:03] drdee: fixed [14:53:09] thanks paravoid [14:53:13] whoopsie. [14:53:40] New review: Akosiaris; "I really can not reproduce this. This file never got created in any of my tests (ran on 12.04 in lab..." [operations/debs/buck] (master) C: -2; - https://gerrit.wikimedia.org/r/72055 [14:53:43] * Nemo_bis cries at "Can Merge: No" https://gerrit.wikimedia.org/r/#/c/33713/ [14:54:17] Nemo_bis: probably some whitespace changes [14:54:23] paravoid: in case you're available to +2 a change ok'd by Asher, let me know so I can rebase it [14:54:24] Nemo_bis: ever used meld for merging? try it [14:54:47] hmm I used meld in the past but not for this [14:55:10] not today, sorry [14:56:25] smart peter [14:56:29] who's on rt duty? not peter [14:56:32] paravoid: any day [14:56:40] heh [14:56:41] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71106 [14:57:02] of course we can turn that around [14:57:20] who's not on rt duty... [14:58:44] RECOVERY - Packetloss_Average on analytics1006 is OK: OK: packet_loss_average is 0.0 [14:59:00] could somebody have a quick look at https://rt.wikimedia.org/Ticket/Display.html?id=5423 (creating an rt account) [15:04:58] New patchset: Petr Onderka; "reorganize files" [operations/dumps/incremental] (gsoc) - https://gerrit.wikimedia.org/r/72525 [15:05:31] Change merged: Petr Onderka; [operations/dumps/incremental] (gsoc) - https://gerrit.wikimedia.org/r/72525 [15:06:59] New patchset: Akosiaris; "Google in debian/copyright" [operations/debs/buck] (master) - https://gerrit.wikimedia.org/r/72526 [15:07:31] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [15:08:11] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [15:12:15] New review: Akosiaris; "LGTM" [operations/debs/buck] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/72526 [15:12:15] Change merged: Akosiaris; [operations/debs/buck] (master) - https://gerrit.wikimedia.org/r/72526 [15:12:31] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:13:31] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [15:16:31] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:17:31] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [15:20:31] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:21:31] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [15:24:55] Elsie: they are calling you Mr. McBride [15:28:43] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71122 [15:29:13] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71123 [15:30:45] New review: Akosiaris; "(1 comment)" [operations/debs/buck] (master) - https://gerrit.wikimedia.org/r/70673 [15:34:18] Tim-away: AaronSchulz: What is currently being done about the job queue being backlogged for > a week? [15:34:45] This is causing problems with VisualEditor and TemplateData which are being populated quickly right now, but the data is not being updated as it should. Causing many users to think it is broken. [15:34:53] James_F: ^ [15:35:39] And for all intends and purposes, it is broken if it takes a week for things to appear. [15:37:29] Krinkle: ? [15:37:57] Krinkle: Oh, the job queue being grossly under-resourced? Yeah. :-( [15:39:23] New patchset: Reedy; "Remove $wgUrlProtocols overrides" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72471 [15:40:01] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72471 [15:40:50] I thought one guy had done some null edits to some really high use templates on enwiki? [15:41:23] afaik that was after it was already backlogged for a week [15:41:43] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71124 [15:41:44] And people will likely continue to do that, which is unfortunate [15:42:33] what is TemplateData? [15:43:00] https://www.mediawiki.org/wiki/Extension:TemplateData [15:43:15] A project spawned out of VisualEditor to store some information about a template in a machine readable way [15:43:29] New patchset: Mark Bergsma; "Add a function vcl_error_keepalive and use it for PURGES" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72530 [15:43:37] ^ [15:43:48] i tested all cases manually, seems to work fine [15:43:49] it's working fine, but (for reasons operations supports) the problem is in the fact that on large wikis we tend to encourage users to add documentation on the /doc subpage instead of the main template [15:44:00] with http 1.1 vs 1.0 and connection: close/keepalive [15:44:21] paravoid: So that you can update the template documentation without invalidating millions of pages that use the template (e.g. it is transcluded in a block on the main template page) [15:44:43] paravoid: due to the job queue being backlogged for a week, it takes a week for an update to Template:Foo/doc to appear on Template:Foo [15:44:52] New patchset: Reedy; "(bug 50561) Add 'Translation' namespace for ukwikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72054 [15:45:23] though this isn't a problem for the html view (that one is re-parsed on demand when viewing or purging the page from the GUI), it is a problem for the API actions that expose the page properties, which are not being updated [15:45:23] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72054 [15:45:33] (e.g. that deferred update takes week) [15:45:55] New patchset: Reedy; "(bug 50156) Set import sources 'w' and 'en' for ml.wikibooks" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72254 [15:46:21] that's a major problem currently hurting VisualEditor because lots of people are updating documetnation of templates to use template data, but none of the data is being updated in the database due to the jobqueue being backlogged. [15:46:23] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72254 [15:46:57] New patchset: Reedy; "(bug 50357) Set $wgSitename for te.wikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72318 [15:47:02] Krinkle: > 1 week, unless it's improved a lot in the past few days - it was 12 days. [15:47:15] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72318 [15:49:26] New patchset: Reedy; "(bug 50802) Create new user groups on sh.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72252 [15:50:20] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72252 [15:50:36] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71125 [15:52:57] New patchset: Reedy; "(bug 50658) Clean up $wgExtraNamespaces in InitialiseSettings.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72059 [15:53:05] New patchset: Mark Bergsma; "Allow persistent connections for HTTP PURGE (error) responses" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72530 [15:53:17] New patchset: Reedy; "Fixed import source for testwikidatawiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71998 [15:54:20] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71998 [15:57:29] !log reedy synchronized wmf-config/InitialiseSettings.php [15:57:39] Logged the message, Master [16:04:36] New patchset: Krinkle; "Fix various path inflexibilities and inconsistencies" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62923 [16:04:45] New review: Krinkle; "Resolved merge conflict." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62923 [16:05:10] New review: Krinkle; "Per Aaron." [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/62923 [16:05:18] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62923 [16:06:05] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [16:06:35] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [16:07:55] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 16:07:51 UTC 2013 [16:08:05] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [16:08:15] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 16:08:13 UTC 2013 [16:08:26] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71126 [16:08:35] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [16:08:45] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 16:08:36 UTC 2013 [16:08:55] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 16:08:54 UTC 2013 [16:09:01] New review: Mark Bergsma; "Varnish overwriting Connection: header was wishful thinking. It just adds another. Best case we have..." [operations/puppet] (production) C: -2; - https://gerrit.wikimedia.org/r/72530 [16:09:10] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [16:09:13] New review: Krinkle; "What is the rationale for sorting by site family first and then by wiki sub code later? We do it alp..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72059 [16:09:16] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 16:09:14 UTC 2013 [16:09:35] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [16:09:35] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 16:09:28 UTC 2013 [16:10:05] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [16:10:35] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [16:12:46] New review: Odder; "As far as I understand it's because that way it's easier for people to see what configuration other ..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72059 [16:13:45] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [16:16:35] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 16:16:26 UTC 2013 [16:17:05] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [16:18:08] New patchset: Krinkle; "Clean up $wgExtraNamespaces in InitialiseSettings.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72059 [16:30:55] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 16:30:49 UTC 2013 [16:31:35] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [16:33:20] New patchset: Dzahn; "add a script and cron to mail out bugzilla audit log and move bugzilla scripts to files/bugzilla instead of misc" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56562 [16:35:14] New review: Dzahn; "alias is changed and setup. bugzilla-admin and -admins the plural form work" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56562 [16:37:22] mark: can I move dysprosium? [16:40:11] New patchset: Dzahn; "add a script and cron to mail out bugzilla audit log and move bugzilla scripts to files/bugzilla instead of misc" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56562 [16:41:27] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56562 [16:43:22] !log installing package upgrades on kaulen (bugzilla) [16:43:30] cmjohnson1: to where? [16:43:30] Logged the message, Master [16:43:52] c8 with the rest of the 10G servers [16:44:08] no, we're probably gonna do something else with it [16:44:13] if you are using it then it can wait but will need that uplink module soon [16:44:14] but something more urgent first [16:44:17] when you guys swapped the C8 switch [16:44:23] you didn't reconnect the uplinks [16:44:30] and we're now saturating the existing links [16:44:37] so we urgently need to fix that [16:46:08] shit..let me do that now...going to add them to 8/0/38 and 39 [16:46:26] no wait [16:46:45] you can't just plug them in, you can cause loops that way [16:47:10] oh..no not going to put it in..just letting you knw..i need to find sfp's first [16:47:15] ok [16:47:22] i'll configure the ports [16:48:30] 8/0/38 should go to cr1-eqiad [16:48:34] 8/0/39 should go to cr2-eqiad [16:48:37] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:49:11] mark: lmk when i can connect [16:49:18] go ahead [16:49:27] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [16:50:32] New patchset: Dzahn; "add misc::bugzilla::auditlog to kaulen" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72539 [16:51:08] mark: connected [16:51:44] New patchset: Dzahn; "add misc::bugzilla::auditlog to kaulen" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72539 [16:53:09] 8/0/39 doesn't work... [16:53:11] let me check why [16:53:29] can't see why [16:53:34] perhaps you need to cross tx/rx? [16:53:53] or perhaps it's disabled on the other side [16:53:55] i'll check [16:54:46] no don't see why [16:55:11] maybe bad sfp [16:56:31] do you have another? [16:56:45] New patchset: ArielGlenn; "mwbzutils: version to 0.0.4, clean up in prep for debian packaging" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/72005 [16:57:05] i have one that i used for the second set of uplinks in row a that are not up yet [16:57:08] i can borrow [16:57:40] ok [16:57:43] do that I guess [16:57:53] Receiver signal average optical power : 0.7696 mW / -1.14 dBm [16:57:56] that looks fine though [16:58:36] swapped but the port link light is still off [16:58:57] checkig cr2 [16:59:35] fortunately cr1 is now master for row C [16:59:50] that link is now 20Gbps [17:00:04] New patchset: Ottomata; "Adding ezachte and spetrea to stats group on stat1002" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72541 [17:02:05] fortunate for now [17:03:56] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72539 [17:06:43] New patchset: Dzahn; "move all misc::bugzilla::* to role/bugzilla.pp, less includes in site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72543 [17:08:40] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [17:08:40] cmjohnson1: looks like cr2 is not getting light from row C [17:08:45] other direction looks fine [17:08:50] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [17:08:55] perhaps the SFP in cr2 is bad [17:09:12] it is possible...i am going to swap it now [17:09:19] can you swap the optic in cr2-eqiad:xe-5/1/2 to test? [17:09:52] Thehelpfulone: can you confirm you got mail? [17:10:24] New patchset: ArielGlenn; "set appropriate user/group for rsync between dump hosts, for primary server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72545 [17:10:43] cmjohnson1: do you have a light meter? [17:10:55] just got it today [17:11:01] excellent [17:11:02] the lc adapter may have shipped seperately =[ [17:11:05] here's your chance to test it [17:11:06] cmjohnson1: got adapter? [17:11:06] ah [17:11:09] Robh: got mine btw [17:11:10] thanks [17:11:10] New patchset: Ottomata; "Adding ezachte and spetrea to stats group on stat1002" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72541 [17:11:18] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72541 [17:11:25] mark: robert miller (admin) is kinda awesome, i hand him things and they show up where i want them to =] [17:11:31] well [17:11:37] i had to pay import duties [17:11:40] that wasn't so awesome [17:11:41] oh =[ [17:11:44] i'll expense it [17:11:52] i also didn't know it was coming [17:11:55] yea but we should handle that on this side if we can [17:12:03] just easier for us to use corporate account to pay that [17:12:05] so I got a note, when I wasn't home, that there was a parcel waiting for me and I had to pay 30 bucks for it [17:12:13] i'll try to make sure we do next time [17:12:13] and had to go to the post office for it ;) [17:12:55] cmjohnson1: still not working [17:13:01] get your light meter out ;) [17:13:09] even without the adapter you can still measure light, but the value won't be accurate [17:13:18] if it sees SOME light, it's at least not totally broken [17:13:32] just hold the fiber in front of the sensor [17:13:37] as close as possible [17:13:46] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72545 [17:14:32] what kind of light meter does chris have? [17:15:07] !log csteipp synchronized php-1.22wmf9/includes/SkinTemplate.php 'Deploy missing function for SUL2' [17:15:07] sorry..mark just got it and had to put that batteries in [17:15:09] :) [17:15:09] Logged the message, Master [17:15:17] simplifiber pro...it sees light [17:15:37] new optic is in...going to connect [17:15:38] hmm [17:15:40] Laser output power : 0.0180 mW / -17.45 dBm [17:15:49] looks like either the optic in asw-c8 is bad [17:15:55] or the switch is not driving the optic [17:16:04] the switch itself says it's only sending -17 dBm [17:16:31] put it in 1310nm btw [17:16:51] that's the wavelength of the signal that is sent [17:17:32] ok...now I am not getting light [17:17:43] what does it say? [17:17:51] and do you have the lc adapter or not? [17:18:10] nothing [17:18:21] no i do not [17:18:31] ok [17:18:45] we can try another port on the switch instead of 8/0/39 [17:18:59] let's try 8/0/32 [17:19:17] lmk when i can connnect [17:19:22] go ahead [17:19:28] it's not configured, but we can check light levels [17:20:03] New patchset: Pyoungmeister; "making icinga reload instead of restart on new confs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72546 [17:20:22] now it looks better [17:20:26] notpeter: heh, thanks [17:20:44] notpeter: sorry for suggesting what needs to be done and not doing it, I wanted to finish up some things today [17:20:49] cmjohnson1: weird [17:20:55] can you put it back in 8/0/39 one more time? [17:21:01] sure [17:21:15] we know it's not the optics or the fiber now [17:21:18] must be the switch [17:21:43] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72546 [17:22:07] paravoid: I too can write this one line :) [17:22:23] mark: that is odd to have one bad port like that [17:22:23] Laser output power : 0.0150 mW / -18.24 dBm [17:22:25] not working [17:22:26] yes [17:22:47] let's make it permanently use xe-8/0/36 [17:22:49] paravoid: you did the more useful part of actually thinking of what needed to be done :) [17:22:51] assuming that works [17:22:59] ok [17:23:00] hopefully soon we can swap it again for an ex4550 [17:23:07] and then we can investigate whether it's broken or not [17:23:09] and send it back if so [17:23:15] i need to send sbernardin a ticket to send that [17:23:35] okay to move to 0/36? [17:23:36] notpeter: you got a sec to take a look at https://rt.wikimedia.org/Ticket/Display.html?id=5423 [17:23:39] yep [17:24:15] sure [17:24:37] cmjohnson1: works fine now [17:24:39] weird [17:24:47] it's possible it just works after a switch reboot or something [17:24:51] but I don't want to test that now :P [17:25:17] it's not worth the possible chaos [17:25:25] New review: Ryan Lane; "This won't solve your problem and it's incredibly specific. You need to ensure that every instance i..." [operations/puppet] (production) C: -2; - https://gerrit.wikimedia.org/r/72504 [17:25:46] cmjohnson1: guaranteed chaos [17:25:58] half our traffic is on that switch [17:29:20] mutante: you there? [17:29:59] !log Gave row A and row C upload caches equal share of traffic again [17:30:23] Logged the message, Master [17:31:40] PROBLEM - Disk space on virt5 is CRITICAL: DISK CRITICAL - free space: /var/lib/nova/instances 42617 MB (3% inode=99%): [17:32:51] cmjohnson1: you can also start disconnecting the secondary uplinks of all cp* servers in row A [17:32:54] the one with the extra switch [17:33:02] i want to take that switch out soon [17:33:44] ok [17:34:08] those servers are now unused [17:34:17] they're still up but don't serve traffic [17:34:35] so as we resuse them...we should probably rename them [17:34:49] if they won't be varnish [17:34:51] some will [17:35:56] hmm actually [17:36:00] replacing that EX4500 with a 4550 is gonna be a problem ;) [17:36:23] OS compatibility? [17:36:36] no, just the downtime of half those boxes during the swap [17:36:50] i'll think about possibly mitigating that [17:37:05] half the cache gone, etc ;) [17:38:02] I could add the new switch and do one at a time? it is painfully slow but ... [17:38:17] yeah maybe [17:38:38] or do it at a crazy hour [17:38:59] at least it's not in any way urgent [17:39:13] but if we get more EX45xx in row C, it's nice to have them all consistently the same model [17:44:29] New patchset: Mark Bergsma; "Allow persistent connections with vcl_error" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/72550 [17:46:08] New review: Mark Bergsma; "I've fixed this in a newer patch to Varnish, which simply unsets the Connection response header afte..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72530 [17:50:51] New patchset: Jgreen; "another try at fundraisingdb my.cnf via mysql module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72553 [17:51:23] cmjohnson1: there's also the OS compatibility - ex4550's in a mixed chassis need 12 [17:51:23] :( [17:51:57] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72553 [17:53:08] lesliecarr: i thought there was something but iirc that all junos is backwards compatible [17:54:22] now i am looking this up to make sure [17:54:41] drdee: done [17:54:53] cmjohnson1: they need 12.2 r1 :( http://www.juniper.net/techpubs/en_US/release-independent/junos/topics/concept/ex-series-software-features-overview.html#first-junos-release-for-ex-platforms-table [17:55:01] and we have 10x/11.x on our switch stacks [17:55:34] well damn! [17:55:54] New patchset: Jgreen; "continuing to fail with mysql module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72555 [18:00:58] thx notpeter! [18:01:14] does this also mean that tnegrin has received his user credentials by email? [18:01:55] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72555 [18:03:55] New patchset: Jgreen; "grrr." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72557 [18:18:24] Reedy: Eh, can you help out for a second? Doing "sql" on tin gives me: [18:18:28] PHP Notice: Undefined variable: wmfCommonDir in /a/common/wmf-config/wgConf.php on line 25 [18:18:37] and more errors resulting from it [18:19:18] Aaron recently approved a change of mine that introduces that variable, but we both confirmed it to work fine. [18:19:26] Does sql execute scripts differently? [18:19:47] No, something is wrong [18:20:01] mwscript eval.php enwiki doesn't work either [18:20:07] PHP Notice: Undefined variable: wmfCommonDir in /a/common/wmf-config/wgConf.php on line 25 [18:20:12] PHP Warning: file(/all.dblist): failed to open stream: No such file or directory in /a/common/wmf-config/wgConf.php on line 25 [18:20:38] yeah [18:20:43] https://gerrit.wikimedia.org/r/#/c/62923/ [18:21:22] Reverting fixes it [18:21:28] So something is something wrong [18:21:48] no all.dblist, that would be bad [18:22:00] "This file is used by commandLine.inc and CommonSettings.php to initialise $wgConf " [18:22:04] from wgConf.php [18:22:11] perhaps it is loaded in the wrong order? [18:22:20] It would look like it's looking for the dblists in wmf-config [18:22:23] not wmf-config/.. [18:22:40] well, that's because wmfCommonDir is unset [18:22:52] I think [18:22:58] 136 -$wmfConfigDir = "$IP/../wmf-config"; [18:22:58] 137 +$wmfCommonDir = dirname( __DIR__ ); [18:22:58] 138 +$wmfConfigDir = __DIR__; [18:23:23] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72557 [18:23:53] It is included after the variable is et [18:23:54] set [18:23:54] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [18:24:03] So that's fine [18:24:14] New patchset: Reedy; "Revert "Fix various path inflexibilities and inconsistencies"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72561 [18:24:51] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72561 [18:25:52] Reedy: Maybe it needs "global" before using the variable? [18:26:10] wgConf is included after we set wmfCommonDir and wmfConfigDir [18:27:21] Seem a bit strange if $IP isn't globaled in the same cases [18:27:30] indeed [18:27:58] New patchset: Krinkle; "Fix various path inflexibilities and inconsistencies (attempt 2)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72562 [18:28:34] New review: Krinkle; "Causes the following in maintenance scripts:" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/72562 [18:28:54] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [18:33:58] !log reedy synchronized php-1.22wmf9/extensions/TimedMediaHandler/ [18:34:08] Logged the message, Master [18:34:14] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [18:35:00] !log reedy synchronized php-1.22wmf8/extensions/TimedMediaHandler/ [18:35:12] Logged the message, Master [18:35:14] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [18:42:43] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikibooks, wikimania and wikivoyage to 1.22wmf9 [18:42:53] Logged the message, Master [18:45:31] New patchset: Ottomata; "Using https for cluster_url_format" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72566 [18:46:42] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72566 [18:47:21] New review: Milimetric; "thanks Andrew!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72566 [18:49:32] New review: AzaToth; "looks fine now" [operations/dumps] (ariel) C: 1; - https://gerrit.wikimedia.org/r/72005 [18:50:54] apergos: you might want to add code like following: http://paste.debian.net/15042/ [18:51:03] loking [18:51:05] +o [18:51:58] it's pretty straightforward [18:52:20] it's pretty much the same thing automake does [18:52:36] I'll check it out (prolly tomorrow) [18:52:52] if I do it will be a separate commit though :-D [18:52:55] heh [18:53:14] thanks for that [18:53:19] that code can be needed if you are not perverted enough to make clean all the time [18:53:21] (both the reviews and the code) [18:53:37] I am perverted enough but other folks may not be. yup [18:54:22] -include $(SRCS:%.c=.deps/%.Po) [18:54:29] that's such a evil construct [18:54:59] it reincludes after the fact the target has come into existance/been edited [18:55:04] eewww [18:55:13] I mean slick but eww [18:55:44] which is good in this case, because the Po files doesn't exists before you've compiled first [18:57:04] what I oughta do (somedayyyy) is convert this stuff to use automake & co. but no time for that whatsoever... [18:57:09] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikinews and wikiquote to 1.22wmf9 [18:57:18] Logged the message, Master [19:01:40] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikisource and wiktionary to 1.22wmf9 [19:01:49] Logged the message, Master [19:08:40] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Everything non 'pedia to 1.22wmf9 [19:08:56] Logged the message, Master [19:10:11] New patchset: Reedy; "Everything non 'pedia to 1.22wmf9" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72610 [19:13:57] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72610 [19:21:42] New patchset: MaxSem; "Fix wikimediafoundation.org mobile URL template" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72613 [19:28:47] New patchset: Aude; "set normalizeItemByTitlePageNames Wikidata setting to true" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72615 [19:32:23] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/72005 [19:33:26] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72615 [19:34:32] !log reedy synchronized wmf-config/CommonSettings.php [19:34:41] Logged the message, Master [19:45:23] hashar: they've merged TAP now [19:45:28] New patchset: Ottomata; "Puppetizing analytics udp2log instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72618 [19:46:34] tap? [19:47:03] New patchset: Ottomata; "Puppetizing analytics udp2log instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72618 [19:47:10] apergos: https://review.openstack.org/#/c/34974/ [19:47:26] it's for debian packages and jenkins and puiparts [19:49:21] oh goos [19:49:22] d [19:53:52] PROBLEM - Disk space on analytics1002 is CRITICAL: DISK CRITICAL - free space: / 692 MB (3% inode=80%): [20:05:29] New review: Faidon; "(as agreed)" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/72618 [20:07:50] heayayyy YuviPanda, you around? [20:07:57] hey [20:07:58] yeah! [20:08:05] i have a pull request I'd like to sync to gerrit [20:08:06] ottomata: ^^ [20:08:09] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 20:08:03 UTC 2013 [20:08:13] oh thanks paravoid [20:08:14] hm [20:08:15] sweet, tell me which repo? [20:08:25] https://github.com/wikimedia/operations-puppet-cdh4/pull/4/files [20:08:31] ninebt [20:08:31] err [20:08:32] moment [20:08:49] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [20:08:49] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 20:08:41 UTC 2013 [20:09:29] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 20:09:23 UTC 2013 [20:09:39] alright, added it. let me sync manually for the first time [20:09:39] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [20:09:49] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [20:09:59] RECOVERY - Disk space on analytics1002 is OK: DISK OK [20:09:59] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 20:09:58 UTC 2013 [20:10:35] oo, can you tell me how this works? [20:10:39] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 20:10:29 UTC 2013 [20:10:39] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [20:10:41] YuviPanda: ? [20:10:47] once I've made it work! I shall! [20:10:49] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [20:10:53] k [20:10:59] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 20:10:52 UTC 2013 [20:11:06] so, I hope to actually get pull requests on a different repo, but this one is here for now [20:11:29] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 20:11:22 UTC 2013 [20:11:39] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [20:11:49] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [20:11:49] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 20:11:43 UTC 2013 [20:12:09] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 20:12:08 UTC 2013 [20:12:39] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [20:12:49] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [20:12:49] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 20:12:46 UTC 2013 [20:12:54] ottomata: so, pull requests to that repo in general will now be automatically sync'd, but that particular one is going to have trouble - since it can't seem to be applied cleanly. [20:12:57] i'm investigating [20:13:08] ottomata: but if you test by opening another pull request there, that should automatically open a gerrit patchset [20:13:09] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 20:13:00 UTC 2013 [20:13:19] hmm, ok [20:13:25] while you are at it, can you do this one? [20:13:31] https://github.com/wikimedia/puppet-cdh4 [20:13:37] that's the one I really want to maintain [20:13:39] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [20:13:40] sure [20:13:49] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [20:14:08] yeah, done [20:14:13] you can test by sending them pull requests now [20:16:39] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Mon Jul 8 20:16:29 UTC 2013 [20:16:49] PROBLEM - Puppet freshness on cp3001 is CRITICAL: No successful Puppet run in the last 10 hours [20:17:07] so, YuviPanda, that one pull request just has trouble? [20:17:21] yeah [20:17:39] because for some reason it doesn't apply cleanly [20:17:57] and needs a rebase. now in general this isn't a problem, but in this particular case seems to be [20:20:21] ottomata and re: how it works - so I just added a hook to that repo, so whenever there's a new pull request (or more commits on an existing one), it hits an URL [20:20:26] that URL is running on toollabs [20:20:40] and that starts a small job on toollabs that fetches the patch from github, applies it, and does git review [20:20:55] where's the hook, can I see it? [20:21:04] the code you mean? [20:21:16] uhh, how'd you add it? [20:21:28] ah, so I'm admin :P [20:21:29] for wikimedia/* [20:21:34] so I can access 'settings', and under setting syou can see the hooks [20:21:40] i've a script that does it for me [20:21:45] i think I am too [20:22:11] ah maybe not [20:22:38] ottomata: you can see that it errored out in the comments of https://github.com/wikimedia/operations-puppet-cdh4/pull/4 [20:22:44] (it mentioned me so I know when something is wrong) [20:23:22] ottomata: https://github.com/yuvipanda/SuchABot/blob/master/suchabot/hooks.py is the script i use to add the hook [20:23:27] other scripts running are on that repo too [20:23:57] ottomata: and the android app is using this bot exclusively (see https://github.com/wikimedia/apps-android-commons/pulls?direction=desc&page=1&sort=created&state=closed) [20:24:03] New review: Ottomata; "Good question. Currently, how does it work, the github repository is only created when it is create..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72130 [20:24:45] hm ok cool [20:24:55] thanks YuviPanda. I'll probably have more repos to add in a week or two [20:24:58] :) [20:25:17] Actually, I'd like to delete operations-puppet-cdh4, but I need to talk to ^demon to see what will happen if I do that [20:25:24] ottomata: does there exist a puppet/cdh4 in gerrit? that is required for this to work (currently). [20:25:31] no [20:25:49] operation/puppet/cdh4 is (currently) both replicated to operations-puppet-cdh4 and puppet-cdh4 [20:25:50] ah, so where should the gerrit ones go to? [20:25:55] oh, that is possible? [20:25:59] it is now! [20:26:08] https://bugzilla.wikimedia.org/show_bug.cgi?id=47274 [20:26:10] okay, then i'll need to add a little bit of custom mapping code then [20:26:16] currently only operations-puppet-cdh4 willw ork [20:27:05] ottomata: but i'll add that mapping in about 2-3 hours [20:27:30] nice, thank you! [20:27:35] :) [20:27:38] yw! [20:29:59] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 8 20:29:49 UTC 2013 [20:30:39] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 10 hours [20:44:50] New patchset: Jgreen; "remove the mysql module stuff from fundraising role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72630 [20:45:59] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72630 [20:55:02] gwicke: hey, I'm combing through rt tickets. can this be closed? https://rt.wikimedia.org/Ticket/Display.html?id=5396 [20:58:12] notpeter: awesome @ search for those latest wikis. [21:24:53] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [21:24:53] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [21:24:53] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [21:24:53] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [21:24:53] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [21:24:54] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [21:24:54] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [21:24:55] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [21:35:32] !log updating blog [21:35:42] Logged the message, Master [21:36:12] hrmm [21:36:22] mutante: So when i clone the repo on my local [21:36:25] it shows tag 3.5.2 [21:36:39] but when i update the repo on holmium, tag list doesnt update to show 3.5.2 [21:36:41] odd. [21:37:03] what does "git branch" show on both compared? [21:38:11] well, it merged into head state [21:38:13] with no branch [21:38:21] then i 0b in to new 3.5.1 created branch [21:38:28] -b even [21:38:35] otherwise its in unattached head state [21:38:37] which is nonideal [21:38:47] lemme walk over to your screen [21:44:18] PROBLEM - HTTP on holmium is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 343 bytes in 0.006 second response time [21:54:18] RECOVERY - HTTP on holmium is OK: HTTP OK: HTTP/1.1 302 Found - 380 bytes in 0.076 second response time [22:11:50] paravoid: hallo [22:12:04] New patchset: Krinkle; "Enabling secure login (HTTPS), second attempt" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68937 [22:13:10] New review: Krinkle; "Second version: I17c902ae8d5e6845c938f7d6643b3d46" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/52344 [22:13:16] New review: Krinkle; "Second version: I17c902ae8d5e6845c938f7d6643b3d46" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21322 [22:22:23] New patchset: Kaldari; "Enabling Disambiguator extension on all WMF wikis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/72646 [22:26:54] New review: Hashar; "You can get the feature enabled on beta to test it out. It has SSL nowadays (although with a self s..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68937 [22:31:21] gwicke: you were given access to the parsoid boxes. can this ticket be closed? https://rt.wikimedia.org/Ticket/Display.html?id=5396 [22:32:47] notpeter: yes, I can now log in and see the stats [22:33:02] it would be awesome if I could also flush the caches [22:34:28] gwicke: from what ori/asher just told me, roan has a script on fenari that should let deployers do that? [22:34:55] No, roots [22:35:03] fenari:/home/catrope/bin/purgeParsoid [22:35:06] cat it, it's very simple [22:35:13] Somewhat embarrassingly simple in fact [22:35:33] It just automates ssh'ing into the two machines as root, stopping the backend Varnish, removing its cache file, and starting it back up again [22:35:55] oh [22:36:02] oh [22:36:09] um [22:36:30] Yeah :( [22:36:35] notpeter: if you were going to make a script, use varnishadm ban.url same as a mobile flush [22:36:47] no start / stop / rm [22:37:07] I'm totally happy for this to be improved by people that actually know the first thing about Varnish :) [22:38:52] why don't you start by checking this into a repository [22:39:10] where is the current script? [22:39:42] omg, we should start a company, purgely, and contract to wmf for all of the purge scripts [22:40:24] varnishadm ban.url . [22:41:01] binasher: I've noticed that ban.url 'foo' doesn't work, I needed ban.url '.*foo.*' [22:41:32] RoanKattouw: isn't the point to flush everything, if its being restarted and cache files deleted? [22:42:06] varnishadm ban.url . === flush everything [22:42:40] i even documented it, rare for me :) https://wikitech.wikimedia.org/wiki/MobileFrontend#Flushing_the_cache [22:43:01] yeah, ban keeps it in cache [22:43:02] Oh OK [22:44:01] really clearing the cache is better for us [22:44:57] why? what kind of concurrency is involved in sending requests to varnish post flush? [22:46:07] orders of magnitude greater than what m.wikipedia.org receives in actual traffic after the mobile team drops the bass? if not, i wouldn't worry too much. [22:46:20] !log DNS update - point download.mw away from empty dir on kaulen [22:46:30] Logged the message, Master [22:46:47] binasher: concurrency is practically nil [22:46:53] but cache size is large [22:46:54] !log restarting pdns on ns0 [22:47:03] Logged the message, Master [22:47:20] so we mainly want to make space rather than wait for LRU [22:47:26] gwicke: do you see cpu/io issues on varnish servers after a ban . ? [22:47:40] binasher: no, CPU is practically zero [22:47:53] I did not try ban anything since I have no rights to do so [22:48:24] request rates are ~15/second [22:48:55] i wouldn't worry. and i don't think the avg object size for parsoid is any larger than mobile pages [22:50:45] if LRU after ban works fine then that is great too [22:51:32] I read that there are some issues with LRU and req.hash_always_miss [22:51:57] if that works fine then ban would be great too [22:52:32] I mainly care about being able to purge at all, no matter which method [22:53:54] gwicke: how about next time you you need to purge, you give the varnishadm method a shot, and if it doesn't work/doesn't suit your needs, we'll continue from there? [22:56:55] why do you want to make space? do we have a space issue ? [22:57:12] RECOVERY - Disk space on virt5 is OK: DISK OK [22:57:56] space issues: http://naomihall.com/uploads/2012/11/space_cat_omgwtfbbq_lolcat_cheezburger.jpg [22:59:49] notpeter: last time I checked I had no rights to call varnishadm [22:59:57] it was the first thing I tried.. [23:00:22] LeslieCarr: one version of all WP pages should take 400G [23:01:09] our cache is slightly larger than that, so having some outdated copies around is fine [23:01:47] but you haven't actually had an issue with the machine running out of space, have you ? [23:02:08] LeslieCarr: no, so far Roan has always cleared the cache [23:03:01] ok, i'm getting angry so i am going to step away from this conversation. but please, use the correct method, and if that doesn't work, we can figure out what we can do (either with varnish or with hardware specs) to make the correct method work [23:03:15] notpeter: [23:03:17] varnishadm ban.url . [23:03:18] Cannot open "/etc/varnish/secret": Permission denied [23:04:01] do you have the right to restart varnish on that machine? [23:04:07] gwicke: proper functioning under lru pressure should be considered a requirement [23:04:07] nope [23:04:38] so, then it's expected :) [23:05:03] if ban.url . brings up any issues specific to lru functioning, it's good to expose and address them [23:05:06] and you can request a cache flush, just like mobile does [23:05:19] binasher: we normally don't want to rely on LRU at all, but our active purging is not configured to work with backends yet [23:06:22] binasher: https://www.varnish-cache.org/trac/wiki/VCLExampleEnableForceRefresh [23:06:38] The downside of this approach is that it will not free up the older objects until they expire, as of Varnish 3.0.2. This is considered a flaw and a fix is expected. [23:07:02] I'm not 100% sure if this means LRU or s-maxage expiry [23:07:28] gwicke: sounds like whichever comes first [23:07:38] I hope so, yes [23:07:43] in which case ban would be fine [23:17:58] New review: Dzahn; "this is likely in your spam folder (it was for me), but it works" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56562 [23:22:17] New patchset: Pyoungmeister; "proposal for allowing gabriel sudo access for varnishadm for parsoid caches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72653 [23:24:45] gwicke: I submitted that ^ [23:24:56] notpeter: thanks! [23:24:57] sudo rule to allow you to use varnishadm [23:25:13] added faidon, mark, asher as reviewers [23:25:25] notpeter: gwicke and varnishstat ? [23:25:36] mutante: ?? [23:25:40] mutante: Roan gave me shell access, so that works now [23:25:41] notpeter: i wonder if you should just let wikidev hit that shit [23:26:15] notpeter: afair the request was also for varnishstat [23:26:47] mutante: https://gerrit.wikimedia.org/r/71535 [23:27:11] that gave me shell access as gwicke, which lets me run varnishstat [23:27:19] aha, got it. cool [23:27:35] New review: Helder.wiki; "[off topic] See also bug 50986." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/72646 [23:30:51] New patchset: Dzahn; "add ServerAlias download.mw so that it doesn't show an ugly error page until this new download host has been setup on an eqiad host" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/72656 [23:31:42] New patchset: Dzahn; "move all misc::bugzilla::* to role/bugzilla.pp, less includes in site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72543 [23:36:26] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/72656 [23:37:41] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72543 [23:40:54] New review: Dzahn; "http://download.mediawiki.org" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/72656 [23:42:18] !log catrope Started syncing Wikimedia installation... : Updating VE and AbuseFilter [23:42:28] Logged the message, Master [23:52:24] !log catrope Finished syncing Wikimedia installation... : Updating VE and AbuseFilter [23:52:33] Logged the message, Master [23:53:45] jdlrobson: hi, what's the status of https://rt.wikimedia.org/Ticket/Display.html?id=5267 [23:53:56] is this something that you'll need today? [23:54:00] or can that ticket be closed? [23:54:57] !log graceful'ing Apaches [23:55:07] Logged the message, Master