[00:00:21] RECOVERY - Puppet freshness on amssq33 is OK: puppet ran at Mon Feb 25 23:59:58 UTC 2013 [00:02:45] RECOVERY - Puppet freshness on amssq44 is OK: puppet ran at Tue Feb 26 00:02:28 UTC 2013 [00:03:22] !log amssq33,amssq44 - fix more puppet runs [00:03:23] Logged the message, Master [00:05:45] RECOVERY - MySQL Slave Running on db35 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [00:06:41] PROBLEM - Puppet freshness on mw1070 is CRITICAL: Puppet has not run in the last 10 hours [00:11:18] PROBLEM - MySQL Slave Running on db35 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Duplicate entry 141 for key PRIMARY on query. Default dat [00:12:41] RECOVERY - Puppet freshness on mc1001 is OK: puppet ran at Tue Feb 26 00:12:38 UTC 2013 [00:12:48] RECOVERY - Puppet freshness on mc1001 is OK: puppet ran at Tue Feb 26 00:12:33 UTC 2013 [00:12:48] test123 [00:12:57] i started the plague [00:13:12] icinga-wm, O RLY? [00:13:23] yes really [00:13:37] killing all humans is part of my master plan [00:13:59] i am tempted to kick icinga-wm for being evil. [00:14:15] but then whoever is working on it may stop. [00:14:16] cyborgs will be spared in the robot cleansing [00:14:18] Change abandoned: Pyoungmeister; "there is a better way...." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47030 [00:14:19] ok then [00:14:24] binasher: Any idea what's up with db64? [00:14:26] icinga-wm: carry on. [00:14:34] | 186733992 | system user | | heartbeat | Connect | 185590 | flush and sync binlog : fsync | REPLACE INTO `heartbeat`.`heartbeat` (ts, server_id, file, position, relay_master_log_file, exec_mas | [00:14:36] icinga-wm, while you're locked in that box, the only bad thing you can do is to flood our IRC clients;) [00:14:48] nah, its singularity on our cluster [00:14:51] we are all screwed. [00:15:21] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 181 seconds [00:15:34] "Execution of '/usr/bin/apt-cache policy .. " puppet issues .. hrmm [00:15:46] Reedy: what are you doing with db64? [00:16:04] Nothing, I just noticed it was unhappy on the dbtree [00:16:11] PROBLEM - MySQL Slave Delay on db56 is CRITICAL: CRIT replication delay 190 seconds [00:16:21] PROBLEM - MySQL Replication Heartbeat on db56 is CRITICAL: CRIT replication delay 192 seconds [00:16:24] PROBLEM - MySQL Replication Heartbeat on db56 is CRITICAL: CRIT replication delay 191 seconds [00:17:01] PROBLEM - MySQL Slave Delay on db56 is CRITICAL: CRIT replication delay 201 seconds [00:17:11] RECOVERY - Puppet freshness on mw1029 is OK: puppet ran at Tue Feb 26 00:17:10 UTC 2013 [00:17:18] RECOVERY - Puppet freshness on mw1029 is OK: puppet ran at Tue Feb 26 00:17:06 UTC 2013 [00:17:22] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [00:17:37] Reedy: oh, replication was broken over the weekend but i fixed it around an hour ago [00:17:44] it'll take some time to catch up though [00:17:44] Aha [00:18:00] That's fine then, I know it's ok to ignore [00:18:04] Cheers [00:18:41] RECOVERY - Puppet freshness on mw1059 is OK: puppet ran at Tue Feb 26 00:18:34 UTC 2013 [00:18:49] RECOVERY - Puppet freshness on mw1059 is OK: puppet ran at Tue Feb 26 00:18:33 UTC 2013 [00:19:31] RECOVERY - Puppet freshness on mw1070 is OK: puppet ran at Tue Feb 26 00:19:30 UTC 2013 [00:19:51] RECOVERY - Puppet freshness on mw1070 is OK: puppet ran at Tue Feb 26 00:19:26 UTC 2013 [00:20:31] RECOVERY - Puppet freshness on mw1157 is OK: puppet ran at Tue Feb 26 00:20:21 UTC 2013 [00:20:33] New patchset: Pyoungmeister; "setting squid service to reload when notified" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50847 [00:20:38] !log mc1001,ms1004,mw1029,mw1059,mw1070,mw1157 - fixed puppet freshness [00:20:40] Logged the message, Master [00:20:45] RECOVERY - Puppet freshness on mw1157 is OK: puppet ran at Tue Feb 26 00:20:20 UTC 2013 [00:21:02] mark: more like that ^^ [00:21:22] New patchset: Lcarr; "fixing solr plugin location" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50848 [00:22:01] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50847 [00:22:10] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50848 [00:25:02] New patchset: Jgreen; "login def and key for Sahar Massachi (fundraising)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50849 [00:25:08] RECOVERY - MySQL Slave Delay on db56 is OK: OK replication delay 8 seconds [00:25:18] RECOVERY - MySQL Replication Heartbeat on db56 is OK: OK replication delay 4 seconds [00:25:24] RECOVERY - MySQL Replication Heartbeat on db56 is OK: OK replication delay 3 seconds [00:26:00] RECOVERY - MySQL Slave Delay on db56 is OK: OK replication delay 0 seconds [00:26:27] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50849 [00:30:06] New patchset: Lcarr; "fixing location" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50851 [00:31:12] New review: Jgreen; "needs hearts? ❤❤❤❤" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/50778 [00:31:26] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50851 [00:33:33] New patchset: Jgreen; "aluminium/grosley: add user saher, remove pcoombe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50854 [00:34:06] PROBLEM - Puppet freshness on db35 is CRITICAL: Puppet has not run in the last 10 hours [00:34:06] PROBLEM - Puppet freshness on gallium is CRITICAL: Puppet has not run in the last 10 hours [00:34:27] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50854 [00:35:06] PROBLEM - Puppet freshness on db1024 is CRITICAL: Puppet has not run in the last 10 hours [00:35:07] PROBLEM - Puppet freshness on nescio is CRITICAL: Puppet has not run in the last 10 hours [00:36:20] New patchset: Dzahn; "improving tone of nagios alerts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50778 [00:37:42] PROBLEM - Puppet freshness on db1024 is CRITICAL: Puppet has not run in the last 10 hours [00:37:42] PROBLEM - Puppet freshness on gallium is CRITICAL: Puppet has not run in the last 10 hours [00:37:42] PROBLEM - Puppet freshness on db35 is CRITICAL: Puppet has not run in the last 10 hours [00:37:42] PROBLEM - Puppet freshness on nescio is CRITICAL: Puppet has not run in the last 10 hours [00:37:50] New review: Tim Starling; "Yes, that is an error, and another error is the assumption that we are using ImageMagick Q8. In fact..." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/50149 [00:37:51] ..... [00:38:06] did something just tick over someplace or do we really have this many items not calling into puppet? [00:38:40] New patchset: Ryan Lane; "Redirect view links" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50856 [00:41:14] New patchset: Pyoungmeister; "improving tone of nagios alerts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50778 [00:41:34] New patchset: Ryan Lane; "Redirect view links" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50856 [00:41:45] PROBLEM - Puppet freshness on tola is CRITICAL: Puppet has not run in the last 10 hours [00:41:57] The last Puppet run was at Sun Feb 24 08:29:33 UTC 2013 (2412 minutes ago). [00:42:04] ^ gallium [00:42:07] Tim-away: hey [00:42:38] mutante: so, my reasoning on my new patchset for that change is to save characters on sms's [00:42:41] does that sound reasonable? [00:42:53] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50856 [00:43:13] Reedy: script still running for files? [00:43:16] RobH: yes, we actually do. but it's reporting them all at once due to work on Icinga [00:43:29] Yeah [00:44:07] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50778 [00:44:07] AaronSchulz: it would be nice if it had some amount of progress notification [00:44:49] mutante: ok, so just duplicate notification, cool, thx. [00:45:21] notpeter: yea:) [00:46:05] RobH: we had like 10 more a little while ago :p [00:46:20] paravoid and I got the list down to half its size last week [00:46:26] then in the course of a weekend it doubled back up. [00:46:30] it happened again [00:46:33] not really [00:46:43] ? [00:46:44] are we talking about nagios? [00:46:49] puppetchecks [00:46:53] oh [00:46:54] puppet freshness [00:47:02] then yeah, haven't checked [00:47:14] i commented cuz new icinga instance notified on them all again [00:49:02] New patchset: Lcarr; "adding in extra checks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50858 [00:49:47] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50858 [00:55:37] New patchset: Ottomata; "Initial commit of Kafka Module." [operations/puppet/kafka] (master) - https://gerrit.wikimedia.org/r/50385 [01:01:18] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49200 [01:03:39] New patchset: Lcarr; "more check commands!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50861 [01:04:49] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50861 [01:09:20] New review: Dzahn; "this is ok, doesn't break things, but also doesn't really fix it. you still get a 302 somehow" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49200 [01:15:21] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [01:16:06] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [01:21:29] New patchset: Lcarr; "removing dupe definition" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50863 [01:22:26] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50863 [01:23:14] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 185 seconds [01:24:53] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 217 seconds [01:32:14] RECOVERY - MySQL disk space on neon is OK: DISK OK [01:32:59] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [01:33:28] New patchset: Lcarr; "inserting forgotten "}"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50866 [01:33:40] PROBLEM - LVS HTTP IPv4 on m.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 15113 bytes in 0.001 second response time [01:34:10] PROBLEM - Puppet freshness on tola is CRITICAL: Puppet has not run in the last 10 hours [01:35:21] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 393 seconds [01:35:23] RECOVERY - Puppet freshness on tola is OK: puppet ran at Tue Feb 26 01:35:08 UTC 2013 [01:35:59] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 198 seconds [01:36:10] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 203 seconds [01:36:10] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 406 seconds [01:36:20] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 206 seconds [01:37:02] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 197 seconds [01:37:20] PROBLEM - Host foundation-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:37:50] PROBLEM - Host wikibooks-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:38:00] PROBLEM - Host wikimedia-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:38:01] PROBLEM - Host wikinews-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:38:01] PROBLEM - Host wikipedia-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:38:10] PROBLEM - Host wikiquote-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:38:10] PROBLEM - Host wikisource-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:38:11] PROBLEM - Host wikiversity-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:38:20] PROBLEM - Host wiktionary-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:38:20] PROBLEM - Host mediawiki-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::8 [01:38:40] RECOVERY - Host mediawiki-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 87.82 ms [01:38:41] RECOVERY - Host foundation-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 113.12 ms [01:38:42] RECOVERY - Host wikibooks-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 86.97 ms [01:38:42] RECOVERY - Host wikipedia-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 87.86 ms [01:38:42] RECOVERY - Host wikisource-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 87.96 ms [01:38:43] RECOVERY - Host wikiquote-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 86.89 ms [01:38:43] RECOVERY - Host wikimedia-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 87.00 ms [01:38:43] RECOVERY - Host wikinews-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 87.72 ms [01:38:50] RECOVERY - Host wikiversity-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 86.94 ms [01:38:50] RECOVERY - Host wiktionary-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 87.93 ms [01:40:10] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [01:40:20] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [01:40:38] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [01:41:24] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [01:45:05] Uhh [01:47:08] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50866 [01:47:25] yeah, there was some ipv6 connectivity issue [01:58:06] !log test [01:58:07] Logged the message, Master [01:59:01] !log test [01:59:03] Logged the message, Master [01:59:45] !log test [01:59:47] Logged the message, Master [01:59:48] Logged the message, Master [02:02:56] LeslieCarr: Would you have time to do that LVS thing I've been talking about tomorrow (or any other day this week that's not Wednesday)? [02:03:02] i think so [02:04:04] OK good [02:04:11] PROBLEM - Varnish HTTP bits on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:04:16] I'll hit you up about it tomorrow [02:06:02] RECOVERY - Varnish HTTP bits on palladium is OK: HTTP OK: HTTP/1.1 200 OK - 635 bytes in 0.001 second response time [02:06:11] !log test [02:06:14] Logged the message, Master [02:06:36] !log testing twitter [02:06:38] Logged the message, Master [02:06:53] That works :) [02:07:00] it's not working [02:07:06] oh [02:07:07] it is [02:07:08] :) [02:07:11] sweet [02:08:32] I think that was the last thing needed before migrating wikitech [02:10:51] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 182 seconds [02:11:04] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram [02:11:07] Logged the message, Master [02:11:41] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 193 seconds [02:11:41] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 194 seconds [02:12:08] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 204 seconds [02:28:06] !log LocalisationUpdate completed (1.21wmf10) at Tue Feb 26 02:28:06 UTC 2013 [02:28:09] Logged the message, Master [02:29:03] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram [02:29:04] Logged the message, Master [02:35:00] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [02:35:40] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [02:36:07] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [02:36:34] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [02:37:03] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram [02:37:05] Logged the message, Master [02:38:40] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: Puppet has not run in the last 10 hours [02:39:43] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [02:40:26] New patchset: Brion VIBBER; "Add Vimpelcom Beeline to Zero carriers list (testing range)." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47520 [02:41:28] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram [02:41:30] Logged the message, Master [02:44:51] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram/includes/pagers/ArticleTable.php [02:44:53] Logged the message, Master [02:46:09] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram/includes/pagers/ArticleTable.php [02:46:12] Logged the message, Master [02:47:45] New review: RobH; "brion said to ;]" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/47520 [02:47:57] \o/ [02:47:59] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47520 [02:48:40] New review: RobH; "Ok, how about a serious comment since most of ops isn't about. This looks sane and is in the same f..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47520 [02:49:37] brion: do we need to clear any cache or anything crazy for that? [02:49:47] or just simply let it sit and when puppet runs on varnish it'll take effect? [02:49:54] i… don't know :D [02:49:58] shouldn't need anything cleared [02:50:09] do we know how long it'll take for puppet to update? [02:50:15] it'd be awesome if this went live in the next hour [02:50:18] if not, we'll live [02:50:22] heh, it runs every 30 [02:50:26] days [02:50:29] haha [02:50:30] so if it can wait for that, we'll be ok [02:50:33] hah [02:50:36] 30 mins is fine :) [02:50:37] otherwise i have to ddsh a puppet run [02:50:48] You meant the run /takes/ 30 mins? [02:50:48] cool, its merged on sockpuppet, so live [02:50:52] * RoanKattouw stops trolling puppet [02:50:57] RoanKattouw: if its on spence, a run takes a fwe hours ;] [02:50:59] dash ? aren't you guys using salt nowadays? [02:51:05] no, ryan is using salt. [02:51:13] i have no fucking idea how to use it [02:51:13] we have no docs yet. [02:51:13] Only for git-deploy [02:51:13] ahh [02:51:15] easy [02:51:21] No one has any idea how to use it other than through the git-deploy wrapper [02:51:24] i asked ryan, who said 'well, for now its my beta so i guess you can just tell me to do things' [02:51:28] echo "your command there" |  mail ryan@wikimedia.org --subject 'please salt this' [02:51:30] I know about half the things I need to do to set up new machines [02:51:33] hehe [02:51:43] yea the new machine setup thing is super annoying. [02:51:53] if it stays this way for longer than another week im gonna have to learn how to do it all. [02:51:58] cuz i hate having to ask someone to do it for me. [02:52:02] !log LocalisationUpdate completed (1.21wmf9) at Tue Feb 26 02:52:01 UTC 2013 [02:52:04] Logged the message, Master [02:52:20] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [02:52:33] * RobH only sets up an average of a dozen or so new machines a week [02:52:52] Well so the thing is [02:52:55] I can bring up salt on a new box [02:53:01] Once Ryan accepts its key on the master [02:53:06] And he hasn't shown me how to do the latter [02:53:17] i think this should be on ops agenda this week [02:53:21] Yes [02:53:21] i'll ask ryan about it tomorrow morning. [02:53:28] Also, regarding Parsoid [02:53:34] or you can get in your bashrc something like: alias salt='ddsh -m all "$@"' [02:54:00] we should probably ask ryan to give a presentation about salt :-] [02:54:07] Leslie and I are probably going to do the Varnish LVS thing tomorrow, once that's done you can start replacing those if you want [02:54:13] and by ask tomorrow [02:54:16] i mean shoot an email to him now [02:54:32] RobH: What's going on with wtp1001 now? [02:54:37] You were gonna replace it? [02:54:49] Well, as soon as you can take the existing wtp1001 offline [02:54:59] I can wipe its existence off the map and move the name to new hardware [02:55:06] It's good to go [02:55:11] oh, its not pooled? [02:55:16] Yeah it's depooled [02:55:18] * RoanKattouw finds URL [02:55:32] http://noc.wikimedia.org/pybal/eqiad/parsoid [02:55:39] dinner time :-) [02:55:47] I didn't bother stopping the services on it because puppet will just restart them [02:55:56] RoanKattouw: Ok, I am going to add it to decomissioning.pp so it pulls out of nagios. [02:55:58] But it's not getting any traffic [02:56:00] OK [02:56:17] once nagios stops monitoring it, I'll kill the box and redo dns for wtp1001 [02:56:29] BTW, since you're ops [02:56:37] Who do we bug about databases with off-the-charts replag? Asher? [02:56:53] OK good [02:56:55] I would say Peter/Asher [02:57:51] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [02:59:02] New patchset: Aaron Schulz; "Added a second pipeline for more immediate jobs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50877 [02:59:07] OK it looks like the borked slave is in Tampa [02:59:11] So it only impacts testwiki [02:59:40] New review: Aaron Schulz; "Tested the loop script on mw1010" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50877 [03:02:15] New patchset: RobH; "wtp1001 migration hosts, so decom to remove from nagios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50878 [03:03:05] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram/includes/pagers/ArticleTable.php [03:03:06] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50878 [03:03:07] Logged the message, Master [03:03:55] RoanKattouw: coolness, i'll get the new one setup prolly tomorrow [03:04:10] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram/includes/pagers/ArticleTable.php [03:04:12] Logged the message, Master [03:04:14] (i realize this week you may or may not have time to mess with it with all the other folks in town ;) [03:06:26] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram/includes/pagers/ArticleTable.php [03:06:27] Logged the message, Master [03:07:01] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [03:07:01] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [03:07:20] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 1 seconds [03:07:20] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [03:10:44] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram/includes/pagers/ArticleTable.php [03:10:45] Logged the message, Master [03:17:14] New review: Jeremyb; "It's one less redirect now than before (302 -> 200 instead of 302 -> 302 -> 200)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49200 [03:17:40] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [04:06:15] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [04:07:55] RECOVERY - Puppet freshness on wtp1001 is OK: puppet ran at Tue Feb 26 04:07:48 UTC 2013 [04:08:15] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [04:08:15] RECOVERY - Puppet freshness on wtp1001 is OK: puppet ran at Tue Feb 26 04:08:12 UTC 2013 [04:08:35] RECOVERY - Puppet freshness on tola is OK: puppet ran at Tue Feb 26 04:08:30 UTC 2013 [04:09:15] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [04:09:25] RECOVERY - Puppet freshness on wtp1001 is OK: puppet ran at Tue Feb 26 04:09:18 UTC 2013 [04:10:15] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [04:18:25] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 184 seconds [04:18:46] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 189 seconds [04:18:47] RECOVERY - Puppet freshness on wtp1001 is OK: puppet ran at Tue Feb 26 04:18:39 UTC 2013 [04:19:02] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 192 seconds [04:19:15] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [04:19:29] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 202 seconds [04:45:25] RECOVERY - Puppet freshness on wtp1001 is OK: puppet ran at Tue Feb 26 04:45:15 UTC 2013 [04:46:15] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [04:52:59] New patchset: Pyoungmeister; "switching from unicode heart to asci heart for sms" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50881 [04:54:14] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50881 [04:56:40] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [04:57:40] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 3.000 second response time on port 8123 [05:09:53] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [05:10:00] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [05:10:11] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [05:10:30] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [05:34:28] PROBLEM - LVS HTTP IPv4 on m.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 15113 bytes in 0.003 second response time [05:44:41] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 183 seconds [05:44:50] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 185 seconds [05:45:08] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 194 seconds [05:45:28] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 198 seconds [05:46:49] PROBLEM - SSH on strontium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:46:58] PROBLEM - Varnish HTTP bits on strontium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:47:48] RECOVERY - SSH on strontium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [05:48:48] RECOVERY - Varnish HTTP bits on strontium is OK: HTTP OK: HTTP/1.1 200 OK - 635 bytes in 0.001 second response time [05:59:13] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [06:02:13] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [06:02:23] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [06:02:41] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [06:02:50] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [06:03:53] RECOVERY - MySQL Slave Delay on db39 is OK: OK replication delay 0 seconds [06:04:11] RECOVERY - MySQL Slave Delay on db39 is OK: OK replication delay 0 seconds [06:08:53] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: Puppet has not run in the last 10 hours [06:21:56] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram/includes/pagers/ArticleTable.php [06:21:59] Logged the message, Master [06:26:38] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram/includes/pagers/ArticleTable.php [06:26:40] Logged the message, Master [06:28:36] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram/includes/pagers/ArticleTable.php [06:28:38] Logged the message, Master [06:29:23] RECOVERY - Puppet freshness on wtp1001 is OK: puppet ran at Tue Feb 26 06:29:21 UTC 2013 [06:29:52] Reedy has switched time zones on me. [06:30:13] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [06:34:09] Susan: more than a few people did that [06:34:35] e.g. sumanah and ottomata [06:35:34] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [06:36:04] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [06:36:53] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [06:37:20] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [06:40:54] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram/includes/pagers/ArticleTable.php [06:40:56] Logged the message, Master [06:40:56] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [06:41:59] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [06:43:37] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram/includes/pagers/ArticleTable.php [06:43:38] Logged the message, Master [06:53:32] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram/ [06:53:33] Logged the message, Master [07:05:34] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 187 seconds [07:06:04] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 197 seconds [07:06:16] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 199 seconds [07:06:34] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 204 seconds [07:06:35] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , commonswiki (23619), Total (30206) [07:07:37] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , commonswiki (17917), Total (24751) [07:09:40] RECOVERY - MySQL disk space on neon is OK: DISK OK [07:10:00] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [07:10:12] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 3 processes with args ircecho [07:10:19] RECOVERY - MySQL disk space on neon is OK: DISK OK [07:10:46] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [07:18:30] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [07:18:52] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [07:19:00] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [07:19:19] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [07:22:19] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [07:23:22] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [07:39:40] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [07:41:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:41:30] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [07:43:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.475 second response time [07:50:31] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , svwiki (13156), Total (24951) [07:51:08] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 191 seconds [07:51:26] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 196 seconds [07:51:27] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 200 seconds [07:51:34] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 200 seconds [08:00:37] RECOVERY - MySQL Slave Delay on db64 is OK: OK replication delay 18 seconds [08:01:01] RECOVERY - MySQL Slave Delay on db64 is OK: OK replication delay 0 seconds [08:01:07] RECOVERY - MySQL Replication Heartbeat on db64 is OK: OK replication delay 0 seconds [08:01:17] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [08:01:27] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [08:01:46] RECOVERY - MySQL Replication Heartbeat on db64 is OK: OK replication delay 0 seconds [08:02:13] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [08:02:24] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [08:05:22] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [08:16:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:18:07] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:42] New patchset: ArielGlenn; "larger root partition size, correct recipe name for snapshots" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50890 [08:26:40] Which is the tech channel that normally has status in the topic? [08:27:37] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [08:27:46] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50890 [08:27:50] wikimedia-tech [08:28:15] apergos: /a -> /local ? [08:28:23] yes [08:28:26] my preference would be /srv/snapshots [08:28:27] we don't use /a over there [08:28:30] or something under /srv [08:28:35] we won't use /local for anything either [08:28:38] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [08:28:43] this is just to keep peopel from trying to put stuff in it [08:28:56] ? [08:29:01] I might use it for a temp job and then throw stuff away [08:29:07] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.324 second response time [08:29:11] once in a great while. [08:29:38] still [08:29:40] apergos, thanks. Is the Wikitech and Labs merger in progress? [08:29:57] superm401: I don't know about the merger, ryan or paravoid would likely know [08:30:14] not me [08:30:18] ryan is working on it [08:30:47] ah while you are there paravoid [08:31:02] what is the status of things re ms-be11 and 12 [08:31:10] are you asking me? [08:31:15] can we put the new 12 in? can I move forward on pulling other hosts out? [08:31:22] yes I'm asking you [08:31:35] how would I know? [08:31:44] you asked me to pull those two hosts out [08:32:01] ooh [08:32:09] completely forgot about that :-) [08:32:17] yes, about the juniper upgrade [08:32:20] ah well good I asked then :-) [08:32:24] I'll tell mark or leslie tomorrow [08:32:28] ok great [08:32:49] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [08:33:03] uh a possibly ignorant question but [08:33:14] how exactly is this going to work? I mean, [08:33:22] swift is all in tampa afaik [08:33:31] and we'll lose tampa for that time right? [08:33:44] no [08:33:52] don't remember the details [08:33:57] some row was it [08:34:49] ah that's it [08:35:05] ok then, just keep me in the loop :-) [08:35:13] yeah [08:35:23] I think so far you're the only one in the loop :P [08:35:42] since I completely forgot about that [08:36:28] :-D [08:51:33] New patchset: ArielGlenn; "snapshot recipe use the standard /srv for extra partion" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50893 [08:52:07] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50893 [08:55:36] :D [08:59:58] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [09:06:58] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [09:12:43] RECOVERY - Puppet freshness on gallium is OK: puppet ran at Tue Feb 26 09:12:17 UTC 2013 [09:19:10] RECOVERY - Puppet freshness on db1024 is OK: puppet ran at Tue Feb 26 09:18:48 UTC 2013 [09:21:15] RECOVERY - Puppet freshness on nescio is OK: puppet ran at Tue Feb 26 09:20:59 UTC 2013 [09:28:49] New patchset: ArielGlenn; "make sure /srv exists and is dir, not symlink for new snapshots" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50894 [09:29:30] RECOVERY - Puppet freshness on db35 is OK: puppet ran at Tue Feb 26 09:29:14 UTC 2013 [09:35:08] PROBLEM - LVS HTTP IPv4 on m.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 15113 bytes in 0.004 second response time [09:49:27] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (75032), skwiki (14443), Total (160035) [09:50:21] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (76682), skwiki (14477), Total (157812) [10:10:05] PROBLEM - LDAP on sanger is CRITICAL: Connection refused [10:12:24] PROBLEM - LDAP on sanger is CRITICAL: Connection refused [10:22:17] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 181 seconds [10:22:45] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 186 seconds [10:22:48] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 188 seconds [10:23:48] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 204 seconds [10:25:00] RECOVERY - LDAP on sanger is OK: TCP OK - 0.003 second response time on port 389 [10:25:07] RECOVERY - LDAP on sanger is OK: TCP OK - 0.027 second response time on port 389 [10:26:17] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:26:47] PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:27:07] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:28:28] PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:33:47] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 12 seconds [10:34:07] PROBLEM - Puppet freshness on db35 is CRITICAL: Puppet has not run in the last 10 hours [10:34:07] PROBLEM - Puppet freshness on gallium is CRITICAL: Puppet has not run in the last 10 hours [10:34:17] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 8 seconds [10:34:28] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 4 seconds [10:34:55] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [10:35:07] PROBLEM - Puppet freshness on db1024 is CRITICAL: Puppet has not run in the last 10 hours [10:35:08] PROBLEM - Puppet freshness on nescio is CRITICAL: Puppet has not run in the last 10 hours [10:42:07] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK: HTTP/1.1 200 OK - 635 bytes in 1.740 second response time [10:42:37] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [10:42:43] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [10:43:10] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.054 seconds [11:15:16] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [11:16:10] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [11:18:07] PROBLEM - LDAP on sanger is CRITICAL: Connection refused [11:19:37] PROBLEM - LDAP on sanger is CRITICAL: Connection refused [11:20:44] !log mlitn synchronized php-1.21wmf10/extensions/ArticleFeedback/api/ApiArticleFeedback.php 'Update ArticleFeedback to master' [11:20:47] Logged the message, Master [11:46:28] New review: Silke Meyer; "I think the definition of mw-extension is missing (it was in the old mediawiki.pp file). Doesn't wor..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/50451 [12:01:27] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [12:08:07] RECOVERY - Puppet freshness on wtp1001 is OK: puppet ran at Tue Feb 26 12:07:59 UTC 2013 [12:08:18] RECOVERY - Puppet freshness on gallium is OK: puppet ran at Tue Feb 26 12:08:09 UTC 2013 [12:08:27] RECOVERY - Puppet freshness on db1024 is OK: puppet ran at Tue Feb 26 12:08:18 UTC 2013 [12:08:27] RECOVERY - Puppet freshness on nescio is OK: puppet ran at Tue Feb 26 12:08:19 UTC 2013 [12:08:27] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [12:08:37] RECOVERY - Puppet freshness on db35 is OK: puppet ran at Tue Feb 26 12:08:27 UTC 2013 [12:08:37] RECOVERY - Puppet freshness on wtp1001 is OK: puppet ran at Tue Feb 26 12:08:28 UTC 2013 [12:09:27] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [12:09:37] RECOVERY - Puppet freshness on wtp1001 is OK: puppet ran at Tue Feb 26 12:09:32 UTC 2013 [12:10:27] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [12:26:25] !log temp blacklist live hack on sodium for exim4 to prevent some spam, see /etc/exim4/blacklist for that [12:26:26] Logged the message, Master [12:35:28] RECOVERY - Puppet freshness on wtp1001 is OK: puppet ran at Tue Feb 26 12:35:23 UTC 2013 [12:40:02] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: Puppet has not run in the last 10 hours [12:41:05] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [12:52:45] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [12:58:45] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [13:00:43] Reedy: is the script done or still moving transcoded files? [13:05:48] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [13:07:08] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [13:07:18] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [13:08:32] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [13:09:17] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [13:14:23] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (29804), Total (30320) [13:14:41] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (27735), Total (28293) [13:19:02] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [13:30:08] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [13:30:26] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [13:35:18] PROBLEM - LVS HTTP IPv4 on m.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 15112 bytes in 0.003 second response time [13:39:17] RECOVERY - MySQL disk space on neon is OK: DISK OK [13:39:18] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 3 processes with args ircecho [13:40:05] RECOVERY - MySQL disk space on neon is OK: DISK OK [13:40:11] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [14:08:05] PROBLEM - Puppet freshness on knsq17 is CRITICAL: Puppet has not run in the last 10 hours [14:08:06] PROBLEM - Puppet freshness on db1039 is CRITICAL: Puppet has not run in the last 10 hours [14:10:35] PROBLEM - Puppet freshness on db1039 is CRITICAL: Puppet has not run in the last 10 hours [14:10:35] PROBLEM - Puppet freshness on knsq17 is CRITICAL: Puppet has not run in the last 10 hours [14:13:29] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , cawiki (13628), enwiki (676256), jawiki (36100), plwiki (34316), Total (765021) [14:13:47] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , cawiki (13079), enwiki (675613), jawiki (34965), plwiki (33480), Total (761557) [14:16:16] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:17:59] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:17:59] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:19:35] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:24:44] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 183 seconds [14:25:16] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 191 seconds [14:25:25] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 192 seconds [14:26:05] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 202 seconds [14:29:55] PROBLEM - Varnish HTTP bits on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:30:45] RECOVERY - Varnish HTTP bits on palladium is OK: HTTP OK: HTTP/1.1 200 OK - 637 bytes in 0.001 second response time [14:32:14] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 24 seconds [14:32:16] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [14:32:17] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [14:33:17] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [14:34:11] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [14:34:25] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [14:35:59] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.372 seconds [14:38:05] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK: HTTP/1.1 200 OK - 635 bytes in 0.008 second response time [14:42:35] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [14:43:25] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [14:55:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:56:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.095 second response time [15:05:29] !log maxsem synchronized php-1.21wmf10/extensions/GeoData/solrupdate.php [15:05:31] Logged the message, Master [15:06:14] RECOVERY - LDAP on sanger is OK: TCP OK - 0.027 second response time on port 389 [15:06:55] !log maxsem synchronized php-1.21wmf9/extensions/GeoData/solrupdate.php [15:06:57] Logged the message, Master [15:07:15] Ohai. Where exactly is the labs' puppet config? In operations/puppet? [15:07:29] RECOVERY - LDAP on sanger is OK: TCP OK - 0.002 second response time on port 389 [15:08:25] Coren, depends on what you want. the virt cluster is there, as well beta [15:09:43] MaxSem: The manifests used to configure the instances. [15:09:58] yes [15:10:17] MaxSem: kk, thanks. [15:11:14] PROBLEM - LDAP on sanger is CRITICAL: Connection refused [15:12:41] MaxSem: Oh, while you're there. :-) Do we have a naming convention for project-specific classes? [15:12:53] PROBLEM - LDAP on sanger is CRITICAL: Connection refused [15:13:21] have you tried asking in #-labs? [15:13:33] they should know better than me [15:14:01] !log suspended i-000001d7 and i-0000039e for sending out malicious emails [15:14:02] Logged the message, Master [15:14:52] MaxSem: Wait, you don't know everything? :-P [15:15:14] yes, I'm incompetent [15:16:58] I'm not sure it's actually a binary choice. :-) [15:29:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:32:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [16:00:20] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [16:04:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:06:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:08:01] RECOVERY - Puppet freshness on wtp1001 is OK: puppet ran at Tue Feb 26 16:07:53 UTC 2013 [16:08:20] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [16:08:21] RECOVERY - Puppet freshness on wtp1001 is OK: puppet ran at Tue Feb 26 16:08:19 UTC 2013 [16:09:20] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: Puppet has not run in the last 10 hours [16:09:20] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [16:09:40] RECOVERY - Puppet freshness on wtp1001 is OK: puppet ran at Tue Feb 26 16:09:30 UTC 2013 [16:10:20] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [16:11:06] RECOVERY - Puppet freshness on wtp1001 is OK: puppet ran at Tue Feb 26 16:10:55 UTC 2013 [16:11:20] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [16:13:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 8.292 second response time [16:19:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.037 seconds [16:21:31] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:21:50] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:23:23] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:23:32] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:20] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 186 seconds [16:25:30] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 189 seconds [16:25:56] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 198 seconds [16:26:50] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 219 seconds [16:33:21] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [16:34:11] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK HTTP/1.1 200 OK - 633 bytes in 0.053 seconds [16:34:20] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [16:35:33] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 194 seconds [16:35:33] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK: HTTP/1.1 200 OK - 633 bytes in 0.001 second response time [16:37:33] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 7 seconds [16:41:23] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [16:41:23] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [16:41:23] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [16:42:26] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [16:45:21] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [16:45:51] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [16:47:06] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [16:47:32] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [17:02:48] New patchset: coren; "Add ssh_hba variable to turn on HBA for sshd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50913 [17:17:58] RECOVERY - MySQL disk space on neon is OK: DISK OK [17:17:59] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [17:18:18] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [17:18:19] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 3 processes with args ircecho [17:18:26] RECOVERY - MySQL disk space on neon is OK: DISK OK [17:35:18] PROBLEM - LVS HTTP IPv4 on m.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 15113 bytes in 0.003 second response time [17:37:23] Could it be there's something wrong with our e-mail? [17:37:38] I think I'm not getting all Gerrit e-mails, and on interactive actions, I'm getting 400 errors. [17:37:48] (not all the time, but happened to me 4 times today) [17:38:36] RobH (or another ops jock :P): ^^ [17:38:52] I am not allowed to do anythign not sprint related or i get in trouble [17:38:53] so sorry dude [17:38:59] i would email ops-requests@rt.wikimedia.org [17:39:59] siebrand: what do you mean interactive actions email issue? [17:40:04] (soudns like two different things) [17:40:28] RobH: When I +1/+2/-1 something in Gerrit, I've gotten errors from Gerrit about not being able to send e-mail. [17:40:45] RobH: I think I'm also missing e-mails on gerrit actions to mediawiki-commits. [17:41:18] RobH: The two are most likely related (at least in my lay man's thinking) [17:41:51] can you test if it's still an issue now, siebrand? [17:42:26] apergos: Can't reproduce. All I can do is watch #mediawiki to see if someone does something, and then see if I get mail. [17:42:41] ok (maybe ask her to do somethign as a test?) [17:42:45] apergos: Also hard to "just make changes". But can you confirm something was wrong before and shouldn't be any longer or something? [17:43:05] well we had limited email issues earlier, but I don't know if it would have affected you [17:43:25] that's why the best thing to do is chck now to see whether you still have the problem [17:43:54] wow, I read 'someone' as sumanah :-D [17:43:59] ugh [17:44:55] siebrand: Ok, it seems that we had an issue in labs [17:45:01] !log Copying math files into ceph [17:45:02] where there was some super spammy instances that are now frozen [17:45:03] Logged the message, Master [17:45:12] but its whats caused the masssssssiiiivvveee email slowdowns [17:45:49] RobH: Ah, so in theory there may be a pile in the queue? [17:46:08] thats my understanding, but I just asked mark and got a quick reply [17:46:15] so i dont wanna misrepresent, but thats my understanding [17:46:53] (mark is looking into it now) [17:48:18] opendj is defunct again over on sanger [17:52:35] RECOVERY - LDAP on sanger is OK: TCP OK - 0.031 second response time on port 389 [17:52:38] RECOVERY - LDAP on sanger is OK: TCP OK - 0.002 second response time on port 389 [17:54:32] !log restarting opendj on sanger [17:54:34] Logged the message, Master [17:54:43] where is Ryan :) [18:00:05] PROBLEM - LDAP on sanger is CRITICAL: Connection refused [18:00:19] right. [18:00:22] hah, paravoid ^ [18:00:29] sigh [18:00:42] opendj is a mess there [18:01:03] doesn't even have logs [18:01:17] I'll restart it until Ryan comes in [18:01:38] PROBLEM - LDAP on sanger is CRITICAL: Connection refused [18:02:05] RECOVERY - LDAP on sanger is OK: TCP OK - 0.028 second response time on port 389 [18:03:57] RECOVERY - LDAP on sanger is OK: TCP OK - 0.010 second response time on port 389 [18:07:17] New patchset: MaxSem; "Fix replication checks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50829 [18:24:55] PROBLEM - Varnish HTTP bits on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:26:45] RECOVERY - Varnish HTTP bits on palladium is OK: HTTP OK: HTTP/1.1 200 OK - 637 bytes in 0.001 second response time [18:30:47] mutante: seen the latest on rt 4378 ? [18:31:25] (aka fax machine) [18:31:54] jeremyb_: thanks, just saw it now, will handle it, but we are in meeting [18:32:04] mutante: right [18:34:22] hackathon [18:52:39] New patchset: Reedy; "Remove temp hack for bug 31187" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47519 [18:52:48] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47519 [18:54:12] New patchset: Reedy; "(bug 40759) Let Proofread Page setup namespaces for fi.wikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48877 [18:54:30] New review: Reedy; "Don't the pages need moving as the namespace number is saved in the database?" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48877 [18:55:35] New patchset: Reedy; "Remove disabled ReaderFeedback" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47532 [18:55:43] New patchset: Reedy; "For bug 44570 - Make the parser cache expire at 30 days. Stops us having stale resource links" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47202 [18:56:56] New patchset: Jeremyb; "For bug 44570 - Make the parser cache expire at 30 days" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47202 [18:58:20] !log reedy synchronized wmf-config/CommonSettings.php [18:58:22] Logged the message, Master [19:01:15] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [19:02:33] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [19:13:00] !log freezing all exim4 bounces to restore normal email delivery [19:13:01] Logged the message, Master [19:20:15] RobH: cmjohnson1 do you know the brand of cabinet in equinix? [19:20:19] or have any closeup pics that might have it ? [19:22:40] lesliecarr: i dont know the brand [19:24:10] * jeremyb_ hands cmjohnson1 a tab key [19:27:59] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50757 [19:32:33] PROBLEM - Host payments4 is DOWN: PING CRITICAL - Packet loss = 100% [19:34:21] !log package updates & reboots of payments hosts [19:34:23] Logged the message, Master [19:35:35] PROBLEM - check_apache2 on payments4 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [19:35:43] RECOVERY - Host payments4 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [19:39:33] PROBLEM - Host payments1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:40:21] New review: Tpt; "Yes, of course, the pages need moving. See https://bugzilla.wikimedia.org/show_bug.cgi?id=40759#c25" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48877 [19:40:32] RECOVERY - check_apache2 on payments4 is OK: PROCS OK: 7 processes with command name apache2 [19:44:15] New patchset: Ryan Lane; "Disable reactors for now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50942 [19:44:42] RECOVERY - Host payments1001 is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [19:45:19] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50942 [19:54:32] PROBLEM - Host payments1003 is DOWN: PING CRITICAL - Packet loss = 100% [19:54:38] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [19:55:59] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [19:59:42] RECOVERY - Host payments1003 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [20:02:42] PROBLEM - Host payments1004 is DOWN: PING CRITICAL - Packet loss = 100% [20:04:32] RECOVERY - Puppet freshness on db1039 is OK: puppet ran at Tue Feb 26 20:04:19 UTC 2013 [20:05:22] RECOVERY - Host payments1004 is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [20:05:35] RECOVERY - Puppet freshness on ssl1002 is OK: puppet ran at Tue Feb 26 20:05:23 UTC 2013 [20:09:11] RECOVERY - Host mw1041 is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [20:09:22] RECOVERY - Host mw1041 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [20:09:42] RECOVERY - Puppet freshness on ssl1002 is OK: puppet ran at Tue Feb 26 20:09:34 UTC 2013 [20:09:43] RECOVERY - Puppet freshness on db1039 is OK: puppet ran at Tue Feb 26 20:09:34 UTC 2013 [20:10:40] PROBLEM - NTP on mw1041 is CRITICAL: NTP CRITICAL: Offset unknown [20:11:32] PROBLEM - Apache HTTP on mw1041 is CRITICAL: Connection refused [20:12:02] PROBLEM - Apache HTTP on mw1041 is CRITICAL: Connection refused [20:12:28] RECOVERY - NTP on mw1041 is OK: NTP OK: Offset 0.00252366066 secs [20:12:55] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 186 seconds [20:13:13] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 194 seconds [20:13:14] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 194 seconds [20:13:34] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [20:13:44] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 210 seconds [20:15:07] New patchset: Lcarr; "fixing contint public ip ranges" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50945 [20:15:55] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50945 [20:15:55] New patchset: RobH; "adding dan andreescu per rt4561" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50948 [20:16:04] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , commonswiki (390929), Total (399119) [20:20:44] New patchset: Jeremyb; "adding dan andreescu per RT 4561" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50948 [20:21:54] !log DNS update - add wikiovyage.eu [20:21:55] Logged the message, Master [20:22:11] meh, wikivoyage.eu of course [20:22:49] ah [20:22:52] hah* [20:23:12] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50948 [20:23:26] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , commonswiki (372745), svwiki (15197), Total (395452) [20:23:34] mutante: social media's broken so you don't have typos propagated as far :P [20:23:50] New patchset: Jgreen; "change payments URL used in pybal/nagios checks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50952 [20:24:16] jeremyb_: why/what did you do to my commit? [20:24:33] (just wondering why you even have a comment on it) [20:24:43] RobH: changed the commit msg so that it would linkify the RT ref [20:24:47] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50952 [20:25:18] i wasnt expecting another patchset, so surprised me, heh [20:25:28] heh [20:29:14] New patchset: Dzahn; "add wikivoyage.eu Apache redirect (RT-4378)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/50954 [20:29:41] apergos: mark and leslie say they won't update junos at this point, so feel free to continue as you were with ms-be11/12 [20:33:14] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 183 seconds [20:33:24] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 187 seconds [20:33:27] ottomata: So did the ldap admins.pp conflict thing get resolved? [20:33:28] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 187 seconds [20:33:35] cuz im having issues adding a user to stat1, which i think was involved in that [20:33:38] but i am no longer certain [20:34:07] oh god damn it, nevermind. [20:34:25] i missed another unrelated error [20:34:49] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 213 seconds [20:37:01] stat1, naw [20:37:09] New patchset: RobH; "duplicate key entry for dan's account, but duplicate entry isnt listed in any includes, so simply deleting it." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50955 [20:37:11] the analytics* nodes were using ldap nss, but i've removed that from puppet [20:37:13] yea, someone added him into key entires, but not includes, so it was borking it up. [20:37:23] it was unrelated to you, merely had the same initial error throw [20:37:30] and i jumped to conclusions ;] [20:37:38] sorry about that [20:38:09] no proooobbbs, yeah [20:38:24] but ja it got resolved, in that the resolution is: don't mix [20:38:24] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50955 [20:38:27] lunchtiiime [20:39:55] New patchset: RobH; "Revert "duplicate key entry for dan's account, but duplicate entry isnt listed in any includes, so simply deleting it."" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50956 [20:41:32] nested quotes :( [20:43:38] New patchset: MaxSem; "Actually disable MobileFrontend->EventLogging integration in Labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50957 [20:43:50] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [20:44:49] New patchset: RobH; "disabling duplicate account, seems it was included" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50959 [20:48:10] RECOVERY - Host db27 is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [20:48:47] New patchset: RobH; "disabling duplicate account, seems it was included" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50959 [20:49:08] Change abandoned: RobH; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50956 [20:49:31] RECOVERY - Host db27 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [20:49:44] New review: RobH; "once more with feeling" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/50959 [20:49:59] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50959 [20:51:22] PROBLEM - Host db27 is DOWN: PING CRITICAL - Packet loss = 100% [20:54:10] RECOVERY - Host db27 is UP: PING OK - Packet loss = 0%, RTA = 26.63 ms [20:54:39] apache conf change needs ops or is shell sufficient? [20:57:36] RobH: FYI: Delayed e-mail seems to be dripping on. New mail is delivered instantly, I think things are catching up from 3-4 hours ago now. [20:58:24] !log stat1 puppet runs fail due to dan account error, i am working on it, but i also want to run out and get something to eat, so leaving in broken state for 10 minutes until i return to work on it [20:58:26] Logged the message, RobH [20:59:02] Change abandoned: Tim Starling; "Can wait for vips" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50149 [21:00:49] all right, mobile deployment [21:01:07] everyone be in fear or something... [21:03:09] New patchset: MaxSem; "Actually disable MobileFrontend->EventLogging integration in Labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50957 [21:09:10] 26 20:54:39 < jeremyb_> apache conf change needs ops or is shell sufficient? [21:09:28] hashar converted the repo, he should know :) [21:10:50] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50957 [21:14:59] jeremyb_, I think you need to get it into puppet [21:15:07] so I think you need ops for that [21:15:26] everyone can submit the change, though :) [21:20:49] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50829 [21:21:09] Platonides: no, apache conf is it's own separate repo [21:21:31] Platonides: btw, did you see the bug i commented on? [21:22:16] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [21:22:26] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [21:24:01] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [21:24:10] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [21:25:19] LeslieCarr: ^^^ [21:36:05] PROBLEM - LVS HTTP IPv4 on m.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 15162 bytes in 0.004 second response time [21:36:42] MaxSem: ^^^ that last alert has been going on for a while [21:36:56] New patchset: RobH; "Revert "disabling duplicate account, seems it was included"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51031 [21:37:39] jeremyb_, http://en.m.wikipedia.org WFM - so there's something with monitoring [21:38:13] will poke at it after the deployment and security fixage, but that check isn't mine:) [21:38:14] MaxSem: well it's looking for a pattern. probably you changed the output on that page but didn't change the check to match [21:38:57] anyway, it's a problem that it's been alerting for hours and no one's bothered to investigate yet [21:39:44] New patchset: RobH; "correcting Dan Andreescu account on stat1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51032 [21:40:51] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51032 [21:40:55] RECOVERY - Apache HTTP on mw1041 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.073 second response time [21:41:13] RECOVERY - Apache HTTP on mw1041 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.114 second response time [21:43:49] robla: So Dan already had an account on stat1 'dandreescu' [21:43:54] with his key [21:44:00] so resolved rt4561 [21:46:24] paravoid: thanks [21:46:43] jeremyb_: yeah the monitoring is broken - [21:48:06] i'll remove that monitor [21:49:44] though hrm, then we won't be checking mobile [21:49:55] hehe we're not checking mobile on spence [21:51:52] RECOVERY - MySQL disk space on neon is OK: DISK OK [21:52:15] RECOVERY - MySQL disk space on neon is OK: DISK OK [21:52:25] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 3 processes with args ircecho [21:52:37] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [21:57:34] * MaxSem scaps [21:59:44] New review: Dzahn; "apache-fast-test wikivoyage.eu.url mw1044" [operations/apache-config] (master) C: 2; - https://gerrit.wikimedia.org/r/50954 [21:59:44] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/50954 [22:00:58] * mutante syncs-apache [22:04:14] MaxSem: i would have to graceful apaches some time soon, should i wait until you're done scapping [22:05:20] mutante, does apache gracefulling cause a MW resync? [22:05:37] nope, just reloads Apache [22:05:51] then in principle it shouldn't break anything [22:06:06] (correct me if I'm wrong) [22:06:46] yea, i thought so too, just making sure [22:06:55] to not cause confusion or so [22:15:02] dzahn is doing a graceful restart of all apaches [22:15:38] !log maxsem Started syncing Wikimedia installation... : Weekly mobile deployment [22:15:40] Logged the message, Master [22:15:41] !log dzahn gracefulled all apaches [22:15:43] Logged the message, Master [22:16:36] !log gracefulling eqiad Apaches to push redirect for wikivoyage.eu [22:16:38] Logged the message, Master [22:20:52] paravoid, thanks for the info [22:23:20] PROBLEM - Puppet freshness on db27 is CRITICAL: Puppet has not run in the last 10 hours [22:24:48] jeremyb_: ^ once they transfer wikivoyage.eu it's done now [22:31:12] !log nagios is dead! spence.wikimedia.org if you want to access it directly [22:31:13] Logged the message, Mistress of the network gear. [22:31:18] !log maxsem Finished syncing Wikimedia installation... : Weekly mobile deployment [22:31:20] Logged the message, Master [22:31:45] LeslieCarr: wooot [22:32:04] :) [22:32:06] so happy [22:32:12] LeslieCarr, forever? [22:32:18] forever... [22:32:33] * MaxSem performs some grave-dancing [22:33:27] New patchset: coren; "Add ssh_hba variable to turn on HBA for sshd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50913 [22:33:53] RECOVERY - Solr on vanadium is OK: All OK [22:34:57] Change abandoned: coren; "Erroneously sent the wrong changeset a second time" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50913 [22:35:25] !log installing package upgrades on fenari [22:35:27] Logged the message, Master [22:35:53] New patchset: MaxSem; "Sync Icinga's check_http_mobile with the late Nagios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51046 [22:36:03] jeremyb_, LeslieCarr ^^^ [22:37:41] thanks MaxSem [22:37:45] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51046 [22:40:05] New patchset: Lcarr; "making nagios no longer page" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51047 [22:41:23] mutante: yeah, i figured :) (read the diff) [22:42:08] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [22:46:20] New patchset: Dzahn; "add class nrpe to node gallium for jenkins/zuul monitoring via nrpe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51048 [22:47:01] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51048 [22:50:16] LeslieCarr: did you find the nrpe neon bug? [22:50:23] hashar: ^ remember this being removed at some point? readding it because it's needed for the jenkins/zuul process monitoring [22:50:46] mutante: no idea sorry :( [22:51:10] hashar: no worries, should fix the monitoring (now on Icinga) [22:51:17] mutante: I think I used something like nagios_monitor_service( process => zuul ); but that is it [22:51:25] the strange part is why it did not show as broken on spence [22:51:35] it could not work [22:51:46] and now we don't need to wonder about spence anymore :p [22:51:50] why do you have to include nrpe at the site.pp level anyway ? [22:51:56] paravoid: didn't find any bugs :( [22:51:59] shouldn't the nagios_monitoring define takes care of it already ? [22:52:04] restarting made it happy [22:52:18] it dies all the time. [22:52:25] :-/ [22:52:46] hashar: hmm. maybe. but it's not needed on all hosts, just where we want to execute plugins locally [22:52:46] I don't understand why we're switching to neon when we still have known problems that spence doesn't have, that's what I've been saying from yesterday [22:52:47] honestly, i'm not sure where to start looking, any thoughts? [22:53:23] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [22:54:44] hashar: works again [22:54:46] root@neon:~# /usr/lib/nagios/plugins/check_nrpe -H gallium.wikimedia.org -c check_jenkins [22:54:49] PROCS OK: 3 processes with args 'jenkins' [22:56:48] New patchset: coren; "(Preliminary) gridengine module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51051 [22:57:13] paravoid: do you have time to look at it with me ? [22:57:53] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [22:58:23] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [22:58:46] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [22:58:46] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [22:59:14] I just got the whine that nagios is gone [22:59:23] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [22:59:37] (watchmouse) [22:59:39] hrmm now there is just no checkcommand for zuul.. [23:00:06] mutante: it is just about making sure the 'zuul-server' is around [23:00:29] monitor_service { "zuul": description => "zuul_service_running", check_command => "check_procs_generic!1!3!1!20!zuul-server" } [23:00:35] from manifests/zuul.ppp [23:00:39] mutante: ^^ [23:01:28] so yeah [23:01:43] I guess whenever using monitor_service, that should include the "nrpe" puppet class [23:02:33] hashar: actually, only if you are using a check command that needs to be executed on the remove server (vs. a check that just connects from monitoring host to the monitored host) [23:03:09] ohh [23:03:29] hashar: ok, so check_procs_generic is just check_procs.. how is it being executed on gallium [23:03:43] that would check locally on spence or neon [23:03:51] I am such a noob [23:03:52] ahah [23:03:54] that's why we need to do it via NRPE [23:04:03] RECOVERY - LVS HTTP IPv4 on m.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15136 bytes in 0.003 second response time [23:04:04] and add a command to checkcommands [23:04:13] i'll add that.. hold on [23:04:16] so probably nagios has been complaining about zuul not running on spence/neon ? [23:04:20] yes:) [23:04:30] i think so [23:04:48] it doesnt explain why spence said it was running [23:04:50] unless the monitor_service() wrapper make it so it is always executed via nrp [23:04:51] e [23:04:52] while neon did not [23:05:15] if you look at ./puppet/templates/icinga/checkcommands.cfg.erb [23:05:32] you can see command_name check_procs_generic [23:05:41] command_line $USER1$/check_procs -w $ARG1$:$ARG2$ -c $ARG3$:$ARG4$ -a $ARG5$ [23:05:50] New patchset: Lcarr; "fixing robh's username to lowercase" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51054 [23:06:17] so that is just using check_procs locally [23:06:56] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51047 [23:06:56] while an NRPE check looks like this: nrpe_check_jenkins command_line /usr/lib/nagios/plugins/check_nrpe -H $HOSTADDRESS$ -c check_jenkins [23:07:14] mutante: which room are you in ? [23:07:30] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51054 [23:07:34] hashar: R37 [23:12:21] New patchset: Pyoungmeister; "moving mysql nagios group resources to coredb.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51056 [23:12:43] Change restored: coren; "I think gerrit hates me*. First it tells me I send the same change twice, then it turns out I didn'..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50913 [23:13:20] There is now a level 0 [23:13:37] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51056 [23:13:57] New patchset: Dzahn; "add NRPE check commands for zuul-server process monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51058 [23:14:08] hashar: ^ [23:14:57] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51058 [23:17:08] New patchset: Pyoungmeister; "removing a whole bunch of unused db code" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51059 [23:17:38] New patchset: Dzahn; "move ./templates/nagios/nrpe_local_icinga.cfg.erb to ./templates/icinga/nrpe_local.cfg.erb" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51060 [23:18:34] New patchset: Dzahn; "move ./templates/nagios/nrpe_local_icinga.cfg.erb to ./templates/icinga/nrpe_local.cfg.erb" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51060 [23:19:13] New review: Dzahn; "clean up Icinga template location" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/51060 [23:20:04] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [23:20:10] are we overloading jenkins now :p [23:20:50] New patchset: Lcarr; "switching icinga with https by default to not log you in" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51063 [23:20:56] Ryan_Lane: can you check this out ? ^^ [23:21:07] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51060 [23:21:22] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [23:22:31] New review: Ryan Lane; "(1 comment)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/51063 [23:22:32] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [23:23:53] New patchset: Lcarr; "switching icinga with https by default to not log you in" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51063 [23:24:05] Ryan_Lane: look better? ^^ [23:24:30] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [23:24:34] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [23:25:29] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51063 [23:25:44] New patchset: Dzahn; "fix jenkins and zuul process monitoring (turn into NRPE checks)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51064 [23:27:18] New patchset: Dzahn; "fix jenkins and zuul process monitoring (turn into NRPE checks)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51064 [23:27:48] !log maxsem synchronized php-1.21wmf10/extensions/MobileFrontend/ 'https://gerrit.wikimedia.org/r/51053 https://gerrit.wikimedia.org/r/51057' [23:27:50] Logged the message, Master [23:28:27] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51064 [23:29:18] !log maxsem synchronized php-1.21wmf9/extensions/MobileFrontend/ 'https://gerrit.wikimedia.org/r/51053 https://gerrit.wikimedia.org/r/51057' [23:29:20] Logged the message, Master [23:33:12] RECOVERY - Puppet freshness on wtp1001 is OK: puppet ran at Tue Feb 26 23:33:09 UTC 2013 [23:33:42] RECOVERY - Puppet freshness on db27 is OK: puppet ran at Tue Feb 26 23:33:41 UTC 2013 [23:33:43] RECOVERY - MySQL disk space on neon is OK: DISK OK [23:34:30] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 3 processes with args ircecho [23:34:40] RECOVERY - MySQL disk space on neon is OK: DISK OK [23:35:22] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [23:35:48] New patchset: Lcarr; "nrpe-server is taking too long to stop and puppet times out - restart should hopefully work better" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51067 [23:37:02] paravoid: i found out why nrpe is dying - it is taking too long to stop, and so it's not restarting since puppet does a stop/start -- hasrestart should fix this [23:37:20] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51059 [23:39:39] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 14 seconds [23:39:40] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [23:39:50] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [23:40:15] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [23:41:36] Oh god - they're duplicating [23:44:10] Damianz: yeah mutante / LeslieCarr are working on nagios/icinga right now [23:44:39] hashar: :D At least the bots aren't taking over [23:46:10] but that shouldn't be causing anything false [23:46:20] oh hehe the duplicating [23:46:34] New patchset: Dzahn; "add check_zuul NRPE command also to Nagios nrpe_local.cfg" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51068 [23:46:34] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51067