[00:00:21] RECOVERY - Puppet freshness on amssq33 is OK: puppet ran at Mon Feb 25 23:59:58 UTC 2013 [00:02:45] RECOVERY - Puppet freshness on amssq44 is OK: puppet ran at Tue Feb 26 00:02:28 UTC 2013 [00:03:22] !log amssq33,amssq44 - fix more puppet runs [00:03:23] Logged the message, Master [00:05:45] RECOVERY - MySQL Slave Running on db35 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [00:06:41] PROBLEM - Puppet freshness on mw1070 is CRITICAL: Puppet has not run in the last 10 hours [00:11:18] PROBLEM - MySQL Slave Running on db35 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Duplicate entry 141 for key PRIMARY on query. Default dat [00:12:41] RECOVERY - Puppet freshness on mc1001 is OK: puppet ran at Tue Feb 26 00:12:38 UTC 2013 [00:12:48] RECOVERY - Puppet freshness on mc1001 is OK: puppet ran at Tue Feb 26 00:12:33 UTC 2013 [00:12:48] test123 [00:12:57] i started the plague [00:13:12] icinga-wm, O RLY? [00:13:23] yes really [00:13:37] killing all humans is part of my master plan [00:13:59] i am tempted to kick icinga-wm for being evil. [00:14:15] but then whoever is working on it may stop. [00:14:16] cyborgs will be spared in the robot cleansing [00:14:18] Change abandoned: Pyoungmeister; "there is a better way...." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47030 [00:14:19] ok then [00:14:24] binasher: Any idea what's up with db64? [00:14:26] icinga-wm: carry on. [00:14:34] | 186733992 | system user | | heartbeat | Connect | 185590 | flush and sync binlog : fsync | REPLACE INTO `heartbeat`.`heartbeat` (ts, server_id, file, position, relay_master_log_file, exec_mas | [00:14:36] icinga-wm, while you're locked in that box, the only bad thing you can do is to flood our IRC clients;) [00:14:48] nah, its singularity on our cluster [00:14:51] we are all screwed. [00:15:21] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 181 seconds [00:15:34] "Execution of '/usr/bin/apt-cache policy .. " puppet issues .. hrmm [00:15:46] Reedy: what are you doing with db64? [00:16:04] Nothing, I just noticed it was unhappy on the dbtree [00:16:11] PROBLEM - MySQL Slave Delay on db56 is CRITICAL: CRIT replication delay 190 seconds [00:16:21] PROBLEM - MySQL Replication Heartbeat on db56 is CRITICAL: CRIT replication delay 192 seconds [00:16:24] PROBLEM - MySQL Replication Heartbeat on db56 is CRITICAL: CRIT replication delay 191 seconds [00:17:01] PROBLEM - MySQL Slave Delay on db56 is CRITICAL: CRIT replication delay 201 seconds [00:17:11] RECOVERY - Puppet freshness on mw1029 is OK: puppet ran at Tue Feb 26 00:17:10 UTC 2013 [00:17:18] RECOVERY - Puppet freshness on mw1029 is OK: puppet ran at Tue Feb 26 00:17:06 UTC 2013 [00:17:22] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [00:17:37] Reedy: oh, replication was broken over the weekend but i fixed it around an hour ago [00:17:44] it'll take some time to catch up though [00:17:44] Aha [00:18:00] That's fine then, I know it's ok to ignore [00:18:04] Cheers [00:18:41] RECOVERY - Puppet freshness on mw1059 is OK: puppet ran at Tue Feb 26 00:18:34 UTC 2013 [00:18:49] RECOVERY - Puppet freshness on mw1059 is OK: puppet ran at Tue Feb 26 00:18:33 UTC 2013 [00:19:31] RECOVERY - Puppet freshness on mw1070 is OK: puppet ran at Tue Feb 26 00:19:30 UTC 2013 [00:19:51] RECOVERY - Puppet freshness on mw1070 is OK: puppet ran at Tue Feb 26 00:19:26 UTC 2013 [00:20:31] RECOVERY - Puppet freshness on mw1157 is OK: puppet ran at Tue Feb 26 00:20:21 UTC 2013 [00:20:33] New patchset: Pyoungmeister; "setting squid service to reload when notified" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50847 [00:20:38] !log mc1001,ms1004,mw1029,mw1059,mw1070,mw1157 - fixed puppet freshness [00:20:40] Logged the message, Master [00:20:45] RECOVERY - Puppet freshness on mw1157 is OK: puppet ran at Tue Feb 26 00:20:20 UTC 2013 [00:21:02] mark: more like that ^^ [00:21:22] New patchset: Lcarr; "fixing solr plugin location" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50848 [00:22:01] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50847 [00:22:10] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50848 [00:25:02] New patchset: Jgreen; "login def and key for Sahar Massachi (fundraising)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50849 [00:25:08] RECOVERY - MySQL Slave Delay on db56 is OK: OK replication delay 8 seconds [00:25:18] RECOVERY - MySQL Replication Heartbeat on db56 is OK: OK replication delay 4 seconds [00:25:24] RECOVERY - MySQL Replication Heartbeat on db56 is OK: OK replication delay 3 seconds [00:26:00] RECOVERY - MySQL Slave Delay on db56 is OK: OK replication delay 0 seconds [00:26:27] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50849 [00:30:06] New patchset: Lcarr; "fixing location" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50851 [00:31:12] New review: Jgreen; "needs hearts? ❤❤❤❤" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/50778 [00:31:26] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50851 [00:33:33] New patchset: Jgreen; "aluminium/grosley: add user saher, remove pcoombe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50854 [00:34:06] PROBLEM - Puppet freshness on db35 is CRITICAL: Puppet has not run in the last 10 hours [00:34:06] PROBLEM - Puppet freshness on gallium is CRITICAL: Puppet has not run in the last 10 hours [00:34:27] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50854 [00:35:06] PROBLEM - Puppet freshness on db1024 is CRITICAL: Puppet has not run in the last 10 hours [00:35:07] PROBLEM - Puppet freshness on nescio is CRITICAL: Puppet has not run in the last 10 hours [00:36:20] New patchset: Dzahn; "improving tone of nagios alerts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50778 [00:37:42] PROBLEM - Puppet freshness on db1024 is CRITICAL: Puppet has not run in the last 10 hours [00:37:42] PROBLEM - Puppet freshness on gallium is CRITICAL: Puppet has not run in the last 10 hours [00:37:42] PROBLEM - Puppet freshness on db35 is CRITICAL: Puppet has not run in the last 10 hours [00:37:42] PROBLEM - Puppet freshness on nescio is CRITICAL: Puppet has not run in the last 10 hours [00:37:50] New review: Tim Starling; "Yes, that is an error, and another error is the assumption that we are using ImageMagick Q8. In fact..." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/50149 [00:37:51] ..... [00:38:06] did something just tick over someplace or do we really have this many items not calling into puppet? [00:38:40] New patchset: Ryan Lane; "Redirect view links" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50856 [00:41:14] New patchset: Pyoungmeister; "improving tone of nagios alerts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50778 [00:41:34] New patchset: Ryan Lane; "Redirect view links" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50856 [00:41:45] PROBLEM - Puppet freshness on tola is CRITICAL: Puppet has not run in the last 10 hours [00:41:57] The last Puppet run was at Sun Feb 24 08:29:33 UTC 2013 (2412 minutes ago). [00:42:04] ^ gallium [00:42:07] Tim-away: hey [00:42:38] mutante: so, my reasoning on my new patchset for that change is to save characters on sms's [00:42:41] does that sound reasonable? [00:42:53] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50856 [00:43:13] Reedy: script still running for files? [00:43:16] RobH: yes, we actually do. but it's reporting them all at once due to work on Icinga [00:43:29] Yeah [00:44:07] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50778 [00:44:07] AaronSchulz: it would be nice if it had some amount of progress notification [00:44:49] mutante: ok, so just duplicate notification, cool, thx. [00:45:21] notpeter: yea:) [00:46:05] RobH: we had like 10 more a little while ago :p [00:46:20] paravoid and I got the list down to half its size last week [00:46:26] then in the course of a weekend it doubled back up. [00:46:30] it happened again [00:46:33] not really [00:46:43] ? [00:46:44] are we talking about nagios? [00:46:49] puppetchecks [00:46:53] oh [00:46:54] puppet freshness [00:47:02] then yeah, haven't checked [00:47:14] i commented cuz new icinga instance notified on them all again [00:49:02] New patchset: Lcarr; "adding in extra checks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50858 [00:49:47] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50858 [00:55:37] New patchset: Ottomata; "Initial commit of Kafka Module." [operations/puppet/kafka] (master) - https://gerrit.wikimedia.org/r/50385 [01:01:18] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49200 [01:03:39] New patchset: Lcarr; "more check commands!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50861 [01:04:49] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50861 [01:09:20] New review: Dzahn; "this is ok, doesn't break things, but also doesn't really fix it. you still get a 302 somehow" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49200 [01:15:21] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [01:16:06] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [01:21:29] New patchset: Lcarr; "removing dupe definition" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50863 [01:22:26] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50863 [01:23:14] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 185 seconds [01:24:53] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 217 seconds [01:32:14] RECOVERY - MySQL disk space on neon is OK: DISK OK [01:32:59] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [01:33:28] New patchset: Lcarr; "inserting forgotten "}"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50866 [01:33:40] PROBLEM - LVS HTTP IPv4 on m.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 15113 bytes in 0.001 second response time [01:34:10] PROBLEM - Puppet freshness on tola is CRITICAL: Puppet has not run in the last 10 hours [01:35:21] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 393 seconds [01:35:23] RECOVERY - Puppet freshness on tola is OK: puppet ran at Tue Feb 26 01:35:08 UTC 2013 [01:35:59] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 198 seconds [01:36:10] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 203 seconds [01:36:10] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 406 seconds [01:36:20] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 206 seconds [01:37:02] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 197 seconds [01:37:20] PROBLEM - Host foundation-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:37:50] PROBLEM - Host wikibooks-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:38:00] PROBLEM - Host wikimedia-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:38:01] PROBLEM - Host wikinews-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:38:01] PROBLEM - Host wikipedia-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:38:10] PROBLEM - Host wikiquote-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:38:10] PROBLEM - Host wikisource-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:38:11] PROBLEM - Host wikiversity-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:38:20] PROBLEM - Host wiktionary-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:38:20] PROBLEM - Host mediawiki-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::8 [01:38:40] RECOVERY - Host mediawiki-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 87.82 ms [01:38:41] RECOVERY - Host foundation-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 113.12 ms [01:38:42] RECOVERY - Host wikibooks-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 86.97 ms [01:38:42] RECOVERY - Host wikipedia-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 87.86 ms [01:38:42] RECOVERY - Host wikisource-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 87.96 ms [01:38:43] RECOVERY - Host wikiquote-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 86.89 ms [01:38:43] RECOVERY - Host wikimedia-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 87.00 ms [01:38:43] RECOVERY - Host wikinews-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 87.72 ms [01:38:50] RECOVERY - Host wikiversity-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 86.94 ms [01:38:50] RECOVERY - Host wiktionary-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 87.93 ms [01:40:10] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [01:40:20] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [01:40:38] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [01:41:24] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [01:45:05] Uhh [01:47:08] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50866 [01:47:25] yeah, there was some ipv6 connectivity issue [01:58:06] !log test [01:58:07] Logged the message, Master [01:59:01] !log test [01:59:03] Logged the message, Master [01:59:45] !log test [01:59:47] Logged the message, Master [01:59:48] Logged the message, Master [02:02:56] LeslieCarr: Would you have time to do that LVS thing I've been talking about tomorrow (or any other day this week that's not Wednesday)? [02:03:02] i think so [02:04:04] OK good [02:04:11] PROBLEM - Varnish HTTP bits on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:04:16] I'll hit you up about it tomorrow [02:06:02] RECOVERY - Varnish HTTP bits on palladium is OK: HTTP OK: HTTP/1.1 200 OK - 635 bytes in 0.001 second response time [02:06:11] !log test [02:06:14] Logged the message, Master [02:06:36] !log testing twitter [02:06:38] Logged the message, Master [02:06:53] That works :) [02:07:00] it's not working [02:07:06] oh [02:07:07] it is [02:07:08] :) [02:07:11] sweet [02:08:32] I think that was the last thing needed before migrating wikitech [02:10:51] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 182 seconds [02:11:04] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram [02:11:07] Logged the message, Master [02:11:41] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 193 seconds [02:11:41] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 194 seconds [02:12:08] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 204 seconds [02:28:06] !log LocalisationUpdate completed (1.21wmf10) at Tue Feb 26 02:28:06 UTC 2013 [02:28:09] Logged the message, Master [02:29:03] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram [02:29:04] Logged the message, Master [02:35:00] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [02:35:40] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [02:36:07] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [02:36:34] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [02:37:03] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram [02:37:05] Logged the message, Master [02:38:40] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: Puppet has not run in the last 10 hours [02:39:43] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [02:40:26] New patchset: Brion VIBBER; "Add Vimpelcom Beeline to Zero carriers list (testing range)." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47520 [02:41:28] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram [02:41:30] Logged the message, Master [02:44:51] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram/includes/pagers/ArticleTable.php [02:44:53] Logged the message, Master [02:46:09] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram/includes/pagers/ArticleTable.php [02:46:12] Logged the message, Master [02:47:45] New review: RobH; "brion said to ;]" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/47520 [02:47:57] \o/ [02:47:59] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47520 [02:48:40] New review: RobH; "Ok, how about a serious comment since most of ops isn't about. This looks sane and is in the same f..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47520 [02:49:37] brion: do we need to clear any cache or anything crazy for that? [02:49:47] or just simply let it sit and when puppet runs on varnish it'll take effect? [02:49:54] i… don't know :D [02:49:58] shouldn't need anything cleared [02:50:09] do we know how long it'll take for puppet to update? [02:50:15] it'd be awesome if this went live in the next hour [02:50:18] if not, we'll live [02:50:22] heh, it runs every 30 [02:50:26] days [02:50:29] haha [02:50:30] so if it can wait for that, we'll be ok [02:50:33] hah [02:50:36] 30 mins is fine :) [02:50:37] otherwise i have to ddsh a puppet run [02:50:48] You meant the run /takes/ 30 mins? [02:50:48] cool, its merged on sockpuppet, so live [02:50:52] * RoanKattouw stops trolling puppet [02:50:57] RoanKattouw: if its on spence, a run takes a fwe hours ;] [02:50:59] dash ? aren't you guys using salt nowadays? [02:51:05] no, ryan is using salt. [02:51:13] i have no fucking idea how to use it [02:51:13] we have no docs yet. [02:51:13] Only for git-deploy [02:51:13] ahh [02:51:15] easy [02:51:21] No one has any idea how to use it other than through the git-deploy wrapper [02:51:24] i asked ryan, who said 'well, for now its my beta so i guess you can just tell me to do things' [02:51:28] echo "your command there" | mail ryan@wikimedia.org --subject 'please salt this' [02:51:30] I know about half the things I need to do to set up new machines [02:51:33] hehe [02:51:43] yea the new machine setup thing is super annoying. [02:51:53] if it stays this way for longer than another week im gonna have to learn how to do it all. [02:51:58] cuz i hate having to ask someone to do it for me. [02:52:02] !log LocalisationUpdate completed (1.21wmf9) at Tue Feb 26 02:52:01 UTC 2013 [02:52:04] Logged the message, Master [02:52:20] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [02:52:33] * RobH only sets up an average of a dozen or so new machines a week [02:52:52] Well so the thing is [02:52:55] I can bring up salt on a new box [02:53:01] Once Ryan accepts its key on the master [02:53:06] And he hasn't shown me how to do the latter [02:53:17] i think this should be on ops agenda this week [02:53:21] Yes [02:53:21] i'll ask ryan about it tomorrow morning. [02:53:28] Also, regarding Parsoid [02:53:34] or you can get in your bashrc something like: alias salt='ddsh -m all "$@"' [02:54:00] we should probably ask ryan to give a presentation about salt :-] [02:54:07] Leslie and I are probably going to do the Varnish LVS thing tomorrow, once that's done you can start replacing those if you want [02:54:13] and by ask tomorrow [02:54:16] i mean shoot an email to him now [02:54:32] RobH: What's going on with wtp1001 now? [02:54:37] You were gonna replace it? [02:54:49] Well, as soon as you can take the existing wtp1001 offline [02:54:59] I can wipe its existence off the map and move the name to new hardware [02:55:06] It's good to go [02:55:11] oh, its not pooled? [02:55:16] Yeah it's depooled [02:55:18] * RoanKattouw finds URL [02:55:32] http://noc.wikimedia.org/pybal/eqiad/parsoid [02:55:39] dinner time :-) [02:55:47] I didn't bother stopping the services on it because puppet will just restart them [02:55:56] RoanKattouw: Ok, I am going to add it to decomissioning.pp so it pulls out of nagios. [02:55:58] But it's not getting any traffic [02:56:00] OK [02:56:17] once nagios stops monitoring it, I'll kill the box and redo dns for wtp1001 [02:56:29] BTW, since you're ops [02:56:37] Who do we bug about databases with off-the-charts replag? Asher? [02:56:53] OK good [02:56:55] I would say Peter/Asher [02:57:51] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [02:59:02] New patchset: Aaron Schulz; "Added a second pipeline for more immediate jobs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50877 [02:59:07] OK it looks like the borked slave is in Tampa [02:59:11] So it only impacts testwiki [02:59:40] New review: Aaron Schulz; "Tested the loop script on mw1010" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50877 [03:02:15] New patchset: RobH; "wtp1001 migration hosts, so decom to remove from nagios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50878 [03:03:05] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram/includes/pagers/ArticleTable.php [03:03:06] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50878 [03:03:07] Logged the message, Master [03:03:55] RoanKattouw: coolness, i'll get the new one setup prolly tomorrow [03:04:10] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram/includes/pagers/ArticleTable.php [03:04:12] Logged the message, Master [03:04:14] (i realize this week you may or may not have time to mess with it with all the other folks in town ;) [03:06:26] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram/includes/pagers/ArticleTable.php [03:06:27] Logged the message, Master [03:07:01] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [03:07:01] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [03:07:20] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 1 seconds [03:07:20] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [03:10:44] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram/includes/pagers/ArticleTable.php [03:10:45] Logged the message, Master [03:17:14] New review: Jeremyb; "It's one less redirect now than before (302 -> 200 instead of 302 -> 302 -> 200)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49200 [03:17:40] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [04:06:15] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [04:07:55] RECOVERY - Puppet freshness on wtp1001 is OK: puppet ran at Tue Feb 26 04:07:48 UTC 2013 [04:08:15] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [04:08:15] RECOVERY - Puppet freshness on wtp1001 is OK: puppet ran at Tue Feb 26 04:08:12 UTC 2013 [04:08:35] RECOVERY - Puppet freshness on tola is OK: puppet ran at Tue Feb 26 04:08:30 UTC 2013 [04:09:15] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [04:09:25] RECOVERY - Puppet freshness on wtp1001 is OK: puppet ran at Tue Feb 26 04:09:18 UTC 2013 [04:10:15] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [04:18:25] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 184 seconds [04:18:46] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 189 seconds [04:18:47] RECOVERY - Puppet freshness on wtp1001 is OK: puppet ran at Tue Feb 26 04:18:39 UTC 2013 [04:19:02] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 192 seconds [04:19:15] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [04:19:29] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 202 seconds [04:45:25] RECOVERY - Puppet freshness on wtp1001 is OK: puppet ran at Tue Feb 26 04:45:15 UTC 2013 [04:46:15] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [04:52:59] New patchset: Pyoungmeister; "switching from unicode heart to asci heart for sms" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50881 [04:54:14] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50881 [04:56:40] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [04:57:40] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 3.000 second response time on port 8123 [05:09:53] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [05:10:00] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [05:10:11] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [05:10:30] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [05:34:28] PROBLEM - LVS HTTP IPv4 on m.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 15113 bytes in 0.003 second response time [05:44:41] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 183 seconds [05:44:50] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 185 seconds [05:45:08] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 194 seconds [05:45:28] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 198 seconds [05:46:49] PROBLEM - SSH on strontium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:46:58] PROBLEM - Varnish HTTP bits on strontium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:47:48] RECOVERY - SSH on strontium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [05:48:48] RECOVERY - Varnish HTTP bits on strontium is OK: HTTP OK: HTTP/1.1 200 OK - 635 bytes in 0.001 second response time [05:59:13] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [06:02:13] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [06:02:23] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [06:02:41] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [06:02:50] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [06:03:53] RECOVERY - MySQL Slave Delay on db39 is OK: OK replication delay 0 seconds [06:04:11] RECOVERY - MySQL Slave Delay on db39 is OK: OK replication delay 0 seconds [06:08:53] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: Puppet has not run in the last 10 hours [06:21:56] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram/includes/pagers/ArticleTable.php [06:21:59] Logged the message, Master [06:26:38] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram/includes/pagers/ArticleTable.php [06:26:40] Logged the message, Master [06:28:36] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram/includes/pagers/ArticleTable.php [06:28:38] Logged the message, Master [06:29:23] RECOVERY - Puppet freshness on wtp1001 is OK: puppet ran at Tue Feb 26 06:29:21 UTC 2013 [06:29:52] Reedy has switched time zones on me. [06:30:13] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [06:34:09] Susan: more than a few people did that [06:34:35] e.g. sumanah and ottomata [06:35:34] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [06:36:04] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [06:36:53] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [06:37:20] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [06:40:54] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram/includes/pagers/ArticleTable.php [06:40:56] Logged the message, Master [06:40:56] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [06:41:59] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [06:43:37] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram/includes/pagers/ArticleTable.php [06:43:38] Logged the message, Master [06:53:32] !log reedy synchronized php-1.21wmf10/extensions/EducationProgram/ [06:53:33] Logged the message, Master [07:05:34] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 187 seconds [07:06:04] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 197 seconds [07:06:16] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 199 seconds [07:06:34] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 204 seconds [07:06:35] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , commonswiki (23619), Total (30206) [07:07:37] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , commonswiki (17917), Total (24751) [07:09:40] RECOVERY - MySQL disk space on neon is OK: DISK OK [07:10:00] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [07:10:12] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 3 processes with args ircecho [07:10:19] RECOVERY - MySQL disk space on neon is OK: DISK OK [07:10:46] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [07:18:30] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [07:18:52] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [07:19:00] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [07:19:19] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [07:22:19] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [07:23:22] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [07:39:40] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [07:41:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:41:30] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [07:43:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.475 second response time [07:50:31] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , svwiki (13156), Total (24951) [07:51:08] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 191 seconds [07:51:26] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 196 seconds [07:51:27] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 200 seconds [07:51:34] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 200 seconds [08:00:37] RECOVERY - MySQL Slave Delay on db64 is OK: OK replication delay 18 seconds [08:01:01] RECOVERY - MySQL Slave Delay on db64 is OK: OK replication delay 0 seconds [08:01:07] RECOVERY - MySQL Replication Heartbeat on db64 is OK: OK replication delay 0 seconds [08:01:17] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [08:01:27] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [08:01:46] RECOVERY - MySQL Replication Heartbeat on db64 is OK: OK replication delay 0 seconds [08:02:13] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [08:02:24] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [08:05:22] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [08:16:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:18:07]