[00:01:49] New patchset: Asher; "labsdb: set innodb_locks_unsafe_for_binlog for s4-5, set high slave_transaction_retries for all shards" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68602 [00:02:23] Sorry for the deployment spam, folks. Complicated deployment. Thanks very much anomie|away, greg-g, RoanKattouw & marktraceur. [00:02:47] Is that the end of the lightning? [00:03:00] ...30 minutes late? :) [00:03:16] Yeah. I was bad. [00:03:36] *nod* K [00:07:54] RECOVERY - Puppet freshness on lvs1002 is OK: puppet ran at Fri Jun 14 00:07:48 UTC 2013 [00:07:54] RECOVERY - Puppet freshness on labstore1 is OK: puppet ran at Fri Jun 14 00:07:48 UTC 2013 [00:07:54] RECOVERY - Puppet freshness on mw1129 is OK: puppet ran at Fri Jun 14 00:07:48 UTC 2013 [00:07:54] RECOVERY - Puppet freshness on db1027 is OK: puppet ran at Fri Jun 14 00:07:49 UTC 2013 [00:07:54] RECOVERY - Puppet freshness on ms-fe1002 is OK: puppet ran at Fri Jun 14 00:07:49 UTC 2013 [00:07:55] RECOVERY - Puppet freshness on ms6 is OK: puppet ran at Fri Jun 14 00:07:49 UTC 2013 [00:07:55] RECOVERY - Puppet freshness on search1003 is OK: puppet ran at Fri Jun 14 00:07:50 UTC 2013 [00:07:56] RECOVERY - Puppet freshness on mw1047 is OK: puppet ran at Fri Jun 14 00:07:52 UTC 2013 [00:08:03] RECOVERY - Puppet freshness on mc11 is OK: puppet ran at Fri Jun 14 00:07:52 UTC 2013 [00:08:03] RECOVERY - Puppet freshness on db69 is OK: puppet ran at Fri Jun 14 00:07:53 UTC 2013 [00:08:03] RECOVERY - Puppet freshness on grosley is OK: puppet ran at Fri Jun 14 00:07:54 UTC 2013 [00:08:03] RECOVERY - Puppet freshness on labstore2 is OK: puppet ran at Fri Jun 14 00:07:54 UTC 2013 [00:08:03] RECOVERY - Puppet freshness on capella is OK: puppet ran at Fri Jun 14 00:07:54 UTC 2013 [00:08:04] RECOVERY - Puppet freshness on ms-be12 is OK: puppet ran at Fri Jun 14 00:07:54 UTC 2013 [00:08:04] RECOVERY - Puppet freshness on ssl1 is OK: puppet ran at Fri Jun 14 00:07:56 UTC 2013 [00:08:05] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Fri Jun 14 00:07:57 UTC 2013 [00:08:05] RECOVERY - Puppet freshness on solr1001 is OK: puppet ran at Fri Jun 14 00:07:57 UTC 2013 [00:08:06] RECOVERY - Puppet freshness on db67 is OK: puppet ran at Fri Jun 14 00:07:58 UTC 2013 [00:08:06] RECOVERY - Puppet freshness on db1005 is OK: puppet ran at Fri Jun 14 00:07:58 UTC 2013 [00:08:07] RECOVERY - Puppet freshness on mw55 is OK: puppet ran at Fri Jun 14 00:07:59 UTC 2013 [00:08:07] RECOVERY - Puppet freshness on amslvs2 is OK: puppet ran at Fri Jun 14 00:07:59 UTC 2013 [00:08:08] RECOVERY - Puppet freshness on ms-be5 is OK: puppet ran at Fri Jun 14 00:07:59 UTC 2013 [00:08:08] RECOVERY - Puppet freshness on analytics1002 is OK: puppet ran at Fri Jun 14 00:08:00 UTC 2013 [00:08:09] RECOVERY - Puppet freshness on sanger is OK: puppet ran at Fri Jun 14 00:08:01 UTC 2013 [00:08:09] RECOVERY - Puppet freshness on mw1181 is OK: puppet ran at Fri Jun 14 00:08:02 UTC 2013 [00:08:13] RECOVERY - Puppet freshness on cp1016 is OK: puppet ran at Fri Jun 14 00:08:02 UTC 2013 [00:08:13] RECOVERY - Puppet freshness on mw1179 is OK: puppet ran at Fri Jun 14 00:08:03 UTC 2013 [00:08:13] RECOVERY - Puppet freshness on palladium is OK: puppet ran at Fri Jun 14 00:08:03 UTC 2013 [00:08:13] RECOVERY - Puppet freshness on mc12 is OK: puppet ran at Fri Jun 14 00:08:04 UTC 2013 [00:08:13] RECOVERY - Puppet freshness on mw1165 is OK: puppet ran at Fri Jun 14 00:08:04 UTC 2013 [00:09:23] RECOVERY - Puppet freshness on mw65 is OK: puppet ran at Fri Jun 14 00:09:13 UTC 2013 [00:09:23] RECOVERY - Puppet freshness on analytics1018 is OK: puppet ran at Fri Jun 14 00:09:13 UTC 2013 [00:09:23] RECOVERY - Puppet freshness on wtp1008 is OK: puppet ran at Fri Jun 14 00:09:13 UTC 2013 [00:09:23] RECOVERY - Puppet freshness on mw1121 is OK: puppet ran at Fri Jun 14 00:09:13 UTC 2013 [00:09:23] RECOVERY - Puppet freshness on mw1141 is OK: puppet ran at Fri Jun 14 00:09:14 UTC 2013 [00:09:23] RECOVERY - Puppet freshness on search1004 is OK: puppet ran at Fri Jun 14 00:09:14 UTC 2013 [00:09:23] RECOVERY - Puppet freshness on amssq50 is OK: puppet ran at Fri Jun 14 00:09:15 UTC 2013 [00:09:24] RECOVERY - Puppet freshness on sq77 is OK: puppet ran at Fri Jun 14 00:09:17 UTC 2013 [00:09:24] RECOVERY - Puppet freshness on sq49 is OK: puppet ran at Fri Jun 14 00:09:17 UTC 2013 [00:09:25] RECOVERY - Puppet freshness on db1056 is OK: puppet ran at Fri Jun 14 00:09:17 UTC 2013 [00:09:25] RECOVERY - Puppet freshness on mw1090 is OK: puppet ran at Fri Jun 14 00:09:17 UTC 2013 [00:09:26] RECOVERY - Puppet freshness on es2 is OK: puppet ran at Fri Jun 14 00:09:18 UTC 2013 [00:09:26] RECOVERY - Puppet freshness on srv268 is OK: puppet ran at Fri Jun 14 00:09:18 UTC 2013 [00:09:27] New patchset: Asher; "labsdb: set innodb_locks_unsafe_for_binlog for s4-5, set high slave_transaction_retries for all shards" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68602 [00:09:27] RECOVERY - Puppet freshness on ms1004 is OK: puppet ran at Fri Jun 14 00:09:18 UTC 2013 [00:09:27] RECOVERY - Puppet freshness on cp1001 is OK: puppet ran at Fri Jun 14 00:09:19 UTC 2013 [00:09:28] RECOVERY - Puppet freshness on mc13 is OK: puppet ran at Fri Jun 14 00:09:19 UTC 2013 [00:09:28] RECOVERY - Puppet freshness on srv298 is OK: puppet ran at Fri Jun 14 00:09:19 UTC 2013 [00:09:29] RECOVERY - Puppet freshness on lvs4 is OK: puppet ran at Fri Jun 14 00:09:20 UTC 2013 [00:09:29] RECOVERY - Puppet freshness on virt1005 is OK: puppet ran at Fri Jun 14 00:09:21 UTC 2013 [00:09:33] RECOVERY - Puppet freshness on mw91 is OK: puppet ran at Fri Jun 14 00:09:22 UTC 2013 [00:09:33] RECOVERY - Puppet freshness on mw92 is OK: puppet ran at Fri Jun 14 00:09:23 UTC 2013 [00:09:33] RECOVERY - Puppet freshness on ms-be1010 is OK: puppet ran at Fri Jun 14 00:09:23 UTC 2013 [00:09:33] RECOVERY - Puppet freshness on cp3019 is OK: puppet ran at Fri Jun 14 00:09:23 UTC 2013 [00:09:33] RECOVERY - Puppet freshness on harmon is OK: puppet ran at Fri Jun 14 00:09:24 UTC 2013 [00:09:33] RECOVERY - Puppet freshness on mc1002 is OK: puppet ran at Fri Jun 14 00:09:24 UTC 2013 [00:09:33] RECOVERY - Puppet freshness on virt8 is OK: puppet ran at Fri Jun 14 00:09:25 UTC 2013 [00:09:34] RECOVERY - Puppet freshness on lvs3 is OK: puppet ran at Fri Jun 14 00:09:28 UTC 2013 [00:09:34] RECOVERY - Puppet freshness on mw1134 is OK: puppet ran at Fri Jun 14 00:09:29 UTC 2013 [00:09:35] RECOVERY - Puppet freshness on srv243 is OK: puppet ran at Fri Jun 14 00:09:29 UTC 2013 [00:09:35] RECOVERY - Puppet freshness on cp3021 is OK: puppet ran at Fri Jun 14 00:09:30 UTC 2013 [00:09:36] RECOVERY - Puppet freshness on mw7 is OK: puppet ran at Fri Jun 14 00:09:32 UTC 2013 [00:09:36] RECOVERY - Puppet freshness on mw3 is OK: puppet ran at Fri Jun 14 00:09:32 UTC 2013 [00:09:37] RECOVERY - Puppet freshness on search23 is OK: puppet ran at Fri Jun 14 00:09:32 UTC 2013 [00:09:38] To make that lightning deploy a bit worse, I am now going to scap [00:09:43] RECOVERY - Puppet freshness on srv251 is OK: puppet ran at Fri Jun 14 00:09:32 UTC 2013 [00:09:43] RECOVERY - Puppet freshness on mw25 is OK: puppet ran at Fri Jun 14 00:09:33 UTC 2013 [00:09:43] RECOVERY - Puppet freshness on mw1199 is OK: puppet ran at Fri Jun 14 00:09:33 UTC 2013 [00:09:43] RECOVERY - Puppet freshness on srv293 is OK: puppet ran at Fri Jun 14 00:09:34 UTC 2013 [00:09:43] RECOVERY - Puppet freshness on mw1120 is OK: puppet ran at Fri Jun 14 00:09:34 UTC 2013 [00:09:48] !log Scapping to fix up VE deployment [00:09:57] Logged the message, Mr. Obvious [00:10:23] RECOVERY - Puppet freshness on db1043 is OK: puppet ran at Fri Jun 14 00:10:12 UTC 2013 [00:10:23] RECOVERY - Puppet freshness on srv257 is OK: puppet ran at Fri Jun 14 00:10:13 UTC 2013 [00:10:23] RECOVERY - Puppet freshness on sq53 is OK: puppet ran at Fri Jun 14 00:10:13 UTC 2013 [00:10:23] RECOVERY - Puppet freshness on mc1010 is OK: puppet ran at Fri Jun 14 00:10:14 UTC 2013 [00:10:23] RECOVERY - Puppet freshness on es3 is OK: puppet ran at Fri Jun 14 00:10:14 UTC 2013 [00:11:23] RECOVERY - Puppet freshness on mw1064 is OK: puppet ran at Fri Jun 14 00:11:13 UTC 2013 [00:11:23] RECOVERY - Puppet freshness on mw19 is OK: puppet ran at Fri Jun 14 00:11:13 UTC 2013 [00:11:23] RECOVERY - Puppet freshness on mw31 is OK: puppet ran at Fri Jun 14 00:11:12 UTC 2013 [00:11:23] RECOVERY - Puppet freshness on es1001 is OK: puppet ran at Fri Jun 14 00:11:13 UTC 2013 [00:11:23] RECOVERY - Puppet freshness on cp1014 is OK: puppet ran at Fri Jun 14 00:11:13 UTC 2013 [00:11:33] RECOVERY - Puppet freshness on analytics1005 is OK: puppet ran at Fri Jun 14 00:11:22 UTC 2013 [00:11:33] RECOVERY - Puppet freshness on snapshot1003 is OK: puppet ran at Fri Jun 14 00:11:22 UTC 2013 [00:11:33] RECOVERY - Puppet freshness on mw1187 is OK: puppet ran at Fri Jun 14 00:11:23 UTC 2013 [00:11:33] RECOVERY - Puppet freshness on cp3005 is OK: puppet ran at Fri Jun 14 00:11:23 UTC 2013 [00:11:33] RECOVERY - Puppet freshness on mw105 is OK: puppet ran at Fri Jun 14 00:11:23 UTC 2013 [00:11:33] RECOVERY - Puppet freshness on search28 is OK: puppet ran at Fri Jun 14 00:11:23 UTC 2013 [00:11:33] RECOVERY - Puppet freshness on mw94 is OK: puppet ran at Fri Jun 14 00:11:24 UTC 2013 [00:11:34] RECOVERY - Puppet freshness on mw101 is OK: puppet ran at Fri Jun 14 00:11:24 UTC 2013 [00:11:35] RECOVERY - Puppet freshness on mw54 is OK: puppet ran at Fri Jun 14 00:11:24 UTC 2013 [00:11:35] RECOVERY - Puppet freshness on maerlant is OK: puppet ran at Fri Jun 14 00:11:25 UTC 2013 [00:11:35] RECOVERY - Puppet freshness on mw1037 is OK: puppet ran at Fri Jun 14 00:11:25 UTC 2013 [00:11:36] RECOVERY - Puppet freshness on mw1113 is OK: puppet ran at Fri Jun 14 00:11:25 UTC 2013 [00:11:36] RECOVERY - Puppet freshness on mw29 is OK: puppet ran at Fri Jun 14 00:11:25 UTC 2013 [00:11:37] RECOVERY - Puppet freshness on analytics1025 is OK: puppet ran at Fri Jun 14 00:11:27 UTC 2013 [00:11:37] RECOVERY - Puppet freshness on strontium is OK: puppet ran at Fri Jun 14 00:11:28 UTC 2013 [00:11:38] RECOVERY - Puppet freshness on srv295 is OK: puppet ran at Fri Jun 14 00:11:29 UTC 2013 [00:11:38] RECOVERY - Puppet freshness on mw1077 is OK: puppet ran at Fri Jun 14 00:11:30 UTC 2013 [00:11:39] RECOVERY - Puppet freshness on pc2 is OK: puppet ran at Fri Jun 14 00:11:30 UTC 2013 [00:11:39] RECOVERY - Puppet freshness on mw1160 is OK: puppet ran at Fri Jun 14 00:11:31 UTC 2013 [00:11:40] RECOVERY - Puppet freshness on mw108 is OK: puppet ran at Fri Jun 14 00:11:32 UTC 2013 [00:11:43] RECOVERY - Puppet freshness on sq67 is OK: puppet ran at Fri Jun 14 00:11:33 UTC 2013 [00:11:43] RECOVERY - Puppet freshness on mw103 is OK: puppet ran at Fri Jun 14 00:11:33 UTC 2013 [00:11:43] RECOVERY - Puppet freshness on mc1015 is OK: puppet ran at Fri Jun 14 00:11:33 UTC 2013 [00:11:43] RECOVERY - Puppet freshness on mw1207 is OK: puppet ran at Fri Jun 14 00:11:33 UTC 2013 [00:11:43] RECOVERY - Puppet freshness on mc1003 is OK: puppet ran at Fri Jun 14 00:11:34 UTC 2013 [00:11:54] RECOVERY - Puppet freshness on es1 is OK: puppet ran at Fri Jun 14 00:11:42 UTC 2013 [00:11:54] RECOVERY - Puppet freshness on amssq52 is OK: puppet ran at Fri Jun 14 00:11:42 UTC 2013 [00:11:54] RECOVERY - Puppet freshness on mw1018 is OK: puppet ran at Fri Jun 14 00:11:43 UTC 2013 [00:11:54] RECOVERY - Puppet freshness on cp1007 is OK: puppet ran at Fri Jun 14 00:11:43 UTC 2013 [00:11:54] RECOVERY - Puppet freshness on labsdb1002 is OK: puppet ran at Fri Jun 14 00:11:44 UTC 2013 [00:11:54] RECOVERY - Puppet freshness on labsdb1003 is OK: puppet ran at Fri Jun 14 00:11:44 UTC 2013 [00:11:54] RECOVERY - Puppet freshness on analytics1027 is OK: puppet ran at Fri Jun 14 00:11:44 UTC 2013 [00:11:55] RECOVERY - Puppet freshness on db1054 is OK: puppet ran at Fri Jun 14 00:11:45 UTC 2013 [00:11:55] RECOVERY - Puppet freshness on cerium is OK: puppet ran at Fri Jun 14 00:11:45 UTC 2013 [00:11:56] RECOVERY - Puppet freshness on locke is OK: puppet ran at Fri Jun 14 00:11:45 UTC 2013 [00:11:56] RECOVERY - Puppet freshness on mw122 is OK: puppet ran at Fri Jun 14 00:11:46 UTC 2013 [00:11:57] RECOVERY - Puppet freshness on srv238 is OK: puppet ran at Fri Jun 14 00:11:48 UTC 2013 [00:11:57] RECOVERY - Puppet freshness on cp3007 is OK: puppet ran at Fri Jun 14 00:11:49 UTC 2013 [00:11:58] RECOVERY - Puppet freshness on mw34 is OK: puppet ran at Fri Jun 14 00:11:49 UTC 2013 [00:11:59] RECOVERY - Puppet freshness on mw20 is OK: puppet ran at Fri Jun 14 00:11:50 UTC 2013 [00:11:59] RECOVERY - Puppet freshness on mw66 is OK: puppet ran at Fri Jun 14 00:11:50 UTC 2013 [00:11:59] RECOVERY - Puppet freshness on srv297 is OK: puppet ran at Fri Jun 14 00:11:50 UTC 2013 [00:12:00] RECOVERY - Puppet freshness on mw1017 is OK: puppet ran at Fri Jun 14 00:11:52 UTC 2013 [00:12:03] RECOVERY - Puppet freshness on mw1015 is OK: puppet ran at Fri Jun 14 00:11:53 UTC 2013 [00:12:03] RECOVERY - Puppet freshness on mw1065 is OK: puppet ran at Fri Jun 14 00:11:53 UTC 2013 [00:12:03] RECOVERY - Puppet freshness on mw1042 is OK: puppet ran at Fri Jun 14 00:11:53 UTC 2013 [00:12:03] RECOVERY - Puppet freshness on search1018 is OK: puppet ran at Fri Jun 14 00:11:53 UTC 2013 [00:12:03] RECOVERY - Puppet freshness on mw1144 is OK: puppet ran at Fri Jun 14 00:11:54 UTC 2013 [00:12:03] RECOVERY - Puppet freshness on mw1052 is OK: puppet ran at Fri Jun 14 00:11:54 UTC 2013 [00:12:04] RECOVERY - Puppet freshness on mw1101 is OK: puppet ran at Fri Jun 14 00:11:55 UTC 2013 [00:12:23] RECOVERY - Puppet freshness on mw100 is OK: puppet ran at Fri Jun 14 00:12:12 UTC 2013 [00:12:23] RECOVERY - Puppet freshness on sq65 is OK: puppet ran at Fri Jun 14 00:12:13 UTC 2013 [00:12:23] RECOVERY - Puppet freshness on search14 is OK: puppet ran at Fri Jun 14 00:12:13 UTC 2013 [00:12:23] RECOVERY - Puppet freshness on mw1162 is OK: puppet ran at Fri Jun 14 00:12:14 UTC 2013 [00:12:23] RECOVERY - Puppet freshness on search1014 is OK: puppet ran at Fri Jun 14 00:12:14 UTC 2013 [00:12:24] RECOVERY - Puppet freshness on lvs1 is OK: puppet ran at Fri Jun 14 00:12:15 UTC 2013 [00:12:24] RECOVERY - Puppet freshness on mw46 is OK: puppet ran at Fri Jun 14 00:12:15 UTC 2013 [00:12:25] RECOVERY - Puppet freshness on ms5 is OK: puppet ran at Fri Jun 14 00:12:16 UTC 2013 [00:12:25] RECOVERY - Puppet freshness on cp3010 is OK: puppet ran at Fri Jun 14 00:12:17 UTC 2013 [00:12:26] RECOVERY - Puppet freshness on pc1002 is OK: puppet ran at Fri Jun 14 00:12:17 UTC 2013 [00:12:26] RECOVERY - Puppet freshness on mw1044 is OK: puppet ran at Fri Jun 14 00:12:17 UTC 2013 [00:12:27] RECOVERY - Puppet freshness on search21 is OK: puppet ran at Fri Jun 14 00:12:18 UTC 2013 [00:12:27] RECOVERY - Puppet freshness on zirconium is OK: puppet ran at Fri Jun 14 00:12:18 UTC 2013 [00:12:28] RECOVERY - Puppet freshness on lvs5 is OK: puppet ran at Fri Jun 14 00:12:18 UTC 2013 [00:12:28] RECOVERY - Puppet freshness on db1048 is OK: puppet ran at Fri Jun 14 00:12:19 UTC 2013 [00:12:33] RECOVERY - Puppet freshness on sq73 is OK: puppet ran at Fri Jun 14 00:12:23 UTC 2013 [00:12:43] RECOVERY - Puppet freshness on mw5 is OK: puppet ran at Fri Jun 14 00:12:36 UTC 2013 [00:12:44] RECOVERY - Puppet freshness on mw41 is OK: puppet ran at Fri Jun 14 00:12:38 UTC 2013 [00:12:54] RECOVERY - Puppet freshness on srv294 is OK: puppet ran at Fri Jun 14 00:12:50 UTC 2013 [00:13:40] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68602 [00:14:03] RECOVERY - Puppet freshness on sodium is OK: puppet ran at Fri Jun 14 00:13:57 UTC 2013 [00:14:03] RECOVERY - Puppet freshness on snapshot1002 is OK: puppet ran at Fri Jun 14 00:14:01 UTC 2013 [00:19:23] !log catrope Started syncing Wikimedia installation... : Fixup for VE deployment [00:19:26] Logged the message, Master [00:21:33] RECOVERY - Puppet freshness on mw1135 is OK: puppet ran at Fri Jun 14 00:21:23 UTC 2013 [00:28:03] !log catrope Finished syncing Wikimedia installation... : Fixup for VE deployment [00:28:11] Logged the message, Master [00:39:40] PROBLEM - Parsoid on wtp1015 is CRITICAL: Connection refused [00:44:30] New patchset: Reedy; "(bug 15434) Periodical run of currently disabled special pages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33713 [00:44:40] RECOVERY - Parsoid on wtp1015 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.005 second response time [00:47:19] New patchset: Reedy; "(bug 15434) Periodical run of currently disabled special pages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33713 [00:47:26] did anybody restart Parsoid on wtp1015? [00:47:51] I did not [00:48:07] RoanKattouw: is anything automatically restarting Parsoid? [00:48:25] No [00:49:28] the node process on wtp1015 was restarted at 0:44 UTC [00:50:00] just before icinga listed it as up again [00:50:21] I wonder if there is anything in syslog [00:52:18] RoanKattouw, Ryan_Lane: could you have a look at syslog on wtp1015 around 0:44 UTC? [00:52:22] there was a puppet run at the time [00:52:32] aha [00:52:39] Looking [00:52:47] well, that shouldn't have caused a problem [00:52:52] and that restarts Parsoid? [00:52:52] I'm running puppet now [00:53:11] Jun 14 00:44:21 wtp1015 puppet-agent[12807]: (/Stage[main]/Misc::Parsoid/Service[parsoid]/ensure) ensure changed 'stopped' to 'running' [00:53:17] For some reason puppet believed the service wasn't up [00:53:24] no [00:53:27] Jun 14 00:44:21 wtp1015 puppet-agent[12807]: (/Stage[main]/Misc::Parsoid/Service[parsoid]/ensure) ensure changed 'stopped' to 'running' [00:53:30] the DOWN came 5 minutes earlier [00:53:40] yep [00:53:48] I'm thinking the puppet run fixed it [00:54:04] the question is what stopped it :) [00:54:08] Nothing in syslog for that [00:54:13] I see, so there is some kind of monitoring in puppet [00:54:21] Not really [00:54:25] gwicke: well, puppet has ensure => running [00:54:28] We just got lucky that puppet ran 5 mins after the process died [00:54:41] root 13496 0.0 0.0 36092 1568 ? S 00:44 0:00 sudo -E -u parsoid nohup node /var/lib/parsoid/Parsoid/js/api/server.js [00:54:44] uugh [00:54:57] Yeah it's a terrible mess [00:54:58] paravoid: yeah. I've complained about this already ;) [00:55:02] I take full blame [00:55:19] And I'm hoping someone in ops is willing to write a proper init script at some point :) [00:55:26] I don't care about blame, I care about fixes :) [00:55:29] As I'm not good at that (evidently) [00:55:51] PROBLEM - Parsoid on wtp1004 is CRITICAL: Connection refused [00:55:55] hah [00:55:56] I'm not sure why an upstart for this couldn't just run node /var/lib/parsoid/Parsoid/js/api/server.js [00:56:21] so, no node on wtp1004 [00:56:25] you'd need to put the forking (or non-forking) count in the upstart [00:56:36] nothing on syslog either [00:56:54] Oh, upstart handles this nicely? [00:57:05] yes [00:57:12] * RoanKattouw has never worked with upstart and couldn't find an example offhand when he wrote this init script [00:57:16] I think parsoid or node might just die [00:57:25] with no logs whatsoever [00:57:50] We know we have issues with child processes not being respawned [00:58:00] The children die every now and then, but usually leave an exception/error message in the log [00:58:18] copied the log off 1004 [00:58:21] does it log somewhere else than syslog? [00:58:24] However, the logs get wiped out every time the process restarts, so we don't know why they die [00:58:28] paravoid: nohup.out [00:58:31] * RoanKattouw hides quickly [00:58:34] haha [00:58:35] * RoanKattouw is embarrassed [00:58:46] upstart is generally simple, assuming your application forks (or doesn't fork) consistently [00:58:48] (/var/lib/parsoid/nohup.out is our "log") [00:59:20] Ryan_Lane: So what we start is a master process that forks a bunch of child processes (15 of them), and reforks them when they die (sometimes it doesn't respawn them, we don't know why yet) [00:59:31] there are some backtraces there [00:59:38] are these relevant? [01:01:42] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.01261508465 secs [01:01:45] They can be [01:01:51] But they could be in the child processes [01:02:03] paravoid: I am thinking that the http error is relevant to the master [01:02:10] or could be at least [01:03:11] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset -0.004909038544 secs [01:10:23] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [01:24:55] !log starting swift->ceph scripts on a screen on ms-fe1002 [01:25:14] Logged the message, Master [01:30:21] paravoid: ? [01:30:28] swiftrepl [01:30:30] thumbs [01:30:44] ok [01:30:44] I'm running them periodically to keep the delta small [01:30:57] so can I run the copy scripts I've been wanting to? [01:31:01] oh yeah sure [01:31:05] * AaronSchulz was waiting for things to be more stable [01:31:17] it is stable, until it breaks again :P [01:31:35] well as long as it is not broken *right now* :) [01:31:45] more seriously, we have one serious bug right now and that manifests when we restart osds or machines etc. [01:32:22] or when the network decides to split brain everything, like the other day :) [01:32:54] so it works but it's just nasty to have this timebomb in production [01:35:42] AaronSchulz: so I'm syncing thumbs, timeline, transcoded, math & score [01:35:52] are any of these journaled? [01:36:12] oh and captcha I guess [01:36:17] I'm doing global-* basically [01:38:25] does "syncing" include deleting? [01:38:29] yes [01:38:43] ok [01:38:45] deletes, new files, etag mismatches [01:39:01] just don't do that for originals :) [01:39:13] I'm not :) [01:39:30] I modified an existing script to handle that later [01:39:46] to handle what? [01:39:48] well at least those initiated from action=delete [01:39:57] where there were out of sync for some reason [01:40:38] so there are like 4 scripts I'd need to run (the first is not needed by luck, no updated files were out of sync) [01:40:48] okay [01:40:55] * AaronSchulz will start on that tomorrow [01:41:18] copy script in both directions, sync script from the time I did failover, and purgeDeletedFiles.php (in that order) :) [01:41:30] RoanKattouw: having upstart manage N processes for some job isn't too hard. Do the node workers share any kind of state, or is it trivial to make them independent processes? [01:41:40] * AaronSchulz hates all this multiwrite stuff [01:41:41] both directions? [01:41:54] Ryan_Lane: did you e-mail the folks in the new group you created? (I can if you haven't already.) [01:41:54] I know... [01:42:04] ori-l: There is a master process listening on port 80 that dispatches tasks to the workers. Not sure about shared state, ask gwicke_away [01:42:18] most of this is only for the few files already out-of-sync before the failover [01:42:21] fun fun fun [01:42:39] ori-l: I didn't [01:42:45] Ryan_Lane: K, will do then. [01:42:46] ouch [01:42:51] ori-l: awesome. thanks! [01:42:56] hopefully won't take more than a week though [01:43:03] probably not [01:43:06] having had to deal with this crap so long things have been optimized :) [01:43:14] the thumb ones take a day or two [01:43:22] for a weeks delta or so [01:43:39] thumbs are smaller of course, but they're also usually more [01:44:28] so supposedely dumpling, i.e. the August release will have georeplication [01:44:35] (and of course that slow peering fix) [01:45:00] but my guess based on their track record is that it won't be ready for consumption until at least their next release [01:45:06] which is Aug+3 months [01:45:15] = November [01:45:49] New patchset: Cmjohnson; "adding cp1045-55 to netboot and dhcpd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68611 [01:45:59] swift is also still working on georepl afaik [01:46:39] so, we're stuck with multiwrite for a little while longer [01:47:03] you don't trust ceph by itself? :) [01:47:24] um [01:47:26] I guess not [01:47:42] maybe we should just use the python scripts in the background and have the sync scripts in a tight loop and not use multiwrite? [01:47:43] paravoid..can you look at my patchset ...i consolidated the the netboot cfg for cp10xx and want a 2nd set of eyes plz. [01:47:54] that might honestly be simpler (and faster) [01:48:03] one ceph cluster at least [01:49:11] an alternative would be to journal anything that isn't generated on demand [01:49:35] i.e. everything but thumbs, unless we can add other categories as autogenerated [01:50:39] and then ignore the small delta and have something periodically sync that (the python scripts are you say) in a not-so-tight loop [01:50:49] as long as deletes are propagated it should be fine [01:51:53] cmjohnson1: um, it adds up to 1065 [01:52:01] cmjohnson1: the commit msg says up to 1055 [01:52:18] cp102[1-9]|cp10[3-5][0-9]|cp106[0-5]|... [01:52:43] that's 1021-1029, 1030-1059, 1060-1065 [01:52:45] paravoid...yeah..i have the additional servers that will be racked tomorrow or monday..since I was making the changing i added them...should i remove or change msg [01:52:52] no that's fine [01:52:57] as long as you're aware of it [01:53:10] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68611 [01:53:19] i am..thx for reviewing [01:53:41] merged [01:53:51] oh..great thank you! [01:54:43] ran puppet on brewster too, so you're ready to go [01:54:53] and I'm ready to go to bed too, good night/evening :) [01:55:12] yep...either way...good night [01:58:02] has anyone been watching memcached-serious.log since twemproxy was deployed? [01:59:47] there's a lot of "bad key" errors [02:05:58] !log LocalisationUpdate completed (1.22wmf6) at Fri Jun 14 02:05:57 UTC 2013 [02:06:07] Logged the message, Master [02:11:02] !log LocalisationUpdate completed (1.22wmf7) at Fri Jun 14 02:11:02 UTC 2013 [02:11:10] Logged the message, Master [02:11:18] PROBLEM - Parsoid on wtp1016 is CRITICAL: Connection refused [02:17:44] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Jun 14 02:17:44 UTC 2013 [02:17:53] Logged the message, Master [02:19:35] !log dns update [02:19:43] Logged the message, Master [02:24:19] RECOVERY - Parsoid on wtp1016 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [02:25:01] TimStarling: twemproxy was deployed yesterday, right? the number of BAD KEYs doesn't seem too unusual. It's at 3,393 at the moment (the log is about twenty hours old). it's been unusually high since late may: https://dpaste.de/60vSu/raw/ [02:37:50] TimStarling: they're also almost all from commonswiki [02:38:39] 68,849 'BAD KEY' errors in may & june, 63,465 from commonswiki. [02:47:59] I think User:Fæ on commonswiki has been uploading lots of pictures from some DoD trove that has very long filenames [02:49:00] TimStarling, pretty sure that's it. [02:50:49] ori-l I gotta take off, the ganglia errors graph looks fine and the server-side events are kenny Loggin' away [02:52:51] spagewmf: cool, ciao! [03:13:27] PROBLEM - Parsoid on wtp1012 is CRITICAL: Connection refused [03:15:47] PROBLEM - Parsoid on wtp1018 is CRITICAL: Connection refused [03:15:57] PROBLEM - Parsoid on wtp1017 is CRITICAL: Connection refused [03:22:43] * RoanKattouw looks at those Parsoid boxes [03:26:47] RECOVERY - Parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.008 second response time [03:26:57] RECOVERY - Parsoid on wtp1017 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [03:27:28] RECOVERY - Parsoid on wtp1012 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [03:30:28] PROBLEM - Parsoid on wtp1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:31:17] RECOVERY - Parsoid on wtp1005 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.002 second response time [03:33:07] PROBLEM - Parsoid on wtp1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:51] RECOVERY - Parsoid on wtp1013 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [03:43:41] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [03:43:41] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [03:43:41] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [03:43:41] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [03:43:41] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [03:43:42] PROBLEM - Puppet freshness on sockpuppet is CRITICAL: No successful Puppet run in the last 10 hours [03:43:42] PROBLEM - Puppet freshness on mw1020 is CRITICAL: No successful Puppet run in the last 10 hours [03:43:43] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [03:43:43] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [03:43:44] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [03:43:44] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [03:43:45] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [04:32:12] !log updating Parsoid dependencies from config repository [04:32:20] Logged the message, Master [04:45:06] !log updated Parsoid to 64921430b1 [04:45:16] Logged the message, Master [05:00:33] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [05:02:03] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [05:10:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:11:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.135 second response time [05:22:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:23:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.186 second response time [05:24:46] https://gerrit.wikimedia.org/r/#/c/33713/ : What Tampa host should call that job? [05:27:20] Ah, the good old hume. :) [05:30:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:32:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [05:53:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:54:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [06:06:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:07:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [06:40:51] New patchset: Tim Starling; "Sync w at the same time as docroot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64449 [06:40:59] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64449 [07:14:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:15:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [07:27:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:28:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [08:01:50] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset 0.002644777298 secs [08:12:10] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [08:17:10] New patchset: Nemo bis; "Add Klaus Graf to the German Planet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68621 [08:32:39] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset -0.008009552956 secs [09:38:38] New review: Akosiaris; "I too think we are almost ok. Just the architecture and the two dependencies and we are good to go." [operations/debs/buck] (master) C: -1; - https://gerrit.wikimedia.org/r/67999 [11:23:13] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68428 [11:48:18] PROBLEM - Parsoid on wtp1007 is CRITICAL: Connection refused [11:55:42] New patchset: Mark Bergsma; "Move monitor_group statements to the role manifest" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68633 [11:55:42] New patchset: Mark Bergsma; "Remove resources for the obsoleted Perl HTCP purger daemon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68634 [11:56:23] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68633 [11:57:42] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68634 [12:00:38] New patchset: Mark Bergsma; "Remove old dependency" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68635 [12:01:15] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68635 [12:09:43] New patchset: Mark Bergsma; "Remove unused upstart version of varnishncsa udploggers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68637 [12:10:42] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68637 [12:12:16] RECOVERY - Parsoid on wtp1007 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.005 second response time [12:17:07] New patchset: Mark Bergsma; "Remove default instance instantiation, we don't use it" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68638 [12:19:56] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68638 [12:21:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:22:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 2.280 second response time [12:27:25] New patchset: Mark Bergsma; "Use quoted tabs instead of tab characters in the format string" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68640 [12:33:34]