[00:01:49] New patchset: Asher; "labsdb: set innodb_locks_unsafe_for_binlog for s4-5, set high slave_transaction_retries for all shards" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68602 [00:02:23] Sorry for the deployment spam, folks. Complicated deployment. Thanks very much anomie|away, greg-g, RoanKattouw & marktraceur. [00:02:47] Is that the end of the lightning? [00:03:00] ...30 minutes late? :) [00:03:16] Yeah. I was bad. [00:03:36] *nod* K [00:07:54] RECOVERY - Puppet freshness on lvs1002 is OK: puppet ran at Fri Jun 14 00:07:48 UTC 2013 [00:07:54] RECOVERY - Puppet freshness on labstore1 is OK: puppet ran at Fri Jun 14 00:07:48 UTC 2013 [00:07:54] RECOVERY - Puppet freshness on mw1129 is OK: puppet ran at Fri Jun 14 00:07:48 UTC 2013 [00:07:54] RECOVERY - Puppet freshness on db1027 is OK: puppet ran at Fri Jun 14 00:07:49 UTC 2013 [00:07:54] RECOVERY - Puppet freshness on ms-fe1002 is OK: puppet ran at Fri Jun 14 00:07:49 UTC 2013 [00:07:55] RECOVERY - Puppet freshness on ms6 is OK: puppet ran at Fri Jun 14 00:07:49 UTC 2013 [00:07:55] RECOVERY - Puppet freshness on search1003 is OK: puppet ran at Fri Jun 14 00:07:50 UTC 2013 [00:07:56] RECOVERY - Puppet freshness on mw1047 is OK: puppet ran at Fri Jun 14 00:07:52 UTC 2013 [00:08:03] RECOVERY - Puppet freshness on mc11 is OK: puppet ran at Fri Jun 14 00:07:52 UTC 2013 [00:08:03] RECOVERY - Puppet freshness on db69 is OK: puppet ran at Fri Jun 14 00:07:53 UTC 2013 [00:08:03] RECOVERY - Puppet freshness on grosley is OK: puppet ran at Fri Jun 14 00:07:54 UTC 2013 [00:08:03] RECOVERY - Puppet freshness on labstore2 is OK: puppet ran at Fri Jun 14 00:07:54 UTC 2013 [00:08:03] RECOVERY - Puppet freshness on capella is OK: puppet ran at Fri Jun 14 00:07:54 UTC 2013 [00:08:04] RECOVERY - Puppet freshness on ms-be12 is OK: puppet ran at Fri Jun 14 00:07:54 UTC 2013 [00:08:04] RECOVERY - Puppet freshness on ssl1 is OK: puppet ran at Fri Jun 14 00:07:56 UTC 2013 [00:08:05] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Fri Jun 14 00:07:57 UTC 2013 [00:08:05] RECOVERY - Puppet freshness on solr1001 is OK: puppet ran at Fri Jun 14 00:07:57 UTC 2013 [00:08:06] RECOVERY - Puppet freshness on db67 is OK: puppet ran at Fri Jun 14 00:07:58 UTC 2013 [00:08:06] RECOVERY - Puppet freshness on db1005 is OK: puppet ran at Fri Jun 14 00:07:58 UTC 2013 [00:08:07] RECOVERY - Puppet freshness on mw55 is OK: puppet ran at Fri Jun 14 00:07:59 UTC 2013 [00:08:07] RECOVERY - Puppet freshness on amslvs2 is OK: puppet ran at Fri Jun 14 00:07:59 UTC 2013 [00:08:08] RECOVERY - Puppet freshness on ms-be5 is OK: puppet ran at Fri Jun 14 00:07:59 UTC 2013 [00:08:08] RECOVERY - Puppet freshness on analytics1002 is OK: puppet ran at Fri Jun 14 00:08:00 UTC 2013 [00:08:09] RECOVERY - Puppet freshness on sanger is OK: puppet ran at Fri Jun 14 00:08:01 UTC 2013 [00:08:09] RECOVERY - Puppet freshness on mw1181 is OK: puppet ran at Fri Jun 14 00:08:02 UTC 2013 [00:08:13] RECOVERY - Puppet freshness on cp1016 is OK: puppet ran at Fri Jun 14 00:08:02 UTC 2013 [00:08:13] RECOVERY - Puppet freshness on mw1179 is OK: puppet ran at Fri Jun 14 00:08:03 UTC 2013 [00:08:13] RECOVERY - Puppet freshness on palladium is OK: puppet ran at Fri Jun 14 00:08:03 UTC 2013 [00:08:13] RECOVERY - Puppet freshness on mc12 is OK: puppet ran at Fri Jun 14 00:08:04 UTC 2013 [00:08:13] RECOVERY - Puppet freshness on mw1165 is OK: puppet ran at Fri Jun 14 00:08:04 UTC 2013 [00:09:23] RECOVERY - Puppet freshness on mw65 is OK: puppet ran at Fri Jun 14 00:09:13 UTC 2013 [00:09:23] RECOVERY - Puppet freshness on analytics1018 is OK: puppet ran at Fri Jun 14 00:09:13 UTC 2013 [00:09:23] RECOVERY - Puppet freshness on wtp1008 is OK: puppet ran at Fri Jun 14 00:09:13 UTC 2013 [00:09:23] RECOVERY - Puppet freshness on mw1121 is OK: puppet ran at Fri Jun 14 00:09:13 UTC 2013 [00:09:23] RECOVERY - Puppet freshness on mw1141 is OK: puppet ran at Fri Jun 14 00:09:14 UTC 2013 [00:09:23] RECOVERY - Puppet freshness on search1004 is OK: puppet ran at Fri Jun 14 00:09:14 UTC 2013 [00:09:23] RECOVERY - Puppet freshness on amssq50 is OK: puppet ran at Fri Jun 14 00:09:15 UTC 2013 [00:09:24] RECOVERY - Puppet freshness on sq77 is OK: puppet ran at Fri Jun 14 00:09:17 UTC 2013 [00:09:24] RECOVERY - Puppet freshness on sq49 is OK: puppet ran at Fri Jun 14 00:09:17 UTC 2013 [00:09:25] RECOVERY - Puppet freshness on db1056 is OK: puppet ran at Fri Jun 14 00:09:17 UTC 2013 [00:09:25] RECOVERY - Puppet freshness on mw1090 is OK: puppet ran at Fri Jun 14 00:09:17 UTC 2013 [00:09:26] RECOVERY - Puppet freshness on es2 is OK: puppet ran at Fri Jun 14 00:09:18 UTC 2013 [00:09:26] RECOVERY - Puppet freshness on srv268 is OK: puppet ran at Fri Jun 14 00:09:18 UTC 2013 [00:09:27] New patchset: Asher; "labsdb: set innodb_locks_unsafe_for_binlog for s4-5, set high slave_transaction_retries for all shards" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68602 [00:09:27] RECOVERY - Puppet freshness on ms1004 is OK: puppet ran at Fri Jun 14 00:09:18 UTC 2013 [00:09:27] RECOVERY - Puppet freshness on cp1001 is OK: puppet ran at Fri Jun 14 00:09:19 UTC 2013 [00:09:28] RECOVERY - Puppet freshness on mc13 is OK: puppet ran at Fri Jun 14 00:09:19 UTC 2013 [00:09:28] RECOVERY - Puppet freshness on srv298 is OK: puppet ran at Fri Jun 14 00:09:19 UTC 2013 [00:09:29] RECOVERY - Puppet freshness on lvs4 is OK: puppet ran at Fri Jun 14 00:09:20 UTC 2013 [00:09:29] RECOVERY - Puppet freshness on virt1005 is OK: puppet ran at Fri Jun 14 00:09:21 UTC 2013 [00:09:33] RECOVERY - Puppet freshness on mw91 is OK: puppet ran at Fri Jun 14 00:09:22 UTC 2013 [00:09:33] RECOVERY - Puppet freshness on mw92 is OK: puppet ran at Fri Jun 14 00:09:23 UTC 2013 [00:09:33] RECOVERY - Puppet freshness on ms-be1010 is OK: puppet ran at Fri Jun 14 00:09:23 UTC 2013 [00:09:33] RECOVERY - Puppet freshness on cp3019 is OK: puppet ran at Fri Jun 14 00:09:23 UTC 2013 [00:09:33] RECOVERY - Puppet freshness on harmon is OK: puppet ran at Fri Jun 14 00:09:24 UTC 2013 [00:09:33] RECOVERY - Puppet freshness on mc1002 is OK: puppet ran at Fri Jun 14 00:09:24 UTC 2013 [00:09:33] RECOVERY - Puppet freshness on virt8 is OK: puppet ran at Fri Jun 14 00:09:25 UTC 2013 [00:09:34] RECOVERY - Puppet freshness on lvs3 is OK: puppet ran at Fri Jun 14 00:09:28 UTC 2013 [00:09:34] RECOVERY - Puppet freshness on mw1134 is OK: puppet ran at Fri Jun 14 00:09:29 UTC 2013 [00:09:35] RECOVERY - Puppet freshness on srv243 is OK: puppet ran at Fri Jun 14 00:09:29 UTC 2013 [00:09:35] RECOVERY - Puppet freshness on cp3021 is OK: puppet ran at Fri Jun 14 00:09:30 UTC 2013 [00:09:36] RECOVERY - Puppet freshness on mw7 is OK: puppet ran at Fri Jun 14 00:09:32 UTC 2013 [00:09:36] RECOVERY - Puppet freshness on mw3 is OK: puppet ran at Fri Jun 14 00:09:32 UTC 2013 [00:09:37] RECOVERY - Puppet freshness on search23 is OK: puppet ran at Fri Jun 14 00:09:32 UTC 2013 [00:09:38] To make that lightning deploy a bit worse, I am now going to scap [00:09:43] RECOVERY - Puppet freshness on srv251 is OK: puppet ran at Fri Jun 14 00:09:32 UTC 2013 [00:09:43] RECOVERY - Puppet freshness on mw25 is OK: puppet ran at Fri Jun 14 00:09:33 UTC 2013 [00:09:43] RECOVERY - Puppet freshness on mw1199 is OK: puppet ran at Fri Jun 14 00:09:33 UTC 2013 [00:09:43] RECOVERY - Puppet freshness on srv293 is OK: puppet ran at Fri Jun 14 00:09:34 UTC 2013 [00:09:43] RECOVERY - Puppet freshness on mw1120 is OK: puppet ran at Fri Jun 14 00:09:34 UTC 2013 [00:09:48] !log Scapping to fix up VE deployment [00:09:57] Logged the message, Mr. Obvious [00:10:23] RECOVERY - Puppet freshness on db1043 is OK: puppet ran at Fri Jun 14 00:10:12 UTC 2013 [00:10:23] RECOVERY - Puppet freshness on srv257 is OK: puppet ran at Fri Jun 14 00:10:13 UTC 2013 [00:10:23] RECOVERY - Puppet freshness on sq53 is OK: puppet ran at Fri Jun 14 00:10:13 UTC 2013 [00:10:23] RECOVERY - Puppet freshness on mc1010 is OK: puppet ran at Fri Jun 14 00:10:14 UTC 2013 [00:10:23] RECOVERY - Puppet freshness on es3 is OK: puppet ran at Fri Jun 14 00:10:14 UTC 2013 [00:11:23] RECOVERY - Puppet freshness on mw1064 is OK: puppet ran at Fri Jun 14 00:11:13 UTC 2013 [00:11:23] RECOVERY - Puppet freshness on mw19 is OK: puppet ran at Fri Jun 14 00:11:13 UTC 2013 [00:11:23] RECOVERY - Puppet freshness on mw31 is OK: puppet ran at Fri Jun 14 00:11:12 UTC 2013 [00:11:23] RECOVERY - Puppet freshness on es1001 is OK: puppet ran at Fri Jun 14 00:11:13 UTC 2013 [00:11:23] RECOVERY - Puppet freshness on cp1014 is OK: puppet ran at Fri Jun 14 00:11:13 UTC 2013 [00:11:33] RECOVERY - Puppet freshness on analytics1005 is OK: puppet ran at Fri Jun 14 00:11:22 UTC 2013 [00:11:33] RECOVERY - Puppet freshness on snapshot1003 is OK: puppet ran at Fri Jun 14 00:11:22 UTC 2013 [00:11:33] RECOVERY - Puppet freshness on mw1187 is OK: puppet ran at Fri Jun 14 00:11:23 UTC 2013 [00:11:33] RECOVERY - Puppet freshness on cp3005 is OK: puppet ran at Fri Jun 14 00:11:23 UTC 2013 [00:11:33] RECOVERY - Puppet freshness on mw105 is OK: puppet ran at Fri Jun 14 00:11:23 UTC 2013 [00:11:33] RECOVERY - Puppet freshness on search28 is OK: puppet ran at Fri Jun 14 00:11:23 UTC 2013 [00:11:33] RECOVERY - Puppet freshness on mw94 is OK: puppet ran at Fri Jun 14 00:11:24 UTC 2013 [00:11:34] RECOVERY - Puppet freshness on mw101 is OK: puppet ran at Fri Jun 14 00:11:24 UTC 2013 [00:11:35] RECOVERY - Puppet freshness on mw54 is OK: puppet ran at Fri Jun 14 00:11:24 UTC 2013 [00:11:35] RECOVERY - Puppet freshness on maerlant is OK: puppet ran at Fri Jun 14 00:11:25 UTC 2013 [00:11:35] RECOVERY - Puppet freshness on mw1037 is OK: puppet ran at Fri Jun 14 00:11:25 UTC 2013 [00:11:36] RECOVERY - Puppet freshness on mw1113 is OK: puppet ran at Fri Jun 14 00:11:25 UTC 2013 [00:11:36] RECOVERY - Puppet freshness on mw29 is OK: puppet ran at Fri Jun 14 00:11:25 UTC 2013 [00:11:37] RECOVERY - Puppet freshness on analytics1025 is OK: puppet ran at Fri Jun 14 00:11:27 UTC 2013 [00:11:37] RECOVERY - Puppet freshness on strontium is OK: puppet ran at Fri Jun 14 00:11:28 UTC 2013 [00:11:38] RECOVERY - Puppet freshness on srv295 is OK: puppet ran at Fri Jun 14 00:11:29 UTC 2013 [00:11:38] RECOVERY - Puppet freshness on mw1077 is OK: puppet ran at Fri Jun 14 00:11:30 UTC 2013 [00:11:39] RECOVERY - Puppet freshness on pc2 is OK: puppet ran at Fri Jun 14 00:11:30 UTC 2013 [00:11:39] RECOVERY - Puppet freshness on mw1160 is OK: puppet ran at Fri Jun 14 00:11:31 UTC 2013 [00:11:40] RECOVERY - Puppet freshness on mw108 is OK: puppet ran at Fri Jun 14 00:11:32 UTC 2013 [00:11:43] RECOVERY - Puppet freshness on sq67 is OK: puppet ran at Fri Jun 14 00:11:33 UTC 2013 [00:11:43] RECOVERY - Puppet freshness on mw103 is OK: puppet ran at Fri Jun 14 00:11:33 UTC 2013 [00:11:43] RECOVERY - Puppet freshness on mc1015 is OK: puppet ran at Fri Jun 14 00:11:33 UTC 2013 [00:11:43] RECOVERY - Puppet freshness on mw1207 is OK: puppet ran at Fri Jun 14 00:11:33 UTC 2013 [00:11:43] RECOVERY - Puppet freshness on mc1003 is OK: puppet ran at Fri Jun 14 00:11:34 UTC 2013 [00:11:54] RECOVERY - Puppet freshness on es1 is OK: puppet ran at Fri Jun 14 00:11:42 UTC 2013 [00:11:54] RECOVERY - Puppet freshness on amssq52 is OK: puppet ran at Fri Jun 14 00:11:42 UTC 2013 [00:11:54] RECOVERY - Puppet freshness on mw1018 is OK: puppet ran at Fri Jun 14 00:11:43 UTC 2013 [00:11:54] RECOVERY - Puppet freshness on cp1007 is OK: puppet ran at Fri Jun 14 00:11:43 UTC 2013 [00:11:54] RECOVERY - Puppet freshness on labsdb1002 is OK: puppet ran at Fri Jun 14 00:11:44 UTC 2013 [00:11:54] RECOVERY - Puppet freshness on labsdb1003 is OK: puppet ran at Fri Jun 14 00:11:44 UTC 2013 [00:11:54] RECOVERY - Puppet freshness on analytics1027 is OK: puppet ran at Fri Jun 14 00:11:44 UTC 2013 [00:11:55] RECOVERY - Puppet freshness on db1054 is OK: puppet ran at Fri Jun 14 00:11:45 UTC 2013 [00:11:55] RECOVERY - Puppet freshness on cerium is OK: puppet ran at Fri Jun 14 00:11:45 UTC 2013 [00:11:56] RECOVERY - Puppet freshness on locke is OK: puppet ran at Fri Jun 14 00:11:45 UTC 2013 [00:11:56] RECOVERY - Puppet freshness on mw122 is OK: puppet ran at Fri Jun 14 00:11:46 UTC 2013 [00:11:57] RECOVERY - Puppet freshness on srv238 is OK: puppet ran at Fri Jun 14 00:11:48 UTC 2013 [00:11:57] RECOVERY - Puppet freshness on cp3007 is OK: puppet ran at Fri Jun 14 00:11:49 UTC 2013 [00:11:58] RECOVERY - Puppet freshness on mw34 is OK: puppet ran at Fri Jun 14 00:11:49 UTC 2013 [00:11:59] RECOVERY - Puppet freshness on mw20 is OK: puppet ran at Fri Jun 14 00:11:50 UTC 2013 [00:11:59] RECOVERY - Puppet freshness on mw66 is OK: puppet ran at Fri Jun 14 00:11:50 UTC 2013 [00:11:59] RECOVERY - Puppet freshness on srv297 is OK: puppet ran at Fri Jun 14 00:11:50 UTC 2013 [00:12:00] RECOVERY - Puppet freshness on mw1017 is OK: puppet ran at Fri Jun 14 00:11:52 UTC 2013 [00:12:03] RECOVERY - Puppet freshness on mw1015 is OK: puppet ran at Fri Jun 14 00:11:53 UTC 2013 [00:12:03] RECOVERY - Puppet freshness on mw1065 is OK: puppet ran at Fri Jun 14 00:11:53 UTC 2013 [00:12:03] RECOVERY - Puppet freshness on mw1042 is OK: puppet ran at Fri Jun 14 00:11:53 UTC 2013 [00:12:03] RECOVERY - Puppet freshness on search1018 is OK: puppet ran at Fri Jun 14 00:11:53 UTC 2013 [00:12:03] RECOVERY - Puppet freshness on mw1144 is OK: puppet ran at Fri Jun 14 00:11:54 UTC 2013 [00:12:03] RECOVERY - Puppet freshness on mw1052 is OK: puppet ran at Fri Jun 14 00:11:54 UTC 2013 [00:12:04] RECOVERY - Puppet freshness on mw1101 is OK: puppet ran at Fri Jun 14 00:11:55 UTC 2013 [00:12:23] RECOVERY - Puppet freshness on mw100 is OK: puppet ran at Fri Jun 14 00:12:12 UTC 2013 [00:12:23] RECOVERY - Puppet freshness on sq65 is OK: puppet ran at Fri Jun 14 00:12:13 UTC 2013 [00:12:23] RECOVERY - Puppet freshness on search14 is OK: puppet ran at Fri Jun 14 00:12:13 UTC 2013 [00:12:23] RECOVERY - Puppet freshness on mw1162 is OK: puppet ran at Fri Jun 14 00:12:14 UTC 2013 [00:12:23] RECOVERY - Puppet freshness on search1014 is OK: puppet ran at Fri Jun 14 00:12:14 UTC 2013 [00:12:24] RECOVERY - Puppet freshness on lvs1 is OK: puppet ran at Fri Jun 14 00:12:15 UTC 2013 [00:12:24] RECOVERY - Puppet freshness on mw46 is OK: puppet ran at Fri Jun 14 00:12:15 UTC 2013 [00:12:25] RECOVERY - Puppet freshness on ms5 is OK: puppet ran at Fri Jun 14 00:12:16 UTC 2013 [00:12:25] RECOVERY - Puppet freshness on cp3010 is OK: puppet ran at Fri Jun 14 00:12:17 UTC 2013 [00:12:26] RECOVERY - Puppet freshness on pc1002 is OK: puppet ran at Fri Jun 14 00:12:17 UTC 2013 [00:12:26] RECOVERY - Puppet freshness on mw1044 is OK: puppet ran at Fri Jun 14 00:12:17 UTC 2013 [00:12:27] RECOVERY - Puppet freshness on search21 is OK: puppet ran at Fri Jun 14 00:12:18 UTC 2013 [00:12:27] RECOVERY - Puppet freshness on zirconium is OK: puppet ran at Fri Jun 14 00:12:18 UTC 2013 [00:12:28] RECOVERY - Puppet freshness on lvs5 is OK: puppet ran at Fri Jun 14 00:12:18 UTC 2013 [00:12:28] RECOVERY - Puppet freshness on db1048 is OK: puppet ran at Fri Jun 14 00:12:19 UTC 2013 [00:12:33] RECOVERY - Puppet freshness on sq73 is OK: puppet ran at Fri Jun 14 00:12:23 UTC 2013 [00:12:43] RECOVERY - Puppet freshness on mw5 is OK: puppet ran at Fri Jun 14 00:12:36 UTC 2013 [00:12:44] RECOVERY - Puppet freshness on mw41 is OK: puppet ran at Fri Jun 14 00:12:38 UTC 2013 [00:12:54] RECOVERY - Puppet freshness on srv294 is OK: puppet ran at Fri Jun 14 00:12:50 UTC 2013 [00:13:40] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68602 [00:14:03] RECOVERY - Puppet freshness on sodium is OK: puppet ran at Fri Jun 14 00:13:57 UTC 2013 [00:14:03] RECOVERY - Puppet freshness on snapshot1002 is OK: puppet ran at Fri Jun 14 00:14:01 UTC 2013 [00:19:23] !log catrope Started syncing Wikimedia installation... : Fixup for VE deployment [00:19:26] Logged the message, Master [00:21:33] RECOVERY - Puppet freshness on mw1135 is OK: puppet ran at Fri Jun 14 00:21:23 UTC 2013 [00:28:03] !log catrope Finished syncing Wikimedia installation... : Fixup for VE deployment [00:28:11] Logged the message, Master [00:39:40] PROBLEM - Parsoid on wtp1015 is CRITICAL: Connection refused [00:44:30] New patchset: Reedy; "(bug 15434) Periodical run of currently disabled special pages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33713 [00:44:40] RECOVERY - Parsoid on wtp1015 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.005 second response time [00:47:19] New patchset: Reedy; "(bug 15434) Periodical run of currently disabled special pages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33713 [00:47:26] did anybody restart Parsoid on wtp1015? [00:47:51] I did not [00:48:07] RoanKattouw: is anything automatically restarting Parsoid? [00:48:25] No [00:49:28] the node process on wtp1015 was restarted at 0:44 UTC [00:50:00] just before icinga listed it as up again [00:50:21] I wonder if there is anything in syslog [00:52:18] RoanKattouw, Ryan_Lane: could you have a look at syslog on wtp1015 around 0:44 UTC? [00:52:22] there was a puppet run at the time [00:52:32] aha [00:52:39] Looking [00:52:47] well, that shouldn't have caused a problem [00:52:52] and that restarts Parsoid? [00:52:52] I'm running puppet now [00:53:11] Jun 14 00:44:21 wtp1015 puppet-agent[12807]: (/Stage[main]/Misc::Parsoid/Service[parsoid]/ensure) ensure changed 'stopped' to 'running' [00:53:17] For some reason puppet believed the service wasn't up [00:53:24] no [00:53:27] Jun 14 00:44:21 wtp1015 puppet-agent[12807]: (/Stage[main]/Misc::Parsoid/Service[parsoid]/ensure) ensure changed 'stopped' to 'running' [00:53:30] the DOWN came 5 minutes earlier [00:53:40] yep [00:53:48] I'm thinking the puppet run fixed it [00:54:04] the question is what stopped it :) [00:54:08] Nothing in syslog for that [00:54:13] I see, so there is some kind of monitoring in puppet [00:54:21] Not really [00:54:25] gwicke: well, puppet has ensure => running [00:54:28] We just got lucky that puppet ran 5 mins after the process died [00:54:41] root 13496 0.0 0.0 36092 1568 ? S 00:44 0:00 sudo -E -u parsoid nohup node /var/lib/parsoid/Parsoid/js/api/server.js [00:54:44] uugh [00:54:57] Yeah it's a terrible mess [00:54:58] paravoid: yeah. I've complained about this already ;) [00:55:02] I take full blame [00:55:19] And I'm hoping someone in ops is willing to write a proper init script at some point :) [00:55:26] I don't care about blame, I care about fixes :) [00:55:29] As I'm not good at that (evidently) [00:55:51] PROBLEM - Parsoid on wtp1004 is CRITICAL: Connection refused [00:55:55] hah [00:55:56] I'm not sure why an upstart for this couldn't just run node /var/lib/parsoid/Parsoid/js/api/server.js [00:56:21] so, no node on wtp1004 [00:56:25] you'd need to put the forking (or non-forking) count in the upstart [00:56:36] nothing on syslog either [00:56:54] Oh, upstart handles this nicely? [00:57:05] yes [00:57:12] * RoanKattouw has never worked with upstart and couldn't find an example offhand when he wrote this init script [00:57:16] I think parsoid or node might just die [00:57:25] with no logs whatsoever [00:57:50] We know we have issues with child processes not being respawned [00:58:00] The children die every now and then, but usually leave an exception/error message in the log [00:58:18] copied the log off 1004 [00:58:21] does it log somewhere else than syslog? [00:58:24] However, the logs get wiped out every time the process restarts, so we don't know why they die [00:58:28] paravoid: nohup.out [00:58:31] * RoanKattouw hides quickly [00:58:34] haha [00:58:35] * RoanKattouw is embarrassed [00:58:46] upstart is generally simple, assuming your application forks (or doesn't fork) consistently [00:58:48] (/var/lib/parsoid/nohup.out is our "log") [00:59:20] Ryan_Lane: So what we start is a master process that forks a bunch of child processes (15 of them), and reforks them when they die (sometimes it doesn't respawn them, we don't know why yet) [00:59:31] there are some backtraces there [00:59:38] are these relevant? [01:01:42] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.01261508465 secs [01:01:45] They can be [01:01:51] But they could be in the child processes [01:02:03] paravoid: I am thinking that the http error is relevant to the master [01:02:10] or could be at least [01:03:11] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset -0.004909038544 secs [01:10:23] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [01:24:55] !log starting swift->ceph scripts on a screen on ms-fe1002 [01:25:14] Logged the message, Master [01:30:21] paravoid: ? [01:30:28] swiftrepl [01:30:30] thumbs [01:30:44] ok [01:30:44] I'm running them periodically to keep the delta small [01:30:57] so can I run the copy scripts I've been wanting to? [01:31:01] oh yeah sure [01:31:05] * AaronSchulz was waiting for things to be more stable [01:31:17] it is stable, until it breaks again :P [01:31:35] well as long as it is not broken *right now* :) [01:31:45] more seriously, we have one serious bug right now and that manifests when we restart osds or machines etc. [01:32:22] or when the network decides to split brain everything, like the other day :) [01:32:54] so it works but it's just nasty to have this timebomb in production [01:35:42] AaronSchulz: so I'm syncing thumbs, timeline, transcoded, math & score [01:35:52] are any of these journaled? [01:36:12] oh and captcha I guess [01:36:17] I'm doing global-* basically [01:38:25] does "syncing" include deleting? [01:38:29] yes [01:38:43] ok [01:38:45] deletes, new files, etag mismatches [01:39:01] just don't do that for originals :) [01:39:13] I'm not :) [01:39:30] I modified an existing script to handle that later [01:39:46] to handle what? [01:39:48] well at least those initiated from action=delete [01:39:57] where there were out of sync for some reason [01:40:38] so there are like 4 scripts I'd need to run (the first is not needed by luck, no updated files were out of sync) [01:40:48] okay [01:40:55] * AaronSchulz will start on that tomorrow [01:41:18] copy script in both directions, sync script from the time I did failover, and purgeDeletedFiles.php (in that order) :) [01:41:30] RoanKattouw: having upstart manage N processes for some job isn't too hard. Do the node workers share any kind of state, or is it trivial to make them independent processes? [01:41:40] * AaronSchulz hates all this multiwrite stuff [01:41:41] both directions? [01:41:54] Ryan_Lane: did you e-mail the folks in the new group you created? (I can if you haven't already.) [01:41:54] I know... [01:42:04] ori-l: There is a master process listening on port 80 that dispatches tasks to the workers. Not sure about shared state, ask gwicke_away [01:42:18] most of this is only for the few files already out-of-sync before the failover [01:42:21] fun fun fun [01:42:39] ori-l: I didn't [01:42:45] Ryan_Lane: K, will do then. [01:42:46] ouch [01:42:51] ori-l: awesome. thanks! [01:42:56] hopefully won't take more than a week though [01:43:03] probably not [01:43:06] having had to deal with this crap so long things have been optimized :) [01:43:14] the thumb ones take a day or two [01:43:22] for a weeks delta or so [01:43:39] thumbs are smaller of course, but they're also usually more [01:44:28] so supposedely dumpling, i.e. the August release will have georeplication [01:44:35] (and of course that slow peering fix) [01:45:00] but my guess based on their track record is that it won't be ready for consumption until at least their next release [01:45:06] which is Aug+3 months [01:45:15] = November [01:45:49] New patchset: Cmjohnson; "adding cp1045-55 to netboot and dhcpd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68611 [01:45:59] swift is also still working on georepl afaik [01:46:39] so, we're stuck with multiwrite for a little while longer [01:47:03] you don't trust ceph by itself? :) [01:47:24] um [01:47:26] I guess not [01:47:42] maybe we should just use the python scripts in the background and have the sync scripts in a tight loop and not use multiwrite? [01:47:43] paravoid..can you look at my patchset ...i consolidated the the netboot cfg for cp10xx and want a 2nd set of eyes plz. [01:47:54] that might honestly be simpler (and faster) [01:48:03] one ceph cluster at least [01:49:11] an alternative would be to journal anything that isn't generated on demand [01:49:35] i.e. everything but thumbs, unless we can add other categories as autogenerated [01:50:39] and then ignore the small delta and have something periodically sync that (the python scripts are you say) in a not-so-tight loop [01:50:49] as long as deletes are propagated it should be fine [01:51:53] cmjohnson1: um, it adds up to 1065 [01:52:01] cmjohnson1: the commit msg says up to 1055 [01:52:18] cp102[1-9]|cp10[3-5][0-9]|cp106[0-5]|... [01:52:43] that's 1021-1029, 1030-1059, 1060-1065 [01:52:45] paravoid...yeah..i have the additional servers that will be racked tomorrow or monday..since I was making the changing i added them...should i remove or change msg [01:52:52] no that's fine [01:52:57] as long as you're aware of it [01:53:10] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68611 [01:53:19] i am..thx for reviewing [01:53:41] merged [01:53:51] oh..great thank you! [01:54:43] ran puppet on brewster too, so you're ready to go [01:54:53] and I'm ready to go to bed too, good night/evening :) [01:55:12] yep...either way...good night [01:58:02] has anyone been watching memcached-serious.log since twemproxy was deployed? [01:59:47] there's a lot of "bad key" errors [02:05:58] !log LocalisationUpdate completed (1.22wmf6) at Fri Jun 14 02:05:57 UTC 2013 [02:06:07] Logged the message, Master [02:11:02] !log LocalisationUpdate completed (1.22wmf7) at Fri Jun 14 02:11:02 UTC 2013 [02:11:10] Logged the message, Master [02:11:18] PROBLEM - Parsoid on wtp1016 is CRITICAL: Connection refused [02:17:44] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Jun 14 02:17:44 UTC 2013 [02:17:53] Logged the message, Master [02:19:35] !log dns update [02:19:43] Logged the message, Master [02:24:19] RECOVERY - Parsoid on wtp1016 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [02:25:01] TimStarling: twemproxy was deployed yesterday, right? the number of BAD KEYs doesn't seem too unusual. It's at 3,393 at the moment (the log is about twenty hours old). it's been unusually high since late may: https://dpaste.de/60vSu/raw/ [02:37:50] TimStarling: they're also almost all from commonswiki [02:38:39] 68,849 'BAD KEY' errors in may & june, 63,465 from commonswiki. [02:47:59] I think User:Fæ on commonswiki has been uploading lots of pictures from some DoD trove that has very long filenames [02:49:00] TimStarling, pretty sure that's it. [02:50:49] ori-l I gotta take off, the ganglia errors graph looks fine and the server-side events are kenny Loggin' away [02:52:51] spagewmf: cool, ciao! [03:13:27] PROBLEM - Parsoid on wtp1012 is CRITICAL: Connection refused [03:15:47] PROBLEM - Parsoid on wtp1018 is CRITICAL: Connection refused [03:15:57] PROBLEM - Parsoid on wtp1017 is CRITICAL: Connection refused [03:22:43] * RoanKattouw looks at those Parsoid boxes [03:26:47] RECOVERY - Parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.008 second response time [03:26:57] RECOVERY - Parsoid on wtp1017 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [03:27:28] RECOVERY - Parsoid on wtp1012 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [03:30:28] PROBLEM - Parsoid on wtp1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:31:17] RECOVERY - Parsoid on wtp1005 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.002 second response time [03:33:07] PROBLEM - Parsoid on wtp1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:51] RECOVERY - Parsoid on wtp1013 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [03:43:41] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [03:43:41] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [03:43:41] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [03:43:41] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [03:43:41] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [03:43:42] PROBLEM - Puppet freshness on sockpuppet is CRITICAL: No successful Puppet run in the last 10 hours [03:43:42] PROBLEM - Puppet freshness on mw1020 is CRITICAL: No successful Puppet run in the last 10 hours [03:43:43] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [03:43:43] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [03:43:44] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [03:43:44] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [03:43:45] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [04:32:12] !log updating Parsoid dependencies from config repository [04:32:20] Logged the message, Master [04:45:06] !log updated Parsoid to 64921430b1 [04:45:16] Logged the message, Master [05:00:33] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [05:02:03] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [05:10:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:11:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.135 second response time [05:22:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:23:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.186 second response time [05:24:46] https://gerrit.wikimedia.org/r/#/c/33713/ : What Tampa host should call that job? [05:27:20] Ah, the good old hume. :) [05:30:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:32:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [05:53:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:54:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [06:06:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:07:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [06:40:51] New patchset: Tim Starling; "Sync w at the same time as docroot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64449 [06:40:59] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64449 [07:14:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:15:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [07:27:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:28:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [08:01:50] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset 0.002644777298 secs [08:12:10] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [08:17:10] New patchset: Nemo bis; "Add Klaus Graf to the German Planet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68621 [08:32:39] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset -0.008009552956 secs [09:38:38] New review: Akosiaris; "I too think we are almost ok. Just the architecture and the two dependencies and we are good to go." [operations/debs/buck] (master) C: -1; - https://gerrit.wikimedia.org/r/67999 [11:23:13] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68428 [11:48:18] PROBLEM - Parsoid on wtp1007 is CRITICAL: Connection refused [11:55:42] New patchset: Mark Bergsma; "Move monitor_group statements to the role manifest" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68633 [11:55:42] New patchset: Mark Bergsma; "Remove resources for the obsoleted Perl HTCP purger daemon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68634 [11:56:23] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68633 [11:57:42] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68634 [12:00:38] New patchset: Mark Bergsma; "Remove old dependency" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68635 [12:01:15] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68635 [12:09:43] New patchset: Mark Bergsma; "Remove unused upstart version of varnishncsa udploggers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68637 [12:10:42] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68637 [12:12:16] RECOVERY - Parsoid on wtp1007 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.005 second response time [12:17:07] New patchset: Mark Bergsma; "Remove default instance instantiation, we don't use it" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68638 [12:19:56] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68638 [12:21:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:22:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 2.280 second response time [12:27:25] New patchset: Mark Bergsma; "Use quoted tabs instead of tab characters in the format string" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68640 [12:33:34] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68640 [12:44:50] New patchset: Mark Bergsma; "Remove embedding of classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68643 [12:45:50] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68643 [12:53:59] New patchset: Mark Bergsma; "Convert tab indent to 4-spaces" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68644 [12:55:47] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68644 [12:59:57] New patchset: Mark Bergsma; "Rename resources with dashes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68646 [13:00:33] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68646 [13:07:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:10:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [13:12:11] PROBLEM - Parsoid on wtp1001 is CRITICAL: Connection refused [13:15:11] RECOVERY - Parsoid on wtp1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [13:17:13] New patchset: Nikerabbit; "Narayam and WebFonts were replaced with ULS" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68648 [13:17:44] PROBLEM - Parsoid on wtp1018 is CRITICAL: Connection refused [13:18:54] PROBLEM - Parsoid on wtp1006 is CRITICAL: Connection refused [13:19:24] PROBLEM - Parsoid on wtp1009 is CRITICAL: Connection refused [13:23:54] RECOVERY - Parsoid on wtp1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [13:24:14] PROBLEM - Parsoid on wtp1015 is CRITICAL: Connection refused [13:32:24] RECOVERY - Parsoid on wtp1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.038 second response time [13:34:59] New patchset: Mark Bergsma; "Move manifests/varnish.pp to a module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68653 [13:36:33] pep8 :-P [13:39:27] RECOVERY - Parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.003 second response time [13:40:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:41:07] RECOVERY - Parsoid on wtp1015 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.005 second response time [13:41:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [13:43:46] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [13:43:47] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [13:43:47] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [13:43:47] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [13:43:47] PROBLEM - Puppet freshness on mw1020 is CRITICAL: No successful Puppet run in the last 10 hours [13:43:47] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [13:43:47] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [13:43:48] PROBLEM - Puppet freshness on sockpuppet is CRITICAL: No successful Puppet run in the last 10 hours [13:43:49] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [13:43:49] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [13:43:49] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [13:43:50] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [13:46:15] New patchset: Mark Bergsma; "Move manifests/varnish.pp to a module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68653 [13:47:07] hi MaxSem [13:47:12] yo [13:47:45] woke up? [13:49:06] haha [13:51:19] MaxSem: funny? [13:51:43] I'm +2 hours from your TZ [13:51:57] PROBLEM - Parsoid on wtp1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:02] MaxSem: but you usually appear US time :) [13:52:16] PROBLEM - Parsoid on wtp1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:53:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.145 second response time [13:55:16] PROBLEM - Parsoid on wtp1023 is CRITICAL: Connection refused [13:57:16] PROBLEM - Parsoid on wtp1003 is CRITICAL: Connection refused [13:57:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:58:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [14:00:23] RECOVERY - Parsoid on wtp1023 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [14:09:19] RECOVERY - Parsoid on wtp1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [14:09:19] RECOVERY - Parsoid on wtp1011 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.005 second response time [14:18:29] New patchset: Odder; "(bug 49575) Set up $wgImportSources for vec.wiktionary" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68655 [14:23:42] New patchset: Diederik; "Simplify regex by removing duplicate language code for 'ar'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68657 [14:24:49] RECOVERY - Parsoid on wtp1016 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [14:26:49] PROBLEM - Parsoid on wtp1019 is CRITICAL: Connection refused [14:28:55] * AzaToth made a popo and might need some help [14:30:05] Tested blocking a test account for 100 millenia, and can't unblock [14:30:07] http://en.wikipedia.org/w/index.php?title=Special:Log/block&page=User%3ADeskanaTest [14:30:18] Error: Block ID DeskanaTest not found. It may have been unblocked already. [14:30:43] * AzaToth hides in shame [14:31:49] RECOVERY - Parsoid on wtp1019 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.002 second response time [14:34:40] Nikerabbit: ping [14:37:33] Does anyone know if it's possible to have a 'require' in the private puppet repo that refers to a class in the public repo? [14:39:33] (NEW) blocking for 100000 years doesn't create a block but fills block log - https://bugzilla.wikimedia.org/49580 normal; MediaWiki: User blocking; () [14:39:35] ツ [14:40:20] if the public class or whatever is included/defined/whatever by the time your private repo piece sees it, I don't see why not [14:42:38] AzaToth: ? [14:43:13] Nikerabbit: our issue was solved when we relaized that blocks had never been made and only the block log had been filled [14:43:21] PROBLEM - Parsoid on wtp1022 is CRITICAL: Connection refused [14:43:24] filed a bug for it [14:43:48] ou [14:44:09] 3 millennia, 1 century, 6 decades, 8 years, 319 days, 4 hours, 1 minute and 3 seconds (99999999999 seconds) works [14:44:55] ou [14:44:59] and what is that in unix timestamp? [14:46:00] New patchset: Jforrester; "Temporarily disable VisualEditor completely in prod" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68667 [14:46:11] apergos: ^^^ Please +2 and deploy. [14:46:17] azatoth@azaboxen:~$ date -d "now + 99999999999 seconds" +%s [14:46:18] 101371221169 [14:46:20] looking [14:46:30] PROBLEM - Parsoid on wtp1024 is CRITICAL: Connection refused [14:46:39] ah so you referenced the bug number, I was about to ask, goo [14:46:40] d [14:46:54] Nikerabbit: ↑ [14:47:30] New review: ArielGlenn; "temporary emergency measure" [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/68667 [14:47:30] Change merged: ArielGlenn; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68667 [14:47:55] Nikerabbit: it's Fri Apr 30 02:34:22 CEST 5182 by the way [14:53:12] and we're off [14:53:31] !log ariel synchronized wmf-config/InitialiseSettings.php 'emergency disabling of Visual Editor, see bug 49577' [14:53:39] copied... [14:53:39] Logged the message, Master [14:53:59] New patchset: Andrew Bogott; "Don't redefine sudo-ldap" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68670 [14:57:06] can someone who knows how to check, check if ve is actually disabled? [15:01:20] RECOVERY - Parsoid on wtp1022 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [15:01:28] ah nm, I see in the 'usability features' that it doesn't show up [15:02:43] so *cough* hate to deploy and run but in fact... I have to run. I will be back in about 1 hour, but am reachable via sms/phone/page if something is wrong and can run back pretty fast (~ 10 min) [15:08:35] RECOVERY - Parsoid on wtp1024 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [15:10:19] apergos: No worries at all; we're not getting this fixed in the next hour anyway. [15:13:16] I'm rolling back the dependency update [15:13:25] should be fixed after that [15:13:53] !log working mw1041 updating bios [15:13:55] !log rolled back config update to investigate issues [15:14:01] Logged the message, Master [15:14:09] Logged the message, Master [15:16:37] PROBLEM - Host mw1041 is DOWN: PING CRITICAL - Packet loss = 100% [15:21:45] RECOVERY - Host mw1041 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [15:25:15] PROBLEM - Host mw1041 is DOWN: PING CRITICAL - Packet loss = 100% [15:25:38] !log carbon going offline to run h/w test at Dell support request. [15:25:45] RECOVERY - Host mw1041 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [15:25:46] Logged the message, Master [15:28:45] PROBLEM - Host carbon is DOWN: CRITICAL - Host Unreachable (208.80.154.10) [15:39:09] RECOVERY - Host carbon is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [15:41:29] PROBLEM - SSH on carbon is CRITICAL: Connection refused [15:42:54] gwicke: fyi i am going to be swapping cpu1 with cpu2 on wtp1008 [15:43:11] !log powering down wtp1008 to swap cpu's [15:43:14] cmjohnson1: ok [15:43:20] Logged the message, Master [15:45:30] PROBLEM - Host wtp1008 is DOWN: PING CRITICAL - Packet loss = 100% [15:48:33] New patchset: Mark Bergsma; "Move manifests/varnish.pp to a module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68653 [15:53:09] PROBLEM - NTP on carbon is CRITICAL: NTP CRITICAL: No response from NTP server [15:55:29] RECOVERY - SSH on carbon is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [15:55:59] RECOVERY - Host wtp1008 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [16:00:22] New patchset: Mark Bergsma; "Move manifests/varnish.pp to a module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68653 [16:01:55] New patchset: Mark Bergsma; "Move manifests/varnish.pp to a module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68653 [16:04:32] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68653 [16:07:17] New patchset: Mark Bergsma; "Move htcppurger.pp to the right dir" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68681 [16:07:46] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68681 [16:10:07] RECOVERY - NTP on carbon is OK: NTP OK: Offset 0.0121307373 secs [16:12:03] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68657 [16:12:59] back [16:15:40] apergos: btw thanks much for helping James_F last night with the parsoid, er, meltdown [16:15:57] well that was just now actually (an hour ago) [16:16:10] and yw, I only wish I could have figured it out when I saw th bug report come in [16:16:24] * greg-g nods [16:16:33] I hate when stuff is critically broken and I don't know enough to fix it [16:17:05] you must be trained in everything from now on. [16:17:12] apergos: We want Parsoid to be so boring we don't need to document this, but yes, we should document it much more. [16:17:32] I did look at the page already there, which is how I at least knew which servers to stare at [16:17:44] so it's better than some subsystems which have or had basically zero docs [16:18:05] paravoid: ping [16:18:26] apergos: Sure, but we should expect new-build systems to be much better documented than the crap we're replacing. :-) [16:18:32] ah I have been wondering what was up with wtp1008... ( cmjohnson1 ) [16:18:44] I'm all for that James_F :-) [16:19:30] apergos: A few of the wtp1...s are flaky, it seems. [16:20:16] well (see my comment on whatever ticket) the cpu step stuff might be not very informative, there's a linux kernel bug about that [16:20:28] but that host in particular looks oddly behaved to me based on ganglia [16:20:54] apergos: wtp1008 issue seems to be h/w related [16:22:06] apergos: you can also always mail Roan & me [16:22:34] New patchset: AzaToth; "Initial debian build" [operations/debs/buck] (master) - https://gerrit.wikimedia.org/r/67999 [16:23:15] what tz are you, gwicke? [16:23:34] cmjohnson1: ok I will be curious to see what it turns out to be in the end [16:23:48] I'm in SF, but would have read the mail when I woke up at 4am ;) [16:24:08] or before I went to bed [16:24:53] ewww 4am [16:24:56] sorry to hear it [16:26:46] cmjohnson1: are those issues isolated to wtp1008, or is this the general thermal warning issue we saw on the other machines too? [16:27:15] this issue is only on wtp1008 [16:27:26] k [16:27:54] i swapped the cpu's now waiting to see if the error presents itself again [16:47:06] lol, why " < wikibugs_> (mod) blocking for 100000 years doesn't create a block but fills block log " [16:52:21] New patchset: Mark Bergsma; "Remove unused file thumbs.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68686 [16:52:21] New patchset: Mark Bergsma; "Remove obsolete Perl HTCP purger daemon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68687 [16:53:24] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68686 [16:53:27] mutante: I thought first I couldn't unblock, but realized that it actually didn't do a block [16:53:46] it only filled the block log [16:53:48] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68687 [16:54:23] AzaToth: ok:) i was just wondering why one would want to block for 100000 years [16:56:29] mutante: which is the maximum you can actually block someone for? [16:57:07] eternity [16:57:23] MaxSem: non-indef [16:58:31] mutante: 3 millennia, 1 century, 6 decades, 8 years, 319 days, 4 hours, 1 minute and 3 seconds (99999999999 seconds) works [16:59:09] let the next millenium people care about it ?:) [17:00:16] mutante: well, the block log still says "blocked for 100 millennias" [17:01:07] you should change it to the "99999999999 seconds" :) [17:01:37] mutante: what I meant is that it records the block log without verifying that it actually made a block [17:01:41] they will all calculate how much it actually is [17:02:06] ah [17:04:47] New patchset: Odder; "(bug 49312) Add a 'Programs' namespace to metawiki." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68690 [17:28:55] New patchset: Catrope; "Re-enable VisualEditor on mediawiki.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68694 [17:31:17] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68694 [17:31:58] !log catrope synchronized wmf-config/InitialiseSettings.php 'Re-enable VE on mw.org' [17:32:06] Logged the message, Master [17:36:15] ^demon: how do you setup your VM? [17:36:28] <^demon> I just fired up a new VM in labs. [17:36:33] <^demon> It's called gerrit-build. [17:36:41] oh [17:37:01] could you throw one up for me so I can test the build? [17:37:17] as I'm using debian here, some stuff could be different [17:37:21] <^demon> Yeah, lemme give you access to it. [17:37:25] k [17:38:03] can a gerrit patchset be made for a different branch than master ( what I mean is, can the gerrit patchset, be merged into a different branch than master) ? [17:38:21] average_drifter: yea [17:38:31] average_drifter: git review [17:42:26] AzaToth: thanks [17:50:01] I would like more permissions on the analytics/dclass.git repo please [17:50:14] I'm finding myself often in the situation of needing to delete branches or rename them [17:58:05] ^demon: /etc/default/gerrit or /etc/default/gerritcodereview? [18:00:25] !log aaron synchronized php-1.22wmf7/maintenance/purgeDeletedFiles.php '9e2ffededde6e7752a8bd64d2ae791af768213c0' [18:00:35] <^demon> AzaToth: Whichever gerrit's init script wants. I think it used to be one but switched to the other. [18:00:36] Logged the message, Master [18:03:48] apergos: Are you around to help us out one more time? [18:03:54] yes [18:03:56] what's up? [18:04:03] I'd ask Les lie as she's on RT duty but she's away today [18:04:13] New patchset: GWicke; "Revert "Use pass to avoid caching in frontend, refactoring, explanatory comments"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68707 [18:04:13] So we think the problem isn't actually with Parsoid or VE, but with https://gerrit.wikimedia.org/r/#/c/68428/6 [18:04:24] ....which Gabriel submitted a revert for right there [18:04:34] Can we get that revert merged and deployed? [18:04:45] RoanKattouw: https://gerrit.wikimedia.org/r/68707 [18:04:50] oh, is this what went live at around... 11:37 or so utc? [18:05:02] Basis for suspicion: Mark merged the change at 4:23 PDT, the last good VE diff was at 4:26 PDT, the first bad diff at 4:37 PDT [18:05:06] So yeah exactly [18:05:10] Merged at 11:23 UTC [18:05:26] haha this is the one thing I could add to the bug (but no idea how to determine if it was the cause) [18:05:29] We blamed Parsoid changes but they went live many hours before [18:05:56] so wait has the library dependency etc all been fixed and tested and yet ve is still not working right? [18:06:15] Yeah we rolled back Parsoid and things are still broken [18:06:15] ugh [18:06:15] ok [18:06:36] There was an initial issue with the revert operation being broken in git-deploy, but we've now verified that the right code is deployed and all the processes have been restarted and still no dice [18:06:39] the code and libraries we deployed yesterday all work fine [18:07:33] so let me ask, was there anything else happening to parsoid varnish along with this change to the vcls? [18:07:44] apergos: no [18:08:04] the issue is that the patch disables default processing, which also makes sure that POST requests are not cached [18:08:04] so this is an isolated change? [18:08:21] so now POST requests get a cached result corresponding to a GET [18:08:35] apergos: yes [18:08:40] depends on nothing else [18:08:44] gwicke: That's kind of what I suspected but I'm disappointed that that's actually what it is [18:09:15] are you committed to be around for awhile and test it? (I can be here also for awhile) [18:09:18] never touch a running system.. [18:09:33] apergos: yes, we will be around for ~8 hours.. [18:09:55] I will be around until about 5pm PDT with a quick break to go to the post office [18:10:21] all right I'll push this out now [18:10:22] I have root so I can theoretically deploy things myself and bypass ops altogether, but I really don't want to if it's not absolutely necessary [18:10:23] sec [18:10:45] no, it's better for one of use to be around etc, you made the right call [18:10:56] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68707 [18:11:05] Thanks man [18:11:14] Let me know when that's on sockpuppet and I'll run puppet on the affected hosts [18:11:15] sec [18:12:28] grr unmerged stuff [18:12:57] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [18:13:27] mark, I'm pushing out your 'remove thumbs.pp' change [18:14:38] ok, should be there RoanKattouw [18:16:56] RoanKattouw: I still get HTML in a diff on MW.org VE - would it take some time? [18:17:13] he has to run puppet first [18:18:48] Running now [18:19:05] Ah, good excuse. :-) [18:19:16] isn't it though :-) [18:20:40] and we are back [18:20:42] mark, paravoid, i have posted a new, simplified (and short) proposal re zero architecture to wikitech-l. Please comment :) [18:22:29] New review: Catrope; "This change completely broke VisualEditor because it caused Varnish to return cached responses for P..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68428 [18:22:43] so how does it look? [18:23:05] apergos: all is fine again [18:23:12] RoanKattouw: :) [18:23:15] Yeah. [18:23:16] excelleent [18:23:26] Yay [18:23:48] Hmm. Regressions in other things, though. [18:24:02] RoanKattouw, gwicke: Can we re-update Parsoid to master then? [18:24:15] I guess we should ask folks in wm-tech to watch edits with VE and see hwo they are, juuuust in case [18:24:16] James_F: yes, will do [18:24:21] also back to latest libraries [18:24:27] gwicke: Brilliant. :-) [18:24:41] apergos: Well, and switch VE on for the rest of the cluster as before. :-) [18:24:46] ah yes :-D [18:26:49] will you be wanting me for that? [18:27:33] I'll do it [18:27:49] apergos: Thanks for your help, and enjoy your evening :) [18:27:59] yw [18:28:12] I might wander off in a little while then (not right away) [18:28:19] have a good rest of your day! [18:28:41] New patchset: Catrope; "Re-enable VisualEditor everywhere" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68716 [18:29:02] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68716 [18:29:04] apergos: thanks much, again [18:30:22] !log catrope synchronized wmf-config/InitialiseSettings.php 'Re-enable VisualEditor on the rest of the wikis' [18:30:30] Logged the message, Master [18:34:01] New patchset: Andrew Bogott; "Move mail manifests to a module called 'exim'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68584 [18:34:02] New patchset: Andrew Bogott; "Refactor exim::rt to use the new exim template." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [18:34:41] paravoid: ping [18:37:08] paravoid: got problem building buck on a vm: http://paste.debian.net/10425/ [18:37:25] have never encountered such issue before and I've no clue what's wrong [18:37:57] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68670 [18:42:04] ^demon: you got same issue? [18:42:42] <^demon> I didn't have that problem with buck, no. The latest patch I tried last night *seems* to have worked, minus the args4j thing I mentioned. [18:42:59] I've no idea what's funking [18:43:12] looks like the directory is modified when tar tried to read it [18:43:34] I'm building it in my hope btw [18:43:37] home [18:44:31] mutante, can you review/test https://gerrit.wikimedia.org/r/#/c/68011/ when you have a moment? It makes the http/https change that you made earlier; I'm not clear on whether that will break things on the current rt host. [18:44:53] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68621 [18:45:22] ^demon: I think I've seen that on sufficiently broken filesystems. [18:45:39] <^demon> Oh, hrm, I did hit that originally yes. [18:45:51] !log updated Parsoid config and code back to latest after finding and fixing Varnish config bug [18:45:53] <^demon> When I was working on /home and not local storage. [18:45:57] oh [18:45:59] Logged the message, Master [18:46:02] <^demon> /home is NFS, I believe. [18:46:03] home is fubar? [18:46:11] <^demon> Or something [18:46:21] projectstorage.pmtpa.wmnet:/gerrit-home on /home type fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072) [18:46:32] andrewbogott: it will work after the upgrade but not before [18:46:39] uh glusterfs? [18:46:42] is this labs? [18:46:53] andrewbogott: regarding the https link for rt-mailgate [18:46:57] <^demon> apergos: Yup, labs [18:47:02] so don't do that [18:47:06] you want to apply um [18:47:12] glusterfuck [18:47:18] mutante: ok, hm, maybe I should make two versions then [18:47:51] role::labsnfs::client [18:47:52] um? [18:47:53] that. [18:48:02] andrewbogott: brb, looking for my own notes [18:48:05] <^demon> apergos: I just ended up building on local instance storage :) [18:48:05] then run puppet, then you will have to reboot the instance afterwards [18:48:12] and you will have nfs instead of gluster for hoe [18:48:22] that works too ^demon [18:48:24] apergos: gluster is bad? [18:48:25] mutante, no need to interrupt current work for this, just trying to get it on your radar [18:48:28] but this is how you can convert them [18:48:39] <^demon> AzaToth: Gluster is many things. Bad is one of them :) [18:48:44] oh [18:48:53] why do you use gluster then? [18:48:53] gluster is sometimes problematic, so it's good if you run into issues to move off of it [18:49:11] because back in the day we didn't know it was going to present these issues [18:49:17] oh [18:49:39] ^demon: you do as apergos said you should do? [18:49:47] <^demon> No, I didn't. [18:49:54] you will do? [18:50:27] <^demon> Lemme finish my lunch :) [18:50:30] hehe [18:51:09] afaik, gluster usually broke when there were version updates of gluster itself [18:51:40] there have been performance problems too iirc [18:54:54] mutante: OK, switched back to http for now… you will probably have to remind me to change it to https when we do the upgrade. [18:55:44] New patchset: Andrew Bogott; "Move mail manifests to a module called 'exim'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68584 [18:55:44] New patchset: Andrew Bogott; "Refactor exim::rt to use the new exim template." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [18:59:02] andrewbogott: so, quite a few related tickets to this: RT-5167 was when we tested https rt-mailgate with the new version and it worked. RT-5169 will be resolved after we use https, RT-714 was a way older one where i tried a fix but it couldn't support https yet, and abandoned in gerrit change 2446 . then there is mail thread "[Ops] redirecting all RT traffic to https" from 2012 where Asher fixed redirects for rt-mailgate <-> REST API .. an [18:59:35] andrewbogott: so short version, we confirmed it on 4.x and it looked like it would also on 3.8.x but it didn't [19:00:43] andrewbogott: will do, just took "#5169: make rt-mailgate use https instead of http" [19:01:23] paravoid: hi, when you have time please consider looking at https://gerrit.wikimedia.org/r/68711 [19:07:12] andrewbogott: fyi, see "Links" section on #2240 now [19:07:37] trying to find them all [19:09:19] New review: Dzahn; "comment just regarding the rt-mailgate change to use https instead of http. This should work after R..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [19:22:17] mutante, sorry, was making lunch… but that all sounds good. I'm not sure if it makes sense to merge my patch before we migrate, but I'm hoping you'll test it on magnesium to make sure that it works there at least. [19:24:51] hey mutante, any news from an20? [19:27:47] !log rebooting es7 to deal with messed up vlm snapshot [19:27:55] Logged the message, notpeter [19:29:56] !log installin updates on es7 while I'm at it [19:30:05] Logged the message, notpeter [19:30:16] drdee: not really, i couldn't get it to boot to an installer, progress is that DHCP issues can be ruled out, it does get an DHCPACK, but then it hangs [19:31:35] PROBLEM - mysqld processes on es7 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [19:31:58] andrewbogott: we already did for the rt-mailgate protocol part and i see Mark gave the exim part a +1 , sounds good so far [19:32:20] !log deploying small changes to bugzilla login screen [19:32:28] Logged the message, Master [19:32:37] andre__: ^ it is "email address" now [19:32:53] mutante, \o/ [19:33:55] PROBLEM - DPKG on es7 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:35:50] drdee: it might be the same thing again .. that the ud2log stuff causes this, because it was so inconsistent, one time i saw an installer and then i couldn't repeat it ever [19:36:33] ugghh [19:36:35] if there is a large firehose of data going at it [19:36:36] it's likely [19:36:44] is there? [19:37:18] maybe…..but not from udp2log, maybe it is still receiving traffic from hadoop [19:37:29] sure, any firehose could mess this up ;) [19:37:50] it's a hadoop worker node, ud2plog is not aimed at that machine AFAIK [19:38:03] PROBLEM - Host es7 is DOWN: PING CRITICAL - Packet loss = 100% [19:38:05] but is anything shooting a lot of data at it? [19:38:16] wouldn't necesarily need to be udp2log traffic [19:38:35] it shouldn't be the case but ottomata is not around to ask [19:38:43] gotcha [19:43:13] RECOVERY - Host es7 is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [19:45:23] PROBLEM - RAID on es7 is CRITICAL: Timeout while attempting connection [19:45:23] PROBLEM - Full LVS Snapshot on es7 is CRITICAL: Timeout while attempting connection [19:45:24] PROBLEM - MySQL Slave Running on es7 is CRITICAL: Timeout while attempting connection [19:45:33] PROBLEM - MySQL disk space on es7 is CRITICAL: Timeout while attempting connection [19:45:34] PROBLEM - MySQL Idle Transactions on es7 is CRITICAL: Timeout while attempting connection [19:45:34] PROBLEM - Disk space on es7 is CRITICAL: Timeout while attempting connection [19:45:34] PROBLEM - MySQL Recent Restart on es7 is CRITICAL: Timeout while attempting connection [19:45:34] PROBLEM - SSH on es7 is CRITICAL: Connection timed out [19:46:04] PROBLEM - MySQL Replication Heartbeat on es7 is CRITICAL: Timeout while attempting connection [19:46:15] PROBLEM - MySQL Slave Delay on es7 is CRITICAL: Timeout while attempting connection [19:47:18] notpeter: es7 wants lunch :) [19:48:14] RECOVERY - Full LVS Snapshot on es7 is OK: OK no full LVM snapshot volumes [19:48:15] RECOVERY - RAID on es7 is OK: OK: State is Optimal, checked 2 logical device(s) [19:48:24] RECOVERY - MySQL Slave Running on es7 is OK: OK replication [19:48:33] RECOVERY - SSH on es7 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [19:48:33] RECOVERY - MySQL Idle Transactions on es7 is OK: OK longest blocking idle transaction sleeps for seconds [19:48:33] RECOVERY - MySQL Recent Restart on es7 is OK: OK seconds since restart [19:48:33] RECOVERY - Disk space on es7 is OK: DISK OK [19:48:33] RECOVERY - MySQL disk space on es7 is OK: DISK OK [19:49:03] RECOVERY - MySQL Replication Heartbeat on es7 is OK: OK replication delay seconds [19:49:13] RECOVERY - MySQL Slave Delay on es7 is OK: OK replication delay seconds [19:50:03] RECOVERY - mysqld processes on es7 is OK: PROCS OK: 1 process with command name mysqld [19:51:53] RECOVERY - DPKG on es7 is OK: All packages OK [19:55:14] RECOVERY - Host labstore4 is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [19:57:23] PROBLEM - SSH on labstore4 is CRITICAL: Connection refused [19:57:43] PROBLEM - RAID on labstore4 is CRITICAL: Connection refused by host [20:07:23] ^demon: done? [20:08:51] <^demon> Ah yes, I got sucked into hiphop again :) [20:09:04] hehe [20:09:12] building in var now then [20:09:33] 1000% faster [20:09:34] <^demon> Gotta package hiphop too, but shouldn't be too hard actually. [20:09:46] hiphop? [20:09:49] <^demon> hhvm. [20:10:01] <^demon> https://github.com/facebook/hiphop-php [20:10:09] faceboook.... [20:10:12] again [20:10:29] <^demon> As of ubuntu 13.04, all the dependencies are in apt except 1. [20:11:06] <^demon> AzaToth: Well, buck wasn't a choice ;-) [20:12:34] ^demon: fwiw, facebook also released new selenium-webdriver bindings this week that they claim don't suck nearly as bad as the last time they released webdriver for PHP https://github.com/facebook/php-webdriver [20:12:59] <^demon> Heh, selenium. [20:13:17] selenium is great! just not in PHP [20:14:35] ok, now my java is a bit funky [20:14:48] <^demon> chrismcmahon: So, I've been meaning to ask you...is there any value to the selenium stuff in tests/selenium/* in core actually any use? It's a ton of messy code that violates tons of coding conventions (and pollutes my grepping ;-)) and if nobody's using it anymore I wouldn't mind getting rid of it. [20:14:50] Exception in thread "main" java.lang.UnsupportedClassVersionError: com/facebook/buck/cli/Main : Unsupported major.minor version 51.0 [20:15:02] how can I specify to use java7? [20:15:11] <^demon> AzaToth: Sounds like you're trying to compile buck with java6. [20:15:25] I'm using "java" [20:15:25] <^demon> Yeah, I basically just only install jdk7. [20:15:33] ^demon: I vote to get rid of it. I don't think anyone will ever use it, all our active Se stuff is in /qa/browsertests [20:15:55] ^demon: it has to be terribly out of date and as I understand it, never really worked in the first place [20:16:18] <^demon> Indeed. [20:16:30] before my time [20:17:49] ^demon: gerrit must be run using java6? [20:18:07] <^demon> At the moment it's running on the java6 jre yeah. [20:18:18] <^demon> I figured we should probably move to java7 since our build chain requires it now. [20:18:48] I would suggest that [20:19:01] sudo update-java-alternatives -s java-1.7.0-openjdk-amd64 [20:20:46] wh00t wh00t — http://www.itwire.com/business-it-news/open-source/60292-red-hat-ditches-mysql-switches-to-mariadb [20:21:14] <^demon> chrismcmahon: https://gerrit.wikimedia.org/r/#/c/68729/ [20:21:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [20:28:13] ^demon: I commented and added Zeljko as reviewer. He presented at SeleniumConf in Boston this week, he should be back at work on Monday [20:36:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:37:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [20:37:20] ^demon: works mostly now, just a fix in the init file for it to install properly [20:37:48] <^demon> Lemme purge buck from the system and try [20:37:59] ok [20:38:12] on gerrit-build-fresh or gerrit-build? [20:38:29] gonna change gerrit to jdk7 as well [20:38:58] done [20:39:18] ^demon: why gerrit not spammy in here via icinga-wm ? [20:39:44] <^demon> I never got around to setting up any icinga alerts for it. [20:39:44] <^demon> Would be nice [20:40:02] ok, so it's not per purpose [20:40:17] <^demon> Oh no, it's just "I dunno how and never had time to ask" [20:40:18] <^demon> :) [20:40:42] * AzaToth so stupid, I meant via gerrit-wm  [20:40:54] * AzaToth goes hiding in the largest hole [20:41:10] <^demon> :) Oh, I never configured it to point to here. It should. [20:41:14] <^demon> Default is #-dev. [20:41:22] ツ [20:43:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:44:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [21:02:15] !log rebooting es10 to deal with fucked up snapshot volume [21:02:24] Logged the message, notpeter [21:02:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:03:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.140 second response time [21:04:29] PROBLEM - Host es10 is DOWN: PING CRITICAL - Packet loss = 100% [21:07:51] RECOVERY - Host es10 is UP: PING OK - Packet loss = 0%, RTA = 26.95 ms [21:24:17] New review: Dzahn; "in PS6 an unrelated change to a planet config sneaked in" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [21:24:29] New patchset: Pyoungmeister; "temporarily removing es1007 and es1010 for reboots" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68735 [21:24:38] AaronSchulz: ^ [21:24:53] do you know of any magical, archaic reason why that would break shit? [21:25:22] or does it work just the same as the regular db boxxies? [21:27:16] I don't see any problem AFAIK [21:27:34] !log mholmquist Started syncing Wikimedia installation... : Fixing UploadWizard in Chrome [21:27:37] cool. thanks! [21:27:43] Logged the message, Master [21:28:04] notpeter: you can't do 0 load either? [21:28:23] oh, that would probably work too [21:29:04] * AaronSchulz wonders if 0 load hosts are included in slave lag checks for readerIndex() [21:29:33] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68735 [21:29:49] ^demon: success? [21:30:28] !log py synchronized wmf-config/db-eqiad.php 'temp depooling es1007 and es1010 for reboots' [21:30:38] Logged the message, Master [21:30:39] huh, seems like though the errors are ignored, which is OKish [21:31:06] plus those checks should be rare now...being mutexed lately [21:32:10] !log reboot es1007 to deal with messed up lvm snapshot [21:32:19] Logged the message, notpeter [21:32:43] I really hate it that I've spent many many many hours trying to figure out how to fix those without rebooting, and have thus far come up empty handed [21:34:11] PROBLEM - Host es1007 is DOWN: CRITICAL - Plugin timed out after 15 seconds [21:35:01] !log mholmquist Finished syncing Wikimedia installation... : Fixing UploadWizard in Chrome [21:35:15] Woo [21:35:16] Logged the message, Master [21:35:19] * marktraceur tests [21:36:01] RECOVERY - Host es1007 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [21:36:37] notpeter: hello [21:36:49] AaronSchulz: hey [21:39:39] <^demon> AzaToth: Tried with master, got Exception in thread "main" java.lang.UnsupportedClassVersionError: com/facebook/buck/cli/Main : Unsupported major.minor version 51.0 again. [21:39:47] <^demon> I'm thinking of calling it a week on this :) [21:40:14] !log reboot es1010 to deal with messed up lvm snapshot [21:40:22] Logged the message, notpeter [21:41:19] New review: Andrew Bogott; "Patch six produces: https://dpaste.de/UPdw2/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [21:42:02] PROBLEM - Host es1010 is DOWN: PING CRITICAL - Packet loss = 100% [21:44:32] RECOVERY - Host es1010 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [21:44:56] ^demon: change java version on the syste, [21:45:22] java doesn't handle multiple concurrent available java versions [21:45:36] ^demon: sudo update-java-alternatives -s java-1.7.0-openjdk-amd64 [21:48:54] <^demon> Sweet, works :) [21:51:32] New patchset: Pyoungmeister; "Revert "temporarily removing es1007 and es1010 for reboots"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68772 [21:53:39] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68772 [21:54:51] <^demon> AzaToth: So, buck looks good to go for me. Still testing gerrit. [21:55:15] !log reedy synchronized php-1.22wmf7/extensions/SecurePoll/ [21:55:18] !log py synchronized wmf-config/db-eqiad.php 'repooling es1007 and es1010' [21:55:23] Logged the message, Master [21:55:31] Logged the message, Master [21:56:18] !log reedy synchronized php-1.22wmf6/extensions/SecurePoll/ [21:56:27] Logged the message, Master [21:59:49] New review: Andrew Bogott; "Sorry, that should be https://dpaste.de/3UCC1/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [22:01:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:02:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.147 second response time [22:03:25] New patchset: Andrew Bogott; "Move mail manifests to a module called 'exim'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68584 [22:04:55] New review: Andrew Bogott; "I've verified that the exim.conf generated by exim::role::simple-mail-sender is the same as that gen..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68584 [22:05:39] !log reedy synchronized php-1.22wmf7/extensions/SecurePoll/ [22:05:48] Logged the message, Master [22:13:15] Change abandoned: Demon; "I'm tired of seeing this on my queue. Feel free to restore if you plan on doing this again." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8120 [22:25:33] !log reedy synchronized php-1.22wmf7/extensions/SecurePoll/ [22:25:41] Logged the message, Master [22:26:09] !log reedy synchronized php-1.22wmf6/extensions/SecurePoll/ [22:26:18] Logged the message, Master [22:37:51] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [22:44:40] New review: Dzahn; "i put https://dpaste.de/3UCC1/raw/ on magnesium and tried it. the rewritiung to general@rt appears t..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [22:46:39] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [22:52:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:53:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.183 second response time [23:10:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:11:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.139 second response time [23:20:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:21:01] New patchset: Ori.livneh; "Add user for jforrester and grant access to stat1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68821 [23:22:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [23:22:46] New review: Dzahn; "2013-06-14 23:13:33 1UndBV-0002tt-LQ Error in smart_route router: unknown routing option or transpor..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [23:24:49] !log reedy synchronized php-1.22wmf6/extensions/SecurePoll/ [23:24:58] Logged the message, Master [23:31:19] New patchset: Hashar; "php-fatal-error.html is now tracked in git" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68825 [23:35:12] New review: Dzahn; "2013-06-14 23:34:20 1UndVb-0003Fx-Ui => dzahn@wikimedia.org R=smart_route T=remote_smtp S=995 H=mche..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [23:40:32] New patchset: Andrew Bogott; "Refactor exim::rt to use the new exim template." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68011 [23:42:44] New review: Jforrester; "Confirm that this is about me, and is my key. :-)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/68821 [23:44:02] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [23:44:02] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [23:44:02] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [23:44:02] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [23:44:02] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [23:44:03] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [23:44:03] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [23:44:04] PROBLEM - Puppet freshness on mw1020 is CRITICAL: No successful Puppet run in the last 10 hours [23:44:04] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [23:44:05] PROBLEM - Puppet freshness on sockpuppet is CRITICAL: No successful Puppet run in the last 10 hours [23:44:05] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [23:44:06] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [23:44:06] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [23:49:00] New review: Dzahn; "yep, with route_list = * mchenry.wikimedia.org:lists.wikimedia.org it worked right away when we tes..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/68011