[00:03:55] btw, I love not noticing that SWAT happened :) [00:57:22] (03PS1) 10MaxSem: Remove live-1.5 and skins-1.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162520 [00:57:30] Reedy, mutante ^^ :P [00:57:33] * MaxSem runs [01:03:05] (03CR) 10Chad: "Sure, will amend." [puppet] - 10https://gerrit.wikimedia.org/r/162161 (owner: 10Chad) [01:05:43] (03PS1) 10Ori.livneh: grafana: correct the path to the document root [puppet] - 10https://gerrit.wikimedia.org/r/162523 [01:05:57] mutante: ^ is super-simple, if you have a sec [01:06:00] (03PS3) 10Chad: T458: Rename ext_ref description and hide it from users [puppet] - 10https://gerrit.wikimedia.org/r/162161 [01:06:12] errrr [01:06:15] it's wrong, hang on [01:06:42] ori: i was about to say, deploy vs. deployment [01:06:54] yeah. [01:07:01] (03PS2) 10Ori.livneh: grafana: correct the path to the document root [puppet] - 10https://gerrit.wikimedia.org/r/162523 [01:12:34] (03CR) 10Dzahn: [C: 031] grafana: correct the path to the document root [puppet] - 10https://gerrit.wikimedia.org/r/162523 (owner: 10Ori.livneh) [01:12:46] thanks [01:12:53] (03PS3) 10Ori.livneh: grafana: correct the path to the document root [puppet] - 10https://gerrit.wikimedia.org/r/162523 [01:13:01] (03CR) 10Ori.livneh: [C: 032 V: 032] grafana: correct the path to the document root [puppet] - 10https://gerrit.wikimedia.org/r/162523 (owner: 10Ori.livneh) [01:20:24] (03PS2) 10Dzahn: remove cron that rsynced nfs home [puppet] - 10https://gerrit.wikimedia.org/r/162189 [01:21:07] (03CR) 10jenkins-bot: [V: 04-1] remove cron that rsynced nfs home [puppet] - 10https://gerrit.wikimedia.org/r/162189 (owner: 10Dzahn) [01:21:38] (03CR) 10Dzahn: "done. did you mean this or just removing the entire class? i tend to think it's not even worth setting stuff to absent and then removing i" [puppet] - 10https://gerrit.wikimedia.org/r/162189 (owner: 10Dzahn) [01:22:36] (03PS3) 10Dzahn: remove cron that rsynced nfs home [puppet] - 10https://gerrit.wikimedia.org/r/162189 [01:25:22] !log tridge - shutting down [01:25:29] mutante: \o/ [01:25:29] Logged the message, Master [01:25:35] another one bites the dust [01:25:48] yes:) [01:27:05] (03CR) 10Dzahn: [C: 032] "tridge is actually shutdown now" [puppet] - 10https://gerrit.wikimedia.org/r/162189 (owner: 10Dzahn) [01:29:24] (03PS1) 10Dzahn: decom - remove tridge [dns] - 10https://gerrit.wikimedia.org/r/162526 [01:33:31] (03PS1) 10Springle: repool db1062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162527 [01:51:28] (03PS1) 10Legoktm: Only use the RSS proxy on foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162532 [01:51:34] quiddity: ^ [01:51:44] ty! [02:05:34] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3608 MB (3% inode=99%): [02:09:15] (03CR) 10Springle: [C: 032] repool db1062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162527 (owner: 10Springle) [02:09:19] (03Merged) 10jenkins-bot: repool db1062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162527 (owner: 10Springle) [02:10:16] !log springle Synchronized wmf-config/db-eqiad.php: repool db1062 (duration: 00m 06s) [02:10:21] Logged the message, Master [02:11:59] (03PS1) 10Dzahn: phabricator - redirect/enforce http->https [puppet] - 10https://gerrit.wikimedia.org/r/162534 [02:21:14] PROBLEM - puppet last run on mw1095 is CRITICAL: CRITICAL: Puppet has 1 failures [02:30:02] (03CR) 10Dzahn: "anyone? time is running out to figure out Tampa stuff soon" [puppet] - 10https://gerrit.wikimedia.org/r/159442 (owner: 10Dzahn) [02:30:04] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:30:40] (03PS1) 10Dzahn: terbium - include misc::noc-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/162536 [02:38:06] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [02:38:14] RECOVERY - puppet last run on mw1095 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [02:39:39] !log LocalisationUpdate completed (1.24wmf21) at 2014-09-24 02:39:39+00:00 [02:39:44] Logged the message, Master [03:01:14] RECOVERY - Disk space on virt0 is OK: DISK OK [03:03:47] (03PS1) 10Jhobs: Reduce file size of wikipedia favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162538 [03:07:34] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:08:44] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Epic puppet fail [03:10:35] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [03:12:54] !log LocalisationUpdate completed (1.24wmf22) at 2014-09-24 03:12:54+00:00 [03:13:00] Logged the message, Master [03:21:49] !log tstarling scap failed: RuntimeError scap requires SSH agent forwarding (duration: 00m 00s) [03:21:53] Logged the message, Master [03:22:08] !log tstarling Started scap: (no message) [03:22:15] Logged the message, Master [03:27:45] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [03:28:45] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:29:04] PROBLEM - puppet last run on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:30:35] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [03:30:54] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 1187 seconds ago with 0 failures [03:34:17] !log tstarling Finished scap: (no message) (duration: 12m 09s) [03:34:21] Logged the message, Master [04:29:53] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Sep 24 04:29:53 UTC 2014 (duration 29m 52s) [04:29:59] Logged the message, Master [04:34:54] PROBLEM - Host ps1-c3-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [04:35:05] PROBLEM - Host ps1-c2-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [04:35:05] PROBLEM - Host ps1-d3-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [04:35:05] PROBLEM - Host ps1-c1-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [04:35:05] PROBLEM - Host ps1-d2-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [04:35:05] PROBLEM - Host ps1-d1-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [04:40:24] RECOVERY - Host ps1-c2-pmtpa is UP: PING WARNING - Packet loss = 50%, RTA = 33.47 ms [04:40:24] (03PS1) 10Jeremyb: import LogFormat s from apache2 package [puppet] - 10https://gerrit.wikimedia.org/r/162541 [04:40:24] RECOVERY - Host ps1-d2-pmtpa is UP: PING WARNING - Packet loss = 50%, RTA = 33.96 ms [04:40:24] RECOVERY - Host ps1-d3-pmtpa is UP: PING WARNING - Packet loss = 50%, RTA = 37.46 ms [04:40:24] RECOVERY - Host ps1-c1-pmtpa is UP: PING WARNING - Packet loss = 50%, RTA = 38.12 ms [04:40:24] RECOVERY - Host ps1-d1-pmtpa is UP: PING WARNING - Packet loss = 50%, RTA = 33.05 ms [04:40:25] RECOVERY - Host ps1-c3-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 34.21 ms [04:48:05] (03CR) 10Jeremyb: "What formats do we want to use where? Where did these come from to begin with? What do we use these logs for?" [puppet] - 10https://gerrit.wikimedia.org/r/162541 (owner: 10Jeremyb) [04:56:04] wow, a time before bits! https://bugzilla.wikimedia.org/8926 [04:56:21] i was around then but much less clueful/observant [04:56:32] i wonder if i even knew what an etag was? :) [05:41:20] !log ran script to back populate bug 70620 on metawiki (/home/legoktm/ca/populateBug70620.php on terbium) [05:41:26] Logged the message, Master [05:42:21] * jeremyb glares at security bug :D [05:46:59] (03CR) 10Tnegrin: [C: 04-1] "Thanks Jeremy." [puppet] - 10https://gerrit.wikimedia.org/r/162541 (owner: 10Jeremyb) [05:53:36] PROBLEM - nutcracker process on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:54:04] PROBLEM - DPKG on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:54:04] PROBLEM - check if dhclient is running on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:54:04] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:54:14] PROBLEM - SSH on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:54:14] PROBLEM - nutcracker port on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:54:14] PROBLEM - Disk space on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:54:14] PROBLEM - puppet last run on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:54:14] PROBLEM - check configured eth on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:55:54] (03PS1) 10Legoktm: Add "viewdeletedfile" userright for global deleted image review [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162546 (https://bugzilla.wikimedia.org/14801) [05:56:00] (03CR) 10Jeremyb: "> know that we depend on the log format pretty closely" [puppet] - 10https://gerrit.wikimedia.org/r/162541 (owner: 10Jeremyb) [05:56:37] (03CR) 10Ori.livneh: "All requests pass through Varnish, but only a small subset are forwarded to Apache. Most requests are served entirely by Varnish out of th" [puppet] - 10https://gerrit.wikimedia.org/r/162541 (owner: 10Jeremyb) [05:57:21] ohai ori [05:57:32] * ori tips hat [05:59:43] (03CR) 10Tnegrin: [C: 031] "Thanks folks -- better now." [puppet] - 10https://gerrit.wikimedia.org/r/162541 (owner: 10Jeremyb) [06:00:16] (03PS2) 10Nemo bis: Add "viewdeletedfile" userright for global deleted image review [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162546 (https://bugzilla.wikimedia.org/14801) (owner: 10Legoktm) [06:00:31] (03CR) 10Nemo bis: "FIxed typo in comment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162546 (https://bugzilla.wikimedia.org/14801) (owner: 10Legoktm) [06:00:54] oops, thanks Nemo_bis [06:01:12] :) [06:10:45] RECOVERY - nutcracker process on mw1053 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [06:11:04] RECOVERY - check if dhclient is running on mw1053 is OK: PROCS OK: 0 processes with command name dhclient [06:11:04] RECOVERY - DPKG on mw1053 is OK: All packages OK [06:11:05] RECOVERY - Disk space on mw1053 is OK: DISK OK [06:11:05] RECOVERY - nutcracker port on mw1053 is OK: TCP OK - 0.000 second response time on port 11212 [06:11:14] RECOVERY - SSH on mw1053 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [06:11:24] RECOVERY - check configured eth on mw1053 is OK: NRPE: Unable to read output [06:12:14] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [06:17:15] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:20:09] (03CR) 10Jeremyb: "actually, here are some stats. but some of these are defined in the same file as they are used in (e.g. combined_time) and some of these a" [puppet] - 10https://gerrit.wikimedia.org/r/162541 (owner: 10Jeremyb) [06:28:51] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Epic puppet fail [06:29:00] PROBLEM - puppet last run on db1051 is CRITICAL: CRITICAL: Epic puppet fail [06:29:09] PROBLEM - puppet last run on db1023 is CRITICAL: CRITICAL: Epic puppet fail [06:30:30] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:50] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:59] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:09] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:19] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:19] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:19] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:20] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:20] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:29] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:39] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:39] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:49] PROBLEM - puppet last run on amssq60 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:50] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:51] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:59] PROBLEM - puppet last run on mw1175 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:30] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:45:39] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [06:45:52] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:45:59] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:46:09] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:46:10] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:46:19] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:46:19] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:46:20] RECOVERY - puppet last run on db1023 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:46:29] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:46:29] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:46:39] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:46:40] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:47:09] RECOVERY - puppet last run on amssq60 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:47:19] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:47:23] RECOVERY - puppet last run on mw1175 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:47:28] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:47:28] RECOVERY - puppet last run on db1051 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:47:40] RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:48:13] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:48:14] PROBLEM - puppet last run on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:48:14] PROBLEM - check configured eth on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:48:40] PROBLEM - nutcracker process on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:48:59] PROBLEM - DPKG on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:48:59] PROBLEM - check if dhclient is running on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:48:59] PROBLEM - SSH on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:48:59] PROBLEM - nutcracker port on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:48:59] PROBLEM - Disk space on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:52:05] Notice: Finished catalog run in 4117.95 seconds [06:52:10] wow, that's crazy... [06:52:22] do new prod hosts take that long? [06:52:51] <_joe_> no [06:52:59] <_joe_> not even remotely [06:53:25] i meant first run. to be clear [06:53:41] RECOVERY - nutcracker process on mw1053 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [06:53:50] RECOVERY - check if dhclient is running on mw1053 is OK: PROCS OK: 0 processes with command name dhclient [06:53:50] RECOVERY - DPKG on mw1053 is OK: All packages OK [06:53:50] RECOVERY - Disk space on mw1053 is OK: DISK OK [06:53:50] RECOVERY - nutcracker port on mw1053 is OK: TCP OK - 0.000 second response time on port 11212 [06:53:50] RECOVERY - SSH on mw1053 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [06:54:10] PROBLEM - puppet last run on db1044 is CRITICAL: CRITICAL: Puppet has 1 failures [06:54:10] RECOVERY - check configured eth on mw1053 is OK: NRPE: Unable to read output [06:54:10] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 1333 seconds ago with 0 failures [06:54:10] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [06:56:02] <_joe_> jeremyb: which kind of server? [06:56:09] <_joe_> that may be realistic in that case [06:56:24] <_joe_> given it needs to install a ton of packages [07:00:16] _joe_: mw [07:00:28] <_joe_> jeremyb: exactly :) [07:00:41] it wasn't really first run. but it was first run after enabling appserver role [07:01:09] so... i wonder if we could reduce that :) [07:01:43] <_joe_> jeremyb: create an apt repo on your local network [07:01:58] <_joe_> that's the best you can do [07:02:14] <_joe_> next in line is rewrite puppet and the way it manages packages [07:13:35] RECOVERY - puppet last run on db1044 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [07:20:29] _joe_: that was on labs. so apt should be local? [07:22:10] (03PS1) 10Giuseppe Lavagetto: Add Tim's PR #3834 as a debian patch [debs/hhvm] - 10https://gerrit.wikimedia.org/r/162551 [07:24:20] <_joe_> /win 34 [07:24:29] you win some, you lose some [07:36:05] {{fact}} [07:36:26] <_joe_> it should be about time I start to win sometimes [07:36:56] hah [07:42:21] (03PS2) 10Giuseppe Lavagetto: Add Tim's PR #3834 as a debian patch [debs/hhvm] - 10https://gerrit.wikimedia.org/r/162551 [07:43:09] (03PS1) 10Ori.livneh: grafana: qualify graphite URL [puppet] - 10https://gerrit.wikimedia.org/r/162554 [07:43:35] <_joe_> jeremyb: btw the long time of the first appserverr puppet run is the thing that really got me thinking of the way packages are handled by puppet [07:46:56] (03CR) 10Ori.livneh: [C: 032] "two-character commit!" [puppet] - 10https://gerrit.wikimedia.org/r/162554 (owner: 10Ori.livneh) [07:51:53] _joe_: when? [07:52:13] we could make our own tasks (what's the name exactly?) for tasksel [07:52:19] or pseudopackages [07:52:43] <_joe_> jeremyb: well, that is circumventing the problem creating another [07:52:49] hehehe [07:53:06] well the issue i was thinking is that puppet does one at a time? [07:53:42] <_joe_> yes [07:53:59] <_joe_> with a big pyle of ruby around to make it more efficient [07:56:37] :D [07:56:49] do the first run with --noop [07:57:04] grep the result for Package[.*]: present [07:57:12] construct one apt-get invocation [07:57:24] then do the real puppet run [08:07:16] seems like ~96 packages [08:08:36] all purged -> present [08:08:38] jeremyb: thanks for the apache2 code review! that was quick :) [08:08:48] godog: code review? [08:08:54] * jeremyb was the author :) [08:09:06] (except that debian package was the real author) [08:09:40] or are we talking about different changes? [08:09:40] hehe [08:09:50] I'm talking about this guy https://gerrit.wikimedia.org/r/#/c/162541/ [08:10:08] right, ok :) [08:19:54] godog: any known swift issues? I getting many internal error: bad token when uploading files [08:20:21] matanya: no known issues to me, checking [08:21:14] <_joe_> godog: I didn't maintain the debian logformats on purpos [08:21:25] <_joe_> I'm pretty sure we do have things relying on those formats [08:22:09] _joe_: indeed, mw "other_vhost_access.log" is filled with "vhost_combined" [08:22:47] _joe_: what do you think relies on them? [08:23:03] <_joe_> godog: eh. [08:23:15] <_joe_> godog: adding the other lines is good [08:23:28] <_joe_> jeremyb: no idea; if I knew, I'd have changed it already [08:23:40] <_joe_> when I ported mw configs to a debian-like structure [08:23:50] <_joe_> I decided not to change too many things [08:24:00] <_joe_> I'd +1 your change jeremyb [08:24:27] <_joe_> just lemme take a look around if something would need correcting [08:24:39] matanya: does it give you that consistently? [08:24:47] mostly godog [08:26:30] (03PS5) 10Giuseppe Lavagetto: backport of https://github.com/facebook/hhvm/pull/3811/ [debs/hhvm] - 10https://gerrit.wikimedia.org/r/161936 [08:29:20] (03CR) 10Giuseppe Lavagetto: [C: 032] "Updated with the latest additions by Tim, it compiles and passes both relevant tests." [debs/hhvm] - 10https://gerrit.wikimedia.org/r/161936 (owner: 10Giuseppe Lavagetto) [08:30:29] (03PS1) 10Steinsplitter: Adding *.nijmegen.nl to wgCopyUploadsDomains. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162556 [08:37:04] (03PS2) 10Steinsplitter: Adding *.nijmegen.nl to wgCopyUploadsDomains. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162556 [08:38:14] matanya: nope I can't find anything obviously wrong, do you know of anyone else having the same problem? [08:38:37] (03CR) 10Alexandros Kosiaris: "Yeah, this way is fine IMHO. As a general rule I prefer having puppet absent things but you are right that in some cases like this it migh" [puppet] - 10https://gerrit.wikimedia.org/r/162189 (owner: 10Dzahn) [08:45:25] (03CR) 10Filippo Giunchedi: [C: 031] NTP service aliases, switch eqiad, add esams [dns] - 10https://gerrit.wikimedia.org/r/162496 (owner: 10Dzahn) [08:46:18] (03CR) 10Filippo Giunchedi: [C: 031] NTP client config - use rubidium/eeden as servers [puppet] - 10https://gerrit.wikimedia.org/r/162175 (owner: 10Dzahn) [08:53:24] (03CR) 10Filippo Giunchedi: [C: 031] "doesn't look like ldap::role::server::production is used anywhere, which makes sense. afaik our ldap server usage is either for labs or th" [puppet] - 10https://gerrit.wikimedia.org/r/159442 (owner: 10Dzahn) [08:58:08] (03CR) 10Filippo Giunchedi: "LGTM, perhaps there are other apache configs elsewhere in puppet that suffer from the same problem" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/162541 (owner: 10Jeremyb) [08:59:08] godog: no, i can ask, thanks for looking [08:59:42] btw akosiaris _joe_ I realize https://gerrit.wikimedia.org/r/#/c/162291/ is a bit big but even just eyeballing it would be nice, it isn't going to affect production but just codfw [09:00:10] matanya: cool thanks! feel free to cc me in BZ too [09:00:49] <_joe_> godog: bikeshedding - swift_new ? /me sad [09:01:17] _joe_: I'm open to suggestions :) [09:02:18] swift² [09:02:20] <_joe_> also: the role/swift.pp is a place that could really make use of hiera :) [09:02:53] ori: only if I can use that in the filesystem path too [09:12:01] (03CR) 10Jeremyb: [C: 04-1] Adding *.nijmegen.nl to wgCopyUploadsDomains. (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162556 (owner: 10Steinsplitter) [09:19:21] !log Upgrading Zuul to f0e3688 Cherry pick https://review.openstack.org/#/c/123437/1 which fix {{bug|71133}} ''Zuul cloner: fails on extension jobs against a wmf branch'' [09:19:27] Logged the message, Master [09:24:45] Jeremyb: why you -1 it? [09:25:34] (03PS1) 10Ori.livneh: Set up CORS for Grafana [puppet] - 10https://gerrit.wikimedia.org/r/162559 [09:25:43] (03CR) 10Steinsplitter: "@Jeremyb: Reason or -1'ing it?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162556 (owner: 10Steinsplitter) [09:26:10] godog: the CORS piece is the last bit needed for grafana, AFAICT ^ [09:27:07] ori: sweet, I'll take a look [09:27:21] ori: aren't you supposed to be sleeping btw? :) [09:27:30] I know that's probably what you tell your son, anyways [09:27:35] (03PS3) 10Steinsplitter: Adding *.nijmegen.nl to wgCopyUploadsDomains. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162556 (https://bugzilla.wikimedia.org/71191) [09:27:36] heheh [09:27:38] touche! [09:27:58] yes, i'm off. the patch involves varnish so maybe you can talk bblack or mark into sanity-checking it [09:28:07] haha will do [09:29:49] ori: any reason to pick varnish over apache btw? [09:30:22] iirc apache doesn't know if the request was http or https, since that is handled by nginx [09:30:38] and you need to know the proto for the allow-origin header [09:30:48] also, i don't think it would get passed through by varnish [09:30:54] so we'd need to tweak the varnish config anyhow [09:31:42] the way to do it in apache would be to ProxyPass /render to graphite.wikimedia.org [09:31:58] which would be fine by me if you wanted to do it that way [09:32:23] oh ok, re: https I thought we'd be setting x-forwarded-proto in nginx but have never checked [09:32:40] !log restarting zuul [09:32:45] Logged the message, Master [09:33:08] !log restarting zuul-merger [09:33:30] ye I asked for that reason but if we have to change varnish anyway, I don't know what's the general preference afaict from puppet it is used only for upload [09:33:37] it == cors [09:33:41] i thought we did too, but then why do we set them for git.wikimedia.org in misc.inc.vcl.erb [09:33:50] (see top of that file) [09:33:50] !log Jenkins switched mwext-UploadWizard-qunit back to Zuul cloner by applying pending change {{gerrit|161459}} [09:33:56] Logged the message, Master [09:34:14] godog: another option would be to move grafana to graphite.wikimedia.org/grafana [09:34:40] but then we'd also have to move it to tungsten [09:35:31] i'm totally fine with all of those options, happy to go with whatever you think is best [09:35:31] we could make that assumption alright, graphite frontend includes grafana too [09:36:14] yeah that makes more sense now [09:37:29] i meant to reply to ryan re: salt, i guess i'll do that tomorrow [09:37:32] good night! [09:37:36] ori: good night! [09:42:46] godog: we got swift_new ??? [09:43:01] I was looking at the change and I just noticed... [09:43:06] <_joe_> :) [09:43:11] <_joe_> we're going to [09:43:35] <_joe_> btw godog - I see a lot of data in the role manifest [09:43:51] <_joe_> cant we move that to hiera directly given you're creating a new module? [09:44:16] ah yes... why ? [09:44:27] I hate ganglia_new way too much already [09:44:50] please tell me we are not going to have the same issue [09:46:52] nah, the rationale is in the commit message, tl;dr is "transitional" [09:47:32] _joe_: yes we can! [09:49:33] _joe_: I can propose another PS, how would it look like? [09:49:46] <_joe_> yes please [09:50:05] <_joe_> godog: if you need help understanding where to put things in hiera, ask [09:50:33] _joe_: yep having an example or sth like that in the code review would be nice! [09:51:24] <_joe_> ok, on it! [09:51:44] thanks [09:51:46] <_joe_> it will take me some time [10:03:37] jeremyb i am waiting for a reply. [10:05:07] okay, mabye it is just trolling. away now. [10:49:05] (03CR) 10Alexandros Kosiaris: [C: 032] "LGTM. Just noting that normally we don't remove mgmt entries as well but this pmtpa so it makes sense." [dns] - 10https://gerrit.wikimedia.org/r/162526 (owner: 10Dzahn) [11:09:45] (03PS1) 10Reza: Add delete right to fawiki Image-reviewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162565 (https://bugzilla.wikimedia.org/71229) [11:27:17] 11:44:25 I hate ganglia_new way too much already [11:27:21] in what respect? data in the manifests? [11:28:10] oh, two modules in use for years ;p [11:28:25] yes [11:28:36] sorry for that ;) [11:28:37] I am always wondering which module does what [11:29:03] * YuviPanda wonders if over time we'll phase out ganglia and go with a graphite+dashboards solution [11:29:59] if we get a dashboard that's equally good or better, yes [11:30:18] until that day, definitely not :) [11:31:46] indeed [11:32:14] * YuviPanda should get on that once shinken/icinga is done, since labs doesn't have a working ganglia [11:33:38] (03CR) 10Alexandros Kosiaris: [C: 032] "That LDAP config is used nowhere. In fact nfs1 does not run an LDAP server anyway. It has been removed since I78d69a5ef345f50f3e8c2b099734" [puppet] - 10https://gerrit.wikimedia.org/r/159442 (owner: 10Dzahn) [11:37:56] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Puppet last ran 188766 seconds ago, expected 14400 [11:38:06] (03CR) 10Alexandros Kosiaris: [C: 04-1] NTP service aliases, switch eqiad, add esams (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/162496 (owner: 10Dzahn) [11:39:30] (03PS1) 10Calak: Add 'unwatchedpages' right to 'patroller' user group on he.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162568 (https://bugzilla.wikimedia.org/71193) [11:41:57] (03CR) 10Alexandros Kosiaris: [C: 032] "LGTM but let's not merge before I8c53ca3e48c63 is merged" [puppet] - 10https://gerrit.wikimedia.org/r/162175 (owner: 10Dzahn) [11:42:56] <_joe_> godog: I do have a patch ready - it's pretty radical though [11:43:12] <_joe_> sure you want me to submit it as a PS? [11:44:55] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [11:47:41] (03CR) 10Alexandros Kosiaris: "Gasp. getent ? Yes this needs fixing. Got a pointer to an RT or BZ (or should I say Phab) ticket/task ?" [puppet] - 10https://gerrit.wikimedia.org/r/160628 (owner: 10Matanya) [11:48:35] hm [11:48:41] so people want to close down the direct ssh root login [11:48:56] how am I supposed to copy files out that my personal account can't access then ;p [11:50:07] <_joe_> mark: anonymous ftp [11:50:12] oh of course [11:50:21] that's why we have that daemon on every box [11:50:32] inetd isn't it [11:50:43] i have a shareware copy of ws_ftp if you need it [11:51:01] <_joe_> kill -1 ori [11:51:38] <_joe_> mark: systemd-inetd you mean [11:52:48] _joe_: sure go ahead! [11:53:19] I thought you started by chown -Ring / to your account... [11:53:24] and then chown -Ring it back to root [11:53:56] <_joe_> YuviPanda: that's what puppet would do if it tried to solve that problem [11:54:03] ah, of course [11:54:13] 'you want the system to be in a state where you can copy all files? sure!' [11:54:35] _joe_: that's what puppet would do by generating a separate File resource for each recursive file to do the chown [11:56:11] <_joe_> mark: right [11:56:32] sad thing is, i'm not even kidding :( [11:56:46] <_joe_> eh. [12:01:31] (03PS2) 10Giuseppe Lavagetto: swift: refactor into module, add codfw [puppet] - 10https://gerrit.wikimedia.org/r/162291 (owner: 10Filippo Giunchedi) [12:02:11] <_joe_> godog: ^^; don't run screaming [12:02:17] <_joe_> it should mostly make sense [12:06:10] so much refactoring [12:18:54] (03PS1) 10Yuvipanda: dsh: Move dsh related code into a module [puppet] - 10https://gerrit.wikimedia.org/r/162570 [12:18:59] moving more things into modules [12:23:32] (03PS5) 10Yuvipanda: Remove all vumi related code [puppet] - 10https://gerrit.wikimedia.org/r/158365 [12:23:48] _joe_: think you'll have time for either of ^ today? I can bug others if don't :) [12:25:23] _joe_: latest PS on swift looks good to me, what would be an easy way to test it in labs? [12:40:48] (03PS1) 10Alexandros Kosiaris: openstreetmap syncing should use less memory [puppet] - 10https://gerrit.wikimedia.org/r/162574 [12:50:12] (03CR) 10Mschon: [C: 031] "looks ok for me since code does not belong into a doc-dir" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162505 (owner: 10Dzahn) [12:52:33] (03PS2) 10Alexandros Kosiaris: openstreetmap syncing should use less memory [puppet] - 10https://gerrit.wikimedia.org/r/162574 [12:56:28] PROBLEM - puppet last run on analytics1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:59:17] !log Zuul / Jenkins stuck [12:59:24] Logged the message, Master [13:00:09] hashar: I was about to ask [13:00:28] !log Jenkins: disconnecting Gearman client from Zuul and reconnecting [13:00:35] Logged the message, Master [13:01:16] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: Epic puppet fail [13:01:32] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] openstreetmap syncing should use less memory [puppet] - 10https://gerrit.wikimedia.org/r/162574 (owner: 10Alexandros Kosiaris) [13:04:11] !log Zuul proceeding queue again [13:04:18] Logged the message, Master [13:12:27] <_joe_> godog: mmmh creating the corresponding entries in labs.yaml I guess [13:12:46] <_joe_> where you can also put credentials in directly [13:15:25] (03CR) 10Dan-nl: [C: 04-1] "* after a bit of further investigation, it looks like this web server may have been hacked: http://www.soumaya.com.mx/imagenes/2003/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161192 (https://bugzilla.wikimedia.org/70986) (owner: 10Jeremyb) [13:17:53] !log disable row awareness on Cirrus's elasticsearch cluster - might help balance load better. too much load was on one row [13:17:59] Logged the message, Master [13:18:27] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [13:19:13] <_joe_> manybubbles: this log line sounds like a review of the feature [13:19:25] !log *disabled* [13:19:32] Logged the message, Master [13:19:59] _joe_: well, trying to keep one of the elasticsearch nodes from getting overloaded - because why wouldn't you put tons of load on just one node? [13:20:11] <_joe_> :) [13:21:32] _joe_: cool I'll give it a try [13:31:14] (03PS1) 10Alexandros Kosiaris: osm: Export an expired tiles list to $expire_dir [puppet] - 10https://gerrit.wikimedia.org/r/162578 [13:32:45] (03CR) 10Alexandros Kosiaris: [C: 032] osm: Export an expired tiles list to $expire_dir [puppet] - 10https://gerrit.wikimedia.org/r/162578 (owner: 10Alexandros Kosiaris) [13:37:29] (03PS1) 10Yuvipanda: nagios_common: Move notification commands into module [puppet] - 10https://gerrit.wikimedia.org/r/162582 [13:45:54] (03PS1) 10Yuvipanda: nagios_common: Move timeperiods definition into module [puppet] - 10https://gerrit.wikimedia.org/r/162583 [13:46:37] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: Move timeperiods definition into module [puppet] - 10https://gerrit.wikimedia.org/r/162583 (owner: 10Yuvipanda) [13:47:32] (03PS2) 10Yuvipanda: nagios_common: Move timeperiods definition into module [puppet] - 10https://gerrit.wikimedia.org/r/162583 [13:51:12] (03PS1) 10Hashar: contint: configuration files renaming [puppet] - 10https://gerrit.wikimedia.org/r/162584 [13:51:36] hmm, nice [13:52:04] not too many files to move off icinga/files [13:52:05] err [13:52:06] files/icinga [13:53:00] (03CR) 10Hashar: "Whenever someone as time, please poke me so I can apply the change on gallium and restart Zuul." [puppet] - 10https://gerrit.wikimedia.org/r/162584 (owner: 10Hashar) [13:55:11] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Various points after having a quick look. One more. I don't like the _new pattern cause it seems that the old one never goes away. Happy t" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/162291 (owner: 10Filippo Giunchedi) [13:58:26] (03PS1) 10Manybubbles: More cirrus shard assignment hints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162590 [13:59:47] PROBLEM - puppet last run on lanthanum is CRITICAL: CRITICAL: Puppet has 1 failures [14:00:15] (03CR) 10Chad: [C: 031] "Assume this is already live, merge whenever." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162590 (owner: 10Manybubbles) [14:05:18] (03PS2) 10Manybubbles: More cirrus shard assignment hints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162590 [14:05:49] any objection to me comendeering a deployment slot right now? [14:07:59] coren: labsdb1001 needs a cable swap.okay to offline for about a minute [14:08:07] seeing no objections - here goes [14:08:21] !log starting deployment to lower cirrus load spikes [14:08:27] Logged the message, Master [14:08:38] (03CR) 10Manybubbles: [C: 032] More cirrus shard assignment hints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162590 (owner: 10Manybubbles) [14:09:00] (03Merged) 10jenkins-bot: More cirrus shard assignment hints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162590 (owner: 10Manybubbles) [14:11:08] (03PS1) 10Manybubbles: Switch throttle to new job [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162593 [14:11:48] (03PS2) 10Manybubbles: Switch throttle to new job [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162593 [14:12:06] PROBLEM - puppet last run on mw1004 is CRITICAL: CRITICAL: Puppet has 1 failures [14:14:07] !log manybubbles Synchronized php-1.24wmf22/extensions/CirrusSearch/: Switch implementation of Cirrus link counting jobs to hopefully lower overall load. (duration: 00m 06s) [14:14:14] Logged the message, Master [14:16:43] (03CR) 10Manybubbles: [C: 032] Switch throttle to new job [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162593 (owner: 10Manybubbles) [14:16:47] (03Merged) 10jenkins-bot: Switch throttle to new job [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162593 (owner: 10Manybubbles) [14:17:22] !log manybubbles Synchronized wmf-config: Cirrus config to lower load (duration: 00m 04s) [14:17:28] Logged the message, Master [14:17:57] RECOVERY - puppet last run on lanthanum is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [14:19:50] (03PS1) 10Manybubbles: Enable delay for new Cirrus link counting job [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162594 [14:20:04] (03CR) 10Manybubbles: [C: 032] Enable delay for new Cirrus link counting job [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162594 (owner: 10Manybubbles) [14:20:06] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 805470B: /srv/deployment/ocg/output 6177439711B: /srv/deployment/ocg/postmortem 1283347B: ocg_job_status 30007 msg (=30000 critical): ocg_render_job_queue 0 msg [14:20:06] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: /mnt/tmpfs 581336B: /srv/deployment/ocg/output 5769391110B: /srv/deployment/ocg/postmortem 4079090B: ocg_job_status 30014 msg (=30000 critical): ocg_render_job_queue 0 msg [14:20:09] (03Merged) 10jenkins-bot: Enable delay for new Cirrus link counting job [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162594 (owner: 10Manybubbles) [14:20:56] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 567B: /srv/deployment/ocg/output 4322121029B: /srv/deployment/ocg/postmortem 1356173B: ocg_job_status 30060 msg (=30000 critical): ocg_render_job_queue 0 msg [14:20:58] !log manybubbles Synchronized wmf-config: More cirrus config to lower load (duration: 00m 04s) [14:21:05] Logged the message, Master [14:21:30] (03PS1) 10Andrew Bogott: Make neptunium an ldap and dns server [puppet] - 10https://gerrit.wikimedia.org/r/162595 [14:22:08] !log manybubbles Synchronized php-1.24wmf21/extensions/CirrusSearch/: Switch implementation of Cirrus link counting jobs to hopefully lower overall load. (duration: 00m 04s) [14:22:13] Logged the message, Master [14:26:06] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: /mnt/tmpfs 212084B: /srv/deployment/ocg/output 5752506578B: /srv/deployment/ocg/postmortem 4079666B: ocg_job_status 30429 msg (=30000 critical): ocg_render_job_queue 0 msg [14:26:57] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 5378377B: /srv/deployment/ocg/output 4033109105B: /srv/deployment/ocg/postmortem 1356173B: ocg_job_status 30485 msg (=30000 critical): ocg_render_job_queue 0 msg [14:30:09] (03PS1) 10Aude: Add blacklisted properties for suggester on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162596 (https://bugzilla.wikimedia.org/70346) [14:30:26] RECOVERY - puppet last run on mw1004 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [14:31:36] cmjohnson1: You had an old cat3 lying around that was used for labsdb1001? [14:31:39] :-) [14:32:06] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 2417505B: /srv/deployment/ocg/output 4051930964B: /srv/deployment/ocg/postmortem 1356848B: ocg_job_status 30843 msg (=30000 critical): ocg_render_job_queue 0 msg [14:32:38] PROBLEM - puppet last run on mw1053 is CRITICAL: CRITICAL: Epic puppet fail [14:32:41] coren: we do not have cat3 [14:33:27] That was a silly joke, cmjohnson1. You'd have a hard time doing 100mb over more than a feet or two with cat3. And I don't remember having /seen/ any in actual use since the early 90s. :-) [14:35:14] coren: hah, yeah...i figured! Anyway, okay to swap? [14:35:40] If you don't expect anything over a minute, just go ahead. [14:35:58] nope...will be real quick like [14:36:03] kk [14:36:16] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 2799B: /srv/deployment/ocg/output 4062422084B: /srv/deployment/ocg/postmortem 1358729B: ocg_job_status 31151 msg (=30000 critical): ocg_render_job_queue 0 msg [14:36:47] is anyone looking at ocg yet? [14:37:32] bblack: I just saw it [14:37:41] hmmm the same bug as last time ? [14:37:46] PROBLEM - check configured eth on labsdb1001 is CRITICAL: Timeout while attempting connection [14:37:47] I'm not sure what ocg_job_status 30485 msg (=30000 critical) [14:37:48] not cleaning up the temporary folder ? [14:37:51] means [14:38:08] there doesn't seem to be a lack of disk space, it's just this check apparently reports several things together [14:38:22] <_joe_> the checks I love [14:38:47] RECOVERY - check configured eth on labsdb1001 is OK: NRPE: Unable to read output [14:39:06] o_O [14:39:08] hey, I saw /mnt/tmpfs at 3KB... [14:39:24] yeah I donno about that part [14:39:27] I assumed 3KB is not enough for a PDF service. I also assumed B means bytes [14:39:36] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: /mnt/tmpfs 720312B: /srv/deployment/ocg/output 5829743202B: /srv/deployment/ocg/postmortem 4101774B: ocg_job_status 31361 msg (=30000 critical): ocg_render_job_queue 0 msg [14:39:46] uh oh [14:39:58] <_joe_> bblack: looking at syslog on the server I don't see anything fishy [14:40:04] <_joe_> but it's very very noisy [14:40:14] akosiaris: yep, that reports bytes [14:40:33] <_joe_> tmpfs 32G 804K 32G 1% /mnt/tmpfs [14:40:39] <_joe_> on ocg1002 [14:40:48] <_joe_> so IDK what it's reporting [14:40:49] !log finished deployment - load spikes look to be gone. yay [14:40:55] Logged the message, Master [14:40:56] <_joe_> maybe hungarian bytes [14:41:04] cmjohnson1: [432571.453898] bnx2 0000:01:00.0: eth0: NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON [14:41:06] it said /mnt/tmpfs 720312B [14:41:07] Jeff_Green: I assume you're looking at it then, lemme know if you need anything [14:41:17] <_joe_> PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 2799B [14:41:22] it's job status that alerted [14:41:27] yeah it refreshed [14:41:27] "ocg_job_status 31361 msg (=30000 critical)" [14:41:44] it was hard to figure out a sane way to report all that info in one blob [14:42:07] <_joe_> akosiaris: I guess it's the number of msg of type ocg_job_status that overflowed their threshold [14:42:16] yes [14:42:27] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 404742B: /srv/deployment/ocg/output 4085575226B: /srv/deployment/ocg/postmortem 1358729B: ocg_job_status 31588 msg (=30000 critical): ocg_render_job_queue 0 msg [14:42:30] (=30000 critical) [14:42:31] <_joe_> Jeff_Green: what does that mean operationally? [14:43:07] the ocg handler is failing to keep up with processing the job queue [14:43:26] now we have to look for why [14:44:10] the queue thresholds are reported per cluster, so this is not specific to ocg1001 [14:44:50] damn, I still need to write a collector for ganglia [14:44:57] <_joe_> and so... why on all servers? why one check for so many things at once? but we'll answer that later I guess [14:45:19] what do you mean "on all servers" ? [14:45:26] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 637208B: /srv/deployment/ocg/output 4101331345B: /srv/deployment/ocg/postmortem 1359332B: ocg_job_status 31828 msg (=30000 critical): ocg_render_job_queue 0 msg [14:45:36] oic what you mean :-P [14:45:44] well, it does cut down on icinga-wm spam when everything goes belly up at the same time, to condense the health report lines :) [14:45:50] well [14:45:53] <_joe_> Jeff_Green: the nodes are not running at full cpu or even at full ram [14:46:00] the real reason: all this data is reported by the ocg server itself [14:46:18] the nagios collector polls it by http and gets back an overall health report [14:46:40] i just translated that into a single nagios ocg server health metric [14:47:04] (then again cutting icinga-wm spam may not be a good goal. in general the tenacity of our response seems to be proportional to how much the icinga spam disrupts conversation here) [14:47:09] <_joe_> anyways, either jobs are failing repeatedly [14:47:10] ha [14:47:26] <_joe_> (which I should see in some sort of error log) [14:48:18] there's nothing in /var/log/ocg.log for today [14:48:26] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 206667B: /srv/deployment/ocg/output 4111076701B: /srv/deployment/ocg/postmortem 1361120B: ocg_job_status 32082 msg (=30000 critical): ocg_render_job_queue 0 msg [14:48:27] <_joe_> exactly. [14:48:28] on ocg1001 [14:48:53] there's a lot in /var/log/syslog though [14:49:02] <_joe_> even too much [14:49:17] i wonder why that stopped writing to the other log [14:50:26] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 225405B: /srv/deployment/ocg/output 4119148797B: /srv/deployment/ocg/postmortem 1361120B: ocg_job_status 32242 msg (=30000 critical): ocg_render_job_queue 0 msg [14:50:37] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 2150491B: /srv/deployment/ocg/output 6015006837B: /srv/deployment/ocg/postmortem 1289095B: ocg_job_status 32253 msg (=30000 critical): ocg_render_job_queue 0 msg [14:50:52] /usr/bin/nodejs-ocg[38376]: Could not find size of '/mnt/tmpfs/f7424cc46dcd3fd7788a08a65c91b8d208de811c.rdf2latex/mw-ocg-latexer5064ckot28g/bundle/images/tmp-5024fmvoay85064hjz8rts.pdf': %s: {"errno":34,"code":"ENOENT","path":"/mnt/tmpfs/f7424cc46dcd3fd7788a08a65c91b8d208de811c.rdf2latex/mw-ocg-latexer5064ckot28g/bundle/images/tmp-5024fmvoay85064hjz8rts.pdf [14:51:08] fair amount of that in the log [14:51:11] manybubbles, marktraceur, ^demon|away: So who wants to SWAT today? [14:51:39] anomie: I was syncing code all morning fighting cirrus fires. I want a rest [14:52:03] my patch is easy :) [14:52:13] legoktm: Ping for SWAT in about 8 minutes [14:52:37] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 0B: /srv/deployment/ocg/output 6024618006B: /srv/deployment/ocg/postmortem 1290289B: ocg_job_status 32398 msg (=30000 critical): ocg_render_job_queue 0 msg [14:53:49] <^demon|away> I can do it. [14:54:42] K, good luck ^demon|away [14:55:01] ^demon|away: ok [14:55:32] <^demon|away> aude: yours looks ok, will do that first. [14:55:46] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 3474B: /srv/deployment/ocg/output 6035569378B: /srv/deployment/ocg/postmortem 1291459B: ocg_job_status 32649 msg (=30000 critical): ocg_render_job_queue 0 msg [14:56:48] <^demon|away> If lego's around I'll do his too, looks ok. [14:57:48] !log restarted service ocg on ocg1001 [14:57:55] Logged the message, Master [14:58:36] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 447634B: /srv/deployment/ocg/output 4153386364B: /srv/deployment/ocg/postmortem 1364147B: ocg_job_status 32860 msg (=30000 critical): ocg_render_job_queue 0 msg [14:58:46] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: /mnt/tmpfs 539577B: /srv/deployment/ocg/output 6021540315B: /srv/deployment/ocg/postmortem 4110910B: ocg_job_status 32866 msg (=30000 critical): ocg_render_job_queue 0 msg [14:59:52] reedy: the external hdd for wikimania videos is attached to terbium [15:00:09] wheeee [15:00:12] * Reedy looks at the damage [15:00:12] hah [15:00:17] heh [15:00:24] cmjohnson1: thanks [15:00:28] * aude was about to say wheeee [15:00:35] (03CR) 10Chad: [C: 032] Add blacklisted properties for suggester on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162596 (https://bugzilla.wikimedia.org/70346) (owner: 10Aude) [15:00:43] (03Merged) 10jenkins-bot: Add blacklisted properties for suggester on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162596 (https://bugzilla.wikimedia.org/70346) (owner: 10Aude) [15:01:04] cmjohnson1: Is it mounted yet? [15:01:19] Doesn't obviously look to be... [15:01:21] nope..lemme do that [15:01:25] !log demon Synchronized wmf-config/Wikibase.php: (no message) (duration: 00m 06s) [15:01:28] <^demon|away> aude: ^ [15:01:32] Logged the message, Master [15:01:33] Bus 002 Device 003: ID 1058:1144 Western Digital Technologies, Inc. [15:01:36] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 0B: /srv/deployment/ocg/output 4153630172B: /srv/deployment/ocg/postmortem 1365275B: ocg_job_status 32868 msg (=30000 critical): ocg_render_job_queue 0 msg [15:02:48] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: /mnt/tmpfs 539577B: /srv/deployment/ocg/output 6021540315B: /srv/deployment/ocg/postmortem 4110910B: ocg_job_status 32869 msg (=30000 critical): ocg_render_job_queue 0 msg [15:03:08] ^demon|away: looks fine although difficult to verify completely [15:03:21] nothing obvious broken :) [15:03:25] <^demon|away> okie dokie :) [15:04:51] <^demon|away> Well legoktm's still idle for ~7h. [15:04:56] <^demon|away> I'll give him a bit longer. [15:05:49] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: /mnt/tmpfs 539577B: /srv/deployment/ocg/output 6021540315B: /srv/deployment/ocg/postmortem 4110910B: ocg_job_status 32870 msg (=30000 critical): ocg_render_job_queue 0 msg [15:13:31] ^demon|away: going to sneak one in: https://gerrit.wikimedia.org/r/#/c/162602/ [15:14:36] ^demon|away: thanks! [15:14:41] <^demon|away> yw [15:18:54] sorry I'm a bit late, my IRC client is being silly right now [15:19:12] <^demon|away> It's k [15:21:07] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: /mnt/tmpfs 539577B: /srv/deployment/ocg/output 6021540315B: /srv/deployment/ocg/postmortem 4110910B: ocg_job_status 32873 msg (=30000 critical): ocg_render_job_queue 0 msg [15:21:32] !log demon Synchronized php-1.24wmf22/extensions/CirrusSearch/maintenance/updateOneSearchIndexConfig.php: (no message) (duration: 00m 05s) [15:21:38] Logged the message, Master [15:21:46] ^demon|away: so specific! [15:22:17] that worked. thanks [15:22:35] cmjohnson1: Did the mount work? /dev/sdb1 on /media/wikimania2014 type vfat (rw) [15:22:39] It shows as being empty :( [15:22:55] it didn't work. [15:23:06] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: /mnt/tmpfs 539577B: /srv/deployment/ocg/output 6021540315B: /srv/deployment/ocg/postmortem 4110910B: ocg_job_status 32873 msg (=30000 critical): ocg_render_job_queue 0 msg [15:23:28] Do I dare ask what fs is used? Something mac specific? [15:24:01] reedy: File System: Journaled HFS+ [15:24:18] * Reedy facepalms [15:24:27] ok, I think this thing is working now. [15:25:17] cmjohnson1: hfsprogs? [15:26:07] prolly not instaleld [15:26:11] <^demon|away> legoktm: Do you have a change for core wmf22 for https://gerrit.wikimedia.org/r/#/c/162550/? [15:26:44] cmjohnson1: Right. It seems the "fix" is to just install that, and remount [15:26:47] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 0B: /srv/deployment/ocg/output 4178870862B: /srv/deployment/ocg/postmortem 1382288B: ocg_job_status 32873 msg (=30000 critical): ocg_render_job_queue 0 msg [15:27:03] ^demon|away: no, do you want me to create it? [15:27:07] (03PS1) 10Hashar: contint: labs slaves +mediawiki::packages::fonts [puppet] - 10https://gerrit.wikimedia.org/r/162604 (https://bugzilla.wikimedia.org/69535) [15:27:15] <^demon|away> legoktm: If you would :) [15:29:47] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 0B: /srv/deployment/ocg/output 4178870862B: /srv/deployment/ocg/postmortem 1396892B: ocg_job_status 32874 msg (=30000 critical): ocg_render_job_queue 0 msg [15:30:43] reedy: can you get out of the dir [15:30:56] yup, out. sorry :) [15:32:00] grr....[27251322.491805] hfs: unable to find HFS+ superblock [15:32:47] Mebbe a reasonable first step would be to plug it into a macos box and check that it works in the first place? [15:32:49] brb [15:33:39] lol [15:34:12] hello friendly ops people [15:34:33] Coren: I guess we're lucky cmjohnson1 has a mac [15:34:36] i'm working on the OCG health issue above. i'll let you know when i know what's going on. [15:35:10] cscott: What about unfriendly ones? No salutations? Maybe that's /why/ they're unfriendly. :-) [15:35:34] <^demon|away> Coren: Could you mark it ack'd in icinga so it'll stop spamming at least? [15:35:46] <^demon|away> Since cscott is on it :) [15:37:23] !log demon Synchronized php-1.24wmf22/extensions/CentralAuth: (no message) (duration: 00m 05s) [15:37:28] <^demon|away> legoktm: ^ [15:37:29] Logged the message, Master [15:37:31] ^demon|away: I'm apparently not cool enough. [15:37:35] cscott: it's useful to use !log for such communications [15:37:46] ^demon|away: thanks [15:37:52] (03CR) 10BryanDavis: "Comments inline suggesting a couple of filename changes." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/162570 (owner: 10Yuvipanda) [15:38:06] <^demon|away> Coren: Boo :( ok thx for trying [15:38:35] !log cscott> i'm working on the OCG health issue above. i'll let you know when i know what's going on. icinga-wm> PROBLEM - OCG health on ocg1002 is CRITICAL [15:38:41] Logged the message, Master [15:39:03] (03PS3) 10Filippo Giunchedi: swift: refactor into module, add codfw [puppet] - 10https://gerrit.wikimedia.org/r/162291 [15:39:35] (03CR) 10Filippo Giunchedi: "fixed a few things, thanks Alex and Giuseppe for the help and the reviews!" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/162291 (owner: 10Filippo Giunchedi) [15:39:48] I do love software that will let you do things, then apparently only does the permissions check when you try to submit it [15:41:07] Reedy: Poor affordances. Something the MacOS designers understood in the days of 1.0. "Never let the user try something and then tell them you can't; gray out the option/menu/etc in the first place" [15:41:08] (03CR) 10Filippo Giunchedi: "Alex: yeah I agree swift_new isn't great but I couldn't find a better way to transition incrementally, perhaps there are better options no" [puppet] - 10https://gerrit.wikimedia.org/r/162291 (owner: 10Filippo Giunchedi) [15:42:06] But also it seems to me that I /should/ be able to do this. [15:42:17] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 0B: /srv/deployment/ocg/output 4198564658B: /srv/deployment/ocg/postmortem 1402796B: ocg_job_status 32876 msg (=30000 critical): ocg_render_job_queue 0 msg [15:43:56] (03CR) 10Hashar: [C: 031] "Cherry picked on integration puppet master. That installed a bunch of fonts package on both Precise and Trusty instances. Would probably" [puppet] - 10https://gerrit.wikimedia.org/r/162604 (https://bugzilla.wikimedia.org/69535) (owner: 10Hashar) [15:44:53] (03PS1) 10Giuseppe Lavagetto: puppet::self::master: symlink the hieradata directory [puppet] - 10https://gerrit.wikimedia.org/r/162613 [15:45:03] <_joe_> godog: ^^ [15:45:27] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: /mnt/tmpfs 539577B: /srv/deployment/ocg/output 6021540315B: /srv/deployment/ocg/postmortem 4110910B: ocg_job_status 32876 msg (=30000 critical): ocg_render_job_queue 0 msg [15:45:36] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 0B: /srv/deployment/ocg/output 6178897927B: /srv/deployment/ocg/postmortem 1316855B: ocg_job_status 32876 msg (=30000 critical): ocg_render_job_queue 0 msg [15:46:12] fyi c scott is looking at the OCG issue, and doesn't think it is in any imminent danger of falling over [15:46:21] (03CR) 10Filippo Giunchedi: [C: 031] puppet::self::master: symlink the hieradata directory [puppet] - 10https://gerrit.wikimedia.org/r/162613 (owner: 10Giuseppe Lavagetto) [15:46:24] _joe_: looks good! [15:47:27] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 0B: /srv/deployment/ocg/output 4198564658B: /srv/deployment/ocg/postmortem 1402796B: ocg_job_status 32876 msg (=30000 critical): ocg_render_job_queue 0 msg [15:47:38] any op around to quickly run "dpkg -s tidy" on virt1000? [15:47:39] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet::self::master: symlink the hieradata directory [puppet] - 10https://gerrit.wikimedia.org/r/162613 (owner: 10Giuseppe Lavagetto) [15:48:03] <_joe_> hoo: what is the problem? [15:48:13] _joe_: I suspect it's not installed [15:48:41] Hi getting a lot of timeout errors [15:48:52] <_joe_> hoo: perl-tidy? [15:49:07] just tidy [15:49:16] yeah [15:49:19] we shell out to it AFAIR [15:49:30] <_joe_> hoo: not installed [15:49:33] :S [15:49:34] ok [15:49:36] Hello? [15:49:57] <_joe_> qcoder00: timeout errors on what? [15:50:04] Trying to load Wikipedia [15:50:19] It loads fine when i turn off a firewall [15:50:36] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 0B: /srv/deployment/ocg/output 6178897927B: /srv/deployment/ocg/postmortem 1316855B: ocg_job_status 32876 msg (=30000 critical): ocg_render_job_queue 0 msg [15:50:37] But not when it's on so are you doing anything "clever"? [15:50:40] bd808: hmm, I'm not *too* much a fan of splitting things that literally [15:50:59] YuviPanda: Well... puppet lint is :) [15:51:01] <_joe_> qcoder00: I would ask myself if my firewall is doing something clever [15:51:08] gah [15:51:13] * YuviPanda wonders if we can fix puppetlint :) [15:51:14] yes well i thought that as well [15:51:36] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 0B: /srv/deployment/ocg/output 4198564658B: /srv/deployment/ocg/postmortem 1402796B: ocg_job_status 32876 msg (=30000 critical): ocg_render_job_queue 0 msg [15:51:46] But I can't find a rule in it's dataset that would cause unexpected time outs only on one specifc site [15:51:51] YuviPanda: Personally, I like it. I hate hunting for the random file somebody stuffed multiple things into. [15:52:07] bd808: git grep yo! :) [15:52:25] <_joe_> qcoder00: what firewalling software are you using? [15:52:36] Comodo [15:52:55] <_joe_> mmmh I guess it does more than just packet filtering [15:53:03] bd808: but yeah, I'll move it around [15:53:08] I'll try a restart [15:53:11] Moment [15:53:27] bd808: the only reason the config class is even needed is because icinga includes just the config classes, and not the package [15:53:46] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: /mnt/tmpfs 539577B: /srv/deployment/ocg/output 6021540315B: /srv/deployment/ocg/postmortem 4110910B: ocg_job_status 32876 msg (=30000 critical): ocg_render_job_queue 0 msg [15:53:46] I'm wondering if it is worth separating them out for that, or if I should just have the dsh class and not care about the extra dsh package on neon [15:54:41] * bd808 shrugs [15:54:57] bd808: Am I missing something or is MW on virt1000 entirely unpuppetized? [15:55:09] (03PS1) 10Hoo man: Don't declare puppet_version='3' for virt* [puppet] - 10https://gerrit.wikimedia.org/r/162618 [15:55:18] (poking you, as you were involved there) [15:55:40] It's puppetised [15:55:47] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 0B: /srv/deployment/ocg/output 6178897927B: /srv/deployment/ocg/postmortem 1316855B: ocg_job_status 32876 msg (=30000 critical): ocg_render_job_queue 0 msg [15:55:59] Reedy: Where? [15:56:05] openstack controller [15:56:11] hoo: I'm not sure what parts andrew.bogott setup in puppet. It is not automatically doing syncs [15:57:33] openstack::openstack-manager [15:58:10] manifests/role/nova.pp [15:58:24] I seem to recall andrew having to fix a package conflict in puppet [15:58:35] got it [15:58:53] https://gerrit.wikimedia.org/r/#/c/158018/1/manifests/role/nova.pp,unified [15:59:30] (03PS1) 10Hoo man: Install 'tidy' for wikitech [puppet] - 10https://gerrit.wikimedia.org/r/162619 [15:59:43] yeah, found it meanwhile... a little messy [16:00:00] * Reedy is still waiting for repos to update [16:00:44] I wonder if there's a better include we can use [16:01:15] You mean for the packages? [16:02:36] yeah, we have include ::mediawiki::packages::php5 [16:02:48] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 0B: /srv/deployment/ocg/output 6178897927B: /srv/deployment/ocg/postmortem 1316855B: ocg_job_status 32876 msg (=30000 critical): ocg_render_job_queue 0 msg [16:03:09] (03PS1) 10KartikMistry: Added initial Debian packaging [debs/lttoolbox] - 10https://gerrit.wikimedia.org/r/162620 [16:04:47] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: /mnt/tmpfs 539577B: /srv/deployment/ocg/output 6021540315B: /srv/deployment/ocg/postmortem 4110910B: ocg_job_status 32876 msg (=30000 critical): ocg_render_job_queue 0 msg [16:05:47] Reedy: Declaring it there will certainly cause trouble [16:05:58] because puppet is a huge fan of duplicate packages [16:06:04] mark: time to take a quick look at https://gerrit.wikimedia.org/r/#/c/161679/ ? [16:06:11] Right, but I was wondering if there's grouping [16:06:18] Jeff_Green: so the default lifetime of job status objects is 5 days [16:06:22] <_joe_> hoo: require_packages() in wmflib [16:06:31] <_joe_> or whatever it' [16:06:35] <_joe_> s called now [16:06:36] not sure yet why were are creating 3 per task, looking at that [16:07:20] Reedy: mediawiki::packages [16:07:57] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 0B: /srv/deployment/ocg/output 6202096490B: /srv/deployment/ocg/postmortem 1316855B: ocg_job_status 32877 msg (=30000 critical): ocg_render_job_queue 0 msg [16:08:52] Jeff_Green: so i don't think that 30k alert is reasonable, since IIRC we expect about 10k pdf jobs/day. so 50k would be the *expected* status queue size. [16:09:14] (03CR) 10Hoo man: "Alternative would be to include mediawiki::packages, but I guess there was a reason to not do that. Andrew?" [puppet] - 10https://gerrit.wikimedia.org/r/162619 (owner: 10Hoo man) [16:09:40] Jeff_Green: i'm trying to figure out if I should add a size bound to the garbage collector, so that if there were > NNN jobs it would start cleaning up the oldest ones until it fell back down to NNN jobs [16:09:47] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 0B: /srv/deployment/ocg/output 4198564658B: /srv/deployment/ocg/postmortem 1402796B: ocg_job_status 32877 msg (=30000 critical): ocg_render_job_queue 0 msg [16:11:06] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: /mnt/tmpfs 539577B: /srv/deployment/ocg/output 6021540315B: /srv/deployment/ocg/postmortem 4110910B: ocg_job_status 32877 msg (=30000 critical): ocg_render_job_queue 0 msg [16:12:00] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 0B: /srv/deployment/ocg/output 6202096490B: /srv/deployment/ocg/postmortem 1316855B: ocg_job_status 32878 msg (=30000 critical): ocg_render_job_queue 0 msg [16:12:34] <_joe_> cscott: what about job erors like the ones we're seeing? [16:12:39] PROBLEM - DPKG on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:12:49] PROBLEM - Disk space on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:12:49] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:12:51] PROBLEM - check if dhclient is running on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:12:54] <_joe_> also, I don't get why "since IIRC we expect about 10k pdf jobs/day. so 50k would be the *expected* status queue size" [16:12:55] _joe_: job errors? [16:12:59] PROBLEM - SSH on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:12:59] PROBLEM - nutcracker port on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:12:59] PROBLEM - nutcracker process on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:13:09] <_joe_> cscott: I thought Jeff_Green reported them [16:13:12] _joe_: the default status queue time out is 5 days. [16:13:19] PROBLEM - check configured eth on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:13:27] <_joe_> so a job is in the queue even if it's done? [16:13:31] _joe_: yes, there are some warning about cleaning up temp files. i'm working on that, but it's a separate issue, not related to the icinga alerts. [16:13:32] <_joe_> for 5 days? [16:13:54] _joe_: there are two queues. the status queue just says whether this job completed successfully or not. that's what's being kept for 5 days. [16:13:54] <_joe_> oh ok, so this is just a case of "badly choosen threshold" [16:14:47] i think so. i could be convinced that the 5 day expiration is a bit too long, too, but the idea is to keep stuff around long enough that we've got a chance at reusing the cached results for another request. [16:14:54] (03PS1) 10Giuseppe Lavagetto: mediawiki: move memcached servers list to a hiera variable [puppet] - 10https://gerrit.wikimedia.org/r/162622 [16:15:00] <_joe_> cscott: not your one [16:15:05] <_joe_> the one used in icinga [16:15:09] also we ran 20K extra jobs in the past 24H right? [16:15:28] Jeff_Green: yes, but that's roughly 2 days "normal usage" where normal is "after this monday" [16:16:05] so although that's a factor here, if i just cleared redis right now, i'd expect us to start alerting sometime around next wednesday. [16:16:09] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 0B: /srv/deployment/ocg/output 6202096490B: /srv/deployment/ocg/postmortem 1316855B: ocg_job_status 32878 msg (=30000 critical): ocg_render_job_queue 0 msg [16:16:09] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: /mnt/tmpfs 539577B: /srv/deployment/ocg/output 6023849208B: /srv/deployment/ocg/postmortem 4110910B: ocg_job_status 32878 msg (=30000 critical): ocg_render_job_queue 0 msg [16:16:17] yah [16:16:43] the alert thresholds are easy to adjust, they're just puppet class parameters [16:16:44] i mean, icinga did it's job in raising a concern that we humans had to think about. [16:17:14] (03CR) 10Giuseppe Lavagetto: [V: 032] backport of https://github.com/facebook/hhvm/pull/3811/ [debs/hhvm] - 10https://gerrit.wikimedia.org/r/161936 (owner: 10Giuseppe Lavagetto) [16:17:22] but i think i'd raise the limit to 60k-100k for now. by the end of next week we should have a better idea of what "typical" is. [16:17:29] (03PS3) 10Giuseppe Lavagetto: Add Tim's PR #3834 as a debian patch [debs/hhvm] - 10https://gerrit.wikimedia.org/r/162551 [16:17:36] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Add Tim's PR #3834 as a debian patch [debs/hhvm] - 10https://gerrit.wikimedia.org/r/162551 (owner: 10Giuseppe Lavagetto) [16:18:24] fwiw, the garbage collector appears to be running successfully. it's just that almost all of those 30k jobs are ones I triggered during testing w/in the past five days. [16:19:01] ok. i'll bump the threshold up to 100k so we stop getting harassed :-P [16:19:19] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 0B: /srv/deployment/ocg/output 6202096490B: /srv/deployment/ocg/postmortem 1316855B: ocg_job_status 32879 msg (=30000 critical): ocg_render_job_queue 0 msg [16:19:46] Jeff_Green: thanks [16:20:14] Jeff_Green: i'm going to file a bug to remind me to look at this limit next week and see if it needs further tweaking, if 5 days is a reasonable expiry, etc. [16:20:45] while I'm in there, are there any other thresholds I should adjust? [16:20:50] we've got: [16:20:56] warn output dir 40GB [16:21:02] critical output dir 50GB [16:21:29] postmortem dir warn 1G, critical 2G [16:22:01] render jobs queue warn 100, critical 500 [16:22:29] er.. diggin through puppet [16:22:51] temp size warn 1G, critical 5G [16:22:54] that's it [16:23:20] Jeff_Green: well, we've got 6G in output dir after ~2 days simulated load. So i'd expect that to go up to 15G in production. So I guess that looks good. [16:23:29] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 0B: /srv/deployment/ocg/output 4198564658B: /srv/deployment/ocg/postmortem 1402796B: ocg_job_status 32879 msg (=30000 critical): ocg_render_job_queue 0 msg [16:23:29] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: /mnt/tmpfs 539577B: /srv/deployment/ocg/output 6023921588B: /srv/deployment/ocg/postmortem 4110910B: ocg_job_status 32879 msg (=30000 critical): ocg_render_job_queue 0 msg [16:23:29] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 0B: /srv/deployment/ocg/output 6202096490B: /srv/deployment/ocg/postmortem 1316855B: ocg_job_status 32879 msg (=30000 critical): ocg_render_job_queue 0 msg [16:23:32] I think those numbers are god. [16:23:34] ok [16:24:25] (03PS1) 10Jgreen: adjust ocg job status queue warn/critical thresholds for expected normal range [puppet] - 10https://gerrit.wikimedia.org/r/162623 [16:26:29] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: /mnt/tmpfs 539577B: /srv/deployment/ocg/output 6023921588B: /srv/deployment/ocg/postmortem 4110910B: ocg_job_status 32879 msg (=30000 critical): ocg_render_job_queue 0 msg [16:26:29] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 0B: /srv/deployment/ocg/output 6202096490B: /srv/deployment/ocg/postmortem 1316855B: ocg_job_status 32879 msg (=30000 critical): ocg_render_job_queue 0 msg [16:26:31] Jeff_Green: filed https://bugzilla.wikimedia.org/show_bug.cgi?id=71239 to remind me to look at this again. [16:27:15] cscott: ok [16:27:58] (03CR) 10Jgreen: [C: 032 V: 031] adjust ocg job status queue warn/critical thresholds for expected normal range [puppet] - 10https://gerrit.wikimedia.org/r/162623 (owner: 10Jgreen) [16:27:59] PROBLEM - NTP on mw1053 is CRITICAL: NTP CRITICAL: No response from NTP server [16:29:29] RECOVERY - OCG health on ocg1001 is OK: OK: /mnt/tmpfs 539577B: /srv/deployment/ocg/output 6023921588B: /srv/deployment/ocg/postmortem 4110910B: ocg_job_status 32879 msg: ocg_render_job_queue 0 msg [16:29:34] wooo [16:29:55] (03PS1) 10BBlack: NTP config refactoring + updates [puppet] - 10https://gerrit.wikimedia.org/r/162625 [16:30:10] (03CR) 10Dzahn: "Mschon, btw, i added you because you appear in the list of authors (AUTHORS file in here), did you know ?:)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162505 (owner: 10Dzahn) [16:30:29] RECOVERY - OCG health on ocg1002 is OK: OK: /mnt/tmpfs 0B: /srv/deployment/ocg/output 4198564658B: /srv/deployment/ocg/postmortem 1402796B: ocg_job_status 32879 msg: ocg_render_job_queue 0 msg [16:30:34] (03CR) 10jenkins-bot: [V: 04-1] NTP config refactoring + updates [puppet] - 10https://gerrit.wikimedia.org/r/162625 (owner: 10BBlack) [16:32:30] RECOVERY - OCG health on ocg1003 is OK: OK: /mnt/tmpfs 0B: /srv/deployment/ocg/output 6202096490B: /srv/deployment/ocg/postmortem 1316855B: ocg_job_status 32879 msg: ocg_render_job_queue 0 msg [16:35:00] RECOVERY - NTP on mw1053 is OK: NTP OK: Offset 0.001859068871 secs [16:35:00] RECOVERY - Disk space on mw1053 is OK: DISK OK [16:35:01] RECOVERY - DPKG on mw1053 is OK: All packages OK [16:35:03] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [16:35:19] RECOVERY - SSH on mw1053 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [16:35:19] RECOVERY - nutcracker port on mw1053 is OK: TCP OK - 0.000 second response time on port 11212 [16:35:19] RECOVERY - check if dhclient is running on mw1053 is OK: PROCS OK: 0 processes with command name dhclient [16:35:31] RECOVERY - nutcracker process on mw1053 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [16:36:02] RECOVERY - check configured eth on mw1053 is OK: NRPE: Unable to read output [16:36:56] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [16:38:21] (03PS3) 10Dzahn: NTP service aliases, switch eqiad, add esams [dns] - 10https://gerrit.wikimedia.org/r/162496 [16:38:44] (03CR) 10Dzahn: NTP service aliases, switch eqiad, add esams (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/162496 (owner: 10Dzahn) [16:45:27] (03CR) 10Andrew Bogott: [C: 032] Don't declare puppet_version='3' for virt* [puppet] - 10https://gerrit.wikimedia.org/r/162618 (owner: 10Hoo man) [16:49:27] (03PS2) 10Andrew Bogott: Make neptunium an ldap and dns server [puppet] - 10https://gerrit.wikimedia.org/r/162595 [16:51:26] (03PS3) 10Andrew Bogott: Make neptunium an ldap and dns server [puppet] - 10https://gerrit.wikimedia.org/r/162595 [16:52:33] (03PS1) 10Andrew Bogott: Move ldap-eqiad to neptunium for the time being. [dns] - 10https://gerrit.wikimedia.org/r/162632 [16:53:42] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: Epic puppet fail [16:54:19] (03CR) 10Dzahn: "yea, had the discussion before about the mgmt entries. Chris confirmed in the Tampa case they can be removed" [dns] - 10https://gerrit.wikimedia.org/r/162526 (owner: 10Dzahn) [16:54:21] (03CR) 10Andrew Bogott: [C: 032] Make neptunium an ldap and dns server [puppet] - 10https://gerrit.wikimedia.org/r/162595 (owner: 10Andrew Bogott) [16:54:39] (03CR) 10Andrew Bogott: [C: 032] Move ldap-eqiad to neptunium for the time being. [dns] - 10https://gerrit.wikimedia.org/r/162632 (owner: 10Andrew Bogott) [16:55:36] (03CR) 10Filippo Giunchedi: [C: 031] NTP service aliases, switch eqiad, add esams [dns] - 10https://gerrit.wikimedia.org/r/162496 (owner: 10Dzahn) [16:56:38] (03CR) 10Filippo Giunchedi: [C: 031] contint: labs slaves +mediawiki::packages::fonts [puppet] - 10https://gerrit.wikimedia.org/r/162604 (https://bugzilla.wikimedia.org/69535) (owner: 10Hashar) [16:57:21] (03PS1) 10RobH: setting dns entries for server plutonium [dns] - 10https://gerrit.wikimedia.org/r/162634 [17:00:41] (03CR) 10Filippo Giunchedi: [C: 031] contint: configuration files renaming [puppet] - 10https://gerrit.wikimedia.org/r/162584 (owner: 10Hashar) [17:01:27] chasemp: are you Mr. Phabricator? [17:01:28] (03CR) 10RobH: [C: 032] setting dns entries for server plutonium [dns] - 10https://gerrit.wikimedia.org/r/162634 (owner: 10RobH) [17:01:42] dear god I hope not, but maybe for the moment yes [17:01:47] what's up? [17:02:01] chasemp: i'm going to order you a trucker cap with that printed on it. [17:02:10] but shorted to Mr. Phab [17:02:13] I have a question about everyone's favorite topic, naming! [17:02:15] prefer speedos [17:02:18] Yeah, Mr. Phab is better [17:02:21] heh [17:02:24] chasemp: noted! [17:02:32] how about both! [17:02:37] Mr on the cap [17:02:38] sure man, what about naming? [17:02:40] Phab on the speedo [17:02:41] a hot topic to be sure [17:02:45] chasemp: Is phabricator going to use ldap for account management? The same ldap as labs? [17:02:49] (done distracting, sorry ;) [17:02:51] on the back should say "that's mr. phab to you" [17:02:56] andrewbogott: yes [17:03:05] there will be two sources of truth, SUL and labs LDAP [17:03:12] which is becoming more than just labs ldap as time goes on [17:03:22] and you can link your two accounts from those two to a single phab account [17:03:27] it already is, we use it for icinga/graphite/etc all those logins [17:03:31] so you can use both or either [17:03:34] ok [17:03:35] ^that too [17:03:58] is that a concern? [17:03:59] So… my question is somethingsomething staff don't use (WMF) usernames on wikitech and renaming is more or less impossible. [17:04:15] there has been some discussion [17:04:15] i dont use (wmf) on wikitech [17:04:18] wait, they really asked about WMF names on wikitech now? [17:04:18] cuz we werent forced to [17:04:19] haha [17:04:19] on this point [17:04:25] that was a joke until now:p [17:04:30] phab doesn't respect ( in a username or ) [17:04:32] or spaces [17:04:33] I think [17:04:34] mutante, robh, no, exactly, no one dos. [17:04:38] so some people have done -WMF [17:04:38] good :) [17:04:40] Theree is no standardization of staff usernames on wikitech (nor are there plans to be if I can stop it ;) [17:04:45] But on phab we'll probably want (WMF) names. [17:04:45] but as far as I know there is o "do this" statement [17:04:53] andrewbogott: noooooo [17:04:57] until told to do it I'm using my irc name [17:05:01] I've tried making them different [17:05:04] and it's a nightmare [17:05:14] andrewbogott: we dont want that at all ;] [17:05:18] ok, but then… SUL on phab? That's the same SUL as on wikipedia, right? [17:05:24] but I think that's not my final say [17:05:25] yes [17:05:35] or the overall general login stuffs is my understanding [17:05:48] So, maybe you just answered this, but… how does that work? If phab draws from ldap /and/ SUL? [17:05:57] phab is module for auth providers [17:06:01] Given that there is no standardization of names betweeon the two [17:06:04] and can link them to a local username [17:06:06] yeah [17:06:08] wild west [17:06:30] I have good answer other than, I was going to let ops in to claim their names early to avoid confusion [17:06:37] and then for an overall standard taht is legal driven [17:06:40] because root = win =] [17:06:45] i hope nobody registers my LDAP user on a wiki :) [17:06:46] I have no wisdom and have seen none that is definnitive [17:07:00] chasemp: ok, so this sounds like a convincing case for me continuing to ignore this issue. [17:07:23] andrewbogott: I have special platinum blinders so i can ignore stuff in style. [17:07:24] the use (WMF) was proposed before [17:07:29] and didn't make it per technical ability [17:07:46] the -WMF was then proposed but afaik it's not a standard as of now [17:08:06] I think it's kinda short sighted atm and names matching in irc is the best practical bet [17:08:12] but hey not my call maybe [17:08:25] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /tmp 3954 MB (3% inode=99%): [17:09:02] chasemp: so, for the record: Currently when anyone asks me to rename their labs account, I tell them I can't and also won't. [17:09:12] Changing that answer will be a pretty major effort. [17:09:29] won't need to, the phab username can be anything and link to any username in ldap [17:09:31] it's the creds that matter [17:09:37] ok, excellent. Thanks! [17:09:38] so technically could do -WMF and same ldap name [17:09:40] (03PS1) 10RobH: setting install params for server plutonium [puppet] - 10https://gerrit.wikimedia.org/r/162637 [17:09:53] recently created a new user for somebody who wanted to rename.. and they said they'll just abandon the old name [17:11:05] (03PS1) 10Glaisher: Add wikidatawiki to wgAppleTouchIcon and add wikidata.png to bits Change-Id: I7db24cfe2a03c5e343869923f1432de3820bcd9b [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162638 [17:11:11] yeah, new name and abandoning is fine [17:11:33] (03CR) 10RobH: [C: 032] setting install params for server plutonium [puppet] - 10https://gerrit.wikimedia.org/r/162637 (owner: 10RobH) [17:11:52] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [17:12:34] bah, they should have to live with their username for eternity [17:12:52] teach folks to make good username choices [17:12:53] like 'robh' [17:12:59] !log lowered throttling on Elasticsearch index transfer from one node to another speed because I hate excitement [17:13:02] ;] [17:13:06] Logged the message, Master [17:13:19] manybubbles: Oh, not sure if you noticed, but the SSDs should arrive this week for the search servers [17:13:29] oh sweet! [17:13:35] so i imagine we'll start depooling and reinstalling individual hosts [17:13:36] slowly. [17:13:38] new servers will take more time I imagine [17:13:48] new server quote is on my plate to review today and escalate for approvals [17:13:54] ah cool [17:13:59] i have it back, just havent given it a detailed proof yet [17:14:08] (03PS2) 10Glaisher: Add wikidatawiki to wgAppleTouchIcon and add wikidata.png to bits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162638 (https://bugzilla.wikimedia.org/70996) [17:14:11] but once we order, its usually like 2 weeks or so from order to delivery half the time [17:14:15] 2-3 weeks [17:14:22] usually on the shorter side. [17:14:28] cool [17:14:40] i imagine for the S3500 upgrades, its worth starting those immediately right? [17:14:52] take one offline, reinstall, let it go back to full load, continue to next.... [17:15:33] RECOVERY - Disk space on stat1002 is OK: DISK OK [17:15:41] mutante: I think I got the redirect stuff to work (the changset you put up wasn't working for me?) [17:15:46] can you sanity check me? [17:15:56] bd808: can you explain about https://gerrit.wikimedia.org/r/#/c/162619/ to me? Or at least tell me if you want me to merge it? [17:17:31] Ah. If tidy is installed then MW will use it to make generated html less crappy. hoo must have noticed something ugly that it would fix. We run it on the cluster servers. [17:17:42] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I can't see how this can fit with the openstack manifests;" [puppet] - 10https://gerrit.wikimedia.org/r/162619 (owner: 10Hoo man) [17:18:11] He was asking earlier where virt1000 was configured as a MW server. [17:18:40] <_joe_> you should probably include mediawiki::packages [17:18:51] <_joe_> but that would cause duplicate package defs maybe? [17:19:08] probably [17:19:14] will the puppet compiler tell us? [17:19:21] or is it broken? (still?) [17:19:39] <_joe_> Reedy: it's broken since I broke it yesterday [17:19:44] <_joe_> had no time to look at it [17:19:57] <_joe_> I manually broke it with some ruby mumbo-jumbo [17:20:17] <_joe_> so I'll probably just rebuild the server [17:20:27] <_joe_> (cattle, not pets) [17:20:54] Just switch all packages everywhere to use require_package() :) [17:21:26] <_joe_> bd808: or just on a need-to basis; it's still good [17:21:51] chasemp: hi.. yea.. so it was weird, the change i uploaded _seemed_ to work for me in firefox [17:22:18] chasemp: but then.. when checking other ways i also got "redirect without a target" ?:p [17:23:00] mutante: I think is the best best http://www.mediacollege.com/internet/server/apache/mod-rewrite/last.html [17:23:03] (03PS1) 10Andrew Bogott: Add ldap cert to neptunium [puppet] - 10https://gerrit.wikimedia.org/r/162645 [17:23:06] I'll amend your changset here [17:23:10] literal two char change [17:24:24] (03PS2) 10Rush: phabricator - redirect/enforce http->https [puppet] - 10https://gerrit.wikimedia.org/r/162534 (owner: 10Dzahn) [17:25:00] that redirect w/o target thing is because it was matching rules incorrectly bascially [17:25:13] phab does some fancy stuff-ish [17:25:34] so doing a hard stop when it's just a proto redirect to prevent other rewrite rules seems to jive [17:26:16] chasemp: oh. L flag.. that makes sense [17:27:52] (03CR) 10Andrew Bogott: [C: 032] Add ldap cert to neptunium [puppet] - 10https://gerrit.wikimedia.org/r/162645 (owner: 10Andrew Bogott) [17:28:50] (03CR) 10Dzahn: [C: 031] "using L flag makes sense to me. we want the first rewrite part to END and the other part that is for phab itself should exist separately" [puppet] - 10https://gerrit.wikimedia.org/r/162534 (owner: 10Dzahn) [17:29:10] (03PS3) 10Rush: phabricator - redirect/enforce http->https [puppet] - 10https://gerrit.wikimedia.org/r/162534 (owner: 10Dzahn) [17:29:26] (03CR) 10Rush: [C: 032] phabricator - redirect/enforce http->https [puppet] - 10https://gerrit.wikimedia.org/r/162534 (owner: 10Dzahn) [17:29:33] (03CR) 10Rush: [V: 032] phabricator - redirect/enforce http->https [puppet] - 10https://gerrit.wikimedia.org/r/162534 (owner: 10Dzahn) [17:40:43] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /tmp 3093 MB (3% inode=99%): [17:46:40] (03PS1) 10RobH: seting dns entries for server pollux in codfw [dns] - 10https://gerrit.wikimedia.org/r/162649 [17:48:43] RECOVERY - Disk space on stat1002 is OK: DISK OK [17:49:40] (03CR) 10RobH: [C: 032] seting dns entries for server pollux in codfw [dns] - 10https://gerrit.wikimedia.org/r/162649 (owner: 10RobH) [17:52:42] godog: marktraceur has some swift questions, you around? [17:52:53] will it blend? [17:53:21] godog: Basically I'm just wonderig if we can extend the lifespan of the logs, we only have about a week currently AIUI [17:56:56] (03PS1) 10Andrew Bogott: Add ALL the ldap servers to the replication firewall def [puppet] - 10https://gerrit.wikimedia.org/r/162654 [17:58:16] (03CR) 10Andrew Bogott: [C: 032] Add ALL the ldap servers to the replication firewall def [puppet] - 10https://gerrit.wikimedia.org/r/162654 (owner: 10Andrew Bogott) [17:59:34] _joe_: scap-recompile! [17:59:41] people are complaining math is broken on hhvm [17:59:54] <_joe_> Reedy: uh? [18:00:02] https://bugzilla.wikimedia.org/71224 [18:00:04] (03PS1) 10MaxSem: Disable gadgets caching in labs for investigation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162655 [18:00:14] /usr/local/apache/uncommon texvc [18:00:21] <_joe_> oooh ok [18:00:29] <_joe_> Reedy: I'd be off by now [18:00:41] <_joe_> I decided to stop at the 10 hour mark for this week [18:01:04] 10 hours work in one week?! [18:01:20] <_joe_> eheh [18:01:21] ugh, why do we still recompile texvc manually? [18:01:52] (03CR) 10MaxSem: [C: 032] Disable gadgets caching in labs for investigation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162655 (owner: 10MaxSem) [18:01:58] (03Merged) 10jenkins-bot: Disable gadgets caching in labs for investigation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162655 (owner: 10MaxSem) [18:04:43] YuviPanda: about? [18:04:48] MaxSem: because legacy [18:05:12] _joe_: I wonder if just running scap-recompile on the hhvm servers will work [18:05:22] <_joe_> no idea [18:05:31] Reedy, {{sotryit}} [18:05:49] which servers? [18:06:15] <_joe_> Reedy: mw1017, testwiki [18:06:35] <_joe_> but I suspect hte problem is more evident on mw1021 or mw1018 [18:06:41] <_joe_> so try there [18:06:52] I was going to say, mw1017 wasn't reinstalled, was it? [18:07:13] reedy@mw1018:~$ scap-recompile [18:07:13] /srv/deployment/scap/scap/bin/scap-recompile: line 7: mwversionsinuse: command not found [18:07:13] Unable to read wikiversions.json or it is empty [18:07:34] bah. Casualty of not being ported to python [18:07:51] we can replace mwversionsinuse [18:08:17] uh, qualify the path [18:08:24] Reedy: Just needs the right path [18:08:25] yeah [18:08:25] $PATHs again? [18:08:45] It's at /srv/deployment/scap/scap/bin/mwversionsinuse [18:08:57] is that not being added to $PATH again? [18:08:58] reedy@tin:/srv/deployment/scap/scap/bin$ ./scap-recompile [18:08:58] MediaWiki: Compiling texvc... [18:09:05] install: cannot remove `/usr/local/apache/uncommon/bin/texvc': Permission denied [18:09:07] wheee :) [18:09:08] and if so... for which user? [18:09:27] uh, wrong server [18:09:50] install: cannot create directory ‘/usr/local/apache’: Permission denied [18:09:50] install: cannot create regular file ‘/usr/local/apache/uncommon/bin’: No such file or directory [18:09:57] that's more like it [18:10:06] blame ori ;) [18:10:13] heh [18:10:21] Should we move it? Or just create the directory on those servers? [18:11:05] ugh. Where does that belong now? Are the other servers only working because the old dir structure hasn't been removed? [18:11:16] It's the reinstalled servers [18:11:26] They don't have the bins, cause they've not been built [18:11:34] They can't be built, as the target dir(s) don't exist [18:12:35] I wonder if we should just build to /srv/mediawiki/something [18:12:53] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /tmp 3211 MB (3% inode=99%): [18:13:30] https://github.com/wikimedia/operations-mediawiki-config/blob/a32cb9ce216f928c3047e3047cf162ac05fec6d5/wmf-config/CommonSettings.php#L2135 [18:13:42] yeah [18:13:51] we can move it everywhere, or vary [18:13:55] "move" [18:13:55] well if was uncommon to not assume servers have the same arch and let them compile locally [18:14:03] not sure that really matters for us [18:14:18] heh [18:14:27] we do have some 14.04 but mostly 12.04 now [18:14:39] hhvm boxen are all 14.04 [18:14:48] php5 are all 12.04 [18:15:07] we can always use /srv/mediawiki/bin and add an exclude clause to the rsync if we cared about that [18:15:29] That seems probably sensible [18:15:55] might aswell make the dir in mediawiki-staging, and gitignore bin/* [18:16:19] but we wouldn't be building there anywhere [18:16:20] do non-tin hosts even have -staging? [18:16:24] nope [18:16:27] right [18:18:07] If we put something in /srv/mediawiki/bin that needs to vary by host, we will need to exclude that dir from rsync or bad things will happen [18:18:22] Can we not just put it into /usr/local/bin? [18:18:50] This dumb thing should really be a deb package and not in the extension anyway. [18:19:08] I think there's an RT ticket about packaging it [18:19:24] https://rt.wikimedia.org/Ticket/Display.html?id=5270 [18:19:25] (03CR) 10Aude: "does this need some sort of background, like http://bits.wikimedia.org/apple-touch/wikipedia.png has?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162638 (https://bugzilla.wikimedia.org/70996) (owner: 10Glaisher) [18:19:29] Created: Tue Jun 11 16:55:46 2013 [18:19:30] physkerwelt is the only person who fixes it and he's betting on mathoid now [18:20:00] yeah [18:20:15] scap at one point used to rebuild the binary on every server on every scap :) [18:20:28] o_0 [18:21:25] "I've uploaded a version of texvc and texvccheck (mediawiki-math-texvc) in reprepro taken from git, updating the source package from collab-maint." [18:21:38] So we have a local package I guess? [18:21:47] (03CR) 10Jgreen: [C: 032 V: 031] Corrected the exim regex expression and POST url [puppet] - 10https://gerrit.wikimedia.org/r/161679 (owner: 1001tonythomas) [18:22:13] (03PS1) 10Manybubbles: Less exciting Elasticsearch configuration [puppet] - 10https://gerrit.wikimedia.org/r/162661 [18:22:15] (03CR) 10Reedy: [C: 04-1] "Needs rebasing" [puppet] - 10https://gerrit.wikimedia.org/r/135544 (owner: 10Filippo Giunchedi) [18:22:17] Ah [18:22:20] RECOVERY - Disk space on stat1002 is OK: DISK OK [18:22:47] reedy@tin:/usr/local/apache/uncommon$ apt-cache search mediawiki-math-texvc [18:22:47] mediawiki-math-texvc - math rendering plugin for MediaWiki (texvc binary files) [18:23:25] It has a recommends that Faidon made a note about blocking :mediawiki-extensions-math [18:23:51] is that the package that depends on mediawiki and mysql and random stuff? [18:24:05] * AaronS remembers using something like that once [18:24:43] yeah. It recommends mediawiki-extensions-base which recommends mediawiki [18:24:44] Doesn't look to [18:24:57] http://p.defau.lt/?kNcoaxht5TJW8bpFPsuaow [18:25:25] So not required but recommended and that by default would install unless you tell apt otherwise [18:25:35] at least on a stock ubuntu box [18:25:40] (03PS2) 10Reedy: ship {texvc,texvccheck} via mediawiki-math-texvc [puppet] - 10https://gerrit.wikimedia.org/r/135544 (owner: 10Filippo Giunchedi) [18:27:16] In MediaWiki-Vagrant we added apt config to turn off recommends by default -- https://github.com/wikimedia/mediawiki-vagrant/blob/master/puppet/files/apt/01no-recommended [18:27:18] The following packages have unmet dependencies: [18:27:18] mediawiki-math : Depends: mediawiki-extensions-math but it is not going to be installed [18:28:15] apt-get --no-install-recommends ... [18:28:31] or aptitude --without-recommends ... [18:29:42] that's trying `apt-get install --no-install-recommends -s mediawiki-math` on tin [18:30:17] yuck. [18:30:44] (03CR) 10Reedy: "reedy@tin:/usr/local/apache/uncommon$ apt-get install --no-install-recommends -s mediawiki-math" [puppet] - 10https://gerrit.wikimedia.org/r/135544 (owner: 10Filippo Giunchedi) [18:38:13] greg-g, hi, what happened to the zero depl window? don't see it on depl page [18:41:36] yurikR: it wasn't being used regularly, so I removed it. If you'll use it regularly and update what went out when you do use it :) [18:42:13] greg-g ? i used it last week, and the week before - i don't think i skipped that many? [18:42:27] oh, yeah, i wasn't updating the page, sorry :( [18:43:04] yurikR: re-add it and log if you do use it [18:43:49] greg-g, i just updated 24wmf21 & 22 [18:44:56] yurikR: dont' tell me, tell the wiki page [18:45:06] (or, do both) ;) [18:45:13] Coren: hey, i'd add your icinga user, but we should match the labs login user [18:45:26] Coren: and i cant find you in LDAP as "Coren" somehow? [18:45:45] but your wikitech user is that [18:46:42] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /tmp 3883 MB (3% inode=99%): [18:52:12] Coren: ah.. found it..nvm [18:53:42] RECOVERY - Disk space on stat1002 is OK: DISK OK [18:53:46] !log reedy Synchronized php-1.24wmf22/extensions/WikimediaMaintenance: (no message) (duration: 00m 14s) [18:53:53] Logged the message, Master [18:55:21] !log reedy Synchronized wmf-config/interwiki.cdb: Updating interwiki cache (duration: 00m 14s) [18:55:27] Logged the message, Master [18:55:35] (03PS1) 10Reedy: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162676 [18:56:35] (03CR) 10Reedy: [C: 032] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162676 (owner: 10Reedy) [18:56:40] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162676 (owner: 10Reedy) [18:58:16] greg-g, updated, deploying [18:58:24] (03PS1) 10Dzahn: give Coren Icinga permissions [puppet] - 10https://gerrit.wikimedia.org/r/162677 [18:58:29] greg-g: just fyi, ocg is stealing parsoid's deploy window today [19:00:58] cscott: sounds good to me [19:03:52] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /tmp 3960 MB (3% inode=99%): [19:07:08] !log yurik Started scap: updating Graph, JsonConfig, ZeroBanner & ZeroPortal to master for 21 & 22 [19:07:15] Logged the message, Master [19:10:13] cscott: :) [19:10:19] (03CR) 10RobH: [C: 031] give Coren Icinga permissions [puppet] - 10https://gerrit.wikimedia.org/r/162677 (owner: 10Dzahn) [19:11:53] RECOVERY - Disk space on stat1002 is OK: DISK OK [19:12:43] (03CR) 10Dzahn: [C: 032] give Coren Icinga permissions [puppet] - 10https://gerrit.wikimedia.org/r/162677 (owner: 10Dzahn) [19:14:55] !log yurik Finished scap: updating Graph, JsonConfig, ZeroBanner & ZeroPortal to master for 21 & 22 (duration: 07m 46s) [19:15:02] Logged the message, Master [19:20:47] (03PS2) 10BBlack: NTP config refactoring + updates [puppet] - 10https://gerrit.wikimedia.org/r/162625 [19:20:49] (03PS1) 10BBlack: New wmflib parser function array_concat() [puppet] - 10https://gerrit.wikimedia.org/r/162686 [19:22:26] (03CR) 10jenkins-bot: [V: 04-1] NTP config refactoring + updates [puppet] - 10https://gerrit.wikimedia.org/r/162625 (owner: 10BBlack) [19:22:28] (03CR) 10jenkins-bot: [V: 04-1] New wmflib parser function array_concat() [puppet] - 10https://gerrit.wikimedia.org/r/162686 (owner: 10BBlack) [19:23:32] hi, who added labs-ns1 recently? [19:23:45] bblack: that NTP refactoring sounds pretty cool [19:24:19] mutante: yeah I think with this structure we don't need any other changes, even in DNS. It keeps all the data on our global NTP setup in one file in puppet. [19:24:34] I need to get it actually working first, though :) [19:24:43] :) [19:26:30] (03PS1) 10Andrew Bogott: Update labs instances to use the new ldap-eqiad server [puppet] - 10https://gerrit.wikimedia.org/r/162689 [19:26:43] PROBLEM - CI tmpfs disk space on lanthanum is CRITICAL: DISK CRITICAL - free space: /var/lib/jenkins-slave/tmpfs 26 MB (5% inode=99%): [19:27:32] (03CR) 10jenkins-bot: [V: 04-1] Update labs instances to use the new ldap-eqiad server [puppet] - 10https://gerrit.wikimedia.org/r/162689 (owner: 10Andrew Bogott) [19:28:43] PROBLEM - CI tmpfs disk space on lanthanum is CRITICAL: DISK CRITICAL - free space: /var/lib/jenkins-slave/tmpfs 23 MB (4% inode=99%): [19:29:08] (03Abandoned) 10Andrew Bogott: Switch ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/162139 (owner: 10Andrew Bogott) [19:30:29] (03PS2) 10Andrew Bogott: Update labs instances to use the new ldap-eqiad server [puppet] - 10https://gerrit.wikimedia.org/r/162689 [19:30:43] PROBLEM - Disk space on lanthanum is CRITICAL: DISK CRITICAL - free space: /var/lib/jenkins-slave/tmpfs 15 MB (3% inode=99%): [19:31:03] (03CR) 10Andrew Bogott: [C: 04-2] "WIP" [puppet] - 10https://gerrit.wikimedia.org/r/162689 (owner: 10Andrew Bogott) [19:31:10] (03CR) 10jenkins-bot: [V: 04-1] Update labs instances to use the new ldap-eqiad server [puppet] - 10https://gerrit.wikimedia.org/r/162689 (owner: 10Andrew Bogott) [19:32:10] 17G [19:32:16] heh [19:32:30] (that was go to line 17, and this isn't vi) [19:33:24] (03PS3) 10Andrew Bogott: Update labs instances to use the new ldap-eqiad server [puppet] - 10https://gerrit.wikimedia.org/r/162689 [19:33:32] i thought it was the size of the tmpfs on jenkins :p [19:34:23] icinga is funny, it tells me that [19:34:30] "***> The name of the main configuration file looks suspicious... [19:34:51] and then goes on about it "typically" being in /usr/local/icinga/ [19:35:08] yea, no, we use /etc for config files [19:36:03] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /tmp 3506 MB (3% inode=99%): [19:39:36] (03PS3) 10BBlack: NTP config refactoring + updates [puppet] - 10https://gerrit.wikimedia.org/r/162625 [19:39:38] (03PS2) 10BBlack: New wmflib parser function array_concat() [puppet] - 10https://gerrit.wikimedia.org/r/162686 [19:40:49] (03CR) 10jenkins-bot: [V: 04-1] NTP config refactoring + updates [puppet] - 10https://gerrit.wikimedia.org/r/162625 (owner: 10BBlack) [19:41:03] (03CR) 10jenkins-bot: [V: 04-1] New wmflib parser function array_concat() [puppet] - 10https://gerrit.wikimedia.org/r/162686 (owner: 10BBlack) [19:41:47] ^ damn you jenkins :p [19:44:13] RECOVERY - Disk space on stat1002 is OK: DISK OK [19:45:31] (03PS4) 10BBlack: NTP config refactoring + updates [puppet] - 10https://gerrit.wikimedia.org/r/162625 [19:45:33] (03PS3) 10BBlack: New wmflib parser function array_concat() [puppet] - 10https://gerrit.wikimedia.org/r/162686 [19:46:13] (03CR) 10jenkins-bot: [V: 04-1] NTP config refactoring + updates [puppet] - 10https://gerrit.wikimedia.org/r/162625 (owner: 10BBlack) [19:46:18] !log yurik Synchronized php-1.24wmf21/extensions/ZeroBanner/: Updating to master (duration: 01m 07s) [19:46:24] Logged the message, Master [19:47:10] (03PS5) 10BBlack: NTP config refactoring + updates [puppet] - 10https://gerrit.wikimedia.org/r/162625 [19:47:33] !log yurik Synchronized php-1.24wmf22/extensions/ZeroBanner/: Updating to master (duration: 01m 10s) [19:47:40] Logged the message, Master [19:48:54] cscott, i'm done [19:56:31] bblack: you can have your editor to run puppet parser validate on save and whine on issue. If you are using vim, you can get https://github.com/scrooloose/syntastic which does the job :] [19:56:56] yeah but I tend to edit locally on osx where I don't even have our environment [19:57:03] I know, I should fix that problem :) [19:58:08] ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" && brew install puppet [19:58:29] (ok you obviously want to review that script before executing it :] [19:59:16] (03PS2) 10Reedy: Bump wgMaxImageArea to 75MP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162492 [19:59:22] (03CR) 10Reedy: [C: 032] Bump wgMaxImageArea to 75MP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162492 (owner: 10Reedy) [19:59:35] (03Merged) 10jenkins-bot: Bump wgMaxImageArea to 75MP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162492 (owner: 10Reedy) [19:59:37] heh [19:59:59] I have brew actually, but still I'll end up getting stuck on version differences from our hosts and such [20:00:05] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 15s) [20:00:11] Logged the message, Master [20:05:08] (03PS1) 10Nemo bis: [Planet Wikimedia] Add User:Husky to English Planet [puppet] - 10https://gerrit.wikimedia.org/r/162700 [20:09:22] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /tmp 3622 MB (3% inode=99%): [20:10:03] RECOVERY - Disk space on lanthanum is OK: DISK OK [20:13:15] (03CR) 10BBlack: [C: 032] New wmflib parser function array_concat() [puppet] - 10https://gerrit.wikimedia.org/r/162686 (owner: 10BBlack) [20:16:23] RECOVERY - Disk space on stat1002 is OK: DISK OK [20:23:14] RECOVERY - CI tmpfs disk space on lanthanum is OK: DISK OK [20:26:35] (03PS2) 10Dzahn: add dduvall to group 'statistics-users' [puppet] - 10https://gerrit.wikimedia.org/r/162166 [20:26:58] ori: did you ever make a bz report for that memcached problem? [20:27:24] AaronS: no. i figured out what was happening, though. i'll be in the office in one hour; wanna discuss it then? [20:27:33] sure [20:29:20] (03CR) 10Dzahn: "if you have office wiki access you could review me and give me a +1 for using the right key, you don't even have to be ops" [puppet] - 10https://gerrit.wikimedia.org/r/162156 (owner: 10Dzahn) [20:31:18] (03PS1) 10BBlack: Move partial traffic back to ulsfo [dns] - 10https://gerrit.wikimedia.org/r/162747 [20:31:20] (03PS1) 10BBlack: Move partial traffic back to ulsfo [dns] - 10https://gerrit.wikimedia.org/r/162748 [20:31:22] (03PS1) 10BBlack: Move partial traffic back to ulsfo [dns] - 10https://gerrit.wikimedia.org/r/162749 [20:31:24] (03PS1) 10BBlack: Move partial traffic back to ulsfo [dns] - 10https://gerrit.wikimedia.org/r/162750 [20:31:26] (03PS1) 10BBlack: Move partial traffic back to ulsfo [dns] - 10https://gerrit.wikimedia.org/r/162751 [20:31:28] (03PS1) 10BBlack: Move remaining traffic back to ulsfo [dns] - 10https://gerrit.wikimedia.org/r/162752 [20:31:39] (03CR) 10Reedy: [C: 031] "Key matches" [puppet] - 10https://gerrit.wikimedia.org/r/162156 (owner: 10Dzahn) [20:32:04] Reedy: :) [20:33:32] (03CR) 10Dduvall: [C: 031] "The key looks good." [puppet] - 10https://gerrit.wikimedia.org/r/162156 (owner: 10Dzahn) [20:33:53] (03CR) 10Dzahn: [C: 032] create shell user for Dan Duvall [puppet] - 10https://gerrit.wikimedia.org/r/162156 (owner: 10Dzahn) [20:34:12] 1 + 1 == 2! [20:34:26] (03PS2) 10BBlack: Move partial traffic back to ulsfo [dns] - 10https://gerrit.wikimedia.org/r/162747 [20:34:28] (03PS2) 10BBlack: Move partial traffic back to ulsfo [dns] - 10https://gerrit.wikimedia.org/r/162749 [20:34:30] (03PS2) 10BBlack: Move partial traffic back to ulsfo [dns] - 10https://gerrit.wikimedia.org/r/162748 [20:34:32] (03PS2) 10BBlack: Move partial traffic back to ulsfo [dns] - 10https://gerrit.wikimedia.org/r/162751 [20:34:34] (03PS2) 10BBlack: Move partial traffic back to ulsfo [dns] - 10https://gerrit.wikimedia.org/r/162750 [20:34:36] (03PS2) 10BBlack: Move remaining traffic back to ulsfo [dns] - 10https://gerrit.wikimedia.org/r/162752 [20:35:13] (03PS3) 10Dzahn: add dduvall to group 'statistics-users' and also bastion access to get to stat1003 [puppet] - 10https://gerrit.wikimedia.org/r/162166 [20:35:25] (03PS4) 10Dzahn: add dduvall to group 'statistics-users' [puppet] - 10https://gerrit.wikimedia.org/r/162166 [20:35:32] (03CR) 10BBlack: [C: 032] Move partial traffic back to ulsfo [dns] - 10https://gerrit.wikimedia.org/r/162747 (owner: 10BBlack) [20:36:07] (03CR) 10Greg Grossmeier: [C: 031] "Yep, for his access needed to help the Multimedia Team with some perf metrics." [puppet] - 10https://gerrit.wikimedia.org/r/162166 (owner: 10Dzahn) [20:36:43] (03CR) 10Dzahn: [C: 032] add dduvall to group 'statistics-users' [puppet] - 10https://gerrit.wikimedia.org/r/162166 (owner: 10Dzahn) [20:38:49] marxarelli: i just saw puppet create your home dir and key on stat1003 [20:39:00] marxarelli: hold on for the bastion host to jump to first [20:39:19] as ottomata said on the gerrit it has public IP but is firewalled [20:39:32] so still need to go via another host, which is bast1001.wikimedia.org [20:40:06] mutante: will he have fenari access as well? (does that come with bastion?) [20:41:30] greg-g: hah! just checked.. no, in the current configuration , no fenari [20:41:55] (03PS4) 10Andrew Bogott: Update labs instances to use the new ldap-eqiad server [puppet] - 10https://gerrit.wikimedia.org/r/162689 [20:42:13] greg-g: if it was i would have to mail him right away to move his stuff :) [20:42:47] so the account has been created on bast1001 as well now [20:43:17] !log aaron Synchronized php-1.24wmf22/includes/cache/bloom: ad8a7a761d5f3bd086bbd6c88870e83c701e59e3 (duration: 00m 04s) [20:43:23] mutante: I was just thinking re people.wikimedia.org/~user/ access [20:43:28] Logged the message, Master [20:43:39] (03PS2) 10Yurik: Reduce file size of wikipedia favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162538 (owner: 10Jhobs) [20:44:05] greg-g: people is on terbium, but good point, currently only deployers get it automatically from their role. [20:44:19] terbium! right. names are hard. [20:44:47] (03CR) 10Chad: [C: 031] Less exciting Elasticsearch configuration [puppet] - 10https://gerrit.wikimedia.org/r/162661 (owner: 10Manybubbles) [20:44:58] yea, not included yet [20:45:17] because it was just for stat1003 [20:45:36] but can request ? [20:45:59] sure, I don't think he needs it now, I was just thinking generally. [20:46:13] PROBLEM - Host achernar is DOWN: PING CRITICAL - Packet loss = 100% [20:46:33] PROBLEM - Host 208.80.153.42 is DOWN: PING CRITICAL - Packet loss = 100% [20:46:56] my thought was "it'd be good for devs to have a space like that they could use for quick things instead of having them use a labs instance (since they'd have to set it up just to share something quickly)" [20:47:02] PROBLEM - Host 2620:0:860:2:d6ae:52ff:fead:5610 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:2:d6ae:52ff:fead:5610 [20:47:22] RECOVERY - Host achernar is UP: PING OK - Packet loss = 0%, RTA = 43.08 ms [20:47:32] greg-g: yea, we should rethink which admin groups we put on terbium. currently it comes with being a software deployer only [20:47:38] * greg-g nods [20:47:51] I don't mean to make this complicated, it was just my thought process :) [20:48:02] RECOVERY - Host 2620:0:860:2:d6ae:52ff:fead:5610 is UP: PING OK - Packet loss = 0%, RTA = 45.79 ms [20:48:02] RECOVERY - Host 208.80.153.42 is UP: PING OK - Packet loss = 0%, RTA = 43.80 ms [20:48:04] take it or leave it, it's not worth the time for me or my team right now :) [20:48:43] (the use-case we thought of for it is probably better served by a labs instance since we'll probably run Limn or something as well) [20:48:44] _joe_: I wish there was a way to fix the random swift 401s... [20:50:15] (03CR) 10Reedy: "This seems to reduce them from 32bit to 4bit, which makes sense as the rest of the colour palette isn't actually used." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162538 (owner: 10Jhobs) [20:50:40] greg-g: admins::restricted also does it. ok [20:55:52] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /tmp 0 MB (0% inode=99%): [20:56:03] PROBLEM - DPKG on tungsten is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:56:48] !log updated OCG to version 48acb8a2031863e35fad9960e48af60a3618def9 [20:56:54] Logged the message, Master [20:58:03] RECOVERY - DPKG on tungsten is OK: All packages OK [20:59:20] (03CR) 10MaxSem: [C: 031] Reduce file size of wikipedia favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162538 (owner: 10Jhobs) [20:59:44] (03CR) 10Dzahn: [C: 032] "first headline is about "cats" & "wikidata" at the same time. win" [puppet] - 10https://gerrit.wikimedia.org/r/162700 (owner: 10Nemo bis) [21:00:33] http://www.haykranen.nl/2014/04/03/feline-mayors-of-the-world-or-why-wikidata-is-awesome/ [21:02:04] ok, i'm done with the ocg deploy. looks good. thanks, all. [21:08:25] greg-g et al, any thoughts on https://gerrit.wikimedia.org/r/#/c/162538/ [21:08:39] it shrinks our icon image 5 times [21:09:53] PROBLEM - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:10:19] RECOVERY - Disk space on stat1002 is OK: DISK OK [21:11:06] bblack, btw, we might need to check varnish - why it does not use gzip compression for http://en.wikipedia.org/favicon.ico (it could be backend side though) [21:11:08] RECOVERY - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 9.795 second response time [21:12:36] (03CR) 10Hashar: "Zeljkof confirmed its provides firefox with Japanese/Chinese fonts :D" [puppet] - 10https://gerrit.wikimedia.org/r/162604 (https://bugzilla.wikimedia.org/69535) (owner: 10Hashar) [21:15:33] yurikR1: I just added jared z. I trust his opinion. [21:16:30] (03CR) 10Yurik: [C: 031] Reduce file size of wikipedia favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162538 (owner: 10Jhobs) [21:18:27] (03CR) 10Yurik: "Reedy, according to Jeff - imagemagick, the actual icons are still stored in the same order, but with different palette they show differen" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162538 (owner: 10Jhobs) [21:19:20] yurikR1: I was using VisualStudio to inspect the ico files ;) [21:19:58] Reedy, funny, i was about to plugin in my old machine with VS to do exactly that :) [21:24:38] PROBLEM - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:25:47] (03PS1) 10RobH: setting server pollux install parameters [puppet] - 10https://gerrit.wikimedia.org/r/162762 [21:26:46] (03CR) 10RobH: [C: 032] setting server pollux install parameters [puppet] - 10https://gerrit.wikimedia.org/r/162762 (owner: 10RobH) [21:27:36] cscott: ping: ^ [21:27:50] RECOVERY - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 459 bytes in 9.661 second response time [21:29:29] bblack: hm. odd. [21:29:44] (03CR) 10Jaredzimmerman: "I'm not sure how Gerrit handles the rendering of these things, but someone with the ability to test the patch on a retina device as a favi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162538 (owner: 10Jhobs) [21:30:02] it's bounced twice since the update, they're short bounces of the overall service status [21:30:05] not sure what's up [21:30:47] bblack: it might be related to some load testing i'm finishing up. i pushed ~300 jobs through OCG in short order, maybe that's enough to make response time lag beyond 10s? [21:30:58] (03CR) 10Jhobs: "Sorry Yuri, I think my wording was misleading. What I actually found was simply that the order of the images was not consistent across pro" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162538 (owner: 10Jhobs) [21:31:07] I have no idea :) [21:34:43] mutante: thanks again for the access! it looks like i may also need to access the research mysql account on one of the analytics slaves. would that need to be another rt request? [21:35:00] isn't that a shared mysql password? [21:35:05] * Reedy whinces [21:35:05] mutante: i'm reading here (https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Analytics_slaves) [21:35:12] seems like it [21:35:33] * marxarelli tries "password" [21:37:01] * marxarelli is penetrating the firewall [21:37:07] * marxarelli hax0rs the mainframe [21:42:35] marxarelli: you need membership in a group called "researchers" [21:42:53] marxarelli: that lets you read the file with the password [21:43:40] marxarelli: i would say we should have ticket but we can re-open the existing one, which i just did [21:44:03] getting a list of who is researcher is open ticket [21:44:57] mutante: oh, cool. thanks for doing that. sorry for the runaround [21:45:24] PROBLEM - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:45:29] Reedy: 7105 Sharing credentials for the internal SQL slaves :p [21:45:54] i think maybe we should just ask springle to make one user per .. user [21:46:22] but easier would be just get list of people who need it and all of them having stat1003 as well [21:46:25] RECOVERY - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 459 bytes in 7.336 second response time [21:46:25] up to analytics [21:47:16] <_joe_> is someone looking into ^^ [21:47:25] <_joe_> look at pybal checks [21:47:37] <_joe_> I suspect they are failing on some hosts [21:47:43] <_joe_> more than can be depooled [21:48:43] _joe_: good question [21:48:45] cscott: ^ ? [21:48:47] 14:04 < cscott> ok, i'm done with the ocg deploy. looks good. thanks, all. [21:48:52] i was about to do what mutante just did [21:48:56] cuz it keeps flapping [21:49:14] <_joe_> take. a. look. at. pybal. logs. [21:50:14] <_joe_> we probably have more servers down than can be depooled. i guess 2 out of 3 [21:50:35] they're showing partially up yes [21:50:58] so yea, it cannot depool anymore, but i honestly dunno the ocg service... checking on the hosts if its something obvious [21:51:02] <_joe_> all? [21:51:15] <_joe_> all partially up? [21:51:40] all partially up [21:51:42] ocg1001-1003 [21:51:45] none fully [21:51:53] so there is something wrong in whatever was recently done [21:52:02] cscott: this isnt your load testing, since you said it finished up last time we got a page [21:53:18] and logging for /var/log/ocg.log on 1002 ends back on the 9th [21:53:36] logging is all in logstash [21:53:42] there is no local logging on these machines [21:53:44] ahh [21:53:54] well, not since the 9th ;] [21:54:01] <_joe_> there is, in syslog [21:54:10] <_joe_> there is an error in the rsyslog line [21:54:21] afaik the request logs are all in logstash these days [21:54:23] <_joe_> see backlog [21:54:31] <_joe_> gwicke: and in syslog. [21:54:57] well, these started right when cscott finished doing stuff, can we just revert his changes? (I suppose i can text him) [21:55:03] all 3 are saying they try to connect to localhost:8000 but that times out [21:55:11] robh: cscott should know [21:55:27] <_joe_> which I guess is the node proxy [21:55:29] I pinged him in the parsoid channel too, but he doesn't seem to be around right now [21:55:35] <_joe_> so, the node app is down [21:55:40] <_joe_> let me log in [21:56:12] nodejs-ocg running , a bunch of processes actually [21:56:23] too many? [21:56:26] yea, a bunch on all of them [21:56:32] <_joe_> no [21:56:38] <_joe_> not too many [21:56:45] less than two dozen. [21:56:53] <_joe_> it probably reached its connection or open file limits I guess [21:56:55] actually, sorry, about 30ish [21:57:17] 0.0.0.0:8000 0.0.0.0:* LISTEN [21:57:28] vs. connection error: HTTPConnectionPool(host='localhost', port=8000): Read timed out. [21:57:46] <_joe_> not really [21:58:11] (03PS5) 10Andrew Bogott: Update labs instances to use the new ldap-eqiad server [puppet] - 10https://gerrit.wikimedia.org/r/162689 [21:59:01] _joe_, robh: this service is not yet in production [21:59:09] so I'd just wait for cscott to come back & look into it [21:59:14] urgh, then why is it sending pages to ops ;P [21:59:25] <_joe_> gwicke: AFAIK, it is in prod now [21:59:30] because the LVS check is paging [21:59:30] and i thought it was [21:59:33] * cscott pokes his head up [21:59:34] i thought folks could opt in to ocg [21:59:38] well, alpha level prod testing [21:59:39] <_joe_> I'll have to fetch the relevant RT [21:59:46] well, if its in production we have to care =P [21:59:52] <_joe_> cscott: something is very wrong with the latest ocg deployment [22:00:00] cscott: So its giving errors and pinging the crap out of everyone ;] [22:00:04] are we still looking at PROBLEM - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds ? [22:00:09] yes [22:00:21] where is that coming from? [22:00:33] You mean the check alert? [22:00:36] it's the icinga check [22:00:38] yeah. [22:00:42] <_joe_> well, right now I don't see errors [22:00:56] modules/lvs/manifests/monitor.pp: lvs::monitor_service_http { 'ocg.svc.eqiad.wmnet': ip_address => $ip['ocg']['eqiad'], check_command => "check_http_lvs_on_port!ocg.svc.eqiad.wmnet!8000!/?command=health" } [22:01:10] there are checks on the 3 individual servers, but they are just WARN [22:01:18] started 2:45 clear 2:46 PDT [22:01:19] but the check for the LVS went crit [22:01:25] so its flapping an awful lot [22:01:28] because all 3 in the pool timed out [22:01:35] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=ocg [22:01:51] PROBLEM - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:01:52] well, i can hit it fine from here. [22:02:24] I can't right now [22:02:39] well it did finish, but it's slow [22:02:56] <_joe_> cscott: http://ocg.svc.eqiad.wmnet/?command=health gives a timeout after 5 seconds to pybal [22:03:04] bblack@palladium:~$ time curl 'http://ocg.svc.eqiad.wmnet:8000/?command=health' [22:03:07] {"host":"ocg1003","directories":{"temp":{"path":"/mnt/tmpfs","size":0},"output":{"path":"/srv/deployment/ocg/output","size":6592206845},"postmortem":{"path":"/srv/deployment/ocg/postmortem","size":1371560}},"JobQueue":{"name":"ocg_render_job_queue","length":0},"StatusObjects":{"name":"ocg_job_status","length":32953},"time":1411596176856} [22:03:12] real 0m11.090s [22:03:14] 11s for me from palladium [22:04:03] RECOVERY - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 459 bytes in 7.673 second response time [22:04:07] <_joe_> curl -H 'Host: ocg.svc.eqiad.wmnet' http://ocg1001.eqiad.wmnet:8000/?command=health from the pybal host takes more than 5 seconds [22:04:12] Read timed out. (read timeout=5) .. when it gets over 5s [22:04:25] <_joe_> mutante: that's the pybal timeout [22:04:35] <_joe_> so, either we need to raise the pybal timeout [22:04:40] ok, can we just crank up the timeout to, say, 30s? [22:04:44] <_joe_> or something very wrong is going on there [22:04:50] <_joe_> cscott: seems foolish [22:04:53] why would it take that long to get a healthcheck? [22:05:03] <_joe_> why does it take so long for an healthcheck [22:05:05] <_joe_> :) [22:05:07] because it iterates over the entire redis queue, and the queue is growing larger. [22:05:25] <_joe_> cscott: then change it [22:05:31] it actually does a pretty extensive 'du -s' as well, and that is also growing larger. [22:05:36] <_joe_> or add a "quick" parameter [22:05:38] (03PS1) 10MaxSem: Kill old (skins|live)-1.5 stuff [puppet] - 10https://gerrit.wikimedia.org/r/162768 [22:05:41] <_joe_> oh my [22:06:07] well, the service is going to enter production on monday and all the queues are going to grow by a lot. so better to see it now than later. [22:06:13] <_joe_> ok [22:06:24] <_joe_> cscott: you need to fix this [22:06:38] <_joe_> but in the meanwhile, let's raise that limit [22:06:44] yes, i think that's best. [22:06:48] <_joe_> as a temporary countermeasure [22:06:55] can you file a bugzilla for the ?quick parameter? [22:07:02] <_joe_> cscott: you need to fix this before we go in production [22:07:06] and i'll tackle that tomorrow. [22:07:22] <_joe_> cscott: it's midnight here, so, tomorrow maybe. Or someone else does [22:07:53] it's 6:07pm here and i have to pick up my kid by 6:30pm. but i can file the bugzilla when i get back. [22:08:02] <_joe_> thanks :) [22:08:19] i don't think it's related to the deploy at all, i think it's related to the fact that i've been testing the service and growing the queues to a size more like they will be in production. [22:09:31] (03PS1) 10Yurik: Moved 470-01 to unified, zero-only, https support [puppet] - 10https://gerrit.wikimedia.org/r/162770 [22:09:42] bblack, ^ [22:09:53] !log icinga VS HTTP IPv4 on ocg.svc.eqiad.wmnet test is most likely due to `du -s` of a 6G cache directory, not critical. timeouts can be increased to quiet it. i will look into adding a -quick parameter or some such tomorrow to make the health check faster. [22:10:00] Logged the message, Master [22:15:19] (03CR) 10BBlack: [C: 032] Moved 470-01 to unified, zero-only, https support [puppet] - 10https://gerrit.wikimedia.org/r/162770 (owner: 10Yurik) [22:15:26] thx [22:23:01] mutante: sorry to bug, but is getting added to the researchers group going to need another review? (i didn't see anything from rt) [22:23:55] marxarelli: i don't know the actual answer to that, but i'll upload a patch soon [22:27:02] mutante: right on, i appreciate it. i'll cease with the nagging now [22:44:06] !log salted a bash update on labs instances, which turned out to be updated already. [22:44:13] Logged the message, Master [22:55:39] (03PS6) 10Andrew Bogott: Update labs instances to use the new ldap-eqiad server [puppet] - 10https://gerrit.wikimedia.org/r/162689 [22:58:04] (03PS1) 10Ori.livneh: Install mediawiki-math-texvc on application servers [puppet] - 10https://gerrit.wikimedia.org/r/162783 [22:59:39] (03PS2) 10Ori.livneh: Install mediawiki-math-texvc on application servers [puppet] - 10https://gerrit.wikimedia.org/r/162783 [22:59:59] !log OCG - scheduled downtime/disabled notifications for LVS check [23:00:06] Logged the message, Master [23:01:33] (03PS7) 10Andrew Bogott: Update labs instances to use the new ldap-eqiad server [puppet] - 10https://gerrit.wikimedia.org/r/162689 [23:01:54] (03CR) 10Ori.livneh: [C: 032] Install mediawiki-math-texvc on application servers [puppet] - 10https://gerrit.wikimedia.org/r/162783 (owner: 10Ori.livneh) [23:02:11] (03CR) 10jenkins-bot: [V: 04-1] Update labs instances to use the new ldap-eqiad server [puppet] - 10https://gerrit.wikimedia.org/r/162689 (owner: 10Andrew Bogott) [23:08:02] OK, so I guess it's SWAT time and the bot is broken? [23:08:04] I'll take it [23:08:23] _joe_, mutante: filed https://bugzilla.wikimedia.org/show_bug.cgi?id=71260 to fix the OCG check 'properly'. [23:09:56] jouncebot: die [23:09:59] :) [23:10:06] (03PS1) 10Dzahn: add dduvall to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/162790 [23:10:20] (03CR) 10Ori.livneh: "Relevant IRC discussion:" [puppet] - 10https://gerrit.wikimedia.org/r/162559 (owner: 10Ori.livneh) [23:10:23] now, to make it come back... [23:11:02] * chrismcmahon sacrifices another goat. it worked yesterday for Jenkins... [23:11:16] * greg-g waits.... [23:11:18] yay [23:11:41] (03CR) 10Dzahn: "also see https://rt.wikimedia.org/Ticket/Display.html?id=7105" [puppet] - 10https://gerrit.wikimedia.org/r/162790 (owner: 10Dzahn) [23:11:48] jouncebot: next [23:11:49] In 13 hour(s) and 48 minute(s): Fundraising (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140925T1300) [23:12:17] !log restarted jouncebot, he wasn't announcing deploy windows [23:12:23] Logged the message, Master [23:13:17] http://www.catb.org/jargon/html/V/voodoo-programming.html [23:14:08] actually this http://www.catb.org/jargon/html/R/rain-dance.html [23:14:16] "We'll have to wait for Greg to do his rain dance.” " *g* [23:14:46] !log catrope Synchronized php-1.24wmf22/resources/lib/oojs-ui/: SWAT (duration: 00m 05s) [23:14:51] Logged the message, Master [23:16:46] (03PS2) 10Dzahn: add dduvall to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/162790 [23:17:21] (03CR) 10Dzahn: [C: 032] "still part of the same access request we already got ACKs for..." [puppet] - 10https://gerrit.wikimedia.org/r/162790 (owner: 10Dzahn) [23:17:55] !log catrope Synchronized php-1.24wmf22/extensions/VisualEditor: SWAT (duration: 00m 04s) [23:18:01] Logged the message, Master [23:19:09] mutante: thanks! [23:19:33] marxarelli: MMV = multi-media viewer, right? [23:19:36] in this context [23:19:44] yeah [23:20:05] i think that's what the cool kids say (i am not one of them) [23:21:10] marxarelli: cat /srv/passwords/researchdb on stat1003 [23:21:22] perfect [23:21:28] :) k [23:25:05] (03PS8) 10Andrew Bogott: Update labs instances to use the new ldap-eqiad server [puppet] - 10https://gerrit.wikimedia.org/r/162689 [23:25:45] (03CR) 10jenkins-bot: [V: 04-1] Update labs instances to use the new ldap-eqiad server [puppet] - 10https://gerrit.wikimedia.org/r/162689 (owner: 10Andrew Bogott) [23:32:53] csteipp_afk, ping [23:38:14] (03PS9) 10Andrew Bogott: Update labs instances to use the new ldap-eqiad server [puppet] - 10https://gerrit.wikimedia.org/r/162689 [23:51:49] (03CR) 10Dzahn: [C: 031] "begging on IRC to get reviews doesn't scale" [puppet] - 10https://gerrit.wikimedia.org/r/162192 (owner: 10Dzahn) [23:53:02] (03CR) 10Ori.livneh: [C: 031] create shell account for nettrom [puppet] - 10https://gerrit.wikimedia.org/r/162192 (owner: 10Dzahn) [23:53:14] (03CR) 10Dzahn: [C: 032] "key per https://rt.wikimedia.org/Ticket/Display.html?id=8343" [puppet] - 10https://gerrit.wikimedia.org/r/162192 (owner: 10Dzahn) [23:54:33] ori: thanks! [23:56:12] error: The following untracked working tree files ... modules/mariadb .. :/ [23:57:46] it blows my mind that git submodules are so awful [23:57:55] because the problem they purport to solve is so common [23:58:30] so i have to imagine that so many other software communities are suffering from their limitations