[00:01:26] YuviPanda|zzz: you conflict with toollabs? [00:01:39] i thought quarry is a separate project [00:02:06] (03CR) 10Dzahn: [C: 032] "Unicode to 8-bit charset transliteration codec" [operations/puppet] - 10https://gerrit.wikimedia.org/r/156483 (owner: 10Yuvipanda) [00:02:22] ^d we can wait...good with me it's late here [00:02:40] <^d> Yeah no big deal, we'll figure it out tomorrow. Have a good evening. [00:03:03] YuviPanda|zzz: eh, ignore me, confused [00:03:13] it's the androidsdk thing of course [00:04:56] YuviPanda|zzz: tell AndrewB about the other place it's defined at [00:05:04] he asked [00:07:27] (03CR) 10Dzahn: "then just don't install the package here because you are already getting it from toollabs class? or move out of tool labs and have your ow" [operations/puppet] - 10https://gerrit.wikimedia.org/r/153600 (owner: 10Yuvipanda) [00:10:27] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [00:11:27] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [00:26:48] (03CR) 10Dzahn: [C: 04-1] "the entire require webserver::apache should be removed now, asked Ori" [operations/puppet] - 10https://gerrit.wikimedia.org/r/153849 (owner: 10Dzahn) [00:28:11] mutante: will do :) [00:38:26] RECOVERY - Disk space on elastic1009 is OK: DISK OK [00:39:26] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:42:16] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [00:47:56] PROBLEM - Disk space on virt1000 is CRITICAL: DISK CRITICAL - free space: / 2493 MB (3% inode=84%): [00:48:26] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:49:26] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [00:55:36] RECOVERY - Puppet freshness on mw1053 is OK: puppet ran at Wed Aug 27 00:55:32 UTC 2014 [00:58:26] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:01:27] PROBLEM - nutcracker port on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:02:16] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [01:02:17] RECOVERY - nutcracker port on mw1053 is OK: TCP OK - 0.000 second response time on port 11212 [01:07:26] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:07:56] PROBLEM - SSH on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:08:06] PROBLEM - check configured eth on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:08:46] RECOVERY - SSH on mw1053 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [01:08:56] RECOVERY - check configured eth on mw1053 is OK: NRPE: Unable to read output [01:10:26] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [01:23:26] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:26:26] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [01:39:56] PROBLEM - Disk space on virt1000 is CRITICAL: DISK CRITICAL - free space: / 2480 MB (3% inode=84%): [01:42:33] are we going to worry about virt1000? [01:42:35] disk [01:46:29] jeremyb: I doubt it has it's real data on / so probably not an issue [01:46:46] (as / probably wont grow fast/ soonish) [01:47:03] but ofc. somebody (sh * should/could [01:47:17] interesting point [01:47:20] let's see [01:59:26] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:01:17] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [02:07:37] so, actually that box is not even in ganglia at all??? [02:07:47] and https://rt.wikimedia.org/Ticket/Display.html?id=7836 is still open [02:19:57] PROBLEM - SSH on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:26] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:20:56] RECOVERY - SSH on mw1053 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [02:24:16] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [02:27:27] PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: Puppet has 1 failures [02:29:56] PROBLEM - Disk space on virt1000 is CRITICAL: DISK CRITICAL - free space: / 2494 MB (3% inode=84%): [02:30:07] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/qrunner [02:30:27] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [02:33:06] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [02:35:26] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:36:06] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 15 processes with UID = 38 (list), regex args /mailman/bin/qrunner [02:36:17] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [02:36:56] PROBLEM - DPKG on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:37:46] RECOVERY - DPKG on mw1053 is OK: All packages OK [02:42:12] !log LocalisationUpdate completed (1.24wmf17) at 2014-08-27 02:41:08+00:00 [02:42:26] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:44:56] PROBLEM - SSH on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:45:26] RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [02:46:06] PROBLEM - check configured eth on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:46:17] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [02:46:56] RECOVERY - SSH on mw1053 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [02:47:06] RECOVERY - check configured eth on mw1053 is OK: NRPE: Unable to read output [02:49:07] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [02:49:26] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:49:27] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [02:50:16] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [02:52:17] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [02:54:27] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 2 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [02:55:06] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 16 processes with UID = 38 (list), regex args /mailman/bin/qrunner [02:59:37] PROBLEM - puppet last run on mw1102 is CRITICAL: CRITICAL: Puppet has 1 failures [03:10:06] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [03:10:27] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [03:14:06] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 16 processes with UID = 38 (list), regex args /mailman/bin/qrunner [03:14:23] !log Ran a slightly modified version of legoktm's removeOldManualUserPages.php for myself [03:14:27] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 2 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [03:14:31] helderwiki: ^ that's what I meant [03:15:26] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:15:26] PROBLEM - puppet last run on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:16:16] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [03:18:38] RECOVERY - puppet last run on mw1102 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [03:19:07] !log LocalisationUpdate completed (1.24wmf18) at 2014-08-27 03:18:04+00:00 [03:20:56] PROBLEM - Disk space on virt1000 is CRITICAL: DISK CRITICAL - free space: / 2493 MB (3% inode=84%): [03:28:06] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [03:28:27] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [03:30:25] I see [03:32:06] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 16 processes with UID = 38 (list), regex args /mailman/bin/qrunner [03:32:27] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 2 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [03:36:26] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:39:26] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [03:48:06] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [03:48:27] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [03:52:06] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 16 processes with UID = 38 (list), regex args /mailman/bin/qrunner [03:52:27] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 2 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [03:56:26] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:57:10] (03CR) 10KartikMistry: "Ping?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/156079 (owner: 10KartikMistry) [03:57:30] (03CR) 10KartikMistry: "Ping!" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155741 (https://bugzilla.wikimedia.org/69860) (owner: 10KartikMistry) [03:58:56] PROBLEM - nutcracker process on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:59:47] RECOVERY - nutcracker process on mw1053 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [04:01:26] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [04:06:18] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Aug 27 04:05:12 UTC 2014 (duration 5m 11s) [04:08:06] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [04:08:27] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [04:08:54] (03CR) 10Santhosh: [C: 04-1] Fix Parsoid API variable for ContentTranslation (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/156079 (owner: 10KartikMistry) [04:12:06] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 16 processes with UID = 38 (list), regex args /mailman/bin/qrunner [04:12:27] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 2 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [04:14:56] PROBLEM - Disk space on virt1000 is CRITICAL: DISK CRITICAL - free space: / 2493 MB (3% inode=84%): [04:16:56] (03PS3) 10KartikMistry: Fix Parsoid API variable for ContentTranslation [operations/puppet] - 10https://gerrit.wikimedia.org/r/156079 [04:22:46] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00666666666667 [04:27:37] (03CR) 10Hoo man: [C: 04-1] "Looks mostly ok now, but this is still unresolved:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155753 (owner: 1001tonythomas) [04:28:06] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [04:28:27] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [04:29:36] hoo: commenting them out ? [04:30:11] tonythomas: Yep [04:30:27] maybe you can also just keep that and build a realm switch into the templae [04:30:43] if production keep the exact old code, if beta, use the new version [04:30:46] PROBLEM - puppet last run on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:30:46] PROBLEM - RAID on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:30:55] actually, we built a switch here https://gerrit.wikimedia.org/r/#/c/155753/15/manifests/mail.pp [04:31:10] that can work ? [04:31:14] I know [04:31:17] it will later on [04:31:18] but not now [04:31:23] please re-read my comment [04:31:30] k :) [04:31:36] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 1308 seconds ago with 0 failures [04:31:39] will comment that out [04:31:46] RECOVERY - RAID on stat1002 is OK: OK: optimal, 1 logical, 12 physical [04:32:04] tonythomas: Ok [04:32:06] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 16 processes with UID = 38 (list), regex args /mailman/bin/qrunner [04:32:16] most important point is that nothing should change for production, yet [04:32:27] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 2 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [04:33:35] tonythomas: ^ [04:33:58] hoo: true. in that case, commenting out entire changes, so that someone can pick that and apply to beta easily [04:34:13] and restore the eat router ? [04:35:09] tonythomas: I guess you can just make a switch by realm in the template [04:35:26] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:35:46] oh. in the exim template too. true. and prod = eat, beta = bouncehandler right ? [04:35:56] PROBLEM - SSH on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:36:24] tonythomas: https://ask.puppetlabs.com/question/3420/if-else-statement-in-erb-template/?answer=3421#post-id-3421 [04:36:32] just do that with $::realm [04:37:10] if production use the old code as is used currently [04:37:18] if in beta, use your code with the curl [04:38:10] I hope you get what I mean [04:38:18] k. let me code that out :) [04:38:23] if not, you might want to ping someone who's better at explaining :P [04:38:27] PROBLEM - nutcracker port on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:40:17] RECOVERY - nutcracker port on mw1053 is OK: TCP OK - 0.000 second response time on port 11212 [04:40:17] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [04:43:26] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:44:41] (03PS16) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [operations/puppet] - 10https://gerrit.wikimedia.org/r/155753 [04:48:26] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [04:48:27] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [04:48:56] RECOVERY - SSH on mw1053 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [04:49:06] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [04:51:06] PROBLEM - check configured eth on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:51:26] PROBLEM - check if dhclient is running on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:51:26] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:51:27] PROBLEM - nutcracker port on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:51:56] PROBLEM - SSH on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:51:56] PROBLEM - DPKG on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:51:56] PROBLEM - nutcracker process on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:52:06] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 16 processes with UID = 38 (list), regex args /mailman/bin/qrunner [04:52:16] RECOVERY - check if dhclient is running on mw1053 is OK: PROCS OK: 0 processes with command name dhclient [04:52:26] RECOVERY - nutcracker port on mw1053 is OK: TCP OK - 0.000 second response time on port 11212 [04:52:27] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 2 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [04:52:56] RECOVERY - nutcracker process on mw1053 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [04:52:56] RECOVERY - DPKG on mw1053 is OK: All packages OK [04:52:57] RECOVERY - SSH on mw1053 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [04:52:57] RECOVERY - check configured eth on mw1053 is OK: NRPE: Unable to read output [04:54:16] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [04:54:16] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [04:55:19] ori: online? mw1053 seems to be having issues ^ (tell me I'm dumb if I misremembered the hhvm host number again) [05:03:56] PROBLEM - Disk space on virt1000 is CRITICAL: DISK CRITICAL - free space: / 2493 MB (3% inode=84%): [05:06:56] PROBLEM - Disk space on virt1000 is CRITICAL: DISK CRITICAL - free space: / 2493 MB (3% inode=84%): [05:08:06] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [05:08:27] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [05:12:06] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 16 processes with UID = 38 (list), regex args /mailman/bin/qrunner [05:12:27] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 2 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [05:28:06] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [05:28:27] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [05:32:06] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 16 processes with UID = 38 (list), regex args /mailman/bin/qrunner [05:32:27] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 2 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [05:36:26] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:37:17] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [05:38:16] PROBLEM - puppet last run on mw1053 is CRITICAL: CRITICAL: Puppet has 1 failures [05:49:06] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [05:49:27] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [05:53:06] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 16 processes with UID = 38 (list), regex args /mailman/bin/qrunner [05:53:27] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 2 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [05:54:56] PROBLEM - Disk space on virt1000 is CRITICAL: DISK CRITICAL - free space: / 2486 MB (3% inode=84%): [05:56:26] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:59:17] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [06:00:56] PROBLEM - Disk space on virt1000 is CRITICAL: DISK CRITICAL - free space: / 2494 MB (3% inode=84%): [06:09:06] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [06:09:27] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [06:12:26] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:12:27] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 2 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [06:13:06] PROBLEM - DPKG on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:13:06] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 16 processes with UID = 38 (list), regex args /mailman/bin/qrunner [06:13:07] PROBLEM - check configured eth on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:13:57] RECOVERY - DPKG on mw1053 is OK: All packages OK [06:14:06] RECOVERY - check configured eth on mw1053 is OK: NRPE: Unable to read output [06:18:16] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [06:27:16] PROBLEM - puppet last run on amssq48 is CRITICAL: CRITICAL: Epic puppet fail [06:28:06] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [06:28:16] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:27] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:27] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [06:28:27] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:36] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:06] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:16] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:06] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 16 processes with UID = 38 (list), regex args /mailman/bin/qrunner [06:32:27] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 2 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [06:34:56] PROBLEM - Disk space on virt1000 is CRITICAL: DISK CRITICAL - free space: / 2491 MB (3% inode=84%): [06:45:16] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:45:27] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:45:36] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:45:36] PROBLEM - puppet last run on ssl1002 is CRITICAL: CRITICAL: Puppet has 2 failures [06:46:06] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:46:17] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:46:27] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:47:16] RECOVERY - puppet last run on amssq48 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:47:17] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:48:06] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [06:48:27] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [06:49:01] (03PS1) 10Aude: Add other-projects beta feature to whitelist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156501 [06:52:06] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 16 processes with UID = 38 (list), regex args /mailman/bin/qrunner [06:52:27] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 2 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [06:55:26] RECOVERY - Disk space on ms1004 is OK: DISK OK [06:55:36] PROBLEM - puppet last run on db1011 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:26] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:57:27] PROBLEM - puppet last run on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:58:26] PROBLEM - Disk space on ms1004 is CRITICAL: DISK CRITICAL - free space: / 179 MB (1% inode=94%): /var/lib/ureadahead/debugfs 179 MB (1% inode=94%): [06:59:26] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [07:03:37] RECOVERY - puppet last run on ssl1002 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [07:08:27] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [07:09:06] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [07:12:27] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 2 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [07:13:06] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 16 processes with UID = 38 (list), regex args /mailman/bin/qrunner [07:13:36] RECOVERY - puppet last run on db1011 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [07:18:27] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:20:26] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [07:28:06] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [07:28:27] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [07:32:06] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 16 processes with UID = 38 (list), regex args /mailman/bin/qrunner [07:32:27] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 2 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [07:48:06] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [07:48:27] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [07:48:33] <_joe_> something bad is happening on sodium [07:48:42] <_joe_> can someone take a look? [07:48:48] <_joe_> or I'll do in a few [07:48:49] (03CR) 10Filippo Giunchedi: "you are correct, interestingly enough git was confused about .builder files and I forced it via .gitattributes (http://git-scm.com/docs/gi" [operations/software/swift-ring] - 10https://gerrit.wikimedia.org/r/156252 (owner: 10Filippo Giunchedi) [07:52:06] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 16 processes with UID = 38 (list), regex args /mailman/bin/qrunner [07:52:27] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 2 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [08:08:06] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [08:08:27] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [08:12:06] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 16 processes with UID = 38 (list), regex args /mailman/bin/qrunner [08:12:27] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 2 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [08:15:32] So, with SUL, if a user is unified, does that mean they have confirmed they are who they say they are on every wiki? [08:15:56] "No idea about the technical error (other than the idea that the guys who are paid to keep the engines running smoothly for us are instead busy creating crap nobody wants and then comploting to get us silenced)." [08:16:18] my god. [08:18:41] LFaraone: yes, either via confirmed email address or password [08:19:27] legoktm: so I can write code that assumes that if a user is authorized on loginwiki that there do not exist any other users on other wikis who are not them? [08:19:46] errrr, no [08:19:53] it's possible there are unattached accounts [08:20:08] how can I query that via the API/ [08:20:28] https://en.wikipedia.org/w/api.php?action=query&meta=globaluserinfo&guiuser=Catrope&guiprop=merged&format=jsonfm [08:21:17] which is a list of all the wikis "Catrope" is attached on [08:22:46] Sure. Do you know if there is a "blesséd" Python library for interacting with that API? [08:23:14] <_joe_> LFaraone: requests + cjson ;) [08:23:47] lol [08:24:06] https://www.mediawiki.org/wiki/API:Client_code#Python there's a list of clients, I'd personally recommend using pywikibot, but that's since I wrote a bunch of it :P [08:25:21] <_joe_> legoktm: I was obviously joking [08:25:45] <_joe_> wow how many libraries [08:26:00] <_joe_> doesn't look like people tend to avoid fragmentation :P [08:27:47] python used to be really good about people all supporting one major library (pywikibot) and then people experienced in python realized it was crap because it had hacks to support windows well, so they forked and wrote their own libraries [08:28:22] "the python wiki(p|m)edia community*" I guess [08:29:06] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [08:29:27] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [08:32:27] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 2 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [08:33:06] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 16 processes with UID = 38 (list), regex args /mailman/bin/qrunner [08:42:56] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Wed 27 Aug 2014 06:41:47 UTC [08:43:04] (03PS7) 10Giuseppe Lavagetto: Add mediawiki::packages::php5 [operations/puppet] - 10https://gerrit.wikimedia.org/r/153772 [08:49:06] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [08:49:27] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [08:52:27] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 2 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [08:53:06] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 16 processes with UID = 38 (list), regex args /mailman/bin/qrunner [08:54:01] (03PS8) 10Giuseppe Lavagetto: Add mediawiki::packages::php5 [operations/puppet] - 10https://gerrit.wikimedia.org/r/153772 [08:56:28] (03CR) 10Giuseppe Lavagetto: [C: 032] "Re-introduced libmemcached10, if we want to remove it we need to do that in a separate change." [operations/puppet] - 10https://gerrit.wikimedia.org/r/153772 (owner: 10Giuseppe Lavagetto) [08:56:51] (03CR) 10Alexandros Kosiaris: [C: 032] Fix Parsoid API variable for ContentTranslation [operations/puppet] - 10https://gerrit.wikimedia.org/r/156079 (owner: 10KartikMistry) [08:57:58] <_joe_> do I merge your or you merge mine? [08:58:26] <_joe_> akosiaris: I merged both changes [09:00:02] ok [09:00:03] thanks [09:01:27] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [09:01:55] <_joe_> !log restarted mailman on sodium, after having killed the two concurrent instances running [09:02:06] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [09:31:37] PROBLEM - DPKG on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:17] PROBLEM - SSH on rhenium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:32:17] PROBLEM - Disk space on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:17] PROBLEM - check configured eth on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:26] PROBLEM - puppet last run on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:27] PROBLEM - check if dhclient is running on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:27] PROBLEM - RAID on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:35:27] RECOVERY - DPKG on rhenium is OK: All packages OK [09:36:06] RECOVERY - Disk space on rhenium is OK: DISK OK [09:36:06] RECOVERY - SSH on rhenium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [09:36:06] RECOVERY - check configured eth on rhenium is OK: NRPE: Unable to read output [09:36:16] RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [09:36:17] RECOVERY - check if dhclient is running on rhenium is OK: PROCS OK: 0 processes with command name dhclient [09:36:17] RECOVERY - RAID on rhenium is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [09:42:46] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [09:46:40] _joe_: good morning :-D I had a question about providing the hhvm build dependency on contint slaves ( https://gerrit.wikimedia.org/r/#/c/150813/ ) [09:47:10] the patch provides a list of -dev packages and that is better done with " apt-get build-dep hhvm" [09:47:20] but I have no clue how to have puppet run it :/ [09:50:18] <_joe_> hashar: what's wrong with a correctly crafted exec? [09:50:27] <_joe_> the alternative is to create a type provider [09:50:36] <_joe_> but it's ruby, I don't recommend it [09:50:55] <_joe_> unless you want to make thousands of debian sysadmins happy :) [09:50:56] we could write it in python using rpython :D [09:51:02] so simply an exec statement? [09:51:10] that would run apt-get build-dep hhvm on each run of puppet? [09:51:26] that seems to easy to me [09:51:28] <_joe_> not if you set an onlyif or unless condition [09:52:02] <_joe_> btw, I decided to go the gem2deb way for packaging rubocop [09:52:25] <_joe_> so far it's been awful but I should end by lunchtime or a wee bit later [09:53:00] I am not sure what kind of onlyif / unless condition we should use though [09:53:08] potentially verify whether hhvm package is a newer version [09:55:17] _joe_: so finally you are packaging rubocop ? :D [09:55:29] as I said yesterday, feel free to dismiss the idea and let us use ruby gems [09:55:31] it is fine tome [09:56:13] <_joe_> well, you can have the exec run every time in fact [09:56:29] <_joe_> it just slows the whole thing down but in contint that's ok [09:57:08] great thanks [09:57:11] <_joe_> but [09:57:12] will do the exec {} thing so [09:57:21] <_joe_> look at the apt-get manual [09:57:28] <_joe_> maybe -s may help you [09:57:51] <_joe_> and you can play with exit codes from it, but I honestly don't know [09:58:44] also while you are around, maybe on contint we should have hhvm to ensure => latest ? [09:58:56] cause I am pretty sure package don't magically upgrade on labs [09:59:38] i am liking systemd more and more [10:00:09] me too akosiaris, using it over a year and a half, and it works really nice [10:00:24] the /etc/init.d/ handling stuff kind of breaks though. I could not disable puppet through systemctl [10:00:41] <_joe_> akosiaris: I /hate/ systemd [10:00:51] you can write sysvcomap script and it would work (or should) [10:00:52] <_joe_> it has a bunch of good and a bunch of terrible ideas [10:01:00] <_joe_> like binary logs [10:01:08] journalctl ? [10:01:09] <_joe_> but that's well beyond my point [10:01:18] yeah I disliked that as well at first [10:01:40] <_joe_> again, it's basically irrelevant [10:01:45] then I realized that most of the times I am using that only to get the early boot messages which are more or less otherwise lost [10:01:52] <_joe_> I hate how unsimple it is [10:02:17] <_joe_> how it tries to funnel a lot of things into an init system [10:02:28] <_joe_> that IMO should be as barebones as possible [10:02:44] well each binary is more or less barebones [10:03:11] the entirety of the package is kind of complex though [10:03:17] <_joe_> akosiaris: let's see if in 5 years we won't be forced to use systemd's ntp [10:03:22] <_joe_> or systemd's dhcp [10:03:29] <_joe_> because everyone is [10:03:32] yeah, that thing I hate [10:03:56] (03PS3) 10Hashar: hhvm: create module + list all dev dependencies [operations/puppet] - 10https://gerrit.wikimedia.org/r/150813 (https://bugzilla.wikimedia.org/63120) [10:04:12] but each single systemd compartment is not complex per se [10:04:49] <_joe_> systemd is a dangerous way to go down, and a spit in the face of everyone using linux on a server. [10:05:16] not so sure about that anymore either [10:05:24] <_joe_> and I'm still not using it extensively. Wait for the hate and frustration coming from real-world usage :D [10:05:40] <_joe_> akosiaris: there are tons of tools for job supervision [10:05:46] anyway I am still evaluating it in depth... not disliking much what I have seen yet [10:05:57] <_joe_> it's not *terrible* [10:06:07] <_joe_> it has quite a few good ideas [10:06:27] <_joe_> but it is a very dangerous path [10:06:33] meh... ideas are ideas... its implementations that matter [10:07:10] matanya_: i was referring to the fact that puppet 2.7 in debian wheezy has both a systemd .service and an /etc/init.d/puppet file [10:07:31] ah, that :) [10:07:43] the result on 204-14~bpo70+1 systemd was that systemctl disable puppet would disable the init.d stanzas [10:07:56] but would not delete the /etc/systemd/system/puppet.service symlink [10:08:12] i'm on 216, i can check if it is the same [10:08:15] (03PS4) 10Hashar: hhvm: create module + list all dev dependencies [operations/puppet] - 10https://gerrit.wikimedia.org/r/150813 (https://bugzilla.wikimedia.org/63120) [10:08:32] would be interesting to see if it is systemd's fault [10:08:50] (03CR) 10Hashar: "And it now uses /usr/bin/apt-get build-dep hhvm" [operations/puppet] - 10https://gerrit.wikimedia.org/r/150813 (https://bugzilla.wikimedia.org/63120) (owner: 10Hashar) [10:10:18] matanya_: see this for example http://paste.debian.net/hidden/51c41f83/ [10:11:17] puppet will be started on next boot [10:11:23] akosiaris: i see what you mean [10:11:45] but due to init.d backwards compat and not systemd's unit stanza [10:12:10] which is kind the inverse of what lennart is swearing that should happen [10:12:28] <_joe_> wow [10:12:41] but I am not sure who to blame yet ... [10:13:24] lunch time for me [10:22:27] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /tmp 3957 MB (3% inode=99%): [10:35:27] RECOVERY - Disk space on stat1002 is OK: DISK OK [10:43:56] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Wed 27 Aug 2014 06:41:47 UTC [11:01:10] PROBLEM - Puppet freshness on db1019 is CRITICAL: Last successful Puppet run was Wed 27 Aug 2014 10:58:23 UTC [11:03:10] PROBLEM - Puppet freshness on db1019 is CRITICAL: Last successful Puppet run was Wed 27 Aug 2014 10:58:23 UTC [11:05:10] PROBLEM - Puppet freshness on db1019 is CRITICAL: Last successful Puppet run was Wed 27 Aug 2014 10:58:23 UTC [11:07:10] PROBLEM - Puppet freshness on db1019 is CRITICAL: Last successful Puppet run was Wed 27 Aug 2014 10:58:23 UTC [11:09:10] PROBLEM - Puppet freshness on db1019 is CRITICAL: Last successful Puppet run was Wed 27 Aug 2014 10:58:23 UTC [11:11:10] PROBLEM - Puppet freshness on db1019 is CRITICAL: Last successful Puppet run was Wed 27 Aug 2014 10:58:23 UTC [11:13:10] PROBLEM - Puppet freshness on db1019 is CRITICAL: Last successful Puppet run was Wed 27 Aug 2014 10:58:23 UTC [11:15:10] PROBLEM - Puppet freshness on db1019 is CRITICAL: Last successful Puppet run was Wed 27 Aug 2014 10:58:23 UTC [11:17:10] PROBLEM - Puppet freshness on db1019 is CRITICAL: Last successful Puppet run was Wed 27 Aug 2014 10:58:23 UTC [11:17:50] RECOVERY - Puppet freshness on db1019 is OK: puppet ran at Wed Aug 27 11:17:44 UTC 2014 [11:49:00] (03PS1) 10MaxSem: Disable mobile uploads [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156523 (https://bugzilla.wikimedia.org/62598) [12:15:23] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 39 data above and 0 below the confidence bounds [12:15:23] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 39 data above and 0 below the confidence bounds [12:18:11] (03CR) 10Florianschmidtwelzow: [C: 031] "From commit message:" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156523 (https://bugzilla.wikimedia.org/62598) (owner: 10MaxSem) [12:31:51] !log restarting Zuul. I have added a bunch of jobs in Jenkins and want to make sure everything is settled properly. [12:44:03] PROBLEM - Host ms-be1009 is DOWN: PING CRITICAL - Packet loss = 100% [12:44:13] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Wed 27 Aug 2014 06:41:47 UTC [13:24:54] (03CR) 10Hashar: [C: 031] "Seems this is ready to land. Puppet will notify the Gerrit service on ytterbium to have the new conf applied." [operations/puppet] - 10https://gerrit.wikimedia.org/r/156103 (owner: 10Hashar) [13:39:53] manybubbles: hm, so elastic1018 came back oline yesterday, but kinda without supervision [13:40:05] chris had trouble reinstalling it, so he booted it back up [13:40:10] which, I think, means it joined the cluster [13:40:20] but, raid was kinda borky on it, because he re-added one of the SSDs [13:40:23] so I don't really know what state it is in [13:40:31] ottomata: I saw it show up - we've got elasticsearch configured not to send it shards [13:40:36] ok phew [13:40:50] can I turn off elasticsearch and reformat the partition? [13:40:57] we need to upgrade elasticsearch there too [13:41:32] ottomata: you can do whatever you want to with it [13:41:49] I've confirmed we won't allocate shards there [13:42:23] ok cool [13:46:37] <_joe_> ottomata: thanks a bunch (re: hhvm + analytics) [13:47:12] yup, don't htink I did much, but you are welcome! [13:47:34] manybubbles: ok, elastic1018 shoudl be ok to join cluster now [13:47:37] double check for me if you will [13:47:49] <_joe_> and after teasing you... hiera! https://gerrit.wikimedia.org/r/#/c/151869/ [13:47:52] v 1.3.2 with plugins deployed [13:48:31] _joe_, OK! with that poke I will take on some reviewer responsibility! I haven't worked with hiera at all yet though, so i'm going to have to do some self educating! [13:48:52] ottomata: cool - looks like you also made sure it was upgraded to 1.3.2 [13:48:55] thanks [13:50:27] <_joe_> ottomata: whenever you've doubts, feel free to ask [13:56:32] (03PS1) 10Manybubbles: Fix spelling making config not properly apply [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156532 [13:56:52] James_F|Away or greg-g: Do you OK https://gerrit.wikimedia.org/r/#/c/156501/ ? [13:58:07] anomie: it's nothing new [13:58:52] aude: The comment says "DO NOT add entries here without OK from Greg Grossmeier or James Forrester" [13:58:57] ok :) [13:59:11] i should have added it last night but didn't know about the whitelist setting [14:00:00] anyway, we can wait for James_F|Away or greg-g [14:00:24] akosiaris: I wanted to start with the development of the role for puppet in production... The templated used for beta was cxserver https://git.wikimedia.org/blob/operations%2Fpuppet.git/ea789e2f0bfb965d95fc7818b986c75dcfc6221e/manifests%2Frole%2Fcxserver.pp ... but cxserver has only a beta role.. I had a look at parsoid https://git.wikimedia.org/blob/operations%2Fpuppet.git/ea789e2f0bfb965d95fc7818b986c75dcfc6221e/manifests%2Frole%2Fpa [14:01:07] anomie: I can do swat today _but_ I have a meeting from 11-11:30 [14:01:37] manybubbles: Since you have a meeting, I'll do it [14:11:17] <^d> ottomata: 1018 looks much better now :) [14:11:25] <^d> I tried debugging it last night but got a little over my head [14:11:52] <^d> Well, I figured out what was wrong but didn't know how to fix it :p [14:15:20] cool, ja, basically, it didn't have a raid 0 partition for elasticsearch, and it was running 1.2 [14:15:36] physikerwelt: avoid using the parsoid roles as a template. [14:15:37] <^d> Yep, mdadm was complaining last night. [14:15:45] <^d> It looked like it was misconfigured for the old drive. [14:16:02] akosiaris: OK [14:16:25] <^d> ottomata: But I see the partition on /dev/md2 and it's mounted at the right place so \o/ [14:16:25] that being said beta is probably pretty close to what production should have, plus some monitoring [14:16:52] yup, now its cool [14:16:57] i manually recreated it [14:16:58] (03PS2) 10Giuseppe Lavagetto: mediawiki: use HHVM everywhere on HAT appservers [operations/puppet] - 10https://gerrit.wikimedia.org/r/156303 [14:17:23] <_joe_> mmm I forgot the -D :/ [14:20:06] (03PS3) 10Giuseppe Lavagetto: mediawiki: use HHVM everywhere on HAT appservers [operations/puppet] - 10https://gerrit.wikimedia.org/r/156303 [14:37:48] <^d> manybubbles: Do we want to repool 18 now? [14:38:04] ^d and ottomata: it wouldn't hurt to repool it [14:38:13] yes [14:38:15] its ready [14:38:18] sweet [14:38:22] <^d> Yeah, if he can do pybal I can do the shard allocation. [14:38:27] i can [14:38:31] I did the shard allocation [14:38:31] is that ready to do? [14:39:10] This is interesting… I got a new phone yesterday and pulled the sim out of my old phone. Now I'm trying to use authenticator on the old phone, and my keys don't work. [14:39:16] <^d> manybubbles: Ah ok, missed that thx! [14:39:21] <^d> ottomata: Go for it :) [14:39:26] Can it be that just pulling the sim made the clock drift so much that the keys no longer match up? [14:39:44] !log started shards reallocating on elastic1018 [14:40:06] manybubbles: can I depool 1019 while we're at it? I'm going to try to get chris to swap some disks so we can get it back online [14:40:15] its better if it is depooled while we do that, right? [14:40:16] andrewbogott: have you tried doing the clock sync within the authenticator app? [14:40:16] ottomata: sure! [14:40:26] ottomata: doesn't make a big difference, really [14:40:32] greg-g: No, I didn't know that was a thing. I'll try [14:40:54] !log re-pooled elastic1018 [14:41:01] greg-g: Yeah, it fails; apparently it won't sync over wifi [14:41:02] ottomata: pybal will keep it out of the actual pool while elasticsearch isn't up and if it comes up and goes down again cirrus will just retry [14:41:16] that has worked really well in the past so I'm confident in it [14:41:26] ok, well, i took it out anyway [14:41:31] never hurts to have more safeguards i guess [14:41:37] <^d> Nice thing about disabling it in pybal means it won't even try though :) [14:41:42] <^d> But yes, it won't fail if we don't. [14:42:17] (03PS1) 10Physikerwelt: WIP:mathoid role for production [operations/puppet] - 10https://gerrit.wikimedia.org/r/156542 [14:43:48] andrewbogott: er? huh weird [14:44:04] greg-g: that was it! I reset my clock by hand and now things work. [14:44:05] akosiaris: I have no what to change for the production role... do you think a common role would make sense here [14:44:06] thx [14:44:13] andrewbogott: :) [14:45:13] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Wed 27 Aug 2014 06:41:47 UTC [14:45:14] physikerwelt: niah avoid it. it has proven to be a less flexible solution that is appears to be [14:45:59] they will deviate anyway [14:46:29] greg-g: are you ok with https://gerrit.wikimedia.org/r/#/c/156501 for swat? [14:46:40] * anomie was just about to ask the same question [14:46:43] i forgot (e.g. unaware) of that setting last night [14:47:00] so opting currently has no effect [14:47:22] (03CR) 10Alexandros Kosiaris: updating install-server module for new codfw rows and install2001 params second ps: fixing two ip address mistakes/typos from bblack's revie (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/156210 (owner: 10RobH) [14:48:49] akosiaris: I can imagine. However, can you recommend a good template for production... otherwise I have no idea how firewall settings and jenkins settings translate [14:49:03] (03PS1) 10Alexandros Kosiaris: Fix a couple of errors with codfw private DHCP stanzas [operations/puppet] - 10https://gerrit.wikimedia.org/r/156547 [14:49:14] physikerwelt: jenkins settings don't translate.. just remove them [14:49:46] also contint stuff does not translate remove it [14:50:18] the ferm::service rule is just fine as is, just rename ferm::service { 'http' to ferm::service { 'mathoid' [14:50:33] please do that on both beta and production btw [14:50:49] akosiaris: ok got it [14:51:01] system::role also needs a small update. remove (on beta) [14:51:11] the rest seems fine [14:51:15] James_F|Away, aude, MaxSem, Krinkle: Ping for SWAT in about 9 minutes. [14:51:17] aude: anomie yeah [14:51:29] greg-g: thanks [14:51:37] aude: anomie that ones a "oops, needed this yesterday" patch :) [14:51:47] akosiaris: what's about @monitor_group { 'mathoid_eqiad': description => 'eqiad mathoid servers' } does that work [14:51:48] next time i know [14:51:56] aude: :) yep, thanks! [14:51:56] greg-g: Since the comment said it needed approval, I made sure the approval was there (: [14:52:09] by default, the setting is not there in beta features and then it works [14:52:12] so worked for me [14:52:13] anomie: :) [14:52:28] * aude bit confused last night [14:52:39] physikerwelt: yes uncomment that as well [14:53:08] but don't put it inside any class or it won't be evaluated by the icinga server [14:53:29] confusing, I know but... [14:53:35] (03PS2) 10Physikerwelt: WIP:mathoid role for production [operations/puppet] - 10https://gerrit.wikimedia.org/r/156542 [14:55:06] akosiaris: does mathoid_eqiad refers to an entity that has be created elsewhere [14:55:45] (refer) [14:56:18] in the contrary. it creates the entity. But due to the @ in front and some import rules in site.pp it works [14:56:57] and the icinga server realizes the creation via collection in its own manifest [14:58:17] akosiaris: OK I trust in good code review for this change [14:58:41] (03PS1) 10BBlack: fixup install2001 rev dns [operations/dns] - 10https://gerrit.wikimedia.org/r/156550 [14:59:09] (03CR) 10BBlack: [C: 032] fixup install2001 rev dns [operations/dns] - 10https://gerrit.wikimedia.org/r/156550 (owner: 10BBlack) [14:59:14] bblack: I assume you are working on that too [14:59:29] on install2001? [14:59:31] yes [15:00:15] not really, I thought robh was. I just was checking on whether it was reachable, etc and noticed the DNS issue [15:00:28] ah cool [15:00:39] aude: Since you're the only one who responded since my ping, you get to go first for SWAT. (: [15:00:51] (03PS2) 10Anomie: Add other-projects beta feature to whitelist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156501 (owner: 10Aude) [15:00:54] yeah I am looking at it to help robh. Thanks for the DNS issue then :-) [15:01:05] anomie: Hey. [15:01:09] anomie: I responded! ;-) [15:01:11] anomie: On the Beta Feature, let me check it out quickly. [15:01:12] (03CR) 10Anomie: [C: 032] "SWAT" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156501 (owner: 10Aude) [15:01:15] (03Merged) 10jenkins-bot: Add other-projects beta feature to whitelist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156501 (owner: 10Aude) [15:01:16] ok [15:01:20] yea i have to adjust the isntall server apt params today [15:01:21] … ah. [15:01:26] James_F: too late [15:01:27] James_F: greg already responded about the beta feature. Thanks though. [15:01:27] so it'll serve the image [15:01:38] sorry, serve the apt repo [15:01:40] anomie: Aha. OK, it's still missing a few things. [15:01:45] manybubbles: poke? :D [15:01:47] aude: There's no link in https://www.mediawiki.org/wiki/Beta_Features#Current_Beta_Features [15:01:47] (thx for noticing the missing . bblack) [15:01:54] can do [15:02:15] !log anomie Synchronized wmf-config/InitialiseSettings.php: SWAT: Add other-projects beta feature to whitelist [[gerrit:156501]] (duration: 00m 27s) [15:02:23] aude: ^ test [15:02:27] doing [15:02:38] pleas [15:02:38] e [15:02:44] \o/ [15:02:45] aude: Otherwise looks good – you've even got https://www.mediawiki.org/wiki/Talk:Beta_Features/Other_projects_sidebar Flow-ified. [15:02:51] heh [15:03:01] * aude tries opting out and verify [15:03:10] James_F: Ready for the VE patches? [15:03:16] Yeah. [15:03:26] Doing the wmf18 one first [15:03:38] Cool. [15:04:01] looks good [15:05:09] anomie, pong [15:05:47] MaxSem: I'll do yours after the two VE patches [15:06:33] bah, https://www.mediawiki.org/w/index.php?title=Beta_Features&action=edit§ion=3 cannot find section [15:06:47] aude: Use VE! ;-) [15:06:50] doing [15:07:02] omg, scary with translate tags [15:07:06] aude: (The Translate Extension breaks sections. Known bug.) [15:07:55] (03PS3) 10Physikerwelt: Mathoid role for production [operations/puppet] - 10https://gerrit.wikimedia.org/r/156542 (https://bugzilla.wikimedia.org/69989) [15:09:05] !log anomie Synchronized php-1.24wmf18/extensions/VisualEditor/modules/ve-mw/ui/dialogs/ve.ui.MWSaveDialog.js: SWAT: Make the VE Resolve conflict button actually appear [[gerrit:156493]] (duration: 00m 09s) [15:09:13] James_F: ^ test please [15:09:18] Will do. [15:10:35] $site = $main_ipaddress ? { [15:10:35] /^208\.80\.15[23]\./ => 'pmtpa', [15:10:43] bblack: yes was is breaking now... [15:10:49] s/was/what/ [15:11:14] guess :-( [15:11:26] * James_F patients waits for bits to clear. [15:11:45] anomie: Sorry for the delay in testing. :-( [15:14:50] anomie: Yup, looks good. Thanks! [15:16:40] (03CR) 10Alexandros Kosiaris: [C: 032] Fix a couple of errors with codfw private DHCP stanzas [operations/puppet] - 10https://gerrit.wikimedia.org/r/156547 (owner: 10Alexandros Kosiaris) [15:17:55] <_joe_> 4~/win 33 [15:18:07] <_joe_> meh [15:18:23] James_F: Working? [15:18:31] James_F: Doing wmf17 now [15:18:45] anomie: Thanks! [15:19:09] !log anomie Synchronized php-1.24wmf17/extensions/VisualEditor/modules/ve-mw/ui/dialogs/ve.ui.MWSaveDialog.js: SWAT: Make the VE Resolve conflict button actually appear [[gerrit:156494]] (duration: 00m 09s) [15:19:10] James_F: ^ test please [15:19:25] Doing. [15:19:27] MaxSem: You're next [15:19:41] yup [15:21:39] anomie: Confirmed working in wmf17 too. Thank you very much! [15:22:15] James_F: You're welcome [15:23:32] (03PS1) 10Alexandros Kosiaris: Add codfw in various places [operations/puppet] - 10https://gerrit.wikimedia.org/r/156553 [15:24:04] !log anomie Synchronized php-1.24wmf18/extensions/ApiSandbox/SpecialApiSandbox.php: SWAT: Fix retrieval of query modules in ApiSandbox [[gerrit:156500]] (duration: 00m 09s) [15:24:09] MaxSem: ^ test please [15:24:33] Krinkle: Around for SWAT? [15:24:35] anomie, works - thanks [15:24:37] yes [15:25:00] Krinkle: Ok, doing yours now [15:25:04] (03CR) 10BBlack: [C: 031] Add codfw in various places [operations/puppet] - 10https://gerrit.wikimedia.org/r/156553 (owner: 10Alexandros Kosiaris) [15:26:49] !log anomie Synchronized php-1.24wmf18/extensions/GlobalCssJs: SWAT: Maintenance script for deleting manual loading of global CSS/JS [[gerrit:156533]] [[gerrit:156534]] (duration: 00m 09s) [15:26:50] Krinkle: ^ Test please [15:27:41] (03PS2) 10Anomie: Fix spelling making config not properly apply [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156532 (owner: 10Manybubbles) [15:27:51] anomie: thanks [15:28:07] (03PS2) 10Alexandros Kosiaris: Add codfw in various places [operations/puppet] - 10https://gerrit.wikimedia.org/r/156553 [15:28:17] manybubbles: Since I'm already logged in, I'll merge your config change too [15:28:33] (03CR) 10Anomie: [C: 032] Fix spelling making config not properly apply [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156532 (owner: 10Manybubbles) [15:28:36] anomie: thanks [15:28:37] (03Merged) 10jenkins-bot: Fix spelling making config not properly apply [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156532 (owner: 10Manybubbles) [15:29:15] anomie: confirmed, script exists and works. [15:29:21] Krinkle: Thanks [15:29:22] !log anomie Synchronized wmf-config/InitialiseSettings.php: SWAT: Fix typo in InitialiseSettings.php [[gerrit:156532]] (duration: 00m 10s) [15:29:24] manybubbles: ^ test please [15:30:03] anomie: works thanks [15:30:22] * anomie is done with SWAT! [15:32:57] !log restarted webstats-collector on gadolinium [15:34:56] Hm.. pushing to gerrit is throwing: [15:35:00] 404 Not Found [15:35:01]

The requested URL /tools/hooks/commit-msg was not found on this server.

[15:36:01] worked around by re-initialising git review -s and no-op amending commit. still weird though [15:36:12] (03PS1) 10Alexandros Kosiaris: Add puppet CNAME for codfw [operations/dns] - 10https://gerrit.wikimedia.org/r/156558 [15:45:54] cmjohnson1: good morning! [15:46:06] <^d> Is something up? [15:46:06] <^d> Reports of slowness on Gerrit. SSH sessions for myself and bd808 have been flakey last hour or so. [15:46:12] ottomata: hi [15:47:43] (03PS2) 10Alexandros Kosiaris: Add puppet/webproxy CNAMEs for codfw [operations/dns] - 10https://gerrit.wikimedia.org/r/156558 [15:48:00] akosiaris: Let me know if there something I can do to improve https://gerrit.wikimedia.org/r/#/c/156542/ ... for now I'll wait for code review [15:49:16] physikerwelt: you can add some monitoring [15:49:39] the rest is pretty much outside this changes scope (which is pretty good overall [15:50:25] like who will be doing deployments (you? someone else?) , a point of contact for the maintenance of the software/service (I suppose you?) [15:50:56] cmjohnson1: hiiii [15:50:57] ja so, 1018 is up and online [15:51:01] how's 1019? do you have spare (old) SSD to put back in it? [15:51:03] SSDs* [15:51:21] ottomata: 1019 is still getting the install error...has 2 new ssds in it [15:51:27] not the same [15:51:30] as b4 [15:51:38] ...new being the new ones we wnat to test? [15:51:50] or you mean, just the same ones [15:52:03] akosiaris: do you have an example for that? [15:52:05] we just want to have new ones available to test on 1016 once we get 1019 back up [15:52:11] new models [15:53:08] ottomata: the old ssds are back in it...the new 'test' ssds are waiting to go into another box once we identify it. My suggestion is to wait for new controller and test with that for max performance [15:55:01] !log (a few minutes ago) Shutdown GTT/TiNet transit link on cr2-eqiad [15:55:29] (03CR) 10Alexandros Kosiaris: [C: 04-1] "This is pretty good overall. A comment about monitoring and this is pretty much ready. A followup commit should assign the role to some ho" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/156542 (https://bugzilla.wikimedia.org/69989) (owner: 10Physikerwelt) [15:55:39] ottomataa: if you recall we were getting the the no target information msg yesterday on both 1018 and 1019 when we tried to install. I still don't know why. I thought it was the disk but since we didnt' replace any disk on 1018 it's a mystery still [15:55:40] physikerwelt: something better :-) [15:55:42] ok ja [15:55:46] I'm attached to dickson and have freenode lag too [15:55:46] we will probably put that in 1016 [15:56:00] cmjohnson1: oo, just making sure, did we confirm that elastic1001-1016 have the newer controllers? [15:56:23] cmjohnson1: you did replace one disk on 1018, right? [15:57:22] no, they don't have separate raid controllers...they're using the SATA controller [15:57:38] and yes, the disk was replaced and I rebuilt the raid yesterday [15:58:00] akosiaris: Thank you. Now I understand what you mean. [15:59:27] ok ja [15:59:28] oh [15:59:33] they dont' have separate raid controllers [15:59:36] ok..... [15:59:47] wait so, can or can't we use elastic1016 to test the new model SSDs? [16:00:43] we can test it on 1016 [16:01:04] i think if we wait for the new controller card that robh ordered we'll get better performance [16:01:36] either way...we have 16 of the older r410's that we would want to use...so works either wya [16:02:11] (03PS4) 10Giuseppe Lavagetto: add twentyafterfour to deployment group [operations/puppet] - 10https://gerrit.wikimedia.org/r/156177 (owner: 1020after4) [16:02:29] akosiaris: check_command => 'check_http_on_port!10042' only checks if there is a service running on 10042 mathoid needs a few seconds to start is that a problem here? [16:02:37] well,i guess comparing ssd performance would be better if the system was the same, i think [16:02:46] unless, we are going to order the new nodes with r410 controllersd, and thikn there could be some performance change (good or bad) because of that [16:03:17] cmjohnson1: basically, we just want to make sure that ordering a bunch of these newer SSDs is not a bad idea [16:03:18] so we want to see them in operation before we place an order [16:03:31] ok, so either way, the next step is to get 1019 up, and we are stuck on the netboot problem [16:03:37] yep..than adding to 1016 will be fine [16:05:45] (03CR) 10Giuseppe Lavagetto: [C: 032] add twentyafterfour to deployment group [operations/puppet] - 10https://gerrit.wikimedia.org/r/156177 (owner: 1020after4) [16:09:32] so, cmjohnson1, since we are stuck on 1019......Uhh, how do we get unstuck? [16:17:06] (03PS4) 10Physikerwelt: Mathoid role for production [operations/puppet] - 10https://gerrit.wikimedia.org/r/156542 (https://bugzilla.wikimedia.org/69989) [16:19:32] Hi, is there a version of semanticmediawiki that works against wmf/1.24wmf ? [16:24:49] renoirb: We are running the versions listed at https://wikitech.wikimedia.org/wiki/Special:Version [16:25:21] gosh, thanks, i havent thought of searching there [16:25:41] right, SMW 1.9 is problematic with dependencies [16:25:42] renoirb: The wmf/1.24wmfX branches actually include SMW in the git submodules [16:26:03] oh, so i should not include them myself then [16:26:05] good to know [16:26:10] cmjohnson1: we should move to test in the older box [16:26:20] cuz itll be a few days before we have the new controllers [16:26:21] i dont see them in extension/ from my last mediawiki-vagrant build [16:26:48] robh: agree and 16 of the elastic boxes are the R410's so we need to determine if they work in them as wll [16:26:48] only SemanticForms SemanticInternalObjects [16:26:51] brb [16:27:43] renoirb: They would need to be loaded via mediawiki::extension or hand built config in mw-vagrant [16:28:30] i was adding manually with a puppet class https://github.com/webplatform/mediawiki/blob/201408-upgrade/VAGRANT.md [16:28:33] gotta rework that part [16:28:49] 'composer require "mediawiki/semantic-media-wiki" "~2.0"', [16:28:59] and clearly, it broke our SMW heavy pages :/ [16:29:58] renoirb: It would be awesome to have a role that did all the right things to install via composer [16:30:38] renoirb: There is a role at https://github.com/wikimedia/mediawiki-vagrant/blob/master/puppet/modules/role/manifests/wikitech.pp that tries to make a test version of wikitech. I haven't tried it myself, but it may give you some ideas. [16:33:07] good ! [16:33:20] bd808, I can work on that and make a patchset if nothing is successful [16:33:41] uhm. I mean, if nothing succesful exists, I can try to have one working then give a patchset [16:34:38] renoirb: andrewbogott is the "owner" of that role. A generic SMW role would be a nice addition. [16:35:53] I can't be much help here. Wikitech uses that ancient version of SMW which works fine... [16:36:02] there's a wikitech vagrant role, but it doesn't include SMW [16:36:37] renoirb: Sounds like you have a project then. :) [16:44:56] I had all my MW servers down, now it works. I feel better. [16:45:05] (yesterday) [16:46:13] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Wed 27 Aug 2014 06:41:47 UTC [16:50:28] (03PS1) 10Physikerwelt: WIP: assign mathoid production hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/156576 (https://bugzilla.wikimedia.org/69990) [16:51:25] bd808, i see role::wikitech, i can clearly not call that role myself though [16:51:27] let me see [16:51:32] cmjohnson1: hey Chris, is RT #7728 confirmed? [16:52:22] godog: yep that will work...updating ticket now [16:52:52] cool, thanks! [16:57:15] (03PS1) 10Chad: Adjust swift backup config for Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156578 [16:58:06] (03CR) 10Chad: "This is what's already live." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156578 (owner: 10Chad) [16:58:52] (03CR) 10Filippo Giunchedi: [C: 031] Adjust swift backup config for Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156578 (owner: 10Chad) [17:01:33] (03PS2) 10Physikerwelt: WIP: assign mathoid production hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/156576 (https://bugzilla.wikimedia.org/69990) [17:06:44] cmjohnson1: ja soooo hm, any ideas on how we get unstuck? [17:07:12] there is no semanticresultformats without smw 2 or @dev [17:07:13] https://github.com/SemanticMediaWiki/SemanticResultFormats/blob/master/composer.json [17:07:21] ottomata: not yet [17:07:40] I don't know what the issue si [17:07:41] is [17:07:44] aye [17:12:13] ^d: so, 1018 is up, right? [17:12:25] we are having trouble reinstalling 1019, chris is going to try some different things [17:12:31] but, in order to help move this whole thing along [17:12:48] would it be possible to reinstall 1016 with the newer model SSDs, even if we don't yet have 1019 up? [17:12:48] <^d> Yeah, far as I can tell. [17:13:15] chris probably won't have time to to install the new ssds until tomorrow nayway [17:13:19] do we need to start moving shards off of 1016 now? [17:14:15] <^d> If it's not until tomorrow we could wait until this afternoon. [17:14:27] <^d> Don't need a full 24h. [17:14:44] ok [17:17:26] (03CR) 10BBlack: [C: 031] Add codfw in various places [operations/puppet] - 10https://gerrit.wikimedia.org/r/156553 (owner: 10Alexandros Kosiaris) [17:18:53] PROBLEM - puppet last run on mw1018 is CRITICAL: CRITICAL: Puppet has 1 failures [17:20:39] Coren: mark picked up the RT for labmon1001 network access, btw. [17:21:18] * Coren nods. [17:21:37] YuviPanda: We'll sit down sometime next week and discuss alerting. [17:22:41] (03PS1) 10RobH: adding codfw subnets into preseed file [operations/puppet] - 10https://gerrit.wikimedia.org/r/156579 [17:23:23] akosiaris: ^ you asked if i needed help, pls review ;] [17:23:45] * RobH has not self merged any changes other than mac address updates [17:23:51] solo reviewed i mean. [17:25:07] Coren: yeah, I realized we can very quickly setup icinga checks for toollabs and deployment-prep similar to how we do them for swift and mediawiki error rates. [17:27:26] (03CR) 10Alexandros Kosiaris: [C: 032] adding codfw subnets into preseed file [operations/puppet] - 10https://gerrit.wikimedia.org/r/156579 (owner: 10RobH) [17:28:02] akosiaris: thank you =] [17:29:18] (03CR) 10Alexandros Kosiaris: [C: 032] Add codfw in various places [operations/puppet] - 10https://gerrit.wikimedia.org/r/156553 (owner: 10Alexandros Kosiaris) [17:29:59] (03CR) 10Alexandros Kosiaris: [C: 032] Add puppet/webproxy CNAMEs for codfw [operations/dns] - 10https://gerrit.wikimedia.org/r/156558 (owner: 10Alexandros Kosiaris) [17:37:38] jouncebot: hi [17:37:43] jouncebot: next [17:37:43] In 0 hour(s) and 22 minute(s): Wikipedia Zero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140827T1800) [17:37:53] RECOVERY - puppet last run on mw1018 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [17:37:54] bot is back :) [17:42:09] (03PS1) 10RobH: install2001 needs trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/156583 [17:42:38] (03CR) 10RobH: [C: 032] install2001 needs trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/156583 (owner: 10RobH) [17:43:37] (03CR) 10MaxSem: [C: 04-2] Disable mobile uploads [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156523 (https://bugzilla.wikimedia.org/62598) (owner: 10MaxSem) [17:46:45] (03PS1) 10RobH: Trusty is now the default for installs [operations/puppet] - 10https://gerrit.wikimedia.org/r/156584 [17:48:46] (03CR) 10RobH: [C: 032] "This decision was made in IRC discussion with Alex, Mark, and myself a few minutes ago. Since all new installs have been trusty for testi" [operations/puppet] - 10https://gerrit.wikimedia.org/r/156584 (owner: 10RobH) [17:49:33] (03PS1) 10RobH: Revert "install2001 needs trusty" [operations/puppet] - 10https://gerrit.wikimedia.org/r/156586 [17:50:05] (03CR) 10RobH: [C: 032] "trusty is to be the default, reverting this to test the 'default' install behavior" [operations/puppet] - 10https://gerrit.wikimedia.org/r/156586 (owner: 10RobH) [17:50:58] (03PS2) 10Yuvipanda: stats: Setup rsync for stat1002 as well [operations/puppet] - 10https://gerrit.wikimedia.org/r/156324 [17:52:09] (03Abandoned) 10RobH: Revert "install2001 needs trusty" [operations/puppet] - 10https://gerrit.wikimedia.org/r/156586 (owner: 10RobH) [17:54:11] andre__: i wanted to talk to you but got "andre_" with a single underscore.. confusing much ;) [17:54:19] mutante, sorry [17:54:42] in my next life I will come up with a better nickname and will be a creative person. [17:54:53] haha, i told him all this stuff re: Bugzilla and then he is "do i know you?" [17:57:43] (03PS2) 10RobH: Trusty is now the default for installs [operations/puppet] - 10https://gerrit.wikimedia.org/r/156584 [17:57:45] (03PS1) 10RobH: Merge "Add codfw in various places" into production [operations/puppet] - 10https://gerrit.wikimedia.org/r/156589 [17:57:52] oh, hell, thats not intended [17:58:10] i fucked my local git branch [17:58:12] arghhh [17:58:22] andre__: use andre_wmf :D [17:58:38] (03Abandoned) 10RobH: Merge "Add codfw in various places" into production [operations/puppet] - 10https://gerrit.wikimedia.org/r/156589 (owner: 10RobH) [17:59:12] JohnLewis: hmm, but that's weird for all the non-wikimedia channels I'm in on Freenode which don't care (and don't need to know) who I work for :-/ [18:00:05] yurik: Dear anthropoid, the time has come. Please deploy Wikipedia Zero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140827T1800). [18:00:21] andre__two_not_1 :p [18:01:03] PROBLEM - Disk space on virt1000 is CRITICAL: DISK CRITICAL - free space: / 1625 MB (2% inode=84%): [18:01:10] eek [18:01:47] (03PS3) 10RobH: Trusty is now the default for installs [operations/puppet] - 10https://gerrit.wikimedia.org/r/156584 [18:03:15] (03CR) 10Dzahn: [C: 032] git,svn - remove duplicate NameVirtualHost *:80 [operations/puppet] - 10https://gerrit.wikimedia.org/r/156191 (owner: 10Dzahn) [18:05:25] hmm, lots of memcached errors atm [18:05:27] (03PS2) 10coren: Tool Labs: make crontab paranoid about empty files [operations/puppet] - 10https://gerrit.wikimedia.org/r/156294 (https://bugzilla.wikimedia.org/69355) [18:05:34] whatsup! [18:05:42] all mw1163 [18:06:23] (03CR) 10Dzahn: "root@antimony:/etc/apache2# grep -r "NameVirtualHost \*:80" *" [operations/puppet] - 10https://gerrit.wikimedia.org/r/156191 (owner: 10Dzahn) [18:08:10] (03PS4) 10RobH: Trusty is now the default for installs [operations/puppet] - 10https://gerrit.wikimedia.org/r/156584 [18:08:34] (03CR) 10Dzahn: "if Aaron is blocked by this i recommend merging this and then just making a follow-up patch for the other remanining users separately (RT " [operations/puppet] - 10https://gerrit.wikimedia.org/r/155452 (owner: 10Dzahn) [18:09:17] (03CR) 10RobH: [C: 032] Trusty is now the default for installs [operations/puppet] - 10https://gerrit.wikimedia.org/r/156584 (owner: 10RobH) [18:12:01] (03CR) 10coren: [C: 032] Tool Labs: make crontab paranoid about empty files [operations/puppet] - 10https://gerrit.wikimedia.org/r/156294 (https://bugzilla.wikimedia.org/69355) (owner: 10coren) [18:13:43] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: Epic puppet fail [18:14:35] mutante: maybe restarting nutcracker on mw1163 would help? [18:16:03] AaronSchulz: can do that [18:16:33] !log restarted nurcracker on mw1163 [18:16:47] nutcracker start/running, process 10214 [18:20:11] hmm still seeing the errors [18:21:53] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [18:22:53] RECOVERY - Disk space on virt1000 is OK: DISK OK [18:23:18] AaronSchulz: where do you see them? [18:23:35] mw1163 appears normal afacit [18:24:10] flourine logs [18:27:21] is this a right place to poke about lack of domain? [18:27:23] i.e. pl.m.wikimedia.org [18:28:31] AaronSchulz: nutcracker listes on 11212 and 22222 but in the fluorine los it says it is trying :11211 ?? [18:29:13] 11211 is memcached [18:29:50] there is no memcached there though [18:29:57] but something seems to expect one [18:29:59] lazowik: kinda, but a bug report would be good. (Product: Wikimedia, Component: DNS, probably) [18:30:11] (03CR) 10Manybubbles: [C: 031] Adjust swift backup config for Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156578 (owner: 10Chad) [18:30:21] lazowik: the best would be if you can describe it in a short mail to the "Requests" address in the topic [18:30:33] heh, or what greg-g said [18:30:41] :) [18:30:53] jgage: so the logs say on server "127.0.0.1:11211": SERVER HAS FAILED [18:31:04] but mw1163 memcached: unrecognized service [18:31:27] mw1163 enwiki: Memcached error for key "enwiki ... and only this one box ? [18:32:47] lazowik: Do any of the chapter type wikis have mobile subdomains? [18:32:51] hmm.. 14:54 _joe_: ran scap-rebuild-cdb on mw1163 [18:33:19] 16:22 cmjohnson1: shutting down mw1163 to replace DIMM [18:33:41] 09:26 mutante: disabling mw1163 in pybal [18:33:45] all this stuff in SAL :p [18:33:59] Reedy: dunno [18:34:07] but we have MobileFrontend enabled [18:34:12] and no domain… [18:35:11] You shouldn't *need* a domain [18:36:16] yeah, all the mobile entries are specifically added [18:36:41] https://github.com/wikimedia/operations-dns/blob/master/templates/wikimedia.org [18:36:44] That's a fail github [18:37:25] Reedy: hmm, okay, mobile opens a desktop version normally [18:37:42] so only the link at the bottom links to m. [18:37:45] lazowik: I'm not saying they shouldn't be added, just fyi [18:38:26] OK, I'll ask what we want first and then file a bug if needed [18:38:39] we have control over the link to mobile version? [18:38:40] https://pl.wikimedia.org/wiki/Strona_g%C5%82%C3%B3wna [18:39:02] "Wersja dla urządzeń mobilnych" links to m. [18:39:38] fuchsia.ipmi [18:39:39] ffs [18:39:40] AaronSchulz: the server is disabled in pybal, can we disable it elsewhere so it's not even trying for now? [18:39:46] 'wmgMobileUrlTemplate' => array( [18:39:46] 'default' => '%h0.m.%h1.%h2', [18:39:51] Reedy: that's what you get for github :) http://git.wikimedia.org/blob/operations%2Fdns.git/HEAD/templates%2Fwikimedia.org [18:40:00] there, that is much more readable [18:40:09] lol [18:40:15] github 100, gitblit 1 [18:40:28] :p [18:40:44] (03PS2) 10Aaron Schulz: Use the hash preprocessor when using HHVM [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155594 [18:41:03] mutante: how is it getting any traffic? health checks or something? [18:41:14] !log deleting glance images that have no labs VMs associated with them: 167350c0-0410-4336-9a94-9c8da55f26a3 | ubuntu-11.04-natty; 7239b0b3-0a02-474b-a7b8-0216f60bf341 | ubuntu-10.04-lucid (deprecated); a3ee8fe3-b9f6-4a96-bad2-8bac64affde0 | ubuntu-11.10-oneiric [18:42:21] meh, now I'm just confused [18:42:45] lazowik: We should probably just add the dns entries ;) [18:43:07] ie all of https://noc.wikimedia.org/conf/highlight.php?file=wikimedia.dblist [18:43:07] Reedy: if we want mobile => subdomain needed, don't want => not needed? [18:43:33] or we can have mobile without subdomain? [18:43:36] ok [18:43:36] AaronSchulz: i don't know, seems too often for health checks, i just see that it's always "enwiki:resourceloader:filter:minify-js" [18:43:39] file a bug? [18:46:01] mutante: is there pybal conf for both load.php and regular apache requests? [18:46:13] I thought those were merged...or maybe they were just set to be the same [18:46:31] AaronSchulz: i just see one line with mw1163 [18:46:33] eqiad/apaches:{'host': 'mw1163.eqiad.wmnet', 'weight': 12, 'enabled': False } [18:46:48] that disables it for regular apache [18:46:51] dunno about load.php [18:47:13] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Wed 27 Aug 2014 06:41:47 UTC [18:48:44] (03PS1) 10Reedy: Add mobile subdomains for Wikimedia chapter wikis [operations/dns] - 10https://gerrit.wikimedia.org/r/156596 [18:48:47] lazowik: ^^ [18:48:53] (03PS1) 10Dzahn: add pl.m CNAME for PL chapter mobile frontend [operations/dns] - 10https://gerrit.wikimedia.org/r/156597 [18:48:56] (03CR) 10jenkins-bot: [V: 04-1] Add mobile subdomains for Wikimedia chapter wikis [operations/dns] - 10https://gerrit.wikimedia.org/r/156596 (owner: 10Reedy) [18:48:56] hrmm [18:48:58] Reedy: thanks [18:49:47] Reedy: they should all be .m [18:49:47] oh, fail [18:49:50] (03PS3) 10Aaron Schulz: Use the hash preprocessor when using HHVM [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155594 [18:49:53] mutante: yup, just noticed :P [18:50:06] so i made one for just pl .. shrug [18:50:18] do we want all or just the ones needed ? not sure [18:51:46] CNAME not allowed alongside other data at domainname 'pl.wikimedia.org.' [18:52:03] (03PS2) 10Reedy: Add mobile subdomains for Wikimedia chapter wikis [operations/dns] - 10https://gerrit.wikimedia.org/r/156596 [18:52:17] mutante: I just thought it made sense to do the lot... [18:52:26] One, we're currently giving broken mobile links [18:52:36] and 2, if one wants it now, others are likely to... [18:53:09] (03PS1) 10coren: Tool Labs: apply /etc/iptables.conf on boot [operations/puppet] - 10https://gerrit.wikimedia.org/r/156599 (https://bugzilla.wikimedia.org/53181) [18:53:31] noboard_chapters makes it ugly :) [18:53:33] PROBLEM - check if dhclient is running on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:53:33] PROBLEM - RAID on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:53:39] (03PS3) 10Reedy: Add mobile subdomains for Wikimedia chapter wikis [operations/dns] - 10https://gerrit.wikimedia.org/r/156596 [18:53:43] PROBLEM - DPKG on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:53:46] that, and I still missed .m on it :P [18:54:24] PROBLEM - check configured eth on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:54:24] PROBLEM - Disk space on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:54:24] PROBLEM - puppet last run on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:54:58] cajoel: ^ rhenium has an issue, that's flow [18:55:13] mutante: yeah, lets use n tabs to make everything nice and smooth :p [18:55:23] mutante: haven't logged in to that in a long while [18:55:33] PROBLEM - SSH on rhenium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:55:34] RECOVERY - DPKG on rhenium is OK: All packages OK [18:56:08] load 6 [18:56:13] no reason for that [18:56:14] RECOVERY - check configured eth on rhenium is OK: NRPE: Unable to read output [18:56:14] RECOVERY - Disk space on rhenium is OK: DISK OK [18:56:14] RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [18:56:20] self healing [18:56:22] (03CR) 10Michał Łazowik: [C: 031] "Not like you need that :p" [operations/dns] - 10https://gerrit.wikimedia.org/r/156596 (owner: 10Reedy) [18:56:23] RECOVERY - SSH on rhenium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [18:56:24] RECOVERY - check if dhclient is running on rhenium is OK: PROCS OK: 0 processes with command name dhclient [18:56:24] RECOVERY - RAID on rhenium is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [18:56:30] (03CR) 10Andrew Bogott: [C: 031] Tool Labs: apply /etc/iptables.conf on boot [operations/puppet] - 10https://gerrit.wikimedia.org/r/156599 (https://bugzilla.wikimedia.org/53181) (owner: 10coren) [18:56:31] mutante: but those processes shouldn't be doing anything [18:56:43] mutante: can you help me find ganglia page for this host? [18:58:02] killed them all off [18:58:02] cajoel: hit the Search tab and search for host name [18:58:06] cajoel: https://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&h=rhenium.wikimedia.org&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [18:58:07] on ganglia i mean [18:58:11] that [18:58:34] lazowik: thanks [18:58:42] looks like something sent it off in to lala land [18:58:48] it's not supposed to be doing any active work. [18:59:08] I'd be happy to have it rebuilt as 14.04 [18:59:12] how hard is that? [18:59:37] load average: 223.98, 236.89, 101.88 [18:59:44] something is very funky [18:59:45] well, Robj just made Trust the default [18:59:50] Robh [18:59:55] Trusty, arg, cant type [19:00:09] it was OOM killing [19:00:15] so if it is completely puppetized.. it should in theory.. be easy [19:00:38] ? [19:00:52] (03CR) 10Chad: [C: 032] Adjust swift backup config for Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156578 (owner: 10Chad) [19:00:58] (03Merged) 10jenkins-bot: Adjust swift backup config for Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156578 (owner: 10Chad) [19:01:03] RobH: i just said you made Trust the default installer earlier today, because cajoel asked about installing 14.04 [19:01:07] RobH: need any volunteer machines for trusty-ification? [19:01:07] ahh [19:01:11] and i keep making typing mistakes [19:01:46] cajoel: are you asking if we have any systems that need to get trusty? not new requests at the momemnt (other than all of codfw which is halted behind my install2001 woes) [19:01:50] update-manager -d ? [19:02:08] im not sure what you are asking about. [19:02:11] !log demon Synchronized wmf-config/CirrusSearch-production.php: swift is sweet now (duration: 00m 05s) [19:02:23] any thumbs down about SW upgrades diect from 12.04 to 14.04 (vs re-install) [19:02:23] or what you mean by volunteer machines [19:02:54] RobH: I have a machine that I would volunteer to upgrade to 14.04 -- not sure how that would/should happen [19:03:32] Well, have to decide if its reinstall or upgrade in place. the former is just basically rebuilding from scratch [19:03:52] upgrading in place to trusty could be trickier, i think the opsen dealing with analyitcs boxen had to recently do that [19:04:05] mutante: can you run 'service hhvm stop' on osmium? [19:04:48] so the first question is does the box you want to upgrade have data that we have to keep? [19:04:58] no data [19:05:02] perfect time to do it [19:05:43] just a heads up, we're going to start a large global rename (100k+ edits), I don't expect any issues, but I'll be monitoring just in case [19:05:50] AaronSchulz: i can, why do you want it to stop though? [19:06:02] so I can run it with memory tracing/profiling stuff instead [19:06:07] gotcha, ok [19:06:13] nothing is using it...it's just being annoying with auto-restart [19:06:16] AaronSchulz: done [19:06:36] AaronSchulz: that did not stop it yet [19:06:48] or you are quick to restart :) [19:08:04] I started it [19:08:20] in any case, it should probably just be ensured to not be running on osmium [19:08:37] it's only ever needed for debugging anyway [19:11:30] AaronSchulz: site.pp says "HHVM staging" [19:11:40] (03CR) 10Ottomata: puppet: hiera backend for the WMF (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/151869 (owner: 10Giuseppe Lavagetto) [19:14:53] (03PS1) 10Dzahn: ensure hhvm is stopped on osmium [operations/puppet] - 10https://gerrit.wikimedia.org/r/156604 [19:15:04] AaronSchulz: ^ but then you are going to stop puppet when you debug? [19:17:02] (03PS1) 10Gage: Logstash: GELF filter: additional field: Server [operations/puppet] - 10https://gerrit.wikimedia.org/r/156606 [19:17:12] (03PS2) 10Dzahn: ensure hhvm is stopped on osmium [operations/puppet] - 10https://gerrit.wikimedia.org/r/156604 [19:18:13] (03CR) 10Gage: [C: 032] "trivial" [operations/puppet] - 10https://gerrit.wikimedia.org/r/156606 (owner: 10Gage) [19:18:56] (03PS4) 10Dzahn: Add mobile subdomains for Wikimedia chapter wikis [operations/dns] - 10https://gerrit.wikimedia.org/r/156596 (owner: 10Reedy) [19:20:25] (03CR) 10Aaron Schulz: [C: 031] ensure hhvm is stopped on osmium [operations/puppet] - 10https://gerrit.wikimedia.org/r/156604 (owner: 10Dzahn) [19:20:33] mutante: I might need https://gerrit.wikimedia.org/r/#/c/154378/1 too [19:21:39] (03PS3) 10Dzahn: ensure hhvm is stopped on osmium [operations/puppet] - 10https://gerrit.wikimedia.org/r/156604 [19:21:55] (03CR) 10Dzahn: [C: 032] ensure hhvm is stopped on osmium [operations/puppet] - 10https://gerrit.wikimedia.org/r/156604 (owner: 10Dzahn) [19:22:08] (03PS2) 10Dzahn: Provision role::jobrunner on osmium [operations/puppet] - 10https://gerrit.wikimedia.org/r/154378 (owner: 10Ori.livneh) [19:22:53] mutante: my only issue with that patch is the jobrunner service spamming it's error log since hhvm isn't running (and leak if it was anyway) [19:23:09] I guess we can assert that the redisJobRunner isn't running either [19:23:39] ok, let's do it [19:24:17] fixes quoting :p [19:24:27] (03PS4) 10Dzahn: ensure hhvm is stopped on osmium [operations/puppet] - 10https://gerrit.wikimedia.org/r/156604 [19:25:28] AaronSchulz: should we also ensure redis-server is stopped? [19:25:53] redis-server shouldn't matter, is that even running there? [19:26:07] it is [19:26:46] (03PS5) 10Dzahn: ensure hhvm is stopped on osmium [operations/puppet] - 10https://gerrit.wikimedia.org/r/156604 [19:27:04] still could not submit without another rebase..hrmm [19:27:13] (03CR) 10Dzahn: [C: 032] ensure hhvm is stopped on osmium [operations/puppet] - 10https://gerrit.wikimedia.org/r/156604 (owner: 10Dzahn) [19:28:18] (03CR) 10Dzahn: [C: 032] Provision role::jobrunner on osmium [operations/puppet] - 10https://gerrit.wikimedia.org/r/154378 (owner: 10Ori.livneh) [19:28:25] (03PS3) 10Dzahn: Provision role::jobrunner on osmium [operations/puppet] - 10https://gerrit.wikimedia.org/r/154378 (owner: 10Ori.livneh) [19:28:48] <> should not be running [19:28:54] heh, yeah redis-server is running...funny [19:29:01] (03CR) 10Dzahn: [C: 032] Provision role::jobrunner on osmium [operations/puppet] - 10https://gerrit.wikimedia.org/r/154378 (owner: 10Ori.livneh) [19:29:03] no idea what is using that...probably nothing [19:29:52] jobrunner: unrecognized service [19:29:56] so that's that [19:30:14] we will get it soon i suppose [19:30:30] applies the puppet [19:30:51] hopefully this will get apache set up run for running jobs [19:31:31] puppet disabled by you? [19:32:05] I never touched puppet [19:35:04] AaronSchulz: soo, we cant just let puppet apply that change :/ [19:35:20] puppet was disabled and when i enabled it [19:35:29] Duplicate declaration: Service[hhvm] [19:35:33] grr.. yea [19:36:12] * AaronSchulz goes afk for a while [19:36:49] <> should at least give a 500 if apache is working (and a json back if hhvm is running) [19:38:28] mutante: eh [19:38:32] don't enable it please [19:38:36] i'm in the middle of debugging something there [19:38:51] sorry, it's been a couple of days, didn't want to !log every time [19:39:10] !log re-disabled puppet on osmium [19:39:11] ori-mtng: but .. how am i going to apply the change you suggested :) [19:39:22] to let Aaron have jobrunner role [19:39:44] you want the service running or stopped? [19:39:47] hhvm service that is [19:39:55] because Aaron wants it stopped [19:40:07] the jobrunner service shouldn't be running [19:40:09] hhvm should be [19:40:18] AaronSchulz: can you hold off for a few? [19:40:39] sorry to pop out of nowhere like that [19:40:54] (03PS3) 10Yuvipanda: stats: Setup rsync for stat1002 as well [operations/puppet] - 10https://gerrit.wikimedia.org/r/156324 [19:40:55] ottomata: ^ [19:40:56] meeting ?:) [19:41:21] mutante: not a very interesting one ;) [19:42:41] (03PS1) 10Dzahn: Revert "ensure hhvm is stopped on osmium" [operations/puppet] - 10https://gerrit.wikimedia.org/r/156665 [19:43:02] (03PS1) 10Gage: Logastash: Hadoop: resolve source IP into hostname [operations/puppet] - 10https://gerrit.wikimedia.org/r/156666 [19:43:37] (03CR) 10Ottomata: stats: Setup rsync for stat1002 as well (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/156324 (owner: 10Yuvipanda) [19:44:24] (03PS2) 10Gage: Logastash: Hadoop: resolve source IP into hostname [operations/puppet] - 10https://gerrit.wikimedia.org/r/156666 [19:44:36] (03CR) 10Yuvipanda: stats: Setup rsync for stat1002 as well (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/156324 (owner: 10Yuvipanda) [19:44:42] (03PS2) 10Dzahn: Revert "ensure hhvm is stopped on osmium" [operations/puppet] - 10https://gerrit.wikimedia.org/r/156665 [19:45:06] (03PS4) 10Yuvipanda: stats: Setup rsync for stat1002 as well [operations/puppet] - 10https://gerrit.wikimedia.org/r/156324 [19:45:08] (03PS1) 10Dzahn: Revert "Provision role::jobrunner on osmium" [operations/puppet] - 10https://gerrit.wikimedia.org/r/156667 [19:45:21] (03CR) 10Gage: [C: 032] Logastash: Hadoop: resolve source IP into hostname [operations/puppet] - 10https://gerrit.wikimedia.org/r/156666 (owner: 10Gage) [19:46:39] ottomata: ^ updated, and doesn't the symlink fix things? [19:47:32] (03CR) 10Dzahn: [C: 032] "Ori is debugging one thing, Aaron is debugging another thing. conflicting interests here about the service running or not running. also, c" [operations/puppet] - 10https://gerrit.wikimedia.org/r/156665 (owner: 10Dzahn) [19:47:36] <_joe_> ori-mtng: don't cheat, you often work during meetings :) [19:48:28] !log restarted webstatscollector on gadolinium with berkley db in tmpfs at /run/shm/webstats [19:49:06] oh i missed the symlink YuviPanda [19:49:08] :) [19:49:16] sure. [19:49:35] what a mess! :) [19:49:50] ottomata: indeed [19:50:27] (03CR) 10Ottomata: [C: 032 V: 032] stats: Setup rsync for stat1002 as well [operations/puppet] - 10https://gerrit.wikimedia.org/r/156324 (owner: 10Yuvipanda) [19:51:15] mutante: i'm puppet-merging a change of yours [19:51:19] ensure hhvm is stopped on osmium [19:51:20] s'ok? [19:51:34] (03CR) 10Dzahn: [C: 032] Revert "Provision role::jobrunner on osmium" [operations/puppet] - 10https://gerrit.wikimedia.org/r/156667 (owner: 10Dzahn) [19:51:42] (03PS2) 10Dzahn: Revert "Provision role::jobrunner on osmium" [operations/puppet] - 10https://gerrit.wikimedia.org/r/156667 [19:51:46] ottomata: I think you've to mkdir /srv/aggregate-data on stat1002 [19:52:15] ottomata: yes please, already had the script waiting for my YES [19:54:27] YuviPanda: done [19:54:46] (03CR) 10Dzahn: [C: 032] "not yet per Ori, because he is debugging something. Aaron, please talk to ori when to re-revert :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/156667 (owner: 10Dzahn) [19:55:47] ottomata: cool [19:55:57] (03PS1) 10Gage: Hadoop: Logstash: send to logstash1002 [operations/puppet] - 10https://gerrit.wikimedia.org/r/156668 [19:56:25] (03PS2) 10Gage: Hadoop: Logstash: send to logstash1002 [operations/puppet] - 10https://gerrit.wikimedia.org/r/156668 [19:57:13] (03PS1) 10Chad: Install elasticsearch plugins to jenkins slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/156669 [19:57:33] (03CR) 10Gage: [C: 032] Hadoop: Logstash: send to logstash1002 [operations/puppet] - 10https://gerrit.wikimedia.org/r/156668 (owner: 10Gage) [19:57:40] (03PS1) 10Rush: Revert: Phabricator files domain [operations/puppet] - 10https://gerrit.wikimedia.org/r/156670 [19:58:08] (03PS2) 10Rush: Revert: Phabricator files domain [operations/puppet] - 10https://gerrit.wikimedia.org/r/156670 [19:58:14] (03CR) 10Rush: [C: 032 V: 032] Revert: Phabricator files domain [operations/puppet] - 10https://gerrit.wikimedia.org/r/156670 (owner: 10Rush) [19:58:23] (03CR) 10Chad: "Want to get browser tests passing again. Having up to date plugins would help :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/156669 (owner: 10Chad) [20:00:05] gwicke, subbu, cscott: Respected human, time to deploy Parsoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140827T2000). Please do the needful. [20:00:53] Respected jouncebot, we are not going to be deploying Parsoid today /cc cscott gwicke [20:01:02] This one is so long "/Stage[main]/Mediawiki::Jobrunner/File[/etc/default/jobrunner]" [20:01:25] the job runner while building the first boot, not a good idea :/ [20:10:29] (03PS1) 10Ottomata: Run webstats collector in /run/shm to reduce disk io for temp files [operations/puppet] - 10https://gerrit.wikimedia.org/r/156673 [20:11:33] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /tmp 3333 MB (3% inode=99%): [20:13:16] (03CR) 10Manybubbles: [C: 031] Install elasticsearch plugins to jenkins slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/156669 (owner: 10Chad) [20:14:01] (03PS2) 10Ottomata: Run webstats collector in /run/shm to reduce disk io for temp files [operations/puppet] - 10https://gerrit.wikimedia.org/r/156673 [20:14:05] bd808, it breaks at git clone --recurse-submodules: "Error: git clone --recurse-submodules --branch 'wmf/1.24wmf16' https://gerrit.wikimedia.org/r/p/mediawiki/core.git /vagrant/mediawiki returned 1 instead of one of [0] [20:14:05] " [20:14:33] RECOVERY - Disk space on stat1002 is OK: DISK OK [20:15:13] mutante: is puppet still broken on osmium? [20:15:21] Hi AaronSchulz :) [20:15:24] (03PS1) 10RobH: adding in codfw realm in ganglia [operations/puppet] - 10https://gerrit.wikimedia.org/r/156677 [20:15:41] AaronSchulz: ori asked us to leave it disabled, i reverted the puppet changes [20:15:56] AaronSchulz: please talk to ori when he is done debugging his stuff, conflict of interest here [20:16:20] renoirb: so debug it? Maybe a timeout; maybe something worse. If you run that command as root inside the vm you will probably get better output. Or run `PUPPET_DEBUG=1 vagrant provision` [20:16:26] (03CR) 10QChris: [C: 031] Run webstats collector in /run/shm to reduce disk io for temp files [operations/puppet] - 10https://gerrit.wikimedia.org/r/156673 (owner: 10Ottomata) [20:16:55] (03PS3) 10Ottomata: Run webstats collector in /run/shm to reduce disk io for temp files [operations/puppet] - 10https://gerrit.wikimedia.org/r/156673 [20:16:57] oh! thanks, didnt know about the PUPPET_DEBUG=1 shell variable for vagrant [20:17:01] (03CR) 10RobH: [C: 032] "'lets see if this does it' -said every sysadmin ever" [operations/puppet] - 10https://gerrit.wikimedia.org/r/156677 (owner: 10RobH) [20:17:03] (03CR) 10Ottomata: [C: 032 V: 032] Run webstats collector in /run/shm to reduce disk io for temp files [operations/puppet] - 10https://gerrit.wikimedia.org/r/156673 (owner: 10Ottomata) [20:17:21] RobH, ok to merge? [20:17:27] adding in codfw realm in ganglia [20:17:38] renoirb: It's secret magic :) Search in Vagrantfile to see what does [20:18:03] ottomata: yes [20:18:12] i was about to ask you same [20:18:13] heh [20:18:21] done :) [20:18:25] thx [20:19:07] FYI https://gist.github.com/renoirb/27a911fd67118f640e51 [20:19:35] bd808, I ran manually in the VM /vagrant/mediawiki the `git submodule update --init --recursive` then ran `vagrant provision` and all went well. [20:19:48] the gist I provided is the error message for posterity [20:19:53] along with the config I put [20:20:46] "returned 1 instead of one of [0]" is mostly unhelpful. Thanks puppet :/ [20:21:11] The rest makes sense, no mw-core clone so I give up on the rest [20:23:34] (03PS1) 10QChris: Correct webstatcollector's upstart name [operations/puppet] - 10https://gerrit.wikimedia.org/r/156680 [20:25:41] ottomata: ^ [20:26:06] oo thanks [20:26:21] (03PS1) 10RobH: install2001 site.pp entry [operations/puppet] - 10https://gerrit.wikimedia.org/r/156681 [20:26:23] Sorry for missing it in CR. [20:26:42] (03CR) 10Ottomata: [C: 032 V: 032] Correct webstatcollector's upstart name [operations/puppet] - 10https://gerrit.wikimedia.org/r/156680 (owner: 10QChris) [20:26:58] sorry for submitting it in the C! [20:27:04] :-P [20:28:49] hashar: If you are around, can we catch up regarding 'labswiki'? [20:28:56] (03CR) 10RobH: [C: 032] install2001 site.pp entry [operations/puppet] - 10https://gerrit.wikimedia.org/r/156681 (owner: 10RobH) [20:30:03] PROBLEM - puppet last run on gadolinium is CRITICAL: CRITICAL: Epic puppet fail [20:30:37] (03PS1) 10Ottomata: Fix dependency on webstats-collector.conf init file [operations/puppet] - 10https://gerrit.wikimedia.org/r/156683 [20:30:37] Epic! [20:30:42] hardly epic. [20:30:58] (03CR) 10Ottomata: [C: 032 V: 032] Fix dependency on webstats-collector.conf init file [operations/puppet] - 10https://gerrit.wikimedia.org/r/156683 (owner: 10Ottomata) [20:31:25] andrewbogott: yeah I have seen some comment stating that we should rename 'labswiki' on beta cluster [20:31:37] andrewbogott: maybe deploymentwiki . Not sure how to migrate / rename the db though [20:31:45] bd808, its strange SemanticForms says its using 3.0-alpha in the Special:Version, but in the log we see "Thu Jul 31 16:29:45 2014 +0000 e4a2b35 (HEAD, origin/wmf/1.24wmf16) Creating new wmf/1.24wmf16 branch [Reedy]" [20:32:20] hashar, https://stackoverflow.com/questions/12190000/rename-mysql-database item 38 seems promising. [20:32:49] renoirb: branch names and version numbers of extensions seldom have any correlation. [20:32:49] 38 is the vote count! :D [20:32:58] oh, you're right. [20:33:01] well anyway, that one [20:33:03] RECOVERY - puppet last run on gadolinium is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [20:33:39] andrewbogott: beta has a master / slave replication. Not sure how the replication system will handle that [20:33:50] Oh, hm. [20:34:38] I'm not entirely unwilling to move the wikitech db, only I'm wary about breaking things. [20:34:45] "Note that this only handles the tables. Views and stored procedures have to be done separately." <- is it safe to assume that we don't have those? [20:34:57] akosiaris: If you have a minute to have another look at https://gerrit.wikimedia.org/r/#/c/156542/ that would be great [20:35:04] andrewbogott: also its it used on beta as the OAuth database which might have some interesting side effect if we rename the db :D at worse we can just fix it post rename [20:35:48] andrewbogott: one sure thing it is safer to rename the beta labswiki . At the sametime I would love the wikitech db to be named wikitechwiki [20:35:49] hashar, andrewbogott: mysqldump and restore to a new db? -- https://dba.stackexchange.com/questions/8869/restore-mysql-database-with-different-name [20:35:57] bd808: yeah :] [20:38:22] can haz merge? https://gerrit.wikimedia.org/r/#/c/156596 [20:38:23] (03PS1) 10QChris: Drop write and execute on webstatscollector's upstart configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/156685 [20:38:43] hashar: I think my only real objection to doing this in wikitech is that it requires scheduling downtime, and I'm reluctant to add that task to the list of things to do during the switch-to-deployment-train window. [20:39:03] I assume (perhaps wrongly) that you can just break the db on beta for a few minutes without really messing with anyone. [20:39:32] or we can phase out labswiki entirely from beta [20:39:37] I am not sure what it is meant for [20:39:51] apparently used for OAuth and maybe central logging [20:40:26] (03CR) 10QChris: Drop write and execute on webstatscollector's upstart configuration (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/156685 (owner: 10QChris) [20:40:51] (03PS3) 10RobH: public1-a-codfw - also use install2001 not carbon [operations/puppet] - 10https://gerrit.wikimedia.org/r/156228 (owner: 10Dzahn) [20:41:07] andrewbogott: the easiest would probably be to rename beta labswiki to deploymentwiki [20:41:39] hashar: Works for me :) if you determine that renaming it on beta is going to be a bigger hassel than renaming wikitech, though, I'm not inflexible. [20:42:20] (03CR) 10Ottomata: [C: 032 V: 032] Drop write and execute on webstatscollector's upstart configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/156685 (owner: 10QChris) [20:42:35] (03CR) 10RobH: [C: 032] "install2001 is online, so this can now merge" [operations/puppet] - 10https://gerrit.wikimedia.org/r/156228 (owner: 10Dzahn) [20:43:40] ottomata: i just merged yer stuff [20:43:46] fyi [20:44:04] (the drop write and execute patch) [20:45:26] danke, huh, you must have just beat me [20:45:29] by like a second [20:45:46] (03PS1) 10Chad: Comment updates from default config [operations/puppet] - 10https://gerrit.wikimedia.org/r/156689 [20:46:41] andrewbogott: the only thing is that I am unlikely to rename it myself :D [20:47:00] andrewbogott: busy with CI and will be absent for a couple weeks soon [20:47:05] hashar: Do you know who created it and/or cares about it? [20:48:13] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Wed 27 Aug 2014 06:41:47 UTC [20:48:26] (03PS2) 10MaxSem: Disable mobile uploads [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156523 (https://bugzilla.wikimedia.org/62598) [20:52:23] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [20:52:54] hashar: Can you suggest someone for me to bug about the rename? Do you know whose project that is? [20:54:17] andrewbogott: bug fill under Wikimedia Labs > deployment-prep (beta) [20:54:30] then gotta find some SQL guru to migrate the DB on whatever the database instances are [20:54:37] hashar: ok [20:54:40] and check replication still work (I have no clue how to handle that) [20:54:44] maybe sean can help there [20:54:58] or just the good old mysqldump :-D [20:55:13] then, in operations/mediawiki-config.git replace labswiki -> deploymentwiki [20:55:28] have it reviewed by usual deployer / cabal of mediawiki-config.git [20:55:39] on +2, the change will be automatically deployed on beta and break stuff :D [20:55:47] then we can iterate from there and fix up beta if needed. [20:55:57] sorry it is late there :-/ [20:56:18] hashar: No worries, I'll cc: you on the bug [20:58:36] andrewbogott: thx! [20:59:36] Oh, you get all beta bugs automatically :) Anyway, https://bugzilla.wikimedia.org/show_bug.cgi?id=70108 [21:10:18] (03PS1) 10Andrew Bogott: Rename labswiki to deploymentwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156695 [21:10:42] (03CR) 10jenkins-bot: [V: 04-1] Rename labswiki to deploymentwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156695 (owner: 10Andrew Bogott) [21:10:51] (03PS2) 10Krinkle: CommonSettings.php: Remove some dated cruft [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/154442 (owner: 10PleaseStand) [21:10:57] (03PS2) 10Andrew Bogott: Rename labswiki to deploymentwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156695 [21:11:03] (03CR) 10jenkins-bot: [V: 04-1] Rename labswiki to deploymentwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156695 (owner: 10Andrew Bogott) [21:12:34] (03CR) 10Krinkle: "Maybe a bit too much for one commit / one person to review. You might get lucky though." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/154442 (owner: 10PleaseStand) [21:12:55] (03PS3) 10Krinkle: Rename labswiki to deploymentwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156695 (https://bugzilla.wikimedia.org/70108) (owner: 10Andrew Bogott) [21:13:00] (03CR) 10jenkins-bot: [V: 04-1] Rename labswiki to deploymentwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156695 (https://bugzilla.wikimedia.org/70108) (owner: 10Andrew Bogott) [21:14:05] !log powercycling frozen ms-be1009 [21:16:31] (03PS4) 10Andrew Bogott: Rename labswiki to deploymentwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156695 [21:16:36] (03CR) 10jenkins-bot: [V: 04-1] Rename labswiki to deploymentwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156695 (owner: 10Andrew Bogott) [21:16:53] RECOVERY - Host ms-be1009 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [21:17:04] (03CR) 10Andrew Bogott: "I believe this will always fail tests until the actual live database is renamed." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156695 (owner: 10Andrew Bogott) [21:48:29] (03PS1) 10Aaron Schulz: Use __DIR__ instead of the CWD [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156704 [21:52:44] (03PS1) 10Aaron Schulz: Recognize ::1 as loopback too [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156706 [21:56:09] (03CR) 10Chad: [C: 031] Use __DIR__ instead of the CWD [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156704 (owner: 10Aaron Schulz) [21:58:49] (03CR) 10Tim Landscheidt: "The setup as part of the toollabs module makes it difficult to reuse in other projects, for example to solve bug #69042." [operations/puppet] - 10https://gerrit.wikimedia.org/r/156599 (https://bugzilla.wikimedia.org/53181) (owner: 10coren) [22:08:34] andrewbogott_afk, are you aware of SMW_rebuildData errors such as "Notice: Array to string conversion in SemanticMediaWiki/includes/storage/SQLStore/SMW_SQLStore3_Writers.php on line 383" [22:14:03] (03PS1) 10RobH: setting ip info for bast2001, install2001, & lvs2001-2006 [operations/dns] - 10https://gerrit.wikimedia.org/r/156713 [22:17:09] (03PS2) 10RobH: setting ip info for bast2001, install2001, & lvs2001-2006 [operations/dns] - 10https://gerrit.wikimedia.org/r/156713 [22:19:00] (03CR) 10RobH: [C: 032] setting ip info for bast2001, install2001, & lvs2001-2006 [operations/dns] - 10https://gerrit.wikimedia.org/r/156713 (owner: 10RobH) [22:22:12] (03CR) 10QChris: [C: 031] Tool Labs: apply /etc/iptables.conf on boot [operations/puppet] - 10https://gerrit.wikimedia.org/r/156599 (https://bugzilla.wikimedia.org/53181) (owner: 10coren) [22:37:40] how do i finish my global account migration? [22:37:47] Your account is active on 843 project sites. [22:37:48] Unconfirmed accounts with your name remain on 2 projects. [22:37:53] it's been like this since forever [22:38:03] the 2 remaining projects are other users but they dont do things [22:38:22] but that way i'm in status "In migration" until the end of time [22:38:22] good question [22:38:36] well, when your name is mutante (WMF) it'll migrate fine [22:38:37] ;] [22:38:55] it won't be [22:39:00] (or you fixing that stuff in advance to keep your accounts? [22:39:02] i need to do that. [22:39:11] i did not associate a wikimedia email with it ever [22:39:15] nice [22:39:30] mutante: when we do SUL finalization, those two users will get renamed to other things and yours will finally say that everything is ok [22:39:41] legoktm: can i do anything to speed that up or help? [22:40:08] like.. show how these other users are not active [22:41:45] it doesn't matter that they're inactive, they'll get renamed regardless. Keegan or Deskana|Away know the timetable of when things will happen, but I don't think it's set in stone yet [22:42:28] alright, thanks, i'll just wait for now [22:45:09] (03CR) 10Yuvipanda: "Should be ported to ferm at some point, I think." [operations/puppet] - 10https://gerrit.wikimedia.org/r/156599 (https://bugzilla.wikimedia.org/53181) (owner: 10coren) [22:47:05] mutante: I'm in the same position myself; there are three other Coren not SUL'ed, two of them are even active (one did, however, voluntarily renamed himself) [22:47:29] There's a YuviPanda on enwiki [22:47:31] which was me [22:47:33] when I got it renamed [22:47:35] years ago [22:47:40] before I realized I needed a cross wiki rename [22:47:43] (I am otherwise Yuvipanda) [22:49:07] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Wed 27 Aug 2014 06:41:47 UTC [22:53:06] (03CR) 10EBernhardson: [C: 031] "looks good to deploy after the train rolls out" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155169 (owner: 10Bsitu) [22:58:52] (03CR) 10Michał Łazowik: "Already covered by I88022a0c12cab2483e94652dcd66b69b7bd058db" [operations/dns] - 10https://gerrit.wikimedia.org/r/156597 (owner: 10Dzahn) [23:00:04] RoanKattouw, ori, MaxSem: Respected human, time to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140827T2300). Please do the needful. [23:00:39] I'll do it [23:01:20] thanks [23:03:18] (03CR) 10Catrope: [C: 032] Disable mobile uploads [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156523 (https://bugzilla.wikimedia.org/62598) (owner: 10MaxSem) [23:03:26] (03Merged) 10jenkins-bot: Disable mobile uploads [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156523 (https://bugzilla.wikimedia.org/62598) (owner: 10MaxSem) [23:05:02] mutante, shouldn't harm [23:06:52] MaxSem: to add all of them? ok [23:07:19] !log catrope Synchronized wmf-config/: Disable mobile uploads (duration: 00m 06s) [23:20:28] (03PS5) 10Dzahn: gerrit - use apache::site [operations/puppet] - 10https://gerrit.wikimedia.org/r/153849 [23:22:29] (03CR) 10Dzahn: [C: 032] gerrit - use apache::site [operations/puppet] - 10https://gerrit.wikimedia.org/r/153849 (owner: 10Dzahn) [23:23:28] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [23:25:59] ^d: ehm.. do you remember adding File[/etc/sysctl.d/70-high-http-performance.conf] or something? [23:26:18] <^d> Where? On gerrit? [23:26:21] yea [23:26:22] <^d> Not that I can recall, no. [23:26:26] i dont see it in puppet either [23:26:49] !log catrope Synchronized php-1.24wmf18/extensions/UploadWizard/: (no message) (duration: 00m 05s) [23:27:05] Thanks catrope [23:32:09] (03CR) 10Dzahn: "ping, why is role::download::mediawiki unused?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/153817 (owner: 10Dzahn) [23:38:15] (03CR) 10Dzahn: "wtf, is the config class from the role not used in the module?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/153986 (owner: 10Dzahn) [23:41:07] PROBLEM - Puppet freshness on install2001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 00:00:00 UTC [23:41:29] ^ hear hear, codfw monitoring :) [23:42:08] well, likely not getting the snmptraps [23:43:28] its not =[ [23:43:44] not showing in ganglia either [23:43:52] (03PS5) 10Dzahn: puppetmaster - use ssl_ciphersuite [operations/puppet] - 10https://gerrit.wikimedia.org/r/153986 [23:43:53] but still, its something =] [23:44:04] yea, the host made it into icinga [23:44:14] took neon awhile, but its doing puppet runs [23:44:32] so shows in incinga due to puppetstoreddb i suppose [23:46:07] PROBLEM - puppet last run on analytics1020 is CRITICAL: CRITICAL: Puppet last ran 19387 seconds ago, expected 14400 [23:47:07] RECOVERY - puppet last run on analytics1020 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [23:54:19] (03PS6) 10Dzahn: puppetmaster - use ssl_ciphersuite [operations/puppet] - 10https://gerrit.wikimedia.org/r/153986 [23:56:36] (03PS7) 10Dzahn: puppetmaster - use ssl_ciphersuite [operations/puppet] - 10https://gerrit.wikimedia.org/r/153986