[00:03:36] (03PS2) 10BBlack: update zeroconfig URL for netmapper [operations/puppet] - 10https://gerrit.wikimedia.org/r/86205 [00:09:49] Ryan_Lane: does the salt minion change (above) fix things? [00:10:57] ori-l: fix things that we just broke? :) [00:11:00] hopefully [00:11:35] hm. shit [00:11:49] actually, the puppet change won't fix things [00:12:49] because there's no ldap entry for that. [00:12:59] well, salt doesn't depend on that, so I can salt a fix to them :D [00:47:38] (03CR) 10Yurik: [C: 031] update zeroconfig URL for netmapper [operations/puppet] - 10https://gerrit.wikimedia.org/r/86205 (owner: 10BBlack) [01:07:53] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [01:07:53] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [01:08:43] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [01:08:43] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [01:09:46] is there a log to view warnings? fatal log doesn't seem to have them [01:10:33] RECOVERY - Puppet freshness on titanium is OK: puppet ran at Fri Sep 27 01:10:32 UTC 2013 [01:10:43] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [01:10:43] RECOVERY - Puppet freshness on praseodymium is OK: puppet ran at Fri Sep 27 01:10:42 UTC 2013 [01:10:53] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [01:16:06] yurik, the one used by fatalmonitor has plenty of warnings [01:16:34] MaxSem, fatalmonitor looks at apache.log which has no stacktraces [01:16:58] and you waon't have them unless you install xdebug [01:17:08] bleh [01:17:27] i've been looking at the fatamonitor - shows lots of strcmp issues [01:17:52] all of the issues right now are array instead of string is passed [01:17:58] no idea who does it [01:18:03] don't want it to be my fault [01:19:25] I wonder if there's a possibility to always have one appserver with xdebug ready and direct traffic to it when someone needs stacktraces/other debugging [01:32:44] Those strcmp warnings are definitely new in 1.19 [01:32:46] uh [01:32:47] 1.22wmf19 [01:33:09] Must find the cause before deploying further, otherwise it's just going to get really noisy [01:34:03] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 01:34:00 UTC 2013 [01:34:53] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [01:38:03] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Sep 27 01:37:53 UTC 2013 [01:38:43] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [01:40:53] RECOVERY - Puppet freshness on titanium is OK: puppet ran at Fri Sep 27 01:40:49 UTC 2013 [01:41:03] RECOVERY - Puppet freshness on praseodymium is OK: puppet ran at Fri Sep 27 01:40:59 UTC 2013 [01:41:43] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [01:41:53] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [02:03:23] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Sep 27 02:03:22 UTC 2013 [02:03:43] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [02:03:53] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 02:03:52 UTC 2013 [02:04:53] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [02:10:43] RECOVERY - Puppet freshness on titanium is OK: puppet ran at Fri Sep 27 02:10:41 UTC 2013 [02:10:53] RECOVERY - Puppet freshness on praseodymium is OK: puppet ran at Fri Sep 27 02:10:46 UTC 2013 [02:11:43] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [02:11:53] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [02:14:34] !log LocalisationUpdate completed (1.22wmf18) at Fri Sep 27 02:14:34 UTC 2013 [02:14:53] Logged the message, Master [02:33:03] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Sep 27 02:32:59 UTC 2013 [02:33:20] !log LocalisationUpdate completed (1.22wmf19) at Fri Sep 27 02:33:19 UTC 2013 [02:33:34] Logged the message, Master [02:33:43] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [02:34:13] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 02:34:09 UTC 2013 [02:34:53] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [02:41:03] (03CR) 10BBlack: [C: 032] update zeroconfig URL for netmapper [operations/puppet] - 10https://gerrit.wikimedia.org/r/86205 (owner: 10BBlack) [02:55:41] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Sep 27 02:55:41 UTC 2013 [02:55:53] Logged the message, Master [03:01:54] PROBLEM - Puppet freshness on analytics1021 is CRITICAL: No successful Puppet run in the last 10 hours [03:46:43] (03PS1) 10Yuvipanda: Explicitly reference labsvagrant class from the module [operations/puppet] - 10https://gerrit.wikimedia.org/r/86213 [03:46:44] Ryan_Lane: merge? ^ [03:47:18] waiting for jenkins [03:47:42] (03CR) 10Ryan Lane: [C: 032] Explicitly reference labsvagrant class from the module [operations/puppet] - 10https://gerrit.wikimedia.org/r/86213 (owner: 10Yuvipanda) [03:47:58] YuviPanda: ^^ [03:48:05] ty, Ryan_Lane [03:48:14] ok. gone for the night [04:13:30] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [04:13:30] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [04:13:50] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [04:13:50] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [04:26:50] PROBLEM - Puppet freshness on mw1072 is CRITICAL: No successful Puppet run in the last 10 hours [04:33:31] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Sep 27 04:33:29 UTC 2013 [04:33:50] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [04:34:31] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 04:34:29 UTC 2013 [04:35:30] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [04:41:40] RECOVERY - Puppet freshness on titanium is OK: puppet ran at Fri Sep 27 04:41:37 UTC 2013 [04:41:50] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [04:42:00] RECOVERY - Puppet freshness on praseodymium is OK: puppet ran at Fri Sep 27 04:41:58 UTC 2013 [04:42:30] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [05:04:30] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Sep 27 05:04:22 UTC 2013 [05:04:50] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [05:05:30] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 05:05:23 UTC 2013 [05:06:30] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [05:13:10] RECOVERY - Puppet freshness on titanium is OK: puppet ran at Fri Sep 27 05:13:07 UTC 2013 [05:13:10] RECOVERY - Puppet freshness on praseodymium is OK: puppet ran at Fri Sep 27 05:13:07 UTC 2013 [05:13:30] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [05:13:50] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [05:22:40] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [05:23:40] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [05:35:00] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 05:34:50 UTC 2013 [05:35:30] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [05:35:40] PROBLEM - RAID on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:37:40] RECOVERY - RAID on snapshot3 is OK: OK: no RAID installed [05:41:30] RECOVERY - Puppet freshness on titanium is OK: puppet ran at Fri Sep 27 05:41:23 UTC 2013 [05:41:31] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Sep 27 05:41:28 UTC 2013 [05:41:50] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [05:41:50] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [05:42:00] RECOVERY - Puppet freshness on praseodymium is OK: puppet ran at Fri Sep 27 05:41:59 UTC 2013 [05:42:30] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [06:01:40] PROBLEM - RAID on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:03:40] PROBLEM - Disk space on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:04:00] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Sep 27 06:03:50 UTC 2013 [06:04:40] RECOVERY - Disk space on snapshot3 is OK: DISK OK [06:04:40] RECOVERY - RAID on snapshot3 is OK: OK: no RAID installed [06:04:50] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [06:05:00] PROBLEM - DPKG on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:06:50] RECOVERY - DPKG on snapshot3 is OK: All packages OK [06:09:24] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 06:09:17 UTC 2013 [06:09:24] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [06:11:24] RECOVERY - Puppet freshness on praseodymium is OK: puppet ran at Fri Sep 27 06:11:17 UTC 2013 [06:11:24] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [06:12:14] RECOVERY - Puppet freshness on titanium is OK: puppet ran at Fri Sep 27 06:12:13 UTC 2013 [06:12:44] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [06:32:34] PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:34:04] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Sep 27 06:34:02 UTC 2013 [06:34:44] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [06:35:04] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 06:35:03 UTC 2013 [06:35:24] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [06:37:34] RECOVERY - RAID on snapshot1002 is OK: OK: no RAID installed [06:40:34] PROBLEM - RAID on snapshot1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:14] RECOVERY - Puppet freshness on titanium is OK: puppet ran at Fri Sep 27 06:41:07 UTC 2013 [06:41:44] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [06:42:04] RECOVERY - Puppet freshness on praseodymium is OK: puppet ran at Fri Sep 27 06:42:03 UTC 2013 [06:42:24] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [06:42:34] RECOVERY - RAID on snapshot1002 is OK: OK: no RAID installed [07:03:44] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Sep 27 07:03:36 UTC 2013 [07:03:44] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [07:04:34] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 07:04:32 UTC 2013 [07:05:24] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [07:08:44] PROBLEM - RAID on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:10:44] RECOVERY - RAID on snapshot3 is OK: OK: no RAID installed [07:11:14] RECOVERY - Puppet freshness on titanium is OK: puppet ran at Fri Sep 27 07:11:07 UTC 2013 [07:11:44] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [07:12:34] RECOVERY - Puppet freshness on praseodymium is OK: puppet ran at Fri Sep 27 07:12:28 UTC 2013 [07:13:24] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [07:20:04] PROBLEM - DPKG on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:20:44] PROBLEM - RAID on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:20:54] RECOVERY - DPKG on snapshot3 is OK: All packages OK [07:21:44] RECOVERY - RAID on snapshot3 is OK: OK: no RAID installed [07:25:44] PROBLEM - RAID on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:26:44] RECOVERY - RAID on snapshot3 is OK: OK: no RAID installed [07:30:17] hello [07:31:04] PROBLEM - DPKG on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:31:55] RECOVERY - DPKG on snapshot3 is OK: All packages OK [07:33:04] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Sep 27 07:32:59 UTC 2013 [07:33:44] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [07:33:54] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 07:33:50 UTC 2013 [07:34:24] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [07:38:46] PROBLEM - RAID on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:38:46] PROBLEM - DPKG on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:40:36] RECOVERY - RAID on snapshot3 is OK: OK: no RAID installed [07:40:36] RECOVERY - DPKG on snapshot3 is OK: All packages OK [07:49:35] (03PS3) 10JanZerebecki: replace SSLCACertificatePath with SSLCertificateChainFile in Apache templates [operations/puppet] - 10https://gerrit.wikimedia.org/r/84901 (owner: 10Dzahn) [07:50:46] PROBLEM - DPKG on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:53:36] RECOVERY - DPKG on snapshot3 is OK: All packages OK [07:53:42] (03CR) 10JanZerebecki: [C: 031] "Using CACertificatePath may be a performance problem because apache will send a list of all certificates in that path as acceptable for cl" [operations/puppet] - 10https://gerrit.wikimedia.org/r/84901 (owner: 10Dzahn) [07:56:46] PROBLEM - RAID on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:57:46] RECOVERY - RAID on snapshot3 is OK: OK: no RAID installed [07:58:46] PROBLEM - DPKG on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:00:46] PROBLEM - RAID on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:00:46] PROBLEM - Disk space on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:01:36] RECOVERY - Disk space on snapshot3 is OK: DISK OK [08:01:36] RECOVERY - RAID on snapshot3 is OK: OK: no RAID installed [08:01:36] RECOVERY - DPKG on snapshot3 is OK: All packages OK [08:05:46] PROBLEM - DPKG on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:05:46] PROBLEM - Disk space on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:05:46] PROBLEM - RAID on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:06:05] PROBLEM - SSH on snapshot3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:06:15] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [08:06:25] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [08:06:35] RECOVERY - SSH on snapshot3 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [08:06:45] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [08:07:05] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [08:09:35] RECOVERY - Disk space on snapshot3 is OK: DISK OK [08:09:45] RECOVERY - RAID on snapshot3 is OK: OK: no RAID installed [08:09:45] RECOVERY - DPKG on snapshot3 is OK: All packages OK [08:10:45] RECOVERY - Puppet freshness on titanium is OK: puppet ran at Fri Sep 27 08:10:35 UTC 2013 [08:10:55] RECOVERY - Puppet freshness on praseodymium is OK: puppet ran at Fri Sep 27 08:10:46 UTC 2013 [08:11:05] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [08:11:15] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [08:12:45] PROBLEM - DPKG on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:12:45] PROBLEM - RAID on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:13:35] RECOVERY - RAID on snapshot3 is OK: OK: no RAID installed [08:13:35] RECOVERY - DPKG on snapshot3 is OK: All packages OK [08:29:20] Ryan_Lane: european hours ? [08:34:05] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 08:33:54 UTC 2013 [08:34:25] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [08:36:15] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Sep 27 08:36:06 UTC 2013 [08:36:45] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [08:56:42] paravoid, around? [08:57:18] i deployed the redirection just like you wanted :) [08:58:44] mark, the backend now returns "Enable-ESI: 1" header when it wants the result to be processed via ESI (also per paravoid suggestion) - do we now need to vary based on that header too? [08:59:05] also, please enable it in VCL [09:06:11] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [09:06:21] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [09:06:51] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [09:07:01] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [09:10:51] RECOVERY - Puppet freshness on titanium is OK: puppet ran at Fri Sep 27 09:10:42 UTC 2013 [09:10:51] RECOVERY - Puppet freshness on praseodymium is OK: puppet ran at Fri Sep 27 09:10:47 UTC 2013 [09:11:01] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [09:11:11] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [09:17:42] finally... root@lvs4003.... [09:27:38] (03PS1) 10Akosiaris: Allow from ulsfo.wmnet for puppetmasters [operations/puppet] - 10https://gerrit.wikimedia.org/r/86222 [09:29:45] (03PS1) 10ArielGlenn: wikiretriever: retrieve recent changes, pass extra params, bugfixes [operations/dumps] (ariel) - 10https://gerrit.wikimedia.org/r/86223 [09:29:48] (03CR) 10Akosiaris: [C: 032] Allow from ulsfo.wmnet for puppetmasters [operations/puppet] - 10https://gerrit.wikimedia.org/r/86222 (owner: 10Akosiaris) [09:35:01] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Sep 27 09:34:52 UTC 2013 [09:35:31] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 09:35:28 UTC 2013 [09:35:51] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [09:36:21] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [09:41:14] (03CR) 10ArielGlenn: [C: 032] wikiretriever: retrieve recent changes, pass extra params, bugfixes [operations/dumps] (ariel) - 10https://gerrit.wikimedia.org/r/86223 (owner: 10ArielGlenn) [09:42:12] RECOVERY - Puppet freshness on titanium is OK: puppet ran at Fri Sep 27 09:42:06 UTC 2013 [09:42:31] RECOVERY - Puppet freshness on praseodymium is OK: puppet ran at Fri Sep 27 09:42:22 UTC 2013 [09:43:01] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [09:43:11] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [10:03:31] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Sep 27 10:03:26 UTC 2013 [10:03:41] RECOVERY - DPKG on stafford is OK: All packages OK [10:03:51] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [10:04:01] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 10:03:52 UTC 2013 [10:04:21] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [10:08:41] PROBLEM - DPKG on stafford is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:11:22] RECOVERY - Puppet freshness on titanium is OK: puppet ran at Fri Sep 27 10:11:17 UTC 2013 [10:11:32] RECOVERY - Puppet freshness on praseodymium is OK: puppet ran at Fri Sep 27 10:11:22 UTC 2013 [10:11:52] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [10:12:02] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [10:12:52] (03PS1) 10Akosiaris: Adding puppet CNAME for ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/86232 [10:13:23] (03CR) 10Akosiaris: [C: 032] Adding puppet CNAME for ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/86232 (owner: 10Akosiaris) [10:33:52] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Sep 27 10:33:42 UTC 2013 [10:34:02] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 10:33:52 UTC 2013 [10:34:22] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [10:34:42] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [10:42:25] PROBLEM - Disk space on lvs4002 is CRITICAL: Connection refused by host [10:42:25] PROBLEM - RAID on lvs4004 is CRITICAL: Connection refused by host [10:42:25] PROBLEM - DPKG on lvs4001 is CRITICAL: Connection refused by host [10:42:35] PROBLEM - Disk space on lvs4001 is CRITICAL: Connection refused by host [10:42:35] PROBLEM - RAID on lvs4003 is CRITICAL: Connection refused by host [10:42:45] PROBLEM - RAID on lvs4002 is CRITICAL: Connection refused by host [10:42:55] PROBLEM - DPKG on lvs4004 is CRITICAL: Connection refused by host [10:42:55] PROBLEM - RAID on lvs4001 is CRITICAL: Connection refused by host [10:43:05] PROBLEM - DPKG on lvs4003 is CRITICAL: Connection refused by host [10:43:05] PROBLEM - Disk space on lvs4004 is CRITICAL: Connection refused by host [10:43:15] PROBLEM - Disk space on lvs4003 is CRITICAL: Connection refused by host [10:43:15] PROBLEM - DPKG on lvs4002 is CRITICAL: Connection refused by host [10:54:25] PROBLEM - NTP on lvs4003 is CRITICAL: NTP CRITICAL: Offset unknown [10:54:36] PROBLEM - NTP on lvs4002 is CRITICAL: NTP CRITICAL: Offset unknown [10:54:45] PROBLEM - NTP on lvs4001 is CRITICAL: NTP CRITICAL: Offset unknown [10:55:15] PROBLEM - NTP on lvs4004 is CRITICAL: NTP CRITICAL: Offset unknown [10:57:55] RECOVERY - DPKG on lvs4004 is OK: All packages OK [10:58:05] RECOVERY - Disk space on lvs4004 is OK: DISK OK [10:58:05] RECOVERY - DPKG on lvs4003 is OK: All packages OK [10:58:15] RECOVERY - Disk space on lvs4003 is OK: DISK OK [10:58:15] RECOVERY - DPKG on lvs4002 is OK: All packages OK [10:58:18] (03PS1) 10Akosiaris: Add ulsfo to enable_proxy for apt [operations/puppet] - 10https://gerrit.wikimedia.org/r/86236 [10:58:25] RECOVERY - Disk space on lvs4002 is OK: DISK OK [10:58:25] RECOVERY - RAID on lvs4004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [10:58:35] RECOVERY - RAID on lvs4003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [10:58:45] RECOVERY - RAID on lvs4002 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [10:59:24] (03CR) 10Akosiaris: [C: 032] Add ulsfo to enable_proxy for apt [operations/puppet] - 10https://gerrit.wikimedia.org/r/86236 (owner: 10Akosiaris) [10:59:25] RECOVERY - DPKG on lvs4001 is OK: All packages OK [10:59:35] RECOVERY - Disk space on lvs4001 is OK: DISK OK [10:59:55] RECOVERY - RAID on lvs4001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [11:13:29] (03PS1) 10Faidon Liambotis: Switch eqiad's ms-fe & ms-be to Swift [operations/puppet] - 10https://gerrit.wikimedia.org/r/86238 [11:13:39] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [11:13:49] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [11:13:49] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [11:13:49] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [11:14:09] RECOVERY - NTP on lvs4004 is OK: NTP OK: Offset -0.01887404919 secs [11:14:19] RECOVERY - NTP on lvs4003 is OK: NTP OK: Offset -0.02353930473 secs [11:14:39] RECOVERY - NTP on lvs4002 is OK: NTP OK: Offset -0.02709567547 secs [11:15:19] RECOVERY - NTP on lvs4001 is OK: NTP OK: Offset -0.02019965649 secs [11:24:13] (03PS1) 10Faidon Liambotis: Remove role::ceph::*, unused now [operations/puppet] - 10https://gerrit.wikimedia.org/r/86241 [11:24:34] (03CR) 10Faidon Liambotis: [C: 032] Switch eqiad's ms-fe & ms-be to Swift [operations/puppet] - 10https://gerrit.wikimedia.org/r/86238 (owner: 10Faidon Liambotis) [11:24:44] (03CR) 10Faidon Liambotis: [C: 032] Remove role::ceph::*, unused now [operations/puppet] - 10https://gerrit.wikimedia.org/r/86241 (owner: 10Faidon Liambotis) [11:28:59] Ceph is dead, long live Swift? [11:29:42] yeah... [11:29:45] * paravoid sad [11:32:59] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Sep 27 11:32:57 UTC 2013 [11:33:39] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [11:33:59] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 11:33:53 UTC 2013 [11:34:49] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [12:11:07] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [12:11:17] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [12:11:17] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [12:11:27] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [12:15:32] paravoid, will there be a post/email/wiki page describing the Ceph fail? [12:18:02] I didn't think anyone else was interested [12:18:19] but if you are, I guess I can, yes [12:26:00] mark: rt5848...the site.pp entries have been removed...that is where bonding would be set...correct? [12:33:17] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Sep 27 12:33:09 UTC 2013 [12:33:17] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [12:34:16] (03PS2) 10Dzahn: retab misc/planet.pp from tabs to 4 spaces, do the cleanup before next attempt to turn into module [operations/puppet] - 10https://gerrit.wikimedia.org/r/86126 [12:34:40] (03CR) 10Dzahn: [C: 032] retab misc/planet.pp from tabs to 4 spaces, do the cleanup before next attempt to turn into module [operations/puppet] - 10https://gerrit.wikimedia.org/r/86126 (owner: 10Dzahn) [12:35:17] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 12:35:10 UTC 2013 [12:35:28] (03PS2) 10Dzahn: planet.pp - wrong quoting, aligned arrows, ensure first and other puppet lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/86130 [12:36:17] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [12:37:25] (03CR) 10Dzahn: [C: 032] planet.pp - wrong quoting, aligned arrows, ensure first and other puppet lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/86130 (owner: 10Dzahn) [12:38:51] (03PS1) 10Krinkle: gerrit: Don't include '.' in the match for adjecent separator [operations/puppet] - 10https://gerrit.wikimedia.org/r/86250 [12:40:09] (03PS2) 10Krinkle: gerrit: Don't include '.' in the match for adjecent separator [operations/puppet] - 10https://gerrit.wikimedia.org/r/86250 [12:40:40] (03PS2) 10Dzahn: bugzilla.pp - fix unquoted resource titles and file modes (puppet-lint) [operations/puppet] - 10https://gerrit.wikimedia.org/r/86124 [12:42:37] (03CR) 10Dzahn: [C: 032] bugzilla.pp - fix unquoted resource titles and file modes (puppet-lint) [operations/puppet] - 10https://gerrit.wikimedia.org/r/86124 (owner: 10Dzahn) [12:51:40] (03CR) 10Dzahn: [C: 031] replace SSLCACertificatePath with SSLCertificateChainFile in Apache templates [operations/puppet] - 10https://gerrit.wikimedia.org/r/84901 (owner: 10Dzahn) [12:55:06] PROBLEM - Host ms-fe1001 is DOWN: PING CRITICAL - Packet loss = 100% [12:56:06] PROBLEM - Ceph on ms-fe1004 is CRITICAL: Ceph HEALTH_WARN 1 mons down, quorum 1,2 ms-fe1003,ms-fe1004 [12:56:06] PROBLEM - Ceph on ms-fe1003 is CRITICAL: Ceph HEALTH_WARN 1 mons down, quorum 1,2 ms-fe1003,ms-fe1004 [12:59:46] PROBLEM - Host ms-fe1002 is DOWN: PING CRITICAL - Packet loss = 100% [13:00:06] PROBLEM - Swift HTTP on ms-fe4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:00:16] RECOVERY - Host ms-fe1001 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [13:01:16] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:02:26] PROBLEM - Ceph on ms-fe1001 is CRITICAL: Connection refused by host [13:02:26] PROBLEM - SSH on ms-fe1001 is CRITICAL: Connection refused [13:02:36] PROBLEM - DPKG on ms-fe1001 is CRITICAL: Connection refused by host [13:02:36] PROBLEM - HTTP Apache on ms-fe1001 is CRITICAL: Connection refused [13:02:46] PROBLEM - Puppet freshness on analytics1021 is CRITICAL: No successful Puppet run in the last 10 hours [13:02:48] !log dismantling & repurposing ceph cluster [13:02:56] PROBLEM - HTTP radosgw on ms-fe1001 is CRITICAL: Connection refused [13:03:01] Logged the message, Master [13:03:06] PROBLEM - Disk space on ms-fe1001 is CRITICAL: Connection refused by host [13:03:06] PROBLEM - RAID on ms-fe1001 is CRITICAL: Connection refused by host [13:03:19] wait, why is ms-fe @ pmtpa getting an load increase [13:03:48] (03CR) 10Dzahn: [C: 04-1] "so we have at least 3 different users running maint. crons, "apache", "mwdeploy" and "l10nupdate" and the latter have comments # which use" [operations/puppet] - 10https://gerrit.wikimedia.org/r/83574 (owner: 10Reedy) [13:04:56] RECOVERY - Host ms-fe1002 is UP: PING OK - Packet loss = 0%, RTA = 1.05 ms [13:05:06] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:05:26] wait, what [13:06:05] friendly autocompletion suggestion: .. the fuck [13:06:26] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [13:06:56] PROBLEM - HTTP radosgw on ms-fe1002 is CRITICAL: Connection refused [13:07:06] PROBLEM - RAID on ms-fe1002 is CRITICAL: Connection refused by host [13:07:24] PROBLEM - DPKG on ms-be1001 is CRITICAL: Connection refused by host [13:07:34] PROBLEM - DPKG on ms-fe1002 is CRITICAL: Connection refused by host [13:07:34] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [13:07:44] PROBLEM - Disk space on ms-fe1002 is CRITICAL: Connection refused by host [13:07:54] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [13:07:54] PROBLEM - SSH on ms-fe1002 is CRITICAL: Connection refused [13:07:54] PROBLEM - HTTP Apache on ms-fe1002 is CRITICAL: Connection refused [13:08:04] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [13:08:14] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [13:08:34] PROBLEM - Disk space on ms-be1001 is CRITICAL: Connection refused by host [13:08:54] PROBLEM - SSH on ms-be1001 is CRITICAL: Connection refused [13:09:04] PROBLEM - RAID on ms-be1001 is CRITICAL: Connection refused by host [13:09:54] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK: HTTP/1.1 200 OK - 2503 bytes in 0.059 second response time [13:09:54] RECOVERY - Swift HTTP on ms-fe4 is OK: HTTP OK: HTTP/1.1 200 OK - 2503 bytes in 0.059 second response time [13:10:44] RECOVERY - Puppet freshness on praseodymium is OK: puppet ran at Fri Sep 27 13:10:42 UTC 2013 [13:10:46] is this a joke [13:11:14] PROBLEM - Swift HTTP on ms-fe3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:11:34] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [13:12:00] holy crap [13:12:04] RECOVERY - Swift HTTP on ms-fe3 is OK: HTTP OK: HTTP/1.1 200 OK - 2503 bytes in 0.062 second response time [13:12:13] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&g=cpu_report&h=ms-be10.pmtpa.wmnet&c=Swift+pmtpa [13:12:34] RECOVERY - Puppet freshness on titanium is OK: puppet ran at Fri Sep 27 13:12:27 UTC 2013 [13:12:54] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [13:14:13] 13:14:08 up 309 days, 4:11, 1 user, load average: 118.86, 47.69, 30.51 [13:14:17] that's the spirit [13:14:41] paravoid: ms-be1001 is down since then, is that also expected or just the fe- hosts [13:14:44] PROBLEM - NTP on ms-fe1001 is CRITICAL: NTP CRITICAL: No response from NTP server [13:14:53] mutante: it's everything [13:15:00] I'm going to do all of ms-fe10xx / ms-be10xx [13:15:07] but in the meantime, the pmtpa cluster is acting up [13:15:10] at exactly the same time [13:15:17] without me touching it [13:15:45] nod.. ah .. hm [13:15:54] RECOVERY - SSH on ms-be1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [13:17:14] PROBLEM - Host ms-fe1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:24] RECOVERY - SSH on ms-fe1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [13:18:34] RECOVERY - Host ms-fe1001 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [13:19:04] PROBLEM - NTP on ms-fe1002 is CRITICAL: NTP CRITICAL: No response from NTP server [13:20:02] sounds like fail-over somehow because of the timing? paravoid, i have no idea, but it went down again on the ganglia graph [13:20:14] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:20:22] yeah found part of the cause and fixing [13:20:26] cool [13:20:54] PROBLEM - NTP on ms-be1001 is CRITICAL: NTP CRITICAL: No response from NTP server [13:21:04] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK: HTTP/1.1 200 OK - 2503 bytes in 0.061 second response time [13:21:54] RECOVERY - SSH on ms-fe1002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [13:34:34] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Sep 27 13:34:31 UTC 2013 [13:35:04] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [13:35:44] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 13:35:42 UTC 2013 [13:36:14] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [13:42:04] RECOVERY - Puppet freshness on titanium is OK: puppet ran at Fri Sep 27 13:41:54 UTC 2013 [13:42:44] RECOVERY - Puppet freshness on praseodymium is OK: puppet ran at Fri Sep 27 13:42:34 UTC 2013 [13:42:54] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [13:43:30] (03PS1) 10Akosiaris: Remove jfsutils from base::standard-packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/86252 [13:43:34] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [13:56:00] (03CR) 10Faidon Liambotis: [C: 031] "preseed.cfg says that apache.cfg used to use JFS -but that's unused and replaced by mw.cfg now, also needs a cleanup." [operations/puppet] - 10https://gerrit.wikimedia.org/r/86252 (owner: 10Akosiaris) [13:58:12] that was when apaches contained external storage [13:59:45] heh [14:01:03] YOoooo akosiaris! [14:01:55] hey ottomata. [14:01:59] wassup ? [14:02:03] ohhh just checking in! [14:02:10] cps almost done? you've just done base installs, right? [14:02:23] no puppet? [14:02:24] yes. [14:02:29] and puppet [14:02:50] hm, looking in site.pp [14:03:04] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Sep 27 14:02:57 UTC 2013 [14:03:07] you should be able to log in at lvs and most cp4xxx [14:03:28] i just run puppet... no extra configuration [14:03:45] oh ok, not even a site entry, got it [14:03:45] ok [14:04:00] i suppose these will be done on the next step [14:04:04] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [14:04:18] when varnish and pybal and all that will be installed/configured etc [14:04:34] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 14:04:28 UTC 2013 [14:05:09] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [14:05:30] well, usually I lke to make site.pp entries without the functional stuff [14:05:32] So I suppose we are almost done on that front [14:05:32] and apply that first [14:05:35] basically just base module [14:05:43] that sets up all the usual monitoring and account stuff [14:05:52] it seems like that happens anyway :P [14:05:56] really? [14:06:26] seems like it... they are in icinga [14:06:28] note that even a base puppet install may perhaps fail as puppet doesn't know about ulsfo at all yet ;) [14:06:39] listed as pmtpa probably? [14:06:46] hm [14:06:53] maybe... [14:07:01] but indeed puppet is running [14:07:04] akosiaris: you login with new install key though, right? [14:07:06] not your ssh key? [14:07:18] now both [14:07:35] when i ran the first puppet run by hand the new_install key [14:08:09] node default { [14:08:09] include standard [14:08:10] } [14:08:13] at the end of site.pp [14:08:14] ah interesting [14:08:18] i suppose that explains it ? [14:09:26] [12598.184329] CPU15: Package power limit notification (total events = 140) [14:09:30] [12598.294217] CPU1: Package power limit notification (total events = 136) [14:09:33] meh [14:09:42] bios setting ? [14:09:51] that weird C state thing ? [14:10:14] ahhh got it [14:10:32] my ssh proxy wasn't using root@, and I don't have an otto account on bast4001 [14:10:34] k cool [14:10:49] RECOVERY - Puppet freshness on titanium is OK: puppet ran at Fri Sep 27 14:10:45 UTC 2013 [14:10:59] RECOVERY - Puppet freshness on praseodymium is OK: puppet ran at Fri Sep 27 14:10:56 UTC 2013 [14:11:08] ok, akosiaris, I'm going to go ahead ane make site entires for these, and then we can fill them in with whatever [14:11:23] sure, go ahead [14:11:29] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [14:11:45] you should also find out why you don't have an account in bast4001 [14:11:49] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [14:12:23] got to go.. c ya later [14:12:28] hm, yeah maybe I'm not in default roots [14:12:55] mark: should I have an account on all new nodes by default? [14:13:39] no [14:13:53] no caching or lvs servers have local accounts [14:13:56] just root [14:16:20] right but bast4001? [14:17:11] also, i'm looking at the existing lvs* and cp* nodes in site.pp [14:17:31] mark, should we let you configure those :p or should I copy lvs100* or maybe amslvs*? [14:17:54] not sure if I should jsut ad lvs[14]00[1-6] to the regex [14:18:09] and add entries from lvs_service_ips in your your lvs_balancer_ips config [14:18:41] you should certainly not copy anything [14:18:51] copy + edit [14:18:57] if you don't understand the current manifests you should probably not do anything ;) [14:19:17] aye yeah that's what I was wondering, wasn't sure if you wanted us to try and get you to review and teach us, or just do it [14:19:21] i'm fine with either [14:19:57] hmm [14:20:02] perhaps, go over all the lvs manifests [14:20:05] try to understand what it all does [14:20:16] give me the summary, and we'll discuss? [14:20:18] k [14:20:29] i can of course tell it but I think it works better if you dive in yourself [14:20:39] (and if I do it myself it doesn't work at all :) [14:20:42] aye [14:20:59] looks a little complex but I'll at least get the gist and you can fix up what I don't get [14:22:16] there's some documentation on lvs and pybal on wikitech, but not a lot on the puppet bits of it I think... [14:23:07] k thanks, reading... [14:24:50] it's not so difficult, having it all in puppet helps a lot [14:25:01] and a working setup to run commands :) [14:25:12] I can help you too, leslie perhaps too [14:27:09] PROBLEM - Puppet freshness on mw1072 is CRITICAL: No successful Puppet run in the last 10 hours [14:32:42] (03PS1) 10Dzahn: add nrpe to node zirconium so we can monitor etherpad and other processes [operations/puppet] - 10https://gerrit.wikimedia.org/r/86253 [14:33:49] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Sep 27 14:33:41 UTC 2013 [14:34:39] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [14:35:19] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 14:35:12 UTC 2013 [14:36:09] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [14:39:32] paravoid, mark, i'm reading into the lvs stuff, should we setup the varnishes first anyway? [14:39:42] (03CR) 10Dzahn: [C: 032] add nrpe to node zirconium so we can monitor etherpad and other processes [operations/puppet] - 10https://gerrit.wikimedia.org/r/86253 (owner: 10Dzahn) [14:39:49] PROBLEM - Host ms-be1003 is DOWN: PING CRITICAL - Packet loss = 100% [14:39:49] PROBLEM - Host ms-be1004 is DOWN: PING CRITICAL - Packet loss = 100% [14:39:49] PROBLEM - Host ms-be1005 is DOWN: PING CRITICAL - Packet loss = 100% [14:39:49] PROBLEM - Host ms-be1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:39:50] do you guys have specific plans for them, what they will serve, etc? [14:40:14] similar to esams really [14:40:36] upload, mobile, bits? [14:41:11] yes [14:41:23] text [14:41:23] :) [14:41:33] eventually [14:42:00] hmm, yeah looks like also 20 nodes in ams [14:42:08] do you want the same layout in ulsfo? [14:42:18] although, are cp300[12] doing anythign? [14:42:22] (i'm just looking at puppet) [14:43:19] RECOVERY - Puppet freshness on praseodymium is OK: puppet ran at Fri Sep 27 14:43:11 UTC 2013 [14:43:29] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [14:43:39] RECOVERY - Puppet freshness on titanium is OK: puppet ran at Fri Sep 27 14:43:31 UTC 2013 [14:43:49] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [14:44:59] RECOVERY - Host ms-be1004 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [14:44:59] RECOVERY - Host ms-be1003 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [14:44:59] RECOVERY - Host ms-be1005 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [14:44:59] RECOVERY - Host ms-be1002 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [14:47:09] PROBLEM - SSH on ms-be1004 is CRITICAL: Connection refused [14:47:09] PROBLEM - SSH on ms-be1002 is CRITICAL: Connection refused [14:47:09] PROBLEM - SSH on ms-be1005 is CRITICAL: Connection refused [14:47:09] PROBLEM - RAID on ms-be1005 is CRITICAL: Connection refused by host [14:47:19] PROBLEM - DPKG on ms-be1005 is CRITICAL: Connection refused by host [14:47:19] PROBLEM - Disk space on ms-be1002 is CRITICAL: Connection refused by host [14:47:29] PROBLEM - SSH on ms-be1003 is CRITICAL: Connection refused [14:47:29] PROBLEM - Disk space on ms-be1004 is CRITICAL: Connection refused by host [14:47:39] PROBLEM - RAID on ms-be1003 is CRITICAL: Connection refused by host [14:47:39] PROBLEM - RAID on ms-be1002 is CRITICAL: Connection refused by host [14:47:49] PROBLEM - DPKG on ms-be1004 is CRITICAL: Connection refused by host [14:47:49] PROBLEM - Disk space on ms-be1005 is CRITICAL: Connection refused by host [14:47:49] PROBLEM - DPKG on ms-be1003 is CRITICAL: Connection refused by host [14:47:59] PROBLEM - RAID on ms-be1004 is CRITICAL: Connection refused by host [14:47:59] PROBLEM - DPKG on ms-be1002 is CRITICAL: Connection refused by host [14:47:59] PROBLEM - Disk space on ms-be1003 is CRITICAL: Connection refused by host [14:48:24] * paravoid cries [14:48:59] haha, awww [14:50:09] PROBLEM - Host ms-be1006 is DOWN: PING CRITICAL - Packet loss = 100% [14:50:09] PROBLEM - Host ms-be1009 is DOWN: PING CRITICAL - Packet loss = 100% [14:50:09] PROBLEM - Host ms-be1007 is DOWN: PING CRITICAL - Packet loss = 100% [14:50:09] PROBLEM - Host ms-be1008 is DOWN: PING CRITICAL - Packet loss = 100% [14:50:49] PROBLEM - HTTP radosgw on ms-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:50:59] PROBLEM - HTTP radosgw on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:52:14] ottomata: what do you mean by 'layout'? [14:52:53] how many do we have again? [14:52:55] i mean, which hosts servce what content [14:53:00] 20 in ulsof too [14:53:07] cp4001-cp4020 [14:53:11] that's... not enough [14:53:25] 8 upload, 8 text, 4 bits, 4 mobile [14:53:29] osm [14:53:39] but anyway [14:53:41] start with bits I'd say [14:53:54] i believe they have identical configuration [14:53:59] if they're also in the same rack, it doesn't matter [14:54:10] if they're split across racks, might make sense to split them according to function too [14:54:18] ok 4 bits [14:54:29] RECOVERY - SSH on ms-be1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [14:54:31] like, balance them across racks, or keep the same functions in the same racks? [14:54:39] balance across [14:54:44] I don't know the rack setup, cmjohnson1? [14:54:46] you know, switch failure, power failure [14:54:48] ja [14:54:53] racktables [14:54:55] you don't have access to racktables? [14:55:09] RECOVERY - SSH on ms-be1004 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [14:55:09] RECOVERY - SSH on ms-be1005 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [14:55:09] RECOVERY - SSH on ms-be1002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [14:55:13] oo ja i do [14:55:13] looking [14:55:19] RECOVERY - Host ms-be1006 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [14:55:19] RECOVERY - Host ms-be1007 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [14:55:19] RECOVERY - Host ms-be1009 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [14:55:19] RECOVERY - Host ms-be1008 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [14:55:19] have only looked at it like once before [14:55:57] ah cool, 2 racks [14:56:07] you should be able to figure it out from the IPs [14:56:15] we're doing the separate L3 per rack again, right? [14:56:20] no [14:56:22] never did [14:56:26] separate L3 per row [14:57:04] ok, so 4001,4002 are in 3.02, and 4003,4004 in 3.03 [14:57:10] could do bits on each of those? [14:57:18] er, yes, per row I meant [14:57:19] PROBLEM - DPKG on ms-be1006 is CRITICAL: Connection refused by host [14:57:26] but in this case it's the same l3 domain? [14:57:29] PROBLEM - Disk space on ms-be1007 is CRITICAL: Connection refused by host [14:57:29] PROBLEM - Disk space on ms-be1009 is CRITICAL: Connection refused by host [14:57:29] PROBLEM - SSH on ms-be1007 is CRITICAL: Connection refused [14:57:29] PROBLEM - Disk space on ms-be1006 is CRITICAL: Connection refused by host [14:57:36] since it's "same row"? [14:57:39] PROBLEM - RAID on ms-be1007 is CRITICAL: Connection refused by host [14:57:39] PROBLEM - SSH on ms-be1009 is CRITICAL: Connection refused [14:57:49] PROBLEM - RAID on ms-be1009 is CRITICAL: Connection refused by host [14:57:49] PROBLEM - DPKG on ms-be1008 is CRITICAL: Connection refused by host [14:57:49] PROBLEM - SSH on ms-be1008 is CRITICAL: Connection refused [14:57:49] PROBLEM - RAID on ms-be1006 is CRITICAL: Connection refused by host [14:57:52] yes [14:57:59] PROBLEM - Disk space on ms-be1008 is CRITICAL: Connection refused by host [14:57:59] PROBLEM - DPKG on ms-be1007 is CRITICAL: Connection refused by host [14:58:09] PROBLEM - RAID on ms-be1008 is CRITICAL: Connection refused by host [14:58:09] PROBLEM - SSH on ms-be1006 is CRITICAL: Connection refused [14:58:11] (03PS1) 10Mark Bergsma: Enable ESI processing for the Testing carrier range [operations/puppet] - 10https://gerrit.wikimedia.org/r/86258 [14:58:19] PROBLEM - DPKG on ms-be1009 is CRITICAL: Connection refused by host [14:58:32] (03PS1) 10Yuvipanda: Rename labsvagrant to labs_vagrant [operations/puppet] - 10https://gerrit.wikimedia.org/r/86260 [14:58:39] Coren: ^, trivial [14:58:59] PROBLEM - NTP on ms-be1005 is CRITICAL: NTP CRITICAL: No response from NTP server [14:59:19] PROBLEM - NTP on ms-be1004 is CRITICAL: NTP CRITICAL: No response from NTP server [14:59:59] PROBLEM - NTP on ms-be1003 is CRITICAL: NTP CRITICAL: No response from NTP server [14:59:59] PROBLEM - NTP on ms-be1002 is CRITICAL: NTP CRITICAL: No response from NTP server [15:00:14] (03PS2) 10coren: Rename labsvagrant to labs_vagrant [operations/puppet] - 10https://gerrit.wikimedia.org/r/86260 (owner: 10Yuvipanda) [15:01:12] Coren: wait! found a typo [15:01:27] Coren: pushing. [15:01:28] Still needed to rebase. :-) [15:01:35] (03PS3) 10Yuvipanda: Rename labsvagrant to labs_vagrant [operations/puppet] - 10https://gerrit.wikimedia.org/r/86260 [15:01:50] oh sure :D [15:02:00] (03PS4) 10Yuvipanda: Rename labsvagrant to labs_vagrant [operations/puppet] - 10https://gerrit.wikimedia.org/r/86260 [15:02:03] rebased [15:03:47] Coren: can you also modify the role name in the wikitech UI once you merge this? [15:03:57] (I'll poke again when you're done with the meeting on -labs) [15:04:09] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 15:04:02 UTC 2013 [15:04:09] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [15:04:29] RECOVERY - SSH on ms-be1007 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [15:04:39] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Sep 27 15:04:32 UTC 2013 [15:04:39] RECOVERY - SSH on ms-be1009 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [15:04:49] RECOVERY - SSH on ms-be1008 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [15:05:10] RECOVERY - SSH on ms-be1006 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [15:05:39] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [15:05:59] (03CR) 10coren: [C: 032] Rename labsvagrant to labs_vagrant [operations/puppet] - 10https://gerrit.wikimedia.org/r/86260 (owner: 10Yuvipanda) [15:06:41] YuviPanda: You /must/ remove the checkmark from any project that has it before I rename it at the UI though. [15:06:46] Tell me when that's done. [15:06:55] Coren: ok ok, they're in two and i've admin on both [15:06:56] doing [15:07:52] Coren: done [15:09:11] {{done}} [15:09:17] wooo [15:09:29] PROBLEM - NTP on ms-be1008 is CRITICAL: NTP CRITICAL: No response from NTP server [15:09:29] PROBLEM - NTP on ms-be1009 is CRITICAL: NTP CRITICAL: No response from NTP server [15:09:39] PROBLEM - NTP on ms-be1007 is CRITICAL: NTP CRITICAL: No response from NTP server [15:09:49] PROBLEM - NTP on ms-be1006 is CRITICAL: NTP CRITICAL: No response from NTP server [15:09:50] ty Coren [15:10:11] (03PS1) 10Dzahn: add etherpad-lite process monitoring via NRPE, RT #5790 [operations/puppet] - 10https://gerrit.wikimedia.org/r/86261 [15:12:19] RECOVERY - Puppet freshness on praseodymium is OK: puppet ran at Fri Sep 27 15:12:11 UTC 2013 [15:12:29] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [15:13:29] RECOVERY - Puppet freshness on titanium is OK: puppet ran at Fri Sep 27 15:13:26 UTC 2013 [15:13:49] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [15:19:46] (03PS2) 10Dzahn: add etherpad-lite process monitoring via NRPE, RT #5790 [operations/puppet] - 10https://gerrit.wikimedia.org/r/86261 [15:20:09] PROBLEM - Host ms-be1011 is DOWN: PING CRITICAL - Packet loss = 100% [15:20:09] PROBLEM - Host ms-be1010 is DOWN: PING CRITICAL - Packet loss = 100% [15:20:09] PROBLEM - Host ms-be1012 is DOWN: PING CRITICAL - Packet loss = 100% [15:22:03] ok, paravoid, mark, been reading cache and varnish puppet stuff for bits [15:22:08] here's what I understand so far [15:22:15] if I include role::cache::bits [15:22:29] a buncha configs will be selected out of lvs and cache configuration classes based on $::site and $cluste [15:22:32] $cluster [15:22:46] yes [15:22:47] so I need to add the IPs of bits varnishes to those configs, as well as lvs's for ulsfo [15:23:09] it looks like for this varnish directors will be empty, is that ok? [15:23:20] (03CR) 10Dzahn: [C: 032] "/usr/lib/nagios/plugins/check_procs -c 1:2 --ereg-argument-array='^/bin/sh /usr/share/etherpad-lite/bin/safeRun.sh'" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86261 (owner: 10Dzahn) [15:23:28] it will fail then [15:23:37] how does ams work? [15:23:47] thae $backends config doesn't have an entiry for bits ams [15:24:17] can i merge labs-vagrant changes on sockpuppet? [15:24:32] Coren: ^? [15:24:47] $varnish_directors is set based on $cluster_tier [15:24:51] which is 1 for pmtpa/eqiad, 2 for esams [15:24:55] and should become 2 for ulsfo as well [15:25:01] right [15:25:05] "backend" => flatten(values($role::cache::configuration::backends[$::realm]['bits'])) [15:25:12] 'bits' => { [15:25:12] 'pmtpa' => flatten([$lvs::configuration::lvs_service_ips['production']['bits']['pmtpa']['bitslb']]), [15:25:12] 'eqiad' => flatten([$lvs::configuration::lvs_service_ips['production']['bits']['eqiad']['bitslb']]), [15:25:12] }, [15:25:19] RECOVERY - Host ms-be1010 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [15:25:19] RECOVERY - Host ms-be1012 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [15:25:19] no entry for esams there [15:25:19] RECOVERY - Host ms-be1011 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [15:25:26] yes [15:25:32] so, what does that mean [15:25:39] the backens of esams and ulsfo are... pmtpa and eqiad [15:25:47] ohhh ahhhhhh [15:25:49] ok ok ok [15:25:50] got it [15:25:50] danke [15:25:54] cool. [15:25:59] ah makes sense ja [15:26:04] ottomata: All good then? [15:26:09] basically, the upper tier :) [15:26:19] mutante was asking if he could merge labs_vagrant on sockpuppet Coren [15:26:42] mutante: Oh, yes! Sorry, forgot to do it. [15:26:48] * Coren slaps self. [15:26:50] alright, doing so. np [15:26:58] That'll teach me to +2 stuff while in a meeting. :-) [15:26:59] so mark, we have multiple layers of varnishes? [15:27:08] cache dcs -> varnish in main dcs -> apaches? [15:27:12] yes [15:27:15] (or whatever backend) [15:27:16] aye cool [15:27:19] PROBLEM - RAID on ms-be1011 is CRITICAL: Connection refused by host [15:27:25] just curious, why not cache dcs -> apaches? [15:27:29] PROBLEM - DPKG on ms-be1012 is CRITICAL: Connection refused by host [15:27:29] PROBLEM - Disk space on ms-be1012 is CRITICAL: Connection refused by host [15:27:39] PROBLEM - DPKG on ms-be1010 is CRITICAL: Connection refused by host [15:27:49] PROBLEM - SSH on ms-be1010 is CRITICAL: Connection refused [15:27:49] PROBLEM - SSH on ms-be1011 is CRITICAL: Connection refused [15:27:49] PROBLEM - Disk space on ms-be1010 is CRITICAL: Connection refused by host [15:27:49] PROBLEM - RAID on ms-be1012 is CRITICAL: Connection refused by host [15:27:50] are apaches all private internal? [15:27:59] PROBLEM - RAID on ms-be1010 is CRITICAL: Connection refused by host [15:28:01] ottomata/akosiaris_away how'd you get the machines to start installing ? [15:28:08] ha, I don't know! [15:28:09] PROBLEM - SSH on ms-be1012 is CRITICAL: Connection refused [15:28:09] PROBLEM - DPKG on ms-be1011 is CRITICAL: Connection refused by host [15:28:15] i woke up and akosiaris had done them [15:28:59] cool [15:29:34] mark, i think I need to add an entry to lvs.pp $lvs_service_ips for bits [15:29:37] but I don't know what to put there [15:29:51] it will have a public service IP, I assume? [15:29:56] 'ulsfo' => { 'bitslb' => ??? } [15:30:58] yes, you need to allocate LVS service IPs just like eqiad, pmtpa, esams have them [15:31:01] in DNS [15:31:24] I'd say, copy the ones from eqiad, in 154.80.208.in-addr.arpa [15:31:26] and ipv6 also [15:33:20] hmm k [15:33:29] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Sep 27 15:33:27 UTC 2013 [15:33:39] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [15:34:39] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 15:34:32 UTC 2013 [15:34:49] RECOVERY - SSH on ms-be1010 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [15:34:49] RECOVERY - SSH on ms-be1011 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [15:35:09] RECOVERY - SSH on ms-be1012 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [15:35:09] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [15:39:29] PROBLEM - NTP on ms-be1011 is CRITICAL: NTP CRITICAL: No response from NTP server [15:39:29] PROBLEM - NTP on ms-be1010 is CRITICAL: NTP CRITICAL: No response from NTP server [15:39:31] haha [15:39:39] PROBLEM - NTP on ms-be1012 is CRITICAL: NTP CRITICAL: No response from NTP server [15:40:10] etherpad.wm is giving me 503s on a pretty consistent basis right now [15:40:19] PROBLEM - Host ms-fe1003 is DOWN: PING CRITICAL - Packet loss = 100% [15:40:19] PROBLEM - Host ms-fe1004 is DOWN: PING CRITICAL - Packet loss = 100% [15:40:47] greg-g: upgrade is coming soon, hope that helps [15:41:34] ottomata: i feel you know, waiting for icinga run ,, Could not retrieve catalog from remote server: execution expired [15:41:39] RECOVERY - Puppet freshness on titanium is OK: puppet ran at Fri Sep 27 15:41:35 UTC 2013 [15:41:39] RECOVERY - Puppet freshness on praseodymium is OK: puppet ran at Fri Sep 27 15:41:35 UTC 2013 [15:41:48] icinga run? [15:41:49] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [15:42:26] mutante, i kinda gave up mostly, and edited the files manually, i was just removing hosts [15:42:29] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [15:42:38] greg-g: RT #5789 and RT #5841 [15:42:40] but, LeslieCarr recommended playing puppet process tower defense in htop on stafford :p [15:42:48] ottomata: ..gotcha.. [15:42:54] kill puppet processes til you get some procs free, and then try to run yours [15:42:56] hehe, isee [15:42:56] heheh [15:42:58] yea [15:44:04] :) [15:45:29] RECOVERY - Host ms-fe1003 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [15:45:29] RECOVERY - Host ms-fe1004 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [15:47:29] PROBLEM - DPKG on ms-fe1003 is CRITICAL: Connection refused by host [15:47:29] PROBLEM - HTTP Apache on ms-fe1003 is CRITICAL: Connection refused [15:47:30] PROBLEM - SSH on ms-fe1004 is CRITICAL: Connection refused [15:47:39] PROBLEM - RAID on ms-fe1004 is CRITICAL: Connection refused by host [15:47:39] PROBLEM - HTTP Apache on ms-fe1004 is CRITICAL: Connection refused [15:47:59] PROBLEM - SSH on ms-fe1003 is CRITICAL: Connection refused [15:48:09] PROBLEM - Disk space on ms-fe1004 is CRITICAL: Connection refused by host [15:48:09] PROBLEM - RAID on ms-fe1003 is CRITICAL: Connection refused by host [15:48:19] PROBLEM - DPKG on ms-fe1004 is CRITICAL: Connection refused by host [15:48:23] mutante: thanks! [15:48:29] PROBLEM - Disk space on ms-fe1003 is CRITICAL: Connection refused by host [15:48:42] heya mark [15:48:58] shodul there be an entry in templates/wikimedia.org for bits-lb.esams.wikimedia.org? [15:49:13] yes [15:49:14] isn't there? [15:49:17] don't see it [15:49:55] grep bits-lb wikimedia.org [15:49:55] bits-lb.ulsfo 1H IN A 198.35.26.225 [15:49:55] bits-lb.pmtpa 1H IN A 208.80.152.210 [15:49:55] bits-lb.eqiad 1H IN A 208.80.154.234 [15:49:55] bits-lb 1H IN A 91.198.174.233 [15:50:04] that last one…is management? [15:50:17] maybe not...? [15:50:39] what do you mean? [15:50:40] it's esams [15:50:58] thats the esams one? [15:51:08] just trying to grok everything... [15:51:09] yes [15:51:29] hmm yeah ip matches hm [15:51:33] yep [15:52:23] ottomata: check the $ORIGIN [15:52:48] AHHHHhhh was looking for that, was way higher than I thought, was thrown off by the mgmt stuff ok [15:52:51] hm ok [15:53:34] so we just don't ahve eqiad and ulsfo and pmtpa consistent in this file? [15:53:48] esams servers have .esams. as subdomain [15:53:56] eqiad/pmtpa never had [15:54:05] ahhhh [15:54:08] legacy [15:54:08] so they are explicit, got it [15:54:19] i don't know what makes sense long term [15:54:22] so for ulsfo there should be a .ulsfo. origin? [15:54:22] hm [15:55:12] mark: well, moving a bunch of those servers under .esams.wmnet will help :) [15:55:35] ? [15:56:21] what? [15:56:39] which servers under .esams.? [15:56:39] will help what? [15:58:16] having less stuff under .esams.wikimedia.org [15:58:33] oh [15:58:45] gotcha, ok but i'm not going to do that (right now) :p [15:58:58] no no [15:59:13] so i'm adding bits-lb, which i think should be bits-lb.ulsfo.wikimedia.org, right? [15:59:27] correct [15:59:29] PROBLEM - NTP on ms-fe1004 is CRITICAL: NTP CRITICAL: No response from NTP server [15:59:30] afaict, there is not ORIGIN in this file for ulsof.wikimedia.rg [15:59:51] should I add one, or should I just add an entry for bits-lb.ulsfo under wikimedia.org ORIGIN? [15:59:59] PROBLEM - NTP on ms-fe1003 is CRITICAL: NTP CRITICAL: No response from NTP server [16:01:09] no, just do the same as pmtpa/eqiad I'd say [16:01:29] RECOVERY - SSH on ms-fe1004 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [16:01:42] ok, but mark was saying that that is kind of legacy? [16:01:56] no, servers under .esams.wikimedia.org is legacy [16:01:59] RECOVERY - SSH on ms-fe1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [16:02:01] actually can we make the lvs service ip's down in the bottom /25 ? [16:02:02] oh, k [16:02:17] mark / ottomata [16:02:21] so, for example, bastion is hooft.esams.wikimedia.org instead of hooft.wikimedia.org [16:02:21] LeslieCarr: I reserved a /27 for it yesterday [16:02:33] in dns [16:02:44] when I had to allocate ips for the logical systems [16:02:47] yeah, but can we move that to the lower /25 - so that all of ulsfo's specific stuff is in the lower /25 ? [16:02:50] any objection ? [16:02:56] oh [16:03:04] no objection [16:03:06] cool [16:03:25] .96/27 ? [16:03:39] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Sep 27 16:03:29 UTC 2013 [16:03:39] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [16:04:09] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 16:04:04 UTC 2013 [16:04:56] (03PS1) 10Lcarr: moved the LVS subnet for ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/86266 [16:04:58] LeslieCarr: cool w me, that isn't actually set anywhere yet, right mark? it was just your comment? [16:05:09] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [16:05:10] correct [16:05:12] k [16:05:16] also, man i love having a git repo for dns :) [16:05:27] what do you mean by ulsfo specific stuff? [16:06:20] (03PS2) 10Lcarr: moved the LVS subnet for ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/86266 [16:06:44] i would like the ip's that are ulsfo specific to be in the lower /25 of that range (like my change) [16:07:08] i consider the infrastructure block to not necessarily be ulsfo specific, just west coast region specific [16:07:30] well and possibly not even that, with backbone links [16:07:58] ok, i think i've got everything for the dns change, the only thing i really don't understand is the reverse ip6.arpa file (asking this with the risk of looking dumb…:) ) [16:08:32] why are the entires there just incrementing? [16:08:38] hmmmmmmMmmm masked or something? [16:08:42] what do you mean? [16:09:14] paravoid: was the what do you mean to me or otto ? [16:09:15] ohoh, are the full addies inferred from the filename [16:09:20] oh, to otto [16:09:23] sorry [16:09:24] right? [16:09:36] I don't understand your question ottomata [16:09:37] (03CR) 10Lcarr: [C: 032] moved the LVS subnet for ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/86266 (owner: 10Lcarr) [16:09:40] i was wondering how the reverse stuff worked without seeing the full IP in the file [16:09:48] the entries just look like this [16:09:48] 3.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0 1H IN PTR bits-lb.ulsfo.wikimedia.org. [16:09:57] but i guess the full IP is inferred from the file name? [16:10:00] 3.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa [16:10:01] ? [16:10:05] it's standard DNS [16:10:27] it it isn't terminated by dot, the origin is appended [16:10:38] so, it's also inferred from the origin line - $ORIGIN 1.0.0.0.{{ zonename }}. [16:10:39] so you say brewster IN A, and .wikimedia.org is appended [16:11:15] (03PS2) 10Dzahn: add salt grains for applicationserver roles [operations/puppet] - 10https://gerrit.wikimedia.org/r/83768 [16:11:29] RECOVERY - Puppet freshness on titanium is OK: puppet ran at Fri Sep 27 16:11:25 UTC 2013 [16:11:29] RECOVERY - Puppet freshness on praseodymium is OK: puppet ran at Fri Sep 27 16:11:26 UTC 2013 [16:11:49] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [16:12:04] csteipp: out of curiosity, what is #p-personal? [16:12:29] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [16:12:54] (03CR) 10Dzahn: "Ryan, you said it would be better to use a class instead of the definition right away as in PS1 vs. PS2. Is that what you had in mind?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/83768 (owner: 10Dzahn) [16:12:59] The name of the div that holds your toolbar in mediawiki [16:13:25] That we slide in if we centrally log you in, even though you hit the sight anonymously. [16:13:41] hm, ok LeslieCarr where does zonename get set? [16:14:02] that's the filename [16:14:09] ahhhhhhh ok got it [16:14:12] makes sense [16:14:13] thanks [16:14:15] paravoid is a better leslie than me :) [16:14:20] authdns-gen-zones does this [16:14:21] i'll get him a pink wig [16:14:48] context['zonename'] = filename [16:14:52] (03PS1) 10Mark Bergsma: Normalize and Vary on the forceHTTPS cookie [operations/puppet] - 10https://gerrit.wikimedia.org/r/86268 [16:15:30] hehe [16:15:30] hahaha [16:15:33] paravoid in a pink wig [16:15:34] aahahhhahah [16:15:46] lol [16:15:57] here is another dumb question: do you have a halloweenish celebration in greece? [16:16:09] and if so, are you dressing paravoid? and if so? what? [16:16:22] because if you don't have an idea, it sounds like you were just given a really good one: LeslieCarr [16:16:24] yes ... in february/march (its a movable one) [16:16:30] hehe [16:16:46] but no trick or treat and mostly no witches/skeletons etc [16:16:55] just costumes ? [16:16:57] it is dress as you like [16:17:06] yeah [16:17:09] great, dns grokking i think complete! [16:17:18] it's the carnival [16:17:37] trick or treat is really just a conspiracy of the liberal dentist + insulin manufacturers cabal to ruin children's teeth and make them diabetic [16:18:11] (03CR) 10Dzahn: [C: 031] "ack, the benefit seems worth a bit more than the downside." [operations/puppet] - 10https://gerrit.wikimedia.org/r/86250 (owner: 10Krinkle) [16:18:15] https://en.wikipedia.org/wiki/Carnival#Greece fwiw :) [16:18:22] and santa claues in green not red :P [16:18:23] akosiaris: how did you get the installations to happen? last night bast4001 started sending the installation images but would just time out, going very very slowly ? what magic did you do ? [16:18:32] greek magic [16:18:33] :P [16:18:35] lol [16:18:45] a sacrifice to the gods and such [16:18:55] i wore a ancient greek's armour first [16:19:03] then the helmet and the spear [16:19:11] and started dancing around the fire [16:19:18] rofl [16:19:22] anyway [16:19:32] its seems you guys had everything correct [16:19:55] but tftpd was configured to a different directory than /srv/tftpboot [16:20:02] oh [16:20:02] ahh [16:20:03] hahaha [16:20:06] nice [16:20:19] hehe, find docs in https://en.wikipedia.org/wiki/Greek_Magical_Papyri [16:20:26] hm, LeslieCarr should the v6 service IP just start from the next IP? [16:20:27] ::3? [16:20:36] I fixed it manually (hating myself in the process), and i will puppetize it later [16:20:38] for bits-lb? [16:20:57] guys? leslie? [16:21:08] nope, the f:f:'s are loopback only [16:21:09] I know we keep saying guys but this time you went too far :P [16:21:17] haha [16:21:31] ohohhh [16:21:33] ottomata: i always cheat and check out an existing zone to steal their config [16:21:59] like 1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa [16:22:08] did I ? i was implicitly referring to ottomata, LeslieCarr and RobH ... [16:22:25] so ... guys and lady ?? [16:22:25] yeah i'm looking, grokking allmoooost complete. [16:22:35] guys and girl ? [16:22:46] haha [16:22:54] akosiaris: hrmm, it shouldnt have been, i wonder why it didnt copy down cofig... oh well [16:22:58] LeslieCarr: actually i was curious about that, sense Ken mentioned several times, i like using guys as gender neutral, but I've heard that some folks don't like that [16:23:00] guys + awesomest person ever [16:23:06] nice find, i just found the error and wasnt willing to deal with it more at 6pm, heh [16:23:13] for the most part i don't care and don't even notice [16:23:22] akosiaris: fyi, if puppet would run on neon, etherpad would have process monitoring [16:23:24] i say folks a lot. [16:23:29] its a cheat, gender neutral [16:23:30] i think i've gotten used to it ;) [16:23:32] yep [16:23:35] notice I used folks in that last statement :p [16:23:38] hehe [16:23:39] I'm trying to use THON more [16:23:44] THON ? [16:23:50] that sounds like's thor's little brother ? [16:23:53] mutante: yeah i noticed. Good job. [16:24:00] http://www.qwantz.com/index.php?comic=2079 [16:24:06] that and the next 2 or 3 comics [16:24:13] altough i am not sure if it best to monitor safeRun.sh of the node process itself... [16:24:17] or* [16:25:28] ok, yea, we can adjust the regex easily, but this works currently [16:25:54] (03CR) 10CSteipp: [C: 031] "This looks like it will do what we need. If the cookie is present, we want the user to get the 302 redirect that MediaWiki will send. Othe" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86268 (owner: 10Mark Bergsma) [16:26:26] mutante: #1247... merge or reject/resolve/mark as irrelevant [16:26:32] ok, after reading this Thon sounds awesome [16:26:33] whatever you feel like :-) [16:27:07] okk:) [16:27:56] ok, from now on, no more "guys" just "thons" or "bitches" [16:28:30] merged [16:28:36] heheh yeah! how's it going thons?! [16:29:00] ok um LeslieCarr, having one more grokking problem [16:29:24] :) [16:29:26] ok ? [16:29:46] ok i was going to make bits-lb.ulsfo [16:29:50] 198.35.26.97 [16:29:56] in wikimedia.org then: [16:30:00] bits-lb.ulsfo 1H IN A 198.35.26.97 [16:30:00] 1H IN AAAA 2620:0:863:1:198.35.26.97 [16:30:02] is that correct? [16:30:03] hm [16:30:04] for IPv6? [16:30:09] oh wait no [16:30:12] so we're going to have to shuffle all our lvs service IPs around soon for zero [16:30:31] that's a bit unfortunate now with ulsfo [16:30:35] but we need to think/discuss about that first [16:30:35] oh? [16:30:44] and no, that's not right for ipv6 [16:30:53] ipv6 is all the 2620:0:86x:ed1a:: addresses [16:30:59] ok, didn't look right, but was copying other entries there [16:31:11] which you can copy exactly, but use 2620:0:863 instead of 2620:0:861 [16:31:29] anyway, i'm feeling sick, i'm going offline for a bit [16:31:38] ok, thanks for the help mark, feel bettaaahh [16:31:51] yeah, mark's advice of just copying and s//'ing is best [16:31:57] oh, real quick, should I hold off with this ip allocation stuff then, if we ahve to discuss zero stuff? [16:32:14] ok cool [16:33:00] we can discuss zero on email ? shouldn't be too hard to move around some ip's before they are in production [16:33:06] ok [16:33:29] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Sep 27 16:33:25 UTC 2013 [16:33:39] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [16:34:09] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 16:34:05 UTC 2013 [16:34:50] ottomata, LeslieCarr, mark: my latest message to the ops list with subject line containing 'Request for Confirmation on LBs' contains my thoughts on IPs. [16:35:09] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [16:35:14] ottomata, LeslieCarr, mark: i understand you may have already seen it, just wanted to note it in case it was muted or buried [16:35:37] ottomata, LeslieCarr, mark: and i understand it may take some time to, um, digest. [16:36:06] It's _hir_ network gear. https://en.wiktionary.org/wiki/hir#English [16:38:01] Can somebody take a look at the squid/varnish stats for the severs in front of dewiki and see if they look sick? See https://bugzilla.wikimedia.org/show_bug.cgi?id=54647 [16:38:29] !log mw1072 replacing hard drive [16:38:40] Logged the message, Master [16:40:19] (03PS1) 10Ottomata: Adding entry for bits-lb.ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/86272 [16:40:26] (03CR) 10jenkins-bot: [V: 04-1] Adding entry for bits-lb.ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/86272 (owner: 10Ottomata) [16:41:34] heya paravoid, should I modify config-geo for this too? [16:41:51] eventually [16:41:52] but not yet [16:41:53] ok [16:42:06] (03PS2) 10Ottomata: Adding entry for bits-lb.ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/86272 [16:42:23] yes, in the end we should add ulsfo in geodns and start throwing traffic at it [16:42:30] but let's set up it up first :) [16:42:35] cool, ok [16:42:58] can't you add all the -lbs? [16:43:01] since you're doing this [16:43:36] ha suppose so! but i was just doing one at time atm, to make sure i knew what I was doing [16:43:47] trying to jsut to bits stuff atm [16:43:50] with varnishes as well [16:43:51] is it possible to go to city level and just point, say San Francisco at ulsfo? just curious [16:43:51] PROBLEM - Host mw1074 is DOWN: PING CRITICAL - Packet loss = 100% [16:43:51] also wrong place in wikimedia.org [16:44:03] you added it on the servers section [16:44:08] there's a "round-robin" section below [16:44:13] which has first pmtpa, then eqiad [16:44:19] just add a ulsfo section below [16:44:25] and add all the allocations imho [16:44:39] mutante: yes, although we need to switch to the paid-for geoip database [16:44:47] but yes, gdnsd supports that, and even support hierarchies [16:44:50] regions etc. [16:44:56] ah, i see, yea, you pay for more details of course [16:45:01] PROBLEM - Host mw1072 is DOWN: PING CRITICAL - Packet loss = 100% [16:45:05] so we can say us eqiad, california ulsfo, san francisco pmtpa [16:45:11] that's stupid, but it's possible :) [16:45:28] use Oakland to test it :p [16:45:31] RECOVERY - Host mw1074 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [16:46:51] oook paravoid, will do [16:47:20] paravoid, here? [16:47:20] https://gerrit.wikimedia.org/r/#/c/86272/3/templates/wikimedia.org [16:47:54] yep [16:47:56] that's it [16:48:06] feel free to add ;pmtpa ;eqiad comments to if it helps you [16:51:53] k that will help, will do [16:54:10] paravoid: how does this do round robin, it looks like there is only one ip per name? [16:54:50] lvs [16:54:54] this is the lvs service IP [16:55:13] ah ok [16:55:14] so it goes from the router to the lvs boxes and then lvs rounds-robin to the varnish servers [16:55:36] ahh k, so not roudn robin dns [16:55:40] no [16:56:10] ahh ok i thikn that's why i was unsure, can I cahnge comment to Round Robin LVS Services? [16:56:24] Round Robin LVS Service records [16:56:24] ? [16:56:26] it's the pybal config [17:01:37] (03PS4) 10Ottomata: Adding entries for bits-lb.ulsfo, mobile-lb.ulsfo, and upload-lb.ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/86272 [17:02:33] (03CR) 10Dzahn: [C: 032] since we just touched statistics.pp anyways, sneak in the retabbing as well [operations/puppet] - 10https://gerrit.wikimedia.org/r/86110 (owner: 10Dzahn) [17:04:07] (03CR) 10Dzahn: [C: 032] statistics.pp - fix unquoted file modes and resource titles. puppet-lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/86112 (owner: 10Dzahn) [17:06:11] RECOVERY - Host mw1072 is UP: PING OK - Packet loss = 16%, RTA = 0.31 ms [17:08:31] PROBLEM - RAID on mw1072 is CRITICAL: Timeout while attempting connection [17:08:31] PROBLEM - DPKG on mw1072 is CRITICAL: Timeout while attempting connection [17:08:31] PROBLEM - SSH on mw1072 is CRITICAL: Connection timed out [17:08:41] PROBLEM - twemproxy process on mw1072 is CRITICAL: Connection refused by host [17:08:51] PROBLEM - Apache HTTP on mw1072 is CRITICAL: Connection refused [17:08:51] PROBLEM - Disk space on mw1072 is CRITICAL: Connection refused by host [17:12:49] paravoid, i replied to your comments, hope it answered your concern. More work on carrier tagging is coming :) [17:13:15] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [17:13:25] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [17:13:44] (03CR) 10Dzahn: [C: 032] statistics.pp, puppet-lint, fix WARNINGs: string containing only a variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/86113 (owner: 10Dzahn) [17:13:50] Anyone about? Need a file touching as udp2log:udp2log on fluorine please [17:13:55] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [17:13:55] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [17:14:56] Reedy: yo, where [17:14:58] mark, around? I am not very sure I understand your patch [17:15:15] ottomata: neon finished catalog run, heh [17:15:16] /a/mw-log/temp-debug.log [17:15:56] woot! [17:16:07] yurik: mark signed off, wasn't feeling well [17:16:10] Reedy: 0 -rw-r--r-- 1 udp2log udp2log 0 Sep 27 17:15 temp-debug.log [17:16:18] thanks [17:16:21] yw [17:16:24] that ESI patch really did him in :( [17:16:32] thx ottomata [17:17:17] heh [17:17:21] ottomata: i merged the puppet lint fixes on statistics.pp, watching one more run to make sure nothing went wrong.. [17:17:40] k danke [17:19:06] (03CR) 10Yurik: "Looks ok, but I am not sure why ESI should be enabled on both front and back?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86258 (owner: 10Mark Bergsma) [17:20:15] PROBLEM - NTP on mw1072 is CRITICAL: NTP CRITICAL: No response from NTP server [17:20:20] Ryan_Lane: 'fatal: Unknown commit none/master', I forgot how we resolved this last time -- doing a --force sync now [17:21:00] did you do an initial commit? [17:21:37] bootstrapping is surely the biggest pain in the ass [17:22:22] i didn't, but it worked now [17:22:46] * Ryan_Lane nods [17:22:50] so maybe instead of requiring an initial commit, 'fatal: Unknown commit none/master' should be swallowed on --force [17:22:57] since it's not fatal anyway [17:23:00] * Ryan_Lane nods [17:23:18] I need to switch the frontend out with sartoris [17:23:24] so that we can have more control [17:23:51] I guess I can just modify the perl some more but, bleh [17:27:09] !log powering down analytics1021 replacing disk1 (sdb) [17:27:20] Logged the message, Master [17:29:53] lunchtime, back in a bit [17:30:15] PROBLEM - Host analytics1021 is DOWN: PING CRITICAL - Packet loss = 100% [17:31:15] !log reedy synchronized php-1.22wmf19/includes/GlobalFunctions.php 'all of the debugging' [17:31:26] Logged the message, Master [17:33:05] RECOVERY - Host analytics1021 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [17:33:26] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Sep 27 17:33:24 UTC 2013 [17:34:05] RECOVERY - Puppet freshness on analytics1021 is OK: puppet ran at Fri Sep 27 17:33:59 UTC 2013 [17:34:25] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [17:34:35] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 17:34:29 UTC 2013 [17:35:15] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [17:36:26] RECOVERY - SSH on mw1072 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [17:39:36] !log reedy synchronized php-1.22wmf19/includes/GlobalFunctions.php 'all of the debugging' [17:39:48] Logged the message, Master [17:41:35] RECOVERY - Puppet freshness on titanium is OK: puppet ran at Fri Sep 27 17:41:32 UTC 2013 [17:41:35] RECOVERY - Puppet freshness on praseodymium is OK: puppet ran at Fri Sep 27 17:41:32 UTC 2013 [17:41:55] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [17:41:55] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [17:42:34] !log reedy synchronized php-1.22wmf19/includes/GlobalFunctions.php 'all of the debugging' [17:42:47] Logged the message, Master [17:43:12] (03CR) 10Yurik: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86258 (owner: 10Mark Bergsma) [17:44:35] (03CR) 10Chad: [C: 031] gerrit: Don't include '.' in the match for adjecent separator [operations/puppet] - 10https://gerrit.wikimedia.org/r/86250 (owner: 10Krinkle) [17:48:37] mark; if you're there and have a few moments; I'd love to chat about your comments on my RfC [17:51:04] mwalker: he wasn't feeling well and signed off [17:51:22] fair enough; I'll respond in the sloooow way then :) [17:53:39] 24 Fatal error: Nesting level too deep - recursive dependency? in /usr/local/apache/common-local/php-1.22wmf19/includes/GlobalFunctions.php on line 147 [17:53:39] ffs [17:53:54] !log reedy synchronized php-1.22wmf19/includes/GlobalFunctions.php 'all of the debugging' [17:54:07] Logged the message, Master [18:00:43] RECOVERY - Apache HTTP on mw1072 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.006 second response time [18:02:43] RECOVERY - Puppet freshness on mw1072 is OK: puppet ran at Fri Sep 27 18:02:40 UTC 2013 [18:04:04] !log reedy synchronized php-1.22wmf19/includes/GlobalFunctions.php 'all of the debugging' [18:04:15] Logged the message, Master [18:06:33] (03PS1) 10Reedy: Repoint php at 1.22wmf18 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86280 [18:07:15] (03CR) 10Reedy: [C: 032] Repoint php at 1.22wmf18 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86280 (owner: 10Reedy) [18:07:47] (03Merged) 10jenkins-bot: Repoint php at 1.22wmf18 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86280 (owner: 10Reedy) [18:09:23] RECOVERY - RAID on mw1072 is OK: OK: no RAID installed [18:09:23] RECOVERY - DPKG on mw1072 is OK: All packages OK [18:09:53] RECOVERY - Disk space on mw1072 is OK: DISK OK [18:10:14] RECOVERY - NTP on mw1072 is OK: NTP OK: Offset -0.0657916069 secs [18:11:43] PROBLEM - Apache HTTP on mw1072 is CRITICAL: Connection refused [18:12:32] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [18:12:42] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [18:13:11] cmjohnson1: can you wipe the decom'ed servers so they don't keep trying to check into puppet ? [18:13:12] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [18:13:12] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [18:13:15] actually decom? :) [18:14:48] lesliecarr: they're in the decom.pp list...that should of stopped the puppet checks [18:15:54] lesliecarr: in rt5848.....the bonded ports were id'd in site.pp right? and on the switch? [18:16:13] well they check into puppet and then they get removed by the decom.pp [18:16:23] so that causes the problem puppet freshness and then recovery [18:17:01] cool, that is defined in both site.pp and the switch [18:17:09] cmjohnson1: for rt5846 [18:17:22] cool..just want to make sure [18:17:43] will get them after i finish getting mw1072 back up [18:18:01] cool [18:18:19] !log reedy synchronized php-1.22wmf19/includes/GlobalFunctions.php 'Revert' [18:18:30] Logged the message, Master [18:22:42] RECOVERY - Apache HTTP on mw1072 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.077 second response time [18:24:23] (03PS1) 10Andrew Bogott: Move generic::pythonpip into the stats role. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86281 [18:24:47] (03CR) 10jenkins-bot: [V: 04-1] Move generic::pythonpip into the stats role. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86281 (owner: 10Andrew Bogott) [18:25:42] RECOVERY - twemproxy process on mw1072 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [18:33:22] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Sep 27 18:33:20 UTC 2013 [18:34:12] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [18:34:33] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 18:34:31 UTC 2013 [18:35:12] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [18:37:29] ottomata: any reason you didn't put the other lvs ips in dns ? [18:38:04] aren't they already in? [18:38:33] lvs400*.ulsfo.wmnet? or do you mean something else? [18:38:59] like wikimedia-lb.ulsfo.wikimedia.org [18:39:37] hmm, i guess because I didn't know we wanted those? i think we're just doing those three types of varnishes with the 20 cps [18:39:48] for now anyway, right? [18:41:12] RECOVERY - Puppet freshness on titanium is OK: puppet ran at Fri Sep 27 18:41:04 UTC 2013 [18:41:32] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [18:41:32] RECOVERY - Puppet freshness on praseodymium is OK: puppet ran at Fri Sep 27 18:41:24 UTC 2013 [18:41:42] PROBLEM - Puppet freshness on praseodymium is CRITICAL: No successful Puppet run in the last 10 hours [18:43:22] PROBLEM - Host praseodymium is DOWN: PING CRITICAL - Packet loss = 100% [18:45:19] LeslieCarr: ? [18:45:52] oh okay, i thought we were going to do all of the frontends, not just bits, mobile, upload [18:46:00] how many of the cp's are dedicated to those 3 ? [18:46:02] PROBLEM - Host titanium is DOWN: PING CRITICAL - Packet loss = 100% [18:46:06] i think eventually, but mark was saying that 20 wasn't enough [18:46:12] so to start with bits mobile upload [18:46:25] so, i guess i could add the other service ips still [18:46:25] (03PS2) 10Andrew Bogott: Move generic::pythonpip into the stats role. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86281 [18:46:27] ok [18:46:28] just don't know what they are [18:46:32] PROBLEM - Host xenon is DOWN: PING CRITICAL - Packet loss = 100% [18:46:37] bascially, what, copy esams? [18:47:01] or eqiad there? and give them all uslfo net ips? [18:49:20] (03PS1) 10Yuvipanda: Add php5-sqlite module [operations/puppet] - 10https://gerrit.wikimedia.org/r/86288 [18:50:20] (03PS1) 10Manybubbles: Fix lograte for elasticsearch. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86289 [18:54:11] yeah, average, i see ./perl5/lib/perl5/local/lib.pm in your homedir [18:54:23] that needs to be globally installed, right? [18:54:27] what is lib.pm? where is it from? [18:55:20] I have Moritz here who is working on math support, and we have a question about running a separate service vs. shelling out from PHP [18:55:54] he can basically implement both, but the complexity for ops would differ, so it would be good to get your input on this [18:57:01] the web service would need to be monitored and load balanced, but on the plus side we'd avoid the need to install texlive etc on all apaches [18:58:25] parsoid page? [18:58:39] something up with parsoid? [18:58:47] I also just got that [18:59:00] cerium.wikimedia.org? [18:59:03] (03PS2) 10Andrew Bogott: Add php5-sqlite package to toollabs instances. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86288 (owner: 10Yuvipanda) [18:59:17] kind of weird that a hostname like that is paging for parsoid [18:59:32] gwicke: ? [18:59:32] ganglia looks fine [18:59:33] there's not any report from it in here that I can see [18:59:36] is that legit? [18:59:51] (03CR) 10Ottomata: [C: 032 V: 032] Fix lograte for elasticsearch. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86289 (owner: 10Manybubbles) [18:59:54] this looks like a watchmouse page [19:00:08] cerium is one of the Varnishes [19:00:11] https://wikitech.wikimedia.org/wiki/Parsoid [19:00:50] ^^ RoanKattouw [19:00:59] manybubbles: merged [19:01:07] manual testing is fine [19:01:09] Oh, yeah [19:01:12] Sorry about this guys [19:01:18] Parsoid monitoring in Watchmouse should be killed [19:01:32] It's no longer a publicly accessible service, so Watchmouse cannot monitor it [19:02:03] Unless we decide that it should be publicly accessible in which case we should ask ops for a public IP [19:02:10] to hook up to the LVS VIP [19:02:30] Also I should fix that documentation page [19:02:38] gwicke: i was given cerium to decom...is that not true? [19:02:44] cerium and titanium haven't been serving Parsoid for a while [19:02:47] They're being decommissioned [19:03:02] cmjohnson1: No you're right. And gwicke read the docs correctly, it's just that the docs are wrong [19:03:18] ah..okay..cool thx [19:03:39] * RoanKattouw fixes wikitech page [19:03:59] ottomata: thanks! I'll go clean up the files. [19:04:21] (03CR) 10Andrew Bogott: [C: 032] Add php5-sqlite package to toollabs instances. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86288 (owner: 10Yuvipanda) [19:04:52] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 19:04:49 UTC 2013 [19:05:02] yup! [19:05:12] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [19:08:20] haha [19:08:25] folks getting watchmouse for cerium [19:08:26] its ok [19:08:29] its a bad monitor! [19:08:35] RoanKattouw: Did you set that monitor or did someone in ops? [19:08:54] mutante did [19:08:56] It should die [19:09:14] I'm sorry I overlooked this when cleaning up our Parsoid setup months ago [19:09:26] We only noticed it now because the boxes I freed up then are only being decommissioned now [19:10:03] no worries, i can fix it [19:10:06] yep! [19:10:09] this is my fault =] [19:10:21] or maybe cmjohnson1 [19:10:24] If we make Parsoid a public service, then we should set up a public IP that ends up going to the Parsoid LVS VIP, and reinstate that monitor pointing to that public IP [19:10:50] But until such time, we should not monitor it in Watchmouse as it's not a public service and so Watchmouse cannot possibly monitor it [19:12:42] deleting it now [19:12:46] So take a blowtorch or a soldering iron or whatever your tool-turned-weapon of choice is and make that monitor die in a fire [19:12:47] Thanks [19:13:29] deleted =] [19:13:52] gonna email ops list just for fyi [19:13:55] !log reedy synchronized php-1.22wmf19/extensions/TorBlock/ [19:14:08] Logged the message, Master [19:18:20] !log restarting apache on professor (graphite) [19:18:32] Logged the message, Master [19:20:42] LeslieCarr: do I need to add $ORIGIN svc.ulsfo.wmnet. at the bottom of wmnet [19:20:43] ? [19:20:54] 10.2.4.x? not sure here... [19:24:08] (03PS1) 10ArielGlenn: ukwikimedia is moving, set all namespaces to read-only except user talk [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86303 [19:28:00] (03PS2) 10ArielGlenn: ukwikimedia is moving, set all namespaces to read-only except user talk [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86303 [19:29:35] (03CR) 10ArielGlenn: "second patchset fixes some indentation issues on first patchset" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86303 (owner: 10ArielGlenn) [19:30:39] (03PS1) 10Andrew Bogott: Abolish generic::packages::locales. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86306 [19:31:43] paravoid, more dns questions if you are around [19:32:29] https://wikitech.wikimedia.org/w/index.php?title=Parsoid&diff=84625&oldid=75866 fixes mentions of cerium/titanium [19:33:03] (03CR) 10Andrew Bogott: [C: 032] Abolish generic::packages::locales. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86306 (owner: 10Andrew Bogott) [19:33:52] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 19:33:48 UTC 2013 [19:34:12] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [19:36:40] manybubbles: hey. let's talk in here, rather than email [19:36:54] manybubbles: since I'm still not totally getting what you're trying to do [19:37:14] Ryan_Lane: sure! [19:37:17] you have a set of systems: sys1, sys2, sys3, sys4, etc.. [19:37:31] not really [19:37:37] you want to do: sys1: forceSearchIndex.php --wiki test2wiki --fromId 0 --toId 1091; sys2: forceSearchIndex.php --wiki test2wiki --fromId 1091 --toId 2182 [19:37:39] I have a set of jobs [19:37:39] etc. etc.? [19:37:54] yeah, that [19:38:13] well, except your sys* aren't really predefined, right? [19:38:26] how would parallel split these up across the systems? [19:38:26] you've got a set of jobs that can pretty much run on any mediawiki apache, right? [19:38:31] in which way would it target then? [19:38:34] *them [19:38:44] wait, this is going to run on the apaches? [19:38:53] i may have no idea what i'm talking about :/ [19:38:55] heh. why not just inject jobs into the job queue? [19:39:03] ottomata: I'd prefer to just list some systems or a pool name or something and let the runner just deal them randomly [19:39:14] yeah, this is the kind of thing the job queue is for [19:39:32] right, but the basic idea is, that you need to run jobs in parallel on mediawiki servers, rigth? [19:39:33] ideally with a less shitty job queue system in the future, like gearman or something else [19:39:37] doesn't matter which jobs run where? [19:39:42] they all need to happen at the same exact time? [19:40:07] I'd prefer they run with a lower priority, but yeah [19:40:33] well, a separate job queue is still doable [19:40:33] the problem with the job queue is that we have something that is already a command line script and it seems silly to wrap it in a job queue [19:40:36] well, not the problem [19:40:50] but that is why I've been resisting it [19:41:14] where is this going to run? the app servers? [19:41:21] is it going to run manually? [19:41:35] Ryan_Lane: I haven't a clue, really. Api, app servers, terbium like machines [19:41:41] kicked off manually, yes [19:41:54] it's possible to do via salt, assuming we made a runner [19:42:25] where the runner would note the number of systems and split the jobs up accordingly, then sent them to the minions [19:42:38] salt can deal the jobs out, then? I was just trying to find the path of least resistance [19:43:20] it can... but I still think the job queue is better for this [19:43:49] you're attempting to do this in a push way, and it doesn't take some things into consideration [19:44:05] for instance. the hosts could already be running relatively intense jobs [19:44:20] or, the host may not actually respond [19:44:39] and if it doesn't, then you need to re-schedule that chunk of work and only that chunk of work [19:46:22] so to do this in the job queue, do I have to create a separate queue and tell some nodes to read from it and other not? I'm sure that is all on the documentation page. [19:46:47] If I do this push style I'll just jam a bunch of jobs on some queue and let them drain in whatever amount of time it takes [19:46:54] sorry, pull style [19:48:33] you'd push them into a separate queue [19:48:41] and then have a cron that runs specific for that queue [19:48:45] at least I believe that's how it works [19:48:47] AaronSchulz: ^^ ? [19:51:07] manybubbles: well, if you push you have the possibility of portions of it not running [19:51:22] then you need to put in retries, or you need to be able to re-run those portions manually [19:51:38] and you may hit systems that are already overloaded, causing them to OOM [19:53:54] so gnu parallel handles stuff like retries and central job queueing - but it would deal the jobs out indescriminantly, potentially overwhelming machines [19:55:41] well, it also adds another mechanism for doing the same things we're already doing, too ;) [19:56:51] salt will not do things like retry [19:57:03] the job queue picks up jobs as they are fed into the queue [19:57:27] one thing you could do, is to feed them into the queue [19:57:40] then have salt tell the systems to pick up jobs from the queue [19:58:09] it still has the problem of possibly overwhelming the servers [19:58:10] Ryan_Lane: that is certainly possible, though it doesn't solve the overwhelming machines problem. [19:58:49] though that seems like a better approach than introducing a queue/remote execution framework [19:59:32] <^d> Does salt at least report failures? [19:59:41] <^d> Auto retrying isn't so bad, as long as we know which ones exploded. [19:59:55] yes, it can [20:00:06] <^d> Plus this is only for disaster recovery and initial buildout of indexes, not a weekly thing or anything like that. [20:00:27] the hard thing with salt would be splitting the jobs across the systems [20:00:30] it'll feel weekly for a while [20:00:40] when we are adding new wikis quickly [20:00:43] which is why I think job queue is best [20:01:11] ^d: is it possible to prioritize our jobs or something? [20:01:13] then rather than having a cron pick up from the queue, have salt call it [20:01:29] <^d> manybubbles: Sort of. You can kick off runners for specific job types. [20:01:47] <^d> But you can't say "This job is more important than that one" afaik. [20:01:50] ah [20:02:05] right, so don't have anything pick up the jobs [20:02:16] our jobs wouldn't really the same from a performance standpoint as most jobs [20:02:25] but have salt tell the system "pick up this job type now" [20:02:30] unless we just made a job per page [20:02:59] Ryan_Lane: now that is an idea, we could swing some number of servers over from consuming normal jobs to consuming our jobs [20:03:07] then swing them back [20:03:19] <^d> Or just kick off additional runners on those machines. [20:03:30] ^d: the worry is overwhelming them [20:03:35] yeah [20:04:30] <^d> I'm afraid of the queue getting backed up if we pull runners out of rotation though. [20:04:48] yeah [20:04:53] its pretty moot if we don't have machines to throw at the problem [20:05:00] well, we can also nice the processes [20:05:17] (03PS3) 10ArielGlenn: ukwikimedia is moving, set all namespaces to read-only except user talk [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86303 [20:05:22] assuming they don't eat up shitloads of ram it should be fine [20:05:39] we mostly eat cpu [20:05:46] <^d> Yeah ram shouldn't be a big problem [20:05:49] then it should be fine just nicing them [20:06:10] <^d> Only time it becomes an issue is if we kept parsing again and again since the parser slowly leaks memory. [20:06:20] at that point we come back to just dealing them out again, just niced. [20:06:21] <^d> So --fromId 0 --toId 1000000 has a possibility of OOM :) [20:06:28] yeah [20:06:29] we have the ability to check jobs on the queue, right/ [20:06:34] if so, we don't need to worry about reporting [20:06:38] just execution [20:06:42] which makes it really simple [20:06:44] (03PS1) 10Ottomata: Ensuring package pigz is installed for wikistats parallel gzip processing. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86310 [20:07:09] we can monitor the length of the queues, I believe. [20:07:14] <^d> Yes [20:07:23] (03CR) 10Ottomata: [C: 032 V: 032] Ensuring package pigz is installed for wikistats parallel gzip processing. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86310 (owner: 10Ottomata) [20:07:26] and we can assume if a job is popped from the queue that it ran [20:07:45] if not, then we need to fix our job queue system :) [20:08:13] <^d> pop() just marks it as claimed, not that it's been run or successful yet :) [20:08:16] Ryan_Lane: it'll log output and that output'll be shunted back home [20:08:21] <^d> There's other black magic that I should know. [20:08:25] <^d> I did review Aaron's work on this. [20:08:34] manybubbles: shunted back where, though? [20:08:52] we should really switch to something like gearman :) [20:08:55] Ryan_Lane: if we log it like we log everything else then to fluorine [20:09:00] ah. right [20:09:12] unless jobs are different [20:09:13] and we need logstash, so that we can tag logs like this [20:09:28] I've heard we're looking at it [20:09:31] yep [20:09:36] it's elasticsearch backed [20:09:48] well, by default it's ES backed. [20:09:58] Reedy: Do we have any way to purge varnish/squids in front of dewiki (& maybe others)? See https://bugzilla.wikimedia.org/show_bug.cgi?id=54647 and paravoid's explanation of multicast/htcp outage that has now been resolved. [20:10:11] <^d> Ryan_Lane: I considered gearman. [20:10:20] <^d> Since it's already got php bindings and everything. [20:10:22] +1 for logstash. Hoping to help on work for that. [20:10:33] <^d> But Nik like gnu parallel and other people mentioned salt so I dropped it :) [20:10:36] bd808: echo "http://en.wikipedia.org/wiki/Foobar" | mwscript purgeList.php enwiki [20:10:42] ^d: I think hashar is looking at gearman for jenkins [20:10:49] gearman is different from salt and parallel [20:10:53] logstash yeah! [20:11:01] i've done gearman stuff before [20:11:03] <^d> Ryan_Lane: Indeed. But it seems like what I wanted here :) [20:11:07] I just want something that works and I didn't want to get stuck waiting on some infrastructure project [20:11:16] wrote an email campaign mailer with it [20:11:19] I mean, you can turn salt into gearman, but that's development effort ;) [20:11:44] manybubbles: yeah. I think it's easily doable with salt + job queue, though. [20:11:52] Reedy: followup question - how do I/we get a list of pages that changes during outage window? [20:12:14] bd808: cirrussearch runs a sql query for that! [20:12:19] ottomata: for any of the internal vip's but i don't believe that we will have any internal vip's in ulsfo - i believe they will be going to eqiad [20:12:19] Either revision or recentchanges table I guess [20:12:21] also, parallel uses ssh and I'm actively trying to kill all of our uses of ssh ;) [20:12:50] Ryan_Lane: fair enough. it is also very very very nice for what it does. [20:12:55] * Ryan_Lane nods [20:13:06] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [20:13:16] ok, I'll send an email about the job queue [20:13:24] I'll add aaron. [20:13:24] ok, cool [20:13:38] Reedy, manybubbles: should I know how to run such queries? I haven't been shown/found out how to do that kind of stuff yet. [20:14:40] bd808: I'm not sure who has access to production mysql. but you can steal the query from forceSearchIndex.php in cirrussearch if/when you get the access [20:14:48] you could also query a labs slave [20:14:53] which you should have access too [20:15:07] accessing the labs slaves is easy [20:15:18] just make a labs account and request access to tools project [20:15:19] (03CR) 10Reedy: "(1 comment)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86303 (owner: 10ArielGlenn) [20:15:24] if you haven't already [20:16:11] Ryan_Lane: I have labs account; not sure about tools access [20:16:35] bd808: https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request [20:16:53] andrewbogott: you know, i don't think that pythonpip thing works anyway [20:17:01] aannnnd I *think* its not being used [20:17:44] yeah you know, i'd say remove it. [20:17:55] if someone complains later ( I don't think they will) I will deal with it [20:18:03] Ryan_Lane: Thanks. {{done}} [20:18:36] bd808: ah, you're already a member [20:19:06] So I just log into bastion and …? do stuff *grin* [20:19:13] (03PS4) 10Reedy: ukwikimedia is moving, set all namespaces to read-only except user talk [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86303 (owner: 10ArielGlenn) [20:19:18] bd808: https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help#Database_access [20:19:31] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help#Accessing_Tool_Labs_and_managing_your_files [20:19:41] Ryan_Lane: Thank you for teaching me to fish. [20:19:56] yw [20:20:07] * Ryan_Lane has never actually used the database access [20:20:36] production ftw [20:20:43] Reedy: tsk tsk [20:20:44] :) [20:21:09] (03CR) 10Reedy: [C: 032] ukwikimedia is moving, set all namespaces to read-only except user talk [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86303 (owner: 10ArielGlenn) [20:21:17] (03Merged) 10jenkins-bot: ukwikimedia is moving, set all namespaces to read-only except user talk [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86303 (owner: 10ArielGlenn) [20:21:28] idle pmtpa slaves are great for silly sql queries [20:23:04] ottomata, remove it and remove the references in role::statistics as well? [20:27:42] (03PS3) 10Andrew Bogott: Remove generic::pythonpip [operations/puppet] - 10https://gerrit.wikimedia.org/r/86281 [20:28:01] yup [20:28:28] (03CR) 10Ottomata: [C: 032] Remove generic::pythonpip [operations/puppet] - 10https://gerrit.wikimedia.org/r/86281 (owner: 10Andrew Bogott) [20:29:24] (03PS1) 10Ori.livneh: Add Gdash module; remove Gdash source tree from Puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/86312 [20:29:46] paravoid / Ryan_Lane: ^ dat diff: +100, -4519 [20:30:27] (03CR) 10Ori.livneh: [C: 04-1] "I'd like to be around when this is merged and I have to run right now, so -1ing." [operations/puppet] - 10https://gerrit.wikimedia.org/r/86312 (owner: 10Ori.livneh) [20:30:34] !log ariel synchronized wmf-config/InitialiseSettings.php 'ukwikimedia to read-only except for user talk' [20:30:45] Logged the message, Master [20:30:54] (03PS5) 10Ottomata: Adding entries for bits-lb.ulsfo, mobile-lb.ulsfo, and upload-lb.ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/86272 [20:32:01] ori-l: :D [20:32:02] \o/ [20:33:18] ori-l: wtf is ordered_json($settings) ?? [20:33:36] does it take a hash and output it as ordered json? [20:33:41] if it does I'm going to shit a brick [20:34:42] yep :P [20:34:49] AaronSchulz: http://www.planetcassandra.org/Learn/FAQ#arch-16 [20:34:52] dude. I needed this a week ago [20:34:57] ori-l: can you put this somewhere generic? [20:35:01] I still need it [20:35:25] I can finish reviewing this as is, of course [20:35:28] AaronSchulz: and http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0 [20:35:31] can be a followup or something :) [20:35:31] Ryan_Lane: Ok, gotta run right now but will do so as a separate patch later [20:35:34] yep [20:35:35] manybubbles: can you translate the db->select() from forceSearchIndex.php into "real" sql for me? [20:35:37] \o/ [20:35:39] <3 [20:35:56] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 20:35:49 UTC 2013 [20:35:59] bd808: selectSQLText() [20:36:01] <^d> bd808: Swap select() for selectSQLText() with the same params and you can get it. [20:36:06] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [20:36:22] Reedy. ^d: thx [20:36:27] ori-l: /opt? [20:36:32] don't we generally use /srv? [20:36:41] Ryan_Lane: it's where it's currently deployed [20:36:44] trying not to break things [20:36:57] don't look at the state of professor if you want to keep your lunch [20:37:03] (03PS1) 10Ottomata: Puppetizing bits ulsfo varnishes. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86313 [20:37:10] i'll clean it up further, but one step at a time [20:37:21] hm. well, git deploy deploys to/from the same location [20:37:23] * Ryan_Lane nods [20:37:46] git deploy does that for a sense of sanity, so that you don't need to chase down things on target systems [20:38:00] (03CR) 10Ottomata: [C: 04-1] "I still feel like I only have about an 85% grasp on all the moving pieces here, so a full review of this and https://gerrit.wikimedia.org/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86313 (owner: 10Ottomata) [20:38:04] AaronSchulz: paxos seems to be quorum-based [20:38:22] anyway, yeah, doing so in steps sounds fine [20:38:27] ok, that's more in line with what I remember [20:39:26] PROBLEM - Puppet freshness on ms-be1012 is CRITICAL: No successful Puppet run in the last 10 hours [20:39:43] AaronSchulz: more detail in https://github.com/apache/cassandra/blob/cassandra-2.0.0-beta1/src/java/org/apache/cassandra/service/StorageProxy.java#L204 [20:39:59] (03CR) 10Ryan Lane: [C: 032] Add Gdash module; remove Gdash source tree from Puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/86312 (owner: 10Ori.livneh) [20:40:20] ori-l: I +2'd, but I'll wait till you are around for merge [20:41:26] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: No successful Puppet run in the last 10 hours [20:42:26] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: No successful Puppet run in the last 10 hours [20:45:26] PROBLEM - Puppet freshness on ms-fe1002 is CRITICAL: No successful Puppet run in the last 10 hours [20:48:26] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: No successful Puppet run in the last 10 hours [20:48:26] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: No successful Puppet run in the last 10 hours [20:48:33] (03CR) 10Ottomata: "Am I correct in adding the *.svc.ulsfo.wmnet entries? I was pretty unsure about." [operations/dns] - 10https://gerrit.wikimedia.org/r/86272 (owner: 10Ottomata) [20:51:26] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: No successful Puppet run in the last 10 hours [20:53:26] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: No successful Puppet run in the last 10 hours [20:53:26] PROBLEM - Puppet freshness on ms-be1004 is CRITICAL: No successful Puppet run in the last 10 hours [20:54:26] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: No successful Puppet run in the last 10 hours [20:55:26] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: No successful Puppet run in the last 10 hours [20:55:29] (03PS1) 10ArielGlenn: wgNamespaceProtection NS_MODULE not recognized so putting in the numbers [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86316 [20:56:38] (03CR) 10Reedy: [C: 032] wgNamespaceProtection NS_MODULE not recognized so putting in the numbers [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86316 (owner: 10ArielGlenn) [20:56:53] (03Merged) 10jenkins-bot: wgNamespaceProtection NS_MODULE not recognized so putting in the numbers [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86316 (owner: 10ArielGlenn) [20:57:26] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: No successful Puppet run in the last 10 hours [20:57:34] brb, migrating to coffee shop [20:57:57] !log ariel synchronized wmf-config/InitialiseSettings.php 'fix up ns refs for ukwikimedia' [20:58:08] Logged the message, Master [20:59:26] PROBLEM - Puppet freshness on ms-be1008 is CRITICAL: No successful Puppet run in the last 10 hours [21:00:26] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: No successful Puppet run in the last 10 hours [21:01:36] Reedy: Select for enwiki tells me that there are 215,968 pages to purge from 2013-09-22T00:00:00Z to 2013-09-26T00:00:00Z. Does that sound reasonable? [21:02:26] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: No successful Puppet run in the last 10 hours [21:02:26] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: No successful Puppet run in the last 10 hours [21:03:57] Ryan_Lane: around now [21:04:06] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 21:04:02 UTC 2013 [21:04:43] bd808: No idea... Make sure you only purge each page once [21:05:06] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [21:05:51] Reedy: I'm getting the list now. Will need further advice/instruction on how and where to actually run the purgeList.php script [21:08:24] (03PS1) 10Odder: (bug 54680) Set $wgCategoryCollation for the French Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86320 [21:08:42] https://gerrit.wikimedia.org/r/86320 [21:08:46] Oups, sorry. [21:12:18] Reedy: so I had a look at https://bugzilla.wikimedia.org/show_bug.cgi?id=54368 [21:12:58] bd808: On terbium, run mwscript purgeList.php --wiki=enwiki --help [21:13:05] (or --wiki=commonswiki or whatever as appropriate) [21:13:54] RoanKattouw: Thanks. I've never logged into anything other than labs boxes. o_O [21:14:26] terbium is internal so you'll have to log into bast1001 first [21:14:35] Assuming you actually have shell access to the main cluster [21:14:48] RoanKattouw: I'm guessing I don't [21:15:00] If you don't know that you do, you probably don't [21:15:08] In that case get Reedy to run the script for you [21:15:26] And if you're gonna be doing tasks like these more often, get your manager (robla?) to sign off on a shell access request [21:15:51] RoanKattouw: good ideas all. Mostly I'm trying to help :) [21:20:17] (03CR) 10Bartosz Dziewoński: [C: 031] (bug 54680) Set $wgCategoryCollation for the French Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86320 (owner: 10Odder) [21:25:37] !log reedy synchronized php-1.22wmf19/extensions/WikimediaMessages/ [21:25:49] Logged the message, Master [21:33:49] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 21:33:41 UTC 2013 [21:33:59] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [21:34:10] (03PS1) 10Ori.livneh: Update NavigationTiming StatsD reporter to schema rev. 5832704 [operations/puppet] - 10https://gerrit.wikimedia.org/r/86322 [21:35:01] Ryan_Lane: any chance you can merge the gdash patch and this one ^^ ? I got the OK to deploy the concomitant JS change [21:35:20] oh man, you told an ops person? [21:35:28] last time I let you deploy on a Friday [21:36:35] :P [21:41:01] Reedy: I have 2 text files on tools-login that have urls of pages in enwiki and dewiki that are likely in need of purge. [21:41:11] Reedy: seeing advice on what to do next [21:41:37] run for the hills normally works [21:41:57] p858snake|l: easy. I'm in the hills already. [21:43:26] * bd808 stays put [21:43:46] (03CR) 10Hashar: [C: 04-1] "I am not sure we need a class to install maven, lets just add the maven2 package to the long array of packages? Ie next to ant (a java bu" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86136 (owner: 10Andrew Bogott) [21:45:00] greg-g: who should I bother besides Sam? World clock tells me he should be sleeping. [21:45:02] I think those files can just be cat and piped to purgeList [21:45:22] bd808: Reedy is not a normal sleeper ;) [21:45:23] Yeah, works with stdin [21:45:47] basically, assume Reedy is on SF-time [21:45:52] ish [21:45:55] Reedy: I don't think I have access to anywhere that I can run that against prod [21:46:16] To my knowledge I only have labs shell access [21:46:27] Hmm [21:46:34] What's the easiest way for me to grab them.. [21:46:47] I can add you to the tool [21:47:04] Or I can pastebin [21:47:22] Probably a bit big for pastebin [21:50:15] probably yeah [21:50:15] transfering files between computers, still a hard problem [21:51:14] I just don't have anywhere outside my house to land them [21:51:14] Probably easiest to add me to the tool [21:51:14] <^d> e-mail! [21:51:21] Reedy: what's your username on toolslab? [21:51:38] reedy [21:51:39] hmmm… ' No results match "reedy"' [21:51:53] who runs an anonymous ftp server? send details to bd808 [21:51:58] lol [21:52:08] dropbox! [21:52:12] seems overkill [21:52:14] [21:52:17] err [21:52:58] Try it as Reedy [21:53:14] easiest would be to publish this file in a apache-accessible place, no? [21:53:49] Ryan_Lane: ping when you're back, plz. [21:54:55] Reedy: no joy there either. Hang on I'm pushing them to a host I control that has network access. [21:59:48] <^d> bd808, Reedy, greg-g: We'll make a new git repo with full push access, call it dropbox. [21:59:56] Reedy: Did you get URLs via PM? [21:59:57] <^d> :) [22:01:22] * greg-g commits every linus install iso he has into said repo [22:01:26] heh, linux [22:01:55] * bd808 fills it with pictures of dogs and cheeseburgers [22:02:37] * ^d rewrites history to get rid of silly commits ;-) [22:02:46] reedy@tin:/tmp$ cat dewiki-misses.txt | mwscript purgeList.php dewiki [22:02:46] Purging 34447 urls [22:02:46] Done! [22:02:49] gotta pull it first! [22:03:01] w00t [22:03:03] reedy@tin:/tmp$ cat enwiki-misses.txt | mwscript purgeList.php enwiki [22:03:03] Purging 146760 urls [22:03:03] Done! [22:03:09] eek, that's a lot [22:03:22] <^d> That's over 9000 urls! [22:03:59] It was all edits to namespace 0 from 2013-09-22T00:00:00Z to 2013-09-26T00:00:00Z [22:04:11] there's more out there but this should help [22:04:24] eg other namespaces, other wikis [22:06:35] Reedy: Thanks much [22:08:02] ori-l: I've added a diagram to https://wikitech.wikimedia.org/wiki/Sartoris/Design hopefully it makes things a little clearer [22:08:21] ooo, purty [22:08:24] Ryan_Lane: nice -- what did you use to draw it? [22:08:30] omnigraffle [22:08:37] <3 omnigraffle [22:08:57] +1 for omnigraffle [22:08:59] i think this helps, yeah [22:09:05] the only downside is that it's proprietary [22:09:16] happy 30th birthday, GNU [22:09:20] :D [22:09:22] * greg-g times things well [22:09:26] ;) [22:10:11] Ryan_Lane: if i read it now something important will fall out of my head, but i'll read it carefully this weekend [22:10:28] yep. no worries. today is a good day for documentation, so I thought I'd do that :) [22:10:38] <^d> greg-g: Shouldn't that be an indication you need a weekend project writing gomnigraffle? [22:10:43] any chance you could merge the gdash change & 86322? [22:10:55] yep, if you'd like [22:11:16] ^d: my weekend project is writing a human ;) [22:11:22] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [22:11:26] Ryan_Lane: yes plz [22:11:29] one down [22:11:34] <^d> greg-g: Yes, but is the human licensed under GPLv2? [22:11:54] ^d: AGPLv3, of course [22:11:59] <^d> Haha [22:12:19] (03CR) 10Ryan Lane: [C: 032] Update NavigationTiming StatsD reporter to schema rev. 5832704 [operations/puppet] - 10https://gerrit.wikimedia.org/r/86322 (owner: 10Ori.livneh) [22:12:35] ori-l: done [22:12:50] weee, thanks! [22:12:57] I wish there was an open source diagram app as good as omnigraffle [22:13:04] I'd use it in a heartbeat [22:13:18] I've tried a number of open source ones and they make the ugliest diagrams ever [22:13:25] LibreOffice Draw! ... oh wait... :( [22:13:38] <^d> Ryan_Lane: I would suggest Dia, but I hate it. [22:13:44] <^d> So I can't in good faith suggest it. [22:14:58] I'm opening myself up to mockery here, but I think that using good closed tools beats using almost-works-most-of-the-time free tools every day of the week [22:15:32] I'd love to be able to tweak everything but I don't always have time to do that [22:15:32] * ^d hides from the oncoming hoard who will eat bd808 alive [22:16:33] I'm proprietary software friendly and I don't care who knows it [22:16:40] * bd808 hides from marktraceur [22:16:58] * marktraceur gets nerf guns and plane tickets to Boise [22:17:17] * bd808 changes the linens in the guest room [22:17:48] I use yed for diagrams now, since dia started to suck (my gtk setup on windows is possibly broke). and blasmiq for mockups (not os) [22:18:32] oh, so one of those "pragmatic" people, eh? [22:18:46] we have a name for you kinds of people around these here parts [22:19:19] I forget what it is and I saved it in a FLOSS note taking program, but it changed file formats recently and I lost all my history... [22:19:47] <^d> It could be worse. We could all be using Rational Rose. [22:20:01] <^d> Because *thats* a fun tool. [22:22:11] * bd808 chases ^d with a stick for mentioning RUP [22:22:42] Model driven money extraction. Fun for consultants the world over [22:23:04] <^d> :) [22:23:39] I have no idea what you're talking about, and I think I'm happier for it. [22:23:43] <^d> bd808: I had a whole class centered on Rose and that RUP crap. [22:23:50] <^d> It was...awful...to say the least. [22:24:02] <^d> Only time I've ever felt bad for the *computer* for having to run software. [22:24:16] <^d> I could feel the computer's pain. "WHY ARE YOU RUNNING THIS ON ME? IT BURNS!" [22:24:27] I've tried to rehabilitate former RUP users, but never been successful. It ruins good minds. [22:24:54] greg-g: https://en.wikipedia.org/wiki/IBM_Rational_Unified_Process [22:25:29] <^d> bd808: It basically went against everything I knew to be a Good Thing. So I purposefully unlearned it. [22:25:41] RUP is waterfall on steroids with code generation as the end result. [22:25:54] <^d> Now it's kind of a maxim. WWRUPD? If you find yourself doing what RUP would do, do the opposite. [22:26:13] * bd808 LOL'd for reals [22:27:17] <^d> Waterfall on steroids is an understatement. It's freaking Niagara Falls, but frozen over so you don't actually progress. [22:27:21] For those who are curious, here's my write up of making those files Reedy ran: https://www.mediawiki.org/wiki/User:BDavis_(WMF)/Notes/Finding_Files_To_Purge [22:27:31] bd808: tl;dr [22:28:35] (03CR) 10Lcarr: "honestly i'm unsure if we need the internal .svc addresses as well." [operations/dns] - 10https://gerrit.wikimedia.org/r/86272 (owner: 10Ottomata) [22:28:42] My question on bz:54647 now is "what next" [22:29:03] bd808: d..d... documentation? [22:29:03] Reedy: there are updates to TorBlock and WikimediaMessages queued, should I sync them? [22:29:11] I'm about to sync a NavigationTiming change [22:29:13] Uhh [22:29:17] I thought I already had... [22:31:28] greg-g: document what? A process for recovering from a major htpc outage? [22:31:40] Reedy: yes; I misread git's output. Sorry. [22:32:06] bd808: that's what you did, and I was impressed :) [22:32:54] I can't remember things. That's what the wiki is for! [22:34:02] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 22:33:53 UTC 2013 [22:34:22] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [22:35:28] !log olivneh synchronized php-1.22wmf18/extensions/NavigationTiming 'Updating NavigationTiming: schema update 5336845 -> 5832704 (1/2)' [22:35:41] Logged the message, Master [22:35:45] !log olivneh synchronized php-1.22wmf19/extensions/NavigationTiming 'Updating NavigationTiming: schema update 5336845 -> 5832704 (2/2)' [22:35:54] Logged the message, Master [22:39:09] (03PS1) 10Ori.livneh: Disable broken Varnish monitoring for professor graphite host [operations/puppet] - 10https://gerrit.wikimedia.org/r/86329 [22:46:09] PROBLEM - Puppet freshness on sockpuppet is CRITICAL: No successful Puppet run in the last 10 hours [22:57:23] (03PS1) 10Ori.livneh: Fix-ups for Gdash module [operations/puppet] - 10https://gerrit.wikimedia.org/r/86330 [22:57:24] (03PS1) 10Ori.livneh: Re-introduce 'rendering' metric to navtiming.py [operations/puppet] - 10https://gerrit.wikimedia.org/r/86331 [22:57:50] Ryan_Lane: ^ fix-ups, already tested on target. And that's it.™ [23:05:42] Reedy: I think the data files I made had a bug. They don't have /wiki/ in the url. [23:06:20] Reedy: so the purges wouldn't match varnish canonical urls I don't think [23:09:17] uh oh [23:09:53] It wouldn't hurt anything but it wouldn't fix anything either [23:10:29] I've got new files. Pushing them to my transfer server now [23:10:36] * greg-g nods [23:11:43] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [23:14:56] Done [23:15:49] * bd808 needs to find a transfer server inside the network [23:16:09] Any server I SSH into I can get the files... [23:17:16] I don't know why I couldn't find you in the wikitech management interface [23:33:53] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Fri Sep 27 23:33:46 UTC 2013 [23:34:43] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [23:47:17] Ryan_Lane: about? [23:55:58] !log Manually synchronized changes 86329, 86330, 86331 to hafnium & professor & temporarily disabled puppet to prevent it from clobbering the changes. I'll re-enable once they're merged. [23:56:10] Logged the message, Master