[00:03:22] (03PS1) 10GWicke: RESTBase caching: Force clients to revalidate purged end points [puppet] - 10https://gerrit.wikimedia.org/r/277056 [00:09:10] PROBLEM - puppet last run on cp3037 is CRITICAL: CRITICAL: puppet fail [00:36:31] RECOVERY - puppet last run on cp3037 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [00:37:41] PROBLEM - Host mc1003 is DOWN: PING CRITICAL - Packet loss = 100% [00:41:50] RECOVERY - Host mc1003 is UP: PING OK - Packet loss = 0%, RTA = 1.37 ms [00:50:19] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [24.0] [01:00:30] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [02:04:27] (03CR) 10Alex Monk: [C: 04-1] "Lets figure out use of Let's Encrypt certs. This change is still live in beta for the time being because it needs to have some sort of cer" [puppet] - 10https://gerrit.wikimedia.org/r/247587 (https://phabricator.wikimedia.org/T70387) (owner: 10Alex Monk) [02:17:06] 6Operations, 10Beta-Cluster-Infrastructure, 6Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#2115233 (10Krenair) I'm going around a few tasks on this subject trying to merge everything together (this one, T70387, T75919, T... [02:17:18] 6Operations, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Beta Cluster no longer listens for HTTPS - https://phabricator.wikimedia.org/T70387#2115235 (10Krenair) [02:17:26] 6Operations, 10Beta-Cluster-Infrastructure, 6Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#2115236 (10Krenair) [02:30:31] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.16) (duration: 13m 24s) [02:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:39:06] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Mar 13 02:39:06 UTC 2016 (duration 8m 35s) [02:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:42:05] 6Operations, 10Mail, 10OTRS: E-mail incorrectly forwarded to wm-cz OTRS e-mail - https://phabricator.wikimedia.org/T129743#2115257 (10Peachey88) [03:39:49] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: puppet fail [03:49:27] (03PS1) 10Alex Monk: varnish: Fix puppet in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/277058 [04:03:43] (03CR) 10Dereckson: "This patch is ready for deployment. Next step is to review https://wikitech.wikimedia.org/wiki/SWAT_deploys and add it to a deployment wi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276934 (https://phabricator.wikimedia.org/T129728) (owner: 10Pmlineditor) [04:05:22] (03CR) 10Alex Monk: "Cherry-picked on deployment-puppetmaster, of course. I checked puppet-compiler for a few production cache machines (all upload, I think) a" [puppet] - 10https://gerrit.wikimedia.org/r/277058 (owner: 10Alex Monk) [04:07:11] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:23:55] (03CR) 10Dereckson: "There is a consensus for the configuration requested in T121853, which discuss with names with one another user, and notify of a trivial (" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263614 (https://phabricator.wikimedia.org/T121853) (owner: 10Mdann52) [05:00:38] 6Operations, 10Traffic: Fix puppet on deployment-cache* hosts in beta labs - https://phabricator.wikimedia.org/T129270#2115324 (10Krenair) a:5BBlack>3Krenair [05:04:15] 6Operations, 7Puppet, 10Beta-Cluster-Infrastructure, 10Traffic: Fix puppet on deployment-cache* hosts in beta labs - https://phabricator.wikimedia.org/T129270#2115336 (10Krenair) I spent about an hour of volunteer time trying to figure out exactly what was going on here, and I put a patch in gerrit for it [05:04:23] (03PS2) 10Alex Monk: varnish: Fix puppet in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/277058 (https://phabricator.wikimedia.org/T129270) [05:07:34] (03CR) 1020after4: [C: 031] varnish: Fix puppet in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/277058 (https://phabricator.wikimedia.org/T129270) (owner: 10Alex Monk) [05:27:57] (03PS1) 10Ori.livneh: HHVM: Enable translation cache garbage-collection on canary app servers [puppet] - 10https://gerrit.wikimedia.org/r/277061 [06:30:40] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:00] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:11] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:00] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:00] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:00] PROBLEM - puppet last run on mw1251 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:21] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:29] PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: Puppet has 1 failures [06:54:32] (03PS2) 10Ori.livneh: HHVM: Enable translation cache garbage-collection on canary app servers [puppet] - 10https://gerrit.wikimedia.org/r/277061 (https://phabricator.wikimedia.org/T277061) [06:56:29] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:56:49] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:56:49] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:57:40] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:57:51] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:29] RECOVERY - puppet last run on mw1251 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:51] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:59] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:23:20] PROBLEM - puppet last run on analytics1032 is CRITICAL: CRITICAL: puppet fail [07:24:00] PROBLEM - puppet last run on mw2048 is CRITICAL: CRITICAL: puppet fail [07:52:00] RECOVERY - puppet last run on analytics1032 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [07:52:29] (03PS4) 10Giuseppe Lavagetto: jobqueue_redis: set up encryption and cross-dc replication [puppet] - 10https://gerrit.wikimedia.org/r/276980 (https://phabricator.wikimedia.org/T124672) [07:52:41] RECOVERY - puppet last run on mw2048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:11:23] (03PS5) 10Giuseppe Lavagetto: jobqueue_redis: set up encryption and cross-dc replication [puppet] - 10https://gerrit.wikimedia.org/r/276980 (https://phabricator.wikimedia.org/T124672) [08:17:43] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The patch does the right thing, the puppet compiler shows, but we need to wait that all eqiad redises are upgraded to jessie." [puppet] - 10https://gerrit.wikimedia.org/r/276980 (https://phabricator.wikimedia.org/T124672) (owner: 10Giuseppe Lavagetto) [08:36:01] (03PS2) 10Giuseppe Lavagetto: jobrunner: monitor the HHVM server health [puppet] - 10https://gerrit.wikimedia.org/r/276710 [08:36:18] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] jobrunner: monitor the HHVM server health [puppet] - 10https://gerrit.wikimedia.org/r/276710 (owner: 10Giuseppe Lavagetto) [08:47:23] <_joe_> the check works, cool [08:47:49] great [09:19:28] (03PS2) 10Giuseppe Lavagetto: mediawiki::maintenance: add codfw host, multidc support [puppet] - 10https://gerrit.wikimedia.org/r/275837 (https://phabricator.wikimedia.org/T126987) [10:16:38] 6Operations, 10Wikimedia-General-or-Unknown, 7WorkType-NewFunctionality: security@mediawiki.org : Create a public key and publish it on the public key servers - https://phabricator.wikimedia.org/T40860#2115667 (10Nemo_bis) [10:17:45] 6Operations, 10Wikimedia-General-or-Unknown, 7WorkType-NewFunctionality: Run our own Tor client for Tor block - https://phabricator.wikimedia.org/T32716#2115713 (10Nemo_bis) [10:30:20] 6Operations, 10Wikimedia-General-or-Unknown, 7WorkType-NewFunctionality: security@mediawiki.org : Create a public key and publish it on the public key servers - https://phabricator.wikimedia.org/T40860#426408 (10Peachey88) >>! In T40860#1726133, @Dzahn wrote: > The people having access to this key will have... [12:27:24] PROBLEM - puppet last run on mw2043 is CRITICAL: CRITICAL: puppet fail [12:56:55] RECOVERY - puppet last run on mw2043 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [13:08:31] 7Puppet, 6Labs: Receiving puppet run failure alert for instance where manual puppet runs complete fine - https://phabricator.wikimedia.org/T129403#2116118 (10dschwen) Just got another one ``` Received: from root by maps-wma1.maps.eqiad.wmflabs with local (Exim 4.76) (envelope-from ) id 1af... [13:58:48] 7Puppet, 6Labs: Receiving puppet run failure alert for instance where manual puppet runs complete fine - https://phabricator.wikimedia.org/T129403#2104571 (10valhallasw) The emails are sent when puppet has not run for 24 hours. Specifically, the code checks the 'last_run' parameter in /var/lib/puppet/state/las... [14:06:44] 7Puppet, 6Labs: Receiving puppet run failure alert for instance where manual puppet runs complete fine - https://phabricator.wikimedia.org/T129403#2116149 (10valhallasw) ``` root@maps-wma1:~# bash -x /usr/local/sbin/puppet-run + set -e + touch /var/log/puppet.log + chmod 600 /var/log/puppet.log ++ puppet agent... [15:48:25] PROBLEM - puppet last run on db1050 is CRITICAL: CRITICAL: puppet fail [16:14:35] RECOVERY - puppet last run on db1050 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [16:38:54] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "the hiera trick I tried didn't work, per the puppet compiler." [puppet] - 10https://gerrit.wikimedia.org/r/275837 (https://phabricator.wikimedia.org/T126987) (owner: 10Giuseppe Lavagetto) [16:44:56] (03PS3) 10Giuseppe Lavagetto: mediawiki::maintenance: add codfw host, multidc support [puppet] - 10https://gerrit.wikimedia.org/r/275837 (https://phabricator.wikimedia.org/T126987) [16:51:28] (03PS4) 10Giuseppe Lavagetto: mediawiki::maintenance: add codfw host, multidc support [puppet] - 10https://gerrit.wikimedia.org/r/275837 (https://phabricator.wikimedia.org/T126987) [18:44:34] 7Puppet, 6Labs: Receiving puppet run failure alert for instance where manual puppet runs complete fine - https://phabricator.wikimedia.org/T129403#2116323 (10dschwen) Thanks, I upgraded puppet. Let's see if that makes me compliant again :-) [18:56:45] PROBLEM - puppet last run on mw2008 is CRITICAL: CRITICAL: Puppet has 1 failures [19:23:24] 6Operations, 7LDAP: Add wmf LDAP group members into nda group, delete wmf group - https://phabricator.wikimedia.org/T129786#2116331 (10Krenair) [19:24:05] RECOVERY - puppet last run on mw2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:26:31] (03PS5) 10Giuseppe Lavagetto: mediawiki::maintenance: add codfw host, multidc support [puppet] - 10https://gerrit.wikimedia.org/r/275837 (https://phabricator.wikimedia.org/T126987) [19:53:30] 6Operations, 7LDAP: Review list of LDAP groups and document exactly what kind of access they can be allowed to provide - https://phabricator.wikimedia.org/T129788#2116378 (10Krenair) [20:04:23] 6Operations, 7LDAP: Review list of LDAP groups and document exactly what kind of access they can be allowed to provide - https://phabricator.wikimedia.org/T129788#2116409 (10Krenair) Ugh, found this: https://integration.wikimedia.org/ci/configureSecurity/ wmf gets some special rights there This stuff *needs*... [20:06:58] 6Operations, 7LDAP: Add wmf LDAP group members into nda group, delete wmf group - https://phabricator.wikimedia.org/T129786#2116424 (10Krenair) [21:03:50] (03PS1) 10GWicke: Increase purged entry point s-maxage from 12 to 48 hours [puppet] - 10https://gerrit.wikimedia.org/r/277112 [21:12:45] PROBLEM - puppet last run on ms-be2012 is CRITICAL: CRITICAL: puppet fail [21:39:55] RECOVERY - puppet last run on ms-be2012 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [22:14:40] 6Operations, 7LDAP: Add wmf LDAP group members into nda group, delete wmf group - https://phabricator.wikimedia.org/T129786#2116331 (10Legoktm) +1, sounds good. I don't see any reason why we wouldn't allow `nda` users to do stuff in jenkins. [22:47:25] (03PS3) 10Ori.livneh: HHVM: Enable translation cache garbage-collection on canary app servers [puppet] - 10https://gerrit.wikimedia.org/r/277061 (https://phabricator.wikimedia.org/T277061) [23:26:16] PROBLEM - RAID on db1053 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded)