[00:00:10] gong! [00:00:11] welp [00:00:24] what am i going to do with all of these batteries now [00:00:27] HAPPY NEW YEAR!!! [00:00:36] yay [00:00:41] ori: lick them [00:00:44] we should do this more often [00:00:49] why is everybody still alive? [00:01:11] great team building exercise guys, go collect your ribbon from the "we're awesome" bucket [00:01:47] !log we're still here [00:01:51] Logged the message, Master [00:01:52] <_joe_> greg-g: please [00:02:07] <_joe_> wait until it's time to chant :) [00:02:16] * greg-g had to [00:02:58] rendering may bee in the processing of dying [00:03:22] CRITICAL - Socket timeout after 10 seconds for some render nodes since just after leap [00:03:39] greg-g: ok to deploy now? [00:03:43] nah, that's normal [00:03:49] No greg-g, the end hath come [00:03:50] really? [00:03:55] legoktm: give it another 10 [00:03:55] no [00:03:59] ok [00:03:59] it's not normal [00:04:16] where do you see the alerts? [00:04:29] _joe_: the C* boxes definitely skipped the leap second so far [00:04:46] <_joe_> bblack: rendering? [00:04:53] <_joe_> I've seen a few codfw boxes [00:05:16] well... maybe you're alive, but my VPS isn't :< [00:05:26] they varnish [00:05:29] err varnished [00:05:39] <_joe_> vanished? [00:05:54] the 1/3 soft-fails in rendering icinga [00:06:12] that's a great new verb, 'varnished' [00:06:14] <_joe_> yeah, you wrote "varnish" instead of "vanish"? [00:06:20] <_joe_> twice in a row [00:06:22] the start time on them was shortly after leap, but it could be that they're constantly intermittently in that state [00:06:26] <_joe_> you need to stop working on it [00:06:39] * gwicke imagines what it means [00:06:39] yeah I can't not type that word once "va" starts [00:06:58] there, you have been varnished! [00:07:09] https://en.m.wiktionary.org/wiki/varnished <- it is a word :P [00:07:31] haha [00:08:00] different meaning in this context, though [00:08:15] less brush & more bit mangling [00:08:48] <_joe_> bblack: I see a network peak in rendering [00:08:55] <_joe_> like we had a burst of imagescaling [00:09:04] <_joe_> around midnight UTC [00:09:13] <_joe_> so that may explain what you saw [00:10:55] ganglia's down! [00:11:02] oh wait, that's just how long it always takes to load :P [00:11:57] i was just debating whether icinga slowness is because we all have it open, or because it has to render a thousand-row table :P [00:12:10] both [00:12:33] analytics cluster in eqiad is red too, but what's new there? [00:12:37] love how column sorting is a server side operation [00:12:59] yeah that's normal for analytics [00:13:32] So how long until we declare that everything is fine, log off, and then rush back when our pagers start beeping? [00:13:38] (03CR) 10CSteipp: Log privileged users with short passwords (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222025 (https://phabricator.wikimedia.org/T94774) (owner: 10CSteipp) [00:13:49] right about now [00:14:06] * greg-g waves [00:14:08] but the pagers won't go off because of leap seconds. they'll go off because people start making changes again [00:14:20] change is the #1 cause of outages :P [00:14:26] <_joe_> yes [00:14:36] <_joe_> change and hardware failures I'd say [00:14:48] hardware failures are a very distant second [00:15:03] <_joe_> so you change in order to prevent hardware failures to cause an outage, and you cause more actual outages :) [00:15:24] <_joe_> bblack: I always said I am paid to say no to change [00:15:31] _joe_: especially those changes from 'up' to 'down' are highly correlated with failure ;) [00:15:58] freeze all gerrit commits for 3 days and watch how stable we are :P [00:16:06] I suspect that if we fired all of our operations and development staff, the site would be very stable! And then, all of a sudden, no longer stable. [00:16:12] (and all user logins to production shells too) [00:17:56] greg-g: now? [00:18:58] we should change to static HTML, users can then edit the wikis by uploading Gerrit changes to HTML.. fixed [00:20:08] so now we're left with on the difficult to track bugs related to leap second [00:22:29] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1416235 (10Seb35) With HTTPS mandatory for InstantCommons, the php5-curl package / curl PHP extension becomes mandatory to us... [00:22:44] mutante: then we save all the funding as we don't need most of engineering! [00:23:45] !log starting rolling restart of cassandra nodes to apply new config [00:23:48] Logged the message, Master [00:23:54] (03CR) 10BryanDavis: Log privileged users with short passwords (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222025 (https://phabricator.wikimedia.org/T94774) (owner: 10CSteipp) [00:26:51] 6operations, 10Gather, 10MobileFrontend, 7HHVM, and 2 others: [facebook/hhvm] Incorrect return value from eval, Closure generated in first eval pass is returned in the second eval pass #5502 - https://phabricator.wikimedia.org/T102937#1416249 (10Jdlrobson) Looks like the patch got merged :) Can we get this... [00:26:57] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [00:28:12] (03CR) 10Alex Monk: Log privileged users with short passwords (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222025 (https://phabricator.wikimedia.org/T94774) (owner: 10CSteipp) [00:33:34] !log rolling cassandra restart done [00:33:38] Logged the message, Master [00:34:52] !log pooled mw1152 (HHVM rendering) at weight 10 for testing [00:34:55] Logged the message, Master [00:38:20] (03CR) 10Ori.livneh: "Is the added overhead of logging (under load) enough to make timing attacks possible?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222025 (https://phabricator.wikimedia.org/T94774) (owner: 10CSteipp) [00:40:50] (03PS1) 10Ori.livneh: Double $wgMaxShellMemory on HHVM scalers (512 Mb => 1024 Mb) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222044 [00:41:12] (03CR) 10Ori.livneh: [C: 032] Double $wgMaxShellMemory on HHVM scalers (512 Mb => 1024 Mb) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222044 (owner: 10Ori.livneh) [00:41:18] (03Merged) 10jenkins-bot: Double $wgMaxShellMemory on HHVM scalers (512 Mb => 1024 Mb) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222044 (owner: 10Ori.livneh) [00:42:45] !log ori Synchronized wmf-config/CommonSettings.php: I9a8018981: Double $wgMaxShellMemory on HHVM scalers (512 Mb => 1024 Mb) (duration: 00m 12s) [00:42:49] Logged the message, Master [00:45:37] PROBLEM - puppet last run on eventlog1001 is CRITICAL puppet fail [00:51:48] (03PS1) 10Dzahn: use https link to fetch stats for wmf projects [debs/wikistats] - 10https://gerrit.wikimedia.org/r/222045 (https://phabricator.wikimedia.org/T104367) [00:53:48] PROBLEM - Apache HTTP on mw1152 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.021 second response time [00:56:19] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.088 second response time [00:57:39] mw1152 is me [01:03:13] !log Disabling Puppet on mw1152 for 12h to hack apache config to log locally [01:03:18] Logged the message, Master [01:03:48] RECOVERY - puppet last run on eventlog1001 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:24:09] (03PS1) 10Dzahn: add config option to use https-only for certain projects [debs/wikistats] - 10https://gerrit.wikimedia.org/r/222048 (https://phabricator.wikimedia.org/T104367) [01:25:05] (03CR) 10Dzahn: [C: 032] add config option to use https-only for certain projects [debs/wikistats] - 10https://gerrit.wikimedia.org/r/222048 (https://phabricator.wikimedia.org/T104367) (owner: 10Dzahn) [01:25:13] (03Merged) 10jenkins-bot: add config option to use https-only for certain projects [debs/wikistats] - 10https://gerrit.wikimedia.org/r/222048 (https://phabricator.wikimedia.org/T104367) (owner: 10Dzahn) [01:25:26] (03Abandoned) 10Dzahn: use https link to fetch stats for wmf projects [debs/wikistats] - 10https://gerrit.wikimedia.org/r/222045 (https://phabricator.wikimedia.org/T104367) (owner: 10Dzahn) [01:37:51] !log Depooled mw1152. Req error dashboard shows elevated 5xx rates correlating with the server getting pooled, but the logs don't appear to corroborate it. Odd. [01:37:56] Logged the message, Master [01:40:06] (03PS5) 10EBernhardson: Patch to uniqify filename of eval()'d code [debs/hhvm] - 10https://gerrit.wikimedia.org/r/219125 (https://phabricator.wikimedia.org/T102937) [01:41:06] 6operations, 10Gather, 10MobileFrontend, 7HHVM, and 2 others: [facebook/hhvm] Incorrect return value from eval, Closure generated in first eval pass is returned in the second eval pass #5502 - https://phabricator.wikimedia.org/T102937#1416320 (10EBernhardson) I've submitted a patch (previously, just forgot... [01:41:09] PROBLEM - puppet last run on sca1001 is CRITICAL Puppet has 60 failures [01:56:08] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [02:12:28] !log upgrade db1034 trusty [02:12:33] Logged the message, Master [02:13:44] (03PS1) 10Dzahn: fix wrong URL structure for https_only wikis [debs/wikistats] - 10https://gerrit.wikimedia.org/r/222049 (https://phabricator.wikimedia.org/T104411) [02:14:37] (03CR) 10Dzahn: [C: 032] fix wrong URL structure for https_only wikis [debs/wikistats] - 10https://gerrit.wikimedia.org/r/222049 (https://phabricator.wikimedia.org/T104411) (owner: 10Dzahn) [02:14:45] (03Merged) 10jenkins-bot: fix wrong URL structure for https_only wikis [debs/wikistats] - 10https://gerrit.wikimedia.org/r/222049 (https://phabricator.wikimedia.org/T104411) (owner: 10Dzahn) [02:19:50] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [02:23:19] !log l10nupdate Synchronized php-1.26wmf11/cache/l10n: (no message) (duration: 06m 50s) [02:23:25] Logged the message, Master [02:23:56] 6operations, 10vm-requests: eqiad: (1) VM for static-bugzilla - https://phabricator.wikimedia.org/T103604#1416357 (10Dzahn) that said, "misc naming style" would mean to use element names in eqiad and star names in codfw :) [02:26:55] !log LocalisationUpdate completed (1.26wmf11) at 2015-07-01 02:26:55+00:00 [02:26:59] Logged the message, Master [02:33:37] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 762.763812578 [02:40:20] (03PS1) 10Springle: upgrade db1034 [puppet] - 10https://gerrit.wikimedia.org/r/222053 [02:42:43] (03CR) 10Springle: [C: 032] upgrade db1034 [puppet] - 10https://gerrit.wikimedia.org/r/222053 (owner: 10Springle) [02:53:29] !log l10nupdate Synchronized php-1.26wmf12/cache/l10n: (no message) (duration: 10m 12s) [02:53:39] Logged the message, Master [03:00:22] !log LocalisationUpdate completed (1.26wmf12) at 2015-07-01 03:00:21+00:00 [03:00:26] Logged the message, Master [03:07:44] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [03:08:37] oh [03:10:37] springle: fixing it [03:10:43] !log git pull on strontium [03:10:48] Logged the message, Master [03:10:54] manifests/site.pp | 4 ++-- [03:10:58] tnx [03:11:32] icinga-wm: please [03:12:54] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [03:15:42] (03CR) 10Dzahn: [C: 031] "has approval on ticket, 3 days seem over, go !?" [puppet] - 10https://gerrit.wikimedia.org/r/220989 (owner: 10Matanya) [03:15:57] (03PS3) 10Dzahn: access: grant niedzielski access to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/220989 (owner: 10Matanya) [03:20:36] (03CR) 10Dzahn: [C: 032] "approval was added last Thursday - therefore waiting period over" [puppet] - 10https://gerrit.wikimedia.org/r/220989 (owner: 10Matanya) [03:24:53] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: stat1002 access requested for sniedzielski - https://phabricator.wikimedia.org/T103871#1416402 (10Dzahn) approval was here, waiting period was over, so i merged it. ---- [root@stat1002 03:24 /root] # id niedzielski uid=11833(niedzielski) gid=500(wikide... [03:25:12] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: stat1002 access requested for sniedzielski - https://phabricator.wikimedia.org/T103871#1416403 (10Dzahn) 5Open>3Resolved [03:25:20] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: stat1002 access requested for sniedzielski - https://phabricator.wikimedia.org/T103871#1416404 (10Niedzielski) Thanks! I'm in! [03:46:55] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 6 below the confidence bounds [04:07:35] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 6 below the confidence bounds [04:37:55] 6operations, 6Analytics-Engineering, 7network: networking: adjust ACLs to allow analytics clusters to talk to new ganglia aggregator - https://phabricator.wikimedia.org/T104036#1416460 (10BBlack) [04:41:40] !log krinkle Synchronized php-1.26wmf12/includes/resourceloader/ResourceLoader.php: Iee884208c5c4b minify cache key (duration: 00m 11s) [04:41:45] Logged the message, Master [04:57:13] 6operations, 7Monitoring, 5Patch-For-Review: remove ganglia(old), replace with ganglia_new - https://phabricator.wikimedia.org/T93776#1416475 (10BBlack) [04:57:15] 6operations, 6Analytics-Engineering, 7network: networking: adjust ACLs to allow analytics clusters to talk to new ganglia aggregator - https://phabricator.wikimedia.org/T104036#1416472 (10BBlack) 5Open>3Resolved a:3BBlack logstash1*.eqiad.wmnet appear to be in the normal private vlans rather than the a... [05:09:45] (03PS2) 10BBlack: tlsproxy: enable DHE-2048 FS for Android 2.x, etc. [puppet] - 10https://gerrit.wikimedia.org/r/222023 (https://phabricator.wikimedia.org/T104281) [05:09:47] (03PS2) 10BBlack: ciphersuites: refactor further, add compat-dhe option [puppet] - 10https://gerrit.wikimedia.org/r/222022 [05:09:56] (03PS1) 10CSteipp: Set initial Staff password policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222057 (https://phabricator.wikimedia.org/T104370) [05:26:26] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [05:27:53] !log deployed patch for T103765 [05:27:57] Logged the message, Master [05:28:05] (03PS3) 10BBlack: tlsproxy: enable DHE-2048 FS for Android 2.x, etc. [puppet] - 10https://gerrit.wikimedia.org/r/222023 (https://phabricator.wikimedia.org/T104281) [05:28:07] (03PS3) 10BBlack: ciphersuites: refactor further, add compat-dhe option [puppet] - 10https://gerrit.wikimedia.org/r/222022 [05:28:38] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Jul 1 05:28:38 UTC 2015 (duration 28m 37s) [05:28:42] Logged the message, Master [05:32:00] (03CR) 10Glaisher: Log privileged users with short passwords (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222025 (https://phabricator.wikimedia.org/T94774) (owner: 10CSteipp) [06:02:02] (03CR) 10CSteipp: "For timing attacks, since we're only logging successes, I'm assuming you mean that an attacker could watch for admins with slightly longer" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222025 (https://phabricator.wikimedia.org/T94774) (owner: 10CSteipp) [06:23:05] 6operations, 7Graphite: Insecure XHR for 'http://tessera.wikimedia.org/api/preferences/' has been blocked - https://phabricator.wikimedia.org/T104424#1416557 (10Krinkle) 3NEW [06:32:34] PROBLEM - puppet last run on cp2014 is CRITICAL Puppet has 1 failures [06:32:34] PROBLEM - puppet last run on cp2013 is CRITICAL Puppet has 1 failures [06:34:15] PROBLEM - puppet last run on cp4014 is CRITICAL Puppet has 1 failures [06:35:25] PROBLEM - puppet last run on db1028 is CRITICAL Puppet has 1 failures [06:36:34] PROBLEM - puppet last run on mw2123 is CRITICAL Puppet has 1 failures [06:36:35] PROBLEM - puppet last run on db2036 is CRITICAL Puppet has 1 failures [06:37:14] PROBLEM - puppet last run on labcontrol2001 is CRITICAL Puppet has 1 failures [06:37:35] PROBLEM - puppet last run on mw2093 is CRITICAL Puppet has 1 failures [06:37:46] PROBLEM - puppet last run on mw2097 is CRITICAL Puppet has 1 failures [06:38:05] PROBLEM - puppet last run on mw2134 is CRITICAL Puppet has 1 failures [06:38:06] PROBLEM - puppet last run on mw2050 is CRITICAL Puppet has 1 failures [06:38:06] RECOVERY - puppet last run on db1028 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:39:05] PROBLEM - puppet last run on mw2079 is CRITICAL Puppet has 1 failures [06:39:14] RECOVERY - puppet last run on db2036 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:47:08] RECOVERY - puppet last run on cp4014 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:47:35] RECOVERY - puppet last run on labcontrol2001 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [06:47:56] RECOVERY - puppet last run on cp2014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:56] RECOVERY - puppet last run on cp2013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:25] RECOVERY - puppet last run on mw2123 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:49:25] RECOVERY - puppet last run on mw2079 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:50:35] RECOVERY - puppet last run on mw2093 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:45] RECOVERY - puppet last run on mw2097 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:51:05] RECOVERY - puppet last run on mw2134 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:51:05] RECOVERY - puppet last run on mw2050 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:53:19] Do we no longer use per-domain certificates via SNI, and just use the unified alt name cert for everything? [06:56:08] Oh, I just found c02fab71422a490dbdcf [06:56:16] So I guess I answered my own question [07:01:30] <_joe_> bawolff: :) [07:02:23] _joe_: fun fact, Mediawiki's piece of crap HTTPs implementations allegedly does not look at SubjectAltName [07:02:31] but that's hardly the only thing wrong with it [07:02:42] https://phabricator.wikimedia.org/T75199 [07:02:50] <_joe_> bawolff: I didn't even know it had one :) [07:03:05] Well, I mean how it uses php f_open [07:03:09] *fopen [07:03:16] <_joe_> ewww [07:03:35] <_joe_> doesn't curl support TLS correctly in php? [07:03:35] yeah [07:03:46] We try to support the case where curl is not installed [07:04:03] <_joe_> we should not. [07:04:14] <_joe_> honestly, it is a bad idea [07:04:18] well if it makes you feel better, we don't do a very good job [07:05:59] <_joe_> eheh [07:06:20] bug seems to indicate that it might be an easy fix by tweaking the options we set [07:06:23] php sucks [07:07:45] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 8.60599445688e-06 [07:09:24] (03CR) 1020after4: [C: 031] add unibas.ch to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221600 (owner: 10Matanya) [07:18:25] (03PS4) 10BBlack: tlsproxy: enable DHE-2048 FS for Android 2.x, etc. [puppet] - 10https://gerrit.wikimedia.org/r/222023 (https://phabricator.wikimedia.org/T104281) [07:18:27] (03PS4) 10BBlack: ciphersuites: refactor further, add compat-dhe option [puppet] - 10https://gerrit.wikimedia.org/r/222022 [07:18:29] (03PS2) 10BBlack: tlsproxy: add 2048-bit dhparam file to nginx [puppet] - 10https://gerrit.wikimedia.org/r/222016 [07:18:31] (03PS2) 10BBlack: tlsproxy: rename protoproxy to tlsproxy globally [puppet] - 10https://gerrit.wikimedia.org/r/222001 [07:18:33] (03PS2) 10BBlack: tlsproxy: fold ssl::beta::common into ssl::beta [puppet] - 10https://gerrit.wikimedia.org/r/222002 [07:18:35] (03PS2) 10BBlack: tlsproxy: move role::tlsproxy::ssl::common to auto-required tlsproxy::instance [puppet] - 10https://gerrit.wikimedia.org/r/222003 [07:18:37] (03PS2) 10BBlack: tlsproxy: move sslcert stuff inside of tlsproxy::localssl [puppet] - 10https://gerrit.wikimedia.org/r/222004 [07:18:39] (03PS2) 10BBlack: tlsproxy: remove unused ganglia/localhost stuff [puppet] - 10https://gerrit.wikimedia.org/r/222005 [07:18:41] (03PS2) 10BBlack: tlsproxy: rename beta-only things to betassl for clarity [puppet] - 10https://gerrit.wikimedia.org/r/222006 [07:18:43] (03PS2) 10BBlack: tlsproxy: remove remaining ipv6 hacks from beta [puppet] - 10https://gerrit.wikimedia.org/r/222007 [07:18:45] (03PS2) 10BBlack: tlsproxy: gut esams cases from beta-only template [puppet] - 10https://gerrit.wikimedia.org/r/222008 [07:18:47] (03PS2) 10BBlack: tlsproxy: move template into module (only user) [puppet] - 10https://gerrit.wikimedia.org/r/222009 [07:18:49] (03PS2) 10BBlack: tlsproxy: move logrotate into module (only user) [puppet] - 10https://gerrit.wikimedia.org/r/222010 [07:18:51] (03PS2) 10BBlack: tlsproxy: remove pointless use_ssl + jessie conditionals [puppet] - 10https://gerrit.wikimedia.org/r/222011 [07:18:53] (03PS2) 10BBlack: tlsproxy: remove dead udplog comments [puppet] - 10https://gerrit.wikimedia.org/r/222012 [07:18:55] (03PS1) 10BBlack: tlsproxy: kill $ssl_protos var [puppet] - 10https://gerrit.wikimedia.org/r/222065 [07:18:58] (03PS1) 10BBlack: sslcert: add sslcert::std_cert for easier arrays [puppet] - 10https://gerrit.wikimedia.org/r/222066 [07:18:59] (03PS1) 10BBlack: tlsproxy: multi-cert support, including ocsp [puppet] - 10https://gerrit.wikimedia.org/r/222067 [07:20:45] <_joe_> bblack: a few commits, right/ [07:22:37] wtf, wow [07:22:52] it's mostly tiny commits, I just like to split up refactoring work to make it easy to follow [07:23:02] yeah gerrit sucks that way [07:23:14] when you depend on reviews it encourages you to construct mega-commits [07:23:24] github PRs are nicer [07:23:32] oh I still depend on reviews. poor faidon will have to review most of it lol [07:23:36] (he does this to me too) [07:24:00] heh [07:25:15] the old protoproxy was a mess because it had a convoluted history from being used for other things [07:25:33] I like that with github, you merge a single logical patch, but you review it in small bits anyway [07:25:40] * valhallasw`cloud is wondering how this will work with Differential [07:25:46] in practice, the only thing it's doing succesfully today is the standard cache-cluster nginx TLS proxying, so this refactors it all around that and gets rid of cruft. [07:26:29] (which seemed prudent before the last few commits in that chain, which add some new complexities and such) [07:29:00] (valhallasw`cloud: yes, that) [07:30:05] (03CR) 10Merlijn van Deen: [C: 04-1] "Conceptually, I'd say nginx.conf.erb belongs to the nginx module, as it doesn't seem to contain anything specific to the dynamicproxy?" [puppet] - 10https://gerrit.wikimedia.org/r/222009 (owner: 10BBlack) [07:30:13] bblack: ^ I'm confused :-p [07:31:24] valhallasw`cloud: well, dynamicproxy isn't a part of this at all. But in general, our /templates/nginx/nginx.conf.erb was a file that was only ever being used by modules/protoproxy (now modules/tlsproxy) and nothing else. [07:31:43] so I moved it into the module, to avoid others using it in the future, so I can keep hacking on it only in the context of that module. [07:32:09] other users of e.g. modules/nginx have their own separate templates [07:32:16] ah! ok, that makes sense [07:33:42] (03CR) 10Merlijn van Deen: [C: 031] "...that's because I misread. It /is/ conceptually part of tlsproxy." [puppet] - 10https://gerrit.wikimedia.org/r/222009 (owner: 10BBlack) [07:34:40] the whole reason that thing was call "protoproxy" was it used to do a combination job of proxying HTTPS->HTTP *and* IPv6->IPv4 heh. The IPv6 part fell out of use a long long time ago, but cruft remained. [07:36:05] (03PS2) 10BBlack: tlsproxy: multi-cert support, including ocsp [puppet] - 10https://gerrit.wikimedia.org/r/222067 (https://phabricator.wikimedia.org/T86654) [07:38:52] oh I see the original source of confusion, it auto-tagged some reviewers because I touched dynamicproxy config (to update a comment line about where their local config was originally copied from, since the source moved) [07:48:01] !log restbase restarting cassandra on rb1005 [07:48:10] Logged the message, Master [07:50:49] 6operations, 10MediaWiki-Sites, 10SEO, 5HTTPS-by-default, and 4 others: URLs for the same title without extra query parameters should have the same canonical link - https://phabricator.wikimedia.org/T67402#1416602 (10Nemo_bis) [07:59:11] bblack: thanks for the explanation :-) [08:00:08] (03CR) 10Alexandros Kosiaris: [C: 032] "I ran a catalog diff. The only thing expected to change is the motd about the role. Merging. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/221787 (owner: 10John F. Lewis) [08:01:38] (03PS9) 10Alexandros Kosiaris: install-server: rename module to install_server [puppet] - 10https://gerrit.wikimedia.org/r/221787 (owner: 10John F. Lewis) [08:02:06] 6operations, 10Deployment-Systems, 6Performance-Team, 6Release-Engineering, 7HHVM: Make scap able to depool/repool servers via the conftool API - https://phabricator.wikimedia.org/T104352#1416628 (10mmodell) @bd808: Note that #releng is working on the next-generation of deployment tooling, and I think we... [08:02:06] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] install-server: rename module to install_server [puppet] - 10https://gerrit.wikimedia.org/r/221787 (owner: 10John F. Lewis) [08:05:57] <_joe_> akosiaris: thanks! [08:06:16] <_joe_> I hate dashed class names and modules [08:06:36] <_joe_> (I'd thank john as well, but he's not here) [08:08:21] 6operations, 10Traffic, 7discovery-system, 5services-tooling: integrate (pybal|varnish)->varnish backend config/state with etcd or similar - https://phabricator.wikimedia.org/T97029#1416636 (10Joe) [08:08:24] 6operations, 10Traffic, 5Patch-For-Review, 7discovery-system, 5services-tooling: Integrate confd into the varnish configuration to generate the list of active backends - https://phabricator.wikimedia.org/T97975#1416635 (10Joe) 5Open>3Resolved [08:09:06] 6operations, 7HHVM, 7Wikimedia-log-errors: investigate HHVM mysqlExtension::ConnectTimeout - https://phabricator.wikimedia.org/T98489#1416638 (10Joe) 5Open>3Resolved [08:15:14] (03PS1) 10Muehlenhoff: Re-enable NTP on restbase, the "smearing option" (-x) has been configured via salt [puppet] - 10https://gerrit.wikimedia.org/r/222073 [08:21:40] (03CR) 10Alexandros Kosiaris: [C: 031] Re-enable NTP on restbase, the "smearing option" (-x) has been configured via salt [puppet] - 10https://gerrit.wikimedia.org/r/222073 (owner: 10Muehlenhoff) [08:27:04] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [08:31:34] 6operations, 10Gather, 10MobileFrontend, 7HHVM, and 2 others: [facebook/hhvm] Incorrect return value from eval, Closure generated in first eval pass is returned in the second eval pass #5502 - https://phabricator.wikimedia.org/T102937#1377848 (10Jhernandez) @ebernhardson Awesome thanks! @phuedx I've creat... [08:31:45] 6operations, 10Gather, 10MobileFrontend, 7HHVM, and 2 others: [facebook/hhvm] Incorrect return value from eval, Closure generated in first eval pass is returned in the second eval pass #5502 - https://phabricator.wikimedia.org/T102937#1416673 (10Jhernandez) [08:32:51] (03CR) 10Muehlenhoff: [C: 032 V: 032] Re-enable NTP on restbase, the "smearing option" (-x) has been configured via salt [puppet] - 10https://gerrit.wikimedia.org/r/222073 (owner: 10Muehlenhoff) [08:34:24] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [08:35:42] 6operations, 10Gather, 10MobileFrontend, 7HHVM, and 2 others: [facebook/hhvm] Incorrect return value from eval, Closure generated in first eval pass is returned in the second eval pass #5502 - https://phabricator.wikimedia.org/T102937#1416684 (10Joe) Sorry guys, I still didn't build the package including t... [08:52:06] RECOVERY - NTP on restbase1004 is OK: NTP OK: Offset -0.1747348309 secs [08:52:21] (03CR) 1020after4: "I don't understand why this won't rebase..." [puppet] - 10https://gerrit.wikimedia.org/r/218930 (https://phabricator.wikimedia.org/T101235) (owner: 10Jakob) [08:52:45] RECOVERY - NTP on restbase1002 is OK: NTP OK: Offset -0.157012701 secs [08:53:34] RECOVERY - NTP on restbase1005 is OK: NTP OK: Offset -0.1867246628 secs [08:56:06] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 6.31891554612e-09 [09:00:54] RECOVERY - NTP on restbase1001 is OK: NTP OK: Offset -0.1349617243 secs [09:01:34] RECOVERY - NTP on restbase1003 is OK: NTP OK: Offset -0.1584032774 secs [09:03:25] RECOVERY - NTP on restbase1006 is OK: NTP OK: Offset -0.07200491428 secs [09:09:09] _joe_: good morning :-} [09:09:13] I am going to attempt to fix puppet-compiler02.puppet3-diffs.eqiad.wmflabs [09:09:18] can't ssh to it anymore :-( [09:09:56] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 2.52661619695e-09 [09:12:05] (03PS4) 1020after4: Add Phragile module. [puppet] - 10https://gerrit.wikimedia.org/r/218930 (https://phabricator.wikimedia.org/T101235) (owner: 10Jakob) [09:12:53] (03PS1) 10KartikMistry: Beta: Test Restbase in ContentTranslation [puppet] - 10https://gerrit.wikimedia.org/r/222084 [09:15:52] (03CR) 1020after4: [C: 031] Add Phragile module. [puppet] - 10https://gerrit.wikimedia.org/r/218930 (https://phabricator.wikimedia.org/T101235) (owner: 10Jakob) [09:17:10] (03CR) 1020after4: "I manually rebased this patch which was a nightmare because of some weirdness with the cassandra submodule. I have no idea why that caused" [puppet] - 10https://gerrit.wikimedia.org/r/218930 (https://phabricator.wikimedia.org/T101235) (owner: 10Jakob) [09:24:52] (03Abandoned) 1020after4: Move maniphest status settings into custom/wmf-defaults.php [puppet] - 10https://gerrit.wikimedia.org/r/205797 (https://phabricator.wikimedia.org/T548) (owner: 1020after4) [09:26:46] (03CR) 1020after4: [C: 031] Rename all main WikimediaIncubator settings to have a wg prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207909 (owner: 10Paladox) [09:36:22] <_joe_> hashar: I think we'd better start from scrathc [09:36:43] <_joe_> I don't like the way it works at all [09:36:59] _joe_: we can still bring the instance back though :-} [09:37:02] <_joe_> it's been a mess for a long time, I'll fix it this quarter [09:37:07] <_joe_> hashar: maybe [09:37:22] <_joe_> not now though, I'm pretty involved in something else [09:37:23] <_joe_> :) [09:38:14] _joe_: I will get labs ops to fix the instance [09:41:32] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1416784 (10Bawolff) [09:42:44] <_joe_> hashar: and they will circle back to me [09:42:50] <_joe_> not now means "later today" [09:43:23] _joe_: na I think the instance is broken because it hasn't been migrated to the DNS or some weird ssh issue. They should be able to handle it :} [09:45:16] <_joe_> hashar: nope it's dead since well before that [09:45:26] <_joe_> I think it's mostly unusable since a few months [10:03:34] (03Abandoned) 10Giuseppe Lavagetto: WIP: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221643 (owner: 10Giuseppe Lavagetto) [10:11:15] (03CR) 10Alexandros Kosiaris: [C: 031] Allow optional firejail containment for nodejs services. [puppet] - 10https://gerrit.wikimedia.org/r/219177 (https://phabricator.wikimedia.org/T101870) (owner: 10Muehlenhoff) [10:22:17] PROBLEM - puppet last run on mw2018 is CRITICAL Puppet has 1 failures [10:23:17] _joe_: the last success build was on June 19th though [10:23:54] PROBLEM - puppet last run on mw2055 is CRITICAL Puppet has 1 failures [10:26:32] (03CR) 10Alexandros Kosiaris: [C: 032] Beta: Test Restbase in ContentTranslation [puppet] - 10https://gerrit.wikimedia.org/r/222084 (owner: 10KartikMistry) [10:28:36] <_joe_> hashar: sshing into that server doesn't work [10:28:54] <_joe_> meaning its ssh daemon doesn't respond [10:29:00] <_joe_> so I'd just kill it [10:29:48] _joe_: then we lost the compiler :-( [10:30:22] YuviPanda: seems the instance is deadlocked somehow :-/ no ssh answer there [10:30:41] and I am not sure sure whether we can get a console access on it [10:32:12] <_joe_> hashar: lemme work on it. The compiler was unusable anyways [10:32:21] <_joe_> I have to work on it sooner or later [10:32:29] <_joe_> it seems I'd have to do it sooner :) [10:34:56] (03PS1) 10Giuseppe Lavagetto: varnish: activate dynamic lookup on one esams host [puppet] - 10https://gerrit.wikimedia.org/r/222091 [10:36:16] _joe_: would you like me to kill NFS on it as well? :) [10:38:56] RECOVERY - puppet last run on mw2055 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [10:40:04] RECOVERY - puppet last run on mw2018 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:41:59] !log restarting Jenkins: upgrading Jenkins gearman plugin from 0.1.1-8-gf2024bd to 0.1.1-9-g08e9c42-change_192429_2 https://phabricator.wikimedia.org/T72597#1416913 [10:42:03] Logged the message, Master [10:44:14] (03PS2) 10Giuseppe Lavagetto: varnish: activate dynamic lookup on one esams host [puppet] - 10https://gerrit.wikimedia.org/r/222091 [10:44:52] _joe_: oh, did you delete the instances in puppet3-diffs already? [10:45:18] <_joe_> YuviPanda: no [10:45:28] hmm, it just reported 0 instances there? [10:45:29] * YuviPanda checks [10:45:39] <_joe_> log out and login again? [10:45:48] https://wikitech.wikimedia.org/w/api.php?action=query&list=novainstances&niproject=puppet3-diffs&niregion=eqiad&format=json [10:45:52] that doesn't lie [10:45:58] puppet3-diffs project has no instances... [10:46:00] let me veryfi [10:47:23] !log installed patch security updates on 862 hosts [10:47:27] Logged the message, Master [10:48:36] <_joe_> YuviPanda: Special:NovaInstance reports it [10:48:48] _joe_: yeah, I saw that [10:48:52] something strange is going on [10:49:20] my suspicion is that novadmin is somehow not projectadmin anymorre [10:50:18] _joe_: yup, that's what happened [10:50:27] and https://wikitech.wikimedia.org/w/api.php?action=query&list=novainstances&niproject=puppet3-diffs&niregion=eqiad&format=json is accurate now [10:50:44] <_joe_> YuviPanda: I'm pretty sure I didn't do that [10:50:51] yeah, not sure how that happened. [10:50:56] let me file a bug. [10:51:48] filed [11:14:38] YuviPand/_joe_ i go to rename: https://de.wikipedia.org/wiki/Spezial:Verwaltung_Benutzerkonten-Zusammenf%C3%BChrung/M%28e%29ister_Eiskalt [11:14:54] +20000 edit. ploicy say i need to notify root [11:17:16] Steinsplitter: I suggest you wait until there's someone more familiar with mediawiki's rename handling around before you do that. [11:17:28] ok [11:17:29] I think it'll flood the job queue but I'm not entirely sure. [11:18:36] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [11:20:24] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [11:20:49] did't we put in a usergroup permission to stop it happening… [11:21:45] PROBLEM - puppet last run on mw1052 is CRITICAL puppet fail [11:29:15] RECOVERY - puppet last run on mw1052 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:31:15] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [11:32:29] !log rsync on labstore1002 finished, restarting to see what was skipped + errors [11:32:33] Logged the message, Master [11:32:46] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.002 second response time on port 9042 [11:37:09] (03PS3) 10Yuvipanda: [WIP] labstore: Rewrite of manage-nfs-volumes-daemon [puppet] - 10https://gerrit.wikimedia.org/r/221872 (https://phabricator.wikimedia.org/T102782) [11:37:26] 6operations, 10Gather, 10MobileFrontend, 7HHVM, and 2 others: [facebook/hhvm] Incorrect return value from eval, Closure generated in first eval pass is returned in the second eval pass #5502 - https://phabricator.wikimedia.org/T102937#1416982 (10Jhernandez) No worries @joe, thanks. [11:37:42] (03PS1) 10KartikMistry: Revert "Beta: Test Restbase in ContentTranslation" [puppet] - 10https://gerrit.wikimedia.org/r/222096 [11:37:55] akosiaris: now, https://gerrit.wikimedia.org/r/#/c/222096/ :) [11:38:09] akosiaris: I should've add 'not to merge'! [11:41:07] PROBLEM - puppet last run on cp4012 is CRITICAL puppet fail [11:41:48] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "Beta: Test Restbase in ContentTranslation" [puppet] - 10https://gerrit.wikimedia.org/r/222096 (owner: 10KartikMistry) [11:42:14] kart_: to my defence, it did seem absolutely logical to merge [11:42:42] akosiaris: :) (It wasn't working too) [11:43:12] akosiaris: how about fixing cxserver.yaml at hackathon? [11:43:25] (if you don't have other plan) [11:44:09] kart_: I suppose we can do that. [11:49:14] (03PS4) 10Yuvipanda: [WIP] labstore: Rewrite of manage-nfs-volumes-daemon [puppet] - 10https://gerrit.wikimedia.org/r/221872 (https://phabricator.wikimedia.org/T102782) [11:49:17] paravoid: ^ [11:49:26] I think I'm ready to get rid of the WIP tag there [11:49:38] I have tested it, results in the /tmp/tmp folder on labstore1002 [11:49:49] I need to comment it a lot more heavily and wirte a systemd unit file tho [11:49:53] (03CR) 10jenkins-bot: [V: 04-1] [WIP] labstore: Rewrite of manage-nfs-volumes-daemon [puppet] - 10https://gerrit.wikimedia.org/r/221872 (https://phabricator.wikimedia.org/T102782) (owner: 10Yuvipanda) [11:50:03] hah! i knew it [11:50:53] Hi YuviPanda, sorry to bother you again about the NFS problem for math [11:50:56] (03PS5) 10Yuvipanda: [WIP] labstore: Rewrite of manage-nfs-volumes-daemon [puppet] - 10https://gerrit.wikimedia.org/r/221872 (https://phabricator.wikimedia.org/T102782) [11:51:14] physikerwelt: ah! apologies it's taking so long - swamped in the middle of other things :( [11:51:22] physikerwelt: let me read the email again, I'll do the things now [11:51:42] but it turned out that the /data/scratch and /data/project mounts are extremly helpful [11:51:45] /tmp/tmp? [11:51:58] that's ok [11:52:09] it would be great to re-enable them [11:52:40] on the other hand is sharing of the home directory nothing we are looking for [11:52:45] paravoid: /tmp/tmp is where I was testing the output - should be same as /etc/exports.d [11:53:02] physikerwelt: yeah, let me enable /data/scratch and /data/project for you now. [11:55:39] (03PS1) 10Yuvipanda: math: Enable /data/project and /data/scratch [puppet] - 10https://gerrit.wikimedia.org/r/222098 [11:55:50] (03CR) 10Yuvipanda: [C: 032 V: 032] math: Enable /data/project and /data/scratch [puppet] - 10https://gerrit.wikimedia.org/r/222098 (owner: 10Yuvipanda) [11:56:16] RECOVERY - puppet last run on cp4012 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:56:34] physikerwelt: that'll enable /data/project and /data/scratch when you run puppet. [11:56:39] physikerwelt: give me an instance name so I can try it out? [11:57:06] mlp [11:57:07] paravoid: I also want to get https://gerrit.wikimedia.org/r/#/c/221856/4 merged asap to avoid rebase hell (I am in it already now!) - should be a noop [11:57:13] or math [11:57:37] YuviPanda: We also have a new math contributor https://gerrit.wikimedia.org/r/#/c/222098/ [11:57:38] sorry [11:57:54] https://wikitech.wikimedia.org/wiki/Shell_Request/Nmeuschke [11:58:35] physikerwelt: shell requests are no longer necessary - if you add them to your project they will automatically get shell [11:59:20] physikerwelt: is the mlp instance a new instance? [11:59:31] if so NFS is still broken for new instances, I'm afraid - that's what I've been working on fixing. [11:59:41] physikerwelt: however, on older instances /data/project should be back. [11:59:50] no mlp is a very old instance [12:00:31] (03PS1) 10Alexandros Kosiaris: Add lvs::configuration::service_ips to beta [puppet] - 10https://gerrit.wikimedia.org/r/222099 (https://phabricator.wikimedia.org/T104076) [12:00:54] physikerwelt: hmm, strange [12:01:06] physikerwelt: mind if I reboot it? [12:02:15] no pleas go ahead [12:02:24] I was going to propose that [12:02:44] ok [12:04:11] physikerwelt: yeah, that seems to have worked [12:04:50] great thank you soo much [12:05:44] physikerwelt: yw. I'm copying your old homedirs into /data/project/home so you can recover what you want [12:23:11] 6operations, 10Traffic, 10Wikimedia-DNS, 5Patch-For-Review, 7Pybal: pybal DNS lookup issues causing outage risks - https://phabricator.wikimedia.org/T103921#1417015 (10BBlack) 5Open>3Resolved a:3BBlack Closing this as the patches were merged. Will open a separate ticket for future work re: dnspq k... [12:24:06] RECOVERY - NTP on lvs3003 is OK: NTP OK: Offset -0.002709031105 secs [12:24:43] 6operations: Investigate better DNS cache/lookup solutions - https://phabricator.wikimedia.org/T104442#1417023 (10BBlack) 3NEW [12:27:21] (03PS6) 10Yuvipanda: [WIP] labstore: Rewrite of manage-nfs-volumes-daemon [puppet] - 10https://gerrit.wikimedia.org/r/221872 (https://phabricator.wikimedia.org/T102782) [12:27:26] (03CR) 10jenkins-bot: [V: 04-1] [WIP] labstore: Rewrite of manage-nfs-volumes-daemon [puppet] - 10https://gerrit.wikimedia.org/r/221872 (https://phabricator.wikimedia.org/T102782) (owner: 10Yuvipanda) [12:33:12] 6operations, 10Traffic, 10Wikimedia-DNS, 5Patch-For-Review, 7Pybal: pybal DNS lookup issues causing outage risks - https://phabricator.wikimedia.org/T103921#1417034 (10mark) PyBal resolves IPs once at startup for managing the IPVS state. Did we neglect to extend that to IdleConnection? [12:33:16] (03PS1) 10Giuseppe Lavagetto: etcd::ssl: do not restart the server upon changes [puppet] - 10https://gerrit.wikimedia.org/r/222102 [12:35:13] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd::ssl: do not restart the server upon changes [puppet] - 10https://gerrit.wikimedia.org/r/222102 (owner: 10Giuseppe Lavagetto) [12:37:28] (03PS7) 10Yuvipanda: labstore: Rewrite of manage-nfs-volumes-daemon [puppet] - 10https://gerrit.wikimedia.org/r/221872 (https://phabricator.wikimedia.org/T102782) [12:37:35] (03CR) 10jenkins-bot: [V: 04-1] labstore: Rewrite of manage-nfs-volumes-daemon [puppet] - 10https://gerrit.wikimedia.org/r/221872 (https://phabricator.wikimedia.org/T102782) (owner: 10Yuvipanda) [12:39:00] (03PS1) 10BBlack: Add backports and thirdparty to jessie-wikimedia udebcomponents [puppet] - 10https://gerrit.wikimedia.org/r/222104 [12:39:46] (03CR) 10Faidon Liambotis: [C: 031] Add backports and thirdparty to jessie-wikimedia udebcomponents [puppet] - 10https://gerrit.wikimedia.org/r/222104 (owner: 10BBlack) [12:40:03] 6operations, 7Database, 5WMF-NDA: Upgrade db1022, which has an older kernel - https://phabricator.wikimedia.org/T101516#1417045 (10jcrespo) [12:40:13] (03CR) 10BBlack: [C: 032] Add backports and thirdparty to jessie-wikimedia udebcomponents [puppet] - 10https://gerrit.wikimedia.org/r/222104 (owner: 10BBlack) [12:44:56] (03PS2) 10Alexandros Kosiaris: Add lvs::configuration::service_ips to beta [puppet] - 10https://gerrit.wikimedia.org/r/222099 (https://phabricator.wikimedia.org/T104076) [12:46:45] RECOVERY - NTP on virt1003 is OK: NTP OK: Offset 0.0008275508881 secs [12:48:22] (03PS5) 10Yuvipanda: labstore: Simplify (and expand!) projects-config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/221856 [12:48:24] (03PS8) 10Yuvipanda: labstore: Rewrite of manage-nfs-volumes-daemon [puppet] - 10https://gerrit.wikimedia.org/r/221872 (https://phabricator.wikimedia.org/T102782) [12:48:38] paravoid: ^ I have removed the WIP tag. do review when you can. [12:48:44] I'm off to eat some food, beback [12:48:49] (03PS3) 10Alexandros Kosiaris: Add lvs::configuration::service_ips to beta [puppet] - 10https://gerrit.wikimedia.org/r/222099 (https://phabricator.wikimedia.org/T104076) [12:49:03] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1417074 (10Tau) I have installed the php5-curl now but still Instantcommons isn't working properly. What next? [12:50:58] RECOVERY - NTP on wtp1013 is OK: NTP OK: Offset -0.01958489418 secs [12:54:57] 6operations, 7Database, 5WMF-NDA: Upgrade db1022, which has an older kernel - https://phabricator.wikimedia.org/T101516#1417078 (10jcrespo) Issues to fix: * **Partman/install server/autoinstall**: Installation is not fully unattended- it requires to confirm changes when deleting partitions and when creating... [13:05:28] PROBLEM - puppet last run on mw1152 is CRITICAL Puppet last ran 12 hours ago [13:12:58] (03PS4) 10Alexandros Kosiaris: Add lvs::configuration::service_ips to beta [puppet] - 10https://gerrit.wikimedia.org/r/222099 (https://phabricator.wikimedia.org/T104076) [13:13:07] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Add lvs::configuration::service_ips to beta [puppet] - 10https://gerrit.wikimedia.org/r/222099 (https://phabricator.wikimedia.org/T104076) (owner: 10Alexandros Kosiaris) [13:15:35] RECOVERY - puppet last run on mw1152 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:28:48] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default, 5Patch-For-Review: Switch to ECDSA hybrid certificates - https://phabricator.wikimedia.org/T86654#1417144 (10BBlack) The necessary packages are now in jessie-wikimedia repo: (openssl-1.0.2c-1 in backports, nginx-1.9.2-1+wmf2 in main). We're not deployin... [13:41:45] PROBLEM - puppet last run on ganeti1003 is CRITICAL Puppet has 1 failures [13:54:57] PROBLEM - puppet last run on ms-fe3002 is CRITICAL puppet fail [13:56:55] RECOVERY - puppet last run on ganeti1003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:57:30] !log rebooting cp2001 (test kernel update) [13:57:35] Logged the message, Master [14:08:06] (03PS1) 10Giuseppe Lavagetto: varnish: always generate the dynamic directors lists [puppet] - 10https://gerrit.wikimedia.org/r/222117 [14:10:27] RECOVERY - puppet last run on ms-fe3002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:11:19] (03PS3) 10Phuedx: Enable Gather flagging on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221754 (https://phabricator.wikimedia.org/T97704) (owner: 10Jdlrobson) [14:13:39] (03CR) 10Phuedx: [C: 04-1] Enable Gather flagging on labs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221754 (https://phabricator.wikimedia.org/T97704) (owner: 10Jdlrobson) [14:14:18] (03PS1) 10Alexandros Kosiaris: Enable ntpd on ulsfo and esams [puppet] - 10https://gerrit.wikimedia.org/r/222118 [14:14:48] (03CR) 10BBlack: [C: 031] varnish: always generate the dynamic directors lists [puppet] - 10https://gerrit.wikimedia.org/r/222117 (owner: 10Giuseppe Lavagetto) [14:14:56] 6operations, 7Graphite, 7HTTPS: Insecure XHR for 'http://tessera.wikimedia.org/api/preferences/' has been blocked - https://phabricator.wikimedia.org/T104424#1417265 (10Krenair) [14:16:54] 6operations, 10ops-codfw, 10hardware-requests, 7Database: Faulty memory on es2004 (purchase one module) - https://phabricator.wikimedia.org/T103843#1417270 (10jcrespo) [14:18:13] (03CR) 10Muehlenhoff: [C: 031] "LTGM" [puppet] - 10https://gerrit.wikimedia.org/r/222118 (owner: 10Alexandros Kosiaris) [14:20:10] (03CR) 10Alexandros Kosiaris: [C: 032] Enable ntpd on ulsfo and esams [puppet] - 10https://gerrit.wikimedia.org/r/222118 (owner: 10Alexandros Kosiaris) [14:24:40] 6operations, 10vm-requests: eqiad: (1) VM for static-bugzilla - https://phabricator.wikimedia.org/T103604#1417295 (10akosiaris) I am fine with either approach. Wanna propose a name ? [14:41:05] RECOVERY - NTP on cp3044 is OK: NTP OK: Offset -0.002349615097 secs [14:41:15] RECOVERY - NTP on cp3006 is OK: NTP OK: Offset -0.002671599388 secs [14:41:15] RECOVERY - NTP on ms-be3003 is OK: NTP OK: Offset -0.001475811005 secs [14:41:15] RECOVERY - NTP on cp3015 is OK: NTP OK: Offset -0.0005130767822 secs [14:41:15] RECOVERY - NTP on cp3049 is OK: NTP OK: Offset -0.00145816803 secs [14:41:37] RECOVERY - NTP on cp3013 is OK: NTP OK: Offset -0.001623630524 secs [14:41:44] RECOVERY - NTP on cp3017 is OK: NTP OK: Offset -0.001567959785 secs [14:41:44] RECOVERY - NTP on cp3030 is OK: NTP OK: Offset -0.001729011536 secs [14:42:06] RECOVERY - NTP on cp3041 is OK: NTP OK: Offset -0.0031478405 secs [14:42:06] RECOVERY - NTP on cp3019 is OK: NTP OK: Offset -0.001456975937 secs [14:42:24] RECOVERY - NTP on ms-fe3001 is OK: NTP OK: Offset -0.001640439034 secs [14:42:25] RECOVERY - NTP on cp3046 is OK: NTP OK: Offset -0.000480890274 secs [14:43:05] RECOVERY - NTP on cp4006 is OK: NTP OK: Offset -0.0003497600555 secs [14:43:05] RECOVERY - NTP on cp3045 is OK: NTP OK: Offset -0.003185510635 secs [14:43:06] RECOVERY - NTP on cp3010 is OK: NTP OK: Offset -0.00247502327 secs [14:43:06] RECOVERY - NTP on cp3035 is OK: NTP OK: Offset -0.002175450325 secs [14:43:15] RECOVERY - NTP on cp3040 is OK: NTP OK: Offset -0.004250645638 secs [14:43:15] RECOVERY - NTP on cp3008 is OK: NTP OK: Offset -6.437301636e-05 secs [14:43:25] RECOVERY - NTP on cp3020 is OK: NTP OK: Offset -0.001423239708 secs [14:43:25] RECOVERY - NTP on cp3021 is OK: NTP OK: Offset -0.001824259758 secs [14:43:44] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 1783.30373344 [14:43:45] RECOVERY - NTP on cp4008 is OK: NTP OK: Offset -0.001568675041 secs [14:43:45] RECOVERY - NTP on cp3042 is OK: NTP OK: Offset -2.062320709e-05 secs [14:43:46] RECOVERY - NTP on cp3014 is OK: NTP OK: Offset -0.0008828639984 secs [14:44:05] RECOVERY - NTP on lvs4002 is OK: NTP OK: Offset 0.001346349716 secs [14:44:05] RECOVERY - NTP on lvs3001 is OK: NTP OK: Offset -0.001976132393 secs [14:44:15] RECOVERY - NTP on cp3048 is OK: NTP OK: Offset -0.001420736313 secs [14:44:15] RECOVERY - NTP on cp3022 is OK: NTP OK: Offset -0.00152349472 secs [14:44:15] RECOVERY - NTP on cp3039 is OK: NTP OK: Offset 0.0004806518555 secs [14:44:15] RECOVERY - NTP on cp3012 is OK: NTP OK: Offset -0.0002645254135 secs [14:44:15] RECOVERY - NTP on ms-be3002 is OK: NTP OK: Offset -0.001163959503 secs [14:44:15] RECOVERY - NTP on cp3037 is OK: NTP OK: Offset -0.002552270889 secs [14:44:15] RECOVERY - NTP on cp3016 is OK: NTP OK: Offset -0.003399014473 secs [14:44:16] RECOVERY - NTP on ms-be3001 is OK: NTP OK: Offset -0.002299904823 secs [14:44:24] the e-notation makes some of those confusing heh [14:44:37] RECOVERY - NTP on cp3009 is OK: NTP OK: Offset -0.0005856752396 secs [14:44:37] "Offset -6.4" -> "wtf?" [14:44:45] RECOVERY - NTP on cp3004 is OK: NTP OK: Offset 0.0057117939 secs [14:44:45] RECOVERY - NTP on cp3043 is OK: NTP OK: Offset -0.002085208893 secs [14:44:55] RECOVERY - NTP on lvs3004 is OK: NTP OK: Offset -0.003157258034 secs [14:45:28] RECOVERY - NTP on cp4003 is OK: NTP OK: Offset 0.001105666161 secs [14:45:45] RECOVERY - NTP on cp4014 is OK: NTP OK: Offset -9.894371033e-05 secs [14:45:46] RECOVERY - NTP on cp3007 is OK: NTP OK: Offset 0.002432107925 secs [14:45:55] RECOVERY - NTP on cp3034 is OK: NTP OK: Offset -0.002057671547 secs [14:45:55] RECOVERY - NTP on cp3003 is OK: NTP OK: Offset -0.00150001049 secs [14:46:14] RECOVERY - NTP on lvs3002 is OK: NTP OK: Offset -0.001467466354 secs [14:46:15] RECOVERY - NTP on cp4004 is OK: NTP OK: Offset -0.0003101825714 secs [14:46:24] RECOVERY - NTP on ms-be3004 is OK: NTP OK: Offset -0.001046180725 secs [14:46:44] RECOVERY - NTP on cp3038 is OK: NTP OK: Offset -0.001832008362 secs [14:46:45] RECOVERY - NTP on bast4001 is OK: NTP OK: Offset -0.003821253777 secs [14:46:45] RECOVERY - NTP on cp3032 is OK: NTP OK: Offset -0.003973603249 secs [14:46:45] RECOVERY - NTP on cp3005 is OK: NTP OK: Offset 0.00024330616 secs [14:46:45] RECOVERY - NTP on cp3031 is OK: NTP OK: Offset -0.002740263939 secs [14:47:07] RECOVERY - NTP on cp3018 is OK: NTP OK: Offset -0.000960111618 secs [14:47:07] RECOVERY - NTP on cp4005 is OK: NTP OK: Offset -0.0002368688583 secs [14:47:14] RECOVERY - NTP on cp3036 is OK: NTP OK: Offset -0.002277493477 secs [14:47:15] RECOVERY - NTP on cp3033 is OK: NTP OK: Offset -0.001938939095 secs [14:47:26] RECOVERY - NTP on ms-fe3002 is OK: NTP OK: Offset 3.695487976e-05 secs [14:47:26] RECOVERY - NTP on multatuli is OK: NTP OK: Offset -0.001314878464 secs [14:47:51] cp3008 seems legit now, also cp4014 [14:47:56] RECOVERY - NTP on cp4001 is OK: NTP OK: Offset -6.282329559e-05 secs [14:48:05] RECOVERY - NTP on cp4018 is OK: NTP OK: Offset -0.0004807710648 secs [14:48:16] RECOVERY - NTP on cp4019 is OK: NTP OK: Offset -0.0007411241531 secs [14:48:17] RECOVERY - NTP on hooft is OK: NTP OK: Offset -0.001455307007 secs [14:49:05] RECOVERY - NTP on lvs4003 is OK: NTP OK: Offset -0.002859830856 secs [14:49:15] RECOVERY - NTP on cp3047 is OK: NTP OK: Offset -0.002445220947 secs [14:49:19] (03Restored) 10Hoo man: Add a dedicated Wikibase job runner [puppet] - 10https://gerrit.wikimedia.org/r/208397 (owner: 10Hoo man) [14:50:25] (03PS2) 10Hoo man: Add a dedicated Wikibase job runner [puppet] - 10https://gerrit.wikimedia.org/r/208397 [14:50:32] (03CR) 10jenkins-bot: [V: 04-1] Add a dedicated Wikibase job runner [puppet] - 10https://gerrit.wikimedia.org/r/208397 (owner: 10Hoo man) [14:50:40] * anomie sees nothing for SWAT this morning [14:53:16] RECOVERY - NTP on cp4020 is OK: NTP OK: Offset -0.000387430191 secs [14:54:05] * hoo will hijack today's SWAT to push a Wikibase update to test [14:54:44] RECOVERY - NTP on cp4002 is OK: NTP OK: Offset -0.0009986162186 secs [14:54:45] RECOVERY - NTP on cp4012 is OK: NTP OK: Offset 0.001273036003 secs [14:55:01] hi, hoo I think I didn't have the change to say hi in real time [14:55:27] hi jynus :) [14:55:30] (03PS3) 10Ottomata: Add Pageviews/LegacyPageviews to metrics website [puppet] - 10https://gerrit.wikimedia.org/r/221611 (https://phabricator.wikimedia.org/T104003) (owner: 10Joal) [14:55:32] s/change/chance [14:55:47] you are like a dream come true for me [14:55:47] RECOVERY - NTP on lvs4004 is OK: NTP OK: Offset 0.001770615578 secs [14:55:54] RECOVERY - NTP on cp4010 is OK: NTP OK: Offset 0.001293778419 secs [14:55:55] RECOVERY - NTP on cp4009 is OK: NTP OK: Offset 0.001118659973 secs [14:56:01] with your help [14:56:45] RECOVERY - NTP on cp4016 is OK: NTP OK: Offset -0.001809597015 secs [14:56:52] (03CR) 10Ottomata: [C: 032] Add Pageviews/LegacyPageviews to metrics website [puppet] - 10https://gerrit.wikimedia.org/r/221611 (https://phabricator.wikimedia.org/T104003) (owner: 10Joal) [14:57:10] nice... but what did I do? [14:57:15] RECOVERY - NTP on cp4011 is OK: NTP OK: Offset 0.001112103462 secs [14:57:25] RECOVERY - NTP on cp4015 is OK: NTP OK: Offset -0.003883600235 secs [14:57:32] Didn't really have the time to poke at much recently [14:57:59] well, every time I run over you you have a very helpful ticket comment [14:58:39] just, thanks [14:58:45] RECOVERY - NTP on cp4013 is OK: NTP OK: Offset 0.0002037286758 secs [14:58:45] (03PS5) 10BBlack: tlsproxy: enable DHE-2048 FS for Android 2.x, etc. [puppet] - 10https://gerrit.wikimedia.org/r/222023 (https://phabricator.wikimedia.org/T104281) [14:58:47] (03PS5) 10BBlack: ciphersuites: refactor further, add compat-dhe option [puppet] - 10https://gerrit.wikimedia.org/r/222022 [14:58:49] (03PS2) 10BBlack: sslcert: add sslcert::std_cert for easier arrays [puppet] - 10https://gerrit.wikimedia.org/r/222066 [14:58:52] (03PS3) 10BBlack: tlsproxy: multi-cert support, including ocsp [puppet] - 10https://gerrit.wikimedia.org/r/222067 (https://phabricator.wikimedia.org/T86654) [14:58:54] (03PS3) 10BBlack: tlsproxy: add 2048-bit dhparam file to nginx [puppet] - 10https://gerrit.wikimedia.org/r/222016 [14:58:56] (03PS1) 10BBlack: protoproxy/tlsproxy: big refactor commit [puppet] - 10https://gerrit.wikimedia.org/r/222124 [14:59:15] RECOVERY - NTP on cp4017 is OK: NTP OK: Offset 0.001001000404 secs [14:59:15] RECOVERY - NTP on lvs4001 is OK: NTP OK: Offset 0.001164793968 secs [14:59:15] RECOVERY - NTP on cp4007 is OK: NTP OK: Offset -0.003648996353 secs [14:59:38] !log disabling puppet on caches (because puppet always breaks when you move files/modules around...) [14:59:42] Logged the message, Master [14:59:56] You're welcome :) [15:00:04] manybubbles anomie ostriches marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150701T1500). [15:00:20] (03CR) 10BBlack: [C: 032 V: 032] protoproxy/tlsproxy: big refactor commit [puppet] - 10https://gerrit.wikimedia.org/r/222124 (owner: 10BBlack) [15:00:26] puppet has been broken on beta caches for ages apparently :( [15:00:44] yup [15:00:56] well, off and on. every time we fix it, it gets broken again a few days later [15:00:56] RECOVERY - NTP on eeden is OK: NTP OK: Offset 5.722045898e-06 secs [15:01:39] anomie, if you'd like to sanity-check https://gerrit.wikimedia.org/r/#/c/221808/ that'd be appreciated [15:01:54] it seems https://gerrit.wikimedia.org/r/#/c/221808/2/wmf-config/CommonSettings.php makes it disable the extension [15:02:09] * anomie looks [15:02:16] because clearly that's desired when you're actually changing it to unconditionally include the extension... [15:03:25] I don't get it. This should be a really simple case of a redundant wmgUse variable [15:04:28] anomie: you are doing the swat? we still need a few more minutes to finish the build and stuff for our stuff if you don't mind? [15:05:01] !log re-enabling puppet on caches [15:05:05] Logged the message, Master [15:05:22] <_joe_> bblack: should I wait before I merge my change, right? [15:05:31] jzerebecki: Wasn't planning on it, but I maybe could if no one else wants to and the patches are simple enough [15:05:40] _joe_: it's ok to go ahead [15:05:51] anomie: it is a small backport to group0 [15:05:59] <_joe_> I checked it with the compiler and seems nice [15:05:59] most likely, assuming puppetmaster doesn't cause a bunch of pointless puppetfails in the next few minutes [15:06:17] hoo^^? [15:06:53] !log restbase1002: PWD=/home/eevans/restbase-mod-table-cassandra/maintenance; node thin_out_key_rev_value_data.js `hostname -i` local_group_wikimedia_T_parsoid_html 2>&1 | pv --line-mode | gzip -c > wikimedia_T_parsoid_html.log.gz [15:06:56] Logged the message, Master [15:06:58] Krenair: That change looks sane [15:07:04] (what's ugly is that puppet repo update -> master -> agent-run is not transactional. when you push a single commit that, for example, renames a file source path, a bunch of agents break that were mid-run or run right after) [15:07:50] anomie, and yet when I tested it on tin, that extensions' entries (e.g. in wgHooks['APIAfterExecute']) disappeared from eval.php [15:08:20] (03PS3) 10Thiemo Mättig (WMDE): Add a dedicated Wikibase job runner [puppet] - 10https://gerrit.wikimedia.org/r/208397 (owner: 10Hoo man) [15:08:57] (03Abandoned) 10BBlack: tlsproxy: rename protoproxy to tlsproxy globally [puppet] - 10https://gerrit.wikimedia.org/r/222001 (owner: 10BBlack) [15:09:04] (03CR) 10Thiemo Mättig (WMDE): "PS3 is a rebase only." [puppet] - 10https://gerrit.wikimedia.org/r/208397 (owner: 10Hoo man) [15:09:06] (03Abandoned) 10BBlack: tlsproxy: fold ssl::beta::common into ssl::beta [puppet] - 10https://gerrit.wikimedia.org/r/222002 (owner: 10BBlack) [15:09:13] (03Abandoned) 10BBlack: tlsproxy: move role::tlsproxy::ssl::common to auto-required tlsproxy::instance [puppet] - 10https://gerrit.wikimedia.org/r/222003 (owner: 10BBlack) [15:09:20] (03Abandoned) 10BBlack: tlsproxy: move sslcert stuff inside of tlsproxy::localssl [puppet] - 10https://gerrit.wikimedia.org/r/222004 (owner: 10BBlack) [15:09:32] (03Abandoned) 10BBlack: tlsproxy: remove unused ganglia/localhost stuff [puppet] - 10https://gerrit.wikimedia.org/r/222005 (owner: 10BBlack) [15:09:45] (03Abandoned) 10BBlack: tlsproxy: rename beta-only things to betassl for clarity [puppet] - 10https://gerrit.wikimedia.org/r/222006 (owner: 10BBlack) [15:09:52] (03Abandoned) 10BBlack: tlsproxy: remove remaining ipv6 hacks from beta [puppet] - 10https://gerrit.wikimedia.org/r/222007 (owner: 10BBlack) [15:09:58] (03Abandoned) 10BBlack: tlsproxy: gut esams cases from beta-only template [puppet] - 10https://gerrit.wikimedia.org/r/222008 (owner: 10BBlack) [15:10:06] (03Abandoned) 10BBlack: tlsproxy: move template into module (only user) [puppet] - 10https://gerrit.wikimedia.org/r/222009 (owner: 10BBlack) [15:10:14] (03Abandoned) 10BBlack: tlsproxy: move logrotate into module (only user) [puppet] - 10https://gerrit.wikimedia.org/r/222010 (owner: 10BBlack) [15:10:20] (03Abandoned) 10BBlack: tlsproxy: remove pointless use_ssl + jessie conditionals [puppet] - 10https://gerrit.wikimedia.org/r/222011 (owner: 10BBlack) [15:10:26] (03Abandoned) 10BBlack: tlsproxy: remove dead udplog comments [puppet] - 10https://gerrit.wikimedia.org/r/222012 (owner: 10BBlack) [15:10:37] (03Abandoned) 10BBlack: tlsproxy: kill $ssl_protos var [puppet] - 10https://gerrit.wikimedia.org/r/222065 (owner: 10BBlack) [15:10:56] (03PS2) 10Giuseppe Lavagetto: varnish: always generate the dynamic directors lists [puppet] - 10https://gerrit.wikimedia.org/r/222117 [15:12:08] (03PS6) 10Rush: confd: track per template run error state files [puppet] - 10https://gerrit.wikimedia.org/r/222035 [15:13:48] (03PS3) 10BBlack: sslcert: add sslcert::std_cert for easier arrays [puppet] - 10https://gerrit.wikimedia.org/r/222066 [15:13:50] (03PS4) 10BBlack: tlsproxy: multi-cert support, including ocsp [puppet] - 10https://gerrit.wikimedia.org/r/222067 (https://phabricator.wikimedia.org/T86654) [15:14:45] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [15:15:55] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [15:16:02] (03CR) 10Giuseppe Lavagetto: [C: 032] "looks good in the compiler" [puppet] - 10https://gerrit.wikimedia.org/r/222117 (owner: 10Giuseppe Lavagetto) [15:16:56] (03PS1) 10Joal: Correct error in wikimetrics projectview simlink [puppet] - 10https://gerrit.wikimedia.org/r/222127 [15:18:17] akosiaris: hi, re. the vm, can we go with just staticbugs.eqiad.wmnet? [15:18:26] (03PS7) 10Rush: confd: track per template run error state files [puppet] - 10https://gerrit.wikimedia.org/r/222035 [15:18:32] (03PS1) 10Muehlenhoff: Reenable ntp by default [puppet] - 10https://gerrit.wikimedia.org/r/222129 [15:18:52] (03PS2) 10Ottomata: Correct error in wikimetrics projectview simlink [puppet] - 10https://gerrit.wikimedia.org/r/222127 (owner: 10Joal) [15:19:00] (03CR) 10Ottomata: [C: 032 V: 032] Correct error in wikimetrics projectview simlink [puppet] - 10https://gerrit.wikimedia.org/r/222127 (owner: 10Joal) [15:19:27] (03PS8) 10Rush: confd: track per template run error state files [puppet] - 10https://gerrit.wikimedia.org/r/222035 [15:23:47] hoo: now that it is merged will you take the swat slot? [15:24:05] Yeah [15:24:12] waiting for jenkins on core now [15:28:01] JohnFLewis: well, if it's not a misc (elements or starts - depending on eqiad/codfw respectively) name, it has to have numbers, so staticbugs1001.eqiad.wmnet [15:28:01] otherwise, an element name [15:28:15] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [15:28:38] akosiaris: if we can assign element names to vm, then that will probably be best as it would fall under 'misc' [15:29:28] JohnFLewis: fine by me [15:29:36] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.004 second response time on port 9042 [15:30:14] !log hoo Synchronized php-1.26wmf12/extensions/Wikidata/: Remove alias uniqueness constraints (duration: 00m 21s) [15:30:18] Logged the message, Master [15:31:16] akosiaris: this is why migrating over misc hardware is awkward because some things are services like planet and etherpad but some are one offs which takes up the misc pool :/ [15:32:17] But anyway - it's a case-by-case decision :) [15:36:37] JohnFLewis: I know ;-) [15:39:51] paravoid: think you'll have time to review the rewrite today? :) [15:40:07] maybe, in 1-2hrs [15:40:30] paravoid: ok! [15:41:40] (03PS7) 10Alexandros Kosiaris: Add new_wmf_service.py and examples [puppet] - 10https://gerrit.wikimedia.org/r/217548 (https://phabricator.wikimedia.org/T97036) [15:42:11] (03CR) 10Addshore: [C: 031] Add a dedicated Wikibase job runner [puppet] - 10https://gerrit.wikimedia.org/r/208397 (owner: 10Hoo man) [15:45:06] (03CR) 10BBlack: [C: 031] "Improvement on previous insanity!" [puppet] - 10https://gerrit.wikimedia.org/r/218880 (https://phabricator.wikimedia.org/T92561) (owner: 10coren) [15:46:50] (03PS8) 10Yuvipanda: Labs: small race condition fix in replica-addusers.pl [puppet] - 10https://gerrit.wikimedia.org/r/218880 (https://phabricator.wikimedia.org/T92561) (owner: 10coren) [15:50:05] (03CR) 10JanZerebecki: [C: 031] Add a dedicated Wikibase job runner [puppet] - 10https://gerrit.wikimedia.org/r/208397 (owner: 10Hoo man) [15:53:30] (03CR) 10Yuvipanda: [C: 032] Labs: small race condition fix in replica-addusers.pl [puppet] - 10https://gerrit.wikimedia.org/r/218880 (https://phabricator.wikimedia.org/T92561) (owner: 10coren) [16:00:28] legoktm, twentyafterfour, YuviPanda: can someone start wikibugs? [16:00:30] it seems to be broken [16:00:39] alternatively give me the ability to restart it and I won't bother you next time [16:00:46] I can do the latter, yes. [16:01:00] I explicitly removed myself from that list of people as well, didn't I? :) [16:01:12] or valhallasw`cloud [16:01:13] adding you tho [16:01:27] (03CR) 10Alexandros Kosiaris: [C: 031] Reenable ntp by default [puppet] - 10https://gerrit.wikimedia.org/r/222129 (owner: 10Muehlenhoff) [16:01:33] whenever this thing decides to work [16:05:25] (03PS2) 10Alexandros Kosiaris: Reenable ntp by default [puppet] - 10https://gerrit.wikimedia.org/r/222129 (owner: 10Muehlenhoff) [16:05:32] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Reenable ntp by default [puppet] - 10https://gerrit.wikimedia.org/r/222129 (owner: 10Muehlenhoff) [16:05:49] !log re-enabling ntp everywhere [16:05:53] Logged the message, Master [16:11:05] RECOVERY - NTP on elastic1024 is OK: NTP OK: Offset 0.002243757248 secs [16:11:34] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [16:11:35] RECOVERY - NTP on mc1012 is OK: NTP OK: Offset 0.002692461014 secs [16:12:05] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [16:12:26] RECOVERY - NTP on elastic1019 is OK: NTP OK: Offset 0.002202868462 secs [16:13:50] YuviPanda, has it not decided to work? [16:14:25] RECOVERY - NTP on elastic1006 is OK: NTP OK: Offset 0.0008374452591 secs [16:14:32] Krenair: whoops, it was taking long and I tabbed out and forgot. added you now [16:14:34] RECOVERY - NTP on elastic1015 is OK: NTP OK: Offset -0.003218531609 secs [16:14:54] RECOVERY - NTP on elastic1005 is OK: NTP OK: Offset -0.001771807671 secs [16:14:54] RECOVERY - NTP on elastic1011 is OK: NTP OK: Offset 0.001108765602 secs [16:15:11] (03CR) 10Giuseppe Lavagetto: [C: 031] "Great work. It has some rough edges but it's a great starting point." (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/217548 (https://phabricator.wikimedia.org/T97036) (owner: 10Alexandros Kosiaris) [16:15:33] RECOVERY - NTP on snapshot1004 is OK: NTP OK: Offset -0.01075041294 secs [16:15:33] RECOVERY - NTP on analytics1011 is OK: NTP OK: Offset -0.006717920303 secs [16:15:45] RECOVERY - NTP on elastic1014 is OK: NTP OK: Offset -0.01224899292 secs [16:15:45] (03PS1) 10Dzahn: Revert "Revert "analytics_kafka: switch to ganglia_new"" [puppet] - 10https://gerrit.wikimedia.org/r/222138 [16:15:51] (03CR) 10jenkins-bot: [V: 04-1] Revert "Revert "analytics_kafka: switch to ganglia_new"" [puppet] - 10https://gerrit.wikimedia.org/r/222138 (owner: 10Dzahn) [16:16:13] RECOVERY - NTP on elastic1002 is OK: NTP OK: Offset -0.001994252205 secs [16:16:53] RECOVERY - NTP on elastic1017 is OK: NTP OK: Offset 0.0008696317673 secs [16:17:13] RECOVERY - NTP on elastic1023 is OK: NTP OK: Offset 0.001240849495 secs [16:17:13] RECOVERY - NTP on elastic1003 is OK: NTP OK: Offset 0.0007504224777 secs [16:17:14] RECOVERY - NTP on ocg1003 is OK: NTP OK: Offset 0.001040220261 secs [16:17:23] RECOVERY - NTP on cp1044 is OK: NTP OK: Offset 0.0006556510925 secs [16:18:05] RECOVERY - NTP on elastic1020 is OK: NTP OK: Offset -0.0006422996521 secs [16:18:09] _joe_: Moritz thought I should run https://gerrit.wikimedia.org/r/#/c/218380/ past you before merge. Any concerns? (We want to do the same for salt.) [16:20:04] RECOVERY - NTP on ganeti1003 is OK: NTP OK: Offset 0.002232789993 secs [16:21:33] RECOVERY - NTP on mw1154 is OK: NTP OK: Offset -0.002393364906 secs [16:21:54] RECOVERY - NTP on mw1019 is OK: NTP OK: Offset -0.003203868866 secs [16:22:23] RECOVERY - NTP on mw1015 is OK: NTP OK: Offset 0.0005525350571 secs [16:22:44] Krenair: is it still down? [16:23:02] I'm trying to figure out how to make it start [16:23:13] so yes [16:23:45] RECOVERY - NTP on es1006 is OK: NTP OK: Offset 0.0009577274323 secs [16:24:25] RECOVERY - NTP on elastic1031 is OK: NTP OK: Offset -0.0002068281174 secs [16:24:35] RECOVERY - NTP on elastic1026 is OK: NTP OK: Offset -0.001788258553 secs [16:24:40] OSError: Multiple exceptions: [Errno 110] Connect call failed ('148.251.187.147', 6667), [Errno 101] Network is unreachable [16:24:44] RECOVERY - NTP on db1007 is OK: NTP OK: Offset -0.0005015134811 secs [16:25:04] RECOVERY - NTP on labvirt1007 is OK: NTP OK: Offset -0.002866387367 secs [16:25:13] oh heh [16:25:14] RECOVERY - NTP on elastic1028 is OK: NTP OK: Offset 0.001387953758 secs [16:25:15] I know why [16:25:34] RECOVERY - NTP on db1043 is OK: NTP OK: Offset -0.00153529644 secs [16:25:34] RECOVERY - NTP on labstore1001 is OK: NTP OK: Offset 0.0001419782639 secs [16:25:45] RECOVERY - NTP on cp1071 is OK: NTP OK: Offset -0.0001133680344 secs [16:26:04] RECOVERY - NTP on mw1195 is OK: NTP OK: Offset -0.0005921125412 secs [16:26:14] RECOVERY - NTP on labnet1001 is OK: NTP OK: Offset -0.002748250961 secs [16:26:14] RECOVERY - NTP on snapshot1001 is OK: NTP OK: Offset 0.000684261322 secs [16:26:23] RECOVERY - NTP on conf1001 is OK: NTP OK: Offset -0.005385518074 secs [16:26:24] RECOVERY - NTP on mw1206 is OK: NTP OK: Offset -0.006228804588 secs [16:26:24] RECOVERY - NTP on db1052 is OK: NTP OK: Offset 0.0006399154663 secs [16:26:34] RECOVERY - NTP on stat1003 is OK: NTP OK: Offset 0.001839876175 secs [16:26:34] RECOVERY - NTP on mw1149 is OK: NTP OK: Offset -0.001811146736 secs [16:26:35] RECOVERY - NTP on mw1076 is OK: NTP OK: Offset -0.000497341156 secs [16:26:44] RECOVERY - NTP on mw1208 is OK: NTP OK: Offset -0.002260327339 secs [16:26:44] RECOVERY - NTP on es1007 is OK: NTP OK: Offset -0.0004492998123 secs [16:26:54] RECOVERY - NTP on analytics1038 is OK: NTP OK: Offset -0.004551529884 secs [16:26:54] RECOVERY - NTP on polonium is OK: NTP OK: Offset -0.009288907051 secs [16:27:03] RECOVERY - NTP on silver is OK: NTP OK: Offset 0.0008475780487 secs [16:27:03] RECOVERY - NTP on antimony is OK: NTP OK: Offset -0.001176357269 secs [16:27:03] RECOVERY - NTP on mw1237 is OK: NTP OK: Offset 0.001049280167 secs [16:27:03] RECOVERY - NTP on elastic1008 is OK: NTP OK: Offset -6.198883057e-06 secs [16:27:14] Krenair: it's back [16:27:14] RECOVERY - NTP on wtp1005 is OK: NTP OK: Offset -0.0007543563843 secs [16:27:14] RECOVERY - NTP on cp1058 is OK: NTP OK: Offset -0.004805922508 secs [16:27:14] RECOVERY - NTP on wtp1012 is OK: NTP OK: Offset 0.0003634691238 secs [16:27:14] RECOVERY - NTP on lithium is OK: NTP OK: Offset -0.001853227615 secs [16:27:15] RECOVERY - NTP on mc1014 is OK: NTP OK: Offset -0.0005037784576 secs [16:27:15] RECOVERY - NTP on mw1249 is OK: NTP OK: Offset -0.0005707740784 secs [16:27:15] RECOVERY - NTP on mw1238 is OK: NTP OK: Offset -0.001354932785 secs [16:27:23] RECOVERY - NTP on virt1001 is OK: NTP OK: Offset -0.0009245872498 secs [16:27:24] RECOVERY - NTP on plutonium is OK: NTP OK: Offset -0.003599643707 secs [16:27:24] RECOVERY - NTP on cp1046 is OK: NTP OK: Offset -0.001534700394 secs [16:27:24] RECOVERY - NTP on analytics1002 is OK: NTP OK: Offset -0.01609075069 secs [16:27:34] RECOVERY - NTP on dataset1001 is OK: NTP OK: Offset -0.0008971691132 secs [16:27:34] RECOVERY - NTP on mw1162 is OK: NTP OK: Offset 0.0003156661987 secs [16:27:35] RECOVERY - NTP on elastic1007 is OK: NTP OK: Offset -0.001901745796 secs [16:27:35] RECOVERY - NTP on elastic1001 is OK: NTP OK: Offset -0.003170609474 secs [16:27:35] RECOVERY - NTP on elastic1021 is OK: NTP OK: Offset 4.971027374e-05 secs [16:27:35] RECOVERY - NTP on mw1247 is OK: NTP OK: Offset 0.0004059076309 secs [16:27:35] RECOVERY - NTP on pc1002 is OK: NTP OK: Offset -0.0001844167709 secs [16:27:43] RECOVERY - NTP on db1016 is OK: NTP OK: Offset -0.005051136017 secs [16:27:44] RECOVERY - NTP on db1060 is OK: NTP OK: Offset -0.00278198719 secs [16:27:44] RECOVERY - NTP on mw1202 is OK: NTP OK: Offset -0.001884460449 secs [16:27:53] RECOVERY - NTP on virt1004 is OK: NTP OK: Offset 0.0002028942108 secs [16:27:53] RECOVERY - NTP on snapshot1002 is OK: NTP OK: Offset -0.0002391338348 secs [16:27:54] RECOVERY - NTP on db1036 is OK: NTP OK: Offset -0.001797437668 secs [16:27:54] RECOVERY - NTP on mw1049 is OK: NTP OK: Offset -0.005429148674 secs [16:27:56] RECOVERY - NTP on labmon1001 is OK: NTP OK: Offset -0.01508796215 secs [16:28:03] RECOVERY - NTP on elastic1004 is OK: NTP OK: Offset -0.002102851868 secs [16:28:04] RECOVERY - NTP on db1048 is OK: NTP OK: Offset -0.001130342484 secs [16:28:13] RECOVERY - NTP on cp1050 is OK: NTP OK: Offset -0.005946397781 secs [16:28:14] RECOVERY - NTP on analytics1013 is OK: NTP OK: Offset -0.002531409264 secs [16:28:14] RECOVERY - NTP on analytics1026 is OK: NTP OK: Offset -0.0007712841034 secs [16:28:14] RECOVERY - NTP on db1026 is OK: NTP OK: Offset -0.004920601845 secs [16:28:23] RECOVERY - NTP on mc1005 is OK: NTP OK: Offset 6.914138794e-05 secs [16:28:23] RECOVERY - NTP on gadolinium is OK: NTP OK: Offset 0.0005996227264 secs [16:28:23] legoktm, I found https://www.mediawiki.org/wiki/Wikibugs#Deploying_changes [16:28:24] RECOVERY - NTP on etcd1002 is OK: NTP OK: Offset 5.900859833e-05 secs [16:28:24] RECOVERY - NTP on mw1051 is OK: NTP OK: Offset 0.0003777742386 secs [16:28:33] RECOVERY - NTP on mw1111 is OK: NTP OK: Offset -0.001017689705 secs [16:28:33] RECOVERY - NTP on mw1044 is OK: NTP OK: Offset -9.417533875e-05 secs [16:28:34] RECOVERY - NTP on analytics1022 is OK: NTP OK: Offset -0.001609802246 secs [16:28:34] RECOVERY - NTP on db1069 is OK: NTP OK: Offset -0.005012989044 secs [16:28:34] RECOVERY - NTP on argon is OK: NTP OK: Offset -0.002904772758 secs [16:28:34] RECOVERY - NTP on cp1063 is OK: NTP OK: Offset -0.002573013306 secs [16:28:34] RECOVERY - NTP on mw1133 is OK: NTP OK: Offset -0.002162098885 secs [16:28:35] RECOVERY - NTP on elastic1012 is OK: NTP OK: Offset -0.001248836517 secs [16:28:35] RECOVERY - NTP on mw1156 is OK: NTP OK: Offset -0.0001980066299 secs [16:28:36] RECOVERY - NTP on db1039 is OK: NTP OK: Offset 0.0002094507217 secs [16:28:37] is the idea that you run those commands locally and you can't actually restart it from tools-login? [16:28:44] RECOVERY - NTP on ms-be1007 is OK: NTP OK: Offset -0.003082752228 secs [16:28:53] RECOVERY - NTP on dbproxy1004 is OK: NTP OK: Offset -0.001790761948 secs [16:28:53] RECOVERY - NTP on mc1001 is OK: NTP OK: Offset -0.002902269363 secs [16:28:54] RECOVERY - NTP on mw1189 is OK: NTP OK: Offset -0.001187562943 secs [16:28:54] RECOVERY - NTP on mw1014 is OK: NTP OK: Offset -0.001257300377 secs [16:28:54] RECOVERY - NTP on conf1003 is OK: NTP OK: Offset 0.000124335289 secs [16:28:54] RECOVERY - NTP on rhenium is OK: NTP OK: Offset -1.549720764e-05 secs [16:28:54] RECOVERY - NTP on mw1125 is OK: NTP OK: Offset -0.0009781122208 secs [16:28:55] RECOVERY - NTP on cp1048 is OK: NTP OK: Offset -0.004184365273 secs [16:28:55] RECOVERY - NTP on osmium is OK: NTP OK: Offset -0.002081036568 secs [16:29:03] RECOVERY - NTP on mw1053 is OK: NTP OK: Offset -0.005102276802 secs [16:29:03] RECOVERY - NTP on mw1055 is OK: NTP OK: Offset -0.0001393556595 secs [16:29:03] RECOVERY - NTP on mw1084 is OK: NTP OK: Offset -0.0001171827316 secs [16:29:04] RECOVERY - NTP on mw1050 is OK: NTP OK: Offset -0.003386974335 secs [16:29:04] RECOVERY - NTP on mw1168 is OK: NTP OK: Offset -0.004619240761 secs [16:29:13] RECOVERY - NTP on elastic1018 is OK: NTP OK: Offset 0.001440525055 secs [16:29:13] RECOVERY - NTP on rdb1001 is OK: NTP OK: Offset -0.0004514455795 secs [16:29:13] RECOVERY - NTP on labsdb1006 is OK: NTP OK: Offset -0.002314329147 secs [16:29:25] RECOVERY - NTP on mw1004 is OK: NTP OK: Offset -0.05386936665 secs [16:29:25] RECOVERY - NTP on db1001 is OK: NTP OK: Offset -0.07569134235 secs [16:29:25] RECOVERY - NTP on elastic1027 is OK: NTP OK: Offset -0.01649141312 secs [16:29:25] RECOVERY - NTP on ms-be1012 is OK: NTP OK: Offset -0.006895184517 secs [16:29:25] RECOVERY - NTP on mw1151 is OK: NTP OK: Offset -0.002140045166 secs [16:29:26] RECOVERY - NTP on etcd1001 is OK: NTP OK: Offset -0.005708694458 secs [16:29:26] RECOVERY - NTP on mw1212 is OK: NTP OK: Offset -0.03931987286 secs [16:29:29] Krenair: lets move to -labs [16:29:32] (03PS8) 10Alexandros Kosiaris: Add new_wmf_service.py and examples [puppet] - 10https://gerrit.wikimedia.org/r/217548 (https://phabricator.wikimedia.org/T97036) [16:29:33] RECOVERY - NTP on berkelium is OK: NTP OK: Offset -0.01850712299 secs [16:29:33] RECOVERY - NTP on db1004 is OK: NTP OK: Offset -5.412101746e-05 secs [16:29:33] RECOVERY - NTP on ms-be1008 is OK: NTP OK: Offset -0.00129878521 secs [16:29:33] RECOVERY - NTP on mw1183 is OK: NTP OK: Offset -0.000385761261 secs [16:29:34] RECOVERY - NTP on db1020 is OK: NTP OK: Offset -0.001871228218 secs [16:29:43] RECOVERY - NTP on wtp1023 is OK: NTP OK: Offset -0.0007914304733 secs [16:29:43] RECOVERY - NTP on cp1062 is OK: NTP OK: Offset -0.003393530846 secs [16:29:44] RECOVERY - NTP on wtp1018 is OK: NTP OK: Offset -0.006498813629 secs [16:29:44] RECOVERY - NTP on stat1002 is OK: NTP OK: Offset -0.006909966469 secs [16:29:53] RECOVERY - NTP on db1034 is OK: NTP OK: Offset 0.04354918003 secs [16:29:53] RECOVERY - NTP on mw1098 is OK: NTP OK: Offset -0.002477407455 secs [16:29:53] RECOVERY - NTP on mw1057 is OK: NTP OK: Offset -0.002779603004 secs [16:29:54] RECOVERY - NTP on mw1258 is OK: NTP OK: Offset 0.0005168914795 secs [16:29:54] RECOVERY - NTP on wtp1004 is OK: NTP OK: Offset -0.00207400322 secs [16:29:54] RECOVERY - NTP on analytics1032 is OK: NTP OK: Offset -0.01094019413 secs [16:30:03] RECOVERY - NTP on francium is OK: NTP OK: Offset -0.001114726067 secs [16:30:03] RECOVERY - NTP on protactinium is OK: NTP OK: Offset -4.887580872e-05 secs [16:30:04] RECOVERY - NTP on mw1181 is OK: NTP OK: Offset -0.0008516311646 secs [16:30:04] RECOVERY - NTP on mw1146 is OK: NTP OK: Offset -0.006646990776 secs [16:30:05] RECOVERY - NTP on wtp1011 is OK: NTP OK: Offset -0.001239061356 secs [16:30:13] RECOVERY - NTP on mc1013 is OK: NTP OK: Offset -2.658367157e-05 secs [16:30:13] RECOVERY - NTP on db1062 is OK: NTP OK: Offset -0.003253102303 secs [16:30:14] RECOVERY - NTP on db1071 is OK: NTP OK: Offset -0.001491904259 secs [16:30:14] RECOVERY - NTP on db1054 is OK: NTP OK: Offset -0.004831314087 secs [16:30:14] RECOVERY - NTP on ms-be1009 is OK: NTP OK: Offset 0.0008399486542 secs [16:30:23] RECOVERY - NTP on mw1097 is OK: NTP OK: Offset -0.0002864599228 secs [16:30:23] RECOVERY - NTP on mw1190 is OK: NTP OK: Offset -0.0008950233459 secs [16:30:23] RECOVERY - NTP on elastic1022 is OK: NTP OK: Offset -0.02175056934 secs [16:30:24] RECOVERY - NTP on wtp1022 is OK: NTP OK: Offset -0.006782650948 secs [16:30:24] RECOVERY - NTP on titanium is OK: NTP OK: Offset -0.007627606392 secs [16:30:24] RECOVERY - NTP on db1055 is OK: NTP OK: Offset -0.002088546753 secs [16:30:24] RECOVERY - NTP on labcontrol1002 is OK: NTP OK: Offset 0.0008490085602 secs [16:30:25] RECOVERY - NTP on rdb1002 is OK: NTP OK: Offset -0.004882931709 secs [16:30:25] RECOVERY - NTP on db1063 is OK: NTP OK: Offset 0.0004245042801 secs [16:30:26] RECOVERY - NTP on mw1165 is OK: NTP OK: Offset -0.0004497766495 secs [16:30:26] RECOVERY - NTP on mw1081 is OK: NTP OK: Offset -0.003248810768 secs [16:30:34] RECOVERY - NTP on mw1163 is OK: NTP OK: Offset -0.0007094144821 secs [16:30:34] RECOVERY - NTP on mw1087 is OK: NTP OK: Offset -0.000523686409 secs [16:30:34] RECOVERY - NTP on mw1114 is OK: NTP OK: Offset -0.001848340034 secs [16:30:34] RECOVERY - NTP on mw1032 is OK: NTP OK: Offset -0.003158569336 secs [16:30:35] RECOVERY - NTP on elastic1030 is OK: NTP OK: Offset -0.007767558098 secs [16:30:43] RECOVERY - NTP on mw1210 is OK: NTP OK: Offset -0.0002615451813 secs [16:30:44] RECOVERY - NTP on pc1003 is OK: NTP OK: Offset -0.003076434135 secs [16:30:53] RECOVERY - NTP on mw1180 is OK: NTP OK: Offset -0.00382912159 secs [16:30:54] RECOVERY - NTP on cp1060 is OK: NTP OK: Offset -0.01035785675 secs [16:30:54] RECOVERY - NTP on wtp1015 is OK: NTP OK: Offset -0.001754999161 secs [16:30:54] RECOVERY - NTP on wtp1002 is OK: NTP OK: Offset -0.0009828805923 secs [16:30:54] RECOVERY - NTP on rdb1003 is OK: NTP OK: Offset -0.008142709732 secs [16:30:55] RECOVERY - NTP on db1057 is OK: NTP OK: Offset -0.0009818077087 secs [16:30:55] RECOVERY - NTP on mw1227 is OK: NTP OK: Offset -0.0003873109818 secs [16:30:55] RECOVERY - NTP on mw1198 is OK: NTP OK: Offset -0.002222418785 secs [16:30:55] RECOVERY - NTP on mw1056 is OK: NTP OK: Offset 0.0004869699478 secs [16:30:56] RECOVERY - NTP on mw1030 is OK: NTP OK: Offset -0.003450274467 secs [16:30:56] RECOVERY - NTP on mw1148 is OK: NTP OK: Offset -0.00618314743 secs [16:30:56] * akosiaris happy to see all the NTP recoveries [16:30:57] RECOVERY - NTP on mw1159 is OK: NTP OK: Offset -0.01023924351 secs [16:31:04] RECOVERY - NTP on rcs1002 is OK: NTP OK: Offset -0.001373648643 secs [16:31:04] RECOVERY - NTP on ganeti1001 is OK: NTP OK: Offset -0.001591086388 secs [16:31:04] RECOVERY - NTP on analytics1037 is OK: NTP OK: Offset -0.001442551613 secs [16:31:04] RECOVERY - NTP on mc1007 is OK: NTP OK: Offset -0.001565217972 secs [16:31:05] RECOVERY - NTP on mw1239 is OK: NTP OK: Offset -0.0006972551346 secs [16:31:05] RECOVERY - NTP on mw1116 is OK: NTP OK: Offset -0.002708792686 secs [16:31:13] RECOVERY - NTP on mw1243 is OK: NTP OK: Offset -0.002601742744 secs [16:31:13] RECOVERY - NTP on nitrogen is OK: NTP OK: Offset -0.0003271102905 secs [16:31:13] RECOVERY - NTP on es1002 is OK: NTP OK: Offset -0.001298546791 secs [16:31:14] RECOVERY - NTP on mw1023 is OK: NTP OK: Offset -0.004323363304 secs [16:31:14] RECOVERY - NTP on graphite1001 is OK: NTP OK: Offset -0.0006091594696 secs [16:31:14] RECOVERY - NTP on mw1079 is OK: NTP OK: Offset -0.002100467682 secs [16:31:14] RECOVERY - NTP on mw1029 is OK: NTP OK: Offset -0.0032954216 secs [16:31:23] RECOVERY - NTP on mw1248 is OK: NTP OK: Offset -0.001348614693 secs [16:31:24] RECOVERY - NTP on mw1034 is OK: NTP OK: Offset -0.004468798637 secs [16:31:24] RECOVERY - NTP on mw1074 is OK: NTP OK: Offset -0.001759290695 secs [16:31:24] RECOVERY - NTP on ms-be1018 is OK: NTP OK: Offset -0.0004646778107 secs [16:31:24] RECOVERY - NTP on mw1122 is OK: NTP OK: Offset -0.005771875381 secs [16:31:24] RECOVERY - NTP on mw1171 is OK: NTP OK: Offset -0.001945614815 secs [16:31:29] akosiaris: i don't see anything in load that stands out at a glance [16:31:33] RECOVERY - NTP on analytics1001 is OK: NTP OK: Offset -0.001927018166 secs [16:31:33] RECOVERY - NTP on db1070 is OK: NTP OK: Offset -0.003462910652 secs [16:31:34] RECOVERY - NTP on wtp1003 is OK: NTP OK: Offset -0.005591154099 secs [16:31:34] RECOVERY - NTP on cp1052 is OK: NTP OK: Offset -0.002894043922 secs [16:31:34] RECOVERY - NTP on mw1121 is OK: NTP OK: Offset -0.009603977203 secs [16:31:38] (03PS2) 10Andrew Bogott: Tidy up firewall rules for puppetmaster and salt [puppet] - 10https://gerrit.wikimedia.org/r/214085 [16:31:43] RECOVERY - NTP on db1038 is OK: NTP OK: Offset -0.0004702806473 secs [16:31:44] RECOVERY - NTP on db1030 is OK: NTP OK: Offset -0.003389239311 secs [16:31:44] RECOVERY - NTP on db1044 is OK: NTP OK: Offset -0.002178311348 secs [16:31:44] RECOVERY - NTP on logstash1005 is OK: NTP OK: Offset -0.0007911920547 secs [16:31:44] RECOVERY - NTP on ms-be1011 is OK: NTP OK: Offset -0.007014989853 secs [16:31:45] RECOVERY - NTP on mw1185 is OK: NTP OK: Offset -0.003058075905 secs [16:31:45] RECOVERY - NTP on mw1188 is OK: NTP OK: Offset -0.0008232593536 secs [16:31:54] RECOVERY - NTP on db1064 is OK: NTP OK: Offset -2.253055573e-05 secs [16:32:04] RECOVERY - NTP on planet1001 is OK: NTP OK: Offset -0.004374623299 secs [16:32:04] RECOVERY - NTP on rcs1001 is OK: NTP OK: Offset -0.001067042351 secs [16:32:04] RECOVERY - NTP on mw1077 is OK: NTP OK: Offset -0.001524090767 secs [16:32:14] RECOVERY - NTP on mw1043 is OK: NTP OK: Offset -0.004419565201 secs [16:32:14] RECOVERY - NTP on mw1024 is OK: NTP OK: Offset -0.003511667252 secs [16:32:14] RECOVERY - NTP on mw1139 is OK: NTP OK: Offset -0.0008141994476 secs [16:32:14] RECOVERY - NTP on analytics1014 is OK: NTP OK: Offset -0.001261234283 secs [16:32:19] ottomata: there was a spike at 16:10 but it subsided now [16:32:24] RECOVERY - NTP on mw1010 is OK: NTP OK: Offset -0.001039743423 secs [16:32:24] RECOVERY - NTP on sodium is OK: NTP OK: Offset -0.005532026291 secs [16:32:24] RECOVERY - NTP on mw1201 is OK: NTP OK: Offset -0.0009055137634 secs [16:32:24] RECOVERY - NTP on db1006 is OK: NTP OK: Offset -0.001312375069 secs [16:32:24] 6operations, 10Analytics-Cluster, 10hardware-requests: rack new hadoop worker nodes - https://phabricator.wikimedia.org/T104463#1417892 (10RobH) 3NEW a:3RobH [16:32:24] RECOVERY - NTP on mw1167 is OK: NTP OK: Offset -0.000550866127 secs [16:32:24] RECOVERY - NTP on mw1152 is OK: NTP OK: Offset -0.0007272958755 secs [16:32:25] RECOVERY - NTP on mw1223 is OK: NTP OK: Offset -0.0006641149521 secs [16:32:25] RECOVERY - NTP on mw1229 is OK: NTP OK: Offset -0.01036345959 secs [16:32:33] RECOVERY - NTP on mw1219 is OK: NTP OK: Offset -0.0001889467239 secs [16:32:33] RECOVERY - NTP on ganeti1004 is OK: NTP OK: Offset -0.003890156746 secs [16:32:33] RECOVERY - NTP on iridium is OK: NTP OK: Offset -0.002086281776 secs [16:32:33] RECOVERY - NTP on mw1001 is OK: NTP OK: Offset -0.0004976987839 secs [16:32:33] RECOVERY - NTP on mw1186 is OK: NTP OK: Offset -0.0006396770477 secs [16:32:34] RECOVERY - NTP on mw1241 is OK: NTP OK: Offset -0.002270936966 secs [16:32:34] RECOVERY - NTP on mw1225 is OK: NTP OK: Offset -0.01249217987 secs [16:32:35] RECOVERY - NTP on mw1016 is OK: NTP OK: Offset -0.001110434532 secs [16:32:35] RECOVERY - NTP on db1033 is OK: NTP OK: Offset -0.002055168152 secs [16:32:36] 6operations, 10ops-eqiad, 10Analytics-Cluster: rack new hadoop worker nodes - https://phabricator.wikimedia.org/T104463#1417892 (10RobH) [16:32:39] ottomata: probably some job that just happened to run right while I was enabling ntpds everywhere [16:32:43] RECOVERY - NTP on analytics1027 is OK: NTP OK: Offset -0.01194453239 secs [16:32:44] RECOVERY - NTP on calcium is OK: NTP OK: Offset -0.0004185438156 secs [16:32:49] (03PS3) 10Andrew Bogott: Tidy up firewall rules for puppetmaster and salt [puppet] - 10https://gerrit.wikimedia.org/r/214085 [16:32:53] RECOVERY - NTP on mw1108 is OK: NTP OK: Offset -0.0004431009293 secs [16:32:53] RECOVERY - NTP on mw1022 is OK: NTP OK: Offset -0.002146363258 secs [16:32:53] RECOVERY - NTP on dbstore1002 is OK: NTP OK: Offset -0.0009336471558 secs [16:32:54] RECOVERY - NTP on wtp1007 is OK: NTP OK: Offset -0.00341117382 secs [16:33:03] RECOVERY - NTP on cp1067 is OK: NTP OK: Offset -0.002996683121 secs [16:33:03] RECOVERY - NTP on labvirt1002 is OK: NTP OK: Offset -0.0008926391602 secs [16:33:13] RECOVERY - NTP on mw1033 is OK: NTP OK: Offset -0.008304476738 secs [16:33:13] RECOVERY - NTP on db1061 is OK: NTP OK: Offset -0.001038908958 secs [16:33:14] RECOVERY - NTP on mw1105 is OK: NTP OK: Offset -0.001437544823 secs [16:33:14] RECOVERY - NTP on ms-be1002 is OK: NTP OK: Offset -0.0002417564392 secs [16:33:14] RECOVERY - NTP on mw1142 is OK: NTP OK: Offset -0.00308740139 secs [16:33:14] RECOVERY - NTP on mc1018 is OK: NTP OK: Offset -0.0009223222733 secs [16:33:23] RECOVERY - NTP on labvirt1008 is OK: NTP OK: Offset -0.0004470348358 secs [16:33:23] RECOVERY - NTP on analytics1018 is OK: NTP OK: Offset -0.0006833076477 secs [16:33:23] RECOVERY - NTP on db1011 is OK: NTP OK: Offset -0.001519560814 secs [16:33:24] RECOVERY - NTP on mw1220 is OK: NTP OK: Offset -0.002863168716 secs [16:33:25] RECOVERY - NTP on es1010 is OK: NTP OK: Offset -0.002200961113 secs [16:33:25] RECOVERY - NTP on mw1071 is OK: NTP OK: Offset -0.008047699928 secs [16:33:25] RECOVERY - NTP on iodine is OK: NTP OK: Offset -0.002402067184 secs [16:33:28] 6operations, 10ops-eqiad, 10Analytics-Cluster: rack new hadoop worker nodes - https://phabricator.wikimedia.org/T104463#1417892 (10RobH) Last instruction from @ottomata is to place 4 in rack d2-eqiad for now, the location of the remainder still need to be determined. [16:33:33] RECOVERY - NTP on oxygen is OK: NTP OK: Offset -0.009523749352 secs [16:33:33] RECOVERY - NTP on terbium is OK: NTP OK: Offset -0.01392459869 secs [16:33:44] RECOVERY - NTP on fluorine is OK: NTP OK: Offset -0.005782723427 secs [16:33:44] RECOVERY - NTP on mw1107 is OK: NTP OK: Offset -0.002712249756 secs [16:33:44] RECOVERY - NTP on mw1093 is OK: NTP OK: Offset -0.009152770042 secs [16:33:45] RECOVERY - NTP on mw1209 is OK: NTP OK: Offset -0.0009900331497 secs [16:33:45] RECOVERY - NTP on cp1053 is OK: NTP OK: Offset -0.001731872559 secs [16:33:45] RECOVERY - NTP on cp1068 is OK: NTP OK: Offset -0.007169485092 secs [16:33:53] RECOVERY - NTP on db1037 is OK: NTP OK: Offset -0.001579999924 secs [16:33:53] RECOVERY - NTP on db1045 is OK: NTP OK: Offset -0.002391219139 secs [16:33:53] RECOVERY - NTP on radon is OK: NTP OK: Offset -0.002947449684 secs [16:33:54] RECOVERY - NTP on wtp1010 is OK: NTP OK: Offset -0.0003652572632 secs [16:33:54] RECOVERY - NTP on mw1086 is OK: NTP OK: Offset -0.0005185604095 secs [16:33:54] RECOVERY - NTP on db1072 is OK: NTP OK: Offset -0.005231142044 secs [16:34:03] RECOVERY - NTP on lvs1003 is OK: NTP OK: Offset -0.001084804535 secs [16:34:03] RECOVERY - NTP on cp1054 is OK: NTP OK: Offset -0.0004869699478 secs [16:34:04] RECOVERY - NTP on mw1231 is OK: NTP OK: Offset -0.0005114078522 secs [16:34:04] RECOVERY - NTP on labvirt1006 is OK: NTP OK: Offset -0.000660777092 secs [16:34:05] RECOVERY - NTP on mc1017 is OK: NTP OK: Offset -0.01558017731 secs [16:34:14] RECOVERY - NTP on mw1204 is OK: NTP OK: Offset -0.0005584955215 secs [16:34:14] RECOVERY - NTP on mw1110 is OK: NTP OK: Offset -0.003643870354 secs [16:34:14] RECOVERY - NTP on copper is OK: NTP OK: Offset -0.001644492149 secs [16:34:14] RECOVERY - NTP on mw1090 is OK: NTP OK: Offset -0.001386880875 secs [16:34:14] RECOVERY - NTP on ms-fe1002 is OK: NTP OK: Offset -0.002024650574 secs [16:34:23] RECOVERY - NTP on mw1131 is OK: NTP OK: Offset -0.04274296761 secs [16:34:24] RECOVERY - NTP on analytics1028 is OK: NTP OK: Offset -0.003916144371 secs [16:34:24] RECOVERY - NTP on mw1091 is OK: NTP OK: Offset -0.03667151928 secs [16:34:24] RECOVERY - NTP on mw1112 is OK: NTP OK: Offset -0.02269566059 secs [16:34:24] RECOVERY - NTP on mw1143 is OK: NTP OK: Offset -0.006956219673 secs [16:34:24] RECOVERY - NTP on mw1066 is OK: NTP OK: Offset -0.001264452934 secs [16:34:25] RECOVERY - NTP on mw1203 is OK: NTP OK: Offset -0.004212617874 secs [16:34:34] RECOVERY - NTP on labcontrol1001 is OK: NTP OK: Offset -0.00557410717 secs [16:34:34] RECOVERY - NTP on ganeti1002 is OK: NTP OK: Offset -0.001026988029 secs [16:34:34] RECOVERY - NTP on mw1027 is OK: NTP OK: Offset -0.00250184536 secs [16:34:39] (03PS9) 10Alexandros Kosiaris: Add new_wmf_service.py and examples [puppet] - 10https://gerrit.wikimedia.org/r/217548 (https://phabricator.wikimedia.org/T97036) [16:34:43] RECOVERY - NTP on virt1008 is OK: NTP OK: Offset -0.001516819 secs [16:34:43] RECOVERY - NTP on bast1001 is OK: NTP OK: Offset -0.00209748745 secs [16:34:43] RECOVERY - NTP on wtp1001 is OK: NTP OK: Offset -0.001080393791 secs [16:34:44] RECOVERY - NTP on db1035 is OK: NTP OK: Offset -0.001782774925 secs [16:34:45] RECOVERY - NTP on labvirt1004 is OK: NTP OK: Offset -0.002964615822 secs [16:34:47] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Add new_wmf_service.py and examples [puppet] - 10https://gerrit.wikimedia.org/r/217548 (https://phabricator.wikimedia.org/T97036) (owner: 10Alexandros Kosiaris) [16:34:54] RECOVERY - NTP on mw1236 is OK: NTP OK: Offset -0.004598855972 secs [16:34:54] RECOVERY - NTP on strontium is OK: NTP OK: Offset -0.004472017288 secs [16:34:55] RECOVERY - NTP on mw1021 is OK: NTP OK: Offset -0.008628249168 secs [16:34:55] RECOVERY - NTP on mw1215 is OK: NTP OK: Offset -0.008789539337 secs [16:34:55] RECOVERY - NTP on mw1064 is OK: NTP OK: Offset -0.01474893093 secs [16:34:57] (03CR) 10Alexandros Kosiaris: "Great! thanks. merging" [puppet] - 10https://gerrit.wikimedia.org/r/217548 (https://phabricator.wikimedia.org/T97036) (owner: 10Alexandros Kosiaris) [16:35:03] RECOVERY - NTP on mw1253 is OK: NTP OK: Offset -0.003025889397 secs [16:35:04] RECOVERY - NTP on elastic1025 is OK: NTP OK: Offset -0.0006932020187 secs [16:35:04] RECOVERY - NTP on palladium is OK: NTP OK: Offset -0.005411982536 secs [16:35:04] RECOVERY - NTP on wtp1008 is OK: NTP OK: Offset -0.001429200172 secs [16:35:14] 6operations, 10Analytics-Cluster, 10hardware-requests: Hadoop worker node procurement - 2015 - https://phabricator.wikimedia.org/T100442#1417920 (10RobH) 5Open>3Resolved I've created T104463 for the racking of these systems. As this hardware request has been completed, I'm resolving this task. [16:35:14] RECOVERY - NTP on mw1113 is OK: NTP OK: Offset -0.001735448837 secs [16:35:23] RECOVERY - NTP on radium is OK: NTP OK: Offset -0.008670330048 secs [16:35:23] RECOVERY - NTP on sca1001 is OK: NTP OK: Offset -0.009868383408 secs [16:35:24] RECOVERY - NTP on mw1207 is OK: NTP OK: Offset -0.003614664078 secs [16:35:24] RECOVERY - NTP on db1049 is OK: NTP OK: Offset -0.004130005836 secs [16:35:24] RECOVERY - NTP on db1027 is OK: NTP OK: Offset -0.01614308357 secs [16:35:24] RECOVERY - NTP on elastic1029 is OK: NTP OK: Offset -0.009181499481 secs [16:35:24] RECOVERY - NTP on db1056 is OK: NTP OK: Offset -0.001588225365 secs [16:35:25] RECOVERY - NTP on tmh1001 is OK: NTP OK: Offset -0.003202319145 secs [16:35:25] RECOVERY - NTP on mw1193 is OK: NTP OK: Offset -0.001408219337 secs [16:35:26] RECOVERY - NTP on analytics1012 is OK: NTP OK: Offset -0.002819657326 secs [16:35:26] RECOVERY - NTP on cp1064 is OK: NTP OK: Offset -0.01925110817 secs [16:35:33] RECOVERY - NTP on cp1066 is OK: NTP OK: Offset -0.001397252083 secs [16:35:34] RECOVERY - NTP on mw1104 is OK: NTP OK: Offset -0.002663254738 secs [16:35:34] RECOVERY - NTP on mw1073 is OK: NTP OK: Offset -0.001150608063 secs [16:35:43] RECOVERY - NTP on mw1158 is OK: NTP OK: Offset -0.007160186768 secs [16:35:44] RECOVERY - NTP on cp1045 is OK: NTP OK: Offset -0.001378178596 secs [16:35:45] RECOVERY - NTP on mw1037 is OK: NTP OK: Offset -0.0008081197739 secs [16:35:45] RECOVERY - NTP on mw1255 is OK: NTP OK: Offset -0.000669836998 secs [16:35:55] RECOVERY - NTP on labvirt1009 is OK: NTP OK: Offset -0.001486897469 secs [16:36:04] RECOVERY - NTP on mw1135 is OK: NTP OK: Offset -0.002450227737 secs [16:36:04] RECOVERY - NTP on mw1047 is OK: NTP OK: Offset -0.001754760742 secs [16:36:04] RECOVERY - NTP on labsdb1002 is OK: NTP OK: Offset -0.00227022171 secs [16:36:04] RECOVERY - NTP on ms-be1010 is OK: NTP OK: Offset -0.0001248121262 secs [16:36:05] RECOVERY - NTP on mw1018 is OK: NTP OK: Offset -0.00380551815 secs [16:36:05] RECOVERY - NTP on mw1199 is OK: NTP OK: Offset -0.002362966537 secs [16:36:13] RECOVERY - NTP on es1004 is OK: NTP OK: Offset -0.001679182053 secs [16:36:14] RECOVERY - NTP on mc1015 is OK: NTP OK: Offset -0.003388166428 secs [16:36:15] RECOVERY - NTP on mw1058 is OK: NTP OK: Offset -0.006834149361 secs [16:36:23] RECOVERY - NTP on db1005 is OK: NTP OK: Offset -0.002541422844 secs [16:36:24] RECOVERY - NTP on mw1128 is OK: NTP OK: Offset -0.0008891820908 secs [16:36:33] RECOVERY - NTP on ms-be1014 is OK: NTP OK: Offset -0.00207901001 secs [16:36:34] RECOVERY - NTP on analytics1021 is OK: NTP OK: Offset -0.002482414246 secs [16:36:34] RECOVERY - NTP on mw1155 is OK: NTP OK: Offset -0.001065969467 secs [16:36:44] RECOVERY - NTP on db1068 is OK: NTP OK: Offset -0.001064419746 secs [16:36:44] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [16:36:53] RECOVERY - NTP on mw1179 is OK: NTP OK: Offset -0.002332091331 secs [16:36:53] RECOVERY - NTP on conf1002 is OK: NTP OK: Offset -0.001418232918 secs [16:36:54] RECOVERY - NTP on cp1057 is OK: NTP OK: Offset -0.005301594734 secs [16:36:54] RECOVERY - NTP on mw1103 is OK: NTP OK: Offset -0.001061797142 secs [16:36:56] 6operations, 10ops-codfw, 10hardware-requests, 7Database: Faulty memory on es2004 (purchase one module) - https://phabricator.wikimedia.org/T103843#1417935 (10RobH) Can you guys paste the actual memory speed and capacity per stick? @papaul: please advise. [16:37:04] RECOVERY - NTP on elastic1010 is OK: NTP OK: Offset -0.002372264862 secs [16:37:04] RECOVERY - NTP on db1009 is OK: NTP OK: Offset -0.003207325935 secs [16:37:04] RECOVERY - NTP on mw1194 is OK: NTP OK: Offset -0.001133918762 secs [16:37:13] RECOVERY - NTP on etherpad1001 is OK: NTP OK: Offset -0.002820849419 secs [16:37:13] RECOVERY - NTP on virt1009 is OK: NTP OK: Offset -0.005522489548 secs [16:37:13] RECOVERY - NTP on wtp1014 is OK: NTP OK: Offset -0.004742026329 secs [16:37:13] RECOVERY - NTP on mw1020 is OK: NTP OK: Offset -0.004944801331 secs [16:37:13] RECOVERY - NTP on elastic1009 is OK: NTP OK: Offset -0.004372239113 secs [16:37:14] RECOVERY - NTP on elastic1013 is OK: NTP OK: Offset -0.03459250927 secs [16:37:14] RECOVERY - NTP on cp1051 is OK: NTP OK: Offset -0.02574539185 secs [16:37:23] RECOVERY - NTP on lvs1006 is OK: NTP OK: Offset -0.001153230667 secs [16:37:23] RECOVERY - NTP on es1003 is OK: NTP OK: Offset -0.004630208015 secs [16:37:23] RECOVERY - NTP on mw1095 is OK: NTP OK: Offset -0.0130366087 secs [16:37:24] RECOVERY - NTP on analytics1031 is OK: NTP OK: Offset 0.003938913345 secs [16:37:35] RECOVERY - NTP on mw1230 is OK: NTP OK: Offset -0.002185106277 secs [16:37:44] RECOVERY - NTP on mw1078 is OK: NTP OK: Offset -0.005714297295 secs [16:37:44] RECOVERY - NTP on cp1099 is OK: NTP OK: Offset -0.004661917686 secs [16:37:53] RECOVERY - NTP on dbproxy1003 is OK: NTP OK: Offset -0.003268361092 secs [16:37:54] RECOVERY - NTP on db1019 is OK: NTP OK: Offset -0.07040667534 secs [16:37:56] (03PS4) 10Andrew Bogott: Tidy up firewall rules for puppetmaster and salt [puppet] - 10https://gerrit.wikimedia.org/r/214085 [16:38:04] RECOVERY - NTP on labvirt1001 is OK: NTP OK: Offset -0.007595777512 secs [16:38:04] RECOVERY - NTP on mw1094 is OK: NTP OK: Offset -0.003930330276 secs [16:38:05] RECOVERY - NTP on elastic1016 is OK: NTP OK: Offset -0.0009027719498 secs [16:38:05] RECOVERY - NTP on mc1008 is OK: NTP OK: Offset -0.002230644226 secs [16:38:13] RECOVERY - NTP on analytics1034 is OK: NTP OK: Offset -0.003306031227 secs [16:38:14] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.012 second response time on port 9042 [16:38:14] RECOVERY - NTP on ocg1002 is OK: NTP OK: Offset -0.005418300629 secs [16:38:14] RECOVERY - NTP on mw1136 is OK: NTP OK: Offset -0.004929304123 secs [16:38:14] RECOVERY - NTP on mw1137 is OK: NTP OK: Offset -0.001686453819 secs [16:38:14] RECOVERY - NTP on mw1157 is OK: NTP OK: Offset -0.003100752831 secs [16:38:15] RECOVERY - NTP on db1024 is OK: NTP OK: Offset -0.003222346306 secs [16:38:15] RECOVERY - NTP on ms-be1005 is OK: NTP OK: Offset -0.01620960236 secs [16:38:23] RECOVERY - NTP on erbium is OK: NTP OK: Offset -0.002608060837 secs [16:38:24] RECOVERY - NTP on mw1075 is OK: NTP OK: Offset -0.0022149086 secs [16:38:24] RECOVERY - NTP on mw1070 is OK: NTP OK: Offset -0.001188635826 secs [16:38:24] RECOVERY - NTP on mw1191 is OK: NTP OK: Offset -0.0008674860001 secs [16:38:33] RECOVERY - NTP on tmh1002 is OK: NTP OK: Offset -0.006461143494 secs [16:38:33] RECOVERY - NTP on mw1085 is OK: NTP OK: Offset -0.00485432148 secs [16:38:33] RECOVERY - NTP on stat1001 is OK: NTP OK: Offset -0.001200318336 secs [16:38:34] RECOVERY - NTP on mw1214 is OK: NTP OK: Offset -0.001250743866 secs [16:38:34] RECOVERY - NTP on mw1169 is OK: NTP OK: Offset -0.001501083374 secs [16:38:34] RECOVERY - NTP on cp1073 is OK: NTP OK: Offset -0.001848101616 secs [16:38:34] RECOVERY - NTP on db1041 is OK: NTP OK: Offset -0.001279473305 secs [16:38:43] RECOVERY - NTP on db1058 is OK: NTP OK: Offset -0.003944158554 secs [16:38:43] RECOVERY - NTP on mw1096 is OK: NTP OK: Offset -0.003859400749 secs [16:38:43] RECOVERY - NTP on labsdb1007 is OK: NTP OK: Offset -0.001236319542 secs [16:38:44] RECOVERY - NTP on mw1257 is OK: NTP OK: Offset -0.00105202198 secs [16:38:44] RECOVERY - NTP on mw1232 is OK: NTP OK: Offset -0.003422021866 secs [16:38:53] RECOVERY - NTP on zirconium is OK: NTP OK: Offset -0.001768827438 secs [16:38:54] RECOVERY - NTP on cp1069 is OK: NTP OK: Offset -0.001316785812 secs [16:38:54] RECOVERY - NTP on mw1182 is OK: NTP OK: Offset -0.00179541111 secs [16:38:54] RECOVERY - NTP on ms-be1016 is OK: NTP OK: Offset -0.001677513123 secs [16:38:54] RECOVERY - NTP on mw1234 is OK: NTP OK: Offset -0.00249004364 secs [16:38:54] RECOVERY - NTP on mw1083 is OK: NTP OK: Offset -0.009271502495 secs [16:38:55] RECOVERY - NTP on mw1101 is OK: NTP OK: Offset -0.000922203064 secs [16:38:55] RECOVERY - NTP on mw1184 is OK: NTP OK: Offset -0.0009409189224 secs [16:39:03] RECOVERY - NTP on cp1074 is OK: NTP OK: Offset -0.00698018074 secs [16:39:03] RECOVERY - NTP on mw1127 is OK: NTP OK: Offset -0.002390265465 secs [16:39:03] RECOVERY - NTP on mw1017 is OK: NTP OK: Offset -0.0007381439209 secs [16:39:04] RECOVERY - NTP on cp1065 is OK: NTP OK: Offset -0.001659870148 secs [16:39:04] RECOVERY - NTP on virt1002 is OK: NTP OK: Offset -0.005825638771 secs [16:39:14] RECOVERY - NTP on analytics1039 is OK: NTP OK: Offset 0.00656080246 secs [16:39:23] RECOVERY - NTP on cp1043 is OK: NTP OK: Offset -0.0007544755936 secs [16:39:24] RECOVERY - NTP on es1005 is OK: NTP OK: Offset -0.00164270401 secs [16:39:34] RECOVERY - NTP on mw1036 is OK: NTP OK: Offset -0.001263618469 secs [16:39:34] RECOVERY - NTP on mc1010 is OK: NTP OK: Offset -0.001508593559 secs [16:39:34] RECOVERY - NTP on mw1102 is OK: NTP OK: Offset -0.002230286598 secs [16:39:34] RECOVERY - NTP on mw1252 is OK: NTP OK: Offset -0.01206839085 secs [16:39:34] RECOVERY - NTP on mw1216 is OK: NTP OK: Offset -0.003001332283 secs [16:39:35] RECOVERY - NTP on mw1161 is OK: NTP OK: Offset -0.007133603096 secs [16:39:35] RECOVERY - NTP on ytterbium is OK: NTP OK: Offset -0.04030907154 secs [16:39:39] 6operations, 10ops-codfw, 10hardware-requests, 7Database: Faulty memory on es2004 (purchase one module) - https://phabricator.wikimedia.org/T103843#1417943 (10RobH) Nevermind, system is online so I can poll that via software: description: DIMM DDR3 Synchronous 1333 MHz (0.8 ns) product... [16:39:43] RECOVERY - NTP on mw1013 is OK: NTP OK: Offset -0.001437425613 secs [16:39:44] RECOVERY - NTP on db1010 is OK: NTP OK: Offset -0.004597783089 secs [16:39:44] RECOVERY - NTP on cp1059 is OK: NTP OK: Offset -0.00511610508 secs [16:39:53] RECOVERY - NTP on cp1072 is OK: NTP OK: Offset -0.002369046211 secs [16:39:54] RECOVERY - NTP on es1009 is OK: NTP OK: Offset -0.002643346786 secs [16:39:54] RECOVERY - NTP on mw1246 is OK: NTP OK: Offset -0.01255702972 secs [16:40:03] 6operations, 7Service-Architecture: Create a nagios check script that can monitor multiple endpoints based on what the service exposes - https://phabricator.wikimedia.org/T94831#1417947 (10akosiaris) [16:40:04] RECOVERY - NTP on labsdb1001 is OK: NTP OK: Offset -0.004688620567 secs [16:40:04] RECOVERY - NTP on mw1218 is OK: NTP OK: Offset -0.003974556923 secs [16:40:06] 6operations, 5Patch-For-Review: Define and implement an automated process to ease the introduction of a new service into production - https://phabricator.wikimedia.org/T97036#1417944 (10akosiaris) 5Open>3Resolved a:3akosiaris With https://gerrit.wikimedia.org/r/217548 merged, the steps outlined in https:... [16:40:09] 6operations, 10Deployment-Systems, 6Release-Engineering, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1417949 (10akosiaris) [16:40:13] RECOVERY - NTP on mw1062 is OK: NTP OK: Offset -0.005212306976 secs [16:40:13] RECOVERY - NTP on analytics1019 is OK: NTP OK: Offset -0.001150250435 secs [16:40:14] RECOVERY - NTP on cp1049 is OK: NTP OK: Offset -0.001236557961 secs [16:40:14] RECOVERY - NTP on mw1138 is OK: NTP OK: Offset -0.002618074417 secs [16:40:23] RECOVERY - NTP on mw1109 is OK: NTP OK: Offset -0.0008511543274 secs [16:40:23] RECOVERY - NTP on mw1196 is OK: NTP OK: Offset -0.005522489548 secs [16:40:23] RECOVERY - NTP on mw1147 is OK: NTP OK: Offset -0.06647348404 secs [16:40:24] RECOVERY - NTP on mw1130 is OK: NTP OK: Offset -0.06053757668 secs [16:40:25] RECOVERY - NTP on mw1035 is OK: NTP OK: Offset -0.002254962921 secs [16:40:25] RECOVERY - NTP on mw1124 is OK: NTP OK: Offset -0.01219832897 secs [16:40:33] RECOVERY - NTP on mw1244 is OK: NTP OK: Offset -0.0002129077911 secs [16:40:33] RECOVERY - NTP on mw1221 is OK: NTP OK: Offset -0.001354813576 secs [16:40:33] RECOVERY - NTP on mw1245 is OK: NTP OK: Offset -0.001972794533 secs [16:40:34] RECOVERY - NTP on mw1240 is OK: NTP OK: Offset -0.002767562866 secs [16:40:34] RECOVERY - NTP on wtp1019 is OK: NTP OK: Offset -0.0009609460831 secs [16:40:34] RECOVERY - NTP on wtp1017 is OK: NTP OK: Offset -0.004622578621 secs [16:40:43] RECOVERY - NTP on wtp1009 is OK: NTP OK: Offset -0.00214600563 secs [16:40:43] RECOVERY - NTP on pc1001 is OK: NTP OK: Offset -0.005127429962 secs [16:40:44] RECOVERY - NTP on logstash1003 is OK: NTP OK: Offset -0.001406908035 secs [16:40:44] RECOVERY - NTP on dbstore1001 is OK: NTP OK: Offset -0.002054095268 secs [16:40:44] RECOVERY - NTP on wtp1021 is OK: NTP OK: Offset -0.001634597778 secs [16:40:44] RECOVERY - NTP on ms-be1001 is OK: NTP OK: Offset -0.001941919327 secs [16:40:53] RECOVERY - NTP on mc1004 is OK: NTP OK: Offset -0.007036685944 secs [16:40:53] RECOVERY - NTP on wtp1024 is OK: NTP OK: Offset -0.006436347961 secs [16:40:54] RECOVERY - NTP on analytics1015 is OK: NTP OK: Offset -0.003606319427 secs [16:40:54] RECOVERY - NTP on mw1132 is OK: NTP OK: Offset -0.007244467735 secs [16:41:04] RECOVERY - NTP on uranium is OK: NTP OK: Offset -0.01427316666 secs [16:41:04] RECOVERY - NTP on mw1028 is OK: NTP OK: Offset -0.002489209175 secs [16:41:04] RECOVERY - NTP on mw1072 is OK: NTP OK: Offset -0.0001975297928 secs [16:41:18] RECOVERY - NTP on ocg1001 is OK: NTP OK: Offset -0.0002876520157 secs [16:41:18] RECOVERY - NTP on ms-be1017 is OK: NTP OK: Offset -0.001435637474 secs [16:41:18] RECOVERY - NTP on ms-fe1003 is OK: NTP OK: Offset -0.007218241692 secs [16:41:18] RECOVERY - NTP on mc1009 is OK: NTP OK: Offset -0.001282930374 secs [16:41:24] RECOVERY - NTP on cp1070 is OK: NTP OK: Offset -0.006636857986 secs [16:41:24] RECOVERY - NTP on db1065 is OK: NTP OK: Offset -0.0001963376999 secs [16:41:24] RECOVERY - NTP on mw1256 is OK: NTP OK: Offset -0.00636446476 secs [16:41:25] RECOVERY - NTP on logstash1001 is OK: NTP OK: Offset -0.002305150032 secs [16:41:25] RECOVERY - NTP on rdb1004 is OK: NTP OK: Offset -0.001422166824 secs [16:41:25] RECOVERY - NTP on mw1038 is OK: NTP OK: Offset -0.003637433052 secs [16:41:33] RECOVERY - NTP on caesium is OK: NTP OK: Offset -0.001136422157 secs [16:41:33] RECOVERY - NTP on analytics1036 is OK: NTP OK: Offset -0.00104367733 secs [16:41:34] RECOVERY - NTP on mw1178 is OK: NTP OK: Offset -0.0006648302078 secs [16:41:34] RECOVERY - NTP on mw1040 is OK: NTP OK: Offset -0.004734992981 secs [16:41:34] RECOVERY - NTP on mw1063 is OK: NTP OK: Offset -0.003010034561 secs [16:41:34] RECOVERY - NTP on mw1067 is OK: NTP OK: Offset -0.01349496841 secs [16:41:35] RECOVERY - NTP on mw1045 is OK: NTP OK: Offset -0.002891182899 secs [16:41:43] RECOVERY - NTP on mw1134 is OK: NTP OK: Offset -0.001237511635 secs [16:41:44] RECOVERY - NTP on mw1233 is OK: NTP OK: Offset -0.0010201931 secs [16:41:44] RECOVERY - NTP on californium is OK: NTP OK: Offset -0.0138181448 secs [16:41:44] RECOVERY - NTP on dbproxy1002 is OK: NTP OK: Offset -0.001616001129 secs [16:41:44] RECOVERY - NTP on mc1011 is OK: NTP OK: Offset -0.003567218781 secs [16:41:45] RECOVERY - NTP on mw1006 is OK: NTP OK: Offset -0.00709092617 secs [16:41:53] RECOVERY - NTP on analytics1033 is OK: NTP OK: Offset -0.0006150007248 secs [16:41:54] RECOVERY - NTP on wtp1006 is OK: NTP OK: Offset -0.00935781002 secs [16:41:54] RECOVERY - NTP on carbon is OK: NTP OK: Offset -0.002880573273 secs [16:41:54] RECOVERY - NTP on magnesium is OK: NTP OK: Offset -0.003981947899 secs [16:41:55] RECOVERY - NTP on ms-be1004 is OK: NTP OK: Offset -0.005969405174 secs [16:41:55] RECOVERY - NTP on db1029 is OK: NTP OK: Offset -0.00770843029 secs [16:42:03] RECOVERY - NTP on ms-be1013 is OK: NTP OK: Offset -0.001005649567 secs [16:42:04] RECOVERY - NTP on curium is OK: NTP OK: Offset -0.002894043922 secs [16:42:04] RECOVERY - NTP on eventlog1001 is OK: NTP OK: Offset -0.006514310837 secs [16:42:04] RECOVERY - NTP on mw1005 is OK: NTP OK: Offset -0.004467010498 secs [16:42:04] RECOVERY - NTP on mw1048 is OK: NTP OK: Offset -0.0007736682892 secs [16:42:04] RECOVERY - NTP on mw1031 is OK: NTP OK: Offset -0.001048922539 secs [16:42:05] RECOVERY - NTP on mw1200 is OK: NTP OK: Offset -0.007073521614 secs [16:42:05] RECOVERY - NTP on analytics1017 is OK: NTP OK: Offset -0.00159907341 secs [16:42:13] RECOVERY - NTP on es1001 is OK: NTP OK: Offset -0.01061105728 secs [16:42:13] RECOVERY - NTP on mw1145 is OK: NTP OK: Offset -0.001974344254 secs [16:42:14] RECOVERY - NTP on es1008 is OK: NTP OK: Offset -0.0007764101028 secs [16:42:14] RECOVERY - NTP on mw1187 is OK: NTP OK: Offset -0.001888990402 secs [16:42:14] RECOVERY - NTP on labnodepool1001 is OK: NTP OK: Offset -0.002555966377 secs [16:42:23] RECOVERY - NTP on netmon1001 is OK: NTP OK: Offset -0.0007112026215 secs [16:42:23] RECOVERY - NTP on analytics1041 is OK: NTP OK: Offset -0.000804901123 secs [16:42:23] RECOVERY - NTP on mw1089 is OK: NTP OK: Offset -0.0007537603378 secs [16:42:23] RECOVERY - NTP on mw1140 is OK: NTP OK: Offset -0.002093195915 secs [16:42:23] RECOVERY - NTP on mw1197 is OK: NTP OK: Offset -0.003682017326 secs [16:42:24] RECOVERY - NTP on lead is OK: NTP OK: Offset -0.002997517586 secs [16:42:24] RECOVERY - NTP on mw1115 is OK: NTP OK: Offset -0.001136541367 secs [16:42:25] RECOVERY - NTP on potassium is OK: NTP OK: Offset -0.005947828293 secs [16:42:25] RECOVERY - NTP on wtp1020 is OK: NTP OK: Offset -0.008866429329 secs [16:42:26] RECOVERY - NTP on neptunium is OK: NTP OK: Offset -0.01096439362 secs [16:42:26] RECOVERY - NTP on mw1080 is OK: NTP OK: Offset -0.005418300629 secs [16:42:33] RECOVERY - NTP on mw1007 is OK: NTP OK: Offset -0.00135076046 secs [16:42:33] RECOVERY - NTP on db1031 is OK: NTP OK: Offset -0.002834439278 secs [16:42:34] RECOVERY - NTP on mw1012 is OK: NTP OK: Offset -0.001402378082 secs [16:42:34] RECOVERY - NTP on mw1174 is OK: NTP OK: Offset -0.002866268158 secs [16:42:44] RECOVERY - NTP on mw1041 is OK: NTP OK: Offset -0.003098845482 secs [16:42:44] RECOVERY - NTP on helium is OK: NTP OK: Offset -0.001351833344 secs [16:42:44] RECOVERY - NTP on mw1026 is OK: NTP OK: Offset -0.002313137054 secs [16:42:54] RECOVERY - NTP on ms-fe1001 is OK: NTP OK: Offset -0.002969503403 secs [16:42:54] RECOVERY - NTP on mc1016 is OK: NTP OK: Offset -0.0006060600281 secs [16:42:54] RECOVERY - NTP on mw1141 is OK: NTP OK: Offset -0.002474546432 secs [16:43:04] RECOVERY - NTP on dbproxy1008 is OK: NTP OK: Offset -0.001216888428 secs [16:43:04] RECOVERY - NTP on mw1059 is OK: NTP OK: Offset -0.0225982666 secs [16:43:12] stupid time [16:43:13] RECOVERY - NTP on db1053 is OK: NTP OK: Offset -0.001372098923 secs [16:43:13] RECOVERY - NTP on mw1224 is OK: NTP OK: Offset -0.004678845406 secs [16:43:14] RECOVERY - NTP on cp1047 is OK: NTP OK: Offset -0.001550316811 secs [16:43:14] RECOVERY - NTP on cp1055 is OK: NTP OK: Offset -0.008867025375 secs [16:43:20] (03PS1) 10Yuvipanda: labstore: Move replica_addusers into its own puppet class [puppet] - 10https://gerrit.wikimedia.org/r/222140 [16:43:23] RECOVERY - NTP on ms-be1006 is OK: NTP OK: Offset -0.006878733635 secs [16:43:23] RECOVERY - NTP on db1073 is OK: NTP OK: Offset -0.002737760544 secs [16:43:23] RECOVERY - NTP on ms-be1003 is OK: NTP OK: Offset -0.002052664757 secs [16:43:24] RECOVERY - NTP on analytics1020 is OK: NTP OK: Offset -0.001741170883 secs [16:43:24] RECOVERY - NTP on mw1242 is OK: NTP OK: Offset -0.005863070488 secs [16:43:34] RECOVERY - NTP on db1059 is OK: NTP OK: Offset -0.007017731667 secs [16:43:34] RECOVERY - NTP on mw1106 is OK: NTP OK: Offset -0.00150179863 secs [16:43:43] RECOVERY - NTP on mw1254 is OK: NTP OK: Offset -0.0009307861328 secs [16:43:43] RECOVERY - NTP on mw1226 is OK: NTP OK: Offset -0.003175616264 secs [16:43:43] RECOVERY - NTP on labstore1003 is OK: NTP OK: Offset -0.00209069252 secs [16:43:43] (03PS2) 10Yuvipanda: labstore: Move replica_addusers into its own puppet class [puppet] - 10https://gerrit.wikimedia.org/r/222140 [16:43:44] RECOVERY - NTP on mc1006 is OK: NTP OK: Offset -0.001480221748 secs [16:43:44] RECOVERY - NTP on mw1205 is OK: NTP OK: Offset -0.0004824399948 secs [16:43:44] RECOVERY - NTP on mw1082 is OK: NTP OK: Offset -0.001441478729 secs [16:43:44] RECOVERY - NTP on mw1176 is OK: NTP OK: Offset -0.004004001617 secs [16:43:53] RECOVERY - NTP on lvs1002 is OK: NTP OK: Offset -0.01368641853 secs [16:43:54] RECOVERY - NTP on mw1153 is OK: NTP OK: Offset -0.001013159752 secs [16:43:54] RECOVERY - NTP on mw1099 is OK: NTP OK: Offset -0.005381464958 secs [16:44:03] RECOVERY - NTP on logstash1004 is OK: NTP OK: Offset -0.002185463905 secs [16:44:03] RECOVERY - NTP on graphite1002 is OK: NTP OK: Offset -0.002325415611 secs [16:44:03] RECOVERY - NTP on iron is OK: NTP OK: Offset -0.003488779068 secs [16:44:03] RECOVERY - NTP on mw1088 is OK: NTP OK: Offset -0.00358068943 secs [16:44:04] RECOVERY - NTP on mw1060 is OK: NTP OK: Offset -0.001657605171 secs [16:44:04] RECOVERY - NTP on mw1250 is OK: NTP OK: Offset -0.01038694382 secs [16:44:04] RECOVERY - NTP on mw1222 is OK: NTP OK: Offset -0.0005540847778 secs [16:44:14] RECOVERY - NTP on mw1069 is OK: NTP OK: Offset -0.0008407831192 secs [16:44:14] RECOVERY - NTP on db1046 is OK: NTP OK: Offset -0.001643180847 secs [16:44:14] RECOVERY - NTP on mw1117 is OK: NTP OK: Offset -0.004632353783 secs [16:44:14] RECOVERY - NTP on mw1228 is OK: NTP OK: Offset -0.02103376389 secs [16:44:14] RECOVERY - NTP on analytics1035 is OK: NTP OK: Offset -0.002432227135 secs [16:44:15] RECOVERY - NTP on mw1120 is OK: NTP OK: Offset -0.001443624496 secs [16:44:23] RECOVERY - NTP on mw1046 is OK: NTP OK: Offset -0.002529263496 secs [16:44:23] RECOVERY - NTP on mc1003 is OK: NTP OK: Offset -0.001053929329 secs [16:44:23] RECOVERY - NTP on mw1160 is OK: NTP OK: Offset -0.001102089882 secs [16:44:24] RECOVERY - NTP on db1002 is OK: NTP OK: Offset -0.0111374855 secs [16:44:32] (03CR) 10Yuvipanda: [C: 032] labstore: Move replica_addusers into its own puppet class [puppet] - 10https://gerrit.wikimedia.org/r/222140 (owner: 10Yuvipanda) [16:44:33] RECOVERY - NTP on db1050 is OK: NTP OK: Offset -0.0006308555603 secs [16:44:33] RECOVERY - NTP on ruthenium is OK: NTP OK: Offset 0.0001995563507 secs [16:44:34] RECOVERY - NTP on sca1002 is OK: NTP OK: Offset -0.001475811005 secs [16:44:34] RECOVERY - NTP on ms-fe1004 is OK: NTP OK: Offset -0.007134914398 secs [16:44:34] RECOVERY - NTP on analytics1040 is OK: NTP OK: Offset 0.006186485291 secs [16:44:34] RECOVERY - NTP on wtp1016 is OK: NTP OK: Offset -0.004908204079 secs [16:44:34] RECOVERY - NTP on cp1061 is OK: NTP OK: Offset -0.002898335457 secs [16:44:35] RECOVERY - NTP on snapshot1003 is OK: NTP OK: Offset -0.003803372383 secs [16:44:43] RECOVERY - NTP on db1066 is OK: NTP OK: Offset -0.01090800762 secs [16:44:44] RECOVERY - NTP on holmium is OK: NTP OK: Offset -0.002668023109 secs [16:44:44] RECOVERY - NTP on mw1009 is OK: NTP OK: Offset -0.0004818439484 secs [16:44:44] RECOVERY - NTP on mw1173 is OK: NTP OK: Offset -0.0394308567 secs [16:44:54] RECOVERY - NTP on lvs1005 is OK: NTP OK: Offset -0.00146651268 secs [16:44:54] RECOVERY - NTP on mw1217 is OK: NTP OK: Offset 7.700920105e-05 secs [16:45:00] (03PS1) 10Andrew Bogott: Make ferm list into an actual list. [puppet] - 10https://gerrit.wikimedia.org/r/222141 [16:45:03] RECOVERY - NTP on mc1002 is OK: NTP OK: Offset -0.003002285957 secs [16:45:13] RECOVERY - NTP on db1018 is OK: NTP OK: Offset -0.004452824593 secs [16:45:14] RECOVERY - NTP on rhodium is OK: NTP OK: Offset -0.001888751984 secs [16:45:15] RECOVERY - NTP on mw1068 is OK: NTP OK: Offset -0.002890467644 secs [16:45:15] RECOVERY - NTP on db1021 is OK: NTP OK: Offset -0.006146907806 secs [16:45:15] RECOVERY - NTP on mw1164 is OK: NTP OK: Offset -0.009780526161 secs [16:45:21] 6operations, 10ops-codfw, 10hardware-requests, 7Database: Faulty memory on es2004 (purchase one module) - https://phabricator.wikimedia.org/T103843#1417965 (10RobH) Chris checked his spares stock and has: 4GB 1333MHz ECC Reg Memory 2Rx8 Hynix HMT351R7BFR8A-H9 * 4 [16:45:24] RECOVERY - NTP on mw1150 is OK: NTP OK: Offset -0.001768112183 secs [16:45:24] RECOVERY - NTP on mw1100 is OK: NTP OK: Offset -0.00621509552 secs [16:45:34] RECOVERY - NTP on db1028 is OK: NTP OK: Offset -0.001672029495 secs [16:45:34] RECOVERY - NTP on mw1092 is OK: NTP OK: Offset -0.001489162445 secs [16:45:34] RECOVERY - NTP on mw1118 is OK: NTP OK: Offset -0.001044154167 secs [16:45:34] RECOVERY - NTP on mw1123 is OK: NTP OK: Offset -0.006405115128 secs [16:45:34] RECOVERY - NTP on logstash1002 is OK: NTP OK: Offset -0.003506422043 secs [16:45:35] RECOVERY - NTP on mw1003 is OK: NTP OK: Offset -0.002251029015 secs [16:45:43] RECOVERY - NTP on labsdb1003 is OK: NTP OK: Offset -0.00149667263 secs [16:45:44] RECOVERY - NTP on db1067 is OK: NTP OK: Offset -0.00716817379 secs [16:45:44] RECOVERY - NTP on mw1008 is OK: NTP OK: Offset -0.008796811104 secs [16:45:44] morning folks. i'm prepping for puppet CA cert replacement, trying to figure out how to schedule downtime or at least avoid alerts. icinga UI won't allow me to do a range select on that service list, it won't let me do the whole list of 1000+ at once, and i don't want to click 1000 times. i looked at using the icinga-downtime CLI tool, but it only schedules downtime for entire hosts rather than single services. so icinga-wm may be as noisy about "puppet [16:45:44] RECOVERY - NTP on mw1235 is OK: NTP OK: Offset -0.0003613233566 secs [16:45:44] RECOVERY - NTP on analytics1030 is OK: NTP OK: Offset -0.003328442574 secs [16:45:45] RECOVERY - NTP on tin is OK: NTP OK: Offset -0.004762768745 secs [16:45:54] RECOVERY - NTP on mw1166 is OK: NTP OK: Offset -0.005560517311 secs [16:45:54] RECOVERY - NTP on db1040 is OK: NTP OK: Offset -0.003084897995 secs [16:45:54] RECOVERY - NTP on mw1170 is OK: NTP OK: Offset -0.001259088516 secs [16:45:54] RECOVERY - NTP on mw1175 is OK: NTP OK: Offset -0.0009546279907 secs [16:45:54] RECOVERY - NTP on mw1211 is OK: NTP OK: Offset -0.00106549263 secs [16:45:54] RECOVERY - NTP on logstash1006 is OK: NTP OK: Offset -0.0009192228317 secs [16:46:03] RECOVERY - NTP on mw1129 is OK: NTP OK: Offset -0.0005985498428 secs [16:46:03] RECOVERY - NTP on cp1056 is OK: NTP OK: Offset -0.0006116628647 secs [16:46:03] RECOVERY - NTP on mw1119 is OK: NTP OK: Offset -0.007730841637 secs [16:46:04] RECOVERY - NTP on mw1213 is OK: NTP OK: Offset -0.005434036255 secs [16:46:04] RECOVERY - NTP on mw1251 is OK: NTP OK: Offset -0.0003247261047 secs [16:46:04] RECOVERY - NTP on labvirt1003 is OK: NTP OK: Offset -0.0004297494888 secs [16:46:13] RECOVERY - NTP on mw1025 is OK: NTP OK: Offset -0.009835243225 secs [16:46:13] RECOVERY - NTP on mw1065 is OK: NTP OK: Offset -0.006102085114 secs [16:46:14] RECOVERY - NTP on db1015 is OK: NTP OK: Offset -0.03541469574 secs [16:46:14] RECOVERY - NTP on db1023 is OK: NTP OK: Offset -0.04152476788 secs [16:46:14] RECOVERY - NTP on db1042 is OK: NTP OK: Offset -0.03635597229 secs [16:46:14] RECOVERY - NTP on mw1042 is OK: NTP OK: Offset -0.03528618813 secs [16:46:14] RECOVERY - NTP on mw1061 is OK: NTP OK: Offset -0.0654040575 secs [16:46:17] jgage: icinga? noisy? [16:46:24] RECOVERY - NTP on mw1172 is OK: NTP OK: Offset -0.002964735031 secs [16:46:24] RECOVERY - NTP on dbproxy1001 is OK: NTP OK: Offset -0.0006375312805 secs [16:46:31] also, you cut off at "may be as noisy about "puppet" [16:46:33] RECOVERY - NTP on mw1052 is OK: NTP OK: Offset -0.0007737874985 secs [16:46:33] RECOVERY - NTP on mw1144 is OK: NTP OK: Offset -0.001039624214 secs [16:46:34] RECOVERY - NTP on db1051 is OK: NTP OK: Offset -0.003520965576 secs [16:46:44] RECOVERY - NTP on mw1177 is OK: NTP OK: Offset -0.006212234497 secs [16:46:45] RECOVERY - NTP on mw1039 is OK: NTP OK: Offset -0.0007020235062 secs [16:46:50] (03PS2) 10Andrew Bogott: Make ferm list into an actual list. [puppet] - 10https://gerrit.wikimedia.org/r/222141 [16:46:54] RECOVERY - NTP on mw1002 is OK: NTP OK: Offset -0.003620147705 secs [16:46:55] heh [16:47:04] RECOVERY - NTP on mw1054 is OK: NTP OK: Offset -0.004254937172 secs [16:47:04] RECOVERY - NTP on mw1011 is OK: NTP OK: Offset -0.002440810204 secs [16:47:15] RECOVERY - NTP on mw1126 is OK: NTP OK: Offset -0.005571961403 secs [16:47:33] and people wonder why we don't hold all conversations in this channel :P [16:48:39] oh cool check_puppetrun only warns after 2 hours. maybe icinga will say nothing! [16:48:46] (03CR) 10Andrew Bogott: [C: 032] Make ferm list into an actual list. [puppet] - 10https://gerrit.wikimedia.org/r/222141 (owner: 10Andrew Bogott) [16:48:49] <_joe_> jgage: the puppet runs will fail [16:48:54] <_joe_> the ones initiated by cron [16:48:59] yes [16:49:00] 6operations, 10ops-codfw, 10hardware-requests, 7Database: Faulty memory on es2004 (purchase one module) - https://phabricator.wikimedia.org/T103843#1417974 (10jcrespo) This is what I can get from the OS of existing memories on the host: `DIMM DDR3 1333 MHz 8GB "Hynix Semiconductor"`, but I would wait for p... [16:49:05] <_joe_> so those will alarm [16:49:14] (03PS1) 10Yuvipanda: labstore: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/222142 [16:49:55] (03PS2) 10Yuvipanda: labstore: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/222142 [16:50:16] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/222142 (owner: 10Yuvipanda) [16:51:43] PROBLEM - puppet last run on labstore1001 is CRITICAL puppet fail [16:51:54] ^ fixed [16:52:23] Root help needed. nutcracker is broken on mw1077 and needs to be restarted [16:52:41] hm why don't those respect the -w 7200 [16:52:43] command[check_puppet_checkpuppetrun]=/usr/bin/sudo /usr/local/lib/nagios/plugins/check_puppetrun -w 7200 -c 14400 [16:53:00] bd808: i'll do it [16:53:07] thx jgage [16:53:52] 6operations, 10ops-codfw, 10hardware-requests, 7Database: Faulty memory on es2004 (purchase one module) - https://phabricator.wikimedia.org/T103843#1417985 (10jcrespo) My main concern is having the same amount of memory on hosts within the same service and datacenter. If we can achieve that somehow, no mat... [16:53:53] bd808_: done [16:53:54] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [16:54:24] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [16:54:41] <_joe_> jgage: whenever there is an hard failure, we alarm [16:55:00] <_joe_> I mean we alarm if a) puppet didn't run b) puppet runs and fails [16:55:12] <_joe_> so if you run puppet agent --disable across the fleet [16:55:17] <_joe_> you might avoid errors [16:55:21] <_joe_> err, alerts [16:55:53] <_joe_> jgage: ping me when certs have been resigned, I'll need to do an etcd rolling restart [16:56:06] _joe_ ok [16:56:25] disabling the agent is the first step, so hopefully that will avoid alerts [16:56:28] bd808_: is that nutcracker issue tracked anywhere btw? [16:56:41] godog: not that I know of [16:57:13] _joe_ shouldn't etcd be restarted because it will get service notifications when the files change? or did you disable that? [16:57:21] <_joe_> I did [16:57:23] k [16:57:25] <_joe_> disable that [16:57:40] <_joe_> anyways, you'll be running your commands with -b, right? [16:57:59] <_joe_> (I'm not sure all hosts have salt running correctly btw [16:59:34] yeah i am using -b 200 [16:59:46] oh salt not running correctly will be fun [16:59:52] any idea which hosts those are? [17:00:00] <_joe_> jgage: nope [17:00:07] i'm about to find out :D [17:00:24] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [17:00:48] <_joe_> jgage: trying to find out now [17:00:56] thanks [17:01:44] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.007 second response time on port 9042 [17:02:03] <_joe_> jgage: you might get lucky and it will work [17:02:05] <_joe_> :) [17:02:09] bd808: kk, yeah it doesn't seem to happen that often, still annoying :( poke for some gdb/etc if you see it again! [17:02:14] luck! a sysadmin's best friend. [17:02:16] <_joe_> jgage: ping me when you're done [17:02:25] _joe_ ok [17:02:27] godog: *nod* [17:02:28] <_joe_> or if you need help [17:02:37] 6operations, 10ops-codfw, 10hardware-requests, 7Database: Faulty memory on es2004 (purchase one module) - https://phabricator.wikimedia.org/T103843#1417995 (10RobH) Crucial sells the following: http://www.crucial.com/usa/en/poweredge-r510/CT2276949 [17:02:41] thanks. i've got godog here, i will make him help me if needed [17:03:02] !log beginning puppet CA replacement procedure [17:03:07] Logged the message, Master [17:03:36] jgage: if you salt in batch mode it will see all hosts, only a simple salt '*' will have incomplete results (less than 1000) [17:03:55] <_joe_> moritzm: yes, noted that [17:04:01] <_joe_> pretty "funny" indeed [17:04:13] mortizm thanks, luckily someone else figured that out the hard way before me [17:04:43] i saw ori use -b 200 so i'll use that as my chunk size, hopefully that's not too much parallelization [17:05:28] <_joe_> well, it might be for the puppetmasters when you require to sign the certs [17:06:01] for that, i will enable the agents and then they will check in over the 20 min spread thanks to their crontabs [17:06:06] so that should avoid thundering herd [17:07:09] 6operations, 10vm-requests: eqiad: (1) VM for static-bugzilla - https://phabricator.wikimedia.org/T103604#1418002 (10RobH) If this is a VM, and not bare metal, it shouldnt use up element names. The element names are allocated for bare metal misc servers. This was an ongoing discussion on 'how do we name misc... [17:09:14] 6operations, 10vm-requests: eqiad: (1) VM for static-bugzilla - https://phabricator.wikimedia.org/T103604#1418019 (10RobH) If its a VM, I think it should be named by its role and all VMs should have some designation in their hostnames to denote they aren't on bare metal. [17:09:56] hmm, bblack are you around? i need to depool cp1065 + cp3030 [17:09:57] apergos: Are hosts within the cluster also rate limited on dumps.wm.o? [17:11:40] ostriches: yes, all of them. [17:11:46] oh ok [17:11:47] hmm [17:12:02] eta: 1h52m. [17:12:04] * ostriches makes coffee [17:12:09] what are you trying to get? [17:12:22] dumps.wikimedia.org/other/misc/svn-mediawiki.gz to iridium [17:12:40] mirror might have it [17:13:07] 6operations, 10vm-requests: eqiad: (1) VM for static-bugzilla - https://phabricator.wikimedia.org/T103604#1418037 (10RobH) I'd call this vm something like: staticbugzilla, rather than staticbugs, since bugs can be ANY bug tracker. [17:13:13] PROBLEM - Disk space on iridium is CRITICAL: DISK CRITICAL - free space: / 317 MB (3% inode=84%) [17:13:18] they do! [17:14:04] http://dumps.wikimedia.your.org/other/misc/ [17:14:07] yep [17:16:03] Hmmm, timing out :\ [17:16:07] From iridium [17:16:53] PROBLEM - bacula director process on helium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (bacula), command name bacula-dir [17:16:54] PROBLEM - bacula sd process on heze is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (bacula), command name bacula-sd [17:17:08] ah yes, gotta silence those too [17:17:22] <_joe_> don't [17:17:45] why? [17:17:50] (03PS2) 10Dzahn: analytics_kafka: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/222138 [17:18:01] <_joe_> so we see when they happen [17:18:11] does iridum have public ip? [17:18:16] *iridium [17:18:20] Oh dur, probably not [17:18:26] _joe_ i don't see the point of that, i disabled bacula as part of my procedure [17:18:31] Goes through misc-lb [17:18:36] <_joe_> jgage: oh, right [17:18:45] <_joe_> I thought it was a consequence [17:18:57] heh no i don't silence alerts like that ;) [17:19:14] i need to depool cp1065 & cp3030 and i'm not finding a gerrit change to copy [17:19:19] apergos: hehe https://phabricator.wikimedia.org/P862 [17:19:56] <_joe_> jgage: why depool? [17:20:34] because they're running ipsec, which will break as part of this upgrade [17:20:44] <_joe_> can't we disable ipsec? [17:20:45] <_joe_> :) [17:20:48] <_joe_> that [17:20:50] ostriches: it's internal, you need to go through a proxy [17:20:56] <_joe_> *that's easier and less destructive [17:21:13] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [17:21:18] yeah ok, i'll just have to bring them back up manually before reenabling the agents [17:21:21] * jgage does that [17:21:34] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [17:21:37] https://wikitech.wikimedia.org/wiki/Http_proxy ostriches [17:21:40] 6operations, 6Discovery, 10Maps, 6Scrum-of-Scrums, 10hardware-requests: Eqiad Spare allocation: 1 hardware access request for OSM Maps project - https://phabricator.wikimedia.org/T97638#1418070 (10MaxSem) [17:21:52] 6operations, 6Discovery, 10Maps, 6Scrum-of-Scrums, 10hardware-requests: Eqiad Spare allocation: 1 hardware access request for OSM Maps project - https://phabricator.wikimedia.org/T97638#1418072 (10MaxSem) 5Open>3declined [17:22:03] 6operations, 10RESTBase-Cassandra, 6Services: Alert the services team mailing list when Cassandra dies - https://phabricator.wikimedia.org/T104467#1418073 (10mobrovac) 3NEW a:3fgiunchedi [17:23:34] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [17:24:03] (03PS3) 10Dzahn: analytics_kafka: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/222138 [17:24:44] PROBLEM - Disk space on iridium is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=84%) [17:24:48] Oh snap [17:24:52] Ok, I'll stop that idea [17:25:49] Or at least, do it on the right partition. [17:26:12] bah i have issued puppet agent --disable a few times via salt but some agents are still enabled [17:26:16] DIE [17:26:24] /home is mounted with /, only had 9.1G space :p [17:26:46] iridium disk space fixed, icinga should notice in a sec. [17:26:49] perhaps some of these codfw nodes are not included in salt's idea of '*' [17:26:54] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.009 second response time on port 9042 [17:27:45] <_joe_> jgage: where? [17:28:04] ostriches: there is /srv/phab/dumps I think [17:28:15] (03PS4) 10Jdlrobson: Enable Gather flagging on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221754 (https://phabricator.wikimedia.org/T97704) [17:28:17] Yeah, I'm using /srv [17:28:44] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.69% of data above the critical threshold [1000.0] [17:29:04] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [17:29:16] wha? [17:29:26] I mean /srv/phab/dumps [17:29:29] Just not / [17:29:30] :) [17:29:31] hah we had the same last week IIRC (google) [17:30:49] maybe there are more malicious fiber cuts happening [17:31:38] hmm my key is not accepted on mw2208.codfw.wmnet [17:32:08] and it's not in icinga [17:33:19] (03CR) 10Phuedx: [C: 032] Enable Gather flagging on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221754 (https://phabricator.wikimedia.org/T97704) (owner: 10Jdlrobson) [17:33:45] (03Merged) 10jenkins-bot: Enable Gather flagging on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221754 (https://phabricator.wikimedia.org/T97704) (owner: 10Jdlrobson) [17:34:46] <_joe_> jgage: mw2208 might be still to install after an hardware failure [17:35:24] its agent is hitting the puppetmaster, but oh well [17:35:31] other than that the puppetmasters are quiet [17:35:35] apergos: 36M/s :) [17:36:17] ah, mw2208's cert is not signed [17:39:35] jgage: use the "new_install" key to ssh to it [17:39:49] <_joe_> mutante: I don't think jgage wants to [17:40:06] <_joe_> jgage: go on with your plan and just don't care about mw2208 [17:40:10] <_joe_> I can take care of it [17:40:42] <_joe_> {{done}} [17:42:34] RECOVERY - Host google is UPING OK - Packet loss = 0%, RTA = 9.22 ms [17:43:34] thank you, icinga-wm [17:43:44] still like the "UPING" ? [17:44:00] 6operations, 10vm-requests: eqiad: (1) VM for static-bugzilla - https://phabricator.wikimedia.org/T103604#1418173 (10RobH) Infrastructure naming convention discussions are the worst. There is no really nice way to name these in regards to misc bare-metal versus misc vm, considering their use cases will overl... [17:44:40] re: ganglia_new: i needed separate regexes for regular analytics and analytics_kafka [17:51:01] fighting with apache on palladium to disable just the puppetmaster part, we may briefly see a pybal alert [17:51:30] <_joe_> jgage: can I do that? [17:51:36] <_joe_> disable just the puppetmaster part [17:52:54] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [17:53:04] _joe_ too late ;) [17:54:08] phuedx: ^ [17:54:12] You forgot to sync to prod [17:57:43] 6operations: Ensure kernel and OpenJDK fixes for leap second are present - https://phabricator.wikimedia.org/T103479#1418183 (10MoritzMuehlenhoff) 5Open>3Resolved All done. [18:00:04] twentyafterfour greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150701T1800). Please do the needful. [18:02:02] !log restarted keystone on labcontrol1001 [18:02:09] Logged the message, Master [18:02:31] Who's doing the train today? [18:08:27] hoo: that'd be me [18:08:45] ah nice [18:08:58] I was about to ask if there is anything I should be aware of before I go ahead with it? [18:09:05] (03PS1) 10Alexandros Kosiaris: Introducing bromine.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/222149 (https://phabricator.wikimedia.org/T103604) [18:09:06] First of all, be aware that there are unsynched changes from phuedx [18:09:27] And I'd like to push a Wikidata update, preferably before the train [18:09:31] ok so sync-wikiversions isn't all that I need to do then? [18:09:42] hoo: no rush on my part if you wanna go ahead [18:09:42] dammit -- that was mibad [18:09:54] revert my change and i'll get it synced later [18:10:08] my apologies :/ [18:10:10] (03PS1) 10Gage: puppetmaster: fix puppet.conf for new CA cert [puppet] - 10https://gerrit.wikimedia.org/r/222151 [18:10:14] hoo: I can wait to do the train until wikidata update is done [18:10:32] ok, nice [18:10:42] Shouldn't take to long [18:10:45] just let me know when you're ready [18:10:50] just the jenkins waiting is time consuming [18:11:00] yeah, understood. [18:11:04] sometimes that can be pretty slow [18:11:34] (03PS1) 10Dzahn: switch analytics and analytics_kafka to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/222153 [18:11:56] (03PS2) 10Gage: puppetmaster: fix puppet.conf for new CA cert [puppet] - 10https://gerrit.wikimedia.org/r/222151 [18:12:22] (03Abandoned) 10Dzahn: WIP: analytics: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/219078 (owner: 10Dzahn) [18:13:39] (03Abandoned) 10Dzahn: analytics_kafka: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/222138 (owner: 10Dzahn) [18:14:23] PROBLEM - puppet last run on eventlog1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:14:41] (03CR) 10Gage: [C: 032] puppetmaster: fix puppet.conf for new CA cert [puppet] - 10https://gerrit.wikimedia.org/r/222151 (owner: 10Gage) [18:14:45] hoo, twentyafterfour: should i revert my change or is someone else on it? [18:14:54] phuedx: Is it beta only? [18:15:03] hoo: it is beta only [18:15:10] changes only to -beta.php files [18:15:22] In that case it can just be pulled to tin and synched out for consistency [18:15:29] I can do that when I push my Wikidata update [18:15:33] don't worry [18:15:36] hoo: ok, thanks [18:15:52] well, still, sorry for the minor inconvenience [18:26:34] RECOVERY - Disk space on iridium is OK: DISK OK [18:26:55] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [18:29:56] 6operations, 5Patch-For-Review: Decommission svn.wikimedia.org server (import SVN into Phabricator) - https://phabricator.wikimedia.org/T86655#1418271 (10demon) Had to import the full history from the dumps since they're now locally hosted (duh?!). Pywikipedia and Mysql repos imported and all working just fine... [18:30:06] 6operations, 5Patch-For-Review: Decommission svn.wikimedia.org server (import SVN into Phabricator) - https://phabricator.wikimedia.org/T86655#1418272 (10demon) 5Open>3Resolved [18:36:39] Did anyone look at the FancyCaptcha fatals on wmf12? [18:38:03] Apparently there's not even a bug [18:39:06] https://phabricator.wikimedia.org/T104477 [18:41:33] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [18:41:35] !log hoo Synchronized wmf-config/InitialiseSettings-labs.php: consistency (duration: 00m 31s) [18:41:39] Logged the message, Master [18:41:47] hoo, that'd be caused by Florian's change [18:42:02] https://gerrit.wikimedia.org/r/#/c/179793/ [18:42:04] !log hoo Synchronized wmf-config/mobile-labs.php: consistency (duration: 00m 12s) [18:42:09] Logged the message, Master [18:43:26] Krenair: Yeah, looks like it [18:43:52] We have that in the logs 302 times [18:43:59] probably worth a hot fix [18:44:27] or even better a revert before the train [18:44:31] twentyafterfour: Krenair ^ [18:57:45] hoo: so revert that change on the branch? [18:58:31] Seems sensible to me [18:58:45] especially given that this doesn't touch Wikimedia used functionality [19:00:36] if we don't use it why is it fataling? something must be calling the method ... [19:00:58] I mean the functionality introduced by that hcange is not used [19:01:02] isn't it bad form to change a method signature and not the call sites? [19:01:22] yeah the change doesn't look useful on teh release branch [19:01:22] It is [19:01:44] k I'll revert it on the branch [19:03:32] !log hoo Synchronized php-1.26wmf12/extensions/Wikidata/: Update DataModel to fix SnakList (duration: 00m 20s) [19:03:36] Logged the message, Master [19:09:06] https://gerrit.wikimedia.org/r/#/c/222162/ [19:09:26] (03CR) 10Alexandros Kosiaris: [C: 032] Introducing bromine.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/222149 (https://phabricator.wikimedia.org/T103604) (owner: 10Alexandros Kosiaris) [19:09:36] (03CR) 10Ori.livneh: "I'll merge once Aaron +1s" [puppet] - 10https://gerrit.wikimedia.org/r/208397 (owner: 10Hoo man) [19:09:48] twentyafterfour: +2ed [19:12:03] PROBLEM - puppetmaster backend https on palladium is CRITICAL: Connection refused [19:12:14] PROBLEM - puppetmaster https on palladium is CRITICAL: Connection refused [19:13:24] PROBLEM - Cassandra database on restbase1005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [19:15:03] PROBLEM - Cassanda CQL query interface on restbase1005 is CRITICAL: Connection refused [19:15:54] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [19:17:34] PROBLEM - bacula sd process on helium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (bacula), command name bacula-sd [19:17:44] PROBLEM - puppet last run on palladium is CRITICAL puppet fail [19:19:14] RECOVERY - Cassandra database on restbase1005 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [19:20:53] RECOVERY - Cassanda CQL query interface on restbase1005 is OK: TCP OK - 0.013 second response time on port 9042 [19:23:34] hoo: all done with deploying? [19:24:20] I'm done, yes [19:25:02] hmm I found the bug in that captcha commit [19:25:12] but I'm gonna go ahead and sync with it reverted [19:25:20] Yeah, fixing on master should be neough [19:26:16] 10Ops-Access-Requests, 6operations, 3Discovery-Maps-Sprint: EventLogging access for maxsem - https://phabricator.wikimedia.org/T104482#1418401 (10MaxSem) 3NEW [19:27:56] akosiaris: could you post the mac on the ticket re. staticbugzilla/bromine? [19:27:59] !log Repooling mw1152 for further testing of HHVM scaler [19:28:04] Logged the message, Master [19:28:38] 10Ops-Access-Requests, 6operations, 3Discovery-Maps-Sprint: EventLogging access for maxsem - https://phabricator.wikimedia.org/T104482#1418415 (10Krenair) [19:31:07] !log from restbase1002; node thin_out_key_rev_value_data.js `hostname -i` local_group_wikipedia_T_parsoid_html 2>&1 | pv --line-mode | gzip -c > wikipedia_T_parsoid_html.log.gz [19:31:11] Logged the message, Master [19:31:35] !log replication issues for shard s7 on dbstore2001 and dbstore2002, production applications *not* affected [19:31:39] Logged the message, Master [19:32:33] ^this is probably going to take hours to fix, so I will wait until Springle or tomorrow, because it is not breaking production [19:36:35] 6operations, 10vm-requests, 5Patch-For-Review: eqiad: (1) VM for static-bugzilla - https://phabricator.wikimedia.org/T103604#1418447 (10akosiaris) bromine.eqiad.wmnet is ready on ganeti01.svc.eqiad.wmnet with MAC: aa:00:00:f1:36:3a. DHCP configuration and installation are not done yet. [19:36:42] JohnFLewis: done [19:36:44] !log twentyafterfour Synchronized php-1.26wmf12: sync 1.26wmf12 branch revert of "Implement support for Google reCAPTCHA 2.0" 90665a737bc25ff3c859044755d662c6cd700573 (duration: 02m 04s) [19:36:48] Logged the message, Master [19:36:54] akosiaris: thanks :) [19:38:12] deploying 1.26wmf12 to group1 [19:38:35] (03PS1) 1020after4: group1 wikis to 1.26wmf12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222171 [19:40:56] (03CR) 1020after4: [C: 032] group1 wikis to 1.26wmf12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222171 (owner: 1020after4) [19:41:01] (03Merged) 10jenkins-bot: group1 wikis to 1.26wmf12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222171 (owner: 1020after4) [19:41:21] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: group1 wikis to 1.26wmf12 [19:41:28] Logged the message, Master [19:42:58] (03PS1) 10John F. Lewis: add bromine to dhcp [puppet] - 10https://gerrit.wikimedia.org/r/222172 [19:45:26] 6operations, 10ops-eqiad: Rack and Setup New LVS servers - https://phabricator.wikimedia.org/T104484#1418473 (10Cmjohnson) 3NEW a:3BBlack [19:47:29] git.wikimedia.org seems down. known issue? [19:49:18] !log mw1152 not actually re-pooled because of ongoing work on palladium. I'm undoing the change and hanging back now. [19:49:24] Logged the message, Master [19:49:29] someone else's turn to restart gitblit on antimony [19:50:47] !log restarted gitblit on antimony [19:50:52] Logged the message, Master [19:52:24] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.69% of data above the critical threshold [1000.0] [19:52:34] someone remind me why we even need gitblit? [19:52:44] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60536 bytes in 0.948 second response time [19:52:44] it's always felt a bit redundant to me [19:54:15] Krenair: we don't need it, but we can't pull the plug without someone taking responsibility for making the decision and making sure it is communicated [19:54:42] and no one wants to do that at the moment. Getting a discussion going again on ops@ or on Phabricator would be good. [19:55:24] ori: Krenair: we have been trying to get all the repositories browseable on phabricator before killing gitblit [19:55:41] most of them are in phab but afaik still not all of them [19:55:51] there is a list of stragglers in a ticket [19:55:58] ostriches knows more about the subject [19:56:03] oh cool [19:56:10] so this is in much better shape than i thought [19:56:13] I stand corrected! [19:56:31] very happy to hear that. [19:56:47] (03PS1) 10RobH: EventLogging access for maxsem [puppet] - 10https://gerrit.wikimedia.org/r/222177 (https://phabricator.wikimedia.org/T104482) [19:58:10] 6operations, 6Mobile-Apps, 6Services, 3Mobile Content Service, 5Patch-For-Review: Deployment of Mobile App's service on the SCA cluster - https://phabricator.wikimedia.org/T92627#1418530 (10bearND) [20:00:04] gwicke cscott arlolra subbu: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150701T2000). [20:01:03] no parsoid deploy today [20:01:15] 10Ops-Access-Requests, 6operations, 3Discovery-Maps-Sprint, 5Patch-For-Review: EventLogging access for maxsem - https://phabricator.wikimedia.org/T104482#1418538 (10RobH) Please note this task requires the manager approval (which @maxsem already requested in his task creation) before the merge of the patch... [20:01:32] (03PS19) 10Paladox: Rename all main WikimediaIncubator settings to have a wg prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207909 [20:01:49] mobrovac: Anything to deploy from your end? [20:02:25] James_F: not today [20:02:56] OK. [20:05:29] 10Ops-Access-Requests, 6operations, 3Discovery-Maps-Sprint, 5Patch-For-Review: EventLogging access for maxsem - https://phabricator.wikimedia.org/T104482#1418541 (10Manybubbles) As Tomasz's delegate I approve. If that is good enough for you then its good enough for me. Max is working on making sure the da... [20:18:56] 10Ops-Access-Requests, 6operations, 3Discovery-Maps-Sprint, 5Patch-For-Review: EventLogging access for maxsem - https://phabricator.wikimedia.org/T104482#1418603 (10RobH) 5Open>3stalled I'm setting this to stalled until 2015-07-07. I'll be on ops clinic duty that week, so I'll go ahead and claim this... [20:19:05] 10Ops-Access-Requests, 6operations, 3Discovery-Maps-Sprint, 5Patch-For-Review: EventLogging access for maxsem - https://phabricator.wikimedia.org/T104482#1418605 (10RobH) a:3RobH [20:19:18] 10Ops-Access-Requests, 6operations, 3Discovery-Maps-Sprint, 5Patch-For-Review: EventLogging access for maxsem - https://phabricator.wikimedia.org/T104482#1418401 (10RobH) p:5Triage>3Normal [20:22:25] 6operations: Ensure kernel and OpenJDK fixes for leap second are present - https://phabricator.wikimedia.org/T103479#1418622 (10hashar) @MoritzMuehlenhoff Danke Schoen! Thank you a ton to have taken extra care :-} [20:37:05] (03PS1) 10Gage: Revert "puppetmaster: fix puppet.conf for new CA cert" [puppet] - 10https://gerrit.wikimedia.org/r/222184 [20:37:14] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [20:37:46] i guess i'll restart gitblit after all [20:38:20] !log restarted gitblit on antimony [20:38:27] Logged the message, Master [20:39:10] (03CR) 10Gage: [C: 032] Revert "puppetmaster: fix puppet.conf for new CA cert" [puppet] - 10https://gerrit.wikimedia.org/r/222184 (owner: 10Gage) [20:40:54] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60575 bytes in 0.052 second response time [20:41:28] This is getting silly. [20:42:19] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default, 5Patch-For-Review: Switch to ECDSA hybrid certificates - https://phabricator.wikimedia.org/T86654#1418668 (10BBlack) [20:42:22] 6operations, 10Traffic: openssl 1.0.2 packaging for jessie - https://phabricator.wikimedia.org/T104143#1418666 (10BBlack) 5Open>3Resolved a:3BBlack [20:42:43] RECOVERY - puppetmaster backend https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 1.026 second response time [20:44:33] RECOVERY - puppet last run on palladium is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [20:46:23] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [20:47:49] (03PS4) 10BBlack: tlsproxy: add 2048-bit dhparam file to nginx [puppet] - 10https://gerrit.wikimedia.org/r/222016 [20:48:03] PROBLEM - puppet last run on sodium is CRITICAL Puppet last ran 4 hours ago [20:50:59] hi tgr. Do you mind checking out https://meta.wikimedia.org/wiki/Research_talk:Increasing_article_coverage#Usage_of_user_database and respond in the bottom of the thread if you have thoughts to share about Chricho's questions? This is a follow up on your response https://lists.wikimedia.org/pipermail/analytics/2015-June/004103.html [20:55:34] PROBLEM - puppet last run on strontium is CRITICAL Puppet last ran 4 hours ago [20:56:23] PROBLEM - puppetmaster backend https on palladium is CRITICAL - Socket timeout after 10 seconds [21:00:14] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [21:00:44] PROBLEM - puppetmaster backend https on strontium is CRITICAL: Connection refused [21:02:08] currently palladium has high load because it's signing certs [21:15:33] RECOVERY - puppetmaster backend https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 7.349 second response time [21:19:33] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 6.746 second response time [21:21:30] (03PS1) 10Eevans: increase size of key cache to 400MB [puppet] - 10https://gerrit.wikimedia.org/r/222189 [21:24:02] (03CR) 10Filippo Giunchedi: [C: 031] "note puppet is still disabled, so we'd have to either wait or apply manually" [puppet] - 10https://gerrit.wikimedia.org/r/222189 (owner: 10Eevans) [21:25:14] PROBLEM - puppetmaster backend https on palladium is CRITICAL - Socket timeout after 10 seconds [21:25:14] PROBLEM - puppetmaster https on palladium is CRITICAL - Socket timeout after 10 seconds [21:26:32] palladium puppetmaster is still running, it's just heavily loaded [21:26:41] (03CR) 10Mobrovac: [C: 031] increase size of key cache to 400MB [puppet] - 10https://gerrit.wikimedia.org/r/222189 (owner: 10Eevans) [21:26:49] i'm surprised the condition is persisting for so long, but i'm being patient [21:27:06] the http log looks good, i'm tailing it and watching load [21:28:59] leila: responded [21:29:06] thanks a lot tgr. [21:30:24] PROBLEM - puppet last run on palladium is CRITICAL puppet fail [21:31:35] <_joe_> jgage: ping me when you enable puppet on the fleet [21:32:37] _joe_ ok [21:32:54] RECOVERY - puppetmaster backend https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 9.497 second response time [21:37:27] 21:00 < icinga-wm> PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [21:37:32] can we kick it again? [21:38:08] alright I'll do it [21:38:14] pfft, it barely stayed up for 20 minutes that time [21:38:22] diee gitblit [21:38:44] PROBLEM - puppetmaster backend https on palladium is CRITICAL - Socket timeout after 10 seconds [21:39:15] !log bounce gitblit [21:39:21] Logged the message, Master [21:40:32] while true ; do service restart gitblit ; sleep 1500 ; done [21:40:35] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60548 bytes in 0.500 second response time [21:40:49] jgage: pretty mcu [21:40:52] gah, typing [21:41:06] tgr: so I know I broke stuff yesterday [21:41:26] https://commons.wikimedia.org/w/index.php?title=Special:UploadWizard&campaign=tos-rs&action=purge any ideas why the header doesn't appear despite the template existing? [21:42:16] You can't purge a special page for one thing [21:42:33] RECOVERY - puppetmaster backend https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.095 second response time [21:42:34] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.079 second response time [21:43:04] RECOVERY - puppetmaster backend https on strontium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.995 second response time [21:43:09] Not that I know how to purge a special page, mind you, but that's why you can't fix it that way [21:43:24] RECOVERY - puppet last run on strontium is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [21:43:54] RECOVERY - puppet last run on palladium is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [21:46:19] marktraceur: oh yeah, I know, meant to delete that bit from the URL :) [21:46:57] And now magically it works [21:47:59] https://commons.wikimedia.org/w/index.php?title=Special:UploadWizard&campaign=tos-rs&uselang=sr not here though [21:48:20] did the template not exist at some point? [21:48:22] Is the template internationalized? [21:49:57] In English and Serbian as far as I know [21:50:10] Well then I'd bet there's another cache you need to wait for [21:50:54] tgr: it didn't, it was only created 10 minutes ago. I guess it's cache then? [21:51:18] some sort of parser cache I guess? [21:52:14] PROBLEM - puppetmaster backend https on palladium is CRITICAL - Socket timeout after 10 seconds [21:52:14] PROBLEM - puppetmaster https on palladium is CRITICAL - Socket timeout after 10 seconds [21:53:16] 10Ops-Access-Requests, 6operations, 6Discovery, 3Discovery-Cirrus-Sprint, 5Patch-For-Review: Grant access to HTTP request logs - https://phabricator.wikimedia.org/T103872#1418912 (10Wwes) Approved [21:53:30] 10Ops-Access-Requests, 6operations, 6Discovery, 3Discovery-Cirrus-Sprint, 5Patch-For-Review: Grant access to HTTP request logs - https://phabricator.wikimedia.org/T103872#1418913 (10Wwes) a:5Wwes>3None [21:53:49] (03PS2) 10Dzahn: add bromine to dhcp [puppet] - 10https://gerrit.wikimedia.org/r/222172 (https://phabricator.wikimedia.org/T103604) (owner: 10John F. Lewis) [21:54:04] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.078 second response time [21:54:04] RECOVERY - puppetmaster backend https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.094 second response time [21:54:53] tgr: Parser for what? Oh god are we still parsing messages in UW [21:55:04] Of course we are, because templates [21:55:33] also, the result of the parse is stored in memcache [21:55:45] so one of those [21:56:05] (03CR) 10Aaron Schulz: "Are they delayed for too long or just not picked up for too long? This won't help the former." [puppet] - 10https://gerrit.wikimedia.org/r/208397 (owner: 10Hoo man) [21:57:27] Christ. [21:57:47] Well, the moral of the story is that it will start working Soon™ [21:57:54] uh, seems like there is no cache expiration time? [21:58:02] orly? [21:58:08] That...is bad. [21:58:12] (03CR) 10Hoo man: "We only delay for a few seconds (unless the replag is high, but that didn't happen in the past weeks), so the problem is picking up the jo" [puppet] - 10https://gerrit.wikimedia.org/r/208397 (owner: 10Hoo man) [21:58:14] (03PS1) 1020after4: Bump phabricator release tags [puppet] - 10https://gerrit.wikimedia.org/r/222194 [21:58:20] $cache->set( $memKey, array( 'timestamp' => time(), 'config' => $parsedConfig ) ); [21:58:25] ... [21:58:31] (03PS3) 10Dzahn: add bromine to dhcp [puppet] - 10https://gerrit.wikimedia.org/r/222172 (https://phabricator.wikimedia.org/T103604) (owner: 10John F. Lewis) [21:58:39] God damn it [21:58:44] not sure I am looking at the right place, not familiar with the post-multidatacenter cache stuff [21:58:58] I think that's the right reading of it, at least that's what I would read it as [21:59:15] Well if it's still borked tomorrow I'll try to fix it [21:59:24] But I have to run now [21:59:25] (03CR) 10Dzahn: [C: 032] add bromine to dhcp [puppet] - 10https://gerrit.wikimedia.org/r/222172 (https://phabricator.wikimedia.org/T103604) (owner: 10John F. Lewis) [21:59:33] (03PS2) 1020after4: Bump phabricator release tags refs T104047 [puppet] - 10https://gerrit.wikimedia.org/r/222194 (https://phabricator.wikimedia.org/T104047) [21:59:50] (03PS3) 1020after4: Bump phabricator release tags refs T104047 [puppet] - 10https://gerrit.wikimedia.org/r/222194 (https://phabricator.wikimedia.org/T104047) [21:59:56] twentyafterfour: do you want that now? [22:00:15] mutante: yeah I'll do the update in 2 hours [22:00:26] but it can be merged anytime? [22:00:28] the puppet change can happen any time before that [22:00:32] ok, doing so [22:00:38] mutante: thanks! [22:00:49] (03CR) 10Dzahn: [C: 032] Bump phabricator release tags refs T104047 [puppet] - 10https://gerrit.wikimedia.org/r/222194 (https://phabricator.wikimedia.org/T104047) (owner: 1020after4) [22:01:34] that gives me time to run through the upgrade and test process on phab-01 [22:02:05] PROBLEM - puppet last run on cp3030 is CRITICAL Puppet last ran 5 hours ago [22:02:44] (03CR) 10Andrew Bogott: [C: 04-1] "Let's use nova_client rather than hitting the wikitech api. Details inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/221872 (https://phabricator.wikimedia.org/T102782) (owner: 10Yuvipanda) [22:03:54] 6operations, 10vm-requests, 5Patch-For-Review: eqiad: (1) VM for static-bugzilla - https://phabricator.wikimedia.org/T103604#1418961 (10Dzahn) merged DHCP config (thanks John) tried to get console: [ganeti1003:~] $ sudo gnt-instance console bromine.eqiad.wmnet 2015/07/01 22:03:13 socat[47275] E connect(5,... [22:04:04] RECOVERY - puppet last run on cp3030 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:04:33] PROBLEM - puppet last run on virt1002 is CRITICAL Puppet last ran 5 hours ago [22:04:34] PROBLEM - puppet last run on wtp2005 is CRITICAL Puppet last ran 5 hours ago [22:04:34] PROBLEM - puppet last run on lvs2001 is CRITICAL Puppet last ran 4 hours ago [22:04:34] PROBLEM - puppet last run on es2006 is CRITICAL Puppet last ran 5 hours ago [22:04:34] PROBLEM - puppet last run on analytics1039 is CRITICAL Puppet last ran 5 hours ago [22:04:34] PROBLEM - puppet last run on mw1037 is CRITICAL Puppet last ran 5 hours ago [22:04:35] PROBLEM - puppet last run on cp1069 is CRITICAL Puppet last ran 5 hours ago [22:04:35] PROBLEM - puppet last run on es1005 is CRITICAL Puppet last ran 5 hours ago [22:04:36] PROBLEM - puppet last run on wtp1016 is CRITICAL Puppet last ran 5 hours ago [22:04:36] PROBLEM - puppet last run on mw2025 is CRITICAL Puppet last ran 5 hours ago [22:04:37] PROBLEM - puppet last run on mw1049 is CRITICAL Puppet last ran 5 hours ago [22:04:37] PROBLEM - puppet last run on analytics1015 is CRITICAL Puppet last ran 5 hours ago [22:04:38] PROBLEM - puppet last run on etcd1001 is CRITICAL Puppet last ran 5 hours ago [22:04:38] PROBLEM - puppet last run on mw2111 is CRITICAL Puppet last ran 5 hours ago [22:08:04] RECOVERY - puppet last run on cp1058 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [22:08:04] RECOVERY - puppet last run on db1023 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:08:04] RECOVERY - puppet last run on cp1071 is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures [22:08:14] o yes [22:08:24] RECOVERY - puppet last run on cp2003 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [22:08:24] RECOVERY - puppet last run on cp2013 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [22:08:43] RECOVERY - puppet last run on labstore1001 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [22:08:44] RECOVERY - puppet last run on cp4004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:08:54] RECOVERY - puppet last run on cp3004 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [22:08:54] RECOVERY - puppet last run on cp2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:09:23] RECOVERY - puppet last run on mc1012 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [22:09:47] (03CR) 10Aaron Schulz: "The nominal delay is low but is the actual time it takes to be undelayed too high? A delay of a few seconds will result in a delay of minu" [puppet] - 10https://gerrit.wikimedia.org/r/208397 (owner: 10Hoo man) [22:10:03] RECOVERY - puppet last run on cp4001 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [22:10:03] RECOVERY - puppet last run on cp4019 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:10:03] RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:10:04] RECOVERY - puppet last run on analytics1010 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [22:10:13] RECOVERY - puppet last run on cp1048 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [22:10:13] RECOVERY - puppet last run on cp1046 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [22:10:14] RECOVERY - puppet last run on cp1050 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [22:10:14] RECOVERY - puppet last run on cp2020 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [22:10:20] looks like puppetmaster crashed on strontium [22:10:23] but was already restarted [22:10:25] RECOVERY - puppet last run on restbase1004 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [22:10:33] RECOVERY - puppet last run on mc2006 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [22:10:43] RECOVERY - puppet last run on cp3006 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [22:10:43] RECOVERY - puppet last run on labnet1001 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [22:10:44] RECOVERY - puppet last run on cp1062 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [22:10:44] RECOVERY - puppet last run on elastic1030 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [22:10:44] RECOVERY - puppet last run on virt1004 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [22:10:44] RECOVERY - puppet last run on cp1063 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [22:10:44] RECOVERY - puppet last run on db1016 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [22:10:50] where did you see that, mutante? [22:10:53] RECOVERY - puppet last run on rhenium is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [22:10:54] marktraceur, odder: so apparently there is on-demand invalidation using some clever mechanism on the new WAN cache class, where the key itself has an infinite TTL but on $cache->get it automatically checks another key which can be used to mass invalidate all the keys which depend on it (all languages in this case) [22:11:27] jgage: tail -f /var/log/apache2/error.log [22:11:31] and the cache should be invalidated when the campaign config is updated or any linked page changes [22:11:48] tgr: so a dummy edit should help? [22:11:52] hm so it did [22:12:02] I'm guessing "linked page" doesn't work properly for JSON pages [22:12:02] jgage: on both servers, palladium first, told me that the issue is strontium.. then on strontium same thing [22:12:06] odder: I think so [22:13:41] 6operations, 10Deployment-Systems, 6Performance-Team, 6Release-Engineering, 7HHVM: Make scap able to depool/repool servers via the conftool API - https://phabricator.wikimedia.org/T104352#1418983 (10mmodell) [22:14:11] tgr: and it did \o/ [22:14:22] jgage: re: the MaxClients setting, that's "normal" https://phabricator.wikimedia.org/T97466 [22:14:28] so another question which I can't find the answer to... [22:14:37] https://commons.wikimedia.org/wiki/Campaign:tos-rs <= how do I make the count !=0 ? [22:15:05] thanks mutante [22:19:59] ottomata: remember the switch to ganglia_new for analytics cluster? we got the ACLs adjusted now.. so i would try again [22:20:17] (03PS2) 10RobH: access: grant Jdouglas access toanalytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/220990 (owner: 10Matanya) [22:20:25] ottomata: but i dont get why logstash failed.. because that is apparently in the normal VLAN, not analytics [22:22:09] (03CR) 10RobH: [C: 032] access: grant Jdouglas access toanalytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/220990 (owner: 10Matanya) [22:22:29] logstash? [22:22:35] (03CR) 10Krinkle: "In order to reliably deploy this we need to either backport the other patch to both branches and deploy alongside wmf-config in a single s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207909 (owner: 10Paladox) [22:23:33] RECOVERY - puppet last run on mw2089 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:23:34] RECOVERY - puppet last run on mw2160 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:23:34] RECOVERY - puppet last run on rdb1004 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:23:35] RECOVERY - puppet last run on ms-be1003 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [22:23:35] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [22:23:36] RECOVERY - puppet last run on mw2100 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [22:23:36] RECOVERY - puppet last run on mw1005 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [22:23:37] RECOVERY - puppet last run on mw1040 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:23:37] RECOVERY - puppet last run on mw1080 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:23:38] RECOVERY - puppet last run on mw1145 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [22:23:38] RECOVERY - puppet last run on mw1197 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [22:23:39] RECOVERY - puppet last run on mw1240 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:23:41] RECOVERY - puppet last run on mw2112 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [22:23:41] RECOVERY - puppet last run on mw2108 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [22:23:41] RECOVERY - puppet last run on mw1063 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [22:23:41] RECOVERY - puppet last run on ms-be2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:23:42] RECOVERY - puppet last run on mw2032 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:23:42] RECOVERY - puppet last run on mw2074 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [22:23:51] 10Ops-Access-Requests, 6operations, 6Discovery, 3Discovery-Cirrus-Sprint, 5Patch-For-Review: Grant access to HTTP request logs - https://phabricator.wikimedia.org/T103872#1419046 (10RobH) 5Open>3Resolved a:3RobH With Wes's approval of rights escalation, I've gone ahead and merged the patchset. @Jd... [22:23:53] RECOVERY - IPsec on berkelium is OK: Strongswan OK - Security Associations: 2 ESP transports installed [22:23:53] RECOVERY - puppet last run on cerium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:23:59] 10Ops-Access-Requests, 6operations, 6Discovery, 3Discovery-Cirrus-Sprint: Grant access to HTTP request logs - https://phabricator.wikimedia.org/T103872#1419050 (10RobH) [22:24:02] RECOVERY - puppet last run on mw1031 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:24:02] RECOVERY - puppet last run on mw2005 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:24:11] RECOVERY - puppet last run on xenon is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:24:12] RECOVERY - puppet last run on mw1006 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [22:24:12] RECOVERY - puppet last run on mw1041 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:24:12] RECOVERY - puppet last run on cp2007 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:24:12] RECOVERY - puppet last run on db1029 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:24:12] RECOVERY - puppet last run on db1053 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:24:12] RECOVERY - puppet last run on mw2124 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [22:24:13] RECOVERY - puppet last run on cp1047 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:24:21] RECOVERY - puppet last run on mw2028 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:24:21] RECOVERY - puppet last run on mw2179 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:24:21] RECOVERY - puppet last run on analytics1041 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [22:24:22] RECOVERY - puppet last run on snapshot1003 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [22:24:22] RECOVERY - puppet last run on ms-be1006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:24:22] RECOVERY - puppet last run on db1073 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:24:22] RECOVERY - puppet last run on elastic1018 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [22:24:23] RECOVERY - puppet last run on cp2009 is OK Puppet is currently enabled, last run 3 minutes ago with 0 failures [22:24:23] RECOVERY - bacula director process on helium is OK: PROCS OK: 1 process with UID = 110 (bacula), command name bacula-dir [22:24:24] RECOVERY - puppet last run on mc2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:24:24] RECOVERY - puppet last run on mw1007 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [22:24:31] RECOVERY - puppet last run on mw1036 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:24:32] RECOVERY - puppet last run on ms-be2013 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:24:32] RECOVERY - puppet last run on mw1082 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [22:24:32] RECOVERY - puppet last run on mw1138 is OK Puppet is currently enabled, last run 3 minutes ago with 0 failures [22:24:32] RECOVERY - puppet last run on mw1141 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:24:32] RECOVERY - puppet last run on mw1059 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:24:32] RECOVERY - puppet last run on mw2183 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:24:33] RECOVERY - puppet last run on mw2103 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:24:33] RECOVERY - puppet last run on wtp1006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:24:34] RECOVERY - puppet last run on wtp1021 is OK Puppet is currently enabled, last run 3 minutes ago with 0 failures [22:24:34] RECOVERY - puppet last run on mw2161 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [22:24:35] RECOVERY - puppet last run on mw2170 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [22:24:35] RECOVERY - puppet last run on elastic1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:24:41] RECOVERY - bacula sd process on helium is OK: PROCS OK: 1 process with UID = 110 (bacula), command name bacula-sd [22:24:41] RECOVERY - puppet last run on magnesium is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:24:42] RECOVERY - puppet last run on mw1012 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [22:24:42] RECOVERY - puppet last run on mw1089 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:24:42] RECOVERY - puppet last run on mw1124 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:24:42] RECOVERY - puppet last run on graphite1002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:24:51] RECOVERY - puppet last run on cp2025 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:24:52] RECOVERY - puppet last run on cp3012 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [22:24:52] RECOVERY - puppet last run on mw1187 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [22:24:52] RECOVERY - puppet last run on cp4006 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:24:52] RECOVERY - puppet last run on helium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:24:52] RECOVERY - puppet last run on es2008 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [22:24:52] RECOVERY - puppet last run on ms-be2002 is OK Puppet is currently enabled, last run 3 minutes ago with 0 failures [22:24:53] RECOVERY - puppet last run on mw1178 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:24:54] RECOVERY - puppet last run on mw2046 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:28:58] odder: do you have a link at hand? [22:29:17] to an uploaded file I mean [22:30:54] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1419061 (10Bawolff) >>! In T102566#1417074, @Tau wrote: > I have installed the php5-curl now but still Instantcommons isn't w... [22:31:51] (03PS1) 10Dzahn: add bromine as a misc-web backend [puppet] - 10https://gerrit.wikimedia.org/r/222198 (https://phabricator.wikimedia.org/T101734) [22:34:13] (03PS1) 10Dzahn: switch static-bugzilla to backend bromine [puppet] - 10https://gerrit.wikimedia.org/r/222200 (https://phabricator.wikimedia.org/T101734) [22:35:41] (03PS2) 10Filippo Giunchedi: increase size of key cache to 400MB [puppet] - 10https://gerrit.wikimedia.org/r/222189 (owner: 10Eevans) [22:35:43] (03PS1) 10Filippo Giunchedi: cassandra: add team-services for cql failure [puppet] - 10https://gerrit.wikimedia.org/r/222201 (https://phabricator.wikimedia.org/T104467) [22:35:46] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] increase size of key cache to 400MB [puppet] - 10https://gerrit.wikimedia.org/r/222189 (owner: 10Eevans) [22:37:31] (03PS1) 10Catrope: Allow bureaucrats to grant and remove flow-bot on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222202 [22:38:26] 6operations, 10vm-requests, 5Patch-For-Review: eqiad: (1) VM for static-bugzilla - https://phabricator.wikimedia.org/T103604#1419079 (10Dzahn) after waiting a bit and restarting it i got console and saw the installer. then, as with planet1001 at first: ``` ┌────────┤ [!!] Install the GRUB boot loader... [22:39:01] (03PS1) 10Dzahn: add new node bromine, add bz-static role [puppet] - 10https://gerrit.wikimedia.org/r/222203 (https://phabricator.wikimedia.org/T101734) [22:39:22] (03CR) 10Dzahn: "bro, (it's) mine" [puppet] - 10https://gerrit.wikimedia.org/r/222203 (https://phabricator.wikimedia.org/T101734) (owner: 10Dzahn) [22:40:30] 6operations: Investigate better DNS cache/lookup solutions - https://phabricator.wikimedia.org/T104442#1419086 (10fgiunchedi) re: dnspq upstream is responsive and open to improvements, I've sent some myself for `carbon-c-relay`. re: the general topic, would a local full-fletched caching resolver help? (and 127.0... [22:41:09] (03PS2) 10Dzahn: add new node bromine, add bz-static role [puppet] - 10https://gerrit.wikimedia.org/r/222203 (https://phabricator.wikimedia.org/T101734) [22:41:47] tgr: Hang on a second, I have a link indeed [22:42:09] https://commons.wikimedia.org/wiki/File:Na_Drini_%C4%87uprija_Vi%C5%A1egrad.JPG [22:43:42] PROBLEM - puppet last run on cp2024 is CRITICAL Puppet last ran 5 hours ago [22:46:19] (03PS1) 10Filippo Giunchedi: monitoring: validate service description for illegal characters [puppet] - 10https://gerrit.wikimedia.org/r/222205 (https://phabricator.wikimedia.org/T101799) [22:47:13] (03PS2) 10Catrope: Grant flow-create-bot to sysops on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222202 (https://phabricator.wikimedia.org/T101663) [22:47:33] RECOVERY - puppet last run on mw2114 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:47:51] (03PS5) 10BBlack: tlsproxy: add 2048-bit dhparam file to nginx [puppet] - 10https://gerrit.wikimedia.org/r/222016 [22:47:59] (03CR) 10BBlack: [C: 032 V: 032] tlsproxy: add 2048-bit dhparam file to nginx [puppet] - 10https://gerrit.wikimedia.org/r/222016 (owner: 10BBlack) [22:55:43] odder: so there should be an 'Uploaded via Campaign:toc-rs' category on that page [22:55:55] looks like adding that is broken [22:56:10] as a workaround you can just put it into the template for now [22:58:14] * odder nods [22:58:22] RECOVERY - puppet last run on bast1001 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [22:59:19] (03CR) 10John F. Lewis: [C: 031] add new node bromine, add bz-static role [puppet] - 10https://gerrit.wikimedia.org/r/222203 (https://phabricator.wikimedia.org/T101734) (owner: 10Dzahn) [23:00:04] RoanKattouw ostriches rmoen Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150701T2300). [23:00:05] RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:52] I'm here for that patch; who can deploy? [23:01:09] I think RoanKattouw is here to deploy it? [23:01:15] No, he's in a meeting. [23:01:17] Oh. [23:01:20] Okay, I'll do it then. [23:01:25] Awesome, thanks Krenair. [23:01:30] I probably should've checked his calendar [23:01:43] :-) [23:02:45] Hm. Have we not granted that right to any other groups before? [23:03:13] It's completely changed from PS1 [23:03:48] Oh, it's granted via extensions defaults to the 'flow-bot' group. Great. [23:04:10] (03CR) 10Alex Monk: [C: 032] Grant flow-create-bot to sysops on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222202 (https://phabricator.wikimedia.org/T101663) (owner: 10Catrope) [23:04:16] (03Merged) 10jenkins-bot: Grant flow-create-bot to sysops on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222202 (https://phabricator.wikimedia.org/T101663) (owner: 10Catrope) [23:04:36] Why do we have a non-readonly remote called 'readonly'? [23:04:43] Krenair: Of course I can't usefully test… [23:04:53] RECOVERY - puppet last run on cp2024 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:06:12] RECOVERY - puppet last run on db1022 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [23:06:26] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/222202/ (duration: 00m 11s) [23:06:32] Logged the message, Master [23:06:33] Never mind, quiddity can test [23:06:44] OK. [23:06:45] Thanks! [23:06:53] (03PS1) 10Filippo Giunchedi: tessera: force uwsgi scheme to https [puppet] - 10https://gerrit.wikimedia.org/r/222210 (https://phabricator.wikimedia.org/T104424) [23:07:16] (03CR) 10Paladox: "Ok done here https://gerrit.wikimedia.org/r/#/c/222208/ and https://gerrit.wikimedia.org/r/#/c/222209/ here. backported to wmf 12 and wmf1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207909 (owner: 10Paladox) [23:07:37] All good, quiddity? [23:08:28] Tested, all good. [23:08:31] thanks :) [23:08:51] yvw [23:10:03] RECOVERY - puppet last run on ms-fe2002 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [23:12:21] (03PS6) 10BBlack: ciphersuites: refactor further, add compat-dhe option [puppet] - 10https://gerrit.wikimedia.org/r/222022 [23:12:39] (03CR) 10BBlack: [C: 032 V: 032] "Verified in catalog compiler, no-op." [puppet] - 10https://gerrit.wikimedia.org/r/222022 (owner: 10BBlack) [23:12:41] PROBLEM - puppet last run on mw2030 is CRITICAL Puppet last ran 6 hours ago [23:13:11] (03PS6) 10BBlack: tlsproxy: enable DHE-2048 FS for Android 2.x, etc. [puppet] - 10https://gerrit.wikimedia.org/r/222023 (https://phabricator.wikimedia.org/T104281) [23:13:36] (03CR) 10BBlack: [C: 04-1] "On hold pending java6 situation: further analysis and/or pre-warning to mailing list" [puppet] - 10https://gerrit.wikimedia.org/r/222023 (https://phabricator.wikimedia.org/T104281) (owner: 10BBlack) [23:13:41] PROBLEM - puppet last run on planet1001 is CRITICAL puppet fail [23:14:59] (03CR) 10BBlack: [C: 04-1] "On hold pending further impact analysis and/or pre-warning email" [puppet] - 10https://gerrit.wikimedia.org/r/221974 (owner: 10BBlack) [23:18:23] RECOVERY - puppet last run on mw2030 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [23:21:31] RECOVERY - puppet last run on planet1001 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [23:25:31] RECOVERY - puppet last run on stat1002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:30:01] (03PS4) 10BBlack: sslcert: replace install_certificate with sslcert::std_cert [puppet] - 10https://gerrit.wikimedia.org/r/222066 [23:30:03] (03PS5) 10BBlack: tlsproxy: multi-cert support, including ocsp [puppet] - 10https://gerrit.wikimedia.org/r/222067 (https://phabricator.wikimedia.org/T86654) [23:30:28] !log restart mysqld dbstore2002 T104471 [23:30:34] Logged the message, Master [23:36:06] (03PS2) 10Ori.livneh: tessera: force uwsgi scheme to https [puppet] - 10https://gerrit.wikimedia.org/r/222210 (https://phabricator.wikimedia.org/T104424) (owner: 10Filippo Giunchedi) [23:36:23] (03CR) 10Ori.livneh: [C: 032 V: 032] tessera: force uwsgi scheme to https [puppet] - 10https://gerrit.wikimedia.org/r/222210 (https://phabricator.wikimedia.org/T104424) (owner: 10Filippo Giunchedi) [23:36:25] (03PS5) 10BBlack: sslcert: replace install_certificate with sslcert::std_cert [puppet] - 10https://gerrit.wikimedia.org/r/222066 [23:36:27] (03PS6) 10BBlack: tlsproxy: multi-cert support, including ocsp [puppet] - 10https://gerrit.wikimedia.org/r/222067 (https://phabricator.wikimedia.org/T86654) [23:38:48] (03PS6) 10BBlack: sslcert: replace install_certificate with sslcert::std_cert [puppet] - 10https://gerrit.wikimedia.org/r/222066 [23:38:56] (03CR) 10BBlack: [C: 032 V: 032] sslcert: replace install_certificate with sslcert::std_cert [puppet] - 10https://gerrit.wikimedia.org/r/222066 (owner: 10BBlack) [23:40:13] (03PS2) 10Ori.livneh: Make Coal's whisper files accessible to Graphite front-ends. [puppet] - 10https://gerrit.wikimedia.org/r/222020 [23:40:15] (03PS1) 10Ori.livneh: Set `uWSGIForceWSGIScheme https` for all mod_uwsgi webapps [puppet] - 10https://gerrit.wikimedia.org/r/222216 [23:40:18] (03PS3) 10Ori.livneh: Make Coal's whisper files accessible to Graphite front-ends. [puppet] - 10https://gerrit.wikimedia.org/r/222020 [23:40:27] (03CR) 10Ori.livneh: [C: 032 V: 032] Make Coal's whisper files accessible to Graphite front-ends. [puppet] - 10https://gerrit.wikimedia.org/r/222020 (owner: 10Ori.livneh) [23:44:41] PROBLEM - puppet last run on labvirt1007 is CRITICAL puppet fail [23:45:54] PROBLEM - puppet last run on virt1002 is CRITICAL puppet fail [23:46:32] PROBLEM - puppet last run on netmon1001 is CRITICAL puppet fail [23:46:32] PROBLEM - puppet last run on labvirt1005 is CRITICAL puppet fail [23:48:42] (03PS1) 10BBlack: bugfixes for b13b9157 (dependencies) [puppet] - 10https://gerrit.wikimedia.org/r/222217 [23:48:45] those are all me ^ [23:49:37] (03CR) 10BBlack: [C: 032] bugfixes for b13b9157 (dependencies) [puppet] - 10https://gerrit.wikimedia.org/r/222217 (owner: 10BBlack) [23:50:22] PROBLEM - puppet last run on labvirt1003 is CRITICAL puppet fail [23:51:23] fixed now, they'll catch it on the next run [23:51:52] PROBLEM - puppet last run on virt1003 is CRITICAL puppet fail [23:52:12] RECOVERY - puppet last run on labvirt1007 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:52:21] PROBLEM - puppet last run on virt1004 is CRITICAL puppet fail [23:52:21] PROBLEM - puppet last run on virt1001 is CRITICAL puppet fail [23:54:15] (03PS7) 10BBlack: tlsproxy: multi-cert support, including ocsp [puppet] - 10https://gerrit.wikimedia.org/r/222067 (https://phabricator.wikimedia.org/T86654) [23:56:11] RECOVERY - puppet last run on virt1001 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures