[00:05:36] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [00:08:16] !log catrope Synchronized php-1.26wmf8/vendor/oojs/oojs-ui/php/Tag.php: Fix OOUI fatals (T99210) (duration: 00m 13s) [00:08:25] Logged the message, Master [00:10:11] ori: Hmm, so how do I determine if this has stopped fatals? fatal.log on fluorine? fatalmonitor on fluorine? What's the cool thing these days? [00:10:20] Oh look at that fluorine finally let me ssh in [00:11:01] Wow 1346 Notice: JobQueueGroup::__destruct: 1 buffered job(s) never inserted. in /srv/mediawiki/php-1.26wmf8/includes/jobqueue/JobQueueGroup.php on line 419 is huge on fatalmonitor [00:11:16] RoanKattouw: Can you only ssh in when the 500s are below a certain level? [00:11:27] I hope not :S [00:13:05] Whoa wtf fatal.log contains errors from the future [00:13:18] 2015-06-05 01:23:32 mw1079 rowiki fatal INFO: [b8279a07] /w/index.php?title=Special%3ACarte&bookcmd=book_creator&referer=Categorie%3A1009 ErrorException from line 361 of /srv/mediawiki/php-1.26wmf8/vendor/oojs/oojs-ui/php/Tag.php: PHP Error: exception 'OOUI\Exception' with message 'Potentially unsafe 'href' attribute value. Scheme: ''; value: '/wiki/Categorie:1009'.' in /srv/mediawiki/php-1.26wm [00:13:19] f8/vendor/oojs/oojs-ui/php/Tag.php:317 [00:13:26] That timestamp is an hour and 10 minutes from now [00:13:45] RoanKattouw: it's a localized timestamp [00:13:51] there's a bug for that somewhere [00:13:52] legoktm: On a server? [00:13:59] legoktm: What? [00:14:07] localized to rowiki's timezone [00:14:17] Ha. [00:14:25] https://phabricator.wikimedia.org/T99581 [00:14:48] OK so I basically cannot find out what the historical volume of these errors was? [00:15:15] they should go into the right file [00:15:31] Oh you mean log rotation [00:15:44] Yeah I guess that works on a day basis [00:16:57] RoanKattouw: just visit one of the old fataling urls and see if it still causes new entries to be logged? [00:20:12] 6operations, 7HTTPS: Ganglia server doesn't send intermediary certificates - https://phabricator.wikimedia.org/T72326#1339835 (10Chmarkine) 5Open>3Resolved a:3Chmarkine This has been fixed in T100825. [00:20:51] 6operations, 7HTTPS: Ganglia server doesn't send intermediary certificates - https://phabricator.wikimedia.org/T72326#1339840 (10Chmarkine) a:5Chmarkine>3None [00:21:05] I visited a few of them and got no new fatals [00:21:07] So it looks fixed [00:21:25] thanks roan [00:21:32] And legoktm. :-) [00:22:10] 6operations, 7HTTPS: Ganglia server doesn't send intermediary certificates - https://phabricator.wikimedia.org/T72326#1339846 (10Dzahn) ah, thanks for the reminder @Chmarkine :) [00:22:19] yeah [00:25:20] greg-g: do I need a deploy window for https://phabricator.wikimedia.org/T98490 / https://phabricator.wikimedia.org/T101460 or is it OK to just do them now? the extension is mobile beta only, the patches only add a new column/index and the affected tables have a couple thousand rows only [00:29:34] PROBLEM - Debian mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/debian is over 25 hours old. [00:30:19] Do we still require DBA review for schema changes? [00:30:36] yes? [00:31:15] RECOVERY - Debian mirror in sync with upstream on carbon is OK: /srv/mirrors/debian is over 0 hours old. [00:35:50] springle, ^ [00:39:46] YuviPanda: You awake? [00:40:59] RoanKattouw: wikitech says " RoanKattouw_away: And if anyone runs update.php I'll be on the first flight to SFO to slap them in the face " :) [00:41:44] Yeah :) [00:41:47] Guess when that was written [00:42:17] 2011 ? [00:43:59] Before RoanKattouw was in SF, I imagine. [00:45:40] But everyone else was in SF? [00:47:51] Yeah I think it was 2010 or so [00:47:51] 6operations, 6Release-Engineering, 5Patch-For-Review: Use SSLCertificateChainFile in Gerrit Apache configuration - https://phabricator.wikimedia.org/T82319#1339936 (10Dzahn) >>! In T82319#1335241, @hashar wrote: > Does #releng has anything to do there? Seems like some infrastructure tech debt. I added sinc... [00:48:02] 6operations: Need Kaldari2 account disabled - https://phabricator.wikimedia.org/T101477#1339937 (10kaldari) 3NEW [00:48:27] kaldari: we hardly knew ye [00:48:59] so long cruel world! [00:51:36] 6operations: Need Kaldari account disabled - https://phabricator.wikimedia.org/T101479#1339961 (10kaldari2) 3NEW [00:53:54] PROBLEM - Debian mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/debian is over 25 hours old. [00:54:30] kaldari: done [00:54:39] it did not have an ssh key, right [00:54:46] 6operations: Need Kaldari account disabled - https://phabricator.wikimedia.org/T101479#1339986 (10kaldari) Don't delete me! [00:54:48] 6operations: Need Kaldari account disabled - https://phabricator.wikimedia.org/T101479#1339987 (10kaldari2) 5Open>3Invalid a:3kaldari2 (This was a joke.) [00:55:35] RECOVERY - Debian mirror in sync with upstream on carbon is OK: /srv/mirrors/debian is over 0 hours old. [00:56:15] wtf [00:56:38] 6operations: Need Kaldari account disabled - https://phabricator.wikimedia.org/T101479#1339990 (10Dzahn) 5Invalid>3Open [00:57:05] 6operations: Need Kaldari account disabled - https://phabricator.wikimedia.org/T101479#1339961 (10Dzahn) 5Open>3Resolved gerrit> select account_id from accounts where full_name="Kaldari2"; account_id ---------- 2239 gerrit> update accounts set inactive = 'Y' where account_id=2239; UPDATE 1; 3 ms gerrit>... [00:58:25] 6operations: Need Kaldari2 account disabled - https://phabricator.wikimedia.org/T101477#1339994 (10kaldari) 5Open>3Resolved a:3kaldari Per the joke ticket: T101479 [00:58:29] mutante: thanks! [01:01:52] 6operations: Need Kaldari account disabled - https://phabricator.wikimedia.org/T101479#1340011 (10Dzahn) {F174916} [01:08:05] 6operations, 6Release-Engineering, 5Patch-For-Review: Use SSLCertificateChainFile in Gerrit Apache configuration - https://phabricator.wikimedia.org/T82319#1340016 (10Dzahn) >>! In T82319#1335241, @hashar wrote: > Does #releng has anything to do there? Seems like some infrastructure tech debt. Actually, yo... [01:10:22] 6operations, 6Release-Engineering, 5Patch-For-Review: Use SSLCertificateChainFile in Gerrit Apache configuration - https://phabricator.wikimedia.org/T82319#1340017 (10Dzahn) 5Open>3Resolved a:3Dzahn we use "SSLCACertificatePath /etc/ssl/certs/" in the Gerrit config (meanwhile) and that is ok too [01:13:01] 6operations, 10Traffic: Package/backport openssl 1.0.2 + nginx 1.7.x or higher - https://phabricator.wikimedia.org/T96850#1340021 (10BBlack) @faidon just pointed out the [[ http://nginx.com/blog/socket-sharding-nginx-release-1-9-1/ | 1.9.1 release has added `SO_REUSEPORT` ]], which would be a really huge win... [01:13:18] 6operations, 6Release-Engineering, 5Patch-For-Review: Use SSLCertificateChainFile in Gerrit Apache configuration - https://phabricator.wikimedia.org/T82319#1340022 (10Dzahn) https://www.ssllabs.com/ssltest/analyze.html?d=gerrit.wikimedia.org the "-" in "A-" is because we are not supporting PFS which is beca... [01:14:15] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: Switch to ECDSA hybrid certificates - https://phabricator.wikimedia.org/T86654#1340024 (10BBlack) Looks like someone may have already done the hard work, I just didn't google hard enough: http://forum.nginx.org/read.php?2,253440,257332#msg-257332 . [[ htt... [01:18:30] tgr: imo T101460 does not need a window, just a !log [01:20:27] springle: how about the other one? that just adds an index for <10K rows [01:23:51] yeah i just saw T98490. commenting there (also saying just do it) [01:25:56] springle: thanks! [01:26:32] (03PS3) 10Dzahn: add AAAA record for silver.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/214489 (https://phabricator.wikimedia.org/T73218) [01:27:53] !log deploying schema changes for Gather on enwiki, enwikivoyage, hewiki (T98490, T101460) [01:27:59] Logged the message, Master [01:28:01] nice [01:28:02] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: Switch to ECDSA hybrid certificates - https://phabricator.wikimedia.org/T86654#1340042 (10matmarex) Let me just note that at least some of Cloudflare's ECDSA certificates are incompatible with Opera 12.x (which represents a bit more than 0.5% of our page v... [01:28:35] (03PS4) 10Dzahn: add IPv6 for silver (wikitech web) [puppet] - 10https://gerrit.wikimedia.org/r/214430 (https://phabricator.wikimedia.org/T73218) [01:34:49] (03CR) 10Dzahn: [C: 032] add IPv6 for silver (wikitech web) [puppet] - 10https://gerrit.wikimedia.org/r/214430 (https://phabricator.wikimedia.org/T73218) (owner: 10Dzahn) [01:36:19] 6operations, 6Labs, 10Labs-Infrastructure, 10wikitech.wikimedia.org, and 2 others: Enable IPv6 on wikitech.wikimedia.org - https://phabricator.wikimedia.org/T73218#1340069 (10Dzahn) a:3Dzahn [01:38:32] 6operations, 6Labs, 10Labs-Infrastructure, 10wikitech.wikimedia.org, and 2 others: Enable IPv6 on wikitech.wikimedia.org - https://phabricator.wikimedia.org/T73218#1340082 (10Dzahn) pre-up /sbin/ip token set ::208:80:154:136 dev eth0 + up ip addr add 2620:0:861:2:208:80:154:136/64 dev eth0 Notice: /... [01:39:49] (03CR) 10Dzahn: [C: 032] add AAAA record for silver.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/214489 (https://phabricator.wikimedia.org/T73218) (owner: 10Dzahn) [01:44:31] 6operations, 10hardware-requests: Replace rubidium with radon for authdns (allocate radon, deallocate rubidium) - https://phabricator.wikimedia.org/T101256#1340098 (10Dzahn) 5Resolved>3Open radon is not in site.pp because it was moved back to spare pool in T88818 (https://phabricator.wikimedia.org/rOPUP26f... [01:44:33] 6operations, 10ops-eqiad: rubidium - wipe and reclaim to spares - investigate hdd issue - https://phabricator.wikimedia.org/T101279#1340102 (10Dzahn) [01:47:29] 6operations, 10ops-eqiad: rubidium - wipe and reclaim to spares - investigate hdd issue - https://phabricator.wikimedia.org/T101279#1340110 (10Dzahn) [01:49:53] 6operations, 6Labs, 10Labs-Infrastructure, 10wikitech.wikimedia.org, and 2 others: Enable IPv6 on wikitech.wikimedia.org - https://phabricator.wikimedia.org/T73218#1340118 (10Dzahn) silver.wikimedia.org has address 208.80.154.136 silver.wikimedia.org has IPv6 address 2620:0:861:2:208:80:154:136 --- ;; ANS... [01:52:49] 6operations, 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure: dnsmasq returns SERVFAIL for (some?) names that do not exist instead of NXDOMAIN - https://phabricator.wikimedia.org/T92351#1340120 (10scfc) 5Resolved>3declined (AFAIUI, the underlying issue has not been researched or r... [01:55:52] (03PS1) 10Dzahn: wikitech: make it a CNAME for silver [dns] - 10https://gerrit.wikimedia.org/r/216021 (https://phabricator.wikimedia.org/T73218) [01:56:55] (03PS2) 10Dzahn: wikitech: make it a CNAME for silver [dns] - 10https://gerrit.wikimedia.org/r/216021 (https://phabricator.wikimedia.org/T73218) [01:59:41] (03PS1) 10Dzahn: shop/store: set TTL back to 1H [dns] - 10https://gerrit.wikimedia.org/r/216022 [02:00:46] 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#1340134 (10intracer) I'm a Board Secretary of Wikimedia Ukraine, we've been using Redmine for a year, and while we have about 330 tasks and 18... [02:06:47] (03PS1) 10Dzahn: remove civicrm*.frdev.wikimedia.org entries [dns] - 10https://gerrit.wikimedia.org/r/216023 [02:11:51] (03PS1) 10Dzahn: fix indentation in server section [dns] - 10https://gerrit.wikimedia.org/r/216024 [02:14:09] (03Abandoned) 10Dzahn: add IPv6 for caesium (releases) [puppet] - 10https://gerrit.wikimedia.org/r/214435 (owner: 10Dzahn) [02:18:25] (03PS1) 10Dzahn: add an empty template for domain parking [dns] - 10https://gerrit.wikimedia.org/r/216025 [02:19:43] (03PS2) 10Dzahn: add an empty template for domain parking [dns] - 10https://gerrit.wikimedia.org/r/216025 [02:21:32] !log l10nupdate Synchronized php-1.26wmf8/cache/l10n: (no message) (duration: 07m 09s) [02:21:41] Logged the message, Master [02:22:24] 6operations, 6Search-and-Discovery: sitemap.wikimedia.org ? - https://phabricator.wikimedia.org/T101486#1340168 (10Dzahn) [02:23:32] 6operations, 6Search-and-Discovery: sitemap.wikimedia.org ? - https://phabricator.wikimedia.org/T101486#1340162 (10Dzahn) by "actual" sitemap i mean https://en.wikipedia.org/wiki/Site_map#XML_Sitemaps [02:24:24] 6operations, 6Search-and-Discovery: sitemap.wikimedia.org ? - https://phabricator.wikimedia.org/T101486#1340171 (10Dzahn) p:5Triage>3Low [02:25:14] 6operations: change notification options in CyrusOne customer portal - https://phabricator.wikimedia.org/T100481#1340172 (10Dzahn) a:3RobH @RobH do you have the login for CyrusOne to make those changes? [02:26:23] !log LocalisationUpdate completed (1.26wmf8) at 2015-06-05 02:25:20+00:00 [02:26:27] Logged the message, Master [02:26:49] 6operations, 6Labs, 10Labs-Infrastructure, 10wikitech.wikimedia.org, and 2 others: Enable IPv6 on wikitech.wikimedia.org - https://phabricator.wikimedia.org/T73218#1340174 (10Dzahn) silver has an IPv6 address on eth0 now and also the AAAA record in DNS. that would have resolved it if wikitech was a CNAME... [02:33:57] 6operations, 6Search-and-Discovery: sitemap.wikimedia.org ? - https://phabricator.wikimedia.org/T101486#1340178 (10Mattflaschen) See https://www.mediawiki.org/wiki/Manual:GenerateSitemap.php . Don't know if we really need one, though. [02:38:00] 10Ops-Access-Requests, 6operations: Requesting access to analytics cluster for AndyRussG - https://phabricator.wikimedia.org/T101443#1340179 (10AndyRussG) [02:45:10] 10Ops-Access-Requests, 6operations: Requesting access to analytics cluster for AndyRussG - https://phabricator.wikimedia.org/T101443#1340198 (10AndyRussG) I put my shell username and full name in the task description. I'm not sure which analytics group need... Thanks again! [02:58:14] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 21.43% of data above the critical threshold [500.0] [03:10:16] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [03:17:45] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [03:34:54] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [03:36:15] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [03:39:44] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60563 bytes in 0.096 second response time [03:50:57] (03PS3) 10Ori.livneh: webperf: Remove JQMigrateUsage deprecate handler [puppet] - 10https://gerrit.wikimedia.org/r/210263 (owner: 10Krinkle) [03:51:07] (03CR) 10Ori.livneh: [C: 032 V: 032] webperf: Remove JQMigrateUsage deprecate handler [puppet] - 10https://gerrit.wikimedia.org/r/210263 (owner: 10Krinkle) [03:52:14] (03CR) 10Ori.livneh: [C: 031] RT: adjust Apache config to be behind misc-web [puppet] - 10https://gerrit.wikimedia.org/r/215972 (https://phabricator.wikimedia.org/T101432) (owner: 10Dzahn) [03:59:16] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: Switch to ECDSA hybrid certificates - https://phabricator.wikimedia.org/T86654#1340241 (10BBlack) Yeah, that's why we're blocking on dual-cert support in nginx (or other solution to the problem). There's similar issues with e.g. Chrome on WinXP, so we won... [04:04:16] 7Blocked-on-Operations, 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1340242 (10BBlack) One more general question: is it possible/usable to put the SSH part on a separate IP/hostname from the websockets... [04:07:58] (03PS1) 10Springle: depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/216027 [04:08:23] (03CR) 10Springle: [C: 032] depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/216027 (owner: 10Springle) [04:08:29] (03Merged) 10jenkins-bot: depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/216027 (owner: 10Springle) [04:10:21] !log springle Synchronized wmf-config/db-eqiad.php: depool db1073 (duration: 01m 08s) [04:10:25] Logged the message, Master [04:23:06] (03PS1) 10BBlack: Delete clicktracking-session cookies more thoroughly [puppet] - 10https://gerrit.wikimedia.org/r/216029 [04:23:27] (03CR) 10BBlack: [C: 032 V: 032] Delete clicktracking-session cookies more thoroughly [puppet] - 10https://gerrit.wikimedia.org/r/216029 (owner: 10BBlack) [04:24:45] (03PS3) 10BBlack: Add legacy bits.wm.o support to text-lb VCL [puppet] - 10https://gerrit.wikimedia.org/r/215624 (https://phabricator.wikimedia.org/T95448) [04:24:59] (03CR) 10BBlack: [C: 04-1] "Still on hold, just checking rebase" [puppet] - 10https://gerrit.wikimedia.org/r/215624 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack) [04:37:50] (03CR) 1020after4: "Gueseppe: I really did want to give quilt a fair shot but I spent some more time with the quilt documentation (now far exceeding the amoun" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202665 (https://phabricator.wikimedia.org/T95375) (owner: 1020after4) [04:45:04] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (100129s 100000s) [05:02:20] (03CR) 10Ori.livneh: [C: 04-1] "mwpatch seems fine. mwcli should be replaced with something off-the-shelf. Surely one of these can do the trick:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202665 (https://phabricator.wikimedia.org/T95375) (owner: 1020after4) [05:08:41] (03CR) 10Ori.livneh: "(And don't be discouraged, I think it'd be good to have a tool for this.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202665 (https://phabricator.wikimedia.org/T95375) (owner: 1020after4) [05:18:45] PROBLEM - nutcracker port on silver is CRITICAL - Socket timeout after 2 seconds [05:20:35] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [05:23:30] (03CR) 1020after4: "ori: how about http://climate.thephpleague.com" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202665 (https://phabricator.wikimedia.org/T95375) (owner: 1020after4) [05:23:53] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Jun 5 05:22:50 UTC 2015 (duration 22m 49s) [05:24:01] Logged the message, Master [05:24:37] (03CR) 10Ori.livneh: "Why not? Looks nice, good test coverage, clean code." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202665 (https://phabricator.wikimedia.org/T95375) (owner: 1020after4) [05:38:47] <_joe_> good morning everyone [05:39:24] <_joe_> springle: it seems we had a double-fuckup with the mysql.connect timeout being removed from one place and wrongly added everywhere else [05:40:25] PROBLEM - puppet last run on mw1158 is CRITICAL Puppet has 1 failures [05:53:50] _joe_: :) [05:55:11] _joe_: at leat the fatal noise if much reduced. now we can have some hope of figuring out where the remaining glitches are occuring. thanks for that [05:55:17] *is much [05:57:15] RECOVERY - puppet last run on mw1158 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [05:58:04] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [06:13:25] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:16:45] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (15209 100000s) [06:24:40] !log added redis_2.8.4-2+wmf1 to trusty-wikimedia on apt.wikimedia.org [06:24:44] Logged the message, Master [06:30:35] PROBLEM - puppet last run on logstash1006 is CRITICAL Puppet has 1 failures [06:31:14] PROBLEM - puppet last run on mw2173 is CRITICAL Puppet has 1 failures [06:31:24] PROBLEM - puppet last run on mw1235 is CRITICAL Puppet has 1 failures [06:31:35] PROBLEM - puppet last run on cp1058 is CRITICAL puppet fail [06:32:15] PROBLEM - puppet last run on wtp2015 is CRITICAL Puppet has 1 failures [06:32:24] PROBLEM - puppet last run on cp4004 is CRITICAL Puppet has 1 failures [06:32:25] PROBLEM - puppet last run on cp4014 is CRITICAL Puppet has 1 failures [06:32:55] PROBLEM - puppet last run on mw2113 is CRITICAL Puppet has 1 failures [06:33:15] PROBLEM - puppet last run on mw1061 is CRITICAL Puppet has 1 failures [06:33:29] _joe_, what do you think about T98489 now? [06:34:16] I'm asking about "have discussed with @Joe the possibility of just patching hhvm's hardcoded mysqlExtension::ConnectTimeout to 3000ms in our next build" [06:34:34] PROBLEM - puppet last run on mw2212 is CRITICAL Puppet has 1 failures [06:34:34] PROBLEM - puppet last run on gallium is CRITICAL Puppet has 1 failures [06:34:35] PROBLEM - puppet last run on mw2127 is CRITICAL Puppet has 1 failures [06:34:44] PROBLEM - puppet last run on tin is CRITICAL Puppet has 2 failures [06:34:45] PROBLEM - puppet last run on mw2079 is CRITICAL Puppet has 1 failures [06:34:45] PROBLEM - puppet last run on mw2096 is CRITICAL Puppet has 1 failures [06:34:45] PROBLEM - puppet last run on mw2123 is CRITICAL Puppet has 1 failures [06:34:46] PROBLEM - puppet last run on mw1172 is CRITICAL Puppet has 1 failures [06:35:04] PROBLEM - puppet last run on mw2143 is CRITICAL Puppet has 1 failures [06:35:04] PROBLEM - puppet last run on mw2097 is CRITICAL Puppet has 1 failures [06:35:25] PROBLEM - puppet last run on mw2134 is CRITICAL Puppet has 1 failures [06:35:44] PROBLEM - puppet last run on mw2136 is CRITICAL Puppet has 2 failures [06:37:19] <_joe_> jynus: 1 sec sorry [06:37:27] not urgent [06:39:47] 6operations, 7HHVM, 5Patch-For-Review: investigate HHVM mysqlExtension::ConnectTimeout - https://phabricator.wikimedia.org/T98489#1340363 (10Joe) So I assumed the original change didn't work because of something not working in that ini setting. In fact it turned out that the patch for the canaries (which I b... [06:40:23] oh, I see [06:41:12] so, I will leave it open as normal priority, ok? [06:46:35] RECOVERY - puppet last run on mw2173 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:46:55] RECOVERY - puppet last run on mw1061 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:47:25] RECOVERY - puppet last run on mw2134 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:47:45] RECOVERY - puppet last run on logstash1006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:45] RECOVERY - puppet last run on wtp2015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:46] RECOVERY - puppet last run on cp4004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:46] RECOVERY - puppet last run on cp4014 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:48:15] RECOVERY - puppet last run on gallium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:15] RECOVERY - puppet last run on mw2212 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:16] RECOVERY - puppet last run on mw2127 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:16] RECOVERY - puppet last run on mw2113 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:48:17] (03CR) 10Nemo bis: "Ori, can you push the amended commit to gerrit?" [puppet] - 10https://gerrit.wikimedia.org/r/213579 (https://phabricator.wikimedia.org/T94807) (owner: 10Nemo bis) [06:48:24] RECOVERY - puppet last run on tin is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:25] RECOVERY - puppet last run on mw1235 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:34] RECOVERY - puppet last run on mw1172 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:34] RECOVERY - puppet last run on mw2096 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:34] RECOVERY - puppet last run on mw2079 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:34] RECOVERY - puppet last run on mw2123 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:35] RECOVERY - puppet last run on cp1058 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:45] RECOVERY - puppet last run on mw2143 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:45] RECOVERY - puppet last run on mw2097 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:22] <_joe_> jynus: yes, and assign it to me [06:49:25] RECOVERY - puppet last run on mw2136 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:05] ok [06:50:46] I said thank you to all people involved, didn't know if you show it [06:51:02] you may my work now way easier [06:51:06] *make [06:53:55] PROBLEM - Debian mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/debian is over 31 hours old. [06:54:46] PROBLEM - statsite backend instances on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:54:46] PROBLEM - configured eth on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:54:46] PROBLEM - DPKG on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:55:04] 6operations, 7HHVM: investigate HHVM mysqlExtension::ConnectTimeout - https://phabricator.wikimedia.org/T98489#1340381 (10jcrespo) p:5High>3Normal a:5jcrespo>3Joe [06:55:05] PROBLEM - SSH on graphite2001 is CRITICAL - Socket timeout after 10 seconds [06:55:35] PROBLEM - salt-minion processes on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:55:35] PROBLEM - dhclient process on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:55:35] RECOVERY - Debian mirror in sync with upstream on carbon is OK: /srv/mirrors/debian is over 0 hours old. [06:55:44] PROBLEM - statsdlb process on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:55:45] PROBLEM - uWSGI web apps on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:55:55] PROBLEM - Graphite Carbon on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:55:55] PROBLEM - puppet last run on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:56:05] PROBLEM - RAID on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:56:34] PROBLEM - Disk space on graphite2001 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [07:02:04] RECOVERY - SSH on graphite2001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [07:02:24] RECOVERY - salt-minion processes on graphite2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:02:24] RECOVERY - dhclient process on graphite2001 is OK: PROCS OK: 0 processes with command name dhclient [07:02:35] RECOVERY - statsdlb process on graphite2001 is OK: PROCS OK: 1 process with command name statsdlb [07:02:36] RECOVERY - uWSGI web apps on graphite2001 is OK All defined uWSGI apps are runnning. [07:02:44] RECOVERY - puppet last run on graphite2001 is OK Puppet is currently enabled, last run 24 minutes ago with 0 failures [07:02:45] RECOVERY - Graphite Carbon on graphite2001 is OK All defined Carbon jobs are runnning. [07:02:55] RECOVERY - RAID on graphite2001 is OK Active: 8, Working: 8, Failed: 0, Spare: 0 [07:03:24] RECOVERY - DPKG on graphite2001 is OK: All packages OK [07:03:24] RECOVERY - Disk space on graphite2001 is OK: DISK OK [07:03:25] RECOVERY - statsite backend instances on graphite2001 is OK All defined statsite jobs are runnning. [07:03:25] RECOVERY - configured eth on graphite2001 is OK - interfaces up [07:12:45] RECOVERY - MySQL Replication Heartbeat on es1009 is OK replication delay 0 seconds [07:16:48] 6operations: Install fonts-wqy-zenhei on all mediawiki app servers - https://phabricator.wikimedia.org/T84777#1340396 (10C933103) I don't think unifont should be use instead as according to http://wiki.debian.org.hk/w/Fonts it look like a wqy's font is an improved version over unibit in term of Chinese support. [07:17:31] (03PS27) 10KartikMistry: CX: Log to logstash [puppet] - 10https://gerrit.wikimedia.org/r/213840 (https://phabricator.wikimedia.org/T89265) [07:19:19] 6operations, 10vm-requests: EQIAD: 1 VM request for etherpad - https://phabricator.wikimedia.org/T101492#1340397 (10akosiaris) 3NEW [07:19:41] 6operations, 10vm-requests: EQIAD: 1 VM request for etherpad - https://phabricator.wikimedia.org/T101492#1340404 (10akosiaris) p:5Triage>3Normal [07:20:12] 6operations, 10vm-requests: EQIAD: 1 VM request for etherpad - https://phabricator.wikimedia.org/T101492#1340406 (10akosiaris) a:3akosiaris [07:20:54] 6operations, 10vm-requests: EQIAD: 1 VM request for etherpad - https://phabricator.wikimedia.org/T101492#1340407 (10akosiaris) 5Open>3Resolved VM has been created on the ganeti01.svc.eqiad.wmnet cluster with the required attributes. Resolving [07:20:55] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [07:21:04] talking about the fastest task ever :P [07:26:33] (03CR) 10Mobrovac: "@Kartik, no need to rebase the patch, it'll be done right before merging." [puppet] - 10https://gerrit.wikimedia.org/r/213840 (https://phabricator.wikimedia.org/T89265) (owner: 10KartikMistry) [07:40:44] <_joe_> '/win 25 [07:55:03] 6operations, 7database: es[12]00[123] maintenance and upgrade - https://phabricator.wikimedia.org/T101084#1340442 (10jcrespo) I've just realized that we do not have any 10s as masters in production ("pc" are the closest thing, but they are not real "masters"), and we do not have things like read-write and good... [08:08:39] (03PS3) 10Faidon Liambotis: base: kill the wmf-ca CA [puppet] - 10https://gerrit.wikimedia.org/r/215821 [08:08:47] (03CR) 10Faidon Liambotis: [C: 032] base: kill the wmf-ca CA [puppet] - 10https://gerrit.wikimedia.org/r/215821 (owner: 10Faidon Liambotis) [08:13:44] PROBLEM - puppet last run on cp3035 is CRITICAL Puppet has 1 failures [08:13:55] PROBLEM - puppet last run on wtp1023 is CRITICAL Puppet has 1 failures [08:14:05] PROBLEM - puppet last run on lvs1001 is CRITICAL Puppet has 1 failures [08:14:15] PROBLEM - puppet last run on wtp2009 is CRITICAL Puppet has 1 failures [08:14:15] PROBLEM - puppet last run on wtp2007 is CRITICAL Puppet has 1 failures [08:14:16] PROBLEM - puppet last run on mw1171 is CRITICAL Puppet has 1 failures [08:14:24] PROBLEM - puppet last run on mw1057 is CRITICAL Puppet has 1 failures [08:25:05] 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#1340457 (10Qgil) Hi @intracer, in order to request a new project you need to create a new task prividing the information detailed here: https:... [08:29:28] 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#1340462 (10intracer) >>! In T706#1340457, @Qgil wrote: > Hi @intracer, in order to request a new project you need to create a new task prividi... [08:29:55] RECOVERY - puppet last run on mw1057 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [08:30:09] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1340463 (10Qgil) The policy of a space should determine the policy of all objects living in that space, files just as much as tasks. [08:30:25] RECOVERY - puppet last run on mw1183 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [08:30:44] RECOVERY - puppet last run on wtp1002 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [08:30:45] RECOVERY - puppet last run on es2007 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [08:30:45] RECOVERY - puppet last run on acamar is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [08:30:54] RECOVERY - puppet last run on ms-be1009 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [08:31:05] RECOVERY - puppet last run on mw1030 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:31:15] RECOVERY - puppet last run on wtp1023 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:31:25] RECOVERY - puppet last run on lvs1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:31:35] RECOVERY - puppet last run on wtp2007 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [08:31:36] RECOVERY - puppet last run on wtp2009 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:31:36] RECOVERY - puppet last run on mw1171 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [08:31:45] RECOVERY - puppet last run on mw2054 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [08:31:54] RECOVERY - puppet last run on mw2051 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [08:32:05] RECOVERY - puppet last run on mw1212 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:32:05] RECOVERY - puppet last run on mw1023 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [08:32:05] RECOVERY - puppet last run on cp3031 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:32:06] RECOVERY - puppet last run on mw2042 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:32:14] RECOVERY - puppet last run on mw2094 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:32:14] RECOVERY - puppet last run on mw2065 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [08:32:40] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1340475 (10Qgil) [08:32:44] RECOVERY - puppet last run on mw1029 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:32:44] RECOVERY - puppet last run on mw2027 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:35:17] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [08:42:15] PROBLEM - puppet last run on pollux is CRITICAL puppet fail [08:47:25] RECOVERY - puppet last run on pollux is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [08:48:39] (03PS2) 10Faidon Liambotis: base: remove wmf-ca ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/215829 [08:48:53] (03CR) 10Faidon Liambotis: [C: 032] base: remove wmf-ca ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/215829 (owner: 10Faidon Liambotis) [08:54:12] (03PS8) 10Faidon Liambotis: certs: replace require by collector ordering [puppet] - 10https://gerrit.wikimedia.org/r/215352 [09:00:31] 6operations: Need Kaldari account on Gerrit disabled - https://phabricator.wikimedia.org/T101479#1340519 (10Aklapper) [09:00:54] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 15.38% of data above the critical threshold [500.0] [09:03:11] 6operations, 7HHVM, 7Wikimedia-log-errors: investigate HHVM mysqlExtension::ConnectTimeout - https://phabricator.wikimedia.org/T98489#1340532 (10hashar) [09:06:45] PROBLEM - puppet last run on ms-be1017 is CRITICAL Puppet has 1 failures [09:08:26] 6operations, 7HHVM, 7Wikimedia-log-errors: investigate HHVM mysqlExtension::ConnectTimeout - https://phabricator.wikimedia.org/T98489#1340548 (10hashar) Thank you everyone, that largely removed the spam we were seeing in logstash! Kudos to everyone involved! [09:11:14] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [09:17:33] !log added redis_2.6.13-1+wmf1 to precise-wikimedia on apt.wikimedia.org [09:17:37] Logged the message, Master [09:22:25] RECOVERY - puppet last run on ms-be1017 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [09:35:14] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 660 [09:37:01] this shouldn't happen^ on it [09:38:44] oh, db1008, not es1008 [09:40:00] 6operations, 6Release-Engineering, 5Patch-For-Review: Use SSLCertificateChainFile in Gerrit Apache configuration - https://phabricator.wikimedia.org/T82319#1340615 (10hashar) Thanks @Dzahn :-) [09:40:01] (03PS1) 10Faidon Liambotis: authdns: switch to new ed25519 key [puppet] - 10https://gerrit.wikimedia.org/r/216051 [09:40:14] RECOVERY - check_mysql on db1008 is OK: Uptime: 4308875 Threads: 2 Questions: 14707123 Slow queries: 28687 Opens: 45461 Flush tables: 2 Open tables: 64 Queries per second avg: 3.413 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:40:27] (03CR) 10Faidon Liambotis: [C: 032] authdns: switch to new ed25519 key [puppet] - 10https://gerrit.wikimedia.org/r/216051 (owner: 10Faidon Liambotis) [09:46:47] (03PS1) 10Faidon Liambotis: authdns: switch authdns-update to reference new key [puppet] - 10https://gerrit.wikimedia.org/r/216052 [09:47:05] (03CR) 10Faidon Liambotis: [C: 032 V: 032] authdns: switch authdns-update to reference new key [puppet] - 10https://gerrit.wikimedia.org/r/216052 (owner: 10Faidon Liambotis) [10:17:48] (03PS1) 10ArielGlenn: dumps: tweak 7z compression arg so compression runs much faster [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/216058 [10:19:42] (03CR) 10ArielGlenn: [C: 032] dumps: tweak 7z compression arg so compression runs much faster [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/216058 (owner: 10ArielGlenn) [10:20:47] ^relevant: http://dbahire.com/which-compression-tool-should-i-use-for-my-database-backups/ [10:26:22] (03PS1) 10Jcrespo: Repool db1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/216059 [10:27:29] (03CR) 10Jcrespo: [C: 032] Repool db1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/216059 (owner: 10Jcrespo) [10:44:06] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [10:46:19] ^ori, I will merge my repool (shouldn't affect your work) [10:54:48] <_joe_> I hope or.i is sleeping [10:55:01] <_joe_> but one can never be sure :P [10:56:30] I will stash his changes, then [10:58:18] <_joe_> mh [10:58:27] <_joe_> wait a sec, lemme see [10:58:56] too late [10:59:53] <_joe_> I think that has probably been already synced to prod [11:01:20] <_joe_> lemme check that [11:03:56] <_joe_> yes it was, apparently [11:04:01] pop whatever you want later, I will sync head [11:04:11] (my head) [11:04:25] <_joe_> jynus: you're just doing sync-file, right? [11:04:31] yep [11:04:45] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [11:05:22] I meant that I will put that branch on head^, not that I will sync everithing [11:05:33] <_joe_> oh sorry ok [11:05:46] <_joe_> just keep an eye out when someone wants to do SWAT [11:05:50] <_joe_> in the afternoon [11:05:58] <_joe_> uhm, nevermind, it's friday [11:06:06] <_joe_> no deploy :) [11:09:44] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 13.33% of data above the critical threshold [500.0] [11:10:05] PROBLEM - DPKG on stat1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:12:25] PROBLEM - puppet last run on multatuli is CRITICAL puppet fail [11:18:52] i am about to have lunch , so I will repool later [11:20:37] apergos: you around? [11:22:51] kaldari: am awake now [11:22:56] I hope you arent? [11:23:35] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [11:25:29] 6operations, 10vm-requests, 7discovery-system: eqiad: 3 VM request for ETCD - https://phabricator.wikimedia.org/T101506#1340815 (10Joe) 3NEW [11:26:08] 6operations, 10vm-requests, 7discovery-system: eqiad: 3 VM request for ETCD - https://phabricator.wikimedia.org/T101506#1340825 (10Joe) [11:27:55] RECOVERY - puppet last run on multatuli is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [11:30:38] <_joe_> !log uploaded new HHVM package, installing on mw1025 for testing [11:30:42] Logged the message, Master [11:31:24] PROBLEM - DPKG on dataset1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:40:45] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 15.38% of data above the critical threshold [500.0] [12:00:55] _joe_: have you picked Stas patch for gmp_init() ? https://phabricator.wikimedia.org/T98882#1322422 [12:01:15] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [12:01:23] CI instances are still at 3.3.1+dfsg1-1+wm3 [12:02:07] <_joe_> hashar: nope, I will as soon as possible [12:02:12] <_joe_> this was a security release [12:02:19] okk [12:08:32] !log jynus Synchronized wmf-config/db-eqiad.php: Repool es1009 (duration: 01m 08s) [12:08:36] Logged the message, Master [12:16:03] gwicke: yes? [12:23:35] PROBLEM - Redis on mc1005 is CRITICAL: Connection refused [12:25:15] RECOVERY - Redis on mc1005 is OK: TCP OK - 0.002 second response time on port 6379 [12:26:14] (03PS1) 10Jcrespo: Depool es1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/216070 [12:32:29] (03CR) 10Jcrespo: [C: 032] Depool es1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/216070 (owner: 10Jcrespo) [12:37:25] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [12:39:45] PROBLEM - puppet last run on mw1236 is CRITICAL Puppet has 1 failures [12:42:25] (03CR) 10Jcrespo: "es1009, not db1009, title is wrong" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/216059 (owner: 10Jcrespo) [12:44:29] hello! is this the place where could i ask about why don't english wikipedia dumps include pages-meta-history anymore? [12:44:35] !log jynus Synchronized wmf-config/db-eqiad.php: Depool es1007 (duration: 01m 08s) [12:44:39] Logged the message, Master [12:45:15] hi, d33tah, there has been some issue with the dumps lately [12:45:36] jynus: i have a project that kind of relies on latest dumps. is there any schedule on fixing the issue? [12:46:22] I know there has been some work, I am looking if it has been fixed [12:46:27] (not my field) [12:47:14] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 23.08% of data above the critical threshold [500.0] [12:47:55] jynus: please let me know when you find some information or where i should look, i'd love to get an as late enwiki dump as possible [12:48:22] i'm starting to be desperate enough to get .bz2 pages-meta-history from 20150515 [12:50:01] d33tah: why is that one bad? [12:50:24] first, it's .bz2, which means it'd take quite a few days to download given the bandwidth limits wikimedia puts [12:50:36] I can tell you that enwiki dumps are in progress, there was an recent issue on phabricator about dumps, but I cannot find it now [12:50:46] d33tah: you usually use 7z? [12:50:50] Nemo_bis: yup [12:50:55] ok [12:50:57] jynus: the current dump doesn't seem to have pages-meta-history scheduled [12:51:28] Nemo_bis: secondly, it's from a month ago, which means i'd have to write some code that would also integrate my IRC archives from the wikipedia notification server [12:51:35] and it's not perfectly reliable [12:52:37] (my script, i mean) [12:53:38] (03PS2) 10Andrew Bogott: novaconfig: qualify var [puppet] - 10https://gerrit.wikimedia.org/r/215769 (owner: 10Matanya) [12:53:45] PROBLEM - Debian mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/debian is over 37 hours old. [12:54:46] (03PS2) 10Andrew Bogott: salt_reactor_options: qualify var [puppet] - 10https://gerrit.wikimedia.org/r/215785 (owner: 10Matanya) [12:54:55] RECOVERY - puppet last run on mw1236 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [12:54:58] (03CR) 10Andrew Bogott: [C: 032] novaconfig: qualify var [puppet] - 10https://gerrit.wikimedia.org/r/215769 (owner: 10Matanya) [12:55:24] RECOVERY - Debian mirror in sync with upstream on carbon is OK: /srv/mirrors/debian is over 0 hours old. [12:56:02] (03PS3) 10Andrew Bogott: salt_reactor_options: qualify var [puppet] - 10https://gerrit.wikimedia.org/r/215785 (owner: 10Matanya) [12:57:00] (03CR) 10Andrew Bogott: [C: 031] puppet_certname: qualify var [puppet] - 10https://gerrit.wikimedia.org/r/215772 (owner: 10Matanya) [12:57:08] (03CR) 10Andrew Bogott: [C: 032] salt_reactor_options: qualify var [puppet] - 10https://gerrit.wikimedia.org/r/215785 (owner: 10Matanya) [12:57:36] d33tah, could this be related? https://phabricator.wikimedia.org/T98585 [12:58:12] (03PS2) 10Andrew Bogott: site: qualify var [puppet] - 10https://gerrit.wikimedia.org/r/215788 (owner: 10Matanya) [12:58:23] jynus: no idea, maybe [12:59:21] (03CR) 10Andrew Bogott: [C: 032] site: qualify var [puppet] - 10https://gerrit.wikimedia.org/r/215788 (owner: 10Matanya) [13:00:45] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [13:02:06] (03CR) 10Andrew Bogott: [C: 032] wikitech: make it a CNAME for silver [dns] - 10https://gerrit.wikimedia.org/r/216021 (https://phabricator.wikimedia.org/T73218) (owner: 10Dzahn) [13:04:00] This could be the answer, on the dumps list: https://lists.wikimedia.org/pipermail/xmldatadumps-l/2015-May/001114.html [13:04:42] "Do expect some delays in the dump generation, probably the first few ones will be out in June" [13:05:39] (03PS5) 10Andrew Bogott: labs_lvm: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211346 (owner: 10Dzahn) [13:06:22] if not, that list would be the best place to ask, as the right people may not be available right now, d33tah [13:06:32] (03CR) 10Andrew Bogott: [C: 032] labs_lvm: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211346 (owner: 10Dzahn) [13:07:30] jynus: thanks! [13:14:12] blocked until the dumps finish for the current wikis :-( [13:14:36] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:22:14] (03PS1) 10Yuvipanda: toollabs: Specify redis IP manually [puppet] - 10https://gerrit.wikimedia.org/r/216080 [13:22:15] Coren: ^ [13:23:42] (03PS2) 10Yuvipanda: toollabs: Specify redis IP manually [puppet] - 10https://gerrit.wikimedia.org/r/216080 [13:23:47] (03CR) 10coren: [C: 031] toollabs: Specify redis IP manually [puppet] - 10https://gerrit.wikimedia.org/r/216080 (owner: 10Yuvipanda) [13:23:59] (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: Specify redis IP manually [puppet] - 10https://gerrit.wikimedia.org/r/216080 (owner: 10Yuvipanda) [13:25:15] YuviPanda: tbh, I'm not entirely clear what the ultimate point of ipresolve() is; isn't baking the IP in the manifest a fundamentally inferior method to just using the name or was the issue caused by bits of config where it was necessary to specify the numerical IP? [13:25:28] Coren: it's in cases when you do need an IP. like /etc/hosts [13:25:40] yeah [13:25:41] Ah, right. [13:28:09] 6operations, 10vm-requests, 7discovery-system: eqiad: 3 VM request for ETCD - https://phabricator.wikimedia.org/T101506#1341022 (10akosiaris) Sounds reasonable. Moving on with the next steps. The are described here: https://wikitech.wikimedia.org/wiki/Ganeti#Create_a_VM The first step is already done: http... [13:33:29] (03CR) 10Faidon Liambotis: [C: 031] "base-installer/kernel/image does not set the d-i kernel, just the target's system kernel." [puppet] - 10https://gerrit.wikimedia.org/r/211688 (https://phabricator.wikimedia.org/T100773) (owner: 10Muehlenhoff) [13:34:56] jynus: ping? [13:35:13] paravoid, yes [13:36:02] do you know why db1022 runs a very old kernel? [13:36:06] 2.6.32(!) [13:36:23] despite being a precise box [13:36:30] let me check what that boc is [13:37:26] it is also an old mysql, but a slave, which usually means "do not touch" [13:37:43] I'm looking [13:38:50] snapshot, vslow, dump" [13:39:17] (03CR) 10Alexandros Kosiaris: [C: 032] "Seems like docker is messing up with the networking, creating virtual interfaces and handling IP assignment on those. Since this is labs a" [puppet] - 10https://gerrit.wikimedia.org/r/213530 (https://phabricator.wikimedia.org/T99564) (owner: 10Mobrovac) [13:39:50] well, it is certainly on the list to be upgraded [13:40:00] (03PS2) 10Alexandros Kosiaris: Cassandra: deployment-prep: Set the correct listen IP [puppet] - 10https://gerrit.wikimedia.org/r/213530 (https://phabricator.wikimedia.org/T99564) (owner: 10Mobrovac) [13:40:05] ok [13:40:14] but those particular mysqls are tricky, because a special partitioning [13:40:19] it's one of two servers running a < 3.2 kernel [13:40:19] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "Rebased to merge based on ff-only policy" [puppet] - 10https://gerrit.wikimedia.org/r/213530 (https://phabricator.wikimedia.org/T99564) (owner: 10Mobrovac) [13:40:26] the other one is a really old box :) [13:40:28] (sodium) [13:40:33] what we can do [13:40:42] is put it on a higher priority [13:40:50] if it is a problem [13:40:59] is that ok? [13:41:31] sure, although to be fair it's not a huge problem obviously :) [13:41:44] moritzm: ^ btw :) [13:41:58] there are still some "single points of slowdown", which we are slowly fixing [13:42:25] that sounds awesome :) [13:43:14] with that I mean that if they are depooled/crash and bing starts hitting us, it will be a problem [13:43:46] 6operations, 5Patch-For-Review: Switch to Linux 3.19 by default on jessie hosts - https://phabricator.wikimedia.org/T100773#1341278 (10faidon) Besides switching the default (which is about to be committed), we should probably upgrade the existing installed systems to 3.19 as well, for uniformity. Looking at s... [13:44:10] plus upgrading 5.5 is being queued until the benefits overcome the risks of the failover [13:44:16] but we are on it [13:44:17] :-) [13:44:55] I have upgraded 20 machines or so since I enrolled, only 150 to go :-) [13:48:58] (03CR) 10Andrew Bogott: [C: 032] remote_cert_cleaner: qualify var [puppet] - 10https://gerrit.wikimedia.org/r/215777 (owner: 10Matanya) [13:49:02] (03PS2) 10Andrew Bogott: remote_cert_cleaner: qualify var [puppet] - 10https://gerrit.wikimedia.org/r/215777 (owner: 10Matanya) [13:56:35] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 814.204342262 [13:57:54] 6operations, 6Labs, 10Labs-Infrastructure, 10wikitech.wikimedia.org, and 2 others: Enable IPv6 on wikitech.wikimedia.org - https://phabricator.wikimedia.org/T73218#1341313 (10wpmirrordev) Let us test this from an IPv6 only network: (shell) ping6 -c 1 wikitech.wikimedia.org PING wikitech.wikimedia.org(silv... [14:00:26] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [14:07:25] !log added ubuntu-meta-1.267.1+wmf1 for precise-wikimedia to apt.wikimedia.org (T100004) [14:07:30] Logged the message, Master [14:07:44] !log added ubuntu-meta-1.325+wmf1 for trusty-wikimedia to apt.wikimedia.org (T100004) [14:07:48] Logged the message, Master [14:10:53] (03PS1) 10Yuvipanda: ores: Introduce staging role, make caching lb optional [puppet] - 10https://gerrit.wikimedia.org/r/216093 [14:11:35] (03CR) 10jenkins-bot: [V: 04-1] ores: Introduce staging role, make caching lb optional [puppet] - 10https://gerrit.wikimedia.org/r/216093 (owner: 10Yuvipanda) [14:12:36] (03PS2) 10Yuvipanda: ores: Introduce staging role, make caching lb optional [puppet] - 10https://gerrit.wikimedia.org/r/216093 [14:13:49] (03CR) 10Yuvipanda: [C: 032] ores: Introduce staging role, make caching lb optional [puppet] - 10https://gerrit.wikimedia.org/r/216093 (owner: 10Yuvipanda) [14:15:34] (03PS1) 10Yuvipanda: ores: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/216094 [14:15:46] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [14:15:50] (03CR) 10Yuvipanda: [C: 032 V: 032] ores: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/216094 (owner: 10Yuvipanda) [14:16:15] (03PS1) 10Giuseppe Lavagetto: etcd: add SRV record for autodiscovery [dns] - 10https://gerrit.wikimedia.org/r/216095 [14:19:39] (03PS1) 10ArielGlenn: dumps: fix exception raised in stub phase on failure [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/216096 [14:22:07] (03CR) 10Alexandros Kosiaris: [C: 031] "LGTM technically, question inline though" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/216095 (owner: 10Giuseppe Lavagetto) [14:24:33] (03CR) 10ArielGlenn: [C: 032] dumps: fix exception raised in stub phase on failure [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/216096 (owner: 10ArielGlenn) [14:24:35] (03CR) 10Giuseppe Lavagetto: etcd: add SRV record for autodiscovery (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/216095 (owner: 10Giuseppe Lavagetto) [14:34:43] is there any upcoming SWAT? [14:35:11] no deploys on Friday [14:35:15] so not today, no [14:35:18] meh [14:36:56] lag on one of the slave, my fault [14:37:03] (03PS1) 10Giuseppe Lavagetto: ganglia: configure the ganglia aggregator for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/216098 [14:37:04] should be gone in a second [14:37:05] (03PS1) 10Giuseppe Lavagetto: etcd: setup servers/ganglia stuff [puppet] - 10https://gerrit.wikimedia.org/r/216099 [14:37:20] (03CR) 10Muehlenhoff: [C: 032 V: 032] autoinstall: install linux-meta (3.19) on jessie [puppet] - 10https://gerrit.wikimedia.org/r/211688 (https://phabricator.wikimedia.org/T100773) (owner: 10Muehlenhoff) [14:37:55] Should I admin-log it, if it only took 10 seconds? [14:38:40] might as well [14:39:36] (03PS7) 10Muehlenhoff: autoinstall: install linux-meta (3.19) on jessie [puppet] - 10https://gerrit.wikimedia.org/r/211688 (https://phabricator.wikimedia.org/T100773) [14:41:11] (03PS1) 10Andrew Bogott: Prep labcontrol1001 to be the new labs controller [puppet] - 10https://gerrit.wikimedia.org/r/216101 [14:41:50] (03CR) 10Andrew Bogott: [C: 032] Prep labcontrol1001 to be the new labs controller [puppet] - 10https://gerrit.wikimedia.org/r/216101 (owner: 10Andrew Bogott) [14:43:51] !log short lag period on db1049, traffic automatically redirected to other slave and back to normal [14:43:56] Logged the message, Master [14:44:12] <_joe_> andrewbogott: which hosts you don't see in ganglia? [14:44:19] <_joe_> I think I got what's the problem [14:44:22] labvirt100x [14:44:26] <_joe_> ok [14:44:45] also labnet100x [14:46:36] <_joe_> actually, no. [14:46:45] <_joe_> they're using ganglia, not ganglia_new [14:46:52] <_joe_> so no, I have no idea [14:47:38] :( [14:48:05] Is it possible that labcontrol1001 is ‘stealing’ the data because it’s a competing aggregator? [14:56:10] (03PS2) 10Giuseppe Lavagetto: etcd: setup servers/ganglia stuff [puppet] - 10https://gerrit.wikimedia.org/r/216099 [14:56:12] (03PS2) 10Giuseppe Lavagetto: ganglia: configure the ganglia aggregator for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/216098 [14:56:57] (03CR) 10Muehlenhoff: [V: 032] autoinstall: install linux-meta (3.19) on jessie [puppet] - 10https://gerrit.wikimedia.org/r/211688 (https://phabricator.wikimedia.org/T100773) (owner: 10Muehlenhoff) [15:10:17] greetings [15:10:38] (03PS8) 10Muehlenhoff: autoinstall: install linux-meta (3.19) on jessie [puppet] - 10https://gerrit.wikimedia.org/r/211688 (https://phabricator.wikimedia.org/T100773) [15:11:09] (03CR) 10Muehlenhoff: [V: 032] autoinstall: install linux-meta (3.19) on jessie [puppet] - 10https://gerrit.wikimedia.org/r/211688 (https://phabricator.wikimedia.org/T100773) (owner: 10Muehlenhoff) [15:11:32] g'morn [15:17:57] * YuviPanda waves at godog from across his desk [15:19:52] * godog waves back [15:20:35] YuviPanda: somebody was indeed wondering what's the deal with me here and you in the UK [15:20:39] (03PS2) 10Gage: puppet_certname: qualify var [puppet] - 10https://gerrit.wikimedia.org/r/215772 (owner: 10Matanya) [15:21:25] godog: :D we'll swap next month [15:22:10] hehe true, shame the timing didn't work out [15:23:03] (03CR) 10Gage: [C: 032] puppet_certname: qualify var [puppet] - 10https://gerrit.wikimedia.org/r/215772 (owner: 10Matanya) [15:29:41] (03CR) 10Alexandros Kosiaris: [C: 031] ganglia: configure the ganglia aggregator for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/216098 (owner: 10Giuseppe Lavagetto) [15:32:18] moritzm, thank you for letting us know about that beforehand. [15:37:06] (03CR) 10BBlack: [C: 031] "existing comment aside of course" [dns] - 10https://gerrit.wikimedia.org/r/216095 (owner: 10Giuseppe Lavagetto) [15:42:58] Krenair: you're welcome [15:45:35] (03PS2) 10Giuseppe Lavagetto: etcd: add SRV record for autodiscovery [dns] - 10https://gerrit.wikimedia.org/r/216095 [15:47:38] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd: add SRV record for autodiscovery [dns] - 10https://gerrit.wikimedia.org/r/216095 (owner: 10Giuseppe Lavagetto) [16:00:23] (03PS3) 10Giuseppe Lavagetto: etcd: setup servers/ganglia stuff [puppet] - 10https://gerrit.wikimedia.org/r/216099 [16:00:25] (03PS3) 10Giuseppe Lavagetto: ganglia: configure the ganglia aggregator for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/216098 [16:02:00] (03CR) 10Giuseppe Lavagetto: [C: 032] ganglia: configure the ganglia aggregator for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/216098 (owner: 10Giuseppe Lavagetto) [16:17:06] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [16:25:54] <_joe_> ottomata: take a look at this puppet failure ^^ [16:26:11] <_joe_> it has to do with kafkatee.PY [16:26:53] (03PS1) 10Andrew Bogott: Open up firewall to include a spare openstack controller [puppet] - 10https://gerrit.wikimedia.org/r/216116 [16:27:11] (03PS1) 10Filippo Giunchedi: install-server: provision new restbase machines [puppet] - 10https://gerrit.wikimedia.org/r/216118 (https://phabricator.wikimedia.org/T101112) [16:28:51] (03PS2) 10Andrew Bogott: Open up firewall to include a spare openstack controller [puppet] - 10https://gerrit.wikimedia.org/r/216116 [16:29:15] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 5982.39459617 [16:30:53] (03PS1) 10Andrew Bogott: Rename openstack::firewall to openstack::controller_firewall [puppet] - 10https://gerrit.wikimedia.org/r/216119 [16:31:27] (03CR) 10Andrew Bogott: [C: 032] Open up firewall to include a spare openstack controller [puppet] - 10https://gerrit.wikimedia.org/r/216116 (owner: 10Andrew Bogott) [16:33:05] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [16:33:26] looking [16:33:28] oh, fine now? [16:34:23] huh, weird, some gmond pyconf file failture, but now fine [16:34:24] strange [16:38:15] <_joe_> ottomata: I changed something in ganglia earlier, if you see any issue, please revert [16:39:08] (03PS1) 10Andrew Bogott: Allow labvirt1001 to talk to keystone and nova and such on virt1000 [puppet] - 10https://gerrit.wikimedia.org/r/216121 [16:40:19] (03CR) 10Andrew Bogott: [C: 032] Rename openstack::firewall to openstack::controller_firewall [puppet] - 10https://gerrit.wikimedia.org/r/216119 (owner: 10Andrew Bogott) [16:41:37] _joe_: ththis one? [16:41:37] https://gerrit.wikimedia.org/r/#/c/216098/ [16:41:59] (03CR) 10Andrew Bogott: [C: 032] Allow labvirt1001 to talk to keystone and nova and such on virt1000 [puppet] - 10https://gerrit.wikimedia.org/r/216121 (owner: 10Andrew Bogott) [16:44:05] (03PS1) 10Andrew Bogott: Fixed ferm syntax error [puppet] - 10https://gerrit.wikimedia.org/r/216124 [16:45:09] (03CR) 10Andrew Bogott: [C: 032] Fixed ferm syntax error [puppet] - 10https://gerrit.wikimedia.org/r/216124 (owner: 10Andrew Bogott) [16:46:09] <_joe_> ottomata: yeah it should have zero effect, but who knows [16:46:49] yeah, looks unrelated, maybe there was some temporary puppet hiccup it caused with generating the pyconf file [16:46:53] seems ok now though [17:01:48] (03PS2) 10Filippo Giunchedi: install-server: provision new restbase machines [puppet] - 10https://gerrit.wikimedia.org/r/216118 (https://phabricator.wikimedia.org/T101112) [17:01:54] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] install-server: provision new restbase machines [puppet] - 10https://gerrit.wikimedia.org/r/216118 (https://phabricator.wikimedia.org/T101112) (owner: 10Filippo Giunchedi) [17:03:59] cmjohnson1: ^ can you try again? [17:04:19] will do in a bit [17:04:34] cmjohnson1: kk, wait to sign puppet keys tho [17:11:14] (03PS2) 10Rush: RT: move behind misc-web [dns] - 10https://gerrit.wikimedia.org/r/215974 (https://phabricator.wikimedia.org/T101432) (owner: 10Dzahn) [17:11:30] (03CR) 10Rush: [C: 031] "seems ok to me in tandem with other relevant changes" [dns] - 10https://gerrit.wikimedia.org/r/215974 (https://phabricator.wikimedia.org/T101432) (owner: 10Dzahn) [17:21:05] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 23.08% of data above the critical threshold [500.0] [17:27:03] ori, did we ever disable the hhvm tag? I assume it's pointless now [17:27:27] disabled it, yeah. legoktm wrote a maintenance job to remove it from old edits [17:28:06] https://gerrit.wikimedia.org/r/#/c/203970/ <-- could use review, I ran PS3 which was really really slow [17:28:20] well, I started to run it on testwiki and it was really slow. [17:30:05] that script would have been better if it had the tag as a parameter, ah well [17:33:05] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [17:51:46] (03PS1) 10Andrew Bogott: Move the glance image datadir into a puppet var. [puppet] - 10https://gerrit.wikimedia.org/r/216135 [17:55:56] I want to set up a cron that rsyncs between two hosts — one is a hot spare and I want it to have an up-to-date copy of some data. Is there an established puppet pattern for this? I see an rsync module… [18:01:43] !log restarted gerrit on ytterbium for java update [18:01:47] Logged the message, Master [18:01:54] gerrit.wikimedia.org can be used again [18:04:32] Yay. [18:11:24] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 21.43% of data above the critical threshold [500.0] [18:21:44] PROBLEM - HHVM rendering on mw1041 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.013 second response time [18:21:55] PROBLEM - Apache HTTP on mw1041 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.016 second response time [18:22:43] so, ocg is throwing sporadic 500s, can't find it ganglia, looking in logstash cc cscott [18:23:14] godog: from the above, it seems like graphite is also throwing 500s? [18:23:25] RECOVERY - HHVM rendering on mw1041 is OK: HTTP OK: HTTP/1.1 200 OK - 70462 bytes in 0.154 second response time [18:23:35] RECOVERY - Apache HTTP on mw1041 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.062 second response time [18:23:47] and mw1041 is throwing 500s as well, it might be possible for 500s from core or graphite to propagate into ocg [18:24:04] cscott: no, the graphite 500s are the website's [18:24:32] godog: ocg's ganglia is linked from https://wikitech.wikimedia.org/wiki/OCG#Monitoring [18:25:38] cscott: thanks, however as I said ocg100* don't show up in ganglia (check the linked ganglia page) [18:26:28] godog: well *that's* weird. [18:27:10] but yeah the sporadic spikes of 500s on wikis might be related to FSS crashes we've been seeing, not sure yet about ocg tho [18:27:53] cscott: yup, that's strange but likely related to ganglia changes, needs to be investigated separatedly [18:33:26] cscott: mhh indeed searching for *error* on the ocg dashboard in logstash shows some redis-related errors [18:35:35] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [18:40:45] !log spike in redis network starting at ~15.00 UTC, correlates with ocg failures [18:40:49] Logged the message, Master [18:42:39] ah yeah redis has been upgraded at around that time cc moritzm [18:44:36] ori, Katie, aude, YuviPanda: can someone attempt https://wikitech.wikimedia.org/wiki/Grrrit-wm#Debugging_stuck_stream please? [18:44:54] ok, on it [18:46:58] right, any redis experts around? rdb1001 / rdb1003 show outbound bandwidth and 1002/1004 inbound, so looks like synchronization [18:47:02] http://ganglia.wikimedia.org/latest/?r=day&cs=6%2F5%2F2015+13%3A15&ce=6%2F5%2F2015+18%3A39&c=Redis+eqiad&h=&tab=m&vn=&hide-hf=false&m=network_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [18:47:15] AaronSchulz: ^ [18:49:19] Krenair: didn't work. I restarted both the stream and the bot. [18:49:36] :/ [18:50:50] (03Abandoned) 10Ori.livneh: VCL: Add 'maybe_use_random_scheduler' subroutine [puppet] - 10https://gerrit.wikimedia.org/r/184547 (owner: 10Ori.livneh) [18:50:52] ok [18:50:54] worked [18:50:58] great [18:51:01] what did you do to it? [18:51:40] i used ./lolrrit-wm/src/kick.bash [18:51:55] then per the instructions [18:51:56] "wait until `qstat` doesn't show lolrrit-wm anymore, then run kick.bash again" [18:51:58] so i did [18:52:05] so it worked the second time? [18:52:10] yes [18:52:44] PROBLEM - Debian mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/debian is over 43 hours old. [18:52:49] ugh.. why would it work the second time but not the first? [18:54:25] RECOVERY - Debian mirror in sync with upstream on carbon is OK: /srv/mirrors/debian is over 0 hours old. [18:55:06] (03PS2) 10Dzahn: varnish: add RT on magnesium to misc-web config [puppet] - 10https://gerrit.wikimedia.org/r/215973 (https://phabricator.wikimedia.org/T101432) [18:55:14] TypeError: Cannot read property 'length' of undefined at Promise.resolve.then.then.then.then.then.then.then.then.then.then.then.finally.then.then.then.err.exitCode [18:57:26] (03CR) 10Dzahn: [C: 032] varnish: add RT on magnesium to misc-web config [puppet] - 10https://gerrit.wikimedia.org/r/215973 (https://phabricator.wikimedia.org/T101432) (owner: 10Dzahn) [18:59:27] (03PS1) 10Dzahn: Revert "ganglia: set ocg1001 as aggregator for ocg hosts" [puppet] - 10https://gerrit.wikimedia.org/r/216198 [18:59:52] (03PS1) 10Dzahn: Revert "ganglia: switch ocg servers to ganglia_new" [puppet] - 10https://gerrit.wikimedia.org/r/216201 [19:05:57] ori, Krenair: I don't think the bot uses gerrit-to-redis anymore. I'm pretty sure YuviPanda got rid of that dependency [19:05:58] also, please !log in -labs in the future [19:06:12] oh [19:06:18] would be good to update the docs [19:06:45] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 21.43% of data above the critical threshold [500.0] [19:07:27] !log redis master logs shows periodic 'cmd=sync scheduled to be closed ASAP for overcoming of output buffer limits.' indicating the slave fails to sync [19:07:31] Logged the message, Master [19:08:35] (03PS1) 10Dzahn: RT: lower TTL to 5M [dns] - 10https://gerrit.wikimedia.org/r/216240 (https://phabricator.wikimedia.org/T101432) [19:09:08] Krenair: the docs are updated [19:09:31] (03CR) 10Dzahn: [C: 032] RT: lower TTL to 5M [dns] - 10https://gerrit.wikimedia.org/r/216240 (https://phabricator.wikimedia.org/T101432) (owner: 10Dzahn) [19:09:33] so it looks like the redis slaves periodically fail to sync from master due to output buffer limits [19:16:10] (03PS3) 10Dzahn: setting up user dpatrick with deploy access [puppet] - 10https://gerrit.wikimedia.org/r/215537 (https://phabricator.wikimedia.org/T101170) (owner: 10RobH) [19:17:31] (03PS4) 10Dzahn: setting up user dpatrick with deploy access [puppet] - 10https://gerrit.wikimedia.org/r/215537 (https://phabricator.wikimedia.org/T101170) (owner: 10RobH) [19:17:41] !log bounce redis on rdb1003 after bumping slave limits [19:17:45] Logged the message, Master [19:18:55] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [19:19:57] (03CR) 10Dzahn: [C: 032] setting up user dpatrick with deploy access [puppet] - 10https://gerrit.wikimedia.org/r/215537 (https://phabricator.wikimedia.org/T101170) (owner: 10RobH) [19:26:03] (03PS3) 10Andrew Bogott: Rsync glance image files to the controller spare [puppet] - 10https://gerrit.wikimedia.org/r/216154 [19:28:35] (03CR) 10Andrew Bogott: [C: 032] Rsync glance image files to the controller spare [puppet] - 10https://gerrit.wikimedia.org/r/216154 (owner: 10Andrew Bogott) [19:29:11] !log bounce redis again on rdb1003 after increasing the slave limits more [19:29:17] Logged the message, Master [19:34:41] <_joe_> godog: whatsup with redis? [19:35:01] <_joe_> AFAIK moritz updated those today [19:38:21] (03PS1) 10Filippo Giunchedi: redis: bump slave buffer output limit [puppet] - 10https://gerrit.wikimedia.org/r/216293 [19:38:25] _joe_: output buffer limits were too low, see ^ [19:39:05] ori: ^ [19:43:35] right I'm going ahead, on rdb1003 that settings achieve a full sync [19:43:48] (03PS2) 10Filippo Giunchedi: redis: bump slave buffer output limit [puppet] - 10https://gerrit.wikimedia.org/r/216293 [19:43:57] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] redis: bump slave buffer output limit [puppet] - 10https://gerrit.wikimedia.org/r/216293 (owner: 10Filippo Giunchedi) [19:46:50] godog: \o/ thanks [19:46:55] looks sane [19:47:02] curious, la.wiki DB was locked for a few seconds [19:47:11] s3 I guess [19:48:58] latin wikipedia? where you do think it is? :p [19:49:27] ori: cool, afaict redis is also deployed on mc boxes but practically unused [19:50:27] !log bounce redis on rdb1002/rdb1004 to pick up new slave limits [19:50:31] Logged the message, Master [19:51:21] !log chown root:root / on terbium [19:51:25] Logged the message, Master [19:52:41] !log bounce redis on rdb1001/rdb1003 to pick up new slave limits [19:52:45] Logged the message, Master [19:54:17] [77a201e7] 2015-06-05 19:53:30: Fatal exception of type "JobQueueError" [19:54:25] upon saving [19:55:21] heh that's possibly related to restarting redis [19:55:21] edits succeeded though [19:56:16] PROBLEM - puppet last run on virt1000 is CRITICAL Puppet has 2 failures [19:58:10] godog: I'm guessing that also explains why MW is yelling super hard about "Failed connecting to redis server at 10.64.0.201: Connection timed out" [19:58:48] ostriches: yup, in other news we're back [19:59:05] redis is much slower than I expected when starting back [19:59:06] (03PS1) 10Andrew Bogott: Fixup the glance image cron [puppet] - 10https://gerrit.wikimedia.org/r/216296 [19:59:31] Ah yeah, it's going down [19:59:49] (03CR) 10jenkins-bot: [V: 04-1] Fixup the glance image cron [puppet] - 10https://gerrit.wikimedia.org/r/216296 (owner: 10Andrew Bogott) [20:01:04] PROBLEM - puppet last run on labcontrol1001 is CRITICAL Puppet has 2 failures [20:01:38] (03PS2) 10Andrew Bogott: Fixup the glance image cron [puppet] - 10https://gerrit.wikimedia.org/r/216296 [20:02:21] (03CR) 10jenkins-bot: [V: 04-1] Fixup the glance image cron [puppet] - 10https://gerrit.wikimedia.org/r/216296 (owner: 10Andrew Bogott) [20:02:30] ostriches: yep jobq is recovering too [20:02:49] Maybe we'll see the jobqueue destructor error go away soon [20:02:54] (03PS3) 10Andrew Bogott: Fixup the glance image cron [puppet] - 10https://gerrit.wikimedia.org/r/216296 [20:03:04] not sure if there's anything else to do when redis is bounced? [20:03:53] No this isn't that [20:04:24] nah I don't think so, this has been going on for ~6h [20:10:29] !log apt-get upgrade on terbium [20:10:33] Logged the message, Master [20:10:51] cscott: ocg should be happier now [20:11:19] the stupid samba packages.. i want to nuke them all :) [20:11:30] oh joy: [20:11:35] Errors were encountered while processing: php-luasandbox [20:13:53] (03CR) 10coren: [C: 031] "Fundamentally sound, but given the possible security implication it's probably wise to get Moritz's input." [puppet] - 10https://gerrit.wikimedia.org/r/216296 (owner: 10Andrew Bogott) [20:14:14] PROBLEM - DPKG on terbium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:14:37] !log demon Synchronized php-1.26wmf8: live hack (duration: 02m 32s) [20:14:41] Logged the message, Master [20:15:10] moritzm: If you’re still working, your thoughts on https://gerrit.wikimedia.org/r/#/c/216296/? [20:15:28] csteipp: ^ there he is [20:15:44] (There’s an existing non-working version of that already applied… I can roll back if you prefer.) [20:18:15] (03PS1) 10Andrew Bogott: Revert "Rsync glance image files to the controller spare" [puppet] - 10https://gerrit.wikimedia.org/r/216299 [20:18:22] /var/lib/dpkg/info/php-luasandbox.postinst: 6: /var/lib/dpkg/info/php-luasandbox.postinst: php5enmod: not found [20:18:45] twentyafterfour: I cherry-picked in the change that fixes the JobQueueGroup spam on fatalmonitor. [20:18:56] It should start going away now [20:19:35] (03PS2) 10Andrew Bogott: Revert "Rsync glance image files to the controller spare" [puppet] - 10https://gerrit.wikimedia.org/r/216299 [20:20:15] PROBLEM - puppet last run on terbium is CRITICAL Puppet has 1 failures [20:20:32] (03CR) 10Andrew Bogott: [C: 032] Revert "Rsync glance image files to the controller spare" [puppet] - 10https://gerrit.wikimedia.org/r/216299 (owner: 10Andrew Bogott) [20:21:16] PROBLEM - puppetmaster https on virt1000 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error [20:21:29] I’m working on ^, will be fixed shortly [20:23:13] i upgraded terbium, php-luasandbox fails because it uses php5enmod and php5dismod in pre/post install scripts [20:23:16] "php5enmod was introduced in Debian in version 5.4.0~rc6-2 of the package php5" [20:23:31] would have to upgrade php5 to a 5.4 version .. [20:23:42] Oh wait, no I’m not, I’m working on a different thing :( [20:23:42] but don't have it here [20:23:55] RECOVERY - puppet last run on virt1000 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [20:26:02] (03PS4) 10Andrew Bogott: Rsync glance image files to the controller spare [puppet] - 10https://gerrit.wikimedia.org/r/216296 [20:26:06] andrewbogott: can you add me to reviewers, I'll look into on Monday [20:26:14] moritzm: yep! [20:28:56] mutante, how about doing the hhvm migration for terbium instead? :p [20:30:17] (03PS1) 10BBlack: Add authdns::testns to cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/216303 [20:30:42] (I have no idea how much work that actually is, so..) [20:31:13] (03CR) 10BBlack: [C: 032] Add authdns::testns to cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/216303 (owner: 10BBlack) [20:31:42] * godog lunch, bbl [20:37:04] RECOVERY - puppet last run on labcontrol1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [20:38:40] (03PS1) 10BBlack: make ipresolve failures more informative [puppet] - 10https://gerrit.wikimedia.org/r/216304 [20:39:20] (03CR) 10Gage: [C: 031] make ipresolve failures more informative [puppet] - 10https://gerrit.wikimedia.org/r/216304 (owner: 10BBlack) [20:40:18] (03CR) 10BBlack: [C: 032] make ipresolve failures more informative [puppet] - 10https://gerrit.wikimedia.org/r/216304 (owner: 10BBlack) [20:47:59] (03PS1) 10Hashar: contint: PIL 1.1.7 expects libs in /usr/lib [puppet] - 10https://gerrit.wikimedia.org/r/216307 (https://phabricator.wikimedia.org/T101550) [20:49:12] !log Upgrading hhvm-fss on application servers to 1.1.7; expect brief 5xx spike. [20:49:17] Logged the message, Master [20:50:54] PROBLEM - puppet last run on mw1114 is CRITICAL Puppet has 1 failures [20:51:13] godog: cluster upgraded, things look fine [20:57:15] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [20:59:35] (03PS2) 10Hashar: contint: PIL 1.1.7 expects libs in /usr/lib [puppet] - 10https://gerrit.wikimedia.org/r/216307 (https://phabricator.wikimedia.org/T101550) [21:01:08] <_joe_> ori: found one more occurrence? [21:01:14] yep [21:01:31] <_joe_> thanks a lot [21:05:23] (03PS3) 10Hashar: contint: PIL 1.1.7 expects libs in /usr/lib [puppet] - 10https://gerrit.wikimedia.org/r/216307 (https://phabricator.wikimedia.org/T101550) [21:06:05] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [21:06:45] (03CR) 10Hashar: [C: 031 V: 032] "That got the job done and managed to fix T101550 :-}" [puppet] - 10https://gerrit.wikimedia.org/r/216307 (https://phabricator.wikimedia.org/T101550) (owner: 10Hashar) [21:08:15] RECOVERY - puppet last run on mw1114 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:10:15] (03PS1) 10Odder: Add another source to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/216316 (https://phabricator.wikimedia.org/T101513) [21:11:14] PROBLEM - Router interfaces on cr2-knams is CRITICAL host 91.198.174.246, interfaces up: 65, down: 1, dormant: 0, excluded: 1, unused: 0BRge-1/2/0: down - Transit: ! Init7 {#14009} [1Gbps DF]BR [21:11:15] (03PS1) 10BBlack: remove ipsec role from cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/216317 [21:11:28] (03CR) 10BBlack: [C: 032 V: 032] remove ipsec role from cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/216317 (owner: 10BBlack) [21:18:47] (03CR) 10John F. Lewis: [C: 031] RT: adjust Apache config to be behind misc-web [puppet] - 10https://gerrit.wikimedia.org/r/215972 (https://phabricator.wikimedia.org/T101432) (owner: 10Dzahn) [21:21:27] (03CR) 10John F. Lewis: [C: 031] "Maybe worth adding the IPv6 address as well to enable IPv6 on RT?" [dns] - 10https://gerrit.wikimedia.org/r/215974 (https://phabricator.wikimedia.org/T101432) (owner: 10Dzahn) [21:35:40] 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#1342460 (10mbinder) Hello! I would like to join Project-Creators. I'm a new Scrum Master supporting the Mobile Apps team. :) Thanks! Max [21:53:32] 6operations, 10RESTBase-Cassandra: configure less aggressive cassandra log rotation / send cassandra logs to logstash - https://phabricator.wikimedia.org/T100970#1342513 (10Eevans) {F175304} {F175305} To test, I looped indefinitely, logging messages ranging in size from 1 to 1000 words. The code I used is a... [21:58:34] RECOVERY - Router interfaces on cr2-knams is OK host 91.198.174.246, interfaces up: 67, down: 0, dormant: 0, excluded: 1, unused: 0 [22:02:55] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 23.08% of data above the critical threshold [500.0] [22:04:55] RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.054 second response time [22:06:10] !log restarted apache on virt1000 [22:06:15] Logged the message, Master [22:13:05] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [22:15:25] Hi ops! Anyone have any guesses about this apparent massive CentralNotice outage that lasted exactly 1 day? https://phabricator.wikimedia.org/T101265 [22:15:54] My first guess has been a GeoIP outage, but for now I don't have a way of checking that [22:17:42] chasemp: ^^ [22:19:32] AndyRussG: I think there's a lot of debugging steps missing in digging down the stack. Where do the low impressions numbers source from. Can we correlate that with analytics logs and show that the impression really wasn't hit/served? etc... [22:20:46] bblack: thanks! They come from udp2log which fills a special database (sampled at 1/100). They correlate exactly with a dropoff in donations in the same window [22:21:05] So it's reasonable to think it was a real dropoff in banners shown [22:21:16] so, what decides to show them? [22:21:25] CentralNotice [22:21:40] the extension [22:21:43] yeah [22:21:56] no changes in the extension code I guess? [22:22:00] or config? [22:22:01] (which is what I mainly work on these days ;) ) [22:22:02] correct [22:22:12] or at least none that I was told about! [22:22:16] the GeoIP comes from the cookies, right? [22:22:57] (I don't know - we have several different geoip lookup mechanisms in places in different parts of our stack) [22:22:57] bblack: it's put in cookies I think at the varnish level? and then the cookie is read in the client via JS, and that's how we decide if a campaign that's geotargeted is right for a user [22:23:11] ok so it's the cookies [22:23:24] that cookie code hasn't changed in a long time [22:23:45] we do sync the underlying binary databases every so often. it's possible there could've been bad data for a day. [22:23:46] Though maybe a burp in the service (?) we rely on to set the cookies? [22:24:00] The cookies only last a session [22:24:00] the service is C code directly in varnish, and varnish was still working for sure in general [22:24:19] there's no real "service" here to go down, just varnish referencing a database loaded off of disk, which does see updates [22:24:36] bblack: ah hmmm [22:25:09] awight was just suggesting a dive into the hive logs of our impression-recording URL [22:25:29] the database I'm getting the data out doesn't record everything available in Hive [22:25:49] Sooo expensive, unfortunately. [22:26:23] it looks like the commercial GeoIP data files varnish sources from were last updated May 31, somewhere in the ballpark of 03:00 UTC [22:26:53] so that kinda correlates, vaguely, with the start of this, if we allow some lag time for them to really take effect from some kind of thread restarts or whatever [22:27:00] awight: bblack: I'm just waiting on my Hive permissions, so if it seems appropriate and someone has a few minutes [22:27:04] bblack: GeoIP has done some creepy things, we don't understand why but there have been some glitches where everyone is being geolocated to the center of the EU, to RU, etc. It's very repeatable but only seems to happen the day before we launch a big campaign! [22:27:07] but it wouldn't explain the outage later ending, as there's been no change since [22:27:47] * awight scrambles for a Phab task so as to not look completely crazy [22:28:19] https://phabricator.wikimedia.org/T87677 [22:28:44] Well, now we all look crazy. [22:30:27] (03PS1) 10BryanDavis: elasticsearch: allow control of dynamic scripting [puppet] - 10https://gerrit.wikimedia.org/r/216325 [22:31:03] heh [22:31:32] awight: it's entirely possible there are persistent bugs in the varnish->GeoIP C code and such. But there's no good reproducible reports in those. [22:31:55] it would be nice to have a consistent testcase that can show that a given IP is really in the database matching a country, and then later fails to match a country when it should. [22:31:56] ori: nice! [22:32:57] as it is, it sounds like the "sporadic" part could be users deciding to complain they weren't identified. It could be that the IP they were on just wasn't in the database and the misidentification was legitimate from a data/code perspective. [22:33:04] (GeoIP is not perfect, after all) [22:33:11] bblack: awight: the suspicious thing is that it starts and ends exactly at 00 UTC, or so it seems [22:33:15] PROBLEM - configured eth on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:33:16] PROBLEM - dhclient process on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:33:24] PROBLEM - DPKG on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:33:26] well I'm talking about past GeoIP issues in general now [22:33:35] PROBLEM - Disk space on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:33:43] the recent June 1 thing almost ends on a day barrier, it's very close in any case. [22:33:44] PROBLEM - puppet last run on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:33:45] (I'm gonna dig into the data I have from udp2log to see how exactly...) [22:33:55] PROBLEM - salt-minion processes on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:33:56] PROBLEM - Graphite Carbon on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:34:05] PROBLEM - statsdlb process on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:34:06] PROBLEM - High load average on labstore1001 is CRITICAL 71.43% of data above the critical threshold [24.0] [22:34:15] PROBLEM - statsite backend instances on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:34:25] PROBLEM - RAID on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:34:26] mhh, I'll take a look [22:34:30] bblack: awight: worst case is it's a CentralNotice bug [22:34:31] (03PS2) 10Dzahn: RT: adjust Apache config to be behind misc-web [puppet] - 10https://gerrit.wikimedia.org/r/215972 (https://phabricator.wikimedia.org/T101432) [22:34:34] PROBLEM - uWSGI web apps on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:34:35] PROBLEM - SSH on graphite2001 is CRITICAL - Socket timeout after 10 seconds [22:34:48] well worst case it could really be anything, our stack is rather complicated :) [22:34:49] well no, worst case is we don't find out what it is... and it recurs... [22:35:13] you kinda have to break down cause and effect very carefully to find anything, usually [22:35:17] layer by layer [22:35:48] (03PS3) 10Dzahn: RT: move behind misc-web [dns] - 10https://gerrit.wikimedia.org/r/215974 (https://phabricator.wikimedia.org/T101432) [22:35:53] We should re-write the whole site in vcl [22:36:36] that's exactly what we're trying to do the opposite of lol :) [22:36:51] the less VCL the better [22:37:15] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 23.08% of data above the critical threshold [500.0] [22:39:58] bblack: so in any case I guess you can confirm there were no known outages for that exact time window... [22:41:15] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [22:42:05] not really, no. it depends on your definition of "outage" :) [22:42:07] !log powercycle graphite2001, no console no ssh [22:42:12] Logged the message, Master [22:42:33] I don't think we have any strong correlation to an obvious known issue in that window, yet, or other massive site-affecting thing. [22:43:08] that doesn't mean there wasn't, for example, a failure of setting GeoIP cookies properly for that entire day. CentralNotice might've been the only way we'd know that was happening reliably. [22:43:22] (03CR) 10Dzahn: [C: 032] RT: adjust Apache config to be behind misc-web [puppet] - 10https://gerrit.wikimedia.org/r/215972 (https://phabricator.wikimedia.org/T101432) (owner: 10Dzahn) [22:43:58] there could be other complex subtle things in play that weren't noticed in other ways, too [22:44:09] bblack: Ah OK that's good to know. Can you think of _any_ other way of verifying whether that happened by chance? [22:44:15] RECOVERY - salt-minion processes on graphite2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:44:15] RECOVERY - Graphite Carbon on graphite2001 is OK All defined Carbon jobs are runnning. [22:44:26] RECOVERY - statsdlb process on graphite2001 is OK: PROCS OK: 1 process with command name statsdlb [22:44:34] well I don't think we know that GeoIP failure was the cause, yet [22:44:34] (03CR) 10Dzahn: [C: 032] "@John maybe, but one change at a time" [dns] - 10https://gerrit.wikimedia.org/r/215974 (https://phabricator.wikimedia.org/T101432) (owner: 10Dzahn) [22:44:36] RECOVERY - statsite backend instances on graphite2001 is OK All defined statsite jobs are runnning. [22:44:45] RECOVERY - RAID on graphite2001 is OK Active: 8, Working: 8, Failed: 0, Spare: 0 [22:44:54] it's certainly a candidate, though. [22:44:54] RECOVERY - uWSGI web apps on graphite2001 is OK All defined uWSGI apps are runnning. [22:44:55] RECOVERY - SSH on graphite2001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [22:44:58] True! But if we found that there was a GeoIP failure, it's smoking gun-ish [22:45:17] bblack: All the CentralNotice campaigns that were up during that time were geo-targeted, so I don't have any non-geo ones (that wouldn't rely on the geo cookie) to compare the numbers for [22:45:24] RECOVERY - configured eth on graphite2001 is OK - interfaces up [22:45:25] RECOVERY - dhclient process on graphite2001 is OK: PROCS OK: 0 processes with command name dhclient [22:45:26] RECOVERY - DPKG on graphite2001 is OK: All packages OK [22:45:35] RECOVERY - Disk space on graphite2001 is OK: DISK OK [22:45:40] analytics might know, if they're logging cookies for all requests, whether the cookies looked dysfunctional for all other requests that day [22:45:45] the geoip cookies, I mean [22:45:45] RECOVERY - puppet last run on graphite2001 is OK Puppet is currently enabled, last run 26 minutes ago with 0 failures [22:46:10] bblack: good point! hmmm [22:46:30] K I'm gonna pester them before they head out ;) [22:46:37] ok [22:46:55] bblack: thanks much!!! pls LMK if you think of anything else 8p [22:47:04] np [22:48:11] 6operations, 5Patch-For-Review: move RT behind misc-web - https://phabricator.wikimedia.org/T101432#1342636 (10Dzahn) done and switched rt.wikimedia.org has address 208.80.154.241 rt.wikimedia.org mail is handled by 10 polonium.wikimedia.org. rt.wikimedia.org mail is handled by 50 lead.wikimedia.org. 241.154... [22:48:41] 6operations: move RT behind misc-web - https://phabricator.wikimedia.org/T101432#1342637 (10Dzahn) [22:49:15] PROBLEM - RT-HTTPS on magnesium is CRITICAL - Cannot make SSL connection [22:49:55] ah, icinga-wm is right now of course, will fix [22:51:15] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [22:51:34] ACKNOWLEDGEMENT - RT-HTTPS on magnesium is CRITICAL - Cannot make SSL connection daniel_zahn moved it behind misc-web [22:52:19] (03PS1) 10Ottomata: Disable eventlogging kafka client-side event processor [puppet] - 10https://gerrit.wikimedia.org/r/216328 [22:54:01] 6operations: delete and revoke SSL certificate rt.wikimedia.org - https://phabricator.wikimedia.org/T101571#1342677 (10Dzahn) 3NEW [22:54:27] 6operations: move RT behind misc-web - https://phabricator.wikimedia.org/T101432#1342687 (10Dzahn) [22:54:29] 6operations: delete and revoke SSL certificate rt.wikimedia.org - https://phabricator.wikimedia.org/T101571#1342686 (10Dzahn) [22:55:07] AndyRussG: another slight hint/hit, but I think it's pretty weak. Maybe someone else would see the link though. https://gerrit.wikimedia.org/r/#/c/214741/ <- this was some kind of bump of extensions in general on June 1, + changes to the thing that loads extensions. [22:55:19] AndyRussG: that didn't happen until halfway through the UTC day or more, though [22:55:31] (03PS1) 10Dzahn: delete RT's SSL certificate [puppet] - 10https://gerrit.wikimedia.org/r/216332 (https://bugzilla.wikimedia.org/101571) [22:57:18] AndyRussG: but perhaps it points at something else related. maybe there was some out-of-syncness there on something that happened before/after with CN? [22:58:07] (03PS1) 10Dzahn: RT: remove HTTPS monitoring [puppet] - 10https://gerrit.wikimedia.org/r/216334 (https://phabricator.wikimedia.org/T101432) [22:58:29] AndyRussG: also, just looking at https://github.com/wikimedia/mediawiki-extensions-CentralNotice/commits/master , there were some real code changes 3-4 days before ( https://github.com/wikimedia/mediawiki-extensions-CentralNotice/commit/bbad430c1168663281eea2353a5c19a28e91f3e0 ) [22:58:43] maybe see when those actually got deployed (as opposed to committed to master)? [22:58:47] (03PS2) 10Dzahn: RT: remove HTTPS monitoring [puppet] - 10https://gerrit.wikimedia.org/r/216334 (https://phabricator.wikimedia.org/T101432) [22:59:32] (03CR) 10Dzahn: [C: 032] RT: remove HTTPS monitoring [puppet] - 10https://gerrit.wikimedia.org/r/216334 (https://phabricator.wikimedia.org/T101432) (owner: 10Dzahn) [23:00:33] bblack: not as far as I know. We have a separate deploy branch... [23:00:58] ottomata: i'm just meeting with toby now, gimme another 15 [23:01:16] 6operations: delete and revoke SSL certificate rt.wikimedia.org - https://phabricator.wikimedia.org/T101571#1342706 (10Dzahn) [23:01:18] 6operations, 5Patch-For-Review: move RT behind misc-web - https://phabricator.wikimedia.org/T101432#1342705 (10Dzahn) 5Open>3Resolved [23:01:32] 6operations, 7Graphite: graphite2001 OOM and unresponsive - https://phabricator.wikimedia.org/T101572#1342707 (10fgiunchedi) 3NEW a:3fgiunchedi [23:02:11] 6operations, 7HTTPS: delete and revoke SSL certificate rt.wikimedia.org - https://phabricator.wikimedia.org/T101571#1342677 (10Dzahn) [23:02:22] np [23:02:32] AndyRussG: how do I correlate the CN repo changes with what actually deployed to live wikis? [23:02:47] (I mean as far as a real effective deploy date for a given commit) [23:03:27] bblack: whenever we deploy in CN we check out a slot on the deployments page on wikitech [23:03:57] it doesn't even roll into some broader change, as in 1.NNwmfX of core includes ABC release of CN? [23:04:00] Other than that, check the CN submodule sha for versions of MW that go out with the train [23:04:21] 6operations: switch magnesium to private IP - https://phabricator.wikimedia.org/T101574#1342728 (10Dzahn) 3NEW [23:05:03] there were some further back changes that at least moved the parameters for things like country to Elsewhere. They're pretty old relative to the June 1 issue, though, but I don't know when it would've gone live + other correlated changes to mediawiki-config or whatever [23:05:09] e.g. https://github.com/wikimedia/mediawiki-extensions-CentralNotice/commit/42a9f5c1846129518942e411c4a402b110656928 [23:05:09] bblack: for CentralNotice, master isn't automatically rolled into the train. We'd have to merge master to wmf_deploy and then update the submodule in core [23:05:24] (03PS2) 10BryanDavis: elasticsearch: allow control of dynamic scripting [puppet] - 10https://gerrit.wikimedia.org/r/216325 [23:05:26] (03PS1) 10BryanDavis: [WIP] logstash: jessie support and beta cluster cluster [puppet] - 10https://gerrit.wikimedia.org/r/216337 (https://phabricator.wikimedia.org/T101541) [23:05:30] but that link is a May 1 commit to wmf_deploy branch [23:05:46] and I don't know where the correlated commits are for where the country parameter moved to elsewhere, etc [23:06:49] bblack: https://git.wikimedia.org/log/mediawiki%2Fextensions%2FCentralNotice/refs%2Fheads%2Fwmf_deploy [23:06:59] 6operations: switch magnesium to private IP - https://phabricator.wikimedia.org/T101574#1342738 (10Dzahn) This needs changes in: - DNS - DHCP - modules/role/manifests/cache/misc.pp (because the backend definiton includes .wikimedia.org) [23:07:09] I'm pretty sure that merge to master was just before our last deploy... Hmm checking the date... [23:07:15] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [23:07:24] 6operations: switch magnesium to private IP - https://phabricator.wikimedia.org/T101574#1342739 (10Dzahn) [23:07:26] 6operations, 5Patch-For-Review: move RT behind misc-web - https://phabricator.wikimedia.org/T101432#1342740 (10Dzahn) [23:08:03] I'm just wondering if, even if that wasn't deployed at the right time for this outage, there was a correlating change in resourceloader javascript or mediawiki-config, that ended up out of sync on two bits of code or code+data for day [23:09:08] bblack: the last CN deploy was may 14 at 14:00 UTC. CN code running on production shouldn't have changed since then [23:09:27] hmmm [23:09:31] bblack: definitely wasn't deployed at the time of the outage. I did the deploy myself [23:10:14] 6operations: irc bots should send NOTICE not PRIVMSG - https://phabricator.wikimedia.org/T101575#1342741 (10fgiunchedi) 3NEW [23:11:45] (03PS1) 10Dzahn: RT: stop installing SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/216339 (https://phabricator.wikimedia.org/T101571) [23:12:22] Maybe more details about the exact time and way the numbers dropped off and then came back will tell us more [23:12:27] (03PS2) 10Dzahn: RT: stop installing SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/216339 (https://phabricator.wikimedia.org/T101571) [23:13:24] (03CR) 10Dzahn: [C: 032] RT: stop installing SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/216339 (https://phabricator.wikimedia.org/T101571) (owner: 10Dzahn) [23:16:23] 6operations, 7HTTPS, 5Patch-For-Review: delete and revoke SSL certificate rt.wikimedia.org - https://phabricator.wikimedia.org/T101571#1342774 (10Dzahn) - deleted on server magnesium: shred'ed key in /etc/ssl/private/ and rm'ed cert and chained file in /etc/ssl/localcerts/ - removed install_certificate from... [23:16:48] 6operations, 7HTTPS, 5Patch-For-Review: delete and revoke SSL certificate rt.wikimedia.org - https://phabricator.wikimedia.org/T101571#1342775 (10Dzahn) p:5Triage>3Normal [23:17:06] 6operations, 7HTTPS, 5Patch-For-Review: delete and revoke SSL certificate rt.wikimedia.org - https://phabricator.wikimedia.org/T101571#1342777 (10Dzahn) a:3RobH [23:17:24] 6operations: switch magnesium to private IP - https://phabricator.wikimedia.org/T101574#1342778 (10Dzahn) p:5Triage>3Normal [23:18:03] (bblack: sorry, I meant, the deploy didn't take place at the time of the outage. But it was indeed on production at that time, having been deployed May 14) [23:20:54] What outage is this? [23:21:41] 6operations, 10Datasets-General-or-Unknown: dataset1001 - dpkg reports broken packages - https://phabricator.wikimedia.org/T101579#1342810 (10Dzahn) [23:23:21] 6operations, 10Analytics-Cluster: stat1002 - dpkg reports broken packages - https://phabricator.wikimedia.org/T101582#1342821 (10Dzahn) 3NEW [23:23:51] Krinkle: https://phabricator.wikimedia.org/T101265 [23:24:08] basically, loss of donate banner impressions on June 1, no identified cause yet [23:24:21] just for 1 day then fixed itself [23:25:13] (well, I should say severe dropoff. there were still impressions, just far fewer than the day before or the day after) [23:27:29] AndyRussG: I suspect RL indeed. [23:27:54] I'm busy at the moment, but I can investigate in 30 minu. [23:28:06] 6operations: terbium - dpkg reports broken packages - https://phabricator.wikimedia.org/T101583#1342828 (10Dzahn) 3NEW [23:28:23] I need to know which parts of this run on main 1.26wmfX branches and what has custom branches [23:28:49] and when/how those custom branches are updated from master [23:29:55] Krinkle: thanks! [23:30:32] 6operations, 10Wikimedia-Logstash, 7Elasticsearch: Update Wikimedia apt repo to include debs for Elasticsearch & Logstash on jessie - https://phabricator.wikimedia.org/T98042#1342835 (10bd808) [23:35:00] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0] [23:39:22] 6operations: improve redis master/slave monitoring - https://phabricator.wikimedia.org/T101584#1342840 (10fgiunchedi) 3NEW [23:39:24] (03CR) 10Dan-nl: [C: 031] Add another source to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/216316 (https://phabricator.wikimedia.org/T101513) (owner: 10Odder) [23:43:00] 6operations: document redis upgrade/restart procedures - https://phabricator.wikimedia.org/T101585#1342848 (10fgiunchedi) 3NEW [23:43:07] 6operations: improve redis master/slave monitoring - https://phabricator.wikimedia.org/T101584#1342854 (10fgiunchedi) [23:46:35] 10Ops-Access-Requests, 6operations: Login for jkrauska to librenms - https://phabricator.wikimedia.org/T101064#1342861 (10Dzahn) Did it work out? [23:47:19] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]