[00:03:12] PROBLEM - jmxtrans on analytics1021 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar [00:03:44] PROBLEM - jmxtrans on analytics1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar [00:04:13] PROBLEM - jmxtrans on analytics1018 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar [00:04:14] PROBLEM - jmxtrans on analytics1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar [00:05:41] !log krinkle Synchronized php-1.26wmf11/extensions/SyntaxHighlight_GeSHi/modules/pygments.wrapper.css: I5d1510dc80d6d4712ca8411 (duration: 00m 12s) [00:05:47] Logged the message, Master [00:07:03] PROBLEM - puppet last run on planet1001 is CRITICAL Puppet has 1 failures [00:07:52] (03CR) 10Jdouglas: [C: 031] access: grant Jdouglas access toanalytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/220990 (owner: 10Matanya) [00:09:05] RECOVERY - jmxtrans on analytics1022 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar [00:10:46] (03PS1) 10Dzahn: planet: remove feedparser.py [puppet] - 10https://gerrit.wikimedia.org/r/221013 (https://phabricator.wikimedia.org/T101730) [00:12:21] (03PS2) 10Krinkle: Add high-resolution logos for the Chinese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219012 (https://phabricator.wikimedia.org/T102852) (owner: 10Odder) [00:12:59] (03CR) 10Krinkle: [C: 032] Add high-resolution logos for the Chinese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219012 (https://phabricator.wikimedia.org/T102852) (owner: 10Odder) [00:13:05] (03Merged) 10jenkins-bot: Add high-resolution logos for the Chinese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219012 (https://phabricator.wikimedia.org/T102852) (owner: 10Odder) [00:14:19] !log krinkle Synchronized w/static/images/project-logos/zhwiki-1.5x.png: T102852 (duration: 00m 12s) [00:14:25] Logged the message, Master [00:14:53] RECOVERY - jmxtrans on analytics1018 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar [00:15:08] !log krinkle Synchronized w/static/images/project-logos/zhwiki-2x.png: T102852 (duration: 00m 13s) [00:15:13] Logged the message, Master [00:15:41] 6operations, 10Wikimedia-Git-or-Gerrit: move tendril to gerrit repo and puppetize cloning - https://phabricator.wikimedia.org/T98816#1403249 (10Dzahn) a:3Dzahn [00:16:15] problems, Krinkle? [00:16:20] Krenair: no? [00:16:22] !log krinkle Synchronized wmf-config/InitialiseSettings.php: T102852 (duration: 00m 12s) [00:16:27] Logged the message, Master [00:16:32] I noticed you re-synched one of those syntaxhighlight files [00:16:59] Krenair: Same file, different bug. [00:17:08] ah, heh. [00:17:10] 6operations, 10Wikimedia-DNS, 7Mail: DNS Change for GreenHouse - https://phabricator.wikimedia.org/T103893#1403252 (10faidon) These settings would effectively allow Greenhouse (and the service they use, mailgun, plus partially any other users of that service), effectively be trusted to send emails from the w... [00:17:23] RECOVERY - jmxtrans on analytics1021 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar [00:17:35] Krenair: fixed font-size issue (fon-family:monspace;) and nested bug [00:18:33] RECOVERY - jmxtrans on analytics1012 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar [00:20:09] Krinkle: https://gerrit.wikimedia.org/r/221015 [00:20:32] ori: https://zh.wikipedia.org/wiki/MediaWiki:Common.css [00:20:34] same there [00:20:55] could be an encoding issue [00:21:22] ori: do we need the span qualifier? [00:21:32] yes, to trump the generated stylesheet [00:21:38] it comes after, no? [00:21:44] it cascades naturally [00:21:55] if not, let's swap the order [00:22:04] in the case of identical selectors, do rules at the top take precedence? [00:22:09] (yes, I really didn't know that) [00:22:11] last one win [00:22:19] ah [00:22:25] fundamental in css, so something we should make use of. [00:22:26] ok i'll amend [00:22:30] kk [00:24:17] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/221013 (https://phabricator.wikimedia.org/T101730) (owner: 10Dzahn) [00:24:40] ori: btw, zhwiki is an interesting case [00:24:49] they have two logos depending on language variant [00:24:56] :/ [00:25:25] so they need the override in common.css [00:25:28] though we can still host it [00:25:49] (03PS1) 10Dzahn: planet: remove role from zirconium [puppet] - 10https://gerrit.wikimedia.org/r/221016 (https://phabricator.wikimedia.org/T101730) [00:25:55] Krinkle: amended: https://gerrit.wikimedia.org/r/#/c/221015/ [00:26:09] (03CR) 10Dzahn: [C: 032] planet: remove feedparser.py [puppet] - 10https://gerrit.wikimedia.org/r/221013 (https://phabricator.wikimedia.org/T101730) (owner: 10Dzahn) [00:27:01] Krinkle: I went on a bit of a spree :) https://en.wikipedia.org/wiki/Special:Contributions/Ori_Livneh [00:28:43] ori: Ah, nice. [00:28:44] RECOVERY - puppet last run on planet1001 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [00:29:25] ori: is https://en.wikipedia.org/wiki/Brainfuck#Hello_World.21 good already ?:) [00:29:50] lang="bf" [00:29:51] supported :) [00:29:54] haha [00:30:00] :) [00:33:36] (03CR) 10Dzahn: [C: 032] planet: remove role from zirconium [puppet] - 10https://gerrit.wikimedia.org/r/221016 (https://phabricator.wikimedia.org/T101730) (owner: 10Dzahn) [00:36:44] !log ori Synchronized php-1.26wmf11/extensions/SyntaxHighlight_GeSHi: I0e5f2d3b2: Updated mediawiki/core Project: mediawiki/extensions/SyntaxHighlight_GeSHi (duration: 00m 11s) [00:36:50] Logged the message, Master [00:37:23] (03PS1) 10Dzahn: cache/misc: add planet1001 as a backend [puppet] - 10https://gerrit.wikimedia.org/r/221027 (https://phabricator.wikimedia.org/T101730) [00:39:14] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [00:40:03] (03CR) 10Dzahn: [C: 032] cache/misc: add planet1001 as a backend [puppet] - 10https://gerrit.wikimedia.org/r/221027 (https://phabricator.wikimedia.org/T101730) (owner: 10Dzahn) [00:43:14] (03PS1) 10Dzahn: cache/misc: switch planet over to planet1001 [puppet] - 10https://gerrit.wikimedia.org/r/221032 (https://phabricator.wikimedia.org/T101730) [00:51:12] !log reverted restbase1001 canary to 90817c2a [00:51:18] Logged the message, Master [00:53:57] (03CR) 10Dzahn: [C: 032] cache/misc: switch planet over to planet1001 [puppet] - 10https://gerrit.wikimedia.org/r/221032 (https://phabricator.wikimedia.org/T101730) (owner: 10Dzahn) [01:03:04] (03PS1) 10Dzahn: planet: install xslt-proc [puppet] - 10https://gerrit.wikimedia.org/r/221033 (https://phabricator.wikimedia.org/T101730) [01:03:47] (03CR) 10Dzahn: [C: 032] planet: install xslt-proc [puppet] - 10https://gerrit.wikimedia.org/r/221033 (https://phabricator.wikimedia.org/T101730) (owner: 10Dzahn) [01:14:25] bah, switching planet to a VM and jessie - all the puppet parts fine now.. but "error 500" for each and every feed it is supposed to read [01:14:33] same config, same feeds as before.. grmbl [01:14:47] -D debug also doesnt say more [01:18:07] it can't talk to external http servers, even though: [01:18:13] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [01:18:13] ACCEPT tcp -- anywhere anywhere tcp dpt:http [01:19:33] RECOVERY - Host mw2027 is UPING OK - Packet loss = 0%, RTA = 45.00 ms [01:19:51] mutante: I think you'll need to proxy through url-downloader [01:22:42] PROBLEM - puppet last run on cp3032 is CRITICAL puppet fail [01:22:46] godog: thanks! that works [01:23:02] export http_proxy="http://url-downloader.wikimedia.org:8080" [01:23:34] cool! [01:23:41] ehmm.. or not :p [01:23:45] it makes manual curl work [01:23:50] but not the planet update.. bah [01:30:41] 6operations, 5Patch-For-Review: Decommission svn.wikimedia.org server (import SVN into Phabricator) - https://phabricator.wikimedia.org/T86655#1403352 (10PleaseStand) The recent changes seem to have broken even Phabricator's Subversion viewer: > Unhandled Exception ("CommandException") > Command failed with e... [01:33:50] (03PS1) 10Gergő Tisza: Set CORS headers on etherpad static files [puppet] - 10https://gerrit.wikimedia.org/r/221035 (https://phabricator.wikimedia.org/T103940) [01:40:42] RECOVERY - puppet last run on cp3032 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [01:40:43] RECOVERY - Incoming network saturation on labstore1001 is OK Less than 10.00% above the threshold [75000000.0] [01:46:44] (03PS1) 10Dzahn: planet: proxy through url-downloader to fetch URLs [puppet] - 10https://gerrit.wikimedia.org/r/221036 (https://phabricator.wikimedia.org/T101730) [01:48:13] (03CR) 10Dzahn: [C: 032] planet: proxy through url-downloader to fetch URLs [puppet] - 10https://gerrit.wikimedia.org/r/221036 (https://phabricator.wikimedia.org/T101730) (owner: 10Dzahn) [01:55:11] (03PS1) 10Dzahn: planet: fix to variable name for proxy [puppet] - 10https://gerrit.wikimedia.org/r/221037 [01:56:01] (03CR) 10Dzahn: [C: 032] planet: fix to variable name for proxy [puppet] - 10https://gerrit.wikimedia.org/r/221037 (owner: 10Dzahn) [02:03:53] (03PS1) 10Dzahn: planet: fix feed URLs (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/221039 [02:12:09] (03PS1) 10Dzahn: planet: besides http_proxy need https_proxy too [puppet] - 10https://gerrit.wikimedia.org/r/221040 (https://phabricator.wikimedia.org/T101730) [02:12:41] (03CR) 10Dzahn: [C: 032] planet: fix feed URLs (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/221039 (owner: 10Dzahn) [02:13:31] (03CR) 10Dzahn: [C: 032] planet: besides http_proxy need https_proxy too [puppet] - 10https://gerrit.wikimedia.org/r/221040 (https://phabricator.wikimedia.org/T101730) (owner: 10Dzahn) [02:18:31] (03PS2) 10coren: Puppetize toolserver.org legacy server [puppet] - 10https://gerrit.wikimedia.org/r/220134 (https://phabricator.wikimedia.org/T85165) [02:19:56] (03CR) 10coren: [C: 032] "Carrying over Yuvi's +1 (whitespace fixes only)" [puppet] - 10https://gerrit.wikimedia.org/r/220134 (https://phabricator.wikimedia.org/T85165) (owner: 10coren) [02:28:52] (03PS3) 10Alex Monk: Tools: Add database alias for wikimania2016wiki [puppet] - 10https://gerrit.wikimedia.org/r/214718 (https://phabricator.wikimedia.org/T96638) (owner: 10Tim Landscheidt) [02:30:42] !log l10nupdate Synchronized php-1.26wmf11/cache/l10n: (no message) (duration: 05m 36s) [02:30:51] Logged the message, Master [02:30:54] 6operations, 10Traffic, 7Pybal: pybal idleconn ipv6 monitors actually check ipv4 - https://phabricator.wikimedia.org/T103880#1403450 (10BBlack) [02:30:57] 6operations, 7Pybal: pybal health checks are ipv4 even for ipv6 vips - https://phabricator.wikimedia.org/T82747#1403451 (10BBlack) [02:31:11] 6operations, 10Traffic, 7Pybal: pybal health checks are ipv4 even for ipv6 vips - https://phabricator.wikimedia.org/T82747#904929 (10BBlack) [02:32:11] 6operations, 10Traffic, 10Wikimedia-DNS, 7Pybal: pybal DNS lookup issues causing outage risks - https://phabricator.wikimedia.org/T103921#1403456 (10BBlack) Probably interrelated with: T83662 [02:33:33] !log LocalisationUpdate completed (1.26wmf11) at 2015-06-26 02:33:33+00:00 [02:33:39] Logged the message, Master [02:36:05] Coren|AFK, what are https://gerrit.wikimedia.org/r/#/c/214718/ and https://gerrit.wikimedia.org/r/#/c/214995/3 waiting for? [02:36:59] Krenair: I should say "my being aware of the bug, which I wasn't copied on" :-) Lemme see if I can do this real quick. [02:37:25] You were already a reviewer on the changes [02:38:07] (03CR) 10coren: [C: 032] Tools: Add database alias for wikimania2016wiki [puppet] - 10https://gerrit.wikimedia.org/r/214718 (https://phabricator.wikimedia.org/T96638) (owner: 10Tim Landscheidt) [02:38:54] Krenair: Being added as reviewer is generally not enough to reliably ping me - I get over 300 gerrit email per day and they get autosorted. [02:40:36] (03PS4) 10coren: Add cnwikimedia to the list of wikis on labs [puppet] - 10https://gerrit.wikimedia.org/r/214995 (owner: 10Jcrespo) [02:42:18] (03CR) 10coren: [C: 032] "Trivial host addition" [puppet] - 10https://gerrit.wikimedia.org/r/214995 (owner: 10Jcrespo) [02:43:04] Krenair: {{done}} for both [02:43:07] ty [02:43:32] I note that maintain-replicas already picked both up so the views are there too. [02:45:35] Coren, where does the meta_p database come from? [02:46:01] is the data there automatically generated? [02:46:05] Krenair: Generated from the dblist in mediawiki-config [02:46:12] where's the script? [02:46:33] operations/software in maintain-replicas [02:52:53] (03PS1) 10Alex Monk: maintain-replicas: Do not record centralauth in meta_p.wiki [software] - 10https://gerrit.wikimedia.org/r/221042 (https://phabricator.wikimedia.org/T101750) [02:56:56] (03PS1) 10Dzahn: planet: fix feed URLS - use https for wordpress [puppet] - 10https://gerrit.wikimedia.org/r/221043 [02:59:56] 6operations, 5Patch-For-Review: Decommission svn.wikimedia.org server (import SVN into Phabricator) - https://phabricator.wikimedia.org/T86655#1403520 (10demon) Swapped them to locally hosted, which is better since it's not trying to hit svn.wm.o. Similar failure though. [03:02:37] (03PS2) 10Dzahn: planet: fix feed URLS - use https for wordpress [puppet] - 10https://gerrit.wikimedia.org/r/221043 [03:02:41] 6operations, 6Labs, 7Database: Santitize recent wikis: wikimania 2016 and cn.wikimedia.org at labs dbs - https://phabricator.wikimedia.org/T100441#1403521 (10Krenair) 5Open>3Resolved Looks like this was done [03:03:12] (03CR) 10Dzahn: [C: 032] planet: fix feed URLS - use https for wordpress [puppet] - 10https://gerrit.wikimedia.org/r/221043 (owner: 10Dzahn) [03:04:13] Coren, I have a couple more host file additions [03:05:20] Krenair: Point ye me at it? [03:05:31] haven't uploaded it yet [03:05:36] Now is the time, I'm on vacation starting tomorrow. [03:06:07] Although, to be fair, pretty much any ops can +2 that. :-) [03:06:46] (03PS1) 10Alex Monk: toollabs: Add gomwiki and lrcwiki db hosts file entries [puppet] - 10https://gerrit.wikimedia.org/r/221045 (https://phabricator.wikimedia.org/T102647) [03:06:54] Coren, ^ [03:07:06] Coren, in my experience it is quite difficult to get ops to +2 puppet changes [03:08:46] (03CR) 10coren: [C: 032] "Moar projects." [puppet] - 10https://gerrit.wikimedia.org/r/221045 (https://phabricator.wikimedia.org/T102647) (owner: 10Alex Monk) [03:09:35] So according to my list... that leaves labswiki (wikitech)? [03:10:05] (03PS1) 10Dzahn: planet: fix feed URLs - non-wordpress redirects [puppet] - 10https://gerrit.wikimedia.org/r/221046 [03:10:25] (03PS2) 10Dzahn: planet: fix feed URLs - non-wordpress redirects [puppet] - 10https://gerrit.wikimedia.org/r/221046 [03:10:30] Weren't there blockers for that one? [03:11:20] I don't think so? [03:11:22] nothing on https://phabricator.wikimedia.org/T89548 [03:11:33] (03CR) 10Dzahn: [C: 032] "planet.en.log:WARNING:planet.runner:Feed has moved from to Krenair: Oh, right, I was thinging phabricator. [03:14:19] 6operations, 5Patch-For-Review: move planet from zirconium to a ganeti VM - https://phabricator.wikimedia.org/T101730#1403565 (10Dzahn) all done. planet is running out of planet1001 in ganeti now and is switched. see changes above also feed URL fixes: https://gerrit.wikimedia.org/r/#/c/221043/ https://gerri... [03:14:43] 6operations, 5Patch-For-Review: move planet from zirconium to a ganeti VM - https://phabricator.wikimedia.org/T101730#1403566 (10Dzahn) p:5Normal>3Low [03:21:52] 6operations, 5Patch-For-Review: move planet from zirconium to a ganeti VM - https://phabricator.wikimedia.org/T101730#1403589 (10Dzahn) good things: - planet-venus 0~git9de2109-3 instead of planet-venus 0~bzr116-1 - jessie instead of precise - one less virtual host on zirconium - now a VM, separate from othe... [03:22:21] 6operations: move planet from zirconium to a ganeti VM - https://phabricator.wikimedia.org/T101730#1403590 (10Dzahn) [03:30:31] (03CR) 10Gergő Tisza: "Created T103958 so there is a more permanent place for the Commons/Wikidata discussion." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220970 (https://phabricator.wikimedia.org/T74469) (owner: 10Gergő Tisza) [03:32:06] Coren, what dbuser does maintain-replicas run as? [03:32:50] root? [03:43:11] Krenair: The labs-specific 'dbmanager' [04:00:56] (03CR) 10Hydriz: [C: 031] Rename all main WikimediaIncubator settings to have a wg prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207909 (owner: 10Paladox) [04:18:19] (03PS1) 10KartikMistry: CX: Add eswiki-recommender campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221047 [04:33:23] PROBLEM - Incoming network saturation on labstore1001 is CRITICAL 10.71% of data above the critical threshold [100000000.0] [05:11:10] (03PS1) 10Dzahn: planet: remove broken feeds [puppet] - 10https://gerrit.wikimedia.org/r/221050 [05:11:14] (03CR) 10jenkins-bot: [V: 04-1] planet: remove broken feeds [puppet] - 10https://gerrit.wikimedia.org/r/221050 (owner: 10Dzahn) [05:11:22] (03PS2) 10Dzahn: planet: remove broken feeds [puppet] - 10https://gerrit.wikimedia.org/r/221050 [05:16:09] (03CR) 10Dzahn: [C: 031] "inline comments and these are ERRORS in planet logs" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/221050 (owner: 10Dzahn) [05:22:39] !log restarted apache on iridium to fix phabricator fatal [05:22:45] Logged the message, Master [05:24:43] (03CR) 10Dzahn: "because it didn't seem to work when i tested it. additional testing / amending is appreciated" [puppet] - 10https://gerrit.wikimedia.org/r/220164 (https://phabricator.wikimedia.org/T103425) (owner: 10Dzahn) [05:25:21] twentyafterfour: wfm [05:25:51] thx! [05:37:12] (03CR) 10Dzahn: [C: 04-1] "if anything it should be a symlink to "parking" now. see Change-Id: Idc4dccff197c16d and the comments on it" [dns] - 10https://gerrit.wikimedia.org/r/197361 (owner: 10Dzahn) [05:44:19] (03PS3) 10Dzahn: park wikiartpedia domains [dns] - 10https://gerrit.wikimedia.org/r/197361 [05:47:42] (03PS2) 10Dzahn: park visualwikipedia domains [dns] - 10https://gerrit.wikimedia.org/r/197362 [06:04:14] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Jun 26 06:04:14 UTC 2015 (duration 4m 13s) [06:04:20] Logged the message, Master [06:20:41] (03PS8) 10Dzahn: mediawiki: update font packages for jessie [puppet] - 10https://gerrit.wikimedia.org/r/218640 (https://phabricator.wikimedia.org/T102623) [06:21:30] (03CR) 10jenkins-bot: [V: 04-1] mediawiki: update font packages for jessie [puppet] - 10https://gerrit.wikimedia.org/r/218640 (https://phabricator.wikimedia.org/T102623) (owner: 10Dzahn) [06:22:23] (03PS9) 10Dzahn: mediawiki: update font packages for jessie [puppet] - 10https://gerrit.wikimedia.org/r/218640 (https://phabricator.wikimedia.org/T102623) [06:23:03] (03CR) 10jenkins-bot: [V: 04-1] mediawiki: update font packages for jessie [puppet] - 10https://gerrit.wikimedia.org/r/218640 (https://phabricator.wikimedia.org/T102623) (owner: 10Dzahn) [06:26:40] (03PS10) 10Dzahn: mediawiki: update font packages for jessie [puppet] - 10https://gerrit.wikimedia.org/r/218640 (https://phabricator.wikimedia.org/T102623) [06:27:19] (03CR) 10jenkins-bot: [V: 04-1] mediawiki: update font packages for jessie [puppet] - 10https://gerrit.wikimedia.org/r/218640 (https://phabricator.wikimedia.org/T102623) (owner: 10Dzahn) [06:29:44] (03PS11) 10Dzahn: mediawiki: update font packages for jessie [puppet] - 10https://gerrit.wikimedia.org/r/218640 (https://phabricator.wikimedia.org/T102623) [06:31:45] (03PS4) 10Giuseppe Lavagetto: confctl: allow regex expression and a global "all" [software/conftool] - 10https://gerrit.wikimedia.org/r/220536 [06:32:11] <_joe_> icinga-wm: where are the puppet failures? [06:32:25] * _joe_ feels disoriented [06:32:36] * _joe_ blames ori for breaking the good ole way [06:34:03] (03CR) 10Dzahn: [C: 04-1] "arr, still WIP, fonts-crosextra-carlito etc would miss in trusty" [puppet] - 10https://gerrit.wikimedia.org/r/218640 (https://phabricator.wikimedia.org/T102623) (owner: 10Dzahn) [07:01:25] 6operations, 7Database: db1002-db1007 - decom or repurpose? - https://phabricator.wikimedia.org/T103005#1403727 (10jcrespo) [07:10:26] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I doubt this is needed. See comment in phabricator task." [puppet] - 10https://gerrit.wikimedia.org/r/221035 (https://phabricator.wikimedia.org/T103940) (owner: 10Gergő Tisza) [07:16:53] PROBLEM - Incoming network saturation on labstore1001 is CRITICAL 10.34% of data above the critical threshold [100000000.0] [07:30:52] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] confctl: allow regex expression and a global "all" [software/conftool] - 10https://gerrit.wikimedia.org/r/220536 (owner: 10Giuseppe Lavagetto) [07:35:40] (03PS8) 10Alexandros Kosiaris: lvs::configuration: use hiera for lvs_service_ips [puppet] - 10https://gerrit.wikimedia.org/r/217289 [07:37:28] (03CR) 10Alexandros Kosiaris: [C: 032] lvs::configuration: use hiera for lvs_service_ips [puppet] - 10https://gerrit.wikimedia.org/r/217289 (owner: 10Alexandros Kosiaris) [07:44:43] (03CR) 10Muehlenhoff: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/220761 (owner: 10Alexandros Kosiaris) [07:46:32] PROBLEM - puppet last run on lvs2003 is CRITICAL Puppet last ran 11 hours ago [07:48:14] RECOVERY - puppet last run on lvs2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:49:44] (03CR) 10Mobrovac: [C: 031] Fix Shinken Mathoid probe [puppet] - 10https://gerrit.wikimedia.org/r/220954 (https://phabricator.wikimedia.org/T103595) (owner: 10Hashar) [07:52:29] (03CR) 10Muehlenhoff: "The current set LGTM, but let's also include the three additional hosts Andrew offered for testing?" [puppet] - 10https://gerrit.wikimedia.org/r/220772 (owner: 10Alexandros Kosiaris) [07:56:48] (03CR) 10Faidon Liambotis: [C: 031] ciphersuites: re-order ECDSA ahead of RSA [puppet] - 10https://gerrit.wikimedia.org/r/220377 (owner: 10BBlack) [07:57:05] !log krinkle Synchronized php-1.26wmf11/extensions/Popups: T103610 (duration: 00m 11s) [07:57:12] Logged the message, Master [08:00:50] (03PS1) 10Ricordisamoa: Redirect dartar's cite-o-meter to Tool Labs [puppet] - 10https://gerrit.wikimedia.org/r/221063 [08:02:19] (03PS1) 10Faidon Liambotis: HTTPS: raise production's HSTS to 6 months [puppet] - 10https://gerrit.wikimedia.org/r/221064 [08:06:51] (03PS1) 10Alexandros Kosiaris: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221065 [08:06:53] RECOVERY - Incoming network saturation on labstore1001 is OK Less than 10.00% above the threshold [75000000.0] [08:12:13] RECOVERY - salt-minion processes on labstore1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:14:15] (03PS1) 10Ricordisamoa: Make relic Toolserver files HTML5 [puppet] - 10https://gerrit.wikimedia.org/r/221067 [08:17:32] (03CR) 10Alexandros Kosiaris: "why not ruby-dev and ruby respectively ? Both packages exist in trusty and jessie and depend on the packages specified in this change." [puppet] - 10https://gerrit.wikimedia.org/r/220308 (https://phabricator.wikimedia.org/T103600) (owner: 10Hashar) [08:18:54] (03PS1) 10Giuseppe Lavagetto: conftool: fixup for regex matching [software/conftool] - 10https://gerrit.wikimedia.org/r/221068 [08:19:20] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] conftool: fixup for regex matching [software/conftool] - 10https://gerrit.wikimedia.org/r/221068 (owner: 10Giuseppe Lavagetto) [08:22:13] PROBLEM - puppet last run on lvs1001 is CRITICAL Puppet last ran 14 hours ago [08:24:03] RECOVERY - puppet last run on lvs1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:30:35] 6operations, 6Labs, 3Labs-Sprint-102, 3Labs-Sprint-103, 5Patch-For-Review: Backport sshd with AuthorizedKeysCommand support to Ubuntu precise - https://phabricator.wikimedia.org/T102401#1403906 (10MoritzMuehlenhoff) 5Open>3Resolved The SSH backport has been installed across the fleet and all precise... [08:31:26] (03PS1) 10Giuseppe Lavagetto: confd: do not declare rsyslog/logrotate rules for upstart-based distros. [puppet] - 10https://gerrit.wikimedia.org/r/221069 [08:32:50] (03PS5) 10Muehlenhoff: Enable firejail containment for zotero [puppet] - 10https://gerrit.wikimedia.org/r/220434 (https://phabricator.wikimedia.org/T98852) [08:34:51] (03CR) 10Giuseppe Lavagetto: [C: 032] confd: do not declare rsyslog/logrotate rules for upstart-based distros. [puppet] - 10https://gerrit.wikimedia.org/r/221069 (owner: 10Giuseppe Lavagetto) [08:38:09] 6operations, 6Labs, 7Database, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1403926 (10jcrespo) [08:39:49] (03PS2) 10Alexandros Kosiaris: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221065 [08:40:25] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "See comments, of course lgtm in general." [puppet] - 10https://gerrit.wikimedia.org/r/221065 (owner: 10Alexandros Kosiaris) [08:40:54] PROBLEM - Incoming network saturation on labstore1001 is CRITICAL 17.24% of data above the critical threshold [100000000.0] [08:45:32] 6operations, 6Labs, 7Database, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1403939 (10jcrespo) [08:46:18] (03PS1) 10Giuseppe Lavagetto: conftool: add service "varnish-be-rand" to cache text nodes [puppet] - 10https://gerrit.wikimedia.org/r/221071 [08:52:13] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool: add service "varnish-be-rand" to cache text nodes [puppet] - 10https://gerrit.wikimedia.org/r/221071 (owner: 10Giuseppe Lavagetto) [08:54:31] (03PS1) 10Faidon Liambotis: wmflib/ssl_ciphersuite: use an array of ciphers [puppet] - 10https://gerrit.wikimedia.org/r/221073 [08:54:36] _joe_: ^ :) [08:55:38] <_joe_> paravoid: yeah at the time I just wanted to make changes be unified instead of seeing 200 puppet commits, but it was admittedly horrendously crude [08:55:42] <_joe_> lemme take a look [08:56:46] (03CR) 10Giuseppe Lavagetto: [C: 031] "This is obviously much better." [puppet] - 10https://gerrit.wikimedia.org/r/221073 (owner: 10Faidon Liambotis) [08:57:44] (03PS3) 10Alexandros Kosiaris: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221065 [08:58:08] <_joe_> at least conftool works well :) [08:58:15] since this is high impact, would be nice to run it through the compiler -- is that working now? [08:58:46] paravoid: yes [08:58:52] well... which compiler ? [08:59:05] there is at least one that is working [08:59:28] or at least another review? [08:59:30] *cough* akosiaris :) [08:59:36] <_joe_> paravoid: or, you could write some test /before/ this patch [08:59:41] <_joe_> and see it doesn't break :P [08:59:50] * _joe_ hides [08:59:56] ah yes, paravoid loves RSpec [09:00:04] he will do it gladly :P [09:00:07] <_joe_> akosiaris: I know, this was pure trolling [09:00:34] btw, hiera interpolation in a hiera interpolation [09:00:34] I would actually love it if our CI would be able to detect noop changes [09:00:41] is it gonna work ? [09:00:42] I wonder [09:00:47] but so far it doesn't even catch 90% of the errors :/ [09:01:00] <_joe_> akosiaris: it should, lemme see [09:01:06] (03PS1) 10Giuseppe Lavagetto: conftool: Revert default for varnish-be-rand service [puppet] - 10https://gerrit.wikimedia.org/r/221075 [09:01:22] <_joe_> paravoid: that is pretty hard to do if we don't want to invest like 3/4 machines at least for that [09:01:51] (03CR) 10Alexandros Kosiaris: [C: 031] wmflib/ssl_ciphersuite: use an array of ciphers [puppet] - 10https://gerrit.wikimedia.org/r/221073 (owner: 10Faidon Liambotis) [09:01:58] <_joe_> paravoid: as we don't really know where a class will be called if we don't compile the whole catalog [09:02:01] nah, we could be smarter at detecting affected hosts [09:02:06] <_joe_> paravoid: how? [09:02:15] <_joe_> it's not that easy you know [09:02:26] <_joe_> not that I didn't think about that [09:02:26] I didn't say it was easy :) [09:02:47] <_joe_> paravoid: it means parsing the whole puppet tree and compiling all the different node stanzas [09:02:59] sort of [09:03:17] we already have the *existing* applied classes on a host [09:03:19] <_joe_> paravoid: I had a function to do that, but it still took ~ 40 minutes [09:03:24] we can pre-cache that [09:03:33] <_joe_> on a single vm, though [09:03:48] and if there are no include/class statements in the diff, you can be relatively sure this set didn't change [09:03:53] <_joe_> and if we use the new puppet diff approach, it could be easier [09:04:32] <_joe_> as we can run compilations in parallel on two different remote puppetmasters [09:05:12] <_joe_> so we may just run two puppetmasters, on decent VMs and be done in a smaller amount of time [09:05:42] <_joe_> but reducing that to what we think is acceptable in general, uhm [09:05:53] _joe_: nope. interpolation in interpolation does not work [09:05:58] 6operations, 6Research-and-Data, 7Database: Test and fix db1047 BBU - https://phabricator.wikimedia.org/T103345#1403973 (10jcrespo) p:5Triage>3Low Low priority because for now the hosts works as intended. [09:06:10] <_joe_> akosiaris: of course, shitty to the last bit [09:07:10] (03PS2) 10Faidon Liambotis: ssl_ciphersuite: make cipherlist more readable [puppet] - 10https://gerrit.wikimedia.org/r/221073 [09:07:12] (03PS2) 10Faidon Liambotis: ssl_ciphersuite: re-order ECDSA ahead of RSA [puppet] - 10https://gerrit.wikimedia.org/r/220377 (owner: 10BBlack) [09:07:24] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool: Revert default for varnish-be-rand service [puppet] - 10https://gerrit.wikimedia.org/r/221075 (owner: 10Giuseppe Lavagetto) [09:07:31] argh [09:07:32] :) [09:07:52] <_joe_> eheh [09:07:59] (03PS3) 10Faidon Liambotis: ssl_ciphersuite: make cipherlist more readable [puppet] - 10https://gerrit.wikimedia.org/r/221073 [09:08:01] <_joe_> this is part of my trolling effort as well [09:08:02] <_joe_> :P [09:08:07] (03CR) 10Faidon Liambotis: [C: 032 V: 032] ssl_ciphersuite: make cipherlist more readable [puppet] - 10https://gerrit.wikimedia.org/r/221073 (owner: 10Faidon Liambotis) [09:08:43] -ip = 198.35.26.106 [09:08:44] +ip = } [09:08:58] wow... I could really wreck everything if I merge this patch [09:09:09] ...don't? :P [09:09:22] which.. isn't big news... most patches have that potential now that I think about it [09:10:07] (03CR) 10Mobrovac: [C: 04-1] Enable firejail containment for zotero (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/220434 (https://phabricator.wikimedia.org/T98852) (owner: 10Muehlenhoff) [09:11:51] (03CR) 10Faidon Liambotis: "Updated for a cleaner version of ciphersuites that uses an array :)" [puppet] - 10https://gerrit.wikimedia.org/r/220377 (owner: 10BBlack) [09:11:57] (03PS4) 10Alexandros Kosiaris: WIP: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221065 [09:12:29] look at how cleaner this diff is :) [09:12:39] you can actually see what's changing! [09:13:08] <_joe_> paravoid: eheh, true [09:13:50] (03Abandoned) 10Giuseppe Lavagetto: varnish: allow picking which director is dynamic [puppet] - 10https://gerrit.wikimedia.org/r/220492 (owner: 10Giuseppe Lavagetto) [09:14:55] (03PS8) 10Addshore: rsync wikidata json dumps to labs /public/dumps [puppet] - 10https://gerrit.wikimedia.org/r/215585 (https://phabricator.wikimedia.org/T100885) [09:25:57] (03PS5) 10Alexandros Kosiaris: WIP: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221065 [09:30:14] (03PS3) 10Joal: Add new projectview to projectcounts aggregation [puppet] - 10https://gerrit.wikimedia.org/r/220752 (https://phabricator.wikimedia.org/T101118) [09:30:26] (03CR) 10Joal: Add new projectview to projectcounts aggregation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/220752 (https://phabricator.wikimedia.org/T101118) (owner: 10Joal) [09:35:38] (03PS6) 10Muehlenhoff: Enable firejail containment for zotero [puppet] - 10https://gerrit.wikimedia.org/r/220434 (https://phabricator.wikimedia.org/T98852) [09:36:24] RECOVERY - Incoming network saturation on labstore1001 is OK Less than 10.00% above the threshold [75000000.0] [09:45:08] Reedy: hey :) [09:46:10] Reedy: looking @ https://github.com/EFForg/https-everywhere/blob/master/src/chrome/content/rules/Wikimedia.xml [10:03:27] (03PS1) 10Hashar: Setup tox for easy venv [software/conftool] - 10https://gerrit.wikimedia.org/r/221087 [10:04:25] (03CR) 10Hashar: "That is a start. The integration tests fails for me because etcd is not shutdown between tests and that complains with 'ERROR can't bind t" [software/conftool] - 10https://gerrit.wikimedia.org/r/221087 (owner: 10Hashar) [10:07:42] (03PS2) 10Hashar: Setup tox for easy venv [software/conftool] - 10https://gerrit.wikimedia.org/r/221087 (https://phabricator.wikimedia.org/T103972) [10:11:31] (03CR) 10Yuvipanda: "Needs rebase?" [puppet] - 10https://gerrit.wikimedia.org/r/220954 (https://phabricator.wikimedia.org/T103595) (owner: 10Hashar) [10:14:13] (03CR) 10Hashar: "check experimental" [software/conftool] - 10https://gerrit.wikimedia.org/r/221087 (https://phabricator.wikimedia.org/T103972) (owner: 10Hashar) [10:17:56] (03PS3) 10Hashar: Fix Shinken Mathoid probe [puppet] - 10https://gerrit.wikimedia.org/r/220954 (https://phabricator.wikimedia.org/T103595) [10:18:11] (03CR) 10Hashar: "@Yuvipanda rebased :}" [puppet] - 10https://gerrit.wikimedia.org/r/220954 (https://phabricator.wikimedia.org/T103595) (owner: 10Hashar) [10:21:05] 7Puppet, 6Phabricator: Local config file contains escape characters - https://phabricator.wikimedia.org/T103924#1404151 (10Aklapper) p:5Triage>3Low [10:21:16] (03CR) 10Yuvipanda: [C: 032] Fix Shinken Mathoid probe [puppet] - 10https://gerrit.wikimedia.org/r/220954 (https://phabricator.wikimedia.org/T103595) (owner: 10Hashar) [10:23:38] YuviPanda: if you can take care of updating the labs Shinken, that will be ideal :-D [10:23:40] http://shinken.wmflabs.org/problems?search=hg:deployment-prep [10:24:02] it doesn't run its own puppetmaster, so it shouldn't need any special updating [10:24:14] should update when puppet runs next [10:32:41] 7Puppet, 3Reading-Web, 3Reading-Web-Sprint-50-The-X-Files: Certain urls do not redirect to mobile - https://phabricator.wikimedia.org/T103158#1404213 (10phuedx) [10:37:45] YuviPanda: great, will check this afternoon. Thanks! [10:41:13] PROBLEM - puppet last run on es1004 is CRITICAL Puppet has 1 failures [10:41:41] 6operations, 10ops-codfw, 7Database: Faulty memory on es2004 - https://phabricator.wikimedia.org/T103843#1404278 (10jcrespo) p:5Triage>3Normal [10:46:33] PROBLEM - Incoming network saturation on labstore1001 is CRITICAL 10.34% of data above the critical threshold [100000000.0] [10:50:27] 6operations, 7Database: db1002-db1007 - decom or repurpose? - https://phabricator.wikimedia.org/T103005#1404309 (10jcrespo) p:5Triage>3Low [10:55:56] 6operations, 7Database: mysql boxes not in ganglia - https://phabricator.wikimedia.org/T87209#1404321 (10jcrespo) p:5Triage>3Normal [10:57:12] RECOVERY - puppet last run on es1004 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [11:07:02] (03PS2) 10Giuseppe Lavagetto: varnish: add service to the directors options [puppet] - 10https://gerrit.wikimedia.org/r/220815 [11:07:51] (03CR) 10jenkins-bot: [V: 04-1] varnish: add service to the directors options [puppet] - 10https://gerrit.wikimedia.org/r/220815 (owner: 10Giuseppe Lavagetto) [11:09:25] (03PS7) 10Muehlenhoff: Enable firejail containment for zotero [puppet] - 10https://gerrit.wikimedia.org/r/220434 (https://phabricator.wikimedia.org/T98852) [11:11:01] (03CR) 10Mobrovac: [C: 031] Enable firejail containment for zotero [puppet] - 10https://gerrit.wikimedia.org/r/220434 (https://phabricator.wikimedia.org/T98852) (owner: 10Muehlenhoff) [11:13:34] 6operations, 10Gather, 7Database, 7Schema-change: Update Gather DB schema for flagging backend - https://phabricator.wikimedia.org/T103611#1404396 (10jcrespo) In order to apply a schema change, we also need: * The wikis where this will be applied (I assume all where the Gather extension is, but it should... [11:15:46] 6operations, 10Gather, 7Database, 7Schema-change: Update Gather DB schema for flagging backend - https://phabricator.wikimedia.org/T103611#1404410 (10jcrespo) p:5Triage>3Normal [11:16:19] (03PS3) 10Giuseppe Lavagetto: varnish: add service to the directors options [puppet] - 10https://gerrit.wikimedia.org/r/220815 [11:33:32] (03PS1) 10Giuseppe Lavagetto: confd1001: join the etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/221099 [11:36:26] (03PS1) 10Giuseppe Lavagetto: etcd: add conf1001 to the servers [dns] - 10https://gerrit.wikimedia.org/r/221100 [11:39:11] (03PS3) 10BBlack: ssl_ciphersuite: re-order ECDSA ahead of RSA [puppet] - 10https://gerrit.wikimedia.org/r/220377 [11:40:01] (03PS2) 10BBlack: HTTPS: raise production's HSTS to 6 months [puppet] - 10https://gerrit.wikimedia.org/r/221064 (owner: 10Faidon Liambotis) [11:41:06] (03CR) 10BBlack: [C: 032] HTTPS: raise production's HSTS to 6 months [puppet] - 10https://gerrit.wikimedia.org/r/221064 (owner: 10Faidon Liambotis) [11:44:18] bblack: morning :) [11:44:39] bblack is awake all times... [11:45:33] hi :) [11:45:57] I'm only slightly awake, I fully intend to go back to sleep in a bit! [11:55:43] (03PS2) 10Giuseppe Lavagetto: confd1001: join the etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/221099 [11:55:45] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] etcd: add conf1001 to the servers [dns] - 10https://gerrit.wikimedia.org/r/221100 (owner: 10Giuseppe Lavagetto) [11:57:01] 6operations, 7Database: review eqiad database server quantities / warranties / service(s) - https://phabricator.wikimedia.org/T103936#1404542 (10jcrespo) I suppose very related to T103005. [12:02:01] (03CR) 10BBlack: [C: 031] varnish: add service to the directors options [puppet] - 10https://gerrit.wikimedia.org/r/220815 (owner: 10Giuseppe Lavagetto) [12:08:11] hmm, no csteipp ? [12:08:30] is it urgent thedj? [12:08:39] it's like 5 in the morning in SF [12:09:34] ill send a mail... [12:10:53] RECOVERY - salt-minion processes on conf1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:12:24] (03PS3) 10Giuseppe Lavagetto: confd1001: join the etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/221099 [12:13:36] (03CR) 10Giuseppe Lavagetto: [C: 032] confd1001: join the etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/221099 (owner: 10Giuseppe Lavagetto) [12:18:49] <_joe_> !log added conf1001 to the etcd cluster [12:18:55] Logged the message, Master [12:19:20] Krenair, want to check that we understand "blocked external/not db team" in the same way? [12:20:01] Krenair: i think someone is spoofiing our phab. But i send an email. [12:22:07] (03PS1) 10Alex Monk: wikitech: Re-add wgCookieDomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221105 (https://phabricator.wikimedia.org/T103939) [12:22:08] jynus, sigh. what bug is this about? [12:22:17] PROBLEM - Host mw2031 is DOWN: PING CRITICAL - Packet loss = 100% [12:22:44] (03CR) 10Alex Monk: [C: 032] wikitech: Re-add wgCookieDomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221105 (https://phabricator.wikimedia.org/T103939) (owner: 10Alex Monk) [12:22:47] PROBLEM - Incoming network saturation on labstore1001 is CRITICAL 10.34% of data above the critical threshold [100000000.0] [12:22:48] Krenair, T101750 [12:22:50] (03Merged) 10jenkins-bot: wikitech: Re-add wgCookieDomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221105 (https://phabricator.wikimedia.org/T103939) (owner: 10Alex Monk) [12:23:37] RECOVERY - Host mw2031 is UPING OK - Packet loss = 0%, RTA = 43.87 ms [12:23:38] !log krenair Synchronized wmf-config/wikitech.php: https://gerrit.wikimedia.org/r/#/c/221105/ (duration: 00m 12s) [12:23:45] Logged the message, Master [12:24:39] (03PS1) 10Giuseppe Lavagetto: conf1001: remove startup parameters [puppet] - 10https://gerrit.wikimedia.org/r/221106 [12:25:16] jynus, I figured that'd be a labs team thing [12:25:27] ok, then we are on the same page [12:25:38] :-) [12:25:46] great :) [12:26:19] just wanted to make sure that you didn't need something from us and you put it on the "but is is not for you" panel [12:26:52] it's not a need [12:26:58] (03CR) 10Glaisher: [C: 031] Autocreate accounts on meta, mediawiki.org, loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220970 (https://phabricator.wikimedia.org/T74469) (owner: 10Gergő Tisza) [12:27:11] I just think it's silly that centralauth is considered a wiki by this table :p [12:27:12] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] conf1001: remove startup parameters [puppet] - 10https://gerrit.wikimedia.org/r/221106 (owner: 10Giuseppe Lavagetto) [12:27:26] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 885.161964265 [12:27:50] lol [12:28:10] Krenair, nothing against the idea of the patch although I havn't check it properly [12:30:51] our security@ bounces for jmm :) [12:31:28] thedj, you email security@ and it bounces for someone? [12:31:56] jmm would be.. moritzm [12:32:06] Krenair: yup [12:32:20] https://phabricator.wikimedia.org/T103987 [12:39:27] thedj: interesring, I only added myself an hour ago, so something's wrong there. could you please bounce me the error to moritz@wikimedia.org ? [12:40:57] I had thought that commit was you :) [12:41:30] you probably added jmm instead of moritzm or something [12:43:46] PROBLEM - puppet last run on mw2072 is CRITICAL puppet fail [12:46:07] RECOVERY - Incoming network saturation on labstore1001 is OK Less than 10.00% above the threshold [75000000.0] [12:50:54] moritzm: fw'ed [12:51:21] thedj: thanks, fixed now [12:51:23] (03PS1) 10Yuvipanda: base: Sort list of packages in base [puppet] - 10https://gerrit.wikimedia.org/r/221110 [12:52:48] (03PS1) 10Yuvipanda: base: Install molly-guard everywhere [puppet] - 10https://gerrit.wikimedia.org/r/221111 (https://phabricator.wikimedia.org/T103873) [12:53:17] thedj: fwiw, it's not just phab, it's everything [12:54:30] paravoid: right... [12:54:37] (03PS2) 10Yuvipanda: base: Sort list of packages in base [puppet] - 10https://gerrit.wikimedia.org/r/221110 [12:54:39] (03PS2) 10Yuvipanda: base: Install molly-guard everywhere [puppet] - 10https://gerrit.wikimedia.org/r/221111 (https://phabricator.wikimedia.org/T103873) [12:55:25] (03CR) 10Yuvipanda: [C: 032] base: Sort list of packages in base [puppet] - 10https://gerrit.wikimedia.org/r/221110 (owner: 10Yuvipanda) [12:55:36] PROBLEM - RAID on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [12:55:37] PROBLEM - SSH on graphite1002 is CRITICAL: Server answer [12:55:47] PROBLEM - dhclient process on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [12:55:48] PROBLEM - configured eth on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [12:56:06] PROBLEM - puppet last run on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [12:56:36] PROBLEM - DPKG on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [12:56:47] PROBLEM - Disk space on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [12:57:07] PROBLEM - salt-minion processes on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [13:01:37] RECOVERY - puppet last run on mw2072 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:21:53] (03PS1) 10Odder: More high-resolution logos for Chinese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221113 (https://phabricator.wikimedia.org/T102852) [13:22:50] moritzm: I wrote a hacky script with paramiko to measure RTT's over ssh, and I'll do some testing at home tonight. [13:23:02] (03PS2) 10Odder: More high-resolution logos for Chinese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221113 (https://phabricator.wikimedia.org/T102852) [13:23:34] wired at work I get 80 \pm 4 ms rtt [13:23:39] (03CR) 10Odder: "See I78966ce for the follow-up." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219012 (https://phabricator.wikimedia.org/T102852) (owner: 10Odder) [13:25:17] valhallasw: what's the problem? [13:25:25] (03CR) 10Ottomata: [C: 031] Add new projectview to projectcounts aggregation [puppet] - 10https://gerrit.wikimedia.org/r/220752 (https://phabricator.wikimedia.org/T101118) (owner: 10Joal) [13:25:38] paravoid: moritzm was wondering whether ssh rtt's were so big that people needed to use mosh [13:25:48] for labs? [13:25:50] yeah [13:26:13] I didn't have much problems with just using ssh from India... [13:26:20] it's ~145ms from over here, so depends on your tolerance levels [13:26:38] even over 3G in remote-ish places. [13:26:59] it could possibly just be that I have insanely high tolerance levels :) [13:27:09] YuviPanda: do you still have remote access to any of those places to measure it? [13:27:33] valhallasw: not personally, but I can find people who can run tests if need be [13:42:02] 6operations, 5Patch-For-Review: Install molly-guard on production hosts - https://phabricator.wikimedia.org/T103873#1404747 (10Negative24) So this molly-guard package, is it going to be installed everywhere as in Labs and prod? A few emails and docs should be written because I, and I guarantee a few others, wo... [13:42:46] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 5976.91307256 [13:45:08] 6operations, 5Patch-For-Review: Install molly-guard on production hosts - https://phabricator.wikimedia.org/T103873#1404756 (10Krenair) Looking at http://www.ubuntugeek.com/molly-guard-protects-machines-from-accidental-shutdownsreboots.html it seems super-simple to use. Literally just asks you for the target h... [13:45:12] 6operations, 5Patch-For-Review: Install molly-guard on production hosts - https://phabricator.wikimedia.org/T103873#1404758 (10MoritzMuehlenhoff) You still can reboot your instances, this is intended to prevent accidental reboots/shutdowns when logged into the wrong system: $ apt-cache show molly-guard (..)... [13:45:52] 6operations, 5Patch-For-Review: Install molly-guard on production hosts - https://phabricator.wikimedia.org/T103873#1404759 (10yuvipanda) They can reboot their instance - it just asks you to type in the name of the hostname before rebooting http://manpages.ubuntu.com/manpages/lucid/man8/molly-guard.8.html [13:46:19] (03CR) 10Nemo bis: [C: 031] "AFAIK the only feed for the en.wiki signpost is http://en.wikipedia.org/w/index.php?title=Wikipedia:Wikipedia_Signpost/Issue&feed=atom&act" [puppet] - 10https://gerrit.wikimedia.org/r/221050 (owner: 10Dzahn) [13:50:34] 6operations, 5Patch-For-Review: Install molly-guard on production hosts - https://phabricator.wikimedia.org/T103873#1404760 (10Negative24) Like I said I didn't know how to use it when I posted that. Now I do. [14:01:35] any sweet person mind running a git test for me please ? Would need the output of: git --version && GIT_TRACE=1 GIT_TRACE_PACKET=1 git fetch 2>&1|head -n20 [14:01:43] trying to debug out a git slowness with Gerrit [14:02:27] hashar: On which repo? [14:02:35] Over ssh or over https? [14:02:56] ssh [14:03:05] well if you get https I can take it as well :D [14:03:22] upload-pack executed on Gerrit side sends me all refs/changes/ :-( [14:04:42] hashar: http://fpaste.org/236980/14353274/ [14:05:18] yeah same on your setup bah [14:05:53] hoo: thank you! [14:06:19] Happy to help :) [14:17:43] !log Deployed patch for T103391 [14:17:50] Logged the message, Master [14:24:58] (03PS13) 10Ottomata: Refactor eventlogging role classes to make it easier to include different processes on different hosts [puppet] - 10https://gerrit.wikimedia.org/r/220912 (https://phabricator.wikimedia.org/T102831) [14:26:19] (03CR) 10Ottomata: [C: 032] Refactor eventlogging role classes to make it easier to include different processes on different hosts [puppet] - 10https://gerrit.wikimedia.org/r/220912 (https://phabricator.wikimedia.org/T102831) (owner: 10Ottomata) [14:30:46] (03CR) 10Addshore: [C: 031] contint: Create symlink for composer in /usr/local/bin/ [puppet] - 10https://gerrit.wikimedia.org/r/220658 (owner: 10Legoktm) [14:31:18] 6operations, 7discovery-system: Install etcd in multiple rows/racks - https://phabricator.wikimedia.org/T101713#1404860 (10Joe) I added conf1001 to the cluster this morning. I will add the remainder in the next couple of hours. [14:33:47] PROBLEM - puppet last run on hafnium is CRITICAL puppet fail [14:34:22] ottomata: you broke puppet! [14:35:09] I DI!? [14:35:21] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not parse for environment production: invalid byte sequence in US-ASCII at /etc/puppet/manifests/role/eventlogging.pp:1 on node tools-checker-01.tools.eqiad.wmflabs [14:35:33] !? [14:35:35] weird [14:35:46] i just applied that change on one prod server [14:35:59] ottomata: yeah but it broke the labs puppetmaster... [14:36:02] werid [14:36:04] because encoding, I guess. [14:36:07] not sure what I put there [14:36:11] i did the same thign i always do! [14:36:53] YuviPanda: I can't log in there [14:36:55] to check [14:37:02] ottomata: labcontrol1001.wikimedia.org [14:37:05] (03PS6) 10Alexandros Kosiaris: WIP: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221065 [14:37:39] ottomata: it also failed on hafnium but with a different error [14:37:41] <_joe_> ottomata: trusty vs precise [14:37:46] <_joe_> YuviPanda: ^^ [14:37:50] ESC[1;31mError: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate declaration: Class[Eventlogging::Monitoring] is already declared in file /etc/puppet/manifests/role/eventlogging.pp:221; cannot redeclare at /etc/puppet/manifests/role/eventlogging.pp:266 on node hafnium.wikimedia.orgESC[0m [14:37:51] yeah [14:38:00] but how to find the offending character? [14:38:18] <_joe_> YuviPanda: akosiaris surely knows [14:38:37] <_joe_> but... files in ASCII? didn't we overcome that? [14:38:37] checking [14:39:24] ? [14:39:50] (03PS1) 10Giuseppe Lavagetto: etcd: add conf1001 to the SRV record for clients [dns] - 10https://gerrit.wikimedia.org/r/221122 [14:39:50] <_joe_> Error 400 on SERVER: Could not parse for environment production: invalid byte sequence in US-ASCII at /etc/puppet/manifests/role/eventlogging.pp:1 [14:40:03] <_joe_> akosiaris: wasn't this one of the problems with trusty vs precise? [14:40:06] (03PS1) 10Ottomata: Use include to avoid class conflict for eventlogging::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/221123 [14:40:13] <_joe_> err ruby 1.8 vs 1.9 [14:40:36] yes, gimme a sec [14:40:48] (03CR) 10Ottomata: [C: 032] Use include to avoid class conflict for eventlogging::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/221123 (owner: 10Ottomata) [14:41:33] wut? can't I include a class twice? [14:41:36] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd: add conf1001 to the SRV record for clients [dns] - 10https://gerrit.wikimedia.org/r/221122 (owner: 10Giuseppe Lavagetto) [14:41:38] since when can I not do that? [14:42:39] <_joe_> ottomata: include twice with params? [14:42:43] <_joe_> since forever [14:43:03] no not with params [14:43:04] just include [14:43:13] oh opps [14:43:16] didn't puppet merge :p [14:43:21] <_joe_> ahah [14:44:37] PROBLEM - Check status of defined EventLogging jobs on analytics1010 is CRITICAL Stopped EventLogging jobs: reporter/statsd [14:44:50] cool! [14:44:54] not critical. [14:45:03] interesting though. [14:47:15] (03PS1) 10Ottomata: Name nrpe monitor services differently for graphite consumer and main monitor [puppet] - 10https://gerrit.wikimedia.org/r/221125 [14:47:49] (03CR) 10Ottomata: [C: 032 V: 032] Name nrpe monitor services differently for graphite consumer and main monitor [puppet] - 10https://gerrit.wikimedia.org/r/221125 (owner: 10Ottomata) [14:50:00] (03PS1) 10Ottomata: Use require to include eventlogging::monitoring so it is evaluated first [puppet] - 10https://gerrit.wikimedia.org/r/221126 [14:50:15] (03CR) 10Ottomata: [C: 032 V: 032] Use require to include eventlogging::monitoring so it is evaluated first [puppet] - 10https://gerrit.wikimedia.org/r/221126 (owner: 10Ottomata) [14:51:18] PROBLEM - puppet last run on analytics1010 is CRITICAL puppet fail [14:51:34] (03PS1) 10Ottomata: Depend on eventlogging::monitoring class, not resource inside the class [puppet] - 10https://gerrit.wikimedia.org/r/221127 [14:51:53] (03CR) 10Ottomata: [C: 032 V: 032] Depend on eventlogging::monitoring class, not resource inside the class [puppet] - 10https://gerrit.wikimedia.org/r/221127 (owner: 10Ottomata) [14:52:38] RECOVERY - puppet last run on hafnium is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [14:53:07] RECOVERY - puppet last run on analytics1010 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [14:53:22] so, did you guys solve the ASCII issue ? [14:54:01] (03PS1) 10Ottomata: Fully qualify references to ::eventlogging::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/221129 [14:54:13] akosiaris: i don't think so? [14:54:17] its happenings on a labs box [14:54:30] YuviPanda: ? [14:54:35] which one ? [14:54:41] all of them [14:54:51] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not parse for environment production: invalid byte sequence in US-ASCII at /etc/puppet/manifests/role/eventlogging.pp:1 on node tools-checker-01.tools.eqiad.wmflabs [14:54:52] labs puppetmaster is trusty, prod's is precise, so that's where suspect is [14:55:24] started happening right after https://gerrit.wikimedia.org/r/#/c/220912/ [14:55:49] (03CR) 10Ottomata: [C: 032] Fully qualify references to ::eventlogging::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/221129 (owner: 10Ottomata) [14:55:53] YuviPanda: utf-8 BOM? [14:55:55] ti's on line 1 [14:56:13] oh? do you see it? [14:56:15] I don't... [14:56:19] no, I'm just guessing [14:56:24] yeah, that's possible... [14:56:39] valhallasw: wait, utf8 has no BOM [14:56:42] does it? [14:56:45] utf-16 does... [14:56:53] niah, it does not have a bom [14:56:56] some editors put an utf-8 'bom' at the start of the file [14:57:03] right, ignore me, it does [14:57:04] i.e. u+FEFF [14:57:10] but this probably isn't it [14:57:16] git show should've showed that, I presume. [14:57:38] hexdump doesn't show it, so it's probably not there [14:58:40] also nothing above hex 80 in the first line :/ [14:59:06] python open('eventlogging.pp').read().decode('ascii') gives me UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 3059: ordinal not in range(128) [15:00:15] '# Responsible for forwarding\xc2\xa0incoming raw events from UDP\n' [15:00:33] YuviPanda, ottomata, are y’all on top of this already? [15:00:41] andrewbogott: yeah, valhallasw found the spot [15:00:54] 'k [15:00:56] I only see a space there [15:01:31] (03PS1) 10Yuvipanda: eventlogging: HAHAHAHA [puppet] - 10https://gerrit.wikimedia.org/r/221130 [15:01:38] akosiaris: a non-breaking space, to be precise [15:01:48] http://www.fileformat.info/info/unicode/char/00a0/index.htm => \xc2\xa0 [15:01:49] PROBLEM - Check status of defined EventLogging jobs on graphite consumer on hafnium is CRITICAL Stopped EventLogging jobs: reporter/statsd [15:01:49] (03PS2) 10Yuvipanda: eventlogging: HAHAHAHA [puppet] - 10https://gerrit.wikimedia.org/r/221130 [15:02:05] i found it [15:02:07] no way this could be detected visually [15:02:09] oh you found it too! [15:02:18] (03PS3) 10Yuvipanda: eventlogging: HAHAHAHA [puppet] - 10https://gerrit.wikimedia.org/r/221130 [15:02:19] valhallasw: well done [15:02:21] where did that ocome fro!M/ [15:02:23] thanks valhallasw [15:02:28] how did I even type that? [15:02:47] maybe copied the text from somewhere? I'm not sure [15:02:52] why does it not break older ruby? [15:02:57] (03CR) 10Yuvipanda: [C: 032 V: 032] eventlogging: HAHAHAHA [puppet] - 10https://gerrit.wikimedia.org/r/221130 (owner: 10Yuvipanda) [15:02:58] but more importantly, how did pplint not catch this? [15:03:20] YuviPanda: ruby 1.9 strings are unicode, ruby 1.8 are not [15:03:23]  maybe gerrit's pplint is running on older ruby? [15:03:38] my puppet parser validate did not catch it either [15:03:48] akosiaris: ah, I see. so 1.8 is just passing it through as a raw sequence of bytes [15:03:52] while 1.9 actually tries to do unicode [15:04:02] YuviPanda: yup [15:04:59] shall I open a task for hashar to upgrade the puppet version used for CI tests? [15:05:17] so, why did that not break in production ? never puppet-merged it ? [15:05:18] PROBLEM - Check status of defined EventLogging jobs on hafnium is CRITICAL Stopped EventLogging jobs: reporter/statsd [15:05:25] akosiaris: palladium is precise? [15:05:31] ottomata: yes [15:05:37] older ruby there? not sure [15:05:43] yes, 1.8 [15:05:44] i puppet parser validated it on pallaidum to check [15:05:44] ah yes [15:05:51] and didn't find it [15:05:53] so ja [15:05:57] that is the reason. it's precise and ruby1.8 [15:05:59] aye [15:06:06] precise does not necessarily mean ruby 1.8 btw. [15:06:10] aye [15:06:11] it's just the default [15:06:14] but more likely :) [15:06:15] thanks guys, sorry about that. [15:06:29] no worries. I wonder how you managed to do it though [15:06:31] haha [15:06:33] no idea! [15:06:40] • [15:06:40] have you guys migrated to a new puppet ? [15:06:42] is that this char? [15:06:46] option-8 on the mac [15:06:50] ? [15:07:06] dunno. [15:07:08] hashar: labs us using puppet on trusty. Prod is using puppet on Precise but will soon move to Jessie. [15:07:16] ottomata: it's this char: http://www.fileformat.info/info/unicode/char/00a0/index.htm [15:07:22] just to keep you on your toes :) [15:07:26] 6operations, 7Database: mysql boxes not in ganglia - https://phabricator.wikimedia.org/T87209#1404998 (10Dzahn) http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=MySQL%2520eqiad&tab=m&vn=&hide-hf=false ? [15:07:31] andrewbogott: the CI puppet master is Precise stil [15:07:37] <_joe_> ottomata: LOL [15:07:46] if mac os x decided to somehow change the representation to make it visible... well... [15:07:51] hashar: right, which is why the cursed patch made it through the tests. [15:08:28] https://integration.wikimedia.org/ci/job/pplint-HEAD/ is tied to Precise [15:08:33] then [15:08:37] we have no Jessie slaves for CI [15:08:49] RECOVERY - Check status of defined EventLogging jobs on graphite consumer on hafnium is OK All defined EventLogging jobs are runnning. [15:08:51] (well we have a small one for dns linting) [15:08:58] RECOVERY - Check status of defined EventLogging jobs on hafnium is OK All defined EventLogging jobs are runnning. [15:09:36] akosiaris: my editor actually did show it as visible with show invisibles on [15:09:40] except [15:09:46] my editor also shows spaces as visible [15:09:53] as little tiny grey dots [15:09:55] Option-space, apparently [15:10:00] that character was a slightly whiter tiny grey dot [15:10:10] uh, yeah! [15:10:12] hahah [15:10:19] must have had a heavy pinky [15:10:22] iunnoooo [15:10:40] lol [15:11:18] ottomata: I head YuviPanda's moose also likes pinkies >:) [15:11:28] It's not my moose [15:11:32] i was just fed to it [15:11:40] you have a moose? [15:11:59] fed to a moose ? [15:12:04] sounds fun! [15:15:00] (03CR) 10Alexandros Kosiaris: "while the internal puppet structures passed around change a bit with PS6, the result is a noop. Still not mergeable quality though" [puppet] - 10https://gerrit.wikimedia.org/r/221065 (owner: 10Alexandros Kosiaris) [15:15:50] that reminds me of a fun misunderstanding I had with somebody from Alaska, who I though was talking about mousse au chocolat; of course she was talking about the animal, which we finally clarified with 'ah, a moooouuuuuussse' ;) [15:16:08] <_joe_> lol [15:16:19] RECOVERY - Check status of defined EventLogging jobs on analytics1010 is OK All defined EventLogging jobs are runnning. [15:16:43] <_joe_> gwicke: that sounds like an interesting misunderstanding. [15:17:36] _joe_: it all didn't make sense to me, having this picture of mousse au chocolat in mind.. [15:18:00] <_joe_> gwicke: eheh I bet [15:18:03] hah, so 1. do not feed Yuvi to moose, but 2. feed a Mousse to Yuvi! [15:18:47] ;) [15:21:10] (03PS1) 10Ottomata: Use multiplexer_host for graphite consumer [puppet] - 10https://gerrit.wikimedia.org/r/221133 [15:22:59] (03CR) 10Ottomata: [C: 032] Use multiplexer_host for graphite consumer [puppet] - 10https://gerrit.wikimedia.org/r/221133 (owner: 10Ottomata) [15:29:36] PROBLEM - High load average on ms-be1015 is CRITICAL - load average: 239.32, 151.33, 73.33 [15:33:11] (03CR) 10Hashar: "> why not ruby-dev and ruby respectively ?" [puppet] - 10https://gerrit.wikimedia.org/r/220308 (https://phabricator.wikimedia.org/T103600) (owner: 10Hashar) [15:34:18] (03PS1) 10Alex Monk: shinken: Check hosts over IPv4 only [puppet] - 10https://gerrit.wikimedia.org/r/221134 (https://phabricator.wikimedia.org/T101517) [15:35:39] (03CR) 10JanZerebecki: "Thank you for using an array!" [puppet] - 10https://gerrit.wikimedia.org/r/220377 (owner: 10BBlack) [15:36:56] (03PS2) 10Yuvipanda: shinken: Check hosts over IPv4 only [puppet] - 10https://gerrit.wikimedia.org/r/221134 (https://phabricator.wikimedia.org/T101517) (owner: 10Alex Monk) [15:38:15] (03CR) 10Yuvipanda: [C: 032] shinken: Check hosts over IPv4 only [puppet] - 10https://gerrit.wikimedia.org/r/221134 (https://phabricator.wikimedia.org/T101517) (owner: 10Alex Monk) [15:38:29] 6operations, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review, 7RESTBase-architecture: put new restbase servers in service - https://phabricator.wikimedia.org/T102015#1405142 (10Eevans) The bootstrap attempt from 1004 to 1008, both running 2.1.7, failed as well early this morning. From 1004: ``` INFO... [15:41:59] “Host UP alert for wikitech-static!” Thanks YuviPanda, Krenair [15:42:03] \o/ [15:42:08] all Krenair that one [15:42:12] you guys get alerts for hosts being up? [15:42:33] not for wikitech yet [15:42:35] just -static [15:43:05] PROBLEM - Host ms-be1015 is DOWN: PING CRITICAL - Packet loss = 100% [15:44:39] (03PS1) 10DCausse: Upgrade extra and experimental-highlighter to 1.6.0 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/221136 (https://phabricator.wikimedia.org/T103598) [15:45:10] Krenair: ok, there it is. Both reporting as up now. [15:45:47] wheee! [15:46:49] 6operations, 7Database: mysql boxes not in ganglia - https://phabricator.wikimedia.org/T87209#1405192 (10jcrespo) Yes, but http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&c=MySQL+eqiad&h=&tab=m&vn=&hide-hf=false&m=mysql_com_select&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name Needs more investigation. [15:59:41] (03PS7) 10Alexandros Kosiaris: WIP: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221065 [16:03:05] (03PS8) 10Alexandros Kosiaris: WIP: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221065 [16:06:37] Does PrivateSettings.php.example on mediawiki-config contain all the wgs in the actual file? [16:08:22] no [16:08:41] 6operations, 10Analytics-Cluster: Can't download large datasets from datasets.wikimedia.org - https://phabricator.wikimedia.org/T104004#1405240 (10Halfak) 3NEW [16:08:54] Glaisher, why do you ask? [16:09:22] wondering what the purpose of that public file is [16:09:39] doesn't really have anything except the obvious [16:09:44] no idea, YuviPanda added it [16:09:57] (03PS1) 10Ottomata: Don't cache datasets.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/221139 (https://phabricator.wikimedia.org/T104004) [16:10:09] wah [16:10:11] what did I do? [16:10:24] https://github.com/wikimedia/operations-mediawiki-config/commit/f42cf6f5fc28797a4469a854fea42e17bdb4b125 [16:11:24] (03PS18) 10Paladox: Rename all main WikimediaIncubator settings to have a wg prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207909 [16:16:31] Glaisher: oh, I think I was trying to setup a vagrant setup for OpenStackManager... [16:16:39] Glaisher: and needed a simple privatesettings.php? [16:16:44] with just the absolute required bits... [16:16:48] feel free to delete it [16:16:51] ah [16:16:54] no, it's fine [16:18:40] (03PS1) 10Giuseppe Lavagetto: confd: monitor template failures and file removals [puppet] - 10https://gerrit.wikimedia.org/r/221140 (https://phabricator.wikimedia.org/T103360) [16:18:57] <_joe_> chasemp: ^^ [16:19:07] kk [16:24:32] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Deployment Access to tin for Ellery Wulczyn - https://phabricator.wikimedia.org/T103782#1405302 (10RobH) [16:25:21] <_joe_> chasemp: I'm not sure it works at all, it's probably full of bashisms I still have to weed out [16:25:54] sure thing, the intention is this woudl cover unconsumable templates by confd and missing generated files [16:25:57] right? [16:26:36] (03CR) 10RobH: [C: 04-2] "This patchset looks good, but cannot be merged until Ellery's manager approves on T103782." [puppet] - 10https://gerrit.wikimedia.org/r/221006 (owner: 10Matanya) [16:27:58] <_joe_> chasemp: yes [16:28:11] cool [16:28:18] <_joe_> so it's orthogonal to checking if the generated template passed the sanity check I think [16:28:28] cool I'll pursue that then [16:28:33] <_joe_> but verify the spurious file doesn't get generated in that case [16:28:40] <_joe_> if it does, we're done :) [16:28:54] right [16:29:06] <_joe_> or well, your wrapper would still be needed for the reload script, where present [16:29:12] so post initial generation on any invalid key / value change we would be silent still [16:29:17] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Deployment Access to tin for Ellery Wulczyn - https://phabricator.wikimedia.org/T103782#1405312 (10RobH) I've commented that Matanya's patchset https://gerrit.wikimedia.org/r/#/c/221006/ looks good for this and can be merged once the following conditions... [16:29:20] <_joe_> chasemp: nope [16:29:31] invalid key we woudl catch [16:29:35] <_joe_> chasemp: if confd runs and fails it generates a spurious file [16:30:04] (03PS9) 10Alexandros Kosiaris: WIP: lvs: hieraize lvs_services variable [puppet] - 10https://gerrit.wikimedia.org/r/221065 [16:30:07] <_joe_> even for spurious as in invalid for the template [16:30:22] really I thought that was removed on lint failure [16:30:49] <_joe_> you mean if the check_script fails? [16:30:58] <_joe_> I'm not sure about that, it must be verified [16:31:14] <_joe_> if it does, then we need a wrapper around check scripts :) [16:31:28] right that's what I was talking about [16:31:41] I'm pretty sure the "file exists but new file content linting is failing" [16:31:56] just creates teh temp, lints it, removes and does nothing [16:32:01] <_joe_> ok [16:32:10] even a noop doesn't report with exit 1 [16:32:14] <_joe_> if you're 100% sure then we need said wrapper [16:32:23] I'll make sure [16:32:36] hey could you update the master in etcd project? can't ssh into precise box atm [16:32:54] some changes went out and yuvi said our master is behind etc [16:33:11] no hurry tho [16:38:29] <_joe_> chasemp: you may need assistance from someone else, I'm a bit busy right now [16:38:46] is it cool w/ you to update the master there then? [16:38:54] I can do it but I didn't know if you had local things going on [16:40:18] <_joe_> chasemp: yeah and even if it has, screw them :P [16:40:27] k [16:41:42] ACKNOWLEDGEMENT - Unmerged changes on repository puppet on rhodium is CRITICAL: There are 37 unmerged changes in puppet (dir /var/lib/git/operations/puppet). alexandros kosiaris ignore [16:47:58] (03PS1) 10Ottomata: Configure eventlogging kafka forwarder and processor [puppet] - 10https://gerrit.wikimedia.org/r/221145 (https://phabricator.wikimedia.org/T102831) [16:48:36] (03CR) 10jenkins-bot: [V: 04-1] Configure eventlogging kafka forwarder and processor [puppet] - 10https://gerrit.wikimedia.org/r/221145 (https://phabricator.wikimedia.org/T102831) (owner: 10Ottomata) [16:48:52] (03PS2) 10Ottomata: Configure eventlogging kafka forwarder and processor [puppet] - 10https://gerrit.wikimedia.org/r/221145 (https://phabricator.wikimedia.org/T102831) [16:49:05] (03PS3) 10Ottomata: Configure eventlogging kafka forwarder and processor [puppet] - 10https://gerrit.wikimedia.org/r/221145 (https://phabricator.wikimedia.org/T102831) [16:49:35] (03PS4) 10Ottomata: Configure eventlogging kafka forwarder and processor [puppet] - 10https://gerrit.wikimedia.org/r/221145 (https://phabricator.wikimedia.org/T102831) [16:50:25] (03CR) 10jenkins-bot: [V: 04-1] Configure eventlogging kafka forwarder and processor [puppet] - 10https://gerrit.wikimedia.org/r/221145 (https://phabricator.wikimedia.org/T102831) (owner: 10Ottomata) [16:53:42] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default, 5Patch-For-Review: Switch to ECDSA hybrid certificates - https://phabricator.wikimedia.org/T86654#1405366 (10BBlack) @janzerebecki mentioned in https://gerrit.wikimedia.org/r/#/c/220377 : > For the explanation of the current ordering see https://wiki.m... [16:53:44] (03PS3) 10Dzahn: planet: remove broken feeds [puppet] - 10https://gerrit.wikimedia.org/r/221050 [16:54:01] (03CR) 10BBlack: "Either way, until we actually deploy ECDSA certs on our end, the relative order of ECDSA-vs-RSA just doesn't matter, as ECDSA is not an op" [puppet] - 10https://gerrit.wikimedia.org/r/220377 (owner: 10BBlack) [16:55:39] (03PS5) 10Ottomata: Configure eventlogging kafka forwarder and processor [puppet] - 10https://gerrit.wikimedia.org/r/221145 (https://phabricator.wikimedia.org/T102831) [16:56:49] (03PS1) 10Ottomata: Need to escape % in cron command for hdfs balancer [puppet] - 10https://gerrit.wikimedia.org/r/221147 [16:57:34] (03CR) 10Dzahn: [C: 032] planet: remove broken feeds [puppet] - 10https://gerrit.wikimedia.org/r/221050 (owner: 10Dzahn) [16:58:07] (03PS2) 10Ottomata: Need to escape % in cron command for hdfs balancer [puppet] - 10https://gerrit.wikimedia.org/r/221147 [16:58:13] (03CR) 10Ottomata: [C: 032 V: 032] Need to escape % in cron command for hdfs balancer [puppet] - 10https://gerrit.wikimedia.org/r/221147 (owner: 10Ottomata) [16:58:38] ottomata: FWIW that command would be better in a separate script called from cron [16:59:58] yeahhhh probalby so [17:00:15] (03PS1) 10Glaisher: Redirect wikipedia.is to is.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/221148 (https://phabricator.wikimedia.org/T103915) [17:01:29] (03PS1) 10Filippo Giunchedi: Revert "Use cronolog and logrotate to avoid Puppetmaster Apache reloads" [puppet] - 10https://gerrit.wikimedia.org/r/221149 [17:01:36] 6operations, 10Analytics, 10Traffic: Provide summary of MediaWiki downloads - https://phabricator.wikimedia.org/T104010#1405400 (10Krenair) So you need to get statistics on downloads from Gerrit, Gitblit, Github (not in our infrastructure...), and releases.wikimedia.org? [17:02:05] (03CR) 10Glaisher: "I don't know why the bugzilla one is getting modified. Just added the entry to redirects.dat and ran refreshDomainRedirects" [puppet] - 10https://gerrit.wikimedia.org/r/221148 (https://phabricator.wikimedia.org/T103915) (owner: 10Glaisher) [17:02:34] (03PS2) 10Filippo Giunchedi: Revert "Use cronolog and logrotate to avoid Puppetmaster Apache reloads" [puppet] - 10https://gerrit.wikimedia.org/r/221149 [17:02:40] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "Use cronolog and logrotate to avoid Puppetmaster Apache reloads" [puppet] - 10https://gerrit.wikimedia.org/r/221149 (owner: 10Filippo Giunchedi) [17:03:07] Glaisher, I guess you're going to upload another commit to operations/dns for that task? [17:03:20] Krenair: It's already there. [17:03:33] apache is not configured [17:04:13] !log reverted cronolog puppetmaster patch, restarting apache [17:04:17] alex@alex-laptop:~$ host wikipedia.ls [17:04:17] Host wikipedia.ls not found: 3(NXDOMAIN) [17:04:19] Logged the message, Master [17:04:22] oh it was is [17:04:23] right [17:04:52] ori: ^ [17:05:51] !log zirconium - manual cleanup, removing planet [17:05:57] Logged the message, Master [17:11:28] 6operations: move planet from zirconium to a ganeti VM - https://phabricator.wikimedia.org/T101730#1405427 (10Dzahn) 5Open>3Resolved cleaned up on zirconium (remove docroot, delete systemuser, ...) done [17:11:42] akosiaris: https://phabricator.wikimedia.org/T101730#1403589 [17:13:14] (03PS6) 10Ottomata: Configure eventlogging kafka forwarder and processor [puppet] - 10https://gerrit.wikimedia.org/r/221145 (https://phabricator.wikimedia.org/T102831) [17:15:48] (03PS7) 10Ottomata: Configure eventlogging kafka forwarder and processor [puppet] - 10https://gerrit.wikimedia.org/r/221145 (https://phabricator.wikimedia.org/T102831) [17:16:05] PROBLEM - puppet last run on wtp2019 is CRITICAL Puppet has 1 failures [17:16:46] PROBLEM - puppet last run on dataset1001 is CRITICAL Puppet has 1 failures [17:16:57] (03PS8) 10Ottomata: Configure eventlogging kafka forwarder and processor [puppet] - 10https://gerrit.wikimedia.org/r/221145 (https://phabricator.wikimedia.org/T102831) [17:17:36] PROBLEM - puppet last run on mw1150 is CRITICAL Puppet has 1 failures [17:17:47] (03PS1) 10Andrew Bogott: Wait for a minute for NFS exports before trying to mount requested volumes. [puppet] - 10https://gerrit.wikimedia.org/r/221150 (https://phabricator.wikimedia.org/T102544) [17:17:49] (03PS1) 10Andrew Bogott: Remove the wait-on-NFS code from labs instance firstboot. [puppet] - 10https://gerrit.wikimedia.org/r/221151 (https://phabricator.wikimedia.org/T102544) [17:17:57] PROBLEM - puppet last run on mw2003 is CRITICAL Puppet has 1 failures [17:17:57] PROBLEM - puppet last run on mw2043 is CRITICAL Puppet has 1 failures [17:18:35] PROBLEM - puppet last run on mw1118 is CRITICAL Puppet has 1 failures [17:18:45] PROBLEM - puppet last run on mw2082 is CRITICAL Puppet has 1 failures [17:20:02] (03PS9) 10Ottomata: Configure eventlogging kafka forwarder and processor [puppet] - 10https://gerrit.wikimedia.org/r/221145 (https://phabricator.wikimedia.org/T102831) [17:20:42] mutante: yay! [17:25:08] Can someone in ops please review https://gerrit.wikimedia.org/r/#/c/139581/ ? [17:25:56] RECOVERY - puppet last run on mw1118 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [17:26:07] RECOVERY - puppet last run on mw2082 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [17:26:56] RECOVERY - puppet last run on mw1150 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:27:12] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default, 5Patch-For-Review: Switch to ECDSA hybrid certificates - https://phabricator.wikimedia.org/T86654#1405484 (10BBlack) So, to answer some of the above and recap where IRC conversations about this have gone lately: - Yes, we're limited by TLS protocol supp... [17:27:16] RECOVERY - puppet last run on mw2003 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [17:27:16] RECOVERY - puppet last run on mw2043 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:27:55] RECOVERY - puppet last run on dataset1001 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [17:28:56] RECOVERY - puppet last run on wtp2019 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:36:06] PROBLEM - Host graphite1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:37:15] RECOVERY - dhclient process on graphite1002 is OK: PROCS OK: 0 processes with command name dhclient [17:37:26] RECOVERY - Host graphite1002 is UPING OK - Packet loss = 0%, RTA = 1.14 ms [17:37:26] RECOVERY - DPKG on graphite1002 is OK: All packages OK [17:37:46] RECOVERY - Disk space on graphite1002 is OK: DISK OK [17:38:16] RECOVERY - SSH on graphite1002 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [17:38:26] RECOVERY - salt-minion processes on graphite1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:38:27] RECOVERY - RAID on graphite1002 is OK optimal, 2 logical, 4 physical [17:38:35] RECOVERY - configured eth on graphite1002 is OK - interfaces up [17:38:56] RECOVERY - puppet last run on graphite1002 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [17:42:11] 6operations, 10Wikimedia-Bugzilla, 5Patch-For-Review: remove Bugzilla installation remnants from zirconium and repos - https://phabricator.wikimedia.org/T103193#1405554 (10Dzahn) added sanitized dump with emails and static HTML files to Bacula backup. client zirconium. verified in bconsole on helium the bac... [17:42:35] (03PS10) 10Ottomata: Configure eventlogging kafka forwarder and processor [puppet] - 10https://gerrit.wikimedia.org/r/221145 (https://phabricator.wikimedia.org/T102831) [17:44:16] 6operations, 10Wikimedia-Bugzilla, 5Patch-For-Review: remove Bugzilla installation remnants from zirconium and repos - https://phabricator.wikimedia.org/T103193#1405561 (10Dzahn) 5Open>3Resolved [17:44:20] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 5Patch-For-Review, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1405562 (10Dzahn) [17:45:37] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1182562 (10Dzahn) [17:46:46] 6operations, 10Wikimedia-Bugzilla, 5Patch-For-Review: old-bugzilla redirects broken - https://phabricator.wikimedia.org/T103425#1405581 (10Dzahn) i think my redirect rule doesn't work as it should yet, amending appreciated [17:48:14] 6operations, 10Wikimedia-Bugzilla: Show an error message when trying to view dynamic pages like buglist.cgi in static bugzilla - https://phabricator.wikimedia.org/T102579#1405585 (10Dzahn) p:5Lowest>3Low [17:58:03] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471#1405617 (10fgiunchedi) [17:58:06] 7Blocked-on-Operations, 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: move cassandra submodule into puppet repo - https://phabricator.wikimedia.org/T92560#1405615 (10fgiunchedi) 5Open>3Resolved submodule merged, resolving [18:05:22] (03CR) 10Andrew Bogott: "The alternative to that silly regexp is including a json parser in the fact. I don't know if json parsing is part of the standard puppet/" [puppet] - 10https://gerrit.wikimedia.org/r/220991 (https://phabricator.wikimedia.org/T93684) (owner: 10Andrew Bogott) [18:05:50] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1405645 (10RobH) [18:06:22] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1405647 (10RobH) [18:07:41] 6operations, 10RESTBase, 6Services, 10Traffic, 7Service-Architecture: Proxying new services through RESTBase - https://phabricator.wikimedia.org/T96688#1405656 (10GWicke) [18:11:07] 6operations, 10Gather, 7Database, 7Schema-change: Update Gather DB schema for flagging backend - https://phabricator.wikimedia.org/T103611#1405681 (10Tgr) >>! In T103611#1404396, @jcrespo wrote: > * The wikis where this will be applied (I assume all where the Gather extension is, but it should be explicitl... [18:11:26] 6operations, 10ops-codfw: EQDFW/EQORD Deployment Prep Task - https://phabricator.wikimedia.org/T91077#1405686 (10RobH) [18:13:07] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [18:18:35] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60550 bytes in 8.766 second response time [18:23:14] (03PS3) 10Andrew Bogott: Have git-sync-upstream error out if there are local changes. [puppet] - 10https://gerrit.wikimedia.org/r/220157 [18:24:58] (03CR) 10Andrew Bogott: [C: 032] Have git-sync-upstream error out if there are local changes. [puppet] - 10https://gerrit.wikimedia.org/r/220157 (owner: 10Andrew Bogott) [18:26:55] (03PS4) 10Andrew Bogott: Turn on autoupdate_master by default. [puppet] - 10https://gerrit.wikimedia.org/r/220147 [18:40:50] 6operations, 6Labs: update star.wmflabs.org cert from sha1 to sha256 - https://phabricator.wikimedia.org/T104017#1405726 (10RobH) 3NEW [18:41:16] 6operations, 6Labs: update star.wmflabs.org cert from sha1 to sha256 - https://phabricator.wikimedia.org/T104017#1405734 (10RobH) [18:41:19] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1405733 (10RobH) [18:43:15] 6operations, 6Labs: update star.wmflabs.org cert from sha1 to sha256 - https://phabricator.wikimedia.org/T104017#1405726 (10RobH) there also seems to be two wmflabs certificate files in the repo: star.wmflabs.crt star.wmflabs.org.crt Now, rapidssl also happens to have two different SHA1 hashed certificates f... [18:48:14] (03PS11) 10Ottomata: Configure eventlogging kafka forwarder and processor [puppet] - 10https://gerrit.wikimedia.org/r/221145 (https://phabricator.wikimedia.org/T102831) [18:49:48] (03PS1) 10RobH: changing *.wmflabs.org from sha1 to sha256 [puppet] - 10https://gerrit.wikimedia.org/r/221167 (https://phabricator.wikimedia.org/T104017) [18:50:37] (03CR) 10RobH: [C: 04-2] "just to make it clear, submitting my initial patchset 1 will break things. (once further config changes are added and someone will babysi" [puppet] - 10https://gerrit.wikimedia.org/r/221167 (https://phabricator.wikimedia.org/T104017) (owner: 10RobH) [18:51:08] damn all the labs folks are away! [18:51:24] YuviPanda: hey yer not away! who in labs team is best to handle sha256 cert update? [18:51:29] https://phabricator.wikimedia.org/T104017 [18:52:13] 6operations, 6Labs, 5Patch-For-Review: update star.wmflabs.org cert from sha1 to sha256 - https://phabricator.wikimedia.org/T104017#1405767 (10RobH) once the cert is updated and in place, someone should kick this task back to me for rapidssl cert revocation of the older sha1 certs. [18:54:47] 6operations, 10Traffic, 10Wikimedia-DNS, 7Pybal: pybal DNS lookup issues causing outage risks - https://phabricator.wikimedia.org/T103921#1405771 (10BBlack) What I'm testing on lvs1004 today is basically: 1. Switching to the lvs100[25]-style nameservers list (that is: the 2x local recdns machines directly... [18:57:29] (03CR) 10Ottomata: [C: 032] Configure eventlogging kafka forwarder and processor [puppet] - 10https://gerrit.wikimedia.org/r/221145 (https://phabricator.wikimedia.org/T102831) (owner: 10Ottomata) [19:00:30] (03CR) 10Hashar: Wait for a minute for NFS exports before trying to mount requested volumes. (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/221150 (https://phabricator.wikimedia.org/T102544) (owner: 10Andrew Bogott) [19:00:48] (03PS1) 10Dzahn: static-bugzilla: redirects from .cgi to Maniphest [puppet] - 10https://gerrit.wikimedia.org/r/221168 (https://phabricator.wikimedia.org/T102579) [19:01:36] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL - elasticsearch inactive shards 42 threshold =0.1% breach: status: yellow, number_of_nodes: 5, unassigned_shards: 42, timed_out: False, active_primary_shards: 42, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 84, initializing_shards: 0, number_of_data_nodes: 2 [19:02:11] robh: that's probably me as well - but I'm not here atm [19:02:22] (03CR) 10Dzahn: "if mod_alias is loaded, untested" [puppet] - 10https://gerrit.wikimedia.org/r/221168 (https://phabricator.wikimedia.org/T102579) (owner: 10Dzahn) [19:02:24] no worries, i just didnt wanna leave it sitting [19:02:29] ok if i assign to you to tackle next week? [19:02:46] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL - elasticsearch inactive shards 28 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 20, timed_out: False, active_primary_shards: 42, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 98, initializing_shards: 8, number_of_data_nodes: 3 [19:02:57] 6operations, 6Labs, 5Patch-For-Review: update star.wmflabs.org cert from sha1 to sha256 - https://phabricator.wikimedia.org/T104017#1405807 (10RobH) p:5Triage>3High [19:03:06] PROBLEM - ElasticSearch health check for shards on logstash1005 is CRITICAL - elasticsearch inactive shards 28 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 20, timed_out: False, active_primary_shards: 42, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 98, initializing_shards: 8, number_of_data_nodes: 3 [19:03:06] PROBLEM - ElasticSearch health check for shards on logstash1004 is CRITICAL - elasticsearch inactive shards 28 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 20, timed_out: False, active_primary_shards: 42, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 98, initializing_shards: 8, number_of_data_nodes: 3 [19:03:37] (03PS2) 10Ori.livneh: protoproxy: Set X-Connection-Properties header on proxied requests [puppet] - 10https://gerrit.wikimedia.org/r/221000 [19:03:39] bblack: ^ [19:03:40] 6operations, 6Labs, 5Patch-For-Review: update star.wmflabs.org cert from sha1 to sha256 - https://phabricator.wikimedia.org/T104017#1405808 (10RobH) a:3yuvipanda Chatted with Yuvi in IRC. This seems like it would be something he would handle, so I'll assign it to him for now. If incorrect, he can just ki... [19:03:40] (if wrong, sorry ;) [19:03:45] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL - elasticsearch inactive shards 28 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 20, timed_out: False, active_primary_shards: 42, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 98, initializing_shards: 8, number_of_data_nodes: 3 [19:03:46] PROBLEM - ElasticSearch health check for shards on logstash1006 is CRITICAL - elasticsearch inactive shards 28 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 20, timed_out: False, active_primary_shards: 42, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 98, initializing_shards: 8, number_of_data_nodes: 3 [19:03:48] robh: yeah that is ok [19:03:55] good cuz i totally just did ;] [19:05:16] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default, 5Patch-For-Review: Switch to ECDSA hybrid certificates - https://phabricator.wikimedia.org/T86654#1405810 (10BBlack) FYI: I did some force-pushing and branch-deleting to fix up previously-malformed branches in the operations/software/nginx repo: - The `... [19:06:01] 6operations, 10ops-eqiad, 10RESTBase: investigate new restbase machine disks timeouts - https://phabricator.wikimedia.org/T102557#1405812 (10fgiunchedi) update from T102015, we were able to reproduce the bootstrapping failures even with intel disks which however haven't timeout/failed so far so it indeed loo... [19:06:58] o/ bblack [19:07:11] Do you have a minute to look at the patch attached to https://phabricator.wikimedia.org/T104004 [19:07:12] ? [19:07:38] Not critical, but it's blocking some work I'd like to do today. [19:07:41] I can give that a lookover [19:07:52] Woot. Thanks ori :) [19:09:13] (03CR) 10Ori.livneh: [C: 031] Don't cache datasets.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/221139 (https://phabricator.wikimedia.org/T104004) (owner: 10Ottomata) [19:10:29] (03PS1) 10Dzahn: tendril: git clone from wmf repo via puppet [puppet] - 10https://gerrit.wikimedia.org/r/221172 (https://phabricator.wikimedia.org/T98816) [19:10:34] lgtm, i'll leave it up to b.black to merge / deploy [19:11:11] (03CR) 10jenkins-bot: [V: 04-1] tendril: git clone from wmf repo via puppet [puppet] - 10https://gerrit.wikimedia.org/r/221172 (https://phabricator.wikimedia.org/T98816) (owner: 10Dzahn) [19:11:56] (03PS2) 10Dzahn: tendril: git clone from wmf repo via puppet [puppet] - 10https://gerrit.wikimedia.org/r/221172 (https://phabricator.wikimedia.org/T98816) [19:12:07] looking at the logstash issue, seems to be slow/down [19:12:46] ok, thanks ori [19:13:10] 6operations, 10ops-eqiad, 10RESTBase: investigate new restbase machine disks timeouts - https://phabricator.wikimedia.org/T102557#1405828 (10GWicke) @fgiunchedi, I think with 4/6 disks explicitly broken it's pretty clear that the entire batch of Samsung disks is DOA. I'd RMA them all, and perhaps order a new... [19:13:29] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1405829 (10RobH) [19:16:19] (03CR) 10Dzahn: "there is a missing } at the beginning of line 46, but otherwise this is good, we do the exact same thing for other backends" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/221139 (https://phabricator.wikimedia.org/T104004) (owner: 10Ottomata) [19:17:17] AndyRussG: what is the volume of events here again? [19:17:21] oops wrong chat [19:17:55] (03PS2) 10Dzahn: Don't cache datasets.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/221139 (https://phabricator.wikimedia.org/T104004) (owner: 10Ottomata) [19:19:01] (03PS3) 10Dzahn: Don't cache datasets.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/221139 (https://phabricator.wikimedia.org/T104004) (owner: 10Ottomata) [19:20:38] (03CR) 10Dzahn: [C: 032] Don't cache datasets.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/221139 (https://phabricator.wikimedia.org/T104004) (owner: 10Ottomata) [19:21:24] (03CR) 10BBlack: [C: 031] protoproxy: Set X-Connection-Properties header on proxied requests [puppet] - 10https://gerrit.wikimedia.org/r/221000 (owner: 10Ori.livneh) [19:21:25] PROBLEM - Host restbase1009 is DOWN: PING CRITICAL - Packet loss = 100% [19:21:50] thanks mutante! [19:22:06] PROBLEM - Cassandra database on restbase1008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (cassandra), command name java, args CassandraDaemon [19:22:31] (03PS3) 10Ori.livneh: protoproxy: Set X-Connection-Properties header on proxied requests [puppet] - 10https://gerrit.wikimedia.org/r/221000 [19:23:13] ottomata: arr. there might be an issue though [19:23:36] (03CR) 10Ori.livneh: [C: 032] protoproxy: Set X-Connection-Properties header on proxied requests [puppet] - 10https://gerrit.wikimedia.org/r/221000 (owner: 10Ori.livneh) [19:23:37] looking [19:24:15] ottomata: a missing ) [19:24:18] fixing [19:26:37] (03PS1) 10Dzahn: varnish-misc: fix syntax error for datasets config [puppet] - 10https://gerrit.wikimedia.org/r/221177 [19:26:56] PROBLEM - puppet last run on cp1044 is CRITICAL Puppet has 1 failures [19:27:07] 6operations, 10Fundraising Dash: Create sandbox site for Dash - https://phabricator.wikimedia.org/T87809#1405858 (10awight) [19:27:26] PROBLEM - puppet last run on cp1043 is CRITICAL Puppet has 1 failures [19:27:35] (03PS2) 10Dzahn: varnish-misc: fix syntax error for datasets config [puppet] - 10https://gerrit.wikimedia.org/r/221177 (https://phabricator.wikimedia.org/T104004) [19:27:50] re: icinga, that's what i'm fixing [19:28:13] (03CR) 10Dzahn: [C: 032 V: 032] varnish-misc: fix syntax error for datasets config [puppet] - 10https://gerrit.wikimedia.org/r/221177 (https://phabricator.wikimedia.org/T104004) (owner: 10Dzahn) [19:29:25] oops. thanks. [19:29:27] bd808: I'm looking at some failed shards on logstash, some shard corruption messages in the logs, did we see this before? [19:30:36] RECOVERY - puppet last run on cp1044 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:30:41] halfak: ottomata: ^ now it should not be cached anymore [19:30:57] RECOVERY - puppet last run on cp1043 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:31:11] (03PS2) 10Andrew Bogott: Wait for a minute for NFS exports before trying to mount requested volumes. [puppet] - 10https://gerrit.wikimedia.org/r/221150 (https://phabricator.wikimedia.org/T102544) [19:31:13] (03PS2) 10Andrew Bogott: Remove the wait-on-NFS code from labs instance firstboot. [puppet] - 10https://gerrit.wikimedia.org/r/221151 (https://phabricator.wikimedia.org/T102544) [19:31:22] thanks mutante and ori :) [19:31:59] (03CR) 10Dzahn: [C: 04-1] "i think we need to sync the github repo to the wmf repo one last time before merging this" [puppet] - 10https://gerrit.wikimedia.org/r/221172 (https://phabricator.wikimedia.org/T98816) (owner: 10Dzahn) [19:32:23] mutante, just to make sure -- I shouldn't have to wait any longer for the change to propagate, right? [19:32:34] I'm running a test now and it doesn't seem to have solved the problem. [19:32:36] godog: I haven't seem much of that, no [19:33:00] halfak: you should not, i saw it apply the changes on the 2 servers [19:33:10] cool. Thanks. [19:33:21] ottomata, we might need to brainstorm some more [19:33:44] godog: woah. I see what you mean. Did we have node restarts? [19:34:51] bd808: it looks like logstash1004 freaked out at the top of the hour and it went downhill from there [19:35:07] bd808: also seems like it temporarily elected 1005 as master and switched back to 1004 I think [19:36:10] the good news is that all initializing/unassigned shards seem replicas and not primary [19:37:14] godog: I think I see what happened on the heap_used graph for logstash1004 [19:37:18] https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Logstash%20cluster%20eqiad&h=logstash1004.eqiad.wmnet&r=hour&z=default&jr=&js=&st=1435347356&v=19830090608&m=es_heap_used&vl=bytes&ti=es_heap_used&z=large [19:37:48] it hit a gc thrash and probably fell behind on cluster heartbeat messages [19:38:01] we haven't seen that since we got the new bigger hardware [19:38:15] hah indeed, heap maxed out [19:39:05] PROBLEM - Disk space on labnodepool1001 is CRITICAL: DISK CRITICAL - /tmp/image.fjRJEAB3/mnt is not accessible: Permission denied [19:39:37] we only have one "live" index at a time so recovery can be a bit slow but is generally not a problem [19:39:51] new docs only go into today's index [19:39:52] (03PS1) 10Dzahn: tendril: sync changes from github repo [software/tendril] - 10https://gerrit.wikimedia.org/r/221184 (https://phabricator.wikimedia.org/T98816) [19:40:02] 6operations, 10Analytics-Cluster, 5Patch-For-Review: Can't download large datasets from datasets.wikimedia.org - https://phabricator.wikimedia.org/T104004#1405886 (10Halfak) The patch should have been merged by now, but the problem persists. [19:40:56] RECOVERY - Disk space on labnodepool1001 is OK: DISK OK [19:41:36] bd808: yup so all the ones stuck in initializing are on 1004 and 1005, I'll try moving them out of the way and kick a reroute [19:42:06] godog: 1004 is building 3 replicas right now [19:42:11] looks like it is doing fine [19:42:33] it will just take several hours for it to heal [19:43:08] I have a script on logstash1001 ~bd808/recover.sh that makes watching it pretty easy [19:43:11] (03CR) 10Dzahn: "only after https://gerrit.wikimedia.org/r/221184" [puppet] - 10https://gerrit.wikimedia.org/r/221172 (https://phabricator.wikimedia.org/T98816) (owner: 10Dzahn) [19:43:18] *recovery.sh [19:43:49] (03PS3) 10Dzahn: tendril: git clone from wmf repo via puppet [puppet] - 10https://gerrit.wikimedia.org/r/221172 (https://phabricator.wikimedia.org/T98816) [19:43:58] the only thing that will cause a mess is if 1006 goes down [19:44:10] bd808: ooh fancy, we should integrate that in es-tool [19:46:01] bd808: will it recover the corrupted shards by itself though? namely logstash-2015.06.03 and logstash-2015.05.30 seem to show up a lot in logs [19:46:52] if the locals are corrupt then, no. The only fix for that before has been to down the node, wipe the corrupt index and bring the node back up [19:47:20] in theory they can be fixed manually but that hasn't worked for me or Nik before [19:47:46] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [19:48:49] what do you mean manually? [19:48:54] (03CR) 10Dzahn: [C: 031] "It's actually correct that the Bugzilla link is changed as well. That's a fix. It's because refreshDomainRedirects recently changed and wh" [puppet] - 10https://gerrit.wikimedia.org/r/221148 (https://phabricator.wikimedia.org/T103915) (owner: 10Glaisher) [19:48:56] RECOVERY - Host mw2027 is UPING WARNING - Packet loss = 66%, RTA = 79.52 ms [19:49:11] there are lucene recovery tools [19:49:38] but they are not fun to work with if you can just slurp a good copy over instead [19:49:47] ah, likely not worth it [19:50:35] (03PS2) 10Dzahn: Redirect wikipedia.is to is.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/221148 (https://phabricator.wikimedia.org/T103915) (owner: 10Glaisher) [19:52:39] 1004 is not happy [19:55:39] bd808: the shower of index corruption messages I guess [19:55:56] yeah. [19:56:32] so I think maybe the thing to do here is stop es on 1004, clean up 05.30 and 06.03 and then start it back up [19:56:49] 6operations, 10Gather, 7Database, 7Schema-change: Update Gather DB schema for flagging backend - https://phabricator.wikimedia.org/T103611#1405911 (10Tgr) I guess the easiest would be to split the update in two parts: run `UPDATE gather_list SET gl_perm_override = 1 WHERE gl_perm = 2` as part of the schema... [19:56:58] I don't think that will make anything worse than it already is [19:57:05] godog: thoughts? [19:57:36] bd808: I agree, we'll need to wipe those anyway even if it finishes with the rest of the recovery [19:57:45] *nod* [19:57:54] ok, I'll bring it down [19:58:16] ack [19:58:34] !log stopping elasticsearch on logstash1004 to cleanup corrupt shards [19:58:36] (03CR) 10Dzahn: [C: 032] Redirect wikipedia.is to is.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/221148 (https://phabricator.wikimedia.org/T103915) (owner: 10Glaisher) [19:58:40] Logged the message, Master [20:01:25] godog: yuck. came back up and just started listing other shards as corrupt [20:01:34] I brought it back down [20:02:10] bd808: yeah, saw that :( [20:03:11] bd808: looks like logstash-2015.05.30 logstash-2015.05.31 logstash-2015.06.03 logstash-2015.06.06 [20:04:47] * bd808 tries again [20:05:51] (03PS1) 10Andrew Bogott: Avoid nesting some dirs into themselves due to cp -r behavior [puppet] - 10https://gerrit.wikimedia.org/r/221268 (https://phabricator.wikimedia.org/T104019) [20:06:29] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default, 5Patch-For-Review: Switch to ECDSA hybrid certificates - https://phabricator.wikimedia.org/T86654#1405940 (10BBlack) pinkunicorn is now running the proposed setup: - openssl 1.0.2c - nginx rebuilt against openssl 1.0.2c, with multi-cert patches, from b... [20:08:55] bd808: looks happier [20:09:15] (03CR) 10Gage: [C: 031] Avoid nesting some dirs into themselves due to cp -r behavior [puppet] - 10https://gerrit.wikimedia.org/r/221268 (https://phabricator.wikimedia.org/T104019) (owner: 10Andrew Bogott) [20:09:24] it at least stopped yelling [20:09:52] ACKNOWLEDGEMENT - ElasticSearch health check for shards on logstash1001 is CRITICAL - elasticsearch inactive shards 52 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 36, timed_out: False, active_primary_shards: 42, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 74, initializing_shards: 16, number_of_data_nodes: 3 Filippo Giunchedi recovering [20:09:52] ACKNOWLEDGEMENT - ElasticSearch health check for shards on logstash1002 is CRITICAL - elasticsearch inactive shards 52 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 36, timed_out: False, active_primary_shards: 42, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 74, initializing_shards: 16, number_of_data_nodes: 3 Filippo Giunchedi recovering [20:09:52] ACKNOWLEDGEMENT - ElasticSearch health check for shards on logstash1003 is CRITICAL - elasticsearch inactive shards 52 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 36, timed_out: False, active_primary_shards: 42, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 74, initializing_shards: 16, number_of_data_nodes: 3 Filippo Giunchedi recovering [20:09:52] ACKNOWLEDGEMENT - ElasticSearch health check for shards on logstash1004 is CRITICAL - elasticsearch inactive shards 52 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 36, timed_out: False, active_primary_shards: 42, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 74, initializing_shards: 16, number_of_data_nodes: 3 Filippo Giunchedi recovering [20:09:52] ACKNOWLEDGEMENT - ElasticSearch health check for shards on logstash1005 is CRITICAL - elasticsearch inactive shards 52 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 36, timed_out: False, active_primary_shards: 42, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 74, initializing_shards: 16, number_of_data_nodes: 3 Filippo Giunchedi recovering [20:09:52] ACKNOWLEDGEMENT - ElasticSearch health check for shards on logstash1006 is CRITICAL - elasticsearch inactive shards 52 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 36, timed_out: False, active_primary_shards: 42, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 74, initializing_shards: 16, number_of_data_nodes: 3 Filippo Giunchedi recovering [20:10:07] sigh, I should have unchecked 'send notification' [20:10:17] !log Deleted 4 corrupt indices (logstash-2015.05.30 logstash-2015.05.31 logstash-2015.06.03 logstash-2015.06.06) on logstash1004 [20:10:23] Logged the message, Master [20:14:01] bd808: cool, looks like we'll be waiting for fully recovery [20:14:19] yeah... it will take several hours :/ [20:14:42] at least I tuned up all the recovery settings during the last planned restart [20:15:07] yeah 100MB/s [20:17:05] 6operations, 6Discovery, 7Elasticsearch: show recovery status/stats in es-tool - https://phabricator.wikimedia.org/T104022#1405953 (10fgiunchedi) 3NEW [20:17:33] I'll go for lunch, bbl [20:18:28] (03PS2) 10Andrew Bogott: Avoid nesting some dirs into themselves due to cp -r behavior [puppet] - 10https://gerrit.wikimedia.org/r/221268 (https://phabricator.wikimedia.org/T104019) [20:19:49] (03CR) 10Andrew Bogott: [C: 032] Avoid nesting some dirs into themselves due to cp -r behavior [puppet] - 10https://gerrit.wikimedia.org/r/221268 (https://phabricator.wikimedia.org/T104019) (owner: 10Andrew Bogott) [20:20:02] 6operations, 10ops-eqiad: analytics1016 down due to power issue(?) - https://phabricator.wikimedia.org/T103544#1405962 (10Cmjohnson) Dell sent a new board but it's bad. The new board spits out a QPI 1 error. [20:27:40] (03PS1) 10Ottomata: Refactor eventlogging monitoring classes [puppet] - 10https://gerrit.wikimedia.org/r/221277 [20:27:45] (03CR) 10jenkins-bot: [V: 04-1] Refactor eventlogging monitoring classes [puppet] - 10https://gerrit.wikimedia.org/r/221277 (owner: 10Ottomata) [20:29:13] cmjohnson: ahha, yay! [20:29:17] (03PS2) 10Ottomata: Refactor eventlogging monitoring classes [puppet] - 10https://gerrit.wikimedia.org/r/221277 [20:29:19] (03PS5) 10Andrew Bogott: Turn on autoupdate_master by default. [puppet] - 10https://gerrit.wikimedia.org/r/220147 [20:30:01] (03CR) 10jenkins-bot: [V: 04-1] Refactor eventlogging monitoring classes [puppet] - 10https://gerrit.wikimedia.org/r/221277 (owner: 10Ottomata) [20:30:28] ottomata: those 1st gens never seem to do well and of course Dell sends a refurbished board and it's borked [20:30:43] (03CR) 10Andrew Bogott: [C: 032] Turn on autoupdate_master by default. [puppet] - 10https://gerrit.wikimedia.org/r/220147 (owner: 10Andrew Bogott) [20:36:30] 6operations, 10Gather, 7Database, 7Schema-change: Update Gather DB schema for flagging backend - https://phabricator.wikimedia.org/T103611#1406031 (10Tgr) Even easier, just use `UPDATE gather_list SET gl_perm = 1, gl_perm_override = 1 WHERE gl_perm = 2` the first time. That pretends hidden lists are privat... [20:37:09] (03PS3) 10Ottomata: Refactor eventlogging monitoring classes [puppet] - 10https://gerrit.wikimedia.org/r/221277 [20:37:23] (03PS1) 10Dzahn: Revert "Redirect wikipedia.is to is.wikipedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/221282 [20:38:25] PROBLEM - puppet last run on ms-be1011 is CRITICAL Puppet has 1 failures [20:38:37] (03CR) 10Dzahn: "for unknown reasons:" [puppet] - 10https://gerrit.wikimedia.org/r/221282 (owner: 10Dzahn) [20:38:51] (03PS2) 10Dzahn: Revert "Redirect wikipedia.is to is.wikipedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/221282 [20:39:44] (03CR) 10Dzahn: [C: 032] Revert "Redirect wikipedia.is to is.wikipedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/221282 (owner: 10Dzahn) [20:42:07] 6operations, 10Gather, 7Database, 7Schema-change: Update Gather DB schema for flagging backend - https://phabricator.wikimedia.org/T103611#1406037 (10Tgr) [20:47:36] (03PS1) 10Dzahn: apache redirects: update .conf for static-bz redir [puppet] - 10https://gerrit.wikimedia.org/r/221287 [20:54:35] RECOVERY - puppet last run on ms-be1011 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:09:59] !log taking xenon down to be rebootstrapped [21:10:05] Logged the message, Master [21:14:16] PROBLEM - Cassanda CQL query interface on xenon is CRITICAL: Connection refused [21:15:15] (03CR) 10BBlack: [C: 031] apache redirects: update .conf for static-bz redir [puppet] - 10https://gerrit.wikimedia.org/r/221287 (owner: 10Dzahn) [21:17:40] (03PS2) 10Dzahn: apache redirects: update .conf for static-bz redir [puppet] - 10https://gerrit.wikimedia.org/r/221287 [21:18:32] (03CR) 10Dzahn: [C: 032] apache redirects: update .conf for static-bz redir [puppet] - 10https://gerrit.wikimedia.org/r/221287 (owner: 10Dzahn) [21:22:41] (03PS1) 10BBlack: redirects: use separate ServerAlias directives for each alias [puppet] - 10https://gerrit.wikimedia.org/r/221291 [21:34:03] (03CR) 10Rush: [C: 031] "We spent a good amount of time narrowing down this weird behavior. This patch makes sense to me as a pragmatic approach." [puppet] - 10https://gerrit.wikimedia.org/r/221291 (owner: 10BBlack) [21:35:16] PROBLEM - Check status of defined EventLogging jobs on analytics1010 is CRITICAL Stopped EventLogging jobs: reporter/statsd [21:38:26] ACKNOWLEDGEMENT - Check status of defined EventLogging jobs on analytics1010 is CRITICAL Stopped EventLogging jobs: reporter/statsd ottomata Not a real problem! [21:43:38] (03CR) 10Dzahn: [C: 031] "tested on mw1033. reloaded fine. tested some random server aliases." [puppet] - 10https://gerrit.wikimedia.org/r/221291 (owner: 10BBlack) [21:43:53] bd808: heh more corruption looks like? logstash-2015.06.02 logstash-2015.06.03 [21:44:14] grumble [21:46:21] godog: recovery status looks like it is moving past that blip. 06.03 is continuing to recover [21:46:39] I really don't understand how it gets this messed up [21:46:51] those files should all be read-only in practice [21:47:19] I think Nik told me that there is some new index state for 1.6.0 that should help with this [21:47:39] yeah I was reading up on github issues about the same thing and they mentioned some fixes [22:07:36] (03CR) 10Dzahn: "[terbium:~] $ apache-fast-test staticbz zirconium.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/221168 (https://phabricator.wikimedia.org/T102579) (owner: 10Dzahn) [22:07:41] (03PS2) 10Dzahn: static-bugzilla: redirects from .cgi to Maniphest [puppet] - 10https://gerrit.wikimedia.org/r/221168 (https://phabricator.wikimedia.org/T102579) [22:09:35] !log restarted logstash on logstash1001 [22:09:41] Logged the message, Master [22:11:20] (03CR) 10Dzahn: [C: 032] "no worries, this redirect isn't on the cluster" [puppet] - 10https://gerrit.wikimedia.org/r/221168 (https://phabricator.wikimedia.org/T102579) (owner: 10Dzahn) [22:18:06] godog: blerg. All 3 frontend logstash nodes are getting timeouts from the backend nodes when trying to add new log events [22:18:28] my recovery tuneup may be too high? [22:18:46] "observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]" [22:21:53] bd808: mhh yeah let me tune recovery to 50MB/s and see if it helps [22:22:08] sounds good [22:24:54] possibly this fixed bug too -- https://www.elastic.co/blog/elasticsearch-1-6-0-released#async-shard-info [22:25:00] 6operations, 10Wikimedia-Bugzilla, 5Patch-For-Review: Show an error message when trying to view dynamic pages like buglist.cgi in static bugzilla - https://phabricator.wikimedia.org/T102579#1406253 (10Dzahn) buglist.cgi and index.cgi redirect into phabricator now instead of 404: https://static-bugzilla.org/... [22:25:18] !log Reset email address of User:Chwms identity verified in person at editathon [22:25:20] !log set indices.recovery.max_bytes_per_sec to 50mb on logstash ES cluster [22:25:25] Logged the message, Master [22:25:31] Logged the message, Master [22:26:33] 6operations, 10Wikimedia-Bugzilla, 5Patch-For-Review: Show an error message when trying to view dynamic pages like buglist.cgi in static bugzilla - https://phabricator.wikimedia.org/T102579#1406257 (10Dzahn) see above, buglist.cgi?title=Special%3ASearch .. will actually get you to Search in Phab https://pha... [22:27:29] 6operations, 10Wikimedia-Bugzilla, 5Patch-For-Review: Show an error message when trying to view dynamic pages like buglist.cgi in static bugzilla - https://phabricator.wikimedia.org/T102579#1406259 (10Dzahn) 5Open>3Resolved [22:29:49] (03CR) 10John F. Lewis: [C: 031] "Looks valid" [software/tendril] - 10https://gerrit.wikimedia.org/r/221184 (https://phabricator.wikimedia.org/T98816) (owner: 10Dzahn) [22:30:08] bd808: still seeing those but at level DEBUG and each minute [22:30:37] (03CR) 10John F. Lewis: [C: 031] tendril: git clone from wmf repo via puppet [puppet] - 10https://gerrit.wikimedia.org/r/221172 (https://phabricator.wikimedia.org/T98816) (owner: 10Dzahn) [22:31:07] 6operations, 10vm-requests: eqiad: (1) VM for static-bugzilla - https://phabricator.wikimedia.org/T103604#1406266 (10Dzahn) a:3akosiaris @akosiaris next VM for next zirconium service please :) do we really have to follow the "1001" kind of naming for stuff like this? static-bugzilla1001 is a bit odd but i... [22:32:06] godog: sadly we haven't added any significant data to the index since 19:21:30 [22:33:42] so we can either (a) ignore that and let recovery continue, (b) stop 04 and 05 and just limp on with 06 and then recover later, (c) figure out how to lower recovery to the point that indexing works again [22:33:57] I'm thinking (c) is the sane option [22:33:58] (03PS1) 10BBlack: Use overridden direct DNS for all LVS [puppet] - 10https://gerrit.wikimedia.org/r/221303 (https://phabricator.wikimedia.org/T103921) [22:34:00] (03PS1) 10BBlack: resolv.conf: lower timeout from 3s to 1s, ++attempts [puppet] - 10https://gerrit.wikimedia.org/r/221304 (https://phabricator.wikimedia.org/T103921) [22:35:25] greg-g: Can we deploy https://gerrit.wikimedia.org/r/#/c/221011/2 today (on a Friday) to unbreak officewiki? [22:36:10] it's Friday? sweet! [22:36:14] * greg-g kids... and looks [22:36:14] bd808: yup, I've lowered further to 10mb [22:36:28] 6operations, 10Traffic, 10Wikimedia-DNS, 5Patch-For-Review, 7Pybal: pybal DNS lookup issues causing outage risks - https://phabricator.wikimedia.org/T103921#1406277 (10BBlack) Testing on lvs1004 (with comparison against logs on lvs1001) is still looking great. The above two patches would spread equivale... [22:36:30] RoanKattouw: sure [22:36:35] !log set indices.recovery.max_bytes_per_sec to 10mb on logstash ES cluster [22:36:40] !log set indices.recovery.concurrent_streams to 4 on logstash ES cluster [22:36:40] Logged the message, Master [22:36:46] Logged the message, Master [22:37:11] bd808: meanwhile trying to understand what logstash is exactly unhappy about [22:37:36] greg-g: Thanks. Commit was just +2ed so it's making its way through Jenkins things, but I'll deploy it once it makes it through there with another heads-up to you before I pull the trigger [22:37:44] godog: that error message is not well documented :/ [22:37:57] lots of search hits but little information [22:42:40] greg-g: That went faster than I thought, pulling the trigger now-ish [22:43:17] gj jenkins [22:43:56] bd808: heh also zero logging on the logstash side it seems [22:43:57] !log catrope Synchronized php-1.26wmf11/extensions/Flow: Temporarily make subpages in Flow-occupied namespaces non-Flow again (duration: 00m 14s) [22:44:03] Logged the message, Master [22:44:18] logging inception [22:44:38] godog: yeah. logstash hardly says anything ever to the logs. [22:45:12] bd808: I guess out of fear that people might be piping logstash logs back into logstash [22:45:42] it's posting from the logstash process to the elasticsearch process on the same machine and things are setup so it doesn't wait for a response [22:45:57] it will log of the post fails but that's not what's happening [22:46:04] greg-g: yes, a wonderful log spiral! [22:46:13] es takes the post but then is failing to actually store the data [22:47:07] yo dawg? [22:47:26] anywho, I apparently need to refocus after this 3:30 on Friday mental slump [22:48:39] godog: here's my next thought, take 04 and 05 down and see if logging starts working again. If it does bring just one of them back up and see what happens [22:48:51] if it doesn't then... cry? [22:49:27] 04 looks messed up on ganglia too. [22:49:36] all the graphs went to 0 [22:49:51] bd808: how often should logstash flush to es btw? I was looking with tcpdump but don't see logs going to port 9200 [22:50:51] godog: ta least once a second [22:52:03] so... yeah what's up with that [22:52:38] 6operations, 10Traffic, 10Wikimedia-DNS, 5Patch-For-Review, 7Pybal: pybal DNS lookup issues causing outage risks - https://phabricator.wikimedia.org/T103921#1406314 (10BBlack) Also, just for reference re testing lvs100[14] differential (lvs1004 has proposed changes, lvs1001 has current config): I've been... [22:53:10] bd808: heh no I might be misreading, there is flushing going on every now and then [22:54:35] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [22:56:32] yeah, logstash is posting to es there [22:56:41] but then it's a blackhole [22:57:47] !log restarted gitblit [22:57:53] Logged the message, Master [22:58:07] 1001 and 1005 are talking to each other on 9300 a lot which would be the logs being sent to 1005 (today's index master) [22:59:52] (03CR) 10Dzahn: "the bugzilla thing is fixed in https://gerrit.wikimedia.org/r/#/c/221287/" [puppet] - 10https://gerrit.wikimedia.org/r/221148 (https://phabricator.wikimedia.org/T103915) (owner: 10Glaisher) [23:01:06] bd808: I was looking at the es settings and it requires at least two master eligible nodes, if we shut 4 and 5 then the frontends might freak out [23:05:11] git.wm.o dead again! [23:05:39] !log restarted gitblit on antimony, AGAIN [23:05:45] Logged the message, Master [23:06:23] bblack: every time it happens it reminds me of https://gerrit.wikimedia.org/r/#/c/188480/ [23:07:05] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60544 bytes in 0.373 second response time [23:07:24] heh [23:10:45] 6operations, 10vm-requests: eqiad: (1) VM for static-bugzilla - https://phabricator.wikimedia.org/T103604#1406346 (10Dzahn) statbugs1001 or just staticbugs or something [23:11:20] 6operations, 10vm-requests: eqiad: (1) VM for static-bugzilla - https://phabricator.wikimedia.org/T103604#1406347 (10Dzahn) p:5Triage>3Normal [23:11:20] bd808: since nothing is making it to ES we might as well bounce the cluster I think [23:12:05] (03PS1) 10BBlack: Release 1.9.2-1+wmf2 (multicert, libssl1.0.2) [software/nginx] (wmf-1.9.2-1) - 10https://gerrit.wikimedia.org/r/221307 [23:12:46] (03CR) 10BBlack: [C: 032 V: 032] Release 1.9.2-1+wmf2 (multicert, libssl1.0.2) [software/nginx] (wmf-1.9.2-1) - 10https://gerrit.wikimedia.org/r/221307 (owner: 10BBlack) [23:12:51] I think the problem may be that today's index had no replicas. if we bounce I want to be sure that logstash-2015.06.26 is recovered first [23:13:47] if we disable allocation, bounce 04 and 05 and then explictly assign logstash-2015.06.26 before enabling allocation in general that may do it [23:15:18] bd808: yep we can try that, right so ES will refuse updates to logstash-2015.06.26 even though the primary is online and two other replicas initializing [23:15:50] I just got them started initializing too. they were both unassigned [23:16:06] I think its worth a try [23:16:23] as long as we don't touch 06 we are no worse off than we are already [23:16:40] I'll do it [23:16:52] bd808: ok! [23:21:34] (03CR) 10Dzahn: "needs manual rebase" [puppet] - 10https://gerrit.wikimedia.org/r/220164 (https://phabricator.wikimedia.org/T103425) (owner: 10Dzahn) [23:23:15] (03CR) 10Dzahn: "so now it's just gitblit since svn is gone. any reason NOT to give gitblit IPv6?" [puppet] - 10https://gerrit.wikimedia.org/r/214432 (owner: 10Dzahn) [23:23:35] (03PS2) 10Dzahn: add IPv6 for antimony (git web) [puppet] - 10https://gerrit.wikimedia.org/r/214432 [23:23:41] (03CR) 10jenkins-bot: [V: 04-1] add IPv6 for antimony (git web) [puppet] - 10https://gerrit.wikimedia.org/r/214432 (owner: 10Dzahn) [23:25:10] (03PS3) 10Dzahn: add IPv6 for antimony (git web) [puppet] - 10https://gerrit.wikimedia.org/r/214432 [23:26:14] (03PS4) 10Dzahn: add IPv6 for antimony (git web) [puppet] - 10https://gerrit.wikimedia.org/r/214432 [23:28:24] (03PS3) 10Dzahn: add IPv6 for ytterbium (gerrit) [puppet] - 10https://gerrit.wikimedia.org/r/214437 [23:29:28] (03PS4) 10Dzahn: add IPv6 for ytterbium (gerrit) [puppet] - 10https://gerrit.wikimedia.org/r/214437 [23:30:02] (03CR) 10Dzahn: "same here. are there reasons NOT to enable IPv6 on gerrit?" [puppet] - 10https://gerrit.wikimedia.org/r/214437 (owner: 10Dzahn) [23:30:26] wth, all of beta cluster is complaining about puppet failures http://shinken.wmflabs.org/problems?search=hg:deployment-prep [23:30:33] just happened [23:31:23] (03PS5) 10Dzahn: add IPv6 for ytterbium (gerrit) [puppet] - 10https://gerrit.wikimedia.org/r/214437 (https://phabricator.wikimedia.org/T37540) [23:31:50] (03CR) 10Paladox: "Nope." [puppet] - 10https://gerrit.wikimedia.org/r/214437 (https://phabricator.wikimedia.org/T37540) (owner: 10Dzahn) [23:31:52] (03PS2) 10Dzahn: add AAAA record for antimony [dns] - 10https://gerrit.wikimedia.org/r/214504 (https://phabricator.wikimedia.org/T37540) [23:31:57] (03PS3) 10Dzahn: add AAAA record for antimony [dns] - 10https://gerrit.wikimedia.org/r/214504 (https://phabricator.wikimedia.org/T37540) [23:31:59] (03CR) 10Paladox: [C: 031] add IPv6 for ytterbium (gerrit) [puppet] - 10https://gerrit.wikimedia.org/r/214437 (https://phabricator.wikimedia.org/T37540) (owner: 10Dzahn) [23:32:22] (03PS2) 10Dzahn: add AAAA record for argon (irc,rc streams) [dns] - 10https://gerrit.wikimedia.org/r/214506 (https://phabricator.wikimedia.org/T37540) [23:32:24] (03CR) 10jenkins-bot: [V: 04-1] add AAAA record for argon (irc,rc streams) [dns] - 10https://gerrit.wikimedia.org/r/214506 (https://phabricator.wikimedia.org/T37540) (owner: 10Dzahn) [23:32:30] (03PS2) 10Dzahn: add AAAA record for ytterbium (gerrit) [dns] - 10https://gerrit.wikimedia.org/r/214507 (https://phabricator.wikimedia.org/T37540) [23:32:42] (03PS3) 10Dzahn: add IPv6 for argon (irc,mw-rc streams) [puppet] - 10https://gerrit.wikimedia.org/r/214434 (https://phabricator.wikimedia.org/T37540) [23:32:55] (03PS5) 10Dzahn: add IPv6 for antimony (git web) [puppet] - 10https://gerrit.wikimedia.org/r/214432 (https://phabricator.wikimedia.org/T37540) [23:33:34] (03CR) 10Paladox: "The" [puppet] - 10https://gerrit.wikimedia.org/r/214432 (https://phabricator.wikimedia.org/T37540) (owner: 10Dzahn) [23:33:47] (03CR) 10Paladox: [C: 031] "Nope" [puppet] - 10https://gerrit.wikimedia.org/r/214432 (https://phabricator.wikimedia.org/T37540) (owner: 10Dzahn) [23:36:43] (03PS3) 10Dzahn: add AAAA record for argon (irc,rc streams) [dns] - 10https://gerrit.wikimedia.org/r/214506 (https://phabricator.wikimedia.org/T37540) [23:38:52] bd808: mhh today's index is about to get recovered on 1005 btw, 50% done [23:39:08] *nod* [23:39:20] I cranked a bunch of the limits back up [23:39:44] it's not going anywhere on 04 though [23:40:29] why is that hosts so sick? [23:40:57] (03PS1) 10Ori.livneh: Accumulate X-Connection-Properties stats and report to StatsD [puppet] - 10https://gerrit.wikimedia.org/r/221314 [23:42:14] now its moving, raised cluster.routing.allocation.cluster_concurrent_rebalance [23:44:39] bd808: yep, also looks like the logs are back [23:45:18] yea! [23:46:04] imma going to let 04 finish replicating today and then turn normal recovery back on [23:48:12] (03CR) 10BBlack: [C: 031] Accumulate X-Connection-Properties stats and report to StatsD [puppet] - 10https://gerrit.wikimedia.org/r/221314 (owner: 10Ori.livneh) [23:48:19] imma let you finish, but elasticsearch had some of the best replication of ALL TIME [23:54:17] (03PS2) 10Ori.livneh: Accumulate X-Connection-Properties stats and report to StatsD [puppet] - 10https://gerrit.wikimedia.org/r/221314 [23:54:32] !log re-enabled allocation on logstash elasticsearch cluster [23:54:38] Logged the message, Master [23:55:36] (03CR) 10Ori.livneh: [C: 032] Accumulate X-Connection-Properties stats and report to StatsD [puppet] - 10https://gerrit.wikimedia.org/r/221314 (owner: 10Ori.livneh) [23:57:44] !log Logstash log ingestion working again after forcing recovery of replicas for logstash-2015.06.26; new logs were being rejected with only a primary shard available [23:57:50] Logged the message, Master