[00:04:06] <Dereckson>	 code sync'ed everywhere
[00:04:47] <Danny_B>	 it doesn't work properly
[00:05:22] <Danny_B>	 https://cs.wiktionary.org/wiki/Speci%C3%A1ln%C3%AD:Co_odkazuje_na?target=Modul%3AQuote%2Ftools&namespace=10&title=Speci%C3%A1ln%C3%AD%3ACo_odkazuje_na
[00:05:52] <Danny_B>	 "skrýt" (hide) isn't link which toggles to "zobrazit" (show)
[00:07:16] <logmsgbot>	 !log dereckson@tin Finished scap: Revert "Convert Special:WhatLinksHere from XML form to OOUI form" ([[Gerrit:289772]], T135773) (duration: 34m 12s)
[00:07:17] <stashbot>	 T135773: [Regression] Special:WhatLinksHere is unusable - https://phabricator.wikimedia.org/T135773
[00:07:25] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:09:07] <Dereckson>	 Danny_B: and now?
[00:10:07] <Danny_B>	 still nothing - can messages be reverted as well?
[00:10:23] <thedj>	 Dereckson: hmm, seems that message rebuilt there failed or something...
[00:10:31] <Danny_B>	 (and whoever made the patch be taught that repurposing messages is evil?)
[00:10:31] * thedj not sure
[00:10:38] <legoktm>	 we only reverted the english translations right...?
[00:10:45] <thedj>	 oh...
[00:10:56] <thedj>	 good point
[00:11:00] <Danny_B>	 00:43:29 < MatmaRex> i wonder how many translations were "updated" in the meantime
[00:11:05] <Danny_B>	 00:52:37 < Danny_B> MatmaRex: at least 28 langs
[00:11:05] <Danny_B>	 00:52:56 < Danny_B> (based on last 5000 edits on twn)
[00:11:13] <thedj>	 so all the translations will be dumped
[00:11:17] <thedj>	 ffs 
[00:11:59] <MatmaRex>	 oh hell
[00:11:59] <thedj>	 well it's 2:11 am for me and i wouldn't even know how to do that either :/
[00:12:17] <MatmaRex>	 just revert a bunch of localisation updates in core
[00:12:22] <MatmaRex>	 "just"
[00:12:34] <MaxSem>	 all the TWN ppl seem  to be in sleeping TZs too
[00:13:00] <MatmaRex>	 but i think we have LocalisationUpdate? which pulls messages from core or TWN, can't remember
[00:13:22] <MatmaRex>	 so this should fix itself entirely in a couple hours/days, in the worst case, when people update the translations again
[00:13:48] <MatmaRex>	 (or am i misremembering how this works?)
[00:14:19] <Dereckson>	 We could include in the next tech news this update is urgent?
[00:14:34] <Dereckson>	 So people will check on TranslateWiki if it has been translated for them?
[00:14:56] <Dereckson>	 Danny_B: Ajayrahulp
[00:15:17] <bd808>	 MatmaRex: l10nupdate runs around 03:00z each day and copies in all the messages from master
[00:16:02] <MatmaRex>	 i could probably revert all the changes on translatewiki, and then manually pull in translated messages into core
[00:16:17] <MatmaRex>	 (i wrote a little script for the latter just a few days ago)
[00:16:42] <Dereckson>	 You want to do that now?
[00:16:44] <MatmaRex>	 if it'll be updated from master, like bd808 says, i think we could skip deploying it manually
[00:21:50] <MatmaRex>	 anyway, yeah, i'll try doing that
[00:22:07] <Dereckson>	 k
[00:25:39] <MatmaRex>	 man, it's pretty amazing how quickly this got so many translations
[00:26:33] <MatmaRex>	 for future reference, i'm looking at https://translatewiki.net/w/i.php?title=Special%3ATranslations&message=whatlinkshere-hidetrans&namespace=8 and reverting all the changes made last English message update (light blue background)
[00:31:32] <legoktm>	 that should have created new message keys instead of re-using the old ones -.-
[00:32:02] <MatmaRex>	 yep
[00:33:48] <Danny_B>	 MatmaRex: it's highly used page so no doubt it was quick
[00:34:02] <Danny_B>	 also it is three or four messages to be reverted
[00:38:26] <Dereckson>	 I've notifified the patch author @ https://phabricator.wikimedia.org/T135773#2311312
[00:38:44] <grrrit-wm>	 (03PS2) 10Dzahn: RT: do not ensure=>latest,install perldoc [puppet] - 10https://gerrit.wikimedia.org/r/289796 (https://phabricator.wikimedia.org/T119112) 
[00:39:28] <Danny_B>	 Dereckson: thanks!
[00:40:15] <Danny_B>	 Dereckson: could that task be actually closed?
[00:41:06] <Danny_B>	 and the notification put rather to T117754
[00:41:06] <stashbot>	 T117754: Convert Special:WhatLinksHere to OOUI - https://phabricator.wikimedia.org/T117754
[00:41:36] <Dereckson>	 perhaps MatmaRex wants to attach the l10n revert to T135773 too
[00:41:36] <stashbot>	 T135773: [Regression] Special:WhatLinksHere is unusable - https://phabricator.wikimedia.org/T135773
[00:43:34] <MatmaRex>	 (i'm about 70% done with reverting, then i'll have to import them)
[00:44:07] <Danny_B>	 i've copied thenotification  comment
[00:44:44] <MatmaRex>	 can you folks take a screenshot of the current broken state, for future reference?
[00:44:58] <MatmaRex>	 (so that i can bash people over the head with it if anyone tried to do this again ;) )
[00:45:01] <wikibugs>	 06Operations, 10Wikimedia-Mailing-lists: Reset Mailman List Creator password - https://phabricator.wikimedia.org/T135776#2311323 (10Dzahn) I reset the list creator password on the server, fermium.  docs:  There is the **mmsitepass** command (/usr/sbin/  linked to /var/lib/mailman/bin/)   **mmsitepass -c** to r...
[00:47:41] <Dereckson>	 MatmaRex: https://s3.amazonaws.com/upload.screenshot.co/93ac7370e0
[00:48:04] <Danny_B>	 MatmaRex: https://cs.wiktionary.org/w/index.php?title=Speci%C3%A1ln%C3%AD%3ACo+odkazuje+na&target=Hlavn%C3%AD+strana&namespace=10&uselang=cs
[00:48:28] <grrrit-wm>	 (03PS3) 10Dzahn: RT: do not ensure=>latest,install perldoc [puppet] - 10https://gerrit.wikimedia.org/r/289796 (https://phabricator.wikimedia.org/T119112) 
[00:48:58] <Dereckson>	 MatmaRex: {F4032805}
[00:49:04] <grrrit-wm>	 (03PS4) 10Dzahn: RT: do not ensure=>latest,install perldoc [puppet] - 10https://gerrit.wikimedia.org/r/289796 (https://phabricator.wikimedia.org/T119112) 
[00:49:09] <Dereckson>	 (I've uploaded it on Phab)
[00:49:40] <Danny_B>	 F4032805
[00:50:03] <Danny_B>	 oh, stashbot, where art thou?
[00:51:18] <Dereckson>	 Danny_B: https://phabricator.wikimedia.org/F4032805
[00:51:32] <Dereckson>	 "{F4032805}" is what to use to have it displayed in a task
[00:51:41] <MatmaRex>	 ok, i'm done
[00:51:48] <MatmaRex>	 oh wait, someone reverted four of my reverts. lol
[00:51:53] <Dereckson>	 xD
[00:52:07] <mutante>	 !log stashbot test (T122690)
[00:52:08] <Danny_B>	 i know... stashbot links tasks, should link other phab stuff as well
[00:52:08] <stashbot>	 T122690: Move stashbot tool to k8s - https://phabricator.wikimedia.org/T122690
[00:52:14] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:53:16] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] RT: do not ensure=>latest,install perldoc [puppet] - 10https://gerrit.wikimedia.org/r/289796 (https://phabricator.wikimedia.org/T119112) (owner: 10Dzahn)
[00:54:07] <Dereckson>	 stashbot: because you think everybody will `wget -O cswiktionary-msg-repurposed-l10n-issue.png https://s3.amazonaws.com/upload.screenshot.co/93ac7370e0`, `arc upload cswiktionary-msg-repurposed-l10n-issue.png` and then says "{F...}" on the channel?
[00:54:12] <Dereckson>	 Danny_B: ^
[00:55:10] <Dereckson>	 Bots should handle most general cases, not every corner case used once per year.
[00:55:38] <Danny_B>	 that's incorrect case
[00:57:05] <Danny_B>	 when saying F123456 i expect stashbot to return "F123456: Image name - https://phabricator.wikimedia.orgF123456" - similar to what it does for tasks
[00:57:35] <Dereckson>	 yes, but are people saying F123456?
[00:58:42] <mutante>	 this came up before, afair we agreed some are nice to have, like P(astebins) but probably not F's
[00:58:43] <Danny_B>	 sometimes when discussing some screenshots? same with pastes i guess. it isn't harmful and it is user friendly since it saves work...
[00:59:12] <mutante>	 P666
[00:59:13] <stashbot>	 P666 salt wtf? - https://phabricator.wikimedia.org/P666
[00:59:18] <mutante>	 there you go
[00:59:21] <bd808>	 It does DMPT now -- https://github.com/bd808/tools-stashbot/blob/master/stashbot/bot.py#L33
[00:59:47] <mutante>	 :)
[00:59:48] <bd808>	 pull requests welcome :)
[01:00:41] <mutante>	 and i wonder .. why does this server have both Apache modules, mod-fastcgi AND mod-fcgid  
[01:00:59] <Danny_B>	 out of curiosity: why don't we host such bot tools on our vcs? (gerrit/diffusion/...)
[01:01:51] <bd808>	 beacuse I wrote it, I run it, and I didn't bother to go through the hassle of requesting gerrit hosting
[01:02:19] <enterprisey>	 Danny_B: are you asking about bots in general, or this specific bot?
[01:02:26] <bd808>	 I'm working on making diffusion hosting of tools easier right now though
[01:03:03] <Danny_B>	 it was rather general question since bunch of other tools/bots are somewhere in outer space, but we do bugtracking for them in phab (formerly in bz)
[01:03:22] <Danny_B>	 bd808: <3 for that!
[01:03:58] <bd808>	 Danny_B: you can follow along at T133252
[01:03:58] <stashbot>	 T133252: Create application to manage Diffusion repositories for a Tool Labs project - https://phabricator.wikimedia.org/T133252
[01:05:35] <bd808>	 The choice of hosting is a nuanced thing. Almost everyone wants it to be easy and fast to get a repo
[01:05:54] <bd808>	 some also want to build a "resume" of sorts on github or bitbucket
[01:06:42] <bd808>	 we require hosting or at least mirroring in gerrit/diffusion for production deploys
[01:06:57] <mutante>	 gerrit projects get cloned over to github. i think for diffusion that is in the works?
[01:07:06] <bd808>	 for things in Tool Labs I'm just really happy if source is published *anywhere*
[01:07:08] <enterprisey>	 I just have all my stuff on GitHub out of habit
[01:07:21] <enterprisey>	 and my stuff is split between GitHub Pages and Tool Labs
[01:09:20] <MatmaRex>	 okayyyy. i think i've got this
[01:10:03] <Dereckson>	 bd808: Danny_B avoid F5 please :p
[01:10:23] <mutante>	 github is a very centralized thing while git itself is all about not being centralized
[01:11:12] <bd808>	 github has been a pretty amazing replacement for prior attempts like sourceforge
[01:11:26] <Danny_B>	 Dereckson: ???
[01:11:28] <mutante>	 in a way its good when things are split across different tools and not all in the same place
[01:11:44] <bd808>	 but I don't like the current assumption by many that it is the *only* place to find FLOSS software
[01:12:08] <bd808>	 (notbody who says FLOSS believes that but ...)
[01:12:51] <Dereckson>	 Danny_B: <user> try to press F5 <bot> https://<phabricator instance>/F5
[01:13:17] <bd808>	 I read a disturbing essay recently that postulated that we are in a post-open source world where licensing doesn't matter
[01:13:25] <MatmaRex>	 hmm. interesting
[01:13:30] <MatmaRex>	 Dereckson: bd808: Danny_B: https://gerrit.wikimedia.org/r/289802
[01:13:49] <MatmaRex>	 a few languages apparently have massive diffs in my patch, because… their JSON files were indented with spaces rather than tabs?
[01:14:17] <MatmaRex>	 i think these might not have been exported from translatewiki for a long time, probably intentionally?
[01:14:28] <mutante>	 we should  focus on making it easy to request new repos on our own tools , i dont know about diffusion yet
[01:14:40] <mutante>	 bd808: heh, wanna close the tickets about licensing with that ?:)
[01:14:54] <bd808>	 mutante: lol. hell no
[01:16:29] <Dereckson>	 MatmaRex: use /paste for scripts, you gain syntax highlighting
[01:17:21] <Dereckson>	 MatmaRex: https://phabricator.wikimedia.org/P2868 if you wish a command to do arc-paste-file load-translations2.rb
[01:20:07] <bd808>	 MatmaRex: this one looks funky -- https://gerrit.wikimedia.org/r/#/c/289802/1/languages/i18n/kk-cyrl.json
[01:21:40] <MatmaRex>	 bd808: hmmm
[01:21:47] <grrrit-wm>	 (03PS2) 10Dzahn: RT: loading mod_fastcgi wasnt puppetized [puppet] - 10https://gerrit.wikimedia.org/r/289795 (https://phabricator.wikimedia.org/T119112) 
[01:22:03] <MatmaRex>	 bd808: that language has LanguageConverter… blergh
[01:26:14] <MatmaRex>	 bd808: yeah, something went weird in my script / the API…
[01:26:34] <MatmaRex>	 i have had enough of this, i think. that one language change should probably be undone in the commit (it's okay on translatewiki)
[01:27:02] <MatmaRex>	 bd808: Dereckson: feel free to tweak that commit and +2, or ignore it - the next l10n-bot run will include these fixes anyway
[01:27:38] <MatmaRex>	 i'm going to go sleep. see you :)
[01:27:43] <Dereckson>	 I suggest we remove problematic languages files and wait fo the bot.
[01:27:48] <Dereckson>	 Good night.
[01:28:22] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db1040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 377.86 seconds
[01:31:15] <Dereckson>	 (strange, for other *-* that works)
[01:32:49] <Danny_B>	 let me know if you need some tests of that message revert
[01:38:41] <grrrit-wm>	 (03PS3) 10Dzahn: RT: loading mod_fastcgi wasnt puppetized [puppet] - 10https://gerrit.wikimedia.org/r/289795 (https://phabricator.wikimedia.org/T119112) 
[01:43:04] <Dereckson>	 bd808: finally I removed gan-hant, kk-cyrl, kk-latn and zh-hant
[01:44:09] <bd808>	 Dereckson: +2'd
[01:44:43] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0]
[01:44:50] <Dereckson>	 Okay, so we wait 3:00 UTC for l10nupdate or we sync the change now?
[01:44:53] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0]
[01:45:18] <bd808>	 I think we can just wait for l10nupdate to fix things
[01:45:52] <bd808>	 If it's still messed up tomorrow we can backport and scap again
[01:46:00] <bd808>	 s/we/someone/
[01:46:02] <bd808>	 :)
[01:47:47] <Dereckson>	 That's fine with me.
[01:57:05] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] RT: loading mod_fastcgi wasnt puppetized [puppet] - 10https://gerrit.wikimedia.org/r/289795 (https://phabricator.wikimedia.org/T119112) (owner: 10Dzahn)
[01:57:13] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[01:57:23] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[02:00:23] <Danny_B>	 Dereckson: if you'll be waiting for l10n update, then assure that no matmarex's changes got reverted on twn
[02:06:16] <Dereckson>	 Danny_B: hmmm?
[02:06:58] <Dereckson>	 Danny_B > according https://wikitech.wikimedia.org/wiki/LocalisationUpdate "translatewiki.net staff commit translations to trunk"
[02:07:57] <Dereckson>	 so there won't be a commit with new translations to MediaWiki core in the MediaWiki master branch
[02:12:14] <Dereckson>	 Danny_B: what the job do is create a commit to the current wmf branches to backport the master l10n files changes
[02:12:37] <Dereckson>	 so we'll get 5bd7b72 
[02:13:09] <Dereckson>	 (well we'll get a changes with the difference between wmf branch and core branch for l10n files)
[02:13:27] <Dereckson>	 s/core/master
[02:22:31] <wikibugs>	 06Operations, 13Patch-For-Review: move RT off of magnesium - https://phabricator.wikimedia.org/T119112#2311397 (10Dzahn) command for db schema upgrades:  rt-setup-database-4 --dba rt --action upgrade --upgrade-from 4.0.4 --upgrade-to 4.2.8  it will ask for credentials to the db
[02:27:13] <wikibugs>	 06Operations, 13Patch-For-Review: move RT off of magnesium - https://phabricator.wikimedia.org/T119112#2311402 (10Dzahn) there is also upgrade-mysql-schema.pl in /usr/share/request-tracker4/etc/upgrade/  which i tried before the former command and it created queries that i executed on the db, but basically jus...
[03:19:00] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: Secure connection failed when attempting to send POST request - https://phabricator.wikimedia.org/T134869#2311448 (10Thibaut120094) >>! In T134869#2281369, @BBlack wrote: > As a random experiment, perhaps some of those reporting could try this in FF 46.0.1? >  > 1. Type 'abo...
[04:20:22] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[04:21:57] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Update to 4.4.11 [debs/linux44] - 10https://gerrit.wikimedia.org/r/289813 
[04:24:12] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1014 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.007 second response time
[04:24:33] <moritzm>	 !log restarted hhvm on mw1014 (got stuck, output of hhvm-dump-debug available)
[04:24:41] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[04:42:47] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Update to 4.4.11 [debs/linux44] - 10https://gerrit.wikimedia.org/r/289813 
[04:50:24] <grrrit-wm>	 (03PS3) 10Muehlenhoff: Update to 4.4.11 [debs/linux44] - 10https://gerrit.wikimedia.org/r/289813 
[04:53:18] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Update to 4.4.11 [debs/linux44] - 10https://gerrit.wikimedia.org/r/289813 (owner: 10Muehlenhoff)
[05:24:49] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Add missing CVE ID to changelog [debs/linux44] - 10https://gerrit.wikimedia.org/r/289816 
[05:25:27] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Add missing CVE ID to changelog [debs/linux44] - 10https://gerrit.wikimedia.org/r/289816 (owner: 10Muehlenhoff)
[05:46:54] <wikibugs>	 06Operations: rebase librsvg security fixes - https://phabricator.wikimedia.org/T135804#2311589 (10MoritzMuehlenhoff)
[05:50:13] <icinga-wm>	 PROBLEM - Host cp3032 is DOWN: PING CRITICAL - Packet loss = 100%
[05:50:13] <icinga-wm>	 PROBLEM - Host cp3035 is DOWN: PING CRITICAL - Packet loss = 100%
[05:50:13] <icinga-wm>	 PROBLEM - Host cp3036 is DOWN: PING CRITICAL - Packet loss = 100%
[05:50:22] <Nikerabbit>	 cannot access meta
[05:50:33] <icinga-wm>	 PROBLEM - Host cp3033 is DOWN: PING CRITICAL - Packet loss = 100%
[05:50:33] <icinga-wm>	 PROBLEM - Host cp3038 is DOWN: PING CRITICAL - Packet loss = 100%
[05:50:33] <icinga-wm>	 PROBLEM - Host cp3039 is DOWN: PING CRITICAL - Packet loss = 100%
[05:50:43] <icinga-wm>	 PROBLEM - Host cp3040 is DOWN: PING CRITICAL - Packet loss = 100%
[05:50:43] <icinga-wm>	 PROBLEM - Host cp3048 is DOWN: PING CRITICAL - Packet loss = 100%
[05:50:43] <icinga-wm>	 PROBLEM - Host cp3006 is DOWN: PING CRITICAL - Packet loss = 100%
[05:50:44] <icinga-wm>	 PROBLEM - Host cp3045 is DOWN: PING CRITICAL - Packet loss = 100%
[05:50:44] <icinga-wm>	 PROBLEM - Host cp3008 is DOWN: PING CRITICAL - Packet loss = 100%
[05:50:44] <icinga-wm>	 PROBLEM - Host cp3004 is DOWN: PING CRITICAL - Packet loss = 100%
[05:50:44] <icinga-wm>	 PROBLEM - Host cp3049 is DOWN: PING CRITICAL - Packet loss = 100%
[05:51:02] <icinga-wm>	 PROBLEM - Host 2620:0:862:1:91:198:174:122 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:1:91:198:174:122
[05:51:03] <icinga-wm>	 PROBLEM - Host 2620:0:862:1:91:198:174:106 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:1:91:198:174:106
[05:51:03] <icinga-wm>	 PROBLEM - Host cp3009 is DOWN: PING CRITICAL - Packet loss = 100%
[05:51:03] <icinga-wm>	 PROBLEM - Host cp3031 is DOWN: PING CRITICAL - Packet loss = 100%
[05:51:03] <icinga-wm>	 PROBLEM - Host cp3007 is DOWN: PING CRITICAL - Packet loss = 100%
[05:51:14] <icinga-wm>	 PROBLEM - Host 91.198.174.106 is DOWN: PING CRITICAL - Packet loss = 100%
[05:51:16] <icinga-wm>	 PROBLEM - Host misc-web-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[05:51:22] <icinga-wm>	 PROBLEM - Host cp3043 is DOWN: PING CRITICAL - Packet loss = 100%
[05:51:22] <icinga-wm>	 PROBLEM - Host ns2-v6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::e
[05:51:22] <icinga-wm>	 PROBLEM - Host cp3010 is DOWN: PING CRITICAL - Packet loss = 100%
[05:51:22] <icinga-wm>	 PROBLEM - Host cp3037 is DOWN: PING CRITICAL - Packet loss = 100%
[05:51:22] <icinga-wm>	 PROBLEM - Host cp3046 is DOWN: PING CRITICAL - Packet loss = 100%
[05:51:22] <icinga-wm>	 PROBLEM - Host cp3047 is DOWN: PING CRITICAL - Packet loss = 100%
[05:51:22] <icinga-wm>	 PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100%
[05:51:23] <icinga-wm>	 PROBLEM - Host cp3034 is DOWN: PING CRITICAL - Packet loss = 100%
[05:51:34] <icinga-wm>	 PROBLEM - Host cp3044 is DOWN: PING CRITICAL - Packet loss = 100%
[05:51:42] <icinga-wm>	 PROBLEM - Host ms-be3003 is DOWN: PING CRITICAL - Packet loss = 100%
[05:51:42] <icinga-wm>	 PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100%
[05:51:42] <icinga-wm>	 PROBLEM - Host ms-be3002 is DOWN: PING CRITICAL - Packet loss = 100%
[05:51:42] <icinga-wm>	 PROBLEM - Host ms-be3001 is DOWN: PING CRITICAL - Packet loss = 100%
[05:51:42] <icinga-wm>	 PROBLEM - Host maerlant is DOWN: PING CRITICAL - Packet loss = 100%
[05:51:42] <icinga-wm>	 PROBLEM - Host lvs3004 is DOWN: PING CRITICAL - Packet loss = 100%
[05:51:42] <icinga-wm>	 PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100%
[05:51:43] <icinga-wm>	 PROBLEM - Host lvs3002 is DOWN: PING CRITICAL - Packet loss = 100%
[05:51:43] <icinga-wm>	 PROBLEM - Host cp3041 is DOWN: PING CRITICAL - Packet loss = 100%
[05:51:56] <icinga-wm>	 PROBLEM - Host text-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::1
[05:51:57] <icinga-wm>	 PROBLEM - Host bast3001 is DOWN: PING CRITICAL - Packet loss = 100%
[05:51:57] <icinga-wm>	 PROBLEM - Host asw-esams.mgmt.esams.wmnet is DOWN: PING CRITICAL - Packet loss = 100%
[05:51:57] <icinga-wm>	 PROBLEM - Host ms-fe3001 is DOWN: PING CRITICAL - Packet loss = 100%
[05:52:02] <Nikerabbit>	 cannot access phabricator, but can access gerrit :o
[05:52:02] <icinga-wm>	 PROBLEM - Host cp3042 is DOWN: PING CRITICAL - Packet loss = 100%
[05:52:03] <icinga-wm>	 PROBLEM - Host nescio is DOWN: PING CRITICAL - Packet loss = 100%
[05:52:16] <icinga-wm>	 PROBLEM - Host upload-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::2:b
[05:52:17] <icinga-wm>	 PROBLEM - Host cp3030 is DOWN: PING CRITICAL - Packet loss = 100%
[05:52:17] <icinga-wm>	 PROBLEM - Host wikidata is DOWN: PING CRITICAL - Packet loss = 100%
[05:52:17] <icinga-wm>	 PROBLEM - Host cp3005 is DOWN: PING CRITICAL - Packet loss = 100%
[05:52:22] <icinga-wm>	 PROBLEM - Host cp3021 is DOWN: PING CRITICAL - Packet loss = 100%
[05:52:22] <icinga-wm>	 PROBLEM - Host cp3016 is DOWN: PING CRITICAL - Packet loss = 100%
[05:52:22] <icinga-wm>	 PROBLEM - Host cp3018 is DOWN: PING CRITICAL - Packet loss = 100%
[05:52:22] <icinga-wm>	 PROBLEM - Host cp3015 is DOWN: PING CRITICAL - Packet loss = 100%
[05:52:22] <icinga-wm>	 PROBLEM - Host cp3013 is DOWN: PING CRITICAL - Packet loss = 100%
[05:52:22] <icinga-wm>	 PROBLEM - Host cp3022 is DOWN: PING CRITICAL - Packet loss = 100%
[05:52:22] <icinga-wm>	 PROBLEM - Host cp3020 is DOWN: PING CRITICAL - Packet loss = 100%
[05:52:23] <icinga-wm>	 PROBLEM - Host cp3017 is DOWN: PING CRITICAL - Packet loss = 100%
[05:52:23] <icinga-wm>	 PROBLEM - Host cr2-knams is DOWN: PING CRITICAL - Packet loss = 100%
[05:52:24] <icinga-wm>	 PROBLEM - Host cr1-esams is DOWN: PING CRITICAL - Packet loss = 100%
[05:52:24] <icinga-wm>	 PROBLEM - Host lvs3003 is DOWN: PING CRITICAL - Packet loss = 100%
[05:52:25] <icinga-wm>	 PROBLEM - Host eeden is DOWN: PING CRITICAL - Packet loss = 100%
[05:52:45] <icinga-wm>	 PROBLEM - Host upload-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[05:52:49] <icinga-wm>	 PROBLEM - Host misc-web-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::3:d
[05:53:03] <icinga-wm>	 PROBLEM - Host mr1-esams is DOWN: PING CRITICAL - Packet loss = 100%
[05:53:29] <kart_>	 Nikerabbit: phabricator is up for me.
[05:53:52] <icinga-wm>	 PROBLEM - Host 91.198.174.122 is DOWN: PING CRITICAL - Packet loss = 100%
[05:53:53] <icinga-wm>	 PROBLEM - Host csw2-esams.mgmt.esams.wmnet is DOWN: PING CRITICAL - Packet loss = 100%
[05:55:26] <bblack>	 I'm here
[05:55:33] <icinga-wm>	 PROBLEM - Host mr1-esams IPv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ffff::1
[05:55:35] <_joe_>	 bblack: I'm taking out esams ASAP
[05:55:37] <apergos>	 me too. I guess joe is 
[05:55:49] <grrrit-wm>	 (03PS1) 10BBlack: Depool esams [dns] - 10https://gerrit.wikimedia.org/r/289817 
[05:55:49] <apergos>	 working on it, I was gonna say but he said it himself
[05:55:51] <_joe_>	 shit gerrit doesn't work
[05:56:03] <_joe_>	 bblack: thanks, I am out of gerrit
[05:56:09] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] Depool esams [dns] - 10https://gerrit.wikimedia.org/r/289817 (owner: 10BBlack)
[05:56:09] <volans>	 !log Killed transaction 3262258 on db1040 (alter table stuck in "Waiting for table metadata lock" blocking the replica) T130692
[05:56:10] <stashbot>	 T130692: Add new indexes from eec016ece6d2b30addcdf3d3efcc2ba59b10e858 to production databases - https://phabricator.wikimedia.org/T130692
[05:56:12] <icinga-wm>	 PROBLEM - Host cr2-knams IPv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ffff::4
[05:56:12] <icinga-wm>	 PROBLEM - Host cr1-esams IPv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ffff::5
[05:56:12] <icinga-wm>	 PROBLEM - Host cr2-esams IPv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ffff::3
[05:56:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[05:57:26] <bblack>	 I'm trying to get into esams direct now, to kill dns there too
[05:57:54] <_joe_>	 bblack: I can get in I guess
[05:57:56] <_joe_>	 let me try
[05:58:13] <bblack>	 I'm in now
[05:58:36] <bblack>	 !log gdnsd stopped on eeden.esams, puppet disabled
[05:58:43] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[05:58:50] <_joe_>	 bblack: lol I did the same things :P
[06:01:38] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: "Hey... who do you think you're talking to, people? :)" [puppet] - 10https://gerrit.wikimedia.org/r/289683 (https://phabricator.wikimedia.org/T135749) (owner: 10Giuseppe Lavagetto)
[06:02:13] <icinga-wm>	 PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 16 not-conn: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6
[06:02:13] <icinga-wm>	 PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 16 not-conn: cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6
[06:02:13] <icinga-wm>	 PROBLEM - IPsec on cp1052 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6
[06:02:14] <icinga-wm>	 PROBLEM - IPsec on cp1045 is CRITICAL: Strongswan CRITICAL - ok: 16 not-conn: cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6
[06:02:14] <icinga-wm>	 PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6
[06:02:22] <icinga-wm>	 PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6
[06:02:22] <icinga-wm>	 PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6
[06:02:23] <icinga-wm>	 PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 92 connecting: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6, cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6, cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3
[06:02:33] <icinga-wm>	 PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6
[06:02:34] <icinga-wm>	 PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6
[06:02:45] <bblack>	 and now we get the related ipsec spam :)
[06:02:52] <icinga-wm>	 PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6
[06:02:52] <icinga-wm>	 PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6
[06:02:53] <icinga-wm>	 PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 92 connecting: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6, cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6, cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3
[06:02:54] <robh>	 odd to see it and not be my fault.
[06:03:01] <icinga-wm>	 PROBLEM - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6
[06:03:02] <icinga-wm>	 PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6
[06:03:02] <icinga-wm>	 PROBLEM - IPsec on cp2003 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6
[06:03:02] <icinga-wm>	 PROBLEM - IPsec on cp2021 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6
[06:03:03] <icinga-wm>	 PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6
[06:03:11] <icinga-wm>	 PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 92 connecting: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6, cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6, cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3
[06:03:11] <icinga-wm>	 PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6
[06:03:12] <icinga-wm>	 PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6
[06:03:21] <icinga-wm>	 PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 92 connecting: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6, cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6, cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3
[06:03:21] <icinga-wm>	 PROBLEM - IPsec on cp1060 is CRITICAL: Strongswan CRITICAL - ok: 16 not-conn: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6
[06:03:23] <icinga-wm>	 PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6
[06:03:31] <icinga-wm>	 PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6
[06:03:31] <icinga-wm>	 PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6
[06:03:42] <icinga-wm>	 PROBLEM - IPsec on cp1046 is CRITICAL: Strongswan CRITICAL - ok: 16 not-conn: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6
[06:03:42] <icinga-wm>	 PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6
[06:03:42] <icinga-wm>	 PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6
[06:03:42] <icinga-wm>	 PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6
[06:03:42] <icinga-wm>	 PROBLEM - IPsec on cp2009 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6
[06:03:53] <icinga-wm>	 PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6
[06:03:53] <icinga-wm>	 PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6
[06:03:53] <icinga-wm>	 PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6
[06:03:53] <icinga-wm>	 PROBLEM - IPsec on cp2015 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6
[06:04:02] <icinga-wm>	 PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6
[06:04:02] <icinga-wm>	 PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6
[06:04:02] <icinga-wm>	 PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 92 connecting: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6, cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6, cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3
[06:04:22] <icinga-wm>	 PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6
[06:04:22] <icinga-wm>	 PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6
[06:04:22] <icinga-wm>	 PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6
[06:04:33] <icinga-wm>	 PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6
[06:04:33] <icinga-wm>	 PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6
[06:04:33] <icinga-wm>	 PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6
[06:04:33] <icinga-wm>	 PROBLEM - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6
[06:04:51] <icinga-wm>	 PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6
[06:04:51] <icinga-wm>	 PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6
[06:05:12] <icinga-wm>	 PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 92 connecting: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6, cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6, cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3
[06:05:22] <icinga-wm>	 PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6
[06:05:22] <icinga-wm>	 PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6
[06:05:22] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db1040 is OK: OK slave_sql_lag Replication lag: 0.73 seconds
[06:05:41] <icinga-wm>	 PROBLEM - IPsec on cp1058 is CRITICAL: Strongswan CRITICAL - ok: 16 not-conn: cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6
[06:05:41] <icinga-wm>	 PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6
[06:05:41] <icinga-wm>	 PROBLEM - IPsec on cp1059 is CRITICAL: Strongswan CRITICAL - ok: 16 not-conn: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6
[06:05:41] <icinga-wm>	 PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6
[06:05:41] <icinga-wm>	 PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6
[06:05:41] <icinga-wm>	 PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6
[06:06:11] <icinga-wm>	 PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 16 not-conn: cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6
[06:06:11] <icinga-wm>	 PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6
[06:06:11] <icinga-wm>	 PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6
[06:06:53] <icinga-wm>	 ACKNOWLEDGEMENT - IPsec on cp1045 is CRITICAL: Strongswan CRITICAL - ok: 16 not-conn: cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6 Brandon Black esams link down
[06:06:53] <icinga-wm>	 ACKNOWLEDGEMENT - IPsec on cp1046 is CRITICAL: Strongswan CRITICAL - ok: 16 not-conn: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6 Brandon Black esams link down
[06:06:53] <icinga-wm>	 ACKNOWLEDGEMENT - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 16 not-conn: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6 Brandon Black esams link down
[06:06:53] <icinga-wm>	 ACKNOWLEDGEMENT - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6 Brandon Black esams link down
[06:06:53] <icinga-wm>	 ACKNOWLEDGEMENT - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6 Brandon Black esams link down
[06:08:06] <icinga-wm>	 RECOVERY - Host cp3010 is UP: PING WARNING - Packet loss = 80%, RTA = 86.18 ms
[06:08:06] <icinga-wm>	 RECOVERY - Host cp3038 is UP: PING WARNING - Packet loss = 80%, RTA = 83.94 ms
[06:08:06] <icinga-wm>	 RECOVERY - Host cp3034 is UP: PING OK - Packet loss = 16%, RTA = 84.17 ms
[06:08:06] <icinga-wm>	 RECOVERY - Host cp3041 is UP: PING OK - Packet loss = 16%, RTA = 83.73 ms
[06:08:06] <icinga-wm>	 RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 56 ESP OK
[06:08:11] <icinga-wm>	 RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 56 ESP OK
[06:08:11] <icinga-wm>	 RECOVERY - Host cp3017 is UP: PING OK - Packet loss = 0%, RTA = 84.83 ms
[06:08:12] <icinga-wm>	 RECOVERY - Host cp3033 is UP: PING OK - Packet loss = 0%, RTA = 83.17 ms
[06:08:12] <icinga-wm>	 RECOVERY - Host cp3008 is UP: PING OK - Packet loss = 0%, RTA = 83.67 ms
[06:08:12] <icinga-wm>	 RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 84.31 ms
[06:08:12] <icinga-wm>	 RECOVERY - Host cp3009 is UP: PING OK - Packet loss = 0%, RTA = 84.56 ms
[06:08:26] <bblack>	 heh
[06:08:36] <apergos>	 lol
[06:09:12] <icinga-wm>	 RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 24 ESP OK
[06:09:12] <icinga-wm>	 RECOVERY - IPsec on cp1052 is OK: Strongswan OK - 44 ESP OK
[06:09:13] <icinga-wm>	 RECOVERY - IPsec on cp1045 is OK: Strongswan OK - 24 ESP OK
[06:09:13] <icinga-wm>	 RECOVERY - Host mr1-esams is UP: PING OK - Packet loss = 0%, RTA = 83.67 ms
[06:09:14] <icinga-wm>	 RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 36 ESP OK
[06:09:14] <icinga-wm>	 RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 70 ESP OK
[06:09:15] <icinga-wm>	 RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 56 ESP OK
[06:09:15] <icinga-wm>	 RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 56 ESP OK
[06:09:16] <icinga-wm>	 RECOVERY - Host 91.198.174.122 is UP: PING OK - Packet loss = 0%, RTA = 83.35 ms
[06:09:18] <icinga-wm>	 RECOVERY - Host upload-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 86.70 ms
[06:09:22] <icinga-wm>	 RECOVERY - Host text-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 83.96 ms
[06:09:22] <icinga-wm>	 RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 56 ESP OK
[06:09:22] <icinga-wm>	 RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 70 ESP OK
[06:09:22] <icinga-wm>	 RECOVERY - Host 2620:0:862:1:91:198:174:106 is UP: PING OK - Packet loss = 0%, RTA = 84.42 ms
[06:09:23] <icinga-wm>	 RECOVERY - Host wikidata is UP: PING OK - Packet loss = 0%, RTA = 83.17 ms
[06:09:32] <icinga-wm>	 RECOVERY - Host asw-esams.mgmt.esams.wmnet is UP: PING OK - Packet loss = 0%, RTA = 89.42 ms
[06:09:33] <icinga-wm>	 RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 148 ESP OK
[06:09:33] <icinga-wm>	 RECOVERY - Host 91.198.174.106 is UP: PING OK - Packet loss = 0%, RTA = 83.60 ms
[06:09:44] <icinga-wm>	 RECOVERY - Host csw2-esams.mgmt.esams.wmnet is UP: PING OK - Packet loss = 0%, RTA = 85.21 ms
[06:09:44] <icinga-wm>	 RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 148 ESP OK
[06:09:44] <icinga-wm>	 RECOVERY - IPsec on cp1046 is OK: Strongswan OK - 24 ESP OK
[06:09:44] <icinga-wm>	 RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 56 ESP OK
[06:09:44] <icinga-wm>	 RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 56 ESP OK
[06:09:44] <icinga-wm>	 RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 56 ESP OK
[06:09:45] <icinga-wm>	 RECOVERY - IPsec on cp2009 is OK: Strongswan OK - 36 ESP OK
[06:09:52] <icinga-wm>	 RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 70 ESP OK
[06:09:52] <icinga-wm>	 RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 70 ESP OK
[06:10:03] <icinga-wm>	 RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 56 ESP OK
[06:10:07] <icinga-wm>	 RECOVERY - Host misc-web-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 83.39 ms
[06:10:07] <icinga-wm>	 RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 44 ESP OK
[06:10:07] <icinga-wm>	 RECOVERY - IPsec on cp1058 is OK: Strongswan OK - 24 ESP OK
[06:10:07] <icinga-wm>	 RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 44 ESP OK
[06:10:07] <icinga-wm>	 RECOVERY - IPsec on cp1059 is OK: Strongswan OK - 24 ESP OK
[06:10:07] <icinga-wm>	 RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 70 ESP OK
[06:10:08] <icinga-wm>	 RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 70 ESP OK
[06:10:08] <icinga-wm>	 RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 70 ESP OK
[06:10:09] <icinga-wm>	 RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 44 ESP OK
[06:10:09] <icinga-wm>	 RECOVERY - IPsec on cp2012 is OK: Strongswan OK - 36 ESP OK
[06:10:10] <icinga-wm>	 RECOVERY - IPsec on cp2015 is OK: Strongswan OK - 36 ESP OK
[06:10:10] <icinga-wm>	 RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 56 ESP OK
[06:10:12] <icinga-wm>	 RECOVERY - Host ns2-v4 is UP: PING OK - Packet loss = 0%, RTA = 84.42 ms
[06:10:23] <icinga-wm>	 RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 56 ESP OK
[06:10:23] <icinga-wm>	 RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 56 ESP OK
[06:10:23] <icinga-wm>	 RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 148 ESP OK
[06:10:24] <_joe_>	 Wikimedia Platform operations, serious stuff | Status: partial outage in Amsterdam DC | Log: https://bit.ly/wikitech | Channel logs: http://ur1.ca/edq22 | Ops Clinic Duty: _joe_
[06:10:32] <icinga-wm>	 RECOVERY - Host ns2-v6 is UP: PING OK - Packet loss = 0%, RTA = 83.67 ms
[06:11:36] <bblack>	 !log restarted gdnsd on eeden.esams (with new config, esams marked down)
[06:11:43] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[06:12:52] <icinga-wm>	 PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: puppet fail
[06:12:52] <icinga-wm>	 PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: puppet fail
[06:12:52] <icinga-wm>	 PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: puppet fail
[06:12:53] <icinga-wm>	 PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: puppet fail
[06:12:53] <icinga-wm>	 PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: puppet fail
[06:12:53] <icinga-wm>	 PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: Puppet has 3 failures
[06:12:53] <icinga-wm>	 PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: puppet fail
[06:13:02] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0]
[06:13:03] <icinga-wm>	 PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: puppet fail
[06:13:03] <icinga-wm>	 PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail
[06:13:03] <icinga-wm>	 PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: puppet fail
[06:13:13] <icinga-wm>	 PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: puppet fail
[06:13:33] <icinga-wm>	 PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: puppet fail
[06:13:33] <icinga-wm>	 PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: puppet fail
[06:13:42] <icinga-wm>	 PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: puppet fail
[06:13:43] <icinga-wm>	 PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: puppet fail
[06:13:43] <icinga-wm>	 PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: puppet fail
[06:13:43] <icinga-wm>	 PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: Puppet has 3 failures
[06:13:43] <icinga-wm>	 PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: puppet fail
[06:13:52] <icinga-wm>	 PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: puppet fail
[06:13:52] <icinga-wm>	 PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: puppet fail
[06:13:52] <icinga-wm>	 PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: puppet fail
[06:13:52] <icinga-wm>	 PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Puppet has 3 failures
[06:13:53] <icinga-wm>	 PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: puppet fail
[06:14:03] <icinga-wm>	 PROBLEM - puppet last run on cp3045 is CRITICAL: CRITICAL: puppet fail
[06:14:03] <icinga-wm>	 RECOVERY - Host mr1-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 85.49 ms
[06:14:03] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[06:14:14] <icinga-wm>	 PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: Puppet has 3 failures
[06:14:14] <icinga-wm>	 PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: puppet fail
[06:14:24] <icinga-wm>	 PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: Puppet has 3 failures
[06:14:24] <icinga-wm>	 PROBLEM - puppet last run on cp3037 is CRITICAL: CRITICAL: puppet fail
[06:14:25] <bblack>	 the puppetfails are just stacked up from the outage, they'll recover on their own
[06:14:32] <_joe_>	 yes
[06:14:34] <icinga-wm>	 PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: puppet fail
[06:14:42] <icinga-wm>	 PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: puppet fail
[06:14:42] <icinga-wm>	 PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: puppet fail
[06:14:48] <_joe_>	 it's pretty obvious they happen given we don't have a local puppetmaster
[06:15:00] <_joe_>	 (which, IMO, we should have)
[06:15:22] <icinga-wm>	 PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Puppet has 3 failures
[06:15:34] <icinga-wm>	 PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Puppet has 3 failures
[06:15:52] <icinga-wm>	 PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: Puppet has 3 failures
[06:19:22] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[06:19:23] <icinga-wm>	 RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[06:20:23] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[06:21:13] <icinga-wm>	 RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[06:21:32] <icinga-wm>	 RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:22:13] <icinga-wm>	 RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:22:43] <icinga-wm>	 RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[06:24:22] <icinga-wm>	 RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[06:24:32] <icinga-wm>	 RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:24:42] <icinga-wm>	 RECOVERY - puppet last run on cp3045 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:25:03] <icinga-wm>	 RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[06:25:04] <icinga-wm>	 RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:25:32] <icinga-wm>	 RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:26:53] <icinga-wm>	 RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:27:09] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Services: Expand sc-admins to provide sufficient coverage for sc* clusters - https://phabricator.wikimedia.org/T135548#2311616 (10Joe) a:05Joe>03None
[06:27:44] <icinga-wm>	 RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures
[06:28:24] <icinga-wm>	 RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:29:44] <icinga-wm>	 RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures
[06:30:34] <icinga-wm>	 RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:30:43] <icinga-wm>	 RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:31:22] <icinga-wm>	 PROBLEM - puppet last run on mw1260 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:33] <icinga-wm>	 PROBLEM - puppet last run on nobelium is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:53] <icinga-wm>	 RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:31:54] <icinga-wm>	 RECOVERY - puppet last run on eeden is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[06:32:44] <icinga-wm>	 RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[06:32:52] <icinga-wm>	 RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[06:33:20] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Add systemd support for the jessie build [debs/nutcracker] - 10https://gerrit.wikimedia.org/r/289603 (owner: 10Giuseppe Lavagetto)
[06:34:01] <icinga-wm>	 PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:34:02] <icinga-wm>	 PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:34:53] <icinga-wm>	 RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:35:12] <icinga-wm>	 PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:35:13] <wikibugs>	 06Operations, 10Beta-Cluster-Infrastructure, 06Labs, 10Traffic: deployment-cache-upload04 (m1.medium) / is almost full - https://phabricator.wikimedia.org/T135700#2311660 (10Joe) @hashar that was exactly my plan
[06:36:12] <icinga-wm>	 PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:36:22] <icinga-wm>	 PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:36:23] <icinga-wm>	 RECOVERY - puppet last run on cp3037 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[06:36:52] <icinga-wm>	 RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:37:31] <icinga-wm>	 RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:38:22] <icinga-wm>	 RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[06:38:23] <icinga-wm>	 RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[06:38:32] <icinga-wm>	 RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures
[06:39:02] <icinga-wm>	 RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:39:03] <icinga-wm>	 RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures
[06:39:11] <icinga-wm>	 RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[06:40:01] <icinga-wm>	 RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:40:42] <icinga-wm>	 RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:56:12] <icinga-wm>	 RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures
[06:57:12] <icinga-wm>	 RECOVERY - puppet last run on nobelium is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[06:57:22] <icinga-wm>	 RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:32] <icinga-wm>	 RECOVERY - puppet last run on mw1260 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:52] <icinga-wm>	 RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:02] <icinga-wm>	 RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:12] <icinga-wm>	 RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:59:12] <icinga-wm>	 RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:59:22] <icinga-wm>	 PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: puppet fail
[07:18:34] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 031] "Liking this, it's exactly what we discussed yesterday." [puppet] - 10https://gerrit.wikimedia.org/r/289683 (https://phabricator.wikimedia.org/T135749) (owner: 10Giuseppe Lavagetto)
[07:24:47] <icinga-wm>	 RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[07:27:33] <grrrit-wm>	 (03PS1) 10Jcrespo: Reduce max table lock to identify metadata locks and abort [software] - 10https://gerrit.wikimedia.org/r/289820 (https://phabricator.wikimedia.org/T135809) 
[07:31:29] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] "I am going to be bold and break a host to test this." [software] - 10https://gerrit.wikimedia.org/r/289820 (https://phabricator.wikimedia.org/T135809) (owner: 10Jcrespo)
[07:36:38] <jynus>	 !log testing medata lock detectiom on db1069
[07:36:46] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[07:43:09] <jynus>	 the test was successful
[07:59:54] <grrrit-wm>	 (03PS1) 10Jcrespo: Revert "mariadb: set is_critical to false for checks" [puppet] - 10https://gerrit.wikimedia.org/r/289821 
[08:00:14] <grrrit-wm>	 (03PS1) 10Jcrespo: Revert "mariadb: set replication check's contact_group to admins" [puppet] - 10https://gerrit.wikimedia.org/r/289822 
[08:00:16] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Revert "mariadb: set is_critical to false for checks" [puppet] - 10https://gerrit.wikimedia.org/r/289821 (owner: 10Jcrespo)
[08:00:28] <grrrit-wm>	 (03PS2) 10Jcrespo: Revert "mariadb: set replication check's contact_group to admins" [puppet] - 10https://gerrit.wikimedia.org/r/289822 
[08:02:14] <grrrit-wm>	 (03PS3) 10Jcrespo: Revert "mariadb: set replication check's contact_group to admins" [puppet] - 10https://gerrit.wikimedia.org/r/289822 (https://phabricator.wikimedia.org/T112473) 
[08:02:31] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Revert "mariadb: set replication check's contact_group to admins" [puppet] - 10https://gerrit.wikimedia.org/r/289822 (https://phabricator.wikimedia.org/T112473) (owner: 10Jcrespo)
[08:02:39] <grrrit-wm>	 (03CR) 10Jcrespo: [V: 032] Revert "mariadb: set replication check's contact_group to admins" [puppet] - 10https://gerrit.wikimedia.org/r/289822 (https://phabricator.wikimedia.org/T112473) (owner: 10Jcrespo)
[08:03:00] <grrrit-wm>	 (03PS2) 10Jcrespo: Revert "mariadb: set is_critical to false for checks" [puppet] - 10https://gerrit.wikimedia.org/r/289821 (https://phabricator.wikimedia.org/T112473) 
[08:03:10] <grrrit-wm>	 (03PS3) 10Jcrespo: Revert "mariadb: set is_critical to false for checks" [puppet] - 10https://gerrit.wikimedia.org/r/289821 (https://phabricator.wikimedia.org/T112473) 
[08:07:18] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Add a new backup set to backup openldap databases [puppet] - 10https://gerrit.wikimedia.org/r/289824 (https://phabricator.wikimedia.org/T120919) 
[08:08:04] <mobrovac>	 !log mathoid deploying 243a530
[08:08:12] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[08:08:30] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Add a new backup set to backup openldap databases [puppet] - 10https://gerrit.wikimedia.org/r/289824 (https://phabricator.wikimedia.org/T120919) (owner: 10Muehlenhoff)
[08:09:54] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Add a new backup set to backup openldap databases [puppet] - 10https://gerrit.wikimedia.org/r/289824 (https://phabricator.wikimedia.org/T120919) 
[08:10:55] <wikibugs>	 06Operations, 13Patch-For-Review: Add openldap/labs servers to backup - https://phabricator.wikimedia.org/T120919#2311788 (10MoritzMuehlenhoff)
[08:10:57] <wikibugs>	 06Operations, 06Labs, 10Labs-Infrastructure: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#2311787 (10MoritzMuehlenhoff)
[08:11:44] <elukey>	 !log upgrading cassandra from 2.1.12 to 2.1.13 on aqs1002.eqiad.mwnet
[08:11:51] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[08:12:40] <grrrit-wm>	 (03PS1) 10Jcrespo: Increase retries to 10 to avoid small bumps to alert [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/289825 (https://phabricator.wikimedia.org/T112473) 
[08:13:17] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Increase retries to 10 to avoid small bumps to alert [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/289825 (https://phabricator.wikimedia.org/T112473) (owner: 10Jcrespo)
[08:14:39] <grrrit-wm>	 (03PS4) 10Jcrespo: Revert "mariadb: set is_critical to false for checks" [puppet] - 10https://gerrit.wikimedia.org/r/289821 (https://phabricator.wikimedia.org/T112473) 
[08:16:43] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Revert "mariadb: set is_critical to false for checks" [puppet] - 10https://gerrit.wikimedia.org/r/289821 (https://phabricator.wikimedia.org/T112473) (owner: 10Jcrespo)
[08:17:21] <icinga-wm>	 PROBLEM - cassandra CQL 10.64.32.175:9042 on aqs1002 is CRITICAL: Connection refused
[08:17:35] <elukey>	 this is me, should resolve in a bit
[08:18:30] <icinga-wm>	 PROBLEM - Analytics Cassanda CQL query interface on aqs1002 is CRITICAL: Connection refused
[08:23:00] <icinga-wm>	 RECOVERY - Analytics Cassanda CQL query interface on aqs1002 is OK: TCP OK - 0.004 second response time on port 9042
[08:23:18] <elukey>	 gooood
[08:23:22] <elukey>	 one more to go
[08:23:26] <elukey>	 (in a bit)
[08:24:00] <icinga-wm>	 RECOVERY - cassandra CQL 10.64.32.175:9042 on aqs1002 is OK: TCP OK - 0.000 second response time on port 9042
[08:50:35] <elukey>	 !log upgrading cassandra from 2.1.12 to 2.1.13 on aqs1003.eqiad.mwnet
[08:50:43] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[08:55:11] <wikibugs>	 06Operations, 07Graphite: investigate carbon-c-relay stalls/drops towards graphite2002 - https://phabricator.wikimedia.org/T135385#2311841 (10fgiunchedi) it looks like metric sending from `cassandra-metrics-collector` keep piling up as they get stalled and never time out the sending, myself and @Eevans have be...
[08:57:19] <icinga-wm>	 PROBLEM - cassandra CQL 10.64.48.117:9042 on aqs1003 is CRITICAL: Connection refused
[08:58:03] <elukey>	 node is already up and running, will clear in a sec
[08:59:29] <icinga-wm>	 RECOVERY - cassandra CQL 10.64.48.117:9042 on aqs1003 is OK: TCP OK - 0.002 second response time on port 9042
[09:00:19] <jynus>	 !log altering db1040 commonswiki.categorylinks
[09:00:27] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:10:12] <icinga-wm>	 PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 686
[09:11:15] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Package and backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2311859 (10fgiunchedi) +1 to what @faidon said, thanks @Gilles indeed for all the effort you've put in this! I'll start with python-statsd as the lowest hanging fruit....
[09:20:12] <icinga-wm>	 RECOVERY - check_mysql on lutetium is OK: Uptime: 821598 Threads: 1 Questions: 15018862 Slow queries: 14258 Opens: 92966 Flush tables: 2 Open tables: 64 Queries per second avg: 18.280 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0
[09:20:44] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0]
[09:21:23] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0]
[09:22:03] <icinga-wm>	 PROBLEM - puppet last run on mw2200 is CRITICAL: CRITICAL: puppet fail
[09:30:22] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[09:32:02] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[09:43:05] <grrrit-wm>	 (03PS1) 10Gehel: Maps - make redis server configureable [puppet] - 10https://gerrit.wikimedia.org/r/289829 (https://phabricator.wikimedia.org/T134901) 
[09:45:54] <wikibugs>	 07Blocked-on-Operations, 06Operations, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#2311913 (10elukey) @Eevans aqs100[123] upgraded to 2.1.13 today!
[09:50:23] <icinga-wm>	 RECOVERY - puppet last run on mw2200 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[09:53:37] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Allow CQL access for multi-instance AQS Cassandra setup [puppet] - 10https://gerrit.wikimedia.org/r/289830 
[09:58:31] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Allow CQL access for multi-instance AQS Cassandra setup [puppet] - 10https://gerrit.wikimedia.org/r/289830 
[10:01:17] <wikibugs>	 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2311944 (10elukey) Reporting a conversation with dormando on the #memcached Freenode channel: https://phabricator.wikimedia.org/P3153  Comments are about the la...
[10:06:58] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0]
[10:08:25] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] Maps - make redis server configureable [puppet] - 10https://gerrit.wikimedia.org/r/289829 (https://phabricator.wikimedia.org/T134901) (owner: 10Gehel)
[10:08:44] <grrrit-wm>	 (03CR) 10Elukey: [C: 031] "http://puppet-compiler.wmflabs.org/2856" [puppet] - 10https://gerrit.wikimedia.org/r/289830 (owner: 10Muehlenhoff)
[10:10:39] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0]
[10:15:33] <jynus>	 they are upload errors, cannot see a pattern
[10:17:09] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[10:17:49] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[10:36:35] <grrrit-wm>	 (03PS1) 10Jcrespo: Increase db1029 and db1033 weight back to normal after upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289838 (https://phabricator.wikimedia.org/T112079) 
[10:36:47] <wikibugs>	 06Operations, 07HHVM, 07User-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2312032 (10Joe) I am now building a package linked to libicu52 for trusty, as the preparation work seems to be done.
[10:37:36] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Increase db1029 and db1033 weight back to normal after upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289838 (https://phabricator.wikimedia.org/T112079) (owner: 10Jcrespo)
[10:40:20] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Increase db1029 and db1033 weight back to normal after upgrade (duration: 01m 52s)
[10:40:27] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:26:44] <grrrit-wm>	 (03CR) 10Mobrovac: "Given that Cass instances do not use 9042 for inter-node communication and the fact that the plan is to retire aqqs100[123] soon(TM), I th" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/289830 (owner: 10Muehlenhoff)
[11:27:05] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: Revert "Depool esams" [dns] - 10https://gerrit.wikimedia.org/r/289844 
[11:28:26] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] Revert "Depool esams" [dns] - 10https://gerrit.wikimedia.org/r/289844 (owner: 10Faidon Liambotis)
[11:45:05] <wikibugs>	 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2312240 (10elukey) Compared chunk size vs number of chunks for the hosts under testing to get a visual  difference (I tried to combine the graphs but my spreads...
[11:49:49] <moritzm>	 !log rolling restart of nginx in ulsfo to pick up expat update
[11:49:56] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:52:07] <kart_>	 !Updated cxserver to 4c5738c
[11:54:39] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is OK: TCP OK - 0.037 second response time on port 9042
[12:02:44] <moritzm>	 !log rolling restart of nginx in esams to pick up expat update
[12:02:52] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:13:51] <wikibugs>	 06Operations, 10DBA, 13Patch-For-Review: reimage or decom db servers on precise - https://phabricator.wikimedia.org/T125028#2312310 (10jcrespo)
[12:13:53] <wikibugs>	 06Operations, 10DBA, 13Patch-For-Review: db1033 (old s7 master) needs backup and reimage - https://phabricator.wikimedia.org/T134555#2312309 (10jcrespo) 05Open>03Resolved
[12:15:34] <wikibugs>	 06Operations, 10DBA, 13Patch-For-Review: Upgrade x1 cluster - https://phabricator.wikimedia.org/T112079#2312315 (10jcrespo) 05Open>03Resolved After the increase of weight of the slave, all x1 servers should be on jessie and a recent mariadb version. Only regular maintenance would be needed as usual.
[12:16:27] <elukey>	 !log restarting cassandra on aqs100[123] for Java upgrades
[12:16:35] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:22:29] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: Secure connection failed when attempting to send POST request - https://phabricator.wikimedia.org/T134869#2312346 (10Thibaut120094) OTRS members have the same issue with Firefox 46.0.1 on https://ticket.wikimedia.org/  https://lists.wikimedia.org/mailman/private/otrs-fr/2016...
[12:22:49] <moritzm>	 !log rolling restart of nginx in codfw to pick up expat update
[12:22:56] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:27:19] <grrrit-wm>	 (03PS3) 10Muehlenhoff: Allow CQL access for multi-instance AQS Cassandra setup [puppet] - 10https://gerrit.wikimedia.org/r/289830 
[12:30:00] <icinga-wm>	 PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail
[12:31:38] <moritzm>	 !log rolling restart of nginx in eqiad to pick up expat update
[12:31:45] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:34:48] <wikibugs>	 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 3 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2312370 (10Steinsplitter) New error when deleting (related to this?): ``` API reque...
[12:37:04] <wikibugs>	 06Operations, 10Traffic, 07Browser-Support-Firefox, 07HTTPS: Secure connection failed when attempting to send POST request - https://phabricator.wikimedia.org/T134869#2312371 (10Danny_B) Adding #browser-support-firefox as it seems to be its issue only ATM (no other browser verified yet - in such case, remo...
[12:39:27] <grrrit-wm>	 (03CR) 10Mobrovac: [C: 031] Allow CQL access for multi-instance AQS Cassandra setup [puppet] - 10https://gerrit.wikimedia.org/r/289830 (owner: 10Muehlenhoff)
[12:51:07] <wikibugs>	 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2312406 (10phuedx) @BBlack: If Varnish is the part of the stack that this is to be done, have you taken a look at [libvmod-abtest](https://github.com/Destination/libvmod-abtest)...
[12:51:44] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Package and backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2312407 (10Gilles)
[12:54:02] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Package and backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2312414 (10Gilles) So far I'm not running into major issues packaging any of the dependencies with their own tests running during the build.  The install dependencies a...
[12:54:54] <grrrit-wm>	 (03CR) 10Elukey: [C: 031] "https://puppet-compiler.wmflabs.org/2857/ - Marko docet" [puppet] - 10https://gerrit.wikimedia.org/r/289830 (owner: 10Muehlenhoff)
[12:58:12] <icinga-wm>	 RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:05:50] <jynus>	 !log freeing up space on db1038 by defragmenting its tables
[13:05:56] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:06:34] <wikibugs>	 06Operations, 06Discovery, 10Maps, 03Discovery-Maps-Sprint, 13Patch-For-Review: Install / configure new maps servers in codfw - https://phabricator.wikimedia.org/T134901#2312419 (10Gehel) Log of osm2pgsql run: {F4034592}  Import was done according to [[ https://wikitech.wikimedia.org/wiki/Maps#Importing_...
[13:07:38] <mark>	 !log Performing acupuncture on cr2-codfw:ae4.2020 (Lowered VRRP priority from 100 to 50, inet/inet6)
[13:07:46] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:12:08] <elukey>	 ahahahahahahah
[13:14:26] <mark>	 !log Lowering VRRP priority to 50 on all VRRP groups on cr2-codfw to drain FPC0
[13:14:33] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:15:45] <wikibugs>	 06Operations, 10ops-eqiad, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2312454 (10elukey) 05Open>03Resolved
[13:24:55] <mark>	 !log Disabling OSPF on all cr2-codfw row subnets to drain FPC0
[13:25:01] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:31:05] <grrrit-wm>	 (03PS3) 10Rush: labstore1003 define scratch share [puppet] - 10https://gerrit.wikimedia.org/r/289774 
[13:31:21] <mark>	 !log Disabling cr2-codfw et-0/* interfaces
[13:31:28] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:35:10] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 84, down: 4, dormant: 0, excluded: 0, unused: 0BRae1: down - Core: asw-a-codfw:ae2BRae2: down - Core: asw-b-codfw:ae2BRae3: down - Core: asw-c-codfw:ae2BRae4: down - Core: asw-d-codfw:ae2BR
[13:35:37] <jynus>	 !log changing dbstore1001 to be a direct slave of db1075
[13:35:44] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:37:00] <mark>	 !log Offlining cr2-codfw FPC 0
[13:37:07] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:37:48] <moritzm>	 !log upgraded java on xenon/praseodymium/cerium and restbase2001 to latest openjdk-8 release (along with restarts of Cassandra)
[13:37:55] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:38:03] <mark>	 sleep...
[13:38:43] <mark>	 !log Bringing cr2-codfw FPC 0 back up
[13:38:50] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:41:49] <jynus>	 db1056 lag is flapping (in a lagging- depooled- up to date cycle)
[13:41:59] <jynus>	 but schema change on s4 finished already
[13:45:01] <MatmaRex>	 bd808: so it looks like the message changes from https://gerrit.wikimedia.org/r/289802 did not get automagically deployed… filters on e.g. https://pl.wikipedia.org/wiki/Specjalna:Linkujące/A are still broken (no links are generated)
[13:45:48] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0
[13:45:55] <mark>	 !log Enabled cr2-codfw et-0/* interfaces, reenabling OSPF/OSPF3
[13:46:03] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:46:55] <MatmaRex>	 Dereckson: ^^^^
[13:47:24] <MatmaRex>	 Dereckson: looking at https://wikitech.wikimedia.org/wiki/Server_Admin_Log , i don't see any evidence that LocalisationUpdate actually ran yesterday
[13:47:57] <MatmaRex>	 apparently the last time it happened was on 2016-05-18? " 02:59 logmsgbot: mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.2) (duration: 11m 15s)"
[13:48:18] <jynus>	 disk is ok
[13:50:40] <Dereckson>	 ostriches: hi. Yesterday, we synced a Revert "Convert Special:WhatLinksHere from XML form to OOUI form" change, which broked that special page. As the change repurposed messages, we've merged in master a l10n change too, but it hasn't been picked by l10n task. The non English messages aren't consistent with the version. We don't have SWAT windows available today. Could I at 8:00 SF do a sca
[13:50:46] <Dereckson>	 p to deploy a cherry pick of https://gerrit.wikimedia.org/r/289802 to wmf.2?
[13:51:18] <Dereckson>	 MatmaRex: if I've a green light, I can backport it to wmf/1.28.0-wmf.2 and sync
[13:53:07] <jynus>	 lots of "Title::invalidateCache"
[13:53:59] <jynus>	 probably someone doing it from the api
[13:55:18] <grrrit-wm>	 (03PS4) 10Rush: labstore1003 define scratch share [puppet] - 10https://gerrit.wikimedia.org/r/289774 
[13:55:35] <grrrit-wm>	 (03PS5) 10Rush: labstore1003 define scratch share [puppet] - 10https://gerrit.wikimedia.org/r/289774 
[13:56:43] <wikibugs>	 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2312587 (10BBlack) >>! In T135762#2312406, @phuedx wrote: > @BBlack: If Varnish is the part of the stack that this is to be done, have you taken a look at [libvmod-abtest](https...
[14:00:20] <wikibugs>	 07Blocked-on-Operations, 06Operations, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#2312610 (10fgiunchedi) >>! In T95253#2306337, @Eevans wrote: > Cassandra has now been downgraded to 2.1.13 on...
[14:01:33] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: cassandra: pin cassandra version [puppet] - 10https://gerrit.wikimedia.org/r/289683 (https://phabricator.wikimedia.org/T135749) 
[14:01:58] <_joe_>	 elukey, urandom, gehel ^^
[14:02:52] <_joe_>	 actually, I think some of the things I put there are redundant
[14:03:29] <grrrit-wm>	 (03PS3) 10Giuseppe Lavagetto: cassandra: pin cassandra version [puppet] - 10https://gerrit.wikimedia.org/r/289683 (https://phabricator.wikimedia.org/T135749) 
[14:03:48] <_joe_>	 please review :)
[14:03:59] <elukey>	 ack!
[14:04:23] <_joe_>	 !log removing libicu48 from trusty archives, kept a copy of the packages in my homedir on carbon
[14:04:32] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:06:31] <grrrit-wm>	 (03CR) 10Gehel: [C: 031] cassandra: pin cassandra version [puppet] - 10https://gerrit.wikimedia.org/r/289683 (https://phabricator.wikimedia.org/T135749) (owner: 10Giuseppe Lavagetto)
[14:06:38] <wikibugs>	 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2312644 (10fgiunchedi) all instances have been bootstrapped, left to do: * deploy restbase on restbase200[789] if not already * add restbase200[789] to conftool and pool them in lvs
[14:08:12] <grrrit-wm>	 (03CR) 10Gehel: cassandra: pin cassandra version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/289683 (https://phabricator.wikimedia.org/T135749) (owner: 10Giuseppe Lavagetto)
[14:12:07] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: cassandra: pin cassandra version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/289683 (https://phabricator.wikimedia.org/T135749) (owner: 10Giuseppe Lavagetto)
[14:20:14] <grrrit-wm>	 (03CR) 10Elukey: [C: 031] "https://puppet-compiler.wmflabs.org/2859/" [puppet] - 10https://gerrit.wikimedia.org/r/289683 (https://phabricator.wikimedia.org/T135749) (owner: 10Giuseppe Lavagetto)
[14:22:31] <gehel>	 !log cassandra downgraded on maps2*.codfw.wmnet
[14:22:34] <gehel>	 _joe_: ^
[14:22:38] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:23:16] <_joe_>	 gehel: thanks :)
[14:23:50] <_joe_>	 ok merging this
[14:26:01] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] cassandra: pin cassandra version [puppet] - 10https://gerrit.wikimedia.org/r/289683 (https://phabricator.wikimedia.org/T135749) (owner: 10Giuseppe Lavagetto)
[14:30:10] <wikibugs>	 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2312715 (10elukey) Finally the Snapshots:  mc1007 - growth factor 1.15 - memcached 1.4.21  [[ https://phabricator.wikimedia.org/P3129 | mc1007_stats_1463649014...
[14:36:19] <matt_flaschen>	 jynus, do we do anything in production to preserve the MySQL AUTO_INCREMENT counter when the database/database server restarts?  See https://phabricator.wikimedia.org/T122262#2310857 .
[14:37:40] <grrrit-wm>	 (03PS1) 10Gergő Tisza: Enable AuthManager in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289866 (https://phabricator.wikimedia.org/T135498) 
[14:37:48] <jynus>	 matt_flaschen, no
[14:38:00] <jynus>	 matt_flaschen, does it affect you?
[14:39:33] <matt_flaschen>	 jynus, in production, probably not, unless we're amazingly unlucky.
[14:39:39] <jynus>	 your problem is deleting rows, who in full sanity does delete rows or move them around?
[14:40:00] <jynus>	 the whole "archival" of rows is broken
[14:40:09] <jynus>	 and should never be done
[14:40:37] <jynus>	 deleting rows in unefficient, if you can just mark them as deleted
[14:41:36] <jynus>	 is that your model, similar to archive?
[14:41:39] <matt_flaschen>	 jynus, Flow doesn't specifically (we mark stuff as deleted), but as you know core does for page and revision.
[14:41:48] <jynus>	 I know
[14:41:57] <jynus>	 and that is a very broken behaviour
[14:42:05] <jynus>	 that causes slaves to desync
[14:42:18] <jynus>	 INSERT...SELECT is the worst thing ever
[14:42:28] <matt_flaschen>	 Also, I didn't realize before yesterday the same auto_increment key could be handed out twice in certain scenarios.
[14:42:31] <matt_flaschen>	 That's why I asked.
[14:42:55] <jynus>	 yeah, a master failover
[14:43:59] <jynus>	 I mean, we could workaround it
[14:44:13] <jynus>	 but I would prefer to fix the code to not depend on it
[14:45:38] <matt_flaschen>	 I agree, but I don't have time to fix that in core right now.  I'm not asking you to work on it in the operations side either (since I think it's super-unlikely to cause problems in practice), I was just wondering.
[14:45:45] <matt_flaschen>	 I will file a bug just to track it, though.
[14:46:08] <jynus>	 matt_flaschen, sure, I wasn't suggesting that you did :-)
[14:46:27] <jynus>	 but it is something that worries me
[14:47:01] <jynus>	 and can break replication, too
[14:48:29] <wikibugs>	 06Operations, 10cassandra: Downgrade Cassandra on apt.wikimedia.org to 2.1.13 - https://phabricator.wikimedia.org/T135673#2312806 (10Joe)
[14:50:32] <gehel>	 !log shutting down kartotherian on maps-test2001 (accidental data deletion)
[14:50:40] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:51:59] <icinga-wm>	 PROBLEM - cassandra service on maps-test2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed
[14:52:28] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.0.128:9042 on maps-test2001 is CRITICAL: Connection refused
[14:54:15] <bd808>	 MatmaRex: crud. Checked the logs and found failures. I opened T135849 to investigate
[14:54:15] <stashbot>	 T135849: l10nupdate failing due to sudo rights - https://phabricator.wikimedia.org/T135849
[14:54:50] <bd808>	 thcipriani: ^ do you have time to figure out how to fix l10nupdate?
[14:55:09] * thcipriani looks
[14:55:40] <bd808>	 It looks like at least the sudo grant is messed up
[14:56:11] <bd808>	 also I thought I had fixed it at some point so it would !log on failure but apparently not
[14:56:11] <thcipriani>	 bd808: yup. I can take a look.
[14:56:34] <bd808>	 <3 thcipriani. Shout if you need help
[14:56:55] <thcipriani>	 will do (really most likely will do :))
[14:58:17] <grrrit-wm>	 (03PS5) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) 
[14:58:26] <wikibugs>	 06Operations, 10ops-codfw, 10cassandra, 13Patch-For-Review: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2312843 (10Eevans)
[14:58:31] <thcipriani>	 ah, crap. I think I know what's happening.
[14:59:42] <godog>	 urandom: I'm going to roll-restart cmcd to test the theory behind T135385 and see if that makes it recover
[14:59:42] <stashbot>	 T135385: investigate carbon-c-relay stalls/drops towards graphite2002 - https://phabricator.wikimedia.org/T135385
[15:00:00] <urandom>	 godog: go for it
[15:01:31] <grrrit-wm>	 (03PS1) 10Alex Monk: Attempt to fix dynamicproxy-api service [puppet] - 10https://gerrit.wikimedia.org/r/289870 
[15:02:42] <moritzm>	 !log uploaded librsvg 2.40.5-1+deb8u2+wmf1 for jessie-wikimedia to carbon (rebase of locally patched package on top of latest security update)
[15:02:50] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:02:53] <grrrit-wm>	 (03CR) 10Alex Monk: "Yuvi, is proxy-eqiad.wmflabs.org working perhaps because of an unpuppetised version of that file?" [puppet] - 10https://gerrit.wikimedia.org/r/289870 (owner: 10Alex Monk)
[15:03:05] <godog>	 !log roll-restart cassandra-metrics-collector in codfw for T135385
[15:03:07] <stashbot>	 T135385: investigate carbon-c-relay stalls/drops towards graphite2002 - https://phabricator.wikimedia.org/T135385
[15:03:11] <icinga-wm>	 RECOVERY - cassandra service on maps-test2001 is OK: OK - cassandra is active
[15:03:13] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:05:43] <gehel>	 !log starting cluster rejoining for cassandra onmaps-test2001
[15:05:50] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:05:57] <wikibugs>	 06Operations, 10MediaWiki-Database: Preserve InnoDB table auto_increment on restart - https://phabricator.wikimedia.org/T135851#2312863 (10Mattflaschen-WMF)
[15:06:18] <wikibugs>	 06Operations, 10MediaWiki-Database: Preserve InnoDB table auto_increment on restart - https://phabricator.wikimedia.org/T135851#2312876 (10Mattflaschen-WMF) p:05Triage>03Low
[15:07:31] <wikibugs>	 06Operations, 10MediaWiki-Database: Preserve InnoDB table auto_increment on restart - https://phabricator.wikimedia.org/T135851#2312863 (10Mattflaschen-WMF)
[15:10:40] <MatmaRex>	 bd808: :/
[15:10:55] <grrrit-wm>	 (03PS6) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) 
[15:11:32] <bd808>	 poor l10nupdate gets broken for weeks at a time too often.
[15:11:55] <bd808>	 a sign that either a) we don't care or b) the weekly train makes it less useful than it once was
[15:12:01] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Remove labvirt1003 from the scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/289871 (https://phabricator.wikimedia.org/T135850) 
[15:12:03] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Nova: Decrease disk_allocation_ratio [puppet] - 10https://gerrit.wikimedia.org/r/289872 
[15:12:05] <grrrit-wm>	 (03CR) 1020after4: keyholder key cleanup (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4)
[15:13:20] <MatmaRex>	 bd808: if you and thciprian.i manage to fix it today, it'd be good to run it. that's probably a nicer thing to do than backport and deploy the patch on friday. :)
[15:13:31] <grrrit-wm>	 (03PS7) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) 
[15:13:48] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Remove labvirt1003 from the scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/289871 (https://phabricator.wikimedia.org/T135850) (owner: 10Andrew Bogott)
[15:14:27] <wikibugs>	 06Operations, 10MediaWiki-Database: Preserve InnoDB table auto_increment on restart - https://phabricator.wikimedia.org/T135851#2312900 (10Mattflaschen-WMF)
[15:16:07] <godog>	 !log roll-restart cassandra-metrics-collector in eqiad for T135385
[15:16:08] <stashbot>	 T135385: investigate carbon-c-relay stalls/drops towards graphite2002 - https://phabricator.wikimedia.org/T135385
[15:16:14] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:19:43] <grrrit-wm>	 (03PS1) 10Thcipriani: Fix suoders permissions for l10nupdate [puppet] - 10https://gerrit.wikimedia.org/r/289874 
[15:19:55] <thcipriani>	 ^ bd808 ought to fix it.
[15:21:15] <grrrit-wm>	 (03PS2) 10BryanDavis: Fix suoders permissions for l10nupdate [puppet] - 10https://gerrit.wikimedia.org/r/289874 (https://phabricator.wikimedia.org/T135849) (owner: 10Thcipriani)
[15:21:30] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 031] Fix suoders permissions for l10nupdate [puppet] - 10https://gerrit.wikimedia.org/r/289874 (https://phabricator.wikimedia.org/T135849) (owner: 10Thcipriani)
[15:22:30] <_joe_>	 bd808: need me to take a look?
[15:22:40] <bd808>	 _joe_: that would be awesome
[15:22:59] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Fix suoders permissions for l10nupdate [puppet] - 10https://gerrit.wikimedia.org/r/289874 (https://phabricator.wikimedia.org/T135849) (owner: 10Thcipriani)
[15:23:02] <bd808>	 I was just trying to decide if it was too late in your day for a ping :)
[15:23:34] <bd808>	 If you could force a run of that on tin it would be greatly appreciated
[15:23:42] <_joe_>	 it will take ~ half an hour to complete.
[15:23:49] <bd808>	 *nod*
[15:23:54] <paladox>	 It seems https://integration.wikimedia.org/zuul/ is backed up
[15:24:04] <paladox>	 because https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit/43601/console has frozen
[15:24:13] <Dereckson>	 Once completed, is there a way to manually trigger the l10nupdate task?
[15:24:39] <bd808>	 Dereckson: yeah, any deployer can run the script on tin
[15:24:45] <paladox>	 Could someone abort https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit/43601/console please.
[15:25:09] <_joe_>	 paladox: why?
[15:25:09] <bd808>	 paladox: probably a better conversation for #wikimedia-releng
[15:25:19] <_joe_>	 yeah, that too
[15:25:19] <paladox>	 _joe_ because it is frozen
[15:25:25] <urandom>	 godog: verdict?
[15:25:37] <paladox>	 Ok
[15:26:04] <wikibugs>	 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2312919 (10Jdlrobson) This all sounds great and I love that its generic and can be reused again!  A few clarifications - if I'm understanding correctly experiments would be conf...
[15:27:53] <wikibugs>	 06Operations, 10MediaWiki-Database: Preserve InnoDB table auto_increment on restart - https://phabricator.wikimedia.org/T135851#2312924 (10jcrespo) -1 disagreeing with the solution.  This can happen also on master failover-which your solution will not protect against. The right fix is not a complex server-side...
[15:28:16] <wikibugs>	 06Operations, 10DBA, 10MediaWiki-Database: Preserve InnoDB table auto_increment on restart - https://phabricator.wikimedia.org/T135851#2312929 (10jcrespo)
[15:32:07] <godog>	 urandom: yeah it looks like stall/drops decrease but likely the pause between stop/start isn't long enough to fully drain the queue, was worth a try!
[15:33:02] <wikibugs>	 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2312939 (10dr0ptp4kt) Quick question: how does this guarantee bucketing across browser restart?
[15:34:09] <wikibugs>	 06Operations, 07HHVM, 07User-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2312962 (10Joe) >>! In T86096#2186298, @Joe wrote: > Since we are at the point where there are no precise machines left running php, we should really build HHVM with libicu52 and m...
[15:34:20] <wikibugs>	 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2312967 (10Nuria) Non session cookies are kept after browser restarts, with an expiration set of 30 days (like last access cookie) the cookie is available.
[15:35:40] <wikibugs>	 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2312971 (10dr0ptp4kt) I should note in the concrete cases: persistence across browser restart is probably not as important for lazy loaded images, whereas persistence across bro...
[15:36:13] <urandom>	 godog: seems like it would take a stall to have them start piling up in the first place, though
[15:36:41] <wikibugs>	 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2312972 (10dr0ptp4kt) @nuria, should we add that to the Description as acceptance criteria?
[15:37:10] <urandom>	 godog: granted, that will on exacerbate matters, but it sounds like there is still something going on carbon-side
[15:38:27] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra CQL 10.192.0.128:9042 on maps-test2001 is CRITICAL: Connection refused Gehel data import in progress
[15:38:33] <urandom>	 s/will on/will only/
[15:38:42] <godog>	 urandom: yeah I'm not 100% convinced either there isn't something else going on, the periodicity is oddly precise
[15:39:10] <wikibugs>	 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2312984 (10BBlack) >>! In T135762#2312919, @Jdlrobson wrote: > A few clarifications - if I'm understanding correctly experiments would be configured in puppet?  That would be th...
[15:44:19] <wikibugs>	 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2312986 (10BBlack) >>! In T135762#2312939, @dr0ptp4kt wrote: > Quick question: how does this guarantee bucketing across browser restart?  >>! In T135762#2312967, @Nuria wrote: >...
[15:51:50] <wikibugs>	 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2313004 (10Nuria) >Since the binning is done independently of actual experiments (the binning is live all the time for all cookie-enabled agents), this actually is a problem, I...
[15:52:54] <jynus>	 !log performing schema change on s5 T130692
[15:52:55] <stashbot>	 T130692: Add new indexes from eec016ece6d2b30addcdf3d3efcc2ba59b10e858 to production databases - https://phabricator.wikimedia.org/T130692
[15:53:01] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:09:23] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Nova: Decrease disk_allocation_ratio [puppet] - 10https://gerrit.wikimedia.org/r/289872 
[16:11:28] <logmsgbot>	 !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.23) (duration: 20990m 58s)
[16:11:35] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:11:36] <bd808>	 ^ lol
[16:11:43] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Nova: Decrease disk_allocation_ratio [puppet] - 10https://gerrit.wikimedia.org/r/289872 (owner: 10Andrew Bogott)
[16:11:57] <bd808>	 that was a hung ssh for a l10nupdate that I jsut killed on tin
[16:12:56] <wikibugs>	 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2313016 (10BBlack) Right, but if the user deletes cookies or goes incognito, that's probably a rare event for most, and possibly associated with not re-using browser cache acros...
[16:14:31] <Dereckson>	 bd808: that's a lot of hours ago
[16:14:43] <bd808>	 1.27.0-wmf.23 is a clue :)
[16:15:33] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10Analytics, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: Add kartik to analytics-privatedata-users group - https://phabricator.wikimedia.org/T135704#2307853 (10madhuvishy) Noting that analytics-privatedata-users also gives Hadoop access...
[16:15:39] <wikibugs_>	 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2313020 (10Krinkle) I think 1-100 might be a bit small. Especially considering our scale and considering most of our experiments will not have been load tested very much.  For i...
[16:18:36] <bd808>	 !log kicking off manual l10nupdate run on tin
[16:18:43] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:20:36] <grrrit-wm>	 (03PS3) 10Nuria: Cloning analytics.wikimedia.org repo [puppet] - 10https://gerrit.wikimedia.org/r/289676 (https://phabricator.wikimedia.org/T134506) 
[16:22:02] <icinga-wm>	 PROBLEM - nova-compute process on labvirt1010 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute
[16:22:13] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Cloning analytics.wikimedia.org repo [puppet] - 10https://gerrit.wikimedia.org/r/289676 (https://phabricator.wikimedia.org/T134506) (owner: 10Nuria)
[16:27:03] <grrrit-wm>	 (03CR) 10Ottomata: Cloning analytics.wikimedia.org repo (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/289676 (https://phabricator.wikimedia.org/T134506) (owner: 10Nuria)
[16:27:24] <wikibugs_>	 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2313080 (10BBlack) >>! In T135762#2313020, @Krinkle wrote: > [...] > An experiment could start at 1 bucket (0.01%) and work its way up to 10 (0.1%). And if the experiment no lon...
[16:31:14] <wikibugs>	 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2313083 (10Nuria) >For comparison, our entire Navigation Timing data used to be based on 0.01% sampling. It is now tuned up to 0.1% (1:1000 sample; >$wgNavigationTimingSamplingF...
[16:32:32] <icinga-wm>	 PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures
[16:38:29] <wikibugs>	 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2313093 (10BBlack) >>! In T135762#2313083, @Nuria wrote:. > I know you know this but just clarifying that we do not have these restrictions here though,  the restrictions come f...
[16:38:44] <grrrit-wm>	 (03PS4) 10Nuria: Cloning analytics.wikimedia.org repo [puppet] - 10https://gerrit.wikimedia.org/r/289676 (https://phabricator.wikimedia.org/T134506) 
[16:40:47] <logmsgbot>	 !log bd808@tin scap sync-l10n completed (1.28.0-wmf.2) (duration: 10m 06s)
[16:40:53] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:41:14] <bd808>	 Dereckson: ^ do you know which messages to check?
[16:41:43] * Dereckson is looking.
[16:41:46] <grrrit-wm>	 (03CR) 10Thcipriani: "Looking good, moves in the direction I think we'd like to go." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4)
[16:42:05] <Dereckson>	 Danny_B: ping?
[16:42:30] <Dereckson>	 Danny_B: does https://cs.wiktionary.org/wiki/Speci%C3%A1ln%C3%AD:Co_odkazuje_na?target=Modul%3AQuote%2Ftools&namespace=10&title=Speci%C3%A1ln%C3%AD%3ACo_odkazuje_na look good to you?
[16:45:29] <bd808>	 there are at least links there (so the $1 is back in the message)
[16:45:34] <bd808>	 https://pl.wikipedia.org/wiki/Specjalna:Linkuj%C4%85ce/A has them too
[16:46:25] <Dereckson>	 bd808: https://gerrit.wikimedia.org/r/#/c/289802/ is well put into consideration, yes
[16:46:51] <icinga-wm>	 RECOVERY - nova-compute process on labvirt1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute
[16:46:51] <grrrit-wm>	 (03PS5) 10Nuria: Cloning analytics.wikimedia.org repo [puppet] - 10https://gerrit.wikimedia.org/r/289676 (https://phabricator.wikimedia.org/T134506) 
[16:49:41] <icinga-wm>	 PROBLEM - Getent speed check on labstore1001 is CRITICAL: CRITICAL: getent group tools.admin failed
[16:50:26] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Fri May 20 16:50:25 UTC 2016 (duration 9m 38s)
[16:50:33] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:53:33] <Danny_B>	 Dereckson: yes
[16:54:01] <grrrit-wm>	 (03PS8) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) 
[16:54:18] <nuria_>	 hello! can anyone here tell me how client connections and varnish return codes are reported to graphana? https://grafana.wikimedia.org/dashboard/db/client-connections 
[16:54:47] <bd808>	 !log Cleaned up /tmp/mw-cache-1.27.0-wmf.2* cache files on tin
[16:54:53] <nuria_>	 and https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes
[16:54:55] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:55:08] <nuria_>	 are those reported directly from varnish into statsd?
[16:56:32] <icinga-wm>	 RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[16:57:18] <bblack>	 nuria_: those and several similar stats flow from varnishd -> VSM -> (various python scripts) -> statsd
[16:57:52] <bblack>	 (well, in the client-connections and tls-ciphers cases, from nginx -> varnishd -> ...)
[16:57:59] <nuria_>	 bblack: aham ..'vsm' stands out for ...?
[16:58:16] <bblack>	 shared memory log, which varnishd writes to and the various daemons read from
[16:58:55] <bblack>	 the relatively-new https://grafana-admin.wikimedia.org/dashboard/db/varnish-caching comes from that sort of pipeline, too
[17:00:37] <nuria_>	 bblack: i see, as far as i can see there are no varnish stats per endpoint as in 'restbase apis' versus 'english wikipedia', correct?
[17:00:45] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Cloning analytics.wikimedia.org repo [puppet] - 10https://gerrit.wikimedia.org/r/289676 (https://phabricator.wikimedia.org/T134506) (owner: 10Nuria)
[17:00:59] <mutante>	 does anyone know about exim mail routers. specifically, in something like "templates/exim/exim4.conf.mx.erb:route_list = * magnesium.wikimedia.org byname", i know that can be a list of hosts and not just one, but can we add it in a way that doesnt just make it a backup or randomizes but makes it always route it to BOTH at the same time?  i have looked at docs but it's all about trying one from the list until one works and then stopping it s
[17:01:04] <bblack>	 nuria_: no, we don't log sort of detail, via these sorts of pipelines
[17:01:24] <bblack>	 nuria_: we do have some per-backend stats that can filter RB-vs-MW and such, but that's on the backside of the caches, not the front.
[17:01:38] <grrrit-wm>	 (03PS6) 10Nuria: Cloning analytics.wikimedia.org repo [puppet] - 10https://gerrit.wikimedia.org/r/289676 (https://phabricator.wikimedia.org/T134506) 
[17:01:43] <bblack>	 nuria_: and obviously, upload.wm.o and maps.wm.o have distinct clusters from the primary wikis, so they're inherently separated
[17:01:54] <nuria_>	 bblack: got it thank you
[17:02:00] <mutante>	 if that cut off, somebody said there is an "unseen" router option that makes it possible to continue with a second router as if the first never happened or so?
[17:05:46] <moritzm>	 !log uploaded librsvg 2.40.2-1+wm2 for trusty-wikimedia to carbon (backported patches from librsvg DSA to our custom trusty build)
[17:05:53] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:06:50] <wikibugs>	 06Operations: rebase librsvg security fixes - https://phabricator.wikimedia.org/T135804#2313125 (10MoritzMuehlenhoff) 05Open>03Resolved Uploaded librsvg 2.40.2-1+wm2 for trusty-wikimedia to carbon (backported patches from librsvg DSA to our custom trusty build) Uploaded librsvg 2.40.5-1+deb8u2+wmf1 for jessi...
[17:09:19] <icinga-wm>	 RECOVERY - cassandra CQL 10.192.0.128:9042 on maps-test2001 is OK: TCP OK - 0.037 second response time on port 9042
[17:21:59] <grrrit-wm>	 (03PS3) 10Dzahn: exim: route mail for RT to ununpentium [puppet] - 10https://gerrit.wikimedia.org/r/288721 (https://phabricator.wikimedia.org/T119112) 
[17:23:56] <grrrit-wm>	 (03PS4) 10Dzahn: exim: route mail for RT to ununpentium [puppet] - 10https://gerrit.wikimedia.org/r/288721 (https://phabricator.wikimedia.org/T119112) 
[17:27:58] <grrrit-wm>	 (03PS1) 10BryanDavis: l10nupdate: Stop using deprecated refreshCdbJsonFiles script [puppet] - 10https://gerrit.wikimedia.org/r/289886 
[17:30:02] <grrrit-wm>	 (03CR) 10BryanDavis: "I looked and didn't see any explicit sudo grants related to refreshCdbJsonFiles that would also need to be changed. The current script wil" [puppet] - 10https://gerrit.wikimedia.org/r/289886 (owner: 10BryanDavis)
[17:32:10] <grrrit-wm>	 (03PS5) 10Dzahn: exim: route mail for RT to ununpentium [puppet] - 10https://gerrit.wikimedia.org/r/288721 (https://phabricator.wikimedia.org/T119112) 
[17:44:27] <grrrit-wm>	 (03PS7) 10Ottomata: Cloning analytics.wikimedia.org repo [puppet] - 10https://gerrit.wikimedia.org/r/289676 (https://phabricator.wikimedia.org/T134506) (owner: 10Nuria)
[17:48:45] <grrrit-wm>	 (03CR) 10Rush: [C: 032] labstore1003 define scratch share [puppet] - 10https://gerrit.wikimedia.org/r/289774 (owner: 10Rush)
[17:49:28] <grrrit-wm>	 (03PS8) 10Ottomata: Cloning analytics.wikimedia.org repo [puppet] - 10https://gerrit.wikimedia.org/r/289676 (https://phabricator.wikimedia.org/T134506) (owner: 10Nuria)
[17:49:35] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] Cloning analytics.wikimedia.org repo [puppet] - 10https://gerrit.wikimedia.org/r/289676 (https://phabricator.wikimedia.org/T134506) (owner: 10Nuria)
[17:50:51] <grrrit-wm>	 (03PS3) 10Dzahn: install_server: split out reprepro role [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) 
[17:51:24] <grrrit-wm>	 (03CR) 10Dzahn: install_server: split out reprepro role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn)
[17:53:34] <grrrit-wm>	 (03CR) 10Dzahn: "yes, doing that and changing the module name, just for the role class, if it stays "role foo::bar" instead of "role foo" it can be moved t" [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn)
[17:59:39] <grrrit-wm>	 (03PS9) 10Rush: labstore nfs introduce nfs_mount defined type [puppet] - 10https://gerrit.wikimedia.org/r/289727 
[18:03:56] <grrrit-wm>	 (03PS4) 10Dzahn: install_server: split out reprepro to module aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) 
[18:05:38] <icinga-wm>	 PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors
[18:07:22] <grrrit-wm>	 (03PS5) 10Dzahn: install_server: split out reprepro to module aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) 
[18:07:47] <mutante>	 re: icinga-wm: it cant find host labstore2004
[18:07:55] <mutante>	 i would say normal if that is a fresh install, otherwise not
[18:08:00] <mutante>	 runs puppet again
[18:08:08] <grrrit-wm>	 (03PS10) 10Rush: labstore nfs introduce nfs_mount defined type [puppet] - 10https://gerrit.wikimedia.org/r/289727 
[18:09:28] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[18:11:00] <grrrit-wm>	 (03CR) 10Rush: [C: 032] labstore nfs introduce nfs_mount defined type [puppet] - 10https://gerrit.wikimedia.org/r/289727 (owner: 10Rush)
[18:11:58] <chasemp>	 mutante: that host was installed ...a month or so ago
[18:12:03] <chasemp>	 not sure what the deal is but it's at least not new
[18:13:32] <mutante>	 chasemp: hmm, ok. i'm looking for changes 
[18:15:49] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[18:29:14] <grrrit-wm>	 (03PS1) 10Rush: labstore200[1-4] add standard in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/289891 
[18:30:14] <chasemp>	 mutante: labstore2* were getting standard from a role which is not ideal as role changes can then cause this, I had added to eqiad things to compensate but didn't realize fileserver was already on labstore2
[18:30:37] <YuviPanda>	 it shouldn't be (the same role on both, that is) but that's for another day
[18:31:31] <chasemp>	 agreed I'll deal w/ it later
[18:31:43] <chasemp>	 it's all badly arranged
[18:32:38] <mutante>	 chasemp: gotcha! yea, this all makes sense then. standard adds to icinga. it removed them from icinga. that doesnt happen in a single run.. thats why errors.. 
[18:32:57] <mutante>	 thanks, yep
[18:33:57] <grrrit-wm>	 (03CR) 10Rush: [C: 032] labstore200[1-4] add standard in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/289891 (owner: 10Rush)
[18:35:49] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Don't include labsdb hosts in the 'labs' group [puppet] - 10https://gerrit.wikimedia.org/r/289892 
[18:39:42] <grrrit-wm>	 (03CR) 10Dzahn: [C: 031] ":) yea, that explains what i saw on neon! thanks" [puppet] - 10https://gerrit.wikimedia.org/r/289892 (owner: 10Andrew Bogott)
[18:40:29] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Don't include labsdb hosts in the 'labs' group [puppet] - 10https://gerrit.wikimedia.org/r/289892 (owner: 10Andrew Bogott)
[18:54:54] <grrrit-wm>	 (03PS1) 10Rush: labstore::fileserver remove use_ldap option [puppet] - 10https://gerrit.wikimedia.org/r/289898 (https://phabricator.wikimedia.org/T126083) 
[18:55:38] <grrrit-wm>	 (03PS2) 10Rush: labstore::fileserver remove use_ldap option [puppet] - 10https://gerrit.wikimedia.org/r/289898 (https://phabricator.wikimedia.org/T126083) 
[19:03:47] <grrrit-wm>	 (03PS6) 10Dzahn: install_server: split out reprepro to module aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) 
[19:11:34] <grrrit-wm>	 (03PS7) 10Dzahn: install_server: split out reprepro to module aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) 
[19:12:42] <icinga-wm>	 RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct
[19:13:22] <icinga-wm>	 PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1676 bytes in 0.217 second response time
[19:14:30] <grrrit-wm>	 (03CR) 10Rush: [C: 032] labstore::fileserver remove use_ldap option [puppet] - 10https://gerrit.wikimedia.org/r/289898 (https://phabricator.wikimedia.org/T126083) (owner: 10Rush)
[19:19:23] <icinga-wm>	 RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1670 bytes in 0.205 second response time
[19:22:17] <grrrit-wm>	 (03PS1) 10Rush: labs nfs scratch and dumps to soft mode [puppet] - 10https://gerrit.wikimedia.org/r/289903 (https://phabricator.wikimedia.org/T126083) 
[19:22:32] <grrrit-wm>	 (03PS2) 10Rush: labs nfs scratch and dumps to soft mode [puppet] - 10https://gerrit.wikimedia.org/r/289903 (https://phabricator.wikimedia.org/T126083) 
[19:30:46] <chasemp>	 jenkins on vaca?
[19:33:59] <grrrit-wm>	 (03CR) 10Rush: [C: 032 V: 032] labs nfs scratch and dumps to soft mode [puppet] - 10https://gerrit.wikimedia.org/r/289903 (https://phabricator.wikimedia.org/T126083) (owner: 10Rush)
[19:46:57] <icinga-wm>	 PROBLEM - puppet last run on cp2021 is CRITICAL: CRITICAL: Puppet has 1 failures
[19:50:44] <wikibugs>	 06Operations, 10DBA, 10MediaWiki-Database: Preserve InnoDB table auto_increment on restart - https://phabricator.wikimedia.org/T135851#2313704 (10Mattflaschen-WMF) Yeah, in principle I support fixing it in core as described.  Your proposed solution is similar to RevisionDelete (which is already in core), but...
[19:52:31] <wikibugs>	 06Operations, 10DBA, 10MediaWiki-Database: Preserve InnoDB table auto_increment on restart - https://phabricator.wikimedia.org/T135851#2313718 (10Mattflaschen-WMF)
[19:52:45] <wikibugs>	 06Operations, 10DBA, 10MediaWiki-Database: Preserve InnoDB table auto_increment on restart - https://phabricator.wikimedia.org/T135851#2312863 (10Mattflaschen-WMF)
[19:59:17] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 031] "Good catch—thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/289886 (owner: 10BryanDavis)
[20:11:36] <icinga-wm>	 RECOVERY - puppet last run on cp2021 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[20:15:35] <wikibugs>	 06Operations, 06Labs, 10Labs-Infrastructure: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2313733 (10scfc)
[20:36:02] <chasemp>	 !log restart rabbitmq on labcontrol1001
[20:36:10] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:58:57] <grrrit-wm>	 (03PS9) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) 
[21:00:16] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures
[21:05:16] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures
[21:10:16] <icinga-wm>	 RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[21:17:59] <grrrit-wm>	 (03PS1) 10Eevans: Update collector version (both branches) [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/289963 
[21:24:57] <grrrit-wm>	 (03PS1) 10Rush: wip: labstore cleanup and role vs module arrange [puppet] - 10https://gerrit.wikimedia.org/r/289964 
[21:28:44] <grrrit-wm>	 (03PS1) 10Eevans: Updated cassandra-metrics-collector version(s) [puppet] - 10https://gerrit.wikimedia.org/r/289965 
[21:31:23] <grrrit-wm>	 (03PS8) 10Dzahn: install_server: split out reprepro to module aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) 
[21:34:13] <grrrit-wm>	 (03PS2) 10Rush: wip: labstore cleanup and role vs module arrange [puppet] - 10https://gerrit.wikimedia.org/r/289964 
[22:25:23] <wikibugs>	 06Operations, 10Traffic, 07HTTPS, 05MW-1.27-release-notes, 13Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#2314347 (10bd808) @Steinsplitter reported to me on irc that > for protocol relative urls in mwclient, scheme='https' must be set in the config to enable https o...
[22:51:31] <wikibugs>	 06Operations, 10Mail: administrative rights for GLAM@ - https://phabricator.wikimedia.org/T135874#2314374 (10Krenair) I believe ops have a cron set up on the mail servers to send a copy of the (private) @wikimedia.org aliases file to officeit@wikimedia.org every week.
[23:02:12] <wikibugs>	 06Operations, 10Mail: administrative rights for GLAM@ - https://phabricator.wikimedia.org/T135874#2314378 (10eliza) Hmm.....yes - I actually checked the log and an alias glam@wikimedia.org does not exist.  What's curious is that I sent a test to glam@wikimedia.org and did not receive a bounce back or reply.  I...
[23:02:54] <wikibugs>	 06Operations, 10Mail: administrative rights for GLAM@ - https://phabricator.wikimedia.org/T135874#2313706 (10Dzahn) **glam@wikimedia.org** does not appear in the alias file controlled by ops.  the mail server tells me it's controlled by OTRS  [mx1001:~] $ sudo exim4 -bt glam@wikimedia.org glam@wikimedia.org...
[23:04:13] <wikibugs>	 06Operations, 10Mail: administrative rights for GLAM@ - https://phabricator.wikimedia.org/T135874#2314383 (10Dzahn) @eliza mail to glam@ goes into  https://meta.wikimedia.org/wiki/OTRS
[23:05:09] <grrrit-wm>	 (03PS1) 10BryanDavis: toollabs: Replace trusty PHP5 session cleanup script [puppet] - 10https://gerrit.wikimedia.org/r/289973 (https://phabricator.wikimedia.org/T135861) 
[23:18:19] <ori>	 bd808: why aren't you using the 'cron' resource directly?
[23:18:40] <bd808>	 ori: ... good question
[23:19:10] <bd808>	 there is an existing file that php5-common installs I'd need to ensure absent
[23:19:23] <bd808>	 or I can just overwrite like this
[23:21:19] <ori>	 apt will prompt you about a config file conflict every time you upgrade php5
[23:22:11] <ori>	 I think doing it this way is fine, but I think you should add a comment to the Puppet manifest saying you are overwriting a file that ships with the package
[23:22:22] <bd808>	 *nod*
[23:22:53] <ori>	 shellcheck has some recommendations too: https://dpaste.de/MHkJ/raw
[23:23:14] <ori>	 I realize this came from upstream, so up to you if you want to apply them in our version, not apply them at all, or submit them upstream, or whatever
[23:23:20] <ori>	 happy to +2
[23:23:58] <ori>	 I think the "-- SC2016: Expressions don't expand in single quotes, use double quotes for that." is bogus
[23:24:11] <ori>	 it is seeing the '$k => $v' and assuming you wanted those to be expanded by the shell
[23:24:48] <ori>	 the rest seem legit
[23:24:51] <bd808>	 "Prefer [ p ] && [ q ]" is a bash-ism
[23:25:06] <bd808>	 isn't it?
[23:25:18] <ori>	 [[ ]] is a bashism, I don't think && is
[23:25:33] <bd808>	 Maybe I'm thinking of [[ p && q ]]
[23:25:42] <ori>	 that's definitely a bashism yeah
[23:26:08] <bd808>	 "fixing" Debian's shell seems nit picky :)
[23:26:31] <ori>	 yeah maybe it's best to just leave it alone
[23:26:34] <grrrit-wm>	 (03PS2) 10BryanDavis: toollabs: Replace trusty PHP5 session cleanup script [puppet] - 10https://gerrit.wikimedia.org/r/289973 (https://phabricator.wikimedia.org/T135861) 
[23:26:42] <bd808>	 comment added
[23:28:11] <ori>	 "lsof check used in the Ubuntu stock script can hang indefinitely" -- in Hebrew "ein sof" (or nsof) means "without end" or forever.
[23:28:18] <ori>	 funny coincidence
[23:28:39] <bd808>	 or deep conspiracy? ;)
[23:28:48] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] toollabs: Replace trusty PHP5 session cleanup script [puppet] - 10https://gerrit.wikimedia.org/r/289973 (https://phabricator.wikimedia.org/T135861) (owner: 10BryanDavis)
[23:29:21] <ori>	 merged
[23:29:44] <bd808>	 cool. I'll for a couple of puppet runs
[23:29:49] <bd808>	 *force
[23:34:34] <grrrit-wm>	 (03PS9) 10Dzahn: install_server: split out reprepro to module aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) 
[23:35:42] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] install_server: split out reprepro to module aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn)
[23:41:17] <grrrit-wm>	 (03CR) 10Dzahn: [C: 04-1] "http://puppet-compiler.wmflabs.org/2866/carbon.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn)
[23:47:19] <grrrit-wm>	 (03PS10) 10Dzahn: install_server: split out reprepro to module aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) 
[23:56:18] <grrrit-wm>	 (03PS11) 10Dzahn: install_server: split out reprepro to module aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757)