[00:01:58] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 93 % full
[00:03:49] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 58 % full
[00:05:36] <Krinkle>	 !log mwscript deleteEqualMessages.php --wiki maiwiki
[00:05:38] <mutante>	 Reedy: i'd like to reinstall bast1001 soonish, i noticed your home dir is 53G, it's no problem to copy it back but all needed?
[00:05:42] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:07:44] <mutante>	 RoanKattouw: still need /home/catrope/oldlogs ? ^ for the same reason
[00:07:57] <RoanKattouw>	 On bast1001?
[00:08:01] <mutante>	 yes
[00:09:07] <RoanKattouw>	 Hah, I didn't realize I had 3 gigs of old log files from the wtp* servers lying around from back in 2013
[00:09:21] <mutante>	 gwicke: ^ same wtp* files there for you i think
[00:09:39] <mutante>	 RoanKattouw: :) not a big deal to copy them all but i thought it'd be old, yea
[00:10:28] <mutante>	 thanks
[00:10:28] <RoanKattouw>	 That may have been from that outage caused by a bug in the rewritten logging backend that caused disk space exhaustion on all wtp* servers around the same time
[00:11:11] <mutante>	 *nod* and in the other case it's the videos from devsummit or wikimania
[00:11:28] <RoanKattouw>	 Yeah those can be deleted too, they've probably been imported to Commons already
[00:12:58] <mutante>	 i'm not sure that actually happened
[00:13:16] <mutante>	 keep seeing ancient tickets about that
[00:13:38] <mutante>	 like we had a ticket for the new videos before we managed to upload the ones from the year before
[00:14:18] <mutante>	 yea 2014 is still open https://phabricator.wikimedia.org/T84465
[00:14:38] <mutante>	 this too https://phabricator.wikimedia.org/T106038   
[00:14:46] <mutante>	 not even surprised
[00:15:31] <mutante>	 "the output of the conversion is larger than swift can handle, and as a result can't be uploaded to commons. resolving this task. "   of course that resolves it
[00:25:59] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 90 % full
[00:27:58] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 54 % full
[00:35:08] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 65.52% of data above the critical threshold [5000000.0]
[00:38:38] <Krinkle>	 !log mwscript deleteEqualMessages.php --wiki newiki
[00:38:43] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[01:06:42] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 390.36 seconds
[01:18:58] <icinga-wm>	 PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: puppet fail
[01:19:58] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[01:31:58] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 73.33% of data above the critical threshold [5000000.0]
[01:37:56] <volans>	 anyone doing anything on db1048 (m3)
[01:38:59] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 90 % full
[01:40:49] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 27 % full
[01:41:13] <YuviPanda>	 volans: nope...
[01:42:00] <volans>	 because the replica is just stopped
[01:42:21] <YuviPanda>	 hmm
[01:42:32] * YuviPanda doesn't know what to do / help
[01:42:33] <volans>	 and I can see a connection with user root from dbstore1001 that is doing basically a dump
[01:42:50] <volans>	 on phabricator_repository
[01:42:51] <YuviPanda>	 volans: ah, I think you've maybe discovered the strange ways of the eventlogging replication
[01:42:54] <YuviPanda>	 oh
[01:43:01] <YuviPanda>	 not that then.
[01:43:22] <YuviPanda>	 you could kill it and see if it recovers.
[01:43:33] * YuviPanda doesn't know what he's saying or doing
[01:43:42] <YuviPanda>	 I'm just here for the company :)
[01:43:57] <volans>	 I can see a peak every day at this time in bytes sent, so looks like a daily job
[01:44:28] * volans checking icinga for past alerts
[01:46:10] <volans>	 and now it was started again
[01:46:39] <volans>	 so yes we have a script that stops the replica and do a dump... it will be so kind if it will put a dowtime on icinga alert too :)
[01:47:09] <icinga-wm>	 RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[01:47:52] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.12 seconds
[01:50:18] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: x1 on db2008 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 321.20 seconds
[01:51:42] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: x1 on db1031 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 398.22 seconds
[01:51:44] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: x1 on db2009 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 403.12 seconds
[01:51:53] <volans>	 yep, x1 turn
[01:52:02] <YuviPanda>	 is this just dumps?
[01:52:06] <volans>	 yes
[01:52:18] <icinga-wm>	 PROBLEM - puppet last run on mw1080 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago
[01:52:32] <YuviPanda>	 ok!
[01:52:55] <volans>	 not really, they should't alarm IMHO
[01:53:01] <YuviPanda>	 volans: you're in the sf evening dead zone, where usually I'm the only one around. I don't know much about databases, but happy to help in whatever way I can if help is needed.
[01:53:20] <volans>	 nothing to do really
[01:53:28] <volans>	 thanks for the offer
[01:53:55] <YuviPanda>	 ok!
[01:54:09] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: x1 on db2008 is OK: OK slave_sql_lag Replication lag: 0.23 seconds
[01:55:32] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: x1 on db1031 is OK: OK slave_sql_lag Replication lag: 0.28 seconds
[01:55:34] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: x1 on db2009 is OK: OK slave_sql_lag Replication lag: 0.41 seconds
[01:55:58] <volans>	 crontab completed, last one was x1
[01:58:15] <volans>	 checked with irc logs, it's pretty regular each week
[02:04:02] <volans>	 opened a phab task for tracking
[02:12:44] <grrrit-wm>	 (03PS1) 10BryanDavis: Labs: Preserve env for vagrant commands [puppet] - 10https://gerrit.wikimedia.org/r/283118 (https://phabricator.wikimedia.org/T120186) 
[02:13:09] <YuviPanda>	 bd808: let me know if you want a merge
[02:13:30] <bd808>	 YuviPanda: I'm going to test it out and see if it actually works :)
[02:13:41] <bd808>	 but hopefully I'll poke you soon
[02:14:41] <YuviPanda>	 bd808: ok!
[02:14:47] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 91 % full
[02:16:28] <grrrit-wm>	 (03CR) 10BryanDavis: "Tested via cherry-pick. Works as expected." [puppet] - 10https://gerrit.wikimedia.org/r/283118 (https://phabricator.wikimedia.org/T120186) (owner: 10BryanDavis)
[02:16:36] <bd808>	 YuviPanda: ^ looks good
[02:16:37] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 42 % full
[02:16:58] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] Labs: Preserve env for vagrant commands [puppet] - 10https://gerrit.wikimedia.org/r/283118 (https://phabricator.wikimedia.org/T120186) (owner: 10BryanDavis)
[02:17:17] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[02:27:48] <logmsgbot>	 !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.20) (duration: 11m 56s)
[02:27:55] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:37:48] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1046 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[02:39:37] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1046 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[02:53:05] <grrrit-wm>	 (03PS3) 10Yuvipanda: tools: Don't track mediawiki's font list for precise [puppet] - 10https://gerrit.wikimedia.org/r/282958 (https://phabricator.wikimedia.org/T132282) 
[02:53:12] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Don't track mediawiki's font list for precise [puppet] - 10https://gerrit.wikimedia.org/r/282958 (https://phabricator.wikimedia.org/T132282) (owner: 10Yuvipanda)
[03:03:18] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 91 % full
[03:05:08] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 24 % full
[03:14:25] <grrrit-wm>	 (03PS1) 10Yuvipanda: tools: Remove non-precise fonts from precise hosts [puppet] - 10https://gerrit.wikimedia.org/r/283121 (https://phabricator.wikimedia.org/T132282) 
[03:14:49] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] tools: Remove non-precise fonts from precise hosts [puppet] - 10https://gerrit.wikimedia.org/r/283121 (https://phabricator.wikimedia.org/T132282) (owner: 10Yuvipanda)
[03:27:18] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 90 % full
[03:29:08] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 30 % full
[03:29:16] <wikibugs>	 06Operations, 06Commons: Unable to restore file that has a very large file size - https://phabricator.wikimedia.org/T131832#2202028 (10Pokefan95) p:05Triage>03High
[03:38:38] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 94 % full
[03:40:19] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 62 % full
[03:46:57] <grrrit-wm>	 (03PS2) 10Yuvipanda: tools: Remove non-precise fonts from precise hosts [puppet] - 10https://gerrit.wikimedia.org/r/283121 (https://phabricator.wikimedia.org/T132282) 
[03:51:27] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 94 % full
[03:53:18] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 28 % full
[04:02:47] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 96 % full
[04:04:37] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 57 % full
[04:05:41] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] tools: Remove non-precise fonts from precise hosts [puppet] - 10https://gerrit.wikimedia.org/r/283121 (https://phabricator.wikimedia.org/T132282) (owner: 10Yuvipanda)
[04:30:29] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1057 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[04:41:59] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: HTTPS error on status.wikimedia.org (watchmouse certificate mismatch) - https://phabricator.wikimedia.org/T131017#2202093 (10Pokefan95) p:05Triage>03Normal
[04:43:37] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1057 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[04:50:48] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 93 % full
[04:52:39] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 57 % full
[04:58:28] <icinga-wm>	 PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: puppet fail
[05:09:04] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: enable HSTS on *.planet.wikimedia.org - https://phabricator.wikimedia.org/T132543#2202098 (10Dzahn)
[05:11:22] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: enable HSTS on *.planet.wikimedia.org - https://phabricator.wikimedia.org/T132543#2202114 (10Pokefan95) p:05Triage>03Normal
[05:12:06] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2202115 (10Pokefan95) p:05Triage>03Normal
[05:14:08] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: enable HSTS on *.planet.wikimedia.org - https://phabricator.wikimedia.org/T132543#2202116 (10Dzahn) looking at the config i already see:   13         Header always set Strict-Transport-Security "max-age=604800"  isn't it already enabled?
[05:14:59] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 93 % full
[05:15:30] <Vito>	 ori: sorry, I was sleeping
[05:15:58] <Vito>	 btw bblack and ori, my bot went back editing at 1:38 CEST
[05:16:49] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 54 % full
[05:16:50] <Vito>	 anzi I can now reach everything
[05:20:37] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: enable HSTS on *.planet.wikimedia.org - https://phabricator.wikimedia.org/T132543#2202117 (10Dzahn) 05Open>03Invalid already resolved/invalid  it's enabled and *.planet. uses  use standard cache cluster termination, it's misc-web, besides having a separate wildcard cert,...
[05:22:26] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: enable HSTS on *.planet.wikimedia.org - https://phabricator.wikimedia.org/T132543#2202119 (10Dzahn) @Pokefan do me a favor and update https://wikitech.wikimedia.org/wiki/HTTPS/domains ? can't login on wikitech due to lack of second factor
[05:26:19] <icinga-wm>	 RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[05:29:06] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: enable HSTS on *.planet.wikimedia.org - https://phabricator.wikimedia.org/T132543#2202121 (10Pokefan95) @Dhann: Doing...
[05:29:23] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: enable HSTS on *.planet.wikimedia.org - https://phabricator.wikimedia.org/T132543#2202123 (10Dzahn)
[05:29:25] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: Enable HSTS on Wikimedia sites - https://phabricator.wikimedia.org/T40516#2202122 (10Dzahn)
[05:31:00] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: Enable HSTS on Wikimedia sites - https://phabricator.wikimedia.org/T40516#1146945 (10Dzahn)
[05:31:02] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: enable HSTS on *.planet.wikimedia.org - https://phabricator.wikimedia.org/T132543#2202098 (10Dzahn) 05Invalid>03Resolved @Pokefan95 thank you , then there was actually something to resolve, heh
[05:35:13] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: enable HSTS on *.planet.wikimedia.org - https://phabricator.wikimedia.org/T132543#2202129 (10Dzahn) the change that enabled this was https://gerrit.wikimedia.org/r/#/c/253758/ on  2015-11-18
[05:36:41] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: enable HSTS on *.planet.wikimedia.org - https://phabricator.wikimedia.org/T132543#2202130 (10Pokefan95) @Dhann: For now, I just changed it from "No" to "Yes". What is the duration of the HSTS?
[05:39:10] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: enable HSTS on *.planet.wikimedia.org - https://phabricator.wikimedia.org/T132543#2202131 (10Dzahn) it's max-age=31536000  (from https://www.ssllabs.com/ssltest/analyze.html?d=es.planet.wikimedia.org&s=208.80.153.248) so that means  [[ https://duckduckgo.com/?q=31536000+seco...
[05:39:17] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 97 % full
[05:39:30] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: enable HSTS on *.planet.wikimedia.org - https://phabricator.wikimedia.org/T132543#2202132 (10Pokefan95) Ah, ok, thanks
[05:41:07] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 51 % full
[06:03:28] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 91 % full
[06:04:53] <elukey>	 analytics1057 and 1046 (node manager down) were probably having issues for stuff like "CPU 14 THERMAL EVENT TSC 6c7bed1096c482"
[06:05:01] <wikibugs>	 06Operations, 07Puppet, 06Labs: Implement role based hiera lookups for labs - https://phabricator.wikimedia.org/T120165#1847021 (10mmodell) Is this really difficult to do? I'm very interested in fixing this but not at all sure where to start.
[06:05:02] <elukey>	 that is not the first analytics* host 
[06:05:04] <elukey>	 sigh
[06:05:18] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 39 % full
[06:07:01] <elukey>	 ---^ tons of TIME_WAITs (a lot towards rdb hosts)
[06:07:49] <elukey>	 also, [Wed Apr 13 05:38:08 2016] nf_conntrack: table full, dropping packet in the dmesg, might want to check to avoid problems with job runners
[06:08:06] <elukey>	 moritzm: --^ (morning :)
[06:08:40] <elukey>	 (brb)
[06:21:28] <icinga-wm>	 RECOVERY - Updater process on wdqs1002 is OK: PROCS OK: 1 process with UID = 998 (blazegraph), regex args ^java .* org.wikidata.query.rdf.tool.Update
[06:27:28] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 92 % full
[06:29:19] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 62 % full
[06:30:10] <icinga-wm>	 PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: puppet fail
[06:32:08] <icinga-wm>	 PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:28] <icinga-wm>	 PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:29] <icinga-wm>	 PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:35:07] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 99 % full
[06:37:14] <_joe_>	 why only that server
[06:37:18] <_joe_>	 goddamnit
[06:38:48] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 60 % full
[06:46:54] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: Preload HSTS for select hostnames within wikimedia.org - https://phabricator.wikimedia.org/T111967#2202184 (10BBlack) Yeah, I'm in the process of enumerating those....
[06:47:58] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 99 % full
[06:48:46] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Inline comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) (owner: 10Andrew Bogott)
[06:51:37] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 32 % full
[06:56:18] <icinga-wm>	 RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[06:57:28] <icinga-wm>	 RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:57] <icinga-wm>	 RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:59] <icinga-wm>	 RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:58] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 96 % full
[07:02:49] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 55 % full
[07:11:59] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 97 % full
[07:15:38] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 28 % full
[07:21:52] <wikibugs>	 06Operations, 10ops-eqiad, 06Analytics-Kanban: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#2202237 (10elukey)
[07:22:59] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 99 % full
[07:26:22] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2202245 (10BBlack) ========= //Audit Data// =========  Methodology: -----------------  The starting point is our raw D...
[07:26:38] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 33 % full
[07:26:48] <moritzm>	 !log temporarily bumped connection tracking table size on mw1163 to 512k (randomly spiking)
[07:26:54] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[07:27:40] <moritzm>	 elukey: I'll add the diamond collectors we applied to the kafka brokers to the job runners so that we have better data
[07:27:50] <grrrit-wm>	 (03CR) 10DCausse: [C: 031] Convert mwgrep to use regexp by default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/283107 (owner: 10EBernhardson)
[07:30:20] <elukey>	 moritzm: I was about to ask the same :)
[07:34:44] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2202250 (10BBlack) While we should fix all of these issues in the long term (they should all be 301->https on the same...
[07:43:16] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2202254 (10BBlack) As for the rest of the work, IMHO we should re-purpose the wiki tracking page at https://wikitech.w...
[07:48:23] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA: db1070, db1071 and db1065 overheating problems - https://phabricator.wikimedia.org/T132515#2202271 (10Volans) For reference I found it looking at a small spike in connections errors from here: https://logstash.wikimedia.org/#/dashboard/elasticsearch/wfLogDBError
[07:56:37] <wikibugs>	 06Operations: rsync module doesnt work on trusty - https://phabricator.wikimedia.org/T132532#2201690 (10MoritzMuehlenhoff) There are plenty of trusty systems with rsync::server, e.g. the swift-be systems.
[08:01:44] <grrrit-wm>	 (03CR) 10Stats: [C: 031] varnishmedia: ignore transactions without status code [puppet] - 10https://gerrit.wikimedia.org/r/282976 (https://phabricator.wikimedia.org/T132430) (owner: 10Ema)
[08:02:35] <grrrit-wm>	 (03CR) 10Elukey: [C: 031] "The Stats review was me logged in Gerrit with a different user, sorry :)" [puppet] - 10https://gerrit.wikimedia.org/r/282976 (https://phabricator.wikimedia.org/T132430) (owner: 10Ema)
[08:05:02] <p858snake|L_>	 volans: cmjonhson has said prevsiouly there has been a batch have been overheating and needing new thermal paste
[08:05:44] <grrrit-wm>	 (03CR) 10BBlack: [C: 031] varnishmedia: ignore transactions without status code [puppet] - 10https://gerrit.wikimedia.org/r/282976 (https://phabricator.wikimedia.org/T132430) (owner: 10Ema)
[08:06:37] <volans>	 p858snake|L_: thanks, you mean a bunch of servers from the same order?
[08:06:58] <p858snake|L_>	 no idea
[08:07:44] <p858snake|L_>	 There was the phab host recently (iridium?) and I have seen passing reports about others
[08:07:51] <p858snake|L_>	 but haven't paid attention
[08:08:35] <volans>	 yes, I'm aware of them, I was chatting with chris yesterday, he will take a look today if it's an hot spot on the rack or not
[08:11:00] <grrrit-wm>	 (03PS1) 10Elukey: Add nfconntrack and TCP States diamond collectors to Jobrunners. [puppet] - 10https://gerrit.wikimedia.org/r/283129 
[08:12:31] <Reedy>	 mutante: errm
[08:13:24] <Reedy>	 mutante: 7.0M now
[08:13:24] <Reedy>	 :P
[08:13:58] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Update to Linux 4.4.7 [debs/linux44] - 10https://gerrit.wikimedia.org/r/283130 
[08:17:11] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Update to Linux 4.4.7 [debs/linux44] - 10https://gerrit.wikimedia.org/r/283130 (owner: 10Muehlenhoff)
[08:17:58] <grrrit-wm>	 (03PS3) 10Ema: varnishmedia: ignore transactions without status code [puppet] - 10https://gerrit.wikimedia.org/r/282976 (https://phabricator.wikimedia.org/T132430) 
[08:18:22] <grrrit-wm>	 (03CR) 10Ema: [C: 032 V: 032] varnishmedia: ignore transactions without status code [puppet] - 10https://gerrit.wikimedia.org/r/282976 (https://phabricator.wikimedia.org/T132430) (owner: 10Ema)
[08:19:38] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1054 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[08:19:45] <elukey>	 boooooo
[08:20:19] <elukey>	 --^ checking
[08:23:48] <godog>	 sad_trombone.wav
[08:23:54] <elukey>	 java.lang.OutOfMemoryError: GC overhead limit exceeded
[08:24:19] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 65.52% of data above the critical threshold [5000000.0]
[08:25:44] <grrrit-wm>	 (03PS5) 10Filippo Giunchedi: write cassandra instance yaml descriptors [puppet] - 10https://gerrit.wikimedia.org/r/277865 (owner: 10Eevans)
[08:25:51] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] write cassandra instance yaml descriptors [puppet] - 10https://gerrit.wikimedia.org/r/277865 (owner: 10Eevans)
[08:26:59] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1054 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[08:27:16] <p858snake|L_>	 godog: .ogg >.>
[08:28:32] <godog>	 p858snake|L_: hehe wav felt like a touch of old
[08:28:35] <godog>	 Duplicate declaration: File[/etc/cassandra-instances.d] is already declared in file /etc/puppet/modules/cassandra/manifests/instance.pp:214; cannot redeclare at /etc/puppet/modules/cassandra/manifests/instance.pp:214
[08:28:39] <godog>	 thanks puppet
[08:30:57] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 031] Add nfconntrack and TCP States diamond collectors to Jobrunners. [puppet] - 10https://gerrit.wikimedia.org/r/283129 (owner: 10Elukey)
[08:30:57] <icinga-wm>	 PROBLEM - puppet last run on restbase1009 is CRITICAL: CRITICAL: puppet fail
[08:36:36] <grrrit-wm>	 (03PS2) 10Elukey: Add nfconntrack and TCP States diamond collectors to Jobrunners. [puppet] - 10https://gerrit.wikimedia.org/r/283129 
[08:38:42] <grrrit-wm>	 (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/2414/mw1163.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/283129 (owner: 10Elukey)
[08:45:54] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: cassandra: don't require cassandra Package for cassandra-instances.d [puppet] - 10https://gerrit.wikimedia.org/r/283132 
[08:48:21] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: cassandra: don't require cassandra Package for cassandra-instances.d [puppet] - 10https://gerrit.wikimedia.org/r/283132 
[08:52:27] <grrrit-wm>	 (03PS3) 10Filippo Giunchedi: cassandra: don't require cassandra Package for cassandra-instances.d [puppet] - 10https://gerrit.wikimedia.org/r/283132 
[08:52:28] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:52:57] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: don't require cassandra Package for cassandra-instances.d [puppet] - 10https://gerrit.wikimedia.org/r/283132 (owner: 10Filippo Giunchedi)
[08:53:04] <elukey>	 is it the analytics joyful morning today? sigh.. checking aqs..
[08:53:31] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2202415 (10Chmarkine) >>! In T132521#2202254, @BBlack wrote: > As for the rest of the work, IMHO we should re-purpose...
[08:53:53] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Drop Debian-specific logging fix (clashes with 4.4.7 point update). [debs/linux44] - 10https://gerrit.wikimedia.org/r/283133 
[08:54:09] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy
[08:54:49] <grrrit-wm>	 (03PS1) 10Volans: MariaDB: use Puppet certs for TLS [puppet] - 10https://gerrit.wikimedia.org/r/283134 (https://phabricator.wikimedia.org/T111654) 
[08:55:13] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Drop Debian-specific logging fix (clashes with 4.4.7 point update). [debs/linux44] - 10https://gerrit.wikimedia.org/r/283133 (owner: 10Muehlenhoff)
[08:55:17] <icinga-wm>	 RECOVERY - puppet last run on restbase1009 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[08:55:28] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:57:19] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy
[08:57:48] <grrrit-wm>	 (03PS1) 10Volans: Depool db1050 and db1041 for TLS upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283135 (https://phabricator.wikimedia.org/T111654) 
[09:02:58] <icinga-wm>	 PROBLEM - puppet last run on aqs1002 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:04:38] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 04-1] "See comment for proposed fix" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/282761 (owner: 10Chad)
[09:07:08] <grrrit-wm>	 (03CR) 10Volans: "All changes looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/283134 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans)
[09:17:38] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[09:19:51] <grrrit-wm>	 (03CR) 10Gehel: [C: 031] "Looks good and there should be no issues now that T128813 is fixed. I'd like to wait for the reinstall of wdqs1002 to be done and tested b" [puppet] - 10https://gerrit.wikimedia.org/r/274864 (https://phabricator.wikimedia.org/T126730) (owner: 10Smalyshev)
[09:21:58] <volans>	 !log start upgrading TLS for cross-dc replica on shards s6 and s7 - T111654
[09:21:59] <stashbot>	 T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654
[09:22:04] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:22:43] <grrrit-wm>	 (03CR) 10Volans: [C: 032] Depool db1050 and db1041 for TLS upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283135 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans)
[09:22:47] <grrrit-wm>	 (03CR) 10Gehel: Simplification of Cassandra Logstash filtering (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/282466 (https://phabricator.wikimedia.org/T130861) (owner: 10Jstenval)
[09:23:07] <grrrit-wm>	 (03Merged) 10jenkins-bot: Depool db1050 and db1041 for TLS upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283135 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans)
[09:25:42] <grrrit-wm>	 (03PS1) 10Volans: Repool db1050 and db1041 for TLS upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283139 (https://phabricator.wikimedia.org/T111654) 
[09:29:03] <volans>	 I had some issues with scap regarding some mw hosts anyone can give me an hand? _joe_ godog ?
[09:29:59] <godog>	 volans: sure, what problems?
[09:30:03] <volans>	 https://phabricator.wikimedia.org/P2891
[09:30:15] <volans>	 seems that there is an mw with a RO filesystem :/
[09:30:56] <_joe_>	 volans: so we need to depool it
[09:31:19] <volans>	 but which one is? the command that failed has a bunch of them
[09:31:45] <_joe_>	 yeah I was trying to understand that
[09:31:59] <godog>	 mw1080,  https://phabricator.wikimedia.org/T132529
[09:32:11] <volans>	 salt "touch /src/__test" ? :D
[09:32:19] <volans>	 s/src/srv/
[09:33:09] <volans>	 oh yes godog it says "on mw1080.eqiad.wmnet returned" at the end of the line
[09:33:17] <_joe_>	 so just remove it from mediawiki-installation
[09:33:32] <_joe_>	 and from conftool-data
[09:33:34] <volans>	 never done... 
[09:33:55] <_joe_>	 find puppet -name mediawiki-installation
[09:34:36] <volans>	 remove or comment?
[09:35:07] <_joe_>	 remove
[09:35:34] <_joe_>	 it's also already depooled
[09:36:54] <grrrit-wm>	 (03PS1) 10Volans: MW: remove mw1080, has RO filesystem [puppet] - 10https://gerrit.wikimedia.org/r/283142 (https://phabricator.wikimedia.org/T111654) 
[09:37:28] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 031] MW: remove mw1080, has RO filesystem [puppet] - 10https://gerrit.wikimedia.org/r/283142 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans)
[09:37:37] <grrrit-wm>	 (03PS2) 10Volans: MW: remove mw1080, has RO filesystem [puppet] - 10https://gerrit.wikimedia.org/r/283142 (https://phabricator.wikimedia.org/T111654) 
[09:37:47] <_joe_>	 volans: check if it's a scap proxy by any chance
[09:38:30] <volans>	 I've grepped mw1080 in puppet repo and was only there and linux-host-entries of course
[09:38:35] <_joe_>	 ok
[09:38:38] <volans>	 anywhere else I should check?
[09:38:41] <_joe_>	 nope
[09:38:52] <_joe_>	 remember to run puppet-merge with sudo -i
[09:39:04] <_joe_>	 as this is going to do act on conftool
[09:39:06] <wikibugs>	 06Operations, 10Phabricator, 10hardware-requests: We need a backup phabricator front-end node - https://phabricator.wikimedia.org/T131775#2202510 (10mmodell) Phabricator's support for "High Availability" is making progress recently, see [[ https://secure.phabricator.com/T10751 | upstream task (T10751) ]] for...
[09:39:37] <volans>	 ok, thanks, didn't know
[09:39:54] <grrrit-wm>	 (03CR) 10Volans: [C: 032] MW: remove mw1080, has RO filesystem [puppet] - 10https://gerrit.wikimedia.org/r/283142 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans)
[09:40:34] <wikibugs>	 06Operations, 10Traffic, 13Patch-For-Review, 07Varnish: Add icinga monitoring for varnish statistics daemons - https://phabricator.wikimedia.org/T131760#2202515 (10ema) 05Open>03Resolved
[09:41:02] <volans>	 _joe_: should I force puppet on tin then?
[09:41:17] <_joe_>	 volans: if you need to deploy again, yes
[09:41:18] <grrrit-wm>	 (03PS2) 10Volans: MariaDB: use Puppet certs for TLS [puppet] - 10https://gerrit.wikimedia.org/r/283134 (https://phabricator.wikimedia.org/T111654) 
[09:41:22] <_joe_>	 if not, it's going to run
[09:41:53] <volans>	 I don't know if scap finished properly
[09:44:47] <wikibugs>	 06Operations, 10Traffic, 07Graphite, 07HTTPS: HTTPS redirects for graphite.wikimedia.org - https://phabricator.wikimedia.org/T132461#2202522 (10fgiunchedi) the most critical I can think of is `check_graphite` which already supports https (not sure about following redirects)  ``` $ /usr/lib/nagios/plugins/c...
[09:45:50] <wikibugs>	 06Operations, 10ops-eqiad: mw1080 - readonly fs - depooled - please check sda - https://phabricator.wikimedia.org/T132529#2201642 (10Volans) @Dzahn FYI scap failed on that host, I've merged this: https://gerrit.wikimedia.org/r/#/c/283142/
[09:47:28] <grrrit-wm>	 (03CR) 10Volans: [C: 032] MariaDB: use Puppet certs for TLS [puppet] - 10https://gerrit.wikimedia.org/r/283134 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans)
[09:47:51] <_joe_>	 so, puppet is failing on aqs1002
[09:48:00] <_joe_>	 was it converted to multi-instance?
[09:48:15] <_joe_>	 godog mobrovac ?
[09:48:30] <godog>	 nope
[09:48:44] <godog>	 but that's me, checking
[09:48:52] <_joe_>	 it laments there is no /etc/cassandra-instances.d directory
[09:49:03] <godog>	 yeah my lame mistake, I'll fix it
[09:49:06] <_joe_>	 (the error is more convoluted of course)
[09:49:16] <elukey>	 _joe_ I was about to check, aqs100[23] had some troubles becase of some cassandra timeouts (we are waiting SSDs)
[09:49:37] <elukey>	 ah ok thanks godog :)
[09:52:13] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: cassandra: create /etc/cassandra-instances.d [puppet] - 10https://gerrit.wikimedia.org/r/283144 
[09:53:26] <godog>	 still baffled that I can't eyeball a puppet change and tell whether it is going to fail to compile or not
[09:54:40] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Enable base::firewall for all rdb* servers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/282979 
[09:56:37] <icinga-wm_>	 PROBLEM - Hadoop NodeManager on analytics1043 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[09:57:18] <icinga-wm_>	 PROBLEM - Hadoop NodeManager on analytics1049 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[10:01:36] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: cassandra: create /etc/cassandra-instances.d [puppet] - 10https://gerrit.wikimedia.org/r/283144 
[10:02:13] <elukey>	 checking analytics hosts
[10:05:06] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: create /etc/cassandra-instances.d [puppet] - 10https://gerrit.wikimedia.org/r/283144 (owner: 10Filippo Giunchedi)
[10:05:17] <icinga-wm_>	 RECOVERY - Hadoop NodeManager on analytics1049 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[10:05:26] <elukey>	 they are all went down for out of memory sigh, opening a phab yask
[10:05:28] <elukey>	 *task
[10:05:33] <grrrit-wm>	 (03PS3) 10Muehlenhoff: Enable base::firewall for all rdb* servers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/282979 
[10:05:39] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable base::firewall for all rdb* servers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/282979 (owner: 10Muehlenhoff)
[10:06:16] <godog>	 moritzm: going to merge that too ^
[10:06:36] <moritzm>	 ok, I was just about to ask you whether I should merge your's along :-)
[10:08:27] <icinga-wm_>	 RECOVERY - Hadoop NodeManager on analytics1043 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[10:09:38] <wikibugs>	 06Operations, 06Analytics-Kanban: Out of memory errors causing Yarn nodemanager to shutdown on analytics hosts - https://phabricator.wikimedia.org/T132559#2202590 (10elukey)
[10:14:16] <wikibugs>	 06Operations, 06Analytics-Kanban: Out of memory errors causing Yarn nodemanager to shutdown on analytics hosts - https://phabricator.wikimedia.org/T132559#2202603 (10elukey)
[10:17:15] <wikibugs>	 06Operations, 10Analytics, 10Traffic, 13Patch-For-Review, 07Varnish: varnishstatsd crashes with ValueError in vsl_callback without being restarted by systemd - https://phabricator.wikimedia.org/T132430#2202604 (10ema) One more:    Feb 11 09:17:47 cp4010 varnishstatsd[2820]: Traceback (most recent call la...
[10:23:57] <icinga-wm_>	 PROBLEM - mediawiki-installation DSH group on mw1080 is CRITICAL: Host mw1080 is not in mediawiki-installation dsh group
[10:24:46] <wikibugs>	 06Operations, 10Analytics: kafkatee cronspam from oxygen - https://phabricator.wikimedia.org/T132322#2202623 (10elukey) p:05Triage>03Low
[10:27:34] <wikibugs>	 06Operations, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: varnishkafka logrotate cronspam - https://phabricator.wikimedia.org/T129344#2102820 (10elukey) Installed on maps hosts by @ema, we will rollout the new version everywhere along wiht the Varnish 4 upgrade.
[10:27:47] <icinga-wm_>	 RECOVERY - puppet last run on aqs1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:29:23] <wikibugs>	 06Operations, 10ops-eqiad: ms-be1001.eqiad.wmnet: slot=8 dev=sdi failed - https://phabricator.wikimedia.org/T132142#2202643 (10fgiunchedi) 05Open>03Resolved thanks @Cmjohnson ! the disk came back in the right order, now rebuiling  ``` /dev/sdi1       1.9T  2.2G  1.9T   1% /srv/swift-storage/sdi1 ```
[10:42:06] <wikibugs>	 06Operations, 10Analytics, 10Traffic, 13Patch-For-Review, 07Varnish: varnishstatsd crashes with ValueError in vsl_callback without being restarted by systemd - https://phabricator.wikimedia.org/T132430#2202664 (10ema) And this one:    Mar 18 12:41:35 cp4010 varnishstatsd[10396]: Traceback (most recent ca...
[10:42:34] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: puppet: add a function for performing conftool lookups [puppet] - 10https://gerrit.wikimedia.org/r/283151 
[10:50:56] <grrrit-wm>	 (03CR) 10Dereckson: "Sure, but the ProofreadPage code to use extension registration is in wmf/1.27.0-wmf.21." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281976 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson)
[10:52:25] <grrrit-wm>	 (03CR) 10Volans: [C: 032] Repool db1050 and db1041 for TLS upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283139 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans)
[10:52:48] <grrrit-wm>	 (03Merged) 10jenkins-bot: Repool db1050 and db1041 for TLS upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283139 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans)
[10:56:10] <logmsgbot>	 !log volans@tin Synchronized wmf-config/db-eqiad.php: Repool db1050 and db1041 after TLS upgrade - T111654 (duration: 00m 42s)
[10:56:15] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:04:14] <icinga-wm_>	 RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0
[11:04:22] <icinga-wm_>	 PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 1 failures
[11:04:54] <elukey>	 ---^ checking
[11:07:44] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "to be merged once the full switchover is in place (cfr. https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Media_storage.2FSwift )" [puppet] - 10https://gerrit.wikimedia.org/r/282893 (https://phabricator.wikimedia.org/T129089) (owner: 10Filippo Giunchedi)
[11:08:08] <elukey>	 the problem on stat1003 seems to be libcairo2-dev install
[11:08:20] * elukey wonders what libcairo2-dev is
[11:08:31] <volans>	 elukey: transient or an error?
[11:08:43] <icinga-wm_>	 PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Puppet has 1 failures
[11:09:56] <elukey>	 volans: error, I am checking now
[11:10:40] <elukey>	 2016-04-13 10:55:21 remove libcairo2-dev:amd64 1.13.0~20140204-0ubuntu1.1 <none>
[11:14:33] <icinga-wm_>	 RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[11:15:12] <grrrit-wm>	 (03PS3) 10Alexandros Kosiaris: scap: A basic workaround for the git clone issue [puppet] - 10https://gerrit.wikimedia.org/r/282992 (https://phabricator.wikimedia.org/T132267) (owner: 10Ladsgroup)
[11:15:18] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] scap: A basic workaround for the git clone issue [puppet] - 10https://gerrit.wikimedia.org/r/282992 (https://phabricator.wikimedia.org/T132267) (owner: 10Ladsgroup)
[11:16:43] <icinga-wm_>	 PROBLEM - Hadoop NodeManager on analytics1053 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[11:18:42] <icinga-wm_>	 RECOVERY - Hadoop NodeManager on analytics1053 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[11:18:51] <elukey>	 --^ /me doing sudo service hadoop-yarn-nodemanager restart
[11:19:06] <elukey>	 there is a phab task for this
[11:20:59] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: Remove allocation for old eqiad-ulsfo GTT link [dns] - 10https://gerrit.wikimedia.org/r/283152 
[11:21:48] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] Remove allocation for old eqiad-ulsfo GTT link [dns] - 10https://gerrit.wikimedia.org/r/283152 (owner: 10Faidon Liambotis)
[11:22:42] <volans>	 !log completed upgrading TLS for cross-dc replica on shards s6 and s7 - T111654
[11:22:47] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:22:50] <mark>	 cleanup!
[11:25:22] <icinga-wm_>	 PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures
[11:26:39] <wikibugs>	 07Blocked-on-Operations, 06Operations, 10RESTBase, 10hardware-requests: Expand SSD space in Cassandra cluster - https://phabricator.wikimedia.org/T121575#2202721 (10fgiunchedi)
[11:26:41] <wikibugs>	 07Blocked-on-Operations, 06Operations, 10RESTBase-Cassandra: expand raid0 in restbase200[1-6] - https://phabricator.wikimedia.org/T127951#2202719 (10fgiunchedi) 05Open>03Resolved done  ``` restbase2001.codfw.wmnet: /dev/mapper/restbase2001--vg-srv  4.5T  2.5T  2.0T  57% /srv restbase2002.codfw.wmnet: /de...
[11:27:42] <icinga-wm_>	 RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[11:33:23] <icinga-wm_>	 RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[11:43:30] <volans>	 we lost grrrit-wm
[11:46:06] <logmsgbot>	 !log volans@tin Synchronized wmf-config/db-eqiad.php: Reduce db1050 weight - T111654 (duration: 00m 30s)
[11:46:12] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:46:49] <p858snake|L_>	 volans: it should restart itself iirc, if not the following people can give it a little push https://wikitech.wikimedia.org/wiki/Grrrit-wm#Access
[11:47:41] <volans>	 ok, let's see, thanks
[11:48:44] <icinga-wm_>	 PROBLEM - puppet last run on cp3018 is CRITICAL: CRITICAL: puppet fail
[11:58:29] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Enable base::firewall for rdb* servers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/282980 
[12:00:23] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable base::firewall for rdb* servers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/282980 (owner: 10Muehlenhoff)
[12:07:44] <wikibugs>	 06Operations, 06Commons, 10media-storage: Unable to restore file that has a very large file size - https://phabricator.wikimedia.org/T131832#2202787 (10Aklapper)
[12:16:03] <icinga-wm_>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 63.33% of data above the critical threshold [5000000.0]
[12:18:24] <icinga-wm_>	 RECOVERY - puppet last run on cp3018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[12:18:30] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Disable base::firewall on rdb1* again, needs more tweaks [puppet] - 10https://gerrit.wikimedia.org/r/283160 
[12:22:58] <paravoid>	 :(
[12:23:16] <volans>	 we have some issue with DPKG
[12:23:32] <volans>	 at least icinga started complaining on a lot of DBs, from puppet logs:
[12:23:35] <volans>	 W: Failed to fetch http://ubuntu.wikimedia.org/ubuntu/dists/trusty-updates/universe/binary-amd64/Packages  Hash Sum mismatch
[12:24:21] <_joe_>	 moritzm: as I feared...
[12:24:25] <_joe_>	 sigh
[12:24:33] <icinga-wm_>	 PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[12:25:55] <volans>	 _joe_: icinga alarms just cleared now
[12:26:53] <volans>	 but I didn't check what the check checks (clear concept right :-P)
[12:28:33] <icinga-wm_>	 RECOVERY - DPKG on labmon1001 is OK: All packages OK
[12:29:44] <icinga-wm_>	 PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: puppet fail
[12:30:32] <wikibugs>	 06Operations, 10media-storage, 13Patch-For-Review: swift capacity planning - https://phabricator.wikimedia.org/T1268#2202894 (10fgiunchedi) over the last year we're still averaging ~140GB/day or ~51TB/year (not including 3x replication) : [[ https://graphite.wikimedia.org/render/?width=816&height=329&_salt=1...
[12:30:47] <wikibugs>	 06Operations, 10Traffic, 10Wiki-Loves-Monuments-General, 07HTTPS: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2202895 (10SindyM3) @Akoopal  Thank you! I will contact server admin.
[12:32:46] <moritzm>	 volans: yeah, I noticed this on a few other hosts as well, this was also the reason for the earlier icinga alerts on stat100[23], since it failed to install a binary package which was partly unavailable due to the failing "apt-get update"
[12:34:10] <icinga-wm_>	 RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures
[12:38:52] <grrrit-wm>	 (03CR) 10Elukey: "After a chat with Joe I believe that the picture is:" [puppet] - 10https://gerrit.wikimedia.org/r/282936 (https://phabricator.wikimedia.org/T131877) (owner: 10Elukey)
[12:40:05] <jouncebot>	 jzerebecki: Dear anthropoid, the time has come. Please deploy MediaWiki deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160413T1240).
[12:47:30] <icinga-wm_>	 PROBLEM - Hadoop NodeManager on analytics1044 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[12:50:44] <elukey>	 ---^ taking care of it
[12:51:19] <icinga-wm_>	 RECOVERY - Hadoop NodeManager on analytics1044 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[12:54:31] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Disable connection tracking for redis/jobqueue [puppet] - 10https://gerrit.wikimedia.org/r/283167 
[12:55:11] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Disable base::firewall on rdb1* again, needs more tweaks [puppet] - 10https://gerrit.wikimedia.org/r/283160 (owner: 10Muehlenhoff)
[13:07:48] <grrrit-wm>	 (03PS1) 10Volans: MariaDB: Use Puppet certs for s2 [puppet] - 10https://gerrit.wikimedia.org/r/283170 (https://phabricator.wikimedia.org/T111654) 
[13:09:50] <icinga-wm_>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[13:12:03] <wikibugs>	 06Operations, 06Analytics-Kanban: Out of memory errors causing Yarn nodemanager to shutdown on analytics hosts - https://phabricator.wikimedia.org/T132559#2203044 (10elukey) From https://grafana.wikimedia.org/dashboard/db/server-board I don't see major memory problems at host level, a lot of GBs are simply cac...
[13:14:33] <grrrit-wm>	 (03CR) 10Volans: "changes looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/283170 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans)
[13:14:46] <volans>	 !log start upgrading TLS for cross-dc replica on shards s2 - T111654
[13:14:51] <icinga-wm_>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:14:52] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:17:20] <icinga-wm_>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:18:41] <icinga-wm_>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[13:22:18] <logmsgbot>	 !log jzerebecki@tin Started scap: php-1.27.0-wmf.21: Update Wikidata to wmf/1.27.0-wmf.21
[13:22:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:24:40] <icinga-wm_>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:25:42] <wikibugs>	 06Operations, 10Beta-Cluster-Infrastructure, 07Performance: Need a way to simulate replication lag to test replag issues - https://phabricator.wikimedia.org/T40945#2203091 (10Nikerabbit)
[13:26:16] <wikibugs>	 06Operations, 06Analytics-Kanban: Out of memory errors causing Yarn nodemanager to shutdown on analytics hosts - https://phabricator.wikimedia.org/T132559#2203092 (10elukey) Another ticket was opened for the same thing https://phabricator.wikimedia.org/T102954
[13:26:25] <grrrit-wm>	 (03PS2) 10Volans: MariaDB: Use Puppet certs for s2 [puppet] - 10https://gerrit.wikimedia.org/r/283170 (https://phabricator.wikimedia.org/T111654) 
[13:26:50] <wikibugs>	 06Operations, 06Analytics-Kanban: Out of memory errors causing Yarn nodemanager to shutdown on analytics hosts - https://phabricator.wikimedia.org/T132559#2203094 (10elukey) 05Open>03Resolved p:05Triage>03Normal
[13:28:21] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Ferm rules for puppetmaster frontend [puppet] - 10https://gerrit.wikimedia.org/r/283174 
[13:30:06] <grrrit-wm>	 (03CR) 10Volans: [C: 032] MariaDB: Use Puppet certs for s2 [puppet] - 10https://gerrit.wikimedia.org/r/283170 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans)
[13:30:42] <grrrit-wm>	 (03CR) 10Andrew Bogott: Use half-baked ldap auth for librenms (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) (owner: 10Andrew Bogott)
[13:32:02] <wikibugs>	 06Operations, 10ops-eqiad, 06Analytics-Kanban: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#2203112 (10elukey) ``` elukey@neodymium:~$ sudo -i salt -t 120 analytics10* cmd.run 'grep "Hardware event" /var/log/mcelog | uniq -c' analytics1041.eqiad.wmnet: analytics1...
[13:32:38] <grrrit-wm>	 (03PS5) 10Andrew Bogott: Use half-baked ldap auth for librenms [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) 
[13:32:51] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Add ferm rules for pybal_conf / http [puppet] - 10https://gerrit.wikimedia.org/r/283175 
[13:33:29] <icinga-wm_>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[13:34:05] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Add ferm rules for pybal_conf / http [puppet] - 10https://gerrit.wikimedia.org/r/283175 (owner: 10Muehlenhoff)
[13:34:37] <paravoid>	 elukey: any idea of what's up with an1045?
[13:34:50] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Add ferm rules for pybal_conf / http [puppet] - 10https://gerrit.wikimedia.org/r/283175 
[13:36:31] <moritzm>	 paravoid: there's some discussion how to fix this in #wikimedia-analytics
[13:36:49] <icinga-wm_>	 PROBLEM - Hadoop NodeManager on analytics1045 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[13:37:25] <paravoid>	 ok
[13:39:33] <grrrit-wm>	 (03PS1) 10Volans: Remove require on resource not managed by Puppet [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/283176 (https://phabricator.wikimedia.org/T111654) 
[13:39:40] <icinga-wm_>	 PROBLEM - Hadoop NodeManager on analytics1056 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[13:39:58] <ottomata>	 Hmph!
[13:40:03] <elukey>	 paravoid: hi! So the hadoop nodemanager registers a Java OOM and shutsdown (no upstart scritp with respawn). ottomata saw this issue a while ago but very sporadically, and puppet was basically fixing the problem for us. We will investigate what's happening..
[13:40:14] <grrrit-wm>	 (03CR) 10Volans: [C: 032] Remove require on resource not managed by Puppet [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/283176 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans)
[13:40:27] <ottomata>	 a while ago meaning over a year ago
[13:41:04] <elukey>	 ottomata: just restarted yarn on 1045
[13:41:18] <ottomata>	 k
[13:41:30] <ottomata>	 am looking at logs there and on 56
[13:42:40] <icinga-wm_>	 RECOVERY - Hadoop NodeManager on analytics1045 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[13:42:43] <elukey>	 restarted also 1056
[13:42:45] <grrrit-wm>	 (03PS1) 10Volans: MariaDB: Update submodule reference [puppet] - 10https://gerrit.wikimedia.org/r/283178 (https://phabricator.wikimedia.org/T111654) 
[13:43:39] <icinga-wm_>	 RECOVERY - Hadoop NodeManager on analytics1056 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[13:45:01] <grrrit-wm>	 (03CR) 10Volans: [C: 032] MariaDB: Update submodule reference [puppet] - 10https://gerrit.wikimedia.org/r/283178 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans)
[13:46:08] <icinga-wm_>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[13:47:20] <icinga-wm_>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:47:23] <grrrit-wm>	 (03PS1) 10Ema: varnishstatsd: log ValueError exceptions [puppet] - 10https://gerrit.wikimedia.org/r/283179 (https://phabricator.wikimedia.org/T132430) 
[13:51:10] <icinga-wm_>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[13:52:00] <grrrit-wm>	 (03CR) 10BBlack: [C: 031] varnishstatsd: log ValueError exceptions [puppet] - 10https://gerrit.wikimedia.org/r/283179 (https://phabricator.wikimedia.org/T132430) (owner: 10Ema)
[13:52:15] <logmsgbot>	 !log jzerebecki@tin Finished scap: php-1.27.0-wmf.21: Update Wikidata to wmf/1.27.0-wmf.21 (duration: 29m 56s)
[13:52:20] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:52:36] <wikibugs>	 06Operations, 10ops-eqiad: mw1080 - readonly fs - depooled - please check sda - https://phabricator.wikimedia.org/T132529#2203158 (10Dzahn) Ok, thanks. So that means depooling is 3 steps now? Running conftool, changing conftool config AND still changing the old dsh files?
[13:52:54] <grrrit-wm>	 (03PS2) 10Ema: varnishstatsd: log ValueError exceptions [puppet] - 10https://gerrit.wikimedia.org/r/283179 (https://phabricator.wikimedia.org/T132430) 
[13:53:04] <grrrit-wm>	 (03CR) 10Ema: [C: 032 V: 032] varnishstatsd: log ValueError exceptions [puppet] - 10https://gerrit.wikimedia.org/r/283179 (https://phabricator.wikimedia.org/T132430) (owner: 10Ema)
[13:57:21] <wikibugs>	 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2203163 (10BBlack) Notable: there's an ongoing report of 1.9.14 causing an HTTP/2 proto error in Chrome.  We may need to be wary and stick with .13 or wait for .15: http://mailman.ngin...
[13:57:49] <andrewbogott>	 akosiaris: starttls fails immediately; any suggestions on what to check or try?
[13:58:44] <wikibugs>	 06Operations, 10Analytics, 10hardware-requests, 13Patch-For-Review: eqiad: (3) AQS replacement nodes - https://phabricator.wikimedia.org/T124947#2203165 (10Ottomata) @Robh, over in T132067 it looks like these nodes were ordered, is this correct?  If so, any idea on ETA?
[13:59:27] <akosiaris>	 andrewbogott: fails ? interesting. Error message ?
[13:59:28] <grrrit-wm>	 (03CR) 10Faidon Liambotis: "What does half-baked mean here? I'm okay with this in concept and, if tested/baby-sat on, in implementation as well." [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) (owner: 10Andrew Bogott)
[14:00:16] <grrrit-wm>	 (03PS1) 10Urbanecm: Fix typo in newikibooks namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283183 
[14:00:54] <wikibugs>	 06Operations: rsync module doesnt work on trusty - https://phabricator.wikimedia.org/T132532#2203182 (10Dzahn) For some reason ms-be systems have a  **/etc/init.d/rsync ** but when i put an "include rsync::server" on osmium we do not get that init script and it just didnt exist.  Putting the identical class on a...
[14:00:59] <andrewbogott>	 akosiaris: the php call 'ldap_start_tls' returns 'false.'  Very informative.
[14:01:37] <andrewbogott>	 https://www.irccloud.com/pastebin/CnEW4ryi/
[14:01:40] <andrewbogott>	 akosiaris: ^
[14:01:47] <grrrit-wm>	 (03PS2) 10Urbanecm: Fix typo in newikibooks namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283183 (https://phabricator.wikimedia.org/T131754) 
[14:01:54] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 031] Enable ChallengeResponseAuthentication [puppet] - 10https://gerrit.wikimedia.org/r/281629 (owner: 10Muehlenhoff)
[14:02:07] <akosiaris>	 andrewbogott: https://secure.php.net/manual/en/function.ldap-error.php
[14:02:12] <akosiaris>	 that should help explain what happens
[14:02:25] <akosiaris>	 ah
[14:02:27] <akosiaris>	 it's already there
[14:02:30] <grrrit-wm>	 (03CR) 10Andrew Bogott: "@faidon, I'm just pissed at the shitty librenmns ldap code that I had to dig through to get here. I will fix the commit message :)" [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) (owner: 10Andrew Bogott)
[14:02:45] <akosiaris>	 and it does not return an error ?
[14:03:01] <andrewbogott>	 "Fatal error: LDAP TLS required but not successfully negotiated:Connect error"
[14:03:02] <akosiaris>	 wth ?
[14:03:08] <akosiaris>	 hmm
[14:03:13] <akosiaris>	 ok lemme check this a bit then
[14:03:35] <andrewbogott>	 thanks.  I'm watching the slapd logs but there's nothing of interest there that I can see
[14:05:00] <andrewbogott>	 akosiaris: could this be a missing cert on netmon1001?
[14:05:35] <akosiaris>	 meaning the CA cert ? maybe ...
[14:05:44] <akosiaris>	 gimme 10 mins and I 'll tell you what it is
[14:05:47] <akosiaris>	 well at least I hope
[14:05:52] <akosiaris>	 damned LDAP
[14:05:58] <icinga-wm_>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[14:05:58] <icinga-wm_>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[14:06:18] <andrewbogott>	 akosiaris: works for me :)
[14:06:48] <andrewbogott>	 Um, I mean, having you debug it works for me, not starttls.  Starttls does not work for me, as established
[14:07:31] <grrrit-wm>	 (03PS1) 10Ottomata: Use @yarn_heapsize, not @hadoop_heapsize when setting $yarn_heapsize [puppet/cdh] - 10https://gerrit.wikimedia.org/r/283185 (https://phabricator.wikimedia.org/T102954) 
[14:08:09] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] Use @yarn_heapsize, not @hadoop_heapsize when setting $yarn_heapsize [puppet/cdh] - 10https://gerrit.wikimedia.org/r/283185 (https://phabricator.wikimedia.org/T102954) (owner: 10Ottomata)
[14:09:09] <icinga-wm_>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:09:42] <grrrit-wm>	 (03PS1) 10Ottomata: Update cdh submodule with yarn_heapsize fix, set yarn_heapsize to 2048m [puppet] - 10https://gerrit.wikimedia.org/r/283186 
[14:09:48] <icinga-wm_>	 PROBLEM - Hadoop NodeManager on analytics1047 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[14:10:41] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Update cdh submodule with yarn_heapsize fix, set yarn_heapsize to 2048m [puppet] - 10https://gerrit.wikimedia.org/r/283186 (owner: 10Ottomata)
[14:10:59] <icinga-wm_>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[14:11:34] <grrrit-wm>	 (03CR) 10Ottomata: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/283186 (owner: 10Ottomata)
[14:13:49] <icinga-wm_>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[14:13:49] <icinga-wm_>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[14:13:54] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Update cdh submodule with yarn_heapsize fix, set yarn_heapsize to 2048m [puppet] - 10https://gerrit.wikimedia.org/r/283186 (owner: 10Ottomata)
[14:13:58] <icinga-wm_>	 PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:13:58] <icinga-wm_>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:15:48] <icinga-wm_>	 RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy
[14:16:02] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] Use half-baked ldap auth for librenms (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) (owner: 10Andrew Bogott)
[14:16:19] <akosiaris>	 andrewbogott: found it. it's 'ldap://ldap-labs.eqiad.wikimedia.org ldap://ldap-labs.codfw.wikimedia.org' otherwise cert names don't match DNS names
[14:16:22] <akosiaris>	 commented on the change
[14:17:14] <akosiaris>	 and I needed 10 mins and 10seconds (by some accounts)... missed my mark by 10 seconds
[14:17:16] <akosiaris>	 grrr
[14:17:35] <andrewbogott>	 still pretty good 
[14:17:48] <icinga-wm_>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[14:19:39] <andrewbogott>	 hm, akosiaris, can you log in now?
[14:20:19] <akosiaris>	 in librenms ? no 
[14:20:24] <akosiaris>	 ah silly me
[14:21:00] <akosiaris>	 I was trying with full name instead of username, but that does not work either
[14:21:35] <andrewbogott>	 yeah, it gets further now but still doesn't work...
[14:21:40] <andrewbogott>	 let me figure out which change broke it
[14:21:55] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Add stat1004 configuration to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/282936 (https://phabricator.wikimedia.org/T131877) (owner: 10Elukey)
[14:22:35] <wikibugs>	 06Operations, 10ops-eqiad: mw1080 - readonly fs - depooled - please check sda - https://phabricator.wikimedia.org/T132529#2203236 (10Volans) @Dzahn I got help from @Joe and @fgiunchedi on IRC, I didn't know the whole thing. Looks like it is needed, but double check with them please.
[14:22:59] <icinga-wm_>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:23:09] <icinga-wm_>	 PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:23:18] <grrrit-wm>	 (03PS2) 10Elukey: Add stat1004 configuration to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/282936 (https://phabricator.wikimedia.org/T131877) 
[14:23:34] <akosiaris>	 mobrovac: I am betting that that gov site does not work again ^
[14:23:48] <icinga-wm_>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:23:48] <icinga-wm_>	 PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:23:52] <mobrovac>	 akosiaris: i think you have won that lottery
[14:23:54] <_joe_>	 akosiaris: last time I checked, that was the case
[14:24:16] <akosiaris>	 yeah, like stealing candy from a baby
[14:24:22] <_joe_>	 mobrovac: we should really add smarter timeouts to service_checker
[14:24:40] <grrrit-wm>	 (03PS1) 10ArielGlenn: don't allow en wiki dump jobs to overlap (yet) [puppet] - 10https://gerrit.wikimedia.org/r/283188 
[14:24:48] <wikibugs>	 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2203239 (10Milimetric) I'd say we should wait for @ezachte to figure out what we should do with those 2T.  Pinging.
[14:25:13] <mobrovac>	 akosiaris: i'm quite sure you wouldn't be able to steal candy from my niece
[14:25:20] <mobrovac>	 _joe_: as in configurable?
[14:25:29] <icinga-wm_>	 RECOVERY - Hadoop NodeManager on analytics1047 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[14:25:36] <volans>	 !log completed upgrading TLS for cross-dc replica on shards s2 - T111654
[14:25:41] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:26:11] <mobrovac>	 _joe_: also notice that it's not the service_checker script that is failing now, but check_nrpe
[14:26:12] <mobrovac>	 \
[14:26:16] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] don't allow en wiki dump jobs to overlap (yet) [puppet] - 10https://gerrit.wikimedia.org/r/283188 (owner: 10ArielGlenn)
[14:28:26] <AndyRussG>	 Morning anomie ostriches thcipriani MarkTraceur Krenair, how's it going? Just wondering who'll do the SWAT deploy this morning, and if it's possible to add a CN patch that needs to sync first to mw1017 for a small amount of extra testing?
[14:29:04] <wikibugs>	 06Operations, 10ops-eqiad: mw1080 - readonly fs - depooled - please check sda - https://phabricator.wikimedia.org/T132529#2201642 (10Cmjohnson) it this something we want to fix or just decommission?
[14:31:49] <icinga-wm_>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[14:31:58] <ottomata>	 heya, anybody know what happened to the salt role grains?
[14:32:07] <ottomata>	 they stopped working a LONG time ago, and I never investigated
[14:32:10] <ottomata>	 https://wikitech.wikimedia.org/wiki/Salt#List.2Fping_all_nodes_with_a_puppet_role
[14:32:14] <ottomata>	 is this not true?
[14:32:20] <wikibugs>	 06Operations, 10ops-eqiad: mw1080 - readonly fs - depooled - please check sda - https://phabricator.wikimedia.org/T132529#2203275 (10Joe) decommission it, it would've been decommissioned after next week anyways.
[14:32:36] <grrrit-wm>	 (03PS6) 10Andrew Bogott: Use ldap auth for librenms [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) 
[14:32:52] <moritzm>	 not sure about those, but you can match server groups via the debdeploy salt grains
[14:33:00] <grrrit-wm>	 (03PS2) 10BBlack: Use https://config-master.wm.o for rolematcher T132459 [puppet] - 10https://gerrit.wikimedia.org/r/283089 
[14:33:08] <icinga-wm_>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[14:33:10] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] Use https://config-master.wm.o for rolematcher T132459 [puppet] - 10https://gerrit.wikimedia.org/r/283089 (owner: 10BBlack)
[14:33:14] <wikibugs>	 06Operations, 10ops-eqiad: mw1080 - readonly fs - depooled - please check sda - https://phabricator.wikimedia.org/T132529#2203276 (10Joe) @Dzahn no, the immediate depooling is done via conftool; I asked to remove it from conftool-data since it's going to be decommissioned.
[14:33:16] <ottomata>	 HMMM
[14:33:17] <ottomata>	 ja
[14:33:17] <ottomata>	 debdeploy-hadoop-worker
[14:33:18] <ottomata>	 ok
[14:33:19] <icinga-wm_>	 RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy
[14:33:29] <ottomata>	 those aren't very automated though, but i guess its cool
[14:33:36] <ottomata>	 wondering if i should add it back into system::role or something
[14:33:49] <icinga-wm_>	 RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy
[14:33:57] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Use ldap auth for librenms [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) (owner: 10Andrew Bogott)
[14:34:23] <moritzm>	 they're mostly based on the roles (except the canary ones)
[14:34:41] <ottomata>	 right, but they all have to be added manually, i'm sure there are many roles that don't have debdeploy grains, no?
[14:35:15] <godog>	 !log start cleanup on restbase100[569] - T128107
[14:35:20] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:35:59] <grrrit-wm>	 (03PS7) 10Andrew Bogott: Use ldap auth for librenms [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) 
[14:36:16] <mark>	 \o/
[14:36:24] <wikibugs>	 06Operations, 10ops-eqiad: mw1080 - readonly fs - depooled - please check sda - https://phabricator.wikimedia.org/T132529#2203301 (10Dzahn) @Volans @joe gotcha, thank you
[14:37:24] <wikibugs>	 06Operations, 10Pybal, 10Traffic, 07HTTPS, 13Patch-For-Review: HTTPS redirects for config-master.wikimedia.org - https://phabricator.wikimedia.org/T132459#2203303 (10BBlack) Re: rolematcher - the only real host I could trace it to in puppetization was fluorine.  However, post-merge the update did not get...
[14:37:25] <moritzm>	 ottomata: all roles should have debdeploy grains (except a few corner cases), if I spot some systems not matched by existing grains I add them
[14:37:37] <moritzm>	 so unless they're very fresh they should be in there
[14:37:44] <grrrit-wm>	 (03CR) 10Andrew Bogott: "Tested, looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) (owner: 10Andrew Bogott)
[14:38:17] <ottomata>	 !log restarting hadoop-yarn-nodemanager on all hadoop worker nodes one by one to apply increase in heap size
[14:38:22] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:39:01] <grrrit-wm>	 (03PS1) 10BBlack: config-master.wm.o HTTPS redirect T132459 [puppet] - 10https://gerrit.wikimedia.org/r/283190 
[14:39:08] <thcipriani>	 AndyRussG: I'll SWAT this morning, Go ahead and add your patch and we'll do it at the end of the window.
[14:39:50] <AndyRussG>	 thcipriani: cool thanks much!
[14:39:59] <AndyRussG>	 just preparing the patch itself
[14:40:09] <thcipriani>	 okie doke.
[14:41:41] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] config-master.wm.o HTTPS redirect T132459 [puppet] - 10https://gerrit.wikimedia.org/r/283190 (owner: 10BBlack)
[14:42:37] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: let all services on misc-web enforce http->https redirects - https://phabricator.wikimedia.org/T103919#2203332 (10BBlack)
[14:42:39] <wikibugs>	 06Operations, 10Pybal, 10Traffic, 07HTTPS, 13Patch-For-Review: HTTPS redirects for config-master.wikimedia.org - https://phabricator.wikimedia.org/T132459#2203330 (10BBlack) 05Open>03Resolved a:03BBlack
[14:42:41] <grrrit-wm>	 (03PS1) 10Volans: MariaDB: use Puppet cert for s1 cross-dc replica [puppet] - 10https://gerrit.wikimedia.org/r/283192 (https://phabricator.wikimedia.org/T111654) 
[14:42:48] <grrrit-wm>	 (03PS1) 10Volans: Depool db1057 for TLS upgrade for s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283193 (https://phabricator.wikimedia.org/T111654) 
[14:44:11] <grrrit-wm>	 (03PS1) 10Ema: Install test version of ${vcl}.inc.vcl.erb [puppet] - 10https://gerrit.wikimedia.org/r/283194 
[14:44:22] <grrrit-wm>	 (03PS2) 10Volans: MariaDB: use Puppet cert for s1 cross-dc replica [puppet] - 10https://gerrit.wikimedia.org/r/283192 (https://phabricator.wikimedia.org/T111654) 
[14:45:10] <grrrit-wm>	 (03CR) 10BBlack: [C: 031] Install test version of ${vcl}.inc.vcl.erb [puppet] - 10https://gerrit.wikimedia.org/r/283194 (owner: 10Ema)
[14:47:08] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 031] Use ldap auth for librenms [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) (owner: 10Andrew Bogott)
[14:47:16] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: HTTPS redirects for transparency.wikimedia.org - https://phabricator.wikimedia.org/T132464#2203360 (10BBlack) I don't see any mixed content in simple checks, and it seems to not use proto-absolute URLs in general.  Since this site is clearly for human consumption, I'll proba...
[14:47:17] <grrrit-wm>	 (03PS3) 10Elukey: Add stat1004 configuration to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/282936 (https://phabricator.wikimedia.org/T131877) 
[14:48:17] <grrrit-wm>	 (03PS8) 10Andrew Bogott: Use ldap auth for librenms [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) 
[14:48:23] <elukey>	 !log Yarn nodemanager Xmx size bumped up from 1000m to 2048 on all the analytics* hosts to overcome the OutOfMemory errors.
[14:48:27] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:48:30] <grrrit-wm>	 (03PS1) 10Ema: misc-backend.inc.vcl: set do_stream=false when testing [puppet] - 10https://gerrit.wikimedia.org/r/283195 
[14:49:00] <wikibugs>	 06Operations, 10Analytics-Cluster, 10Traffic, 07HTTPS: HTTPS redirects for stats.wikimedia.org - https://phabricator.wikimedia.org/T132465#2203363 (10BBlack) I don't see any mixed content in simple checks (and we checked/fixed that in a much earlier ticket: (T93702).  Since this site is clearly for human c...
[14:49:12] <grrrit-wm>	 (03CR) 10Elukey: [C: 032] Add stat1004 configuration to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/282936 (https://phabricator.wikimedia.org/T131877) (owner: 10Elukey)
[14:50:00] <grrrit-wm>	 (03PS3) 10Volans: MariaDB: use Puppet cert for s1 cross-dc replica [puppet] - 10https://gerrit.wikimedia.org/r/283192 (https://phabricator.wikimedia.org/T111654) 
[14:51:27] <wikibugs>	 06Operations, 06Release-Engineering-Team, 10Traffic, 05Gitblit-Deprecate, 07HTTPS: HTTPS redirects for git.wikimedia.org - https://phabricator.wikimedia.org/T132460#2203375 (10BBlack) As gitblit is on the chopping block for deprecation anyways, my inclination is to go ahead and enable HTTPS for this soon...
[14:51:45] <grrrit-wm>	 (03CR) 10Volans: [C: 032] Depool db1057 for TLS upgrade for s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283193 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans)
[14:52:10] <grrrit-wm>	 (03Merged) 10jenkins-bot: Depool db1057 for TLS upgrade for s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283193 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans)
[14:52:25] <grrrit-wm>	 (03CR) 10BBlack: [C: 031] "+1 Works for now! In the long term, we'll want to figure out how to test multiple layers/tiers..." [puppet] - 10https://gerrit.wikimedia.org/r/283195 (owner: 10Ema)
[14:52:51] <grrrit-wm>	 (03PS16) 10Mobrovac: Kafka config: Add config functions [puppet] - 10https://gerrit.wikimedia.org/r/279280 (https://phabricator.wikimedia.org/T130371) 
[14:53:01] <wikibugs>	 06Operations, 10Gitblit, 06Release-Engineering-Team, 10Traffic, 07HTTPS: HTTPS redirects for git.wikimedia.org - https://phabricator.wikimedia.org/T132460#2203384 (10greg)
[14:53:25] <logmsgbot>	 !log volans@tin Synchronized wmf-config/db-eqiad.php: Depool db1057 to upgrade TLS on s1 - T111654 (duration: 00m 26s)
[14:53:30] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:53:49] <icinga-wm_>	 PROBLEM - Hadoop NodeManager on analytics1048 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[14:53:55] <volans>	 !log start upgrading TLS for cross-dc replica on shards s1 - T111654
[14:53:59] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:54:09] <grrrit-wm>	 (03PS2) 10Ema: Install test version of ${vcl}.inc.vcl.erb [puppet] - 10https://gerrit.wikimedia.org/r/283194 
[14:54:12] <grrrit-wm>	 (03CR) 10Mobrovac: "@Ottomata {{done}}" [puppet] - 10https://gerrit.wikimedia.org/r/279280 (https://phabricator.wikimedia.org/T130371) (owner: 10Mobrovac)
[14:54:18] <grrrit-wm>	 (03CR) 10Ema: [C: 032 V: 032] Install test version of ${vcl}.inc.vcl.erb [puppet] - 10https://gerrit.wikimedia.org/r/283194 (owner: 10Ema)
[14:54:29] <grrrit-wm>	 (03PS2) 10Ema: misc-backend.inc.vcl: set do_stream=false when testing [puppet] - 10https://gerrit.wikimedia.org/r/283195 
[14:54:40] <grrrit-wm>	 (03CR) 10Ema: [C: 032 V: 032] misc-backend.inc.vcl: set do_stream=false when testing [puppet] - 10https://gerrit.wikimedia.org/r/283195 (owner: 10Ema)
[14:55:30] <wikibugs>	 06Operations, 10Datasets-General-or-Unknown, 10Traffic, 07HTTPS, 13Patch-For-Review: HTTPS redirects for datasets.wikimedia.org - https://phabricator.wikimedia.org/T132463#2203387 (10Ottomata) datasets.wikimedia.or is hosted on stat1001, not a dataset100x host.  I think HTTPS only is fine, but maybe @mil...
[14:55:49] <icinga-wm_>	 RECOVERY - Hadoop NodeManager on analytics1048 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[14:56:14] <grrrit-wm>	 (03PS3) 10Ema: New misc VTC test: 09-chunked-response-add-cl.vtc [puppet] - 10https://gerrit.wikimedia.org/r/282895 (https://phabricator.wikimedia.org/T128813) 
[14:56:22] <grrrit-wm>	 (03CR) 10Ema: [C: 032 V: 032] New misc VTC test: 09-chunked-response-add-cl.vtc [puppet] - 10https://gerrit.wikimedia.org/r/282895 (https://phabricator.wikimedia.org/T128813) (owner: 10Ema)
[14:58:33] <akosiaris>	 andrewbogott: so any luck ? did you manage to get librenms working ? 
[14:58:36] <grrrit-wm>	 (03PS9) 10Andrew Bogott: Use ldap auth for librenms [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) 
[14:58:50] <akosiaris>	 oh still not merged, sorry
[14:58:52] <andrewbogott>	 akosiaris: yep, all good, I going to merge just as soon as I can catch up with the git head
[14:58:57] <akosiaris>	 ahaha
[14:58:57] <akosiaris>	 ok
[14:59:51] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Use ldap auth for librenms [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) (owner: 10Andrew Bogott)
[15:00:04] <jouncebot>	 anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160413T1500).
[15:00:04] <jouncebot>	 jdlrobson Urbanecm: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[15:00:11] <jdlrobson>	 \o
[15:00:25] <thcipriani>	 I'll SWAT today.
[15:02:17] <wikibugs>	 06Operations: librsvg path patch needs to be applied for jessie - https://phabricator.wikimedia.org/T132584#2203430 (10MoritzMuehlenhoff)
[15:02:26] <wikibugs>	 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review: setup stat1004/WMF4721 for hadoop client usage - https://phabricator.wikimedia.org/T131877#2203442 (10elukey) stat1004 is ready for a test!  @Ottomata would you be the first one to try it? :P
[15:02:28] <thcipriani>	 Urbanecm: around for SWAT? I can get your patch out while I'm waiting on Jenkins for the others.
[15:02:38] <Urbanecm>	 Yes. 
[15:02:56] <wikibugs>	 06Operations, 13Patch-For-Review: Configure librenms to use LDAP for authentication - https://phabricator.wikimedia.org/T107702#2203445 (10Andrew) 05Open>03Resolved a:03Andrew
[15:03:08] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283183 (https://phabricator.wikimedia.org/T131754) (owner: 10Urbanecm)
[15:03:27] <wikibugs>	 06Operations: librsvg path patch needs to be applied for jessie - https://phabricator.wikimedia.org/T132584#2203463 (10MoritzMuehlenhoff)
[15:03:29] <wikibugs>	 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM: Make all role::mediawiki::* classes compatible with debian jessie - https://phabricator.wikimedia.org/T131749#2203462 (10MoritzMuehlenhoff)
[15:03:54] <andrewbogott>	 akosiaris: works for you now?
[15:04:31] <grrrit-wm>	 (03PS3) 10Thcipriani: Fix typo in newikibooks namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283183 (https://phabricator.wikimedia.org/T131754) (owner: 10Urbanecm)
[15:04:33] <grrrit-wm>	 (03CR) 10Volans: "change looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/283192 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans)
[15:06:18] <Urbanecm>	 Thcipriani: I can see that you're rebasing my changes everytime when you SWAT my patches. So should I rebase them on master everytime when I upload it to Gerrit?
[15:06:54] <wikibugs>	 06Operations, 10media-storage, 07Tracking: [tracking] refresh swift hardware in codfw/eqiad - https://phabricator.wikimedia.org/T130012#2203496 (10fgiunchedi) for the current refresh we're replacing 12x 2TB machines in eqiad/codfw, each machine has 12x 1.9TB = 22.8TB usable, so 273.6TB and 144 disks per data...
[15:07:15] <andrewbogott>	 Urbanecm: you can only merge the patch on the very tippy-top.  So there's almost always a rebase step before a merge if the patch is more than a few minutes old.
[15:07:25] <thcipriani>	 Urbanecm: nah, I'm just making sure that I don't end up with merge commits.
[15:07:34] <thcipriani>	 yeah, what andrewbogott said.
[15:07:34] <jdlrobson>	 thcipriani:  ping me when you need me to test. I've got some examples ready
[15:07:45] <thcipriani>	 jdlrobson: kk, just waiting on Jenkins for now.
[15:08:02] <grrrit-wm>	 (03CR) 10Thcipriani: Fix typo in newikibooks namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283183 (https://phabricator.wikimedia.org/T131754) (owner: 10Urbanecm)
[15:08:13] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] "SWAT again" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283183 (https://phabricator.wikimedia.org/T131754) (owner: 10Urbanecm)
[15:08:38] <grrrit-wm>	 (03Merged) 10jenkins-bot: Fix typo in newikibooks namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283183 (https://phabricator.wikimedia.org/T131754) (owner: 10Urbanecm)
[15:08:44] <thcipriani>	 ^ also in this instance this patch needed a rebase before it could merge (which I forgot to do before I +2'd) :\
[15:09:59] <Urbanecm>	 Andrewbogott and Thcipriani: So instead of git review -R I should run git review? So shouldn't we update https://www.mediawiki.org/wiki/Gerrit/Tutorial ? (If I'm asking in wrong time because we're SWATing, please tell it to me and I'll ask after SWAT)
[15:10:25] <paravoid>	 andrewbogott: thank you so much for that librenms ldap change
[15:10:36] <grrrit-wm>	 (03PS2) 10BBlack: use https://parsoid-tests in testreduce T132462 [puppet] - 10https://gerrit.wikimedia.org/r/283088 
[15:10:45] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] use https://parsoid-tests in testreduce T132462 [puppet] - 10https://gerrit.wikimedia.org/r/283088 (owner: 10BBlack)
[15:10:45] <paravoid>	 oh, I lost my dashboard
[15:10:47] <paravoid>	 oh well
[15:10:53] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Fix typo in newikibooks namespaces [[gerrit:283183]] (duration: 00m 30s)
[15:10:58] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:11:10] <thcipriani>	 ^ Urbanecm check please
[15:11:38] <thcipriani>	 Urbanecm: FWIW all your patches have seemed fine to me in terms of how they are submitted.
[15:11:40] <wikibugs>	 06Operations, 10Analytics-Cluster, 10Traffic, 07HTTPS: HTTPS redirects for stats.wikimedia.org - https://phabricator.wikimedia.org/T132465#2203508 (10Ottomata) +1 should be fine to do.
[15:11:52] <wikibugs>	 06Operations, 10Parsoid, 10Traffic, 07HTTPS, 13Patch-For-Review: HTTPS redirects for parsoid-tests.wikimedia.org - https://phabricator.wikimedia.org/T132462#2203509 (10BBlack) Brief discussion on mediawiki-parsoid IRC channel seems to indicate this is low risk, so going for it.
[15:12:50] <Urbanecm>	 It seems ok. Thanks. 
[15:13:56] <thcipriani>	 Urbanecm: thanks for checking!
[15:15:35] <grrrit-wm>	 (03CR) 10Mobrovac: "Still looking good - https://puppet-compiler.wmflabs.org/2426/" [puppet] - 10https://gerrit.wikimedia.org/r/279280 (https://phabricator.wikimedia.org/T130371) (owner: 10Mobrovac)
[15:15:43] <Urbanecm>	 Thcipriani: So the pushing. I have one commit in branch in my local repo and I worked on it for one hour. Should I run git review or rebase them on master and then run git review -R? I think that these commands are same, so I can run only git review. Am I right? 
[15:16:55] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/263631 (owner: 10Chad)
[15:17:06] <AndyRussG>	 thcipriani: just waiting on gate-and-submits... Since u said near the end of the window, I didn't rush too much, sorry
[15:17:10] <wikibugs>	 06Operations, 10Analytics, 10hardware-requests, 13Patch-For-Review: eqiad: (3) AQS replacement nodes - https://phabricator.wikimedia.org/T124947#2203525 (10Eevans)
[15:17:59] <wikibugs>	 06Operations, 10Datasets-General-or-Unknown, 10Traffic, 07HTTPS, 13Patch-For-Review: HTTPS redirects for datasets.wikimedia.org - https://phabricator.wikimedia.org/T132463#2203529 (10Halfak) +1 for HTTPS only being OK.
[15:18:43] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: HTTPS redirects for transparency.wikimedia.org - https://phabricator.wikimedia.org/T132464#2203530 (10Chmarkine) Redirect to https should be fine, since we enabled HSTS for transparency.wikimedia.org in May 2015.[1] But was there any reason that the redirect was dropped?  [1...
[15:18:52] <grrrit-wm>	 (03PS1) 10Physikerwelt: Make MathML rendering default in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283199 (https://phabricator.wikimedia.org/T104550) 
[15:18:58] <AndyRussG>	 Hmmm looks like the queue is not so fast this morning!
[15:19:00] <grrrit-wm>	 (03CR) 10Ottomata: [V: 031] "COOOOL! Marko, let's apply this together this week, ja?" [puppet] - 10https://gerrit.wikimedia.org/r/279280 (https://phabricator.wikimedia.org/T130371) (owner: 10Mobrovac)
[15:19:14] <ottomata>	 mobrovac:  maybe my morning tomorrow?
[15:19:30] <ottomata>	 we can apply in in beta first and run puppet in a few places to make sure it does what we think, and then fully  merge it?
[15:19:31] <thcipriani>	 Urbanecm: I don't use git-review too much. FWIW, I think it's probably going to be easier to do a manual rebase and then do git review -R. I'm not sure what happens if your rebase fails using git review: it's possible that it's the same thing that would happen if your rebase failed running git rebase, but I'm not 100% on that.
[15:19:54] <mobrovac>	 ottomata: sounds good!
[15:19:55] <thcipriani>	 i.e. the error message may be more opaque thanks to git-review
[15:20:13] <icinga-wm_>	 PROBLEM - Hadoop NodeManager on analytics1055 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[15:20:20] <Urbanecm>	 Thanks. So no more manual rebases by SWATters, I'll remember it :)
[15:20:30] <volans>	 thcipriani: do you mind if I merge on puppet?
[15:20:37] <godog>	 yeah seriously andrewbogott thanks for the librenms/ldap work! works like a charm
[15:21:05] <grrrit-wm>	 (03PS4) 10Volans: MariaDB: use Puppet cert for s1 cross-dc replica [puppet] - 10https://gerrit.wikimedia.org/r/283192 (https://phabricator.wikimedia.org/T111654) 
[15:21:33] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.27.0-wmf.21/extensions/WikidataPageBanner: SWAT: Attempt at fixing table of contents problem [[gerrit:282995]] (duration: 00m 29s)
[15:21:38] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:21:58] <thcipriani>	 ^ jdlrobson check please: .21 only right now.
[15:22:05] <thcipriani>	 (group0 wikis)
[15:22:13] <icinga-wm_>	 RECOVERY - Hadoop NodeManager on analytics1055 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[15:23:20] <jdlrobson>	 thcipriani: WikidataPageBanner is only on .20 right now
[15:23:52] <jdlrobson>	 so i can't verify there until Wikivoyage is on .21 :/
[15:24:11] <thcipriani>	 jdlrobson: kk, I'll go forward with .20
[15:25:39] <andrewbogott>	 godog: cool
[15:26:45] <thcipriani>	 well...I would move forward with .20. Jenkins is sure taking its time.
[15:27:33] <icinga-wm_>	 PROBLEM - HHVM rendering on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:28:14] <grrrit-wm>	 (03CR) 10Volans: [C: 032] MariaDB: use Puppet cert for s1 cross-dc replica [puppet] - 10https://gerrit.wikimedia.org/r/283192 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans)
[15:28:16] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: HTTPS redirects for transparency.wikimedia.org - https://phabricator.wikimedia.org/T132464#2203583 (10BBlack) @Chmarkine: not sure - those changes are still in puppet, and I've confirmed the backend server for it today (bromine) still has that config deployed as well.  But i...
[15:28:23] <icinga-wm_>	 PROBLEM - Apache HTTP on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:28:43] <icinga-wm_>	 PROBLEM - DPKG on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:28:43] <icinga-wm_>	 PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:28:44] <icinga-wm_>	 PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:28:44] <icinga-wm_>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:28:51] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: scap: use conftool data to populate dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/283201 (https://phabricator.wikimedia.org/T132529) 
[15:29:02] <icinga-wm_>	 PROBLEM - dhclient process on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:29:03] <icinga-wm_>	 PROBLEM - Disk space on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:29:13] <icinga-wm_>	 PROBLEM - RAID on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:29:33] <icinga-wm_>	 PROBLEM - SSH on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:29:43] <icinga-wm_>	 PROBLEM - nutcracker process on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:29:53] <icinga-wm_>	 PROBLEM - configured eth on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:30:12] <icinga-wm_>	 PROBLEM - salt-minion processes on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:30:40] <grrrit-wm>	 (03CR) 10Mobrovac: [C: 031] Make MathML rendering default in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283199 (https://phabricator.wikimedia.org/T104550) (owner: 10Physikerwelt)
[15:30:42] <icinga-wm_>	 PROBLEM - puppet last run on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:30:42] <icinga-wm_>	 PROBLEM - Check size of conntrack table on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:30:43] <icinga-wm_>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[15:30:43] <icinga-wm_>	 RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy
[15:30:43] <icinga-wm_>	 RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy
[15:31:06] <grrrit-wm>	 (03PS1) 10BBlack: parsoid-tests.wm.o HTTPS redirect T132462 [puppet] - 10https://gerrit.wikimedia.org/r/283202 
[15:31:31] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] parsoid-tests.wm.o HTTPS redirect T132462 [puppet] - 10https://gerrit.wikimedia.org/r/283202 (owner: 10BBlack)
[15:31:50] <andrewbogott>	 hm, looks like mw1139 has had some historical issues.  Is anyone messing with it just now?
[15:32:03] <icinga-wm_>	 PROBLEM - HHVM processes on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:32:09] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: let all services on misc-web enforce http->https redirects - https://phabricator.wikimedia.org/T103919#2203620 (10BBlack)
[15:32:12] <wikibugs>	 06Operations, 10Parsoid, 10Traffic, 07HTTPS, 13Patch-For-Review: HTTPS redirects for parsoid-tests.wikimedia.org - https://phabricator.wikimedia.org/T132462#2203618 (10BBlack) 05Open>03Resolved a:03BBlack
[15:32:23] <icinga-wm_>	 PROBLEM - nutcracker port on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:33:07] <andrewbogott>	 moritzm: do you know anything about 1139 other than what's in the SAL from a couple of weeks ago?
[15:33:15] <AndyRussG>	 thcipriani: I think zuul is too slow today to get the CN patches ready within this window
[15:33:53] <thcipriani>	 AndyRussG: I just stopped a job that was hung at 16% for the past 30 mins :\ Queue should start to clear, hopefully.
[15:33:57] <moritzm>	 andrewbogott: I don't eben remember what I put into SAL a few week ago, let me check
[15:34:08] <AndyRussG>	 Ahh K thx! :)
[15:34:10] <grrrit-wm>	 (03PS1) 10BBlack: transparency.wm.o HTTPS redirect T132464 [puppet] - 10https://gerrit.wikimedia.org/r/283203 
[15:34:12] <grrrit-wm>	 (03PS1) 10BBlack: stats.wm.o HTTPS redirect T132465 [puppet] - 10https://gerrit.wikimedia.org/r/283204 
[15:34:37] <thcipriani>	 unfortunately it was a patch for SWAT :\
[15:34:37] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] transparency.wm.o HTTPS redirect T132464 [puppet] - 10https://gerrit.wikimedia.org/r/283203 (owner: 10BBlack)
[15:34:49] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] stats.wm.o HTTPS redirect T132465 [puppet] - 10https://gerrit.wikimedia.org/r/283204 (owner: 10BBlack)
[15:35:03] <icinga-wm_>	 RECOVERY - Disk space on mw1139 is OK: DISK OK
[15:35:03] <moritzm>	 andrewbogott: that a few weeks ago was just one of the occasional hhvm lockups, this here is different
[15:35:09] <andrewbogott>	 ok
[15:35:13] <icinga-wm_>	 RECOVERY - RAID on mw1139 is OK: OK: no RAID installed
[15:35:23] <icinga-wm_>	 RECOVERY - SSH on mw1139 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0)
[15:35:30] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: let all services on misc-web enforce http->https redirects - https://phabricator.wikimedia.org/T103919#2203640 (10BBlack)
[15:35:42] <icinga-wm_>	 RECOVERY - nutcracker process on mw1139 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker
[15:35:42] <icinga-wm_>	 RECOVERY - configured eth on mw1139 is OK: OK - interfaces up
[15:35:43] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: let all services on misc-web enforce http->https redirects - https://phabricator.wikimedia.org/T103919#1402411 (10BBlack)
[15:35:43] <icinga-wm_>	 RECOVERY - HHVM processes on mw1139 is OK: PROCS OK: 6 processes with command name hhvm
[15:35:45] <wikibugs>	 06Operations, 10Analytics-Cluster, 10Traffic, 07HTTPS, 13Patch-For-Review: HTTPS redirects for stats.wikimedia.org - https://phabricator.wikimedia.org/T132465#2203641 (10BBlack) 05Open>03Resolved a:03BBlack
[15:35:48] <andrewbogott>	 oh, it's back!  And it didn't reboot...
[15:35:54] <icinga-wm_>	 RECOVERY - salt-minion processes on mw1139 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[15:36:03] <icinga-wm_>	 RECOVERY - nutcracker port on mw1139 is OK: TCP OK - 0.000 second response time on port 11212
[15:36:12] <icinga-wm_>	 PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[15:36:23] <icinga-wm_>	 RECOVERY - Check size of conntrack table on mw1139 is OK: OK: nf_conntrack is 0 % full
[15:36:33] <icinga-wm_>	 RECOVERY - DPKG on mw1139 is OK: All packages OK
[15:36:45] <icinga-wm_>	 RECOVERY - dhclient process on mw1139 is OK: PROCS OK: 0 processes with command name dhclient
[15:37:31] <andrewbogott>	 moritzm: it was oom.  What do you think, should I reboot it just to make a full recovery from all the services killed by oom-killer?
[15:38:39] <andrewbogott>	 hm, it still can't fork
[15:39:24] <AndyRussG>	 thcipriani: this is the patch for 20: https://gerrit.wikimedia.org/r/#/c/283205/
[15:39:58] <mutante>	 bblack: re: redirect for transparency.wm   seems to me like it already redirects, unless you just enabled it in varnish?
[15:40:20] <mutante>	 < HTTP/1.1 301 TLS Redirect
[15:40:22] <icinga-wm_>	 ACKNOWLEDGEMENT - Apache HTTP on mw1139 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50394 bytes in 0.039 second response time andrew bogott OOM -- I will reboot and investigate when I have a chance
[15:40:22] <icinga-wm_>	 ACKNOWLEDGEMENT - HHVM rendering on mw1139 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50394 bytes in 0.068 second response time andrew bogott OOM -- I will reboot and investigate when I have a chance
[15:40:23] <icinga-wm_>	 ACKNOWLEDGEMENT - puppet last run on mw1139 is CRITICAL: CRITICAL: puppet fail andrew bogott OOM -- I will reboot and investigate when I have a chance
[15:40:28] <mutante>	 < Location: https://transparency.wikimedia.org/
[15:40:36] * andrewbogott needs to step away but will look at 1139 as soon as able
[15:41:59] <thcipriani>	 AndyRussG: anything requiring a full scap in this update?
[15:41:59] <bblack>	 mutante: I just did, yes
[15:42:15] <bblack>	 mutante: something's broken about the apache redirect there, I didn't debug it, but it wasn't redirecting before
[15:42:15] <mutante>	 ah, and i just saw this:   < Server: Varnish
[15:42:19] <moritzm>	 andrewbogott: yeah, reboot won't hurt, whatever went wrong there
[15:42:28] <mutante>	 hmm, ok
[15:42:34] <AndyRussG>	 thcipriani: not really. I'm pulling in changes from translatewiki, but it's OK if those only get updated when the train goes thru, no?
[15:42:49] <grrrit-wm>	 (03PS1) 10BBlack: datasets.wm.o HTTPS redirect T132463 [puppet] - 10https://gerrit.wikimedia.org/r/283207 
[15:43:01] <AndyRussG>	 thcipriani: I'm also fine leaving this for another SWAT
[15:43:08] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] datasets.wm.o HTTPS redirect T132463 [puppet] - 10https://gerrit.wikimedia.org/r/283207 (owner: 10BBlack)
[15:43:11] <jdlrobson>	 thcipriani: did we hit some issues on the wikivoyage patch?
[15:43:53] <icinga-wm_>	 RECOVERY - DPKG on labmon1001 is OK: All packages OK
[15:43:55] <thcipriani>	 jdlrobson: yeah. for some reason one of the tests hung at 16% for 30 mins or so. I had to kill it, rejected the patch. Flailed around a bit getting it resubmitted.
[15:43:57] <grrrit-wm>	 (03PS2) 10Bartosz Dziewoński: Enable $wgAbuseFilterProfile for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282806 (https://phabricator.wikimedia.org/T132200) 
[15:44:01] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: let all services on misc-web enforce http->https redirects - https://phabricator.wikimedia.org/T103919#2203658 (10BBlack)
[15:44:03] <wikibugs>	 06Operations, 10Datasets-General-or-Unknown, 10Traffic, 07HTTPS, 13Patch-For-Review: HTTPS redirects for datasets.wikimedia.org - https://phabricator.wikimedia.org/T132463#2203656 (10BBlack) 05Open>03Resolved a:03BBlack
[15:44:36] <jdlrobson>	 owch. fingers crossed it works this time :)
[15:45:09] <grrrit-wm>	 (03PS1) 10Volans: Repool db1057 after TLS upgrade on s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283209 (https://phabricator.wikimedia.org/T111654) 
[15:46:15] <wikibugs>	 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#2203677 (10Ottomata)
[15:46:31] <thcipriani>	 AndyRussG: let's push this to another SWAT if you don't mind. Jenkins queue is a bit backed up + a full scap seems like it'd run over by quite a bit.
[15:46:41] <AndyRussG>	 thcipriani: yeah!
[15:46:49] <thcipriani>	 AndyRussG: thanks.
[15:47:00] <AndyRussG>	 thcipriani: likewise!
[15:47:51] <AndyRussG>	 thcipriani: quick question: since I +2'd the change for .21, does that mean it'll get synced with the train in a few minutes?
[15:48:57] <AndyRussG>	 if so maybe we could quickly push it out to mw1017
[15:49:21] <bd808>	 nothing magically pulls to tin
[15:49:37] <bd808>	 +2 on branches should really only be done by a deployer in the course of deploying
[15:49:52] <AndyRussG>	 bd808: yes hmm well it was going to get deployed
[15:49:56] <wikibugs>	 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2203682 (10Nuria)
[15:49:59] <AndyRussG>	 Sorry I should have held off tho
[15:50:14] <AndyRussG>	 bd808: Could still cancel the gate-and-submit I think
[15:50:20] <bd808>	 sorry not screaming, just correcting gently :)
[15:50:27] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 031] "for a few days, sure" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282806 (https://phabricator.wikimedia.org/T132200) (owner: 10Bartosz Dziewoński)
[15:50:43] <thcipriani>	 AndyRussG: it's np. Cancelling now would requeue everything in zuul so don't do that.
[15:50:51] <wikibugs>	 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2197243 (10Nuria)
[15:50:51] <AndyRussG>	 K
[15:50:52] <thcipriani>	 well, everything behind it.
[15:51:22] <AndyRussG>	 fwiw it doesn't need a scap (just re-checked, no new i18n keys)
[15:51:40] <grrrit-wm>	 (03PS1) 10BBlack: HTTPS for graphite monitor URLs T132461 [puppet] - 10https://gerrit.wikimedia.org/r/283210 
[15:52:43] <jdlrobson>	 thcipriani: hope this setback hasn't impacted the Wikivoyage patch going out today?
[15:53:04] <thcipriani>	 jdlrobson: going in 1 second.
[15:53:11] <jdlrobson>	 phew :)
[15:53:20] <thcipriani>	 AndyRussG: I can get .21 out for you here in a minute.
[15:53:41] <AndyRussG>	 thcipriani: K that would be great! thx much :)
[15:54:44] <thcipriani>	 AndyRussG: er, wait, did the .20 patch merge, too (just fetched it down)
[15:54:49] <AndyRussG>	 yep
[15:55:20] <AndyRussG>	 thcipriani: do you want to do mw1017? It was suggested that we smoke test there first since we're moving about some modules, but IMHO it's quite unlikely to be problematic
[15:56:13] <icinga-wm_>	 PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:56:14] <icinga-wm_>	 PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:56:14] <icinga-wm_>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:56:33] <icinga-wm_>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:56:53] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.27.0-wmf.20/extensions/WikidataPageBanner: SWAT: Attempt at fixing table of contents problem [[gerrit:282994]] (duration: 00m 28s)
[15:56:58] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:57:00] <thcipriani>	 ^ jdlrobson check please
[15:57:59] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] "I've manually tested the effect on neon by looking at the check_graphite -related commands it runs, and re-running them manually with s/ht" [puppet] - 10https://gerrit.wikimedia.org/r/283210 (owner: 10BBlack)
[15:58:12] <icinga-wm_>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[15:58:12] <icinga-wm_>	 RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy
[15:58:12] <icinga-wm_>	 RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy
[15:58:23] <icinga-wm_>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[15:58:33] <jdlrobson>	 thcipriani: on it!
[16:02:09] <jdlrobson>	 thcipriani: looks fine thanks
[16:02:17] <thcipriani>	 jdlrobson: cool, thanks for checking.
[16:02:31] <thcipriani>	 AndyRussG: kk, I'll sync down to mw1017.
[16:02:41] <AndyRussG>	 thcipriani: cool beans! thx :)
[16:03:19] <thcipriani>	 AndyRussG: kk, says it's done, give it a try.
[16:03:23] <grrrit-wm>	 (03PS1) 10BBlack: graphite.wm.o HTTPS redirect T132461 [puppet] - 10https://gerrit.wikimedia.org/r/283214 
[16:05:39] <wikibugs>	 06Operations, 10Analytics-Cluster, 10hardware-requests, 10netops: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2203753 (10Ottomata)
[16:06:35] <AndyRussG>	 thcipriani: loosk fine!
[16:07:13] <thcipriani>	 AndyRussG: kk, I'll run a sync-dir for CentralNotice on .21 then .20 if that plan sounds fine with you.
[16:07:26] <AndyRussG>	 thcipriani: yeah that'd be amazing :) thanks so much!
[16:07:27] <wikibugs>	 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2197243 (10BBlack) If we're doing this in production, the frontend should probably be through cache_misc.  I'm not sure what the backend looks like at all role/software-wise...
[16:10:11] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.27.0-wmf.21/extensions/CentralNotice: SWAT: Update CentralNotice [[gerrit:283206]] (duration: 00m 33s)
[16:10:17] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:11:42] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.27.0-wmf.20/extensions/CentralNotice: SWAT: Update CentralNotice [[gerrit:283205]] (duration: 00m 30s)
[16:11:48] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:11:48] <thcipriani>	 ^ AndyRussG check please
[16:12:50] <AndyRussG>	 thcipriani: thx!
[16:14:48] <wikibugs>	 06Operations, 10Traffic, 07Graphite, 07HTTPS, 13Patch-For-Review: HTTPS redirects for graphite.wikimedia.org - https://phabricator.wikimedia.org/T132461#2203793 (10BBlack) With the check_graphite stuff switched to HTTPS, so far neon doesn't seem to be suffering from any significant increase in overall CP...
[16:15:10] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "to be merged once the full switchover is in place (cfr. https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Media_storage.2FSwift )" [puppet] - 10https://gerrit.wikimedia.org/r/268080 (https://phabricator.wikimedia.org/T91869) (owner: 10Filippo Giunchedi)
[16:17:51] <AndyRussG>	 thcipriani: looking good! :)
[16:18:08] <thcipriani>	 AndyRussG: cool. Thanks for checking.
[16:18:13] <andrewbogott>	 !log rebooting mw1139 — OOM
[16:18:18] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:18:33] <AndyRussG>	 thcipriani: likewise thx mcuh \o/
[16:18:35] <elukey>	 andrewbogott: thanks a lot for librenms!
[16:19:37] <thcipriani>	 AndyRussG: yw. weird jenkins problem actually made things work out ok.
[16:20:51] <AndyRussG>	 heh ... how so?
[16:21:32] <icinga-wm_>	 RECOVERY - Apache HTTP on mw1139 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 5.729 second response time
[16:21:59] <thcipriani>	 AndyRussG: just that your patches ended up landing before one of the other SWAT patches because of the jenkins issue.
[16:22:14] <AndyRussG>	 Ahh heh right 8p
[16:22:54] <AndyRussG>	 BTW I'm gonna update the deployments page just for the record...
[16:23:30] <thcipriani>	 ty! Meant to ask you to do that during the course of SWAT.
[16:24:03] <icinga-wm_>	 RECOVERY - HHVM rendering on mw1139 is OK: HTTP OK: HTTP/1.1 200 OK - 67187 bytes in 0.797 second response time
[16:25:23] <icinga-wm_>	 RECOVERY - puppet last run on mw1139 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[16:28:13] <volans>	 thcipriani: is the SWAT over? (I need to repool a DB)
[16:28:50] <thcipriani>	 volans: yes it is. sorry I missed your message earlier.
[16:29:04] <volans>	 no problem, thanks
[16:29:06] <AndyRussG>	 done!
[16:30:38] <grrrit-wm>	 (03CR) 10Volans: [C: 032] Repool db1057 after TLS upgrade on s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283209 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans)
[16:31:05] <grrrit-wm>	 (03Merged) 10jenkins-bot: Repool db1057 after TLS upgrade on s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283209 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans)
[16:32:14] <logmsgbot>	 !log volans@tin Synchronized wmf-config/db-eqiad.php: Repool db1057 after TLS upgrade on s1 - T111654 (duration: 00m 26s)
[16:32:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:32:26] <wikibugs>	 06Operations, 10ops-codfw, 06DC-Ops: db2018 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T128057#2203876 (10mark) p:05Normal>03Unbreak! a:05jcrespo>03RobH @RobH: please buy 4 appropriate disks today, fastest delivery. Hereby approved.
[16:35:28] <ottomata>	 !log rebuilding raid1 array on aqs1001 after hot swapping sdh
[16:35:33] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:37:15] <wikibugs>	 06Operations, 10ops-eqiad, 06DC-Ops: Broken disk on aqs1001.eqiad.wmnet - https://phabricator.wikimedia.org/T130816#2203889 (10Ottomata) @Cmjohnson  has swapped the disk.  Faidon helped get the device to show by doing  ``` megacli -CfgForeign -Scan -a0 There are 1 foreign configuration(s) on controller 0. .....
[16:38:57] <wikibugs>	 06Operations, 10MediaWiki-Parser, 06Parsing-Team, 10Traffic, and 4 others: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2203890 (10Jdlrobson) @Atsirlin can you make an update the pagebanner template to force a cache flush for the pages that use the...
[16:39:14] <andrewbogott>	 Reedy: is read-only access adequate for your librenms needs?
[16:39:53] <Reedy>	 andrewbogott: I'd presume so... I've no idea what benefit write access would actually provide
[16:40:02] <andrewbogott>	 me neither
[16:40:33] <Reedy>	 Presumably it's just configuration
[16:43:03] <wikibugs>	 06Operations, 10MediaWiki-Parser, 06Parsing-Team, 10Traffic, and 4 others: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2203893 (10Wrh2) >>! In T121135#2203890, @Jdlrobson wrote: > @Atsirlin can you make an update the pagebanner template to force a...
[16:43:44] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] graphite.wm.o HTTPS redirect T132461 [puppet] - 10https://gerrit.wikimedia.org/r/283214 (owner: 10BBlack)
[16:43:50] <andrewbogott>	 chasemp: is the ldap 'ops' group actually populated from the ops stanza in admin/data/data.yaml?  Or are the two lists maintained separately?
[16:44:13] <godog>	 the latter iirc
[16:44:41] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: let all services on misc-web enforce http->https redirects - https://phabricator.wikimedia.org/T103919#2203911 (10BBlack)
[16:44:43] <wikibugs>	 06Operations, 10Traffic, 07Graphite, 07HTTPS, 13Patch-For-Review: HTTPS redirects for graphite.wikimedia.org - https://phabricator.wikimedia.org/T132461#2203909 (10BBlack) 05Open>03Resolved a:03BBlack
[16:44:50] <andrewbogott>	 godog: is it somehow poor form for me to create a new ldap group without adding a corresponding section in puppet?  (The puppet bits wouldn't do anything)
[16:45:32] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Make MathML rendering default in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283199 (https://phabricator.wikimedia.org/T104550) (owner: 10Physikerwelt)
[16:45:48] <godog>	 andrewbogott: I don't think so but I'm not sure tbh, what would be the group? iirc being added to ldap and admin in puppet happens during onboarding
[16:46:13] <andrewbogott>	 godog: 'librenms-readers'
[16:46:32] <wikibugs>	 06Operations, 10ops-eqiad, 06DC-Ops: Broken disk on aqs1001.eqiad.wmnet - https://phabricator.wikimedia.org/T130816#2203912 (10faidon) Just for the record, after clearing the foreign config, `megacli -PDMakeJBOD -PhysDrv\[32:7\] -a0` was also needed.
[16:49:13] <grrrit-wm>	 (03PS2) 10Ori.livneh: Make MathML rendering default in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283199 (https://phabricator.wikimedia.org/T104550) (owner: 10Physikerwelt)
[16:49:56] <godog>	 andrewbogott: seems fine to me, and an update to https://wikitech.wikimedia.org/wiki/LDAP_Groups
[16:51:00] <andrewbogott>	 godog: done, thanks
[16:54:20] <volans>	 !log completed TLS upgrade for s1 - T111654
[16:54:25] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:55:02] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Give librenms access to members of ldap group librenms-readers [puppet] - 10https://gerrit.wikimedia.org/r/283221 (https://phabricator.wikimedia.org/T131252) 
[16:55:21] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] "hellooooooo jenkins" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283199 (https://phabricator.wikimedia.org/T104550) (owner: 10Physikerwelt)
[16:55:46] <grrrit-wm>	 (03Merged) 10jenkins-bot: Make MathML rendering default in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283199 (https://phabricator.wikimedia.org/T104550) (owner: 10Physikerwelt)
[16:58:20] <grrrit-wm>	 (03PS7) 10Gehel: Add caching headers for nginx [puppet] - 10https://gerrit.wikimedia.org/r/274864 (https://phabricator.wikimedia.org/T126730) (owner: 10Smalyshev)
[16:58:29] <logmsgbot>	 !log ori@tin Synchronized wmf-config/CommonSettings-labs.php: I5a0abcdc: Make MathML rendering default in labs (duration: 00m 39s)
[16:58:34] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:59:13] <wikibugs>	 06Operations, 10ops-codfw, 06DC-Ops: db2018 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T128057#2203957 (10RobH)
[17:02:00] <grrrit-wm>	 (03PS3) 10Ori.livneh: Add apache::mod::security [puppet] - 10https://gerrit.wikimedia.org/r/278318 (https://phabricator.wikimedia.org/T132599) 
[17:03:00] <grrrit-wm>	 (03PS1) 10Dzahn: install_server: let bast1001 use jessie [puppet] - 10https://gerrit.wikimedia.org/r/283224 (https://phabricator.wikimedia.org/T123721) 
[17:04:08] <grrrit-wm>	 (03PS2) 10Dzahn: install_server: let bast1001 use jessie [puppet] - 10https://gerrit.wikimedia.org/r/283224 (https://phabricator.wikimedia.org/T123721) 
[17:04:19] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032 V: 032] install_server: let bast1001 use jessie [puppet] - 10https://gerrit.wikimedia.org/r/283224 (https://phabricator.wikimedia.org/T123721) (owner: 10Dzahn)
[17:14:07] <grrrit-wm>	 (03PS1) 10Ori.livneh: App servers: make Server response header equal to fqdn (mw1017 only) [puppet] - 10https://gerrit.wikimedia.org/r/283226 (https://phabricator.wikimedia.org/T132599) 
[17:14:19] <robh>	 !log lvs4003 going offline for maint (icinga has been silenced, i think ;)
[17:14:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:14:33] <grrrit-wm>	 (03PS4) 10Ori.livneh: Add apache::mod::security [puppet] - 10https://gerrit.wikimedia.org/r/278318 (https://phabricator.wikimedia.org/T132599) 
[17:14:43] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] Add apache::mod::security [puppet] - 10https://gerrit.wikimedia.org/r/278318 (https://phabricator.wikimedia.org/T132599) (owner: 10Ori.livneh)
[17:15:08] <mutante>	 !log rebooting bast1001 into PXE
[17:15:13] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:15:18] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] App servers: make Server response header equal to fqdn (mw1017 only) [puppet] - 10https://gerrit.wikimedia.org/r/283226 (https://phabricator.wikimedia.org/T132599) (owner: 10Ori.livneh)
[17:16:12] <grrrit-wm>	 (03PS2) 10Ori.livneh: App servers: make Server response header equal to fqdn (mw1017 only) [puppet] - 10https://gerrit.wikimedia.org/r/283226 (https://phabricator.wikimedia.org/T132599) 
[17:16:32] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] App servers: make Server response header equal to fqdn (mw1017 only) [puppet] - 10https://gerrit.wikimedia.org/r/283226 (https://phabricator.wikimedia.org/T132599) (owner: 10Ori.livneh)
[17:16:58] <grrrit-wm>	 (03PS1) 10Alex Monk: deployment-prep: keyholder shinken monitoring [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) 
[17:20:05] <chasemp>	 andrewbogott they are separately managed
[17:20:26] <chasemp>	 (In meetings fyi)
[17:22:28] <grrrit-wm>	 (03PS1) 10Ori.livneh: debug_proxy: don't clobber server header from backend [puppet] - 10https://gerrit.wikimedia.org/r/283229 
[17:22:40] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] debug_proxy: don't clobber server header from backend [puppet] - 10https://gerrit.wikimedia.org/r/283229 (owner: 10Ori.livneh)
[17:24:32] <icinga-wm_>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 72.41% of data above the critical threshold [5000000.0]
[17:25:23] <icinga-wm_>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [5000000.0]
[17:26:16] <grrrit-wm>	 (03PS1) 10Ori.livneh: App servers: make Server response header equal to fqdn [puppet] - 10https://gerrit.wikimedia.org/r/283230 (https://phabricator.wikimedia.org/T132599) 
[17:26:55] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] App servers: make Server response header equal to fqdn [puppet] - 10https://gerrit.wikimedia.org/r/283230 (https://phabricator.wikimedia.org/T132599) (owner: 10Ori.livneh)
[17:26:58] <mutante>	 !log bast1001 - revoke puppet cert, delete salt key, reinstall with jessie
[17:27:03] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:30:51] <icinga-wm_>	 PROBLEM - Host lvs4003 is DOWN: PING CRITICAL - Packet loss = 100%
[17:31:13] <grrrit-wm>	 (03PS1) 10Dzahn: bast1001: rsync home dirs back from tungsten [puppet] - 10https://gerrit.wikimedia.org/r/283231 (https://phabricator.wikimedia.org/T123721) 
[17:32:28] <grrrit-wm>	 (03PS2) 10Dzahn: bast1001: rsync home dirs back from tungsten [puppet] - 10https://gerrit.wikimedia.org/r/283231 (https://phabricator.wikimedia.org/T123721) 
[17:32:48] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] bast1001: rsync home dirs back from tungsten [puppet] - 10https://gerrit.wikimedia.org/r/283231 (https://phabricator.wikimedia.org/T123721) (owner: 10Dzahn)
[17:33:20] <Krinkle>	 !log mwscript deleteEqualMessages.php --wiki cywiki (T45917)
[17:33:25] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:34:07] <wikibugs>	 06Operations, 10Gitblit, 06Release-Engineering-Team, 10Traffic, 07HTTPS: HTTPS redirects for git.wikimedia.org - https://phabricator.wikimedia.org/T132460#2204120 (10BBlack) In lieu of anything more-concrete to go on, I've been monitoring the varnishlog live request flow for git.wikimedia.org this mornin...
[17:41:31] <bearloga>	 I was connected to stat1002 and then got kicked out. Trying to SSH in again but getting the warning that remote host identification for bast1001.wikimedia.org has changed.
[17:41:38] <mutante>	 !log tungsten stop and remove rsync package and config
[17:41:43] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:41:55] <grrrit-wm>	 (03PS1) 10BBlack: git.wm.o HTTPS redirect T132460 [puppet] - 10https://gerrit.wikimedia.org/r/283233 
[17:42:04] <mutante>	 bearloga: it's being reinstalled right now, in scheduled maintenance
[17:42:13] <mutante>	 bearloga: will be back soon and until then you can use bast2001 or 3001
[17:43:21] <bearloga>	 mutante: phew! okay, thanks! Was worried something funky could be going on. Will I need to remove its entry from known_hosts and re-auth after it's back up?
[17:43:59] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] git.wm.o HTTPS redirect T132460 [puppet] - 10https://gerrit.wikimedia.org/r/283233 (owner: 10BBlack)
[17:44:02] <icinga-wm_>	 RECOVERY - Host lvs4003 is UP: PING OK - Packet loss = 0%, RTA = 83.42 ms
[17:44:06] <mutante>	 bearloga: yes, i will give you a link with the host keys and send it on list.. maybe grab a coffee and you can use it again
[17:44:15] <mutante>	 bearloga: just copying back some data ..
[17:45:03] <mutante>	 bearloga: or feel free to change your ssh config and just replace bast1001 with bast2001 and you'd be fine
[17:45:15] <bearloga>	 mutante: thanks!
[17:45:19] <wikibugs>	 06Operations, 10media-storage: Unable to delete, restore/undelete, move or upload new versions of files on several wikis ("inconsistent state within the internal storage backends") - https://phabricator.wikimedia.org/T128096#2204146 (10aaron) I'm seeing very few sync errors in the logs lately.
[17:46:01] <robh>	 !log lvs4003 rebooted and back online, lvs4004 offlining for maint.
[17:46:07] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:48:03] <wikibugs>	 06Operations, 10RESTBase-Cassandra: restbase1007 not assembling raid after reboot - https://phabricator.wikimedia.org/T130930#2204150 (10fgiunchedi)
[17:48:05] <wikibugs>	 06Operations: Boot time race condition when assembling root raid device on cp1052 - https://phabricator.wikimedia.org/T131961#2204152 (10fgiunchedi)
[17:48:14] <grrrit-wm>	 (03PS1) 10BBlack: git.wm.o HTTPS redirect - part 2 - T132460 [puppet] - 10https://gerrit.wikimedia.org/r/283234 
[17:49:51] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] git.wm.o HTTPS redirect - part 2 - T132460 [puppet] - 10https://gerrit.wikimedia.org/r/283234 (owner: 10BBlack)
[17:50:21] <wikibugs>	 06Operations: Boot time race condition when assembling root raid device on cp1052 - https://phabricator.wikimedia.org/T131961#2204165 (10fgiunchedi) came across the same problem in {T130930}, could be a jessie-specific issue as I don't remember seeing the same on trusty/precise
[17:52:58] <wikibugs>	 06Operations, 10Gitblit, 06Release-Engineering-Team, 10Traffic, and 2 others: HTTPS redirects for git.wikimedia.org - https://phabricator.wikimedia.org/T132460#2204187 (10BBlack) 05Open>03Resolved a:03BBlack Resolving for now, although I suspect this is the most likely of the bunch to trigger some ki...
[17:53:01] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: let all services on misc-web enforce http->https redirects - https://phabricator.wikimedia.org/T103919#2204190 (10BBlack)
[17:54:49] <grrrit-wm>	 (03PS3) 10Muehlenhoff: Don't use package-> latest for apt-transport-https [puppet] - 10https://gerrit.wikimedia.org/r/282941 
[17:55:01] <icinga-wm_>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 56.67% of data above the critical threshold [5000000.0]
[17:56:09] <grrrit-wm>	 (03PS1) 10Aaron Schulz: Set descriptionCacheExpiry for Commons repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283239 
[17:56:11] <grrrit-wm>	 (03CR) 10Bartosz Dziewoński: Enable $wgAbuseFilterProfile for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282806 (https://phabricator.wikimedia.org/T132200) (owner: 10Bartosz Dziewoński)
[17:56:17] <grrrit-wm>	 (03PS3) 10Bartosz Dziewoński: Enable $wgAbuseFilterProfile for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282806 (https://phabricator.wikimedia.org/T132200) 
[17:57:23] <wikibugs>	 06Operations, 13Patch-For-Review: Reinstall bast1001 with jessie - https://phabricator.wikimedia.org/T123721#2204201 (10Dzahn) bast1001 is back with jessie.  data from home dirs is being copied back as i type  the new fingerprints are:   ``` +---------+---------+------------------------------------------------...
[17:57:41] <grrrit-wm>	 (03PS1) 10Bartosz Dziewoński: Disable $wgAbuseFilterProfile for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283243 (https://phabricator.wikimedia.org/T132200) 
[17:57:59] <mutante>	 bearloga: should work again https://phabricator.wikimedia.org/T123721#2204201
[17:58:17] <bearloga>	 mutante: awesome, thanks!
[18:00:02] <wikibugs>	 06Operations: librsvg path patch needs to be applied for jessie - https://phabricator.wikimedia.org/T132584#2204210 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff
[18:00:46] <bblack>	 !log disable puppet, stop pybal on lvs400[12] (maint shutdown imminent, depooled from DNS since yesterday)
[18:00:51] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:03:40] <robh>	 !log the cp sysetms in ulsfo will be rebooting into maint mode regularly for the next few hours.  I'll be scheduling for each host as I get to them, but not echoing every cp host reboot in SAL
[18:03:45] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:03:52] <robh>	 !log the cp sysetms in ulsfo will be rebooting into maint mode regularly for the next few hours.  I'll be scheduling downtime for each host as I get to them, but not echoing every cp host reboot in SAL
[18:03:58] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:06:06] <mutante>	 !log bast1001 back with jessie
[18:06:11] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:06:51] <icinga-wm_>	 PROBLEM - pybal on lvs4001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal
[18:06:52] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs4001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090
[18:07:01] <icinga-wm_>	 PROBLEM - pybal on lvs4002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal
[18:08:30] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs4002 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090
[18:13:09] <wikibugs>	 06Operations, 10MediaWiki-Parser, 06Parsing-Team, 10Traffic, and 3 others: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2204281 (10Jdlrobson) 05Open>03stalled p:05High>03Normal
[18:13:29] <wikibugs>	 06Operations, 10MediaWiki-Parser, 06Parsing-Team, 10Traffic, and 3 others: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#1870420 (10Jdlrobson) Seems fixed, but we'll check in our next sprint after 2 weeks have passed.
[18:13:36] <grrrit-wm>	 (03PS1) 10Dzahn: bast1001: remove temp rsync for migration [puppet] - 10https://gerrit.wikimedia.org/r/283248 (https://phabricator.wikimedia.org/T123721) 
[18:13:57] <grrrit-wm>	 (03PS2) 10Dzahn: bast1001: remove temp rsync for migration [puppet] - 10https://gerrit.wikimedia.org/r/283248 (https://phabricator.wikimedia.org/T123721) 
[18:14:17] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] bast1001: remove temp rsync for migration [puppet] - 10https://gerrit.wikimedia.org/r/283248 (https://phabricator.wikimedia.org/T123721) (owner: 10Dzahn)
[18:14:56] <wikibugs>	 06Operations, 10Gitblit, 06Release-Engineering-Team, 10Traffic, and 2 others: HTTPS redirects for git.wikimedia.org - https://phabricator.wikimedia.org/T132460#2204324 (10BBlack) For the record: the oddball `Mozilla/8.0 (Windows 2008 SP32 + 3patch)` seems to be making it through the HTTPS redirect just fin...
[18:15:41] <grrrit-wm>	 (03PS1) 10BBlack: HTTPS redirect for all: 1/3 remove VCL conditional [puppet] - 10https://gerrit.wikimedia.org/r/283249 (https://phabricator.wikimedia.org/T103919) 
[18:16:01] <icinga-wm_>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[18:16:20] <icinga-wm_>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[18:16:27] <bblack>	 !log shutdown lvs400[12]
[18:16:32] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:16:57] <grrrit-wm>	 (03CR) 10Ottomata: [C: 031] stats/datasets: remove Apache virtual host stat1001.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/283086 (owner: 10Dzahn)
[18:18:00] <icinga-wm_>	 PROBLEM - Host lvs4001 is DOWN: PING CRITICAL - Packet loss = 100%
[18:18:00] <icinga-wm_>	 PROBLEM - Host lvs4002 is DOWN: PING CRITICAL - Packet loss = 100%
[18:18:43] <grrrit-wm>	 (03PS1) 10BBlack: HTTPS redirect for all: 2/3 remove vcl_config settings [puppet] - 10https://gerrit.wikimedia.org/r/283250 (https://phabricator.wikimedia.org/T103919) 
[18:18:45] <grrrit-wm>	 (03PS1) 10BBlack: HTTPS redirect for all: 3/3 remove misc custom block [puppet] - 10https://gerrit.wikimedia.org/r/283251 (https://phabricator.wikimedia.org/T103919) 
[18:19:50] <icinga-wm_>	 PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6
[18:19:50] <icinga-wm_>	 PROBLEM - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6
[18:19:51] <icinga-wm_>	 PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 120 not-conn: cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6
[18:20:00] <icinga-wm_>	 PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6
[18:20:02] <icinga-wm_>	 PROBLEM - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6
[18:20:10] <icinga-wm_>	 PROBLEM - Host cp4003 is DOWN: PING CRITICAL - Packet loss = 100%
[18:20:11] <icinga-wm_>	 PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 20 not-conn: cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6
[18:20:11] <icinga-wm_>	 PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 120 not-conn: cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6
[18:20:12] <icinga-wm_>	 PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 20 not-conn: cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6
[18:20:22] <icinga-wm_>	 PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 120 not-conn: cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6
[18:20:30] <icinga-wm_>	 PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 120 not-conn: cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6
[18:20:31] <icinga-wm_>	 PROBLEM - IPsec on cp1058 is CRITICAL: Strongswan CRITICAL - ok: 20 not-conn: cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6
[18:20:31] <icinga-wm_>	 PROBLEM - IPsec on cp1045 is CRITICAL: Strongswan CRITICAL - ok: 20 not-conn: cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6
[18:20:36] <grrrit-wm>	 (03CR) 10Aaron Schulz: [C: 04-1] "I'd like https://gerrit.wikimedia.org/r/#/c/283247/ to be dealt with first, unless there is something urgent here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282806 (https://phabricator.wikimedia.org/T132200) (owner: 10Bartosz Dziewoński)
[18:20:40] <icinga-wm_>	 PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 120 not-conn: cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6
[18:20:52] <icinga-wm_>	 PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 120 not-conn: cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6
[18:21:36] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#2204346 (10BBlack)
[18:22:47] <grrrit-wm>	 (03PS2) 10BBlack: HTTPS redirect for all: 1/3 remove VCL conditional [puppet] - 10https://gerrit.wikimedia.org/r/283249 (https://phabricator.wikimedia.org/T103919) 
[18:22:56] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] HTTPS redirect for all: 1/3 remove VCL conditional [puppet] - 10https://gerrit.wikimedia.org/r/283249 (https://phabricator.wikimedia.org/T103919) (owner: 10BBlack)
[18:24:01] <icinga-wm_>	 RECOVERY - Host cp4003 is UP: PING OK - Packet loss = 0%, RTA = 75.25 ms
[18:24:51] <icinga-wm_>	 PROBLEM - Host cp4004 is DOWN: PING CRITICAL - Packet loss = 100%
[18:25:11] <grrrit-wm>	 (03PS1) 10Dzahn: bast1001: reorder includes, rm ganglia_aggregator [puppet] - 10https://gerrit.wikimedia.org/r/283252 
[18:25:55] <wikibugs>	 06Operations, 13Patch-For-Review, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2204382 (10Dzahn)
[18:26:05] <grrrit-wm>	 (03PS2) 10BBlack: HTTPS redirect for all: 2/3 remove vcl_config settings [puppet] - 10https://gerrit.wikimedia.org/r/283250 (https://phabricator.wikimedia.org/T103919) 
[18:27:25] <wikibugs>	 06Operations, 13Patch-For-Review, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1931698 (10Dzahn)
[18:28:20] <grrrit-wm>	 (03CR) 10ArielGlenn: "This is the right resting place for it now. If we have snpahsots in the future that only run misc crons then we may need to revisit it.Th" [puppet] - 10https://gerrit.wikimedia.org/r/282866 (owner: 10ArielGlenn)
[18:28:30] <paravoid>	 mutante: do bast4001 too?
[18:28:41] <paravoid>	 and don't forget to cleanup SLAACs from network.pp :)
[18:28:42] <grrrit-wm>	 (03PS2) 10ArielGlenn: add debdeploy and admin group configs for new snapshots [puppet] - 10https://gerrit.wikimedia.org/r/282866 
[18:29:50] <icinga-wm_>	 RECOVERY - IPsec on cp2025 is OK: Strongswan OK - 36 ESP OK
[18:29:51] <icinga-wm_>	 RECOVERY - Host cp4004 is UP: PING OK - Packet loss = 0%, RTA = 74.64 ms
[18:29:51] <icinga-wm_>	 RECOVERY - IPsec on cp1061 is OK: Strongswan OK - 24 ESP OK
[18:29:52] <icinga-wm_>	 RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 124 ESP OK
[18:29:52] <icinga-wm_>	 RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 24 ESP OK
[18:30:09] <mutante>	 paravoid: 4001 - i need something else in ulsfo to put the install server on i'm afraid and i'm not sure what to use
[18:30:10] <icinga-wm_>	 RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 124 ESP OK
[18:30:10] <icinga-wm_>	 RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 124 ESP OK
[18:30:11] <icinga-wm_>	 RECOVERY - IPsec on cp1058 is OK: Strongswan OK - 24 ESP OK
[18:30:19] <paravoid>	 just use carbon?
[18:30:20] <icinga-wm_>	 RECOVERY - IPsec on cp1045 is OK: Strongswan OK - 24 ESP OK
[18:30:21] <icinga-wm_>	 RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 124 ESP OK
[18:30:30] <paravoid>	 also bast1001 seems to not have been installed properly
[18:30:38] <paravoid>	 there is no raid configured
[18:30:39] <mutante>	 i tried that with bast2001, installing from carbon wouldnt work
[18:30:40] <icinga-wm_>	 RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 124 ESP OK
[18:30:40] <paravoid>	 sdb is unused
[18:30:48] <paravoid>	 why wouldn't it not work?
[18:31:11] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] add debdeploy and admin group configs for new snapshots [puppet] - 10https://gerrit.wikimedia.org/r/282866 (owner: 10ArielGlenn)
[18:31:31] <icinga-wm_>	 RECOVERY - IPsec on cp2012 is OK: Strongswan OK - 36 ESP OK
[18:31:31] <icinga-wm_>	 RECOVERY - IPsec on cp2006 is OK: Strongswan OK - 36 ESP OK
[18:31:32] <icinga-wm_>	 RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 124 ESP OK
[18:31:41] <icinga-wm_>	 RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 36 ESP OK
[18:33:42] <mutante>	 paravoid: maybe just because of he hardware problems bast2001 had, i'll try it from carbon. re: SLAAC yep, doing that now .. re: RAID  awww man.. so it was a manual setup , i changed nothing
[18:33:56] <grrrit-wm>	 (03PS3) 10BBlack: HTTPS redirect for all: 2/3 remove vcl_config settings [puppet] - 10https://gerrit.wikimedia.org/r/283250 (https://phabricator.wikimedia.org/T103919) 
[18:33:58] <paravoid>	 might have been misconfigured in the first place
[18:34:03] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] HTTPS redirect for all: 2/3 remove vcl_config settings [puppet] - 10https://gerrit.wikimedia.org/r/283250 (https://phabricator.wikimedia.org/T103919) (owner: 10BBlack)
[18:34:31] <grrrit-wm>	 (03PS2) 10BBlack: HTTPS redirect for all: 3/3 remove misc custom block [puppet] - 10https://gerrit.wikimedia.org/r/283251 (https://phabricator.wikimedia.org/T103919) 
[18:34:40] <icinga-wm_>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[18:34:48] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] HTTPS redirect for all: 3/3 remove misc custom block [puppet] - 10https://gerrit.wikimedia.org/r/283251 (https://phabricator.wikimedia.org/T103919) (owner: 10BBlack)
[18:36:10] <icinga-wm_>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0]
[18:36:28] <bblack>	 ^ is this from the mw* apache thing?
[18:36:56] <paravoid>	 mutante: in any case... needs a reinstall/fixing :)
[18:37:21] <mutante>	 does that mean writing a new partman recipe ? sigh
[18:37:29] <gehel>	 !log activated maintenance page for wdqs1002 (data load in progress)
[18:37:29] <paravoid>	 a new one?
[18:37:30] <paravoid>	 why?
[18:37:34] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:37:40] <paravoid>	 we have one for raid1 already
[18:38:00] <mutante>	 ok
[18:39:00] <grrrit-wm>	 (03PS1) 10Dzahn: network: remove bast1001 SLAAC addresses [puppet] - 10https://gerrit.wikimedia.org/r/283254 
[18:39:24] <bblack>	 5xx is still elevated for eqiad-text!
[18:39:24] <grrrit-wm>	 (03PS2) 10Dzahn: network: remove bast1001 SLAAC addresses [puppet] - 10https://gerrit.wikimedia.org/r/283254 
[18:39:56] <yurik>	 beh, i was using stat1002.eqiad.wmnet earlier today, and now i try to login and it tells me REMOTE HOST IDENTIFICATION HAS CHANGED!
[18:40:09] <MaxSem>	 read ops@
[18:40:17] <MaxSem>	 bast1001 was reinstalled
[18:41:26] <bblack>	 interestingly, the 5xx spike really is only eqiad-text frontends, not e.g. eqiad-esams and such
[18:41:38] <bblack>	 could be traffic-induced
[18:41:43] <mutante>	 also mailed wikitech-l, the new fingerprints are here, yurik https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/bast1001.wikimedia.org
[18:42:21] <icinga-wm_>	 RECOVERY - Host lvs4001 is UP: PING OK - Packet loss = 0%, RTA = 76.49 ms
[18:42:34] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Give librenms access to members of ldap group librenms-readers [puppet] - 10https://gerrit.wikimedia.org/r/283221 (https://phabricator.wikimedia.org/T131252) 
[18:42:35] <mutante>	 bblack: isnt that what ori said earlier?
[18:42:55] <mutante>	 except taking longer than he expected
[18:43:04] <ori>	 not plausibly related
[18:44:33] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Give librenms access to members of ldap group librenms-readers [puppet] - 10https://gerrit.wikimedia.org/r/283221 (https://phabricator.wikimedia.org/T131252) (owner: 10Andrew Bogott)
[18:44:41] <icinga-wm_>	 RECOVERY - Host lvs4002 is UP: PING OK - Packet loss = 0%, RTA = 75.42 ms
[18:44:49] <bblack>	 yeah ori's thing would've affected all the sites
[18:45:45] <bblack>	 !log lvs4001 - enable->run puppet post-reboot
[18:45:48] <grrrit-wm>	 (03PS8) 10Gehel: Add caching headers for nginx [puppet] - 10https://gerrit.wikimedia.org/r/274864 (https://phabricator.wikimedia.org/T126730) (owner: 10Smalyshev)
[18:45:50] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:46:22] <bblack>	 !log lvs4002 - enable->run puppet post-reboot
[18:46:27] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:47:41] <icinga-wm_>	 RECOVERY - pybal on lvs4001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal
[18:47:42] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs4001 is OK: PYBAL OK - All pools are healthy
[18:47:59] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] Add caching headers for nginx [puppet] - 10https://gerrit.wikimedia.org/r/274864 (https://phabricator.wikimedia.org/T126730) (owner: 10Smalyshev)
[18:48:11] <andrewbogott>	 Reedy: try now?  https://librenms.wikimedia.org/
[18:48:21] <gehel>	 !log activating cache headers for WDQS
[18:48:26] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:49:26] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: Enable HSTS on Wikimedia sites - https://phabricator.wikimedia.org/T40516#2204462 (10BBlack)
[18:49:28] <wikibugs>	 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: let all services on misc-web enforce http->https redirects - https://phabricator.wikimedia.org/T103919#2204458 (10BBlack) 05Open>03Resolved a:03BBlack
[18:50:17] <grrrit-wm>	 (03PS2) 10Dzahn: bast1001: reorder includes, rm ganglia_aggregator [puppet] - 10https://gerrit.wikimedia.org/r/283252 (https://phabricator.wikimedia.org/T123721) 
[18:50:23] <grrrit-wm>	 (03PS1) 10Dzahn: partman: make bast1001 use raid1-lvm [puppet] - 10https://gerrit.wikimedia.org/r/283255 (https://phabricator.wikimedia.org/T123721) 
[18:51:00] <grrrit-wm>	 (03PS3) 10Dzahn: network: remove bast1001 SLAAC addresses [puppet] - 10https://gerrit.wikimedia.org/r/283254 (https://phabricator.wikimedia.org/T123721) 
[18:51:14] <grrrit-wm>	 (03PS4) 10Dzahn: network: remove bast1001 SLAAC addresses [puppet] - 10https://gerrit.wikimedia.org/r/283254 (https://phabricator.wikimedia.org/T123721) 
[18:51:59] <hoo>	 this is causing data inconsistencies on Wikidata right now
[18:52:03] <hoo>	 no idea?
[18:53:33] <bblack>	 hoo what is?
[18:53:39] <wikibugs>	 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Grant reedy access to librenms - https://phabricator.wikimedia.org/T131252#2204471 (10Andrew) 05Open>03Resolved @Reedy, you should be all set -- let me know if this doesn't work.
[18:53:44] <jdlrobson>	 mutante: woops i think you just broke my time consuming sql query :-).
[18:54:05] <hoo>	 bblack: See for example https://commons.wikimedia.org/w/api.php?action=query&prop=info&redirects=1&converttitles=1&format=json&titles=File:Iogansen YuI.jpg
[18:54:19] <jdlrobson>	 but thanks for the updates appreciated :)
[18:54:23] <hoo>	 We internally do such API requests (within the cluster) and get a security redirect foo bar
[18:54:32] <hoo>	 breaking page name normalization/ validation
[18:54:40] <bblack>	 hoo: was there some context before "this is causing..."?
[18:55:19] <hoo>	 bblack: No, just the html of the redirect page which MediaWiki serves me
[18:55:25] <mutante>	 jdlrobson: ugh, sorry, and i have to do it again. please switch to bast2001 or 3003, actually 3001 will be much closer for you if in UK
[18:55:34] <mutante>	 it's the new one in esams
[18:56:03] <MatmaRex>	 so
[18:56:03] <bblack>	 hoo: I'm sorry, I still don't understand.  I tried your URL and it does 200 OK with json output
[18:56:06] <MatmaRex>	 what's with the &*
[18:56:20] <MatmaRex>	 e.g. https://commons.wikimedia.org/w/api.php?origin=https%3A%2F%2Fen.wikipedia.org is redirecting to https://commons.wikimedia.org/w/api.php?origin=https%3A%2F%2Fen.wikipedia.org&*
[18:56:21] <hoo>	 bblack: Now, it 301s or 302s
[18:56:24] <hoo>	 Ü* no
[18:56:26] <MatmaRex>	 this is breaking all CORS requests
[18:56:33] <grrrit-wm>	 (03PS3) 10Madhuvishy: analytics_cluster: Add wrapper script for beeline [puppet] - 10https://gerrit.wikimedia.org/r/282713 (https://phabricator.wikimedia.org/T116123) 
[18:57:05] <bblack>	 hoo: it's 200 for me, your 'https://commons.wikimedia.org/w/api.php?action=query&prop=info&redirects=1&converttitles=1&format=json&titles=File:Iogansen YuI.jpg'
[18:57:10] <icinga-wm_>	 PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 120 not-conn: cp4008_v4, cp4008_v6, cp4010_v4, cp4010_v6
[18:57:10] <icinga-wm_>	 PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp4008_v4, cp4008_v6, cp4010_v4, cp4010_v6
[18:57:10] <icinga-wm_>	 PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp4008_v4, cp4008_v6, cp4010_v4, cp4010_v6
[18:57:21] <icinga-wm_>	 PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 120 not-conn: cp4008_v4, cp4008_v6, cp4010_v4, cp4010_v6
[18:57:40] <icinga-wm_>	 PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4010_v4, cp4010_v6
[18:57:41] <icinga-wm_>	 PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4010_v4, cp4010_v6
[18:57:52] <icinga-wm_>	 PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4010_v4, cp4010_v6
[18:58:00] <icinga-wm_>	 PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4010_v4, cp4010_v6
[18:58:01] <icinga-wm_>	 PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4010_v4, cp4010_v6
[18:58:06] <bblack>	 I do see the Security Redirect on 'https://commons.wikimedia.org/w/api.php?origin=https%3A%2F%2Fen.wikipedia.org'
[18:58:08] <hoo>	 bblack: That's interesting
[18:58:11] <icinga-wm_>	 PROBLEM - IPsec on cp1052 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4010_v4, cp4010_v6
[18:58:12] <icinga-wm_>	 PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4010_v4, cp4010_v6
[18:58:19] <hoo>	 it also works, if I curl the URL
[18:58:20] <icinga-wm_>	 PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 122 not-conn: cp4010_v4, cp4010_v6
[18:58:26] <bblack>	 yeah I tesed with curl
[18:58:26] <MatmaRex>	 "security redirect"?
[18:58:29] <hoo>	 but not served via browser or via MediaWiki's http class
[18:58:31] <icinga-wm_>	 PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4010_v4, cp4010_v6
[18:58:32] <icinga-wm_>	 PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4010_v4, cp4010_v6
[18:58:40] <icinga-wm_>	 PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4010_v4, cp4010_v6
[18:58:40] <icinga-wm_>	 PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 122 not-conn: cp4010_v4, cp4010_v6
[18:58:41] <grrrit-wm>	 (03PS4) 10Madhuvishy: analytics_cluster: Add wrapper script for beeline [puppet] - 10https://gerrit.wikimedia.org/r/282713 (https://phabricator.wikimedia.org/T116123) 
[18:58:41] <icinga-wm_>	 PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4010_v4, cp4010_v6
[18:58:42] <grrrit-wm>	 (03PS2) 10Dzahn: partman: make bast1001 use raid1-lvm [puppet] - 10https://gerrit.wikimedia.org/r/283255 (https://phabricator.wikimedia.org/T123721) 
[18:58:50] <icinga-wm_>	 PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4010_v4, cp4010_v6
[18:58:51] <icinga-wm_>	 PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 122 not-conn: cp4010_v4, cp4010_v6
[18:58:51] <icinga-wm_>	 PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4010_v4, cp4010_v6
[18:58:51] <icinga-wm_>	 PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 122 not-conn: cp4010_v4, cp4010_v6
[18:58:53] <bblack>	 hoo: when did it start?
[18:59:00] <icinga-wm_>	 PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4010_v4, cp4010_v6
[18:59:01] <icinga-wm_>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[18:59:10] <icinga-wm_>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[18:59:17] <hoo>	 bblack: Dunno for sure
[18:59:22] <icinga-wm_>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[18:59:23] <grrrit-wm>	 (03PS3) 10Dzahn: partman: make bast1001 use raid1-lvm [puppet] - 10https://gerrit.wikimedia.org/r/283255 (https://phabricator.wikimedia.org/T123721) 
[18:59:28] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] partman: make bast1001 use raid1-lvm [puppet] - 10https://gerrit.wikimedia.org/r/283255 (https://phabricator.wikimedia.org/T123721) (owner: 10Dzahn)
[18:59:43] <bblack>	 hoo: since the word Security is in there, and because it's affecting CORS, I'd check in with csteipp
[19:00:04] <jouncebot>	 ostriches: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160413T1900).
[19:00:25] <MatmaRex>	 it's *breaking* CORS
[19:00:33] <MatmaRex>	 is this a security patch or something?
[19:00:40] <hoo>	 MatmaRex: Can't see anything
[19:00:42] <bblack>	 I have no idea
[19:00:46] <hoo>	 already checked prior to asking
[19:00:48] <MatmaRex>	 i see nothing related in SAL :(
[19:00:53] <hoo>	 same here
[19:01:06] <mutante>	 ostriches: when deploying, please dont use bast1001 right now in your ssh config
[19:01:06] <bblack>	 are we sure it's new behavior? does some log point to when it changed in time?
[19:01:14] <MatmaRex>	 i'm going to tell you when it started in a minute
[19:01:20] <ostriches>	 mutante: What can I use instead?
[19:01:30] <hoo>	 bblack: Yes, we are sure... and no
[19:01:35] <mutante>	 ostriches: bast2001
[19:01:41] <hoo>	 Sitelink handling on Wikidata is completely broken now
[19:01:45] <hoo>	 and people indeed notice taht
[19:01:49] <MatmaRex>	 bblack: it started when these uploads stopped: https://commons.wikimedia.org/w/index.php?title=Special:RecentChanges&tagfilter=cross-wiki-upload
[19:01:51] <mutante>	 ostriches: or 3001 or 4001 even
[19:02:03] <hoo>	 first user report at 18:13 UTC
[19:02:09] <MatmaRex>	 around 17:45 UTC
[19:02:36] <hoo>	 wait, I can even check the logs
[19:02:44] <csteipp>	 Shouldn't be a security patch. For normal mediawiki requests we do the IE-stupidity redirect if we think IE will think this is a filename..
[19:02:57] <csteipp>	 Actually... might be a security patch. One sec.
[19:03:19] <csteipp>	 No, nevermind. I don't think it's a security patch.
[19:03:32] <icinga-wm_>	 RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 44 ESP OK
[19:03:39] <hoo>	 IEUrlExtension hasn't been touched in a while
[19:03:41] <icinga-wm_>	 RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 56 ESP OK
[19:03:51] <icinga-wm_>	 RECOVERY - IPsec on cp1053 is OK: Strongswan OK - 44 ESP OK
[19:03:51] <icinga-wm_>	 RECOVERY - IPsec on cp1068 is OK: Strongswan OK - 44 ESP OK
[19:04:01] <icinga-wm_>	 RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 56 ESP OK
[19:04:02] <icinga-wm_>	 RECOVERY - IPsec on cp1052 is OK: Strongswan OK - 44 ESP OK
[19:04:11] <icinga-wm_>	 RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 56 ESP OK
[19:04:11] <icinga-wm_>	 RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 124 ESP OK
[19:04:13] <csteipp>	 Did tin's ssh fingerprint change?
[19:04:17] <MatmaRex>	 csteipp: looking at the code, that's not supposed to happen for POST requests ever, and it's happening
[19:04:17] <csteipp>	 Since yesterday?
[19:04:30] <icinga-wm_>	 RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 44 ESP OK
[19:04:31] <icinga-wm_>	 RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 56 ESP OK
[19:04:32] <icinga-wm_>	 RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 44 ESP OK
[19:04:32] <hoo>	 no, I just sshed in just fine
[19:04:32] <icinga-wm_>	 RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 124 ESP OK
[19:04:41] <icinga-wm_>	 RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 44 ESP OK
[19:04:42] <icinga-wm_>	 RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 56 ESP OK
[19:04:42] <icinga-wm_>	 RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 124 ESP OK
[19:04:47] <grrrit-wm>	 (03PS1) 10Chad: group1 to wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283257 
[19:04:50] <icinga-wm_>	 RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 124 ESP OK
[19:04:50] <icinga-wm_>	 RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 56 ESP OK
[19:04:51] <icinga-wm_>	 RECOVERY - IPsec on cp1066 is OK: Strongswan OK - 44 ESP OK
[19:04:56] * csteipp is going to back away slowly from tin...
[19:05:00] <icinga-wm_>	 RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 124 ESP OK
[19:05:01] <icinga-wm_>	 RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 56 ESP OK
[19:05:01] <icinga-wm_>	 RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 56 ESP OK
[19:05:02] <mutante>	 csteipp: bast1001 did, if you are going through that
[19:05:10] <icinga-wm_>	 RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 124 ESP OK
[19:05:10] <grrrit-wm>	 (03CR) 10Chad: [C: 032] group1 to wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283257 (owner: 10Chad)
[19:05:12] <MatmaRex>	 ostriches: can you hold? shit's broken :(
[19:05:19] <csteipp>	 Oh, yeah.. bast1001 ;)
[19:05:30] <hoo>	 ostriches: Yes, seriously
[19:05:39] <grrrit-wm>	 (03Merged) 10jenkins-bot: group1 to wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283257 (owner: 10Chad)
[19:05:49] <mutante>	 csteipp: i'll have to change that one more time, you might want to switch to just bast2001 
[19:06:06] <wikibugs>	 06Operations, 10Wikimedia-General-or-Unknown: api.php gives a 302 redirect to URL with '&*' appended, breaking CORS requests - https://phabricator.wikimedia.org/T132612#2204512 (10matmarex)
[19:06:18] <ostriches>	 I won't sync wikiversions.
[19:06:31] <wikibugs>	 06Operations, 10Wikimedia-General-or-Unknown: api.php gives a 302 redirect to URL with '&*' appended, breaking CORS requests - https://phabricator.wikimedia.org/T132612#2204514 (10hoo) p:05High>03Unbreak!
[19:06:42] <apergos>	 csteipp: bast1001's fingerprint changed (but it's being reinstalled right this second to fix a raid setup issue)
[19:07:11] <wikibugs>	 06Operations, 10Wikimedia-General-or-Unknown: api.php gives a 302 redirect to URL with '&*' appended, breaking CORS requests - https://phabricator.wikimedia.org/T132612#2204366 (10hoo) This also breaks changing sitelinks on Wikidata. This causes data inconsistencies between Wikidata and the clients (Wikipedias...
[19:07:24] <wikibugs>	 06Operations, 10Wikimedia-General-or-Unknown: api.php gives a 302 redirect to URL with '&*' appended, breaking CORS requests - https://phabricator.wikimedia.org/T132612#2204366 (10matmarex) This started around 17:45 UTC today, judging by when the cross-wiki uploads stopped happening: https://commons.wikimedia....
[19:07:52] <hoo>	 ostriches: Ok, you might also want to revert so that we can easily use tin for deployments once we have a fix
[19:08:33] <ostriches>	 As long as nobody does sync-wikiversions you're fine :)
[19:08:44] <hoo>	 mh, ok
[19:09:44] <grrrit-wm>	 (03PS1) 10Dzahn: Revert "bast1001: rsync home dirs back from tungsten" [puppet] - 10https://gerrit.wikimedia.org/r/283258 
[19:09:50] <grrrit-wm>	 (03PS2) 10Dzahn: Revert "bast1001: rsync home dirs back from tungsten" [puppet] - 10https://gerrit.wikimedia.org/r/283258 
[19:09:57] <wikibugs>	 06Operations, 10Wikimedia-General-or-Unknown: api.php gives a 302 redirect to URL with '&*' appended, breaking CORS requests - https://phabricator.wikimedia.org/T132612#2204552 (10matmarex) ``` $ curl -i https://commons.wikimedia.org/w/api.php?origin=https%3A%2F%2Fen.wikipedia.org HTTP/1.1 302 Found Date: Wed,...
[19:10:06] <bblack>	 I really don't have much hint to go on here, like you I don't see any obvious change in the right timeframe that would be remotely related
[19:10:18] <bblack>	 unless something unlogged happened
[19:10:21] <csteipp>	 MatmaRex: "This started around 17:45 UTC today" I'm assuming nothing changed in the upload process that suddenly started using cors in a different way, right?
[19:10:26] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] Revert "bast1001: rsync home dirs back from tungsten" [puppet] - 10https://gerrit.wikimedia.org/r/283258 (owner: 10Dzahn)
[19:10:29] <MatmaRex>	 csteipp: nope
[19:10:37] <MatmaRex>	 csteipp: and this affects Wikidata stuff too, apparnetly
[19:10:44] <csteipp>	 I can confirm there have been no unlogged security patches deployed in the last few days.
[19:10:47] <hoo>	 csteipp: Also MediaWikiPageNameNormalizer in core is also affected (as it does API requests)
[19:10:55] <MatmaRex>	 this definitely looks like the result of WebRequest::doSecurityRedirect()
[19:11:07] <grrrit-wm>	 (03CR) 10Dzahn: "this is just to copy the home dir data one more time in the opposite direction" [puppet] - 10https://gerrit.wikimedia.org/r/283258 (owner: 10Dzahn)
[19:11:59] <hoo>	 MatmaRex: Yeah, but nothing about that changed AFAICT
[19:12:01] <csteipp>	 MatmaRex: The redirect, yes. So something changed so that the redirect is suddenly being called. Someone want to add a stack trace somewhere to figure out the call path?
[19:12:35] <MatmaRex>	 csteipp: i can tell you the call path :) api.php does $wgRequest->checkUrlExtension()
[19:13:01] <MatmaRex>	 and this does the security redirect
[19:13:22] <MatmaRex>	 now someone should figure out why IEUrlExtension::areServerVarsBad() is suddenly returning true
[19:13:42] <bblack>	 wait what about ori's Server: header thing?
[19:13:43] <MatmaRex>	 anything in apache configuration changed? i'm not even sure what exactly that methods checks for
[19:13:52] <bblack>	 I discounted that as unrelated, but ...
[19:13:55] <MatmaRex>	 but i know some of our apache rewrites are crazy
[19:13:56] <grrrit-wm>	 (03PS1) 10Chad: Try to tune back ldap logging a tad. Rather spammy. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283259 
[19:14:02] <bblack>	 it also loads a new apache module too
[19:14:11] <bblack>	 https://gerrit.wikimedia.org/r/#/c/283226/
[19:14:31] <bblack>	 and then followed up with https://gerrit.wikimedia.org/r/#/c/283230/1
[19:14:41] <MatmaRex>	 sounds scary enough to me
[19:14:57] <hoo>	 bblack: Do we at some point mangle URLs as to change the order of get parameters?
[19:15:16] <MatmaRex>	 hoo: we do very fucked up things, there's some other bug this is causing
[19:15:19] <hoo>	 https://commons.wikimedia.org/w/api.php?action=query&prop=info&redirects=1&converttitles=1&titles=File:Iogansen%20YuI.jpg&format=json
[19:15:32] <hoo>	 that works (because the .jpg is not the last parameter)
[19:15:37] <grrrit-wm>	 (03PS1) 10Dzahn: Revert "Revert "bast1001: rsync home dirs back from tungsten"" [puppet] - 10https://gerrit.wikimedia.org/r/283260 
[19:16:00] <icinga-wm_>	 PROBLEM - Host cp4011 is DOWN: PING CRITICAL - Packet loss = 100%
[19:16:05] <grrrit-wm>	 (03CR) 10Dzahn: "the changes are fine, revert means copy a->b and revert-revert means copy b->a :p" [puppet] - 10https://gerrit.wikimedia.org/r/283260 (owner: 10Dzahn)
[19:16:12] <bblack>	 hoo: I think the param order thing is a natural effect with the IE check.  it's only trying to prevent filename ending problems
[19:16:21] <icinga-wm_>	 RECOVERY - Host cp4011 is UP: PING OK - Packet loss = 0%, RTA = 75.98 ms
[19:16:40] <bblack>	 ori: ?
[19:16:42] <hoo>	 bblack: hm... but why is this suddenly a problem?
[19:17:06] <ori>	 bblack: catching up
[19:17:44] <MatmaRex>	 ori: https://phabricator.wikimedia.org/T132612 can it be caused by https://gerrit.wikimedia.org/r/#/c/283230/ ?
[19:18:16] <bblack>	 yes, it's ori's thing
[19:18:22] <bblack>	 (indirectly)
[19:18:27] <bblack>	 in https://doc.wikimedia.org/mediawiki-core/REL1_25/php/IEUrlExtension_8php_source.html
[19:18:37] <bblack>	 static $whitelist = array( 261             'Apache', 262             'Zeus', 263             'LiteSpeed' ); 264         if ( preg_match( '/^(.*?)($|\/| )/', $serverSoftware, $m ) ) { 265             return in_array( $m[1], $whitelist );
[19:18:51] <bblack>	 basically part of that code in there for the IE bug-check is looking for Apache in the server header
[19:18:57] <bblack>	 and that's been replaced with the hostname
[19:19:12] <hoo>	 ok
[19:19:17] <wikibugs>	 06Operations, 10ops-ulsfo, 06DC-Ops: ulsfo temperature-related exceptions - https://phabricator.wikimedia.org/T119631#2204587 (10RobH) In addition to replacing the thermal paste on all lvs4001-4004, I've knocked out the following hosts from T125205.  * cp4008 * cp4010 * cp4011 * cp4012  That took care of all...
[19:19:39] <hoo>	 I suggest live hacking that function to return true
[19:19:40] <hoo>	 ori: ^
[19:19:42] <bblack>	 Server: mw1171.eqiad.wmnet makes that code ( IEUrlExtension::areServerVarsBad ) behave differently than Server: Apache
[19:20:02] <ori>	 OK, I don't have the full picture yet, but I'll do what hoo suggests for now
[19:20:04] <ori>	 just a moment
[19:20:10] <icinga-wm_>	 PROBLEM - Host cp4012 is DOWN: PING CRITICAL - Packet loss = 100%
[19:22:06] <MatmaRex>	 hoo: (btw, https://phabricator.wikimedia.org/T123276 is the other bug caused by the URL getting rewritten before reaching apaches)
[19:23:20] <logmsgbot>	 !log ori@tin Synchronized php-1.27.0-wmf.20/includes/libs/IEUrlExtension.php: Live-hack IEUrlExtension::haveUndecodedRequestUri() to always return true (duration: 00m 33s)
[19:23:26] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:24:00] <hoo>	 ori: Looks good
[19:24:12] <MatmaRex>	 yeah
[19:24:13] <logmsgbot>	 !log ori@tin Synchronized php-1.27.0-wmf.21/includes/libs/IEUrlExtension.php: Live-hack IEUrlExtension::haveUndecodedRequestUri() to always return true (duration: 00m 33s)
[19:24:18] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:24:20] <hoo>	 confirmed
[19:24:46] <ori>	 OK, now that it's not an ongoing crisis, wtf happened exactly, and what is this evil code?
[19:24:59] <bblack>	 it looks like this would probably work (without the hack) if we made the server header more like "Apache/mw1171"
[19:25:47] <ori>	 I want to first ascertain that this code has value
[19:25:51] <bblack>	 ori: when the bug is triggered, the output is a 302 that says:
[19:25:52] <bblack>	 We can't serve non-HTML content from the URL you have requested, because
[19:25:56] <bblack>	 Internet Explorer would interpret it as an incorrect and potentially dangerous
[19:25:59] <bblack>	 content type.</p>
[19:26:01] <bblack>	 <p>Instead, please use <a href="https://commons.wikimedia.org/w/api.php?origin=https%3A%2F%2Fen.wikipedia.org&amp;*">this URL</a>, which is the same as the
[19:26:04] <bblack>	 URL you have requested, except that "&amp;*" is appended. This prevents Internet
[19:26:08] <bblack>	 Explorer from seeing a bogus file extension.
[19:26:09] <hoo>	 ori: I don't think that code even "worked" before this
[19:26:10] <bblack>	 so it's some kind of IE bug workaround
[19:26:11] <MatmaRex>	 ori: yes-ish. it prevents XSS for IE 6. :P
[19:26:19] <bblack>	 if it's only IE6 or lower that has said bug, IMHO we can nuke this code at WMF, because IE6 can't connect anyways
[19:26:28] <ori>	 IE 6 is blacklisted
[19:26:32] <MatmaRex>	 it can connect?
[19:26:35] <bblack>	 (due to TLS restrictions, it can't even make a connection)
[19:26:47] <hoo>	 Yeah, it's IE6 only, I don't think IE7 is doing that insane stuff anymore
[19:26:51] <ori>	 right, and there's that
[19:27:03] <ori>	 let me confirm that it's not an issue with IE7 and then i'll submit a patch to nuke this code
[19:27:07] <MatmaRex>	 bblack: are you sure? isn't it just pre-service-pack IE6?
[19:27:15] <MatmaRex>	 hold on, jesus
[19:27:24] <hoo>	 Yeah, you can enable TLS1.1 on IE6 or so
[19:27:39] <csteipp>	 Unpatched IE6. You can flip the config to make it work on SP3 (iirc)
[19:27:41] <MatmaRex>	 IE 6 can view Wikipedia
[19:27:44] <MatmaRex>	 even if it couldn't
[19:27:48] <MatmaRex>	 IE 6 can view other MediaWiki sites
[19:27:52] <MatmaRex>	 which is what this check is meant for
[19:27:55] <bblack>	 hmm yeah that may be true, although it's rare
[19:27:56] <MatmaRex>	 you can't just go and delete if c
[19:28:01] <MatmaRex>	 …it because you don't like it
[19:28:11] <MatmaRex>	 it's not rare. it's normal
[19:28:14] <hoo>	 Then slap a config in front of it
[19:28:17] <csteipp>	 bblack: pretty common for corporate managed laptops...
[19:28:28] <hoo>	 and add a comment that it's IE6 only and supposed to get killed at some point
[19:28:29] <MatmaRex>	 if anything, yes, this could be behind a config option, on by default
[19:28:29] <ori>	 how does SERVER_SOFTWARE enter into it?
[19:28:42] <bblack>	 SERVER_SOFTWARE is the Server: header
[19:28:44] <hoo>	 ori: via PHP'S super global
[19:28:45] <ostriches>	 What MatmaRex said. Config it, on by default.
[19:28:46] <MatmaRex>	 the real problem is that our apaches have shit rewrites
[19:29:01] <MatmaRex>	 changing the Server header or whatever should not have caused this
[19:29:13] <bblack>	 it can't be *that* common, it's still way under a percent of our traffic
[19:29:16] <MatmaRex>	 if you want a real fix, please find out why QUERY_STRING is decoded
[19:29:23] <bblack>	 (just as an outside bound)
[19:29:39] <grrrit-wm>	 (03CR) 10Ottomata: analytics_cluster: Add wrapper script for beeline (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/282713 (https://phabricator.wikimedia.org/T116123) (owner: 10Madhuvishy)
[19:29:42] <MatmaRex>	 bblack: out of all IE 6 traffic, i'd think most is upgraded as far as it can be. :)
[19:29:50] <hoo>	 out for now, will try to be back for the train and further catastrophes... but first, food
[19:29:53] <bblack>	 yeah but most traffic is not IE6 to begin with
[19:29:53] <MatmaRex>	 (it's a small portion of all traffic either way)
[19:30:10] <MatmaRex>	 ori: so
[19:30:28] <hoo|away>	 couldn't it also be user agent switched?
[19:30:28] <MatmaRex>	 ori: the SERVER_SOFTWARE check is to see if we can rely on REQUEST_URI, or if we should use QUERY_STRING instead.
[19:30:43] <mutante>	 0.31% market share for IE6
[19:30:53] <ori>	 I'm still parsing "When passed the value of $_SERVER['SERVER_SOFTWARE'], this function returns true if that server is known to have a REQUEST_URI variable with %2E not decod    ed to ".". On such a server, it is possible to detect whether the script filename has been obscured.  The function returns false if the server is not know    n to have this behavior. Microsoft IIS in particular is known to decode escaped script fil
[19:30:53] <ori>	 enames."
[19:30:54] <MatmaRex>	 ori: apparently, QUERY_STRING is fucked up in WMF deployment. it probably shouldn't be automatically decoded.
[19:31:28] <Krinkle>	 I don't think it's related to WMF
[19:31:33] <Krinkle>	 This check returns true for Apache
[19:31:34] <Krinkle>	 not WMF Apache
[19:31:41] <Krinkle>	 and various other server softwares as well
[19:31:44] <MatmaRex>	 ori: and we have other bugs about $_SERVER vars being fucked up, e.g. https://phabricator.wikimedia.org/T123276
[19:31:44] <Krinkle>	 (that WMF does not use)
[19:32:05] <MatmaRex>	 and at least one other one i can't find now
[19:32:07] <bblack>	 we do decode the whole URI to some custom degree in varnish, too
[19:32:18] <Krinkle>	 We can change the fqnd module to change instead of replace. e.g. Server: Apache (%fqdn)
[19:32:19] <bblack>	 only for MediaWiki
[19:33:00] <ori>	 that would still be fucked up and would bite us again some day
[19:33:14] <ori>	 I want a few minutes to understand it properly and think about it, are deployments pending?
[19:33:23] <bblack>	 train is held up
[19:34:25] <ori>	 ostriches: can you wait a few?
[19:34:26] <bblack>	 https://github.com/wikimedia/operations-puppet/blob/production/templates/varnish/normalize_path.inc.vcl.erb is the varnish URI decoding that happens for MW (and a variant for RestBase too, now, relatedly)
[19:34:32] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] Revert "Revert "bast1001: rsync home dirs back from tungsten"" [puppet] - 10https://gerrit.wikimedia.org/r/283260 (owner: 10Dzahn)
[19:34:37] <bblack>	 the code has been slightly moved/updated in recent times, but it's ancient in origin
[19:34:50] <bblack>	 (including most of the commentary on why)
[19:35:03] <grrrit-wm>	 (03PS1) 10Dzahn: Revert "bast1001: remove temp rsync for migration" [puppet] - 10https://gerrit.wikimedia.org/r/283263 
[19:35:18] <ostriches>	 ori: I was
[19:35:29] <grrrit-wm>	 (03PS2) 10Dzahn: Revert "bast1001: remove temp rsync for migration" [puppet] - 10https://gerrit.wikimedia.org/r/283263 
[19:35:37] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] Revert "bast1001: remove temp rsync for migration" [puppet] - 10https://gerrit.wikimedia.org/r/283263 (owner: 10Dzahn)
[19:36:06] <ori>	 thanks
[19:36:25] <ori>	 so, is the idea that QUERY_STRING would be un-decoded in cases where REQUEST_URI is decoded?
[19:36:35] <ori>	 couldn't we just choose the one with more % in it?
[19:36:39] <grrrit-wm>	 (03PS1) 10Dzahn: Revert "Revert "bast1001: remove temp rsync for migration"" [puppet] - 10https://gerrit.wikimedia.org/r/283264 
[19:36:50] <MatmaRex>	 and the other bug i know caused by this is https://phabricator.wikimedia.org/T128380 , found it now
[19:37:06] <bblack>	 oh, the varnish code I linked does stop at ?query - it only decodes the path part
[19:37:14] <bblack>	 so it's not that
[19:37:28] <MatmaRex>	 ori: i think the best simplest long-term fix is to prefix the value with "Apache " in your patch
[19:37:38] <MatmaRex>	 can we do that, verify, and run the train?
[19:37:48] <bblack>	 I'd say that's a medium-term fix at best, though
[19:37:49] <ori>	 no, because I don't agree
[19:37:58] <ori>	 we have talked about replacing apache with nginx, and that would bite us in that case too
[19:38:01] <bblack>	 we'll face this again when we dump apache for hhvm's own HTTP server, etc
[19:38:05] <ori>	 or that
[19:38:28] <MatmaRex>	 bblack: ori: we hopefully won't, if we also fix whatever is decoding the query string. i guess that is the longer-term fix.
[19:38:30] <ori>	 it's bad code
[19:38:45] <MatmaRex>	 https://phabricator.wikimedia.org/T128380#2072420
[19:38:46] <MatmaRex>	 "the data passed to HHVM is a mixed bag of already-decoded and non-decoded nonsense"
[19:38:50] <MatmaRex>	 which is bad code?
[19:39:28] <ori>	 haveUndecodedRequestUri
[19:39:41] <bblack>	 the part that cares about what "Server: " says in reference to some IE6 client side bug?
[19:39:50] <bblack>	 that server whitelist is suspect
[19:40:15] <ori>	 I'll unblock the train by committing the live-hack into the production branch
[19:40:32] <ori>	 then figure out how to clean this up, and revert the live-hack
[19:40:55] <MatmaRex>	 it's from http://mediawiki.org/wiki/Special:Code/MediaWiki/89558
[19:42:39] <bblack>	 the logic there is hard to follow heh
[19:43:19] <MatmaRex>	 https://phabricator.wikimedia.org/T30235#331525
[19:45:43] <grrrit-wm>	 (03PS1) 10Mobrovac: Math: increase the number of concurrent connections to 150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283269 (https://phabricator.wikimedia.org/T132096) 
[19:45:54] <grrrit-wm>	 (03PS2) 10Alex Monk: deployment-prep: keyholder shinken monitoring [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) 
[19:47:22] <wikibugs>	 06Operations, 13Patch-For-Review: Reinstall bast1001 with jessie - https://phabricator.wikimedia.org/T123721#2204676 (10Dzahn) had to reinstall one more time because it did no use RAID1, now it does:  root@bast1001:~# mdadm --detail /dev/md0 |grep Level      Raid Level : raid1  ``` +---------+---------+-------...
[19:48:10] <bblack>	 what about the later comments below that, where it says:
[19:48:13] <bblack>	 So by sending "Content-Disposition: filename=api.php", we can avoid having to deal with the broken behaviour of GetFileExtensionFromUrl().
[19:48:50] <bblack>	 seems like that might be a superior approach
[19:49:10] <bblack>	 and regardless, surely we could lock this whole thing down to "only if UA string matches IE6 in the first place"?
[19:49:51] <wikibugs>	 06Operations, 10Wikimedia-General-or-Unknown: api.php gives a 302 redirect to URL with '&*' appended, breaking CORS requests - https://phabricator.wikimedia.org/T132612#2204693 (10matmarex) Worked around for now:    [19:23] <logmsgbot> !log ori@tin Synchronized php-1.27.0-wmf.20/includes/libs/IEUrlExtension.ph...
[19:49:53] <ori>	 ostriches: train unblocked
[19:50:22] <ori>	 i merged the live-hack to both prod branches, will back it out and replace with a proper fix in a little while
[19:50:33] <grrrit-wm>	 (03PS5) 10Madhuvishy: analytics_cluster: Add wrapper script for beeline [puppet] - 10https://gerrit.wikimedia.org/r/282713 (https://phabricator.wikimedia.org/T116123) 
[19:51:50] <icinga-wm_>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet).
[19:52:01] <grrrit-wm>	 (03PS6) 10Madhuvishy: analytics_cluster: Add wrapper script for beeline [puppet] - 10https://gerrit.wikimedia.org/r/282713 (https://phabricator.wikimedia.org/T116123) 
[19:52:38] <ori>	 I also don't see how it is correct to assume that Apache is the only server software in the request path
[19:54:11] <MatmaRex>	 ori: yeah. there's https://phabricator.wikimedia.org/T47501 ;)
[19:54:30] <icinga-wm_>	 PROBLEM - NTP on cp4011 is CRITICAL: NTP CRITICAL: Offset unknown
[19:54:31] <ori>	 yeah, there you go
[19:54:58] <ori>	 before i go on ranting -- thank you, bblack, MatmaRex, hoo, ostriches
[19:55:21] <icinga-wm_>	 PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet).
[19:56:25] <logmsgbot>	 !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: (no message)
[19:56:34] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:57:35] <bblack>	 mutante: I think that's you with un-puppet-merged stuff?
[19:58:08] <mutante>	 bblack: yes, sry, done
[19:59:11] <icinga-wm_>	 RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge.
[19:59:31] <icinga-wm_>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[20:00:04] <jouncebot>	 gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160413T2000).
[20:00:36] <subbu>	 no parsoid deploy today
[20:01:11] <icinga-wm_>	 PROBLEM - salt-minion processes on bast1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[20:02:00] <grrrit-wm>	 (03PS1) 10Alex Monk: shinken: Add myself to Beta Cluster Administrators contact group [puppet] - 10https://gerrit.wikimedia.org/r/283278 
[20:02:02] <grrrit-wm>	 (03PS7) 10Madhuvishy: analytics_cluster: Add wrapper script for beeline [puppet] - 10https://gerrit.wikimedia.org/r/282713 (https://phabricator.wikimedia.org/T116123) 
[20:04:45] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] shinken: Add myself to Beta Cluster Administrators contact group [puppet] - 10https://gerrit.wikimedia.org/r/283278 (owner: 10Alex Monk)
[20:05:18] <Krenair>	 ty mutante 
[20:05:35] <mutante>	 np
[20:08:31] <icinga-wm_>	 RECOVERY - Host cp4012 is UP: PING OK - Packet loss = 0%, RTA = 76.11 ms
[20:08:55] <grrrit-wm>	 (03PS8) 10Ottomata: analytics_cluster: Add wrapper script for beeline [puppet] - 10https://gerrit.wikimedia.org/r/282713 (https://phabricator.wikimedia.org/T116123) (owner: 10Madhuvishy)
[20:10:37] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] analytics_cluster: Add wrapper script for beeline [puppet] - 10https://gerrit.wikimedia.org/r/282713 (https://phabricator.wikimedia.org/T116123) (owner: 10Madhuvishy)
[20:11:12] <ottomata>	 OH madhuvishy i think we don't need that hive.yaml file at all
[20:11:14] <ottomata>	 since we grab sane defaults
[20:11:19] <ottomata>	 we'd only need that to override
[20:12:45] <madhuvishy>	 ottomata: hmmm - just set the port number in the client class too?
[20:13:02] <ottomata>	 no you have a hiera default lookup
[20:13:05] <ottomata>	 the defaults are good
[20:13:17] <madhuvishy>	 ah
[20:13:18] <grrrit-wm>	 (03PS1) 10Ottomata: Remove unneeded hive/client.yaml role hiera file [puppet] - 10https://gerrit.wikimedia.org/r/283279 
[20:13:28] <madhuvishy>	 right, okay
[20:13:42] <madhuvishy>	 can add if needed
[20:13:44] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] Remove unneeded hive/client.yaml role hiera file [puppet] - 10https://gerrit.wikimedia.org/r/283279 (owner: 10Ottomata)
[20:13:53] <madhuvishy>	 and also override in labs if needed i guess
[20:13:55] <grrrit-wm>	 (03PS2) 10Dzahn: Revert "Revert "bast1001: remove temp rsync for migration"" [puppet] - 10https://gerrit.wikimedia.org/r/283264 
[20:14:21] <wikibugs>	 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2204736 (10ezachte) I can't login right now to check. The vast majority of that 2TB will be backups, which I thin out every half year or so. All html files in htdocs should be co...
[20:14:34] <mutante>	 in labs you have 2 options for that, hiera files in the repo or edit a special wiki page
[20:15:09] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "done copying" [puppet] - 10https://gerrit.wikimedia.org/r/283264 (owner: 10Dzahn)
[20:17:15] <grrrit-wm>	 (03PS1) 10BBlack: Revert "Disable ulsfo T128424" [dns] - 10https://gerrit.wikimedia.org/r/283288 
[20:17:20] <grrrit-wm>	 (03PS2) 10BBlack: Revert "Disable ulsfo T128424" [dns] - 10https://gerrit.wikimedia.org/r/283288 
[20:17:31] <icinga-wm_>	 RECOVERY - NTP on cp4011 is OK: NTP OK: Offset -0.0001041889191 secs
[20:19:28] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] Revert "Disable ulsfo T128424" [dns] - 10https://gerrit.wikimedia.org/r/283288 (owner: 10BBlack)
[20:19:54] <grrrit-wm>	 (03CR) 10Physikerwelt: Math: increase the number of concurrent connections to 150 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283269 (https://phabricator.wikimedia.org/T132096) (owner: 10Mobrovac)
[20:20:08] <bblack>	 !log re-pooling ulsfo traffic T128424
[20:20:10] <icinga-wm_>	 RECOVERY - salt-minion processes on bast1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[20:20:15] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:23:04] <wikibugs>	 07Puppet, 10Beta-Cluster-Infrastructure, 06Services, 15User-mobrovac: deployment-(mathoid|sca0[12]|cxserver03) puppet failures "Error: Could not set home on user" due to user being in use by process - https://phabricator.wikimedia.org/T132265#2204755 (10mobrovac) `deployment-(mathoid|sca0[12])` have been f...
[20:24:44] <wikibugs>	 07Puppet, 10Beta-Cluster-Infrastructure, 06Services, 15User-mobrovac: deployment-(mathoid|sca0[12]|cxserver03) puppet failures "Error: Could not set home on user" due to user being in use by process - https://phabricator.wikimedia.org/T132265#2204772 (10Krenair) Yeah, I can only log in as root there, not m...
[20:28:15] <grrrit-wm>	 (03PS1) 10Madhuvishy: analytics_cluster: Fix beeline path in wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/283323 
[20:30:22] <wikibugs>	 06Operations, 10Dumps-Generation, 07HHVM, 13Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#2204797 (10ArielGlenn) It's already installed and provides hhvm-gdb which I used above.
[20:30:36] <grrrit-wm>	 (03PS2) 10Madhuvishy: analytics_cluster: Fix beeline path in wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/283323 
[20:30:43] <wikibugs>	 07Puppet, 10Beta-Cluster-Infrastructure, 06Services, 15User-mobrovac: deployment-(mathoid|sca0[12]|cxserver03) puppet failures "Error: Could not set home on user" due to user being in use by process - https://phabricator.wikimedia.org/T132265#2192956 (10hashar) For deployment-cxserver03 I have filled {T132...
[20:32:04] <wikibugs>	 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2204810 (10mobrovac)
[20:33:22] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] analytics_cluster: Fix beeline path in wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/283323 (owner: 10Madhuvishy)
[20:33:44] <grrrit-wm>	 (03CR) 10Mobrovac: Math: increase the number of concurrent connections to 150 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283269 (https://phabricator.wikimedia.org/T132096) (owner: 10Mobrovac)
[20:34:20] <grrrit-wm>	 (03PS5) 10Dzahn: network: remove bast1001 SLAAC addresses [puppet] - 10https://gerrit.wikimedia.org/r/283254 (https://phabricator.wikimedia.org/T123721) 
[20:34:58] <grrrit-wm>	 (03PS6) 10Dzahn: network: remove bast1001 SLAAC addresses [puppet] - 10https://gerrit.wikimedia.org/r/283254 (https://phabricator.wikimedia.org/T123721) 
[20:35:27] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] network: remove bast1001 SLAAC addresses [puppet] - 10https://gerrit.wikimedia.org/r/283254 (https://phabricator.wikimedia.org/T123721) (owner: 10Dzahn)
[20:41:26] <ori___>	 MatmaRex: how does it make sense to default to QUERY_STRING?
[20:41:45] <andrewbogott>	 elukey: awake enough for a 10-second review?  https://gerrit.wikimedia.org/r/#/c/283324/1
[20:43:50] <MatmaRex>	 ori: i don't know, but it's tim's code, so i'm assuming it does until proven otherwise. i think it's a WMF misconfiguration problem.
[20:44:31] <MatmaRex>	 ori_: are you working on a fix? a $wg variable to choose which of REQUEST_URI, QUERY_STRING, PATH_INFO should be checked probably makes sense, on second thought
[20:45:30] <wikibugs>	 06Operations, 10Wikimedia-Apache-configuration, 07Varnish: Data passed to HHVM ($_SERVER variables) is a mixed bag of already-decoded and non-decoded nonsense - https://phabricator.wikimedia.org/T132629#2204871 (10matmarex)
[20:45:54] <wikibugs>	 06Operations, 10Wikimedia-Apache-configuration, 07Varnish: Data passed to HHVM ($_SERVER variables) is a mixed bag of already-decoded and non-decoded nonsense - https://phabricator.wikimedia.org/T132629#2204887 (10matmarex)
[20:46:05] <wikibugs>	 06Operations, 10Wikimedia-Apache-configuration, 07Varnish: Data passed to HHVM ($_SERVER variables) is a mixed bag of already-decoded and non-decoded nonsense - https://phabricator.wikimedia.org/T132629#2204871 (10matmarex)
[20:46:10] <wikibugs>	 06Operations, 10Wikimedia-General-or-Unknown: api.php gives a 302 redirect to URL with '&*' appended, breaking CORS requests - https://phabricator.wikimedia.org/T132612#2204891 (10matmarex)
[20:46:18] <ori_>	 MatmaRex: I propose checking whether $_SERVER['QUERY_STRING'] contains more %s than the query component of the URI as derived from $_SERVER['REQUEST_URI'] 
[20:46:18] <wikibugs>	 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2204894 (10Dzahn) If that is mostly HTML and text let me try to compress that, we should achieve a high compression ratio and maybe it's not so bad then.
[20:46:49] <MatmaRex>	 filed https://phabricator.wikimedia.org/T132629 , btw
[20:46:54] <wikibugs>	 06Operations, 10DBA, 10MediaWiki-Special-pages, 10Wikidata, 07Performance: Batch updates create slave lag on s3 over WAN - https://phabricator.wikimedia.org/T122429#2204902 (10hoo)
[20:48:46] <MatmaRex>	 ori_: i honestly don't know enough about this to say if that makes sense. it doesn't sound entirely unreasonable. it would be very silly if it resulted in false positives, though.
[20:53:37] <wikibugs>	 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM: hhvm apache fills /var/log/apache2 with access logs - https://phabricator.wikimedia.org/T75262#2204929 (10Krenair)
[20:54:37] <wikibugs>	 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM: hhvm apache fills /var/log/apache2 with access logs - https://phabricator.wikimedia.org/T75262#755023 (10ori) You can continue finding and disabling everything that needs disk space, or you can just increase the amount of disk space available to these mach...
[20:58:35] <wikibugs>	 06Operations, 07Need-volunteer, 13Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#2204946 (10Andrew)
[20:58:37] <wikibugs>	 06Operations, 13Patch-For-Review: fix all "top-scope variable being used without an explicit namespace" across the puppet repo - https://phabricator.wikimedia.org/T125042#2204944 (10Andrew) 05Open>03Resolved a:03Andrew
[20:58:57] <wikibugs>	 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM: Beta-cluster web server fills up /var/log with Apache logs - https://phabricator.wikimedia.org/T75262#2204949 (10Krinkle)
[20:59:20] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "no diff, carbon is the real aggregator http://puppet-compiler.wmflabs.org/2431/" [puppet] - 10https://gerrit.wikimedia.org/r/283252 (https://phabricator.wikimedia.org/T123721) (owner: 10Dzahn)
[20:59:41] <grrrit-wm>	 (03PS4) 10Dzahn: bast1001: reorder includes, rm ganglia_aggregator [puppet] - 10https://gerrit.wikimedia.org/r/283252 (https://phabricator.wikimedia.org/T123721) 
[21:00:42] <andrewbogott>	 I've closed two bugs on the clinic-duty dashboard.  That leaves only ~798 to go
[21:00:51] <Krenair>	 there's a clinic-duty dashboard?
[21:01:42] <mutante>	 andrewbogott: wow, how.. you fixed all of them?
[21:01:49] <mutante>	 "top-scope vars" that is
[21:02:10] <mutante>	 looks
[21:02:19] <hoo>	 greg-g: Around?
[21:02:27] <andrewbogott>	 mutante: I think so?  You already merged the linter change didn't you?  Or do I misunderstand that bug?
[21:02:44] <andrewbogott>	 Krenair: https://phabricator.wikimedia.org/dashboard/view/45/
[21:02:52] <andrewbogott>	 Which at the moment is an out-of-control disaster
[21:03:22] <andrewbogott>	 MatmaRex: is https://phabricator.wikimedia.org/T132612 still "Unbreak now!" after your workaround?
[21:04:16] <mutante>	 i'm looking.. 5 min
[21:04:28] <MatmaRex>	 andrewbogott: ori's. probably not anymore
[21:05:05] <bd808>	 hoo: greg-g is out today
[21:05:09] <hoo>	 ok
[21:05:15] <wikibugs>	 06Operations, 10Wikimedia-General-or-Unknown: api.php gives a 302 redirect to URL with '&*' appended, breaking CORS requests - https://phabricator.wikimedia.org/T132612#2204953 (10Andrew) p:05Unbreak!>03High
[21:05:30] <hoo>	 bd808: So... who is the go to person about deploys today?
[21:05:41] <wikibugs>	 06Operations, 10Wikimedia-General-or-Unknown: api.php gives a 302 redirect to URL with '&*' appended, breaking CORS requests - https://phabricator.wikimedia.org/T132612#2204956 (10matmarex) This was indirectly caused by https://gerrit.wikimedia.org/r/#/c/283230/. That patch resulted in a change to the value of...
[21:06:33] <bd808>	 hoo: probably ostriches I would guess. He was running the train earlier
[21:06:52] <ostriches>	 What's up?
[21:07:04] <hoo>	 ostriches: I would like to bump Wikibase (Wikidata) to master
[21:07:12] <ostriches>	 jouncebot_: next
[21:07:12] <jouncebot_>	 In 1 hour(s) and 52 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160413T2300)
[21:07:18] <ostriches>	 Sure :)
[21:07:32] <hoo>	 Thanks :)
[21:07:45] <mutante>	 Ostrich der Lokomotivfuehrer und die Wilde wmf-13
[21:07:56] <mutante>	 shuts up
[21:07:58] <hoo>	 :'D
[21:09:17] <bd808>	 !log https://tools.wmflabs.org/sal missing entries since 2016-04-13T09:21. Needs to be backfilled
[21:09:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:10:01] <mutante>	 andrewbogott: you did fix it, i didnt realize that's what you did with the submodule, thank you!
[21:10:20] <MatmaRex>	 ori_: hmm, do me a favour in re https://phabricator.wikimedia.org/T132629? live-hack mw1017 or something with `var_dump( $_SERVER )`, then try visiting a few URLs and paste the results on the task?
[21:10:21] <MatmaRex>	 https://commons.wikimedia.org/w/api.php?origin=https%3A%2F%2Fwww.mediawiki.org
[21:10:24] <MatmaRex>	 https://en.wikipedia.org/wiki/Why_Me%3F_(Daniel_Johnston_Album)
[21:10:28] <MatmaRex>	 https://test.wikipedia.org/wiki/Bug%3F?action=history
[21:10:33] <mutante>	 modules/kafka has a bunch of errors that were gone and got readded 
[21:10:42] <mutante>	 which was only possible by overriding jenkins
[21:10:45] <MatmaRex>	 i can't fix it, but i'm curious how broken it is. :)
[21:10:54] <ori_>	 bd808: why did it happen?
[21:10:56] <mutante>	 tssk tssk, please
[21:10:57] <ori_>	 MatmaRex: sure, give me a moment.
[21:11:32] <bd808>	 ori_: not sure exactly. the bot process was running but not in any channels. Maybe it netsplit? maybe something else.
[21:11:33] <MatmaRex>	 ori_: thanks. also, strip your cookie and stuff from the pastes :D
[21:11:47] <MatmaRex>	 ($_SERVER has HTTP_COOKIE)
[21:11:58] <cscott>	 my ssh is complaining about the ECDSA key of bast1001.wikimedia.org (208.80.154.149)
[21:12:07] <cscott>	 is that an expected key rotation?
[21:12:08] <bd808>	 there wasn't anything interesting the err log for the bot :/
[21:12:30] <bd808>	 cscott: yeah. server was rebuilt today. should have an email about it
[21:12:31] <andrewbogott>	 cscott: yes, expected, there's an email to the ops list about it
[21:12:37] <cscott>	 ok, thanks.
[21:12:42] <cscott>	 been off in parser land today
[21:12:46] <mutante>	 cscott: https://phabricator.wikimedia.org/T123721#2204676
[21:12:57] <ottomata>	 mutante: ?
[21:13:40] <grrrit-wm>	 (03PS1) 10Mattflaschen: Enable Echo survey on French-language wikis (retry) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283330 (https://phabricator.wikimedia.org/T131893) 
[21:14:05] <grrrit-wm>	 (03CR) 10Mattflaschen: "Retry at https://gerrit.wikimedia.org/r/283330" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282414 (https://phabricator.wikimedia.org/T131893) (owner: 10Mattflaschen)
[21:14:28] <mutante>	 ottomata: re: puppet-lint, we fixed most of the warnings/errors globally so that for one type of error/warning there were 0 left across the repo , so then we could let jenkins vote on that specific one
[21:14:37] <grrrit-wm>	 (03PS2) 10Mattflaschen: Enable Echo survey on French-language wikis (retry) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283330 (https://phabricator.wikimedia.org/T131893) 
[21:14:51] <mutante>	 ottomata: then i noticed that some came back 
[21:15:07] <mutante>	 when i looked at that ticket that andrew just closed
[21:15:26] <mutante>	 it's a different check though, not the "top-scope var" thing, but just the alignment
[21:15:33] <mutante>	 already started fixing
[21:16:04] <ottomata>	 oh ok
[21:16:06] <andrewbogott>	 The thing I fixed was a logic error that would've been caught by the linter.  So… +1 for linting.
[21:16:35] <mutante>	 just so that it can actually say OK for the "strict" check too
[21:16:41] <mutante>	 that we got used to never working. but now it can
[21:16:41] <icinga-wm_>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [5000000.0]
[21:18:27] <ori_>	 MatmaRex: https://gist.github.com/atdt/15a25c5f493d32b9e53e4add12a9b1d0
[21:19:34] <MatmaRex>	 ori: thank you. i'll put it on the task
[21:21:38] <ori>	 Nginx is running on ~15% of all web servers and it is not included in the whitelist
[21:23:09] <ori>	 MatmaRex: the problem with the config var approach is this: presumably the default value would be to check REQUEST_URI, since it appears to be the right thing to do for all web servers except ancient versions of Microsoft IIS
[21:23:38] <ori>	 that means currently-secure installations using those versions of IIS will become insecure
[21:24:56] <ori>	 IIS 5 was released 16 years ago
[21:25:20] <MatmaRex>	 ori: i was thinking that we'd default to the current behavior (checking SERVER_SOFTWARE and deciding based on that) unless overriden by the config variable
[21:26:40] <MatmaRex>	 ori: but, this is extremely not my area :D grab TimStarlin.g when he's out of the RFC meeting or csteip.p or someone
[21:26:55] <ori>	 it's a good idea
[21:27:05] <ori>	 defaulting to the current behavior, I mean
[21:30:05] <grrrit-wm>	 (03PS1) 10Dereckson: Add maps-cluster referer rules for pl.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/283332 (https://phabricator.wikimedia.org/T132510) 
[21:30:10] <Dereckson>	 yurik: Hi. Like this for Varnish? ^
[21:30:35] <grrrit-wm>	 (03PS2) 10Dereckson: Add maps-cluster referer rule for pl.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/283332 (https://phabricator.wikimedia.org/T132510) 
[21:30:38] <Dereckson>	 oh one rule here
[21:31:10] <yurik>	 Dereckson, wikimedia is already listed above
[21:31:16] <yurik>	 you might want to merge them
[21:31:36] <yurik>	 and Dereckson, pl?  means p with an optional l
[21:32:25] <Dereckson>	 Let's fix that.
[21:32:45] <yurik>	 Dereckson, (?i)^https?://(maps|phabricator|wikitech|incubator|pl)\.(m\.)?wikimedia\.org(/|$)
[21:32:55] <Dereckson>	 hmmmmm
[21:33:15] <yurik>	 bblack, ^
[21:33:18] <Krenair>	 why the optional https?
[21:33:36] <Dereckson>	 What about a line for the .m. ones? Incubator + pl?
[21:34:15] <yurik>	 Krenair, its a referrer, are we positive that every device on the planet will send https?
[21:34:19] <wikibugs>	 06Operations, 10Wikimedia-Apache-configuration, 07Varnish: Data passed to HHVM ($_SERVER variables) is a mixed bag of already-decoded and non-decoded nonsense - https://phabricator.wikimedia.org/T132629#2205049 (10matmarex) Some examples courtesy of @ori, for each of the bugs mentioned above. (Something went...
[21:34:24] <yurik>	 its a voodoo :)
[21:34:41] <yurik>	 Dereckson, what do you mean?
[21:35:05] <yurik>	 my RE above adds optional m. to every wikimedia.org, just in case they exist
[21:35:20] <yurik>	 i wouldn't mind enabling it for all wikimedia.org, without the restriction
[21:35:26] <Dereckson>	         && req.http.referer !~ "(?i)^https?://(incubator|pl)\.(m\.)?wikimedia\.org(/|$)"
[21:35:51] <Dereckson>	 so we avoid not existing wikitech.m.wikimedia.org or phabricator.m.wikimedia.org
[21:36:26] <yurik>	 Dereckson, its a filter for referer headers, so it doesn't matter if they don't actually exist - the simpler the better :)
[21:36:54] <Dereckson>	 That would also make the file more easy to read, so simpler:
[21:36:56] <Dereckson>	         && req.http.referer !~ "(?i)^https?://(maps|phabricator|wikitech)\.wikimedia\.org(/|$)"
[21:36:59] <Dereckson>	         && req.http.referer !~ "(?i)^https?://(incubator|pl)\.(m\.)?wikimedia\.org(/|$)"
[21:37:03] <Dereckson>	 We have one line for "special" sites
[21:37:09] <Dereckson>	 Another for "regular" wikis
[21:39:11] <grrrit-wm>	 (03CR) 10Bartosz Dziewoński: "Thanks Aaron, I'll have that backported first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282806 (https://phabricator.wikimedia.org/T132200) (owner: 10Bartosz Dziewoński)
[21:39:17] <logmsgbot>	 !log hoo@tin Started scap: Update Wikibase to master (wmf21)
[21:39:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:40:29] <yurik>	 Dereckson, sure
[21:40:55] <yurik>	 Dereckson, the problem is that i would never remember which wikimedia.org has the m. and which dont
[21:41:02] <Dereckson>	 okay so one line
[21:41:14] <Dereckson>	 To avoid to create "traps" in config is a bad idea.
[21:41:18] <Dereckson>	 (a good idea)
[21:41:33] <yurik>	 exactly
[21:42:10] <yurik>	 otherwise we might by accident put some other domain in one line but not in the other, and accidentally not enable  .m. for some site
[21:42:27] <Dereckson>	 I concur.
[21:42:46] <grrrit-wm>	 (03CR) 10Yurik: [C: 04-1] "pl? is wrong" [puppet] - 10https://gerrit.wikimedia.org/r/283332 (https://phabricator.wikimedia.org/T132510) (owner: 10Dereckson)
[21:44:11] <grrrit-wm>	 (03PS3) 10Dereckson: Add maps-cluster referer rules for pl.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/283332 (https://phabricator.wikimedia.org/T132510) 
[21:44:35] <wikibugs>	 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2205087 (10Ottomata) > 2.0T wikistats  Ja, this is why we don't backup! Too big!  stat1001 is in the analytics cluster...so technically we could use HDFS as a holding pen.   Migh...
[21:47:02] <Dereckson>	 Regexp looks good to me, tested with http://www.regexplanet.com/advanced/ruby/.
[21:51:29] <wikibugs>	 06Operations, 10Wikimedia-Apache-configuration, 07Varnish: Data passed to HHVM ($_SERVER variables) is a mixed bag of already-decoded and non-decoded nonsense - https://phabricator.wikimedia.org/T132629#2205129 (10matmarex) And for comparison, similar requests from my local testing wiki, with Apache and a ve...
[21:52:07] <grrrit-wm>	 (03CR) 10Dereckson: "PS3: regex discussed on #wikimedia-operations, and tested through http://www.regexplanet.com/advanced/ruby/." [puppet] - 10https://gerrit.wikimedia.org/r/283332 (https://phabricator.wikimedia.org/T132510) (owner: 10Dereckson)
[22:00:01] <icinga-wm_>	 PROBLEM - Apache HTTP on mw1211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:00:52] <icinga-wm_>	 PROBLEM - HHVM rendering on mw1211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:02:30] <icinga-wm_>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[22:03:40] <mutante>	 twentyafterfour: https://gerrit.wikimedia.org/r/#/c/282478/  looks like it makes sense
[22:03:47] <mutante>	 merge ok?
[22:03:55] <grrrit-wm>	 (03CR) 10Yurik: [C: 031] Add maps-cluster referer rules for pl.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/283332 (https://phabricator.wikimedia.org/T132510) (owner: 10Dereckson)
[22:05:29] <wikibugs>	 06Operations: fix all "top-scope variable being used without an explicit namespace" across the puppet repo - https://phabricator.wikimedia.org/T125042#2205246 (10Dzahn)
[22:07:09] <twentyafterfour>	 mutante: sure
[22:07:17] <grrrit-wm>	 (03CR) 1020after4: [C: 031] Fix viewing raw php files in diffusion [puppet] - 10https://gerrit.wikimedia.org/r/282478 (owner: 10Paladox)
[22:07:30] <wikibugs>	 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM: Beta-cluster web server fills up /var/log with Apache logs - https://phabricator.wikimedia.org/T75262#2205257 (10Dzahn) Though, disabling the logs means you dont have to deal with data-retention issues.
[22:07:51] <grrrit-wm>	 (03PS2) 10Dzahn: Fix viewing raw php files in diffusion [puppet] - 10https://gerrit.wikimedia.org/r/282478 (owner: 10Paladox)
[22:07:57] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] Fix viewing raw php files in diffusion [puppet] - 10https://gerrit.wikimedia.org/r/282478 (owner: 10Paladox)
[22:11:23] <logmsgbot>	 !log hoo@tin Finished scap: Update Wikibase to master (wmf21) (duration: 32m 06s)
[22:11:31] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:15:30] <grrrit-wm>	 (03CR) 10Dzahn: [C: 031] Enable ChallengeResponseAuthentication [puppet] - 10https://gerrit.wikimedia.org/r/281629 (owner: 10Muehlenhoff)
[22:15:36] <Dereckson>	 yurik: what wikis should have wgKartographerWikivoyageMode set at true, wikivoyage + incubator?
[22:16:11] <yurik>	 Dereckson, wikivoyage + mediawiki.org (since it is used for documentation)
[22:16:51] <Dereckson>	 For Incubator, the rationale is because in the future, a new Wikivoyage project in a new language could start on it.
[22:22:18] <Dereckson>	 Oh, the extension isn't currently deployed on Incubator.
[22:23:21] <grrrit-wm>	 (03PS1) 10Dereckson: Set wgKartographerWikivoyageMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283339 
[22:23:32] <logmsgbot>	 !log hoo@tin Synchronized php-1.27.0-wmf.21/./extensions/Wikidata/extensions/Wikibase/view/resources/jquery/wikibase/jquery.wikibase.statementview.RankSelector.js: touch (duration: 00m 26s)
[22:23:40] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:25:10] <grrrit-wm>	 (03PS1) 10Dereckson: Enable Kartographer on pl.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283340 (https://phabricator.wikimedia.org/T132510) 
[22:26:27] <grrrit-wm>	 (03CR) 10Yurik: [C: 031] Set wgKartographerWikivoyageMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283339 (owner: 10Dereckson)
[22:27:58] <hoo>	 hm :/
[22:28:13] <hoo>	 RL doesn't seem to pick up the new messages
[22:28:16] <hoo>	 Krinkle: ^
[22:28:21] <hoo>	 Anything we can do about that
[22:28:34] <grrrit-wm>	 (03PS1) 10Ori.livneh: Force $_SERVER['SERVER_SOFTWARE'] to be "Apache" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283341 (https://phabricator.wikimedia.org/T132612) 
[22:30:09] <Dereckson>	 mutante: we have a request to deploy an extension for a wiki, which is blocked by https://gerrit.wikimedia.org/r/#/c/283332/ (Varnish). Is that mergeable or should I include it for next puppet SWAT?
[22:30:40] <grrrit-wm>	 (03PS2) 10Ori.livneh: Force $_SERVER['SERVER_SOFTWARE'] to be "Apache" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283341 (https://phabricator.wikimedia.org/T132612) 
[22:30:45] <Krinkle>	 hoo: Can you be more specific? Which message where, and what have you done so far?
[22:31:47] <hoo>	 Krinkle: New message has been added to Wikibase (and a RL module). I scaped it and the localization is present (as can be seen on https://www.wikidata.org/wiki/MediaWiki:Wikibase-statementview-references-counter)
[22:32:01] <hoo>	 But it's not propagating into RL
[22:32:19] <hoo>	 After the scap, I touched a file that is in the script array of the module, but not luck
[22:34:10] <Krinkle>	 hoo: file touching makes no difference to messages, ever.
[22:34:12] <Krinkle>	 just fyi
[22:34:51] <hoo>	 Ok, I thought it might re-pick up the whole module
[22:35:57] <Krinkle>	 https://www.wikidata.org/w/load.php?modules=jquery.wikibase.statementview&debug=false&_1
[22:36:07] <Krinkle>	 "wikibase-statementview-references-counter":"\u003Cwikibase-statementview-references-counter\u003E"
[22:36:25] <Krinkle>	 https://www.wikidata.org/w/load.php?modules=jquery.wikibase.statementview&debug=false&_1&lang=nl
[22:36:36] <Krinkle>	 "wikibase-statementview-references-counter":"$1{{PLURAL:$2|0=|$3+$2$4}} 
[22:37:30] <Krinkle>	 https://logstash.wikimedia.org/#/dashboard/elasticsearch/resourceloader
[22:37:41] <Krinkle>	 It exists in localisation cache / wfMessage now
[22:37:44] <Krinkle>	 but did not at first
[22:38:12] <Krinkle>	 "Failed to find wikibase-statementview-references-counter (de)" - /w/load.php?debug=false&lang=de&modules=startup&only=scripts&skin=vector
[22:39:14] <Krinkle>	 hoo: Looks like maybe the server started looking for that message too early. In which  order did it get deployed?
[22:39:45] <hoo>	 Krinkle: what scap does, so mw-update-l10n first
[22:40:17] <Krinkle>	 Now that it is cached in MessageBlobStore it won't refresh until either 1) localisation cache bumps the touch key for this language, 2) the message is changed on-wiki for the specific language, 3) 1 week cache expires.
[22:40:25] <grrrit-wm>	 (03PS2) 10Dereckson: Enable Kartographer on pl.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283340 (https://phabricator.wikimedia.org/T132510) 
[22:41:23] <Krinkle>	 hoo: Editing MeidaWiki: wikibase-statementview-references-counter should purge the blob for all languages
[22:41:28] <Krinkle>	 and then delete again
[22:41:30] <Krinkle>	 Try that?
[22:41:42] <hoo>	 Sounds reasonable
[22:42:15] <Krinkle>	 It seems like somehow scap bumped the cache key before it finished syncing
[22:42:22] <Krinkle>	 Which seems like a definitive possibility
[22:43:14] <bd808>	 scap *builds* the l10n on tin first but actually applies it on the hosts last
[22:43:14] <Krinkle>	 hoo: You could also new MessageBlobStore()->updateMessage( String $key );
[22:43:28] <Krinkle>	 (note it is wiki specific)
[22:44:14] <Krinkle>	 bd808: Hm.. which script does it use?
[22:44:15] <hoo>	 bd808: hm... that sounds troublesome
[22:44:26] <Krinkle>	 I know that localisationUpdate (nightly) accounts for it
[22:44:50] <bd808>	 scap doesn't do the blob purge that l10nupdate does
[22:45:15] <hoo>	 Krinkle: That worked... but I found at least one other message that's missing
[22:45:19] <bd808>	 hoo: it's life. :) there is a long standing bug about this very case in Phab.
[22:45:21] <hoo>	 will try purging via MessageBlobStore
[22:45:46] <hoo>	 bd808: Yeah... I keep running into edge cases all the time :P
[22:46:00] <grrrit-wm>	 (03PS1) 10Dereckson: Add signature edit button for the Comments namespace to ru.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283347 (https://phabricator.wikimedia.org/T132241) 
[22:46:07] <bd808>	 hoo: if your code would just ride the normal train...
[22:46:57] <bd808>	 an new branch deploy is synced to all hosts before the active versions is updated so it doesn't hit this particular issue
[22:47:06] <hoo>	 yeah, I would love that
[22:47:52] <hoo>	 I kind of forgot to branch in time this week... then Jan branched, but he got the wrong HEAD
[22:47:58] <hoo>	 so I updated our branch just now
[22:48:08] <hoo>	 doing all that per hand is a pain
[22:48:32] <Krinkle>	 hoo: LOoks fixed to me
[22:48:48] <hoo>	 indeed :)
[22:48:57] <wikibugs>	 06Operations, 10Wikimedia-Apache-configuration, 07Varnish: Data passed to HHVM ($_SERVER variables) is a mixed bag of already-decoded and non-decoded nonsense - https://phabricator.wikimedia.org/T132629#2205393 (10BBlack) With the examples, could you be more-specific about what's broken in them?
[22:48:58] <hoo>	 Still missing another message
[22:49:23] <bd808>	 the general fix for this type of problem will be when scap3 knows how to depool servers while they are getting code changes
[22:49:43] <bd808>	 then we won't get mixed state from a server that is being updated
[22:50:49] <hoo>	 found another message :/
[22:51:05] <Krinkle>	 bd808: I don't think that would solve it in this case.
[22:51:21] <Krinkle>	 The problem isn't lack of back-compat. If it synced messages first and then the code, it'd be fine.
[22:51:25] <bd808>	 was it not a completely new message?
[22:51:28] <Krinkle>	 And while it does do that (I think?) 
[22:51:33] <Krinkle>	 bd808: It was
[22:51:41] <bd808>	 depooling would fix it then
[22:51:48] <Krinkle>	 bd808: The problem is that something purged the cache key (which is in memcached) before all servers got the code
[22:52:32] <grrrit-wm>	 (03PS1) 10Dereckson: Consider all pages as valid content articles on pl.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283349 (https://phabricator.wikimedia.org/T131771) 
[22:52:35] <Krinkle>	 So it was recombobulated and repopulated from another server too early
[22:52:51] <Krinkle>	 Hm.. yeah, I'm not sure actually
[22:53:10] <Krinkle>	 I guess if old server stayed pooled and handling existing requests with old code then it wouldn't look for that key in the first plae.
[22:53:11] <Krinkle>	 place*
[22:53:34] <Krinkle>	 bd808: Wait, did you say it syncs the new l10n files last?
[22:53:43] <bd808>	 yes
[22:53:46] <Krinkle>	 why?
[22:53:50] <bd808>	 well it rebuilds the cdb files last
[22:53:53] <Krinkle>	 I remember it was historically first
[22:53:56] <bd808>	 nope
[22:54:28] <Krinkle>	 mw-do-l10n, mw-sync-l01n then sync-common-all
[22:54:31] <Krinkle>	 scap bash
[22:54:38] <bd808>	 before we started using json for the shipping encoding (December 2014) it would have been inode order random
[22:55:20] <Krinkle>	 right
[22:56:02] <bd808>	 if it did things in that order it was before I ever saw the scap source (i.e. before December 2013)
[22:56:23] <Krinkle>	 bd808: Can you point to where it builds the l10n files and where it syncs them? Perhaps there's a relatively easy way to (re)trigger the cache touchkey aftereward
[22:56:38] <Krinkle>	 I only looked at scap source once, around 2012.
[22:57:02] <Krinkle>	 (before the rewrite that is, I've looked at the new one since then)
[22:57:34] <bd808>	 sure. -- https://phabricator.wikimedia.org/diffusion/MSCA/browse/master/scap/main.py;a4fa153570e0c6f3494eadd6de9d9435754b6c18$328
[22:58:41] <bd808>	 lines 278-288 give an outline of the step order (ignore 289: that doesn't normally happen)
[23:00:04] <jouncebot_>	 RoanKattouw ostriches Krenair MaxSem Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160413T2300).
[23:00:04] <jouncebot_>	 MatmaRex matt_flaschen: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[23:00:27] <MatmaRex>	 hi
[23:00:28] <Dereckson>	 Hi. I've some patches to add in last minute to this windows.
[23:00:52] <MatmaRex>	 (you can do everyone else before me, i'm in a short meeting)
[23:01:33] <matt_flaschen>	 Present
[23:03:44] <Dereckson>	 Okay, I can SWAT.
[23:04:12] <Dereckson>	 matt_flaschen: do you have a specific order between the Echo and config?
[23:05:13] <matt_flaschen>	 Dereckson, Echo first.
[23:05:18] <Dereckson>	 k
[23:07:40] <Dereckson>	 Tests are running. Meanwhile, I'll take one config patch.
[23:08:09] <Dereckson>	 MatmaRex: okay, ping when you're back
[23:09:20] <grrrit-wm>	 (03CR) 10Dereckson: [C: 032] Add signature edit button for the Comments namespace to ru.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283347 (https://phabricator.wikimedia.org/T132241) (owner: 10Dereckson)
[23:09:54] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add signature edit button for the Comments namespace to ru.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283347 (https://phabricator.wikimedia.org/T132241) (owner: 10Dereckson)
[23:10:27] <MatmaRex>	 Dereckson: (i'm around now)
[23:11:10] <Dereckson>	 k
[23:11:45] <logmsgbot>	 !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Add signature edit button for the Comments namespace to ru.wikinews (Task T132241, [[Gerrit:283347]]) (duration: 00m 38s)
[23:11:46] <stashbot>	 T132241: Enabling signature button in toolbar for the Comments namespace in ruwikinews - https://phabricator.wikimedia.org/T132241
[23:11:52] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:12:59] <Dereckson>	 Doesn't work, but could be cache related. Will retest later.
[23:13:24] <Dereckson>	 Zuul tests for Echo still running, MatmaRex, you're next.
[23:15:30] <Dereckson>	 MaxSem Krenair or greg-g: https://gerrit.wikimedia.org/r/#/c/283247/2 looks SWATtable ?
[23:16:28] <Dereckson>	 283347 works.
[23:18:20] <grrrit-wm>	 (03PS2) 10Dereckson: Consider all pages as valid content articles on pl.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283349 (https://phabricator.wikimedia.org/T131771) 
[23:19:37] <grrrit-wm>	 (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283349 (https://phabricator.wikimedia.org/T131771) (owner: 10Dereckson)
[23:21:03] <grrrit-wm>	 (03Merged) 10jenkins-bot: Consider all pages as valid content articles on pl.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283349 (https://phabricator.wikimedia.org/T131771) (owner: 10Dereckson)
[23:23:05] <Dereckson>	 matt_flaschen: okay, Echo changes merged. We can deploy them.
[23:23:08] <logmsgbot>	 !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Consider all pages as valid content articles on pl.wikisource (Task T131771, [[Gerrit:283349]]) (duration: 00m 35s)
[23:23:09] <stashbot>	 T131771: Set $wgArticleCountMethod to 'any' for plwikisource - https://phabricator.wikimedia.org/T131771
[23:23:15] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:25:31] <Dereckson>	 !log Ran mwscript updateArticleCount.php --wiki=plwikisource --update
[23:25:38] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:26:23] <matt_flaschen>	 Sounds good.  Dereckson, let me know when Echo is done, and I can test quickly before the mediawiki-config.
[23:26:32] <Dereckson>	 k
[23:27:14] <wikibugs>	 06Operations, 10Traffic, 06Zero: Use Text IP for Mobile hostnames to gain SPDY/H2 coalesce between the two - https://phabricator.wikimedia.org/T124482#2205468 (10BBlack) The Zero picture is clearer now from some email threads with @DFoy and @dr0ptp4kt .  We're clear for this change on the Zero front already,...
[23:35:37] <Krenair>	 Dereckson, hmm
[23:35:39] <Krenair>	 Dereckson, I would as long as Aaron or Bartosz are around
[23:35:54] <MatmaRex>	 i'm here.
[23:35:55] <logmsgbot>	 !log dereckson@tin Synchronized php-1.27.0-wmf.20/extensions/Echo/modules/ooui/mw.echo.ui.FooterNoticeWidget.js: Fixes for Echo ([[Gerrit:282714]] + [[Gerrit:282715]]) (duration: 00m 26s)
[23:36:02] <Dereckson>	 matt_flaschen: you can test ^
[23:36:03] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:36:14] <MatmaRex>	 i suppose there might be some performance stats we could watch for catastrophic regressions? not sure where to find them though
[23:36:18] <MatmaRex>	 Krenair: ^
[23:36:33] <MatmaRex>	 i'm pretty sure we track page save duration across time
[23:37:15] <Krenair>	 not sure. ori?
[23:37:17] <Dereckson>	 matt_flaschen: the .less file is no-op?
[23:37:21] <Krenair>	 yeah, somewhere
[23:37:23] <Krenair>	 was it gdash?
[23:37:28] <matt_flaschen>	 Dereckson, no, it's required.
[23:37:32] <Dereckson>	 k
[23:37:33] <Krenair>	 no...
[23:37:51] <Krenair>	 grafana?
[23:38:01] <Krenair>	 not graphite
[23:38:23] <Krenair>	 MatmaRex, https://grafana.wikimedia.org/dashboard/db/save-timing
[23:38:27] <Dereckson>	 logmsgbot?
[23:38:28] <logmsgbot>	 !log dereckson@tin Synchronized php-1.27.0-wmf.20/extensions/Echo/modules/ooui/styles/mw.echo.ui.FooterNoticeWidget.less: Fix for Echo footer ([[Gerrit:282715]]) (duration: 00m 27s)
[23:38:36] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:38:44] <Dereckson>	 matt_flaschen: here you are ^
[23:39:28] <grrrit-wm>	 (03PS3) 10Dereckson: Enable Echo survey on French-language wikis (retry) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283330 (https://phabricator.wikimedia.org/T131893) (owner: 10Mattflaschen)
[23:39:58] <MatmaRex>	 Krenair: oh, neat. i guess it can't be filtered per-wiki?
[23:40:09] <Krenair>	 not there...
[23:40:39] <grrrit-wm>	 (03PS1) 10Dzahn: dhcp: let ulsfo public subnet use carbon as TFTP [puppet] - 10https://gerrit.wikimedia.org/r/283359 (https://phabricator.wikimedia.org/T123674) 
[23:43:19] <matt_flaschen>	 Dereckson, yeah, the config patch is good to go.
[23:43:39] <Dereckson>	 k
[23:43:49] <grrrit-wm>	 (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283330 (https://phabricator.wikimedia.org/T131893) (owner: 10Mattflaschen)
[23:44:00] <grrrit-wm>	 (03PS2) 10Dzahn: dhcp: let ulsfo public subnet use carbon as TFTP [puppet] - 10https://gerrit.wikimedia.org/r/283359 (https://phabricator.wikimedia.org/T123674) 
[23:44:14] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable Echo survey on French-language wikis (retry) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283330 (https://phabricator.wikimedia.org/T131893) (owner: 10Mattflaschen)
[23:46:13] <logmsgbot>	 !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable Echo survey on French-language wikis (Task T131893, [[Gerrit:283330]]) (duration: 00m 26s)
[23:46:14] <stashbot>	 T131893: Invite French users to take the Notification Survey (using the Notifications panel) - https://phabricator.wikimedia.org/T131893
[23:46:16] <Dereckson>	 matt_flaschen: and here you're with the config patch ^
[23:46:20] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:46:32] <matt_flaschen>	 Thanks, Dereckson, testing now.
[23:47:03] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] dhcp: let ulsfo public subnet use carbon as TFTP [puppet] - 10https://gerrit.wikimedia.org/r/283359 (https://phabricator.wikimedia.org/T123674) (owner: 10Dzahn)
[23:47:45] <wikibugs>	 06Operations, 10Wikimedia-Apache-configuration, 07Varnish: Data passed to HHVM ($_SERVER variables) is a mixed bag of already-decoded and non-decoded nonsense - https://phabricator.wikimedia.org/T132629#2205523 (10matmarex) For T123276:  * P2895 https://commons.wikimedia.org/w/api.php?oxrigin=https%3A%2F%2Fw...
[23:49:36] <Dereckson>	 MatmaRex: okay, Zuul is running tests for 283333
[23:50:18] <matt_flaschen>	 Dereckson, works as expected.  Thanks.
[23:50:25] <grrrit-wm>	 (03PS1) 10Dzahn: network: remove bast4001 SLAAC IPs [puppet] - 10https://gerrit.wikimedia.org/r/283361 (https://phabricator.wikimedia.org/T123674) 
[23:50:31] <Dereckson>	 You're welcome matt_flaschen. Thanks for testing.
[23:50:57] <Dereckson>	 Pending running tests and merge, let's resume our config patches.
[23:51:43] <Dereckson>	 MatmaRex's blocked by AbuseFilter update, so let's merge the hi.wikt one.
[23:51:50] <grrrit-wm>	 (03CR) 10Dereckson: [C: 032] Import sources on hi.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282843 (https://phabricator.wikimedia.org/T132417) (owner: 10Dereckson)
[23:53:26] <Dereckson>	 Needs rebase
[23:54:21] <icinga-wm_>	 PROBLEM - RAID on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:54:22] <icinga-wm_>	 PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:54:26] <grrrit-wm>	 (03PS2) 10Dereckson: Import sources on hi.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282843 (https://phabricator.wikimedia.org/T132417) 
[23:54:31] <icinga-wm_>	 PROBLEM - configured eth on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:54:50] <icinga-wm_>	 PROBLEM - dhclient process on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:54:52] <grrrit-wm>	 (03CR) 10Dereckson: [C: 032] Import sources on hi.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282843 (https://phabricator.wikimedia.org/T132417) (owner: 10Dereckson)
[23:55:10] <icinga-wm_>	 PROBLEM - salt-minion processes on alsafi is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[23:55:10] <icinga-wm_>	 PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:55:10] <icinga-wm_>	 PROBLEM - Check size of conntrack table on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:55:12] <icinga-wm_>	 PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:55:22] <icinga-wm_>	 PROBLEM - DPKG on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:55:31] <icinga-wm_>	 PROBLEM - Disk space on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:55:40] <icinga-wm_>	 PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:55:40] <icinga-wm_>	 PROBLEM - graphoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:55:51] <icinga-wm_>	 PROBLEM - puppet last run on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:56:00] <icinga-wm_>	 PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:56:00] <icinga-wm_>	 PROBLEM - graphoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:56:01] <icinga-wm_>	 PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:56:01] <icinga-wm_>	 PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:56:01] <icinga-wm_>	 PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:56:01] <icinga-wm_>	 PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:56:01] <icinga-wm_>	 PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:56:01] <icinga-wm_>	 PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:56:11] <icinga-wm_>	 RECOVERY - configured eth on alsafi is OK: OK - interfaces up
[23:56:12] <icinga-wm_>	 RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy
[23:56:12] <mutante>	 !log ssh alsafi
[23:56:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:56:27] <MaxSem>	 Dereckson, I have another patch to deploy, can do it myself
[23:56:31] <icinga-wm_>	 RECOVERY - dhclient process on alsafi is OK: PROCS OK: 0 processes with command name dhclient
[23:56:51] <icinga-wm_>	 RECOVERY - salt-minion processes on alsafi is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[23:56:51] <icinga-wm_>	 RECOVERY - Check size of conntrack table on alsafi is OK: OK: nf_conntrack is 0 % full
[23:56:51] <icinga-wm_>	 RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy
[23:57:01] <icinga-wm_>	 RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy
[23:57:10] <icinga-wm_>	 RECOVERY - DPKG on alsafi is OK: All packages OK
[23:57:11] <icinga-wm_>	 RECOVERY - Disk space on alsafi is OK: DISK OK
[23:57:20] <icinga-wm_>	 RECOVERY - graphoid endpoints health on scb2001 is OK: All endpoints are healthy
[23:57:20] <icinga-wm_>	 RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy
[23:57:40] <icinga-wm_>	 RECOVERY - puppet last run on alsafi is OK: OK: Puppet is currently enabled, last run 29 minutes ago with 0 failures
[23:57:41] <icinga-wm_>	 RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy
[23:57:41] <icinga-wm_>	 RECOVERY - graphoid endpoints health on scb2002 is OK: All endpoints are healthy
[23:57:41] <icinga-wm_>	 RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy
[23:57:41] <icinga-wm_>	 RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy
[23:57:42] <icinga-wm_>	 RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy
[23:57:42] <icinga-wm_>	 RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy
[23:57:42] <icinga-wm_>	 RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy
[23:57:42] <icinga-wm_>	 RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy
[23:57:46] <Dereckson>	 MaxSem: k, we've still two patches for MatmaRex (pending zuul) before
[23:58:00] <icinga-wm_>	 RECOVERY - RAID on alsafi is OK: OK: no RAID installed
[23:59:13] <MatmaRex>	 whoo it merged at last.
[23:59:16] <grrrit-wm>	 (03PS3) 10Dereckson: Import sources on hi.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282843 (https://phabricator.wikimedia.org/T132417)