[00:00:17] <icinga-wm>	 PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 1 failures
[00:00:17] <icinga-wm>	 PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[00:00:17] <icinga-wm>	 PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[00:00:30] <wikibugs>	 6Operations, 10MediaWiki-Parser, 10Traffic, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182459 (10BBlack) >>! In T121135#1910435, @Atsirlin wrote: > @Legoktm: Frankly speaking, for a small project like Wikivo...
[00:01:37] <wikibugs>	 6Operations, 6Release-Engineering-Team: Update gerrit sshkey in role::ci::slave::labs when upgrade to Jessie happens - https://phabricator.wikimedia.org/T131903#2182462 (10madhuvishy)
[00:01:52] <wikibugs>	 6Operations, 6Release-Engineering-Team: reinstall/upgrade gerrit server (ytterbium) from precise to jessie - https://phabricator.wikimedia.org/T125018#2182475 (10madhuvishy)
[00:01:54] <wikibugs>	 6Operations, 6Release-Engineering-Team: Update gerrit sshkey in role::ci::slave::labs when upgrade to Jessie happens - https://phabricator.wikimedia.org/T131903#2182474 (10madhuvishy)
[00:02:46] <wikibugs>	 6Operations, 6Release-Engineering-Team: Update gerrit sshkey in role::ci::slave::labs when upgrade to Jessie happens - https://phabricator.wikimedia.org/T131903#2182462 (10madhuvishy)
[00:05:18] <icinga-wm>	 PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 1 failures
[00:05:18] <icinga-wm>	 PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[00:05:18] <icinga-wm>	 PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[00:09:52] <grrrit-wm>	 (03PS2) 10BryanDavis: Moving elasticsearch::https instatiation to elasticsearch role [puppet] - 10https://gerrit.wikimedia.org/r/281824 (https://phabricator.wikimedia.org/T131906) (owner: 10Gehel)
[00:10:17] <icinga-wm>	 PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 1 failures
[00:10:17] <icinga-wm>	 PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[00:10:18] <icinga-wm>	 PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[00:12:10] <MaxSem>	 is anybody looking into payments ^^^?
[00:14:08] <grrrit-wm>	 (03PS2) 10RobH: stat1004 has 4 disks [puppet] - 10https://gerrit.wikimedia.org/r/281858 
[00:14:22] <awight>	 MaxSem: I would absolutely love some help with that.
[00:14:25] <robh>	 codfw payments is not primary
[00:14:38] <robh>	 so it didnt genreate pages
[00:14:41] <awight>	 payments-Redis needs to be kicked, though I'd like to know why.
[00:14:55] <awight>	 ah, this was codfw?  looking
[00:15:06] <robh>	 thats what the alerts are for, payments 2XXX
[00:15:10] <robh>	 which is codfw
[00:15:17] <icinga-wm>	 RECOVERY - check_puppetrun on payments2002 is OK: OK: Puppet is currently enabled, last run 91 seconds ago with 0 failures
[00:15:31] <robh>	 heh, of course now that we discuss, it clears?
[00:15:40] <grrrit-wm>	 (03CR) 10Mattflaschen: [C: 031] "Same as schema from I5c1f648cc63ed317508febaece955ec68f640ba3 , which I just +2'ed. If that merges (no Jenkins failures), I'll deploy thi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281640 (https://phabricator.wikimedia.org/T112799) (owner: 10Matthias Mullie)
[00:15:44] <robh>	 still 2003 issue
[00:15:47] <robh>	 (same one)
[00:15:50] <awight>	 I believe it has online replicas of donor data, so sort of matters
[00:16:12] <grrrit-wm>	 (03CR) 10RobH: [C: 032] stat1004 has 4 disks [puppet] - 10https://gerrit.wikimedia.org/r/281858 (owner: 10RobH)
[00:20:40] <Krenair>	 Luke081515, it's still stuck on 3 remaining hosts...
[00:21:16] <Krenair>	 mw2043, mw2177 and mw1184
[00:21:27] <Krenair>	 I think
[00:22:41] <Luke081515>	 seems like I get my data from one of them...
[00:24:47] <Krenair>	 Luke081515, I think there might be something extra is has to do after this sync
[00:25:13] <Luke081515>	 hm, ok
[00:32:09] <icinga-wm>	 PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures
[00:33:58] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet).
[00:34:08] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet).
[00:34:33] <YuviPanda>	 robh: I merged your change
[00:35:57] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[00:36:00] <MaxSem>	 mhm mwdeploy 29365  0.0  0.0 105764  2076 ?        S<   Apr05   0:00 sshd: mwdeploy@notty
[00:36:07] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge.
[00:36:34] <Luke081515>	 Krenair: So my note: A backport is better, but overwrite the data by creating local sysmessages is faster ;)
[00:45:23] <Krenair>	 bd808, hey
[00:45:27] <Krenair>	 so I was stuck in scap
[00:45:41] <Krenair>	 I found that pressing enter made it move to the next host...
[00:46:01] <Krenair>	 it sat at sync-common:  99% (ok: 424; fail: 0; left: 3) for ages
[00:46:05] <Krenair>	 then I pressed enter
[00:46:15] <Krenair>	 suddenly, sync-common:  99% (ok: 425; fail: 0; left: 2)
[00:46:17] <Krenair>	 and so on until 0
[00:48:43] <bd808>	 hmm
[00:48:52] <bd808>	 you found hidden magic?
[00:49:39] <bd808>	 the trick I've used for stuck hosts before is to open another ssh session to tin and kill the outbound ssh connections to those hosts
[00:49:50] <Luke081515>	 Krenair: My browser shows the right messages now
[00:50:04] <bd808>	 and then you can ssh directly to the hosts and run sync-common
[00:50:51] <ori>	 pressing enter almost certainly had nothing to do with it
[00:51:11] <ori>	 https://en.wikipedia.org/wiki/Placebo_button
[00:51:17] <YuviPanda>	 it could be something stuck waiting for stdin somewhere
[00:51:26] <YuviPanda>	 has happened to me before in other places
[00:51:42] <Krenair>	 that was my assumption
[00:52:00] <bd808>	 hmm.. like unknown host keys?
[00:52:14] <bd808>	 were there servers reimaged recently?
[00:52:28] <icinga-wm>	 PROBLEM - Apache HTTP on mw1119 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.009 second response time
[00:52:38] <ori>	 subprocess.Popen's stdin argument defaults to None, so spawned processes do not inherit the parent process's stdin by default.
[00:52:52] <Krenair>	 bd808, mw1184 has been up 20 days
[00:52:59] <Krenair>	 so not that recently, people have run scap in that time
[00:53:09] <icinga-wm>	 PROBLEM - HHVM rendering on mw1119 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.011 second response time
[00:53:21] <bd808>	 and as ori proposes the ssh connections aren't attached to your terminal (or shouldn't be)
[00:53:39] <Krenair>	 ori, ^ I've seen that (icinga alert for mw1119) happen a couple of times now for other hosts
[00:53:51] <Krenair>	 do you want to look into it or shall I just restart hhvm?
[00:55:36] <MaxSem>	 !log restarted hhvm on mw1119, stuck
[00:55:40] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:56:05] <ori>	 actually, I'm wrong
[00:56:08] <Krenair>	 scap-rebuild-cdbs:  99% (ok: 437; fail: 0; left: 1)                             
[00:56:09] <icinga-wm>	 RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 5.830 second response time
[00:56:18] <ori>	 " With the default settings of None, no redirection will occur; the child’s file handles will be inherited from the parent."
[00:56:30] <ori>	 (from https://docs.python.org/2/library/subprocess.html)
[00:56:39] <ori>	 so it's possible pressing enter really did help, in fact
[00:56:58] <icinga-wm>	 RECOVERY - HHVM rendering on mw1119 is OK: HTTP OK: HTTP/1.1 200 OK - 68780 bytes in 0.772 second response time
[00:57:04] <ori>	 it should not; scap's subprocess calls should pass stdin=subprocess.PIPE
[00:57:48] <icinga-wm>	 RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[00:57:51] <Krenair>	 wouldn't we then just get completely stuck if it prompts for input?
[00:58:55] <wikibugs>	 6Operations, 10MediaWiki-Parser, 10Traffic, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182577 (10Jdlrobson) >>! In T121135#2182349, @Wrh2 wrote: > Cache is cleared fairly regularly even if articles aren't ed...
[00:59:27] <ori>	 Krenair: could you e-mail ops@ about it or file a task? I could depool it, but I am not able to take the time to debug it further, and I worry that it would just remain depooled until someone comes across it.
[00:59:44] <Krenair>	 mw1119? MaxSem already restarted hhvm there
[01:00:10] <Krenair>	 it was the first thing in the chat after I asked you about it
[01:00:41] <bd808>	 Krenair: is it still stuck?
[01:00:44] <ori>	 not great to just restart hhvm
[01:00:48] <Krenair>	 probably not
[01:00:55] <Krenair>	 icinga said it recovered
[01:00:58] <ori>	 better to capture a trace or just depool it and leave it
[01:01:05] <Krenair>	 I agree
[01:01:16] <MaxSem>	 ori, neither of which us mortals can do...
[01:01:35] <bd808>	 Krenair: the process that is still running is connected to mw1119.eqiad.wmnet
[01:01:51] <Krenair>	 I'll kill that process then
[01:01:51] <bd808>	 "/usr/bin/ssh -oBatchMode=yes -oSetupTimeout=10 -F/dev/null -oUser=mwdeploy mw1119.eqiad.wmnet sudo -u mwdeploy -n -- /usr/bin/scap-rebuild-cdbs"
[01:02:03] <ori>	 MaxSem: pybal would have depooled it for failing health checks
[01:02:23] <ori>	 so you are not actually helping anything by restarting HHVM; it is not continuing to receive requests
[01:02:31] <ori>	 it is just destroying program state
[01:02:31] <logmsgbot>	 !log krenair@tin Finished scap: https://gerrit.wikimedia.org/r/#/c/281846/ - add messages for the new extendedconfirmed protection (duration: 95m 03s)
[01:02:36] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[01:02:58] <MaxSem>	 ori, it resumes receiving requests seconds after a restart, as evidenced by load
[01:03:15] <Krenair>	 pybal doesn't automatically repool?
[01:03:18] <ori>	 yes -- so?
[01:03:23] <ori>	 yes, it does
[01:03:48] <ori>	 (if a server is depooled for failing health checks -- it does not repool a manually depooled server)
[01:04:06] <Krenair>	 Luke081515, ^
[01:04:08] <MaxSem>	 so where does that load on it come from? :P
[01:04:28] <ori>	 from pybal re-pooling it, since it passed health checks after you restarted it
[01:04:39] <Luke081515>	 ok
[01:05:10] <MaxSem>	 err <ori> so you are not actually helping anything by restarting HHVM; it is not continuing to receive requests
[01:05:13] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.71 seconds
[01:05:33] <ori>	 we have plenty of spare capacity on the app server cluster, by design
[01:06:16] <Krenair>	 is it worth me filing a task this time? or waiting for the next?
[01:06:27] <ori>	 the fact that the machine had been depooled is not in itself a problem
[01:06:32] <ori>	 Krenair: probably not worth it, no
[01:06:34] <YuviPanda>	 MaxSem: parsing error I think - I took ori's sentence to mean 'if it is down, it is not actually receiving requests, so there is no actual issue'
[01:06:35] <Krenair>	 k
[01:06:43] <YuviPanda>	 *user facing issue
[01:07:02] <Krenair>	 yes this is probably a misunderstanding
[01:07:02] <ori>	 YuviPanda: right -- I see how that could have been confusing
[01:08:20] <ori>	 (not worth it because it is unlikely that anyone would take the time to investigate it, given that there exists a backlog of similar issues with more debug data. not because it wouldn't be useful to know.)
[01:08:38] <ori>	 eep, it's late here. o/
[01:09:16] <YuviPanda>	 it is late everywhere!
[01:09:23] * YuviPanda disappears too
[01:18:54] <wikibugs>	 6Operations, 10MediaWiki-Parser, 10Traffic, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182584 (10Wrh2) If the template was at fault the behavior should be consistent - currently if a page is edited or flushe...
[01:25:58] <Krenair>	 Luke081515, bah... after than very long scap, someone has already overridden one of the messages locally
[01:26:13] <Luke081515>	 lol
[01:26:16] <Luke081515>	 hm, ok
[01:26:31] <Luke081515>	 Kreniar: But thanks for backporting
[01:27:52] <Luke081515>	 *Krenair
[01:27:59] * Luke081515 hates typos in nicknames
[01:28:18] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 65.52% of data above the critical threshold [5000000.0]
[01:29:05] <wikibugs>	 6Operations, 10Traffic: Support TLS chacha20-poly1305 AEAD ciphers - https://phabricator.wikimedia.org/T131908#2182588 (10BBlack)
[01:29:17] <wikibugs>	 6Operations, 10Traffic: Support TLS chacha20-poly1305 AEAD ciphers - https://phabricator.wikimedia.org/T131908#2182601 (10BBlack) p:5Triage>3Normal
[01:45:17] <wikibugs>	 6Operations, 10MediaWiki-Parser, 10Traffic, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182613 (10Jdlrobson) >>! In T121135#2182584, @Wrh2 wrote: > If the template was at fault the behavior should be consiste...
[01:49:03] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.26 seconds
[01:50:13] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: x1 on db1031 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 326.90 seconds
[01:50:17] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: x1 on db2008 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 334.48 seconds
[01:51:27] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: x1 on db2009 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 397.01 seconds
[01:51:32] <wikibugs>	 6Operations, 10MediaWiki-Parser, 10Traffic, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182614 (10Jdlrobson) My current theory is that under some circumstances the banner is generated before the table of cont...
[01:54:02] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: x1 on db1031 is OK: OK slave_sql_lag Replication lag: 0.42 seconds
[01:54:02] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: x1 on db2008 is OK: OK slave_sql_lag Replication lag: 0.03 seconds
[01:54:59] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: x1 on db2009 is OK: OK slave_sql_lag Replication lag: 0.13 seconds
[02:10:08] <icinga-wm>	 PROBLEM - check_missing_thank_yous on db1025 is CRITICAL: CRITICAL missing_thank_yous=550 [critical =500]
[02:15:08] <icinga-wm>	 PROBLEM - check_missing_thank_yous on db1025 is CRITICAL: CRITICAL missing_thank_yous=552 [critical =500]
[02:23:08] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[02:25:08] <icinga-wm>	 RECOVERY - check_missing_thank_yous on db1025 is OK: OK missing_thank_yous=0
[02:31:31] <logmsgbot>	 !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.19) (duration: 11m 44s)
[02:31:36] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:56:51] <logmsgbot>	 !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.20) (duration: 09m 43s)
[02:56:55] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[03:06:19] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Apr  6 03:06:18 UTC 2016 (duration 9m 27s)
[03:06:22] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[03:08:19] <wikibugs>	 6Operations: setup stat1004/WMF4721 for hadoop client usage - https://phabricator.wikimedia.org/T131877#2182660 (10RobH)
[03:19:58] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [5000000.0]
[04:06:18] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[04:15:08] <icinga-wm>	 PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[04:15:08] <icinga-wm>	 PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[04:15:08] <icinga-wm>	 PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[04:20:08] <icinga-wm>	 PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[04:20:08] <icinga-wm>	 PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[04:20:08] <icinga-wm>	 PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[04:25:08] <icinga-wm>	 PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[04:25:08] <icinga-wm>	 PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[04:25:09] <icinga-wm>	 PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[04:30:08] <icinga-wm>	 PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[04:30:08] <icinga-wm>	 PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[04:30:17] <icinga-wm>	 PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[04:35:08] <icinga-wm>	 PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[04:35:08] <icinga-wm>	 PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[04:35:08] <icinga-wm>	 PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[04:40:08] <icinga-wm>	 PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[04:40:08] <icinga-wm>	 PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[04:40:08] <icinga-wm>	 PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[04:45:08] <icinga-wm>	 PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[04:45:08] <icinga-wm>	 PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[04:45:08] <icinga-wm>	 PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[04:50:08] <icinga-wm>	 PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[04:50:08] <icinga-wm>	 PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[04:50:08] <icinga-wm>	 PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[04:55:08] <icinga-wm>	 PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[04:55:08] <icinga-wm>	 PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[04:55:08] <icinga-wm>	 PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:00:07] <icinga-wm>	 PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:00:07] <icinga-wm>	 PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:00:08] <icinga-wm>	 PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:05:08] <icinga-wm>	 PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:05:08] <icinga-wm>	 PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:05:08] <icinga-wm>	 PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:10:08] <icinga-wm>	 PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:10:08] <icinga-wm>	 PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:10:08] <icinga-wm>	 PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:15:08] <icinga-wm>	 PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:15:08] <icinga-wm>	 PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:15:08] <icinga-wm>	 PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:20:08] <icinga-wm>	 PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:20:08] <icinga-wm>	 PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:20:08] <icinga-wm>	 PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:25:08] <icinga-wm>	 PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:25:08] <icinga-wm>	 PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:25:09] <icinga-wm>	 PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:30:08] <icinga-wm>	 PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:30:08] <icinga-wm>	 PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:30:08] <icinga-wm>	 PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:35:08] <icinga-wm>	 PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:35:08] <icinga-wm>	 PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:35:08] <icinga-wm>	 PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:40:08] <icinga-wm>	 PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:40:09] <icinga-wm>	 PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:40:09] <icinga-wm>	 PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:42:08] <_joe_>	 uh what's this?
[05:42:35] <_joe_>	 uhm codfw 
[05:45:08] <icinga-wm>	 PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:45:08] <icinga-wm>	 PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:45:08] <icinga-wm>	 PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:50:08] <icinga-wm>	 PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:50:08] <icinga-wm>	 PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:50:08] <icinga-wm>	 PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:55:08] <icinga-wm>	 PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:55:08] <icinga-wm>	 PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[05:55:08] <icinga-wm>	 PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[06:00:08] <icinga-wm>	 PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[06:00:08] <icinga-wm>	 PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[06:00:17] <icinga-wm>	 PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379
[06:01:56] <_joe_>	 I paged jeff, I'm disabling notifications for those services
[06:02:09] <icinga-wm>	 PROBLEM - puppet last run on mw2138 is CRITICAL: CRITICAL: puppet fail
[06:07:38] <_joe_>	 !log restarting HHVM on mw1134, deadlock in what appears to be HPHP::Treadmill::getAgeOldestRequest
[06:07:42] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[06:09:49] <icinga-wm>	 RECOVERY - Apache HTTP on mw1134 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 0.147 second response time
[06:09:59] <icinga-wm>	 RECOVERY - HHVM rendering on mw1134 is OK: HTTP OK: HTTP/1.1 200 OK - 68824 bytes in 0.839 second response time
[06:10:34] <p858snake|L_>	 awight and robh looked at it when it happened earlier, but it fixed itself
[06:11:09] <awight>	 I missed the diagnosis of the last outage, unfortunately
[06:12:48] <awight>	 It's only medium-priority, this is a failover service and replica.
[06:15:11] <_joe_>	 awight: I disabled notifications for those redis services
[06:15:33] <awight>	 _joe_: Perfectly good workaround for now, thanks for doing so!
[06:16:06] <_joe_>	 wasn't really an effort :P
[06:20:37] <awight>	 Jeff_Green is spot on about provisioning either Kafka or Redis, but not both for our payments queue overhaul...  For the saved trouble of keeping a service up, I'm happy to slightly abuse mysql and use it to simulate redis storage types.
[06:20:45] <awight>	 _joe_: Thanks for the idea to look at Kafka!
[06:23:34] <_joe_>	 yw :)
[06:27:39] <icinga-wm>	 PROBLEM - puppet last run on mc2004 is CRITICAL: CRITICAL: puppet fail
[06:29:57] <grrrit-wm>	 (03PS1) 10Gehel: WIP [puppet] - 10https://gerrit.wikimedia.org/r/281881 
[06:30:19] <grrrit-wm>	 (03PS1) 10Gehel: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281882 
[06:30:27] <icinga-wm>	 PROBLEM - puppet last run on es2013 is CRITICAL: CRITICAL: puppet fail
[06:31:03] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281882 (owner: 10Gehel)
[06:31:08] <icinga-wm>	 RECOVERY - puppet last run on mw2138 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:31:29] <icinga-wm>	 PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:30] <icinga-wm>	 PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:57] <icinga-wm>	 PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:38] <icinga-wm>	 PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:58] <icinga-wm>	 PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:58] <icinga-wm>	 PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:53:08] <icinga-wm>	 RECOVERY - puppet last run on mc2004 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[06:56:38] <icinga-wm>	 PROBLEM - puppet last run on mw2152 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:56:39] <icinga-wm>	 RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[06:57:28] <icinga-wm>	 RECOVERY - puppet last run on es2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:08] <icinga-wm>	 RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures
[06:58:29] <icinga-wm>	 RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[06:58:48] <icinga-wm>	 RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:59:49] <icinga-wm>	 RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[07:01:27] <icinga-wm>	 RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[07:16:07] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Ignore packages in deinstalled status [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/281669 (owner: 10Muehlenhoff)
[07:20:38] <icinga-wm>	 RECOVERY - puppet last run on mw2152 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[07:30:55] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: hhvm: add systemd/jessie support [puppet] - 10https://gerrit.wikimedia.org/r/281885 
[08:22:21] <grrrit-wm>	 (03PS3) 10Gehel: Moving elasticsearch::https instatiation to elasticsearch role [puppet] - 10https://gerrit.wikimedia.org/r/281824 (https://phabricator.wikimedia.org/T131906) 
[08:23:40] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] "DTRT: https://puppet-compiler.wmflabs.org/2314/mw1220.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/281720 (owner: 10Giuseppe Lavagetto)
[08:23:53] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: hhvm: watch extension packages from the service [puppet] - 10https://gerrit.wikimedia.org/r/281720 
[08:24:51] <_joe_>	 come on jenkinsss
[08:26:02] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] Moving elasticsearch::https instatiation to elasticsearch role [puppet] - 10https://gerrit.wikimedia.org/r/281824 (https://phabricator.wikimedia.org/T131906) (owner: 10Gehel)
[08:27:52] <grrrit-wm>	 (03PS4) 10Gehel: Moving elasticsearch::https instatiation to elasticsearch role [puppet] - 10https://gerrit.wikimedia.org/r/281824 (https://phabricator.wikimedia.org/T131906) 
[08:28:12] <_joe_>	 !log disabling puppet on the mw servers to test hhvm changes
[08:28:16] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[08:30:25] <grrrit-wm>	 (03CR) 10Hashar: "Well done \O/" [puppet] - 10https://gerrit.wikimedia.org/r/281706 (owner: 10Madhuvishy)
[08:33:50] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: hhvm: parametrize directories [puppet] - 10https://gerrit.wikimedia.org/r/281721 
[08:45:23] <grrrit-wm>	 (03PS3) 10Giuseppe Lavagetto: hhvm: parametrize directories [puppet] - 10https://gerrit.wikimedia.org/r/281721 
[08:47:14] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm: parametrize directories [puppet] - 10https://gerrit.wikimedia.org/r/281721 (owner: 10Giuseppe Lavagetto)
[08:50:19] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: hhvm: fixup for I5e9403c2 [puppet] - 10https://gerrit.wikimedia.org/r/281892 
[08:50:55] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] hhvm: fixup for I5e9403c2 [puppet] - 10https://gerrit.wikimedia.org/r/281892 (owner: 10Giuseppe Lavagetto)
[08:51:48] <icinga-wm>	 PROBLEM - puppet last run on wasat is CRITICAL: CRITICAL: puppet fail
[08:51:59] <icinga-wm>	 PROBLEM - puppet last run on mw1017 is CRITICAL: CRITICAL: puppet fail
[08:52:39] <icinga-wm>	 PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: puppet fail
[08:52:39] <icinga-wm>	 PROBLEM - puppet last run on snapshot1002 is CRITICAL: CRITICAL: puppet fail
[08:53:48] <icinga-wm>	 PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: puppet fail
[08:53:48] <icinga-wm>	 RECOVERY - puppet last run on mw1017 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[08:55:27] <_joe_>	 these were all mine ^^
[08:59:08] <icinga-wm>	 RECOVERY - puppet last run on wasat is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[09:01:14] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: hhvm: s/fcgi.ini/server.ini/ [puppet] - 10https://gerrit.wikimedia.org/r/281722 
[09:02:08] <icinga-wm>	 PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: puppet fail
[09:04:52] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/281722 (owner: 10Giuseppe Lavagetto)
[09:07:12] <wikibugs>	 6Operations, 6Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2182880 (10Gehel) @RobH I'd really appreciate if you could let me do the reclaim / reinstall so that I learn something in the process (thi...
[09:10:08] <icinga-wm>	 PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 643
[09:17:25] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: hhvm: add systemd/jessie support [puppet] - 10https://gerrit.wikimedia.org/r/281885 
[09:17:46] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Upgrade to 3.19.8-ckt18 [debs/linux] - 10https://gerrit.wikimedia.org/r/281899 
[09:17:59] <grrrit-wm>	 (03PS1) 10Alex Monk: Send www.wikipedia.org/apple-app-site-association to the right file without an external redirect [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) 
[09:19:59] <icinga-wm>	 RECOVERY - puppet last run on snapshot1002 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[09:20:08] <icinga-wm>	 RECOVERY - check_mysql on lutetium is OK: Uptime: 1710230 Threads: 1 Questions: 15361401 Slow queries: 10263 Opens: 106158 Flush tables: 2 Open tables: 64 Queries per second avg: 8.982 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0
[09:21:09] <icinga-wm>	 RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[09:21:57] <icinga-wm>	 RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[09:22:02] <grrrit-wm>	 (03CR) 10Muehlenhoff: hhvm: add systemd/jessie support (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/281885 (owner: 10Giuseppe Lavagetto)
[09:23:08] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Upgrade to 3.19.8-ckt18 [debs/linux] - 10https://gerrit.wikimedia.org/r/281899 (owner: 10Muehlenhoff)
[09:23:48] <icinga-wm>	 RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[09:26:48] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: hhvm: add systemd/jessie support (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/281885 (owner: 10Giuseppe Lavagetto)
[09:29:37] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 51.72% of data above the critical threshold [5000000.0]
[09:32:13] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 04-1] "have been trying to test this on beta but no luck so far" [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk)
[09:35:37] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:37:07] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:37:18] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:37:18] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:38:49] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[09:38:58] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[09:38:58] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy
[09:39:39] <icinga-wm>	 PROBLEM - HHVM rendering on mw2086 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:40:39] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[09:40:49] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy
[09:41:19] <icinga-wm>	 RECOVERY - HHVM rendering on mw2086 is OK: HTTP OK: HTTP/1.1 200 OK - 68763 bytes in 0.285 second response time
[09:43:07] <elukey>	 small outage?
[09:45:11] <_joe_>	 elukey: wat?
[09:45:12] <grrrit-wm>	 (03PS2) 10Gehel: Actiavte SSL + connection pooling for CirrusSearch on PROD [puppet] - 10https://gerrit.wikimedia.org/r/281881 (https://phabricator.wikimedia.org/T131839) 
[09:45:20] <_joe_>	 elukey: why you say that?
[09:46:52] <elukey>	 _joe_ I was asking as "is this normal or is this an outage?" after reading citoid endpoints health on scb2002 is CRITICAL
[09:48:44] <_joe_>	 elukey: nope it's typically an upstream problem for citoid
[09:48:59] <_joe_>	 the test urls include a dependency on an external system
[09:49:38] <elukey>	 ahh okok will take a look to it, thanks :)
[10:09:17] <grrrit-wm>	 (03PS3) 10Mschon: update the DNS record for benefactors.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/280637 (https://phabricator.wikimedia.org/T130937) 
[10:19:37] <wikibugs>	 6Operations, 10Analytics: Upgrade aqs* to nodejs 4.3 - https://phabricator.wikimedia.org/T123629#2183066 (10elukey) So I checked the replication factor on the aqs nodes and this is the result:  ``` cassandra@cqlsh> SELECT * FROM system.schema_keyspaces;   keyspace_name                                    | dura...
[10:19:49] <wikibugs>	 6Operations, 10Analytics: Upgrade aqs* to nodejs 4.3 - https://phabricator.wikimedia.org/T123629#2183067 (10elukey) p:5Triage>3Normal
[10:27:36] <grrrit-wm>	 (03PS1) 10ArielGlenn: small fixes for dumps cron job script [puppet] - 10https://gerrit.wikimedia.org/r/281913 
[10:28:16] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: otrs: Remove HTTPS ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/281914 
[10:28:22] <_joe_>	 T73486 ?
[10:28:22] <stashbot>	 T73486: HHVM: segfault when serializing/unserializing large preprocessor cache items - https://phabricator.wikimedia.org/T73486
[10:28:35] <_joe_>	 oh, yes
[10:29:08] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] small fixes for dumps cron job script [puppet] - 10https://gerrit.wikimedia.org/r/281913 (owner: 10ArielGlenn)
[10:31:19] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: otrs: Remove HTTPS ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/281914 
[10:31:26] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] otrs: Remove HTTPS ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/281914 (owner: 10Alexandros Kosiaris)
[10:32:25] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Still fails. https://puppet-compiler.wmflabs.org/2319/mendelevium.eqiad.wmnet/change.mendelevium.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis)
[10:37:37] <icinga-wm>	 RECOVERY - DPKG on etherpad1001 is OK: All packages OK
[10:38:38] <grrrit-wm>	 (03PS3) 10Giuseppe Lavagetto: hhvm: add systemd/jessie support [puppet] - 10https://gerrit.wikimedia.org/r/281885 
[10:38:45] <_joe_>	 moritzm: ^^
[10:38:51] <_joe_>	 (whenever you have time)
[10:39:00] <moritzm>	 thanks, will have a look in a bit
[10:49:22] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 031] Use local resources in codfw for parsoid, url-downloader and mathoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279355 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto)
[10:51:27] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "Since this is making excellent sense to be in the cxserver repo config, we should move it over there. I see https://phabricator.wikimedia." [puppet] - 10https://gerrit.wikimedia.org/r/277463 (https://phabricator.wikimedia.org/T129849) (owner: 10KartikMistry)
[10:57:22] <Amir1>	 akosiaris: hey, it would be great if you give an eta for when you can check these puppet patches
[10:57:33] <Amir1>	 (or the beta setup)
[10:57:57] <Amir1>	 no rush at all, I'm too excited 
[11:08:22] <akosiaris>	 Amir1: I will be looking into them today, not sure though when they 'll be merged. Plan however is for this week to try and get ORES deployed in production
[11:08:43] <Amir1>	 \o/
[11:09:14] <Amir1>	 thanks akosiaris, tell me if you need anything from me. I think I need to explain lots of these patches and choices I've made
[11:10:38] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 031] "noop for chromium,alsafi, effectively noop for carbon. https://puppet-compiler.wmflabs.org/2320/" [puppet] - 10https://gerrit.wikimedia.org/r/281447 (https://phabricator.wikimedia.org/T123733) (owner: 10Filippo Giunchedi)
[11:16:37] <grrrit-wm>	 (03CR) 10Ema: [C: 031] installserver: port squid3 changes for trusty/jessie [puppet] - 10https://gerrit.wikimedia.org/r/281447 (https://phabricator.wikimedia.org/T123733) (owner: 10Filippo Giunchedi)
[11:21:30] <grrrit-wm>	 (03CR) 10DCausse: [C: 031] Bump CirrusSearchRequestSet rev to 121456865906 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280448 (https://phabricator.wikimedia.org/T128533) (owner: 10DCausse)
[11:26:11] <grrrit-wm>	 (03PS1) 10Ema: Allow ganglia user to read VSM files [puppet] - 10https://gerrit.wikimedia.org/r/281918 
[11:26:26] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: installserver: port squid3 changes for trusty/jessie [puppet] - 10https://gerrit.wikimedia.org/r/281447 (https://phabricator.wikimedia.org/T123733) 
[11:26:34] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] installserver: port squid3 changes for trusty/jessie [puppet] - 10https://gerrit.wikimedia.org/r/281447 (https://phabricator.wikimedia.org/T123733) (owner: 10Filippo Giunchedi)
[11:27:26] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Allow ganglia user to read VSM files [puppet] - 10https://gerrit.wikimedia.org/r/281918 (owner: 10Ema)
[11:27:40] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: "thanks! another way I tried was "include /etc/squid3/conf.d/*.conf" but alas squid refuses to start if the wildcard matches no files" [puppet] - 10https://gerrit.wikimedia.org/r/281447 (https://phabricator.wikimedia.org/T123733) (owner: 10Filippo Giunchedi)
[11:32:57] <matanya>	 godog: u around ?
[11:33:18] <matanya>	 can you please +1 https://phabricator.wikimedia.org/T131895 ?
[11:35:48] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 031] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280831 (https://phabricator.wikimedia.org/T131895) (owner: 10Matanya)
[11:35:51] <godog>	 matanya: for sure, {{done}}
[11:35:59] <matanya>	 thanks much godog
[11:36:42] <grrrit-wm>	 (03CR) 10KartikMistry: "I realized that T122498 is not straightforward to fix, but yes - that's the direction we need to go. Until, that is done, can this be merg" [puppet] - 10https://gerrit.wikimedia.org/r/277463 (https://phabricator.wikimedia.org/T129849) (owner: 10KartikMistry)
[11:36:59] <grrrit-wm>	 (03CR) 10DCausse: Actiavte SSL + connection pooling for CirrusSearch on PROD (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/281881 (https://phabricator.wikimedia.org/T131839) (owner: 10Gehel)
[11:37:35] <kart_>	 apergos: reminder for https://phabricator.wikimedia.org/T127793 as you told :)
[11:38:23] <kart_>	 apergos: if possible, can you add estimate time once we start the work? That will be helpful for setting priority for team (third party awaits, so we can tell them).
[11:40:14] <wikibugs>	 6Operations, 6Labs, 10Labs-Infrastructure: labnet1002 can't talk to webproxy.eqiad.wmnet:8080, puppet fails to install designateclient - https://phabricator.wikimedia.org/T129623#2183182 (10faidon) p:5Normal>3High Hey — puppet hasn't been running properly on labnet1002 with the above failure for almost a...
[11:46:04] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "Not straightforward to fix ? How come ? Care to share more info ?" [puppet] - 10https://gerrit.wikimedia.org/r/277463 (https://phabricator.wikimedia.org/T129849) (owner: 10KartikMistry)
[11:48:13] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: Allow ganglia user to read VSM files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/281918 (owner: 10Ema)
[11:49:11] <grrrit-wm>	 (03PS1) 10Mschon: puppet-lint runs through now, changed scope of variable from $hostname to $::hostname [puppet] - 10https://gerrit.wikimedia.org/r/281922 
[11:51:07] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] puppet-lint runs through now, changed scope of variable from $hostname to $::hostname [puppet] - 10https://gerrit.wikimedia.org/r/281922 (owner: 10Mschon)
[11:51:12] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: puppet-lint runs through now, changed scope of variable from $hostname to $::hostname [puppet] - 10https://gerrit.wikimedia.org/r/281922 (owner: 10Mschon)
[11:51:17] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [V: 032] puppet-lint runs through now, changed scope of variable from $hostname to $::hostname [puppet] - 10https://gerrit.wikimedia.org/r/281922 (owner: 10Mschon)
[11:58:33] <wikibugs>	 6Operations, 10Analytics: Upgrade aqs* to nodejs 4.3 - https://phabricator.wikimedia.org/T123629#2183195 (10elukey) Executed the command and started nodetool repair on aqs1002.
[11:59:26] <grrrit-wm>	 (03CR) 10Ema: [C: 04-1] Allow ganglia user to read VSM files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/281918 (owner: 10Ema)
[12:09:09] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet).
[12:09:18] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet).
[12:11:35] <grrrit-wm>	 (03CR) 10Ema: "RxURL is used to match VSL log entries with transactions:" [puppet] - 10https://gerrit.wikimedia.org/r/281439 (https://phabricator.wikimedia.org/T131353) (owner: 10BBlack)
[12:19:17] <grrrit-wm>	 (03CR) 10Gehel: Actiavte SSL + connection pooling for CirrusSearch on PROD (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/281881 (https://phabricator.wikimedia.org/T131839) (owner: 10Gehel)
[12:19:48] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[12:21:17] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[12:21:18] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[12:21:39] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[12:29:24] <grrrit-wm>	 (03PS3) 10Gehel: Actiavte SSL + connection pooling for CirrusSearch on PROD [puppet] - 10https://gerrit.wikimedia.org/r/281881 (https://phabricator.wikimedia.org/T131839) 
[12:30:23] <wikibugs>	 6Operations: setup stat1004/WMF4721 for hadoop client usage - https://phabricator.wikimedia.org/T131877#2183228 (10Cmjohnson)
[12:30:26] <wikibugs>	 6Operations, 10ops-eqiad: update labels and visible label field for stat1004/WMF4721 - https://phabricator.wikimedia.org/T131902#2183226 (10Cmjohnson) 5Open>3Resolved done
[12:32:55] <wikibugs>	 6Operations, 10ops-eqiad: stat1002 broken disk causing degraded RAID array - https://phabricator.wikimedia.org/T131758#2183231 (10Cmjohnson) 5Open>3Resolved Disk has been replaced and back online
[12:35:04] <elukey>	 ---^ \o/ thanks
[12:37:16] <wikibugs>	 6Operations, 10media-storage, 7Tracking: [tracking] refresh swift hardware in codfw/eqiad - https://phabricator.wikimedia.org/T130012#2183250 (10fgiunchedi) one of the questions for the next order is 3TB vs 4TB disks, the last order of 3x eqiad and 6x codfw {T114500} and related was for 4TB.  to gauge the im...
[12:47:07] <icinga-wm>	 PROBLEM - Host snapshot1006 is DOWN: PING CRITICAL - Packet loss = 100%
[12:50:57] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[12:51:27] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy
[12:53:18] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy
[12:56:37] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[13:01:48] <grrrit-wm>	 (03PS4) 10Gehel: Actiavte SSL + connection pooling for CirrusSearch on PROD [puppet] - 10https://gerrit.wikimedia.org/r/281881 (https://phabricator.wikimedia.org/T131839) 
[13:03:33] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 031] "One enhancement proposal, but otherwise looks good to me." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/281885 (owner: 10Giuseppe Lavagetto)
[13:08:32] <icinga-wm>	 RECOVERY - Host snapshot1006 is UP: PING OK - Packet loss = 0%, RTA = 1.35 ms
[13:10:34] <_joe_>	 moritzm: heh fair enough
[13:10:47] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: hhvm: add systemd/jessie support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/281885 (owner: 10Giuseppe Lavagetto)
[13:10:57] <grrrit-wm>	 (03Abandoned) 10Hashar: Increase default thumbnail display size from 220px to 300px [mediawiki-config] - 10https://gerrit.wikimedia.org/r/154408 (https://bugzilla.wikimedia.org/67709) (owner: 10Jforrester)
[13:11:27] <grrrit-wm>	 (03PS4) 10Giuseppe Lavagetto: hhvm: add systemd/jessie support [puppet] - 10https://gerrit.wikimedia.org/r/281885 
[13:11:31] <grrrit-wm>	 (03Abandoned) 10Hashar: [WIP] Make VisualEditor access RESTbase directly on private wikis too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200107 (owner: 10Jforrester)
[13:15:30] <grrrit-wm>	 (03CR) 10Gehel: Actiavte SSL + connection pooling for CirrusSearch on PROD (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/281881 (https://phabricator.wikimedia.org/T131839) (owner: 10Gehel)
[13:16:11] <gehel>	 Could anyone have a look at ^ https://gerrit.wikimedia.org/r/#/c/281881/ ? I wrote some pretty ugly code and I'm sure there is a better way, but my brain seems frozen...
[13:16:18] <gehel>	 Comments inline
[13:17:02] <icinga-wm>	 RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:19:22] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: "Yeah I would like to understand why it is so hard to move this file to the code repository. I removed my -2 because this is not as ugly an" [puppet] - 10https://gerrit.wikimedia.org/r/277463 (https://phabricator.wikimedia.org/T129849) (owner: 10KartikMistry)
[13:22:36] <grrrit-wm>	 (03CR) 10Fjalapeno: "@Krenair are you still seeing the redirect? Or are you just unable to test?" [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk)
[13:23:12] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1051 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:23:22] <icinga-wm>	 PROBLEM - salt-minion processes on analytics1051 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:23:23] <icinga-wm>	 PROBLEM - DPKG on analytics1051 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:23:23] <icinga-wm>	 PROBLEM - Check size of conntrack table on analytics1051 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:23:45] <grrrit-wm>	 (03CR) 10Alex Monk: "It's live on beta but I still get a redirect" [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk)
[13:23:51] <icinga-wm>	 PROBLEM - RAID on analytics1051 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:23:52] <icinga-wm>	 PROBLEM - Disk space on analytics1051 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:24:11] <icinga-wm>	 PROBLEM - configured eth on analytics1051 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:24:21] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1051 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:24:22] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on analytics1051 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:24:32] <icinga-wm>	 PROBLEM - dhclient process on analytics1051 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:24:43] <icinga-wm>	 PROBLEM - Hadoop DataNode on analytics1051 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:24:52] <icinga-wm>	 PROBLEM - puppet last run on analytics1051 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:27:36] <wikibugs>	 6Operations, 7Availability, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: swiftrepl replication pass for thumbnails eqiad -> codfw - https://phabricator.wikimedia.org/T125791#2183325 (10fgiunchedi) replication for thumbs has finished: [[ https://graphite.wikimedia.org/render/?width=723...
[13:27:43] <icinga-wm>	 PROBLEM - Host analytics1051 is DOWN: PING CRITICAL - Packet loss = 100%
[13:27:48] <wikibugs>	 6Operations, 7Availability, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: swiftrepl replication pass for thumbnails eqiad -> codfw - https://phabricator.wikimedia.org/T125791#2183329 (10fgiunchedi) 5Open>3Resolved
[13:28:04] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 031] hhvm: add systemd/jessie support [puppet] - 10https://gerrit.wikimedia.org/r/281885 (owner: 10Giuseppe Lavagetto)
[13:28:25] <grrrit-wm>	 (03CR) 10Alex Monk: "well... I thought it was live, but puppet seems very broken and the line has gone missing" [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk)
[13:30:51] <icinga-wm>	 RECOVERY - Check size of conntrack table on analytics1051 is OK: OK: nf_conntrack is 0 % full
[13:30:51] <icinga-wm>	 RECOVERY - DPKG on analytics1051 is OK: All packages OK
[13:31:00] <wikibugs>	 6Operations, 10MediaWiki-Interface, 10Traffic: Purge pages cached with mobile editlinks - https://phabricator.wikimedia.org/T125841#2183332 (10BBlack) 5Open>3Resolved I'm guessing by now they're all naturally expiring out anyways since there's no further feedback.
[13:31:01] <icinga-wm>	 RECOVERY - Host analytics1051 is UP: PING OK - Packet loss = 0%, RTA = 1.62 ms
[13:31:13] <icinga-wm>	 RECOVERY - RAID on analytics1051 is OK: OK: optimal, 13 logical, 14 physical
[13:31:21] <icinga-wm>	 RECOVERY - Disk space on analytics1051 is OK: DISK OK
[13:31:41] <grrrit-wm>	 (03CR) 10Fjalapeno: "Oh - ok - hmmm… thats odd. I'm also out to Brion on this - I CC'd him as well. He also knows the iOS app so may be able to help." [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk)
[13:31:59] <icinga-wm>	 RECOVERY - configured eth on analytics1051 is OK: OK - interfaces up
[13:32:21] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on analytics1051 is OK: DISK OK
[13:32:24] <elukey>	 analytics1051 was rebooted, mmm
[13:32:26] <grrrit-wm>	 (03CR) 10Brion VIBBER: "Patch looks legit enough... Yeah double-check that it got fully deployed. :)" [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk)
[13:32:33] <grrrit-wm>	 (03CR) 10Alex Monk: "I don't own any iOS/OS X devices and never have, I'm just fiddling around with redirects in apache" [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk)
[13:32:38] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1051 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode
[13:32:50] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1051 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[13:32:59] <icinga-wm>	 PROBLEM - NTP on analytics1051 is CRITICAL: NTP CRITICAL: Offset unknown
[13:33:12] <wikibugs>	 6Operations, 10Traffic, 6Zero: Use Text IP for Mobile hostnames to gain SPDY/H2 coalesce between the two - https://phabricator.wikimedia.org/T124482#2183337 (10BBlack) p:5Low>3Normal We didn't end up keeping SPDY disabled, and HTTP/2 is coming.  From our end, this is a relatively simple change now, but t...
[13:33:30] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1051 is OK: OK: YARN NodeManager analytics1051.eqiad.wmnet:8041 Node-State: RUNNING
[13:34:08] <icinga-wm>	 RECOVERY - dhclient process on analytics1051 is OK: PROCS OK: 0 processes with command name dhclient
[13:34:19] <icinga-wm>	 RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:34:38] <icinga-wm>	 RECOVERY - salt-minion processes on analytics1051 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[13:34:45] <grrrit-wm>	 (03PS5) 10Giuseppe Lavagetto: hhvm: add systemd/jessie support [puppet] - 10https://gerrit.wikimedia.org/r/281885 
[13:35:57] <wikibugs>	 6Operations, 10Traffic: compressed http responses without content-length not cached by varnish - https://phabricator.wikimedia.org/T124195#2183342 (10BBlack) We actually dug further into related issues when investigating WDQS woes on cache_misc, and the problem is different than what we thought we understood i...
[13:36:20] <wikibugs>	 6Operations, 10Traffic: compressed http responses without content-length not cached by varnish - https://phabricator.wikimedia.org/T124195#2183343 (10BBlack)
[13:36:22] <wikibugs>	 6Operations, 10Traffic, 13Patch-For-Review, 7Varnish: cache_misc's misc_fetch_large_objects has issues - https://phabricator.wikimedia.org/T128813#2183344 (10BBlack)
[13:38:08] <icinga-wm>	 RECOVERY - NTP on analytics1051 is OK: NTP OK: Offset -0.0004067420959 secs
[13:38:34] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] "Practically a noop on trusty" [puppet] - 10https://gerrit.wikimedia.org/r/281885 (owner: 10Giuseppe Lavagetto)
[13:38:59] <wikibugs>	 6Operations, 10Traffic, 6Zero: Use Text IP for Mobile hostnames to gain SPDY/H2 coalesce between the two - https://phabricator.wikimedia.org/T124482#2183349 (10BBlack)
[13:39:02] <wikibugs>	 6Operations, 6Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2183350 (10BBlack)
[13:39:26] <_joe_>	 akosiaris: I'm merging a change from you?
[13:41:01] <wikibugs>	 6Operations, 10Citoid, 6Security-Team, 10Traffic, and 3 others: http://citoid.wikimedia.org/ should force HTTPS - https://phabricator.wikimedia.org/T108632#2183351 (10BBlack) 5Open>3Resolved a:3BBlack When we moved various *oid to the text cluster as part of the parsoidcache decom, they got forced to...
[13:41:06] <_joe_>	 seems innocent enough
[13:41:07] <akosiaris>	 _joe_: damn again.. yes sorry
[13:41:08] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge.
[13:41:20] <_joe_>	 yeah already merged
[13:42:05] <wikibugs>	 6Operations, 10Traffic: Stop using LVS from varnishes - https://phabricator.wikimedia.org/T107956#2183359 (10BBlack) 5Open>3declined We've made decisions about this in the past already and moved past this idea.  The general direction is to always use LVS for multi-host varnish backends, and solve HTTPS iss...
[13:43:31] <wikibugs>	 6Operations: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928#2183368 (10MoritzMuehlenhoff)
[13:45:30] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: Revert "Make MediaWiki call the codfw restbase from all datacenters" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281935 
[13:45:33] <ema>	 !log Upgrading cp1052 to jessie 8.4 point release and linux 4.4 (T131746, T131928)
[13:45:34] <stashbot>	 T131746: Integrate jessie 8.4 point release - https://phabricator.wikimedia.org/T131746
[13:45:35] <stashbot>	 T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928
[13:45:37] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:45:40] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: Revert "cache::text: route traffic for restbase, citoid, cxserver to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/281936 
[13:45:48] <_joe_>	 godog: ^^
[13:46:00] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[13:46:47] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 031] Revert "cache::text: route traffic for restbase, citoid, cxserver to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/281936 (owner: 10Giuseppe Lavagetto)
[13:46:49] <godog>	 _joe_: thanks!
[13:47:11] <_joe_>	 godog: I have an interview in 10, but you can merge those while I'm away
[13:48:59] <icinga-wm>	 PROBLEM - puppet last run on mw2083 is CRITICAL: CRITICAL: puppet fail
[13:50:01] <_joe_>	 that is just a transitional problem that shows how "good" puppet is
[13:51:32] <godog>	 _joe_: ok, as for the order mediawiki first and then varnish, i.e. reversed from what we did yesterday?
[13:52:20] <hashar>	 !sal
[13:52:21] <wm-bot>	 https://wikitech.wikimedia.org/wiki/Server_Admin_Log  https://tools.wmflabs.org/sal/production   See it and you will know all you need.
[13:52:48] <icinga-wm>	 RECOVERY - puppet last run on mw2083 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:53:34] <_joe_>	 godog: whatever you prefere
[13:53:40] <_joe_>	 it makes no differences
[13:54:29] <icinga-wm>	 PROBLEM - Host cp1052 is DOWN: PING CRITICAL - Packet loss = 100%
[13:54:54] <_joe_>	 ema: expected ? ^^
[13:54:59] <icinga-wm>	 RECOVERY - Host cp1052 is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms
[13:55:18] <ema>	 _joe_: yep
[13:55:29] <icinga-wm>	 RECOVERY - puppet last run on snapshot1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:56:22] <godog>	 _joe_: ok I'll do varnish first and mediawiki second then
[13:56:25] <godog>	 urandom: ^
[13:56:36] <urandom>	 +1
[13:56:53] <grrrit-wm>	 (03CR) 10Alex Monk: "I seem to have fixed puppet in deployment-prep by applying https://github.com/puppetlabs/puppet/commit/149b24542aa3ffaad2afef8daea05188750" [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk)
[13:56:58] <_joe_>	 yeah I won't be around
[13:57:05] <_joe_>	 unless it explodes somehow
[13:58:49] <icinga-wm>	 PROBLEM - HTTPS on cp1052 is CRITICAL: Return code of 255 is out of bounds
[13:59:26] <grrrit-wm>	 (03CR) 10Alex Monk: "It works if I comment out the RewriteRule below it though." [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk)
[13:59:38] <ema>	 cp1052 has an issue with nginx, the host is depooled though 
[14:00:03] <grrrit-wm>	 (03PS1) 10Ladsgroup: wikilabels: healthier uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/281940 
[14:00:49] <icinga-wm>	 PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Puppet has 1 failures
[14:02:42] <wikibugs>	 6Operations, 6Labs, 10Labs-Infrastructure: labnet1002 can't talk to webproxy.eqiad.wmnet:8080, puppet fails to install designateclient - https://phabricator.wikimedia.org/T129623#2183406 (10chasemp) Here is what I believe is happening.    Labnet1001 is the inactive node at the moment and has an IPv6 address:...
[14:02:49] <icinga-wm>	 RECOVERY - HTTPS on cp1052 is OK: SSLXNN OK - 36 OK
[14:04:04] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: Revert "cache::text: route traffic for restbase, citoid, cxserver to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/281936 (owner: 10Giuseppe Lavagetto)
[14:04:12] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "cache::text: route traffic for restbase, citoid, cxserver to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/281936 (owner: 10Giuseppe Lavagetto)
[14:04:27] <grrrit-wm>	 (03PS1) 10BBlack: add CAP_CHOWN to tlsproxy nginx caps [puppet] - 10https://gerrit.wikimedia.org/r/281941 
[14:04:30] <icinga-wm>	 RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[14:05:05] <grrrit-wm>	 (03CR) 10Brion VIBBER: "Does the rewrite override the alias maybe? Might have to punch a rewrite rule in instead of an Alias here..." [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk)
[14:05:10] <grrrit-wm>	 (03PS2) 10BBlack: add CAP_CHOWN to tlsproxy nginx caps [puppet] - 10https://gerrit.wikimedia.org/r/281941 
[14:05:18] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] add CAP_CHOWN to tlsproxy nginx caps [puppet] - 10https://gerrit.wikimedia.org/r/281941 (owner: 10BBlack)
[14:05:20] <godog>	 !log move restbase/citoid/cxserver varnish traffic back to eqiad
[14:05:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:08:00] <grrrit-wm>	 (03CR) 10Alex Monk: "http://stackoverflow.com/a/12161249/1306662 :/" [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk)
[14:08:41] <wikibugs>	 6Operations, 6Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2183411 (10RobH) >>! In T131880#2182880, @Gehel wrote: > @RobH I'd really appreciate if you could let me do the reclaim / reinstall so tha...
[14:08:48] <icinga-wm>	 PROBLEM - puppet last run on mw1224 is CRITICAL: CRITICAL: Puppet has 1 failures
[14:09:47] <grrrit-wm>	 (03CR) 10Alex Monk: "I wonder if we can use a location block - https://httpd.apache.org/docs/current/mod/mod_alias.html#order" [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk)
[14:11:20] <wikibugs>	 6Operations, 6Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2183429 (10BBlack) There's no real need to reinstall them.  I have patches pending to put them into their proper roles, etc.
[14:11:35] <godog>	 I'll let it simmer for half an hour and then merge https://gerrit.wikimedia.org/r/#/c/281935/
[14:12:37] <urandom>	 k
[14:13:13] <wikibugs>	 6Operations, 6Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2183432 (10BBlack) The patch series starts at: https://gerrit.wikimedia.org/r/#/c/268236/ , but needs manual rebases at this point.
[14:13:48] <wikibugs>	 6Operations, 6Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2183446 (10BBlack) (it's better to look at T109162, that had all the patch links)
[14:13:59] <grrrit-wm>	 (03PS4) 10BBlack: maps DNS 2/2: enable geodns routing [dns] - 10https://gerrit.wikimedia.org/r/268240 (https://phabricator.wikimedia.org/T109162) 
[14:15:30] <icinga-wm>	 PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Puppet has 1 failures
[14:15:50] <bblack>	 !log rebooting baham (ns1) for 4.4 kernel + package updates
[14:15:53] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:19:09] <icinga-wm>	 PROBLEM - Host ns1-v4 is DOWN: PING CRITICAL - Packet loss = 100%
[14:19:18] <icinga-wm>	 PROBLEM - Host baham is DOWN: PING CRITICAL - Packet loss = 100%
[14:19:48] <wikibugs>	 6Operations, 6Labs, 10Labs-Infrastructure: labnet1002 can't talk to webproxy.eqiad.wmnet:8080, puppet fails to install designateclient - https://phabricator.wikimedia.org/T129623#2183482 (10faidon) squid on carbon over IPv4 works fine — we'd have a lot more failures if that wasn't the case (you can verify th...
[14:21:59] <icinga-wm>	 PROBLEM - Apache HTTP on mw1187 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.021 second response time
[14:22:19] <icinga-wm>	 PROBLEM - HHVM rendering on mw1187 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.003 second response time
[14:22:53] <grrrit-wm>	 (03PS5) 10Gehel: Actiavte SSL + connection pooling for CirrusSearch on PROD [puppet] - 10https://gerrit.wikimedia.org/r/281881 (https://phabricator.wikimedia.org/T131839) 
[14:24:49] <icinga-wm>	 RECOVERY - Host baham is UP: PING OK - Packet loss = 0%, RTA = 37.09 ms
[14:25:09] <icinga-wm>	 RECOVERY - Host ns1-v4 is UP: PING OK - Packet loss = 0%, RTA = 40.47 ms
[14:25:39] <icinga-wm>	 RECOVERY - Apache HTTP on mw1187 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.084 second response time
[14:26:00] <icinga-wm>	 RECOVERY - HHVM rendering on mw1187 is OK: HTTP OK: HTTP/1.1 200 OK - 68427 bytes in 0.126 second response time
[14:26:25] <wikibugs>	 6Operations, 6Labs, 10Labs-Infrastructure: labnet1002 can't talk to webproxy.eqiad.wmnet:8080, puppet fails to install designateclient - https://phabricator.wikimedia.org/T129623#2183484 (10chasemp) >>! In T129623#2183482, @faidon wrote: > squid on carbon over IPv4 works fine — we'd have a lot more failures...
[14:26:45] <elukey>	 !log hhvm restarted on mw1187
[14:26:49] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:28:25] <icinga-wm>	 PROBLEM - Auth DNS on ns1-v6 is CRITICAL: CRITICAL - Plugin timed out while executing system call
[14:28:37] <godog>	 sad_trombone.wav
[14:29:42] <akosiaris>	 what's up ? 
[14:29:44] <icinga-wm>	 PROBLEM - Auth DNS on ns1-v4 is CRITICAL: CRITICAL - Plugin timed out while executing system call
[14:29:59] <grrrit-wm>	 (03PS6) 10Gehel: Actiavte SSL + connection pooling for CirrusSearch on PROD [puppet] - 10https://gerrit.wikimedia.org/r/281881 (https://phabricator.wikimedia.org/T131839) 
[14:30:03] <godog>	 akosiaris: I think benign, baham rebooted earlier by bblack 
[14:30:09] <icinga-wm>	 PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: puppet fail
[14:30:14] <akosiaris>	 ok
[14:30:37] <icinga-wm>	 RECOVERY - Auth DNS on ns1-v6 is OK: DNS OK: 5.081 seconds response time. www.wikipedia.org returns 208.80.154.224
[14:31:56] <icinga-wm>	 RECOVERY - Auth DNS on ns1-v4 is OK: DNS OK: 0.065 seconds response time. www.wikipedia.org returns 208.80.154.224
[14:32:12] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: Revert "Make MediaWiki call the codfw restbase from all datacenters" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281935 (owner: 10Giuseppe Lavagetto)
[14:32:22] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "Make MediaWiki call the codfw restbase from all datacenters" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281935 (owner: 10Giuseppe Lavagetto)
[14:34:00] <logmsgbot>	 !log filippo@tin Synchronized wmf-config/ProductionServices.php: move mediawiki traffic back to restbase eqiad (duration: 00m 34s)
[14:34:04] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:35:40] <icinga-wm>	 RECOVERY - puppet last run on mw1224 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:36:09] <icinga-wm>	 RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:37:09] <grrrit-wm>	 (03PS2) 10Alex Monk: Send www.wikipedia.org/apple-app-site-association to the right file without an external redirect [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) 
[14:38:59] <grrrit-wm>	 (03PS1) 10ArielGlenn: set dump cron job date range back to normal, adjust start times [puppet] - 10https://gerrit.wikimedia.org/r/281946 
[14:41:01] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] set dump cron job date range back to normal, adjust start times [puppet] - 10https://gerrit.wikimedia.org/r/281946 (owner: 10ArielGlenn)
[14:42:20] <icinga-wm>	 RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[14:47:22] <wikibugs>	 6Operations, 6Labs, 10Tool-Labs, 10Traffic, 6Zero: Tool labs tools should have a method of identifying Zero traffic - https://phabricator.wikimedia.org/T131934#2183516 (10zhuyifei1999)
[14:53:16] <wikibugs>	 6Operations, 6Labs, 10Tool-Labs, 10Traffic, 6Zero: Tool labs tools should have a method of identifying Zero traffic - https://phabricator.wikimedia.org/T131934#2183516 (10valhallasw) Does Wikipedia Zero include non-wikipedia domains? I would expect tools.wmflabs.org to fall out of scope.
[15:00:04] <jouncebot>	 anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160406T1500). Please do the needful.
[15:00:04] <jouncebot>	 matt_flaschen legoktm Urbanecm dcausse bmansurov: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[15:00:11] <bmansurov>	 here
[15:00:14] <legoktm>	 o/
[15:00:26] <dcausse>	 \o
[15:00:34] <_joe_>	 godog: what's the situation on the switchover?
[15:00:49] <matt_flaschen>	 Present
[15:00:59] <wikibugs>	 6Operations, 6Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2183560 (10RobH) @gehel: Since this isn't going to end up being a reinstall, I'll ping you to do a reinstall on one of the many I do every...
[15:01:01] <hashar>	 o/
[15:01:20] <thcipriani>	 I can SWAT this morning
[15:01:24] <hashar>	 for swat:  I had merged a patch to MobileFrontend extension an hour or so again and havent rebased the extension on tin yet :(
[15:01:58] <icinga-wm>	 PROBLEM - NTP on baham is CRITICAL: NTP CRITICAL: Offset unknown
[15:03:02] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281640 (https://phabricator.wikimedia.org/T112799) (owner: 10Matthias Mullie)
[15:03:42] <hashar>	 !log rebased php-1.27.0-wmf.19/MobileFrontend and php-1.27.0-wmf.20/MobileFrontend  (single commit related to CI)
[15:03:44] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add Flow dumps schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281640 (https://phabricator.wikimedia.org/T112799) (owner: 10Matthias Mullie)
[15:03:47] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:05:10] <hashar>	  /srv/mediawiki-staging/php-1.27.0-wmf.20 has uncommited/staged modifications :-(
[15:05:16] <thcipriani>	 matt_flaschen: is there anything special needed or any coordination needed for the flowdumps change other than syncing it out?
[15:05:27] <matt_flaschen>	 thcipriani, no, it's all static.
[15:06:20] <wikibugs>	 6Operations, 10Beta-Cluster-Infrastructure, 7WorkType-NewFunctionality: etcd/confd is not started on deployment-cache-mobile04 - https://phabricator.wikimedia.org/T116224#2183569 (10Krenair) 5Open>3declined Deleting instead: {T130473}
[15:07:20] <logmsgbot>	 !log thcipriani@tin Synchronized docroot/mediawiki/xml: SWAT: Add Flow dumps schema [[gerrit:281640]] (duration: 00m 28s)
[15:07:24] <thcipriani>	 ^ matt_flaschen check please
[15:07:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:08:34] <matt_flaschen>	 Thanks, thcipriani.  Works fine: https://www.mediawiki.org/xml/flow-1.0/ and https://www.mediawiki.org/xml/flow-1.0.xsd
[15:08:38] <thcipriani>	 hashar: yeah, .20 does have a lot of weird modifications :( I don't know what's up with that.
[15:08:44] <thcipriani>	 matt_flaschen: cool, thanks for checking.
[15:11:00] <hashar>	 thcipriani: I guess that is some live patches that havent been properly applied or failed to rebase
[15:11:03] <thcipriani>	 Urbanecm: around for SWAT?
[15:11:09] <icinga-wm>	 RECOVERY - NTP on baham is OK: NTP OK: Offset -0.002489447594 secs
[15:11:34] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280448 (https://phabricator.wikimedia.org/T128533) (owner: 10DCausse)
[15:11:58] <thcipriani>	 going to go through config changes, then do the big scap at the end legoktm 
[15:12:10] <legoktm>	 ok :P
[15:12:19] <grrrit-wm>	 (03Merged) 10jenkins-bot: Bump CirrusSearchRequestSet rev to 121456865906 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280448 (https://phabricator.wikimedia.org/T128533) (owner: 10DCausse)
[15:12:27] <Urbanecm>	 I don't understand you thcipriany. I am ready for SWAT. 
[15:14:49] <wikibugs>	 6Operations, 10Continuous-Integration-Infrastructure, 10Phabricator, 10netops, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2183605 (10mmodell) I'm sure we could hack the Jenkins job to use https but the staging...
[15:15:08] <ema>	 !log Upgrading cp* to jessie 8.4 point release and linux 4.4 (T131746, T131928). Not rebooting yet.
[15:15:09] <stashbot>	 T131746: Integrate jessie 8.4 point release - https://phabricator.wikimedia.org/T131746
[15:15:09] <stashbot>	 T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928
[15:15:12] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:15:32] <wikibugs>	 6Operations, 10Analytics-Cluster, 10hardware-requests: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2183608 (10Ottomata) @robh, bump on this too.
[15:16:45] <godog>	 _joe_: both patches merged, IOW {{done}}
[15:17:02] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/event-schemas: SWAT: Bump CirrusSearchRequestSet rev to 121456865906 PART I [[gerrit:280448]] (duration: 00m 27s)
[15:17:06] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:17:42] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Bump CirrusSearchRequestSet rev to 121456865906 PART II [[gerrit:280448]] (duration: 00m 30s)
[15:17:46] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:17:49] <thcipriani>	 ^ dcausse check please
[15:18:16] <thcipriani>	 blerg Notice: Avro failed to serialize record for CirrusSearchRequestSet
[15:18:41] <thcipriani>	 lots and lots of those
[15:18:45] <dcausse>	 thcipriani: damn
[15:18:47] <_joe_>	 revert
[15:18:57] <dcausse>	 yes revert (sorry)
[15:19:18] <dcausse>	 thcipriani: only InitialiaseSettings should be ok
[15:20:34] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: REVERT Bump CirrusSearchRequestSet rev to 121456865906 PART II [[gerrit:280448]] (duration: 00m 31s)
[15:20:38] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:20:48] <wikibugs>	 6Operations, 10Continuous-Integration-Infrastructure, 10Phabricator, 10netops, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2183636 (10mmodell) Why is labs blocked from connecting to ssh? Is that to avoid people...
[15:21:20] <thcipriani>	 dcausse: ok, lemme get a patch up for the revert.
[15:21:46] <urandom>	 godog: I'm going to restore the bootstrap stream rate then
[15:21:56] <godog>	 urandom: sweet, thanks!
[15:23:10] <urandom>	 !log Restoring default stream throughput on restbase200{3,4-a}.codfw.wmnet
[15:23:14] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:24:00] <urandom>	 godog: how is the rebuild on 2003 doing btw?
[15:24:24] <grrrit-wm>	 (03PS1) 10Thcipriani: Revert "Merge "Bump CirrusSearchRequestSet rev to 121456865906"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281952 
[15:24:45] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281952 (owner: 10Thcipriani)
[15:25:10] <grrrit-wm>	 (03Merged) 10jenkins-bot: Revert "Merge "Bump CirrusSearchRequestSet rev to 121456865906"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281952 (owner: 10Thcipriani)
[15:25:46] <godog>	 urandom: progressing afaics, 42% but throttled at 6MB/s
[15:26:51] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) (owner: 10Urbanecm)
[15:26:55] <urandom>	 godog: was it always throttled there, or was that for the switchover?
[15:27:00] <wikibugs>	 6Operations, 10Analytics-Cluster, 10hardware-requests, 10netops: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2183653 (10RobH) a:5RobH>3None Yes, I think we need a network admin to investigate the dhcp ability of the analytics vlan to carbon, as I cannto seem to...
[15:27:25] <godog>	 urandom: we started like that, though we can bump it now I'd say
[15:27:33] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable Translate extension on uawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) (owner: 10Urbanecm)
[15:27:39] <urandom>	 godog: i guess that makes the eta something like, Friday or later
[15:28:33] <thcipriani>	 dcausse: hmmm, still seeing the errors coming in, although at a lower rate: https://logstash.wikimedia.org/#dashboard/temp/AVPsL-QYO3D718AOlQeh
[15:28:56] <dcausse>	 thcipriani: are all the wiki on wmf19?
[15:29:07] <greg-g>	 dcausse: no, just group0
[15:29:11] <greg-g>	 er, just group1 and 2
[15:29:15] <greg-g>	 wmf20 is on group0
[15:29:17] <godog>	 urandom: err, 8MB/s, ETA is like 24h now
[15:29:26] <dcausse>	 hmmm... so it should work :/
[15:29:41] <urandom>	 godog: cool
[15:30:20] <dcausse>	 schema is back to the previous one on enwiki
[15:30:55] <greg-g>	 dcausse: https://tools.wmflabs.org/versions/ and you can click on the verison numbers to see which wikis is included
[15:31:09] <dcausse>	 greg-g: thanks
[15:31:43] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable Translate extension on uawikimedia [[gerrit:281403]] (duration: 00m 27s)
[15:31:47] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:31:52] <thcipriani>	 ^ Urbanecm  check please
[15:32:56] <Urbanecm>	 Thcipriani: uawikimedia is down. 
[15:33:03] <Urbanecm>	 output
[15:33:04] <thcipriani>	 yup, running revert now
[15:33:09] <Urbanecm>	 MediaWiki internal error.
[15:33:09] <Urbanecm>	 Exception caught inside exception handler. Set $wgShowExceptionDetails = true; at the bottom of LocalSettings.php to show detailed debugging information.
[15:33:24] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: REVERT Enable Translate extension on uawikimedia [[gerrit:281403]] (duration: 00m 25s)
[15:33:28] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:33:36] <Krenair>	 did you add the tables thcipriani?
[15:33:57] <thcipriani>	 no I did not.
[15:34:16] <Krenair>	 you can use the normal extension table creation script for that
[15:38:05] <thcipriani>	 Krenair: which script?
[15:38:32] <wikibugs>	 6Operations, 10Phabricator, 10hardware-requests: We need a backup phabricator front-end node - https://phabricator.wikimedia.org/T131775#2183691 (10mmodell) >>! In T131775#2180958, @chasemp wrote: > I'm pretty sure you mean LVS :)  Yes, stupid error. Corrected now, thanks!  > A hot/cold setup with a like pha...
[15:40:14] <Krenair>	 thcipriani, create extension tables under the wikimedia maintenance extension
[15:49:07] <thcipriani>	 Krenair: mwscript extensions/WikimediaMaintenance/createExtensionTables.php translate --wiki=uawikimedia ?
[15:49:38] <Krenair>	 mwscript extensions/WikimediaMaintenance/createExtensionTables.php uawikimedia translate
[15:49:41] <Krenair>	 ^ I think it's that
[15:49:57] <Krenair>	 don't remember how much it likes --wiki being at the end
[15:50:00] <Krenair>	 might work
[15:50:06] <thcipriani>	 kk, thanks
[15:50:43] <thcipriani>	 ok, let's try this again.
[15:50:44] <Reedy>	 mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=uawikimedia translate
[15:50:47] <Reedy>	 I'd try that usually :P
[15:51:42] <thcipriani>	 heh, Krenair 's version seemed to work :)
[15:54:42] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable Translate extension on uawikimedia [[gerrit:281403]] (duration: 00m 28s)
[15:54:53] <thcipriani>	 ^ Urbanecm check please
[15:56:10] <thcipriani>	 marxarelli: would you check wmf.20 por favor? hashar noticed lots of git weirdness therein.
[15:56:34] <Urbanecm>	 It seems that it's working. Thanks. 
[15:56:45] <thcipriani>	 Urbanecm: thank you for checking!
[15:56:46] <marxarelli>	 thcipriani: git weirdness?
[15:56:50] <grrrit-wm>	 (03PS3) 10Alex Monk: Send wikipedia.org/apple-app-site-association to the right file without an external redirect [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) 
[15:57:21] <thcipriani>	 marxarelli: yeah, modified stuff, check it out on tin.
[15:58:36] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277837 (https://phabricator.wikimedia.org/T129274) (owner: 10Bmansurov)
[15:58:49] <thcipriani>	 bmansurov: still around for SWAT (I hope) :)
[15:58:53] <bmansurov>	 yes
[15:58:55] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Remove Language Overlay experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277837 (https://phabricator.wikimedia.org/T129274) (owner: 10Bmansurov)
[15:59:19] <bmansurov>	 thcipriani: i'll rebase real quick
[15:59:25] <thcipriani>	 kk, thanks
[16:00:57] <grrrit-wm>	 (03PS3) 10Bmansurov: Remove Language Overlay experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277837 (https://phabricator.wikimedia.org/T129274) 
[16:01:07] <bmansurov>	 thcipriani: done
[16:01:12] <marxarelli>	 thcipriani: oh geez. it's from the security patches
[16:01:16] <thcipriani>	 kk, let's try this again.
[16:02:00] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277837 (https://phabricator.wikimedia.org/T129274) (owner: 10Bmansurov)
[16:02:56] <wikibugs>	 6Operations, 10Dumps-Generation, 7HHVM, 13Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#2183850 (10ArielGlenn) I've done this for the new snapshot hosts and run a test dump of a wiki; it looked fine. I'll keep this open til the misc cron jobs are...
[16:03:07] <grrrit-wm>	 (03CR) 10Brion VIBBER: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk)
[16:03:25] <grrrit-wm>	 (03Merged) 10jenkins-bot: Remove Language Overlay experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277837 (https://phabricator.wikimedia.org/T129274) (owner: 10Bmansurov)
[16:06:01] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Remove Language Overlay experiment [[gerrit:277837]] (duration: 00m 26s)
[16:06:05] <thcipriani>	 ^ bmansurov check please
[16:06:07] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:06:58] <bmansurov>	 thcipriani: thanks, looks good
[16:07:03] <thcipriani>	 bmansurov: awesome, thanks!
[16:07:28] <wikibugs>	 6Operations, 6Analytics-Kanban: Upgrade aqs* to nodejs 4.3 - https://phabricator.wikimedia.org/T123629#2183891 (10elukey)
[16:07:39] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 65.52% of data above the critical threshold [5000000.0]
[16:08:07] <thcipriani>	 legoktm: marxarelli is doing some rearranging some things on wmf.20, don't want to scap in the middle of it.
[16:08:19] <icinga-wm>	 PROBLEM - Apache HTTP on mw1173 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:08:35] <legoktm>	 no worries, I'll be here for a while :P
[16:08:59] <icinga-wm>	 PROBLEM - HHVM rendering on mw1173 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:09:00] <thcipriani>	 legoktm: kk, I'll poke you when we're done, thanks :)
[16:10:25] <hashar>	 thcipriani: there is a pending change for MobileFrontend wmf.20  . It is for CI build
[16:10:46] <thcipriani>	 hashar: right, it's merged already, correct?
[16:10:50] <hashar>	 yeah
[16:10:57] <hashar>	 I havent rebased the MobileFrontend repo on tin  since the mediawiki working copy has some staged diff
[16:11:08] <hashar>	 but it is definitely harmless for prod (just tweak package.json)
[16:11:26] <marxarelli>	 hashar: I8ea086cedd81c0cd626452b375a6ae1e81460943 ?
[16:11:33] <marxarelli>	 just pulled that down
[16:11:34] <grrrit-wm>	 (03CR) 10Dereckson: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281443 (https://phabricator.wikimedia.org/T131754) (owner: 10Urbanecm)
[16:12:00] <wikibugs>	 6Operations: Integrate jessie 8.4 point release - https://phabricator.wikimedia.org/T131746#2183894 (10ema) p:5Triage>3Normal
[16:20:56] <marxarelli>	 thcipriani, legoktm: ok, should be good to go now
[16:21:29] <grrrit-wm>	 (03PS2) 10Ema: Misc cluster VCL: avoid name conflict between directors and probes [puppet] - 10https://gerrit.wikimedia.org/r/281457 (https://phabricator.wikimedia.org/T131501) 
[16:22:03] <grrrit-wm>	 (03CR) 10Ema: [C: 032 V: 032] Misc cluster VCL: avoid name conflict between directors and probes [puppet] - 10https://gerrit.wikimedia.org/r/281457 (https://phabricator.wikimedia.org/T131501) (owner: 10Ema)
[16:24:17] <thcipriani>	 legoktm: I'm around to scap if you're around to check
[16:24:28] <legoktm>	 I am!
[16:25:19] <wikibugs>	 6Operations, 10Mathoid: Travis PNG looks different from vagrant png - https://phabricator.wikimedia.org/T94379#2183924 (10Physikerwelt) p:5Low>3Triage
[16:25:32] <wikibugs>	 6Operations, 10ops-eqiad, 6DC-Ops: Broken disk on aqs1001.eqiad.wmnet - https://phabricator.wikimedia.org/T130816#2147470 (10Milimetric) I know we're supposed to convert these to SSDs soon, but I would sleep a lot easier if we fixed the disk.  If another one fails we'll lose a lot of data and have to backfil...
[16:27:28] <wikibugs>	 6Operations, 10Mathoid: Travis PNG looks different from vagrant png - https://phabricator.wikimedia.org/T94379#2183948 (10Physikerwelt) I think we should reclassify the importance o this bug for two reasons. 1. The problem is also preverlent in production (https://en.wikipedia.org/api/rest_v1/media/math/render...
[16:30:19] <icinga-wm>	 PROBLEM - HHVM rendering on mw1210 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.015 second response time
[16:30:57] <grrrit-wm>	 (03PS8) 10Dereckson: Support handoff and credential sharing with the iOS app [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279394 (https://phabricator.wikimedia.org/T128795) (owner: 10Fjalapeno)
[16:31:38] <legoktm>	 thcipriani: ^^
[16:32:00] <thcipriani>	 legoktm: whoops, missed your reply, kk going :)
[16:32:18] <icinga-wm>	 RECOVERY - HHVM rendering on mw1210 is OK: HTTP OK: HTTP/1.1 200 OK - 66442 bytes in 0.099 second response time
[16:32:20] <wikibugs>	 6Operations, 10Mathoid, 6Services: Travis PNG looks different from vagrant png - https://phabricator.wikimedia.org/T94379#2183958 (10Physikerwelt)
[16:32:47] <logmsgbot>	 !log thcipriani@tin Started scap: SWAT: Add user_wpzero AbuseFilter variable [[gerrit:281867]]
[16:32:51] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:32:54] <thcipriani>	 ^ legoktm started
[16:33:36] <grrrit-wm>	 (03CR) 10Dereckson: [C: 031] "PS8: use previous version indent style, to preserve the git blame information for the applinks sections." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279394 (https://phabricator.wikimedia.org/T128795) (owner: 10Fjalapeno)
[16:34:16] <paravoid>	 twentyafterfour: still getting shitton of emails from /srv/phab/tools/public_task_dump.py for "rtppl"
[16:34:41] <legoktm>	 woot :D
[16:34:52] <grrrit-wm>	 (03CR) 10Fjalapeno: [C: 031] Support handoff and credential sharing with the iOS app [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279394 (https://phabricator.wikimedia.org/T128795) (owner: 10Fjalapeno)
[16:40:03] <grrrit-wm>	 (03PS2) 10ArielGlenn: https redirect for dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/278861 (https://phabricator.wikimedia.org/T128587) 
[16:40:27] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] https redirect for dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/278861 (https://phabricator.wikimedia.org/T128587) (owner: 10ArielGlenn)
[16:42:15] <twentyafterfour>	 paravoid: oh, I thought I fixed that. let me see
[16:45:30] <grrrit-wm>	 (03PS3) 10ArielGlenn: https redirect for dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/278861 (https://phabricator.wikimedia.org/T128587) 
[16:46:57] <wikibugs>	 6Operations, 6Analytics-Kanban: Upgrade aqs* to nodejs 4.3 - https://phabricator.wikimedia.org/T123629#1934216 (10Eevans) >>! In T123629#2143751, @MoritzMuehlenhoff wrote: > Upgrade procedure: > - Depool one of the aqs servers via conftool > - Stop restbase  > - nodetool drain && systemctl stop cassandra > - u...
[16:47:49] <wikibugs>	 6Operations, 10MediaWiki-Parser, 10Traffic, 10Wikidata, and 2 others: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2184019 (10Jdlrobson)
[16:49:51] <grrrit-wm>	 (03CR) 10ArielGlenn: "I think the 301 method may not work for POST whereas the rewrite does. Have a look at the current patchset and see what you think." [puppet] - 10https://gerrit.wikimedia.org/r/278861 (https://phabricator.wikimedia.org/T128587) (owner: 10ArielGlenn)
[16:57:38] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[17:00:26] <logmsgbot>	 !log thcipriani@tin Finished scap: SWAT: Add user_wpzero AbuseFilter variable [[gerrit:281867]] (duration: 27m 39s)
[17:00:31] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:00:31] <thcipriani>	 ^ legoktm done!
[17:00:37] <legoktm>	 YAAAAY
[17:00:49] <thcipriani>	 :D
[17:01:14] <legoktm>	 thcipriani: confirmed working :)
[17:01:15] <legoktm>	 thanks!
[17:01:29] <thcipriani>	 legoktm: cool, thanks for checking!
[17:05:13] <matanya>	 Dereckson: can you please schedule your patch for thursday, and i will do the same ?
[17:07:37] <Dereckson>	 k, but mine should also get approved, as it has only a green light from security point of view, not yet from ops, from a performance point of view
[17:08:04] <matanya>	 godog: ^ ? :)
[17:09:01] <godog>	 Dereckson matanya what's the context?
[17:09:25] <Dereckson>	 godog: in addition to https://gerrit.wikimedia.org/r/#/c/280831 we would like to change https://gerrit.wikimedia.org/r/#/c/281823/
[17:14:21] <godog>	 Dereckson: ack, thanks, looking
[17:18:35] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: "a comment on the actual value, also do we know how often mediawiki hits this timeout?" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281823 (https://phabricator.wikimedia.org/T118887) (owner: 10Dereckson)
[17:22:06] <grrrit-wm>	 (03CR) 10Fjalapeno: "Brion you mind +1 ing again?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279394 (https://phabricator.wikimedia.org/T128795) (owner: 10Fjalapeno)
[17:25:55] <grrrit-wm>	 (03PS1) 10DCausse: Remove deprecated settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281966 (https://phabricator.wikimedia.org/T131944) 
[17:26:09] <icinga-wm>	 RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[17:26:52] <grrrit-wm>	 (03CR) 10DCausse: [C: 04-1] "I1614ed5 needs to be deployed before" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281966 (https://phabricator.wikimedia.org/T131944) (owner: 10DCausse)
[17:27:09] <grrrit-wm>	 (03CR) 10Brion VIBBER: [C: 031] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279394 (https://phabricator.wikimedia.org/T128795) (owner: 10Fjalapeno)
[17:27:37] <grrrit-wm>	 (03CR) 10Dereckson: Raise upload-by-URL request timeout (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281823 (https://phabricator.wikimedia.org/T118887) (owner: 10Dereckson)
[17:28:27] <grrrit-wm>	 (03PS2) 10Dereckson: Remove deprecated settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281966 (https://phabricator.wikimedia.org/T131944) (owner: 10DCausse)
[17:28:41] <grrrit-wm>	 (03CR) 10Fjalapeno: [C: 031] "lgtm as well" [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk)
[17:29:57] <grrrit-wm>	 (03CR) 10Fjalapeno: "Krenair - how do merge/deployments work for this type of change? Do I need to schedule a SWAT or will this just get merged and go out on t" [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk)
[17:31:34] <grrrit-wm>	 (03CR) 10Alex Monk: "It's a puppet change so someone with ops rights will need to do it. There is puppetswat though..." [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk)
[17:36:30] <Dereckson>	 godog: the question of how many is tricky: it's a feature currently restricted to GWT users (mainly GLAM institutions with a lot of files to upload) and Wikimedia Commons sysops. Would you know who could tell us where/how/if the information is logged?
[17:39:46] <wikibugs>	 6Operations, 10ops-eqiad, 6DC-Ops: Broken disk on aqs1001.eqiad.wmnet - https://phabricator.wikimedia.org/T130816#2184216 (10Ottomata) Ja let’s do this.  @cmjohnson1 ja?!
[17:41:36] <grrrit-wm>	 (03PS1) 10Dereckson: Set wgSemiprotectedRestrictionLevels for en.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281967 (https://phabricator.wikimedia.org/T126607) 
[17:46:53] <grrrit-wm>	 (03PS1) 10Muehlenhoff: List all required restarts next to the new restarts introduced by a library upgrade [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/281969 
[17:46:55] <grrrit-wm>	 (03PS2) 10Dereckson: Raise upload-by-URL request timeout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281823 (https://phabricator.wikimedia.org/T118887) 
[17:47:58] <grrrit-wm>	 (03CR) 10Dereckson: "PS2: 180 seconds instead of 300, per Filippo comment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281823 (https://phabricator.wikimedia.org/T118887) (owner: 10Dereckson)
[17:52:20] <wikibugs>	 6Operations, 10Analytics-Cluster, 10hardware-requests, 10netops: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2184248 (10faidon) The port was also on the labs-instance-ports interface-range, which set the port-mode to trunk (and also added labs-instances1-eqiad to t...
[17:53:28] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] List all required restarts next to the new restarts introduced by a library upgrade [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/281969 (owner: 10Muehlenhoff)
[17:55:18] * Nemo_bis is eating the last decent oranges of the season and starts craving for susine and pesche
[17:55:30] <wikibugs>	 6Operations, 10ops-codfw: rack conf100[123] - https://phabricator.wikimedia.org/T131959#2184249 (10RobH)
[17:55:39] <Nemo_bis>	 oh I was stuck at a Yuvi comment of many hours ago, sorry :)
[17:55:41] <wikibugs>	 6Operations, 10ops-codfw: rack conf100[123] - https://phabricator.wikimedia.org/T131959#2184266 (10RobH)
[17:55:53] <wikibugs>	 6Operations, 10ops-codfw: rack conf100[123] - https://phabricator.wikimedia.org/T131959#2184249 (10RobH) p:5Triage>3Normal
[17:56:19] <wikibugs>	 6Operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#1890778 (10RobH) This has been ordered, and now has a public blocking/racking task of T131959.
[17:56:27] <wikibugs>	 6Operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#2184272 (10RobH)
[17:56:43] <wikibugs>	 6Operations, 10ops-codfw: rack/setup/deploy conf100[123] - https://phabricator.wikimedia.org/T131959#2184249 (10RobH)
[17:59:06] <akosiaris>	 yurik: so, maps server are marked as downtime in icinga, we are good to go with the nodejs 4.3 migration. Whenever you are ready!
[18:14:33] <wikibugs>	 6Operations, 10hardware-requests, 13Patch-For-Review: Allocate 2 analytics machines to experiment with a jupyterhub notebook service - https://phabricator.wikimedia.org/T130760#2184291 (10yuvipanda) 5Resolved>3Open re-opening, since there is some issues still (I just found time to check back on it).  So...
[18:16:39] <wikibugs>	 6Operations, 10Analytics-Cluster, 10hardware-requests, 10netops: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2184309 (10RobH) Ok, multiple attempts have still resulted in no joy (no dhcp request hitting carbon.)  The system was also showing in the config in the def...
[18:18:21] <wikibugs>	 6Operations, 10hardware-requests, 13Patch-For-Review: Allocate 2 analytics machines to experiment with a jupyterhub notebook service - https://phabricator.wikimedia.org/T130760#2184310 (10yuvipanda) now at:  ```  +---------------------------------------------+ [!!] Configuring grub-pc +----------------------...
[18:18:21] <grrrit-wm>	 (03PS1) 10BBlack: LVS: add salt grain for lvs:(primary|secondary) [puppet] - 10https://gerrit.wikimedia.org/r/281972 
[18:19:45] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] LVS: add salt grain for lvs:(primary|secondary) [puppet] - 10https://gerrit.wikimedia.org/r/281972 (owner: 10BBlack)
[18:20:38] <wikibugs>	 6Operations, 10Continuous-Integration-Infrastructure, 10Phabricator, 10netops, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2165259 (10Andrew) > Why is labs intentionally blocked from connecting to ssh?  Can you...
[18:21:33] <grrrit-wm>	 (03PS2) 10BBlack: LVS: add salt grain for lvs:(primary|secondary) [puppet] - 10https://gerrit.wikimedia.org/r/281972 
[18:22:35] <grrrit-wm>	 (03PS1) 10Matanya: webp: enabled by default - remove old dead code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281973 (https://phabricator.wikimedia.org/T27397) 
[18:22:46] <wikibugs>	 6Operations, 10hardware-requests, 13Patch-For-Review: Allocate 2 analytics machines to experiment with a jupyterhub notebook service - https://phabricator.wikimedia.org/T130760#2184316 (10yuvipanda) I'm trying on notebook1002 now
[18:23:53] <wikibugs>	 6Operations, 10hardware-requests, 13Patch-For-Review: Allocate 2 analytics machines to experiment with a jupyterhub notebook service - https://phabricator.wikimedia.org/T130760#2184318 (10yuvipanda) notebook1002 also seems to have the same thing going, stuck at the same '4 of 9'. I wonder if that's something...
[18:26:22] <wikibugs>	 6Operations: Boot time race condition when assembling root raid device - https://phabricator.wikimedia.org/T131961#2184334 (10ema)
[18:27:19] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] "works in puppet compiler, ship it!" [puppet] - 10https://gerrit.wikimedia.org/r/281972 (owner: 10BBlack)
[18:30:19] <wikibugs>	 6Operations, 10DNS, 10Mail, 10Phabricator, and 2 others: phabricator.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T116806#2184350 (10BBlack) So, the gerrit change is held up on comments about `mx ?all` vs `mx -all`.  Are we confident phab emails only come from our mxes? ping @chasemp...
[18:34:34] <wikibugs>	 6Operations: Installer issues for notebook1001 & 1002 - https://phabricator.wikimedia.org/T131964#2184372 (10yuvipanda)
[18:35:01] <wikibugs>	 6Operations: Installer issues for notebook1001 & 1002 - https://phabricator.wikimedia.org/T131964#2184391 (10yuvipanda) now at:  +---------------------------------------------+ [!!] Configuring grub-pc +----------------------------------------------+  |...
[18:36:22] <wikibugs>	 6Operations, 10hardware-requests, 13Patch-For-Review: Allocate 2 analytics machines to experiment with a jupyterhub notebook service - https://phabricator.wikimedia.org/T130760#2145652 (10yuvipanda) 5Open>3Resolved Moved to T131964
[18:40:27] <wikibugs>	 6Operations: Installer issues for notebook1001 & 1002 - https://phabricator.wikimedia.org/T131964#2184402 (10yuvipanda) notebook1002 is now failing with:  ```        +-------------------------------------+ [!!] Select and install software +-------------------------------------+        |...
[18:43:24] <wikibugs>	 6Operations: Installer issues for notebook1001 & 1002 - https://phabricator.wikimedia.org/T131964#2184405 (10yuvipanda) Possible error things:  ``` pr  2 18:05:56 main-menu[392]: (process:7949): /var/lib/partman/devices/=dev=sda Apr  2 18:05:56 main-menu[392]: (process:7949): /bin/autopartition-lvm: line 1: stat...
[18:44:16] <wikibugs>	 6Operations: Installer issues for notebook1001 & 1002 - https://phabricator.wikimedia.org/T131964#2184406 (10yuvipanda) ```Apr  2 18:05:58 debconf: <-- 0 Retrying failed download of http://mirrors.wikimedia.org/debian/dists/jessie/main/binary-amd64/Packages.gz ```  might be more of the problem.
[18:45:49] <grrrit-wm>	 (03PS3) 10BBlack: cache_maps: add all sites in LVS [puppet] - 10https://gerrit.wikimedia.org/r/268238 (https://phabricator.wikimedia.org/T109162) 
[18:45:51] <grrrit-wm>	 (03PS4) 10BBlack: cache_maps: re-role old mobile servers [puppet] - 10https://gerrit.wikimedia.org/r/268236 (https://phabricator.wikimedia.org/T109162) 
[18:45:53] <grrrit-wm>	 (03PS3) 10BBlack: cache_maps: remove cp104[34] test caches [puppet] - 10https://gerrit.wikimedia.org/r/268237 (https://phabricator.wikimedia.org/T109162) 
[18:49:58] <wikibugs>	 6Operations: Installer issues for notebook1001 & 1002 - https://phabricator.wikimedia.org/T131964#2184433 (10yuvipanda) ```Apr  6 18:08:02 in-target: The following packages have unmet dependencies: Apr  6 18:08:02 in-target:  bind9-host : Depends: libbind9-90 (= 1:9.9.5.dfsg-9+deb8u6) but it is not going to be i...
[18:49:58] <yurik>	 akosiaris and I will be switching maps services to node4.3 now, and will use trebuchet to update maps services
[18:50:09] <akosiaris>	 !log disable salt-minion on maps-test200{1,2,3} for maps services deployment. nodejs upgrade is in place
[18:50:14] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:50:21] <akosiaris>	 yurik: you are good to go
[18:50:33] <yurik>	 akosiaris, going...
[18:51:33] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] wikilabels: healthier uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/281940 (owner: 10Ladsgroup)
[18:51:37] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: wikilabels: healthier uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/281940 (owner: 10Ladsgroup)
[18:51:41] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [V: 032] wikilabels: healthier uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/281940 (owner: 10Ladsgroup)
[18:51:53] <grrrit-wm>	 (03PS1) 10Dereckson: Use extension registration for ProofreadPage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281976 (https://phabricator.wikimedia.org/T119117) 
[18:52:01] <wikibugs>	 6Operations: Default gateway unreachable on baham.wikimedia.org after reboot - https://phabricator.wikimedia.org/T131966#2184436 (10ema)
[18:52:49] <wikibugs>	 6Operations: Boot time race condition when assembling root raid device on cp1052 - https://phabricator.wikimedia.org/T131961#2184452 (10ema)
[18:53:34] <wikibugs>	 6Operations: Installer issues for notebook1001 & 1002 - https://phabricator.wikimedia.org/T131964#2184453 (10yuvipanda) ``` Apr  6 18:18:23 anna[28516]: wget: server returned error: HTTP/1.0 404 Not Found Apr  6 18:18:23 anna[28516]: WARNING **: package retrieval failed Apr  6 18:18:25 choose-mirror[28692]: DEBU...
[18:56:40] <yurik>	 akosiaris, kartotherian & tilerator have been synced, need restart
[18:56:54] <yurik>	 (service, not box :)
[18:57:21] <akosiaris>	 ok, unmasking and restarting
[18:57:31] <yurik>	 MaxSem, ^^
[18:58:05] <yurik>	 i see maps on 2004
[18:58:25] <MaxSem>	 wee
[18:58:26] <akosiaris>	 service-runner is spawning workers this time around. great!
[18:58:48] <YuviPanda>	 !log reboot notebook1001
[18:58:52] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:58:53] <yurik>	 tilerator is still down
[18:59:40] <akosiaris>	 ok did tilerator and tileratorui as well
[18:59:50] <akosiaris>	 so, now let's check everything works as expected
[19:00:01] <yurik>	 akosiaris, both are up
[19:00:04] <jouncebot>	 marxarelli: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160406T1900).
[19:00:20] <yurik>	 akosiaris, seems all is good
[19:00:43] <grrrit-wm>	 (03PS1) 10Dereckson: Flow dblist on noc.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281977 
[19:02:18] <akosiaris>	 yurik: I concur
[19:02:34] <akosiaris>	 so... exact same drill for maps-test2001 ? 
[19:02:35] <akosiaris>	 or more ?
[19:02:52] <akosiaris>	 aaah, lemme pool first maps-test2004
[19:02:57] <akosiaris>	 or we will be without a service
[19:03:06] <grrrit-wm>	 (03CR) 10Dereckson: "There symbolic links have been generated by docroot/noc/createTxtFileSymlinks.sh." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281977 (owner: 10Dereckson)
[19:04:13] <yurik>	 akosiaris, when you upgrade to node43, do you have to stop the service?
[19:05:09] <icinga-wm>	 RECOVERY - puppet last run on maps-test2004 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
[19:05:17] <akosiaris>	 I do just to be sure cause the behaviour after the upgrade is not exactly well defined, but it's not required strictly
[19:06:46] <yurik>	 akosiaris, lets take all 3 servers out of rotation, and see if 2004 handles the new load
[19:06:56] <yurik>	 don't stop the service, only update the LVS
[19:07:01] <akosiaris>	 ok, done
[19:07:05] <yurik>	 checking
[19:08:09] <grrrit-wm>	 (03CR) 10GWicke: [C: 031] "The code needed for this is now deployed, so we can start producing resource_changed events." [puppet] - 10https://gerrit.wikimedia.org/r/281740 (https://phabricator.wikimedia.org/T126571) (owner: 10Ppchelko)
[19:09:06] <yurik>	 akosiaris, all seems to be good, lets do all 3
[19:09:38] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet).
[19:09:59] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet).
[19:11:14] <akosiaris>	 yurik: ok, gimme 2 mins
[19:11:28] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[19:11:49] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge.
[19:11:59] <akosiaris>	 yurik: and you are good to go
[19:12:30] <yurik>	 akosiaris, trebuchet is all set for 1-3?
[19:12:35] <akosiaris>	 yup
[19:12:51] <akosiaris>	 and nodejs has been upgraded
[19:13:41] <yurik>	 akosiaris, did you disable 2004?  it shows that all 4 fetched
[19:13:45] <yurik>	 not that it matters
[19:13:54] <akosiaris>	 no, I did not
[19:13:59] <yurik>	 ok, its fine
[19:14:01] <akosiaris>	 exactly because it does not matter ;-)
[19:14:04] <yurik>	 hehe
[19:14:24] <yurik>	 akosiaris, kartotherian is done, syncing tilerator
[19:14:26] <grrrit-wm>	 (03PS1) 10Dduvall: group1 wikis to 1.27.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281978 
[19:15:35] <yurik>	 akosiaris, tilerator is done
[19:15:51] <grrrit-wm>	 (03PS1) 10BBlack: secure WMF-Last-Access cookie T119576 [puppet] - 10https://gerrit.wikimedia.org/r/281979 
[19:15:54] <akosiaris>	 !log restart kartotherian on maps-test200{1,2,3}
[19:15:58] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:16:08] <icinga-wm>	 PROBLEM - puppet last run on maps-test2004 is CRITICAL: CRITICAL: puppet fail
[19:16:26] <grrrit-wm>	 (03PS2) 10BBlack: secure WMF-Last-Access cookie T119576 [puppet] - 10https://gerrit.wikimedia.org/r/281979 
[19:16:32] <akosiaris>	 !log restart tilerator, tileratorui on maps-test200{1,2,3}
[19:16:36] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:16:40] <akosiaris>	 yurik: ok, done, let's test this
[19:16:55] * yurik runs for the hills
[19:17:08] <grrrit-wm>	 (03PS1) 10BBlack: secure GeoIP cookie T119576 [puppet] - 10https://gerrit.wikimedia.org/r/281980 
[19:18:19] <icinga-wm>	 PROBLEM - HHVM rendering on mw1145 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.009 second response time
[19:18:19] <akosiaris>	 seems to be working fine in my tests
[19:19:03] <marxarelli>	 yurik: will the train group1 promotion affect what you're deploying right now? or visa versa?
[19:19:28] <icinga-wm>	 PROBLEM - Apache HTTP on mw1135 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.016 second response time
[19:19:30] <icinga-wm>	 PROBLEM - Apache HTTP on mw1145 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.011 second response time
[19:19:57] <akosiaris>	 marxarelli: doubtful
[19:20:21] <akosiaris>	 yurik: I think all is well... and maps-test2004 is handling all the load just fine
[19:20:37] <akosiaris>	 of course it's just 300KB/s
[19:20:39] <icinga-wm>	 PROBLEM - HHVM rendering on mw1135 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.004 second response time
[19:20:48] <grrrit-wm>	 (03PS24) 10Ottomata: Hieraize keyholder::agent configuration [puppet] - 10https://gerrit.wikimedia.org/r/279198 (https://phabricator.wikimedia.org/T130419) (owner: 1020after4)
[19:21:07] <marxarelli>	 akosiaris: kk. choo choo it is ...
[19:21:26] <yurik>	 marxarelli, no affect
[19:21:30] <yurik>	 unrelated stuff
[19:21:38] <yurik>	 and its done anyway
[19:21:43] <marxarelli>	 seemed like it but i wanted to double check :)
[19:21:46] <marxarelli>	 thanks
[19:21:59] <icinga-wm>	 RECOVERY - HHVM rendering on mw1145 is OK: HTTP OK: HTTP/1.1 200 OK - 68465 bytes in 0.507 second response time
[19:22:10] <akosiaris>	 !log bounce hhvm on mw1135, mw1145
[19:22:10] <grrrit-wm>	 (03CR) 10Dduvall: [C: 032] group1 wikis to 1.27.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281978 (owner: 10Dduvall)
[19:22:14] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:22:28] <icinga-wm>	 RECOVERY - HHVM rendering on mw1135 is OK: HTTP OK: HTTP/1.1 200 OK - 68465 bytes in 0.351 second response time
[19:22:41] <grrrit-wm>	 (03Merged) 10jenkins-bot: group1 wikis to 1.27.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281978 (owner: 10Dduvall)
[19:22:46] <akosiaris>	 yurik: so, I am pooling back maps-test200{1,2,3}, ok ?
[19:22:56] <yurik>	 akosiaris, go ahead
[19:22:59] <logmsgbot>	 !log dduvall@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.20
[19:23:08] <icinga-wm>	 RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.046 second response time
[19:23:10] <icinga-wm>	 RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.081 second response time
[19:24:03] <akosiaris>	 !log pool maps-test200{1,2,3} for kartotherian.svc.codfw.wmnet
[19:24:48] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Hieraize keyholder::agent configuration [puppet] - 10https://gerrit.wikimedia.org/r/279198 (https://phabricator.wikimedia.org/T130419) (owner: 1020after4)
[19:26:16] <akosiaris>	 yurik: I am considering this done and a success. Thank you for your business
[19:26:18] <akosiaris>	 ;-)
[19:26:29] <yurik>	 akosiaris, so do i, thank you for all the help! :D
[19:26:44] <grrrit-wm>	 (03CR) 10Reedy: [C: 031] secure GeoIP cookie T119576 [puppet] - 10https://gerrit.wikimedia.org/r/281980 (owner: 10BBlack)
[19:26:44] * yurik waits for some weird bug to surface
[19:26:52] <ottomata>	 may be breaking puppet on tin... :/...
[19:26:54] <grrrit-wm>	 (03CR) 10Reedy: [C: 031] secure WMF-Last-Access cookie T119576 [puppet] - 10https://gerrit.wikimedia.org/r/281979 (owner: 10BBlack)
[19:26:54] <greg-g>	 uh, wikitech down?
[19:27:08] <yurik>	 yep
[19:27:23] <yurik>	 akosiaris, did you break wikitech? :)
[19:28:55] <akosiaris>	 doubtful
[19:29:29] <akosiaris>	 marxarelli: wikitech is throwing 500s, maybe due to the train deploy ?
[19:29:49] <Reedy>	 I don't think wikitech gets changed today
[19:30:04] <Reedy>	 tail: cannot open ‘apache2/error.log’ for reading: Permission denied
[19:30:11] <akosiaris>	 I don't remember in which group it is 
[19:30:20] <Reedy>	 Ah, you're right
[19:30:29] <Reedy>	 Group 1 to .20 in https://github.com/wikimedia/operations-mediawiki-config/commit/c8c8730443ac6bf969f0852dcf1453dfbf0f52c8
[19:30:30] <icinga-wm>	 PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: puppet fail
[19:30:48] <Reedy>	 akosiaris: Someone from ops will have to look at the error log :)
[19:30:53] <andrewbogott>	 It's SMW yet again
[19:30:54] <andrewbogott>	 PHP Fatal error:  Call to undefined method Title::newFromRedirect() in /srv/mediawiki/php-1.27.0-wmf.20/extensions/SemanticMediaWiki/includes/SMW_ParserExtensions.php on line 41
[19:30:54] <akosiaris>	 [Wed Apr 06 19:30:39.350356 2016] [:error] [pid 30370] [client 10.68.17.64:42705] PHP Fatal error:  Call to undefined method Title::newFromRedirect() in /srv/mediawiki/php-1.27.0-wmf.20/extensions/SemanticMediaWiki/includes/SMW_ParserExtensions.php on line 41
[19:31:01] <Reedy>	 lol
[19:31:02] <marxarelli>	 akosiaris: grr, looking
[19:31:03] <andrewbogott>	 akosiaris: I was first!
[19:31:08] <Reedy>	 marxarelli: not your fault
[19:31:17] <akosiaris>	 andrewbogott: groumf... yeah I give you that one
[19:31:27] <Reedy>	 marxarelli: Revert it back to .19, and I'll get SMW fixed
[19:31:35] <andrewbogott>	 thanks Reedy 
[19:31:47] <akosiaris>	 it would have been so much cooler if my client believed it was before andrewbogott's
[19:31:57] * andrewbogott LOVES not being the only one who ever logs into the wikitech host
[19:31:58] <akosiaris>	 at least we would be having a distributed systems discussions then
[19:32:29] <icinga-wm>	 RECOVERY - puppet last run on maps-test2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:32:41] <akosiaris>	 we still have SMW on wikitech ... what for ?
[19:32:52] <akosiaris>	 you know what ? I don't want to know
[19:33:11] <Reedy>	 probably for the best :)
[19:33:23] <akosiaris>	 exactly because of that ^ reason 
[19:33:27] <logmsgbot>	 !log dduvall@tin rebuilt wikiversions.php and synchronized wikiversions files: (no message)
[19:33:34] <akosiaris>	 I am sure it's some obscure thing 
[19:33:59] <andrewbogott>	 It's not very obscure, but we can probably live without it
[19:34:00] <andrewbogott>	 eventually
[19:34:03] <wikibugs>	 6Operations: Installer issues for notebook1001 & 1002 - https://phabricator.wikimedia.org/T131964#2184584 (10yuvipanda) @faidon upgraded debian-installer on carbon, which has resulted in the install completing but getting stuck in a installer loop!
[19:34:16] <Reedy>	 lol, it's really just one line that needs fixing
[19:34:59] <icinga-wm>	 PROBLEM - puppet last run on db2035 is CRITICAL: CRITICAL: Puppet has 1 failures
[19:35:25] <wikibugs>	 6Operations, 10Traffic, 7HTTPS, 13Patch-For-Review, 7Varnish: Mark cookies from varnish as secure - https://phabricator.wikimedia.org/T119576#1830027 (10BBlack) Note there are probably question-marks around these about insecure requests.  We don't yet block/deny insecure POST traffic ( T105794 ), but we'...
[19:36:48] <grrrit-wm>	 (03PS1) 10Ottomata: Temporarily removing new group deploy-phabricator to fix puppet on tin [puppet] - 10https://gerrit.wikimedia.org/r/281985 
[19:38:37] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Temporarily removing new group deploy-phabricator to fix puppet on tin [puppet] - 10https://gerrit.wikimedia.org/r/281985 (owner: 10Ottomata)
[19:38:40] <icinga-wm>	 PROBLEM - HHVM rendering on mw1132 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.008 second response time
[19:39:39] <icinga-wm>	 PROBLEM - Apache HTTP on mw1132 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.008 second response time
[19:40:08] <icinga-wm>	 PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/).
[19:40:18] <icinga-wm>	 PROBLEM - HHVM rendering on mw1140 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.006 second response time
[19:40:36] <marxarelli>	 Reedy: https://phabricator.wikimedia.org/T131973
[19:40:38] <icinga-wm>	 PROBLEM - Apache HTTP on mw1140 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.004 second response time
[19:40:48] <Reedy>	 marxarelli: I just made a commit :D
[19:41:17] <grrrit-wm>	 (03CR) 10Ottomata: "Something was wrong with this, we'll have to figure it out as a separate patch. With this defined, i was geting:" [puppet] - 10https://gerrit.wikimedia.org/r/281985 (owner: 10Ottomata)
[19:41:24] <marxarelli>	 Reedy: thanks!
[19:41:28] <icinga-wm>	 RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[19:43:26] <grrrit-wm>	 (03PS29) 10Ottomata: Add new scap::source define to ease bootstrapping of repositories on deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/280730 (https://phabricator.wikimedia.org/T118772) 
[19:44:30] <icinga-wm>	 PROBLEM - Keyholder SSH agent on tin is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it.
[19:44:55] <wikibugs>	 6Operations, 10Traffic, 7HTTPS, 5MW-1.27-release-notes, 13Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#2184631 (10BBlack) So, we've had the API warning up for a couple of months now.  In general, we've continually fallen behind on promises to notify -> kill insecure...
[19:45:08] <icinga-wm>	 PROBLEM - Apache HTTP on mw1114 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.034 second response time
[19:45:50] <icinga-wm>	 PROBLEM - HHVM rendering on mw1114 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.003 second response time
[19:45:57] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Add new scap::source define to ease bootstrapping of repositories on deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/280730 (https://phabricator.wikimedia.org/T118772) (owner: 10Ottomata)
[19:46:10] <icinga-wm>	 PROBLEM - HHVM rendering on mw1200 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.008 second response time
[19:46:39] <icinga-wm>	 PROBLEM - Apache HTTP on mw1200 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.014 second response time
[19:47:00] <icinga-wm>	 PROBLEM - HHVM rendering on mw1141 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.004 second response time
[19:47:06] <wikibugs>	 6Operations, 5Continuous-Integration-Scaling, 7HHVM: Provide a HHVM package for jessie-wikimedia matching version of trusty-wikimedia - https://phabricator.wikimedia.org/T125821#1997890 (10akosiaris) So, up to now we did not have to package HHVM for jessie-wikimedia. I don't have an ETA on when it will be re...
[19:47:10] <icinga-wm>	 PROBLEM - Apache HTTP on mw1141 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.010 second response time
[19:47:36] <wikibugs>	 6Operations: Installer issues for notebook1001 & 1002 - https://phabricator.wikimedia.org/T131964#2184637 (10yuvipanda) It's no longer in a loop, is back to:   ```      ┌───────────────┤ [!!] Select and install software ├────────────────┐      │                                                                   │...
[19:48:03] <wikibugs>	 6Operations: Installer issues for notebook1001 & 1002 - https://phabricator.wikimedia.org/T131964#2184638 (10yuvipanda) And back to:   ``` Apr  6 19:41:45 in-target: Reading package lists... Apr  6 19:41:45 in-target:  Apr  6 19:41:45 in-target: Building dependency tree... Apr  6 19:41:45 in-target:  Apr  6 19:4...
[19:48:59] <icinga-wm>	 RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge.
[19:49:08] <logmsgbot>	 !log dduvall@tin Synchronized php-1.27.0-wmf.20/extensions/SemanticMediaWiki/includes/SMW_ParserExtensions.php: Replace usage of Title::newFromRedirect() (duration: 00m 38s)
[19:49:12] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:49:50] <icinga-wm>	 RECOVERY - Keyholder SSH agent on tin is OK: OK: Keyholder is armed with all configured keys.
[19:50:56] <grrrit-wm>	 (03PS1) 10Ottomata: Temporarily comment out dumps/dumps scap source until it is ready [puppet] - 10https://gerrit.wikimedia.org/r/281987 
[19:50:58] <logmsgbot>	 !log dduvall@tin rebuilt wikiversions.php and synchronized wikiversions files: Promote labswiki to 1.27.0-wmf.20 following temporary rollback and fix
[19:51:05] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:51:17] <marxarelli>	 Reedy: thanks for the quick fix!
[19:51:24] <wikibugs>	 6Operations, 10Traffic, 7HTTPS, 5MW-1.27-release-notes, 13Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#2184639 (10konklone) @BBlack If you want someone to remind you about it, I am happy to volunteer. ;)
[19:51:25] <Reedy>	 marxarelli: Do I still need to do a bump fix?
[19:51:28] <Reedy>	 *submodule bump
[19:51:52] <marxarelli>	 Reedy: shouldn't need to. .gitmodules is tracking 1.8.x for SMW
[19:52:01] <Reedy>	 ah, I wasn't sure if it autobumped :)
[19:52:36] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Temporarily comment out dumps/dumps scap source until it is ready [puppet] - 10https://gerrit.wikimedia.org/r/281987 (owner: 10Ottomata)
[19:52:47] <Reedy>	 marxarelli: It doesn't look to have done..
[19:52:50] <icinga-wm>	 PROBLEM - Keyholder SSH agent on mira is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it.
[19:53:13] <ottomata>	 HMmmMM
[19:53:15] <ottomata>	 on mira?
[19:53:49] <icinga-wm>	 PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 2 failures
[19:54:22] <marxarelli>	 Reedy: ah, right. no, there's no commit yet. i just pulled down the latest from 1.8.x
[19:54:24] <Reedy>	 https://gerrit.wikimedia.org/r/281988
[19:54:58] <marxarelli>	 Reedy: i went rogue on that one, sorry :)
[19:55:06] <Reedy>	 heh, no worries
[19:55:39] <icinga-wm>	 RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[19:56:28] <icinga-wm>	 RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys.
[20:00:04] <jouncebot>	 gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160406T2000). Please do the needful.
[20:01:56] <mdholloway>	 no mobileapps deployment today
[20:01:58] <icinga-wm>	 PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: Puppet has 1 failures
[20:02:10] <icinga-wm>	 RECOVERY - puppet last run on db2035 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[20:03:40] <icinga-wm>	 RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[20:05:20] <wikibugs>	 6Operations: Installer issues for notebook1001 & 1002 - https://phabricator.wikimedia.org/T131964#2184661 (10yuvipanda) BIOS had PXEboot ahead of local hard drive, so I've switched that over now (on 1001)
[20:06:56] <subbu>	 !log starting parsoid deploy
[20:07:02] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:08:51] <wikibugs>	 6Operations: Installer issues for notebook1001 & 1002 - https://phabricator.wikimedia.org/T131964#2184669 (10yuvipanda) now it's stuck just booting up, at:  ```Scanning for devices.  Please wait, this may take several minutes...```
[20:09:19] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 55.17% of data above the critical threshold [5000000.0]
[20:10:14] <subbu>	 !log synced code; restarted parsoid on wtp1001 as a canary
[20:10:21] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:12:16] <wikibugs>	 6Operations, 10RESTBase-Cassandra: Efficacy of DateTieredCompactionStrategy - https://phabricator.wikimedia.org/T126221#2184683 (10Eevans) I conducted an audit of compactions on restbase1007-a.eqiad.wmnet over the weekend (from April 1-4), the result of which can be seen, visualized as a directed graph, here:...
[20:17:05] <wikibugs>	 6Operations, 10RESTBase-Cassandra: Efficacy of DateTieredCompactionStrategy - https://phabricator.wikimedia.org/T126221#2184699 (10Eevans) /cc'ing @JAllemandou and @elukey as AQS uses DTCS too if I'm not mistaken; It wouldn't hurt to have a look at how compaction is working on the AQS cluster
[20:17:08] <subbu>	 !log finished deploying parsoid sha 5f6c0c60
[20:17:15] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:19:59] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[20:25:15] <wikibugs>	 6Operations: Installer issues for notebook1001 & 1002 - https://phabricator.wikimedia.org/T131964#2184706 (10yuvipanda) OK, the boot order fixed it for notebook1002!  1001 is still stuck
[20:41:41] <wikibugs>	 6Operations, 10Continuous-Integration-Infrastructure, 10Phabricator, 10netops, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2184719 (10chasemp) 22 to only 208.80.154.250/32 as the service address for git-ssh shou...
[20:42:04] <Platonides>	 oh
[20:42:14] <Platonides>	 "talk to gearman", not "talk German"
[20:44:37] <grrrit-wm>	 (03PS1) 10ArielGlenn: dumps: fix up all references to directory with config files [puppet] - 10https://gerrit.wikimedia.org/r/281997 
[20:44:57] <grrrit-wm>	 (03PS1) 10Yuvipanda: docker: Add nginx frontend for registry [puppet] - 10https://gerrit.wikimedia.org/r/281998 
[20:45:05] * apergos boggles at Platonides
[20:45:22] <apergos>	 we're big on multilingualism but...
[20:45:44] <grrrit-wm>	 (03PS2) 10Yuvipanda: docker: Add nginx frontend for registry [puppet] - 10https://gerrit.wikimedia.org/r/281998 
[20:46:31] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] dumps: fix up all references to directory with config files [puppet] - 10https://gerrit.wikimedia.org/r/281997 (owner: 10ArielGlenn)
[20:47:27] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[20:47:55] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[20:48:23] <grrrit-wm>	 (03PS3) 10Yuvipanda: docker: Add nginx frontend for registry [puppet] - 10https://gerrit.wikimedia.org/r/281998 
[20:49:24] <grrrit-wm>	 (03PS2) 10ArielGlenn: dumps: fix up all references to directory with config files [puppet] - 10https://gerrit.wikimedia.org/r/281997 
[20:50:36] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] dumps: fix up all references to directory with config files [puppet] - 10https://gerrit.wikimedia.org/r/281997 (owner: 10ArielGlenn)
[20:57:49] <grrrit-wm>	 (03PS4) 10Yuvipanda: docker: Add nginx frontend for registry [puppet] - 10https://gerrit.wikimedia.org/r/281998 
[21:05:57] <grrrit-wm>	 (03PS1) 10Rush: labstore svc addresses to separate mounts [dns] - 10https://gerrit.wikimedia.org/r/282000 (https://phabricator.wikimedia.org/T131541) 
[21:08:25] <grrrit-wm>	 (03CR) 10Rush: [C: 032] labstore svc addresses to separate mounts [dns] - 10https://gerrit.wikimedia.org/r/282000 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush)
[21:11:50] <grrrit-wm>	 (03PS2) 10Dereckson: Set wgSemiprotectedRestrictionLevels for en.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281967 (https://phabricator.wikimedia.org/T131976) 
[21:12:33] <grrrit-wm>	 (03PS2) 10Gehel: Activate SSL and connection pooling for CirrusSearch. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281882 (https://phabricator.wikimedia.org/T131839) 
[21:14:21] <wikibugs>	 6Operations, 6Discovery, 10hardware-requests, 3Discovery-Search-Sprint: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2184839 (10EBernhardson)
[21:15:43] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Activate SSL and connection pooling for CirrusSearch. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281882 (https://phabricator.wikimedia.org/T131839) (owner: 10Gehel)
[21:16:52] <grrrit-wm>	 (03PS5) 10Yuvipanda: docker: Add nginx frontend for registry [puppet] - 10https://gerrit.wikimedia.org/r/281998 
[21:16:57] <matanya>	 csteipp: poke
[21:20:38] <wikibugs>	 6Operations, 6Discovery, 10hardware-requests, 3Discovery-Search-Sprint: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2184858 (10EBernhardson) SATA is great, well not great but the disk requirements here make SSD's a bit untenable.  2x6 isn't a strict requirement, we figure...
[21:21:16] <csteipp>	 matanya: What can I do for you?
[21:21:47] <grrrit-wm>	 (03PS1) 10Dereckson: Add mergehistory right to eliminator group on ja.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282055 (https://phabricator.wikimedia.org/T131751) 
[21:22:02] <wikibugs>	 6Operations, 6Labs, 10Labs-Infrastructure: labnet1002 can't talk to webproxy.eqiad.wmnet:8080, puppet fails to install designateclient - https://phabricator.wikimedia.org/T129623#2184867 (10chasemp) Thank you faidon, that is indeed the story.  I put in a specific allowance for the labs-hosts VLAN in question...
[21:23:26] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[21:23:46] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[21:29:27] <icinga-wm>	 PROBLEM - Disk space on restbase2004 is CRITICAL: DISK CRITICAL - free space: /srv 186523 MB (3% inode=99%)
[21:31:44] <grrrit-wm>	 (03CR) 10EBernhardson: "code looks sane, but i need to look into the unit test failures to see what's happening. It might just be that it's not pulling in the rig" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281882 (https://phabricator.wikimedia.org/T131839) (owner: 10Gehel)
[21:35:19] <wikibugs>	 6Operations, 10Continuous-Integration-Infrastructure, 6Multimedia, 13Patch-For-Review: Investigate impact of switching from ffmpeg to libav (ffmpeg is not in Jessie) - https://phabricator.wikimedia.org/T103335#2184918 (10hashar) OpenStack enquired about imagemagick on Trusty requiring ffmpeg. But ffmpeg go...
[21:36:36] <gwicke>	 urandom: 2004 is tight
[21:36:57] <urandom>	 gwicke: i have script
[21:37:06] <urandom>	 it should cull compactions past 93%
[21:37:28] <urandom>	 yeesh, and it has been...
[21:38:14] <urandom>	 hrmm, or at least it should have been
[21:38:37] <icinga-wm>	 RECOVERY - Disk space on restbase2004 is OK: DISK OK
[21:39:28] <gwicke>	 thanks!
[21:39:35] <urandom>	 gwicke: it may not make it either way
[21:40:18] <urandom>	 i show it needing another ~670G, and killing the big compaction just now brought it just north of 400G
[21:40:32] <urandom>	 we might have to hang tight until the new hardware arrives
[21:42:46] <grrrit-wm>	 (03PS1) 10Rush: nslcd specifying shell override [puppet] - 10https://gerrit.wikimedia.org/r/282060 (https://phabricator.wikimedia.org/T131541) 
[21:42:59] <wikibugs>	 6Operations, 5Continuous-Integration-Scaling, 7HHVM: Provide a HHVM package for jessie-wikimedia matching version of trusty-wikimedia - https://phabricator.wikimedia.org/T125821#1997890 (10greg) One thought from Alex in the SoS was creating a trusty nodepool image for these tests (composer) to unblock us (Re...
[21:46:25] <grrrit-wm>	 (03CR) 10Ottomata: [C: 031] "+1, one though:" [puppet] - 10https://gerrit.wikimedia.org/r/281740 (https://phabricator.wikimedia.org/T126571) (owner: 10Ppchelko)
[21:47:58] <grrrit-wm>	 (03PS2) 10Rush: nslcd specifying shell override [puppet] - 10https://gerrit.wikimedia.org/r/282060 (https://phabricator.wikimedia.org/T131541) 
[21:53:36] <wikibugs>	 6Operations, 5Continuous-Integration-Scaling, 7HHVM: Provide a HHVM package for jessie-wikimedia matching version of trusty-wikimedia - https://phabricator.wikimedia.org/T125821#2185010 (10hashar) Potentially we could generate an image based on Trusty then I would rather switch all of CI to run solely on Deb...
[21:54:03] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 031] nslcd specifying shell override [puppet] - 10https://gerrit.wikimedia.org/r/282060 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush)
[22:05:30] <grrrit-wm>	 (03PS3) 10ArielGlenn: dumps: fix up all references to directory with config files [puppet] - 10https://gerrit.wikimedia.org/r/281997 
[22:07:11] <gwicke>	 urandom: too bad that we can't throw brotli at it yet
[22:11:44] <grrrit-wm>	 (03PS7) 10Yuvipanda: docker: Add nginx frontend for registry [puppet] - 10https://gerrit.wikimedia.org/r/281998 
[22:11:46] <grrrit-wm>	 (03PS1) 10Yuvipanda: docker: Don't setup credentials on all hosts [puppet] - 10https://gerrit.wikimedia.org/r/282071 
[22:12:02] <grrrit-wm>	 (03PS1) 10Rush: toollabs bastions install cgroup-bin [puppet] - 10https://gerrit.wikimedia.org/r/282072 (https://phabricator.wikimedia.org/T131541) 
[22:12:07] <grrrit-wm>	 (03CR) 10Hoo man: [C: 032 V: 032] Clarifying i18n parameters [dumps/dcat] - 10https://gerrit.wikimedia.org/r/277955 (owner: 10Lokal Profil)
[22:12:59] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] docker: Don't setup credentials on all hosts [puppet] - 10https://gerrit.wikimedia.org/r/282071 (owner: 10Yuvipanda)
[22:13:21] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] toollabs bastions install cgroup-bin [puppet] - 10https://gerrit.wikimedia.org/r/282072 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush)
[22:20:07] <grrrit-wm>	 (03PS2) 10Rush: toollabs bastions install cgroup-bin [puppet] - 10https://gerrit.wikimedia.org/r/282072 (https://phabricator.wikimedia.org/T131541) 
[22:26:53] <wikibugs>	 6Operations, 6Discovery, 10hardware-requests, 3Discovery-Search-Sprint: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2185144 (10RobH) Well, the only potential spare sysems would be our recently reclaimed restbase1001-1006, but they would need a memory upgrade, plus the pur...
[22:39:49] <grrrit-wm>	 (03PS4) 10ArielGlenn: dumps: fix up all references to directory with config files [puppet] - 10https://gerrit.wikimedia.org/r/281997 
[22:41:07] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] dumps: fix up all references to directory with config files [puppet] - 10https://gerrit.wikimedia.org/r/281997 (owner: 10ArielGlenn)
[22:45:57] <grrrit-wm>	 (03PS8) 10Yuvipanda: docker: Add nginx frontend for registry [puppet] - 10https://gerrit.wikimedia.org/r/281998 
[23:00:04] <jouncebot>	 RoanKattouw ostriches Krenair MaxSem Dereckson fjalapeno: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160406T2300).
[23:00:04] <jouncebot>	 fjalapeno matt_flaschen: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[23:00:20] * MaxSem is busy
[23:00:59] <Krenair>	 I'm available
[23:01:08] <Krenair>	 Unless you want to, matt_flaschen ..?
[23:01:21] <grrrit-wm>	 (03PS1) 10ArielGlenn: enable dumps cron run on snapshot1006 and 1007 [puppet] - 10https://gerrit.wikimedia.org/r/282078 
[23:01:25] <matt_flaschen>	 Krenair, I can do it.
[23:01:33] <coreyfloyd>	 I am available
[23:01:55] <coreyfloyd>	 (Fjalapeno)
[23:03:00] <grrrit-wm>	 (03PS2) 10ArielGlenn: enable dumps cron run on snapshot1006 and 1007 [puppet] - 10https://gerrit.wikimedia.org/r/282078 
[23:03:01] <matt_flaschen>	 That's a cool feature, BTW.
[23:04:23] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] enable dumps cron run on snapshot1006 and 1007 [puppet] - 10https://gerrit.wikimedia.org/r/282078 (owner: 10ArielGlenn)
[23:04:40] <grrrit-wm>	 (03CR) 10Mattflaschen: [C: 032] Support handoff and credential sharing with the iOS app [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279394 (https://phabricator.wikimedia.org/T128795) (owner: 10Fjalapeno)
[23:05:09] <grrrit-wm>	 (03Merged) 10jenkins-bot: Support handoff and credential sharing with the iOS app [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279394 (https://phabricator.wikimedia.org/T128795) (owner: 10Fjalapeno)
[23:05:15] <icinga-wm>	 PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 607
[23:11:27] <logmsgbot>	 !log mattflaschen@tin Synchronized docroot/wikipedia.org/apple-app-site-association: Support handoff and credential sharing with the iOS app (duration: 00m 34s)
[23:11:34] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:12:57] <matt_flaschen>	 coreyfloyd, done, but I don't see it yet.  I guess we have to wait for Varnish to clear.  Test when you can.
[23:13:17] <coreyfloyd>	 matt_flaschen: I see the same thing. Will keep an eye out
[23:13:27] <coreyfloyd>	 matt_flaschen: thanks
[23:15:15] <icinga-wm>	 RECOVERY - check_mysql on lutetium is OK: Uptime: 1760330 Threads: 2 Questions: 16453277 Slow queries: 10361 Opens: 109216 Flush tables: 2 Open tables: 64 Queries per second avg: 9.346 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0
[23:16:05] <matt_flaschen>	 coreyfloyd, if you access it with a query string, it shows the new file, though.  So it's definitely Varnish: https://en.wikipedia.org/apple-app-site-association?something
[23:18:44] <Krenair>	 you can purge it from varnish
[23:21:24] <coreyfloyd>	 matt_flaschen: yep I see it.
[23:21:43] <coreyfloyd>	 Krenair: how long does it take normally. I'm not in a rush.
[23:22:09] <Krenair>	 I don't remember
[23:22:46] <matt_flaschen>	 Some things take 5 minutes, I don't know how long that takes.
[23:25:46] <coreyfloyd>	 Ok I'm patient.
[23:30:19] <Dereckson>	 matt_flaschen: try echo 'https://en.wikipedia.org/apple-app-site-association' | mwscript purgeList.php
[23:34:07] <matt_flaschen>	 Thanks, Dereckson.  I ran that, which force-purged it on English Wikipedia.  That won't affect any other language subdomains, though.  coreyfloyd, you could put together the other domains it's intended to work with and force-purge them, or wait for it to auto-expire (but no one knows how long that takes)
[23:35:41] <Dereckson>	 matt_flaschen: the /static folder is served from en.wikip by Varnish so that helps, but here the trick is it's outside /static.
[23:36:41] <Dereckson>	 I wonder if it wouldn't be best to move the file to /static and redirect /apple... to /static/apple... if there is a regular need to update this file.
[23:37:49] <coreyfloyd>	 matt_flaschen: en is enough for me to test on
[23:37:51] <Krenair>	 yeah there's some varnish magic that transparently sends everything static to enwiki
[23:39:19] <Dereckson>	 coreyfloyd: how stable it apple-app-site-association? Will you need in the future to add new app identifiers?
[23:40:38] <logmsgbot>	 !log mattflaschen@tin Synchronized php-1.27.0-wmf.20/extensions/Flow/includes/Data/Listener/NotificationListener.php: Fix new topic notifications (duration: 00m 37s)
[23:40:45] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:41:35] <coreyfloyd>	 Dereckson: pretty stable. We only need to change it to add additional services. Like this patch.
[23:42:02] <coreyfloyd>	 Dereckson: we are just implementing these services for the first time. After this though I don't see many more changes coming.
[23:43:01] <matt_flaschen>	 Works on MediaWiki.org
[23:44:42] <logmsgbot>	 !log mattflaschen@tin Synchronized php-1.27.0-wmf.19/extensions/Flow/includes/Data/Listener/NotificationListener.php: Fix new topic notifications (duration: 00m 29s)
[23:44:49] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:46:58] <Dereckson>	 coreyfloyd: okay
[23:54:07] <matt_flaschen>	 And bs.wikipedia.org.
[23:54:13] <matt_flaschen>	 SWAT complete
[23:57:01] <grrrit-wm>	 (03PS9) 10Yuvipanda: docker: Add nginx frontend for registry [puppet] - 10https://gerrit.wikimedia.org/r/281998 
[23:57:28] <greg-g>	 thanks matt_flaschen