[00:00:17] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 1 failures [00:00:17] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [00:00:17] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [00:00:30] 6Operations, 10MediaWiki-Parser, 10Traffic, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182459 (10BBlack) >>! In T121135#1910435, @Atsirlin wrote: > @Legoktm: Frankly speaking, for a small project like Wikivo... [00:01:37] 6Operations, 6Release-Engineering-Team: Update gerrit sshkey in role::ci::slave::labs when upgrade to Jessie happens - https://phabricator.wikimedia.org/T131903#2182462 (10madhuvishy) [00:01:52] 6Operations, 6Release-Engineering-Team: reinstall/upgrade gerrit server (ytterbium) from precise to jessie - https://phabricator.wikimedia.org/T125018#2182475 (10madhuvishy) [00:01:54] 6Operations, 6Release-Engineering-Team: Update gerrit sshkey in role::ci::slave::labs when upgrade to Jessie happens - https://phabricator.wikimedia.org/T131903#2182474 (10madhuvishy) [00:02:46] 6Operations, 6Release-Engineering-Team: Update gerrit sshkey in role::ci::slave::labs when upgrade to Jessie happens - https://phabricator.wikimedia.org/T131903#2182462 (10madhuvishy) [00:05:18] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 1 failures [00:05:18] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [00:05:18] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [00:09:52] (03PS2) 10BryanDavis: Moving elasticsearch::https instatiation to elasticsearch role [puppet] - 10https://gerrit.wikimedia.org/r/281824 (https://phabricator.wikimedia.org/T131906) (owner: 10Gehel) [00:10:17] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 1 failures [00:10:17] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [00:10:18] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [00:12:10] is anybody looking into payments ^^^? [00:14:08] (03PS2) 10RobH: stat1004 has 4 disks [puppet] - 10https://gerrit.wikimedia.org/r/281858 [00:14:22] MaxSem: I would absolutely love some help with that. [00:14:25] codfw payments is not primary [00:14:38] so it didnt genreate pages [00:14:41] payments-Redis needs to be kicked, though I'd like to know why. [00:14:55] ah, this was codfw? looking [00:15:06] thats what the alerts are for, payments 2XXX [00:15:10] which is codfw [00:15:17] RECOVERY - check_puppetrun on payments2002 is OK: OK: Puppet is currently enabled, last run 91 seconds ago with 0 failures [00:15:31] heh, of course now that we discuss, it clears? [00:15:40] (03CR) 10Mattflaschen: [C: 031] "Same as schema from I5c1f648cc63ed317508febaece955ec68f640ba3 , which I just +2'ed. If that merges (no Jenkins failures), I'll deploy thi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281640 (https://phabricator.wikimedia.org/T112799) (owner: 10Matthias Mullie) [00:15:44] still 2003 issue [00:15:47] (same one) [00:15:50] I believe it has online replicas of donor data, so sort of matters [00:16:12] (03CR) 10RobH: [C: 032] stat1004 has 4 disks [puppet] - 10https://gerrit.wikimedia.org/r/281858 (owner: 10RobH) [00:20:40] Luke081515, it's still stuck on 3 remaining hosts... [00:21:16] mw2043, mw2177 and mw1184 [00:21:27] I think [00:22:41] seems like I get my data from one of them... [00:24:47] Luke081515, I think there might be something extra is has to do after this sync [00:25:13] hm, ok [00:32:09] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures [00:33:58] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [00:34:08] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [00:34:33] robh: I merged your change [00:35:57] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [00:36:00] mhm mwdeploy 29365 0.0 0.0 105764 2076 ? S< Apr05 0:00 sshd: mwdeploy@notty [00:36:07] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [00:36:34] Krenair: So my note: A backport is better, but overwrite the data by creating local sysmessages is faster ;) [00:45:23] bd808, hey [00:45:27] so I was stuck in scap [00:45:41] I found that pressing enter made it move to the next host... [00:46:01] it sat at sync-common: 99% (ok: 424; fail: 0; left: 3) for ages [00:46:05] then I pressed enter [00:46:15] suddenly, sync-common: 99% (ok: 425; fail: 0; left: 2) [00:46:17] and so on until 0 [00:48:43] hmm [00:48:52] you found hidden magic? [00:49:39] the trick I've used for stuck hosts before is to open another ssh session to tin and kill the outbound ssh connections to those hosts [00:49:50] Krenair: My browser shows the right messages now [00:50:04] and then you can ssh directly to the hosts and run sync-common [00:50:51] pressing enter almost certainly had nothing to do with it [00:51:11] https://en.wikipedia.org/wiki/Placebo_button [00:51:17] it could be something stuck waiting for stdin somewhere [00:51:26] has happened to me before in other places [00:51:42] that was my assumption [00:52:00] hmm.. like unknown host keys? [00:52:14] were there servers reimaged recently? [00:52:28] PROBLEM - Apache HTTP on mw1119 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.009 second response time [00:52:38] subprocess.Popen's stdin argument defaults to None, so spawned processes do not inherit the parent process's stdin by default. [00:52:52] bd808, mw1184 has been up 20 days [00:52:59] so not that recently, people have run scap in that time [00:53:09] PROBLEM - HHVM rendering on mw1119 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.011 second response time [00:53:21] and as ori proposes the ssh connections aren't attached to your terminal (or shouldn't be) [00:53:39] ori, ^ I've seen that (icinga alert for mw1119) happen a couple of times now for other hosts [00:53:51] do you want to look into it or shall I just restart hhvm? [00:55:36] !log restarted hhvm on mw1119, stuck [00:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:56:05] actually, I'm wrong [00:56:08] scap-rebuild-cdbs: 99% (ok: 437; fail: 0; left: 1) [00:56:09] RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 5.830 second response time [00:56:18] " With the default settings of None, no redirection will occur; the child’s file handles will be inherited from the parent." [00:56:30] (from https://docs.python.org/2/library/subprocess.html) [00:56:39] so it's possible pressing enter really did help, in fact [00:56:58] RECOVERY - HHVM rendering on mw1119 is OK: HTTP OK: HTTP/1.1 200 OK - 68780 bytes in 0.772 second response time [00:57:04] it should not; scap's subprocess calls should pass stdin=subprocess.PIPE [00:57:48] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:57:51] wouldn't we then just get completely stuck if it prompts for input? [00:58:55] 6Operations, 10MediaWiki-Parser, 10Traffic, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182577 (10Jdlrobson) >>! In T121135#2182349, @Wrh2 wrote: > Cache is cleared fairly regularly even if articles aren't ed... [00:59:27] Krenair: could you e-mail ops@ about it or file a task? I could depool it, but I am not able to take the time to debug it further, and I worry that it would just remain depooled until someone comes across it. [00:59:44] mw1119? MaxSem already restarted hhvm there [01:00:10] it was the first thing in the chat after I asked you about it [01:00:41] Krenair: is it still stuck? [01:00:44] not great to just restart hhvm [01:00:48] probably not [01:00:55] icinga said it recovered [01:00:58] better to capture a trace or just depool it and leave it [01:01:05] I agree [01:01:16] ori, neither of which us mortals can do... [01:01:35] Krenair: the process that is still running is connected to mw1119.eqiad.wmnet [01:01:51] I'll kill that process then [01:01:51] "/usr/bin/ssh -oBatchMode=yes -oSetupTimeout=10 -F/dev/null -oUser=mwdeploy mw1119.eqiad.wmnet sudo -u mwdeploy -n -- /usr/bin/scap-rebuild-cdbs" [01:02:03] MaxSem: pybal would have depooled it for failing health checks [01:02:23] so you are not actually helping anything by restarting HHVM; it is not continuing to receive requests [01:02:31] it is just destroying program state [01:02:31] !log krenair@tin Finished scap: https://gerrit.wikimedia.org/r/#/c/281846/ - add messages for the new extendedconfirmed protection (duration: 95m 03s) [01:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:02:58] ori, it resumes receiving requests seconds after a restart, as evidenced by load [01:03:15] pybal doesn't automatically repool? [01:03:18] yes -- so? [01:03:23] yes, it does [01:03:48] (if a server is depooled for failing health checks -- it does not repool a manually depooled server) [01:04:06] Luke081515, ^ [01:04:08] so where does that load on it come from? :P [01:04:28] from pybal re-pooling it, since it passed health checks after you restarted it [01:04:39] ok [01:05:10] err so you are not actually helping anything by restarting HHVM; it is not continuing to receive requests [01:05:13] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.71 seconds [01:05:33] we have plenty of spare capacity on the app server cluster, by design [01:06:16] is it worth me filing a task this time? or waiting for the next? [01:06:27] the fact that the machine had been depooled is not in itself a problem [01:06:32] Krenair: probably not worth it, no [01:06:34] MaxSem: parsing error I think - I took ori's sentence to mean 'if it is down, it is not actually receiving requests, so there is no actual issue' [01:06:35] k [01:06:43] *user facing issue [01:07:02] yes this is probably a misunderstanding [01:07:02] YuviPanda: right -- I see how that could have been confusing [01:08:20] (not worth it because it is unlikely that anyone would take the time to investigate it, given that there exists a backlog of similar issues with more debug data. not because it wouldn't be useful to know.) [01:08:38] eep, it's late here. o/ [01:09:16] it is late everywhere! [01:09:23] * YuviPanda disappears too [01:18:54] 6Operations, 10MediaWiki-Parser, 10Traffic, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182584 (10Wrh2) If the template was at fault the behavior should be consistent - currently if a page is edited or flushe... [01:25:58] Luke081515, bah... after than very long scap, someone has already overridden one of the messages locally [01:26:13] lol [01:26:16] hm, ok [01:26:31] Kreniar: But thanks for backporting [01:27:52] *Krenair [01:27:59] * Luke081515 hates typos in nicknames [01:28:18] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 65.52% of data above the critical threshold [5000000.0] [01:29:05] 6Operations, 10Traffic: Support TLS chacha20-poly1305 AEAD ciphers - https://phabricator.wikimedia.org/T131908#2182588 (10BBlack) [01:29:17] 6Operations, 10Traffic: Support TLS chacha20-poly1305 AEAD ciphers - https://phabricator.wikimedia.org/T131908#2182601 (10BBlack) p:5Triage>3Normal [01:45:17] 6Operations, 10MediaWiki-Parser, 10Traffic, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182613 (10Jdlrobson) >>! In T121135#2182584, @Wrh2 wrote: > If the template was at fault the behavior should be consiste... [01:49:03] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.26 seconds [01:50:13] PROBLEM - MariaDB Slave Lag: x1 on db1031 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 326.90 seconds [01:50:17] PROBLEM - MariaDB Slave Lag: x1 on db2008 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 334.48 seconds [01:51:27] PROBLEM - MariaDB Slave Lag: x1 on db2009 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 397.01 seconds [01:51:32] 6Operations, 10MediaWiki-Parser, 10Traffic, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182614 (10Jdlrobson) My current theory is that under some circumstances the banner is generated before the table of cont... [01:54:02] RECOVERY - MariaDB Slave Lag: x1 on db1031 is OK: OK slave_sql_lag Replication lag: 0.42 seconds [01:54:02] RECOVERY - MariaDB Slave Lag: x1 on db2008 is OK: OK slave_sql_lag Replication lag: 0.03 seconds [01:54:59] RECOVERY - MariaDB Slave Lag: x1 on db2009 is OK: OK slave_sql_lag Replication lag: 0.13 seconds [02:10:08] PROBLEM - check_missing_thank_yous on db1025 is CRITICAL: CRITICAL missing_thank_yous=550 [critical =500] [02:15:08] PROBLEM - check_missing_thank_yous on db1025 is CRITICAL: CRITICAL missing_thank_yous=552 [critical =500] [02:23:08] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [02:25:08] RECOVERY - check_missing_thank_yous on db1025 is OK: OK missing_thank_yous=0 [02:31:31] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.19) (duration: 11m 44s) [02:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:56:51] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.20) (duration: 09m 43s) [02:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:06:19] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Apr 6 03:06:18 UTC 2016 (duration 9m 27s) [03:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:08:19] 6Operations: setup stat1004/WMF4721 for hadoop client usage - https://phabricator.wikimedia.org/T131877#2182660 (10RobH) [03:19:58] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [5000000.0] [04:06:18] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [04:15:08] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [04:15:08] PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [04:15:08] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [04:20:08] PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [04:20:08] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [04:20:08] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [04:25:08] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [04:25:08] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [04:25:09] PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [04:30:08] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [04:30:08] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [04:30:17] PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [04:35:08] PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [04:35:08] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [04:35:08] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [04:40:08] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [04:40:08] PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [04:40:08] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [04:45:08] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [04:45:08] PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [04:45:08] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [04:50:08] PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [04:50:08] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [04:50:08] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [04:55:08] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [04:55:08] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [04:55:08] PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:00:07] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:00:07] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:00:08] PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:05:08] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:05:08] PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:05:08] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:10:08] PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:10:08] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:10:08] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:15:08] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:15:08] PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:15:08] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:20:08] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:20:08] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:20:08] PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:25:08] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:25:08] PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:25:09] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:30:08] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:30:08] PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:30:08] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:35:08] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:35:08] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:35:08] PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:40:08] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:40:09] PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:40:09] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:42:08] <_joe_> uh what's this? [05:42:35] <_joe_> uhm codfw [05:45:08] PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:45:08] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:45:08] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:50:08] PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:50:08] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:50:08] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:55:08] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:55:08] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [05:55:08] PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [06:00:08] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [06:00:08] PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [06:00:17] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [06:01:56] <_joe_> I paged jeff, I'm disabling notifications for those services [06:02:09] PROBLEM - puppet last run on mw2138 is CRITICAL: CRITICAL: puppet fail [06:07:38] <_joe_> !log restarting HHVM on mw1134, deadlock in what appears to be HPHP::Treadmill::getAgeOldestRequest [06:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:09:49] RECOVERY - Apache HTTP on mw1134 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 0.147 second response time [06:09:59] RECOVERY - HHVM rendering on mw1134 is OK: HTTP OK: HTTP/1.1 200 OK - 68824 bytes in 0.839 second response time [06:10:34] awight and robh looked at it when it happened earlier, but it fixed itself [06:11:09] I missed the diagnosis of the last outage, unfortunately [06:12:48] It's only medium-priority, this is a failover service and replica. [06:15:11] <_joe_> awight: I disabled notifications for those redis services [06:15:33] _joe_: Perfectly good workaround for now, thanks for doing so! [06:16:06] <_joe_> wasn't really an effort :P [06:20:37] Jeff_Green is spot on about provisioning either Kafka or Redis, but not both for our payments queue overhaul... For the saved trouble of keeping a service up, I'm happy to slightly abuse mysql and use it to simulate redis storage types. [06:20:45] _joe_: Thanks for the idea to look at Kafka! [06:23:34] <_joe_> yw :) [06:27:39] PROBLEM - puppet last run on mc2004 is CRITICAL: CRITICAL: puppet fail [06:29:57] (03PS1) 10Gehel: WIP [puppet] - 10https://gerrit.wikimedia.org/r/281881 [06:30:19] (03PS1) 10Gehel: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281882 [06:30:27] PROBLEM - puppet last run on es2013 is CRITICAL: CRITICAL: puppet fail [06:31:03] (03CR) 10jenkins-bot: [V: 04-1] WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281882 (owner: 10Gehel) [06:31:08] RECOVERY - puppet last run on mw2138 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:31:29] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:30] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:57] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:38] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:58] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:58] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:53:08] RECOVERY - puppet last run on mc2004 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:56:38] PROBLEM - puppet last run on mw2152 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:39] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:57:28] RECOVERY - puppet last run on es2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:08] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:58:29] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:58:48] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:49] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [07:01:27] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [07:16:07] (03CR) 10Muehlenhoff: [C: 032 V: 032] Ignore packages in deinstalled status [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/281669 (owner: 10Muehlenhoff) [07:20:38] RECOVERY - puppet last run on mw2152 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [07:30:55] (03PS1) 10Giuseppe Lavagetto: hhvm: add systemd/jessie support [puppet] - 10https://gerrit.wikimedia.org/r/281885 [08:22:21] (03PS3) 10Gehel: Moving elasticsearch::https instatiation to elasticsearch role [puppet] - 10https://gerrit.wikimedia.org/r/281824 (https://phabricator.wikimedia.org/T131906) [08:23:40] (03CR) 10Giuseppe Lavagetto: [C: 032] "DTRT: https://puppet-compiler.wmflabs.org/2314/mw1220.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/281720 (owner: 10Giuseppe Lavagetto) [08:23:53] (03PS2) 10Giuseppe Lavagetto: hhvm: watch extension packages from the service [puppet] - 10https://gerrit.wikimedia.org/r/281720 [08:24:51] <_joe_> come on jenkinsss [08:26:02] (03CR) 10Gehel: [C: 032] Moving elasticsearch::https instatiation to elasticsearch role [puppet] - 10https://gerrit.wikimedia.org/r/281824 (https://phabricator.wikimedia.org/T131906) (owner: 10Gehel) [08:27:52] (03PS4) 10Gehel: Moving elasticsearch::https instatiation to elasticsearch role [puppet] - 10https://gerrit.wikimedia.org/r/281824 (https://phabricator.wikimedia.org/T131906) [08:28:12] <_joe_> !log disabling puppet on the mw servers to test hhvm changes [08:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:30:25] (03CR) 10Hashar: "Well done \O/" [puppet] - 10https://gerrit.wikimedia.org/r/281706 (owner: 10Madhuvishy) [08:33:50] (03PS2) 10Giuseppe Lavagetto: hhvm: parametrize directories [puppet] - 10https://gerrit.wikimedia.org/r/281721 [08:45:23] (03PS3) 10Giuseppe Lavagetto: hhvm: parametrize directories [puppet] - 10https://gerrit.wikimedia.org/r/281721 [08:47:14] (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm: parametrize directories [puppet] - 10https://gerrit.wikimedia.org/r/281721 (owner: 10Giuseppe Lavagetto) [08:50:19] (03PS1) 10Giuseppe Lavagetto: hhvm: fixup for I5e9403c2 [puppet] - 10https://gerrit.wikimedia.org/r/281892 [08:50:55] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] hhvm: fixup for I5e9403c2 [puppet] - 10https://gerrit.wikimedia.org/r/281892 (owner: 10Giuseppe Lavagetto) [08:51:48] PROBLEM - puppet last run on wasat is CRITICAL: CRITICAL: puppet fail [08:51:59] PROBLEM - puppet last run on mw1017 is CRITICAL: CRITICAL: puppet fail [08:52:39] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: puppet fail [08:52:39] PROBLEM - puppet last run on snapshot1002 is CRITICAL: CRITICAL: puppet fail [08:53:48] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: puppet fail [08:53:48] RECOVERY - puppet last run on mw1017 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [08:55:27] <_joe_> these were all mine ^^ [08:59:08] RECOVERY - puppet last run on wasat is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:01:14] (03PS2) 10Giuseppe Lavagetto: hhvm: s/fcgi.ini/server.ini/ [puppet] - 10https://gerrit.wikimedia.org/r/281722 [09:02:08] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: puppet fail [09:04:52] (03CR) 10Giuseppe Lavagetto: [C: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/281722 (owner: 10Giuseppe Lavagetto) [09:07:12] 6Operations, 6Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2182880 (10Gehel) @RobH I'd really appreciate if you could let me do the reclaim / reinstall so that I learn something in the process (thi... [09:10:08] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 643 [09:17:25] (03PS2) 10Giuseppe Lavagetto: hhvm: add systemd/jessie support [puppet] - 10https://gerrit.wikimedia.org/r/281885 [09:17:46] (03PS1) 10Muehlenhoff: Upgrade to 3.19.8-ckt18 [debs/linux] - 10https://gerrit.wikimedia.org/r/281899 [09:17:59] (03PS1) 10Alex Monk: Send www.wikipedia.org/apple-app-site-association to the right file without an external redirect [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) [09:19:59] RECOVERY - puppet last run on snapshot1002 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [09:20:08] RECOVERY - check_mysql on lutetium is OK: Uptime: 1710230 Threads: 1 Questions: 15361401 Slow queries: 10263 Opens: 106158 Flush tables: 2 Open tables: 64 Queries per second avg: 8.982 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:21:09] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [09:21:57] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:22:02] (03CR) 10Muehlenhoff: hhvm: add systemd/jessie support (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/281885 (owner: 10Giuseppe Lavagetto) [09:23:08] (03CR) 10Muehlenhoff: [C: 032 V: 032] Upgrade to 3.19.8-ckt18 [debs/linux] - 10https://gerrit.wikimedia.org/r/281899 (owner: 10Muehlenhoff) [09:23:48] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [09:26:48] (03CR) 10Giuseppe Lavagetto: hhvm: add systemd/jessie support (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/281885 (owner: 10Giuseppe Lavagetto) [09:29:37] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 51.72% of data above the critical threshold [5000000.0] [09:32:13] (03CR) 10Alex Monk: [C: 04-1] "have been trying to test this on beta but no luck so far" [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk) [09:35:37] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:37:07] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:37:18] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:37:18] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:38:49] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [09:38:58] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [09:38:58] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [09:39:39] PROBLEM - HHVM rendering on mw2086 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:40:39] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [09:40:49] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [09:41:19] RECOVERY - HHVM rendering on mw2086 is OK: HTTP OK: HTTP/1.1 200 OK - 68763 bytes in 0.285 second response time [09:43:07] small outage? [09:45:11] <_joe_> elukey: wat? [09:45:12] (03PS2) 10Gehel: Actiavte SSL + connection pooling for CirrusSearch on PROD [puppet] - 10https://gerrit.wikimedia.org/r/281881 (https://phabricator.wikimedia.org/T131839) [09:45:20] <_joe_> elukey: why you say that? [09:46:52] _joe_ I was asking as "is this normal or is this an outage?" after reading citoid endpoints health on scb2002 is CRITICAL [09:48:44] <_joe_> elukey: nope it's typically an upstream problem for citoid [09:48:59] <_joe_> the test urls include a dependency on an external system [09:49:38] ahh okok will take a look to it, thanks :) [10:09:17] (03PS3) 10Mschon: update the DNS record for benefactors.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/280637 (https://phabricator.wikimedia.org/T130937) [10:19:37] 6Operations, 10Analytics: Upgrade aqs* to nodejs 4.3 - https://phabricator.wikimedia.org/T123629#2183066 (10elukey) So I checked the replication factor on the aqs nodes and this is the result: ``` cassandra@cqlsh> SELECT * FROM system.schema_keyspaces; keyspace_name | dura... [10:19:49] 6Operations, 10Analytics: Upgrade aqs* to nodejs 4.3 - https://phabricator.wikimedia.org/T123629#2183067 (10elukey) p:5Triage>3Normal [10:27:36] (03PS1) 10ArielGlenn: small fixes for dumps cron job script [puppet] - 10https://gerrit.wikimedia.org/r/281913 [10:28:16] (03PS1) 10Alexandros Kosiaris: otrs: Remove HTTPS ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/281914 [10:28:22] <_joe_> T73486 ? [10:28:22] T73486: HHVM: segfault when serializing/unserializing large preprocessor cache items - https://phabricator.wikimedia.org/T73486 [10:28:35] <_joe_> oh, yes [10:29:08] (03CR) 10ArielGlenn: [C: 032] small fixes for dumps cron job script [puppet] - 10https://gerrit.wikimedia.org/r/281913 (owner: 10ArielGlenn) [10:31:19] (03PS2) 10Alexandros Kosiaris: otrs: Remove HTTPS ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/281914 [10:31:26] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] otrs: Remove HTTPS ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/281914 (owner: 10Alexandros Kosiaris) [10:32:25] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Still fails. https://puppet-compiler.wmflabs.org/2319/mendelevium.eqiad.wmnet/change.mendelevium.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [10:37:37] RECOVERY - DPKG on etherpad1001 is OK: All packages OK [10:38:38] (03PS3) 10Giuseppe Lavagetto: hhvm: add systemd/jessie support [puppet] - 10https://gerrit.wikimedia.org/r/281885 [10:38:45] <_joe_> moritzm: ^^ [10:38:51] <_joe_> (whenever you have time) [10:39:00] thanks, will have a look in a bit [10:49:22] (03CR) 10Alexandros Kosiaris: [C: 031] Use local resources in codfw for parsoid, url-downloader and mathoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279355 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [10:51:27] (03CR) 10Alexandros Kosiaris: "Since this is making excellent sense to be in the cxserver repo config, we should move it over there. I see https://phabricator.wikimedia." [puppet] - 10https://gerrit.wikimedia.org/r/277463 (https://phabricator.wikimedia.org/T129849) (owner: 10KartikMistry) [10:57:22] akosiaris: hey, it would be great if you give an eta for when you can check these puppet patches [10:57:33] (or the beta setup) [10:57:57] no rush at all, I'm too excited [11:08:22] Amir1: I will be looking into them today, not sure though when they 'll be merged. Plan however is for this week to try and get ORES deployed in production [11:08:43] \o/ [11:09:14] thanks akosiaris, tell me if you need anything from me. I think I need to explain lots of these patches and choices I've made [11:10:38] (03CR) 10Alexandros Kosiaris: [C: 031] "noop for chromium,alsafi, effectively noop for carbon. https://puppet-compiler.wmflabs.org/2320/" [puppet] - 10https://gerrit.wikimedia.org/r/281447 (https://phabricator.wikimedia.org/T123733) (owner: 10Filippo Giunchedi) [11:16:37] (03CR) 10Ema: [C: 031] installserver: port squid3 changes for trusty/jessie [puppet] - 10https://gerrit.wikimedia.org/r/281447 (https://phabricator.wikimedia.org/T123733) (owner: 10Filippo Giunchedi) [11:21:30] (03CR) 10DCausse: [C: 031] Bump CirrusSearchRequestSet rev to 121456865906 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280448 (https://phabricator.wikimedia.org/T128533) (owner: 10DCausse) [11:26:11] (03PS1) 10Ema: Allow ganglia user to read VSM files [puppet] - 10https://gerrit.wikimedia.org/r/281918 [11:26:26] (03PS2) 10Filippo Giunchedi: installserver: port squid3 changes for trusty/jessie [puppet] - 10https://gerrit.wikimedia.org/r/281447 (https://phabricator.wikimedia.org/T123733) [11:26:34] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] installserver: port squid3 changes for trusty/jessie [puppet] - 10https://gerrit.wikimedia.org/r/281447 (https://phabricator.wikimedia.org/T123733) (owner: 10Filippo Giunchedi) [11:27:26] (03CR) 10jenkins-bot: [V: 04-1] Allow ganglia user to read VSM files [puppet] - 10https://gerrit.wikimedia.org/r/281918 (owner: 10Ema) [11:27:40] (03CR) 10Filippo Giunchedi: "thanks! another way I tried was "include /etc/squid3/conf.d/*.conf" but alas squid refuses to start if the wildcard matches no files" [puppet] - 10https://gerrit.wikimedia.org/r/281447 (https://phabricator.wikimedia.org/T123733) (owner: 10Filippo Giunchedi) [11:32:57] godog: u around ? [11:33:18] can you please +1 https://phabricator.wikimedia.org/T131895 ? [11:35:48] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280831 (https://phabricator.wikimedia.org/T131895) (owner: 10Matanya) [11:35:51] matanya: for sure, {{done}} [11:35:59] thanks much godog [11:36:42] (03CR) 10KartikMistry: "I realized that T122498 is not straightforward to fix, but yes - that's the direction we need to go. Until, that is done, can this be merg" [puppet] - 10https://gerrit.wikimedia.org/r/277463 (https://phabricator.wikimedia.org/T129849) (owner: 10KartikMistry) [11:36:59] (03CR) 10DCausse: Actiavte SSL + connection pooling for CirrusSearch on PROD (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/281881 (https://phabricator.wikimedia.org/T131839) (owner: 10Gehel) [11:37:35] apergos: reminder for https://phabricator.wikimedia.org/T127793 as you told :) [11:38:23] apergos: if possible, can you add estimate time once we start the work? That will be helpful for setting priority for team (third party awaits, so we can tell them). [11:40:14] 6Operations, 6Labs, 10Labs-Infrastructure: labnet1002 can't talk to webproxy.eqiad.wmnet:8080, puppet fails to install designateclient - https://phabricator.wikimedia.org/T129623#2183182 (10faidon) p:5Normal>3High Hey — puppet hasn't been running properly on labnet1002 with the above failure for almost a... [11:46:04] (03CR) 10Alexandros Kosiaris: "Not straightforward to fix ? How come ? Care to share more info ?" [puppet] - 10https://gerrit.wikimedia.org/r/277463 (https://phabricator.wikimedia.org/T129849) (owner: 10KartikMistry) [11:48:13] (03CR) 10Alexandros Kosiaris: Allow ganglia user to read VSM files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/281918 (owner: 10Ema) [11:49:11] (03PS1) 10Mschon: puppet-lint runs through now, changed scope of variable from $hostname to $::hostname [puppet] - 10https://gerrit.wikimedia.org/r/281922 [11:51:07] (03CR) 10Alexandros Kosiaris: [C: 032] puppet-lint runs through now, changed scope of variable from $hostname to $::hostname [puppet] - 10https://gerrit.wikimedia.org/r/281922 (owner: 10Mschon) [11:51:12] (03PS2) 10Alexandros Kosiaris: puppet-lint runs through now, changed scope of variable from $hostname to $::hostname [puppet] - 10https://gerrit.wikimedia.org/r/281922 (owner: 10Mschon) [11:51:17] (03CR) 10Alexandros Kosiaris: [V: 032] puppet-lint runs through now, changed scope of variable from $hostname to $::hostname [puppet] - 10https://gerrit.wikimedia.org/r/281922 (owner: 10Mschon) [11:58:33] 6Operations, 10Analytics: Upgrade aqs* to nodejs 4.3 - https://phabricator.wikimedia.org/T123629#2183195 (10elukey) Executed the command and started nodetool repair on aqs1002. [11:59:26] (03CR) 10Ema: [C: 04-1] Allow ganglia user to read VSM files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/281918 (owner: 10Ema) [12:09:09] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [12:09:18] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [12:11:35] (03CR) 10Ema: "RxURL is used to match VSL log entries with transactions:" [puppet] - 10https://gerrit.wikimedia.org/r/281439 (https://phabricator.wikimedia.org/T131353) (owner: 10BBlack) [12:19:17] (03CR) 10Gehel: Actiavte SSL + connection pooling for CirrusSearch on PROD (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/281881 (https://phabricator.wikimedia.org/T131839) (owner: 10Gehel) [12:19:48] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:21:17] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:21:18] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:21:39] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:29:24] (03PS3) 10Gehel: Actiavte SSL + connection pooling for CirrusSearch on PROD [puppet] - 10https://gerrit.wikimedia.org/r/281881 (https://phabricator.wikimedia.org/T131839) [12:30:23] 6Operations: setup stat1004/WMF4721 for hadoop client usage - https://phabricator.wikimedia.org/T131877#2183228 (10Cmjohnson) [12:30:26] 6Operations, 10ops-eqiad: update labels and visible label field for stat1004/WMF4721 - https://phabricator.wikimedia.org/T131902#2183226 (10Cmjohnson) 5Open>3Resolved done [12:32:55] 6Operations, 10ops-eqiad: stat1002 broken disk causing degraded RAID array - https://phabricator.wikimedia.org/T131758#2183231 (10Cmjohnson) 5Open>3Resolved Disk has been replaced and back online [12:35:04] ---^ \o/ thanks [12:37:16] 6Operations, 10media-storage, 7Tracking: [tracking] refresh swift hardware in codfw/eqiad - https://phabricator.wikimedia.org/T130012#2183250 (10fgiunchedi) one of the questions for the next order is 3TB vs 4TB disks, the last order of 3x eqiad and 6x codfw {T114500} and related was for 4TB. to gauge the im... [12:47:07] PROBLEM - Host snapshot1006 is DOWN: PING CRITICAL - Packet loss = 100% [12:50:57] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [12:51:27] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [12:53:18] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [12:56:37] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [13:01:48] (03PS4) 10Gehel: Actiavte SSL + connection pooling for CirrusSearch on PROD [puppet] - 10https://gerrit.wikimedia.org/r/281881 (https://phabricator.wikimedia.org/T131839) [13:03:33] (03CR) 10Muehlenhoff: [C: 031] "One enhancement proposal, but otherwise looks good to me." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/281885 (owner: 10Giuseppe Lavagetto) [13:08:32] RECOVERY - Host snapshot1006 is UP: PING OK - Packet loss = 0%, RTA = 1.35 ms [13:10:34] <_joe_> moritzm: heh fair enough [13:10:47] (03CR) 10Giuseppe Lavagetto: hhvm: add systemd/jessie support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/281885 (owner: 10Giuseppe Lavagetto) [13:10:57] (03Abandoned) 10Hashar: Increase default thumbnail display size from 220px to 300px [mediawiki-config] - 10https://gerrit.wikimedia.org/r/154408 (https://bugzilla.wikimedia.org/67709) (owner: 10Jforrester) [13:11:27] (03PS4) 10Giuseppe Lavagetto: hhvm: add systemd/jessie support [puppet] - 10https://gerrit.wikimedia.org/r/281885 [13:11:31] (03Abandoned) 10Hashar: [WIP] Make VisualEditor access RESTbase directly on private wikis too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200107 (owner: 10Jforrester) [13:15:30] (03CR) 10Gehel: Actiavte SSL + connection pooling for CirrusSearch on PROD (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/281881 (https://phabricator.wikimedia.org/T131839) (owner: 10Gehel) [13:16:11] Could anyone have a look at ^ https://gerrit.wikimedia.org/r/#/c/281881/ ? I wrote some pretty ugly code and I'm sure there is a better way, but my brain seems frozen... [13:16:18] Comments inline [13:17:02] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:19:22] (03CR) 10Giuseppe Lavagetto: "Yeah I would like to understand why it is so hard to move this file to the code repository. I removed my -2 because this is not as ugly an" [puppet] - 10https://gerrit.wikimedia.org/r/277463 (https://phabricator.wikimedia.org/T129849) (owner: 10KartikMistry) [13:22:36] (03CR) 10Fjalapeno: "@Krenair are you still seeing the redirect? Or are you just unable to test?" [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk) [13:23:12] PROBLEM - Hadoop NodeManager on analytics1051 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:23:22] PROBLEM - salt-minion processes on analytics1051 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:23:23] PROBLEM - DPKG on analytics1051 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:23:23] PROBLEM - Check size of conntrack table on analytics1051 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:23:45] (03CR) 10Alex Monk: "It's live on beta but I still get a redirect" [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk) [13:23:51] PROBLEM - RAID on analytics1051 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:23:52] PROBLEM - Disk space on analytics1051 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:24:11] PROBLEM - configured eth on analytics1051 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:24:21] PROBLEM - YARN NodeManager Node-State on analytics1051 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:24:22] PROBLEM - Disk space on Hadoop worker on analytics1051 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:24:32] PROBLEM - dhclient process on analytics1051 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:24:43] PROBLEM - Hadoop DataNode on analytics1051 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:24:52] PROBLEM - puppet last run on analytics1051 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:27:36] 6Operations, 7Availability, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: swiftrepl replication pass for thumbnails eqiad -> codfw - https://phabricator.wikimedia.org/T125791#2183325 (10fgiunchedi) replication for thumbs has finished: [[ https://graphite.wikimedia.org/render/?width=723... [13:27:43] PROBLEM - Host analytics1051 is DOWN: PING CRITICAL - Packet loss = 100% [13:27:48] 6Operations, 7Availability, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: swiftrepl replication pass for thumbnails eqiad -> codfw - https://phabricator.wikimedia.org/T125791#2183329 (10fgiunchedi) 5Open>3Resolved [13:28:04] (03CR) 10Muehlenhoff: [C: 031] hhvm: add systemd/jessie support [puppet] - 10https://gerrit.wikimedia.org/r/281885 (owner: 10Giuseppe Lavagetto) [13:28:25] (03CR) 10Alex Monk: "well... I thought it was live, but puppet seems very broken and the line has gone missing" [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk) [13:30:51] RECOVERY - Check size of conntrack table on analytics1051 is OK: OK: nf_conntrack is 0 % full [13:30:51] RECOVERY - DPKG on analytics1051 is OK: All packages OK [13:31:00] 6Operations, 10MediaWiki-Interface, 10Traffic: Purge pages cached with mobile editlinks - https://phabricator.wikimedia.org/T125841#2183332 (10BBlack) 5Open>3Resolved I'm guessing by now they're all naturally expiring out anyways since there's no further feedback. [13:31:01] RECOVERY - Host analytics1051 is UP: PING OK - Packet loss = 0%, RTA = 1.62 ms [13:31:13] RECOVERY - RAID on analytics1051 is OK: OK: optimal, 13 logical, 14 physical [13:31:21] RECOVERY - Disk space on analytics1051 is OK: DISK OK [13:31:41] (03CR) 10Fjalapeno: "Oh - ok - hmmm… thats odd. I'm also out to Brion on this - I CC'd him as well. He also knows the iOS app so may be able to help." [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk) [13:31:59] RECOVERY - configured eth on analytics1051 is OK: OK - interfaces up [13:32:21] RECOVERY - Disk space on Hadoop worker on analytics1051 is OK: DISK OK [13:32:24] analytics1051 was rebooted, mmm [13:32:26] (03CR) 10Brion VIBBER: "Patch looks legit enough... Yeah double-check that it got fully deployed. :)" [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk) [13:32:33] (03CR) 10Alex Monk: "I don't own any iOS/OS X devices and never have, I'm just fiddling around with redirects in apache" [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk) [13:32:38] RECOVERY - Hadoop DataNode on analytics1051 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [13:32:50] RECOVERY - Hadoop NodeManager on analytics1051 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [13:32:59] PROBLEM - NTP on analytics1051 is CRITICAL: NTP CRITICAL: Offset unknown [13:33:12] 6Operations, 10Traffic, 6Zero: Use Text IP for Mobile hostnames to gain SPDY/H2 coalesce between the two - https://phabricator.wikimedia.org/T124482#2183337 (10BBlack) p:5Low>3Normal We didn't end up keeping SPDY disabled, and HTTP/2 is coming. From our end, this is a relatively simple change now, but t... [13:33:30] RECOVERY - YARN NodeManager Node-State on analytics1051 is OK: OK: YARN NodeManager analytics1051.eqiad.wmnet:8041 Node-State: RUNNING [13:34:08] RECOVERY - dhclient process on analytics1051 is OK: PROCS OK: 0 processes with command name dhclient [13:34:19] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:34:38] RECOVERY - salt-minion processes on analytics1051 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:34:45] (03PS5) 10Giuseppe Lavagetto: hhvm: add systemd/jessie support [puppet] - 10https://gerrit.wikimedia.org/r/281885 [13:35:57] 6Operations, 10Traffic: compressed http responses without content-length not cached by varnish - https://phabricator.wikimedia.org/T124195#2183342 (10BBlack) We actually dug further into related issues when investigating WDQS woes on cache_misc, and the problem is different than what we thought we understood i... [13:36:20] 6Operations, 10Traffic: compressed http responses without content-length not cached by varnish - https://phabricator.wikimedia.org/T124195#2183343 (10BBlack) [13:36:22] 6Operations, 10Traffic, 13Patch-For-Review, 7Varnish: cache_misc's misc_fetch_large_objects has issues - https://phabricator.wikimedia.org/T128813#2183344 (10BBlack) [13:38:08] RECOVERY - NTP on analytics1051 is OK: NTP OK: Offset -0.0004067420959 secs [13:38:34] (03CR) 10Giuseppe Lavagetto: [C: 032] "Practically a noop on trusty" [puppet] - 10https://gerrit.wikimedia.org/r/281885 (owner: 10Giuseppe Lavagetto) [13:38:59] 6Operations, 10Traffic, 6Zero: Use Text IP for Mobile hostnames to gain SPDY/H2 coalesce between the two - https://phabricator.wikimedia.org/T124482#2183349 (10BBlack) [13:39:02] 6Operations, 6Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2183350 (10BBlack) [13:39:26] <_joe_> akosiaris: I'm merging a change from you? [13:41:01] 6Operations, 10Citoid, 6Security-Team, 10Traffic, and 3 others: http://citoid.wikimedia.org/ should force HTTPS - https://phabricator.wikimedia.org/T108632#2183351 (10BBlack) 5Open>3Resolved a:3BBlack When we moved various *oid to the text cluster as part of the parsoidcache decom, they got forced to... [13:41:06] <_joe_> seems innocent enough [13:41:07] _joe_: damn again.. yes sorry [13:41:08] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [13:41:20] <_joe_> yeah already merged [13:42:05] 6Operations, 10Traffic: Stop using LVS from varnishes - https://phabricator.wikimedia.org/T107956#2183359 (10BBlack) 5Open>3declined We've made decisions about this in the past already and moved past this idea. The general direction is to always use LVS for multi-host varnish backends, and solve HTTPS iss... [13:43:31] 6Operations: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928#2183368 (10MoritzMuehlenhoff) [13:45:30] (03PS1) 10Giuseppe Lavagetto: Revert "Make MediaWiki call the codfw restbase from all datacenters" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281935 [13:45:33] !log Upgrading cp1052 to jessie 8.4 point release and linux 4.4 (T131746, T131928) [13:45:34] T131746: Integrate jessie 8.4 point release - https://phabricator.wikimedia.org/T131746 [13:45:35] T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928 [13:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:45:40] (03PS1) 10Giuseppe Lavagetto: Revert "cache::text: route traffic for restbase, citoid, cxserver to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/281936 [13:45:48] <_joe_> godog: ^^ [13:46:00] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [13:46:47] (03CR) 10Filippo Giunchedi: [C: 031] Revert "cache::text: route traffic for restbase, citoid, cxserver to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/281936 (owner: 10Giuseppe Lavagetto) [13:46:49] _joe_: thanks! [13:47:11] <_joe_> godog: I have an interview in 10, but you can merge those while I'm away [13:48:59] PROBLEM - puppet last run on mw2083 is CRITICAL: CRITICAL: puppet fail [13:50:01] <_joe_> that is just a transitional problem that shows how "good" puppet is [13:51:32] _joe_: ok, as for the order mediawiki first and then varnish, i.e. reversed from what we did yesterday? [13:52:20] !sal [13:52:21] https://wikitech.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [13:52:48] RECOVERY - puppet last run on mw2083 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:53:34] <_joe_> godog: whatever you prefere [13:53:40] <_joe_> it makes no differences [13:54:29] PROBLEM - Host cp1052 is DOWN: PING CRITICAL - Packet loss = 100% [13:54:54] <_joe_> ema: expected ? ^^ [13:54:59] RECOVERY - Host cp1052 is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [13:55:18] _joe_: yep [13:55:29] RECOVERY - puppet last run on snapshot1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:56:22] _joe_: ok I'll do varnish first and mediawiki second then [13:56:25] urandom: ^ [13:56:36] +1 [13:56:53] (03CR) 10Alex Monk: "I seem to have fixed puppet in deployment-prep by applying https://github.com/puppetlabs/puppet/commit/149b24542aa3ffaad2afef8daea05188750" [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk) [13:56:58] <_joe_> yeah I won't be around [13:57:05] <_joe_> unless it explodes somehow [13:58:49] PROBLEM - HTTPS on cp1052 is CRITICAL: Return code of 255 is out of bounds [13:59:26] (03CR) 10Alex Monk: "It works if I comment out the RewriteRule below it though." [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk) [13:59:38] cp1052 has an issue with nginx, the host is depooled though [14:00:03] (03PS1) 10Ladsgroup: wikilabels: healthier uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/281940 [14:00:49] PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Puppet has 1 failures [14:02:42] 6Operations, 6Labs, 10Labs-Infrastructure: labnet1002 can't talk to webproxy.eqiad.wmnet:8080, puppet fails to install designateclient - https://phabricator.wikimedia.org/T129623#2183406 (10chasemp) Here is what I believe is happening. Labnet1001 is the inactive node at the moment and has an IPv6 address:... [14:02:49] RECOVERY - HTTPS on cp1052 is OK: SSLXNN OK - 36 OK [14:04:04] (03PS2) 10Filippo Giunchedi: Revert "cache::text: route traffic for restbase, citoid, cxserver to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/281936 (owner: 10Giuseppe Lavagetto) [14:04:12] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "cache::text: route traffic for restbase, citoid, cxserver to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/281936 (owner: 10Giuseppe Lavagetto) [14:04:27] (03PS1) 10BBlack: add CAP_CHOWN to tlsproxy nginx caps [puppet] - 10https://gerrit.wikimedia.org/r/281941 [14:04:30] RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [14:05:05] (03CR) 10Brion VIBBER: "Does the rewrite override the alias maybe? Might have to punch a rewrite rule in instead of an Alias here..." [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk) [14:05:10] (03PS2) 10BBlack: add CAP_CHOWN to tlsproxy nginx caps [puppet] - 10https://gerrit.wikimedia.org/r/281941 [14:05:18] (03CR) 10BBlack: [C: 032 V: 032] add CAP_CHOWN to tlsproxy nginx caps [puppet] - 10https://gerrit.wikimedia.org/r/281941 (owner: 10BBlack) [14:05:20] !log move restbase/citoid/cxserver varnish traffic back to eqiad [14:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:08:00] (03CR) 10Alex Monk: "http://stackoverflow.com/a/12161249/1306662 :/" [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk) [14:08:41] 6Operations, 6Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2183411 (10RobH) >>! In T131880#2182880, @Gehel wrote: > @RobH I'd really appreciate if you could let me do the reclaim / reinstall so tha... [14:08:48] PROBLEM - puppet last run on mw1224 is CRITICAL: CRITICAL: Puppet has 1 failures [14:09:47] (03CR) 10Alex Monk: "I wonder if we can use a location block - https://httpd.apache.org/docs/current/mod/mod_alias.html#order" [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk) [14:11:20] 6Operations, 6Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2183429 (10BBlack) There's no real need to reinstall them. I have patches pending to put them into their proper roles, etc. [14:11:35] I'll let it simmer for half an hour and then merge https://gerrit.wikimedia.org/r/#/c/281935/ [14:12:37] k [14:13:13] 6Operations, 6Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2183432 (10BBlack) The patch series starts at: https://gerrit.wikimedia.org/r/#/c/268236/ , but needs manual rebases at this point. [14:13:48] 6Operations, 6Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2183446 (10BBlack) (it's better to look at T109162, that had all the patch links) [14:13:59] (03PS4) 10BBlack: maps DNS 2/2: enable geodns routing [dns] - 10https://gerrit.wikimedia.org/r/268240 (https://phabricator.wikimedia.org/T109162) [14:15:30] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Puppet has 1 failures [14:15:50] !log rebooting baham (ns1) for 4.4 kernel + package updates [14:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:19:09] PROBLEM - Host ns1-v4 is DOWN: PING CRITICAL - Packet loss = 100% [14:19:18] PROBLEM - Host baham is DOWN: PING CRITICAL - Packet loss = 100% [14:19:48] 6Operations, 6Labs, 10Labs-Infrastructure: labnet1002 can't talk to webproxy.eqiad.wmnet:8080, puppet fails to install designateclient - https://phabricator.wikimedia.org/T129623#2183482 (10faidon) squid on carbon over IPv4 works fine — we'd have a lot more failures if that wasn't the case (you can verify th... [14:21:59] PROBLEM - Apache HTTP on mw1187 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.021 second response time [14:22:19] PROBLEM - HHVM rendering on mw1187 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.003 second response time [14:22:53] (03PS5) 10Gehel: Actiavte SSL + connection pooling for CirrusSearch on PROD [puppet] - 10https://gerrit.wikimedia.org/r/281881 (https://phabricator.wikimedia.org/T131839) [14:24:49] RECOVERY - Host baham is UP: PING OK - Packet loss = 0%, RTA = 37.09 ms [14:25:09] RECOVERY - Host ns1-v4 is UP: PING OK - Packet loss = 0%, RTA = 40.47 ms [14:25:39] RECOVERY - Apache HTTP on mw1187 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.084 second response time [14:26:00] RECOVERY - HHVM rendering on mw1187 is OK: HTTP OK: HTTP/1.1 200 OK - 68427 bytes in 0.126 second response time [14:26:25] 6Operations, 6Labs, 10Labs-Infrastructure: labnet1002 can't talk to webproxy.eqiad.wmnet:8080, puppet fails to install designateclient - https://phabricator.wikimedia.org/T129623#2183484 (10chasemp) >>! In T129623#2183482, @faidon wrote: > squid on carbon over IPv4 works fine — we'd have a lot more failures... [14:26:45] !log hhvm restarted on mw1187 [14:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:28:25] PROBLEM - Auth DNS on ns1-v6 is CRITICAL: CRITICAL - Plugin timed out while executing system call [14:28:37] sad_trombone.wav [14:29:42] what's up ? [14:29:44] PROBLEM - Auth DNS on ns1-v4 is CRITICAL: CRITICAL - Plugin timed out while executing system call [14:29:59] (03PS6) 10Gehel: Actiavte SSL + connection pooling for CirrusSearch on PROD [puppet] - 10https://gerrit.wikimedia.org/r/281881 (https://phabricator.wikimedia.org/T131839) [14:30:03] akosiaris: I think benign, baham rebooted earlier by bblack [14:30:09] PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: puppet fail [14:30:14] ok [14:30:37] RECOVERY - Auth DNS on ns1-v6 is OK: DNS OK: 5.081 seconds response time. www.wikipedia.org returns 208.80.154.224 [14:31:56] RECOVERY - Auth DNS on ns1-v4 is OK: DNS OK: 0.065 seconds response time. www.wikipedia.org returns 208.80.154.224 [14:32:12] (03PS2) 10Filippo Giunchedi: Revert "Make MediaWiki call the codfw restbase from all datacenters" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281935 (owner: 10Giuseppe Lavagetto) [14:32:22] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "Make MediaWiki call the codfw restbase from all datacenters" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281935 (owner: 10Giuseppe Lavagetto) [14:34:00] !log filippo@tin Synchronized wmf-config/ProductionServices.php: move mediawiki traffic back to restbase eqiad (duration: 00m 34s) [14:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:35:40] RECOVERY - puppet last run on mw1224 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:36:09] RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:37:09] (03PS2) 10Alex Monk: Send www.wikipedia.org/apple-app-site-association to the right file without an external redirect [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) [14:38:59] (03PS1) 10ArielGlenn: set dump cron job date range back to normal, adjust start times [puppet] - 10https://gerrit.wikimedia.org/r/281946 [14:41:01] (03CR) 10ArielGlenn: [C: 032] set dump cron job date range back to normal, adjust start times [puppet] - 10https://gerrit.wikimedia.org/r/281946 (owner: 10ArielGlenn) [14:42:20] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [14:47:22] 6Operations, 6Labs, 10Tool-Labs, 10Traffic, 6Zero: Tool labs tools should have a method of identifying Zero traffic - https://phabricator.wikimedia.org/T131934#2183516 (10zhuyifei1999) [14:53:16] 6Operations, 6Labs, 10Tool-Labs, 10Traffic, 6Zero: Tool labs tools should have a method of identifying Zero traffic - https://phabricator.wikimedia.org/T131934#2183516 (10valhallasw) Does Wikipedia Zero include non-wikipedia domains? I would expect tools.wmflabs.org to fall out of scope. [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160406T1500). Please do the needful. [15:00:04] matt_flaschen legoktm Urbanecm dcausse bmansurov: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:11] here [15:00:14] o/ [15:00:26] \o [15:00:34] <_joe_> godog: what's the situation on the switchover? [15:00:49] Present [15:00:59] 6Operations, 6Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2183560 (10RobH) @gehel: Since this isn't going to end up being a reinstall, I'll ping you to do a reinstall on one of the many I do every... [15:01:01] o/ [15:01:20] I can SWAT this morning [15:01:24] for swat: I had merged a patch to MobileFrontend extension an hour or so again and havent rebased the extension on tin yet :( [15:01:58] PROBLEM - NTP on baham is CRITICAL: NTP CRITICAL: Offset unknown [15:03:02] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281640 (https://phabricator.wikimedia.org/T112799) (owner: 10Matthias Mullie) [15:03:42] !log rebased php-1.27.0-wmf.19/MobileFrontend and php-1.27.0-wmf.20/MobileFrontend (single commit related to CI) [15:03:44] (03Merged) 10jenkins-bot: Add Flow dumps schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281640 (https://phabricator.wikimedia.org/T112799) (owner: 10Matthias Mullie) [15:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:05:10] /srv/mediawiki-staging/php-1.27.0-wmf.20 has uncommited/staged modifications :-( [15:05:16] matt_flaschen: is there anything special needed or any coordination needed for the flowdumps change other than syncing it out? [15:05:27] thcipriani, no, it's all static. [15:06:20] 6Operations, 10Beta-Cluster-Infrastructure, 7WorkType-NewFunctionality: etcd/confd is not started on deployment-cache-mobile04 - https://phabricator.wikimedia.org/T116224#2183569 (10Krenair) 5Open>3declined Deleting instead: {T130473} [15:07:20] !log thcipriani@tin Synchronized docroot/mediawiki/xml: SWAT: Add Flow dumps schema [[gerrit:281640]] (duration: 00m 28s) [15:07:24] ^ matt_flaschen check please [15:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:08:34] Thanks, thcipriani. Works fine: https://www.mediawiki.org/xml/flow-1.0/ and https://www.mediawiki.org/xml/flow-1.0.xsd [15:08:38] hashar: yeah, .20 does have a lot of weird modifications :( I don't know what's up with that. [15:08:44] matt_flaschen: cool, thanks for checking. [15:11:00] thcipriani: I guess that is some live patches that havent been properly applied or failed to rebase [15:11:03] Urbanecm: around for SWAT? [15:11:09] RECOVERY - NTP on baham is OK: NTP OK: Offset -0.002489447594 secs [15:11:34] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280448 (https://phabricator.wikimedia.org/T128533) (owner: 10DCausse) [15:11:58] going to go through config changes, then do the big scap at the end legoktm [15:12:10] ok :P [15:12:19] (03Merged) 10jenkins-bot: Bump CirrusSearchRequestSet rev to 121456865906 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280448 (https://phabricator.wikimedia.org/T128533) (owner: 10DCausse) [15:12:27] I don't understand you thcipriany. I am ready for SWAT. [15:14:49] 6Operations, 10Continuous-Integration-Infrastructure, 10Phabricator, 10netops, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2183605 (10mmodell) I'm sure we could hack the Jenkins job to use https but the staging... [15:15:08] !log Upgrading cp* to jessie 8.4 point release and linux 4.4 (T131746, T131928). Not rebooting yet. [15:15:09] T131746: Integrate jessie 8.4 point release - https://phabricator.wikimedia.org/T131746 [15:15:09] T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928 [15:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:15:32] 6Operations, 10Analytics-Cluster, 10hardware-requests: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2183608 (10Ottomata) @robh, bump on this too. [15:16:45] _joe_: both patches merged, IOW {{done}} [15:17:02] !log thcipriani@tin Synchronized wmf-config/event-schemas: SWAT: Bump CirrusSearchRequestSet rev to 121456865906 PART I [[gerrit:280448]] (duration: 00m 27s) [15:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:17:42] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Bump CirrusSearchRequestSet rev to 121456865906 PART II [[gerrit:280448]] (duration: 00m 30s) [15:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:17:49] ^ dcausse check please [15:18:16] blerg Notice: Avro failed to serialize record for CirrusSearchRequestSet [15:18:41] lots and lots of those [15:18:45] thcipriani: damn [15:18:47] <_joe_> revert [15:18:57] yes revert (sorry) [15:19:18] thcipriani: only InitialiaseSettings should be ok [15:20:34] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: REVERT Bump CirrusSearchRequestSet rev to 121456865906 PART II [[gerrit:280448]] (duration: 00m 31s) [15:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:20:48] 6Operations, 10Continuous-Integration-Infrastructure, 10Phabricator, 10netops, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2183636 (10mmodell) Why is labs blocked from connecting to ssh? Is that to avoid people... [15:21:20] dcausse: ok, lemme get a patch up for the revert. [15:21:46] godog: I'm going to restore the bootstrap stream rate then [15:21:56] urandom: sweet, thanks! [15:23:10] !log Restoring default stream throughput on restbase200{3,4-a}.codfw.wmnet [15:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:24:00] godog: how is the rebuild on 2003 doing btw? [15:24:24] (03PS1) 10Thcipriani: Revert "Merge "Bump CirrusSearchRequestSet rev to 121456865906"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281952 [15:24:45] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281952 (owner: 10Thcipriani) [15:25:10] (03Merged) 10jenkins-bot: Revert "Merge "Bump CirrusSearchRequestSet rev to 121456865906"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281952 (owner: 10Thcipriani) [15:25:46] urandom: progressing afaics, 42% but throttled at 6MB/s [15:26:51] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) (owner: 10Urbanecm) [15:26:55] godog: was it always throttled there, or was that for the switchover? [15:27:00] 6Operations, 10Analytics-Cluster, 10hardware-requests, 10netops: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2183653 (10RobH) a:5RobH>3None Yes, I think we need a network admin to investigate the dhcp ability of the analytics vlan to carbon, as I cannto seem to... [15:27:25] urandom: we started like that, though we can bump it now I'd say [15:27:33] (03Merged) 10jenkins-bot: Enable Translate extension on uawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) (owner: 10Urbanecm) [15:27:39] godog: i guess that makes the eta something like, Friday or later [15:28:33] dcausse: hmmm, still seeing the errors coming in, although at a lower rate: https://logstash.wikimedia.org/#dashboard/temp/AVPsL-QYO3D718AOlQeh [15:28:56] thcipriani: are all the wiki on wmf19? [15:29:07] dcausse: no, just group0 [15:29:11] er, just group1 and 2 [15:29:15] wmf20 is on group0 [15:29:17] urandom: err, 8MB/s, ETA is like 24h now [15:29:26] hmmm... so it should work :/ [15:29:41] godog: cool [15:30:20] schema is back to the previous one on enwiki [15:30:55] dcausse: https://tools.wmflabs.org/versions/ and you can click on the verison numbers to see which wikis is included [15:31:09] greg-g: thanks [15:31:43] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable Translate extension on uawikimedia [[gerrit:281403]] (duration: 00m 27s) [15:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:31:52] ^ Urbanecm check please [15:32:56] Thcipriani: uawikimedia is down. [15:33:03] output [15:33:04] yup, running revert now [15:33:09] MediaWiki internal error. [15:33:09] Exception caught inside exception handler. Set $wgShowExceptionDetails = true; at the bottom of LocalSettings.php to show detailed debugging information. [15:33:24] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: REVERT Enable Translate extension on uawikimedia [[gerrit:281403]] (duration: 00m 25s) [15:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:33:36] did you add the tables thcipriani? [15:33:57] no I did not. [15:34:16] you can use the normal extension table creation script for that [15:38:05] Krenair: which script? [15:38:32] 6Operations, 10Phabricator, 10hardware-requests: We need a backup phabricator front-end node - https://phabricator.wikimedia.org/T131775#2183691 (10mmodell) >>! In T131775#2180958, @chasemp wrote: > I'm pretty sure you mean LVS :) Yes, stupid error. Corrected now, thanks! > A hot/cold setup with a like pha... [15:40:14] thcipriani, create extension tables under the wikimedia maintenance extension [15:49:07] Krenair: mwscript extensions/WikimediaMaintenance/createExtensionTables.php translate --wiki=uawikimedia ? [15:49:38] mwscript extensions/WikimediaMaintenance/createExtensionTables.php uawikimedia translate [15:49:41] ^ I think it's that [15:49:57] don't remember how much it likes --wiki being at the end [15:50:00] might work [15:50:06] kk, thanks [15:50:43] ok, let's try this again. [15:50:44] mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=uawikimedia translate [15:50:47] I'd try that usually :P [15:51:42] heh, Krenair 's version seemed to work :) [15:54:42] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable Translate extension on uawikimedia [[gerrit:281403]] (duration: 00m 28s) [15:54:53] ^ Urbanecm check please [15:56:10] marxarelli: would you check wmf.20 por favor? hashar noticed lots of git weirdness therein. [15:56:34] It seems that it's working. Thanks. [15:56:45] Urbanecm: thank you for checking! [15:56:46] thcipriani: git weirdness? [15:56:50] (03PS3) 10Alex Monk: Send wikipedia.org/apple-app-site-association to the right file without an external redirect [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) [15:57:21] marxarelli: yeah, modified stuff, check it out on tin. [15:58:36] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277837 (https://phabricator.wikimedia.org/T129274) (owner: 10Bmansurov) [15:58:49] bmansurov: still around for SWAT (I hope) :) [15:58:53] yes [15:58:55] (03CR) 10jenkins-bot: [V: 04-1] Remove Language Overlay experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277837 (https://phabricator.wikimedia.org/T129274) (owner: 10Bmansurov) [15:59:19] thcipriani: i'll rebase real quick [15:59:25] kk, thanks [16:00:57] (03PS3) 10Bmansurov: Remove Language Overlay experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277837 (https://phabricator.wikimedia.org/T129274) [16:01:07] thcipriani: done [16:01:12] thcipriani: oh geez. it's from the security patches [16:01:16] kk, let's try this again. [16:02:00] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277837 (https://phabricator.wikimedia.org/T129274) (owner: 10Bmansurov) [16:02:56] 6Operations, 10Dumps-Generation, 7HHVM, 13Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#2183850 (10ArielGlenn) I've done this for the new snapshot hosts and run a test dump of a wiki; it looked fine. I'll keep this open til the misc cron jobs are... [16:03:07] (03CR) 10Brion VIBBER: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk) [16:03:25] (03Merged) 10jenkins-bot: Remove Language Overlay experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277837 (https://phabricator.wikimedia.org/T129274) (owner: 10Bmansurov) [16:06:01] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Remove Language Overlay experiment [[gerrit:277837]] (duration: 00m 26s) [16:06:05] ^ bmansurov check please [16:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:06:58] thcipriani: thanks, looks good [16:07:03] bmansurov: awesome, thanks! [16:07:28] 6Operations, 6Analytics-Kanban: Upgrade aqs* to nodejs 4.3 - https://phabricator.wikimedia.org/T123629#2183891 (10elukey) [16:07:39] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 65.52% of data above the critical threshold [5000000.0] [16:08:07] legoktm: marxarelli is doing some rearranging some things on wmf.20, don't want to scap in the middle of it. [16:08:19] PROBLEM - Apache HTTP on mw1173 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:08:35] no worries, I'll be here for a while :P [16:08:59] PROBLEM - HHVM rendering on mw1173 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:09:00] legoktm: kk, I'll poke you when we're done, thanks :) [16:10:25] thcipriani: there is a pending change for MobileFrontend wmf.20 . It is for CI build [16:10:46] hashar: right, it's merged already, correct? [16:10:50] yeah [16:10:57] I havent rebased the MobileFrontend repo on tin since the mediawiki working copy has some staged diff [16:11:08] but it is definitely harmless for prod (just tweak package.json) [16:11:26] hashar: I8ea086cedd81c0cd626452b375a6ae1e81460943 ? [16:11:33] just pulled that down [16:11:34] (03CR) 10Dereckson: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281443 (https://phabricator.wikimedia.org/T131754) (owner: 10Urbanecm) [16:12:00] 6Operations: Integrate jessie 8.4 point release - https://phabricator.wikimedia.org/T131746#2183894 (10ema) p:5Triage>3Normal [16:20:56] thcipriani, legoktm: ok, should be good to go now [16:21:29] (03PS2) 10Ema: Misc cluster VCL: avoid name conflict between directors and probes [puppet] - 10https://gerrit.wikimedia.org/r/281457 (https://phabricator.wikimedia.org/T131501) [16:22:03] (03CR) 10Ema: [C: 032 V: 032] Misc cluster VCL: avoid name conflict between directors and probes [puppet] - 10https://gerrit.wikimedia.org/r/281457 (https://phabricator.wikimedia.org/T131501) (owner: 10Ema) [16:24:17] legoktm: I'm around to scap if you're around to check [16:24:28] I am! [16:25:19] 6Operations, 10Mathoid: Travis PNG looks different from vagrant png - https://phabricator.wikimedia.org/T94379#2183924 (10Physikerwelt) p:5Low>3Triage [16:25:32] 6Operations, 10ops-eqiad, 6DC-Ops: Broken disk on aqs1001.eqiad.wmnet - https://phabricator.wikimedia.org/T130816#2147470 (10Milimetric) I know we're supposed to convert these to SSDs soon, but I would sleep a lot easier if we fixed the disk. If another one fails we'll lose a lot of data and have to backfil... [16:27:28] 6Operations, 10Mathoid: Travis PNG looks different from vagrant png - https://phabricator.wikimedia.org/T94379#2183948 (10Physikerwelt) I think we should reclassify the importance o this bug for two reasons. 1. The problem is also preverlent in production (https://en.wikipedia.org/api/rest_v1/media/math/render... [16:30:19] PROBLEM - HHVM rendering on mw1210 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.015 second response time [16:30:57] (03PS8) 10Dereckson: Support handoff and credential sharing with the iOS app [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279394 (https://phabricator.wikimedia.org/T128795) (owner: 10Fjalapeno) [16:31:38] thcipriani: ^^ [16:32:00] legoktm: whoops, missed your reply, kk going :) [16:32:18] RECOVERY - HHVM rendering on mw1210 is OK: HTTP OK: HTTP/1.1 200 OK - 66442 bytes in 0.099 second response time [16:32:20] 6Operations, 10Mathoid, 6Services: Travis PNG looks different from vagrant png - https://phabricator.wikimedia.org/T94379#2183958 (10Physikerwelt) [16:32:47] !log thcipriani@tin Started scap: SWAT: Add user_wpzero AbuseFilter variable [[gerrit:281867]] [16:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:32:54] ^ legoktm started [16:33:36] (03CR) 10Dereckson: [C: 031] "PS8: use previous version indent style, to preserve the git blame information for the applinks sections." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279394 (https://phabricator.wikimedia.org/T128795) (owner: 10Fjalapeno) [16:34:16] twentyafterfour: still getting shitton of emails from /srv/phab/tools/public_task_dump.py for "rtppl" [16:34:41] woot :D [16:34:52] (03CR) 10Fjalapeno: [C: 031] Support handoff and credential sharing with the iOS app [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279394 (https://phabricator.wikimedia.org/T128795) (owner: 10Fjalapeno) [16:40:03] (03PS2) 10ArielGlenn: https redirect for dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/278861 (https://phabricator.wikimedia.org/T128587) [16:40:27] (03CR) 10jenkins-bot: [V: 04-1] https redirect for dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/278861 (https://phabricator.wikimedia.org/T128587) (owner: 10ArielGlenn) [16:42:15] paravoid: oh, I thought I fixed that. let me see [16:45:30] (03PS3) 10ArielGlenn: https redirect for dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/278861 (https://phabricator.wikimedia.org/T128587) [16:46:57] 6Operations, 6Analytics-Kanban: Upgrade aqs* to nodejs 4.3 - https://phabricator.wikimedia.org/T123629#1934216 (10Eevans) >>! In T123629#2143751, @MoritzMuehlenhoff wrote: > Upgrade procedure: > - Depool one of the aqs servers via conftool > - Stop restbase > - nodetool drain && systemctl stop cassandra > - u... [16:47:49] 6Operations, 10MediaWiki-Parser, 10Traffic, 10Wikidata, and 2 others: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2184019 (10Jdlrobson) [16:49:51] (03CR) 10ArielGlenn: "I think the 301 method may not work for POST whereas the rewrite does. Have a look at the current patchset and see what you think." [puppet] - 10https://gerrit.wikimedia.org/r/278861 (https://phabricator.wikimedia.org/T128587) (owner: 10ArielGlenn) [16:57:38] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [17:00:26] !log thcipriani@tin Finished scap: SWAT: Add user_wpzero AbuseFilter variable [[gerrit:281867]] (duration: 27m 39s) [17:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:00:31] ^ legoktm done! [17:00:37] YAAAAY [17:00:49] :D [17:01:14] thcipriani: confirmed working :) [17:01:15] thanks! [17:01:29] legoktm: cool, thanks for checking! [17:05:13] Dereckson: can you please schedule your patch for thursday, and i will do the same ? [17:07:37] k, but mine should also get approved, as it has only a green light from security point of view, not yet from ops, from a performance point of view [17:08:04] godog: ^ ? :) [17:09:01] Dereckson matanya what's the context? [17:09:25] godog: in addition to https://gerrit.wikimedia.org/r/#/c/280831 we would like to change https://gerrit.wikimedia.org/r/#/c/281823/ [17:14:21] Dereckson: ack, thanks, looking [17:18:35] (03CR) 10Filippo Giunchedi: "a comment on the actual value, also do we know how often mediawiki hits this timeout?" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281823 (https://phabricator.wikimedia.org/T118887) (owner: 10Dereckson) [17:22:06] (03CR) 10Fjalapeno: "Brion you mind +1 ing again?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279394 (https://phabricator.wikimedia.org/T128795) (owner: 10Fjalapeno) [17:25:55] (03PS1) 10DCausse: Remove deprecated settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281966 (https://phabricator.wikimedia.org/T131944) [17:26:09] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [17:26:52] (03CR) 10DCausse: [C: 04-1] "I1614ed5 needs to be deployed before" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281966 (https://phabricator.wikimedia.org/T131944) (owner: 10DCausse) [17:27:09] (03CR) 10Brion VIBBER: [C: 031] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279394 (https://phabricator.wikimedia.org/T128795) (owner: 10Fjalapeno) [17:27:37] (03CR) 10Dereckson: Raise upload-by-URL request timeout (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281823 (https://phabricator.wikimedia.org/T118887) (owner: 10Dereckson) [17:28:27] (03PS2) 10Dereckson: Remove deprecated settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281966 (https://phabricator.wikimedia.org/T131944) (owner: 10DCausse) [17:28:41] (03CR) 10Fjalapeno: [C: 031] "lgtm as well" [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk) [17:29:57] (03CR) 10Fjalapeno: "Krenair - how do merge/deployments work for this type of change? Do I need to schedule a SWAT or will this just get merged and go out on t" [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk) [17:31:34] (03CR) 10Alex Monk: "It's a puppet change so someone with ops rights will need to do it. There is puppetswat though..." [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk) [17:36:30] godog: the question of how many is tricky: it's a feature currently restricted to GWT users (mainly GLAM institutions with a lot of files to upload) and Wikimedia Commons sysops. Would you know who could tell us where/how/if the information is logged? [17:39:46] 6Operations, 10ops-eqiad, 6DC-Ops: Broken disk on aqs1001.eqiad.wmnet - https://phabricator.wikimedia.org/T130816#2184216 (10Ottomata) Ja let’s do this. @cmjohnson1 ja?! [17:41:36] (03PS1) 10Dereckson: Set wgSemiprotectedRestrictionLevels for en.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281967 (https://phabricator.wikimedia.org/T126607) [17:46:53] (03PS1) 10Muehlenhoff: List all required restarts next to the new restarts introduced by a library upgrade [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/281969 [17:46:55] (03PS2) 10Dereckson: Raise upload-by-URL request timeout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281823 (https://phabricator.wikimedia.org/T118887) [17:47:58] (03CR) 10Dereckson: "PS2: 180 seconds instead of 300, per Filippo comment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281823 (https://phabricator.wikimedia.org/T118887) (owner: 10Dereckson) [17:52:20] 6Operations, 10Analytics-Cluster, 10hardware-requests, 10netops: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2184248 (10faidon) The port was also on the labs-instance-ports interface-range, which set the port-mode to trunk (and also added labs-instances1-eqiad to t... [17:53:28] (03CR) 10Muehlenhoff: [C: 032 V: 032] List all required restarts next to the new restarts introduced by a library upgrade [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/281969 (owner: 10Muehlenhoff) [17:55:18] * Nemo_bis is eating the last decent oranges of the season and starts craving for susine and pesche [17:55:30] 6Operations, 10ops-codfw: rack conf100[123] - https://phabricator.wikimedia.org/T131959#2184249 (10RobH) [17:55:39] oh I was stuck at a Yuvi comment of many hours ago, sorry :) [17:55:41] 6Operations, 10ops-codfw: rack conf100[123] - https://phabricator.wikimedia.org/T131959#2184266 (10RobH) [17:55:53] 6Operations, 10ops-codfw: rack conf100[123] - https://phabricator.wikimedia.org/T131959#2184249 (10RobH) p:5Triage>3Normal [17:56:19] 6Operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#1890778 (10RobH) This has been ordered, and now has a public blocking/racking task of T131959. [17:56:27] 6Operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#2184272 (10RobH) [17:56:43] 6Operations, 10ops-codfw: rack/setup/deploy conf100[123] - https://phabricator.wikimedia.org/T131959#2184249 (10RobH) [17:59:06] yurik: so, maps server are marked as downtime in icinga, we are good to go with the nodejs 4.3 migration. Whenever you are ready! [18:14:33] 6Operations, 10hardware-requests, 13Patch-For-Review: Allocate 2 analytics machines to experiment with a jupyterhub notebook service - https://phabricator.wikimedia.org/T130760#2184291 (10yuvipanda) 5Resolved>3Open re-opening, since there is some issues still (I just found time to check back on it). So... [18:16:39] 6Operations, 10Analytics-Cluster, 10hardware-requests, 10netops: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2184309 (10RobH) Ok, multiple attempts have still resulted in no joy (no dhcp request hitting carbon.) The system was also showing in the config in the def... [18:18:21] 6Operations, 10hardware-requests, 13Patch-For-Review: Allocate 2 analytics machines to experiment with a jupyterhub notebook service - https://phabricator.wikimedia.org/T130760#2184310 (10yuvipanda) now at: ``` +---------------------------------------------+ [!!] Configuring grub-pc +----------------------... [18:18:21] (03PS1) 10BBlack: LVS: add salt grain for lvs:(primary|secondary) [puppet] - 10https://gerrit.wikimedia.org/r/281972 [18:19:45] (03CR) 10jenkins-bot: [V: 04-1] LVS: add salt grain for lvs:(primary|secondary) [puppet] - 10https://gerrit.wikimedia.org/r/281972 (owner: 10BBlack) [18:20:38] 6Operations, 10Continuous-Integration-Infrastructure, 10Phabricator, 10netops, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2165259 (10Andrew) > Why is labs intentionally blocked from connecting to ssh? Can you... [18:21:33] (03PS2) 10BBlack: LVS: add salt grain for lvs:(primary|secondary) [puppet] - 10https://gerrit.wikimedia.org/r/281972 [18:22:35] (03PS1) 10Matanya: webp: enabled by default - remove old dead code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281973 (https://phabricator.wikimedia.org/T27397) [18:22:46] 6Operations, 10hardware-requests, 13Patch-For-Review: Allocate 2 analytics machines to experiment with a jupyterhub notebook service - https://phabricator.wikimedia.org/T130760#2184316 (10yuvipanda) I'm trying on notebook1002 now [18:23:53] 6Operations, 10hardware-requests, 13Patch-For-Review: Allocate 2 analytics machines to experiment with a jupyterhub notebook service - https://phabricator.wikimedia.org/T130760#2184318 (10yuvipanda) notebook1002 also seems to have the same thing going, stuck at the same '4 of 9'. I wonder if that's something... [18:26:22] 6Operations: Boot time race condition when assembling root raid device - https://phabricator.wikimedia.org/T131961#2184334 (10ema) [18:27:19] (03CR) 10BBlack: [C: 032] "works in puppet compiler, ship it!" [puppet] - 10https://gerrit.wikimedia.org/r/281972 (owner: 10BBlack) [18:30:19] 6Operations, 10DNS, 10Mail, 10Phabricator, and 2 others: phabricator.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T116806#2184350 (10BBlack) So, the gerrit change is held up on comments about `mx ?all` vs `mx -all`. Are we confident phab emails only come from our mxes? ping @chasemp... [18:34:34] 6Operations: Installer issues for notebook1001 & 1002 - https://phabricator.wikimedia.org/T131964#2184372 (10yuvipanda) [18:35:01] 6Operations: Installer issues for notebook1001 & 1002 - https://phabricator.wikimedia.org/T131964#2184391 (10yuvipanda) now at: +---------------------------------------------+ [!!] Configuring grub-pc +----------------------------------------------+ |... [18:36:22] 6Operations, 10hardware-requests, 13Patch-For-Review: Allocate 2 analytics machines to experiment with a jupyterhub notebook service - https://phabricator.wikimedia.org/T130760#2145652 (10yuvipanda) 5Open>3Resolved Moved to T131964 [18:40:27] 6Operations: Installer issues for notebook1001 & 1002 - https://phabricator.wikimedia.org/T131964#2184402 (10yuvipanda) notebook1002 is now failing with: ``` +-------------------------------------+ [!!] Select and install software +-------------------------------------+ |... [18:43:24] 6Operations: Installer issues for notebook1001 & 1002 - https://phabricator.wikimedia.org/T131964#2184405 (10yuvipanda) Possible error things: ``` pr 2 18:05:56 main-menu[392]: (process:7949): /var/lib/partman/devices/=dev=sda Apr 2 18:05:56 main-menu[392]: (process:7949): /bin/autopartition-lvm: line 1: stat... [18:44:16] 6Operations: Installer issues for notebook1001 & 1002 - https://phabricator.wikimedia.org/T131964#2184406 (10yuvipanda) ```Apr 2 18:05:58 debconf: <-- 0 Retrying failed download of http://mirrors.wikimedia.org/debian/dists/jessie/main/binary-amd64/Packages.gz ``` might be more of the problem. [18:45:49] (03PS3) 10BBlack: cache_maps: add all sites in LVS [puppet] - 10https://gerrit.wikimedia.org/r/268238 (https://phabricator.wikimedia.org/T109162) [18:45:51] (03PS4) 10BBlack: cache_maps: re-role old mobile servers [puppet] - 10https://gerrit.wikimedia.org/r/268236 (https://phabricator.wikimedia.org/T109162) [18:45:53] (03PS3) 10BBlack: cache_maps: remove cp104[34] test caches [puppet] - 10https://gerrit.wikimedia.org/r/268237 (https://phabricator.wikimedia.org/T109162) [18:49:58] 6Operations: Installer issues for notebook1001 & 1002 - https://phabricator.wikimedia.org/T131964#2184433 (10yuvipanda) ```Apr 6 18:08:02 in-target: The following packages have unmet dependencies: Apr 6 18:08:02 in-target: bind9-host : Depends: libbind9-90 (= 1:9.9.5.dfsg-9+deb8u6) but it is not going to be i... [18:49:58] akosiaris and I will be switching maps services to node4.3 now, and will use trebuchet to update maps services [18:50:09] !log disable salt-minion on maps-test200{1,2,3} for maps services deployment. nodejs upgrade is in place [18:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:50:21] yurik: you are good to go [18:50:33] akosiaris, going... [18:51:33] (03CR) 10Alexandros Kosiaris: [C: 032] wikilabels: healthier uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/281940 (owner: 10Ladsgroup) [18:51:37] (03PS2) 10Alexandros Kosiaris: wikilabels: healthier uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/281940 (owner: 10Ladsgroup) [18:51:41] (03CR) 10Alexandros Kosiaris: [V: 032] wikilabels: healthier uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/281940 (owner: 10Ladsgroup) [18:51:53] (03PS1) 10Dereckson: Use extension registration for ProofreadPage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281976 (https://phabricator.wikimedia.org/T119117) [18:52:01] 6Operations: Default gateway unreachable on baham.wikimedia.org after reboot - https://phabricator.wikimedia.org/T131966#2184436 (10ema) [18:52:49] 6Operations: Boot time race condition when assembling root raid device on cp1052 - https://phabricator.wikimedia.org/T131961#2184452 (10ema) [18:53:34] 6Operations: Installer issues for notebook1001 & 1002 - https://phabricator.wikimedia.org/T131964#2184453 (10yuvipanda) ``` Apr 6 18:18:23 anna[28516]: wget: server returned error: HTTP/1.0 404 Not Found Apr 6 18:18:23 anna[28516]: WARNING **: package retrieval failed Apr 6 18:18:25 choose-mirror[28692]: DEBU... [18:56:40] akosiaris, kartotherian & tilerator have been synced, need restart [18:56:54] (service, not box :) [18:57:21] ok, unmasking and restarting [18:57:31] MaxSem, ^^ [18:58:05] i see maps on 2004 [18:58:25] wee [18:58:26] service-runner is spawning workers this time around. great! [18:58:48] !log reboot notebook1001 [18:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:58:53] tilerator is still down [18:59:40] ok did tilerator and tileratorui as well [18:59:50] so, now let's check everything works as expected [19:00:01] akosiaris, both are up [19:00:04] marxarelli: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160406T1900). [19:00:20] akosiaris, seems all is good [19:00:43] (03PS1) 10Dereckson: Flow dblist on noc.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281977 [19:02:18] yurik: I concur [19:02:34] so... exact same drill for maps-test2001 ? [19:02:35] or more ? [19:02:52] aaah, lemme pool first maps-test2004 [19:02:57] or we will be without a service [19:03:06] (03CR) 10Dereckson: "There symbolic links have been generated by docroot/noc/createTxtFileSymlinks.sh." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281977 (owner: 10Dereckson) [19:04:13] akosiaris, when you upgrade to node43, do you have to stop the service? [19:05:09] RECOVERY - puppet last run on maps-test2004 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [19:05:17] I do just to be sure cause the behaviour after the upgrade is not exactly well defined, but it's not required strictly [19:06:46] akosiaris, lets take all 3 servers out of rotation, and see if 2004 handles the new load [19:06:56] don't stop the service, only update the LVS [19:07:01] ok, done [19:07:05] checking [19:08:09] (03CR) 10GWicke: [C: 031] "The code needed for this is now deployed, so we can start producing resource_changed events." [puppet] - 10https://gerrit.wikimedia.org/r/281740 (https://phabricator.wikimedia.org/T126571) (owner: 10Ppchelko) [19:09:06] akosiaris, all seems to be good, lets do all 3 [19:09:38] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [19:09:59] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [19:11:14] yurik: ok, gimme 2 mins [19:11:28] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [19:11:49] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [19:11:59] yurik: and you are good to go [19:12:30] akosiaris, trebuchet is all set for 1-3? [19:12:35] yup [19:12:51] and nodejs has been upgraded [19:13:41] akosiaris, did you disable 2004? it shows that all 4 fetched [19:13:45] not that it matters [19:13:54] no, I did not [19:13:59] ok, its fine [19:14:01] exactly because it does not matter ;-) [19:14:04] hehe [19:14:24] akosiaris, kartotherian is done, syncing tilerator [19:14:26] (03PS1) 10Dduvall: group1 wikis to 1.27.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281978 [19:15:35] akosiaris, tilerator is done [19:15:51] (03PS1) 10BBlack: secure WMF-Last-Access cookie T119576 [puppet] - 10https://gerrit.wikimedia.org/r/281979 [19:15:54] !log restart kartotherian on maps-test200{1,2,3} [19:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:16:08] PROBLEM - puppet last run on maps-test2004 is CRITICAL: CRITICAL: puppet fail [19:16:26] (03PS2) 10BBlack: secure WMF-Last-Access cookie T119576 [puppet] - 10https://gerrit.wikimedia.org/r/281979 [19:16:32] !log restart tilerator, tileratorui on maps-test200{1,2,3} [19:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:16:40] yurik: ok, done, let's test this [19:16:55] * yurik runs for the hills [19:17:08] (03PS1) 10BBlack: secure GeoIP cookie T119576 [puppet] - 10https://gerrit.wikimedia.org/r/281980 [19:18:19] PROBLEM - HHVM rendering on mw1145 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.009 second response time [19:18:19] seems to be working fine in my tests [19:19:03] yurik: will the train group1 promotion affect what you're deploying right now? or visa versa? [19:19:28] PROBLEM - Apache HTTP on mw1135 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.016 second response time [19:19:30] PROBLEM - Apache HTTP on mw1145 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.011 second response time [19:19:57] marxarelli: doubtful [19:20:21] yurik: I think all is well... and maps-test2004 is handling all the load just fine [19:20:37] of course it's just 300KB/s [19:20:39] PROBLEM - HHVM rendering on mw1135 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.004 second response time [19:20:48] (03PS24) 10Ottomata: Hieraize keyholder::agent configuration [puppet] - 10https://gerrit.wikimedia.org/r/279198 (https://phabricator.wikimedia.org/T130419) (owner: 1020after4) [19:21:07] akosiaris: kk. choo choo it is ... [19:21:26] marxarelli, no affect [19:21:30] unrelated stuff [19:21:38] and its done anyway [19:21:43] seemed like it but i wanted to double check :) [19:21:46] thanks [19:21:59] RECOVERY - HHVM rendering on mw1145 is OK: HTTP OK: HTTP/1.1 200 OK - 68465 bytes in 0.507 second response time [19:22:10] !log bounce hhvm on mw1135, mw1145 [19:22:10] (03CR) 10Dduvall: [C: 032] group1 wikis to 1.27.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281978 (owner: 10Dduvall) [19:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:22:28] RECOVERY - HHVM rendering on mw1135 is OK: HTTP OK: HTTP/1.1 200 OK - 68465 bytes in 0.351 second response time [19:22:41] (03Merged) 10jenkins-bot: group1 wikis to 1.27.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281978 (owner: 10Dduvall) [19:22:46] yurik: so, I am pooling back maps-test200{1,2,3}, ok ? [19:22:56] akosiaris, go ahead [19:22:59] !log dduvall@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.20 [19:23:08] RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.046 second response time [19:23:10] RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.081 second response time [19:24:03] !log pool maps-test200{1,2,3} for kartotherian.svc.codfw.wmnet [19:24:48] (03CR) 10Ottomata: [C: 032] Hieraize keyholder::agent configuration [puppet] - 10https://gerrit.wikimedia.org/r/279198 (https://phabricator.wikimedia.org/T130419) (owner: 1020after4) [19:26:16] yurik: I am considering this done and a success. Thank you for your business [19:26:18] ;-) [19:26:29] akosiaris, so do i, thank you for all the help! :D [19:26:44] (03CR) 10Reedy: [C: 031] secure GeoIP cookie T119576 [puppet] - 10https://gerrit.wikimedia.org/r/281980 (owner: 10BBlack) [19:26:44] * yurik waits for some weird bug to surface [19:26:52] may be breaking puppet on tin... :/... [19:26:54] (03CR) 10Reedy: [C: 031] secure WMF-Last-Access cookie T119576 [puppet] - 10https://gerrit.wikimedia.org/r/281979 (owner: 10BBlack) [19:26:54] uh, wikitech down? [19:27:08] yep [19:27:23] akosiaris, did you break wikitech? :) [19:28:55] doubtful [19:29:29] marxarelli: wikitech is throwing 500s, maybe due to the train deploy ? [19:29:49] I don't think wikitech gets changed today [19:30:04] tail: cannot open ‘apache2/error.log’ for reading: Permission denied [19:30:11] I don't remember in which group it is [19:30:20] Ah, you're right [19:30:29] Group 1 to .20 in https://github.com/wikimedia/operations-mediawiki-config/commit/c8c8730443ac6bf969f0852dcf1453dfbf0f52c8 [19:30:30] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: puppet fail [19:30:48] akosiaris: Someone from ops will have to look at the error log :) [19:30:53] It's SMW yet again [19:30:54] PHP Fatal error: Call to undefined method Title::newFromRedirect() in /srv/mediawiki/php-1.27.0-wmf.20/extensions/SemanticMediaWiki/includes/SMW_ParserExtensions.php on line 41 [19:30:54] [Wed Apr 06 19:30:39.350356 2016] [:error] [pid 30370] [client 10.68.17.64:42705] PHP Fatal error: Call to undefined method Title::newFromRedirect() in /srv/mediawiki/php-1.27.0-wmf.20/extensions/SemanticMediaWiki/includes/SMW_ParserExtensions.php on line 41 [19:31:01] lol [19:31:02] akosiaris: grr, looking [19:31:03] akosiaris: I was first! [19:31:08] marxarelli: not your fault [19:31:17] andrewbogott: groumf... yeah I give you that one [19:31:27] marxarelli: Revert it back to .19, and I'll get SMW fixed [19:31:35] thanks Reedy [19:31:47] it would have been so much cooler if my client believed it was before andrewbogott's [19:31:57] * andrewbogott LOVES not being the only one who ever logs into the wikitech host [19:31:58] at least we would be having a distributed systems discussions then [19:32:29] RECOVERY - puppet last run on maps-test2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:32:41] we still have SMW on wikitech ... what for ? [19:32:52] you know what ? I don't want to know [19:33:11] probably for the best :) [19:33:23] exactly because of that ^ reason [19:33:27] !log dduvall@tin rebuilt wikiversions.php and synchronized wikiversions files: (no message) [19:33:34] I am sure it's some obscure thing [19:33:59] It's not very obscure, but we can probably live without it [19:34:00] eventually [19:34:03] 6Operations: Installer issues for notebook1001 & 1002 - https://phabricator.wikimedia.org/T131964#2184584 (10yuvipanda) @faidon upgraded debian-installer on carbon, which has resulted in the install completing but getting stuck in a installer loop! [19:34:16] lol, it's really just one line that needs fixing [19:34:59] PROBLEM - puppet last run on db2035 is CRITICAL: CRITICAL: Puppet has 1 failures [19:35:25] 6Operations, 10Traffic, 7HTTPS, 13Patch-For-Review, 7Varnish: Mark cookies from varnish as secure - https://phabricator.wikimedia.org/T119576#1830027 (10BBlack) Note there are probably question-marks around these about insecure requests. We don't yet block/deny insecure POST traffic ( T105794 ), but we'... [19:36:48] (03PS1) 10Ottomata: Temporarily removing new group deploy-phabricator to fix puppet on tin [puppet] - 10https://gerrit.wikimedia.org/r/281985 [19:38:37] (03CR) 10Ottomata: [C: 032] Temporarily removing new group deploy-phabricator to fix puppet on tin [puppet] - 10https://gerrit.wikimedia.org/r/281985 (owner: 10Ottomata) [19:38:40] PROBLEM - HHVM rendering on mw1132 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.008 second response time [19:39:39] PROBLEM - Apache HTTP on mw1132 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.008 second response time [19:40:08] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [19:40:18] PROBLEM - HHVM rendering on mw1140 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.006 second response time [19:40:36] Reedy: https://phabricator.wikimedia.org/T131973 [19:40:38] PROBLEM - Apache HTTP on mw1140 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.004 second response time [19:40:48] marxarelli: I just made a commit :D [19:41:17] (03CR) 10Ottomata: "Something was wrong with this, we'll have to figure it out as a separate patch. With this defined, i was geting:" [puppet] - 10https://gerrit.wikimedia.org/r/281985 (owner: 10Ottomata) [19:41:24] Reedy: thanks! [19:41:28] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [19:43:26] (03PS29) 10Ottomata: Add new scap::source define to ease bootstrapping of repositories on deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/280730 (https://phabricator.wikimedia.org/T118772) [19:44:30] PROBLEM - Keyholder SSH agent on tin is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [19:44:55] 6Operations, 10Traffic, 7HTTPS, 5MW-1.27-release-notes, 13Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#2184631 (10BBlack) So, we've had the API warning up for a couple of months now. In general, we've continually fallen behind on promises to notify -> kill insecure... [19:45:08] PROBLEM - Apache HTTP on mw1114 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.034 second response time [19:45:50] PROBLEM - HHVM rendering on mw1114 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.003 second response time [19:45:57] (03CR) 10Ottomata: [C: 032] Add new scap::source define to ease bootstrapping of repositories on deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/280730 (https://phabricator.wikimedia.org/T118772) (owner: 10Ottomata) [19:46:10] PROBLEM - HHVM rendering on mw1200 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.008 second response time [19:46:39] PROBLEM - Apache HTTP on mw1200 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.014 second response time [19:47:00] PROBLEM - HHVM rendering on mw1141 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.004 second response time [19:47:06] 6Operations, 5Continuous-Integration-Scaling, 7HHVM: Provide a HHVM package for jessie-wikimedia matching version of trusty-wikimedia - https://phabricator.wikimedia.org/T125821#1997890 (10akosiaris) So, up to now we did not have to package HHVM for jessie-wikimedia. I don't have an ETA on when it will be re... [19:47:10] PROBLEM - Apache HTTP on mw1141 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.010 second response time [19:47:36] 6Operations: Installer issues for notebook1001 & 1002 - https://phabricator.wikimedia.org/T131964#2184637 (10yuvipanda) It's no longer in a loop, is back to: ``` ┌───────────────┤ [!!] Select and install software ├────────────────┐ │ │... [19:48:03] 6Operations: Installer issues for notebook1001 & 1002 - https://phabricator.wikimedia.org/T131964#2184638 (10yuvipanda) And back to: ``` Apr 6 19:41:45 in-target: Reading package lists... Apr 6 19:41:45 in-target: Apr 6 19:41:45 in-target: Building dependency tree... Apr 6 19:41:45 in-target: Apr 6 19:4... [19:48:59] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [19:49:08] !log dduvall@tin Synchronized php-1.27.0-wmf.20/extensions/SemanticMediaWiki/includes/SMW_ParserExtensions.php: Replace usage of Title::newFromRedirect() (duration: 00m 38s) [19:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:49:50] RECOVERY - Keyholder SSH agent on tin is OK: OK: Keyholder is armed with all configured keys. [19:50:56] (03PS1) 10Ottomata: Temporarily comment out dumps/dumps scap source until it is ready [puppet] - 10https://gerrit.wikimedia.org/r/281987 [19:50:58] !log dduvall@tin rebuilt wikiversions.php and synchronized wikiversions files: Promote labswiki to 1.27.0-wmf.20 following temporary rollback and fix [19:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:51:17] Reedy: thanks for the quick fix! [19:51:24] 6Operations, 10Traffic, 7HTTPS, 5MW-1.27-release-notes, 13Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#2184639 (10konklone) @BBlack If you want someone to remind you about it, I am happy to volunteer. ;) [19:51:25] marxarelli: Do I still need to do a bump fix? [19:51:28] *submodule bump [19:51:52] Reedy: shouldn't need to. .gitmodules is tracking 1.8.x for SMW [19:52:01] ah, I wasn't sure if it autobumped :) [19:52:36] (03CR) 10Ottomata: [C: 032] Temporarily comment out dumps/dumps scap source until it is ready [puppet] - 10https://gerrit.wikimedia.org/r/281987 (owner: 10Ottomata) [19:52:47] marxarelli: It doesn't look to have done.. [19:52:50] PROBLEM - Keyholder SSH agent on mira is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [19:53:13] HMmmMM [19:53:15] on mira? [19:53:49] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 2 failures [19:54:22] Reedy: ah, right. no, there's no commit yet. i just pulled down the latest from 1.8.x [19:54:24] https://gerrit.wikimedia.org/r/281988 [19:54:58] Reedy: i went rogue on that one, sorry :) [19:55:06] heh, no worries [19:55:39] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [19:56:28] RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys. [20:00:04] gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160406T2000). Please do the needful. [20:01:56] no mobileapps deployment today [20:01:58] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: Puppet has 1 failures [20:02:10] RECOVERY - puppet last run on db2035 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:03:40] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:05:20] 6Operations: Installer issues for notebook1001 & 1002 - https://phabricator.wikimedia.org/T131964#2184661 (10yuvipanda) BIOS had PXEboot ahead of local hard drive, so I've switched that over now (on 1001) [20:06:56] !log starting parsoid deploy [20:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:08:51] 6Operations: Installer issues for notebook1001 & 1002 - https://phabricator.wikimedia.org/T131964#2184669 (10yuvipanda) now it's stuck just booting up, at: ```Scanning for devices. Please wait, this may take several minutes...``` [20:09:19] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 55.17% of data above the critical threshold [5000000.0] [20:10:14] !log synced code; restarted parsoid on wtp1001 as a canary [20:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:12:16] 6Operations, 10RESTBase-Cassandra: Efficacy of DateTieredCompactionStrategy - https://phabricator.wikimedia.org/T126221#2184683 (10Eevans) I conducted an audit of compactions on restbase1007-a.eqiad.wmnet over the weekend (from April 1-4), the result of which can be seen, visualized as a directed graph, here:... [20:17:05] 6Operations, 10RESTBase-Cassandra: Efficacy of DateTieredCompactionStrategy - https://phabricator.wikimedia.org/T126221#2184699 (10Eevans) /cc'ing @JAllemandou and @elukey as AQS uses DTCS too if I'm not mistaken; It wouldn't hurt to have a look at how compaction is working on the AQS cluster [20:17:08] !log finished deploying parsoid sha 5f6c0c60 [20:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:19:59] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [20:25:15] 6Operations: Installer issues for notebook1001 & 1002 - https://phabricator.wikimedia.org/T131964#2184706 (10yuvipanda) OK, the boot order fixed it for notebook1002! 1001 is still stuck [20:41:41] 6Operations, 10Continuous-Integration-Infrastructure, 10Phabricator, 10netops, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2184719 (10chasemp) 22 to only 208.80.154.250/32 as the service address for git-ssh shou... [20:42:04] oh [20:42:14] "talk to gearman", not "talk German" [20:44:37] (03PS1) 10ArielGlenn: dumps: fix up all references to directory with config files [puppet] - 10https://gerrit.wikimedia.org/r/281997 [20:44:57] (03PS1) 10Yuvipanda: docker: Add nginx frontend for registry [puppet] - 10https://gerrit.wikimedia.org/r/281998 [20:45:05] * apergos boggles at Platonides [20:45:22] we're big on multilingualism but... [20:45:44] (03PS2) 10Yuvipanda: docker: Add nginx frontend for registry [puppet] - 10https://gerrit.wikimedia.org/r/281998 [20:46:31] (03CR) 10jenkins-bot: [V: 04-1] dumps: fix up all references to directory with config files [puppet] - 10https://gerrit.wikimedia.org/r/281997 (owner: 10ArielGlenn) [20:47:27] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:47:55] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [20:48:23] (03PS3) 10Yuvipanda: docker: Add nginx frontend for registry [puppet] - 10https://gerrit.wikimedia.org/r/281998 [20:49:24] (03PS2) 10ArielGlenn: dumps: fix up all references to directory with config files [puppet] - 10https://gerrit.wikimedia.org/r/281997 [20:50:36] (03CR) 10jenkins-bot: [V: 04-1] dumps: fix up all references to directory with config files [puppet] - 10https://gerrit.wikimedia.org/r/281997 (owner: 10ArielGlenn) [20:57:49] (03PS4) 10Yuvipanda: docker: Add nginx frontend for registry [puppet] - 10https://gerrit.wikimedia.org/r/281998 [21:05:57] (03PS1) 10Rush: labstore svc addresses to separate mounts [dns] - 10https://gerrit.wikimedia.org/r/282000 (https://phabricator.wikimedia.org/T131541) [21:08:25] (03CR) 10Rush: [C: 032] labstore svc addresses to separate mounts [dns] - 10https://gerrit.wikimedia.org/r/282000 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [21:11:50] (03PS2) 10Dereckson: Set wgSemiprotectedRestrictionLevels for en.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281967 (https://phabricator.wikimedia.org/T131976) [21:12:33] (03PS2) 10Gehel: Activate SSL and connection pooling for CirrusSearch. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281882 (https://phabricator.wikimedia.org/T131839) [21:14:21] 6Operations, 6Discovery, 10hardware-requests, 3Discovery-Search-Sprint: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2184839 (10EBernhardson) [21:15:43] (03CR) 10jenkins-bot: [V: 04-1] Activate SSL and connection pooling for CirrusSearch. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281882 (https://phabricator.wikimedia.org/T131839) (owner: 10Gehel) [21:16:52] (03PS5) 10Yuvipanda: docker: Add nginx frontend for registry [puppet] - 10https://gerrit.wikimedia.org/r/281998 [21:16:57] csteipp: poke [21:20:38] 6Operations, 6Discovery, 10hardware-requests, 3Discovery-Search-Sprint: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2184858 (10EBernhardson) SATA is great, well not great but the disk requirements here make SSD's a bit untenable. 2x6 isn't a strict requirement, we figure... [21:21:16] matanya: What can I do for you? [21:21:47] (03PS1) 10Dereckson: Add mergehistory right to eliminator group on ja.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282055 (https://phabricator.wikimedia.org/T131751) [21:22:02] 6Operations, 6Labs, 10Labs-Infrastructure: labnet1002 can't talk to webproxy.eqiad.wmnet:8080, puppet fails to install designateclient - https://phabricator.wikimedia.org/T129623#2184867 (10chasemp) Thank you faidon, that is indeed the story. I put in a specific allowance for the labs-hosts VLAN in question... [21:23:26] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:23:46] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:29:27] PROBLEM - Disk space on restbase2004 is CRITICAL: DISK CRITICAL - free space: /srv 186523 MB (3% inode=99%) [21:31:44] (03CR) 10EBernhardson: "code looks sane, but i need to look into the unit test failures to see what's happening. It might just be that it's not pulling in the rig" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281882 (https://phabricator.wikimedia.org/T131839) (owner: 10Gehel) [21:35:19] 6Operations, 10Continuous-Integration-Infrastructure, 6Multimedia, 13Patch-For-Review: Investigate impact of switching from ffmpeg to libav (ffmpeg is not in Jessie) - https://phabricator.wikimedia.org/T103335#2184918 (10hashar) OpenStack enquired about imagemagick on Trusty requiring ffmpeg. But ffmpeg go... [21:36:36] urandom: 2004 is tight [21:36:57] gwicke: i have script [21:37:06] it should cull compactions past 93% [21:37:28] yeesh, and it has been... [21:38:14] hrmm, or at least it should have been [21:38:37] RECOVERY - Disk space on restbase2004 is OK: DISK OK [21:39:28] thanks! [21:39:35] gwicke: it may not make it either way [21:40:18] i show it needing another ~670G, and killing the big compaction just now brought it just north of 400G [21:40:32] we might have to hang tight until the new hardware arrives [21:42:46] (03PS1) 10Rush: nslcd specifying shell override [puppet] - 10https://gerrit.wikimedia.org/r/282060 (https://phabricator.wikimedia.org/T131541) [21:42:59] 6Operations, 5Continuous-Integration-Scaling, 7HHVM: Provide a HHVM package for jessie-wikimedia matching version of trusty-wikimedia - https://phabricator.wikimedia.org/T125821#1997890 (10greg) One thought from Alex in the SoS was creating a trusty nodepool image for these tests (composer) to unblock us (Re... [21:46:25] (03CR) 10Ottomata: [C: 031] "+1, one though:" [puppet] - 10https://gerrit.wikimedia.org/r/281740 (https://phabricator.wikimedia.org/T126571) (owner: 10Ppchelko) [21:47:58] (03PS2) 10Rush: nslcd specifying shell override [puppet] - 10https://gerrit.wikimedia.org/r/282060 (https://phabricator.wikimedia.org/T131541) [21:53:36] 6Operations, 5Continuous-Integration-Scaling, 7HHVM: Provide a HHVM package for jessie-wikimedia matching version of trusty-wikimedia - https://phabricator.wikimedia.org/T125821#2185010 (10hashar) Potentially we could generate an image based on Trusty then I would rather switch all of CI to run solely on Deb... [21:54:03] (03CR) 10Yuvipanda: [C: 031] nslcd specifying shell override [puppet] - 10https://gerrit.wikimedia.org/r/282060 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [22:05:30] (03PS3) 10ArielGlenn: dumps: fix up all references to directory with config files [puppet] - 10https://gerrit.wikimedia.org/r/281997 [22:07:11] urandom: too bad that we can't throw brotli at it yet [22:11:44] (03PS7) 10Yuvipanda: docker: Add nginx frontend for registry [puppet] - 10https://gerrit.wikimedia.org/r/281998 [22:11:46] (03PS1) 10Yuvipanda: docker: Don't setup credentials on all hosts [puppet] - 10https://gerrit.wikimedia.org/r/282071 [22:12:02] (03PS1) 10Rush: toollabs bastions install cgroup-bin [puppet] - 10https://gerrit.wikimedia.org/r/282072 (https://phabricator.wikimedia.org/T131541) [22:12:07] (03CR) 10Hoo man: [C: 032 V: 032] Clarifying i18n parameters [dumps/dcat] - 10https://gerrit.wikimedia.org/r/277955 (owner: 10Lokal Profil) [22:12:59] (03CR) 10Yuvipanda: [C: 032 V: 032] docker: Don't setup credentials on all hosts [puppet] - 10https://gerrit.wikimedia.org/r/282071 (owner: 10Yuvipanda) [22:13:21] (03CR) 10jenkins-bot: [V: 04-1] toollabs bastions install cgroup-bin [puppet] - 10https://gerrit.wikimedia.org/r/282072 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [22:20:07] (03PS2) 10Rush: toollabs bastions install cgroup-bin [puppet] - 10https://gerrit.wikimedia.org/r/282072 (https://phabricator.wikimedia.org/T131541) [22:26:53] 6Operations, 6Discovery, 10hardware-requests, 3Discovery-Search-Sprint: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2185144 (10RobH) Well, the only potential spare sysems would be our recently reclaimed restbase1001-1006, but they would need a memory upgrade, plus the pur... [22:39:49] (03PS4) 10ArielGlenn: dumps: fix up all references to directory with config files [puppet] - 10https://gerrit.wikimedia.org/r/281997 [22:41:07] (03CR) 10ArielGlenn: [C: 032] dumps: fix up all references to directory with config files [puppet] - 10https://gerrit.wikimedia.org/r/281997 (owner: 10ArielGlenn) [22:45:57] (03PS8) 10Yuvipanda: docker: Add nginx frontend for registry [puppet] - 10https://gerrit.wikimedia.org/r/281998 [23:00:04] RoanKattouw ostriches Krenair MaxSem Dereckson fjalapeno: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160406T2300). [23:00:04] fjalapeno matt_flaschen: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:20] * MaxSem is busy [23:00:59] I'm available [23:01:08] Unless you want to, matt_flaschen ..? [23:01:21] (03PS1) 10ArielGlenn: enable dumps cron run on snapshot1006 and 1007 [puppet] - 10https://gerrit.wikimedia.org/r/282078 [23:01:25] Krenair, I can do it. [23:01:33] I am available [23:01:55] (Fjalapeno) [23:03:00] (03PS2) 10ArielGlenn: enable dumps cron run on snapshot1006 and 1007 [puppet] - 10https://gerrit.wikimedia.org/r/282078 [23:03:01] That's a cool feature, BTW. [23:04:23] (03CR) 10ArielGlenn: [C: 032] enable dumps cron run on snapshot1006 and 1007 [puppet] - 10https://gerrit.wikimedia.org/r/282078 (owner: 10ArielGlenn) [23:04:40] (03CR) 10Mattflaschen: [C: 032] Support handoff and credential sharing with the iOS app [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279394 (https://phabricator.wikimedia.org/T128795) (owner: 10Fjalapeno) [23:05:09] (03Merged) 10jenkins-bot: Support handoff and credential sharing with the iOS app [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279394 (https://phabricator.wikimedia.org/T128795) (owner: 10Fjalapeno) [23:05:15] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 607 [23:11:27] !log mattflaschen@tin Synchronized docroot/wikipedia.org/apple-app-site-association: Support handoff and credential sharing with the iOS app (duration: 00m 34s) [23:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:12:57] coreyfloyd, done, but I don't see it yet. I guess we have to wait for Varnish to clear. Test when you can. [23:13:17] matt_flaschen: I see the same thing. Will keep an eye out [23:13:27] matt_flaschen: thanks [23:15:15] RECOVERY - check_mysql on lutetium is OK: Uptime: 1760330 Threads: 2 Questions: 16453277 Slow queries: 10361 Opens: 109216 Flush tables: 2 Open tables: 64 Queries per second avg: 9.346 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [23:16:05] coreyfloyd, if you access it with a query string, it shows the new file, though. So it's definitely Varnish: https://en.wikipedia.org/apple-app-site-association?something [23:18:44] you can purge it from varnish [23:21:24] matt_flaschen: yep I see it. [23:21:43] Krenair: how long does it take normally. I'm not in a rush. [23:22:09] I don't remember [23:22:46] Some things take 5 minutes, I don't know how long that takes. [23:25:46] Ok I'm patient. [23:30:19] matt_flaschen: try echo 'https://en.wikipedia.org/apple-app-site-association' | mwscript purgeList.php [23:34:07] Thanks, Dereckson. I ran that, which force-purged it on English Wikipedia. That won't affect any other language subdomains, though. coreyfloyd, you could put together the other domains it's intended to work with and force-purge them, or wait for it to auto-expire (but no one knows how long that takes) [23:35:41] matt_flaschen: the /static folder is served from en.wikip by Varnish so that helps, but here the trick is it's outside /static. [23:36:41] I wonder if it wouldn't be best to move the file to /static and redirect /apple... to /static/apple... if there is a regular need to update this file. [23:37:49] matt_flaschen: en is enough for me to test on [23:37:51] yeah there's some varnish magic that transparently sends everything static to enwiki [23:39:19] coreyfloyd: how stable it apple-app-site-association? Will you need in the future to add new app identifiers? [23:40:38] !log mattflaschen@tin Synchronized php-1.27.0-wmf.20/extensions/Flow/includes/Data/Listener/NotificationListener.php: Fix new topic notifications (duration: 00m 37s) [23:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:41:35] Dereckson: pretty stable. We only need to change it to add additional services. Like this patch. [23:42:02] Dereckson: we are just implementing these services for the first time. After this though I don't see many more changes coming. [23:43:01] Works on MediaWiki.org [23:44:42] !log mattflaschen@tin Synchronized php-1.27.0-wmf.19/extensions/Flow/includes/Data/Listener/NotificationListener.php: Fix new topic notifications (duration: 00m 29s) [23:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:46:58] coreyfloyd: okay [23:54:07] And bs.wikipedia.org. [23:54:13] SWAT complete [23:57:01] (03PS9) 10Yuvipanda: docker: Add nginx frontend for registry [puppet] - 10https://gerrit.wikimedia.org/r/281998 [23:57:28] thanks matt_flaschen