[00:35:08] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [00:55:08] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 646 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 2969889 keys, up 59 days 16 hours - replication_delay is 646 [00:57:08] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 2952337 keys, up 59 days 16 hours - replication_delay is 0 [01:01:58] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [01:24:08] PROBLEM - puppet last run on wdqs1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:29:58] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [01:32:08] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [01:53:08] RECOVERY - puppet last run on wdqs1002 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [02:39:58] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:52:48] PROBLEM - puppet last run on lvs1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:07:58] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [03:21:48] RECOVERY - puppet last run on lvs1001 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [04:08:28] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=3867.50 Read Requests/Sec=1863.50 Write Requests/Sec=2.70 KBytes Read/Sec=19475.60 KBytes_Written/Sec=1007.20 [04:17:28] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=7.00 Read Requests/Sec=0.00 Write Requests/Sec=0.60 KBytes Read/Sec=0.00 KBytes_Written/Sec=15.20 [05:57:47] (03PS3) 10BryanDavis: [WIP] Provision MediaWiki-Vagrant on Jessie hosts [puppet] - 10https://gerrit.wikimedia.org/r/245920 [05:58:20] (03CR) 10BryanDavis: "> Uploaded patch set 3." [puppet] - 10https://gerrit.wikimedia.org/r/245920 (owner: 10BryanDavis) [05:58:39] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Provision MediaWiki-Vagrant on Jessie hosts [puppet] - 10https://gerrit.wikimedia.org/r/245920 (owner: 10BryanDavis) [06:01:10] (03PS4) 10BryanDavis: [WIP] Provision MediaWiki-Vagrant on Jessie hosts [puppet] - 10https://gerrit.wikimedia.org/r/245920 [06:28:58] PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:32:18] PROBLEM - puppet last run on ms-be1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:33:58] PROBLEM - Check HHVM threads for leakage on mw1259 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:40:08] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:43:18] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:01:18] RECOVERY - puppet last run on ms-be1027 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [07:22:58] RECOVERY - Check HHVM threads for leakage on mw1260 is OK: OK [08:12:18] 06Operations, 10DNS, 06Labs, 10Labs-Infrastructure, and 3 others: Set SPF (... -all) for toolserver.org - https://phabricator.wikimedia.org/T131930#2907047 (10Nemo_bis) I keep getting quite a bit of spam (and phishing) from fake toolserver.org addresses, would be nice to fix this. `~all` is better than not... [08:17:15] 06Operations, 10Mail: Get mail relay out of Yahoo! blacklist: apply to Yahoo for whitelisting bulk mail - https://phabricator.wikimedia.org/T58414#2907049 (10Nemo_bis) As far as I can see everything is ready here and T66795#2867163 should reduce bounces to/from Yahoo (although it was arguably their fault, not... [08:34:47] 06Operations, 10ops-codfw: rack/setup/install mw2051-mw2060 - https://phabricator.wikimedia.org/T152698#2907052 (10Joe) @robh option 1 seems good. A side note: why are we reusing old hostnames? we never did that in eqiad and I thought that was a policy. [08:35:05] 06Operations, 10ops-codfw: rack/setup/install mw2051-mw2060 - https://phabricator.wikimedia.org/T152698#2907053 (10Joe) a:05Joe>03RobH [08:38:08] RECOVERY - Check HHVM threads for leakage on mw1168 is OK: OK [08:43:58] RECOVERY - Check HHVM threads for leakage on mw1259 is OK: OK [08:55:07] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 6 others: Check concurrency/retry/timeout limits and syncronize those between services - https://phabricator.wikimedia.org/T152073#2907055 (10Joe) [08:56:18] RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK [10:45:22] 06Operations, 10MediaWiki-Special-pages, 10Wikimedia-General-or-Unknown: Special:Import error: "Import failed: Could not open import file" - https://phabricator.wikimedia.org/T17000#2907097 (10TTO) Sorry Faidon... I'm adding Ops back per my previous comment. Otherwise this will be stalled forever. [10:46:26] 06Operations, 10MediaWiki-Export-or-Import, 10Wikimedia-General-or-Unknown: Special:Import error: "Import failed: Could not open import file" - https://phabricator.wikimedia.org/T17000#2907102 (10TTO) [10:46:45] (03CR) 10Thiemo Mättig (WMDE): [C: 031] "Approved by PM. Please merge." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329453 (https://phabricator.wikimedia.org/T153186) (owner: 10Ladsgroup) [10:54:08] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:22:08] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [11:25:18] PROBLEM - puppet last run on graphite1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:53:18] RECOVERY - puppet last run on graphite1002 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [12:00:59] 06Operations, 10MediaWiki-Export-or-Import, 10Wikimedia-General-or-Unknown: Special:Import error: "Import failed: Could not open import file" - https://phabricator.wikimedia.org/T17000#2907184 (10Nemo_bis) If ops don't answer, I think it's reasonable to proceed by trial and error. The first step would be to... [14:09:18] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [14:37:10] 06Operations, 10MediaWiki-Export-or-Import, 10Wikimedia-General-or-Unknown: Special:Import error: "Import failed: Could not open import file" - https://phabricator.wikimedia.org/T17000#191958 (10Joe) >>! In T17000#2907184, @Nemo_bis wrote: > If ops don't answer, I think it's reasonable to proceed by trial an... [14:38:18] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [15:20:54] 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + accounts - https://phabricator.wikimedia.org/T154205#2907409 (10Paladox) Upstream have figured out a fix that will fix both online reindex and clean shutdown (@Luca fixed it :)) See https://g... [15:41:08] 06Operations, 10ops-codfw: rack/setup/install mw2051-mw2060 - https://phabricator.wikimedia.org/T152698#2907426 (10Papaul) The last new mw servers we put in, we didn't reused old hostnames we started with mw2215 and the last one is mw2250, I have already put in racktable mw2251-mw2260 for the 10 new mw servers. [15:42:18] PROBLEM - puppet last run on mwdebug1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:52:18] PROBLEM - puppet last run on mw1198 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:59:38] 06Operations, 10ops-codfw: rack/setup/install mw2051-mw2060 - https://phabricator.wikimedia.org/T152698#2907442 (10RobH) Indeed, we shouldn't be reusing the old hostnames, and I didn't think we planned to. (Seems that @papaul is also on the same page!) I'll go ahead and decommission the existing hosts so the... [16:07:23] 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, 10Elasticsearch: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2907453 (10Papaul) @Gehel please see below the racking schema for the new elastic servers. Let me know if you are approve so i can start the rackin... [16:10:18] RECOVERY - puppet last run on mwdebug1001 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [16:21:18] RECOVERY - puppet last run on mw1198 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:23:14] 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + maybe accounts - https://phabricator.wikimedia.org/T154205#2907469 (10Paladox) [16:24:38] 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + maybe accounts - https://phabricator.wikimedia.org/T154205#2904053 (10Paladox) I have now tested https://gerrit-review.googlesource.com/#/c/93479/ and it works. @demon could we cherry-pick t... [16:32:11] 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit: Schedule downtime for T154205 - https://phabricator.wikimedia.org/T154327#2907498 (10Paladox) [16:32:21] 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit: Schedule downtime for T154205 - https://phabricator.wikimedia.org/T154327#2907514 (10Paladox) p:05Triage>03High [16:35:18] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [16:44:50] 06Operations, 10DBA, 10Gerrit, 06Release-Engineering-Team: Gerrit: Schedule downtime for T154205 - https://phabricator.wikimedia.org/T154327#2907534 (10Paladox) [16:46:35] 06Operations, 10DBA, 10Gerrit, 06Release-Engineering-Team: Gerrit: Schedule downtime for T154205 - https://phabricator.wikimedia.org/T154327#2907498 (10Paladox) [16:46:38] 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + maybe accounts - https://phabricator.wikimedia.org/T154205#2907536 (10Paladox) [17:03:18] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [17:16:20] 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, 10Elasticsearch: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2907543 (10RobH) a:05Papaul>03Gehel Assigned to @gehel for his feedback. @gehel: Please review and if all looks good, comment and assign back t... [17:31:28] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:57:23] 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + maybe accounts - https://phabricator.wikimedia.org/T154205#2907562 (10Paladox) This deftly looks like it caused T153079 as he managed to force merge without an object error but gerrit was sug... [17:58:08] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:00:28] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [18:26:08] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [18:58:28] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:22:45] (03PS1) 10BryanDavis: vagrant: Update LXC packages and apparmor conf for systemd [puppet] - 10https://gerrit.wikimedia.org/r/329702 (https://phabricator.wikimedia.org/T154294) [19:27:28] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [20:37:37] (03PS1) 10Madhuvishy: nfs: Clean up post tools nfs migration [puppet] - 10https://gerrit.wikimedia.org/r/329707 [21:09:28] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:20:01] 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, 10Elasticsearch: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2907759 (10Gehel) a:05Gehel>03Papaul That looks fine! Thanks! [21:38:28] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [21:48:47] 06Operations, 06Labs, 07Tracking: Migrate misc to secondary labstore HA cluster - https://phabricator.wikimedia.org/T154336#2907785 (10madhuvishy) [21:54:08] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:12:17] (03PS1) 10Madhuvishy: nfs: Dual mount misc projects from labstore-secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/329711 (https://phabricator.wikimedia.org/T154336) [22:16:41] madhuvishy sorry to bother but as a lab user myself will I be affected by the NFS convert at all or any of the tools I maintain? I know that your probably busy so take your time getting to this reply, what you are doing is more important [22:17:30] Zppix: Hi! Do you maintain any active labs projects(that have nfs enabled)? [22:17:53] if it's only tools - the migration is complete as of mid november and you shouldn't be affected [22:22:08] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [23:03:37] Zppix: -labs is probably the better channel for that [23:07:16] 07Puppet, 06Labs, 10MediaWiki-Vagrant, 15User-bd808: Make role::labs::mediawiki_vagrant work on Debian Jessie host systems - https://phabricator.wikimedia.org/T154340#2907871 (10bd808) [23:07:51] 07Puppet, 06Labs, 10MediaWiki-Vagrant, 15User-bd808: Make role::labs::mediawiki_vagrant work on Debian Jessie host systems - https://phabricator.wikimedia.org/T154340#2907885 (10bd808) [23:08:27] 07Puppet, 06Labs, 10MediaWiki-Vagrant, 15User-bd808: Make role::labs::mediawiki_vagrant work on Debian Jessie host systems - https://phabricator.wikimedia.org/T154340#2907871 (10bd808) One prior blocker was fixed by {T122734} [23:15:40] 07Puppet, 06Labs, 10MediaWiki-Vagrant, 15User-bd808: Make role::labs::mediawiki_vagrant work on Debian Jessie host systems - https://phabricator.wikimedia.org/T154340#2907907 (10bd808) [23:28:25] hi guys, when I try to email arbcom-l@lists.wikimedia.org or clerks-l@lists.wikimedia.org, i'm getting an error: Message rejected for privacy protection: The list of recipients contains both private and public mail lists