[00:00:44] ori legoktm ostriches just reading backscroll. Yes it's okay to undeploy it. I'm personally glad to see the end of this saga. But i really would appreciate you remember that people who were asked to build this on the behalf of the WMF built this to the best of their ability. Phrases such as "Kill that shit with fire" don't make me too happy to be part of [00:00:44] this community and cautious about other things I build. [00:01:14] !log dereckson@tin Synchronized wmf-config/: Undeploy Gather extension (T128568) (duration: 00m 33s) [00:01:15] T128568: Undeploy the Gather extension from Wikimedia wikis - https://phabricator.wikimedia.org/T128568 [00:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:01:40] legoktm: please test ^ [00:01:47] jdlrobson: my comment (I did not say to kill it with fire) was based on the fact that affected users were already notified a long time ago [00:01:59] not a statement about the quality of the extension [00:02:00] Dereckson: I checked Special:Version, looks good, and thank you :) [00:02:03] ori i was referring to "ostriches> Kill that shit with fire then." [00:02:21] I've only synced the files, is there any other step to do to undeploy an extension? [00:02:29] Dereckson: I will take care of the rest [00:02:40] okay, thanks [00:02:47] but thank you ori for clarifying. A lot of people got hurt feelings with this whole saga so I'm glad its behind us. [00:02:49] removal from make-wmf-branch is the only other step, I'm also going to archive the extension [00:03:35] So two changes remaining, one config and the Echo full scap. [00:03:37] jdlrobson: I wouldn't read "kill that shit with fire" as a criticism of the extension either. We talk like that about moribund code pretty often, including stuff that we have written and cared about.. [00:04:12] (03PS2) 10Dereckson: Add Puotal: namespace to jam.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289331 (https://phabricator.wikimedia.org/T135479) [00:04:19] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289331 (https://phabricator.wikimedia.org/T135479) (owner: 10Dereckson) [00:04:58] (03Merged) 10jenkins-bot: Add Puotal: namespace to jam.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289331 (https://phabricator.wikimedia.org/T135479) (owner: 10Dereckson) [00:05:09] but, I hear you. Sometimes a lot of heart goes into some product and it doesn't take, and if people rejoice when it gets undeployed it can hurt. [00:05:13] (03PS1) 10Dzahn: admin: add new group labnet-users [puppet] - 10https://gerrit.wikimedia.org/r/289343 (https://phabricator.wikimedia.org/T133992) [00:05:20] but I think ostriches was rejoicing about having less code to maintain :) [00:05:24] jdlrobson: I believe you misunderstood me :) [00:05:28] Yep [00:05:51] Anything on the cluster that doesn't need to be there...is by definition shit :P [00:06:08] It can be high quality shit! [00:06:21] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Allow RelEng access to labnet servers (was: Allow RelEng nova log access) - https://phabricator.wikimedia.org/T133992#2303541 (10Dzahn) [00:06:44] ostriches ori thanks guys. Just my team has been demoralised a lot by this whole thing... especially given some of the community is requesting a very similar sounding reading list... the saga never ends. [00:07:41] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Add Puotal: namespace to jam.wikipedia (T135479) (duration: 00m 27s) [00:07:42] T135479: add a Portal namespace for jam.wikipedia.org - https://phabricator.wikimedia.org/T135479 [00:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:09:35] !log mwscript namespaceDupes.php jamwiki --fix: 4 pages to fix, 4 were resolvable. (T135479) [00:09:36] T135479: add a Portal namespace for jam.wikipedia.org - https://phabricator.wikimedia.org/T135479 [00:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:12:22] Let's go for the full scpa. [00:12:50] !log dereckson@tin Started scap: Echo: Bring back Echo email messages [00:12:55] (03PS2) 10Dzahn: admin: add new group labnet-users [puppet] - 10https://gerrit.wikimedia.org/r/289343 (https://phabricator.wikimedia.org/T133992) [00:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:18:37] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Allow RelEng access to labnet servers (was: Allow RelEng nova log access) - https://phabricator.wikimedia.org/T133992#2303604 (10Dzahn) This would add a new group for unprivilege... [00:21:00] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 4 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2303610 (10Dereckson) **Portal namespace** During the life on Incubator, a portal namespace have been prepared for Jamaica, https://jam.wikiped... [00:22:45] (03PS2) 10Dzahn: planet: only run updates when in active datacenter [puppet] - 10https://gerrit.wikimedia.org/r/289340 (https://phabricator.wikimedia.org/T134507) [00:24:17] (03CR) 10jenkins-bot: [V: 04-1] planet: only run updates when in active datacenter [puppet] - 10https://gerrit.wikimedia.org/r/289340 (https://phabricator.wikimedia.org/T134507) (owner: 10Dzahn) [00:29:36] (03PS3) 10Dzahn: planet: only run updates when in active datacenter [puppet] - 10https://gerrit.wikimedia.org/r/289340 (https://phabricator.wikimedia.org/T134507) [00:31:08] (03CR) 10jenkins-bot: [V: 04-1] planet: only run updates when in active datacenter [puppet] - 10https://gerrit.wikimedia.org/r/289340 (https://phabricator.wikimedia.org/T134507) (owner: 10Dzahn) [00:31:44] (caches done, scap is syncing) [00:34:48] (03PS4) 10Dzahn: planet: only run updates when in active datacenter [puppet] - 10https://gerrit.wikimedia.org/r/289340 (https://phabricator.wikimedia.org/T134507) [00:37:24] (03PS5) 10Dzahn: planet: only run updates when in active datacenter [puppet] - 10https://gerrit.wikimedia.org/r/289340 (https://phabricator.wikimedia.org/T134507) [00:37:39] (sync-masters and sync-proxies ok, we're at the sync-apaches step) [00:38:36] (03CR) 10Dzahn: [C: 032] planet: only run updates when in active datacenter [puppet] - 10https://gerrit.wikimedia.org/r/289340 (https://phabricator.wikimedia.org/T134507) (owner: 10Dzahn) [00:42:00] 06Operations: Miscellaneous servers to track in eqiad for possible inclusion in codfw misc virt cluster - https://phabricator.wikimedia.org/T88761#2303635 (10Dzahn) >>! In T88761#2284248, @akosiaris wrote: > Given that we will probably have to do one more switchover in the next 6 months, I am thinking that the s... [00:43:00] 06Operations, 10Wikimedia-Planet, 13Patch-For-Review: install planet2001 - https://phabricator.wikimedia.org/T134507#2303638 (10Dzahn) 05Open>03Resolved done. the cron jobs that run the actual feed updates are deactivated if the current DC is not the active DC [00:45:10] PROBLEM - check_puppetrun on fdb2001 is CRITICAL: CRITICAL: Puppet has 33 failures [00:45:10] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 48 failures [00:45:34] sync-apaches: 42% (ok: 172; fail: 0; left: 229) [00:48:39] 06Operations, 10Wikimedia-Planet: install planet2001 - https://phabricator.wikimedia.org/T134507#2303642 (10Dzahn) [00:48:47] (03PS1) 10Dzahn: toolserver: add wiki.toolserver LE cert [puppet] - 10https://gerrit.wikimedia.org/r/289345 (https://phabricator.wikimedia.org/T134798) [00:50:09] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 48 failures [00:50:09] PROBLEM - check_puppetrun on fdb2001 is CRITICAL: CRITICAL: Puppet has 33 failures [00:53:20] (03PS2) 10Dzahn: toolserver: add wiki.toolserver LE cert [puppet] - 10https://gerrit.wikimedia.org/r/289345 (https://phabricator.wikimedia.org/T62220) [00:55:09] RECOVERY - check_puppetrun on bellatrix is OK: OK: Puppet is currently enabled, last run 144 seconds ago with 0 failures [00:55:09] RECOVERY - check_puppetrun on fdb2001 is OK: OK: Puppet is currently enabled, last run 111 seconds ago with 0 failures [00:56:01] 06Operations, 06Labs, 10Tool-Labs, 13Patch-For-Review: setup Letsencrypt for toolserver.org (toolserver.org certificate to expire 2016-06-30) - https://phabricator.wikimedia.org/T134798#2303651 (10Dzahn) @Nemo_bis Possible, but we'll need it in DNS first, then Apache config for wiki.toolserver to work.. LE... [00:56:17] (sync-apaches done) [01:00:09] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [01:03:06] !log dereckson@tin Finished scap: Echo: Bring back Echo email messages (duration: 50m 15s) [01:03:12] RoanKattouw: please test ^ [01:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:05:09] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [01:05:39] Dereckson: Hrmph I just realized it should have been cherry-picked to wmf1 instead :( [01:05:40] Oh well [01:05:58] The impact is fairly minor and wmf1 will be gone in 48h [01:07:12] you could still do, I won't tell anyone [01:07:31] It's the last window of the day, no other window is pending, we still have time for a full scap. [01:07:38] OK [01:07:41] * RoanKattouw cherry-picks [01:07:54] 06Operations, 10Architecture, 10Incident-20150423-Commons, 10RESTBase, and 5 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#2303695 (10RobLa-WMF) [01:09:34] So during the scap, I noticed we had some stale files left in stale php-1.27.0-wmf.19 [01:10:09] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [01:10:12] I've asked on releng, thcipriani thinks we should run SSH_AUTH_SOCK=/run/keyholder/proxy.sock dsh -F 20 -M -g mediawiki-installation -r ssh -o -oUser=mwdeploy -- rm -rf /srv/mediawiki/php-1.27.0-wmf.19 [01:10:16] per https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Remove_left_over_files_from_expired_branches [01:10:41] I open a ticket so it could be a part of the next MediaWiki train or I do that now? [01:12:10] Dereckson: you can open ticket and I'll take a look at it in the morning. I'd like to verify that what I think is the problem *is* the problem :) [01:12:17] k [01:12:26] thank you! [01:14:47] Dereckson: https://gerrit.wikimedia.org/r/289346 [01:14:56] Sorry for the delay, I had to do a manual cherry-pick because of conflicts [01:15:31] no problem [01:23:07] Merged. [01:24:20] ok on Tin, let's scap [01:24:58] (03PS1) 10Dzahn: toolserver: redirect some old tool URLs that still get 404s [puppet] - 10https://gerrit.wikimedia.org/r/289347 (https://phabricator.wikimedia.org/T85167) [01:25:19] !log dereckson@tin Started scap: Echo: Bring back Echo email messages ([[Gerrit:289346]], for wmf1) [01:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:27:06] (03PS2) 10Dzahn: toolserver: redirect some old tool URLs that still get 404s [puppet] - 10https://gerrit.wikimedia.org/r/289347 (https://phabricator.wikimedia.org/T85167) [01:28:08] mutante: can I add a PS3 with a fix for quentinv57 sul or should I send another change? it currently points to sulutil.php instead of sulinfo.php [01:28:12] (03PS3) 10Dzahn: toolserver: redirect some old tool URLs that still get 404s [puppet] - 10https://gerrit.wikimedia.org/r/289347 (https://phabricator.wikimedia.org/T85167) [01:28:42] (03CR) 10Dzahn: [C: 032] toolserver: redirect some old tool URLs that still get 404s [puppet] - 10https://gerrit.wikimedia.org/r/289347 (https://phabricator.wikimedia.org/T85167) (owner: 10Dzahn) [01:29:01] Dereckson: eh, just saw. yes, go ahead and add PS3 [01:29:12] it's not merged because didnt have V+2 yet [01:29:24] k [01:33:09] (03PS4) 10Dereckson: toolserver: redirect some old tool URLs that still get 404s [puppet] - 10https://gerrit.wikimedia.org/r/289347 (https://phabricator.wikimedia.org/T85167) (owner: 10Dzahn) [01:34:22] (03CR) 10Dereckson: "PS4: +/~vvv/sulutil.php → https://tools.wmflabs.org/quentinv57-tools/tools/sulinfo.php" [puppet] - 10https://gerrit.wikimedia.org/r/289347 (https://phabricator.wikimedia.org/T85167) (owner: 10Dzahn) [01:36:00] (03CR) 10Dzahn: [C: 032] toolserver: redirect some old tool URLs that still get 404s [puppet] - 10https://gerrit.wikimedia.org/r/289347 (https://phabricator.wikimedia.org/T85167) (owner: 10Dzahn) [01:37:30] (03CR) 10Ricordisamoa: "I don't maintain any of those tools but the redirects look fine." [puppet] - 10https://gerrit.wikimedia.org/r/289347 (https://phabricator.wikimedia.org/T85167) (owner: 10Dzahn) [01:42:22] (03PS5) 10Dzahn: Apache redirects for w.wiki [puppet] - 10https://gerrit.wikimedia.org/r/285932 (https://phabricator.wikimedia.org/T108557) (owner: 10Dereckson) [01:51:49] !log dereckson@tin Finished scap: Echo: Bring back Echo email messages ([[Gerrit:289346]], for wmf1) (duration: 26m 30s) [01:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:52:12] RoanKattouw: ^ [01:52:22] Thanks, will test [01:53:04] Working, thanks [01:53:54] You're welcome. Thanks for cherry-pick/testing so late. [01:54:56] Thank YOU for staying up so late o.O [01:55:42] (For some reason I thought you were in Europe, but now I realize I don't actually know why I think that) [01:56:54] Yes, I'm from Belgium. [01:57:25] Aha [01:57:32] * RoanKattouw is a Dutchman in San Francisco [02:10:11] 06Operations, 07Documentation: put lint:ignore documention on wikitech - https://phabricator.wikimedia.org/T133222#2303774 (10Dzahn) https://wikitech.wikimedia.org/w/index.php?title=Puppet_coding&type=revision&diff=533037&oldid=177752 [02:10:27] 06Operations, 07Documentation: put lint:ignore documention on wikitech - https://phabricator.wikimedia.org/T133222#2303776 (10Dzahn) [02:10:30] 06Operations, 07Need-volunteer, 13Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#2303775 (10Dzahn) [02:10:49] 06Operations, 07Documentation: put lint:ignore documention on wikitech - https://phabricator.wikimedia.org/T133222#2225527 (10Dzahn) 05Open>03Resolved [02:10:51] 06Operations, 07Need-volunteer, 13Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#1192884 (10Dzahn) [02:11:31] 06Operations, 07Documentation: put lint:ignore documention on wikitech - https://phabricator.wikimedia.org/T133222#2225527 (10Dzahn) http://puppet-lint.com/controlcomments/ [02:18:09] PROBLEM - puppet last run on db2009 is CRITICAL: CRITICAL: puppet fail [02:31:28] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.1) (duration: 11m 05s) [02:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:45:59] RECOVERY - puppet last run on db2009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:59:31] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.2) (duration: 11m 15s) [02:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:09:11] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed May 18 03:09:11 UTC 2016 (duration 9m 40s) [03:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:25:40] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 200, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [04:26:10] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [04:37:51] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 202, down: 0, dormant: 0, excluded: 0, unused: 0 [04:38:19] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [04:46:10] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [04:47:40] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 200, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [05:03:49] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 202, down: 0, dormant: 0, excluded: 0, unused: 0 [05:04:10] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [05:34:37] (03PS1) 10Dzahn: rm files/misc/scripts/Makefile [puppet] - 10https://gerrit.wikimedia.org/r/289351 [05:38:24] (03PS1) 10Dzahn: rm files/misc/scripts/rotate_fundraising_logs [puppet] - 10https://gerrit.wikimedia.org/r/289352 [05:43:18] (03PS1) 10Dzahn: logging: move files/misc/demux.py to modules/udp2log [puppet] - 10https://gerrit.wikimedia.org/r/289353 [05:48:04] (03PS1) 10Dzahn: mv files/misc/udp2log.init into modules/udp2log [puppet] - 10https://gerrit.wikimedia.org/r/289354 [06:21:43] 06Operations, 06Parsing-Team, 06Services, 03Mobile-Content-Service: ChangeProp / RESTBase / Parsoid outage 2016-05-05 - https://phabricator.wikimedia.org/T134537#2303918 (10mobrovac) [06:24:46] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Allow mobrovac to run puppet on SC(A|B) - https://phabricator.wikimedia.org/T134251#2303922 (10Joe) [06:24:48] 06Operations, 10Ops-Access-Requests, 06Services: Expand sc-admins to provide sufficient coverage for sc* clusters - https://phabricator.wikimedia.org/T135548#2303923 (10Joe) [06:29:25] (03PS2) 10Muehlenhoff: Remove access credentials for kleduc [puppet] - 10https://gerrit.wikimedia.org/r/288153 [06:30:20] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:40] PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: Puppet has 3 failures [06:30:50] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:00] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:01] (03CR) 10Muehlenhoff: Remove access credentials for kleduc [puppet] - 10https://gerrit.wikimedia.org/r/288153 (owner: 10Muehlenhoff) [06:31:10] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:10] PROBLEM - puppet last run on scb1001 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:20] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:10] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:39] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:50] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:58] (03CR) 10Muehlenhoff: [C: 032 V: 032] Remove access credentials for kleduc [puppet] - 10https://gerrit.wikimedia.org/r/288153 (owner: 10Muehlenhoff) [06:38:48] !log restarted hhvm on mw1207 [06:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:40:20] RECOVERY - Apache HTTP on mw1207 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.052 second response time [06:40:20] RECOVERY - HHVM rendering on mw1207 is OK: HTTP OK: HTTP/1.1 200 OK - 68446 bytes in 0.458 second response time [06:53:40] PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Puppet has 1 failures [06:54:09] !log restbase deploy start of 75a94ee [06:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:55:35] 06Operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner: Rationalize our jobqueues redis topology - https://phabricator.wikimedia.org/T135113#2303963 (10ori) Yes, I think that makes sense. Thanks for suggesting it. [06:56:20] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:56:40] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:50] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:56:51] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:57:09] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:20] RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:57:20] RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:29] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:58:29] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:49] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:04:43] !log restbase deploy end of 75a94ee [07:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:15:57] (03PS1) 10Mobrovac: service::node: Add a wrapper script for service_checker [puppet] - 10https://gerrit.wikimedia.org/r/289358 [07:19:18] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Good idea, some minor comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/289358 (owner: 10Mobrovac) [07:19:37] that was fast _joe_ :) [07:19:42] i'm still waiting on the compiler [07:19:43] hehe [07:20:00] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:20:22] (03CR) 10Muehlenhoff: "Thanks for the patch. The salt grains also need to be added to the debdeploy.conf template, though. I'll amend in a PS4" [puppet] - 10https://gerrit.wikimedia.org/r/289001 (https://phabricator.wikimedia.org/T134716) (owner: 10Dzahn) [07:22:35] (03PS4) 10Muehlenhoff: notebook: add debdeploy grains [puppet] - 10https://gerrit.wikimedia.org/r/289001 (https://phabricator.wikimedia.org/T134716) (owner: 10Dzahn) [07:25:13] 06Operations, 10ops-eqiad: kafka1013 hardware crash - https://phabricator.wikimedia.org/T135557#2304005 (10Joe) p:05Triage>03High a:03Ottomata [07:26:02] 06Operations, 10ops-eqiad: kafka1013 hardware crash - https://phabricator.wikimedia.org/T135557#2303007 (10Joe) @ottomata I assigned the ticket to you since it seems you were actively working on it; feel free to assign it to someone else. [07:27:25] (03PS2) 10Jcrespo: Remove scap::clean now that scap is clean [puppet] - 10https://gerrit.wikimedia.org/r/289333 (owner: 10Thcipriani) [07:29:01] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 07Elasticsearch: high load on elastic1001 - https://phabricator.wikimedia.org/T135509#2304013 (10Joe) p:05Triage>03Normal a:03Gehel [07:30:18] (03CR) 10Jcrespo: [C: 032] Remove scap::clean now that scap is clean [puppet] - 10https://gerrit.wikimedia.org/r/289333 (owner: 10Thcipriani) [07:31:42] 06Operations, 10Ops-Access-Requests, 06Services: Expand sc-admins to provide sufficient coverage for sc* clusters - https://phabricator.wikimedia.org/T135548#2304016 (10Joe) As a personal note: Marko was granted the right to run/disable/manage puppet because he is performing non-emergency coverage and regula... [07:39:19] (03PS2) 10Mobrovac: service::node: Add a wrapper script for service_checker [puppet] - 10https://gerrit.wikimedia.org/r/289358 [07:39:54] (03CR) 10Mobrovac: service::node: Add a wrapper script for service_checker (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/289358 (owner: 10Mobrovac) [07:42:09] 06Operations: Miscellaneous servers to track in eqiad for possible inclusion in codfw misc virt cluster - https://phabricator.wikimedia.org/T88761#2304022 (10akosiaris) >>! In T88761#2303635, @Dzahn wrote: > Yes, it was easy enough. Just did it :). It's running but just the actual feed update crons are only acti... [07:44:38] (03CR) 10Mobrovac: "The PCC approves - https://puppet-compiler.wmflabs.org/2838/" [puppet] - 10https://gerrit.wikimedia.org/r/289358 (owner: 10Mobrovac) [07:48:11] (03PS5) 10Muehlenhoff: notebook: add debdeploy grains [puppet] - 10https://gerrit.wikimedia.org/r/289001 (https://phabricator.wikimedia.org/T134716) (owner: 10Dzahn) [07:48:19] (03CR) 10Muehlenhoff: [C: 032 V: 032] notebook: add debdeploy grains [puppet] - 10https://gerrit.wikimedia.org/r/289001 (https://phabricator.wikimedia.org/T134716) (owner: 10Dzahn) [07:48:38] 06Operations, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, 06Services, and 2 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2304050 (10KartikMistry) Only packages those are arch:any need rebuild but I'll check against WMF policy and upd... [07:48:54] 06Operations, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, 06Services, and 2 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2304052 (10KartikMistry) [07:58:51] (03PS3) 10Giuseppe Lavagetto: admin: add new group labnet-users [puppet] - 10https://gerrit.wikimedia.org/r/289343 (https://phabricator.wikimedia.org/T133992) (owner: 10Dzahn) [07:59:07] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] admin: add new group labnet-users [puppet] - 10https://gerrit.wikimedia.org/r/289343 (https://phabricator.wikimedia.org/T133992) (owner: 10Dzahn) [08:10:51] !log starting rolling restart of Elasticsearch equiad fro Java update (T135499) [08:10:52] T135499: Restart elasticsearch clusters for Java update - https://phabricator.wikimedia.org/T135499 [08:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:16:40] (03CR) 10Alexandros Kosiaris: [C: 032] service::node: Add a wrapper script for service_checker [puppet] - 10https://gerrit.wikimedia.org/r/289358 (owner: 10Mobrovac) [08:16:45] (03PS3) 10Alexandros Kosiaris: service::node: Add a wrapper script for service_checker [puppet] - 10https://gerrit.wikimedia.org/r/289358 (owner: 10Mobrovac) [08:16:49] (03CR) 10Alexandros Kosiaris: [V: 032] service::node: Add a wrapper script for service_checker [puppet] - 10https://gerrit.wikimedia.org/r/289358 (owner: 10Mobrovac) [08:17:29] (03CR) 10Mobrovac: [C: 04-1] keyholder: ops into trusted groups unconditionally (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/288624 (owner: 10Alexandros Kosiaris) [08:36:16] !log performing schema change on s2 T130692 [08:36:17] T130692: Add new indexes from eec016ece6d2b30addcdf3d3efcc2ba59b10e858 to production databases - https://phabricator.wikimedia.org/T130692 [08:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:36:53] 06Operations: mod_deflate + mod_uwsgi causing mangled apache responses - https://phabricator.wikimedia.org/T135595#2304125 (10fgiunchedi) [08:37:11] 06Operations, 10Traffic, 13Patch-For-Review: graphite.wikimedia.org 503s on some css/js resources - https://phabricator.wikimedia.org/T135515#2301423 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi confirmed this is fixed on the graphite side, thanks @BBlack @ema ! re: mod_deflate, @elukey was mention... [08:40:48] (03CR) 10Alexandros Kosiaris: keyholder: ops into trusted groups unconditionally (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/288624 (owner: 10Alexandros Kosiaris) [08:40:53] (03PS9) 10Alexandros Kosiaris: keyholder: ops into trusted groups unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/288624 [08:43:01] !log installing xerces-c security updates [08:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:46:57] (03PS10) 10Alexandros Kosiaris: keyholder: ops into trusted groups unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/288624 [08:49:49] (03CR) 10Alexandros Kosiaris: [C: 032] "https://puppet-compiler.wmflabs.org/2840/tin.eqiad.wmnet/ says the expected thing. Thanks for the reviews, merging" [puppet] - 10https://gerrit.wikimedia.org/r/288624 (owner: 10Alexandros Kosiaris) [08:50:13] (03PS11) 10Alexandros Kosiaris: keyholder: ops into trusted groups unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/288624 [08:50:21] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] keyholder: ops into trusted groups unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/288624 (owner: 10Alexandros Kosiaris) [08:52:52] akosiaris: \o/ --^ thanks :) [08:53:19] elukey: I was thinking it might be of use to you. it definitely is to me [08:54:03] <_joe_> ops: granting themselves privileges since 2001 [08:54:20] hehe [08:54:35] I was reviewed by at least 2 ppl !! [08:54:47] and only 50% was ops :P [08:55:42] and it was actuall Krenair that gave me the idea [08:55:49] s/ll/ly/ [09:00:27] akosiaris: do we need to slap the keyholder on tin after the merge by any chance? [09:00:50] (and the puppet run) [09:01:43] elukey: quite possibly [09:01:45] lemme see [09:01:53] !log installing libarchive security updates [09:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:07:33] elukey: a sudo service keyholder-proxy restart was required indeed [09:07:46] but not a keyholder-agent restart (and the subsequent rearming) [09:07:58] ahh only the proxy, makes sense! [09:08:01] thanks! [09:08:25] 09:08:12 Finished Deploy: analytics/aqs/deploy (duration: 00m 01s) [09:08:29] so I am now able to deploy :) [09:11:31] PROBLEM - DPKG on mw1020 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:15:30] RECOVERY - DPKG on mw1020 is OK: All packages OK [09:15:58] ^that was me, fallout from updates [09:18:13] (03CR) 10Alexandros Kosiaris: [C: 032] sysfs: puppet always restarted the sysfsutils service [puppet] - 10https://gerrit.wikimedia.org/r/266684 (owner: 10Hashar) [09:18:19] (03PS5) 10Alexandros Kosiaris: sysfs: puppet always restarted the sysfsutils service [puppet] - 10https://gerrit.wikimedia.org/r/266684 (owner: 10Hashar) [09:18:34] (03CR) 10Alexandros Kosiaris: [V: 032] sysfs: puppet always restarted the sysfsutils service [puppet] - 10https://gerrit.wikimedia.org/r/266684 (owner: 10Hashar) [09:18:50] (03PS1) 10Giuseppe Lavagetto: openstack: allow unprivileged users to access nova logs [puppet] - 10https://gerrit.wikimedia.org/r/289370 (https://phabricator.wikimedia.org/T133992) [09:18:55] (03PS1) 10Giuseppe Lavagetto: openstack::nova::api: allow users to access api logs [puppet] - 10https://gerrit.wikimedia.org/r/289371 (https://phabricator.wikimedia.org/T133992) [09:18:57] (03PS1) 10Giuseppe Lavagetto: admin: add the releng team to labnet-users [puppet] - 10https://gerrit.wikimedia.org/r/289372 (https://phabricator.wikimedia.org/T133992) [09:19:57] 07Blocked-on-Operations, 06Operations, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#2304204 (10fgiunchedi) same error today after some hours on restbase2008-a ```lines=5 ERROR [STREAM-IN-/10.1... [09:23:39] (03CR) 10Alexandros Kosiaris: "needs a manual rebase btw" [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [09:23:57] (03CR) 10Alexandros Kosiaris: "un cherry-picked from beta because of the manual rebase need" [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [09:25:40] (03PS1) 10Mobrovac: service::update: Fix: Correct the she-bang for check-${title} [puppet] - 10https://gerrit.wikimedia.org/r/289374 [09:25:49] akosiaris: _joe_: ^^^ [09:26:04] needs to be merged otherwise we'll have all services start screaming at us [09:26:34] (03PS2) 10Giuseppe Lavagetto: service::update: Fix: Correct the she-bang for check-${title} [puppet] - 10https://gerrit.wikimedia.org/r/289374 (owner: 10Mobrovac) [09:26:43] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] service::update: Fix: Correct the she-bang for check-${title} [puppet] - 10https://gerrit.wikimedia.org/r/289374 (owner: 10Mobrovac) [09:26:57] thnx _joe_! [09:26:59] I missed that :-( [09:27:07] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 07Elasticsearch: high load on elastic1001 - https://phabricator.wikimedia.org/T135509#2304207 (10Gehel) [09:27:08] <_joe_> heh me too [09:27:10] 06Operations, 03Discovery-Search-Sprint: Followup on elastic1026 blowing up May 9, 21:43-22:14 UTC - https://phabricator.wikimedia.org/T134829#2304209 (10Gehel) [09:27:17] join the club :) [09:27:30] <_joe_> mobrovac: you can run puppet now :P [09:27:40] :) [09:28:40] 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2304213 (10akosiaris) > @jcrespo this is a bit shrouded in mystery with no documentation. It seems post replication someone would run [[ https://phabricator.wik... [09:35:33] !log installing jansson security updates [09:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:43:46] 07Blocked-on-Operations, 06Operations, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#2304280 (10fgiunchedi) same grep for stream id on restbase2008-a ```lines=5 restbase2008:/var/log$ fgrep 9a4... [09:45:34] (03PS1) 10Jcrespo: Install parallel package on mariadb::client(s) [puppet] - 10https://gerrit.wikimedia.org/r/289379 [09:46:00] PROBLEM - DPKG on rhenium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:48:13] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2304310 (10fgiunchedi) the raid arrays issue might be related to {T131961} though that should be fixed already in puppet for jessie, modulo rebuild of initramfs [09:52:01] RECOVERY - DPKG on rhenium is OK: All packages OK [09:54:07] akosiaris: I am super happy you reviewed and approved my sysfs puppet tweak ( https://gerrit.wikimedia.org/r/#/c/266684/ ) [09:54:22] I have learned about about service {}  this way [09:57:01] hashar, thank you for that, I saw such an issue recently but did not have time to investigate [10:00:07] glad it help [10:00:19] (03PS1) 10Muehlenhoff: Amend changelog entry with recently assigned CVE ID [debs/linux44] - 10https://gerrit.wikimedia.org/r/289382 [10:00:29] I run puppet manually a lot, and found it disturbing :D [10:01:36] <_joe_> did you check what you did works on jessie, btw? [10:03:03] I did on a couple beta cluster instance using sysfsutil [10:03:16] one trusty deployment-sentry01.deployment-prep.eqiad.wmflabs [10:03:17] the other is Jessie deployment-sentry2.deployment-prep.eqiad.wmflabs [10:03:43] it is on the CI slaves, if you get idea of testing case i dontmind rebooting some as needed (got Precise/Trusty/Jessie there) [10:05:47] (03CR) 10Filippo Giunchedi: [C: 031] Cassandra 2.2.6 config [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [10:08:14] (03PS2) 10Filippo Giunchedi: cleanup uneeded jars [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/285712 (owner: 10Eevans) [10:08:21] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cleanup uneeded jars [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/285712 (owner: 10Eevans) [10:08:36] _joe_: confirmed it works on Jessie. A reboot applied the sys parameter :-} [10:08:39] (03PS2) 10Filippo Giunchedi: descriptors should be world readable [puppet] - 10https://gerrit.wikimedia.org/r/286020 (owner: 10Eevans) [10:09:09] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] descriptors should be world readable [puppet] - 10https://gerrit.wikimedia.org/r/286020 (owner: 10Eevans) [10:10:08] (03CR) 10Hashar: "Double checked on Jessie instance integration-slave-jessie-1001. After a reboot:" [puppet] - 10https://gerrit.wikimedia.org/r/266684 (owner: 10Hashar) [10:12:54] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, what sort of coordination is needed to deploy this after merge? bounce logstash manually ?" [puppet] - 10https://gerrit.wikimedia.org/r/278315 (owner: 10BryanDavis) [10:13:47] <_joe_> hashar: my point was that with systemd you _have_ the status of the service correctly reported [10:14:00] <_joe_> but it's ok, thanks for looking into it :) [10:15:33] (03PS5) 10Alexandros Kosiaris: Introduce service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/288613 [10:17:14] (03CR) 10Alexandros Kosiaris: Introduce service::uwsgi (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/288613 (owner: 10Alexandros Kosiaris) [10:23:53] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/288613 (owner: 10Alexandros Kosiaris) [10:35:29] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:36:30] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:37:01] PROBLEM - zotero on sca1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:37:18] euh? [10:37:22] !log Updating cxserver to 700dac2 [10:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:37:45] mobrovac: any issue with sca02? I'm deploying cxserver. [10:38:19] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api: /api (web site in alternative language) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api: /api (Zot [10:38:37] <_joe_> citoid is indeed unreachable, it seems [10:38:43] <_joe_> because zotero? [10:38:45] !log zotero restarted it on sca1002, was on 50% mem and 100% cpu [10:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:38:52] <_joe_> exactly [10:39:00] kart_, do not try to use SELECT into OUTFILE- it will try to create a file on the *DB* server [10:39:00] RECOVERY - zotero on sca1002 is OK: HTTP OK: HTTP/1.0 200 OK - 62 bytes in 0.009 second response time [10:39:02] <_joe_> see? our paging on lvs works pretty well :P [10:39:14] jynus: got it. thanks. [10:39:21] (03CR) 10Filippo Giunchedi: Initial debianization (031 comment) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/288196 (https://phabricator.wikimedia.org/T132317) (owner: 10Giuseppe Lavagetto) [10:39:30] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [10:39:31] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [10:39:42] jynus: do you know any easy way to get output to local machine for query? [10:39:45] !log zotero restarted it on sca100, was on 50% mem [10:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:40:03] <_joe_> mobrovac: also thanks for being that fast [10:40:07] that should have been s/sca100/sca1001/ but oh well [10:40:15] kart_, tell me what you realy need [10:40:24] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [10:40:38] kart_: sca100x are ok with regards to cxserver / apertium [10:40:45] (03CR) 10Filippo Giunchedi: "FTBFS for me with "git-buildpage --git-pbuilder"" [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/288196 (https://phabricator.wikimedia.org/T132317) (owner: 10Giuseppe Lavagetto) [10:41:24] jynus: need to run a query which has long output. [10:41:38] mobrovac: thanks. deployment was sucessful. [10:41:48] <_joe_> mobrovac: am I wrong or this is the first time we get an alert from the LVS checks? [10:41:56] !log Updated cxserver to 700dac2 [10:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:42:33] _joe_: i think so, yes [10:42:39] kart_, how long would be the output? 1000? 1 million rows? [10:42:58] <_joe_> mobrovac: that's good, this was a real alert [10:43:24] \o/ [10:43:37] and it is a one time thing or you want to do it regularly? [10:43:41] jynus: 15000 right now. [10:43:50] jynus: once in a week. [10:44:07] ok, long for human standards, but not too long for a server [10:44:32] I would say that if it is a regular thing, you should ask for analytics access [10:44:57] jynus: eventually, yes. this comes from wikishared db though. [10:45:03] and you will be able to run mysqldump and export to a text file [10:45:15] mmm, I think I have wikishared replicate there, too [10:45:17] let me check [10:45:50] jynus: ok. That will be great, as I guess amire80 also runs regular queries there [10:46:11] AFAIK, there was request to replicate it in labs db. [10:46:17] the important thing is that long running queries can impact production performance [10:46:41] analytics slaves are prepared to do anything(*) you want without affecting other production slaves [10:46:57] jynus: ok. can you check and let me know. [10:47:05] jynus: I'll check with Amir too. [10:47:14] kart_, yes, I can confirm cx* tables are on analytics slave [10:47:34] jynus: cool. Let me head towards analytics. [10:47:39] jynus: thanks. [10:48:20] I think you have to formaly request access, if you do not have, add me to a ticket and I will explain the reasons [10:48:32] OK. [10:49:22] if you want a quick one-time download, send me a query and I do it once for you [10:49:46] but please request for access for subsequent queries [11:04:47] 06Operations, 10MediaWiki-General-or-Unknown: Special pages on cswiki not updated for longer than usual (since May 12th 6am) - https://phabricator.wikimedia.org/T135326#2304453 (10Danny_B) 05Open>03declined Updated on 2016-05-17 17:** UTC [11:08:27] (03PS1) 10Muehlenhoff: Blacklist asn1_decoder kernel module [puppet] - 10https://gerrit.wikimedia.org/r/289389 [11:09:31] 06Operations, 10MediaWiki-General-or-Unknown: Special pages on cswiki not updated for longer than usual (since May 12th 6am) - https://phabricator.wikimedia.org/T135326#2295392 (10jcrespo) There were some issue with dewiki vslow db host recently, now solved, could be related. [11:16:06] (03CR) 10Muehlenhoff: [C: 032 V: 032] Amend changelog entry with recently assigned CVE ID [debs/linux44] - 10https://gerrit.wikimedia.org/r/289382 (owner: 10Muehlenhoff) [11:17:42] !log install expat security updates [11:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:21:49] (03PS1) 10ArielGlenn: filter out rsync warnings about vanishing files, for kiwix rsync [puppet] - 10https://gerrit.wikimedia.org/r/289391 [11:22:56] (03CR) 10ArielGlenn: [C: 032] filter out rsync warnings about vanishing files, for kiwix rsync [puppet] - 10https://gerrit.wikimedia.org/r/289391 (owner: 10ArielGlenn) [11:31:24] 06Operations, 03Scap3: Scap3 doesn't start the service on Jessie if it's down - https://phabricator.wikimedia.org/T135609#2304520 (10mobrovac) [11:33:22] 06Operations, 10ops-esams, 06DC-Ops, 10netops: Set up cr2-esams - https://phabricator.wikimedia.org/T118256#2304535 (10mark) [11:36:56] morebots: ping [11:36:56] I am a logbot running on tools-exec-1220. [11:36:56] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [11:36:56] To log a message, type !log . [11:41:16] !log rolling restart of Elasticsearch equiad for Java update completed (T135499) [11:41:17] T135499: Restart elasticsearch clusters for Java update - https://phabricator.wikimedia.org/T135499 [11:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:41:28] !log starting rolling restart of Elasticsearch codfw for Java update (T135499) [11:41:29] T135499: Restart elasticsearch clusters for Java update - https://phabricator.wikimedia.org/T135499 [11:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:55:27] (03CR) 10Hashar: "/var/log/nova is created by the nova-common package ( nova-common.dirs )" [puppet] - 10https://gerrit.wikimedia.org/r/289370 (https://phabricator.wikimedia.org/T133992) (owner: 10Giuseppe Lavagetto) [11:57:21] (03CR) 10Hashar: [C: 031] admin: add the releng team to labnet-users [puppet] - 10https://gerrit.wikimedia.org/r/289372 (https://phabricator.wikimedia.org/T133992) (owner: 10Giuseppe Lavagetto) [11:57:54] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2304642 (10elukey) I took a look to Ganglia's mem_report and all the caches seems to have almost recovered from the last restart event, except of course mc1009... [12:14:10] PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80351 MB (15% inode=99%) [12:17:41] (03PS1) 10Sbisson: Remove EchoBundleEmailInterval [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289395 (https://phabricator.wikimedia.org/T135446) [12:23:02] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2304716 (10ssastry) >>! In T135176#2303498, @GWicke wrote: > @Arlolra, if you look at the dates & the discussion you'll notice that a) there is no dependency on node v4.x, and b) ther... [12:27:30] PROBLEM - Disk space on elastic1015 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80546 MB (15% inode=99%) [12:29:26] (03PS1) 10ArielGlenn: fix link removal for cirrus cearch dumps cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289397 [12:31:42] (03CR) 10ArielGlenn: [C: 032] fix link removal for cirrus cearch dumps cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289397 (owner: 10ArielGlenn) [12:37:50] (03PS19) 10Eevans: Cassandra 2.2.6 config [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) [12:40:30] RECOVERY - cassandra-a CQL 10.64.0.126:9042 on aqs1004 is OK: TCP OK - 0.001 second response time on port 9042 [12:40:49] RECOVERY - AQS root url on aqs1004 is OK: HTTP OK: HTTP/1.1 200 - 727 bytes in 0.024 second response time [12:42:00] PROBLEM - Disk space on elastic1015 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80019 MB (15% inode=99%) [12:42:51] ACKNOWLEDGEMENT - Disk space on elastic1015 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80019 MB (15% inode=99%): Gehel I will rebalance cluster... [12:45:46] !log elasticsearch eqiad - reducing high watermark to rebalance disk space [12:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:48:07] (03CR) 10Eevans: "Rebased and (re)cherry-picked to beta." [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [12:48:59] RECOVERY - Disk space on elastic1016 is OK: DISK OK [12:50:10] RECOVERY - Disk space on elastic1015 is OK: DISK OK [12:51:39] RECOVERY - cassandra-b CQL 10.64.0.127:9042 on aqs1004 is OK: TCP OK - 0.000 second response time on port 9042 [12:53:30] RECOVERY - cassandra-b CQL 10.64.32.190:9042 on aqs1005 is OK: TCP OK - 0.005 second response time on port 9042 [12:54:17] --^ this is me trying to bootstrap a cassandra cluster [12:55:48] (03PS1) 10BBlack: varnishxcps: use send interval, def 30s [puppet] - 10https://gerrit.wikimedia.org/r/289401 [12:57:10] 06Operations, 03Scap3: Scap3 doesn't start the service on Jessie if it's down - https://phabricator.wikimedia.org/T135609#2304817 (10fgiunchedi) I can't reproduce the systemctl/service behaviour you mentioned on a different service, `restart` seems to do the right thing on a jessie labs instance even on a just... [13:03:23] (03CR) 10ArielGlenn: [C: 031] Install parallel package on mariadb::client(s) [puppet] - 10https://gerrit.wikimedia.org/r/289379 (owner: 10Jcrespo) [13:05:27] (03PS1) 10Jdrewniak: T133732 Updating footer on www.wikipedia.org portal. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289403 (https://phabricator.wikimedia.org/T133732) [13:06:19] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2304855 (10mobrovac) >>! In T135176#2304716, @ssastry wrote: >>>! In T135176#2303498, @GWicke wrote: >> @Arlolra, if you look at the dates & the discussion you'll notice that a) there... [13:08:07] elukey: \o/ [13:08:55] mobrovac: two instances up but can't see them talking for the moment, a bit weird :P [13:09:58] firewall issues perhaps? [13:10:10] even though there shouldn't be any [13:10:41] elukey: what do you mean by "can't see them talking" exactly ? [13:10:55] (03CR) 10Jcrespo: [C: 032] Install parallel package on mariadb::client(s) [puppet] - 10https://gerrit.wikimedia.org/r/289379 (owner: 10Jcrespo) [13:11:19] mobrovac: I used auto_bootstrap: false to bring up cassandra-a/b on aqs1004, but nodetool status doesn't show me two instances [13:11:24] (03PS2) 10Jcrespo: Install parallel package on mariadb::client(s) [puppet] - 10https://gerrit.wikimedia.org/r/289379 [13:11:37] also in the logs I can't see the info about "hey I found a new host!" [13:11:47] (03CR) 10Jcrespo: [V: 032] Install parallel package on mariadb::client(s) [puppet] - 10https://gerrit.wikimedia.org/r/289379 (owner: 10Jcrespo) [13:12:00] elukey: ah, iirc, before you bootstrap the first node you need to fix the seeds in its cassandra.yaml manually to point only to itself [13:12:11] godog: is ^ correct? [13:12:58] mobrovac: I thought that autobootstrap:false was enough - sigh [13:13:21] mhh I can't remember exactly if there's a case where having the node itself in the seeds is allowed [13:14:08] godog: iirc, in general that's a problem, but for the first node it needs to know it's the only one in cluster [13:14:24] i may have bad memory on the subject though [13:14:42] https://wikitech.wikimedia.org/wiki/Cassandra#Bootstrap_a_brand_new_cluster [13:15:01] so I'll try to bring up all the instances [13:15:05] heh with a wiki, no memory is needed :) [13:15:30] then probably with some the right dosage of "encouragement" they will create the ring :) [13:15:44] *with the right [13:19:21] (03PS1) 10ArielGlenn: direct output from cron job for full dumps to log files [puppet] - 10https://gerrit.wikimedia.org/r/289406 [13:20:37] (03CR) 10Faidon Liambotis: [C: 031] "Sounds good to me." [puppet] - 10https://gerrit.wikimedia.org/r/289389 (owner: 10Muehlenhoff) [13:29:02] !log resize volume for nfs dumps per T134896 [13:29:03] T134896: move nfs /scratch to labstore1003 - https://phabricator.wikimedia.org/T134896 [13:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:30:11] RECOVERY - cassandra-b CQL 10.64.48.149:9042 on aqs1006 is OK: TCP OK - 0.009 second response time on port 9042 [13:30:18] (03CR) 10Alexandros Kosiaris: "isn't this done in a similar way in https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/hieradata/common/discovery.yaml ?" [puppet] - 10https://gerrit.wikimedia.org/r/289334 (owner: 10Dzahn) [13:32:49] 06Operations, 10MediaWiki-General-or-Unknown: Special pages on cswiki not updated for longer than usual (since May 12th 6am) - https://phabricator.wikimedia.org/T135326#2304927 (10Bawolff) >>! In T135326#2304468, @jcrespo wrote: > There were some issue with dewiki vslow db host recently, now solved, could be r... [13:36:53] (03PS1) 10BBlack: config-geo: default EU=>esams [dns] - 10https://gerrit.wikimedia.org/r/289408 [13:36:55] (03PS1) 10BBlack: config-geo: default AS=>ulsfo [dns] - 10https://gerrit.wikimedia.org/r/289409 [13:38:05] (03CR) 10BBlack: [C: 032] config-geo: default EU=>esams [dns] - 10https://gerrit.wikimedia.org/r/289408 (owner: 10BBlack) [13:38:28] 06Operations, 03Scap3: Scap3 doesn't start the service on Jessie if it's down - https://phabricator.wikimedia.org/T135609#2304941 (10mobrovac) Indeed @fgiunchedi, using the commands directly on the target yields a different result: ``` root@deployment-mathoid:~# systemctl stop mathoid root@deployment-mathoid... [13:38:49] mobrovac: ah you were maybe referring to https://wikitech.wikimedia.org/wiki/Cassandra#Adding_a_new_.28empty.29_node ? [13:38:51] PROBLEM - puppet last run on wtp2006 is CRITICAL: CRITICAL: Puppet has 1 failures [13:39:26] ah! [13:40:53] also I don't see the aqs1004-b instance listed among the seeds of the a instace [13:40:56] - seeds: aqs1005-a.eqiad.wmnet,aqs1005-b.eqiad.wmnet,aqs1006-a.eqiad.wmnet,aqs1006-b.eqiad.wmnet [13:41:33] nor the istance itself (?) [13:42:27] the instance itself shouldn't be there [13:42:31] so that's ok [13:42:42] but aqs1004-b should be there [13:43:02] that's probably a bug in our erb template [13:43:14] we likely filter it out because it's on the same host [13:43:28] but that's not a concern [13:43:57] oh elukey, are you trying to bring up aqs1004-a and then aqs1004-b before aqs100[56]-* ? [13:44:06] if so, it's clear why they are not talking to each other [13:45:33] mobrovac: I am following the wiki docs and http://docs.datastax.com/en/cassandra/1.2/cassandra/initialize/initializeSingleDS.html, in which it seems that each node must be brought up with auto-bootstrap false [13:45:50] what would be your suggestion to bootstrap the cluster? [13:46:09] hm, that does not really answer my question, though [13:46:18] if you bring up aqs1004-a [13:46:20] (03CR) 10Jcrespo: [C: 032] Add interval parameter, and change the default to 1 beat per second [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/289177 (https://phabricator.wikimedia.org/T133337) (owner: 10Jcrespo) [13:46:30] PROBLEM - puppet last run on wtp1004 is CRITICAL: CRITICAL: Puppet has 1 failures [13:46:37] and then do aqs1004-b, it won't see aqs1004-a because it's not in its list of seeds [13:46:44] and it's the only node that's up [13:46:57] yep.. but 1006 a/b are up now [13:47:04] (03PS1) 10Cmjohnson: Removing duplicate entries of mw1305/1306 [dns] - 10https://gerrit.wikimedia.org/r/289410 [13:47:10] PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:53] ^that's me [13:47:56] mobrovac: I got your point, i 1004-a doesn't talk with its configured seeds then it can figure out how the cluster is done [13:48:13] *it can't [13:48:16] elukey: the docs you gave the link for are for an old cass version, not sure it applies any more [13:48:17] (need coffee) [13:48:49] mobrovac: yeah I was looking for the newest ones, but they seemed consistent with our wikitech page [13:48:51] (03CR) 10Cmjohnson: [C: 032] Removing duplicate entries of mw1305/1306 [dns] - 10https://gerrit.wikimedia.org/r/289410 (owner: 10Cmjohnson) [13:49:03] elukey: random drive-by comment, but make sure all nodes are also firewalled off, including the main node ip address i.e. aqs1004 not only aqs1004-[ab] [13:49:52] elukey: a node doesn't need to know *all* of seeds, it just needs one, but if you are bootstrapping aqs1004-b and only aqs1004-a is present, then nothing won't happen as they are not in each other's seeds [13:50:04] (03PS2) 10Jcrespo: Enable heartbeat on all masters, even on the pasive datacenter [puppet] - 10https://gerrit.wikimedia.org/r/289178 (https://phabricator.wikimedia.org/T133337) [13:50:09] s/won't/will/ [13:50:15] i need to eat [13:50:18] * mobrovac lunch [13:51:28] godog: sure thanks! You made me realize that I might have been using ferm rules for the aqs cluster that doesn't have multi-instance settings [13:51:39] (03PS1) 10Alexandros Kosiaris: service::uwsgi: Use Uwsgi::App [puppet] - 10https://gerrit.wikimedia.org/r/289412 [13:52:11] elukey: ok! I came across a similar problem https://phabricator.wikimedia.org/T128590 [13:52:17] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] service::uwsgi: Use Uwsgi::App [puppet] - 10https://gerrit.wikimedia.org/r/289412 (owner: 10Alexandros Kosiaris) [13:53:49] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: Puppet has 1 failures [13:55:29] !log disabling puppet on all database masters to test replication monitoring change T133337 [13:55:29] T133337: Automate database datacenter switchover steps - https://phabricator.wikimedia.org/T133337 [13:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:56:34] (03PS3) 10Jcrespo: Enable heartbeat on all masters, even on the pasive datacenter [puppet] - 10https://gerrit.wikimedia.org/r/289178 (https://phabricator.wikimedia.org/T133337) [13:56:50] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [13:56:59] (03CR) 10Jcrespo: [C: 032 V: 032] Enable heartbeat on all masters, even on the pasive datacenter [puppet] - 10https://gerrit.wikimedia.org/r/289178 (https://phabricator.wikimedia.org/T133337) (owner: 10Jcrespo) [13:57:53] !log downtime for dataset1001 puppet runs as T134896 causes failure (temporary for resize) [13:57:54] T134896: move nfs /scratch to labstore1003 - https://phabricator.wikimedia.org/T134896 [13:57:59] apergos: ^ fyi [13:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:58:08] okey dokey [13:58:09] akosiaris, ok to merge? [13:59:08] I will wait for your ok [14:00:08] (03PS1) 10Gergő Tisza: Set $wgDisableAuthManager = true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289413 (https://phabricator.wikimedia.org/T135498) [14:00:24] 06Operations, 10MediaWiki-General-or-Unknown: Special pages on cswiki not updated for longer than usual (since May 12th 6am) - https://phabricator.wikimedia.org/T135326#2304978 (10Bawolff) Digging further, here are some timings for expensive special pages: * Special:ListRedirects: 08:17, 16 May 2016 UTC * Spe... [14:02:01] (03PS1) 10Cmjohnson: Adding mac addressed for mw1305/1306 to dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/289417 [14:02:12] 06Operations, 10Dumps-Generation, 07HHVM, 13Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#2304979 (10ArielGlenn) Well I now see why people couldn't reproduce: my nice copy-pastes were transformed by phab into weirdness, with comments stripped ou... [14:03:09] 06Operations, 10MediaWiki-General-or-Unknown: Special pages on cswiki not updated for longer than usual (since May 12th 6am) - https://phabricator.wikimedia.org/T135326#2304980 (10Dvorapa) if that problem with dewiki is solved, then thnak you for closing this [14:04:10] RECOVERY - puppet last run on wtp2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:04:50] (03CR) 10Cmjohnson: [C: 032] Adding mac addressed for mw1305/1306 to dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/289417 (owner: 10Cmjohnson) [14:05:52] jynus I have a change mixed in with yours. feel free to merge [14:06:28] I was waiting for alex, he already merged [14:06:44] yeah, thanks [14:06:57] (03PS1) 10Alexandros Kosiaris: service::uwsgi: log dir owned by www-data [puppet] - 10https://gerrit.wikimedia.org/r/289418 [14:06:57] cmjohnson1, done [14:07:14] thx [14:10:48] (03CR) 10Volans: "Minor cosmetics" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/289178 (https://phabricator.wikimedia.org/T133337) (owner: 10Jcrespo) [14:11:32] (03CR) 10Alexandros Kosiaris: [C: 032] service::uwsgi: log dir owned by www-data [puppet] - 10https://gerrit.wikimedia.org/r/289418 (owner: 10Alexandros Kosiaris) [14:11:37] (03PS2) 10Alexandros Kosiaris: service::uwsgi: log dir owned by www-data [puppet] - 10https://gerrit.wikimedia.org/r/289418 [14:11:37] Volans I left it on purpose for legibility reasons- it may be needed in the future [14:11:41] (03CR) 10Alexandros Kosiaris: [V: 032] service::uwsgi: log dir owned by www-data [puppet] - 10https://gerrit.wikimedia.org/r/289418 (owner: 10Alexandros Kosiaris) [14:11:49] RECOVERY - puppet last run on wtp1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:15:51] !log rolling restart of hhvm in codfw to pick up expat security update [14:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:16:51] PROBLEM - puppet last run on rdb1002 is CRITICAL: CRITICAL: Puppet has 1 failures [14:17:07] (03CR) 10Jgreen: [C: 032] "yes, you're right--this script was deprecated when we switched to the kafka pipeline" [puppet] - 10https://gerrit.wikimedia.org/r/289352 (owner: 10Dzahn) [14:18:00] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [14:19:06] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2304993 (10ssastry) >>! In T135176#2304855, @mobrovac wrote: > TL;DR that PR tries to address issues in services using #service-runner, not the package itself. Ah, okay. That is clea... [14:19:35] I will be doing "killall perl && rm /var/run/pt-heartbeat.pid && puppet agent -tv" on all eqiad masters [14:20:19] double pt-heartbeat will simplify master failover, too [14:20:54] jynus: probably worth to downtime on icinga lag checks for ~30 min? [14:21:08] I downtimed puppet [14:21:13] it is not needed [14:21:32] checks from codfw will create monitor HA [14:22:09] that is the whole point of this :-) [14:22:27] in the Switch_Datacenter doc there is the command to kill pt-heartbeat specifically [14:22:58] ah, ok, will check it- in any case the world is better without perl [14:25:30] I am thinking if new fields (gtid and datacenter) would be useful to prevent split brain [14:25:47] (03CR) 10Luke081515: [C: 031] Set $wgDisableAuthManager = true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289413 (https://phabricator.wikimedia.org/T135498) (owner: 10Gergő Tisza) [14:26:41] RECOVERY - puppet last run on labvirt1007 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [14:32:19] 06Operations, 03Scap3: Scap3 doesn't start the service on Jessie if it's down - https://phabricator.wikimedia.org/T135609#2305049 (10Joe) p:05Triage>03Normal [14:33:17] <_joe_> mobrovac: taking a look now [14:33:30] <_joe_> mobrovac: can I abuse of mathoid in deployment-prep for a bit? [14:33:53] (03CR) 10Ema: [C: 031] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/289409 (owner: 10BBlack) [14:37:20] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:37:39] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api [14:37:48] (03PS2) 10Muehlenhoff: Blacklist asn1_decoder kernel module [puppet] - 10https://gerrit.wikimedia.org/r/289389 [14:37:56] 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2305057 (10chasemp) [14:38:30] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:40:00] (03PS2) 10BBlack: varnishxcps: use send interval, def 30s [puppet] - 10https://gerrit.wikimedia.org/r/289401 [14:40:21] (03CR) 10BBlack: [C: 032 V: 032] varnishxcps: use send interval, def 30s [puppet] - 10https://gerrit.wikimedia.org/r/289401 (owner: 10BBlack) [14:40:50] PROBLEM - Citoid LVS eqiad on citoid.svc.codfw.wmnet is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://citoid.svc.codfw.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.codfw.wmnet:1970/api [14:41:13] (03PS2) 10BBlack: config-geo: default AS=>ulsfo [dns] - 10https://gerrit.wikimedia.org/r/289409 [14:41:27] mobrovac: I figured out the problem! [14:41:49] (03CR) 10BBlack: [C: 032] config-geo: default AS=>ulsfo [dns] - 10https://gerrit.wikimedia.org/r/289409 (owner: 10BBlack) [14:42:08] mobrovac: that was me.. I needed to specify the port for nodetool [14:42:25] because otherwise it will check for the default JMX one [14:42:30] RECOVERY - puppet last run on rdb1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:42:52] I was puzzled by the fact that nodetool showed some results for localhost but I was seeing handshakes in the logs [14:42:56] sigh [14:43:03] godog --^ [14:43:31] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [14:43:40] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [14:44:41] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [14:44:49] RECOVERY - Citoid LVS eqiad on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [14:44:56] (03CR) 10Alexandros Kosiaris: "Already tested in labs migrating a new web VM (ores-akosiaris02) to this scheme. With the extra step of having to declare a scap::deployme" [puppet] - 10https://gerrit.wikimedia.org/r/288618 (owner: 10Alexandros Kosiaris) [14:45:28] (03CR) 10Yuvipanda: [C: 04-1] "This reduces the number of uwsgi processes from corecount * 4 to just corecount." [puppet] - 10https://gerrit.wikimedia.org/r/288618 (owner: 10Alexandros Kosiaris) [14:49:41] (03PS3) 10Muehlenhoff: Blacklist asn1_decoder kernel module [puppet] - 10https://gerrit.wikimedia.org/r/289389 [14:49:47] (03CR) 10Muehlenhoff: [C: 032 V: 032] Blacklist asn1_decoder kernel module [puppet] - 10https://gerrit.wikimedia.org/r/289389 (owner: 10Muehlenhoff) [14:50:39] 06Operations, 03Scap3: Scap3 doesn't start the service on Jessie if it's down - https://phabricator.wikimedia.org/T135609#2305071 (10Joe) So, I watched the auth log on reployment-mathoid while running scap and it seems that when mathoid is up before we run scap we do issue a restart: ``` May 18 14:47:22 deplo... [14:50:48] (03CR) 10Andrew Bogott: "You're right, nova-api runs as 'nova'." [puppet] - 10https://gerrit.wikimedia.org/r/289370 (https://phabricator.wikimedia.org/T133992) (owner: 10Giuseppe Lavagetto) [14:51:21] 06Operations, 03Scap3: Scap3 calls the checker script before restarting the service, not able to restart a service if it's down. - https://phabricator.wikimedia.org/T135609#2305072 (10Joe) [14:55:50] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "For the record, the 'root' owner is my typo and it should've been 'nova'." [puppet] - 10https://gerrit.wikimedia.org/r/289370 (https://phabricator.wikimedia.org/T133992) (owner: 10Giuseppe Lavagetto) [14:56:23] (03CR) 10Alexandros Kosiaris: "ok, number of processes is easily fixed. Will do that. regarding the redirect, we definitely not need it in production. We should probably" [puppet] - 10https://gerrit.wikimedia.org/r/288618 (owner: 10Alexandros Kosiaris) [14:56:26] <_joe_> sorry, it was obvious to me as soon as I opened the file [14:57:53] (03PS1) 10Elukey: Fix typo in IP address settings for aqs1005.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/289424 (https://phabricator.wikimedia.org/T135145) [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160518T1500). Please do the needful. [15:00:04] jan_drewniak tgr: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:49] \O/ [15:01:00] Hello. [15:01:05] tyler is coming [15:01:05] o/ [15:01:16] I can SWAT. [15:02:04] (03CR) 10Ladsgroup: "Okay, let me check what I can do about redirect in nginx" [puppet] - 10https://gerrit.wikimedia.org/r/288618 (owner: 10Alexandros Kosiaris) [15:02:20] !log Zuul/Nodepool is out of instances. Looking [15:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:02:28] (03PS2) 10Andrew Bogott: Organize Ganglia cluster names for labs [puppet] - 10https://gerrit.wikimedia.org/r/289315 [15:03:00] somehow Nodepool no more spawn instances for CI bah ( digs in labnodepool1001.eqiad.wmnet ) [15:03:36] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289403 (https://phabricator.wikimedia.org/T133732) (owner: 10Jdrewniak) [15:03:47] thcipriani: you will most probably have to force merge [15:04:16] 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2305116 (10jcrespo) [15:04:26] hashar: https://www.mediawiki.org/wiki/Continuous_integration/Architecture/Troubleshooting#Nodepool ? [15:04:27] (03Merged) 10jenkins-bot: T133732 Updating footer on www.wikipedia.org portal. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289403 (https://phabricator.wikimedia.org/T133732) (owner: 10Jdrewniak) [15:04:53] hashar: Should we covert php55lint to nodepool [15:04:57] The basic test. [15:05:39] RECOVERY - cassandra-a CQL 10.64.32.189:9042 on aqs1005 is OK: TCP OK - 0.002 second response time on port 9042 [15:05:41] (03CR) 10Andrew Bogott: [C: 032] Organize Ganglia cluster names for labs [puppet] - 10https://gerrit.wikimedia.org/r/289315 (owner: 10Andrew Bogott) [15:06:07] mobrovac, godog: cassandra cluster up :) [15:06:34] thcipriani: yup looks relevant. Though the openstack state bunch of instances are in ERROR or BUILD state [15:07:05] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2305130 (10Arlolra) > service-runner loads some global-level goodies, such as corejs ES6 shims, which in this way become available to the service's code automagically. That seems pot... [15:08:02] hashar: a quick glance at labnodepool1001—looks different than the delete problem. [15:08:57] (03CR) 10Elukey: [C: 032 V: 032] Fix typo in IP address settings for aqs1005.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/289424 (https://phabricator.wikimedia.org/T135145) (owner: 10Elukey) [15:09:12] !log Stopping Nodepool on labnodepool1001 , it can't spawn instances [15:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:10:02] !log SWAT: running sync-portals for [[gerrit:289403|Updating footer on www.wikipedia.org portal]] [15:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:10:56] !log thcipriani@tin Synchronized portals/prod/wikipedia.org/assets: (no message) (duration: 00m 33s) [15:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:11:06] hashar: It seems somone created a autoload_static file in mediawiki/vendor [15:11:16] Im getting [15:11:17] [18-May-2016 12:36:08 UTC] PHP Warning: require_once(/home/randomwi/public_html/en/vendor/composer/autoload_static.php): failed to open stream: No such file or directory in /home/randomwi/public_html/en/vendor/composer/autoload_real.php on line 32 [15:11:29] !log thcipriani@tin Synchronized portals: (no message) (duration: 00m 32s) [15:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:11:37] ^ jan_drewniak check please [15:11:49] I didnt generate it. I go it straight of mediawiki/vendor on github. Branch wmf 2 1.28 [15:12:43] thcipriani: can you run the sync portals script once again? I'm getting a weird error [15:12:54] !log Nodepool / labs instance issue filled as https://phabricator.wikimedia.org/T135631 [15:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:13:04] jzerebecki: ^^ [15:13:07] About composer [15:13:12] jan_drewniak: sure [15:13:20] elukey: nice! [15:13:26] (03CR) 10BryanDavis: "Puppet will automatically restart the Logstash process when the conf files change. It would probably be a good idea to force a post-merge " [puppet] - 10https://gerrit.wikimedia.org/r/278315 (owner: 10BryanDavis) [15:13:50] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [15:13:59] anomie: API login with tokin does not succeed in test2wiki any more [15:14:04] !log thcipriani@tin Synchronized portals/prod/wikipedia.org/assets: (no message) (duration: 00m 30s) [15:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:14:36] !log thcipriani@tin Synchronized portals: (no message) (duration: 00m 31s) [15:14:41] doctaxon: Works for me when I try it. [15:14:42] ^ jan_drewniak done [15:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:14:57] for me not [15:15:10] (I have shutdown nodepool) [15:15:34] anomie: I get token +/ [15:15:46] (03PS1) 10ArielGlenn: always return success for dry runs for onallwikis script [dumps] - 10https://gerrit.wikimedia.org/r/289428 [15:15:52] thcipriani: aww nuts, something is messed up. can you sync a file manually? this url should exist https://www.wikipedia.org/portal/wikipedia.org/assets/js/index-1a803501de.js but it doesn't :/ . If not, then I'll go bug my team about it :P [15:16:18] * thcipriani looks [15:18:08] (03PS1) 10BBlack: varnishxcps: misc nits [puppet] - 10https://gerrit.wikimedia.org/r/289429 [15:18:28] !log thcipriani@tin Synchronized portals/prod/wikipedia.org/assets/js/index-1a803501de.js: SWAT: manual js sync for portals (duration: 00m 28s) [15:18:29] (03CR) 10BBlack: [C: 032 V: 032] varnishxcps: misc nits [puppet] - 10https://gerrit.wikimedia.org/r/289429 (owner: 10BBlack) [15:18:32] anomie: a token without string [15:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:18:45] +/ only [15:19:13] anomie: API write isn't possible [15:19:42] (03PS4) 10BBlack: varnishxcache - new monitoring script for hit/miss stuff [puppet] - 10https://gerrit.wikimedia.org/r/289071 [15:20:05] anomie: I never had an empty token in my whole life [15:21:01] jan_drewniak: sync'd manually, but file still seems to be missing :\ [15:21:49] thcipriani: ok, I'll double check the commit, but we can leave it as is for now, thanks! [15:21:52] doctaxon: Hmm. The login seems to succeed, but the session isn't actually logged in. I'll have to look into that more. [15:22:16] jan_drewniak: I can confirm that the file exists on the appservers (the ones that I spot-checked anyway) [15:22:26] jan_drewniak: ok. thank you for checking. [15:23:08] tgr: ping for SWAT [15:23:18] thcipriani: o/ [15:23:41] (03PS2) 10Thcipriani: Set $wgDisableAuthManager = true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289413 (https://phabricator.wikimedia.org/T135498) (owner: 10Gergő Tisza) [15:23:46] thcipriani: ok, well if the file exists on the servers, then it's not a commit issue... (off to get to the bottom of this..) [15:24:03] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289413 (https://phabricator.wikimedia.org/T135498) (owner: 10Gergő Tisza) [15:25:01] (03Merged) 10jenkins-bot: Set $wgDisableAuthManager = true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289413 (https://phabricator.wikimedia.org/T135498) (owner: 10Gergő Tisza) [15:25:06] 06Operations, 10MediaWiki-API: login with empty token to test2wiki - https://phabricator.wikimedia.org/T135638#2305313 (10doctaxon) [15:26:42] when will the page https://www.mediawiki.org/w/index.php?title=MediaWiki_1.28/wmf.2 be created? [15:26:54] 06Operations, 10MediaWiki-API: login with empty token to test2wiki - https://phabricator.wikimedia.org/T135638#2305334 (10doctaxon) There is no api write possible, login with empty token only since about May 17th, 20:00 utc [15:27:06] 06Operations, 10MediaWiki-API: login with empty token to test2wiki - https://phabricator.wikimedia.org/T135638#2305335 (10doctaxon) There is no api write possible, login with empty token only since about May 17th, 20:00 utc [15:28:34] anomie: I found a second issue @test2wiki: CentralLogout doesn't work [15:29:10] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:289413|Set $wgDisableAuthManager = true]] (duration: 00m 25s) [15:29:12] ^ tgr sync'd [15:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:29:16] thcipriani, any possibility to add a late SWAT? https://gerrit.wikimedia.org/r/289423 [15:29:44] tgr: The fix for logout at beta, is that fix live at the group0 wikis? [15:30:21] (03PS2) 10Giuseppe Lavagetto: openstack::nova::api: allow users to access api logs [puppet] - 10https://gerrit.wikimedia.org/r/289371 (https://phabricator.wikimedia.org/T133992) [15:30:23] (03PS2) 10Giuseppe Lavagetto: openstack: allow unprivileged users to access nova logs [puppet] - 10https://gerrit.wikimedia.org/r/289370 (https://phabricator.wikimedia.org/T133992) [15:30:25] (03PS2) 10Giuseppe Lavagetto: admin: add the releng team to labnet-users [puppet] - 10https://gerrit.wikimedia.org/r/289372 (https://phabricator.wikimedia.org/T133992) [15:30:33] 06Operations, 07Graphite: investigate carbon-c-relay stalls/drops towards graphite2002 - https://phabricator.wikimedia.org/T135385#2305348 (10fgiunchedi) still looking into what might be causing this with a given periodicity, close to us daily patterns perhaps it looks like: {F4027871} {F4027872} [15:30:43] thcipriani: looks good, nothing changed [15:30:50] (03PS5) 10BBlack: varnishxcache - new monitoring script for hit/miss stuff [puppet] - 10https://gerrit.wikimedia.org/r/289071 [15:30:51] matt_flaschen: sure. looks like I have to force merge for nodepool though :( [15:30:52] (03PS1) 10BBlack: X-Cache-Int refactor [puppet] - 10https://gerrit.wikimedia.org/r/289430 [15:31:19] tgr: my favorite change to SWAT :) [15:31:19] 06Operations, 10MediaWiki-User-login-and-signup, 10MediaWiki-extensions-CentralAuth: Can't logout @test2wiki, centralauth logs you back in - https://phabricator.wikimedia.org/T135639#2305352 (10Luke081515) [15:31:26] Thanks, thcipriani [15:32:01] !log Deleted Nodepool snapshot images created around 14:30 apparently they havent been provisioned properly. Started nodepool again. Poke T135631 [15:32:02] T135631: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631 [15:32:05] Luke081515: duh, no [15:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:32:20] I assumed I merged that before the branch cut but apparently not :( [15:32:22] 06Operations, 10MediaWiki-User-login-and-signup, 10MediaWiki-extensions-CentralAuth: Can't logout @test2wiki, centralauth logs you back in - https://phabricator.wikimedia.org/T135639#2305371 (10Luke081515) Related too T135525? Works only at test2, logout at a group2wiki (dewiki) works normal [15:32:26] hm, ok [15:32:27] thcipriani: can we do another one? [15:32:40] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [15:32:48] thcipriani: https://gerrit.wikimedia.org/r/#/c/289248/ [15:33:31] 06Operations, 10MediaWiki-User-login-and-signup, 10MediaWiki-extensions-CentralAuth: Can't logout @test2wiki, centralauth logs you back in - https://phabricator.wikimedia.org/T135639#2305376 (10Luke081515) [15:33:34] 06Operations, 10MediaWiki-API: login with empty token to test2wiki - https://phabricator.wikimedia.org/T135638#2305377 (10Luke081515) [15:34:17] matt_flaschen: do you have a backport of the commit you need SWATted? [15:34:41] thcipriani, it's still Jenkins-ing, will get it to you when it merges. [15:34:47] (On master) [15:35:09] matt_flaschen: eh, nodepool is broken right now. Probably won't merge via zuul very quickly. [15:35:22] o.O gate-and-submit has 5 [15:36:14] (03CR) 10BBlack: [C: 032 V: 032] X-Cache-Int refactor [puppet] - 10https://gerrit.wikimedia.org/r/289430 (owner: 10BBlack) [15:36:31] thcipriani, oh, thanks. Sorry, I didn't make the connection before. [15:37:57] tgr: if you can get me a backport of that patch to the branch(es) you need it on, I can get it out as long as you can be around for a bit of an extended SWAT. [15:38:09] thcipriani: cherry-pick is https://gerrit.wikimedia.org/r/#/c/289431/ (I'll add it to the deployments page) [15:38:16] thanks [15:38:36] stephanebisson, we could maybe schedule it for 4 Pacific and hope nodepool is fixed by then. [15:38:36] (03PS1) 10Ema: config-geo: list all DCs in failover lists for completeness [dns] - 10https://gerrit.wikimedia.org/r/289433 [15:38:39] Your call. [15:38:40] PROBLEM - puppet last run on labvirt1003 is CRITICAL: CRITICAL: puppet fail [15:38:56] matt_flaschen: sounds good [15:39:24] matt_flaschen: like I said in #collab, I don't think this is the kind of emergency worth bypassing the system [15:39:58] stephanebisson, yeah, I agree it's better to wait. [15:40:44] thcipriani: do you know the file path of that portal file that wasn't updating? gehel is looking into into this for me :) [15:40:51] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [15:41:01] 06Operations, 06Performance-Team, 10Thumbor: Package and backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2305454 (10Gilles) [15:41:09] jan_drewniak: sure, I can get it for you, give me 1 minute to sync out this change. [15:41:10] jan_drewniak, thcipriani: I got the file, ... [15:41:17] ok [15:41:34] (03CR) 1020after4: [C: 031] Use scap subcommands [puppet] - 10https://gerrit.wikimedia.org/r/277700 (owner: 10Thcipriani) [15:41:38] thcipriani: keep syncing, I'll ping you when I'm completely lost in symlinks and HTTP 302... [15:41:46] :D [15:43:18] !log thcipriani@tin Synchronized php-1.28.0-wmf.2/extensions/CentralAuth/includes/CentralAuthHooks.php: SWAT: [[gerrit:289431|Fix central logout]] (duration: 00m 26s) [15:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:43:27] ^ tgr check please [15:43:53] thcipriani: works [15:43:54] thcipriani: I checked it, works [15:44:06] tgr: Luke081515 thank you [15:44:11] thcipriani: Ok, I might need some help on this one... https://www.wikipedia.org/portal/wikipedia.org/assets/js/index-1a803501de.js works fine on mw1017, but fails (404) when disabling testing headers... [15:44:43] (03CR) 1020after4: ores: Scap3 deployment configurations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/280403 (owner: 10Ladsgroup) [15:45:35] gehel, maybe it's stuck in Varnish. [15:45:46] 06Operations, 10MediaWiki-User-login-and-signup, 10MediaWiki-extensions-CentralAuth: Can't logout @test2wiki, centralauth logs you back in - https://phabricator.wikimedia.org/T135639#2305466 (10Luke081515) 05Open>03Resolved a:03Tgr Solved with https://gerrit.wikimedia.org/r/#/c/289431/, the reason was... [15:46:24] matt_flaschen: negative caching? I don't actually know our config, but I would expect a fairly short TTL on negative cache... [15:46:57] 1 of 2 deployment blockers solved [15:47:37] gehel, yeah, you may be right, not sure. [15:48:07] matt_flaschen: me neither, I'm checking VCL, but that's not a language I'm too familiar with... [15:48:38] hmm, shows up without headers or debug=true via curl [15:49:10] thcipriani: yep, for me as well... [15:49:42] and it now works via browser as well for me. jan_drewniak could you check as well? [15:50:11] matt_flaschen might have been right, short negative caching TTL in varnish and issue solved itself [15:50:48] 06Operations, 06Performance-Team, 10Thumbor: Package and backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2305496 (10Gilles) [15:51:04] wow! gehel it works! jesus, negative caching TTL sounds like devil-speak to me :P [15:51:14] :D [15:51:57] jan_drewniak: I'm not even sure that's the correct term, but the idea is that we cache failures (in this case 404) as well, but only for a short time, because we expect them to be transient [15:52:43] gehel, also, that would probably be a performance nightmare if someone hit a bunch of random invalid URLs, if there was a long negative caching TTL. [15:52:56] !log restarting varnish frontends on cache_misc + cache_maps to clear cached X-Cache entries [15:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:53:19] jan_drewniak: let's cross fingers and hope this is actually resolved. Ping me if you see the issue again... [15:53:26] bblack: and speaking of the devil... [15:53:37] Ive added one more blocker to https://phabricator.wikimedia.org/T134450 [15:53:44] Which is https://phabricator.wikimedia.org/T135635 [15:53:51] gehel: ? [15:54:04] phew, I will. Thanks for taking a look at this! [15:54:34] bblack: disregard... we were talking about caching and how that looks like devil speak to some, and here you are restarting varnish servers... [15:54:49] doctaxon: I think I figured it out, working on a fix now. [15:54:54] gehel: I'm not sure negative maxage is even legal [15:55:01] zero should suffice :P [15:55:12] or simply set the various no-store, no-cache sorts of CC values [15:56:03] bblack: we just had transient error on https://www.wikipedia.org/portal/wikipedia.org/assets/js/index-1a803501de.js (returning a 404), it fixed itself. Could it be that we cached the 404 for a short time? (that's what I meant when talking about negative caching, I have no idea what the correct term is) [15:56:18] ok [15:56:29] thcipriani: uh, are you around for one more? [15:56:47] yes, we do cache 404s just like any other. varnish will use the same caching rules it uses for 200s by default (looking at CC, Expires, etc), but then we also cap 4xx cache lifetimes to minimize them [15:56:53] Luke081515 seems to have a knack for catching UBN bugs in the middle of SWAT :) [15:56:57] I think we currently limit them to 1 minute on all clusters [15:57:05] tgr: sure :) [15:57:14] (but in practice, it could take up to 4 minutes to clear from all, since the capped TTL is per-layer) [15:57:32] bblack: That could explain the issue. Thanks! [15:58:20] thcipriani: https://gerrit.wikimedia.org/r/#/c/289438/ [15:59:56] 06Operations, 10MediaWiki-General-or-Unknown: Special pages on cswiki not updated for longer than usual (since May 12th 6am) - https://phabricator.wikimedia.org/T135326#2305516 (10Danny_B) Whether is any problem with script or whatever, it is not an issue of cswiki, thus new task has to be opened for that... [16:01:20] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [16:01:38] !log thcipriani@tin Synchronized php-1.28.0-wmf.2/includes/DefaultSettings.php: SWAT: [[gerrit:289438|Increase BotPasswordSessionProvider default priority]] (duration: 00m 26s) [16:01:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:01:48] ^ tgr check please [16:01:54] 06Operations, 10MediaWiki-API: login with empty token to test2wiki - https://phabricator.wikimedia.org/T135638#2305313 (10Tgr) Anomie wrote a fix in https://gerrit.wikimedia.org/r/#/c/289437/ / {f00b09f0693391e8fca06d180b3a80fc8833165a}. [16:02:07] thcipriani: works [16:02:11] anomie: thank you [16:02:17] doctaxon: Should be fixed now. [16:02:20] (03PS1) 10Filippo Giunchedi: graphite: introduce local carbon-c-relay daemon [puppet] - 10https://gerrit.wikimedia.org/r/289440 (https://phabricator.wikimedia.org/T85451) [16:03:02] hm, a few contint instances get create actually [16:03:31] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [16:04:54] but gate and submit isn't working yet [16:05:04] (03PS6) 10BBlack: varnishxcache - new monitoring script for hit/miss stuff [puppet] - 10https://gerrit.wikimedia.org/r/289071 [16:05:56] 06Operations, 10MediaWiki-API: login with empty token to test2wiki - https://phabricator.wikimedia.org/T135638#2305589 (10Tgr) 05Open>03Resolved Verified, bot login works now. [16:06:12] 06Operations, 06Performance-Team, 10Thumbor: Package and backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2305592 (10Gilles) [16:07:08] 06Operations, 06Discovery, 10Maps: Configure monitoring / alerting of Postgresql cluster for maps - https://phabricator.wikimedia.org/T135647#2305596 (10Gehel) [16:10:04] (03PS1) 10Jcrespo: Remove unneeded $heartbeat_enable variables [puppet] - 10https://gerrit.wikimedia.org/r/289442 (https://phabricator.wikimedia.org/T133337) [16:10:32] (03PS2) 10Jcrespo: Remove unneeded $heartbeat_enabled variables [puppet] - 10https://gerrit.wikimedia.org/r/289442 (https://phabricator.wikimedia.org/T133337) [16:11:21] (03PS7) 10BBlack: varnishxcache - new monitoring script for hit/miss stuff [puppet] - 10https://gerrit.wikimedia.org/r/289071 [16:19:52] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [16:20:13] (03PS1) 10Ladsgroup: ores: rewrite http requests to SSL in nginx [puppet] - 10https://gerrit.wikimedia.org/r/289445 [16:21:37] (03CR) 10Jcrespo: [C: 031] "I have solved the HA issue and I do not want to be longer a blocker. However, take a look at the comments on T134480#2305635 - I have rede" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) (owner: 10Aaron Schulz) [16:24:07] (03CR) 10Jcrespo: [C: 032] Remove unneeded $heartbeat_enabled variables [puppet] - 10https://gerrit.wikimedia.org/r/289442 (https://phabricator.wikimedia.org/T133337) (owner: 10Jcrespo) [16:25:23] (03CR) 10Jcrespo: [V: 032] Remove unneeded $heartbeat_enabled variables [puppet] - 10https://gerrit.wikimedia.org/r/289442 (https://phabricator.wikimedia.org/T133337) (owner: 10Jcrespo) [16:27:37] (03CR) 10Yuvipanda: [C: 04-1] "This won't work, you need the https enforcement code similar to the one in modules/extdist/templates/extdist.nginx.erb" [puppet] - 10https://gerrit.wikimedia.org/r/289445 (owner: 10Ladsgroup) [16:27:40] PROBLEM - nova-compute process on labvirt1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [16:27:41] PROBLEM - nova-compute process on labvirt1007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [16:27:51] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:28:31] (03CR) 10Yuvipanda: "Reason being we use the labs proxy for https termination, so this nginx will always receive requests on http only." [puppet] - 10https://gerrit.wikimedia.org/r/289445 (owner: 10Ladsgroup) [16:28:47] (03PS3) 10Faidon Liambotis: Use scap subcommands [puppet] - 10https://gerrit.wikimedia.org/r/277700 (owner: 10Thcipriani) [16:28:59] PROBLEM - nova-compute process on labvirt1010 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [16:29:01] PROBLEM - nova-compute process on labvirt1005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [16:29:03] (03CR) 10Faidon Liambotis: [C: 032] Use scap subcommands [puppet] - 10https://gerrit.wikimedia.org/r/277700 (owner: 10Thcipriani) [16:30:35] (03PS2) 10Ladsgroup: ores: rewrite http requests to SSL in nginx [puppet] - 10https://gerrit.wikimedia.org/r/289445 [16:31:19] ostriches: I keep getting "Server unavailable" in gerrit every so often -- is that known? [16:31:23] (03PS8) 10BBlack: varnishxcache - new monitoring script for hit/miss stuff [puppet] - 10https://gerrit.wikimedia.org/r/289071 [16:31:25] (03PS1) 10BBlack: varnisxcache: initial temporary limit to cp1065 [puppet] - 10https://gerrit.wikimedia.org/r/289448 [16:31:39] paravoid: Shouldn't be a thing.... [16:31:40] (03PS6) 10Volans: MariaDB: remove special SSL option multiple-ca [puppet] - 10https://gerrit.wikimedia.org/r/288420 (https://phabricator.wikimedia.org/T111654) [16:31:44] I'm not touching it today [16:31:54] (03CR) 10BBlack: [C: 032 V: 032] varnishxcache - new monitoring script for hit/miss stuff [puppet] - 10https://gerrit.wikimedia.org/r/289071 (owner: 10BBlack) [16:31:57] (wfm, as a completely useless anecdote) [16:31:57] (03CR) 10Faidon Liambotis: [V: 032] Use scap subcommands [puppet] - 10https://gerrit.wikimedia.org/r/277700 (owner: 10Thcipriani) [16:32:05] (03PS4) 10Faidon Liambotis: Use scap subcommands [puppet] - 10https://gerrit.wikimedia.org/r/277700 (owner: 10Thcipriani) [16:32:07] Lemme look [16:32:11] (03CR) 10Faidon Liambotis: [V: 032] Use scap subcommands [puppet] - 10https://gerrit.wikimedia.org/r/277700 (owner: 10Thcipriani) [16:32:13] (03CR) 10BBlack: [C: 032 V: 032] varnisxcache: initial temporary limit to cp1065 [puppet] - 10https://gerrit.wikimedia.org/r/289448 (owner: 10BBlack) [16:32:20] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [16:32:21] (03PS2) 10BBlack: varnisxcache: initial temporary limit to cp1065 [puppet] - 10https://gerrit.wikimedia.org/r/289448 [16:32:23] :P [16:32:29] (03CR) 10BBlack: [V: 032] varnisxcache: initial temporary limit to cp1065 [puppet] - 10https://gerrit.wikimedia.org/r/289448 (owner: 10BBlack) [16:32:38] oops, sorry [16:33:00] is yours ok to go? [16:33:32] * bblack curses the person that made ops/puppet ff-only :P [16:33:41] paravoid: Killed a couple (like 4) replication jobs that were languishing, otherwise not seeing anything interesting.... [16:36:09] (03CR) 10Faidon Liambotis: [C: 04-1] "See inline -- essentially, I see no point in calling this reprepro::apt_repository instead of just reprepro (same for the role)." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [16:36:34] ok, thanks [16:36:37] bblack: wasn't that you? :) [16:37:23] (03CR) 10Faidon Liambotis: [C: 031] "Sounds fine to me." [puppet] - 10https://gerrit.wikimedia.org/r/289143 (https://phabricator.wikimedia.org/T134798) (owner: 10Dzahn) [16:37:28] :) [16:38:24] 07Blocked-on-Operations, 06Operations, 06Services, 06WMDE-Analytics-Engineering, and 3 others: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451#2305735 (10fgiunchedi) the patch at https://gerrit.wikimedia.org/r/289440 add a local carbon-c-relay to be used for submitting graph... [16:39:33] (03PS1) 10BBlack: Revert "varnisxcache: initial temporary limit to cp1065" [puppet] - 10https://gerrit.wikimedia.org/r/289450 [16:39:45] (03CR) 10BBlack: [C: 032 V: 032] Revert "varnisxcache: initial temporary limit to cp1065" [puppet] - 10https://gerrit.wikimedia.org/r/289450 (owner: 10BBlack) [16:44:30] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [16:44:43] (03PS3) 10Ladsgroup: ores: rewrite http requests to SSL in nginx [puppet] - 10https://gerrit.wikimedia.org/r/289445 [16:45:30] RECOVERY - nova-compute process on labvirt1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [16:46:42] (03PS4) 10Yuvipanda: ores: rewrite http requests to SSL in nginx [puppet] - 10https://gerrit.wikimedia.org/r/289445 (owner: 10Ladsgroup) [16:46:49] (03CR) 10Yuvipanda: [C: 032 V: 032] ores: rewrite http requests to SSL in nginx [puppet] - 10https://gerrit.wikimedia.org/r/289445 (owner: 10Ladsgroup) [16:47:39] (03CR) 10Faidon Liambotis: [C: 04-1] "One tiny comment so far. I'm not too familiar with all those scap manifests and they are hard to follow/understand without a much more tho" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [16:49:02] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM: Backport nutcracker 0.4.1 to jessie - https://phabricator.wikimedia.org/T132032#2305798 (10Joe) a:03Joe [16:49:13] (03CR) 10Faidon Liambotis: [C: 04-2] "Yeah. There is definitely no notion of a global active datacenter. Some apps are hot/cold and require an active but even for those, we won" [puppet] - 10https://gerrit.wikimedia.org/r/289334 (owner: 10Dzahn) [16:49:28] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM: Backport nutcracker 0.4.1 to jessie - https://phabricator.wikimedia.org/T132032#2186385 (10Joe) My plan is to use our trusty package and just add a service unit. [16:51:39] PROBLEM - nova-compute process on labvirt1010 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [16:51:49] RECOVERY - nova-compute process on labvirt1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [16:52:30] RECOVERY - nova-compute process on labvirt1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [16:53:21] PROBLEM - puppet last run on labvirt1008 is CRITICAL: CRITICAL: puppet fail [16:53:46] (03CR) 10Ladsgroup: "With Ibbf54ce0aa162e72662e6906b04884283ded41ce merged, there is no need to keep the redirecting part in uwsgi. I didn't remove it from the" [puppet] - 10https://gerrit.wikimedia.org/r/288618 (owner: 10Alexandros Kosiaris) [16:57:11] (03CR) 1020after4: WIP: keyholder key cleanup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [17:01:40] 06Operations, 10DBA, 07Epic: Eliminate SPOF at the main database infrastructure - https://phabricator.wikimedia.org/T119626#2305868 (10jcrespo) [17:02:10] (03PS1) 10Ladsgroup: ores: Make non-SSL redirects permanent (302 -> 301) [puppet] - 10https://gerrit.wikimedia.org/r/289455 [17:03:24] RECOVERY - nova-compute process on labvirt1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [17:05:43] RECOVERY - puppet last run on labvirt1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:05:52] 06Operations, 03Scap3: Scap3 calls the checker script before restarting the service, not able to restart a service if it's down. - https://phabricator.wikimedia.org/T135609#2305902 (10thcipriani) Oh! I know exactly what is happening. To implement `scap --service-restart` we added a `restart_service` stage (se... [17:13:04] 07Blocked-on-Operations, 06Operations, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#2305919 (10Eevans) One difference w/ 2008 is that it is running Cassandra 2.1.14: ``` restbase1007.eqiad.wmn... [17:13:20] (03PS1) 10Dzahn: mw_rc_irc: remove upstart, pre-jessie support [puppet] - 10https://gerrit.wikimedia.org/r/289458 [17:16:34] (03PS7) 10Volans: MariaDB: remove special SSL option multiple-ca [puppet] - 10https://gerrit.wikimedia.org/r/288420 (https://phabricator.wikimedia.org/T111654) [17:16:44] got the general "our servers have a problem" page from integration.wm for a moment. then back normal [17:19:01] (03PS2) 10Dzahn: mw_rc_irc: remove upstart, pre-jessie support [puppet] - 10https://gerrit.wikimedia.org/r/289458 [17:20:41] CI is not running? [17:21:07] (03CR) 10Alex Monk: [C: 031] "noop for kraz, assuming we don't want this on ubuntu again it looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/289458 (owner: 10Dzahn) [17:21:07] nope, mom [17:21:16] volans: https://phabricator.wikimedia.org/T135631 [17:21:24] spawning instances is broken [17:21:47] Luke081515: thanks [17:22:09] np ;) [17:22:46] (03CR) 10Volans: [C: 032] MariaDB: remove special SSL option multiple-ca [puppet] - 10https://gerrit.wikimedia.org/r/288420 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [17:22:50] 06Operations, 10ops-eqiad: Rack and Set up new application servers mw1284-1306 - https://phabricator.wikimedia.org/T134309#2305957 (10Cmjohnson) [17:22:53] RECOVERY - nova-compute process on labvirt1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [17:23:11] (03CR) 10Volans: [V: 032] MariaDB: remove special SSL option multiple-ca [puppet] - 10https://gerrit.wikimedia.org/r/288420 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [17:23:27] (03CR) 1020after4: "@faidon: regarding the key fingerprint lookup, I decided this would be better than wrangling puppet to deal with the keys because it's a o" [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [17:23:37] (03CR) 10Dzahn: [C: 032] "yea, no more Ubuntu. checked with compiler." [puppet] - 10https://gerrit.wikimedia.org/r/289458 (owner: 10Dzahn) [17:23:44] 06Operations, 10ops-eqiad: Rack and Set up new application servers mw1284-1306 - https://phabricator.wikimedia.org/T134309#2261471 (10Cmjohnson) a:05Cmjohnson>03Joe assigning to @joe to complete installs. [17:24:17] 06Operations, 10ops-eqiad: Rack and Set up new application servers mw1261-1283 - https://phabricator.wikimedia.org/T133798#2305966 (10Cmjohnson) a:05Cmjohnson>03Joe Assigning to @joe to complete installs. [17:25:01] (03PS2) 10Dzahn: rm files/misc/scripts/rotate_fundraising_logs [puppet] - 10https://gerrit.wikimedia.org/r/289352 [17:26:21] (03CR) 10Dzahn: [C: 032] rm files/misc/scripts/rotate_fundraising_logs [puppet] - 10https://gerrit.wikimedia.org/r/289352 (owner: 10Dzahn) [17:27:37] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/289458 (owner: 10Dzahn) [17:28:22] (03PS1) 10BBlack: varnishxcache: always emit zeros [puppet] - 10https://gerrit.wikimedia.org/r/289460 [17:29:03] (03CR) 10BBlack: [C: 032 V: 032] varnishxcache: always emit zeros [puppet] - 10https://gerrit.wikimedia.org/r/289460 (owner: 10BBlack) [17:29:57] mutante: sadly jenkins is currently KO https://phabricator.wikimedia.org/T135631 [17:30:04] leeeroy jeeenkins [17:30:10] godog: ok, thanks [17:30:14] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [17:30:48] (03PS2) 10ArielGlenn: always return success for dry runs of onallwikis script [dumps] - 10https://gerrit.wikimedia.org/r/289428 [17:30:53] (03PS3) 10Dzahn: rm files/misc/scripts/rotate_fundraising_logs [puppet] - 10https://gerrit.wikimedia.org/r/289352 [17:30:59] (03CR) 10Dzahn: [V: 032] rm files/misc/scripts/rotate_fundraising_logs [puppet] - 10https://gerrit.wikimedia.org/r/289352 (owner: 10Dzahn) [17:31:02] bah still out of order? [17:31:03] meh [17:31:44] -sub failboat { [17:31:58] (03CR) 10ArielGlenn: [C: 032 V: 032] always return success for dry runs of onallwikis script [dumps] - 10https://gerrit.wikimedia.org/r/289428 (owner: 10ArielGlenn) [17:32:24] (03PS3) 10Dzahn: mw_rc_irc: remove upstart, pre-jessie support [puppet] - 10https://gerrit.wikimedia.org/r/289458 [17:32:33] (03CR) 10Dzahn: [V: 032] mw_rc_irc: remove upstart, pre-jessie support [puppet] - 10https://gerrit.wikimedia.org/r/289458 (owner: 10Dzahn) [17:32:34] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [17:33:03] !log rolling restart of hhvm on mediawiki canaries in eqiad to pick up expat security update [17:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:34:05] (03CR) 10Dzahn: "jenkins was out. no-op confirmed on kraz." [puppet] - 10https://gerrit.wikimedia.org/r/289458 (owner: 10Dzahn) [17:35:59] (03Abandoned) 10Dzahn: hiera: add variable for the active datacenter [puppet] - 10https://gerrit.wikimedia.org/r/289334 (owner: 10Dzahn) [17:36:38] (03PS3) 10Andrew Bogott: Organize Ganglia cluster names for labs [puppet] - 10https://gerrit.wikimedia.org/r/289315 [17:37:40] mutante godog zuul is now working. [17:37:54] but is now slow [17:37:56] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/289440 (https://phabricator.wikimedia.org/T85451) (owner: 10Filippo Giunchedi) [17:38:09] paladox: neat, thanks! [17:38:52] (03CR) 10Dzahn: "looked at it yesterday,seems ok, the ganglia cluster stuff did have a few surprises in the past and is a bit obscure.. one way to find out" [puppet] - 10https://gerrit.wikimedia.org/r/289315 (owner: 10Andrew Bogott) [17:39:40] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2306058 (10mobrovac) My recommendation would be to separate moving to #service-runner and Jessie and Node 4.x because the former implies testing and making sure all of the Parsoid's n... [17:39:41] paladox: cool, tx [17:43:33] (03PS3) 10Dzahn: ssl: remove toolserver.org cert, uses letsencrypt now [puppet] - 10https://gerrit.wikimedia.org/r/289143 (https://phabricator.wikimedia.org/T134798) [17:43:50] (03CR) 10Dzahn: [C: 032] "deleted toolserver.org.key in private repo just now" [puppet] - 10https://gerrit.wikimedia.org/r/289143 (https://phabricator.wikimedia.org/T134798) (owner: 10Dzahn) [17:44:04] (03CR) 10Dzahn: [V: 032] "deleted toolserver.org.key in private repo just now" [puppet] - 10https://gerrit.wikimedia.org/r/289143 (https://phabricator.wikimedia.org/T134798) (owner: 10Dzahn) [17:45:21] !log twentyafterfour@tin Synchronized php-1.28.0-wmf.2/extensions/Renameuser/RenameuserSQL.php: (no message) (duration: 00m 38s) [17:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:48:29] (03PS4) 10Andrew Bogott: Organize Ganglia cluster names for labs [puppet] - 10https://gerrit.wikimedia.org/r/289315 [17:52:39] 06Operations, 06Labs, 10Tool-Labs, 13Patch-For-Review: setup Letsencrypt for toolserver.org (toolserver.org certificate to expire 2016-06-30) - https://phabricator.wikimedia.org/T134798#2306131 (10Dzahn) - deleted toolserver.org.key in private repo - deleted certs and .key and .keyold on instance, /etc/ssl... [17:57:02] (03PS2) 10Yuvipanda: ores: Make non-SSL redirects permanent (302 -> 301) [puppet] - 10https://gerrit.wikimedia.org/r/289455 (owner: 10Ladsgroup) [17:57:03] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2306160 (10GWicke) Given that [Parsoid already supports running on service-runner](https://github.com/wikimedia/mediawiki-node-services/blob/master/config.yaml#L33-L43), and so far th... [17:57:08] (03CR) 10Yuvipanda: [C: 032 V: 032] ores: Make non-SSL redirects permanent (302 -> 301) [puppet] - 10https://gerrit.wikimedia.org/r/289455 (owner: 10Ladsgroup) [17:58:44] 06Operations, 06Labs, 10Tool-Labs, 13Patch-For-Review: setup Letsencrypt for toolserver.org (toolserver.org certificate to expire 2016-06-30) - https://phabricator.wikimedia.org/T134798#2306161 (10Dzahn) >>! In T134798#2303651, @Dzahn wrote: > @Nemo_bis Possible, but we'll need it in DNS first, then Apache... [18:00:04] csteipp dapatrick: Respected human, time to deploy Enable Ex:OATH on testwikis (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160518T1800). Please do the needful. [18:06:29] (03PS1) 10CSteipp: Enable Ex:OATHAuth on test wikis, disabled for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289486 (https://phabricator.wikimedia.org/T107605) [18:10:19] (03PS1) 10Dzahn: toolserver.org: same ServerAliases for https, add wiki. alias [puppet] - 10https://gerrit.wikimedia.org/r/289487 (https://phabricator.wikimedia.org/T62220) [18:11:33] (03CR) 10Thcipriani: [C: 031] admin: add the releng team to labnet-users [puppet] - 10https://gerrit.wikimedia.org/r/289372 (https://phabricator.wikimedia.org/T133992) (owner: 10Giuseppe Lavagetto) [18:13:00] (03PS5) 10Yuvipanda: tools: Enable host automounts [puppet] - 10https://gerrit.wikimedia.org/r/288761 (https://phabricator.wikimedia.org/T134748) [18:13:07] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Enable host automounts [puppet] - 10https://gerrit.wikimedia.org/r/288761 (https://phabricator.wikimedia.org/T134748) (owner: 10Yuvipanda) [18:14:03] !log created oathauth_users in centralauth db [18:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:14:16] AH00112: Warning: DocumentRoot [/does/not/exist] does not exist [18:14:20] say what [18:18:07] (03CR) 10Alex Monk: [C: 031] toolserver.org: same ServerAliases for https, add wiki. alias [puppet] - 10https://gerrit.wikimedia.org/r/289487 (https://phabricator.wikimedia.org/T62220) (owner: 10Dzahn) [18:19:12] (03PS2) 10CSteipp: Enable Ex:OATHAuth on test wikis, disabled for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289486 (https://phabricator.wikimedia.org/T107605) [18:19:43] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:21:35] high 500 due to pageviews api [18:21:59] madhuvishy: nuria_ ^ [18:22:07] yessir [18:23:18] (03PS2) 10Yuvipanda: tools: Enable HostPathEnforcer admission controller [puppet] - 10https://gerrit.wikimedia.org/r/288808 (https://phabricator.wikimedia.org/T112718) [18:23:35] YuviPanda: is the ping about pageview api alarms [18:23:36] ? [18:23:39] (03CR) 10jenkins-bot: [V: 04-1] tools: Enable HostPathEnforcer admission controller [puppet] - 10https://gerrit.wikimedia.org/r/288808 (https://phabricator.wikimedia.org/T112718) (owner: 10Yuvipanda) [18:23:42] nuria_: yes. [18:23:55] nuria_: sorry, got distracted and didn't explicitly mention it [18:24:09] (03PS3) 10Yuvipanda: tools: Enable HostPathEnforcer admission controller [puppet] - 10https://gerrit.wikimedia.org/r/288808 (https://phabricator.wikimedia.org/T112718) [18:24:28] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Allow RelEng access to labnet servers (was: Allow RelEng nova log access) - https://phabricator.wikimedia.org/T133992#2251644 (10Legoktm) >>! In T133992#2304203, @gerritbot wrote... [18:25:13] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:25:14] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:25:44] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [18:26:46] (03PS1) 10Rush: labs nfs client instance refactor [puppet] - 10https://gerrit.wikimedia.org/r/289490 [18:27:03] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [18:27:13] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [18:27:13] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [18:27:38] (03PS2) 10Rush: labs nfs client instance refactor [puppet] - 10https://gerrit.wikimedia.org/r/289490 [18:28:25] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:30:45] (03CR) 10Yuvipanda: labs nfs client instance refactor (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/289490 (owner: 10Rush) [18:30:52] (03CR) 10CSteipp: [C: 032] Enable Ex:OATHAuth on test wikis, disabled for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289486 (https://phabricator.wikimedia.org/T107605) (owner: 10CSteipp) [18:31:10] chasemp: minor comments, major one is me bashing on bash mostly I like the concepts etc [18:31:17] I didn't write that bash script [18:31:27] I'm pulling it out of generic manifests/files hell [18:31:27] !log Stopping failed bootstrap of restbase2008-b.codfw.wmnet : T95253 [18:31:28] T95253: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253 [18:31:29] oh? it was an 'A' [18:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:31:43] no disagreement but I woudl rather python-ize it separately [18:31:44] ah [18:31:46] (03PS1) 10Catrope: Enable Flow beta feature on outreachwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289492 (https://phabricator.wikimedia.org/T135582) [18:32:03] (03Merged) 10jenkins-bot: Enable Ex:OATHAuth on test wikis, disabled for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289486 (https://phabricator.wikimedia.org/T107605) (owner: 10CSteipp) [18:32:04] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [18:32:18] (03PS3) 10Rush: labs nfs client instance refactor [puppet] - 10https://gerrit.wikimedia.org/r/289490 [18:32:33] chasemp: ah that makes more sense now :) [18:32:34] !log Stopping restbase2008-a.codfw.wmnet and downgrading Cassandra to 2.1.13 : T95253 [18:32:35] T95253: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253 [18:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:33:04] chasemp: agree re: pythonization on a separate commit, was confused because I thought you had just written it anew [18:33:31] heh my bad for not actually removing the old one, test setup I did and then I think file removals don't git export well or something [18:34:03] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [18:34:35] (03PS4) 10Rush: labs nfs client instance refactor [puppet] - 10https://gerrit.wikimedia.org/r/289490 [18:35:02] !log Starting restbase2008-a.codfw.wmnet : T95253 [18:35:04] T95253: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253 [18:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:36:08] (03PS5) 10Rush: labs nfs client instance refactor [puppet] - 10https://gerrit.wikimedia.org/r/289490 [18:36:26] chasemp: it's an almost straight up move, right? [18:36:27] !log csteipp@tin Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 34s) [18:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:36:47] YuviPanda: yes rearrange not change (tm) [18:36:54] PROBLEM - cassandra-b service on restbase2008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [18:37:08] chasemp: cool cool :) [18:37:08] but the mount_nfs centralization etc [18:37:19] yeah I think that's fine [18:37:39] on the k8s worker nodes I want to allow us to mount only specific mounts but that can wait [18:37:57] !log csteipp@tin Synchronized wmf-config/CommonSettings.php: Enable OATH on test wikis (duration: 00m 28s) [18:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:38:19] you could pretty easily by extending mount_nfs_volume and also passing in hostnames or something [18:38:23] but yeah another day [18:38:28] have a bit more to follow this [18:39:02] ^^^ cassandra-b service on restbase2008 is me [18:40:02] chasemp: yeah [18:40:09] chasemp: kk, am going afk for food etc now [18:40:18] YuviPanda: +1ing? [18:40:32] or did I miss it and amend wiped it out [18:40:35] !log Starting bootstrap of restbase2008-b.codfw.wmnet : T95253 [18:40:36] T95253: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253 [18:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:40:44] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:40:48] chasemp: /mnt/nfs? [18:40:57] (03CR) 10Dzahn: [C: 032] toolserver.org: same ServerAliases for https, add wiki. alias [puppet] - 10https://gerrit.wikimedia.org/r/289487 (https://phabricator.wikimedia.org/T62220) (owner: 10Dzahn) [18:40:59] (03CR) 10Dzahn: [V: 032] toolserver.org: same ServerAliases for https, add wiki. alias [puppet] - 10https://gerrit.wikimedia.org/r/289487 (https://phabricator.wikimedia.org/T62220) (owner: 10Dzahn) [18:41:27] (03PS6) 10Rush: labs nfs client instance refactor [puppet] - 10https://gerrit.wikimedia.org/r/289490 [18:41:30] YuviPanda: right removed in this round to make it more noop-y thought I commented to that affect...but nope [18:41:46] YuviPanda: ist the grid master down? [18:41:49] chasemp: gerrrrit etc [18:41:51] doctaxon: oh? [18:41:53] looking [18:42:04] i cannot start tasks by crontab [18:42:10] master's up [18:42:17] doctaxon: which tasks? what're you trying to start? which tool? [18:42:25] qstat says: [18:42:28] 6458402 0.30040 qsm tools.taxonb r 05/18/2016 13:40:05 task@tools-exec-1212.eqiad.wmf 1 [18:42:31] 6460010 0.30030 c-uncat tools.taxonb r 05/18/2016 14:50:06 task@tools-exec-1404.eqiad.wmf 1 [18:42:34] 6464606 0.30000 aa-voy tools.taxonb qw 05/18/2016 18:38:03 1 [18:42:37] 6464607 0.30000 aan tools.taxonb qw 05/18/2016 18:38:03 1 [18:42:40] 6464610 0.30000 ld tools.taxonb qw 05/18/2016 18:38:03 1 [18:42:44] 6464611 0.30000 ldw tools.taxonb qw 05/18/2016 18:38:03 1 [18:42:46] state qw ? [18:42:51] hmm [18:42:53] RECOVERY - cassandra-b service on restbase2008 is OK: OK - cassandra-b is active [18:42:54] it's possible they are trying to access dumps which is still being resized? [18:42:59] queue wait [18:43:14] usualy means...no resources to run in this queue atm or pending I think [18:43:21] nope, 201 jobs in qw [18:43:25] let's switch to -labs [18:44:57] 07Blocked-on-Operations, 06Operations, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#2306337 (10Eevans) Cassandra has now been downgraded to 2.1.13 on restbase2008.codfw.wmnet, and the bootstrap... [18:51:40] !log added oathauth-enable right to Staff group on testwiki. [18:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:53:59] (03PS3) 10Dzahn: toolserver: add wiki.toolserver LE cert [puppet] - 10https://gerrit.wikimedia.org/r/289345 (https://phabricator.wikimedia.org/T62220) [18:56:59] (03PS4) 10Dzahn: toolserver: add wiki. and stable. LE certs [puppet] - 10https://gerrit.wikimedia.org/r/289345 (https://phabricator.wikimedia.org/T62220) [18:57:40] (03PS5) 10Dzahn: toolserver: add wiki. and stable. LE certs [puppet] - 10https://gerrit.wikimedia.org/r/289345 (https://phabricator.wikimedia.org/T62220) [18:58:24] (03CR) 10Dzahn: [C: 032] toolserver: add wiki. and stable. LE certs [puppet] - 10https://gerrit.wikimedia.org/r/289345 (https://phabricator.wikimedia.org/T62220) (owner: 10Dzahn) [18:58:54] (03CR) 10Dzahn: [V: 032] toolserver: add wiki. and stable. LE certs [puppet] - 10https://gerrit.wikimedia.org/r/289345 (https://phabricator.wikimedia.org/T62220) (owner: 10Dzahn) [19:00:04] hashar: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160518T1900). Please do the needful. [19:05:46] (03PS1) 10ArielGlenn: add option to onallwikis to run from a base wiki, with wikis as args [dumps] - 10https://gerrit.wikimedia.org/r/289501 [19:06:47] 06Operations: consider debian "experimental" workflow for internal repository - https://phabricator.wikimedia.org/T115758#2306455 (10Eevans) [19:08:09] 06Operations: Debian repository supporting multiple package versions - https://phabricator.wikimedia.org/T115758#1732058 (10Eevans) [19:11:03] bblack: adding 2 more SANs to an existing LE cert.. also just worked.. on second puppet run, and i could see the requests from LE validation server get a 302 when getting challenge files and it works [19:12:47] 06Operations: Debian repository supporting multiple package versions - https://phabricator.wikimedia.org/T115758#1732058 (10Eevans) Even more examples in {T95253} (starting [[ https://phabricator.wikimedia.org/T95253#2305919 | here ]]), and {T126629} (starting [[ https://phabricator.wikimedia.org/T126629#2262966... [19:13:54] 06Operations, 06Labs, 10Tool-Labs, 13Patch-For-Review: setup Letsencrypt for toolserver.org (toolserver.org certificate to expire 2016-06-30) - https://phabricator.wikimedia.org/T134798#2306476 (10Dzahn) done. the cert has additional SANs now, "wiki" and "stable" [19:14:46] (03CR) 10jenkins-bot: [V: 04-1] labs nfs client instance refactor [puppet] - 10https://gerrit.wikimedia.org/r/289490 (owner: 10Rush) [19:15:36] (03CR) 10Dzahn: "wiki.toolserver.org now has a valid cert" [puppet] - 10https://gerrit.wikimedia.org/r/227079 (https://phabricator.wikimedia.org/T62220) (owner: 10Nemo bis) [19:16:15] yurik: which service were you wanting to deploy via scap3? Follow-up-question have you merged any changes in puppet to service::node for that service? [19:17:09] (03CR) 10Dzahn: "wiki. has been added as a server alias to the www.toolserver.org.erb template. could you rebase and put it in there" [puppet] - 10https://gerrit.wikimedia.org/r/227079 (https://phabricator.wikimedia.org/T62220) (owner: 10Nemo bis) [19:18:05] (03PS2) 10Dzahn: Redirect wiki.toolserver.org to www.mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/227079 (https://phabricator.wikimedia.org/T62220) (owner: 10Nemo bis) [19:18:28] (03CR) 10jenkins-bot: [V: 04-1] Redirect wiki.toolserver.org to www.mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/227079 (https://phabricator.wikimedia.org/T62220) (owner: 10Nemo bis) [19:19:03] (03CR) 10Dzahn: "@Yuvipanda i think you can remove the -1 now, we have a cert, shouldnt be a problem anymore" [puppet] - 10https://gerrit.wikimedia.org/r/227079 (https://phabricator.wikimedia.org/T62220) (owner: 10Nemo bis) [19:20:54] (03CR) 10Yuvipanda: "done" [puppet] - 10https://gerrit.wikimedia.org/r/227079 (https://phabricator.wikimedia.org/T62220) (owner: 10Nemo bis) [19:21:34] (03CR) 10jenkins-bot: [V: 04-1] add option to onallwikis to run from a base wiki, with wikis as args [dumps] - 10https://gerrit.wikimedia.org/r/289501 (owner: 10ArielGlenn) [19:25:02] 06Operations, 10cassandra: Downgrade Cassandra on apt.wikimedia.org to 2.1.13 - https://phabricator.wikimedia.org/T135673#2306518 (10Eevans) [19:27:29] 07Blocked-on-Operations, 06Operations, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#2306522 (10Eevans) >>! In T95253#2305919, @Eevans wrote: > [ ... ] > I will follow-up with any ticket(s) need... [19:30:45] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 222, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [19:33:53] (03PS49) 10Ladsgroup: ores: Scap3 deployment configurations [puppet] - 10https://gerrit.wikimedia.org/r/280403 [19:36:01] (03PS1) 10Dzahn: toolserver.org: send 410 Gone for ~cmarqu URLs [puppet] - 10https://gerrit.wikimedia.org/r/289504 (https://phabricator.wikimedia.org/T85167) [19:36:47] (03PS2) 10Dzahn: toolserver.org: send 410 Gone for ~cmarqu URLs [puppet] - 10https://gerrit.wikimedia.org/r/289504 (https://phabricator.wikimedia.org/T85167) [19:37:06] (03PS3) 10Dzahn: toolserver.org: send 410 Gone for ~cmarqu URLs [puppet] - 10https://gerrit.wikimedia.org/r/289504 (https://phabricator.wikimedia.org/T85167) [19:37:24] (03PS4) 10Yuvipanda: tools: Enable HostPathEnforcer admission controller [puppet] - 10https://gerrit.wikimedia.org/r/288808 (https://phabricator.wikimedia.org/T112718) [19:39:36] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Enable HostPathEnforcer admission controller [puppet] - 10https://gerrit.wikimedia.org/r/288808 (https://phabricator.wikimedia.org/T112718) (owner: 10Yuvipanda) [19:40:21] (03PS4) 10Dzahn: toolserver.org: send 410 Gone for ~cmarqu URLs [puppet] - 10https://gerrit.wikimedia.org/r/289504 (https://phabricator.wikimedia.org/T85167) [19:40:39] (03CR) 10Dzahn: [C: 032] toolserver.org: send 410 Gone for ~cmarqu URLs [puppet] - 10https://gerrit.wikimedia.org/r/289504 (https://phabricator.wikimedia.org/T85167) (owner: 10Dzahn) [19:41:17] (03PS7) 10Rush: labs nfs client instance refactor [puppet] - 10https://gerrit.wikimedia.org/r/289490 [19:42:32] (03CR) 10Dzahn: [V: 032] toolserver.org: send 410 Gone for ~cmarqu URLs [puppet] - 10https://gerrit.wikimedia.org/r/289504 (https://phabricator.wikimedia.org/T85167) (owner: 10Dzahn) [19:45:32] (03CR) 10jenkins-bot: [V: 04-1] labs nfs client instance refactor [puppet] - 10https://gerrit.wikimedia.org/r/289490 (owner: 10Rush) [19:49:08] (03PS8) 10Rush: labs nfs client instance refactor [puppet] - 10https://gerrit.wikimedia.org/r/289490 [19:51:06] !log disabled puppet on aqs1001 aqs1002 aqs1003 aqs1004 aqs1005 aqs1006 maps-test2001 maps-test2002 maps-test2003 maps-test2004 maps2001 maps2002 maps2003 maps2004 restbase1007 restbase1008 restbase1009 restbase2001 restbase2002 restbase2003 restbase2004 restbase2005 restbase2006 restbase2007 restbase2008 restbase2009 for cassandra upgrade [19:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:51:38] FYI, this ^ does not mean cassandra will get updated on those hosts, it's just a precaution [19:53:46] 06Operations, 10Traffic, 13Patch-For-Review: Raise cache frontend memory sizes significantly - https://phabricator.wikimedia.org/T135384#2306727 (10BBlack) cp3048 today: 181G virtual, 88G resident [19:55:36] !log disable puppet on restbase1010 restbase1011 restbase1012 restbase1013 restbase1014 restbase1015 as well [19:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:59:03] (03PS20) 10Alexandros Kosiaris: Cassandra 2.2.6 config [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [19:59:10] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Cassandra 2.2.6 config [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [20:00:04] gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160518T2000). Please do the needful. [20:02:04] !log Restarting Cassandra on xenon.eqiad.wmnet post-puppet-run : T126629 [20:02:06] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [20:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:03:27] chasemp: Could you fix the link at https://phabricator.wikimedia.org/phame/post/view/4/horizon_is_now_the_best_ui_for_labs_tools/ ? [20:03:52] (Either [[ url | label ]] or [label](url) - [[ url ]](label) doesn't work) [20:04:36] Krinkle: done? [20:04:45] chasemp: thx [20:11:06] (03PS2) 10ArielGlenn: add option to onallwikis to run from a base wiki, with wikis as args [dumps] - 10https://gerrit.wikimedia.org/r/289501 [20:12:51] 06Operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner: Rationalize our jobqueues redis topology - https://phabricator.wikimedia.org/T135113#2306793 (10aaron) Seems sensible to me. We should document that on manual fail-over (say of a server in eqiad) to: a) switchover that logical partition to the other... [20:12:59] (03CR) 10ArielGlenn: [C: 032] add option to onallwikis to run from a base wiki, with wikis as args [dumps] - 10https://gerrit.wikimedia.org/r/289501 (owner: 10ArielGlenn) [20:14:03] !log Deployment of 1.28.0-wmf.2 [T134450] is no longer blocked Preparing to deploy to group1 wikis. [20:14:04] T134450: MW-1.28.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T134450 [20:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:15:59] !log restarting Cassandra on praseodymium.eqiad.wmnet : T126629 [20:16:00] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [20:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:17:12] !log restarting Cassandra on cerium.eqiad.wmnet : T126629 [20:17:13] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [20:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:19:43] !log Rolling restart of restbase-test200[1-3].codfw.wmnet : T126629 [20:19:43] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [20:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:21:35] (03PS1) 1020after4: group1 to 1.28.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289543 [20:23:00] (03PS2) 1020after4: group1 to 1.28.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289543 (https://phabricator.wikimedia.org/T134450) [20:24:17] (03CR) 1020after4: [C: 032] group1 to 1.28.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289543 (https://phabricator.wikimedia.org/T134450) (owner: 1020after4) [20:24:36] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [20:24:54] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [20:24:59] (03Merged) 10jenkins-bot: group1 to 1.28.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289543 (https://phabricator.wikimedia.org/T134450) (owner: 1020after4) [20:26:14] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /page/html/{title} (Get html by title from storage) is CRITICAL: Te [20:26:41] !log git deployed tilerator, will restart one by one manually. https://gerrit.wikimedia.org/r/#/c/289542/ [20:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:27:19] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.28.0-wmf.2 [20:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:28:45] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [20:31:24] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [20:34:19] andre__afk: subprojects and milestones should be logged as well? [20:34:31] I guess so [20:35:09] Danny_B: wrong channel? I don't see a reason to log milestones if sprints aren't logged [20:36:07] twentyafterfour: (which other chan would be better?) well, that's why i'm asking... ;-) [20:36:44] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [20:36:47] to clarify: i meant logging their creation in project creation log [20:37:06] Danny_B: #wikimedia-devtools [20:37:32] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2306885 (10Ottomata) Grr, these are getting close to full. Luca and I tried to dynamically set topic retention, but kafka d... [20:37:48] (03CR) 10Yuvipanda: [C: 031] labs nfs client instance refactor [puppet] - 10https://gerrit.wikimedia.org/r/289490 (owner: 10Rush) [20:40:35]  [20:40:41] ~ll [20:40:47] gah [20:40:58] ssh session fail. [20:41:17] !log Renabling puppet on restbase production cluster : T126629 [20:41:18] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [20:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:43:01] (03PS9) 10Rush: labs nfs client instance refactor [puppet] - 10https://gerrit.wikimedia.org/r/289490 [20:43:53] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [20:45:03] Danny_B: #wikimedia-devtools? [20:45:15] PROBLEM - tilerator on maps-test2003 is CRITICAL: Connection refused [20:45:44] so ... should I be worried about this: {"id":"b6c572002cf0e79d7b681abe","type":"DBReplicationWaitError","file":"/srv/mediawiki/php-1.28.0-wmf.2/includes/db/loadbalancer/LBFactory.php","line":396,"message":"Could not wait for slaves to catch up to 10.64.0.7", [20:46:04] PROBLEM - tilerator on maps-test2001 is CRITICAL: Connection refused [20:46:11] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2306971 (10Ottomata) I take it back! The command I had run previously looks like it had a larger retention.ms than the defa... [20:46:44] PROBLEM - tilerator on maps-test2004 is CRITICAL: Connection refused [20:46:54] PROBLEM - tilerator on maps-test2002 is CRITICAL: Connection refused [20:49:49] (03CR) 10Rush: [C: 032 V: 032] labs nfs client instance refactor [puppet] - 10https://gerrit.wikimedia.org/r/289490 (owner: 10Rush) [20:51:34] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: /page/revision/{revision} (Get rev by ID) is CRITICAL: Test Get rev by ID returned the unexpected status 500 (expecting: 200): /page/mobile-sections/{title} (Get MobileApps Foobar page) is CRITICAL: Test Get MobileApps Foobar page returned the unexpected status 500 (expecting: 200) [20:54:05] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: /page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200) [20:56:04] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [20:57:45] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [20:59:32] thcipriani: that is an External Store replication borked somehow [20:59:35] err [20:59:37] RFC meeting starting now in #wikimedia-office: Requirements for change propagation [20:59:54] twentyafterfour: that is an External Store replication borked somehow. No idea where those DB hosts are monitored though :( [21:03:54] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 224, down: 0, dormant: 0, excluded: 0, unused: 0 [21:04:34] PROBLEM - HHVM rendering on mw2056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:04:43] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:04:44] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:06:24] RECOVERY - HHVM rendering on mw2056 is OK: HTTP OK: HTTP/1.1 200 OK - 67818 bytes in 0.425 second response time [21:06:44] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [21:06:53] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [21:06:58] hashar: well those errors are new since I rolled out wmf.2 [21:10:01] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2307079 (10Ottomata) Ok, brokers have deleted webrequest_upload data older than 48 hours. I've removed the topic config ove... [21:12:27] (03PS5) 10Andrew Bogott: Organize Ganglia cluster names for labs [puppet] - 10https://gerrit.wikimedia.org/r/289315 [21:13:36] (03PS1) 10Alexandros Kosiaris: cassandra::metrics: Refresh the collector on a unit file change [puppet] - 10https://gerrit.wikimedia.org/r/289559 [21:14:21] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/289559 (owner: 10Alexandros Kosiaris) [21:15:51] (03PS1) 10ArielGlenn: enable onallwikis to run non php scripts by passing full arguments [dumps] - 10https://gerrit.wikimedia.org/r/289561 [21:16:22] still seeing a ton of DBReplicationWaitError exceptions in fatalmonitor [21:17:01] (03PS2) 10Alexandros Kosiaris: cassandra::metrics: Refresh the collector on a unit file change [puppet] - 10https://gerrit.wikimedia.org/r/289559 [21:17:09] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] cassandra::metrics: Refresh the collector on a unit file change [puppet] - 10https://gerrit.wikimedia.org/r/289559 (owner: 10Alexandros Kosiaris) [21:17:10] I'm considering rollback of wmf.2 [21:19:04] RECOVERY - tilerator on maps-test2001 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.091 second response time [21:19:05] (03CR) 10ArielGlenn: [C: 032] enable onallwikis to run non php scripts by passing full arguments [dumps] - 10https://gerrit.wikimedia.org/r/289561 (owner: 10ArielGlenn) [21:21:18] (03PS1) 1020after4: roll group1 back to 1.28.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289562 [21:24:58] (03PS6) 10Andrew Bogott: Organize Ganglia cluster names for labs [puppet] - 10https://gerrit.wikimedia.org/r/289315 [21:27:05] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: /page/mobile-sections/{title} (Get MobileApps Foobar page) is CRITICAL: Test Get MobileApps Foobar page returned the unexpected status 500 (expecting: 200) [21:29:14] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [21:30:54] (03CR) 1020after4: [C: 032] roll group1 back to 1.28.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289562 (owner: 1020after4) [21:31:32] (03Merged) 10jenkins-bot: roll group1 back to 1.28.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289562 (owner: 1020after4) [21:31:32] anyone know anything about the "external store replication" ? [21:32:02] (03PS1) 10Alexandros Kosiaris: maps: Specify cassandra graphite host [puppet] - 10https://gerrit.wikimedia.org/r/289564 [21:32:16] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: roll back group1 to 1.28.0-wmf.1 [21:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:33:23] PROBLEM - puppet last run on maps-test2003 is CRITICAL: CRITICAL: Puppet has 1 failures [21:35:34] (03PS1) 10Alexandros Kosiaris: base::service_unit: Remove old cruft [puppet] - 10https://gerrit.wikimedia.org/r/289565 [21:37:53] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [21:38:04] ok rollback relieved the errors, no idea where to look for the culprit [21:38:25] twentyafterfour: Di you have a ticket? [21:38:33] https://phabricator.wikimedia.org/T135690 [21:38:44] PROBLEM - puppet last run on maps-test2004 is CRITICAL: CRITICAL: Puppet has 1 failures [21:39:01] !log enable puppet on aqs1001 aqs1002 aqs1003 aqs1004 aqs1005 aqs1006 maps-test2001 maps-test2002 maps-test2003 maps-test2004 maps2001 maps2002 maps2003 maps2004 [21:39:04] PROBLEM - puppet last run on maps-test2002 is CRITICAL: CRITICAL: Puppet has 1 failures [21:39:04] hoo: T135690 is the ticket Isubmitted [21:39:04] T135690: DBReplicationWaitError: Could not wait for slaves to catch up to 10.64.0.7 - https://phabricator.wikimedia.org/T135690 [21:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:40:43] RECOVERY - tilerator on maps-test2002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.092 second response time [21:41:13] RECOVERY - puppet last run on maps-test2002 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [21:41:26] twentyafterfour: Weird [21:41:56] @replag [21:42:01] all the errors appear to be on commonswiki, triggered by job runners [21:42:03] Er, that changed. [21:42:31] ostriches: @replag? [21:42:40] Used to trigger a bot or something [21:42:42] Anyway [21:44:12] the errors started happening immediately when rolling out the new branch and subsided immediately after I rolled back. So something is going on besides actual replication lag [21:44:29] Yeah, I'm not seeing any actual lag. [21:44:30] that or the error condition just wasn't being logged previously [21:44:36] At least not on the clusters we care about [21:44:42] (er, care about for prod) [21:45:25] can I be of any help from the DB side? [21:45:47] * volans just saw the task on phab via email [21:46:10] twentyafterfour: Were they all from 10.64.0.7/es1012 or was it a bunch of hosts? [21:47:38] ostriches: replag on es1 shard? they are standalone [21:47:55] Yeah, which is why I don't think it's an actual replication bug. [21:48:13] if the code does show slave status and expect an output could be [21:48:21] it will return empty set on es1 [21:48:47] I'm not seeing anything interesting changed in the LB code in the last week or two, hmm [21:49:29] Last thing was on May 6th, but that should be live already [21:49:50] Errr... [21:49:55] mebbe not [21:50:04] link to the diff? [21:50:32] ostriches: which patch? [21:50:46] https://gerrit.wikimedia.org/r/#/c/286314/ [21:50:46] also, the job runner code is in a separate repo so people could have easily forgotten to update it... [21:50:54] PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: puppet fail [21:50:59] that was in wmf.1 though [21:51:02] Yeah [21:51:05] So shouldn't be that [21:51:22] git branch --contains was lying to me locally :) [21:53:24] RECOVERY - tilerator on maps-test2003 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.090 second response time [21:53:53] RECOVERY - puppet last run on maps-test2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:54:36] is it a specific job type? or all jobs? [21:56:23] the change was actually from 1.28.0-wmf.1 to .2 or the hosts that showed the errors were in some other version prior to the upgrade? [21:58:29] legoktm: afaict, just RefreshLinks [21:58:31] https://logstash.wikimedia.org/#dashboard/temp/AVTF4CiE0z-7ykXOp4j_ [21:59:10] I'm curious if the same error is happening for group0 on wmf.2 as well, just at to low a volume to spot easily [22:00:18] Although, I'm curious why only es1012 is the complaining host. I haven't seen any others in this log set. [22:00:24] It's certainly not *actual* replication lag. [22:00:40] nothing for es1016 or es1018? [22:00:43] !log restart cassandra on maps-test200{2,3,4} to get logstash changing working [22:00:45] (same shard) [22:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:01:15] RECOVERY - tilerator on maps-test2004 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.089 second response time [22:01:34] RECOVERY - puppet last run on maps-test2004 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [22:01:56] ostriches also only commonswiki [22:03:34] RECOVERY - puppet last run on carbon is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [22:03:36] I see wikdafawimi too [22:03:45] Err wikidatawiki [22:03:50] Stupid phone [22:04:08] volans: not that I see [22:04:32] me neither from the logs [22:04:32] (03PS1) 10ArielGlenn: in configs, allow comma separated list of files of wikis to be skipped [dumps] - 10https://gerrit.wikimedia.org/r/289567 [22:05:05] (03CR) 10jenkins-bot: [V: 04-1] in configs, allow comma separated list of files of wikis to be skipped [dumps] - 10https://gerrit.wikimedia.org/r/289567 (owner: 10ArielGlenn) [22:06:03] updated the task with "what we know so far" [22:06:08] twentyafterfour: Er, you're right. [22:06:13] commonswiki only [22:06:13] (03PS2) 10Alexandros Kosiaris: base::service_unit: Remove old cruft [puppet] - 10https://gerrit.wikimedia.org/r/289565 [22:06:21] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] base::service_unit: Remove old cruft [puppet] - 10https://gerrit.wikimedia.org/r/289565 (owner: 10Alexandros Kosiaris) [22:06:23] I didn't filter enough [22:06:29] twentyafterfour: can you try again with https://gerrit.wikimedia.org/r/#/c/289568/1 ? [22:06:53] there is some deeper problem with the non-replicating ES store handling with slave lag, but that should work around it at least [22:07:26] AaronSchulz: ok [22:07:41] that must be a super old bug [22:07:55] guess it didn't get hit enough before [22:09:00] (03PS2) 10ArielGlenn: in configs, allow comma separated list of files of wikis to be skipped [dumps] - 10https://gerrit.wikimedia.org/r/289567 [22:10:29] (03PS1) 1020after4: group1 to 1.28.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289570 (https://phabricator.wikimedia.org/T134450) [22:11:20] I don't know why but it happened thousands of times as soon as I rolled out wmf.2 [22:11:29] mutante: Are the ircd restarts finished? I think you broke my bot for like a week. [22:12:30] Leah: i have not restarted it since we talked last time [22:12:38] Leah: one time the entire VM crashed though [22:12:46] mutante: I think it got restarted since we talked. [22:12:48] Since the motd changed. [22:12:53] I saw some e-mails about it. [22:13:02] Maybe that was the crash, I dunno. [22:13:04] ostriches, twentyafterfour between 20:30 and 21:00 UTC the same error was logged also for specieswiki, hewiki, mediawikiwiki, ruwikisource, metawiki [22:13:05] most clients auto-reconnect [22:13:07] (03PS3) 10Alexandros Kosiaris: ores: Use the new service::uwsgi define [puppet] - 10https://gerrit.wikimedia.org/r/288618 [22:13:17] i am not aware of another restart [22:13:21] Okay, cool. [22:13:36] (I filtered out commonswiki) [22:15:00] either way, "you broke my bot for like a week" is not it [22:16:48] more like "ganeti freeze issue" https://phabricator.wikimedia.org/T134242 and "bot fails to reconnect" [22:17:26] heh [22:19:13] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [22:19:23] !log twentyafterfour@tin Synchronized php-1.28.0-wmf.2/includes/deferred/LinksUpdate.php: deploy https://gerrit.wikimedia.org/r/#/c/289569/ (duration: 00m 29s) [22:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:20:33] (03CR) 1020after4: [C: 032] group1 to 1.28.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289570 (https://phabricator.wikimedia.org/T134450) (owner: 1020after4) [22:21:23] (03Merged) 10jenkins-bot: group1 to 1.28.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289570 (https://phabricator.wikimedia.org/T134450) (owner: 1020after4) [22:23:20] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: second attempt: group1 to 1.28.0-wmf.2 [22:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:25:09] AaronSchulz: seems like your patch fixed it [22:50:19] 06Operations: Debian repository supporting multiple package versions - https://phabricator.wikimedia.org/T115758#2307445 (10GWicke) Previous related discussion: - https://www.mediawiki.org/wiki/Packaging - https://gerrit.wikimedia.org/r/#/c/136128/ An example of a repository manager with good support for mult... [22:53:45] twentyafterfour: lol @ 1319.82% increase [23:00:04] RoanKattouw ostriches Krenair Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160518T2300). Please do the needful. [23:00:04] matt_flaschen: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:05:08] I'll do it [23:05:22] And I can verify matt_flaschen's patches as well because I know what they are [23:06:04] I'm here, sorry. [23:08:11] (03PS8) 10Ottomata: Initial debian packaging [debs/druid] - 10https://gerrit.wikimedia.org/r/287285 (https://phabricator.wikimedia.org/T134503) [23:09:02] (03PS12) 10Ottomata: Druid module and analytics_cluster role class [puppet] - 10https://gerrit.wikimedia.org/r/288099 (https://phabricator.wikimedia.org/T131974) [23:09:18] 06Operations: Debian repository supporting multiple package versions - https://phabricator.wikimedia.org/T115758#2307476 (10yuvipanda) Also https://wikitech.wikimedia.org/wiki/Aptly [23:10:16] (03PS2) 10Catrope: Enable Flow beta feature on outreachwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289492 (https://phabricator.wikimedia.org/T135582) [23:10:18] (03PS1) 10Aaron Schulz: Fix slave lag wait calls for read-only ES clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289575 (https://phabricator.wikimedia.org/T135690) [23:10:22] (03CR) 10Catrope: [C: 032] Enable Flow beta feature on outreachwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289492 (https://phabricator.wikimedia.org/T135582) (owner: 10Catrope) [23:11:14] (03Merged) 10jenkins-bot: Enable Flow beta feature on outreachwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289492 (https://phabricator.wikimedia.org/T135582) (owner: 10Catrope) [23:14:18] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable Flow beta feature on outreachwiki (T135582) (duration: 00m 31s) [23:14:19] T135582: Enable Flow on Outreach in Beta Feature - https://phabricator.wikimedia.org/T135582 [23:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:27:59] !log catrope@tin Synchronized php-1.28.0-wmf.2/extensions/Echo/: SWAT: fix URLs in notification emails (T135625) (duration: 00m 35s) [23:28:00] T135625: Broken user rights email notification (test2wiki) - https://phabricator.wikimedia.org/T135625 [23:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master