[00:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151118T0000). Please do the needful. [00:00:04] MaxSem: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:08] while now role::labs;:instance stuff is via node.pp [00:00:12] and everything else is via LDAP [00:00:18] Oh, I thought you hadn’t merged the role::labs::instance thing yet [00:00:22] andrewbogott: I had [00:00:27] atchyer service! [00:00:29] so what this meant [00:00:37] is that the conflict which had always existed [00:00:45] ran in such a way that the package {} declaration in the exec node [00:00:47] came first [00:00:49] before the ensure_package [00:00:51] in diamond [00:00:55] and since ensure_package handles package{} fine [00:00:57] since I'm the only one in the list, can deploy myself [00:00:58] it didn't error [00:01:02] MaxSem, ok [00:01:10] and now somehow the package{} came second [00:01:11] was about to do it, but sure [00:01:12] and caused errors [00:01:17] which I've fixed by using ensure_package in both places [00:01:22] (03PS6) 10Dzahn: deactivate wikimania.asia [dns] - 10https://gerrit.wikimedia.org/r/244103 [00:01:25] this also explains why the webproxy had no errors [00:01:29] and neither did other labs projects [00:01:30] YuviPanda: ok, so shall I abandon the notify-realm patch? [00:01:40] andrewbogott: I think it'll be useful to verify just inc ase [00:01:42] *case [00:01:44] (03CR) 10Dzahn: [C: 032] deactivate wikimania.asia [dns] - 10https://gerrit.wikimedia.org/r/244103 (owner: 10Dzahn) [00:01:50] ok [00:01:52] this is a theory that fits facts but there is a handwavy bit about 'ordering' [00:02:05] * andrewbogott breaks puppet on EVERY SINGLE LABS AND PRODUCTION HOST [00:02:11] (03CR) 10MaxSem: [C: 032] Bump portals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253772 (owner: 10MaxSem) [00:02:24] (03PS2) 10Andrew Bogott: Display 'realm' during each puppet run. [puppet] - 10https://gerrit.wikimedia.org/r/253789 [00:02:29] icinga-wm: die :) [00:02:32] (03Merged) 10jenkins-bot: Bump portals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253772 (owner: 10MaxSem) [00:03:22] YuviPanda: nothing handwavy about it - puppet is evil and has happy fun ordering issues. [00:03:26] oops, can I stick a config change in swat? [00:03:33] (03CR) 10Andrew Bogott: [C: 032] Display 'realm' during each puppet run. [puppet] - 10https://gerrit.wikimedia.org/r/253789 (owner: 10Andrew Bogott) [00:03:36] (03PS3) 10Dzahn: deactivate wikimemory.org [dns] - 10https://gerrit.wikimedia.org/r/244101 [00:03:45] Coren: ot [00:03:57] "ot"? [00:04:01] https://gerrit.wikimedia.org/r/#/c/253682/ specifically [00:04:04] Coren: it's pretty handwavy because it doesn't say why there's a difference in ordering between node.pp and ldap [00:04:07] !log maxsem@tin Synchronized portals/: SWAT (duration: 00m 32s) [00:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:04:11] (premature enter, was going to be 'it') [00:04:20] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [00:04:40] legoktm: only if you use mira to deploy :) *kidding* [00:05:01] YuviPanda: realms look right to me. [00:05:07] YuviPanda: Ah, I see what you mean - yeah, but I think it's a given that pretty much anything that has any sort of dependency on ordering will randomly break in puppet. :-) [00:05:16] (03CR) 10Dzahn: [C: 032] deactivate wikimemory.org [dns] - 10https://gerrit.wikimedia.org/r/244101 (owner: 10Dzahn) [00:05:18] (03CR) 10Rush: "puppet compiler: http://puppet-compiler.wmflabs.org/1310/" [puppet] - 10https://gerrit.wikimedia.org/r/253780 (owner: 10Rush) [00:05:21] :) [00:05:26] I’m pretty sure that ldap always happens first. [00:05:44] (03CR) 10Legoktm: "@Aaron: Addshore and I will make sure that's merged & deployed before we deploy this to any other wikis outside of group0." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253682 (owner: 10Addshore) [00:05:57] MaxSem: are you done? [00:06:09] andrewbogott: The thing is, dependencies will complicate things even if you think you know what order the definitions come in. [00:06:39] legoktm, still poking [00:06:48] andrewbogott: yeah that'd explain this [00:06:53] Right, I’m not saying we should rely on the ordering, only that it explains why things never happened in this order before. [00:07:19] andrewbogott: Ah, it probably does at that. [00:07:24] * legoktm waits [00:07:49] Coren: this is why things that are potentially unordered should be explicitly disordered :) Doesn’t perl do that with lists? [00:08:15] andrewbogott: i'm going to revert the realm patch now [00:08:24] andrewbogott: hashes :-) [00:08:52] YuviPanda: ok! [00:08:54] (03PS1) 10Yuvipanda: Revert "Display 'realm' during each puppet run." [puppet] - 10https://gerrit.wikimedia.org/r/253791 [00:09:20] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "Display 'realm' during each puppet run." [puppet] - 10https://gerrit.wikimedia.org/r/253791 (owner: 10Yuvipanda) [00:10:51] (03CR) 10Andrew Bogott: [C: 031] "prod.sh is only meant to server as an example rather than be run, so it's probably fine to leave the neutron stuff in there... at least it" [puppet] - 10https://gerrit.wikimedia.org/r/253780 (owner: 10Rush) [00:13:21] MaxSem: ping me once you're done? [00:17:43] (03PS3) 10Rush: WIP: labs remove old neutron configs [puppet] - 10https://gerrit.wikimedia.org/r/253780 [00:18:41] (03PS4) 10Rush: labs remove old neutron configs [puppet] - 10https://gerrit.wikimedia.org/r/253780 [00:19:51] (03PS5) 10Rush: labs remove old neutron configs [puppet] - 10https://gerrit.wikimedia.org/r/253780 [00:21:04] legoktm, we're still investigating but go ahed [00:21:28] (03CR) 10Rush: [C: 032] labs remove old neutron configs [puppet] - 10https://gerrit.wikimedia.org/r/253780 (owner: 10Rush) [00:23:00] ok, thanks [00:23:09] (03CR) 10Legoktm: [C: 032] wgRCWatchCategoryMembership true on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253682 (owner: 10Addshore) [00:23:52] (03Merged) 10jenkins-bot: wgRCWatchCategoryMembership true on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253682 (owner: 10Addshore) [00:26:07] !log legoktm@tin Synchronized wmf-config/InitialiseSettings.php: wgRCWatchCategoryMembership true on group0 wikis (duration: 00m 27s) [00:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:27:18] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:27:48] > (diff | hist) . . Category:Baz‎; 00:27 . . (0)‎ . . ‎Legoktm (talk | contribs | block)‎ (Blablabla added to category) [00:27:55] yay [00:28:06] ? [00:29:34] Category memberships in recentchanges [00:29:45] on group0 wikis [00:29:59] https://test.wikipedia.org/w/index.php?title=Special:RecentChanges&hidecategorization=0 [00:30:19] oooh [00:32:00] How long before the DBA goes lolno? :D [00:32:44] Reedy, too late [00:33:55] (03PS1) 10Rush: trebuchet set default master for realm handling [puppet] - 10https://gerrit.wikimedia.org/r/253795 [00:35:38] (03PS2) 10Rush: trebuchet set default master for realm handling [puppet] - 10https://gerrit.wikimedia.org/r/253795 [00:36:19] PROBLEM - puppet last run on mw1017 is CRITICAL: CRITICAL: puppet fail [00:36:26] gj ori [00:37:06] (03CR) 10Rush: [C: 032] trebuchet set default master for realm handling [puppet] - 10https://gerrit.wikimedia.org/r/253795 (owner: 10Rush) [00:40:08] RECOVERY - puppet last run on mw1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:43:46] (03CR) 10Tim Landscheidt: "This changes the semantics from "ensure => latest" to "ensure => present", so if Python users encounter bugs in packages, they should be a" [puppet] - 10https://gerrit.wikimedia.org/r/253788 (owner: 10Yuvipanda) [00:44:29] (03CR) 10Yuvipanda: "Hmm, that's a great point and I'm not exactly sure how to fix this. We definitely should have them set to ensure => latest..." [puppet] - 10https://gerrit.wikimedia.org/r/253788 (owner: 10Yuvipanda) [00:45:10] (03PS1) 10Rush: labtest realm minimal mail configuration [puppet] - 10https://gerrit.wikimedia.org/r/253800 [00:47:18] (03PS3) 10Dzahn: deactivate wikidisclosure.[com|org] [dns] - 10https://gerrit.wikimedia.org/r/243973 [00:47:23] (03PS2) 10Rush: labtest realm minimal mail configuration [puppet] - 10https://gerrit.wikimedia.org/r/253800 [00:47:31] (03CR) 10Dzahn: [C: 032] deactivate wikidisclosure.[com|org] [dns] - 10https://gerrit.wikimedia.org/r/243973 (owner: 10Dzahn) [00:49:02] (03CR) 10Rush: [C: 032] labtest realm minimal mail configuration [puppet] - 10https://gerrit.wikimedia.org/r/253800 (owner: 10Rush) [00:52:53] goddamit [00:52:55] oh well [00:53:10] andrewbogott: I have https://gerrit.wikimedia.org/r/253801 and https://gerrit.wikimedia.org/r/253802 [00:55:19] YuviPanda: what happened to ‘base’? [00:55:29] (03PS1) 10Rush: labtestneutron initial site definition [puppet] - 10https://gerrit.wikimedia.org/r/253804 [00:56:54] (03PS2) 10Rush: labtestneutron initial site definition [puppet] - 10https://gerrit.wikimedia.org/r/253804 [00:59:27] andrewbogott: hmm that's been included by role::labs::instance for a while [00:59:32] (03CR) 10Rush: [C: 032] labtestneutron initial site definition [puppet] - 10https://gerrit.wikimedia.org/r/253804 (owner: 10Rush) [00:59:34] andrewbogott: and afaict it's not actually been in most new instances [00:59:35] oh, ok then [00:59:46] andrewbogott: since it got removed from the default list of roles at some point in the past [01:00:01] your patch indicates that it’s added to new instances though? https://gerrit.wikimedia.org/r/#/c/253802/1/wmf-config/wikitech.php [01:00:18] andrewbogott: theoretically [01:00:27] andrewbogott: but I think that doesn't work if you remove it from the default list of roles [01:00:44] http://tools.wmflabs.org/watroles/role/role::labs::bastion [01:01:22] (03PS3) 10Dzahn: deactivate wikifamily.[com|org] [dns] - 10https://gerrit.wikimedia.org/r/243972 [01:01:23] sorry, I don’t understand what you’re saying :) [01:01:31] It’s not a role, so I wouldn’t expect to see it there [01:01:48] PROBLEM - nutcracker process on mw1156 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:01:50] PROBLEM - HHVM processes on mw1156 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:02:38] PROBLEM - dhclient process on mw1156 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:03:02] Hm, although I can confirm that it doesn’t seem to be in the ldap def for many instances [01:03:19] or… any [01:03:19] PROBLEM - salt-minion processes on mw1156 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:03:20] I wonder why not [01:04:07] shiiiit, mw1156 is crapping out in the same way 1158 did earlier today [01:04:09] * andrewbogott checks 1157 [01:04:18] andrewbogott: no, 'role' as in if I add it to 'manage puppet roles' from the left sidebar [01:04:24] andrewbogott: err, remove it from there. [01:05:12] (03PS1) 10Rush: labtestneutron2001 initial site typo correction [puppet] - 10https://gerrit.wikimedia.org/r/253806 [01:05:52] (03PS2) 10Rush: labtestneutron2001 initial site typo correction [puppet] - 10https://gerrit.wikimedia.org/r/253806 [01:06:57] oh, actually… YuviPanda I don’t think that silver config does anything anymore. I think those initial classes are added by a designate plugin. [01:07:02] (03CR) 10Rush: [C: 032] labtestneutron2001 initial site typo correction [puppet] - 10https://gerrit.wikimedia.org/r/253806 (owner: 10Rush) [01:07:15] andrewbogott: oh... I see [01:07:31] we should kill them either way then [01:07:47] yes [01:07:55] kill [01:07:57] but also, look at designate.conf; you’ll want to make a change there too [01:08:19] andrewbogott: ok! [01:11:19] RECOVERY - puppet last run on labtestneutron2001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [01:11:57] YuviPanda: andrewbogott fyi there is a salt grain for realm TIL [01:12:13] that could be handy, if salt works [01:12:21] :) [01:12:27] chasemp: can I interrupt you to help me figure out what’s happening with the rendering servers? [01:12:46] I was about to hop off but I have a minute, sorry I am out of the loop what's up? [01:12:51] I depooled 1158 earlier because services were dying. Now services are also dying on 1156 [01:13:02] in both cases dmesg looks scary, ‘kernel bug blahblahblah' [01:13:16] and the oom killer is running on 1156 even though it looks like it has plenty of memory [01:13:27] AND I’m worried this same thing is happening on 1157 et. al. as well [01:13:40] but probably this is something that’s ‘normal’ and I’m just not used to looking [01:13:49] what's the range on render mw*? [01:13:49] ori, if awake, same questions to you ^ ^^ ^^^ [01:13:52] how many are there? [01:14:02] there were 8, now seven since I pulled out 1158 [01:14:14] andrewbogott, ori is walking home [01:14:21] ok [01:14:33] (03PS1) 10Yuvipanda: designate: Stop populating default classes / variables [puppet] - 10https://gerrit.wikimedia.org/r/253807 (https://phabricator.wikimedia.org/T101447) [01:14:45] mw1153-mw1160 [01:15:23] andrewbogott: https://phabricator.wikimedia.org/T107698 ? [01:15:50] chasemp: https://phabricator.wikimedia.org/T118888 [01:15:55] maybe the same? I’ll compare [01:16:09] I think it's teh same bug is what I meant :) [01:16:35] it’s the same one! [01:16:36] yes [01:16:45] so, it was fixed with a kernel update? [01:17:19] seems so and it was backported to trusty (which 1156 is) [01:17:20] but is behind [01:17:25] we could update these... [01:17:39] apt-get install linux-headers-3.13.0-62 linux-headers-3.13.0-62-generic linux-image-3.13.0-62-generic linux-image-extra-3.13.0-62-generic linux-tools-3.13.0-62 linux-tools-3.13.0-62-generic [01:18:21] hm... [01:18:32] maybe do the one already depooled to see? [01:18:44] I don’t know how quickly depooling responds. But yeah, I’ll do that one at least. [01:18:54] but I'm not sure why 3 in one day out of this pool [01:19:14] 6operations: Kernel errors on mw1158 - https://phabricator.wikimedia.org/T118888#1812759 (10Andrew) This looks to be the same as https://phabricator.wikimedia.org/T107698 I'm going to update the kernel on that box and reboot, and repool if it seems happy. [01:19:27] if it’s load-related, then depooling one might cause others to fall [01:19:51] !log upgrading mw1158 to 3.13.0-62 [01:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:21:43] (03CR) 10Dzahn: [C: 032] deactivate wikifamily.[com|org] [dns] - 10https://gerrit.wikimedia.org/r/243972 (owner: 10Dzahn) [01:21:59] ehh [01:22:00] !log rebooting mw1158 [01:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:22:06] MaxSem: ? [01:22:11] why the hell do we own wikifamily.* ? [01:22:35] ask legal [01:25:26] heh, 1158 is too sick to launch mollyguard [01:25:52] pull the plug :) [01:26:24] I guess I will.. reboot from mgmt [01:29:41] greg-g, i need to bump graphoid service, noone is deploying, added myself [01:30:40] RECOVERY - nutcracker process on mw1158 is OK: PROCS OK: 1 process with UID = 109 (nutcracker), command name nutcracker [01:31:28] RECOVERY - salt-minion processes on mw1158 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:32:10] RECOVERY - dhclient process on mw1158 is OK: PROCS OK: 0 processes with command name dhclient [01:32:18] RECOVERY - HHVM processes on mw1158 is OK: PROCS OK: 11 processes with command name hhvm [01:33:28] PROBLEM - puppet last run on mw1156 is CRITICAL: CRITICAL: puppet fail [01:35:57] !log repooling mw1158, depooling mw1156 [01:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:36:30] 6operations: Kernel errors on rendering hosts - https://phabricator.wikimedia.org/T118888#1812814 (10Andrew) [01:37:10] 6operations: Kernel errors on rendering hosts - https://phabricator.wikimedia.org/T118888#1811767 (10Andrew) I ran apt-get install linux-headers-3.13.0-62 linux-headers-3.13.0-62-generic linux-image-3.13.0-62-generic linux-image-extra-3.13.0-62-generic linux-tools-3.13.0-62 linux-tools-3.13.0-62-generic on mw115... [01:40:22] (03PS2) 10Dzahn: deactivate wiki[p|m]ediastories.[com|net|org] [dns] - 10https://gerrit.wikimedia.org/r/244086 [01:40:58] !log updated kernel on mw1156, now rebooting. [01:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:41:18] (03CR) 10Dzahn: [C: 032] deactivate wiki[p|m]ediastories.[com|net|org] [dns] - 10https://gerrit.wikimedia.org/r/244086 (owner: 10Dzahn) [01:43:55] (03PS3) 10Dzahn: apache: remove visualwikipedia redirects [puppet] - 10https://gerrit.wikimedia.org/r/243340 [01:43:56] !log synced graphoid service [01:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:44:30] RECOVERY - dhclient process on mw1156 is OK: PROCS OK: 0 processes with command name dhclient [01:45:09] RECOVERY - salt-minion processes on mw1156 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:45:29] RECOVERY - nutcracker process on mw1156 is OK: PROCS OK: 1 process with UID = 109 (nutcracker), command name nutcracker [01:45:39] RECOVERY - HHVM processes on mw1156 is OK: PROCS OK: 11 processes with command name hhvm [01:46:37] !log repooling mw1156 [01:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:48:40] RECOVERY - puppet last run on mw1156 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:49:38] 6operations: Kernel errors on rendering hosts - https://phabricator.wikimedia.org/T118888#1812842 (10Andrew) 1156 is now upgraded and seems ok. Moritz, I will plan on upgrading the kernels of 1153,54, 55, 57, 59 and 60 tomorrow unless you disagree. [01:52:13] (03CR) 10Dzahn: [C: 032] apache: remove visualwikipedia redirects [puppet] - 10https://gerrit.wikimedia.org/r/243340 (owner: 10Dzahn) [01:53:07] (03PS3) 10Dzahn: apache: remove wikiartpedia redirects [puppet] - 10https://gerrit.wikimedia.org/r/243341 [01:56:01] (03CR) 10Dzahn: [C: 032] apache: remove wikiartpedia redirects [puppet] - 10https://gerrit.wikimedia.org/r/243341 (owner: 10Dzahn) [01:57:03] (03Abandoned) 10Dzahn: add template for 'mailonly' domains [dns] - 10https://gerrit.wikimedia.org/r/244115 (owner: 10Dzahn) [01:57:56] 6operations, 7Database: Drop database table "hashs" from Wikimedia wikis - https://phabricator.wikimedia.org/T54927#1812863 (10MZMcBride) a:5Springle>3None [02:03:47] (03PS3) 10Dzahn: apache: remove softwarewikipedia redirects [puppet] - 10https://gerrit.wikimedia.org/r/243342 [02:03:54] Myself and another editor have noticed that new images are not resolving quickly and changes made to pages are not updating immediately after hitting "save" on English Wikipedia. Is this a known issue or something new? [02:05:56] (03CR) 10Dzahn: [C: 032] apache: remove softwarewikipedia redirects [puppet] - 10https://gerrit.wikimedia.org/r/243342 (owner: 10Dzahn) [02:07:02] 6operations, 10Analytics, 6Services: Wikimedia pageview API intermittently throwing HTTP 503s - https://phabricator.wikimedia.org/T118817#1812875 (10MZMcBride) Very nice. Thank you all for the quick investigation and resolution! [02:07:05] (03PS3) 10Dzahn: apache: remove wikidisclosure redirects [puppet] - 10https://gerrit.wikimedia.org/r/243347 [02:10:47] (03CR) 10Dzahn: [C: 032] apache: remove wikidisclosure redirects [puppet] - 10https://gerrit.wikimedia.org/r/243347 (owner: 10Dzahn) [02:11:24] (03PS3) 10Dzahn: apache: remove wikifamily redirects [puppet] - 10https://gerrit.wikimedia.org/r/243345 [02:16:19] (03CR) 10Dzahn: [C: 032] apache: remove wikifamily redirects [puppet] - 10https://gerrit.wikimedia.org/r/243345 (owner: 10Dzahn) [02:21:42] (03PS3) 10Dzahn: apache: remove webhostingwikipedia redirects [puppet] - 10https://gerrit.wikimedia.org/r/243344 [02:22:28] (03CR) 10Dzahn: [C: 032] apache: remove webhostingwikipedia redirects [puppet] - 10https://gerrit.wikimedia.org/r/243344 (owner: 10Dzahn) [02:26:10] (03PS3) 10Dzahn: apache: remove wikimaps redirects [puppet] - 10https://gerrit.wikimedia.org/r/243348 [02:26:30] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000000.0] [02:27:34] (03CR) 10Dzahn: [C: 032] apache: remove wikimaps redirects [puppet] - 10https://gerrit.wikimedia.org/r/243348 (owner: 10Dzahn) [02:28:01] (03PS2) 10Dzahn: apache: remove wikimania.asia redirect [puppet] - 10https://gerrit.wikimedia.org/r/244635 [02:30:29] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [02:31:30] this in global js code: var baseUrl = 'http://bits.beta.wmflabs.org/event.gif' [02:32:47] i dont know much about that, but prod using bits.beta seems like an issue [02:33:08] (03CR) 10Dzahn: [C: 032] apache: remove wikimania.asia redirect [puppet] - 10https://gerrit.wikimedia.org/r/244635 (owner: 10Dzahn) [02:34:48] (03PS2) 10Dzahn: apache: remove wikimedia.biz redirects [puppet] - 10https://gerrit.wikimedia.org/r/253763 (https://phabricator.wikimedia.org/T81344) [02:35:02] (03PS3) 10Dzahn: apache: remove wikimedia.biz redirects [puppet] - 10https://gerrit.wikimedia.org/r/253763 (https://phabricator.wikimedia.org/T81344) [02:35:04] mutante: isn't global (is in a closure), and is overridden at runtime to www.wikipedia.org/beacon/event [02:35:16] but yeah, the pattern is weird to set it to dev-value first and fix later [02:35:51] Krinkle: aah, ok [02:35:52] nod [02:36:39] (03CR) 10Dzahn: [C: 032] apache: remove wikimedia.biz redirects [puppet] - 10https://gerrit.wikimedia.org/r/253763 (https://phabricator.wikimedia.org/T81344) (owner: 10Dzahn) [02:39:48] !log l10nupdate@tin Synchronized php-1.27.0-wmf.6/cache/l10n: l10nupdate for 1.27.0-wmf.6 (duration: 10m 10s) [02:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:41:49] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0] [04:08:58] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [100000000.0] [04:56:00] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [06:09:34] (03PS4) 1020after4: scap: Create wrapper script for master-master rsync [puppet] - 10https://gerrit.wikimedia.org/r/253040 (https://phabricator.wikimedia.org/T117016) (owner: 10BryanDavis) [06:12:10] (03CR) 1020after4: [C: 031] scap: Create wrapper script for master-master rsync [puppet] - 10https://gerrit.wikimedia.org/r/253040 (https://phabricator.wikimedia.org/T117016) (owner: 10BryanDavis) [06:12:57] (03CR) 1020after4: "This is needed for https://phabricator.wikimedia.org/D48" [puppet] - 10https://gerrit.wikimedia.org/r/253040 (https://phabricator.wikimedia.org/T117016) (owner: 10BryanDavis) [06:31:00] <_joe_> !log rsyncing terbium homes from rutherfordium [06:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:31:08] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:08] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:18] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:00] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:09] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:39] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:39] PROBLEM - puppet last run on cp1064 is CRITICAL: CRITICAL: puppet fail [06:32:58] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:56] (03PS1) 10Yuvipanda: mattermost: Introduce module [puppet] - 10https://gerrit.wikimedia.org/r/253839 [06:38:01] (03PS2) 10Yuvipanda: mattermost: Introduce module [puppet] - 10https://gerrit.wikimedia.org/r/253839 [06:38:46] (03CR) 10Yuvipanda: [C: 032 V: 032] mattermost: Introduce module [puppet] - 10https://gerrit.wikimedia.org/r/253839 (owner: 10Yuvipanda) [06:39:38] PROBLEM - Disk space on terbium is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=73%) [06:40:00] ^ is beign worked on [06:41:29] RECOVERY - Disk space on terbium is OK: DISK OK [06:49:43] (03PS1) 10Yuvipanda: mattermost: Rejigger dependencies [puppet] - 10https://gerrit.wikimedia.org/r/253840 [06:50:45] (03PS1) 10Yuvipanda: mattermost: Fix mariadb version number in role [puppet] - 10https://gerrit.wikimedia.org/r/253841 [06:50:52] (03CR) 10Yuvipanda: [C: 032] mattermost: Rejigger dependencies [puppet] - 10https://gerrit.wikimedia.org/r/253840 (owner: 10Yuvipanda) [06:51:03] (03CR) 10Yuvipanda: [C: 032 V: 032] mattermost: Fix mariadb version number in role [puppet] - 10https://gerrit.wikimedia.org/r/253841 (owner: 10Yuvipanda) [06:54:59] (03PS1) 10Yuvipanda: mattermost: Make /srv/mattermost writeable by www-data [puppet] - 10https://gerrit.wikimedia.org/r/253842 [06:55:20] (03CR) 10Yuvipanda: [C: 032 V: 032] mattermost: Make /srv/mattermost writeable by www-data [puppet] - 10https://gerrit.wikimedia.org/r/253842 (owner: 10Yuvipanda) [06:57:29] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:04] (03PS1) 10Yuvipanda: mattermost: Fix really stupid typo in template file [puppet] - 10https://gerrit.wikimedia.org/r/253843 [06:58:18] (03CR) 10Yuvipanda: [C: 032 V: 032] mattermost: Fix really stupid typo in template file [puppet] - 10https://gerrit.wikimedia.org/r/253843 (owner: 10Yuvipanda) [06:58:28] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:58] RECOVERY - puppet last run on cp1064 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:05:33] (03PS1) 10Yuvipanda: mattermost: Fix mysql connection string format [puppet] - 10https://gerrit.wikimedia.org/r/253844 [07:05:58] (03CR) 10Yuvipanda: [C: 032 V: 032] mattermost: Fix mysql connection string format [puppet] - 10https://gerrit.wikimedia.org/r/253844 (owner: 10Yuvipanda) [07:21:53] (03PS1) 10Yuvipanda: mattermost: Setup matterircd, an IRC <-> mattermost relay [puppet] - 10https://gerrit.wikimedia.org/r/253845 [07:22:15] (03CR) 10Yuvipanda: [C: 032 V: 032] mattermost: Setup matterircd, an IRC <-> mattermost relay [puppet] - 10https://gerrit.wikimedia.org/r/253845 (owner: 10Yuvipanda) [07:25:40] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [07:25:49] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [07:26:48] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [07:27:19] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [07:27:39] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [07:41:19] (03PS1) 10Yuvipanda: mattermost: Pull in matterircd from a deploy branch [puppet] - 10https://gerrit.wikimedia.org/r/253848 [07:41:21] (03PS1) 10Yuvipanda: mattermost: Have matterircd listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/253849 [07:41:33] (03CR) 10Yuvipanda: [C: 032 V: 032] mattermost: Pull in matterircd from a deploy branch [puppet] - 10https://gerrit.wikimedia.org/r/253848 (owner: 10Yuvipanda) [07:41:46] (03CR) 10Yuvipanda: [C: 032 V: 032] mattermost: Have matterircd listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/253849 (owner: 10Yuvipanda) [08:28:20] 6operations, 6Labs: Can not access new public IP despite security settings - https://phabricator.wikimedia.org/T118936#1813172 (10yuvipanda) 3NEW [08:29:05] 6operations, 6Labs: Can not access new public IP despite security settings - https://phabricator.wikimedia.org/T118936#1813179 (10yuvipanda) So port 8065 is also open via a security group to just inside labs, and I can telnet that. So I changed port 6667 to be open to just inside labs and *can not telnet there... [08:31:29] 6operations, 6Labs: Can not access new public IP despite security settings - https://phabricator.wikimedia.org/T118936#1813181 (10yuvipanda) [08:49:49] 6operations, 6Labs: Can not access new public IP despite security settings - https://phabricator.wikimedia.org/T118936#1813206 (10yuvipanda) I've tried a bunch more things: - Moving it to port 9000, and trying 10.0.0.0/8 security group (no luck!) - Moving it to port 9000, and trying 0.0.0.0/0 security group (... [09:02:15] YuviPanda: maybe you are hit by not being able to add a new security group to an existing instance? https://phabricator.wikimedia.org/T42525 [09:02:30] YuviPanda: the security rules in labs are a bit scary and have a bunch of limitations :( [09:02:42] hashar: nope, this is me editing the default rule which is already applied [09:04:20] maybe the new rules are not being applied (unlikely) [09:05:02] good luck : [09:05:03] ( [09:06:01] (03PS2) 10Jcrespo: Repool db1027, depool db1044 (regular maintenance) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253600 [09:07:59] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 216, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/2: down - Core: cr1-ulsfo:xe-0/0/3 GTT/TiNet (02773-004-32) [2Gbps MPLS]BR [09:08:00] (03PS3) 10Jcrespo: Repool db1027, depool db1044 (regular maintenance) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253600 [09:08:12] (03PS4) 10Jcrespo: Repool db1027, depool db1044 (regular maintenance) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253600 [09:08:53] (03CR) 10Jcrespo: [C: 032] Repool db1027, depool db1044 (regular maintenance) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253600 (owner: 10Jcrespo) [09:09:00] RECOVERY - Apache HTTP on mw1095 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 5.923 second response time [09:09:17] <_joe_> !log restarted HHVM on mw1095, stuck in treadmill::startrequest() on a contended lock [09:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:09:51] RECOVERY - HHVM rendering on mw1095 is OK: HTTP OK: HTTP/1.1 200 OK - 66267 bytes in 0.140 second response time [09:11:52] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1027, depool db1044 for maintenance (duration: 01m 08s) [09:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:21:24] 6operations, 6Labs: Can not access new public IP despite security settings - https://phabricator.wikimedia.org/T118936#1813221 (10yuvipanda) With some debugging help from @akosiaris, it turns out that new security rules aren't being applied on labvirt1010 until a nova-compute restart (I had to restart twice) [09:22:46] (03PS5) 10Filippo Giunchedi: scap: Create wrapper script for master-master rsync [puppet] - 10https://gerrit.wikimedia.org/r/253040 (https://phabricator.wikimedia.org/T117016) (owner: 10BryanDavis) [09:22:52] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] scap: Create wrapper script for master-master rsync [puppet] - 10https://gerrit.wikimedia.org/r/253040 (https://phabricator.wikimedia.org/T117016) (owner: 10BryanDavis) [09:25:08] !log reimage restbase2002 [09:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:27:58] PROBLEM - Host restbase2002 is DOWN: PING CRITICAL - Packet loss = 100% [09:28:53] (03PS1) 10Yuvipanda: mattermost: Adapt matterircd commandline params [puppet] - 10https://gerrit.wikimedia.org/r/253857 [09:29:28] (03PS2) 10Yuvipanda: mattermost: Adapt matterircd commandline params [puppet] - 10https://gerrit.wikimedia.org/r/253857 [09:32:49] 6operations, 7Database, 7Performance: number of database updates multiplied x3 since 29 October - https://phabricator.wikimedia.org/T117398#1813235 (10jcrespo) And, as an update, the number of updates and selects is in an all-time low right now (compated to last week). Related? 50% less job queing: https://... [09:34:57] (03CR) 10Yuvipanda: [C: 032] mattermost: Adapt matterircd commandline params [puppet] - 10https://gerrit.wikimedia.org/r/253857 (owner: 10Yuvipanda) [09:37:35] (03PS1) 10Yuvipanda: mattermost: Enable team listing by default [puppet] - 10https://gerrit.wikimedia.org/r/253859 [09:37:46] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 218, down: 0, dormant: 0, excluded: 0, unused: 0 [09:37:52] (03PS2) 10Yuvipanda: mattermost: Enable team listing by default [puppet] - 10https://gerrit.wikimedia.org/r/253859 [09:38:07] (03CR) 10Yuvipanda: [C: 032 V: 032] mattermost: Enable team listing by default [puppet] - 10https://gerrit.wikimedia.org/r/253859 (owner: 10Yuvipanda) [09:48:15] !log several db maintenance tasks on db1044 -optimization, upgrade- expect lag (node is depooled) [09:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:50:26] jynus: <3 for the !logging [09:50:50] +1 [09:50:52] I am confused? [09:51:08] lots of people don't !log when they do things [09:51:12] I have downtimed icinga [09:51:12] no, I 'm just saying I really appreciate it [09:51:23] (03PS4) 10Zfilipin: Add rubocop and 'test' target to Rakefile [puppet] - 10https://gerrit.wikimedia.org/r/252686 (https://phabricator.wikimedia.org/T117993) [09:51:45] <_joe_> YuviPanda: I hate people not !logging, as it's actually helpful [09:51:50] I agree [09:51:55] ori, this is other kind of optimization [09:52:02] space [09:52:09] not so much speed :) [09:52:36] I am saying thanks, I am just saying that this is my job, so expected [09:52:44] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/252686 (https://phabricator.wikimedia.org/T117993) (owner: 10Zfilipin) [09:53:32] problem with logging is that it is punctual, I would like to have a way to mention ongoing maintenance [09:54:59] (03PS1) 10Yuvipanda: mattermost: Ensure that local imagedir is present [puppet] - 10https://gerrit.wikimedia.org/r/253860 [09:55:17] (03PS2) 10Yuvipanda: mattermost: Ensure that local imagedir is present [puppet] - 10https://gerrit.wikimedia.org/r/253860 [09:56:48] (03CR) 10Yuvipanda: [C: 032] mattermost: Ensure that local imagedir is present [puppet] - 10https://gerrit.wikimedia.org/r/253860 (owner: 10Yuvipanda) [09:57:15] wikidata issues on db1070 [09:58:27] not ongoing, but do not see a reason either [10:10:57] 6operations: Kernel errors on rendering hosts - https://phabricator.wikimedia.org/T118888#1813280 (10MoritzMuehlenhoff) No, sounds good to me. [10:11:12] 796 wikiadmin users on db1035 [10:11:35] and growing [10:15:05] from terbium?, _joe_ [10:15:38] 6operations, 7Database: Drop phlegal_* databases from m3 - https://phabricator.wikimedia.org/T112573#1813291 (10Liuxinyu970226) [10:18:31] ... I would like to have a way to mention ongoing maintenance yeah, but you do not know if it is a punctual thing or it takes 2 months [10:19:35] (even if you log the end of it) [10:19:54] well if its ongoing, you should mention details in the message [10:20:05] for example, the previous message may not have an effect until 6 hours later [10:20:10] in terms of lag [10:20:29] and by that time, people will not notice the old message [10:20:33] (03PS1) 10Filippo Giunchedi: cassandra: add restbase2002-a instance [puppet] - 10https://gerrit.wikimedia.org/r/253864 [10:20:34] if its something that going on for example 2 months, you should probably do more than a !log, email to lists have a phab ticket etc [10:20:45] sure [10:20:51] (03PS2) 10Filippo Giunchedi: cassandra: add restbase2002-a instance [puppet] - 10https://gerrit.wikimedia.org/r/253864 [10:20:57] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase2002-a instance [puppet] - 10https://gerrit.wikimedia.org/r/253864 (owner: 10Filippo Giunchedi) [10:21:04] it is more of a wanting something stateful [10:21:28] e.g. "do not touch this table until new order because you will break the import" [10:21:46] not happening right now [10:22:02] but it happened when 2 people were working on the same domain [10:29:32] 6operations, 10Salt: slow salt-call invocation on minions - https://phabricator.wikimedia.org/T118380#1813313 (10fgiunchedi) looking at bit more into this, redis on tin is `2:2.6.13-1+wmf1` though ipv6 support landed in 2.8 as per https://github.com/antirez/redis/pull/61 [10:30:10] 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database, and 4 others: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1813315 (10jcrespo) I am thinking of killing and banning these connections right... [10:30:56] !log upgrading Zuul on gallium (CI interruption for a minute) [10:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:31:21] !log gallium: apt-get upgrade [10:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:33:56] jynus: would it be worth to backport and deploy https://gerrit.wikimedia.org/r/#/c/252267/ ? [10:34:23] would be really good to know that this helps or not (or how much) [10:34:43] I do not think that will work at all [10:34:53] hm :( [10:35:15] PROBLEM - Restbase endpoints health on restbase2002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [10:35:39] known ^ silencing [10:35:45] the logic of creating those connections is broken in the first place, there is not a max [10:35:46] PROBLEM - Restbase root url on restbase2002 is CRITICAL: Connection refused [10:36:35] jynus: is it only s3 it is breaking? [10:36:57] mostly because it creates 1 or 2 connections per database, which is a WTF [10:37:11] that is 900 wikis there [10:37:38] * aude nods [10:38:03] ahh, s3 has 900, *was looking at https://noc.wikimedia.org/db.php#tabs-3 which afaik only showed around 18! [10:38:44] css is broken there [10:38:47] that is s2 [10:38:49] ahh! [10:39:02] Any wiki not hosted on the other clusters. [10:39:15] ahh okay, that list is super confusing then! :D [10:39:33] 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database, and 4 others: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1813321 (10ori) @daniel, could you please make sure to sync up with the Services... [10:39:35] page is lacking styles for some reason [10:41:28] 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database, and 4 others: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1813325 (10aude) from irc: ``` 05:33 < aude> jynus: would it be worth to backpor... [10:43:59] to give you an idea, we have now more activity on that server than on the whole english wikipedia [10:46:42] cannot repool db1044, will increase weight of the other servers [10:48:34] (03PS1) 10Jcrespo: Decreasing weight of db1035 due to connection exhaustion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253877 [10:48:59] (03CR) 10Jcrespo: [C: 032] Decreasing weight of db1035 due to connection exhaustion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253877 (owner: 10Jcrespo) [10:49:18] 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database, and 4 others: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1813338 (10ori) >>! In T118162#1795369, @jcrespo wrote: > * The cron jobs run as... [10:49:55] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Decreasing weight of db1035 due to connection exhaustion (duration: 00m 19s) [10:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:51:53] I do not see connections rejected, but I do not know the impact on uncached page view delay (can it be filtered per wiki/shard?) [10:52:15] 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database, and 4 others: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1813354 (10ori) [10:52:27] 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database, and 3 others: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1793064 (10ori) [10:55:02] jynus: (can it be filtered per wiki/shard?) -- no, not at the moment. [10:55:07] + on the queing [10:55:10] +1 [10:55:16] ori, it is ok [10:55:31] my patch should mitigate any effect [10:56:10] and now I was under maintenance, so we are with -30% capacity than normal [10:56:27] http://graphite.wikimedia.org/render/?width=942&height=556&_salt=1447844163.593&from=-1days&target=frontend.navtiming.waiting.desktop.authenticated.median for uncached resp time RUM data [10:57:27] * hashar sends ori to sleep [10:57:30] to be fair, the only large wikis there are wikivoyage and some h*wiki, so it may be disolved among the other servers [10:57:33] yes, go to sleep [10:57:49] we will need to setup a cron to auto ban folks that works too late [10:57:50] :D [10:57:58] (myself included) [10:58:06] bye! [10:59:34] I think we are now in "situation is not solved, but controlled" [11:07:55] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [11:11:48] 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database, and 3 others: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1813392 (10daniel) @ori We do not have agreement on how to fix this. Dispatching... [11:12:18] 6operations, 10Salt: slow salt-call invocation on minions - https://phabricator.wikimedia.org/T118380#1813393 (10fgiunchedi) proposed ad-hoc fix in trebuchet instead, https://github.com/trebuchet-deploy/trebuchet/pull/17 [11:23:45] 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database, and 3 others: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1813400 (10jcrespo) > Now Jynus sais that he doesn't believe that fix is going to... [11:28:04] 6operations, 6Labs: labs precise instance not accessible after provisioning - https://phabricator.wikimedia.org/T117673#1813406 (10Joe) FTR. this just happened to me with a newly-created instance with jessie; to my knowledge no prior machine with that name existed and puppet is failing even after a reboot: ``... [11:28:14] 6operations, 6Labs: labs precise instance not accessible after provisioning - https://phabricator.wikimedia.org/T117673#1813407 (10Joe) p:5Normal>3High [11:29:54] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [11:32:29] expected ^ [11:44:14] !log Set OSPF/OSPFv3 metric 320 (instead of 360) on new Zayo link between cr2-eqiad:xe-5/2/3 and cr2-codfw:xe-5/0/1 [11:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:54:20] (03PS2) 10Muehlenhoff: Add ferm rules for role::mariadb::misc::eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/240042 (https://phabricator.wikimedia.org/T104699) [11:56:23] !log pooling new zuul-merger on scandium [11:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:57:07] !log disabled puppet on gallium [11:57:09] (03CR) 10Jcrespo: "I have to check if there is activity outside of the internal network." [puppet] - 10https://gerrit.wikimedia.org/r/240042 (https://phabricator.wikimedia.org/T104699) (owner: 10Muehlenhoff) [11:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:00:42] !log reenabled puppet on gallium. Havent pooled zuul-merger yet [12:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:01:55] (03PS1) 10Faidon Liambotis: Fix misplace $ORIGIN for neighbor subnet PTR [dns] - 10https://gerrit.wikimedia.org/r/253892 [12:04:05] (03CR) 10Faidon Liambotis: [C: 032] Fix misplace $ORIGIN for neighbor subnet PTR [dns] - 10https://gerrit.wikimedia.org/r/253892 (owner: 10Faidon Liambotis) [12:05:03] 6operations, 7Database: mysql permission request: racktables from krypton - https://phabricator.wikimedia.org/T118816#1813442 (10jcrespo) DB access is already there- I have already tested. You can close this ticket. I have pending refactoring the grants, so the ticket should be changed to: make sure only 10.64... [12:06:25] (03PS1) 10Hashar: contint: move iptables rule for zuul-merger git daemon [puppet] - 10https://gerrit.wikimedia.org/r/253893 (https://phabricator.wikimedia.org/T95046) [12:09:51] (03PS2) 10Muehlenhoff: WIP: labs openldap role [puppet] - 10https://gerrit.wikimedia.org/r/253347 [12:09:58] (03CR) 10Hashar: [C: 031 V: 031] "Puppet compilation https://puppet-compiler.wmflabs.org/1311/" [puppet] - 10https://gerrit.wikimedia.org/r/253893 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [12:45:35] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [12:57:16] 6operations, 10Gitblit: Accessing raw link on git.wikimedia.org causes "Error Sorry, the repository mediawiki does not have a extensions branch!" - https://phabricator.wikimedia.org/T118156#1813511 (10saper) Looks like my patch https://github.com/gitblit/gitblit/pull/950 got merged into the Gitblit repo! [12:59:54] 6operations, 10Gitblit, 7Upstream: Accessing raw link on git.wikimedia.org causes "Error Sorry, the repository mediawiki does not have a extensions branch!" - https://phabricator.wikimedia.org/T118156#1813513 (10saper) [13:03:50] 6operations, 10Gitblit, 7Upstream: Accessing raw link on git.wikimedia.org causes "Error Sorry, the repository mediawiki does not have a extensions branch!" - https://phabricator.wikimedia.org/T118156#1813516 (10hashar) 5Open>3declined a:3hashar gitblit is being phased out in favor of Phabricator Diffu... [13:07:33] (03CR) 10Hashar: "We can land this one. The zuul-merger on scandium can be stopped manually and puppet is instructed to not start it automatically. That l" [puppet] - 10https://gerrit.wikimedia.org/r/252337 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [13:13:20] (03CR) 10Hashar: [C: 031] "Puppet compile https://puppet-compiler.wmflabs.org/1312/" [puppet] - 10https://gerrit.wikimedia.org/r/252337 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [13:13:29] (03PS1) 10BBlack: disable BGP on lvs100[123] for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/253894 [13:14:07] _joe_: ^ can you double-check that when you get a sec? puppet is disabled on those 3 hosts already [13:14:22] I can use a couple puppet patches to land for contint. To enable connection from a new server to gallium/zuul : an iptable rule https://gerrit.wikimedia.org/r/#/c/252337/ and some basic refactoring https://gerrit.wikimedia.org/r/#/c/253893/ :-) [13:20:01] (03PS1) 10Mobrovac: RESTBase: Labs config: move MobileApps back-end spec [puppet] - 10https://gerrit.wikimedia.org/r/253895 [13:21:29] (03PS2) 10Mobrovac: RESTBase: Config: Move MobileApps back-end spec out of config [puppet] - 10https://gerrit.wikimedia.org/r/253895 [13:22:12] (03PS3) 10Mobrovac: RESTBase: Config: Move MobileApps back-end spec out of config [puppet] - 10https://gerrit.wikimedia.org/r/253895 (https://phabricator.wikimedia.org/T102130) [13:22:26] PROBLEM - puppet last run on lvs4001 is CRITICAL: CRITICAL: puppet fail [13:27:08] (03CR) 10Zfilipin: [C: 04-1] "Needs manual rebase" [puppet] - 10https://gerrit.wikimedia.org/r/253351 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [13:31:10] (03CR) 10Zfilipin: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/253351 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [13:43:40] (03PS2) 10Paladox: Update gitblit.properties file with new configs [puppet] - 10https://gerrit.wikimedia.org/r/251836 [13:48:30] (03CR) 10Hashar: "Don't bother with gitblit anymore. It is being removed in favor of Phabricator / Diffusion." [puppet] - 10https://gerrit.wikimedia.org/r/251836 (owner: 10Paladox) [13:49:07] 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database, and 4 others: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1813560 (10daniel) Regarding : this should... [13:49:36] RECOVERY - puppet last run on lvs4001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:53:10] (03PS3) 10Muehlenhoff: WIP: labs openldap role [puppet] - 10https://gerrit.wikimedia.org/r/253347 [13:55:37] (03CR) 10Hashar: [C: 04-1] "not meant to be merged" [puppet] - 10https://gerrit.wikimedia.org/r/253322 (https://phabricator.wikimedia.org/T116017) (owner: 10Hashar) [13:55:42] (03CR) 10BBlack: [C: 032] disable BGP on lvs100[123] for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/253894 (owner: 10BBlack) [13:55:44] !log disabling pybal on lvs100[123] - fallback to lvs100[456] [13:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:55:59] (03Abandoned) 10Hashar: (WIP) admin: support members aliasing (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/234554 (owner: 10Hashar) [13:56:29] (03PS4) 10Hashar: beta: parsoid now uses modules defined in source [puppet] - 10https://gerrit.wikimedia.org/r/243992 (https://phabricator.wikimedia.org/T92871) [13:56:47] (03CR) 10Hashar: "Basic rebase. Already cherry picked on beta cluster puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/243992 (https://phabricator.wikimedia.org/T92871) (owner: 10Hashar) [13:57:05] (03CR) 10Hashar: [C: 04-1] "Not meant to be merged" [puppet] - 10https://gerrit.wikimedia.org/r/250002 (https://phabricator.wikimedia.org/T117207) (owner: 10Hashar) [13:59:06] (03CR) 10Hashar: [C: 04-1] Update gitblit.properties file with new configs [puppet] - 10https://gerrit.wikimedia.org/r/251836 (owner: 10Paladox) [14:00:44] did someone kill phabricator? [14:01:05] some outage going on [14:01:06] PROBLEM - LVS HTTP IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: Connection refused [14:01:18] oh noes [14:01:25] PROBLEM - pybal on lvs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [14:01:34] PROBLEM - pybal on lvs1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [14:01:44] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection refused [14:01:45] uh oh, bblack ^ ? [14:01:50] <_joe_> what's up? [14:01:54] hmmm [14:01:54] PROBLEM - LVS HTTPS IPv4 on misc-web-lb.eqiad.wikimedia.org is CRITICAL: Connection refused [14:01:58] I'll start pybal back [14:01:59] or maybe just misc varnish [14:02:15] PROBLEM - pybal on lvs1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [14:02:15] PROBLEM - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is CRITICAL: Connection refused [14:02:25] ahh, that might explain why WMDE cant acces phabricator! [14:02:27] phab down :( [14:02:28] aude: congrats human monitoring probe #00213 :-} [14:02:34] :D [14:02:50] no, not just misc apparently [14:02:59] back now? ;D [14:03:05] <_joe_> bblack: what's up? [14:03:06] should be [14:03:08] it is back [14:03:14] yep, pahbricator is now loading for us again [14:03:16] seems back yeah [14:03:18] _joe_: I stopped pybal on lvs100[123], with [456] taking over [14:03:24] seems something didn't work right :/ [14:03:26] RECOVERY - LVS HTTP IPv4 on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 560 bytes in 0.029 second response time [14:03:44] :( [14:03:46] RECOVERY - pybal on lvs1001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [14:03:54] RECOVERY - pybal on lvs1002 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [14:03:55] things are up for me now and I can't demonstrate teh outage atm [14:04:02] same here [14:04:04] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10126 bytes in 0.096 second response time [14:04:12] I started them back up on [123] [14:04:14] RECOVERY - LVS HTTPS IPv4 on misc-web-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 371 bytes in 0.023 second response time [14:04:33] so... 123 down but 456 did not take over ? that's a first [14:04:34] RECOVERY - pybal on lvs1003 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [14:04:35] RECOVERY - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.003 second response time [14:04:48] connections were showing up in ipvsadm on [456] though [14:04:55] so something's unusual.... [14:04:56] <_joe_> akosiaris: 456 have been upgraded to jessie recently [14:05:00] hmmm [14:05:10] yeah but all the other DCs are running on jessie fine, too [14:05:20] this is slightly different hardware, but still [14:05:43] <_joe_> bblack: uhm [14:05:47] what? [14:06:16] <_joe_> no I was saying if it's not jessie-specific, it's even stranger [14:06:32] ☕ [14:06:35] yeah I'm going to have to dig around on what exactly failed and what didn't and why [14:06:43] so we didn't lose the route, right? [14:07:29] traffic did end up on 456, right? [14:07:41] <_joe_> seems so from ipvsadm [14:07:48] paravoid: some did for sure [14:07:51] ipvsadm isn't a good measure [14:07:53] if text-lb.eqiad.wikimedia.org where really down, wikipedia would be down, right? but only phab was down for us. [14:08:00] jzerebecki: esams [14:08:02] but looking now at "show route" on the routers, cr1 doesn't show routes to [456], only cr2 does [14:08:08] (backup routes not being selected right now, I mean) [14:08:08] <_joe_> bblack: ok this is strange [14:08:30] jzerebecki: (more verbose answer) no, because you'd go to text-lb.esams. from Germany and that would continue working fine [14:08:34] I think for some reason, cr1 can't route traffic to [456], only cr2 can [14:08:40] misconfig somewhere? [14:08:45] no, that's on purpose [14:09:00] primary LVSes have BGP with cr1, backup LVSes with cr2 [14:09:03] yea I didn't look at the dc... [14:09:11] paravoid: I know, but they also get routes from each other [14:09:21] cr1 says: [14:09:22] 208.80.154.224/32 *[BGP/170] 00:05:02, localpref 100 [14:09:22] AS path: 64600 I [14:09:22] > to 208.80.154.55 via ae1.1001 [14:09:22] [Static/200] 14w3d 23:18:27 [14:09:24] > to 208.80.154.55 via ae1.1001 [14:09:28] cr2 says: [14:09:30] 208.80.154.224/32 *[BGP/170] 00:05:18, localpref 100, from 208.80.154.196 [14:09:30] <_joe_> bblack: looking at the ipvsadm output on lvs1001 and on lvs1004 [14:09:33] AS path: 64600 I [14:09:35] <_joe_> they are pretty different [14:09:36] > to 208.80.154.55 via ae1.1001 [14:09:38] [BGP/170] 1d 23:36:47, MED 10, localpref 100 [14:09:41] AS path: 64600 I [14:09:43] > to 208.80.154.137 via ae2.1002 [14:09:46] [Static/200] 14w3d 23:18:40 [14:09:48] > to 208.80.154.55 via ae1.1001 [14:09:50] bblack: that's normal [14:10:05] PROBLEM - puppet last run on ms-be2018 is CRITICAL: CRITICAL: puppet fail [14:10:27] paravoid: so what happens when 208.80.154.55 goes away on cr1? [14:10:28] cr2 has a best path via lvs1001 due to MED, and so it would not send its path to cr2 [14:10:38] er [14:10:40] cr2 has a best path via lvs1001 due to MED, and so it would not send its path to cr1 [14:11:21] cr1 doesn't know about lvs1004 route for that IP at all... ? [14:11:25] <_joe_> bblack: take a look at ipvsadm -L on lvs1001 and on lvs1004, they look very, very different [14:11:31] _joe_: it's ok [14:11:38] <_joe_> it's ok? [14:11:46] that's not the issue, it's just that pybal never clears the ipvs entries for things that are removed [14:11:59] <_joe_> oh, uhm [14:12:07] lvs1001 has been up for nearly 300 days, services have been pulled from it since then [14:12:18] that's why the discrepancy ones have zero connections [14:12:30] <_joe_> ok [14:12:36] <_joe_> not all [14:12:40] so ok [14:12:49] (03PS4) 10Mobrovac: RESTBase: Config: Move MobileApps back-end spec out of config [puppet] - 10https://gerrit.wikimedia.org/r/253895 (https://phabricator.wikimedia.org/T102130) [14:12:55] <_joe_> text-lb.eqiad.wikimedia.org: wrr which is not present in lvs1004 *does* have active connections [14:12:56] let's establish whether route selection worked as it should and traffic went to lvs100[456] [14:12:59] I think it did [14:13:02] but let's confirm that [14:13:21] then let's see what happened at the IPVS layer, whether lvs100[456] forwarded the traffic to realservers [14:13:28] _joe_: oh, you're right [14:13:38] there are real missing services, too [14:13:43] the requests are normal, but the counters are at 0: https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes [14:13:50] <_joe_> bblack: and it's a big issue, I'd say [14:13:55] ok so, that makes sense [14:14:01] we're missing port 80 for text-lb for sure [14:14:07] that's a problem :P [14:14:10] haha [14:14:17] <_joe_> :) [14:14:25] aren't we using 443 nowadays ? :D [14:14:25] <_joe_> that's what I've been trying to say [14:14:42] _joe_: sorry, there's others that are what I'm talking about, I've gotten used to it heh [14:15:02] and I don't think I've ever seen pybal just fail to set up a configured service, either :/ [14:15:15] <_joe_> Nov 17 11:01:12 lvs1004 pybal[1911]: Memory allocation problem [14:15:21] <_joe_> this is new [14:15:25] nice [14:15:38] <_joe_> Nov 17 11:01:12 lvs1004 pybal[1911]: [pybal] ERROR: ipvsadm exited with status 255 when executing cmdlis [14:15:41] <_joe_> Nov 17 11:01:12 lvs1004 pybal[1911]: [pybal] ERROR: ipvsadm stderr output: Memory allocation problem [14:15:42] I haven't touched [456] since I turned [123] back on [14:15:44] <_joe_> WAAAT [14:15:50] those should probably be fatal for pybal [14:15:56] hehe [14:15:58] yes [14:16:12] <_joe_> bblack: you should feel lucky that now we at least log them :P [14:16:15] (haven't touched - as in so we can stare at it more) [14:16:40] <_joe_> I'll look at the other hosts [14:17:23] pybal's mem usage doesn't look extreme [14:17:25] 'Memory allocation problem' is a somewhat generic error it will give when referencing non-existent pools and maybe other cases I believe [14:17:37] i've seen it before too [14:17:38] yeah this is an old bug [14:17:39] <_joe_> journalctl -u pybal.service -n3000 -l | grep ERROR: tells a sad story [14:17:42] I've seen it multiple times too [14:17:47] it may have been a race on the host rebooting? [14:17:55] we have known issues with races on ethernet interfaces coming up [14:17:57] it shows that a few dozen times when booting [14:18:33] or used to, haven't booted a jessie lvs recently :) [14:18:49] only 1004 though, not 5 or 6 [14:18:55] <_joe_> yeah [14:19:17] I checked all the others in salt too, it's only 1004 [14:19:35] <_joe_> so this problem was local to 1004 [14:20:28] well the only thing that doesn't make sense is misc-web alerts + phab reported down [14:20:35] that should be on 1002/1005 I think? [14:20:59] <_joe_> yes, let me check [14:21:20] I only see port 80 for it on 1005 [14:21:22] not 443 [14:21:48] which matches up with icinga's: LVS HTTPS IPv4 on misc-web-lb.eqiad.wikimedia.org [14:21:53] <_joe_> yes [14:22:01] <_joe_> I was about to say the same [14:22:06] <_joe_> let me check pybal's logs [14:22:12] oh [14:22:20] lvs1005 has spam from rcstream being in constant failure mode [14:22:27] it pushed the log entries back further in history in terms of lines [14:22:46] <_joe_> Nov 16 14:30:34 lvs1005 pybal[1943]: [pybal] INFO: Created LVS service 'misc_weblb_443' [14:22:49] <_joe_> Nov 16 14:30:34 lvs1005 pybal[1943]: [pybal] INFO: Created LVS service 'misc_weblb6_443' [14:22:52] Nov 16 14:30:34 lvs1005 pybal[1943]: [pybal] ERROR: ipvsadm exited with status 255 when executing cmdlist ['-a -t 208.80.154.241:443 -r 10.64.0.107 -w 10\n', '-a -t 208.80.154.241:443 -r 10.64.0.106 -w 10\n', '-a -t 208.80.154.241:443 -r 10.64.32.133 -w 10\n', '-a -t 208.80.154.241:443 -r 10.64.32.134 -w 10\n'] [14:22:57] Nov 16 14:30:34 lvs1005 pybal[1943]: [pybal] ERROR: ipvsadm stderr output: [14:23:00] Nov 16 14:30:34 lvs1005 pybal[1943]: [pybal] ERROR: ipvsadm exited with status 255 when executing cmdlist ['-a -t 10.2.2.31:8000 -r 10.64.32.151 -w 10\n', '-a -t 10.2.2.31:8000 -r 10.64.48.42 -w 10\n'] [14:23:04] Nov 16 14:30:34 lvs1005 pybal[1943]: [pybal] ERROR: ipvsadm stderr output: [14:23:07] heh [14:23:45] so, startup races with pybal on system boot? [14:24:04] seems likely it tried to add the real server to a nonexistent pool [14:24:04] although it hasn't been an issue at the other 3 DCs afaik (but different hardware) [14:24:27] chasemp: that's possible too [14:25:52] so is pybal doing ipvsadm updates async now? [14:25:58] because that's a change I submitted a while ago but didn't dare merging [14:25:58] <_joe_> no [14:26:01] perhaps someone did [14:26:11] <_joe_> not AFAIR [14:27:04] so, restart pybal on 1005 or 1004 and see if it works fine when it's not at system boot time? I kinda assume it will [14:27:27] <_joe_> mark: yes, I think it was merged, I remember fixing it for porting it to jessie [14:27:36] that's probably it then [14:27:38] <_joe_> IPVSProcessProtocol [14:27:54] ok, so maybe that introduced a new race, and we finally hit it [14:27:59] yes [14:28:02] very likely [14:28:04] <_joe_> mark: let me check [14:30:10] <_joe_> we can roll that back if we need to, or fix that race [14:30:26] <_joe_> https://gerrit.wikimedia.org/r/#/c/172215/ [14:30:33] i'd roll that back tbh, it might bite us a bunch more times if noone tested it properly [14:30:58] <_joe_> mark: never happened in my tests, but I usually add a couple of pools only [14:31:25] <_joe_> mark: ok, I'm working on rolling that back now [14:31:30] thanks [14:31:37] <_joe_> bblack: I'll prepare a new package shortly [14:31:45] ok [14:31:58] <_joe_> mark: actually, we can leave there the good bits and just roll back the change in IPVSManager [14:32:37] what's the point of keeping the unused code? [14:32:51] 6operations, 10ops-eqiad, 10netops: cr1-eqiad PEM 0 fan failed - https://phabricator.wikimedia.org/T118721#1813619 (10Cmjohnson) Dear Juniper Networks Customer, Thank you for contacting the Juniper Networks Global Support. We have opened Service Request number 2015-1118-0268 to track this problem. IF YOU H... [14:33:04] <_joe_> uhm there is more than that to fix btw, ugh [14:37:15] RECOVERY - puppet last run on ms-be2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:37:34] PROBLEM - cassandra-a CQL 10.192.16.165:9042 on restbase2002 is CRITICAL: Connection refused [14:38:10] anyways, digging around in varnish graphs, in practice the fallout was pretty minimal [14:38:25] there was a traffic dropoff for misc-web-lb for a few minutes (phab, etc) due to it losing port 443 [14:38:49] but the others' request rates seem normal pattern throughout, probably due to only losing port 80 and/or IPv6 [14:39:53] we lost port 80 for ocg and text-lb, port 443 for misc (all IPv4), and then also port 443 on IPv6 for mobile [14:40:13] and the outage duration is ~4-5 mins [14:40:25] will write up timeline / incident rep, etc in a few mins [14:41:48] godog: https://gerrit.wikimedia.org/r/#/c/253895/ [14:41:54] godog: i'll deploy mobileapps now [14:42:15] bblack: and only eqiad [14:43:08] (03PS4) 10Andrew Bogott: contint: pool in zuul-merger on scandium [puppet] - 10https://gerrit.wikimedia.org/r/252337 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [14:43:32] godog: would it be a problem if dropped some keyspaces from cassandra vis-a-vis rb2002 bootstrapping ? [14:44:15] paravoid: yeah :) [14:44:37] well I was limiting my graphs to eqiad anyways, but still, no graph-notable traffic dropoff other than misc-web [14:44:41] mobrovac: I don't think so, no, also what about the 'version: 1.0.0' addition in module parsoid in that config chagne? [14:45:21] godog: that version stanza is useless [14:45:54] can remove it [14:46:07] if it is useless sure [14:46:20] (03CR) 10Andrew Bogott: [C: 032] contint: pool in zuul-merger on scandium [puppet] - 10https://gerrit.wikimedia.org/r/252337 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [14:46:27] kk [14:47:33] (03CR) 10Glaisher: [C: 04-1] "Looks like it has started again so this might have to be kept as is for now.." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252627 (owner: 10Glaisher) [14:47:39] !log mobileapps deploying 151c312 [14:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:48:06] (03PS2) 10Andrew Bogott: contint: move iptables rule for zuul-merger git daemon [puppet] - 10https://gerrit.wikimedia.org/r/253893 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [14:48:16] NOTE: RESTBase alerts will start going off in a couple of minutes, please ignore [14:49:58] (03PS5) 10Mobrovac: RESTBase: Config: Move MobileApps back-end spec out of config [puppet] - 10https://gerrit.wikimedia.org/r/253895 (https://phabricator.wikimedia.org/T102130) [14:50:39] (03CR) 10Andrew Bogott: [C: 032] contint: move iptables rule for zuul-merger git daemon [puppet] - 10https://gerrit.wikimedia.org/r/253893 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [14:50:52] (03PS1) 10Giuseppe Lavagetto: ipvs: switch back to blocking execution [debs/pybal] - 10https://gerrit.wikimedia.org/r/253902 [14:51:07] <_joe_> mark, bblack ^^ [14:51:21] mobrovac: kk, good to merge? [14:51:54] _joe_: k. let's go [14:52:37] ups, meant godog ^^ [14:52:38] :P [14:53:03] andrewbogott: thanks :) [14:53:21] <_joe_> now I see why we never have patches for puppetswat [14:53:36] (03PS6) 10Filippo Giunchedi: RESTBase: Config: Move MobileApps back-end spec out of config [puppet] - 10https://gerrit.wikimedia.org/r/253895 (https://phabricator.wikimedia.org/T102130) (owner: 10Mobrovac) [14:53:42] ops team is too effective at doing reviews ? :) [14:53:42] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] RESTBase: Config: Move MobileApps back-end spec out of config [puppet] - 10https://gerrit.wikimedia.org/r/253895 (https://phabricator.wikimedia.org/T102130) (owner: 10Mobrovac) [14:54:04] mobrovac: {{done}} [14:54:14] godog: mind restarting mobileapps on scb100x please? [14:54:44] godog: and forcing a puppet run in staging? [14:56:49] 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1813655 (10hashar) The two last puppet patches has let the zuul-merger on scandium to reach out gallium AND let the slaves git clone from scandium git... [14:57:23] !log bounce mobileapps on scb1001/scb1002 [14:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:57:42] mobrovac: I think puppet run on staging would be self service in your case? you are root there [14:58:47] !log Pooled zuul-merger instance on scandium.eqiad.wmnet . In case it screw up one has to stop the zuul-merger service on scandium. [14:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:59:10] _joe_: i vaguely recall having done more with that after that commit [14:59:18] ...so let's check if that didn't also end up in pybal [14:59:26] godog: yeah but you can do via salt, whereas i have to use poor-man's for ssh loop :P [14:59:32] godog: but kk, i'll go for it [14:59:48] <_joe_> mark: what are you referring to? [14:59:57] after I created that commit you're now reverting [15:00:06] i vaguely recall doing more with the Deferred that now got added, elsewhere [15:00:10] * mark checks [15:00:52] (03PS1) 10Glaisher: Set $wgCategoryCollation to 'uca-sr' at srwiki and srwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253903 (https://phabricator.wikimedia.org/T115806) [15:01:03] <_joe_> your last patch was using decorators for static and classmethods [15:02:04] <_joe_> and that is fairly safe :) [15:02:31] yeah weird [15:02:37] i don't even see that change in the first place [15:02:50] <_joe_> you have a CR with "DO NOT EVER TRY TO MERGE" or something by you :P [15:02:59] that's a different one [15:03:35] * mark dedusts his local git repo [15:04:45] Author: Mark Bergsma [15:04:45] Date: Wed Jan 28 17:08:28 2015 -0800 [15:04:45] First stab at pool/depool feedback [15:04:49] i never pushed that [15:05:01] <_joe_> no [15:05:28] and I think that was one way of trying to fix that race we're now seeing [15:05:42] so, I'll +1 your change [15:06:43] (03CR) 10Mark Bergsma: [C: 031] "Yes, this change was never intended to get merged without further changes and careful testing." [debs/pybal] - 10https://gerrit.wikimedia.org/r/253902 (owner: 10Giuseppe Lavagetto) [15:07:00] Wed Jan 28 [15:07:06] this unpushed change, i wrote that on the plane back from SF [15:07:12] along with the other one which says DO NOT MERGE [15:07:26] <_joe_> mark: that change was directly pushed by you to master :P [15:07:33] _joe_: apparently :( [15:07:46] <_joe_> I remember we noticed that and we did some testing like 1 yr ago [15:07:50] <_joe_> not enough apparently [15:08:16] I also recall hashar carefully saying that my code sucked, at the airport in paris ;p [15:08:26] <_joe_> ahahah [15:08:47] (03CR) 10Giuseppe Lavagetto: [C: 032] ipvs: switch back to blocking execution [debs/pybal] - 10https://gerrit.wikimedia.org/r/253902 (owner: 10Giuseppe Lavagetto) [15:08:49] (not that I needed him to realize that - that's why I said i wanted to make a new state machine, for the no-gravity stuff) [15:09:40] (03Merged) 10jenkins-bot: ipvs: switch back to blocking execution [debs/pybal] - 10https://gerrit.wikimedia.org/r/253902 (owner: 10Giuseppe Lavagetto) [15:10:08] <_joe_> bblack: building a new package, let's see how it behaves [15:11:10] _joe_: ok, this is 12.1 + un-async of ipvsadm? [15:11:14] (03PS1) 10Jcrespo: [WIP] New generated key for jcrespo (jynus) [puppet] - 10https://gerrit.wikimedia.org/r/253905 [15:11:34] going to make a task btw, so I can link it into the report [15:11:37] <_joe_> bblack: yes [15:12:01] <_joe_> yeah sorry I tried to be quick as we're in a baaad shape it seems [15:12:18] <_joe_> someone should look at the rcstream issue in the meantime [15:13:36] (03CR) 10Jcrespo: [C: 04-1] "Waiting for testing/consensus." [puppet] - 10https://gerrit.wikimedia.org/r/253905 (owner: 10Jcrespo) [15:14:21] !log stopping zuul-merger on scandium. Lacks ssh private key to reach gerrit [15:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:15:36] 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1813684 (10hashar) More unpuppetized / badly pauperized stuff: ``` stderr: 'Cloning into '/srv/ssd/zuul/git/mediawiki/core'... Warning: Identity file... [15:17:32] (03PS1) 10Jgreen: add jgreen yubikey ssh public key [puppet] - 10https://gerrit.wikimedia.org/r/253907 [15:18:04] PROBLEM - zuul_merger_service_running on scandium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger [15:21:07] _joe_: rcstream has been in that state for months, it's just kinda been ignored [15:21:23] their port 443 doesn't work right on the service, it serves un-encrypted HTTP on port 443 for some reason [15:21:55] <_joe_> :/ ok let's fix it then later [15:22:34] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [100000000.0] [15:22:34] (03PS1) 10Rush: Remove old "openstack on labs" configuration [puppet] - 10https://gerrit.wikimedia.org/r/253909 [15:23:15] 6operations, 10Traffic, 7Pybal: Pybal 1.12 has issues with executing ipvsadm commands - https://phabricator.wikimedia.org/T118948#1813688 (10BBlack) 3NEW [15:23:16] <_joe_> !log uploaded pybal 1.13 to jessie-wikimedia [15:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:23:34] <_joe_> bblack: I'll upgrade lvs1004 if it's ok with you [15:23:50] <_joe_> bblack: first, I'd issue an ipvsadm -C to clean up everything there [15:24:27] (03CR) 10jenkins-bot: [V: 04-1] Remove old "openstack on labs" configuration [puppet] - 10https://gerrit.wikimedia.org/r/253909 (owner: 10Rush) [15:24:33] <_joe_> if you prefer to do it yourself, please go on [15:26:04] https://wikitech.wikimedia.org/wiki/Incident_documentation/20151118-LVS-PyBal [15:26:20] greg-g: ^ [15:27:30] _joe_: yeah I can [15:28:07] * _joe_ whistles "canary in a coldmine" looking at lvs1004 [15:28:29] (03PS1) 10Ottomata: Use new topic interpolation format for eventlogging in labs [puppet] - 10https://gerrit.wikimedia.org/r/253910 [15:28:43] !log upgrading pybal to 1.13 on lvs1004, with cleared tables [15:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:29:08] looks saner :) [15:29:20] <_joe_> yeah it does [15:29:22] !log restbase canary deploy to rb1001 of 6b5a602 [15:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:29:34] well wait [15:29:58] the logs on 1004 still show a "Memory allocation problem" before each service creation [15:30:13] Nov 18 15:28:35 lvs1004 pybal[4172]: Memory allocation problem [15:30:14] Nov 18 15:28:35 lvs1004 pybal[4172]: [pybal] INFO: Created LVS service 'mobilelb_80' [15:30:17] etc [15:31:31] (03CR) 10Ottomata: [C: 032] Use new topic interpolation format for eventlogging in labs [puppet] - 10https://gerrit.wikimedia.org/r/253910 (owner: 10Ottomata) [15:33:45] PROBLEM - Restbase root url on restbase1001 is CRITICAL: Connection refused [15:34:26] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:34:35] mobrovac: ^? [15:35:03] <_joe_> bblack: I guess they're deploying [15:35:23] yeah, mobrovac is deploying [15:35:24] bblack: known [15:35:38] (03PS14) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [15:35:55] bblack: (15:48:16) mobrovac: NOTE: RESTBase alerts will start going off in a couple of minutes, please ignore [15:36:03] (03PS2) 10Jgreen: add jgreen yubikey ssh public key [puppet] - 10https://gerrit.wikimedia.org/r/253907 [15:36:08] :P [15:36:18] your timezone is awful :) [15:36:55] hahaha [15:37:08] <_joe_> bblack: btw https://bugs.launchpad.net/ubuntu/+source/ipvsadm/+bug/1028585 [15:37:18] !log warming up srwiki.page and srwiki.categorylinks on s3 slaves [15:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:37:24] (03CR) 10Jgreen: [C: 032 V: 031] add jgreen yubikey ssh public key [puppet] - 10https://gerrit.wikimedia.org/r/253907 (owner: 10Jgreen) [15:38:00] (03CR) 10Jgreen: [V: 032] add jgreen yubikey ssh public key [puppet] - 10https://gerrit.wikimedia.org/r/253907 (owner: 10Jgreen) [15:39:43] "Apparently whenever a syntax or semantic error occurs ipvsadm generates an memory allocation error." [15:39:46] nice :) [15:40:05] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [15:40:27] does pybal issue deletes before creates blindly? could've been the deletes of non-existent services I cleared with ipvsadm -C [15:41:17] (03PS15) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [15:41:37] <_joe_> bblack: I dunno [15:42:06] <_joe_> I mean, it does see what is changed in its conf and then just issues deletes/creates [15:42:15] def createService(self): [15:42:15] """Initializes this LVS instance in LVS.""" [15:42:16] # Remove a previous service and add the new one [15:42:16] cmdList = [self.ipvsManager.commandRemoveService(self.service()), [15:42:18] self.ipvsManager.commandAddService(self.service())] [15:42:21] ^ yup, I bet that's it [15:42:52] so the "memory allocation problem" before each service create is ipvsadm saying that due to delete of non-existent I think. We can confirm that manually [15:43:36] root@lvs1004:~# ipvsadm -D -u 1.2.3.4:9999 [15:43:36] Memory allocation problem [15:44:03] <_joe_> bblack: yeah, gotcha [15:44:18] <_joe_> bblack: I was triple checking our syntax, but that is probably it [15:44:31] (03PS16) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [15:44:34] <_joe_> (yes, that's the generic error that ipvs sends out) [15:44:46] not a huge deal in the big picture [15:44:51] <_joe_> nope [15:45:17] RECOVERY - Restbase root url on restbase1001 is OK: HTTP OK: HTTP/1.1 200 - 15171 bytes in 0.060 second response time [15:45:24] <_joe_> but yeah we must change how we interact with ipvs [15:45:33] <_joe_> "I have plans" [15:46:05] RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy [15:46:08] (03PS17) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [15:48:23] <_joe_> bbiab [15:48:30] (03PS2) 10Rush: Remove old "openstack on labs" configuration [puppet] - 10https://gerrit.wikimedia.org/r/253909 [15:48:51] _joe_: ok I'm gonna go through the same stop->clear->upgrade on lvs100[56] now, and then do the backups at the other DCs, and keep going from there (get the remote DCs fully updated today, and then start back on trying to fail off and reinstall lvs100[123]) [15:49:40] (03PS3) 10Rush: Remove old "openstack on labs" configuration [puppet] - 10https://gerrit.wikimedia.org/r/253909 [15:50:05] !log upgrading pybal to 1.13 on lvs100[56] (with service clean) [15:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:51:13] (03CR) 10jenkins-bot: [V: 04-1] Remove old "openstack on labs" configuration [puppet] - 10https://gerrit.wikimedia.org/r/253909 (owner: 10Rush) [15:51:36] 6operations, 6Labs: labs precise instance not accessible after provisioning - https://phabricator.wikimedia.org/T117673#1813785 (10Andrew) It looks like you deleted that instance... is that right? If so, can you see if you're able to repeat the issue and ping me with the failed instance? [15:53:00] bblack: ty [15:54:21] (03PS4) 10Rush: Remove old "openstack on labs" configuration [puppet] - 10https://gerrit.wikimedia.org/r/253909 [15:55:22] (03PS5) 10Rush: Remove old "openstack on labs" configuration [puppet] - 10https://gerrit.wikimedia.org/r/253909 [15:56:23] !log restbase start deploy of 6b5a602 [15:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:56:38] _joe_: we have another bug in 1.13 (new since 1.12 I think): Nov 18 15:52:42 lvs1006 pybal[2608]: exceptions.AttributeError: class Alerts has no attribute 'add' [15:56:44] (03CR) 10jenkins-bot: [V: 04-1] Remove old "openstack on labs" configuration [puppet] - 10https://gerrit.wikimedia.org/r/253909 (owner: 10Rush) [15:57:55] <_joe_> bblack: sigh, fixing [15:58:08] <_joe_> wth I'm sure i fixed it [15:58:31] looks like s/add/addAlert/ ? [15:58:37] <_joe_> yeah [15:58:51] <_joe_> I am sure I f*cking fixed that [15:59:16] mobrovac: I see new mobileapps CFs being created in graphite, are those brand new or be offset by the tables/cf you will remove? [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151118T1600). Please do the needful. [16:00:04] Glaisher: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:00:35] godog: yup, 2 new CFs per group, they will replace the 4 or 5 per group currently existing [16:01:05] (03PS1) 10Giuseppe Lavagetto: pybal: s/add/addAlert/ [debs/pybal] - 10https://gerrit.wikimedia.org/r/253915 [16:01:08] here [16:01:32] godog: i'm doing the full deploy round now, after that i'll start dropping the old mobileapps KS' [16:01:48] <_joe_> bblack: I'm building the package now [16:02:03] <_joe_> and sorry :/ this is a screwup [16:02:08] mobrovac: sweet! let me know when done, I'll remove the respective stats from graphite tomorrow [16:02:24] kk [16:02:30] (03CR) 10Giuseppe Lavagetto: [C: 032] pybal: s/add/addAlert/ [debs/pybal] - 10https://gerrit.wikimedia.org/r/253915 (owner: 10Giuseppe Lavagetto) [16:03:24] (03Merged) 10jenkins-bot: pybal: s/add/addAlert/ [debs/pybal] - 10https://gerrit.wikimedia.org/r/253915 (owner: 10Giuseppe Lavagetto) [16:04:06] RECOVERY - Restbase root url on restbase2002 is OK: HTTP OK: HTTP/1.1 200 - 15171 bytes in 0.113 second response time [16:04:25] RECOVERY - Restbase endpoints health on restbase2002 is OK: All endpoints are healthy [16:04:59] 6operations, 10Traffic, 10Wikimedia-Stream: rcstream service on port 443 is broken, spamming logs - https://phabricator.wikimedia.org/T118956#1813822 (10BBlack) 3NEW [16:05:02] !log restbase end deploy of 6b5a602 [16:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:05:26] Is no one around for SWAT? [16:05:52] Glaisher: I can SWAT, running a little late this morning. Just looking at patches now. [16:06:09] ok, thanks [16:06:18] (03PS1) 10BBlack: disable stream.wm.o:443, broken for a long time [puppet] - 10https://gerrit.wikimedia.org/r/253917 (https://phabricator.wikimedia.org/T118956) [16:06:22] You need to run updateCollation.php as well [16:06:30] Glaisher: did we ping jcrespo for running the maintenance script on srwiki? [16:06:35] yeah [16:06:35] yes [16:06:46] kk, just making sure :) [16:07:06] (03CR) 10Alexandros Kosiaris: [C: 032] trebuchet: make the role a module [puppet] - 10https://gerrit.wikimedia.org/r/253640 (owner: 10Alexandros Kosiaris) [16:07:08] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253903 (https://phabricator.wikimedia.org/T115806) (owner: 10Glaisher) [16:07:21] !log restbase cassandra dropping old mobileapps keyspaces [16:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:07:24] godog: ^^ [16:07:37] (03Merged) 10jenkins-bot: Set $wgCategoryCollation to 'uca-sr' at srwiki and srwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253903 (https://phabricator.wikimedia.org/T115806) (owner: 10Glaisher) [16:07:49] (03CR) 10Alexandros Kosiaris: "https://puppet-compiler.wmflabs.org/1308/ says a noop throughout a good portion of the cluster, the failures are puppet compiler stuff mis" [puppet] - 10https://gerrit.wikimedia.org/r/253640 (owner: 10Alexandros Kosiaris) [16:07:51] <_joe_> bblack: on which server did you see that error? [16:09:09] <_joe_> akosiaris: most of those errors are missing secrets in the labs/private repo [16:09:19] <_joe_> bblack: package updated btw [16:09:37] (03PS6) 10Rush: Remove old "openstack on labs" configuration [puppet] - 10https://gerrit.wikimedia.org/r/253909 [16:09:41] <_joe_> oh, 1006, i see [16:10:00] (03PS7) 10Rush: Remove old "openstack on labs" configuration [puppet] - 10https://gerrit.wikimedia.org/r/253909 [16:10:20] _joe_: ok upgrading 1004-6 again [16:11:13] !log upgrading pybal on lvs1004 (1.13.1) [16:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:11:47] (03CR) 10jenkins-bot: [V: 04-1] Remove old "openstack on labs" configuration [puppet] - 10https://gerrit.wikimedia.org/r/253909 (owner: 10Rush) [16:11:51] (03PS2) 10Alexandros Kosiaris: trebuchet: make the role a module [puppet] - 10https://gerrit.wikimedia.org/r/253640 [16:11:54] <_joe_> bblack: I'd say we wait a wee bit to see how pybal behaves on these servers before we upgrade the old one? [16:12:06] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Set $wgCategoryCollation to "uca-sr" at srwiki and srwiktionary [[gerrit:253903]] (duration: 00m 20s) [16:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:12:11] _joe_: yes, already addressed in https://gerrit.wikimedia.org/r/253918 [16:12:36] (03PS8) 10Rush: Remove old "openstack on labs" configuration [puppet] - 10https://gerrit.wikimedia.org/r/253909 [16:12:43] _joe_: well once things look even temporarily "good" on lvs100[456], I'm going to go ahead and upgrade the backup LVS only at codfw, ulsfo, then later esams [16:12:47] (03PS9) 10Rush: Remove old "openstack on labs" configuration [puppet] - 10https://gerrit.wikimedia.org/r/253909 [16:12:51] and then the primaries there a bit later, like before [16:12:57] Glaisher: jynus running updateCollation on srwiki first. [16:13:07] then after all that we'll get back around to trying to down/upgrade lvs100[123] again at the end of it all [16:13:27] thanks [16:13:29] thcipriani: maybe it's better to run on wiktionary because it will be quicker? [16:13:43] <_joe_> bblack: ok [16:13:47] Glaisher: ack, sure. [16:14:06] <_joe_> akosiaris: the compiler seems to behave reasonably well, right? [16:14:15] mobrovac: ack, thanks! [16:14:15] fixing 82,713 rows [16:14:38] (03CR) 10Mobrovac: [C: 031] restbase: move to systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/244647 (https://phabricator.wikimedia.org/T103134) (owner: 10Filippo Giunchedi) [16:14:51] as a reminder, lag in db1044 is normal (it is under maintenance) [16:14:52] _joe_: yup [16:14:59] _joe_: if this was just normal updates I'd go even slower, but I think killing the async ipvsadm stuff >> risk reduction of stretching this over days [16:15:05] (03CR) 10Alexandros Kosiaris: [C: 032] trebuchet: make the role a module [puppet] - 10https://gerrit.wikimedia.org/r/253640 (owner: 10Alexandros Kosiaris) [16:15:32] <_joe_> yeah, I just want primaries in eqiad to be left alone for a bit [16:15:52] yeah they'll stay on precise and be last, after everything else has successfully gone to 1.13 and been stable a bit [16:16:04] (03PS13) 10Alexandros Kosiaris: etherpad: Move role into module [puppet] - 10https://gerrit.wikimedia.org/r/220085 [16:16:09] <_joe_> also, never again let a year pass between upgrades [16:16:14] :) [16:16:15] !log updateCollation.php for srwiktionary complete 82713 rows processed [16:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:16:33] now starting srwiki [16:16:40] :) [16:16:46] <_joe_> bblack: the async stuff is in the latest pybal for precise too, I guess [16:17:14] _joe_: lvs100[123] are on precise 1.06 [16:17:26] <_joe_> heh! [16:17:35] <_joe_> like march 2014 or something [16:18:27] (03PS1) 10Jgreen: remove jgreen non-yubikey ssh public key [puppet] - 10https://gerrit.wikimedia.org/r/253921 [16:19:12] (03PS1) 10BBlack: Revert "disable BGP on lvs100[123] for reinstall" [puppet] - 10https://gerrit.wikimedia.org/r/253922 [16:19:18] (03PS2) 10BBlack: Revert "disable BGP on lvs100[123] for reinstall" [puppet] - 10https://gerrit.wikimedia.org/r/253922 [16:19:18] would someone kindly review https://gerrit.wikimedia.org/r/253921 for me? [16:19:36] (03CR) 10BBlack: [C: 032 V: 032] "So I can re-enable puppet on lvs100[123] until we re-schedule the reinstalls." [puppet] - 10https://gerrit.wikimedia.org/r/253922 (owner: 10BBlack) [16:19:48] <_joe_> bblack: :/ sorry [16:20:14] (03PS2) 10Jgreen: remove jgreen non-yubikey ssh public key [puppet] - 10https://gerrit.wikimedia.org/r/253921 [16:20:21] eh it wasn't that big an outage [16:20:28] I've done worse! :) [16:20:28] (03CR) 10Chmarkine: [C: 031] planet: add HSTS headers [puppet] - 10https://gerrit.wikimedia.org/r/253758 (owner: 10Dzahn) [16:20:37] (03CR) 10Rush: [C: 031] "seems reasonable :)" [puppet] - 10https://gerrit.wikimedia.org/r/253921 (owner: 10Jgreen) [16:20:40] <_joe_> bblack: I'm updating 1006 in the meantime [16:21:06] !log upgrading pybal on lvs1004 (1.13.1) [16:21:11] !log upgrading pybal on lvs1005 (1.13.1) [16:21:12] heh [16:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:21:20] <_joe_> ahah ok I'll wait [16:21:20] (03CR) 10Jgreen: [C: 032 V: 031] remove jgreen non-yubikey ssh public key [puppet] - 10https://gerrit.wikimedia.org/r/253921 (owner: 10Jgreen) [16:21:36] (03CR) 10Jgreen: [V: 032] remove jgreen non-yubikey ssh public key [puppet] - 10https://gerrit.wikimedia.org/r/253921 (owner: 10Jgreen) [16:22:05] wow I only got one log message for that screwup [16:22:12] 6operations, 7HTTPS, 5Patch-For-Review: releases.wikimedia.org should be https only and have hsts set - https://phabricator.wikimedia.org/T118787#1813859 (10Chmarkine) [16:22:19] <_joe_> ? [16:22:38] <_joe_> bblack: what screwup? [16:23:15] <_joe_> can I upgrade 1006 now? [16:23:29] for double-logging above. I up-arrowed to my old 1004 msg to edit for 1005, but sent it too [16:23:35] <_joe_> !log upgrading pybal on lvs1006 (1.13.1) [16:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:23:42] !log upgrading pybal on lvs1006 (1.13.1) [16:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:23:50] heh I'm too slow with the msgs [16:23:59] <_joe_> lol [16:24:04] (03CR) 10Chmarkine: [C: 031] releases: enforce http->https redirect behind misc-web [puppet] - 10https://gerrit.wikimedia.org/r/253757 (https://phabricator.wikimedia.org/T118787) (owner: 10Dzahn) [16:24:05] <_joe_> I usually first log then act [16:24:08] <_joe_> :P [16:24:24] yeah well, I have a lot of windows open and I'm short on coffee :) [16:25:20] (03CR) 10Chmarkine: [C: 031] releases: enable strict transport security [puppet] - 10https://gerrit.wikimedia.org/r/253759 (https://phabricator.wikimedia.org/T118787) (owner: 10Dzahn) [16:25:57] _joe_: on 1006 new pybal logs, what's with the spam of IdleConnection fail->ok on start? [16:26:01] just too many sockets again or something? [16:26:41] 16:24:18 - 16:24:22 or so [16:26:47] <_joe_> bblack: uh just on starts and then recovering? [16:26:47] (03PS10) 10Rush: Remove old "openstack on labs" configuration [puppet] - 10https://gerrit.wikimedia.org/r/253909 [16:27:10] yeah it was just one big block of them shortly after startup [16:27:24] <_joe_> bblack I think so, we should check [16:27:26] (03PS11) 10Rush: Remove old "openstack on labs" configuration [puppet] - 10https://gerrit.wikimedia.org/r/253909 [16:28:24] <_joe_> however, all servers are up and ok now [16:28:31] <_joe_> let's keep an eye on it [16:28:31] yeah [16:28:53] well also I'm noting that the systemd unit file says nofiles should be "unlimited", yet it ends up at 64K [16:29:01] chasemp, was that the code that allowed us to test things in the openstack project? [16:29:04] <_joe_> if it's just a problem at startup because we do too many connections of something, it's ok if we solve it later [16:29:04] err infinity [16:29:25] <_joe_> as long as it doesn't happen repeatedly, let's just wait and see [16:29:27] Krenair: I'm not sure what you mean there, but there was code to run openstack in openstack on labs but it's been broken for a long time now [16:29:57] but then again under sysvinit, I had only raised it to 10240 and I think that solved it back then, at least for runtime [16:30:03] <_joe_> bblack: I guess it might have to do with the aggressive tcp keepalive we do on idleconnectionmonitor [16:30:07] surely it doesn't do 64K+ monitor conns [16:30:15] <_joe_> bblack: I hope not [16:30:19] chasemp, at some point we had a test wikitech set up in the openstack project in labs [16:30:35] I don't remember if you could actually make instances with it [16:31:54] (03PS1) 10Hashar: zuul: support for zuul-merger gerrit ssh key [puppet] - 10https://gerrit.wikimedia.org/r/253925 (https://phabricator.wikimedia.org/T95046) [16:31:57] now we don't, so the only way to test changes to things like OpenStackManager is in actual production wikitech [16:32:34] I counted the entries from the instrumentation output on 1006, there's only 324 total backend servers. even if there are several checks each for different ports and idle+fetch, we're talking a thousand or two [16:32:58] maybe some of the trailing ones being closed overlap and count too, but still [16:35:16] (03CR) 10Hashar: [C: 04-1] "We would need to grab the content from gallium:/var/lib/zuul/.ssh/id_rsa and put it in our private repository under ssh/ci/jenkins-bot_ge" [puppet] - 10https://gerrit.wikimedia.org/r/253925 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [16:36:03] Glaisher: about half way on srwiki, FYI [16:36:28] thanks for the update [16:36:37] no lag issues > 1 sec [16:38:13] <_joe_> bblack: I repeat, if I had to bet a penny, I'd say is the aggressive keepalive we do [16:38:32] _joe_: it does happen again, it did another burst on apaches and apis (but not other services, like before) [16:38:43] Nov 18 16:28:09 lvs1006 pybal[25075]: [api_80 IdleConnection] INFO: mw1128.eqiad.wmnet (enabled/partially up/not pooled): Connection lost. [16:38:46] Nov 18 16:28:09 lvs1006 pybal[25075]: [apaches_80] ERROR: Monitoring instance IdleConnection reports server mw1112.eqiad.wmnet (enabled/up/pooled) down: Connection to the other side was lost in a non-clean fashion. [16:38:50] <_joe_> uhm [16:38:51] etc [16:38:56] (03PS1) 10Alexandros Kosiaris: Introduce seaborgium and serpens [dns] - 10https://gerrit.wikimedia.org/r/253927 (https://phabricator.wikimedia.org/T118726) [16:39:12] maybe this is just "normal" with the other side timing out the idleconn? [16:39:32] it shouldn't be a failure if the other side closes. although I don't know what "non-clean" really means here [16:39:42] <_joe_> bblack: it happens in showers? always on the same servers? [16:39:53] I think so [16:39:55] <_joe_> bblack: that comes from the twisted socket implementation [16:40:16] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [16:40:45] ^ fixed, was ssh key [16:41:36] _joe_: I think it's just a fixed timeout for hhvm side closing the idleconn, and for some reason we're treating that as a failure when it's probably normal [16:41:48] 6operations, 10vm-requests, 5Patch-For-Review: VM request for OpenLDAP labs servers - https://phabricator.wikimedia.org/T118726#1813909 (10akosiaris) It turns that out due to the strict nature of our firewalling, labs instances can not contact hosts in production with private IPs. However that limitation is... [16:41:54] <_joe_> bblack: only happens on apaches? [16:42:00] apaches + apis [16:42:08] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce seaborgium and serpens [dns] - 10https://gerrit.wikimedia.org/r/253927 (https://phabricator.wikimedia.org/T118726) (owner: 10Alexandros Kosiaris) [16:42:14] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [16:42:19] <_joe_> bblack: so cases where pybal connects to apaches [16:42:27] <_joe_> as in the webserver software :) [16:42:31] it's about 4 minutes [16:42:36] <_joe_> bblack: I'll try to isolate the problem [16:43:02] <_joe_> bblack: I have pybal-test2001 for all my tests [16:43:26] <_joe_> you can use it too, my idea would be to set up a single apache in codfw as a backend and define just idleconnectionmonitor [16:43:37] <_joe_> and tcpdump all communications [16:43:38] (03PS12) 10Rush: Remove old "openstack on labs" configuration [puppet] - 10https://gerrit.wikimedia.org/r/253909 [16:43:49] (03PS13) 10Rush: Remove old "openstack on labs" configuration [puppet] - 10https://gerrit.wikimedia.org/r/253909 [16:43:55] <_joe_> bblack: that should give us an idea of what is happening [16:44:10] <_joe_> it could be that apache closes the connection in some unclean way? [16:44:27] <_joe_> meaning no RST no FIN handshake [16:44:31] (03PS1) 10Ottomata: Remove reference to eventlogging/EventLogging for deployment [puppet] - 10https://gerrit.wikimedia.org/r/253929 (https://phabricator.wikimedia.org/T118863) [16:44:32] <_joe_> on a timeout [16:45:27] _joe_: actually the two bursts were different backend servers [16:45:34] (03CR) 10jenkins-bot: [V: 04-1] Remove old "openstack on labs" configuration [puppet] - 10https://gerrit.wikimedia.org/r/253909 (owner: 10Rush) [16:45:51] <_joe_> yup, and it's been idle ever since, or am I wrong [16:46:05] 65 failed in the first burst, then 4 minutes later 68 failed, and there's no common server between the two lists [16:46:08] (03PS1) 10Jcrespo: Allow user.user_touched field on labs replicas [software/redactatron] - 10https://gerrit.wikimedia.org/r/253930 [16:46:13] all apaches though [16:46:49] (03CR) 10Jcrespo: [C: 032] Allow user.user_touched field on labs replicas [software/redactatron] - 10https://gerrit.wikimedia.org/r/253930 (owner: 10Jcrespo) [16:46:51] nothing yet since [16:46:51] (03CR) 10Ottomata: [C: 032] Remove reference to eventlogging/EventLogging for deployment [puppet] - 10https://gerrit.wikimedia.org/r/253929 (https://phabricator.wikimedia.org/T118863) (owner: 10Ottomata) [16:46:51] <_joe_> bblack: apache is surely part of the equation I'd say [16:46:59] (03CR) 10Jcrespo: [V: 032] Allow user.user_touched field on labs replicas [software/redactatron] - 10https://gerrit.wikimedia.org/r/253930 (owner: 10Jcrespo) [16:47:05] the bursts were at :24 and :28 -ish [16:47:06] PROBLEM - puppet last run on db2070 is CRITICAL: CRITICAL: puppet fail [16:48:07] checking [16:49:04] RECOVERY - puppet last run on db2070 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [16:56:22] _joe_: is pybal back to behaving reliably? [16:57:05] <_joe_> andrewbogott: ask bblack, but I guess so [16:57:55] PROBLEM - cassandra-a service on restbase2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [16:58:00] Glaisher: at 1,250,000 rows processed. [16:58:07] still going. [16:58:13] hmm [16:58:44] thcipriani: Do we know the total number of rows that needs to be updated? [16:58:59] andrewbogott: why? [16:59:36] bblack: just need to do a depool/repool dance [16:59:42] to reboot some things [16:59:45] should be normal, yes [16:59:48] ok thanks [17:00:04] Glaisher: it's weird, at the beginning of the script it said: Fixing collation for 1014805 rows [17:00:20] :O [17:00:52] in the bug it mentions 1.5 million rows. [17:01:42] _joe_: btw, people are increasingly impatient about https://phabricator.wikimedia.org/T112421 and I think you touched that package last. Can you comment at least enough to get me unblocked? [17:01:55] It looks like the latest build source isn’t in gerrit. [17:02:34] <_joe_> andrewbogott: well, it's marked "low" priority, what should I say about it? [17:02:57] block it on the ticket for "hire more opsen" :) [17:02:58] <_joe_> I probably just backported it? [17:03:33] _joe_: there are two issues, I think… 1) do we still need to patch the latest upstream version? and 2) where did the currently-installed build come from? [17:04:05] <_joe_> andrewbogott: pure backport [17:04:12] <_joe_> so there is no trace in gerrit [17:04:17] <_joe_> but you can apt-get source ofc [17:04:32] akosiaris, what are the steps to force sync trebuchet (in deploy-labs) [17:04:33] <_joe_> and as for 1) I think we don't [17:04:45] <_joe_> I think I solved that long time ago with TimStarling and ori [17:05:02] <_joe_> solved as in determined we don't need the internal security patch anymore [17:05:02] _joe_: are we sure the existing puppet client certs don't work for this? [17:05:04] (03PS1) 10DCausse: CirrusSearch: Include all languages we can detect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253933 (https://phabricator.wikimedia.org/T118571) [17:05:06] akosiaris, git deploy sync - graphoid is broken [17:05:11] 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database, and 4 others: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1813990 (10mobrovac) As @ori mentioned, the #Services team is working with #Analy... [17:05:14] I mean, they did have that client auth bit set or whatever in openssl [17:05:16] <_joe_> bblack: they don't have subjectAltNames we need [17:05:28] we don't even know that we need SANs I don't think [17:05:32] <_joe_> and they have one of the client auth bits, yes [17:05:37] _joe_: ok, thanks. Sounds like we can just upgrade to a newer upstream package, then, no custom build needed? [17:05:53] <_joe_> bblack: yup, we need them for things different from varnishes maybe [17:05:54] !log restart restbase2002-a cassandra instance, bootstrap failed with stream error [17:05:55] RECOVERY - cassandra-a service on restbase2002 is OK: OK - cassandra-a is active [17:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:06:08] 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database, and 4 others: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1814005 (10daniel) >>! In T118162#1813400, @jcrespo wrote: > Connection re-use do... [17:06:16] 7Blocked-on-Operations, 6operations, 6Commons, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1814006 (10Andrew) p:5Low>3Normal [17:06:30] <_joe_> andrewbogott: don't take my word at 6 pm as gold [17:06:34] I think if cp2011.codfw.wmnet makes an HTTPS connection to mw1153.eqiad.wmnet, we can have the TLS mutual auth on the real hostnames, and the rest be just a Host: header inside HTTPS. Assuming we don't need SNI, which I don't think we do. [17:06:36] ok :) [17:06:37] 6operations, 6Phabricator, 10Traffic, 7Blocked-on-Security: Pharicator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#1814008 (10csteipp) [17:07:25] !log updateCollation.php for srwiki 1670608 rows processed [17:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:07:31] 6operations, 6Security-Team: can we get rid of rsvg security patch? - https://phabricator.wikimedia.org/T104147#1814019 (10Joe) Yes, the correct version of this patch is already included in our trusty packages. [17:07:31] ^ Glaisher jynus [17:07:39] finished \o/ [17:07:39] <_joe_> andrewbogott: check the source package [17:07:50] yep, will do. [17:07:54] 6operations, 6Phabricator, 10Traffic, 7Blocked-on-Security: Pharicator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#1645318 (10csteipp) Security review is done. Note my comments about no aphlict.log, and making sure the Admin server is not exposed anywhere, wh... [17:08:34] want to make releases.wikimedia.org https-only. any concerns? [17:08:51] 7Blocked-on-Operations, 6operations, 6Commons, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1814031 (10Andrew) Giuseppe says: - The current package was a clean backport with the security patch so he didn't check anything into gerrit. - He's pretty... [17:09:32] !log depooled mw1153 for https://phabricator.wikimedia.org/T118888 [17:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:09:55] 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database, and 4 others: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1814039 (10jcrespo) You are opening one connection per wiki. That is wrong. Locki... [17:10:07] 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database, and 4 others: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1814041 (10daniel) @mobrovac Yes, and I'll be very happy to replace the entire ch... [17:11:31] jynus: hi. can we chat about the dispatcher briefly? [17:11:44] I'm still under the impression that there is a fundamental misunderstanding here [17:11:54] 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database, and 4 others: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1814061 (10jcrespo) ``` | 206189024 | wikiadmin | 10.64.32.13:49847 | ruwi... [17:12:01] See ^ [17:12:13] mutante: +1 [17:12:37] jynus: yes, i understand that. i wasn't doubting it either. the question is: why does it happen, and how can it be fixed. [17:12:49] you open 1 connection per wiki [17:12:57] (03CR) 10Hashar: "On gallium /var/lib/zuul/.ssh/id_rsa has been manually created. It will be overridden /managed by puppet once this patch lands. I create" [puppet] - 10https://gerrit.wikimedia.org/r/253925 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [17:13:03] servers and wikis do not coorelate [17:13:25] jynus: LoadBalancer is opening a new connection every time, even though it shouldn't, as per spect of LoadBalancer (or my interpretation of it) [17:13:39] jynus: that would make this a LoadBalancer bug. Which, as far as I knwo, Aaron fixed. [17:13:57] ori: ok, thanks [17:14:04] sorry, I cannot tell it more clearly [17:14:23] the only other thing I can do is block the connections [17:14:46] jynus: you could try whether aarons patch fixes the problem [17:14:59] it was written with the intention to fix the problem, right? [17:15:11] I am not a developer not a deployer [17:15:40] I maintain the infrastructure [17:15:47] the infrastructure has a problem [17:15:48] jynus: so, if a developer tells you "here is a patch that hopefully fixs the problem", you are not willing to try it? [17:15:50] I reported it [17:15:53] what else can I do but that? [17:15:59] am I blocking it? [17:16:04] yes, and we investigated, found and issue, and got a patch out [17:16:14] where am I blocking it? [17:16:35] where is my -1? [17:16:41] jynus: i think your statement earlier kept it from being backported [17:16:45] maybe I did one by error [17:16:55] which one? [17:17:00] 10Ops-Access-Requests, 6operations, 7Mail: add jacob rogers (legal) to dns-admin alias - https://phabricator.wikimedia.org/T118970#1814084 (10RobH) 3NEW a:3RobH [17:17:12] (03PS2) 10Dzahn: releases: enforce http->https redirect behind misc-web [puppet] - 10https://gerrit.wikimedia.org/r/253757 (https://phabricator.wikimedia.org/T118787) [17:17:17] !log repooled mw1153, depooling mw1154 [17:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:17:26] jynus: https://gerrit.wikimedia.org/r/#/c/252267/ [17:17:43] I am not a reviewer there [17:17:47] jynus: but according to aude, this is going to ride the regular deployment train in a couple of hours anyway [17:17:48] nor I have -1 ed [17:17:56] i actually assumed this would be backported asap when it got merged [17:18:08] I am happy about people trying things [17:18:12] is that ok? [17:18:31] jynus: i think the issue is that the patch that is supposed to fix this didn't get into SWAT, as it should have [17:18:48] presumambly because we didn't push it, since it'S core code [17:19:03] ok, I just was pointing we have an issue, I do not follow the current HEAD [17:19:03] DanielK_WMDE_: would you like me to backport and sync it for you? [17:19:14] and I wasn't working last week, so it all set there for a week [17:19:29] ori: sure, that would be awesome! [17:19:36] this may just fix it [17:19:37] * ori on it [17:19:50] or it may change nothing. in which case, it wasn't the only problem. [17:19:53] nobody is telling you personally anything, I am worried about the current state of the servers [17:20:04] :-) [17:20:08] in that case, my and caties patch should kill the symptoms [17:20:17] but then we still have to dig up the actual problem, so it doesn't bit us again [17:20:40] i'm not sure closing the connection is the right thing to do here [17:20:48] jynus: and that's appreciated. i just got the impression you are barking up the wrong tree. [17:20:59] i don't see this done much in other code [17:21:00] aude: me thier. but it would kill the symptom. [17:21:04] and jenkins had isseus with it [17:21:05] 10Ops-Access-Requests, 6operations, 7Mail: add jacob rogers (legal) to dns-admin alias - https://phabricator.wikimedia.org/T118970#1814107 (10RobH) Jacob has been updated via email (and confirmed) that he understands his use of the dns-admin alias should ONLY be to receive copyright issue emails. He knows t... [17:21:11] well, I can tell you want is happening [17:21:29] every wiki has a 400 second connection open per wiki [17:22:02] that is not a problem for s1, s2, s4 s5,s6 and s7 [17:22:06] but it is for s3 [17:22:07] jynus: i understood this, and it made me scratch my head. then i saw aarons patch, and I was like - oh oops, that's would cause that issue, yea. [17:22:34] i saw the patch bing merged, and I assumed it would be SWATed because you guys were after this issue getting fixed. [17:22:47] i suppose that'S where things got stuck - nobody was responsible for getting it into swat. [17:22:47] please note than when a server fails connections, I got waken up [17:22:58] so I am quite invested on this to be fixed [17:23:54] jynus: getting the patch deplyed that *sais* it is a fix for this would probably help [17:24:17] whatever it works [17:25:03] jynus: who's responsibility should it be to get such a patch deployed? we need to answer that question, so we can avoid this kind of deadlock in the future [17:25:30] if it was a critical bug in wikibase, it would be aude or me or jan pushing this [17:25:33] but this was a core patch [17:25:53] good question [17:26:02] who maintains core [17:26:09] i don't think there is a clear decision-procedure, so probably best for all parties with a stake in the issue to assume responsibility and whoever deploys it first deploys it [17:26:10] had I known that this wasn't backported, i would have pushed for it - but I wasn't in the office, and nobody poked me about it. [17:26:13] I would like an answer to that, too [17:26:15] !log repooled mw1154, depooled mw1155 [17:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:26:48] ori, jynus: once it'S synced, we should know within 10 minutes whether it worked, right? [17:27:10] i don't know how frequently the job is invoked [17:27:17] i haven't looked at the cron job def [17:27:19] 6operations, 6Security-Team: can we get rid of rsvg security patch? - https://phabricator.wikimedia.org/T104147#1814133 (10Reedy) So... That just means tin and silver to be upgraded and then we're done... I think [17:27:21] I can tell [17:27:26] every 3 minutes iirc [17:27:33] (03CR) 10Dzahn: [C: 032] releases: enforce http->https redirect behind misc-web [puppet] - 10https://gerrit.wikimedia.org/r/253757 (https://phabricator.wikimedia.org/T118787) (owner: 10Dzahn) [17:29:21] 6operations, 7Swift: add ms-be1019 / 1020 / 1021 to swift - https://phabricator.wikimedia.org/T118183#1814141 (10fgiunchedi) see also the final allocation plan originally at https://phabricator.wikimedia.org/T114711#1705505 | row | allocation | zones | total | |----- |------------| -------| ------| | A |... [17:29:30] 6operations, 10ops-eqiad, 7Swift: [determine] rack ms-be1019-1021 - https://phabricator.wikimedia.org/T114711#1814142 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi [17:30:05] !log ori@tin Synchronized php-1.27.0-wmf.6/includes/db/loadbalancer/LoadBalancer.php: I03e1386b: Make getLaggedSlaveMode() use reuseConnection() as needed (T118162) (duration: 00m 19s) [17:30:09] ori: so, generally, when ops files a critical bug, and a patch for the issue is merged into master - who's responsibility should it be to get it SWATed? [17:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:31:50] * DanielK_WMDE_ is really curious to see whether this works [17:32:16] s/who's/whose :P [17:32:21] (03PS1) 10Dzahn: releases: load mod_headers for proto redirect [puppet] - 10https://gerrit.wikimedia.org/r/253936 (https://phabricator.wikimedia.org/T118787) [17:32:25] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 15.38% of data above the critical threshold [500.0] [17:32:25] * aude going home in order to be back (and not hungry) for the train deployment [17:32:45] that doesn't look great [17:32:45] ori: i'll try to get that right in the future? [17:32:59] (03PS2) 10Dzahn: releases: load mod_headers for proto redirect [puppet] - 10https://gerrit.wikimedia.org/r/253936 (https://phabricator.wikimedia.org/T118787) [17:33:00] DanielK_WMDE_: relax :) i'm just trying to diffuse tension [17:33:06] probably failing as usual [17:33:27] (03CR) 10Dzahn: [C: 032] releases: load mod_headers for proto redirect [puppet] - 10https://gerrit.wikimedia.org/r/253936 (https://phabricator.wikimedia.org/T118787) (owner: 10Dzahn) [17:33:29] i'm currently trying to deam up a way to actualyl test this locally [17:33:41] setting up a multi-master environment isn't exactly easy [17:33:58] labs replicas? [17:34:03] DanielK_WMDE_: https://phabricator.wikimedia.org/T118829 filed yesterday [17:34:05] thoughts welcome [17:34:28] jynus: yea, but probably a couple of days of work for me. i'm not a server admin, i have to read up on basically everything [17:34:54] "i'm not a server admin", as if I knew what I was doing :-P [17:35:21] where I am making mediawiki suggestions without reading a single line of code [17:35:40] there was a brief spike in fatals, triggered by the deployment, but not the actual payload of the deployment [17:35:41] well, maybe 1 oe 2 [17:36:25] we've seen this before; a sudden spike in 'PHP Fatal Error: request has exceeded memory limit' in wmf-config/StartProfiler.php:70, which is where traces for all running threads are collated and shipped off to redis [17:36:38] !log releases.wikimedia.org now enforcing https protocol [17:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:37:16] mutante: \o/ that's great [17:37:26] godog: :) [17:37:40] i will wait a little bit before adding the HSTS header [17:37:44] just in case [17:38:03] this is just the proto redirect [17:38:31] well, certainly db1035 looks better already [17:39:53] pager grep -c wikiadmin; SHOW PROCESSLIST; 3 [17:40:18] jynus: that sounds about right [17:40:29] DanielK_WMDE_, I can help with setting up any testing environement, that is my job [17:40:40] whatever makes your life easier [17:40:41] jynus: that would be awesome, thanks! [17:40:48] and as a consequence, mince :-) [17:41:44] ori: all I can say to the "Automate the provisioning" ticket is "YES" :) [17:42:14] (03CR) 10Tjones: [C: 031] "Looks okay, but also unsure about some of the parts David is unsure about." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253933 (https://phabricator.wikimedia.org/T118571) (owner: 10DCausse) [17:42:19] thanks ori and DanielK_WMDE_ [17:42:42] was Aarons' patch? thanks to him also [17:43:04] AaronSchulz: ^ [17:43:27] I have a question and it is why this wasn't caught before if it was on core? [17:43:37] (the actual mistake is normal) [17:43:50] was is a larger number of queries? [17:44:49] 6operations, 7HTTPS, 5Patch-For-Review: releases.wikimedia.org should be https only and have hsts set - https://phabricator.wikimedia.org/T118787#1814197 (10Dzahn) Merged the protocol redirect. It now redirects http->https. Waiting with the HSTS headers just a little bit just in case.. because that can't be... [17:45:27] jynus: yes, that was aarons patch, from a week ago [17:45:29] jynus, ori: so, in the future, let's make sure we actually deploy the fixes we already have, ok? [17:45:47] my mistake was thinking it was already deployed [17:46:01] same here [17:46:31] I was never terribly involved with this issue apart from agitating on Phabricator. And jynus wouldn't be the one deploying core changes anyway. So you should probably talk with Aaron. [17:46:37] that is why I said "that didn't work" [17:47:22] jynus: as to why it wasn't caught: a) because core has no unit tests for the LoadBalancer code. None. b) any spot checking and smoke tests usually don't try to connect to a few hundred client wikis, so it never becomes apparent that the connectiosn stay open [17:47:29] jynus: to be fair, you said " I do not think that will work at all" [17:47:46] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:48:11] !log repooled mw1155, depooled 1157 (I did 56 and 58 list night) [17:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:49:02] and that is why you should not listen to me [17:50:16] (03PS14) 10Rush: Remove old "openstack on labs" configuration [puppet] - 10https://gerrit.wikimedia.org/r/253909 [17:52:13] (03CR) 10jenkins-bot: [V: 04-1] Remove old "openstack on labs" configuration [puppet] - 10https://gerrit.wikimedia.org/r/253909 (owner: 10Rush) [17:52:47] I meant that there was another problem- the open connection. I failed to recognize that was not the main issue [17:52:54] it was [17:53:35] the 488 second connction persists [17:57:19] jynus: yea, we still keep the connections open for the entire time the script lives. not nice, but not so terrible for 3 conenctions [17:57:23] or 10 [17:57:28] +1 [17:57:35] my patch will reduce that to 1 per script [17:57:47] but the patch isn't easy to deploy, so it may be a week or two [17:57:51] wouldn't running via the job queue fix that too? [17:57:59] no [17:58:17] the problem with long running connections is what I mentioned about failures/configuration changes [17:58:18] runnign via the job queue alone wouldn't change a thing. except perhaps the db user [17:58:21] but then it would not run as wikiadmin, so if the script did misbehave, the databases would enforce a limit [17:58:35] yea, so dispatching would break [17:58:39] a silightly better failure mode [17:58:44] but not a fix to the problem [17:59:01] well that is actually the idea [17:59:11] make allways small idempotent changes [17:59:15] yurik: sorry I was in an interview, so the problem is still there ? [17:59:26] because I can guarantee the service [17:59:27] ori: dispatching via the job queue means we need a completely new batching strategy. hoo and aude have been discussing this we me on and off. [17:59:29] !log repooling mw1157, depooled mw1159 [17:59:30] not the server [17:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:59:34] yurik: more precisely, what's the problem ? [17:59:35] perhaps we can get some input from aaron on this during the summit [17:59:55] (03PS15) 10Rush: Remove old "openstack on labs" configuration [puppet] - 10https://gerrit.wikimedia.org/r/253909 [18:00:05] every day I depool and repool a server, query times must me small for maintenance and for failover [18:00:38] akosiaris, https://phabricator.wikimedia.org/T118929 [18:00:44] 6operations, 6Phabricator, 10Traffic, 7Blocked-on-Security: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#1814260 (10MZMcBride) [18:00:59] Lydia_WMDE: yea! good news: problem solved. aaron's patch did the trick. [18:01:01] akosiaris, pls comment how you solve it so that i could do it later too ) [18:01:05] the only exception being dumps, which have a dedicated server [18:01:25] Lydia_WMDE: the dispatch script still has a lot of issues, but nothing critical, as far as i can tell. [18:03:54] 7Blocked-on-Operations, 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#1814269 (10mmodell) [18:04:39] 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database, and 5 others: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1814271 (10jcrespo) 5Open>3Resolved a:3jcrespo Resolving, and hopefuly futu... [18:04:59] 6operations, 10RESTBase-Cassandra: establish new thresholds for cassandra alarms after switching restbase to dtcs - https://phabricator.wikimedia.org/T118976#1814288 (10fgiunchedi) p:5Triage>3Normal a:3fgiunchedi [18:05:11] (03PS2) 10Dzahn: planet: add HSTS headers [puppet] - 10https://gerrit.wikimedia.org/r/253758 [18:05:15] there is one fear, and that is connections piling up [18:05:25] (03CR) 10Dzahn: [C: 032] "certificate renewal has been approved" [puppet] - 10https://gerrit.wikimedia.org/r/253758 (owner: 10Dzahn) [18:05:26] will reopen if that happens [18:08:45] (03CR) 10Rush: "http://puppet-compiler.wmflabs.org/1320/" [puppet] - 10https://gerrit.wikimedia.org/r/253909 (owner: 10Rush) [18:09:21] !log repooled mw1159, depooled mw1160 [18:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:10:18] (03PS16) 10Rush: Remove old "openstack on labs" configuration [puppet] - 10https://gerrit.wikimedia.org/r/253909 [18:11:39] 6operations, 10vm-requests, 5Patch-For-Review: VM request for OpenLDAP labs servers - https://phabricator.wikimedia.org/T118726#1814306 (10akosiaris) VMs created with MAC addresses |seaborgium.wikimedia.org | aa:00:00:4d:be:84 | |serpens.wikimedia.org | aa:00:00:0e:a4:81 | [18:12:03] (03CR) 10Rush: [C: 032] Remove old "openstack on labs" configuration [puppet] - 10https://gerrit.wikimedia.org/r/253909 (owner: 10Rush) [18:15:15] (03CR) 10Dzahn: "@Paladox i heard that gitblit may die within the next couple days even" [puppet] - 10https://gerrit.wikimedia.org/r/251836 (owner: 10Paladox) [18:16:22] 6operations, 10Beta-Cluster-Infrastructure, 7HHVM, 5Patch-For-Review: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1814315 (10dduvall) [18:17:49] apergos: how about https://gerrit.wikimedia.org/r/#/c/230738/1 it just adds a link and i see that link already works [18:18:23] !log repooled mw1160. And that’s all of them! [18:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:18:51] AaronSchulz: new redis box quotes in for higher clock and less cores =] [18:19:12] one vendor of two so far [18:19:14] mutante: I never merged that? huh [18:19:23] 6operations: Kernel errors on rendering hosts - https://phabricator.wikimedia.org/T118888#1814317 (10Andrew) 5Open>3Resolved a:3Andrew All of the rendering servers (mw1153-mw1160) are now running 3.13.0-62-generic. Note that that buggy kernel is surely running on lots of other bits of our cluster. [18:19:25] sure, just check that the rewrite does the right thing after [18:19:42] (03CR) 10Dzahn: [C: 031] "the link this redirects to does work. i'm not sure if there is a difference between ./fundraising/ and ./other/fundraising/" [puppet] - 10https://gerrit.wikimedia.org/r/230738 (owner: 10ArielGlenn) [18:20:03] apergos: is there a difference between ./fundraising/ and ./other/fundraising/ ? [18:20:25] both work [18:20:29] yes: we want anything that's not a wiki* dump to be under "other" [18:20:45] there's a symlink I think but we might as well redirect them [18:20:49] ok [18:20:59] because the old link is already published, is all [18:23:10] (03PS1) 10Filippo Giunchedi: restbase: higher sstable per read threshold [puppet] - 10https://gerrit.wikimedia.org/r/253941 (https://phabricator.wikimedia.org/T118976) [18:23:12] (03PS1) 10Filippo Giunchedi: restbase: remove cassandra pending compactions alert [puppet] - 10https://gerrit.wikimedia.org/r/253942 (https://phabricator.wikimedia.org/T118976) [18:25:14] (03PS2) 10Dzahn: dataset: add redirect for fundraising data and link on web page [puppet] - 10https://gerrit.wikimedia.org/r/230738 (owner: 10ArielGlenn) [18:25:45] (03PS1) 10Alexandros Kosiaris: Introduce seaborgium and serpens [puppet] - 10https://gerrit.wikimedia.org/r/253943 (https://phabricator.wikimedia.org/T118726) [18:25:49] 10Ops-Access-Requests, 6operations, 7Mail: add jacob rogers (legal) to dns-admin alias - https://phabricator.wikimedia.org/T118970#1814334 (10RobH) 5Open>3Resolved Merged. [18:28:35] addshore, apergos, I’m pretty sure that https://phabricator.wikimedia.org/T118739 is blocked on comments from one or the other of you… could you follow up? [18:29:03] waiting for addshore [18:29:43] *reads* [18:30:17] thanks! [18:30:59] 10Ops-Access-Requests, 6operations, 6WMDE-Analytics-Engineering, 10Wikidata, 5Patch-For-Review: Requesting access to dataset-admins for Addshore - https://phabricator.wikimedia.org/T118739#1814340 (10Addshore) I am fine with doing it either way :) Someone from the analytics team may also have an opinion... [18:31:01] andrewbogott: apergos ^^ [18:31:46] apergos: can you continue with implementation since you’re already in to it a bit? (If you’re swamped then I can catch up and work on it, what with clinic duty and all) [18:32:27] I'll do it as I described there, tomorrow am though (while I'm still hanging around in irc, I'm dinner and off the clock unless emergencies now) [18:32:47] (03PS3) 10Dzahn: dataset: add redirect for fundraising data and link on web page [puppet] - 10https://gerrit.wikimedia.org/r/230738 (owner: 10ArielGlenn) [18:32:55] apergos: thank you! [18:33:14] yw [18:33:32] (03CR) 10Dzahn: [C: 032] dataset: add redirect for fundraising data and link on web page [puppet] - 10https://gerrit.wikimedia.org/r/230738 (owner: 10ArielGlenn) [18:34:59] (03CR) 10Mobrovac: RESTBase configuration for scap3 deployment (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/252887 (owner: 10Thcipriani) [18:35:12] (03CR) 10Andrew Bogott: [C: 031] "Is that key already present in the private puppet repo?" [puppet] - 10https://gerrit.wikimedia.org/r/253925 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [18:41:22] (03PS1) 10Faidon Liambotis: Various fixes [debs/nfsd-ldap] - 10https://gerrit.wikimedia.org/r/253946 [18:41:26] Coren: ^ [18:41:32] Coren: pretty solid work overall [18:41:58] paravoid: Lots of RTFM. Ima look at your patch now. [18:42:05] 6operations, 10hardware-requests, 5Patch-For-Review: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1814370 (10RobH) [18:42:22] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Various fixes [debs/nfsd-ldap] - 10https://gerrit.wikimedia.org/r/253946 (owner: 10Faidon Liambotis) [18:42:51] ottomata: https://gerrit.wikimedia.org/r/#/c/252961/ ? :) [18:42:59] 6operations, 10hardware-requests: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1035333 (10RobH) [18:43:24] (03PS2) 10Alexandros Kosiaris: Introduce seaborgium and serpens [puppet] - 10https://gerrit.wikimedia.org/r/253943 (https://phabricator.wikimedia.org/T118726) [18:44:30] paravoid: will amend... [18:45:23] paravoid: Ah, goodie; I was mostly worried around the diversion stuff which I had never toyed with before. I'm still unhappy about having to hook into a very internal glibc thing, but afaict that was the only way to avoid nscd screwing us up. [18:46:49] paravoid: Also, I have nfs-kernel-server as a pre-depends because of the diversion; that might have been overly cautious? [18:46:56] (03PS1) 10Cmjohnson: Removing all entries regarding analytics1003/4/10 bug: task T118572 [puppet] - 10https://gerrit.wikimedia.org/r/253947 [18:50:13] 6operations, 10Analytics, 10CirrusSearch, 6Discovery, 7audits-data-retention: Delete logs on stat1002 in /a/mw-log/archive that are more than 90 days old. - https://phabricator.wikimedia.org/T118527#1814417 (10Ottomata) Hey wait a minute, isn't this already happening? https://github.com/wikimedia/operat... [18:50:30] (03PS1) 10Dzahn: datasets: add missing line break in index HTML [puppet] - 10https://gerrit.wikimedia.org/r/253948 [18:51:02] (03PS2) 10Dzahn: datasets: add missing line break in index HTML [puppet] - 10https://gerrit.wikimedia.org/r/253948 [18:51:25] Jeff_Green: can we remove the kafkatee fundraising stuff from erbium too? it was only there for verification [18:51:30] and now you are running it inside of frack, ja? [18:52:34] (03PS2) 10DCausse: CirrusSearch: Include all languages we can detect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253933 (https://phabricator.wikimedia.org/T118571) [18:52:51] (03PS1) 10Rush: Labs dedupe top level hiera values [puppet] - 10https://gerrit.wikimedia.org/r/253949 [18:52:56] 6operations, 7Database, 7Wikimedia-log-errors: Spikes of job runner new connection errors to mysql "Error connecting to 10.64.32.24: Can't connect to MySQL server on '10.64.32.24' (4)" - mainly on db1035 - https://phabricator.wikimedia.org/T107072#1814418 (10Krinkle) [18:53:20] (03CR) 10Dzahn: [C: 032] datasets: add missing line break in index HTML [puppet] - 10https://gerrit.wikimedia.org/r/253948 (owner: 10Dzahn) [18:53:22] (03PS2) 10Rush: Labs dedupe top level hiera values [puppet] - 10https://gerrit.wikimedia.org/r/253949 [18:53:58] matanya: hey, i'm here btw. 3rd floor [18:54:16] matanya: food? [18:54:19] ah, great, see you at our break [18:54:31] what time is it again? [18:54:31] (03CR) 10Mobrovac: [C: 031] restbase: higher sstable per read threshold [puppet] - 10https://gerrit.wikimedia.org/r/253941 (https://phabricator.wikimedia.org/T118976) (owner: 10Filippo Giunchedi) [18:54:33] (03PS5) 10Ottomata: Remove role::logging::udp2log::erbium and friends [puppet] - 10https://gerrit.wikimedia.org/r/252961 (https://phabricator.wikimedia.org/T84062) (owner: 10Faidon Liambotis) [18:54:43] don't really have time for food today, sadly :) [18:54:48] approx 13:00 [18:55:01] matanya: ok, i'll go before that and be back by 1 [18:55:08] thanks! [18:55:19] (03PS3) 10Rush: Labs dedupe top level hiera values [puppet] - 10https://gerrit.wikimedia.org/r/253949 [18:55:35] (03CR) 10jenkins-bot: [V: 04-1] Remove role::logging::udp2log::erbium and friends [puppet] - 10https://gerrit.wikimedia.org/r/252961 (https://phabricator.wikimedia.org/T84062) (owner: 10Faidon Liambotis) [18:58:55] (03PS6) 10Ottomata: Remove role::logging::udp2log::erbium and friends [puppet] - 10https://gerrit.wikimedia.org/r/252961 (https://phabricator.wikimedia.org/T84062) (owner: 10Faidon Liambotis) [18:59:41] (03PS1) 10Dzahn: dumps: fix redirect for fundraising dumps [puppet] - 10https://gerrit.wikimedia.org/r/253950 [18:59:43] ori: jynus: regarding connection re-use, In Wikimedia Labs it's quite common for tools to "query all wikis". In which case the meta registry is used to know which wikis are on which shard and then it opens (as needed) upto 7 connections and then switches db contexts when needed. And closes at the end. [18:59:56] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce seaborgium and serpens [puppet] - 10https://gerrit.wikimedia.org/r/253943 (https://phabricator.wikimedia.org/T118726) (owner: 10Alexandros Kosiaris) [19:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151118T1900). [19:00:05] (03PS3) 10Alexandros Kosiaris: Introduce seaborgium and serpens [puppet] - 10https://gerrit.wikimedia.org/r/253943 (https://phabricator.wikimedia.org/T118726) [19:00:06] We did the same in production in 2011 during the usability initiative with the preference feature that changed the user preference for skin=Vector on all wikis. [19:00:12] (03PS2) 10Dzahn: dumps: fix redirect for fundraising dumps [puppet] - 10https://gerrit.wikimedia.org/r/253950 [19:00:14] (03CR) 10jenkins-bot: [V: 04-1] dumps: fix redirect for fundraising dumps [puppet] - 10https://gerrit.wikimedia.org/r/253950 (owner: 10Dzahn) [19:01:25] (03CR) 10Rush: [C: 032 V: 032] "http://puppet-compiler.wmflabs.org/1321/" [puppet] - 10https://gerrit.wikimedia.org/r/253949 (owner: 10Rush) [19:02:32] paravoid: amended patch [19:02:37] removed a couple more things [19:02:59] also i removed the kafkatee role from erbium, as i'm pretty sure it was only used for initial testing by fr folks [19:03:04] Jeff_Green: should verify [19:03:17] with that gone, erbium is a spare, and can just be reinstalled [19:03:37] (03PS3) 10Dzahn: dumps: fix redirect for fundraising dumps [puppet] - 10https://gerrit.wikimedia.org/r/253950 [19:03:56] (03CR) 10Ottomata: [C: 031] "Jeff should approve too, since I also removed the kafkatee fundraising stuff from erbium. I'm pretty sure that this was only being used f" [puppet] - 10https://gerrit.wikimedia.org/r/252961 (https://phabricator.wikimedia.org/T84062) (owner: 10Faidon Liambotis) [19:04:44] (03CR) 10Dzahn: [C: 032] dumps: fix redirect for fundraising dumps [puppet] - 10https://gerrit.wikimedia.org/r/253950 (owner: 10Dzahn) [19:05:20] 6operations, 6Analytics-Backlog, 6Analytics-Kanban, 7Monitoring, 5Patch-For-Review: Migrate reqstats icinga alerts to new graphite metrics and deprecate or adapt reqstats gdash - https://phabricator.wikimedia.org/T118979#1814440 (10Ottomata) 3NEW a:3Ottomata [19:05:46] 6operations, 6Analytics-Backlog, 6Analytics-Kanban, 7Monitoring, 5Patch-For-Review: Migrate reqstats icinga alerts to new graphite metrics and deprecate or adapt reqstats gdash - https://phabricator.wikimedia.org/T118979#1814440 (10Ottomata) a:5Ottomata>3None [19:06:01] 6operations, 7Monitoring: Migrate reqstats icinga alerts to new graphite metrics and deprecate or adapt reqstats gdash - https://phabricator.wikimedia.org/T118979#1814440 (10Ottomata) [19:06:44] (03PS4) 10Alexandros Kosiaris: Introduce seaborgium and serpens [puppet] - 10https://gerrit.wikimedia.org/r/253943 (https://phabricator.wikimedia.org/T118726) [19:06:49] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Introduce seaborgium and serpens [puppet] - 10https://gerrit.wikimedia.org/r/253943 (https://phabricator.wikimedia.org/T118726) (owner: 10Alexandros Kosiaris) [19:09:17] 6operations, 6Security-Team: can we get rid of rsvg security patch? - https://phabricator.wikimedia.org/T104147#1814464 (10Andrew) @Joe, when you say 'the corect version of this patch' do you mean this? https://gerrit.wikimedia.org/r/#/c/28496/ Or is there a different patch someplace else? [19:10:04] apergos: i fixed the redirect with one follow-up. done now [19:10:16] thank you! [19:10:27] (03PS1) 10Chad: Revoke my shell access [puppet] - 10https://gerrit.wikimedia.org/r/253951 [19:10:32] apergos: yw [19:11:19] 6operations, 6Security-Team: can we get rid of rsvg security patch? - https://phabricator.wikimedia.org/T104147#1814470 (10Andrew) Wait, I'm dumb, it took me a while to figure out what @joe meant by 'simple backport.' [19:12:53] (03CR) 10Tjones: [C: 031] "Still looks good, still not sure about unsure parts. Added comments on some comments. (So very useful!)" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253933 (https://phabricator.wikimedia.org/T118571) (owner: 10DCausse) [19:15:49] (03Abandoned) 10Dzahn: move all non-default contact_group variables to hiera [puppet] - 10https://gerrit.wikimedia.org/r/244814 (owner: 10John F. Lewis) [19:16:13] (03CR) 10Greg Grossmeier: [C: 04-1] "I don't think this issue is worth revocation. Yes, there are areas for improvement for this issue, but that doesn't trump all the other gr" [puppet] - 10https://gerrit.wikimedia.org/r/253951 (owner: 10Chad) [19:17:34] ottomata: erbium can be lobotomized, it's fine [19:17:56] paravoid: ^^ LESDOIT [19:19:35] mutante: i need some help regarding an abused stolen account [19:20:07] (03CR) 10Ryan Lane: "It's there because you need to run a command when new repos are added to trebuchet. If trebuchet isn't being used for new repos, then it i" [puppet] - 10https://gerrit.wikimedia.org/r/219372 (owner: 10ArielGlenn) [19:20:17] or any sysadmin with shell account [19:21:44] (03CR) 10Alex Monk: [C: 04-1] Revoke my shell access [puppet] - 10https://gerrit.wikimedia.org/r/253951 (owner: 10Chad) [19:23:03] matanya: i'm here, what's up [19:23:26] https://wikitech.wikimedia.org/wiki/Password_reset [19:23:36] (03CR) 10Yuvipanda: [C: 04-2] "What Greg said." [puppet] - 10https://gerrit.wikimedia.org/r/253951 (owner: 10Chad) [19:23:46] an zh-wiki long term admin got his account stolen adn email reset [19:24:14] in addition, the account taker made a steward remove his rights, and deleted the email account [19:24:41] we need to reset his email so he can take over the account back please, before more harm is done [19:25:21] his idenetity was verified over phone call [19:26:46] eh, ok, i don't know how to yet [19:27:14] the wiki link above has the process [19:28:27] matanya: alright. so https://wikitech.wikimedia.org/wiki/Password_reset/Confirming_identities has been done? [19:28:39] yes, by two people [19:28:42] PM me please [19:28:46] independently. [19:29:00] (03CR) 10DCausse: CirrusSearch: Include all languages we can detect (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253933 (https://phabricator.wikimedia.org/T118571) (owner: 10DCausse) [19:30:06] (03PS3) 10DCausse: CirrusSearch: Include all languages we can detect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253933 (https://phabricator.wikimedia.org/T118571) [19:31:23] " Don't reset things just because a user asked you on IRC ":) [19:31:50] ok, hold on [19:35:23] 6operations, 10hardware-requests: Hardware access request for yubico auth servers - https://phabricator.wikimedia.org/T118983#1814545 (10MoritzMuehlenhoff) 3NEW [19:39:59] matanya, I'm on it but would like to confirm in person. where are you at the office? [19:40:29] 5th [19:40:41] what room? [19:41:48] matanya, ^^^ [19:41:56] sec [19:43:49] (03PS2) 10Dzahn: sudo journalctl: make missing restrions obvious [puppet] - 10https://gerrit.wikimedia.org/r/251714 (https://phabricator.wikimedia.org/T115067) (owner: 10JanZerebecki) [19:44:47] greogey something [19:45:59] gvachaknavi MaxSem [19:46:06] 6operations, 10hardware-requests: Hardware access request for yubico auth servers - https://phabricator.wikimedia.org/T118983#1814569 (10RobH) a:3MoritzMuehlenhoff The requirements for these are very low (basically any minimal server we have can do this.) However, can this live in a ganeti VM? I assume tha... [19:47:09] (03PS3) 10Dzahn: sudo journalctl: make missing restrions obvious [puppet] - 10https://gerrit.wikimedia.org/r/251714 (https://phabricator.wikimedia.org/T115067) (owner: 10JanZerebecki) [19:47:30] (03PS4) 10Dzahn: sudo journalctl: make missing restrictions obvious [puppet] - 10https://gerrit.wikimedia.org/r/251714 (https://phabricator.wikimedia.org/T115067) (owner: 10JanZerebecki) [19:48:26] (03PS5) 10Dzahn: sudo journalctl: make missing restrictions obvious [puppet] - 10https://gerrit.wikimedia.org/r/251714 (https://phabricator.wikimedia.org/T115067) (owner: 10JanZerebecki) [19:49:00] matanya, PM me your email [19:49:13] (03CR) 10Dzahn: [C: 032] "it's true, this is not an effective restriction. and just making it consistent with other existing services. to actually restrict this we'" [puppet] - 10https://gerrit.wikimedia.org/r/251714 (https://phabricator.wikimedia.org/T115067) (owner: 10JanZerebecki) [19:49:31] 6operations, 6Project-Creators: create #ops-eqdfw & #ops-eqord projects - https://phabricator.wikimedia.org/T117585#1814581 (10Aklapper) @RobH: Do it! :) [19:52:24] 6operations, 10ops-eqiad, 10netops: cr1-eqiad PEM 0 fan failed - https://phabricator.wikimedia.org/T118721#1814592 (10Cmjohnson) uniper Networks Service Logistics has created RMA R405131-1, from Case Number 2015-1118-0268 for the replacement of defective with serial QCS1046C0AY. You will receive shipment a... [19:54:32] (03PS1) 1020after4: group1 wikis to 1.27.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253956 [19:54:51] (03CR) 1020after4: [C: 032] group1 wikis to 1.27.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253956 (owner: 1020after4) [19:55:13] (03Merged) 10jenkins-bot: group1 wikis to 1.27.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253956 (owner: 1020after4) [19:56:24] (03CR) 10Tjones: [C: 031] "Still looks good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253933 (https://phabricator.wikimedia.org/T118571) (owner: 10DCausse) [19:58:27] (03PS1) 10Yuvipanda: shinken: Deal with instances that have no classes [puppet] - 10https://gerrit.wikimedia.org/r/253957 [19:58:48] (03PS2) 10Yuvipanda: shinken: Deal with instances that have no classes [puppet] - 10https://gerrit.wikimedia.org/r/253957 [19:59:04] Jeff_Green: if you could paste that template for VM requests.. i'll create one for you for the testing [19:59:11] and install it with jessie? [19:59:34] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.7 [19:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:00:08] mutante: will do [20:01:28] (03CR) 10Yuvipanda: [C: 032] shinken: Deal with instances that have no classes [puppet] - 10https://gerrit.wikimedia.org/r/253957 (owner: 10Yuvipanda) [20:07:34] https://phabricator.wikimedia.org/T118988 ... spamming the logs quite a bit [20:07:45] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [20:07:52] "CURLOPT_SAFE_UPLOAD" is an undefined constant [20:09:02] Reset User:人神之间 email (confirmed in person with Matanya) [20:09:19] 2um not sure what's up with palladium [20:09:20] * YuviPanda looks [20:09:40] !log maxsem reset User:人神之间 email (confirmed in person with Matanya) [20:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:11:35] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:11:54] matanya, ^^^^ [20:13:18] 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database, and 5 others: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1814722 (10ori) #Wikidata folks (especially @daniel and @Lydia_Pintscher): Just a... [20:17:46] (03PS1) 10Dereckson: Whitelist domains for server-side upload - Coding Da Vinci Hackathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253963 (https://phabricator.wikimedia.org/T118844) [20:22:46] (03CR) 10Ottomata: "1 comment, otherwise +1." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/253947 (owner: 10Cmjohnson) [20:23:06] 6operations, 10Analytics, 10CirrusSearch, 6Discovery, 7audits-data-retention: Delete logs on stat1002 in /a/mw-log/archive that are more than 90 days old. - https://phabricator.wikimedia.org/T118527#1814779 (10EBernhardson) The oldest file i see is CirrusSearchRequests.log-20150726.gz. A 90 day retention... [20:29:37] 6operations, 10hardware-requests: Hardware access request for yubico auth servers - https://phabricator.wikimedia.org/T118983#1814814 (10MoritzMuehlenhoff) >>! In T118983#1814569, @RobH wrote: > The requirements for these are very low (basically any minimal server we have can do this.) However, can this live... [20:29:58] 6operations, 10hardware-requests: Hardware access request for yubico auth servers - https://phabricator.wikimedia.org/T118983#1814815 (10MoritzMuehlenhoff) a:5MoritzMuehlenhoff>3RobH [20:31:18] 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database, and 5 others: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1814824 (10Lydia_Pintscher) @ori: Don't worry. We have it fixed and will do some... [20:39:23] 6operations, 7Performance: forceprofile=1 with X-Wikimedia-Debug: 1 header does not work on non-wikipedias - https://phabricator.wikimedia.org/T118990#1814853 (10aude) 3NEW [20:40:08] (03CR) 10Hashar: "No the key is not in the production private repo. I probably just generated it locally as the zuul user and thus building up tech debt :(" [puppet] - 10https://gerrit.wikimedia.org/r/253925 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [20:43:07] (03PS1) 10Ottomata: Use mtime instead of ctime when considering file retention, fix retention for mw-logs on stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/253968 (https://phabricator.wikimedia.org/T118527) [20:43:09] !log upgrading pybal to 1.13.1 on lvs200[456] [20:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:46:01] (03CR) 10Ottomata: [C: 032] Use mtime instead of ctime when considering file retention, fix retention for mw-logs on stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/253968 (https://phabricator.wikimedia.org/T118527) (owner: 10Ottomata) [20:46:05] (03CR) 10EBernhardson: CirrusSearch: Include all languages we can detect (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253933 (https://phabricator.wikimedia.org/T118571) (owner: 10DCausse) [20:46:12] 6operations, 10hardware-requests: Hardware access request for yubico auth servers - https://phabricator.wikimedia.org/T118983#1814895 (10RobH) I cannot disclose our discounted pricing on this public task, so we'll leave the approval for allocation of one spare in EIQAD. Dell PowerEdge R420, single Intel Xeon... [20:46:26] 6operations, 10Analytics, 10CirrusSearch, 6Discovery, and 2 others: Delete logs on stat1002 in /a/mw-log/archive that are more than 90 days old. - https://phabricator.wikimedia.org/T118527#1814896 (10Ottomata) Ok, yup, bug! the job that removed old files was using ctime instead of mtime, and apparently th... [20:47:21] 6operations, 6Analytics-Kanban, 10CirrusSearch, 6Discovery, and 2 others: Delete logs on stat1002 in /a/mw-log/archive that are more than 90 days old. - https://phabricator.wikimedia.org/T118527#1814901 (10Ottomata) [20:47:29] (03CR) 10Dereckson: [C: 04-1] "Commit message should be rewritten." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252012 (https://phabricator.wikimedia.org/T113109) (owner: 10Luke081515) [20:48:46] (03CR) 10Dereckson: "By the way, it would be nice if this change is only about the new group." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252012 (https://phabricator.wikimedia.org/T113109) (owner: 10Luke081515) [20:50:34] 6operations, 6Analytics-Kanban, 10CirrusSearch, 6Discovery, and 2 others: Delete logs on stat1002 in /a/mw-log/archive that are more than 90 days old. - https://phabricator.wikimedia.org/T118527#1814908 (10Ottomata) Let's wait a few days and make sure this is working, then I think we can close. [20:52:56] PROBLEM - puppet last run on cp2022 is CRITICAL: CRITICAL: Puppet has 1 failures [20:55:05] (03PS3) 10Dereckson: Add a rollbacker group for wuuwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253084 (https://phabricator.wikimedia.org/T116270) (owner: 10Luke081515) [20:55:12] (03CR) 10Dereckson: [C: 031] Add a rollbacker group for wuuwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253084 (https://phabricator.wikimedia.org/T116270) (owner: 10Luke081515) [21:00:00] 6operations, 7Performance: forceprofile=1 with X-Wikimedia-Debug: 1 header does not work on non-wikipedias - https://phabricator.wikimedia.org/T118990#1814965 (10aude) also doesn't work on test.wikipedia.org ``` curl -H 'X-Wikimedia-Debug: 1' --dump-header - https://test.wikipedia.org/wiki/Kitten ``` my gues... [21:00:04] gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151118T2100). [21:04:13] (03PS2) 10Luke081515: Add new group "curator" to enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252012 (https://phabricator.wikimedia.org/T113109) [21:06:48] !log starting parsoid deploy [21:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:07:20] (03PS1) 10Luke081515: Remove rights from sysop group, and give them to the bureaucrats group at enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254029 (https://phabricator.wikimedia.org/T113109) [21:09:38] 6operations, 6Security-Team: can we get rid of rsvg security patch? - https://phabricator.wikimedia.org/T104147#1815012 (10Krenair) >>! In T104147#1814133, @Reedy wrote: > So... That just means tin and silver to be upgraded and then we're done... I think I thought silver already had the patch? [21:10:47] Is Tin on 14.04?//backported rsvg? [21:10:53] !log synced code; restarted parsoid on wtp1003 as a canary [21:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:11:23] (03CR) 10Dzahn: "re:Robh -> https://phabricator.wikimedia.org/T118816" [puppet] - 10https://gerrit.wikimedia.org/r/250050 (https://phabricator.wikimedia.org/T105555) (owner: 10Dzahn) [21:12:52] 6operations, 5Patch-For-Review: move racktables and RT to a VM - https://phabricator.wikimedia.org/T105555#1815047 (10Dzahn) [21:12:54] 6operations, 7Database: mysql permission request: racktables from krypton - https://phabricator.wikimedia.org/T118816#1815044 (10Dzahn) 5Open>3Resolved a:3Dzahn @jcrespo ok! thank you. [21:13:07] wtp1003 looking good. restarting on all nodes. [21:13:12] 6operations, 7Database: mysql permission request: racktables from krypton - https://phabricator.wikimedia.org/T118816#1815053 (10Dzahn) 5Resolved>3Open [21:13:13] 6operations, 5Patch-For-Review: move racktables and RT to a VM - https://phabricator.wikimedia.org/T105555#1446410 (10Dzahn) [21:14:32] 6operations, 7Database: mysql permission request: racktables from krypton - https://phabricator.wikimedia.org/T118816#1809991 (10Dzahn) >>! In T118816#1813442, @jcrespo wrote: > ticket should be changed to: make sure only 10.64.32.182 has access to racktables. renaming ticket. i will take it until i'm done wi... [21:15:06] 6operations, 7Database: mysql privs: restrict access to racktables to krypton - https://phabricator.wikimedia.org/T118816#1815064 (10Dzahn) [21:15:55] !log finished deploying parsoid sha e0a4fc91 [21:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:16:34] 6operations, 10hardware-requests: Hardware access request for yubico auth servers - https://phabricator.wikimedia.org/T118983#1815076 (10RobH) a:5RobH>3mark @Mark, also please assign this back to me post approval/comment so I can follow up, thank you! (I'll also keep this open until the codfw allocation i... [21:19:35] RECOVERY - puppet last run on cp2022 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [21:20:25] (03PS2) 10Dzahn: racktables: add role to krypton [puppet] - 10https://gerrit.wikimedia.org/r/250050 (https://phabricator.wikimedia.org/T105555) [21:22:03] (03PS3) 10Dzahn: racktables: add role to krypton [puppet] - 10https://gerrit.wikimedia.org/r/250050 (https://phabricator.wikimedia.org/T105555) [21:22:39] (03CR) 10Dzahn: [C: 032] "nothing is moved yet, i'm just starting with the role to see what else we'll need when applying this over here" [puppet] - 10https://gerrit.wikimedia.org/r/250050 (https://phabricator.wikimedia.org/T105555) (owner: 10Dzahn) [21:31:16] PROBLEM - puppet last run on krypton is CRITICAL: CRITICAL: Puppet has 1 failures [21:32:25] (03PS1) 10Eevans: [WIP] hooks-based event production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254033 [21:32:38] (03PS2) 10Cmjohnson: Removing all entries regarding analytics1003/4/10 bug: task tT118572 [puppet] - 10https://gerrit.wikimedia.org/r/253947 [21:35:18] (03PS3) 10Cmjohnson: Removing all entries regarding analytics1003/4/10 bug: task tT118572 [puppet] - 10https://gerrit.wikimedia.org/r/253947 [21:37:04] (03CR) 10Cmjohnson: [C: 032] Removing all entries regarding analytics1003/4/10 bug: task tT118572 [puppet] - 10https://gerrit.wikimedia.org/r/253947 (owner: 10Cmjohnson) [21:39:50] (03CR) 10Aude: "i am enthusiastic about the work on events here, but it somewhat scares me to have this much hooks code in the config (e.g. without any so" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254033 (owner: 10Eevans) [21:41:51] (03CR) 10Greg Grossmeier: "Mukunda: If this looks ok, let's schedule this for the week of Dec 7th (since the next two weeks we don't have the train anyways)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253371 (https://phabricator.wikimedia.org/T115002) (owner: 10Greg Grossmeier) [21:41:54] (03CR) 10Bartosz Dziewoński: "There is a WikimediaEvents extension where this kind of glue code has been dumped." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254033 (owner: 10Eevans) [21:42:23] 10Ops-Access-Requests, 6operations: Grant jgirault an jan_drewniak access to the eventlogging db on stat1003 and hive to query webrequests tables on stat1002 - https://phabricator.wikimedia.org/T118998#1815162 (10EBernhardson) 3NEW [21:42:36] (03CR) 10Dereckson: "This change contains artefact of another change." (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252012 (https://phabricator.wikimedia.org/T113109) (owner: 10Luke081515) [21:42:56] (03CR) 10Dereckson: "(commit message is good)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252012 (https://phabricator.wikimedia.org/T113109) (owner: 10Luke081515) [21:43:01] (03CR) 1020after4: "ok" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253371 (https://phabricator.wikimedia.org/T115002) (owner: 10Greg Grossmeier) [21:43:12] 10Ops-Access-Requests, 6operations: Grant jgirault an jan_drewniak access to the eventlogging db on stat1003 and hive to query webrequests tables on stat1002 - https://phabricator.wikimedia.org/T118998#1815172 (10Tfinc) Approved [21:46:48] (03CR) 10Chad: [C: 04-2] "Couple of nitpicks inline. Generally speaking though this looks like an extension and should be written/deployed as such and not ad-hoc in" (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254033 (owner: 10Eevans) [21:47:07] 10Ops-Access-Requests, 6operations: Grant jgirault an jan_drewniak access to the eventlogging db on stat1003 and hive to query webrequests tables on stat1002 - https://phabricator.wikimedia.org/T118998#1815184 (10EBernhardson) @jgirault @jdrewniak You will both need to generate ssh keys used only for wmf prod... [21:47:21] (03CR) 10Chad: "(Or put it in WM Events like Bartosz suggested)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254033 (owner: 10Eevans) [21:52:04] (03PS1) 10Dzahn: racktables: puppetize /srv/org/.. directory tree [puppet] - 10https://gerrit.wikimedia.org/r/254037 (https://phabricator.wikimedia.org/T105555) [21:52:36] (03CR) 10Dereckson: "What about 'nuke and unblockself only for bureaucrats on en.wikiversity' as title?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254029 (https://phabricator.wikimedia.org/T113109) (owner: 10Luke081515) [21:52:37] (03PS1) 10Cmjohnson: Removing dns entries for decommmission analytics1003,4,10 and a few straggler mgmt entries for previously decom'd analytics servers bug task# T118572 [dns] - 10https://gerrit.wikimedia.org/r/254038 [21:53:15] (03PS2) 10Dzahn: racktables: puppetize /srv/org/.. directory tree [puppet] - 10https://gerrit.wikimedia.org/r/254037 (https://phabricator.wikimedia.org/T105555) [21:53:55] (03CR) 10Dzahn: [C: 032] racktables: puppetize /srv/org/.. directory tree [puppet] - 10https://gerrit.wikimedia.org/r/254037 (https://phabricator.wikimedia.org/T105555) (owner: 10Dzahn) [21:54:30] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for decommmission analytics1003,4,10 and a few straggler mgmt entries for previously decom'd analytics servers bug task [dns] - 10https://gerrit.wikimedia.org/r/254038 (owner: 10Cmjohnson) [21:55:12] (03PS2) 10Luke081515: Nuke and unblockself only for bureaucrats on en.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254029 (https://phabricator.wikimedia.org/T113109) [21:55:38] (03CR) 10Luke081515: "Was a good idea ;)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254029 (https://phabricator.wikimedia.org/T113109) (owner: 10Luke081515) [21:57:51] RECOVERY - puppet last run on krypton is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:58:14] 6operations, 10ops-eqiad, 5Patch-For-Review: Wipe and remove from rack Analytics1003, 1004 and 1010 - https://phabricator.wikimedia.org/T118999#1815200 (10Cmjohnson) 3NEW a:3Cmjohnson [21:59:39] 6operations, 10ops-eqiad, 5Patch-For-Review: Decommission cisco servers, Analytics1003, 1004 and 1010 - https://phabricator.wikimedia.org/T118572#1815210 (10Cmjohnson) 5Open>3Resolved Decommissioned according to server lifecycle page...also deleted salt keys on neodymium. Created sub-task for wiping/rack... [21:59:59] 6operations, 10ops-eqiad, 5Patch-For-Review: Wipe and remove from rack Analytics1003, 1004 and 1010 - https://phabricator.wikimedia.org/T118999#1815200 (10Cmjohnson) p:5Triage>3Normal [22:01:55] (03PS2) 10Dereckson: Remove Browse experimental config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252660 (https://phabricator.wikimedia.org/T113686) (owner: 10Phuedx) [22:02:02] (03CR) 10Dereckson: [C: 031] Remove Browse experimental config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252660 (https://phabricator.wikimedia.org/T113686) (owner: 10Phuedx) [22:24:21] PROBLEM - YARN NodeManager Node-State on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:25:37] (03CR) 10MaxSem: "Agree with Aude, this needs to be a proper extension because it is not configuration." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254033 (owner: 10Eevans) [22:26:12] RECOVERY - YARN NodeManager Node-State on analytics1030 is OK: OK: YARN NodeManager analytics1030.eqiad.wmnet:8041 Node-State: RUNNING [22:31:50] (03PS1) 10Dzahn: deactivate wikidata.pt [dns] - 10https://gerrit.wikimedia.org/r/254042 [22:32:02] PROBLEM - YARN NodeManager Node-State on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:32:41] PROBLEM - puppet last run on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:33:22] PROBLEM - RAID on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:33:42] PROBLEM - Check size of conntrack table on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:33:42] PROBLEM - SSH on analytics1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:33:52] PROBLEM - configured eth on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:34:01] PROBLEM - DPKG on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:34:03] PROBLEM - dhclient process on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:34:12] PROBLEM - Disk space on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:34:13] PROBLEM - Hadoop NodeManager on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:34:22] PROBLEM - Disk space on Hadoop worker on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:34:31] PROBLEM - salt-minion processes on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:34:40] (03PS1) 10Dzahn: deactivate wiki-pedia.org [dns] - 10https://gerrit.wikimedia.org/r/254043 [22:34:43] PROBLEM - Hadoop DataNode on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:37:26] (03PS2) 10Dzahn: deactivate wikidata.pt [dns] - 10https://gerrit.wikimedia.org/r/254042 [22:38:51] !log powercycled analytics1030 - no ssh response, mgmt console login timed out [22:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:41:36] (03PS1) 10Dzahn: deactivate wikiepdia.[com|org] [dns] - 10https://gerrit.wikimedia.org/r/254049 [22:41:42] RECOVERY - Disk space on Hadoop worker on analytics1030 is OK: DISK OK [22:41:51] RECOVERY - salt-minion processes on analytics1030 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:41:53] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 25 minutes ago with 0 failures [22:42:03] RECOVERY - Hadoop DataNode on analytics1030 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [22:42:42] RECOVERY - RAID on analytics1030 is OK: OK: optimal, 13 logical, 14 physical [22:42:52] RECOVERY - Check size of conntrack table on analytics1030 is OK: OK: nf_conntrack is 0 % full [22:42:52] RECOVERY - SSH on analytics1030 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [22:43:01] (03PS1) 10Dzahn: deactivate wiikipedia.com [dns] - 10https://gerrit.wikimedia.org/r/254054 [22:43:02] RECOVERY - configured eth on analytics1030 is OK: OK - interfaces up [22:43:12] RECOVERY - DPKG on analytics1030 is OK: All packages OK [22:43:12] RECOVERY - YARN NodeManager Node-State on analytics1030 is OK: OK: YARN NodeManager analytics1030.eqiad.wmnet:8041 Node-State: RUNNING [22:43:21] RECOVERY - dhclient process on analytics1030 is OK: PROCS OK: 0 processes with command name dhclient [22:43:22] RECOVERY - Disk space on analytics1030 is OK: DISK OK [22:44:21] (03PS1) 10Dzahn: deactivate wicipediacymraeg.org [dns] - 10https://gerrit.wikimedia.org/r/254055 [22:45:07] (03PS1) 10Rush: WIP: hiera-ize labs openstack nova configuration [puppet] - 10https://gerrit.wikimedia.org/r/254056 [22:47:42] and now the worst commit message ... [22:47:42] (03PS1) 10Dzahn: deactivate wikimedia.community [dns] - 10https://gerrit.wikimedia.org/r/254057 [22:50:11] (03PS1) 10Dzahn: deactivate voyagewiki.[com|org] [dns] - 10https://gerrit.wikimedia.org/r/254058 [22:50:11] PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: puppet fail [22:51:53] (03CR) 10Aaron Schulz: "WM Events is fine by me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254033 (owner: 10Eevans) [22:54:19] (03PS1) 10Dzahn: apache: remove wikimemory.org redirect [puppet] - 10https://gerrit.wikimedia.org/r/254059 [22:56:29] (03PS1) 10Dzahn: apache: remove wikipaedia.net redirect [puppet] - 10https://gerrit.wikimedia.org/r/254060 [23:02:06] (03PS1) 10Dzahn: remove quality.wikipedia.[org|com] redirects. [puppet] - 10https://gerrit.wikimedia.org/r/254061 [23:10:53] 6operations, 10vm-requests, 5Patch-For-Review: VM request for OpenLDAP labs servers - https://phabricator.wikimedia.org/T118726#1815415 (10akosiaris) 5Open>3Resolved a:3akosiaris VMs are up and running, resolving [23:11:05] yo can someone here answer a quick question about the upcoming freeze for me? (I've also read the deployments page.) [23:12:26] RoanKattouw: ^ maybe? [23:12:42] 10Ops-Access-Requests, 6operations: Grant jgirault an jan_drewniak access to the eventlogging db on stat1003 and hive to query webrequests tables on stat1002 - https://phabricator.wikimedia.org/T118998#1815424 (10JGirault) ``` ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDMR2TEdzI8TP4Khbvm7AOqX/dTKWoxtrcb4fb0gDzocfdi... [23:15:43] anyways, I'm looking for some clarity on what qualifies as "High Priority SWATs" during the pre- and post-Thanksgiving weeks [23:16:10] (03CR) 10Andrew Bogott: [C: 04-1] "Apart from a minor naming request, this looks good to me -- it's a big improvement." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/254056 (owner: 10Rush) [23:16:46] RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [23:21:59] (03PS1) 10MaxSem: Bump portals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254065 [23:24:12] jhobs: You should ask greg-g [23:24:47] jhobs: I don't mean to deflect that question but I'm not the deployments coordinator, I'm just a random guy that does deploys sometimes (and also designed many of the original processes 3 years ago). Greg is the Master of Deployments :) [23:24:55] totally fine, thanks [23:26:53] greg-g: I'm looking for some clarity on what qualifies as "High Priority SWATs" during the pre- and post-Thanksgiving weeks. Got a minute to answer a question or two? [23:27:16] (03PS5) 10BBlack: wikimedia.vcl: set CP ('Connection Properties') cookie in vcl_deliver [puppet] - 10https://gerrit.wikimedia.org/r/253645 (owner: 10Ori.livneh) [23:34:10] (03CR) 10BBlack: [C: 032] wikimedia.vcl: set CP ('Connection Properties') cookie in vcl_deliver [puppet] - 10https://gerrit.wikimedia.org/r/253645 (owner: 10Ori.livneh) [23:38:32] 6operations: Investigate why mw2208 is powered off - https://phabricator.wikimedia.org/T118857#1815511 (10BBlack) Yeah I've found mw2208 also spamming failures in PyBal while still being pooled. I'm going to depool it for now in PyBal as well with a comment referencing this ticket. [23:45:09] !log upgrading pybal to 1.13.1 on lvs400[34] [23:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:45:16] (03PS1) 10EBernhardson: Turn on language detection user test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254070 (https://phabricator.wikimedia.org/T118290) [23:45:18] (03PS1) 10EBernhardson: Turn off language detection user test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254071 (https://phabricator.wikimedia.org/T118292) [23:47:05] greg-g, i need a few minutes to re-deploy graphoid service. 15 min before swat should be tons of time [23:49:20] !log upgrading pybal to 1.13.1 on lvs300[34] [23:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:51:13] !log git deploy sync latest graphoid [23:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:52:58] (03PS1) 10Dereckson: Namespace configuration on es.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254072 (https://phabricator.wikimedia.org/T119006) [23:53:06] PROBLEM - puppet last run on lvs3004 is CRITICAL: CRITICAL: puppet fail [23:54:18] PROBLEM - puppet last run on lvs1004 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago [23:54:57] RECOVERY - puppet last run on lvs3004 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [23:56:08] RECOVERY - puppet last run on lvs1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures