[00:00:04] RoanKattouw ostriches rmoen Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151106T0000). [00:00:04] bd808 matt_flaschen: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:15] hi [00:00:22] o/ [00:00:31] hmm [00:00:40] Here [00:00:40] James_F and i were supposed to deploy a thing too [00:00:47] Oh. [00:00:48] Yeah. [00:00:49] Whoops. [00:00:53] Krenair: my 2 patches mess with logging config so matt_flaschen should probably go first [00:00:55] Forgot to add to the calendar. [00:01:04] good idea [00:01:33] and James_F if he can talk you into a late entry ;) [00:01:44] matt_flaschen, schema changes? in swat? [00:02:00] MatmaRex: Are you putting them in the list? [00:02:41] eww. that seems very not-SWAT [00:02:45] Oh, you've already dealt with the schema part? [00:02:52] and just need to start using it? [00:02:52] James_F: not at the moment [00:02:54] but i can [00:02:58] Doing it. [00:03:14] alright [00:03:43] Krenair, the schema change is not part of SWAT. [00:03:46] Just the code deployment. [00:03:49] Krenair: https://gerrit.wikimedia.org/r/#/q/status:open+project:mediawiki/core+branch:wmf/1.27.0-wmf.5+owner:%22Jforrester+%253Cjdforrester%2540gmail.com%253E%22,n,z [00:04:12] Phase 1 of the schema change is already done, the other schema change will be done later not part of SWAT. [00:04:26] Krenair, but if you would prefer, we can schedule a window. [00:05:11] Yes, basically it will start inserting to the new column. [00:05:14] greg-g, thoughts? [00:05:18] Everything else is before or later. [00:05:39] If it's just a submodule bump that seems normal [00:05:54] (03PS1) 10Yuvipanda: dynamicproxy: Explicitly specify callable for uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/251438 [00:06:12] Yeah, that's all I planned for SWAT. [00:06:20] (03CR) 10Yuvipanda: [C: 032 V: 032] dynamicproxy: Explicitly specify callable for uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/251438 (owner: 10Yuvipanda) [00:07:09] mmh? [00:08:04] yeah, using the new db structure is fine, but not making it, during swat :) [00:08:54] (03CR) 10BryanDavis: "I put this change up for SWAT on the morning of 2015-11-09." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250897 (https://phabricator.wikimedia.org/T111829) (owner: 10Bgerstle) [00:11:28] matt_flaschen, it seems okay, I've abandoned the wmf.4 one and approved the wmf.5 one [00:12:10] Krenair, thanks, I should have re-checked the roadmap. [00:12:21] The roadmap hasn't been updated to take it into account yet. [00:12:22] But: [00:12:26] krenair@tin:~$ mwversionsinuse [00:12:26] 1.27.0-wmf.5 [00:12:41] krenair@tin:~$ [00:13:34] (03PS1) 10Yuvipanda: dynamicproxy: Mount dynamicproxy API on a prefix [puppet] - 10https://gerrit.wikimedia.org/r/251439 [00:14:50] (03CR) 10Yuvipanda: [C: 032] dynamicproxy: Mount dynamicproxy API on a prefix [puppet] - 10https://gerrit.wikimedia.org/r/251439 (owner: 10Yuvipanda) [00:15:26] Ah, did you clean up php-1.27.0-wmf.5, twentyafterfour [00:15:28] ? [00:15:49] thcipriani did [00:15:56] ok [00:16:00] thanks thcipriani [00:16:38] oh Krenair did you hear that the tin->mira sync is in scap now? [00:16:38] Krenair: np, twentyafterfour did the actual syncing out after I did the requisite git-futzing. [00:16:53] yes bd808 [00:18:06] !log krenair@tin Synchronized php-1.27.0-wmf.5/extensions/Flow: https://gerrit.wikimedia.org/r/#/c/251416/ (duration: 00m 37s) [00:18:11] matt_flaschen, ^ please test [00:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:18:16] Deskana: when you say git-based, does that imply PHP? [00:18:42] Krenair, thanks, testing. [00:18:55] paravoid: not yet [00:19:34] paravoid: they have been talking about some type of static generation system but I think for now it's just static html, js, css [00:19:41] ...why? [00:19:56] reasons? [00:20:21] ebernhardson mentioned there was a horror-filled({{cn}}) 20-page phab ticket somewhere [00:20:26] the static files first makes sense. It's just replacing a static file pulled from metawiki with a static file in git [00:21:27] YuviPanda: https://phabricator.wikimedia.org/T110070 [00:21:51] make sure to click "show older changes" several times ;) [00:23:03] wtf [00:25:04] matt_flaschen, how's it going? [00:25:31] (i'm not here right now, please ping James_F if our stuff comes up while i'm not around) [00:25:42] Krenair, done. Tested on Wikipedia and MediaWiki.org , looks good. [00:25:49] thanks [00:25:56] a ghost has taken over MatmaRex's irc client! [00:27:04] I should really work on my 'Hide certain people's comments from phab' user script. [00:27:07] * YuviPanda puts it on next week [00:28:00] (03CR) 10Alex Monk: [C: 032] Monolog: Use useMicrosecondTimestamps() on Loggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249313 (https://phabricator.wikimedia.org/T116550) (owner: 10BryanDavis) [00:28:05] who are you ignoring YuviPanda? :P [00:28:42] any discussion of that in public will not be productive and I think you already know anyway :P [00:28:49] (03CR) 10jenkins-bot: [V: 04-1] Monolog: Use useMicrosecondTimestamps() on Loggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249313 (https://phabricator.wikimedia.org/T116550) (owner: 10BryanDavis) [00:29:00] maybe :P [00:29:05] what's up jerkins? [00:29:22] * YuviPanda curses wikitech, re-logs in [00:29:23] PHP Fatal error: Class 'Monolog\Logger' not found in /mnt/jenkins-workspace/workspace/operations-mw-config-phpunit/wmf-config/logging.php on line 280 [00:29:46] It succeeded before. . . [00:30:53] hey paravoid [00:31:00] hi MaxSem [00:31:05] nevermind, I saw the task [00:31:12] so what was your concern? :) [00:31:14] baby steps sounds fine :P [00:31:40] Krenair: that's a weird error [00:31:42] yeah [00:32:05] That line has been there for quite a while so the error is something more jenkins related [00:32:07] shall we do the others and come back to this? [00:32:13] sure [00:32:39] (03PS2) 10Alex Monk: logstash: Exclude runJobs info events from logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251317 (https://phabricator.wikimedia.org/T113571) (owner: 10BryanDavis) [00:32:44] (03CR) 10Alex Monk: [C: 032] logstash: Exclude runJobs info events from logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251317 (https://phabricator.wikimedia.org/T113571) (owner: 10BryanDavis) [00:33:14] (03Merged) 10jenkins-bot: logstash: Exclude runJobs info events from logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251317 (https://phabricator.wikimedia.org/T113571) (owner: 10BryanDavis) [00:33:38] (03PS2) 10BryanDavis: Monolog: Use useMicrosecondTimestamps() on Loggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249313 (https://phabricator.wikimedia.org/T116550) [00:33:59] (03CR) 10jenkins-bot: [V: 04-1] Monolog: Use useMicrosecondTimestamps() on Loggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249313 (https://phabricator.wikimedia.org/T116550) (owner: 10BryanDavis) [00:34:10] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/251317/ (duration: 00m 35s) [00:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:34:16] bd808, ^ [00:35:12] Krenair: looks good [00:35:35] runJobs in logstash just dropped like a rock and runJobs.log is still going strong [00:36:07] the jerkins failure looks like it is because mediawiki/vendor isn't present in the test environment [00:36:24] strangely we merged a change to the same file eariler today [00:36:35] so a new jenkins/zuul change? [00:37:18] I can't do CI stuff. legoktm? [00:38:40] hi [00:38:41] what's up? [00:38:49] I don't see anything in -releng that explains it [00:39:06] legoktm: jenkins being mean about https://gerrit.wikimedia.org/r/#/c/249313/ [00:39:30] It is saying that a class provided by mediawiki/vendor isn't present [00:39:57] because the phpunit tests aren't run alongside MW [00:40:08] but it worked 83 minutes ago for -- https://gerrit.wikimedia.org/r/#/c/251431/ [00:40:14] (03PS1) 10Yuvipanda: dynamicproxy: Tweak uwsgi parameters some more [puppet] - 10https://gerrit.wikimedia.org/r/251443 [00:40:34] you're now unconditionally calling Monolog in your file [00:40:51] previously it was guarded by method_exists [00:40:58] oh crap you're right [00:41:03] I'll put the guard back [00:41:54] (03CR) 10Yuvipanda: [C: 032] dynamicproxy: Tweak uwsgi parameters some more [puppet] - 10https://gerrit.wikimedia.org/r/251443 (owner: 10Yuvipanda) [00:42:13] (03PS3) 10BryanDavis: Monolog: Use useMicrosecondTimestamps() on Loggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249313 (https://phabricator.wikimedia.org/T116550) [00:42:33] thanks for reading my patch for me legoktm :/ [00:43:17] :)) [00:43:28] Krenair: it's green now [00:43:37] ok [00:43:42] doing James_F's patches now, am coming back to it [00:43:54] *nod* [00:45:08] (03PS1) 10Yuvipanda: uwsgi: Fix notify behavior [puppet] - 10https://gerrit.wikimedia.org/r/251444 [00:45:08] Matma's patches. He did the work. I'm just the guy to blame if it breaks stuff. :-) [00:47:16] Krenair: your invisible unicorn patch has been applied! [00:47:23] and I also refactored some of the other unicorn bits [00:47:24] yay [00:47:38] Krenair: but however OSM still adds the domain entry [00:47:42] since there's no validation in mw [00:47:44] I updated ticket [00:48:03] Won't OSM just fail to add the entry? [00:48:08] no [00:48:12] it succeeds adding the domain entry [00:48:14] what [00:48:16] and then fails adding the proxy entry [00:48:18] Successfully added test//234 entry for IP address 208.80.155.156. [00:48:19] Ohhh. [00:48:20] Failed to create new proxy test//234.wmflabs.org. [00:48:37] That's bad. [00:48:44] yes [00:48:49] so it needs an OSM patch [00:48:50] But probably an existing issue, if it were to fail for any other reason. [00:49:01] yeah [00:49:04] totally [00:49:09] you'll end up with ghost proxies [00:49:13] that vaguely exist but do not really? [00:49:26] also the invisible unicorn code is now in the puppet repo and not in a separate deb anymore, easier to hack on [00:50:40] Krenair: ah, but because Special:NovaProxy lists responses from the proxy API, it no longer shows the malformed domain [00:50:50] I can only see it in manage addresses for project-proxy [00:51:29] (03CR) 10Yuvipanda: [C: 032] uwsgi: Fix notify behavior [puppet] - 10https://gerrit.wikimedia.org/r/251444 (owner: 10Yuvipanda) [00:51:36] so you can technically identify all the broken ones through project-proxy's NovaAddress list? [00:51:55] yeah [00:54:42] Krenair: ok so from what I see [00:54:45] this would: [00:54:55] 1. cleanup the broken ones by identifying from the NovaAddress list [00:54:57] 2. OSM patch [00:55:07] (i'm here now) [00:55:07] I can probably do (1) but not (2) [00:55:16] I might be able to do (2) [00:56:12] (hmm, they're still merging?) [00:56:20] (for… almost 20 minutes? :/) [00:56:28] Dear CI. [00:56:30] 20 minutes. [00:56:31] You suck. [00:56:40] Krenair: awesome. [00:56:55] Krenair: I'll do (1) later today [00:57:20] I had almost forgotten I was doing this. [00:57:21] Okay. [00:59:11] (03PS1) 10BryanDavis: logstash: Exclude zero from Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251446 [00:59:40] !log krenair@tin Synchronized php-1.27.0-wmf.5/resources/src/mediawiki/mediawiki.ForeignStructuredUpload.js: https://gerrit.wikimedia.org/r/#/c/251301/ https://gerrit.wikimedia.org/r/#/c/251306/ and https://gerrit.wikimedia.org/r/#/c/251307/ (duration: 00m 35s) [00:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:00:04] * YuviPanda goes for food [01:00:11] (03PS1) 10MaxSem: Update MEDIAWIKI_DBLIST_DIR [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251447 [01:00:42] !log krenair@tin Synchronized php-1.27.0-wmf.5/resources/src/mediawiki/mediawiki.ForeignStructuredUpload.BookletLayout.js: https://gerrit.wikimedia.org/r/#/c/251307/ (duration: 00m 34s) [01:00:46] James_F, please test ^ [01:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:01:00] (03CR) 10BryanDavis: [C: 031] Update MEDIAWIKI_DBLIST_DIR [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251447 (owner: 10MaxSem) [01:02:37] bblack, around? i'm doing some maps stress tests, and for some reason only one varnish peaks out [01:03:39] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&c=maps+Cluster+codfw&tab=m&vn=&hide-hf=false&sh=1 [01:03:43] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Maps%2520caches%2520eqiad&tab=m&vn=&hide-hf=false [01:06:06] James_F? [01:06:38] Krenair: Still testing. [01:07:13] k [01:08:45] Krenair: Yup, all looks good. [01:09:09] (03CR) 10Alex Monk: [C: 032] Update MEDIAWIKI_DBLIST_DIR [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251447 (owner: 10MaxSem) [01:09:31] (03Merged) 10jenkins-bot: Update MEDIAWIKI_DBLIST_DIR [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251447 (owner: 10MaxSem) [01:10:35] !log krenair@tin Synchronized multiversion/defines.php: https://gerrit.wikimedia.org/r/#/c/251447/ (duration: 00m 34s) [01:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:11:45] 6operations, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-GettingStarted: GettingStarted on Beta Cluster periodically loses its Redis index - https://phabricator.wikimedia.org/T100515#1787372 (10Mattflaschen) 5Resolved>3Open Seems like it did, unless it was a different cause. [01:13:08] (03PS6) 10MaxSem: Beta: deploy www.wikipedia.beta.wmflabs.org/ from version control [puppet] - 10https://gerrit.wikimedia.org/r/248374 [01:13:15] (03PS7) 10MaxSem: Beta: deploy www.wikipedia.beta.wmflabs.org/ from version control [puppet] - 10https://gerrit.wikimedia.org/r/248374 [01:14:04] (03CR) 10Ori.livneh: [C: 032 V: 032] Beta: deploy www.wikipedia.beta.wmflabs.org/ from version control [puppet] - 10https://gerrit.wikimedia.org/r/248374 (owner: 10MaxSem) [01:21:58] Krenair: still working on the patches for James_F? [01:22:12] no [01:22:47] can I jump in and do my logging changes that got stuck? [01:24:04] yep [01:24:11] sweet [01:24:23] (03CR) 10BryanDavis: [C: 032] Monolog: Use useMicrosecondTimestamps() on Loggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249313 (https://phabricator.wikimedia.org/T116550) (owner: 10BryanDavis) [01:24:45] (03Merged) 10jenkins-bot: Monolog: Use useMicrosecondTimestamps() on Loggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249313 (https://phabricator.wikimedia.org/T116550) (owner: 10BryanDavis) [01:26:01] !log bd808@tin Synchronized wmf-config/logging.php: https://gerrit.wikimedia.org/r/#/c/249313/ (duration: 00m 34s) [01:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:28:07] !log bd808@tin Synchronized wmf-config/InitialiseSettings.php: Touch (duration: 00m 35s) [01:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:28:25] duh. gotta actually apply the patch after fetching [01:29:57] !log bd808@tin Synchronized wmf-config/logging.php: https://gerrit.wikimedia.org/r/#/c/249313/ (forgot to apply after fetching) (duration: 00m 34s) [01:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:30:28] that's better [01:31:02] (03PS2) 10BryanDavis: logstash: Exclude zero from Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251446 [01:31:09] (03CR) 10BryanDavis: [C: 032] logstash: Exclude zero from Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251446 (owner: 10BryanDavis) [01:31:44] (03Merged) 10jenkins-bot: logstash: Exclude zero from Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251446 (owner: 10BryanDavis) [01:32:58] !log bd808@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/251446/ (duration: 00m 35s) [01:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:38:53] !log krenair@mira Synchronized README: test (duration: 00m 34s) [01:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:39:08] yay [01:39:42] cool [01:42:13] still a bunch of permissions errors to clean up [01:42:19] Ooh, fancy. [01:42:28] Krenair: oh? on the tin side [01:42:31] One step closer to killing 5.3! [01:42:46] yeah [01:42:47] we got the mira side chowned eariler today [01:42:57] !log krenair@mira Synchronized README: done (duration: 00m 34s) [01:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:43:07] it's a ton of errors under "01:42:23 Copying to tin.eqiad.wmnet from mira.codfw.wmnet" [01:43:10] if you have the output paste it and we'll get somebody to fix tomorrow [01:44:02] are they all just mtime errors? [01:44:02] 6operations, 6Discovery, 10Maps, 10Traffic, 3Discovery-Maps-Sprint: Load testing Maps sent all traffic to only one server - https://phabricator.wikimedia.org/T117937#1787418 (10Yurik) 3NEW a:3BBlack [01:44:18] 6operations, 6Discovery, 10Maps, 10Traffic, 3Discovery-Maps-Sprint: Load testing Maps sent all traffic to only one server - https://phabricator.wikimedia.org/T117937#1787427 (10Yurik) [01:44:21] chgrp errors bd808 [01:44:34] hmm ok [01:44:42] e.g. only root can mess with tests/multiversion/MWMultiVersionTest.php [01:44:54] no one else can deploy changes to that from tin [01:46:18] I think that flipping back and forth is going to take root ownership fixes every time actually. [01:46:32] maybe not [01:46:55] it seems likely that git operations are going to create new files owned by whomever did the deploy [01:47:09] and then mtims coming from the other side won't work [01:47:13] *mtime [01:47:47] nobody should be allowed to make /srv/mediawiki-staging files have any group other than mwdeploy, or make them unwritable by the group [01:48:08] yeah. we could actually write up some kind of icinga check for that [01:49:03] See mira:~krenair/sync-errors [01:50:55] all the l10n_cache ones might be legitimate [01:51:05] because l10nupdate owns those? [01:51:08] yeah [01:51:29] but mwdeply should be in that group [01:51:29] private/WikitechPrivateLdapSettings.php comes up [01:51:35] wasn't puppet fixed to deal with that? [01:52:08] I thought it was [01:52:34] getting closer anyway [01:52:47] and yet: -r--r--r-- 1 root root [01:53:02] and we are at least mirroring state to codfw now which is actually huge [01:53:17] I don't think there has ever been a spare deploy server [01:53:59] (in case you're wondering why I dumped that file on a server instead of using paste: chrome locked up when I tried to paste it...) [01:54:30] Krenair: phaste it! [01:54:42] ? [01:54:51] see /usr/local/bin/phaste [01:54:58] a tool o.ri made [01:55:39] https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/deployment/manifests/wikitech.pp;d76f731e2774b4fa3ee74c2aa92e21fc3a514248$14 [01:56:06] um /srv/mediawiki [01:56:22] ... Huh. [01:56:48] How did this ever work? [01:57:10] Did someone manually copy? ಠ_ಠ [01:57:18] that would be the right place on silver [01:57:31] and that role is for silver right? [01:57:37] so yeah maybe copy pasta [01:58:05] it really doesn't need to be on tin/mira at all probably [01:58:06] Nope, it's included in role::deployment::mediawiki [01:59:30] * bd808 leaves for dinner [02:01:46] 6operations, 10Deployment-Systems, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1787436 (10Krenair) Think it's mainly just a case of fixing file permissions on tin now so we don't get a ton of errors. I made a basic change on mira to the README... [02:02:27] 6operations, 10Deployment-Systems: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1787437 (10Krenair) [02:06:29] (03PS6) 10Alex Monk: Allow import from any Labs/Beta Cluster project to any other [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO) [02:07:28] (03PS7) 10Alex Monk: Allow import from any Labs/Beta Cluster project to any other [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO) [02:07:42] (03CR) 10Alex Monk: Allow import from any Labs/Beta Cluster project to any other [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO) [02:28:01] !log l10nupdate@tin Synchronized php-1.27.0-wmf.5/cache/l10n: l10nupdate for 1.27.0-wmf.5 (duration: 05m 56s) [02:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:18:47] 6operations, 10Deployment-Systems: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1787498 (10demon) >>! In T95436#1787436, @Krenair wrote: > Think it's mainly just a case of fixing file permissions on tin now so we don't get a ton of errors. I made a basic change on m... [03:23:26] 6operations, 10Deployment-Systems: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1787502 (10Krenair) See mira:~krenair/sync-errors [03:31:14] RECOVERY - swift codfw-prod object availability on graphite1001 is OK: OK: Less than 1.00% under the threshold [95.0] [03:44:35] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [04:26:05] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 3 below the confidence bounds [05:05:34] 6operations, 10Traffic, 7Browser-Support-Internet-Explorer, 7HTTPS: Xbox 360 Internet Explorer unable to view Wikipedia - https://phabricator.wikimedia.org/T105455#1787575 (10Chmarkine) Are there any updates now? [05:32:26] PROBLEM - puppet last run on ms-be2015 is CRITICAL: CRITICAL: puppet fail [05:34:15] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [06:00:35] RECOVERY - puppet last run on ms-be2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:17:45] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 35, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-314534, 29ms) {#11375} [10Gbps DWDM]BR [06:23:24] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps DWDM]BR [06:25:14] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 [06:25:15] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 [06:30:34] PROBLEM - puppet last run on mw1086 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:35] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:24] PROBLEM - puppet last run on aqs1002 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:14] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:16] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Puppet has 3 failures [06:32:45] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:55] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:51:15] PROBLEM - puppet last run on mw2004 is CRITICAL: CRITICAL: puppet fail [06:57:44] RECOVERY - puppet last run on aqs1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:34] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:15:56] RECOVERY - puppet last run on mw2004 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [07:21:28] (03CR) 10TTO: "As this is Labs only, does this need to wait for a SWAT window or can it be merged at any time?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO) [07:26:45] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [07:26:55] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [07:27:05] RECOVERY - puppet last run on mw1086 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:14] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [07:27:36] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [07:35:26] 6operations, 6Release-Engineering-Team, 7Database, 5Patch-For-Review, 7WorkType-Maintenance: Recover missing values from user_properties tables - https://phabricator.wikimedia.org/T114899#1787641 (10Nemo_bis) [07:55:25] RECOVERY - cassandra-a CQL 10.64.16.188:9042 on praseodymium is OK: TCP OK - 0.003 second response time on port 9042 [08:05:44] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 35, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps DWDM]BR [08:05:54] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps DWDM]BR [08:28:35] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 [08:28:54] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 [08:34:41] (03PS2) 10Muehlenhoff: Assign salt grains for dbstore systems [puppet] - 10https://gerrit.wikimedia.org/r/250994 [08:38:35] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [500.0] [08:42:24] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:51:32] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for dbstore systems [puppet] - 10https://gerrit.wikimedia.org/r/250994 (owner: 10Muehlenhoff) [08:52:24] (03CR) 10Jcrespo: [C: 032] Repool 34,42,35,38,39,40. New: 58,64,60,61,62,59. Depool: 44,45,48 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251467 (owner: 10Jcrespo) [08:55:30] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db20 34,42,35,38,39,40. New: 58,64,60,61,62,59. Depool: 44,45,48 (duration: 01m 10s) [08:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:59:12] (03CR) 10Hashar: "Nodepool refresh the reference image once per day or so. I will manually force the update based on https://wikitech.wikimedia.org/wiki/No" [puppet] - 10https://gerrit.wikimedia.org/r/251432 (https://phabricator.wikimedia.org/T111005) (owner: 10Gilles) [09:00:37] (03PS1) 10Giuseppe Lavagetto: redis: install two additional redis servers [puppet] - 10https://gerrit.wikimedia.org/r/251469 (https://phabricator.wikimedia.org/T117916) [09:01:46] (03PS2) 10Giuseppe Lavagetto: redis: install two additional redis servers [puppet] - 10https://gerrit.wikimedia.org/r/251469 (https://phabricator.wikimedia.org/T117916) [09:04:50] (03CR) 10Giuseppe Lavagetto: [C: 032] redis: install two additional redis servers [puppet] - 10https://gerrit.wikimedia.org/r/251469 (https://phabricator.wikimedia.org/T117916) (owner: 10Giuseppe Lavagetto) [09:09:29] 6operations, 10Continuous-Integration-Infrastructure: can't install libcurl-dev on Jessie - https://phabricator.wikimedia.org/T117955#1787753 (10hashar) 3NEW [09:09:46] !log Stopping mysql and cloning db2044 -> db2065, db2048 -> db2069, db2045 -> db2066 [09:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:10:02] 6operations, 10Continuous-Integration-Infrastructure: can't install libcurl-dev on Jessie - https://phabricator.wikimedia.org/T117955#1787761 (10hashar) [09:11:27] any clue has to why puppet would not install a virtual package "libcurl-dev" :-( [09:11:42] Package libcurl-dev is a virtual package provided by: .... [09:11:47] E: Package 'libcurl-dev' has no installation candidate [09:11:51] libcurl4-openssl-dev 7.38.0-4+deb8u2 [09:11:52] libcurl4-nss-dev 7.38.0-4+deb8u2 [09:11:52] libcurl4-gnutls-dev 7.38.0-4+deb8u2 [09:12:12] seems they are different encryption libraries [09:17:17] (03CR) 10Alexandros Kosiaris: [C: 04-1] Create a define to register extra LDAP schemas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/251229 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff) [09:17:25] <_joe_> hashar: aptitude why might help you [09:17:27] hashar: there's multiple versions of the library to link against (since some licenses are GPL-incompatible) [09:17:43] ohhh [09:17:56] I actually had the issue with Squid on Jessie not being compiled with ssl support [09:18:02] because of license issue with openssl [09:18:39] (03CR) 10Alexandros Kosiaris: [C: 04-1] "looks good, minor comment inline" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/251272 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff) [09:18:46] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 35, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps DWDM]BR [09:18:54] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps DWDM]BR [09:19:06] <_joe_> uhm it's down again [09:19:11] <_joe_> akosiaris: ^^ [09:19:57] ok watching it again then [09:20:57] gilinx/zayo is already seeing traffic [09:21:23] (03PS1) 10Muehlenhoff: Update to 3.19.8-ckt8 [debs/linux] - 10https://gerrit.wikimedia.org/r/251475 [09:26:20] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update to 3.19.8-ckt8 [debs/linux] - 10https://gerrit.wikimedia.org/r/251475 (owner: 10Muehlenhoff) [09:33:29] (03PS4) 10Zfilipin: rubocop: Ignoring Style/WordArray offense [puppet] - 10https://gerrit.wikimedia.org/r/238778 (https://phabricator.wikimedia.org/T112651) [09:33:44] (03PS1) 10Muehlenhoff: Update to 3.19.8-ckt9 [debs/linux] - 10https://gerrit.wikimedia.org/r/251477 [09:34:53] (03CR) 10Zfilipin: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/238778 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [09:39:11] 7Puppet, 10Continuous-Integration-Config: Move RuboCop job from experimental pipeline to the usual pipelines for operations/puppet - https://phabricator.wikimedia.org/T110019#1787784 (10zeljkofilipin) a:3zeljkofilipin [09:42:23] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: PID not expanded in heap dumps - https://phabricator.wikimedia.org/T116814#1787787 (10fgiunchedi) a:3fgiunchedi [09:43:48] 6operations, 10Continuous-Integration-Infrastructure: can't install libcurl-dev on Jessie - https://phabricator.wikimedia.org/T117955#1787788 (10hashar) So we have to pick one of them. The Debian package python-pycurl at https://packages.debian.org/jessie/python-pycurl mentions: > NOTE: the SSL support is pro... [09:49:17] (03PS1) 10Hashar: Revert "Add libcurl-dev to Python Jenkins" [puppet] - 10https://gerrit.wikimedia.org/r/251479 [09:49:32] (03PS2) 10Hashar: Revert "Add libcurl-dev to Python Jenkins" [puppet] - 10https://gerrit.wikimedia.org/r/251479 [09:50:36] (03PS3) 10Hashar: contint: explicitly set ssl/tls libcurl dev package [puppet] - 10https://gerrit.wikimedia.org/r/251479 (https://phabricator.wikimedia.org/T111005) [09:51:29] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update to 3.19.8-ckt9 [debs/linux] - 10https://gerrit.wikimedia.org/r/251477 (owner: 10Muehlenhoff) [09:51:56] GNUTLS all the way \O/ [09:52:47] (03CR) 10Muehlenhoff: [C: 031] contint: explicitly set ssl/tls libcurl dev package [puppet] - 10https://gerrit.wikimedia.org/r/251479 (https://phabricator.wikimedia.org/T111005) (owner: 10Hashar) [09:53:22] moritzm: I am cherry picking it on the CI puppet master to make sure it fix the puppet failures on all three distro [09:53:34] ok [09:53:35] <_joe_> hashar: why gnutls all the way? [09:53:42] libcurl-dev [09:53:45] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 [09:53:45] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 [09:53:51] ask to pick one of openssl/gnutls/nss [09:54:05] and the python package mentions it is using gnutls [09:54:32] not that I know the differences between openssl and gnutls beside different license :-/ [09:54:40] <_joe_> my question was more: why excitement about using gnutls? [09:54:49] ohh [09:54:57] <_joe_> :) [09:55:05] I guess the newbie I am been excited of using something with GNU in the name [09:55:16] and I am quite happy to have figured the proper lib to use all by myself [09:55:19] low hanging fruits! [09:55:20] <_joe_> :) [09:55:37] 80% of the time [09:55:42] I have no clue what you guys are talking about [09:55:57] <_joe_> we don't have one either [09:55:58] fortunately openssl is fixing it's license to apache2, so these license roadblocks will be moot in the mid-future [09:56:07] <_joe_> we just learned all the buzzwords [09:56:10] <_joe_> and the jargon [09:56:25] the thing [09:56:34] is I had the same issue last week with squid 3.3 on Jessie [09:56:44] it does not have SSL support because of conflict with openssl [09:56:53] and gnutls is on squid 3.4 (iirc) [09:57:24] that is mostly a Debian issues by reading the license very strictly, RHEL e.g. enables SSL in squid [09:57:43] the GPL has an exception for system libraries [09:57:53] and red hat argues openssl is such a thing [09:57:53] none of us are lawyers, but do we get a risk of being sued by having Squid linked against OpenSSL? [09:58:01] which makes total sense [09:58:11] and they probably have 50 lawyers on payroll [09:58:56] PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Puppet has 1 failures [09:59:02] who would sue anyway? openssl certainly not [09:59:52] and RH taking the "risk"actually has something to lose in terms of compensation in court (with more than a bn revenue per quarter) [09:59:55] seems to me the debian community is very picky when it comes to license [10:00:01] indeed :-) [10:00:04] and if in doubt, prefer to refuse including the software [10:00:15] but that will be resolved with the license change in openssl for good [10:00:17] I find it annoying, but at the same time I like how they take zero risk [10:00:57] OTOH Debian has a more relaxed state on sw patents, e.g. Fedora lacks a lot of codecs while Debian enables all the MPEG/H264 goodies by default [10:01:18] (03CR) 10Hashar: [C: 031 V: 031] "Cherry picked on integration puppetmaster. On Precise/Trusty/Jessie that yields:" [puppet] - 10https://gerrit.wikimedia.org/r/251479 (https://phabricator.wikimedia.org/T111005) (owner: 10Hashar) [10:01:37] ^^^that one can land, that fix puppet runs on ci slaves :-} [10:01:59] isn't it that software patents are not applicable in europe but are in US ? [10:02:15] so if Fedora is mostly US driven, that might explain the difference [10:05:28] it's a little more than that, https://www.debian.org/reports/patent-faq has further links [10:07:20] * godog remembers the "non-US" archive days, looking into distance [10:08:23] hashar: https://wiki.debian.org/non-US that was even more silly [10:12:08] godog: it is nice it has been documented for historians :-} [10:12:24] I think France had a law to prevent usage of cryptography which is too strong [10:13:17] so you can use cryptography freely (as in speech) [10:14:38] but if you provide, import, transfer, export a cryptography system you got to declare it to some government agency and depending on the target device (ex: phone) you might need to require a license [10:17:38] (03PS1) 10Filippo Giunchedi: Revert "cassandra: add xenon-b instance" [puppet] - 10https://gerrit.wikimedia.org/r/251481 [10:20:46] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "cassandra: add xenon-b instance" [puppet] - 10https://gerrit.wikimedia.org/r/251481 (owner: 10Filippo Giunchedi) [10:24:14] RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [10:28:49] anyone can please merge in https://gerrit.wikimedia.org/r/#/c/251479/ // contint: explicitly set ssl/tls libcurl dev package // thx :) [10:28:57] I need it to refresh the Nodepool instances [10:29:16] (uses tip of production branch, not the ci puppet master, so I can't really cherry pick) [10:30:43] (03PS5) 10Muehlenhoff: Create a define to register extra LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/251229 (https://phabricator.wikimedia.org/T101299) [10:31:38] hashar: candidate for puppet swat? [10:33:20] (03PS3) 10Muehlenhoff: openldap: Allow configurable ACLs [puppet] - 10https://gerrit.wikimedia.org/r/251272 (https://phabricator.wikimedia.org/T101299) [10:33:42] godog: not really, the lib is needed to run tests for Thumbor, yet another python software [10:33:54] RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [10:36:56] (03PS4) 10Muehlenhoff: openldap: Allow configurable ACLs [puppet] - 10https://gerrit.wikimedia.org/r/251272 (https://phabricator.wikimedia.org/T101299) [10:37:16] hashar: I'll do, in about 10 mins [10:37:23] thanks ! [10:39:15] PROBLEM - puppet last run on mw1152 is CRITICAL: CRITICAL: Puppet has 1 failures [10:44:20] <_joe_> uhm, checking [10:49:04] <_joe_> ok this is weird [10:49:39] (03PS4) 10Muehlenhoff: contint: explicitly set ssl/tls libcurl dev package [puppet] - 10https://gerrit.wikimedia.org/r/251479 (https://phabricator.wikimedia.org/T111005) (owner: 10Hashar) [10:49:54] (03CR) 10Muehlenhoff: [C: 032 V: 032] contint: explicitly set ssl/tls libcurl dev package [puppet] - 10https://gerrit.wikimedia.org/r/251479 (https://phabricator.wikimedia.org/T111005) (owner: 10Hashar) [10:50:27] \O/ [10:55:10] 6operations: Allow rsync to dataset1001 from Analytics VLAN - https://phabricator.wikimedia.org/T117428#1787896 (10akosiaris) 5Open>3Invalid a:3akosiaris ``` akosiaris@stat1003:~$ telnet -4 dataset1001.wikimedia.org 873 Trying 208.80.154.11... Connected to dataset1001.wikimedia.org. Escape character is '^]... [10:55:18] 6operations, 10netops: Allow rsync to dataset1001 from Analytics VLAN - https://phabricator.wikimedia.org/T117428#1787901 (10akosiaris) [10:55:30] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: can't install libcurl-dev on Jessie - https://phabricator.wikimedia.org/T117955#1787902 (10hashar) 5Open>3Resolved a:3hashar The CI permanent slaves have it all fine. I have refreshed the Nodepool snapshot image: 2015-11-06... [10:56:44] 6operations, 5Patch-For-Review: install/setup/deploy wmf3153 (rdb1007) & wmf3154 (rdb1008) - https://phabricator.wikimedia.org/T117916#1787906 (10Joe) Both machines seem uncabled after an attempt at configuring the switches. @cmjohnson we will need you to check the cabling for both servers. [10:56:54] 6operations, 10ops-eqiad, 5Patch-For-Review: install/setup/deploy wmf3153 (rdb1007) & wmf3154 (rdb1008) - https://phabricator.wikimedia.org/T117916#1787908 (10Joe) [10:58:12] (03CR) 10Alexandros Kosiaris: [C: 04-1] Create a define to register extra LDAP schemas (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/251229 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff) [11:00:14] (03CR) 10Muehlenhoff: Create a define to register extra LDAP schemas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/251229 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff) [11:01:45] RECOVERY - service on xenon is OK: OK - cassandra-a is active [11:02:03] (03PS6) 10Muehlenhoff: Create a define to register extra LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/251229 (https://phabricator.wikimedia.org/T101299) [11:02:40] (03CR) 10Filippo Giunchedi: [C: 04-1] puppet-lint: re-enable 'unquoted file modes' check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/251295 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [11:04:39] (03CR) 1020after4: "It's not an issue related to get configuration. Phabricator is hard-coded to ignore non-standard refs, even though they are in the repo ph" [puppet] - 10https://gerrit.wikimedia.org/r/251304 (owner: 10Paladox) [11:05:52] RECOVERY - puppet last run on mw1152 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:06:04] (03CR) 10Filippo Giunchedi: restbase: move to systemd unit file (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/244647 (https://phabricator.wikimedia.org/T103134) (owner: 10Filippo Giunchedi) [11:06:29] (03PS4) 10Filippo Giunchedi: restbase: move to systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/244647 (https://phabricator.wikimedia.org/T103134) [11:06:36] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: can't install libcurl-dev on Jessie - https://phabricator.wikimedia.org/T117955#1787935 (10hashar) 5Resolved>3Open ``` 00:00:48.871 src/pycurl.h:164:30: fatal error: gnutls/gnutls.h: No such file or directory ``` [11:09:21] (03PS1) 10Jcrespo: Depool db2042, db2049, db2053 and db2054 for cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251484 [11:10:57] (03CR) 10Jcrespo: [C: 032] Depool db2042, db2049, db2053 and db2054 for cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251484 (owner: 10Jcrespo) [11:12:18] 7Puppet, 10Deployment-Systems, 3Scap3: Refactor `mediawiki::scap` to make sure Scap dependencies are not dependent on mediawiki - https://phabricator.wikimedia.org/T116606#1787941 (10mmodell) [11:12:45] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2042, db2049, db2053 and db2054 for cloning (duration: 00m 34s) [11:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:13:30] (03CR) 10Filippo Giunchedi: [C: 031] "some more hosts using ldap (I think) https://puppet-compiler.wmflabs.org/1205/" [puppet] - 10https://gerrit.wikimedia.org/r/251251 (owner: 10Dzahn) [11:16:00] hashar: it needs libgnutls28-dev in addition [11:22:46] (03PS1) 10Hashar: contint: get libgnutls28-dev on Jessie [puppet] - 10https://gerrit.wikimedia.org/r/251485 (https://phabricator.wikimedia.org/T117955) [11:23:21] moritzm: yeah noticed that :-( my fault for only testing on Precise [11:23:28] on Jessie it is merely a suggestion [11:23:44] and libgnutls-dev is a virtual package on Jessie :( [11:24:08] moritzm: https://gerrit.wikimedia.org/r/#/c/251485/ will do it [11:24:12] sorry [11:28:12] hashar: did you check trusty? it [11:28:24] is similar in version, so the same might occur there as well [11:30:06] moritzm: yeah on Ubuntu the package has Depends: libgnutls-dev [11:30:13] so it install whatever it needs [11:30:17] ok [11:30:18] but Debian made it a suggest [11:30:31] and libgnutls-dev is a virtual package which iirc does not play well with puppet [11:30:49] (puppet would keep trying to reinstall a virtual package on each run because it think it is not installed for some reason) [11:31:15] !log stopping mysql and cloning db2042 -> db2070, db2054 -> db2068, db2053 -> db2067, db2049 -> db2056 [11:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:32:46] (03CR) 10Mobrovac: restbase: move to systemd unit file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/244647 (https://phabricator.wikimedia.org/T103134) (owner: 10Filippo Giunchedi) [11:33:37] hashar: looks good, will merge in a bit [11:34:13] thx [11:36:53] (03CR) 10Muehlenhoff: [C: 032 V: 032] contint: get libgnutls28-dev on Jessie [puppet] - 10https://gerrit.wikimedia.org/r/251485 (https://phabricator.wikimedia.org/T117955) (owner: 10Hashar) [11:37:29] * hashar refreshes Nodepool images [11:46:22] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: can't install libcurl-dev on Jessie - https://phabricator.wikimedia.org/T117955#1788006 (10hashar) 5Open>3Resolved [11:46:32] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds [11:54:04] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds [12:10:44] !log mark@cr2-knams> request chassis mic fpc-slot 1 mic-slot 1 offline [12:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:18:08] !log mark@cr2-knams> request chassis mic fpc-slot 1 mic-slot 1 online [12:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:18:25] !log Swapped cr2-knams fpc 1 mic 1 for a new 2x 10 [12:18:28] !log Swapped cr2-knams fpc 1 mic 1 for a new 2x 10G line card [12:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:22:52] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 59, down: 2, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - BRxe-1/2/0: down - BR [12:24:43] RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 59, down: 0, dormant: 0, excluded: 2, unused: 0 [12:26:49] 6operations, 6Discovery, 10Maps, 10Traffic, 3Discovery-Maps-Sprint: Load testing Maps sent all traffic to only one server - https://phabricator.wikimedia.org/T117937#1788062 (10BBlack) Without seeing the actual test script or traffic, it's hard to say exactly, but I don't find it all that surprising in t... [12:59:35] and that almost concludes the migration of the new 16 servers [13:01:12] <_joe_> jynus: congrats! [13:09:29] !log upgrading pybal 1.10 -> 1.12 on lvs4003 [13:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:11:31] !log re-enabled aws-d-eqiad in librenms [13:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:14:07] (03PS1) 10Faidon Liambotis: Add mgmt A/PTR for cr2-esams [dns] - 10https://gerrit.wikimedia.org/r/251491 [13:15:17] 6operations, 10ops-esams: Replace cr2-knams MX80 MIC slot with a 2x10G MIC - https://phabricator.wikimedia.org/T111765#1788133 (10faidon) 5Open>3Resolved Done! [13:16:14] !log upgrading pybal 1.10 -> 1.12 on lvs300[34].esams.wmnet [13:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:17:20] 6operations, 10ops-eqiad, 5Patch-For-Review: install/setup/deploy wmf3153 (rdb1007) & wmf3154 (rdb1008) - https://phabricator.wikimedia.org/T117916#1788137 (10RobH) a:5RobH>3Cmjohnson [13:18:10] (03CR) 10Faidon Liambotis: [C: 032] Add mgmt A/PTR for cr2-esams [dns] - 10https://gerrit.wikimedia.org/r/251491 (owner: 10Faidon Liambotis) [13:19:00] (03PS1) 10Lokal Profil: Make issued and modified typed [puppet] - 10https://gerrit.wikimedia.org/r/251492 (https://phabricator.wikimedia.org/T117533) [13:22:44] (03PS1) 10Lokal Profil: Localisation updates from translatewiki.net [puppet] - 10https://gerrit.wikimedia.org/r/251493 [13:30:55] (03CR) 10Paladox: "Would we need to upload a patch that hardcode a those paths too. Only until it isent hard coded." [puppet] - 10https://gerrit.wikimedia.org/r/251304 (owner: 10Paladox) [13:32:14] 6operations, 10ops-eqiad, 10Traffic, 10netops, 5Patch-For-Review: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1788167 (10BBlack) @cmjohnson - We seem to have lost link on lvs1010 eth1 -> asw2-a5:11 , new cabling issue? was working before. [13:33:56] 6operations, 10Traffic: cp4007 crashed - https://phabricator.wikimedia.org/T117746#1788170 (10BBlack) re-pooling on the assumption this was a rare kernel bug, been stable since [13:36:19] !log cp4007 repooled - T117746 [13:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:38:14] !log swift codfw-prod: ms-be2016 / ms-be2018 / ms-be2020 weight 2000 [13:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:44:42] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [13:55:13] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000000.0] [13:55:30] (03CR) 10Filippo Giunchedi: restbase: move to systemd unit file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/244647 (https://phabricator.wikimedia.org/T103134) (owner: 10Filippo Giunchedi) [13:57:48] (03PS1) 10RobH: scandium needs to be jessie [puppet] - 10https://gerrit.wikimedia.org/r/251496 [13:58:11] !log bounce ganglia-monitor-aggregator-instance ID=2027 on install2001, debugging missing ms-be2020 [13:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:58:40] (03CR) 10RobH: [C: 032] scandium needs to be jessie [puppet] - 10https://gerrit.wikimedia.org/r/251496 (owner: 10RobH) [13:58:45] (03PS2) 10RobH: scandium needs to be jessie [puppet] - 10https://gerrit.wikimedia.org/r/251496 [13:59:03] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [5000000.0] [14:03:13] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 1.00% above the threshold [1000000.0] [14:09:23] (03PS1) 10Joal: Update mediawiki_CirrusSearchREquestSet camus cron [puppet] - 10https://gerrit.wikimedia.org/r/251497 [14:09:33] dcausse: --^ [14:13:47] (03CR) 10DCausse: [C: 031] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/251497 (owner: 10Joal) [14:14:32] (03PS1) 10Muehlenhoff: Add missing changelog entry for previous fix [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/251498 [14:14:34] (03PS1) 10Muehlenhoff: Store the name of the server group in the jobs database [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/251499 [14:14:54] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add missing changelog entry for previous fix [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/251498 (owner: 10Muehlenhoff) [14:14:56] 6operations, 7Swift: audit and improve ring dispersion for swift - https://phabricator.wikimedia.org/T94557#1788243 (10fgiunchedi) 5Open>3Resolved fixed when adding new machines and tweaking the `overload` factor ``` i7:~/src/wikimedia/swift-ring (master)$ git grep Dispersion */*.dispersion codfw-prod/acc... [14:15:07] (03CR) 10Muehlenhoff: [C: 032 V: 032] Store the name of the server group in the jobs database [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/251499 (owner: 10Muehlenhoff) [14:19:31] 6operations, 10ops-eqiad, 10Traffic, 10netops, 5Patch-For-Review: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1788248 (10Cmjohnson) The sfp on asw2:11 failed and was replaced. the link is up [14:20:41] (03CR) 10Zfilipin: [C: 031] contint: install npm/grunt-cli with npm [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) (owner: 10Hashar) [14:21:20] 6operations, 10ops-eqiad, 5Patch-For-Review: install/setup/deploy wmf3153 (rdb1007) & wmf3154 (rdb1008) - https://phabricator.wikimedia.org/T117916#1788250 (10Cmjohnson) The on-site was notified of this task and the cables were disconnected from previous wipe on rdb1008. rdb1007 had a bad cable and the swi... [14:21:39] 6operations, 10ops-eqiad, 5Patch-For-Review: install/setup/deploy wmf3153 (rdb1007) & wmf3154 (rdb1008) - https://phabricator.wikimedia.org/T117916#1788251 (10Cmjohnson) a:5Cmjohnson>3Joe [14:26:41] (03PS1) 10Jcrespo: Repool all db servers that were depooled and add the last batch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251501 (https://phabricator.wikimedia.org/T84428) [14:27:33] 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1788276 (10RobH) a:5RobH>3hashar Reinstalled to jessie and has all keys accepted, ready for service implementation. [14:28:05] (03CR) 10Jcrespo: [C: 032] Repool all db servers that were depooled and add the last batch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251501 (https://phabricator.wikimedia.org/T84428) (owner: 10Jcrespo) [14:31:46] !log jynus@tin Synchronized wmf-config/db-codfw.php: Pool all available db servers in codfw (duration: 00m 49s) [14:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:45:35] 6operations, 10hardware-requests, 7Performance: Refresh Parser cache servers pc1001-pc1003 - https://phabricator.wikimedia.org/T111777#1788309 (10RobH) [14:46:22] 6operations, 10hardware-requests, 7Performance: Refresh Parser cache servers pc1001-pc1003 - https://phabricator.wikimedia.org/T111777#1616223 (10RobH) [14:47:32] (03PS2) 10Ottomata: Update mediawiki_CirrusSearchREquestSet camus cron [puppet] - 10https://gerrit.wikimedia.org/r/251497 (owner: 10Joal) [14:48:19] (03CR) 10Ottomata: [C: 032 V: 032] Update mediawiki_CirrusSearchREquestSet camus cron [puppet] - 10https://gerrit.wikimedia.org/r/251497 (owner: 10Joal) [14:48:40] grrmbmblb puppet [14:50:10] 6operations, 7Swift: swift upgrade plans - https://phabricator.wikimedia.org/T117972#1788316 (10fgiunchedi) 3NEW [14:50:19] 6operations, 7Swift: swift upgrade plans - https://phabricator.wikimedia.org/T117972#1788323 (10fgiunchedi) a:3fgiunchedi [14:52:32] !log uploaded debdeploy 0.0.9 to apt.wikimedia.org [14:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:55:23] ah debdeploy [14:56:18] (03PS3) 10BBlack: config-geo: more of the middle US -> codfw [dns] - 10https://gerrit.wikimedia.org/r/251247 (https://phabricator.wikimedia.org/T114659) [14:56:58] (03CR) 10BBlack: [C: 032] config-geo: more of the middle US -> codfw [dns] - 10https://gerrit.wikimedia.org/r/251247 (https://phabricator.wikimedia.org/T114659) (owner: 10BBlack) [14:58:20] (03PS3) 10BBlack: config-geo: ulsfo fallback to codfw before eqiad [dns] - 10https://gerrit.wikimedia.org/r/251255 (https://phabricator.wikimedia.org/T114659) [15:03:15] (03PS1) 10Hashar: .gitreview file [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/251505 [15:03:17] (03PS1) 10Hashar: (WIP) gbp configuration (WIP) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/251506 [15:03:29] (03CR) 10Hashar: "check experimental" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/251506 (owner: 10Hashar) [15:04:01] (03PS1) 10Jcrespo: Deploy the new db20XX databases with the actual shards provisioned [puppet] - 10https://gerrit.wikimedia.org/r/251507 (https://phabricator.wikimedia.org/T84428) [15:04:50] moritzm: debdeploy confuses me: upstream/0.0.9 is not a valid treeish [15:05:04] moritzm: seems the repo has no tag :/ [15:05:18] ok, I need a +1 from someone that likes sudokus: https://gerrit.wikimedia.org/r/#/c/251507 [15:11:10] (03PS2) 10Hashar: (WIP) gbp configuration (WIP) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/251506 [15:11:24] (03CR) 10Hashar: "check experimental" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/251506 (owner: 10Hashar) [15:13:38] (03CR) 10Hashar: "https://integration.wikimedia.org/ci/job/debian-glue/27/ \O/" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/251506 (owner: 10Hashar) [15:14:34] (03CR) 10Mobrovac: restbase: move to systemd unit file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/244647 (https://phabricator.wikimedia.org/T103134) (owner: 10Filippo Giunchedi) [15:14:41] (03PS3) 10Hashar: Point git buildpackage upstream to HEAD of repo [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/251506 [15:15:09] (03CR) 10Mobrovac: [C: 031] restbase: move to systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/244647 (https://phabricator.wikimedia.org/T103134) (owner: 10Filippo Giunchedi) [15:18:20] (03CR) 10Jcrespo: "Why waiting for human review, when machines can do the job for us. I, For One, Welcome Our New Jenkins Overlords:" [puppet] - 10https://gerrit.wikimedia.org/r/251507 (https://phabricator.wikimedia.org/T84428) (owner: 10Jcrespo) [15:21:43] (03PS1) 10Faidon Liambotis: Drain esams due to network maintainance [dns] - 10https://gerrit.wikimedia.org/r/251510 [15:22:16] (03PS2) 10Faidon Liambotis: Drain esams due to network maintainance [dns] - 10https://gerrit.wikimedia.org/r/251510 [15:22:34] (03CR) 10Faidon Liambotis: [C: 032] Drain esams due to network maintainance [dns] - 10https://gerrit.wikimedia.org/r/251510 (owner: 10Faidon Liambotis) [15:25:34] (03PS2) 10Jcrespo: Deploy the new db20XX databases with the actual shards provisioned [puppet] - 10https://gerrit.wikimedia.org/r/251507 (https://phabricator.wikimedia.org/T84428) [15:27:30] (03CR) 10Jcrespo: [C: 032] "Confirmed no error on compilation and that the commit summary and the actual change are the same." [puppet] - 10https://gerrit.wikimedia.org/r/251507 (https://phabricator.wikimedia.org/T84428) (owner: 10Jcrespo) [15:29:23] !log draining esams for network maintainance (deployed about 7mins ago) [15:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:30:14] hashar: in a meeting ATM [15:30:50] moritzm: no hurries, we can talk about Jenkins for deb deploy next week } [15:35:22] PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:36:25] (03CR) 10Alexandros Kosiaris: [C: 031] Create a define to register extra LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/251229 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff) [15:39:41] PROBLEM - puppet last run on mw2171 is CRITICAL: CRITICAL: puppet fail [15:41:11] RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING [15:44:41] PROBLEM - check_payments_wiki on payments2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:49:12] RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy [15:49:21] PROBLEM - check_payments_wiki on payments2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string OK not found on https://payments2002.frack.codfw.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 271 bytes in 0.020 second response time [15:49:31] RECOVERY - Restbase root url on aqs1001 is OK: HTTP OK: HTTP/1.1 200 - 727 bytes in 0.004 second response time [15:49:41] PROBLEM - check_payments_wiki on payments2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:54:11] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string OK not found on https://payments2003.frack.codfw.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 271 bytes in 0.018 second response time [15:54:21] PROBLEM - check_payments_wiki on payments2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string OK not found on https://payments2002.frack.codfw.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 271 bytes in 0.017 second response time [15:54:31] PROBLEM - check_payments_wiki on payments2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string OK not found on https://payments2001.frack.codfw.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 271 bytes in 0.019 second response time [15:55:00] 6operations, 10ops-eqiad, 10Traffic, 10netops, 5Patch-For-Review: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1788478 (10BBlack) List of current ports hooked up for reference, and switch-redundancy issues to address with further port moves... | host | port | switch | po... [15:57:42] (03PS1) 10Rush: decom: haedus was removed in august 2015 [puppet] - 10https://gerrit.wikimedia.org/r/251515 (https://phabricator.wikimedia.org/T94474) [15:58:42] PROBLEM - Disk space on copper is CRITICAL: DISK CRITICAL - free space: / 850 MB (1% inode=63%) [15:58:51] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 53, down: 1, dormant: 0, excluded: 2, unused: 0BRxe-0/0/3: down - Core: csw2-esams:xe-3/1/1 [10Gbps DF CWDM C47]BR [15:59:21] PROBLEM - check_payments_wiki on payments2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string OK not found on https://payments2002.frack.codfw.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 271 bytes in 0.016 second response time [15:59:21] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:59:31] PROBLEM - check_payments_wiki on payments2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string OK not found on https://payments2001.frack.codfw.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 271 bytes in 0.019 second response time [16:03:01] 6operations, 7Graphite: graphite / carbon-cache leaks memory on corrupted whisper files - https://phabricator.wikimedia.org/T101572#1788503 (10fgiunchedi) [16:04:15] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string OK not found on https://payments2003.frack.codfw.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 271 bytes in 0.016 second response time [16:04:25] PROBLEM - check_payments_wiki on payments2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string OK not found on https://payments2002.frack.codfw.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 271 bytes in 0.016 second response time [16:04:25] PROBLEM - check_payments_wiki on payments2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string OK not found on https://payments2001.frack.codfw.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 271 bytes in 0.020 second response time [16:05:04] ignore payments alerts ^^^, fixing [16:06:16] RECOVERY - puppet last run on mw2171 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:07:13] godog, you are almost running out of space on copper [16:09:15] PROBLEM - check_payments_wiki on payments2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string OK not found on https://payments2003.frack.codfw.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 271 bytes in 0.015 second response time [16:09:25] PROBLEM - check_payments_wiki on payments2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string OK not found on https://payments2001.frack.codfw.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 271 bytes in 0.016 second response time [16:09:35] PROBLEM - check_payments_wiki on payments2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:14:15] RECOVERY - check_payments_wiki on payments2003 is OK: HTTP OK: HTTP/1.1 200 OK - 226 bytes in 0.025 second response time [16:14:25] RECOVERY - check_payments_wiki on payments2002 is OK: HTTP OK: Status line output matched HTTP/1.1 503 - 214 bytes in 0.010 second response time [16:14:35] PROBLEM - check_payments_wiki on payments2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:19:35] PROBLEM - check_payments_wiki on payments2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:10] ACKNOWLEDGEMENT - check_payments_wiki on payments2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Jeff_Green testing mw config [16:24:25] RECOVERY - check_payments_wiki on payments2001 is OK: HTTP OK: HTTP/1.1 200 OK - 226 bytes in 0.029 second response time [16:24:58] 6operations, 10ops-eqiad, 5Patch-For-Review: install/setup/deploy wmf3153 (rdb1007) & wmf3154 (rdb1008) - https://phabricator.wikimedia.org/T117916#1788613 (10Joe) rdb1007 and 1008 are installed and replica from 1007 to 1008 is working. [16:27:46] RECOVERY - Disk space on copper is OK: DISK OK [16:29:18] jynus: interesting! what did you free up? [16:30:04] nothing, I suppose the process ended (it was /var/cache) [16:30:52] I do not delete random files without asking ;-) [16:31:32] 6operations, 10hardware-requests: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1788642 (10Joe) [16:31:34] 6operations, 10ops-eqiad, 5Patch-For-Review: install/setup/deploy wmf3153 (rdb1007) & wmf3154 (rdb1008) - https://phabricator.wikimedia.org/T117916#1788641 (10Joe) 5Open>3Resolved [16:31:48] <_joe_> ori, AaronSchulz ^^ [16:32:07] <_joe_> rdb1007-8 installed and tested to replicate correctly [16:32:14] (03PS1) 10Filippo Giunchedi: ferm: add new swift machines [puppet] - 10https://gerrit.wikimedia.org/r/251522 [16:33:03] ah, yeah pbuilder then, we have /tmp mounted there but I'm not sure it is actually used during build at this point [16:33:22] (03PS1) 10Jforrester: Enable Flow user opt-in Beta Feature on three more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251523 (https://phabricator.wikimedia.org/T116611) [16:34:17] (03PS2) 10Filippo Giunchedi: ferm: add new swift machines [puppet] - 10https://gerrit.wikimedia.org/r/251522 [16:34:23] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] ferm: add new swift machines [puppet] - 10https://gerrit.wikimedia.org/r/251522 (owner: 10Filippo Giunchedi) [16:39:26] PROBLEM - check_apache2 on payments2001 is CRITICAL: PROCS CRITICAL: 2 processes with command name apache2 [16:39:41] (03CR) 10Eevans: [C: 031] restbase: move to systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/244647 (https://phabricator.wikimedia.org/T103134) (owner: 10Filippo Giunchedi) [16:43:28] ok, I saw a worsening of performance since 15:22 UTC and saw no explanation, now I can see that maintanance is going on [16:44:36] RECOVERY - check_apache2 on payments2001 is OK: PROCS OK: 8 processes with command name apache2 [16:44:36] PROBLEM - check_payments_wiki on payments2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:28] jynus: yeah eqiad load is in general higher than usual, because esams is depooled and most of the users end up at eqiad for now [16:49:36] PROBLEM - check_payments_wiki on payments2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:46] it basically doubles the user load in eqiad when we do that [16:50:03] bblack, yes, I saw it not worried now that I know the cause [16:50:24] but it is funny to see the impact that that has on lower layers, too [16:52:42] (03PS1) 10Filippo Giunchedi: swift: monitor mediawiki originals upload rate [puppet] - 10https://gerrit.wikimedia.org/r/251526 (https://phabricator.wikimedia.org/T92322) [16:53:16] 6operations, 7Database: Adapt wmf-mariadb10 package for jessie or puppetize differently its service to adapt it to systemd - https://phabricator.wikimedia.org/T116903#1788692 (10jcrespo) p:5Triage>3Low [16:54:36] PROBLEM - check_payments_wiki on payments2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:55:06] 6operations: Issues with partman/install server/autoinstall for db servers on Jessie - https://phabricator.wikimedia.org/T116902#1788697 (10jcrespo) p:5Triage>3Normal [16:57:24] 7Puppet, 10Continuous-Integration-Config: Move RuboCop job from experimental pipeline to the usual pipelines for operations/puppet - https://phabricator.wikimedia.org/T110019#1788702 (10zeljkofilipin) [16:58:18] 6operations, 10Wikimedia-Mailing-lists: wikisk-l: Give the list an administrator - https://phabricator.wikimedia.org/T111054#1788704 (10JohnLewis) Looking at the issue, @kubof seems to suggests [gtranslate] the list is not in active use and that he also has no time to manage the list if he was given it. As suc... [16:59:15] 6operations, 10Wikimedia-Mailing-lists, 6Wiktionary: wiktionary-l: assign new moderators - https://phabricator.wikimedia.org/T110969#1788706 (10JohnLewis) No response at all. As such, I am going to close the list [really unfortunate as it is a project mailing list] and if any one wants to administrate it, th... [16:59:26] RECOVERY - check_payments_wiki on payments2001 is OK: HTTP OK: HTTP/1.1 200 OK - 226 bytes in 0.025 second response time [17:02:22] _joe_: \o/ thank you sir [17:03:36] (03Abandoned) 10BryanDavis: logstash: Drop replica count to 1 after 21 days [puppet] - 10https://gerrit.wikimedia.org/r/250501 (https://phabricator.wikimedia.org/T117438) (owner: 10BryanDavis) [17:03:59] 6operations, 10Wikimedia-Mailing-lists: Evaluate lists with large moderation queues - https://phabricator.wikimedia.org/T110438#1788729 (10JohnLewis) [17:04:01] 6operations, 10Wikimedia-Mailing-lists: wikisk-l: Give the list an administrator - https://phabricator.wikimedia.org/T111054#1788727 (10JohnLewis) 5Open>3Resolved Disabled. [17:04:08] <_joe_> ori: thank cmjohnson1 first :) [17:04:09] 6operations, 10Wikimedia-Mailing-lists: Evaluate lists with large moderation queues - https://phabricator.wikimedia.org/T110438#1577914 (10JohnLewis) [17:04:15] thanks cmjohnson1! [17:04:50] ori, do you know if writing to the external storage database is a significative portion of the save timing? [17:05:05] Aaron showed me a trace once, I cannot find it now [17:05:16] 6operations, 10Wikimedia-Mailing-lists: Evaluate lists with large moderation queues - https://phabricator.wikimedia.org/T110438#1577914 (10JohnLewis) All blockers closed. Done at last. Taking a look at current lists, all within manageable values (200-) [17:05:22] 6operations, 10Wikimedia-Mailing-lists: Evaluate lists with large moderation queues - https://phabricator.wikimedia.org/T110438#1788742 (10JohnLewis) 5Open>3Resolved a:3JohnLewis [17:05:36] jynus: i'll show you, sec [17:11:02] so, every ten minutes we capture a stack trace of all running threads. in aggregate, it gives us a picture of where we spend our time. using xenon-grep (a tool i wrote to crunch these logs) we see that mysqli::hh_real_query is on-CPU about 21% of the time for page save requests: https://dpaste.de/T5Cw/raw [17:12:34] <_joe_> ori: that's all db queries, right? [17:12:39] yeah, i'm breaking it down further [17:12:55] yes, there is currently an issue with one read [17:13:16] <_joe_> and anyways, 21 % of time in the db is not uncommon at all :) [17:13:21] I have that indentified and with a potential way to fix it [17:14:15] I am wondering if partitioning on a new cluster is worth it- we have such a large table right now that it may be interesting to try for lowering insert overhead [17:14:36] I am usering here the mediawiki name for cluster [17:14:55] basically, writing to a separate, smaller table [17:15:15] the insert barely registers [17:15:17] 17 | DatabaseBase::insert | 0.92% [17:15:21] ok [17:15:24] vs [17:15:25] 1 | DatabaseBase::select | 16.34% [17:15:34] so not interesting from the performance point of view [17:15:41] that is great to know [17:15:46] I will focus on the select [17:16:02] I think I can reduce that 10 times [17:16:02] jynus: can i show you one more thing? [17:16:05] sure [17:16:18] https://performance.wikimedia.org/xenon/svgs/daily/ <-- these are interactive flame graphs from the same data [17:16:40] so check out https://performance.wikimedia.org/xenon/svgs/daily/2015-11-05.index.reversed.svgz and https://performance.wikimedia.org/xenon/svgs/daily/2015-11-05.index.svgz [17:16:40] those are the ones that aron showed me, and lost [17:16:46] bookmarked it [17:17:15] thank you, ori [17:17:22] np! [17:18:19] 6operations, 7Graphite, 7Monitoring: Add monitoring for analytics-statsv service - https://phabricator.wikimedia.org/T117994#1788765 (10Krinkle) [17:26:12] (03PS1) 10Aaron Schulz: Add rdb1007 to jobqueue pool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251535 [17:27:31] (03PS2) 10Aaron Schulz: Add rdb1007 to jobqueue pool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251535 (https://phabricator.wikimedia.org/T89400) [17:29:47] RECOVERY - salt-minion processes on ms-be2021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:31:09] (03PS1) 10Ori.livneh: add Restart=always to webperf jobs [puppet] - 10https://gerrit.wikimedia.org/r/251536 [17:32:24] (03CR) 10Ori.livneh: [C: 032] add Restart=always to webperf jobs [puppet] - 10https://gerrit.wikimedia.org/r/251536 (owner: 10Ori.livneh) [17:33:55] (03CR) 10Ori.livneh: [C: 031] Add rdb1007 to jobqueue pool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251535 (https://phabricator.wikimedia.org/T89400) (owner: 10Aaron Schulz) [17:35:52] (03PS4) 10BBlack: config-geo: ulsfo fallback to codfw before eqiad [dns] - 10https://gerrit.wikimedia.org/r/251255 (https://phabricator.wikimedia.org/T114659) [17:36:58] 6operations, 7Graphite, 7Monitoring: Add monitoring for analytics-statsv service - https://phabricator.wikimedia.org/T117994#1788839 (10ori) It might be better to monitor the metrics, and fire an alert if they have not received an update in N seconds. [17:37:27] (03CR) 10BryanDavis: [C: 04-1] "This is a WMF production cluster concentric change that will break beta cluster and other Labs projects that use this role. Same problem a" [puppet] - 10https://gerrit.wikimedia.org/r/248918 (https://phabricator.wikimedia.org/T104964) (owner: 10Dzahn) [17:39:48] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [5000000.0] [17:43:29] (03CR) 10Chad: [C: 032] checkoutMediaWiki: sudo as mwdeploy for most things [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251327 (owner: 10Chad) [17:43:32] (03CR) 10Aaron Schulz: [C: 04-1] Add rdb1007 to jobqueue pool (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251535 (https://phabricator.wikimedia.org/T89400) (owner: 10Aaron Schulz) [17:44:12] (03Merged) 10jenkins-bot: checkoutMediaWiki: sudo as mwdeploy for most things [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251327 (owner: 10Chad) [17:44:57] RECOVERY - configured eth on lvs1010 is OK: OK - interfaces up [17:45:31] !log demon@tin Synchronized multiversion/checkoutMediaWiki: no-op outside tin; just for completeness (duration: 00m 36s) [17:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:45:37] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 1.00% above the threshold [1000000.0] [17:49:39] bd808: checkoutMW change sync'd, so future branches shouldn't have permission problems anymore. [17:50:11] nice [17:50:33] you should clean up the eisting branches and extra l10n cache files while you are there ;) [17:50:39] Also, sync-master added 17s to sync-file :\ [17:50:43] yeha [17:50:45] I was going to remind mukunda of that yesterday but forgot :) [17:50:49] and it feels all stally [17:51:02] I wonder if it could go async from the rest of the process. [17:51:16] possibly, but don't go async unless you have to [17:51:22] it makes everything harder to reason about [17:51:26] I wanted to make the codwf slaves sync from mira [17:51:31] as a follow-up [17:51:32] Oh ya [17:51:37] nvm then [17:51:42] which should actually speed that bit up [17:51:42] i noticed the delay sync-master added but i don't think it's significant in the grand scheme of things [17:52:32] it will be moot when scap3 makes the sync a git fetch rather than a huge rsync [17:52:35] 6operations, 10ops-codfw: db2034 host crashed; mgmt interface unavailable (needs reset and hw check) - https://phabricator.wikimedia.org/T117858#1788927 (10jcrespo) Sorry, @Papaul, I misunderstood you. SSH is actually broken, but **on the mgmt interface**. We can check it when you are back, at least the serve... [17:52:47] also it will be faster if you prune the dame old branches out [17:52:58] *damn [17:53:06] * bd808 can't even cures right today [17:53:10] lol [17:56:09] We really should automate that prunin' [17:56:27] * AaronSchulz can barely use gerrit [17:57:36] Connection to 208.80.154.81 timed out [17:59:00] ostriches: or get rid of them :) [17:59:08] all together, I mean [17:59:08] That too. [17:59:44] (03CR) 10BBlack: [C: 032] config-geo: ulsfo fallback to codfw before eqiad [dns] - 10https://gerrit.wikimedia.org/r/251255 (https://phabricator.wikimedia.org/T114659) (owner: 10BBlack) [18:02:47] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, and 2 others: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1788998 (10Jgreen) >>! In T97676#1785592, @Pcoombe wrote: > @awight Sounds like it would be safest to just take campaigns do... [18:03:39] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, and 2 others: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1789013 (10Jgreen) [18:03:43] 6operations, 10Analytics-Cluster, 5Patch-For-Review: Turn off webrequest udp2log instances. - https://phabricator.wikimedia.org/T97294#1789012 (10Jgreen) [18:06:59] (03PS1) 10Dzahn: mw1083: add back to dsh group [puppet] - 10https://gerrit.wikimedia.org/r/251537 (https://phabricator.wikimedia.org/T116184) [18:07:44] (03PS2) 10MaxSem: Switch www.wikimedia.org to source control [puppet] - 10https://gerrit.wikimedia.org/r/249009 (https://phabricator.wikimedia.org/T115964) [18:08:08] (03PS2) 10Dzahn: mw1083: add back to dsh group [puppet] - 10https://gerrit.wikimedia.org/r/251537 (https://phabricator.wikimedia.org/T116184) [18:08:14] (03CR) 10Dzahn: [C: 032] mw1083: add back to dsh group [puppet] - 10https://gerrit.wikimedia.org/r/251537 (https://phabricator.wikimedia.org/T116184) (owner: 10Dzahn) [18:08:24] ostriches: the tricky bit is guessing when the branches have been off of group2 long enough that they aren't needed for Varnish hits any more. If you can solve that then it is easy to automate [18:14:55] (03CR) 10Krinkle: Switch www.wikimedia.org to source control (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/249009 (https://phabricator.wikimedia.org/T115964) (owner: 10MaxSem) [18:15:46] RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 59, down: 0, dormant: 0, excluded: 2, unused: 0 [18:18:09] bd808: could you maybe deploy to mw1083 to sync it? [18:18:20] it was not in dsh because hardware issues, and just came back [18:18:31] (03CR) 10Krinkle: [C: 04-1] "It seems http://www.wikipedia.beta.wmflabs.org/ and http://www.wikimedia.beta.wmflabs.org/ are both still using the ones from their Meta-W" [puppet] - 10https://gerrit.wikimedia.org/r/249009 (https://phabricator.wikimedia.org/T115964) (owner: 10MaxSem) [18:18:32] or any deployer [18:18:54] mutante: i'll do it [18:19:00] ori: thanks [18:19:12] !log running sync-common on mw1083 [18:19:14] ref T116184 [18:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:21:09] mutante: done [18:22:02] mutante: generally, don't ping bd80.8 for those, ping chad or mukunda or tyler ;) [18:22:32] or just do it yourself :) [18:22:59] `ssh wm1083 -- sync-common --verbose` [18:23:18] ori: thanks [18:23:19] greg-g: ok [18:23:24] <_joe_> yeah, do it yourself. [18:23:26] <_joe_> :) [18:23:47] +1 ;) [18:28:37] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [18:28:37] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [18:32:37] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [18:32:37] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [18:37:08] (03PS3) 10Ori.livneh: Add rdb1007 to jobqueue pool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251535 (https://phabricator.wikimedia.org/T89400) (owner: 10Aaron Schulz) [18:37:14] AaronSchulz: ^ [18:40:17] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [500.0] [18:45:11] 7Puppet, 10Deployment-Systems, 3Scap3: Refactor `mediawiki::scap` to make sure Scap dependencies are not dependent on mediawiki - https://phabricator.wikimedia.org/T116606#1789336 (10thcipriani) p:5Triage>3Normal [18:46:41] (03CR) 10Aaron Schulz: [C: 031] Add rdb1007 to jobqueue pool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251535 (https://phabricator.wikimedia.org/T89400) (owner: 10Aaron Schulz) [18:47:31] 7Puppet, 10Deployment-Systems, 3Scap3: Refactor `mediawiki::scap` to make sure Scap dependencies are not dependent on mediawiki - https://phabricator.wikimedia.org/T116606#1789357 (10Joe) a:3Joe [18:47:48] <_joe_> thcipriani: I assigned that task to myself, but if you want to do that today, be my guest [18:48:12] <_joe_> thcipriani: it's gonna be easier for me to unwrap the whole puppet manifest mess we have atm [18:49:46] 6operations, 6Performance-Team: Allocate tungsten for InfluxDB - https://phabricator.wikimedia.org/T117888#1789370 (10Dzahn) [18:50:19] _joe_: be my guest :) We were just talking about it in the deployment sprint meeting. [18:50:55] (03CR) 10Ori.livneh: [C: 032] Add rdb1007 to jobqueue pool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251535 (https://phabricator.wikimedia.org/T89400) (owner: 10Aaron Schulz) [18:51:30] <_joe_> thcipriani: I guess I should start working with you guys in integrating pooling/depooling primitives in scap3, btw [18:51:38] (03Merged) 10jenkins-bot: Add rdb1007 to jobqueue pool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251535 (https://phabricator.wikimedia.org/T89400) (owner: 10Aaron Schulz) [18:53:13] _joe_: will that use etcd? [18:53:23] pybal's etcd capabilities i mean [18:53:27] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:53:42] <_joe_> ori: that is the plan, yes [18:53:46] \o/ [18:54:01] !log ori@tin Synchronized wmf-config/jobqueue-eqiad.php: I9e5be66680: Add rdb1007 to jobqueue pool (duration: 00m 35s) [18:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:54:18] <_joe_> there are quite a few things I should do before we can go live [18:54:31] <_joe_> including updating etcd to 2.2, and start using ACLs [18:54:48] nod [18:55:16] * twentyafterfour has been wanting to tackle the etcd stuff but I have a very limited understanding of how it works and documentation of our setup seems somewhat slim [18:55:20] <_joe_> but yeah, pybal using etcd will be reality as of... soon [18:55:31] <_joe_> twentyafterfour: heh, guilty as charged [18:55:36] <_joe_> sorry, gtg [18:55:44] have a good weekend! [18:55:46] <_joe_> but please let's talk about this on monday [18:55:56] 6operations, 10Wikimedia-Mailing-lists: wikisk-l: Give the list an administrator - https://phabricator.wikimedia.org/T111054#1789411 (10KuboF) @JohnLewis Could you please edit the description of the list to make it clear about the possibility to reopen and how to do it? Please ping me and I will translate it t... [18:56:13] (03PS1) 10Andrew Bogott: Labs: For instance dns, use actual project_name rather than tenant_id. [puppet] - 10https://gerrit.wikimedia.org/r/251543 (https://phabricator.wikimedia.org/T117610) [18:57:26] PROBLEM - Apache HTTP on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:57:56] PROBLEM - HHVM rendering on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:58:02] (03PS1) 10Jdlrobson: Enable banners extension on mobile web beta only (enwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251544 (https://phabricator.wikimedia.org/T101108) [19:05:38] ori: ./hieradata/eqiad/mediawiki/jobrunner.yaml still needs update [19:06:17] AaronSchulz: got it, sec [19:07:07] 6operations, 10Wikimedia-Mailing-lists: wikisk-l: Give the list an administrator - https://phabricator.wikimedia.org/T111054#1789497 (10Dzahn) @KuboF i can do this. i just asked on #wikipedia-sk for this. [19:07:56] (03PS1) 10Ori.livneh: jobrunners: add rdb1007 to queue_servers and aggr_servers [puppet] - 10https://gerrit.wikimedia.org/r/251548 [19:07:59] AaronSchulz: ^ [19:09:28] (03CR) 10Ori.livneh: [C: 032] jobrunners: add rdb1007 to queue_servers and aggr_servers [puppet] - 10https://gerrit.wikimedia.org/r/251548 (owner: 10Ori.livneh) [19:09:53] ori: you can ignore aggr_servers since the wmf-config didn't add that (or you can add it there too) [19:10:44] 2 servers are probably fine, though 3 won't really hurt [19:10:48] (03PS1) 10Ori.livneh: Add rdb1007 as fallback aggregator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251549 [19:11:05] plus the symmetry is simpler [19:11:08] (03CR) 10Aaron Schulz: [C: 031] Add rdb1007 as fallback aggregator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251549 (owner: 10Ori.livneh) [19:11:14] (03PS2) 10Ori.livneh: Add rdb1007 as fallback aggregator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251549 [19:11:23] (03CR) 10Ori.livneh: [C: 032] Add rdb1007 as fallback aggregator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251549 (owner: 10Ori.livneh) [19:12:00] robh: any reason you restricted joining of the phabricator project Ops-Access-Requests? [19:12:00] (03Merged) 10jenkins-bot: Add rdb1007 as fallback aggregator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251549 (owner: 10Ori.livneh) [19:13:07] !log ori@tin Synchronized wmf-config/jobqueue-eqiad.php: I920627cf8b: Add rdb1007 as fallback aggregator (duration: 00m 35s) [19:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:13:21] jzerebecki: uhhhh, i did? lemme take a look [19:13:33] it may be from before we fixed acl's everyplace, checking [19:14:07] RECOVERY - mediawiki-installation DSH group on mw1083 is OK: OK [19:14:42] chasemp: So before I flip this bit, is there any reason you can think of that anyone shouldnt be joinable to ops-access-requests? (this is not reviews, this is the initial user request) [19:14:52] im 99.999999% certain its ok. [19:15:17] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1789521 (10Nemo_bis) First victims https://www.mediawiki.org/w/index.php?title=Talk%3AInstantCommons&type=revision&diff=19342... [19:15:37] jzerebecki: it was indeed me, and i think it was being overly paranoid in our initial implementaiton of it [19:15:46] so since i did it, im gonna undo it. [19:15:53] (as all the tasks in there are public) [19:16:00] on a call atm will circle back [19:16:01] so following should be allowed, much like #operations now is. [19:16:23] chasemp: no worries, its no big deal, i just didnt open up #ops-access-requests when we opened up #operations [19:17:07] jzerebecki: you can join it now ;] [19:17:31] (03PS1) 10Dzahn: librenms: add ferm rules for http/https [puppet] - 10https://gerrit.wikimedia.org/r/251550 (https://phabricator.wikimedia.org/T105410) [19:17:45] robh: thx [19:17:54] welcome, thx for finding it [19:20:47] AaronSchulz: forced a puppet run on the jobrunners too. I have to head to the office now, so I'll be offline for half an hour. Can you keep an eye on things? [19:21:00] robh: yeah it's not related to security at all afaik [19:21:09] * AaronSchulz wanted to head to the office too [19:21:17] cool, its fixed [19:21:18] (03PS2) 10Jdlrobson: WIP: First QuickSurvey for reader segmentation research - external survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251133 (https://phabricator.wikimedia.org/T113443) [19:21:20] (03PS2) 10Dzahn: librenms: add ferm rules for http/https [puppet] - 10https://gerrit.wikimedia.org/r/251550 (https://phabricator.wikimedia.org/T105410) [19:21:21] AaronSchulz: ok, I'll watch it from my phone [19:21:30] everything seems fine now though [19:21:33] nod [19:22:39] !log disabled puppet on holmium while testing and staging an update for bug T117610 [19:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:25:08] (03PS1) 10Dzahn: servermon: add ferm rules for http/https [puppet] - 10https://gerrit.wikimedia.org/r/251552 (https://phabricator.wikimedia.org/T105410) [19:25:51] ori: I assume the job services restarted after the change, right? [19:26:01] (just making sure) [19:26:13] * AaronSchulz can just look at the ps list [19:26:30] yes they do [19:32:40] (03PS1) 10Dzahn: smokeping: create a role class, use role keyword [puppet] - 10https://gerrit.wikimedia.org/r/251554 [19:33:39] (03CR) 10JanZerebecki: [C: 031] "Assuming two ferm services on the same host with the same port work." [puppet] - 10https://gerrit.wikimedia.org/r/251550 (https://phabricator.wikimedia.org/T105410) (owner: 10Dzahn) [19:33:57] (03PS1) 10Faidon Liambotis: Revert "Drain esams due to network maintainance" [dns] - 10https://gerrit.wikimedia.org/r/251555 [19:34:13] (03PS2) 10Faidon Liambotis: Revert "Drain esams due to network maintainance" [dns] - 10https://gerrit.wikimedia.org/r/251555 [19:34:19] (03CR) 10Faidon Liambotis: [C: 032] Revert "Drain esams due to network maintainance" [dns] - 10https://gerrit.wikimedia.org/r/251555 (owner: 10Faidon Liambotis) [19:38:38] (03PS1) 10Dzahn: smokeping: add ferm rules for http/https to role [puppet] - 10https://gerrit.wikimedia.org/r/251557 (https://phabricator.wikimedia.org/T105410) [19:40:28] (03CR) 10Florianschmidtwelzow: [C: 04-1] Enable banners extension on mobile web beta only (enwiki) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251544 (https://phabricator.wikimedia.org/T101108) (owner: 10Jdlrobson) [19:51:40] 6operations, 10Wikimedia-Mailing-lists: wikisk-l: Give the list an administrator - https://phabricator.wikimedia.org/T111054#1789706 (10Dzahn) updated the list description with text provided by KuboF , with clickable links to Phabricator etc.. [19:54:37] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [19:56:01] (03PS2) 10Rush: decom: haedus/capella were removed in august 2015 [puppet] - 10https://gerrit.wikimedia.org/r/251515 (https://phabricator.wikimedia.org/T94474) [19:58:27] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [20:04:17] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 1.00% above the threshold [1000000.0] [20:33:55] 6operations, 6Labs, 10Labs-Infrastructure: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1789788 (10chasemp) [20:34:21] (03PS3) 10Rush: decom: haedus/capella were removed in august 2015 [puppet] - 10https://gerrit.wikimedia.org/r/251515 (https://phabricator.wikimedia.org/T94474) [20:34:31] (03CR) 10Rush: [C: 032] decom: haedus/capella were removed in august 2015 [puppet] - 10https://gerrit.wikimedia.org/r/251515 (https://phabricator.wikimedia.org/T94474) (owner: 10Rush) [20:34:41] (03CR) 10Rush: [V: 032] decom: haedus/capella were removed in august 2015 [puppet] - 10https://gerrit.wikimedia.org/r/251515 (https://phabricator.wikimedia.org/T94474) (owner: 10Rush) [20:35:56] chasemp: you said removed in Aug 2015 yet the ticket is one from April? [20:36:13] plus the ticked linked had decom patches from April, interested :) [20:36:33] I got august from somwhere :) [20:37:07] oh, I'm misreading slightly [20:37:24] the ticket was treated as a reclaim so it was left in DHCP and autoinstall [20:40:47] chasemp: should the mgmt interfaces still be around? [actual hostname.mgmt not wmfxxxx.mgmt, always wondered in these cases and get different responses] [20:41:24] good question, I haven't decommed a lot of servers so I'm not sure, I assume no [20:41:32] that hostname is dead, etc [20:42:05] maybe want to remove them from dns then :) [20:45:53] Dear opsen, what is the PHP version we run on the cluster right now? [20:45:58] (modulo a few outdated PHP 5.3 places like terbium) [20:46:21] I tried running php --version on a random mw10NN machine but it just gave me HHMV gibberish that I don't know how to interpret [20:46:59] RoanKattouw: 5.6-ish [20:47:32] OK [20:47:34] mw1117 at least is behaving as if it's 5.5 [20:47:37] https://phabricator.wikimedia.org/T117938 [20:47:40] legoktm@mw1152:~$ mwscript eval.php --wiki=testwiki [20:47:40] > echo PHP_VERSION [20:47:40] 5.6.99-hhvm [20:48:00] Hah [20:48:11] https://secure.php.net/manual/en/function.curl-setopt.php claims the default value of CURLOPT_SAFE_UPLOAD is true in 5.6 [20:48:32] Anyway, I'll submit a patch setting it explicitly [20:48:35] 5.3.10-1ubuntu3.20+wmf1 on tin+terbium, 5.5.9-1ubuntu4.13 on silver, HipHop VM 3.6.5 on everything else [20:49:28] OK so maybe it's one of the 5.5.9 boxes then [20:49:36] No wait that's just silver you said [20:50:05] yes, so only wikitech should be on 5.5.9 [20:50:47] Nope, it's 5.6.99 even on that specific appserver [20:51:00] I guess what the PHP manual claims either isn't true, or isn't true for HHVM [20:57:39] RoanKattouw, legoktm: https://phabricator.wikimedia.org/P2288 [20:58:05] uhmmmmmmm [20:58:12] mw1107.eqiad.wmnet: : creat("jeprof.24649.0.f.heap"), 0644) failed [20:58:15] lolwut [20:58:22] Krenair: Thanks for that list [20:59:18] PROBLEM - puppet last run on mc2001 is CRITICAL: CRITICAL: puppet fail [20:59:36] legoktm, ? [20:59:48] the jemalloc errors :P [21:00:17] heh [21:06:11] Krenair: do you want to kick off a manual update of SpecialGadgetUsage too? :) [21:06:54] Or maybe just enwp and commons, since those look to be the most affected [21:07:07] I'll just run the whole thing again [21:07:58] {{doing}} [21:08:37] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 9 below the confidence bounds [21:17:38] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000000.0] [21:18:06] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [21:23:57] RECOVERY - puppet last run on mc2001 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [21:25:45] (03PS2) 10Andrew Bogott: Labs: For instance dns, use actual project_name rather than tenant_id. [puppet] - 10https://gerrit.wikimedia.org/r/251543 (https://phabricator.wikimedia.org/T117610) [21:27:06] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 1.00% above the threshold [1000000.0] [21:30:59] (03PS3) 10Jdlrobson: First QuickSurvey for reader segmentation research - external survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251133 (https://phabricator.wikimedia.org/T113443) [21:35:18] (03PS2) 10Dzahn: Redirect pietrodn's intersect-contribs to Tool Labs [puppet] - 10https://gerrit.wikimedia.org/r/250516 (owner: 10Ricordisamoa) [21:35:44] (03CR) 10Dzahn: [C: 032] Redirect pietrodn's intersect-contribs to Tool Labs [puppet] - 10https://gerrit.wikimedia.org/r/250516 (owner: 10Ricordisamoa) [21:39:06] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 4 below the confidence bounds [21:39:23] (03PS3) 10Dzahn: ldap: split client.pp into one file per class [puppet] - 10https://gerrit.wikimedia.org/r/251251 [21:39:37] (03CR) 10Dzahn: [C: 032] "https://puppet-compiler.wmflabs.org/1205/" [puppet] - 10https://gerrit.wikimedia.org/r/251251 (owner: 10Dzahn) [21:42:38] (03PS4) 10Dzahn: puppet-lint: re-enable 'unquoted file modes' check [puppet] - 10https://gerrit.wikimedia.org/r/251295 (https://phabricator.wikimedia.org/T93645) [21:43:36] (03CR) 10Dzahn: puppet-lint: re-enable 'unquoted file modes' check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/251295 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [21:43:57] (03PS5) 10Dzahn: puppet-lint: re-enable 'unquoted file modes' check [puppet] - 10https://gerrit.wikimedia.org/r/251295 (https://phabricator.wikimedia.org/T93645) [21:44:08] (03CR) 10Dzahn: [C: 032] puppet-lint: re-enable 'unquoted file modes' check [puppet] - 10https://gerrit.wikimedia.org/r/251295 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [21:45:42] 6operations, 10hardware-requests, 5Patch-For-Review: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1789983 (10RobH) [22:09:47] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 4 below the confidence bounds [22:10:34] seems to get better https://atlas.torproject.org/#details/DB19E709C9EDB903F75F2E6CA95C84D637B62A02 [22:15:27] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [22:16:36] (03CR) 10Jhobs: [C: 04-1] First QuickSurvey for reader segmentation research - external survey (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251133 (https://phabricator.wikimedia.org/T113443) (owner: 10Jdlrobson) [22:19:16] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [22:20:13] (03PS2) 10Jdlrobson: Enable banners extension on mobile web beta only (enwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251544 (https://phabricator.wikimedia.org/T101108) [22:21:18] (03PS1) 10Rush: autoinstall setup for labs-hosts1-b-codfw [puppet] - 10https://gerrit.wikimedia.org/r/251644 (https://phabricator.wikimedia.org/T115491) [22:22:16] (03PS4) 10Jdlrobson: First QuickSurvey for reader segmentation research - external survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251133 (https://phabricator.wikimedia.org/T113443) [22:26:06] (03PS1) 10Ori.livneh: $wmfUdp2logDest: replace IPs with hostnames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251647 [22:26:54] (03CR) 10Jhobs: [C: 031] First QuickSurvey for reader segmentation research - external survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251133 (https://phabricator.wikimedia.org/T113443) (owner: 10Jdlrobson) [22:27:47] (03CR) 10Jhobs: "Assuming that survey link is correct once it's enabled. (Will test and update T113443 once it is.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251133 (https://phabricator.wikimedia.org/T113443) (owner: 10Jdlrobson) [22:30:32] (03CR) 10Andrew Bogott: [C: 031] "I only vaguely understand this, but it looks right to me :)" [puppet] - 10https://gerrit.wikimedia.org/r/251644 (https://phabricator.wikimedia.org/T115491) (owner: 10Rush) [22:33:40] (03PS1) 10Rush: labtest* hosts pxe and install setup [puppet] - 10https://gerrit.wikimedia.org/r/251648 (https://phabricator.wikimedia.org/T117097) [22:38:06] (03CR) 10Andrew Bogott: [C: 031] "If the partman setup is really this easy then I will breath a great sigh of relief" [puppet] - 10https://gerrit.wikimedia.org/r/251648 (https://phabricator.wikimedia.org/T117097) (owner: 10Rush) [22:40:24] (03PS2) 10Rush: autoinstall setup for labs-hosts1-b-codfw [puppet] - 10https://gerrit.wikimedia.org/r/251644 (https://phabricator.wikimedia.org/T115491) [22:41:16] (03CR) 10Rush: [C: 032 V: 032] autoinstall setup for labs-hosts1-b-codfw [puppet] - 10https://gerrit.wikimedia.org/r/251644 (https://phabricator.wikimedia.org/T115491) (owner: 10Rush) [22:41:32] (03PS2) 10Rush: labtest* hosts pxe and install setup [puppet] - 10https://gerrit.wikimedia.org/r/251648 (https://phabricator.wikimedia.org/T117097) [22:41:42] (03CR) 10Rush: [C: 032 V: 032] labtest* hosts pxe and install setup [puppet] - 10https://gerrit.wikimedia.org/r/251648 (https://phabricator.wikimedia.org/T117097) (owner: 10Rush) [22:42:08] !log mwscript updateSpecialPages.php --wiki=enwiki --only=GadgetUsage [22:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:50:32] (03PS5) 10Jdlrobson: First QuickSurvey for reader segmentation research - external survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251133 (https://phabricator.wikimedia.org/T113443) [23:10:20] (03PS1) 10Andrew Bogott: Assign IPs for public labtest hosts. [dns] - 10https://gerrit.wikimedia.org/r/251655 [23:11:21] (03CR) 10Rush: [C: 031] Assign IPs for public labtest hosts. [dns] - 10https://gerrit.wikimedia.org/r/251655 (owner: 10Andrew Bogott) [23:18:36] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [23:26:25] (03PS1) 10Reedy: Enable DynamicPageList on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251657 [23:31:03] (03PS2) 10Reedy: Enable DynamicPageList on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251657 [23:31:23] (03CR) 10Reedy: [C: 032] Enable DynamicPageList on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251657 (owner: 10Reedy) [23:31:49] (03Merged) 10jenkins-bot: Enable DynamicPageList on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251657 (owner: 10Reedy) [23:32:49] btw, I know about that ^ :) [23:34:44] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: Enable DynamicPageList on officewiki (duration: 01m 03s) [23:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master