[00:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160211T0000). Please do the needful. [00:00:04] csteipp Dereckson Krenair: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:38] * csteipp is around.. [00:00:53] idem [00:01:44] and now the swat starts? hehe [00:01:53] hm [00:02:04] (03PS1) 10Hashar: contint: lower tmpfs from 512MB to 200MB [puppet] - 10https://gerrit.wikimedia.org/r/269880 (https://phabricator.wikimedia.org/T126545) [00:02:09] ori, tgr, bd808: all good? [00:03:22] (03CR) 10Dzahn: "commit message says 200, actual code says 128" [puppet] - 10https://gerrit.wikimedia.org/r/269880 (https://phabricator.wikimedia.org/T126545) (owner: 10Hashar) [00:03:32] (03CR) 10Hashar: "We went with a 512MB tmpfs to have a bunch of sqlite in parallel but jobs now use MySQL.. So we should be fine with less memory :-)" [puppet] - 10https://gerrit.wikimedia.org/r/269880 (https://phabricator.wikimedia.org/T126545) (owner: 10Hashar) [00:04:00] * csteipp is reading backscroll-- are we still trying to get the train deployed? [00:04:06] unclear [00:04:33] Krenair: the train was rolled back [00:04:34] greg-g, ori: I think we have a patch coming to fix .13 [00:04:44] I guess we could unrollback it now? [00:05:02] um, no, it's the swat window now [00:05:46] well, hard to test swat patches if they are not live anywhere [00:06:07] I guess you could use mw1017, that's on wmf.13 [00:06:25] there's only one patch for wmf.13 only [00:06:33] the others should be unaffected by it [00:06:50] ok, get the swats out, then we can do tgr's stuff [00:06:55] and that one wmf.13 patch can't be tested on mw1017 anyway [00:06:56] sorry tgr [00:07:13] s/do tgr's stuff/roll forward and do tgr's patch/ [00:07:21] works for me [00:07:33] (anomie's patch, acutally) [00:07:37] :) [00:07:41] sure, sounds good [00:07:46] csteipp, manually rebasing [00:07:49] Krenair: ty [00:07:58] the fix is a fscking missing "global $wgHooks;" [00:08:15] (03PS1) 10Anomie: Better hack for T49647, for real [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269881 (https://phabricator.wikimedia.org/T49647) [00:08:47] (03PS2) 10Alex Monk: Set password policy for global steward group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259439 (https://phabricator.wikimedia.org/T104371) (owner: 10CSteipp) [00:08:59] (03CR) 10BryanDavis: [C: 031] Better hack for T49647, for real [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269881 (https://phabricator.wikimedia.org/T49647) (owner: 10Anomie) [00:09:12] (03CR) 10Alex Monk: [C: 032] Set password policy for global steward group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259439 (https://phabricator.wikimedia.org/T104371) (owner: 10CSteipp) [00:09:19] (03PS3) 10Dzahn: Enable base::firewall for mw1152 [puppet] - 10https://gerrit.wikimedia.org/r/269688 (owner: 10Muehlenhoff) [00:09:56] (03Merged) 10jenkins-bot: Set password policy for global steward group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259439 (https://phabricator.wikimedia.org/T104371) (owner: 10CSteipp) [00:11:09] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [00:12:24] (03CR) 10Dzahn: [C: 032] Enable base::firewall for mw1152 [puppet] - 10https://gerrit.wikimedia.org/r/269688 (owner: 10Muehlenhoff) [00:12:55] !log krenair@mira Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/259439/ (duration: 02m 20s) [00:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:13:07] csteipp, ^ [00:13:20] Thanks! [00:13:41] csteipp, I guess you can't really test that? [00:16:22] oh, enwikt and itwiki are read-only? [00:17:23] not obvious why from tendril but exception.log is showing lots of job runners complaining about it [00:17:28] ... still? (they were for the master db switch over) [00:18:00] the entries coming in are timestamped 2016-02-11 00:17 [00:18:03] so yeah [00:18:44] weird log lag again or just not a lot of fatalmonitor traffic today? [00:20:21] seems to be mw1163? [00:20:28] only [00:22:05] https://phabricator.wikimedia.org/T126436 [00:23:41] mw1167 [00:24:11] Krenair: can you confirm is that task above is what you're seeing as well? [00:24:38] 7Puppet, 6operations, 10RESTBase-Cassandra: cassandra - puppet compiler fail on test/staging hosts - https://phabricator.wikimedia.org/T125943#2017372 (10Dzahn) Just noticed it also happens in production in codfw, not just on staging host. So while prod eqiad works, it also fails with this error on restbase2... [00:24:58] looks the same, yeah [00:25:04] 7Puppet, 6operations, 10RESTBase-Cassandra: cassandra - puppet compiler fail on codfw/test/staging hosts - https://phabricator.wikimedia.org/T125943#2017373 (10Dzahn) [00:25:04] different format because that's from logstash [00:25:17] * greg-g nods [00:25:19] thanks [00:25:28] but seems like the same issue [00:25:37] Krenair: do you have anything half staged on mira? I want to run a sync-common --verbose on mw1167 to see if it possibly has stale config [00:25:47] nope, go for it [00:25:55] the config looks up to date to me [00:26:08] (I checked that :)) [00:26:25] it jsut got a ton of changes... [00:26:57] possibly only timestamp diff though. --verbose doesn't say why rsync is updating something [00:27:13] how's swat doing? [00:27:19] how far I mean? [00:27:33] one patch done, 3 more to go [00:27:57] all these changes happened -- https://phabricator.wikimedia.org/P2590 [00:28:47] hm, didn't check InitialiseSettings, just CommonSettings and db-eqiad [00:28:56] bd808, can I continue with the other patches? [00:29:03] yeah I think so [00:29:16] (03CR) 10Alex Monk: [C: 032] Set timezone, logo, site name on hi.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269827 (https://phabricator.wikimedia.org/T126185) (owner: 10Dereckson) [00:30:13] (03Merged) 10jenkins-bot: Set timezone, logo, site name on hi.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269827 (https://phabricator.wikimedia.org/T126185) (owner: 10Dereckson) [00:32:52] !log krenair@mira Synchronized w/static/images/project-logos/hiwikiquote.png: https://gerrit.wikimedia.org/r/#/c/269827/1 (duration: 02m 13s) [00:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:32:56] Testing. [00:34:04] Logo ok. [00:35:33] !log krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/269827/1 (duration: 02m 13s) [00:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:37:27] Testing. [00:38:39] how do you test timezone if nothing happens in RC? [00:38:47] make something happen in RC [00:38:57] edit your user sandbox or something [00:39:20] ०६:०९ [00:39:25] seems good [00:40:09] Tested, will need a follow-up patch for the meta namespace. [00:40:19] But works. [00:40:21] "omg, wanted to let you know you expose all your infrastructure internals to the whole world" "yea, we know, ganglia is public, cool, eh?" [00:40:24] (03CR) 10Alex Monk: [C: 032] Add import source for ru.wikisource.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264937 (https://phabricator.wikimedia.org/T123837) (owner: 10Mdann52) [00:41:02] mutante: :) [00:41:09] (03Merged) 10jenkins-bot: Add import source for ru.wikisource.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264937 (https://phabricator.wikimedia.org/T123837) (owner: 10Mdann52) [00:44:02] !log krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/264937/ (duration: 02m 15s) [00:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:44:06] Dereckson, ^ [00:44:12] (03PS4) 10Dzahn: cassandra: fix top-scope vars without namespaces [puppet] - 10https://gerrit.wikimedia.org/r/266975 [00:44:18] Can't test this one, will let a message asking bug reporter to test. [00:44:18] (03PS1) 10Dereckson: Set meta namespace on hi.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269883 (https://phabricator.wikimedia.org/T126185) [00:44:27] (need importer right) [00:44:29] (03CR) 10Alex Monk: [C: 032] Set logo, timezone, autoconfirm on wuu.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263051 (https://phabricator.wikimedia.org/T122476) (owner: 10Mdann52) [00:45:30] (03Merged) 10jenkins-bot: Set logo, timezone, autoconfirm on wuu.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263051 (https://phabricator.wikimedia.org/T122476) (owner: 10Mdann52) [00:45:31] Could we also deploy https://gerrit.wikimedia.org/r/#/c/269883 for the hi.wikiquote meta namespace? [00:46:59] bd808: tgr: I need to go afk, I trust you to revert if things aren't fixed [00:47:09] greg-g: yup yup [00:47:15] ori will keep us honest [00:47:32] * bd808 is not helpful either [00:47:40] just step back and look at the big picture calmly [00:47:45] * greg-g had to scroll up to quote it [00:48:33] !log krenair@mira Synchronized w/static/images/project-logos/wuuwiki.png: https://gerrit.wikimedia.org/r/#/c/263051/ (duration: 02m 13s) [00:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:48:54] Dereckson, ^ [00:49:29] Krenair: logo not deployed [00:50:02] (works with ?debug=true, but not yet purged) [00:50:13] hmm [00:50:14] https://wuu.wikipedia.org/w/static/images/project-logos/wuuwiki.png [00:50:18] vs. [00:50:18] https://www.wikimedia.org/static/images/project-logos/wuuwiki.png [00:50:25] works if you add ?a [00:50:52] yup [00:51:24] you did your echo 'https://www.wikimedia.org/static/images/project-logos/wuuwiki.png' | mwscript purgeList.php thing ? [00:52:19] yes [00:52:32] bblack, ^ [00:53:37] !log krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/263051/ (duration: 02m 13s) [00:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:53:41] Dereckson, ^ [00:54:31] oh, okay, seriously [00:54:42] wtf [00:55:01] "cache: Change static_host from www.wikimedia.org to en.wikipedia.org" - Krinkle [00:55:21] This would explain that. [00:55:45] For 263051, tz ok. [00:56:27] 6operations, 7Mail: Remove exim aliases: (nitika, noopur, shiju, subhashish) - https://phabricator.wikimedia.org/T126523#2017497 (10Dzahn) 5Open>3Resolved done ``` -# RT-3648 - Centre for Internet & Society India -nitika: nitika@cis-india.org -noopur: noopur@cis-india.org -shiju: shiju@cis-india.org -... [00:56:29] 6operations, 7Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#2017499 (10Dzahn) [00:58:29] Krenair: could we also deploy https://gerrit.wikimedia.org/r/#/c/269883 for the hi.wikiquote meta namespace? [00:58:44] (03CR) 10Alex Monk: [C: 032] Set meta namespace on hi.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269883 (https://phabricator.wikimedia.org/T126185) (owner: 10Dereckson) [00:59:28] (03Merged) 10jenkins-bot: Set meta namespace on hi.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269883 (https://phabricator.wikimedia.org/T126185) (owner: 10Dereckson) [01:00:04] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160211T0100). Please do the needful. [01:00:40] error: insufficient permission for adding an object to repository database .git/objects [01:00:40] fatal: failed to write object [01:00:41] fatal: unpack-objects failed [01:00:52] Oh Krenair, by the way I forgot to add a Wikiquote: alias, to maintain compatibility [01:01:02] _joe_, is that you? [01:01:44] * bd808 fidgets about getting group1 back to wmf.13 [01:01:48] Dereckson, isn't that handled automatically? [01:02:46] Dereckson, '+wikiquote' => array( 'Wikiquote' => NS_PROJECT ), [01:02:55] Ah, nice. [01:03:08] looks like whatever was stopping me from writing .git/objects is gone [01:05:52] !log krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/269883/1 (duration: 02m 11s) [01:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:05:59] Dereckson, ^ [01:06:31] 6operations, 7Mail: Remove exim alias (atglenn) - https://phabricator.wikimedia.org/T126508#2017517 (10Dzahn) a:5Dzahn>3ArielGlenn [01:06:36] Works. [01:06:52] (03CR) 10Alex Monk: "Just what we needed, more hardcoded en.wikipedia.org nonsense." [puppet] - 10https://gerrit.wikimedia.org/r/269224 (owner: 10Krinkle) [01:06:56] Thanks for the deploy. [01:07:10] 6operations, 7Mail: Remove exim aliases (chris, cjohnson) - https://phabricator.wikimedia.org/T126505#2017525 (10Dzahn) [01:08:25] (03CR) 10Alex Monk: "Also, might be a good idea to warn deployers before you start making such changes where purging of static files from caching using the exi" [puppet] - 10https://gerrit.wikimedia.org/r/269224 (owner: 10Krinkle) [01:09:37] Krenair: The .git/objects glitch has been on-going all day. It was complained about at the morning SWAT too, I believe, and went away after retrying. [01:11:07] i just fixed permissions in .git/objects by request not that long ago [01:11:21] for some reason they keep getting messed up every once in a while [01:11:32] it's got to be puppet, no? [01:11:42] i thought it was a root deploying stuff [01:11:46] but not sure [01:11:59] not regular enough for puppet [01:12:08] that was my suspicion - a root user was logged in at the time [01:12:50] PROBLEM - puppet last run on lvs4003 is CRITICAL: CRITICAL: puppet fail [01:12:55] yea, last time some of those objects were owned by root [01:13:04] j.oe has been idle for 9 hours so I doubt that was it [01:13:36] I have deployed today, and I have sudo [01:13:45] 6operations, 7Mail: Remove exim aliases (chris, cjohnson) - https://phabricator.wikimedia.org/T126505#2017529 (10Dzahn) done ``` -#Chris Johnson -chris: cmjohnson -cjohnson: cmjohnson - ``` [01:13:51] 6operations, 7Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#2017531 (10Dzahn) [01:13:53] 6operations, 7Mail: Remove exim aliases (chris, cjohnson) - https://phabricator.wikimedia.org/T126505#2017530 (10Dzahn) 5Open>3Resolved [01:13:53] but I was logged in as myself, and my deployment scripts (which you can see for yourself in modules/admin/files/...) don't sudo [01:14:05] but maybe I'm doing something wrong without realizing it [01:14:42] Krenair: this was jsut between a couple of the swat patches though right? [01:14:46] yes [01:14:58] so yeah, not you either ori [01:15:02] there is constant puppet churn on the deployment servers, consisting of: [01:15:04] Notice: /Stage[main]/Deployment::Deployment_server/Salt::Grain[deployment_server]/Exec[ensure_deployment_server_true]/returns: executed successfully [01:15:04] Info: Salt::Grain[deployment_server]: Scheduling refresh of Exec[deployment_server_sync_all] [01:15:04] Notice: /Stage[main]/Deployment::Deployment_server/Exec[deployment_server_sync_all]: Triggered 'refresh' from 1 events [01:15:06] Notice: /Stage[main]/Deployment::Deployment_server/Exec[eventual_consistency_deployment_server_init]/returns: executed successfully [01:15:07] (admittedly I didn't check idle times) [01:15:14] so maybe one of these things does something bad [01:15:41] that's trebuchet I think [01:16:29] Krenair: are you still swatting? [01:16:29] deployment_server_sync_all is just 'salt-call saltutil.sync_all', so that isn't related [01:16:31] see this related change https://gerrit.wikimedia.org/r/#/c/219372/ [01:16:53] no bd808 [01:16:57] Ryan said "It's there because you need to run a command when new repos are added to trebuchet. If trebuchet isn't being used for new repos, then it isn't needed." [01:17:11] w00t. tgr you want to do the honors or should I? [01:17:16] deployment_server_init does do stuff with git [01:17:25] I can do it [01:17:34] maybe there is some old cruft from the last attempt to use trebuchet with mediawiki [01:17:58] tgr: I was thinking that the config fix should land first and then the mwversion bump [01:18:14] indeed [01:18:36] (03CR) 10Dzahn: [C: 032] "compiler showed it's a noop in prod eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/266975 (owner: 10Dzahn) [01:18:44] we should retest the config patch on mw1017 this time too [01:18:58] rather than hoping it matches the live hack [01:19:39] yeah, that's the plan [01:20:16] I can log in on mw1017 now though, so something must still be deployed there [01:20:55] probably still has brad's hack. it was in mobile.php [01:21:21] mutante: re https://gerrit.wikimedia.org/r/266975 , have you checked codfw as the problem seemed to have been there? [01:21:37] mobrovac: i am right now [01:21:45] i updated that ticket [01:22:11] confirmed noop on 1001, problem on 2001 :/ [01:22:41] damn, it's just using $hostname, wtf [01:23:13] bd808: I should be able to reset that by running sync-file wmf-config/mobile.php on mw1017, right? [01:23:32] nope. sync-file doesn't work like that [01:23:47] you want to run sync-common on mw1017 [01:23:49] mobrovac: remember the ticket about fail on staging? it's the exact same thing [01:23:58] see https://phabricator.wikimedia.org/T126372 :) [01:24:11] tgr: waht ori said and then put the version hack back [01:24:12] or only sync-common works like that? [01:24:27] that strange trailing - comes from somewhere [01:24:31] sync-common is pull, all the other sync-* commands are push [01:24:34] sync-{dir,docroot,file,l10n,wikiversions} push data from localhost to remote targets; sync-{common,master} fetch data from remote hosts to localhost. [01:24:44] RoanKattouw: not all; there's sync-master too [01:24:44] mutante: ok, please disable puppet in codfw for RB for now until we figure this out or revert your change [01:24:52] So sync-file is something you run on mira to push that file out everywhere, while sync-common is something you run on mwNNNN to fetch everything [01:24:57] Oh, right, I forgot about sync-master [01:25:01] That one is new [01:25:15] mobrovac: already done on 2002 [01:25:16] yeah, I shold have not followed the legacy naming [01:25:32] and I should have fixed it for sync-common too [01:25:41] PROBLEM - puppet last run on restbase2001 is CRITICAL: CRITICAL: puppet fail [01:25:47] pull-common or fetch-common [01:25:47] mutante: ^ [01:26:17] mobrovac: yes, i'm on that one [01:26:44] mutante: the list is restbase200[1-6] && restbase-test200[1-3] [01:27:37] (03PS1) 10Dzahn: Revert "cassandra: fix top-scope vars without namespaces" [puppet] - 10https://gerrit.wikimedia.org/r/269886 [01:28:17] (03CR) 10Dzahn: [C: 032] "for unknown reasons fails in codfw while it works fine in eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/269886 (owner: 10Dzahn) [01:30:13] thnx mutante for ^ [01:30:23] we obviously need to study this problem more [01:30:50] mutante: it's not as trivial as puppet makes us think [01:31:10] RECOVERY - puppet last run on restbase2001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [01:31:45] mobrovac: of course it has to be an issue with "the very last of these errors in the whole repository" :/ [01:32:01] after that one i would have adjusted puppet-lint.rc [01:37:49] but have no idea why it failed in the first place [01:38:18] mobrovac: that is making it weird, so far everything followed the codfw vs. eqiad pattern and i know for sure 1001 was ok [01:38:41] and it's node /^restbase100[1-9]\.eqiad\.wmnet$/ { [01:38:47] so it can't be different :/ [01:39:08] mutante: actually it is, hence the failure on rb1008 which is multi-instance [01:39:25] (03CR) 10BBlack: [C: 031] delete SSL cert for ticket.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/269753 (https://phabricator.wikimedia.org/T122320) (owner: 10Dzahn) [01:39:26] mutante: it's hiera-controlled iirc [01:39:31] rb1008 is different from 1001 thru 1007 even though they have identical puppet roles ? [01:39:34] ah! [01:39:41] RECOVERY - puppet last run on lvs4003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:40:06] tgr: whare are we at? [01:40:08] syncing [01:40:12] mobrovac: at least a pattern that explains it :) [01:40:20] :) [01:40:49] maybe 1007 also seems multi-instance though? [01:40:56] role/eqiad/restbase.yaml: - restbase1007-a.eqiad.wmnet [01:41:02] with the -a part [01:41:29] !log tgr@mira Synchronized wmf-config/mobile.php: fix for T49647 (duration: 02m 15s) [01:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:42:16] bd808: OK, patch is live and looks good [01:42:41] * bd808 tests on mw1017 [01:44:05] 7Puppet, 6operations, 10RESTBase-Cassandra: cassandra - puppet compiler fail on codfw/test/staging hosts - https://phabricator.wikimedia.org/T125943#2017594 (10Dzahn) as @mobrovac found out this affects all instances that have been converted to multi-instance and not the ones that have not been converted. t... [01:44:06] bd808: is https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment#Switch_a_wiki_to_a_different_MW_version up to date? people seem to do this via gerrit [01:44:22] tgr: mw1017 looks good to me [01:44:30] * bd808 reads doc [01:44:46] no that is really stale [01:45:29] just revert ori's patch, push to gerrit, +2, run sync-wikiversions then? [01:45:33] tgr: you should be able to jsut revert the revert patch [01:45:36] yes [01:45:38] yes [01:45:55] this is the new way to do it form scratch -- https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Create_patches_to_update_wikiversions.json [01:46:31] !log ms-be1008 - delete/gzip large syslog, out of disk [01:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:47:18] !log labservices1001 - out of disk [01:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:48:52] 7Blocked-on-Operations, 10Beta-Cluster-Infrastructure, 6Discovery, 6Release-Engineering-Team, and 2 others: Beta: submodule update reverts new portals commits - https://phabricator.wikimedia.org/T126061#2003572 (10JGirault) Thanks! @thcipriani ! [01:52:45] (03PS1) 10Gergő Tisza: group1 wikis to 1.27.0-wmf.13 again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269888 [01:53:01] RECOVERY - Disk space on labservices1001 is OK: DISK OK [01:53:45] (03CR) 10BryanDavis: [C: 031] group1 wikis to 1.27.0-wmf.13 again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269888 (owner: 10Gergő Tisza) [01:53:57] (03CR) 10Gergő Tisza: [C: 032] group1 wikis to 1.27.0-wmf.13 again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269888 (owner: 10Gergő Tisza) [01:54:59] (03Merged) 10jenkins-bot: group1 wikis to 1.27.0-wmf.13 again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269888 (owner: 10Gergő Tisza) [01:56:09] 6operations, 6Labs, 10Labs-Infrastructure: labservices1001 ran out of disk space - https://phabricator.wikimedia.org/T126572#2017616 (10Dzahn) 3NEW [01:58:08] !log tgr@mira rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.13 again [01:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:59:16] tgr: \o/ works for me [01:59:21] ok, Commons is on wmf.13 and login works [01:59:52] bd808: are you going to be near the keyboard? [02:00:11] I can be, yes [02:00:32] I'll go have lunch then :) [02:00:43] lol. you should do that [02:01:21] 6operations, 6Labs, 10Labs-Infrastructure: labservices1001 ran out of disk space - https://phabricator.wikimedia.org/T126572#2017627 (10Dzahn) deleted some of the older rotated logs, gzipped designate-central.log which is also large [02:09:51] bd808: forgot to put back mw1017 to all-.13, doing that now [02:10:06] you are supposed to be eating :/ [02:10:16] that's next [02:10:38] I just realized the global sync overwrote mw1017 [02:11:40] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [02:12:04] hmmm... wtf is up with tin/mira sync? [02:13:36] apparently sync-wikiversions didn't update tin [02:14:51] PROBLEM - Apache HTTP on mw1130 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50396 bytes in 9.737 second response time [02:14:59] hmm [02:15:02] !log Fetched 1271d36 to tin:/srv/mediawiki-staging; should have been updated by sync-wikiversions [02:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:15:10] PROBLEM - HHVM rendering on mw1130 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.015 second response time [02:15:12] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [02:15:21] * ori checks mw1130 [02:16:27] ori: I'm going to go afk to eat. If you need me I'll get direct icr pings on my phone and can be back online quickly [02:16:40] !log restarted HHVM on mw1130; locked up (T89912) [02:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:17:02] bd808: I'm off too. The mw1130 was unrelated to the deploy. (Well, it was caused by the deploy, but not the content of the change.) [02:17:32] *nod* I'll check back after dinner and make sure things are still smooth [02:18:11] RECOVERY - Apache HTTP on mw1130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 484 bytes in 0.050 second response time [02:18:41] RECOVERY - HHVM rendering on mw1130 is OK: HTTP OK: HTTP/1.1 200 OK - 66476 bytes in 1.020 second response time [02:27:00] PROBLEM - puppet last run on lvs4002 is CRITICAL: CRITICAL: puppet fail [02:28:14] 6operations, 10Traffic: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2017669 (10BBlack) I was thinking about all of this last night, and we can probably mitigate some of the low-TTL-cap operational concerns with some modifications to the parameters and usage of grace mode.... [02:31:38] 6operations: ms-be1008 ran out of disk - https://phabricator.wikimedia.org/T126574#2017674 (10Dzahn) 3NEW [02:32:49] 6operations: ms-be1008 ran out of disk - https://phabricator.wikimedia.org/T126574#2017683 (10Dzahn) [02:36:33] my bot's token keeps getting revoked after several hours [02:36:37] from #wikipedia-en [02:36:55] anomie, ^ [02:37:17] (03PS6) 10Yurik: Add allowedDomains param to graphoid config [puppet] - 10https://gerrit.wikimedia.org/r/269819 [02:38:23] Both myself and Joe also keep getting logged out of officewiki after a while [02:51:05] RoanKattouw_away: I'm looking at some session warnings in logstash related to you getting logged out of officewiki. I'll open a bug for Brad to look at them. [02:53:01] bd808, also https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#.22Keep_me_logged_in_for_up_to_30_days.22_not_working [02:53:21] pretty sure it's not us invalidating sessions repeatedly [02:53:51] RECOVERY - puppet last run on lvs4002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:53:59] (03PS7) 10Yurik: Add allowedDomains param to graphoid config [puppet] - 10https://gerrit.wikimedia.org/r/269819 [02:54:13] MaxSem: thanks for pointing it out [02:55:03] we jsut hit group1 today so "for the last few days" probably isn't sessionmanager but worth looking into [02:56:54] (03PS1) 10Yurik: Add $wgGraphAllowedDomains setting for future [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269896 [03:06:40] (03PS1) 10Tim Landscheidt: Tools: Remove obsolete roles [puppet] - 10https://gerrit.wikimedia.org/r/269897 [03:55:49] (03CR) 10Dereckson: "In such cases, we don't require a formal vote like a prise de décision in fr.wikipedia or a whole RfC." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263342 (https://phabricator.wikimedia.org/T123188) (owner: 10Mdann52) [04:08:10] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [04:10:31] PROBLEM - puppet last run on mw2195 is CRITICAL: CRITICAL: Puppet has 1 failures [04:11:01] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [04:14:31] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [04:29:23] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core: cr1-eqiad:xe-4/2/0 (Telia, IC-307235, 34ms) {#10693} [10Gbps wave]BR [04:31:04] (03Abandoned) 10Dereckson: Set wmgULSPosition to personal on gom.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224772 (https://phabricator.wikimedia.org/T105124) (owner: 10Dereckson) [04:37:22] RECOVERY - puppet last run on mw2195 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [04:50:51] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [04:52:42] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 [04:56:18] (03CR) 10Mobrovac: [C: 031] Add allowedDomains param to graphoid config [puppet] - 10https://gerrit.wikimedia.org/r/269819 (owner: 10Yurik) [05:11:40] PROBLEM - Disk space on ytterbium is CRITICAL: DISK CRITICAL - free space: / 355 MB (3% inode=87%) [05:21:31] PROBLEM - logstash process on logstash1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (logstash), command name java, args logstash [05:21:48] looking ^ [05:23:47] OOM killer got it [05:23:55] what is ytterbium? [05:24:11] gerrit [05:24:23] it would be bad for it to run out of disk, no? [05:24:30] 355 MB isn't a whole lot [05:24:39] seems not goog [05:24:47] good even [05:25:01] RECOVERY - logstash process on logstash1003 is OK: PROCS OK: 1 process with UID = 998 (logstash), command name java, args logstash [05:25:04] !log Restarted logstash on logstash1003; killed by OOMKiler [05:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:27:05] / is 9.2G, 4.7G of that is apache logs [05:27:24] * ori gzips /var/log/apache2/access.log.1 [05:29:20] RECOVERY - Disk space on ytterbium is OK: DISK OK [05:30:22] PROBLEM - logstash process on logstash1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (logstash), command name java, args logstash [05:30:45] wtf logstash1003 [05:31:48] it's okay folks, we still have flat files on fluorine [05:32:21] RECOVERY - logstash process on logstash1003 is OK: PROCS OK: 1 process with UID = 998 (logstash), command name java, args logstash [05:32:22] *hollow neckbeard laugh* [05:32:37] ori: that's about the 5th poke you've taken at me or my team today. [05:32:54] I'm not amused [05:33:48] why, does it need to be an even number? [05:37:40] PROBLEM - logstash process on logstash1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (logstash), command name java, args logstash [05:38:52] "Error: Your application used more memory than the safety cap of 500M." [05:39:08] so not OOMKiler, java heap exhaustion? [05:40:43] !log logstash process on logstash1003 flapping; continuing to investigate [05:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:44:18] (03PS1) 10Tim Landscheidt: Tools: Move role classes to module role [puppet] - 10https://gerrit.wikimedia.org/r/269902 [05:44:42] RECOVERY - logstash process on logstash1003 is OK: PROCS OK: 1 process with UID = 998 (logstash), command name java, args logstash [05:45:19] basically as soon as it starts fully it runs out of heap and dies [05:46:31] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [05:46:34] Only log output from the app before it dies is an initialization warning for one plugin [05:47:31] does it use redis pub/sub, or is there a queue? if the latter, is it possible that there is something problematic with the item at the front? [05:48:19] hmmm... that may be the node that kafka errors are sent to [05:49:35] bytes in spiked at ~05:12 : http://ganglia.wikimedia.org/latest/graph.php?r=4hr&z=xlarge&c=Logstash+cluster+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [05:49:54] woah [05:50:02] PROBLEM - logstash process on logstash1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (logstash), command name java, args logstash [05:51:21] kibana isn't showing any matching event count jump [05:52:19] I'm going to disable puppet, comment out the kafka input and see if it starts [05:52:31] does that sound too hacky? [05:56:15] no, sounds sane [05:57:02] RECOVERY - logstash process on logstash1003 is OK: PROCS OK: 1 process with UID = 998 (logstash), command name java, args logstash [05:57:03] RESTBase: http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Restbase+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [05:57:04] !log disabled puppet on logstash1003 to debug configuration [05:57:10] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 [05:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:57:29] that looks suspiciously similar [05:57:37] yeah, ~30M [05:57:48] and it barfed again without kafka [05:57:59] so maybe gelf from restbase [05:58:04] * bd808 looks up that port [05:58:04] yeah [05:58:34] (03PS6) 1020after4: scap::target to configure scap3 deployment repository and deploy-user. [puppet] - 10https://gerrit.wikimedia.org/r/269560 (https://phabricator.wikimedia.org/T113072) [06:00:09] (03PS1) 10Dzahn: kibana,wdqs,wikimetrics: lint fix indentation [puppet] - 10https://gerrit.wikimedia.org/r/269903 [06:00:22] tcpdump is only showing gelf from parsoid and not at a high rate [06:02:41] PROBLEM - logstash process on logstash1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (logstash), command name java, args logstash [06:02:42] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [06:02:49] !log Raised logstash mem limit to -Xms512m on logstash1003 [06:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:04:30] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 [06:06:30] (03PS1) 10Dzahn: phabricator: fix 16 lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/269904 [06:08:06] the traffic from logstash abated [06:08:37] restbase too [06:08:48] err, to logstash [06:08:49] right [06:09:21] it's still not starting and not logging anything [06:09:57] does systemd swallow startup errors to some magic log? [06:10:14] oh, i think i read something about that the other day [06:10:30] i think it was if the service failed before forking [06:10:39] "Incompatible minimum and maximum heap sizes specified" [06:10:53] I goofed my config change apparently [06:11:58] (03PS1) 10Dzahn: haproxy: fix 12 lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/269905 [06:13:11] RECOVERY - logstash process on logstash1003 is OK: PROCS OK: 1 process with UID = 998 (logstash), command name java, args logstash [06:14:31] well it's not dead yet and things are back as puppet would like them [06:15:08] and restbase calmed down [06:15:12] if restbase was able to kill it with a giant stream of GELF packets to reassemble... that's not cool [06:15:42] we surely could reconfigure things to give logstash a bigger heap [06:16:39] !log Re-enabled puppet on logstash1003 and forced pupept run [06:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:16:52] oh noes! a typo sent to twitter [06:17:59] <_joe_> morning [06:18:04] quick,edit wikitech SAL page on wiki, does the twitter bridge even work :) [06:18:05] <_joe_> what happened with restbase? [06:18:19] not sure yet _joe_ [06:18:23] (03PS1) 10Dzahn: scap: fix lint warnings, alignment [puppet] - 10https://gerrit.wikimedia.org/r/269906 [06:18:41] it looks like it threw a huge pile of logs towards logstash. [06:18:52] <_joe_> call filippo or eric in case of problems with cassandra [06:18:55] but they crashed logstash [06:18:58] <_joe_> when did that happen? [06:19:06] <_joe_> bd808: and mediawiki survived? [06:19:23] afaik mediawiki was fine [06:19:47] MW only lobs udp packets towards logstash these days [06:20:34] _joe_: inbound traffic spike to logstash -- https://ganglia.wikimedia.org/latest/graph.php?r=4hr&z=xlarge&c=Logstash+cluster+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [06:20:50] outbound from restbase -- https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Restbase+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [06:21:11] (03PS3) 10Dzahn: apache: add conf_type "mods" [puppet] - 10https://gerrit.wikimedia.org/r/264313 [06:21:41] <_joe_> bd808: what intercepts said packets? [06:21:57] the logstash process [06:22:10] <_joe_> oh ok [06:22:20] lots of "Setting host 10.64.0.230:9042 as DOWN" made it to logstash from restbase [06:22:20] <_joe_> sorry, asking because of reasoning yesterday [06:23:18] <_joe_> we should really ping filippo and eric [06:23:22] <_joe_> that is restbase1007-a.eqiad.wmnet [06:23:23] there is no big spike or hole in https://logstash.wikimedia.org/#/dashboard/elasticsearch/restbase [06:23:32] <_joe_> ok I will page filippo [06:23:40] there needs to be a !dig [06:24:04] I'm here [06:24:11] (surprise!) [06:24:12] https://snoonet.org/gonzobot has it [06:24:33] _joe_: ^ [06:24:45] <_joe_> godog_: hey :) [06:25:02] <_joe_> so, see backscroll [06:25:33] <_joe_> in the meantime, it's 6 hours ms-be1008 has a full root partition [06:25:39] <_joe_> wtf, opsens, wtf [06:27:09] looks like casandra may be pinned to sending logs to logstash1003 -- https://github.com/wikimedia/operations-puppet/blob/d2fc3bc5261c803950374c10e0f485480926cf93/modules/cassandra/manifests/init.pp#L281 [06:28:25] bd808: ah, do you know if it is logs from restbase or cassandra specifically? I'm asking because there were some load tests yesterday on the restbase staging cluster [06:29:19] godog: we don't, no. What we know right now is the logstash process was dying repeatedly due to heap exhaustion. [06:29:35] and those graphs in backscroll [06:30:21] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:21] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:28] ack, thanks [06:30:40] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:03] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:04] there is a corresponding hole in this dashboard too -- https://logstash.wikimedia.org/#/dashboard/elasticsearch/cassandra [06:31:10] so looking like cassandra [06:31:31] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:58] <_joe_> godog: ok I think I got what's up with ms-be1008, and it ain't pretty [06:32:01] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:01] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:12] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:20] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 5 failures [06:32:22] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:30] <_joe_> godog: how do I stop swift [06:32:31] <_joe_> ? [06:32:40] _joe_: swift-init all stop [06:32:46] <_joe_> ok thanks [06:32:58] <_joe_> so puppet re-created a partition on the local disk [06:34:55] <_joe_> godog: there is still an rsync running [06:35:21] _joe_: yeah you can stop the rsync daemon if you want [06:35:31] it'll resume/recover [06:35:38] <_joe_> clients, I mean [06:35:39] <_joe_> :) [06:35:45] <_joe_> but yeah, I'll kill those [06:36:06] also clients yeah [06:36:52] PROBLEM - swift-container-replicator on ms-be1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [06:37:01] bd808: looks like cerium, staging host http://ganglia.wikimedia.org/latest/?c=Restbase%20eqiad&h=cerium.eqiad.wmnet&m=network_report&r=4hr&s=by%20name&hc=4&mc=2 [06:37:01] PROBLEM - swift-account-reaper on ms-be1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [06:37:10] PROBLEM - swift-container-server on ms-be1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [06:37:30] yeah that matches pretty well [06:37:30] PROBLEM - swift-account-auditor on ms-be1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [06:37:31] PROBLEM - swift-account-replicator on ms-be1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [06:37:31] PROBLEM - swift-account-server on ms-be1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [06:37:49] I'm going to open a bug about this crashing logstash [06:37:50] PROBLEM - swift-container-updater on ms-be1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [06:38:00] PROBLEM - swift-object-server on ms-be1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [06:38:01] PROBLEM - swift-container-auditor on ms-be1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [06:38:02] PROBLEM - swift-object-auditor on ms-be1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [06:38:21] PROBLEM - swift-object-updater on ms-be1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [06:38:21] PROBLEM - swift-object-replicator on ms-be1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [06:38:50] PROBLEM - Ubuntu mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/ubuntu is over 12 hours old. [06:39:23] <_joe_> !log stipped swift on ms-be1008, disk full because object were being written to the root partition [06:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:41:03] bd808: let me know the bug number I'll add some more info [06:41:29] <_joe_> godog: is cassandra ok? [06:42:44] _joe_: it is yeah [06:43:25] _joe_: is ms-be1008 ok? [06:43:31] godog: https://phabricator.wikimedia.org/T126582 [06:43:46] <_joe_> godog: no it's not [06:43:55] and I'm going to head to bed because tired [06:44:07] <_joe_> g'night bd808 [06:44:11] RECOVERY - Ubuntu mirror in sync with upstream on carbon is OK: /srv/mirrors/ubuntu is over 0 hours old. [06:44:11] <_joe_> sleep well :) [06:44:16] night bd808 thanks! [06:44:59] <_joe_> godog: so, for some reason yet tto determine, yesterday at 22 UTC the /srv/swift-object/sdj1 was being written to the / partition [06:45:09] <_joe_> the original partition is still there [06:45:21] <_joe_> so, should I just wipe the contents of what is on /? [06:46:41] _joe_: yeah, likely it threw an error and got unmounted so it is probably not mountable anymore (?) [06:47:02] <_joe_> it is mountable [06:47:05] <_joe_> I tried [06:47:30] <_joe_> so, my point: no reason to copy objects back to the original partition, right? [06:48:32] <_joe_> uhm so puppet did "define" the mounts for /dev/sdd1 and /dev/sdj1 in two runs yesterday [06:48:38] <_joe_> but /dev/sdd1 is mounted [06:48:44] <_joe_> the other one wasnt [06:49:41] _joe_: no need to copy no, swift will catch up [06:51:45] <_joe_> still trying to find out what happened exactly [06:53:01] <_joe_> but well, "puppetfart" [06:53:30] _joe_: mind opening a task for that so we can followup? [06:53:49] <_joe_> godog: so, both sdd and sdj are local, sigh [06:53:50] <_joe_> wtf [06:54:11] <_joe_> did we change anything swift-related in puppet yesterday? [06:54:46] <_joe_> root@ms-be1008:/srv/swift-storage# mount /srv/swift-storage/sdd1 [06:54:46] <_joe_> mount: /dev/sdd1: can't read superblock [06:54:49] <_joe_> ok [06:54:55] <_joe_> what should I do in this case? [06:55:07] <_joe_> just remove that disk from puppet? [06:56:24] <_joe_> sorry but I don't know enough about our puppet swift code [06:56:28] <_joe_> I guess I'll read [06:56:42] _joe_: looking [06:56:55] <_joe_> !log removing content from sd{d,j}1 written on the root partition on ms-be1008 [06:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:57:01] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:57:01] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:10] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:57:11] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:11] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:57:12] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:57:30] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:57:31] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:51] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:21] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:31] RECOVERY - Disk space on ms-be1008 is OK: DISK OK [07:11:04] 6operations: ms-be1008 ran out of disk - https://phabricator.wikimedia.org/T126574#2017884 (10fgiunchedi) thanks @dzahn ! the problem as found out by @joe was that `/srv/swift-storage/sdd1` and `/srv/swift-storage/sdj1` were being written to `/` instead of their respective disks, I think that happens when the di... [07:22:26] (03PS1) 10KartikMistry: New upstream release [debs/contenttranslation/apertium-dan] - 10https://gerrit.wikimedia.org/r/269912 (https://phabricator.wikimedia.org/T124137) [07:38:37] (03PS1) 10KartikMistry: New upstream release [debs/contenttranslation/apertium-nob] - 10https://gerrit.wikimedia.org/r/269914 (https://phabricator.wikimedia.org/T124317) [07:45:40] (03PS1) 10KartikMistry: New upstream release [debs/contenttranslation/apertium-nno] - 10https://gerrit.wikimedia.org/r/269915 (https://phabricator.wikimedia.org/T124317) [07:48:16] (03PS1) 10KartikMistry: apertium-dan-nor: New upstream release [debs/contenttranslation/apertium-dan-nor] - 10https://gerrit.wikimedia.org/r/269916 (https://phabricator.wikimedia.org/T124317) [07:49:56] (03PS2) 10KartikMistry: apertium-nno: New upstream release [debs/contenttranslation/apertium-nno] - 10https://gerrit.wikimedia.org/r/269915 (https://phabricator.wikimedia.org/T124317) [07:50:12] (03PS2) 10KartikMistry: apertium-nob: New upstream release [debs/contenttranslation/apertium-nob] - 10https://gerrit.wikimedia.org/r/269914 (https://phabricator.wikimedia.org/T124317) [07:50:35] (03PS2) 10KartikMistry: apertium-dan: New upstream release [debs/contenttranslation/apertium-dan] - 10https://gerrit.wikimedia.org/r/269912 (https://phabricator.wikimedia.org/T124137) [07:56:53] 6operations, 6Labs, 10wikitech.wikimedia.org: Rename specific account in LDAP, Wikitech, Gerrit and Phabricator - https://phabricator.wikimedia.org/T85913#2017918 (10adrianheine) 5Open>3Resolved Thank you very much, @demon. I've yet to find any issue :) [08:00:31] 6operations, 10Wikimedia-Mailing-lists: import old staff list archives ? - https://phabricator.wikimedia.org/T109395#2017922 (10Nemo_bis) >>! In T109395#1710588, @Dzahn wrote: > Ideas for the best way to "un-stall" this? You could notify the authors of the messages included in the mbox file and see if someone... [08:03:22] 6operations, 10MediaWiki-Logging, 10Wikimedia-IRC-RC-Server, 10Wikimedia-Stream, and 2 others: Verify that logs, irc, rcstream changes can flow from codfw to eqiad - https://phabricator.wikimedia.org/T126472#2017925 (10Joe) @ori since rcstream is **not** in codfw at the moment, this is not necessary. On th... [08:04:26] (03PS2) 1020after4: Clean up phabricator roles in puppet to remove tag pinning. [puppet] - 10https://gerrit.wikimedia.org/r/269561 (https://phabricator.wikimedia.org/T125851) [08:04:57] (03CR) 10jenkins-bot: [V: 04-1] Clean up phabricator roles in puppet to remove tag pinning. [puppet] - 10https://gerrit.wikimedia.org/r/269561 (https://phabricator.wikimedia.org/T125851) (owner: 1020after4) [08:08:46] 6operations, 10Wikimedia-Mailing-lists: Redirect links to mailing list intlwiki-l - https://phabricator.wikimedia.org/T35898#2017940 (10Nemo_bis) [08:10:05] 6operations, 10Wikimedia-Mailing-lists: Redirect links to mailing list intlwiki-l - https://phabricator.wikimedia.org/T35898#2017945 (10Nemo_bis) This would be a good request but there is no way to know what ID a message might have in a mailman mailing list. Should we consider this an #upstream request to some... [08:12:10] (03PS3) 1020after4: Clean up phabricator roles in puppet to remove tag pinning. [puppet] - 10https://gerrit.wikimedia.org/r/269561 (https://phabricator.wikimedia.org/T125851) [08:25:38] (03PS7) 1020after4: scap::target to configure scap3 deployment repository and deploy-user. [puppet] - 10https://gerrit.wikimedia.org/r/269560 (https://phabricator.wikimedia.org/T113072) [08:26:03] (03PS4) 1020after4: Clean up phabricator roles in puppet to remove tag pinning. [puppet] - 10https://gerrit.wikimedia.org/r/269561 (https://phabricator.wikimedia.org/T125851) [08:30:00] (03PS1) 10Elukey: Temporary remove of mc1007 from the memcached/redis pool for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/269921 (https://phabricator.wikimedia.org/T123711) [08:31:36] (03CR) 10Elukey: [C: 032] Temporary remove of mc1007 from the memcached/redis pool for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/269921 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [08:32:41] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "this would make /etc/apache2/mods-enabled puppet managed with recurse => true, purge => true" [puppet] - 10https://gerrit.wikimedia.org/r/264313 (owner: 10Dzahn) [08:33:10] !log removed mc1007 from the redis/memcached pool for Jessie migration [08:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:33:49] (03PS5) 1020after4: Clean up phabricator roles in puppet to remove tag pinning. [puppet] - 10https://gerrit.wikimedia.org/r/269561 (https://phabricator.wikimedia.org/T125851) [08:34:13] (03CR) 10Giuseppe Lavagetto: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/268684 (owner: 10Ema) [08:37:57] <_joe_> 3 [08:38:00] <_joe_> yeah [08:38:05] (03CR) 10Jcrespo: [C: 032 V: 032] CREATE OR REPLACE view + new s2 master [software] - 10https://gerrit.wikimedia.org/r/269700 (owner: 10Jcrespo) [08:39:56] (03PS6) 1020after4: Clean up phabricator roles in puppet to remove tag pinning. [puppet] - 10https://gerrit.wikimedia.org/r/269561 (https://phabricator.wikimedia.org/T125851) [08:45:40] !log Disabling puppet, redis, memcached on mc1007 for maintenance [08:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:46:37] <_joe_> !log enabled puppet on mw1017, was disabled with no reason [08:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:47:51] PROBLEM - puppet last run on mw1017 is CRITICAL: CRITICAL: Puppet last ran 12 hours ago [08:49:41] RECOVERY - puppet last run on mw1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:50:23] _joe_: I believe I am ready to enable puppet on iridium, just need to merge https://gerrit.wikimedia.org/r/#/c/269561/ [08:53:45] then I'll run through my checklist: service apache2 stop && service phd stop; mv /srv/phab/repos /srv/repos && mv /srv/phab/ /srv/phab.bak && puppet agent --test; [08:53:52] then double check a bunch of things, then enable puppet [08:56:23] 6operations, 10Traffic: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2017991 (10elukey) a:3elukey [09:04:18] !log restarting HHVM on all running mediawiki job queue processors [09:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:04:55] 6operations, 10MediaWiki-Logging: Warning: unable to connect to unix:///var/run/nutcracker/redis_eqiad.sock [111]: Connection refused - https://phabricator.wikimedia.org/T126476#2018005 (10elukey) confirmed, it is me doing maintenance, I just removed mc1007 from the pool and the timings match. Closing this ti... [09:05:08] 6operations, 10MediaWiki-Logging: Warning: unable to connect to unix:///var/run/nutcracker/redis_eqiad.sock [111]: Connection refused - https://phabricator.wikimedia.org/T126476#2018007 (10elukey) 5Open>3Resolved [09:12:18] akosiaris: around? [09:22:55] 6operations, 10ops-eqiad: elastic1023 doesn't come back up after reboot - https://phabricator.wikimedia.org/T126586#2018030 (10MoritzMuehlenhoff) 3NEW a:3Cmjohnson [09:24:13] ACKNOWLEDGEMENT - Host elastic1023 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff hardware problems, T126586 [09:24:31] PROBLEM - puppet last run on mw2046 is CRITICAL: CRITICAL: puppet fail [09:31:44] !log re-enabled puppet on mc1007.eqiad [09:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:33:34] kart_: yup [09:34:37] (03PS1) 10ArielGlenn: dumps mirroring tool, change remote and local to source and dest [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/269933 [09:37:04] (03CR) 10ArielGlenn: [C: 032] dumps mirroring tool, change remote and local to source and dest [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/269933 (owner: 10ArielGlenn) [09:39:24] !log depooled elastic1023 for hardware problems (T126586) [09:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:44:01] (03PS1) 10Elukey: Add mc1007.eqiad back in service after maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/269934 (https://phabricator.wikimedia.org/T123711) [09:51:19] (03CR) 10Elukey: [C: 032] Add mc1007.eqiad back in service after maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/269934 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [09:51:51] RECOVERY - puppet last run on mw2046 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [09:52:38] !log re-added mc1007.eqiad back into redis/memcached pools after maintenance [09:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:57:40] (03PS2) 10Gehel: Ship Elasticsearch logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) [10:03:23] akosiaris: Jessie migration for cxserver - let's plan it. [10:03:35] !log ms-be2018 / ms-be2019 swift weight to 3500 [10:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:03:43] anyone willing to review + hopefully merge https://gerrit.wikimedia.org/r/#/c/269560 and https://gerrit.wikimedia.org/r/#/c/269561 ? [10:03:56] akosiaris: I tested in Labs instance, and it is fine. However, I would be happy if beta goes to Jessie too. [10:04:39] kart_: ok. for when ? say Monday ? [10:05:23] akosiaris: Monday sounds good. [10:05:36] akosiaris: do you know when Beta is moving? [10:05:40] (03PS2) 10Hashar: contint: lower tmpfs from 512MB to 200MB [puppet] - 10https://gerrit.wikimedia.org/r/269880 (https://phabricator.wikimedia.org/T126545) [10:05:41] I am not sure about beta. hashar ? should/can we get jessie images on beta ? is there like a policy/plan/something that tell us not to ? [10:05:56] kart_: I have no idea. hashar probably knows [10:06:41] unless there is an outage going on, I am not there today sorry. [10:06:50] beta is broken for sure due to php55, code is no more updating [10:07:14] for beta migration toward Jessie, nobody is working on it afaik. At least nobody on releng side [10:07:27] huh... I got email about puppet failures on some labs instances - and it links to wikitech.org, which redirects to http://how-to.wikia.com/wiki/How_To [10:07:49] akosiaris: I don't think there is any kind of policy against it [10:08:00] if prodution is migrating to jessie then beta should too [10:08:25] twentyafterfour: I will suggest to migrate beta first :) [10:08:30] twentyafterfour: heh, the correct way would be beta to migrate beta first of course [10:08:35] not beta following production [10:08:36] sigh [10:08:37] but #releng is already heavily overworked / behind schedule on most of our projects so taking on new ones doesn't seem feasible either [10:09:15] akosiaris: I mean it should follow the same path not strictly production first then beta [10:09:24] I think the beta sca0x hosts are running Jessie now, so maybe migrate cxserver to the sca host? [10:09:39] in my mind beta and prod should be closely in sync [10:10:17] hashar: that will be great and +1. [10:10:18] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 649 [10:10:35] hashar: have some time for helping in it? [10:12:40] kart_: yeah in June (maybe) [10:12:56] kart_: more seriously, no I have no idling cycles left :-\ [10:13:21] kart_: in theory it is all about creating a Jessie instance in beta, add the relevant puppet class to it [10:13:31] eventually setup scap if it is ready for services deployment [10:13:56] or add that instance as a jenkins slave (there is a puppet class on the current instance that does it) [10:14:12] add the new instance as a slave in Jenkins and migrate the Jenkins job to run on the new instance (those two steps are trivial) [10:15:33] (03CR) 10Hashar: "Hey puppet manage to remount:" [puppet] - 10https://gerrit.wikimedia.org/r/269880 (https://phabricator.wikimedia.org/T126545) (owner: 10Hashar) [10:18:13] (03PS3) 10Gehel: Ship Elasticsearch logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) [10:18:28] hashar: given enough documentation, I can create instance and setup too. [10:18:38] twentyafterfour: why wouldn't the move to apt and the changing of the [10:18:41] yes I'm here [10:18:51] of the scap binary location be in one patch [10:18:58] and the rest in a second patch for afterwards? [10:19:23] apergos: no reason [10:19:41] other than I didn't see much need to split them up? [10:20:08] RECOVERY - check_mysql on db1008 is OK: Uptime: 1968114 Threads: 4 Questions: 12650742 Slow queries: 13228 Opens: 4849 Flush tables: 2 Open tables: 400 Queries per second avg: 6.427 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:20:29] it seems like that's easier to manage (if something goes awry), I think that's what thcipriani was getting at [10:22:01] in their comment on the first patchset [10:23:15] right [10:23:21] apergos: I can split it up [10:23:24] ok [10:23:27] I'll be here [10:23:36] ping if I don't see it in a reasonable period of time [10:23:52] I've done git rebase -i about 100 times tonight [10:26:16] you don't work in a branch? [10:28:04] (03PS4) 10Gehel: Ship Elasticsearch logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) [10:28:46] 6operations, 6Labs, 10wikitech.wikimedia.org: Rename specific account in LDAP, Wikitech, Gerrit and Phabricator - https://phabricator.wikimedia.org/T85913#2018140 (10Krenair) @demon, are all the steps you took documented? [10:28:52] (03PS1) 10Elukey: Remove mc1008/mc1009 from redis/memcached pool for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/269939 [10:30:13] (03PS2) 10Elukey: Remove mc1008/mc1009 from redis/memcached pool for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/269939 (https://phabricator.wikimedia.org/T123711) [10:31:25] (03CR) 10jenkins-bot: [V: 04-1] Remove mc1008/mc1009 from redis/memcached pool for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/269939 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [10:34:33] --^ 10:30:55 /usr/lib/ruby/vendor_ruby/puppet-lint/bin.rb:78:in `block in run': invalid option: --no-puppet_url_without_modules-check (OptionParser::InvalidOption) [10:34:38] mmmm [10:35:18] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 671 [10:38:30] <_joe_> elukey: wat? [10:38:42] <_joe_> oh that's jenkins [10:40:08] RECOVERY - check_mysql on db1008 is OK: Uptime: 1969315 Threads: 2 Questions: 12682726 Slow queries: 13247 Opens: 4850 Flush tables: 2 Open tables: 401 Queries per second avg: 6.440 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:42:16] _joe_: yep, not sure why it is complaining, I just added Bug: T123711 to the commit log [10:43:07] (03PS8) 1020after4: Clean up phabricator roles in puppet to remove tag pinning. [puppet] - 10https://gerrit.wikimedia.org/r/269561 (https://phabricator.wikimedia.org/T125851) [10:43:09] (03PS9) 1020after4: make scap::target use the scap3 package provider [puppet] - 10https://gerrit.wikimedia.org/r/269560 (https://phabricator.wikimedia.org/T114363) [10:43:11] (03PS1) 1020after4: Install the scap package from deb instead of trebuchet [puppet] - 10https://gerrit.wikimedia.org/r/269942 (https://phabricator.wikimedia.org/T114363) [10:43:24] (03PS3) 10Elukey: Remove mc1008/mc1009 from redis/memcached pool for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/269939 (https://phabricator.wikimedia.org/T123711) [10:44:17] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Lacks removal of sslcert::certificate { 'ticket.wikimedia.org': } from ::otrs::web" [puppet] - 10https://gerrit.wikimedia.org/r/269753 (https://phabricator.wikimedia.org/T122320) (owner: 10Dzahn) [10:46:59] (03CR) 10Elukey: [C: 032] Remove mc1008/mc1009 from redis/memcached pool for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/269939 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [10:48:15] (03CR) 10Alexandros Kosiaris: "various inline comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/269560 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [10:48:17] apergos: I work on a branch but I've been juggling 3 patches, if I want to amend the one that isn't at the tip of the branch then I use rebase -i ... which is what I just did to split 2 into 3 [10:48:34] ah I see [10:50:11] !log removed mc1008/mc1009 from redis/memcached pools. Puppet disabled. [10:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:52:43] !log Correction about mc1008/mc1009 - puppet not disabled, will be done later on. [10:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:53:11] (03PS2) 1020after4: Install the scap package from deb instead of trebuchet [puppet] - 10https://gerrit.wikimedia.org/r/269942 (https://phabricator.wikimedia.org/T114363) [10:53:13] (03PS9) 1020after4: Clean up phabricator roles in puppet to remove tag pinning. [puppet] - 10https://gerrit.wikimedia.org/r/269561 (https://phabricator.wikimedia.org/T125851) [10:53:15] (03PS10) 1020after4: make scap::target use the scap3 package provider [puppet] - 10https://gerrit.wikimedia.org/r/269560 (https://phabricator.wikimedia.org/T114363) [10:53:17] (03CR) 10Hashar: [C: 031 V: 031] "Cherry picked and deployed on CI puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/269880 (https://phabricator.wikimedia.org/T126545) (owner: 10Hashar) [10:54:33] (03CR) 1020after4: [C: 031] "alexandros: I split the package part out into a separate patch (https://gerrit.wikimedia.org/r/#/c/269942/2) but I did address your commen" [puppet] - 10https://gerrit.wikimedia.org/r/269560 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [10:54:38] (03CR) 10Alexandros Kosiaris: [C: 031] "ah, it's done in https://gerrit.wikimedia.org/r/#/c/269877/. Sorry I missed that. +1ing it" [puppet] - 10https://gerrit.wikimedia.org/r/269753 (https://phabricator.wikimedia.org/T122320) (owner: 10Dzahn) [10:54:56] apergos: https://gerrit.wikimedia.org/r/#/c/269942/2 [10:55:04] (03CR) 10Alexandros Kosiaris: [C: 031] OTRS: remove ssl cert and config [puppet] - 10https://gerrit.wikimedia.org/r/269877 (https://phabricator.wikimedia.org/T122320) (owner: 10Dzahn) [10:55:22] akosiaris: thanks for the review [10:55:55] yeah good for not latest [10:57:42] (03CR) 1020after4: "patch set 2 addresses comments from akosiaris on https://gerrit.wikimedia.org/r/#/c/269560/8" [puppet] - 10https://gerrit.wikimedia.org/r/269942 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [10:57:59] I'd like to test a puppet patch for elastic search on deployment-prep (https://gerrit.wikimedia.org/r/#/c/269100/). Should I just cherry pick the change directly to deployment-puppetmaster.eqiad.wmflabs? [10:58:45] Also if someone has 5 minutes to do a review of the patch before I crash labs, would be great ! [11:00:47] <_joe_> gehel: yes, and sorry not now :( [11:01:50] _joe_: your a busy man ! [11:02:19] Anyone else for a review ? Else I'll just deploy on labs and cross fingers (should be good though ...) [11:02:58] (03CR) 10Alexandros Kosiaris: "comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) (owner: 10Gehel) [11:03:04] (03PS3) 10ArielGlenn: Install the scap package from deb instead of trebuchet [puppet] - 10https://gerrit.wikimedia.org/r/269942 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [11:03:16] gehel: yes cherry pick it there, it would not crash labs I think [11:03:22] but see my comment inline [11:03:34] twentyafterfour: I'm gonna merge this first one, it looks good and I double checked for other occurences of the path, made sure no precise host needs the package, etc [11:03:40] premise it that the boolean flag send_logs_to_logstash is redundant [11:03:56] hm dunno if labs [11:04:02] (03PS1) 10Aude: Exclude Sauce Labs IP ranges from rate limits on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269944 (https://phabricator.wikimedia.org/T126585) [11:05:24] (03CR) 10Alexandros Kosiaris: [C: 032] "That's a noop, https://puppet-compiler.wmflabs.org/1714/ I 'll merge" [puppet] - 10https://gerrit.wikimedia.org/r/269758 (owner: 10Dereckson) [11:05:33] (03PS4) 10Alexandros Kosiaris: Separate math and TeX packages classes [puppet] - 10https://gerrit.wikimedia.org/r/269758 (owner: 10Dereckson) [11:05:39] (03CR) 10Alexandros Kosiaris: [V: 032] Separate math and TeX packages classes [puppet] - 10https://gerrit.wikimedia.org/r/269758 (owner: 10Dereckson) [11:07:26] akosiaris: funny how different organizations have different best practice ... [11:07:56] kart_: so Feb 15th is out. US holiday [11:08:08] Feb 16th, same battime, same batchannel ? [11:08:25] apergos: should be good [11:08:31] gehel: hehe, indeed [11:09:43] 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Enable Redis cross-dc replication - https://phabricator.wikimedia.org/T126470#2018203 (10Joe) @ori while on the medium-to-long run we should start using a better redis client proxy (like dynomite, which seems pretty interesting, but some caveats apply... [11:10:02] 6operations, 10ops-eqiad: elastic1023 doesn't come back up after reboot - https://phabricator.wikimedia.org/T126586#2018204 (10dcausse) [11:11:51] twentyafterfour: if I knew all of deployment-prep was off of precise I would be done with that check [11:12:49] !log moving pending s2 slaves to the new master [11:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:13:01] looks good [11:13:14] (03PS4) 10ArielGlenn: Install the scap package from deb instead of trebuchet [puppet] - 10https://gerrit.wikimedia.org/r/269942 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [11:14:52] (03CR) 10ArielGlenn: [C: 032] Install the scap package from deb instead of trebuchet [puppet] - 10https://gerrit.wikimedia.org/r/269942 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [11:21:19] (03PS1) 10Hashar: hhvm: can now change enable/ensure service [puppet] - 10https://gerrit.wikimedia.org/r/269946 (https://phabricator.wikimedia.org/T126594) [11:21:21] (03PS1) 10Hashar: contint: disable HHVM background service [puppet] - 10https://gerrit.wikimedia.org/r/269947 (https://phabricator.wikimedia.org/T126594) [11:21:43] ok that change is going around, in 1/2 hour scap should be on all the relevant hosts in /usr/bin [11:21:46] twentyafterfour: [11:22:13] scap being deployed via .deb right ? [11:22:15] \O/ [11:22:19] yes deb [11:22:33] so once thta goes around everywhere we just need one check to make sure it's working [11:22:36] like the l10nupdate [11:22:41] how often does that go? [11:23:42] <_joe_> once a day [11:23:45] <_joe_> at 2 am [11:23:51] well that's not convenient for us [11:32:05] <_joe_> so, this is still scap2, right? [11:32:08] <_joe_> not scap3 [11:32:14] scap3 [11:32:18] <_joe_> wat? [11:32:29] <_joe_> we're swapping out scap2 w scap3? [11:33:04] <_joe_> was this tested on beta? [11:33:18] <_joe_> and did we send out an announcement? [11:33:23] just a sec; I thought this was "just" scap (existing) in a deb package [11:33:27] if not I'm reverting this [11:33:29] twentyafterfour: ?? [11:33:40] !log disabled puppet, memcached, redis on mc1008/mc1009 for maintenance [11:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:34:17] <_joe_> apergos: oh dear [11:34:27] <_joe_> I'm not sure reverting is enough [11:34:46] well it's just "reset paths" [11:34:58] and if I need to uninstall the deb [11:35:25] <_joe_> twentyafterfour, hashar care to comment? [11:35:26] the copy in /srv/deployment/scap/scap will still be there untouched [11:35:40] <_joe_> apergos: yeah but I don't trust this being ok [11:35:55] <_joe_> please ALL: do NOT deploy ANYTHING right now [11:36:10] <_joe_> apergos: revert please [11:36:13] sure [11:36:15] so on mira, scap now points to scap3 [11:36:23] <_joe_> waaat [11:36:31] <_joe_> shit. [11:36:35] looking [11:36:39] <_joe_> ok let's be rational [11:36:49] dpkg -S /usr/bin/scap [11:36:49] scap: /usr/bin/scap [11:36:58] yes [11:36:59] we know [11:37:04] if the deb is removed that will be gone [11:37:55] (03PS1) 10ArielGlenn: Revert "Install the scap package from deb instead of trebuchet" [puppet] - 10https://gerrit.wikimedia.org/r/269949 [11:37:56] <_joe_> let me check labs [11:38:08] err,, /usr/bin/X11/scap [11:38:09] <_joe_> oblivian@deployment-mediawiki01:~$ which scap [11:38:09] <_joe_> /srv/deployment/scap/scap/bin/scap [11:38:12] what is what ? [11:38:20] <_joe_> akosiaris: lol no idea [11:38:25] <_joe_> where is that? [11:38:27] mira [11:38:40] unsure how it got there [11:38:47] sounds really really wrong [11:38:49] (03CR) 10Hashar: [V: 031] "Disabling the hhvm service makes no sense on production MediaWiki application server. But maybe for some specific prod servers we might w" [puppet] - 10https://gerrit.wikimedia.org/r/269946 (https://phabricator.wikimedia.org/T126594) (owner: 10Hashar) [11:38:51] <_joe_> akosiaris: I guess the deb package? [11:38:56] nope [11:38:56] yes, the deb package [11:38:58] no? [11:39:10] dpkg -S /usr/bin/X11/scap [11:39:10] dpkg-query: no path found matching pattern /usr/bin/X11/scap [11:39:23] Commandline: /usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install scap [11:39:33] that's in the log as of last [11:39:35] (03CR) 10Hashar: [V: 031] "Depends on https://gerrit.wikimedia.org/r/#/c/269946/ that adds the service_enable / service_ensure." [puppet] - 10https://gerrit.wikimedia.org/r/269947 (https://phabricator.wikimedia.org/T126594) (owner: 10Hashar) [11:39:35] so yes, the deb [11:39:36] <_joe_> so, let's revert this patch, apply on every machine where this had any effect [11:39:51] <_joe_> and then remove the deb everywhere it was installed [11:39:53] so /usr/bin/scap is the deb. /usr/bin/X11/scap is not [11:39:56] something is weird [11:40:01] but that is something not for right now [11:40:05] /usr/bin/X11/scap is unrelated to anything I know of [11:40:21] not in the package so ignore [11:40:21] <_joe_> -rwxr-xr-x 1 root root 348 Feb 2 10:17 /usr/bin/X11/scap [11:40:22] ok, so revert and remove the package [11:40:27] yes I'm trying to do that [11:40:28] <_joe_> on an mw server [11:40:35] damn [11:40:41] <_joe_> I guess it is a result of some post-install hook [11:40:45] <_joe_> or something [11:40:52] (03CR) 10ArielGlenn: [C: 032] Revert "Install the scap package from deb instead of trebuchet" [puppet] - 10https://gerrit.wikimedia.org/r/269949 (owner: 10ArielGlenn) [11:40:55] (03CR) 10JanZerebecki: "Note, Wikibase jobs still have sqlite and we might want to introduce more sqlite jobs again. Would that conflict?" [puppet] - 10https://gerrit.wikimedia.org/r/269880 (https://phabricator.wikimedia.org/T126545) (owner: 10Hashar) [11:41:41] <_joe_> -rwxr-xr-x 1 root root 348 Feb 2 10:17 /usr/bin/scap [11:41:48] <_joe_> yup, comes from the package [11:41:56] feb 2 ? [11:41:59] <_joe_> yes [11:42:05] <_joe_> same timestamp, both the files [11:42:14] so, feb 2 ? today is the 11th [11:42:18] I am missing something [11:42:24] <_joe_> akosiaris: that's the deb timestamp... [11:42:36] ah yes [11:42:44] <_joe_> anyways [11:43:16] running everywhere [11:43:17] (03CR) 10Hashar: "yeah it would. We now have two different kind of Trusty slaves:" [puppet] - 10https://gerrit.wikimedia.org/r/269880 (https://phabricator.wikimedia.org/T126545) (owner: 10Hashar) [11:43:21] I took it from mira first [11:43:30] <_joe_> apergos: what is running everywhere? [11:43:34] removing the deb [11:43:38] <_joe_> puppet applying on mw1018, then I'll see what other things are needed [11:44:08] <_joe_> probably files need to be removed by hand [11:45:07] it's gone on all hosts [11:45:33] <_joe_> apergos: which hosts did you purge? [11:45:37] <_joe_> log that btw [11:45:40] all [11:45:44] I ran it across the cluster [11:45:59] !log removed scap3 deb install from all hosts in prod [11:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:46:21] <_joe_> so, if scap3 has to be installed, it better be as /usr/bin/scap3 [11:46:33] <_joe_> and then use alternatives to set up /usr/bin/scap [11:46:39] <_joe_> if needed [11:46:53] indeed [11:46:58] <_joe_> so for at least half an hour no one should dare deploying anything [11:47:07] so that puppet can clean up [11:47:15] well, scap will fail [11:47:23] because /usr/bin/scap is nonexisten [11:47:23] you can add the flag to be sure [11:47:26] but still not nice [11:47:49] <_joe_> apergos: stop rsync on both mira and tin, after puppet has run there, and disable puppet [11:48:19] right [11:48:37] https://phabricator.wikimedia.org/D36 [11:48:57] <_joe_> jynus: oh neat, thanks [11:49:08] PROBLEM - Disk space on ms-be2016 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdi1 is not accessible: Input/output error [11:49:39] akosiaris@mw1020:/usr/bin$ ls -ld X11 [11:49:39] lrwxrwxrwx 1 root root 1 Jan 29 2014 X11 -> . [11:49:44] btw... that solves that question [11:49:58] I must say I did not remember that symlink existed at all [11:50:17] ok, normal scap2 deploys will fail now, right ? [11:51:01] <_joe_> akosiaris: me neither [11:51:22] <_joe_> akosiaris: well, yes [11:51:22] (03CR) 10JanZerebecki: [C: 031] Exclude Sauce Labs IP ranges from rate limits on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269944 (https://phabricator.wikimedia.org/T126585) (owner: 10Aude) [11:51:37] <_joe_> akosiaris: worse, they'll work on a portion of the cluster [11:52:12] _joe_: done [11:52:20] <_joe_> apergos: what? [11:52:27] !log updated xenon, praseodymium and cerium to nodejs 4.3.0 [11:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:52:36] puppet on mira and tin, stopped rsync, disabled puppet on both [11:53:19] _joe_: portion ? why ? [11:53:28] I coul dmove scap out of the way on tin and mira to be extra safe [11:53:30] <_joe_> apergos: no need [11:53:50] <_joe_> apergos: actually, reenable puppet, jynus pointed out a better way to do that [11:54:11] <_joe_> !log stopped scap runs for now (touched sync.flag) [11:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:54:16] oh the flag [11:54:23] I didn't click through his link [11:54:54] enabled on both hosts [11:55:07] rsync still off but I presume it will be back next puppet run [11:55:13] <_joe_> yes [11:55:28] (03CR) 10Mark Bergsma: [C: 031] admin: add gehel to ops [puppet] - 10https://gerrit.wikimedia.org/r/269674 (https://phabricator.wikimedia.org/T125651) (owner: 10Filippo Giunchedi) [11:55:36] * twentyafterfour is confused [11:56:18] so the package, the deb twentyafterfour [11:56:20] it was scap3 [11:56:53] what is in place in /srv/deployment/scap/scap though? [11:57:04] the deb is both scap 2 and scap3 [11:57:10] wha? [11:57:12] scap3 includes scap2 [11:57:18] they are the same thing [11:57:18] PROBLEM - puppet last run on ms-be2016 is CRITICAL: CRITICAL: Puppet has 1 failures [11:57:20] <_joe_> and that was tested where? [11:57:32] <_joe_> i mean deploying mediawiki with the deb? [11:57:48] _joe_: not tested with the deb [11:57:53] but they are the same code [11:57:57] <_joe_> if it wasn't, I suggest we install this patch in beta first [11:58:08] <_joe_> see everything works there, if we have a test install [11:58:27] ok, the one thing I should have caught on review but didn't [11:58:34] because seemed too obvious. [11:58:35] <_joe_> twentyafterfour: I'm not thinking it's not the same code [11:58:58] <_joe_> I just guess there are a few good reasons for which this could fail [11:59:18] <_joe_> permissions, non-obvious path changes/crons [11:59:34] _joe_: yeah [11:59:34] <_joe_> this is just the most obvious thing to check for tbh [11:59:49] <_joe_> so I guess the revert is "please be sure this tests ok" [12:00:30] _joe_: I am really sorry I didn't ask 'has this been tested', I usually do even though I figure it's a no brainer [12:00:32] totally my bad [12:00:53] obviously I will be around during any deploys tonight to help clean up, if there is cleanup needed [12:02:09] RECOVERY - Disk space on ms-be2016 is OK: DISK OK [12:02:33] (03PS5) 10Gehel: Ship Elasticsearch logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) [12:02:58] <_joe_> godog: ^^ is that you or puppet did the trick again? [12:04:05] akosiaris: Thanks. 16 Feb is fine! [12:04:47] ok cherry-picked on beta puppetmaster [12:05:11] <_joe_> twentyafterfour: thanks [12:07:10] are the symlinks in /usr/local/bin created by puppet? if so I didn't find them by grepping [12:07:35] <_joe_> uhm [12:07:52] !log enabled puppet on mc1008 [12:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:08:26] (03CR) 10Gehel: Ship Elasticsearch logs to logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) (owner: 10Gehel) [12:11:05] gehel: Gutten Tag. I am not sure how big logstash is on beta cluster and if it will be able to consume all the ElasticSearch logs sent to it [12:11:25] <_joe_> twentyafterfour: which symlinks? [12:11:25] _joe_: not me no [12:11:27] gehel: no clue how it is setup nor how to check though :( [12:13:13] !log re-enabled puppet on mc1009 [12:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:15:42] hashar: a quick check on 2 of the lab's elasticsearch nodes indicate that there is a few 100's k of logs per day. I don't know either how logstash is sized, but it seems reasonable to assume that it will scale to an additional meg (or 2) per day [12:16:04] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/269762 (https://phabricator.wikimedia.org/T126468) (owner: 10Dereckson) [12:16:17] (03PS3) 10Muehlenhoff: Add texvc to role::labs::openstack::nova::manager [puppet] - 10https://gerrit.wikimedia.org/r/269762 (https://phabricator.wikimedia.org/T126468) (owner: 10Dereckson) [12:16:22] hashar: should I check in more details before deploying on lab ? [12:17:01] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add texvc to role::labs::openstack::nova::manager [puppet] - 10https://gerrit.wikimedia.org/r/269762 (https://phabricator.wikimedia.org/T126468) (owner: 10Dereckson) [12:21:20] gehel: if you are familiar with logstash the instance is deployment-logstash2.deployment-prep.eqiad.wmflabs [12:21:39] no clue where the logs are saved too though [12:22:09] PROBLEM - DPKG on silver is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:22:50] <_joe_> can someone check silver? [12:22:53] ^ that's ok, the current puppet run is installing a lot of packages [12:23:06] that was the fastest check evah [12:23:12] :-) [12:23:14] <_joe_> moritzm: oh texvc [12:23:17] <_joe_> yeah [12:23:24] <_joe_> bbl [12:23:39] now we finally have ocaml on wikitech/silver :-) [12:25:33] hashar: having a look right now ... [12:25:48] RECOVERY - DPKG on silver is OK: All packages OK [12:26:02] moritzm: I loved ocaml! (but that was a long long time ago ...) [12:27:19] gehel: I guess 99% of all people who ever used ocaml had it installed to run mldonkey :-) [12:27:46] gehel: apparently the logs are written to elasticsearch on logstash2 as well. Seems /var/lib/elasticsearch is a hardlink to /srv which has 37G used and 95G available. So maybe it is enough :) [12:28:49] moritzm: ocaml is a popular language in French university. Pretty sure most friends ended up doing ocaml during their study [12:29:14] hashar: yep, that's what I was looking. Seems that we generate between 300Mo and 1Go of indice per day. Additional load should not be too much of an issue [12:29:49] hashar, moritzm: I actually have one friend who managed to get a real job writing ocaml ... [12:31:15] (03PS1) 10Elukey: Add mc1008/mc1009 back into redis/memcached pools after maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/269953 (https://phabricator.wikimedia.org/T123711) [12:32:02] gehel: so that sounds good to me. Sorry I am just being paranoid all the time :-( [12:32:38] hashar: even paranoids have real enemies ... And being paranoid about the new guy is probably healthy... [12:33:20] oh you are not new until you have broken the cluster and restored it. [12:33:30] :-D [12:34:06] anyway eager to see more logs sent to logstash . The beta cluster one has Kibana at https://logstash-beta.wmflabs.org [12:34:42] hashar: yep, I found that one. Lunch break and I'll get started to break that labs cluster... [12:34:56] (03PS2) 10Elukey: Add mc1008/mc1009 back into redis/memcached pools after maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/269953 (https://phabricator.wikimedia.org/T123711) [12:35:24] (03CR) 10Elukey: [C: 032] Add mc1008/mc1009 back into redis/memcached pools after maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/269953 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [12:36:33] query uovobw [12:36:41] argh / missed :D [12:40:50] !log uploaded nodejs 4.3.0 for jessie-wikimedia to carbon [12:40:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:42:09] !log Add mc1008/mc1009 back into redis/memcached pools after maintenance [12:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:43:34] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [12:44:53] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [12:46:44] 6operations: ferm rules for eventlog - https://phabricator.wikimedia.org/T126462#2018412 (10MoritzMuehlenhoff) [12:46:46] 6operations: Add ferm rules for eventlog hosts - https://phabricator.wikimedia.org/T113343#2018413 (10MoritzMuehlenhoff) [12:51:12] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:51:53] PROBLEM - puppet last run on restbase1009 is CRITICAL: CRITICAL: Puppet has 1 failures [12:52:22] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:00:01] !log upgrading backup LVS servers to Linux 4.4.0 across all sites [13:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:02:12] PROBLEM - Host lvs1010 is DOWN: PING CRITICAL - Packet loss = 100% [13:03:04] PROBLEM - Host lvs1012 is DOWN: PING CRITICAL - Packet loss = 100% [13:03:13] PROBLEM - Host lvs1011 is DOWN: PING CRITICAL - Packet loss = 100% [13:04:02] RECOVERY - Host lvs1010 is UP: PING OK - Packet loss = 0%, RTA = 4.21 ms [13:04:02] RECOVERY - Host lvs1011 is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [13:04:13] RECOVERY - Host lvs1012 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [13:05:57] <_joe_> ...when you make the mistake of reading the backscroll from the end [13:07:50] :P [13:07:52] PROBLEM - Host lvs1005 is DOWN: PING CRITICAL - Packet loss = 100% [13:08:02] PROBLEM - Host lvs1006 is DOWN: PING CRITICAL - Packet loss = 100% [13:08:13] RECOVERY - Host lvs1006 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [13:08:23] RECOVERY - Host lvs1005 is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [13:08:50] _joe_: the scap symlinks, sorry. I figured it out though, no worries [13:12:56] so ... when did /usr/bin/php start pointing to hphp? [13:14:23] PROBLEM - Host lvs4003 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:23] RECOVERY - Host lvs4003 is UP: PING OK - Packet loss = 0%, RTA = 75.10 ms [13:15:47] (03CR) 10Bmansurov: [C: 04-1] "wrong bug number" [debs/contenttranslation/apertium-nno] - 10https://gerrit.wikimedia.org/r/269915 (https://phabricator.wikimedia.org/T124317) (owner: 10KartikMistry) [13:16:10] (03CR) 10Bmansurov: [C: 04-1] "wrong bug number" [debs/contenttranslation/apertium-dan-nor] - 10https://gerrit.wikimedia.org/r/269916 (https://phabricator.wikimedia.org/T124317) (owner: 10KartikMistry) [13:16:17] PROBLEM - Host lvs2004 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:22] PROBLEM - Host lvs2006 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:33] RECOVERY - Host lvs2004 is UP: PING OK - Packet loss = 0%, RTA = 37.53 ms [13:17:24] RECOVERY - Host lvs2006 is UP: PING OK - Packet loss = 0%, RTA = 37.03 ms [13:17:36] 6operations, 10Traffic: Upgrade LVS servers to a 4.3+ kernel - https://phabricator.wikimedia.org/T119515#2018468 (10faidon) 5Open>3Resolved a:3faidon I just finished upgading the rest to 4.4.0 as well. Considering this task resolved. [13:17:50] moritzm: ^ [13:17:51] all done [13:18:03] these were the only boxes with 4.3.0, so I think we can remove that from apt too [13:21:55] !log migrating dbstore1001 to the new s2-master [13:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:24:50] moritzm: ok, I just reprepro removesrc jessie-wikimedia linux 4.3-1~wmf1 -- no 4.3.0 available anymore [13:26:28] <_joe_> twentyafterfour: since the hhvm migration [13:26:50] oh hmm.. [13:27:26] on beta it's php 5.3,,,, [13:27:36] <_joe_> uh? [13:27:36] er only on deployment-bastion [13:27:41] <_joe_> ah right [13:27:52] <_joe_> we should upgrade it to trusty I guess :P [13:28:00] well there is deployment-tin [13:28:24] deployment-bastion used to be the beta deploy host but I'm not sure if it still is or if deployment-tin has taken over that role [13:29:24] is there any sort of guideline about which hosts should be trusty and which should be jessie? [13:29:32] * twentyafterfour doesn't really like ubuntu on servers [13:29:50] * twentyafterfour doesn't trust trusty [13:30:02] (03PS2) 10Faidon Liambotis: lvs: add schedule_icmp ipvs sysctl [puppet] - 10https://gerrit.wikimedia.org/r/269423 [13:30:11] <_joe_> well, anything hhvm-related should be trusty [13:30:20] I see [13:30:23] (03CR) 10Faidon Liambotis: [C: 032] lvs: add schedule_icmp ipvs sysctl [puppet] - 10https://gerrit.wikimedia.org/r/269423 (owner: 10Faidon Liambotis) [13:30:38] (03PS1) 10Giuseppe Lavagetto: Rationalize services definitions for labs too. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269955 [13:30:44] _joe_: what do you think about re-enabling scap? any checks come to mind we should do before that? [13:30:48] (03PS3) 10KartikMistry: apertium-nno: New upstream release [debs/contenttranslation/apertium-nno] - 10https://gerrit.wikimedia.org/r/269915 (https://phabricator.wikimedia.org/T124137) [13:30:51] (03PS2) 10Filippo Giunchedi: admin: add gehel to ops [puppet] - 10https://gerrit.wikimedia.org/r/269674 (https://phabricator.wikimedia.org/T125651) [13:30:55] <_joe_> apergos: nope, go on :) [13:30:59] k [13:30:59] (03PS2) 10KartikMistry: apertium-dan-nor: New upstream release [debs/contenttranslation/apertium-dan-nor] - 10https://gerrit.wikimedia.org/r/269916 (https://phabricator.wikimedia.org/T124137) [13:31:02] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] admin: add gehel to ops [puppet] - 10https://gerrit.wikimedia.org/r/269674 (https://phabricator.wikimedia.org/T125651) (owner: 10Filippo Giunchedi) [13:31:46] ok, great [13:31:49] grumble :) [13:31:54] (03PS3) 10Faidon Liambotis: lvs: add schedule_icmp ipvs sysctl [puppet] - 10https://gerrit.wikimedia.org/r/269423 [13:32:06] (03CR) 10Faidon Liambotis: [V: 032] lvs: add schedule_icmp ipvs sysctl [puppet] - 10https://gerrit.wikimedia.org/r/269423 (owner: 10Faidon Liambotis) [13:32:45] paravoid: haha rebase wars! [13:33:22] 10Ops-Access-Requests, 6operations, 3Discovery-Search-Sprint, 5Patch-For-Review: Access for new Discovery OpsEng: Guillaume Lederrey - https://phabricator.wikimedia.org/T125651#2018492 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi related https://gerrit.wikimedia.org/r/269674 has been merged, resolving [13:34:00] !log reenabled scap runs (removed flag) [13:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:34:28] <_joe_> godog: I evaluated keeping a batch of dumb noop patches ready for such events [13:34:36] <_joe_> to retaliate properly [13:36:16] switchover a delayed slave with our current system is awful- and that is the reason I hate database events [13:37:55] !log downgrading berkelium to Linux 3.19 & rebooting [13:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:38:16] the good news is that s2 topology is how it should be [13:38:31] finally [13:39:11] rebooting curium too [13:39:22] (tell me to move my comments to -databases if offtopic here) [13:39:42] PROBLEM - Host berkelium is DOWN: PING CRITICAL - Packet loss = 100% [13:40:52] RECOVERY - Host berkelium is UP: PING OK - Packet loss = 0%, RTA = 1.53 ms [13:41:53] PROBLEM - Host curium is DOWN: PING CRITICAL - Packet loss = 100% [13:42:08] 6operations, 6Performance-Team, 10Traffic, 5Patch-For-Review: Disable SPDY on cache_text for a week - https://phabricator.wikimedia.org/T125979#2018503 (10BBlack) So, looking at @Krinkle's metrics from above ( https://grafana.wikimedia.org/dashboard/db/performance-metrics ) from the change to now, we're se... [13:42:13] RECOVERY - Host curium is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [13:47:31] !log backuping, reimaging, restarting and defragmenting db1024 (old s2 master) [13:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:47:52] (03PS2) 10Alexandros Kosiaris: Graphoid: pass the config options directly to service::node [puppet] - 10https://gerrit.wikimedia.org/r/269828 (owner: 10Mobrovac) [13:47:58] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Graphoid: pass the config options directly to service::node [puppet] - 10https://gerrit.wikimedia.org/r/269828 (owner: 10Mobrovac) [13:51:24] (03CR) 10Alexandros Kosiaris: "inline nitpick but this should work fine otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) (owner: 10Gehel) [13:51:52] akosiaris, thx, could you review my related patch for that pls [13:52:23] akosiaris, https://gerrit.wikimedia.org/r/#/c/269819/ [13:52:25] yurik: hmm gerrit won't allow me to see related now that it is merged [13:52:27] ah, thanks [13:52:28] (03PS1) 10Alex Monk: Fix the meta:System_administrators table generation script [puppet] - 10https://gerrit.wikimedia.org/r/269959 [13:52:47] (03PS2) 10Alex Monk: Fix the meta:System_administrators table generation script [puppet] - 10https://gerrit.wikimedia.org/r/269959 [13:53:03] (03PS6) 10Gehel: Ship Elasticsearch logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) [13:53:39] akosiaris: thanks, I missed that one [13:54:03] (03CR) 10jenkins-bot: [V: 04-1] Ship Elasticsearch logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) (owner: 10Gehel) [14:01:29] 6operations: ferm rules for eventlog - https://phabricator.wikimedia.org/T126462#2018519 (10MoritzMuehlenhoff) It also needs rules for rsync access from stat100[23] [14:01:45] !log `nodetool stop -- COMPACTION' on restbase1002.eqiad to free disk space (current state: https://phabricator.wikimedia.org/P2593) [14:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:02:41] (03PS1) 10Elukey: Remove mc1010/mc1011 from redis/memcached pools for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/269961 (https://phabricator.wikimedia.org/T123711) [14:03:03] (03PS12) 10Giuseppe Lavagetto: Define Production service entries for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) [14:03:14] (03CR) 10Alexandros Kosiaris: [C: 04-1] "puppet compiler in https://puppet-compiler.wmflabs.org/1715/scb1001.eqiad.wmnet/" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/269819 (owner: 10Yurik) [14:05:03] (03CR) 10Hashar: [C: 04-1] "I have found out why /etc/init/mysql.override keeps disappearing. The puppet implementation for an upstart service uses the .override to e" [puppet] - 10https://gerrit.wikimedia.org/r/204528 (https://phabricator.wikimedia.org/T96230) (owner: 10coren) [14:05:14] (03CR) 10Elukey: [C: 032] Remove mc1010/mc1011 from redis/memcached pools for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/269961 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [14:05:16] akosiaris, where would you like those comments? [14:05:30] ah, never mind, sec [14:07:26] !log removed mc1010/mc1011 from the redis/memcached pools for maintenance [14:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:17:48] (03CR) 10Alex Monk: [C: 04-1] "also needs urldownloader - deployment-urldownloader.deployment-prep.eqiad.wmflabs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269955 (owner: 10Giuseppe Lavagetto) [14:19:52] (03PS1) 10BBlack: VCL: ttl fixed/cap params vcl_fetch [puppet] - 10https://gerrit.wikimedia.org/r/269967 (https://phabricator.wikimedia.org/T124954) [14:19:54] (03PS1) 10BBlack: VCL: drop default ttl_cap to 21 days [puppet] - 10https://gerrit.wikimedia.org/r/269968 (https://phabricator.wikimedia.org/T124954) [14:21:23] 6operations, 6Labs, 10wikitech.wikimedia.org, 5Patch-For-Review: Deploy Mathoid for Wikitech too, or texvc as fallback - https://phabricator.wikimedia.org/T126468#2018558 (10Dereckson) 5Open>3Resolved a:3Dereckson [14:21:25] 7Blocked-on-Operations, 6Labs, 10Wikimedia-Site-Requests, 10wikitech.wikimedia.org, 5Patch-For-Review: Enable math extension on wikitech - https://phabricator.wikimedia.org/T126338#2018560 (10Dereckson) [14:25:31] <_joe_> Krenair: I didn't add it as it was not in LocaliseSetting-labs.php [14:28:43] <_joe_> but yes, I will amend [14:30:00] (03CR) 10BBlack: tlsproxy: nginx keepalives param for testing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/269708 (https://phabricator.wikimedia.org/T107749) (owner: 10BBlack) [14:30:04] (03CR) 10Subramanya Sastry: "@mobrovac, @dzahn .. parsoid_rt_client, parsoid_vd_client, etc. are strictly not instances of the parsoid service or parsoid roles. They j" [puppet] - 10https://gerrit.wikimedia.org/r/269603 (owner: 10Dzahn) [14:30:06] (03PS2) 10BBlack: tlsproxy: nginx keepalives param for testing [puppet] - 10https://gerrit.wikimedia.org/r/269708 (https://phabricator.wikimedia.org/T107749) [14:30:20] akosiaris, i added # See https://www.mediawiki.org/wiki/Extension:Graph?venotify=restored#External_data [14:30:45] (03CR) 10Subramanya Sastry: "This is strictly not a parsoid service instance or role. It just happens to have parsoid in its name. See comment on https://gerrit.wikime" [puppet] - 10https://gerrit.wikimedia.org/r/269707 (owner: 10Dzahn) [14:31:04] 6operations: ferm rules for eventlog - https://phabricator.wikimedia.org/T126462#2018606 (10Ottomata) Ah yes! Good catch. [14:33:13] (03PS8) 10Yurik: Add allowedDomains param to graphoid config [puppet] - 10https://gerrit.wikimedia.org/r/269819 [14:33:18] akosiaris, ^ [14:35:48] (03CR) 10Ottomata: "Hmmmm. We usually pull data on stat boxes instead of push. It would be better to run an rsyncd on the dataset boxes, and then pull from " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) (owner: 10ArielGlenn) [14:36:12] (03PS2) 10Giuseppe Lavagetto: Rationalize services definitions for labs too. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269955 [14:36:14] (03PS4) 10Giuseppe Lavagetto: Reduce poolcounter configuration complexity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266511 (https://phabricator.wikimedia.org/T114273) [14:36:16] (03PS13) 10Giuseppe Lavagetto: Define service entries for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) [14:36:49] (03CR) 10Giuseppe Lavagetto: "added" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269955 (owner: 10Giuseppe Lavagetto) [14:36:51] (03PS1) 10MarcoAurelio: Adding WP and WT as namespace aliases for tawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269970 (https://phabricator.wikimedia.org/T126604) [14:38:02] 6operations, 10Datasets-General-or-Unknown, 6WMDE-Analytics-Engineering, 10Wikidata, 5Patch-For-Review: Push dumps.wm.o logs files to stat1002 - https://phabricator.wikimedia.org/T118739#2018625 (10Ottomata) Hey sorry, I don't think I've seen this ticket before, hence the silence! I just commented on ti... [14:42:00] !log Stopped puppet, memcached, redis on mc1010/mc1011 for maintenance [14:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:45:35] (03CR) 10ArielGlenn: "The same is true for dataset1001, we usually control the rsyncs on our end, which is why I wanted to go that route in the first place." [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) (owner: 10ArielGlenn) [14:47:11] (03PS3) 10BBlack: tlsproxy: nginx keepalives param for testing [puppet] - 10https://gerrit.wikimedia.org/r/269708 (https://phabricator.wikimedia.org/T107749) [14:49:45] (03CR) 10Ottomata: "Ahh right! A battle! :)" [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) (owner: 10ArielGlenn) [14:50:25] (03PS4) 10BBlack: tlsproxy: nginx keepalives param for testing [puppet] - 10https://gerrit.wikimedia.org/r/269708 (https://phabricator.wikimedia.org/T107749) [14:50:56] (03CR) 10BBlack: [C: 032 V: 032] "Compiler checks out ok: no-op for cp1065, correct change for cp1008" [puppet] - 10https://gerrit.wikimedia.org/r/269708 (https://phabricator.wikimedia.org/T107749) (owner: 10BBlack) [14:53:25] (03CR) 10Ottomata: "If you add dataset1001 to this list:" [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) (owner: 10ArielGlenn) [14:53:58] (03CR) 10Krinkle: "128M or 200M? Maybe from 512 to 256?" [puppet] - 10https://gerrit.wikimedia.org/r/269880 (https://phabricator.wikimedia.org/T126545) (owner: 10Hashar) [14:57:32] (03CR) 10Ottomata: send web server logs from dataset hosts to stat1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) (owner: 10ArielGlenn) [15:00:42] (03CR) 10BBlack: [C: 031] "Works as intended in compiler on a range of text/upload/misc/maps machines at various tiers: https://puppet-compiler.wmflabs.org/1719/" [puppet] - 10https://gerrit.wikimedia.org/r/269967 (https://phabricator.wikimedia.org/T124954) (owner: 10BBlack) [15:02:35] (03CR) 10Krinkle: [C: 04-1] "If we apply to the large slaves too, we should probably go with 256M for now and do 128M later once we replaced all the slaves with medium" [puppet] - 10https://gerrit.wikimedia.org/r/269880 (https://phabricator.wikimedia.org/T126545) (owner: 10Hashar) [15:07:29] (03CR) 10Luke081515: [C: 031] "per normal code review (Community consensus still needed)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258671 (https://phabricator.wikimedia.org/T119727) (owner: 10Dereckson) [15:07:59] (03CR) 10Luke081515: [C: 031] Add new WEF enwiki IP rate limit exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269862 (https://phabricator.wikimedia.org/T126541) (owner: 10Alex Monk) [15:08:22] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [15:08:40] ^ fixing [15:10:16] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [15:11:44] (03CR) 10Ottomata: [C: 031] make scap::target use the scap3 package provider [puppet] - 10https://gerrit.wikimedia.org/r/269560 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [15:12:11] (03PS3) 10Ottomata: I will include the proper fix as an amendment to this revert-revert before merging. [puppet] - 10https://gerrit.wikimedia.org/r/269857 [15:14:57] (03CR) 10Ottomata: [C: 032] I will include the proper fix as an amendment to this revert-revert before merging. [puppet] - 10https://gerrit.wikimedia.org/r/269857 (owner: 10Ottomata) [15:18:59] (03PS2) 10Ottomata: Include analytics_cluster::client role on analytics1026 for testing [puppet] - 10https://gerrit.wikimedia.org/r/269844 (https://phabricator.wikimedia.org/T109859) [15:20:19] (03CR) 10jenkins-bot: [V: 04-1] Include analytics_cluster::client role on analytics1026 for testing [puppet] - 10https://gerrit.wikimedia.org/r/269844 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [15:22:07] bah whaa? [15:22:17] /usr/lib/ruby/vendor_ruby/puppet-lint/bin.rb:78:in `block in run': invalid option: --no-puppet_url_without_modules-check (OptionParser::InvalidOption) [15:22:46] <_joe_> that is jenkins failing, otto [15:22:59] oh just jenkins? [15:23:05] should I recheck? [15:23:22] <_joe_> yup, saw someone else pasting the same error this morning [15:23:26] ottomata: it is puppet-lint failling [15:23:49] (03PS2) 10Giuseppe Lavagetto: Configure redis LockManager in both DCs, use the master everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266514 [15:23:51] (03PS2) 10Giuseppe Lavagetto: Add references to wmfServices for Cirrusearch. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266512 (https://phabricator.wikimedia.org/T114273) [15:23:53] (03PS2) 10Giuseppe Lavagetto: Use wmfMasterDatacenter for picking the master redis config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266513 (https://phabricator.wikimedia.org/T114273) [15:24:10] !log puppet re-enabled on mc1010/mc1011 [15:24:10] (03CR) 1020after4: "@dzahn: https://gerrit.wikimedia.org/r/#/c/269561/ edits the same file and I got it down to zero:" [puppet] - 10https://gerrit.wikimedia.org/r/269904 (owner: 10Dzahn) [15:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:24:17] (03PS1) 10Filippo Giunchedi: swift: let mount_filesystem fail on unmountable fs [puppet] - 10https://gerrit.wikimedia.org/r/269980 (https://phabricator.wikimedia.org/T126574) [15:24:27] (03PS1) 10Ottomata: Add role::analytics_cluster::java class [puppet] - 10https://gerrit.wikimedia.org/r/269981 (https://phabricator.wikimedia.org/T109859) [15:24:51] (03CR) 10Ottomata: [C: 032 V: 032] Add role::analytics_cluster::java class [puppet] - 10https://gerrit.wikimedia.org/r/269981 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [15:25:08] (03PS3) 10Ottomata: Include analytics_cluster::client role on analytics1026 for testing [puppet] - 10https://gerrit.wikimedia.org/r/269844 (https://phabricator.wikimedia.org/T109859) [15:25:52] (03CR) 10jenkins-bot: [V: 04-1] Configure redis LockManager in both DCs, use the master everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266514 (owner: 10Giuseppe Lavagetto) [15:26:04] (03CR) 10jenkins-bot: [V: 04-1] Add references to wmfServices for Cirrusearch. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266512 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [15:26:17] (03CR) 10jenkins-bot: [V: 04-1] Use wmfMasterDatacenter for picking the master redis config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266513 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [15:27:19] (03CR) 10jenkins-bot: [V: 04-1] swift: let mount_filesystem fail on unmountable fs [puppet] - 10https://gerrit.wikimedia.org/r/269980 (https://phabricator.wikimedia.org/T126574) (owner: 10Filippo Giunchedi) [15:30:39] 6operations, 5Patch-For-Review: puppet should try to mount all mountable swift filesystems - https://phabricator.wikimedia.org/T126574#2018720 (10fgiunchedi) [15:32:33] hmm ok [15:32:42] does hiera not work with puppet -compiler hashar? [15:32:56] it does [15:33:09] Failed to compile catalog for node analytics1026.eqiad.wmnet: Could not find data item cdh::hadoop::namenode_hosts in any Hiera data file [15:33:10] but [15:33:19] it does indeed [15:33:25] https://github.com/wikimedia/operations-puppet/blob/production/hieradata/eqiad.yaml#L158 [15:34:35] https://puppet-compiler.wmflabs.org/1721/analytics1026.eqiad.wmnet/change.analytics1026.eqiad.wmnet.err [15:35:33] (03PS2) 10BBlack: VCL: ttl fixed/cap params vcl_fetch [puppet] - 10https://gerrit.wikimedia.org/r/269967 (https://phabricator.wikimedia.org/T124954) [15:36:25] (03PS1) 10Elukey: Add mc1010/mc1011 back to the redis/memcached pools after maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/269982 (https://phabricator.wikimedia.org/T123711) [15:37:07] is it available on the instance yet? [15:37:24] (03PS1) 10Muehlenhoff: Add ferm rules for rsyncd used in eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/269983 [15:38:08] seems like it "should be" [15:39:57] apergos: ja should be [15:40:00] oh [15:40:03] (03CR) 10BBlack: [C: 032] VCL: ttl fixed/cap params vcl_fetch [puppet] - 10https://gerrit.wikimedia.org/r/269967 (https://phabricator.wikimedia.org/T124954) (owner: 10BBlack) [15:40:04] maybe id din't puppet merge Ummm [15:40:08] no yes i id [15:40:13] nm yes of course i did [15:40:16] yeah cause I saw it in local repo on my laptop [15:40:22] first thing I checked [15:40:37] (03PS2) 10Elukey: Add mc1010/mc1011 back to the redis/memcached pools after maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/269982 (https://phabricator.wikimedia.org/T123711) [15:40:39] there is another change i didn't puppet merge, but that would be unrelated to this one. bblack, cna I puppet-merge [15:40:39] ? [15:40:43] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [15:40:47] ottomata: already did [15:40:49] danke [15:40:57] gonna re run my puppet compiler, just in case [15:41:21] it's doing a gc run on merge of course :P [15:41:42] :-D [15:42:03] https://wikimediafoundation.org/donate/2007/psa/backup/index.html [15:42:12] OggHandler still trying to load a png from commons from there [15:42:31] ohgee [15:42:32] (03CR) 10Elukey: [C: 032] Add mc1010/mc1011 back to the redis/memcached pools after maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/269982 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [15:42:33] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [15:42:41] Hey look we're linking a mp4 file! [15:42:43] (404 though) [15:43:29] !log Add mc1010/mc1011 back to the redis/memcached pools after maintenance. [15:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:43:59] (03CR) 10BryanDavis: Ship Elasticsearch logs to logstash (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) (owner: 10Gehel) [15:44:43] PROBLEM - Disk space on cerium is CRITICAL: DISK CRITICAL - free space: /srv 13732 MB (3% inode=99%) [15:45:22] hmmm, ja [15:45:23] no good [15:45:23] https://gist.github.com/ottomata/eddba28183a2b0c44ad0 [15:45:27] apergos: ^ :/ [15:45:29] dunno what then [15:47:12] hmmmm [15:47:22] maybe its the unquoted names with :: [15:47:23] ? [15:47:55] (03CR) 10Gehel: Ship Elasticsearch logs to logstash (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) (owner: 10Gehel) [15:48:16] hrm [15:48:26] hmm, nope, don't think so [15:49:32] akosiaris: about your previous comment on my elasticsearch -> logstash patch. Relying on undef to disable log shipping makes it awkward to configure. We need to ensure that logstash does not send logs to itself. Using hiera seems the right place to do it, but it gets fairly cryptic if we want to enable log shipping for all hosts except logstash. Handling of [15:49:32] `undef` in hiera is misleading at best [15:50:00] you know maybe [15:50:11] if you do classes you have to split these out into separate files and subdirs [15:50:50] nope, this works fine [15:50:50] ./utils/hiera_lookup --fqdn=terbium.eqiad.wmnet admin::groups [15:51:10] <_joe_> ottomata: what's your problem? [15:51:13] apergos: ? i thought that was for role lookup [15:51:21] yeah I might be ful of bs [15:51:26] _joe_: i have new hiera keys like [15:51:27] https://github.com/wikimedia/operations-puppet/blob/production/hieradata/eqiad.yaml#L153 [15:51:36] stumped [15:51:38] but they don't seem to work [15:51:46] ./utils/hiera_lookup --fqdn=$(hostname -f) cdh::hadoop::cluster_name [15:51:46] nil [15:51:48] it'll be something obvious (to someone with fresh eyes) [15:51:50] (that was on palladium) [15:51:59] <_joe_> ottomata: of course they won't [15:52:06] haha, _joe_ of course! [15:52:09] (...?) [15:52:14] ottomata: also --debug is helpful [15:52:23] OOO [15:52:42] <_joe_> ottomata: those should be in https://github.com/wikimedia/operations-puppet/tree/production/hieradata/eqiad/cdh/hadoop.yaml [15:52:51] so I was not wrong [15:52:54] <_joe_> and written like datanode_mounts: etc [15:52:54] grrrr [15:53:14] some of them mabye _joe_ [15:53:20] but i do need to access them from other places [15:53:26] not just in the module class [15:53:33] gehel: [15:53:37] <_joe_> ottomata: sigh, that has nothing to do with the module class [15:53:39] gehel: I am not sure I follow [15:53:46] <_joe_> it's just how variables are encoded in hiera [15:53:52] <_joe_> hiera is a k-v store [15:53:57] yes, but scoped [15:54:06] er, not really ... [15:54:06] oh, eqiad/cdh/hadoop [15:54:07] hmmm [15:54:16] i didn't know that worked, i thought that only looked for our special role stuff [15:54:23] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [15:54:27] so, if in eqiad/cdh/hadoop.yaml [15:54:28] I put [15:54:32] <_joe_> well, if you look at the docs, it's written there :) [15:54:33] cluster_name: blabla [15:54:36] <_joe_> yes [15:54:43] the wikitech ones! I read them yesterday! must have missed it... [15:54:44] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [1000.0] [15:54:47] heh [15:54:53] so, ja if I do that [15:55:01] then in an arbitrary class [15:55:02] i can do [15:55:02] <_joe_> uhm eqiad [15:55:06] hiera(cdh::hadoop::cluster_name) [15:55:07] for all the cdh vars, I see you have a pile in there [15:55:07] ? [15:55:08] <_joe_> sorry, looking [15:55:10] <_joe_> ottomata: yes [15:55:15] ok, thanks [15:55:18] will try that [15:55:20] <_joe_> hiera('cdh::hadoop::cluster_name') [15:55:24] aye [15:55:24] <_joe_> to be more precise [15:55:33] <_joe_> or, if you included the class somewhere [15:55:44] <_joe_> $cdh::hadoop::cluster_name [15:55:58] _joe_: is totally documente "so foo::params::param would be searched insidehieradata/${::site}/foo/params.yaml as param" [15:55:59] sorry [15:56:11] <_joe_> I know it is :P [15:56:23] aye, might not have the class included at that time [15:56:27] <_joe_> you gave me a good occasion to play BOTH :) [15:56:38] akosiaris: conversation going on in #wikimedia-labs with bd808. We need to ensure that logstash (which also uses elasticsearch) does not ship logs to itself. [15:56:39] <_joe_> ottomata: ok so if the class is included, puppet will do for you [15:56:48] aye [15:56:56] <_joe_> if it's not, explicitly look up the label on hiera [15:56:56] yeah makes sense, you can just use usual variable lookup then [15:56:58] aye [15:57:11] gehel: sounds doable [15:57:47] gehel: ah, you fear that undef in hiera will cause problems... [15:57:53] hmm never had to test that [15:57:58] akosiaris: this could be done by setting `es::graylog: ...` in deployment-prep/common.yaml and `es::graylog: ~` in deployment-logstash2.yaml [15:58:06] <_joe_> what are you trying to do gehel? [15:58:28] _joe_: create a flag saying elastic search should ship logs to logstash [15:58:39] but logstash's elasticsearch should not [15:58:53] <_joe_> akosiaris: so, define it as a class parameter, default to false, turn it on with true in hiera where needed [15:58:59] (03PS1) 10Muehlenhoff: Add ferm rules for eventlogging udp receiver [puppet] - 10https://gerrit.wikimedia.org/r/269986 (https://phabricator.wikimedia.org/T113343) [15:59:20] <_joe_> or do the opposite if not shipping logs is the exception [15:59:36] not shipping logs is the exception [15:59:45] _joe_: that was my first implementation, but akosiaris find it cleaner to rely on the host to which we ship logs being undef to deactivate log shipping [16:00:05] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160211T1600). [16:00:05] Addshore Krenair yurik Dereckson: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:00:06] but my comment was that instead of a boolean class parameter, relying on undef make is cleaner [16:00:06] <_joe_> ok that works equally well [16:00:10] Hello. [16:00:14] chasemp, valhallasw`cloud, _joe_, tools-worker-1001 is firing a disk space alarm, apparently due to grown in /var/lib/docker. Does one of you know what to do about that? Or if we care? [16:00:14] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:00:22] <_joe_> akosiaris: heh, whatever, tbh :P [16:00:44] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:01:01] <_joe_> but let me check one thing [16:01:11] akosiaris, _joe_ : relying on undef is awkward if you want to have default == ship log and disable it on exceptional case (IMHO) [16:01:55] my fear is it might not be doable [16:01:56] here! [16:02:05] not the awkwardness of it [16:02:09] <_joe_> akosiaris: what might not be doable? [16:02:12] as in.. undef in hiera ? [16:02:19] not sure how to even specify that yet [16:02:23] Krenair: are you SWATing this morning? [16:02:34] akosiaris: ~ == undef [16:02:36] ah [16:02:38] role/common/aqs.yaml:restbase::parsoid_uri: undef [16:02:40] that works [16:02:54] ok then, seems like it's a working solution [16:03:00] why is it awkward ? [16:03:51] because we had to ask ourselves "is it doable?". We were not clear on how hiera is handling undef, so I prefer a more explicit solution. [16:04:02] (03PS4) 10Thcipriani: wgRCWatchCategoryMembership true on wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264735 (owner: 10Addshore) [16:04:04] <_joe_> akosiaris: hiera('host', undef) surely works [16:04:10] <_joe_> but I don't like it [16:04:14] (03PS2) 10Yurik: Add $wgGraphAllowedDomains setting for future [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269896 [16:04:16] <_joe_> if it's a class variable [16:04:25] <_joe_> just default to undef [16:04:29] I can SWAT, this morning :) [16:04:35] thcipriani: awesome! [16:04:43] _joe_: look at role/common/aqs.yaml [16:04:44] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264735 (owner: 10Addshore) [16:04:51] akosiaris, is this good? https://www.mediawiki.org/wiki/Extension:Graph#External_data [16:04:54] _joe_: does the classpath lookup work in the labs/ dir too? [16:04:54] like [16:04:58] labs/cdh/hadoop.yaml [16:04:58] ? [16:05:01] <_joe_> oh well, there is kip thorne talking [16:05:09] we are already doing. It does not look bad [16:05:18] <_joe_> ottomata: look at hiera.yaml on palladium [16:05:44] k [16:05:55] (03Merged) 10jenkins-bot: wgRCWatchCategoryMembership true on wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264735 (owner: 10Addshore) [16:06:11] yurik: custom wiki protocols ? [16:06:19] akosiaris, yep :) [16:06:22] wikirest:///api/rest_v1/page/... ??? [16:06:25] yep [16:06:32] what ? when did we come up with that ? [16:06:37] <_joe_> rotfl [16:06:40] trust me, it was a very painful process :( [16:06:40] <_joe_> ROTFL [16:06:45] akosiaris: Ok, I will keep the undef [16:06:46] thcipriani, am here now [16:06:52] * addshore waves at Krenair [16:06:55] gehel: thanks! [16:06:55] <_joe_> that's so great [16:06:59] <_joe_> WIKIREST!! [16:07:00] akosiaris and _joe_ talk to csteip :))) [16:07:04] _joe_: - "%{::site}/%{::realm}" [16:07:05] so [16:07:11] eqiad/labs/cdh/hadoop.yaml [16:07:12] ? [16:07:25] yurik: er, ok [16:07:38] ottomata: nope, sorry [16:07:40] wikirestinpeace: :P [16:07:45] ottomata: where did you get that? [16:07:47] hey addshore [16:08:19] Krenair: okie doke. Would you like to take over? Or should I keep on trucking? [16:08:20] haha wikirestinpeace, /etc/puppet/hiera.yaml [16:08:30] on palladium [16:08:32] ottomata: sorry, you looked on palladium [16:08:42] "_joe_ [16:08:42] ottomata: look at hiera.yaml on palladium [16:08:42] labs has a different hierarchy ofc [16:08:42] " [16:08:42] haha [16:08:46] oh [16:09:01] akosiaris and _joe_ check your email :))) [16:09:05] thcipriani, you can keep going if you like [16:09:05] <_joe_> also, sorry, going off for a bit [16:09:15] reading [16:09:16] more [16:09:16] doc [16:09:17] https://wikitech.wikimedia.org/wiki/Puppet_Hiera#In_Labs [16:09:24] <_joe_> yurik: russian spam is waiting for me? [16:09:35] _joe_, why russian? i object! :D [16:09:37] ottomata: hiera in labs is fun(TM) [16:09:47] <_joe_> yurik: "russian" [16:10:05] _joe_, only mail order. [16:10:22] ok _joe_ am fine with labs.yaml directly [16:10:29] <_joe_> everything written in cyrillic in my inbox is classified as "russian spam" by my eye :) [16:10:30] but i'm worried that the lookup will break like it did with eqiad.yaml [16:10:34] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269862 (https://phabricator.wikimedia.org/T126541) (owner: 10Alex Monk) [16:10:48] <_joe_> ottomata: no; but use the project-specific files please [16:11:03] <_joe_> or the wikitech Hiera: namespace :) [16:11:04] no, this is a global setting, all labs hadoop setups should have this default [16:11:17] i could put it in the pupppet clas [16:11:23] if $::realm == 'labs' [16:11:25] :p [16:11:45] https://github.com/wikimedia/operations-puppet/blob/production/hieradata/labs.yaml#L65 [16:11:48] (03Merged) 10jenkins-bot: Add new WEF enwiki IP rate limit exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269862 (https://phabricator.wikimedia.org/T126541) (owner: 10Alex Monk) [16:11:57] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: wgRCWatchCategoryMembership true on wikisource [[gerrit:264735]] (duration: 02m 14s) [16:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:12:01] * akosiaris sharpens his two handed greataxe [16:12:03] its kinda already there, just trying to smooth the process of spawning hadoop clusters in labs [16:12:04] ^ addshore check please [16:12:10] checking [16:12:28] https://github.com/wikimedia/operations-puppet/blob/production/modules/role/manifests/analytics_cluster/hadoop/client.pp#L233 [16:13:00] (03CR) 10Mobrovac: [C: 031] "ok @Dzahn, makes sense, LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/269603 (owner: 10Dzahn) [16:13:13] looking good [16:13:19] andrewbogott: I don't know anything about the current k8s setup to be honest [16:13:25] addshore: cool, thanks. [16:13:32] valhallasw`cloud: me neither really [16:13:34] akosiaris: what are you about to behead? [16:13:35] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269558 (owner: 10Andrew Bogott) [16:13:59] 6operations, 10Traffic: openssl-1.0.2f introduced minor bug with nginx - https://phabricator.wikimedia.org/T126616#2018818 (10BBlack) 3NEW a:3MoritzMuehlenhoff [16:14:54] 6operations, 6Labs, 10wikitech.wikimedia.org, 5Patch-For-Review: Failing wikitech logins - https://phabricator.wikimedia.org/T126322#2018828 (10Andrew) 5Open>3Resolved ok, I'm happy with how this is fixed now. [16:15:43] (03PS1) 10Ottomata: Move analytics_cluster prod hiera vars to eqiad/cdh/ [puppet] - 10https://gerrit.wikimedia.org/r/269991 (https://phabricator.wikimedia.org/T109859) [16:16:37] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Add new WEF enwiki IP rate limit exception [[gerrit:269862]] (duration: 02m 13s) [16:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:16:44] ^ Krenair sync'd [16:16:48] apergos: people adding if $::realm == labs checks ;-) [16:16:50] thanks [16:16:54] (03PS2) 10Ottomata: Move analytics_cluster prod hiera vars to eqiad/cdh/ [puppet] - 10https://gerrit.wikimedia.org/r/269991 (https://phabricator.wikimedia.org/T109859) [16:17:00] (03Merged) 10jenkins-bot: Add wgOpenStackManagerNovaIdentityV3URI to wikitech configs. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269558 (owner: 10Andrew Bogott) [16:17:01] heh [16:17:13] thcipriani: afk now but still pingable on phone! [16:17:22] addshore: ack. [16:19:26] (03PS1) 10Elukey: Remove mc1012/mc1013 from the redis/memcached pools for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/269993 (https://phabricator.wikimedia.org/T123711) [16:19:40] (03CR) 10Ottomata: [C: 032] "No op, nothing included yet." [puppet] - 10https://gerrit.wikimedia.org/r/269991 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [16:20:39] !log thcipriani@mira Synchronized wmf-config/wikitech.php: SWAT: Add wgOpenStackManagerNovaIdentityV3URI to wikitech configs [[gerrit:269558]] (duration: 02m 11s) [16:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:20:48] jaa that works, thanks joe! [16:20:51] ^ Krenair check if possible [16:21:29] (03PS4) 10Ottomata: Include analytics_cluster::client role on analytics1026 for testing [puppet] - 10https://gerrit.wikimedia.org/r/269844 (https://phabricator.wikimedia.org/T109859) [16:21:41] (03PS2) 10Elukey: Remove mc1012/mc1013 from the redis/memcached pools for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/269993 (https://phabricator.wikimedia.org/T123711) [16:21:52] thcipriani, looking [16:22:18] thcipriani, looks like that set it as expected, thanks [16:22:26] Krenair: cool, thanks [16:22:29] (no code in prod actually checks this variable yet, but will soon) [16:24:35] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269896 (owner: 10Yurik) [16:25:08] akosiaris, ^ :) [16:25:20] its a matching patch [16:25:54] (03Merged) 10jenkins-bot: Add $wgGraphAllowedDomains setting for future [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269896 (owner: 10Yurik) [16:26:06] (03CR) 10Elukey: [C: 032] Remove mc1012/mc1013 from the redis/memcached pools for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/269993 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [16:26:38] !log puppet disabled on cp1065 for some live and careful experimentation... [16:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:26:53] yurik: so should I hold off on syncing that one out? [16:27:03] thcipriani, its ok [16:27:07] they are not used yet [16:27:22] just pestering akosiaris :) [16:27:33] fair enough. going then :) [16:27:52] !log Removed mc1012/mc1013 from the redis/memcached pools for maintenance. [16:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:28:17] thcipriani: I was expressing my flabbergastness (is that even a word) about https://www.mediawiki.org/wiki/Extension:Graph/Guide#External_Data where we seem to be introducing new protocols like wikirest:/// [16:28:37] feel free to go on, I am waiting for csteipp to feed me some more info on this, per yurik's advice [16:28:45] andrewbogott: not sure, there are a few tools running in k8s land tho [16:28:50] I'll ask valhallasw`cloud in -labs [16:28:57] flabbergastedness...? :) [16:29:06] chasemp: I asked him already :) [16:29:06] chasemp: I'm here ;-) [16:29:20] what was the verdict (I missed it)? [16:29:21] I don’t think it’s totally safe to ignore but I also don’t know what to do about it. [16:29:26] ok [16:29:32] (03PS1) 10Ottomata: Add missing net-topology.py.erb to analytics_cluster role [puppet] - 10https://gerrit.wikimedia.org/r/269994 (https://phabricator.wikimedia.org/T109859) [16:29:40] "I don't know", and I'm inclined to ignore it until Yuvi is back [16:29:54] I think only grrrit-wm is using k8s at the moment? [16:30:03] and PAWS, I guess [16:30:05] right, let me look at the storage growth rate there, assuming it's small no worries [16:30:21] I thought there was some of magnus's stuff [16:30:23] but I'm not sure [16:30:51] * andrewbogott is about to be in a meeting [16:31:11] (03CR) 10Ottomata: [C: 031] Add ferm rules for rsyncd used in eventlogging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/269983 (owner: 10Muehlenhoff) [16:31:17] 6operations, 10RESTBase-Cassandra: cassandra slow streaming during (de)commission - https://phabricator.wikimedia.org/T126619#2018891 (10fgiunchedi) 3NEW [16:32:21] !log thcipriani@mira Synchronized wmf-config/CommonSettings.php: SWAT: Add $wgGraphAllowedDomains setting for future [[gerrit:269896]] (duration: 02m 13s) [16:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:32:24] ^ yurik sync'd [16:32:38] thcipriani, thx, it should be a noop [16:32:44] checking just in case [16:33:06] thcipriani, alls good [16:33:16] yurik: cool, thank you. [16:34:16] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267170 (https://phabricator.wikimedia.org/T124881) (owner: 10MtDu) [16:34:35] (03CR) 10Ottomata: [C: 032] Add missing net-topology.py.erb to analytics_cluster role [puppet] - 10https://gerrit.wikimedia.org/r/269994 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [16:34:57] (03PS5) 10Ottomata: Include analytics_cluster::client role on analytics1026 for testing [puppet] - 10https://gerrit.wikimedia.org/r/269844 (https://phabricator.wikimedia.org/T109859) [16:35:15] (03Merged) 10jenkins-bot: Change Nepali Wikibooks sitename and logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267170 (https://phabricator.wikimedia.org/T124881) (owner: 10MtDu) [16:35:38] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268419 (owner: 10Dereckson) [16:36:20] (03Merged) 10jenkins-bot: Clean duplicate wgCopyUploadsDomains setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268419 (owner: 10Dereckson) [16:38:05] (03PS1) 10Ottomata: Fix use of variable in analytics_cluster/hadoop/client.pp [puppet] - 10https://gerrit.wikimedia.org/r/269996 (https://phabricator.wikimedia.org/T109859) [16:38:27] We've a lot of irccloud users. [16:39:18] 6operations, 10Traffic, 5Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#2018958 (10BBlack) I'm live-testing this (at a setting of `1`) on an eqiad text machine (cp1065) right now. The connection/thread stuff I'm watching is the output of: ```... [16:41:15] !log thcipriani@mira Synchronized w/static/images/project-logos/newikibooks.png: SWAT: Change Nepali Wikibooks sitename and logo Part I [[gerrit:267170]] (duration: 02m 13s) [16:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:41:48] (03CR) 10Ottomata: [C: 032] Fix use of variable in analytics_cluster/hadoop/client.pp [puppet] - 10https://gerrit.wikimedia.org/r/269996 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [16:41:49] 6operations, 10Traffic, 5Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#2018972 (10BBlack) The above (if it holds) basically confirms earlier thinking: that this is the "right" thing to do and helps, but we've got upload-cluster-specific issues... [16:42:19] (03PS6) 10Ottomata: Include analytics_cluster::client role on analytics1026 for testing [puppet] - 10https://gerrit.wikimedia.org/r/269844 (https://phabricator.wikimedia.org/T109859) [16:43:27] thcipriani: you've already purged the logo? If so, it's echo 'https://en.wikipedia.org/static/image/project-logos/newikibooks.png' | mwscript purgeList.php now, not any more www.wikimedia.org. [16:44:16] Dereckson: haven't yet, syncing initialisesettings. Thank you for the command :) [16:44:55] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Change Nepali Wikibooks sitename and logo Part II [[gerrit:267170]], Clean duplicate wgCopyUploadsDomains setting [[gerrit:268419]] (duration: 02m 14s) [16:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:45:28] Dereckson: sync'd and purged, check please [16:45:47] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269770 (owner: 10Dereckson) [16:46:05] (03PS1) 10Ottomata: Move hadoop memory settings for production above class declaration [puppet] - 10https://gerrit.wikimedia.org/r/269998 (https://phabricator.wikimedia.org/T109859) [16:46:51] (03Merged) 10jenkins-bot: Revert "Revert "Enable Math extension on Wikitech"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269770 (owner: 10Dereckson) [16:48:02] 6operations, 10Traffic, 5Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#2019001 (10BBlack) Should also note: the initial test of this was with keepalive's idle conns parameter set to `4`, whereas the new tests are at `1`. That alone may improv... [16:48:21] (03CR) 10Ottomata: [C: 032] Move hadoop memory settings for production above class declaration [puppet] - 10https://gerrit.wikimedia.org/r/269998 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [16:48:33] thcipriani: so, site name is OK, a follow-up patch is needed for the meta namespace which has been forgotten, logo isn't sync'ed. [16:48:44] oh logo is okay now [16:49:09] (03PS7) 10Ottomata: Include analytics_cluster::client role on analytics1026 for testing [puppet] - 10https://gerrit.wikimedia.org/r/269844 (https://phabricator.wikimedia.org/T109859) [16:50:00] Krenair: syncing the semanticform patch now, FYI [16:50:05] ty [16:50:08] was about to ask about that [16:51:38] !log thcipriani@mira Synchronized php-1.27.0-wmf.13/extensions/SemanticForms/includes/SF_Utils.php: SWAT: Semantic form path fixes [[gerrit:269869]] (duration: 02m 14s) [16:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:51:42] ^ Krenair check please [16:51:52] (03PS1) 10Dereckson: Namespace configuration on ne.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270000 (https://phabricator.wikimedia.org/T124881) [16:52:09] thcipriani, looks like that fixed it, thanks [16:52:15] thcipriani: could we deploy this one too, as a follow-up for MtDu's 267170 ? ^ [16:52:19] Krenair: awesome, thanks for checking [16:52:34] PROBLEM - puppet last run on mw1017 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago [16:52:34] (03PS1) 10ArielGlenn: dumps mirroring tool, don't assume dest is local filesystem [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/270001 [16:53:00] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270000 (https://phabricator.wikimedia.org/T124881) (owner: 10Dereckson) [16:53:12] Dereckson: could you add that to the deploy page on wikitech, please? [16:53:16] Sure. [16:54:00] (03Merged) 10jenkins-bot: Namespace configuration on ne.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270000 (https://phabricator.wikimedia.org/T124881) (owner: 10Dereckson) [16:55:27] I got disconnected from IRC and can't rejoin #wikimedia_security. My IRC-fu is rusty, do I need an invite each time? [16:55:52] mediawiki_ [16:56:40] right, I meant #mediawiki_security. Still cannot rejoin ... [16:56:57] gehel, maybe this isn't the right place to discuss that [16:56:59] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Revert "Revert "Enable Math extension on Wikitech"" [[gerrit:269770]] (duration: 02m 13s) [16:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:57:04] ^ Dereckson check please [16:57:09] well he can't get into the other channel :-D [16:57:40] * Dereckson sighs. [16:57:42] bblack: what would be the right place? [16:57:44] We'll have to do more tests on beta wikitech. Exception encountered, of type "FileBackendException" [16:58:03] Dereckson: :( kk, reverting [16:59:17] I guess the wikitech extension tries to use Mathoid by default, and we'll have to tweak the config to use legacy texvc for wikitech (or see if mathoid is reacheable from there). [16:59:25] s/wikitech/Math [16:59:57] (03PS1) 10Thcipriani: Revert "Merge "Revert "Revert "Enable Math extension on Wikitech"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270002 [17:00:04] jynus moritzm: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160211T1700). [17:00:04] yurik: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:00:26] akosiaris, any objections to that patch? it was scheduled for puppet swat ^ [17:00:29] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270002 (owner: 10Thcipriani) [17:01:05] (03Merged) 10jenkins-bot: Revert "Merge "Revert "Revert "Enable Math extension on Wikitech"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270002 (owner: 10Thcipriani) [17:01:36] 2016-02-11 16:57:23 silver labswiki 1.27.0-wmf.13 exception ERROR: [efc7b4f6] /wiki/User:Dereckson/Sandbox FileBackendException from line 154 of /srv/mediawiki/php-1.27.0-wmf.13/includes/filebackend/FileBackendGroup.php: No backend defined with the name `global-multiwrite`. {"exception_id":"efc7b4f6"} [17:01:53] sounds like an entirely different issue Dereckson [17:02:13] probably to save the generated image [17:02:41] (03CR) 10BBlack: [C: 031] mediawiki: Rewrite /w/{skins,resources,extensions} to /w/static.php [puppet] - 10https://gerrit.wikimedia.org/r/268802 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [17:03:20] Krenair: by default, Math uses PNG images as rendering method in user settings [17:03:34] MathML is an option [17:04:29] also saw this: [17:04:30] 2016-02-11 11:45:56 silver labswiki 1.27.0-wmf.13 fatal ERROR: [0ec09468] PHP Fatal Error: Call to undefined function wfCheckLimits() {"exception_id":"0ec09468"} [17:04:31] [Exception ErrorException] (/srv/mediawiki/php-1.27.0-wmf.13/extensions/SemanticMediaWiki/specials/QueryPages/SMW_SpecialProperties.php:29) PHP Fatal Error: Call to undefined function wfCheckLimits() [17:04:31] #0 [internal function]: MWExceptionHandler::handleFatalError() [17:04:31] #1 {main} [17:04:37] Reedy, perhaps another removed function? [17:04:39] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Revert enable math extension on Wikitech. Namespace configuration on ne.wikibooks (duration: 02m 15s) [17:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:04:44] ^ Dereckson check please [17:04:52] jynus: moritzm: Can I nominate https://gerrit.wikimedia.org/r/268802 as last-minute addition to puppet swat for the coming hour? [17:04:56] (03CR) 10Bmansurov: apertium-dan-nor: New upstream release [debs/contenttranslation/apertium-dan-nor] - 10https://gerrit.wikimedia.org/r/269916 (https://phabricator.wikimedia.org/T124137) (owner: 10KartikMistry) [17:05:00] sounds like something reedy removed [17:05:01] (see -sec) [17:05:10] thcipriani: works [17:05:20] Dereckson: thank you. [17:05:24] (03CR) 10Bmansurov: apertium-nno: New upstream release [debs/contenttranslation/apertium-nno] - 10https://gerrit.wikimedia.org/r/269915 (https://phabricator.wikimedia.org/T124137) (owner: 10KartikMistry) [17:05:27] SWAT's done :) [17:05:30] thcipriani: you synced 270000? [17:05:40] ah yes, works too. [17:05:47] Krinkle, ok for me [17:05:48] Thanks for the deploy. [17:05:51] 6operations, 10RESTBase, 10hardware-requests: 3x additional SSD for restbase hp hardware - https://phabricator.wikimedia.org/T126626#2019073 (10fgiunchedi) 3NEW a:3RobH [17:05:57] For anyone wondering if depooling/repooling redis servers has a production impact -- https://graphite.wikimedia.org/render/?width=1000&height=600&_salt=1455047738.237&lineMode=connected&target=MediaWiki.edit.failures.session_loss.count [17:06:09] Dereckson: yup, as part of the last sync. Both the revert and namespace change. [17:06:12] https://gerrit.wikimedia.org/r/#/c/262028/ (can't blame reedy) [17:06:26] and i don't want to blame anyone [17:06:32] !log disabled puppet, memcached, redis on mc1012/mc1013 for maintenance [17:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:09:45] RECOVERY - puppet last run on mw1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:11:53] (03CR) 10Jcrespo: "Latest version: https://puppet-compiler.wmflabs.org/1728/scb1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/269819 (owner: 10Yurik) [17:13:20] bd808: btw, we're all good on SM today and wmf.13 is where it should be? [17:13:38] I was watching the twitter account for !log and that's what I think is the case [17:13:46] (last night, that is) [17:14:03] greg-g: yeah we are back on wmf.13 in the right places. [17:14:15] right on, good work [17:14:24] Krinkle: unless that has been tested in deployment-prep I'd rather not, we recently had a bigger breakage caused by a seeminglingly harmless apache config change and consensus was that they're not in scope for puppet swat unless extensively tested [17:14:29] * bd808 asks anomie to take a bow [17:15:18] (03PS1) 10Mobrovac: Cassandra: fix top-scope vars without namespaces [puppet] - 10https://gerrit.wikimedia.org/r/270005 (https://phabricator.wikimedia.org/T125943) [17:16:07] Re: yurik akosiaris had -1 that, I want his confirmation that the changes are enough for him to go on [17:16:18] moritzm: I've tested it a fair bit on mw1017 myself, and monitored production varnish frontend traffic to those urls as being low and unaffected. But point taken. I'll apply there first. [17:16:34] * yurik pokes akosiaris with a poker [17:16:36] * anomie takes a bow [17:16:56] 6operations, 10ops-eqiad: ms-be1008.eqiad.wmnet: slot=3 dev=sdd failed - https://phabricator.wikimedia.org/T126627#2019101 (10fgiunchedi) 3NEW [17:17:10] moritzm: The beta config is quite different though. So risk assesment wouldn't change much [17:17:16] anomie: good work :) [17:17:35] (03PS7) 10Gehel: Ship Elasticsearch logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) [17:18:08] jynus, i pointed akosiaris towards the documentation, and i think he was fine with it -- this related to overall custom protocols that we had to introduce due to security. This patch simply refactors existing code's configurations and adds a small thing, no major changes [17:19:15] jynus, it should be a noop because that config value is not being used yet [17:20:02] jynus: still reading up a bug, will post a review soon [17:21:22] (03CR) 10Mobrovac: [C: 04-1] "Hmmm, still failing for x-instances: https://puppet-compiler.wmflabs.org/1729/" [puppet] - 10https://gerrit.wikimedia.org/r/270005 (https://phabricator.wikimedia.org/T125943) (owner: 10Mobrovac) [17:21:23] PROBLEM - puppet last run on ms-be1008 is CRITICAL: CRITICAL: Puppet last ran 10 hours ago [17:21:26] yurik, it is a question of not oveariding a team mate, if that is ok, I will personally merge it outside of the puppet swat window [17:21:59] jynus, that's totally fine, no rush on this :) I thought akosiaris was done with it [17:23:14] RECOVERY - puppet last run on ms-be1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:24:30] 6operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T125200#2019127 (10fgiunchedi) is there a status update on replacement and/or order? thanks! [17:25:03] (03PS1) 10Krinkle: beta: Add public-wiki-rewrites.incl to better match mediawiki production [puppet] - 10https://gerrit.wikimedia.org/r/270009 [17:26:02] (03PS1) 10Ottomata: Move production hadoop memory configuration into hiera cdh/hadoop.yaml [puppet] - 10https://gerrit.wikimedia.org/r/270010 [17:26:58] (03PS1) 10Krinkle: beta: Rewrite /w/{skins,resources,extensions} to /w/static.php [puppet] - 10https://gerrit.wikimedia.org/r/270011 [17:27:07] (03PS4) 10Krinkle: mediawiki: Rewrite /w/{skins,resources,extensions} to /w/static.php [puppet] - 10https://gerrit.wikimedia.org/r/268802 (https://phabricator.wikimedia.org/T99096) [17:28:32] 7Puppet, 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: cassandra - puppet compiler fail on codfw/test/staging hosts - https://phabricator.wikimedia.org/T125943#2019135 (10mobrovac) [Gerrit 270005](https://gerrit.wikimedia.org/r/270005) tries to correct the mistake - `$instance_name` is not a top-sc... [17:30:11] so, can this be enabled progressively in production, or would it hurt the cache? [17:31:04] (03CR) 10Ottomata: [C: 032] Move production hadoop memory configuration into hiera cdh/hadoop.yaml [puppet] - 10https://gerrit.wikimedia.org/r/270010 (owner: 10Ottomata) [17:31:08] jynus: it should be ok to let it roll out naturally with puppet [17:31:51] (03PS8) 10Ottomata: Include analytics_cluster::client role on analytics1026 for testing [puppet] - 10https://gerrit.wikimedia.org/r/269844 (https://phabricator.wikimedia.org/T109859) [17:31:56] I am referring to "Revisit Puppet SWAT and general +2 merge procedures around Apache configuration changes" [17:32:01] (03CR) 10Alexandros Kosiaris: [C: 032] Add allowedDomains param to graphoid config [puppet] - 10https://gerrit.wikimedia.org/r/269819 (owner: 10Yurik) [17:32:03] ah! [17:32:07] (03PS10) 10Alexandros Kosiaris: Add allowedDomains param to graphoid config [puppet] - 10https://gerrit.wikimedia.org/r/269819 (owner: 10Yurik) [17:32:12] (03CR) 10Alexandros Kosiaris: [V: 032] Add allowedDomains param to graphoid config [puppet] - 10https://gerrit.wikimedia.org/r/269819 (owner: 10Yurik) [17:32:25] referring to this particular patch in reference to that [17:32:44] we should do whatever testing we need to do there. I was just talking about the actual production roll post-puppet-merge to production - it doesn't need special treatment for varnish cache effects [17:33:23] (03CR) 10Krinkle: "Hm.. have to give it a different name. Duplicate declaration: File[/etc/apache2/sites-enabled/public-wiki-rewrites.incl] is already declar" [puppet] - 10https://gerrit.wikimedia.org/r/270009 (owner: 10Krinkle) [17:33:41] 6operations, 10ops-codfw: ms-be2016.codfw.wmnet: slot=0 dev=sdi failed - https://phabricator.wikimedia.org/T126630#2019168 (10fgiunchedi) 3NEW [17:35:34] yurik: jynus, merged [17:35:44] akosiaris, thx! [17:35:52] akosiaris, any comments about the approach? :D [17:36:11] other than the "eeek, this is ugly :)" [17:36:15] akosiaris, I can do that, I just needed your +1 [17:36:27] or undoing your -1 [17:36:44] jynus: already done +2ed, merged and deployed [17:36:51] thanks though [17:37:05] jynus: where did we end up documenting puppetswat improvements for apache config? I see [Done] on the incident, but can't find the info [17:37:17] (03PS2) 10Krinkle: beta: Use public-wiki-rewrites.incl to better match mediawiki production [puppet] - 10https://gerrit.wikimedia.org/r/270009 [17:37:30] yurik: well, took me a while to understand the thinking about that. Probably some documentation at some point soon about why these was necessary would be needed. [17:37:38] oh [17:37:40] "Changes to the Apache configuration for the MediaWiki application server cluster are not eligible for SWAT, as due to the potentially far reaching impact / unavailability, these need extensive testing." [17:37:43] (03Abandoned) 10Krinkle: beta: Rewrite /w/{skins,resources,extensions} to /w/static.php [puppet] - 10https://gerrit.wikimedia.org/r/270011 (owner: 10Krinkle) [17:37:49] that was the improvement I guess [17:37:54] (03PS5) 10Krinkle: mediawiki: Rewrite /w/{skins,resources,extensions} to /w/static.php [puppet] - 10https://gerrit.wikimedia.org/r/268802 (https://phabricator.wikimedia.org/T99096) [17:37:56] :-) [17:38:07] so I am ok with testing that [17:38:20] *help testing that [17:38:35] 6operations, 10Adminbot, 10Deployment-Systems: [[wikitech:Server_admin_log]] should not rely on freenode irc for logmsgbot entries - https://phabricator.wikimedia.org/T46791#2019192 (10greg) [17:38:47] probably, disable puppet, enabling it slowly [17:39:07] it seems kind of backwards to say "X should not go through the slightly-more-formal procedure of PuppetSWAT because it's too dangerous", but instead should be "tested extensively" and deployed outside of PuppetSWAT under no formalism at all... [17:39:43] I think when you put it outside of puppetswat means more than one +1 from ops [17:39:43] I guess that's the essential conflict in the fact that our only ops deployment formalism is intended only for lightweight changes [17:40:14] PROBLEM - puppet last run on db2040 is CRITICAL: CRITICAL: puppet fail [17:42:13] RECOVERY - puppet last run on db2040 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [17:43:35] do we still have puppet 2.7 somewhere? Is there any reason not to upgrade puppet-stdlib to v4 ? [17:43:53] except of course that there are more important things to do ... [17:44:19] 6operations, 10RESTBase, 10hardware-requests: 3x additional SSD for restbase hp hardware - https://phabricator.wikimedia.org/T126626#2019222 (10RobH) @fgiunchedi: The spares are samsung EVO SSDs that we ONLY use in restbase. So I imagine this would come out of the services budget to do this; has it been dis... [17:44:36] <_joe_> gehel: we have no puppet 2.7 [17:44:51] <_joe_> but well, upgrading is risky and requires a lot of testing, basically [17:45:20] _joe_: yeah, always the same issue ... especially with puppet ... [17:45:38] <_joe_> gehel: did you get introduced to our puppet compiler? [17:45:53] _joe_: WAT ? [17:45:55] <_joe_> that would reduce the amount of guesswork by a lot [17:46:24] no I have not ... [17:46:43] <_joe_> we have a jenkins job queue to compile a puppet catalog for production hosts that spits out comparisons between what is in production now and a gerrit change [17:47:09] <_joe_> see e.g. https://puppet-compiler.wmflabs.org/1550/ [17:47:12] That's cool ! [17:48:41] it's not linked from the gerrit page ? [17:49:04] <_joe_> nope [17:49:11] <_joe_> you have to run this manually [17:49:34] <_joe_> it's not ran automatically as it's expensive to compile for all hosts in the production catalog [17:49:52] <_joe_> IIRC, it takes ~ 1.5 hours to make a full catalog run [17:49:54] also puppet compiler says it is a noop, even if it is not [17:50:18] 6operations, 10RESTBase, 10hardware-requests: 3x additional SSD for restbase hp hardware - https://phabricator.wikimedia.org/T126626#2019247 (10GWicke) @RobH, getting some extra 850 Pro SSDs to simplify the transition period & for use as spares sounds good to me. [17:50:19] <_joe_> jynus: what do you mean? [17:50:32] _joe_, unrelated to your conv [17:50:43] talking to bblack [17:51:16] 6operations, 10RESTBase, 10hardware-requests: 3x additional SSD for restbase hp hardware - https://phabricator.wikimedia.org/T126626#2019254 (10RobH) Sounds good, I'll attach a #procurement task (with pricing for @gwicke & @mark's approvals) shortly. [17:51:25] <_joe_> oh [17:51:40] <_joe_> gehel: so, you have a little gem in your puppet tree [17:51:44] <_joe_> ./utils/pcc [17:51:45] _joe_: I quite like the idea of this puppet compiler. The other thing I've seen is always running puppet in noop mode and alerting when something happens. [17:51:52] <_joe_> it can submit jobs via the CLI [17:53:37] _joe_: so, I should use pcc with respect and understanding that it consumes precious resources ... [17:54:46] 7Puppet, 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: cassandra - puppet compiler fail on codfw/test/staging hosts - https://phabricator.wikimedia.org/T125943#2019263 (10fgiunchedi) I believe this was a legit case of secrets missing from labs/private.git, I've updated the repo with last secrets, c... [17:55:31] <_joe_> gehel: well, if you test for less than 10 hosts, it's not so expensive :) [17:55:44] <_joe_> and if it becomes too expensive, we can scale out [17:55:47] I need to at least try that ! [17:55:49] <_joe_> to two jenkins slaves [17:56:03] <_joe_> it requires a bit of work, but should be doable [17:56:16] (03PS8) 10Gehel: Ship Elasticsearch logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) [17:59:38] bblack, can I reserve some time with you next week to do this? I do not want to leave it unattended, just give it the proper time [18:00:05] yurik gwicke cscott arlolra subbu: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160211T1800). Please do the needful. [18:00:19] i'm getting ready kartotherian bugfix [18:00:40] no parsoid deploy [18:01:15] !log `nodetool stop -- COMPACTION' on restbase1001.eqiad to free disk space (https://phabricator.wikimedia.org/P2596) [18:01:16] (03PS9) 10Gehel: Ship Elasticsearch logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) [18:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:06:16] (03PS1) 10Elukey: Add mc1012/mc1013 to redis/memcached pools after maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/270014 (https://phabricator.wikimedia.org/T123711) [18:09:26] 7Puppet, 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: cassandra - puppet compiler fail on codfw/test/staging hosts - https://phabricator.wikimedia.org/T125943#2019394 (10mobrovac) Thanks Filippo! The puppet compiler is now [happy](https://puppet-compiler.wmflabs.org/1735/). [18:09:55] (03CR) 10Mobrovac: "It was a case of missing secret keys for x-instances. The compiler now shows this to be a no-op: https://puppet-compiler.wmflabs.org/1735/" [puppet] - 10https://gerrit.wikimedia.org/r/270005 (https://phabricator.wikimedia.org/T125943) (owner: 10Mobrovac) [18:11:30] (03CR) 10Elukey: [C: 032] Add mc1012/mc1013 to redis/memcached pools after maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/270014 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [18:12:42] !log Add mc1012/mc1013 to redis/memcached pools after maintenance. [18:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:14:31] (03CR) 10Krinkle: "This doesn't currently do anything bad, but it also doesn't work in beta because beta has been broken for >24 hours and as such doesn't ev" [puppet] - 10https://gerrit.wikimedia.org/r/270009 (owner: 10Krinkle) [18:15:33] (03PS6) 10Krinkle: mediawiki: Rewrite /w/{skins,resources,extensions} to /w/static.php [puppet] - 10https://gerrit.wikimedia.org/r/268802 (https://phabricator.wikimedia.org/T99096) [18:16:57] 6operations, 10Analytics-EventLogging, 10Traffic: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#2019446 (10Milimetric) [18:17:01] 6operations, 10Analytics, 10Analytics-EventLogging, 10MediaWiki-extensions-CentralNotice, 10Traffic: Eventlogging should transparently split large event payloads - https://phabricator.wikimedia.org/T114078#2019444 (10Milimetric) 5Open>3declined We decided to not split and merge events, this would be... [18:19:32] (03CR) 10Dzahn: "oh, thank you Marko and Filippo :)" [puppet] - 10https://gerrit.wikimedia.org/r/270005 (https://phabricator.wikimedia.org/T125943) (owner: 10Mobrovac) [18:22:02] !log updated kartotherian https://gerrit.wikimedia.org/r/#/c/270016/ [18:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:23:07] (03PS2) 10Dzahn: Cassandra: fix top-scope vars without namespaces [puppet] - 10https://gerrit.wikimedia.org/r/270005 (https://phabricator.wikimedia.org/T125943) (owner: 10Mobrovac) [18:27:23] (03CR) 10Dzahn: [C: 032] "https://puppet-compiler.wmflabs.org/1735/" [puppet] - 10https://gerrit.wikimedia.org/r/270005 (https://phabricator.wikimedia.org/T125943) (owner: 10Mobrovac) [18:29:10] running puppet on restbast to double confirm [18:29:21] !log concluding manual experiment on cp1065 (puppet re-enabled) [18:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:30:52] mobrovac: confirmed noop on restbase2001, 1001.. thx [18:37:40] 6operations, 10RESTBase, 10hardware-requests: 3x additional SSD for restbase hp hardware - https://phabricator.wikimedia.org/T126626#2019073 (10RobH) [18:39:46] do we have have a preferred framework for unit test of puppet modules ? [18:42:35] mutante: \o/ [18:46:10] (03CR) 10RobH: [C: 031] "pls ping me when we finally kill all of these from the repo, as I'll revoke all the certs." [puppet] - 10https://gerrit.wikimedia.org/r/269877 (https://phabricator.wikimedia.org/T122320) (owner: 10Dzahn) [18:46:28] mobrovac: a see a single puppet fail now, on 1009, but unrelated. it's about nodejs package upgrade [18:47:49] (03CR) 10RobH: [C: 031] netboot.cfg - replace tabs with spaces [puppet] - 10https://gerrit.wikimedia.org/r/268717 (owner: 10Dzahn) [18:49:03] (03CR) 10RobH: [C: 031] dhcp: replace tabs with spaces [puppet] - 10https://gerrit.wikimedia.org/r/268706 (owner: 10Dzahn) [18:50:21] 6operations, 10Wikimedia-Mailing-lists: Remove/ archive inspire@lists.wikimedia.org - https://phabricator.wikimedia.org/T126640#2019673 (10eross) 3NEW [18:50:27] (03CR) 10RobH: [C: 031] "I think this will work, and the explanation makes sense and is something we need. I'm not 100% sure on the actual rewrite rule, but it ap" [puppet] - 10https://gerrit.wikimedia.org/r/268851 (https://phabricator.wikimedia.org/T118176) (owner: 10Dzahn) [18:52:30] (03CR) 10Muehlenhoff: Add ferm rules for rsyncd used in eventlogging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/269983 (owner: 10Muehlenhoff) [18:53:35] 6operations, 10Wikimedia-Mailing-lists: Remove/ archive inspire@lists.wikimedia.org - https://phabricator.wikimedia.org/T126640#2019692 (10Capt_Swing) [18:53:40] ahh , nice moritzm cool [18:54:37] 6operations, 10Wikimedia-Mailing-lists: Remove/ archive inspire@lists.wikimedia.org - https://phabricator.wikimedia.org/T126640#2019673 (10Capt_Swing) Sent originally to techsupport@wikimedia, and Emerauld posted it here for me. Please let me know if you need more info from me to process this request. [18:57:16] query mutante [18:57:24] nope, I need a / [19:00:03] the mutant query! [19:05:33] elukey: here you go: / [19:05:43] first one is free [19:06:27] you think you have a lock on the market. I'm cuttin in [19:06:39] and cuttin out... need dinner [19:08:00] ACKNOWLEDGEMENT - puppet last run on restbase1009 is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn server has held nodejs package [19:08:55] 7Puppet, 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: cassandra - puppet compiler fail on codfw/test/staging hosts - https://phabricator.wikimedia.org/T125943#2019772 (10Dzahn) a:3mobrovac [19:08:59] 7Puppet, 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: cassandra - puppet compiler fail on codfw/test/staging hosts - https://phabricator.wikimedia.org/T125943#2019773 (10Dzahn) 5Open>3Resolved [19:09:12] 6operations, 10ops-eqiad: elastic1023 doesn't come back up after reboot - https://phabricator.wikimedia.org/T126586#2019774 (10Cmjohnson) Not sure what caused it but cpu1 reports having an internal error. I cleared the error and drained flea power. Server is up [19:09:17] 6operations, 10ops-eqiad: elastic1023 doesn't come back up after reboot - https://phabricator.wikimedia.org/T126586#2019775 (10Cmjohnson) 5Open>3Resolved [19:11:23] (03PS1) 10Tim Landscheidt: vagrant_lxc: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270025 [19:12:47] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#2019786 (10EBernhardson) So it looks like the next steps are # Include 4 backend boxes and 4 varnish boxes in the budget ask for strategic goals, FY16-17 # Acquire 4 back... [19:12:59] 10Ops-Access-Requests, 6operations, 3Discovery-Search-Sprint, 5Patch-For-Review: Access for new Discovery OpsEng: Guillaume Lederrey - https://phabricator.wikimedia.org/T125651#2019789 (10Dzahn) [19:13:19] 10Ops-Access-Requests, 6operations, 3Discovery-Search-Sprint: Access for new Discovery OpsEng: Guillaume Lederrey - https://phabricator.wikimedia.org/T125651#1993559 (10Dzahn) [19:13:35] (03PS1) 10Subramanya Sastry: sudo: Run update_parsoid.sh as root, not parsoid [puppet] - 10https://gerrit.wikimedia.org/r/270026 [19:16:20] (03PS2) 10Dzahn: OTRS: remove ssl cert and config [puppet] - 10https://gerrit.wikimedia.org/r/269877 (https://phabricator.wikimedia.org/T122320) [19:17:35] (03CR) 10Dzahn: [C: 032] OTRS: remove ssl cert and config [puppet] - 10https://gerrit.wikimedia.org/r/269877 (https://phabricator.wikimedia.org/T122320) (owner: 10Dzahn) [19:19:07] (03PS1) 10Ottomata: Fix hadoop_namenode_opts param, move hadoop logstash logging configs into separate class [puppet] - 10https://gerrit.wikimedia.org/r/270027 [19:21:28] (03PS1) 10Chad: all wikis to 1.27.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270029 [19:22:54] (03PS1) 10Chad: Revert "Revert "Add \n to wikiversions.json"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270030 [19:23:50] (03CR) 10Ottomata: [C: 032] Fix hadoop_namenode_opts param, move hadoop logstash logging configs into separate class [puppet] - 10https://gerrit.wikimedia.org/r/270027 (owner: 10Ottomata) [19:24:04] (03PS9) 10Ottomata: Include analytics_cluster::client role on analytics1026 for testing [puppet] - 10https://gerrit.wikimedia.org/r/269844 (https://phabricator.wikimedia.org/T109859) [19:27:13] !log ruthenium - chmod g+w on git repo so wikidev members can deploy [19:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:30:33] (03CR) 10Chad: [C: 032] Revert "Revert "Add \n to wikiversions.json"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270030 (owner: 10Chad) [19:31:33] (03CR) 10Alex Monk: [C: 031] "the modules/mediawiki/manifests/web/beta_sites.pp change is just a new linebreak which could be removed, lgtm otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/270009 (owner: 10Krinkle) [19:32:02] (03Merged) 10jenkins-bot: Revert "Revert "Add \n to wikiversions.json"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270030 (owner: 10Chad) [19:33:24] (03CR) 10Dzahn: "i ran chmod g+w on the files on those files in /usr/lib/parsoid on ruthenium, since the group owning the files is already wikidev and all " [puppet] - 10https://gerrit.wikimedia.org/r/270026 (owner: 10Subramanya Sastry) [19:34:38] (03PS4) 10Dzahn: delete SSL cert for ticket.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/269753 (https://phabricator.wikimedia.org/T122320) [19:34:44] (03CR) 10Dzahn: [C: 032] delete SSL cert for ticket.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/269753 (https://phabricator.wikimedia.org/T122320) (owner: 10Dzahn) [19:34:51] (03CR) 10Ottomata: [C: 032] Include analytics_cluster::client role on analytics1026 for testing [puppet] - 10https://gerrit.wikimedia.org/r/269844 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [19:35:48] (03CR) 10Subramanya Sastry: "This command is run as part of the update_parsoid.sh script which also needs to restart the services which requires sudo permissions. So, " [puppet] - 10https://gerrit.wikimedia.org/r/270026 (owner: 10Subramanya Sastry) [19:39:11] (03PS5) 10Dzahn: delete SSL cert for ticket.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/269753 (https://phabricator.wikimedia.org/T122320) [19:39:57] (03PS1) 10Ottomata: Use proper path for grep in analytics_cluster/hadoop/logstash.pp [puppet] - 10https://gerrit.wikimedia.org/r/270032 [19:40:39] !log demon@mira Synchronized multiversion/MWWikiversions.php: newlines redux (duration: 02m 13s) [19:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:41:44] (03PS2) 10Ottomata: Use proper path for grep in analytics_cluster/hadoop/logstash.pp [puppet] - 10https://gerrit.wikimedia.org/r/270032 [19:41:56] (03CR) 10Ottomata: [C: 032 V: 032] Use proper path for grep in analytics_cluster/hadoop/logstash.pp [puppet] - 10https://gerrit.wikimedia.org/r/270032 (owner: 10Ottomata) [19:42:01] (03CR) 10Smalyshev: [C: 031] A/B/C test of control vs textcat vs accept-lang + textcat (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268048 (https://phabricator.wikimedia.org/T121542) (owner: 10EBernhardson) [19:42:30] 6operations, 10Wikimedia-Mailing-lists: import old staff list archives ? - https://phabricator.wikimedia.org/T109395#2019869 (10Dzahn) >>! In T109395#2017922, @Nemo_bis wrote: > You could notify the authors of the messages included in the mbox file and see if someone complains. All of them? :o And since this... [19:43:00] PROBLEM - puppet last run on analytics1026 is CRITICAL: CRITICAL: Puppet has 1 failures [19:43:38] (03PS6) 10Dzahn: delete SSL cert for ticket.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/269753 (https://phabricator.wikimedia.org/T122320) [19:44:42] RECOVERY - puppet last run on analytics1026 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:44:57] (03PS1) 10Ottomata: Include new analytics_cluster::hadoop::worker for testing on analytics1057 [puppet] - 10https://gerrit.wikimedia.org/r/270033 (https://phabricator.wikimedia.org/T109859) [19:45:41] RECOVERY - swift-account-server on ms-be1008 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [19:45:50] RECOVERY - swift-container-server on ms-be1008 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [19:45:52] !log restart swift on ms-be1008, sdd offlined [19:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:46:01] RECOVERY - swift-object-replicator on ms-be1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [19:46:11] RECOVERY - swift-container-updater on ms-be1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [19:46:21] RECOVERY - swift-object-auditor on ms-be1008 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [19:46:22] RECOVERY - swift-account-replicator on ms-be1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [19:46:31] RECOVERY - swift-object-updater on ms-be1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [19:46:31] RECOVERY - swift-account-reaper on ms-be1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [19:46:32] RECOVERY - swift-object-server on ms-be1008 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [19:46:42] RECOVERY - swift-container-auditor on ms-be1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [19:46:42] RECOVERY - swift-account-auditor on ms-be1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [19:47:02] RECOVERY - swift-container-replicator on ms-be1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [19:47:08] (03PS2) 10Ottomata: Include new analytics_cluster::hadoop::worker for testing on analytics1057 [puppet] - 10https://gerrit.wikimedia.org/r/270033 (https://phabricator.wikimedia.org/T109859) [19:47:14] (03CR) 10Ottomata: [C: 032 V: 032] Include new analytics_cluster::hadoop::worker for testing on analytics1057 [puppet] - 10https://gerrit.wikimedia.org/r/270033 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [19:49:40] PROBLEM - puppet last run on ms-be1008 is CRITICAL: CRITICAL: Puppet has 1 failures [19:52:52] ACKNOWLEDGEMENT - puppet last run on ms-be2016 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi sdi failed https://phabricator.wikimedia.org/T126630 [19:53:43] ACKNOWLEDGEMENT - puppet last run on ms-be1008 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi sdd failed https://phabricator.wikimedia.org/T126627 [19:53:53] (03PS1) 10Ottomata: Apply new analytics_cluster::hadoop::standby role on analytics1002 [puppet] - 10https://gerrit.wikimedia.org/r/270035 (https://phabricator.wikimedia.org/T109859) [19:57:35] (03CR) 10Ottomata: [C: 032] Apply new analytics_cluster::hadoop::standby role on analytics1002 [puppet] - 10https://gerrit.wikimedia.org/r/270035 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [19:58:50] (03PS1) 10Ottomata: Use new analytics_cluster::hadoop::master role on analytics1001 [puppet] - 10https://gerrit.wikimedia.org/r/270037 (https://phabricator.wikimedia.org/T109859) [20:00:05] ostriches: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160211T2000). Please do the needful. [20:00:11] Yo jouncebot [20:00:14] Let's do this [20:00:21] (03CR) 10Chad: [C: 032] all wikis to 1.27.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270029 (owner: 10Chad) [20:00:47] (03Merged) 10jenkins-bot: all wikis to 1.27.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270029 (owner: 10Chad) [20:01:16] !log demon@mira rebuilt wikiversions.php and synchronized wikiversions files: remaining wikis to wmf.13 [20:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:02:09] 6operations, 10OTRS, 7HTTPS, 5Patch-For-Review: ssl certificate replacement: ticket.wikimedia.org (expires 2016-02-16) - https://phabricator.wikimedia.org/T122320#2019918 (10Dzahn) deleted ticket.wikimedia.org.key in private repo pinged Robh, he is revoking the cert also deleted "new.ticket.wikimedia.org... [20:03:25] (03PS3) 10Krinkle: beta: Use public-wiki-rewrites.incl to better match mediawiki production [puppet] - 10https://gerrit.wikimedia.org/r/270009 [20:03:41] (03CR) 10Krinkle: "Fixed." [puppet] - 10https://gerrit.wikimedia.org/r/270009 (owner: 10Krinkle) [20:08:54] !log mendelevium (OTRS) - delete unused ssl cert files, shred key [20:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:11:53] 6operations, 10Gerrit, 10hardware-requests: Need spare server to upgrade/migrate gerrit - https://phabricator.wikimedia.org/T123132#2019941 (10mark) Approved. [20:14:23] (03PS3) 10Dzahn: Fix the meta:System_administrators table generation script [puppet] - 10https://gerrit.wikimedia.org/r/269959 (owner: 10Alex Monk) [20:14:30] (03PS1) 10Ottomata: Move hadoop_user_posix_groups setting to eqiad/cdh/hadoop/users.yaml [puppet] - 10https://gerrit.wikimedia.org/r/270040 (https://phabricator.wikimedia.org/T109859) [20:14:32] (03CR) 10Dzahn: [C: 032] Fix the meta:System_administrators table generation script [puppet] - 10https://gerrit.wikimedia.org/r/269959 (owner: 10Alex Monk) [20:15:19] 6operations, 10Gerrit, 10hardware-requests: Need spare server to upgrade/migrate gerrit - https://phabricator.wikimedia.org/T123132#2019950 (10Paladox) [20:15:43] 6operations, 6Labs: RDNS for some labs instance IPs resolve to multiple different instances - https://phabricator.wikimedia.org/T115194#2019952 (10dduvall) [20:17:23] 6operations, 6Labs: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2019965 (10hashar) [20:18:41] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [20:18:50] (03CR) 10Ottomata: [C: 032] Move hadoop_user_posix_groups setting to eqiad/cdh/hadoop/users.yaml [puppet] - 10https://gerrit.wikimedia.org/r/270040 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [20:20:14] (03PS2) 10Ottomata: Use new analytics_cluster::hadoop::master role on analytics1001 [puppet] - 10https://gerrit.wikimedia.org/r/270037 (https://phabricator.wikimedia.org/T109859) [20:23:40] (03CR) 10Ottomata: [C: 032] Use new analytics_cluster::hadoop::master role on analytics1001 [puppet] - 10https://gerrit.wikimedia.org/r/270037 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [20:25:05] 6operations, 6Labs: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2019987 (10dduvall) [20:26:59] (03CR) 10Ottomata: [C: 031] Add ferm rules for eventlogging udp receiver [puppet] - 10https://gerrit.wikimedia.org/r/269986 (https://phabricator.wikimedia.org/T113343) (owner: 10Muehlenhoff) [20:29:00] (03PS1) 10Ottomata: Include new analytics_cluster::client role on stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/270045 [20:30:28] (03CR) 10Ottomata: [C: 032 V: 032] Include new analytics_cluster::client role on stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/270045 (owner: 10Ottomata) [20:30:49] 6operations, 10DBA, 5Patch-For-Review: Reimage db2012 - https://phabricator.wikimedia.org/T126209#2020004 (10jcrespo) 5Open>3Resolved dbstore2001 reimported too, from db2012. There could be some junk at m3, but at 70GB, it is not worth the time investigating. [20:31:43] 6operations, 10RESTBase, 6Services, 3Mobile-Content-Service, 7Varnish: Enable caching for the Mobile Content Service's RESTBase public endpoints - https://phabricator.wikimedia.org/T113591#2020012 (10GWicke) A basic PR is now available at https://github.com/wikimedia/restbase/pull/511. This does not set... [20:32:39] (03PS1) 10Ottomata: Can't use analytics_cluster::client on stat1002 yet, refinery dependency [puppet] - 10https://gerrit.wikimedia.org/r/270046 [20:33:38] (03CR) 10Ottomata: [C: 032 V: 032] Can't use analytics_cluster::client on stat1002 yet, refinery dependency [puppet] - 10https://gerrit.wikimedia.org/r/270046 (owner: 10Ottomata) [20:35:21] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: puppet fail [20:37:18] (03PS1) 10Ottomata: Use new analytics_cluster::hadoop::worker role on all Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/270050 (https://phabricator.wikimedia.org/T109859) [20:37:53] 6operations, 10DBA, 6Labs, 10Labs-Infrastructure: db1069 is running low on space - https://phabricator.wikimedia.org/T124464#2020049 (10jcrespo) at 78% and going down. [20:39:07] 6operations: Adding/Removing users from enWP Arbcom Mailinglist archives - https://phabricator.wikimedia.org/T123787#2020051 (10Dzahn) @gamaliel @drmies (cc: @Jalexander) Sorry, i made a mistake and your password did not actually work yet when i closed this the other day. working on fixing that now.. [20:39:14] 6operations: Adding/Removing users from enWP Arbcom Mailinglist archives - https://phabricator.wikimedia.org/T123787#2020053 (10Dzahn) 5Resolved>3Open [20:40:57] (03CR) 10Ottomata: [C: 032] Use new analytics_cluster::hadoop::worker role on all Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/270050 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [20:43:10] 6operations, 10Continuous-Integration-Infrastructure, 7HHVM: /usr/lib/x86_64-linux-gnu/hhvm/extensions/current/luasandbox.so no such file or directory - https://phabricator.wikimedia.org/T126658#2020104 (10hashar) 3NEW [20:45:10] 6operations, 10Continuous-Integration-Infrastructure, 7HHVM: /usr/lib/x86_64-linux-gnu/hhvm/extensions/current/luasandbox.so no such file or directory - https://phabricator.wikimedia.org/T126658#2020138 (10Paladox) [20:46:11] (03PS1) 10Tim Landscheidt: quarry: Move role classes to module role [puppet] - 10https://gerrit.wikimedia.org/r/270097 [20:50:01] RECOVERY - Disk space on cerium is OK: DISK OK [20:51:12] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [20:52:04] 6operations, 10Continuous-Integration-Infrastructure, 7HHVM: /usr/lib/x86_64-linux-gnu/hhvm/extensions/current/luasandbox.so no such file or directory - https://phabricator.wikimedia.org/T126658#2020180 (10hashar) The link is provided by puppet. `modules/hhvm/manifests/init.pp` has: ``` hhvm... [20:55:44] (03PS1) 10Tim Landscheidt: ores: Move role classes to module role [puppet] - 10https://gerrit.wikimedia.org/r/270102 [20:57:11] 6operations: Adding/Removing users from enWP Arbcom Mailinglist archives - https://phabricator.wikimedia.org/T123787#2020215 (10Dzahn) 5Open>3Resolved _now_ it should work @Gamaliel @Drmies @Jalexander [20:59:15] (03PS1) 10Ottomata: Create refinery classes in analytics_cluster role, apply them to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/270103 (https://phabricator.wikimedia.org/T109859) [20:59:37] 6operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T125200#2020226 (10Papaul) Yes received some 2 TB drives today will be replacing the bad drive. [20:59:56] 6operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T125200#2020227 (10Papaul) p:5Triage>3Normal [21:00:11] (03PS1) 10Tim Landscheidt: testsystem: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270105 [21:02:11] (03PS1) 10Tim Landscheidt: simplelamp: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270106 [21:02:26] (03PS2) 10Ottomata: Create refinery classes in analytics_cluster role, apply them to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/270103 (https://phabricator.wikimedia.org/T109859) [21:02:51] 6operations, 10Continuous-Integration-Infrastructure, 7HHVM: /usr/lib/x86_64-linux-gnu/hhvm/extensions/current/luasandbox.so no such file or directory - https://phabricator.wikimedia.org/T126658#2020239 (10hashar) [21:03:32] (03CR) 10Hashar: [C: 04-1] "Cause T126658. The symlink /usr/lib/x86_64-linux-gnu/hhvm/extensions/current is created by the hhvm upstart job!" [puppet] - 10https://gerrit.wikimedia.org/r/269947 (https://phabricator.wikimedia.org/T126594) (owner: 10Hashar) [21:03:42] (03PS3) 10Ottomata: Create refinery classes in analytics_cluster role, apply them to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/270103 (https://phabricator.wikimedia.org/T109859) [21:03:56] (03PS1) 10Tim Landscheidt: rcstream: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270107 [21:03:59] (03CR) 10Hashar: [C: 04-1] "Cause T126658. The symlink /usr/lib/x86_64-linux-gnu/hhvm/extensions/current is created by the hhvm upstart job!" [puppet] - 10https://gerrit.wikimedia.org/r/269946 (https://phabricator.wikimedia.org/T126594) (owner: 10Hashar) [21:07:25] (03PS4) 10Ottomata: Create refinery classes in analytics_cluster role, apply them to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/270103 (https://phabricator.wikimedia.org/T109859) [21:07:40] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [21:12:08] 6operations: setup/deploy oresrdb1001-oresrdb1002 - https://phabricator.wikimedia.org/T125562#2020283 (10RobH) [21:13:29] 6operations, 10ops-eqiad: Update Label for oresrdb1001 (WMF4577) & relocate and update label for oresrdb1002 (WMF4578) - https://phabricator.wikimedia.org/T125565#2020286 (10RobH) 5Resolved>3Open [21:13:31] 6operations: setup/deploy oresrdb1001-oresrdb1002 - https://phabricator.wikimedia.org/T125562#1990909 (10RobH) [21:21:14] (03PS3) 10Hashar: contint: lower tmpfs from 512MB to 256MB [puppet] - 10https://gerrit.wikimedia.org/r/269880 (https://phabricator.wikimedia.org/T126545) [21:22:13] (03CR) 10Hashar: "Thanks for the review. I have changed my mind while writing the patch hence the inconsistency." [puppet] - 10https://gerrit.wikimedia.org/r/269880 (https://phabricator.wikimedia.org/T126545) (owner: 10Hashar) [21:37:15] something is wrong with servers, I can't get this page: https://meta.wikimedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvlimit=1&titles=Huggle/Config&rawcontinue=1&format=xml [21:37:41] WFM [21:37:44] petan: I am afraid we are just on the same provider that has issues. [21:37:55] petan, see traceroute [21:38:19] mhm... people are complaing in #huggle about login issues, I doubt they share same provider as me :/ [21:38:24] like, lots of people [21:38:48] (I currently cannot reach any WMF sites apart from status.wikimedia.org it seems.) [21:38:54] meh [21:39:09] I can use vpn I guess [21:40:15] From #wikimedia: phabricator.wikimedia.org and dewiki is down [21:41:18] Yepp, that user I just mentioned also has a UPC IP [21:42:32] I can't go to any Wikipedia pages anymore [21:42:51] MGC_, what's your provider? [21:44:14] MGC_: same issue here, and it seems like something bigger on EU level or whatever [21:44:20] My provider is the german Telekom [21:44:33] yep, central EU here as well, same problem [21:44:41] Okay... now that makes it more interesting now. [21:44:47] all the wikipedias are down [21:44:48] I know some people form Germany who don't habe the problem [21:44:58] all languages [21:45:01] German Telekom, Czech UPC, Austrian UPC we have so far... [21:45:11] just fyi I got a server on czech backbone, and it's not possible to access any WMF sites from there [21:45:29] so WMF sites probably don't work from any ISP in czech right now, probably other countries as well [21:45:38] I'm in the Europe entry point and everithing is ok for me, so probably outside of WMF infrastructure [21:45:51] petan: are the huggle problems Huggle2 users? [21:45:54] what about Amsterdam DC? [21:45:59] it works for me (greece) [21:46:02] bd808: nope, HG3 now [21:46:18] bd808: but I can't debug this without connectivity to WMF servers :/ [21:46:41] my traceroute output is at http://fpaste.org/321452/ if that helps anybody [21:47:29] There is also not noticible HTTP request slowdown [21:47:31] petan: boo to connectivity problems. 1.27.0-wmf.13 just hit wikipedias which brings SessionManager back (was in wmf.11, problems, skipped in wmf.12) [21:48:10] I remember hearing that Huggle3 had some problems with changes to the cookies/headers caused by SessionManager [21:48:30] AWB definitely had: https://phabricator.wikimedia.org/T126577 [21:48:42] petan: anomie, tgr and I are here to help debug this. I don't think any of us run Huggle ever [21:48:43] bd808: I don't know, but I would be more happy if every release of MW wouldn't break everything :/ [21:48:58] More reports from Poland and France in #wikimedia-tech about not being able to reach Wikimedia servers [21:49:08] ok, let's page paravoid and bblack [21:49:30] ok I can't really do much about that with no connectivity to debug stuff, bd808 when this release happened? people started reporting issues like few hours ago [21:49:48] it works from here (Italy) [21:49:57] it seems extremely unlikely that sessionmanager has anything to do with it [21:50:03] andre__: clearly central EU is affected by this [21:50:06] I can see drop in traffic in the varnish traffic dashboard on grafana [21:50:07] petan: 19:40Z [21:50:22] bd808: today? aha in that case it might be related [21:50:24] ori: not connectivity but Huggle3 client issues [21:51:08] I see it now https://grafana.wikimedia.org/dashboard/db/varnish-http-requests [21:51:12] works for me from France, but my ISP peers with us [21:51:14] * andre__ would love to move Huggle session talk to #huggle instead :) [21:51:32] jynus: I was looking at https://grafana.wikimedia.org/dashboard/db/varnish-traffic [21:51:58] there is a huge drop in traffic in the last 7~8 minutes [21:52:36] the graph is wrong [21:52:50] volans_, make sure you are not counting yet-to-be received requests [21:53:21] I would not trust those grafana metrics for the last minutes. The counters are buffered and only show up after a few minutes [21:53:24] out of curiosity is there some documentation of what this new session-thing does eg. "how to update your code so it still works on new MW" [21:53:24] Copying from #wikimedia: http://status.wikimedia.org/ shows dns disruption [21:53:35] the drop I see is at 21:20 [21:53:53] esams only [21:54:03] btw I am watching for API warnings and I haven't been getting any, so it would be more nice if warnings were emitted for stuff that is going to be changed [21:54:05] or deprecated [21:54:14] petan, not now please :) [21:54:22] petan, #huggle please or private chat. thanks. [21:54:37] http://status.wikimedia.org/ shows DNS Server discuption dewiki an phabricator are dead [21:54:57] 6operations, 5Patch-For-Review, 7user-notice: reinstall eqiad memcache servers with jessie - https://phabricator.wikimedia.org/T123711#2020435 (10Legoktm) [21:57:38] (03PS1) 10BBlack: depool esams [dns] - 10https://gerrit.wikimedia.org/r/270116 [21:57:54] (03CR) 10BBlack: [C: 032 V: 032] depool esams [dns] - 10https://gerrit.wikimedia.org/r/270116 (owner: 10BBlack) [21:58:33] s2-master DNS change was pending too...? [21:58:44] -s2-master 5M IN CNAME db1024.eqiad.wmnet. [21:58:45] +s2-master 5M IN CNAME db1018.eqiad.wmnet. [21:58:50] I went ahead and pushed it [21:58:58] please, I for got [21:59:07] it doesn't matter [22:00:04] Krenair: Respected human, time to deploy Adyghe Wikipedia creation (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160211T2200). Please do the needful. [22:00:42] Krenair, delaying that [22:01:02] k [22:01:34] still digging around as to what's going on [22:03:43] I'm guessing at this point, some kind of broader routing disruption in the EU [22:04:10] andre__ is affected, is going run traceroute [22:04:30] (03PS5) 10Legoktm: contint: Use slave-scripts/bin/php wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/269370 (https://phabricator.wikimedia.org/T126211) [22:04:38] I'm here from my phone [22:04:42] http://fpaste.org/321452/ [22:04:56] andre__'s traceroute from ~15m ago [22:05:36] yepp. exactly the same output for text-lb.esams like for en.wp here, hence I won't repost my traceroute [22:05:43] paravoid: traffic dropped sharply in esams only, but not completely [22:05:51] now it's dropping a lot more, because I depooled it [22:06:10] (of course, if it turns out to be general EU routing issues, that may or may not help anyways) [22:06:21] :-) [22:06:23] check librenms' ports > tranditsit + peering [22:06:30] k [22:06:34] tele2 maybe yeah.... [22:06:37] can we live without it? [22:06:45] to see if one of them shows the drop or it's all of them [22:07:09] re: andre's traceroute, the reverse one (from hooft to andre) would be useful [22:07:22] Whatever command I shall enter, paste it here. :P [22:07:30] bblack: I've taking a look at traffic stats on DE-CIX and AMS-IX but there is not enough granularity to say much, if there is anything there will show it later [22:07:36] the forward one shows tele2 and we have a port witb them [22:07:42] (03CR) 10Legoktm: "PS5: Added the dependency, but not sure I did it right..." [puppet] - 10https://gerrit.wikimedia.org/r/269370 (https://phabricator.wikimedia.org/T126211) (owner: 10Legoktm) [22:07:49] https://ams-ix.net/technical/statistics and https://www.de-cix.net/about/statistics/ for reference [22:07:55] andre__: just send people your ip :) [22:08:59] bblack: we can live witbout esams, but be on alert for congested transit ports [22:09:07] same page on librenms [22:09:32] some reports of success, maybe after depool? "ato_> Hallelujah! Wikipedia is back :)" [22:10:09] YES. I can reach https://en.wikipedia.org/ again. [22:10:09] paravoid: tele2 but not e.g. datahop on that page for knams/esams stuff [22:10:18] (and I was via tele2) [22:11:19] <_joe_> uh seems I'm late to the party [22:11:21] > For issues requiring immediate attention, please contact the Tele2 Data NOC at dnoc@swip.net or give them a call on +46 8 1204 3025. They will escalate to an on call engineer if needed. For non-immediate issues, you may contact bgp4-adm@swip.net which is read by our senior routing staff. [22:11:25] (from http://ipv6.tele2.net/peering/) [22:11:37] nah no need to [22:11:40] works again for me [22:11:47] we can just disable the port if needed [22:11:56] dewiki is available again [22:12:02] ran a reverse traceroute from hooft to andre [22:12:26] if it's also tele2, likely where the issue is [22:12:39] is it? [22:12:50] oh, you meant run, not ran [22:13:00] yes [22:13:02] phabricator.wikimedia.org is still dead [22:13:04] sorry [22:13:16] andre__: can you PM me your IP? (http://whatismyip.org/) [22:13:21] enwiki works too [22:13:22] (this is where people are talking about the tool labs stuff, right?) [22:13:55] <_joe_> enterprisey: that would be #wikimedia-labs I guess? [22:14:14] enterprisey: we know of issues for some European users to our sites, it is being investigated, we are trying to keep this channel focused for those fixing the issue [22:14:22] greg-g: okay, thanks [22:14:44] paravoid: yes, it's tele2 from hooft as well [22:14:47] i'll pastebin in a sec [22:15:09] phabricator.wikimedia.org works for me boshomi [22:15:23] https://dpaste.de/jVMZ/raw [22:15:35] <_joe_> boshomi: phabricator shouldn't be affected in fact [22:15:47] _joe_, it was for me [22:15:52] _joe_: it can be, we have edge termination for misc-web everywhere [22:16:01] yep [22:16:03] <_joe_> oh right now we have :P [22:16:05] <_joe_> sorry [22:16:34] ok, login to cr2-knams [22:16:39] I'm there already [22:16:40] and tirn [22:16:42] disable tele2 port? [22:16:54] yeah, either that or bgp [22:16:57] either works [22:17:30] !log disable port xe-0/0/2 on cr2-knams (tele2) [22:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:17:54] ok,andre__ now try pinging hooft.esams.wikimedia.org [22:18:00] it should work [22:18:16] i can reach andre__ from hooft [22:18:18] now phab is available [22:18:18] paravoid, yes, works for me [22:18:30] awesome [22:18:38] thanks everybody so far [22:18:38] (03PS1) 10BBlack: Revert "depool esams" [dns] - 10https://gerrit.wikimedia.org/r/270119 [22:18:44] :)) [22:18:56] I doint think it is an eu issue because im in the eu so i think it may be a provider or something else. [22:18:57] secundes befor [22:18:59] curl -I https://phabricator.wikimedia.org/ [22:19:00] curl: (7) Failed to connect to phabricator.wikimedia.org port 443: Connection timed out [22:19:21] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [22:19:21] boshomi: use -v and show a server IP [22:19:27] * andre__ is afk for 10-15min [22:19:28] <_joe_> boshomi: yes unfortunately the dns record for phabricator has a long TTL [22:19:30] <_joe_> 1 hour [22:19:40] oh good point, but either way, tele2 is off now [22:19:49] <_joe_> yeah and now it works for him :) [22:20:05] oh missed the import of "seconds before" [22:20:17] repooling esams... [22:20:25] curl -I -v https://phabricator.wikimedia.org/ [22:20:27] * Trying 208.80.154.251... [22:20:43] <_joe_> boshomi: is it working? [22:20:51] RECOVERY - DPKG on restbase1009 is OK: All packages OK [22:20:53] HTTP/1.1 200 OK yes [22:21:01] <_joe_> boshomi: ok, thanks :) [22:21:08] <_joe_> great [22:21:17] !log restbase1009 removed dpkg hold on nodejs [22:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:21:39] (03CR) 10BBlack: [C: 032] Revert "depool esams" [dns] - 10https://gerrit.wikimedia.org/r/270119 (owner: 10BBlack) [22:23:22] (03PS1) 10Mattflaschen: Use Wikipedia logo for all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270121 (https://phabricator.wikimedia.org/T49662) [22:27:26] (03PS1) 10Mobrovac: RESTBase: Move to HyperSwitch [puppet] - 10https://gerrit.wikimedia.org/r/270122 [22:27:42] <_joe_> WTF is hyperswitch? [22:27:59] <_joe_> mobrovac: oh I see [22:28:07] :P [22:28:12] <_joe_> I thought it was some alternative nodejs framework [22:28:22] <_joe_> the name is hipstery enough :D [22:28:27] hahaha [22:28:54] _joe_: the name has been picked via consensus in the office [22:29:24] full disclosure, I voted for hyperswitch [22:30:03] <_joe_> mobrovac: boo dig +short hyperswitch.io [22:30:05] <_joe_> booo [22:30:29] <_joe_> I mean you call a framework hyperswitch and don't own the .io domain? [22:30:35] <_joe_> tsk tsk [22:30:38] hyperswtchr.ly [22:30:50] I was just going to say [22:30:53] _joe_: we're not hyper-hipster-y enough [22:30:54] <_joe_> MaxSem: evil! [22:30:56] .io is so 2015 [22:31:06] ly is lybia [22:31:06] <_joe_> gwicke: .ly was 2014 ? [22:31:12] they're both horrible, politically speaking [22:31:13] _joe_: it came back [22:31:16] <_joe_> gwicke: what is 2015 then? [22:31:19] the cycles are short these days [22:31:23] do they already have the .hipster TLD? [22:31:29] <_joe_> gwicke: you can say that [22:31:36] _joe_: 2015 is buying your own TLD [22:31:37] MaxSem: probably [22:32:06] the true 2016 domain would be hyperswitch.xyz [22:32:12] (03PS2) 10Alex Monk: Use Wikipedia logo for all Wikipedias for Echo notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270121 (https://phabricator.wikimedia.org/T49662) (owner: 10Mattflaschen) [22:32:32] www.hyper.switch [22:32:36] <_joe_> www? [22:32:38] <_joe_> naaah [22:32:51] www is so 1994 [22:32:52] _joe_: hipster.hyper.switch [22:32:59] well it ain't going to be hyperswit.ch [22:33:13] <_joe_> apergos: ofc [22:34:41] lol @ ssh deployment-bastion.eqiad.wmflabs [22:34:45] "do not use this server" [22:34:57] er? [22:35:05] mobrovac, sounds like the same notice as tin.eqiad.wmnet? [22:35:14] copy/pasta left-over from _joe_'s tin move [22:35:37] "Connect to 'deployment.eqiad.wmnet' instead" [22:35:41] euh not really [22:35:42] * apergos wonders what copy pasta tastes like [22:35:55] but we still don't even have all the traffic back from the DNS un-depool yet, all the way to stats and graphs anyways [22:35:59] apergos: spaghetti! [22:36:05] ewwww [22:36:16] https://newgtlds.icann.org/en/program-status/delegated-strings 21 January 2016 PAMPEREDCHEF [22:36:51] PROBLEM - DPKG on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:36:52] PROBLEM - Disk space on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:37:01] oh come on alsafi [22:37:11] you're getting old [22:37:12] PROBLEM - dhclient process on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:37:31] PROBLEM - RAID on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:37:50] PROBLEM - Check size of conntrack table on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:37:52] PROBLEM - salt-minion processes on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:37:53] alsafi makes me think of gaddafi [22:37:59] apergos: "Fruehverrentung" though [22:38:00] PROBLEM - configured eth on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:38:11] PROBLEM - puppet last run on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:38:51] RECOVERY - dhclient process on alsafi is OK: PROCS OK: 0 processes with command name dhclient [22:39:10] RECOVERY - RAID on alsafi is OK: OK: no RAID installed [22:39:14] ssh to server, fixed [22:39:17] (without doing stuff) [22:39:22] RECOVERY - Check size of conntrack table on alsafi is OK: OK: nf_conntrack is 0 % full [22:39:31] RECOVERY - salt-minion processes on alsafi is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:39:32] RECOVERY - configured eth on alsafi is OK: OK - interfaces up [22:39:50] RECOVERY - puppet last run on alsafi is OK: OK: Puppet is currently enabled, last run 11 minutes ago with 0 failures [22:40:18] mobrovac: it's both from Arabic. "Alsafi, corrupted from Athāfi, erroneously transcribed from the Arabic plurarl Athāfiyy, by which the nomads designated the tripods of their open-air kitchens" [22:40:20] RECOVERY - Disk space on alsafi is OK: DISK OK [22:40:21] RECOVERY - DPKG on alsafi is OK: All packages OK [22:41:17] bblack, how's it going? [22:42:36] still waiting for data to update fully in graphs and everything to settle back to "normal" looking [22:43:45] * ori removes "Mobile GETs" panels from , which have been blank since the text/mobile consolidation. [22:44:25] (03CR) 10Catrope: [C: 031] Use Wikipedia logo for all Wikipedias for Echo notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270121 (https://phabricator.wikimedia.org/T49662) (owner: 10Mattflaschen) [22:44:53] what makes it confusing was the EU was on a sharp traffic downslope anyways (normal daily pattern) [22:45:04] it makes it hard to line up when things returned to what should be normal :) [22:45:20] this data is in librenms right? [22:45:41] well, everywhere [22:46:06] the graphite varnish-http-requests helps too [22:47:28] I think it's reasonable at this point to say everything looks pretty healthy [22:47:33] 6operations: setup YubiHSM and laptop at office - https://phabricator.wikimedia.org/T123818#2020866 (10Dzahn) {F3334534} [22:47:41] bblack: time to write a seasonal adjuster for Grafana that is TZ sensible [22:48:09] "last day" behind helps [22:49:01] bblack: it's back to normal [22:49:15] 6operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T125200#2020868 (10Papaul) a:5Papaul>3fgiunchedi Drive replacement complete. [22:49:17] blue is same time yesterday [22:49:21] RECOVERY - puppet last run on restbase1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:53:56] 6operations, 10ops-codfw: es2011-es2020 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2020886 (10Papaul) Racking complete. [22:54:16] jynus, bblack: so are we good to proceed? [22:54:31] he is the boss here [22:59:24] so I would say yes [23:00:44] mutante: ah didn't know that! thnx [23:04:16] (03CR) 10Mobrovac: [C: 031] "Checked in labs." [puppet] - 10https://gerrit.wikimedia.org/r/270122 (owner: 10Mobrovac) [23:04:45] mutante: have time for a restbase hipster config change ^ ? [23:05:21] 6operations, 6Performance-Team, 10Traffic: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2020906 (10ori) The Chrome team [[ http://blog.chromium.org/2016/02/transitioning-from-spdy-to-http2.html | announced ]] that they will drop support for SPDY (and NPN) in three months, on May 15th. [23:08:23] 6operations, 10ops-codfw: es2011-es2020 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2020920 (10RobH) a:5RobH>3Papaul @Papaul: I should have requested that you do the following for the new systems as you rack them, so I'll list it off and assign back to you. For the new syst... [23:09:40] ema, any thoughts? [23:09:57] Krenair: go ahead [23:10:53] 6operations: setup YubiHSM and laptop at office - https://phabricator.wikimedia.org/T123818#2020941 (10Dzahn) yubihsm hardware plugged in and detected installed the `yhsm-tools` package which provides yhsm-keystore-unlock -- keystore unlock yhsm-linux-add-entropy -- entropy seeder yhsm-decrypt-aead -- decrypt... [23:14:41] yes we're fine! [23:14:50] I've been distracted with real life since my last statement, sorry! [23:15:33] great [23:16:09] (03PS7) 10Alex Monk: Initial configuration for ady.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268004 (https://phabricator.wikimedia.org/T125501) (owner: 10Dereckson) [23:16:38] (03PS1) 10Jcrespo: Installing jessie on db1024 [puppet] - 10https://gerrit.wikimedia.org/r/270133 [23:16:46] (03CR) 10Alex Monk: [C: 032] Initial configuration for ady.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268004 (https://phabricator.wikimedia.org/T125501) (owner: 10Dereckson) [23:17:07] (03CR) 10Jcrespo: [C: 032] Installing jessie on db1024 [puppet] - 10https://gerrit.wikimedia.org/r/270133 (owner: 10Jcrespo) [23:17:19] sigh [23:17:26] I really wish people would not break the wiki creation script [23:17:30] (03Merged) 10jenkins-bot: Initial configuration for ady.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268004 (https://phabricator.wikimedia.org/T125501) (owner: 10Dereckson) [23:17:32] :-) [23:17:52] 6operations, 10ops-codfw: es2011-es2020 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2020987 (10Papaul) All the 9 new es systems are racked. [23:17:53] you always have to complain every time [23:19:50] 6operations, 10ops-codfw: es2011-es2020 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2021001 (10RobH) [23:20:19] alright [23:20:20] looks good on mw1017 [23:21:10] :) [23:22:21] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [23:22:33] !log krenair@mira Synchronized dblists: https://gerrit.wikimedia.org/r/#/c/268004/ (duration: 01m 23s) [23:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:22:56] !log krenair@mira rebuilt wikiversions.php and synchronized wikiversions files: (no message) [23:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:23:42] (03CR) 10Dzahn: [C: 032] RESTBase: Move to HyperSwitch [puppet] - 10https://gerrit.wikimedia.org/r/270122 (owner: 10Mobrovac) [23:24:22] (03PS2) 10Dzahn: RESTBase: Move to HyperSwitch [puppet] - 10https://gerrit.wikimedia.org/r/270122 (owner: 10Mobrovac) [23:24:40] !log krenair@mira Synchronized langlist: https://gerrit.wikimedia.org/r/#/c/268004/ (duration: 01m 21s) [23:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:26:27] !log krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/268004/ (duration: 01m 19s) [23:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:27:06] (03PS4) 10Dzahn: contint: lower tmpfs from 512MB to 256MB [puppet] - 10https://gerrit.wikimedia.org/r/269880 (https://phabricator.wikimedia.org/T126545) (owner: 10Hashar) [23:27:23] (03CR) 10Dzahn: [C: 032] "already cherry-picked" [puppet] - 10https://gerrit.wikimedia.org/r/269880 (https://phabricator.wikimedia.org/T126545) (owner: 10Hashar) [23:28:46] 6operations: setup YubiHSM and laptop at office - https://phabricator.wikimedia.org/T123818#2021069 (10Dzahn) this also provided: python-pyhsm (see https://github.com/Yubico/python-pyhsm/) the example tool "yhsm-sysinfo.py" works and shows the device has a "power-up count" of 768. [23:28:53] !log krenair@mira Synchronized w/static/images/project-logos/adywiki.png: https://gerrit.wikimedia.org/r/#/c/268004/ (duration: 01m 19s) [23:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:30:15] krenair@mira:/srv/mediawiki-staging (master)$ mwscript extensions/WikimediaMaintenance/filebackend/setZoneAccess.php adywiki --backend=local-multiwrite [23:30:16] Exception encountered, of type "FileBackendException" [23:30:16] [ff9b6761] [no req] FileBackendException from line 154 of /srv/mediawiki-staging/php-1.27.0-wmf.13/includes/filebackend/FileBackendGroup.php: No backend defined with the name ``. [23:30:17] what? [23:30:24] (03PS4) 10Dzahn: Fix the meta:System_administrators table generation script [puppet] - 10https://gerrit.wikimedia.org/r/269959 (owner: 10Alex Monk) [23:30:25] arg [23:30:30] (03PS1) 10Ori.livneh: update my (ori) dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/270138 [23:30:42] That's the backend hell day [23:31:11] #0 /srv/mediawiki-staging/php-1.27.0-wmf.13/extensions/WikimediaMaintenance/filebackend/setZoneAccess.php(12): FileBackendGroup->get(NULL) [23:31:21] yet that line is: $backend = FileBackendGroup::singleton()->get( $this->getOption( 'backend' ) ); [23:31:24] (03PS2) 10Ori.livneh: update my (ori) dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/270138 [23:32:04] Krenair: heh, are you set up to automatically be added to my puppet changesets? [23:32:13] (03CR) 10Ori.livneh: [C: 032 V: 032] update my (ori) dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/270138 (owner: 10Ori.livneh) [23:32:21] ori, no, changesets touching certain directories [23:32:40] (03PS5) 10Dzahn: Fix the meta:System_administrators table generation script [puppet] - 10https://gerrit.wikimedia.org/r/269959 (owner: 10Alex Monk) [23:33:51] hmm. passing --help to that script doesn't reveal --private/--backend params [23:33:57] yet the code tries to set that up [23:34:14] (03PS1) 10Dereckson: Fix wgUploadPath for labtestwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270140 [23:34:21] (03CR) 10Ori.livneh: [C: 031] VCL: drop default ttl_cap to 21 days [puppet] - 10https://gerrit.wikimedia.org/r/269968 (https://phabricator.wikimedia.org/T124954) (owner: 10BBlack) [23:37:13] doesn't appear to be another hhvm vs. php argument parsing thing [23:39:01] PROBLEM - puppet last run on restbase-test2001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [23:40:42] RECOVERY - puppet last run on restbase-test2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:40:49] Ugh. [23:40:54] gj Reedy [23:41:22] What've I broken now? [23:41:55] obviously Krenair's spirit [23:42:11] actually, maybe not you. you were just the last one to touch the file :p [23:42:23] - public function construct() { [23:42:23] + public function __construct() { [23:42:30] :| [23:42:41] how was this broken for so long [23:44:13] I've run this script before... yet it was apparently committed entirely broken in september 2012? [23:44:39] PHP 5.3 played nice? ;D [23:45:36] Ah [23:45:38] maybe this was it: https://gerrit.wikimedia.org/r/#/c/261586/8/maintenance/Maintenance.php [23:46:01] Now it checks mParams before recording params [23:49:56] okay, so now I need to fix search [23:50:38] error given while running updateSearchIndexConfig was this: [23:50:42] Couldn't update mappings. Here is elasticsearch's error message: RemoteTransportException[[elastic1001][inet[/10.64.0.108:9300]][indices:admin/mapping/put]]; nested: ProcessClusterEventTimeoutException[failed to process cluster event (put-mapping [page]) within 2m]; [23:51:31] but elastic1001 does not appear to be listening on that port [23:51:42] any ideas ebernhardson? [23:52:25] (03PS1) 10Jcrespo: s2: bye bye coredb; hello mariadb 10 with jessie [puppet] - 10https://gerrit.wikimedia.org/r/270142 [23:52:54] Reedy: sorry [23:53:10] !log ori@mira Synchronized php-1.27.0-wmf.13/extensions/MobileFrontend: I1fe8538eb67: Provide low-resolution NetSpeed option in Special:MobileOptions (duration: 01m 22s) [23:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:54:16] (03CR) 10Jcrespo: [C: 032] s2: bye bye coredb; hello mariadb 10 with jessie [puppet] - 10https://gerrit.wikimedia.org/r/270142 (owner: 10Jcrespo) [23:56:56] (03PS1) 10Reedy: Add CirrusSearch-production.php to noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270143 [23:58:21] RECOVERY - Host mw2173 is UP: PING OK - Packet loss = 0%, RTA = 36.86 ms [23:59:37] The firewall rule is there for 9300