[00:00:04] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151103T0000). [00:00:04] ebernhardson Krenair: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:25] 6operations: Make ops-l a list for humans again (no cheating) - https://phabricator.wikimedia.org/T117508#1776136 (10Krinkle) +1 for separate aliases. Individual teams or people that own a particular feed (or are interested in it) can add themselves to said aliases. Similar to what we do in puppet/icinga already... [00:01:36] Hi, I've added 3 config changes patches. Sorry for the late addition. [00:03:52] Dereckson: looks like its you me and krenair, i can start pushing things out i suppose [00:04:00] I'm sort of here [00:04:09] Failing at focusing on other things [00:04:20] (03CR) 10EBernhardson: [C: 032] Enable WikidataPageBanner on fr.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246169 (https://phabricator.wikimedia.org/T115023) (owner: 10Dereckson) [00:04:44] (03Merged) 10jenkins-bot: Enable WikidataPageBanner on fr.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246169 (https://phabricator.wikimedia.org/T115023) (owner: 10Dereckson) [00:06:09] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/246169/ (duration: 00m 18s) [00:06:09] Dereckson: first patch is out ^ [00:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:06:30] Testing. [00:06:30] (03CR) 10EBernhardson: [C: 032] Add *.unesco.org to server-side upload whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246167 (https://phabricator.wikimedia.org/T115338) (owner: 10Dereckson) [00:06:54] (03Merged) 10jenkins-bot: Add *.unesco.org to server-side upload whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246167 (https://phabricator.wikimedia.org/T115338) (owner: 10Dereckson) [00:07:05] (03CR) 10EBernhardson: [C: 032] Add www.webarchive.org.uk to server-side upload whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250595 (https://phabricator.wikimedia.org/T116179) (owner: 10Dereckson) [00:07:33] (03Merged) 10jenkins-bot: Add www.webarchive.org.uk to server-side upload whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250595 (https://phabricator.wikimedia.org/T116179) (owner: 10Dereckson) [00:08:33] 246169 works. [00:09:06] Dereckson: cool, pushing out the next ones [00:09:13] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: add www.webarchive.co.uk and *.unesco.org to server side upload whitelist (duration: 00m 18s) [00:09:39] Dereckson: ^^ [00:10:25] (03CR) 10EBernhardson: [C: 032] Add new VE RESTBase URL config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250590 (owner: 10Alex Monk) [00:10:41] fyi, this config is expected to be a no-op for now ebernhardson [00:11:02] it will start getting used by VE as the next version rolls out via the train deploys [00:11:14] (03Merged) 10jenkins-bot: Add new VE RESTBase URL config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250590 (owner: 10Alex Monk) [00:11:15] ok [00:11:19] 6operations: Make ops-l a list for humans again (no cheating) - https://phabricator.wikimedia.org/T117508#1776155 (10yuvipanda) 525db7d3bd38802d3864af60a08e43c5d2494b18 in the private repo now moves catchpoint alerting away from ops@ [00:11:50] slave lag... [00:12:00] 250595 tested, no URL of a file in *.unesco.org to test 246167 [00:12:10] appears to be gone [00:12:26] ori: so catchpoint is gone now [00:12:40] ebernhardson: Krenair: Let me know when you're done. I've got a CentralAuth patch w/ AaronSchulz afterward. No rush :) [00:13:07] !log ebernhardson@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/250590/ (duration: 00m 18s) [00:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:13:48] Krenair: you're all set for train [00:13:53] thanks [00:13:57] 10Ops-Access-Requests, 6operations: Requesting access to rest base and cassandra nodes - https://phabricator.wikimedia.org/T117473#1776172 (10Cmjohnson) Nuria I will need the following things from you User's direct supervisor has approved of access request via comment on phabricator task. Approval from proj... [00:14:10] (03CR) 10EBernhardson: [C: 032] Enable A/B test for combined language search. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241226 (https://phabricator.wikimedia.org/T3837) (owner: 10Smalyshev) [00:14:46] (03Merged) 10jenkins-bot: Enable A/B test for combined language search. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241226 (https://phabricator.wikimedia.org/T3837) (owner: 10Smalyshev) [00:18:06] Krenair: sure, mine might take a little longer. It refactors the code mediawiki-config uses for setting up logging, so i need to test it on testwiki first [00:18:15] err, Krinkle ^ [00:18:32] Yeah, no worries. [00:18:47] We've got 15 hours until the next deployment slow [00:18:49] slot [00:18:52] :) [00:19:07] 6operations: Make ops-l a list for humans again (no cheating) - https://phabricator.wikimedia.org/T117508#1776185 (10JohnLewis) There was a suggestion by (I think) @hashar about moving the catch point alerts to a new list like ops-infrastructure (or otherwise named list). I do find these annoying either way (so... [00:20:38] YuviPanda: thanks for those catch point mail death <3 [00:20:42] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:26:08] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/241226/4 (duration: 00m 17s) [00:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:26:41] (03CR) 10EBernhardson: [C: 032] Refactor monolog handling for kafka logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) (owner: 10EBernhardson) [00:27:20] (03Merged) 10jenkins-bot: Refactor monolog handling for kafka logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) (owner: 10EBernhardson) [00:27:59] !log sync-common on mw1017 for https://gerrit.wikimedia.org/r/240615 [00:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:32:08] i realized...i dont know what i can do to force mw1017 to log some things outside using eval.php ... [00:32:21] i guess that will have to be close enough [00:34:46] !log ebernhardson@tin Synchronized wmf-config/avro/CirrusSearchRequestSet.avsc: https://gerrit.wikimedia.org/r/#/c/240615/ (duration: 00m 17s) [00:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:35:39] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/240615/ (duration: 00m 17s) [00:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:35:53] another spike of db errors for labswiki [00:35:56] (unrelated to deploys) [00:37:40] !log ebernhardson@tin Synchronized wmf-config/logging.php: https://gerrit.wikimedia.org/r/#/c/240615/ (duration: 00m 19s) [00:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:39:16] Krinkle: you should be good to go now [00:42:05] ebernhardson: OK [00:42:08] AaronSchulz: We're up [00:42:43] Krinkle: want to do the sync? [00:42:47] Yep [00:42:57] I'm staging on tin in a minute and then syncing to mw1017 [00:43:00] Then full out [00:47:23] Krenair: Why was jobqueue: delta metric type fix reverted 5 hours ago? [00:47:55] local revert on tin in wmf.4 [00:48:11] I left a comment explaining it. [00:48:37] Revert "jobqueue: Pass count value delta instead of $type for the inserts_actual metric" [00:48:37] [00:48:37] This reverts commit 23d8fd02dc8623a7955f34e2d1dd7bcf7c7263ef. [00:48:41] Thats all the commit says [00:49:07] Yes, see 23d8fd02dc8623a7955f34e2d1dd7bcf7c7263ef. [00:49:36] 2015-10-30 [00:49:37] 06:24 logmsgbot: krinkle@tin Synchronized php-1.27.0-wmf.4/includes/jobqueue/JobQueueRedis.php: (no message) (duration: 00m 18s) [00:49:39] You should've received an email when I left the comment. [00:49:41] That deployed it [00:49:48] and I confirmed that in the graphs that day [00:50:05] 4 days ago [00:50:09] It wasn't merged when I found it. [00:50:16] define merged [00:50:26] it was merged in gerrit, applied on tin and deployed onthe cluster [00:50:27] actually applied on tin [00:51:09] (03PS5) 10Dzahn: analytics: impala,refinery: small lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/241318 [00:52:22] OK. It looks like it wasn't deployed yet, and it is most definitely deployed on the cluster and in the branch underneath that on tin [00:52:37] so I'm making that go away from local tin history before someone accidentally deploys that [00:52:44] I mean your revert wasn't deployed yet [00:54:40] (03CR) 10EBernhardson: "imo we should switch all the portals within a short (same day?) time period, rather than a piecemeal approach that leaves people wondering" [puppet] - 10https://gerrit.wikimedia.org/r/249009 (https://phabricator.wikimedia.org/T115964) (owner: 10MaxSem) [00:54:56] ori: btw job queue health seems to be changing in mysterious ways last couple hours [00:54:57] https://grafana.wikimedia.org/dashboard/db/job-queue-health [00:55:06] first time that those two diverge since 4 days ago [00:55:09] The revert was for a commit that can't have completed the deployment process, it wasn't going to be deployed itself [00:55:18] This time in the other direction though [00:55:31] (03CR) 10MaxSem: "Yup. But for starters, I'd like to break the least visible portal :P" [puppet] - 10https://gerrit.wikimedia.org/r/249009 (https://phabricator.wikimedia.org/T115964) (owner: 10MaxSem) [00:55:44] ebernhardson, ^ [00:56:08] Krenair: I don't know what to say. The commit was merged in gerrit, and applied on tin and most definitely deployed to all app servers. Looking on any mw server or on local tin /srv/mediawiki/ shows it. [00:57:33] AaronSchulz: Staged on tin, syncing to mw1017 now [01:01:32] AaronSchulz: Visiting a page and checking the cache key from the command line shows that it works. Any particular test you wanan do before we sync further? [01:02:20] no [01:02:40] that's fine [01:03:05] !log krinkle@tin Synchronized php-1.27.0-wmf.4/includes/: I58836a24b9e239f: User cache key (duration: 00m 21s) [01:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:03:51] !log krinkle@tin Synchronized php-1.27.0-wmf.4/extensions/CentralAuth/includes/CentralAuthUser.php: Ia9673a448da81: User cache key (duration: 00m 17s) [01:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:05:20] 6operations, 7Shinken: Make the Shinken IRC alert and icinga-wm bots use colors - https://phabricator.wikimedia.org/T113785#1776350 (10Dzahn) [01:09:20] 6operations, 10Math: Install texlive-extra-utils on mw appservers - https://phabricator.wikimedia.org/T109195#1776355 (10Dzahn) @Physikerwelt ok,thanks for the explanation. That sounds to me like the ticket is rejected. @Mathmensch what do you think about the comments above? [01:09:41] 6operations, 10Math: Install texlive-extra-utils on mw appservers - https://phabricator.wikimedia.org/T109195#1776356 (10Dzahn) p:5Normal>3Low [01:11:59] 6operations: request for ganeti vm for people.wm.org - https://phabricator.wikimedia.org/T117517#1776360 (10Dzahn) 3NEW [01:12:13] 6operations, 10vm-requests: request for ganeti vm for people.wm.org - https://phabricator.wikimedia.org/T117517#1776360 (10Dzahn) [01:16:28] Krinkle: so the enqueue graphs are messed up again? [01:17:03] AaronSchulz: Depends, are we sure it's just the graphs? [01:17:23] AaronSchulz: https://grafana.wikimedia.org/dashboard/db/job-queue-health?from=now-7d [01:17:45] They started diverging about 3 hours ago [01:17:47] Krinkle: I mean https://grafana-admin.wikimedia.org/dashboard/db/job-queue-rate?panelId=5&fullscreen&edit [01:18:22] Interesting [01:18:42] My sync fixed it I guess [01:18:50] it was fine between Friday and 4 hours ago [01:18:54] 6operations, 10vm-requests: request for ganeti vm for people.wm.org - https://phabricator.wikimedia.org/T117517#1776379 (10Dzahn) [01:19:23] yeah seems to better again [01:19:27] 6operations, 10vm-requests: request for ganeti vm for people.wm.org - https://phabricator.wikimedia.org/T117517#1776392 (10Dzahn) a:3akosiaris [01:19:55] e.g. 0 and 1 keys are down to 0/sec [01:19:58] Yeah [01:20:53] PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:21:01] (03PS6) 10Dzahn: analytics: impala,refinery: small lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/241318 [01:21:14] (03CR) 10Dzahn: [C: 032] "no diff - http://puppet-compiler.wmflabs.org/1169/" [puppet] - 10https://gerrit.wikimedia.org/r/241318 (owner: 10Dzahn) [01:22:41] RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING [01:24:49] (03PS2) 10Dzahn: deactivate wikidisclosure.[com|org] [dns] - 10https://gerrit.wikimedia.org/r/243973 [01:36:31] (03CR) 10Dzahn: [C: 031] "i see no sign that this was ever a working site or project or anything except that i can show we have it since at least 2009 (a blog post " [dns] - 10https://gerrit.wikimedia.org/r/243973 (owner: 10Dzahn) [01:37:40] (03CR) 10Dzahn: [C: 031] "we are _not_ going to use this for the shop, right ?:)" [dns] - 10https://gerrit.wikimedia.org/r/244084 (owner: 10Dzahn) [01:41:51] (03PS2) 10Dzahn: deactivate wikimedia.biz [dns] - 10https://gerrit.wikimedia.org/r/244084 (https://phabricator.wikimedia.org/T81344) [01:44:07] (03CR) 10NehalDaveND: [C: 031] "I have seen this it is proper. We can proceed..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250454 (https://phabricator.wikimedia.org/T117314) (owner: 10Luke081515) [01:46:05] (03CR) 10Sbgujarat: [C: 031] "Yes I have also seen this it patch is proper Thank you @Luke081515" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250454 (https://phabricator.wikimedia.org/T117314) (owner: 10Luke081515) [01:47:43] 6operations, 10netops: Re-Label transcode1 and transcode2 - https://phabricator.wikimedia.org/T81345#1776462 (10Dzahn) [01:54:47] 6operations: Move people.wikimedia.org off terbium, to somewhere open to all prod shell accounts? - https://phabricator.wikimedia.org/T116992#1776482 (10Dzahn) a:3Dzahn [01:56:03] mutante: Hm.. wikifamily.org aliases wikimedia.com in ops-dns, but yet in a browser it goes to www.wikipedia not www.wikimedia (like wikimedia.com) [01:56:41] Krinkle: some of them have actual Apache config on cluster, and some don't but the default redirects them anyways to the portal page [01:56:58] yes, they both redirect [01:57:10] but wikifamily redirects to wikipedia instead of wikimedia like the dns days [01:57:14] which seems odd? [01:59:29] Krinkle: yea, agreed. that is an odd combination. usually that would be linked to wikipedia.org in DNS too, if it ends up redirecting to Wikipedia.org in Apache... [01:59:36] though, it doesn't make a difference in the end [01:59:42] yeah [01:59:46] because it all ends up with redirects.conf and that has [02:00:03] Also, is it intentional that one can not/should not copy files from home directories between hosts? Not even from bastion? [02:00:07] # funnel *wikifamily.org //www.wikipedia.org [02:00:21] Since we stopped forwarding keys I now find myself having to copy any files for terbium/people.wm.o via localhost first [02:00:44] what do other people do for this purpose? [02:01:14] in one case i used puppet to get an rsyncd on the other host [02:01:21] in some cases i downloaded and upload [02:01:29] and i did "scp -3" [02:02:33] that is still technically up and download and no progress bar whatsoever.. but yea [02:02:57] but at least in one step and i dont really copy the files to my disk [02:04:02] about the "intended" part, not sure, i just know the intention was to not allow agent forwarding [02:04:46] from a ferm-rule point of view it would also be limited though.. [02:15:09] (03PS1) 10Dzahn: add private IP for hafnium [dns] - 10https://gerrit.wikimedia.org/r/250611 (https://phabricator.wikimedia.org/T117449) [02:15:11] (03PS1) 10Dzahn: hafnium, remove the public IP [dns] - 10https://gerrit.wikimedia.org/r/250612 (https://phabricator.wikimedia.org/T117449) [02:16:32] Krinkle, I think you can set up a directory with a simple python HTTP server and download across the cluster that way [02:18:39] rsync::server::module { 'foo': path => '/home/baz/files', read_only => 'no [02:19:04] host_allow => 'someotherbox, } [02:22:02] (03CR) 10coren: [C: 031] "I can see no semantic change, this should be all okay." [puppet] - 10https://gerrit.wikimedia.org/r/249675 (owner: 10Dzahn) [02:22:25] !log l10nupdate@tin Synchronized php-1.27.0-wmf.4/cache/l10n: l10nupdate for 1.27.0-wmf.4 (duration: 07m 00s) [02:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:22:59] (03PS2) 10Dzahn: gridengine: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249675 [02:24:01] (03CR) 10Dzahn: [C: 032] gridengine: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249675 (owner: 10Dzahn) [02:25:17] 6operations, 10Analytics, 6Analytics-Kanban, 6Discovery, and 8 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1776521 (10GWicke) @ottomata: In my recollection of the discussion & the log you linked to, the question of which REST producer proxy to use was left open. Our priority is to get... [02:25:44] alex@alex-laptop:~/Development/Wikimedia/Operations-Puppet (phab-project-changes)$ ssh gerrit gerrit ls-projects --has-acl-for ldap/wmf [02:25:44] fatal: internal server error [02:25:49] dammit gerrit [02:26:10] :/ [02:26:15] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.4) at 2015-11-03 02:26:14+00:00 [02:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:26:37] Krenair: "ssh gerrit gerrit" ? [02:26:57] the first 'gerrit' is the hostname I have setup in my ssh config to point to gerrit.wikimedia.org port 29418 [02:27:03] with my correct remote username [02:27:19] the second gerrit is the command name you have to type to do anything there [02:27:36] there's no interactive shell [02:28:03] yea, i have used it a few times [02:28:15] to create a project [02:28:36] does the --has-acl-for thing work on newer versions? [02:29:04] is it really "ldap/wmf" or maybe just "wmf" [02:29:23] yes, just not for ldap groups, I think [02:30:36] sigh, gotcha [02:31:08] (03PS1) 10Dzahn: hafnium: switch from public to private IP [puppet] - 10https://gerrit.wikimedia.org/r/250614 (https://phabricator.wikimedia.org/T117449) [02:31:16] gerrit can't list members of an ldap group either [02:41:38] (03PS3) 10Dzahn: Re enable bzip2 for gitblit downloads [puppet] - 10https://gerrit.wikimedia.org/r/250447 (owner: 10Paladox) [02:43:04] (03CR) 10Dzahn: [C: 032] Re enable bzip2 for gitblit downloads [puppet] - 10https://gerrit.wikimedia.org/r/250447 (owner: 10Paladox) [02:45:16] (03CR) 10Dzahn: "would it help to set $cluster in hiera instead of the puppet role class?" [puppet] - 10https://gerrit.wikimedia.org/r/250068 (owner: 10Dzahn) [02:50:21] j /win go #fauve [02:51:16] (03PS1) 10Dzahn: dns::recursor: move 'standard' and v6 IP to role [puppet] - 10https://gerrit.wikimedia.org/r/250616 [02:55:45] (03PS1) 10Dzahn: analytics::mysql::meta, move standard/fw to role [puppet] - 10https://gerrit.wikimedia.org/r/250617 [02:55:47] try to request details of a non existent group = internal service error [02:55:49] classic gerrit [02:56:05] (03PS2) 10Dzahn: dns::recursor: move 'standard' and v6 IP to role [puppet] - 10https://gerrit.wikimedia.org/r/250616 [02:56:12] server* [02:56:39] (03CR) 10jenkins-bot: [V: 04-1] analytics::mysql::meta, move standard/fw to role [puppet] - 10https://gerrit.wikimedia.org/r/250617 (owner: 10Dzahn) [03:00:27] (03PS1) 10Dzahn: bastion: move 'standard' include to role [puppet] - 10https://gerrit.wikimedia.org/r/250618 [03:03:02] (03PS2) 10Dzahn: analytics::mysql::meta, move standard/fw to role [puppet] - 10https://gerrit.wikimedia.org/r/250617 [03:04:11] PROBLEM - puppet last run on mw1259 is CRITICAL: CRITICAL: Puppet has 1 failures [03:06:51] (03PS1) 10Dzahn: deployment::server: move IPv6 int to role [puppet] - 10https://gerrit.wikimedia.org/r/250619 [03:17:17] (03PS1) 10Dzahn: icinga: move "standard" inclusion to icinga::web [puppet] - 10https://gerrit.wikimedia.org/r/250621 [03:23:32] mutante, I've tried to document a bunch of LDAP permissions here: https://wikitech.wikimedia.org/wiki/LDAP_Groups [03:28:43] Krenair good work, nice. there is also servermon.wm.org (ops), and eh.. non-LDAP passwords for a user "wmf" :p [03:28:57] 36 AuthUserFile /etc/apache2/htpasswd.stats [03:28:57] 37 Require user wmf [03:29:07] stats.wikimedia.org/htdocs/reportcard/staff :p [03:29:31] * Krenair grumbles [03:29:40] yea, that should also be LDAP, at least [03:29:45] gotta run though.. [03:29:59] yeah not sure why servermon is a localpassword [03:30:36] 2 separate things. that was stats.wm.org [03:30:50] /me cancels the damn Amazon Prime autorenewal [03:31:21] RECOVERY - puppet last run on mw1259 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:31:34] kibana? [03:31:56] kibana is LDAP no? [03:32:12] templates/kibana/apache-auth-local.erb: AuthName "<%= @auth_realm %>" [03:32:40] 3 AuthUserFile <%= @auth_file %> [03:32:53] hrmm.. no [03:37:41] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 7 others: Standardise CXServer deployment - https://phabricator.wikimedia.org/T101272#1776570 (10KartikMistry) [03:38:06] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 7 others: Standardise CXServer deployment - https://phabricator.wikimedia.org/T101272#1334128 (10KartikMistry) [04:17:23] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [100000000.0] [05:04:31] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 1 below the confidence bounds [05:09:51] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 1 below the confidence bounds [05:15:11] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [05:20:32] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [05:29:32] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [05:34:42] (03PS1) 10Dzahn: wdqs: indentation of => is not properly.. [puppet] - 10https://gerrit.wikimedia.org/r/250627 [05:35:02] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [05:36:01] the noisy 5xx alerts look to be the result of an exceptionally quiet(!) and relatively error-free stretch which have trained the alert to be more sensitive [05:36:16] in case you are reading this and feel alarmed [05:37:35] thanks ori, i was..to a certain extent [05:38:33] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Nov 3 05:38:33 UTC 2015 (duration 38m 32s) [05:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:39:51] (03CR) 10Dzahn: [C: 032] "yes, it looks like nitpicking, but there just a couple hundred and then we can enable --no-arrow_alignment-check again and be done with it" [puppet] - 10https://gerrit.wikimedia.org/r/250627 (owner: 10Dzahn) [05:40:22] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [05:42:25] 6operations, 10Math: Install texlive-extra-utils on mw appservers - https://phabricator.wikimedia.org/T109195#1776605 (10Mathmensch) @Physikerwelt Can I expect the new rendering mode to be complete in the next summer term? (I'm currently trying to accumulate some free time there for work at wikibooks.) [05:44:02] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [05:54:37] (03PS1) 10Dzahn: dumps,ganglia,nodepool: indentation of => [puppet] - 10https://gerrit.wikimedia.org/r/250628 [05:54:39] (03PS1) 10Dzahn: eventlogging: fix indentation for lint checks [puppet] - 10https://gerrit.wikimedia.org/r/250629 [06:05:17] (03PS1) 10Dzahn: toollabs: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/250632 [06:08:22] (03CR) 1020after4: iridium system-wide gitconfig needs http.proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/250370 (owner: 1020after4) [06:10:31] (03PS1) 10Dzahn: scap: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/250634 [06:13:50] (03PS1) 10Dzahn: apache: indentation of => [puppet] - 10https://gerrit.wikimedia.org/r/250635 [06:22:22] (03PS1) 10Dzahn: teredo: minimal lint fix [puppet] - 10https://gerrit.wikimedia.org/r/250637 [06:26:10] (03PS1) 10Dzahn: authdns: minimal lint fix [puppet] - 10https://gerrit.wikimedia.org/r/250638 [06:29:50] (03PS1) 10Dzahn: jmxtrans: auto-fixed indentation of => [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/250641 [06:30:11] PROBLEM - puppet last run on mc2005 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:12] PROBLEM - puppet last run on mw2043 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:26] (03CR) 10jenkins-bot: [V: 04-1] jmxtrans: auto-fixed indentation of => [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/250641 (owner: 10Dzahn) [06:30:51] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: puppet fail [06:30:59] (03CR) 10Dzahn: [C: 04-1] "a test of the --fix option (vs. a human doing it), but looks like crap :)" [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/250641 (owner: 10Dzahn) [06:31:12] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:22] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:42] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:42] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:52] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:32] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:21] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 3 failures [06:33:32] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:38:12] (03PS1) 10Dzahn: deployment,aptly: minimal lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/250642 [06:55:22] RECOVERY - puppet last run on mc2005 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:56:22] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:56:52] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:56:52] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:57:03] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:57:21] RECOVERY - puppet last run on mw2043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:42] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:57:52] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:58:31] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:32] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:51] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:15:08] (03CR) 10Paladox: "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/250447 (owner: 10Paladox) [07:47:03] PROBLEM - puppet last run on mw2031 is CRITICAL: CRITICAL: puppet fail [07:51:46] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 7 others: Standardise CXServer deployment - https://phabricator.wikimedia.org/T101272#1776665 (10Arrbee) [07:54:12] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I think we should remove this instead of moving it around." [puppet] - 10https://gerrit.wikimedia.org/r/249345 (owner: 10Dzahn) [08:00:07] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me (also double-checked with puppet compiler)" [puppet] - 10https://gerrit.wikimedia.org/r/250619 (owner: 10Dzahn) [08:03:29] 6operations, 10vm-requests: request for ganeti vm for people.wm.org - https://phabricator.wikimedia.org/T117517#1776674 (10akosiaris) I see memory with a ?. Since this is going to be a webserver only thing, let's go for a conservative number like 2G for now and we can increase as needed. Sounds like a misc VM... [08:04:15] (03CR) 10Muehlenhoff: "Looks good to me (also double-checked with puppet compiler)" [puppet] - 10https://gerrit.wikimedia.org/r/250621 (owner: 10Dzahn) [08:04:24] (03CR) 10Muehlenhoff: [C: 031] icinga: move "standard" inclusion to icinga::web [puppet] - 10https://gerrit.wikimedia.org/r/250621 (owner: 10Dzahn) [08:14:22] RECOVERY - puppet last run on mw2031 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [08:15:01] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me (also double-checked in puppet compiler)" [puppet] - 10https://gerrit.wikimedia.org/r/250618 (owner: 10Dzahn) [08:20:47] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me (also-double-checked with puppet compiler)" [puppet] - 10https://gerrit.wikimedia.org/r/250616 (owner: 10Dzahn) [09:44:59] !log decomission xenon.eqiad.wmnet from cassandra, pending conversion to multi-instance [09:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:46:15] (03PS1) 10Filippo Giunchedi: install_server: move restbase test cluster to /srv [puppet] - 10https://gerrit.wikimedia.org/r/250649 [09:47:15] 6operations, 10Beta-Cluster-Infrastructure: [OPS] udp2log prevents udp2log-mw from starting - https://phabricator.wikimedia.org/T40995#1776756 (10hashar) 5Open>3declined a:3hashar udp2log is gone, at least from bastion. [09:47:26] (03PS2) 10Filippo Giunchedi: install_server: move restbase test cluster to /srv [puppet] - 10https://gerrit.wikimedia.org/r/250649 [09:47:40] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] install_server: move restbase test cluster to /srv [puppet] - 10https://gerrit.wikimedia.org/r/250649 (owner: 10Filippo Giunchedi) [09:50:11] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 648 [09:51:22] PROBLEM - Cassandra CQL query interface on xenon is CRITICAL: Connection refused [09:53:51] (03PS1) 10Filippo Giunchedi: cassandra: provision multiple instances on xenon [puppet] - 10https://gerrit.wikimedia.org/r/250650 [09:54:21] (03PS2) 10Filippo Giunchedi: cassandra: provision multiple instances on xenon [puppet] - 10https://gerrit.wikimedia.org/r/250650 [09:54:27] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: provision multiple instances on xenon [puppet] - 10https://gerrit.wikimedia.org/r/250650 (owner: 10Filippo Giunchedi) [09:55:09] 6operations, 10Analytics, 6Analytics-Kanban, 6Discovery, and 8 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1776773 (10mobrovac) FWIW, one does not exclude the other: the EL-based service can be used in production, while the node-based REST proxy may be used for development and/or small... [09:55:11] RECOVERY - check_mysql on db1008 is OK: Uptime: 8788042 Threads: 1 Questions: 103707685 Slow queries: 59561 Opens: 103943 Flush tables: 2 Open tables: 64 Queries per second avg: 11.801 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:55:12] PROBLEM - puppet last run on mw1181 is CRITICAL: CRITICAL: Puppet has 1 failures [09:56:00] !log reimage xenon.eqiad.wmnet [09:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:56:09] mobrovac: ^ [09:56:52] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1776776 (10mobrovac) Please take a look at the [proposed event definitions](https://github.com/wikimedia/restevent/pull/5) and voice any concerns you... [09:57:01] kk godog thnx for the heads up [09:57:43] godog: btw, once reimaging is done, we'll need to do a deploy on xenon since by default the rb version from tin is used, and that one's outdated [10:00:04] mobrovac: ack, we can update that copy while xenon is reimaging though (?) [10:00:40] which copy? [10:00:51] the one that's outdated [10:01:01] on tin? no, we shouldn't really [10:01:38] we'll just change the remote ref and update on xenon [10:02:23] ah, why isn't the checkout on tin updated too though? [10:03:57] because we use ansible to deploy it [10:04:29] it just occurred to me that ansible could update tin's copy too [10:06:08] (03PS2) 10Giuseppe Lavagetto: New package, minor debian tweaks [debs/pybal] - 10https://gerrit.wikimedia.org/r/249984 [10:14:25] (03CR) 10Filippo Giunchedi: [C: 04-1] New package, minor debian tweaks (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/249984 (owner: 10Giuseppe Lavagetto) [10:22:18] RECOVERY - puppet last run on mw1181 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:32:13] (03CR) 10Alexandros Kosiaris: [C: 031] swift: no "if $hostname" in node blocks, use role [puppet] - 10https://gerrit.wikimedia.org/r/250072 (owner: 10Dzahn) [10:34:00] (03CR) 10Filippo Giunchedi: [C: 031] "(not for this code review) but I'm not sure what's the story with ganglia_aggregator (if we are still using it that is)" [puppet] - 10https://gerrit.wikimedia.org/r/250072 (owner: 10Dzahn) [10:34:53] PROBLEM - configured eth on xenon is CRITICAL: Connection refused by host [10:35:14] PROBLEM - dhclient process on xenon is CRITICAL: Connection refused by host [10:35:42] PROBLEM - puppet last run on xenon is CRITICAL: Connection refused by host [10:35:53] PROBLEM - salt-minion processes on xenon is CRITICAL: Connection refused by host [10:36:02] PROBLEM - service on xenon is CRITICAL: Connection refused by host [10:36:22] PROBLEM - Cassandra CQL query interface on xenon is CRITICAL: Connection refused [10:36:28] damn race, silenced [10:42:23] RECOVERY - configured eth on xenon is OK: OK - interfaces up [10:42:43] RECOVERY - dhclient process on xenon is OK: PROCS OK: 0 processes with command name dhclient [10:43:23] RECOVERY - salt-minion processes on xenon is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:44:24] PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: puppet fail [10:46:16] (03PS3) 10Giuseppe Lavagetto: New package, minor debian tweaks [debs/pybal] - 10https://gerrit.wikimedia.org/r/249984 [10:48:43] RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:49:51] (03CR) 10Giuseppe Lavagetto: New package, minor debian tweaks (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/249984 (owner: 10Giuseppe Lavagetto) [10:53:48] (03PS2) 10Filippo Giunchedi: cassandra: multi-instance aware CQL checks [puppet] - 10https://gerrit.wikimedia.org/r/250439 (https://phabricator.wikimedia.org/T93886) [10:55:28] (03CR) 10Giuseppe Lavagetto: [C: 031] "Seems good, I expect it to be mergeable whenever you feel like it." [puppet] - 10https://gerrit.wikimedia.org/r/250439 (https://phabricator.wikimedia.org/T93886) (owner: 10Filippo Giunchedi) [10:55:49] <_joe_> godog: you addressed my only comment while I was writing it :P [10:56:52] _joe_: hahah a case of 'mind writing' as opposed to 'mind reading' [10:57:57] (03CR) 10Alexandros Kosiaris: [C: 031] cassandra: multi-instance aware CQL checks [puppet] - 10https://gerrit.wikimedia.org/r/250439 (https://phabricator.wikimedia.org/T93886) (owner: 10Filippo Giunchedi) [10:58:18] (03PS12) 10Hashar: contint: install npm/grunt-cli with npm [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) [10:58:40] (03CR) 10Hashar: "Rebased, nodejs-legacy got moved to contint::packages::javascript" [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) (owner: 10Hashar) [10:58:44] (03PS1) 10Muehlenhoff: Move role declaration earlier to make the role keyword work [puppet] - 10https://gerrit.wikimedia.org/r/250659 [10:59:09] mobrovac: xenon is ready btw if you need to update restbase [11:00:04] RECOVERY - service on xenon is OK: OK - cassandra-a is active [11:00:16] !log start cassandra-a on xenon [11:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:00:44] (03CR) 10Hashar: [C: 031] "cherry picked on integration puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) (owner: 10Hashar) [11:01:39] <_joe_> godog: nah I just found your keylogger on my pc [11:01:52] <_joe_> godog: you could've had the decency of not writing it in perl [11:02:25] lol [11:02:44] hahah but it runs everywhere! [11:03:16] <_joe_> godog: even cockroaches would survive a nuclear fallout, but I don't think they are nice pets on that grounds [11:04:51] heheh [11:07:56] (03CR) 10Hashar: "Puppet compiler result https://puppet-compiler.wmflabs.org/1177/ytterbium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/244498 (https://phabricator.wikimedia.org/T86661) (owner: 10Hashar) [11:08:19] can some one please merge the above conf change for Gerrit ? [11:08:40] I ran it on the puppet compiler https://puppet-compiler.wmflabs.org/1177/ytterbium.wikimedia.org/ that drop a configuration setting from Gerrit replication [11:09:02] it is currently preventing me from adding new zuul-merger instances [11:09:23] <_joe_> hashar: puppetSWAT is this afternoon, can't this wait for it? [11:09:34] I can't attend the puppet swat :-( [11:09:45] either due to family constraints or conflicting meeting [11:10:01] (03PS3) 10Filippo Giunchedi: cassandra: multi-instance aware CQL checks [puppet] - 10https://gerrit.wikimedia.org/r/250439 (https://phabricator.wikimedia.org/T93886) [11:10:09] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: multi-instance aware CQL checks [puppet] - 10https://gerrit.wikimedia.org/r/250439 (https://phabricator.wikimedia.org/T93886) (owner: 10Filippo Giunchedi) [11:10:14] <_joe_> well, that's not a reason for me to drop what I am doing suddenly, right? [11:10:29] yeah [11:10:31] not forcing anyone [11:10:33] RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [11:10:33] <_joe_> so, maybe in a few if I'm done with what I'm working on :) [11:11:21] (03CR) 10Hashar: [C: 031] dumps,ganglia,nodepool: indentation of => [puppet] - 10https://gerrit.wikimedia.org/r/250628 (owner: 10Dzahn) [11:17:52] PROBLEM - puppet last run on mw2114 is CRITICAL: CRITICAL: puppet fail [11:19:06] (03PS1) 10Jcrespo: [WIP] Migrate eventlogging_sync process from terbium to the slaves [puppet] - 10https://gerrit.wikimedia.org/r/250662 [11:20:15] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Migrate eventlogging_sync process from terbium to the slaves [puppet] - 10https://gerrit.wikimedia.org/r/250662 (owner: 10Jcrespo) [11:22:51] (03PS2) 10Jcrespo: [WIP] Migrate eventlogging_sync process from terbium to the slaves [puppet] - 10https://gerrit.wikimedia.org/r/250662 [11:28:44] (03PS3) 10Jcrespo: [WIP] Migrate eventlogging_sync process from terbium to the slaves [puppet] - 10https://gerrit.wikimedia.org/r/250662 [11:29:22] (03CR) 10Gilles: [C: 031] Add perf-roots to Graphite role [puppet] - 10https://gerrit.wikimedia.org/r/249966 (owner: 10Ori.livneh) [11:30:15] (03CR) 10Gilles: [C: 031] Revert "Don't commit interwiki.cdb anymore" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250294 (owner: 10Ori.livneh) [11:36:27] (03CR) 10Gilles: [C: 04-1] Made the session/main stashes write to both DCs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247325 (https://phabricator.wikimedia.org/T111575) (owner: 10Aaron Schulz) [11:42:20] (03PS2) 10Giuseppe Lavagetto: gdash: redirect reqstats page to grafana [puppet] - 10https://gerrit.wikimedia.org/r/250395 [11:42:34] <_joe_> akosiaris: can I get one more opinion on ^^ [11:43:59] (03CR) 10Alexandros Kosiaris: [C: 031] gdash: redirect reqstats page to grafana [puppet] - 10https://gerrit.wikimedia.org/r/250395 (owner: 10Giuseppe Lavagetto) [11:45:37] (03CR) 10Filippo Giunchedi: "minor nit on spaces/tabs, LGTM otherwise" (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/249984 (owner: 10Giuseppe Lavagetto) [11:46:07] (03PS1) 10Filippo Giunchedi: puppetmaster: long options for wmf-reimage [puppet] - 10https://gerrit.wikimedia.org/r/250669 [11:46:22] RECOVERY - puppet last run on mw2114 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [11:47:27] (03PS4) 10Jcrespo: Migrate eventlogging_sync process from terbium to the slaves [puppet] - 10https://gerrit.wikimedia.org/r/250662 [11:51:01] (03CR) 10Jcrespo: "Adding _joe_ because he was working on terbium so he is in the loop." [puppet] - 10https://gerrit.wikimedia.org/r/250662 (owner: 10Jcrespo) [11:51:13] (03PS2) 10Filippo Giunchedi: puppetmaster: long options for wmf-reimage [puppet] - 10https://gerrit.wikimedia.org/r/250669 [11:56:53] (03CR) 10Faidon Liambotis: [C: 04-1] "I think keeping the graphs but adding a big fat warning should be the best way forward. This is *the* most popular dashboard after all." [puppet] - 10https://gerrit.wikimedia.org/r/250395 (owner: 10Giuseppe Lavagetto) [11:57:13] 6operations, 10ops-esams, 10hardware-requests: Buy fiber patches - https://phabricator.wikimedia.org/T94846#1776935 (10mark) 5Open>3Resolved I ordered: 10x 3m 5x 10m 5x 7m [11:58:25] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Two small comments, seems good otherwise." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/250669 (owner: 10Filippo Giunchedi) [11:58:34] <_joe_> paravoid: that was my doubt all along [11:59:11] <_joe_> paravoid: I'll just add a link to the page under the deprecation warning [12:02:11] (03PS4) 10Giuseppe Lavagetto: New package, minor debian tweaks [debs/pybal] - 10https://gerrit.wikimedia.org/r/249984 [12:02:43] (03CR) 10Giuseppe Lavagetto: [C: 032] New package, minor debian tweaks [debs/pybal] - 10https://gerrit.wikimedia.org/r/249984 (owner: 10Giuseppe Lavagetto) [12:03:45] (03PS5) 10Jcrespo: Migrate eventlogging_sync process from terbium to the slaves [puppet] - 10https://gerrit.wikimedia.org/r/250662 [12:04:13] (03Merged) 10jenkins-bot: New package, minor debian tweaks [debs/pybal] - 10https://gerrit.wikimedia.org/r/249984 (owner: 10Giuseppe Lavagetto) [12:06:29] (03CR) 10Filippo Giunchedi: puppetmaster: long options for wmf-reimage (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/250669 (owner: 10Filippo Giunchedi) [12:06:36] (03PS3) 10Filippo Giunchedi: puppetmaster: long options for wmf-reimage [puppet] - 10https://gerrit.wikimedia.org/r/250669 [12:07:19] 6operations, 10Wikidata, 7Database, 7Performance: number of database updates multiplied x3 since 29 October - https://phabricator.wikimedia.org/T117398#1776954 (10jcrespo) The trend has not finished: {F2911313} [12:07:51] (03CR) 10Giuseppe Lavagetto: [C: 031] puppetmaster: long options for wmf-reimage [puppet] - 10https://gerrit.wikimedia.org/r/250669 (owner: 10Filippo Giunchedi) [12:08:11] 6operations: hafnium: corrupted filesystem? - https://phabricator.wikimedia.org/T117536#1776955 (10MoritzMuehlenhoff) 3NEW [12:09:07] 6operations: hafnium: corrupted filesystem? - https://phabricator.wikimedia.org/T117536#1776962 (10MoritzMuehlenhoff) [12:12:48] (03PS4) 10Filippo Giunchedi: puppetmaster: long options for wmf-reimage [puppet] - 10https://gerrit.wikimedia.org/r/250669 [12:12:54] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] puppetmaster: long options for wmf-reimage [puppet] - 10https://gerrit.wikimedia.org/r/250669 (owner: 10Filippo Giunchedi) [12:13:51] I will deploy the eventlogging migration after lunch, because it requires all my attention [12:13:58] 6operations, 10Wikidata, 7Database, 7Performance: number of database updates multiplied x3 since 29 October - https://phabricator.wikimedia.org/T117398#1776967 (10aaron) Sounds like the the result of fixing the rpc/RunJobs to properly run jobs till the 30 sec limit rather than 1 at a time (which wasted hug... [12:15:52] <_joe_> !log updated pybal for jessie to 1.12 [12:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:23:52] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: Puppet has 1 failures [12:25:54] (03PS4) 10Aaron Schulz: Made the session/main stashes write to both DCs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247325 (https://phabricator.wikimedia.org/T111575) [12:28:29] 6operations, 10Wikidata, 7Database, 7Performance: number of database updates multiplied x3 since 29 October - https://phabricator.wikimedia.org/T117398#1777027 (10jcrespo) ok, if it is explained and expected, then it is not urgent. A higher of updates does not necessarily imply less performance, (it coul... [12:29:13] 6operations, 7Database, 7Performance: number of database updates multiplied x3 since 29 October - https://phabricator.wikimedia.org/T117398#1777030 (10jcrespo) [12:33:18] (03CR) 10Gilles: [C: 031] Made the session/main stashes write to both DCs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247325 (https://phabricator.wikimedia.org/T111575) (owner: 10Aaron Schulz) [12:36:11] (03CR) 10Aaron Schulz: Make mysql-multiwrite use getInstance() factory spec [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249457 (owner: 10Aaron Schulz) [12:37:02] PROBLEM - Restbase endpoints health on xenon is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [12:37:32] PROBLEM - Restbase root url on xenon is CRITICAL: Connection refused [12:38:33] (03PS3) 10Giuseppe Lavagetto: gdash: redirect reqstats page to grafana, minor correction [puppet] - 10https://gerrit.wikimedia.org/r/250395 [12:40:22] (03PS4) 10Giuseppe Lavagetto: gdash: deprecate reqerror dashboard, minor correction [puppet] - 10https://gerrit.wikimedia.org/r/250395 [12:40:24] <_joe_> paravoid: would you be ok with this version of the change? ^^ [12:47:20] (03PS1) 10Giuseppe Lavagetto: ganglia: remove jobqueue stats [puppet] - 10https://gerrit.wikimedia.org/r/250674 [12:50:01] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [12:51:02] (03CR) 10Giuseppe Lavagetto: "I already removed the jobs from their original place, now I remove them completely in https://gerrit.wikimedia.org/r/250674" [puppet] - 10https://gerrit.wikimedia.org/r/249345 (owner: 10Dzahn) [12:58:13] 6operations, 10Traffic, 5Patch-For-Review, 7Pybal: pybal fails to detect dead servers under production lb IPs for port 80 - https://phabricator.wikimedia.org/T113151#1777079 (10Joe) As we worked out on the patch, the intervals are all configurable per-socket. this is now resolved with the new pybal package. [12:58:20] 6operations, 10Traffic, 5Patch-For-Review, 7Pybal: pybal fails to detect dead servers under production lb IPs for port 80 - https://phabricator.wikimedia.org/T113151#1777080 (10Joe) 5Open>3Resolved [12:58:32] 6operations, 10Traffic, 5Patch-For-Review, 7Pybal: pybal fails to detect dead servers under production lb IPs for port 80 - https://phabricator.wikimedia.org/T113151#1656273 (10Joe) [12:58:33] 6operations, 10Traffic, 5Patch-For-Review, 7Pybal: Make pybal accept 30[12] for ProxyFetch - https://phabricator.wikimedia.org/T102393#1777081 (10Joe) 5Open>3Resolved [12:58:51] 6operations, 10Traffic, 5Patch-For-Review, 7discovery-system, 5services-tooling: integrate (pybal|varnish)->varnish backend config/state with etcd or similar - https://phabricator.wikimedia.org/T97029#1777083 (10Joe) 5Open>3Resolved [13:01:11] 6operations, 10Traffic, 7Availability, 5Patch-For-Review: Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820#1777084 (10aaron) [13:19:08] (03PS1) 10DCausse: Elastic: added an option to load bitsets lazily [puppet] - 10https://gerrit.wikimedia.org/r/250678 [13:27:52] godog: ack [13:30:11] (03PS6) 10Jcrespo: Migrate eventlogging_sync process from terbium to the slaves [puppet] - 10https://gerrit.wikimedia.org/r/250662 [13:31:26] (03CR) 10Jcrespo: [C: 032] Migrate eventlogging_sync process from terbium to the slaves [puppet] - 10https://gerrit.wikimedia.org/r/250662 (owner: 10Jcrespo) [13:33:17] !log migrating eventlogging_sync from terbium to db1047 and dbstore1002 (analytics-store) [13:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:33:39] (03PS2) 10BBlack: authdns: minimal lint fix [puppet] - 10https://gerrit.wikimedia.org/r/250638 (owner: 10Dzahn) [13:33:45] (03CR) 10BBlack: [C: 032 V: 032] authdns: minimal lint fix [puppet] - 10https://gerrit.wikimedia.org/r/250638 (owner: 10Dzahn) [13:36:23] small issue, fixing it now [13:40:21] PROBLEM - puppet last run on dbstore1002 is CRITICAL: CRITICAL: Puppet has 1 failures [13:41:59] (03PS1) 10Jcrespo: Fixing bug with init.d script content [puppet] - 10https://gerrit.wikimedia.org/r/250680 [13:43:00] (03CR) 10Jcrespo: [C: 032] Fixing bug with init.d script content [puppet] - 10https://gerrit.wikimedia.org/r/250680 (owner: 10Jcrespo) [13:45:52] RECOVERY - puppet last run on dbstore1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:47:22] PROBLEM - Last backup of the others filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-others was exit-code [13:47:31] PROBLEM - Last backup of the maps filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-maps was exit-code [13:54:48] Coren: ^ ? [13:55:02] akosiaris: I'm looking at it now. [13:55:30] ok [13:57:12] RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy [13:57:42] RECOVERY - Restbase root url on xenon is OK: HTTP OK: HTTP/1.1 200 - 15171 bytes in 0.010 second response time [13:59:46] and load and network in terbium went down an order of magnitude [14:01:45] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1777165 (10fgiunchedi) xenon reimaged today, currently bootstrapping one instance ``` xenon:~$ nodetool-a netstats | grep "bytes total" R... [14:03:59] (03PS2) 10Filippo Giunchedi: graphite: enable labs instances archiver [puppet] - 10https://gerrit.wikimedia.org/r/248317 (https://phabricator.wikimedia.org/T111540) [14:04:05] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: enable labs instances archiver [puppet] - 10https://gerrit.wikimedia.org/r/248317 (https://phabricator.wikimedia.org/T111540) (owner: 10Filippo Giunchedi) [14:04:54] RECOVERY - Last backup of the maps filesystem on labstore1001 is OK: OK - Last run for unit replicate-maps was successful [14:07:13] !log manual invocation of /usr/local/bin/archive-instances on labmon1001 to test [14:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:07:33] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 2 below the confidence bounds [14:10:32] PROBLEM - puppet last run on mw2206 is CRITICAL: CRITICAL: puppet fail [14:15:04] 6operations, 10RESTBase, 7Monitoring, 5Patch-For-Review: Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts - https://phabricator.wikimedia.org/T78514#1777195 (10fgiunchedi) [14:15:06] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Cassandra/CQL query interface monitoring - https://phabricator.wikimedia.org/T93886#1777193 (10fgiunchedi) 5Open>3Resolved >>! In T93886#1716348, @Eevans wrote: >>>! In T93886#1715162, @fgiunchedi wrote: >> * error messages should includ... [14:15:54] RECOVERY - Last backup of the others filesystem on labstore1001 is OK: OK - Last run for unit replicate-others was successful [14:17:33] 6operations, 5Patch-For-Review: hafnium should not have a public interface - https://phabricator.wikimedia.org/T117449#1777198 (10Cmjohnson) Ori, To make the IP change some downtime will be required to update the network cfg on the server. [14:19:31] !log disabling puppet on dbstore1002 and db1047 to debug replication issue [14:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:22:33] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 4 below the confidence bounds [14:22:43] (03PS1) 10Giuseppe Lavagetto: cassandra: do not manage the service via puppet [puppet] - 10https://gerrit.wikimedia.org/r/250682 (https://phabricator.wikimedia.org/T103134) [14:23:43] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: puppet should safely manage cassandra start/stop - https://phabricator.wikimedia.org/T103134#1777204 (10Joe) I implemented in puppet terms what we hypothesized earlier, but I do think what @akosiaris proposed is a way better idea. [14:33:33] (03PS1) 10Filippo Giunchedi: swift: add new ms-be machines to eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/250684 (https://phabricator.wikimedia.org/T114500) [14:37:13] RECOVERY - puppet last run on mw2206 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [14:42:07] (03PS3) 10Subramanya Sastry: Update Parsoid server.js path in the upstart config [puppet] - 10https://gerrit.wikimedia.org/r/249399 [14:45:22] (03PS1) 10Muehlenhoff: Fix Hiera path for analytics::spark::standalone::worker role [puppet] - 10https://gerrit.wikimedia.org/r/250685 [14:51:42] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:53:42] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 920249 bytes in 8.684 second response time [14:58:21] (03PS1) 10Filippo Giunchedi: codfw-prod: add ms-be2016 / ms-be2018 / ms-be2020 at weight 1000 [software/swift-ring] - 10https://gerrit.wikimedia.org/r/250687 (https://phabricator.wikimedia.org/T116842) [15:07:59] (03PS2) 10Giuseppe Lavagetto: impala: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/250038 (owner: 10Muehlenhoff) [15:08:20] (03PS2) 10Rush: Elastic: added an option to load bitsets lazily [puppet] - 10https://gerrit.wikimedia.org/r/250678 (owner: 10DCausse) [15:09:00] (03CR) 10Mobrovac: [C: 031] "Nice!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/250682 (https://phabricator.wikimedia.org/T103134) (owner: 10Giuseppe Lavagetto) [15:10:27] 6operations, 5Patch-For-Review, 7Swift: install/setup/deploy ms-be2016-2021 - https://phabricator.wikimedia.org/T116842#1777329 (10fgiunchedi) allocation plans swift-wise for codfw: * 3x machines in different zones at weight 1000 increments until weight 4000 * add the other 3x machines at weight 1000 increme... [15:10:54] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [15:14:53] (03PS1) 10Muehlenhoff: Assign salt grains for nova::api [puppet] - 10https://gerrit.wikimedia.org/r/250690 [15:19:22] (03CR) 10Rush: "I'm not crazy about creating a load_fixed_bitset_filters_lazily param in puppet that is a layer of indirection for the actual load_fixed_b" [puppet] - 10https://gerrit.wikimedia.org/r/250678 (owner: 10DCausse) [15:22:55] (03CR) 10Rush: [C: 031] "I'm good with this... +1 but it will wait for the ops meeting I imagine" [puppet] - 10https://gerrit.wikimedia.org/r/249966 (owner: 10Ori.livneh) [15:25:40] (03CR) 10Rush: "I think this has aged a bit and needs updating so I'm removing myself for now :) But in general like the idea" [puppet] - 10https://gerrit.wikimedia.org/r/235778 (owner: 10Chad) [15:28:59] (03PS3) 10DCausse: Elastic: added an option to load bitsets lazily [puppet] - 10https://gerrit.wikimedia.org/r/250678 [15:30:12] (03PS1) 10EBernhardson: Remove sampling of CirrusSearchRequestSet log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250691 [15:34:31] (03CR) 10DCausse: [C: 031] Remove sampling of CirrusSearchRequestSet log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250691 (owner: 10EBernhardson) [15:37:18] (03PS8) 10Faidon Liambotis: Labs instance subnet allocation for codfw [dns] - 10https://gerrit.wikimedia.org/r/249919 (https://phabricator.wikimedia.org/T115492) (owner: 10Rush) [15:38:42] (03CR) 10Faidon Liambotis: [C: 032] Labs instance subnet allocation for codfw [dns] - 10https://gerrit.wikimedia.org/r/249919 (https://phabricator.wikimedia.org/T115492) (owner: 10Rush) [15:39:00] (03Abandoned) 10Filippo Giunchedi: WIP: xenon additional instances [puppet] - 10https://gerrit.wikimedia.org/r/234292 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [15:39:04] (03Abandoned) 10Filippo Giunchedi: xenon additional instances [dns] - 10https://gerrit.wikimedia.org/r/234286 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [15:39:46] (03CR) 10Andrew Bogott: "Can I get some context for this?" [puppet] - 10https://gerrit.wikimedia.org/r/250690 (owner: 10Muehlenhoff) [15:42:02] (03PS2) 10Rush: Fix codfw row a labs-hosts1 and labs-support1 IP overlap [dns] - 10https://gerrit.wikimedia.org/r/250458 [15:43:13] (03PS1) 10Filippo Giunchedi: swift: force rsync protocol version 30 [puppet] - 10https://gerrit.wikimedia.org/r/250693 (https://phabricator.wikimedia.org/T93587) [15:51:52] (03CR) 10Greg Grossmeier: [C: 031] "It's Chad's patch, and he+"releng" have ownership of this tool/service. This is a symbolic +1. :)" [puppet] - 10https://gerrit.wikimedia.org/r/250578 (owner: 10Chad) [15:55:37] 6operations, 10Analytics, 6Analytics-Kanban, 6Discovery, and 8 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1777558 (10Ottomata) > In my recollection of the discussion & the log you linked to, the question of which REST producer proxy to use was left open. I think you may be referring... [15:59:29] (03PS1) 10Jcrespo: Temporarelly disabling sleep on the eventlogging sync [puppet] - 10https://gerrit.wikimedia.org/r/250695 [15:59:40] * James_F waves. [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151103T1600). [16:00:04] James_F dcausse: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:00:25] * James_F remains alive. [16:00:28] hi [16:01:43] okie doke, I can SWAT today. [16:01:51] (03PS2) 10Jcrespo: Temporarily disabling sleep on the eventlogging sync [puppet] - 10https://gerrit.wikimedia.org/r/250695 [16:02:42] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250470 (https://phabricator.wikimedia.org/T117410) (owner: 10Jforrester) [16:02:48] Thanks thcipriani [16:03:06] James_F: of course! [16:04:05] (03Merged) 10jenkins-bot: Enable VisualEditor for 10% of new accounts on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250470 (https://phabricator.wikimedia.org/T117410) (owner: 10Jforrester) [16:05:06] (03CR) 10Jcrespo: [C: 032] Temporarily disabling sleep on the eventlogging sync [puppet] - 10https://gerrit.wikimedia.org/r/250695 (owner: 10Jcrespo) [16:07:05] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable VisualEditor for 10% of new accounts on eswiki [[gerrit:250470]] (duration: 00m 18s) [16:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:07:11] ^ James_F check please [16:07:12] Whee. [16:09:19] thcipriani: Yup, looks good. [16:09:30] James_F: cool, thanks for checking. [16:09:59] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250691 (owner: 10EBernhardson) [16:10:24] (03Merged) 10jenkins-bot: Remove sampling of CirrusSearchRequestSet log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250691 (owner: 10EBernhardson) [16:12:29] (03PS1) 10Paladox: Fix error Sorry, the repository $1 does not have a $2 branch! [puppet] - 10https://gerrit.wikimedia.org/r/250697 (https://phabricator.wikimedia.org/T117459) [16:12:42] (03PS2) 10Paladox: Fix error Sorry, the repository $1 does not have a $2 branch! [puppet] - 10https://gerrit.wikimedia.org/r/250697 (https://phabricator.wikimedia.org/T117459) [16:12:44] (03PS4) 10Rush: Elastic: added an option to load bitsets lazily [puppet] - 10https://gerrit.wikimedia.org/r/250678 (owner: 10DCausse) [16:12:50] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Remove sampling of CirrusSearchRequestSet log channel [[gerrit:250691]] (duration: 00m 18s) [16:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:12:59] ^ dcausse check if possible please [16:13:15] (03PS3) 10Paladox: Fix error Sorry, the repository $1 does not have a $2 branch! [puppet] - 10https://gerrit.wikimedia.org/r/250697 (https://phabricator.wikimedia.org/T117459) [16:14:04] (03CR) 10Paladox: "@Dzahn could you merge this I would like to see if this fixes the problem. if it dosent I will try the second solution that is in the ref." [puppet] - 10https://gerrit.wikimedia.org/r/250697 (https://phabricator.wikimedia.org/T117459) (owner: 10Paladox) [16:14:44] thcipriani: looks good, thanks! [16:15:01] dcausse: cool, thanks for checking! [16:15:02] (03CR) 10Rush: [C: 032] "compiler seems to agree:" [puppet] - 10https://gerrit.wikimedia.org/r/250678 (owner: 10DCausse) [16:16:57] 6operations, 6Revscoring, 6Services, 5acl*operations-team, 7service-deployment-requests: New Service Request: ORES - https://phabricator.wikimedia.org/T117560#1777632 (10Halfak) 3NEW [16:17:08] 6operations, 6Revscoring, 6Services, 5acl*operations-team, 7service-deployment-requests: New Service Request: ORES - https://phabricator.wikimedia.org/T117560#1777640 (10Halfak) [16:17:32] (03PS2) 10Muehlenhoff: Assign salt grains for nova::api [puppet] - 10https://gerrit.wikimedia.org/r/250690 [16:17:41] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for nova::api [puppet] - 10https://gerrit.wikimedia.org/r/250690 (owner: 10Muehlenhoff) [16:19:16] 6operations, 6Revscoring, 6Services, 5acl*operations-team, 7service-deployment-requests: New Service Request: ORES - https://phabricator.wikimedia.org/T117560#1777632 (10Halfak) [16:19:55] !log installed unzip and audiofile security updates across the fleet [16:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:27:26] !log restarting elastic on nobelium to check new settings [16:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:29:22] (03PS1) 10Jcrespo: Fixing eventlogging_sync process monitoring [puppet] - 10https://gerrit.wikimedia.org/r/250699 [16:30:14] (03PS2) 10Jcrespo: Fixing eventlogging_sync process monitoring [puppet] - 10https://gerrit.wikimedia.org/r/250699 [16:31:20] (03CR) 10Jcrespo: [C: 032] Fixing eventlogging_sync process monitoring [puppet] - 10https://gerrit.wikimedia.org/r/250699 (owner: 10Jcrespo) [16:43:23] (03PS2) 10Filippo Giunchedi: swift: add new ms-be machines to eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/250684 (https://phabricator.wikimedia.org/T114500) [16:43:30] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: add new ms-be machines to eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/250684 (https://phabricator.wikimedia.org/T114500) (owner: 10Filippo Giunchedi) [16:49:22] 6operations, 10Wikimedia-Mailing-lists, 6Wiktionary: wiktionary-l: assign new moderators - https://phabricator.wikimedia.org/T110969#1777766 (10JohnLewis) a:3JohnLewis I've repoked the IRC channel and asked onwiki ([[https://en.wiktionary.org/wiki/Wiktionary:Beer_parlour/2015/November#Wiktionary-l_mailing_... [16:52:13] PROBLEM - puppet last run on ms-be1021 is CRITICAL: CRITICAL: Puppet has 8 failures [16:53:34] RECOVERY - puppet last run on ms-be2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:54:30] 6operations, 10Wikimedia-Mailing-lists: wikisk-l: Give the list an administrator - https://phabricator.wikimedia.org/T111054#1777795 (10JohnLewis) a:3JohnLewis Adapted message from wiktionary-l also posted to local wiki [[https://sk.wikipedia.org/wiki/Wikip%C3%A9dia:Kr%C4%8Dma/R%C3%B4zne#wikisk-l_mailing_lis... [16:57:03] PROBLEM - puppet last run on ms-be1019 is CRITICAL: CRITICAL: Puppet has 8 failures [17:00:04] _joe_ andrewbogott: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151103T1700). Please do the needful. [17:00:04] Krenair: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:00:15] Grrrrrrrrrrrrr [17:00:27] who is the king of jouncebot? [17:00:35] andrewbogott: not sure where that's coming from, this week it's akosiaris and myself [17:00:47] well, obviously from https://wikitech.wikimedia.org/wiki/Deployments [17:01:03] but not sure how you end up there every week :-) [17:01:20] ah, it’s not the bot that’s broken [17:01:26] I think you’re supposed to add yourself there :) [17:01:56] 6operations, 10ops-codfw, 7Swift: rack & initial on-site setup of ms-be2016-2021 - https://phabricator.wikimedia.org/T114712#1777832 (10fgiunchedi) on ms-be2018 one of the onboard nics is being detected as eth0 and the 10g as eth1, not clear the reason why ``` ~ # ip address list 1: lo: yup [17:02:25] <_joe_> andrewbogott: we should edit the template, or this damn thing will call us every week [17:02:41] _joe_: I’ll look [17:03:03] PROBLEM - puppet last run on ms-be1020 is CRITICAL: CRITICAL: Puppet has 8 failures [17:04:41] PROBLEM - swift-object-replicator on ms-be1021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [17:05:13] PROBLEM - swift-object-server on ms-be1021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [17:05:42] PROBLEM - swift-object-replicator on ms-be1019 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [17:06:01] PROBLEM - swift-object-server on ms-be1019 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [17:06:21] PROBLEM - swift-account-auditor on ms-be1021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [17:06:23] that's me, silencing [17:06:51] PROBLEM - swift-account-reaper on ms-be1021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [17:07:01] _joe_: fixed, I think. [17:07:19] <_joe_> andrewbogott: thanks! [17:07:21] PROBLEM - swift-account-replicator on ms-be1021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [17:07:21] PROBLEM - swift-account-auditor on ms-be1019 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [17:07:32] PROBLEM - swift-account-reaper on ms-be1019 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [17:13:07] So who is doing this then, andrewbogott? [17:13:25] Krenair: moritzm and akosiaris [17:13:32] !log installed security updates on the internal ntpds (system-local ones to follow) [17:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:13:47] moritzm: can you catch up with krenair? He has a pending patch I believe. [17:14:05] Krenair: I'm on it, currently waiting whether Alex wants to comment [17:14:13] 6operations, 10Wikimedia-Mailing-lists, 7Mail: Spam solutions for Education-l mailing list - https://phabricator.wikimedia.org/T100428#1777877 (10JohnLewis) 5Open>3Resolved https://wikitech.wikimedia.org/wiki/Mailman#Fighting_spam_in_mailman should be the aggregated source for this information. Really we... [17:14:37] 6operations, 10ops-codfw, 7Swift: rack & initial on-site setup of ms-be2016-2021 - https://phabricator.wikimedia.org/T114712#1777882 (10fgiunchedi) >>! In T114712#1777832, @fgiunchedi wrote: > on ms-be2018 one of the onboard nics is being detected as eth0 and the 10g as eth1, not clear the reason why scratc... [17:17:49] (03PS3) 10Muehlenhoff: Make mediawiki-config clone be owned by mwdeploy [puppet] - 10https://gerrit.wikimedia.org/r/249684 (https://phabricator.wikimedia.org/T117016) (owner: 10Alex Monk) [17:18:41] RECOVERY - swift-object-server on ms-be1021 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [17:18:41] RECOVERY - swift-account-replicator on ms-be1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [17:19:42] RECOVERY - swift-account-auditor on ms-be1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [17:19:47] (03CR) 10Muehlenhoff: [C: 032 V: 032] Make mediawiki-config clone be owned by mwdeploy [puppet] - 10https://gerrit.wikimedia.org/r/249684 (https://phabricator.wikimedia.org/T117016) (owner: 10Alex Monk) [17:19:52] RECOVERY - swift-object-replicator on ms-be1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [17:20:02] (03CR) 10Nemo bis: [C: 04-1] Tidy robots.txt (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240065 (https://phabricator.wikimedia.org/T104251) (owner: 10Mdann52) [17:20:02] RECOVERY - swift-account-reaper on ms-be1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [17:20:31] RECOVERY - swift-account-auditor on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [17:20:51] RECOVERY - swift-account-reaper on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [17:20:52] RECOVERY - swift-object-replicator on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [17:21:05] Krenair: merged (and this also concludes puppet swat as per https://wikitech.wikimedia.org/wiki/Deployments) [17:21:12] RECOVERY - swift-object-server on ms-be1019 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [17:22:47] sorry for the noise, recoveries go through even when hosts are in downtime [17:23:32] RECOVERY - puppet last run on ms-be1019 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [17:24:01] RECOVERY - puppet last run on ms-be1020 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [17:24:19] godog, I *know* - in my case it is worse because my alerts are usually paging [17:24:48] jynus: hehe yeah the pagestorm! [17:24:54] or page shower [17:25:03] RECOVERY - puppet last run on ms-be1021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:25:37] hey, get me some help so I do not have to setup 20 servers every time myself! [17:26:42] thanks moritzm [17:37:57] (03CR) 10Rush: "I proposed an alternative https://gerrit.wikimedia.org/r/#/c/244471/" [puppet] - 10https://gerrit.wikimedia.org/r/227327 (https://phabricator.wikimedia.org/T114161) (owner: 10Alex Monk) [17:37:59] 7Puppet, 6Labs, 6Phabricator: phabricator at labs is not up to date - https://phabricator.wikimedia.org/T117441#1778025 (10mmodell) p:5Triage>3High [17:38:43] 10Ops-Access-Requests, 6operations, 10Wikidata: Requesting access to researchers for Addshore - https://phabricator.wikimedia.org/T116784#1778030 (10Deskana) This has gone through the required three day waiting period. Can it be actioned now? [17:40:10] 6operations, 6Analytics-Backlog, 10Wikimedia-Mailing-lists: Requests to lists.wikimedia.org should end up in hadoop wmf.webrequest via kafka! - https://phabricator.wikimedia.org/T116429#1778032 (10JohnLewis) 5Open>3declined a:3JohnLewis Closing as declined for several reasons: * requires Varnish, whic... [17:40:36] (03PS3) 10Alex Monk: admin: allow all active users to be applied [puppet] - 10https://gerrit.wikimedia.org/r/244471 (https://phabricator.wikimedia.org/T114161) (owner: 10Rush) [17:40:43] (03PS1) 10Dereckson: Set import sources on en.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250709 (https://phabricator.wikimedia.org/T115938) [17:40:52] (03Abandoned) 10Alex Monk: Add all groups to general bastions, mostly empty bastiononly group [puppet] - 10https://gerrit.wikimedia.org/r/227327 (https://phabricator.wikimedia.org/T114161) (owner: 10Alex Monk) [17:43:04] 10Ops-Access-Requests, 6operations, 10Wikidata: Requesting access to researchers for Addshore - https://phabricator.wikimedia.org/T116784#1778050 (10Dzahn) a:3Cmjohnson [17:47:29] (03PS1) 10Dzahn: addmin: add add-shore to researchers [puppet] - 10https://gerrit.wikimedia.org/r/250711 (https://phabricator.wikimedia.org/T116784) [17:48:17] cmjohnson1: ^ feel like taking that one? i think it is unblocked, has waited and approval and no sudo involved [17:48:30] (03CR) 10John F. Lewis: [C: 031] addmin: add add-shore to researchers [puppet] - 10https://gerrit.wikimedia.org/r/250711 (https://phabricator.wikimedia.org/T116784) (owner: 10Dzahn) [17:48:55] mutante: looking [17:52:39] (03CR) 10JanZerebecki: [C: 031] Also published bzip2 compressed Wikidata TTL dumps [puppet] - 10https://gerrit.wikimedia.org/r/249981 (owner: 10Hoo man) [17:52:57] (03CR) 10Cmjohnson: [C: 032] "Pushing this change, no objections were made and access request criteria was met." [puppet] - 10https://gerrit.wikimedia.org/r/250711 (https://phabricator.wikimedia.org/T116784) (owner: 10Dzahn) [17:53:28] cmjohnson1: thank you:) [17:58:46] (03CR) 10Steinsplitter: [C: 031] Set import sources on en.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250709 (https://phabricator.wikimedia.org/T115938) (owner: 10Dereckson) [18:00:31] 10Ops-Access-Requests, 6operations, 10Wikidata, 5Patch-For-Review: Requesting access to researchers for Addshore - https://phabricator.wikimedia.org/T116784#1778116 (10Addshore) 5Open>3Resolved [18:12:05] (03PS1) 10Alex Monk: Fix global for wgVisualEditorFullRestbaseURL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250712 [18:23:07] (03CR) 10Chad: "Won't this break all existing urls?" [puppet] - 10https://gerrit.wikimedia.org/r/250697 (https://phabricator.wikimedia.org/T117459) (owner: 10Paladox) [18:29:27] (03CR) 10Paladox: "I am not sure. But am trying since some repo carn't view raw files. Yes there's phabricator but currently we link to git.wikimedia.org. fo" [puppet] - 10https://gerrit.wikimedia.org/r/250697 (https://phabricator.wikimedia.org/T117459) (owner: 10Paladox) [18:29:29] (03CR) 10Dzahn: "is it possible to test this somewhere not in production first? because ...the "try" part and Chad's question ..." [puppet] - 10https://gerrit.wikimedia.org/r/250697 (https://phabricator.wikimedia.org/T117459) (owner: 10Paladox) [18:29:56] !log Moved navtiming script from hafnium to a screen session on eventlog2001 in preparation for hafnium reimage [18:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:30:34] (03CR) 10Dzahn: "i'm not sure, but maybe this part is what we want to change anyways? "Yes there's phabricator but currently we link to git.wikimedia.org. " [puppet] - 10https://gerrit.wikimedia.org/r/250697 (https://phabricator.wikimedia.org/T117459) (owner: 10Paladox) [18:33:55] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1778254 (10Ottomata) Cool, added some comments. [18:34:20] !log Moved statsv script from hafnium to a screen session on eventlog2001 in preparation for hafnium reimage [18:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:36:08] (03PS1) 10Ori.livneh: Do not apply eventlogging and webperf roles on hafnium [puppet] - 10https://gerrit.wikimedia.org/r/250717 [18:36:22] (03PS2) 10Ori.livneh: Do not apply eventlogging and webperf roles on hafnium [puppet] - 10https://gerrit.wikimedia.org/r/250717 [18:36:49] (03CR) 10Ori.livneh: [C: 032 V: 032] "(mutante, fyi)" [puppet] - 10https://gerrit.wikimedia.org/r/250717 (owner: 10Ori.livneh) [18:40:13] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: puppet fail [18:40:25] 6operations: hafnium: corrupted filesystem? - https://phabricator.wikimedia.org/T117536#1778322 (10ori) I deleted /usr/lib/a* by mistake. Owning up to it so that it does not seem like the machine was compromised. It is getting re-imaged per T117449. [18:40:43] 6operations: hafnium: corrupted filesystem? - https://phabricator.wikimedia.org/T117536#1778328 (10ori) [18:40:44] 6operations, 5Patch-For-Review: hafnium should not have a public interface - https://phabricator.wikimedia.org/T117449#1774610 (10ori) [18:42:28] 6operations, 5Patch-For-Review: hafnium should not have a public interface - https://phabricator.wikimedia.org/T117449#1778337 (10ori) >>! In T117449#1777198, @Cmjohnson wrote: > To make the IP change some downtime will be required to update the network cfg on the server. Yes, and (per agreement with Daniel... [18:46:34] (03CR) 10Jforrester: [C: 031] Fix global for wgVisualEditorFullRestbaseURL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250712 (owner: 10Alex Monk) [18:48:33] Jeff_Green: Load on Bismuth is staying just below 3... is that going to start paging anyone? [18:53:02] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10hardware-requests, and 2 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1778402 (10RobH) I'm going to add #vm-requests to this and remove #hardware-requests. It appears this can live i... [18:53:10] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10hardware-requests, and 2 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1778405 (10RobH) a:3JMinor [18:53:19] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10vm-requests, 5iOS-5-app-production: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1746196 (10RobH) [18:54:40] csteipp: checking [18:55:40] csteipp: I think it's fine until it's sustained over 5 [18:55:53] and I don't think it pages until more like 20 [18:56:56] Cool. I'll aim to keep it under 5. I think I'm hitting some other issues so I killed it off for now. I'll ping you when I start it up again. [18:57:03] ok [19:00:05] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151103T1900). Please do the needful. [19:01:37] (03PS4) 10Rush: Allocate reserved labs-hosts1-b-codfw [dns] - 10https://gerrit.wikimedia.org/r/249914 [19:03:14] (03CR) 10Rush: [C: 032] Allocate reserved labs-hosts1-b-codfw [dns] - 10https://gerrit.wikimedia.org/r/249914 (owner: 10Rush) [19:03:47] (03PS9) 10Rush: Labs instance subnet allocation for codfw [dns] - 10https://gerrit.wikimedia.org/r/249919 (https://phabricator.wikimedia.org/T115492) [19:04:24] (03CR) 10Rush: [C: 032] Labs instance subnet allocation for codfw [dns] - 10https://gerrit.wikimedia.org/r/249919 (https://phabricator.wikimedia.org/T115492) (owner: 10Rush) [19:05:32] (03PS3) 10Rush: Fix codfw row a labs-hosts1 and labs-support1 IP overlap [dns] - 10https://gerrit.wikimedia.org/r/250458 [19:06:00] 6operations, 6Labs, 10Labs-Infrastructure, 10netops, and 2 others: Allocate labs subnet in dallas - https://phabricator.wikimedia.org/T115491#1778439 (10chasemp) [19:06:05] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [19:06:05] (03PS4) 10Rush: admin: allow all active users to be applied [puppet] - 10https://gerrit.wikimedia.org/r/244471 (https://phabricator.wikimedia.org/T114161) [19:06:51] !log deploying 1.27.0-wmf.5 to group0 [19:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:09:26] (03CR) 10Rush: "I guess -1 your own change means WIP? :)" [puppet] - 10https://gerrit.wikimedia.org/r/250370 (owner: 1020after4) [19:10:00] (03CR) 1020after4: "@rush: For some reason it doesn't seem to actually work, though it should" [puppet] - 10https://gerrit.wikimedia.org/r/250370 (owner: 1020after4) [19:11:45] !log ori@tin Synchronized php-1.27.0-wmf.4/includes/libs/objectcache/BagOStuff.php: 46dff75b5b: Make makeKeyInternal() limit more conservative (duration: 00m 18s) [19:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:12:08] (03PS2) 10Rush: iridium system-wide gitconfig needs http.proxy [puppet] - 10https://gerrit.wikimedia.org/r/250370 (owner: 1020after4) [19:13:54] (03CR) 1020after4: "Once I applied the proxy settings phabricator's git processes could not access gerrit anymore, and further, github was still inaccessible." [puppet] - 10https://gerrit.wikimedia.org/r/250370 (owner: 1020after4) [19:17:09] !log twentyafterfour@tin Started scap: testwiki to 1.27.0-wmf.5 [19:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:18:24] (03CR) 10Paladox: "I am not sure. Since I think it is to do with ngnix. Plus viewing raw files is currently showings errors for mediawiki/extensions/ and med" [puppet] - 10https://gerrit.wikimedia.org/r/250697 (https://phabricator.wikimedia.org/T117459) (owner: 10Paladox) [19:19:23] oh no, puppet isn't safe [19:20:26] (03PS5) 10Yuvipanda: tools: Make home page check critical [puppet] - 10https://gerrit.wikimedia.org/r/249300 (https://phabricator.wikimedia.org/T116825) [19:20:37] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Make home page check critical [puppet] - 10https://gerrit.wikimedia.org/r/249300 (https://phabricator.wikimedia.org/T116825) (owner: 10Yuvipanda) [19:22:11] (03CR) 10Rush: "there is some trick here I recall from ages ago, let me see if I have working config somewhere" [puppet] - 10https://gerrit.wikimedia.org/r/250370 (owner: 1020after4) [19:22:54] 6operations, 6Revscoring, 6Services, 7service-deployment-requests: New Service Request: ORES - https://phabricator.wikimedia.org/T117560#1778518 (10RobH) [19:27:54] (03CR) 10Yuvipanda: "This method is definitely far less hacky than doing it via facter..." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/243357 (owner: 10Alex Monk) [19:27:59] (03PS1) 10Dzahn: hafnium: use jessie installer now [puppet] - 10https://gerrit.wikimedia.org/r/250737 [19:28:14] Krenair: ^^ what were we blocking you on? [19:28:47] (03PS2) 10Dzahn: hafnium: use jessie installer now [puppet] - 10https://gerrit.wikimedia.org/r/250737 [19:29:32] (03CR) 10Dzahn: [C: 032] "yep, switching the hostname in DNS, it's already pending in gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/250737 (owner: 10Dzahn) [19:30:19] 6operations, 6Labs, 10Tool-Labs, 7Icinga, and 2 others: Have a paging check in icinga for tools home page going down - https://phabricator.wikimedia.org/T116825#1778548 (10yuvipanda) 5Open>3Resolved Done [19:33:51] (03CR) 10Dzahn: "assigning 10.64.32.176 to hafnium. unused, no response and right in the middle between other used ones." [dns] - 10https://gerrit.wikimedia.org/r/250611 (https://phabricator.wikimedia.org/T117449) (owner: 10Dzahn) [19:33:54] (03PS2) 10Dzahn: add private IP for hafnium [dns] - 10https://gerrit.wikimedia.org/r/250611 (https://phabricator.wikimedia.org/T117449) [19:34:41] (03CR) 10Dzahn: [C: 032] add private IP for hafnium [dns] - 10https://gerrit.wikimedia.org/r/250611 (https://phabricator.wikimedia.org/T117449) (owner: 10Dzahn) [19:36:42] 6operations, 6Revscoring, 6Services, 7service-deployment-requests: New Service Request: ORES - https://phabricator.wikimedia.org/T117560#1778582 (10yuvipanda) From talking to @akosiaris during the offsite, we can run both the web and the celery stuff (and redis too!) on the same hosts - on the SCB cluster... [19:36:55] 6operations, 10Deployment-Systems, 6Performance-Team, 7Epic, 7Tracking: During deployment old servers may populate new cache URIs (tracking) - https://phabricator.wikimedia.org/T47877#1778584 (10Krinkle) [19:37:13] 6operations, 10MediaWiki-Cache, 6Performance-Team, 10hardware-requests, 7Availability: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1778586 (10GWicke) @Ottomata, based on the data we have so far even the smallest spares sho... [19:37:28] 6operations, 10ops-codfw, 10ops-eqiad, 10ops-esams, 6Project-Creators: create #ops-eqdfw & #ops-eqord projects - https://phabricator.wikimedia.org/T117585#1778587 (10RobH) 3NEW a:3RobH [19:37:38] 6operations, 6Project-Creators: create #ops-eqdfw & #ops-eqord projects - https://phabricator.wikimedia.org/T117585#1778596 (10RobH) [19:37:49] (03CR) 10Rush: "The syntax I have for proxy in general is:" [puppet] - 10https://gerrit.wikimedia.org/r/250370 (owner: 1020after4) [19:38:02] 6operations, 6Revscoring, 6Services, 7service-deployment-requests: New Service Request: ORES - https://phabricator.wikimedia.org/T117560#1778600 (10chasemp) p:5Triage>3Normal [19:38:46] 6operations, 10MediaWiki-Cache, 6Performance-Team, 10hardware-requests, 7Availability: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1778608 (10Ottomata) Mmk! @GWicke, I can't remember. Have we talked about where the event... [19:39:07] (03PS2) 10Dzahn: hafnium: switch from public to private IP [puppet] - 10https://gerrit.wikimedia.org/r/250614 (https://phabricator.wikimedia.org/T117449) [19:39:55] (03CR) 10Dzahn: [C: 032] hafnium: switch from public to private IP [puppet] - 10https://gerrit.wikimedia.org/r/250614 (https://phabricator.wikimedia.org/T117449) (owner: 10Dzahn) [19:39:59] 6operations, 10MediaWiki-Cache, 6Performance-Team, 10hardware-requests, 7Availability: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1778630 (10GWicke) @ottomata: Yes, we'd like to co-locate. The benchmarks were done with co... [19:40:44] 6operations, 10MediaWiki-Cache, 6Performance-Team, 10hardware-requests, 7Availability: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1778640 (10Ottomata) Ok, good with me. Am fine with the spares decision. :) [19:41:35] (03CR) 10Rush: "unfortunately this takes a long time to rollout atm as it requires a restart. I'm guessing we can couple w/ our security updates in a wee" [puppet] - 10https://gerrit.wikimedia.org/r/238850 (owner: 10EBernhardson) [19:42:10] (03CR) 10EBernhardson: "sounds like a plan to me." [puppet] - 10https://gerrit.wikimedia.org/r/238850 (owner: 10EBernhardson) [19:47:11] !log twentyafterfour@tin Finished scap: testwiki to 1.27.0-wmf.5 (duration: 30m 00s) [19:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:54:29] !log hafnium - reboot into PXE for jessie reinstall [19:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:55:56] !log hafnium - revoke puppet cert, salt key for the .wikimedia.org name [19:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:57:07] (03PS1) 1020after4: group0 wikis to 1.27.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250748 [19:57:44] (03CR) 1020after4: [C: 032] group0 wikis to 1.27.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250748 (owner: 1020after4) [19:58:10] (03Merged) 10jenkins-bot: group0 wikis to 1.27.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250748 (owner: 1020after4) [19:59:09] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: group0 wikis to 1.27.0-wmf.5 [19:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:59:28] 6operations, 10MediaWiki-Cache, 6Performance-Team, 10hardware-requests, 7Availability: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1778780 (10RobH) [19:59:55] 6operations, 10MediaWiki-Cache, 6Performance-Team, 10hardware-requests, 7Availability: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1687618 (10RobH) [20:01:59] 6operations, 10MediaWiki-Cache, 6Performance-Team, 10hardware-requests, 7Availability: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1778794 (10RobH) I've updated the task description to reflect the discussion results of usi... [20:06:32] (03PS2) 10Dzahn: hafnium, remove the public IP [dns] - 10https://gerrit.wikimedia.org/r/250612 (https://phabricator.wikimedia.org/T117449) [20:07:07] (03CR) 10Dzahn: [C: 032] hafnium, remove the public IP [dns] - 10https://gerrit.wikimedia.org/r/250612 (https://phabricator.wikimedia.org/T117449) (owner: 10Dzahn) [20:09:15] YuviPanda, I've forgotten about this commit [20:09:26] To be honest you've left it too long [20:09:42] yeah :( [20:10:46] Krenair: do you still have energy to work on the patch? [20:12:57] 6operations, 6Project-Creators: create #ops-eqdfw & #ops-eqord projects - https://phabricator.wikimedia.org/T117585#1778885 (10Krenair) Seems to just be keeping consistency with the existing pattern, I would just go ahead and create them, then document it at T103700 per https://www.mediawiki.org/wiki/Phabricat... [20:13:03] Krenair: I can take it over for a while otherwise... [20:13:29] I'll take a look through it later [20:13:49] Krenair: ok! [20:18:04] (03CR) 10Yuvipanda: [WIP] Labs DNS: Stop hardcoding instance IPs in Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/243357 (owner: 10Alex Monk) [20:18:31] Krenair: I just did a manual rebase, fixing some minor things as well [20:18:39] (03PS2) 10Yuvipanda: [WIP] Labs DNS: Stop hardcoding instance IPs in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/243357 (owner: 10Alex Monk) [20:18:45] so which comments are still outstanding? [20:20:52] Krenair: I found the alias_script thing and fixing it now. we still need to figure a way of not restarting the dns recursor all the time [20:22:29] Krenair: ok, so this patchset does the alias_script fix. [20:22:40] Krenair: do you mind if I also not make the python file a template? [20:22:48] (also the python file is using tabs) [20:23:10] go fori t [20:23:29] ok [20:23:38] (03PS3) 10Yuvipanda: [WIP] Labs DNS: Stop hardcoding instance IPs in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/243357 (owner: 10Alex Monk) [20:32:08] 6operations, 5Patch-For-Review: hafnium should not have a public interface - https://phabricator.wikimedia.org/T117449#1778994 (10RobH) I just changed the switch port from public to private vlan per @dzahn's request. [20:38:06] (03PS2) 10Dzahn: dumps,ganglia,nodepool: indentation of => [puppet] - 10https://gerrit.wikimedia.org/r/250628 [20:38:31] (03CR) 10Dzahn: [C: 032] dumps,ganglia,nodepool: indentation of => [puppet] - 10https://gerrit.wikimedia.org/r/250628 (owner: 10Dzahn) [20:38:45] !log cleared cassandra snapshots on aqs servers [20:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:41:17] (03PS2) 10Dzahn: Also published bzip2 compressed Wikidata TTL dumps [puppet] - 10https://gerrit.wikimedia.org/r/249981 (owner: 10Hoo man) [20:41:25] (03CR) 10Dzahn: [C: 032] Also published bzip2 compressed Wikidata TTL dumps [puppet] - 10https://gerrit.wikimedia.org/r/249981 (owner: 10Hoo man) [20:41:48] !log restbase cassandra: switched local_group_wikipedia_T_parsoid_html to the Date-Tiered Compaction Strategy (DTCS) [20:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:45:44] !log deployed patch for T109724 to wmf4/5 [20:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:47:31] (03CR) 10Dzahn: "We are using this: "class { 'ganglia::monitor::aggregator':". Where you see that in site.pp we actually use it. one of them per data cente" [puppet] - 10https://gerrit.wikimedia.org/r/250072 (owner: 10Dzahn) [20:48:12] (03PS8) 10Dzahn: swift: no "if $hostname" in node blocks, use role [puppet] - 10https://gerrit.wikimedia.org/r/250072 [20:49:08] (03CR) 10Dzahn: [C: 032] swift: no "if $hostname" in node blocks, use role [puppet] - 10https://gerrit.wikimedia.org/r/250072 (owner: 10Dzahn) [20:50:35] (03PS2) 10Dzahn: miredo: minimal lint fix, add comment [puppet] - 10https://gerrit.wikimedia.org/r/250637 [20:54:47] (03PS3) 10Dzahn: miredo: minimal lint fix, add comment [puppet] - 10https://gerrit.wikimedia.org/r/250637 [20:55:02] (03CR) 10Dzahn: [C: 032] miredo: minimal lint fix, add comment [puppet] - 10https://gerrit.wikimedia.org/r/250637 (owner: 10Dzahn) [21:01:47] 6operations, 5Patch-For-Review: hafnium should not have a public interface - https://phabricator.wikimedia.org/T117449#1779091 (10Dzahn) I made the DNS changes and re-installed it with jessie. Removed old puppet certs/salt-key and added new ones in .eqiad.wmnet. initial puppet run is ongoing and regular she... [21:05:53] 6operations, 10Analytics, 6Analytics-Kanban, 6Discovery, and 8 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1779098 (10Ottomata) Ok, still various TODOs around the code, but this is ready for review. https://gerrit.wikimedia.org/r/#/c/235671 There are concepts that it'll be good to do... [21:09:41] 6operations, 5Patch-For-Review: hafnium should not have a public interface - https://phabricator.wikimedia.org/T117449#1779145 (10Dzahn) a:3ori assigning to Ori per: < ori> mutante: OK, good to go. to prevent metrics from being submitted twice from two places, I removed the role from hafnium. If you hand it... [21:10:10] 6operations, 5Patch-For-Review: hafnium should not have a public interface - https://phabricator.wikimedia.org/T117449#1779147 (10Dzahn) p:5Triage>3Normal [21:12:37] 6operations, 6Commons, 10Wikimedia-Media-storage, 5MW-1.27-release-notes, and 2 others: Some files had disappeared from Commons after renaming - https://phabricator.wikimedia.org/T111838#1779156 (10matmarex) [21:13:00] 6operations, 6Commons, 10Wikimedia-Media-storage, 5MW-1.27-release-notes, and 2 others: Some files had disappeared from Commons after renaming - https://phabricator.wikimedia.org/T111838#1779158 (10matmarex) a:3aaron So, what can we do about this now with the shiny new script? :) [21:13:27] (03CR) 10Chad: [C: 04-1] "We should stop linking to Gitblit instead." [puppet] - 10https://gerrit.wikimedia.org/r/250697 (https://phabricator.wikimedia.org/T117459) (owner: 10Paladox) [21:15:19] 6operations, 5Patch-For-Review: hafnium should not have a public interface - https://phabricator.wikimedia.org/T117449#1779165 (10Dzahn) host hafnium.eqiad.wmnet hafnium.eqiad.wmnet has address 10.64.32.176 root@hafnium:~# ssh-keygen -lf /etc/ssh/ssh_host_rsa_key.pub 2048 f3:83:90:e8:e0:5f:f4:33:67:a2:89:4f:... [21:15:56] (03CR) 10Paladox: "Yes but I am not sure if it all repos have been migrated since on mediawiki the template was changed to use phabricator but then reverted." [puppet] - 10https://gerrit.wikimedia.org/r/250697 (https://phabricator.wikimedia.org/T117459) (owner: 10Paladox) [21:21:40] !log ms-be2018,ms-be2019: sign puppet certs, initial run [21:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:23:11] (03CR) 10Chad: "Like < 90% of them are, I wouldn't freak out about it much." [puppet] - 10https://gerrit.wikimedia.org/r/250697 (https://phabricator.wikimedia.org/T117459) (owner: 10Paladox) [21:23:20] (03CR) 10Chad: ">, not <" [puppet] - 10https://gerrit.wikimedia.org/r/250697 (https://phabricator.wikimedia.org/T117459) (owner: 10Paladox) [21:25:28] 6operations, 5Patch-For-Review, 7Swift: install/setup/deploy ms-be2016-2021 - https://phabricator.wikimedia.org/T116842#1779325 (10Dzahn) a:3Papaul [21:27:57] 6operations, 5Patch-For-Review, 7Swift: install/setup/deploy ms-be2016-2021 - https://phabricator.wikimedia.org/T116842#1779347 (10Dzahn) because there is already a `role swift::storage` for `node /^ms-be20(1[6-9]|2[0-1])\.codfw\.wmnet$/` they are already getting all the swift things and monitoring just a... [21:31:38] (03PS1) 10Yuvipanda: Increase max-line-length to at least 100 for python [puppet] - 10https://gerrit.wikimedia.org/r/250831 [21:31:56] can I get someone to +1 ^? [21:32:09] * YuviPanda looks at mutante / chasemp / andrewbogott [21:32:40] (03CR) 10Rush: [C: 031] "no argument" [puppet] - 10https://gerrit.wikimedia.org/r/250831 (owner: 10Yuvipanda) [21:33:03] +1 [21:33:05] (03CR) 10Dzahn: [C: 031] "100 seems fine" [puppet] - 10https://gerrit.wikimedia.org/r/250831 (owner: 10Yuvipanda) [21:34:04] PROBLEM - swift-object-replicator on ms-be2018 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [21:34:04] PROBLEM - swift-account-auditor on ms-be2019 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [21:34:34] PROBLEM - swift-account-reaper on ms-be2019 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [21:34:34] PROBLEM - swift-object-server on ms-be2018 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [21:34:54] PROBLEM - swift-account-replicator on ms-be2019 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [21:35:14] PROBLEM - puppet last run on ms-be2018 is CRITICAL: CRITICAL: Puppet has 9 failures [21:35:55] ACKNOWLEDGEMENT - NTP on ms-be2018 is CRITICAL: NTP CRITICAL: Offset unknown daniel_zahn fresh install [21:35:55] ACKNOWLEDGEMENT - puppet last run on ms-be2018 is CRITICAL: CRITICAL: Puppet has 9 failures daniel_zahn fresh install [21:35:55] ACKNOWLEDGEMENT - swift-account-reaper on ms-be2018 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper daniel_zahn fresh install [21:35:55] ACKNOWLEDGEMENT - swift-account-replicator on ms-be2018 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator daniel_zahn fresh install [21:35:55] ACKNOWLEDGEMENT - swift-object-replicator on ms-be2018 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator daniel_zahn fresh install [21:35:55] ACKNOWLEDGEMENT - swift-object-server on ms-be2018 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server daniel_zahn fresh install [21:35:56] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware (mwoffliner) - https://phabricator.wikimedia.org/T117095#1779364 (10chasemp) [21:36:04] RECOVERY - swift-object-replicator on ms-be2018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [21:36:34] RECOVERY - swift-object-server on ms-be2018 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [21:36:54] ACKNOWLEDGEMENT - puppet last run on ms-be2019 is CRITICAL: CRITICAL: Puppet has 7 failures daniel_zahn fresh install [21:36:54] ACKNOWLEDGEMENT - swift-account-auditor on ms-be2019 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor daniel_zahn fresh install [21:36:54] ACKNOWLEDGEMENT - swift-account-reaper on ms-be2019 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper daniel_zahn fresh install [21:36:54] ACKNOWLEDGEMENT - swift-account-replicator on ms-be2019 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator daniel_zahn fresh install [21:36:54] ACKNOWLEDGEMENT - swift-object-replicator on ms-be2019 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator daniel_zahn fresh install [21:36:55] ACKNOWLEDGEMENT - swift-object-server on ms-be2019 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server daniel_zahn fresh install [21:40:04] RECOVERY - swift-account-auditor on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [21:40:25] RECOVERY - swift-account-reaper on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [21:40:54] RECOVERY - swift-account-replicator on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [21:42:13] (03PS4) 10Yuvipanda: [WIP] Labs DNS: Stop hardcoding instance IPs in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/243357 (owner: 10Alex Monk) [21:42:14] (03PS2) 10Yuvipanda: Increase max-line-length to at least 100 for python [puppet] - 10https://gerrit.wikimedia.org/r/250831 [21:42:39] Krenair: ok I fixed up the python file a bit. I'll have to make the config file actually be written now [21:42:46] ok [21:42:46] !log krenair@tin Synchronized README: (no message) (duration: 00m 18s) [21:42:47] (03PS3) 10Yuvipanda: Increase max-line-length to at least 100 for python [puppet] - 10https://gerrit.wikimedia.org/r/250831 [21:42:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:42:59] Krenair: and then we just need to figure out how to do the restart bits [21:43:10] (03CR) 10Yuvipanda: [C: 032 V: 032] Increase max-line-length to at least 100 for python [puppet] - 10https://gerrit.wikimedia.org/r/250831 (owner: 10Yuvipanda) [21:47:21] 6operations, 6Labs, 10Labs-Infrastructure, 3labs-sprint-117, 3labs-sprint-118: How to handle mgmt lan for labs bare metal? - https://phabricator.wikimedia.org/T116607#1779404 (10RobH) If these exist as bare metal to the OS (that has the userspace the labs user is in) then they have direct hardware access... [21:47:44] 6operations, 6Labs, 10Labs-Infrastructure, 3labs-sprint-117, 3labs-sprint-118: How to handle mgmt lan for labs bare metal? - https://phabricator.wikimedia.org/T116607#1779414 (10RobH) [21:47:46] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware (mwoffliner) - https://phabricator.wikimedia.org/T117095#1766457 (10RobH) [21:48:24] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware (mwoffliner) - https://phabricator.wikimedia.org/T117095#1779427 (10RobH) a:5chasemp>3RobH [21:50:51] (03PS1) 10BryanDavis: Add deployment-bastion to beta /etc/dsh/group/mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/250837 (https://phabricator.wikimedia.org/T117574) [21:51:31] (03CR) 10Alex Monk: "and mira?" [puppet] - 10https://gerrit.wikimedia.org/r/250837 (https://phabricator.wikimedia.org/T117574) (owner: 10BryanDavis) [21:54:14] (03PS2) 10BryanDavis: Add deploy masters to beta /etc/dsh/group/mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/250837 (https://phabricator.wikimedia.org/T117574) [21:58:55] ebernhardson: halfak is asking if there's a schema for the elasticsearch dumps [21:59:12] :D [21:59:16] !log krenair@tin Synchronized README: rv, was just demonstrating for https://phabricator.wikimedia.org/T117574#1779384 (duration: 00m 17s) [21:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:59:39] halfak: sortof, but not described anywhere except in the code :P they look like this: https://en.wikipedia.org/wiki/California?action=cirrusdump [21:59:46] halfak: (thats json, everything will be roughly the same as that) [22:00:35] I'm looking for what fields are possible, what _type could be, etc. [22:00:55] I see there's no rev_id. is that right? [22:01:04] Oh wait. [22:01:05] halfak: hmm, well i guess i can add a ticket to document it :) [22:01:08] I see "version" [22:01:20] ebernhardson, that or just tell me where to request it. :) [22:01:22] halfak: rev_id is the version at the very end of the json document (we also use that to prevent out of order writes to ES) [22:01:38] * halfak runs to a meeting. [22:01:42] Thanks for the info. [22:01:48] I'll be scoping this out more later :D [22:01:50] i'll add a thing to document somewhere, cant hurt :) [22:07:18] (03PS2) 10BBlack: ssl_ciphersuite: add ECDHE+3DES options [puppet] - 10https://gerrit.wikimedia.org/r/249017 [22:08:57] (03CR) 10BBlack: [C: 032] "Did some ciphersuite simulations against our raw clienthello data, and there are some (very few <0.01% in my limited data) clients that wi" [puppet] - 10https://gerrit.wikimedia.org/r/249017 (owner: 10BBlack) [22:12:35] (03CR) 10BryanDavis: "Tested via cherry-pick in beta cluster; fixes T117574" [puppet] - 10https://gerrit.wikimedia.org/r/250837 (https://phabricator.wikimedia.org/T117574) (owner: 10BryanDavis) [22:13:20] (03PS3) 10Rush: Add deploy masters to beta /etc/dsh/group/mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/250837 (https://phabricator.wikimedia.org/T117574) (owner: 10BryanDavis) [22:15:04] (03CR) 10Rush: [C: 032] Add deploy masters to beta /etc/dsh/group/mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/250837 (https://phabricator.wikimedia.org/T117574) (owner: 10BryanDavis) [22:24:08] 6operations, 5Patch-For-Review, 7Swift: install/setup/deploy ms-be2016-2021 - https://phabricator.wikimedia.org/T116842#1779643 (10Dzahn) ms-be2020/2021 reinstalled, and they switched names around. but they don't have signed puppet certs yet [22:27:40] (03PS2) 10Dzahn: eventlogging: fix indentation for lint checks [puppet] - 10https://gerrit.wikimedia.org/r/250629 [22:27:48] (03CR) 10Dzahn: [C: 032] eventlogging: fix indentation for lint checks [puppet] - 10https://gerrit.wikimedia.org/r/250629 (owner: 10Dzahn) [22:29:35] (03PS2) 10Rush: Fetch scap from Phabricator instead of Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/250578 (owner: 10Chad) [22:31:41] (03CR) 10Rush: [C: 032] Fetch scap from Phabricator instead of Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/250578 (owner: 10Chad) [22:32:36] (03CR) 10Dzahn: "i think inheritance could be the issue yea, it definitely will be in the future because puppet3 won't support it anymore afaik" [puppet] - 10https://gerrit.wikimedia.org/r/250038 (owner: 10Muehlenhoff) [22:32:46] (03CR) 10Dzahn: [C: 04-1] impala: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/250038 (owner: 10Muehlenhoff) [22:33:58] (03CR) 10Dzahn: "i agreed with reverting the gravatar change and the bzip2 change, here i don't really have an opinion or knowledge to base it on" [puppet] - 10https://gerrit.wikimedia.org/r/250453 (https://phabricator.wikimedia.org/T117393) (owner: 10Paladox) [22:34:28] (03PS4) 10Dzahn: gitblit: Fix "Sorry, the repository $1 does not have a $2 branch" [puppet] - 10https://gerrit.wikimedia.org/r/250697 (https://phabricator.wikimedia.org/T117459) (owner: 10Paladox) [22:35:15] (03CR) 10Dzahn: [C: 04-1] "per Chad" [puppet] - 10https://gerrit.wikimedia.org/r/250697 (https://phabricator.wikimedia.org/T117459) (owner: 10Paladox) [22:36:46] !log tin: scap now pointing to Phab repo instead of Gerrit. [22:36:47] YuviPanda: which host does "k8" run on [22:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:38:08] (03PS2) 10Dzahn: k8s: Move the ferm fules into the role [puppet] - 10https://gerrit.wikimedia.org/r/246295 (owner: 10Muehlenhoff) [22:38:23] PROBLEM - puppet last run on mw1037 is CRITICAL: CRITICAL: puppet fail [22:39:34] PROBLEM - puppet last run on logstash1004 is CRITICAL: CRITICAL: Puppet has 2 failures [22:40:03] PROBLEM - puppet last run on mw1236 is CRITICAL: CRITICAL: Puppet has 1 failures [22:40:04] PROBLEM - puppet last run on strontium is CRITICAL: CRITICAL: Puppet has 1 failures [22:40:13] PROBLEM - puppet last run on elastic1003 is CRITICAL: CRITICAL: Puppet has 1 failures [22:40:14] PROBLEM - puppet last run on mw1217 is CRITICAL: CRITICAL: Puppet has 2 failures [22:40:16] 6operations, 6Parsing-Team, 10hardware-requests: Dedicated server for running Parsoid's roundtrip tests to get reliable parse latencies and use as perf. benchmarking tests - https://phabricator.wikimedia.org/T116090#1779694 (10RobH) @Ssastry: So if it isn't puppetized, you'd need to have the same full sudo r... [22:40:23] PROBLEM - puppet last run on cp2005 is CRITICAL: CRITICAL: Puppet has 1 failures [22:40:24] PROBLEM - puppet last run on cp1044 is CRITICAL: CRITICAL: Puppet has 1 failures [22:40:31] checks puppetmaster [22:40:36] that doesnt look promising [22:40:43] PROBLEM - puppet last run on kafka1022 is CRITICAL: CRITICAL: Puppet has 1 failures [22:40:44] PROBLEM - puppet last run on mw1167 is CRITICAL: CRITICAL: Puppet has 1 failures [22:40:55] PROBLEM - puppet last run on mw2007 is CRITICAL: CRITICAL: Puppet has 1 failures [22:40:59] yeah seems like strontium failing? [22:41:14] (03CR) 10Paladox: "Yes but until it is changed on mediawiki we should fix gitblit so it works." [puppet] - 10https://gerrit.wikimedia.org/r/250697 (https://phabricator.wikimedia.org/T117459) (owner: 10Paladox) [22:41:25] RECOVERY - puppet last run on logstash1004 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [22:41:40] all the puppet procs there are < 10 minutes old (master + apache) ? [22:41:42] bblack: yes, but when i got to look at the log it was alreayd compiling catalogs again [22:41:54] PROBLEM - puppet last run on mw2040 is CRITICAL: CRITICAL: Puppet has 1 failures [22:41:58] and then recovery after i ran puppet on logstash1004 [22:42:04] RECOVERY - puppet last run on mw1037 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [22:42:21] oh this could be delayed icinga reporting of just before the restart then, or at the moment of [22:42:33] PROBLEM - puppet last run on mw2001 is CRITICAL: CRITICAL: Puppet has 1 failures [22:42:35] i think it is. i did nothing and it looks ok [22:42:44] PROBLEM - puppet last run on mw2156 is CRITICAL: CRITICAL: Puppet has 1 failures [22:45:46] (03PS2) 10Dzahn: deployment::server: move IPv6 int to role [puppet] - 10https://gerrit.wikimedia.org/r/250619 [22:46:21] (03CR) 10Paladox: "Ok. Well the performance seems to be bad after upgrading not sure why I notice it to be a little slower but as we are moving to phabricato" [puppet] - 10https://gerrit.wikimedia.org/r/250453 (https://phabricator.wikimedia.org/T117393) (owner: 10Paladox) [22:47:59] 6operations, 6Labs, 10wikitech.wikimedia.org, 7Wikimedia-log-errors: RunJobs.php fails to be executed on labswiki - https://phabricator.wikimedia.org/T117394#1779729 (10demon) p:5Triage>3Normal [22:49:33] 6operations, 6Labs, 10wikitech.wikimedia.org, 7Wikimedia-log-errors: RunJobs.php fails to be executed on labswiki - https://phabricator.wikimedia.org/T117394#1779738 (10Krinkle) Checking https://logstash.wikimedia.org/#/dashboard/elasticsearch/mediawiki-errors shows these connection errors from RunJobs for... [22:49:47] (03CR) 10Paladox: "@Dzahn it shows In the file that 0 = hide tags/branches so it is hiding tags. it is currently web.summaryRefsCount = 0 I am changing it to" [puppet] - 10https://gerrit.wikimedia.org/r/250449 (owner: 10Paladox) [22:54:46] (03PS1) 10Reedy: Remove 3 old wmgMonologChannels related to closed bugs/tasks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250850 [22:54:59] (03CR) 10Dzahn: "please discuss with ori who made the original change" [puppet] - 10https://gerrit.wikimedia.org/r/250449 (owner: 10Paladox) [22:55:13] (03PS1) 10MaxSem: Bump portals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250851 [22:55:33] (03CR) 10Dzahn: [C: 032] deployment::server: move IPv6 int to role [puppet] - 10https://gerrit.wikimedia.org/r/250619 (owner: 10Dzahn) [22:56:42] (03PS1) 10Reedy: Temporary: Enable 'error' log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250852 [22:57:21] (03CR) 10jenkins-bot: [V: 04-1] Temporary: Enable 'error' log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250852 (owner: 10Reedy) [22:57:30] lol [22:57:58] (03PS2) 10Reedy: Temporary: Enable 'error' log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250852 [23:04:48] Reedy: I thought about removing some of those, but wondered if they were meant to be kept in case the bug(s) ever resurfaced [23:04:54] RECOVERY - puppet last run on strontium is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [23:05:04] RECOVERY - puppet last run on elastic1003 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [23:05:14] RECOVERY - puppet last run on cp1044 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [23:05:14] RECOVERY - puppet last run on cp2005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:05:34] RECOVERY - puppet last run on kafka1022 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [23:05:35] RECOVERY - puppet last run on mw1167 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:05:42] ostriches: I'd presume that the log calls were just added temporarly etc... I can grep the relevant code bases later to see if the calls are still there [23:05:54] RECOVERY - puppet last run on mw2007 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [23:06:08] Reedy: Probably best to, just in case. [23:06:44] RECOVERY - puppet last run on mw1236 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:06:46] (03CR) 10Chad: "If we merge I5fead02b it'll shut up one of the spammiest warnings we saw last time this was on." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250852 (owner: 10Reedy) [23:06:53] RECOVERY - puppet last run on mw2040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:06:55] RECOVERY - puppet last run on mw1217 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:07:12] (03PS3) 10Krinkle: Temporary: Enable 'error' log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250852 (owner: 10Reedy) [23:07:23] RECOVERY - puppet last run on mw2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:07:35] RECOVERY - puppet last run on mw2156 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:07:38] (03CR) 10Reedy: [C: 04-1] "Need to grep to check these debug logs are still not in place in relevant code" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250850 (owner: 10Reedy) [23:10:06] jouncebot: next [23:10:06] In 0 hour(s) and 49 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151104T0000) [23:14:27] (03PS2) 10Dzahn: apache: indentation of => [puppet] - 10https://gerrit.wikimedia.org/r/250635 [23:35:21] 10Ops-Access-Requests, 6operations, 10Wikidata, 5Patch-For-Review: Requesting access to researchers for Addshore - https://phabricator.wikimedia.org/T116784#1780014 (10Deskana) Thanks! [23:46:15] (03PS1) 10Dereckson: Set $wgCategoryCollation for bs.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250862 (https://phabricator.wikimedia.org/T116527)