[00:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180109T0000). Please do the needful. [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:04:32] PROBLEM - HHVM rendering on mw2219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:05:22] RECOVERY - HHVM rendering on mw2219 is OK: HTTP OK: HTTP/1.1 200 OK - 76137 bytes in 0.634 second response time [00:06:09] !log releases1001/2001 - upgraded kernel image, planet - upgraded openssl et al [00:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:49] !log contint1001/2001 - upgraded php5-related packages [00:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:32] 10Operations, 10Ops-Access-Requests: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3885143 (10RobH) [00:17:20] !log netmon1002/2001 - upgraded php7.0 related packages | krypton (webserver_misc_apps) - upgraded php5 packages [00:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:40] 10Operations, 10Ops-Access-Requests: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3884118 (10RobH) @cy534: Please note we'll need a few further details from you to make this happen. These are documented on https://wikitech.wikimedia.org/wiki/Requesting_shell_acc... [00:19:51] 10Operations, 10Ops-Access-Requests: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3885158 (10RobH) a:03cy534 [00:20:13] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Package/upload service-checker for Debian stretch - https://phabricator.wikimedia.org/T184224#3885164 (10dduvall) @Joe, will the existing service-checker package for jessie work for stretch? The dependencies don't look all that complicated. [00:25:13] 10Operations, 10Ops-Access-Requests: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3885171 (10atgo) Hi @RobH ! I'm the staff member sponsoring this request (and approve it). This is for analyst work supporting #new-readers . The NDA is in process with Legal. We'll... [00:26:26] 10Operations, 10Ops-Access-Requests: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3885190 (10RobH) [00:30:32] PROBLEM - HHVM rendering on mw2220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:30:41] 10Operations, 10Ops-Access-Requests: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3885195 (10cy534) Thanks @RobH! Using the @gmail.com address for this request will be fine. [00:31:22] RECOVERY - HHVM rendering on mw2220 is OK: HTTP OK: HTTP/1.1 200 OK - 76137 bytes in 0.288 second response time [00:40:47] 10Operations, 10Ops-Access-Requests: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3885215 (10RobH) It doesn't quite answer the question, but I'll paste the two options below: ``` analytics-users Access to stat1004 to connect to the Analytics/Cluster (Hadoop/... [00:43:42] !log phabricator servers: upgraded php5-*, openssh [00:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:23] 10Operations, 10Ops-Access-Requests: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3885284 (10cy534) Thanks again @RobH. I spoke with another analyst who believe thats analytics-users will be sufficient. So, let's go with that unless @atgo thinks otherwise. [01:08:10] (03CR) 10Awight: Update gerrit login display (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/402665 (owner: 10Paladox) [01:09:33] (03CR) 10Dzahn: [C: 032] prometheus: move duplicate include, use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/397727 (owner: 10Dzahn) [01:10:25] 10Operations, 10Ops-Access-Requests: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3885287 (10RobH) >>! In T184473#3885284, @cy534 wrote: > Thanks again @RobH. I spoke with another analyst who believe thats analytics-users will be sufficient. So, let's go with that... [01:10:43] 10Operations, 10Ops-Access-Requests: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3885288 (10RobH) [01:10:52] 10Operations, 10Ops-Access-Requests: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3884118 (10RobH) p:05Triage>03Normal [01:13:32] PROBLEM - puppet last run on prometheus2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:13:51] yea, it's me. i got it^ [01:14:43] just a sec [01:15:22] PROBLEM - puppet last run on prometheus2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:17:16] 10Operations, 10Ops-Access-Requests: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3885303 (10RobH) [01:18:41] (03CR) 10Paladox: Update gerrit login display (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/402665 (owner: 10Paladox) [01:18:54] (03PS1) 10Dzahn: Revert "prometheus: move duplicate include, use profile::base::firewall" [puppet] - 10https://gerrit.wikimedia.org/r/403087 [01:19:03] (03PS1) 10RobH: adding shell user cy534 [puppet] - 10https://gerrit.wikimedia.org/r/403088 (https://phabricator.wikimedia.org/T184473) [01:19:24] (03CR) 10jerkins-bot: [V: 04-1] Revert "prometheus: move duplicate include, use profile::base::firewall" [puppet] - 10https://gerrit.wikimedia.org/r/403087 (owner: 10Dzahn) [01:20:43] (03PS3) 10RobH: adding imarlier to groups [puppet] - 10https://gerrit.wikimedia.org/r/402103 (https://phabricator.wikimedia.org/T184190) [01:21:00] (03CR) 10Dzahn: [V: 032 C: 032] Revert "prometheus: move duplicate include, use profile::base::firewall" [puppet] - 10https://gerrit.wikimedia.org/r/403087 (owner: 10Dzahn) [01:21:47] (03PS4) 10RobH: adding imarlier to groups [puppet] - 10https://gerrit.wikimedia.org/r/402103 (https://phabricator.wikimedia.org/T184190) [01:22:01] mutante: rebase race =P [01:22:21] hehe [01:22:25] i assume its you maybe not [01:22:32] (03PS1) 10RobH: adds cy534 to analytics-users [puppet] - 10https://gerrit.wikimedia.org/r/403090 (https://phabricator.wikimedia.org/T184473) [01:22:47] (03CR) 10RobH: [C: 032] adding imarlier to groups [puppet] - 10https://gerrit.wikimedia.org/r/402103 (https://phabricator.wikimedia.org/T184190) (owner: 10RobH) [01:23:32] RECOVERY - puppet last run on prometheus2003 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [01:24:04] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3885331 (10RobH) [01:26:17] robh: sorry, it was a special case where i had no choice but to V+2 to revert, heh [01:26:22] normally wouldnt [01:27:00] well, at least to keep it a clean revert and not amend to it so it wouldnt actually be a revert [01:30:22] RECOVERY - puppet last run on prometheus2004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:31:03] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:33:32] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:36:12] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:43:08] no worries was just amusing to be in rebase race at this hour heh [01:43:52] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:44:27] 10Operations, 10Ops-Access-Requests: Requesting access to Production SSH, statistics-privatedata-users, analytics-privatedata-users, perf-team for imarlier - https://phabricator.wikimedia.org/T184190#3885370 (10RobH) [01:44:49] hrmm, race condition of adding new users on stat machines maybe, checking [01:45:19] i went too quickly in merging them, annoying... [01:45:54] i have to revert my group addition, and let the user add run correctly... [01:46:11] oh wait, it was jsut a user add... strange... [01:46:13] PROBLEM - puppet last run on notebook1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:46:34] (03PS1) 10RobH: Revert "adding imarlier to groups" [puppet] - 10https://gerrit.wikimedia.org/r/403091 [01:46:44] mutante: so these failures are me not you i think? [01:46:53] cuz it references the admin module, and i just merged a change to the module [01:47:11] (03CR) 10RobH: [C: 032] Revert "adding imarlier to groups" [puppet] - 10https://gerrit.wikimedia.org/r/403091 (owner: 10RobH) [01:48:10] on the systems whose groups he was added to... meh its too late to troubleshoot just reverting and testing if puppet works on those hosts. [01:49:02] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:49:11] yes yes icinga i know [01:49:23] rerunning puppet agent on stat1005 after revert and so far so good. [01:49:39] yeah, ok, it ran without issues, so the compile failure was totally on my changeset [01:50:15] 10Operations, 10Ops-Access-Requests: Requesting access to Production SSH, statistics-privatedata-users, analytics-privatedata-users, perf-team for imarlier - https://phabricator.wikimedia.org/T184190#3885375 (10RobH) 05Resolved>03Open >>! In T184190#3885368, @RobH wrote: > This is merged live, and affected... [01:50:29] all of the failed compilation alerts will clear up as they call in again [01:51:03] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:58:32] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [02:01:12] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:13:52] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:14:02] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:16:22] RECOVERY - puppet last run on notebook1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:23:31] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.15) (duration: 05m 31s) [02:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:55:45] 10Operations, 10MediaWiki-Configuration, 10Availability (Multiple-active-datacenters), 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), and 4 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3885442 (10CCicalese_WMF) [03:26:02] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 798.99 seconds [03:33:55] robh: wasnt here earlier but the issue was that imarlier is "ldap-only" admin and not shell admin user, so adding one of these to groups means it's trying to create but there is no key field etc [03:34:30] needs to be moved from ldap section to shell user section and get key [03:34:39] yeah i realized that after hte fact [03:34:43] =P [03:34:54] forgot i had to do two patchsets too ;D [03:35:01] so ill just fix tomorrow [03:35:11] (i reverted it so it wasnt breaking things any longer) [03:35:27] yep :) sounds good [03:35:32] not sure why i didnt do it that way to start, i must have had it in git stash or something and forgot, who knows [03:46:28] (03PS1) 10Dzahn: prometheus: remove duplicate firewall include [puppet] - 10https://gerrit.wikimedia.org/r/403094 [03:47:41] (03CR) 10Dzahn: [C: 032] prometheus: remove duplicate firewall include [puppet] - 10https://gerrit.wikimedia.org/r/403094 (owner: 10Dzahn) [03:50:01] and that's what i couldnt submit earlier after i had to revert.. all good .. bye, out [03:59:15] (03PS1) 10Aaron Schulz: Replace yubikey nano key with yubikey 4 key for aaron [puppet] - 10https://gerrit.wikimedia.org/r/403095 [04:05:48] (03PS2) 10Aaron Schulz: Replace yubikey nano key with yubikey 4 key for aaron [puppet] - 10https://gerrit.wikimedia.org/r/403095 [04:06:03] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 213.11 seconds [04:12:23] PROBLEM - HHVM rendering on mw2226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:13:13] RECOVERY - HHVM rendering on mw2226 is OK: HTTP OK: HTTP/1.1 200 OK - 75113 bytes in 0.292 second response time [04:17:30] mutante: I did the tin file thing again. [sigh], the nano one got untied from my key chain :/ [04:17:48] (03PS1) 10KartikMistry: apertium-mlt-ara: Update dependency on cg3 [debs/contenttranslation/apertium-mlt-ara] - 10https://gerrit.wikimedia.org/r/403099 (https://phabricator.wikimedia.org/T171406) [04:18:23] (03CR) 10jerkins-bot: [V: 04-1] apertium-mlt-ara: Update dependency on cg3 [debs/contenttranslation/apertium-mlt-ara] - 10https://gerrit.wikimedia.org/r/403099 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [04:20:52] (03PS1) 10KartikMistry: apertium-nno: Update dependency on cg3 [debs/contenttranslation/apertium-nno] - 10https://gerrit.wikimedia.org/r/403100 (https://phabricator.wikimedia.org/T171406) [04:21:34] (03CR) 10jerkins-bot: [V: 04-1] apertium-nno: Update dependency on cg3 [debs/contenttranslation/apertium-nno] - 10https://gerrit.wikimedia.org/r/403100 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [04:24:11] (03PS1) 10KartikMistry: apertium-nno-nob: Update dependency on cg3 [debs/contenttranslation/apertium-nno-nob] - 10https://gerrit.wikimedia.org/r/403101 (https://phabricator.wikimedia.org/T171406) [04:24:37] (03CR) 10jerkins-bot: [V: 04-1] apertium-nno-nob: Update dependency on cg3 [debs/contenttranslation/apertium-nno-nob] - 10https://gerrit.wikimedia.org/r/403101 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [04:27:10] (03PS3) 10Tim Starling: Fix 'sql' script for multi-instance hosts [puppet] - 10https://gerrit.wikimedia.org/r/397912 (https://phabricator.wikimedia.org/T182713) (owner: 10Anomie) [04:27:30] (03CR) 10Tim Starling: [C: 032] Fix 'sql' script for multi-instance hosts [puppet] - 10https://gerrit.wikimedia.org/r/397912 (https://phabricator.wikimedia.org/T182713) (owner: 10Anomie) [04:32:29] (03CR) 10Tim Starling: "This is getting really ugly. I wonder if it is possible to rewrite it in PHP. Or at least move all the PHP one-liners to a single helper s" [puppet] - 10https://gerrit.wikimedia.org/r/397913 (owner: 10Anomie) [04:35:51] just saying, I'm around for the database split. [04:53:13] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-eventlogging04 due to missing directory '/var/lib/superset'? - https://phabricator.wikimedia.org/T184238#3885503 (10mmodell) [05:02:43] 10Puppet, 10Beta-Cluster-Infrastructure, 10ORES, 10Scoring-platform-team: Puppet broken on deployment-ores01 due to missing hieradata - https://phabricator.wikimedia.org/T184478#3885510 (10Ladsgroup) Yes, That's an ongoing work which I failed to finish yesterday and this is the first thing to pick up and f... [05:05:41] Now ORES in beta cluster is GDFR [05:06:35] (03PS5) 10Jcrespo: db-eqiad.php: Set s5 on read_only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401434 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [05:08:07] Morning :) [05:08:09] Thanks Amir1 [05:08:34] morning [05:08:36] hello o/ \o [05:10:50] I am going to start with the pre work tasks [05:12:37] !log Start pre-failover tasks T177208 T181645 [05:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:50] T177208: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208 [05:12:50] T181645: Help communicate read-only time for dewiki and wikidata for database split - https://phabricator.wikimedia.org/T181645 [05:13:03] RECOVERY - exim queue on mx1001 is OK: OK: Less than 1000 mails in exim queue. [05:13:19] (03PS2) 10Marostegui: db1070.yaml: Update new socket location [puppet] - 10https://gerrit.wikimedia.org/r/401433 (https://phabricator.wikimedia.org/T177208) [05:17:10] All done, as puppet agent has been disabled on db1070 after the full ugprade I am going to merge: https://gerrit.wikimedia.org/r/#/c/401433/ [05:18:17] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2055 - https://phabricator.wikimedia.org/T184285#3885517 (10Marostegui) 05Open>03Resolved Thanks! ``` root@db2055:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337C9270) Port Name: 1I Port Name:... [05:26:29] (03CR) 10Marostegui: [C: 032] db1070.yaml: Update new socket location [puppet] - 10https://gerrit.wikimedia.org/r/401433 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [05:36:33] PROBLEM - Check Varnish expiry mailbox lag on cp4024 is CRITICAL: CRITICAL: expiry mailbox lag is 2058707 [05:43:44] Going to merge: https://gerrit.wikimedia.org/r/#/c/401434/5 (but NOT deploy) [05:44:53] BTW, we take an exclusive lock on both puppet and mediawiki-config deployment right now [05:45:52] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Set s5 on read_only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401434 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [05:46:16] ping if you want to deploy something, but we reserved it at https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180109T0600 [05:47:16] (03Merged) 10jenkins-bot: db-eqiad.php: Set s5 on read_only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401434 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [05:47:27] (03CR) 10jenkins-bot: db-eqiad.php: Set s5 on read_only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401434 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [05:48:46] (03PS1) 10Marostegui: db-eqiad.php: Set s5, s8 read only OFF [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403104 (https://phabricator.wikimedia.org/T177208) [05:55:50] hi [05:56:01] In 5 minutes I will deploy the read_only [05:56:05] Hi aude! Morning! [05:56:16] ok [05:57:10] (03PS1) 10KartikMistry: apertium-nob: Update dependency on cg3 [debs/contenttranslation/apertium-nob] - 10https://gerrit.wikimedia.org/r/403105 (https://phabricator.wikimedia.org/T171406) [05:57:41] (03CR) 10jerkins-bot: [V: 04-1] apertium-nob: Update dependency on cg3 [debs/contenttranslation/apertium-nob] - 10https://gerrit.wikimedia.org/r/403105 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [05:57:45] aude: o/ [06:00:04] marostegui and jynus: Time to snap out of that daydream and deploy s8 Migration. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180109T0600). [06:00:04] No GERRIT patches in the queue for this window AFAICS. [06:00:07] morning [06:00:25] Deploying [06:00:41] <_joe_> hi :) [06:00:45] hi [06:01:02] oh, the calvary :) [06:01:06] i see https://gerrit.wikimedia.org/r/#/c/403104/ [06:01:25] i know there is also a mediawiki setting, do we need to set that also? [06:01:48] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Set s5 on read-only to start failover T177208 T181645 (duration: 00m 50s) [06:01:50] setting mysql read only [06:01:58] done [06:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:02] T177208: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208 [06:02:02] T181645: Help communicate read-only time for dewiki and wikidata for database split - https://phabricator.wikimedia.org/T181645 [06:02:12] aude: not the time, but that takes care of it [06:02:15] ok [06:02:32] I see everything read only [06:02:41] me too [06:02:42] noted positions [06:02:48] ready to stop mysql on db1070? [06:02:49] confirmed on wiki [06:03:04] ok to go [06:03:08] ok [06:03:17] stopping + killing hearbeat + running puppet [06:03:28] go [06:03:39] yeah, on its way :) [06:04:22] still stopping [06:04:41] running puppet [06:05:13] starting mysql [06:05:28] running mysql_upgrade [06:05:39] running puppet for hearbeat [06:06:11] pt-hearbeat is up [06:06:19] jynus: you can stop replication between db1070 and db1071 [06:06:19] binlog_format | STATEMENT [06:06:23] doing [06:06:58] once done I will rename the wikidatata tables on db1070 - let me know [06:07:10] (03PS7) 10Marostegui: db-eqiad.php: Point wikidatawiki to s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401436 (https://phabricator.wikimedia.org/T177208) [06:07:17] !log stopping slave and reseting on db1071 [06:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:42] done, marostegui [06:07:49] I got the latest coord too [06:07:50] ok [06:07:52] renaming [06:07:52] but not copying it [06:07:57] Done [06:08:27] we are split now [06:08:39] waiting for CI on https://gerrit.wikimedia.org/r/#/c/401436/ [06:08:42] time to go live! [06:08:49] merging [06:08:54] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Point wikidatawiki to s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401436 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [06:09:00] only deploying db-eqiad.php [06:09:06] cool [06:09:30] fatalmonitor is going crazy but that's expected [06:09:38] yeah :-( [06:09:50] it should be ok when we are back in rw [06:10:19] the jobs' fault [06:10:34] (03Merged) 10jenkins-bot: db-eqiad.php: Point wikidatawiki to s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401436 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [06:10:44] (03CR) 10jenkins-bot: db-eqiad.php: Point wikidatawiki to s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401436 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [06:10:44] latest edit on wikidata 06:01 [06:11:29] marostegui: is it deploying? [06:11:32] yep [06:11:55] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Splitting s5 and s8 T177208 T181645 (duration: 00m 50s) [06:11:59] ready for read_only off? [06:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:07] T177208: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208 [06:12:07] go [06:12:07] T181645: Help communicate read-only time for dewiki and wikidata for database split - https://phabricator.wikimedia.org/T181645 [06:12:21] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Set s5, s8 read only OFF [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403104 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [06:12:24] (03PS2) 10Marostegui: db-eqiad.php: Set s5, s8 read only OFF [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403104 (https://phabricator.wikimedia.org/T177208) [06:12:55] Will also remove read only from mysql as soon as the deploy is done [06:13:31] (03PS1) 10KartikMistry: apertium-oc-ca: Cleanup [debs/contenttranslation/apertium-oc-ca] - 10https://gerrit.wikimedia.org/r/403106 [06:13:57] A big piece of the read only is waiting for CI :) [06:14:00] tendril is looking ok [06:14:03] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=2fullscreen [06:14:10] QPS increasing [06:14:14] (03CR) 10jenkins-bot: db-eqiad.php: Set s5, s8 read only OFF [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403104 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [06:14:15] on s8 [06:14:46] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Remove read_only from s5 and s8 T177208 T181645 (duration: 00m 27s) [06:14:47] <_joe_> https://cdn.meme.am/cache/instances/folder46/65289046.jpg [06:14:53] we are live! [06:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:59] read only is also off on mysql [06:14:59] :)))) [06:15:14] I can see binlog growing, testing edits [06:15:25] <_joe_> so, now let's edit the 5 most used wikidata items, and kill the jobqueue :P [06:15:38] \o/ [06:15:38] * mark takes away joe's coffee [06:15:40] edits succesful [06:15:49] fatals and errors are faltlined now [06:15:53] *flatlined [06:15:54] I can edit finely on wikidatawiki [06:16:01] (Q2316811)‎; 06:14 . . (0)‎ . . ‎Jorge (talk | contribs)‎ (‎Page moved from [eswiki:Bandera de la Región de Antofagasta] to [eswiki:Bandera de la región de Antofagasta]) [06:16:19] _joe_: it is a chicken and egg problem [06:16:25] watchlists working fine on dewiki too [06:16:39] regarding job queue- disabling it also is not perfect [06:16:44] jynus: ok to deploy s5.dblists, s8.dblists and noc [06:16:46] _joe_: regarding job queue and wikidata and refreshlinks, I made an amazing discovery a couple of days ago [06:16:55] <_joe_> Amir1: do share! [06:17:02] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=2fullscreen [06:17:06] the best option would be for the jobqueue to understand read only [06:17:37] we are not yet out of the forest, a revert is still on the table [06:17:39] any sort of edit on certain (and quite often edited) items in wikidata causes around 1M refresh on ruwiki [06:17:51] 1M update per edit [06:17:54] <_joe_> Amir1: and that we knew [06:18:06] Servers are doing fine so far, and they were kinda cold [06:18:08] I got the list and how that can be fixed now [06:18:09] <_joe_> Amir1: I didn't know which objects specifically though [06:18:13] yes, we knew when watchlist and rcs broke on ruwiki [06:18:17] <_joe_> Amir1: that's great [06:18:27] due to large amount of rcs coming from wikidatawiki [06:18:37] and I have several bugs open beceause of that [06:18:56] https://quarry.wmflabs.org/query/23917 [06:19:03] <_joe_> jynus: syncing wikidata is one of the largest issues we have at an infrastructural level [06:19:27] any edit on P856 causes 0.6M refreshes on ruwiki [06:19:28] I agree [06:19:38] (03PS1) 10KartikMistry: apertium-oc-es: Cleanup [debs/contenttranslation/apertium-oc-es] - 10https://gerrit.wikimedia.org/r/403107 [06:19:46] Amir1: and 0.6 insertions on RCs [06:19:50] among others [06:19:57] jynus: that part is disabled for now [06:19:58] s8 is a bit low [06:20:15] on throughput [06:20:35] https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All [06:20:37] there is one big thing that will come soon(TM) and kill this problem [06:20:38] I don't see insert failures with: table doesn't exist, so, so far nothing is inserting on the wrong side :) [06:20:44] fine-grained usage tracking [06:20:59] practically we kill X aspect [06:21:41] <_joe_> Amir1: oh ok so fine-grained tracking, that I knew about :) [06:21:46] not to cut the conversation short, but we havent finished [06:21:55] <_joe_> jynus: sorry, go on [06:22:01] thanks jynus and marostegui :) [06:22:15] marostegui: other deployments needed? [06:22:21] s5,s8 and noc [06:22:42] let's go, unless someone sees something that requires revert [06:22:46] any error [06:22:49] anything strange? [06:22:57] I don't see anything [06:23:01] speak now before we commit to the new hardware even more [06:23:02] <_joe_> jynus: fatalmonitor and icinga are ok [06:23:14] no user reports? [06:23:40] mediawiki-side logs are fine [06:23:42] performance issues [06:23:43] ? [06:24:10] <_joe_> not that I see [06:24:14] I was checking for lags and stuff like that as the servers were sort of cold [06:24:17] I see a spike but that would be expected [06:24:17] but so far they look fine [06:24:43] I am going to deploy the dblists [06:25:05] go [06:25:11] the spike started yesterday actually [06:25:14] <_joe_> jynus: I think by all means this is a go :) [06:25:27] _joe_: https://grafana.wikimedia.org/dashboard/db/save-timing?panelId=11&fullscreen&orgId=1 [06:25:45] unrelated, but you may want to check if you need to restart some mw server? [06:26:01] timing seems to grow again like on christmas? [06:26:19] !log marostegui@tin Synchronized dblists/s5.dblist: Deploy the dblist files with the correct databases after the split (duration: 00m 53s) [06:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:45] <_joe_> jynus: that can very well be an artifact of that graph [06:26:50] that is true [06:26:52] looking back [06:27:04] <_joe_> but I'm going to check anyways [06:27:08] (03PS1) 10KartikMistry: apertium-pt-ca: Cleanup [debs/contenttranslation/apertium-pt-ca] - 10https://gerrit.wikimedia.org/r/403109 [06:27:08] it could be the normal peak time [06:27:16] !log marostegui@tin Synchronized dblists/s8.dblist: Deploy the dblist files with the correct databases after the split (duration: 00m 50s) [06:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:41] <_joe_> jynus: if you look at the last 6 hours, you get a different picture [06:27:55] cool [06:28:07] I am trying to cover all things [06:28:22] <_joe_> jynus: of course, we don't remap shards that often [06:28:37] we exected a hit on s8, too [06:28:52] it is impossible to warmup perfectly wikidata [06:29:20] !log marostegui@tin Synchronized docroot/noc/conf/s8.dblist: Deploy the dblist files with the correct databases after the split (duration: 00m 48s) [06:29:30] All the dblists and noc are now deployed [06:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:38] cool [06:30:41] 10Operations, 10Goal, 10Patch-For-Review: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208#3885585 (10Marostegui) [06:32:20] 10Operations, 10Goal, 10Patch-For-Review: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208#3770320 (10Marostegui) Failover is done Read only started: 06:01 Read only finished: 06:14 Thanks @Ladsgroup @aude @mark @joe for being online and supporting the DBAs! The... [06:32:23] 10Operations, 10Goal, 10Patch-For-Review: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208#3885588 (10Marostegui) Failover is done Read only started: 06:01 Read only finished: 06:14 Thanks @Ladsgroup @aude @mark @joe for being online and supporting the DBAs! The... [06:32:25] 10Operations, 10Goal, 10Patch-For-Review: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208#3885589 (10Marostegui) Failover is done Read only started: 06:01 Read only finished: 06:14 Thanks @Ladsgroup @aude @mark @joe for being online and supporting the DBAs! The... [06:32:45] <_joe_> !log restarting pdfrender on scb1003 [06:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:55] Why did get written 3 times...weird [06:33:32] smooth work guys :) [06:33:32] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [06:33:52] phabricator weirdness? [06:34:37] hats off guys! [06:34:59] I wrote on another ticket and it went fine [06:36:18] 10Operations, 10Goal, 10Patch-For-Review: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208#3885593 (10jcrespo) [07:22:14] (03PS1) 10Jcrespo: mariadb: Setup s8 on dbstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/403110 (https://phabricator.wikimedia.org/T177208) [07:22:48] (03CR) 10Marostegui: [C: 031] mariadb: Setup s8 on dbstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/403110 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [07:53:04] (03PS5) 10Faidon Liambotis: wmflib: cleanup secret.rb a little bit [puppet] - 10https://gerrit.wikimedia.org/r/359449 [07:53:07] (03PS5) 10Faidon Liambotis: graphite: cleanup configparser_format a little bit [puppet] - 10https://gerrit.wikimedia.org/r/359451 [07:53:09] (03PS6) 10Faidon Liambotis: Fix Style/FormatString RuboCop across all Rakefiles [puppet] - 10https://gerrit.wikimedia.org/r/359452 [07:53:11] (03PS6) 10Faidon Liambotis: wmflib: fix RuboCop infractions in serializers [puppet] - 10https://gerrit.wikimedia.org/r/359453 [07:53:16] (03PS5) 10Faidon Liambotis: Fix more whitespace-related RuboCop across the tree [puppet] - 10https://gerrit.wikimedia.org/r/359478 [07:53:16] (03PS5) 10Faidon Liambotis: Fix Style/RegexpLiteral RuboCop offenses [puppet] - 10https://gerrit.wikimedia.org/r/359479 [07:53:17] (03PS4) 10Faidon Liambotis: wmflib: fix another couple minor RuboCop offenses [puppet] - 10https://gerrit.wikimedia.org/r/359480 [07:53:19] (03PS4) 10Faidon Liambotis: wmflib, admin: fix RuboCop Style/For offenses [puppet] - 10https://gerrit.wikimedia.org/r/359481 [07:53:21] (03PS4) 10Faidon Liambotis: base: fix RuboCop MethodCallWithoutArgsParentheses [puppet] - 10https://gerrit.wikimedia.org/r/359482 [07:53:23] (03PS4) 10Faidon Liambotis: utils/expanderrb.rb: fix Style/SpecialGlobalVars [puppet] - 10https://gerrit.wikimedia.org/r/359483 [07:53:25] (03PS4) 10Faidon Liambotis: Fix Style/NegatedIf RuboCop offense across the tree [puppet] - 10https://gerrit.wikimedia.org/r/359484 [07:53:27] (03PS4) 10Faidon Liambotis: rubocop: move three ignores to .rubocop.yml [puppet] - 10https://gerrit.wikimedia.org/r/359485 [07:56:55] (03CR) 10jerkins-bot: [V: 04-1] Fix Style/NegatedIf RuboCop offense across the tree [puppet] - 10https://gerrit.wikimedia.org/r/359484 (owner: 10Faidon Liambotis) [07:58:03] wot [07:58:07] (03CR) 10jerkins-bot: [V: 04-1] Fix more whitespace-related RuboCop across the tree [puppet] - 10https://gerrit.wikimedia.org/r/359478 (owner: 10Faidon Liambotis) [07:58:21] (03PS1) 10Giuseppe Lavagetto: site.pp: one role per node called with role() [puppet] - 10https://gerrit.wikimedia.org/r/403112 [07:58:27] (03CR) 10jerkins-bot: [V: 04-1] wmflib, admin: fix RuboCop Style/For offenses [puppet] - 10https://gerrit.wikimedia.org/r/359481 (owner: 10Faidon Liambotis) [07:59:04] these seem like a bunch of unrelated errors? [07:59:10] like [07:59:10] 07:56:53 rspec ./spec/functions/pick_initscript_spec.rb:10 # pick_initscript Returns false if no init script provided [07:59:35] has something changed? [07:59:53] <_joe_> paravoid: uhm, no idea, I'll have to check [08:00:19] <_joe_> in theory, those specs should run every time someone changes anything in wmflib [08:01:56] <_joe_> btw - we should really convert as many functions as we can to the new API [08:02:23] <_joe_> it's more deterministic (you can't mess with the local scope) and nicer to write to [08:02:40] <_joe_> not sure about some of our most glorious hacks like require_package [08:02:56] <_joe_> btw, someone changed something in pcc that makes its output unreadable [08:03:12] (03CR) 10Jcrespo: [C: 032] mariadb: Setup s8 on dbstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/403110 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [08:06:50] (03PS2) 10Giuseppe Lavagetto: site.pp: one role per node called with role() [puppet] - 10https://gerrit.wikimedia.org/r/403112 [08:15:59] !log stopping dbstore2001:s5 for cloning to s8 [08:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:49] (03PS1) 10ArielGlenn: load ActiveAbtract extension explicitly so class autoloading works [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403114 (https://phabricator.wikimedia.org/T184177) [08:19:00] (03CR) 10ArielGlenn: "I've tested this for abstracts dumps and it works with mediawiki master. Untested otherwise." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403114 (https://phabricator.wikimedia.org/T184177) (owner: 10ArielGlenn) [08:27:10] BTW apergos- we have now production testing databases, with large wikis like commons- if you need them, coordinate with MCR developers if that would help [08:28:00] jynus: that's great news! I can't make use of them right away, because the client end for testing would still be a little tiny labs instance right now [08:28:05] oooh but maybe in the future! [08:32:58] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [08:39:53] !log rebooting app servers in codfw for kernel security update [08:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:04] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9637/netmon1002.wikimedia.org/ seems ok" [puppet] - 10https://gerrit.wikimedia.org/r/403112 (owner: 10Giuseppe Lavagetto) [08:41:09] (03PS3) 10Giuseppe Lavagetto: site.pp: one role per node called with role() [puppet] - 10https://gerrit.wikimedia.org/r/403112 [08:45:28] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [08:51:42] (03PS1) 10KartikMistry: apertium-pt-gl: Cleanup [debs/contenttranslation/apertium-pt-gl] - 10https://gerrit.wikimedia.org/r/403116 [08:54:47] (03PS1) 10KartikMistry: apertium-rus: Update dependency on cg3 [debs/contenttranslation/apertium-rus] - 10https://gerrit.wikimedia.org/r/403118 (https://phabricator.wikimedia.org/T171406) [08:55:16] (03CR) 10jerkins-bot: [V: 04-1] apertium-rus: Update dependency on cg3 [debs/contenttranslation/apertium-rus] - 10https://gerrit.wikimedia.org/r/403118 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [09:04:53] (03PS1) 10Marostegui: mariadb: Add s8 to dbstore config and monitoring [puppet] - 10https://gerrit.wikimedia.org/r/403119 (https://phabricator.wikimedia.org/T177208) [09:08:12] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler03/9639/" [puppet] - 10https://gerrit.wikimedia.org/r/403119 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [09:10:05] (03CR) 10Jcrespo: [C: 031] "This means we have to fix both dbstore1001 and 2 at the same time, isn't it?" [puppet] - 10https://gerrit.wikimedia.org/r/403119 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [09:11:22] (03CR) 10Marostegui: "> This means we have to fix both dbstore1001 and 2 at the same time," [puppet] - 10https://gerrit.wikimedia.org/r/403119 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [09:11:25] !log roll restart swift in eqiad for kernel upgrade [09:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:26] (03Draft2) 10محمد شعیب: Changing namespaces on some Urdu language projects, 'وکی لغت‌' to 'ویکی لغت', 'وکی کتب' to 'ویکی کتب', 'وکی اقتباسات' to 'ویکی اقتباس'. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403120 [09:15:07] (03CR) 10Marostegui: [C: 032] mariadb: Add s8 to dbstore config and monitoring [puppet] - 10https://gerrit.wikimedia.org/r/403119 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [09:16:37] RECOVERY - Check Varnish expiry mailbox lag on cp4024 is OK: OK: expiry mailbox lag is 0 [09:17:48] (03PS11) 10Elukey: role::puppetmaster::puppetdb: add Prometheus monitoring for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/394966 [09:18:18] (03CR) 10jerkins-bot: [V: 04-1] role::puppetmaster::puppetdb: add Prometheus monitoring for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/394966 (owner: 10Elukey) [09:22:42] 10Operations, 10Goal, 10Patch-For-Review: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208#3885725 (10Marostegui) dbstore1002 is fixed [09:23:07] (03CR) 10Volans: [C: 031] "LGTM. The other approach would be to manually rename the interfaces in the udev rule file and reboot. I'm fine with both approaches." [puppet] - 10https://gerrit.wikimedia.org/r/402859 (https://phabricator.wikimedia.org/T167299) (owner: 10Ema) [09:23:30] 10Operations, 10Goal, 10Patch-For-Review: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208#3885726 (10Marostegui) [09:26:20] (03CR) 10Elukey: "Pcc for nihal and nitrogen: https://puppet-compiler.wmflabs.org/compiler03/9642/" [puppet] - 10https://gerrit.wikimedia.org/r/394966 (owner: 10Elukey) [09:26:30] 10Puppet, 10Beta-Cluster-Infrastructure, 10Services (done): Puppet disabled for a month on deployment-restbase0[12] instances - https://phabricator.wikimedia.org/T184477#3885729 (10mobrovac) a:05Pchelolo>03mobrovac That was supposed to be a temporary measure to test reverting a problematic commit. I enab... [09:26:35] 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#3885733 (10mobrovac) [09:26:38] 10Puppet, 10Beta-Cluster-Infrastructure, 10Services (done): Puppet disabled for a month on deployment-restbase0[12] instances - https://phabricator.wikimedia.org/T184477#3885732 (10mobrovac) 05Open>03Resolved [09:30:39] (03PS6) 10Giuseppe Lavagetto: wmflib: cleanup secret.rb a little bit [puppet] - 10https://gerrit.wikimedia.org/r/359449 (owner: 10Faidon Liambotis) [09:30:41] (03PS1) 10Giuseppe Lavagetto: secret: add spec [puppet] - 10https://gerrit.wikimedia.org/r/403121 [09:31:26] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban): npm 1.4.21 can't use a http proxy - https://phabricator.wikimedia.org/T183569#3885738 (10hashar) [09:31:35] (03PS3) 10Filippo Giunchedi: RESTBase: Set up RESTBase for the production_ng role as well [puppet] - 10https://gerrit.wikimedia.org/r/401784 (https://phabricator.wikimedia.org/T184110) (owner: 10Mobrovac) [09:31:59] !log deploy restbase to cassandra 3 nodes [09:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:37] (03CR) 10Filippo Giunchedi: [C: 032] RESTBase: Set up RESTBase for the production_ng role as well [puppet] - 10https://gerrit.wikimedia.org/r/401784 (https://phabricator.wikimedia.org/T184110) (owner: 10Mobrovac) [09:33:01] (03PS2) 10Giuseppe Lavagetto: secret: add spec [puppet] - 10https://gerrit.wikimedia.org/r/403121 [09:33:09] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] secret: add spec [puppet] - 10https://gerrit.wikimedia.org/r/403121 (owner: 10Giuseppe Lavagetto) [09:34:31] mobrovac: I'll start with restbase2007 [09:34:56] kk godog [09:35:03] !log reboot kafka1014 for kernel updates [09:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:09] mobrovac: 2007 works, 2001 fails Server Error: Evaluation Error: Error while evaluating a Function Call, Could not find data item profile::restbase::salt_key [09:37:30] (03PS7) 10Giuseppe Lavagetto: wmflib: cleanup secret.rb a little bit [puppet] - 10https://gerrit.wikimedia.org/r/359449 (owner: 10Faidon Liambotis) [09:37:35] godog: hm that means it lack secrets somehow [09:37:48] yeah I'm checking on the master [09:38:04] <_joe_> I still didn't merge the change I was about to merge, I'll hold [09:38:26] 10Puppet, 10Beta-Cluster-Infrastructure, 10ORES, 10Scoring-platform-team (Current), 10User-Ladsgroup: Puppet broken on deployment-ores01 due to missing hieradata - https://phabricator.wikimedia.org/T184478#3885745 (10Ladsgroup) a:03Ladsgroup [09:38:32] <_joe_> but that seems more of a hiera secret than an actual secret() [09:39:17] _joe_: thanks, yeah hiera secret [09:39:35] PROBLEM - puppet last run on restbase2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:39:40] <_joe_> ok lemme know if I can proceed [09:40:08] _joe_: yup go ahea [09:41:35] (03CR) 10Giuseppe Lavagetto: [C: 032] wmflib: cleanup secret.rb a little bit [puppet] - 10https://gerrit.wikimedia.org/r/359449 (owner: 10Faidon Liambotis) [09:42:26] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 112 not-conn: cp4030_v4, cp4030_v6 [09:43:26] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 114 ESP OK [09:46:15] PROBLEM - MariaDB disk space on dbstore2001 is CRITICAL: DISK CRITICAL - free space: /srv 680260 MB (5% inode=99%) [09:48:18] ^ that is expected as j*nus is working out s5/s8 copies [09:48:31] I am trying to get rid of crift [09:48:38] *unnecesary stuff [09:49:06] I didn't know we were so short in space there [09:49:37] maybe we need to defragment [09:49:49] the labsdb backup and the s8 doesn't help [09:50:16] RECOVERY - MariaDB disk space on dbstore2001 is OK: DISK OK [09:50:17] mobrovac: ok running on restbase2001 now, takes a couple of puppet runs to converge [09:50:24] I would like to delete the labsdb backup [09:50:24] k [09:50:29] it doesn't belong there [09:50:38] jynus: labsdb1003/1001 one? [09:50:59] it is only 16G, I wanted to have it on different places [09:51:15] ok [09:51:26] so we need then to delete something :-) [09:51:26] !log lvs3004: upgrade to latest jessie point release (8.10) T182656 and linux kernel 4.9.65-3+deb9u1~bpo8+2 (KPTI) T184267 [09:51:28] I can move it to 2002 if it is better there [09:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:37] T182656: Integrate jessie 8.10 point release - https://phabricator.wikimedia.org/T182656 [09:52:03] I am going to delete sqldata.s1.deleteme [09:52:28] no idea what that is [09:52:28] ok, we now have 3TB available [09:52:38] that should be enough [09:53:07] probably 90% was 1 TB or so [09:53:09] nice! [09:53:48] mobrovac: can you check restbase on 2001 ? looks good according to service-checker-swagger [09:54:18] looking [09:54:35] RECOVERY - puppet last run on restbase2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:55:38] godog: yup, looking good! [09:56:00] 10Operations, 10Developer-Relations: Bring discourse.mediawiki.org to production - https://phabricator.wikimedia.org/T180853#3885767 (10Qgil) [09:56:03] 10Operations, 10Developer-Relations (Jan-Mar-2018), 10cloud-services-team (Kanban): Create discourse-mediawiki.wmflabs.org (pilot instance) - https://phabricator.wikimedia.org/T180854#3885765 (10Qgil) 05Open>03Resolved [09:56:13] mobrovac: sweet, I'll reeanble and run puppet on the c2 nodes first [09:56:42] 10Operations, 10Developer-Relations (Jan-Mar-2018), 10cloud-services-team (Kanban): Create discourse-mediawiki.wmflabs.org (pilot instance) - https://phabricator.wikimedia.org/T180854#3771512 (10Qgil) ... and announced! https://discourse-mediawiki.wmflabs.org/t/announcing-the-new-wikimedia-developer-support-... [09:57:42] (03CR) 10Arturo Borrero Gonzalez: [C: 031] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/403072 (https://phabricator.wikimedia.org/T182722) (owner: 10Rush) [09:59:24] !log failover traffic lvs3002 -> lvs3004 (new kernel) [09:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:22] PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp3041_v4, cp3041_v6 [10:00:23] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3041_v4, cp3041_v6 [10:00:23] PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp3041_v4, cp3041_v6 [10:00:32] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 112 connecting: cp3041_v4 not-conn: cp3041_v6 [10:00:32] PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp3041_v4, cp3041_v6 [10:00:33] PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp3041_v4, cp3041_v6 [10:00:52] PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp3041_v4, cp3041_v6 [10:00:52] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3041_v4, cp3041_v6 [10:00:52] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3041_v4, cp3041_v6 [10:00:52] PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp3041_v4, cp3041_v6 [10:00:52] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 112 not-conn: cp3041_v4, cp3041_v6 [10:00:53] PROBLEM - IPsec on cp1052 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp3041_v4, cp3041_v6 [10:00:53] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 112 connecting: cp3041_v4 not-conn: cp3041_v6 [10:00:53] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3041_v4, cp3041_v6 [10:01:03] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp3041_v4, cp3041_v6 [10:01:12] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3041_v4, cp3041_v6 [10:01:13] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3041_v4, cp3041_v6 [10:02:22] PROBLEM - Host cp3041 is DOWN: PING CRITICAL - Packet loss = 100% [10:02:44] looking ^ [10:02:52] PROBLEM - pybal on lvs3002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [10:02:53] PROBLEM - PyBal backends health check on lvs3002 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [10:03:07] cp3041 was just a bit slow at rebooting [10:03:21] !log reboot kafka-jumbo1002 for kernel updates [10:03:23] lvs3002 is me, sorry for the noise [10:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:02] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 110 connecting: cp3041_v4 not-conn: cp3041_v6, cp4023_v4, cp4023_v6 [10:05:37] uh, no, cp3041 is having troubles [10:05:38] NMI watchdog: BUG: soft lockup - CPU#38 stuck for 22s! [swapper/38:0] [10:06:47] godog: have you pooled rb2001? [10:07:00] mobrovac: no [10:07:26] !log cp3041 soft lockup, rebooting [10:07:27] mobrovac: I was leaving pooling for last, when everything is in place [10:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:42] kk [10:08:21] mobrovac: also I'm looking at https://config-master.wikimedia.org/pybal/codfw/restbase to double check [10:09:22] hm some nodes are not even there [10:09:31] i guess that will be fixed with the pooling [10:11:12] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 44 ESP OK [10:11:22] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 56 ESP OK [10:11:22] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 56 ESP OK [10:11:22] RECOVERY - Host cp3041 is UP: PING OK - Packet loss = 0%, RTA = 83.80 ms [10:11:23] RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 44 ESP OK [10:11:32] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 44 ESP OK [10:11:32] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 56 ESP OK [10:11:33] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 114 ESP OK [10:11:33] RECOVERY - IPsec on cp1068 is OK: Strongswan OK - 44 ESP OK [10:11:33] RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 44 ESP OK [10:11:52] RECOVERY - IPsec on cp1053 is OK: Strongswan OK - 44 ESP OK [10:11:52] RECOVERY - IPsec on cp1066 is OK: Strongswan OK - 44 ESP OK [10:11:52] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 56 ESP OK [10:11:52] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 56 ESP OK [10:11:53] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 114 ESP OK [10:11:53] RECOVERY - IPsec on cp1052 is OK: Strongswan OK - 44 ESP OK [10:11:53] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 114 ESP OK [10:12:02] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 56 ESP OK [10:12:02] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 114 ESP OK [10:14:32] PROBLEM - MariaDB Slave SQL: s5 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state could not connect [10:14:36] mobrovac: not the fastest but it is proceeding [10:14:43] :) [10:14:45] good good [10:15:03] PROBLEM - MariaDB Slave IO: s5 on dbstore2001 is CRITICAL: CRITICAL slave_io_state could not connect [10:15:51] <_joe_> `uh what's up with dbstore2001 ? [10:16:16] (03PS1) 10Mobrovac: RESTBase: Do not manage Cassandra 2 in the legacy cluster [puppet] - 10https://gerrit.wikimedia.org/r/403122 (https://phabricator.wikimedia.org/T184100) [10:16:39] (03CR) 10jerkins-bot: [V: 04-1] RESTBase: Do not manage Cassandra 2 in the legacy cluster [puppet] - 10https://gerrit.wikimedia.org/r/403122 (https://phabricator.wikimedia.org/T184100) (owner: 10Mobrovac) [10:17:37] ACKNOWLEDGEMENT - PyBal backends health check on lvs3002 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 Ema Prod traffic sent to lvs3004 to test new kernel - The acknowledgement expires at: 2018-01-09 16:16:40. [10:17:37] ACKNOWLEDGEMENT - pybal on lvs3002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal Ema Prod traffic sent to lvs3004 to test new kernel - The acknowledgement expires at: 2018-01-09 16:16:40. [10:20:21] (03PS6) 10Giuseppe Lavagetto: graphite: cleanup configparser_format a little bit [puppet] - 10https://gerrit.wikimedia.org/r/359451 (owner: 10Faidon Liambotis) [10:21:22] (03CR) 10Giuseppe Lavagetto: [C: 032] graphite: cleanup configparser_format a little bit [puppet] - 10https://gerrit.wikimedia.org/r/359451 (owner: 10Faidon Liambotis) [10:22:53] RECOVERY - pybal on lvs3002 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [10:23:12] RECOVERY - PyBal backends health check on lvs3002 is OK: PYBAL OK - All pools are healthy [10:24:33] PROBLEM - puppet last run on restbase1008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[restbase/deploy] [10:25:31] the restbase puppet failures is me [10:26:31] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/402081 (https://phabricator.wikimedia.org/T184103) (owner: 10Alexandros Kosiaris) [10:27:39] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/402101 (https://phabricator.wikimedia.org/T184103) (owner: 10Alexandros Kosiaris) [10:28:10] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/402343 (https://phabricator.wikimedia.org/T184103) (owner: 10Alexandros Kosiaris) [10:28:57] (03PS2) 10Mobrovac: RESTBase: Do not manage Cassandra 2 in the legacy cluster [puppet] - 10https://gerrit.wikimedia.org/r/403122 (https://phabricator.wikimedia.org/T184100) [10:29:26] PROBLEM - puppet last run on restbase2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 54 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[restbase/deploy] [10:29:36] RECOVERY - puppet last run on restbase1008 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:34:17] RECOVERY - puppet last run on restbase2004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:34:24] (03PS7) 10Giuseppe Lavagetto: Fix Style/FormatString RuboCop across all Rakefiles [puppet] - 10https://gerrit.wikimedia.org/r/359452 (owner: 10Faidon Liambotis) [10:36:56] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix Style/FormatString RuboCop across all Rakefiles [puppet] - 10https://gerrit.wikimedia.org/r/359452 (owner: 10Faidon Liambotis) [10:38:02] (03CR) 10Alexandros Kosiaris: [C: 031] role::puppetmaster::puppetdb: add Prometheus monitoring for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/394966 (owner: 10Elukey) [10:40:13] (03PS12) 10Elukey: role::puppetmaster::puppetdb: add Prometheus monitoring for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/394966 [10:40:43] (03CR) 10jerkins-bot: [V: 04-1] role::puppetmaster::puppetdb: add Prometheus monitoring for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/394966 (owner: 10Elukey) [10:41:07] akosiaris: sorry I changed https://gerrit.wikimedia.org/r/#/c/394966/12/modules/profile/files/puppetmaster/puppetdb/jvm_prometheus_puppetdb_jmx_exporter.yaml after a chat with Filippo, basically removing those metrics will force us to rely on the jmx_agent's defaults [10:41:13] https://github.com/prometheus/client_java/blob/master/simpleclient_hotspot/src/main/java/io/prometheus/client/hotspot/DefaultExports.java [10:41:16] removing duplicates [10:41:21] (credits to gehel :) [10:41:40] final list in https://phabricator.wikimedia.org/P6528#36923 [10:41:42] what did I do again? [10:41:59] you established a standard! :) [10:42:09] so I was fixing my change to follow it [10:42:21] Oh... great! [10:42:48] gehel: aaand you also won multiple code reviews to fix the hadoop configurations :P [10:42:57] jokes aside, I'd be glad to have your review [10:43:04] ok, will look after lunch... see you! [10:43:06] when I'll have them ready [10:43:09] thanks! [10:43:37] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 46 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[restbase/deploy] [10:44:01] shush! [10:44:25] (03PS3) 10Mobrovac: RESTBase: Do not manage Cassandra 2 in the legacy cluster [puppet] - 10https://gerrit.wikimedia.org/r/403122 (https://phabricator.wikimedia.org/T184100) [10:45:33] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/402344 (https://phabricator.wikimedia.org/T184103) (owner: 10Alexandros Kosiaris) [10:47:46] mobrovac: ok another round of puppet runs and I think we should be ok to start pooling [10:48:37] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:48:37] (03PS7) 10Giuseppe Lavagetto: wmflib: fix RuboCop infractions in serializers [puppet] - 10https://gerrit.wikimedia.org/r/359453 (owner: 10Faidon Liambotis) [10:48:54] godog: \o/ [10:50:39] (03PS1) 10Elukey: Standardize Analytics jmx agent's configurations [puppet] - 10https://gerrit.wikimedia.org/r/403123 (https://phabricator.wikimedia.org/T177458) [10:50:50] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban): npm 1.4.21 can't use a http proxy - https://phabricator.wikimedia.org/T183569#3885794 (10hashar) a:03hashar Came back to this the tldr is we have to upgrade to 0.4.3 The Debian bug report is https://bugs... [10:52:54] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Add Prometheus client support for varnish/statsd metrics daemons - https://phabricator.wikimedia.org/T177199#3885798 (10fgiunchedi) [10:52:57] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Limit http methods reported by varnishmtail - https://phabricator.wikimedia.org/T183926#3885796 (10fgiunchedi) 05Resolved>03Open Reopening, we need to match method names exactly, not prefixes like we're doing now [10:53:35] (03PS1) 10Hashar: New upstream version 0.4.3 [debs/node-tunnel-agent] (upstream) - 10https://gerrit.wikimedia.org/r/403124 (https://phabricator.wikimedia.org/T183569) [10:54:33] (03PS1) 10Hashar: pristine-tar data for node-tunnel-agent_0.4.3.orig.tar.gz [debs/node-tunnel-agent] (pristine-tar) - 10https://gerrit.wikimedia.org/r/403125 (https://phabricator.wikimedia.org/T183569) [10:56:43] (03Abandoned) 10Ema: lvs: rename lvs1007 eth interfaces [puppet] - 10https://gerrit.wikimedia.org/r/402859 (https://phabricator.wikimedia.org/T167299) (owner: 10Ema) [10:57:30] (03PS1) 10Hashar: Merge tag 'upstream/0.4.3' [debs/node-tunnel-agent] - 10https://gerrit.wikimedia.org/r/403126 (https://phabricator.wikimedia.org/T183569) [10:57:32] (03PS1) 10Hashar: Package 0.4.3 [debs/node-tunnel-agent] - 10https://gerrit.wikimedia.org/r/403127 (https://phabricator.wikimedia.org/T183569) [10:58:11] (03PS6) 10Alexandros Kosiaris: ircecho: Remove redundant thread [puppet] - 10https://gerrit.wikimedia.org/r/402081 (https://phabricator.wikimedia.org/T184103) [10:58:13] (03PS4) 10Alexandros Kosiaris: ircecho: Force unbuffered stdin/stdout/stderr [puppet] - 10https://gerrit.wikimedia.org/r/402101 (https://phabricator.wikimedia.org/T184103) [10:58:15] (03PS3) 10Alexandros Kosiaris: ircecho: Normalize print statements [puppet] - 10https://gerrit.wikimedia.org/r/402343 (https://phabricator.wikimedia.org/T184103) [10:58:17] (03PS3) 10Alexandros Kosiaris: ircecho: set EchoNotifier threads as daemon [puppet] - 10https://gerrit.wikimedia.org/r/402344 (https://phabricator.wikimedia.org/T184103) [10:58:39] (03CR) 10Hashar: [V: 032 C: 032] New upstream version 0.4.3 [debs/node-tunnel-agent] (upstream) - 10https://gerrit.wikimedia.org/r/403124 (https://phabricator.wikimedia.org/T183569) (owner: 10Hashar) [10:59:41] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2002.codfw.wmnet [10:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:57] (03CR) 10Hashar: [V: 032 C: 032] pristine-tar data for node-tunnel-agent_0.4.3.orig.tar.gz [debs/node-tunnel-agent] (pristine-tar) - 10https://gerrit.wikimedia.org/r/403125 (https://phabricator.wikimedia.org/T183569) (owner: 10Hashar) [11:00:05] _joe_: much <3 for merging those patches of mine [11:00:49] (03CR) 10Hashar: [V: 032 C: 032] Merge tag 'upstream/0.4.3' [debs/node-tunnel-agent] - 10https://gerrit.wikimedia.org/r/403126 (https://phabricator.wikimedia.org/T183569) (owner: 10Hashar) [11:00:51] (03CR) 10Alexandros Kosiaris: [C: 032] ircecho: Remove redundant thread [puppet] - 10https://gerrit.wikimedia.org/r/402081 (https://phabricator.wikimedia.org/T184103) (owner: 10Alexandros Kosiaris) [11:00:54] (03CR) 10Alexandros Kosiaris: [C: 032] ircecho: Force unbuffered stdin/stdout/stderr [puppet] - 10https://gerrit.wikimedia.org/r/402101 (https://phabricator.wikimedia.org/T184103) (owner: 10Alexandros Kosiaris) [11:01:06] (03CR) 10Alexandros Kosiaris: [C: 032] ircecho: Normalize print statements [puppet] - 10https://gerrit.wikimedia.org/r/402343 (https://phabricator.wikimedia.org/T184103) (owner: 10Alexandros Kosiaris) [11:01:07] (03CR) 10Alexandros Kosiaris: [C: 032] ircecho: set EchoNotifier threads as daemon [puppet] - 10https://gerrit.wikimedia.org/r/402344 (https://phabricator.wikimedia.org/T184103) (owner: 10Alexandros Kosiaris) [11:01:32] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2001.codfw.wmnet [11:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:49] 10Operations, 10Goal, 10Patch-For-Review: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208#3885816 (10jcrespo) [11:01:52] (03CR) 10Hashar: [V: 032 C: 032] Package 0.4.3 [debs/node-tunnel-agent] - 10https://gerrit.wikimedia.org/r/403127 (https://phabricator.wikimedia.org/T183569) (owner: 10Hashar) [11:01:57] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:01:57] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [11:02:03] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2004.codfw.wmnet [11:02:08] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:02:10] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2006.codfw.wmnet [11:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:17] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title}{/revision} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, [11:02:17] fore a response was received [11:02:17] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:02:17] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [11:02:18] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for J [11:02:18] ed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:02:18] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 [11:02:18] rue)) timed out before a response was received [11:02:18] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:02:19] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [11:02:19] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:27] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:02:34] what the [11:02:37] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Manitowoc, Wisconsin) timed out before a response was received: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read artic [11:02:37] 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:02:37] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:02:37] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:02:37] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:02:37] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:02:38] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for 2nd Earl of Derby) timed out before a response was received: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 [11:02:38] a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:02:40] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Manitowoc, Wisconsin) timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out befo [11:02:40] eceived: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:02:40] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve media items of en.wp Cat page via media route) timed out before a response was received: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before [11:02:40] depooling [11:02:41] eived: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:02:48] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2006.codfw.wmnet [11:02:55] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2002.codfw.wmnet [11:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:02] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2001.codfw.wmnet [11:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:08] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2004.codfw.wmnet [11:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:17] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [11:03:17] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [11:03:17] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [11:03:18] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [11:03:18] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [11:03:18] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [11:03:18] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [11:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:27] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [11:03:28] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy [11:03:28] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [11:03:28] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [11:03:28] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [11:03:28] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [11:03:28] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [11:03:28] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [11:03:37] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [11:03:38] sorry about the spam [11:03:48] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [11:03:48] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [11:03:56] 10Operations, 10Goal, 10Patch-For-Review: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208#3885823 (10jcrespo) [11:03:58] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [11:04:07] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [11:04:07] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [11:04:20] just spam though, right? :) [11:04:49] (03PS1) 10Elukey: profile::hive/oozie: add a hiera parameter for the ferm srange [puppet] - 10https://gerrit.wikimedia.org/r/403128 (https://phabricator.wikimedia.org/T166248) [11:05:53] paravoid: heheh the failures were real [11:07:48] 10Operations, 10Data-Services, 10MediaWiki-Maintenance-scripts, 10Wikidata, 10Patch-For-Review: Missing references to s8 on maintenance and cloud scripts (and potentially others) - https://phabricator.wikimedia.org/T184179#3885829 (10jcrespo) a:03bd808 I think cloud stuff is ready -although there is a... [11:08:04] woo, what happened there? [11:08:45] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler03/9650/" [puppet] - 10https://gerrit.wikimedia.org/r/403128 (https://phabricator.wikimedia.org/T166248) (owner: 10Elukey) [11:09:57] mobrovac: pooled restbase hosts in codfw, fail, depool [11:10:12] now looking at what went wrong there [11:12:36] mobrovac: thoughts? [11:13:11] i'll take a look at rb2001 [11:15:52] godog: i'll try to pool rb2001 again [11:16:04] it's not clear to me why these errors coincided with your pooling [11:16:28] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/node-tunnel-agent] - 10https://gerrit.wikimedia.org/r/403130 [11:16:45] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/node-tunnel-agent] - 10https://gerrit.wikimedia.org/r/403130 (owner: 10Hashar) [11:16:52] mobrovac: ok trying again [11:17:09] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2001.codfw.wmnet [11:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:27] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was re [11:19:28] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:19:28] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [11:19:37] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [11:19:37] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:19:37] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Manitowoc, Wisconsin) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 201 [11:19:37] true)) timed out before a response was received [11:19:38] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:19:38] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:19:48] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:19:57] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:19:57] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:19:57] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:19:57] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:19:57] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:19:57] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:19:58] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with a [11:19:58] med out before a response was received [11:19:58] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:19:59] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:20:05] (03PS1) 10Elukey: profile::analytics::database::meta: add ferm rules hiera parameter [puppet] - 10https://gerrit.wikimedia.org/r/403131 (https://phabricator.wikimedia.org/T166248) [11:20:07] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [11:20:10] mobrovac: ^ [11:20:15] yeah i see [11:20:17] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:20:17] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:20:17] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:20:18] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:20:19] this is really weird [11:20:38] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10Release-Engineering-Team (Kanban): npm 1.4.21 can't use a http proxy - https://phabricator.wikimedia.org/T183569#3885851 (10hashar) p:05Triage>03Normal I have forked the Debian repository and bumped the package to 0... [11:20:46] mobrovac: we can try the off and on again tecnique [11:21:06] godog: restart rb you mean? [11:21:10] yeah [11:21:34] kk, restarting [11:21:47] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [11:21:47] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [11:21:48] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [11:21:48] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [11:21:48] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [11:21:48] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [11:21:57] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy [11:21:57] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [11:21:58] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [11:22:02] works every time [11:22:07] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [11:22:08] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [11:22:17] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [11:22:17] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [11:22:18] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [11:22:27] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [11:22:27] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [11:22:27] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [11:22:28] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [11:22:28] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [11:22:37] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [11:22:38] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [11:22:44] huh [11:22:45] weird [11:22:47] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [11:22:57] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:23:06] ah no godog ^ [11:23:10] ah no ok, phew [11:23:38] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9651/" [puppet] - 10https://gerrit.wikimedia.org/r/403131 (https://phabricator.wikimedia.org/T166248) (owner: 10Elukey) [11:24:41] godog: hm, ok so restbase receives a response,but it takes 12 seconds [11:24:44] and the limit is 10 [11:24:57] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:24:57] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:24:58] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:24:58] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:24:58] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:24:58] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:24:58] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:25:17] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [11:25:17] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:25:18] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:25:25] mobrovac: ok, I'm going to depool 2001 since cleary it isn't working and we can debug [11:25:27] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:25:27] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:25:28] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [11:25:28] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:25:28] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [11:25:37] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [11:25:37] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:25:38] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:25:44] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2001.codfw.wmnet [11:25:47] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:25:47] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:57] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:25:57] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [11:25:57] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [11:25:57] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [11:25:57] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [11:25:57] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [11:25:57] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [11:25:58] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [11:25:59] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy [11:26:07] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [11:26:11] godog: also, one weirdness is that cass2 restbase nodes have the pool-restbase command, while restbase2001 does not [11:26:17] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [11:26:17] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [11:26:17] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [11:26:18] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [11:26:27] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [11:26:30] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [11:26:30] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [11:26:30] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [11:26:30] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [11:26:30] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [11:26:37] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [11:26:37] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [11:26:48] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [11:26:57] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [11:27:10] mobrovac: indeed, and errors from other hosts too so things seem to grind to an halt? [11:27:40] godog: now that rb2001 is depooled, it takes it 1 second to respond for that endpoint that timed out [11:27:47] there is something very wrong here [11:28:05] what's going on? [11:28:16] that's a good question paravoid ;) [11:29:17] mobrovac: and only on featured feed? I'm looking at https://grafana.wikimedia.org/dashboard/db/restbase?panelId=16&fullscreen&orgId=1&from=now-1h&to=now [11:30:15] paravoid: pooling one rb host results in a shower of alerts ATM, timeouts looks like [11:30:43] godog: feeatured feed and onthisday too [11:31:47] *nod* [11:31:48] 10Operations, 10Goal, 10Patch-For-Review: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208#3885884 (10jcrespo) So I contaminated by mistake s5/s8 on dbstore1001 due to timing of delayed replication. As a consolation, I do not think they were in such a good shape a... [11:32:11] mobrovac: unless things have changed iirc rb nodes talk to each other only (?) for rate limiting purposes? [11:33:08] (03PS4) 10Mobrovac: RESTBase: Do not manage Cassandra 2 in the legacy cluster [puppet] - 10https://gerrit.wikimedia.org/r/403122 (https://phabricator.wikimedia.org/T184100) [11:33:42] godog: right, but i don't see how this can influence this change just based on pooling a server [11:34:14] godog: if there were a problem in intra-node comms, then we'd see the same thing regardless of whether the node is pooled or not [11:35:51] heh, I don't have a better explanation of why other nodes are affected when rb2001 receives traffic [11:37:51] 10Operations, 10Goal, 10Patch-For-Review: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208#3885885 (10Marostegui) From the checksums I did past weeks it was indeed pretty inconsistent in all the shards so I don't think it is a big deal to have a added a bit more d... [11:37:57] godog: ah wait a sec, i might have found a discrepancy between c2 and c3 nodes [11:40:25] 10Operations, 10Goal, 10Patch-For-Review: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208#3885898 (10jcrespo) I've set up `Replicate_Wild_Do_Table: dewiki.%,heartbeat.%` on s5 to unbreak it (it will not be affected), but s8 will still not be in a good state. Time... [11:41:23] (03PS1) 10Mobrovac: RESTBase: Add LVS pools to the production_ng role [puppet] - 10https://gerrit.wikimedia.org/r/403133 (https://phabricator.wikimedia.org/T184110) [11:41:27] godog: ^^^ [11:41:36] godog: they were missing the lvs role definitions [11:42:04] godog: i'll run pcc [11:43:05] !log rebooting app servers mw1238-mw1258 for kernel security update (along with update to HHVM 3.18.6 where applicable) [11:43:06] (03CR) 10Giuseppe Lavagetto: [C: 032] wmflib: fix RuboCop infractions in serializers [puppet] - 10https://gerrit.wikimedia.org/r/359453 (owner: 10Faidon Liambotis) [11:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:25] (03PS8) 10Giuseppe Lavagetto: wmflib: fix RuboCop infractions in serializers [puppet] - 10https://gerrit.wikimedia.org/r/359453 (owner: 10Faidon Liambotis) [11:43:55] mobrovac: ugh, indeed [11:44:54] (03CR) 10Mobrovac: "PCC OK - https://puppet-compiler.wmflabs.org/compiler02/9653/" [puppet] - 10https://gerrit.wikimedia.org/r/403133 (https://phabricator.wikimedia.org/T184110) (owner: 10Mobrovac) [11:44:59] godog: ^ [11:45:05] godog: i think that ought to do it [11:45:08] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [11:45:50] (03PS1) 10KartikMistry: apertium-sme-nob: Updated dependency on cg3 [debs/contenttranslation/apertium-sme-nob] - 10https://gerrit.wikimedia.org/r/403134 (https://phabricator.wikimedia.org/T171406) [11:45:54] (03CR) 10Filippo Giunchedi: [C: 032] RESTBase: Add LVS pools to the production_ng role [puppet] - 10https://gerrit.wikimedia.org/r/403133 (https://phabricator.wikimedia.org/T184110) (owner: 10Mobrovac) [11:46:19] (03CR) 10jerkins-bot: [V: 04-1] apertium-sme-nob: Updated dependency on cg3 [debs/contenttranslation/apertium-sme-nob] - 10https://gerrit.wikimedia.org/r/403134 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [11:47:25] (03PS9) 10Giuseppe Lavagetto: wmflib: fix RuboCop infractions in serializers [puppet] - 10https://gerrit.wikimedia.org/r/359453 (owner: 10Faidon Liambotis) [11:48:06] mobrovac: indeed that looks better, I'll bounce rb there too since config changed [11:48:20] (03PS1) 10KartikMistry: apertium-spa: Updated dependency on cg3 [debs/contenttranslation/apertium-spa] - 10https://gerrit.wikimedia.org/r/403135 (https://phabricator.wikimedia.org/T171406) [11:48:44] (03CR) 10jerkins-bot: [V: 04-1] apertium-spa: Updated dependency on cg3 [debs/contenttranslation/apertium-spa] - 10https://gerrit.wikimedia.org/r/403135 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [11:50:07] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2001.codfw.wmnet [11:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:15] godog: hehe, looking good now! [11:51:22] service-checker is happy [11:51:48] heheh indeed much better [11:52:03] I'll run puppet on the other nodes and finish pooling codfw [11:52:11] kk [11:53:55] (03PS1) 10KartikMistry: apertium-spa-arg: Updated dependency on cg3 [debs/contenttranslation/apertium-spa-arg] - 10https://gerrit.wikimedia.org/r/403136 (https://phabricator.wikimedia.org/T171406) [11:54:22] (03CR) 10jerkins-bot: [V: 04-1] apertium-spa-arg: Updated dependency on cg3 [debs/contenttranslation/apertium-spa-arg] - 10https://gerrit.wikimedia.org/r/403136 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [11:54:49] (03PS6) 10Giuseppe Lavagetto: Fix more whitespace-related RuboCop across the tree [puppet] - 10https://gerrit.wikimedia.org/r/359478 (owner: 10Faidon Liambotis) [11:55:29] (03PS5) 10Mobrovac: RESTBase: Do not manage Cassandra 2 in the legacy cluster [puppet] - 10https://gerrit.wikimedia.org/r/403122 (https://phabricator.wikimedia.org/T184100) [11:56:14] !log roll-restart restbase c3 nodes in codfw/eqiad [11:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:26] err, that means rb service not the host [11:56:33] :) [11:57:15] 10Operations, 10Analytics, 10Analytics-Cluster, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3885934 (10dr0ptp4kt) Thanks, @Shilad . @Ottomata would you be game to take another run at this? @Shilad what's your availability days+time UTC this current week... [11:58:28] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2002.codfw.wmnet [11:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:22] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix more whitespace-related RuboCop across the tree [puppet] - 10https://gerrit.wikimedia.org/r/359478 (owner: 10Faidon Liambotis) [12:02:34] (03PS6) 10Giuseppe Lavagetto: Fix Style/RegexpLiteral RuboCop offenses [puppet] - 10https://gerrit.wikimedia.org/r/359479 (owner: 10Faidon Liambotis) [12:03:23] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2003.codfw.wmnet [12:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:02] (03PS6) 10Mobrovac: RESTBase: Do not manage Cassandra 2 in the legacy cluster [puppet] - 10https://gerrit.wikimedia.org/r/403122 (https://phabricator.wikimedia.org/T184100) [12:04:41] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2004.codfw.wmnet [12:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:43] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2005.codfw.wmnet [12:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:56] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix Style/RegexpLiteral RuboCop offenses [puppet] - 10https://gerrit.wikimedia.org/r/359479 (owner: 10Faidon Liambotis) [12:06:32] (03PS6) 10Urbanecm: Initial configuration for inhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402658 (https://phabricator.wikimedia.org/T184374) [12:07:34] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2006.codfw.wmnet [12:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:50] mobrovac: I'll go ahead with eqiad [12:08:54] (03CR) 10Mobrovac: "PCC OK - https://puppet-compiler.wmflabs.org/compiler02/9656/" [puppet] - 10https://gerrit.wikimedia.org/r/403122 (https://phabricator.wikimedia.org/T184100) (owner: 10Mobrovac) [12:09:13] godog: yup, +1 [12:09:25] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1007.eqiad.wmnet [12:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:53] (03PS1) 10Chad: CI: Add php7.0-zip stretch and beyond [puppet] - 10https://gerrit.wikimedia.org/r/403138 [12:12:01] (03PS5) 10Giuseppe Lavagetto: wmflib: fix another couple minor RuboCop offenses [puppet] - 10https://gerrit.wikimedia.org/r/359480 (owner: 10Faidon Liambotis) [12:12:13] (03CR) 10Giuseppe Lavagetto: [C: 032] wmflib: fix another couple minor RuboCop offenses [puppet] - 10https://gerrit.wikimedia.org/r/359480 (owner: 10Faidon Liambotis) [12:15:31] godog: ping me once you are done, i'll do a full rb deploy then to make sure all is bueno indeed [12:17:01] !log rebooting scb2001 for kernel security update [12:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:16] (03PS7) 10Urbanecm: Initial configuration for inhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402658 (https://phabricator.wikimedia.org/T184374) [12:17:32] mobrovac: kk [12:17:47] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1008.eqiad.wmnet [12:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:06] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1009.eqiad.wmnet [12:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:23] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1010.eqiad.wmnet [12:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:42] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1011.eqiad.wmnet [12:22:48] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1012.eqiad.wmnet [12:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:38] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1014.eqiad.wmnet [12:23:44] ACKNOWLEDGEMENT - HP RAID on ms-be1033 is CRITICAL: CRITICAL: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:2, 2I:2:3, 2I:2:4 - Failed: 2I:2:1 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T184514 [12:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:47] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1033 - https://phabricator.wikimedia.org/T184514#3886055 (10ops-monitoring-bot) [12:24:18] mobrovac: {{done}} [12:25:18] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1033 - https://phabricator.wikimedia.org/T183896#3886072 (10fgiunchedi) [12:25:20] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1033 - https://phabricator.wikimedia.org/T184514#3886074 (10fgiunchedi) [12:25:32] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1033 - https://phabricator.wikimedia.org/T184514#3886055 (10fgiunchedi) Duplicate, host was rebooted as part of kernel upgrade [12:35:57] (03PS5) 10Giuseppe Lavagetto: wmflib, admin: fix RuboCop Style/For offenses [puppet] - 10https://gerrit.wikimedia.org/r/359481 (owner: 10Faidon Liambotis) [12:39:32] (03CR) 10Giuseppe Lavagetto: [C: 032] wmflib, admin: fix RuboCop Style/For offenses [puppet] - 10https://gerrit.wikimedia.org/r/359481 (owner: 10Faidon Liambotis) [12:40:13] (03CR) 10Giuseppe Lavagetto: [C: 032] "see for example a few hosts wthat should use all the functions modified here https://puppet-compiler.wmflabs.org/compiler02/9658/" [puppet] - 10https://gerrit.wikimedia.org/r/359481 (owner: 10Faidon Liambotis) [12:45:55] (03PS5) 10Giuseppe Lavagetto: base: fix RuboCop MethodCallWithoutArgsParentheses [puppet] - 10https://gerrit.wikimedia.org/r/359482 (owner: 10Faidon Liambotis) [12:50:12] 10Operations, 10Goal, 10Patch-For-Review: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208#3886180 (10jcrespo) [12:50:54] (03CR) 10Giuseppe Lavagetto: [C: 032] base: fix RuboCop MethodCallWithoutArgsParentheses [puppet] - 10https://gerrit.wikimedia.org/r/359482 (owner: 10Faidon Liambotis) [12:51:28] (03PS5) 10Giuseppe Lavagetto: utils/expanderrb.rb: fix Style/SpecialGlobalVars [puppet] - 10https://gerrit.wikimedia.org/r/359483 (owner: 10Faidon Liambotis) [12:52:14] (03CR) 10Giuseppe Lavagetto: [C: 032] utils/expanderrb.rb: fix Style/SpecialGlobalVars [puppet] - 10https://gerrit.wikimedia.org/r/359483 (owner: 10Faidon Liambotis) [12:54:55] !log akosiaris@tin Started deploy [servermon/servermon@10e165e]: Update servermon [12:54:57] !log akosiaris@tin Finished deploy [servermon/servermon@10e165e]: Update servermon (duration: 00m 02s) [12:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:17] !log rebooting labnodepool* for kernel security update [12:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:09] !log mobrovac@tin Started deploy [restbase/deploy@837f5a9]: Force deploy on all targets - T184110 [13:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:23] T184110: Set up RESTBase on Cassandra 3 nodes - https://phabricator.wikimedia.org/T184110 [13:01:20] (03PS1) 10Jcrespo: proxysql: Correct path for proxysql persistance layer [puppet] - 10https://gerrit.wikimedia.org/r/403144 [13:03:59] (03CR) 10Elukey: [C: 031] Create profile::cache::kafka::certificate to DRY require of cert [puppet] - 10https://gerrit.wikimedia.org/r/403059 (https://phabricator.wikimedia.org/T175461) (owner: 10Ottomata) [13:04:46] (03PS5) 10Giuseppe Lavagetto: Fix Style/NegatedIf RuboCop offense across the tree [puppet] - 10https://gerrit.wikimedia.org/r/359484 (owner: 10Faidon Liambotis) [13:04:57] (03CR) 10Elukey: [C: 031] Tweaks to profile::cache::kafka::webrequest::jumbo test [puppet] - 10https://gerrit.wikimedia.org/r/403064 (owner: 10Ottomata) [13:05:20] 10Operations, 10DBA, 10Performance-Team, 10Availability (Multiple-active-datacenters): Perform testing for TLS effect on connection rate - https://phabricator.wikimedia.org/T171071#3886222 (10jcrespo) I @aaron, I would like still to reproduce your results. Meanwhile, I thought a reason why that can be- wit... [13:05:29] 10Operations, 10IRCecho, 10monitoring, 10Patch-For-Review: ircecho doesn't reconnect on failure - https://phabricator.wikimedia.org/T184103#3886223 (10akosiaris) 05Open>03Resolved a:03akosiaris This should finally be resolved with the above changes [13:05:33] (03CR) 10jerkins-bot: [V: 04-1] Fix Style/NegatedIf RuboCop offense across the tree [puppet] - 10https://gerrit.wikimedia.org/r/359484 (owner: 10Faidon Liambotis) [13:06:06] (03CR) 10Jcrespo: [C: 032] proxysql: Correct path for proxysql persistance layer [puppet] - 10https://gerrit.wikimedia.org/r/403144 (owner: 10Jcrespo) [13:07:31] !log mobrovac@tin Finished deploy [restbase/deploy@837f5a9]: Force deploy on all targets - T184110 (duration: 07m 23s) [13:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:41] T184110: Set up RESTBase on Cassandra 3 nodes - https://phabricator.wikimedia.org/T184110 [13:08:20] (03CR) 10Elukey: "Added some comments even if WIP! :)" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297) (owner: 10Ottomata) [13:10:20] !log reboot kafka1020 for kernel updates [13:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:43] 10Operations, 10RESTBase, 10Services (done), 10User-mobrovac: Set up RESTBase on Cassandra 3 nodes - https://phabricator.wikimedia.org/T184110#3886232 (10mobrovac) 05Open>03Resolved RESTBase is now alive and kicking on all nodes. Resolving. [13:19:04] (03PS1) 10Jcrespo: dbstore: include s8 on the dbstore2001 backups [puppet] - 10https://gerrit.wikimedia.org/r/403147 (https://phabricator.wikimedia.org/T184179) [13:19:54] (03CR) 10Jcrespo: [C: 032] dbstore: include s8 on the dbstore2001 backups [puppet] - 10https://gerrit.wikimedia.org/r/403147 (https://phabricator.wikimedia.org/T184179) (owner: 10Jcrespo) [13:20:10] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: Multiple systems in esams OE10 showing PSU failures - https://phabricator.wikimedia.org/T177228#3886246 (10mark) 05Open>03Resolved a:03mark cp3048 had one PSU loosely connected, fixed. All other systems in the rack have redundant power atm. [13:34:41] !log rebooting remaining video scalers in eqiad for kernel security update (along with HHVM update) [13:34:49] (03PS2) 10Hashar: Save -> Publish on remaining Wikinewses which haven't updated [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403077 (owner: 10Jforrester) [13:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:17] 10Operations, 10ops-esams, 10Epic: Remove all decommissioned hardware - https://phabricator.wikimedia.org/T184063#3886309 (10mark) [13:39:12] 10Operations, 10ops-esams, 10DC-Ops, 10netops: cr2-esams temperature warning - https://phabricator.wikimedia.org/T176816#3886324 (10mark) cr2-esams is mounted facing the hot row (like all network equipment) so this makes sense. I've removed 2 blind plates from the cold row side to let some cold air leak i... [13:46:58] moritzm: shall I set net.netfilter.nf_conntrack_tcp_timeout_time_wait=65 for the mw servers rebooted? [13:48:06] elukey: I forgot why is this thing still manual... [13:48:44] volans: race condition that is still not fixed, checking what task it was [13:48:47] elukey: feel free, otherwise I'll do it en bloc once all are rebooted [13:49:07] volans: https://phabricator.wikimedia.org/T136094 [13:49:28] moritzm: ack! [13:50:09] thx [13:50:16] usually only jobrunners/vs are problematic, they all look good, will not interfere with the reboots :) [13:51:24] !log reboot kafka-jumbo1003 for kernel updates [13:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:39] and following runs of puppet don't fix it right? [13:51:52] yep [13:53:03] (03PS8) 10Rush: toolforge: ferm hook to restart components post updates [puppet] - 10https://gerrit.wikimedia.org/r/403072 (https://phabricator.wikimedia.org/T182722) [13:53:28] (03CR) 10jerkins-bot: [V: 04-1] toolforge: ferm hook to restart components post updates [puppet] - 10https://gerrit.wikimedia.org/r/403072 (https://phabricator.wikimedia.org/T182722) (owner: 10Rush) [13:53:48] (03CR) 10Rush: toolforge: ferm hook to restart components post updates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/403072 (https://phabricator.wikimedia.org/T182722) (owner: 10Rush) [13:58:32] (03CR) 10Filippo Giunchedi: [C: 031] Standardize Analytics jmx agent's configurations [puppet] - 10https://gerrit.wikimedia.org/r/403123 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [13:59:00] (03CR) 10Arturo Borrero Gonzalez: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/403072 (https://phabricator.wikimedia.org/T182722) (owner: 10Rush) [13:59:13] (03CR) 10Gehel: "LGTM. Minor comment inline." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/403123 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [14:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European Mid-day SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180109T1400). [14:00:05] James_F: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:50] o/ [14:01:22] !log copy poolcounter from jessie-wikimedia into stretch-wikimedia - T183385 [14:01:23] gehel: sigh [14:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:33] T183385: Package Poolcounter for Debian Stretch - https://phabricator.wikimedia.org/T183385 [14:01:57] (03CR) 10Hashar: [C: 032] Save -> Publish on remaining Wikinewses which haven't updated [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403077 (owner: 10Jforrester) [14:02:05] elukey: yeah, that comment is mainly me being pedantic and not liking the jmx_exporter implementation much [14:02:27] elukey: it might make sense to ignore it unless we know we have an app that exposes a ton of mbeans [14:02:57] gehel: it is indeed extremely useful, for this use case I think I could skip it but we might need to start collecting these info in a wiki page [14:03:01] that we can refer to [14:03:13] like standards/conventions, etc.. [14:03:20] so we'll just have one source of truth [14:03:31] (03Merged) 10jenkins-bot: Save -> Publish on remaining Wikinewses which haven't updated [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403077 (owner: 10Jforrester) [14:03:34] elukey: or just collecting the questions we have :) [14:03:43] yep :) [14:03:45] (03CR) 10jenkins-bot: Save -> Publish on remaining Wikinewses which haven't updated [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403077 (owner: 10Jforrester) [14:04:09] !log reboot kafka1022 for kernel updates [14:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:08] !log lvs3002: upgrade to latest jessie point release (8.10) T182656 and linux kernel 4.9.65-3+deb9u1~bpo8+2 (KPTI) T184267 [14:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:18] T182656: Integrate jessie 8.10 point release - https://phabricator.wikimedia.org/T182656 [14:07:19] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Save -> Publish on remaining Wikinewses which haven't updated - https://gerrit.wikimedia.org/r/#/c/403077/ (duration: 00m 53s) [14:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:15] (03PS1) 10Filippo Giunchedi: mtail: fix counting for frontend varnish http_method [puppet] - 10https://gerrit.wikimedia.org/r/403158 (https://phabricator.wikimedia.org/T183926) [14:10:43] 10Operations, 10Discovery-Search (Current work), 10Goal, 10Patch-For-Review, and 2 others: Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3886420 (10Gehel) Upstream has released version 1.0.2 with the additional elasticsearch metrics that we need: https://github.com/ju... [14:11:13] hashar: Thank you. All looks good. [14:14:07] !log lvs3003: upgrade to latest jessie point release (8.10) T182656 and linux kernel 4.9.65-3+deb9u1~bpo8+2 (KPTI) T184267 [14:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:18] T182656: Integrate jessie 8.10 point release - https://phabricator.wikimedia.org/T182656 [14:14:28] !log rolling reboot of scb in codfw for kernel security update [14:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:00] (03CR) 10Ema: [C: 031] mtail: fix counting for frontend varnish http_method [puppet] - 10https://gerrit.wikimedia.org/r/403158 (https://phabricator.wikimedia.org/T183926) (owner: 10Filippo Giunchedi) [14:15:07] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4025_v4, cp4025_v6 [14:16:07] (03PS2) 10Filippo Giunchedi: mtail: fix counting for frontend varnish http_method [puppet] - 10https://gerrit.wikimedia.org/r/403158 (https://phabricator.wikimedia.org/T183926) [14:17:00] James_F: awesome :) [14:17:07] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 70 ESP OK [14:17:15] (03CR) 10Filippo Giunchedi: [C: 032] mtail: fix counting for frontend varnish http_method [puppet] - 10https://gerrit.wikimedia.org/r/403158 (https://phabricator.wikimedia.org/T183926) (owner: 10Filippo Giunchedi) [14:20:27] 10Operations, 10monitoring, 10User-fgiunchedi: Better organization for ops grafana dashboards - https://phabricator.wikimedia.org/T178690#3886440 (10fgiunchedi) [14:21:36] !log reboot kafka-jumbo1004 for kernel updates [14:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:28] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 1.10 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/402906 (https://phabricator.wikimedia.org/T183907) (owner: 10Gilles) [14:23:39] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 1.9 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/402582 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [14:25:46] elukey: I added some notes: https://wikitech.wikimedia.org/wiki/Prometheus#JMX [14:27:03] (03PS1) 10KartikMistry: apertium-srd: New upstream and updated cg3 dependency [debs/contenttranslation/apertium-srd] - 10https://gerrit.wikimedia.org/r/403159 (https://phabricator.wikimedia.org/T171406) [14:27:28] (03CR) 10jerkins-bot: [V: 04-1] apertium-srd: New upstream and updated cg3 dependency [debs/contenttranslation/apertium-srd] - 10https://gerrit.wikimedia.org/r/403159 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [14:29:29] (03CR) 10Brion VIBBER: [C: 031] load ActiveAbtract extension explicitly so class autoloading works [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403114 (https://phabricator.wikimedia.org/T184177) (owner: 10ArielGlenn) [14:30:16] gehel: <3 [14:30:56] (03CR) 10Aklapper: "https://phabricator.wikimedia.org/T173537#3886490 implies that we might need to revert this. Comments in that task welcome." [puppet] - 10https://gerrit.wikimedia.org/r/363264 (https://phabricator.wikimedia.org/T168142) (owner: 10MaxSem) [14:32:17] !log reboot kafka1023 for kernel updates [14:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:03] !log upgrade and roll-restart thumbor in codfw/eqiad - T182656 T183907 T169144 [14:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:17] T169144: Serve thumb.php requests with Thumbor - https://phabricator.wikimedia.org/T169144 [14:36:17] T183907: Thumbor 500 while thumbnailing some webm files - https://phabricator.wikimedia.org/T183907 [14:36:17] T182656: Integrate jessie 8.10 point release - https://phabricator.wikimedia.org/T182656 [14:41:41] !log reboot kafka-jumbo1005 for kernel updates [14:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:46] elukey: wikimedia commons is producing error 400 https://usercontent.irccloud-cdn.com/file/Gb6l3SYX/ [14:46:03] Steinsplitter: not related to my reboots (completely different systems), not able to repro on my browser though [14:46:36] 400 bad request is a weird return code though [14:46:38] elukey: it seems back now (laoding again). thanks aniway. [14:46:54] thank you for checking! [14:48:24] hashar: is SWAT complete (currently withholding further mw* reboots)? [14:48:31] !log rolling reboot of scb in eqiad for kernel security update [14:48:36] moritzm: yes swat is complete [14:48:40] forgot to log it sorry [14:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:42] k, thanks [14:48:52] 10Operations, 10Analytics, 10Analytics-Cluster, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3886647 (10Ottomata) I'm game to get together for another hour or so to meet and try, but I don't really have time to take this on and see it through on my own.... [14:51:31] (03PS5) 10Ottomata: Create profile::hadoop::apt_pin to ensure zookeeper is the correct version [puppet] - 10https://gerrit.wikimedia.org/r/402370 [14:52:47] (03CR) 10Ottomata: [C: 032] Create profile::hadoop::apt_pin to ensure zookeeper is the correct version [puppet] - 10https://gerrit.wikimedia.org/r/402370 (owner: 10Ottomata) [14:54:53] (03PS1) 10Andrew Bogott: simplestatic.erb: @qualify local .erb variables [puppet] - 10https://gerrit.wikimedia.org/r/403166 [14:54:55] (03PS4) 10Ottomata: Create profile::cache::kafka::certificate to DRY require of cert [puppet] - 10https://gerrit.wikimedia.org/r/403059 (https://phabricator.wikimedia.org/T175461) [14:55:27] hello ottomata :) [14:55:29] (03CR) 10Ottomata: [C: 032] Create profile::cache::kafka::certificate to DRY require of cert [puppet] - 10https://gerrit.wikimedia.org/r/403059 (https://phabricator.wikimedia.org/T175461) (owner: 10Ottomata) [14:56:37] (03PS5) 10Ottomata: Tweaks to profile::cache::kafka::webrequest::jumbo test [puppet] - 10https://gerrit.wikimedia.org/r/403064 [14:57:34] 10Operations, 10ORES, 10Graphite, 10Patch-For-Review, and 2 others: Regularly purge old ores graphite metrics - https://phabricator.wikimedia.org/T169969#3886670 (10Halfak) @fgiunchedi, what part of this config specifies that only metrics that haven't been updated in 30 days will be purged. It looks to me... [14:58:37] 10Operations, 10ops-esams: To purchase for next esams visit - https://phabricator.wikimedia.org/T184522#3886671 (10mark) [14:59:10] (03PS2) 10Andrew Bogott: simplestatic.erb: @qualify local .erb variables [puppet] - 10https://gerrit.wikimedia.org/r/403166 [14:59:34] !log lvs3001: upgrade to latest jessie point release (8.10) T182656 and linux kernel 4.9.65-3+deb9u1~bpo8+2 (KPTI) T184267, replace sdb T166965 [14:59:37] (03PS3) 10Volans: Migration to Python 3 [software/cumin] - 10https://gerrit.wikimedia.org/r/402059 [14:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:47] T182656: Integrate jessie 8.10 point release - https://phabricator.wikimedia.org/T182656 [14:59:47] (03CR) 10Andrew Bogott: [C: 032] simplestatic.erb: @qualify local .erb variables [puppet] - 10https://gerrit.wikimedia.org/r/403166 (owner: 10Andrew Bogott) [14:59:48] T166965: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166965 [15:00:05] anomie: (Dis)respected human, time to deploy Deploy MCR tables (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180109T1500). Please do the needful. [15:00:05] No GERRIT patches in the queue for this window AFAICS. [15:00:36] !log reboot kafka-jumbo1006 for kernel updates [15:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:48] !log temporarily disabling puppet agents and rebooting puppet masters for security updates [15:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:55] (03CR) 10Ottomata: [C: 032] "Looks good: https://puppet-compiler.wmflabs.org/compiler02/9659/cp1008.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/403064 (owner: 10Ottomata) [15:03:59] (03PS6) 10Ottomata: Tweaks to profile::cache::kafka::webrequest::jumbo test [puppet] - 10https://gerrit.wikimedia.org/r/403064 [15:04:09] (03CR) 10Ottomata: [V: 032 C: 032] Tweaks to profile::cache::kafka::webrequest::jumbo test [puppet] - 10https://gerrit.wikimedia.org/r/403064 (owner: 10Ottomata) [15:07:14] !log Creating MCR tables on all wikis (T183486) [15:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:26] T183486: MCR schema migration stage 0: create tables - https://phabricator.wikimedia.org/T183486 [15:10:14] !log reboot analytics1028 (hadoop worker and hdfs journal node) for kernel updates [15:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:09] ACKNOWLEDGEMENT - MD RAID on lvs3001 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T184528 [15:14:13] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T184528#3886727 (10ops-monitoring-bot) [15:15:22] why this triggered again... :( [15:16:00] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T184528#3886736 (10Volans) [15:16:03] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T168619#3886738 (10Volans) [15:17:51] we are working on it, that's why [15:18:10] !log lvs3001 disk swap: failover traffic to lvs3003 T166965 [15:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:22] T166965: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166965 [15:18:30] elukey: ping [15:18:40] cmjohnson1: pong! [15:18:48] LMK when you're ready [15:19:42] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1033 - https://phabricator.wikimedia.org/T183896#3886745 (10Cmjohnson) Disk was shipped should be here today [15:20:40] cmjohnson1: going to do the prep steps to shutdown the host, will ping you back when ready [15:23:32] !log puppet master reboots complete. re-enabling puppet agents [15:23:39] cmjohnson1: qq - did you see in the task that it was also mentioned that a DIMM bank might be broken? [15:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:48] (wondering if we would be ready to swap it as well) [15:23:49] mark: yeah I guessed, but it's not the first time [15:23:50] herron: please make sure you don't re-enable puppet on lvs3001 [15:24:00] elukey: not much I can do there but reseat. The server is out of warranty [15:24:12] :( [15:24:25] I will reseat and see if it comes back and we can isolate to see if it's DIMM or CPU [15:25:03] 10Operations, 10ORES, 10Graphite, 10Patch-For-Review, and 2 others: Regularly purge old ores graphite metrics - https://phabricator.wikimedia.org/T169969#3886760 (10awight) @Halfak It's a scary parameter name, +1 that it should be called "delete_files_not_modified_since" or something. Looking at the code... [15:25:12] ema, herron: if disable-puppet / enable-puppet is used with a message, there should be no risk [15:25:19] ema: no problem, I’m using cumin which will only re-enable agents that have a matching message [15:25:58] herron: very well, just making sure. Thanks! :) [15:26:22] 👍 [15:27:09] andrewbogott: yt? [15:27:18] i think some webproxy stuff might be messed up [15:27:20] ottomata: yep! What's up? [15:27:21] not sure [15:27:26] ema: this works of course unless you disable it *after* herron, so the registered message is herron's one and not yours ;) [15:27:32] the proxies in the dashiki project are 502ing [15:27:35] https://flow-reportcard.wmflabs.org/ [15:27:53] ottomata: yeah, puppet is screwed up on those hosts, I talked to milimetric about it a few minutes ago [15:28:03] volans: yeah it's my message so all good [15:28:09] well, puppet had been screwed up forever and I tried to fix it and accidentally broke it some more [15:28:19] ok [15:28:24] still investigating..maybe not proxy... [15:28:29] puppet is running fine though [15:29:12] reporting in here as well [15:29:19] Jan 09 15:22:37 dashiki-staging-01 apache2[11992]: AH00526: Syntax error on line 7 of /etc/apache2/sites-enabled/50-simplestatic.conf [15:29:19] hmm host_names not set [15:29:21] yeah [15:29:25] ServerName is empty [15:29:39] ottomata: the problem is with role::simplestatic [15:30:01] it had some unqualified variables so was failing to compile. I merged https://gerrit.wikimedia.org/r/#/c/403166/ [15:30:06] (03PS1) 10ArielGlenn: don't send mail on dumps failures from labs [puppet] - 10https://gerrit.wikimedia.org/r/403176 [15:30:07] and now apache won't start [15:30:14] !log stop mysql on dbstore1002 as prep step for shutdown (stop all slaves, mysql stop) [15:30:22] PROBLEM - Check systemd state on lvs3001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:28] and I'm not clear on how it ever worked — maybe it was hotfixed during the interval when puppet wasn't working [15:30:46] andrewbogott: those aren't pupet variabls [15:30:49] that's the loop var [15:30:50] host_name [15:30:51] PROBLEM - pybal on lvs3001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [15:30:53] it shouldn't have a @ on it [15:31:02] PROBLEM - PyBal backends health check on lvs3001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [15:31:02] @base_path shoudl [15:31:05] but not host_name [15:31:10] please ignore the alerts above about lvs3001 ^ [15:31:13] ah, so I see [15:31:17] ottomata: I will write another patch [15:31:21] ok thanks [15:31:38] !log reboot maps-test* for kernel upgrade [15:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:20] ottomata: but host_names is still empty isn't it? [15:32:22] ACKNOWLEDGEMENT - Check systemd state on lvs3001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Ema Replacing sdb T166965 [15:32:22] ACKNOWLEDGEMENT - PyBal backends health check on lvs3001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 Ema Replacing sdb T166965 [15:32:22] ACKNOWLEDGEMENT - pybal on lvs3001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal Ema Replacing sdb T166965 [15:33:06] (03PS1) 10Andrew Bogott: simplestatic.erb: remove some inappropriate variable qualifications [puppet] - 10https://gerrit.wikimedia.org/r/403178 [15:33:18] ottomata: ^ [15:33:57] (03CR) 10Andrew Bogott: [C: 032] simplestatic.erb: remove some inappropriate variable qualifications [puppet] - 10https://gerrit.wikimedia.org/r/403178 (owner: 10Andrew Bogott) [15:34:36] thanks [15:34:52] (03CR) 10Ottomata: [WIP] Refactor cache::kafka::eventlogging into profile and enable TLS (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297) (owner: 10Ottomata) [15:35:00] (03PS3) 10Ottomata: [WIP] Refactor cache::kafka::eventlogging into profile and enable TLS [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297) [15:35:12] (03PS4) 10Ottomata: [WIP] Refactor cache::kafka::eventlogging into profile and enable TLS [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297) [15:35:25] huh, I don't understand why this works, but it seems to work [15:35:37] andrewbogott: your patch? [15:36:08] oh, the list of hosts must be set in hiera someplace [15:36:12] oh [15:36:17] that i don't know either :) [15:36:27] (03PS2) 10ArielGlenn: don't send mail on dumps failures from labs [puppet] - 10https://gerrit.wikimedia.org/r/403176 [15:36:27] andrewbogott: does the old wikitech based hiera still live somewhere? [15:36:31] i betcha it is there somehow... [15:36:51] ottomata: while we're on the subject, did you get those pings about other VM puppet issues? [15:37:22] ya, there were quite a few! some of them will be resolved soon as I apply more kafka jumbo stuff in beta. gonna port eventlogging stuff over to new kafka hosts there [15:37:39] but for others yarrrrr, why u guys always gotta be looking! :) [15:37:41] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#3886787 (10Cmjohnson) These are hitting the install server but not receiving the image. Chasemp or robh can you take a look at this please. They were received with 10G Nics that I tu... [15:38:20] cmjohnson1: dbstore1002 shutting down in 1m [15:38:32] okay [15:39:21] (03PS5) 10Ottomata: [WIP] Refactor cache::kafka::eventlogging into profile and enable TLS [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297) [15:40:19] !log akosiaris@tin Started deploy [servermon/servermon@10e165e]: Testing scap check [15:40:21] !log akosiaris@tin Finished deploy [servermon/servermon@10e165e]: Testing scap check (duration: 00m 02s) [15:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:55] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] Fix linewrap issue on wikimedia error page [puppet] - 10https://gerrit.wikimedia.org/r/395552 (https://phabricator.wikimedia.org/T180656) (owner: 10Phantom42) [15:44:09] (03PS1) 10Alexandros Kosiaris: servermon: Remove /media and use /staticfiles [puppet] - 10https://gerrit.wikimedia.org/r/403183 [15:44:19] (03PS3) 10ArielGlenn: don't send mail on dumps failures from labs [puppet] - 10https://gerrit.wikimedia.org/r/403176 [15:44:34] !log roll-restart swift frontends in codfw and eqiad [15:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:49] (03CR) 10ArielGlenn: [C: 032] don't send mail on dumps failures from labs [puppet] - 10https://gerrit.wikimedia.org/r/403176 (owner: 10ArielGlenn) [15:46:17] (03PS2) 10Alexandros Kosiaris: servermon: Remove /media and use /staticfiles [puppet] - 10https://gerrit.wikimedia.org/r/403183 [15:46:21] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] servermon: Remove /media and use /staticfiles [puppet] - 10https://gerrit.wikimedia.org/r/403183 (owner: 10Alexandros Kosiaris) [15:47:38] (03PS1) 10Ottomata: Add $monitoring_enabled parameter to cache::kafka::webrequest profile [puppet] - 10https://gerrit.wikimedia.org/r/403185 [15:49:21] (03PS2) 10Ottomata: Add $monitoring_enabled parameter to cache::kafka::webrequest profile [puppet] - 10https://gerrit.wikimedia.org/r/403185 [15:53:41] cmjohnson1: let me know if you need anything from my side [15:53:41] (03PS6) 10Alexandros Kosiaris: Add all ops members to docker group [puppet] - 10https://gerrit.wikimedia.org/r/401492 [15:53:49] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add all ops members to docker group [puppet] - 10https://gerrit.wikimedia.org/r/401492 (owner: 10Alexandros Kosiaris) [15:56:02] cmjohnson1: moreover, dbstore1002 is not officially OOW no? [15:57:16] elukey: back up mgmt is accessible [15:57:21] the HW warranty expiration: 2017-02-25 [15:58:25] oh sorry you are completely right [15:58:34] thanks a lot for the mgmt ! [15:59:07] I reseated all the DIMM [16:00:03] (03PS1) 10Cmjohnson: Adding production dns notebook1003/4 [dns] - 10https://gerrit.wikimedia.org/r/403186 (https://phabricator.wikimedia.org/T183935) [16:00:57] cmjohnson1: super ignorant - what does it mean to reset all the DIMM ? (I mean, I am curious about the manual steps to take from the dc perspective) [16:01:20] it means I pulled them out and put them back in [16:01:43] (03CR) 10Ottomata: "Is good: https://puppet-compiler.wmflabs.org/compiler02/9662/" [puppet] - 10https://gerrit.wikimedia.org/r/403185 (owner: 10Ottomata) [16:02:04] cmjohnson1: ack thanks for the patience :) [16:02:08] I don't know think it does much but it's one of the basics of troubleshooting...kind of like your internet router...turn it on and off [16:02:31] yep yep, I was super curious about these steps [16:02:58] (03CR) 10Cmjohnson: [C: 032] Adding production dns notebook1003/4 [dns] - 10https://gerrit.wikimedia.org/r/403186 (https://phabricator.wikimedia.org/T183935) (owner: 10Cmjohnson) [16:03:51] ACKNOWLEDGEMENT - MD RAID on lvs3001 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T184530 [16:03:55] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T184530#3886846 (10ops-monitoring-bot) [16:07:52] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: rack/setup/install notebook[34] - https://phabricator.wikimedia.org/T183935#3886850 (10RobH) [16:08:28] (03PS1) 10Cmjohnson: Adding notebook1003/4 to dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/403187 (https://phabricator.wikimedia.org/T183935) [16:08:38] 10Operations, 10ops-esams: Setup new access switches - https://phabricator.wikimedia.org/T184065#3886853 (10mark) asw-oe14-esams, asw-oe15-esams and asw-oe16-esams have all been mounted in their respective racks, all at position 24 (so midway). [16:09:02] (03PS6) 10Giuseppe Lavagetto: Fix Style/NegatedIf RuboCop offense across the tree [puppet] - 10https://gerrit.wikimedia.org/r/359484 (owner: 10Faidon Liambotis) [16:09:04] (03PS1) 10Giuseppe Lavagetto: base: fix spec tests for puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/403188 [16:09:05] !log re-started mysql on dbstore1002 (and slave replication) after hw maintenance [16:09:06] (03PS1) 10Giuseppe Lavagetto: service: properly fix specs for puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/403189 [16:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:37] !log data-services: added s8.{analytics,web}.db.svc.eqiad.wmflabs and aliases (T181643, T184179) [16:09:48] (03CR) 10jerkins-bot: [V: 04-1] base: fix spec tests for puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/403188 (owner: 10Giuseppe Lavagetto) [16:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:49] T181643: Announce wikidata move to s8 to cloud-announce & update wiki docs - https://phabricator.wikimedia.org/T181643 [16:09:49] T184179: Missing references to s8 on maintenance and cloud scripts (and potentially others) - https://phabricator.wikimedia.org/T184179 [16:11:12] (03PS2) 10Cmjohnson: Adding notebook1003/4 to dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/403187 (https://phabricator.wikimedia.org/T183935) [16:11:46] (03CR) 10Cmjohnson: [C: 032] Adding notebook1003/4 to dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/403187 (https://phabricator.wikimedia.org/T183935) (owner: 10Cmjohnson) [16:12:59] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: rack/setup/install notebook[34] - https://phabricator.wikimedia.org/T183935#3886875 (10Cmjohnson) [16:13:47] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: rack/setup/install notebook[34] - https://phabricator.wikimedia.org/T183935#3867856 (10Cmjohnson) All the on-site work has been completed, production dns added and install server. @robh can you look into the partman recipe and complete the install... [16:13:55] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: rack/setup/install notebook[34] - https://phabricator.wikimedia.org/T183935#3886877 (10Cmjohnson) a:05Cmjohnson>03RobH [16:14:15] 10Operations, 10Data-Services, 10MediaWiki-Maintenance-scripts, 10Wikidata, 10Patch-For-Review: Missing references to s8 on maintenance and cloud scripts (and potentially others) - https://phabricator.wikimedia.org/T184179#3886878 (10bd808) After running `maintain-meta_p`: ``` (u3518@s7.labsdb) [meta_p]... [16:18:00] !log starting cluster reboot for elasticsearch / cirrus codfw [16:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:50] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [16:22:25] ACKNOWLEDGEMENT - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% Ema T166965 - The acknowledgement expires at: 2018-01-09 20:21:57. [16:24:50] RECOVERY - Host lvs3001 is UP: PING OK - Packet loss = 0%, RTA = 83.85 ms [16:27:01] ACKNOWLEDGEMENT - MD RAID on lvs3001 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T184533 [16:27:04] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T184533#3886898 (10ops-monitoring-bot) [16:28:14] !log disabled Icinga event handlers on RAID checks for lvs3001, WIP on the host [16:28:18] mark: FYI ^^^ [16:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:16] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T184533#3886907 (10Volans) [16:29:18] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T168619#3886909 (10Volans) [16:29:24] volans: thanks I'm currently re-building the raid [16:29:24] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T184530#3886911 (10Volans) [16:29:26] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T168619#3370107 (10Volans) [16:29:43] ema: great thanks, we can re-enable it once it's resolved [16:29:52] I'm sorry for the duplicates [16:30:08] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3886916 (10herron) [16:30:10] 10Operations, 10Puppet: Trusty puppet 4 approach - https://phabricator.wikimedia.org/T182894#3886914 (10herron) 05Open>03Resolved a:03herron [16:31:52] (03PS1) 10Brion VIBBER: Install php7.0-bz2 package, needed for dumps [puppet] - 10https://gerrit.wikimedia.org/r/403193 [16:35:33] !log disabling pupppet for decom on mw1180-1200 [16:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:23] (03PS2) 10Giuseppe Lavagetto: base: fix spec tests for puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/403188 [16:36:25] (03PS2) 10Giuseppe Lavagetto: service: properly fix specs for puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/403189 [16:36:27] (03PS7) 10Giuseppe Lavagetto: Fix Style/NegatedIf RuboCop offense across the tree [puppet] - 10https://gerrit.wikimedia.org/r/359484 (owner: 10Faidon Liambotis) [16:39:39] (03PS16) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 [16:41:44] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Add Prometheus client support for varnish/statsd metrics daemons - https://phabricator.wikimedia.org/T177199#3886940 (10fgiunchedi) [16:41:47] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Limit http methods reported by varnishmtail - https://phabricator.wikimedia.org/T183926#3886938 (10fgiunchedi) 05Open>03Resolved Fixed and rolled out [16:41:54] (03CR) 10Giuseppe Lavagetto: [C: 032] base: fix spec tests for puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/403188 (owner: 10Giuseppe Lavagetto) [16:42:18] (03CR) 10Giuseppe Lavagetto: [C: 032] service: properly fix specs for puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/403189 (owner: 10Giuseppe Lavagetto) [16:42:24] (03Abandoned) 10Brion VIBBER: Install php7.0-bz2 package, needed for dumps [puppet] - 10https://gerrit.wikimedia.org/r/403193 (owner: 10Brion VIBBER) [16:44:20] (03PS2) 10RobH: adding shell user imarlier [puppet] - 10https://gerrit.wikimedia.org/r/402102 (https://phabricator.wikimedia.org/T184190) [16:44:54] (03CR) 10RobH: [C: 032] adding shell user imarlier [puppet] - 10https://gerrit.wikimedia.org/r/402102 (https://phabricator.wikimedia.org/T184190) (owner: 10RobH) [16:47:12] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix Style/NegatedIf RuboCop offense across the tree [puppet] - 10https://gerrit.wikimedia.org/r/359484 (owner: 10Faidon Liambotis) [16:47:17] (03PS8) 10Giuseppe Lavagetto: Fix Style/NegatedIf RuboCop offense across the tree [puppet] - 10https://gerrit.wikimedia.org/r/359484 (owner: 10Faidon Liambotis) [16:48:45] PROBLEM - Host ms-be3002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:48:45] PROBLEM - Host ms-be3001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:49:08] \o/ \o/ [16:49:29] <_joe_> oh we're killing swift in esams for good? [16:49:56] (03PS1) 10Lucas Werkmeister (WMDE): Enable caching of constraint check results [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403195 (https://phabricator.wikimedia.org/T181060) [16:50:00] (03PS1) 10RobH: adding imarlier to groups [puppet] - 10https://gerrit.wikimedia.org/r/403196 (https://phabricator.wikimedia.org/T184190) [16:50:13] (03PS2) 10RobH: adding imarlier to groups [puppet] - 10https://gerrit.wikimedia.org/r/403196 (https://phabricator.wikimedia.org/T184190) [16:50:15] Hey ops just incase you were aware debian released an sec update for meltdown/spectre [16:50:21] Werent* [16:50:49] (03CR) 10RobH: [C: 032] adding imarlier to groups [puppet] - 10https://gerrit.wikimedia.org/r/403196 (https://phabricator.wikimedia.org/T184190) (owner: 10RobH) [16:52:24] (03CR) 10Lucas Werkmeister (WMDE): "I haven’t proposed this change for any SWAT yet, because I’d love to have a bit more time to test it on Wikidata, and I’m not sure if it’s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403195 (https://phabricator.wikimedia.org/T181060) (owner: 10Lucas Werkmeister (WMDE)) [16:53:20] godog: actually i thought those were already fully decom'ed? [16:53:28] (ms-be) [16:53:28] (03PS1) 10Cmjohnson: Removing mw1180-1200 from site.pp for decom [puppet] - 10https://gerrit.wikimedia.org/r/403197 (https://phabricator.wikimedia.org/T183895) [16:53:37] _joe_: yes [16:54:22] (03CR) 10Cmjohnson: [C: 032] Removing mw1180-1200 from site.pp for decom [puppet] - 10https://gerrit.wikimedia.org/r/403197 (https://phabricator.wikimedia.org/T183895) (owner: 10Cmjohnson) [16:54:27] (03PS2) 10Cmjohnson: Removing mw1180-1200 from site.pp for decom [puppet] - 10https://gerrit.wikimedia.org/r/403197 (https://phabricator.wikimedia.org/T183895) [16:54:56] (03CR) 10Cmjohnson: [V: 032 C: 032] Removing mw1180-1200 from site.pp for decom [puppet] - 10https://gerrit.wikimedia.org/r/403197 (https://phabricator.wikimedia.org/T183895) (owner: 10Cmjohnson) [16:56:57] mark: they are logically decom'd, IOW spare::system in puppet [16:58:56] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to Production SSH, statistics-privatedata-users, analytics-privatedata-users, perf-team for imarlier - https://phabricator.wikimedia.org/T184190#3887016 (10RobH) 05Open>03Resolved I shouldn't merge changes at the end of my workday... [17:00:04] godog, moritzm, and _joe_: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet SWAT(Max 8 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180109T1700). [17:00:04] tgr: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:01:15] ema,bblack - I think that the patch requires your input - https://gerrit.wikimedia.org/r/#/c/402433/1/modules/profile/manifests/cache/text.pp [17:02:13] (this one is the only scheduled for puppet swat) [17:04:33] (03PS3) 10RobH: adding esanders to two groups [puppet] - 10https://gerrit.wikimedia.org/r/402420 (https://phabricator.wikimedia.org/T184206) [17:05:14] (03CR) 10RobH: [C: 032] adding esanders to two groups [puppet] - 10https://gerrit.wikimedia.org/r/402420 (https://phabricator.wikimedia.org/T184206) (owner: 10RobH) [17:05:27] 10Operations, 10ops-eqiad, 10Analytics-Kanban: dbstore1002 possibly MEMORY issues - https://phabricator.wikimedia.org/T183771#3887025 (10elukey) Maintenance done, the mgmt interface is now up and running (Chris also did a reseat of the DIMM banks). @Marostegui, @jcrespo - We (as Analytics team) would like t... [17:06:13] 10Operations, 10Ops-Access-Requests: Requesting access to researchers and analytics-privatedata-users for Ed Sanders - https://phabricator.wikimedia.org/T184206#3887026 (10RobH) [17:06:23] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#3887027 (10chasemp) @Andrew said he would have a minute to talk a look and that this sounded vaguely familiar [17:07:01] 10Operations, 10Ops-Access-Requests: Requesting access to researchers and analytics-privatedata-users for Ed Sanders - https://phabricator.wikimedia.org/T184206#3876092 (10RobH) 05Open>03Resolved a:03RobH No objections were filed, and this group addition as been merged live. All affected systems should... [17:08:14] 10Operations, 10Ops-Access-Requests: Requesting extended access to stat1005 for jdcc - https://phabricator.wikimedia.org/T184085#3887049 (10RobH) a:03Nuria @Nuria, @slaporte requests that you review and approve this extension of access expiry. Please comment and assign back to me for followup, thanks! [17:08:26] elukey: seems sane to me, virtual +1 [17:10:00] tgr: would you mind to coordinate with the traffic team for the deployment of this change? [17:10:31] elukey: sure, what does that mean specifically? [17:11:31] 10Operations, 10ORES, 10Graphite, 10Patch-For-Review, and 2 others: Regularly purge old ores graphite metrics - https://phabricator.wikimedia.org/T169969#3887057 (10fgiunchedi) >>! In T169969#3886670, @Halfak wrote: > @fgiunchedi, what part of this config specifies that only metrics that haven't been updat... [17:11:37] Too late for me to sneak 2 more in? [17:11:41] (both are beta-only) [17:11:55] tgr: getting in touch with either ema or bblack and ask them whenever is better to review/deploy it. It seems easy enough but I am not comfortable to merge this during puppet swat (maybe other ops raeding will not sure!) [17:12:21] no_justification: let's see them :) [17:12:45] https://gerrit.wikimedia.org/r/#/c/386869/ and https://gerrit.wikimedia.org/r/#/c/394203/ [17:12:50] Was about to add to wikitech [17:13:47] (03PS2) 10Ema: Add DELETE to list of allowed methods for text varnish [puppet] - 10https://gerrit.wikimedia.org/r/402433 (https://phabricator.wikimedia.org/T182825) (owner: 10Gergő Tisza) [17:13:50] (03CR) 10Ema: [V: 032 C: 032] Add DELETE to list of allowed methods for text varnish [puppet] - 10https://gerrit.wikimedia.org/r/402433 (https://phabricator.wikimedia.org/T182825) (owner: 10Gergő Tisza) [17:13:52] (03PS8) 10Elukey: hieradata: add redis stretch deployment-prep instances [puppet] - 10https://gerrit.wikimedia.org/r/386869 (https://phabricator.wikimedia.org/T179371) (owner: 10Filippo Giunchedi) [17:14:00] elukey, tgr: patch looks good, merged [17:14:06] thanks! [17:14:13] ty! [17:14:35] RECOVERY - MD RAID on lvs3001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [17:14:40] yeee \o/ [17:14:52] Both of those I posted are already running in beta [17:15:29] !log depool restbase cassandra 2 nodes - T184100 [17:15:35] (03CR) 10Elukey: [C: 032] hieradata: add redis stretch deployment-prep instances [puppet] - 10https://gerrit.wikimedia.org/r/386869 (https://phabricator.wikimedia.org/T179371) (owner: 10Filippo Giunchedi) [17:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:41] T184100: Reprovision legacy Cassandra nodes into new cluster - https://phabricator.wikimedia.org/T184100 [17:16:30] (03PS5) 10Elukey: Beta: Moving all docroots to standard-docroot [puppet] - 10https://gerrit.wikimedia.org/r/394203 (https://phabricator.wikimedia.org/T126306) (owner: 10Chad) [17:17:04] RECOVERY - pybal on lvs3001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [17:17:08] !log failover traffic back to lvs3001, raid rebuilt [17:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:24] RECOVERY - PyBal backends health check on lvs3001 is OK: PYBAL OK - All pools are healthy [17:18:43] no_justification: 386869 seems to depend on another one, gerrit is not giving me the submit [17:18:55] Herp derp. That may be. Nvm then [17:19:04] (03CR) 10Elukey: [C: 032] Beta: Moving all docroots to standard-docroot [puppet] - 10https://gerrit.wikimedia.org/r/394203 (https://phabricator.wikimedia.org/T126306) (owner: 10Chad) [17:19:31] Thx for the other one tho! It's part of my ongoing effort to clean up the pile of symlinks that hold MW & docroots together [17:19:45] elukey depends on https://gerrit.wikimedia.org/r/#/c/387579/2 [17:19:58] yep [17:20:16] godog: can I merge --^ and then https://gerrit.wikimedia.org/r/386869 ? [17:20:23] 10Operations, 10ops-esams, 10Traffic: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166965#3887101 (10ema) 05Open>03Resolved a:03ema Disk replaced today, raid rebuilt. [17:20:25] (not sure what is the status in labs now) [17:20:41] no_justification: 394203 merged [17:20:44] 10Operations, 10ops-esams, 10Traffic: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166965#3887109 (10ema) [17:20:46] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T168619#3887106 (10ema) 05Open>03Resolved a:03ema Disk replaced today, raid rebuilt. [17:20:59] (03PS1) 10Rush: wip: rabbitmq: handling users and initial setup [puppet] - 10https://gerrit.wikimedia.org/r/403202 [17:21:27] (03CR) 10jerkins-bot: [V: 04-1] wip: rabbitmq: handling users and initial setup [puppet] - 10https://gerrit.wikimedia.org/r/403202 (owner: 10Rush) [17:21:30] elukey: I'm not sure either tbh what's the status now, I'll have to check better [17:21:41] elukey: Tyyyyyy. There's a follow-up doing same in prod but far more risky! [17:21:42] :) [17:22:20] godog: all right so let's wait for some verification, is it ok no_justification ? [17:22:48] the prod change can surely be done I think, but possibly let's involve Giuseppe and others in the review :) [17:22:53] Oh yeah. I'll probably even break it up into domain-by-domain to keep it even safer. [17:23:00] Like wikivoyage, then wiktionary, etc. [17:23:01] oh wait already cherry-picked in beta? then safe to merge [17:23:07] reading backlog from no_justification [17:23:12] (03CR) 10Elukey: hieradata: add redis stretch deployment-prep instances [puppet] - 10https://gerrit.wikimedia.org/r/386869 (https://phabricator.wikimedia.org/T179371) (owner: 10Filippo Giunchedi) [17:23:19] godog: Your parent change doesn't look merged. [17:23:22] Just that one ^^ [17:23:25] (er, cherry-picked) [17:24:04] 341a8c0ff19e17b229764c1567f72a46488029e3 is on deployment-prep, but doesn't look like 6fcd7966712894df2eda0c6f3c4eb81f2419044b is [17:24:08] ah, ok then I'll need to clean all of that up soon (tm) [17:24:18] sorry I dropped the ball on that whole redis+stretch thing [17:24:59] nbd, I'm just poking cherry-picks [17:25:06] https://gerrit.wikimedia.org/r/#/c/361796/ is also cherry-picked, probably safe in prod [17:26:17] 10Operations, 10Ops-Access-Requests: Requesting extended access to stat1005 for jdcc - https://phabricator.wikimedia.org/T184085#3887143 (10Nuria) Extension approved. [17:27:54] "[LOCAL] Add cert for etcd in deployment-prep, hiera data for instance" would be nice to get in, it's all hieradata really [17:28:02] ema: yay for lvs3001! [17:28:20] volans: \o/ [17:28:29] I'll re-enable event handlers ;) [17:28:40] yes please [17:28:42] 10Operations, 10Ops-Access-Requests: Requesting extended access to stat1005 for jdcc - https://phabricator.wikimedia.org/T184085#3887147 (10RobH) a:05Nuria>03RobH [17:30:51] !log re-enabled Icinga event handlers on RAID checks for lvs3001 [17:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:22] 10Operations, 10Analytics, 10Analytics-Cluster, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3887160 (10Shilad) I have availability this week and next, but I think @Ottomata is right. It will be tough to do this work if it's attached to a production machi... [17:31:38] ema: systemd still failing? [17:31:41] 10Operations, 10Commons, 10Multimedia, 10Traffic, and 5 others: Disable serving unpatrolled new files to Wikipedia Zero users - https://phabricator.wikimedia.org/T167400#3887161 (10Tgr) [17:31:56] volans: uh? [17:31:58] (03PS1) 10RobH: extending jdcc account expiry [puppet] - 10https://gerrit.wikimedia.org/r/403204 (https://phabricator.wikimedia.org/T184085) [17:32:08] icinga still alarming for it [17:32:14] didn't ssh yet to check though [17:32:36] PROBLEM - Restbase root url on restbase2009 is CRITICAL: connect to address 10.192.48.53 and port 7231: Connection refused [17:32:42] 10Operations, 10Commons, 10Multimedia, 10Traffic, and 5 others: Disable serving unpatrolled new files to Wikipedia Zero users - https://phabricator.wikimedia.org/T167400#3340971 (10Tgr) >>! In T167400#3884214, @Gilles wrote: > My advice would be to implement it, but not enable it until we really need to.... [17:32:46] (03CR) 10RobH: [C: 032] extending jdcc account expiry [puppet] - 10https://gerrit.wikimedia.org/r/403204 (https://phabricator.wikimedia.org/T184085) (owner: 10RobH) [17:33:00] (03PS1) 10Chad: deployment-prep: Commit hiera config for etcd [puppet] - 10https://gerrit.wikimedia.org/r/403205 [17:33:21] (03PS1) 10Ottomata: Use hadoop cluster name variable in camus templates [puppet] - 10https://gerrit.wikimedia.org/r/403206 (https://phabricator.wikimedia.org/T166248) [17:33:25] (03Abandoned) 10Mobrovac: RESTBase: Do not manage Cassandra 2 in the legacy cluster [puppet] - 10https://gerrit.wikimedia.org/r/403122 (https://phabricator.wikimedia.org/T184100) (owner: 10Mobrovac) [17:33:45] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting extended access to stat1005 for jdcc - https://phabricator.wikimedia.org/T184085#3887171 (10RobH) 05Open>03Resolved a:05RobH>03None extension of expiry is now merged live [17:33:46] PROBLEM - Check systemd state on restbase2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:33:55] (03CR) 10jerkins-bot: [V: 04-1] Use hadoop cluster name variable in camus templates [puppet] - 10https://gerrit.wikimedia.org/r/403206 (https://phabricator.wikimedia.org/T166248) (owner: 10Ottomata) [17:34:15] volans: mdadm: No mail address or alert command - not monitoring. [17:34:17] restbase2009 is me [17:34:44] ema: mmmh that might have been me, I think I disabled it because it was sending emails every day [17:34:49] let me check the task [17:35:44] ema: T166965#3399661 let me fix it [17:35:44] T166965: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166965 [17:36:38] (03PS2) 10Ottomata: Use hadoop cluster name variable in camus templates [puppet] - 10https://gerrit.wikimedia.org/r/403206 (https://phabricator.wikimedia.org/T166248) [17:38:37] RECOVERY - Check systemd state on lvs3001 is OK: OK - running: The system is fully operational [17:38:43] nice [17:38:45] ema: done ;) [17:38:56] PROBLEM - Check systemd state on restbase1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:39:07] PROBLEM - Restbase root url on restbase1011 is CRITICAL: connect to address 10.64.0.113 and port 7231: Connection refused [17:39:23] 10Operations, 10Analytics, 10Analytics-Cluster, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3887218 (10RobH) Moving the GPU between boxes is not advised. Items are warrantied to work in the server they were ordered in, so its typically messy to move the... [17:40:15] (03PS3) 10Ottomata: Use hadoop cluster name variable in camus templates [puppet] - 10https://gerrit.wikimedia.org/r/403206 (https://phabricator.wikimedia.org/T166248) [17:40:33] (03PS1) 10Filippo Giunchedi: decom legacy restbase/cassandra 2 cluster [puppet] - 10https://gerrit.wikimedia.org/r/403208 (https://phabricator.wikimedia.org/T184100) [17:41:46] AaronSchulz: do you still need help with a key thing on tin? i dont remember i'm afraid [17:41:47] !log rebooting image scalers in codfw for kernel security update (along with HHVM update) [17:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:09] (03PS1) 10Cmjohnson: Removing dhcp file entries mw1180-1200 [puppet] - 10https://gerrit.wikimedia.org/r/403209 (https://phabricator.wikimedia.org/T183895) [17:42:35] (03PS4) 10Ottomata: Use hadoop cluster name variable in camus templates [puppet] - 10https://gerrit.wikimedia.org/r/403206 (https://phabricator.wikimedia.org/T166248) [17:42:47] (03CR) 10Cmjohnson: [C: 032] Removing dhcp file entries mw1180-1200 [puppet] - 10https://gerrit.wikimedia.org/r/403209 (https://phabricator.wikimedia.org/T183895) (owner: 10Cmjohnson) [17:44:46] AaronSchulz: i was already off when i got your ping last night, let me know [17:46:47] RECOVERY - Check systemd state on restbase2009 is OK: OK - running: The system is fully operational [17:47:09] (03PS1) 10Brion VIBBER: Remove firejail config for now-unused ffmpeg2theora [puppet] - 10https://gerrit.wikimedia.org/r/403212 (https://phabricator.wikimedia.org/T181591) [17:47:36] RECOVERY - Restbase root url on restbase2009 is OK: HTTP OK: HTTP/1.1 200 - 15785 bytes in 0.103 second response time [17:48:21] (03CR) 10Muehlenhoff: [C: 031] "Ack. I'll take care of merging that in the next days." [puppet] - 10https://gerrit.wikimedia.org/r/403212 (https://phabricator.wikimedia.org/T181591) (owner: 10Brion VIBBER) [17:48:34] \o/ [17:48:42] 10Operations, 10Analytics, 10Analytics-Cluster, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3887279 (10Ottomata) Rats, well then. This is partially our fault for sticking this thing in a 'production' box in the first place, but we did it to save some mo... [17:49:11] 10Operations, 10Commons, 10Multimedia, 10Traffic, and 5 others: Disable serving unpatrolled new files to Wikipedia Zero users - https://phabricator.wikimedia.org/T167400#3887282 (10Tgr) [17:49:22] (03PS2) 10Chad: Move wiktionary and foundationwiki docroots to standard docroot [puppet] - 10https://gerrit.wikimedia.org/r/402090 (https://phabricator.wikimedia.org/T126306) [17:50:13] godog: oh, ticket said "left to do is wiping os drives" so i thought everything else was done [17:50:17] (03PS1) 10Ema: cache_upload vtc: allow_inline_c for backend tests [puppet] - 10https://gerrit.wikimedia.org/r/403213 [17:50:29] can you (or someone) actually decom them besides the physical removal/wiping? [17:51:21] (03CR) 10Ema: [C: 032] cache_upload vtc: allow_inline_c for backend tests [puppet] - 10https://gerrit.wikimedia.org/r/403213 (owner: 10Ema) [17:51:26] (03CR) 10Paladox: Update gerrit login display (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/402665 (owner: 10Paladox) [17:51:50] mark: ah my bad, yeah I'll cleanup dns and puppet [17:51:56] 10Operations, 10Commons, 10Multimedia, 10Traffic, and 5 others: Disable serving unpatrolled new files to Wikipedia Zero users - https://phabricator.wikimedia.org/T167400#3350563 (10Tgr) a:03Tgr [17:52:04] (03PS2) 10Dzahn: CI: Add php7.0-zip stretch and beyond [puppet] - 10https://gerrit.wikimedia.org/r/403138 (owner: 10Chad) [17:52:28] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3887312 (10Tgr) a:03Tgr [17:52:43] (03CR) 10Dzahn: [C: 032] CI: Add php7.0-zip stretch and beyond [puppet] - 10https://gerrit.wikimedia.org/r/403138 (owner: 10Chad) [17:53:04] (03PS5) 10Ottomata: Use hadoop cluster name variable in camus templates [puppet] - 10https://gerrit.wikimedia.org/r/403206 (https://phabricator.wikimedia.org/T166248) [17:54:05] elukey: Actually, my first prod swap just does foundationwiki and wiktionary. Still no need to rush today, but should be far more manageable than "basically all non-wikipedias" [17:54:07] :) [17:54:22] +1 :D [17:54:27] (03CR) 10Dzahn: "releases1001:" [puppet] - 10https://gerrit.wikimedia.org/r/403138 (owner: 10Chad) [17:56:59] !log MediaWiki Train: Branching 1.31.0-wmf.16 [17:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:46] (03PS6) 10Ottomata: Use hadoop cluster name variable in camus templates [puppet] - 10https://gerrit.wikimedia.org/r/403206 (https://phabricator.wikimedia.org/T166248) [17:57:49] (03PS2) 10Dzahn: peopleweb: access based on roles, not host names [puppet] - 10https://gerrit.wikimedia.org/r/401829 [17:57:59] mark: I've downtimed all ms-fe3 / ms-be3 so it isn't confusing in icinga, I have to run now but I'll cleanup puppet/dns tomorrow unless someone gets to it first [17:58:46] godog: ticket link please? [17:59:06] mutante: hey, sure that's https://phabricator.wikimedia.org/T169518 [17:59:29] godog: hi:) i'll look at DNS [17:59:53] adding that decom template with the checkboxes [18:00:01] mutante: sweet, thanks a lot! [18:00:04] cscott, arlolra, subbu, halfak, and Amir1: Time to snap out of that daydream and deploy Services – Graphoid / Parsoid / Citoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180109T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:04] yw [18:00:14] Nothing for ORES [18:00:21] (03PS7) 10Ottomata: Use hadoop cluster name variable in camus templates [puppet] - 10https://gerrit.wikimedia.org/r/403206 (https://phabricator.wikimedia.org/T166248) [18:01:16] (03CR) 10Mobrovac: [C: 031] decom legacy restbase/cassandra 2 cluster [puppet] - 10https://gerrit.wikimedia.org/r/403208 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi) [18:02:45] godog: sure, no rush [18:03:56] 10Operations, 10ops-esams, 10Patch-For-Review, 10User-fgiunchedi: Decommission esams ms-fe / ms-be - https://phabricator.wikimedia.org/T169518#3887365 (10Dzahn) [18:04:15] 10Operations, 10Discovery-Search (Current work), 10Goal, 10Patch-For-Review, and 2 others: Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3887366 (10Gehel) a:03Gehel [18:04:16] (03CR) 10Ottomata: "got it ! https://puppet-compiler.wmflabs.org/compiler02/9668/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/403206 (https://phabricator.wikimedia.org/T166248) (owner: 10Ottomata) [18:04:20] 10Operations, 10ops-esams, 10Patch-For-Review, 10User-fgiunchedi: Decommission esams ms-fe / ms-be - https://phabricator.wikimedia.org/T169518#3400383 (10Dzahn) [18:04:27] (03PS8) 10Ottomata: Use hadoop cluster name variable in camus templates [puppet] - 10https://gerrit.wikimedia.org/r/403206 (https://phabricator.wikimedia.org/T166248) [18:04:34] (03CR) 10Ottomata: [V: 032 C: 032] Use hadoop cluster name variable in camus templates [puppet] - 10https://gerrit.wikimedia.org/r/403206 (https://phabricator.wikimedia.org/T166248) (owner: 10Ottomata) [18:05:31] 10Operations, 10ops-esams, 10Patch-For-Review, 10User-fgiunchedi: Decommission esams ms-fe / ms-be - https://phabricator.wikimedia.org/T169518#3400383 (10Dzahn) [18:10:44] (03PS1) 10Dzahn: decom esams swift machines, rm from puppet/dhcp [puppet] - 10https://gerrit.wikimedia.org/r/403215 (https://phabricator.wikimedia.org/T169518) [18:13:03] 10Operations, 10Cleanup, 10Gerrit, 10GitHub-Mirrors, and 6 others: Archive mediawiki/extensions/Collection and others - https://phabricator.wikimedia.org/T183891#3887427 (10ovasileva) Currently, the collections extension is still being used to create books and to save them to the books namespace or user na... [18:14:37] (03PS1) 10Dzahn: decom esams swift machines, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/403216 (https://phabricator.wikimedia.org/T169518) [18:18:07] (03PS2) 10Dzahn: decom esams swift machines, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/403216 (https://phabricator.wikimedia.org/T169518) [18:19:41] (03PS2) 10Dzahn: decom esams swift machines, rm from puppet/dhcp [puppet] - 10https://gerrit.wikimedia.org/r/403215 (https://phabricator.wikimedia.org/T169518) [18:21:38] 10Operations, 10ops-esams, 10Patch-For-Review, 10User-fgiunchedi: Decommission esams ms-fe / ms-be - https://phabricator.wikimedia.org/T169518#3887456 (10Dzahn) [18:22:30] (03CR) 10Dzahn: [C: 032] decom esams swift machines, rm from puppet/dhcp [puppet] - 10https://gerrit.wikimedia.org/r/403215 (https://phabricator.wikimedia.org/T169518) (owner: 10Dzahn) [18:23:00] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 42.69, 36.55, 31.42 [18:23:30] _joe_: FYI ^^^ [18:23:35] yea, should i kill it [18:23:39] already on [18:24:23] <_joe_> mutante: take a hhvm-dump-debug output maybe [18:24:34] <_joe_> also, cool we caught it :) [18:25:03] done! [18:26:12] !log mw1227 hhvm-dump-debug > /root/hhvm-dump-debug-20170109-1024PST.log ; then killed hhvm and restarted it with systemctl [18:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:26] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1033 - https://phabricator.wikimedia.org/T183896#3887465 (10Cmjohnson) @fgiunchedi The disk has been replaced...please resolve this after confirmation [18:29:07] !log mw1227 killed it one more time and also restarted apache.. now load going down [18:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:36] <_joe_> mutante: you can alternatively use restart-hhvm [18:29:49] <_joe_> which depools the server, restarts hhvm, then repools it [18:29:57] ah, ok! [18:33:14] hmm.. it's still busy, just not _that_ busy [18:38:10] !log mw1227 still not showing recovery, using restart-hhvm [18:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:30] !log ms-fe3001 - shutting down for decom, removed from puppet [18:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:02] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 4.21, 14.31, 22.87 [18:39:17] well, i should have used the script right away ^.. ok [18:39:36] didn't kill it good enough before [18:42:53] !log ms-fe3002,ms-fe3001 - powering down, removing from puppet and icinga, ms-be* removing from puppet/icinga (T169518) [18:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:06] T169518: Decommission esams ms-fe / ms-be - https://phabricator.wikimedia.org/T169518 [18:43:38] just ms-be3* of course, heh [18:52:22] no_justification: fyi, the train will be deploying 'that mcr stuff(refactoring)' this week [18:52:36] fyi twentyafterfour is doing the train :) [18:52:45] ooooh, okay! [18:52:55] * addshore will be here to watch things as they happen :) [18:53:07] jouncebot: next [18:53:08] In 0 hour(s) and 6 minute(s): Pre new MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180109T1900) [18:53:14] haha [18:53:34] no_justification: it still says you on the cal [18:54:45] addshore: Calendar lies [18:55:54] (03CR) 10Dzahn: [C: 032] "removed from icinga and shut down, mgmt DNS is still there per lifecycle but you might not need it for wiping anyways" [dns] - 10https://gerrit.wikimedia.org/r/403216 (https://phabricator.wikimedia.org/T169518) (owner: 10Dzahn) [18:56:58] 10Operations, 10hardware-requests: EQIAD: (1) hardware request for eventlog1001 replacement - eventlog1002. - https://phabricator.wikimedia.org/T184551#3887542 (10Ottomata) [18:57:12] no_justification: obviously [18:57:15] 10Operations, 10Analytics, 10hardware-requests: EQIAD: (1) hardware request for eventlog1001 replacement - eventlog1002. - https://phabricator.wikimedia.org/T184551#3887542 (10Ottomata) [18:57:18] 10Operations, 10ops-esams, 10Patch-For-Review, 10User-fgiunchedi: Decommission esams ms-fe / ms-be - https://phabricator.wikimedia.org/T169518#3887551 (10Dzahn) [18:57:45] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=1fullscreen [18:57:46] 10Operations, 10Cloud-VPS, 10Toolforge, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3887553 (10Andrew) First, on labvirt1018: # Create new base images with updated kernels # upgrade kernels on all VMs (how? Probably we nee... [18:58:23] mark: (godog) they are now gone from icinga and DNS, the only part i can't do is the "disable switch port" checkbox. i copied that decom checklist from the server lifecycle page. [18:59:44] PROBLEM - puppet last run on conf1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:00:04] Deploy window Pre new MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180109T1900) [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:00:04] PROBLEM - puppet last run on rdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:00:54] PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:01:04] PROBLEM - puppet last run on kafka1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:01:15] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:01:44] PROBLEM - puppet last run on nitrogen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:01:44] PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:01:44] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:01:54] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=1fullscreen [19:02:14] PROBLEM - puppet last run on db1072 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:02:15] PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:03:04] PROBLEM - puppet last run on cp4025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:03:25] PROBLEM - puppet last run on wdqs1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:03:44] 10Operations, 10Cloud-VPS, 10Toolforge, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3887563 (10chasemp) [19:03:44] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:03:50] puppetdb died and was restarted by systemd 6min ago [19:03:54] FYI [19:03:54] PROBLEM - puppet last run on labvirt1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:03:54] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:03:54] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:03:54] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:04:05] PROBLEM - puppet last run on logstash1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:04:14] PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:04:26] 'kk [19:04:34] PROBLEM - puppet last run on boron is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:04:44] RECOVERY - puppet last run on conf1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:05:04] RECOVERY - puppet last run on rdb1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:05:37] 10Operations, 10Cloud-VPS, 10Toolforge, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3875329 (10chasemp) [19:06:44] RECOVERY - puppet last run on nitrogen is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:07:14] RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:09:24] 10Operations, 10Cloud-VPS, 10Toolforge, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3887598 (10chasemp) [19:10:37] 10Operations, 10ops-esams, 10Patch-For-Review, 10User-fgiunchedi: Decommission esams ms-fe / ms-be - https://phabricator.wikimedia.org/T169518#3887605 (10Dzahn) a:03mark @mark @fgiunchedi They are shutdown and removed from Icinga and DNS now. Only the "disable switch port" part i could not do due to lac... [19:18:37] (03CR) 10ArielGlenn: "Great. It would be nice to have as little of that cruft in host files as possible." [puppet] - 10https://gerrit.wikimedia.org/r/401829 (owner: 10Dzahn) [19:18:42] (03CR) 10ArielGlenn: [C: 031] peopleweb: access based on roles, not host names [puppet] - 10https://gerrit.wikimedia.org/r/401829 (owner: 10Dzahn) [19:19:14] hate when I click the button and immediately realize i forgot to add the +1 [19:19:43] apergos: thank you :) [19:19:58] yw [19:20:21] (03CR) 10Dzahn: [C: 032] peopleweb: access based on roles, not host names [puppet] - 10https://gerrit.wikimedia.org/r/401829 (owner: 10Dzahn) [19:20:26] (03PS3) 10Dzahn: peopleweb: access based on roles, not host names [puppet] - 10https://gerrit.wikimedia.org/r/401829 [19:21:12] apergos: yes, exactly. i would like to remove all admin groups from the ./hosts/* tree [19:21:20] and only use role/common/ [19:22:10] (03CR) 10Dzahn: "thanks, yea, the plan would be to remove any remaining admin groups from ./hosts/ and only ever use ./role/common/" [puppet] - 10https://gerrit.wikimedia.org/r/401829 (owner: 10Dzahn) [19:22:37] once upon a time there was not even puppet... how far we've come [19:22:47] heh, indeed [19:26:48] (03CR) 10Dzahn: "noop on rutherfordium" [puppet] - 10https://gerrit.wikimedia.org/r/401829 (owner: 10Dzahn) [19:28:25] RECOVERY - puppet last run on wdqs1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:28:44] RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [19:28:48] 10Operations, 10Cloud-VPS, 10Toolforge, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3887685 (10chasemp) [19:28:54] RECOVERY - puppet last run on labvirt1012 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [19:28:54] RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [19:28:54] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [19:28:54] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:29:05] RECOVERY - puppet last run on logstash1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:29:14] RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [19:29:34] RECOVERY - puppet last run on boron is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:30:54] RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:31:04] RECOVERY - puppet last run on kafka1014 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:31:15] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:31:44] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:31:44] RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:31:46] 10Operations, 10Cloud-VPS, 10Toolforge, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3887694 (10chasemp) [19:32:14] RECOVERY - puppet last run on db1072 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:33:04] RECOVERY - puppet last run on cp4025 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:33:23] 10Operations, 10Cloud-VPS, 10Toolforge, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3875329 (10chasemp) > tldr PCID feature on all, invpcid on 1010+ only Checking for PCID and INVPCID feature flabs across labvirts. ```for... [19:35:18] mutante: what is the meltdown kernel version target for stretch and jessie in prod do you know? [19:38:23] chasemp: should be linux-image-4.9.0-0.bpo.5 [19:38:40] jessie [19:44:36] chasemp: stretch looks like linux-image-4.9.0-5-amd64, afaict... [19:44:48] mutante: tx man [19:44:58] Fyi for jessie it required a restart, but i dont know if it will to you chasemp [19:48:30] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, 10User-Elukey: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3887751 (10BBlack) >>! In T182993#3871545, @Ottomata wrote: >> The sigalgs lists being negotiated for mutual certificate-based auth seem to i... [19:54:23] 10Operations, 10Analytics, 10Analytics-Cluster, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3887767 (10dr0ptp4kt) Bummer. I think we need to stall out this task until a future date, in alignment with @ottomata 's suggestion to @shilad about it needing an... [19:55:56] 10Operations, 10Analytics, 10Analytics-Cluster, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3887772 (10Shilad) That makes sense. There are plenty of other avenues I can explore without a GPU. [20:00:05] no_justification: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180109T2000). [20:00:05] No GERRIT patches in the queue for this window AFAICS. [20:00:35] !log phab2001 - reboot for upgrade [20:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:23] !log gerrit2001 - rebooting [20:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:14] 10Operations, 10Discovery, 10Maps, 10Maps-Sprint: maps-test2001 is low on disk space - https://phabricator.wikimedia.org/T182583#3887796 (10Fjalapeno) 05Open>03Resolved [20:06:00] twentyafterfour: give me a ping before you run the train pretty please :) [20:06:08] (03PS1) 10Andrew Bogott: vmbuilder: @qualify an erb variable [puppet] - 10https://gerrit.wikimedia.org/r/403226 [20:06:11] addshore: just about to do it [20:06:17] so, ping [20:06:48] (03CR) 10Andrew Bogott: [C: 032] vmbuilder: @qualify an erb variable [puppet] - 10https://gerrit.wikimedia.org/r/403226 (owner: 10Andrew Bogott) [20:07:08] twentyafterfour: awesome! [20:09:12] addshore: should I wait or you just wanted to know when it's happening? [20:09:28] twentyafterfour: just wanted to know when it was happening so I can watch the logs :) [20:12:04] !log twentyafterfour@tin Started scap: Deploy 1.31.0-wmf.16 to test wikis and rebuild l10n. refs T180749 [20:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:17] T180749: 1.31.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T180749 [20:12:34] (03PS1) 10Andrew Bogott: wikibase: add default dummy args for wikibase role [puppet] - 10https://gerrit.wikimedia.org/r/403228 [20:12:59] mutante: do you like https://gerrit.wikimedia.org/r/#/c/403228/? Or would you rather put in real values in the VM hiera config? [20:13:30] !log netmon2001 - rebooting [20:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:42] andrewbogott: the style guide wants us to not have default values. " Profile classes should only have parameters that default to an explicit hiera calls with no fallback value. [20:14:48] !log twentyafterfour@tin scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="test2wiki" --outdir="/tmp/scap_l10n_3984299293" --threads=10 --lang en --quiet' returned non-zero exit status 1 (duration: 02m 44s) [20:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:58] mutante: ok [20:15:13] eww [20:15:16] andrewbogott: so yea, i would say in Horizon Hiera? or in the puppet repo in hiera? [20:15:20] ugh. `scap prep` fail [20:15:20] so… what would you like those values to be on wikibase-stretch? (Or, maybe thats nothing to do with you?) [20:15:38] (03Abandoned) 10Andrew Bogott: wikibase: add default dummy args for wikibase role [puppet] - 10https://gerrit.wikimedia.org/r/403228 (owner: 10Andrew Bogott) [20:16:04] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 34.26, 34.58, 32.12 [20:17:00] (03CR) 10Rush: [V: 032 C: 032] "I believe these are all style warnings and refactoring this giant mess isn't in scope here." [puppet] - 10https://gerrit.wikimedia.org/r/403072 (https://phabricator.wikimedia.org/T182722) (owner: 10Rush) [20:17:04] andrewbogott: one sec.. looking [20:17:04] (03PS9) 10Rush: toolforge: ferm hook to restart components post updates [puppet] - 10https://gerrit.wikimedia.org/r/403072 (https://phabricator.wikimedia.org/T182722) [20:17:08] (03CR) 10Rush: [V: 032 C: 032] toolforge: ferm hook to restart components post updates [puppet] - 10https://gerrit.wikimedia.org/r/403072 (https://phabricator.wikimedia.org/T182722) (owner: 10Rush) [20:17:58] andrewbogott: i would like it to use this: https://wikitech.wikimedia.org/wiki/Hiera:Wikidata-dev [20:18:12] andrewbogott: except the Hiera stuff needs to move from Wikitech over to Horizon now.. right [20:18:21] and that's why it stopped running i assume [20:18:23] I'll do it, I have that page open [20:18:28] cool, thanks [20:18:30] It stopped working a LONG time ago as far as I know [20:18:38] yea, i just saw when logging in [20:18:43] a lot of minutes [20:19:09] ah, it's broken in other ways too [20:19:31] looks at the error now [20:20:24] andrewbogott: i'll fix [20:20:30] thanks [20:20:32] it's because we are using the new httpd module [20:20:34] now [20:20:48] i should send a list email too, heh [20:21:01] !log twentyafterfour@tin Started scap: Deploy 1.31.0-wmf.16 to test wikis and rebuild l10n. refs T180749 (attempt 2) [20:21:08] addshore: trying again [20:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:12] T180749: 1.31.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T180749 [20:22:05] 10Operations, 10Analytics, 10Analytics-Cluster, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3887877 (10dr0ptp4kt) @RobH what's the proper way for us to contact AMD for support about driver options? I wasn't sure if we needed to use a particular customer... [20:24:09] twentyafterfour: ack! [20:24:54] mutante: wikibase and wikibase-vue are broken in similar ways [20:25:22] andrewbogott: ok, leave them to me for now. maybe after lunch but i'll fix them [20:25:30] thank you! [20:25:59] (03PS1) 10Rush: tools: ferm handler updates [puppet] - 10https://gerrit.wikimedia.org/r/403231 [20:26:04] (03PS2) 10Rush: tools: ferm handler updates [puppet] - 10https://gerrit.wikimedia.org/r/403231 [20:26:27] (03CR) 10jerkins-bot: [V: 04-1] tools: ferm handler updates [puppet] - 10https://gerrit.wikimedia.org/r/403231 (owner: 10Rush) [20:27:12] (03CR) 10Rush: [V: 032 C: 032] tools: ferm handler updates [puppet] - 10https://gerrit.wikimedia.org/r/403231 (owner: 10Rush) [20:27:19] (03PS1) 10Dzahn: wikibase: include missing httpd profile for apache2 [puppet] - 10https://gerrit.wikimedia.org/r/403232 [20:28:06] (03PS2) 10Dzahn: wikibase: include missing httpd profile for apache2 [puppet] - 10https://gerrit.wikimedia.org/r/403232 [20:29:18] (03PS3) 10Dzahn: wikibase: include missing httpd profile for apache2 [puppet] - 10https://gerrit.wikimedia.org/r/403232 [20:29:39] (03CR) 10Dzahn: [C: 032] wikibase: include missing httpd profile for apache2 [puppet] - 10https://gerrit.wikimedia.org/r/403232 (owner: 10Dzahn) [20:32:28] andrewbogott: done :) [20:32:32] all 3 green now [20:32:34] great, thanks [20:33:00] yw! it was me when re-organizing the prod setup for "misc webserver" [20:33:03] bbl [20:35:51] 10Operations, 10Analytics, 10Analytics-Cluster, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3887930 (10RobH) @dr0ptp4kt: We don't have any contacts with AMD. You may want to ask on the ops list though, as someone may know someone (our team seems to know... [20:43:05] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 39.95, 34.18, 32.42 [20:44:44] PROBLEM - HHVM rendering on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:45:44] RECOVERY - HHVM rendering on mw1201 is OK: HTTP OK: HTTP/1.1 200 OK - 75646 bytes in 4.702 second response time [20:45:54] PROBLEM - High CPU load on API appserver on mw1201 is CRITICAL: CRITICAL - load average: 41.69, 35.64, 30.47 [20:51:44] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 59.00, 32.35, 25.31 [20:52:05] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 50.13, 33.85, 31.83 [20:52:24] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 49.63, 24.49, 18.25 [20:54:24] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 21.59, 22.68, 18.39 [20:57:35] !log twentyafterfour@tin Finished scap: Deploy 1.31.0-wmf.16 to test wikis and rebuild l10n. refs T180749 (attempt 2) (duration: 36m 34s) [20:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:46] T180749: 1.31.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T180749 [20:59:26] coolio! [20:59:45] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 22.08, 23.82, 23.97 [21:03:39] right, so twentyafterfour I see a couple of things in the logs I will have to investiagte before tommorrow [21:03:52] [{exception_id}] {exception_url} MediaWiki\Storage\RevisionAccessException from line 217 of /srv/mediawiki/php-1.31.0-wmf.16/includes/Storage/RevisionStore.php: Could not determine title for page ID 182830 and revision ID 303630 [21:03:56] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-netbox, looks like it thinks its a prod box - https://phabricator.wikimedia.org/T184242#3887999 (10greg) (Adding #operations since this box is in support of their evaluation of netbox) [21:04:16] addshore: ok, I'm about to promote the branch to group0, any concerns? [21:04:18] similar to the things that I saw and fixed over the christmas period, but apparently a different code path [21:04:35] twentyafterfour: oh, its not already on group0? [21:04:49] just testwikis [21:05:10] ahh, okay, let me look at the stack first [21:06:20] hmm, they look to be issues with Echo [21:07:07] At a guess, rolling this out will break mention notifications and possibly cause exceptions when talk pages are edited [21:07:24] hmm [21:07:46] https://www.irccloud.com/pastebin/YyDkaxB0/ [21:08:01] 10Operations, 10Wikimedia-Mailing-lists: All IP addresses used for sending emails by Wikimedia's services - https://phabricator.wikimedia.org/T184555#3888010 (10Krenair) [21:08:40] also appeard to only be on testwikidatawiki... hmmm [21:08:42] addshore: should I hold the train? [21:09:36] 10Operations, 10Mail: All IP addresses used for sending emails by Wikimedia's services - https://phabricator.wikimedia.org/T184555#3887812 (10Krenair) Could potentially give them the IPs for mx1001.wikimedia.org / mx2001.wikimedia.org, but they might change in future... And other stuff (misc services) might al... [21:12:03] 10Puppet, 10Analytics, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-eventlogging04 due to missing directory '/var/lib/superset'? - https://phabricator.wikimedia.org/T184238#3888024 (10greg) per [[Dev/Maint]] [21:12:17] okay, I also triggered the code path on testwiki. [21:12:36] Editing talkpages, and including links works, but actual mention notifications won't get fired [21:12:52] I guess we don't really want that on mw.org, so I think hold the train for now [21:13:31] It could be nice to keep the code on the test wikis? incase itsurfaces any other issues [21:23:23] (03PS1) 10Cmjohnson: Reoving dns entries for mw1180-1200 [dns] - 10https://gerrit.wikimedia.org/r/403239 (https://phabricator.wikimedia.org/T183895) [21:25:46] (03CR) 10Anomie: "I'm sure it's possible. I don't know if anyone is going to actually do it. Meanwhile, I'd like to not have to have a local copy of the she" [puppet] - 10https://gerrit.wikimedia.org/r/397913 (owner: 10Anomie) [21:33:00] 10Operations, 10OCG-General, 10Readers-Community-Engagement, 10Epic, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#3888087 (10demon) [21:33:12] 10Operations, 10Cleanup, 10Gerrit, 10GitHub-Mirrors, and 6 others: Archive mediawiki/extensions/Collection and others - https://phabricator.wikimedia.org/T183891#3888085 (10demon) 05Open>03declined Then we shouldn't do this :) [21:36:07] twentyafterfour: I filed https://phabricator.wikimedia.org/T184559 [21:39:00] 10Operations, 10Analytics, 10Analytics-Cluster, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3888118 (10dr0ptp4kt) Thanks! [21:40:14] (03CR) 10Dzahn: "it should use the ones from https://wikitech.wikimedia.org/wiki/Hiera:Wikidata-dev" [puppet] - 10https://gerrit.wikimedia.org/r/403228 (owner: 10Andrew Bogott) [21:41:52] (03CR) 10RobH: [C: 031] "typo in commit message but otherwise lgtm!" [dns] - 10https://gerrit.wikimedia.org/r/403239 (https://phabricator.wikimedia.org/T183895) (owner: 10Cmjohnson) [21:44:38] 10Operations, 10Mail: All IP addresses used for sending emails by Wikimedia's services - https://phabricator.wikimedia.org/T184555#3887812 (10herron) As @Krenair mentioned IP addresses are subject to change. SPF records are best used to convey IP information for a sending domain. But if individual IPs must b... [21:47:23] 10Puppet, 10Analytics, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-eventlogging04 due to missing directory '/var/lib/superset'? - https://phabricator.wikimedia.org/T184238#3888157 (10Krenair) I ran the exact same command that puppet does (as the user specified in the puppet file), and it appea... [21:49:38] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, 10User-Elukey: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3888162 (10Ottomata) OO I have done some [[ https://docs.oracle.com/javase/8/docs/technotes/guides/security/jsse/JSSERefGuide.html#DisabledAl... [21:51:29] (03PS2) 10Cmjohnson: Reoving dns entries for mw1180-1200 [dns] - 10https://gerrit.wikimedia.org/r/403239 (https://phabricator.wikimedia.org/T183895) [21:52:29] (03CR) 10Cmjohnson: [C: 032] Reoving dns entries for mw1180-1200 [dns] - 10https://gerrit.wikimedia.org/r/403239 (https://phabricator.wikimedia.org/T183895) (owner: 10Cmjohnson) [21:55:31] 10Operations, 10Puppet: Trusty puppet 4 approach - https://phabricator.wikimedia.org/T182894#3888227 (10herron) [21:56:46] 10Puppet, 10Analytics, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-eventlogging04 due to missing directory '/var/lib/superset'? - https://phabricator.wikimedia.org/T184238#3888234 (10Krenair) 05Open>03Resolved ottomata, -eventlogging04? superset is there? ??... [21:56:50] 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#3888236 (10Krenair) [22:01:28] 10Operations, 10Puppet: Upgrade puppetDB to version 3.2 or newer - https://phabricator.wikimedia.org/T177253#3888256 (10herron) [22:01:44] 10Puppet, 10Operations-Software-Development, 10Patch-For-Review: Cumin: PuppetDB backend, add support for API v4 - https://phabricator.wikimedia.org/T182575#3888257 (10herron) [22:06:04] RECOVERY - High CPU load on API appserver on mw1201 is OK: OK - load average: 22.33, 23.20, 23.99 [22:06:15] [WlU8yQpAEK0AAE12AJUAAABR] 2018-01-09 22:06:01: Fatal exception of type "BannerExistenceException" [22:09:34] PROBLEM - MariaDB Slave Lag: m3 on db1059 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.51 seconds [22:12:53] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3888280 (10atgo) Hi @RobH ! @cy534 will be signing an acknowledgement of NDA rather than a full NDA, as he's part of a consulting group that already has an NDA... [22:13:24] PROBLEM - Keyholder SSH agent on netmon2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [22:14:06] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3888294 (10RobH) @atgo: So I can check the legal departments NDA spreadsheet to see when their name is listed. Alternatively @RStallman-legalteam can confirm w... [22:18:55] i'll fix that on netmon2001, that's from rebooting it [22:18:58] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3888304 (10atgo) Awesome, thanks! [22:18:59] (03PS4) 10Dzahn: planet: add some missing Hiera calls and rename params [puppet] - 10https://gerrit.wikimedia.org/r/397729 [22:19:51] * Platonides finds out that rstallman is called Rachel :P [22:20:06] Platonides: and a lawyer [22:20:21] what scares you most :P ? [22:20:24] (03PS1) 10Rush: tools: ferm pre hook to stop kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/403308 (https://phabricator.wikimedia.org/T182722) [22:20:25] RECOVERY - Keyholder SSH agent on netmon2001 is OK: OK: Keyholder is armed with all configured keys. [22:20:27] !log netmon2001 - arming keyholder for rancid [22:20:35] I'm not sure if there is a difference between lawyer and paralegal [22:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:04] I'd say yes but I'm not familiar with U.S. lawyer classifications. [22:21:48] A paralegal, like a lawyer, can be employed by a law office or work freelance at a company or law office. Paralegals are not allowed to offer legal services directly to the public on their own and must perform their legal work under an attorney or law firm (except in Ontario Canada) [22:21:53] https://en.wikipedia.org/wiki/Paralegal [22:22:29] parece algo así como un auxiliar de farmacia a un farmacéutico [22:22:59] !log aaron@tin Synchronized php-1.31.0-wmf.16/includes/Setup.php: 68b4bbfbc12c626 (duration: 01m 15s) [22:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:26] Platonides: exacto [22:42:49] 10Operations, 10Puppet, 10Goal: Modernize Puppet Configuration Management (2017-18 Q3 Goal) - https://phabricator.wikimedia.org/T184561#3888373 (10Volans) [22:48:39] 10Operations, 10Puppet, 10Goal: Modernize Puppet Configuration Management (2017-18 Q3 Goal) - https://phabricator.wikimedia.org/T184561#3888386 (10Volans) [22:50:14] (03PS5) 10Dzahn: planet: missing Hiera, rename params, rm scope.lookup in erb [puppet] - 10https://gerrit.wikimedia.org/r/397729 [22:52:00] !log ms-be1033 truncate unrotated and big server.log [22:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:34] 10Operations, 10Goal, 10Patch-For-Review: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208#3888423 (10bd808) [22:54:44] RECOVERY - MariaDB Slave Lag: m3 on db1059 is OK: OK slave_sql_lag Replication lag: 22.22 seconds [23:10:30] (03PS1) 10BryanDavis: wmcs: Add s8.labsdb and move wikidatawiki [puppet] - 10https://gerrit.wikimedia.org/r/403315 (https://phabricator.wikimedia.org/T184179) [23:15:10] (03PS6) 10Dzahn: planet: missing Hiera, rename params, rm scope.lookup in erb [puppet] - 10https://gerrit.wikimedia.org/r/397729 [23:15:35] (03CR) 10jerkins-bot: [V: 04-1] planet: missing Hiera, rename params, rm scope.lookup in erb [puppet] - 10https://gerrit.wikimedia.org/r/397729 (owner: 10Dzahn) [23:18:38] (03CR) 10BryanDavis: "Untested, but concept seems sound. One puppet style whine inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/403308 (https://phabricator.wikimedia.org/T182722) (owner: 10Rush) [23:20:38] (03CR) 10Madhuvishy: [C: 032] wmcs: Add s8.labsdb and move wikidatawiki [puppet] - 10https://gerrit.wikimedia.org/r/403315 (https://phabricator.wikimedia.org/T184179) (owner: 10BryanDavis) [23:20:54] (03PS7) 10Dzahn: planet: missing Hiera, rename params, rm scope.lookup in erb [puppet] - 10https://gerrit.wikimedia.org/r/397729 [23:31:44] (03PS8) 10Dzahn: planet: missing Hiera, rename params, rm scope.lookup in erb [puppet] - 10https://gerrit.wikimedia.org/r/397729 [23:32:09] (03CR) 10jerkins-bot: [V: 04-1] planet: missing Hiera, rename params, rm scope.lookup in erb [puppet] - 10https://gerrit.wikimedia.org/r/397729 (owner: 10Dzahn) [23:35:30] (03PS9) 10Dzahn: planet: missing Hiera, rename params, rm scope.lookup in erb [puppet] - 10https://gerrit.wikimedia.org/r/397729 [23:37:07] (03CR) 10Eevans: [C: 031] decom legacy restbase/cassandra 2 cluster [puppet] - 10https://gerrit.wikimedia.org/r/403208 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi) [23:43:15] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10fullscreenorgId=1 [23:48:04] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 55, down: 1, dormant: 0, excluded: 0, unused: 0 [23:50:04] RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 57, down: 0, dormant: 0, excluded: 0, unused: 0 [23:53:19] (03PS10) 10Dzahn: planet: missing Hiera, rename params, rm scope.lookup in erb [puppet] - 10https://gerrit.wikimedia.org/r/397729 [23:56:54] 10Operations, 10Goal, 10Patch-For-Review: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208#3888589 (10bd808) [23:56:58] 10Operations, 10Data-Services, 10MediaWiki-Maintenance-scripts, 10Wikidata, 10cloud-services-team (Kanban): Missing references to s8 on maintenance and cloud scripts (and potentially others) - https://phabricator.wikimedia.org/T184179#3888586 (10bd808) 05Open>03Resolved I think the #cloud-services bi...