[00:01:53] !log maxsem@tin Synchronized wmf-config/InitialiseSettings-labs.php: Labs-only cleanups (duration: 00m 25s) [00:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:02:04] (03CR) 10Andrew Bogott: [C: 032] Rename oslo.config to oslo_config [puppet] - 10https://gerrit.wikimedia.org/r/300453 (owner: 10Andrew Bogott) [00:09:42] (03PS1) 10Thcipriani: Beta: Scap canary deploy dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/300457 [00:09:44] (03PS1) 10Thcipriani: Beta: Add logstash host to scap.cfg [puppet] - 10https://gerrit.wikimedia.org/r/300458 [00:10:25] mutante: or maybe you would know since Max is busy: https://phabricator.wikimedia.org/T139552#2484944 [00:11:10] (03CR) 10Thcipriani: [C: 04-1] "WIP: Depends on https://phabricator.wikimedia.org/D248" [puppet] - 10https://gerrit.wikimedia.org/r/300458 (owner: 10Thcipriani) [00:11:17] kaldari: sorry, no, i haven't done that [00:11:24] kaldari: cd to /srv/mediawiki/ [00:11:29] mediawiki-staging [00:11:34] oh, wait, it's terbium [00:11:36] no, that's right [00:11:42] kaldari: technically, you don't even need to cd [00:11:47] mwscript sql.php --wiki=enwiki extensions/PageAssessments/db/addProjectsTable.sql [00:11:50] that from your home dir will work [00:11:58] that's easy [00:12:22] Reedy: and no special params needed for the master db? [00:12:42] don't think so [00:12:55] the script has a parameter to force to use a slave [00:13:46] } else { [00:13:46] $index = DB_MASTER; [00:13:46] } [00:13:50] kaldari: Will work fine [00:13:51] ok, I'm going to try this... [00:16:31] (03CR) 10Paladox: [C: 031] gerrit: up heap size limit from 20GB to 28GB [puppet] - 10https://gerrit.wikimedia.org/r/300446 (https://phabricator.wikimedia.org/T141064) (owner: 10Dzahn) [00:16:39] (03PS1) 10Dzahn: gerrit: add missing source for remote recursion [puppet] - 10https://gerrit.wikimedia.org/r/300459 [00:17:50] (03CR) 10Paladox: [C: 031] gerrit: add missing source for remote recursion [puppet] - 10https://gerrit.wikimedia.org/r/300459 (owner: 10Dzahn) [00:18:36] kaldari: if it doesn't work, you might need to use a full path [00:19:12] yeah, looks like it. I suppose /srv/mediawiki/php/extensions/PageAssessments/db/addProjectsTable.sql should work [00:19:41] yeah [00:19:53] It's easier to add it to CreateExtensionTables in WikimediaMaintenance ;) [00:20:09] Yay, success! [00:21:31] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2485974 (10Liuxinyu970226) @Pavanaja what about Module and Module talk? [00:22:56] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2480071 (10Reedy) >>! In T140898#2485974, @Liuxinyu970226 wrote: > @Pavanaja what about Module and Module talk? How do you translate them? Note, they s... [00:26:19] (03CR) 10Dzahn: [C: 032] gerrit: add missing source for remote recursion [puppet] - 10https://gerrit.wikimedia.org/r/300459 (owner: 10Dzahn) [00:28:31] gerrit restarted and is already back [00:28:44] had to apply that follow-up fix there [00:28:59] RECOVERY - puppet last run on ytterbium is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [00:29:10] mutante ^^ [00:29:31] yea:) that's what it fixed [00:29:42] yep [00:29:56] now lead [00:32:00] RECOVERY - puppet last run on lead is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [00:34:53] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 707 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5138654 keys - replication_delay is 707 [00:38:43] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5111319 keys - replication_delay is 0 [00:43:23] !log restarted grrrit-wm [00:45:28] (03CR) 10Dzahn: ""secret(): invalid secret gerrit/ssh_host_key" when running in compiler. we gotta add a fake secret to labs/private too or we will have th" [puppet] - 10https://gerrit.wikimedia.org/r/300279 (owner: 10Chad) [00:46:56] 06Operations, 10Ops-Access-Requests, 06Editing-Analysis, 13Patch-For-Review: Requesting access to research groups for Helen Jiang - https://phabricator.wikimedia.org/T140659#2486001 (10Neil_P._Quinn_WMF) Ah, okay, thanks for the clarification! [00:48:31] (03PS2) 10Dzahn: Gerrit: Disable downloading of archives [puppet] - 10https://gerrit.wikimedia.org/r/300304 (owner: 10Chad) [00:53:46] !log Restarted elasticsearch on logstash1003; couldn't find master (even though the master thought 1003 was fine) [00:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:54:03] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 26, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards [00:59:04] bd808: odd, that's kinda what 1002 did earlier [01:25:06] (03PS1) 10Andrew Bogott: Designate: Don't specify two nameservers if they're the same. [puppet] - 10https://gerrit.wikimedia.org/r/300462 [01:26:14] (03CR) 10jenkins-bot: [V: 04-1] Designate: Don't specify two nameservers if they're the same. [puppet] - 10https://gerrit.wikimedia.org/r/300462 (owner: 10Andrew Bogott) [01:36:08] (03PS2) 10Andrew Bogott: Designate: Don't specify two nameservers if they're the same. [puppet] - 10https://gerrit.wikimedia.org/r/300462 [01:37:27] (03CR) 10jenkins-bot: [V: 04-1] Designate: Don't specify two nameservers if they're the same. [puppet] - 10https://gerrit.wikimedia.org/r/300462 (owner: 10Andrew Bogott) [01:39:22] (03PS3) 10Andrew Bogott: Designate: Don't specify two nameservers if they're the same. [puppet] - 10https://gerrit.wikimedia.org/r/300462 [01:40:48] (03CR) 10Andrew Bogott: [C: 032] Designate: Don't specify two nameservers if they're the same. [puppet] - 10https://gerrit.wikimedia.org/r/300462 (owner: 10Andrew Bogott) [02:18:05] So do we proceed with https://phabricator.wikimedia.org/T138460 ? [02:18:52] twentyafterfour ? [02:20:06] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.11) (duration: 08m 23s) [02:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:21:51] (03PS2) 10Jcrespo: Set m3-master as an alias of dbproxy1003 [dns] - 10https://gerrit.wikimedia.org/r/299764 (https://phabricator.wikimedia.org/T138460) [02:23:43] (03CR) 10Jcrespo: [C: 032] Set m3-master as an alias of dbproxy1003 [dns] - 10https://gerrit.wikimedia.org/r/299764 (https://phabricator.wikimedia.org/T138460) (owner: 10Jcrespo) [02:25:53] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Jul 22 02:25:53 UTC 2016 (duration 5m 47s) [02:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:27:49] !log making db2012.codfw.wmnet:3306 a child of db1048.eqiad.wmnet [02:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:47] !log making dbstore1002.eqiad.wmnet:3306 a child of db1048.eqiad.wmnet:3306 [02:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:33:43] !log setting db1043 as read-only (phabricator/m3) [02:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:34:28] !log updating m3-master dns [02:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:35:14] currently getting from phabricator: [02:35:17] AphrontQueryException [02:35:17] #1290: The MariaDB server is running with the --read-only option so it cannot execute this statement [02:35:33] yep [02:35:38] How long should this last? [02:35:44] !log SET GLOBAL read_only=0; on db1048 [02:35:50] jynus just put it in that mode deliberately, it just went into SAL before you joined [02:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:36:04] a few seconds only [02:36:07] okay [02:36:26] (Curious that it's preventing me from even loading pages, but whatever.) [02:36:41] It's a good question harej [02:37:10] But unfortunately one for upstream developers, not our ops :) [02:37:27] Does every page load trigger a db write? That would be interesting. [02:37:46] Clearly it's trying to update something [02:38:05] or insert, delete, whatever. some sort of write statement [02:43:59] (03PS1) 10Jcrespo: Change db1048 as m3-master [dns] - 10https://gerrit.wikimedia.org/r/300466 [02:44:21] (03CR) 10Jcrespo: [C: 032] Change db1048 as m3-master [dns] - 10https://gerrit.wikimedia.org/r/300466 (owner: 10Jcrespo) [02:44:49] phabriator (maniphest) is down (viewing any task) https://phabricator.wikimedia.org/T1 [02:45:01] "#1290: The MariaDB server is running with the --read-only option so it cannot execute this statement" [02:45:33] Vulpix: masters are being switched, see SAL [02:45:52] ok, thanks [02:46:50] I have problems, phabricator is not picking up the dns update [02:48:04] change the phab config to point to the ip? [02:49:48] phabricator in maintenance? [02:49:56] yes [02:50:00] read up Danny_B [02:50:21] jynus: lmk if you need another pair of eyes/hands [02:51:09] jynus, looks like it's $mysql_host/$mysql_slave in modules/role/manifests/phabricator/main.pp [02:51:31] any eta? [02:52:18] PROBLEM - PHD should be supervising processes on iridium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (phd) [02:52:26] I think iridium is not properly puppetized [02:52:36] and it may be hardconfig its database [02:53:26] what's not properly puppetized? [02:54:12] twentyafterfour, I have updated dns and cleared cache on iridium [02:54:17] ok [02:54:24] but it still point to the old master [02:54:50] hmm [02:54:58] PROBLEM - puppet last run on iridium is CRITICAL: CRITICAL: Puppet has 1 failures [02:55:02] want me to change config? [02:55:06] ACKNOWLEDGEMENT - PHD should be supervising processes on iridium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (phd) ori.livneh Phabricator maintenance [02:55:35] iridium already sees the new dns, so it is not a dns issue [02:56:27] m3-master.eqiad.wmnet [02:56:36] is what it's set to connect to [02:56:43] restarted apache2 and phd service? [02:57:06] that is the right ip [02:57:12] /etc/hosts [02:57:15] but it says "read only" [02:57:16] has an entry [02:57:19] 10.64.16.32 m3-master.eqiad.wmnet [02:57:20] ah! [02:57:28] * twentyafterfour didn't do it! [02:57:32] :D [02:57:43] can we nuke that? [02:57:46] that's lame [02:57:49] yes absolutely [02:57:55] nuke it from orbit [02:58:20] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1208.36 seconds [02:58:27] PROBLEM - MariaDB Slave Lag: m3 on db2012 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1214.22 seconds [02:58:28] ^ignore that [02:59:01] !log restarted phd on iridium [02:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:59:14] it is up [02:59:21] \o/ [03:00:03] I was right, that should not be there, of it should, it would have been changed automatically by puppet [03:00:07] ori: nice catch [03:00:38] RECOVERY - PHD should be supervising processes on iridium is OK: PROCS OK: 11 processes with UID = 997 (phd) [03:00:40] please test the hell out of phab now [03:00:46] jynus: on it [03:00:48] it is new db, new version [03:01:06] and let me fail back to the original plan of a proxy [03:01:34] nice [03:01:38] (03PS1) 10Jcrespo: Revert "Change db1048 as m3-master" [dns] - 10https://gerrit.wikimedia.org/r/300467 [03:01:39] PROBLEM - MariaDB Slave Lag: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1408.70 seconds [03:02:44] ignore the replication errors, it is just hartbeat not being properly updated, it will go away soon (replication is ok) [03:03:13] (03CR) 10Jcrespo: [C: 032] Revert "Change db1048 as m3-master" [dns] - 10https://gerrit.wikimedia.org/r/300467 (owner: 10Jcrespo) [03:04:05] hmm phd user has an invalid home directory... strange [03:04:27] other than that unrelated thing, everything looks ok [03:04:33] !log reverting m3-master dns back to the proxy [03:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:04:51] should we restart phd once more? [03:05:29] ok and apache if you changed dns again [03:05:54] phd and apache I mean [03:06:29] did you do it or I do? [03:06:38] I'll do it [03:06:56] !log restarted apache2 and phd on iridium [03:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:07:47] no errors from phd except that thing about the home directory not existing [03:08:01] but that cannot be the database, twentyafterfour [03:08:11] right, unrelated and I'll write a patch to fix that in puppet [03:08:13] It seems okay now [03:08:25] I litterally didn't touch iridium at all [03:08:32] I asked in #phabricator why read-only mode kills the whole thing instead of just write actions [03:09:01] I need read only for master failover- that normally only take a few second [03:09:18] PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - Received genError(5) error-status at error-index 1 [03:09:25] but m3-master was hardcoded on iridium [03:09:37] there is a somewhat new cluster config in phabricator which probably will help with being tolerant to read only [03:09:53] it can even fall back to the slaves if master isn't accepting connections [03:10:09] we haven't got that part configured in phab though [03:10:27] if everthing seems sane, I am going to nuke db1043 [03:10:35] (old master) [03:10:44] and check everything continues being ok [03:11:18] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [03:12:15] ori, sorry, it is too late for me to be very bright at this time [03:12:19] Everything seems sane as far as I can tell [03:12:43] Krenair: it fails when viewing phab being logged in. Logged out/private browsing worked fine, but strange indeed [03:13:09] it logs to the database [03:13:22] some things anyway [03:14:02] !log stopping db1043 db [03:14:03] probably updating xsrf tokens for one thing [03:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:14:28] It sounds like Phabricator may not have been put into read-only mode, just the database. [03:14:35] We don't currently detect that the database is in read-only mode -- there's a separate Phabricator switch (cluster.read-only) or we auto-detect based on the master being completely unreachable. [03:15:02] sounds like wgReadOnly [03:15:16] heh [03:15:54] Interesting [03:15:56] jynus, twentyafterfour: Maybe that's something that should be used next time? [03:16:06] indeed [03:16:52] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2486108 (10jcrespo) Binary log at the time of the master change: ``` MariaDB db1043.eqiad.wmnet (none) > SHOW MASTER STATUS\G *************************** 1. row ********... [03:17:32] Krenair, next time, the /etc/hosts wrong entry will not be there, and it will only take 1 second to perform the maintenance [03:17:38] :) [03:17:43] also we have a proxt now [03:17:55] which means no longer rely on dns [03:18:08] so both things will make that a no problem [03:19:02] the problem with DBAs coming and going is that we will hit the same rocks over and over [03:19:22] at least these issues will not be documented [03:19:43] https://wikitech.wikimedia.org/wiki/MariaDB/misc [03:19:58] PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [03:19:58] most misc databases were previously undocumented [03:20:02] :) [03:20:18] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1043.eqiad.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on db1043.eqiad.wmnet (111 Connection refused) [03:20:20] ^that is the "slave" (old master [03:20:27] being down [03:20:31] so no worries [03:20:55] sorry for the spam, it was not precisely the smoothest process [03:21:26] yeah, well, the lack of smoothness wasn't your fault [03:21:36] no need to apologise :) [03:21:53] in part, this has a component of: we do it once under controlled environement [03:22:00] detect issues [03:22:12] those will not happen again (e.g. on emergency) [03:22:34] (like when we did the datacenter failover) [03:23:06] twentyafterfour, is the puppet run the home dir you mentioned? [03:23:35] jynus: no I haven't submitted that yet and I can't merge patches [03:23:43] I can! [03:24:12] let me help if you need me [03:24:48] RECOVERY - puppet last run on iridium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:25:28] ah , that is ok now [03:25:43] it was only the read only making start phd failing [03:26:36] (03PS2) 1020after4: Fix path to jenkins homedir for nodepool slaves [puppet] - 10https://gerrit.wikimedia.org/r/299029 [03:26:38] (03PS1) 1020after4: specify home directory for phd user [puppet] - 10https://gerrit.wikimedia.org/r/300468 [03:26:59] (03PS2) 1020after4: specify home directory for phd user [puppet] - 10https://gerrit.wikimedia.org/r/300468 [03:27:25] jynus: https://gerrit.wikimedia.org/r/#/c/300468/ [03:28:32] (03CR) 10jenkins-bot: [V: 04-1] specify home directory for phd user [puppet] - 10https://gerrit.wikimedia.org/r/300468 (owner: 1020after4) [03:28:50] (03PS1) 10Jcrespo: Set db1048 as the new phabricator master on config [puppet] - 10https://gerrit.wikimedia.org/r/300469 (https://phabricator.wikimedia.org/T138460) [03:29:15] twentyafterfour, it probly has a syntax problem [03:29:54] weird [03:30:28] missing comma [03:30:42] ah yes [03:30:44] my bad [03:30:44] but do we really want it to have a home? [03:31:08] being a system user, should int be /bin/false ? [03:31:23] why does it need a home? [03:31:27] well, yes I think so. git causes ssh to be used which stores things in ~/.ssh/ [03:31:44] the logged error is about storing ssh keys [03:31:50] ssh_known_hosts [03:31:56] then maybe it shouldn't be a system user? [03:32:14] I would wait from someone else to comment on that [03:32:50] (03CR) 10Jcrespo: [C: 032] Set db1048 as the new phabricator master on config [puppet] - 10https://gerrit.wikimedia.org/r/300469 (https://phabricator.wikimedia.org/T138460) (owner: 10Jcrespo) [03:33:27] the thing is that /etc/passwd has a home directory configured it just is a directory that doesn't exist [03:33:30] (/home/phd) [03:33:44] and maybe it shouldn't be a system user, that part I'm not sure about [03:34:11] but it shouldn't be /home/phd [03:35:01] is that an ongoing issue? [03:35:11] as in, what is that breaking? [03:35:15] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.11 seconds [03:35:15] RECOVERY - MariaDB Slave Lag: m3 on db2012 is OK: OK slave_sql_lag Replication lag: 0.65 seconds [03:35:23] I only just discovered it. it's breaking phd pushes to gerrit, apparently [03:35:30] I don't know how/when it changed [03:35:52] it might not be really the cause of breakage, it could just be collateral error message in the logs [03:36:06] RECOVERY - MariaDB Slave Lag: m3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.41 seconds [03:36:26] so it's not really anything you should worry about, you said it's late already so you don't need to deal with it :) [03:36:35] I can worry about it [03:36:47] ACKNOWLEDGEMENT - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Jcrespo db1043 is down for reimage [03:37:19] I would like to ask for someone else's opinion before changing it [03:37:29] jynus: thanks for reviewing, merge not necessary right now [03:37:38] we can wait for a second opinion [03:37:39] regarding user policies [03:38:10] only a few git repositories are replicating via phab -> gerrit so nothing is critically broken [03:38:32] ok, that is why I asked, if it could wait a few hours [03:39:25] yeah that's fine [03:45:07] ACKNOWLEDGEMENT - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1043.eqiad.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on db1043.eqiad.wmnet (111 Connection refused) Jcrespo because dbstore1001 is a delayed slave, it requires a day for proper failover [03:53:15] 06Operations: this is a test ticket - ignore - https://phabricator.wikimedia.org/T141075#2486119 (10jcrespo) [03:53:27] 06Operations: this is a test ticket - ignore - https://phabricator.wikimedia.org/T141075#2486131 (10jcrespo) 05Open>03Invalid [03:54:42] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2486132 (10jcrespo) db1048 is now the new m3 master, and it is being used though the proxy dbproxy1003. [04:03:56] PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24 [04:05:37] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [04:08:06] !log backing up, shutting down and reimage db1043 [04:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:12:27] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 21 probes of 238 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [04:14:27] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 28 probes of 236 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [04:14:57] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 25 probes of 243 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [04:18:06] (03PS1) 10MaxSem: Labs: remove wmgUseGuidedTour - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300472 [04:18:08] (03PS1) 10MaxSem: Labs: remove wmgUseWPB - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300473 [04:18:10] (03PS1) 10MaxSem: Labs: remove wgThumbnailMinimumBucketDistance - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300474 [04:18:12] (03PS1) 10MaxSem: Labs: remove wgThumbnailBuckets - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300475 [04:18:14] (03PS1) 10MaxSem: Labs: remove wgUseBetaFeatures - it's wmg actually [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300476 [04:18:16] (03PS1) 10MaxSem: Labs: remove wmgUseMultimediaViewer - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300477 [04:18:18] (03PS1) 10MaxSem: Labs: remove wmgUseImageMetrics - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300478 [04:18:20] (03PS1) 10MaxSem: Labs: remove wmgUseRestbaseVRS - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300479 [04:18:22] (03PS1) 10MaxSem: Labs: remove wmgVisualEditorAccessRESTbaseDirectly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300480 [04:18:24] (03PS1) 10MaxSem: Labs: remove wmgUseNavigationTiming - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300481 [04:18:26] (03PS1) 10MaxSem: Labs: RevisionSlider is already loaded in prod, remove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300482 [04:18:27] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 16 probes of 238 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [04:18:28] (03PS1) 10MaxSem: Labs: Kartographer is already loaded in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300483 [04:18:30] (03PS1) 10MaxSem: Labs: don't load Interwiki - duplicates prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300484 [04:18:32] (03PS1) 10MaxSem: Labs: don't load MultimediaViewer - already done in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300485 [04:18:34] (03PS1) 10MaxSem: Labs: remove commented out OnlineStatusBar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300486 (https://phabricator.wikimedia.org/T34128) [04:19:23] just recalled that I haven't spammed for several hours, had to fix that [04:24:03] thanks for clearing those up MaxSem [04:25:24] (03PS6) 10Alex Monk: [WIP] Puppetise script to manage labs floating IP PTR records [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) [04:26:27] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 17 probes of 236 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [04:26:51] I think you've taken off over a hundred lines of InitialiseSettings-labs.php so far [04:26:57] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 16 probes of 243 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [04:28:22] (03CR) 10Alex Monk: "Example output (from an older version with less sanity checking of user data) in https://phabricator.wikimedia.org/P3544" [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) (owner: 10Alex Monk) [04:28:48] PROBLEM - puppet last run on ms-be2022 is CRITICAL: CRITICAL: puppet fail [04:34:33] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Puppetise script to manage labs floating IP PTR records [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) (owner: 10Alex Monk) [04:38:30] (03PS1) 10Jcrespo: Reconfigure m3 servers to use modern mysql configuration [puppet] - 10https://gerrit.wikimedia.org/r/300487 (https://phabricator.wikimedia.org/T138460) [04:39:45] (03CR) 10jenkins-bot: [V: 04-1] Reconfigure m3 servers to use modern mysql configuration [puppet] - 10https://gerrit.wikimedia.org/r/300487 (https://phabricator.wikimedia.org/T138460) (owner: 10Jcrespo) [04:41:01] (03PS2) 10Jcrespo: Reconfigure m3 servers to use modern mysql configuration [puppet] - 10https://gerrit.wikimedia.org/r/300487 (https://phabricator.wikimedia.org/T138460) [04:42:22] (03CR) 10Jcrespo: [C: 032] Reconfigure m3 servers to use modern mysql configuration [puppet] - 10https://gerrit.wikimedia.org/r/300487 (https://phabricator.wikimedia.org/T138460) (owner: 10Jcrespo) [04:53:27] (03PS2) 10Dzahn: planet: "RelEng" (jargon)-> "WMF Release Engineering" [puppet] - 10https://gerrit.wikimedia.org/r/300172 [04:53:37] (03CR) 10Dzahn: [C: 032] planet: "RelEng" (jargon)-> "WMF Release Engineering" [puppet] - 10https://gerrit.wikimedia.org/r/300172 (owner: 10Dzahn) [04:56:57] RECOVERY - puppet last run on ms-be2022 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [04:57:57] (03CR) 10Dzahn: "suggesting to uncouple this one from the switch on Sunday, and maybe move it to -post step" [puppet] - 10https://gerrit.wikimedia.org/r/300304 (owner: 10Chad) [05:00:02] (03CR) 10Dzahn: "needs to be added to private repo, and fake key to labs/private" [puppet] - 10https://gerrit.wikimedia.org/r/300279 (owner: 10Chad) [05:01:57] (03CR) 10Dzahn: "is a goal that it is the same home dir on master and slave? if yea, which one would have to be adjusted" [puppet] - 10https://gerrit.wikimedia.org/r/299029 (owner: 1020after4) [05:09:54] (03CR) 1020after4: "it only needs to be on the slave, arcanist is not used on the master." [puppet] - 10https://gerrit.wikimedia.org/r/299029 (owner: 1020after4) [05:12:54] (03PS3) 1020after4: specify home directory for phd user [puppet] - 10https://gerrit.wikimedia.org/r/300468 [05:13:27] (03CR) 10Dzahn: "ok, i was just wondering in general about the jenkins user's home directory because you point out how it's different" [puppet] - 10https://gerrit.wikimedia.org/r/299029 (owner: 1020after4) [05:16:22] (03PS4) 1020after4: Specify home directory for phd user [puppet] - 10https://gerrit.wikimedia.org/r/300468 [05:16:56] (03PS1) 10Dzahn: chromium: Ubuntu and Debian compatibility (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/300491 [05:29:40] (03PS2) 10Dzahn: chromium: Ubuntu and Debian compatibility (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/300491 [05:34:56] 06Operations, 10Ops-Access-Requests, 06Editing-Analysis, 13Patch-For-Review: Requesting access to research groups for Helen Jiang - https://phabricator.wikimedia.org/T140659#2486160 (10Dzahn) @Neil_P._Quinn_WMF yep, we will get this done and add the groups asap. thanks for the patience. [05:37:30] (03PS1) 1020after4: Configure phabricator database cluster settings [puppet] - 10https://gerrit.wikimedia.org/r/300494 [05:43:25] (03CR) 1020after4: "http://puppet-compiler.wmflabs.org/3436/iridium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/300494 (owner: 1020after4) [05:43:26] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:44:07] RECOVERY - haproxy failover on dbproxy1003 is OK: OK check_failover servers up 2 down 0 [05:52:42] 06Operations, 07Puppet, 06Labs, 10Phabricator, and 2 others: Install Arcanist in toollabs::dev_environ - https://phabricator.wikimedia.org/T139738#2486166 (10mmodell) FWIW I am pretty sure arcanist works with php 5.3. In fact, it works with 5.2: From [[ https://secure.phabricator.com/book/phabricator/art... [05:55:59] (03CR) 10Gergő Tisza: [C: 031] Labs: don't load MultimediaViewer - already done in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300485 (owner: 10MaxSem) [05:56:02] (03CR) 10Gergő Tisza: [C: 031] Labs: remove wmgUseMultimediaViewer - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300477 (owner: 10MaxSem) [05:56:18] 06Operations, 10DBA, 13Patch-For-Review: reimage or decom db servers on precise - https://phabricator.wikimedia.org/T125028#2486171 (10jcrespo) [05:56:23] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2486169 (10jcrespo) 05Open>03Resolved m3 servers are all now on jessie/10, and we are no longer in degraded/reduced redundancy mode. I've left a copy of the old maste... [05:59:29] 06Operations, 13Patch-For-Review, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1936521 (10jcrespo) Total count: 17, none of them mariadbs (technically, there is labsdb1005/6/7, but those are mostly postgres machines). [05:59:55] 06Operations, 10DBA, 13Patch-For-Review: reimage or decom db servers on precise - https://phabricator.wikimedia.org/T125028#2486176 (10jcrespo) T123525#2486173 [06:00:02] 06Operations, 13Patch-For-Review, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1936540 (10jcrespo) [06:00:05] 06Operations, 10DBA, 13Patch-For-Review: reimage or decom db servers on precise - https://phabricator.wikimedia.org/T125028#2486178 (10jcrespo) 05Open>03Resolved [06:00:30] 06Operations, 13Patch-For-Review, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1936556 (10jcrespo) [06:04:38] 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM: HHVM emits logs filling /var/log/upstart/hhvm.log and /var/log/syslog/ filling disk - https://phabricator.wikimedia.org/T71976#2486182 (10Joe) Just FTR, this is solved and the title of the bug is misleading. Resolving. [06:05:29] 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM: HHVM emits logs filling /var/log/upstart/hhvm.log and /var/log/syslog/ filling disk - https://phabricator.wikimedia.org/T71976#2486183 (10Joe) 05Open>03Invalid [06:13:10] 06Operations, 10DBA, 13Patch-For-Review: reimage or decom db servers on precise - https://phabricator.wikimedia.org/T125028#2486189 (10jcrespo) [06:13:16] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2486187 (10jcrespo) 05Resolved>03Open I forgot we need to reenable slave jobs. [06:14:54] (03PS2) 1020after4: Configure phabricator database cluster settings [puppet] - 10https://gerrit.wikimedia.org/r/300494 (https://phabricator.wikimedia.org/T112776) [06:15:02] (03PS1) 10Jcrespo: Set db1043 as the new slave of m3 [dns] - 10https://gerrit.wikimedia.org/r/300497 [06:15:28] (03PS2) 10Jcrespo: Set db1043 as the new slave of m3 [dns] - 10https://gerrit.wikimedia.org/r/300497 (https://phabricator.wikimedia.org/T138460) [06:19:02] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2486192 (10mmodell) [06:31:14] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:14] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:24] PROBLEM - puppet last run on eventlog2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:34] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:35] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: puppet fail [06:31:53] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:54] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:54] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:42] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:01] PROBLEM - puppet last run on analytics1045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:01] PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: Puppet has 2 failures [06:35:11] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:20] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:34] (03PS1) 10Giuseppe Lavagetto: Revert "flip m2-master to db1020" [dns] - 10https://gerrit.wikimedia.org/r/300501 [06:35:43] <_joe_> jynus: ^^ [06:35:48] <_joe_> tell me when it's ok [06:37:07] (03PS1) 10Jcrespo: Revert m2-master back to the proxy (dbproxy1002) [dns] - 10https://gerrit.wikimedia.org/r/300502 (https://phabricator.wikimedia.org/T140983) [06:37:17] <_joe_> eheh [06:37:44] (03CR) 10Jcrespo: [C: 032] Set db1043 as the new slave of m3 [dns] - 10https://gerrit.wikimedia.org/r/300497 (https://phabricator.wikimedia.org/T138460) (owner: 10Jcrespo) [06:38:16] (03CR) 10Jcrespo: [V: 032] Set db1043 as the new slave of m3 [dns] - 10https://gerrit.wikimedia.org/r/300497 (https://phabricator.wikimedia.org/T138460) (owner: 10Jcrespo) [06:38:53] (03CR) 10Jcrespo: [C: 032] Revert m2-master back to the proxy (dbproxy1002) [dns] - 10https://gerrit.wikimedia.org/r/300502 (https://phabricator.wikimedia.org/T140983) (owner: 10Jcrespo) [06:40:47] oh [06:40:49] sorry [06:40:53] I didn't see you [06:41:12] _joe_, it is the same change, right? [06:41:26] (03PS2) 10Jcrespo: Revert "flip m2-master to db1020" [dns] - 10https://gerrit.wikimedia.org/r/300501 (owner: 10Giuseppe Lavagetto) [06:41:46] yes it is, by rebasing, so technically you reviewed my change [06:42:14] <_joe_> yes, don't worry [06:42:23] <_joe_> it's not exactly a change I invested my time in :P [06:42:31] (03Abandoned) 10Jcrespo: Revert "flip m2-master to db1020" [dns] - 10https://gerrit.wikimedia.org/r/300501 (owner: 10Giuseppe Lavagetto) [06:43:29] !log updating dns records: m3-slave to db1043; m2-master to dbproxy1002 [06:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:44:18] checking m2-master db [06:45:17] it could take some time for changes to actually apply- but we do not care as proxy and server point to the same place [06:45:26] <_joe_> yes [06:45:34] <_joe_> I am checking ytterbium [06:45:47] as long as there is no access denied or anything strange [06:46:37] _joe_, check that it doesn't have a hardcoded record on /etc/hosts, like iridium had :-/ [06:47:18] maybe we should have puppet clearing that file to provent temptations by roots? [06:48:41] otrs and gerrit both look good [06:48:52] <_joe_> gerrit is connecting to dbproxy now [06:49:16] <_joe_> jynus: regarding /etc/hosts, it can be useful from time to time [06:49:27] <_joe_> e.g. for puppet testing, I'm using it on rhodium [06:49:34] or it can create an outage :-) [06:49:46] <_joe_> well, of course if you leave leftovers [06:49:51] <_joe_> I won't [06:50:28] there was not an outage, but maintenance on phab went for longer than needed [06:50:39] <_joe_> :/ [06:52:01] PROBLEM - puppet last run on elastic1040 is CRITICAL: CRITICAL: Puppet has 1 failures [06:52:38] So I think I will setup dbproxy1006-10 to be backups of the first 5 (starting manually) [06:53:13] and then I will think the best way to bring proper HA to them [06:53:50] <_joe_> jynus: I guess the best way to go can be RR dns and keep the data on those consistent in some way [06:54:04] <_joe_> s/best/simpler/ [06:54:12] ugh [06:54:14] ugly [06:55:00] I will think about it [06:55:10] <_joe_> not really ugly [06:55:19] I think redundancy is certainly a priority [06:55:37] I will need it anyway to failover dbproxy1001 [06:55:47] <_joe_> where do you write the config state of haproxy? [06:55:49] <_joe_> puppet? [06:55:52] yes [06:55:57] state? [06:56:00] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:56:02] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:56:09] I write the re the initial config [06:56:32] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:42] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:56:46] *there [06:56:51] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:57:03] maintaining them in sync is not a problem [06:57:11] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:16] it is, but it is not my concern [06:57:21] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:30] RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:31] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:57:41] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:51] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:52] I really need to think, because I do not even know that I need [06:58:15] so I will gather that first before taking any decision [06:58:32] but reimaging servers and having them ready is free for now [06:58:38] <_joe_> yeah :) [06:58:40] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:43] in reality I may want separate proxy services per service (maybe on its own dns, like the large load balancer) [06:59:51] RECOVERY - puppet last run on analytics1045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:24] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Very, very good work; I have one doubt though:" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/299000 (https://phabricator.wikimedia.org/T137878) (owner: 10Mobrovac) [07:00:40] I do not know, I will create a list of needs and ask you for advice about options, when I have my mind clear [07:04:03] <_joe_> jynus: ok :) [07:12:17] (03CR) 10Giuseppe Lavagetto: [C: 031] Parsoid: clean up the manifests and files [puppet] - 10https://gerrit.wikimedia.org/r/300067 (https://phabricator.wikimedia.org/T90668) (owner: 10Mobrovac) [07:18:14] RECOVERY - puppet last run on elastic1040 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [07:18:21] (03PS1) 10Jcrespo: Add dbproxy1006-10 to production as redundant instances of 1-5 [puppet] - 10https://gerrit.wikimedia.org/r/300505 (https://phabricator.wikimedia.org/T140983) [07:19:20] (03CR) 10Jcrespo: [C: 04-2] "Need reimaging first." [puppet] - 10https://gerrit.wikimedia.org/r/300505 (https://phabricator.wikimedia.org/T140983) (owner: 10Jcrespo) [07:33:16] PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24 [07:34:55] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [07:43:41] (03PS1) 10Jcrespo: Reenable jobs running on the phabricator db slave [puppet] - 10https://gerrit.wikimedia.org/r/300506 (https://phabricator.wikimedia.org/T138460) [07:49:13] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2486271 (10jcrespo) @mmodell @demon @dzahn I've lost track after so many enables and disables of jobs. I think the only pending thing to enable is: https://gerrit.wikimedi... [07:59:41] godog: thanks for the puppet swat merges yesterday. Could not really attend it :( [08:16:32] (03CR) 10Jcrespo: [C: 032] "https://puppet-compiler.wmflabs.org/3437/" [puppet] - 10https://gerrit.wikimedia.org/r/300505 (https://phabricator.wikimedia.org/T140983) (owner: 10Jcrespo) [08:18:14] 06Operations, 10Ops-Access-Requests, 10EventBus, 06Services: Allow the Services team to administer the eventbus services - https://phabricator.wikimedia.org/T141013#2486307 (10Gehel) p:05Triage>03Normal [08:19:46] !log reimage dbproxy1008 [08:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:24:00] 06Operations, 10ops-eqiad: db1011 disk failure (degraded RAID) - https://phabricator.wikimedia.org/T141046#2486309 (10Gehel) p:05Triage>03High Seems like we need to fix this rather sooner than later: setting priority as high. This probably needs @RobH or @Cmjohnson to look at what we have in spares. [08:24:36] PROBLEM - puppet last run on mw2130 is CRITICAL: CRITICAL: Puppet has 1 failures [08:32:43] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2486318 (10Pavanaja) >>! In T140898#2485978, @Reedy wrote: >>>! In T140898#2485974, @Liuxinyu970226 wrote: >> @Pavanaja what about Module and Module tal... [08:36:24] (03CR) 10Paladox: "Yeh, probably, since finding out the old gerrit does this, this may disrupt some people who do it that way if we disable it." [puppet] - 10https://gerrit.wikimedia.org/r/300304 (owner: 10Chad) [08:36:31] RECOVERY - cassandra-c CQL 10.192.32.136:9042 on restbase2003 is OK: TCP OK - 0.045 second response time on port 9042 [08:39:34] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2486323 (10fgiunchedi) [08:40:15] 06Operations, 10Analytics-Cluster: stat1002 - puppet fails to git pull refinery_source - https://phabricator.wikimedia.org/T141062#2486325 (10Gehel) p:05Triage>03High It seems that there are local commits on stat1002: ``` stats@stat1002:/a/refinery-source$ git status On branch master Your branch and 'ori... [08:48:40] RECOVERY - puppet last run on mw2130 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [08:49:32] 06Operations, 10Icinga: implement icinga paging for non-ops teams - https://phabricator.wikimedia.org/T141038#2486344 (10Gehel) p:05Triage>03Normal This is complex enough that it does require some time and thinking. I am pretty sure other teams besides services would be interested in being paged (I can thi... [08:54:18] PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24 [08:55:37] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [09:06:27] 06Operations, 10Analytics-Cluster: stat1002 - puppet fails to git pull refinery_source - https://phabricator.wikimedia.org/T141062#2486357 (10elukey) We have a jenkins job (https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Refinery-source) that releases new refinery source jars to Archiva so I am not sure... [09:06:46] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban: stat1002 - puppet fails to git pull refinery_source - https://phabricator.wikimedia.org/T141062#2486359 (10elukey) [09:07:23] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2486360 (10WMDE-leszek) FWIW I can confirm that @Jonas is who he says he is. [09:20:20] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2486374 (10Gehel) @Jonas has not yet signed NDA. He needs access to access logs of query.wikidata.org. We probably can't give access to that without NDA, so l... [09:20:34] (03PS1) 10Jcrespo: New m3 database grants for dbproxy1008 [puppet] - 10https://gerrit.wikimedia.org/r/300512 (https://phabricator.wikimedia.org/T140983) [09:33:24] (03CR) 10Jcrespo: [C: 032] New m3 database grants for dbproxy1008 [puppet] - 10https://gerrit.wikimedia.org/r/300512 (https://phabricator.wikimedia.org/T140983) (owner: 10Jcrespo) [09:36:33] !log applying new m3 db grants [09:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:45:38] !log reimage dbproxy1006 [09:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:56:55] (03PS2) 10Giuseppe Lavagetto: Change-Prop: Revert the revert - ignore bots on ORES [puppet] - 10https://gerrit.wikimedia.org/r/300450 (owner: 10Ppchelko) [09:57:18] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Change-Prop: Revert the revert - ignore bots on ORES [puppet] - 10https://gerrit.wikimedia.org/r/300450 (owner: 10Ppchelko) [09:57:51] 06Operations, 10MediaWiki-General-or-Unknown: 503 error raises again while trying to load a Wikidata page - https://phabricator.wikimedia.org/T140879#2486408 (10abian) I've tested visiting today's URL without a curid parameter... https://www.wikidata.org/w/index.php?title=Wikidata:Database_reports/Constraint_... [09:58:43] (03PS2) 10Giuseppe Lavagetto: Change-Prop: Definition rerender bug - don't react to revision change [puppet] - 10https://gerrit.wikimedia.org/r/300442 (owner: 10Ppchelko) [10:00:04] (03CR) 10Giuseppe Lavagetto: [C: 032] Change-Prop: Definition rerender bug - don't react to revision change [puppet] - 10https://gerrit.wikimedia.org/r/300442 (owner: 10Ppchelko) [10:01:26] (03CR) 10Mobrovac: [C: 031] restbase: have systemd restart failed nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/300275 (https://phabricator.wikimedia.org/T136957) (owner: 10Giuseppe Lavagetto) [10:06:56] (03PS2) 10Giuseppe Lavagetto: restbase: have systemd restart failed nodes [puppet] - 10https://gerrit.wikimedia.org/r/300275 (https://phabricator.wikimedia.org/T136957) [10:07:26] (03CR) 10Giuseppe Lavagetto: restbase: have systemd restart failed nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/300275 (https://phabricator.wikimedia.org/T136957) (owner: 10Giuseppe Lavagetto) [10:09:03] 06Operations, 06Discovery, 10netops, 03Discovery-Search-Sprint: deploy elasticsearc/plugins to relforge1001-1002 servers - https://phabricator.wikimedia.org/T141085#2486473 (10Gehel) [10:15:43] (03CR) 10Filippo Giunchedi: "I still think that until restbase has unconditional schema migration at startup this is going to bite us in the future, feel free to merge" [puppet] - 10https://gerrit.wikimedia.org/r/300275 (https://phabricator.wikimedia.org/T136957) (owner: 10Giuseppe Lavagetto) [10:20:59] (03CR) 10Giuseppe Lavagetto: "@Filippo I am not sure how Restart=always will create a problem there. Can't we just mask the resource during the migration?" [puppet] - 10https://gerrit.wikimedia.org/r/300275 (https://phabricator.wikimedia.org/T136957) (owner: 10Giuseppe Lavagetto) [10:23:03] (03Abandoned) 10Dereckson: Allow maintenance scripts to work on wikitech without private settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299996 (https://phabricator.wikimedia.org/T140889) (owner: 10Dereckson) [10:23:43] !log Jenkins has some random deadlock. Will probably reboot it [10:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:26:16] (03PS1) 10Jcrespo: Extra grants on m1 databases for dbproxy1006 [puppet] - 10https://gerrit.wikimedia.org/r/300517 (https://phabricator.wikimedia.org/T140983) [10:27:08] (03PS2) 10Jcrespo: Extra grants on m1 databases for dbproxy1006 [puppet] - 10https://gerrit.wikimedia.org/r/300517 (https://phabricator.wikimedia.org/T140983) [10:27:20] !log Restarting Jenkins entirely (deadlocked) [10:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:32:28] (03CR) 10Jcrespo: [C: 032 V: 032] Extra grants on m1 databases for dbproxy1006 [puppet] - 10https://gerrit.wikimedia.org/r/300517 (https://phabricator.wikimedia.org/T140983) (owner: 10Jcrespo) [10:35:28] (03CR) 10Filippo Giunchedi: [C: 031] "yeah in that case masking would work, what I'm saying is that the reason why this was disabled in the first place is still there, anyways " [puppet] - 10https://gerrit.wikimedia.org/r/300275 (https://phabricator.wikimedia.org/T136957) (owner: 10Giuseppe Lavagetto) [10:36:09] !log applying new m1 db grants [10:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:40:54] (03CR) 10Mobrovac: "> Is a file owned by graphoid:graphoid going to be readable by graphoid-admins?" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/299000 (https://phabricator.wikimedia.org/T137878) (owner: 10Mobrovac) [10:41:12] (03PS8) 10Ema: cache_upload VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/299543 (https://phabricator.wikimedia.org/T128188) [10:41:41] (03CR) 10Ema: [C: 032 V: 032] cache_upload VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/299543 (https://phabricator.wikimedia.org/T128188) (owner: 10Ema) [10:47:31] !log reimage dbproxy1007 T140983 [10:47:32] T140983: dbproxy1002 down - https://phabricator.wikimedia.org/T140983 [10:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:49:11] (03PS3) 10Mobrovac: service::node: Output std out/err to a file [puppet] - 10https://gerrit.wikimedia.org/r/299000 (https://phabricator.wikimedia.org/T137878) [10:49:15] PROBLEM - puppet last run on db2054 is CRITICAL: CRITICAL: puppet fail [10:49:26] 06Operations, 07Puppet, 13Patch-For-Review, 05Puppet-infrastructure-modernization: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#2486548 (10Joe) Findings from this morning: # Our admin classes confuse the catalog diff tool; I monkey-patched th... [10:49:49] <_joe_> I am going to break site.pp on rhodium [10:49:55] (03CR) 10Mobrovac: service::node: Output std out/err to a file (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/299000 (https://phabricator.wikimedia.org/T137878) (owner: 10Mobrovac) [10:50:04] <_joe_> so if anyone has to sync that, please ring me [10:51:16] 06Operations, 10Ops-Access-Requests, 06Labs, 13Patch-For-Review: madhuvishy is moving to operations on 7/18/16 - https://phabricator.wikimedia.org/T140422#2486553 (10elukey) We'd need to fix some pwstore issues so this could be a good occasion to add Madhu's key. Best to wait a bit for Moritz in my opinion! [10:53:59] <_joe_> my unsuccessful hack doesn't work [10:54:04] (03CR) 10Mobrovac: "PCC run - https://puppet-compiler.wmflabs.org/3438/" [puppet] - 10https://gerrit.wikimedia.org/r/299000 (https://phabricator.wikimedia.org/T137878) (owner: 10Mobrovac) [10:56:01] 06Operations, 07Puppet, 13Patch-For-Review, 05Puppet-infrastructure-modernization: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#2486555 (10Joe) For the record, I tried to explicitly set `source_permissions => use` but to no avail, basically. I... [10:56:44] (03PS1) 10Jcrespo: Add extra grants on s2 for dbproxy1007 [puppet] - 10https://gerrit.wikimedia.org/r/300520 (https://phabricator.wikimedia.org/T140983) [10:57:24] (03CR) 10Jcrespo: [C: 032] Add extra grants on s2 for dbproxy1007 [puppet] - 10https://gerrit.wikimedia.org/r/300520 (https://phabricator.wikimedia.org/T140983) (owner: 10Jcrespo) [10:57:49] (03PS2) 10Elukey: admin: add addshore to analytics-wmde-users [puppet] - 10https://gerrit.wikimedia.org/r/299522 (https://phabricator.wikimedia.org/T140342) (owner: 10Addshore) [10:59:54] (03CR) 10Jcrespo: [V: 032] Add extra grants on s2 for dbproxy1007 [puppet] - 10https://gerrit.wikimedia.org/r/300520 (https://phabricator.wikimedia.org/T140983) (owner: 10Jcrespo) [11:04:27] !log applying new m2 db grants [11:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:12:12] !log reimage dbproxy1009 T140983 [11:12:13] T140983: dbproxy1002 down - https://phabricator.wikimedia.org/T140983 [11:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:17:35] RECOVERY - puppet last run on db2054 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:19:05] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [11:21:38] 06Operations, 10Ops-Access-Requests, 10EventBus, 06Services: Allow the Services team to administer the eventbus services - https://phabricator.wikimedia.org/T141013#2484278 (10elukey) So we could create an eventbus-admins group allowed to run systemctl + journalctl commands for eventbus. This needs to go t... [11:21:54] (03PS3) 10Elukey: admin: add addshore to analytics-wmde-users [puppet] - 10https://gerrit.wikimedia.org/r/299522 (https://phabricator.wikimedia.org/T140342) (owner: 10Addshore) [11:23:36] 06Operations, 10Ops-Access-Requests, 06Services, 03Scap3: Allow Pchelolo to deploy services via Scap3 - https://phabricator.wikimedia.org/T141086#2486581 (10mobrovac) [11:24:05] 06Operations, 10Ops-Access-Requests, 06Services, 03Scap3: Allow Pchelolo to deploy services via Scap3 - https://phabricator.wikimedia.org/T141086#2486593 (10mobrovac) @GWicke please approve. [11:24:26] 06Operations, 10Ops-Access-Requests, 06Services, 03Scap3: Allow Pchelolo to deploy services via Scap3 - https://phabricator.wikimedia.org/T141086#2486594 (10elukey) p:05Triage>03Normal [11:24:33] (03PS1) 10Jcrespo: Add grants on m4 databases for dbproxy1009 [puppet] - 10https://gerrit.wikimedia.org/r/300521 (https://phabricator.wikimedia.org/T140983) [11:25:23] (03CR) 10Elukey: [C: 032] "All the approvals collected in the ticket and the change was already discussed during an ops meeting, safe to merge in my opinion." [puppet] - 10https://gerrit.wikimedia.org/r/299522 (https://phabricator.wikimedia.org/T140342) (owner: 10Addshore) [11:27:14] RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge. [11:31:13] (03CR) 10Hashar: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/294371 (owner: 10BBlack) [11:31:50] 06Operations, 10Ops-Access-Requests, 06WMDE-Analytics-Engineering, 13Patch-For-Review, 15User-Addshore: Requesting sudo access to analytics-wmde user on stat1002 for Addshore - https://phabricator.wikimedia.org/T140342#2486623 (10elukey) 05Open>03Resolved a:03elukey Last code review merged, both ne... [11:32:28] (03PS2) 10Jcrespo: Add grants on m4 databases for dbproxy1009 [puppet] - 10https://gerrit.wikimedia.org/r/300521 (https://phabricator.wikimedia.org/T140983) [11:37:01] PROBLEM - puppet last run on db2012 is CRITICAL: CRITICAL: puppet fail [11:37:09] (03PS1) 10Mobrovac: Add ppchelko to the deploy-service group [puppet] - 10https://gerrit.wikimedia.org/r/300523 (https://phabricator.wikimedia.org/T141086) [11:37:27] (03CR) 10Jcrespo: [C: 032] Add grants on m4 databases for dbproxy1009 [puppet] - 10https://gerrit.wikimedia.org/r/300521 (https://phabricator.wikimedia.org/T140983) (owner: 10Jcrespo) [11:37:30] 06Operations, 10Ops-Access-Requests, 06Services, 13Patch-For-Review, 03Scap3: Allow Pchelolo to deploy services via Scap3 - https://phabricator.wikimedia.org/T141086#2486636 (10mobrovac) [11:38:50] (03PS1) 10Hashar: (DO NOT MERGE) testing CI run [software/conftool] - 10https://gerrit.wikimedia.org/r/300524 [11:39:45] 06Operations, 10ops-eqiad, 10fundraising-tech-ops, 10netops: put pfw1- ge-2/0/11 in the 'fundraising' vlan for new host frqueue1001 - https://phabricator.wikimedia.org/T140991#2486639 (10Gehel) p:05Triage>03High [11:41:41] (03Abandoned) 10Hashar: (DO NOT MERGE) testing CI run [software/conftool] - 10https://gerrit.wikimedia.org/r/300524 (owner: 10Hashar) [11:42:11] (03PS1) 10Elukey: Add user hjiang to analytics/research related groups. [puppet] - 10https://gerrit.wikimedia.org/r/300526 (https://phabricator.wikimedia.org/T140659) [11:42:25] 06Operations: operations/software/conftool fails tox-py27-jessie - https://phabricator.wikimedia.org/T112853#2486654 (10hashar) 05Open>03Resolved a:03hashar The original issue with pbr/mock is definitely fixed. I have sent a dummy patch and it passed just fine https://gerrit.wikimedia.org/r/#/c/300524/ [11:44:30] (03CR) 10Elukey: [C: 04-1] "LGTM, but let's wait a formal ops meeting approval." [puppet] - 10https://gerrit.wikimedia.org/r/300523 (https://phabricator.wikimedia.org/T141086) (owner: 10Mobrovac) [11:44:53] (03CR) 10Gehel: [C: 031] "LGTM. Related phab ticket has been opened for more than 3 days, my understanding is that we should merge this." [puppet] - 10https://gerrit.wikimedia.org/r/300526 (https://phabricator.wikimedia.org/T140659) (owner: 10Elukey) [11:49:31] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2486680 (10elukey) Update about https://phabricator.wikimedia.org/T139353 - all the old appservers/apiservers have not been serving traffic from at least a... [12:03:27] RECOVERY - puppet last run on db2012 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [12:36:46] (03PS1) 10Jcrespo: Add extra grants needed on m5 for the dbproxy1010 [puppet] - 10https://gerrit.wikimedia.org/r/300529 (https://phabricator.wikimedia.org/T140983) [12:39:10] (03CR) 10Jcrespo: [C: 032 V: 032] Add extra grants needed on m5 for the dbproxy1010 [puppet] - 10https://gerrit.wikimedia.org/r/300529 (https://phabricator.wikimedia.org/T140983) (owner: 10Jcrespo) [12:42:40] !log applying new m5 db grants [12:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:53:04] 06Operations, 10Ops-Access-Requests, 10EventBus, 06Services: Allow the Services team to administer the eventbus services - https://phabricator.wikimedia.org/T141013#2486836 (10Ottomata) Hm, we should also probably add this new group to the scap keyholder_agents trusted_groups eventlogging. This will allow... [12:53:23] PROBLEM - Host mw1099 is DOWN: PING CRITICAL - Packet loss = 100% [12:55:00] <_joe_> uh, mw1099? [12:55:03] <_joe_> cmjohnson1: ping? [12:55:33] _joe_ that was me...1099 is in the middle of a group of decom servers...just powered it back up [12:55:35] sorry [12:55:46] didn't realize that 1 host was still in use [12:55:58] <_joe_> cmjohnson1: not an issue, I kind of expected it was you [12:56:08] <_joe_> cmjohnson1: mw1017-1025 and mw1099 [12:56:19] <_joe_> but mw1018-25 will go away next week [12:56:25] okay [12:57:02] RECOVERY - Host mw1099 is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [13:00:51] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2486858 (10Gehel) Looking at [[ https://wikitech.wikimedia.org/wiki/Volunteer_NDA | the documentation ]], this require approval from a WMF manager. @K4-713 as... [13:01:12] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [13:01:59] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban: stat1002 - puppet fails to git pull refinery_source - https://phabricator.wikimedia.org/T141062#2486860 (10Ottomata) 05Open>03Resolved a:03Ottomata refinery-source is cloned by `role::analytics_cluster::refinery::source`, and mainly exists just to... [13:03:52] 06Operations, 10Analytics-Cluster, 10EventBus, 06Services: Investigate proper set up for using Kafka MirrorMaker with new main Kafka clusters. - https://phabricator.wikimedia.org/T123954#2486867 (10Ottomata) [13:03:56] 06Operations, 10EventBus, 06Services, 10hardware-requests: 4 more Kafka brokers, 2 in eqiad and 2 codfw - https://phabricator.wikimedia.org/T124469#2486865 (10Ottomata) 05Open>03declined No need to do this for now, as long as we have decided to go with datacenter prefixed topic names. Declining. [13:06:27] (03PS2) 10Rush: toollabs: collect stats on grid usage by job [puppet] - 10https://gerrit.wikimedia.org/r/300534 (https://phabricator.wikimedia.org/T140999) [13:07:32] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2486869 (10Gehel) Documentation says: > (Have someone with access double-check which mediawiki.org account that the manager's Phabricator account is linked t... [13:09:38] 06Operations, 07Puppet, 13Patch-For-Review, 05Puppet-infrastructure-modernization: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#2486871 (10Joe) So, good news: the file permissions for the git checkout on rhodium are different than those on str... [13:10:07] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban: stat1002 - puppet fails to git pull refinery_source - https://phabricator.wikimedia.org/T141062#2486877 (10Gehel) As far as I can see, someone ran `mvn release:prepare && mvn release::perform` which does a bit more than `mvn package`. The release will cr... [13:11:19] (03PS2) 10Luke081515: gerrit: up heap size limit from 20GB to 28GB [puppet] - 10https://gerrit.wikimedia.org/r/300446 (https://phabricator.wikimedia.org/T141064) (owner: 10Dzahn) [13:14:02] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2486880 (10JanZerebecki) a:03JanZerebecki [13:14:28] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban: stat1002 - puppet fails to git pull refinery_source - https://phabricator.wikimedia.org/T141062#2486881 (10Ottomata) Ah! I bet Madhu did this when she was developing jenkins deployments. Not sure. [13:16:57] (03CR) 10Gehel: [C: 031] "Merging this will require restart of logstash100[1-3]" [puppet] - 10https://gerrit.wikimedia.org/r/300440 (https://phabricator.wikimedia.org/T141063) (owner: 10EBernhardson) [13:18:21] (03PS1) 10Giuseppe Lavagetto: puppetmaster: pass the group to git::clone [puppet] - 10https://gerrit.wikimedia.org/r/300537 (https://phabricator.wikimedia.org/T98173) [13:18:59] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: pass the group to git::clone [puppet] - 10https://gerrit.wikimedia.org/r/300537 (https://phabricator.wikimedia.org/T98173) (owner: 10Giuseppe Lavagetto) [13:19:01] 06Operations, 10ops-eqiad: db1011 disk failure (degraded RAID) - https://phabricator.wikimedia.org/T141046#2486886 (10Cmjohnson) Disk has been replaced and rebuilding nclosure Device ID: 32 Slot Number: 7 Drive's position: DiskGroup: 0, Span: 3, Arm: 1 Enclosure position: N/A Device Id: 7 WWN: 5000C5003240E73... [13:19:45] (03CR) 10Giuseppe Lavagetto: [V: 032] puppetmaster: pass the group to git::clone [puppet] - 10https://gerrit.wikimedia.org/r/300537 (https://phabricator.wikimedia.org/T98173) (owner: 10Giuseppe Lavagetto) [13:21:54] 06Operations, 10ops-eqiad, 10DBA: dbstore1002 disk errors - https://phabricator.wikimedia.org/T140337#2486908 (10jcrespo) 05Open>03Resolved It finished, no media errors. [13:22:38] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2486911 (10Addshore) >>! In T140911#2486858, @Gehel wrote: > Looking at [[ https://wikitech.wikimedia.org/wiki/Volunteer_NDA | the documentation ]], this requ... [13:26:46] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2486924 (10Gehel) Then @debt (who now has @Deskana's role) could probably also approve this. Actually any of [ @K4-713, @Deskana, @debt ] should be sufficient... [13:37:31] 06Operations, 07Puppet, 13Patch-For-Review, 05Puppet-infrastructure-modernization: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#2487002 (10Joe) After fixing the puppet owner, the git clone has the correct owner/group; there is still a problem... [13:40:02] PROBLEM - puppet last run on mw2017 is CRITICAL: CRITICAL: puppet fail [13:40:42] <_joe_> ^^ that's me, looking [13:40:57] 06Operations, 10ops-codfw, 10ops-eqiad: ship 7 ex4200s from codfw to eqiad - https://phabricator.wikimedia.org/T140655#2487009 (10Cmjohnson) Incoming ticket created for equinix [13:42:52] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review, 15User-Addshore: MediaWiki deployment shell access request for addshore - https://phabricator.wikimedia.org/T140276#2458869 (10JanZerebecki) I think addshore personally is trustworthy for production access. >>! In T140276#2459199, @aude wrote: > It... [13:44:02] RECOVERY - puppet last run on mw2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:47:54] (03PS1) 10Hashar: zuul: explicitly define the Gerrit event delay [puppet] - 10https://gerrit.wikimedia.org/r/300545 [13:50:28] (03CR) 10Paladox: [C: 031] zuul: explicitly define the Gerrit event delay [puppet] - 10https://gerrit.wikimedia.org/r/300545 (owner: 10Hashar) [13:51:17] (03PS2) 10Hashar: zuul: explicitly define the Gerrit event delay [puppet] - 10https://gerrit.wikimedia.org/r/300545 [13:52:53] (03CR) 10Paladox: [C: 031] zuul: explicitly define the Gerrit event delay [puppet] - 10https://gerrit.wikimedia.org/r/300545 (owner: 10Hashar) [13:54:42] RECOVERY - MegaRAID on db1011 is OK: OK: optimal, 1 logical, 2 physical [14:00:29] 06Operations, 10ops-eqiad: db1011 disk failure (degraded RAID) - https://phabricator.wikimedia.org/T141046#2487042 (10jcrespo) 05Open>03Resolved a:03jcrespo ``` nclosure Device ID: 32 Slot Number: 7 Drive's position: DiskGroup: 0, Span: 3, Arm: 1 Enclosure position: N/A Device Id: 7 WWN: 5000C5003240E730... [14:08:29] (03CR) 10Andrew Bogott: [C: 031] toollabs: collect stats on grid usage by job [puppet] - 10https://gerrit.wikimedia.org/r/300534 (https://phabricator.wikimedia.org/T140999) (owner: 10Rush) [14:09:42] (03PS3) 10Giuseppe Lavagetto: restbase: have systemd restart failed nodes [puppet] - 10https://gerrit.wikimedia.org/r/300275 (https://phabricator.wikimedia.org/T136957) [14:10:21] (03PS1) 10Eevans: Enable Cassandra instance restbase2004-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/300547 (https://phabricator.wikimedia.org/T134016) [14:11:54] urandom: let me know if you need any help with --^ [14:12:12] (03CR) 10Giuseppe Lavagetto: [C: 032] restbase: have systemd restart failed nodes [puppet] - 10https://gerrit.wikimedia.org/r/300275 (https://phabricator.wikimedia.org/T136957) (owner: 10Giuseppe Lavagetto) [14:12:27] elukey: you can +2 if you want! [14:12:37] it's ready to roll [14:13:38] (03PS2) 10Elukey: Enable Cassandra instance restbase2004-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/300547 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [14:15:01] (03PS3) 10Hashar: zuul: explicitly define the Gerrit event delay [puppet] - 10https://gerrit.wikimedia.org/r/300545 [14:15:08] (03PS2) 10Gehel: Increase elasticsearch heap on logstash routing instances [puppet] - 10https://gerrit.wikimedia.org/r/300440 (https://phabricator.wikimedia.org/T141063) (owner: 10EBernhardson) [14:16:34] (03CR) 10Elukey: [C: 032] "Checked IP address and general config, looks good! Moreover I had a chat with urandom on IRC and the node is ready to go." [puppet] - 10https://gerrit.wikimedia.org/r/300547 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [14:16:45] (03CR) 10Gehel: [C: 032] Increase elasticsearch heap on logstash routing instances [puppet] - 10https://gerrit.wikimedia.org/r/300440 (https://phabricator.wikimedia.org/T141063) (owner: 10EBernhardson) [14:17:30] elukey: thank you sir! [14:17:44] (03PS3) 10Gehel: Increase elasticsearch heap on logstash routing instances [puppet] - 10https://gerrit.wikimedia.org/r/300440 (https://phabricator.wikimedia.org/T141063) (owner: 10EBernhardson) [14:17:50] urandom: all set, you are welcome :) [14:18:02] (03CR) 10Hashar: [C: 031] "Puppet compile https://puppet-compiler.wmflabs.org/3442/" [puppet] - 10https://gerrit.wikimedia.org/r/300545 (owner: 10Hashar) [14:19:03] hashar: I am reviewing https://phabricator.wikimedia.org/T140894 [14:19:09] gehel: elukey: if you are still at the clinic would you mind merging in a conf setting for zuul ? Going to be a noop on prod regardless https://gerrit.wikimedia.org/r/300545 [14:19:22] hashar: sure [14:19:27] !log T134016: Boostrapping restbase2004-c.codfw.wmnet [14:19:28] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [14:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:19:35] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2487128 (10JanZerebecki) > (Have someone with access double-check which mediawiki.org account that the manager's Phabricator account is linked to, where the S... [14:19:38] elukey: oh I have missed the reply on the task yesterday :( [14:20:18] gehel: the new setting that patch introduced is going to be needed for a later version of Zuul. The currently deployed one just skip/ignore it entirely :} [14:20:40] hashar: just a minute, I'm finishing a change on logstash... [14:20:44] sure :} [14:21:20] I can takeover [14:21:23] !log rolling restart of logstash100[1-3] - T141063 [14:21:25] T141063: Increase java heap on logstash1001-3 - https://phabricator.wikimedia.org/T141063 [14:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:21:35] hashar: looks good, I can merge if you are ready [14:21:37] (03PS2) 10Giuseppe Lavagetto: puppetmaster: temporarily allow rhodium to compile all catalogs [puppet] - 10https://gerrit.wikimedia.org/r/300307 (https://phabricator.wikimedia.org/T98173) [14:21:42] elukey: ready! [14:21:48] I will do the puppet run on gallium [14:21:51] elukey: thanks! [14:22:24] (03PS4) 10Elukey: zuul: explicitly define the Gerrit event delay [puppet] - 10https://gerrit.wikimedia.org/r/300545 (owner: 10Hashar) [14:22:45] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppetmaster: temporarily allow rhodium to compile all catalogs [puppet] - 10https://gerrit.wikimedia.org/r/300307 (https://phabricator.wikimedia.org/T98173) (owner: 10Giuseppe Lavagetto) [14:22:54] <_joe_> rebase-sniping [14:23:15] hashar: for the zuul upgrade, we can maybe do it on monday? I would be super great if we could sanity check the package with a debdiff, but you have probably done all the homeworks [14:23:30] comeeeee oooonnn [14:23:32] elukey: yeah I will do it on monday [14:23:33] I just rebased! [14:23:35] :P [14:23:48] elukey: it is probably not a good idea to sneak upgrade it on a friday evening [14:23:51] (sorry hashar it wasn't for you but for Joe the sniper) [14:23:56] the deb diff is super annoying [14:24:04] that package embeds a bunch of python module from pypi :/ [14:24:09] (03PS5) 10Elukey: zuul: explicitly define the Gerrit event delay [puppet] - 10https://gerrit.wikimedia.org/r/300545 (owner: 10Hashar) [14:24:50] (03PS1) 10Ottomata: Finish adding --until param to check_graphite script [puppet] - 10https://gerrit.wikimedia.org/r/300548 (https://phabricator.wikimedia.org/T116035) [14:25:20] hashar: I was asking not because I don't trust you (I am not a good debian package maintainer) but just to ask if somebody else reviewed it to remove any build bug/inconsistency [14:26:00] (03CR) 10Elukey: [C: 032] "Change looks good, Hashar is ready to execute the puppet run!" [puppet] - 10https://gerrit.wikimedia.org/r/300545 (owner: 10Hashar) [14:26:38] hashar: all set, you are free to run puppet [14:29:37] (03PS2) 10Ottomata: Finish adding --until param to check_graphite script [puppet] - 10https://gerrit.wikimedia.org/r/300548 (https://phabricator.wikimedia.org/T116035) [14:29:41] puppet agent --log "exploding gallium" [14:29:53] 06Operations, 10Citoid, 10ContentTranslation-CXserver, 10RESTBase, and 4 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#2487147 (10BBlack) Re the `rest.wikimedia.org`: we effectively shut off service back in April ( https://... [14:30:01] elukey: na the package hasn't been reviewed. All the basic / original package got paired with Filipo at least :} [14:30:07] godog: would love a review and coordinated deploy (next week) of this one https://gerrit.wikimedia.org/r/300548 [14:30:20] it got left in the dust, and I was recently reminded that I needed to finish up that thing [14:30:36] elukey: zuul.conf is all up to date thank you! [14:30:52] super! [14:30:59] let's chat on Monday about the Zuul upgrade [14:31:41] 07Blocked-on-Operations, 06Operations, 10Continuous-Integration-Infrastructure, 10Zuul: Upgrade Zuul on scandium.eqiad.wmnet (Jessie zuul-merger) - https://phabricator.wikimedia.org/T140894#2487152 (10hashar) @elukey proposed to review the package and we had a quick discussion about it. Turns out upgradin... [14:31:54] elukey: I am going to upgrade it over night on sunday-monday due to Gerrit upgrade [14:32:12] 06Operations, 06Services, 10Traffic, 13Patch-For-Review: Declarative configuration for varnish services and backends - https://phabricator.wikimedia.org/T110717#2487166 (10BBlack) [14:32:28] 06Operations, 06Services, 10Traffic, 13Patch-For-Review: Declarative configuration for varnish services and backends - https://phabricator.wikimedia.org/T110717#1585084 (10BBlack) (re-titled because I could never find the old title when I looked for it!) [14:32:33] elukey: I mean, the server instance on gallium. The other task is about upgrading the merger instance on scandium, and that one I would love review / chat about it ;} [14:32:49] it is not using systemd yet :/ [14:33:53] yes the merger :D [14:34:00] PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24 [14:34:28] ottomata: yup! will take a look [14:34:31] hashar: I meant https://phabricator.wikimedia.org/T140894, is there another one? [14:35:40] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [14:36:29] elukey: sorry. We are upgrading Gerrit this week-end and I will probably upgrade the Zuul package on gallium for the zuul server (I got root there) [14:36:46] elukey: T140894 is a package for Jessie to be pushed on scandium on which I do not have root :} [14:36:47] T140894: Upgrade Zuul on scandium.eqiad.wmnet (Jessie zuul-merger) - https://phabricator.wikimedia.org/T140894 [14:36:55] 06Operations, 07Puppet, 07Need-volunteer: MaxClients on puppetmaster - https://phabricator.wikimedia.org/T97466#1243042 (10elukey) This task is very old and it will probably be fixed with Joe and Alex's current work to upgrade puppet and migrate palladium to Jessie (we'll get Apache 2.4 in the process). MaxC... [14:37:21] elukey: and if you get spare time to talk about the packaging work there and have idea about what should be improved for Jessie, I am all for it :} [14:38:21] PROBLEM - cassandra-c CQL 10.192.32.139:9042 on restbase2004 is CRITICAL: Connection refused [14:38:24] (03PS7) 10Alex Monk: [WIP] Puppetise script to manage labs floating IP PTR records [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) [14:38:30] thar she blows [14:38:35] got it btw ^^^ [14:39:02] ACKNOWLEDGEMENT - cassandra-c CQL 10.192.32.139:9042 on restbase2004 is CRITICAL: Connection refused eevans Bootstrapping - The acknowledgement expires at: 2016-07-24 14:38:49. [14:40:34] 06Operations, 10Trebuchet: git fat/git deploy doesn't always unstub files [Trebuchet] - https://phabricator.wikimedia.org/T98962#2487220 (10elukey) I suspect that this task will get even more dust with the migration to Scap3. @greg is there any value in keeping this open? [14:46:52] (03CR) 10Filippo Giunchedi: "LGTM, just a couple of comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/300548 (https://phabricator.wikimedia.org/T116035) (owner: 10Ottomata) [14:49:19] hashar: I am not a good packager, we might ask ema/godog to assist and give us some feedback :) [14:50:55] sure, feel free me to add to the code review [14:52:39] elukey: godog will poke you on monday :)} [14:53:18] ok! [14:53:55] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2487261 (10Deskana) Approved. [14:54:47] (03PS1) 10Faidon Liambotis: mirror1001 -> sodium [dns] - 10https://gerrit.wikimedia.org/r/300551 [14:55:21] (03CR) 10Faidon Liambotis: [C: 032] mirror1001 -> sodium [dns] - 10https://gerrit.wikimedia.org/r/300551 (owner: 10Faidon Liambotis) [14:56:02] 06Operations, 10ops-eqiad: re-label mirror1001 to sodium - https://phabricator.wikimedia.org/T141105#2487266 (10Cmjohnson) [14:57:24] 06Operations, 10Citoid, 10ContentTranslation-CXserver, 10RESTBase, and 4 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#2487284 (10mobrovac) Hm, I can see 40k log entries for this domain in the logs for requests coming from... [14:59:47] (03PS1) 10Andrew Bogott: Rearrange arguments for pool_target options [puppet] - 10https://gerrit.wikimedia.org/r/300554 [15:00:53] (03PS1) 10Faidon Liambotis: mirror1001 -> sodium [puppet] - 10https://gerrit.wikimedia.org/r/300555 [15:01:46] (03PS2) 10Faidon Liambotis: mirror1001 -> sodium [puppet] - 10https://gerrit.wikimedia.org/r/300555 [15:02:26] (03CR) 10Faidon Liambotis: [C: 032] mirror1001 -> sodium [puppet] - 10https://gerrit.wikimedia.org/r/300555 (owner: 10Faidon Liambotis) [15:08:04] (03PS2) 10Andrew Bogott: Rearrange arguments for pool_target options [puppet] - 10https://gerrit.wikimedia.org/r/300554 [15:10:51] 06Operations, 10Ops-Access-Requests, 10EventBus, 06Services: Allow the Services team to administer the eventbus services - https://phabricator.wikimedia.org/T141013#2487316 (10Nuria) >This will allow eventbus-admins to deploy eventlogging for both the analytics instance and the eventbus instance, butI thin... [15:11:58] (03PS1) 10Faidon Liambotis: Update references to ubuntu.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/300559 [15:12:19] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Update references to ubuntu.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/300559 (owner: 10Faidon Liambotis) [15:15:44] 06Operations, 06Commons, 10media-storage, 13Patch-For-Review, 07User-notice: Some fonts not anti-aliasing in SVG thumbnails after upgrade of scaling servers - https://phabricator.wikimedia.org/T139543#2487317 (10Joe) 05Open>03Resolved [15:16:00] !log T140825: Restarting Cassandra to apply 8MB trickle_fsync (restbase1015-a.eqiad.wmnet) [15:16:02] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [15:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:16:48] (03PS3) 10Andrew Bogott: Rearrange arguments for pool_target options [puppet] - 10https://gerrit.wikimedia.org/r/300554 [15:17:03] anomie testing dologmsg [15:17:15] Hmm. Works from tin but not terbium? [15:18:28] (03CR) 10Andrew Bogott: [C: 032] Rearrange arguments for pool_target options [puppet] - 10https://gerrit.wikimedia.org/r/300554 (owner: 10Andrew Bogott) [15:20:09] (03PS1) 10BBlack: text VCL: refactor backend selection [puppet] - 10https://gerrit.wikimedia.org/r/300560 (https://phabricator.wikimedia.org/T110717) [15:20:11] (03PS1) 10BBlack: Text VCL: split X-Wikimedia-Debug from the rest [puppet] - 10https://gerrit.wikimedia.org/r/300561 (https://phabricator.wikimedia.org/T110717) [15:21:43] (03PS8) 10Alex Monk: Puppetise script to manage labs floating IP PTR records [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) [15:22:02] PROBLEM - cassandra-a CQL 10.64.48.138:9042 on restbase1015 is CRITICAL: Connection refused [15:22:10] got it ^^^ [15:22:53] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.48.138:9042 on restbase1015 is CRITICAL: Connection refused eevans Starting... - The acknowledgement expires at: 2016-07-22 17:22:24. [15:24:38] (03PS1) 10Andrew Bogott: Formatting followup for I32b30fccaf044ae2865b331f28f9238ac6693f81 [puppet] - 10https://gerrit.wikimedia.org/r/300562 [15:24:49] !log Starting script to populate empty gu_auth_token [[phab:T140478]] [15:24:51] T140478: Populate gu_auth_token for existing users - https://phabricator.wikimedia.org/T140478 [15:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:26:26] 06Operations, 10Ops-Access-Requests, 06Services, 13Patch-For-Review, 03Scap3: Allow Pchelolo to deploy services via Scap3 - https://phabricator.wikimedia.org/T141086#2487367 (10GWicke) Approved. [15:26:26] (03CR) 10Andrew Bogott: [C: 032] Formatting followup for I32b30fccaf044ae2865b331f28f9238ac6693f81 [puppet] - 10https://gerrit.wikimedia.org/r/300562 (owner: 10Andrew Bogott) [15:29:44] 06Operations, 10Citoid, 10ContentTranslation-CXserver, 10RESTBase, and 4 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#2487399 (10BBlack) Sounds good to me, you want to do it, or me, or @gwicke since he sent the first one? [15:34:00] 06Operations, 10Citoid, 10ContentTranslation-CXserver, 10RESTBase, and 4 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#2487410 (10mobrovac) >>! In T133001#2487399, @BBlack wrote: > Sounds good to me, you want to do it, or m... [15:35:57] 06Operations, 10Citoid, 10ContentTranslation-CXserver, 10RESTBase, and 4 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#2487415 (10GWicke) Okay, I'll send out an announcement today. Lets say we'll switch it off by September... [15:37:37] 06Operations, 10Citoid, 10ContentTranslation-CXserver, 10RESTBase, and 4 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#2487421 (10mobrovac) >>! In T133001#2487415, @GWicke wrote: > Okay, I'll send out an announcement today.... [15:38:21] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [15:40:42] RECOVERY - cassandra-a CQL 10.64.48.138:9042 on restbase1015 is OK: TCP OK - 0.009 second response time on port 9042 [15:44:41] 06Operations, 10Fundraising-Backlog, 10fundraising-tech-ops, 13Patch-For-Review: Allow Fundraising to A/B test wikipedia.org as send domain - https://phabricator.wikimedia.org/T135410#2487449 (10CCogdill_WMF) IBM tried to validate wikipedia.org, but the validation failed because the the p= value was wrong.... [15:49:26] (03PS2) 10Greg Grossmeier: [Beta Cluster] Remove PoolCounter override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298919 (https://phabricator.wikimedia.org/T38891) [15:52:58] (03PS1) 10Jgreen: corrected public key for spop1024._domainkey in wikipedia.org zone [dns] - 10https://gerrit.wikimedia.org/r/300565 [15:54:43] 06Operations, 10Fundraising-Backlog, 10fundraising-tech-ops, 13Patch-For-Review: Allow Fundraising to A/B test wikipedia.org as send domain - https://phabricator.wikimedia.org/T135410#2487522 (10Jgreen) >>! In T135410#2487449, @CCogdill_WMF wrote: > IBM tried to validate wikipedia.org, but the validation f... [15:57:03] (03PS2) 10Jgreen: corrected public key for spop1024._domainkey in wikipedia.org zone [dns] - 10https://gerrit.wikimedia.org/r/300565 (https://phabricator.wikimedia.org/T135410) [15:59:46] (03CR) 10Jgreen: [C: 032 V: 031] "minor fix to correct the DKIM public key to the original spec" [dns] - 10https://gerrit.wikimedia.org/r/300565 (https://phabricator.wikimedia.org/T135410) (owner: 10Jgreen) [16:05:06] !log running authdns-update to correct a DKIM public key on wikipedia.org [16:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:07:00] (03CR) 10GWicke: "@Filippo: The biggest concern we have isn't so much about schema migrations (old code expecting an older schema would just exit, thanks to" [puppet] - 10https://gerrit.wikimedia.org/r/300275 (https://phabricator.wikimedia.org/T136957) (owner: 10Giuseppe Lavagetto) [16:12:57] (03CR) 10Andrew Bogott: [C: 04-1] "I'd like a few hard-coded things moved into hiera (as commented inline)." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) (owner: 10Alex Monk) [16:16:32] 06Operations, 10ops-eqiad: Rack/setup sodium (carbon/mirror server replacement) - https://phabricator.wikimedia.org/T139171#2487639 (10faidon) [16:16:47] (03CR) 1020after4: [C: 031] Reenable jobs running on the phabricator db slave [puppet] - 10https://gerrit.wikimedia.org/r/300506 (https://phabricator.wikimedia.org/T138460) (owner: 10Jcrespo) [16:17:49] 06Operations, 10Citoid, 10ContentTranslation-CXserver, 10RESTBase, and 4 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#2487642 (10BBlack) [16:18:04] 06Operations, 10Citoid, 10ContentTranslation-CXserver, 10RESTBase, and 4 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#2216638 (10BBlack) thanks! [16:22:32] PROBLEM - puppet last run on mw2062 is CRITICAL: CRITICAL: puppet fail [16:22:50] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2487671 (10Gehel) I added @Jonas to #wmf-nda-requests project, which should allow him to [[ https://phabricator.wikimedia.org/legalpad/view/2/ | sign the NDA... [16:25:42] (03PS2) 1020after4: Reenable jobs running on the phabricator db slave [puppet] - 10https://gerrit.wikimedia.org/r/300506 (https://phabricator.wikimedia.org/T138460) (owner: 10Jcrespo) [16:26:36] (03CR) 10Dzahn: [C: 032] "yep, thanks, i was just about to merge it" [puppet] - 10https://gerrit.wikimedia.org/r/300506 (https://phabricator.wikimedia.org/T138460) (owner: 10Jcrespo) [16:27:29] 06Operations, 10Icinga: implement icinga paging for non-ops teams - https://phabricator.wikimedia.org/T141038#2487679 (10greg) Yeah, just to be clear, when I don't capitalize "services" I mean all the things that look like services, not just the things that Services team does/owns :) (so, I was implicitly incl... [16:29:04] 06Operations, 10Fundraising-Backlog, 10fundraising-tech-ops, 13Patch-For-Review: Allow Fundraising to A/B test wikipedia.org as send domain - https://phabricator.wikimedia.org/T135410#2487683 (10CCogdill_WMF) Thanks, Jeff! [16:31:47] 06Operations, 10Gerrit, 10Mail, 07Upstream: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2487685 (10Paladox) As we are updating to gerrit 2.12, we need to set sendemail.connectTimeout in gerrit, to resolve this task. @demon , I'm wondering do you have a number in mind... [16:32:37] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review, 15User-Addshore: MediaWiki deployment shell access request for addshore - https://phabricator.wikimedia.org/T140276#2487687 (10greg) >>! In T140276#2487014, @JanZerebecki wrote: > I think addshore personally is trustworthy for production access. > >... [16:32:37] (03PS1) 10Gehel: Maps - initial data import [puppet] - 10https://gerrit.wikimedia.org/r/300572 (https://phabricator.wikimedia.org/T138501) [16:33:28] (03PS1) 10Yuvipanda: tools: Alert when iowait info is missing as well [puppet] - 10https://gerrit.wikimedia.org/r/300573 (https://phabricator.wikimedia.org/T141017) [16:33:58] (03CR) 10jenkins-bot: [V: 04-1] tools: Alert when iowait info is missing as well [puppet] - 10https://gerrit.wikimedia.org/r/300573 (https://phabricator.wikimedia.org/T141017) (owner: 10Yuvipanda) [16:34:12] (03PS2) 10Rush: tools: Alert when iowait info is missing as well [puppet] - 10https://gerrit.wikimedia.org/r/300573 (https://phabricator.wikimedia.org/T141017) (owner: 10Yuvipanda) [16:36:36] (03Abandoned) 10BBlack: [WIP] - move app_directors logic to puppet parser func [puppet] - 10https://gerrit.wikimedia.org/r/294495 (https://phabricator.wikimedia.org/T110717) (owner: 10BBlack) [16:37:44] 06Operations, 10Trebuchet: git fat/git deploy doesn't always unstub files [Trebuchet] - https://phabricator.wikimedia.org/T98962#2487744 (10greg) Until we remove trebuchet from production we probably don't want to mass close any #trebuchet tasks quite yet. If it's a valid issue then it should probably stay ope... [16:46:11] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2487806 (10Dzahn) >>! In T138460#2486271, @jcrespo wrote: > Once that is deployed, we can set this as resolved. @jcrespo thank you! has been deployed, the cron has been c... [16:48:51] (03PS2) 10BBlack: text VCL: refactor backend selection [puppet] - 10https://gerrit.wikimedia.org/r/300560 (https://phabricator.wikimedia.org/T110717) [16:48:53] (03PS2) 10BBlack: Text VCL: split X-Wikimedia-Debug from the rest [puppet] - 10https://gerrit.wikimedia.org/r/300561 (https://phabricator.wikimedia.org/T110717) [16:48:55] (03PS1) 10BBlack: VCL backends work 1/N [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/300574 (https://phabricator.wikimedia.org/T110717) [16:49:02] RECOVERY - puppet last run on mw2062 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [16:49:14] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2487812 (10Dzahn) 05Open>03Resolved setting to resolved per Jaime's comment. (and since last phabricator upgrade this should not mean i automatically claim the task ) [16:49:49] 06Operations, 10DBA, 10Phabricator: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2487818 (10Dzahn) [16:52:22] (03CR) 10jenkins-bot: [V: 04-1] VCL backends work 1/N [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/300574 (https://phabricator.wikimedia.org/T110717) (owner: 10BBlack) [16:53:07] (03CR) 10Ori.livneh: [C: 031] Text VCL: split X-Wikimedia-Debug from the rest [puppet] - 10https://gerrit.wikimedia.org/r/300561 (https://phabricator.wikimedia.org/T110717) (owner: 10BBlack) [16:54:31] 06Operations, 10Trebuchet: git fat/git deploy doesn't always unstub files [Trebuchet] - https://phabricator.wikimedia.org/T98962#2487867 (10Dzahn) p:05Normal>03Low [16:59:02] 06Operations, 06Commons, 06Multimedia: Deploy a PHP and HHVM patch (Exif values retrieved incorrectly if they appear before IFD) - https://phabricator.wikimedia.org/T140419#2487894 (10matmarex) Could this be triaged or declined, please? I would like to know if this is possible to do at all, or if we have to... [17:00:07] * MatmaRex eyes gehel and elukey [17:00:38] (03PS9) 10Alex Monk: Puppetise script to manage labs floating IP PTR records [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) [17:00:44] MatmaRex: yes? [17:00:56] gehel: is the request in https://phabricator.wikimedia.org/T140419 possible to do? [17:02:15] MatmaRex: that definitely does not look like something to do today :) [17:02:29] gehel: no, of course. i'm asking if it's possible at all :) [17:02:53] MatmaRex: there was some discussion about that in Monday's Ops meeting... Lemme see if I can find my notes... [17:04:21] MatmaRex: Can't find the notes... but I'm pretty sure the idea was to merge it at some point. elukey do you remember better than me? [17:04:40] MatmaRex: I know nothing about HHVM, so I tend to forget when conversation is about it... [17:05:25] :) [17:05:51] yeah I don't remember too :/ [17:06:15] Not much more I can do at the moment. That's something I would ask _joe, but he has started his weekend... [17:06:33] MatmaRex: can you ping again on Monday? elukey anything else coming to mind? [17:06:44] sure [17:06:49] ori or _joe_ for sure [17:07:35] it's definitely possible [17:07:43] i'll reply [17:09:51] 06Operations, 06Commons, 06Multimedia: Deploy a PHP and HHVM patch (Exif values retrieved incorrectly if they appear before IFD) - https://phabricator.wikimedia.org/T140419#2487934 (10ori) It's definitely possible; we do this with security patches and other critical updates. @Joe, what do you think? [17:09:54] 06Operations, 06DC-Ops: determine/process/document bios firmware tracking/updating policies - https://phabricator.wikimedia.org/T141128#2487936 (10RobH) [17:11:34] thanks ori! [17:13:34] thanks ori! [17:13:57] that's the kind of sophisticated technical wizardy you can expect from me on phab tasks [17:14:13] ori: I have some thoughts newer versions of memcached, will update the long standing phab task on Monday [17:14:19] so we'll decide the direction to take [17:14:30] elukey: is the newer version the one with the instrumentation? [17:14:55] yeah exactly.. up to 1.4.28 there are no imporant changes but only logging one [17:14:58] *ones [17:15:16] (03PS10) 10Alex Monk: Puppetise script to manage labs floating IP PTR records [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) [17:15:23] so it might be useful to have some "canaries" with 1.4.28 in which we can easily observe evitions etc.. [17:15:29] yes, definitely [17:15:39] but I am not sure if this would be worth doing or not [17:16:20] the idea would be to have more tangible data about the memcached layer that for the moment might be a bit abused [17:17:03] anyhow, going afk, have a good weekend folks! [17:17:09] o/ [17:18:49] have a good weekend [17:19:26] (03CR) 10Andrew Bogott: [C: 031] "Thank you for the code comments. Let me know when you'd like me to merge this." [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) (owner: 10Alex Monk) [17:20:00] andrewbogott: how about adding a wiki to labs dns https://gerrit.wikimedia.org/r/#/c/300215/ [17:20:22] it's the latest wikipedia (well it's in the middle of being created) [17:21:13] daaaaaaaaang there are a lot of those [17:21:27] (03PS2) 10Andrew Bogott: labs dnsrecursor: add tcy.wiki(pedia) [puppet] - 10https://gerrit.wikimedia.org/r/300215 (https://phabricator.wikimedia.org/T140898) (owner: 10Dzahn) [17:21:32] (03CR) 10Andrew Bogott: [C: 031] labs dnsrecursor: add tcy.wiki(pedia) [puppet] - 10https://gerrit.wikimedia.org/r/300215 (https://phabricator.wikimedia.org/T140898) (owner: 10Dzahn) [17:22:15] (03CR) 10Chad: "Don't need a fake key in labs/private, the default is to not install it at all?" [puppet] - 10https://gerrit.wikimedia.org/r/300279 (owner: 10Chad) [17:27:43] (03CR) 10Dzahn: "the fake key would be just there so that the puppet compiler runs dont complain about non existing secret in labs" [puppet] - 10https://gerrit.wikimedia.org/r/300279 (owner: 10Chad) [17:28:02] (03CR) 10Dzahn: "because somehow it did when i ran it yesterday" [puppet] - 10https://gerrit.wikimedia.org/r/300279 (owner: 10Chad) [17:28:26] (03CR) 10Dzahn: [C: 032] labs dnsrecursor: add tcy.wiki(pedia) [puppet] - 10https://gerrit.wikimedia.org/r/300215 (https://phabricator.wikimedia.org/T140898) (owner: 10Dzahn) [17:28:56] andrewbogott: thx, is there something that has to be run beside merge? [17:29:18] mutante: I don't think so [17:29:37] I mean, puppet on the recursor host [17:30:08] ok cool, yea that can just run by itself. i was thinking "authdns-update"-style [17:31:48] I ran it :) [17:36:54] 06Operations, 06Release-Engineering-Team, 15User-greg: Institute a weekly review of all UBN! tasks - https://phabricator.wikimedia.org/T141130#2488009 (10greg) [17:37:14] 06Operations, 06Release-Engineering-Team, 15User-greg: Institute a weekly review of all UBN! tasks - https://phabricator.wikimedia.org/T141130#2488026 (10greg) [17:37:22] PROBLEM - puppet last run on cp2009 is CRITICAL: CRITICAL: Puppet has 1 failures [17:38:18] (03Abandoned) 10Chad: Gerrit: Disable downloading of archives [puppet] - 10https://gerrit.wikimedia.org/r/300304 (owner: 10Chad) [17:38:57] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2488032 (10Reedy) >>! In T140898#2486318, @Pavanaja wrote: > ಮೋಡ್ಯೂಲ್ ಪಾತೆರ I don't want to mangle these... ``` $namespaceNames['en'] = array( 828 =>... [17:44:19] (03PS2) 10BBlack: VCL backends work 1/N [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/300574 (https://phabricator.wikimedia.org/T110717) [17:44:21] (03PS1) 10BBlack: VCL backends 2/N: sort misc req_handling [puppet] - 10https://gerrit.wikimedia.org/r/300579 (https://phabricator.wikimedia.org/T110717) [17:44:23] (03PS1) 10BBlack: VCL backends 3/N: no need for (?i) on planet [puppet] - 10https://gerrit.wikimedia.org/r/300580 (https://phabricator.wikimedia.org/T110717) [17:44:26] (03PS1) 10BBlack: VCL backends 4/N: add force-pass support [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) [17:46:58] (03CR) 10jenkins-bot: [V: 04-1] VCL backends work 1/N [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/300574 (https://phabricator.wikimedia.org/T110717) (owner: 10BBlack) [17:47:00] (03PS1) 10Chad: Gerrit: Set sendemail.connectTimeout to 1 minute [puppet] - 10https://gerrit.wikimedia.org/r/300583 (https://phabricator.wikimedia.org/T131189) [17:47:28] (03CR) 10jenkins-bot: [V: 04-1] VCL backends 2/N: sort misc req_handling [puppet] - 10https://gerrit.wikimedia.org/r/300579 (https://phabricator.wikimedia.org/T110717) (owner: 10BBlack) [17:48:40] (03CR) 10jenkins-bot: [V: 04-1] VCL backends 3/N: no need for (?i) on planet [puppet] - 10https://gerrit.wikimedia.org/r/300580 (https://phabricator.wikimedia.org/T110717) (owner: 10BBlack) [17:49:19] (03CR) 10Paladox: [C: 031] Gerrit: Set sendemail.connectTimeout to 1 minute [puppet] - 10https://gerrit.wikimedia.org/r/300583 (https://phabricator.wikimedia.org/T131189) (owner: 10Chad) [17:49:38] andrewbogott: oh, i almost missed that line! thanks [17:49:45] (03CR) 10jenkins-bot: [V: 04-1] VCL backends 4/N: add force-pass support [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) (owner: 10BBlack) [17:50:41] (03PS2) 10BBlack: VCL backends 2/N: sort misc req_handling [puppet] - 10https://gerrit.wikimedia.org/r/300579 (https://phabricator.wikimedia.org/T110717) [17:50:43] (03PS2) 10BBlack: VCL backends 4/N: add force-pass support [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) [17:50:45] (03PS2) 10BBlack: VCL backends 3/N: no need for (?i) on planet [puppet] - 10https://gerrit.wikimedia.org/r/300580 (https://phabricator.wikimedia.org/T110717) [17:50:47] (03PS3) 10BBlack: VCL backends work 1/N [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/300574 (https://phabricator.wikimedia.org/T110717) [17:51:27] the great jenkins-bot battle of 2016 [17:51:40] :) [17:52:01] we should have a keyword to tell jenkins not to bother with CI checks yet for early WIP patches [17:52:33] yup, for that I usually use --draft if I want gerrit to have a copy [17:52:45] before that I push the current branch as 'production' on a labs puppetmaster [17:52:52] I've never even heard of --draft heh [17:52:54] Drafts are ugly because nobody can see them :\ [17:53:00] err, git-review's --draft [17:53:06] ah [17:53:13] They're private, but not private at all :p [17:53:53] so basically they exist in the git clone (not private), but just aren't obviously-findable via branches and/or the UI? [17:54:29] I think they are visible to you only, so my use case is the rare time when I switch computers and want the code reviews shared [17:55:02] You + anyone you put on review list. [17:55:13] I just hate leaving complex work in my working copy local-only, because I could lose it to an accident [17:55:15] ah, that part I didn't know [17:55:48] But yeah, if you know the change # you can construct the ref it exists at and end up fetching the object. [17:55:51] I guess I could sync my local git to a thumb drive once a day or something instead. but this way also gives a little visibility where someone might notice and go "woah you're way off base with where you're going" too :) [17:56:36] <_joe_> godog: yes, I have a couple of drafts to which I added reviewers [17:56:45] Drafts aren't bad, they're just not private as they look so I tend to tell people to avoid them. In which case: why not just make it public if it doesn't need to be private :) [17:57:09] <_joe_> I use them just to avoid spamming the channels until I feel confident with what I wrote [17:57:14] ostriches, in gerrit 2.12 they will deffintly be drafts if users use the web editor [17:57:17] <_joe_> I don't consider drafts private [17:57:23] but will turn to refs/changes/ [17:57:29] once you click the public button [17:57:35] Yeah I know. [17:57:45] Not a fan of that, but *shrug* [17:57:52] oh [17:58:03] imho the default should be visible :) [17:58:15] I quite like editing through the web editor. But yes it should be visable by default [17:58:31] Ooooh, I could do that actually :p [17:58:46] Oh [17:58:57] my brain has been thinking in vi for too long, there's no hope for me and new editors anymore :P [17:58:58] Make drafts viewable by default so nobody thinks they're private. We would just need to tweak the IRC bot to not report them. [17:59:06] :D [17:59:19] I think I have a permission for that.... [17:59:27] oh [18:00:05] Ah yes, the aptly-named "View Drafts" lol. [18:00:19] ha [18:00:22] lol [18:02:01] RECOVERY - puppet last run on cp2009 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [18:07:33] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2488149 (10Pavanaja) >>! In T140898#2488032, @Reedy wrote: >>>! In T140898#2486318, @Pavanaja wrote: >> ಮೋಡ್ಯೂಲ್ ಪಾತೆರ > > I don't want to mangle these... [18:17:07] (03CR) 10Dzahn: [C: 032] Gerrit: Set sendemail.connectTimeout to 1 minute [puppet] - 10https://gerrit.wikimedia.org/r/300583 (https://phabricator.wikimedia.org/T131189) (owner: 10Chad) [18:18:34] ^ gerrit restart coming up.. expect just a couple seconds [18:19:36] done [18:25:02] 06Operations, 10ops-eqiad: Rack/setup sodium (carbon/mirror server replacement) - https://phabricator.wikimedia.org/T139171#2488316 (10faidon) So, this was simply the case of a misconfigured VLAN on the switch. I did that and with another small hack managed to make the server install. However, it is currently... [18:25:02] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 2 failures [18:35:23] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:43:42] 06Operations, 10netops: Turn up new eqiad-esams wave (Level3) - https://phabricator.wikimedia.org/T136717#2488531 (10faidon) We finally got the LOA. Subtasks for the cross-connect (both protected under S4) have been opened for Equinix/EvoSwitch respectively. [18:51:52] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [18:54:36] !log restart grrrit-wm [18:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:55:44] (03CR) 10Dzahn: [C: 032] Gerrit: Store the ssh_host_key in private puppet secrets [puppet] - 10https://gerrit.wikimedia.org/r/300279 (owner: 10Chad) [18:57:52] RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge. [18:57:59] (03CR) 10Dzahn: "no-op on ytterbium confirmed, compiles in labs too" [puppet] - 10https://gerrit.wikimedia.org/r/300279 (owner: 10Chad) [18:58:21] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2488563 (10Jonas) Apparently not ``` Access Denied: L2 Trusted Volunteer Access & Confidentiality Agreement You do not have permission to edit this object.... [18:59:23] (03PS2) 10Dzahn: Gerrit: Further tweaks to down/maintenance mode [puppet] - 10https://gerrit.wikimedia.org/r/300323 (owner: 10Chad) [19:01:39] 06Operations, 10Gerrit, 10Mail, 13Patch-For-Review, 07Upstream: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2158668 (10Dzahn) >>! In T131189#2487685, @Paladox wrote: > As we are updating to gerrit 2.12, we need to set sendemail.connectTimeout in gerrit, to resolve th... [19:07:02] 06Operations, 10Gerrit, 10Mail, 13Patch-For-Review, 07Upstream: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2488597 (10demon) I'd rather not until we swap over to the new host. [19:07:38] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/3445/" [puppet] - 10https://gerrit.wikimedia.org/r/300323 (owner: 10Chad) [19:13:04] (03PS1) 10Dzahn: gerrit: fixup error document config line [puppet] - 10https://gerrit.wikimedia.org/r/300597 [19:13:44] (03PS2) 10Dzahn: gerrit: fixup error document config line [puppet] - 10https://gerrit.wikimedia.org/r/300597 [19:14:56] (03CR) 10Chad: [C: 031] gerrit: fixup error document config line [puppet] - 10https://gerrit.wikimedia.org/r/300597 (owner: 10Dzahn) [19:15:00] (03CR) 10Dzahn: [C: 032] gerrit: fixup error document config line [puppet] - 10https://gerrit.wikimedia.org/r/300597 (owner: 10Dzahn) [19:15:22] (03CR) 10Alex Monk: "Do you think we should do it before or after the labs openstack upgrade?" [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) (owner: 10Alex Monk) [19:18:57] (03CR) 10Andrew Bogott: "Safer to wait until after, if you can stand the suspense." [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) (owner: 10Alex Monk) [19:34:10] (03CR) 10Dzahn: "checked on iridium with apt-file. /usr/bin/mysql is already provided my package mariadb-client. installing that instead looks fine" [puppet] - 10https://gerrit.wikimedia.org/r/297803 (owner: 10Chad) [19:34:23] (03PS7) 10Chad: WIP: Gerrit: Swap lead to point at production data [puppet] - 10https://gerrit.wikimedia.org/r/298673 [19:37:56] (03PS3) 10Chad: Phab: make sure the mail crons have mariadb-client installed [puppet] - 10https://gerrit.wikimedia.org/r/297803 [19:46:41] (03PS4) 10Chad: Phab: make sure the mail crons have mariadb-client installed [puppet] - 10https://gerrit.wikimedia.org/r/297803 [19:47:53] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2488655 (10Gehel) The correct link should be https://phabricator.wikimedia.org/L2 (thanks @Krenair). I'm not able to test... so I might be missing something... [19:49:02] (03PS1) 10Hashar: contint: jenkins-debian-glue 0.17.0 [puppet] - 10https://gerrit.wikimedia.org/r/300600 (https://phabricator.wikimedia.org/T141114) [19:51:28] (03CR) 10Chad: "Swapped to require_package() since this can (and does) get called multiple times." [puppet] - 10https://gerrit.wikimedia.org/r/297803 (owner: 10Chad) [19:53:43] (03CR) 10Hashar: [C: 031] "Essentially a noop. Cherry picked on CI puppet master. Can be merged anytime :)" [puppet] - 10https://gerrit.wikimedia.org/r/300600 (https://phabricator.wikimedia.org/T141114) (owner: 10Hashar) [20:00:36] (03CR) 10Dzahn: [C: 032] "yes, looks good. and yes, require_package !" [puppet] - 10https://gerrit.wikimedia.org/r/297803 (owner: 10Chad) [20:02:23] (03CR) 10Dzahn: "noop on iridium confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/297803 (owner: 10Chad) [20:03:50] (03CR) 10Dzahn: [C: 032] contint: jenkins-debian-glue 0.17.0 [puppet] - 10https://gerrit.wikimedia.org/r/300600 (https://phabricator.wikimedia.org/T141114) (owner: 10Hashar) [20:04:00] (03PS2) 10Dzahn: contint: jenkins-debian-glue 0.17.0 [puppet] - 10https://gerrit.wikimedia.org/r/300600 (https://phabricator.wikimedia.org/T141114) (owner: 10Hashar) [20:04:49] mutante: need to amend it sorry :( [20:04:55] mutante: or i can send another [20:05:05] hashar: oh? wasnt it just rebsae? [20:05:07] rebase [20:05:30] ok [20:06:26] mutante: upstream simplified their packages and I am missing dependencies [20:06:38] I should really only send / +1 a patch once I am 100% sure it is all good :D [20:07:37] 06Operations, 10VisualEditor, 13Patch-For-Review, 07Performance: fix chromium-browser startup script on osmium (was: fix puppet run on osmium ) - https://phabricator.wikimedia.org/T141023#2488698 (10Dzahn) [20:08:02] mutante: double checked it is all good after all :))) [20:09:05] hashar: alright, submitted [20:09:54] did it based on "cherry-picked on CI master" [20:10:35] and I have found a shell linter :} [20:10:42] :) [20:10:46] find . -name README | xargs rm [20:10:46] ^-- SC2038: Use -print0/-0 or -exec + to allow for non-alphanumeric filenames. [20:10:55] looks like a legit gotcha! [20:11:37] deleting all READMEs? [20:11:45] oh [20:12:25] I just ran the tool "shellcheck" on random .sh files on my disk [20:13:41] oh, shellcheck.net i assume [20:14:09] but why does it delete READMEs [20:15:20] mutante: yeah that one :} [20:15:36] (03PS11) 10Alex Monk: Puppetise script to manage labs floating IP PTR records [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) [20:16:43] and shellcheck is supported by the vim plugin that does syntax check :D https://github.com/scrooloose/syntastic [20:17:15] yea, legit gotcha [20:17:31] non-alphanumeric file names [20:18:43] may I abuse your time to bump a package on apt.wikimedia.org ? that is for building .deb packages :) [20:18:48] if you have idle cycles [20:21:27] hashar: not sure that i want to touch build hosts.. where are they? [20:24:47] :) [20:49:57] (03PS3) 10BBlack: VCL backends 2/N: sort misc req_handling [puppet] - 10https://gerrit.wikimedia.org/r/300579 (https://phabricator.wikimedia.org/T110717) [20:49:59] (03PS3) 10BBlack: VCL backends 4/N: add force-pass support [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) [20:50:01] (03PS3) 10BBlack: VCL backends 3/N: no need for (?i) on planet [puppet] - 10https://gerrit.wikimedia.org/r/300580 (https://phabricator.wikimedia.org/T110717) [20:50:03] (03PS1) 10BBlack: VCL backends work 1/N [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/300654 (https://phabricator.wikimedia.org/T110717) [20:50:05] (03PS1) 10BBlack: VCL backends 5/N: subpaths and defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300655 [20:50:07] (03PS1) 10BBlack: VCL backends 6/N: use for all clusters [puppet] - 10https://gerrit.wikimedia.org/r/300656 [20:53:38] 06Operations, 10Cassandra, 06Services, 10hardware-requests: 6x additional Cassandra/RESTBase nodes - https://phabricator.wikimedia.org/T139961#2488772 (10Eevans) In discussing this with @GWicke the question of quantity came up again. Since Cassandra is deployed across two data-centers with three racks eac... [20:55:34] (03PS1) 10Dzahn: gerrit: lower TTLs to 600 [dns] - 10https://gerrit.wikimedia.org/r/300657 (https://phabricator.wikimedia.org/T70271) [20:56:04] (03PS2) 10Dzahn: gerrit: lower TTLs to 600 [dns] - 10https://gerrit.wikimedia.org/r/300657 (https://phabricator.wikimedia.org/T70271) [20:57:38] (03CR) 10Dzahn: "we have a mix of 5M, 600 and 1H that is a bit random. many things are just 600 all the time and never get upped to 1H again.." [dns] - 10https://gerrit.wikimedia.org/r/300657 (https://phabricator.wikimedia.org/T70271) (owner: 10Dzahn) [20:57:54] (03CR) 10jenkins-bot: [V: 04-1] VCL backends 5/N: subpaths and defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300655 (owner: 10BBlack) [21:00:17] (03CR) 10jenkins-bot: [V: 04-1] VCL backends 6/N: use for all clusters [puppet] - 10https://gerrit.wikimedia.org/r/300656 (owner: 10BBlack) [21:01:02] does it matter if all services have just 5 minute TTLs instead of one hour? does "lower to 5M and later back up to 1H" ? [21:01:38] we used to always do that.. then many things just became 5M all the time [21:01:57] (03CR) 10Paladox: [C: 031] gerrit: lower TTLs to 600 [dns] - 10https://gerrit.wikimedia.org/r/300657 (https://phabricator.wikimedia.org/T70271) (owner: 10Dzahn) [21:02:56] (03CR) 10BBlack: [C: 031] gerrit: lower TTLs to 600 [dns] - 10https://gerrit.wikimedia.org/r/300657 (https://phabricator.wikimedia.org/T70271) (owner: 10Dzahn) [21:04:11] (03CR) 10Dzahn: [C: 032] gerrit: lower TTLs to 600 [dns] - 10https://gerrit.wikimedia.org/r/300657 (https://phabricator.wikimedia.org/T70271) (owner: 10Dzahn) [21:06:04] 06Operations, 10Cassandra, 10hardware-requests: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2488791 (10Eevans) A good first-step here might be the procurement of 3 additional machines, instead of the full 6. This doesn't provide for parity with our multi-data cente... [21:08:12] (03PS1) 10Dzahn: switch gerrit-new to gerrit [dns] - 10https://gerrit.wikimedia.org/r/300660 (https://phabricator.wikimedia.org/T70271) [21:10:17] (03CR) 10Dzahn: "DNS change to go with that: https://gerrit.wikimedia.org/r/#/c/300660/ (TTLs lowered to 5M)" [puppet] - 10https://gerrit.wikimedia.org/r/298673 (owner: 10Chad) [21:11:04] (03CR) 10Chad: "I71e32568?" [dns] - 10https://gerrit.wikimedia.org/r/300660 (https://phabricator.wikimedia.org/T70271) (owner: 10Dzahn) [21:11:17] (03PS2) 10Dzahn: switch gerrit-new to gerrit [dns] - 10https://gerrit.wikimedia.org/r/300660 (https://phabricator.wikimedia.org/T70271) [21:12:51] (03CR) 10Dzahn: WIP: Gerrit: Swap DNS to new host, lead (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/299007 (owner: 10Chad) [21:13:36] (03CR) 10Dzahn: [C: 04-1] WIP: Gerrit: Swap DNS to new host, lead [dns] - 10https://gerrit.wikimedia.org/r/299007 (owner: 10Chad) [21:13:59] (03Abandoned) 10Dzahn: switch gerrit-new to gerrit [dns] - 10https://gerrit.wikimedia.org/r/300660 (https://phabricator.wikimedia.org/T70271) (owner: 10Dzahn) [21:15:10] (03CR) 10Dzahn: "we are doing this after Sunday" [puppet] - 10https://gerrit.wikimedia.org/r/300446 (https://phabricator.wikimedia.org/T141064) (owner: 10Dzahn) [21:15:56] (03PS2) 10Dzahn: Add user hjiang to analytics/research related groups. [puppet] - 10https://gerrit.wikimedia.org/r/300526 (https://phabricator.wikimedia.org/T140659) (owner: 10Elukey) [21:17:33] (03CR) 10Dzahn: [C: 032] Add user hjiang to analytics/research related groups. [puppet] - 10https://gerrit.wikimedia.org/r/300526 (https://phabricator.wikimedia.org/T140659) (owner: 10Elukey) [21:20:13] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review, 15User-Addshore: MediaWiki deployment shell access request for addshore - https://phabricator.wikimedia.org/T140276#2488800 (10JanZerebecki) I want to clarify. One thing I am certain about was that the APG grant was referred to. One example of what I... [21:23:28] (03PS2) 10Chad: WIP: Gerrit: Swap DNS to new host, lead [dns] - 10https://gerrit.wikimedia.org/r/299007 [21:24:36] 06Operations, 10Ops-Access-Requests, 06Editing-Analysis, 13Patch-For-Review: Requesting access to research groups for Helen Jiang - https://phabricator.wikimedia.org/T140659#2488805 (10Dzahn) on bast1001.wikimedia.org: Notice: /Stage[main]/Admin/Admin::Hashuser[hjiang]/Admin::User[hjiang]/Ssh::Userkey[hji... [21:27:06] (03CR) 10Dzahn: "yep, looks good now" [dns] - 10https://gerrit.wikimedia.org/r/299007 (owner: 10Chad) [21:35:45] 06Operations, 10Ops-Access-Requests, 06Editing-Analysis, 13Patch-For-Review: Requesting access to research groups for Helen Jiang - https://phabricator.wikimedia.org/T140659#2488819 (10Dzahn) 05Open>03Resolved a:03Dzahn HJiang-WMF your access has been granted. You should be able to connect now (or wi... [21:46:38] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review, 15User-Addshore: MediaWiki deployment shell access request for addshore - https://phabricator.wikimedia.org/T140276#2488841 (10greg) >>! In T140276#2488800, @JanZerebecki wrote: > Can you point to the explicit communication regarding the expectations... [21:47:38] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review, 15User-Addshore: MediaWiki deployment shell access request for addshore - https://phabricator.wikimedia.org/T140276#2488842 (10MZMcBride) >>! In T140276#2475172, @gerritbot wrote: > Change 299755 had a related patch set uploaded (by Elukey): > Add ad... [22:15:08] (03CR) 10Aaron Schulz: [C: 032] Enable debug logging for DBTransaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299095 (owner: 10Aaron Schulz) [22:16:40] (03PS3) 10Aaron Schulz: Enable debug logging for DBTransaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299095 [22:16:55] (03CR) 10Aaron Schulz: [C: 032] Enable debug logging for DBTransaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299095 (owner: 10Aaron Schulz) [22:17:42] (03Merged) 10jenkins-bot: Enable debug logging for DBTransaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299095 (owner: 10Aaron Schulz) [22:19:09] !log aaron@tin Synchronized wmf-config/InitialiseSettings.php: Enable debug logging for DBTransaction (duration: 00m 38s) [22:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:22:23] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 621 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5082826 keys - replication_delay is 621 [22:30:41] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5081219 keys - replication_delay is 0 [22:35:15] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Brentjoseph (bcohn) - https://phabricator.wikimedia.org/T140449#2465141 (10Dzahn) Hi @Brentjoseph we just need that wikitech user name then we can go ahead and resolve this for... [22:38:38] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2480587 (10Dzahn) confirmed Jonas has signed L2 now [22:39:27] (03CR) 10Paladox: "+1." [dns] - 10https://gerrit.wikimedia.org/r/299007 (owner: 10Chad) [22:42:51] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 620 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5081254 keys - replication_delay is 620 [22:43:49] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2489027 (10Dzahn) @Jonas Do you have a user on wikitech wiki? (https://wikitech.wikimedia.org/wiki/Main_Page) please paste the user name or create one if you... [22:44:43] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5081406 keys - replication_delay is 45 [22:52:19] (03PS1) 10Dzahn: admin: add jsamra to analytics-privatedata-users and researchers [puppet] - 10https://gerrit.wikimedia.org/r/300678 (https://phabricator.wikimedia.org/T140445) [22:53:51] (03PS2) 10Dzahn: admin: add jsamra to analytics-privatedata-users and researchers [puppet] - 10https://gerrit.wikimedia.org/r/300678 (https://phabricator.wikimedia.org/T140445) [22:54:43] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 645 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5081406 keys - replication_delay is 645 [23:07:11] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5082838 keys - replication_delay is 15 [23:39:16] (03PS1) 10Ppchelko: Change-Prop: Updates to error hangling [puppet] - 10https://gerrit.wikimedia.org/r/300681 [23:52:30] (03PS2) 10RobH: robh on vacation, removing from paging [puppet] - 10https://gerrit.wikimedia.org/r/300591