[00:15:08] mutante: I'll test the phab deploy key one sec [00:15:42] twentyafterfour: cool, so i had to restart the keyholder to unload the old key, but should be all good [00:17:31] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 51.72% of data above the critical threshold [5000000.0] [00:17:32] hmm I might be doing it wrong [00:18:04] twentyafterfour@tin:~$ ssh iridium.eqiad.wmnet -l phab-deploy [00:18:06] Permission denied (publickey). [00:18:35] (after running export SSH_AUTH_SOCK=/run/keyholder/proxy.sock ) [00:19:10] twentyafterfour: wait, where does the public part go [00:19:28] on iridium ... [00:19:59] how does it get there [00:20:05] puppet installs the private part [00:20:08] that patch earlier should have put it in phab-deploy's authorized keys... [00:20:15] (wrong one, obviously)( [00:21:36] can you show me which part installs the public part on iridium? [00:22:47] because i need to add it [00:23:33] duh, now I see that part is not obvious [00:24:28] so i made a new keypair and the public part also changed [00:24:39] first i wanted to just add the passphrase with -p [00:24:47] but then it didnt like the permissions of it [00:24:55] while in the private repo [00:28:21] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [00:32:21] mutante: modules/phabricator/files/phab-deploy-key.production [00:33:19] twentyafterfour: ah! thanks, couldn't find because underscores :p [00:33:22] adding now [00:36:47] (03PS1) 10Dzahn: phabricator: add new deployment public key [puppet] - 10https://gerrit.wikimedia.org/r/279080 [00:39:51] (03CR) 1020after4: [C: 031] phabricator: add new deployment public key [puppet] - 10https://gerrit.wikimedia.org/r/279080 (owner: 10Dzahn) [00:40:16] it occurs to me that we should really have a centralized key management for all of the things [00:41:08] so that the process to set up deploy keys (public and private!) plus passphrases is all easier to remember and less fragmented [00:41:14] (03CR) 10Dzahn: [C: 032] phabricator: add new deployment public key [puppet] - 10https://gerrit.wikimedia.org/r/279080 (owner: 10Dzahn) [00:43:01] twentyafterfour: yes, something more automatic step by step. in small increments, like have them all ine one place with identical naming scheme first [00:43:33] (as opposed to that one epic ticket to manage it all) [00:43:42] i think [00:44:51] yeah ... I'll see what I can come up with for step 1 [00:45:17] cool! [00:45:28] try it again ow [00:45:30] now [00:51:06] (03CR) 10BBlack: [C: 031] VTC tests compatible with Varnish 3 and 4 [puppet] - 10https://gerrit.wikimedia.org/r/278948 (https://phabricator.wikimedia.org/T128188) (owner: 10Ema) [00:53:49] twentyafterfour: works? [01:06:37] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 387.69 seconds [01:17:11] PROBLEM - Host es2019 is DOWN: PING CRITICAL - Packet loss = 100% [01:40:07] I'm doing something wrong it didn't work [01:40:30] mutante: I'm gonna try to debug it on iridium [01:40:56] I did get something different this time: Agent admitted failure to sign using the key. [01:47:36] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.47 seconds [01:49:02] PROBLEM - MariaDB Slave Lag: x1 on db2008 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 322.15 seconds [01:49:42] PROBLEM - MariaDB Slave Lag: x1 on db2009 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 363.69 seconds [01:50:08] PROBLEM - MariaDB Slave Lag: x1 on db1031 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 384.02 seconds [01:51:55] iridium's auth log reports 4 entries of "Failed publickey for phab-deploy" [01:52:06] RECOVERY - MariaDB Slave Lag: x1 on db1031 is OK: OK slave_sql_lag Replication lag: 0.26 seconds [01:52:08] (for each auth attempt) and 3 of them match the fingerprint in keyholder [01:52:20] 1 mismatch. I guess that keyholder-proxy needs to be restarted on tin [01:52:27] mutante: ^ [01:52:43] RECOVERY - MariaDB Slave Lag: x1 on db2008 is OK: OK slave_sql_lag Replication lag: 0.44 seconds [01:53:31] RECOVERY - MariaDB Slave Lag: x1 on db2009 is OK: OK slave_sql_lag Replication lag: 0.31 seconds [02:27:59] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.17) (duration: 11m 18s) [02:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:53:15] 6Operations, 10Wikimedia-Site-Requests, 5Security: Squid configuration for url-downloader.wikimedia.org allowing upload.wikimedia.org - https://phabricator.wikimedia.org/T130695#2143360 (10Dereckson) [02:53:47] 6Operations, 10Wikimedia-Site-Requests, 5Security: Squid configuration for url-downloader.wikimedia.org allowing upload.wikimedia.org - https://phabricator.wikimedia.org/T130695#2143360 (10Dereckson) [02:57:21] 6Operations, 10Wikimedia-Site-Requests, 5Security: XFF configuration for url-downloader.wikimedia.org allowing upload.wikimedia.org - https://phabricator.wikimedia.org/T130695#2143381 (10Dereckson) [02:57:50] kaldari: here you are ^ [03:02:47] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.18) (duration: 17m 31s) [03:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:12:07] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Mar 23 03:12:07 UTC 2016 (duration 9m 20s) [03:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:26:24] !log tin - restart keyholder - re: < twentyafterfour> 1 mismatch. I guess that keyholder-proxy needs to be restarted on tin [03:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:26:38] twentyafterfour: [03:31:22] PROBLEM - Keyholder SSH agent on tin is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [03:31:39] sigh, yes [03:31:53] gets prepared to type 10 passphrases.. just a sec [03:34:09] !log tin - re-arm keyholder [03:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:34:39] come on now [03:34:52] RECOVERY - Keyholder SSH agent on tin is OK: OK: Keyholder is armed with all configured keys. [06:25:02] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [06:27:02] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 51.72% of data above the critical threshold [5000000.0] [06:30:42] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:43] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:43] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:01] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:11] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:12] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:12] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:21] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:22] PROBLEM - puppet last run on cp1057 is CRITICAL: CRITICAL: puppet fail [06:32:01] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:33] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:52] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:39:12] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [06:39:22] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [06:55:30] (03PS6) 10Giuseppe Lavagetto: jobqueue_redis: set up encryption and cross-dc replication [puppet] - 10https://gerrit.wikimedia.org/r/276980 (https://phabricator.wikimedia.org/T124672) [06:55:41] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:56:23] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:56:52] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:56:53] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:56:53] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:57:11] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:57:22] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:22] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:32] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:58:03] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:33] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:12] RECOVERY - puppet last run on cp1057 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:13:53] (03CR) 10Giuseppe Lavagetto: [C: 032] "Does the right thing according to the compiler:" [puppet] - 10https://gerrit.wikimedia.org/r/276980 (https://phabricator.wikimedia.org/T124672) (owner: 10Giuseppe Lavagetto) [07:19:25] <_joe_> !log progressively activating cross-dc replica and encryption between the jobqueue redises [07:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:33:22] PROBLEM - cassandra-a CQL 10.64.32.192:9042 on restbase1004 is CRITICAL: Connection refused [07:59:04] !log powercycling es2019 - it was down [07:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:01:44] RECOVERY - Host es2019 is UP: PING OK - Packet loss = 0%, RTA = 36.50 ms [08:01:59] PROBLEM - mysqld processes on es2019 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [08:02:15] PROBLEM - MariaDB Slave Lag: es3 on es2019 is CRITICAL: CRITICAL slave_sql_lag could not connect [08:03:14] PROBLEM - MariaDB Slave IO: es3 on es2019 is CRITICAL: CRITICAL slave_io_state could not connect [08:03:23] PROBLEM - MariaDB Slave SQL: es3 on es2019 is CRITICAL: CRITICAL slave_sql_state could not connect [08:03:54] PROBLEM - NTP on es2019 is CRITICAL: NTP CRITICAL: Offset unknown [08:10:23] (03PS4) 10Muehlenhoff: Add ferm rules for carbon-c-relay for labs graphite [puppet] - 10https://gerrit.wikimedia.org/r/276482 [08:11:44] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for carbon-c-relay for labs graphite [puppet] - 10https://gerrit.wikimedia.org/r/276482 (owner: 10Muehlenhoff) [08:18:24] (03PS2) 10Muehlenhoff: Revert temporary bump of connection table, underlying bug has been fixed [puppet] - 10https://gerrit.wikimedia.org/r/278850 [08:19:07] (03CR) 10Muehlenhoff: [C: 032 V: 032] Revert temporary bump of connection table, underlying bug has been fixed [puppet] - 10https://gerrit.wikimedia.org/r/278850 (owner: 10Muehlenhoff) [08:21:33] RECOVERY - NTP on es2019 is OK: NTP OK: Offset -0.000977396965 secs [08:22:15] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Check the redis (jobqueue) configuration in codfw - https://phabricator.wikimedia.org/T124672#2143701 (10Joe) Replication is now active and encrypted. It doesn't seem to cause any major performance issue at this rate of insertion... [08:22:24] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Figure out and document the datacenter switchover process - https://phabricator.wikimedia.org/T124670#2143703 (10Joe) [08:22:26] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Check the redis (jobqueue) configuration in codfw - https://phabricator.wikimedia.org/T124672#2143702 (10Joe) 5Open>3Resolved [08:24:20] (03PS2) 10Jcrespo: Pool db1044 and db1075 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279031 (https://phabricator.wikimedia.org/T130351) [08:25:05] (03CR) 10Jcrespo: [C: 032] Pool db1044 and db1075 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279031 (https://phabricator.wikimedia.org/T130351) (owner: 10Jcrespo) [08:26:14] (03PS2) 10Muehlenhoff: Add ferm rules for postgres/maps [puppet] - 10https://gerrit.wikimedia.org/r/277198 [08:26:30] (03PS3) 10Muehlenhoff: Add ferm rules for postgres/maps [puppet] - 10https://gerrit.wikimedia.org/r/277198 [08:26:37] (03CR) 10jenkins-bot: [V: 04-1] Add ferm rules for postgres/maps [puppet] - 10https://gerrit.wikimedia.org/r/277198 (owner: 10Muehlenhoff) [08:27:01] !log jynus@tin Synchronized wmf-config/db-codfw.php: Add db1075 (duration: 00m 40s) [08:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:27:27] (03CR) 10jenkins-bot: [V: 04-1] Add ferm rules for postgres/maps [puppet] - 10https://gerrit.wikimedia.org/r/277198 (owner: 10Muehlenhoff) [08:30:19] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1044, pool db1075 (duration: 00m 25s) [08:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:32:30] 6Operations, 10DBA: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2143705 (10jcrespo) [08:33:35] (03PS1) 10Jcrespo: Depool es2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279090 (https://phabricator.wikimedia.org/T130702) [08:34:06] (03CR) 10Jcrespo: [C: 032] Depool es2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279090 (https://phabricator.wikimedia.org/T130702) (owner: 10Jcrespo) [08:35:13] (03PS2) 10Jcrespo: Add db2008 commented out to x1 shard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278022 (https://phabricator.wikimedia.org/T130098) (owner: 10Volans) [08:35:47] (03CR) 10Jcrespo: [C: 032] Add db2008 commented out to x1 shard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278022 (https://phabricator.wikimedia.org/T130098) (owner: 10Volans) [08:37:30] !log jynus@tin Synchronized wmf-config/db-codfw.php: Add db2008 (x1) depooled, depool es2019 (duration: 00m 26s) [08:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:38:34] !log starting mysql at es2019 [08:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:39:44] Table './mysql/user' is marked as crashed and should be repaired [08:41:28] 6Operations, 10DBA, 13Patch-For-Review: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2143737 (10Volans) From Icinga logs: ``` [Wed Mar 23 01:15:52 2016] SERVICE ALERT: es2019;MariaDB Slave SQL: es3;CRITICAL;SOFT;1;Timeout while attempting connection [Wed Mar 23 01:15:52 2016] SER... [08:43:50] last timestamp on heartbeat is 2016-03-23T01:15:31.000880 [08:47:25] aligns with failures of all checks on icinga [08:50:20] next update expects 2016-03-23T01:15:31.000880 and adds +0.5 [08:50:32] (03PS4) 10Muehlenhoff: Add ferm rules for postgres/maps [puppet] - 10https://gerrit.wikimedia.org/r/277198 [08:50:59] then inserts 5061392 to `arwiki`.`blobs_cluster25` [08:51:11] I want to check the validity of previous events [08:51:28] (03CR) 10jenkins-bot: [V: 04-1] Add ferm rules for postgres/maps [puppet] - 10https://gerrit.wikimedia.org/r/277198 (owner: 10Muehlenhoff) [08:52:23] for example, that blob is already there [08:53:04] which means that the transaction got committed, but the replication position was not [08:54:19] we need GTIDs as soon as possible for transactional replication control [08:56:09] the good news is that append-only stores are easy to fix and check [08:58:19] (03PS1) 10Elukey: Update CDH submodule with last updates for Hadoop Namenode failover. [puppet] - 10https://gerrit.wikimedia.org/r/279091 (https://phabricator.wikimedia.org/T129838) [08:59:37] (03CR) 10jenkins-bot: [V: 04-1] Update CDH submodule with last updates for Hadoop Namenode failover. [puppet] - 10https://gerrit.wikimedia.org/r/279091 (https://phabricator.wikimedia.org/T129838) (owner: 10Elukey) [09:00:54] srsly Jenkins? [09:04:34] 6Operations, 10Analytics: Upgrade aqs* to nodejs 4.3 - https://phabricator.wikimedia.org/T123629#2143752 (10MoritzMuehlenhoff) [09:04:39] 6Operations, 10DBA, 13Patch-For-Review: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2143753 (10jcrespo) Regarding mysql, it recovered way, but there are certain transactions that mysql has not properly recoded as executed on the binary log but that were committed properly on innod... [09:04:56] 6Operations, 10DBA: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2143754 (10jcrespo) [09:05:55] elukey: jenkins seems broken, I'll tell hashar when he's online. that one irritiated me as well for my 277198 change [09:06:37] moritzm: could it be related to the => aligment warning for ./modules/role/manifests/labs/graphite.pp ? [09:16:39] (03PS1) 10Muehlenhoff: Fix puppetlint warning [puppet] - 10https://gerrit.wikimedia.org/r/279092 [09:18:39] Happy jenkins is happu [09:18:43] *happy [09:20:02] (03CR) 10Muehlenhoff: [C: 032 V: 032] Fix puppetlint warning [puppet] - 10https://gerrit.wikimedia.org/r/279092 (owner: 10Muehlenhoff) [09:20:57] (03PS5) 10Muehlenhoff: Add ferm rules for postgres/maps [puppet] - 10https://gerrit.wikimedia.org/r/277198 [09:21:17] !log start mysql on es2019 at es2018-bin.000044:287914983 [09:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:29:32] (03PS2) 10Muehlenhoff: Add ferm rules for DNS auth servers [puppet] - 10https://gerrit.wikimedia.org/r/277258 [09:39:04] (03PS2) 10Elukey: Update CDH submodule with last updates for Hadoop Namenode failover. [puppet] - 10https://gerrit.wikimedia.org/r/279091 (https://phabricator.wikimedia.org/T129838) [09:39:19] 6Operations: Migrate titanium to jessie - https://phabricator.wikimedia.org/T123725#2143787 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [09:43:44] PROBLEM - puppet last run on dbproxy1004 is CRITICAL: CRITICAL: Puppet has 1 failures [09:45:03] (03PS5) 10Giuseppe Lavagetto: Add select mode [software/conftool] - 10https://gerrit.wikimedia.org/r/278552 (https://phabricator.wikimedia.org/T128199) [09:53:47] 6Operations, 10DBA: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2143818 (10jcrespo) p:5Triage>3Normal I've done the first steps, the following errors were skipped after restarting replication on es2018-bin.000044:287914983 (20160323 1:15:31): ``` Could not execute Write_rows_v1... [09:56:24] (03CR) 10Joal: "@ottomata: THis field is used all over the place in our code base (mostly deprecated stuff, but still)." [puppet] - 10https://gerrit.wikimedia.org/r/253474 (https://phabricator.wikimedia.org/T118557) (owner: 10BBlack) [09:56:33] 6Operations, 10DBA, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Improve documentation about database switchover - https://phabricator.wikimedia.org/T129236#2143824 (10jcrespo) a:5jcrespo>3None [09:58:43] godog: rb1004 critical in icinga is you? [09:59:16] <_joe_> mobrovac: I saw a "decom rb1004" in the sal from urandom [10:00:22] ah right, thnx _joe_ [10:02:05] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.32.192:9042 on restbase1004 is CRITICAL: Connection refused Marko Obrovac decommission in progress by urandom [10:07:51] (03CR) 10Alexandros Kosiaris: [C: 032] Add ferm rules for postgres/maps [puppet] - 10https://gerrit.wikimedia.org/r/277198 (owner: 10Muehlenhoff) [10:10:53] RECOVERY - puppet last run on dbproxy1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:16:17] 6Operations, 10Continuous-Integration-Infrastructure, 6Services: Package npm 2.14 - https://phabricator.wikimedia.org/T124474#2143837 (10mobrovac) I agree with @Krinkle here. //All tests pass now// is due to (1) most npm jobs running just linters for the time being; and (2) no complex npm scripts in extensio... [10:23:40] ACKNOWLEDGEMENT - puppet last run on restbase2004 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi offline scrub ongoing T130254 [10:26:04] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [10:26:43] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [10:27:36] akosiaris: ^^^is it you? [10:27:51] I just ran puppet-merge [10:27:54] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [10:28:14] ok [10:28:34] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [10:29:46] !log installing various security updates on mediawiki codfw servers (along with HHVM restarts): graphite2, libldap, pixman, sqlite, pygments, gnutls26 (already running fine on canaries since yesterday) [10:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:31:32] 6Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2143866 (10jcrespo) a:3Volans [10:32:46] 6Operations, 10Traffic: don't serve upload.wikimedia.org 'root' from wmfrewrite/swift - https://phabricator.wikimedia.org/T130709#2143868 (10fgiunchedi) [10:32:57] 6Operations, 10Traffic, 10media-storage: don't serve upload.wikimedia.org 'root' from wmfrewrite/swift - https://phabricator.wikimedia.org/T130709#2143882 (10fgiunchedi) [10:33:40] (03PS2) 10Filippo Giunchedi: varnish: redirect upload.wikimedia.org to commons [puppet] - 10https://gerrit.wikimedia.org/r/278924 (https://phabricator.wikimedia.org/T130709) [10:34:04] volans: yeah that was me, sorry :-( [10:34:40] no prob :) [10:38:04] (03PS1) 10Elukey: Add support for submodules in git pull. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/279097 (https://phabricator.wikimedia.org/T130703) [10:38:27] !log depool restbase1003 / restbase1004 prior to deprovisioning the hardware [10:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:39:01] (03CR) 10jenkins-bot: [V: 04-1] Add support for submodules in git pull. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/279097 (https://phabricator.wikimedia.org/T130703) (owner: 10Elukey) [10:40:00] (03PS1) 10Jcrespo: Depool db1015 to clone to db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279098 (https://phabricator.wikimedia.org/T130351) [10:41:07] (03PS2) 10Jcrespo: Depool db1015 to clone to db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279098 (https://phabricator.wikimedia.org/T130351) [10:41:46] (03CR) 10Jcrespo: [C: 032] Depool db1015 to clone to db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279098 (https://phabricator.wikimedia.org/T130351) (owner: 10Jcrespo) [10:42:13] (03PS2) 10Ema: VTC tests compatible with Varnish 3 and 4 [puppet] - 10https://gerrit.wikimedia.org/r/278948 (https://phabricator.wikimedia.org/T128188) [10:42:24] (03CR) 10Ema: [C: 032 V: 032] VTC tests compatible with Varnish 3 and 4 [puppet] - 10https://gerrit.wikimedia.org/r/278948 (https://phabricator.wikimedia.org/T128188) (owner: 10Ema) [10:43:13] (03PS1) 10Filippo Giunchedi: restbase: swap restbase1003/4 with restbase1012/3 [puppet] - 10https://gerrit.wikimedia.org/r/279099 [10:44:06] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1015, increase weight of db1044 (duration: 00m 25s) [10:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:45:26] errors slightly increased, but they are not definitive (not related to s3) [10:46:19] (03PS2) 10Elukey: Add support for submodules in git pull. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/279097 (https://phabricator.wikimedia.org/T130703) [10:47:28] nah, red herring [10:48:50] only 12 errors in the last hours on all databases [10:48:54] mobrovac: when you have a minute, https://gerrit.wikimedia.org/r/#/c/279099/ should be straightforward [10:54:42] (03CR) 10Giuseppe Lavagetto: [C: 032] Add support for submodules in git pull. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/279097 (https://phabricator.wikimedia.org/T130703) (owner: 10Elukey) [10:59:09] !log stopping and restarting db1015 for upgrade and clone to db1077 [10:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:00:13] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures [11:06:18] dewiki 1070 having small issues [11:09:08] slowness on Wikibase\Lib\Store\Sql\SqlEntityInfoBuilder::collectTermsForEntities [11:09:34] (03PS1) 10Elukey: Version bump to 0.1.1 [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/279104 [11:11:33] (03CR) 10Elukey: [C: 032] Version bump to 0.1.1 [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/279104 (owner: 10Elukey) [11:16:20] jynus: poke [11:17:42] Steinsplitter, yes? [11:20:56] jynus: or i just create a new tools labs tools and use the passwd and user from there. Because i need to get sql working apap again. [11:21:22] I did not receive the first part of the or [11:21:35] but if this is about labs, this is the wrong channel [11:21:57] (I asked you to use #wikimedia-labs before) [11:22:22] jynus: as far i know you disalbe the labs user of sbot? so why it matters where i ask.... [11:22:46] it helps me organize [11:23:09] ok [11:23:11] and keeps this channel clear of offtpic conversations, I will ignore you here [11:23:56] elukey: so lets bump the puppet compiler install on the instance ? [11:24:01] okay, i will write a complain to wikimedia-I as well as Katherine Maher :) [11:25:08] hashar: _joe_ just did it, it should be know pointing to 0.1.1 [11:25:16] \O/ [11:25:18] yep https://wikitech.wikimedia.org/wiki/Hiera:Puppet3-diffs [11:25:29] oh it is using hiera [11:25:34] but sadly I just tested a build and still seeing no changes :( [11:25:43] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [11:25:44] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [11:25:54] PROBLEM - puppet last run on mw1181 is CRITICAL: CRITICAL: Puppet has 2 failures [11:26:04] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [11:26:22] <_joe_> uhm what's this? [11:27:40] it could be the lagged api issue with dewiki/wikidata [11:27:52] I see nothing right now [11:28:01] elukey: at least there is a 0.1.1 egg around /usr/local/lib/python2.7/dist-packages/puppet_compiler-0.1.1-py2.7.egg [11:28:10] aside from the usual [11:28:53] PROBLEM - puppet last run on mw2008 is CRITICAL: CRITICAL: Puppet has 1 failures [11:29:04] (03PS1) 10Ema: Do not use dynamic directors in test VCL code [puppet] - 10https://gerrit.wikimedia.org/r/279108 (https://phabricator.wikimedia.org/T128188) [11:29:21] hashar: yes puppe compiler points to load_entry_point('puppet-compiler==0.1.1', 'console_scripts', 'puppet-compiler')() [11:29:22] >>> import puppet_compiler [11:29:23] >>> puppet_compiler.__file__ [11:29:23] hashar: SyntaxError: Unexpected reserved word [11:29:24] hashar: ReferenceError: puppet_compiler is not defined [11:29:24] '/usr/local/lib/python2.7/dist-packages/puppet_compiler-0.1.1-py2.7.egg/puppet_compiler/__init__.pyc' [11:30:14] I think I may have IRC issues, but my client reports no lag [11:30:26] I will reconnect to the servers [11:30:43] (03CR) 10Mobrovac: [C: 031] restbase: swap restbase1003/4 with restbase1012/3 [puppet] - 10https://gerrit.wikimedia.org/r/279099 (owner: 10Filippo Giunchedi) [11:31:01] godog: re ^ when will rb1003/4 become inactive? [11:31:56] (03CR) 10Ema: [C: 032 V: 032] Do not use dynamic directors in test VCL code [puppet] - 10https://gerrit.wikimedia.org/r/279108 (https://phabricator.wikimedia.org/T128188) (owner: 10Ema) [11:32:05] <_joe_> elukey: what is the jenkins console url for your job? [11:32:55] _joe_ https://puppet-compiler.wmflabs.org/2129/ [11:32:56] maybe it runs a different version :( [11:33:13] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:33:28] <_joe_> elukey: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/2129/console [11:33:34] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:33:47] mobrovac: thanks! already depooled ~1h ago and I'm going to bring down the machines shortly [11:34:15] <_joe_> elukey: I don't remember, is --recurse-submodules also going to check out the code? or you need to add --init? [11:34:20] oh, several apaches complaining in the last minute, _joe_ that could be it [11:34:21] kk godog, thnx [11:34:30] <_joe_> (I'm saying it's definitely fetching all submodules) [11:35:58] false alarm, maybe network glitch? [11:36:12] _joe_ yes I think I am still missing something, I'll double check in ~1hr [11:36:53] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 10.71% of data above the critical threshold [100000000.0] [11:37:41] elukey: seems there is a def refresh: invoking 'git pull' without submodule processing :/ Missed that one [11:37:55] apparently used to refresh the puppet.git @production and private repos [11:53:13] RECOVERY - puppet last run on mw1181 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:54:07] !log swift eqiad-prod ms-be1020 / ms-be1021 to weight 3500 [11:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:54:23] RECOVERY - puppet last run on mw2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:58:29] (03PS1) 10Muehlenhoff: Set exim environment for minimal prod configuration [puppet] - 10https://gerrit.wikimedia.org/r/279109 [11:58:43] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [12:03:11] 6Operations, 10procurement: 6x swift ms-be machines order - https://phabricator.wikimedia.org/T130713#2144042 (10fgiunchedi) [12:16:16] 6Operations, 10media-storage: [tracking] refresh swift hardware in codfw/eqiad - https://phabricator.wikimedia.org/T130012#2144063 (10fgiunchedi) note that **expansion** not refresh of swift hw is tracked in {T130713}, though we might be able to batch them together [12:17:00] !log installing various security updates on mediawiki eqiad servers (along with HHVM restarts): graphite2, libldap, pixman, sqlite, pygments, gnutls26 (already running fine on canaries since yesterday) [12:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:23:54] 6Operations, 10media-storage, 7Tracking: [tracking] refresh swift hardware in codfw/eqiad - https://phabricator.wikimedia.org/T130012#2144065 (10Aklapper) tracking => #tracking. Is this #DC-Ops too? [12:26:41] 6Operations, 10media-storage, 7Tracking: [tracking] refresh swift hardware in codfw/eqiad - https://phabricator.wikimedia.org/T130012#2144067 (10fgiunchedi) thanks @aklapper, no not #dc-ops ! [12:27:40] !log halt restbase1003 / restbase1004 [12:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:27:50] (03PS2) 10Filippo Giunchedi: restbase: swap restbase1003/4 with restbase1012/3 [puppet] - 10https://gerrit.wikimedia.org/r/279099 [12:27:57] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] restbase: swap restbase1003/4 with restbase1012/3 [puppet] - 10https://gerrit.wikimedia.org/r/279099 (owner: 10Filippo Giunchedi) [12:29:40] !log pool restbase1012 / restbase1013 [12:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:41:22] (03PS1) 10Filippo Giunchedi: restbase: deprovision restbase100[34] [puppet] - 10https://gerrit.wikimedia.org/r/279112 [13:00:04] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: puppet fail [13:03:04] ^^^Error: Could not retrieve catalog from remote server: end of file reached [13:10:19] (03PS24) 10Ema: Maps VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269466 (https://phabricator.wikimedia.org/T124279) [13:12:37] (03PS3) 10BBlack: varnish: redirect upload.wikimedia.org to commons [puppet] - 10https://gerrit.wikimedia.org/r/278924 (https://phabricator.wikimedia.org/T130709) (owner: 10Filippo Giunchedi) [13:12:47] (03CR) 10BBlack: [C: 032 V: 032] varnish: redirect upload.wikimedia.org to commons [puppet] - 10https://gerrit.wikimedia.org/r/278924 (https://phabricator.wikimedia.org/T130709) (owner: 10Filippo Giunchedi) [13:25:19] (03PS14) 10Elukey: Varnish 4 API porting. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) [13:27:47] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:43:00] (03PS25) 10Ema: Maps VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269466 (https://phabricator.wikimedia.org/T124279) [13:47:51] 6Operations, 10Wikimedia-Stream: redis not up after reboot on rcs machines - https://phabricator.wikimedia.org/T130147#2144175 (10Joe) p:5Triage>3Low [13:53:28] 6Operations, 10Traffic, 7discovery-system: Properly package confd and its dependencies - https://phabricator.wikimedia.org/T97971#2144177 (10Joe) 5Open>3declined [13:54:29] 6Operations, 10Traffic, 7discovery-system: Properly package confd and its dependencies - https://phabricator.wikimedia.org/T97971#1256145 (10Joe) I plan on distributing confd via a different method as building it now requires go 1.6, so there is no way I can do that with go 1.3 (which we have on jessie) [13:57:07] 6Operations, 10hardware-requests: new labstore hardware for eqiad - https://phabricator.wikimedia.org/T126089#2144180 (10mark) My main fear is that 20 TB (after RAID10) is not a lot of headroom, considering we are using over 10 TB today, and we'll need some room for LVM snapshots etc as well. With just interna... [14:01:40] (03PS2) 10Ottomata: Clone mediawiki/event-schemas in refinery role [puppet] - 10https://gerrit.wikimedia.org/r/278937 (https://phabricator.wikimedia.org/T126501) [14:01:50] (03CR) 10Ottomata: [C: 032 V: 032] Clone mediawiki/event-schemas in refinery role [puppet] - 10https://gerrit.wikimedia.org/r/278937 (https://phabricator.wikimedia.org/T126501) (owner: 10Ottomata) [14:02:51] (03PS4) 10Alexandros Kosiaris: stdlib: import deep_merge function [puppet] - 10https://gerrit.wikimedia.org/r/278241 [14:03:24] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] stdlib: import deep_merge function [puppet] - 10https://gerrit.wikimedia.org/r/278241 (owner: 10Alexandros Kosiaris) [14:03:43] (03PS7) 10Alexandros Kosiaris: ores: Collapse the redis configs into one stanza [puppet] - 10https://gerrit.wikimedia.org/r/278836 (https://phabricator.wikimedia.org/T124200) [14:04:01] (03CR) 10Alexandros Kosiaris: [C: 032] Adds https only redirect to ores-web [puppet] - 10https://gerrit.wikimedia.org/r/278898 (owner: 10Halfak) [14:04:10] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] ores: Collapse the redis configs into one stanza [puppet] - 10https://gerrit.wikimedia.org/r/278836 (https://phabricator.wikimedia.org/T124200) (owner: 10Alexandros Kosiaris) [14:04:19] (03PS3) 10Alexandros Kosiaris: Adds https only redirect to ores-web [puppet] - 10https://gerrit.wikimedia.org/r/278898 (owner: 10Halfak) [14:05:46] (03PS4) 10Alexandros Kosiaris: Adds https only redirect to ores-web [puppet] - 10https://gerrit.wikimedia.org/r/278898 (owner: 10Halfak) [14:05:57] (03CR) 10Eevans: restbase: deprovision restbase100[34] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/279112 (owner: 10Filippo Giunchedi) [14:05:59] (03CR) 10Alexandros Kosiaris: [V: 032] Adds https only redirect to ores-web [puppet] - 10https://gerrit.wikimedia.org/r/278898 (owner: 10Halfak) [14:10:32] (03PS2) 10Filippo Giunchedi: restbase: deprovision restbase100[34] [puppet] - 10https://gerrit.wikimedia.org/r/279112 [14:10:42] (03CR) 10Filippo Giunchedi: restbase: deprovision restbase100[34] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/279112 (owner: 10Filippo Giunchedi) [14:12:15] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/279112 (owner: 10Filippo Giunchedi) [14:12:27] 6Operations, 10hardware-requests: new labstore hardware for eqiad - https://phabricator.wikimedia.org/T126089#2144195 (10chasemp) This layout: ```/dev/mapper/labstore-tools ext4 8.0T 5.0T 3.1T 63% /srv/project/tools /dev/mapper/labstore-maps ext4 6.0T 2.7T 3.3T 45% /srv/project/maps /dev/... [14:13:28] (03PS3) 10Filippo Giunchedi: restbase: deprovision restbase100[34] [puppet] - 10https://gerrit.wikimedia.org/r/279112 [14:13:35] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] restbase: deprovision restbase100[34] [puppet] - 10https://gerrit.wikimedia.org/r/279112 (owner: 10Filippo Giunchedi) [14:14:02] akosiaris: good to merge I'm assuming [14:31:10] godog: yeah yeah, thanks [14:31:19] all for labs anyway [14:31:50] (03PS1) 10Elukey: Adding submodules support to the prepare module. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/279117 (https://phabricator.wikimedia.org/T130703) [14:32:14] (03CR) 10BBlack: Varnish 4 API porting. (0319 comments) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [14:32:33] (03CR) 10jenkins-bot: [V: 04-1] Adding submodules support to the prepare module. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/279117 (https://phabricator.wikimedia.org/T130703) (owner: 10Elukey) [14:33:07] (03PS2) 10Elukey: Adding submodules support to the prepare module. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/279117 (https://phabricator.wikimedia.org/T130703) [14:33:38] !log rolling-restart restbase after https://gerrit.wikimedia.org/r/279112 [14:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:33:53] (03CR) 10jenkins-bot: [V: 04-1] Adding submodules support to the prepare module. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/279117 (https://phabricator.wikimedia.org/T130703) (owner: 10Elukey) [14:35:09] (03PS3) 10Elukey: Adding submodules support to the prepare module. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/279117 (https://phabricator.wikimedia.org/T130703) [14:36:53] godog: would appreciate a systemd brainbounce if you have some moments today [14:37:37] (03Abandoned) 10Ottomata: Add mysql-backupex script in mysql_wmf module to do regular incremental backups [puppet] - 10https://gerrit.wikimedia.org/r/273312 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [14:37:43] ottomata: for sure! shoot [14:38:12] ok, so i'm looking to port eventlogging to systemd [14:38:24] currently it uses upstart's events to group things together [14:38:30] (03PS4) 10Ottomata: Re-organize analytics dumps to their own page [puppet] - 10https://gerrit.wikimedia.org/r/269696 (https://phabricator.wikimedia.org/T115344) (owner: 10Milimetric) [14:38:39] https://github.com/wikimedia/operations-puppet/blob/production/modules/eventlogging/files/eventloggingctl [14:39:40] yeah the "ctl" pattern [14:39:46] yeah [14:39:48] so [14:40:07] i think i have grouped services in systemd together successfully using [14:40:12] a dummy service [14:40:12] and [14:40:16] PartOf and WantedBy [14:40:18] as you suggested [14:40:20] but [14:40:29] the grouping only works if you enable all the services [14:40:31] which is fine [14:40:45] (that took me a while to figure out, enabling is what causes the wantedby to be populated) [14:41:12] indeed, systemctl enable is what reads the '[Install]' section to figure out what to do [14:41:33] (03CR) 10Ottomata: [C: 031] Adding submodules support to the prepare module. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/279117 (https://phabricator.wikimedia.org/T130703) (owner: 10Elukey) [14:41:39] (03CR) 10Ottomata: [C: 032 V: 032] Re-organize analytics dumps to their own page [puppet] - 10https://gerrit.wikimedia.org/r/269696 (https://phabricator.wikimedia.org/T115344) (owner: 10Milimetric) [14:42:06] aye ok [14:42:11] but [14:42:16] status is funky, right? [14:42:31] systemctl status doesn't work [14:42:38] but, wildcards d [14:42:39] o [14:42:40] so [14:42:52] systemctl status eventlogging-* works [14:43:42] ottomata: yeah the closest I got to status for dummyparent is 'systemctl list-dependencies dummyparent' [14:46:36] hmm, but that lists a lot of dependencies! [14:46:44] 6Operations, 10RESTBase, 13Patch-For-Review: install restbase1010-restbase1015 - https://phabricator.wikimedia.org/T128107#2144437 (10fgiunchedi) thanks @Cmjohnson ! I've halted both restbase1003 and restbase1004, should be good to be reclaimed. once that's done we can move onto the last row with restbase101... [14:46:58] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: Puppet has 1 failures [14:47:21] ottomata: actually no, that's not what I was thinking about, nevermind [14:47:42] yeah, and status with a wildcard doesn't really work [14:47:48] because if the services are stopped [14:47:51] they don't show up at all [14:47:54] (03PS1) 10Alexandros Kosiaris: ores: Hardcode Host in the HTTPS redirect [puppet] - 10https://gerrit.wikimedia.org/r/279119 [14:48:05] systemctl list-unit-files helps a bit [14:48:14] (03PS26) 10Ema: Maps VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269466 (https://phabricator.wikimedia.org/T124279) [14:48:17] i could make a wrapper that calls that and then parses it and runs status [14:48:46] godog: really, the best i can think to do is to write a wrapper script that abstracts all of this [14:50:22] (03CR) 10Alexandros Kosiaris: [C: 032] ores: Hardcode Host in the HTTPS redirect [puppet] - 10https://gerrit.wikimedia.org/r/279119 (owner: 10Alexandros Kosiaris) [14:51:04] ottomata: for status yeah that seems fine, it is mostly for interactive use anyways [14:51:50] godog: if i make it work for status i might as well make it work for all [14:51:55] also, i'm considering doing multiple levels [14:52:20] it'd be nice if we could have more options for stopping and starting than just everytihng or individual things [14:52:26] eventlogging has a bunch of types of services [14:52:34] forwarders, processors, consumers, etc. [14:52:38] and each of those might have multiple instances [14:52:53] e.g. client-side processor has 12 instances right now [14:52:56] would be nice to be able to do [14:53:07] eventloggingctl stop eventlogging-processor-client-side [14:53:08] or maybe [14:53:15] eventloggingctl stop eventlogging-processor [14:53:19] s [14:53:33] (03PS1) 10Milimetric: Fix mistake in dumps reorg [puppet] - 10https://gerrit.wikimedia.org/r/279120 [14:53:38] i wonder if its possible to make a generic enough scripot [14:53:47] that works for other grouped systemd services, not just eventlogging [14:53:49] hm [14:53:53] k, ottomata: https://gerrit.wikimedia.org/r/#/c/279120/ ^ [14:54:29] could prob write a script that took the parent dummy service as a param, parsed some systemctl output, and then presented possible actions and groupings [14:54:33] ung, but that also sounds complicated [14:55:07] (03PS2) 10Ottomata: Fix mistake in dumps reorg [puppet] - 10https://gerrit.wikimedia.org/r/279120 (owner: 10Milimetric) [14:55:14] (03CR) 10Ottomata: [C: 032 V: 032] Fix mistake in dumps reorg [puppet] - 10https://gerrit.wikimedia.org/r/279120 (owner: 10Milimetric) [14:55:37] godog: does that sounds crazy? [14:56:02] godog: one of the motivations for this project is to fix the way eventlogging is deployed [14:56:04] ottomata: IMO the first step would be having systemctl stop/start do the obvious/right thing, e.g. if it is 'eventlogging' then start everything, and ditto for subsystems [14:56:13] currently a sudo pip install . is done after the code is deployed, which sucks [14:56:49] especially since pip package names don't always == debian package names, especially across distros. by not using pip, i remove those cases. buuuuuuut, i don't HAVE to figure out Jessie stuff at the same time [14:57:02] ah, godog that works [14:57:06] that works with PartOf/WantedBy [14:57:08] already [14:57:27] i wonder if that works witih chains? i should try it...it should, write? like several levels of dummy services? [14:57:48] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:57:58] oh ok, yeah that might just work [14:58:43] eventlogging.service, eventlogging-forwarder.service, eventlogging-forwarder@.service, then instance symlinks to that [14:58:43] hm [14:59:35] godog: how do you feel about those templated services in general? [14:59:41] they don't seem necessary since we use puppet [14:59:48] .erb templates are almost just as effective [15:00:04] anomie ostriches thcipriani marktraceur Krenair aude: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160323T1500). Please do the needful. [15:00:44] i almost prefer doing the puppet .erbs, instead of @.service and symlinks, but i am not sure [15:01:04] ottomata: heh ganglia-aggregator uses templated services and it seems to work fine, it was a straightforward port from what was in upstart [15:01:15] the symlinks should be taken care of by systemctl enable tho [15:02:31] right, same as if they weren't symlinks [15:02:44] godog: does puppet provider => systemd do that autotmatically [15:02:48] if enabled => true [15:03:30] it should yeah [15:04:46] ottomata: I'm not an erb fan btw, but e.g. cassandra uses erb templates not systemd instances [15:05:13] (03PS27) 10Ema: Maps VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269466 (https://phabricator.wikimedia.org/T124279) [15:05:46] no patches in SWAT [15:06:50] 6Operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: codfw: (2) servers for redis jobrunners - https://phabricator.wikimedia.org/T126453#2144486 (10RobH) [15:09:16] godog: aye [15:09:26] hmm ok, so either is fine, ok i'll play with it [15:09:33] i'll start with the systemd symlinks [15:09:44] if it gets annoying for some reason, maybe i'll do erb [15:10:06] i will say, this whole systemd thing, while it has many great features...is pretty cumbersome and inflexible! [15:10:27] yeah I don't think there's a clear winner, for ganglia-aggregator is trivial because the instance name is the port number, which is also the config file name [15:10:44] but, i think probably for good reasons i guess. we had to bend over backwards to make systemd +syslog log to a normal file [15:10:51] aye [15:10:57] yeah these will be logical names [15:11:07] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 10.34% of data above the critical threshold [100000000.0] [15:12:46] ottomata: what did you end up doing for syslog btw? [15:13:33] umm [15:13:35] lemme see [15:14:47] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [15:14:56] hmm [15:15:22] we had to support logging to syslog in eventloggin (not that hard with python logging), but had to build in support in puppet modules to configure python logging for syslog [15:15:24] then [15:15:30] rsyslog.conf template [15:15:34] if $programname == '<%= @service_name %>' then <%= @_log_file %> [15:15:34] & stop [15:15:55] so, i guess really not that hard [15:16:10] but, there's no way to get stdout/stderr directly from a service into a log file [15:16:23] journalctl is fine [15:16:35] but not usual [15:16:52] wanted to be able to keep a few rotated log files on disk to look at [15:17:34] !log Starting cleanups on restbase10{08,12,13}-{a,b}.eqiad.wmnet : T125842 [15:17:35] T125842: normalize eqiad restbase cluster - replace restbase1001-1006 - https://phabricator.wikimedia.org/T125842 [15:17:35] afaik on debian stdout/stderr from services ends up in /var/log/syslog via rsyslog, no? [15:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:18:35] !log CORRECTION: Starting cleanups on restbase10{08,10,11}-{a,b}.eqiad.wmnet : T125842 [15:18:36] T125842: normalize eqiad restbase cluster - replace restbase1001-1006 - https://phabricator.wikimedia.org/T125842 [15:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:18:45] godog: ja but we wanted individual files [15:18:50] hmmm, and does it? [15:18:55] wtf [15:19:25] godog: its been a while, but i don't remember being able to capture the stdout from the service using just a rsyslog.conf file [15:19:45] oh, godog, what do you put in your dummy file? [15:19:53] for ExecStart? [15:20:10] (03CR) 10Hashar: "The doc about "git pull --recurse-submodule=yes" does mention that "submodule update" has to be called after .. Sorry I have missed that " [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/279117 (https://phabricator.wikimedia.org/T130703) (owner: 10Elukey) [15:20:21] ottomata: what's in https://phabricator.wikimedia.org/T97402#2125805 /bin/true [15:20:46] re: stdout I'm not sure now you mentioned it, I'm fairly sure stderr does [15:21:48] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [15:23:23] (03PS1) 10Filippo Giunchedi: deprovision restbase1003 / restbase1004 [dns] - 10https://gerrit.wikimedia.org/r/279126 [15:24:19] PROBLEM - HHVM rendering on mw1216 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.011 second response time [15:25:38] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [15:26:04] mw1216: init: hhvm main process (28167) killed by SEGV signal [15:26:08] RECOVERY - HHVM rendering on mw1216 is OK: HTTP OK: HTTP/1.1 200 OK - 67552 bytes in 0.086 second response time [15:27:07] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] deprovision restbase1003 / restbase1004 [dns] - 10https://gerrit.wikimedia.org/r/279126 (owner: 10Filippo Giunchedi) [15:27:55] hmm, godog one plus for not using systemd symlink template [15:28:03] systemctl list-unit-files eventlogging* shows the template files too [15:28:08] eventlogging-forwarder@.service enabled [15:28:14] (03PS28) 10Ema: Maps VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269466 (https://phabricator.wikimedia.org/T124279) [15:28:18] which is confusing, because it isn't a real service [15:34:59] (03PS4) 10Elukey: Adding submodules support to the prepare module. Version bump to 0.1.2 [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/279117 (https://phabricator.wikimedia.org/T130703) [15:41:26] 6Operations, 6Performance-Team, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Figure out how to migrate the jobqueues - https://phabricator.wikimedia.org/T124673#2144567 (10Joe) All the blockers I listed earlier have been removed. Resolving and moving documentation to https://wikitech.wikimedia.org/wiki/Swi... [15:42:40] (03CR) 10Elukey: [C: 032] Adding submodules support to the prepare module. Version bump to 0.1.2 [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/279117 (https://phabricator.wikimedia.org/T130703) (owner: 10Elukey) [15:48:59] !log updated puppet-compiler to 0.1.2 version (added submodule support) [15:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:52:04] _joe_, hashar - https://puppet-compiler.wmflabs.org/2139/analytics1001.eqiad.wmnet/ \o/ [15:52:17] ottomata: ---^ thanks :) [15:53:36] yeehaw! [15:55:50] (03PS1) 10Dzahn: mysql: make lint ignore classes inhering from params [puppet] - 10https://gerrit.wikimedia.org/r/279131 [15:56:02] (03PS29) 10Ema: Maps VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269466 (https://phabricator.wikimedia.org/T124279) [16:01:15] (03PS1) 10Alexandros Kosiaris: Revert "ores: Hardcode Host in the HTTPS redirect" [puppet] - 10https://gerrit.wikimedia.org/r/279133 (https://phabricator.wikimedia.org/T130618) [16:01:48] (03PS2) 10Dzahn: mysql: make lint ignore classes inhering from params [puppet] - 10https://gerrit.wikimedia.org/r/279131 [16:02:29] 6Operations, 10Traffic, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Varnish support for shutting users out of a DC - https://phabricator.wikimedia.org/T129424#2144618 (10BBlack) 5Open>3Resolved [16:02:31] 6Operations, 10Traffic, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Traffic Infrastructure support for Mar 2016 codfw rollout - https://phabricator.wikimedia.org/T125510#2144619 (10BBlack) [16:02:41] (03PS9) 10Alexandros Kosiaris: Add the role::ores::redis class [puppet] - 10https://gerrit.wikimedia.org/r/278758 (https://phabricator.wikimedia.org/T124200) [16:02:47] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Add the role::ores::redis class [puppet] - 10https://gerrit.wikimedia.org/r/278758 (https://phabricator.wikimedia.org/T124200) (owner: 10Alexandros Kosiaris) [16:02:47] 6Operations, 10Traffic, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Traffic Infrastructure support for Mar 2016 codfw rollout - https://phabricator.wikimedia.org/T125510#1989606 (10BBlack) 5Open>3Resolved a:3BBlack [16:03:18] (03CR) 10Dzahn: [C: 032] "comments-only" [puppet] - 10https://gerrit.wikimedia.org/r/279131 (owner: 10Dzahn) [16:04:09] (03PS3) 10Dzahn: mysql: make lint ignore classes inhering from params [puppet] - 10https://gerrit.wikimedia.org/r/279131 [16:04:24] (03PS11) 10Alexandros Kosiaris: Apply the role::ores::redis class to oresdb100{1,2} [puppet] - 10https://gerrit.wikimedia.org/r/278759 (https://phabricator.wikimedia.org/T125562) [16:04:32] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Apply the role::ores::redis class to oresdb100{1,2} [puppet] - 10https://gerrit.wikimedia.org/r/278759 (https://phabricator.wikimedia.org/T125562) (owner: 10Alexandros Kosiaris) [16:06:32] (03PS2) 10Alexandros Kosiaris: Revert "ores: Hardcode Host in the HTTPS redirect" [puppet] - 10https://gerrit.wikimedia.org/r/279133 (https://phabricator.wikimedia.org/T130618) [16:06:38] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Revert "ores: Hardcode Host in the HTTPS redirect" [puppet] - 10https://gerrit.wikimedia.org/r/279133 (https://phabricator.wikimedia.org/T130618) (owner: 10Alexandros Kosiaris) [16:10:32] (03PS4) 10Dzahn: mysql: make lint ignore classes inhering from params [puppet] - 10https://gerrit.wikimedia.org/r/279131 [16:11:21] (03PS30) 10Ema: Maps VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269466 (https://phabricator.wikimedia.org/T124279) [16:13:29] (03PS6) 10Dzahn: puppet-lint: remove exception for "class_inherits_from_params_class" [puppet] - 10https://gerrit.wikimedia.org/r/278404 (https://phabricator.wikimedia.org/T93645) [16:18:04] (03CR) 10Dzahn: "after https://gerrit.wikimedia.org/r/#/c/279131/ this is now a SUCCESS" [puppet] - 10https://gerrit.wikimedia.org/r/278404 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [16:20:01] (03PS7) 10Dzahn: puppet-lint: remove exception for "class_inherits_from_params_class" [puppet] - 10https://gerrit.wikimedia.org/r/278404 (https://phabricator.wikimedia.org/T93645) [16:20:45] (03CR) 10Dzahn: [C: 032] puppet-lint: remove exception for "class_inherits_from_params_class" [puppet] - 10https://gerrit.wikimedia.org/r/278404 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [16:21:25] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 3 others: Should we have a specific check for SSL certificate expiration on elasticsearch - https://phabricator.wikimedia.org/T130366#2144665 (10Gehel) @faidon said: "yes we should", @Gehel will implement this. [16:24:11] (03PS1) 10Alexandros Kosiaris: apertium.svc.eqiad.wmnet: Point it to the LVS IP [dns] - 10https://gerrit.wikimedia.org/r/279137 [16:24:23] <_joe_> akosiaris: wat? [16:24:50] yup [16:25:13] _joe_: can you look at, https://gerrit.wikimedia.org/r/#/c/277463/9 again? I'm still working on solution to read config from cxserver/ repo [16:25:28] _joe_: 10% of original patch size. [16:25:44] (03CR) 10Alexandros Kosiaris: [C: 032] apertium.svc.eqiad.wmnet: Point it to the LVS IP [dns] - 10https://gerrit.wikimedia.org/r/279137 (owner: 10Alexandros Kosiaris) [16:25:49] akosiaris: still figuring out that :/ [16:28:59] (03PS1) 10RobH: adding star.tools.wmflabs.org.crt [puppet] - 10https://gerrit.wikimedia.org/r/279138 [16:29:45] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 3 others: Should we have a specific check for SSL certificate expiration on elasticsearch - https://phabricator.wikimedia.org/T130366#2133879 (10Dzahn) @Gehel please also see T114059 and consider using check_ssl_http example: https://... [16:30:04] bd808: Dear anthropoid, the time has come. Please deploy MediaWiki ActionApi logging (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160323T1630). [16:30:04] bd808: A patch you scheduled for MediaWiki ActionApi logging is about to be deployed. Please be available during the process. [16:30:15] ohi jouncebot [16:30:24] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 3 others: Should we have a specific check for SSL certificate expiration on elasticsearch - https://phabricator.wikimedia.org/T130366#2144697 (10Dzahn) [16:30:26] 6Operations, 10Monitoring, 10Traffic, 7HTTPS, 7Icinga: ssl expiry tracking in icinga - we don't monitor that many domains - https://phabricator.wikimedia.org/T114059#2144698 (10Dzahn) [16:31:01] (03CR) 10RobH: [C: 032] adding star.tools.wmflabs.org.crt [puppet] - 10https://gerrit.wikimedia.org/r/279138 (owner: 10RobH) [16:31:13] bah [16:31:14] (03PS2) 10RobH: adding star.tools.wmflabs.org.crt [puppet] - 10https://gerrit.wikimedia.org/r/279138 [16:31:16] (03PS1) 10Dereckson: Remove T44894 FIXME note [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279140 [16:31:18] rebase war =P [16:31:45] AaronSchulz: Hi. If removeDuplicates = true in jobs, will the earlier ones be discarded or the new one will not be inserted? [16:33:48] I don't think it's a "not inserted" thing [16:33:52] I think they're inserted and purged [16:36:02] Hmm s/inserted/not run/ might have been a better question [16:36:27] err "run", not "not run" [16:37:26] no wikibugs bot [16:37:34] 6Operations, 10Trebuchet: pmtpa remnants in trebuchet redis - https://phabricator.wikimedia.org/T111301#2144728 (10Dzahn) close as won't fix because trebuchet is not used anymore now?? [16:37:35] shinken-wm says tools-home down [16:37:39] ah, there it is [16:42:53] (03PS2) 10BryanDavis: Logging: add ApiAction kafka logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278347 (https://phabricator.wikimedia.org/T108618) [16:46:41] (03PS1) 10Dereckson: Remove not needed FIXME statement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279142 [16:50:27] (03PS1) 10Ladsgroup: Flake8 on utils [puppet] - 10https://gerrit.wikimedia.org/r/279143 [16:51:22] !log bd808@tin Synchronized php-1.27.0-wmf.17/includes/api/ApiMain.php: Rename ApiRequest to ApiAction (4dc12de) (duration: 00m 47s) [16:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:52:27] (03CR) 10BryanDavis: [C: 032] Logging: add ApiAction kafka logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278347 (https://phabricator.wikimedia.org/T108618) (owner: 10BryanDavis) [16:52:52] (03Merged) 10jenkins-bot: Logging: add ApiAction kafka logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278347 (https://phabricator.wikimedia.org/T108618) (owner: 10BryanDavis) [16:56:02] !log bd808@tin Synchronized wmf-config/event-schemas: Logging: add ApiAction kafka logging (34f236c) (duration: 00m 31s) [16:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:57:30] !log bd808@tin Synchronized wmf-config/InitialiseSettings.php: Logging: add ApiAction kafka logging (34f236c) (T108618) (duration: 00m 28s) [16:57:31] T108618: Publish detailed Action API request information to Hadoop - https://phabricator.wikimedia.org/T108618 [16:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:59:38] (03PS1) 10Ladsgroup: Flake8 on rolematcher [puppet] - 10https://gerrit.wikimedia.org/r/279148 [17:00:01] I created a huge spike of "No such file or directory in /srv/mediawiki/wmf-config/InitialiseSetting [17:00:01] s.php on line 4462" when I pushed the submodule bump for the avro schemas [17:00:29] I think it should die off but I'm watching the fatalmonitor to confirm [17:00:31] (03PS1) 10Alexandros Kosiaris: ores: Firewall off the redis boxes [puppet] - 10https://gerrit.wikimedia.org/r/279149 [17:02:54] !log bd808@tin Synchronized wmf-config/InitialiseSettings.php: touched (duration: 00m 25s) [17:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:07:06] 6Operations, 10Trebuchet: pmtpa remnants in trebuchet redis - https://phabricator.wikimedia.org/T111301#1600707 (10greg) Trebuchet still used (though officially deprecated from RelEng's side), with active migration to scap3 is underway. [17:07:26] I'm still seeing the missing file exceptions trickle in from hhvm but I'm not seeing repeats from the same MW servers so I think this is just rsyslog buffering [17:08:07] even when it's buffering, the times are accurate [17:08:12] they are super hot on fatalmonitor because the lines coming in look like "message repeated 1781 times: ..." [17:08:41] hoo: ah good point. the times are "Mar 23 16:57:*" [17:13:10] ok. I'm all done on tin. [17:14:18] !log Cancelling offline scrubs on restbase2004.codfw.wmnet : T130254 [17:14:19] T130254: Investigate recent OOM events on restbase2004 - https://phabricator.wikimedia.org/T130254 [17:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:16:24] !log Disabling puppet on restbase2004.codfw.wmnet to override compactor concurrency : T130254 [17:16:24] T130254: Investigate recent OOM events on restbase2004 - https://phabricator.wikimedia.org/T130254 [17:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:18:27] !log Starting Cassandra on restbase2004.codfw.wmnet : T130254 [17:18:28] T130254: Investigate recent OOM events on restbase2004 - https://phabricator.wikimedia.org/T130254 [17:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:20:24] how can I have my user account on tin include the phabricator-roots group? that group seems to only exist on iridium but it needs to exist on tin for keyholder agent to sign my requests [17:20:50] I don't see any puppet rules that apply to my user account so I assume it must be in ldap? [17:21:03] !log Disabling gossip and binary transport on restbase2004.codfw.wmnet : T130254 [17:21:04] T130254: Investigate recent OOM events on restbase2004 - https://phabricator.wikimedia.org/T130254 [17:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:24:09] !log Starting scrub of parsoid_html on restbase2004.codfw.wmnet : T130254 [17:24:10] T130254: Investigate recent OOM events on restbase2004 - https://phabricator.wikimedia.org/T130254 [17:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:27:45] (03PS1) 10Milimetric: Moving /analytics to /other/analytics [puppet] - 10https://gerrit.wikimedia.org/r/279151 [17:30:41] twentyafterfour: think you'll have to add phabricator-roots here: https://github.com/wikimedia/operations-puppet/blob/production/hieradata/role/common/deployment/server.yaml#L1 [17:30:44] (03CR) 10Elukey: "Thanks for the review! The next update will contain all the fixes." (0318 comments) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [17:31:24] (03CR) 10ArielGlenn: "Do you want to add a rewrite for nginx? puppet/modules/dumps/templates/nginx.dumps.conf.erb if you want to be able to publish nice dumps.w" [puppet] - 10https://gerrit.wikimedia.org/r/279151 (owner: 10Milimetric) [17:31:29] twentyafterfour: but you'll probably need to step the permissions of that group back a bit to do so. [17:33:22] (03CR) 10Alexandros Kosiaris: [C: 032] ores: Firewall off the redis boxes [puppet] - 10https://gerrit.wikimedia.org/r/279149 (owner: 10Alexandros Kosiaris) [17:33:49] (03CR) 10Milimetric: "You mean to keep /analytics redirecting to /other/analytics?" [puppet] - 10https://gerrit.wikimedia.org/r/279151 (owner: 10Milimetric) [17:34:40] (03CR) 10Alex Monk: [C: 04-1] "Should just add a link to T91534" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279142 (owner: 10Dereckson) [17:35:38] (03CR) 10ArielGlenn: "Yes, rewrite /analytics to other/analytics and then you don't have to change any links. Just move the files including the index page to th" [puppet] - 10https://gerrit.wikimedia.org/r/279151 (owner: 10Milimetric) [17:36:20] !log Increasing compactionthroughput to 100MB/s on restbase2004.codfw.wmnet : T130254 [17:36:22] T130254: Investigate recent OOM events on restbase2004 - https://phabricator.wikimedia.org/T130254 [17:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:39:26] (03PS2) 10Milimetric: Moving /analytics to /other/analytics [puppet] - 10https://gerrit.wikimedia.org/r/279151 [17:40:23] (03CR) 10Milimetric: "How's that?" [puppet] - 10https://gerrit.wikimedia.org/r/279151 (owner: 10Milimetric) [17:43:13] PROBLEM - puppet last run on oresrdb1002 is CRITICAL: CRITICAL: puppet fail [17:44:41] (03CR) 10ArielGlenn: "I thought you wanted to use /analytics/ urls (i.e. links in other/index.html and public/index.html to go to /analytics and not /other/anal" [puppet] - 10https://gerrit.wikimedia.org/r/279151 (owner: 10Milimetric) [17:46:37] (03CR) 10Milimetric: "Yep, I don't care about where the links go, the only thing that was linked in the email was /analytics and the rewrite is just there to ma" [puppet] - 10https://gerrit.wikimedia.org/r/279151 (owner: 10Milimetric) [17:46:51] (03CR) 10Milimetric: "If you're ok with it, feel free to merge" [puppet] - 10https://gerrit.wikimedia.org/r/279151 (owner: 10Milimetric) [17:47:19] (03PS3) 10ArielGlenn: Moving /analytics to /other/analytics [puppet] - 10https://gerrit.wikimedia.org/r/279151 (owner: 10Milimetric) [17:48:25] !log Increasing compactionthroughput to 120MB/s on restbase2004.codfw.wmnet : T130254 [17:48:26] T130254: Investigate recent OOM events on restbase2004 - https://phabricator.wikimedia.org/T130254 [17:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:49:24] (03CR) 10ArielGlenn: [C: 032] Moving /analytics to /other/analytics [puppet] - 10https://gerrit.wikimedia.org/r/279151 (owner: 10Milimetric) [17:50:25] thx apergos [17:50:44] gimme a sec it's not live yet [17:50:48] PROBLEM - puppet last run on oresrdb1001 is CRITICAL: CRITICAL: puppet fail [17:51:51] ok [17:52:10] milimetric: jsut verify that it all looks ok to you (lgtm) [17:52:46] looks good apergos [17:52:52] great thx [17:53:36] apergos: i don't have the rights, but I guess you can delete the publicdir/analytics and publicdir/analytics/index.html now [17:53:43] yep [17:54:12] !log Removing old heap dumps on restbase2004.codfw.wmnet : T130254 [17:54:12] T130254: Investigate recent OOM events on restbase2004 - https://phabricator.wikimedia.org/T130254 [17:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:54:28] <_joe_> urandom: any news on rb2004? [17:54:44] <_joe_> I hoped since you're removing the heap dumps :) [17:55:30] _joe_: i suspect there is some corrupt data, so i'm doing a 'scrub', a process that would remove it [17:55:42] _joe_: i tried it offline, but it's slow, very slow [17:55:44] <_joe_> cool [17:56:04] so i brought it up, but disabled, so that i can do the online version, but without the normal workload [17:56:24] tweaked to just churn though the data as fast as i can, load be damned :) [17:56:41] so, we'll see [17:59:08] PROBLEM - MariaDB Slave Lag: s3 on db1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 24580.13 seconds [17:59:17] <_joe_> uh [17:59:24] mmmh... [17:59:27] and as usual, the 6 hour downtime was not enough [17:59:39] because meetings [18:00:02] it is depooled, nothing to see here [18:00:05] <_joe_> ok :) [18:00:20] <_joe_> yeha I saw on tendril it's up since 8 minutes [18:01:02] jynus: https://www.youtube.com/watch?v=7DG97dAVZns [18:01:05] https://i.imgflip.com/11bd9g.jpg [18:01:31] gwicke, subbu|sos, labs instance lintbridge has a full disk and has been broken for quite some time. May I delete it? [18:01:35] (03PS1) 10Gehel: Adding an Icinga check for SSL certificate [puppet] - 10https://gerrit.wikimedia.org/r/279154 [18:01:54] I will disable the check so that the recovery does not ping again [18:02:08] 6Operations, 10hardware-requests: eqiad: (3) nodes for Druid / analytics - https://phabricator.wikimedia.org/T128807#2145278 (10RobH) [18:02:19] (03PS2) 10Gehel: Adding an Icinga check for SSL certificate [puppet] - 10https://gerrit.wikimedia.org/r/279154 (https://phabricator.wikimedia.org/T130366) [18:02:25] (03CR) 10Alexandros Kosiaris: [C: 032] Flake8 on utils [puppet] - 10https://gerrit.wikimedia.org/r/279143 (owner: 10Ladsgroup) [18:02:31] (03PS2) 10Alexandros Kosiaris: Flake8 on utils [puppet] - 10https://gerrit.wikimedia.org/r/279143 (owner: 10Ladsgroup) [18:04:12] 6Operations, 10hardware-requests: eqiad: (3) nodes for Druid / analytics - https://phabricator.wikimedia.org/T128807#2086645 (10RobH) a:5RobH>3mark The systems that can be used for this were ordered today on T130738. I'm now assigning this task to @Mark.Otaris @Mark: Please review the above request, you... [18:04:14] cscott, RoanKattouw, same question: May I delete labs instance 'lintbridge'? [18:05:00] 6Operations, 10RESTBase, 13Patch-For-Review: install restbase1010-restbase1015 - https://phabricator.wikimedia.org/T128107#2145299 (10Cmjohnson) @fgiunchedi restbase1014 is ready for you. I cannot do restbase1015 until both restabse1005/1006 are taken offline. I do not have enough disks to do both. [18:05:04] andrewbogott: I'm pretty sure the answer is going to be 'yes', but best to wait for subbu|sos [18:05:42] I'm lost and probably wrong, so I need a second pair of eyes on T127014. It seems to me that some WDQS responses are dropped by Varnish (see phab task for example). [18:05:42] T127014: Empty result on a tree query - https://phabricator.wikimedia.org/T127014 [18:05:49] !log Incresing compactionthroughput to 200MB/s on restbase2004.codfw.wmnet : T130254 [18:05:50] T130254: Investigate recent OOM events on restbase2004 - https://phabricator.wikimedia.org/T130254 [18:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:05:56] 6Operations, 10Analytics, 10hardware-requests, 13Patch-For-Review: eqiad: (3) AQS replacement nodes - https://phabricator.wikimedia.org/T124947#2145301 (10RobH) The systems that can be used for this were ordered today on T130738. I'm now assigning this task to @Mark.Otaris @Mark: Please review the above... [18:06:13] Not sure how to investigate further... [18:08:53] gwicke, subbu|sos, I'd also like to delete 'towtruck' since it's been shut down for a while [18:12:19] 6Operations, 10hardware-requests: reclaim restbase1001-1006 to spares - https://phabricator.wikimedia.org/T130752#2145327 (10RobH) [18:12:55] !log Removing compaction throughput throttling on restbase2004.codfw.wmnet : T130254 [18:12:55] T130254: Investigate recent OOM events on restbase2004 - https://phabricator.wikimedia.org/T130254 [18:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:13:07] (03PS3) 10Gehel: Adding an Icinga check for SSL certificate [puppet] - 10https://gerrit.wikimedia.org/r/279154 (https://phabricator.wikimedia.org/T130366) [18:13:14] (03PS15) 10Elukey: Varnish 4 API porting. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) [18:13:20] 6Operations, 10Analytics-Cluster, 10hardware-requests: eqiad: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#2145344 (10RobH) Sorry about that @ottomata, this was assigned to @mark so I missed your question. System WMF4541 is an older spare pool system with... [18:14:30] andrewbogott, yes for lintbridge. towtruck is cscott's domain. [18:14:31] 6Operations, 10RESTBase, 13Patch-For-Review: install restbase1010-restbase1015 - https://phabricator.wikimedia.org/T128107#2145347 (10RobH) [18:14:48] subbu|lunch: thanks! [18:15:51] 6Operations, 10Analytics-Cluster, 10hardware-requests: eqiad: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#2145365 (10mark) Approved. [18:16:58] 6Operations, 10Analytics-Cluster, 10hardware-requests: eqiad: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#2145374 (10RobH) a:5mark>3RobH [18:17:52] ori, I can't reach instance mc2002 — any idea what's going on there? [18:19:22] hm, same for mdc2001 [18:20:02] 6Operations, 10hardware-requests: reclaim restbase1001-1006 to spares - https://phabricator.wikimedia.org/T130752#2145408 (10Cmjohnson) @robh @mark We need to decide on what disks to buy for them? [18:23:15] 6Operations, 10hardware-requests: reclaim restbase1001-1006 to spares - https://phabricator.wikimedia.org/T130752#2145422 (10RobH) I know you sent half of the Dell Constellation ES.2 (2.5)ST9250610NS to CODFW, but we could put at least two of each into these. Spares shows 25 left, so 12 could go for this. Th... [18:27:17] 6Operations, 10hardware-requests: reclaim restbase1001-1006 to spares - https://phabricator.wikimedia.org/T130752#2145428 (10Cmjohnson) I have 14 of the 500GB SATA spares...these are a popular size disks and would rather leave them as spares. Also, I would imagine we would want higher capacity disks for the... [18:28:13] cscott: towtruck? [18:28:15] (03PS1) 10Gehel: Increasing between_bytes_timeout for wikidata query service [puppet] - 10https://gerrit.wikimedia.org/r/279157 (https://phabricator.wikimedia.org/T127014) [18:28:26] (03PS1) 10MaxSem: Add channel for slow diff logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279158 [18:34:08] gerrit's web UI is going very slow and throwing random errors at times, known? [18:34:30] gerrit is not working for me [18:34:36] It is poping up The website cannot display the page [18:34:49] works now [18:36:00] (03PS1) 10Catrope: Turn the cross-wiki beta feature back off on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279161 [18:36:33] Hi it seems https://integration.wikimedia.org/zuul/ has crashed because someone forced merged some where. [18:37:27] (03CR) 10Catrope: [C: 032] Turn the cross-wiki beta feature back off on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279161 (owner: 10Catrope) [18:38:43] (03PS2) 10Catrope: Turn the cross-wiki notifications beta feature back off on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279161 [18:38:45] (03CR) 10Jforrester: "Tsk, Timo test. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279161 (owner: 10Catrope) [18:39:11] (03CR) 10Catrope: [C: 032] Turn the cross-wiki notifications beta feature back off on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279161 (owner: 10Catrope) [18:40:01] 6Operations, 10Analytics-Cluster, 10hardware-requests: eqiad: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#2145490 (10Ottomata) Thanks! [18:47:00] PROBLEM - puppet last run on mw2015 is CRITICAL: CRITICAL: puppet fail [18:48:15] (03CR) 10Smalyshev: [C: 031] Increasing between_bytes_timeout for wikidata query service [puppet] - 10https://gerrit.wikimedia.org/r/279157 (https://phabricator.wikimedia.org/T127014) (owner: 10Gehel) [18:49:41] (03PS4) 10Dzahn: elasticsarch: add Icinga check for SSL certificate [puppet] - 10https://gerrit.wikimedia.org/r/279154 (https://phabricator.wikimedia.org/T130366) (owner: 10Gehel) [18:51:45] 6Operations, 6Discovery, 10Wikimedia-Logstash, 3Discovery-Search-Sprint, and 2 others: logstash - nginx failed service start - https://phabricator.wikimedia.org/T129934#2145548 (10Deskana) 5Open>3Resolved p:5Triage>3Normal [18:54:29] (03CR) 10Dzahn: "the dependency with "} -> " directly between classes like that is not a common style in our repo but that doesn't mean it's wrong. the com" [puppet] - 10https://gerrit.wikimedia.org/r/279154 (https://phabricator.wikimedia.org/T130366) (owner: 10Gehel) [18:55:03] 6Operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#2145564 (10RobH) Just to update the public task, we have quotes back from one of our two hardware vendors. Once we have the other back (expected today/tomorrow), th... [18:55:23] (03CR) 10Catrope: [C: 032] Turn the cross-wiki notifications beta feature back off on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279161 (owner: 10Catrope) [19:00:04] thcipriani: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160323T1900). [19:00:37] doing. [19:01:46] (03PS5) 10Dzahn: elasticsarch: add Icinga check for SSL certificate [puppet] - 10https://gerrit.wikimedia.org/r/279154 (https://phabricator.wikimedia.org/T130366) (owner: 10Gehel) [19:01:52] !log restarting zuul [19:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:02:35] (03PS1) 10Thcipriani: group1 wikis to 1.27.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279169 [19:03:56] (03CR) 10Thcipriani: [C: 032] group1 wikis to 1.27.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279169 (owner: 10Thcipriani) [19:05:51] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.18 [19:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:07:53] (03PS1) 10Alexandros Kosiaris: Fix the ores hieradata for ferm [puppet] - 10https://gerrit.wikimedia.org/r/279171 [19:08:10] (03CR) 10Catrope: [C: 032] Turn the cross-wiki notifications beta feature back off on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279161 (owner: 10Catrope) [19:08:16] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Fix the ores hieradata for ferm [puppet] - 10https://gerrit.wikimedia.org/r/279171 (owner: 10Alexandros Kosiaris) [19:08:20] (03PS2) 10Alexandros Kosiaris: Fix the ores hieradata for ferm [puppet] - 10https://gerrit.wikimedia.org/r/279171 [19:08:24] (03CR) 10Alexandros Kosiaris: [V: 032] Fix the ores hieradata for ferm [puppet] - 10https://gerrit.wikimedia.org/r/279171 (owner: 10Alexandros Kosiaris) [19:08:41] (03Merged) 10jenkins-bot: Turn the cross-wiki notifications beta feature back off on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279161 (owner: 10Catrope) [19:10:10] RECOVERY - puppet last run on oresrdb1002 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [19:10:39] RECOVERY - puppet last run on oresrdb1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:13:33] (03PS16) 10BBlack: Varnish 4 API porting. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [19:13:35] (03PS1) 10BBlack: Code format pre-cleanup [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/279173 [19:15:07] 6Operations, 10ops-eqiad: upgrade package_builder machine with SSD - https://phabricator.wikimedia.org/T130759#2145638 (10fgiunchedi) [19:16:09] RECOVERY - puppet last run on mw2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:16:23] (03PS6) 10Dzahn: elasticsarch: add Icinga check for SSL certificate [puppet] - 10https://gerrit.wikimedia.org/r/279154 (https://phabricator.wikimedia.org/T130366) (owner: 10Gehel) [19:16:50] (03CR) 10Gehel: "Isn't the absence of change due to exported resources? I would expect the puppet-compiler to report the change on the elasticsearch server" [puppet] - 10https://gerrit.wikimedia.org/r/279154 (https://phabricator.wikimedia.org/T130366) (owner: 10Gehel) [19:17:13] 6Operations: Allocate 2 analytics machines to experiment with a jupyterhub notebook service - https://phabricator.wikimedia.org/T130760#2145652 (10yuvipanda) [19:19:03] 6Operations: Allocate 2 analytics machines to experiment with a jupyterhub notebook service - https://phabricator.wikimedia.org/T130760#2145685 (10Ottomata) +1, analytics1017 and analytics1021. analytics1017 never came up after a reinstall a few months ago, but I haven't followed up. Should probably reinstall a... [19:19:12] (03PS1) 10Catrope: Enable Echo footer notice in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279174 [19:19:13] (03PS1) 10Catrope: Add plumbing for $wgEchoShowFooterNotice to CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279175 [19:21:18] (03CR) 10Dzahn: "@Gehel yes, looks like it. i just did PS5/PS6 to test a bit if that makes a difference. going back to your original version." [puppet] - 10https://gerrit.wikimedia.org/r/279154 (https://phabricator.wikimedia.org/T130366) (owner: 10Gehel) [19:21:40] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 3 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [19:22:26] (03CR) 10Catrope: [C: 032] Enable Echo footer notice in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279174 (owner: 10Catrope) [19:22:51] (03Merged) 10jenkins-bot: Enable Echo footer notice in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279174 (owner: 10Catrope) [19:23:28] 6Operations, 10ops-eqiad: upgrade package_builder machine with SSD - https://phabricator.wikimedia.org/T130759#2145638 (10Cmjohnson) We do not have anything larger than 300GB Intel SSDs on-site. [19:24:14] (03PS7) 10Dzahn: elasticsarch: add Icinga check for SSL certificate [puppet] - 10https://gerrit.wikimedia.org/r/279154 (https://phabricator.wikimedia.org/T130366) (owner: 10Gehel) [19:25:06] (03CR) 10Dzahn: [C: 031] elasticsarch: add Icinga check for SSL certificate [puppet] - 10https://gerrit.wikimedia.org/r/279154 (https://phabricator.wikimedia.org/T130366) (owner: 10Gehel) [19:25:14] !log rebooting labvirt1008 [19:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:26:04] mutante: about the "->" operator in puppet, is there a specific reason not to use it? Or just a habit? [19:26:47] (03CR) 10Dzahn: "sorry,was just testing a bit, and edited the message. this (PS7) should be just like PS3 all gehel's work. should be fine to merge" [puppet] - 10https://gerrit.wikimedia.org/r/279154 (https://phabricator.wikimedia.org/T130366) (owner: 10Gehel) [19:26:49] (03CR) 10BBlack: Varnish 4 API porting. (031 comment) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [19:27:49] !log ori@tin Synchronized php-1.27.0-wmf.18/includes/Revision.php: I77575d6d0ea: Request-local caching of revision text (duration: 00m 28s) [19:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:29:08] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [19:29:30] PROBLEM - Host labvirt1008 is DOWN: PING CRITICAL - Packet loss = 100% [19:33:08] RECOVERY - Host labvirt1008 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [19:34:33] gehel: no, there isn't (afaik). it's just habit. there is more use of "notify/require" than the arrow syntax (because it existed first?) and we sometimes use this and sometimes that [19:35:18] (03PS2) 10BBlack: Increasing between_bytes_timeout for wikidata query service [puppet] - 10https://gerrit.wikimedia.org/r/279157 (https://phabricator.wikimedia.org/T127014) (owner: 10Gehel) [19:35:28] (03CR) 10BBlack: [C: 032 V: 032] Increasing between_bytes_timeout for wikidata query service [puppet] - 10https://gerrit.wikimedia.org/r/279157 (https://phabricator.wikimedia.org/T127014) (owner: 10Gehel) [19:35:50] i just haven't see monitoring::service being chained like that usually, it does make sense [19:36:09] !log rebooting labvirt1009 [19:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:36:23] you should just go ahead and merge that and see on neon [19:38:26] !log rebooting labvirt1011 [19:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:39:49] PROBLEM - Host labvirt1009 is DOWN: PING CRITICAL - Packet loss = 100% [19:42:26] ssh: connect to host bastion.wmflabs.org port 22: Network is unreachable [19:42:39] RECOVERY - Host labvirt1009 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [19:42:42] is something going on there? [19:43:30] (03PS1) 10Dzahn: puppet-lint: rm exception for inherits across namespaces [puppet] - 10https://gerrit.wikimedia.org/r/279178 (https://phabricator.wikimedia.org/T93645) [19:43:45] chasemp: ^ labvirt reboots related to bastion.wmflabs down ? [19:43:59] most likely [19:44:04] SMalyshev: that [19:44:18] labs servers are being rebooted [19:44:27] mutante: ah, ok. didn't know labvirt and bastion are related [19:44:28] (03CR) 10jenkins-bot: [V: 04-1] puppet-lint: rm exception for inherits across namespaces [puppet] - 10https://gerrit.wikimedia.org/r/279178 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [19:49:59] (03PS1) 10Dzahn: mha: add FIXME/lint-ignore for inherit across namespaces [puppet] - 10https://gerrit.wikimedia.org/r/279182 [19:50:14] (03PS31) 10BBlack: Maps VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269466 (https://phabricator.wikimedia.org/T124279) (owner: 10Ema) [19:50:38] (03PS2) 10Dzahn: mha: add FIXME/lint-ignore for inherit across namespaces [puppet] - 10https://gerrit.wikimedia.org/r/279182 [19:51:16] (03CR) 10Dzahn: [C: 032] "comments-only" [puppet] - 10https://gerrit.wikimedia.org/r/279182 (owner: 10Dzahn) [19:52:02] (03CR) 10BBlack: [C: 031] "I think we're good to carefully merge here" [puppet] - 10https://gerrit.wikimedia.org/r/269466 (https://phabricator.wikimedia.org/T124279) (owner: 10Ema) [19:53:08] (03PS2) 10Dzahn: puppet-lint: rm exception for inherits across namespaces [puppet] - 10https://gerrit.wikimedia.org/r/279178 (https://phabricator.wikimedia.org/T93645) [19:56:56] !log rebooting labvirt1001 [19:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:00:04] gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160323T2000). Please do the needful. [20:00:50] PROBLEM - Host ores.wmflabs.org is DOWN: PING CRITICAL - Packet loss = 100% [20:01:28] PROBLEM - Host paws.wmflabs.org is DOWN: PING CRITICAL - Packet loss = 100% [20:01:38] PROBLEM - Host labvirt1001 is DOWN: PING CRITICAL - Packet loss = 100% [20:03:27] !log starting parsoid deploy [20:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:04:09] RECOVERY - Host labvirt1001 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [20:09:39] bblack: thanks for the fast merge of timeouts on WDQS! [20:10:36] !log synced code. restarted parsoid on wtp1002 (~4 minutes back) as a canary [20:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:10:59] RECOVERY - Host ores.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 1.18 ms [20:11:28] RECOVERY - Host paws.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 2.25 ms [20:14:50] !log finished deploying parsoid version 5538d868 [20:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:18:32] !log reboot labvirt1002 [20:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:20:24] for anyone keeping track, i'll be starting a mobileapps deployment in a few minutes [20:20:58] PROBLEM - Host labvirt1002 is DOWN: PING CRITICAL - Packet loss = 100% [20:25:20] RECOVERY - Host labvirt1002 is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [20:29:45] !log starting mobileapps deployment [20:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:39:07] (03PS1) 10Andrew Bogott: Horizon: Update session config [puppet] - 10https://gerrit.wikimedia.org/r/279186 (https://phabricator.wikimedia.org/T130621) [20:39:28] (03CR) 10Andrew Bogott: "experimental -- currently in place on californium" [puppet] - 10https://gerrit.wikimedia.org/r/279186 (https://phabricator.wikimedia.org/T130621) (owner: 10Andrew Bogott) [20:40:36] !log found an issue with the mobileapps deployment, reverting to 85856f7 [20:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:43:50] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:47:19] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [20:53:45] !log reboot labvirt1003 [20:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:57:18] PROBLEM - Host www.toolserver.org is DOWN: CRITICAL - Host Unreachable (www.toolserver.org) [20:58:09] PROBLEM - Host labvirt1003 is DOWN: PING CRITICAL - Packet loss = 100% [21:00:29] RECOVERY - Host labvirt1003 is UP: PING OK - Packet loss = 0%, RTA = 2.17 ms [21:08:03] chasemp: Are you restarting all servers [21:08:42] we are yes [21:08:50] Ok thanks for replying [21:09:27] chasemp: Will wikipedia be restarted [21:10:32] paladox: we are restarting labs servers so...no, that question is non sequitur either way but no don't worry about it unless your doing something in labs [21:10:53] chasemp: Oh ok thanks for replying [21:11:19] RECOVERY - Host www.toolserver.org is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms [21:16:25] (03PS1) 1020after4: WIP: Hieraize keyholder::agent configuration [puppet] - 10https://gerrit.wikimedia.org/r/279198 [21:20:21] !log reboot labvirt1004 [21:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:23:28] PROBLEM - Host labvirt1004 is DOWN: PING CRITICAL - Packet loss = 100% [21:25:26] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [21:25:47] ok so that's part of reboots^ sorry [21:26:07] ok, np [21:26:09] RECOVERY - Host labvirt1004 is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [21:30:33] !log depooling cp3032 - T125485 [21:34:46] !log depooling cp3033 - T125485 [21:38:30] !log depooling cp3042 - T125485 [21:38:31] T125485: esams cache cluster re-arrangements, early 2016 - https://phabricator.wikimedia.org/T125485 [21:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:41:25] !log depooling cp3043 - T125485 [21:41:26] T125485: esams cache cluster re-arrangements, early 2016 - https://phabricator.wikimedia.org/T125485 [21:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:43:56] !log rebooting labvirt1005 [21:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:46:09] (03PS4) 10Muehlenhoff: Set exim environment for labs instances [puppet] - 10https://gerrit.wikimedia.org/r/278899 [21:46:42] (03CR) 10Muehlenhoff: [C: 032 V: 032] Set exim environment for labs instances [puppet] - 10https://gerrit.wikimedia.org/r/278899 (owner: 10Muehlenhoff) [21:48:28] PROBLEM - Host labvirt1005 is DOWN: PING CRITICAL - Packet loss = 100% [21:50:12] (03PS1) 10BBlack: cp30[34][23] - esams upload->text re-role [puppet] - 10https://gerrit.wikimedia.org/r/279255 (https://phabricator.wikimedia.org/T125485) [21:50:43] (03Draft1) 10Dereckson: Namespace configuration for ne.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279254 (https://phabricator.wikimedia.org/T129768) [21:51:56] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 805579 bytes in 2.973 second response time [21:52:08] RECOVERY - Host labvirt1005 is UP: PING OK - Packet loss = 0%, RTA = 1.23 ms [21:59:23] !log rebooting labvirt1006 [21:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:59:59] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [22:00:14] (03PS1) 10Dereckson: Import sources for ne.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279260 (https://phabricator.wikimedia.org/T129831)