[00:00:05] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170803T0000). [00:04:48] (03CR) 10Jforrester: [C: 031] Remove wikimania2015wiki from wmgCentralAuthLoginIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369814 (owner: 10Reedy) [00:05:03] (03CR) 10Jforrester: [C: 031] phpcs for refresh-dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369808 (owner: 10Reedy) [00:05:31] !log rsyncing /srv/repos from iridium to phab1001 (T163938) [00:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:41] T163938: replace sdb and then setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938 [00:10:36] PROBLEM - Check whether ferm is active by checking the default input chain on phab1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [00:11:53] !log iridium restarted ferm - rsync fragment was in config but not applied, breaking data rsync to phab1001 [00:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:19] !log phab1001 starting ferm service [00:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:36] RECOVERY - Check whether ferm is active by checking the default input chain on phab1001 is OK: OK ferm input default policy is set [00:17:45] PROBLEM - Host cp3048 is DOWN: PING CRITICAL - Packet loss = 100% [00:18:13] (03PS1) 10Catrope: Remove all hacky overrides for ORES in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369816 [00:19:35] RECOVERY - Host cp3048 is UP: PING OK - Packet loss = 0%, RTA = 83.84 ms [00:20:35] PROBLEM - cassandra-b service on restbase-dev1004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [00:20:36] PROBLEM - cassandra-b SSL 10.64.0.168:7001 on restbase-dev1004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [00:20:55] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:20:56] PROBLEM - cassandra-b CQL 10.64.0.168:9042 on restbase-dev1004 is CRITICAL: connect to address 10.64.0.168 and port 9042: Connection refused [00:21:05] RECOVERY - puppet last run on db1089 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [00:23:35] RECOVERY - cassandra-b service on restbase-dev1004 is OK: OK - cassandra-b is active [00:23:53] (03CR) 10Awight: [C: 032] "Thanks for thinking of this!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369816 (owner: 10Catrope) [00:23:55] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [00:25:19] (03Merged) 10jenkins-bot: Remove all hacky overrides for ORES in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369816 (owner: 10Catrope) [00:25:45] RECOVERY - cassandra-b SSL 10.64.0.168:7001 on restbase-dev1004 is OK: SSL OK - Certificate restbase-dev1004-b valid until 2018-07-20 15:08:05 +0000 (expires in 351 days) [00:25:54] awight: Still needs deploying ;) [00:25:56] RECOVERY - cassandra-b CQL 10.64.0.168:9042 on restbase-dev1004 is OK: TCP OK - 0.001 second response time on 10.64.0.168 port 9042 [00:26:13] Reedy: oops! argh I forgot how -config works [00:26:21] * awight reads calendar [00:26:24] jouncebot: next [00:26:24] In 12 hour(s) and 33 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170803T1300) [00:26:26] jouncebot: now [00:26:26] For the next 0 hour(s) and 33 minute(s): Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170803T0000) [00:26:30] !log scheduled downtime for phabricator migration [00:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:44] awight: you should be fine to jfdi as it's just labs so consistency on tin [00:27:11] Reedy: thanks for the ping, I should almost revert it… [00:27:19] nah, don't bother [00:28:00] * Reedy deploys [00:28:45] !log reedy@tin Synchronized wmf-config/InitialiseSettings-labs.php: Simply ores (duration: 00m 47s) [00:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:58] jouncebot: reload [00:29:11] jouncebot: refresh [00:29:13] I refreshed my knowledge about deployments. [00:29:16] gj [00:29:20] * awight cheers [00:30:49] (03PS1) 10Thcipriani: keyholder: public keys publicly readable [puppet] - 10https://gerrit.wikimedia.org/r/369817 (https://phabricator.wikimedia.org/T172333) [00:34:33] (03CR) 10jenkins-bot: Remove all hacky overrides for ORES in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369816 (owner: 10Catrope) [01:07:01] (03CR) 10Smalyshev: [C: 031] "would be nice to see what is misparsed... I hope I didn't break anything in may last change..." [puppet] - 10https://gerrit.wikimedia.org/r/299825 (owner: 10BryanDavis) [01:18:13] (03CR) 10BryanDavis: "https://grokdebug.herokuapp.com/ can be very helpful for debugging" [puppet] - 10https://gerrit.wikimedia.org/r/299825 (owner: 10BryanDavis) [01:21:58] (03PS1) 10Krinkle: wmf-config: Improve docs in CommonSettings.php and LocalSettings.php header [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369818 [01:26:10] * Krinkle likes Reedy's simplified version of the word simplified. [01:40:34] !log repositories synced for phabricator migration. [01:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:43:38] (03PS1) 10Dzahn: cache::misc/phabricator: switch from iridium to phab1001 backend [puppet] - 10https://gerrit.wikimedia.org/r/369820 (https://phabricator.wikimedia.org/T163938) [01:46:49] (03CR) 1020after4: [C: 031] cache::misc/phabricator: switch from iridium to phab1001 backend [puppet] - 10https://gerrit.wikimedia.org/r/369820 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [01:48:45] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3495944 (10mmodell) [01:48:49] we will also need to copy https://gerrit.wikimedia.org/r/#/c/369001/6/hieradata/hosts/iridium.yaml into phab1001.yaml changing the ips that is for iridium :) [01:49:01] (03PS1) 10Dzahn: cache::misc/phabricator: add director for phabricator-new, staging [puppet] - 10https://gerrit.wikimedia.org/r/369821 (https://phabricator.wikimedia.org/T163938) [01:49:07] (03PS2) 10Krinkle: wmf-config: Improve docs in CommonSettings.php and LocalSettings.php header [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369818 [01:49:25] paladox: indeed, I think so [01:49:31] yep :) [01:49:40] i found one ip for iridium in there [01:49:58] the rest seem to be for git-ssh / phab1001 git-ssh [01:50:01] mutante: ^ will we reuse the old IP from iridium for git-ssh? [01:50:24] or is the IP non-portable across machines? [01:50:51] we will reuse it [01:50:57] (03PS2) 10Dzahn: cache::misc/phabricator: add director for phabricator-new, staging [puppet] - 10https://gerrit.wikimedia.org/r/369821 (https://phabricator.wikimedia.org/T163938) [01:53:03] (03CR) 1020after4: [C: 031] cache::misc/phabricator: add director for phabricator-new, staging [puppet] - 10https://gerrit.wikimedia.org/r/369821 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [01:54:24] (03CR) 10Paladox: [C: 031] cache::misc/phabricator: add director for phabricator-new, staging [puppet] - 10https://gerrit.wikimedia.org/r/369821 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [01:59:59] (03CR) 10Reedy: [C: 031] wmf-config: Improve docs in CommonSettings.php and LocalSettings.php header [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369818 (owner: 10Krinkle) [02:09:37] (03CR) 10Paladox: [C: 031] cache::misc/phabricator: switch from iridium to phab1001 backend [puppet] - 10https://gerrit.wikimedia.org/r/369820 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [02:11:21] twentyafterfour and mutante before i go, just wanted to say the ip in profile::base::ssh_server_settings is iridium so that needs to be changed to phab1001 when copying it over. Also the other ips seem to either be phab1001 (git-ssh) and git-ssh. we will need to redirect phab1001-vcs.eqiad.wmnet to the new host too :) [02:11:23] (03PS1) 10Dzahn: phabricator: set phab_domain to phabricator-new for phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/369823 (https://phabricator.wikimedia.org/T163938) [02:11:38] (03CR) 10Paladox: [C: 031] phabricator: set phab_domain to phabricator-new for phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/369823 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [02:12:12] paladox: thanks, I think mutante is on top of it [02:12:20] ok :) [02:12:32] i had problems connecting.. finally it works [02:12:37] :) [02:12:38] going to go ahead now [02:12:42] :) [02:12:56] changed locations again for better wifi [02:13:11] (03PS3) 10Dzahn: cache::misc/phabricator: add director for phabricator-new, staging [puppet] - 10https://gerrit.wikimedia.org/r/369821 (https://phabricator.wikimedia.org/T163938) [02:14:20] (03CR) 10Dzahn: [V: 032 C: 032] cache::misc/phabricator: add director for phabricator-new, staging [puppet] - 10https://gerrit.wikimedia.org/r/369821 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [02:16:03] (03PS2) 10Dzahn: phabricator: set phab_domain to phabricator-new for phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/369823 (https://phabricator.wikimedia.org/T163938) [02:16:09] runs puppet on cache::misc hosts [02:18:33] oh oh [02:18:36] the mysql grants [02:19:26] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [02:19:26] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [02:21:05] PROBLEM - puppet last run on cp1051 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [02:21:15] PROBLEM - puppet last run on cp1045 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [02:22:43] (03PS1) 10Dzahn: Revert "cache::misc/phabricator: add director for phabricator-new, staging" [puppet] - 10https://gerrit.wikimedia.org/r/369825 [02:23:52] (03CR) 10Dzahn: [V: 032 C: 032] Revert "cache::misc/phabricator: add director for phabricator-new, staging" [puppet] - 10https://gerrit.wikimedia.org/r/369825 (owner: 10Dzahn) [02:25:16] RECOVERY - puppet last run on cp1045 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [02:27:35] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [02:28:35] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [02:29:05] RECOVERY - puppet last run on cp1051 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [02:33:48] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.11) (duration: 07m 53s) [02:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:43:58] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3496011 (10Johan) Sounds like a plan. Additionally, this should be mentioned in... [02:45:05] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199#3496013 (10Johan) [02:45:06] (03PS1) 10Dzahn: cache::misc/phabricator: add director for phab-new [puppet] - 10https://gerrit.wikimedia.org/r/369829 (https://phabricator.wikimedia.org/T163938) [02:50:56] (03PS2) 10Dzahn: cache::misc/phabricator: add director for phab-new [puppet] - 10https://gerrit.wikimedia.org/r/369829 (https://phabricator.wikimedia.org/T163938) [02:53:30] (03CR) 1020after4: [C: 031] cache::misc/phabricator: add director for phab-new [puppet] - 10https://gerrit.wikimedia.org/r/369829 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [02:53:42] (03CR) 10Dzahn: [C: 032] cache::misc/phabricator: add director for phab-new [puppet] - 10https://gerrit.wikimedia.org/r/369829 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [03:02:04] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.12) (duration: 05m 47s) [03:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:04:59] (03PS3) 10Dzahn: phabricator: set phab_domain to phabricator-new for phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/369823 (https://phabricator.wikimedia.org/T163938) [03:06:34] (03CR) 10Dzahn: [C: 032] phabricator: set phab_domain to phabricator-new for phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/369823 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [03:09:22] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Aug 3 03:09:22 UTC 2017 (duration 7m 18s) [03:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:23:06] (03CR) 10Smalyshev: [C: 031] "Thanks Bryan! Pattern validates fine on debug though. Something else must be missing." [puppet] - 10https://gerrit.wikimedia.org/r/299825 (owner: 10BryanDavis) [03:25:02] (03PS1) 10Dzahn: datasets/phabricator: allow phab1001 as rsync host for dumps [puppet] - 10https://gerrit.wikimedia.org/r/369831 (https://phabricator.wikimedia.org/T163938) [03:26:25] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 872.14 seconds [03:28:37] (03CR) 10Dzahn: [C: 032] datasets/phabricator: allow phab1001 as rsync host for dumps [puppet] - 10https://gerrit.wikimedia.org/r/369831 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [03:32:26] (03PS1) 10Dzahn: mariadb/phabricator: update GRANTS from iridium to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/369832 (https://phabricator.wikimedia.org/T163938) [03:41:22] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3496059 (10mmodell) I'm making this comment from `phab1001.eqiad.wmnet` :) [03:42:57] (03PS1) 10Andrew Bogott: vmbuilder: rm /etc/init.d/puppet in image [puppet] - 10https://gerrit.wikimedia.org/r/369833 (https://phabricator.wikimedia.org/T172064) [03:44:09] (03CR) 10Andrew Bogott: [C: 032] vmbuilder: rm /etc/init.d/puppet in image [puppet] - 10https://gerrit.wikimedia.org/r/369833 (https://phabricator.wikimedia.org/T172064) (owner: 10Andrew Bogott) [04:01:16] 10Operations, 10Deployment-Systems, 10Performance-Team, 10HHVM, 10Release-Engineering-Team (Watching / External): Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#3496076 (10Krinkle) [04:01:39] 10Operations, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Performance-Team, and 5 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3496077 (10Krinkle) [04:03:47] (03PS1) 10Dzahn: exim/phabricator: send mail to phab1001, not iridium [puppet] - 10https://gerrit.wikimedia.org/r/369834 (https://phabricator.wikimedia.org/T163938) [04:04:24] 10Operations, 10Performance-Team, 10monitoring: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3049036 (10Krinkle) [04:04:28] 10Operations, 10Performance-Team: Move coal from graphite machine(s) - https://phabricator.wikimedia.org/T159354#3496081 (10Krinkle) 05Open>03stalled p:05Normal>03Low a:05Krinkle>03None The current thinking at T158837 might result in the removal of the coal and coal-web services. Setting this task... [04:10:19] 10Operations, 10Wikimania-Hackathon-2017-Organization, 10Release-Engineering-Team (Watching / External): Wikimania needs hosting on a server for onsite conference guide - https://phabricator.wikimedia.org/T172217#3490687 (10Antoine2711) >>! In T172217#3490701, @Reedy wrote: > What is the conference guide? >... [04:10:36] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 281.86 seconds [04:21:55] (03PS1) 10Krinkle: webperf: Move statsv monitoring from 'webperf' to 'webperf::statsv' [puppet] - 10https://gerrit.wikimedia.org/r/369835 (https://phabricator.wikimedia.org/T158837) [04:22:07] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3496111 (10Johan) How do we do on the translation side for this, by the way? W... [04:23:42] 10Operations, 10Wikimania-Hackathon-2017-Organization, 10Release-Engineering-Team (Watching / External): Wikimania needs hosting on a server for onsite conference guide - https://phabricator.wikimedia.org/T172217#3496112 (10Dzahn) Hi @Antoine2711 Are the files already in a public git repository by any chance... [04:24:44] (03PS1) 10Dzahn: phabricator: set phab1001 to active phab server [puppet] - 10https://gerrit.wikimedia.org/r/369836 (https://phabricator.wikimedia.org/T163938) [04:26:06] (03PS2) 10Krinkle: webperf: Move statsv monitoring from 'webperf' to 'webperf::statsv' [puppet] - 10https://gerrit.wikimedia.org/r/369835 (https://phabricator.wikimedia.org/T158837) [04:26:39] (03CR) 10Krinkle: "Passed - no difference on hafnium. (Nor any other vaguely-related nodes) - https://puppet-compiler.wmflabs.org/compiler02/7285/" [puppet] - 10https://gerrit.wikimedia.org/r/369835 (https://phabricator.wikimedia.org/T158837) (owner: 10Krinkle) [04:30:37] (03PS1) 10Andrew Bogott: vmbuilder: update-rc.d -f puppet remove [puppet] - 10https://gerrit.wikimedia.org/r/369837 [04:31:51] (03PS2) 10Andrew Bogott: vmbuilder: update-rc.d -f puppet remove [puppet] - 10https://gerrit.wikimedia.org/r/369837 (https://phabricator.wikimedia.org/T172064) [04:33:04] (03PS2) 10Dzahn: mariadb/phabricator: update GRANTS from iridium to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/369832 (https://phabricator.wikimedia.org/T163938) [04:33:29] (03CR) 10Andrew Bogott: [C: 032] vmbuilder: update-rc.d -f puppet remove [puppet] - 10https://gerrit.wikimedia.org/r/369837 (https://phabricator.wikimedia.org/T172064) (owner: 10Andrew Bogott) [04:34:01] (03PS1) 10BryanDavis: Install composer for PHP imaages [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/369838 (https://phabricator.wikimedia.org/T172358) [04:38:48] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3496127 (10Johan) Regarding Firefox, Firefox 52 is the last version to work reli... [04:39:49] (03PS8) 10Smalyshev: logstash: Parse nginx access logs for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/299825 (owner: 10BryanDavis) [04:40:29] (03PS2) 10Dzahn: exim/phabricator: send mail to phab1001, not iridium [puppet] - 10https://gerrit.wikimedia.org/r/369834 (https://phabricator.wikimedia.org/T163938) [04:40:31] (03PS3) 10Dzahn: mariadb/phabricator: update GRANTS from iridium to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/369832 (https://phabricator.wikimedia.org/T163938) [04:40:48] 10Operations, 10Traffic, 10Patch-For-Review: Planning for phasing out non-Forward-Secret TLS ciphers - https://phabricator.wikimedia.org/T118181#1793501 (10Mostafa2018k) ipad air 2 [04:41:06] (03CR) 10Smalyshev: "I've updated the config and it appears to work on beta now. The main thing I think was trying to write into @timestamp - that probably doe" [puppet] - 10https://gerrit.wikimedia.org/r/299825 (owner: 10BryanDavis) [04:41:18] (03PS1) 10Andrew Bogott: vmbuilder: remove /etc/ssh/userkeys/* [puppet] - 10https://gerrit.wikimedia.org/r/369839 [04:41:20] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199#2684468 (10Johan) Since this task was created in 2015, Firefox has stopped supporting Windows XP. They might need to install Firefox... [04:42:29] (03CR) 10Andrew Bogott: [C: 032] vmbuilder: remove /etc/ssh/userkeys/* [puppet] - 10https://gerrit.wikimedia.org/r/369839 (owner: 10Andrew Bogott) [04:43:20] (03PS2) 10Dzahn: cache::misc/phabricator: switch from iridium to phab1001 backend [puppet] - 10https://gerrit.wikimedia.org/r/369820 (https://phabricator.wikimedia.org/T163938) [04:43:36] (03CR) 10jerkins-bot: [V: 04-1] cache::misc/phabricator: switch from iridium to phab1001 backend [puppet] - 10https://gerrit.wikimedia.org/r/369820 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [04:48:34] 10Operations, 10Wikimania-Hackathon-2017-Organization, 10Release-Engineering-Team (Watching / External): Wikimania needs hosting on a server for onsite conference guide - https://phabricator.wikimedia.org/T172217#3496134 (10Antoine2711) Hi @Dzahn, For me, it's really a simple HTTP need. The ideal of the PHP... [04:58:38] 10Operations, 10Wikimania-Hackathon-2017-Organization, 10Release-Engineering-Team (Watching / External): Wikimania needs hosting on a server for onsite conference guide - https://phabricator.wikimedia.org/T172217#3496137 (10Dzahn) Hi Antoine2711, the easiest way to upload the files would probably be if you w... [05:11:40] 10Operations, 10Wikimania-Hackathon-2017-Organization, 10Release-Engineering-Team (Watching / External): Wikimania needs hosting on a server for onsite conference guide - https://phabricator.wikimedia.org/T172217#3496143 (10Antoine2711) Hi @Dzahn, >>! In T172217#3496137, @Dzahn wrote: > Hi Antoine2711, the... [05:16:25] 10Operations, 10Wikimania-Hackathon-2017-Organization, 10Release-Engineering-Team (Watching / External): Wikimania needs hosting on a server for onsite conference guide - https://phabricator.wikimedia.org/T172217#3496148 (10Dzahn) @Antoine2711 Sounds all good. I will also get back to you tomorrow, we'll get... [05:17:50] !log Stop replication on labsdb1011 for maintenance - T153743 [05:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:59] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [05:24:31] 10Operations, 10Wikimania-Hackathon-2017-Organization, 10Release-Engineering-Team (Watching / External): Wikimania needs hosting on a server for onsite conference guide - https://phabricator.wikimedia.org/T172217#3496157 (10Mostafa2018k) admin ipad air 2 [05:24:45] RECOVERY - MariaDB Slave Lag: s2 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89946.74 seconds [05:27:42] !log Deploy alter table on enwiki - labsdb1010 - T166204 [05:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:53] T166204: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204 [05:33:29] (03PS1) 10Greg Grossmeier: admin: Update gjg's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/369840 [05:34:05] PROBLEM - cassandra-b CQL 10.64.0.168:9042 on restbase-dev1004 is CRITICAL: connect to address 10.64.0.168 and port 9042: Connection refused [05:34:05] PROBLEM - cassandra-b service on restbase-dev1004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [05:34:15] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:34:35] PROBLEM - cassandra-b SSL 10.64.0.168:7001 on restbase-dev1004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [05:37:05] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedi [05:37:05] sday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/revision/{revision} (Get rev by ID) is CRITICAL: Test Get [05:37:05] the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-sections/{title}{/revision} (Get MobileApps Foobar page) is CRITICAL: Test Get MobileApps Foobar page returned the [05:37:15] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received [05:40:23] (03CR) 10Giuseppe Lavagetto: [C: 032] admin: Update gjg's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/369840 (owner: 10Greg Grossmeier) [05:40:35] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedi [05:40:35] sday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed conte [05:40:35] 16) timed out before a response was received: /en.wikipedia.org/v1/page/revision/{revision} (Get rev by ID) is CRITICAL: Test Get rev by ID returned the unexpected status 500 (expecting: 200) [05:43:45] PROBLEM - cassandra-a CQL 10.64.16.97:9042 on restbase-dev1005 is CRITICAL: connect to address 10.64.16.97 and port 9042: Connection refused [05:43:55] PROBLEM - cassandra-a SSL 10.64.16.97:7001 on restbase-dev1005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [05:44:05] PROBLEM - Check systemd state on restbase-dev1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:44:19] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2051" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369841 [05:44:28] (03CR) 10jerkins-bot: [V: 04-1] Revert "db-codfw.php: Depool db2051" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369841 (owner: 10Marostegui) [05:44:30] (03Abandoned) 10Marostegui: Revert "db-codfw.php: Depool db2051" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369841 (owner: 10Marostegui) [05:44:55] PROBLEM - cassandra-a service on restbase-dev1005 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [05:51:55] RECOVERY - cassandra-a service on restbase-dev1005 is OK: OK - cassandra-a is active [05:52:02] <_joe_> should I just silence restbase-dev*? [05:52:05] RECOVERY - Check systemd state on restbase-dev1005 is OK: OK - running: The system is fully operational [05:52:27] (03PS1) 10Marostegui: db-eqiad.php: Restore db2051 original values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369842 (https://phabricator.wikimedia.org/T170351) [05:53:35] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [05:53:45] RECOVERY - cassandra-a CQL 10.64.16.97:9042 on restbase-dev1005 is OK: TCP OK - 0.000 second response time on 10.64.16.97 port 9042 [05:53:56] RECOVERY - cassandra-a SSL 10.64.16.97:7001 on restbase-dev1005 is OK: SSL OK - Certificate restbase-dev1005-a valid until 2018-07-20 15:08:07 +0000 (expires in 351 days) [05:54:15] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [05:54:16] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [05:54:25] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [05:54:31] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db2051 original values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369842 (https://phabricator.wikimedia.org/T170351) (owner: 10Marostegui) [05:55:07] RECOVERY - cassandra-b service on restbase-dev1004 is OK: OK - cassandra-b is active [05:56:00] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db2051 original values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369842 (https://phabricator.wikimedia.org/T170351) (owner: 10Marostegui) [05:56:16] (03CR) 10jenkins-bot: db-eqiad.php: Restore db2051 original values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369842 (https://phabricator.wikimedia.org/T170351) (owner: 10Marostegui) [05:56:45] RECOVERY - cassandra-b SSL 10.64.0.168:7001 on restbase-dev1004 is OK: SSL OK - Certificate restbase-dev1004-b valid until 2018-07-20 15:08:05 +0000 (expires in 351 days) [05:57:05] RECOVERY - cassandra-b CQL 10.64.0.168:9042 on restbase-dev1004 is OK: TCP OK - 0.000 second response time on 10.64.0.168 port 9042 [05:57:15] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2051 - T170351 (duration: 00m 54s) [05:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:27] T170351: db2019 has performance issues, replace disk or switchover s4 master elsewhere - https://phabricator.wikimedia.org/T170351 [06:01:54] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Define network infra ranges and allow them to send syslog to logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/369697 (https://phabricator.wikimedia.org/T166126) (owner: 10Ayounsi) [06:01:57] !log Compress s7 on db1102 - T172169 [06:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:08] T172169: Compress InnoDB on db1102 - https://phabricator.wikimedia.org/T172169 [06:03:58] (03PS3) 10Giuseppe Lavagetto: webperf: Move statsv monitoring from 'webperf' to 'webperf::statsv' [puppet] - 10https://gerrit.wikimedia.org/r/369835 (https://phabricator.wikimedia.org/T158837) (owner: 10Krinkle) [06:06:02] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Add db2074 to s3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369843 (https://phabricator.wikimedia.org/T170662) [06:07:18] (03CR) 10Giuseppe Lavagetto: [C: 032] webperf: Move statsv monitoring from 'webperf' to 'webperf::statsv' [puppet] - 10https://gerrit.wikimedia.org/r/369835 (https://phabricator.wikimedia.org/T158837) (owner: 10Krinkle) [06:14:33] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Add db2074 to s3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369843 (https://phabricator.wikimedia.org/T170662) [06:16:09] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Add db2074 to s3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369843 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [06:17:29] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Add db2074 to s3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369843 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [06:17:39] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Add db2074 to s3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369843 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [06:18:42] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Add db2074 to s3 - T170662 (duration: 00m 46s) [06:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:52] T170662: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662 [06:19:36] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add db2074 to s3 - T170662 (duration: 00m 46s) [06:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:52] (03CR) 10Volans: "-1 (after merge)" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/369710 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [06:25:05] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/revision/{revision} (Get rev by ID) timed out before a response was received: /en.wikipedia.org/v1/page/mobi [06:25:05] {/revision} (Get MobileApps Foobar page) timed out before a response was received [06:25:46] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 [06:25:46] en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/revision/{revision} (Get rev by ID) is CRITICAL: Test Get rev by ID returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-sections/{title}{/revision} (Get MobileApps Foobar page) is CRITICAL: Test Get [06:25:46] age returned the unexpected status 500 (expecting: 200) [06:25:46] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 [06:25:46] en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/revision/{revision} (Get rev by ID) is CRITICAL: Test Get rev by ID returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-sections/{title}{/revision} (Get MobileApps Foobar page) is CRITICAL: Test Get [06:25:46] age returned the unexpected status 500 (expecting: 200) [06:29:45] PROBLEM - cassandra-a CQL 10.64.48.168:9042 on restbase-dev1006 is CRITICAL: connect to address 10.64.48.168 and port 9042: Connection refused [06:29:55] PROBLEM - cassandra-a SSL 10.64.48.168:7001 on restbase-dev1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [06:30:05] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:30:25] PROBLEM - cassandra-a service on restbase-dev1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [06:40:36] PROBLEM - cassandra-b SSL 10.64.48.169:7001 on restbase-dev1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [06:40:55] PROBLEM - cassandra-b service on restbase-dev1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [06:41:05] PROBLEM - cassandra-b CQL 10.64.48.169:9042 on restbase-dev1006 is CRITICAL: connect to address 10.64.48.169 and port 9042: Connection refused [06:45:55] RECOVERY - cassandra-b service on restbase-dev1006 is OK: OK - cassandra-b is active [06:46:05] RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational [06:46:25] RECOVERY - cassandra-a service on restbase-dev1006 is OK: OK - cassandra-a is active [06:47:55] RECOVERY - cassandra-a SSL 10.64.48.168:7001 on restbase-dev1006 is OK: SSL OK - Certificate restbase-dev1006-a valid until 2018-07-20 15:08:10 +0000 (expires in 351 days) [06:47:56] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [06:48:05] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [06:48:06] RECOVERY - cassandra-b CQL 10.64.48.169:9042 on restbase-dev1006 is OK: TCP OK - 0.000 second response time on 10.64.48.169 port 9042 [06:48:16] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [06:48:45] RECOVERY - cassandra-b SSL 10.64.48.169:7001 on restbase-dev1006 is OK: SSL OK - Certificate restbase-dev1006-b valid until 2018-07-20 15:08:11 +0000 (expires in 351 days) [06:48:46] RECOVERY - cassandra-a CQL 10.64.48.168:9042 on restbase-dev1006 is OK: TCP OK - 0.000 second response time on 10.64.48.168 port 9042 [06:49:22] (03PS3) 10Muehlenhoff: Add scons to package list [puppet] - 10https://gerrit.wikimedia.org/r/368399 [07:23:29] 10Operations, 10ops-eqiad: Decommission WMF3248 (old R510) - https://phabricator.wikimedia.org/T172323#3496203 (10Peachey88) [07:23:39] 10Operations, 10ops-eqiad, 10hardware-requests: Decommission WMF3248 (old R510) - https://phabricator.wikimedia.org/T172323#3494819 (10Peachey88) [07:32:31] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3496208 (10MoritzMuehlenhoff) @Johan: Firefox is provided by Mozilla in two rele... [07:41:51] (03PS1) 10Marostegui: check_private_data_report: Specify socket location [puppet] - 10https://gerrit.wikimedia.org/r/369845 [07:43:15] (03CR) 10Marostegui: [C: 032] check_private_data_report: Specify socket location [puppet] - 10https://gerrit.wikimedia.org/r/369845 (owner: 10Marostegui) [07:44:55] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received [07:44:55] 10Operations, 10ORES, 10Scap, 10Scoring-platform-team, 10Release-Engineering-Team (Watching / External): Simplify git-fat support for pulling from both production and labs - https://phabricator.wikimedia.org/T171758#3496219 (10fgiunchedi) CC'ing #operations here too for wider distribution [07:46:46] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received [07:46:55] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [07:47:46] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [07:50:45] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:51:25] PROBLEM - cassandra-b SSL 10.64.0.168:7001 on restbase-dev1004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [07:51:25] PROBLEM - cassandra-b CQL 10.64.0.168:9042 on restbase-dev1004 is CRITICAL: connect to address 10.64.0.168 and port 9042: Connection refused [07:51:35] PROBLEM - cassandra-b service on restbase-dev1004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [07:51:55] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [07:52:05] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpect [07:52:05] cting: 200) [07:52:55] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpect [07:52:55] cting: 200) [07:53:36] PROBLEM - cassandra-b CQL 10.64.16.98:9042 on restbase-dev1005 is CRITICAL: connect to address 10.64.16.98 and port 9042: Connection refused [07:53:45] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [07:53:55] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [07:54:05] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [07:54:36] RECOVERY - cassandra-b service on restbase-dev1004 is OK: OK - cassandra-b is active [07:54:36] RECOVERY - cassandra-b CQL 10.64.16.98:9042 on restbase-dev1005 is OK: TCP OK - 0.000 second response time on 10.64.16.98 port 9042 [07:54:46] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [07:56:26] RECOVERY - cassandra-b CQL 10.64.0.168:9042 on restbase-dev1004 is OK: TCP OK - 0.000 second response time on 10.64.0.168 port 9042 [07:56:26] RECOVERY - cassandra-b SSL 10.64.0.168:7001 on restbase-dev1004 is OK: SSL OK - Certificate restbase-dev1004-b valid until 2018-07-20 15:08:05 +0000 (expires in 351 days) [07:56:51] (03PS14) 10Ema: varnish: Switch browsersec to use errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/355338 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [07:58:50] !log Add db2073 and db2074 to tendril - T170662 [07:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:01] T170662: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662 [08:01:44] (03CR) 10Ema: [C: 032] varnish: Switch browsersec to use errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/355338 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [08:05:02] Krinkle: thanks for the patch! ^ [08:13:20] 10Operations, 10monitoring: broken dependencies for python-snimpy on jessie - https://phabricator.wikimedia.org/T172330#3496282 (10MoritzMuehlenhoff) jessie contains the snimpy source package, but the version in jessie only ships the snimpy CLI tool, but not the Python modules. The version which is in jessie-... [08:19:40] (03PS1) 10Marostegui: mariadb: Add db2075 as a new slave in s5 [puppet] - 10https://gerrit.wikimedia.org/r/369850 (https://phabricator.wikimedia.org/T170662) [08:24:31] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler02/7286/" [puppet] - 10https://gerrit.wikimedia.org/r/369850 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [08:26:57] (03CR) 10Marostegui: [C: 032] mariadb: Add db2075 as a new slave in s5 [puppet] - 10https://gerrit.wikimedia.org/r/369850 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [08:28:29] (03PS2) 10Ema: pybal::monitoring: check_pybal_ipvs_diff interval/timeouts [puppet] - 10https://gerrit.wikimedia.org/r/369006 (https://phabricator.wikimedia.org/T134893) [08:29:02] (03PS1) 10Marostegui: db-codfw.php: Depool db2045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369854 (https://phabricator.wikimedia.org/T170662) [08:29:35] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:30:47] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369854 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [08:32:15] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369854 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [08:32:29] (03CR) 10jenkins-bot: db-codfw.php: Depool db2045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369854 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [08:33:33] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db2045 - T170662 (duration: 00m 47s) [08:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:45] T170662: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662 [08:34:24] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2045 - T170662 (duration: 00m 47s) [08:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:58] !log Stop MySQL on db2045 to copy its data to db2075 - T170662 [08:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:58] !log banning elastic102[89] - T168816 [08:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:08] T168816: some elasticsearch servers in eqiad have CPU overheating - https://phabricator.wikimedia.org/T168816 [08:37:28] checking stat1005.. [08:39:34] Aug 3 08:27:03 stat1005 atopacctd[13036]: Unexpected write error to shadow file: No space left on device [08:39:59] (03PS4) 10Gehel: maps - tune postgresql for maps-test cluster [puppet] - 10https://gerrit.wikimedia.org/r/369660 (https://phabricator.wikimedia.org/T169011) [08:41:15] (03CR) 10Gehel: [C: 032] maps - tune postgresql for maps-test cluster [puppet] - 10https://gerrit.wikimedia.org/r/369660 (https://phabricator.wikimedia.org/T169011) (owner: 10Gehel) [08:45:00] ahh there was an OOM party on stat1005 [08:50:28] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [08:50:58] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [08:51:09] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [08:51:28] PROBLEM - salt-minion processes on stat1005 is CRITICAL: Return code of 255 is out of bounds [08:51:28] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [08:51:28] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [08:53:09] more fun now [08:53:16] (03CR) 10DCausse: [C: 031] CirrusSearch configuration for LTR AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369594 (https://phabricator.wikimedia.org/T171212) (owner: 10EBernhardson) [08:55:38] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1005 is CRITICAL: Return code of 255 is out of bounds [08:56:50] !log upgrading cache_upload to varnish 4.1.8-1wm1 [08:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:48] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [08:59:37] dcausse: hello :) [08:59:51] elukey: hi :) [09:00:03] stat1005 seems a bit overloaded by ebernhardson's processes, do you know what he is doing? [09:00:11] elukey: looking [09:00:35] I just killed some of them [09:00:45] ok [09:01:46] but they keep respawning [09:02:03] and the oom is playing wack-a-mole :D [09:02:24] https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=stat1005&refresh=1m&orgId=1&from=now-3h&to=now [09:03:33] might be good now, waiting a few [09:03:45] load went down after killing the biggest ones [09:03:48] RECOVERY - Disk space on stat1005 is OK: DISK OK [09:03:58] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [09:04:13] \o/ [09:04:18] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [09:04:28] RECOVERY - salt-minion processes on stat1005 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [09:04:28] RECOVERY - DPKG on stat1005 is OK: All packages OK [09:04:28] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [09:04:29] elukey: strange, I should see this process in yarn, what kind of process do you kill? python or java? [09:04:31] !!! [09:04:38] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [09:06:22] dcausse: python and java :D [09:11:12] elukey: ok, it's a pyspark shell that I don't see in yarn (so must have been killed I think), I can't really what was running, I suppose some training for search, I'll check with him later today. Thanks for the heads up [09:12:59] dcausse: sorry for the brutal kill but the host was really overloaded [09:13:04] elukey: np [09:14:51] (03CR) 10Ema: [C: 04-1] "Thanks for submitting this! Please address the comment and add a VTC test to verify that the patch works as advertised. See for example: h" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/365589 (https://phabricator.wikimedia.org/T169683) (owner: 10Gilles) [09:16:46] (03PS1) 10Addshore: Factor graphite class out of statistics::wmde [puppet] - 10https://gerrit.wikimedia.org/r/369857 [09:18:03] (03CR) 10jerkins-bot: [V: 04-1] Factor graphite class out of statistics::wmde [puppet] - 10https://gerrit.wikimedia.org/r/369857 (owner: 10Addshore) [09:19:21] (03PS2) 10Addshore: Factor graphite class out of statistics::wmde [puppet] - 10https://gerrit.wikimedia.org/r/369857 [09:25:38] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1005 is OK: OK: synced at Thu 2017-08-03 09:25:31 UTC. [09:25:38] (03PS8) 10Giuseppe Lavagetto: rake: new rakefile tailored for CI [puppet] - 10https://gerrit.wikimedia.org/r/366591 (https://phabricator.wikimedia.org/T166888) [09:31:52] (03PS3) 10Addshore: Factor graphite class out of statistics::wmde [puppet] - 10https://gerrit.wikimedia.org/r/369857 [09:32:56] !log banning elastic103[01] - T168816 [09:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:06] T168816: some elasticsearch servers in eqiad have CPU overheating - https://phabricator.wikimedia.org/T168816 [09:38:05] (03PS1) 10Ema: VCL: apply VSV00001 (DSA 3924-1) DoS workaround to text only [puppet] - 10https://gerrit.wikimedia.org/r/369859 [09:40:50] (03CR) 10Ema: [C: 032] pybal::monitoring: check_pybal_ipvs_diff interval/timeouts [puppet] - 10https://gerrit.wikimedia.org/r/369006 (https://phabricator.wikimedia.org/T134893) (owner: 10Ema) [09:40:59] (03PS3) 10Ema: pybal::monitoring: check_pybal_ipvs_diff interval/timeouts [puppet] - 10https://gerrit.wikimedia.org/r/369006 (https://phabricator.wikimedia.org/T134893) [09:41:01] (03CR) 10Ema: [V: 032 C: 032] pybal::monitoring: check_pybal_ipvs_diff interval/timeouts [puppet] - 10https://gerrit.wikimedia.org/r/369006 (https://phabricator.wikimedia.org/T134893) (owner: 10Ema) [09:41:24] (03PS1) 10Hashar: openstack: phase out deployment-stream [puppet] - 10https://gerrit.wikimedia.org/r/369860 (https://phabricator.wikimedia.org/T172356) [09:43:24] (03PS1) 10Hashar: beta: phase out RCStream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369861 (https://phabricator.wikimedia.org/T172356) [09:46:05] (03CR) 10Giuseppe Lavagetto: [C: 031] "The new rakefile should work well, but we should really disable tox from running independently, and the tox update as well. I did put up p" [puppet] - 10https://gerrit.wikimedia.org/r/366591 (https://phabricator.wikimedia.org/T166888) (owner: 10Giuseppe Lavagetto) [09:52:30] !log installing mysql 5.5 security updates (package as shipped by Debian jessie, not our internal wmf-mariadb package) [09:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:37] <_joe_> !log systemctl reset-failed puppetmaster.service on labpuppetmaster1002 [09:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:48] RECOVERY - Check systemd state on labpuppetmaster1002 is OK: OK - running: The system is fully operational [10:01:31] !log installing poppler security updates on trusty [10:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:15] (03PS4) 10Addshore: Factor graphite class out of statistics::wmde [puppet] - 10https://gerrit.wikimedia.org/r/369857 [10:11:51] (03CR) 10Elukey: Factor graphite class out of statistics::wmde (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/369857 (owner: 10Addshore) [10:21:09] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [10:21:31] !log removed /run/pacct_shadow.d on stat1005 to allow atopacct.service restart [10:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:11] (03PS5) 10Addshore: Factor graphite class out of statistics::wmde [puppet] - 10https://gerrit.wikimedia.org/r/369857 [10:24:15] (03CR) 10jerkins-bot: [V: 04-1] Factor graphite class out of statistics::wmde [puppet] - 10https://gerrit.wikimedia.org/r/369857 (owner: 10Addshore) [10:32:12] (03CR) 10Hashar: "https://gerrit.wikimedia.org/r/369870 would update the CI job" [puppet] - 10https://gerrit.wikimedia.org/r/366591 (https://phabricator.wikimedia.org/T166888) (owner: 10Giuseppe Lavagetto) [10:36:18] (03PS2) 10Muehlenhoff: NOOP: Formatting and minor refactoring [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/368179 (owner: 10Volans) [10:37:49] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [10:41:30] (03CR) 10Muehlenhoff: [C: 032] NOOP: Formatting and minor refactoring [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/368179 (owner: 10Volans) [10:42:02] (03CR) 10Addshore: Factor graphite class out of statistics::wmde (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/369857 (owner: 10Addshore) [10:43:28] (03PS6) 10Addshore: Factor graphite class out of statistics::wmde [puppet] - 10https://gerrit.wikimedia.org/r/369857 [10:43:49] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [10:48:08] (03PS7) 10Addshore: Factor graphite class out of statistics::wmde [puppet] - 10https://gerrit.wikimedia.org/r/369857 [11:12:26] (03PS8) 10Elukey: statistics::wmde: refactoring and decoupling [puppet] - 10https://gerrit.wikimedia.org/r/369857 (owner: 10Addshore) [11:13:29] (03CR) 10jerkins-bot: [V: 04-1] statistics::wmde: refactoring and decoupling [puppet] - 10https://gerrit.wikimedia.org/r/369857 (owner: 10Addshore) [11:14:22] (03PS9) 10Elukey: statistics::wmde: refactoring and decoupling [puppet] - 10https://gerrit.wikimedia.org/r/369857 (owner: 10Addshore) [11:15:23] (03PS10) 10Elukey: statistics::wmde: refactoring and decoupling [puppet] - 10https://gerrit.wikimedia.org/r/369857 (owner: 10Addshore) [11:15:27] (03CR) 10jerkins-bot: [V: 04-1] statistics::wmde: refactoring and decoupling [puppet] - 10https://gerrit.wikimedia.org/r/369857 (owner: 10Addshore) [11:29:30] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3496703 (10Paladox) I found https://phabricator-new.wikimedia.org/ continuously redirecting which then the browser short circuits it. But http... [11:30:46] (03PS1) 10Hashar: Fix firstboot salt minion id on labs [puppet] - 10https://gerrit.wikimedia.org/r/369873 [11:32:02] (03CR) 10Hashar: "I can't test it with a self puppet master since firstboot.sh uses labs-puppetmaster.wikimedia.org for the first puppet run." [puppet] - 10https://gerrit.wikimedia.org/r/369873 (owner: 10Hashar) [11:39:59] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [1000.0] [11:45:09] (03PS1) 10Marostegui: s5.hosts: Add db2075 as a new slave [software] - 10https://gerrit.wikimedia.org/r/369875 (https://phabricator.wikimedia.org/T170662) [11:45:35] * gehel is looking at elastic... [11:46:50] ACKNOWLEDGEMENT - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1000.0] Gehel rebalance still in progress after banning nodes for thermal paste [11:46:57] (03CR) 10Marostegui: [C: 032] s5.hosts: Add db2075 as a new slave [software] - 10https://gerrit.wikimedia.org/r/369875 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [11:50:27] (03PS1) 10Marostegui: s4.hosts: db2051 is now s4 codfw master [software] - 10https://gerrit.wikimedia.org/r/369877 (https://phabricator.wikimedia.org/T170351) [11:50:57] (03CR) 10Marostegui: [C: 04-2] "Do not merge until the switchover is done" [software] - 10https://gerrit.wikimedia.org/r/369877 (https://phabricator.wikimedia.org/T170351) (owner: 10Marostegui) [11:53:59] (03CR) 10Hashar: "Cherry picked it on the puppet master and I have live hacked the manifest to fix an oddity in a require (see inline comment)." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/365416 (https://phabricator.wikimedia.org/T150502) (owner: 10Thcipriani) [11:54:20] (03PS1) 10Marostegui: db-codfw.php: Promote db2051 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369879 (https://phabricator.wikimedia.org/T170351) [11:56:18] (03PS2) 10Marostegui: db-codfw.php: Promote db2051 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369879 (https://phabricator.wikimedia.org/T170351) [11:56:48] (03CR) 10Marostegui: [C: 04-2] "Do not merge until the switchover is done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369879 (https://phabricator.wikimedia.org/T170351) (owner: 10Marostegui) [11:57:52] (03PS8) 10Hashar: CI/integration: Create role for docker CI agent [puppet] - 10https://gerrit.wikimedia.org/r/365416 (https://phabricator.wikimedia.org/T150502) (owner: 10Thcipriani) [11:59:42] (03CR) 10Hashar: [C: 031] "PS8 address the few oddities I found in PS7." [puppet] - 10https://gerrit.wikimedia.org/r/365416 (https://phabricator.wikimedia.org/T150502) (owner: 10Thcipriani) [12:00:13] (03PS1) 10Marostegui: mariadb: Promote db2051 as the new s4 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/369880 (https://phabricator.wikimedia.org/T170351) [12:03:49] PROBLEM - cassandra-b service on restbase-dev1004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [12:03:49] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:03:59] PROBLEM - cassandra-b SSL 10.64.0.168:7001 on restbase-dev1004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [12:04:19] PROBLEM - cassandra-b CQL 10.64.0.168:9042 on restbase-dev1004 is CRITICAL: connect to address 10.64.0.168 and port 9042: Connection refused [12:05:00] (03CR) 10Marostegui: "Puppet looks good: https://puppet-compiler.wmflabs.org/compiler02/7288/" [puppet] - 10https://gerrit.wikimedia.org/r/369880 (https://phabricator.wikimedia.org/T170351) (owner: 10Marostegui) [12:05:10] (03CR) 10Marostegui: [C: 04-2] "Do not merge until the switchover is done" [puppet] - 10https://gerrit.wikimedia.org/r/369880 (https://phabricator.wikimedia.org/T170351) (owner: 10Marostegui) [12:12:30] (03CR) 10Hashar: "I dont know." [puppet] - 10https://gerrit.wikimedia.org/r/361680 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [12:15:53] !log Restart MySQL on db2051 - T170351 [12:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:04] T170351: db2019 has performance issues, replace disk or switchover s4 master elsewhere - https://phabricator.wikimedia.org/T170351 [12:16:19] (03PS3) 10Hashar: contint: PHP packages cleanup [puppet] - 10https://gerrit.wikimedia.org/r/346165 [12:17:15] (03CR) 10Hashar: "Rebased and changed php-tideways to php7.0-tideways" [puppet] - 10https://gerrit.wikimedia.org/r/346165 (owner: 10Hashar) [12:18:34] (03CR) 10Hashar: contint: profile, role, and packages for R language (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363337 (https://phabricator.wikimedia.org/T153856) (owner: 10Hashar) [12:21:46] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7287/stat1005.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/369857 (owner: 10Addshore) [12:23:49] RECOVERY - cassandra-b service on restbase-dev1004 is OK: OK - cassandra-b is active [12:23:50] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [12:25:59] RECOVERY - cassandra-b SSL 10.64.0.168:7001 on restbase-dev1004 is OK: SSL OK - Certificate restbase-dev1004-b valid until 2018-07-20 15:08:05 +0000 (expires in 351 days) [12:26:19] RECOVERY - cassandra-b CQL 10.64.0.168:9042 on restbase-dev1004 is OK: TCP OK - 0.000 second response time on 10.64.0.168 port 9042 [12:30:00] (03CR) 10Hashar: "I have cherry picked it on the CI puppet master and ran puppet on ntegration-r-lang-01.integration.eqiad.wmflabs" [puppet] - 10https://gerrit.wikimedia.org/r/363337 (https://phabricator.wikimedia.org/T153856) (owner: 10Hashar) [12:30:47] (03PS8) 10Hashar: contint: profile, role, and packages for R language [puppet] - 10https://gerrit.wikimedia.org/r/363337 (https://phabricator.wikimedia.org/T153856) [12:32:29] (03CR) 10Hashar: [C: 031] "Cherry picked on the CI puppet master. Puppet managed to pass on integration-r-lang-01" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/363337 (https://phabricator.wikimedia.org/T153856) (owner: 10Hashar) [12:41:52] (03PS9) 10Giuseppe Lavagetto: rake: new rakefile tailored for CI [puppet] - 10https://gerrit.wikimedia.org/r/366591 (https://phabricator.wikimedia.org/T166888) [12:49:21] jouncebot: refresh [12:49:23] jouncebot: next [12:49:58] I refreshed my knowledge about deployments. [12:49:59] In 0 hour(s) and 10 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170803T1300) [12:50:00] (03CR) 10Giuseppe Lavagetto: [C: 032] rake: new rakefile tailored for CI [puppet] - 10https://gerrit.wikimedia.org/r/366591 (https://phabricator.wikimedia.org/T166888) (owner: 10Giuseppe Lavagetto) [12:51:14] <_joe_> hashar: removing tox-all-envs here e anziche' dire "siete dei ragazzini mettetevi d'accordo per assicurare disponibilita'" [12:51:20] <_joe_> ouch sorry [12:51:24] <_joe_> wrong paste :P [12:51:29] <_joe_> sorry [12:51:31] LANG=C ? [12:51:31] ;-D [12:51:40] <_joe_> no just clipboard fail :P [12:51:51] <_joe_> I pasted the 2nd buffer in the stack :P [12:52:09] <_joe_> so, removing tox-all-envs from operations-puppet.yaml [12:52:22] <_joe_> will prevent updating tox with every jenkins run? [12:52:59] yup that macro runs tox [12:53:08] set -o pipefail [12:53:09] TOX_TESTENV_PASSENV=PY_COLORS PY_COLORS=1 tox -v | tee "log/stdout.log" [12:53:12] <_joe_> hashar: ok [12:53:21] * aude has the only thing for swat today [12:53:25] and maybe the rake file would want to pass those env variables [12:53:28] <_joe_> so, let me recheck a few changes [12:53:36] <_joe_> and then we can merge the change to ci [12:53:40] OK [12:54:02] (03CR) 10Giuseppe Lavagetto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/369695 (owner: 10Gehel) [12:54:23] aude: that is for the wmf.12 train being blocked isn't it ? [12:55:48] yes [12:55:54] (03CR) 10jenkins-bot: rake: new rakefile tailored for CI [puppet] - 10https://gerrit.wikimedia.org/r/366591 (https://phabricator.wikimedia.org/T166888) (owner: 10Giuseppe Lavagetto) [12:55:56] (03PS1) 10Addshore: statistics::wmde cleanup directory usage [puppet] - 10https://gerrit.wikimedia.org/r/369884 [12:55:58] (03PS1) 10Addshore: statistics::wmde move graphite files into own directory [puppet] - 10https://gerrit.wikimedia.org/r/369885 [12:56:00] (03PS1) 10Addshore: statistics::wmde::graphite fix logrotate name [puppet] - 10https://gerrit.wikimedia.org/r/369886 [12:56:11] i could do swat myself or at least +2 it now so jenkins has time [12:56:19] aude: sounds good [12:58:28] (03PS2) 10Rush: openstack: clientlib refactor [puppet] - 10https://gerrit.wikimedia.org/r/369809 (https://phabricator.wikimedia.org/T171494) [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170803T1300). Please do the needful. [13:00:04] aude: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:59] (03PS3) 10Rush: openstack: add wikitech-grep as utility for adminscripts [puppet] - 10https://gerrit.wikimedia.org/r/363896 (https://phabricator.wikimedia.org/T169820) [13:01:29] !log Disable gtid on s4 codfw slaves to get ready for the topology change - T170351 [13:01:32] <_joe_> 13:00:05 ---> spec:profile [13:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:40] T170351: db2019 has performance issues, replace disk or switchover s4 master elsewhere - https://phabricator.wikimedia.org/T170351 [13:01:44] <_joe_> we are now running spec tests in puppet CI :))) [13:01:58] * aude waiting for jenkins [13:02:08] <_joe_> hashar: there is a bug with rubocop though which I need to fix [13:02:25] <_joe_> but we can merge your change for CI as well [13:02:31] (03CR) 10jerkins-bot: [V: 04-1] openstack: add wikitech-grep as utility for adminscripts [puppet] - 10https://gerrit.wikimedia.org/r/363896 (https://phabricator.wikimedia.org/T169820) (owner: 10Rush) [13:02:58] 10Operations, 10Continuous-Integration-Config, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 2 others: Create a basic RSpec unit test for operations/puppet - https://phabricator.wikimedia.org/T78342#3496885 (10hashar) 05Open>03Resolved Bulk of the rspec integration has been done a while... [13:03:09] (03PS2) 10Addshore: statistics::wmde cleanup directory usage [puppet] - 10https://gerrit.wikimedia.org/r/369884 [13:03:14] (03Abandoned) 10Rush: tools: allow generic banner for inf protection [puppet] - 10https://gerrit.wikimedia.org/r/339007 (owner: 10Rush) [13:03:30] Let me close T78342 from Summer 2013 / December 2014 (run rspec for operations puppet in CI) :D [13:03:31] T78342: Create a basic RSpec unit test for operations/puppet - https://phabricator.wikimedia.org/T78342 [13:03:31] https://phabricator.wikimedia.org/T78342 [13:03:45] (03PS3) 10Addshore: statistics::wmde cleanup directory usage [puppet] - 10https://gerrit.wikimedia.org/r/369884 [13:04:00] (03PS2) 10Addshore: statistics::wmde move graphite files into own directory [puppet] - 10https://gerrit.wikimedia.org/r/369885 [13:04:08] (03PS2) 10Addshore: statistics::wmde::graphite fix logrotate name [puppet] - 10https://gerrit.wikimedia.org/r/369886 [13:04:08] <_joe_> hashar: lol [13:04:24] aude: you're deploying it? [13:04:36] (seen the answer above, ok) [13:05:54] (03CR) 10jerkins-bot: [V: 04-1] statistics::wmde cleanup directory usage [puppet] - 10https://gerrit.wikimedia.org/r/369884 (owner: 10Addshore) [13:06:16] (03PS4) 10Rush: openstack: add wikitech-grep as utility for adminscripts [puppet] - 10https://gerrit.wikimedia.org/r/363896 (https://phabricator.wikimedia.org/T169820) [13:06:27] Dereckson: yes [13:06:37] * Dereckson nods [13:06:39] <_joe_> hashar: uhm why did that job filed? [13:06:48] <_joe_> https://integration.wikimedia.org/ci/job/operations-puppet-tests-jessie/7624/console [13:06:51] (03CR) 10Addshore: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/369884 (owner: 10Addshore) [13:07:48] _joe_: yehmm [13:07:59] _joe_: could it be the warning about the use of 'import' ? [13:08:10] <_joe_> hashar: I don't think so, no [13:08:37] * hashar tries [13:08:39] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 [13:08:39] en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/revision/{revision} (Get rev by ID) is CRITICAL: Test Get rev by ID returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-sections/{title}{/revision} (Get MobileApps Foobar page) is CRITICAL: Test Get [13:08:39] age returned the unexpected status 500 (expecting: 200) [13:09:29] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 [13:09:29] en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/revision/{revision} (Get rev by ID) is CRITICAL: Test Get rev by ID returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-sections/{title}{/revision} (Get MobileApps Foobar page) is CRITICAL: Test Get [13:09:29] age returned the unexpected status 500 (expecting: 200) [13:09:34] that all looks like testing restbase equip to me [13:09:49] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 [13:09:49] en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/revision/{revision} (Get rev by ID) is CRITICAL: Test Get rev by ID returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-sections/{title}{/revision} (Get MobileApps Foobar page) is CRITICAL: Test Get [13:09:50] age returned the unexpected status 500 (expecting: 200) [13:10:12] _joe_: I can reproduce locally after rebasing the patch [13:10:18] <_joe_> yeah me too [13:10:28] (03CR) 10Rush: [C: 032] openstack: clientlib refactor [puppet] - 10https://gerrit.wikimedia.org/r/369809 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [13:10:28] <_joe_> now I just need to understand which check fails [13:10:30] <_joe_> :P [13:10:37] <_joe_> and make it more explicit [13:10:38] ;D [13:10:39] PROBLEM - cassandra-a SSL 10.64.16.97:7001 on restbase-dev1005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [13:10:50] PROBLEM - Check systemd state on restbase-dev1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:10:57] !log aude@tin Synchronized php-1.30.0-wmf.12/extensions/Wikidata: Fix bug in InjectRCRecordsJob (duration: 02m 12s) [13:10:59] PROBLEM - cassandra-a service on restbase-dev1005 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [13:11:04] done [13:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:10] PROBLEM - cassandra-a CQL 10.64.16.97:9042 on restbase-dev1005 is CRITICAL: connect to address 10.64.16.97 and port 9042: Connection refused [13:12:03] _joe_: they run in parallel arent they? Also syntax:manifests run twice [13:12:33] <_joe_> hashar: no that's the start/end message [13:13:18] <_joe_> hashar: I ran every test that is ran in parallel there [13:14:07] <_joe_> and each one returns 0 [13:14:12] !log Start topology change for s4 in codfw, slaves will be moved under db2051 - T170351 [13:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:23] T170351: db2019 has performance issues, replace disk or switchover s4 master elsewhere - https://phabricator.wikimedia.org/T170351 [13:14:37] I am not sure what it is running, but it ends up being slower [13:14:54] ahhhh syntax:manifests probably runs on the whole tree [13:15:23] <_joe_> nope [13:15:42] <_joe_> but we run the tasks in parallel so the single ones might look slower [13:16:30] _joe_: with rake --jobs 1 https://phabricator.wikimedia.org/P5848 [13:16:51] bah I should have used rake --trace [13:17:30] updated the paste with --trace [13:17:58] 10Operations, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review, 10User-Elukey: Update Varnishkafka to support TLS encryption/authentication - https://phabricator.wikimedia.org/T165736#3496930 (10elukey) Re-tested it in labs now and the current version of varnishkafka is able to use TLS without any modif... [13:18:02] so yeah it runs only once [13:18:26] <_joe_> hashar: and none of them fails [13:18:30] <_joe_> so... uhm [13:18:35] <_joe_> why does rake return 1? [13:18:40] <_joe_> I must look into that [13:18:43] (03PS3) 10Gehel: wdqs - remove upstart configuration files [puppet] - 10https://gerrit.wikimedia.org/r/369688 [13:19:25] !log disabling puppet acros wmcs things for testing [13:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:06] <_joe_> hashar: I'd first fix the rubocop bug [13:20:19] (03CR) 10jerkins-bot: [V: 04-1] wdqs - remove upstart configuration files [puppet] - 10https://gerrit.wikimedia.org/r/369688 (owner: 10Gehel) [13:20:31] <_joe_> uhm, no [13:20:37] and note that "The use of 'import' is deprecated at /Users/amusso/projects/operations/puppet/manifests/site.pp" is emitted twice [13:20:49] <_joe_> yeah that's kinda normal [13:20:55] <_joe_> don't get diverted by that [13:21:30] and taking 15 seconds to parse a single file? Most probably it processes the whole tree [13:21:46] <_joe_> uhm [13:21:53] <_joe_> jeez I think I know what happened [13:21:56] <_joe_> my last addition [13:21:57] <_joe_> yes [13:21:59] RECOVERY - Check systemd state on restbase-dev1005 is OK: OK - running: The system is fully operational [13:21:59] RECOVERY - cassandra-a service on restbase-dev1005 is OK: OK - cassandra-a is active [13:22:07] my bet is that the puppet-syntax gem register a task named syntax:manifests [13:22:08] <_joe_> ok lemme fix this quickly [13:22:10] (03CR) 10Jcrespo: [C: 031] mariadb: Promote db2051 as the new s4 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/369880 (https://phabricator.wikimedia.org/T170351) (owner: 10Marostegui) [13:22:12] !log banning elastic1032 - T168816 [13:22:15] hashar, aude you guys ok if we hijack a bit and deploy: https://gerrit.wikimedia.org/r/#/c/369879/ ? [13:22:21] and our Rakefile appends another one (though with the same name) [13:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:27] T168816: some elasticsearch servers in eqiad have CPU overheating - https://phabricator.wikimedia.org/T168816 [13:22:33] <_joe_> hashar: I did add that "global" part [13:22:36] <_joe_> lemme yank that out [13:22:50] 10Operations, 10Android-app-feature-Compilations, 10Reading-Infrastructure-Team-Backlog, 10Traffic, 10Wikipedia-Android-App-Backlog: Determine how to upload Zim files to Swift infrastructure - https://phabricator.wikimedia.org/T172123#3496939 (10fgiunchedi) >>! In T172123#3490510, @Mholloway wrote: > Hey... [13:23:09] marostegui: yeah most probably your database change does not overlap with aude hotfix to Wikidata [13:23:37] (03CR) 10Jcrespo: [C: 031] db-codfw.php: Promote db2051 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369879 (https://phabricator.wikimedia.org/T170351) (owner: 10Marostegui) [13:23:39] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [13:23:39] RECOVERY - cassandra-a SSL 10.64.16.97:7001 on restbase-dev1005 is OK: SSL OK - Certificate restbase-dev1005-a valid until 2018-07-20 15:08:07 +0000 (expires in 351 days) [13:23:49] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [13:23:59] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [13:24:09] marostegui: going to drop the backup tables listed in T171883 ok? [13:24:09] T171883: Drop CookieBlock* tables from EventLogging DB - https://phabricator.wikimedia.org/T171883 [13:24:19] RECOVERY - cassandra-a CQL 10.64.16.97:9042 on restbase-dev1005 is OK: TCP OK - 0.000 second response time on 10.64.16.97 port 9042 [13:24:19] elukey: ok [13:26:18] (03PS2) 10Marostegui: mariadb: Promote db2051 as the new s4 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/369880 (https://phabricator.wikimedia.org/T170351) [13:26:29] <_joe_> hashar: we were right [13:26:49] !log Starting the actual s4 codfw failover db2019 -> db2051 - T170351 [13:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:59] T170351: db2019 has performance issues, replace disk or switchover s4 master elsewhere - https://phabricator.wikimedia.org/T170351 [13:27:00] so imho at the end of setup_syntax you are reinstantiating PuppetSyntax::RakeTask.new [13:27:02] <_joe_> my attempt to add globals was naive [13:27:09] <_joe_> no that's not the issue [13:27:22] ah [13:27:31] <_joe_> the problem is the way I do pass the class variables in monkey-patching [13:27:37] <_joe_> but lemme fix the whole thing first [13:28:09] (03PS1) 10Giuseppe Lavagetto: rake: remove global namespace, fix rubocop targeting [puppet] - 10https://gerrit.wikimedia.org/r/369906 [13:28:20] <_joe_> hashar: this fixes it ^^ [13:28:43] !log drop CookieBlock backup tables for T171883 [13:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:57] (03CR) 10Marostegui: [C: 032] mariadb: Promote db2051 as the new s4 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/369880 (https://phabricator.wikimedia.org/T170351) (owner: 10Marostegui) [13:29:01] <_joe_> uhm still not fixed it seems [13:29:21] so what I was about to say is that you re added the puppet-syntax tasks with PuppetSyntax::RakeTask.new [13:29:49] which are added in parallel of the one that have been configured and thus with there are two syntax:manifests [13:30:08] and the one from puppet-syntax apparently comes without PuppetSyntax.fail_on_deprecation_notices = false [13:30:10] fails [13:30:13] and cause rake to exit 1 [13:30:29] PROBLEM - puppet last run on labstore1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:30:42] <_joe_> no that's not what happened [13:31:01] <_joe_> it's more subtle [13:31:10] <_joe_> but I'm now trying to fix the rubocop task [13:32:17] (03CR) 10Marostegui: [C: 032] db-codfw.php: Promote db2051 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369879 (https://phabricator.wikimedia.org/T170351) (owner: 10Marostegui) [13:34:07] (03Merged) 10jenkins-bot: db-codfw.php: Promote db2051 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369879 (https://phabricator.wikimedia.org/T170351) (owner: 10Marostegui) [13:34:20] 10Operations, 10Android-app-feature-Compilations, 10Traffic, 10Wikipedia-Android-App-Backlog, 10Reading-Infrastructure-Team-Backlog (Kanban): Determine URL paths for Zim files - https://phabricator.wikimedia.org/T172148#3496966 (10fgiunchedi) >>! In T172148#3490229, @Fjalapeno wrote: > @fgiunchedi We can... [13:34:29] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [13:35:18] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Promote db2051 as s4 codfw master - T170351 (duration: 00m 46s) [13:35:19] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy [13:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:29] T170351: db2019 has performance issues, replace disk or switchover s4 master elsewhere - https://phabricator.wikimedia.org/T170351 [13:35:49] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:36:17] (03CR) 10jenkins-bot: db-codfw.php: Promote db2051 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369879 (https://phabricator.wikimedia.org/T170351) (owner: 10Marostegui) [13:36:24] (03CR) 10Marostegui: [C: 032] s4.hosts: db2051 is now s4 codfw master [software] - 10https://gerrit.wikimedia.org/r/369877 (https://phabricator.wikimedia.org/T170351) (owner: 10Marostegui) [13:36:49] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [13:37:50] (03Merged) 10jenkins-bot: s4.hosts: db2051 is now s4 codfw master [software] - 10https://gerrit.wikimedia.org/r/369877 (https://phabricator.wikimedia.org/T170351) (owner: 10Marostegui) [13:38:12] (03PS2) 10Giuseppe Lavagetto: rake: remove global namespace, fix rubocop targeting [puppet] - 10https://gerrit.wikimedia.org/r/369906 [13:39:51] (03PS4) 10Elukey: statistics::wmde cleanup directory usage [puppet] - 10https://gerrit.wikimedia.org/r/369884 (owner: 10Addshore) [13:40:23] (03CR) 10Giuseppe Lavagetto: [C: 032] rake: remove global namespace, fix rubocop targeting [puppet] - 10https://gerrit.wikimedia.org/r/369906 (owner: 10Giuseppe Lavagetto) [13:40:35] (03PS3) 10Giuseppe Lavagetto: rake: remove global namespace, fix rubocop targeting [puppet] - 10https://gerrit.wikimedia.org/r/369906 [13:40:36] _joe_: looks good :) [13:40:48] <_joe_> hashar: yup, now rubocop runs on the rakefile itself [13:40:50] <_joe_> too [13:40:52] <_joe_> :P [13:40:54] _joe_: i guess the global namespace mess up with the variables [13:41:08] <_joe_> hashar: no the error is completely mine and I'll fix it [13:41:18] <_joe_> but I'll take the time to do that properly [13:42:01] <_joe_> it still took almost one minute to run a test that requires 2 seconds [13:42:34] <_joe_> hashar: you can merge your change, I'd say [13:42:41] roger! [13:43:10] _joe_, I'm looking at the ORES stress test again. I think the next step should be to bump up the # of workers and retry. I'm looking at the right way to do that. [13:43:39] I'm thinking that it might make sense to use hieradata to change that configuration just for servers that are named ores* [13:43:45] Is that possible with hieradata? [13:43:52] <_joe_> halfak: I'm in the middle of something else, but yes it should [13:43:54] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7290/stat1005.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/369884 (owner: 10Addshore) [13:44:07] (03PS5) 10Elukey: statistics::wmde cleanup directory usage [puppet] - 10https://gerrit.wikimedia.org/r/369884 (owner: 10Addshore) [13:44:10] (03CR) 10Elukey: [V: 032 C: 032] statistics::wmde cleanup directory usage [puppet] - 10https://gerrit.wikimedia.org/r/369884 (owner: 10Addshore) [13:44:17] <_joe_> halfak: let's find out the right way to do it, ok? I really can't look at in for another ~ 30 m [13:44:30] <_joe_> elukey: please don't +v2 patches for now [13:44:36] !log Enable gtid back on codfw s4 slaves - T170351 [13:44:39] OK no worries. I'll be around :) [13:44:40] <_joe_> we are testing a faster CI process right now [13:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:47] T170351: db2019 has performance issues, replace disk or switchover s4 master elsewhere - https://phabricator.wikimedia.org/T170351 [13:44:53] (03PS3) 10Elukey: statistics::wmde move graphite files into own directory [puppet] - 10https://gerrit.wikimedia.org/r/369885 (owner: 10Addshore) [13:44:55] <_joe_> halfak: let's make it 1 hour, sorry [13:45:09] Even better (assume it's not getting too late for you) [13:45:16] _joe_: operations-puppet-tests-jessie is updated [13:45:22] <_joe_> hashar: cool! [13:45:36] (03CR) 10jerkins-bot: [V: 04-1] statistics::wmde move graphite files into own directory [puppet] - 10https://gerrit.wikimedia.org/r/369885 (owner: 10Addshore) [13:45:51] <_joe_> hashar: ouch [13:45:52] <_joe_> 13:45:32 ERROR: No artifacts found that match the file pattern "log/*". Configuration error? [13:45:55] <_joe_> 13:45:32 ERROR: ‘log/*’ doesn’t match anything, but ‘*’ does. Perhaps that’s what you mean? [13:45:58] <_joe_> https://integration.wikimedia.org/ci/job/operations-puppet-tests-jessie/7638/console [13:46:28] yeah [13:46:33] I was about to paste that :) [13:46:36] <_joe_> it should be noted the whole thing did run in 26s [13:46:40] <_joe_> so very very fast [13:46:42] <_joe_> :P [13:46:46] lol [13:47:02] <_joe_> volans: what failed is jenkins removing logs, not the ci job itself [13:47:23] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3497006 (10BBlack) >>! In T163251#3496111, @Johan wrote: > How do we do on the t... [13:49:19] 10Operations, 10Android-app-feature-Compilations, 10Reading-Infrastructure-Team-Backlog, 10Traffic, 10Wikipedia-Android-App-Backlog: Determine how to upload Zim files to Swift infrastructure - https://phabricator.wikimedia.org/T172123#3497011 (10Mholloway) >>! In T172123#3496939, @fgiunchedi wrote: >>>!... [13:49:53] trying [13:49:54] https://gerrit.wikimedia.org/r/369911 [13:50:08] regenerating the job [13:50:39] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/369885 (owner: 10Addshore) [13:52:38] <_joe_> \o/ [13:52:47] <_joe_> 26s and no failure :) [13:53:23] <_joe_> (of wich 5 to actually run all the tests, but we're getting there I think) [13:54:47] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3497020 (10elukey) >>! In T167992#3493916, @RobH wrote: > @elukey: So we do prefer sw raid over hw raid when purchasing serve... [13:56:05] nice! [13:56:49] 10Operations, 10monitoring: Icinga: timeseries checks should have the link to a graph with the data - https://phabricator.wikimedia.org/T170353#3497042 (10Volans) I've started working on this, I hoped to be able to finish it by today but the list of checks is long. I will complete it when I'll be back. [13:58:19] (03CR) 10Elukey: "@Addshore: https://puppet-compiler.wmflabs.org/compiler02/7291/stat1005.eqiad.wmnet/ does it look good?" [puppet] - 10https://gerrit.wikimedia.org/r/369885 (owner: 10Addshore) [13:58:34] (03CR) 10Addshore: "Yup!" [puppet] - 10https://gerrit.wikimedia.org/r/369885 (owner: 10Addshore) [13:58:49] _joe_: iirc the finding typos took a while and running tox and rake in parallel nicely speed it up (that is what tyler did in the docker job) [13:59:19] elukey: the job no more complains when /log files are missing :) [13:59:25] \o/ [13:59:33] <_joe_> hashar: even not running rubocop or the linter when not needed helps a bit [13:59:35] (03CR) 10Elukey: [C: 032] statistics::wmde move graphite files into own directory [puppet] - 10https://gerrit.wikimedia.org/r/369885 (owner: 10Addshore) [14:00:00] _joe_: in my experience it took less than a second [14:00:02] but yeah that is one less thing to load [14:00:04] <_joe_> hashar: the other good thing is we don't refresh the tox env if not needed [14:00:04] reedy: Dear anthropoid, the time has come. Please deploy Create-a-wiki! (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170803T1400). [14:00:22] (03PS3) 10Elukey: statistics::wmde::graphite fix logrotate name [puppet] - 10https://gerrit.wikimedia.org/r/369886 (owner: 10Addshore) [14:00:24] _joe_: yeah that is definitely an improvement :} [14:00:30] !log Add db2075 to tendril - T170662 [14:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:40] T170662: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662 [14:01:01] <_joe_> hashar: now, the next thing to do is [14:01:09] (03PS2) 10Cmjohnson: Adding mgmt and production dns for kafka-jumbo100[1=6] T167992 [dns] - 10https://gerrit.wikimedia.org/r/369725 [14:01:10] !log depooling and shutting down elastic10(28|29|30|31|32) - T168816 [14:01:11] <_joe_> to remove that bundle install --clean [14:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:20] <_joe_> but that's just 3 seconds [14:01:20] T168816: some elasticsearch servers in eqiad have CPU overheating - https://phabricator.wikimedia.org/T168816 [14:01:41] (03CR) 10Hashar: [C: 031] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/299151 (owner: 10Hashar) [14:01:47] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2045" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369913 [14:01:50] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2045" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369913 [14:02:03] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic10(28|29|30|31|32).eqiad.wmnet [14:02:06] (03CR) 10Elukey: [C: 032] statistics::wmde::graphite fix logrotate name [puppet] - 10https://gerrit.wikimedia.org/r/369886 (owner: 10Addshore) [14:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:18] (03CR) 10Cmjohnson: [C: 032] Adding mgmt and production dns for kafka-jumbo100[1=6] T167992 [dns] - 10https://gerrit.wikimedia.org/r/369725 (owner: 10Cmjohnson) [14:03:21] <_joe_> hashar: 23 s (including running the spec) vs 1m21 [14:03:26] <_joe_> seems fast enough for now [14:03:42] (03PS2) 10Reedy: Remove wikimania2015wiki from wmgCentralAuthLoginIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369814 [14:03:45] (03CR) 10Reedy: [C: 032] Remove wikimania2015wiki from wmgCentralAuthLoginIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369814 (owner: 10Reedy) [14:03:47] <_joe_> next step is moving to use the docker image with the enahngements I added :) [14:03:51] <_joe_> *c [14:03:58] (03CR) 10Ottomata: [C: 031] beta: phase out RCStream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369861 (https://phabricator.wikimedia.org/T172356) (owner: 10Hashar) [14:04:04] (03CR) 10Ottomata: [C: 031] openstack: phase out deployment-stream [puppet] - 10https://gerrit.wikimedia.org/r/369860 (https://phabricator.wikimedia.org/T172356) (owner: 10Hashar) [14:05:58] (03Merged) 10jenkins-bot: Remove wikimania2015wiki from wmgCentralAuthLoginIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369814 (owner: 10Reedy) [14:06:09] (03CR) 10jenkins-bot: Remove wikimania2015wiki from wmgCentralAuthLoginIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369814 (owner: 10Reedy) [14:06:16] (03PS7) 10Reedy: Initial configuration for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368168 (https://phabricator.wikimedia.org/T155038) (owner: 10Urbanecm) [14:06:21] (03CR) 10Reedy: [C: 032] Initial configuration for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368168 (https://phabricator.wikimedia.org/T155038) (owner: 10Urbanecm) [14:06:22] thanks cmjohnson1 \o/ [14:06:51] !log Compress s5 on db2075 - T170662 [14:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:01] T170662: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662 [14:07:03] (03PS3) 10Marostegui: Revert "db-codfw.php: Depool db2045" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369913 [14:07:09] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [14:07:29] (03CR) 10Giuseppe Lavagetto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/369695 (owner: 10Gehel) [14:07:59] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199#3497100 (10BBlack) Cross-ticket updates: There's a separate sub-ticket for the Communications side of this change at T163251, and a... [14:08:00] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [14:08:32] (03Merged) 10jenkins-bot: Initial configuration for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368168 (https://phabricator.wikimedia.org/T155038) (owner: 10Urbanecm) [14:08:41] (03CR) 10jenkins-bot: Initial configuration for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368168 (https://phabricator.wikimedia.org/T155038) (owner: 10Urbanecm) [14:09:35] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3191313 (10BBlack) And for those wanting to follow the changes in 3DES percentag... [14:09:44] (03PS2) 10Muehlenhoff: Adapt debdeploy server components to Cumin (WIP) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/368190 [14:10:16] (03PS8) 10Hashar: Jenkins integration of rspec [puppet] - 10https://gerrit.wikimedia.org/r/331856 (https://phabricator.wikimedia.org/T78342) [14:10:31] (03CR) 10jerkins-bot: [V: 04-1] Jenkins integration of rspec [puppet] - 10https://gerrit.wikimedia.org/r/331856 (https://phabricator.wikimedia.org/T78342) (owner: 10Hashar) [14:10:41] <_joe_> halfak: to answer your question: you could define the number of ores workers via hiera in the file [14:10:51] https://github.com/wikimedia/puppet/tree/production/hieradata/hosts [14:10:53] <_joe_> hieradata/role/common/ores/stresstest.yaml [14:11:00] Oh.. Interesting [14:11:03] <_joe_> as all those hosts have that role applied [14:11:11] <_joe_> via the role() keyword in site.pp [14:11:16] <_joe_> node /^ores100[1-9]\.eqiad\.wmnet$/ { [14:11:20] <_joe_> role(ores::stresstest) [14:11:25] Got it. Will work on a patchset [14:11:26] <_joe_> } [14:11:40] Just to confirm, no other hosts have that role applied. [14:11:44] right? [14:11:49] Dereckson: you there? [14:11:50] <_joe_> ok, I can do it at a later time, if you don't have time for it [14:12:03] <_joe_> halfak: it /could/ be the corresponding servers in codfw [14:12:07] <_joe_> but no scb*, no [14:12:17] <_joe_> those have 'role::scb' applied [14:12:28] great [14:14:21] _joe_: a quick question for you, how can I run the spec tests from rake now ? :D [14:14:45] <_joe_> hashar: if you want to run them all? [14:14:54] <_joe_> hashar: wait for my next patch :P [14:15:00] ..... ;D [14:15:39] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2045" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369913 (owner: 10Marostegui) [14:16:16] * halfak is going to do something guess-ish and we can iterate from there [14:17:19] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [14:18:06] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2045" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369913 (owner: 10Marostegui) [14:18:10] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [14:18:15] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2045" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369913 (owner: 10Marostegui) [14:19:05] (03PS1) 10Halfak: Adds hieradata for ores::celery::workers with default. [puppet] - 10https://gerrit.wikimedia.org/r/369915 (https://phabricator.wikimedia.org/T169246) [14:19:05] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2045 - T170662 (duration: 00m 46s) [14:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:16] T170662: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662 [14:19:39] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [14:20:39] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [14:21:52] !log Restart MySQL on db2019 to get the new socket location updated [14:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:19] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [14:22:59] (03PS9) 10Hashar: Jenkins integration of rspec [puppet] - 10https://gerrit.wikimedia.org/r/331856 (https://phabricator.wikimedia.org/T78342) [14:24:19] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy [14:24:21] (03CR) 10Hashar: "I have rebased it on top of Giuseppe patch that runs rspec from rake :-}" [puppet] - 10https://gerrit.wikimedia.org/r/331856 (https://phabricator.wikimedia.org/T78342) (owner: 10Hashar) [14:25:27] (03PS3) 10Hashar: authdns: basic spec [puppet] - 10https://gerrit.wikimedia.org/r/341602 [14:25:29] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [14:26:29] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target) timed out before a response was received [14:26:37] (03PS2) 10Hashar: contint: boilerplate for spec tests [puppet] - 10https://gerrit.wikimedia.org/r/342206 [14:27:10] Reedy, do you plan to create other pending wikis too? [14:27:19] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target) timed out before a response was received: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [14:27:19] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy [14:27:25] Urbanecm: What else is pending? [14:27:37] !log reedy@tin Synchronized dblists: wikimania2018wiki (duration: 00m 47s) [14:27:46] Reedy, seems only one other wiki, T168765 [14:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:47] T168765: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765 [14:28:24] !log reedy@tin rebuilt wikiversions.php and synchronized wikiversions files: (no justification provided) [14:28:24] (03PS8) 10Hashar: interface: IPAddr.new() requires an address family [puppet] - 10https://gerrit.wikimedia.org/r/336840 [14:28:29] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [14:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:07] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-Urbanecm: Reopen Wikinews Dutch - https://phabricator.wikimedia.org/T168764#3497253 (10Urbanecm) 05Open>03Resolved Everything is done. [14:29:07] (03PS8) 10Hashar: systemd: allow isequal to match programname in/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/337411 [14:29:17] Urbanecm: I guess we didn't get a logo? [14:29:18] (03CR) 10jerkins-bot: [V: 04-1] interface: IPAddr.new() requires an address family [puppet] - 10https://gerrit.wikimedia.org/r/336840 (owner: 10Hashar) [14:29:19] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target) timed out before a response was received [14:29:39] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: wikimania2018wiki (duration: 00m 47s) [14:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:49] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [14:30:19] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy [14:30:29] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [14:30:48] Reedy, really? Logos are in https://gerrit.wikimedia.org/r/#/c/368165/ [14:31:01] Urbanecm: for wikimania2018wiki [14:31:10] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 854.15 seconds [14:31:29] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy [14:31:29] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [14:31:43] (03CR) 10jerkins-bot: [V: 04-1] systemd: allow isequal to match programname in/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/337411 (owner: 10Hashar) [14:32:23] Reedy, yes, there's no logo currently. I guess we can add the wiki with default logo and add their one as soon as it is available? [14:32:32] The wikis already added :) [14:32:45] (03PS1) 10Giuseppe Lavagetto: Rakefile: re-add some global tasks [puppet] - 10https://gerrit.wikimedia.org/r/369918 [14:32:49] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [14:33:20] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [14:33:21] is the recomendation api flapping in codfw? [14:33:29] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [14:33:34] looks like it [14:33:36] (03CR) 10jerkins-bot: [V: 04-1] Rakefile: re-add some global tasks [puppet] - 10https://gerrit.wikimedia.org/r/369918 (owner: 10Giuseppe Lavagetto) [14:34:05] Reedy, great :). What about hiwikiversity? [14:34:23] Urbanecm: I'm just looking at the arguments about Topic/Flow [14:34:27] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3497281 (10fgiunchedi) >>! In T158837#3453266, @Krinkle wrote: > (Draft / brain dump) > > * performance.wikimedia.org: simple frontend, low-p... [14:34:32] Reedy, ack [14:34:43] They seem to be requesting flow is disabled.. but it's in the dblist? [14:35:19] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy [14:35:44] Because request to disable it was declined. [14:36:13] Reedy: we sorted that out, they keep Flow enabled ; they submitted a translation for the Flow extension Topic: namespace in hindi, merged in wmf12 [14:36:26] (wikiversity is still wmf11 today) [14:36:29] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy [14:36:36] (03PS10) 10Hashar: Jenkins integration of rspec [puppet] - 10https://gerrit.wikimedia.org/r/331856 (https://phabricator.wikimedia.org/T78342) [14:36:36] And I think there's some .12 blockers... [14:36:39] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target) timed out before a response was received [14:37:29] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy [14:38:29] Reedy, Dereckson: can we take the translation change to wmf.11? [14:38:29] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy [14:38:39] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [14:38:43] I think we can probably ignore it... If they're not going to use the namespaces [14:38:45] Just yet [14:40:53] (03PS6) 10Jcrespo: mariadb: Add new python3 script to check the health of a server [puppet] - 10https://gerrit.wikimedia.org/r/369397 (https://phabricator.wikimedia.org/T171928) [14:41:11] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites, and 2 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3386400 (10Reedy) Restbase is https://gerrit.wikimedia.org/r/#/c/369919/ [14:43:46] (03PS3) 10Muehlenhoff: Adapt debdeploy server components to Cumin (WIP) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/368190 [14:43:48] (03CR) 10Jcrespo: [C: 032] "I am going to merge this, with not yet plans to use it on icinga, but we can check the output on every server including the root clients a" [puppet] - 10https://gerrit.wikimedia.org/r/369397 (https://phabricator.wikimedia.org/T171928) (owner: 10Jcrespo) [14:44:39] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [14:44:56] mobrovac: o/ --^ [14:45:07] recommendation_api flapping in codfw [14:45:08] sigh [14:45:21] good morning :) [14:45:29] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy [14:45:39] ha, it doesn't look like it at this point but buongiorno elukey :) [14:46:38] I checked logs on scb2005 but I didn't get much out of it, let me know if you need help [14:47:38] elukey: i think this wdqs [14:47:43] *this is [14:47:53] I saw some 500s, are those ones? [14:48:06] (03PS14) 10Reedy: Initial configuration for hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368165 (https://phabricator.wikimedia.org/T168765) (owner: 10Urbanecm) [14:48:13] on scb2005? [14:48:14] (03CR) 10Reedy: [C: 032] Initial configuration for hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368165 (https://phabricator.wikimedia.org/T168765) (owner: 10Urbanecm) [14:48:18] yeah [14:48:24] for sparql [14:48:31] Reedy: Urbanecm: yes, if they're not in an hurry to use Flow, they can live with a Topic: instead of the Hindi localisation [14:48:57] yup, that was sparql [14:49:08] TabbyCat: how can I help you? [14:49:26] elukey: QueryTimeoutException: Query deadline is expired [14:49:37] (03PS1) 10Reedy: Add hiwikiversity to labsrecursor [puppet] - 10https://gerrit.wikimedia.org/r/369924 (https://phabricator.wikimedia.org/T168765) [14:49:46] Dereckson: hi, I left the query at conpherence room site requests [14:49:54] sorry :( [14:51:03] (03Merged) 10jenkins-bot: Initial configuration for hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368165 (https://phabricator.wikimedia.org/T168765) (owner: 10Urbanecm) [14:51:13] (03CR) 10jenkins-bot: Initial configuration for hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368165 (https://phabricator.wikimedia.org/T168765) (owner: 10Urbanecm) [14:52:17] !log unbanning and repooling elastic10(29|30|31|32) - T168816 [14:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:27] T168816: some elasticsearch servers in eqiad have CPU overheating - https://phabricator.wikimedia.org/T168816 [14:53:13] !log mwscript initSiteStats.php --wiki srwikiquote --update (T172241) [14:53:17] PROBLEM - puppet last run on db1069 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:22] T172241: Fix article counter on sr.wikiquote.org - https://phabricator.wikimedia.org/T172241 [14:53:33] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic10(29|30|31|32).eqiad.wmnet [14:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:13] TabbyCat: how do you know they have > 222 articles ? [14:54:28] (an article is defined by a minimal weight in Kb for example) [14:54:32] Dereckson: I don't know, they said so ;) [14:54:51] just wondering if it was updateArticleCount or InitSiteStats [14:55:00] both do roughly the same? [14:55:03] !log reedy@tin Synchronized dblists: hiwikiversity (duration: 00m 49s) [14:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:27] TabbyCat: 1184 pages, 222 pages, seems legit to me [14:55:48] !log reedy@tin rebuilt wikiversions.php and synchronized wikiversions files: hiwikiversity [14:55:53] I did an approximate count and 222 looks legit to me as well Dereckson [14:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:09] RECOVERY - puppet last run on labstore1005 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [14:56:34] Dereckson: I think the requestor confuses the number of articles with the number of pages [14:56:45] TabbyCat: Can you do me a favour? :P [14:56:55] Reedy: yes, I'll keep quiet [14:57:00] Add wikimania2018wiki to https://meta.wikimedia.org/wiki/Interwiki_map please? [14:57:01] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: hiwikiversity (duration: 00m 48s) [14:57:08] So I don't have to login with my staff account [14:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:14] (03Draft1) 10Hashar: (WIP) trigger all modules [puppet] - 10https://gerrit.wikimedia.org/r/369923 [14:57:16] Reedy: ok [14:57:19] thanks [14:58:22] !log reedy@tin Synchronized static/images/project-logos/: hiwikiversity (duration: 00m 46s) [14:58:24] Then I'll update the IW map for all [14:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:43] Reedy: can I do that? just got access to deployment-prep, good way to start :) [14:58:47] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:59:00] TabbyCat: You can try... I dunno how well it'll actually work :) [14:59:04] the labs and sanitarium errors it's me [14:59:07] (03CR) 10Hashar: "That apparently causes a death loop. I suspect rspec recursively process modules which have fixtures having a spec directory..." [puppet] - 10https://gerrit.wikimedia.org/r/369923 (owner: 10Hashar) [14:59:14] We've still got an hour in the deploy window, so plenty of time [14:59:32] https://meta.wikimedia.org/w/index.php?title=Interwiki_map&diff=17068295&oldid=17010488 <-- added wm2018 [14:59:38] tryin' the map [14:59:44] * TabbyCat figures out [14:59:57] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [15:00:48] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy [15:01:01] (03PS1) 10Jcrespo: labsdb/mariadb: Change package resource for require_package function [puppet] - 10https://gerrit.wikimedia.org/r/369927 [15:01:48] (03CR) 10jerkins-bot: [V: 04-1] labsdb/mariadb: Change package resource for require_package function [puppet] - 10https://gerrit.wikimedia.org/r/369927 (owner: 10Jcrespo) [15:02:21] Reedy: Urbanecm also prepared the config for techconduct. by the way (https://phabricator.wikimedia.org/T165977) [15:02:39] !log unbanning and repooling elastic1028 - T168816 [15:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:49] T168816: some elasticsearch servers in eqiad have CPU overheating - https://phabricator.wikimedia.org/T168816 [15:03:25] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic1028.eqiad.wmnet [15:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:07] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [15:04:08] PROBLEM - puppet last run on db1095 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:04:36] (03CR) 10Jforrester: [C: 031] "Sounds good to me, leaving to Reedy to deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368770 (owner: 10Steinsplitter) [15:04:52] (03PS1) 10Cmjohnson: Adding mgmt dns for druid1006 [dns] - 10https://gerrit.wikimedia.org/r/369930 [15:05:07] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [15:05:39] (03PS4) 10Muehlenhoff: Adapt debdeploy server components to Cumin (WIP) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/368190 [15:06:21] (03PS2) 10Jcrespo: labsdb/mariadb: Change package resource for require_package function [puppet] - 10https://gerrit.wikimedia.org/r/369927 [15:08:20] (03CR) 10Reedy: Revert "Revert "Set initial configuration for techconduct.wikimedia.org"" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368415 (https://phabricator.wikimedia.org/T165977) (owner: 10Urbanecm) [15:08:47] PROBLEM - puppet last run on labsdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:08:47] (03PS4) 10Reedy: "Set initial configuration for techconduct.wikimedia.org" take 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368415 (https://phabricator.wikimedia.org/T165977) (owner: 10Urbanecm) [15:09:01] (03PS2) 10Cmjohnson: Adding mgmt dns for druid1006 [dns] - 10https://gerrit.wikimedia.org/r/369930 [15:09:37] (03CR) 10Jcrespo: [C: 032] labsdb/mariadb: Change package resource for require_package function [puppet] - 10https://gerrit.wikimedia.org/r/369927 (owner: 10Jcrespo) [15:09:40] (03PS2) 10Hashar: (WIP) trigger all modules [puppet] - 10https://gerrit.wikimedia.org/r/369923 [15:09:45] (03CR) 10Reedy: [C: 032] "Set initial configuration for techconduct.wikimedia.org" take 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368415 (https://phabricator.wikimedia.org/T165977) (owner: 10Urbanecm) [15:09:53] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns for druid1006 [dns] - 10https://gerrit.wikimedia.org/r/369930 (owner: 10Cmjohnson) [15:12:01] (03PS5) 10EBernhardson: CirrusSearch configuration for LTR AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369594 (https://phabricator.wikimedia.org/T171212) [15:12:29] (03CR) 10Thcipriani: "> I guess the population of the git cache that was in PS5 will be done differently?" [puppet] - 10https://gerrit.wikimedia.org/r/365416 (https://phabricator.wikimedia.org/T150502) (owner: 10Thcipriani) [15:15:30] RECOVERY - puppet last run on db1069 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [15:16:50] PROBLEM - cassandra-b SSL 10.64.0.168:7001 on restbase-dev1004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:140773F2:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert unexpected message [15:16:55] (03CR) 10Hashar: "Forget me. My patch created trigger.foo files in authdns and contint modules though they have no spec files there. So I guess rspec ended " [puppet] - 10https://gerrit.wikimedia.org/r/369923 (owner: 10Hashar) [15:17:00] (03Merged) 10jenkins-bot: "Set initial configuration for techconduct.wikimedia.org" take 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368415 (https://phabricator.wikimedia.org/T165977) (owner: 10Urbanecm) [15:17:10] (03CR) 10jenkins-bot: "Set initial configuration for techconduct.wikimedia.org" take 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368415 (https://phabricator.wikimedia.org/T165977) (owner: 10Urbanecm) [15:19:20] PROBLEM - cassandra-b CQL 10.64.0.168:9042 on restbase-dev1004 is CRITICAL: connect to address 10.64.0.168 and port 9042: Connection refused [15:19:50] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:19:51] PROBLEM - cassandra-b service on restbase-dev1004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [15:21:35] wikimania2018wiki creation apparently got stuck on dbstore1002 [15:21:46] stuck? [15:21:55] yeah, literally stuck [15:22:11] dbstore1002 is not critical- it is not production for mediawiki [15:22:18] but that shouldn't happen anyway [15:23:07] of course- TokuDB cannot create an index [15:23:16] on an empty table [15:23:28] if it is stuck, it is always TokuDB [15:23:50] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [15:23:51] RECOVERY - cassandra-b service on restbase-dev1004 is OK: OK - cassandra-b is active [15:23:56] specially with tables with no primary keys [15:24:31] !log reedy@tin Synchronized dblists/: techconductwiki (duration: 00m 47s) [15:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:38] !log reedy@tin Synchronized static/images/project-logos/: techconductwiki (duration: 00m 44s) [15:25:40] RECOVERY - cassandra-b SSL 10.64.0.168:7001 on restbase-dev1004 is OK: SSL OK - Certificate restbase-dev1004-b valid until 2018-07-20 15:08:05 +0000 (expires in 350 days) [15:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:57] jynus: it's the second window in row, where there is a dbstore1002 issue, isn't it? [15:26:05] should we fill a task about that? [15:26:05] and killing the replication doesn't obbey [15:26:14] no [15:26:20] RECOVERY - cassandra-b CQL 10.64.0.168:9042 on restbase-dev1004 is OK: TCP OK - 0.000 second response time on 10.64.0.168 port 9042 [15:26:24] we all know tokudb doesn't work [15:26:31] we just have to deal with it [15:27:00] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [15:27:09] Dereckson: this is not related to production- we do not use tokudb there [15:27:13] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: techconductwiki (duration: 00m 46s) [15:27:20] this among any other reasons [15:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:31] so enterily our responsability [15:28:13] o/ _joe_. How're things looking? Have some time now? [15:29:15] ok [15:29:25] (03PS1) 10Reedy: fix techconductwiki in wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369937 [15:29:35] no reason to report it if we cannot do any actionables [15:29:36] (03CR) 10Reedy: [C: 032] fix techconductwiki in wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369937 (owner: 10Reedy) [15:30:17] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites, and 2 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3497550 (10Reedy) Wiki has been created. Looks like #wikidata is broken... [15:30:41] I have reported it upstream, but I only got 2 answers: mariadb says it is percona's fault, percona says it is mariadb's fault, and other people says to upgrade [15:30:54] jynus: install mongodb [15:30:57] problem solved [15:31:04] (03Merged) 10jenkins-bot: fix techconductwiki in wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369937 (owner: 10Reedy) [15:31:16] (03CR) 10jenkins-bot: fix techconductwiki in wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369937 (owner: 10Reedy) [15:31:37] !log reedy@tin rebuilt wikiversions.php and synchronized wikiversions files: techconductwiki [15:31:45] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 3 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3497557 (10Urbanecm) >>! In T168765#3497550, @Reedy wrote: > Wiki has been created. Looks like #wikidata is broken... Adding them. [15:31:47] Reedy is the real MVP [15:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:03] the only way forward I see it to stop using tokudb for mediawiki data and convice mobile, analytics and reasearch they do not need multi-source [15:33:02] RECOVERY - puppet last run on db1095 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [15:33:06] * halfak sees research get pinged [15:33:16] multi-source? [15:33:37] yes, single monolithical instances of mysql [15:33:53] rather than smaller, light, easy to failover and repair multiple instances [15:33:57] Ahh. Yeah being able to join across databases. We're moving towards hive for a lot of things. [15:34:18] jynus, how would querying a user's activity between wikis work there? [15:34:18] the monotlithical model will not scale forever [15:34:22] Agreed [15:34:40] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/pa [15:34:41] et rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/revision/{revision} (Get rev by ID) is CRITICAL: Test Get rev by ID returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-sections/{title}{/revision} (Get MobileApps Foobar page) is CRITICAL: Test Get MobileApps Foobar page returned the unexp [15:34:41] xpecting: 200) [15:34:43] just the same, just 7 queries will have to be done instead of 1 [15:35:01] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/pa [15:35:01] et rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/revision/{revision} (Get rev by ID) is CRITICAL: Test Get rev by ID returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/mobile-sections/{title}{/revision} (Get MobileApps Foobar page) is CRITICAL: Test Get MobileApps Foobar page returned the unexp [15:35:01] xpecting: 200) [15:35:01] they can be done in parallel, so they may be even faster [15:35:11] jynus, yeah... so that'll make it so we spend a lot of researcher/analyst time filling a technological gap. [15:35:19] Seems like monolithic hive is a better solution. [15:35:25] (03PS1) 10Reedy: Update interwiki.php for 3 new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369939 [15:35:31] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/ [15:35:31] Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/revision/{revision} (Get rev by ID) is CRITICAL: Test Get rev by ID returned the unexpected status [15:35:31] 0): /en.wikipedia.org/v1/page/mobile-sections/{title}{/revision} (Get MobileApps Foobar page) is CRITICAL: Test Get MobileApps Foobar page returned the unexpected status 500 (expecting: 200) [15:35:40] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:35:41] PROBLEM - cassandra-b service on restbase-dev1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [15:35:47] well, right now it is spencing a lot of time of ops/dbas [15:35:50] (03CR) 10Reedy: [C: 032] Update interwiki.php for 3 new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369939 (owner: 10Reedy) [15:35:58] repariing one of those hosts takes 6 months [15:36:10] PROBLEM - cassandra-b CQL 10.64.48.169:9042 on restbase-dev1006 is CRITICAL: connect to address 10.64.48.169 and port 9042: Connection refused [15:36:11] PROBLEM - cassandra-b SSL 10.64.48.169:7001 on restbase-dev1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:36:14] repariing a single instance takes 1 hour [15:36:59] jynus, sure. OK. But it seems like single instances don't fill a need that other batch-processing systems do. [15:37:19] Seems like the multi-db-host is a step between the two with a high maintenance cost. [15:37:22] (03Merged) 10jenkins-bot: Update interwiki.php for 3 new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369939 (owner: 10Reedy) [15:37:32] (03CR) 10jenkins-bot: Update interwiki.php for 3 new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369939 (owner: 10Reedy) [15:37:56] nope [15:38:00] RECOVERY - puppet last run on labsdb1001 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [15:38:33] we are moving alll hosts to multiinstance, including labs and production [15:38:38] 10Operations, 10ops-eqiad, 10Analytics-Cluster, 10Analytics-Kanban: rack/setup/install druid100[456].eqiad.wmnet - https://phabricator.wikimedia.org/T171626#3497579 (10Cmjohnson) [15:38:48] jynus, I don't understand what you meant by "nope" [15:39:02] Seems like the multi-db-host is a step between the two with a high maintenance cost. " [15:39:07] -> no [15:39:13] !log reedy@tin Synchronized wmf-config/interwiki.php: Update for 3 new wikis (duration: 00m 46s) [15:39:16] It doesn't have a high maintenance cost? [15:39:19] no [15:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:27] Oh. So what's the problem then? [15:39:30] 10Operations, 10ops-eqiad, 10Analytics-Cluster, 10Analytics-Kanban: rack/setup/install druid100[456].eqiad.wmnet - https://phabricator.wikimedia.org/T171626#3471295 (10Cmjohnson) a:05Cmjohnson>03RobH Assigning to @robh to do the installs. **Network ports are disabled [15:39:44] oh, you call multisource hosts multi-db? [15:39:55] multi-db-host = many dbs on a single host [15:39:57] (03CR) 10Faidon Liambotis: "Post-merge (and post-rollback?) review too." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/369710 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [15:40:05] I call that multi-instance [15:40:17] Cool. Words are nice., [15:40:17] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 2 others: rack/setup/install wdqs100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T171210#3497587 (10Cmjohnson) [15:40:29] multiinstance is easy [15:40:36] multisource is painful [15:40:50] I find those terms very ambiguous, but OK [15:40:59] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 2 others: rack/setup/install wdqs100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T171210#3457615 (10Cmjohnson) a:05Cmjohnson>03RobH Assigning to @robh for installs **Network ports are disabled. [15:41:26] (03PS2) 10Giuseppe Lavagetto: Rakefile: re-add some global tasks [puppet] - 10https://gerrit.wikimedia.org/r/369918 [15:41:48] So !nope. Regardless, I don't expect splitting analytics-store into "multiinstance" to go over well without a replacement that allows for cross DB queries. [15:42:01] halfak: hive? [15:42:04] It seems HDFS and Hive do allow for many use-cases to still be covered [15:42:06] o/ ottomata [15:42:11] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install druid100[456].eqiad.wmnet - https://phabricator.wikimedia.org/T171626#3497594 (10elukey) [15:42:15] talking to jynus about the pain of maintaining analytics-store [15:42:23] yeah, its a huge pain for eventlogging for simliar reason [15:42:24] s [15:42:26] it doesn't scale [15:42:27] for analytics [15:42:41] And proposing that we start looking at hive as a replacement for all of our cross DB analytics/research [15:42:49] +1 :) [15:42:52] for eventlogging to [15:42:54] too [15:43:00] To see if we can ditch the giant-rdbms-server pattern. [15:43:32] centralauth, log, is a common join pattern. We'll want to make sure that is well supported. [15:43:35] dbstore1002 stuck again [15:43:58] !log T172384: lower tombstone failure threshold in RESTBase dev to 1000 [15:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:12] T172384: OOM exceptions in dev environment - https://phabricator.wikimedia.org/T172384 [15:44:12] But otherwise, I think we should start talking about phasing out analytics-store and I think hive will be a good option for doing so. [15:44:33] jynus, I'll file a task. [15:44:36] 10Operations: Copper root (/) 95% full - https://phabricator.wikimedia.org/T172409#3497614 (10herron) [15:44:50] RECOVERY - cassandra-b service on restbase-dev1006 is OK: OK - cassandra-b is active [15:45:40] RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational [15:46:50] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [15:47:00] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1927, Errmsg: Error Connection was killed on query. Default database: wikimania2018wiki. [Query snipped] [15:47:10] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [15:47:11] RECOVERY - cassandra-b CQL 10.64.48.169:9042 on restbase-dev1006 is OK: TCP OK - 0.000 second response time on 10.64.48.169 port 9042 [15:47:20] RECOVERY - cassandra-b SSL 10.64.48.169:7001 on restbase-dev1006 is OK: SSL OK - Certificate restbase-dev1006-b valid until 2018-07-20 15:08:11 +0000 (expires in 350 days) [15:47:40] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [15:48:49] FYI: https://phabricator.wikimedia.org/T172410 [15:48:58] ottomata, jynus: ^ [15:49:52] (03PS9) 10Thcipriani: CI/integration: Create role for docker CI agent [puppet] - 10https://gerrit.wikimedia.org/r/365416 (https://phabricator.wikimedia.org/T150502) [15:49:56] not a DBA ticket, though [15:50:00] RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [15:50:12] analytics-ops probably? [15:50:17] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 3 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3497672 (10Reedy) [15:50:17] jynus, please feel free to re-tag [15:50:57] 10Operations, 10Analytics, 10Research: Phase out and replace analytics-store (multisource) - https://phabricator.wikimedia.org/T172410#3497676 (10jcrespo) [15:53:21] PROBLEM - puppet last run on labtestnet2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:54:01] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1061, Errmsg: Error Duplicate key name wl_user_notificationtimestamp on query. Default database: wikimania2018wiki. [Query snipped] [15:54:24] jouncebot: next [15:54:24] In 93 hour(s) and 5 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170807T1300) [15:55:11] Reedy: You broke deployments page with https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1767225&oldid=1767215 [15:55:19] (which is why jouncebot thinks theres nothing til Monday) [15:55:34] booo [15:55:36] Oh, I see the break [15:55:42] every single table is failing to be created [15:55:55] jouncebot: reload [15:55:59] jouncebot: refresh [15:56:01] RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [15:56:02] I refreshed my knowledge about deployments. [15:56:05] jouncebot: next [15:56:07] In 0 hour(s) and 3 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170803T1600) [15:56:57] Reedy: #til...if you and I make the *exact* same edit byte-for-byte....I don't get an edit conflict. [15:57:02] MW just treats it as a null edit [15:58:02] hh [15:58:12] (03PS1) 10Andrew Bogott: vmbuilder: don't copy sshd_config in postinst [puppet] - 10https://gerrit.wikimedia.org/r/369943 [15:58:27] 10Operations, 10Analytics, 10Wikimedia-Stream, 10hardware-requests, 10Patch-For-Review: decommission rcs100[12] - https://phabricator.wikimedia.org/T170157#3497701 (10elukey) [15:58:49] (03CR) 10Chad: [C: 032] wmf-config: Improve docs in CommonSettings.php and LocalSettings.php header [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369818 (owner: 10Krinkle) [15:59:44] Dereckson: at what time did you create the last wiki? [16:00:05] jynus: Today? It was me creating them [16:00:05] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170803T1600). [16:00:05] mobrovac, thcipriani, and reedy: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:10] oh [16:00:12] sorry [16:00:16] * mobrovac here [16:00:17] (03CR) 10Andrew Bogott: [C: 032] vmbuilder: don't copy sshd_config in postinst [puppet] - 10https://gerrit.wikimedia.org/r/369943 (owner: 10Andrew Bogott) [16:00:19] (03Merged) 10jenkins-bot: wmf-config: Improve docs in CommonSettings.php and LocalSettings.php header [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369818 (owner: 10Krinkle) [16:00:22] I thought becase he noticed the errors [16:00:29] Reedy: same question [16:00:34] mobrovac: taking a look at your patch [16:00:37] k [16:00:38] ah first [16:00:39] (03CR) 10jenkins-bot: wmf-config: Improve docs in CommonSettings.php and LocalSettings.php header [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369818 (owner: 10Krinkle) [16:00:57] !log silence cassandra-related alerts on restbase-dev cluster, known OOMs [16:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:08] mobrovac: If you're about for puppetswat... Can you +1 https://gerrit.wikimedia.org/r/369919 and help it go out? (I don't know how much ops does, and how much you guys do :)) [16:01:19] looking [16:01:26] it's a new wiki created today [16:01:40] PROBLEM - Check size of conntrack table on ms-fe1005 is CRITICAL: CRITICAL: nf_conntrack is 98 % full [16:01:55] (03PS2) 10Filippo Giunchedi: RESTBase: Add the Recommendation API URI [puppet] - 10https://gerrit.wikimedia.org/r/369446 (https://phabricator.wikimedia.org/T170877) (owner: 10Mobrovac) [16:01:55] yes, I just need a timestamp after no script was run related to the new wiki [16:02:01] give or take [16:02:04] interesting, that's new (conntrack on ms-fe1005 [16:02:06] Reedy: ok, will go out today [16:02:15] (will deploy restbase a bit later in the day) [16:02:18] Reedy: better than "now" :-) [16:02:23] jynus: techconductwiki was created between 1517 and 1524 UTC [16:02:31] !log demon@tin Synchronized scap/plugins/prep.py: doc fixes, no-op (duration: 00m 47s) [16:02:37] was there more, or was that the last one? [16:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:40] RECOVERY - Check size of conntrack table on ms-fe1005 is OK: OK: nf_conntrack is 79 % full [16:02:42] That was the 3rd and final [16:02:49] thank you Reedy [16:02:52] hiwikiversity and wikimania2018wiki were created before that [16:02:59] that is good info! [16:03:27] * thcipriani here for puppet swat [16:03:37] !log demon@tin Synchronized wmf-config/: doc fixes, no-op (duration: 00m 48s) [16:03:41] ema: I guess the increased load on swift is due to your varnish upgrade? [16:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:51] mobrovac: cheers [16:03:56] (03CR) 10Filippo Giunchedi: [C: 032] RESTBase: Add the Recommendation API URI [puppet] - 10https://gerrit.wikimedia.org/r/369446 (https://phabricator.wikimedia.org/T170877) (owner: 10Mobrovac) [16:04:07] note that techconduct is private, must not replicate to labs [16:04:25] godog: it can be yes [16:04:26] mobrovac: merged [16:04:33] (03CR) 10Giuseppe Lavagetto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/369923 (owner: 10Hashar) [16:04:44] currently replicating 14:25:55 data [16:04:49] godog: is it causing problems? [16:04:52] grazie godog, will run puppet on rb hosts [16:04:53] TabbyCat: That was done beforehand [16:04:58] Error: 503, Backend fetch failed at Thu, 03 Aug 2017 16:04:41 GMT [16:05:18] @phabricator, but back up again [16:05:19] I will wait until 15:24 arrives to revert default storage engine back to tokudb [16:05:51] !log demon@tin Started deploy [gerrit/gerrit@15f1544]: This is a test, disregard [16:05:55] !log demon@tin Finished deploy [gerrit/gerrit@15f1544]: This is a test, disregard (duration: 00m 03s) [16:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:19] godog: I've stopped the upgrades for now (only 3 hosts to go) [16:06:21] ema: there was a conntrack exhaustion icinga alert earlier, so I'm not sure yet, requests are up 2x tho [16:06:44] ema: ok thanks! I was looking at https://grafana.wikimedia.org/dashboard/file/swift.json?orgId=1&from=1501667571675&to=1501776290373&var-DC=eqiad&refresh=1m [16:07:25] godog: yup I might have a been moving a bit too quickly, sorry about that [16:08:35] (03CR) 10Paladox: [C: 031] keyholder: public keys publicly readable [puppet] - 10https://gerrit.wikimedia.org/r/369817 (https://phabricator.wikimedia.org/T172333) (owner: 10Thcipriani) [16:08:37] (03PS1) 10Cmjohnson: Adding production dns for labmon1002 in labs-support1-a-eqiad T165784 [dns] - 10https://gerrit.wikimedia.org/r/369945 [16:09:12] (03PS2) 10Filippo Giunchedi: keyholder: public keys publicly readable [puppet] - 10https://gerrit.wikimedia.org/r/369817 (https://phabricator.wikimedia.org/T172333) (owner: 10Thcipriani) [16:09:16] ema: hehe no worries! [16:09:36] phabricator issues? [16:10:23] (03CR) 10Filippo Giunchedi: [C: 032] keyholder: public keys publicly readable [puppet] - 10https://gerrit.wikimedia.org/r/369817 (https://phabricator.wikimedia.org/T172333) (owner: 10Thcipriani) [16:10:30] PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [16:10:41] thcipriani: ^ merged [16:10:57] godog: \o/ [16:11:10] jynus hi, what do you mean phabricator issues? [16:11:26] dbproxy1008 just complained [16:11:32] jynus: Yeah, it's been flapping [16:11:36] "master read only" etc [16:11:44] that is the proxy serving phabricator [16:12:02] I mean it's flapping from the Phab side. I didn't mention it cuz I thought it was what you were talking about ^^^ [16:12:07] (like way further up) [16:13:00] db1043 is down from the point of view of dbproxy1008 [16:13:13] Hmm, but it's not actually? [16:14:00] the active proxy right now is dbproxy1003 [16:14:23] which means 8 shaw the db down for some time [16:14:55] (03CR) 10Cmjohnson: [C: 032] Adding production dns for labmon1002 in labs-support1-a-eqiad T165784 [dns] - 10https://gerrit.wikimedia.org/r/369945 (owner: 10Cmjohnson) [16:15:21] I am going to repoint back to the actual master on the passive proxy [16:15:48] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labmon1002 - https://phabricator.wikimedia.org/T165784#3497729 (10Cmjohnson) [16:15:50] !log repointing dbproxy1008 back to db1043 [16:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:34] RECOVERY - haproxy failover on dbproxy1008 is OK: OK check_failover servers up 2 down 0 [16:16:36] it is strange that only one proxy complained, it must have been a spurious error [16:16:43] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labmon1002 - https://phabricator.wikimedia.org/T165784#3276770 (10Cmjohnson) a:05Cmjohnson>03RobH Handing this off to @robh to do installs. **Network port is disabled....Has a raid card -- not sure if h/w raid is needed. [16:17:13] (03CR) 10Nikerabbit: "It seems this was not applied to https://phabricator.wikimedia.org/source/keyholder/#keyholder" [puppet] - 10https://gerrit.wikimedia.org/r/358884 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [16:19:14] godog: dangit. making world readable public keys is no good without x on the parent dir :( [16:19:40] PROBLEM - puppet last run on labstore1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[ldap-utils] [16:21:07] thcipriani: depends, if you know the filename already then you can read it without x [16:21:38] thcipriani: unless ssh refused altogether? [16:22:36] godog: seems to refuse all together if the parent is 750. Like I currently can't cat /etc/keyholder.d/mwdeploy.pub. I think I may need 751 :\ [16:22:40] RECOVERY - puppet last run on labtestnet2001 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [16:23:04] (03CR) 10Chad: "Created https://phabricator.wikimedia.org/D736 to apply it there (we should kill this puppet thing btw)" [puppet] - 10https://gerrit.wikimedia.org/r/358884 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [16:24:36] (03CR) 10Ayounsi: Define network infra ranges and allow them to send syslog to logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/369697 (https://phabricator.wikimedia.org/T166126) (owner: 10Ayounsi) [16:25:44] thcipriani: ah yeah, you are right x is missing not r and not required [16:25:54] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 3 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3497776 (10MarcoAurelio) [16:26:32] (03PS1) 10Thcipriani: Keyholder: allow users to access files they can read [puppet] - 10https://gerrit.wikimedia.org/r/369947 [16:26:45] godog: I think ^ patch is what I need [16:26:52] (03PS2) 10Ayounsi: Define network infra ranges and allow them to send syslog to logstash [puppet] - 10https://gerrit.wikimedia.org/r/369697 (https://phabricator.wikimedia.org/T166126) [16:28:07] (03CR) 10Filippo Giunchedi: [C: 032] Keyholder: allow users to access files they can read [puppet] - 10https://gerrit.wikimedia.org/r/369947 (owner: 10Thcipriani) [16:28:13] thcipriani: yup [16:28:36] thanks :) [16:30:30] thcipriani: try again on tin [16:30:39] 10Operations, 10Wikidata, 10Patch-For-Review, 10User-notice, 10Wikimedia-Incident: Wikidata and dewiki databases locked - https://phabricator.wikimedia.org/T171928#3497801 (10jcrespo) ``` $ check_mariadb.py -h db1052 --slave-status --primary-dc=eqiad {"datetime": 1501777331.898183, "ssl_expiration": 1619... [16:30:49] * thcipriani does [16:31:09] godog: works! [16:31:24] godog: thank you! [16:31:32] thcipriani: np :) [16:32:13] Oooooh, should I be able to actually use a key now? [16:32:14] :p [16:34:01] RainbowSprinkles: I *think* so, still need a scap update to support it tho :P [16:34:26] Yeah, but I should be able to test said key now [16:35:48] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Switch to new labs puppetmasters - https://phabricator.wikimedia.org/T171786#3497818 (10Andrew) [16:36:05] (03PS1) 10Andrew Bogott: designate: distinguish puppetmaster from controller [puppet] - 10https://gerrit.wikimedia.org/r/369951 (https://phabricator.wikimedia.org/T171786) [16:36:06] (03PS1) 10Andrew Bogott: designate: switch to the new puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/369952 (https://phabricator.wikimedia.org/T171786) [16:41:50] (03CR) 10Andrew Bogott: [C: 032] designate: distinguish puppetmaster from controller [puppet] - 10https://gerrit.wikimedia.org/r/369951 (https://phabricator.wikimedia.org/T171786) (owner: 10Andrew Bogott) [16:42:30] (03PS1) 10RobH: druid100[456] install params [puppet] - 10https://gerrit.wikimedia.org/r/369954 (https://phabricator.wikimedia.org/T171626) [16:43:07] (03CR) 10jerkins-bot: [V: 04-1] druid100[456] install params [puppet] - 10https://gerrit.wikimedia.org/r/369954 (https://phabricator.wikimedia.org/T171626) (owner: 10RobH) [16:43:12] 10Operations, 10MW-1.30-release-notes, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3497832 (10Reedy) 05Open>03Resolved a:03Reedy [16:43:40] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [16:44:59] Has puppet swat stopped? :P [16:45:07] (03PS2) 10RobH: druid100[456] install params [puppet] - 10https://gerrit.wikimedia.org/r/369954 (https://phabricator.wikimedia.org/T171626) [16:45:49] (03CR) 10jerkins-bot: [V: 04-1] druid100[456] install params [puppet] - 10https://gerrit.wikimedia.org/r/369954 (https://phabricator.wikimedia.org/T171626) (owner: 10RobH) [16:46:46] grumble typo grumble [16:46:55] (03PS3) 10RobH: druid100[456] install params [puppet] - 10https://gerrit.wikimedia.org/r/369954 (https://phabricator.wikimedia.org/T171626) [16:46:57] * grumble grumbles [16:47:40] grumble: ha, sorry about that ping ;D [16:48:07] (03CR) 10RobH: [C: 032] druid100[456] install params [puppet] - 10https://gerrit.wikimedia.org/r/369954 (https://phabricator.wikimedia.org/T171626) (owner: 10RobH) [16:48:16] (03PS4) 10RobH: druid100[456] install params [puppet] - 10https://gerrit.wikimedia.org/r/369954 (https://phabricator.wikimedia.org/T171626) [16:48:27] godog: What about my patches? :P [16:48:59] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3497835 (10Johan) Meta will probably do. I'll take care of it. [16:52:54] (03CR) 10Dzahn: "i didn't even know it existed in another place" [puppet] - 10https://gerrit.wikimedia.org/r/358884 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [16:53:14] 10Operations: Copper root (/) 95% full - https://phabricator.wikimedia.org/T172409#3497844 (10herron) 05Open>03Resolved a:03herron /home has been brought down to 80% which clears the warning. Filesystem Size Used Avail Use% Mounted on /dev/md0 46G 35G 8.9G 80% / Thanks @Joe and @M... [16:54:25] (03CR) 10Andrew Bogott: [C: 032] designate: switch to the new puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/369952 (https://phabricator.wikimedia.org/T171786) (owner: 10Andrew Bogott) [16:54:31] (03PS2) 10Andrew Bogott: designate: switch to the new puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/369952 (https://phabricator.wikimedia.org/T171786) [16:55:02] (03CR) 10Dzahn: "what do you mean by "kill this puppet thing"? delete the module?" [puppet] - 10https://gerrit.wikimedia.org/r/358884 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [17:00:05] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170803T1700). [17:00:24] Nothing for ORES today [17:00:33] If anyone is doing parsoid... https://gerrit.wikimedia.org/r/#/c/369928/ please :) [17:00:35] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3497892 (10Whatamidoing-WMF) >>! In T163251#3497006, @BBlack wrote: > For the te... [17:01:21] subbu: Are you doing a parsoid deploy? [17:02:20] RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 271.06 seconds [17:03:51] (03PS1) 10Rush: openstack: Jessie has Liberty py3 client libraries [puppet] - 10https://gerrit.wikimedia.org/r/369957 (https://phabricator.wikimedia.org/T171494) [17:04:21] (03PS1) 10Thcipriani: Gerrit: Add supplemental key to authorizedkeys [puppet] - 10https://gerrit.wikimedia.org/r/369958 [17:04:23] (03PS2) 10Rush: openstack: Jessie has Liberty py3 client libraries [puppet] - 10https://gerrit.wikimedia.org/r/369957 (https://phabricator.wikimedia.org/T171494) [17:05:00] (03PS1) 10Andrew Bogott: labs puppetmaster: allow ssh through firewall from designate [puppet] - 10https://gerrit.wikimedia.org/r/369959 [17:05:15] (03PS2) 10Phuedx: pagePreviews: Tidy up config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351287 (https://phabricator.wikimedia.org/T162672) [17:05:17] (03PS1) 10Phuedx: pagePreviews: Deploy to first 50 of stage 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369960 (https://phabricator.wikimedia.org/T162672) [17:05:29] (03CR) 10Paladox: [C: 031] Gerrit: Add supplemental key to authorizedkeys [puppet] - 10https://gerrit.wikimedia.org/r/369958 (owner: 10Thcipriani) [17:05:51] (03CR) 10Andrew Bogott: [C: 032] labs puppetmaster: allow ssh through firewall from designate [puppet] - 10https://gerrit.wikimedia.org/r/369959 (owner: 10Andrew Bogott) [17:06:53] (03CR) 10jerkins-bot: [V: 04-1] pagePreviews: Tidy up config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351287 (https://phabricator.wikimedia.org/T162672) (owner: 10Phuedx) [17:06:57] (03CR) 10jerkins-bot: [V: 04-1] pagePreviews: Deploy to first 50 of stage 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369960 (https://phabricator.wikimedia.org/T162672) (owner: 10Phuedx) [17:07:23] 10Operations: Copper root (/) 95% full - https://phabricator.wikimedia.org/T172409#3497910 (10herron) 05Resolved>03Open Reopening to migrate /home to a larger and dedicated filesystem [17:11:26] (03PS3) 10Rush: openstack: Jessie has Liberty py3 client libraries [puppet] - 10https://gerrit.wikimedia.org/r/369957 (https://phabricator.wikimedia.org/T171494) [17:12:10] (03PS3) 10Phuedx: pagePreviews: Tidy up config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351287 (https://phabricator.wikimedia.org/T162672) [17:12:12] (03PS2) 10Phuedx: pagePreviews: Deploy to first 50 of stage 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369960 (https://phabricator.wikimedia.org/T162672) [17:13:43] (03CR) 10jerkins-bot: [V: 04-1] pagePreviews: Deploy to first 50 of stage 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369960 (https://phabricator.wikimedia.org/T162672) (owner: 10Phuedx) [17:14:20] (03PS1) 10RobH: druid1006 mac address [puppet] - 10https://gerrit.wikimedia.org/r/369962 (https://phabricator.wikimedia.org/T166510) [17:14:31] (03PS2) 10RobH: druid1006 mac address [puppet] - 10https://gerrit.wikimedia.org/r/369962 (https://phabricator.wikimedia.org/T166510) [17:14:43] (03CR) 10RobH: [C: 032] druid1006 mac address [puppet] - 10https://gerrit.wikimedia.org/r/369962 (https://phabricator.wikimedia.org/T166510) (owner: 10RobH) [17:18:20] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [17:18:52] (03PS3) 10Phuedx: pagePreviews: Deploy to first 50 of stage 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369960 (https://phabricator.wikimedia.org/T162672) [17:20:01] (03PS4) 10Phuedx: pagePreviews: Deploy to first 50 of stage 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369960 (https://phabricator.wikimedia.org/T162672) [17:20:29] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3497942 (10BBlack) The current message text (which needs massaging and updating... [17:21:30] (03CR) 10jerkins-bot: [V: 04-1] pagePreviews: Deploy to first 50 of stage 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369960 (https://phabricator.wikimedia.org/T162672) (owner: 10Phuedx) [17:22:28] (03PS1) 10RobH: druid100[56] dns entry corrections [dns] - 10https://gerrit.wikimedia.org/r/369965 (https://phabricator.wikimedia.org/T171626) [17:23:50] (03CR) 10RobH: [C: 032] druid100[56] dns entry corrections [dns] - 10https://gerrit.wikimedia.org/r/369965 (https://phabricator.wikimedia.org/T171626) (owner: 10RobH) [17:25:15] (03PS4) 10Rush: openstack: Jessie has Liberty py3 client libraries [puppet] - 10https://gerrit.wikimedia.org/r/369957 (https://phabricator.wikimedia.org/T171494) [17:26:08] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3497948 (10Johan) If possible, I'd like to have "Wikipedia won't work in Interne... [17:26:59] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3497949 (10Johan) (I'll take care of gathering translations, in that case.) [17:27:20] (03PS5) 10Rush: openstack: Jessie has Liberty py3 client libraries [puppet] - 10https://gerrit.wikimedia.org/r/369957 (https://phabricator.wikimedia.org/T171494) [17:29:07] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3497950 (10BBlack) Works for me. If you can paste back the text form of whateve... [17:30:00] (03CR) 10Rush: [C: 032] openstack: Jessie has Liberty py3 client libraries [puppet] - 10https://gerrit.wikimedia.org/r/369957 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [17:32:10] RECOVERY - puppet last run on labstore1005 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [17:34:29] andrewbogott: see nova-fullstack [17:34:55] chasemp: I'm looking, I think this is rabbit being stuck again? [17:35:21] andrewbogott: I didn't get a chance to look I'm right in teh middle of fixing the labstores [17:35:49] timeouts all over the place, seemingly unrelated to what I'm doing [17:35:50] PROBLEM - puppet last run on labweb1002 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[python3-glanceclient],Package[python3-keystoneclient],Package[python3-novaclient] [17:36:05] I'm going to just give it a few minutes [17:36:22] * bd808 punched rabbitmq in its cute little bunbun face [17:37:58] well it seems the liberty has the right packages idea may be bad as it does but then not all dependencies [17:38:01] for jessie [17:38:20] PROBLEM - puppet last run on labweb1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[python3-glanceclient],Package[python3-keystoneclient],Package[python3-novaclient] [17:39:21] 10Operations: Copper root (/) 95% full - https://phabricator.wikimedia.org/T172409#3497964 (10herron) /home has been migrated to an lvm backed ext4 filesystem copper:/# df -h /home Filesystem Size Used Avail Use% Mounted on /dev/mapper/copper--vg-home 69G 23G 44G 34% /home For p... [17:39:26] 10Operations: Copper root (/) 95% full - https://phabricator.wikimedia.org/T172409#3497965 (10herron) 05Open>03Resolved [17:39:34] !log arlolra@tin Started deploy [parsoid/deploy@612e711]: Updating Parsoid to 651f12c2 [17:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:28] (03CR) 10Thcipriani: "> what do you mean by "kill this puppet thing"? delete the module?" [puppet] - 10https://gerrit.wikimedia.org/r/358884 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [17:47:03] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Get translations for "IE8 on XP won't work" for page - https://phabricator.wikimedia.org/T172418#3497985 (10Johan) [17:47:25] (03PS4) 10Phuedx: pagePreviews: Tidy up config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351287 (https://phabricator.wikimedia.org/T162672) [17:47:27] (03PS5) 10Phuedx: pagePreviews: Deploy to first 50 of stage 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369960 (https://phabricator.wikimedia.org/T162672) [17:48:03] RECOVERY - puppet last run on labweb1002 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [17:50:49] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: rack/setup/install druid100[456].eqiad.wmnet - https://phabricator.wikimedia.org/T171626#3498017 (10RobH) [17:50:52] !log arlolra@tin Finished deploy [parsoid/deploy@612e711]: Updating Parsoid to 651f12c2 (duration: 11m 17s) [17:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:17] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 3 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3498018 (10Jayprakash12345) Look good https://www.wikidata.org/w/index.php?title=Q5296&action=historysubmit&type=revision&diff=531399495&oldid=... [17:51:33] RECOVERY - puppet last run on labweb1001 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [17:51:38] 10Operations, 10Analytics, 10Analytics-Cluster: rack/setup/install druid100[456].eqiad.wmnet - https://phabricator.wikimedia.org/T171626#3471295 (10RobH) a:05RobH>03Ottomata These 3 systems are ready to be placed into service, and are calling into puppet with role spare at present. This task can be reso... [17:52:44] (03PS1) 10Reedy: Redirect resources.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/369970 (https://phabricator.wikimedia.org/T172417) [17:53:58] 10Operations, 10Wikimania-Hackathon-2017-Organization, 10Release-Engineering-Team (Watching / External): Wikimania needs hosting on a server for onsite conference guide - https://phabricator.wikimedia.org/T172217#3498026 (10Dzahn) @Antoine2711 Please go to [[ https://wikitech.wikimedia.org | wikitech wiki ]... [17:55:14] (03PS1) 10Reedy: Add resources.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/369971 (https://phabricator.wikimedia.org/T172417) [17:55:21] (03CR) 10Jdlrobson: [C: 04-1] pagePreviews: Deploy to first 50 of stage 1 wikis (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369960 (https://phabricator.wikimedia.org/T162672) (owner: 10Phuedx) [17:55:22] !log restarting rabbitmq-server on labcontrol1001 [17:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:44] 10Operations, 10Wikimania-Hackathon-2017-Organization, 10Release-Engineering-Team (Watching / External): Wikimania needs hosting on a server for onsite conference guide - https://phabricator.wikimedia.org/T172217#3498043 (10Dzahn) @Antoine2711 Please go to [[ https://wikitech.wikimedia.org | wikitech wiki ]... [17:59:21] (03PS6) 10Phuedx: pagePreviews: Deploy to first 50 of stage 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369960 (https://phabricator.wikimedia.org/T162672) [17:59:35] (03PS1) 10Rush: openstack: Jessie has cannot fulfill py3 clients without backports [puppet] - 10https://gerrit.wikimedia.org/r/369972 (https://phabricator.wikimedia.org/T171494) [18:00:02] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199#2684468 (10Bawolff) perhaps in the error page, the "use Firefox!" should be directly linked to the firefox 52 esr download page. The... [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170803T1800). [18:00:04] phuedx: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:34] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [18:00:53] PROBLEM - puppet last run on labpuppetmaster1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[python3-glanceclient],Package[python3-keystoneclient],Package[python3-novaclient] [18:01:08] 10Operations, 10Domains, 10Traffic, 10Wikimedia Resource Center, 10Patch-For-Review: Create resources.wikimedia.org as a redirect - https://phabricator.wikimedia.org/T172417#3498059 (10Harej) Note that this is pending final c-level approval. [18:02:07] I will be here in place for phuedx today :) [18:02:11] who's swatting? [18:02:14] doh, i forgot to put a patch on swat but i have something :) will edit [18:02:17] i'm also watching and testing [18:02:26] (03CR) 10Rush: [C: 032] openstack: Jessie has cannot fulfill py3 clients without backports [puppet] - 10https://gerrit.wikimedia.org/r/369972 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [18:02:28] (03PS7) 10Phuedx: pagePreviews: Deploy to first 50 of stage 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369960 (https://phabricator.wikimedia.org/T162672) [18:02:55] I can swat [18:03:35] jdlrobson: "tidy up" first? [18:03:38] !log Updated Parsoid to 651f12c2 (T119802, T73386, T170289) [18:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:51] T170289: VisualEditor/Parsoid unable to parse template containing external link with double brackets - https://phabricator.wikimedia.org/T170289 [18:03:51] T119802: Incorrect params list in transclusion - https://phabricator.wikimedia.org/T119802 [18:03:51] T73386: the Sanitizer allows only ASCII and a some punctuation in extension tag attributes - https://phabricator.wikimedia.org/T73386 [18:03:54] RainbowSprinkles: yes please tidy up should be a noop [18:04:02] (03CR) 10Chad: [C: 032] pagePreviews: Tidy up config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351287 (https://phabricator.wikimedia.org/T162672) (owner: 10Phuedx) [18:04:07] we're just making some final adjustments to the follow up [18:04:53] (03CR) 10Jdlrobson: [C: 031] pagePreviews: Deploy to first 50 of stage 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369960 (https://phabricator.wikimedia.org/T162672) (owner: 10Phuedx) [18:04:54] RECOVERY - puppet last run on labpuppetmaster1001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [18:05:27] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Get translations for "IE8 on XP won't work" for page - https://phabricator.wikimedia.org/T172418#3498116 (10Whatamidoing-WMF) [18:05:31] (03CR) 10jerkins-bot: [V: 04-1] pagePreviews: Deploy to first 50 of stage 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369960 (https://phabricator.wikimedia.org/T162672) (owner: 10Phuedx) [18:06:30] (03Merged) 10jenkins-bot: pagePreviews: Tidy up config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351287 (https://phabricator.wikimedia.org/T162672) (owner: 10Phuedx) [18:06:40] (03CR) 10jenkins-bot: pagePreviews: Tidy up config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351287 (https://phabricator.wikimedia.org/T162672) (owner: 10Phuedx) [18:08:05] (03PS8) 10Phuedx: pagePreviews: Deploy to first 50 of stage 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369960 (https://phabricator.wikimedia.org/T162672) [18:09:57] !log mobrovac@tin Started deploy [restbase/deploy@65af18d] (staging): (no justification provided) [18:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:34] (03PS1) 10Andrew Bogott: labs puppetmaster: allow designate to ssh in on ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/369974 [18:11:55] !log mobrovac@tin Finished deploy [restbase/deploy@65af18d] (staging): (no justification provided) (duration: 01m 57s) [18:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:14] (03CR) 10jerkins-bot: [V: 04-1] labs puppetmaster: allow designate to ssh in on ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/369974 (owner: 10Andrew Bogott) [18:12:42] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [18:13:15] 10Operations, 10Wikimania-Hackathon-2017-Organization, 10Release-Engineering-Team (Watching / External): Wikimania needs hosting on a server for onsite conference guide - https://phabricator.wikimedia.org/T172217#3498151 (10Antoine2711) Hi @Dzahn, my WikiTech account is the same as my Wiki account: Antoine27... [18:14:02] !log demon@tin Synchronized dblists/pp_stage0.dblist: clean up config (duration: 00m 47s) [18:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:56] (03PS2) 10Andrew Bogott: labs puppetmaster: allow designate to ssh in on ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/369974 [18:15:05] RainbowSprinkles: both patches are good to go btw [18:15:07] !log mobrovac@tin Started deploy [restbase/deploy@65af18d]: Expose the recommendation API publicly and activate hiwikiversity - T170877 T168765 [18:15:11] lemme know when to test the first :) [18:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:18] T170877: Recommendation API public end points - https://phabricator.wikimedia.org/T170877 [18:15:18] T168765: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765 [18:15:36] !log demon@tin Synchronized docroot/noc/conf/pp_stage0.dblist: tidy up pp config (duration: 00m 46s) [18:15:44] jdlrobson: Ok so for first one, dblist (and its symlink from docroot) are live. [18:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:54] (no point in pulling to mwdebug first, waste of time) [18:16:12] (03CR) 10Andrew Bogott: [C: 032] labs puppetmaster: allow designate to ssh in on ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/369974 (owner: 10Andrew Bogott) [18:16:22] CommonSettings also trivial, just syncing [18:16:38] mobrovac: Thanks! :) [18:16:45] np Reedy :) [18:17:44] !log demon@tin Synchronized wmf-config/CommonSettings.php: Add new dblist for page preview stuff, basically no-op (duration: 00m 47s) [18:17:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:56] RainbowSprinkles: looks good! can you please stage the next one on mwdebug [18:18:06] InitialiseSettings? Yes, I was going to [18:18:10] sweet [18:18:15] ready when you are [18:18:40] pulled to mwdebug1001 [18:20:02] PROBLEM - puppet last run on labpuppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:21:44] (03PS1) 10Andrew Bogott: Revert "labs puppetmaster: allow designate to ssh in on ipv6" [puppet] - 10https://gerrit.wikimedia.org/r/369975 [18:23:03] RainbowSprinkles:not seeing https://gerrit.wikimedia.org/r/#/c/369960 on mwdebug1001 [18:23:06] are you sure it's there? [18:23:15] Oooohhhhh [18:23:20] I meant the last file from the parent change [18:23:24] (since it seemed riskier) [18:23:35] (03CR) 10Andrew Bogott: [C: 032] Revert "labs puppetmaster: allow designate to ssh in on ipv6" [puppet] - 10https://gerrit.wikimedia.org/r/369975 (owner: 10Andrew Bogott) [18:23:40] !log mobrovac@tin Finished deploy [restbase/deploy@65af18d]: Expose the recommendation API publicly and activate hiwikiversity - T170877 T168765 (duration: 08m 33s) [18:23:42] https://gerrit.wikimedia.org/r/#/c/351287/4/wmf-config/InitialiseSettings.php [18:23:43] That ^ [18:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:53] T170877: Recommendation API public end points - https://phabricator.wikimedia.org/T170877 [18:23:53] T168765: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765 [18:24:57] RainbowSprinkles: so which patch was "Ok so for first one, dblist (and its symlink from docroot) are live." referring to? [18:25:02] RECOVERY - puppet last run on labpuppetmaster1001 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [18:26:07] RainbowSprinkles: okay so you put half of https://gerrit.wikimedia.org/r/#/c/351287/4 live and the other half on mwdebug100? [18:27:14] RainbowSprinkles: if so then we are good and can sync to production and move on to https://gerrit.wikimedia.org/r/#/c/369960 [18:27:56] Ok [18:29:05] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: last part of gerrit 351287 (duration: 00m 47s) [18:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:52] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [18:30:52] (03Abandoned) 10Phuedx: pagePreviews: Create pp_stage0.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351286 (https://phabricator.wikimedia.org/T162672) (owner: 10Phuedx) [18:32:14] RainbowSprinkles: ready for https://gerrit.wikimedia.org/r/#/c/369960 :) [18:32:41] Yep! [18:32:44] E_TOOMANYTABS [18:32:47] Getting lost [18:32:49] In my own work hah [18:33:09] (03CR) 10Chad: [C: 032] pagePreviews: Deploy to first 50 of stage 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369960 (https://phabricator.wikimedia.org/T162672) (owner: 10Phuedx) [18:36:09] (03PS1) 10Andrew Bogott: labs puppetmaster: further attempt to allow ipv6 access to labservices [puppet] - 10https://gerrit.wikimedia.org/r/369978 [18:36:27] (03Merged) 10jenkins-bot: pagePreviews: Deploy to first 50 of stage 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369960 (https://phabricator.wikimedia.org/T162672) (owner: 10Phuedx) [18:36:37] (03CR) 10jenkins-bot: pagePreviews: Deploy to first 50 of stage 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369960 (https://phabricator.wikimedia.org/T162672) (owner: 10Phuedx) [18:38:13] (03CR) 10Dzahn: "i think you typically want 2 x "resolve", one for v4 and one for v6. example we have these "(@resolve((${prometheus_ferm_nodes})) @resolve" [puppet] - 10https://gerrit.wikimedia.org/r/369978 (owner: 10Andrew Bogott) [18:39:26] RainbowSprinkles: on mwdebug1001 ? [18:39:51] !log demon@tin Synchronized dblists/: gerrit 369960 (duration: 00m 47s) [18:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:21] (03PS1) 10Madhuvishy: nfs-exportd: Downgrade to python version 2 [puppet] - 10https://gerrit.wikimedia.org/r/369980 [18:40:32] jdlrobson: Putting the no-op dblist parts live everywhere first, then yes [18:41:08] !log demon@tin Synchronized docroot/: gerrit 369960 (duration: 00m 47s) [18:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:28] (03CR) 10Dzahn: "modules/profile/manifests/puppetmaster/frontend.pp:" [dns] - 10https://gerrit.wikimedia.org/r/369732 (owner: 10Dzahn) [18:41:48] (03CR) 10Dzahn: "eh, comment on wrong patch, ignore" [dns] - 10https://gerrit.wikimedia.org/r/369732 (owner: 10Dzahn) [18:42:05] (03PS1) 10Andrew Bogott: designate: reach puppetmaster via ipv4 [puppet] - 10https://gerrit.wikimedia.org/r/369981 [18:42:07] (03Abandoned) 10Andrew Bogott: labs puppetmaster: further attempt to allow ipv6 access to labservices [puppet] - 10https://gerrit.wikimedia.org/r/369978 (owner: 10Andrew Bogott) [18:42:12] PROBLEM - dhclient process on thumbor1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:42:13] PROBLEM - salt-minion processes on thumbor1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:42:22] (03CR) 10Dzahn: "or like here:" [puppet] - 10https://gerrit.wikimedia.org/r/369978 (owner: 10Andrew Bogott) [18:42:32] PROBLEM - nutcracker process on thumbor1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:42:57] jdlrobson: Ok, full change is live on mwdebug1001 now [18:43:02] RECOVERY - dhclient process on thumbor1003 is OK: PROCS OK: 0 processes with command name dhclient [18:43:12] RECOVERY - salt-minion processes on thumbor1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:43:22] RECOVERY - nutcracker process on thumbor1003 is OK: PROCS OK: 1 process with UID = 111 (nutcracker), command name nutcracker [18:44:09] (03CR) 10Andrew Bogott: [C: 032] designate: reach puppetmaster via ipv4 [puppet] - 10https://gerrit.wikimedia.org/r/369981 (owner: 10Andrew Bogott) [18:45:05] (03CR) 10Bearloga: contint: profile, role, and packages for R language (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363337 (https://phabricator.wikimedia.org/T153856) (owner: 10Hashar) [18:46:03] !log arlolra@tin Started deploy [parsoid/deploy@48be65b]: Updating Parsoid to 6e1a20d5 [18:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:47] jdlrobson: We good? [18:49:02] RainbowSprinkles: just finishing up testing :) [18:49:07] Okie dokie [18:49:15] andrewbogott: your first change should have worked as well, only that it would have relied on IPv6 and not allow v4 as well [18:49:42] mutante: I think I had both? But in any case it was moot since that box doesn't have an ipv6 address [18:49:54] oh, heh, ok [18:50:02] RainbowSprinkles: you can sync now! all good [18:51:02] (03PS2) 10Dzahn: Gerrit: Add supplemental key to authorizedkeys [puppet] - 10https://gerrit.wikimedia.org/r/369958 (owner: 10Thcipriani) [18:51:05] CommonSettings going live first....now [18:51:24] !log demon@tin Synchronized wmf-config/CommonSettings.php: new dblist for gerrit 369960 (duration: 00m 46s) [18:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:36] (03CR) 10BryanDavis: [C: 031] nfs-exportd: Downgrade to python version 2 [puppet] - 10https://gerrit.wikimedia.org/r/369980 (owner: 10Madhuvishy) [18:51:38] (03PS6) 10Chad: CirrusSearch configuration for LTR AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369594 (https://phabricator.wikimedia.org/T171212) (owner: 10EBernhardson) [18:51:48] RainbowSprinkles: w00t [18:52:13] (03CR) 10Dzahn: [C: 032] Gerrit: Add supplemental key to authorizedkeys [puppet] - 10https://gerrit.wikimedia.org/r/369958 (owner: 10Thcipriani) [18:52:27] !log arlolra@tin Finished deploy [parsoid/deploy@48be65b]: Updating Parsoid to 6e1a20d5 (duration: 06m 25s) [18:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:06] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: Final gerrit 369960: "pagePreviews: Deploy to first 50 of stage 1 wikis" (duration: 00m 46s) [18:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:15] jdlrobson: Live errywhur [18:53:23] ebernhardson: There's only 7 minutes left in swat, is your change safe enough to do in a short time? [18:53:32] (diff is big, but I dunno the implications, could be basically no-op) [18:53:41] (03CR) 10Dzahn: "eh.. " Invalid parameter ssh_authorized_keys_file on Class[Ssh::Server]" :/ but i checked and this same thing is already used in multiple" [puppet] - 10https://gerrit.wikimedia.org/r/369958 (owner: 10Thcipriani) [18:54:32] (03CR) 10Dzahn: "without the leading "ssh_"" [puppet] - 10https://gerrit.wikimedia.org/r/369958 (owner: 10Thcipriani) [18:55:31] thanks RainbowSprinkles :) [18:55:43] PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:55:52] PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:56:13] jdlrobson: You're very welcome. Sorry that took so long, was trying to be careful :) [18:56:27] mutante, thcipriani: That puppet failure on you guys? ^ [18:56:53] yes [18:57:06] Ok, just checking before I hopped on to debug :) [18:57:10] don't [18:57:14] it will be gone soon [18:57:19] (03PS1) 10Dzahn: gerrit: fix authorized_keys_file param name [puppet] - 10https://gerrit.wikimedia.org/r/369983 [18:57:37] mutante: I won't ;-) [18:57:57] (03CR) 10Dzahn: [C: 032] gerrit: fix authorized_keys_file param name [puppet] - 10https://gerrit.wikimedia.org/r/369983 (owner: 10Dzahn) [18:58:22] ebernhardson: Sorry, swat window is over, I don't want to rush it in < 3 mins at this point. Plz reschedule for 4pm window or w/e you want [18:58:31] (03CR) 10Dzahn: "follow-up https://gerrit.wikimedia.org/r/#/c/369983/" [puppet] - 10https://gerrit.wikimedia.org/r/369958 (owner: 10Thcipriani) [18:59:33] !log Updated Parsoid to 6e1a20d5 (T168765, T155038, T165977) [18:59:35] mutante: whoops. Sorry about that, didn't get a chance to run puppet compiler yet :( [18:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:47] T165977: Create CoC committee private wiki - https://phabricator.wikimedia.org/T165977 [18:59:47] T168765: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765 [18:59:47] T155038: Create Wikimania 2018 wiki - https://phabricator.wikimedia.org/T155038 [18:59:52] twentyafterfour: Swat over [19:00:02] RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [19:00:03] * RainbowSprinkles hands off the conch [19:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170803T1900). Please do the needful. [19:00:25] thcipriani: no worries, it probably wouldn't have caught it and thanks for the fix, didn't know about that [19:00:40] (03PS1) 10Andrew Bogott: labs puppetmaster: add certmanager creds [puppet] - 10https://gerrit.wikimedia.org/r/369984 [19:00:45] sshd_config is now adjusted on gerrit servers [19:00:52] RECOVERY - puppet last run on gerrit2001 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [19:00:54] i guess you can give scap deploy another try [19:01:29] (03CR) 10Andrew Bogott: [C: 032] labs puppetmaster: add certmanager creds [puppet] - 10https://gerrit.wikimedia.org/r/369984 (owner: 10Andrew Bogott) [19:02:10] I'll still bet we get bit by the too many many authentication failures thing T172333 I have a patch in scap for it, too. [19:02:10] T172333: Scap: keyholder Too many authentication failures - https://phabricator.wikimedia.org/T172333 [19:07:56] *nod* ok, yep [19:08:38] (03PS1) 10Rush: openstack: revert shenkengen to py2 libs [puppet] - 10https://gerrit.wikimedia.org/r/369987 (https://phabricator.wikimedia.org/T169099) [19:08:41] !log deploying 1.30.0-wmf.12 to group1, will proceed to group2 after verifying the branch is stable. [19:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:37] (03PS1) 1020after4: group1 wikis to 1.30.0-wmf.12 refs T168053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369989 [19:09:39] (03CR) 1020after4: [C: 032] group1 wikis to 1.30.0-wmf.12 refs T168053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369989 (owner: 1020after4) [19:11:11] (03Merged) 10jenkins-bot: group1 wikis to 1.30.0-wmf.12 refs T168053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369989 (owner: 1020after4) [19:12:16] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.30.0-wmf.12 refs T168053 [19:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:31] T168053: 1.30.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T168053 [19:13:52] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:15:13] (03PS1) 10Rush: openstack: toolschecker requires openstack clientlibs [puppet] - 10https://gerrit.wikimedia.org/r/369990 (https://phabricator.wikimedia.org/T171494) [19:15:54] aude: looks like the hotfix for T172320 didn't work :( [19:15:54] T172320: Error in Wikibase/client/includes/Changes/InjectRCRecordsJob.php line 120: Bad value for parameter $params: $params['change'] not set. - https://phabricator.wikimedia.org/T172320 [19:20:32] RainbowSprinkles: doh, i was distracted :P oh well ... [19:20:55] looks like train might not roll forward anyways, so it wouldn't matter [19:21:40] !log twentyafterfour@tin Synchronized php-1.30.0-wmf.12/extensions/Wikidata/extensions/Wikibase/client/includes/Changes: be sure https://gerrit.wikimedia.org/r/#/c/369847/ is sync'd (duration: 00m 46s) [19:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:23] 10Operations, 10Interactive-Sprint, 10Maps (Kartotherian): Upgrade kartotherian and tilerator to nodejs 6.11 - https://phabricator.wikimedia.org/T171707#3498456 (10debt) This is blocked right now, as we don't have a test cluster in which to test this out on right now. [19:25:31] thcipriani does your puppet patch fix gerrit2 ssh for scap deploy? [19:26:33] 10Operations, 10Interactive-Sprint, 10Maps (Kartotherian): Upgrade kartotherian and tilerator to nodejs 6.11 - https://phabricator.wikimedia.org/T171707#3498513 (10Gehel) Documentation to build karthoterian / tilerator: https://wikitech.wikimedia.org/wiki/Maps#Building_Kartotherian_and_Tilerator [19:29:31] (03PS6) 10ArielGlenn: write dump output files to temporary location, move in place when done [dumps] - 10https://gerrit.wikimedia.org/r/368744 (https://phabricator.wikimedia.org/T169849) [19:30:32] (03CR) 10Rush: [C: 032] openstack: toolschecker requires openstack clientlibs [puppet] - 10https://gerrit.wikimedia.org/r/369990 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [19:33:07] (03CR) 10jenkins-bot: group1 wikis to 1.30.0-wmf.12 refs T168053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369989 (owner: 1020after4) [19:33:48] wikibugs is slow today [19:33:51] and out of order [19:33:53] heh [19:34:37] it's a flood protection feature [19:36:18] twentyafterfour i wonder if it is related to the slowness of bastion hosts for cloud services last night [19:36:55] (03PS1) 10Dzahn: simplelamp/simplelap: add support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/369993 [19:37:10] bugs in the bug reporting bug should be reported by the meta-bug-reporting-bug [19:37:36] paladox: ^ there, simplelamp with stretch support.... [19:37:36] (03CR) 10Paladox: simplelamp/simplelap: add support for stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/369993 (owner: 10Dzahn) [19:37:51] mutante heh :) [19:37:53] s/bug/bot [19:38:24] eh, how did you do the "mark code section in yellow" [19:38:40] mutante you just select the piece you want [19:38:58] ok, i think i will just fix "simplelap" then [19:39:03] and not "simplelamp" [19:39:05] ok [19:39:52] well, actually.. i don't make aything worse.. it's just fixing ONE thing and not all [19:41:13] (03CR) 10Dzahn: simplelamp/simplelap: add support for stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/369993 (owner: 10Dzahn) [19:42:46] (03PS2) 10Dzahn: simplelamp/simplelap: add support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/369993 [19:43:12] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [19:43:20] Wikimedia-wikibugs-IRC-bot (backlog): the bug reporting bug has a bug. https://phabricator.wikimedia.org/T40728 [19:43:46] lol [19:44:37] also, I think that the bot reporting bug has a bot [19:44:39] (03PS3) 10Dzahn: simplelamp/simplelap: add support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/369993 [19:45:58] (03CR) 10Dzahn: [C: 032] simplelamp/simplelap: add support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/369993 (owner: 10Dzahn) [19:47:50] (03PS1) 10Ottomata: Use newer guava jar in druid hdfs cdh storage extension [puppet] - 10https://gerrit.wikimedia.org/r/369997 (https://phabricator.wikimedia.org/T170590) [19:48:30] (03CR) 10jerkins-bot: [V: 04-1] Use newer guava jar in druid hdfs cdh storage extension [puppet] - 10https://gerrit.wikimedia.org/r/369997 (https://phabricator.wikimedia.org/T170590) (owner: 10Ottomata) [19:49:25] (03CR) 10Ottomata: [V: 032 C: 032] Use newer guava jar in druid hdfs cdh storage extension [puppet] - 10https://gerrit.wikimedia.org/r/369997 (https://phabricator.wikimedia.org/T170590) (owner: 10Ottomata) [19:49:39] (03PS2) 10Ottomata: Use newer guava jar in druid hdfs cdh storage extension [puppet] - 10https://gerrit.wikimedia.org/r/369997 (https://phabricator.wikimedia.org/T170590) [19:49:41] (03CR) 10Ottomata: [V: 032 C: 032] Use newer guava jar in druid hdfs cdh storage extension [puppet] - 10https://gerrit.wikimedia.org/r/369997 (https://phabricator.wikimedia.org/T170590) (owner: 10Ottomata) [19:50:33] (03PS1) 10Ottomata: Improve debian/README.Debian [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/369998 [19:50:56] (03CR) 10Ottomata: [C: 032] Improve debian/README.Debian [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/369998 (owner: 10Ottomata) [19:51:15] !log milimetric@tin Started deploy [analytics/refinery@cc40bf2]: Fix sqoop script [19:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:32] 10Operations, 10Multimedia, 10TimedMediaHandler, 10HHVM, 10Patch-For-Review: Migrate video scalers to jessie - https://phabricator.wikimedia.org/T145742#2639607 (10brion) Per email consultation between me & @MoritzMuehlenhoff we're thinking we should go ahead and deprecate the Ogg Theora video output (us... [19:59:40] (03PS1) 10Madhuvishy: shinkengen: Downgrade to python version 2 [puppet] - 10https://gerrit.wikimedia.org/r/370001 [20:01:02] PROBLEM - puppet last run on ms-be3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:05:04] 10Operations, 10Performance-Team, 10monitoring: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3498746 (10Krinkle) [20:06:34] 10Operations, 10Wikimania-Hackathon-2017-Organization, 10Release-Engineering-Team (Watching / External): Wikimania needs hosting on a server for onsite conference guide - https://phabricator.wikimedia.org/T172217#3498769 (10Dzahn) fixed simplelap puppet role to support stretch so we can put Apache/PHP on Deb... [20:06:39] 10Operations, 10Multimedia, 10TimedMediaHandler, 10HHVM, 10Patch-For-Review: Migrate video scalers to jessie - https://phabricator.wikimedia.org/T145742#2639607 (10brion) [20:11:49] (03PS1) 10Mobrovac: [WIP] JobQueue: Add the RunSingleJob.php script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 [20:13:16] (03CR) 10jerkins-bot: [V: 04-1] [WIP] JobQueue: Add the RunSingleJob.php script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 (owner: 10Mobrovac) [20:23:19] (03PS1) 1020after4: All wikis except wikidata to wmf.12, wikidata to wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370007 (https://phabricator.wikimedia.org/T168053) [20:23:26] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199#2684468 (10Quiddity) >>! In T147199#3498052, @Bawolff wrote: > perhaps in the error page, the "use Firefox!" should be directly link... [20:24:10] 10Operations, 10Wikimania-Hackathon-2017-Organization, 10Release-Engineering-Team (Watching / External): Wikimania needs hosting on a server for onsite conference guide - https://phabricator.wikimedia.org/T172217#3498846 (10Dzahn) @Antoine2711 I created a new VM (instance) in the existing project called "w... [20:24:31] (03CR) 1020after4: "going ahead with this to get the train unblocked." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370007 (https://phabricator.wikimedia.org/T168053) (owner: 1020after4) [20:24:32] !log milimetric@tin Finished deploy [analytics/refinery@cc40bf2]: Fix sqoop script (duration: 33m 17s) [20:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:22] (03CR) 1020after4: [C: 032] All wikis except wikidata to wmf.12, wikidata to wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370007 (https://phabricator.wikimedia.org/T168053) (owner: 1020after4) [20:27:47] (03Merged) 10jenkins-bot: All wikis except wikidata to wmf.12, wikidata to wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370007 (https://phabricator.wikimedia.org/T168053) (owner: 1020after4) [20:27:57] (03CR) 10jenkins-bot: All wikis except wikidata to wmf.12, wikidata to wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370007 (https://phabricator.wikimedia.org/T168053) (owner: 1020after4) [20:28:48] !log milimetric@tin Started deploy [analytics/refinery@cc40bf2]: Fix sqoop script with updated scap config [20:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:37] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis except wikidata to 1.30.0-wmf.12, leaving wikidata behind due to T172320 [20:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:48] T172320: Error in Wikibase/client/includes/Changes/InjectRCRecordsJob.php line 120: Bad value for parameter $params: $params['change'] not set. - https://phabricator.wikimedia.org/T172320 [20:30:22] RECOVERY - puppet last run on ms-be3001 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [20:34:51] 10Operations, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): Switch to new labs puppetmasters - https://phabricator.wikimedia.org/T171786#3498884 (10Andrew) [20:40:40] !log milimetric@tin Finished deploy [analytics/refinery@cc40bf2]: Fix sqoop script with updated scap config (duration: 11m 51s) [20:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:52] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [20:44:57] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 2 others: rack/setup/install wdqs100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T171210#3498895 (10RobH) a:05RobH>03Cmjohnson So wdqs1004 shows a link, and I'll proceed to install it. I can login to mgmt on wdqs1005, but its actual netwo... [20:48:06] (03PS1) 10Herron: Change wikipedia.org SPF record to soft fail (~all) [dns] - 10https://gerrit.wikimedia.org/r/370040 (https://phabricator.wikimedia.org/T170891) [20:51:57] (03PS1) 10RobH: wdqs100[45] install params [puppet] - 10https://gerrit.wikimedia.org/r/370050 (https://phabricator.wikimedia.org/T171210) [20:53:26] (03CR) 10RobH: [C: 032] wdqs100[45] install params [puppet] - 10https://gerrit.wikimedia.org/r/370050 (https://phabricator.wikimedia.org/T171210) (owner: 10RobH) [21:00:04] kaldari and MaxSem: Respected human, time to deploy CodeMirror deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170803T2100). Please do the needful. [21:00:22] 10Operations, 10Cloud-Services, 10Cloud-VPS, 10cloud-services-team (Kanban): Puppet CA: virt1000.wikimedia.org' will expire on 2017-08-15 - https://phabricator.wikimedia.org/T168110#3498932 (10Andrew) [21:00:25] 10Operations, 10Cloud-VPS: rack/setup/install labpuppetmaster100[12].wikimedia.org - https://phabricator.wikimedia.org/T167905#3498931 (10Andrew) 05Open>03Resolved [21:01:33] (03PS1) 10Ppchelko: JobQueueEventBus: Enable on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370064 (https://phabricator.wikimedia.org/T163380) [21:02:14] (03CR) 10Ppchelko: "To go out on Monday Aug 07" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370064 (https://phabricator.wikimedia.org/T163380) (owner: 10Ppchelko) [21:03:02] (03CR) 10jerkins-bot: [V: 04-1] JobQueueEventBus: Enable on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370064 (https://phabricator.wikimedia.org/T163380) (owner: 10Ppchelko) [21:04:10] !log copper /var/cache/pbuilder 95% full - grew /dev/copper-vg/pbuilder fs by +5G and tune2fs -m 0. now at 85% full [21:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:07] (03PS2) 10Ppchelko: JobQueueEventBus: Enable on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370064 (https://phabricator.wikimedia.org/T163380) [21:08:41] (03PS2) 10Madhuvishy: nfs-exportd: Downgrade to python version 2 [puppet] - 10https://gerrit.wikimedia.org/r/369980 [21:10:24] (03CR) 10Madhuvishy: [C: 032] nfs-exportd: Downgrade to python version 2 [puppet] - 10https://gerrit.wikimedia.org/r/369980 (owner: 10Madhuvishy) [21:11:04] (03PS1) 10MaxSem: Enable CodeMirror everywhere but RTL wikis and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370072 (https://phabricator.wikimedia.org/T170966) [21:12:11] (03CR) 10Kaldari: [C: 031] "Looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370072 (https://phabricator.wikimedia.org/T170966) (owner: 10MaxSem) [21:12:40] 10Operations, 10Traffic: Non zero rated LVS IPs - https://phabricator.wikimedia.org/T170518#3498973 (10ayounsi) [21:14:02] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [21:19:24] @seen hashar [21:19:24] mutante: Last time I saw hashar they were quitting the network with reason: Quit: Textual IRC Client: www.textualapp.com N/A at 8/3/2017 3:34:39 PM (5h44m44s ago) [21:20:13] (03PS2) 10Madhuvishy: shinkengen: Downgrade to python version 2 [puppet] - 10https://gerrit.wikimedia.org/r/370001 [21:20:19] !log maxsem@tin Synchronized php-1.30.0-wmf.12/extensions/CodeMirror: https://gerrit.wikimedia.org/r/#/c/370066/ https://gerrit.wikimedia.org/r/#/c/370074/ (duration: 00m 47s) [21:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:53] (03CR) 10Madhuvishy: [C: 032] openstack: revert shenkengen to py2 libs [puppet] - 10https://gerrit.wikimedia.org/r/369987 (https://phabricator.wikimedia.org/T169099) (owner: 10Rush) [21:21:02] (03PS2) 10Madhuvishy: openstack: revert shenkengen to py2 libs [puppet] - 10https://gerrit.wikimedia.org/r/369987 (https://phabricator.wikimedia.org/T169099) (owner: 10Rush) [21:21:07] (03CR) 10Madhuvishy: [V: 032 C: 032] openstack: revert shenkengen to py2 libs [puppet] - 10https://gerrit.wikimedia.org/r/369987 (https://phabricator.wikimedia.org/T169099) (owner: 10Rush) [21:22:32] (03PS3) 10Madhuvishy: shinkengen: Downgrade to python version 2 [puppet] - 10https://gerrit.wikimedia.org/r/370001 [21:23:02] (03CR) 10MaxSem: [C: 032] Enable CodeMirror everywhere but RTL wikis and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370072 (https://phabricator.wikimedia.org/T170966) (owner: 10MaxSem) [21:23:25] (03CR) 10Madhuvishy: [C: 032] shinkengen: Downgrade to python version 2 [puppet] - 10https://gerrit.wikimedia.org/r/370001 (owner: 10Madhuvishy) [21:23:42] mutante: https://tools.wmflabs.org/openstack-browser/project/integration [21:23:54] https://tools.wmflabs.org/nagf/?project=integration [21:24:35] (03Merged) 10jenkins-bot: Enable CodeMirror everywhere but RTL wikis and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370072 (https://phabricator.wikimedia.org/T170966) (owner: 10MaxSem) [21:27:28] paladox: thanks [21:28:06] your welcome :) [21:30:40] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: Enable CodeMirror https://gerrit.wikimedia.org/r/#/c/370072/ (duration: 00m 47s) [21:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:55] !log clear out nova-fullstack project so we can monitor fresh on new puppetmaster [21:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:30] (03PS1) 10Madhuvishy: shinkengen: Update package declaration to require_package to avoid duplicate decl. [puppet] - 10https://gerrit.wikimedia.org/r/370090 [21:32:42] (03CR) 10jenkins-bot: Enable CodeMirror everywhere but RTL wikis and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370072 (https://phabricator.wikimedia.org/T170966) (owner: 10MaxSem) [21:33:10] (03CR) 10jerkins-bot: [V: 04-1] shinkengen: Update package declaration to require_package to avoid duplicate decl. [puppet] - 10https://gerrit.wikimedia.org/r/370090 (owner: 10Madhuvishy) [21:34:12] (03PS1) 10Rush: openstack: clean up openstack::repo [puppet] - 10https://gerrit.wikimedia.org/r/370092 (https://phabricator.wikimedia.org/T171494) [21:34:14] (03PS2) 10Madhuvishy: shinkengen: Update package declaration to require_package [puppet] - 10https://gerrit.wikimedia.org/r/370090 [21:36:22] (03PS3) 10Madhuvishy: shinkengen: Update package declaration to require_package [puppet] - 10https://gerrit.wikimedia.org/r/370090 [21:37:27] (03PS1) 10Ayounsi: Reserve non zero rated IP ranges [dns] - 10https://gerrit.wikimedia.org/r/370094 (https://phabricator.wikimedia.org/T170518) [21:37:50] (03CR) 10Madhuvishy: [C: 032] shinkengen: Update package declaration to require_package [puppet] - 10https://gerrit.wikimedia.org/r/370090 (owner: 10Madhuvishy) [21:38:59] jouncebot: reload [21:39:07] jouncebot: refresh [21:39:07] (03PS4) 10Dzahn: contint: PHP packages cleanup [puppet] - 10https://gerrit.wikimedia.org/r/346165 (owner: 10Hashar) [21:39:10] I refreshed my knowledge about deployments. [21:40:13] (03PS2) 10Rush: openstack: clean up openstack::repo [puppet] - 10https://gerrit.wikimedia.org/r/370092 (https://phabricator.wikimedia.org/T171494) [21:43:01] (03PS1) 10MaxSem: Remove code path from labs settings that now repeats prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370096 [21:45:31] (03PS1) 10Eevans: WIP: Reshape RESTBase Cassandra production cluster; Provision new 3.x cluster [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) [21:47:12] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [21:47:23] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [21:47:33] PROBLEM - salt-minion processes on stat1005 is CRITICAL: Return code of 255 is out of bounds [21:47:33] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [21:47:42] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [21:47:43] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [21:47:43] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [21:47:52] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [21:47:57] (03CR) 10Dzahn: "re "Update few packages to use the php7.0 prefix."" [puppet] - 10https://gerrit.wikimedia.org/r/346165 (owner: 10Hashar) [21:48:29] (03CR) 10MaxSem: [C: 032] Remove code path from labs settings that now repeats prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370096 (owner: 10MaxSem) [21:50:05] (03Merged) 10jenkins-bot: Remove code path from labs settings that now repeats prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370096 (owner: 10MaxSem) [21:50:15] (03CR) 10jenkins-bot: Remove code path from labs settings that now repeats prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370096 (owner: 10MaxSem) [21:51:12] (03CR) 10Dzahn: "for example, php7.0-xml is NOT virtual, you can do "apt-cache show" on it. but php7.0-imagick, php7.0-redis, php7.0-memcached are all vir" [puppet] - 10https://gerrit.wikimedia.org/r/346165 (owner: 10Hashar) [21:51:41] !log maxsem@tin Synchronized wmf-config/CommonSettings-labs.php: https://gerrit.wikimedia.org/r/#/c/370096/ (duration: 00m 47s) [21:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:05] (03PS2) 10Ayounsi: Reserve non zero rated IPs and ranges [dns] - 10https://gerrit.wikimedia.org/r/370094 (https://phabricator.wikimedia.org/T170518) [21:52:07] (03CR) 10Dzahn: "not saying this is -1 because of that, more asking myself because i remember we talked about virtual packages before but unsure what the c" [puppet] - 10https://gerrit.wikimedia.org/r/346165 (owner: 10Hashar) [21:54:46] 10Operations, 10monitoring: broken dependencies for python-snimpy on jessie - https://phabricator.wikimedia.org/T172330#3499076 (10ayounsi) 05Open>03Resolved a:03MoritzMuehlenhoff Thanks, tested manually and it seems to work. Also python3-snimpy doesn't have this issue. [21:57:42] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:03:32] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 29 minutes ago with 0 failures [22:03:42] RECOVERY - salt-minion processes on stat1005 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [22:03:42] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [22:03:42] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [22:03:52] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [22:03:52] RECOVERY - DPKG on stat1005 is OK: All packages OK [22:03:52] RECOVERY - Disk space on stat1005 is OK: DISK OK [22:04:13] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [22:05:32] (03PS3) 10Dzahn: cache::misc/phabricator: switch from iridium to phab1001 backend [puppet] - 10https://gerrit.wikimedia.org/r/369820 (https://phabricator.wikimedia.org/T163938) [22:08:39] (03PS1) 10Ayounsi: Icinga: add check_bfd check (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/370103 [22:08:56] (03CR) 10Paladox: [C: 031] cache::misc/phabricator: switch from iridium to phab1001 backend [puppet] - 10https://gerrit.wikimedia.org/r/369820 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [22:09:19] (03CR) 10jerkins-bot: [V: 04-1] Icinga: add check_bfd check (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/370103 (owner: 10Ayounsi) [22:11:00] (03PS1) 10Dzahn: phab: phabricator-new to phab2001, phab1001 using normal domain [puppet] - 10https://gerrit.wikimedia.org/r/370104 (https://phabricator.wikimedia.org/T163938) [22:17:54] (03PS2) 10Eevans: WIP: Reshape RESTBase Cassandra production cluster; Provision new 3.x cluster [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) [22:19:51] (03CR) 10Paladox: [C: 031] phab: phabricator-new to phab2001, phab1001 using normal domain [puppet] - 10https://gerrit.wikimedia.org/r/370104 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [22:20:10] (03PS3) 10Eevans: WIP: Reshape RESTBase Cassandra production cluster; Provision new 3.x cluster [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) [22:24:07] looks like we have some replication problems? I'm getting the notice, that changes newer than 33 seconds are not displayed now the fith time [22:25:39] just commons? [22:25:58] no, dewiki (s5) [22:26:26] only s4 is lagging https://dbtree.wikimedia.org/ [22:26:34] And that's only the other side of codfw [22:26:50] hm. when I take a look at https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?orgId=1&var-dc=eqiad%20prometheus%2Fops I see a lot of spikes [22:26:53] at s1 for example too [22:27:42] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1005 is OK: OK: synced at Thu 2017-08-03 22:27:35 UTC. [22:28:24] Doesn't look so out of ordinary if you zoom out [22:29:18] (03PS1) 10Dzahn: cache::misc/graphite: rename director, don't send cross-dc traffic [puppet] - 10https://gerrit.wikimedia.org/r/370107 [22:30:11] hm, ok [22:30:59] (03PS2) 10Dzahn: cache::misc/graphite: rename director, don't send cross-dc traffic [puppet] - 10https://gerrit.wikimedia.org/r/370107 [22:34:58] (03PS7) 10EBernhardson: CirrusSearch configuration for LTR AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369594 (https://phabricator.wikimedia.org/T171212) [22:37:14] (03CR) 10Dzahn: ""restbase1016.eqiad.wmnet is decommissioned" just means "it's taken out of the pool / receives no traffic" and then it's reinstalled and i" [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170803T2300). Please do the needful. [23:00:04] RoanKattouw and ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:01:02] \o [23:01:17] i see kaldari snuck one in at the last moment too :) [23:01:25] :) [23:01:27] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199#3499252 (10Bawolff) >>! In T147199#3498842, @Quiddity wrote: >>>! In T147199#3498052, @Bawolff wrote: >> perhaps in the error page,... [23:01:27] trying to [23:01:32] i suppose i can swat today [23:02:05] (03CR) 10EBernhardson: [C: 032] CirrusSearch configuration for LTR AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369594 (https://phabricator.wikimedia.org/T171212) (owner: 10EBernhardson) [23:03:31] (03Merged) 10jenkins-bot: CirrusSearch configuration for LTR AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369594 (https://phabricator.wikimedia.org/T171212) (owner: 10EBernhardson) [23:04:52] 10Operations, 10Traffic, 10netops: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459#3499267 (10ayounsi) [23:05:10] ebernhardson: Thanks for the +2, created cherry pick at https://gerrit.wikimedia.org/r/#/c/370113/ [23:05:45] (03CR) 10jenkins-bot: CirrusSearch configuration for LTR AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369594 (https://phabricator.wikimedia.org/T171212) (owner: 10EBernhardson) [23:07:45] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: CirrusSearch config for MLR test (step 1) (duration: 00m 47s) [23:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:41] kaldari: will be on mwdebug1001 in just a minute, after scap sync-file finishes [23:08:50] !log ebernhardson@tin Synchronized wmf-config/CirrusSearch-production.php: CirrusSearch config for MLR test (step 2) (duration: 00m 46s) [23:08:52] cool, ready to test [23:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:33] 10Operations, 10Cloud-Services: Ensure we can survive a loss of labservices1001 - https://phabricator.wikimedia.org/T163402#3499313 (10ayounsi) Fyi, I opened a new task to discuss and perform a similar maintenance ( T172459 ). [23:10:14] kaldari: ok mwdebug1001 has it [23:10:48] looking [23:12:01] ebernhardson: seems to be OK [23:12:05] push ahead! [23:12:54] !log ebernhardson@tin Synchronized php-1.30.0-wmf.12/extensions/WikimediaEvents/extension.json: (no justification provided) (duration: 00m 47s) [23:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:59] RoanKattouw: around? [23:14:33] !log ebernhardson@tin Synchronized php-1.30.0-wmf.12/extensions/CodeMirror/resources/ext.CodeMirror.js: Only show popup if CodeMirror button exists (duration: 00m 46s) [23:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:53] kaldari: ^ [23:15:41] (03CR) 10Mobrovac: "Perhaps it would be easier to proceed/reason about this if we created a new, temporary role/profile for RB+cass3 at this early stage. That" [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [23:15:54] (03PS2) 10Ayounsi: Icinga: add check_bfd check (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/370103 [23:17:35] (03CR) 10Mobrovac: "> "restbase1016.eqiad.wmnet is decommissioned" just means "it's taken out of the pool / receives no traffic" and then it's reinstalled and" [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [23:19:41] Krinkle: We can fix the load issue and do an emergency bug fix deployment tomorrow or we can just turn off the extension everywhere right now. Any preference? [23:20:33] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3499331 (10Dzahn) steps for migration: - stop phd on iridium - rsync /srv/repos - merge https://gerrit.wikimedia.org/r/369834 - merge https:/... [23:20:37] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199#3499332 (10BBlack) Even while FF 52 is still supported by Mozilla, it's unlikely that Mozilla's security efforts can actually preven... [23:21:00] kaldari: I'm inclined towards the latter. I don't like saying that, but it's because there is already 5 other regressions we're juggling right now and I'd help to rule out CodeMirror as a contributing factor. [23:22:18] 10Operations, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Performance-Team, and 5 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3499337 (10Krinkle) [23:22:39] Krinkle: No worries. We'll just turn it off for now. [23:23:13] (03PS1) 10MaxSem: Revert "Enable CodeMirror everywhere but RTL wikis and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370117 (https://phabricator.wikimedia.org/T172458) [23:23:47] ebernhardson, are you still deploying? [23:24:06] MaxSem: not at the moment, but i have one more patch to go out. Looking into a potential bug with the backend config for this test [23:24:19] ebernhardson, then I'll push ^ [23:24:59] (03CR) 10MaxSem: [C: 032] Revert "Enable CodeMirror everywhere but RTL wikis and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370117 (https://phabricator.wikimedia.org/T172458) (owner: 10MaxSem) [23:25:24] hi twentyafterfour [23:30:10] ebernhardson: whoops sorry forgot about my swat [23:30:23] I'm not at my computer so maybe I should just move it to another day [23:31:55] RoanKattouw: kk [23:32:16] (03PS2) 10MaxSem: Revert "Enable CodeMirror everywhere but RTL wikis and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370117 (https://phabricator.wikimedia.org/T172458) [23:32:28] (03CR) 10MaxSem: Revert "Enable CodeMirror everywhere but RTL wikis and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370117 (https://phabricator.wikimedia.org/T172458) (owner: 10MaxSem) [23:32:31] (03CR) 10MaxSem: [C: 032] Revert "Enable CodeMirror everywhere but RTL wikis and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370117 (https://phabricator.wikimedia.org/T172458) (owner: 10MaxSem) [23:32:39] fuck you gerrit [23:33:08] (03PS1) 10Dzahn: phabricator: switch service IPs to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/370119 (https://phabricator.wikimedia.org/T163938) [23:33:10] :) [23:33:51] (03PS2) 10Dzahn: phabricator: switch service IPs to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/370119 (https://phabricator.wikimedia.org/T163938) [23:33:59] (03Merged) 10jenkins-bot: Revert "Enable CodeMirror everywhere but RTL wikis and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370117 (https://phabricator.wikimedia.org/T172458) (owner: 10MaxSem) [23:34:29] (03CR) 10Mobrovac: [C: 04-1] JobQueueEventBus: Enable on group1 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370064 (https://phabricator.wikimedia.org/T163380) (owner: 10Ppchelko) [23:35:16] (03CR) 10Paladox: phabricator: switch service IPs to phab1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/370119 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [23:36:10] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/370117/2 (duration: 00m 47s) [23:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:31] (03CR) 10jenkins-bot: Revert "Enable CodeMirror everywhere but RTL wikis and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370117 (https://phabricator.wikimedia.org/T172458) (owner: 10MaxSem) [23:37:47] (03PS3) 10Ppchelko: JobQueueEventBus: Enable on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370064 (https://phabricator.wikimedia.org/T163380) [23:37:54] mutante left a comment for you on ^^ :) [23:39:01] aude: what's up? [23:39:03] (03CR) 10Dzahn: phabricator: switch service IPs to phab1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/370119 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [23:39:16] MaxSem: all done? [23:41:17] (03PS1) 10Dzahn: site/phabricator: remove phab role from iridium, make it a spare [puppet] - 10https://gerrit.wikimedia.org/r/370122 (https://phabricator.wikimedia.org/T163938) [23:42:18] (03CR) 10Paladox: "this should wait a couple of days before doing just to make sure there's no problems with phab1001." [puppet] - 10https://gerrit.wikimedia.org/r/370122 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [23:42:51] (03PS3) 10Dzahn: phabricator: switch service IPs to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/370119 (https://phabricator.wikimedia.org/T163938) [23:43:51] !log ebernhardson@tin Synchronized php-1.30.0-wmf.12/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSatisfaction.js: T171212: Turn on CirrusSearch MLR AB test (duration: 00m 46s) [23:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:02] T171212: Interleaved results A/B test: turn on - https://phabricator.wikimedia.org/T171212 [23:44:45] (03PS1) 10Dzahn: phabricator/dumps: remove iridium as allowed dumps host [puppet] - 10https://gerrit.wikimedia.org/r/370123 (https://phabricator.wikimedia.org/T163938) [23:45:01] twentyafterfour: so the bug still happens on wmf12? [23:45:13] aude: yes [23:45:43] ok [23:46:07] looking into it, if there is still an easy/hacky fix [23:46:10] ebernhardson, yup [23:46:12] I even re-sync'd the files for the hotfix, just in case some webservers didn't get the original deployment [23:46:19] but then maybe we dont want to deploy it on friday... maybe monday [23:47:01] aude: yeah monday probably is better, or just wait until next week's train if you want [23:47:30] yeah [23:47:39] thanks for handling this [23:47:46] no problem :) [23:51:39] jouncebot: next [23:51:39] In 0 hour(s) and 8 minute(s): Phabricator migration iridium -> phab1001 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170804T0000) [23:56:05] is it safe to assume phab will be completely down during migration? [23:56:27] Zppix: yes [23:56:38] ok