[00:22:03] (03CR) 10Dzahn: "thanks for merging! though the rebase actually changed the patch. it doesn't apply it on labtestservices2002/2003 anymore but instead on l" [puppet] - 10https://gerrit.wikimedia.org/r/365171 (owner: 10Dzahn) [00:23:17] (03PS1) 10Gergő Tisza: Deploy TemplateStyles to some non-content productions wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365879 (https://phabricator.wikimedia.org/T170863) [00:29:18] (03PS1) 10Dzahn: labtestweb2002/2003: add mapped IPv6 address [puppet] - 10https://gerrit.wikimedia.org/r/365880 [00:29:41] (03CR) 10Gergő Tisza: "Will be deployed this Tuesday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365879 (https://phabricator.wikimedia.org/T170863) (owner: 10Gergő Tisza) [00:30:56] 10Operations, 10Performance-Team, 10TemplateStyles, 10Traffic, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3446882 (10Johan) [00:54:10] (03PS1) 10Dzahn: add IPv6 for labweb*, labtestweb2001, labtestservices2001 [dns] - 10https://gerrit.wikimedia.org/r/365882 [00:55:08] (03PS2) 10Dzahn: labtestweb2002/2003: add mapped IPv6 address [puppet] - 10https://gerrit.wikimedia.org/r/365880 [00:55:27] (03PS3) 10Dzahn: labtestweb2002/2003: add mapped IPv6 address [puppet] - 10https://gerrit.wikimedia.org/r/365880 [00:58:29] (03PS1) 10Andrew Bogott: add ipv6 to labtestservices200[23] [puppet] - 10https://gerrit.wikimedia.org/r/365883 [00:59:00] (03CR) 10Andrew Bogott: "Yikes, I've never seen a rebase do /that/ before. It's fine to add it to the *web nodes, though... follow up is in https://gerrit.wikimed" [puppet] - 10https://gerrit.wikimedia.org/r/365171 (owner: 10Dzahn) [01:00:15] (03PS1) 10Niharika29: Enable CodeMirror on simplewiki for better testing and more exposure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365884 [01:19:51] (03PS4) 10Dzahn: labtestservices2002/2003: add mapped IPv6 address [puppet] - 10https://gerrit.wikimedia.org/r/365880 [01:23:26] (03CR) 10Dzahn: [C: 032] labtestservices2002/2003: add mapped IPv6 address [puppet] - 10https://gerrit.wikimedia.org/r/365880 (owner: 10Dzahn) [01:30:46] (03PS2) 10Dzahn: add IPv6 for labweb*, labtestservices* [dns] - 10https://gerrit.wikimedia.org/r/365882 [01:38:55] (03PS7) 10D3r1ck01: Remove 'din' from wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362876 (https://phabricator.wikimedia.org/T168523) [01:43:40] PROBLEM - puppet last run on labtestservices2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modprobe.d/blacklist-wmf.conf] [01:43:59] PROBLEM - puppet last run on labtestservices2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modprobe.d/blacklist-wmf.conf] [01:45:59] RECOVERY - puppet last run on labtestservices2003 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [01:49:49] RECOVERY - puppet last run on labtestservices2002 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [01:53:39] (03PS2) 10Dzahn: add ipv6 to labtestservices200[23] [puppet] - 10https://gerrit.wikimedia.org/r/365883 (owner: 10Andrew Bogott) [01:54:02] (03CR) 10Dzahn: "ah, i didn't see that until after i did https://gerrit.wikimedia.org/r/#/c/365880/ already done" [puppet] - 10https://gerrit.wikimedia.org/r/365883 (owner: 10Andrew Bogott) [01:58:41] (03Abandoned) 10Dzahn: add ipv6 to labtestservices200[23] [puppet] - 10https://gerrit.wikimedia.org/r/365883 (owner: 10Andrew Bogott) [02:03:24] (03PS1) 10Dzahn: puppetdb: remove postgres::ganglia from puppetdb role [puppet] - 10https://gerrit.wikimedia.org/r/365887 (https://phabricator.wikimedia.org/T169953) [02:05:33] (03PS2) 10Dzahn: puppetdb: remove postgres::ganglia from puppetdb role [puppet] - 10https://gerrit.wikimedia.org/r/365887 (https://phabricator.wikimedia.org/T169953) [02:07:17] (03CR) 10Dzahn: [C: 032] "It has never worked in puppetdb context - ganglia is deprecated - removing disk_stat as well - remove log spam" [puppet] - 10https://gerrit.wikimedia.org/r/365887 (https://phabricator.wikimedia.org/T169953) (owner: 10Dzahn) [02:13:19] !log nitrogen/nihal - rm /usr/lib/ganglia/python_modules/postgresql.py ; rm /etc/ganglia/conf.d/* ; restart gmond (T169953) [02:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:13:36] T169953: postgresql::ganglia on puppetdb servers - authentication failed - https://phabricator.wikimedia.org/T169953 [02:16:00] 10Operations, 10Patch-For-Review: postgresql::ganglia on puppetdb servers - authentication failed - https://phabricator.wikimedia.org/T169953#3446978 (10Dzahn) 05Open>03Resolved removed from nitrogen and nihal - cleaned up - do not see it anymore in logs now. for the scope of this ticket, should be done. [02:17:13] 10Operations, 10Patch-For-Review: postgresql::ganglia on puppetdb servers - authentication failed - https://phabricator.wikimedia.org/T169953#3446980 (10Dzahn) @faidon You reported it and asked me to take a look, i saw you weren't on the ticket, so fyi now. Should be gone. [02:33:59] PROBLEM - Check systemd state on nihal is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:36:53] oh, come on [02:38:49] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.9) (duration: 13m 25s) [02:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:40:59] RECOVERY - Check systemd state on nihal is OK: OK - running: The system is fully operational [02:44:01] 10Operations, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-Urbanecm: Create fishbowl wiki for Maithili Wikimedians User Group - https://phabricator.wikimedia.org/T168782#3446995 (10biplabanand) Successfully logged in to my account but when i tried to create an account via Special:CreateAccount, it... [02:45:25] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Jul 18 02:45:25 UTC 2017 (duration 6m 36s) [02:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:48] (03PS1) 10Dzahn: cache::misc: add director for netmon2001 [puppet] - 10https://gerrit.wikimedia.org/r/365890 (https://phabricator.wikimedia.org/T166180) [03:08:10] (03PS2) 10Dzahn: cache::misc: add director for netmon2001 [puppet] - 10https://gerrit.wikimedia.org/r/365890 (https://phabricator.wikimedia.org/T166180) [03:10:30] (03CR) 10Dzahn: [C: 032] cache::misc: add director for netmon2001 [puppet] - 10https://gerrit.wikimedia.org/r/365890 (https://phabricator.wikimedia.org/T166180) (owner: 10Dzahn) [03:17:18] (03PS1) 1020after4: Change $deploy_user home directory to /var/lib/${deploy_user} [puppet] - 10https://gerrit.wikimedia.org/r/365891 [03:18:18] (03PS1) 10Dzahn: smokeping: switch backend from netmon1002 to netmon2001 [puppet] - 10https://gerrit.wikimedia.org/r/365892 (https://phabricator.wikimedia.org/T166180) [03:18:24] (03CR) 10jerkins-bot: [V: 04-1] Change $deploy_user home directory to /var/lib/${deploy_user} [puppet] - 10https://gerrit.wikimedia.org/r/365891 (owner: 1020after4) [03:19:05] (03PS2) 10Dzahn: smokeping: switch backend from netmon1002 to netmon2001 [puppet] - 10https://gerrit.wikimedia.org/r/365892 (https://phabricator.wikimedia.org/T166180) [03:19:18] (03PS2) 1020after4: Change $deploy_user home directory to /var/lib/${deploy_user} [puppet] - 10https://gerrit.wikimedia.org/r/365891 (https://phabricator.wikimedia.org/T166013) [03:20:19] ok, bit of a shit question to have to ask. I need to dig through my physical files but I seem to have screwed up and ended up 2fa codeless [03:20:27] (03CR) 10jerkins-bot: [V: 04-1] Change $deploy_user home directory to /var/lib/${deploy_user} [puppet] - 10https://gerrit.wikimedia.org/r/365891 (https://phabricator.wikimedia.org/T166013) (owner: 1020after4) [03:21:09] what sort of standard of proof would I need to get somebody with shell access to disable it. Not a whole lot I can't prove although dont have a pgp which would be ideal [03:21:54] (03PS3) 1020after4: Change $deploy_user home directory to /var/lib/${deploy_user} [puppet] - 10https://gerrit.wikimedia.org/r/365891 (https://phabricator.wikimedia.org/T166013) [03:28:19] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 769.35 seconds [03:32:07] (03PS1) 10Dzahn: smokeping: sync data to netmon2001, use quickdatacopy, in role [puppet] - 10https://gerrit.wikimedia.org/r/365893 (https://phabricator.wikimedia.org/T166180) [03:45:36] (03PS2) 10Dzahn: smokeping: sync data to netmon2001, use quickdatacopy, in role [puppet] - 10https://gerrit.wikimedia.org/r/365893 (https://phabricator.wikimedia.org/T166180) [03:48:46] (03CR) 10Dzahn: [C: 032] smokeping: sync data to netmon2001, use quickdatacopy, in role [puppet] - 10https://gerrit.wikimedia.org/r/365893 (https://phabricator.wikimedia.org/T166180) (owner: 10Dzahn) [03:49:21] (03PS3) 10Dzahn: smokeping: sync data to netmon2001, use quickdatacopy, in role [puppet] - 10https://gerrit.wikimedia.org/r/365893 (https://phabricator.wikimedia.org/T166180) [03:53:44] (03PS4) 10Dzahn: smokeping: sync data to netmon2001, use quickdatacopy, in role [puppet] - 10https://gerrit.wikimedia.org/r/365893 (https://phabricator.wikimedia.org/T166180) [04:08:19] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=757.50 Read Requests/Sec=519.20 Write Requests/Sec=3.10 KBytes Read/Sec=48052.40 KBytes_Written/Sec=78.00 [04:10:29] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 157.07 seconds [04:16:29] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=10.50 Read Requests/Sec=0.20 Write Requests/Sec=4.30 KBytes Read/Sec=1.20 KBytes_Written/Sec=111.60 [04:22:09] PROBLEM - puppet last run on netmon2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:22:57] !log remove 2FA from NativeForeigner per T170911 [04:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:25:41] (03PS1) 10Dzahn: rsync::quickdatacopy: avoid duplicate declaration errors [puppet] - 10https://gerrit.wikimedia.org/r/365895 [04:26:52] (03CR) 10jerkins-bot: [V: 04-1] rsync::quickdatacopy: avoid duplicate declaration errors [puppet] - 10https://gerrit.wikimedia.org/r/365895 (owner: 10Dzahn) [04:40:57] (03PS2) 10Dzahn: rsync::quickdatacopy: avoid duplicate declaration errors [puppet] - 10https://gerrit.wikimedia.org/r/365895 [04:42:16] (03PS3) 10Dzahn: rsync::quickdatacopy: avoid duplicate declaration errors [puppet] - 10https://gerrit.wikimedia.org/r/365895 [04:44:21] (03CR) 10Dzahn: [C: 032] rsync::quickdatacopy: avoid duplicate declaration errors [puppet] - 10https://gerrit.wikimedia.org/r/365895 (owner: 10Dzahn) [04:47:19] RECOVERY - puppet last run on netmon2001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [04:48:29] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:48:30] PROBLEM - nutcracker process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:48:57] (03CR) 10Dzahn: "fixed netmon2001 - 21:47 < icinga-wm> RECOVERY - puppet last run on netmon2001 is OK: OK: Puppet is currently enabled, last run 15 seconds" [puppet] - 10https://gerrit.wikimedia.org/r/365895 (owner: 10Dzahn) [04:49:29] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [04:49:29] RECOVERY - nutcracker process on thumbor1002 is OK: PROCS OK: 1 process with UID = 115 (nutcracker), command name nutcracker [05:26:29] 10Operations, 10ops-codfw: pc2006 crashed - https://phabricator.wikimedia.org/T170520#3447120 (10Marostegui) When we replaced the main board we normally do not reimage the server if it not necessary (to avoid copying all the content back, in this case 2TB). I would say let's try if it works without reimage fir... [05:26:39] RECOVERY - High lag on wdqs1002 is OK: OK: Less than 30.00% above the threshold [600.0] [05:28:49] RECOVERY - WDQS SPARQL on wdqs1002 is OK: HTTP OK: HTTP/1.1 200 OK - 13048 bytes in 0.003 second response time [05:29:09] RECOVERY - WDQS HTTP on wdqs1002 is OK: HTTP OK: HTTP/1.1 200 OK - 13048 bytes in 0.001 second response time [05:32:30] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365897 [05:32:35] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365897 [05:39:12] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365897 (owner: 10Marostegui) [05:40:20] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365897 (owner: 10Marostegui) [05:40:29] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365897 (owner: 10Marostegui) [05:41:44] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1073 - T166204 (duration: 00m 44s) [05:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:58] T166204: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204 [05:44:49] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [05:54:59] (03PS1) 10Marostegui: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365898 (https://phabricator.wikimedia.org/T166204) [05:56:39] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365898 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [05:58:02] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365898 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [05:58:11] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365898 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [05:59:01] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1065 - T166204 (duration: 00m 43s) [05:59:05] !log Deploy alter table on s1 - db1065 - T166204 [05:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:14] T166204: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204 [05:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:39] RECOVERY - Host labstore1001 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [06:18:34] 10Operations, 10ops-codfw: pc2006 crashed - https://phabricator.wikimedia.org/T170520#3447146 (10Marostegui) @Papaul I have started MySQL again on pc2006 so it doesn't fall behind too many days, if you need this host to be taken down again, please ping us so we can stop MySQL again. Thank you! [06:42:09] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [06:42:46] !log upgrading restbase on the various test clusters to nodejs 6.11 [06:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:34] (03PS2) 10Giuseppe Lavagetto: motd::script: use validate_numeric for priority [puppet] - 10https://gerrit.wikimedia.org/r/365569 [07:13:36] (03PS2) 10Giuseppe Lavagetto: rsyslog::conf: validate priority with validate_numeric [puppet] - 10https://gerrit.wikimedia.org/r/365570 [07:13:38] (03PS2) 10Giuseppe Lavagetto: sysctl::conffile: validate priority as numeric [puppet] - 10https://gerrit.wikimedia.org/r/365571 [07:13:40] (03PS4) 10Giuseppe Lavagetto: role::configcluster: move to future environment [puppet] - 10https://gerrit.wikimedia.org/r/365572 [07:13:42] (03PS1) 10Giuseppe Lavagetto: systemd: add defines to manage systemd units [puppet] - 10https://gerrit.wikimedia.org/r/365900 [07:16:46] (03PS1) 10Muehlenhoff: Reduce account expiration warning to seven days [puppet] - 10https://gerrit.wikimedia.org/r/365901 [07:20:06] (03CR) 10Muehlenhoff: [C: 032] Reduce account expiration warning to seven days [puppet] - 10https://gerrit.wikimedia.org/r/365901 (owner: 10Muehlenhoff) [07:21:10] RECOVERY - Disk space on stat1006 is OK: DISK OK [07:22:47] this is me --^ [07:23:11] stat1006's home too big, moving it to /srv [07:32:10] !log moved /home to /srv/home on stat1006 to free disk space (created symling from /home -> /srv/home too) - T152712 [07:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:22] T152712: Replacement of stat1002 and stat1003 - https://phabricator.wikimedia.org/T152712 [07:43:27] (03PS1) 10Elukey: production-m4.sql.erb: fix grants for user eventlogcleaner [puppet] - 10https://gerrit.wikimedia.org/r/365902 [07:45:02] (03PS2) 10Elukey: role::mariadb::grants::production-m4.sql.erb: fix grants for user eventlogcleaner [puppet] - 10https://gerrit.wikimedia.org/r/365902 [07:45:59] (03CR) 10jerkins-bot: [V: 04-1] role::mariadb::grants::production-m4.sql.erb: fix grants for user eventlogcleaner [puppet] - 10https://gerrit.wikimedia.org/r/365902 (owner: 10Elukey) [07:46:09] (03PS5) 10Giuseppe Lavagetto: role::configcluster: move to future environment [puppet] - 10https://gerrit.wikimedia.org/r/365572 [07:49:22] hit by the commit msg validator! \o/ [07:51:38] (03PS3) 10Elukey: role::mariadb::grants: fix grants for user eventlogcleaner [puppet] - 10https://gerrit.wikimedia.org/r/365902 [07:52:17] !log upgrade wtp1001 to nodejs 6.11 [07:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:41] (03CR) 10Elukey: [C: 032] role::mariadb::grants: fix grants for user eventlogcleaner [puppet] - 10https://gerrit.wikimedia.org/r/365902 (owner: 10Elukey) [08:02:20] (03PS1) 10Ayounsi: Diffscan: Don't send emails if no new open/closed ports are found [puppet] - 10https://gerrit.wikimedia.org/r/365903 (https://phabricator.wikimedia.org/T169624) [08:14:55] (03PS1) 10Muehlenhoff: Add extended account data for mkroetzsch [puppet] - 10https://gerrit.wikimedia.org/r/365905 [08:16:51] (03CR) 10Ema: [V: 032] Linux kernel module handling [puppet] - 10https://gerrit.wikimedia.org/r/365030 (owner: 10Ema) [08:16:58] (03PS4) 10Ema: Linux kernel module handling [puppet] - 10https://gerrit.wikimedia.org/r/365030 [08:17:02] (03CR) 10Ema: [V: 032] Linux kernel module handling [puppet] - 10https://gerrit.wikimedia.org/r/365030 (owner: 10Ema) [08:18:15] (03PS2) 10Muehlenhoff: Add extended account data for mkroetzsch [puppet] - 10https://gerrit.wikimedia.org/r/365905 [08:18:21] (03PS6) 10Giuseppe Lavagetto: role::configcluster: move to future environment [puppet] - 10https://gerrit.wikimedia.org/r/365572 [08:22:09] (03Restored) 10Jcrespo: Parsercache: Purge rows every day, and reduce TTL to 22 days [puppet] - 10https://gerrit.wikimedia.org/r/361656 (https://phabricator.wikimedia.org/T167784) (owner: 10Jcrespo) [08:22:11] (03CR) 10Muehlenhoff: [C: 032] Add extended account data for mkroetzsch [puppet] - 10https://gerrit.wikimedia.org/r/365905 (owner: 10Muehlenhoff) [08:22:20] (03PS5) 10Jcrespo: Parsercache: Purge rows every day, and reduce TTL to 22 days [puppet] - 10https://gerrit.wikimedia.org/r/361656 (https://phabricator.wikimedia.org/T167784) [08:22:22] (03Restored) 10Jcrespo: Parsercache: Reduce expiration time to 22 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361659 (https://phabricator.wikimedia.org/T167784) (owner: 10Jcrespo) [08:22:26] (03PS2) 10Jcrespo: Parsercache: Reduce expiration time to 22 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361659 (https://phabricator.wikimedia.org/T167784) [08:22:56] (03CR) 10Marostegui: [C: 031] Parsercache: Reduce expiration time to 22 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361659 (https://phabricator.wikimedia.org/T167784) (owner: 10Jcrespo) [08:32:59] <_joe_> win 19 [08:33:25] 10Operations, 10Commons, 10Performance-Team, 10Thumbor, 10media-storage: HTTP 429 on thumbnail images for specific SVG file on Commons - https://phabricator.wikimedia.org/T170628#3447225 (10MoritzMuehlenhoff) [08:38:03] (03PS2) 10Giuseppe Lavagetto: utils/pcc: add --future argument [puppet] - 10https://gerrit.wikimedia.org/r/365579 [08:38:07] (03PS1) 10Giuseppe Lavagetto: puppet_compiler: install ruby-rgen [puppet] - 10https://gerrit.wikimedia.org/r/365909 [08:42:07] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/365579 (owner: 10Giuseppe Lavagetto) [08:42:09] (03PS3) 10Giuseppe Lavagetto: utils/pcc: add --future argument [puppet] - 10https://gerrit.wikimedia.org/r/365579 [08:42:11] (03PS2) 10Giuseppe Lavagetto: puppet_compiler: install ruby-rgen [puppet] - 10https://gerrit.wikimedia.org/r/365909 [08:44:03] 10Operations, 10Commons, 10Performance-Team, 10Thumbor, 10media-storage: HTTP 429 on thumbnail images for specific SVG file on Commons - https://phabricator.wikimedia.org/T170628#3447243 (10MoritzMuehlenhoff) Backtrace is here: It crashes accessing right->deferred.other in active_edges(), but the code in... [08:51:07] (03PS1) 10Ema: [1/2] use kmod puppet module instead of File resources [puppet] - 10https://gerrit.wikimedia.org/r/365911 [08:51:09] (03PS1) 10Ema: [2/2] use kmod puppet module instead of File resources [puppet] - 10https://gerrit.wikimedia.org/r/365912 [08:51:11] (03CR) 10Giuseppe Lavagetto: [C: 032] utils/pcc: add --future argument [puppet] - 10https://gerrit.wikimedia.org/r/365579 (owner: 10Giuseppe Lavagetto) [08:51:23] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet_compiler: install ruby-rgen [puppet] - 10https://gerrit.wikimedia.org/r/365909 (owner: 10Giuseppe Lavagetto) [08:52:25] (03CR) 10jerkins-bot: [V: 04-1] [1/2] use kmod puppet module instead of File resources [puppet] - 10https://gerrit.wikimedia.org/r/365911 (owner: 10Ema) [08:52:27] (03CR) 10jerkins-bot: [V: 04-1] [2/2] use kmod puppet module instead of File resources [puppet] - 10https://gerrit.wikimedia.org/r/365912 (owner: 10Ema) [08:57:59] (03PS2) 10Ema: [1/2] use kmod puppet module instead of File resources [puppet] - 10https://gerrit.wikimedia.org/r/365911 [08:59:06] (03CR) 10jerkins-bot: [V: 04-1] [1/2] use kmod puppet module instead of File resources [puppet] - 10https://gerrit.wikimedia.org/r/365911 (owner: 10Ema) [09:00:03] !log reboot conf1002 for kernel updates [09:00:14] (03PS3) 10Ema: [1/2] use kmod puppet module instead of File resources [puppet] - 10https://gerrit.wikimedia.org/r/365911 [09:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:31] (03PS2) 10Ema: [2/2] use kmod puppet module instead of File resources [puppet] - 10https://gerrit.wikimedia.org/r/365912 [09:08:25] !log reboot conf1003 for kernel updates [09:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:04] (03PS8) 10Hashar: contint: move from /mnt to /srv [puppet] - 10https://gerrit.wikimedia.org/r/312523 (https://phabricator.wikimedia.org/T146381) [09:10:06] (03PS3) 10Hashar: Migrate puppet compiler instance from /mnt to /srv [puppet] - 10https://gerrit.wikimedia.org/r/330412 (https://phabricator.wikimedia.org/T146381) [09:13:06] !log lvs300[34] upgrade pybal to 1.13.9 T82747 [09:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:18] T82747: pybal health checks are ipv4 even for ipv6 vips - https://phabricator.wikimedia.org/T82747 [09:15:19] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me, thanks for cleaning that up." [puppet] - 10https://gerrit.wikimedia.org/r/365911 (owner: 10Ema) [09:15:23] (03PS2) 10Phuedx: Enable page previews for everyone on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365696 (https://phabricator.wikimedia.org/T167365) (owner: 10Jdlrobson) [09:15:51] !log lvs300[12] upgrade pybal to 1.13.9 T82747 [09:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:19] PROBLEM - Check systemd state on conf1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:16:31] checking, surely etcdmirror [09:17:12] yep, fixed [09:17:19] RECOVERY - Check systemd state on conf1002 is OK: OK - running: The system is fully operational [09:19:48] elukey: thanks, esams pybals upgraded to 1.13.9 (and hence restarted, thus properly connected to conf1003) [09:22:58] \o/ [09:26:13] 10Operations, 10TimedMediaHandler-Transcode, 10User-Elukey: Videoscalers overloaded once in a while triggering alarms - https://phabricator.wikimedia.org/T162815#3447287 (10elukey) 05Open>03declined It didn't re-happen during the last two months, so I am inclined to close this task and re-open if necessary. [09:27:04] 10Operations, 10Puppet, 10User-Joe: Prepare for Puppet 4 - https://phabricator.wikimedia.org/T169548#3447293 (10Joe) [09:27:06] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Add support for directory environments to our puppet classes, production puppetmaster - https://phabricator.wikimedia.org/T169485#3447294 (10Joe) [09:27:08] 10Operations, 10Puppet, 10puppet-compiler, 10Patch-For-Review, 10User-Joe: Add results of compilation with the future parser to the puppet compiler - https://phabricator.wikimedia.org/T169546#3447292 (10Joe) 05Open>03Resolved [09:27:50] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/365903 (https://phabricator.wikimedia.org/T169624) (owner: 10Ayounsi) [09:29:13] _joe_: hello,I would like to switch the puppet compiler to use /srv instead of /mnt . I got patch up and a migration plan. Would you have some spare time this week for it ? [09:29:19] !log cp3030: upgrade to varnish 4.1.7-1wm1 and reboot for kernel update [09:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:47] <_joe_> hashar: why se /srv? who cares? [09:32:50] <_joe_> *use [09:33:04] cause pretty much everything else uses /srv :) [09:33:17] <_joe_> you mean the jenkins slaves? [09:33:30] yeah [09:33:38] I have migrated them all using a cherry pick on the CI puppet master [09:33:49] but the puppet compiler instance is attached to the labs puppet master, so I havent migrated it [09:34:18] I wrote a migration plan on https://gerrit.wikimedia.org/r/#/c/330412/ [09:34:37] (basically unmount, merge puppet patch, mount /srv, ensure /mnt is gone from /etc/fstab, run puppet) [09:34:44] and that should point the puppet compiler to /srv [09:38:02] (03CR) 10Elukey: "I am still wondering if there is a more flexible approach for this, since every zk cluster will be whitelisted to accept the same srange, " [puppet] - 10https://gerrit.wikimedia.org/r/356548 (https://phabricator.wikimedia.org/T114815) (owner: 10Muehlenhoff) [09:42:23] 10Operations, 10User-Elukey, 10User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#3447314 (10elukey) Just had a chat with Joe, and the approach that we'd like to follow is: 1) expand the current conf100[123] cluster with the conf100[456] nodes 2) verify that everythin... [09:52:21] (03CR) 10Muehlenhoff: [C: 031] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/365900 (owner: 10Giuseppe Lavagetto) [09:54:05] !log esams cache_text/upload: upgrade to varnish 4.1.7-1wm1 and reboot for kernel updates [09:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:50] !log rebooting oresrdb2002 for kernel update [10:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:34] !log lvs100[45] upgrade pybal to 1.13.9 T82747 T154759 [10:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:48] T82747: pybal health checks are ipv4 even for ipv6 vips - https://phabricator.wikimedia.org/T82747 [10:32:48] T154759: Pybal not happy with DNS delays - https://phabricator.wikimedia.org/T154759 [10:33:33] !log rebooting oresrdb1002 for kernel update [10:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:23] (03PS1) 10Muehlenhoff: Switch oresrdb.svc.eqiad.wmnet to oresrdb1002 [dns] - 10https://gerrit.wikimedia.org/r/365935 [10:41:56] (03Draft1) 10Paladox: Add .DS_Store to .gitignore [software/gerrit] - 10https://gerrit.wikimedia.org/r/365936 [10:41:58] (03PS2) 10Paladox: Add .DS_Store to .gitignore [software/gerrit] - 10https://gerrit.wikimedia.org/r/365936 [10:43:57] !log lvs100[12] upgrade pybal to 1.13.9 T82747 T154759 [10:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:09] T82747: pybal health checks are ipv4 even for ipv6 vips - https://phabricator.wikimedia.org/T82747 [10:44:09] T154759: Pybal not happy with DNS delays - https://phabricator.wikimedia.org/T154759 [10:51:28] just a heads up that i'm going to merge https://gerrit.wikimedia.org/r/#/c/365696/2 (beta cluster only) and update the deployment host [10:51:32] ^ hashar et al [10:52:23] !log lvs400[34] upgrade pybal to 1.13.9 T82747 T154759 [10:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:36] T82747: pybal health checks are ipv4 even for ipv6 vips - https://phabricator.wikimedia.org/T82747 [10:52:36] T154759: Pybal not happy with DNS delays - https://phabricator.wikimedia.org/T154759 [10:54:16] !log lvs400[12] upgrade pybal to 1.13.9 T82747 T154759 [10:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:39] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [10:59:15] !log lvs200[45] upgrade pybal to 1.13.9 T82747 T154759 [10:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:27] T82747: pybal health checks are ipv4 even for ipv6 vips - https://phabricator.wikimedia.org/T82747 [10:59:28] T154759: Pybal not happy with DNS delays - https://phabricator.wikimedia.org/T154759 [11:00:36] !log lvs200[12] upgrade pybal to 1.13.9 T82747 T154759 [11:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:51] (03CR) 10Alexandros Kosiaris: [C: 032] CI/integration: Create role for docker CI agent [puppet] - 10https://gerrit.wikimedia.org/r/365416 (https://phabricator.wikimedia.org/T150502) (owner: 10Thcipriani) [11:02:57] (03PS5) 10Alexandros Kosiaris: CI/integration: Create role for docker CI agent [puppet] - 10https://gerrit.wikimedia.org/r/365416 (https://phabricator.wikimedia.org/T150502) (owner: 10Thcipriani) [11:05:06] (03CR) 10Alexandros Kosiaris: [C: 032] Switch oresrdb.svc.eqiad.wmnet to oresrdb1002 [dns] - 10https://gerrit.wikimedia.org/r/365935 (owner: 10Muehlenhoff) [11:06:26] (03CR) 10Phuedx: [C: 032] "Beta Cluster only!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365696 (https://phabricator.wikimedia.org/T167365) (owner: 10Jdlrobson) [11:07:33] (03Merged) 10jenkins-bot: Enable page previews for everyone on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365696 (https://phabricator.wikimedia.org/T167365) (owner: 10Jdlrobson) [11:07:50] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: pybal health checks are ipv4 even for ipv6 vips - https://phabricator.wikimedia.org/T82747#3447457 (10ema) 05Open>03Resolved All LVSs upgraded to 1.13.9, which fixes this bug. [11:08:52] (03CR) 10jenkins-bot: Enable page previews for everyone on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365696 (https://phabricator.wikimedia.org/T167365) (owner: 10Jdlrobson) [11:09:06] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review, 10User-Joe: Pybal not happy with DNS delays - https://phabricator.wikimedia.org/T154759#3447460 (10ema) 05Open>03Resolved a:03ema All LVSs upgraded to pybal 1.13.9, which fixes this bug. [11:10:54] 10Operations, 10Pybal, 10Traffic: pybal should automatically reconnect to etcd - https://phabricator.wikimedia.org/T169765#3407910 (10ema) [11:10:56] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: pybal should reset the etcdindex it's looking at after losing a connection - https://phabricator.wikimedia.org/T169893#3447465 (10ema) 05Open>03Resolved a:03ema All LVSs upgraded to pybal 1.13.9, which fixes this bug. Note that the partent task (... [11:11:22] 10Operations, 10DBA, 10Pybal, 10Availability: Create a backend check for pybal to monitor the MySQL protocol being up - https://phabricator.wikimedia.org/T165677#3447471 (10ema) p:05Triage>03Normal [11:11:39] done [11:12:43] (03PS9) 10Paladox: gerrit: DO NOT MERGE [software/gerrit] - 10https://gerrit.wikimedia.org/r/363738 [11:12:45] (03PS8) 10Paladox: Gerrit: Upgrading gerrit to 2.14.2-pre (DO NOT MERGE) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363734 [11:17:24] 10Operations, 10DBA, 10Pybal, 10Availability: Create a backend check for pybal to monitor the MySQL protocol being up - https://phabricator.wikimedia.org/T165677#3273473 (10ema) It might be worth looking into the built-in [[ http://twistedmatrix.com/documents/current/core/howto/rdbms.html | adbapi ]] for... [11:19:29] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [11:19:59] 10Operations, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-Urbanecm: Create fishbowl wiki for Maithili Wikimedians User Group - https://phabricator.wikimedia.org/T168782#3447484 (10biplabanand) resolved after discussion with @Dereckson on IRC. [11:20:58] akosiaris: around? ores doesn't look happy [11:21:17] ema: it's not a big deal [11:21:23] the alarm needs to be fixed [11:21:26] ema yes I am around [11:21:56] there's a 503 spike in misc: https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?panelId=2&fullscreen&orgId=1&var-site=All&var-cache_type=All&var-status_type=5&from=now-3h&to=now [11:22:02] we are failing over to oresrdb1002 for kernel upgrades on oresrdb1001 but in theory that should not cause a problem [11:22:45] the switchover for clients themselves is happening slowly. I am tracking it at https://grafana.wikimedia.org/dashboard/db/ores?panelId=18&fullscreen&orgId=1 [11:23:15] hmmm ores got overloaded [11:23:21] why though [11:23:29] yeah, and the 503 rate is still pretty high [11:24:09] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3034_v4, cp3034_v6 [11:24:10] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3034_v4, cp3034_v6 [11:24:19] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3034_v4, cp3034_v6 [11:24:19] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3034_v4, cp3034_v6 [11:24:19] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3034_v4, cp3034_v6 [11:24:19] this is me ^ [11:24:19] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3034_v4, cp3034_v6 [11:24:29] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3034_v4, cp3034_v6 [11:24:29] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3034_v4, cp3034_v6 [11:24:39] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3034_v4, cp3034_v6 [11:24:39] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3034_v4, cp3034_v6 [11:24:40] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3034_v4, cp3034_v6 [11:24:49] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3034_v4, cp3034_v6 [11:24:49] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3034_v4, cp3034_v6 [11:24:49] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3034_v4, cp3034_v6 [11:24:52] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3034_v4, cp3034_v6 [11:24:59] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3034_v4, cp3034_v6 [11:24:59] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3034_v4, cp3034_v6 [11:25:00] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3034_v4, cp3034_v6 [11:25:00] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3034_v4, cp3034_v6 [11:25:09] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3034_v4, cp3034_v6 [11:25:10] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3034_v4, cp3034_v6 [11:25:18] 10Operations, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-Urbanecm: Create fishbowl wiki for Maithili Wikimedians User Group - https://phabricator.wikimedia.org/T168782#3447511 (10biplabanand) @Urbanecm The logo of MWUG looks bit smaller for me. would you please check it and resize it to bit bigg... [11:25:39] PROBLEM - Host cp3034 is DOWN: PING CRITICAL - Packet loss = 100% [11:25:43] !log powercycle cp3034, not rebooting properly [11:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:05] (03PS1) 10Phuedx: pagePreviews: Re-enable Popups extension on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365939 (https://phabricator.wikimedia.org/T167365) [11:26:25] I have nothing, I 'll revert the DNS change [11:26:38] (03PS1) 10Alexandros Kosiaris: Revert "Switch oresrdb.svc.eqiad.wmnet to oresrdb1002" [dns] - 10https://gerrit.wikimedia.org/r/365940 [11:26:41] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "Switch oresrdb.svc.eqiad.wmnet to oresrdb1002" [dns] - 10https://gerrit.wikimedia.org/r/365940 (owner: 10Alexandros Kosiaris) [11:26:44] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Revert "Switch oresrdb.svc.eqiad.wmnet to oresrdb1002" [dns] - 10https://gerrit.wikimedia.org/r/365940 (owner: 10Alexandros Kosiaris) [11:27:45] (03PS1) 10Phuedx: Revert "Log all events for page previews in beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365941 [11:27:54] (03PS2) 10Phuedx: Revert "Log all events for page previews in beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365941 [11:27:59] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 58 ESP OK [11:27:59] RECOVERY - Host cp3034 is UP: PING OK - Packet loss = 16%, RTA = 83.88 ms [11:28:00] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 72 ESP OK [11:28:09] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 58 ESP OK [11:28:09] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 72 ESP OK [11:28:09] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 72 ESP OK [11:28:10] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 72 ESP OK [11:28:10] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 58 ESP OK [11:28:19] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 58 ESP OK [11:28:19] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 58 ESP OK [11:28:19] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 58 ESP OK [11:28:19] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 58 ESP OK [11:28:20] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 72 ESP OK [11:28:29] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 58 ESP OK [11:28:29] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 72 ESP OK [11:28:39] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 72 ESP OK [11:28:40] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 58 ESP OK [11:28:49] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 72 ESP OK [11:28:49] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 58 ESP OK [11:28:49] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 58 ESP OK [11:28:49] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 72 ESP OK [11:28:56] ok jdlrobson's change turned off the popups extension on the beta cluster -- so i'm going to revert it [11:28:59] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 72 ESP OK [11:29:00] (03PS1) 10Esanders: Add Welsh mobile logo (just changes 'k' to 'c'). [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365942 [11:29:33] (03Abandoned) 10Phuedx: Revert "Log all events for page previews in beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365941 (owner: 10Phuedx) [11:29:45] (03PS1) 10Phuedx: Revert "Enable page previews for everyone on labs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365943 [11:30:21] akosiaris: let me know if I can be of any assistance [11:31:14] (03CR) 10Jcrespo: [C: 032] mariadb: Add db2072 to the list of enwiki hosts [software] - 10https://gerrit.wikimedia.org/r/365283 (https://phabricator.wikimedia.org/T170662) (owner: 10Jcrespo) [11:31:25] (03PS2) 10Jcrespo: mariadb: Add db2072 to the list of enwiki hosts [software] - 10https://gerrit.wikimedia.org/r/365283 (https://phabricator.wikimedia.org/T170662) [11:31:28] (03CR) 10Phuedx: [C: 032] Revert "Enable page previews for everyone on labs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365943 (owner: 10Phuedx) [11:31:53] (03CR) 10Aude: [C: 031] "ca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362876 (https://phabricator.wikimedia.org/T168523) (owner: 10D3r1ck01) [11:32:29] (03CR) 10Aude: [C: 031] "good to be deployed in swat" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362876 (https://phabricator.wikimedia.org/T168523) (owner: 10D3r1ck01) [11:32:36] (03Merged) 10jenkins-bot: Revert "Enable page previews for everyone on labs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365943 (owner: 10Phuedx) [11:32:49] (03CR) 10jenkins-bot: Revert "Enable page previews for everyone on labs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365943 (owner: 10Phuedx) [11:33:11] (03PS1) 10Jcrespo: Revert "mariadb: Add db2072 to the list of enwiki hosts" [software] - 10https://gerrit.wikimedia.org/r/365944 [11:33:28] (03CR) 10Jcrespo: [V: 032 C: 032] "I had already been added" [software] - 10https://gerrit.wikimedia.org/r/365944 (owner: 10Jcrespo) [11:34:53] done [11:35:49] akosiaris: in particular scb1001 seems to be flapping according to pybal? [11:36:36] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365946 (https://phabricator.wikimedia.org/T142582) [11:36:37] * akosiaris looking, but there's nothing special about it [11:36:51] the 503 rate seems to be going down meanwhile [11:37:05] yeah I 've reverted the swap to oresrdb1002 [11:37:12] (and scb1001 stopped flapping after I said that heh) [11:37:14] damn if I know why that failed this time around [11:38:24] they overload is done [11:38:26] the* [11:38:37] why did this happen though [11:38:42] logs were not helpful at all [11:39:18] there's still a bunch of errors for: [11:39:19] GET http://ores.wikimedia.org/scores/wikidatawiki/?models=damaging%7Cgoodfaith&revids=523068397&precache=true&format=json [11:39:28] and similar [11:39:44] that's returning fine to me right now [11:39:49] although less frequent than before [11:41:50] now both scb1001 and scb1002 are flapping: Getting http://localhost/v2/scores/ took longer than 5 seconds. [11:42:37] (03PS1) 10Urbanecm: Make maiwikimedia's logo a little bit bigger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365948 (https://phabricator.wikimedia.org/T170922) [11:44:07] (03CR) 10Hashar: [C: 031] "sync wmf-config/mobile.php first to have wgMFAllowNonJavaScriptEditing defined. Then sync InitializeSettings.php." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349274 (https://phabricator.wikimedia.org/T125174) (owner: 10Jforrester) [11:44:10] ema: that's one URL endpoint that is not dependent on the backend store IIRC [11:44:22] it's the one thing that should be able to work almost always [11:44:33] (03CR) 10Hashar: [C: 031] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365946 (https://phabricator.wikimedia.org/T142582) (owner: 10Jdrewniak) [11:44:38] akosiaris: interesting [11:44:46] and it's working fine for me currently [11:45:00] (03CR) 10Hashar: [C: 031] "Dont forget to run the sync-portals script :]" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365946 (https://phabricator.wikimedia.org/T142582) (owner: 10Jdrewniak) [11:45:04] akosiaris@scb1001:/var/log$ curl -I http://localhost:8081/v2/scores/ [11:45:05] HTTP/1.1 200 OK [11:45:10] so what's going on... [11:45:12] akosiaris: right now it is fine according to pybal too [11:45:12] (03CR) 10Hashar: [C: 031] Provide HD logos for several Wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365618 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [11:45:16] (03CR) 10Hashar: [C: 031] Update enwikiquote's logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365409 (https://phabricator.wikimedia.org/T170722) (owner: 10Urbanecm) [11:45:28] (03CR) 10Hashar: [C: 031] New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365663 (https://phabricator.wikimedia.org/T170844) (owner: 10Urbanecm) [11:45:42] akosiaris: it was not ok at 11:42:18 on scb1001 [11:45:55] also, I do not see any 503 anymore [11:46:03] Hi hashar. Just a note because you're reviewing my changes, I'm adding one change (365948) to the SWAT calendar [11:46:43] akosiaris: I see that the OOM killer had some fun on 1001 yesterday (random observation) [11:46:59] yeah with pdfrender [11:47:15] when you run chrome that's to be expected [11:47:22] or any web browser [11:47:29] right :) [11:47:30] but that OOM is not the "system" OOM [11:47:35] it's the cgroup's [11:47:44] * akosiaris not even sure that terminology is right [11:47:45] ok [11:47:58] 10Operations, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-Urbanecm: Create fishbowl wiki for Maithili Wikimedians User Group - https://phabricator.wikimedia.org/T168782#3447604 (10Urbanecm) For record: This was discussed on IRC. I've created T170922 for that purpose. [11:48:44] logs literally have nothing about the event [11:48:48] whatever happened, the DNS change seems to have been the trigger, and reverting it did not immediately fix the issue (but it did apparently after a while) [11:48:52] I must be blind, it can't be that bad [11:51:34] akosiaris: so there's a few pretty slow responses to pybal that I can see on scb1001's logs: [11:51:40] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [11:51:41] journalctl -u pdfrender.service --since=today | awk '/Twisted/ && $15 > 2' [11:52:09] ema: yeah like 1+ sec right ? [11:52:17] or even more.. up to 2.5 secs ? [11:52:28] up to 5.089 (max) [11:52:41] why are you looking into pdfrender ? [11:52:46] it's not related to ORES [11:52:49] oh! [11:52:51] sorry [11:52:58] they are just colocated for now [11:52:58] (because I don't know what I'm doing) [11:53:11] it's uwsgi-ores and ores-celery-worker [11:53:30] generated 181 bytes in 1179 msecs (HTTP/1.1 200) [11:53:35] that's relatively slow [11:53:48] but ORES can be slow when the score hasn't been generated yet [11:54:10] generated 162 bytes in 15376 msecs [11:54:13] ouch.. that's a lot [11:54:42] I 'll hack /etc/hosts on scb1001 to have it use oresrdb1002 again [11:54:48] ok [11:54:55] so I 'll effectively reproduce the issue on just 25% of the boxes [11:55:23] I'll be looking for 503s on cache_misc boxes in eqiad meanwhile [11:56:01] !log add oresrdb.svc.eqiad.wmnet in scb1001's /etc/hosts, restart uwsgi-ores and ores-celery-worker [11:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:30] akosiaris: 503s started immediately [11:57:07] and yet logs in /srv/log/ores/main.log do not report that [11:57:29] they seem to have stopped [11:57:40] last one at 11:57:13 [11:58:03] ok then those maybe been related to the uwgi restart ? [11:58:13] I don't think it does graceful restarts [11:58:28] ah no, my mistake [11:58:35] I 've issued restart instead of reload [11:58:47] ok [11:58:52] in any case, they've stopped entirely now [11:59:05] surely ores should log those errors though [11:59:33] not necessarily the ones happening during restart, but those we've seen earlier [11:59:44] yeah, I agree [12:01:58] akosiaris: did you also restart ores after the dns change or only this time around when hacking /etc/hosts? [12:02:03] ah finally got it ... it's just saying 503 instead of 200 and they are all intermixed [12:02:05] only this time around [12:02:15] previously we wanted to wait it out [12:02:44] ok [12:02:50] akosiaris@scb1001:/srv/log/ores$ grep '(HTTP/1.1 503)' main.log | wc -l [12:02:50] 4859 [12:02:53] ok we have something [12:02:55] ah there you go [12:03:00] but nothing about why this happened [12:03:13] for example [12:03:20] [2017-07-18T11:56:56] [pid: 22909] 10.2.2.10 (-) {32 vars in 644 bytes} [Tue Jul 18 11:56:56 2017] GET /v2/scores/eswiki/?models=reverted&revids=100546982&precache=true&format=json => generated 144 bytes in 2 msecs (HTTP/1.1 503) 6 headers in 225 bytes (1 switches on core 0) user agent "ChangePropagation/WMF" [12:05:41] hmm uwsgi has clearly no wait to know what happened if the application is not sharing this info with uwsgi [12:05:49] https://uwsgi-docs.readthedocs.io/en/latest/LogFormat.html [12:06:10] so it's lacking severity as well as some informational message [12:10:26] !log remove oresrdb.svc.eqiad.wmnet in scb1001's /etc/hosts, but do not restart/reload uwsgi-ores and ores-celery-worker [12:10:31] let's see what happens now [12:10:36] yep [12:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:12] all quiet so far [12:15:01] it has been falling back though .. I can see many ESTABLISHED connections to oresrdb1001 instead of oresrdb1002 [12:15:41] ok [12:15:51] we should perhaps try again an actual DNS change [12:16:12] (although without logs it's not really clear what we can conclude) [12:16:56] true [12:20:51] ema: I 'll wait first for all connections to fallback to the correct entry and retry with an /etc/hosts entry [12:20:59] and if that works, retry the DNS change [12:21:09] maybe it was more than one thing together ... [12:22:04] akosiaris: sounds good, I'm logging 503s from ores on cache_eqiad hosts [12:22:32] (03CR) 10Alexandros Kosiaris: [C: 031] "Merging. I 'll do the various boxes 1 by 1 to make sure we don't end up killing the service" [puppet] - 10https://gerrit.wikimedia.org/r/365518 (owner: 10Lucas Werkmeister (WMDE)) [12:22:52] in the meantime.. let's kill another service [12:24:01] (03PS3) 10Alexandros Kosiaris: Add sandboxing directives to wdqs-blazegraph.service [puppet] - 10https://gerrit.wikimedia.org/r/365518 (owner: 10Lucas Werkmeister (WMDE)) [12:24:17] (03CR) 10Alexandros Kosiaris: [C: 032] Add sandboxing directives to wdqs-blazegraph.service [puppet] - 10https://gerrit.wikimedia.org/r/365518 (owner: 10Lucas Werkmeister (WMDE)) [12:24:20] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add sandboxing directives to wdqs-blazegraph.service [puppet] - 10https://gerrit.wikimedia.org/r/365518 (owner: 10Lucas Werkmeister (WMDE)) [12:32:12] !log Run maintain-views on labsdb1001,1003,1009,1010 and 1011 - T168788 [12:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:24] T168788: Prepare and check storage layer for maiwikimedia - https://phabricator.wikimedia.org/T168788 [12:35:24] 10Operations, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-Urbanecm: Create fishbowl wiki for Maithili Wikimedians User Group - https://phabricator.wikimedia.org/T168782#3447690 (10Marostegui) [12:38:36] 10Operations, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-Urbanecm: Create fishbowl wiki for Maithili Wikimedians User Group - https://phabricator.wikimedia.org/T168782#3447694 (10Urbanecm) 05Open>03Resolved Wiki was created. Only minor improvements are to happen. [12:39:11] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [12:41:19] ugh [12:42:01] router hickuped massively, only just got back online [12:42:12] good to see the beta cluster is up after the revert! [12:45:54] (03PS2) 10Giuseppe Lavagetto: systemd: add defines to manage systemd units [puppet] - 10https://gerrit.wikimedia.org/r/365900 [12:49:06] (03PS2) 10Phuedx: pagePreviews: Enable for anons/as pref on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365939 (https://phabricator.wikimedia.org/T167365) [12:51:06] jouncebot, next [12:51:10] In 0 hour(s) and 8 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170718T1300) [12:54:55] 10Operations, 10ops-eqiad: Unresponsive mgmt on oxygen - https://phabricator.wikimedia.org/T170924#3447713 (10MoritzMuehlenhoff) [12:56:13] (03CR) 10Ayounsi: [C: 032] Diffscan: Don't send emails if no new open/closed ports are found [puppet] - 10https://gerrit.wikimedia.org/r/365903 (https://phabricator.wikimedia.org/T169624) (owner: 10Ayounsi) [12:56:19] (03PS2) 10Ayounsi: Diffscan: Don't send emails if no new open/closed ports are found [puppet] - 10https://gerrit.wikimedia.org/r/365903 (https://phabricator.wikimedia.org/T169624) [12:59:56] (03CR) 10Faidon Liambotis: [C: 031] "\o/ Couple of notes:" [puppet] - 10https://gerrit.wikimedia.org/r/365911 (owner: 10Ema) [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170718T1300). [13:00:04] James_F, Urbanecm, and jan_drewniak: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:08] Present [13:00:11] o/ [13:00:18] (03CR) 10Faidon Liambotis: [C: 032] [2/2] use kmod puppet module instead of File resources [puppet] - 10https://gerrit.wikimedia.org/r/365912 (owner: 10Ema) [13:00:41] Heya. [13:00:44] o/ [13:00:58] I can SWAT today! [13:01:08] James_F: want to deploy your change, or should I? [13:01:17] You please. [13:01:55] (03PS4) 10Zfilipin: Enable mobile non-JavaScript editing on all MobileFrontend wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349274 (https://phabricator.wikimedia.org/T125174) (owner: 10Jforrester) [13:02:23] James_F: sure, will start in a minute; can you test it at mwdebug? [13:03:29] Yeah. [13:04:16] 10Operations, 10ops-eqiad, 10OCG-General: ocg1001 is broken - https://phabricator.wikimedia.org/T170886#3447753 (10Peachey88) [13:04:17] James_F: reviewing... will ping you in a few minutes; any particular order files should be synced in (during deployment)? [13:04:38] marostegui: heya, there are a few new tables with slightly different schemas than the usual eventlogging ones [13:04:53] as such, they did not get indexes added to their uuid and timestamp fields (which have different names) [13:05:05] the biggest of these tables has 900K records [13:05:05] zeljkof, see hashar's comment in the patch (sync wmf-config/mobile.php first to have wgMFAllowNonJavaScriptEditing defined. Then sync InitializeSettings.php.) [13:05:12] i'm planning on stopping the process that is inserting into these tables [13:05:34] and adding the unique index on the uuid (meta_id) field, and an index on timestamp (meta_dt) [13:05:38] objection? [13:05:49] ottomata: Not much, you know more than I do about those schemas! :) [13:05:50] zeljkof: mobile -> Initialise -> Initialise-labs [13:06:00] Urbanecm, James_F: thanks! [13:06:04] zeljkof: Argh, I forgot one of my changes. [13:06:08] (03CR) 10Ema: "> - As confirmed by Moritz, blacklist-linux44 can be merged with" [puppet] - 10https://gerrit.wikimedia.org/r/365911 (owner: 10Ema) [13:06:15] James_F: still time to add it [13:06:28] Kk, one second. [13:06:54] marostegui: ok cool, just a heads up then, tahnks :) [13:07:07] ottomata: thank you! [13:07:35] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349274 (https://phabricator.wikimedia.org/T125174) (owner: 10Jforrester) [13:09:30] (03Merged) 10jenkins-bot: Enable mobile non-JavaScript editing on all MobileFrontend wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349274 (https://phabricator.wikimedia.org/T125174) (owner: 10Jforrester) [13:09:39] (03CR) 10jenkins-bot: Enable mobile non-JavaScript editing on all MobileFrontend wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349274 (https://phabricator.wikimedia.org/T125174) (owner: 10Jforrester) [13:10:18] zeljkof: Added to the list (finally!). [13:10:19] (03PS2) 10Urbanecm: Provide HD logos for several Wikiquotes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365618 (https://phabricator.wikimedia.org/T150618) [13:10:29] (03PS3) 10Urbanecm: Provide HD logos for several Wikiquotes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365618 (https://phabricator.wikimedia.org/T150618) [13:13:22] James_F: 349274 is at mwdebug1002, please test and let me know if I can proceed [13:14:06] zeljkof: Yeah, LGTM. [13:14:17] James_F: ok, deploying... [13:16:45] !log zfilipin@tin Synchronized wmf-config/mobile.php: SWAT: [[gerrit:349274|Enable mobile non-JavaScript editing on all MobileFrontend wikis (T125174)]] (duration: 00m 44s) [13:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:57] T125174: [EPIC] Enable editing for mobile users without JavaScript and kill Special:MobileEditor code in MobileFrontend - https://phabricator.wikimedia.org/T125174 [13:18:18] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:349274|Enable mobile non-JavaScript editing on all MobileFrontend wikis (T125174)]] (duration: 00m 43s) [13:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:40] (03PS4) 10Alexandros Kosiaris: lvs: Remove all bgp keywords from configuration [puppet] - 10https://gerrit.wikimedia.org/r/356790 [13:19:12] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] lvs: Remove all bgp keywords from configuration [puppet] - 10https://gerrit.wikimedia.org/r/356790 (owner: 10Alexandros Kosiaris) [13:19:28] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: [[gerrit:349274|Enable mobile non-JavaScript editing on all MobileFrontend wikis (T125174)]] (duration: 00m 43s) [13:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:56] James_F: deployed, please check [13:20:27] James_F: reviewing 360371 [13:20:44] zeljkof: LGTM, thanks. [13:21:40] James_F: there is a merge conflict for 360371 [13:22:08] Hmm. [13:22:58] zeljkof: Can you rebase? I'm not on my dev machine. [13:23:05] Sorry. [13:23:11] James_F: I can try! :) [13:25:02] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10User-Urbanecm: Add Dinka Wikipedia to Wikidata - https://phabricator.wikimedia.org/T170930#3447871 (10Amire80) [13:25:06] Oh, yeah, I see, I added metawiki to the list later in an earlier patch and didn't rebase this one, sorry. Should be default false, mw/meta/wikipedia true. [13:26:48] (Sorry everyone else.) [13:26:54] 10Operations, 10Pybal, 10Traffic, 10netops: Deploy pybal with BGP MED support (for primary/backup) in production - https://phabricator.wikimedia.org/T165584#3447907 (10ema) p:05Triage>03Normal [13:30:25] PROBLEM - eventlogging_sync processes on dbstore1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /bin/bash /usr/local/bin/eventlogging_sync.sh [13:30:54] James_F: some trouble rebasing... [13:31:02] have to reconnect, back in a minute [13:31:17] zeljkof: Let's skip it so others' patches can go, I'll reschedule for this evening. [13:31:19] (03CR) 10Alexandros Kosiaris: "this has worked fine on lvs1006. Doubled checked from both cr2's PoV and lvs1006's PoV. Proceeding with the rest of the standby LVSes" [puppet] - 10https://gerrit.wikimedia.org/r/356790 (owner: 10Alexandros Kosiaris) [13:32:07] James_F: ok [13:34:06] (03CR) 10Pmiazga: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365939 (https://phabricator.wikimedia.org/T167365) (owner: 10Phuedx) [13:34:52] _joe_: could you have a look at https://gerrit.wikimedia.org/r/#/c/365053/ ? It fixes my issue but I want to make sure I'm not misunderstanding. [13:34:56] (03PS3) 10Ayounsi: Diffscan: Don't send emails if no new open/closed ports are found [puppet] - 10https://gerrit.wikimedia.org/r/365903 (https://phabricator.wikimedia.org/T169624) [13:35:13] <_joe_> andrewbogott: which issue? [13:35:42] sorry, internet problems, I'm back [13:35:49] (03PS1) 10Urbanecm: Update wikiversity's logos to 2017 form [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365966 (https://phabricator.wikimedia.org/T160491) [13:35:52] <_joe_> I'm not sure that is correct, I'll have to look into it [13:36:00] Urbanecm: 365618 is next [13:36:10] (03CR) 10Rush: [C: 031] "I looked at modules/labstore/manifests/traffic_shaping.pp and modules/labstore/manifests/traffic_shaping.pp seems more readable to me. tha" [puppet] - 10https://gerrit.wikimedia.org/r/365911 (owner: 10Ema) [13:36:14] _joe_: let me see revert my local patch so you can see... [13:36:39] <_joe_> andrewbogott: no need, I just need to finish something and I can look better into it [13:36:41] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Add Dinka Wikipedia to Wikidata - https://phabricator.wikimedia.org/T170930#3447930 (10Urbanecm) a:05Dereckson>03None [13:36:47] ok, thanks [13:37:17] <_joe_> andrewbogott: so, just to make sure: your problem was that your puppetmaster was using the ca cert it got as a client of another puppetmaster, and not its own ca? [13:37:31] <_joe_> because that's the kind of issue this could fix [13:37:33] _joe_: yep, correct [13:37:45] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365618 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [13:38:00] <_joe_> ok let me check for side effects though [13:38:34] RECOVERY - eventlogging_sync processes on dbstore1002 is OK: PROCS OK: 1 process with UID = 0 (root), args /bin/bash /usr/local/bin/eventlogging_sync.sh [13:38:45] <_joe_> it looks like it would solve your problem alright, just wanna check it [13:39:10] (03Merged) 10jenkins-bot: Provide HD logos for several Wikiquotes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365618 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [13:39:19] (03CR) 10jenkins-bot: Provide HD logos for several Wikiquotes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365618 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [13:39:25] elukey ^could the purge process be affecting the replication process? [13:39:51] or is it otto's planned maintenance? [13:40:00] 10Operations, 10User-fgiunchedi: prometheus-puppet-agent-stats cronspam on missing puppet stats - https://phabricator.wikimedia.org/T170932#3447944 (10fgiunchedi) [13:40:02] 10Operations, 10ops-eqiad: Unresponsive mgmt on oxygen - https://phabricator.wikimedia.org/T170924#3447958 (10MoritzMuehlenhoff) 05Open>03Resolved Faidon pointed me to "racadm racseset" and that fixed it, sorry for the noise. [13:40:33] jynus: hmm, oh its possible my maintenance affected it [13:40:52] no problem in that case [13:41:03] just mentioned in case we had to debug further [13:41:17] ah yeah [13:41:21] i see an errror in it ERROR 1146 (42S02) at line 1: Table 'log.mediawiki_page_create_1' doesn't exist [13:41:35] which def would have been caused by me, i did a quick rename table as some point to put a new table in place with same name [13:41:40] looks like it started back up though [13:41:42] thanks [13:41:44] that is ok [13:41:58] Urbanecm: 365618 is at mwdebug, please test and let me know if I can proceed [13:42:05] thanks for checking jynus ! [13:42:09] the master should be checked [13:42:09] zeljkof, testing [13:42:21] so we do no reinsert data intended to be deleted [13:42:32] or things like that [13:42:49] it is very easy to break things- believe me- I used to do it all the time [13:43:35] I wonder if by deleting some tables, the replication process hasn't recreated them and reinserted [13:43:41] which led to the space issues [13:46:28] zeljkof, working [13:47:59] Urbanecm: deploying [13:48:06] (03PS3) 10Giuseppe Lavagetto: systemd: add defines to manage systemd units [puppet] - 10https://gerrit.wikimedia.org/r/365900 [13:48:56] !log zfilipin@tin Synchronized static/images/project-logos: SWAT: [[gerrit:365618|Provide HD logos for several Wikiquotes (T150618)]] (duration: 00m 44s) [13:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:08] T150618: Provide HD logos for all projects - https://phabricator.wikimedia.org/T150618 [13:50:49] !log codfw cache_text/upload: upgrade to varnish 4.1.7-1wm1 and reboot for kernel updates [13:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:31] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:365618|Provide HD logos for several Wikiquotes (T150618)]] (duration: 00m 43s) [13:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:47] Urbanecm: deployed, please check [13:52:04] reviewing 365409 [13:52:15] zeljkof, working [13:52:17] ack [13:52:50] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365409 (https://phabricator.wikimedia.org/T170722) (owner: 10Urbanecm) [13:54:00] (03PS1) 10Lucas Werkmeister (WMDE): Make wbqc_constraints table available on Labs [puppet] - 10https://gerrit.wikimedia.org/r/365969 (https://phabricator.wikimedia.org/T170927) [13:54:15] (03Merged) 10jenkins-bot: Update enwikiquote's logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365409 (https://phabricator.wikimedia.org/T170722) (owner: 10Urbanecm) [13:54:56] (03CR) 10jerkins-bot: [V: 04-1] Make wbqc_constraints table available on Labs [puppet] - 10https://gerrit.wikimedia.org/r/365969 (https://phabricator.wikimedia.org/T170927) (owner: 10Lucas Werkmeister (WMDE)) [13:56:09] (03CR) 10jenkins-bot: Update enwikiquote's logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365409 (https://phabricator.wikimedia.org/T170722) (owner: 10Urbanecm) [13:56:30] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [13:56:31] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [13:57:28] (03CR) 10Anomie: [C: 031] Deploy TemplateStyles to some non-content productions wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365879 (https://phabricator.wikimedia.org/T170863) (owner: 10Gergő Tisza) [13:59:24] Urbanecm: 365409 is at mwdebug1002 [14:00:00] Urbanecm, jan_drewniak: we are at the end of swat window, can you stay longer? (I can) [14:00:04] Tgr: Dear anthropoid, the time has come. Please deploy TemplateStyles (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170718T1400). [14:00:41] zeljkof: yes please [14:00:45] tgr: sorry, swat is still in progress, I can finish in a minute [14:00:58] jan_drewniak: sorry, just noticed that tgr has this window [14:01:33] tgr: can you start 10-20 minutes later? or should we hurry with finishing up swat? [14:01:36] zeljkof: no worries, I can wait [14:01:40] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 37.78 ms [14:01:40] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 33.91 ms [14:01:47] tgr: great, in that case... [14:01:59] !log continuing with EU SWAT [14:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:21] Urbanecm: 365409 is at mwdebug1002, did you check? [14:02:57] (03PS2) 10Lucas Werkmeister (WMDE): Make wbqc_constraints table available on Labs [puppet] - 10https://gerrit.wikimedia.org/r/365969 (https://phabricator.wikimedia.org/T170927) [14:03:20] (03CR) 10Lucas Werkmeister (WMDE): "Just in case it’s not clear from the linked task: I’m not very familiar with Labs, so if the commit message “make X available on Labs” doe" [puppet] - 10https://gerrit.wikimedia.org/r/365969 (https://phabricator.wikimedia.org/T170927) (owner: 10Lucas Werkmeister (WMDE)) [14:03:34] I was disconnected: Anything that I should do now? [14:03:41] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2086987 [14:03:45] Urbanecm: 365409 is at mwdebug1002 [14:04:02] we can continue with swat, but tgr has this window, so we should hurry up :) [14:04:29] ack [14:05:08] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365663 (https://phabricator.wikimedia.org/T170844) (owner: 10Urbanecm) [14:05:29] zeljkof, please deploy [14:05:34] Urbanecm: ok [14:06:27] (03Merged) 10jenkins-bot: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365663 (https://phabricator.wikimedia.org/T170844) (owner: 10Urbanecm) [14:06:37] (03CR) 10jenkins-bot: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365663 (https://phabricator.wikimedia.org/T170844) (owner: 10Urbanecm) [14:06:52] zeljkof, please deploy the throttle rule directly – nothing to test here [14:07:00] Urbanecm: sure [14:07:29] !log zfilipin@tin Synchronized static/images/project-logos/enwikiquote.png: SWAT: [[gerrit:365409|Update enwikiquotes logo (T170722)]] (duration: 00m 43s) [14:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:39] T170722: Refresh logo static image for en.wikiquote.org - https://phabricator.wikimedia.org/T170722 [14:08:28] Urbanecm: 365409 deployed [14:08:31] ack [14:09:35] (03PS1) 10Ottomata: Add expiry shell dates for 4 stat box users [puppet] - 10https://gerrit.wikimedia.org/r/365971 (https://phabricator.wikimedia.org/T170878) [14:10:17] !log zfilipin@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:365663|New throttle rule (T170844)]] (duration: 00m 43s) [14:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:30] T170844: Request a temporary lift of the account creation cap on a specific IP for an outreach event on 2017-07-20 - https://phabricator.wikimedia.org/T170844 [14:10:34] Urbanecm: 365663 deployed [14:10:42] ack [14:11:38] reviewing 365948 [14:11:53] RECOVERY - Check systemd state on notebook1002 is OK: OK - running: The system is fully operational [14:12:05] ack [14:12:28] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365948 (https://phabricator.wikimedia.org/T170922) (owner: 10Urbanecm) [14:13:35] (03PS1) 10Gilles: Upgrade to 1.1 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/365973 (https://phabricator.wikimedia.org/T170677) [14:13:46] (03CR) 10Ottomata: [C: 032] Add expiry shell dates for 4 stat box users [puppet] - 10https://gerrit.wikimedia.org/r/365971 (https://phabricator.wikimedia.org/T170878) (owner: 10Ottomata) [14:14:01] (03Merged) 10jenkins-bot: Make maiwikimedia's logo a little bit bigger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365948 (https://phabricator.wikimedia.org/T170922) (owner: 10Urbanecm) [14:15:25] Urbanecm: 365948 is at mwdebug [14:15:40] zeljkof, working, please deploy [14:16:33] (03CR) 10jenkins-bot: Make maiwikimedia's logo a little bit bigger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365948 (https://phabricator.wikimedia.org/T170922) (owner: 10Urbanecm) [14:16:56] !log zfilipin@tin Synchronized static/images/project-logos/: SWAT: [[gerrit:365948|Make maiwikimedias logo a little bit bigger (T170922)]] (duration: 00m 43s) [14:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:08] T170922: Make logo for maiwikimedia a little bit bigger - https://phabricator.wikimedia.org/T170922 [14:17:09] Urbanecm: 365948 is deployed [14:17:32] jan_drewniak: do you want to deploy your change, or should I? [14:17:38] zeljkof, can you purge the URLs? [14:18:17] Urbanecm: sure, will do [14:18:28] zeljkof, thx [14:18:53] zeljkof: could you please? [14:19:05] jan_drewniak: sure, will do [14:19:11] 10Operations, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-Urbanecm: Create fishbowl wiki for Maithili Wikimedians User Group - https://phabricator.wikimedia.org/T168782#3448126 (10Urbanecm) [14:20:11] (03PS2) 10Urbanecm: Update wikiversity's logos to 2017 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365966 (https://phabricator.wikimedia.org/T160491) [14:21:04] (03CR) 10Filippo Giunchedi: [C: 04-1] "Conceptually LGTM, though I think the change should be split between adding a new port/endpoint to logstash and switch thumbor to logstash" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/365619 (https://phabricator.wikimedia.org/T150734) (owner: 10Gilles) [14:21:14] Urbanecm: purged, please check [14:21:27] zeljkof, seems ok, thank you! [14:21:41] (03CR) 10Alexandros Kosiaris: "lvs2004, lvs2005, lvs2006 are fine. proceeding" [puppet] - 10https://gerrit.wikimedia.org/r/356790 (owner: 10Alexandros Kosiaris) [14:21:42] jan_drewniak: reviewing 365946 [14:22:10] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 1.1 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/365973 (https://phabricator.wikimedia.org/T170677) (owner: 10Gilles) [14:22:37] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365946 (https://phabricator.wikimedia.org/T142582) (owner: 10Jdrewniak) [14:23:00] jan_drewniak: can you test 365946 at mwdebug1002? [14:23:16] or should I do a full deploy? [14:23:43] zeljkof, just a friendly reminder: You'll have to run a script. [14:23:52] Urbanecm: thanks [14:24:23] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [14:24:39] zeljkof: mwdebug1002 looks fine [14:24:57] jan_drewniak: wait, it's not there yet, waiting for merge :) [14:25:06] ci seems to be busy [14:25:21] I was just asking if that can be tested there, sorry, forgot be be explicit [14:25:39] zeljkof: ah, well differences are so minor I didn't even notice [14:25:48] tgr: almost there, on the last commit, waiting for ci... [14:26:00] jan_drewniak: will ping you when it's at mwdebug [14:26:14] thanks [14:27:26] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365946 (https://phabricator.wikimedia.org/T142582) (owner: 10Jdrewniak) [14:27:36] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365946 (https://phabricator.wikimedia.org/T142582) (owner: 10Jdrewniak) [14:30:15] jan_drewniak: 365946 is at mwdebug1002, please test [14:31:27] jan_drewniak: can not find anything about sync-portals at wikitech :/ [14:31:34] how do I run the script? [14:32:05] https://www.mediawiki.org/wiki/Wikipedia.org_Portal [14:32:09] from the root of the repo, at the mediawiki-config/portals [14:32:09] zeljkof ^^ [14:34:17] jan_drewniak: I'm confused, I should not follow the standard deployment procedure? https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Full_deployment [14:34:34] but this? https://www.mediawiki.org/wiki/Wikipedia.org_Portal#Portal_Deployment [14:34:57] zeljkof, yes. There are multiple syncs and the bash script automatize them ;) [14:35:14] (I have never deployed portals, as far as I can remember) [14:35:24] Urbanecm: yes, on what? :) [14:35:36] zeljkof: yeah, the portals are a little different [14:35:41] zeljkof, yes, you should follow https://www.mediawiki.org/wiki/Wikipedia.org_Portal#Portal_Deployment :) [14:35:42] I mean, I should do this? https://www.mediawiki.org/wiki/Wikipedia.org_Portal#Portal_Deployment [14:35:50] (I personally haven't deployed them either, (though I should some time) [14:35:56] Urbanecm, jan_drewniak: ok, thanks, will do [14:36:25] I'll make better note of that in future [14:37:16] ok, so running this script is this? [14:37:18] zfilipin@tin:/srv/mediawiki-staging$ cd portals/ [14:37:24] zfilipin@tin:/srv/mediawiki-staging/portals$ sync-portals [14:37:30] zeljkof, ./sync-portals [14:37:33] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [14:37:46] ok, so this? zfilipin@tin:/srv/mediawiki-staging/portals$ ./sync-portals [14:37:48] Yes [14:37:58] !log installing apache updates on silver [14:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:46] !log zfilipin@tin Synchronized portals/prod/wikipedia.org/assets: (no justification provided) (duration: 00m 44s) [14:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:31] !log zfilipin@tin Synchronized portals: (no justification provided) (duration: 00m 45s) [14:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:16] running the script, but sync-masters is taking a long time... [14:40:50] Dereckson: you are a sysadmin? can you assist with a global rename with +100k edits? [14:41:38] !log awight@tin Started deploy [ores/deploy@1d35aa5]: T170485 [14:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:52] T170485: ORES deployment - Mid July, 2017 - https://phabricator.wikimedia.org/T170485 [14:42:04] !log awight@tin Finished deploy [ores/deploy@1d35aa5]: T170485 (duration: 00m 26s) [14:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:20] Steinsplitter: Better to create a task and tag: Operations [14:43:46] marostegui: okay, although i did a number of such renames with assistence here (hoo/legoktm [14:43:46] /etc.) [14:44:22] zeljkof: page looks good to me [14:44:23] Steinsplitter: Sure, but if someone cannot help you on the fly, with a task it might be easier to coordinate :-) [14:44:33] jan_drewniak: great! [14:44:42] !log EU SWAT finished! [14:44:51] marostegui, BTW didn't you supervised some renames like this? ;) [14:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:54] tgr: sorry, we took almost all of your window :( [14:45:00] we are done, you can take over [14:45:06] Urbanecm: I did, but I cannot do it now, that is why I am asking for a task :) [14:45:08] thx zeljkof [14:45:17] next window is empty so I'll be ok [14:45:47] marostegui, ah, ok :) [14:46:00] (03PS2) 10Gergő Tisza: Deploy TemplateStyles to some non-content productions wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365879 (https://phabricator.wikimedia.org/T170863) [14:48:51] (03CR) 10Gergő Tisza: [C: 032] Deploy TemplateStyles to some non-content productions wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365879 (https://phabricator.wikimedia.org/T170863) (owner: 10Gergő Tisza) [14:48:53] (03CR) 10Alexandros Kosiaris: "All standbys done." [puppet] - 10https://gerrit.wikimedia.org/r/356790 (owner: 10Alexandros Kosiaris) [14:50:27] 10Operations: Global rename user - https://phabricator.wikimedia.org/T170941#3448274 (10Steinsplitter) [14:51:18] (03Merged) 10jenkins-bot: Deploy TemplateStyles to some non-content productions wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365879 (https://phabricator.wikimedia.org/T170863) (owner: 10Gergő Tisza) [14:51:29] (03CR) 10jenkins-bot: Deploy TemplateStyles to some non-content productions wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365879 (https://phabricator.wikimedia.org/T170863) (owner: 10Gergő Tisza) [14:51:40] 10Operations, 10ops-codfw, 10Parsoid, 10Patch-For-Review, 10Services (watching): wtp2019 - hardware (RAM) check - https://phabricator.wikimedia.org/T146113#3448294 (10Papaul) a:05Papaul>03RobH Test complete with no errors . {F8793669} [14:52:08] 10Operations, 10DBA: Global rename user - https://phabricator.wikimedia.org/T170941#3448297 (10Marostegui) [14:52:37] (03PS4) 10GWicke: PDF Render: Check hourly if the service is running via cron [puppet] - 10https://gerrit.wikimedia.org/r/359967 (https://phabricator.wikimedia.org/T159922) [14:53:30] 10Operations, 10DBA: Global rename user - https://phabricator.wikimedia.org/T170941#3448274 (10Marostegui) I am happy to monitor the involved DBs here. In which TZ are you? I would prefer to do it tomorrow in the morning, what about 9UTC? [14:53:40] marostegui: ok :-) [14:53:46] \o/ [14:54:09] Urbanecm, jan_drewniak updated docs https://www.mediawiki.org/w/index.php?diff=2516151&oldid=2515791&title=Wikipedia.org_Portal&type=revision [14:54:52] zeljkof: thank you! [14:55:58] !log upload and roll-upgrade thumbor to 1.1 - T170677 [14:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:11] T170677: Thumbor replies 429 on deleted files - https://phabricator.wikimedia.org/T170677 [14:56:13] marostegui: ok = i created the bug. i don't have time before 11:00 utc :( [14:56:28] 11 UTC works :) [14:56:37] okay :) thanks :-D [14:57:35] (03CR) 10Muehlenhoff: "What's the basis for these dates and contacts? They need to be synched with what's tracked by Legal." [puppet] - 10https://gerrit.wikimedia.org/r/365971 (https://phabricator.wikimedia.org/T170878) (owner: 10Ottomata) [14:58:04] 10Operations, 10DBA: Global rename user - https://phabricator.wikimedia.org/T170941#3448322 (10Marostegui) This will be done 19th at 11UTC. [14:58:10] !log tgr@tin Started scap: T170863 deploy TemplateStyles to some non-content wikis (first step: testwiki/labstestwiki only) [14:58:18] 10Operations, 10DBA: Global rename user - https://phabricator.wikimedia.org/T170941#3448325 (10Steinsplitter) >>! In T170941#3448297, @Marostegui wrote: > I am happy to monitor the involved DBs here. > In which TZ are you? I would prefer to do it tomorrow in the morning, what about 9UTC? I am around after 11... [14:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:23] T170863: Identify and coordinate deployment of initial deployment of Template Styles with select non-content Wikis - https://phabricator.wikimedia.org/T170863 [14:58:40] edit conflict. [14:58:43] haha I was faster! :) [14:59:10] (03CR) 10Ottomata: "OO, really? didn't know that. Some of them came from responses in T170878, others were from those that said "yeah, you can expire me", b" [puppet] - 10https://gerrit.wikimedia.org/r/365971 (https://phabricator.wikimedia.org/T170878) (owner: 10Ottomata) [14:59:17] (03CR) 10Ottomata: "I can revert if that is better" [puppet] - 10https://gerrit.wikimedia.org/r/365971 (https://phabricator.wikimedia.org/T170878) (owner: 10Ottomata) [15:01:33] (03CR) 10Muehlenhoff: "You can keep the commit as-is, I'll doublecheck the new entries with the tracking spreadsheet." [puppet] - 10https://gerrit.wikimedia.org/r/365971 (https://phabricator.wikimedia.org/T170878) (owner: 10Ottomata) [15:05:57] (03PS1) 10Awight: Add awight to deploy-service group [puppet] - 10https://gerrit.wikimedia.org/r/365985 [15:07:53] !log tgr@tin scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) [15:07:53] !log tgr@tin scap failed: RuntimeError scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) (duration: 09m 42s) [15:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:11] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Add Dinka Wikipedia to Wikidata - https://phabricator.wikimedia.org/T170930#3448336 (10Dzahn) https://www.wikidata.org/wiki/Q32012187 ? [15:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:45] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Add Dinka Wikipedia to Wikidata - https://phabricator.wikimedia.org/T170930#3448338 (10Amire80) I'm pretty sure that the item page is not enough by itself. It must be possible to add sitelinks to `din`, and currently it's impo... [15:12:19] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Add Dinka Wikipedia to Wikidata - https://phabricator.wikimedia.org/T170930#3447871 (10Reedy) >>! In T170930#3448338, @Amire80 wrote: > I'm pretty sure that the item page is not enough by itself. It must be possible to add sit... [15:12:33] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2099745 [15:14:16] !log Stop MySQL and shutdown pc2006 for mainboard replacement - T170520 [15:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:29] T170520: pc2006 crashed - https://phabricator.wikimedia.org/T170520 [15:14:52] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Add Dinka Wikipedia to Wikidata - https://phabricator.wikimedia.org/T170930#3448351 (10Amire80) >>! In T170930#3448341, @Reedy wrote: >>>! In T170930#3448338, @Amire80 wrote: >> I'm pretty sure that the item page is not enough... [15:17:26] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Add Dinka Wikipedia to Wikidata - https://phabricator.wikimedia.org/T170930#3448360 (10Reedy) Yeah, that can go out in SWAT. No problem there [15:25:02] thcipriani: I checked the logstash errors and none seemed related [15:25:09] should I just force the scap? [15:25:26] (see errors above at :07) [15:25:51] tgr: if none of the errors seem related to the current deployment then yes. [15:26:46] !log tgr@tin Started scap: T170863 deploy TemplateStyles to some non-content wikis (first step: testwiki/labstestwiki only) (forcing; canary errors are unrelated) [15:26:47] I want to change some of that logic. I'd like to have more of a deployment bias than we currently have. I.e. push forward unless catastrophic blow up eminent. [15:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:00] T170863: Identify and coordinate deployment of initial deployment of Template Styles with select non-content Wikis - https://phabricator.wikimedia.org/T170863 [15:30:42] (03PS3) 10BryanDavis: Expose wbqc_constraints view on Wiki Replicas [puppet] - 10https://gerrit.wikimedia.org/r/365969 (https://phabricator.wikimedia.org/T170927) (owner: 10Lucas Werkmeister (WMDE)) [15:34:42] thcipriani: wasn't fully unrelated after all, it seems like one of the canaries failed to sync all files? [15:34:45] https://logstash.wikimedia.org/goto/f4f548e93f0946776aae5dcae3abf54f [15:35:04] unhelpfully, the logstash link in the scap error message does not contain all canaries [15:36:53] blerg [15:36:54] * thcipriani fixes [15:37:06] !log tgr@tin Finished scap: T170863 deploy TemplateStyles to some non-content wikis (first step: testwiki/labstestwiki only) (forcing; canary errors are unrelated) (duration: 10m 19s) [15:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:18] T170863: Identify and coordinate deployment of initial deployment of Template Styles with select non-content Wikis - https://phabricator.wikimedia.org/T170863 [15:37:28] Is this a known error? > Failed to apply catalog: Found 1 dependency cycle: [15:37:31] (Exec[recommendation_api config deploy] => Service::Node::Config::Scap3[recommendation_api] => Scap::Target[recommendation-api/deploy] => User[deploy-service] => Exec[recommendation_api config deploy]) [15:40:37] (03PS1) 10Elukey: eventlogging_cleaner: force a cast to char for the uuid field [puppet] - 10https://gerrit.wikimedia.org/r/365992 [15:42:03] (03PS2) 10Filippo Giunchedi: prometheus: move external_url to class parameter [puppet] - 10https://gerrit.wikimedia.org/r/365266 [15:42:36] awight: I saw that problem on beta recently. I poked mobrovac about it but I didn't have time to look into it at that moment (same deal different service): http://tyler.zone/changeprop-cycle.png [15:45:04] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: move external_url to class parameter [puppet] - 10https://gerrit.wikimedia.org/r/365266 (owner: 10Filippo Giunchedi) [15:45:33] PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp2023_v4, cp2023_v6 [15:45:33] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp2023_v4, cp2023_v6 [15:45:33] !log tgr@tin Synchronized wmf-config/InitialiseSettings.php: T170863 deploy TemplateStyles to some non-content wikis (all target wikis) (duration: 00m 45s) [15:45:34] PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp2023_v4, cp2023_v6 [15:45:34] PROBLEM - IPsec on cp3041 is CRITICAL: Strongswan CRITICAL - ok: 43 not-conn: cp2023_v6 [15:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:46] T170863: Identify and coordinate deployment of initial deployment of Template Styles with select non-content Wikis - https://phabricator.wikimedia.org/T170863 [15:46:03] PROBLEM - puppet last run on mw1285 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:46:33] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 44 ESP OK [15:46:33] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 44 ESP OK [15:46:33] RECOVERY - IPsec on cp1068 is OK: Strongswan OK - 44 ESP OK [15:46:43] RECOVERY - IPsec on cp3041 is OK: Strongswan OK - 44 ESP OK [15:46:43] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/update-motd.d/97-last-puppet-run] [15:46:54] PROBLEM - puppet last run on labvirt1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/dhparam.pem] [15:48:25] thcipriani: k thanks for the note [15:48:44] & amusing graph [15:49:33] (03PS1) 10Thcipriani: Scap: new canary dashboard [puppet] - 10https://gerrit.wikimedia.org/r/365995 [15:49:53] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [15:50:13] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [15:50:23] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [15:50:33] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10netops: codfw: rack frack refresh equipment - https://phabricator.wikimedia.org/T169643#3448498 (10Papaul) switches stacking cabling complete. [15:51:34] (03CR) 10Alexandros Kosiaris: "All LVS servers done" [puppet] - 10https://gerrit.wikimedia.org/r/356790 (owner: 10Alexandros Kosiaris) [15:54:46] 10Operations, 10ops-ulsfo, 10hardware-requests: Decommission cp400[1-4] - https://phabricator.wikimedia.org/T169020#3448501 (10RobH) a:03RobH [15:55:41] (03PS2) 10Alexandros Kosiaris: Add awight to deploy-service group [puppet] - 10https://gerrit.wikimedia.org/r/365985 (owner: 10Awight) [15:57:07] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] "Not a sudo request and already discussed in the ops meeting yesterday and approved, I am merging." [puppet] - 10https://gerrit.wikimedia.org/r/365985 (owner: 10Awight) [15:57:13] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:57:53] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:58:13] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp2017_v4, cp2017_v6 [15:58:13] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp2017_v4, cp2017_v6 [15:58:14] PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4 not-conn: cp2017_v6 [15:58:23] PROBLEM - IPsec on cp4015 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v6 not-conn: cp2017_v4 [15:58:23] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:58:33] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp2017_v4, cp2017_v6 [15:58:33] PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4 not-conn: cp2017_v6 [15:58:43] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp2017_v4, cp2017_v6 [15:58:43] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v6 not-conn: cp2017_v4 [15:58:43] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp2017_v4, cp2017_v6 [15:58:51] looking ^ [15:58:53] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp2017_v4, cp2017_v6 [15:58:53] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2017_v4, cp2017_v6 [15:58:53] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v6 not-conn: cp2017_v4 [15:58:54] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2017_v4, cp2017_v6 [15:58:54] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2017_v4, cp2017_v6 [15:59:03] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp2017_v4, cp2017_v6 [15:59:03] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp2017_v4, cp2017_v6 [15:59:03] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2017_v4, cp2017_v6 [15:59:03] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp2017_v4, cp2017_v6 [15:59:03] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6 [15:59:13] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2017_v4, cp2017_v6 [15:59:13] PROBLEM - IPsec on cp4006 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v6 not-conn: cp2017_v4 [15:59:14] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6 [15:59:14] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4 not-conn: cp2017_v6 [15:59:33] PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6 [15:59:43] PROBLEM - Host cp2017 is DOWN: PING CRITICAL - Packet loss = 100% [15:59:46] !log power-cycle cp2017, stuck rebooting [15:59:53] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6 [15:59:54] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6 [15:59:54] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6 [15:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:03] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6 [16:00:04] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170718T1600). [16:00:04] Dereckson: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:22] !log Deploy alter table on s1 - labsdb1003 - T166204 [16:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:34] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp2017_v4, cp2017_v6 [16:00:37] T166204: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204 [16:00:44] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp2017_v4, cp2017_v6 [16:01:53] RECOVERY - Host cp2017 is UP: PING WARNING - Packet loss = 66%, RTA = 36.20 ms [16:01:53] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 58 ESP OK [16:01:54] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 54 ESP OK [16:01:54] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 54 ESP OK [16:01:54] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 54 ESP OK [16:02:03] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 54 ESP OK [16:02:03] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 54 ESP OK [16:02:03] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 54 ESP OK [16:02:03] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 54 ESP OK [16:02:03] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 58 ESP OK [16:02:03] (03CR) 10Volans: [C: 031] "LGTM, please ensure that all new tables are created properly and optionally track those that needs to be fixed for an eventual fix after t" [puppet] - 10https://gerrit.wikimedia.org/r/365992 (owner: 10Elukey) [16:02:04] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 58 ESP OK [16:02:04] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 54 ESP OK [16:02:05] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 54 ESP OK [16:02:05] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 58 ESP OK [16:02:06] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 54 ESP OK [16:02:13] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 54 ESP OK [16:02:13] RECOVERY - IPsec on cp4006 is OK: Strongswan OK - 54 ESP OK [16:02:23] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 58 ESP OK [16:02:23] RECOVERY - IPsec on cp4013 is OK: Strongswan OK - 54 ESP OK [16:02:23] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 58 ESP OK [16:02:23] RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 54 ESP OK [16:02:23] RECOVERY - IPsec on cp4015 is OK: Strongswan OK - 54 ESP OK [16:02:33] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 58 ESP OK [16:02:34] RECOVERY - IPsec on cp4005 is OK: Strongswan OK - 54 ESP OK [16:02:34] RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 54 ESP OK [16:02:43] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 58 ESP OK [16:02:43] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 58 ESP OK [16:02:43] RECOVERY - IPsec on cp4014 is OK: Strongswan OK - 54 ESP OK [16:02:44] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 58 ESP OK [16:02:53] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 58 ESP OK [16:03:23] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 54 ESP OK [16:07:58] (03PS1) 10Andrew Bogott: nova: add labvirt1014 to the scheduling pool [puppet] - 10https://gerrit.wikimedia.org/r/365998 (https://phabricator.wikimedia.org/T170492) [16:08:33] (03CR) 10Mobrovac: [C: 031] "I think this is a good first step towards narrowing down access to ZK, so I think we should still go ahead with this patch." [puppet] - 10https://gerrit.wikimedia.org/r/356548 (https://phabricator.wikimedia.org/T114815) (owner: 10Muehlenhoff) [16:13:48] (03CR) 10Chad: [V: 032 C: 032] Add .DS_Store to .gitignore [software/gerrit] - 10https://gerrit.wikimedia.org/r/365936 (owner: 10Paladox) [16:14:05] (03CR) 10Paladox: "thanks :)" [software/gerrit] - 10https://gerrit.wikimedia.org/r/365936 (owner: 10Paladox) [16:14:13] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [16:14:23] PROBLEM - Host pc2006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:14:42] (03PS29) 10Paladox: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) [16:14:55] (03PS1) 10Gergő Tisza: Fix labtestwiki typos in InitializeSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366000 [16:15:23] RECOVERY - puppet last run on mw1285 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [16:15:24] RECOVERY - puppet last run on labvirt1001 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [16:16:17] pc2006 is expected, mainboard will be replaced [16:17:27] (03CR) 10BryanDavis: [C: 031] Fix labtestwiki typos in InitializeSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366000 (owner: 10Gergő Tisza) [16:17:50] (03CR) 10Andrew Bogott: [C: 031] Fix labtestwiki typos in InitializeSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366000 (owner: 10Gergő Tisza) [16:17:52] (03CR) 10Chad: [C: 031] "Yes, let's do this. We should *also* fix the entry in LDAP so it's consistent with prod. System users do not need to be in /home" [puppet] - 10https://gerrit.wikimedia.org/r/365891 (https://phabricator.wikimedia.org/T166013) (owner: 1020after4) [16:19:14] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3448571 (10RobH) [16:19:28] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3229950 (10RobH) [16:19:29] 10Operations, 10ops-ulsfo, 10hardware-requests: Decommission cp400[1-4] - https://phabricator.wikimedia.org/T169020#3448576 (10RobH) [16:19:35] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3229950 (10RobH) [16:19:37] 10Operations, 10ops-ulsfo, 10hardware-requests: Decommission cp4011, cp4012, cp4019, cp4020 - https://phabricator.wikimedia.org/T167377#3448579 (10RobH) [16:24:35] i realize puppet swat started 20 minutes ago, but could i get a patch in? [16:27:54] (03PS1) 10Chad: group0 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366001 [16:28:45] 10Operations, 10Analytics-Kanban, 10Wikimedia-Stream, 10hardware-requests, 10Patch-For-Review: Decommission RCStream - https://phabricator.wikimedia.org/T170157#3448619 (10RobH) I already synced up with @ottomata about this via IRC, and I'll snag from here. There is a checklist for decoms, which I'll ap... [16:28:51] (03CR) 10Alexandros Kosiaris: [C: 032] Scap: new canary dashboard [puppet] - 10https://gerrit.wikimedia.org/r/365995 (owner: 10Thcipriani) [16:28:53] (03CR) 10Chad: [C: 04-2] "For later." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366001 (owner: 10Chad) [16:29:00] (03PS2) 10Alexandros Kosiaris: Scap: new canary dashboard [puppet] - 10https://gerrit.wikimedia.org/r/365995 (owner: 10Thcipriani) [16:30:05] (03PS1) 10Ottomata: Remove ensure param from base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/366002 [16:31:34] !log finish rollout of thumbor 1.1 in eqiad - T170677 [16:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:46] T170677: Thumbor replies 429 on deleted files - https://phabricator.wikimedia.org/T170677 [16:33:44] (03PS7) 10Giuseppe Lavagetto: Deploy mjolnir kafka daemon to relforge [puppet] - 10https://gerrit.wikimedia.org/r/363384 (owner: 10EBernhardson) [16:35:30] (03PS4) 10Giuseppe Lavagetto: systemd: add defines to manage systemd units [puppet] - 10https://gerrit.wikimedia.org/r/365900 [16:36:53] (03CR) 10jerkins-bot: [V: 04-1] systemd: add defines to manage systemd units [puppet] - 10https://gerrit.wikimedia.org/r/365900 (owner: 10Giuseppe Lavagetto) [16:37:05] <_joe_> heh, of course [16:38:21] (03CR) 10Giuseppe Lavagetto: [C: 032] Deploy mjolnir kafka daemon to relforge [puppet] - 10https://gerrit.wikimedia.org/r/363384 (owner: 10EBernhardson) [16:38:22] (03CR) 10Alexandros Kosiaris: [C: 031] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/366002 (owner: 10Ottomata) [16:40:21] !log demon@tin Pruned MediaWiki: 1.30.0-wmf.7 [keeping static files] (duration: 06m 06s) [16:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:42] (03PS2) 10Ottomata: Remove ensure param from base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/366002 [16:41:59] (03CR) 10Ottomata: [C: 032] "Checked a few out here: https://puppet-compiler.wmflabs.org/compiler02/7094/" [puppet] - 10https://gerrit.wikimedia.org/r/366002 (owner: 10Ottomata) [16:42:05] (03CR) 10Ottomata: [V: 032 C: 032] Remove ensure param from base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/366002 (owner: 10Ottomata) [16:42:38] (03PS5) 10Giuseppe Lavagetto: systemd: add defines to manage systemd units [puppet] - 10https://gerrit.wikimedia.org/r/365900 [16:42:53] PROBLEM - puppet last run on relforge1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[relforge/mjolnir] [16:44:33] !log oblivian@tin Started deploy [search/MjoLniR@0140aed]: (no justification provided) [16:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:03] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:45:17] !log oblivian@tin Started deploy [search/MjoLniR@0140aed]: init [16:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:03] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:48:39] (03PS2) 10Jforrester: Add Welsh mobile logo (just changes 'k' to 'c'). [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365942 (owner: 10Esanders) [16:48:47] (03CR) 10Jforrester: [C: 031] "LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365942 (owner: 10Esanders) [16:51:53] PROBLEM - puppet last run on relforge1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[relforge/mjolnir] [16:53:48] !log demon@tin Started scap: testwiki to wmf.10 + l10n cache build [16:53:56] (03PS2) 10Elukey: eventlogging_cleaner: force a cast to char for the uuid field [puppet] - 10https://gerrit.wikimedia.org/r/365992 (https://phabricator.wikimedia.org/T170952) [16:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:09] (03PS3) 10Elukey: eventlogging_cleaner: force a cast to char for the uuid field [puppet] - 10https://gerrit.wikimedia.org/r/365992 (https://phabricator.wikimedia.org/T170952) [16:57:10] (03CR) 10jerkins-bot: [V: 04-1] eventlogging_cleaner: force a cast to char for the uuid field [puppet] - 10https://gerrit.wikimedia.org/r/365992 (https://phabricator.wikimedia.org/T170952) (owner: 10Elukey) [16:57:23] PROBLEM - puppet last run on naos is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[relforge/mjolnir] [16:57:40] (03CR) 10Jforrester: [C: 031] "Oh dear." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366000 (owner: 10Gergő Tisza) [16:58:53] RECOVERY - puppet last run on relforge1002 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [16:59:09] ahahaha the commit msg again [16:59:21] 10Operations, 10ops-eqiad, 10Analytics: Smartctl errors for one kafka1012 disk - https://phabricator.wikimedia.org/T168927#3448734 (10Cmjohnson) @elukey, I have plenty of disks on-site...just let me know which slot number. [16:59:35] Line 11: Expected 'Bug:' to come before Change-Id on line 10 [16:59:40] * elukey cries in a corner [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170718T1700). [17:00:13] RECOVERY - puppet last run on relforge1001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [17:00:17] no parsoid deploy today [17:00:27] (03PS4) 10Elukey: eventlogging_cleaner: force a cast to char for the uuid field [puppet] - 10https://gerrit.wikimedia.org/r/365992 (https://phabricator.wikimedia.org/T170952) [17:01:13] PROBLEM - Check systemd state on relforge1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:01:54] PROBLEM - Check systemd state on relforge1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:01:59] cmjohnson1: yt? [17:02:09] i can coordinate that kafka disk replacement [17:02:56] Silly question: If I have a config change, does it have to go out as part of a SWAT or can it go out with the train if we just +2 it? [17:03:05] Link? [17:03:21] (usually I'd say swat over with train, safer) [17:03:24] https://gerrit.wikimedia.org/r/#/c/365884/ [17:04:06] Um, that definitely needs SWAT. If not its own window. It's moving from testwiki to an actual production wiki. I also see no task linked in the commit summary [17:04:09] (03CR) 10Elukey: [C: 032] eventlogging_cleaner: force a cast to char for the uuid field [puppet] - 10https://gerrit.wikimedia.org/r/365992 (https://phabricator.wikimedia.org/T170952) (owner: 10Elukey) [17:04:22] ACKNOWLEDGEMENT - Check systemd state on relforge1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Giuseppe Lavagetto mjolnir service is failing, ebhernardson is investigating [17:04:23] ACKNOWLEDGEMENT - Check systemd state on relforge1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Giuseppe Lavagetto mjolnir service is failing, ebhernardson is investigating [17:04:38] tgr: I'll deploy your typofix for labtest in a bit [17:05:03] thx RainbowSprinkles [17:05:10] RainbowSprinkles: It's not moving from labs cluster. It doesn't need a full scan, does it? [17:05:15] scap* [17:07:07] No, it doesn't need a full scap. It's that you're moving an extension to being *actually* deployed on a prod wiki. New extension rollouts typically take their own window [17:07:33] (I don't consider testwiki having "rolled it out" to production enough...just not enough traffic) [17:07:59] ottomata: here [17:08:24] Ah. It's not something that would be affected by traffic, per se. But we're still sending it out to a relatively tiny wiki just to make sure everything's fine. [17:09:00] cmjohnson1: ok [17:09:17] Niharika: Not traffic as in load, but traffic as in "enough people will actually be using this code" [17:09:38] i can stop kafka on kafka1012 [17:09:41] Anyway, you can stick it in SWAT I guess. But definitely not riding the train [17:09:44] and then write data to sdh [17:09:46] to help you find the disk [17:09:47] ya? [17:09:49] (I've got enough to worry about on a new branch day) [17:10:01] It's a beta feature. The number of people using it will be quite low. :) [17:10:13] ottomata: do you need to stop it or can we isolate the failed disk [17:10:31] Fair enough. :) [17:10:39] cmjohnson1: if we unmount the disk, kafka will die [17:10:48] sooo, ya better to stop [17:10:52] its fine, the other brokers will take over [17:11:05] ok [17:12:54] o/ I'm late for the deploy window. [17:13:09] We're looking to deploy ORES. [17:13:13] cmjohnson1: let me know when you are ready to stare at blinking lights [17:13:35] Anyone else deploying on scb nodes? [17:13:44] (03CR) 10Gilles: [C: 031] Change $deploy_user home directory to /var/lib/${deploy_user} [puppet] - 10https://gerrit.wikimedia.org/r/365891 (https://phabricator.wikimedia.org/T166013) (owner: 1020after4) [17:13:48] ready [17:14:16] (03CR) 10Gilles: [C: 031] Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [17:15:10] 10Operations, 10Analytics-Kanban, 10Wikimedia-Stream, 10hardware-requests, 10Patch-For-Review: Decommission RCStream (rcs100[12]) - https://phabricator.wikimedia.org/T170157#3448838 (10RobH) a:05Ottomata>03RobH [17:15:21] k one sec cmjohnson1 gotta set downtime... [17:15:50] OK looks like we're going to start ORES deployment then. [17:16:12] 10Operations, 10ops-codfw, 10Parsoid, 10Patch-For-Review, and 2 others: wtp2019 - hardware (RAM) check - https://phabricator.wikimedia.org/T146113#3448844 (10mobrovac) a:05RobH>03mobrovac I'll deploy and repool the node now. [17:16:32] !log stopping kafka broker on kafka1012 to replace disk T168927 [17:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:45] T168927: Smartctl errors for one kafka1012 disk - https://phabricator.wikimedia.org/T168927 [17:18:12] !log demon@tin Finished scap: testwiki to wmf.10 + l10n cache build (duration: 24m 23s) [17:18:22] ok cmjohnson1 writing data to sdh [17:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:43] ottomata: okay [17:18:53] i'm going to pull it...confirm it's the right one [17:18:55] lemme know when you'ved IDed it [17:19:11] can you check? [17:19:19] ls: reading directory h: Input/output error [17:19:21] looks good ! [17:19:50] new disk is in....it may have a foreign cfg [17:19:55] to clear first [17:20:10] !log awight@tin Started deploy [ores/deploy@1d35aa5]: T170485 [17:20:19] PROBLEM - Check systemd state on kafka1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:24] T170485: ORES deployment - Mid July, 2017 - https://phabricator.wikimedia.org/T170485 [17:20:37] !log mobrovac@tin Started deploy [parsoid/deploy@1eaa07e]: Bring wtp2019 up to date and repool it - T146113 [17:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:51] T146113: wtp2019 - hardware (RAM) check - https://phabricator.wikimedia.org/T146113 [17:21:26] ok cmjohnson1 doin the stuff, stay tuned... [17:21:39] !log mobrovac@tin Finished deploy [parsoid/deploy@1eaa07e]: Bring wtp2019 up to date and repool it - T146113 (duration: 01m 02s) [17:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:32] 10Operations, 10ops-codfw, 10Parsoid, 10Services (done), 10User-mobrovac: wtp2019 - hardware (RAM) check - https://phabricator.wikimedia.org/T146113#3448900 (10mobrovac) 05Open>03Resolved `wtp2019` is now up to date with the latest code and is back in the pool. Resolving. [17:24:59] RECOVERY - puppet last run on naos is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [17:25:42] (03CR) 10Nirzar: [C: 031] Add Welsh mobile logo (just changes 'k' to 'c'). [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365942 (owner: 10Esanders) [17:25:46] ORE canary looks good. [17:25:53] *ORES [17:30:42] (03CR) 10Mforns: eventlogging_cleaner: force a cast to char for the uuid field (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/365992 (https://phabricator.wikimedia.org/T170952) (owner: 10Elukey) [17:31:20] RECOVERY - Check systemd state on kafka1012 is OK: OK - running: The system is fully operational [17:31:45] (03PS3) 10Dzahn: smokeping: switch backend from netmon1002 to netmon2001 [puppet] - 10https://gerrit.wikimedia.org/r/365892 (https://phabricator.wikimedia.org/T166180) [17:33:00] cmjohnson1: looks good [17:33:06] great [17:33:07] kafka is back up and replicating from other brokers [17:33:20] thanks to you and to elukey's great documentation :) [17:34:27] 10Operations, 10ops-eqiad, 10Analytics-Kanban: Smartctl errors for one kafka1012 disk - https://phabricator.wikimedia.org/T168927#3448934 (10Ottomata) [17:35:05] 10Operations, 10ops-eqiad, 10Analytics-Kanban: Smartctl errors for one kafka1012 disk - https://phabricator.wikimedia.org/T168927#3381297 (10Ottomata) disk replaced as spare. Mounted as /var/spool/kafka/h with UUID=247e0397-066b-4b5c-b6c3-cacd1ecf8cdd. Kafka is back up and is replicating missing data from... [17:35:37] (03CR) 10Chad: [C: 032] Fix labtestwiki typos in InitializeSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366000 (owner: 10Gergő Tisza) [17:36:42] awight, what's scap's status for ORES deploy? [17:37:07] (03Merged) 10jenkins-bot: Fix labtestwiki typos in InitializeSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366000 (owner: 10Gergő Tisza) [17:37:09] (03Abandoned) 10Ottomata: Check that default input policy is ACCEPT if base::firewall ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/365120 (owner: 10Ottomata) [17:37:16] (03CR) 10jenkins-bot: Fix labtestwiki typos in InitializeSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366000 (owner: 10Gergő Tisza) [17:38:10] RECOVERY - Check systemd state on relforge1001 is OK: OK - running: The system is fully operational [17:38:20] 10Operations, 10Analytics, 10Analytics-Cluster: rack/setup/install replacement stat1006 (stat1003 replacement) - https://phabricator.wikimedia.org/T165366#3448958 (10Ottomata) [17:38:32] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban: rack/setup/install replacement stat1006 (stat1003 replacement) - https://phabricator.wikimedia.org/T165366#3264224 (10Ottomata) [17:38:51] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368#3448961 (10Ottomata) [17:39:03] halfak: fetching at 44% [17:39:08] cool [17:39:09] (03CR) 10Dzahn: [C: 032] smokeping: switch backend from netmon1002 to netmon2001 [puppet] - 10https://gerrit.wikimedia.org/r/365892 (https://phabricator.wikimedia.org/T166180) (owner: 10Dzahn) [17:40:30] RECOVERY - Check systemd state on relforge1002 is OK: OK - running: The system is fully operational [17:41:10] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [10.0] [17:41:16] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: labtest typofix for tgr (duration: 00m 46s) [17:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:17] 10Operations, 10Discovery-Analysis: Upgrade pandoc package to at least 1.12.3 - https://phabricator.wikimedia.org/T168683#3449000 (10Ottomata) 05Open>03Resolved a:03Ottomata pandoc from Debian Stretch is 1.17.2~dfsg-3. [17:47:34] Kafka Broker Under Replicated is expected and will resolve itself once kafka1012 is caught back up [17:47:44] !log smokeping - switched to netmon2001 - ping times to codfw hosts went down - ping times to eqiad hosts went up - since service is on both but data has been synced over [17:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:19] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [10.0] [17:52:42] tgr: Oh, I sync'd you up there ^ [17:52:44] (forgot to mention) [17:52:52] cmjohnson1: ......... [17:52:56] we may have replaced the wrong disk? [17:52:58] ???? [17:53:47] okay [17:54:02] i'm not sure how...but its possible that i conflated the disk letter with the wrong partition mount [17:54:05] (03PS1) 10Dzahn: rancid: switch active server to netmon2001 [puppet] - 10https://gerrit.wikimedia.org/r/366012 (https://phabricator.wikimedia.org/T166180) [17:54:23] smart still detects the same # of defects on a disk [17:54:37] (03PS2) 10Dzahn: rancid: switch active server to netmon2001 [puppet] - 10https://gerrit.wikimedia.org/r/366012 (https://phabricator.wikimedia.org/T166180) [17:55:16] !log awight@tin Finished deploy [ores/deploy@1d35aa5]: T170485 (duration: 35m 06s) [17:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:30] ACKNOWLEDGEMENT - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] ottomata resyncing [17:55:30] T170485: ORES deployment - Mid July, 2017 - https://phabricator.wikimedia.org/T170485 [17:56:19] cmjohnson1: hm. what should we do? try again? re-insert the disk you took out? [17:57:17] I'm trying to add a group email address to icinga notifications, but failing to understand the step between a contactgroup's members and how we actually contact that person. [17:57:21] "person". [17:58:06] I can [17:58:14] (03CR) 10Dzahn: [C: 032] rancid: switch active server to netmon2001 [puppet] - 10https://gerrit.wikimedia.org/r/366012 (https://phabricator.wikimedia.org/T166180) (owner: 10Dzahn) [17:58:20] cmjohnson1: ok [17:58:31] ottomata: I am not sure that was the only one blinking...are you sure there is not a foreign cfg on it [17:58:41] i think where they cam from the raid was still set up [17:58:49] awight: the contactgroups are in public repo but the contacts that are members in the groups are in the private repo [17:58:51] cmjohnson1: there was, i had to clear it [17:58:58] on the new one you inserted you mean? [17:59:01] yes [17:59:13] i did the stuff listed here [17:59:14] https://wikitech.wikimedia.org/wiki/Kafka/Administration#Swapping_broken_disk [17:59:28] awight: because we dont want to make phone numbers public. i can add a contact for you that you can then use in groups [17:59:29] mutante: aha, thanks. So I'll prepare a patch on the public repo, then mention that I'll need a private repo change made on my behalf? [17:59:40] awight: yea, sounds good. you can add me to that [17:59:55] cmjohnson1: i think it was probably my fault [17:59:58] rad, ty. I'll PM you the address cos it's an internal list [17:59:59] i may have written data to the wrong disk [18:00:08] awight: eventually i will get to make these public and just hide the phone numbers themselvs like passwords [18:00:43] awight: heh, re: internal list :) [18:00:50] ok, do you need me to do anything with disks right now or just standby? [18:01:26] cmjohnson1: yeah, i'm just not sure if we shoudl reinsert the same disk [18:01:29] i suppose so right? [18:01:33] there likely was nothing wrong with it [18:01:50] i'll do the same stuff i did before: stop kafka, write data to the disk [18:01:51] you can swap [18:01:58] yeah..it would just have the same errors...okay..let me know when it's safe [18:02:24] !log stopping kafka on kafka1012 again, i think we swapped the wrong disk T168927 [18:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:37] T168927: Smartctl errors for one kafka1012 disk - https://phabricator.wikimedia.org/T168927 [18:02:52] (03PS1) 10Awight: Change scoring team's name for alerts; point to group member [puppet] - 10https://gerrit.wikimedia.org/r/366016 [18:03:26] ok cmjohnson1 ready, you watching lights? [18:03:40] wait, oh cmjohnson1 ok [18:03:45] yes [18:03:47] what are you going to do? [18:04:05] (03PS2) 10Awight: Change scoring team's name for alerts; point to group member [puppet] - 10https://gerrit.wikimedia.org/r/366016 [18:04:17] (i'm writing data to the disk with errors again now (i hope) [18:04:18] ) [18:04:22] so now its confusing [18:04:24] haha [18:04:24] yeah [18:04:25] ok [18:04:28] let's do this: [18:04:37] put the disk you removed back in the slot where it came from [18:04:45] and then, find the disk that i'm currently writing to [18:05:14] and replace it with a spare (the one you already selected is fine) [18:05:15] i found the disk you're writing to [18:05:17] ok [18:05:35] going to do the above...ready? [18:05:39] PROBLEM - Check systemd state on kafka1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:05:41] one sec, lemme umount [18:05:43] will be better [18:06:19] RECOVERY - Host pc2006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.74 ms [18:06:43] hmm, ^^^ i thought we had a downtime scheduled, must be a different service check [18:07:43] ok cmjohnson1 ready [18:07:47] yep [18:07:52] i umounted both of those drives [18:08:21] proceed with swap dance [18:08:34] done [18:08:46] (03PS1) 10EBernhardson: Update PYTHONPATH for mjolnir-kafka-daemon [puppet] - 10https://gerrit.wikimedia.org/r/366017 [18:08:55] !log demon@tin Started scap: mobilefrontend wmf.9 + forced l10n rebuild [18:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:41] ok... looking [18:12:06] cmjohnson1: am pretty confused now. did one of my partitions get automounted when you inserted? [18:12:10] i umounted two partitions [18:12:16] you swapped around 2 disks right? [18:12:26] i swapped the two disks [18:12:47] so, the one i most recently lit up -> out [18:12:52] put the old back in the old slot (6) and moved that disk to slot 5 that was blinking [18:13:00] great [18:13:05] so the one that was in slot 5 is out [18:13:05] ok [18:13:10] correct [18:13:14] ok [18:16:31] (03CR) 10Dzahn: [C: 031] "contact for team scoring created in private repo... merging in a little while" [puppet] - 10https://gerrit.wikimedia.org/r/366016 (owner: 10Awight) [18:18:52] ok cmjohnson1 we got it right this time [18:18:57] smart defects are 0 on all disks [18:18:59] thanks [18:20:28] 10Operations, 10ops-eqiad, 10Analytics-Kanban: Smartctl errors for one kafka1012 disk - https://phabricator.wikimedia.org/T168927#3449173 (10Ottomata) Ah, we accidentally swapped the wrong disk. My fault. We put the good one back in, took the defected one out, and put the spare back in the other slot. So /... [18:21:49] RECOVERY - Check systemd state on kafka1012 is OK: OK - running: The system is fully operational [18:22:34] ottomata: awesome news! [18:22:52] thanks, apologies for making you do an extra swap dance :) [18:23:09] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1013 is CRITICAL: CRITICAL: 62.07% of data above the critical threshold [10.0] [18:23:40] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 58.62% of data above the critical threshold [10.0] [18:24:00] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1018 is CRITICAL: CRITICAL: 63.33% of data above the critical threshold [10.0] [18:24:34] obscure? [18:24:53] !log dzahn@neodymium conftool action : set/pooled=no; selector: name=mw2202.codfw.wmnet [18:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:29] ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on kafka1013 is CRITICAL: CRITICAL: 65.52% of data above the critical threshold [10.0] ottomata resyncing after T168927 [18:25:29] ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 63.33% of data above the critical threshold [10.0] ottomata resyncing after T168927 [18:25:29] ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on kafka1018 is CRITICAL: CRITICAL: 63.33% of data above the critical threshold [10.0] ottomata resyncing after T168927 [18:25:29] ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 63.33% of data above the critical threshold [10.0] ottomata resyncing after T168927 [18:25:29] ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [10.0] ottomata resyncing after T168927 [18:26:02] !log mw2202 - remove /etc/udev/rules.d/70-persistent-net.rules for mainboard replacement - to detect new NICs with new MACs (T170307) [18:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:16] T170307: mw2201, mw2202 - contact Dell and replace main board - https://phabricator.wikimedia.org/T170307 [18:26:37] 10Operations, 10Discovery-Analysis: Upgrade pandoc package to at least 1.12.3 - https://phabricator.wikimedia.org/T168683#3449203 (10chelsyx) Thank you @Ottomata ! [18:27:06] 10Operations, 10ops-codfw: pc2006 crashed - https://phabricator.wikimedia.org/T170520#3449209 (10Papaul) a:05Papaul>03Marostegui Main board replacement complete. Server is back up. Please see below for return information on bad main board for reference. {F8794366} [18:27:50] jouncebot: next [18:27:50] In 0 hour(s) and 32 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170718T1900) [18:28:49] 10Operations, 10ops-codfw, 10Patch-For-Review: mw2201, mw2202 - contact Dell and replace main board - https://phabricator.wikimedia.org/T170307#3449232 (10Papaul) Return information for bad main board on mw2201 {F8794370} [18:29:49] !log demon@tin Finished scap: mobilefrontend wmf.9 + forced l10n rebuild (duration: 20m 53s) [18:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:11] (03PS1) 10Dzahn: mw2202: remove from conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/366024 (https://phabricator.wikimedia.org/T170307) [18:33:17] (03PS3) 10Esanders: Add Welsh mobile logo (just changes 'k' to 'c'). [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365942 [18:33:51] (03PS2) 10Dzahn: mw2202: remove from conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/366024 (https://phabricator.wikimedia.org/T170307) [18:35:09] (03CR) 10Dzahn: [C: 032] mw2202: remove from conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/366024 (https://phabricator.wikimedia.org/T170307) (owner: 10Dzahn) [18:36:39] !log mw2202 - scheduled downtime - mainboard replacement [18:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:39] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1014 is OK: OK: Less than 50.00% above the threshold [1.0] [18:39:09] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1018 is OK: OK: Less than 50.00% above the threshold [1.0] [18:39:10] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1013 is OK: OK: Less than 50.00% above the threshold [1.0] [18:39:49] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1020 is OK: OK: Less than 50.00% above the threshold [1.0] [18:42:41] !log netmon1002 - reinstall OS - didn't use the right partman recipe - didn't have md0 - revoke old puppet cert , salt-key, scheduled downtime, services over at netmon2001 [18:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:54] 10Operations, 10monitoring, 10Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3449308 (10Dzahn) 11:42 < mutante> !log netmon1002 - reinstall OS - didn't use the right partman recipe - didn't have md0 - revoke old puppet cert , salt-key, scheduled downtime, servic... [18:49:50] (03PS2) 10Andrew Bogott: nova: add labvirt1014 to the scheduling pool [puppet] - 10https://gerrit.wikimedia.org/r/365998 (https://phabricator.wikimedia.org/T170492) [18:51:23] !log demon@tin Synchronized php-1.30.0-wmf.9/extensions/MobileFrontend/extension.json: One last thing (duration: 02m 55s) [18:51:29] (03CR) 10Andrew Bogott: [C: 032] nova: add labvirt1014 to the scheduling pool [puppet] - 10https://gerrit.wikimedia.org/r/365998 (https://phabricator.wikimedia.org/T170492) (owner: 10Andrew Bogott) [18:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:54] (03PS1) 10Andrew Bogott: keystone: add a secondary ldap host [puppet] - 10https://gerrit.wikimedia.org/r/366025 [18:58:05] mutante: Thx for mw2202 fix, was just about to report that (scap complained) [18:59:41] !log demon@tin Synchronized php-1.30.0-wmf.9/extensions/MobileFrontend/extension.json: One (more) last thing (duration: 02m 49s) [18:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] RainbowSprinkles: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170718T1900). [19:00:04] RainbowSprinkles: looks good :) [19:00:18] RainbowSprinkles: i removed it from conftool data thinking that would stop scap from complaining .. [19:00:38] Eh, it takes a bit cuz puppet has to regenerate the rsync list, right? [19:00:41] Or something [19:00:54] jdlrobson: Yay! So....we're done! [19:00:58] \o/ [19:01:44] (03CR) 10Chad: [C: 032] group0 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366001 (owner: 10Chad) [19:05:52] (03Merged) 10jenkins-bot: group0 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366001 (owner: 10Chad) [19:06:11] (03CR) 10jenkins-bot: group0 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366001 (owner: 10Chad) [19:15:59] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to wmf.10 [19:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:50] 10Operations, 10ops-codfw, 10Patch-For-Review: mw2201, mw2202 - contact Dell and replace main board - https://phabricator.wikimedia.org/T170307#3449627 (10Papaul) Main board replacement complete on mw2202. System is back up. See below for return information for bad main board. {F8794552} [19:33:54] 10Operations, 10ops-codfw, 10Patch-For-Review: mw2201, mw2202 - contact Dell and replace main board - https://phabricator.wikimedia.org/T170307#3449631 (10Papaul) a:05Papaul>03Dzahn [19:33:58] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2050682 [19:34:52] 10Operations, 10ops-codfw, 10ops-eqiad, 10monitoring: Unresponsive/misconfigured iDRACs over the host-BMC interface - https://phabricator.wikimedia.org/T169360#3449633 (10Dzahn) [19:35:07] 10Operations, 10ops-codfw, 10ops-eqiad, 10monitoring: Unresponsive/misconfigured iDRACs over the host-BMC interface - https://phabricator.wikimedia.org/T169360#3395803 (10Dzahn) mw2202 is fixed as well. [19:35:36] (03PS1) 10Dzahn: Revert "mw2202: remove from conftool-data" [puppet] - 10https://gerrit.wikimedia.org/r/366031 [19:35:50] (03PS2) 10Dzahn: Revert "mw2202: remove from conftool-data" [puppet] - 10https://gerrit.wikimedia.org/r/366031 [19:36:57] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365942 (owner: 10Esanders) [19:40:26] (03CR) 10Dzahn: [C: 032] Revert "mw2202: remove from conftool-data" [puppet] - 10https://gerrit.wikimedia.org/r/366031 (owner: 10Dzahn) [19:41:06] RainbowSprinkles: it would be cool if you could deploy to just mw2202 to get it back in sync [19:41:15] Uno momento [19:43:01] Pulled, but CDB may need refresh [19:43:08] thank you [19:47:51] 10Operations, 10ops-codfw, 10Patch-For-Review: mw2201, mw2202 - contact Dell and replace main board - https://phabricator.wikimedia.org/T170307#3449725 (10Dzahn) looks good and came back just fine without having to reinstall, deleting /etc/udev/rules.d/70-persistent-net.rules did the trick, thanks! re-addi... [19:48:31] !log starting wipe on cp400[1-4] per T169020 [19:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:42] T169020: Decommission cp400[1-4] - https://phabricator.wikimedia.org/T169020 [19:50:11] 10Operations, 10Fundraising-Backlog, 10fundraising-tech-ops: reports.frdev.wm.o -- still in use? - https://phabricator.wikimedia.org/T170640#3449738 (10Ejegg) @cwdent / @Jgreen : Is the vhost actually pointing to anything, or is this totally obsolete? [19:51:09] 10Operations, 10ops-ulsfo, 10hardware-requests: Decommission cp400[1-4] - https://phabricator.wikimedia.org/T169020#3449741 (10RobH) [19:53:46] 10Operations, 10Citoid, 10VisualEditor, 10Services (done), 10User-Ryasmeen: NIH db misbehaviour causing problems to Citoid - https://phabricator.wikimedia.org/T133696#3449746 (10mobrovac) [19:53:55] (03PS1) 10RobH: decom cp400[1-4] [dns] - 10https://gerrit.wikimedia.org/r/366033 (https://phabricator.wikimedia.org/T169020) [19:54:28] (03PS2) 10RobH: decom cp400[1-4] [dns] - 10https://gerrit.wikimedia.org/r/366033 (https://phabricator.wikimedia.org/T169020) [19:54:47] mutante: Actually, no cdb issues. All up to date [19:54:54] (sorry for lag, got confused in reading something) [19:54:54] (03CR) 10RobH: [C: 032] decom cp400[1-4] [dns] - 10https://gerrit.wikimedia.org/r/366033 (https://phabricator.wikimedia.org/T169020) (owner: 10RobH) [19:55:18] 10Operations, 10Fundraising-Backlog, 10fundraising-tech-ops: reports.frdev.wm.o -- still in use? - https://phabricator.wikimedia.org/T170640#3449754 (10Dzahn) reports.frdev.wikimedia.org is a CNAME for frdev-eqiad.wikimedia.org frdev-eqiad.wikimedia.org has address 208.80.155.13 This shares an IP with the... [19:55:56] RainbowSprinkles: :) perfect, repooling [19:56:28] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2202.codfw.wmnet [19:56:38] 10Operations, 10ops-ulsfo, 10hardware-requests, 10Patch-For-Review: Decommission cp400[1-4] - https://phabricator.wikimedia.org/T169020#3449758 (10RobH) [19:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:42] 10Operations, 10ops-ulsfo, 10hardware-requests, 10Patch-For-Review: Decommission cp400[1-4] - https://phabricator.wikimedia.org/T169020#3384648 (10RobH) robh@asw-ulsfo> show interfaces descriptions | grep cp4001 xe-2/0/0 down down cp4001.ulsfo.wmnet {master:2} robh@asw-ulsfo> show interfaces desc... [19:57:27] 10Operations, 10ops-codfw, 10ops-eqiad, 10monitoring: Unresponsive/misconfigured iDRACs over the host-BMC interface - https://phabricator.wikimedia.org/T169360#3449763 (10Dzahn) [19:57:28] 10Operations, 10ops-codfw, 10Patch-For-Review: mw2201, mw2202 - contact Dell and replace main board - https://phabricator.wikimedia.org/T170307#3449761 (10Dzahn) 05Open>03Resolved repooled mw2202 - this should resolve this ticket - thanks @Papaul [19:57:39] 10Operations, 10ops-codfw, 10Patch-For-Review: mw2201, mw2202 - contact Dell and replace main board - https://phabricator.wikimedia.org/T170307#3449764 (10Dzahn) a:05Dzahn>03Papaul [19:58:01] (03PS1) 10Amire80: Add cookiesandcodeblog to English Planet [puppet] - 10https://gerrit.wikimedia.org/r/366036 [19:58:53] (03PS1) 10RobH: decom of cp400[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/366037 (https://phabricator.wikimedia.org/T169020) [20:00:47] (03CR) 10RobH: [C: 032] decom of cp400[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/366037 (https://phabricator.wikimedia.org/T169020) (owner: 10RobH) [20:02:52] 10Operations, 10ops-ulsfo, 10hardware-requests, 10Patch-For-Review: Decommission cp400[1-4] - https://phabricator.wikimedia.org/T169020#3449780 (10RobH) [20:03:34] 10Operations, 10ops-ulsfo, 10hardware-requests, 10Patch-For-Review: Decommission cp400[1-4] - https://phabricator.wikimedia.org/T169020#3384648 (10RobH) Setting cp400[1-4] to wipe via usb boot. Sicne there are two 250GB disks, the wipe will take overnight. I'll coem back down later this week or next to p... [20:08:10] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review: decom fluorine - https://phabricator.wikimedia.org/T159996#3449802 (10Luke081515) [20:08:18] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review: decom fluorine - https://phabricator.wikimedia.org/T159996#3085886 (10Luke081515) (per T159996#3106831) [20:09:28] PROBLEM - HHVM jobrunner on mw1165 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [20:10:28] RECOVERY - HHVM jobrunner on mw1165 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [20:11:13] 10Operations, 10MediaWiki-Containers, 10Release-Engineering-Team, 10Epic, and 3 others: FY2017/18 Program 6 - Outcome 2 - Objective 3: Integrated, container-based development environment - https://phabricator.wikimedia.org/T170456#3449809 (10GWicke) a:03mobrovac [20:12:35] 10Operations, 10Cassandra, 10RESTBase-Cassandra, 10Services (blocked): puppetize turning off reserved space for cassandra /srv - https://phabricator.wikimedia.org/T132632#3449811 (10GWicke) [20:13:58] 10Operations, 10Cassandra, 10Patch-For-Review, 10Services (blocked), 10User-mobrovac: Setup automated topk wide row reporting - https://phabricator.wikimedia.org/T147366#3449814 (10GWicke) [20:17:43] (03CR) 10Dzahn: [C: 032] Change scoring team's name for alerts; point to group member [puppet] - 10https://gerrit.wikimedia.org/r/366016 (owner: 10Awight) [20:18:04] (03PS3) 10Dzahn: Change scoring team's name for alerts; point to group member [puppet] - 10https://gerrit.wikimedia.org/r/366016 (owner: 10Awight) [20:22:54] (03CR) 10Seb35: [C: 031] "I support this change. See my comment on the task https://phabricator.wikimedia.org/T168467#3449840." [puppet] - 10https://gerrit.wikimedia.org/r/361685 (https://phabricator.wikimedia.org/T168467) (owner: 10Herron) [20:23:59] 10Operations, 10Fundraising-Backlog, 10fundraising-tech-ops: reports.frdev.wm.o -- still in use? - https://phabricator.wikimedia.org/T170640#3449856 (10cwdent) @Ejegg - there are functioning sites at /reports and /webfiledrop, but no real idea if they are still in use or not [20:24:01] 10Operations, 10ops-ulsfo, 10Patch-For-Review: decommission backup4001 - https://phabricator.wikimedia.org/T161904#3146871 (10RobH) just started disk wipes on both sata disks, will leave overnight and check back on them later. [20:24:52] mutante: Thanks, looking forward to the alerts in my inbox. [20:25:30] (03PS4) 10Dzahn: icinga/ores: Change scoring team's name for alerts; point to group member [puppet] - 10https://gerrit.wikimedia.org/r/366016 (owner: 10Awight) [20:25:43] (03CR) 10Dzahn: [V: 032 C: 032] icinga/ores: Change scoring team's name for alerts; point to group member [puppet] - 10https://gerrit.wikimedia.org/r/366016 (owner: 10Awight) [20:33:51] (03PS1) 10Mforns: Add MediaWikiInstallPingback to EL purging white-list [puppet] - 10https://gerrit.wikimedia.org/r/366049 (https://phabricator.wikimedia.org/T170986) [20:34:33] 10Operations, 10Wikimedia-Mailing-lists: Reach out to Google about @yahoo.com emails not reaching gmail inboxes (when sent to mailing lists) - https://phabricator.wikimedia.org/T146841#2672637 (10Seb35) Instead of reaching Google about this issue, it should first be tried to implement the DMARC-workaround in M... [20:35:20] (03CR) 10Mforns: [C: 04-1] "We need to wait until MediaWikiInstallPingback starts receiving events before merging." [puppet] - 10https://gerrit.wikimedia.org/r/366049 (https://phabricator.wikimedia.org/T170986) (owner: 10Mforns) [20:38:02] (03PS1) 10Urbanecm: Change logo on nl.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366058 (https://phabricator.wikimedia.org/T170984) [20:43:52] (03PS1) 10Urbanecm: Change timezone on nl.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366077 (https://phabricator.wikimedia.org/T170985) [20:46:35] (03CR) 10Zppix: [C: 031] Change timezone on nl.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366077 (https://phabricator.wikimedia.org/T170985) (owner: 10Urbanecm) [20:54:33] (03PS1) 10Bearloga: statistics::packages: Add libssl-dev and comments [puppet] - 10https://gerrit.wikimedia.org/r/366107 (https://phabricator.wikimedia.org/T152712) [20:58:11] 10Operations, 10Analytics-Kanban, 10DBA, 10Patch-For-Review, 10User-Elukey: Puppetize Piwik's Database and set up periodical backups - https://phabricator.wikimedia.org/T164073#3449972 (10Nuria) 05Open>03Resolved [21:00:22] (03PS2) 10Dzahn: Add cookiesandcodeblog to English Planet [puppet] - 10https://gerrit.wikimedia.org/r/366036 (owner: 10Amire80) [21:06:24] (03CR) 10Dzahn: [C: 032] Add cookiesandcodeblog to English Planet [puppet] - 10https://gerrit.wikimedia.org/r/366036 (owner: 10Amire80) [21:06:40] (03PS3) 10Dzahn: planet: Add cookiesandcodeblog to English Planet [puppet] - 10https://gerrit.wikimedia.org/r/366036 (owner: 10Amire80) [21:06:55] 10Operations, 10Citoid, 10VisualEditor, 10Services (watching), 10User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3450028 (10GWicke) [21:09:49] 10Operations, 10Citoid, 10VisualEditor, 10Services (blocked): Separate citoid service for beta that runs off master instead of deploy - https://phabricator.wikimedia.org/T92304#3450042 (10GWicke) @mobrovac @mvolz, what is the status of this task? [21:11:23] (03PS2) 10Hashar: contint: role and packages for R language [puppet] - 10https://gerrit.wikimedia.org/r/363337 (https://phabricator.wikimedia.org/T153856) [21:11:25] (03CR) 10Ottomata: [C: 04-1] "Thank you for code comments!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/366107 (https://phabricator.wikimedia.org/T152712) (owner: 10Bearloga) [21:14:55] (03PS3) 10Hashar: contint: role and packages for R language [puppet] - 10https://gerrit.wikimedia.org/r/363337 (https://phabricator.wikimedia.org/T153856) [21:16:29] (03PS2) 10Bearloga: statistics::packages: Add libssl-dev and comments [puppet] - 10https://gerrit.wikimedia.org/r/366107 (https://phabricator.wikimedia.org/T152712) [21:17:08] (03CR) 10Bearloga: "Latest patch should fix the issue :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/366107 (https://phabricator.wikimedia.org/T152712) (owner: 10Bearloga) [21:17:34] (03CR) 10jerkins-bot: [V: 04-1] statistics::packages: Add libssl-dev and comments [puppet] - 10https://gerrit.wikimedia.org/r/366107 (https://phabricator.wikimedia.org/T152712) (owner: 10Bearloga) [21:18:55] (03CR) 10Hashar: [C: 031] "That is merely a copy paste from role::ci:slave::android adjusted to install r-lang." [puppet] - 10https://gerrit.wikimedia.org/r/363337 (https://phabricator.wikimedia.org/T153856) (owner: 10Hashar) [21:19:07] (03CR) 10Dzahn: "it's a missing comma on line 68" [puppet] - 10https://gerrit.wikimedia.org/r/366107 (https://phabricator.wikimedia.org/T152712) (owner: 10Bearloga) [21:19:11] (03PS3) 10Bearloga: statistics::packages: Add libssl-dev and comments [puppet] - 10https://gerrit.wikimedia.org/r/366107 (https://phabricator.wikimedia.org/T152712) [21:19:47] bearloga: good morning. Eventually I have enabled the CI job for ortiz :] [21:20:12] bearloga: I wanted to further optimize/polish it up, but I guess it is good enough to be enabled and so I did ! [21:21:11] hashar: thank you! :D [21:21:37] bearloga: the installed dependencies are kept between builds [21:21:54] so if you change remove a dependency in the description file, it will still be around [21:22:02] + the devtools dependencies are installed as well :( [21:22:08] it is not perfect [21:23:55] hashar: would it help if I pulled out the R package installation code from shiny_server module into its own thing? [21:24:38] hashar: also really happy to hear the dependencies are cached :) [21:25:46] bearloga: yeah or to say it otherwise: the libs are not cleaned up between builds :] [21:26:06] and yeah the bits from shiny_server can definitely be extracted out to a new puppet module [21:26:24] with maybe a class to install the basic prerequisies (eg r-base) [21:26:32] and a define to easily insall stuff from cran [21:26:42] I could typically use something like: [21:27:12] r:cran( 'devtools' ) [21:27:14] :] [21:29:11] PROBLEM - puppet last run on labtestcontrol2003 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 31 seconds ago with 5 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/keystone-public-uwsgi],File[/etc/logrotate.d/keystone-admin-uwsgi],Service[uwsgi-keystone-admin],Service[uwsgi-keystone-public] [21:30:11] RECOVERY - puppet last run on labtestcontrol2003 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [21:31:48] (03CR) 10Dzahn: [C: 031] "what about the "git submodule update" that is being removed as well?" [puppet] - 10https://gerrit.wikimedia.org/r/255958 (owner: 10Reedy) [21:32:32] 10Operations, 10Continuous-Integration-Infrastructure, 10Discovery, 10Discovery-Analysis (Current work), 10Release-Engineering-Team (Watching / External): Setup a mirror for R language dependencies (CRAN) - https://phabricator.wikimedia.org/T170995#3450163 (10hashar) [21:33:58] (03CR) 10Reedy: "It's not exactly being removed, it's just being moved. It doesn't need separately calling on different code paths..." [puppet] - 10https://gerrit.wikimedia.org/r/255958 (owner: 10Reedy) [21:35:01] (03PS4) 10Dzahn: l10nupdate: Reduce code duplication in git clone operations [puppet] - 10https://gerrit.wikimedia.org/r/255958 (owner: 10Reedy) [21:35:04] jouncebot: next [21:35:04] In 1 hour(s) and 24 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170718T2300) [21:35:21] PROBLEM - puppet last run on labtestcontrol2003 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 8 seconds ago with 3 failures. Failed resources (up to 3 shown): Service[uwsgi-keystone-admin],Service[uwsgi-keystone-public],Service[keystone] [21:38:21] (03CR) 10Dzahn: [C: 032] l10nupdate: Reduce code duplication in git clone operations [puppet] - 10https://gerrit.wikimedia.org/r/255958 (owner: 10Reedy) [21:43:59] !log Attempt to deploy mediawiki/services/jobrunner – https://gerrit.wikimedia.org/r/#/c/349364/ - failed. [21:44:06] hashar: https://gist.github.com/Krinkle/b5514b9a80efe336dd5c6f81216bf103 [21:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:17] Looks like deploy isn't working properly. Something wrong? [21:44:59] Krinkle: jobrunner should be cleaned from the deployment server. It is no more deployed by Trebuchet but using scap [21:45:15] Krinkle: though on deployment scap does not restart jobchron (only jobrunner) [21:45:21] I'm also not understanding the details of the failure. "fetch status" being > 100min ago but still status=0 or status=1 [21:45:22] thcipriani ping ^^ some jobrunner fun [21:45:49] Oh I didn't know that change already happened. [21:45:59] I have no idea about Trebuchet output really :( [21:46:02] https://phabricator.wikimedia.org/T129148 is still open [21:46:13] ahhh [21:46:24] and https://wikitech.wikimedia.org/wiki/Jobrunner documents Trebuchet [21:46:44] OK, thcipriani damage control. What did (if anything) 'sync' do just now? [21:46:45] Krinkle: yeah see my last comment. It is migrated but pending scap 3.6 to be able to restart both services [21:46:50] hrm [21:46:59] See paste [21:47:13] I think it probably didn't do anything but I'm checking... [21:47:18] Thx [21:47:26] (I guess I failed to clear it out from Trebuchet :((( ) [21:48:22] RECOVERY - puppet last run on labtestcontrol2003 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [21:50:49] I am resuming to sleep() project. Feel free to follow up on T129148 [21:50:49] T129148: Deploy jobrunner with scap3 (Trebuchet jobrunner/jobrunner) - https://phabricator.wikimedia.org/T129148 [21:51:22] PROBLEM - puppet last run on labtestcontrol2003 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/var/log/keystone],File[/etc/keystone],Package[keystone] [21:51:53] thcipriani: Once you've verified nothing happened, walk me through the new workflow and I'll document it on Wikitech? [21:51:58] (I still want to deploy this change) [21:52:43] Krinkle: hashar it seems like nothing has changed, I don't see any new tags on any of the servers I spot-checked. It doesn't look like it fetched anything afaict. [21:52:52] Cool [21:53:47] so the new process is: get the repo on tin the way it should look in /srv/deployment/jobrunner/jobrunner, and then run scap deploy -v [21:53:52] (03CR) 10Jforrester: [C: 031] "Scheduled for SWAT at 16:00 SF." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365942 (owner: 10Esanders) [21:54:05] !log krinkle@tin Started deploy [jobrunner/jobrunner@5f6099f]: (no justification provided) [21:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:18] you can open a separate terminal to run: scap deploy-log -v inside /srv/deployment/jobrunner/jobrunner to watch more verbose output from each server [21:54:44] which I am doing now: looks like canaries worked fine... [21:54:53] Indeed. I'm at the canary step now [21:55:31] it should prompt you to continue the deployment, it will roll forward in groups of 5 (defined in scap.cfg) [21:55:42] So it restarted jobrunner on those 2 canaries [21:55:52] "config_deploy is not enabled in scap.cfg, skipping." [21:56:04] yeah, there's no config deployment for jobrunners [21:56:13] What is the option for / capable of? [21:56:25] (03PS2) 10Jforrester: Enable OOjs UI EditPage buttons on all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360371 (https://phabricator.wikimedia.org/T162849) [21:56:31] RECOVERY - puppet last run on labtestcontrol2003 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [21:56:47] (03CR) 10Jforrester: "PS2: Manual rebase." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360371 (https://phabricator.wikimedia.org/T162849) (owner: 10Jforrester) [21:56:50] it can create a config file on the remote machine based on variables inside the scap directory and the environment to which you are deploying. [21:57:09] it can combine the variables in the scap config with a yaml file on the remote machine to create a different config file [21:57:26] services uses it so that config deploys can be decoupled from puppet runs [21:57:31] third group failed one [21:57:33] 21:57:00 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'jobrunner/jobrunner', '-g', 'default', 'promote', '--refresh-config'] on mw1260.eqiad.wmnet returned [70]: Failed to restart jobrunner.service: Unit jobrunner.service is masked. [21:58:04] It was set to masked on purpose, iirc [21:58:08] ^ [21:58:32] 10Operations, 10Release-Engineering-Team, 10Epic, 10Services (watching): FY2017/18 Program 6 - Outcome 2: Developers are able to develop and test their applications through a unified pipeline towards production deployment. - https://phabricator.wikimedia.org/T170480#3450280 (10GWicke) [21:58:35] 10Operations, 10MediaWiki-Containers, 10Release-Engineering-Team, 10Epic, and 3 others: FY2017/18 Program 6 - Outcome 2 - Objective 3: Integrated, container-based development environment - https://phabricator.wikimedia.org/T170456#3450281 (10GWicke) [21:58:39] * Krinkle reads up on systemd "masked units" [21:58:39] of course this exceeds the failure limit :\ [21:59:25] masked would survive reboots, so really disabled [21:59:57] What does "being masked" mean in context of restarting a service? [22:00:07] That it will turn the service off, but never back on? [22:00:20] How did it start in the first place. [22:00:31] PROBLEM - puppet last run on labtestcontrol2003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[uwsgi-keystone-admin],Service[uwsgi-keystone-public] [22:01:38] IIRC there was something about how we didn't want to start one of the jobrunner services on a particular node...something like that. I'm forgetting the context of the first deployment. hash would remember. [22:01:39] RainbowSprinkles: You mean in general, or for this one server? [22:01:47] * thcipriani digs up ticket [22:01:48] Ah, I see. [22:01:58] Like a way of depooling it [22:02:04] !log krinkle@tin Finished deploy [jobrunner/jobrunner@5f6099f]: (no justification provided) (duration: 07m 58s) [22:02:10] OK. I won't rollback in that case. [22:02:15] https://phabricator.wikimedia.org/T167104#3317805 [22:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:55] thcipriani: I wonder why just that one server? [22:03:22] Does that mean I just started jobrunners on the other codfw servers besides this one? [22:03:23] Krinkle: did that deploy to all servers? [22:03:31] RECOVERY - puppet last run on labtestcontrol2003 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [22:03:38] If this is to disable codfw, then it didn't work I guess [22:03:49] all groups had at least one or two codfw servers in it. [22:03:54] And went without trouble [22:04:06] only 1 in default3 group failed. [22:04:09] And actually, that was in eqiad. [22:04:11] mw1260.eqiad [22:05:16] Darn, it looks like JobRunner is active on mw2161.codfw.wmnet (spot checked). [22:05:21] Not jobchron though, maybe that is fine? [22:05:32] PROBLEM - Check systemd state on mw2153 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:06:11] thcipriani: Can we find out if jobrunner was already active on those codfw nodes? [22:07:10] icinga says it has been running for at least 5 days [22:07:34] that is "HHVM jobrunner" service on mw2153 [22:07:41] jobrunner.error count in Grafana just went from 1K to 7K [22:07:47] https://grafana.wikimedia.org/dashboard/db/job-queue-health?refresh=1m&orgId=1&from=now-3h&to=now [22:09:06] OK. Without docs I'm not sure how to turn this back off, or whether I should. [22:09:20] Krinkle: I just spot-checked mw2159 and it looks like it's still running the old version of the code...so what happens with scap is it'll stop deploying once it hits the failure limit (which is, I suppose, 1 in this instance) I think we need to redeploy with a higher failure limit to account for the service masking. This would explain why you only hit 1 server that had the problem: it was just [22:09:22] the first one you hit. I think you've deployed 17 of 36 servers so far. [22:10:17] Yes, it reached server 3 of group default3 [22:10:26] and stopped when the 4th one in that group failed. [22:10:50] But group default1 has 3 codfw servers and also 1 codfw server in group default2 [22:11:23] == DEFAULT1 == [22:11:24] :* mw2243.codfw.wmnet [22:11:24] :* mw2160.codfw.wmnet [22:11:24] :* mw1168.eqiad.wmnet [22:11:24] :* mw2118.codfw.wmnet [22:11:24] :* mw1164.eqiad.wmnet [22:11:25] ah, ok, so in scap.cfg we could add a line that reads something like: 'failure_limit: 25%' and then it won't hault the deployment at least. [22:11:35] == DEFAULT2 == [22:11:35] :* mw1305.eqiad.wmnet [22:11:35] :* mw1304.eqiad.wmnet [22:11:35] :* mw1301.eqiad.wmnet [22:11:35] :* mw1259.eqiad.wmnet [22:11:35] :* mw2161.codfw.wmnet [22:11:49] (03PS6) 10Rush: labtest: rabbitmq for openstack control node [puppet] - 10https://gerrit.wikimedia.org/r/365868 (https://phabricator.wikimedia.org/T167559) [22:11:51] Did these codfw wrongly start a jobrunner where previously they were not? [22:12:21] PROBLEM - Check systemd state on mw2247 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:13:50] thcipriani: [22:14:30] https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=mw2243&var-network=bond0 [22:14:35] (03CR) 10Rush: [C: 032] labtest: rabbitmq for openstack control node [puppet] - 10https://gerrit.wikimedia.org/r/365868 (https://phabricator.wikimedia.org/T167559) (owner: 10Rush) [22:14:36] suggests it wrongly started it [22:14:44] OK this is a problem. these need to be stopped immediately. [22:15:39] well for 2243 it seems that it was started 18 mins ago [22:15:46] active (running) since Tue 2017-07-18 21:56:30 UTC; 18min ago [22:15:54] salt command not found. [22:16:28] yes [22:17:11] PROBLEM - Check systemd state on mw2160 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:19:41] PROBLEM - Check systemd state on mw2118 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:23:30] (03PS1) 10Rush: labtest: labcontrol2001 use rabbitmq role [puppet] - 10https://gerrit.wikimedia.org/r/366166 (https://phabricator.wikimedia.org/T167559) [22:23:31] PROBLEM - Check systemd state on mw2243 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:25:21] PROBLEM - Check systemd state on mw2161 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:28:02] Krinkle: if you want to move forward with the deploy, you can remove service_name from the scap.cfg and it will not try to restart any services. [22:28:44] thcipriani: I'd prefer to revert for now. [22:29:37] getting source code back to how it was, codfw staying down, and reverting/restarted any updated eqiad's. [22:29:54] Then I'll write an incident report and wait for it to be resolved. The commit in quesiton isn't important. [22:30:36] hrm, well. After canceling the option to rollback any change would be a fresh deploy since scap is only aware of the current deployment and the last state that was accepted as a successful deployment. [22:30:51] PROBLEM - puppet last run on labtestcontrol2003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Service[uwsgi-keystone-admin],Service[uwsgi-keystone-public] [22:30:59] OK. SO I'll prepare HEAD as before the commit [22:31:23] k should have been at 161c84cfd4dfa536e09278ce65a585c8d6313aeb [22:31:31] thcipriani: Then I'll locally remove service_name from the cfg file? [22:31:36] then remove service_name and service_port from the scap.cfg [22:31:38] yep [22:32:18] thcipriani: And then run scap deploy again, on the dirty state? [22:32:33] lemme try something first... [22:33:20] Krinkle: yes, all looks correct in the current state [22:33:28] you can run scap deploy -v [22:34:23] !log krinkle@tin Started deploy [jobrunner/jobrunner@5f6099f]: (no justification provided) [22:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:47] thcipriani: "jobrunner/jobrunner: promote and restart_service stage(s): 100% (ok: 2; fail: 0; left: 0)" [22:35:00] restart_service is mentioned but not running, or.. [22:35:16] CANARY: :* mw1299.eqiad.wmnet, * mw2247.codfw.wmnet [22:36:44] yup, checking them out now, they have the right revision and I don't think anything restarted... [22:37:00] /srv/deployment/jobrunner/jobrunner -> jobrunner-cache/revs/161c84cfd4dfa536e09278ce65a585c8d6313aeb [22:38:13] Krinkle: definitely didn't restart/reload service [22:38:21] at least on 1299 [22:38:32] and rev is correct [22:39:43] thcipriani: OK. Minor bug/enhancement to avoid scaring log messages :) [22:39:52] :) [22:40:34] going through the groups now [22:41:03] I'm deployment stalking :) [22:42:42] !log krinkle@tin Finished deploy [jobrunner/jobrunner@5f6099f]: (no justification provided) (duration: 08m 18s) [22:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:02] thcipriani: What is the finalise stage for? It went through the groups again toward the end. [22:43:36] Krinkle: it's for rollback cleanup, so after finalize you can't rollback anymore, but it ensures that the state is sane. [22:44:07] it just removes a file on each target that is in /srv/deployment/[repo]-cache/.in-progress [22:44:15] that points to the in-progress deployment. [22:44:49] Krinkle: ok, do we need to restart jobrunner on the machines in eqiad that were restarted previously? [22:45:21] thcipriani: Yes. [22:45:34] thcipriani: Hm.. and the finalise stage does not happen if an error happens and I say N to rollback? [22:45:42] (as was in the previous attempt) [22:46:23] ah, right. It doesn't remove the .in-progress, it move .in-progress to .done [22:46:42] IIRC [22:47:28] so only successful deployments hit finalize, I think. The logic has been changed a few times :) [22:49:24] thcipriani: So the previous deployment that failed but had No to rollback. Did it leave them, and then not warn about it on the next attempt? [22:49:31] Just trying to see if there is another bug here or not. [22:50:45] leave the .in-progress flag? I believe it did. [22:52:07] we block concurrent deploys on the deployment server rather than on the targets [22:53:21] thcipriani: Right, but without the finalize stage iterating over all servers in the previous error deployment.. [22:53:31] I assume it also left something on the targets [22:53:37] or did it clean up without reporting to stdout. [22:54:12] Let's do the restarts first. [22:54:18] no it probably left a .in-progress in the deployment-cache directory [22:54:44] ok, doing restarts [22:57:01] RECOVERY - puppet last run on labtestcontrol2003 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [22:58:37] !log restared jobrunner on mw1299.eqiad.wmnet mw1168.eqiad.wmnet mw1164.eqiad.wmnet mw1305.eqiad.wmnet mw1304.eqiad.wmnet mw1301.eqiad.wmnet mw1259.eqiad.wmnet mw1166.eqiad.wmnet mw1300.eqiad.wmnet [22:58:44] ^ Krinkle should be done [22:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:01] thcipriani: Thanks [22:59:03] What were they staring at thcipriani? [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170718T2300). [23:00:04] Niharika, tgr, Jdlrobson, and James_F: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:20] Reedy: just making sure they see the right version of the code, they got restarted as part of the scap deploy of the new code (that we don't want), so we deployed the old code (that we do want) and wanted to make sure to manually restart those services :) [23:00:35] thcipriani: you restared them, not restarted ;) [23:00:45] damnit :) [23:00:52] Heya. [23:00:53] thcipriani: So the finalise stage, I didn't quite get it, but just to summarise: in the first deploy command when we hit the masked service error in group default3, there was no finalize step. The second time when we reverted, it went without problem, also no apparent conflict detected, and with finalise stage. If you see a bug in there, please file one :) - If not, I trust that its fine. [23:00:58] Who's swatting? [23:01:05] O/ [23:01:51] to the swatter: Please give me 5-10 minutes to verify everything is good now regarding job queue. [23:01:53] Niharika: Are you volunteering? Or saying you're here? ;) [23:02:09] Krinkle: this seems like the correct behavior. I will double check it, but I'm pretty sure all is fine :) [23:02:44] Reedy: Saying I'm here. 😛 [23:02:55] In a meeting, else I'd totally Swat. [23:03:26] \o [23:04:07] I can SWAT since I'm already here [23:04:38] Krinkle: let me know once you've verified job queue things [23:06:45] thcipriani: All good. [23:06:48] (I think) [23:06:49] :) [23:06:54] Go ahead [23:06:55] oh good :) [23:07:03] alright, proceeding with swat [23:07:10] (03PS2) 10Thcipriani: Enable CodeMirror on simplewiki for better testing and more exposure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365884 (owner: 10Niharika29) [23:07:41] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365884 (owner: 10Niharika29) [23:08:55] (03PS1) 10Bearloga: Move R-related code from shiny_server to separate module [puppet] - 10https://gerrit.wikimedia.org/r/366170 (https://phabricator.wikimedia.org/T153856) [23:09:15] (03Merged) 10jenkins-bot: Enable CodeMirror on simplewiki for better testing and more exposure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365884 (owner: 10Niharika29) [23:09:22] (03CR) 10jenkins-bot: Enable CodeMirror on simplewiki for better testing and more exposure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365884 (owner: 10Niharika29) [23:10:57] Niharika: your change is live on mwdebug1002, check please [23:11:02] On it. [23:11:50] thcipriani: Looks great. [23:11:57] Niharika: ok, going live [23:13:31] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:365884|Enable CodeMirror on simplewiki for better testing and more exposure]] (duration: 00m 48s) [23:13:36] ^ Niharika live now [23:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:52] thcipriani: Thank you! [23:15:29] tgr: your change looks merged on gerrit is that correct? [23:16:09] thcipriani: yeah, Rainbow.Sprinkles deployed it a while ago [23:16:15] ok :) [23:16:36] (03PS3) 10Thcipriani: Enable OOjs UI EditPage buttons on all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360371 (https://phabricator.wikimedia.org/T162849) (owner: 10Jforrester) [23:16:54] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360371 (https://phabricator.wikimedia.org/T162849) (owner: 10Jforrester) [23:18:18] (03Merged) 10jenkins-bot: Enable OOjs UI EditPage buttons on all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360371 (https://phabricator.wikimedia.org/T162849) (owner: 10Jforrester) [23:18:27] (03CR) 10jenkins-bot: Enable OOjs UI EditPage buttons on all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360371 (https://phabricator.wikimedia.org/T162849) (owner: 10Jforrester) [23:19:31] James_F: enable oojs ui editpage buttons on all wikipedias is live on mwdebug1002, check please [23:20:21] thcipriani: Yeah, LGTM. [23:20:27] going live [23:22:27] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:360371|Enable OOjs UI EditPage buttons on all Wikipedias]] T162849 (duration: 00m 47s) [23:22:34] ^ James_F live everywhere [23:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:39] T162849: Support WMF communities in run-up to switching EditPage over to OOUI - https://phabricator.wikimedia.org/T162849 [23:22:54] thcipriani: Thanks! [23:24:25] jdlrobson: Add missing jQueryMsg dependency for mobile diff view is live on mwdebug1002, check please [23:24:29] thcipriani: When you get down to mine.. There's not really much way of testing it as it will require some wikidata stuff to go off in the background. So please just deploy everywhere [23:24:39] thcipriani: checking [23:24:41] Reedy: ack [23:25:17] thcipriani: verified! [23:25:22] going live [23:25:25] thcipriani: sync awwwwayyyy [23:25:49] 10Operations, 10DBA: Global rename of user Moros - https://phabricator.wikimedia.org/T170941#3450716 (10Aklapper) [23:27:28] !log thcipriani@tin Synchronized php-1.30.0-wmf.9/extensions/Thanks/extension.json: SWAT: [[gerrit:366168|Add missing jQueryMsg dependency for mobile diff view]] T170917 (duration: 00m 47s) [23:27:32] ^ jdlrobson live [23:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:40] T170917: Thank button broken on mobilefrontend - displays "{{GENDER:[object Object]|{{GENDER:unknown|Thank}}}}" - https://phabricator.wikimedia.org/T170917 [23:28:54] thcipriani: sweeett [23:29:38] (03PS4) 10Thcipriani: Add Welsh mobile logo (just changes 'k' to 'c'). [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365942 (owner: 10Esanders) [23:29:45] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365942 (owner: 10Esanders) [23:30:39] (03Merged) 10jenkins-bot: Add Welsh mobile logo (just changes 'k' to 'c'). [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365942 (owner: 10Esanders) [23:31:16] James_F: ^ live on mwdebug1002, check please [23:32:27] One sec. [23:32:48] thcipriani: Yeah, LGTM. [23:32:55] going live [23:34:42] !log thcipriani@tin Synchronized static/images/mobile/copyright/wikipedia-wordmark-cy.svg: SWAT: [[gerrit:365942|Add Welsh mobile logo (just changes 'k' to 'c']] PART I (duration: 00m 47s) [23:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:12] (03CR) 10BryanDavis: "Seems reasonable. The other way to do this would be to make the hiera config a list of hosts and join them in the template." [puppet] - 10https://gerrit.wikimedia.org/r/366025 (owner: 10Andrew Bogott) [23:35:55] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:365942|Add Welsh mobile logo (just changes 'k' to 'c']] PART II (duration: 00m 46s) [23:36:00] ^ James_F live now [23:36:04] 10Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests: LDAP access to the wmf group for Anne Gomez - https://phabricator.wikimedia.org/T170679#3450749 (10Dzahn) [23:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:23] (03PS3) 10Thcipriani: Add din to InterwikiSortOrders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365451 (https://phabricator.wikimedia.org/T168518) (owner: 10Reedy) [23:36:28] thcipriani: Thanks! [23:36:29] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365451 (https://phabricator.wikimedia.org/T168518) (owner: 10Reedy) [23:37:31] (03Merged) 10jenkins-bot: Add din to InterwikiSortOrders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365451 (https://phabricator.wikimedia.org/T168518) (owner: 10Reedy) [23:40:18] !log thcipriani@tin Synchronized wmf-config/InterwikiSortOrders.php: SWAT: [[gerrit:365451|Add din to InterwikiSortOrders]] T168518 (duration: 00m 46s) [23:40:26] ^ Reedy your change is live [23:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:29] T168518: Create Dinka Wikipedia - https://phabricator.wikimedia.org/T168518 [23:40:31] Thanks! [23:40:41] yw :) [23:40:58] I think that about does it for swat! [23:42:14] (03CR) 10jenkins-bot: Add Welsh mobile logo (just changes 'k' to 'c'). [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365942 (owner: 10Esanders) [23:42:16] (03CR) 10jenkins-bot: Add din to InterwikiSortOrders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365451 (https://phabricator.wikimedia.org/T168518) (owner: 10Reedy) [23:44:32] RECOVERY - MariaDB Slave Lag: s4 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 89927.48 seconds [23:45:07] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Add Dinka Wikipedia to Wikidata - https://phabricator.wikimedia.org/T170930#3450790 (10Reedy) https://gerrit.wikimedia.org/r/365451 is now merged and deployed [23:53:33] !log netmon1002 - copied Letsencrypt cert/key for librenms from netmon1001 for migration after netmon1002 has been reinstalled and now has RAID. (T159756) [23:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:47] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Add Dinka Wikipedia to Wikidata - https://phabricator.wikimedia.org/T170930#3450840 (10Koavf) I just tried to add din.wp's "Kïndegɔ̈t" to Wikidata's "Q9779" and it didn't work. [23:53:47] T159756: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756 [23:54:09] (03PS3) 10Dzahn: switch librenms from netmon1001 to netmon1002 [dns] - 10https://gerrit.wikimedia.org/r/364617 (https://phabricator.wikimedia.org/T159756) [23:55:47] (03CR) 10Dzahn: [C: 032] switch librenms from netmon1001 to netmon1002 [dns] - 10https://gerrit.wikimedia.org/r/364617 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [23:56:31] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0]