[00:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170720T0000). [00:00:26] * Dereckson is back [00:01:06] (03CR) 10jerkins-bot: [V: 04-1] phpcbf on mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366483 (owner: 10Reedy) [00:01:20] ebernhardson: okay, syncing [00:02:05] (03CR) 10Jforrester: "More to do." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366483 (owner: 10Reedy) [00:02:24] I suppose I'll wait a bit to update phabricator [00:02:27] Reedy: We should break this up into a bunch of small changes to be less disruptive. [00:03:01] Maybe [00:03:06] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Add Dinka Wikipedia to Wikidata - https://phabricator.wikimedia.org/T170930#3455009 (10Dcljr) OK, wait. I'm seeing something weird. Each of the [[ https://din.wikipedia.org/wiki/K%C3%ABc%C3%ABweek:Contributions/Dexbot | pages... [00:03:12] Reedy: I can do it but I've been intentionally waiting for SWAT to be over so the world stops changing out from under us. [00:03:19] haha [00:03:35] !log dereckson@tin Synchronized php-1.30.0-wmf.10/includes/widget/SearchInputWidget.php: Revert "Make mw.widgets.SearchInputWidget extend OO.ui.SearchInputWidget" (1/3) (duration: 00m 46s) [00:03:35] Reedy: Unlike a certain someone. :-P [00:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:14] I love that phpcbf didn't fix some of the stuff it could fix [00:04:34] Coren: want to come over to #wikimedia-cloud and I'll see if I can get you fixed up? [00:04:39] Reedy: It times out on really long files. [00:04:45] !log dereckson@tin Synchronized php-1.30.0-wmf.10/resources/Resources.php: Revert "Make mw.widgets.SearchInputWidget extend OO.ui.SearchInputWidget" (2/3) (duration: 00m 46s) [00:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:11] Reedy: I run in a few times in such cases until it's a no-op. [00:06:03] !log dereckson@tin Synchronized php-1.30.0-wmf.10/resources/src/mediawiki.widgets/mw.widgets.SearchInputWidget.js: Revert "Make mw.widgets.SearchInputWidget extend OO.ui.SearchInputWidget" (3/3) (duration: 00m 46s) [00:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:50] (03PS4) 10Reedy: phpcbf on mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366483 [00:06:59] James_F: There's some files it didn't even attempt [00:07:17] Reedy: Hmm. [00:07:32] ebernhardson: 10 synced, now 9 [00:08:05] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Add Dinka Wikipedia to Wikidata - https://phabricator.wikimedia.org/T170930#3455012 (10Reedy) There's a huge backlog of changes to dispatch for Wikidata. It's very possible that's part of the problem [00:08:09] !log dereckson@tin Synchronized php-1.30.0-wmf.9/includes/widget/SearchInputWidget.php: Revert "Make mw.widgets.SearchInputWidget extend OO.ui.SearchInputWidget" (1/3) (duration: 00m 46s) [00:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:18] (03CR) 10jerkins-bot: [V: 04-1] phpcbf on mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366483 (owner: 10Reedy) [00:08:36] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.005 second response time [00:08:56] Dereckson: thanks [00:08:56] !log dereckson@tin Synchronized php-1.30.0-wmf.9/resources/Resources.php: Revert "Make mw.widgets.SearchInputWidget extend OO.ui.SearchInputWidget" (2/3) (duration: 00m 46s) [00:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:43] !log dereckson@tin Synchronized php-1.30.0-wmf.9/resources/src/mediawiki.widgets/mw.widgets.SearchInputWidget.js: Revert "Make mw.widgets.SearchInputWidget extend OO.ui.SearchInputWidget" (3/3) (duration: 00m 46s) [00:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:00] (03PS5) 10Reedy: phpcbf on mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366483 [00:10:17] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Add Dinka Wikipedia to Wikidata - https://phabricator.wikimedia.org/T170930#3447871 (10Dcljr) Oops, @Reedy, I was editing my comment when you left yours. Sigh, as usual, I guess it just takes time to work. We'll see… [00:11:32] (03CR) 10jerkins-bot: [V: 04-1] phpcbf on mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366483 (owner: 10Reedy) [00:11:39] ebernhardson: and logs look good [00:12:46] (03PS3) 10Dereckson: Stop RelatedArticles A/B test and clean up config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366314 (https://phabricator.wikimedia.org/T169948) (owner: 10Jdlrobson) [00:13:01] (03PS4) 10Dereckson: Stop RelatedArticles A/B test and clean up config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366314 (https://phabricator.wikimedia.org/T169948) (owner: 10Jdlrobson) [00:13:09] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366314 (https://phabricator.wikimedia.org/T169948) (owner: 10Jdlrobson) [00:13:58] (03PS6) 10Reedy: phpcbf on mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366483 [00:14:39] (03Merged) 10jenkins-bot: Stop RelatedArticles A/B test and clean up config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366314 (https://phabricator.wikimedia.org/T169948) (owner: 10Jdlrobson) [00:14:50] jdlrobson: live on mwdebug1002.eqiad.wmnet [00:15:35] (sorry for the delay, I lost connectivity, probably caused by a lightning storm) [00:17:27] (03CR) 10jenkins-bot: Stop RelatedArticles A/B test and clean up config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366314 (https://phabricator.wikimedia.org/T169948) (owner: 10Jdlrobson) [00:17:39] syncing order doesn't seem here to matter, will sync CS/IS [00:18:17] phuedx: ping? [00:23:57] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [00:24:16] (03PS1) 10Dereckson: Revert "Stop RelatedArticles A/B test and clean up config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366490 [00:24:35] (03CR) 10Dereckson: [C: 032] Revert "Stop RelatedArticles A/B test and clean up config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366490 (owner: 10Dereckson) [00:26:56] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [00:27:22] (03Merged) 10jenkins-bot: Revert "Stop RelatedArticles A/B test and clean up config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366490 (owner: 10Dereckson) [00:27:32] twentyafterfour: I'm done [00:28:46] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [00:29:26] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [00:30:07] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [00:30:50] (03CR) 10jenkins-bot: Revert "Stop RelatedArticles A/B test and clean up config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366490 (owner: 10Dereckson) [00:33:52] Dereckson: hey [00:34:03] Sorry I thought you'd forgotten me. There's nothing to test for that patch [00:34:09] It can just be synced [00:41:42] jdlrobson: as it's not an emergency, I offer we do it tomorrow [00:42:16] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [00:44:06] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active [00:45:16] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [00:47:06] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [00:51:06] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active [00:56:06] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [00:58:16] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [01:01:16] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [01:02:16] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [01:05:16] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [01:06:26] PROBLEM - Host cp3048 is DOWN: PING CRITICAL - Packet loss = 100% [01:10:01] !log begin (belated) phabricator upgrade, expect momentary downtime. [01:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:56] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3048_v4, cp3048_v6 [01:12:06] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3048_v4, cp3048_v6 [01:12:06] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3048_v4, cp3048_v6 [01:12:16] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active [01:12:17] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3048_v4, cp3048_v6 [01:12:17] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3048_v4, cp3048_v6 [01:12:17] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3048_v4, cp3048_v6 [01:12:17] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3048_v4, cp3048_v6 [01:12:26] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3048_v4, cp3048_v6 [01:12:26] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3048_v4, cp3048_v6 [01:12:26] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3048_v4, cp3048_v6 [01:12:27] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3048_v4, cp3048_v6 [01:12:36] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3048_v4, cp3048_v6 [01:12:36] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3048_v4, cp3048_v6 [01:12:46] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3048_v4, cp3048_v6 [01:12:48] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3048_v4, cp3048_v6 [01:12:48] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3048_v4, cp3048_v6 [01:12:56] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3048_v4, cp3048_v6 [01:12:56] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3048_v4, cp3048_v6 [01:12:57] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3048_v4, cp3048_v6 [01:13:06] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp3048_v4, cp3048_v6 [01:13:06] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3048_v4, cp3048_v6 [01:14:53] !log phabricator upgrade complete [01:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:15:16] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [01:16:06] twentyafterfour: Neat. Anything cool? [01:16:17] James_F: redesigned diffusion UI [01:16:55] not much else that I know of [01:17:13] twentyafterfour: Indeed, it looks swish. Thanks! [01:17:15] !log restarting keystone on labcontrol1001 [01:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:18:59] (03PS1) 10Jforrester: phpcs: Enable MediaWiki.ControlStructures.IfElseStructure.Space*Else and make pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366493 [01:19:02] (03PS1) 10Jforrester: phpcs: Enable MediaWiki.ExtraCharacters.ParenthesesAroundKeyword.ParenthesesAroundKeywords and make pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366494 [01:19:03] (03PS1) 10Jforrester: phpcs: Enable MediaWiki.WhiteSpace.MultipleEmptyLines.MultipleEmptyLines and make pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366495 [01:22:46] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.005 second response time [01:24:16] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active [01:27:16] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [01:28:26] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [01:29:16] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active [01:30:36] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [01:31:07] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:31:07] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:31:36] PROBLEM - nutcracker process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:32:06] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [01:32:06] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:32:26] RECOVERY - nutcracker process on thumbor1002 is OK: PROCS OK: 1 process with UID = 115 (nutcracker), command name nutcracker [01:52:04] (03PS3) 10Dzahn: librenms: rsync rrd data from netmon1001 to netmon1002 [puppet] - 10https://gerrit.wikimedia.org/r/366324 (https://phabricator.wikimedia.org/T159756) [01:57:06] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [02:00:56] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [02:07:20] (03CR) 10Dzahn: "no jenkins-bot" [puppet] - 10https://gerrit.wikimedia.org/r/366324 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [02:07:28] (03CR) 10Dzahn: [V: 032 C: 032] librenms: rsync rrd data from netmon1001 to netmon1002 [puppet] - 10https://gerrit.wikimedia.org/r/366324 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [02:12:02] (03PS1) 10Dzahn: netmon1001: temp put librenms role back on it [puppet] - 10https://gerrit.wikimedia.org/r/366499 [02:13:21] (03PS2) 10Dzahn: netmon1001: temp put librenms role back on it [puppet] - 10https://gerrit.wikimedia.org/r/366499 [02:13:37] (03CR) 10Dzahn: [V: 032 C: 032] netmon1001: temp put librenms role back on it [puppet] - 10https://gerrit.wikimedia.org/r/366499 (owner: 10Dzahn) [02:14:26] PROBLEM - haproxy failover on dbproxy1005 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [02:14:56] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [02:15:26] PROBLEM - Check systemd state on labstore1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:22:24] !log netmon1001 - rsyncing librenms rrd data to netmon1002 - T159756 [02:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:22:39] T159756: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756 [02:31:56] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 21 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [02:36:32] (03PS1) 10Dzahn: Revert "Revert "switch librenms from netmon1001 to netmon1002"" [dns] - 10https://gerrit.wikimedia.org/r/366501 [02:36:56] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 0 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [02:38:03] (03PS2) 10Dzahn: Revert "Revert "switch librenms from netmon1001 to netmon1002"" [dns] - 10https://gerrit.wikimedia.org/r/366501 [02:38:16] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [02:39:46] PROBLEM - Check for valid instance states on labnodepool1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:13] PROBLEM - Check for valid instance states on labnodepool1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:33] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [03:02:34] !log l10nupdate@tin LocalisationUpdate failed: git pull of extensions failed [03:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:03:33] RECOVERY - Check systemd state on labstore1004 is OK: OK - running: The system is fully operational [03:04:13] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active [03:07:13] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [03:07:33] PROBLEM - Check systemd state on labstore1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:27:29] (03CR) 10Dzahn: [C: 032] "i have rsynced all the rrd data" [dns] - 10https://gerrit.wikimedia.org/r/366501 (owner: 10Dzahn) [03:32:19] !log service uwsgi-labspuppetbackend restart on labcontrol1001 [03:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:33:43] RECOVERY - Check systemd state on labstore1004 is OK: OK - running: The system is fully operational [03:34:14] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active [03:34:40] !log service nova-network restart on labnet1001 [03:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:35:15] 10Operations, 10Services (doing), 10User-mobrovac: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3455272 (10KartikMistry) @MoritzMuehlenhoff Is 6.11.1 available in Labs or Beta? Good to test it there too. [03:36:35] 10Operations, 10Services (doing), 10User-mobrovac: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3455273 (10KartikMistry) Local testing with nodejs 6.11.1~dfsg-1 and cxserver looks good. I'll retest today and update here. [03:39:03] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [03:50:43] PROBLEM - SSH cp3048.mgmt on cp3048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:00:03] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [04:00:13] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.007 second response time [04:05:43] !log restarting rabbitmq-server on labcontrol1001 [04:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:42:33] (03PS1) 10Andrew Bogott: update labvirt-star.eqiad.wmnet.crt with new ca [puppet] - 10https://gerrit.wikimedia.org/r/366505 [04:42:42] !log netmon1002 - restarted Apache for LDAP issue - librenms.wm.org switched back to it, after rsyncing rrd data, re-enabling puppet [04:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:43:03] (03CR) 10Andrew Bogott: [V: 032 C: 032] update labvirt-star.eqiad.wmnet.crt with new ca [puppet] - 10https://gerrit.wikimedia.org/r/366505 (owner: 10Andrew Bogott) [04:44:55] !let netmon1002 - disable puppet again - crons for librenms running, crons for rancid stopped, rsynced data one last time [04:50:36] RECOVERY - SSH cp3048.mgmt on cp3048.mgmt is OK: SSH OK - OpenSSH_5.8 (protocol 2.0) [04:56:53] 10Operations, 10monitoring, 10Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3455331 (10Dzahn) 21:42 < mutante> !log netmon1002 - restarted Apache for LDAP issue - librenms.wm.org switched back to it, after rsyncing rrd data, re-enabling puppet 21:45 < mutante> !... [04:59:14] (03PS1) 10Andrew Bogott: nova libvirtd: use the new ca, wmf_ca_2017_2020 [puppet] - 10https://gerrit.wikimedia.org/r/366508 [04:59:24] (03CR) 10Andrew Bogott: [V: 032 C: 032] nova libvirtd: use the new ca, wmf_ca_2017_2020 [puppet] - 10https://gerrit.wikimedia.org/r/366508 (owner: 10Andrew Bogott) [05:05:18] !log Configure replication for s2 on labsdb1009 and labsdb1010 - T153743 [05:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:28] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [05:07:42] (03PS1) 10Marostegui: s2.hosts: Add labsdb1009 and labsdb1010 [software] - 10https://gerrit.wikimedia.org/r/366510 (https://phabricator.wikimedia.org/T153743) [05:30:42] !log on contint1001 restarted jenkins [05:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:25] !log on contint1001 restarted zuul and zuul-merger [05:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:03] 10Operations, 10Puppet, 10Patch-For-Review: Add the puppet CA to the certification authorities trusted by our systems, on demand - https://phabricator.wikimedia.org/T114638#3455381 (10Joe) 05Open>03Resolved a:03Joe [06:00:38] !log oblivian@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=cp3048.esams.wmnet [06:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:33] 10Operations, 10Traffic: cp3048 down, mgmt console not reachable - https://phabricator.wikimedia.org/T171145#3455383 (10elukey) [06:03:35] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [06:05:35] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [06:12:41] (03PS1) 10Andrew Bogott: update labvirt-star.codfw.wmnet.crt with new ca [puppet] - 10https://gerrit.wikimedia.org/r/366514 [06:14:16] (03CR) 10Andrew Bogott: [C: 032] update labvirt-star.codfw.wmnet.crt with new ca [puppet] - 10https://gerrit.wikimedia.org/r/366514 (owner: 10Andrew Bogott) [06:15:08] (03PS2) 10Andrew Bogott: update labvirt-star.codfw.wmnet.crt with new ca [puppet] - 10https://gerrit.wikimedia.org/r/366514 (https://phabricator.wikimedia.org/T171116) [06:18:00] (03CR) 10Giuseppe Lavagetto: systemd: add defines to manage systemd units (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/365900 (owner: 10Giuseppe Lavagetto) [06:18:06] (03CR) 10Giuseppe Lavagetto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/365900 (owner: 10Giuseppe Lavagetto) [06:22:55] (03PS1) 10Andrew Bogott: update labtestservices2001.wikimedia.org.crt with the new ca [puppet] - 10https://gerrit.wikimedia.org/r/366515 (https://phabricator.wikimedia.org/T171116) [06:24:09] !log netmon1002 - librenms: fix permissions on /srv/librenms/rrd data after rsyncing, mismatching UIDs vs netmon1001 and rsyncd in chroot-issue [06:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:08] (03PS1) 10Giuseppe Lavagetto: Add my new pubkey [labs/private] - 10https://gerrit.wikimedia.org/r/366516 [06:33:34] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Add my new pubkey [labs/private] - 10https://gerrit.wikimedia.org/r/366516 (owner: 10Giuseppe Lavagetto) [06:34:06] RECOVERY - Check for valid instance states on labnodepool1001 is OK: nodepool state management is OK [06:36:15] (03CR) 10Marostegui: "recheck" [software] - 10https://gerrit.wikimedia.org/r/366510 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [06:54:03] !log Force a BBU relearn on db1016 - T166344 [06:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:14] T166344: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344 [06:56:25] RECOVERY - haproxy failover on dbproxy1005 is OK: OK check_failover servers up 2 down 0 [07:04:35] RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [07:10:46] 10Operations, 10Services (doing), 10User-mobrovac: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3455442 (10MoritzMuehlenhoff) @KartikMistry : The new nodejs packages are already uploaded to apt.wikimedia.org for jessie and stretch, so can be used for testing in labs. [07:22:05] 10Operations, 10Traffic: cp3048 down, mgmt console not reachable - https://phabricator.wikimedia.org/T171145#3455383 (10MoritzMuehlenhoff) I had the same symptom wich oxygen a few days ago and a "racadm racreset" fixed the mgmt for me. [07:24:01] moritzm: o/ - curious about --^ did you send the command via ipmi remote? [07:25:25] (03PS1) 10Faidon Liambotis: base: remove WMF CA 2014-2017 [puppet] - 10https://gerrit.wikimedia.org/r/366517 [07:25:27] (03PS1) 10Faidon Liambotis: base: cleanup absented CAs [puppet] - 10https://gerrit.wikimedia.org/r/366518 [07:25:30] ah via ipmitool it works [07:26:33] (03CR) 10Faidon Liambotis: [C: 032] base: remove WMF CA 2014-2017 [puppet] - 10https://gerrit.wikimedia.org/r/366517 (owner: 10Faidon Liambotis) [07:27:11] (03CR) 10Faidon Liambotis: [V: 032 C: 032] base: remove WMF CA 2014-2017 [puppet] - 10https://gerrit.wikimedia.org/r/366517 (owner: 10Faidon Liambotis) [07:27:51] elukey: I can simply be run via the "main shell" of the mgmt, the one that prompts "admin1->" [07:28:55] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 38 probes of 266 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [07:28:56] moritzm: ah okok, but I can't reach that one :( [07:29:13] ssh hangs while trying [07:29:24] I thought there was another way via ipmitool [07:30:05] PROBLEM - puppet last run on prometheus2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/wmf_ca_2014_2017.crt] [07:30:06] PROBLEM - puppet last run on elastic1024 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/wmf_ca_2014_2017.crt] [07:30:15] PROBLEM - puppet last run on mw1272 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/wmf_ca_2014_2017.crt] [07:30:26] PROBLEM - puppet last run on elastic1040 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/wmf_ca_2014_2017.crt] [07:30:36] PROBLEM - puppet last run on ms-be1024 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/wmf_ca_2014_2017.crt] [07:30:45] PROBLEM - puppet last run on analytics1060 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/wmf_ca_2014_2017.crt] [07:30:55] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/wmf_ca_2014_2017.crt] [07:31:05] PROBLEM - puppet last run on oresrdb2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/wmf_ca_2014_2017.crt] [07:31:12] uhoh [07:31:23] ah the usual race [07:32:17] 10Operations, 10Traffic: cp3048 down, mgmt console not reachable - https://phabricator.wikimedia.org/T171145#3455451 (10elukey) From ipmitool sel I got a lot of these: ``` 7b | 07/20/2017 | 01:06:48 | Processor #0x0d | Transition to Non-recoverable | Asserted 7c | 07/20/2017 | 01:06:49 | Unknown #0x28 |... [07:33:55] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 8 probes of 266 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [07:35:00] elukey: mgmt works for me, I filed a bug for affected mgmt's yesterday, see https://phabricator.wikimedia.org/T171041 [07:35:25] PROBLEM - puppet last run on db2040 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/wmf_ca_2014_2017.crt] [07:35:58] the negiotation of the DH groups sometimes fails for those old idrac versions, if you specify the downgraded kexOptions listed in the task, the SSH connection to the mgmt should work in your case as well [07:36:15] works for me at least, if not, I can run the racreset instead [07:37:06] moritzm: works for me not too, this morning it wasn't! (and I think that Daniel had the same issue) [07:37:29] well com2 doesn't work :D [07:37:43] or maybe it is simply that the host is frozen [07:37:52] I can try to powercycle and see [07:37:55] yeat, but com2 should be fixed by the "racadm racreset" [07:37:57] it is depooled [07:38:01] sel doesn't work from me [07:38:04] I'd racreset indeed [07:38:16] ah [07:38:20] I forgot lanplus [07:38:53] so these Unknowns [07:39:01] note that vendors have their own proprietary extensions usually [07:39:29] tryie racadm serveraction powercycle [07:39:41] com2 still dead :D [07:39:50] ah no there it goes [07:40:01] freeipmi has --interpret-oem-data [07:41:05] RECOVERY - Host cp3048 is UP: PING WARNING - Packet loss = 64%, RTA = 349.29 ms [07:41:06] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 58 ESP OK [07:41:16] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 72 ESP OK [07:41:17] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 72 ESP OK [07:41:17] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 72 ESP OK [07:41:17] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 58 ESP OK [07:41:17] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 72 ESP OK [07:41:17] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 72 ESP OK [07:41:25] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 58 ESP OK [07:41:25] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 58 ESP OK [07:41:25] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 58 ESP OK [07:41:26] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 58 ESP OK [07:41:26] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 58 ESP OK [07:41:36] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 72 ESP OK [07:41:36] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 58 ESP OK [07:41:36] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 72 ESP OK [07:41:37] !log powercycle cp3048 - mgmt reachable - T171145 [07:41:45] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 58 ESP OK [07:41:45] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 72 ESP OK [07:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:48] T171145: cp3048 down, mgmt console not reachable - https://phabricator.wikimedia.org/T171145 [07:41:56] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 72 ESP OK [07:41:56] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 58 ESP OK [07:41:56] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 72 ESP OK [07:42:05] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 58 ESP OK [07:45:09] elukey: is com2 working now? [07:47:18] paravoid: yep, I am stupid because it was working but since everything was frozen it wasn't showing me anything, wrote on the chan too soon [07:47:26] oh ok [07:47:32] * elukey grabs coffee [07:47:45] but this morning I am sure the mgmt wasn't even working, ssh hanging [07:47:49] for daniel too [07:48:11] I re-tried now because Moritz told me it was working fine [07:48:39] ok :) [07:48:57] !log restart diamond on serpens/seaborgium to pick up the updated CA [07:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:30] godog: why does diamond care about the CA? [07:49:47] paravoid: I'm assuming because of tls [07:49:52] Jul 20 07:45:58 seaborgium diamond[468]: Unable to query ldap://ldap-labs.eqiad.wikimedia.org:389: {'info': '(unknown error code)', 'desc': 'Connect error'} [07:49:53] why does it do TLS? :) [07:50:04] heh that I don't know :) [07:50:26] it just connects on localhost right? [07:51:00] yeah it does [07:51:11] so no point in doing TLS I think :) [07:53:21] indeed, my guess would be that the collector's client tries tls and doesn't fallback [07:54:58] (03PS1) 10Muehlenhoff: Restrict HTTP access in role::librenms [puppet] - 10https://gerrit.wikimedia.org/r/366519 [07:55:44] !log Start importing s2 into labsdb1011 - T153743 [07:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:56] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [07:57:46] RECOVERY - puppet last run on elastic1040 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [07:58:05] RECOVERY - puppet last run on analytics1060 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [07:58:25] RECOVERY - puppet last run on elastic1024 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [07:58:36] RECOVERY - puppet last run on mw1272 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [07:58:45] RECOVERY - puppet last run on db2040 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [07:59:05] RECOVERY - puppet last run on ms-be1024 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [07:59:16] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [07:59:25] RECOVERY - puppet last run on oresrdb2001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [07:59:25] RECOVERY - puppet last run on prometheus2003 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [08:00:01] <_joe_> ? [08:01:47] !log Stop replication on labsdb1011 for maintenance - T153743 [08:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:00] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [08:05:33] (03PS1) 10Muehlenhoff: Restrict HTTP access for racktables [puppet] - 10https://gerrit.wikimedia.org/r/366521 [08:07:07] (03CR) 10Giuseppe Lavagetto: systemd: add defines to manage systemd units (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/365900 (owner: 10Giuseppe Lavagetto) [08:07:34] (03PS9) 10Giuseppe Lavagetto: systemd: add defines to manage systemd units [puppet] - 10https://gerrit.wikimedia.org/r/365900 [08:14:12] <_joe_> win 17 [08:15:32] 10Operations, 10Puppet, 10Patch-For-Review: PuppetDB misbehaving on 2017-07-15 - https://phabricator.wikimedia.org/T170740#3455518 (10elukey) It would be great in my opinion to monitor JVM metrics before and after the change that Alex proposed: * https://docs.puppet.com/puppetdb/2.3/configure.html#configuri... [08:15:34] (03CR) 10Marostegui: [C: 032] s2.hosts: Add labsdb1009 and labsdb1010 [software] - 10https://gerrit.wikimedia.org/r/366510 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [08:16:39] (03CR) 10Elukey: [C: 04-1] "Something doesn't look right:" [puppet] - 10https://gerrit.wikimedia.org/r/366229 (https://phabricator.wikimedia.org/T170740) (owner: 10Alexandros Kosiaris) [08:18:06] 10Operations, 10Services (doing), 10User-mobrovac: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3455519 (10KartikMistry) Thanks @MoritzMuehlenhoff cxserver in local and Labs with upgraded nodejs packages looks good. All tests are passing and no issues in service. [08:18:18] 10Operations, 10Services (doing), 10User-mobrovac: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3455520 (10KartikMistry) [08:18:51] (03Merged) 10jenkins-bot: s2.hosts: Add labsdb1009 and labsdb1010 [software] - 10https://gerrit.wikimedia.org/r/366510 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [08:19:35] 10Operations, 10Puppet, 10Patch-For-Review: Puppet certificate missing subjectAltName - https://phabricator.wikimedia.org/T158757#3455525 (10fgiunchedi) The `create_ecdsa_cert` script was patched already, though our puppet server doesn't use `dns_alt_names`. I think it'd be ok in this case to just regenerate... [08:20:16] godog: so let's find out and fix? [08:20:20] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Joe: CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3455526 (10Joe) [08:25:34] !log CI is restored albeit in degraded mode (lack of Castor cache) - T171148 [08:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:43] T171148: CI jobs are blocked because castor is unreachable - https://phabricator.wikimedia.org/T171148 [08:25:57] paravoid: heh I haven't finished going through the ops queue yet, mostly cleaning up tasks [08:26:24] 10Operations, 10Puppet, 10Patch-For-Review: PuppetDB misbehaving on 2017-07-15 - https://phabricator.wikimedia.org/T170740#3455544 (10elukey) Checked the patch that Alex wrote and realized that we don't use the heap_size parameter, ending up in a empty Xmx: ``` puppetdb 30718 54.4 27.6 9986036 4547056 ?... [08:29:11] !log ema@neodymium conftool action : set/pooled=yes; selector: name=cp3048.esams.wmnet [08:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:09] !log Force a BBU relearn on db1016 - T166344 [08:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:19] T166344: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344 [08:34:35] PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [08:35:01] marostegui: still playing with BBUs? :( [08:35:15] :_( [08:35:31] add a crontab :-P [08:35:37] hahah [08:35:37] (03PS2) 10Elukey: puppetdb: Bump Java Heap max size to 6GB [puppet] - 10https://gerrit.wikimedia.org/r/366229 (https://phabricator.wikimedia.org/T170740) (owner: 10Alexandros Kosiaris) [08:35:46] it has been a while since it happened last time to be honest [08:45:10] (03CR) 10Faidon Liambotis: [C: 04-2] "For the purpose this changeset is envisioned, it won't work: you can't actually ban the user you want to ban with the "bad_clients" key in" [puppet] - 10https://gerrit.wikimedia.org/r/365821 (https://phabricator.wikimedia.org/T170860) (owner: 10Smalyshev) [08:46:16] 10Puppet, 10Cloud-Services: Make changing puppetmasters for Labs instances more easy - https://phabricator.wikimedia.org/T152941#3455560 (10hashar) [08:52:27] (03CR) 10Faidon Liambotis: [C: 031] "That's great, thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/365900 (owner: 10Giuseppe Lavagetto) [08:55:02] (03CR) 10Giuseppe Lavagetto: "> That's great, thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/365900 (owner: 10Giuseppe Lavagetto) [08:56:34] (03PS10) 10Giuseppe Lavagetto: systemd: add defines to manage systemd units [puppet] - 10https://gerrit.wikimedia.org/r/365900 [08:58:35] 10Operations, 10Traffic: cp3048 down, mgmt console not reachable - https://phabricator.wikimedia.org/T171145#3455570 (10ema) 05Open>03Resolved a:03ema So as @MoritzMuehlenhoff mentioned on IRC the mgmt issues might have been due to T171041. The host is back online and looks fine at the moment so I've re... [08:59:01] (03CR) 10Giuseppe Lavagetto: [C: 032] systemd: add defines to manage systemd units [puppet] - 10https://gerrit.wikimedia.org/r/365900 (owner: 10Giuseppe Lavagetto) [08:59:15] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=501.20 Read Requests/Sec=556.40 Write Requests/Sec=197.50 KBytes Read/Sec=46256.00 KBytes_Written/Sec=1399.20 [08:59:20] (03PS3) 10Giuseppe Lavagetto: motd::script: use validate_numeric for priority [puppet] - 10https://gerrit.wikimedia.org/r/365569 [09:02:18] !log uploaded apache2 2.4.10-10+deb8u10+wmf1 (rebase of WMF-specific patches on top of latest DSA) to apt.wikimedia.org/jessie [09:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:11] !log Restored CI cache storage (castor) on a fresh new instance. Cache is empty though so jobs will be a bit slower until the cache is populated - T171148 [09:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:21] T171148: CI jobs are blocked because castor is unreachable - https://phabricator.wikimedia.org/T171148 [09:04:43] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/365569 (owner: 10Giuseppe Lavagetto) [09:04:50] !log eqiad cache_text/upload: upgrade to varnish 4.1.7-1wm1 and reboot for kernel updates [09:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:34] (03CR) 10Faidon Liambotis: [C: 032] motd::script: use validate_numeric for priority [puppet] - 10https://gerrit.wikimedia.org/r/365569 (owner: 10Giuseppe Lavagetto) [09:06:01] (03PS2) 10Faidon Liambotis: base: cleanup absented CAs [puppet] - 10https://gerrit.wikimedia.org/r/366518 [09:06:06] (03CR) 10Faidon Liambotis: [V: 032 C: 032] base: cleanup absented CAs [puppet] - 10https://gerrit.wikimedia.org/r/366518 (owner: 10Faidon Liambotis) [09:08:25] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.70 Read Requests/Sec=3.90 Write Requests/Sec=4.30 KBytes Read/Sec=20.00 KBytes_Written/Sec=126.40 [09:11:09] (03PS1) 10Filippo Giunchedi: base: check and alert on free filesystem inodes [puppet] - 10https://gerrit.wikimedia.org/r/366525 (https://phabricator.wikimedia.org/T129222) [09:18:39] Is it possible to run a python code using only browser on the Toolforge? [09:19:31] (03PS3) 10Giuseppe Lavagetto: rsyslog::conf: validate priority with validate_numeric [puppet] - 10https://gerrit.wikimedia.org/r/365570 [09:20:50] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/364396 (https://phabricator.wikimedia.org/T170666) (owner: 10Jcrespo) [09:22:03] 10Operations, 10Salt: Old salt grains not removed if a role changes - https://phabricator.wikimedia.org/T115983#3455693 (10fgiunchedi) 05Open>03declined Salt is on its way out [09:23:33] 10Puppet, 10Cloud-Services: Make changing puppetmasters for Labs instances more easy - https://phabricator.wikimedia.org/T152941#3455698 (10hashar) [09:24:03] 10Puppet, 10Cloud-Services: Make changing puppetmasters for Labs instances more easy - https://phabricator.wikimedia.org/T152941#2864105 (10hashar) I have updated the workaround using the one I originally wrote on T148929. The proposed one did not work for me on CI instances with a self puppet master. [09:24:10] 10Operations, 10Reading-Admin, 10Reading-Community-Engagement, 10Traffic: TEST: redirect small portion of unauthenticated desktop users to mobile web - https://phabricator.wikimedia.org/T117826#3455702 (10fgiunchedi) [09:25:47] No one actually keep in here right now isn't it *drink tea* [09:26:29] r96340: you should ask on #wikimedia-cloud, toolforge isn't really an operations thing [09:27:17] Okay, thanks for *redirect* [09:27:23] (03CR) 10Volans: "@elukey: anyway we could avoid to hardcode the value?" [puppet] - 10https://gerrit.wikimedia.org/r/366229 (https://phabricator.wikimedia.org/T170740) (owner: 10Alexandros Kosiaris) [09:27:50] elukey: s/anyway/any way/ ofc :-P [09:29:43] volans: heap_size is not in the manifest anymore and it didn't seem to be super different if it was there or directly in the unit, but we can do it :) [09:33:43] 10Operations: Default gateway unreachable on baham.wikimedia.org after reboot - https://phabricator.wikimedia.org/T131966#2184436 (10fgiunchedi) Has this been reoccuring lately? for dns servers an explanation might be ferm vs @resolve vs dns load ordering like we saw in stretch in {T166653} [09:34:21] (03CR) 10Hashar: "On a freshly created instance, puppet reports issues on the initial provisioning:" [puppet] - 10https://gerrit.wikimedia.org/r/365416 (https://phabricator.wikimedia.org/T150502) (owner: 10Thcipriani) [09:34:28] (03CR) 10Hashar: [C: 04-1] CI/integration: Create role for docker CI agent [puppet] - 10https://gerrit.wikimedia.org/r/365416 (https://phabricator.wikimedia.org/T150502) (owner: 10Thcipriani) [09:36:44] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/365571 (owner: 10Giuseppe Lavagetto) [09:37:53] 10Operations, 10ops-eqiad, 10DBA: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3455737 (10Marostegui) So for the record, after: T166344#3455435 we got: ``` ˜/icinga-wm 9:04> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy ``` The... [09:39:03] (03PS1) 10Muehlenhoff: Add cache::misc hosts to network constants [puppet] - 10https://gerrit.wikimedia.org/r/366526 [09:39:23] (03CR) 10Jcrespo: "> I still think the prometheus::mysqld_exporter part could be tweaked to support both multiinstance and single instance in one patch" [puppet] - 10https://gerrit.wikimedia.org/r/364396 (https://phabricator.wikimedia.org/T170666) (owner: 10Jcrespo) [09:40:22] (03CR) 10Elukey: Make /entity/ redirect internal (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357985 (https://phabricator.wikimedia.org/T119536) (owner: 10Ladsgroup) [09:42:43] volans: ahhh now I've read the whole comment, yes maybe on labs it can trigger unexpected changes, will try to parametrize it [09:43:18] as I said, there are other things too, j.oe is aware, we were chatting about it the other day [09:43:44] so not a big issue for me if not parameterized this time and then we do all together later [09:44:02] also I'm probably one of the few (if not only) people that runs puppetdb in labs ;) [09:44:50] ahahhaha this is why! [09:49:06] 10Operations, 10monitoring, 10User-fgiunchedi: save grafana dashboards in revision control / puppet - https://phabricator.wikimedia.org/T133392#3455743 (10fgiunchedi) [09:49:47] (03CR) 10Ema: [C: 031] Add cache::misc hosts to network constants [puppet] - 10https://gerrit.wikimedia.org/r/366526 (owner: 10Muehlenhoff) [09:50:55] elukey: yeah! I can give you the diff of the commit I have there to make it work ;) [09:51:17] (03PS3) 10Elukey: puppetdb: Bump Java Heap max size to 6GB [puppet] - 10https://gerrit.wikimedia.org/r/366229 (https://phabricator.wikimedia.org/T170740) (owner: 10Alexandros Kosiaris) [09:51:30] this is a test, not sure if the best approach --^ [09:51:42] if it's a test feel free to merge it hardcoded then [09:53:32] 10Operations, 10Puppet: Reboot during puppet run causes /var/lib/puppet/state/agent_catalog_run.lock to be left and puppet to not start running again - https://phabricator.wikimedia.org/T127602#3455750 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi This was eventually resolved, we're running puppet 3.8... [09:55:16] (03CR) 10Elukey: "pcc https://puppet-compiler.wmflabs.org/compiler02/7111/nitrogen.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/366229 (https://phabricator.wikimedia.org/T170740) (owner: 10Alexandros Kosiaris) [09:55:23] 10Operations: Make inventory of (private) data backups on all systems - https://phabricator.wikimedia.org/T83522#3455756 (10fgiunchedi) [09:57:35] 10Operations: export logs to logstash or create apertium-admins group on sca1001/sca1002 - https://phabricator.wikimedia.org/T89222#3455759 (10fgiunchedi) 05Open>03declined Agreed, resolving in favor of {T97297} [10:01:39] (03PS1) 10Urbanecm: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366529 (https://phabricator.wikimedia.org/T171146) [10:02:21] 10Operations: restrict access to puppet logs - https://phabricator.wikimedia.org/T84242#3455769 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi only root and group adm can read puppet logs/syslog in production ATM ``` stretch$ sudo find /var/log/syslog /var/log/puppet/ /var/log/puppet.log -ls 279736... [10:03:07] 10Operations, 10community-labs-monitoring: Monitor internal CA expirations - https://phabricator.wikimedia.org/T171157#3455772 (10faidon) [10:03:20] 10Operations, 10monitoring: Monitor internal CA expirations - https://phabricator.wikimedia.org/T171157#3455784 (10faidon) [10:05:28] (03PS4) 10Elukey: puppetdb: Bump Java Heap max size to 6GB [puppet] - 10https://gerrit.wikimedia.org/r/366229 (https://phabricator.wikimedia.org/T170740) (owner: 10Alexandros Kosiaris) [10:08:53] 10Operations, 10Ops-Access-Requests: Requesting access to tools.speedydeletionwikia for Dylann1024 (Nathan Larson) - https://phabricator.wikimedia.org/T171130#3455790 (10Aklapper) 05Open>03stalled @Mdupont: Please clarify what and where is "tools.speedydeletionwikia", what is there to "add", why you think... [10:12:36] (03CR) 10Elukey: "pcc looks good for the new diff too: https://puppet-compiler.wmflabs.org/compiler02/7112/nitrogen.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/366229 (https://phabricator.wikimedia.org/T170740) (owner: 10Alexandros Kosiaris) [10:12:44] 10Operations, 10MediaWiki-Platform-Team, 10monitoring: Fix monitoring of poolcounter service - https://phabricator.wikimedia.org/T83729#3455797 (10fgiunchedi) + #mediawiki-platform-team for poolcounter-related tasks [10:13:21] 10Operations, 10MediaWiki-Platform-Team, 10monitoring: High levels of PoolCounter errors should trigger alerts - https://phabricator.wikimedia.org/T133318#3455803 (10fgiunchedi) + #mediawiki-platform-team for poolcounter-related tasks [10:17:05] 10Operations: Default gateway unreachable on baham.wikimedia.org after reboot - https://phabricator.wikimedia.org/T131966#3455814 (10MoritzMuehlenhoff) No, that specific problem were unrelated, ferm not starting was one of the symptoms but the root cause (DNS resolution only working late) also affected other sys... [10:21:59] 10Operations: add contract end dates to the ops maint & contract gcal - https://phabricator.wikimedia.org/T84585#3455819 (10fgiunchedi) p:05Normal>03Low [10:43:10] (03PS1) 10Thiemo Mättig (WMDE): Simplify Wikibase "unitStorage" configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366533 (https://phabricator.wikimedia.org/T171107) [10:44:02] (03CR) 10Thiemo Mättig (WMDE): [C: 04-1] "I'm setting a temporary -1 to make it more obvious this should not be merged before the required code is actually deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366533 (https://phabricator.wikimedia.org/T171107) (owner: 10Thiemo Mättig (WMDE)) [10:44:35] RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [10:48:04] 10Operations, 10ops-eqiad, 10DBA: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3455871 (10Marostegui) ``` ˜/icinga-wm 12:44> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy ``` [10:59:48] (03PS1) 10Filippo Giunchedi: releases: switch to using secret() [puppet] - 10https://gerrit.wikimedia.org/r/366539 (https://phabricator.wikimedia.org/T79881) [11:16:20] 10Operations, 10Cloud-Services, 10Release-Engineering-Team, 10Patch-For-Review: contintcloud project thinks it is using 206 fixed-ip quota errantly - https://phabricator.wikimedia.org/T158350#3034394 (10hashar) That is happening again after something got restarted yesterday. Filled as T171158 [11:27:44] (03PS1) 10Ladsgroup: Fix hywiki big and medium logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366545 [11:38:49] (03PS1) 10Phuedx: Revert "Revert "Stop RelatedArticles A/B test and clean up config"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366546 (https://phabricator.wikimedia.org/T169948) [11:41:22] 10Operations: Migrate all jessie hosts to Linux 4.9 - https://phabricator.wikimedia.org/T162029#3455969 (10ema) [11:41:24] 10Operations, 10Patch-For-Review: codfw/eqiad hosts occasionally spend > 3 minutes starting networking.service with linux 4.9 - https://phabricator.wikimedia.org/T162612#3455968 (10ema) 05Resolved>03Open [11:43:43] 10Operations, 10Patch-For-Review: codfw/eqiad hosts occasionally spend > 3 minutes starting networking.service with linux 4.9 - https://phabricator.wikimedia.org/T162612#3455970 (10ema) Reopening, I've just seen this happening again on cp1066. This is what `systemd-analyze blame` reported after a slow but succ... [11:52:40] 10Operations, 10Patch-For-Review: codfw/eqiad hosts occasionally spend > 3 minutes starting networking.service with linux 4.9 - https://phabricator.wikimedia.org/T162612#3455983 (10ema) So [[ http://lxr.linux.no/linux/arch/x86/events/Kconfig | Kconfig ]] says that PERF_EVENTS_INTEL_CSTATE is about perf events... [11:53:20] right, out for 20 minutes for lunch [11:56:02] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3455991 (10Jayprakash12345) @MF-Warburg Sir, Can you update us. This task already taked long time. [11:56:14] phuedx: herd you like reverts... ;) https://wikitech.wikimedia.org/wiki/Deployments#Thursday.2C.C2.A0July.C2.A020 [11:59:17] (03PS1) 10Ema: base::kernel: blacklist intel_cstate and intel_rapl_perf [puppet] - 10https://gerrit.wikimedia.org/r/366548 (https://phabricator.wikimedia.org/T162612) [12:01:04] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3455994 (10Urbanecm) p:05Low>03Lowest Before addressing concerns above we can't (and won't) do anything. Hence lowering the task's priority to Lowest. [12:02:57] zeljkof: lol! it wasn't me! :P [12:05:24] phuedx: deploying for a friend? ;) [12:19:47] zeljkof: i'm now responsible for the change [12:20:08] but afaict it was scheduled accidentally and no one was around to verify that it was working correctly [12:21:28] Warning: me and ema are going to depool acamar in codfw to see how traffic changes on kafka2* for T171048 [12:21:28] T171048: Eventbus does not handle gracefully changes in DNS recursors - https://phabricator.wikimedia.org/T171048 [12:21:43] going to silence the hosts to avoid unnecessary pages [12:22:45] ema: whenever you are ready :) [12:23:02] phuedx: want to deploy it yourself? or should I? [12:23:24] (during swat, not right now, I mean) [12:25:20] elukey: alright, depooling acamar [12:25:29] !log ema@neodymium conftool action : set/pooled=no; selector: name=acamar.wikimedia.org [12:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:39] ema: from pybal logs only 200[13] failed :) [12:27:44] elukey: indeed [12:28:02] \o/ [12:28:19] elukey: why, we don't know [12:28:40] I didn't see anything hanging on tcpdump, weird [12:28:59] oh but look [12:29:06] it takes a while to get a response [12:29:59] $ time host -t A kafka2001.codfw.wmnet [12:30:03] kafka2001.codfw.wmnet has address 10.192.0.139 [12:30:03] real 0m1.048s [12:30:21] and we've got options timeout:1 in resolv.conf [12:30:30] oh yes [12:30:37] PROBLEM - puppet last run on db2040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:30:52] and now it's fast again [12:31:15] $ time host -t A kafka2001.codfw.wmnet [12:31:15] kafka2001.codfw.wmnet has address 10.192.0.139 [12:31:16] real 0m0.012s [12:31:30] very interesting [12:31:56] zeljkof: could you deploy it while i qa it plz? [12:32:03] (just makes it a little smoother on my side) [12:32:37] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/7114/bromine.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/366539 (https://phabricator.wikimedia.org/T79881) (owner: 10Filippo Giunchedi) [12:32:41] (03CR) 10Filippo Giunchedi: [C: 031] releases: switch to using secret() [puppet] - 10https://gerrit.wikimedia.org/r/366539 (https://phabricator.wikimedia.org/T79881) (owner: 10Filippo Giunchedi) [12:32:45] (03CR) 10Filippo Giunchedi: [C: 032] releases: switch to using secret() [puppet] - 10https://gerrit.wikimedia.org/r/366539 (https://phabricator.wikimedia.org/T79881) (owner: 10Filippo Giunchedi) [12:33:18] phuedx: sure [12:34:07] ema: so the best solution would definitely be respecting statsd.eqiad.wmnet's ttl, but not sure how to do it nicely in eventlogging's code [12:34:49] atm it fires a dns resolution for statsd.eqiad.wmnet for each sendto [12:34:49] elukey: but why does name resolution time out when depooling acamar? [12:34:58] and why does it go back to normal after a while? [12:35:16] the answer my friend is blowing in cpython [12:35:30] :D [12:35:36] well I've tried resolving from the CLI with `host` and that also was slow [12:35:49] yep yep I was kidding [12:36:18] ema: what happens when we re-pool? IIRC pybal didn't show any issue [12:36:37] elukey: I think so too, let's try [12:36:40] we could repeat the experiment with some scripts that just do host statsd.eqiad.wmnet in a loop [12:37:08] repooling acamar [12:37:18] !log ema@neodymium conftool action : set/pooled=yes; selector: name=acamar.wikimedia.org [12:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:34] all good [12:37:45] elukey: yup, all fast and nice [12:40:32] elukey: so I've captured some DNS req/responses on kafka2001 around 12:29 and they all look fine to me (kafka2001:~ema/y.log) [12:40:54] ema: https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?panelId=8&fullscreen&orgId=1&var-server=kafka2002&var-datasource=codfw%20prometheus%2Fops - didn't expect it on kafka2002 [12:43:57] elukey: does 2002 talk with 200[13]? [12:44:16] maybe the decrease in network traffic is due to the fact that they were in trouble [12:44:22] the kafka brokers definitely do, so this might be ok yes [12:45:00] ema: last try, can we add statsd to kafka200[13] /etc/host and depool acamar again? [12:45:16] elukey: sure [12:45:30] (03PS2) 10Gilles: Send Thumbor error log to logstash [puppet] - 10https://gerrit.wikimedia.org/r/365619 (https://phabricator.wikimedia.org/T150734) [12:46:13] elukey: statsd added to /etc/hosts [12:47:01] ah I saw it live, I didn't mean to say "you have to do it" :D [12:47:04] thankssss [12:47:13] all right let's depool poor acamar again [12:47:51] wait I also want to check what's going on with dns timeouts :) [12:48:03] BRB, someone at the door [12:48:30] ah yes the scripts right [12:51:59] (03PS3) 10Gilles: Send Thumbor error log to logstash [puppet] - 10https://gerrit.wikimedia.org/r/365619 (https://phabricator.wikimedia.org/T150734) [12:54:08] elukey: ok, I'm resolving on 2003 once every 5 seconds and logging tcpdump's output [12:54:43] ema: let's do it [12:55:07] !log ema@neodymium conftool action : set/pooled=no; selector: name=acamar.wikimedia.org [12:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:53] elukey: eventbus seems fine judging from pybal's output [12:55:58] (03CR) 10Gilles: "It turns out that there is already a logstash udp json endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/365619 (https://phabricator.wikimedia.org/T150734) (owner: 10Gilles) [12:56:34] yep, I clearly saw your statsd dns queries on kafka2003 slowing down [12:57:09] no traffic drop from prometheus [12:57:36] ok and I see fast DNS responses, repooling [12:58:20] !log ema@neodymium conftool action : set/pooled=yes; selector: name=acamar.wikimedia.org [12:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:47] RECOVERY - puppet last run on db2040 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [13:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170720T1300). Please do the needful. [13:00:05] Urbanecm and phuedx: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:24] o/ [13:01:37] o/ [13:01:42] I can SWAT today! [13:01:56] Urbanecm: around for SWAT? [13:03:59] I will deploy Urbanecm's change (since there is nothing to check there) and then phuedx's [13:04:11] phuedx: you can test your change at mwdebug, right? [13:04:53] zeljkof: yarrrp [13:05:21] phuedx: will ping you in a few minutes when the commit is there [13:06:42] sure [13:07:48] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366529 (https://phabricator.wikimedia.org/T171146) (owner: 10Urbanecm) [13:08:52] zeljkof, I'm here [13:09:19] Urbanecm: great, but there is nothing for you to do, right? :) [13:09:47] zeljkof, yeah. This patch can't be broken (because jenkins) :) [13:10:12] Urbanecm: I will deploy it as soon as it is merged (waiting for Jenkins) [13:10:19] (03Merged) 10jenkins-bot: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366529 (https://phabricator.wikimedia.org/T171146) (owner: 10Urbanecm) [13:10:21] zeljkof, thank you [13:10:24] Merged :) [13:10:29] (03CR) 10jenkins-bot: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366529 (https://phabricator.wikimedia.org/T171146) (owner: 10Urbanecm) [13:12:37] !log zfilipin@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:366529|Add new throttle rule (T171146)]] (duration: 00m 48s) [13:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:48] T171146: Mass account creation for Javanese Wikipedia - https://phabricator.wikimedia.org/T171146 [13:13:05] Urbanecm: deployed, thanks for deploying with releng ;) [13:13:12] phuedx: reviewing your commit [13:14:15] phuedx: should I deploy one file before the other? or it does not matter? [13:15:02] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366546 (https://phabricator.wikimedia.org/T169948) (owner: 10Phuedx) [13:16:16] zeljkof: in this case, it shouldn't matter [13:16:39] the variables removed from commonsettings.php shouldn't be referenced by any code running in production [13:16:55] same with the variable removed from initialisesettings.php [13:17:16] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:17:16] PROBLEM - nutcracker process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:17:16] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:17:55] 10Operations, 10monitoring: Icinga: timeseries checks should have the link to a graph with the data - https://phabricator.wikimedia.org/T170353#3456141 (10faidon) p:05Triage>03Normal [13:17:59] phuedx: ok; looks like CI is a bit busy, it might take a minute or two to merge [13:18:05] ugh, I'll take a look at thumbor1001 [13:19:58] (03Merged) 10jenkins-bot: Revert "Revert "Stop RelatedArticles A/B test and clean up config"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366546 (https://phabricator.wikimedia.org/T169948) (owner: 10Phuedx) [13:20:06] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:20:06] RECOVERY - nutcracker process on thumbor1001 is OK: PROCS OK: 1 process with UID = 115 (nutcracker), command name nutcracker [13:20:06] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [13:20:09] (03CR) 10jenkins-bot: Revert "Revert "Stop RelatedArticles A/B test and clean up config"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366546 (https://phabricator.wikimedia.org/T169948) (owner: 10Phuedx) [13:20:47] PROBLEM - Host cp1050 is DOWN: PING CRITICAL - Packet loss = 100% [13:20:53] 10Operations, 10Multimedia, 10monitoring: Create grafana dashboard for video scaler job runners - https://phabricator.wikimedia.org/T163033#3456152 (10faidon) p:05Triage>03Low [13:22:56] phuedx: your commit is at mwdebug1002, please test and let me know if I can continue [13:23:08] ta [13:24:12] 10Operations, 10monitoring: update diamond to latest upstream version - https://phabricator.wikimedia.org/T97635#3456159 (10faidon) [13:24:17] godog: ^ [13:25:31] zeljkof: lgtm, anything in the logs about the config variables? [13:25:38] actually, sec [13:26:00] 10Operations, 10hardware-requests, 10monitoring, 10Patch-For-Review: decom netmon1001 - https://phabricator.wikimedia.org/T171018#3456179 (10faidon) p:05Normal>03High [13:26:02] phuedx: I don't see anything new in the logs [13:26:15] forgot about the "debug log" feature of x-wikimedia-debug [13:27:08] ok, nothing in logstash about those variables [13:27:11] zeljkof: lgtm [13:27:27] phuedx: ok, deploying [13:28:06] 10Operations, 10monitoring, 10Technical-Debt: Retire Torrus - https://phabricator.wikimedia.org/T87840#3456184 (10faidon) 05stalled>03Resolved So @godog mentioned today that we can't actually recover the Torrus data from Bacula, as these were lost forever :( We're still lacking a good solution for monit... [13:28:39] !log zfilipin@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:366546|Revert "Revert "Stop RelatedArticles A/B test and clean up config"" (T169948)]] (duration: 00m 47s) [13:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:50] T169948: Stop RelatedArticles A/B test and clean up config - https://phabricator.wikimedia.org/T169948 [13:29:33] !log cp1050 stuck rebooting, power-cycling [13:29:41] !log downtimed restbase-dev100[1-3] to power off and move ssds to newly racked restbase-dev100[4-6] phab task: T166181 [13:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:45] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:366546|Revert "Revert "Stop RelatedArticles A/B test and clean up config"" (T169948)]] (duration: 00m 46s) [13:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:52] T166181: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181 [13:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:31] phuedx: deployed, logs look ok, please check [13:32:26] (03CR) 10Andrew Bogott: [C: 032] update labtestservices2001.wikimedia.org.crt with the new ca [puppet] - 10https://gerrit.wikimedia.org/r/366515 (https://phabricator.wikimedia.org/T171116) (owner: 10Andrew Bogott) [13:32:31] (03PS2) 10Andrew Bogott: update labtestservices2001.wikimedia.org.crt with the new ca [puppet] - 10https://gerrit.wikimedia.org/r/366515 (https://phabricator.wikimedia.org/T171116) [13:32:32] zeljkof: on it [13:32:51] 10Operations, 10MediaWiki-extensions-Scribunto: Build and push a new hhvm-luasandbox package - https://phabricator.wikimedia.org/T171166#3456205 (10Anomie) [13:33:14] 10Operations, 10MediaWiki-extensions-Scribunto: Build and push a new hhvm-luasandbox package - https://phabricator.wikimedia.org/T171166#3456205 (10Anomie) [13:33:44] 10Operations, 10MediaWiki-extensions-Scribunto: Build and push a new hhvm-luasandbox package - https://phabricator.wikimedia.org/T171166#3456225 (10MoritzMuehlenhoff) I'll take care of that early next week. [13:33:57] 10Operations, 10MediaWiki-extensions-Scribunto: Build and push a new hhvm-luasandbox package - https://phabricator.wikimedia.org/T171166#3456226 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [13:36:22] zeljkof: looking ok [13:37:38] phuedx: great! [13:37:55] !log EU SWAT finished [13:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:57] 10Operations, 10monitoring, 10netops: Evaluate LibreNMS' Graphite backend - https://phabricator.wikimedia.org/T171167#3456248 (10faidon) [13:42:14] !log cp1050 stuck at 'Initializing firmware interfaces...', trying to powerdown/powerup [13:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:17] phuedx: oh, forgot, thanks for deploying with #releng ;) [13:45:35] zeljkof: who else would i deploy with? [13:45:42] best service in this channel! [13:46:20] :) [13:46:46] 10Operations, 10monitoring: update diamond to latest upstream version - https://phabricator.wikimedia.org/T97635#3456272 (10fgiunchedi) It is a lot of development history between the two releases (https://github.com/python-diamond/Diamond/compare/v3.5...v4.0.515) and I'd say some updated/improved collectors es... [13:47:12] out for a little while [13:47:50] 10Operations, 10monitoring: update diamond to latest upstream version - https://phabricator.wikimedia.org/T97635#3456274 (10faidon) If you've backported it already, yeah, we can go forward I'd say :) We can leave trusty behind too, I don't see this as a big deal at all. [13:50:36] RECOVERY - Host cp1050 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [13:51:21] 10Operations, 10monitoring, 10netops, 10User-fgiunchedi: Evaluate LibreNMS' Graphite backend - https://phabricator.wikimedia.org/T171167#3456276 (10fgiunchedi) [13:52:27] !log uprading nodejs on wtp* [13:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:15] 10Operations, 10ops-eqiad, 10DBA: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3456280 (10Marostegui) As we discussed, it would be a good idea to do a switchover and get rid of this host, at least as a master of m1. I have two proposals about how we can do it. #1 Take... [13:54:28] 10Operations, 10Commons, 10Thumbor, 10Traffic, 10media-storage: ERR_RESPONSE_HEADERS_MULTIPLE_CONTENT_DISPOSITION - https://phabricator.wikimedia.org/T170605#3456281 (10Jeff_G) Another new symptom using the same browsers as in the original description: https://upload.wikimedia.org/wikipedia/commons/thumb... [13:55:55] 10Operations, 10ops-eqiad, 10Traffic: cp1050 apparently stuck while "Initializing firmware interfaces..." - https://phabricator.wikimedia.org/T171168#3456283 (10ema) [13:56:11] 10Operations, 10ops-eqiad, 10Traffic: cp1050 apparently stuck while "Initializing firmware interfaces..." - https://phabricator.wikimedia.org/T171168#3456296 (10ema) p:05Triage>03Normal [13:56:36] 10Operations, 10ops-eqiad, 10Traffic: Degraded RAID on cp1008 - https://phabricator.wikimedia.org/T171028#3456297 (10ema) @Cmjohnson please replace the disk (sda) whenever you've got the chance! [13:57:40] !log mobrovac@tin Started deploy [restbase/deploy@5aa7bc1] (staging): (no justification provided) [13:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:28] 10Operations, 10Analytics, 10EventBus, 10User-Elukey: Eventbus does not handle gracefully changes in DNS recursors - https://phabricator.wikimedia.org/T171048#3456304 (10elukey) Me and @ema set up an experiment, namely adding `10.64.32.155 statsd.eqiad.wmnet` in /etc/hosts on kafka2002 (and not on the othe... [13:59:11] !log mobrovac@tin Finished deploy [restbase/deploy@5aa7bc1] (staging): (no justification provided) (duration: 01m 31s) [13:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:41] !log mobrovac@tin Started deploy [restbase/deploy@5aa7bc1]: Translation API bug fix [13:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:06] !log test diamond 4.0.515-4~bpo8+1 on cp1008 [14:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:15] 10Operations, 10Trebuchet: pmtpa remnants in trebuchet redis - https://phabricator.wikimedia.org/T111301#3456324 (10faidon) 05Open>03declined I think we can safely decline this ahead of time by 2½ months :) [14:05:18] 10Operations, 10Traffic: Investigate better DNS cache/lookup solutions - https://phabricator.wikimedia.org/T104442#3456328 (10ema) Forwarding-only caching resolvers would help with issues such as T171048 and T151643. [14:07:39] !log mobrovac@tin Finished deploy [restbase/deploy@5aa7bc1]: Translation API bug fix (duration: 07m 58s) [14:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:28] !log upgrading apache on labs via "dpkg -s apache2 && apt-get -y install apache2" [14:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:11] (03PS4) 10Mobrovac: Add the Scap3 configuration [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/366466 (https://phabricator.wikimedia.org/T116340) [14:12:07] (03PS4) 10Mobrovac: Add the Scap configuration [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/366404 (https://phabricator.wikimedia.org/T137371) [14:23:37] !log upload diamond 4.0.515-4~bpo8+1 to jessie-wikimedia - T97635 [14:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:49] T97635: update diamond to latest upstream version - https://phabricator.wikimedia.org/T97635 [14:27:07] PROBLEM - DPKG on copper is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:28:06] yes yes [14:29:07] RECOVERY - DPKG on copper is OK: All packages OK [14:29:11] 10Operations, 10Commons, 10Thumbor, 10Traffic, 10media-storage: ERR_RESPONSE_HEADERS_MULTIPLE_CONTENT_DISPOSITION - https://phabricator.wikimedia.org/T170605#3456387 (10Aklapper) No such problems in Firefox 54 or Chromium 59 on a Linux desktop. Issue seems to be browser / platform specific? [14:31:58] !log ema@neodymium conftool action : set/pooled=no; selector: name=acamar.wikimedia.org [14:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:07] PROBLEM - DPKG on thumbor1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:32:16] that's me ^ [14:32:21] 10Operations, 10Citoid, 10VisualEditor, 10Services (blocked): Separate citoid service for beta that runs off master instead of deploy - https://phabricator.wikimedia.org/T92304#3456390 (10mobrovac) 05Open>03declined The source repo is never ahead of the deploy repo for more than a couple of days, so th... [14:32:34] 10Operations, 10Citoid, 10VisualEditor, 10Services (done): Separate citoid service for beta that runs off master instead of deploy - https://phabricator.wikimedia.org/T92304#3456392 (10mobrovac) [14:33:32] !log ema@neodymium conftool action : set/pooled=yes; selector: name=acamar.wikimedia.org [14:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:16] RECOVERY - DPKG on thumbor1001 is OK: All packages OK [14:38:46] PROBLEM - DPKG on thumbor1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:40:46] RECOVERY - DPKG on thumbor1002 is OK: All packages OK [14:41:20] !log upload diamond 4.0.515-4~bpo8+2 to jessie-wikimedia - T97635 [14:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:30] T97635: update diamond to latest upstream version - https://phabricator.wikimedia.org/T97635 [14:51:20] 10Operations, 10monitoring: update diamond to latest upstream version - https://phabricator.wikimedia.org/T97635#3456476 (10fgiunchedi) I tried on cp1008 and a couple of thumbor machines and diamond seems to work just fine, package is uploaded and pending rollout to jessie machines [14:52:02] (03PS1) 10Ema: varnish cachestats.py: cache statsd server IP [puppet] - 10https://gerrit.wikimedia.org/r/366564 (https://phabricator.wikimedia.org/T151643) [14:53:09] 10Operations, 10Beta-Cluster-Infrastructure, 10Cloud-Services, 10Services, 10Release-Engineering-Team (Kanban): a lot of beta cluster instances are not reachable over SSH - https://phabricator.wikimedia.org/T171174#3456488 (10hashar) [14:55:02] 10Operations, 10Beta-Cluster-Infrastructure, 10Cloud-Services, 10Services, 10Release-Engineering-Team (Kanban): a lot of beta cluster instances are not reachable over SSH - https://phabricator.wikimedia.org/T171174#3456488 (10Paladox) Now that puppet is fixed, you can either wait a few hours for puppet t... [14:55:55] 10Operations, 10Beta-Cluster-Infrastructure, 10Cloud-Services, 10Services, 10Release-Engineering-Team (Kanban): a lot of beta cluster instances are not reachable over SSH - https://phabricator.wikimedia.org/T171174#3456519 (10hashar) [15:00:57] 10Operations, 10Cassandra, 10Services (blocked), 10User-Joe: Hyperthreading disabled on restbase2002.codfw.wmnet & restbase1015.codfw.wmnet - https://phabricator.wikimedia.org/T162735#3456560 (10Joe) [15:01:44] 10Operations, 10Dumps-Generation, 10Patch-For-Review: Architecture and puppetize setup for dumpsdata boxes - https://phabricator.wikimedia.org/T169849#3456564 (10ArielGlenn) For the rolling rsync to be effective for the larger files (revision history content, primarily), the dumps runner should be notified a... [15:07:47] 10Operations, 10ORES, 10Scoring-platform-team-Backlog, 10Graphite, 10User-fgiunchedi: Regularly purge old ores graphite metrics - https://phabricator.wikimedia.org/T169969#3415026 (10Halfak) p:05Normal>03High [15:08:26] 10Operations, 10ORES, 10Scoring-platform-team-Backlog, 10Graphite, 10User-fgiunchedi: Regularly purge old ores graphite metrics - https://phabricator.wikimedia.org/T169969#3415026 (10Halfak) I think we'd like to keep some high level metrics forever, others for just 90 days and many for just 30 days. Is... [15:13:41] 10Operations, 10Beta-Cluster-Infrastructure, 10Services, 10VPS-Projects, 10Release-Engineering-Team (Kanban): a lot of beta cluster instances are not reachable over SSH - https://phabricator.wikimedia.org/T171174#3456611 (10bd808) [15:14:12] 10Operations, 10ORES, 10Scoring-platform-team-Backlog, 10Graphite, 10User-fgiunchedi: Regularly purge old ores graphite metrics - https://phabricator.wikimedia.org/T169969#3415026 (10ori) Take a look at https://github.com/wikimedia/puppet/blob/c2543d7f80fefbe39901897882c60d91d98c3950/modules/role/manifes... [15:15:14] 10Operations, 10Beta-Cluster-Infrastructure, 10Services, 10VPS-Projects, 10Release-Engineering-Team (Kanban): New instance in deployment prep can't run puppet for the first time - https://phabricator.wikimedia.org/T171177#3456618 (10Ottomata) [15:18:07] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3456652 (10RobH) a:05Cmjohnson>03RobH [15:18:32] 10Operations, 10Beta-Cluster-Infrastructure, 10Services, 10VPS-Projects, 10Release-Engineering-Team (Kanban): New instance in deployment prep can't run puppet for the first time - https://phabricator.wikimedia.org/T171177#3456656 (10hashar) [15:19:23] (03PS2) 10Ema: varnish cachestats.py: cache statsd server IP [puppet] - 10https://gerrit.wikimedia.org/r/366564 (https://phabricator.wikimedia.org/T151643) [15:20:09] 10Operations, 10Beta-Cluster-Infrastructure, 10Services, 10VPS-Projects, 10Release-Engineering-Team (Kanban): New instance in deployment prep can't run puppet for the first time - https://phabricator.wikimedia.org/T171177#3456618 (10hashar) Seems the initial puppet run refuses to process for whatever rea... [15:20:31] 10Operations, 10ops-eqiad, 10DC-Ops, 10Services (watching), 10User-fgiunchedi: Decommisson restbase-dev100[1-3] - https://phabricator.wikimedia.org/T171179#3456666 (10Cmjohnson) [15:21:32] 10Operations, 10Analytics, 10EventBus, 10Patch-For-Review, 10User-Elukey: Eventbus does not handle gracefully changes in DNS recursors - https://phabricator.wikimedia.org/T171048#3452124 (10fgiunchedi) FYI caching statsd name forever also has its problems when failing over statsd as many services need re... [15:22:04] (03PS1) 10RobH: restbase-dev100[456] replacing restbase-dev100[123] [puppet] - 10https://gerrit.wikimedia.org/r/366572 (https://phabricator.wikimedia.org/T166181) [15:22:45] (03PS1) 10Muehlenhoff: Add debdeploy client to detect library restarts (WIP) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/366573 [15:22:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3456698 (10RobH) [15:23:17] (03CR) 10RobH: [C: 032] restbase-dev100[456] replacing restbase-dev100[123] [puppet] - 10https://gerrit.wikimedia.org/r/366572 (https://phabricator.wikimedia.org/T166181) (owner: 10RobH) [15:27:46] PROBLEM - Host restbase-dev1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:27:57] PROBLEM - Host restbase-dev1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:09] (03CR) 10Filippo Giunchedi: "LGTM, modulo a nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/366564 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [15:29:40] (03CR) 10Dzahn: "this one is not actually served via varnish. librenms.wikimedia.org is an alias for netmon1002.wikimedia.org." [puppet] - 10https://gerrit.wikimedia.org/r/366519 (owner: 10Muehlenhoff) [15:40:13] 10Operations, 10ops-eqiad, 10netops: Replace cr1/2-eqiad air filters - https://phabricator.wikimedia.org/T170138#3456762 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson done [15:43:57] 10Operations, 10ops-eqiad: labsdb1001: Investigate eth0 wrong negotiated interface speed - https://phabricator.wikimedia.org/T137555#2371671 (10Marostegui) >>! In T137555#3438553, @akosiaris wrote: > > I am setting it to `stalled` since it seems from T168584#3425254 that the box shouldn't be touched for the... [15:46:55] (03PS3) 10Ema: varnish cachestats.py: cache statsd server IP [puppet] - 10https://gerrit.wikimedia.org/r/366564 (https://phabricator.wikimedia.org/T151643) [15:54:41] ema: nice --^ [15:55:56] RECOVERY - Host ocg1001 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [16:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170720T1600). Please do the needful. [16:00:04] Dereckson, aude, and Amir1: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:16] I'm around on phone [16:00:37] PROBLEM - HP RAID on ms-be1016 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Cable Error - Battery/Capacitor: Recharging [16:00:41] ACKNOWLEDGEMENT - HP RAID on ms-be1016 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T171183 [16:00:45] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T171183#3456811 (10ops-monitoring-bot) [16:01:19] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Cloud-VPS: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3456817 (10madhuvishy) @Cmjohnson Do we have an estimate on when these will be racked? These servers being setup are part of our quarterly goal for Q1 - T16... [16:03:08] elukey: let's see if it works :) [16:04:15] Amir1-phone: ok I'll start with your patch [16:04:34] Thanks [16:06:11] (03PS8) 10Filippo Giunchedi: mediawiki: Remove broken wikidata.org/ontology Apache alias [puppet] - 10https://gerrit.wikimedia.org/r/361801 (https://phabricator.wikimedia.org/T169023) (owner: 10Krinkle) [16:06:23] 10Operations, 10DBA, 10Goal: Migrate MySQLs to use ROW-based replication - https://phabricator.wikimedia.org/T109179#3456828 (10Marostegui) [16:06:35] 10Operations, 10ops-eqiad, 10OCG-General, 10Reading-Web-Backlog (Tracking): ocg1001 is broken - https://phabricator.wikimedia.org/T170886#3446253 (10Cmjohnson) The server was not booting, i did see a h/w error in racadm syslog that pertained to PCIe port...no ports are being used. i did open up and reseat... [16:07:21] (03CR) 10Filippo Giunchedi: [C: 032] mediawiki: Remove broken wikidata.org/ontology Apache alias [puppet] - 10https://gerrit.wikimedia.org/r/361801 (https://phabricator.wikimedia.org/T169023) (owner: 10Krinkle) [16:09:08] Amir1-phone: it is rolling out now [16:09:57] Thanks. It's not testable as the rule was broken and 404s anyway [16:10:34] Amir1-phone: yeah fairly easy, thanks for being here for puppet swat [16:11:04] godog: Thank you! [16:11:29] aude: here? [16:13:25] Dereckson: here? [16:15:41] 10Operations, 10Cloud-Services: Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188#3456889 (10faidon) [16:29:28] (03CR) 10Eevans: [C: 031] Add the Scap3 configuration [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/366466 (https://phabricator.wikimedia.org/T116340) (owner: 10Mobrovac) [16:29:54] (03CR) 10Eevans: [C: 031] Add the Scap configuration [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/366404 (https://phabricator.wikimedia.org/T137371) (owner: 10Mobrovac) [16:30:20] 10Operations, 10Beta-Cluster-Infrastructure, 10Services, 10VPS-Projects, and 2 others: a lot of beta cluster instances are not reachable over SSH - https://phabricator.wikimedia.org/T171174#3456948 (10hashar) https://wikitech.wikimedia.org/wiki/Incident_documentation/20170719-ldap#CI.2Fbeta [16:32:33] 10Operations, 10Wikibase-DataModel, 10Wikidata, 10Wikidata-Sprint: Remove left-over alias for wikidata.org/ontology (doesn't work) - https://phabricator.wikimedia.org/T169023#3456952 (10thiemowmde) 05Open>03Resolved a:03Krinkle [16:35:38] 10Operations, 10Beta-Cluster-Infrastructure, 10Services, 10VPS-Projects, and 2 others: a lot of beta cluster instances are not reachable over SSH - https://phabricator.wikimedia.org/T171174#3456966 (10hashar) So the state as I understand it right now: The puppet master was broken, I had it fixed by removi... [16:35:46] 10Operations, 10Beta-Cluster-Infrastructure, 10Services, 10VPS-Projects, and 2 others: a lot of beta cluster instances are not reachable over SSH - https://phabricator.wikimedia.org/T171174#3456969 (10hashar) p:05Triage>03High [16:35:49] 10Operations, 10Puppet, 10LDAP: Should puppet auto-restart slapd? - https://phabricator.wikimedia.org/T171191#3456970 (10demon) [16:40:51] Hi [16:41:32] 10Operations, 10Beta-Cluster-Infrastructure, 10Services, 10VPS-Projects, and 2 others: a lot of beta cluster instances are not reachable over SSH - https://phabricator.wikimedia.org/T171174#3456998 (10hashar) Announced on the QA list pointing back to this task [16:41:48] godog: I'm here [16:46:28] Dereckson: ack, I was looking at your patch https://gerrit.wikimedia.org/r/#/c/354959 [16:48:46] PROBLEM - SSH cp3040.mgmt on cp3040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:50:17] (03PS3) 10Filippo Giunchedi: Apache: add techconduct.wm.o to remnant sites [puppet] - 10https://gerrit.wikimedia.org/r/354959 (https://phabricator.wikimedia.org/T165977) (owner: 10Dereckson) [16:50:17] Dereckson: merging [16:50:27] (03PS1) 10Chad: wikidatawiki back to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366589 (https://phabricator.wikimedia.org/T171107) [16:54:31] (03CR) 10Filippo Giunchedi: [C: 032] Apache: add techconduct.wm.o to remnant sites [puppet] - 10https://gerrit.wikimedia.org/r/354959 (https://phabricator.wikimedia.org/T165977) (owner: 10Dereckson) [16:56:01] Dereckson: it is rolling out now, should be done in ~30min [16:56:32] Dereckson: already applied on mwdebug1001 if you want to test [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170720T1700). [17:00:28] (03PS3) 10Reedy: Set proofreadpage-showheaders = 1 for tawikisource bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366323 (https://phabricator.wikimedia.org/T169478) [17:03:06] (03CR) 10Smalyshev: "@Faidon please propose alternative approach. So far your position has been "no blocks by IP, period" without offering any alternative solu" [puppet] - 10https://gerrit.wikimedia.org/r/365821 (https://phabricator.wikimedia.org/T170860) (owner: 10Smalyshev) [17:13:11] (03CR) 10Chad: [C: 032] wikidatawiki back to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366589 (https://phabricator.wikimedia.org/T171107) (owner: 10Chad) [17:13:22] (03CR) 10Faidon Liambotis: [C: 04-2] "I'm being asked to review and merge this patch, aren't I? Plus the one that will follow, that presumably will be another privacy policy vi" [puppet] - 10https://gerrit.wikimedia.org/r/365821 (https://phabricator.wikimedia.org/T170860) (owner: 10Smalyshev) [17:13:51] godog: works fine on mwdebug1001 [17:14:13] (it indicates a no wiki found served by our MediaWiki multiversion entry point instead of the generic vhost) [17:14:54] (03Merged) 10jenkins-bot: wikidatawiki back to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366589 (https://phabricator.wikimedia.org/T171107) (owner: 10Chad) [17:14:58] !log killed tranquility instances tranq-banners and tranq-netflow  running on druid1003 in joal's screen sessions [17:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:34] (03PS1) 10RobH: setting restbase-dev to role:spare [puppet] - 10https://gerrit.wikimedia.org/r/366596 (https://phabricator.wikimedia.org/T166181) [17:16:17] (03CR) 10jenkins-bot: wikidatawiki back to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366589 (https://phabricator.wikimedia.org/T171107) (owner: 10Chad) [17:16:27] (03CR) 10RobH: [C: 032] setting restbase-dev to role:spare [puppet] - 10https://gerrit.wikimedia.org/r/366596 (https://phabricator.wikimedia.org/T166181) (owner: 10RobH) [17:16:50] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: wikidata back to wmf.10 [17:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:20] SMalyshev: Looks like the fixes for wikibase are good, nothing in logstash post rollout [17:21:24] (03PS1) 10Chad: group2 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366597 [17:21:34] (03CR) 10Chad: [C: 04-2] "4 l8r" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366597 (owner: 10Chad) [17:23:07] (03CR) 10Smalyshev: "> Plus the one that will follow, that presumably will be another privacy policy violation" [puppet] - 10https://gerrit.wikimedia.org/r/365821 (https://phabricator.wikimedia.org/T170860) (owner: 10Smalyshev) [17:24:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3457127 (10RobH) [17:24:38] SMalyshev: which service in existence you know uses it? [17:24:44] paravoid: uses what? [17:24:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3287641 (10RobH) a:05RobH>03Eevans Assigned to @eevans for followup. These are ready to be used by services, and this task can be resolved once ack... [17:24:59] "All because an obvious solution, which any service in existence I know uses - blocking obvious abusers - is being refused." [17:25:25] paravoid: ah. pretty much any one - including wikis, btw. if you write a bot that spams wiki, don't you get blocked? [17:25:43] paravoid: if you use any major public API outside allowed service parameters - don't you get blocked? [17:25:58] if you send a spam to an ISP - don't you get blocked? [17:26:04] lol [17:26:16] I don't think anyone disagrees with blocking those (or some of those) requests [17:26:19] I think we disagree on the how [17:26:32] we don't push commits everytime someone uses a bot against wikis, or sends spams to us [17:26:52] (or against our APIs, for that matter) [17:27:10] these are all handled at the service level, automatically, with throttles and bans and such [17:27:10] paravoid: if there's a system that would allow to do it without commits, I'm more than happy to use it [17:27:22] there isn't any general purpose one, no [17:27:26] paravoid: that's what I am trying to do, bans [17:27:39] only the very crude varnish rate limit, that is already in effect here [17:27:41] unfortunately, since my config is controlled by puppet, I can't do it without puppet [17:28:09] unless I take my config out of puppet and do everything manually. Then I could do it without bothering ops [17:28:18] but I'm not sure that is a good solution? [17:28:32] yeah, obviously the solution to "this shouldn't be this manual" is to even do it more manually :P [17:28:38] do it even* [17:29:13] paravoid: adding and removing IP from abuse lists has to be manual, afaik we do not have any technology that allows to detect this level of outliers automatically [17:29:29] you want to hand-pick individual IPs and ban them -- we don't do that unless it's an absolute high-volume emergency (DDoSes and stuff) [17:29:31] rate limits and timeouts work in 99.99% of cases. This is 0.01% of case where it does not [17:29:47] ...and yet this works in every other service in prod :) [17:30:01] paravoid: for the service, it is high-volume emergency. it almost took the service down twice in the last week. [17:30:15] it only didn't because I manually babysat and cleaned up after it [17:30:24] this is small potatoes compared to what I'm talking about [17:30:40] but if I happened to be afk or busy, we'd have 2-3 days downtime by now [17:31:10] paravoid: maybe small potatoes to you, but it's service down for me. so I'd like to have some solution for it [17:31:28] even in the cases of high-volume emergency I'm talking about (which are at a whole different scale), it's usually a stopgap, until the proper fix gets deployed [17:31:50] I don't think we are in agreement that this would be a one-off stopgap, it sounds like you seem to think that this is the right solution to this problem [17:32:41] which I'm afraid it isn't, sorry, not for this, nor for any other public facing service whether it's the wikis, our APIs, or spammers targeting our mailservers [17:39:02] !log arlolra@tin Started deploy [parsoid/deploy@97dbabb]: Updating Parsoid to a89a9cc4 [17:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:04] (03PS3) 10Andrew Bogott: puppetmaster: Don't use storeconfig_thin [puppet] - 10https://gerrit.wikimedia.org/r/365605 [17:41:36] (03CR) 10Andrew Bogott: [C: 032] puppetmaster: Don't use storeconfig_thin [puppet] - 10https://gerrit.wikimedia.org/r/365605 (owner: 10Andrew Bogott) [17:42:27] paravoid: so what do you think would be the right solution to the problem? [17:42:55] build something at the service level that throttles and/or bans expensive queries [17:43:15] paravoid: the fact is there are queries which are expensive, and if sent in large enough volume over sustained period, they will block the service for other users. This is not going to change anytime soon [17:43:41] can you automaticallyu kill queries that run for > X seconds for example? [17:43:56] paravoid: there's no technology I am aware of that can block expensive queries before they did the damage. we only know the query is expensive when it took too much time - in which case the damage is already done [17:44:08] paravoid: yes, this is timeout. this is already in place [17:44:26] or recognize the queries in some automated fashion, e.g. examine the query planner ahead of time? [17:44:37] recognize the expensive/damaging ones [17:45:10] paravoid: no, it's not possible. query planner has no idea and query planner is deep inside third party software [17:45:11] what would you do if tomorrow morning someone decides to hammer us with these kind of queries from a hundred thousand IPs all over the internet? [17:45:33] paravoid: then we're screwed. but it didn't happen and probably won't [17:45:49] hope is not a strategy [17:46:03] paravoid: what happens instead that we have people that write broken bots and bad queries [17:46:19] and broken bots instead of failing properly retry the same broken query over and over [17:46:37] seriously, go back a step [17:46:37] again, in 99.99% even that is not a problem, since timeouts/ip limits are enough to deal with it [17:46:49] "we're screwed" is not a strategy [17:46:51] paravoid: ok, which step do you want to go to? [17:47:18] "it didn't happen and probably won't happen" is not a strategy either [17:47:28] paravoid: I never said this is a streategy. We do not have a strategy of dealing with DDoS specially crafted to take our service down. I am not looking for such strategy now. [17:47:31] (03PS1) 10RobH: adding additional IPs for cassandra instances on restbase-dev100[456] [dns] - 10https://gerrit.wikimedia.org/r/366604 (https://phabricator.wikimedia.org/T166181) [17:47:56] (03PS2) 10Eevans: WIP: Configure Cassandra for restbase-dev[4-6] [puppet] - 10https://gerrit.wikimedia.org/r/366334 (https://phabricator.wikimedia.org/T171104) [17:47:56] sorry to interpose, if the problem is, mostly, that a single user abuse the system, could we just build something that keeps track of query execution time per user/IP and throttle based on how much expensive they are in the aftermath, so that at the first expensive query they get throttled for a while and/or that specific query gets blocked for a longer period? [17:47:58] (03CR) 10RobH: [C: 032] adding additional IPs for cassandra instances on restbase-dev100[456] [dns] - 10https://gerrit.wikimedia.org/r/366604 (https://phabricator.wikimedia.org/T166181) (owner: 10RobH) [17:47:58] sigh [17:48:01] (03PS2) 10RobH: adding additional IPs for cassandra instances on restbase-dev100[456] [dns] - 10https://gerrit.wikimedia.org/r/366604 (https://phabricator.wikimedia.org/T166181) [17:48:10] !log arlolra@tin Finished deploy [parsoid/deploy@97dbabb]: Updating Parsoid to a89a9cc4 (duration: 09m 09s) [17:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:21] paravoid: that's not what I am looking to solve now. If you want to propose something to solve *that* particular issue, you are more than welcome to, but this is *not* my problem now and I am not looking to solve it [17:48:51] volans: I do not have any means of blocking specific IP [17:49:15] volans: the patch proposed is the way to do it, but paravoid is not fine with it. [17:49:32] how is volans' idea equivalent to your patch?! [17:49:41] I mean dynamically, not keeping them in a configuration [17:49:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10Services (watching), 10User-fgiunchedi: Decommisson restbase-dev100[1-3] - https://phabricator.wikimedia.org/T171179#3457232 (10RobH) [17:49:53] volans: I could maybe learn nginx internals and implement, test and deploy new nginx module that does intelligent over-time per-ip tracking, but that would probably take a bit of time... [17:50:37] paravoid: it's not equivalent, mine if much simpler solution that does not require developing complex software from scratch [17:50:54] ok, this isn't going anywhere and it's late [17:50:57] I'm not very familiar with the service, but maybe it needs some very simple middleware "proxy" that handle this throttling? [17:50:59] just FYI, if anyone sees `ConstraintParameterException`s in the logs, that’s my fault, see https://phabricator.wikimedia.org/T171196 [17:51:00] we can agree to disagree if you want [17:51:11] my -2 is not going away regardless, sorry :) [17:51:18] paravoid: if you know any software that does it already please suggest. Otherwise it is not a solution [17:51:37] paravoid: ok, next time service is going down I guess that would just be it [17:51:55] (03PS1) 10Andrew Bogott: remove ldap_host from novaconfig hiera [puppet] - 10https://gerrit.wikimedia.org/r/366605 [17:52:21] volans: there is a proxy - nginx. But nginx does not do this kind of complex request tracking [17:53:36] a single broken bot is bringing down your service [17:54:26] and your solution to that is to go inspect the logs when that happens, find the IP of whoever using that bot, pushing a commit (or asking someone to push a commit to a private repo) and deploy it [17:54:34] that pretty much sums up the situation, right? [17:54:50] !log Updated Parsoid to a89a9cc4 (T169293) [17:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:00] T169293: While using ParsoidBatchAPI, [[Media: ]] does not link to media file anymore - https://phabricator.wikimedia.org/T169293 [17:55:06] I've never used it by the first google result leads to https://nginx.org/en/docs/http/ngx_http_limit_req_module.html [17:55:10] paravoid: it doesn't have to be a commit, I don't care how exactly the block happens ,but otherwise yes [17:55:24] volans: yes, this is already in place [17:55:48] ok, glad we had that cleared up [17:56:11] volans: but the limit is 5 reqs/ip/server, and this is enough for the bot to cause damage. I could reduce it further but that would hurt legit users [17:56:51] so I would rather hurt the abuser than the legit users that use service properly and it works for them with 5/ip just fine [17:56:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10Services (watching), 10User-fgiunchedi: Decommisson restbase-dev100[1-3] - https://phabricator.wikimedia.org/T171179#3457263 (10RobH) a:05Cmjohnson>03RobH I'll start the process of removing these from the repos once the new restbase-dev100[456] are fully online.... [17:57:10] SMalyshev: do you have it setup only for IPs? [17:57:20] volans: what do you mean? [17:57:22] it seems to be that it could be set by IP+URI [17:57:29] and in that case you could use a much lower limit [17:57:35] and have probably both in place [17:57:50] volans: all requests have the same URI [17:58:03] it's /sparql?query=BLAH [17:58:04] (03CR) 10Andrew Bogott: [C: 032] remove ldap_host from novaconfig hiera [puppet] - 10https://gerrit.wikimedia.org/r/366605 (owner: 10Andrew Bogott) [17:58:09] so different query string, same URI [17:58:13] ok then [17:58:17] by query string [17:58:32] anything that is a variable in nginx could be used if I'm reading it correctly [17:59:24] volans: I'm not sure you can use two variables... but if I do it per uri that'd require tons of memory to keep I am afraid, since queries are large and every legit user has a different one [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170720T1800). [18:00:04] Amir1 and MatmaRex: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:44] o/ [18:00:44] hello [18:01:46] SMalyshev: probably, but maybe worth a shot to see if a) it's possible and b) make some calculations given the timeframe you want to keep and the service traffic how much memory might be used or c) see if you could use a hash of the query string instead of the raw text [18:02:20] volans: this won't solve a problem in case the bot has slightly different query each time (e.g. different offset) [18:02:37] that happened too yesterday [18:02:55] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): replace sdb and then setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3457291 (10Cmjohnson) a:05Cmjohnson>03RobH Disk has been replaced: Return shipping info is USPS 9202 3946 5301 2436 1520 81 FEDEX 96119... [18:03:27] volans: I'm pretty sure nginx doesn't do hashes. Maybe lua does, if we have lua supported... [18:04:50] SMalyshev: substring then? What I'm suggesting is to see if is possible within nginx or consider writing a very simple and little proxy in any language that will sit in the middle between nginx and the backends. Lua might be an option too, if enabled (I'm not sure) [18:05:07] I can SWAT [18:06:00] thanks! [18:06:10] volans: I am not sure any language would do, as it still has to pass traffic efficiently... maybe nginx lua module could do it, no idea [18:07:12] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366545 (owner: 10Ladsgroup) [18:07:42] woth a shot ;) [18:07:45] *worth [18:08:32] (03Merged) 10jenkins-bot: Fix hywiki big and medium logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366545 (owner: 10Ladsgroup) [18:08:39] (03CR) 10jenkins-bot: Fix hywiki big and medium logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366545 (owner: 10Ladsgroup) [18:08:42] volans: also limit_req seems to be opposite of what I need - it limits how many reqs are processed. but I don't care if 1000 reqs are processed per second - my problem is the opposite, that one request takes 1000 seconds [18:09:10] or, in this case, 60 seconds since that's the time limit [18:09:39] (03PS2) 10Andrew Bogott: keystone: add a secondary ldap host [puppet] - 10https://gerrit.wikimedia.org/r/366025 [18:10:55] not sure if a response header/status code could be used, in case you have that info when it times out, but more the constraint more it seems you need a middleware that does this if it's not possible to integrate those feature in the current backend [18:11:05] RainbowSprinkles: are you still prepping stuff on tin or can I remove patch on /srv/mediawiki-staging? [18:11:21] Whoops, forgot to toss that [18:11:22] Gone [18:11:27] thanks :) [18:11:30] yw [18:12:27] volans: well, theoretically we could develop a nginx module for that, but I really don't see how writing a module from scratch, building a deb package for it, maintaining and deploying it in our repo etc. is easier than editing a line in a config once or twice per year [18:12:29] (03CR) 10Andrew Bogott: [C: 032] keystone: add a secondary ldap host [puppet] - 10https://gerrit.wikimedia.org/r/366025 (owner: 10Andrew Bogott) [18:12:52] Amir1: hywiki big and medium logos live on mwdebug1002, check please [18:13:18] SMalyshev: that if I was the broken bot, the first thing I'd do if I was blocked is to change IP, it usually starts a never ending whack a mole [18:13:51] volans: if you intend on harming the service, yes. but I don't think this is the case here [18:13:54] (03PS3) 10Eevans: Configure Cassandra for restbase-dev[4-6] [puppet] - 10https://gerrit.wikimedia.org/r/366334 (https://phabricator.wikimedia.org/T171104) [18:14:12] volans: I think it's just somebody who doesn't know what he/she is doing [18:14:18] thcipriani: it looks fine [18:14:27] Amir1: ok, going live [18:14:30] no, not because I want to harm, because if i don't realize that my bot is broken I don't realize that I was blocked and I would just play with things until they work again [18:14:54] volans: getting "blocked" response is pretty good indication you are blocked :) [18:15:13] sure, but why? [18:15:21] it doesn't tell me what I did wrong ;) [18:15:23] I might not get it [18:16:04] volans: at least it a) protects the service and b) delivers the message to the human behind the bot that something is not going well [18:16:20] and I'm not even going to the path that bots should be authenticated or identifiable via UA [18:16:26] and reachable [18:16:34] volans: "should be" meaning? [18:16:49] I mean, I am all for it, but I don't see any way of enforcing it [18:16:57] !log thcipriani@tin Synchronized static/images/project-logos: SWAT: [[gerrit:366545|Fix hywiki big and medium logos]] (duration: 00m 47s) [18:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:10] short of requiring pre-registration and API keys [18:17:14] which we're not going to do, hopefully [18:17:38] having very low limits for non-auth users or users with empty or clearly bot UAs for example dunno [18:17:44] but back to the it's a clear message [18:17:50] if now I run the script [18:17:54] Amir1: should be live now [18:17:55] most of my queries hit the timeout [18:18:03] so is not working [18:18:46] thcipriani: Thanks. It looks okay. I tell me friends to double check [18:18:51] it will not change much on the user side blocking it :D [18:18:58] volans: yes, but the damage is done in the meantime. I am more concerned with being able to curtail the damage without hurting the legit users [18:19:28] the clear message on the other end is just a bonus, not a requirement [18:20:20] sure [18:20:27] MatmaRex: your monobook change is live on mwdebug1002, check please [18:21:56] thcipriani: looks good on mw.org [18:21:57] what I suspect is happening is that somebody has this broken bot that is just run on max cylinders without any supervision for like a day or two. And it's pretty noticeable when it happens within a hour or so. So if I could get it locked out within an hour instead of 2 days, that'd solve the issue [18:22:09] MatmaRex: ok, going live [18:24:22] !log thcipriani@tin Synchronized php-1.30.0-wmf.10/skins/MonoBook/main.css: SWAT: [[gerrit:366595|Revert "Remove `position: absolute` and z-index from #p-logo"]] T171195 (duration: 00m 47s) [18:24:24] (03CR) 10Eevans: [C: 031] Configure Cassandra for restbase-dev[4-6] [puppet] - 10https://gerrit.wikimedia.org/r/366334 (https://phabricator.wikimedia.org/T171104) (owner: 10Eevans) [18:24:30] ^ MatmaRex live everywhere [18:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:33] T171195: Site logo is no longer clickable in monobook - https://phabricator.wikimedia.org/T171195 [18:25:32] SMalyshev: sure, but hardcoding the IP in a list is not the right solution. It doesn't scale, it's manual, you don't know when removing an item from the list, if this guy is running the bot from a laptop it might change IP each time and you'll continue to block potentially legit users, etc.. [18:26:03] volans: I don;'t need it to scale, I'm not developing a cdn here, it's just a couple of cases per year [18:26:26] thanks thcipriani [18:27:09] volans: and it was the same ip all the time so far, and no legit traffic from that ip as far as I can see [18:28:15] by any chance have you tried to see if we could reach them? [18:29:40] volans: I have no idea how. All I have is the ip [18:30:11] and publishing anything about it is a privacy violation, so I can't just ask "user with this ip please come forward" [18:30:22] whois + abuse email of the provider [18:30:25] MatmaRex: mw.widgets.visibleByteLimit: Temporarily disable whilst OOjs UI label bug is fixed is live on mwdebug1002 for both wmf.9 and wmf.10, check please [18:31:47] volans: by whois, the provider is a large organization with huge ip block, which would not likely to bother to identify a particular person for me, especially given it's not even malicious abuse... [18:32:42] thcipriani: looking good [18:32:58] MatmaRex: ok, going live, wmf.10 first then wmf.9 [18:33:09] if it's an abuse you're entitle to write to abuse@, they will not identify the user for you, but likely write to their user a notifying them of the issue, at least that what they should do ;) [18:34:06] anyway, dinner time, I really gotta go, sorry [18:34:12] volans: thanks [18:34:21] 10Operations, 10Beta-Cluster-Infrastructure, 10VPS-Projects, 10Release-Engineering-Team (Kanban), 10Services (watching): New instance in deployment prep can't run puppet for the first time - https://phabricator.wikimedia.org/T171177#3457390 (10mobrovac) [18:34:34] hope we can get to a solution somehow :) [18:35:03] 10Operations, 10Beta-Cluster-Infrastructure, 10VPS-Projects, 10Release-Engineering-Team (Kanban), and 2 others: a lot of beta cluster instances are not reachable over SSH - https://phabricator.wikimedia.org/T171174#3457392 (10mobrovac) [18:35:11] !log thcipriani@tin Synchronized php-1.30.0-wmf.10/resources/src/mediawiki.widgets.visibleByteLimit/mediawiki.widgets.visibleByteLimit.js: SWAT: [[gerrit:366599|mw.widgets.visibleByteLimit: Temporarily disable whilst OOjs UI label bug is fixed]] T169982 (duration: 00m 48s) [18:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:22] T169982: OOjs UI fields with dynamic labels get the user cursor out of position when content it too long (e.g. edit summary field) - https://phabricator.wikimedia.org/T169982 [18:36:41] !log thcipriani@tin Synchronized php-1.30.0-wmf.9/resources/src/mediawiki.widgets.visibleByteLimit/mediawiki.widgets.visibleByteLimit.js: SWAT: [[gerrit:366598|mw.widgets.visibleByteLimit: Temporarily disable whilst OOjs UI label bug is fixed]] T169982 (duration: 00m 47s) [18:36:48] ^ MatmaRex live everywhere [18:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:56] thanks [18:37:39] !log upgraded mediawiki version on wikitech-static [18:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:48] (03CR) 10Volans: [C: 032] QueryBuilder: move query string to build() method [software/cumin] - 10https://gerrit.wikimedia.org/r/363750 (owner: 10Volans) [18:42:18] 10Operations, 10wikitech.wikimedia.org: Update mediawiki on wikitech-static - https://phabricator.wikimedia.org/T170854#3457410 (10Andrew) 05Open>03Resolved a:03Andrew Now running 1.29.0 (52abe24) [18:42:35] (03Merged) 10jenkins-bot: QueryBuilder: move query string to build() method [software/cumin] - 10https://gerrit.wikimedia.org/r/363750 (owner: 10Volans) [18:43:35] (03PS4) 10RobH: Configure Cassandra for restbase-dev[4-6] [puppet] - 10https://gerrit.wikimedia.org/r/366334 (https://phabricator.wikimedia.org/T171104) (owner: 10Eevans) [18:44:03] (03CR) 10RobH: [C: 032] Configure Cassandra for restbase-dev[4-6] [puppet] - 10https://gerrit.wikimedia.org/r/366334 (https://phabricator.wikimedia.org/T171104) (owner: 10Eevans) [18:46:02] !log otto@tin Started deploy [eventlogging/analytics@36846d6]: auto add mysql indexes for meta style events [18:46:06] !log otto@tin Finished deploy [eventlogging/analytics@36846d6]: auto add mysql indexes for meta style events (duration: 00m 04s) [18:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:49] PROBLEM - puppet last run on restbase-dev1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:48:49] PROBLEM - puppet last run on restbase-dev1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:49:57] (03PS1) 10RobH: renamed the hiera-data files for new restbase-dev [puppet] - 10https://gerrit.wikimedia.org/r/366612 (https://phabricator.wikimedia.org/T171104) [18:50:03] we are aware of the restbase-dev issues [18:50:07] and are workign to remedy them shortly [18:50:29] (03PS2) 10RobH: renamed the hiera-data files for new restbase-dev [puppet] - 10https://gerrit.wikimedia.org/r/366612 (https://phabricator.wikimedia.org/T171104) [18:51:13] (03CR) 10RobH: [C: 032] renamed the hiera-data files for new restbase-dev [puppet] - 10https://gerrit.wikimedia.org/r/366612 (https://phabricator.wikimedia.org/T171104) (owner: 10RobH) [18:53:50] PROBLEM - puppet last run on restbase-dev1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:55:49] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:57:50] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:57:50] PROBLEM - Check systemd state on restbase-dev1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:00:04] RainbowSprinkles: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170720T1900). Please do the needful. [19:07:24] (03CR) 10Chad: [C: 032] group2 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366597 (owner: 10Chad) [19:09:38] (03Merged) 10jenkins-bot: group2 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366597 (owner: 10Chad) [19:09:48] (03CR) 10jenkins-bot: group2 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366597 (owner: 10Chad) [19:12:23] PROBLEM - Host labvirt1015 is DOWN: PING CRITICAL - Packet loss = 100% [19:14:03] PROBLEM - cassandra-a SSL 10.64.16.97:7001 on restbase-dev1005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [19:14:30] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 to wmf.10 [19:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:54] PROBLEM - cassandra-a CQL 10.64.48.168:9042 on restbase-dev1006 is CRITICAL: connect to address 10.64.48.168 and port 9042: Connection refused [19:14:54] PROBLEM - cassandra-a service on restbase-dev1005 is CRITICAL: NRPE: Command check_cassandra-a-state not defined [19:15:53] PROBLEM - cassandra-b CQL 10.64.16.98:9042 on restbase-dev1005 is CRITICAL: connect to address 10.64.16.98 and port 9042: Connection refused [19:15:53] PROBLEM - cassandra-a SSL 10.64.48.168:7001 on restbase-dev1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [19:16:43] PROBLEM - Restbase root url on restbase-dev1004 is CRITICAL: connect to address 10.64.0.89 and port 7231: Connection refused [19:16:43] PROBLEM - cassandra-a service on restbase-dev1006 is CRITICAL: NRPE: Command check_cassandra-a-state not defined [19:16:43] PROBLEM - cassandra-b SSL 10.64.16.98:7001 on restbase-dev1005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [19:17:33] PROBLEM - cassandra-b CQL 10.64.48.169:9042 on restbase-dev1006 is CRITICAL: connect to address 10.64.48.169 and port 9042: Connection refused [19:17:33] PROBLEM - cassandra-b service on restbase-dev1005 is CRITICAL: NRPE: Command check_cassandra-b-state not defined [19:18:02] ^^^ that can be ignored [19:18:06] let me silence it [19:18:23] PROBLEM - cassandra-a CQL 10.64.0.167:9042 on restbase-dev1004 is CRITICAL: connect to address 10.64.0.167 and port 9042: Connection refused [19:18:23] PROBLEM - cassandra-b SSL 10.64.48.169:7001 on restbase-dev1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [19:19:23] PROBLEM - cassandra-b service on restbase-dev1006 is CRITICAL: NRPE: Command check_cassandra-b-state not defined [19:19:23] PROBLEM - cassandra-a SSL 10.64.0.167:7001 on restbase-dev1004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [19:20:13] PROBLEM - cassandra-a service on restbase-dev1004 is CRITICAL: NRPE: Command check_cassandra-a-state not defined [19:21:44] (03PS1) 10Dzahn: Revert "netmon1001: temp put librenms role back on it" [puppet] - 10https://gerrit.wikimedia.org/r/366617 [19:23:11] (03PS2) 10Dzahn: Revert "netmon1001: temp put librenms role back on it" [puppet] - 10https://gerrit.wikimedia.org/r/366617 [19:23:43] (03CR) 10Dzahn: [C: 032] "this is good - librenms on netmon1002 is fine now, so i can be re-removed here." [puppet] - 10https://gerrit.wikimedia.org/r/366617 (owner: 10Dzahn) [19:24:39] (03PS3) 10Dzahn: Revert "netmon1001: temp put librenms role back on it" [puppet] - 10https://gerrit.wikimedia.org/r/366617 (https://phabricator.wikimedia.org/T171018) [19:25:04] 10Operations, 10hardware-requests, 10monitoring, 10Patch-For-Review: decom netmon1001 - https://phabricator.wikimedia.org/T171018#3457561 (10Dzahn) 05stalled>03Open [19:25:54] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [19:26:13] RECOVERY - cassandra-a service on restbase-dev1004 is OK: OK - cassandra-a is active [19:26:23] RECOVERY - puppet last run on restbase-dev1004 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [19:31:23] RECOVERY - cassandra-a SSL 10.64.0.167:7001 on restbase-dev1004 is OK: SSL OK - Certificate restbase-dev1004-a valid until 2018-07-20 15:08:04 +0000 (expires in 364 days) [19:32:23] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:32:24] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:32:33] RECOVERY - cassandra-a CQL 10.64.0.167:9042 on restbase-dev1004 is OK: TCP OK - 0.000 second response time on 10.64.0.167 port 9042 [19:34:14] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:34:14] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [19:36:20] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): rack/setup/install wdqs100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T171210#3457615 (10RobH) [19:39:03] PROBLEM - Host ms-be2024 is DOWN: PING CRITICAL - Packet loss = 100% [19:39:17] (03CR) 10Dzahn: "that closed all the firewall holes, set to spare system again" [puppet] - 10https://gerrit.wikimedia.org/r/366617 (https://phabricator.wikimedia.org/T171018) (owner: 10Dzahn) [19:54:32] robh: hi! do you know how long setting up new wdqs systems is going to take? I need to reload wdq1 database, but if it's going to be replaced soon with wdq4/5 then I might not bother... but if it's going to take time I'll reload it [19:54:43] robh: also, could you put 1003 back into the pool? [19:54:56] so i just gave chris the racking task, itll be a day or two i iassume [19:55:09] ah, excellent [19:55:14] so I'll wait with wdq1001 reload then [19:55:35] no point in reloading it if by the time it's done it'll be time for it to be retired anyway :) [19:56:02] !log robh@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wdqs1003.eqiad.wmnet [19:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:14] oh, the old ones are retiring? [19:56:18] all 3 or just 1001 and 1002? [19:56:38] iasked chris to not rack them in the same racks sicne i assumed they all stayed [19:56:44] but they can share with at least 1001 and 1002 it seems =] [19:56:48] robh: I was assuming that... 1001 and 1002, 1003 is newer. I'm not 100% sure but I think that was the plan. [19:57:12] oh yeah, 1001 is old [19:57:36] 1002 and 1002 are older afair. gehel will be back on monday and we'll probably discuss it then but I think that was the plan [19:59:47] RECOVERY - cassandra-b service on restbase-dev1005 is OK: OK - cassandra-b is active [19:59:47] RECOVERY - Check systemd state on restbase-dev1005 is OK: OK - running: The system is fully operational [19:59:57] RECOVERY - cassandra-b SSL 10.64.16.98:7001 on restbase-dev1005 is OK: SSL OK - Certificate restbase-dev1005-b valid until 2018-07-20 15:08:08 +0000 (expires in 364 days) [20:00:16] RECOVERY - cassandra-a service on restbase-dev1005 is OK: OK - cassandra-a is active [20:00:26] RECOVERY - cassandra-a SSL 10.64.16.97:7001 on restbase-dev1005 is OK: SSL OK - Certificate restbase-dev1005-a valid until 2018-07-20 15:08:07 +0000 (expires in 364 days) [20:00:56] RECOVERY - puppet last run on restbase-dev1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:01:06] RECOVERY - cassandra-b CQL 10.64.16.98:9042 on restbase-dev1005 is OK: TCP OK - 0.000 second response time on 10.64.16.98 port 9042 [20:01:45] SMalyshev: cool, i changed the task then [20:01:48] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): rack/setup/install wdqs100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T171210#3457740 (10RobH) [20:01:51] thank you! [20:01:54] he can rack them basically anywhere that they dont share a single rack [20:01:57] and dont share with wdqs1003 [20:02:18] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): rack/setup/install wdqs100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T171210#3457615 (10RobH) [20:02:56] RECOVERY - cassandra-a service on restbase-dev1006 is OK: OK - cassandra-a is active [20:03:36] RECOVERY - cassandra-b service on restbase-dev1006 is OK: OK - cassandra-b is active [20:03:46] RECOVERY - puppet last run on restbase-dev1006 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [20:03:46] RECOVERY - cassandra-b SSL 10.64.48.169:7001 on restbase-dev1006 is OK: SSL OK - Certificate restbase-dev1006-b valid until 2018-07-20 15:08:11 +0000 (expires in 364 days) [20:03:47] RECOVERY - cassandra-b CQL 10.64.48.169:9042 on restbase-dev1006 is OK: TCP OK - 0.000 second response time on 10.64.48.169 port 9042 [20:04:37] (03PS1) 10Ottomata: Reload eventlogging-consumer mysql-eventbus if event-schemas change [puppet] - 10https://gerrit.wikimedia.org/r/366623 [20:06:16] RECOVERY - cassandra-a SSL 10.64.48.168:7001 on restbase-dev1006 is OK: SSL OK - Certificate restbase-dev1006-a valid until 2018-07-20 15:08:10 +0000 (expires in 364 days) [20:06:16] RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational [20:07:16] RECOVERY - cassandra-a CQL 10.64.48.168:9042 on restbase-dev1006 is OK: TCP OK - 0.000 second response time on 10.64.48.168 port 9042 [20:10:30] robh: also different rows ;) [20:10:42] (if possible or there aren't other constraints) [20:11:11] updated [20:11:16] so 1003 is in a [20:11:16] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): rack/setup/install wdqs100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T171210#3457765 (10RobH) [20:11:26] i asked him to put the two new ones in other rows and different racks than existing [20:15:07] (03PS1) 10Dereckson: Add Author namespace on ta.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366626 (https://phabricator.wikimedia.org/T165813) [20:19:20] 10Operations, 10Beta-Cluster-Infrastructure, 10VPS-Projects, 10Release-Engineering-Team (Kanban), and 2 others: a lot of beta cluster instances are not reachable over SSH - https://phabricator.wikimedia.org/T171174#3457796 (10hashar) [20:19:25] 10Operations, 10Beta-Cluster-Infrastructure, 10VPS-Projects, 10Release-Engineering-Team (Kanban), 10Services (watching): New instance in deployment prep can't run puppet for the first time - https://phabricator.wikimedia.org/T171177#3457793 (10hashar) 05Open>03Resolved a:03Ottomata Andrew has delet... [20:29:23] !log nuria@tin Started deploy [eventlogging/analytics@c1c2c39]: (no justification provided) [20:29:25] !log nuria@tin Finished deploy [eventlogging/analytics@c1c2c39]: (no justification provided) (duration: 00m 02s) [20:29:26] (03CR) 10Volans: [C: 04-1] "I think there is a bug, see inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/366564 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [20:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:30] (03PS1) 10Andrew Bogott: nova: add labvirt1015 to the scheduling pool [puppet] - 10https://gerrit.wikimedia.org/r/366630 [20:34:03] (03PS2) 10Ottomata: Reload eventlogging-consumer mysql-eventbus if event-schemas change [puppet] - 10https://gerrit.wikimedia.org/r/366623 [20:34:29] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): replace sdb and then setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3457839 (10RobH) [20:34:34] (03CR) 10Andrew Bogott: [C: 032] nova: add labvirt1015 to the scheduling pool [puppet] - 10https://gerrit.wikimedia.org/r/366630 (owner: 10Andrew Bogott) [20:41:21] (03PS3) 10Ottomata: Reload eventlogging-consumer mysql-eventbus if event-schemas change [puppet] - 10https://gerrit.wikimedia.org/r/366623 [20:41:25] (03CR) 10Ottomata: [C: 032] Reload eventlogging-consumer mysql-eventbus if event-schemas change [puppet] - 10https://gerrit.wikimedia.org/r/366623 (owner: 10Ottomata) [20:41:27] (03CR) 10Ottomata: [V: 032 C: 032] Reload eventlogging-consumer mysql-eventbus if event-schemas change [puppet] - 10https://gerrit.wikimedia.org/r/366623 (owner: 10Ottomata) [20:42:26] RECOVERY - Host labvirt1015 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [20:42:56] PROBLEM - puppet last run on analytics1061 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:43:30] (03PS1) 10Ottomata: Invert upstart conditional check in eventlogging consumer [puppet] - 10https://gerrit.wikimedia.org/r/366661 [20:43:36] (03PS1) 10Andrew Bogott: Revert "nova: add labvirt1015 to the scheduling pool" [puppet] - 10https://gerrit.wikimedia.org/r/366662 [20:43:38] (03PS2) 10Ottomata: Invert upstart conditional check in eventlogging consumer [puppet] - 10https://gerrit.wikimedia.org/r/366661 [20:43:42] (03CR) 10Ottomata: [C: 032] Invert upstart conditional check in eventlogging consumer [puppet] - 10https://gerrit.wikimedia.org/r/366661 (owner: 10Ottomata) [20:43:44] (03CR) 10Ottomata: [V: 032 C: 032] Invert upstart conditional check in eventlogging consumer [puppet] - 10https://gerrit.wikimedia.org/r/366661 (owner: 10Ottomata) [20:45:25] (03PS2) 10Andrew Bogott: Revert "nova: add labvirt1015 to the scheduling pool" [puppet] - 10https://gerrit.wikimedia.org/r/366662 [20:47:11] (03CR) 10Andrew Bogott: [C: 032] Revert "nova: add labvirt1015 to the scheduling pool" [puppet] - 10https://gerrit.wikimedia.org/r/366662 (owner: 10Andrew Bogott) [20:52:36] PROBLEM - Host labvirt1015 is DOWN: PING CRITICAL - Packet loss = 100% [20:53:56] RECOVERY - Host labvirt1015 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [21:01:44] (03CR) 10Jforrester: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366493 (owner: 10Jforrester) [21:11:07] RECOVERY - puppet last run on analytics1061 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [21:26:13] (03PS1) 10Volans: Tests: simplify and improve parametrized tests [software/cumin] - 10https://gerrit.wikimedia.org/r/366733 (https://phabricator.wikimedia.org/T154588) [21:26:15] (03PS1) 10Volans: CLI: simplify imports and introspection [software/cumin] - 10https://gerrit.wikimedia.org/r/366734 [21:26:17] (03PS1) 10Volans: Logging: add a custom trace() logging level [software/cumin] - 10https://gerrit.wikimedia.org/r/366735 [21:26:19] (03PS1) 10Volans: Transports: convert hosts to ClusterShell's NodeSet [software/cumin] - 10https://gerrit.wikimedia.org/r/366736 (https://phabricator.wikimedia.org/T170394) [21:26:21] (03PS1) 10Volans: Query: add multi-query support [software/cumin] - 10https://gerrit.wikimedia.org/r/366737 (https://phabricator.wikimedia.org/T170394) [21:33:50] (03PS2) 10Volans: Query: add multi-query support [software/cumin] - 10https://gerrit.wikimedia.org/r/366737 (https://phabricator.wikimedia.org/T170394) [21:36:20] (03PS6) 10Thcipriani: CI/integration: Create role for docker CI agent [puppet] - 10https://gerrit.wikimedia.org/r/365416 (https://phabricator.wikimedia.org/T150502) [21:47:34] RECOVERY - Restbase root url on restbase-dev1004 is OK: HTTP OK: HTTP/1.1 200 - 15600 bytes in 0.025 second response time [21:47:50] jouncebot: now [21:47:50] No deployments scheduled for the next 1 hour(s) and 12 minute(s) [21:47:52] jouncebot: next [21:47:52] In 1 hour(s) and 12 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170720T2300) [21:48:20] (03PS2) 10Reedy: phpcs: Enable MediaWiki.ControlStructures.IfElseStructure.Space*Else and make pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366493 (owner: 10Jforrester) [21:48:34] (03CR) 10Reedy: [C: 032] phpcs: Enable MediaWiki.ControlStructures.IfElseStructure.Space*Else and make pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366493 (owner: 10Jforrester) [21:49:57] (03Merged) 10jenkins-bot: phpcs: Enable MediaWiki.ControlStructures.IfElseStructure.Space*Else and make pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366493 (owner: 10Jforrester) [21:50:07] (03CR) 10jenkins-bot: phpcs: Enable MediaWiki.ControlStructures.IfElseStructure.Space*Else and make pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366493 (owner: 10Jforrester) [21:50:09] (03PS2) 10Reedy: phpcs: Enable MediaWiki.ExtraCharacters.ParenthesesAroundKeyword.ParenthesesAroundKeywords and make pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366494 (owner: 10Jforrester) [21:51:29] (03CR) 10jerkins-bot: [V: 04-1] phpcs: Enable MediaWiki.ExtraCharacters.ParenthesesAroundKeyword.ParenthesesAroundKeywords and make pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366494 (owner: 10Jforrester) [21:52:01] (03CR) 10Reedy: [C: 04-1] "All of the trailing spaces before the ; are wrong :(" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366494 (owner: 10Jforrester) [21:53:00] (03PS2) 10Reedy: phpcs: Enable MediaWiki.WhiteSpace.MultipleEmptyLines.MultipleEmptyLines and make pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366495 (owner: 10Jforrester) [21:54:17] (03CR) 10jerkins-bot: [V: 04-1] phpcs: Enable MediaWiki.WhiteSpace.MultipleEmptyLines.MultipleEmptyLines and make pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366495 (owner: 10Jforrester) [21:56:54] PROBLEM - MegaRAID on db1001 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [21:56:55] ACKNOWLEDGEMENT - MegaRAID on db1001 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T171232 [21:56:59] 10Operations, 10ops-eqiad: Degraded RAID on db1001 - https://phabricator.wikimedia.org/T171232#3458257 (10ops-monitoring-bot) [21:57:57] (03PS3) 10Reedy: phpcs: Enable MediaWiki.ExtraCharacters.ParenthesesAroundKeyword.ParenthesesAroundKeywords and make pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366494 (owner: 10Jforrester) [21:58:15] (03CR) 10Reedy: [C: 032] phpcs: Enable MediaWiki.ExtraCharacters.ParenthesesAroundKeyword.ParenthesesAroundKeywords and make pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366494 (owner: 10Jforrester) [21:59:42] (03CR) 10jerkins-bot: [V: 04-1] phpcs: Enable MediaWiki.ExtraCharacters.ParenthesesAroundKeyword.ParenthesesAroundKeywords and make pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366494 (owner: 10Jforrester) [21:59:57] (03CR) 10jerkins-bot: [V: 04-1] phpcs: Enable MediaWiki.ExtraCharacters.ParenthesesAroundKeyword.ParenthesesAroundKeywords and make pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366494 (owner: 10Jforrester) [22:00:34] (03CR) 10Reedy: [C: 032] "I thought there were none of these? :P" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366494 (owner: 10Jforrester) [22:02:07] (03PS4) 10Reedy: phpcs: Enable MediaWiki.ExtraCharacters.ParenthesesAroundKeyword.ParenthesesAroundKeywords and make pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366494 (owner: 10Jforrester) [22:04:48] (03CR) 10jerkins-bot: [V: 04-1] phpcs: Enable MediaWiki.ExtraCharacters.ParenthesesAroundKeyword.ParenthesesAroundKeywords and make pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366494 (owner: 10Jforrester) [22:11:31] (03PS5) 10Reedy: phpcs: Enable MediaWiki.ExtraCharacters.ParenthesesAroundKeyword.ParenthesesAroundKeywords and make pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366494 (owner: 10Jforrester) [22:11:34] (03CR) 10Reedy: [C: 032] phpcs: Enable MediaWiki.ExtraCharacters.ParenthesesAroundKeyword.ParenthesesAroundKeywords and make pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366494 (owner: 10Jforrester) [22:13:04] (03Merged) 10jenkins-bot: phpcs: Enable MediaWiki.ExtraCharacters.ParenthesesAroundKeyword.ParenthesesAroundKeywords and make pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366494 (owner: 10Jforrester) [22:13:16] (03CR) 10jenkins-bot: phpcs: Enable MediaWiki.ExtraCharacters.ParenthesesAroundKeyword.ParenthesesAroundKeywords and make pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366494 (owner: 10Jforrester) [22:13:50] (03CR) 10Reedy: [C: 032] "T171234 filed for the other sniff" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366494 (owner: 10Jforrester) [22:13:53] (03PS3) 10Reedy: phpcs: Enable MediaWiki.WhiteSpace.MultipleEmptyLines.MultipleEmptyLines and make pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366495 (owner: 10Jforrester) [22:14:01] (03CR) 10Reedy: [C: 032] phpcs: Enable MediaWiki.WhiteSpace.MultipleEmptyLines.MultipleEmptyLines and make pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366495 (owner: 10Jforrester) [22:14:42] PROBLEM - Check size of conntrack table on phab1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:15:30] 10Operations, 10DBA: Evaluate how hard would be to get aa(wikibooks|wiktionary) and howiki databases deleted - https://phabricator.wikimedia.org/T169928#3458384 (10MF-Warburg) Thanks for this reply. I think that settles it. [22:15:42] PROBLEM - puppet last run on phab1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:15:42] PROBLEM - Check systemd state on phab1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:15:45] 10Operations, 10DBA: Evaluate how hard would be to get aa(wikibooks|wiktionary) and howiki databases deleted - https://phabricator.wikimedia.org/T169928#3458388 (10MF-Warburg) 05Open>03Resolved a:03MF-Warburg [22:16:33] PROBLEM - Check the NTP synchronisation status of timesyncd on phab1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:16:33] PROBLEM - salt-minion processes on phab1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:16:47] (03Merged) 10jenkins-bot: phpcs: Enable MediaWiki.WhiteSpace.MultipleEmptyLines.MultipleEmptyLines and make pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366495 (owner: 10Jforrester) [22:16:56] (03CR) 10jenkins-bot: phpcs: Enable MediaWiki.WhiteSpace.MultipleEmptyLines.MultipleEmptyLines and make pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366495 (owner: 10Jforrester) [22:17:32] PROBLEM - Check whether ferm is active by checking the default input chain on phab1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:18:08] !log reedy@tin Synchronized wmf-config/: phpcs (duration: 00m 46s) [22:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:22] PROBLEM - DPKG on phab1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:18:40] (03Abandoned) 10Reedy: phpcbf on mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366483 (owner: 10Reedy) [22:18:42] (03Abandoned) 10Reedy: Disable all phpcs rules for lols [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366489 (owner: 10Reedy) [22:19:02] !log reedy@tin Synchronized tests: phpcs (duration: 00m 44s) [22:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:12] PROBLEM - Disk space on phab1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:20:01] !log reedy@tin Synchronized phpcs.xml: phpcs (duration: 00m 43s) [22:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:01] !log reedy@tin Synchronized search-redirect.php: phpcs (duration: 00m 43s) [22:21:02] PROBLEM - MD RAID on phab1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:23] !log reedy@tin Synchronized wmf-config/CommonSettings.php: phpcs (duration: 00m 43s) [22:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:34] (03Merged) 10jenkins-bot: Fix minor code style issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366321 (owner: 10Reedy) [22:22:40] (03CR) 10jenkins-bot: Fix minor code style issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366321 (owner: 10Reedy) [22:22:52] PROBLEM - configured eth on phab1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:23:42] PROBLEM - dhclient process on phab1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:24:42] PROBLEM - SSH on phab1001 is CRITICAL: connect to address 10.64.16.8 and port 22: Connection refused [22:26:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10Services (watching), 10User-fgiunchedi: Decommisson restbase-dev100[1-3] - https://phabricator.wikimedia.org/T171179#3458454 (10RobH) [22:26:32] RECOVERY - dhclient process on phab1001 is OK: PROCS OK: 0 processes with command name dhclient [22:26:42] RECOVERY - salt-minion processes on phab1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:26:42] RECOVERY - configured eth on phab1001 is OK: OK - interfaces up [22:26:43] RECOVERY - Check size of conntrack table on phab1001 is OK: OK: nf_conntrack is 0 % full [22:26:53] RECOVERY - MD RAID on phab1001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [22:27:12] RECOVERY - Disk space on phab1001 is OK: DISK OK [22:27:22] RECOVERY - DPKG on phab1001 is OK: All packages OK [22:27:32] RECOVERY - Check whether ferm is active by checking the default input chain on phab1001 is OK: OK ferm input default policy is set [22:28:53] PROBLEM - puppet last run on phab1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 42 seconds ago with 1 failures. Failed resources (up to 3 shown): Service[rsync] [22:31:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10Services (watching), 10User-fgiunchedi: Decommisson restbase-dev100[1-3] - https://phabricator.wikimedia.org/T171179#3458469 (10RobH) [22:34:47] (03PS1) 10RobH: decommission restbase-dev100[123] dns entries [dns] - 10https://gerrit.wikimedia.org/r/366746 (https://phabricator.wikimedia.org/T171179) [22:35:14] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review: Decommisson restbase-dev100[1-3] - https://phabricator.wikimedia.org/T171179#3458505 (10RobH) [22:35:44] (03CR) 10RobH: [C: 032] decommission restbase-dev100[123] dns entries [dns] - 10https://gerrit.wikimedia.org/r/366746 (https://phabricator.wikimedia.org/T171179) (owner: 10RobH) [22:36:56] 10Operations, 10ops-eqiad, 10hardware-requests: Decommisson restbase-dev100[1-3] - https://phabricator.wikimedia.org/T171179#3458512 (10RobH) a:05RobH>03Cmjohnson [22:39:49] (03PS1) 10Reedy: Re-enable Generic.WhiteSpace.DisallowSpaceIndent.SpacesUsed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366747 [22:46:23] RECOVERY - Check the NTP synchronisation status of timesyncd on phab1001 is OK: OK: synced at Thu 2017-07-20 22:46:20 UTC. [22:52:12] RECOVERY - puppet last run on phab1001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [22:57:19] (03PS1) 10Reedy: Re-enable MediaWiki.WhiteSpace.SpaceyParenthesis.SpaceBeforeOpeningParenthesis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366749 [22:57:21] (03PS1) 10Reedy: Re-enable MediaWiki.WhiteSpace.SpaceyParenthesis.SingleSpaceAfterOpenParenthesis and MediaWiki.WhiteSpace.SpaceyParenthesis.SingleSpaceBeforeCloseParenthesis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366750 [22:57:38] (03CR) 10Reedy: [C: 032] Re-enable Generic.WhiteSpace.DisallowSpaceIndent.SpacesUsed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366747 (owner: 10Reedy) [22:57:53] (03CR) 10Reedy: [C: 032] Re-enable MediaWiki.WhiteSpace.SpaceyParenthesis.SpaceBeforeOpeningParenthesis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366749 (owner: 10Reedy) [22:57:57] (03CR) 10Reedy: [C: 032] Re-enable MediaWiki.WhiteSpace.SpaceyParenthesis.SingleSpaceAfterOpenParenthesis and MediaWiki.WhiteSpace.SpaceyParenthesis.SingleSpaceBefor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366750 (owner: 10Reedy) [22:57:59] (03PS5) 10Aude: Enable WikibaseQualityConstraints statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363200 (https://phabricator.wikimedia.org/T169647) (owner: 10Lucas Werkmeister (WMDE)) [22:59:02] (03Merged) 10jenkins-bot: Re-enable Generic.WhiteSpace.DisallowSpaceIndent.SpacesUsed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366747 (owner: 10Reedy) [22:59:11] (03CR) 10jenkins-bot: Re-enable Generic.WhiteSpace.DisallowSpaceIndent.SpacesUsed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366747 (owner: 10Reedy) [22:59:53] 10Operations, 10Performance-Team, 10TemplateStyles, 10Traffic, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3458687 (10Etonkovidova) [22:59:55] (03Merged) 10jenkins-bot: Re-enable MediaWiki.WhiteSpace.SpaceyParenthesis.SpaceBeforeOpeningParenthesis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366749 (owner: 10Reedy) [22:59:57] (03Merged) 10jenkins-bot: Re-enable MediaWiki.WhiteSpace.SpaceyParenthesis.SingleSpaceAfterOpenParenthesis and MediaWiki.WhiteSpace.SpaceyParenthesis.SingleSpaceBeforeCloseParenthesis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366750 (owner: 10Reedy) [23:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170720T2300). [23:00:05] Dereckson and aude: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:01:35] (03CR) 10jenkins-bot: Re-enable MediaWiki.WhiteSpace.SpaceyParenthesis.SpaceBeforeOpeningParenthesis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366749 (owner: 10Reedy) [23:01:37] !log reedy@tin Synchronized tests: phpcs (duration: 00m 44s) [23:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:33] !log reedy@tin Synchronized wmf-config/: phpcs (duration: 00m 45s) [23:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:41] !log reedy@tin Synchronized docroot/search.wikimedia.org/index.php: phpcs (duration: 00m 43s) [23:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:01] hi [23:04:19] nearly finished making everyones patches conflict [23:04:35] !log reedy@tin Synchronized errorpages/hhvm-fatal-error.php: phpcs (duration: 00m 44s) [23:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:26] !log reedy@tin Synchronized w/health-check.php: phpcs (duration: 00m 43s) [23:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:03] (03PS2) 10Reedy: Add Author namespace on ta.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366626 (https://phabricator.wikimedia.org/T165813) (owner: 10Dereckson) [23:06:15] !log reedy@tin Synchronized phpcs.xml: phpcs (duration: 00m 43s) [23:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:26] (03CR) 10Reedy: [C: 032] Add Author namespace on ta.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366626 (https://phabricator.wikimedia.org/T165813) (owner: 10Dereckson) [23:08:49] (03Merged) 10jenkins-bot: Add Author namespace on ta.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366626 (https://phabricator.wikimedia.org/T165813) (owner: 10Dereckson) [23:09:51] (03PS5) 10Reedy: Configure WikibaseQualityConstraints extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358553 (https://phabricator.wikimedia.org/T168938) (owner: 10Lucas Werkmeister (WMDE)) [23:10:03] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: Author namespace for tawikisource T165813 (duration: 00m 43s) [23:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:13] T165813: Create Author: namespace on Tamil wikisource - https://phabricator.wikimedia.org/T165813 [23:10:31] (03PS6) 10Reedy: Enable WikibaseQualityConstraints statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363200 (https://phabricator.wikimedia.org/T169647) (owner: 10Lucas Werkmeister (WMDE)) [23:10:37] (03CR) 10Reedy: [C: 032] Enable WikibaseQualityConstraints statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363200 (https://phabricator.wikimedia.org/T169647) (owner: 10Lucas Werkmeister (WMDE)) [23:11:11] Reedy: thanks [23:12:23] (03Merged) 10jenkins-bot: Enable WikibaseQualityConstraints statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363200 (https://phabricator.wikimedia.org/T169647) (owner: 10Lucas Werkmeister (WMDE)) [23:12:35] (03PS6) 10Reedy: Configure WikibaseQualityConstraints extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358553 (https://phabricator.wikimedia.org/T168938) (owner: 10Lucas Werkmeister (WMDE)) [23:12:41] (03CR) 10Reedy: [C: 032] Configure WikibaseQualityConstraints extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358553 (https://phabricator.wikimedia.org/T168938) (owner: 10Lucas Werkmeister (WMDE)) [23:13:31] i would also like to deploy a fix for https://phabricator.wikimedia.org/T171196 [23:13:44] but will take me some time to update the wikidata build [23:13:58] aude: fine by me, I'm sure you know by now I don't care in the nice way :) [23:14:06] (03Merged) 10jenkins-bot: Configure WikibaseQualityConstraints extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358553 (https://phabricator.wikimedia.org/T168938) (owner: 10Lucas Werkmeister (WMDE)) [23:14:10] We've still got 45 mins :) [23:14:33] ok :) [23:15:42] !log reedy@tin Synchronized wmf-config/Wikibase-production.php: T169647 T168938 (duration: 00m 42s) [23:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:52] T169647: Enable constraint statements on Wikidata - https://phabricator.wikimedia.org/T169647 [23:15:52] T168938: Configure WikibaseQualityConstraints extension on Wikidata - https://phabricator.wikimedia.org/T168938 [23:20:48] https://gerrit.wikimedia.org/r/#/c/366754/ [23:21:16] looks like no i18n changes? [23:21:22] no [23:21:25] cool [23:22:02] i was trying to figure out why extensions were being put in the vendor folder [23:22:15] think it's good now [23:29:04] (03PS1) 10Reedy: Enable 3 squiz phpcs rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366756 [23:32:01] * aude wonders if anyone from ops is around? [23:33:08] (03PS1) 10Reedy: Enable MediaWiki.WhiteSpace.SpaceyParenthesis.UnnecessarySpaceBetweenParentheses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366757 [23:36:28] !log reedy@tin Synchronized php-1.30.0-wmf.10/extensions/Wikidata: Update Wikidata - fix uncaught exception in constraints (duration: 02m 09s) [23:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:28] checking [23:37:45] looks good [23:37:47] thanks :) [23:37:48] :) [23:37:54] (03CR) 10Reedy: [C: 032] Enable 3 squiz phpcs rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366756 (owner: 10Reedy) [23:37:58] (03CR) 10Reedy: [C: 032] Enable MediaWiki.WhiteSpace.SpaceyParenthesis.UnnecessarySpaceBetweenParentheses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366757 (owner: 10Reedy) [23:39:15] (03Merged) 10jenkins-bot: Enable 3 squiz phpcs rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366756 (owner: 10Reedy) [23:40:00] (03Merged) 10jenkins-bot: Enable MediaWiki.WhiteSpace.SpaceyParenthesis.UnnecessarySpaceBetweenParentheses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366757 (owner: 10Reedy) [23:40:06] (03PS1) 10Reedy: Enable MediaWiki.WhiteSpace.SpaceBeforeControlStructureBrace.EmptyLines and MediaWiki.WhiteSpace.SpaceAfterControlStructure.Incorrect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366758 [23:40:29] thanks from me as well :) [23:41:02] (03CR) 10Reedy: [C: 032] Enable MediaWiki.WhiteSpace.SpaceBeforeControlStructureBrace.EmptyLines and MediaWiki.WhiteSpace.SpaceAfterControlStructure.Incorrect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366758 (owner: 10Reedy) [23:43:00] (03Merged) 10jenkins-bot: Enable MediaWiki.WhiteSpace.SpaceBeforeControlStructureBrace.EmptyLines and MediaWiki.WhiteSpace.SpaceAfterControlStructure.Incorrect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366758 (owner: 10Reedy) [23:46:32] (03PS1) 10Reedy: Enable MediaWiki.WhiteSpace.SpaceBeforeSingleLineComment.SingleSpaceBeforeSingleLineComment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366759 [23:47:27] (03CR) 10Reedy: [C: 032] Enable MediaWiki.WhiteSpace.SpaceBeforeSingleLineComment.SingleSpaceBeforeSingleLineComment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366759 (owner: 10Reedy) [23:49:27] (03Merged) 10jenkins-bot: Enable MediaWiki.WhiteSpace.SpaceBeforeSingleLineComment.SingleSpaceBeforeSingleLineComment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366759 (owner: 10Reedy) [23:50:20] (03PS1) 10Reedy: Enable Generic.ControlStructures.InlineControlStructure.NotAllowed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366760 [23:50:30] (03CR) 10Reedy: [C: 032] Enable Generic.ControlStructures.InlineControlStructure.NotAllowed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366760 (owner: 10Reedy) [23:51:52] (03Merged) 10jenkins-bot: Enable Generic.ControlStructures.InlineControlStructure.NotAllowed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366760 (owner: 10Reedy) [23:53:42] !log reedy@tin Synchronized docroot/: phpcs (duration: 00m 44s) [23:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:42] !log reedy@tin Synchronized errorpages/404.php: phpcs (duration: 00m 43s) [23:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:40] (03PS1) 10Reedy: Enable Squiz.WhiteSpace.LanguageConstructSpacing.Incorrect and Squiz.WhiteSpace.LanguageConstructSpacing.IncorrectSingle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366761 [23:55:41] !log reedy@tin Synchronized tests: phpcs.xml (duration: 00m 42s) [23:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:37] !log reedy@tin Synchronized w: phpcs (duration: 00m 43s) [23:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:32] !log reedy@tin Synchronized wmf-config/: phpcs (duration: 00m 45s) [23:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:15] (03CR) 10Reedy: [C: 032] Enable Squiz.WhiteSpace.LanguageConstructSpacing.Incorrect and Squiz.WhiteSpace.LanguageConstructSpacing.IncorrectSingle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366761 (owner: 10Reedy) [23:59:53] (03PS1) 10Reedy: Enable MediaWiki.AlternativeSyntax.AlternativeSyntax.AlternativeSyntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366762