[00:00:04] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170406T0000). Please do the needful. [00:02:58] (03CR) 10Dzahn: "i guess so .. the puppet ones seem ok, but let's not keep adding more random ones that did not actually happen in real life. the ones in t" [puppet] - 10https://gerrit.wikimedia.org/r/346282 (owner: 10Zppix) [00:05:19] (03PS5) 10Dzahn: typos: add 'pupet' and 'puppte' [puppet] - 10https://gerrit.wikimedia.org/r/346282 (owner: 10Zppix) [00:06:11] (03PS1) 10Jgreen: repool frdb1001 [dns] - 10https://gerrit.wikimedia.org/r/346669 [00:08:06] (03CR) 10Jgreen: [C: 032] repool frdb1001 [dns] - 10https://gerrit.wikimedia.org/r/346669 (owner: 10Jgreen) [00:08:25] !log preparing to update Phabricator to tag release/2017-04-05/1 #phab-2017-04-05 [00:08:28] (03CR) 10Dzahn: "i'm gonna take "pupet" and "puppte" but not "ppute" and "ppuet". they seem to unlikely to me and we don't want to cause false positives." [puppet] - 10https://gerrit.wikimedia.org/r/346282 (owner: 10Zppix) [00:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:54] (03PS6) 10Dzahn: typos: add 'pupet' and 'puppte' [puppet] - 10https://gerrit.wikimedia.org/r/346282 (owner: 10Zppix) [00:11:21] (03CR) 10Dzahn: [C: 032] typos: add 'pupet' and 'puppte' [puppet] - 10https://gerrit.wikimedia.org/r/346282 (owner: 10Zppix) [00:19:14] !log updating phabricator, the service will be offline for just a few moments. [00:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:39] !log stopping and starting stashbot for config change - added #wikimedia-traffic channel [00:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:02] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:20:02] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:20:02] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:20:02] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:20:02] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:20:03] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:20:03] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:20:13] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:20:13] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:20:22] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:20:32] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:20:52] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:20:52] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:20:52] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:20:52] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:20:52] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:20:53] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:20:57] :( [00:21:02] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:21:02] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:21:02] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:22:43] twentyafterfour: That's been happening off and on today, jy.nus said not to worry [00:24:11] RainbowSprinkles: ok cool, thanks [00:24:23] good to know, ok thx [00:25:23] !log Phabricator upgrade completed uneventfully, other than the undisputable fact that the new search functionality is awesome. [00:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:52] PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [00:27:02] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [00:27:03] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [00:27:03] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [00:27:36] twentyafterfour: :)) [00:27:42] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [00:27:42] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [00:27:42] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [00:27:42] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [00:27:42] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [00:27:43] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [00:27:52] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [00:27:52] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [00:27:52] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [00:27:52] RECOVERY - MariaDB Slave Lag: s6 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 87086.27 seconds [00:27:52] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [00:27:53] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [00:27:53] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [00:27:54] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [00:27:54] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [00:27:55] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [00:27:55] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [00:28:02] RECOVERY - MariaDB Slave Lag: m2 on dbstore1001 is OK: OK slave_sql_lag not a slave [00:28:02] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [00:28:02] RECOVERY - MariaDB Slave Lag: s3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 87117.74 seconds [00:28:02] RECOVERY - MariaDB Slave Lag: s1 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 87136.76 seconds [00:28:02] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [00:28:12] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [00:28:22] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [00:28:39] (03PS3) 10Faidon Liambotis: Standardize on lowercase os_version/require_os [puppet] - 10https://gerrit.wikimedia.org/r/345561 [00:28:41] (03PS1) 10Faidon Liambotis: wmflib: multiple os_version changes [puppet] - 10https://gerrit.wikimedia.org/r/346673 [00:33:00] (03Abandoned) 10Faidon Liambotis: wmflib: os_version now fail when lsb vars are missing [puppet] - 10https://gerrit.wikimedia.org/r/308882 (owner: 10Hashar) [00:37:57] !log install1002/2002: deleteing /srv/tftboot/precise-installer | puppetmaster1002/2001: deleting /var/lib/puppet/volatile/tftpboot/precise-installer (clean up after gerrit:345549) [00:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:41] !log install1002/2002: deleting /srv/autoinstall/precise.cfg [00:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:37] (03CR) 10Dzahn: "< mutante> !log install1002/2002: deleteing /srv/tftboot/precise-installer | puppetmaster1002/2001: deleting /var/lib/puppet/volatile/tftp" [puppet] - 10https://gerrit.wikimedia.org/r/345549 (owner: 10Faidon Liambotis) [00:42:42] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [00:47:42] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 18 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [01:27:02] PROBLEM - MariaDB Slave Lag: s4 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:29:42] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 83719.700952 Seconds [01:29:42] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 83722.374779 Seconds [01:29:52] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 83732.580882 Seconds [01:30:02] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 83827.866869 Seconds [01:32:02] PROBLEM - MariaDB Slave Lag: s4 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:32:12] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 83960.620656 Seconds [01:32:32] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 83977.211427 Seconds [01:36:10] (03CR) 10Dzahn: "i could imagine that this is used in labs with config that is not in puppet repo but still has "IfVersion" snippets. and this removes mod_" [puppet] - 10https://gerrit.wikimedia.org/r/345548 (owner: 10Faidon Liambotis) [01:38:12] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds [01:41:12] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 84500.845266 Seconds [01:41:23] (03PS5) 10Dzahn: Gerrit: Disable md5 in ssh [puppet] - 10https://gerrit.wikimedia.org/r/346180 (owner: 10Paladox) [01:44:06] (03CR) 10Dzahn: [C: 032] Gerrit: Disable md5 in ssh [puppet] - 10https://gerrit.wikimedia.org/r/346180 (owner: 10Paladox) [01:45:52] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [01:45:53] !log restarting gerrit to pick up config change gerrit:346180 (disable MD5) [01:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:52] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 84872.623553 Seconds [01:50:52] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [01:53:52] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 85172.601953 Seconds [01:55:32] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 1.177804 Seconds [01:55:42] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 10.911328 Seconds [01:55:42] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 13.749064 Seconds [01:55:52] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [01:56:02] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 31.49275 Seconds [01:56:12] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 44.497014 Seconds [01:57:14] (03PS4) 10Dzahn: Phab: Use base:service_unit for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/345617 (https://phabricator.wikimedia.org/T112765) (owner: 1020after4) [02:01:06] (03CR) 10Dzahn: "so it looks this code (before this change) is not used yet. i don't see the existing upstart or systemd config on either system?" [puppet] - 10https://gerrit.wikimedia.org/r/345617 (https://phabricator.wikimedia.org/T112765) (owner: 1020after4) [02:04:13] (03PS5) 10Dzahn: Phabricator: Use base:service_unit for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/345617 (https://phabricator.wikimedia.org/T112765) (owner: 1020after4) [02:04:50] (03CR) 10Dzahn: "i'm merging it anyways, since this is no-op in prod (not used yet). i assume you were planning to activate this in a second step http://" [puppet] - 10https://gerrit.wikimedia.org/r/345617 (https://phabricator.wikimedia.org/T112765) (owner: 1020after4) [02:06:06] (03CR) 10Dzahn: [C: 032] Phabricator: Use base:service_unit for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/345617 (https://phabricator.wikimedia.org/T112765) (owner: 1020after4) [02:22:16] (03CR) 10Smalyshev: [C: 031] [cirrus] Increase max field count for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346542 (owner: 10DCausse) [02:25:00] (03CR) 10Smalyshev: wdqs: active/active public interface (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/346543 (https://phabricator.wikimedia.org/T162111) (owner: 10Gehel) [02:25:03] (03CR) 1020after4: "That's correct, planning to activate it separately ..." [puppet] - 10https://gerrit.wikimedia.org/r/345617 (https://phabricator.wikimedia.org/T112765) (owner: 1020after4) [02:25:53] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.18) (duration: 10m 28s) [02:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:23] (03CR) 10Aude: [cirrus] Increase max field count for wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346542 (owner: 10DCausse) [02:36:26] (03CR) 10Smalyshev: [C: 031] [cirrus] Increase max field count for wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346542 (owner: 10DCausse) [02:43:52] PROBLEM - puppet last run on db1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:45:42] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 1818.913159 Seconds [02:45:42] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 1822.247884 Seconds [02:46:02] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 1925.575519 Seconds [02:46:42] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [02:47:32] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 2014.917561 Seconds [02:47:52] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 1951.947328 Seconds [02:48:52] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [02:49:32] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [02:51:42] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [02:52:02] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [02:52:32] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 2315.042361 Seconds [02:53:32] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [02:54:42] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 2361.664929 Seconds [02:55:42] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [02:55:52] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 2431.794304 Seconds [02:56:52] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [02:59:52] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 2671.826406 Seconds [03:00:36] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.19) (duration: 15m 46s) [03:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:00:52] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [03:01:33] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 2854.892937 Seconds [03:02:32] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [03:02:42] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 2841.758008 Seconds [03:04:42] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [03:05:12] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 3078.423391 Seconds [03:05:32] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 3094.870625 Seconds [03:06:32] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [03:06:35] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Apr 6 03:06:35 UTC 2017 (duration 5m 59s) [03:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:10:52] RECOVERY - puppet last run on db1001 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [03:11:12] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds [03:21:02] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 4025.447639 Seconds [03:22:02] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [03:31:52] PROBLEM - puppet last run on mw1258 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:43:08] (03CR) 10Dzahn: [C: 031] interface: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345564 (owner: 10Faidon Liambotis) [03:44:03] (03CR) 10Dzahn: [C: 031] lxc: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345558 (owner: 10Faidon Liambotis) [03:44:32] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 5434.831713 Seconds [03:45:32] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [03:46:02] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 5525.414848 Seconds [03:46:42] PROBLEM - puppet last run on copper is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:47:02] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [03:50:32] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:50:42] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 2756.815849 Seconds [03:51:42] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 2814.238513 Seconds [03:51:42] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [03:51:42] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [03:52:42] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [03:53:32] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 5974.813396 Seconds [03:54:30] (03PS2) 10Dzahn: labs_vagrant: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345840 (owner: 10Faidon Liambotis) [03:55:03] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:56:32] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [03:56:42] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [03:56:58] (03CR) 10Dzahn: [C: 032] labs_vagrant: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345840 (owner: 10Faidon Liambotis) [03:57:03] (03PS3) 10Dzahn: labs_vagrant: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345840 (owner: 10Faidon Liambotis) [03:57:42] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 3176.916164 Seconds [03:58:42] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [03:59:52] RECOVERY - puppet last run on mw1258 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [04:00:02] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 6365.535458 Seconds [04:00:32] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 6394.84454 Seconds [04:01:32] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [04:01:42] PROBLEM - puppet last run on mw1282 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:02:01] (03PS3) 10Dzahn: vagrant: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345563 (owner: 10Faidon Liambotis) [04:03:02] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [04:06:06] !log restarting apache2 on iridium to apply a minor hotfix [04:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:07:58] (03PS4) 10Dzahn: vagrant: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345563 (owner: 10Faidon Liambotis) [04:10:52] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 3967.181177 Seconds [04:12:53] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [04:14:42] RECOVERY - puppet last run on copper is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [04:16:48] (03CR) 10Dzahn: [C: 032] vagrant: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345563 (owner: 10Faidon Liambotis) [04:18:32] RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [04:21:17] (03CR) 10Dzahn: [C: 031] Standardize on lowercase os_version/require_os [puppet] - 10https://gerrit.wikimedia.org/r/345561 (owner: 10Faidon Liambotis) [04:22:02] PROBLEM - puppet last run on seaborgium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:23:02] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [04:29:42] RECOVERY - puppet last run on mw1282 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [04:31:39] (03PS1) 10Dzahn: typos: add (requires_os|os_version)\([^)]*[[:upper:]] [puppet] - 10https://gerrit.wikimedia.org/r/346677 [04:33:05] (03CR) 10jerkins-bot: [V: 04-1] typos: add (requires_os|os_version)\([^)]*[[:upper:]] [puppet] - 10https://gerrit.wikimedia.org/r/346677 (owner: 10Dzahn) [04:40:52] PROBLEM - puppet last run on elastic1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:49:02] RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [04:49:42] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [04:50:42] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [04:58:12] Does the messge "If you report this error to the Wikimedia System Administrators, please include the details below." imply I *should* report it? [04:58:19] Because I have details. [04:58:51] [Request from 65.19.8.129 via cp2012 cp2012, Varnish XID 10002400 Error: 503, Backend fetch failed at Thu, 06 Apr 2017 04:35:36 GMT] [04:59:15] and [04:59:27] [Request from 65.19.8.129 via cp2018 cp2018, Varnish XID 90190736 Error: 503, Backend fetch failed at Thu, 06 Apr 2017 04:48:52 GMT] [04:59:42] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [05:02:39] Oh, works now [05:02:42] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:03:35] I see in the logs Phabricator was being updated, guess it was that [05:03:42] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:06:42] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [05:07:42] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:07:42] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:08:52] RECOVERY - puppet last run on elastic1019 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [05:13:42] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:37:30] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1070" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346687 [05:37:37] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1070" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346687 [05:39:12] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1070" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346687 (owner: 10Marostegui) [05:40:24] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1070" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346687 (owner: 10Marostegui) [05:40:34] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1070" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346687 (owner: 10Marostegui) [05:41:42] RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [05:41:51] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1070 after compression - T153743 (duration: 00m 51s) [05:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:00] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [05:53:31] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db2047" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346690 [05:53:34] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db2047" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346690 [05:56:39] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db2047" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346690 (owner: 10Marostegui) [05:57:58] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db2047" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346690 (owner: 10Marostegui) [05:58:07] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db2047" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346690 (owner: 10Marostegui) [05:58:52] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2047 - T160390 (duration: 00m 40s) [05:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:00] T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390 [06:02:37] !log Configure and start replication on db1081 after the defragment - T161088 [06:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:46] T161088: Defragment s4: db1091, db1084, db1081, d1059 and probably the rest - https://phabricator.wikimedia.org/T161088 [06:09:40] !log Deploy schema change db2029 (s7 codfw master) - T160390 [06:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:48] T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390 [06:14:42] PROBLEM - puppet last run on analytics1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:16:59] (03PS1) 10Marostegui: site.pp: Add tempdb2001 new host [puppet] - 10https://gerrit.wikimedia.org/r/346691 (https://phabricator.wikimedia.org/T162290) [06:17:50] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: setup tempdb2001(WMF6407) - https://phabricator.wikimedia.org/T162290#3159572 (10Marostegui) Thanks Rob! We will take it from here! [06:18:39] (03CR) 10Marostegui: site.pp: Add tempdb2001 new host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/346691 (https://phabricator.wikimedia.org/T162290) (owner: 10Marostegui) [06:20:19] (03PS1) 10Marostegui: x1.hosts: Add tempdb2001.codfw.wmnet [software] - 10https://gerrit.wikimedia.org/r/346693 (https://phabricator.wikimedia.org/T162290) [06:20:52] (03PS2) 10Marostegui: site.pp: Add tempdb2001 new host [puppet] - 10https://gerrit.wikimedia.org/r/346691 (https://phabricator.wikimedia.org/T162290) [06:21:54] (03CR) 10Marostegui: site.pp: Add tempdb2001 new host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/346691 (https://phabricator.wikimedia.org/T162290) (owner: 10Marostegui) [06:26:41] (03CR) 10Marostegui: "This looks good: https://puppet-compiler.wmflabs.org/6032/" [puppet] - 10https://gerrit.wikimedia.org/r/346691 (https://phabricator.wikimedia.org/T162290) (owner: 10Marostegui) [06:29:50] !log restart hhvm on mw1165 (jobrunner) - dump debug in /tmp/hhvm.19449.bt. - threads stuck in HPHP::Treadmill::getAgeOldestRequest [06:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:32] RECOVERY - HHVM jobrunner on mw1165 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.003 second response time [06:33:01] better :) [06:33:49] (03PS2) 10Marostegui: x1.hosts: Add tempdb2001.codfw.wmnet to x1 [software] - 10https://gerrit.wikimedia.org/r/346693 (https://phabricator.wikimedia.org/T162290) [06:37:23] (03PS1) 10Marostegui: db-eqiad.php: Repool db1081 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346694 (https://phabricator.wikimedia.org/T161088) [06:38:03] (03CR) 10Marostegui: [C: 04-2] "Wait until the server has caught up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346694 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [06:39:42] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [06:42:42] RECOVERY - puppet last run on analytics1043 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:44:42] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 18 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [06:55:54] (03CR) 10Jcrespo: x1.hosts: Add tempdb2001.codfw.wmnet to x1 (031 comment) [software] - 10https://gerrit.wikimedia.org/r/346693 (https://phabricator.wikimedia.org/T162290) (owner: 10Marostegui) [06:58:18] (03CR) 10Jcrespo: "All good except for the existing mistake you detected." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/346691 (https://phabricator.wikimedia.org/T162290) (owner: 10Marostegui) [07:01:03] (03PS3) 10Marostegui: x1.hosts: Add tempdb2001.codfw.wmnet to x1 [software] - 10https://gerrit.wikimedia.org/r/346693 (https://phabricator.wikimedia.org/T162290) [07:03:36] (03PS3) 10Marostegui: site.pp: Add tempdb2001 new host [puppet] - 10https://gerrit.wikimedia.org/r/346691 (https://phabricator.wikimedia.org/T162290) [07:03:55] (03CR) 10Muehlenhoff: [C: 031] Prepare mw2090->mw2096 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/345156 (https://phabricator.wikimedia.org/T161488) (owner: 10Elukey) [07:04:22] (03CR) 10Jcrespo: [C: 031] site.pp: Add tempdb2001 new host [puppet] - 10https://gerrit.wikimedia.org/r/346691 (https://phabricator.wikimedia.org/T162290) (owner: 10Marostegui) [07:05:43] (03CR) 10Jcrespo: x1.hosts: Add tempdb2001.codfw.wmnet to x1 (031 comment) [software] - 10https://gerrit.wikimedia.org/r/346693 (https://phabricator.wikimedia.org/T162290) (owner: 10Marostegui) [07:07:37] (03CR) 10Marostegui: x1.hosts: Add tempdb2001.codfw.wmnet to x1 (031 comment) [software] - 10https://gerrit.wikimedia.org/r/346693 (https://phabricator.wikimedia.org/T162290) (owner: 10Marostegui) [07:07:54] (03PS4) 10Marostegui: x1.hosts: Add tempdb2001.codfw.wmnet to x1 [software] - 10https://gerrit.wikimedia.org/r/346693 (https://phabricator.wikimedia.org/T162290) [07:08:27] (03PS1) 10KartikMistry: apertium-cat: New upstream release [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/346696 (https://phabricator.wikimedia.org/T161511) [07:08:38] (03CR) 10Marostegui: [C: 032] site.pp: Add tempdb2001 new host [puppet] - 10https://gerrit.wikimedia.org/r/346691 (https://phabricator.wikimedia.org/T162290) (owner: 10Marostegui) [07:08:41] (03CR) 10jerkins-bot: [V: 04-1] apertium-cat: New upstream release [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/346696 (https://phabricator.wikimedia.org/T161511) (owner: 10KartikMistry) [07:10:47] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/346696 (https://phabricator.wikimedia.org/T161511) (owner: 10KartikMistry) [07:13:23] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: setup tempdb2001(WMF6407) - https://phabricator.wikimedia.org/T162290#3159628 (10Marostegui) a:03Marostegui [07:13:56] (03CR) 10Jcrespo: [C: 031] "Looks good. Bash parses well spaces and tabs when reading text, but my python scripts assumes there are always tabs." [software] - 10https://gerrit.wikimedia.org/r/346693 (https://phabricator.wikimedia.org/T162290) (owner: 10Marostegui) [07:14:03] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1081 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346694 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [07:15:11] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1081 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346694 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [07:15:13] (03PS2) 10Elukey: Prepare mw2090->mw2096 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/345156 (https://phabricator.wikimedia.org/T161488) [07:15:40] (03CR) 10Marostegui: [C: 032] x1.hosts: Add tempdb2001.codfw.wmnet to x1 [software] - 10https://gerrit.wikimedia.org/r/346693 (https://phabricator.wikimedia.org/T162290) (owner: 10Marostegui) [07:16:22] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1081 with low weight - T161088 (duration: 00m 48s) [07:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:28] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1081 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346694 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [07:16:29] T161088: Defragment s4: db1091, db1084, db1081, d1059 and probably the rest - https://phabricator.wikimedia.org/T161088 [07:23:06] (03CR) 10Elukey: [C: 032] Prepare mw2090->mw2096 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/345156 (https://phabricator.wikimedia.org/T161488) (owner: 10Elukey) [07:24:22] PROBLEM - puppet last run on kafka1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:30:13] PROBLEM - MariaDB Slave IO: x1 on tempdb2001 is CRITICAL: CRITICAL slave_io_state could not connect [07:30:32] PROBLEM - MariaDB Slave SQL: x1 on tempdb2001 is CRITICAL: CRITICAL slave_sql_state could not connect [07:31:36] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3159658 (10ema) We're currently running with a [[ https://gerrit.wikim... [07:31:46] PROBLEM - mysqld processes on tempdb2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [07:32:04] it was faster than me ^ [07:32:06] I will silence it [07:32:21] !log cache_upload: ban all objects with content-type ~ "^text" T162035 [07:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:27] T162035: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035 [07:35:02] PROBLEM - HHVM processes on mw2094 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [07:35:03] PROBLEM - HHVM processes on mw2092 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [07:35:03] PROBLEM - Apache HTTP on mw2094 is CRITICAL: connect to address 10.192.16.67 and port 80: Connection refused [07:35:03] PROBLEM - Nginx local proxy to apache on mw2095 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.152 second response time [07:35:03] PROBLEM - HHVM processes on mw2095 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [07:35:12] PROBLEM - Nginx local proxy to apache on mw2094 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.150 second response time [07:35:12] PROBLEM - HHVM rendering on mw2092 is CRITICAL: connect to address 10.192.16.65 and port 80: Connection refused [07:35:22] PROBLEM - HHVM rendering on mw2094 is CRITICAL: connect to address 10.192.16.67 and port 80: Connection refused [07:35:32] PROBLEM - Nginx local proxy to apache on mw2092 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.150 second response time [07:35:32] PROBLEM - HHVM rendering on mw2095 is CRITICAL: connect to address 10.192.16.68 and port 80: Connection refused [07:35:42] PROBLEM - Apache HTTP on mw2095 is CRITICAL: connect to address 10.192.16.68 and port 80: Connection refused [07:35:42] PROBLEM - Apache HTTP on mw2092 is CRITICAL: connect to address 10.192.16.65 and port 80: Connection refused [07:35:55] RECOVERY - mysqld processes on tempdb2001 is OK: PROCS OK: 1 process with command name mysqld [07:36:00] sigh [07:36:11] mw20* is me, I just ran puppet on einstenium [07:36:26] (I mean, I did it BEFORE turning off apache/hhvm) [07:38:16] acked [07:42:47] (03PS1) 10Marostegui: db-eqiad.php: Increase db1081 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346701 (https://phabricator.wikimedia.org/T161088) [07:46:12] PROBLEM - mediawiki-installation DSH group on mw2094 is CRITICAL: Host mw2094 is not in mediawiki-installation dsh group [07:47:53] PROBLEM - mediawiki-installation DSH group on mw2092 is CRITICAL: Host mw2092 is not in mediawiki-installation dsh group [07:48:03] PROBLEM - puppet last run on oresrdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:49:56] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 13Patch-For-Review, 07Wikimedia-Multiple-active-datacenters: Check the size of every cluster in codfw to see if it matches eqiad's capacity - https://phabricator.wikimedia.org/T156023#3159676 (10elukey) Just decommed mw2090-96 after adding new appser... [07:50:02] PROBLEM - mediawiki-installation DSH group on mw2095 is CRITICAL: Host mw2095 is not in mediawiki-installation dsh group [07:51:08] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 13Patch-For-Review, 07Wikimedia-Multiple-active-datacenters: Prepare and improve the datacenter switchover procedure - https://phabricator.wikimedia.org/T154658#3159680 (10elukey) [07:51:11] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 13Patch-For-Review, 07Wikimedia-Multiple-active-datacenters: Check the size of every cluster in codfw to see if it matches eqiad's capacity - https://phabricator.wikimedia.org/T156023#3159679 (10elukey) 05Open>03Resolved [07:52:22] RECOVERY - puppet last run on kafka1018 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [07:52:57] (03CR) 10Volans: "A couple of inline comments." (033 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/346311 (owner: 10Giuseppe Lavagetto) [07:53:12] RECOVERY - puppet last run on mwdebug1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:54:03] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase db1081 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346701 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [07:55:16] (03Merged) 10jenkins-bot: db-eqiad.php: Increase db1081 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346701 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [07:56:20] (03PS4) 10Giuseppe Lavagetto: Add task to restore the TTL of discovery entries to 5 minutes [switchdc] - 10https://gerrit.wikimedia.org/r/346311 [07:56:22] (03PS1) 10Giuseppe Lavagetto: Fix an issue with menu loading if we have a stage 0 [switchdc] - 10https://gerrit.wikimedia.org/r/346702 [07:56:24] (03PS1) 10Giuseppe Lavagetto: Modify the script to switch varnish traffic [switchdc] - 10https://gerrit.wikimedia.org/r/346703 [07:56:25] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increae db1081 weight - T161088 (duration: 00m 39s) [07:56:26] (03PS1) 10Giuseppe Lavagetto: Rolling restart of parsoid [switchdc] - 10https://gerrit.wikimedia.org/r/346704 [07:56:28] (03CR) 10jenkins-bot: db-eqiad.php: Increase db1081 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346701 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [07:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:34] T161088: Defragment s4: db1091, db1084, db1081, d1059 and probably the rest - https://phabricator.wikimedia.org/T161088 [07:56:49] (03CR) 10jerkins-bot: [V: 04-1] Rolling restart of parsoid [switchdc] - 10https://gerrit.wikimedia.org/r/346704 (owner: 10Giuseppe Lavagetto) [07:59:02] PROBLEM - Apache HTTP on mw1194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:59:32] PROBLEM - Nginx local proxy to apache on mw1194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:00:28] checking mw1194.. [08:00:52] elukey@mw1194:~$ hhvmadm check-health [08:00:52] { "load":128 [08:00:52] , "queued":419 [08:01:32] PROBLEM - HHVM rendering on mw1194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:02:06] !log restart hhvm on mw1194 - dump debug in /tmp/hhvm.1692.bt. - threads stuck in HPHP::Treadmill::getAgeOldestRequest [08:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:24] big spike in cache_text 503s, unrelated to the running ban in upload [08:03:22] RECOVERY - Nginx local proxy to apache on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.046 second response time [08:03:22] RECOVERY - HHVM rendering on mw1194 is OK: HTTP OK: HTTP/1.1 200 OK - 74798 bytes in 0.238 second response time [08:03:24] it went down already, just saying because icinga is going to complain soon [08:03:42] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [08:03:42] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [08:04:02] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.027 second response time [08:04:02] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [08:04:51] esams and uslfo? [08:04:59] 06Operations, 10media-storage: Swiftrepl was stuck in an infinite loop since days - https://phabricator.wikimedia.org/T162122#3159706 (10Volans) A second pass was completed successfully without any manual intervention. [08:05:18] elukey: eqiad, hence everywhere [08:05:30] (03CR) 10Giuseppe Lavagetto: Add task to restore the TTL of discovery entries to 5 minutes (032 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/346311 (owner: 10Giuseppe Lavagetto) [08:05:42] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [08:07:17] (03PS5) 10Giuseppe Lavagetto: Add task to restore the TTL of discovery entries to 5 minutes [switchdc] - 10https://gerrit.wikimedia.org/r/346311 [08:07:19] (03PS2) 10Giuseppe Lavagetto: Fix an issue with menu loading if we have a stage 0 [switchdc] - 10https://gerrit.wikimedia.org/r/346702 [08:07:21] (03PS2) 10Giuseppe Lavagetto: Modify the script to switch varnish traffic [switchdc] - 10https://gerrit.wikimedia.org/r/346703 [08:07:23] (03PS2) 10Giuseppe Lavagetto: Rolling restart of parsoid [switchdc] - 10https://gerrit.wikimedia.org/r/346704 [08:07:40] (03CR) 10jerkins-bot: [V: 04-1] Rolling restart of parsoid [switchdc] - 10https://gerrit.wikimedia.org/r/346704 (owner: 10Giuseppe Lavagetto) [08:07:44] ema: ahh okok [08:08:46] elukey: any known issue on the mw* hosts? Vast majority of the 503s happened while fetching from the applayer [08:09:23] 07:58:18 -> 08:00:16 [08:09:48] <_joe_> ema: not really, I'd say we should look at the db too [08:09:48] (03CR) 10Volans: [C: 031] "LGTM" [switchdc] - 10https://gerrit.wikimedia.org/r/346702 (owner: 10Giuseppe Lavagetto) [08:10:29] (03PS3) 10Giuseppe Lavagetto: Rolling restart of parsoid [switchdc] - 10https://gerrit.wikimedia.org/r/346704 [08:12:28] ema: https://grafana.wikimedia.org/dashboard/db/production-logging?refresh=5m&orgId=1&from=1491464770112&to=1491466320181 [08:12:42] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:12:42] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:13:28] the only thing that I can see spiking is DBReplication, but not sure if related [08:13:42] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:13:50] (03PS3) 10Volans: Modify the script to switch varnish traffic [switchdc] - 10https://gerrit.wikimedia.org/r/346703 (owner: 10Giuseppe Lavagetto) [08:13:52] elukey: it does correlate timewise [08:14:02] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:14:29] (03CR) 10Volans: [C: 031] "LGTM, only typo fixed" (031 comment) [switchdc] - 10https://gerrit.wikimedia.org/r/346703 (owner: 10Giuseppe Lavagetto) [08:14:32] RECOVERY - MariaDB Slave SQL: x1 on tempdb2001 is OK: OK slave_sql_state not a slave [08:15:04] (03CR) 10Volans: [C: 031] "LGTM" [switchdc] - 10https://gerrit.wikimedia.org/r/346704 (owner: 10Giuseppe Lavagetto) [08:15:22] RECOVERY - MariaDB Slave IO: x1 on tempdb2001 is OK: OK slave_io_state not a slave [08:15:39] (03CR) 10Hashar: "We have made puppet-lint strict which overall is a good thing. In this case it fail because of the quoted boolean: WARNING quoted boolean " (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/346162 (owner: 10Muehlenhoff) [08:16:02] RECOVERY - puppet last run on oresrdb1001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [08:16:17] (03CR) 10Volans: [C: 031] "LGTM" [switchdc] - 10https://gerrit.wikimedia.org/r/346311 (owner: 10Giuseppe Lavagetto) [08:16:30] (03CR) 10Volans: [C: 032] Add task to restore the TTL of discovery entries to 5 minutes [switchdc] - 10https://gerrit.wikimedia.org/r/346311 (owner: 10Giuseppe Lavagetto) [08:18:05] (03CR) 10Volans: [C: 032] Fix an issue with menu loading if we have a stage 0 [switchdc] - 10https://gerrit.wikimedia.org/r/346702 (owner: 10Giuseppe Lavagetto) [08:19:47] (03CR) 10Volans: [C: 032] Modify the script to switch varnish traffic [switchdc] - 10https://gerrit.wikimedia.org/r/346703 (owner: 10Giuseppe Lavagetto) [08:20:43] (03CR) 10Volans: [C: 032] Rolling restart of parsoid [switchdc] - 10https://gerrit.wikimedia.org/r/346704 (owner: 10Giuseppe Lavagetto) [08:20:52] (03PS4) 10Volans: Rolling restart of parsoid [switchdc] - 10https://gerrit.wikimedia.org/r/346704 (owner: 10Giuseppe Lavagetto) [08:22:35] ema: there seems to be a spike in 503s for upload now [08:23:01] for "\"2017-04-06T08:22:38\",\"503\",\"upload.wikimedia.org\",\"/wikipedia/commons/thumb etc.. [08:23:07] elukey: yep, that's likely the ban [08:23:16] 06Operations, 10Domains, 10Traffic, 06WMF-Legal, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3159811 (10Beetlebeard) 05Resolved>03Open [08:23:33] 06Operations, 10Domains, 10Traffic, 06WMF-Legal, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3042523 (10Beetlebeard) Dear Dzahn I am so sorry but we would like to go back to the old system. Everyone using @wikimedia.ee addresses had to... [08:23:33] * elukey cries in a corner [08:27:22] !log rebooting contint1001 to Linux 4.9 [08:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:40] elukey: this time around I'm gonna wait a bit after that smaller spike before banning on the frontends, let's see if it helps [08:31:21] 06Operations, 10Domains, 10Traffic, 06WMF-Legal, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3042523 (10MoritzMuehlenhoff) @Beetlebeard If cost is the primary issue; G Suite is free of charge for non-profits: https://www.google.com/non... [08:31:41] (03PS1) 10Hashar: (DO NOT SUBMIT) prevent scap on jobrunner02 [puppet] - 10https://gerrit.wikimedia.org/r/346705 (https://phabricator.wikimedia.org/T125735) [08:31:52] (03CR) 10jerkins-bot: [V: 04-1] (DO NOT SUBMIT) prevent scap on jobrunner02 [puppet] - 10https://gerrit.wikimedia.org/r/346705 (https://phabricator.wikimedia.org/T125735) (owner: 10Hashar) [08:34:20] !log starting Jenkins on contint1001 [08:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:23] !log shutting down wdqs codfw for data reimport - T162111 [08:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:30] T162111: Make WDQS active / active - https://phabricator.wikimedia.org/T162111 [08:39:45] (03PS1) 10Marostegui: db-eqiad.php: Increase db1081 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346707 (https://phabricator.wikimedia.org/T161088) [08:40:41] !log installing glibc updates on trusty [08:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:03] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: dc=codfw,cluster=wdqs [08:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:30] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase db1081 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346707 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [08:41:45] 06Operations, 10Domains, 10Traffic, 06WMF-Legal, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3159822 (10Beetlebeard) Thanks Moritz. The others using @wikimedia.ee addresses still prefer to change back. They just liked the old system... [08:42:02] RECOVERY - puppet last run on mw1261 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [08:42:46] (03Merged) 10jenkins-bot: db-eqiad.php: Increase db1081 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346707 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [08:42:59] (03CR) 10jenkins-bot: db-eqiad.php: Increase db1081 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346707 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [08:43:40] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increae db1081 weight - T161088 (duration: 00m 39s) [08:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:46] T161088: Defragment s4: db1091, db1084, db1081, d1059 and probably the rest - https://phabricator.wikimedia.org/T161088 [08:49:13] (03CR) 10Lydia Pintscher: [C: 031] "Discussed it with Marius. Ok to go from my side." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208655 (https://phabricator.wikimedia.org/T94416) (owner: 10Aude) [09:00:15] (03CR) 10Matthias Mullie: [C: 031] Enable 3d extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341332 (https://phabricator.wikimedia.org/T159717) (owner: 10MarkTraceur) [09:02:31] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, 07PostgreSQL: Reduce number of false positive alerts on postgresql lag for maps - https://phabricator.wikimedia.org/T162345#3159843 (10Gehel) [09:02:35] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, 07PostgreSQL: Reduce number of false positive alerts on postgresql lag for maps - https://phabricator.wikimedia.org/T162345#3159857 (10Gehel) p:05Triage>03High [09:02:48] (03PS3) 10Hoo man: Update Wikibase site id and group for test2wiki and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208655 (https://phabricator.wikimedia.org/T94416) (owner: 10Aude) [09:03:06] (03CR) 10Hoo man: "Manually rebased" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208655 (https://phabricator.wikimedia.org/T94416) (owner: 10Aude) [09:05:17] (03PS1) 10Gehel: maps - increase number of retries before alert for posttgresql lag check [puppet] - 10https://gerrit.wikimedia.org/r/346710 (https://phabricator.wikimedia.org/T162345) [09:05:59] (03PS1) 10Hoo man: Remove outdated comment about testwikidata dispatching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346711 [09:12:37] (03CR) 10Gehel: [C: 04-1] wdqs: active/active public interface (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/346543 (https://phabricator.wikimedia.org/T162111) (owner: 10Gehel) [09:12:59] (03PS3) 10Gehel: wdqs: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/346543 (https://phabricator.wikimedia.org/T162111) [09:14:10] (03CR) 10Gehel: [C: 04-1] "Still waiting on data reimport to be completed before merging this." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/346543 (https://phabricator.wikimedia.org/T162111) (owner: 10Gehel) [09:31:02] PROBLEM - MegaRAID on ms-be1006 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [09:31:13] ACKNOWLEDGEMENT - MegaRAID on ms-be1006 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T162347 [09:31:19] 06Operations, 10ops-eqiad: Degraded RAID on ms-be1006 - https://phabricator.wikimedia.org/T162347#3159886 (10ops-monitoring-bot) [09:33:30] 06Operations, 10Traffic, 10media-storage: swift-object-server 1.13.1: Wrong Content-Type returned on 304 Not Modified responses - https://phabricator.wikimedia.org/T162348#3159891 (10ema) [09:33:31] !log installing freetype security updates on trusty [09:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:34] 06Operations, 10ops-eqiad, 10media-storage: Degraded RAID on ms-be1006 - https://phabricator.wikimedia.org/T162347#3159906 (10Volans) [09:38:08] 06Operations, 10ops-eqiad, 10media-storage: Degraded RAID on ms-be1006 - https://phabricator.wikimedia.org/T162347#3159912 (10MoritzMuehlenhoff) p:05Triage>03Normal a:03Cmjohnson [09:40:11] 06Operations, 10Traffic, 10media-storage: swift-object-server 1.13.1: Wrong Content-Type returned on 304 Not Modified responses - https://phabricator.wikimedia.org/T162348#3159922 (10ema) p:05Triage>03High [09:40:50] 06Operations, 10Traffic, 10media-storage: swift-object-server 1.13.1: Wrong Content-Type returned on 304 Not Modified responses - https://phabricator.wikimedia.org/T162348#3159891 (10ema) [09:46:03] PROBLEM - puppet last run on ms-be1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdj1] [09:48:44] (03PS2) 10Alexandros Kosiaris: Remove system::role {'config-master': } [puppet] - 10https://gerrit.wikimedia.org/r/345536 [09:48:46] (03PS2) 10Alexandros Kosiaris: Remove system::role { 'conftool-master': } [puppet] - 10https://gerrit.wikimedia.org/r/345537 [09:48:48] (03PS2) 10Alexandros Kosiaris: Move and rename system::role{ 'role::docker::builder':} [puppet] - 10https://gerrit.wikimedia.org/r/345538 [09:48:50] (03PS1) 10Alexandros Kosiaris: Move and rename system::role{ 'role::elasticsearch::server':} [puppet] - 10https://gerrit.wikimedia.org/r/346713 [09:48:52] (03PS1) 10Alexandros Kosiaris: Move and rename system::role{ 'role::gerrit::server':} [puppet] - 10https://gerrit.wikimedia.org/r/346714 [09:48:54] (03PS1) 10Alexandros Kosiaris: Remove webserver_misc_static system::roles [puppet] - 10https://gerrit.wikimedia.org/r/346715 [09:48:56] (03PS1) 10Alexandros Kosiaris: Move system::role { 'role::planet::venus': } [puppet] - 10https://gerrit.wikimedia.org/r/346716 [09:48:58] (03PS1) 10Alexandros Kosiaris: Remove system::role { 'profile::redis::master':} [puppet] - 10https://gerrit.wikimedia.org/r/346717 [09:49:00] (03PS1) 10Alexandros Kosiaris: Remove system::role {"profile::redis::${category}":} [puppet] - 10https://gerrit.wikimedia.org/r/346718 [09:49:02] (03PS1) 10Alexandros Kosiaris: Remove system::role {"profile::redis::slave":} [puppet] - 10https://gerrit.wikimedia.org/r/346719 [09:50:15] 06Operations, 10Traffic, 10media-storage: swift-object-server 1.13.1: Wrong Content-Type returned on 304 Not Modified responses - https://phabricator.wikimedia.org/T162348#3159942 (10ema) [09:54:12] <_joe_> akosiaris: did you add the reccomendation about that to our coding standards, btw? [09:54:21] _joe_: yup [09:54:29] <_joe_> <3 [09:54:33] already done 4-5 days ago ;-) [09:54:40] <_joe_> great [09:54:43] I 've even linked you the change on wikitech [09:54:50] you are getting forgetful [09:54:52] :P [09:54:58] <_joe_> I need to add something about things that can be included in a profile [09:55:05] <_joe_> like the passwords:: classes [09:56:56] (03PS1) 10Marostegui: db-eqiad.php: Increase db1081 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346721 (https://phabricator.wikimedia.org/T161088) [09:57:46] (03PS1) 10Muehlenhoff: Add adavenport to absented LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/346722 [09:59:18] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add adavenport to absented LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/346722 (owner: 10Muehlenhoff) [10:01:44] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [10:02:11] (03PS1) 10Gehel: maps - publish postgresql replication lag to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/346724 (https://phabricator.wikimedia.org/T162345) [10:04:13] (03CR) 10Giuseppe Lavagetto: [C: 032] Remove system::role {"profile::redis::slave":} [puppet] - 10https://gerrit.wikimedia.org/r/346719 (owner: 10Alexandros Kosiaris) [10:04:30] (03CR) 10Giuseppe Lavagetto: [C: 032] Remove system::role {'config-master': } [puppet] - 10https://gerrit.wikimedia.org/r/345536 (owner: 10Alexandros Kosiaris) [10:04:32] (03CR) 10jerkins-bot: [V: 04-1] maps - publish postgresql replication lag to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/346724 (https://phabricator.wikimedia.org/T162345) (owner: 10Gehel) [10:05:51] (03CR) 10Giuseppe Lavagetto: [C: 032] "I think it's ok for now for us to think puppetmaster == conftool::master" [puppet] - 10https://gerrit.wikimedia.org/r/345537 (owner: 10Alexandros Kosiaris) [10:06:43] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 18 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [10:25:15] 06Operations, 06Office-IT, 07LDAP: Make disabled accounts visible in the corp mirror LDAP replica - https://phabricator.wikimedia.org/T160158#3159985 (10MoritzMuehlenhoff) What's running on ldap1.corp.wikimedia.org? (the host against which our openldap servers are synching). Some Linux server synching with t... [10:41:32] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, and 2 others: Reduce number of false positive alerts on postgresql lag for maps - https://phabricator.wikimedia.org/T162345#3160016 (10Pnorman) per discussion on irc The plugin uses ``` SELECT CASE WHEN pg_last_xlog_receive_location() = pg_la... [10:52:50] (03PS1) 10Giuseppe Lavagetto: run-puppet-agent: some tweaks [puppet] - 10https://gerrit.wikimedia.org/r/346729 [10:59:15] <_joe_> !log running some tests for the switchdc automation [10:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:18] (03PS1) 10Alexandros Kosiaris: Move role::backup::host into a profile [puppet] - 10https://gerrit.wikimedia.org/r/346732 [11:16:13] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=3033.20 Read Requests/Sec=5151.50 Write Requests/Sec=10.80 KBytes Read/Sec=21838.00 KBytes_Written/Sec=4148.00 [11:24:42] (03CR) 10Volans: [C: 04-1] "I think we should use enable-puppet with the message. See details inline" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/346729 (owner: 10Giuseppe Lavagetto) [11:24:44] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase db1081 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346721 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [11:25:57] (03Merged) 10jenkins-bot: db-eqiad.php: Increase db1081 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346721 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [11:26:27] (03CR) 10jenkins-bot: db-eqiad.php: Increase db1081 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346721 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [11:27:08] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increae db1081 weight - T161088 (duration: 00m 40s) [11:27:13] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=9.10 Read Requests/Sec=103.60 Write Requests/Sec=39.80 KBytes Read/Sec=1076.80 KBytes_Written/Sec=4753.20 [11:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:15] T161088: Defragment s4: db1091, db1084, db1081, d1059 and probably the rest - https://phabricator.wikimedia.org/T161088 [11:34:12] (03PS2) 10Gehel: maps - publish postgresql replication lag to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/346724 (https://phabricator.wikimedia.org/T162345) [11:35:15] (03CR) 10jerkins-bot: [V: 04-1] maps - publish postgresql replication lag to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/346724 (https://phabricator.wikimedia.org/T162345) (owner: 10Gehel) [11:36:01] 06Operations, 07HHVM: Frequent TCP RST on connections between HHVM and Redis - https://phabricator.wikimedia.org/T162354#3160102 (10elukey) [11:36:13] (03PS3) 10Gehel: maps - publish postgresql replication lag to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/346724 (https://phabricator.wikimedia.org/T162345) [11:36:16] moritzm: --^ opened this task to track the work [11:36:21] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup first 10 DB servers - https://phabricator.wikimedia.org/T162159#3160117 (10Marostegui) [11:37:40] (03PS1) 10Ema: cp1008: override cache::route_table [puppet] - 10https://gerrit.wikimedia.org/r/346733 [11:44:43] PROBLEM - puppet last run on cp1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:51:00] !log upgrade cp2006 to linux 4.9 T162029 [11:51:04] RECOVERY - MariaDB Slave Lag: s4 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 89920.32 seconds [11:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:07] T162029: Migrate all jessie hosts to Linux 4.9 - https://phabricator.wikimedia.org/T162029 [12:00:43] jouncebot next [12:00:43] In 0 hour(s) and 59 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170406T1300) [12:01:23] PROBLEM - puppet last run on mw1288 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:12:43] RECOVERY - puppet last run on cp1049 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [12:14:40] (03PS1) 10KartikMistry: apertium-spa: New upstream release [debs/contenttranslation/apertium-spa] - 10https://gerrit.wikimedia.org/r/346748 (https://phabricator.wikimedia.org/T161511) [12:16:23] !log uploaded HHVM 3.18.2 to jessie-wikimedia/experimental [12:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:20] (03CR) 10Alexandros Kosiaris: [C: 04-1] "minor inline comments" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/346724 (https://phabricator.wikimedia.org/T162345) (owner: 10Gehel) [12:20:38] hoo: o/ - re: why I abandoned the patch - I think that it is not useful for the moment no? We'd need Debian Jessie and HHVM 3.18 on snapshot* before applying any GC tunables [12:20:55] (03PS1) 10Marostegui: db-eqiad.php: Restore db1081 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346750 (https://phabricator.wikimedia.org/T161088) [12:20:59] !log upgrade cp2009 to linux 4.9 T162029 [12:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:06] T162029: Migrate all jessie hosts to Linux 4.9 - https://phabricator.wikimedia.org/T162029 [12:24:07] (03CR) 10Elukey: [V: 032 C: 032] Check the Request Authorization Header for '%u' [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/317825 (owner: 10R4q3NWnUx2CEhVyr) [12:26:16] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1081 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346750 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [12:27:25] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1081 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346750 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [12:27:34] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1081 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346750 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [12:28:01] !log cp2009 stuck rebooting, powercycled [12:28:03] PROBLEM - IPsec on cp4012 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp2009_v4, cp2009_v6 [12:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:21] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1081 original weight - T161088 (duration: 00m 40s) [12:28:24] nevermind the IPsec errors, that's cp2009 ^ [12:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:27] T161088: Defragment s4: db1091, db1084, db1081, d1059 and probably the rest - https://phabricator.wikimedia.org/T161088 [12:28:43] PROBLEM - IPsec on cp3006 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp2009_v4, cp2009_v6 [12:28:44] PROBLEM - IPsec on cp4011 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp2009_v4, cp2009_v6 [12:28:44] PROBLEM - IPsec on cp4020 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp2009_v4, cp2009_v6 [12:28:44] PROBLEM - IPsec on cp4019 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp2009_v4, cp2009_v6 [12:28:44] PROBLEM - IPsec on cp3004 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp2009_v4, cp2009_v6 [12:28:44] PROBLEM - IPsec on cp3005 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp2009_v4, cp2009_v6 [12:29:23] RECOVERY - puppet last run on mw1288 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [12:29:43] RECOVERY - IPsec on cp3006 is OK: Strongswan OK - 28 ESP OK [12:29:43] RECOVERY - IPsec on cp4011 is OK: Strongswan OK - 28 ESP OK [12:29:43] RECOVERY - IPsec on cp4020 is OK: Strongswan OK - 28 ESP OK [12:29:43] RECOVERY - IPsec on cp4019 is OK: Strongswan OK - 28 ESP OK [12:29:43] RECOVERY - IPsec on cp3004 is OK: Strongswan OK - 28 ESP OK [12:29:44] RECOVERY - IPsec on cp3005 is OK: Strongswan OK - 28 ESP OK [12:30:03] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:30:03] RECOVERY - IPsec on cp4012 is OK: Strongswan OK - 28 ESP OK [12:30:23] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [12:34:25] (03PS4) 10Gehel: maps - publish postgresql replication lag to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/346724 (https://phabricator.wikimedia.org/T162345) [12:34:33] (03CR) 10Gehel: maps - publish postgresql replication lag to prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/346724 (https://phabricator.wikimedia.org/T162345) (owner: 10Gehel) [12:35:23] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [12:38:30] (03CR) 10Alexandros Kosiaris: [C: 031] maps - publish postgresql replication lag to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/346724 (https://phabricator.wikimedia.org/T162345) (owner: 10Gehel) [12:39:26] !log rebooting cp2006 again to check for potential issues bringing up network ifaces / loading intel_uncore T162029 [12:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:32] T162029: Migrate all jessie hosts to Linux 4.9 - https://phabricator.wikimedia.org/T162029 [12:50:03] PROBLEM - puppet last run on mc1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:50:20] !log upgraded mw1261 to HHVM 3.18.2 with cherrypicked fix for stat_cache deadlock, now running with stat_cache enabled again [12:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:29] !log upgrade cp3007 to linux 4.9 T162029 [12:51:33] (03PS5) 10Gehel: maps - publish postgresql replication lag to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/346724 (https://phabricator.wikimedia.org/T162345) [12:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:35] T162029: Migrate all jessie hosts to Linux 4.9 - https://phabricator.wikimedia.org/T162029 [12:52:33] jouncebot: refresh [12:52:36] I refreshed my knowledge about deployments. [12:52:37] jouncebot: next [12:52:37] In 0 hour(s) and 7 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170406T1300) [12:52:41] 06Operations, 07HHVM, 13Patch-For-Review, 07Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3160258 (10MoritzMuehlenhoff) The dead lock could be narrowed down to incorrect locking in stat_cache and was eventually fixed by upstream. I've built a new package which is now ru... [12:52:50] (03CR) 10Gehel: [C: 032] maps - publish postgresql replication lag to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/346724 (https://phabricator.wikimedia.org/T162345) (owner: 10Gehel) [12:54:51] phuedx: deploying your own change? [12:55:13] PROBLEM - Host frauth1001 is DOWN: PING CRITICAL - Packet loss = 16%, RTA = 3095.08 ms [12:55:32] zeljkof: it's been a long time since i've deployed a backported change [12:55:44] elukey: I know, but still this is the right thing [12:55:48] so why not set it for the future? [12:55:50] Reedy: if you have bandwidth, could you review the patch to enable 3D on beta please? ( https://gerrit.wikimedia.org/r/#/c/341332/ ) [12:56:24] though I can just land it [12:56:28] hmm. frauth1001 is not expected... [12:56:34] PROBLEM - Host tellurium is DOWN: PING CRITICAL - Packet loss = 100% [12:56:39] zeljkof, hashar: wait wait wait, the submodule commit is automatically created!? [12:56:47] WHEN DID THIS HAPPEN :D :D :D [12:56:56] Jeff_Green: neither tellurium ? [12:56:56] phuedx: its been that way for awhile [12:56:57] ah, routerfaceplant. great [12:57:11] akosiaris: yeah [12:57:12] phuedx: since I bitched about it and chad magically made it happen :-} [12:57:21] Zppix: as i said, i haven't deployed anything other than config changes in a long time [12:57:23] is it causing visible autagle Jeff_Green ? [12:57:25] like 1.5 years [12:57:34] *outage [12:57:52] phuedx: so yeah a change to an extension ends up automagically bumping the submodule in mediawiki/core :} [12:57:54] phuedx: thats shorter time than me (which was neverago) [12:57:55] * akosiaris trying to login into the pfw [12:57:55] jynus: probably, yes. checking [12:58:25] yup. donation processing is disrupted [12:58:26] It looks like yes- timeout [12:58:36] hashar, phuedx: what's the plan for swat today? phuedx deploying his commit? [12:58:59] can i watch and learn someone via a stream? [12:59:03] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [12:59:05] as i say, it's been a //long// time [12:59:11] Jeff_Green: looking into pfw now [12:59:16] * phuedx is just reading the docs again [12:59:38] (03CR) 10Reedy: [C: 031] "One minor (non blocking) issue" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341332 (https://phabricator.wikimedia.org/T159717) (owner: 10MarkTraceur) [12:59:42] phuedx: were you using Windows 3.1 / VBScript at that time? :-] [12:59:44] phuedx: we just use scap now i think [12:59:45] hashar: should I do the swat today? or do you want to? [12:59:46] akosiaris: i had a hefty rsync going (cloning database), i just stopped it and everything came back coincidentally, maybe it's the cause [13:00:02] hashar: i think i have vbscript on my cv ;) [13:00:04] Jeff_Green: ah yes. been there, done that [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170406T1300). [13:00:05] matthiasmullie and phuedx: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:05] starting it again to confirm [13:00:10] deja vu [13:00:13] RECOVERY - Host tellurium is UP: PING WARNING - Packet loss = 0%, RTA = 1686.42 ms [13:00:16] yeah that's almost certainly it [13:00:17] RECOVERY - Host frauth1001 is UP: PING WARNING - Packet loss = 0%, RTA = 1682.66 ms [13:00:30] akosiaris: yeah, although last time it was at codfw [13:00:44] it's the exact same equipment though, isn't it ? [13:00:45] akosiaris, was it you or did it come back on it own? [13:00:46] * volans daja vu too [13:00:58] jynus: it was jeff killing an rsync [13:01:02] ha [13:01:03] it is, although I think we may have more cross-connects between them in eqiad [13:01:05] (03CR) 10Hashar: "This change being solely for the beta cluster, it can be merged anytime outside of the SWAT window. Just remember to rebase the productio" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341332 (https://phabricator.wikimedia.org/T159717) (owner: 10MarkTraceur) [13:01:14] hoo: well it would really misleading to have those options in there, especially since they seems to trigger some unknown behavior before 3.18. Let's wait that 3.18 is everywhere and then I'll re-add it [13:01:25] Ok [13:01:25] zeljkof: I can do it with phuedx over a hangouts [13:01:31] it's strange that I've been moving data around heftily since we bought this equip, and this issue has only started happening in the past few months [13:01:34] Jeff_Green: didn't you add the --bwlimit last time it happened? [13:01:43] PROBLEM - puppet last run on ms-be1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:01:53] (03PS1) 10Gehel: postgresql - fix publication of postgresql replication lag [puppet] - 10https://gerrit.wikimedia.org/r/346754 [13:01:55] yes, I can do that now if it's the issue [13:02:08] hashar: ok, have fun then :) [13:02:12] phuedx: or I can just do it and you verify on mwdebug1001 [13:02:17] rsync is going again, no disruption yet [13:02:43] hashar: spin up a hangout so that i can refresh my memory faster [13:02:46] akosiaris: are you able to tell whether one of the pfws crashed this time? [13:02:52] hashar: I will be around, on a late lunch, pingable on irc [13:03:07] Jeff_Green: yup. no crash. both have an uptime of 710 days [13:03:12] ok [13:03:13] maybe high bandwidth is necessary but not sufficient condition? [13:03:13] PROBLEM - Host tellurium is DOWN: PING CRITICAL - Packet loss = 100% [13:03:18] ha [13:03:51] Jeff_Green: also tellurium and frauth1001 are on the same FPC [13:03:55] yeah [13:04:05] so no cross connects there [13:04:17] there are no graphs from those hosts so we can check if it is saturating the ifaces? [13:04:30] really weird thing is I didn't lose connectivity through tellurium while it was icinga was reporting it being down [13:04:35] (03CR) 10Matthias Mullie: [C: 031] Enable 3d extension on beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341332 (https://phabricator.wikimedia.org/T159717) (owner: 10MarkTraceur) [13:04:37] (03PS5) 10Matthias Mullie: Enable 3d extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341332 (https://phabricator.wikimedia.org/T159717) (owner: 10MarkTraceur) [13:05:00] Jeff_Green: yeah that's TCP being resistant enough to packet loss [13:05:19] is the polling just icmp? [13:05:22] clearly the packetloss was not too much for that TCP flow [13:05:22] yes [13:05:23] i guess it must be [13:05:29] just icmp echo requests [13:05:33] (03CR) 10Gehel: [C: 032] postgresql - fix publication of postgresql replication lag [puppet] - 10https://gerrit.wikimedia.org/r/346754 (owner: 10Gehel) [13:05:34] PROBLEM - puppet last run on maps1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:06:06] alright...employing --bwlimit [13:06:54] hashar can you merge that beta config change during this swat, or should I get another window to sync the config change in prod? [13:07:34] RECOVERY - puppet last run on maps1003 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [13:08:32] * Jeff_Green wishes I could remember what bwlimit was ideal well last time... [13:09:42] matthiasmullie: that can be done out of the swat window [13:09:48] matthiasmullie: but will +2 it once I am done with phuedx patch [13:11:57] alright thanks :) [13:12:03] PROBLEM - Host frauth1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:12:10] crud [13:13:08] Jeff_Green: btw --bwlimit is in B/s so around 80M is a good value for a 1Gbit ethernet connected host [13:13:17] Jeff_Green: [13:13:20] 2017-01-25 21:56:44Jeff_Green| bwlimit=100m bumps ping times from 37 to 500ms - 1s [13:13:43] RECOVERY - Host tellurium is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [13:13:45] 100M is probably a bit too high [13:13:57] yes agree, I was grepping the history ;) [13:14:07] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, and 2 others: Reduce number of false positive alerts on postgresql lag for maps - https://phabricator.wikimedia.org/T162345#3160274 (10Gehel) Graph of postgres replication lag are available on [[ https://grafana.wikimedia.org/dashboard/db/maps-perfor... [13:14:21] I would stay on 60m ~ .5Gbps [13:14:50] alright, trying 60m [13:15:13] RECOVERY - Host frauth1001 is UP: PING OK - Packet loss = 0%, RTA = 10.01 ms [13:15:18] * Jeff_Green is tempted to start direct-connecting database boxes :-P [13:15:57] but but but ... PCI !!!! :P [13:16:14] are you rsyncing mysqls? [13:16:16] how is the auditor gonna say it's PCI compliant ? [13:16:19] ha, seems ok from a PCI perspective :-P [13:16:28] hehehe [13:16:28] jynus yeah [13:16:29] hope is from a snapshot... [13:16:37] or they are dead [13:16:54] jynus: no they're live and heavily active :-P [13:17:03] just kidding. they're shut down [13:17:07] ok [13:17:10] sorry, I had to ask [13:17:21] as a consultant, it woudln't be the first time I've seen that [13:17:24] it's ok, I will get over the crushing insult [13:17:40] lol [13:17:56] then they asked, "but it is transactional, it shoudl work!" [13:17:57] hey, while I have your attention, what filesystem are we preferring these days for mariadb? [13:18:03] RECOVERY - puppet last run on mc1012 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [13:18:06] I like xfs [13:18:11] jynus: ok [13:18:17] I don't :P [13:18:18] nobarrier & noatime? [13:18:19] ext4 is an alternative, with some tuning [13:18:29] mostly due to all the lockups on swift though [13:18:42] if I had a dollar for every time XFS bugs caused issues where other filesystems would be fine :P [13:18:47] Jeff_Green, if you are using debian, the default is norelatime [13:18:48] * Jeff_Green just had a reiserfs flashback :-( [13:18:53] which is good enough [13:19:31] jynus: what tuning do you do with ext4? [13:19:53] the old dbs were xfs so I'll stick with that, but curious re. ext4 [13:19:56] I haven't done it- but if you enable transactional file operations [13:20:09] you can get away with no double write [13:20:18] oic [13:20:24] and in a lab, it showed equally secure and more performant [13:20:29] I haven't tried myself [13:20:33] yup [13:20:36] plus there are some ext4 nice things [13:20:50] filesystems are a bit like a cult [13:20:51] !log hashar@tin Synchronized php-1.29.0-wmf.19/extensions/Popups: renderer: Pass event to behavior for processing - T162324 (duration: 00m 51s) [13:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:58] T162324: Clicking settings cog as anonymous user doesn't do anything - https://phabricator.wikimedia.org/T162324 [13:21:08] so don't ask me to enter into religios wars :-) [13:21:14] hahaha [13:21:37] xfs is the easy answer, with as alex says, some bugs in older kernels [13:21:51] i've used ext4 and xfs for years, I've been sufficiently untraumatized by them to stay the course for now [13:22:13] assuming you had ssds [13:22:16] i keep reading that btfrs is amazing, but my very limited experience with it (laptop) has been horrendous [13:22:40] ok. these are new HP's with hardware RAID and SSDs, running jessie [13:22:41] yeah, that is like cold fusion- always 20 years away [13:22:47] loooool [13:23:30] opensuse switched to btrfs at some point as the default fs, and I foolishly went along with it [13:23:40] akosiaris, didn't I say it was cult-like "feelings" [13:24:01] I can tell you many reasons not to use all filesystems [13:24:15] of course [13:24:56] and they all end up getting better [13:24:58] i don't like anything with less than 10 years of other people suffering through bugs before I try it [13:25:03] PROBLEM - check_puppetrun on frauth1001 is CRITICAL: CRITICAL: Puppet has 10 failures [13:25:29] (03CR) 10Paladox: "Thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/346180 (owner: 10Paladox) [13:25:36] much like the btrfs cold fusion, all kernels eventually become an "old kernel" that exhibits XFS bugs :) [13:25:48] 06Operations, 10Wikimedia-Logstash, 15User-Elukey, 15User-fgiunchedi: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451#3160292 (10elukey) Catching up with this task.. So kafkatee seems not supporting group-ids, meanwhile kafkacat does: https://github.com/edenhill/kafkacat ```... [13:25:56] (03PS6) 10Hashar: Enable 3d extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341332 (https://phabricator.wikimedia.org/T159717) (owner: 10MarkTraceur) [13:26:01] considering we are about to recommend a "large" upgrade of mysql version to at least a 10-year old mysql version, I know what you are saying :-) [13:26:01] is there a regression that keeps coming back? [13:26:04] it seems like it's been a constant for like 20 years now. XFS is always the perf recommendation, and XFS always eventually blows up in your face [13:26:12] matthiasmullie: doing your 3D on beta patch [13:26:21] perfect! :) [13:26:46] bblack: I've had pretty good luck with XFS so far, but now that I say this I'm sure it will try to kill me [13:26:53] :) [13:28:13] let's just go back to reiserfs3, before hans decided to become dexter morgan :P [13:28:44] has anyone tried f2fs on ssds for anything serious? [13:28:46] my latest gripe is with what I suspect is the combination of rsyslog and systemd (fails 10y test), where I can't get rsyslog to let go of files on /srv where it is tailing mysql_slow logs [13:28:50] and when we gets released, reiser4 will be ready [13:29:02] the biggest issue against ext* fs in the past is async io on a single file with O_DIRECT [13:29:03] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:19] bblack: only on my smartphone. [13:29:19] I am not sure if it was fixed in the latest versions [13:29:33] akosiaris: I don't want to go back to reiserfs, that is a FS that burned me several times [13:29:42] 06Operations, 10Wikimedia-Logstash, 15User-Elukey, 15User-fgiunchedi: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451#3160297 (10Ottomata) For simply writing to a log file, ya I think kafkacat would be just fine. But, even though kafkatee doesn’t have consumer offset commits... [13:30:00] (03CR) 10Hashar: [C: 032] "Looks good now :-}" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341332 (https://phabricator.wikimedia.org/T159717) (owner: 10MarkTraceur) [13:30:03] !log Deploy Deploy schema change dbstore1002 (s7 wikis) - T160390 [13:30:04] RECOVERY - check_puppetrun on frauth1001 is OK: OK: Puppet is currently enabled, last run 213 seconds ago with 0 failures [13:30:04] Jeff_Green: I can guess. It never burned me, but last time I used it I was in grad school [13:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:12] T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390 [13:30:23] akosiaris, I heard it resurrected a bit not a long time ago [13:30:37] as in, interesting updates and maintenance [13:30:42] I remember it fondly but in that way you fondly remember some stuff from your childhood [13:30:43] RECOVERY - puppet last run on ms-be1028 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [13:30:49] of course you could view all performance choices for filesystems to be micro-optimizations not worth chasing. in theory you can just have more shards or nodes or whatever to make up the loss of io perf and go with a more-stable FS choice [13:31:10] but in the case of a database, I can see how the cost of not chasing disk io optimization could be high :) [13:31:12] lets get a farm of 5400 RPM FAT32 disks [13:31:44] bblack, the bug or feature I told you is normally non-interesting, but for mysql it was quite a stopper [13:31:49] yeah [13:32:36] the best way anyway is to actually test on the hardware [13:32:57] (03Merged) 10jenkins-bot: Enable 3d extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341332 (https://phabricator.wikimedia.org/T159717) (owner: 10MarkTraceur) [13:33:12] (03CR) 10jenkins-bot: Enable 3d extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341332 (https://phabricator.wikimedia.org/T159717) (owner: 10MarkTraceur) [13:33:35] (03PS1) 10Marostegui: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346757 (https://phabricator.wikimedia.org/T160390) [13:34:49] !log European SWAT completed [13:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:50] matthiasmullie: beta deployment dies due to extensions/3d/extension.json does not exist hehe [13:36:53] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346757 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [13:38:30] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346757 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [13:38:32] (03PS2) 10Ema: cp1008: override cache::route_table [puppet] - 10https://gerrit.wikimedia.org/r/346733 [13:38:38] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346757 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [13:38:44] matthiasmullie: 3d --> 3D for some reason :( [13:39:01] ugh [13:39:13] !log Deploy schema change db1079 (s7 wikis) - T160390 [13:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:20] T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390 [13:39:36] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db1079 - T160390 (duration: 00m 43s) [13:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:42] matthiasmullie: so I guess wmf-config/extension-list-labs has to be adjusted and in CommonSettings-labs.php most probably want to adjust wfLoadExtension( '3d' ); as well [13:40:44] (03PS3) 10Alexandros Kosiaris: Remove system::role {'config-master': } [puppet] - 10https://gerrit.wikimedia.org/r/345536 [13:40:46] (03PS1) 10Matthias Mullie: Fix 3D extension case [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346758 [13:40:48] (03CR) 10Alexandros Kosiaris: [V: 032] Remove system::role {'config-master': } [puppet] - 10https://gerrit.wikimedia.org/r/345536 (owner: 10Alexandros Kosiaris) [13:41:00] (03PS3) 10Alexandros Kosiaris: Remove system::role { 'conftool-master': } [puppet] - 10https://gerrit.wikimedia.org/r/345537 [13:41:03] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:41:03] (03CR) 10Alexandros Kosiaris: [V: 032] Remove system::role { 'conftool-master': } [puppet] - 10https://gerrit.wikimedia.org/r/345537 (owner: 10Alexandros Kosiaris) [13:41:15] (03PS2) 10Matthias Mullie: Fix 3D extension case [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346758 [13:41:36] (03PS3) 10Alexandros Kosiaris: Move and rename system::role{ 'role::docker::builder':} [puppet] - 10https://gerrit.wikimedia.org/r/345538 [13:41:43] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Move and rename system::role{ 'role::docker::builder':} [puppet] - 10https://gerrit.wikimedia.org/r/345538 (owner: 10Alexandros Kosiaris) [13:41:47] (03CR) 10Hashar: [C: 032] "Lets try it -:}" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346758 (owner: 10Matthias Mullie) [13:41:54] (03PS2) 10Alexandros Kosiaris: Move and rename system::role{ 'role::elasticsearch::server':} [puppet] - 10https://gerrit.wikimedia.org/r/346713 [13:41:58] and I think it's lowercase '3d' in mw-vagrant :) [13:42:01] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Move and rename system::role{ 'role::elasticsearch::server':} [puppet] - 10https://gerrit.wikimedia.org/r/346713 (owner: 10Alexandros Kosiaris) [13:42:11] (03PS2) 10Alexandros Kosiaris: Move and rename system::role{ 'role::gerrit::server':} [puppet] - 10https://gerrit.wikimedia.org/r/346714 [13:42:18] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Move and rename system::role{ 'role::gerrit::server':} [puppet] - 10https://gerrit.wikimedia.org/r/346714 (owner: 10Alexandros Kosiaris) [13:42:27] (03PS2) 10Alexandros Kosiaris: Remove webserver_misc_static system::roles [puppet] - 10https://gerrit.wikimedia.org/r/346715 [13:42:35] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Remove webserver_misc_static system::roles [puppet] - 10https://gerrit.wikimedia.org/r/346715 (owner: 10Alexandros Kosiaris) [13:42:41] coffee.send().one().to( User::singleton('hashar') ) [13:42:43] (03PS2) 10Alexandros Kosiaris: Move system::role { 'role::planet::venus': } [puppet] - 10https://gerrit.wikimedia.org/r/346716 [13:42:49] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Move system::role { 'role::planet::venus': } [puppet] - 10https://gerrit.wikimedia.org/r/346716 (owner: 10Alexandros Kosiaris) [13:42:49] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1079 - T160390 (duration: 00m 39s) [13:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:00] (03PS2) 10Alexandros Kosiaris: Remove system::role { 'profile::redis::master':} [puppet] - 10https://gerrit.wikimedia.org/r/346717 [13:43:06] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Remove system::role { 'profile::redis::master':} [puppet] - 10https://gerrit.wikimedia.org/r/346717 (owner: 10Alexandros Kosiaris) [13:43:15] (03PS2) 10Alexandros Kosiaris: Remove system::role {"profile::redis::${category}":} [puppet] - 10https://gerrit.wikimedia.org/r/346718 [13:43:21] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Remove system::role {"profile::redis::${category}":} [puppet] - 10https://gerrit.wikimedia.org/r/346718 (owner: 10Alexandros Kosiaris) [13:43:31] (03PS2) 10Alexandros Kosiaris: Remove system::role {"profile::redis::slave":} [puppet] - 10https://gerrit.wikimedia.org/r/346719 [13:43:35] (03CR) 10Alexandros Kosiaris: [V: 032] Remove system::role {"profile::redis::slave":} [puppet] - 10https://gerrit.wikimedia.org/r/346719 (owner: 10Alexandros Kosiaris) [13:44:26] !log re-generating tiles for tasmania on maps codfw cluster [13:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:33] PROBLEM - puppet last run on mw1173 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:46:34] (03PS2) 10Giuseppe Lavagetto: run-puppet-agent: some tweaks [puppet] - 10https://gerrit.wikimedia.org/r/346729 [13:47:09] (03PS1) 10Alexandros Kosiaris: Remove system::role {'requesttracker::server': } [puppet] - 10https://gerrit.wikimedia.org/r/346759 [13:47:11] (03PS1) 10Alexandros Kosiaris: Move system::role {'tor::relay': } [puppet] - 10https://gerrit.wikimedia.org/r/346760 [13:48:09] 06Operations, 10ops-esams, 10netops: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3160332 (10ayounsi) From Juniper at about 9am UTC: >Sent by Carrier: UPS >Tracking Number: 1Z223V170461615001 >Tracking URL: http://wwwapps.ups.com/WebTracking/track?track=yes&trackNums=1Z223V170461615001... [13:48:52] (03Merged) 10jenkins-bot: Fix 3D extension case [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346758 (owner: 10Matthias Mullie) [13:49:02] (03CR) 10jenkins-bot: Fix 3D extension case [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346758 (owner: 10Matthias Mullie) [13:52:47] (03PS10) 10Muehlenhoff: Create a separate sysctl configuration for setting conntrack settings [puppet] - 10https://gerrit.wikimedia.org/r/319071 [13:55:28] (03CR) 10Muehlenhoff: [C: 032] Create a separate sysctl configuration for setting conntrack settings [puppet] - 10https://gerrit.wikimedia.org/r/319071 (owner: 10Muehlenhoff) [13:58:03] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [13:59:38] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: setup tempdb2001(WMF6407) - https://phabricator.wikimedia.org/T162290#3160374 (10Marostegui) tempdb2001 is now replicating from db2033. It is running 10.0.30 (mysql_upgrade has been run), SSL is enabled. Right now it is quite delayed: ``` Seconds_Behi... [14:00:04] hoo: Dear anthropoid, the time has come. Please deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170406T1400). [14:00:31] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: setup tempdb2001(WMF6407) - https://phabricator.wikimedia.org/T162290#3160378 (10Marostegui) [14:00:48] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: setup tempdb2001(WMF6407) - https://phabricator.wikimedia.org/T162290#3160379 (10jcrespo) Cool! We need to let it replicate for a while before pooling it to confirm is is ok. [14:01:24] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: setup tempdb2001(WMF6407) - https://phabricator.wikimedia.org/T162290#3160381 (10Marostegui) >>! In T162290#3160379, @jcrespo wrote: > Cool! We need to let it replicate for a while before pooling it to confirm is is ok. Yes! I will leave the ticket op... [14:03:30] (03PS1) 10Giuseppe Lavagetto: Fix join order, this ain't ruby... [switchdc] - 10https://gerrit.wikimedia.org/r/346762 [14:03:32] (03PS1) 10Giuseppe Lavagetto: Remove specialized puppet functions [switchdc] - 10https://gerrit.wikimedia.org/r/346763 [14:03:51] (03PS8) 10Muehlenhoff: Load connection tracking sysctl values via a separate systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/320197 (https://phabricator.wikimedia.org/T136094) [14:03:53] (03CR) 10jerkins-bot: [V: 04-1] Remove specialized puppet functions [switchdc] - 10https://gerrit.wikimedia.org/r/346763 (owner: 10Giuseppe Lavagetto) [14:04:08] !log reimage analytics1002 to Debian Jessie (Hadoop Master Node standby) [14:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:08] (03CR) 10Giuseppe Lavagetto: run-puppet-agent: some tweaks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/346729 (owner: 10Giuseppe Lavagetto) [14:07:27] 06Operations, 06Discovery, 06Maps, 10Traffic, 03Interactive-Sprint: Make maps active / active - https://phabricator.wikimedia.org/T162362#3160389 (10Gehel) [14:07:34] (03PS1) 10Marostegui: db-codfw,db-eqiad.php: Add tempdb2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346764 (https://phabricator.wikimedia.org/T162290) [14:08:00] (03CR) 10Hoo man: [C: 032] Update Wikibase site id and group for test2wiki and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208655 (https://phabricator.wikimedia.org/T94416) (owner: 10Aude) [14:08:57] matthiasmullie: 3D might be on beta now :) [14:09:04] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [14:09:06] (03Merged) 10jenkins-bot: Update Wikibase site id and group for test2wiki and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208655 (https://phabricator.wikimedia.org/T94416) (owner: 10Aude) [14:10:01] (03CR) 10Jcrespo: [C: 031] db-codfw,db-eqiad.php: Add tempdb2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346764 (https://phabricator.wikimedia.org/T162290) (owner: 10Marostegui) [14:10:18] commit a116ba8575bb0cefe2f69ebadf2b8e8da2fd45b6 [14:10:18] Author: Matthias Mullie [14:10:32] That's merged by not deployed! [14:10:38] :S [14:11:04] (03CR) 10jenkins-bot: Update Wikibase site id and group for test2wiki and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208655 (https://phabricator.wikimedia.org/T94416) (owner: 10Aude) [14:11:06] (03PS5) 10Muehlenhoff: Disable wireshark-common/install-setuid to avoid debconf prompt [puppet] - 10https://gerrit.wikimedia.org/r/346162 [14:11:39] 06Operations, 06Discovery, 06Maps, 10Traffic, 03Interactive-Sprint: Make maps active / active - https://phabricator.wikimedia.org/T162362#3160407 (10Gehel) Looking at Tasmania on the maps / codfw cluster, it looks like we did not regenerate all tiles after the T159631 incident. This is now in progress.... [14:11:53] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/346729 (owner: 10Giuseppe Lavagetto) [14:12:37] !log hoo@tin Synchronized wmf-config/Wikibase.php: Add testwiki and test2wiki to "specialSiteLinkGroups" on testwikidata (T94416) (duration: 00m 40s) [14:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:44] T94416: [Task] Don't use 'enwiki' as site id for test2wiki and testwiki - https://phabricator.wikimedia.org/T94416 [14:14:33] RECOVERY - puppet last run on mw1173 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [14:15:53] (03PS1) 10Elukey: Set Debian Jessie image for analytics100[123] [puppet] - 10https://gerrit.wikimedia.org/r/346766 [14:16:59] (03CR) 10Elukey: [V: 032 C: 032] Set Debian Jessie image for analytics100[123] [puppet] - 10https://gerrit.wikimedia.org/r/346766 (owner: 10Elukey) [14:20:13] PROBLEM - check_puppetrun on frdb1001 is CRITICAL: CRITICAL: Puppet last ran 52613 seconds ago, expected 28800 [14:23:42] matthiasmullie: and now beta fails on loginwiki which lacks MultimediaViewer : [14:25:13] PROBLEM - check_puppetrun on frdb1001 is CRITICAL: CRITICAL: Puppet last ran 52913 seconds ago, expected 28800 [14:27:01] (03PS1) 10Matthias Mullie: Only enable 3d if MultimediaViewer is enabled as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346767 [14:27:22] ugh, I didn't realise it was a hard dependency [14:27:41] extension registry ends up with a fatal : https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/16183/console [14:27:42] :( [14:28:00] and mmv is not on private/loginwiki [14:28:34] I just updated config to also check for $wmgUseMultimediaViewer [14:28:39] (https://gerrit.wikimedia.org/r/346767) [14:28:42] I have no clue how extension registry work, I am surprised it fatals out though [14:29:44] that makes 2; I had no idea it would fail like that, but it's a good thing it is checked :) [14:29:54] ;- [14:29:56] } [14:30:13] PROBLEM - check_puppetrun on frdb1001 is CRITICAL: CRITICAL: Puppet last ran 53213 seconds ago, expected 28800 [14:32:14] (03CR) 10Hashar: [C: 032] Only enable 3d if MultimediaViewer is enabled as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346767 (owner: 10Matthias Mullie) [14:33:00] (03PS3) 10Giuseppe Lavagetto: run-puppet-agent: some tweaks [puppet] - 10https://gerrit.wikimedia.org/r/346729 [14:33:27] (03Merged) 10jenkins-bot: Only enable 3d if MultimediaViewer is enabled as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346767 (owner: 10Matthias Mullie) [14:33:35] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] run-puppet-agent: some tweaks [puppet] - 10https://gerrit.wikimedia.org/r/346729 (owner: 10Giuseppe Lavagetto) [14:33:40] (03CR) 10jenkins-bot: Only enable 3d if MultimediaViewer is enabled as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346767 (owner: 10Matthias Mullie) [14:35:13] PROBLEM - check_puppetrun on frdb1001 is CRITICAL: CRITICAL: Puppet last ran 53513 seconds ago, expected 28800 [14:36:55] matthiasmullie: update.php is fixed on beta :-} [14:37:14] alright :) [14:37:20] thanks! [14:37:23] matthiasmullie: would have to remember to use the same thing on prod later on else loginwiki might break somehow [14:38:15] (03PS2) 10Giuseppe Lavagetto: Remove specialized puppet functions [switchdc] - 10https://gerrit.wikimedia.org/r/346763 [14:38:40] wouldn't want that happening :) [14:40:13] PROBLEM - check_puppetrun on frdb1001 is CRITICAL: CRITICAL: Puppet last ran 53813 seconds ago, expected 28800 [14:43:50] 06Operations, 06Labs, 13Patch-For-Review: Instance creation fails before first puppet run around 1% of the time - https://phabricator.wikimedia.org/T160908#3160490 (10chasemp) @andrew (as discussed on irc) I worry about the impact of https://gerrit.wikimedia.org/r/#/c/346318/2/modules/openstack/templates/lib... [14:43:59] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix join order, this ain't ruby... [switchdc] - 10https://gerrit.wikimedia.org/r/346762 (owner: 10Giuseppe Lavagetto) [14:44:06] (03CR) 10Giuseppe Lavagetto: [C: 032] Remove specialized puppet functions [switchdc] - 10https://gerrit.wikimedia.org/r/346763 (owner: 10Giuseppe Lavagetto) [14:45:13] PROBLEM - check_puppetrun on frdb1001 is CRITICAL: CRITICAL: Puppet last ran 54113 seconds ago, expected 28800 [14:45:37] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: setup tempdb2001(WMF6407) - https://phabricator.wikimedia.org/T162290#3160496 (10Marostegui) Too early to say, but we might have IO problems on this host: {F7309725} If I `set global innodb_flush_log_at_trx_commit = 0;` the server starts to catch up... [14:46:41] !log hoo@tin Synchronized wmf-config/: Don't use "enwiki" as Wikibase site id on testwiki and test2wiki (T94416) (duration: 01m 08s) [14:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:48] T94416: [Task] Don't use 'enwiki' as site id for test2wiki and testwiki - https://phabricator.wikimedia.org/T94416 [14:47:09] 06Operations, 06Labs, 13Patch-For-Review: Instance creation fails before first puppet run around 1% of the time - https://phabricator.wikimedia.org/T160908#3160515 (10chasemp) Small note though that we have not dropped an instance since we changed dnsmasq params and this is almost certainly /an/ issue histor... [14:47:22] (03CR) 10Mobrovac: "Tested in Beta, works." [puppet] - 10https://gerrit.wikimedia.org/r/346248 (https://phabricator.wikimedia.org/T116335) (owner: 10Mobrovac) [14:50:13] PROBLEM - check_puppetrun on frdb1001 is CRITICAL: CRITICAL: Puppet last ran 54413 seconds ago, expected 28800 [14:50:54] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: setup tempdb2001(WMF6407) - https://phabricator.wikimedia.org/T162290#3160517 (10jcrespo) No hardware RAID like the other servers- we need to set innodb_flush_log_at_trx_commit = 0; sync_binlog=0; and other stuff to reduce syncronous IO. If it goes dow... [14:51:53] !log Restarted apache on mwdebug1001 in order to test a potential CACHE_ACCEL issue [14:51:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:43] PROBLEM - puppet last run on mw1200 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:55:13] PROBLEM - check_puppetrun on frdb1001 is CRITICAL: CRITICAL: Puppet last ran 54713 seconds ago, expected 28800 [14:55:30] !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: touch (duration: 00m 42s) [14:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:09] PROBLEM - check_puppetrun on frdb1001 is CRITICAL: CRITICAL: Puppet last ran 55013 seconds ago, expected 28800 [15:02:19] PROBLEM - puppet last run on mw1257 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:04:47] (03PS2) 10Alexandros Kosiaris: Remove system::role {'requesttracker::server': } [puppet] - 10https://gerrit.wikimedia.org/r/346759 [15:04:53] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Remove system::role {'requesttracker::server': } [puppet] - 10https://gerrit.wikimedia.org/r/346759 (owner: 10Alexandros Kosiaris) [15:05:09] PROBLEM - check_puppetrun on frdb1001 is CRITICAL: CRITICAL: Puppet last ran 55314 seconds ago, expected 28800 [15:05:13] (03PS2) 10Alexandros Kosiaris: Move system::role {'tor::relay': } [puppet] - 10https://gerrit.wikimedia.org/r/346760 [15:05:29] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Move system::role {'tor::relay': } [puppet] - 10https://gerrit.wikimedia.org/r/346760 (owner: 10Alexandros Kosiaris) [15:07:08] (03PS1) 10Marostegui: mariadb: Disable trx, sync_binlog for tempdb2001 [puppet] - 10https://gerrit.wikimedia.org/r/346773 [15:09:11] 06Operations, 10Analytics-EventLogging, 06Analytics-Kanban, 10DBA, 13Patch-For-Review: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#3160560 (10Nuria) [15:09:31] (03PS2) 10Marostegui: mariadb: Disable trx, sync_binlog for tempdb2001 [puppet] - 10https://gerrit.wikimedia.org/r/346773 [15:10:09] PROBLEM - check_puppetrun on frdb1001 is CRITICAL: CRITICAL: Puppet last ran 55614 seconds ago, expected 28800 [15:10:29] (03PS3) 10Marostegui: mariadb: Disable trx, sync_binlog for tempdb2001 [puppet] - 10https://gerrit.wikimedia.org/r/346773 [15:11:32] 06Operations, 10Analytics-EventLogging, 06Analytics-Kanban, 10DBA, 13Patch-For-Review: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#1952524 (10Nuria) 05Open>03Resolved [15:14:39] (03PS4) 10Marostegui: mariadb: Disable trx, sync_binlog for tempdb2001 [puppet] - 10https://gerrit.wikimedia.org/r/346773 [15:15:04] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage the Hadoop Cluster to Debian Jessie - https://phabricator.wikimedia.org/T160333#3160595 (10Ottomata) [15:15:09] PROBLEM - check_puppetrun on frdb1001 is CRITICAL: CRITICAL: Puppet last ran 55914 seconds ago, expected 28800 [15:17:47] (03CR) 10Jcrespo: [C: 031] "This is ok for a quick'n'dirty patch now, but we need to parametrize it in the future and use hiera to apply it, as I have been doing with" [puppet] - 10https://gerrit.wikimedia.org/r/346773 (owner: 10Marostegui) [15:18:46] (03CR) 10Marostegui: "> This is ok for a quick'n'dirty patch now, but we need to" [puppet] - 10https://gerrit.wikimedia.org/r/346773 (owner: 10Marostegui) [15:18:58] (03PS2) 10Marostegui: db-codfw,db-eqiad.php: Add tempdb2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346764 (https://phabricator.wikimedia.org/T162290) [15:19:03] (03CR) 10Jcrespo: "Wait, is it $hostname or $::hostname?" [puppet] - 10https://gerrit.wikimedia.org/r/346773 (owner: 10Marostegui) [15:20:09] RECOVERY - check_puppetrun on frdb1001 is OK: OK: Puppet is currently enabled, last run 70 seconds ago with 0 failures [15:22:39] RECOVERY - puppet last run on mw1200 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [15:24:25] (03CR) 10Marostegui: [C: 032] db-codfw,db-eqiad.php: Add tempdb2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346764 (https://phabricator.wikimedia.org/T162290) (owner: 10Marostegui) [15:25:40] (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Add tempdb2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346764 (https://phabricator.wikimedia.org/T162290) (owner: 10Marostegui) [15:26:42] (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Add tempdb2001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346764 (https://phabricator.wikimedia.org/T162290) (owner: 10Marostegui) [15:26:46] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add tempdb2001 to config files - T162290 (duration: 00m 39s) [15:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:53] T162290: setup tempdb2001(WMF6407) - https://phabricator.wikimedia.org/T162290 [15:27:00] 06Operations, 10ops-esams, 10netops: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3160606 (10RobH) Done, I've also asked support about the followup part swap, and how we can arrange it. [15:27:33] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Add tempdb2001 to config files - T162290 (duration: 00m 40s) [15:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:40] (03CR) 10Hoo man: [C: 032] Don't set removed Wikibase client settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346161 (owner: 10Hoo man) [15:30:48] (03PS2) 10Hoo man: Don't set removed Wikibase client settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346161 [15:31:19] RECOVERY - puppet last run on mw1257 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [15:35:49] (03PS1) 10Gehel: maps - introduce variables for tilerator / kartotherian scap based config [puppet] - 10https://gerrit.wikimedia.org/r/346775 (https://phabricator.wikimedia.org/T162240) [15:36:36] (03CR) 10jenkins-bot: Don't set removed Wikibase client settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346161 (owner: 10Hoo man) [15:36:46] !log hoo@tin Synchronized wmf-config/Wikibase.php: Don't set removed Wikibase client settings (duration: 00m 40s) [15:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:30] (03PS5) 10Marostegui: mariadb: Disable trx, sync_binlog for tempdb2001 [puppet] - 10https://gerrit.wikimedia.org/r/346773 [15:39:48] (03PS2) 10BBlack: cache_misc: config-master.wm.o active/active [puppet] - 10https://gerrit.wikimedia.org/r/346573 [15:40:29] (03CR) 10BBlack: [V: 032 C: 032] cache_misc: config-master.wm.o active/active [puppet] - 10https://gerrit.wikimedia.org/r/346573 (owner: 10BBlack) [15:41:02] 06Operations: Migrate all jessie hosts to Linux 4.9 - https://phabricator.wikimedia.org/T162029#3160635 (10ema) The following cache hosts are currently running 4.9: cp2003,2006,2009, cp3007, cp4001,4011,cp1008 cp3003 has also been upgraded but is affected by hardware problems described in T162132. [15:41:32] (03PS3) 10Giuseppe Lavagetto: Delink new parsoid-vd test runs from updates to parsoid git repo [puppet] - 10https://gerrit.wikimedia.org/r/346196 (owner: 10Subramanya Sastry) [15:43:00] (03CR) 10Giuseppe Lavagetto: [C: 032] Delink new parsoid-vd test runs from updates to parsoid git repo [puppet] - 10https://gerrit.wikimedia.org/r/346196 (owner: 10Subramanya Sastry) [15:45:23] (03PS1) 10Hoo man: Fix Wikibase site groups for testwiki and test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346776 [15:45:44] (03CR) 10Hoo man: [C: 032] Fix Wikibase site groups for testwiki and test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346776 (owner: 10Hoo man) [15:46:49] (03Merged) 10jenkins-bot: Fix Wikibase site groups for testwiki and test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346776 (owner: 10Hoo man) [15:46:58] (03CR) 10jenkins-bot: Fix Wikibase site groups for testwiki and test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346776 (owner: 10Hoo man) [15:48:03] !log hoo@tin Synchronized wmf-config/Wikibase.php: Fix Wikibase site groups for testwiki and test2wiki (duration: 00m 40s) [15:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:59] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup first 10 DB servers - https://phabricator.wikimedia.org/T162159#3160661 (10Papaul) @Marostegui we can not do a7 and b7 because we have 10G switch in a7 and b7 or the serves have only 1GB NIC's. Please relocate db2091 and db2092. Thanks. [15:50:25] marostegui: ^ [15:51:10] PROBLEM - puppet last run on mw1165 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:51:18] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup first 10 DB servers - https://phabricator.wikimedia.org/T162159#3160662 (10Marostegui) [15:51:50] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup first 10 DB servers - https://phabricator.wikimedia.org/T162159#3154083 (10Marostegui) >>! In T162159#3160661, @Papaul wrote: > @Marostegui we can not do a7 and b7 because we have 10G switch in a7 and b7 or the serves have only 1GB NIC... [15:53:55] mobrovac: Do you mind if I use the first 30m of the puppet swat? [15:55:27] (03PS2) 10Hoo man: Temporarily disable the change dispatch cron for testwikidata [puppet] - 10https://gerrit.wikimedia.org/r/346545 (https://phabricator.wikimedia.org/T159828) [15:56:50] hoo: i would just need a merge and then I can handle things myself and am on a tight schedule here [15:57:46] I also need a puppet merge to continue [15:58:21] jynus: marostegui: One of you willing to help? ;) [15:58:22] I can merge that one before puppet swat, would that work? [15:58:39] https://gerrit.wikimedia.org/r/346545 [15:58:41] Yes [15:58:49] (03CR) 10Jcrespo: [C: 032] Temporarily disable the change dispatch cron for testwikidata [puppet] - 10https://gerrit.wikimedia.org/r/346545 (https://phabricator.wikimedia.org/T159828) (owner: 10Hoo man) [15:58:50] (03PS2) 10BBlack: cache_misc: noc.wm.o active/active [puppet] - 10https://gerrit.wikimedia.org/r/346572 [15:58:58] and later on a revert is needed, but even if not, no one will mind [15:59:08] dispatching on testwikidata didn't work at all until 20 minutes ago [15:59:15] so no one will mind it not working for another day [15:59:27] were does that run, terbium? [15:59:31] es [15:59:33] * yes [16:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170406T1600). Please do the needful. [16:00:04] mobrovac: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:18] (03CR) 10BBlack: [C: 032] cache_misc: noc.wm.o active/active [puppet] - 10https://gerrit.wikimedia.org/r/346572 (owner: 10BBlack) [16:00:25] (03PS3) 10BBlack: cache_misc: noc.wm.o active/active [puppet] - 10https://gerrit.wikimedia.org/r/346572 [16:00:33] (03CR) 10BBlack: [V: 032 C: 032] cache_misc: noc.wm.o active/active [puppet] - 10https://gerrit.wikimedia.org/r/346572 (owner: 10BBlack) [16:00:52] hoo: Notice: /Stage[main]/Mediawiki::Maintenance::Wikidata/Cron[wikibase-dispatch-changes-test]/ensure: removed [16:01:03] Notice: Finished catalog run in 54.66 seconds [16:01:18] thanks! [16:03:31] (03CR) 10Hoo man: [C: 032] Temporarily enable change dispatch logging on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346540 (https://phabricator.wikimedia.org/T159828) (owner: 10Hoo man) [16:03:40] who's doing puppetswat? [16:04:11] (03CR) 10MaxSem: "Why is keyspace present for some hosts but not other ones?" [puppet] - 10https://gerrit.wikimedia.org/r/346775 (https://phabricator.wikimedia.org/T162240) (owner: 10Gehel) [16:07:59] (03CR) 10Gehel: "Because it is a mess :)" [puppet] - 10https://gerrit.wikimedia.org/r/346775 (https://phabricator.wikimedia.org/T162240) (owner: 10Gehel) [16:08:02] (03PS2) 10Hoo man: Temporarily enable change dispatch logging on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346540 (https://phabricator.wikimedia.org/T159828) [16:08:03] (03CR) 10Hoo man: [V: 032 C: 032] Temporarily enable change dispatch logging on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346540 (https://phabricator.wikimedia.org/T159828) (owner: 10Hoo man) [16:09:03] (03CR) 10Marostegui: "This compiles fine and changes the host we want it to change: https://puppet-compiler.wmflabs.org/6043/" [puppet] - 10https://gerrit.wikimedia.org/r/346773 (owner: 10Marostegui) [16:09:06] (03Merged) 10jenkins-bot: Temporarily enable change dispatch logging on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346540 (https://phabricator.wikimedia.org/T159828) (owner: 10Hoo man) [16:09:14] (03CR) 10jenkins-bot: Temporarily enable change dispatch logging on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346540 (https://phabricator.wikimedia.org/T159828) (owner: 10Hoo man) [16:10:13] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346779 [16:10:17] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346779 [16:11:23] !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: Temporarily enable change dispatch logging on testwikidata (duration: 00m 45s) [16:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:40] (03CR) 10Hoo man: [C: 032] Try using redisLockManager for test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345387 (https://phabricator.wikimedia.org/T159828) (owner: 10Daniel Kinzler) [16:14:29] PROBLEM - Nginx local proxy to apache on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:14:29] PROBLEM - Apache HTTP on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:14:37] (03Merged) 10jenkins-bot: Try using redisLockManager for test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345387 (https://phabricator.wikimedia.org/T159828) (owner: 10Daniel Kinzler) [16:14:39] PROBLEM - HHVM rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:14:51] (03CR) 10jenkins-bot: Try using redisLockManager for test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345387 (https://phabricator.wikimedia.org/T159828) (owner: 10Daniel Kinzler) [16:16:30] (03CR) 10MaxSem: "I'd rather switch test hosts to v5 so that everything is controllable from one place." [puppet] - 10https://gerrit.wikimedia.org/r/346775 (https://phabricator.wikimedia.org/T162240) (owner: 10Gehel) [16:17:20] !log hoo@tin Synchronized wmf-config/Wikibase-production.php: Try using redisLockManager for test.wikidata.org (T159828) (duration: 00m 39s) [16:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:27] T159828: Use redis-based lock manager for dispatchChanges on test.wikidata.org - https://phabricator.wikimedia.org/T159828 [16:17:34] checking mw1227 [16:18:22] (03PS1) 10KartikMistry: apertium-spa-cat: New upstream release [debs/contenttranslation/apertium-spa-cat] - 10https://gerrit.wikimedia.org/r/346780 (https://phabricator.wikimedia.org/T161511) [16:18:42] !log restart hhvm on mw1227 - debug in /tmp/hhvm.30097.bt. - theads stuck in HPHP::Treadmill::getAgeOldestRequest [16:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:00] elukey: re: moving that meeting time... [16:20:09] RECOVERY - puppet last run on mw1165 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [16:20:19] RECOVERY - Nginx local proxy to apache on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.051 second response time [16:20:19] RECOVERY - Apache HTTP on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.061 second response time [16:20:29] RECOVERY - HHVM rendering on mw1227 is OK: HTTP OK: HTTP/1.1 200 OK - 74783 bytes in 0.314 second response time [16:20:31] elukey: lemme move channels [16:21:18] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup first 10 DB servers - https://phabricator.wikimedia.org/T162159#3160759 (10Papaul) @Marostegui db2089 needs to move as well. [16:27:11] (03Abandoned) 10Gehel: maps - introduce variables for tilerator / kartotherian scap based config [puppet] - 10https://gerrit.wikimedia.org/r/346775 (https://phabricator.wikimedia.org/T162240) (owner: 10Gehel) [16:29:12] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 15User-Elukey: Reclaim/Decommission mw2090->mw2096 (OOW) - https://phabricator.wikimedia.org/T161488#3160813 (10elukey) [16:32:55] (03Abandoned) 10Hashar: (DO NOT SUBMIT) prevent scap on jobrunner02 [puppet] - 10https://gerrit.wikimedia.org/r/346705 (https://phabricator.wikimedia.org/T125735) (owner: 10Hashar) [16:33:39] (03CR) 10Hashar: ":-}" [puppet] - 10https://gerrit.wikimedia.org/r/346677 (owner: 10Dzahn) [16:35:16] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup first 10 DB servers - https://phabricator.wikimedia.org/T162159#3160833 (10Marostegui) [16:35:28] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup first 10 DB servers - https://phabricator.wikimedia.org/T162159#3154083 (10Marostegui) >>! In T162159#3160759, @Papaul wrote: > @Marostegui db2085 needs to move as well. Updated [16:35:54] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346779 (owner: 10Marostegui) [16:37:02] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346779 (owner: 10Marostegui) [16:37:12] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346779 (owner: 10Marostegui) [16:38:00] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1079 - T160390 (duration: 00m 43s) [16:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:06] T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390 [16:38:10] (03PS1) 10Hoo man: Revert "Temporarily disable the change dispatch cron for testwikidata" [puppet] - 10https://gerrit.wikimedia.org/r/346783 (https://phabricator.wikimedia.org/T94416) [16:38:15] jynus: ^ [16:38:54] didn't work? [16:39:06] (03PS2) 10Hoo man: Remove outdated comment about testwikidata dispatching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346711 [16:39:23] jynus: It did… that's why we can enable the cron again [16:39:24] (03PS1) 10Elukey: Remove remaining configuration for mw20[90-96] [puppet] - 10https://gerrit.wikimedia.org/r/346784 (https://phabricator.wikimedia.org/T161488) [16:39:38] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 15User-Elukey: Reclaim/Decommission mw2090->mw2096 (OOW) - https://phabricator.wikimedia.org/T161488#3160857 (10elukey) [16:39:39] I just had it disabled to be able to switch safely and being able to monitor what's going on [16:39:42] good! [16:39:50] (03CR) 10Jcrespo: [C: 032] Revert "Temporarily disable the change dispatch cron for testwikidata" [puppet] - 10https://gerrit.wikimedia.org/r/346783 (https://phabricator.wikimedia.org/T94416) (owner: 10Hoo man) [16:40:09] (03CR) 10Hoo man: [C: 032] "Comment only change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346711 (owner: 10Hoo man) [16:41:15] (03Merged) 10jenkins-bot: Remove outdated comment about testwikidata dispatching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346711 (owner: 10Hoo man) [16:41:24] (03CR) 10jenkins-bot: Remove outdated comment about testwikidata dispatching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346711 (owner: 10Hoo man) [16:42:12] hoo: Notice: /Stage[main]/Mediawiki::Maintenance::Wikidata/Cron[wikibase-dispatch-changes-test]/ensure: created [16:42:29] (03PS1) 10Giuseppe Lavagetto: Fix call to disable-puppet [switchdc] - 10https://gerrit.wikimedia.org/r/346785 [16:42:31] (03PS1) 10Giuseppe Lavagetto: Set the logging level for irc_logger [switchdc] - 10https://gerrit.wikimedia.org/r/346786 [16:42:33] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup first 10 DB servers - https://phabricator.wikimedia.org/T162159#3160873 (10Papaul) @Marostegui Just for your information asw-a2-codfw asw-a7-codfw asw-b2-codfw asw-b7-codfw asw-c2-codfw asw-c7-codfw asw-d2-codfw asw-d7-codfw are 1... [16:42:48] (03CR) 10jerkins-bot: [V: 04-1] Set the logging level for irc_logger [switchdc] - 10https://gerrit.wikimedia.org/r/346786 (owner: 10Giuseppe Lavagetto) [16:44:09] (03CR) 10Volans: [C: 032] Fix call to disable-puppet [switchdc] - 10https://gerrit.wikimedia.org/r/346785 (owner: 10Giuseppe Lavagetto) [16:45:29] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 15User-Elukey: Reclaim/Decommission mw2090->mw2096 (OOW) - https://phabricator.wikimedia.org/T161488#3160874 (10elukey) [16:45:56] (03PS2) 10Volans: Set the logging level for irc_logger [switchdc] - 10https://gerrit.wikimedia.org/r/346786 (owner: 10Giuseppe Lavagetto) [16:46:33] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 15User-Elukey: Reclaim/Decommission mw2090->mw2096 (OOW) - https://phabricator.wikimedia.org/T161488#3160875 (10RobH) So these aren't labeled on the switch. @papaul: Can you advise on the switch ports for mw2090-2096? [16:46:37] (03CR) 10Volans: [C: 032] Set the logging level for irc_logger [switchdc] - 10https://gerrit.wikimedia.org/r/346786 (owner: 10Giuseppe Lavagetto) [16:46:57] (03PS1) 10Hoo man: Revert "Temporarily enable change dispatch logging on testwikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346788 [16:47:44] (03CR) 10Hoo man: "We probably want to deploy this once the dust has settled. Sometimes after Monday April 10." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346788 (owner: 10Hoo man) [16:48:19] (03CR) 10Elukey: [C: 032] Remove remaining configuration for mw20[90-96] [puppet] - 10https://gerrit.wikimedia.org/r/346784 (https://phabricator.wikimedia.org/T161488) (owner: 10Elukey) [16:48:21] (03PS2) 10Muehlenhoff: [Beta] RESTBase: Set the Cassandra seeds correctly [puppet] - 10https://gerrit.wikimedia.org/r/346639 (owner: 10Mobrovac) [16:48:24] (03PS2) 10Elukey: Remove remaining configuration for mw20[90-96] [puppet] - 10https://gerrit.wikimedia.org/r/346784 (https://phabricator.wikimedia.org/T161488) [16:48:27] (03CR) 10Elukey: [V: 032 C: 032] Remove remaining configuration for mw20[90-96] [puppet] - 10https://gerrit.wikimedia.org/r/346784 (https://phabricator.wikimedia.org/T161488) (owner: 10Elukey) [16:49:47] (03CR) 10jerkins-bot: [V: 04-1] Set the logging level for irc_logger [switchdc] - 10https://gerrit.wikimedia.org/r/346786 (owner: 10Giuseppe Lavagetto) [16:50:12] (03CR) 10Muehlenhoff: [C: 032] [Beta] RESTBase: Set the Cassandra seeds correctly [puppet] - 10https://gerrit.wikimedia.org/r/346639 (owner: 10Mobrovac) [16:50:16] (03PS3) 10Muehlenhoff: [Beta] RESTBase: Set the Cassandra seeds correctly [puppet] - 10https://gerrit.wikimedia.org/r/346639 (owner: 10Mobrovac) [16:51:51] 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3160894 (10dr0ptp4kt) @Dzahn, no luck for me. So when you're at "https://www.google.com/webmasters/tools/home?hl=en" does it say "No new messages or recent criti... [16:52:34] (03CR) 10Volans: [C: 032] "recheck" [switchdc] - 10https://gerrit.wikimedia.org/r/346786 (owner: 10Giuseppe Lavagetto) [16:54:23] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 15User-Elukey: Reclaim/Decommission mw2090->mw2096 (OOW) - https://phabricator.wikimedia.org/T161488#3160902 (10elukey) [16:57:32] (03PS1) 10Andrew Bogott: slapd conf: Allow for unlimited paged searches [puppet] - 10https://gerrit.wikimedia.org/r/346790 [16:57:51] (03CR) 10Andrew Bogott: "I am not entirely confident about this syntax" [puppet] - 10https://gerrit.wikimedia.org/r/346790 (owner: 10Andrew Bogott) [16:58:41] (03PS2) 10Andrew Bogott: slapd conf: Allow for unlimited paged searches [puppet] - 10https://gerrit.wikimedia.org/r/346790 [17:00:05] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170406T1700). [17:01:56] (03PS1) 10Elukey: Remove DNS entries for mw2090->mw2096 [dns] - 10https://gerrit.wikimedia.org/r/346791 (https://phabricator.wikimedia.org/T161488) [17:05:51] (03PS2) 10Legoktm: Deploy Linter to medium wikis too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346591 (https://phabricator.wikimedia.org/T148609) [17:06:00] (03CR) 10Legoktm: [C: 032] Deploy Linter to medium wikis too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346591 (https://phabricator.wikimedia.org/T148609) (owner: 10Legoktm) [17:06:10] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 15User-Elukey: Reclaim/Decommission mw2090->mw2096 (OOW) - https://phabricator.wikimedia.org/T161488#3160948 (10elukey) [17:06:22] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 15User-Elukey: Reclaim/Decommission mw2090->mw2096 (OOW) - https://phabricator.wikimedia.org/T161488#3133053 (10elukey) a:03Papaul [17:09:11] (03Merged) 10jenkins-bot: Deploy Linter to medium wikis too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346591 (https://phabricator.wikimedia.org/T148609) (owner: 10Legoktm) [17:09:21] (03CR) 10jenkins-bot: Deploy Linter to medium wikis too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346591 (https://phabricator.wikimedia.org/T148609) (owner: 10Legoktm) [17:09:57] I'm going to deploy ORES [17:10:00] anyone else? [17:10:07] arlolra is deploying parsoid [17:10:14] kk [17:10:53] (03CR) 10Legoktm: "Even if it's comment-only, please pull it onto tin and sync it out so the deploy repo is up to date with git." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346711 (owner: 10Hoo man) [17:11:11] last commit for ORES is 554ea12 [17:11:22] !log arlolra@tin Started deploy [parsoid/deploy@b5c2a2b]: Updating Parsoid to 56ae82bb [17:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:46] !log legoktm@tin Synchronized wmf-config/InitialiseSettings.php: Deploy Linter to medium wikis too - T148609 (duration: 00m 40s) [17:13:52] arlolra, should I wait to start ORES deploy? [17:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:53] T148609: Review and deploy Linter extension to Wikimedia wikis - https://phabricator.wikimedia.org/T148609 [17:14:37] o/ [17:14:41] hmm... I'm guessing no. [17:14:44] o/ Amir1 [17:14:46] uhh, probably best not to both be deploying at the same time? [17:14:53] OK I'll hang out [17:14:55] i should be done in 5 mins [17:14:56] :) [17:15:18] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 15User-Elukey: Reclaim/Decommission mw2090->mw2096 (OOW) - https://phabricator.wikimedia.org/T161488#3160972 (10elukey) Hosts shutdown, nothing more left on the puppet repo, hosts ready to be decommed. Really sorry if I started the non interruptibl... [17:16:59] PROBLEM - puppet last run on elastic1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:19:51] !log arlolra@tin Finished deploy [parsoid/deploy@b5c2a2b]: Updating Parsoid to 56ae82bb (duration: 08m 29s) [17:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:13] halfak: all good in the neighbourhood [17:20:22] OK! Starting ORES [17:20:26] !log halfak@tin Started deploy [ores/deploy@3396b64]: T161748 [17:20:31] Thanks arlolra [17:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:32] T161748: Deploy ORES early April - https://phabricator.wikimedia.org/T161748 [17:20:37] 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3161007 (10Dzahn) @dr0ptp4kt So you are saying even though i gave you full access to the domain(s) you can't read the associated messages? That seems strange, th... [17:22:21] 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3161015 (10Dzahn) >>! In T161343#3160894, @dr0ptp4kt wrote: > @Dzahn, no luck for me. What went wrong? Did it say "permission denied" or something? What was th... [17:24:23] 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3161019 (10Dzahn) @dr0ptp4kt Is this the one you want to approve? " mediawiki-0794@pages.plusgoogle.com would like to associate his YouTube channel MediaWiki to... [17:25:10] PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:25:21] 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3161020 (10dr0ptp4kt) @dzahn, I can review pageview trends, but the console isn't showing me any approval messages to authorize the "MediaWiki" YouTube channel t... [17:25:40] All looks good with Canary [17:25:42] Moving on [17:26:00] (03PS1) 10Dzahn: Revert "change MX records for wikimedia.ee from elkdata.ee to Google" [dns] - 10https://gerrit.wikimedia.org/r/346795 [17:26:08] (03CR) 10jerkins-bot: [V: 04-1] Revert "change MX records for wikimedia.ee from elkdata.ee to Google" [dns] - 10https://gerrit.wikimedia.org/r/346795 (owner: 10Dzahn) [17:26:11] 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3161022 (10dr0ptp4kt) Race condition! One moment. [17:29:25] (03PS2) 10Dzahn: Revert "change MX records for wikimedia.ee from elkdata.ee to Google" [dns] - 10https://gerrit.wikimedia.org/r/346795 (https://phabricator.wikimedia.org/T158638) [17:34:05] FYI I'm testing the tcpircbot calls from switchdc, so it might endup logging here and in SAL [17:35:01] ORES is at promote and restart [17:35:03] Should be done soon [17:41:34] !log halfak@tin Finished deploy [ores/deploy@3396b64]: T161748 (duration: 21m 08s) [17:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:41] T161748: Deploy ORES early April - https://phabricator.wikimedia.org/T161748 [17:42:14] \o/ [17:42:32] All looks good [17:43:05] (03PS1) 10Catrope: Adjust plwiki, ptwiki ORES thresholds for new model deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346796 [17:43:59] RECOVERY - puppet last run on elastic1018 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [17:45:54] !log maxsem@tin Started deploy [tilerator/deploy@71aed11]: https://gerrit.wikimedia.org/r/#/c/346782/ to test hosts [17:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:13] !log maxsem@tin Finished deploy [tilerator/deploy@71aed11]: https://gerrit.wikimedia.org/r/#/c/346782/ to test hosts (duration: 00m 19s) [17:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:34] by any chance are the services deployments completed? [17:54:09] RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [17:54:46] (03PS12) 10EBernhardson: Upgrade logstash to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344964 (https://phabricator.wikimedia.org/T154473) [17:55:22] this was me restarting it ^ [17:55:33] (03PS13) 10EBernhardson: Upgrade logstash to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344964 (https://phabricator.wikimedia.org/T154473) [17:57:55] !log switchdc (volans@neodymium) Test switchdc IRC/SAL announcement [17:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:53] sorry, need another restart, should be the last one [18:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170406T1800). Please do the needful. [18:00:05] RoanKattouw: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:58] !log switchdc (volans@neodymium) Test switchdc IRC/SAL announcement (2) [18:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:04] ok, testing done [18:04:05] I can SWAT, RoanKattouw ping me if you don't feel like deploying your own patch :) [18:04:33] thcipriani: That would be appreciated, thanks [18:04:39] I'd do it myeslf but I'm in the middle of something [18:04:44] no problem [18:05:19] PROBLEM - puppet last run on restbase-dev1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:05:33] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346796 (owner: 10Catrope) [18:07:05] (03Merged) 10jenkins-bot: Adjust plwiki, ptwiki ORES thresholds for new model deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346796 (owner: 10Catrope) [18:07:48] RoanKattouw: live on mwdebug1002 if there's anything to check there. [18:07:59] thcipriani: It's a no-op until the train [18:08:14] cool, going live [18:10:24] (03PS1) 10Legoktm: Set $wgLinterStatsdSampleFactor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346801 [18:10:24] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:346796|Adjust plwiki, ptwiki ORES thresholds for new model deployment]] (duration: 00m 40s) [18:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:37] ^ RoanKattouw sync'd everywhere [18:11:03] thcipriani: is it too late to add a config change to swat? [18:11:19] legoktm: nope, go for it [18:11:35] could you sync out https://gerrit.wikimedia.org/r/346801 please? :) [18:11:39] * legoktm adds to the wiki [18:11:53] yup [18:12:21] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346801 (owner: 10Legoktm) [18:13:24] (03Merged) 10jenkins-bot: Set $wgLinterStatsdSampleFactor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346801 (owner: 10Legoktm) [18:14:32] legoktm: live on mwdebug1002 if there's anything you want to check there [18:15:09] thcipriani: I can't really, I need to wait for people to make enough edits to trigger sending data to statsd [18:15:35] fair enough, will sync everywhere. [18:17:18] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:346801|Set $wgLinterStatsdSampleFactor]] (duration: 00m 45s) [18:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:26] ^ legoktm everywhere now [18:17:34] thanks :D [18:17:44] I'll watch statsd/graphite carefully [18:18:26] okie doke, sounds dandy :) [18:19:07] (03CR) 10jenkins-bot: Adjust plwiki, ptwiki ORES thresholds for new model deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346796 (owner: 10Catrope) [18:19:09] (03CR) 10jenkins-bot: Set $wgLinterStatsdSampleFactor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346801 (owner: 10Legoktm) [18:22:23] 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3161317 (10dr0ptp4kt) @Dzahn it's unclear if that's the same request; in fact in the YouTube administrator interface the request had been submitted as https://me... [18:24:09] PROBLEM - puppet last run on mx1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:29:42] (03PS1) 10Chad: group2 to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346805 [18:30:30] (03CR) 10Chad: [C: 04-2] "4 l8r duh" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346805 (owner: 10Chad) [18:34:19] RECOVERY - puppet last run on restbase-dev1001 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [18:36:34] 06Operations, 10Traffic, 05MW-1.28-release (WMF-deploy-2016-08-09_(1.28.0-wmf.14)), 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2216137 (10Volker_E) [[ https://github.com/wikimedia/wikimediablog-wordpresscom/commit/292c01ccb221dbadfb91786675d4d3cb5a2f3f... [18:42:06] 06Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3161381 (10Papaul) Will have a replacement board tomorrow between 10:00am and 1:30 PM Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This emai... [18:48:46] !log legoktm@tin Synchronized php-1.29.0-wmf.19/extensions/Linter/includes/RecordLintJob.php: Split statsd metrics by wiki - https://gerrit.wikimedia.org/r/#/c/346807 (duration: 00m 42s) [18:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:23] (03CR) 10Dzahn: [C: 032] "wikimedia.ee has changed their mind about this and wants it back to like it was before." [dns] - 10https://gerrit.wikimedia.org/r/346795 (https://phabricator.wikimedia.org/T158638) (owner: 10Dzahn) [18:51:40] (03PS3) 10Dzahn: Revert "change MX records for wikimedia.ee from elkdata.ee to Google" [dns] - 10https://gerrit.wikimedia.org/r/346795 (https://phabricator.wikimedia.org/T158638) [18:52:09] RECOVERY - puppet last run on mx1001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [18:57:55] (03PS1) 10Ottomata: Add python3 packages to hadoop workers for ORES in hadoop [puppet] - 10https://gerrit.wikimedia.org/r/346812 [19:00:04] RainbowSprinkles: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170406T1900). Please do the needful. [19:00:59] PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: CRITICAL - Rep Delay is: 2710.67099 Seconds [19:00:59] PROBLEM - Postgres Replication Lag on maps-test2003 is CRITICAL: CRITICAL - Rep Delay is: 2710.697211 Seconds [19:00:59] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:01:39] PROBLEM - Postgres Replication Lag on maps-test2004 is CRITICAL: CRITICAL - Rep Delay is: 2754.503459 Seconds [19:05:42] (03PS1) 10BBlack: cp1008 separate app_directors [puppet] - 10https://gerrit.wikimedia.org/r/346813 [19:05:44] (03PS1) 10BBlack: exclude cp1008 from pass_random conditional [puppet] - 10https://gerrit.wikimedia.org/r/346814 [19:06:14] (03CR) 10Chad: [C: 032] "leggo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346805 (owner: 10Chad) [19:07:39] RECOVERY - Postgres Replication Lag on maps-test2004 is OK: OK - Rep Delay is: 0.0 Seconds [19:07:46] (03Merged) 10jenkins-bot: group2 to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346805 (owner: 10Chad) [19:07:56] (03CR) 10jenkins-bot: group2 to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346805 (owner: 10Chad) [19:08:55] (03PS9) 10Matthias Mullie: Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [19:09:45] (03CR) 10BBlack: [C: 032] cp1008 separate app_directors [puppet] - 10https://gerrit.wikimedia.org/r/346813 (owner: 10BBlack) [19:09:48] (03CR) 10BBlack: [C: 032] exclude cp1008 from pass_random conditional [puppet] - 10https://gerrit.wikimedia.org/r/346814 (owner: 10BBlack) [19:10:07] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 to wmf.19 [19:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:37] 06Operations, 10RESTBase, 13Patch-For-Review, 10Scap (Scap3-Adoption-Phase1), and 2 others: Deploy RESTBase with scap3 - https://phabricator.wikimedia.org/T116335#3161485 (10mobrovac) [19:10:39] PROBLEM - Postgres Replication Lag on maps-test2004 is CRITICAL: CRITICAL - Rep Delay is: 3294.486397 Seconds [19:11:19] 06Operations, 10netops: cr2-knams<->asw-esams GBLX fiber down - https://phabricator.wikimedia.org/T158647#3161486 (10ayounsi) a:03ayounsi [19:14:08] 06Operations, 10RESTBase, 13Patch-For-Review, 10Scap (Scap3-Adoption-Phase1), and 2 others: Deploy RESTBase with scap3 - https://phabricator.wikimedia.org/T116335#3161487 (10akosiaris) Yes, fine by me. Monday after/before the Ops meeting ? [19:14:55] 06Operations, 10Domains, 10Traffic, 06WMF-Legal, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3161488 (10Dzahn) @Beetlebeard Ok, i have reverted the change. It's back to elkdata as it was before this ticket. Changes should appear within... [19:14:57] 06Operations, 10Domains, 10Traffic, 06WMF-Legal, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3161489 (10Dzahn) @Beetlebeard Ok, i have reverted the change. It's back to elkdata as it was before this ticket. Changes should appear within... [19:14:59] RECOVERY - Postgres Replication Lag on maps-test2002 is OK: OK - Rep Delay is: 0.0 Seconds [19:15:00] RECOVERY - Postgres Replication Lag on maps-test2003 is OK: OK - Rep Delay is: 0.0 Seconds [19:17:13] 06Operations, 10RESTBase, 13Patch-For-Review, 10Scap (Scap3-Adoption-Phase1), and 2 others: Deploy RESTBase with scap3 - https://phabricator.wikimedia.org/T116335#3161490 (10thcipriani) >>! In T116335#3161487, @akosiaris wrote: > Yes, fine by me. Monday after/before the Ops meeting ? I could be around 3pm... [19:18:37] 06Operations, 10RESTBase, 13Patch-For-Review, 10Scap (Scap3-Adoption-Phase1), and 2 others: Deploy RESTBase with scap3 - https://phabricator.wikimedia.org/T116335#3161503 (10mobrovac) Great! Let's settle for 15h UTC on Monday then. [19:19:39] RECOVERY - Postgres Replication Lag on maps-test2004 is OK: OK - Rep Delay is: 0.0 Seconds [19:20:59] PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: CRITICAL - Rep Delay is: 3910.89494 Seconds [19:21:59] RECOVERY - Postgres Replication Lag on maps-test2002 is OK: OK - Rep Delay is: 58.036037 Seconds [19:28:00] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [19:31:08] (03PS1) 10Catrope: Fix miscomputed plwiki ORES thresholds, and fix one miscomputed threshold for ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346818 [19:32:14] RainbowSprinkles: Could I deploy ---^^ nowish/soonish? I SWATted a config patch in the 11am SWAT to make sure that ptwiki and plwiki would behave properly once the train hit them, but it turns out the numbers in that config were all wrong [19:32:25] yes [19:32:31] Thanks, on it [19:32:42] (03CR) 10Catrope: [C: 032] Fix miscomputed plwiki ORES thresholds, and fix one miscomputed threshold for ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346818 (owner: 10Catrope) [19:33:51] (03Merged) 10jenkins-bot: Fix miscomputed plwiki ORES thresholds, and fix one miscomputed threshold for ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346818 (owner: 10Catrope) [19:34:04] (03CR) 10jenkins-bot: Fix miscomputed plwiki ORES thresholds, and fix one miscomputed threshold for ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346818 (owner: 10Catrope) [19:38:07] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 15User-Elukey: Reclaim/Decommission mw2090->mw2096 (OOW) - https://phabricator.wikimedia.org/T161488#3161574 (10Papaul) a:05Papaul>03RobH mw2090 ge-3/0/10 mw2091 ge-3/0/11 mw2092 ge-3/0/12 mw2093 ge-3/0/13 mw2094 ge-3/0/14 mw2095 ge-3/0/15 mw20... [19:38:59] papaul: Thanks! [19:40:21] (03PS1) 10BBlack: cp1008: test recdns rather than authdns for now [puppet] - 10https://gerrit.wikimedia.org/r/346821 [19:42:06] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Fix ORES threshold settings again (duration: 00m 40s) [19:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:17] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 15User-Elukey: Reclaim/Decommission mw2090->mw2096 (OOW) - https://phabricator.wikimedia.org/T161488#3161583 (10RobH) switch ports disabled [19:45:07] 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3161585 (10Dzahn) >>! In T161343#3161317, @dr0ptp4kt wrote: > @Dzahn it's unclear if that's the same request; in fact in the YouTube administrator interface the... [19:46:26] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 15User-Elukey: Reclaim/Decommission mw2090->mw2096 (OOW) - https://phabricator.wikimedia.org/T161488#3161591 (10RobH) [19:48:16] (03PS1) 10RobH: decom mw2090-mw2096 [dns] - 10https://gerrit.wikimedia.org/r/346823 [19:48:59] (03CR) 10RobH: [C: 032] decom mw2090-mw2096 [dns] - 10https://gerrit.wikimedia.org/r/346823 (owner: 10RobH) [19:50:13] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 15User-Elukey: Reclaim/Decommission mw2090->mw2096 (OOW) - https://phabricator.wikimedia.org/T161488#3161594 (10RobH) [19:50:29] PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:51:05] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 15User-Elukey: Reclaim/Decommission mw2090->mw2096 (OOW) - https://phabricator.wikimedia.org/T161488#3133053 (10RobH) [19:52:25] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 15User-Elukey: Reclaim/Decommission mw2090->mw2096 (OOW) - https://phabricator.wikimedia.org/T161488#3161608 (10RobH) a:05RobH>03Papaul Assigned to @papaul to have the disks wiped, once that is done update for decom. The ports were not actuall... [19:52:29] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 21 probes of 276 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [19:52:34] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 15User-Elukey: Reclaim/Decommission mw2090->mw2096 (OOW) - https://phabricator.wikimedia.org/T161488#3161611 (10RobH) [19:53:59] PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: CRITICAL - Rep Delay is: 1977.985122 Seconds [19:54:59] RECOVERY - Postgres Replication Lag on maps-test2002 is OK: OK - Rep Delay is: 0.0 Seconds [20:05:24] (03PS1) 10Jgreen: swap civi1001 vs barium IPs [dns] - 10https://gerrit.wikimedia.org/r/346828 [20:06:59] (03CR) 10Jgreen: [C: 032] swap civi1001 vs barium IPs [dns] - 10https://gerrit.wikimedia.org/r/346828 (owner: 10Jgreen) [20:07:06] (03PS2) 10Jgreen: swap civi1001 vs barium IPs [dns] - 10https://gerrit.wikimedia.org/r/346828 [20:07:25] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 276 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [20:09:45] PROBLEM - Postgres Replication Lag on maps-test2004 is CRITICAL: CRITICAL - Rep Delay is: 2923.34269 Seconds [20:10:45] RECOVERY - Postgres Replication Lag on maps-test2004 is OK: OK - Rep Delay is: 0.0 Seconds [20:14:25] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 276 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [20:18:26] RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [20:19:25] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 276 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [20:27:45] PROBLEM - puppet last run on db1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:05] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the unexpected status 404 (expecting: 200) [20:28:06] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the unexpected status 404 (expecting: 200) [20:28:15] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the unexpected status 404 (expecting: 200) [20:28:16] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the unexpected status 404 (expecting: 200) [20:28:16] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the unexpected status 404 (expecting: 200) [20:28:25] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the unexpected status 404 (expecting: 200) [20:28:35] PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: CRITICAL - Rep Delay is: 4055.609387 Seconds [20:29:35] RECOVERY - Postgres Replication Lag on maps-test2002 is OK: OK - Rep Delay is: 0.0 Seconds [20:34:43] 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3161691 (10dr0ptp4kt) >> One other approach we might try is updating the YouTube account so that noc@ can become a manager of it > > I would try to avoid that.... [20:36:25] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 276 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [20:37:55] PROBLEM - puppet last run on elastic1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:40:35] PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: CRITICAL - Rep Delay is: 4775.562264 Seconds [20:41:25] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 276 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [20:41:35] RECOVERY - Postgres Replication Lag on maps-test2002 is OK: OK - Rep Delay is: 0.0 Seconds [20:41:39] 06Operations, 06Labs: create a 'root' group strictly for labs/cloud services infrastructure - https://phabricator.wikimedia.org/T162404#3161730 (10chasemp) [20:41:45] PROBLEM - Postgres Replication Lag on maps-test2003 is CRITICAL: CRITICAL - Rep Delay is: 4845.25946 Seconds [20:43:45] RECOVERY - Postgres Replication Lag on maps-test2003 is OK: OK - Rep Delay is: 0.0 Seconds [20:44:49] (03PS1) 10Rush: admin: add a group for cloud services roots [puppet] - 10https://gerrit.wikimedia.org/r/346838 [20:46:36] (03PS2) 10Rush: admin: add a group for cloud services roots [puppet] - 10https://gerrit.wikimedia.org/r/346838 (https://phabricator.wikimedia.org/T162404) [20:46:45] PROBLEM - puppet last run on achernar is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:47:14] 06Operations, 06Labs: create a 'root' group strictly for labs/cloud services infrastructure - https://phabricator.wikimedia.org/T162404#3161763 (10chasemp) https://gerrit.wikimedia.org/r/#/c/346838/ [20:48:35] PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: CRITICAL - Rep Delay is: 5255.709436 Seconds [20:49:35] RECOVERY - Postgres Replication Lag on maps-test2002 is OK: OK - Rep Delay is: 0.0 Seconds [20:54:35] PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: CRITICAL - Rep Delay is: 5615.556821 Seconds [20:55:35] RECOVERY - Postgres Replication Lag on maps-test2002 is OK: OK - Rep Delay is: 0.0 Seconds [20:55:45] RECOVERY - puppet last run on db1056 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [20:57:19] 06Operations, 10Mail, 10Wikimedia-Mailing-lists: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3102305 (10grin) Am I right to guess that we don't do (strict or else) SPF checking while we definitely should? Exim can handle SPF just fine alone, as well as spamassassin. It's also a bit weir... [21:00:27] 06Operations, 10Mail, 10Wikimedia-Mailing-lists: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3161835 (10grin) >>! In T160529#3135437, @KTC wrote: > I'll also accept suggestion for what I can do on my end. Dropping/autorejecting email with matching header `​X-Spam-Score: .+\+\+\+\+\+` (... [21:04:55] RECOVERY - puppet last run on elastic1033 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [21:06:45] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 78.57% of data above the critical threshold [140.0] [21:07:04] (03PS1) 10Rush: shinken (labs): add chasemp as contact tools/inf groups [puppet] - 10https://gerrit.wikimedia.org/r/346874 [21:07:08] ^^ https://integration.wikimedia.org/zuul/ [21:07:12] Oh zuul is really back up [21:07:17] PROBLEM - puppet last run on mw1277 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:08:59] paladox: yes, security release today [21:09:12] Yep, was thinking it was those changes [21:09:19] saw many security changes today :) [21:09:31] Yep, pushed them en masse. I attached CodeReview+2 at push time, so hopefully gets them through a little bit faster [21:09:37] (straight to gate-and-submit that way) [21:09:44] (03CR) 10Rush: [V: 032 C: 032] shinken (labs): add chasemp as contact tools/inf groups [puppet] - 10https://gerrit.wikimedia.org/r/346874 (owner: 10Rush) [21:10:22] RainbowSprinkles it sent it straight to gate and submit but also it is testing it in the test pipeline too [21:10:23] :) [21:10:36] Well, that's zuuls fault for not de-duping them [21:10:40] I hate zuul :) [21:10:46] I did my part [21:11:29] RainbowSprinkles how can you hate zuul :) [21:11:40] Because it's a flaming pile of horse shit [21:11:42] That's how [21:11:50] lol [21:12:43] RainbowSprinkles: 10/10 on accuracy of zuul's description [21:14:45] RECOVERY - puppet last run on achernar is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [21:15:41] (03PS1) 10Rush: shinken (labs): add madhuvishy as contact tools/inf groups [puppet] - 10https://gerrit.wikimedia.org/r/346879 [21:16:54] (03CR) 10Madhuvishy: [C: 031] shinken (labs): add madhuvishy as contact tools/inf groups [puppet] - 10https://gerrit.wikimedia.org/r/346879 (owner: 10Rush) [21:17:35] (03CR) 10Rush: [V: 032 C: 032] shinken (labs): add madhuvishy as contact tools/inf groups [puppet] - 10https://gerrit.wikimedia.org/r/346879 (owner: 10Rush) [21:18:24] (03PS9) 10Andrew Bogott: wmfkeystonehooks: Create project page on wikitech on project creation [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) [21:18:25] (03PS1) 10Andrew Bogott: Keystonehooks: Add two more ldap ous for sudo handling. [puppet] - 10https://gerrit.wikimedia.org/r/346880 [21:20:56] (03PS2) 10Andrew Bogott: Keystonehooks: Add two more ldap ous for sudo handling. [puppet] - 10https://gerrit.wikimedia.org/r/346880 [21:20:57] (03PS10) 10Andrew Bogott: wmfkeystonehooks: Create project page on wikitech on project creation [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) [21:21:22] (03PS1) 10Rush: shinken (labs): add bdavis as contact tools/inf groups [puppet] - 10https://gerrit.wikimedia.org/r/346881 [21:23:45] (03CR) 10Rush: [V: 032 C: 032] shinken (labs): add bdavis as contact tools/inf groups [puppet] - 10https://gerrit.wikimedia.org/r/346881 (owner: 10Rush) [21:26:15] PROBLEM - puppet last run on labvirt1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:26:20] ebernhardson: Hm.. any idea why 'foo\s*' would match 'foo', 'foo ' *and* "foos" ? [21:26:25] (mwgrep) [21:26:32] It seems as if something is making it match both \s and literal s [21:26:52] Or maybe the former two matches are just a coincidence and I'm not escaping it right? [21:27:04] (03PS1) 10Rush: shinken (labs): remove coren as contact tools/inf groups [puppet] - 10https://gerrit.wikimedia.org/r/346882 [21:27:07] did you try doing that in brackets? [21:27:39] using \\ makes it yield 0 matches, so that can't be right either [21:28:00] (03CR) 10Rush: [V: 032 C: 032] shinken (labs): remove coren as contact tools/inf groups [puppet] - 10https://gerrit.wikimedia.org/r/346882 (owner: 10Rush) [21:28:06] Zppix: [\s]* ? [21:28:07] Interesting [21:28:11] yes [21:28:51] https://it.wikipedia.org/?search=insource:/addPortletLink\s*\(/&ns8=1 [21:29:05] https://it.wikipedia.org/?search=insource%3A%2FaddPortletLink%5B%5Cs%5D%2A%5C%28%2F&ns8=1&searchToken=964s7pmbr579zk5pxuq56zmae [21:29:10] still getting addPortletLinks() [21:29:47] https://it.wikipedia.org/w/index.php?search=insource%3A%2F%5B%5E.%5DaddPortletLink%5B%5Cs%5D%2A%5C%28%2F&title=Speciale:Ricerca&profile=advanced&fulltext=1&ns8=1 [21:29:52] This is the actual one I am using [21:30:02] which is supported to match addPortletLink() but not mw.util.addPortletLink [21:30:18] Which is working but it is also matching addPortletLinks() [21:30:29] Krinkle: honestly i dont know how the search handles regex requests well enough to give any other advice [21:31:25] (03PS1) 10Rush: shinken (labs): add andrew Shinken Administrators [puppet] - 10https://gerrit.wikimedia.org/r/346883 [21:31:31] 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3161892 (10Dzahn) following https://www.google.com/webmasters/verification/verification?siteUrl=https%3A%2F%2Fmediawiki.org%2F as noc@ gets me "You are already... [21:32:09] (03CR) 10Rush: [V: 032 C: 032] shinken (labs): add andrew Shinken Administrators [puppet] - 10https://gerrit.wikimedia.org/r/346883 (owner: 10Rush) [21:32:46] Maybe I'll use [ ]* for now (literal space) [21:32:51] other whitepsace is rare anyway [21:33:01] Krinkle: sorry i couldnt be more of help [21:34:39] (03PS1) 10Rush: shinken (labs): remove yuvi as it is a wikimedia address [puppet] - 10https://gerrit.wikimedia.org/r/346884 [21:35:24] (03CR) 10Rush: [V: 032 C: 032] shinken (labs): remove yuvi as it is a wikimedia address [puppet] - 10https://gerrit.wikimedia.org/r/346884 (owner: 10Rush) [21:36:15] RECOVERY - puppet last run on mw1277 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [21:37:18] (03PS3) 10Rush: admin: add a group for cloud services roots [puppet] - 10https://gerrit.wikimedia.org/r/346838 (https://phabricator.wikimedia.org/T162404) [21:37:35] PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: CRITICAL - Rep Delay is: 8195.590002 Seconds [21:38:17] (03PS5) 10Rush: Add a default Apache 2.0 license [puppet] - 10https://gerrit.wikimedia.org/r/183862 (https://phabricator.wikimedia.org/T67270) [21:38:35] RECOVERY - Postgres Replication Lag on maps-test2002 is OK: OK - Rep Delay is: 0.0 Seconds [21:55:15] RECOVERY - puppet last run on labvirt1004 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [21:56:23] (03PS3) 10Dzahn: Gerrit: Add log4j.logger.org.apache.sshd.common.keyprovider.FileKeyPairProvider=INFO to log4j [puppet] - 10https://gerrit.wikimedia.org/r/345583 (owner: 10Paladox) [22:05:38] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, and 2 others: Reduce number of false positive alerts on postgresql lag for maps - https://phabricator.wikimedia.org/T162345#3159843 (10Volans) I think it might happen when a VACUUM is running on the master, at least today that we have a lot of delay... [22:07:13] (03CR) 10Volans: "I know it's already merged, but I came across it. I think the missing --raw was already fixed in a subsequent CR but I don't see it on the" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/346724 (https://phabricator.wikimedia.org/T162345) (owner: 10Gehel) [22:07:55] PROBLEM - Postgres Replication Lag on maps-test2004 is CRITICAL: CRITICAL - Rep Delay is: 10011.169165 Seconds [22:08:55] RECOVERY - Postgres Replication Lag on maps-test2004 is OK: OK - Rep Delay is: 0.0 Seconds [22:09:07] volans: what time zone are you in? Or can't sleep? [22:09:32] gehel: lol :D [22:09:42] I was in the meeting [22:10:00] and no usually I'm not here at this time, but rarely in bed at this time :-P [22:10:49] (03CR) 10Dzahn: [C: 032] Gerrit: Add log4j.logger.org.apache.sshd.common.keyprovider.FileKeyPairProvider=INFO to log4j [puppet] - 10https://gerrit.wikimedia.org/r/345583 (owner: 10Paladox) [22:10:55] thanks [22:11:14] (03CR) 10Dzahn: "per "Upstream has already approved/merged Paladox change in their repo."" [puppet] - 10https://gerrit.wikimedia.org/r/345583 (owner: 10Paladox) [22:11:36] :) [22:12:18] (03PS4) 10Dzahn: Standardize on lowercase os_version/require_os [puppet] - 10https://gerrit.wikimedia.org/r/345561 (owner: 10Faidon Liambotis) [22:12:23] (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/6046/" [puppet] - 10https://gerrit.wikimedia.org/r/345561 (owner: 10Faidon Liambotis) [22:12:33] (03CR) 10Dzahn: [C: 032] Standardize on lowercase os_version/require_os [puppet] - 10https://gerrit.wikimedia.org/r/345561 (owner: 10Faidon Liambotis) [22:22:48] !log maxsem@tin Started deploy [tilerator/deploy@9cf2338]: https://gerrit.wikimedia.org/r/#/c/346913/ to test hosts only [22:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:07] !log maxsem@tin Finished deploy [tilerator/deploy@9cf2338]: https://gerrit.wikimedia.org/r/#/c/346913/ to test hosts only (duration: 00m 18s) [22:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:05] (03PS3) 10Dzahn: lxc: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345558 (owner: 10Faidon Liambotis) [22:34:58] (03CR) 10Dzahn: [C: 032] lxc: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345558 (owner: 10Faidon Liambotis) [22:38:25] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 276 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [22:43:25] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 276 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [22:53:15] PROBLEM - puppet last run on ms-be1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170406T2300). [23:00:04] bmansurov: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:11] here [23:04:33] Hi bmansurov, I can SWAT this evening. [23:04:42] Dereckson, hey! [23:04:43] Dereckson: You can do that one, but let's cap swat at that one [23:04:56] Still trying to push security patches through CI, don't wanna clog it up with a ton of other stuff [23:05:07] RainbowSprinkles: ok [23:09:46] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 35.71% of data above the critical threshold [140.0] [23:09:54] known ^ [23:10:09] (tbh, i'm surprised it took til now to yell about it) [23:10:43] RainbowSprinkles it did [23:11:12] 06Operations, 10Monitoring: certspotter on einsteinium has issues talking to external - https://phabricator.wikimedia.org/T162327#3162151 (10Dzahn) I can't reproduce it when manually running the command. Looks like intermittent on the remote side: root@einsteinium:/etc# sudo -u certspotter /usr/bin/certspotte... [23:11:49] Oh [23:11:50] Oh well [23:11:52] it's fine [23:11:58] It'll self-resolve in a little while [23:12:08] bmansurov: live on mwdebug1002 [23:12:25] Dereckson, ok checking [23:12:36] [22:07:15] PROBLEM - puppet last run on mw1277 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:12:45] oh woops [23:13:08] [22:06:45] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 78.57% of data above the critical threshold [140.0] [23:13:09] [ [23:13:34] Dereckson, PapgePreviews aren't broken, but since the thing we're fixing is sampled and needs time to check, I'd say everything is OK. [23:13:56] ack'ed [23:14:34] !log dereckson@tin Synchronized php-1.29.0-wmf.19/extensions/Popups: actions: Correctly delay FETCH_COMPLETE ([[Gerrit:346832]]) (duration: 00m 41s) [23:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:16] (03PS3) 10Dzahn: deployment::server: convert to profile/role (pt. 1) [puppet] - 10https://gerrit.wikimedia.org/r/344728 [23:22:15] RECOVERY - puppet last run on ms-be1024 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [23:27:11] Dereckson, all set? [23:28:35] bmansurov: yup [23:28:41] Dereckson, thank you! [23:29:33] doctor writes academic paper, publishes it with a co-author. Wikipedia uses paper as reference, co-author actually reviews Wikipedia page, suggests improvements. doctor mails WMF "i never gave you permission to use my name, remove in 7 days or law suit" .. what's wrong with people [23:38:43] Do we need permission? [23:38:45] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [23:38:57] Reedy: i dont think so :) [23:39:17] Reedy: it's his last name and a paper he _published_ [23:39:45] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 35.71% of data above the critical threshold [140.0] [23:42:42] Reedy: lol, i actually convinced him to take it back :p [23:43:45] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 18 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [23:48:05] PROBLEM - puppet last run on labsdb1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues