[01:04:24] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1508720658 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8830779 keys, up 4 minutes 16 seconds - replication_delay is 1508720658 [01:04:34] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1508720669 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 4123878 keys, up 4 minutes 26 seconds - replication_delay is 1508720669 [01:04:55] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1508720693 600 - REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 4119145 keys, up 4 minutes 50 seconds - replication_delay is 1508720693 [01:04:55] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1508720693 600 - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 4120285 keys, up 4 minutes 50 seconds - replication_delay is 1508720693 [01:05:15] RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8821428 keys, up 5 minutes 9 seconds - replication_delay is 0 [01:05:55] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 4114641 keys, up 5 minutes 50 seconds - replication_delay is 0 [01:06:05] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 4116506 keys, up 5 minutes 57 seconds - replication_delay is 0 [01:06:35] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 4118426 keys, up 6 minutes 29 seconds - replication_delay is 0 [02:34:09] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.4) (duration: 08m 45s) [02:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:04] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Oct 23 02:41:04 UTC 2017 (duration 6m 55s) [02:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:55] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 804.94 seconds [04:06:14] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 244.86 seconds [04:22:33] (03PS1) 10Andrew Bogott: increase nfs mount timeouts [puppet] - 10https://gerrit.wikimedia.org/r/385939 [05:18:24] PROBLEM - Apache HTTP on mw2133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:19:15] RECOVERY - Apache HTTP on mw2133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.120 second response time [05:22:44] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [05:22:45] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [05:23:05] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [05:23:35] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [05:23:54] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.117 second response time [05:24:34] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [05:24:34] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [05:25:44] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [05:26:45] <_joe_> I guess that's the usual oom? [05:32:54] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [05:33:14] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [05:33:34] RECOVERY - Disk space on stat1005 is OK: DISK OK [05:33:35] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [05:33:45] RECOVERY - DPKG on stat1005 is OK: All packages OK [05:35:44] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [05:50:05] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.230 second response time [06:04:24] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [06:04:45] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [06:04:54] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [06:05:04] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [06:05:05] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1005 is CRITICAL: Return code of 255 is out of bounds [06:05:14] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [06:06:58] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10Performance-Team (Radar): Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3702529 (10Gilles) @Dzahn perfect, thank you! [06:07:44] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [06:09:05] !log Stop replication on db2092 to re-import and compress linter and watchlist table [06:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:22] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10Performance-Team (Radar): Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3702565 (10Gilles) @Dzahn is networking restricted? I can't seem to be able download things from the outside world. I need to install... [06:30:05] !log Stop replication in sync on db1103 and db2018 to fix data drifts - T164488 [06:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:12] T164488: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488 [06:32:56] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [06:33:14] RECOVERY - DPKG on stat1005 is OK: All packages OK [06:33:15] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [06:33:34] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [06:35:05] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1005 is OK: OK: synced at Mon 2017-10-23 06:35:03 UTC. [06:37:44] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:43:11] (03PS1) 10Marostegui: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385942 (https://phabricator.wikimedia.org/T164488) [06:45:03] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385942 (https://phabricator.wikimedia.org/T164488) (owner: 10Marostegui) [06:46:16] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385942 (https://phabricator.wikimedia.org/T164488) (owner: 10Marostegui) [06:46:25] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385942 (https://phabricator.wikimedia.org/T164488) (owner: 10Marostegui) [06:47:48] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1078 - T164488 (duration: 00m 47s) [06:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:57] T164488: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488 [06:50:14] !log installing poppler security updates on trusty [06:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:27] !log Stop replication in sync on db1078 and db1103 to checksum data - T164488 [06:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:10] !log installing expat security updates on trusty [06:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:44] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.128 second response time [07:14:39] (03PS4) 10Muehlenhoff: Add systemd::tmpfile to configure /etc/tmpfiles.d configuration files [puppet] - 10https://gerrit.wikimedia.org/r/385365 [07:15:33] (03CR) 10Muehlenhoff: Add systemd::tmpfile to configure /etc/tmpfiles.d configuration files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/385365 (owner: 10Muehlenhoff) [07:16:14] (03PS1) 10Marostegui: db-codfw.php: Depool db2037 and db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385943 [07:18:24] PROBLEM - Disk space on stat1006 is CRITICAL: Return code of 255 is out of bounds [07:18:25] PROBLEM - Check whether ferm is active by checking the default input chain on stat1006 is CRITICAL: Return code of 255 is out of bounds [07:18:25] PROBLEM - Check systemd state on stat1006 is CRITICAL: Return code of 255 is out of bounds [07:18:25] PROBLEM - MD RAID on stat1006 is CRITICAL: Return code of 255 is out of bounds [07:18:35] PROBLEM - Check size of conntrack table on stat1006 is CRITICAL: Return code of 255 is out of bounds [07:18:55] PROBLEM - configured eth on stat1006 is CRITICAL: Return code of 255 is out of bounds [07:18:55] PROBLEM - DPKG on stat1006 is CRITICAL: Return code of 255 is out of bounds [07:19:24] PROBLEM - dhclient process on stat1006 is CRITICAL: Return code of 255 is out of bounds [07:19:35] PROBLEM - puppet last run on stat1006 is CRITICAL: Return code of 255 is out of bounds [07:21:03] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2037 and db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385943 (owner: 10Marostegui) [07:22:16] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2037 and db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385943 (owner: 10Marostegui) [07:22:24] (03CR) 10jenkins-bot: db-codfw.php: Depool db2037 and db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385943 (owner: 10Marostegui) [07:23:35] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2037 and db2038 - T178359 (duration: 00m 46s) [07:23:41] morning stat100[56], I am happy to see that you guys are having fun on Monday morning -.- [07:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:43] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [07:26:11] (03PS1) 10Marostegui: mariadb: Update socket location for db2038 [puppet] - 10https://gerrit.wikimedia.org/r/385944 [07:26:54] PROBLEM - Check Varnish expiry mailbox lag on cp4021 is CRITICAL: CRITICAL: expiry mailbox lag is 2076994 [07:27:16] (03CR) 10Marostegui: [C: 032] mariadb: Update socket location for db2038 [puppet] - 10https://gerrit.wikimedia.org/r/385944 (owner: 10Marostegui) [07:40:25] PROBLEM - IPMI Sensor Status on stat1006 is CRITICAL: Return code of 255 is out of bounds [07:40:35] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1006 is CRITICAL: Return code of 255 is out of bounds [07:40:55] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.126 second response time [07:42:04] RECOVERY - configured eth on stat1006 is OK: OK - interfaces up [07:42:05] RECOVERY - DPKG on stat1006 is OK: All packages OK [07:42:25] RECOVERY - dhclient process on stat1006 is OK: PROCS OK: 0 processes with command name dhclient [07:42:34] RECOVERY - Disk space on stat1006 is OK: DISK OK [07:42:34] RECOVERY - Check whether ferm is active by checking the default input chain on stat1006 is OK: OK ferm input default policy is set [07:42:35] RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational [07:42:35] RECOVERY - MD RAID on stat1006 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [07:42:45] RECOVERY - Check size of conntrack table on stat1006 is OK: OK: nf_conntrack is 0 % full [07:44:35] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:57:59] (03Abandoned) 10Giuseppe Lavagetto: profile::cache::base: add role::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/383072 (owner: 10Giuseppe Lavagetto) [08:07:56] !log upgrading hhvm-wikdiff2 on mw1209-mw1220 (app servers) [08:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:17] !log Rename wb_entity_per_page table on s5 db1092 and s3 db1077 - T177601 [08:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:26] T177601: Deploy dropping wb_entity_per_page table - https://phabricator.wikimedia.org/T177601 [08:10:25] RECOVERY - IPMI Sensor Status on stat1006 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [08:10:35] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1006 is OK: OK: synced at Mon 2017-10-23 08:10:33 UTC. [08:48:42] (03CR) 10Alexandros Kosiaris: [C: 031] gerrit-ssh: don't listen on all interfaces, disable on slaves [puppet] - 10https://gerrit.wikimedia.org/r/354074 (owner: 10Dzahn) [08:49:42] (03PS5) 10Muehlenhoff: Add systemd::tmpfile to configure /etc/tmpfiles.d configuration files [puppet] - 10https://gerrit.wikimedia.org/r/385365 [08:49:50] (03PS1) 10Elukey: eventlogging_sync: fix logrotate rule and change error log filename [puppet] - 10https://gerrit.wikimedia.org/r/385947 [08:49:58] (03CR) 10Muehlenhoff: Add systemd::tmpfile to configure /etc/tmpfiles.d configuration files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/385365 (owner: 10Muehlenhoff) [08:50:17] (03PS4) 10Muehlenhoff: Add thirdparty/confluent on stretch-based kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/385196 [08:51:10] (03CR) 10Muehlenhoff: [C: 032] Add thirdparty/confluent on stretch-based kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/385196 (owner: 10Muehlenhoff) [08:54:42] (03CR) 10Alexandros Kosiaris: [C: 04-1] gerrit: let Apache proxy only listen on service IP (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [08:55:44] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/8411/" [puppet] - 10https://gerrit.wikimedia.org/r/385947 (owner: 10Elukey) [08:55:47] (03PS2) 10Elukey: eventlogging_sync: fix logrotate rule and change error log filename [puppet] - 10https://gerrit.wikimedia.org/r/385947 [08:56:05] RECOVERY - Check systemd state on relforge1002 is OK: OK - running: The system is fully operational [08:59:53] (03CR) 10Alexandros Kosiaris: "Not so sure about this. For starters, the cluster variable unfortunately is used in other places too, not just ganglia, namely icinga. But" [puppet] - 10https://gerrit.wikimedia.org/r/385385 (owner: 10Alexandros Kosiaris) [09:00:45] RECOVERY - Check systemd state on relforge1001 is OK: OK - running: The system is fully operational [09:02:27] (03PS1) 10Elukey: eventlogging_sync: fix variable substitution [puppet] - 10https://gerrit.wikimedia.org/r/385949 [09:03:18] (03CR) 10Elukey: [C: 032] eventlogging_sync: fix variable substitution [puppet] - 10https://gerrit.wikimedia.org/r/385949 (owner: 10Elukey) [09:04:35] RECOVERY - puppet last run on relforge1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:06:24] (03CR) 10Alexandros Kosiaris: Add systemd::tmpfile to configure /etc/tmpfiles.d configuration files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/385365 (owner: 10Muehlenhoff) [09:06:35] RECOVERY - puppet last run on relforge1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:09:46] !log upgrading hhvm-wikidiff2 on mw1189-mw1208 (API servers) [09:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:05] (03PS5) 10Gehel: maps: move kartotherian and tilerator sources to puppet [puppet] - 10https://gerrit.wikimedia.org/r/385394 (https://phabricator.wikimedia.org/T160215) [09:17:12] (03CR) 10Gehel: [C: 032] maps: move kartotherian and tilerator sources to puppet [puppet] - 10https://gerrit.wikimedia.org/r/385394 (https://phabricator.wikimedia.org/T160215) (owner: 10Gehel) [09:21:54] RECOVERY - Disk space on stat1005 is OK: DISK OK [09:28:14] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [09:40:24] PROBLEM - Nginx local proxy to apache on mw2127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:41:14] RECOVERY - Nginx local proxy to apache on mw2127 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.195 second response time [09:48:42] (03PS1) 10MarcoAurelio: Extension:Translate default permissions for Wikimedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385953 (https://phabricator.wikimedia.org/T178793) [09:52:41] (03PS5) 10Gehel: maps: use puppet generated configs for sources. [puppet] - 10https://gerrit.wikimedia.org/r/385395 (https://phabricator.wikimedia.org/T160215) [09:57:13] hi! [09:57:40] (03CR) 10Gehel: [C: 032] maps: use puppet generated configs for sources. [puppet] - 10https://gerrit.wikimedia.org/r/385395 (https://phabricator.wikimedia.org/T160215) (owner: 10Gehel) [09:58:41] !log upgrading hhvm-wikidiff2 on mw1293-mw1298 (scalers) [09:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:17] arturo: hey! :) [10:01:11] !log restart varnish-be on cp4021 (mbox lag) [10:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:59] ciao arturo ! [10:03:12] godog: :-) [10:06:55] RECOVERY - Check Varnish expiry mailbox lag on cp4021 is OK: OK: expiry mailbox lag is 0 [10:08:52] !log restart tilerator / karthotherian for config change on all maps clusters - T160215 [10:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:00] T160215: Clean up mess with Kartotherian and Tilerator configs - https://phabricator.wikimedia.org/T160215 [10:09:52] !log elasticsearch/cirrus reindexing 167 wikis from terbium (T177871) [10:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:58] T177871: Re-index un-fallbacked languages - https://phabricator.wikimedia.org/T177871 [10:24:50] (03PS1) 10Giuseppe Lavagetto: Fix apt_remove [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/385957 [10:28:03] !log Remove /srv from db1015 as it has been stopped for weeks now and will be decommissioned (and it is alerting low on disk space) - T173570 [10:28:05] !log upgrading hhvm-wikdiff2 on mw1180-mw1188 (app servers) [10:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:11] T173570: Decommission db1015 - https://phabricator.wikimedia.org/T173570 [10:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:23] !log upgrading hhvm-wikdiff2 on mw1161-mw1167 (job runners) [10:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:09] !log upgrading hhvm-wikdiff2 on remaining API servers in codfw [10:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:44] !log Compress innodb on db2038 and db2084 - T178359 [11:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:52] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [11:11:46] !log upgrading hhvm-wikdiff2 on mw2097-mw2114 [11:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:17] 10Operations, 10PAWS, 10Pywikibot-Commons, 10Traffic, and 2 others: Server error (500) while trying to download files from Commons from PAWS - https://phabricator.wikimedia.org/T178567#3702993 (10Chicocvenancio) While @BBlack's response does seem to make sense to me, I am wondering why pywikibot sends thes... [11:40:42] (03CR) 10Muehlenhoff: Add systemd::tmpfile to configure /etc/tmpfiles.d configuration files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/385365 (owner: 10Muehlenhoff) [11:41:54] (03PS6) 10Muehlenhoff: Add systemd::tmpfile to configure /etc/tmpfiles.d configuration files [puppet] - 10https://gerrit.wikimedia.org/r/385365 [12:00:04] hoo: #bothumor My software never has bugs. It just develops random features. Rise for Description usage tracking. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171023T1200). [12:00:05] No GERRIT patches in the queue for this window AFAICS. [12:01:02] (03CR) 10Hoo man: [C: 032] Enable description usage tracking on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385340 (https://phabricator.wikimedia.org/T178515) (owner: 10Hoo man) [12:01:57] (03Merged) 10jenkins-bot: Enable description usage tracking on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385340 (https://phabricator.wikimedia.org/T178515) (owner: 10Hoo man) [12:02:09] (03CR) 10jenkins-bot: Enable description usage tracking on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385340 (https://phabricator.wikimedia.org/T178515) (owner: 10Hoo man) [12:03:25] !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: Enable description usage tracking on further wikis (T178515) (duration: 00m 47s) [12:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:33] T178515: Enable description usage tracking on further wikis - https://phabricator.wikimedia.org/T178515 [12:03:43] Wow… this one went pretty smoothly. I'm done :) [12:07:46] 10Operations, 10PAWS, 10Pywikibot-Commons, 10Traffic, and 2 others: Server error (500) while trying to download files from Commons from PAWS - https://phabricator.wikimedia.org/T178567#3703068 (10BBlack) Yeah there's a few different layers of issue wrapped up in this `Authorization` mess: 1. Pywikibot pro... [12:12:37] !log bblack@neodymium conftool action : set/pooled=yes; selector: dc=ulsfo,cluster=cache_text,name=cp4029.ulsfo.wmnet [12:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:43] !log bblack@neodymium conftool action : set/pooled=no; selector: dc=ulsfo,cluster=cache_text,name=cp4009.ulsfo.wmnet [12:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:19] jouncebot: next [12:19:19] In 0 hour(s) and 40 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171023T1300) [12:20:41] !log bblack@neodymium conftool action : set/pooled=yes; selector: dc=ulsfo,cluster=cache_text,name=cp4030.ulsfo.wmnet [12:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:04] !log upgrading hhvm-wikdiff2 on mw2254-mw2258 (app servers) [12:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:05] (03PS1) 10Hoo man: Remove hooserv.net from wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385964 [12:26:37] !log upgrading hhvm-wikdiff2 on mw2153-mw2162 (job runners) [12:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:53] (03CR) 10Hoo man: [C: 032] "Trivial" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385964 (owner: 10Hoo man) [12:28:04] (03Merged) 10jenkins-bot: Remove hooserv.net from wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385964 (owner: 10Hoo man) [12:28:18] (03CR) 10jenkins-bot: Remove hooserv.net from wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385964 (owner: 10Hoo man) [12:29:32] !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: Remove hooserv.net from wgCopyUploadsDomains (duration: 00m 46s) [12:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:08] !log bblack@neodymium conftool action : set/pooled=no; selector: dc=ulsfo,cluster=cache_text,name=cp4010.ulsfo.wmnet [12:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:36] (03PS2) 10BBlack: cache_upload: wipe client Authorization headers on ingress [puppet] - 10https://gerrit.wikimedia.org/r/385439 (https://phabricator.wikimedia.org/T178567) [12:37:47] (03CR) 10Filippo Giunchedi: "LGTM, minor nit in commit message" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/385439 (https://phabricator.wikimedia.org/T178567) (owner: 10BBlack) [12:39:05] !log upgrading hhvm-wikdiff2 on remaining video scalers in codfw [12:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:22] (03PS3) 10BBlack: cache_upload: wipe client Authorization+Cookie headers on ingress [puppet] - 10https://gerrit.wikimedia.org/r/385439 (https://phabricator.wikimedia.org/T178567) [12:40:30] (03CR) 10BBlack: cache_upload: wipe client Authorization+Cookie headers on ingress (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/385439 (https://phabricator.wikimedia.org/T178567) (owner: 10BBlack) [12:43:24] PROBLEM - Apache HTTP on mw2207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:44:14] RECOVERY - Apache HTTP on mw2207 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.120 second response time [12:46:03] !log upgrading hhvm-wikdiff2 on remaining job runners in codfw [12:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:55] !log bblack@neodymium conftool action : set/pooled=yes; selector: dc=ulsfo,cluster=cache_text,name=cp4031.ulsfo.wmnet [12:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:17] (03CR) 10BBlack: [C: 032] cache_upload: wipe client Authorization+Cookie headers on ingress [puppet] - 10https://gerrit.wikimedia.org/r/385439 (https://phabricator.wikimedia.org/T178567) (owner: 10BBlack) [12:53:25] !log bblack@neodymium conftool action : set/pooled=no; selector: dc=ulsfo,cluster=cache_text,name=cp4017.ulsfo.wmnet [12:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:46] !log bblack@neodymium conftool action : set/pooled=yes; selector: dc=ulsfo,cluster=cache_text,name=cp4032.ulsfo.wmnet [12:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:02] !log bblack@neodymium conftool action : set/pooled=no; selector: dc=ulsfo,cluster=cache_text,name=cp4018.ulsfo.wmnet [13:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171023T1300). [13:00:05] Zoranzoki21: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:14] I can SWAT today [13:00:28] 10Operations: Revisit Pybal depool thresholds for app servers - https://phabricator.wikimedia.org/T178799#3703098 (10MoritzMuehlenhoff) [13:02:45] I am not sure the URL will work though [13:03:01] hashar: URL? [13:03:11] https://gerrit.wikimedia.org/r/#/c/385829/2/wmf-config/InitialiseSettings.php [13:03:12] https://gerrit.wikimedia.org/r/#/c/385829/ [13:03:59] well, Zoranzoki21 is not around for SWAT, so there will be no deployment anyway [13:04:09] hashar: feel free to comment in gerrit [13:04:45] I will wait for him a few more minutes, then close the swat window for now, reopening if he appears later during the window [13:07:44] (03CR) 10Zfilipin: "This was scheduled for EU SWAT deploy on October 23[0], but was not deployed since Zoranzoki21 was not available in #wikimedia-operations[" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385829 (https://phabricator.wikimedia.org/T178753) (owner: 10Zoranzoki21) [13:08:08] !log EU SWAT finished, no deploys [13:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:39] zeljkof: yeah that does not work [13:12:10] (03CR) 10Hashar: Added to $wgCopyUploadsDomains (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385829 (https://phabricator.wikimedia.org/T178753) (owner: 10Zoranzoki21) [13:14:43] (03PS3) 10Hashar: Add aleph500.biblacad.ro to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385829 (https://phabricator.wikimedia.org/T178753) (owner: 10Zoranzoki21) [13:14:58] (03PS4) 10Hashar: Add aleph500.biblacad.ro to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385829 (https://phabricator.wikimedia.org/T178753) (owner: 10Zoranzoki21) [13:15:13] (03CR) 10Hashar: [C: 032] Add aleph500.biblacad.ro to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385829 (https://phabricator.wikimedia.org/T178753) (owner: 10Zoranzoki21) [13:15:51] zeljkof: $wgCopyUploadsDomains takes a list of domains . I have dropped the port part ":8991" :) [13:16:02] hashar: great, thanks [13:16:16] (03PS2) 10Gehel: wdqs: garbage collection tuning [puppet] - 10https://gerrit.wikimedia.org/r/385364 (https://phabricator.wikimedia.org/T175919) [13:16:28] (03Merged) 10jenkins-bot: Add aleph500.biblacad.ro to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385829 (https://phabricator.wikimedia.org/T178753) (owner: 10Zoranzoki21) [13:16:37] (03CR) 10jenkins-bot: Add aleph500.biblacad.ro to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385829 (https://phabricator.wikimedia.org/T178753) (owner: 10Zoranzoki21) [13:16:41] hashar: oh, I gave up on deploying it [13:17:05] (03PS3) 10Gehel: wdqs: garbage collection tuning [puppet] - 10https://gerrit.wikimedia.org/r/385364 (https://phabricator.wikimedia.org/T175919) [13:17:12] (03CR) 10Hashar: "Please deploy your change after merging!!!!! Thanks :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385964 (owner: 10Hoo man) [13:17:25] since the owner is not around [13:17:31] are you deploying it? should I? [13:17:53] for those jobs, we could just deploy them whenever spotted [13:17:57] it is absolutely trivial :) [13:18:11] I am deploying it [13:18:11] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Remove hooserv.net from wgCopyUploadsDomains (duration: 00m 47s) [13:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:19] there was another merged change that did not get deployed ^^ [13:18:42] (03CR) 10Gehel: [C: 032] wdqs: garbage collection tuning [puppet] - 10https://gerrit.wikimedia.org/r/385364 (https://phabricator.wikimedia.org/T175919) (owner: 10Gehel) [13:19:16] !log rolling restart of wdqs for GC tuning - T175919 [13:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:23] T175919: investigate GC times on wikidata query service - https://phabricator.wikimedia.org/T175919 [13:19:39] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Add aleph500.biblacad.ro to $wgCopyUploadsDomains - T178753 (duration: 00m 47s) [13:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:45] T178753: Please add to $wgCopyUploadsDomains - https://phabricator.wikimedia.org/T178753 [13:19:46] (03CR) 10Elukey: [C: 032] "Time to deploy AQS" [puppet] - 10https://gerrit.wikimedia.org/r/385339 (https://phabricator.wikimedia.org/T178312) (owner: 10Joal) [13:19:50] (03PS3) 10Elukey: aqs: update config template file [puppet] - 10https://gerrit.wikimedia.org/r/385339 (https://phabricator.wikimedia.org/T178312) (owner: 10Joal) [13:20:02] 10Operations, 10ops-ulsfo, 10Traffic, 10hardware-requests: Decom cp4009,10,17,18 (4 nodes) - https://phabricator.wikimedia.org/T178801#3703144 (10BBlack) [13:20:53] hashar: thans [13:20:55] thanks [13:21:16] (03PS1) 10BBlack: Remove cp4009,10,17,18 from ulsfo text cluster [puppet] - 10https://gerrit.wikimedia.org/r/385967 (https://phabricator.wikimedia.org/T178801) [13:22:53] (03CR) 10BBlack: [C: 032] Remove cp4009,10,17,18 from ulsfo text cluster [puppet] - 10https://gerrit.wikimedia.org/r/385967 (https://phabricator.wikimedia.org/T178801) (owner: 10BBlack) [13:23:02] (03PS2) 10BBlack: Remove cp4009,10,17,18 from ulsfo text cluster [puppet] - 10https://gerrit.wikimedia.org/r/385967 (https://phabricator.wikimedia.org/T178801) [13:23:04] (03CR) 10BBlack: [V: 032 C: 032] Remove cp4009,10,17,18 from ulsfo text cluster [puppet] - 10https://gerrit.wikimedia.org/r/385967 (https://phabricator.wikimedia.org/T178801) (owner: 10BBlack) [13:24:12] hmm haven't seen this in a long time, during puppet-merge: [13:24:30] I just merged and saw [INFO] conftool::cleanup: Removing node with tags ulsfo/cache_text/varnish-fe/cp4017.ulsfo.wmnet etc.. [13:24:34] Fetching new commits from https://gerrit.wikimedia.org/r/p/operations/puppet [13:24:36] and my heath stopped for a scond :D [13:24:37] error: Ref refs/remotes/origin/production is at 5f87f21ed7faf4788e7d46e37459fb2ed77088de but expected 8eaf8a9fa402d722fcb4f020c754ab74c2775a37 [13:24:39] *second [13:24:41] From https://gerrit.wikimedia.org/r/p/operations/puppet ! 8eaf8a9..5f87f21 production -> origin/production (unable to update local ref) [13:24:43] Connection to puppetmaster2001.codfw.wmnet closed. [13:25:04] because I merged in gerrit before you merged on puppetmaster, I guess [13:25:41] should be it yep [13:26:25] yeah but now my change didn't make it to the 2002 master, it apparently aborted the remote syncs after that one failed [13:26:49] anyways, seeing if it fixes itself from there [13:29:05] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1082 storage crashed - https://phabricator.wikimedia.org/T178460#3693095 (10Cmjohnson) @marostegui the HP tech will be at the data center today to swap the controller. Is the server depooled? [13:30:14] PROBLEM - IPsec on cp4017 is CRITICAL: Strongswan CRITICAL - ok: 4 not-conn: cp1052_v4, cp1052_v6, cp1053_v4, cp1053_v6, cp1054_v4, cp1054_v6, cp1055_v4, cp1055_v6, cp1065_v4, cp1065_v6, cp1066_v4, cp1066_v6, cp1067_v4, cp1067_v6, cp1068_v4, cp1068_v6, cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6, cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6, cp2013_v4, cp2013_v6, cp2023_v4, cp2023_v6,kafka1012_v4,kafka1012_v6,kafka1013_v4,kafka101 [13:30:14] afka1014_v6,kafka1018_v4,kafka1018_v6,kafka1020_v4,kafka1020_v6,kafka1022_v4,kafka1022_v6 [13:30:14] PROBLEM - IPsec on cp4018 is CRITICAL: Strongswan CRITICAL - ok: 4 not-conn: cp1052_v4, cp1052_v6, cp1053_v4, cp1053_v6, cp1054_v4, cp1054_v6, cp1055_v4, cp1055_v6, cp1065_v4, cp1065_v6, cp1066_v4, cp1066_v6, cp1067_v4, cp1067_v6, cp1068_v4, cp1068_v6, cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6, cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6, cp2013_v4, cp2013_v6, cp2023_v4, cp2023_v6,kafka1012_v4,kafka1012_v6,kafka1013_v4,kafka101 [13:30:14] afka1014_v6,kafka1018_v4,kafka1018_v6,kafka1020_v4,kafka1020_v6,kafka1022_v4,kafka1022_v6 [13:30:23] at least there will only be 4 of them [13:30:28] !log joal@tin Started deploy [analytics/aqs/deploy@ad22e0c]: Upgrade AQS node modules to restbase-latests [13:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:07] someone...refactored that ipsec alert :) awesome [13:33:34] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 404 (expecting: 200) [13:33:35] PROBLEM - Unmerged changes on repository puppet on puppetmaster2001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [13:33:52] lovely [13:33:58] I'll go look at it [13:34:08] I am checking aqs1004, deployment ongoing [13:34:20] I think aqs is unrelated? [13:34:28] it's just my change to cache node metadata that's half-synced in the puppetmaster [13:34:41] completely unrelated, all analytics fault :) [13:34:44] RECOVERY - Unmerged changes on repository puppet on puppetmaster2001 is OK: No changes to merge. [13:39:32] (03PS1) 10Giuseppe Lavagetto: Add namespace support [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/385969 [13:39:34] (03PS1) 10Giuseppe Lavagetto: Add support for nightly builds [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/385970 [13:45:38] (03PS2) 10Andrew Bogott: nfs: increase timeouts when checking mount availability [puppet] - 10https://gerrit.wikimedia.org/r/385939 [13:47:56] (03CR) 10Rush: [C: 031] "Let's see if this has an effect" [puppet] - 10https://gerrit.wikimedia.org/r/385939 (owner: 10Andrew Bogott) [13:48:46] !log Stop mysql and poweroff db1082 for HW maintenance - T178460 [13:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:54] T178460: db1082 storage crashed - https://phabricator.wikimedia.org/T178460 [13:48:54] (03CR) 10Andrew Bogott: [C: 032] nfs: increase timeouts when checking mount availability [puppet] - 10https://gerrit.wikimedia.org/r/385939 (owner: 10Andrew Bogott) [13:50:40] 10Operations, 10Ops-Access-Requests, 10Analytics-Kanban: Add Tilman to analytics-admins - https://phabricator.wikimedia.org/T178802#3703216 (10Ottomata) [13:52:04] (03PS1) 10Marostegui: db1082.yaml: Update socket path [puppet] - 10https://gerrit.wikimedia.org/r/385971 (https://phabricator.wikimedia.org/T178460) [13:52:28] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1082 storage crashed - https://phabricator.wikimedia.org/T178460#3703239 (10Marostegui) @Cmjohnson db1082 is now off, feel free to power it off once the replacement has been done. Thank you! [13:58:14] (03CR) 10Marostegui: [C: 032] db1082.yaml: Update socket path [puppet] - 10https://gerrit.wikimedia.org/r/385971 (https://phabricator.wikimedia.org/T178460) (owner: 10Marostegui) [14:01:41] (03PS1) 10Marostegui: db-codfw.php: Repool db2037 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385972 [14:03:03] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2037 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385972 (owner: 10Marostegui) [14:03:49] (03CR) 10Giuseppe Lavagetto: [C: 031] "Lgtm, fix the typo in the comment." (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/384547 (https://phabricator.wikimedia.org/T178279) (owner: 10Volans) [14:04:32] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2037 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385972 (owner: 10Marostegui) [14:05:33] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2037 - T178359 (duration: 00m 46s) [14:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:41] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [14:06:30] (03CR) 10jenkins-bot: db-codfw.php: Repool db2037 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385972 (owner: 10Marostegui) [14:08:51] (03PS1) 10Herron: puppet: depool (via firewall) codfw puppetmaster for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/385976 (https://phabricator.wikimedia.org/T177254) [14:08:54] (03PS1) 10Muehlenhoff: Add library hint for expat [puppet] - 10https://gerrit.wikimedia.org/r/385977 [14:11:22] (03PS2) 10Muehlenhoff: Add library hint for expat [puppet] - 10https://gerrit.wikimedia.org/r/385977 [14:11:48] (03PS1) 10Ori.livneh: labs: Set $wgAutoloadAttemptLowercase to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385978 (https://phabricator.wikimedia.org/T166759) [14:13:19] (03PS2) 10Ori.livneh: labs: Set $wgAutoloadAttemptLowercase to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385978 (https://phabricator.wikimedia.org/T166759) [14:13:24] (03CR) 10Muehlenhoff: [C: 032] Add library hint for expat [puppet] - 10https://gerrit.wikimedia.org/r/385977 (owner: 10Muehlenhoff) [14:14:10] (03CR) 10Ori.livneh: [C: 032] labs: Set $wgAutoloadAttemptLowercase to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385978 (https://phabricator.wikimedia.org/T166759) (owner: 10Ori.livneh) [14:15:17] (03Merged) 10jenkins-bot: labs: Set $wgAutoloadAttemptLowercase to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385978 (https://phabricator.wikimedia.org/T166759) (owner: 10Ori.livneh) [14:16:18] (03CR) 10jenkins-bot: labs: Set $wgAutoloadAttemptLowercase to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385978 (https://phabricator.wikimedia.org/T166759) (owner: 10Ori.livneh) [14:18:15] (03PS5) 10Gehel: wdqs: LVS check should reach blazegraph and do a simple query [puppet] - 10https://gerrit.wikimedia.org/r/384938 [14:19:14] (03CR) 10Gehel: [C: 032] wdqs: LVS check should reach blazegraph and do a simple query [puppet] - 10https://gerrit.wikimedia.org/r/384938 (owner: 10Gehel) [14:23:56] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385980 [14:23:59] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385980 [14:28:41] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385980 (owner: 10Marostegui) [14:30:01] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385980 (owner: 10Marostegui) [14:30:14] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385980 (owner: 10Marostegui) [14:30:30] (03CR) 10Filippo Giunchedi: [C: 031] Add systemd::tmpfile to configure /etc/tmpfiles.d configuration files [puppet] - 10https://gerrit.wikimedia.org/r/385365 (owner: 10Muehlenhoff) [14:31:19] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1078 - T164488 (duration: 00m 46s) [14:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:27] T164488: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488 [14:43:33] (03CR) 10Giuseppe Lavagetto: [C: 031] "Let's do it" [dns] - 10https://gerrit.wikimedia.org/r/385393 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [14:45:24] (03CR) 10Herron: [C: 032] puppet: depool codfw puppetmaster for upgrade [dns] - 10https://gerrit.wikimedia.org/r/385393 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [14:45:28] (03PS3) 10Herron: puppet: depool codfw puppetmaster for upgrade [dns] - 10https://gerrit.wikimedia.org/r/385393 (https://phabricator.wikimedia.org/T177254) [14:45:55] (03PS6) 10Ema: cache: set varnish runtime parameters via hiera [puppet] - 10https://gerrit.wikimedia.org/r/385378 (https://phabricator.wikimedia.org/T159429) [14:46:44] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [14:46:44] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [14:46:45] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [14:46:48] !log depooling codfw puppetmaster (via dns) [14:46:54] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [14:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:05] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [14:47:08] checking --^ [14:47:24] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [14:49:30] 10Operations, 10ops-ulsfo, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3703355 (10BBlack) 05Open>03Resolved [14:49:55] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [14:51:27] python data crunching process eating up memory [14:52:06] 10Operations, 10PAWS, 10Pywikibot-Commons, 10Traffic, and 2 others: Server error (500) while trying to download files from Commons from PAWS - https://phabricator.wikimedia.org/T178567#3703369 (10fgiunchedi) >>! In T178567#3700598, @BBlack wrote: > The original request did have an `Authorization` header fu... [14:52:22] 10Operations, 10ops-ulsfo, 10Traffic: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3703371 (10BBlack) [14:52:24] 10Operations, 10Traffic, 10Patch-For-Review: setup/install cp402[5-8].ulsfo.wmnet - https://phabricator.wikimedia.org/T172198#3703370 (10BBlack) 05Open>03Resolved [14:52:40] (03CR) 10Ema: [C: 032] cache: set varnish runtime parameters via hiera [puppet] - 10https://gerrit.wikimedia.org/r/385378 (https://phabricator.wikimedia.org/T159429) (owner: 10Ema) [14:53:03] 10Operations, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Invalid "wikimedia" family in unique devices data due to misplaced WMF-Last-Access-Global cookie - https://phabricator.wikimedia.org/T174640#3703376 (10BBlack) 05Open>03Resolved I'm assuming there's nothing left to do here, re-open otherw... [14:53:40] 10Operations, 10Traffic, 10Patch-For-Review: setup/install cp4022 - https://phabricator.wikimedia.org/T171967#3481679 (10BBlack) 05Open>03Resolved [14:53:42] 10Operations, 10ops-ulsfo, 10Traffic: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3229950 (10BBlack) [14:53:58] 10Operations, 10ops-ulsfo, 10Traffic: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3229950 (10BBlack) [14:54:00] 10Operations, 10Traffic, 10Patch-For-Review: setup/install cp402[34] - https://phabricator.wikimedia.org/T171966#3481656 (10BBlack) 05Open>03Resolved [14:54:24] PROBLEM - HHVM rendering on mw2205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:55:05] 10Operations, 10Epic, 10Goal, 10Services (doing), and 2 others: Services Q1 2017/18 goal: Begin migrating job queue processing to multi-DC enabled eventbus infrastructure. - https://phabricator.wikimedia.org/T169937#3703402 (10mobrovac) [14:55:07] 10Operations, 10Epic, 10Goal, 10Services (done), and 2 others: End of September milestone: Migrate first production use case - https://phabricator.wikimedia.org/T175637#3703399 (10mobrovac) 05Open>03Resolved The objective of this task has been fully achieved in September. Resolving. [14:55:14] RECOVERY - HHVM rendering on mw2205 is OK: HTTP OK: HTTP/1.1 200 OK - 77986 bytes in 0.271 second response time [14:59:27] 10Operations, 10Traffic, 10netops: Recurring varnish-be fetch failures in codfw - https://phabricator.wikimedia.org/T170131#3420148 (10BBlack) Are these fetch-failure spikes still happening? [14:59:59] !log wiped resolver cache of records puppet.cofdw.wmnet and puppet.ulsfo.wmnet (T177254) [15:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:07] T177254: Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254 [15:00:35] 10Operations, 10Traffic, 10netops: High amount of unexpected ICMP dest unreachable toward esams cache clusters - https://phabricator.wikimedia.org/T167691#3703417 (10BBlack) Anything to do here? Should we look further at whether most of these ICMPs seem related to real TCP connections to our services? [15:00:39] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3703418 (10herron) [15:01:52] 10Operations, 10Analytics, 10Traffic: Artificial spike in offset of unique devices from November to February 6th on wikidata - https://phabricator.wikimedia.org/T165560#3270089 (10BBlack) Is this something we still need answers for, or have we just moved past it into a new normal? [15:03:25] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [15:03:37] (03PS1) 10Ema: cache: set timeout_idle on text and upload [puppet] - 10https://gerrit.wikimedia.org/r/385985 (https://phabricator.wikimedia.org/T159429) [15:03:54] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [15:03:54] RECOVERY - DPKG on stat1005 is OK: All packages OK [15:03:55] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [15:04:05] RECOVERY - Disk space on stat1005 is OK: DISK OK [15:04:14] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [15:04:55] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:05:40] 10Operations, 10Traffic: Standardize varnish applayer backend definitions - https://phabricator.wikimedia.org/T147844#3703428 (10BBlack) 05Open>03Resolved a:03BBlack Resolved long ago! [15:07:01] 10Operations, 10Traffic, 10Patch-For-Review: Planning for phasing out non-Forward-Secret TLS ciphers - https://phabricator.wikimedia.org/T118181#3703435 (10BBlack) [15:07:03] 10Operations, 10MediaWiki-extensions-CentralNotice, 10Traffic: Varnish-triggered CN campaign about browser security - https://phabricator.wikimedia.org/T144194#3703432 (10BBlack) 05Open>03Resolved a:03BBlack Not much left to discuss in this stale ticket. We have the information we need, and were able... [15:07:56] 10Operations, 10Analytics, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#3703436 (10BBlack) We abandoned the original intent of this ticket, I think? [15:10:28] 10Operations, 10Traffic, 10netops: Recurring varnish-be fetch failures in codfw - https://phabricator.wikimedia.org/T170131#3703437 (10ema) 05Open>03Resolved a:03ema [[https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cache_type=text&v... [15:11:24] 10Operations, 10Traffic, 10Patch-For-Review: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#3703456 (10BBlack) [15:11:26] 10Operations, 10Traffic: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#3703453 (10BBlack) 05Open>03Resolved a:03BBlack Closing this ticket as it's getting rather long in the tooth. We did reduce our TTL caps down to 1d across the board at all layers, with up to ~7d kee... [15:11:36] XioNoX: How is your AS renumbering comming along? [15:11:40] (03PS7) 10Muehlenhoff: Add systemd::tmpfile to configure /etc/tmpfiles.d configuration files [puppet] - 10https://gerrit.wikimedia.org/r/385365 [15:11:42] 10Operations, 10Traffic: Age header reset to 0 after 24 hours on varnish frontends - https://phabricator.wikimedia.org/T141373#3703459 (10BBlack) Anything left to look at here? [15:12:14] (03PS1) 10Ottomata: Add kafka.group.id configs [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/385987 (https://phabricator.wikimedia.org/T178432) [15:12:24] 10Operations, 10Traffic, 10codfw-rollout: Enable VCL applayer datacenter-switch via confd - https://phabricator.wikimedia.org/T127485#3703467 (10BBlack) [15:12:26] 10Operations, 10Traffic, 10codfw-rollout: Enable VCL source-DC switching via confd - https://phabricator.wikimedia.org/T127482#3703468 (10BBlack) [15:12:29] 10Operations, 10Traffic, 10Patch-For-Review, 10codfw-rollout: Varnish support for active:active backend services - https://phabricator.wikimedia.org/T134404#3703465 (10BBlack) 05Open>03Resolved a:03BBlack [15:12:34] PROBLEM - puppet last run on es2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:12:52] (03CR) 10Muehlenhoff: [C: 032] Add systemd::tmpfile to configure /etc/tmpfiles.d configuration files [puppet] - 10https://gerrit.wikimedia.org/r/385365 (owner: 10Muehlenhoff) [15:13:17] (03PS7) 10Muehlenhoff: Create /run/nutcracker on stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/384980 (https://phabricator.wikimedia.org/T178457) [15:13:35] (03CR) 10Ottomata: [C: 032] Add kafka.group.id configs [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/385987 (https://phabricator.wikimedia.org/T178432) (owner: 10Ottomata) [15:16:11] 10Operations, 10Traffic, 10Patch-For-Review: Strong cipher preference ordering for cache terminators - https://phabricator.wikimedia.org/T144626#3703485 (10BBlack) 05Open>03Resolved a:03BBlack [15:16:46] 10Operations, 10Traffic, 10WMF-Communications, 10HTTPS, 10Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#3703488 (10BBlack) 05stalled>03Resolved a:03BBlack No updates in 11 months. I assume this is just the new normal... [15:16:57] 10Operations, 10Ops-Access-Requests, 10Analytics-Kanban: Add Tilman to analytics-admins - https://phabricator.wikimedia.org/T178802#3703491 (10Nuria) Approved on my end. [15:17:49] 10Operations, 10Traffic, 10HTTPS: HTTPS RFC5077 session tickets encryption key rollovers - https://phabricator.wikimedia.org/T86671#3703493 (10BBlack) We still haven't had time to work on doing this "right". Most likely the effort is better invested doing similar things on the TLSv1.3 side at this point, ra... [15:18:02] 10Operations, 10Traffic, 10HTTPS: HTTPS RFC5077 session tickets encryption key rollovers - https://phabricator.wikimedia.org/T86671#3703498 (10BBlack) [15:18:04] 10Operations, 10Traffic, 10Patch-For-Review: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567#3703496 (10BBlack) [15:18:29] 10Operations, 10Traffic, 10Wikimedia-Logstash: Varnish does not vary elasticsearch query by request body - https://phabricator.wikimedia.org/T174960#3578041 (10ema) @dbarratt can you please provide some examples, including request/response headers and body, the behavior you're seeing and the one you'd expect... [15:18:57] 10Operations, 10Traffic, 10HTTPS: Track/notify cert expiries better - https://phabricator.wikimedia.org/T112521#3703501 (10BBlack) It seems like we've made a lot of progress on this front since late 2015. Should we consider this resolved now? @Robh? [15:19:18] 10Operations, 10Traffic: more robust certificate chain creation in puppet - https://phabricator.wikimedia.org/T84543#3703502 (10BBlack) 05Open>03Resolved a:03BBlack [15:19:31] (03PS8) 10Muehlenhoff: Create /run/nutcracker on stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/384980 (https://phabricator.wikimedia.org/T178457) [15:19:37] multichill: pretty well, see https://phabricator.wikimedia.org/T167840 for the details [15:20:35] 10Operations, 10Traffic, 10HTTPS: Track/notify cert expiries better - https://phabricator.wikimedia.org/T112521#3703517 (10RobH) All certificates are now tracked in icinga so I think this can indeed be resolved. (We've also transitioned over to LE for the bulk of non-wildcards!) [15:20:37] (03PS1) 10Rush: aborrero: new opsen user and key [puppet] - 10https://gerrit.wikimedia.org/r/385988 (https://phabricator.wikimedia.org/T178807) [15:20:39] (03PS1) 10Ottomata: Add kafka.offset.store.method config [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/385989 [15:22:31] (03PS1) 10Muehlenhoff: redis: Switch to systemd::tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/385990 [15:25:56] XioNoX: ^ i put you in topic ;D [15:27:02] thx! [15:27:44] (03CR) 10Ottomata: [V: 032 C: 032] Add kafka.offset.store.method config [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/385989 (owner: 10Ottomata) [15:30:20] 10Operations, 10Ops-Access-Requests: Requesting access to ops for aborrero - https://phabricator.wikimedia.org/T178809#3703542 (10chasemp) [15:30:27] 10Operations, 10Ops-Access-Requests: Requesting access to ops for aborrero - https://phabricator.wikimedia.org/T178809#3703572 (10chasemp) https://gerrit.wikimedia.org/r/#/c/385988/ [15:30:48] (03PS2) 10Rush: aborrero: new opsen user and key [puppet] - 10https://gerrit.wikimedia.org/r/385988 (https://phabricator.wikimedia.org/T178807) [15:31:38] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to ops for aborrero - https://phabricator.wikimedia.org/T178809#3703587 (10chasemp) p:05Triage>03Normal [15:33:44] 10Operations, 10Analytics, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#3703590 (10Nuria) Ticket can be closed. [15:35:13] 10Operations, 10CirrusSearch, 10Discovery, 10MediaWiki-JobQueue, and 5 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3703592 (10daniel) I have thought a bit about ways to mitigate this. Here are three things I think could help: * {T178804} (doable in a weeks or two, i... [15:37:34] (03PS1) 10Muehlenhoff: graphite: Use systemd::tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/385994 [15:39:48] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to ops for aborrero - https://phabricator.wikimedia.org/T178809#3703615 (10chasemp) [15:40:55] 10Operations, 10Traffic, 10HTTPS: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#3703623 (10BBlack) [15:40:59] 10Operations, 10Traffic, 10Wikimedia-Incident: Deploy redundant unified certs - https://phabricator.wikimedia.org/T148131#3703620 (10BBlack) 05Open>03Resolved a:03BBlack https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [15:41:46] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to ops for aborrero - https://phabricator.wikimedia.org/T178809#3703542 (10bd808) Manager approved. Arturo is a new hire on the #cloud-services-team and will be helping us with many root tasks. [15:42:29] 10Operations, 10Traffic: certspotter: Error retrieving STH from log - https://phabricator.wikimedia.org/T159137#3057797 (10BBlack) Was this resolved or are we still getting failures here? [15:42:34] RECOVERY - puppet last run on es2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:43:25] (03PS3) 10Rush: aborrero: new opsen user and key [puppet] - 10https://gerrit.wikimedia.org/r/385988 (https://phabricator.wikimedia.org/T178807) [15:43:33] 10Operations, 10Traffic: Build nginx without image filter support - https://phabricator.wikimedia.org/T164456#3703643 (10BBlack) This came up again recently. We really should make the switch to `nginx-light` (carefully, to avoid mass-restart!) [15:43:35] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to ops for aborrero - https://phabricator.wikimedia.org/T178809#3703644 (10chasemp) [15:43:46] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to ops for aborrero - https://phabricator.wikimedia.org/T178809#3703542 (10chasemp) [15:44:15] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to ops for aborrero - https://phabricator.wikimedia.org/T178809#3703542 (10chasemp) [15:45:19] (03PS3) 10Volans: PuppetDB backend: Class, Roles and Profiles shortcuts [software/cumin] - 10https://gerrit.wikimedia.org/r/384547 (https://phabricator.wikimedia.org/T178279) [15:45:34] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to ops for aborrero - https://phabricator.wikimedia.org/T178809#3703542 (10chasemp) @aborrero first ping :) This is the access request for production shell access in the works. [15:46:42] 10Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests, 10HTTPS: SSL error for https://wikispecies.org/ - https://phabricator.wikimedia.org/T164868#3703654 (10BBlack) 05Open>03Resolved a:03BBlack The immediate error noted at the start of this ticket is expected. wikispecies.org is not in ou... [15:47:09] (03CR) 10Volans: "Fixed comment, added description in the README too and added example of fact query that was missing." (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/384547 (https://phabricator.wikimedia.org/T178279) (owner: 10Volans) [15:47:15] 10Operations, 10Traffic: certspotter: Error retrieving STH from log - https://phabricator.wikimedia.org/T159137#3703659 (10faidon) We get occasional rare failures depending on the availability of the CT log servers. I don't see a way around this unless we make our cronjobs quite a bit more sophisticated (e.g.... [15:47:35] PROBLEM - MegaRAID on db1046 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [15:47:49] 10Operations, 10Traffic, 10Patch-For-Review: Improve OCSP fetching and monitoring strategies - https://phabricator.wikimedia.org/T172116#3703660 (10BBlack) 05Open>03Resolved a:03BBlack Seems pretty robust as of the changes above. I don't think it's worth pursuing this further at this time. We might r... [15:47:57] another BBU failure? [15:48:20] XioNoX: I'll do AS1126 now [15:48:44] 10Operations, 10Traffic: Evaluate requesting a rate limit change from Letsencrypt - https://phabricator.wikimedia.org/T176905#3703664 (10BBlack) 05Open>03Resolved a:03BBlack So far in all the cases I've seen, when we've hit the ratelimit it's been a useful signal to tell us we've got broken software and/... [15:48:52] 10Operations, 10Traffic: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456#3703668 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff>03None [15:49:06] multichill: thank you, let me know if I can help [15:50:22] 10Operations, 10Traffic, 10HTTPS: Track/notify cert expiries better - https://phabricator.wikimedia.org/T112521#3703670 (10BBlack) 05Open>03Resolved a:03BBlack [15:50:32] 10Operations, 10Analytics, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#3703672 (10BBlack) 05Open>03Resolved a:03BBlack [15:51:20] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#3703680 (10BBlack) [15:51:23] 10Operations, 10Traffic: certspotter: Error retrieving STH from log - https://phabricator.wikimedia.org/T159137#3703677 (10BBlack) 05Open>03Resolved a:03BBlack Ok I'm gonna say it's not a pressing issue for now then. To revisit the next time it really bothers us! [15:53:39] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to ops for aborrero - https://phabricator.wikimedia.org/T178809#3703682 (10chasemp) He signed L3 [15:55:43] XioNoX: You could send the right AS now? :-) [15:56:03] Ah, there it is [15:57:35] RECOVERY - MegaRAID on db1046 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [15:57:45] (03CR) 10Filippo Giunchedi: [C: 031] graphite: Use systemd::tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/385994 (owner: 10Muehlenhoff) [15:57:47] (03PS1) 10Herron: WIP: add puppet package version paramater to puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/385999 [15:58:06] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to ops for aborrero - https://phabricator.wikimedia.org/T178809#3703700 (10chasemp) [15:58:10] 10Operations, 10monitoring, 10netops, 10Patch-For-Review, 10User-Elukey: pmacct should be upgraded to 1.6.2 on Stretch - https://phabricator.wikimedia.org/T173489#3703701 (10faidon) pmacct 1.7.0-1 (with GeoIP2 support too!) was uploaded to sid yesterday. This should be as easy as a backport-and-install now. [15:58:12] (03CR) 10jerkins-bot: [V: 04-1] WIP: add puppet package version paramater to puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/385999 (owner: 10Herron) [15:59:13] !log forced BBU learn cycle on db1046 - T166141 [15:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:22] T166141: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141 [16:00:00] (03CR) 10Filippo Giunchedi: [C: 031] redis: Switch to systemd::tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/385990 (owner: 10Muehlenhoff) [16:01:04] 10Operations, 10monitoring, 10netops, 10Patch-For-Review, 10User-Elukey: pmacct should be upgraded to 1.6.2 on Stretch - https://phabricator.wikimedia.org/T173489#3703711 (10elukey) Could be a good candidate for the Kafka Jumbo cluster! In this case it could use librdkafka 0.11 to negotiate API without c... [16:03:57] 10Operations, 10netops, 10Patch-For-Review, 10Performance-Team (Radar), 10Performance-Team-notice: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840#3703725 (10Multichill) Did AS1126: maartend@vancis-asd01-r01> show bgp summary | match 14907 80.249.209.176 14907 6... [16:06:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3703728 (10chasemp) @Cmjohnson, can we RMA this back to oblivion yet? :D [16:08:29] XioNoX: Actually, it looks like your objects in the RIPE db are not up to date yet [16:09:14] All your Ams-ix peers are still under the old AS. [16:13:08] I'll look, thanks [16:13:19] I'll leave the links in the ticket [16:15:08] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3703744 (10Cmjohnson) @chasemp, no unfortunately it does not work that way. I new CPU and motherboard has been requested through Dell. I believe that will fix the issue. T... [16:18:07] 10Operations, 10ops-ulsfo, 10Traffic: lvs4002 power supply failure - https://phabricator.wikimedia.org/T151273#3703753 (10BBlack) [16:19:03] 10Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3703755 (10BBlack) 05Open>03Resolved [16:20:02] 10Operations, 10Traffic: cp2017 froze and stopped serving traffic - https://phabricator.wikimedia.org/T159056#3703766 (10BBlack) 05Open>03Resolved a:03BBlack No recurrence AFAIK, closing. [16:21:40] 10Operations, 10ops-ulsfo: ulsfo pdu 1.22 replacement - https://phabricator.wikimedia.org/T151263#3703780 (10BBlack) [16:21:42] 10Operations, 10ops-ulsfo, 10Traffic: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3703779 (10BBlack) [16:21:44] 10Operations, 10ops-ulsfo, 10Traffic: lvs4002 power supply failure - https://phabricator.wikimedia.org/T151273#3703775 (10BBlack) 05Open>03Resolved At this point, we'll just do the new 3-server setup on the new lvs400[567] systems in T178436 and ignore this until decom, basically. [16:23:01] 10Operations, 10netops, 10Patch-For-Review, 10Performance-Team (Radar), 10Performance-Team-notice: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840#3703782 (10Multichill) I was updating RIPE db and I noticed some of the records are still lagging. * Old AS record is at https://apps.d... [16:25:38] 10Operations, 10ops-eqiad, 10Traffic: cp1053 possible hardware issues - https://phabricator.wikimedia.org/T165252#3703786 (10BBlack) Apparently this machine is back in service (since when I'm not sure, but it's been a while I think). It's still showing temp alerts in dmesg.... [16:27:04] 10Operations, 10ops-eqiad, 10Traffic: cp1053 possible hardware issues - https://phabricator.wikimedia.org/T165252#3703789 (10BBlack) Interestingly, the IPMI sensors check in icinga is showing this machine as being fine. I wonder what the discrepancy is between that and the MCEs and dmesg? [16:32:10] 10Operations, 10ops-ulsfo, 10Traffic: decom cp40(09|1[078]) - https://phabricator.wikimedia.org/T178815#3703799 (10RobH) [16:32:42] 10Operations, 10ops-ulsfo, 10Traffic: decom cp40(09|1[078]) - https://phabricator.wikimedia.org/T178815#3703799 (10RobH) [16:33:48] 10Operations, 10ops-ulsfo, 10Traffic: decom cp40(09|1[078]) - https://phabricator.wikimedia.org/T178815#3703799 (10BBlack) [16:33:50] 10Operations, 10ops-ulsfo, 10Traffic, 10hardware-requests, 10Patch-For-Review: Decom cp4009,10,17,18 (4 nodes) - https://phabricator.wikimedia.org/T178801#3703820 (10BBlack) [16:37:17] 10Operations, 10JobRunner-Service, 10MediaWiki-extensions-WikibaseClient, 10Wikidata: Wikibase: Increase batch size for HTMLCacheUpdateJobs triggered by repo changes. - https://phabricator.wikimedia.org/T178810#3703827 (10daniel) @joe To clarify, these parameters are for ops to tweak. I can guess at good v... [16:38:53] 10Operations, 10JobRunner-Service, 10MediaWiki-extensions-WikibaseClient, 10Wikidata: Wikibase: Increase batch size for HTMLCacheUpdateJobs triggered by repo changes. - https://phabricator.wikimedia.org/T178810#3703829 (10daniel) [16:43:45] PROBLEM - Host db1082.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:44:35] 10Operations, 10JobRunner-Service, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10User-Joe: Wikibase: Increase batch size for HTMLCacheUpdateJobs triggered by repo changes. - https://phabricator.wikimedia.org/T178810#3703849 (10Joe) [16:48:14] (03PS1) 10Joal: Add http-subrequest to AQS config [puppet] - 10https://gerrit.wikimedia.org/r/386013 [16:48:55] RECOVERY - Host db1082.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [16:50:58] (03CR) 10Elukey: Add http-subrequest to AQS config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/386013 (owner: 10Joal) [16:51:02] (03CR) 1020after4: [C: 031] Scap prep: Clean up everything, fix up StartProfiler symlink mess [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384898 (https://phabricator.wikimedia.org/T126306) (owner: 10Chad) [16:51:37] (03PS4) 1020after4: Phabricator: Override the frog token's label [puppet] - 10https://gerrit.wikimedia.org/r/371660 (https://phabricator.wikimedia.org/T173208) (owner: 10Greg Grossmeier) [16:52:31] (03CR) 10Chad: [C: 032] Scap prep: Clean up everything, fix up StartProfiler symlink mess [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384898 (https://phabricator.wikimedia.org/T126306) (owner: 10Chad) [16:53:09] !log demon@tin Pruned MediaWiki: 1.30.0-wmf.19 (duration: 03m 09s) [16:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:17] Hi [16:53:23] Please abandon all patches from: https://gerrit.wikimedia.org/r/#/q/owner:minhtq15%2540fsoft.com.vn+status:open [16:56:19] (03PS1) 10Ottomata: Add tbayer to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/386015 (https://phabricator.wikimedia.org/T178802) [16:56:54] (03CR) 10Ottomata: [C: 032] "Reviewed in ops meetings" [puppet] - 10https://gerrit.wikimedia.org/r/386015 (https://phabricator.wikimedia.org/T178802) (owner: 10Ottomata) [16:57:22] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: [subtask] How should we get Chromium for use in puppeteer? - https://phabricator.wikimedia.org/T178570#3703879 (10greg) [16:57:52] 10Operations, 10Ops-Access-Requests, 10Analytics-Kanban, 10Patch-For-Review: Add Tilman to analytics-admins - https://phabricator.wikimedia.org/T178802#3703882 (10Ottomata) [16:58:08] (03CR) 10Elukey: [C: 04-1] "This code also needs to have the correspondent profile changes, going to fix it in a sec :)" [puppet] - 10https://gerrit.wikimedia.org/r/386013 (owner: 10Joal) [16:59:44] 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins, 10Release-Engineering-Team (Backlog): Upgrade ci ssh key to ecdsa - https://phabricator.wikimedia.org/T177826#3703892 (10greg) [17:00:05] gehel: My dear minions, it's time we take the moon! Just kidding. Time for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171023T1700). [17:00:05] Smalyshev: A patch you scheduled for Wikidata Query Service weekly deploy is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:01:05] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to ops for aborrero - https://phabricator.wikimedia.org/T178809#3703542 (10RobH) Please note this access was approved in the operations meeting today, but cannot be merged live until AFTER @aborrero has signed the NDA/DATA preservatio... [17:01:17] I have one question about messages from jouncebot here [17:01:18] 10Operations, 10Gerrit, 10Release-Engineering-Team (Backlog): Reimage cobalt as stretch - https://phabricator.wikimedia.org/T176774#3703903 (10greg) [17:01:20] 10Operations, 10Traffic, 10HTTPS: Track/notify cert expiries better - https://phabricator.wikimedia.org/T112521#3703905 (10Dzahn) There are (seperate) Icinga checks for the *.planet.wikimedia.org and the *.wmfusercontent.org cert that recently alerted on upcoming expiry of the main unified cert. They have be... [17:01:26] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to ops for aborrero - https://phabricator.wikimedia.org/T178809#3703906 (10chasemp) Approved but must FIRST follow up and agree with Legal and data retention guidelines [17:01:36] Why jouncebot always add message: Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:01:43] on which sticker he think? [17:01:50] 10Operations, 10Ops-Access-Requests, 10Analytics-Kanban, 10Patch-For-Review: Add Tilman to analytics-admins - https://phabricator.wikimedia.org/T178802#3703198 (10RobH) Please note adding Tilman to analytics-admins access was approved in today's operations meeting. [17:01:55] jouncebot: o/ [17:02:01] (03PS4) 10Rush: WIP: aborrero: new opsen user and key [puppet] - 10https://gerrit.wikimedia.org/r/385988 (https://phabricator.wikimedia.org/T178807) [17:02:07] Zoranzoki21: It's a joke. Don't take it too seriously. [17:02:23] Zoranzoki21: And it's a bot. Not a he or she. [17:02:32] Ok [17:02:41] Thank you for reply [17:03:07] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3703915 (10chasemp) OK, thanks @Cmjohnson. We'll hang tight for the new board. [17:03:39] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1082 storage crashed - https://phabricator.wikimedia.org/T178460#3703916 (10Cmjohnson) The controller has been replaced and the server has been powered on. @marostegui please resolve task when you're comfortable with the new controller. [17:10:15] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10Performance-Team (Radar): Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3703939 (10Dzahn) @gilles You'll have to use a http_proxy. see https://wikitech.wikimedia.org/wiki/HTTP_proxy Let me know if tha... [17:11:53] I will ask again: [18:53] Hi [18:53] Please abandon all patches from: https://gerrit.wikimedia.org/r/#/q/owner:minhtq15%2540fsoft.com.vn+status:open [17:13:08] 10Operations, 10Traffic: Evaluate requesting a rate limit change from Letsencrypt - https://phabricator.wikimedia.org/T176905#3703942 (10Dzahn) Yep, at the time of writing this ticket we weren't aware that the rate-limiting issue was ultimately caused by a software issue specific to stretch machines (openssl o... [17:13:50] (03PS2) 10Elukey: aqs: add http-subrequest to hyperswitch's config [puppet] - 10https://gerrit.wikimedia.org/r/386013 (owner: 10Joal) [17:15:20] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to ops for aborrero - https://phabricator.wikimedia.org/T178809#3703943 (10Dzahn) We can start with getting him on non-public IRC channels and the desired mailing lists. Also wikitech LDAP user, then we can start with a patch to give... [17:16:08] 10Operations, 10Gerrit, 10Release-Engineering-Team (Backlog): Reimage cobalt as stretch - https://phabricator.wikimedia.org/T176774#3703944 (10Dzahn) p:05Lowest>03Low Moving up from Lowest to Low, we are getting closer since the latest systemd improvements on cobalt :) [17:16:46] Zoranzoki21: Can I ask why you archived last week's deployment calendar and them removed it from the archive? [17:17:04] I removed from archive?? [17:17:12] SMalyshev: blazegraph war has disapeared from the latest wdqs deploy repo. [17:17:18] https://wikitech.wikimedia.org/w/index.php?title=Deployments/Archive/2017/10&diff=next&oldid=1773317 [17:17:29] buh, maybe not? [17:17:36] was it doubled? [17:17:52] Because I added same twice.. [17:17:55] I removed duplicate [17:17:59] yeah, bad copy/paste, gotcha, I mis-read the diffs [17:18:05] Zoranzoki21: thanks :) [17:18:11] ok.. sorry.. All is ok:https://wikitech.wikimedia.org/wiki/Deployments/Archive/2017/10 [17:18:20] (03CR) 10Bearloga: Add profiles/roles for stats/ML on Wikimedia Cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/383916 (https://phabricator.wikimedia.org/T178096) (owner: 10Bearloga) [17:21:28] (03PS3) 10Elukey: aqs: add http-subrequest to hyperswitch's config [puppet] - 10https://gerrit.wikimedia.org/r/386013 (owner: 10Joal) [17:22:19] 10Operations, 10Deployments, 10HHVM, 10Performance-Team (Radar), 10Release-Engineering-Team (Watching / External): Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#3703958 (10demon) We've been running the reusable TC in beta for over 2 m... [17:24:10] !log gehel@tin Started deploy [wdqs/wdqs@4626acb]: deploying latest WDQS version, blazegraph + GUI updates [17:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:43] (03CR) 10Elukey: "Puppet compiler run: https://puppet-compiler.wmflabs.org/compiler02/8423/aqs1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/386013 (owner: 10Joal) [17:26:17] !log gehel@tin Finished deploy [wdqs/wdqs@4626acb]: deploying latest WDQS version, blazegraph + GUI updates (duration: 02m 07s) [17:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:02] (03PS2) 10Chad: Scap prep: Clean up everything, fix up StartProfiler symlink mess [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384898 (https://phabricator.wikimedia.org/T126306) [17:30:40] gehel: which version of the GUI did you deploy? [17:31:00] Lucas_WMDE: the one prepared by Stas :) lemme check [17:31:18] Lucas_WMDE: commit 2d65139 [17:31:20] it doesn’t look like https://gerrit.wikimedia.org/r/#/c/385966/ is included [17:31:25] (03CR) 10jenkins-bot: Scap prep: Clean up everything, fix up StartProfiler symlink mess [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384898 (https://phabricator.wikimedia.org/T126306) (owner: 10Chad) [17:31:54] Lucas_WMDE: it was not. We could re-build and re-deploy [17:32:17] !log demon@tin Synchronized scap/plugins/prep.py: Fixes and improvements of various sorts (duration: 00m 45s) [17:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:39] SMalyshev: that would be great, because that change should really be deployed before this weekend :) [17:32:45] (03PS4) 10Elukey: aqs: add http-subrequest to hyperswitch's config [puppet] - 10https://gerrit.wikimedia.org/r/386013 (owner: 10Joal) [17:33:27] SMalyshev / Lucas_WMDE: ping me when ready to deploy and I'll do it later this evening [17:34:26] (03CR) 10Elukey: "pcc: https://puppet-compiler.wmflabs.org/compiler02/8424/aqs1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/386013 (owner: 10Joal) [17:39:29] (03CR) 10Volans: [C: 031] "LGTM" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/385957 (owner: 10Giuseppe Lavagetto) [17:43:54] gehel: I've updated the gui checkout, ready to deploy [17:44:21] SMalyshev, Lucas_WMDE: I'll take a short break before doing the deployment, if it can wait 30'... [17:44:31] I think it can, thanks! [17:45:01] (03CR) 10Volans: "One conceptual doubt inline" (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/385969 (owner: 10Giuseppe Lavagetto) [17:46:06] sure, thanks SMalyshev! [17:47:45] PROBLEM - MegaRAID on db1046 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [17:47:57] ufff [17:50:49] (03PS5) 10Elukey: aqs: add http-subrequest to hyperswitch's config [puppet] - 10https://gerrit.wikimedia.org/r/386013 (owner: 10Joal) [17:54:46] !log gehel@tin Started deploy [wdqs/wdqs@b84592f]: WDQS GUI update [17:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:10] 10Operations, 10Traffic: Collect Google IPs pinging the load balancers - https://phabricator.wikimedia.org/T165651#3272649 (10BBlack) I don't think anything has changed since on Google's end. Do we try harder or just accept it? [17:56:35] SMalyshev, Lucas_WMDE: WDQS GUI updated, tests are still green... [17:56:42] cool, thanks! [17:56:44] !log gehel@tin Finished deploy [wdqs/wdqs@b84592f]: WDQS GUI update (duration: 01m 58s) [17:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:41] gehel: great, thanks! birthday present detection works now :) [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Morning SWAT (Max 8 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171023T1800). [18:00:04] Zoranzoki21: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:42] (03CR) 10Elukey: [C: 032] aqs: add http-subrequest to hyperswitch's config [puppet] - 10https://gerrit.wikimedia.org/r/386013 (owner: 10Joal) [18:00:57] Zoranzoki21: Um, that's not a SWATable thing. [18:01:10] We don't host Translatewiki, that's an outside site :) [18:01:12] No? [18:01:44] Nope, not a WMF wiki :) [18:03:29] Ok.. Where to add it than? [18:03:53] you have to add translatewiki reviewers [18:04:51] Who is it? [18:04:55] *are it? [18:04:56] You'll wanna talk to Niklas and/or Siebrand [18:05:18] Ok, I will add Niklas and Siebrand to review [18:05:20] Thank you [18:05:29] Raimond Spekking [18:05:57] https://gerrit.wikimedia.org/r/#/c/385843/ [18:06:18] This one I Schedule Now For SWAT. [18:08:52] (03CR) 10Volans: "nitpick inline" (032 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/385970 (owner: 10Giuseppe Lavagetto) [18:08:52] 10Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests, 10HTTPS: SSL error for https://wikispecies.org/ - https://phabricator.wikimedia.org/T164868#3704153 (10Framawiki) 05Resolved>03declined (the problem is not resolved, so I change the status of this task) [18:20:15] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 404 (expecting: 200) [18:20:29] 10Operations, 10Pybal, 10Traffic, 10netops: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3704183 (10BBlack) Bump, I had to re-dig into this ticket a bit to catch myself up, so, re-summarizing: 1) In `src/http/ngx_http_request.c` at the top of `ngx_http_ssl_... [18:20:39] aqs is me, not taking traffic, sorry :) [18:20:56] Can someone Merge My Patch? [18:21:44] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get) [18:21:44] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get) [18:21:44] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get) [18:21:44] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get) [18:21:44] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get) [18:21:45] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get) [18:21:45] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get) [18:21:46] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get) [18:21:46] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get) [18:21:47] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get) [18:21:47] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get) [18:21:48] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get) [18:24:15] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 404 (expecting: 200) [18:24:40] so aqs1004 is depooled from LVS and I am trying to fix a deployment, but the rest ? [18:25:03] 10Operations, 10Traffic, 10HTTPS, 10Patch-For-Review: Add CAA records to our domains - https://phabricator.wikimedia.org/T155806#3704189 (10BBlack) 05Open>03Resolved a:03BBlack [18:25:37] I just copy/pasted the icinga message to the -mobile channel, but /me shrugs [18:26:12] is there a non-opsen on the page list for that service? [18:26:19] a mobile apps team person, I mean [18:26:25] I'm here [18:26:35] basically me and mdholloway [18:26:52] looking at the logs (sorry still in a meeting) [18:29:45] * mdholloway is looking [18:31:19] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1082 storage crashed - https://phabricator.wikimedia.org/T178460#3704198 (10Marostegui) Thanks @Cmjohnson - I have started MySQL and will leave it running over night. If all goes well, I will close this ticket tomorrow. The crashes are not so frequen... [18:36:13] do you guys need help? What is the current impact? [18:38:25] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy [18:38:31] bearND: mdholloway ^ [18:39:12] Not sure. I cannot repro locally and when i run the endpoint on scb1001 it looks fine to me. [18:39:31] 10Operations, 10DNS, 10Traffic: Consider DNSSec - https://phabricator.wikimedia.org/T26413#3704222 (10BBlack) 05Open>03stalled I've tried in the past to keep myself fairly open to the eventual inevitability of DNSSEC and keep my comments even-handed on the matter. I was willing to capitulate to mass op... [18:39:44] only when I run the check-mobileapps script I get an error [18:40:33] (03PS9) 10Bearloga: Add profiles/roles for stats/ML on Wikimedia Cloud [puppet] - 10https://gerrit.wikimedia.org/r/383916 (https://phabricator.wikimedia.org/T178096) [18:40:47] i don't think this should have any end user impact but this may be preventing RESTBase from updating (cc Pchelolo: seeing any errors?) [18:41:15] bearND: interestingly, one of the unit tests just broke as well (en Uranus should have pronunciation, for the mobile-sections-lead endpoint) [18:43:45] (03PS1) 10Elukey: role::aqs: fix druid endpoint whitelisted pattern [puppet] - 10https://gerrit.wikimedia.org/r/386038 [18:44:20] yup bearND, mdholloway it seems the pronunciation is missing in the response [18:45:20] mobrovac: the weird part is i can't see any recent deployment that could have caused a change in the output [18:45:59] yeah, same here. I just checked it Parsoid was recently deployed but only see it's still coming up for today [18:46:09] (03CR) 10Elukey: [C: 032] role::aqs: fix druid endpoint whitelisted pattern [puppet] - 10https://gerrit.wikimedia.org/r/386038 (owner: 10Elukey) [18:46:22] s/it/if/ [18:48:07] and that page that fails the test has been last edited 16 days ago [18:48:46] 10Operations, 10Puppet: Granular puppet version selection feature - https://phabricator.wikimedia.org/T178825#3704234 (10herron) [18:50:01] 10Operations, 10Puppet: Granular puppet version selection feature - https://phabricator.wikimedia.org/T178825#3704247 (10herron) [18:50:09] Pronunciation missing in response? I saw something about TextExtracts and that complaint on enwp VPT [18:50:17] 10Operations, 10Puppet: Granular puppet version installation - https://phabricator.wikimedia.org/T178825#3704234 (10herron) [18:51:28] https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#IPA_pronunciation_missing_while_requesting_page_through_JSON_API - possible red herring, but you all mentioned pronunciations soooooo..... [18:51:36] (03PS2) 10Herron: WIP: add puppet package version paramater to puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/385999 (https://phabricator.wikimedia.org/T178825) [18:52:09] (03CR) 10jerkins-bot: [V: 04-1] WIP: add puppet package version paramater to puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/385999 (https://phabricator.wikimedia.org/T178825) (owner: 10Herron) [18:52:20] !log joal@tin Finished deploy [analytics/aqs/deploy@ad22e0c]: Upgrade AQS node modules to restbase-latests (duration: 321m 51s) [18:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:34] PROBLEM - AQS root url on aqs1004 is CRITICAL: connect to address 10.64.0.107 and port 7232: Connection refused [18:52:52] !log joal@tin Started deploy [analytics/aqs/deploy@ad22e0c]: Upgrade AQS node modules to restbase-latests - Redeploy after config fix [18:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:34] RECOVERY - AQS root url on aqs1004 is OK: HTTP OK: HTTP/1.1 200 - 278 bytes in 0.023 second response time [18:53:54] (03CR) 10Herron: "why is this failing?" [puppet] - 10https://gerrit.wikimedia.org/r/385999 (https://phabricator.wikimedia.org/T178825) (owner: 10Herron) [18:55:08] bearND: mobrovac: this looks like our culprit: https://en.wikipedia.org/w/index.php?title=Template%3AIPA_audio_link&type=revision&diff=806709058&oldid=783985633 [18:56:35] PROBLEM - AQS root url on aqs1005 is CRITICAL: connect to address 10.64.32.138 and port 7232: Connection refused [18:57:06] deployment issue, should be fixed it a few --^ [18:58:00] !log joal@tin Finished deploy [analytics/aqs/deploy@ad22e0c]: Upgrade AQS node modules to restbase-latests - Redeploy after config fix (duration: 05m 07s) [18:58:03] elukey: you need help? [18:58:04] bearND: mobrovac: https://gerrit.wikimedia.org/r/#/c/386040/ [18:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:08] !log joal@tin Started deploy [analytics/aqs/deploy@ad22e0c]: Upgrade AQS node modules to restbase-latests - Redeploy after config fix [18:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:23] mobrovac: nono we had a ops/deployer sync issue :) [18:58:25] thanks [18:58:26] mdholloway: Thanks. Waiting for the build. [18:58:35] RECOVERY - AQS root url on aqs1005 is OK: HTTP OK: HTTP/1.1 200 - 278 bytes in 0.021 second response time [18:58:44] PROBLEM - Check Varnish expiry mailbox lag on cp4024 is CRITICAL: CRITICAL: expiry mailbox lag is 2049591 [18:59:00] bearND: mdholloway: i created T178828, let's put the findings there please :) [18:59:00] T178828: Missing pronunciation in MCS test response - https://phabricator.wikimedia.org/T178828 [18:59:38] mdholloway: ok, then please also link your patch to this task [19:00:43] !log joal@tin Finished deploy [analytics/aqs/deploy@ad22e0c]: Upgrade AQS node modules to restbase-latests - Redeploy after config fix (duration: 02m 34s) [19:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:03] yesssss [19:17:44] RECOVERY - MegaRAID on db1046 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [19:17:49] greg-g: we have a patch for the issue (https://phabricator.wikimedia.org/T178828). Mind if I deploy this right away instead of waiting until the top of the hour? [19:20:05] ACKNOWLEDGEMENT - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get): Marko Obrovac Deploy imminent - T178828 - The acknowledgement exp [19:20:06] 21:18:46. [19:20:06] ACKNOWLEDGEMENT - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get): Marko Obrovac Deploy imminent - T178828 - The acknowledgement exp [19:20:06] 21:18:46. [19:20:06] ACKNOWLEDGEMENT - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get): Marko Obrovac Deploy imminent - T178828 - The acknowledgement exp [19:20:06] 21:18:46. [19:20:06] ACKNOWLEDGEMENT - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get): Marko Obrovac Deploy imminent - T178828 - The acknowledgement exp [19:20:07] 21:18:46. [19:21:26] fix will be deployed shortly ^ [19:21:35] ACKNOWLEDGEMENT - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get): Marko Obrovac Deploy imminent - T178828 - The acknowledgement exp [19:21:35] 21:21:14. [19:21:35] ACKNOWLEDGEMENT - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get): Marko Obrovac Deploy imminent - T178828 - The acknowledgement exp [19:21:35] 21:21:14. [19:21:35] ACKNOWLEDGEMENT - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get): Marko Obrovac Deploy imminent - T178828 - The acknowledgement exp [19:21:36] 21:21:14. [19:21:36] ACKNOWLEDGEMENT - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get): Marko Obrovac Deploy imminent - T178828 - The acknowledgement exp [19:21:37] 21:21:14. [19:21:37] ACKNOWLEDGEMENT - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get): Marko Obrovac Deploy imminent - T178828 - The acknowledgement exp [19:21:38] 21:21:14. [19:21:38] ACKNOWLEDGEMENT - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get): Marko Obrovac Deploy imminent - T178828 - The acknowledgement exp [19:21:39] 21:21:14. [19:22:00] (ha, i guess that's what the message says) [19:22:59] ACKNOWLEDGEMENT - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get): Marko Obrovac Deploy imminent - T178828 - The acknowl [19:22:59] : 2017-10-24 21:22:42. [19:23:50] (03PS5) 10Paladox: Gerrit: Replace certificates with tokens for its-phabricator [puppet] - 10https://gerrit.wikimedia.org/r/384901 (https://phabricator.wikimedia.org/T178385) [19:23:55] ACKNOWLEDGEMENT - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get): Marko Obrovac Deploy imminent - T178828 - The acknowl [19:23:55] : 2017-10-24 21:23:44. [19:24:22] (03PS4) 10Paladox: Gerrit: Replace certificates with tokens for its-phabricator [labs/private] - 10https://gerrit.wikimedia.org/r/384902 (https://phabricator.wikimedia.org/T178385) [19:29:15] ok, going to deploy mobileapps [19:30:06] !log bsitzmann@tin Started deploy [mobileapps/deploy@a654ce1]: Update mobileapps to 3628105 (T178828) [19:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:13] T178828: Missing pronunciation in MCS test response - https://phabricator.wikimedia.org/T178828 [19:31:12] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/385999 (https://phabricator.wikimedia.org/T178825) (owner: 10Herron) [19:31:43] (03CR) 10jerkins-bot: [V: 04-1] WIP: add puppet package version paramater to puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/385999 (https://phabricator.wikimedia.org/T178825) (owner: 10Herron) [19:31:44] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [19:33:58] 10Operations, 10Wiki-Loves-Monuments (2017): Import Wiki Loves Monuments photos from Flickr to Commons - https://phabricator.wikimedia.org/T173056#3704449 (10LilyOfTheWest) 05Open>03Resolved [19:34:45] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [19:34:45] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [19:34:45] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [19:34:45] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [19:34:54] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [19:35:54] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [19:35:54] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [19:35:55] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [19:36:54] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [19:36:54] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [19:36:55] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [19:37:05] !log bsitzmann@tin Finished deploy [mobileapps/deploy@a654ce1]: Update mobileapps to 3628105 (T178828) (duration: 06m 59s) [19:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:13] T178828: Missing pronunciation in MCS test response - https://phabricator.wikimedia.org/T178828 [19:40:50] (03CR) 10Chad: [V: 032 C: 032] Gerrit: Replace certificates with tokens for its-phabricator [labs/private] - 10https://gerrit.wikimedia.org/r/384902 (https://phabricator.wikimedia.org/T178385) (owner: 10Paladox) [19:41:05] thanks :). that will break the bot. [19:41:18] i can backport to stable-2.13 if you want no_justification? [19:41:46] We should upgrade to 2.14.x anyway :) [19:41:55] oh :) :) [19:42:31] Although, it's only stable-2.13 for an extension, not core gerrit [19:42:41] Could just do that easy enough [19:44:22] lots of merge conflicts [19:44:39] due to my change a few months ago that switches one conduit call to the new one :) [19:45:12] https://gerrit.googlesource.com/plugins/its-phabricator/+/a86babaae85d5c698beb9e520bbc03652b93ca24 [19:47:37] I wonder if we could compile master against stable-2.13 of core :p [19:47:38] hehee [19:48:02] heh [19:48:16] give it a try though remeber it's bazel not buck anymore :) [19:49:32] no_justification your going to have to compile against 2.14, master removed support for velocity so its-base is currently broken, i've been waiting for someone to merge my fixes including support for soy. [19:50:19] * no_justification is reading changelogs for 2.14 [19:50:31] We probably should bit the bullet and go to 2.14.5.1 [19:51:08] yeh, it should work, worked in my testing but as always i never seem to manage to reproduce any bugs prod find :). [19:51:13] 10Operations, 10HHVM, 10User-Elukey: Provide a forward port of ICU 52 for stretch / Investigate best ICU update strategy - https://phabricator.wikimedia.org/T177498#3704506 (10faidon) >>! In T177498#3672036, @MoritzMuehlenhoff wrote: > I investigated the upgrade procedure for "provide icu57 in jessie and mig... [19:51:25] zuul needs updating too [19:51:31] fixes support for 2.14+ [19:51:43] We should probably update zuul first then ;-) [19:52:05] but we can backport the fix for 2.14 as i've been running the fix :) [20:00:05] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: That opportune time is upon us again. Time for a Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171023T2000). [20:00:05] No GERRIT patches in the queue for this window AFAICS. [20:02:26] no_justification this is the zuul change that needs backporting https://github.com/openstack-infra/zuul/commit/c7370284742f2ed5c136d0e5fcd08ca2d23c8fe1 :) [20:14:41] !log arlolra@tin Started deploy [parsoid/deploy@0759bd1]: Updating Parsoid to 1cefef12 [20:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:56] !log arlolra@tin Finished deploy [parsoid/deploy@0759bd1]: Updating Parsoid to 1cefef12 (duration: 11m 14s) [20:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:16] 10Operations, 10Services (watching): New upstream jvm-tools - https://phabricator.wikimedia.org/T178839#3704693 (10Eevans) [20:27:23] 10Operations, 10Ops-Access-Requests, 10Analytics-Kanban, 10Patch-For-Review: Add Tilman to analytics-admins - https://phabricator.wikimedia.org/T178802#3704705 (10Tbayer) 05Open>03Resolved Thanks all! [20:27:31] 10Operations, 10Services (watching): New upstream jvm-tools - https://phabricator.wikimedia.org/T178839#3704708 (10Eevans) p:05Triage>03Normal [20:28:05] 10Operations, 10Discovery, 10Discovery-Analysis, 10Discovery-Search (Current work): Upload shiny-server .deb to our Jessie apt repository - https://phabricator.wikimedia.org/T168967#3704709 (10mpopov) p:05Normal>03Lowest Ish? Until this is done, we're limited to using Ubuntu for the VMs that host our d... [20:33:59] !log Updated Parsoid to 1cefef12 (T178217) [20:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:06] T178217: DSR values for linter issues in extension content with wikitext content model are incorrect - https://phabricator.wikimedia.org/T178217 [20:34:19] (03PS1) 10Chad: Remove mediawiki::users::mwdeploy_pub_key, unused [puppet] - 10https://gerrit.wikimedia.org/r/386065 (https://phabricator.wikimedia.org/T145495) [20:40:34] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 39 probes of 295 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [20:45:34] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 7 probes of 295 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [20:58:57] (03PS1) 10Phedenskog: Remove zeroes for non existing values [puppet] - 10https://gerrit.wikimedia.org/r/386071 (https://phabricator.wikimedia.org/T178479) [21:00:04] dapatrick, bawolff, and Reedy: (Dis)respected human, time to deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171023T2100). Please do the needful. [21:00:05] No GERRIT patches in the queue for this window AFAICS. [21:00:28] (03CR) 10Hashar: "The apache::mod::* dummy classes were probably to avoid the duplicate definitions?" [puppet] - 10https://gerrit.wikimedia.org/r/382343 (owner: 10Dzahn) [21:01:21] (03CR) 10Hashar: "Sorry I replied too fast:" [puppet] - 10https://gerrit.wikimedia.org/r/382343 (owner: 10Dzahn) [21:03:55] (03PS2) 10Krinkle: webperf: Remove zeroes for non existing navtiming values [puppet] - 10https://gerrit.wikimedia.org/r/386071 (https://phabricator.wikimedia.org/T178479) (owner: 10Phedenskog) [21:04:06] (03CR) 10Krinkle: [C: 031] "LGMT. Ready for deployment :)" [puppet] - 10https://gerrit.wikimedia.org/r/386071 (https://phabricator.wikimedia.org/T178479) (owner: 10Phedenskog) [21:17:26] (03PS1) 10Hashar: beta: set cache::fe_transient_gb: 0 [puppet] - 10https://gerrit.wikimedia.org/r/386077 [21:18:06] (03CR) 10Paladox: [C: 031] beta: set cache::fe_transient_gb: 0 [puppet] - 10https://gerrit.wikimedia.org/r/386077 (owner: 10Hashar) [21:27:01] (03CR) 10Hashar: "Applied on the beta cluster puppet master. That caused puppet to fail at least on deployment-cache-text04.deployment-prep.eqiad.wmflabs" [puppet] - 10https://gerrit.wikimedia.org/r/386077 (owner: 10Hashar) [21:32:53] (03PS2) 10Hashar: beta: set cache::fe_transient_gb / be_transient_gb to 0 [puppet] - 10https://gerrit.wikimedia.org/r/386077 [21:39:16] (03CR) 10Dzahn: [C: 032] webperf: Remove zeroes for non existing navtiming values [puppet] - 10https://gerrit.wikimedia.org/r/386071 (https://phabricator.wikimedia.org/T178479) (owner: 10Phedenskog) [21:40:05] !log increasing MTU on Zayo transit interfaces [21:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:49] (03CR) 10Dzahn: "deployed on hafnium.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/386071 (https://phabricator.wikimedia.org/T178479) (owner: 10Phedenskog) [21:41:45] (03CR) 10Dzahn: "any context?" [puppet] - 10https://gerrit.wikimedia.org/r/386077 (owner: 10Hashar) [21:43:57] (03PS2) 10Dzahn: beta: migrate autoupdater to a profile [puppet] - 10https://gerrit.wikimedia.org/r/385461 (owner: 10Hashar) [21:44:36] (03CR) 10Dzahn: [C: 032] beta: migrate autoupdater to a profile [puppet] - 10https://gerrit.wikimedia.org/r/385461 (owner: 10Hashar) [21:46:20] (03CR) 10Hashar: [C: 04-1] "That is a transient change. Puppet is/was broken on the instance." [puppet] - 10https://gerrit.wikimedia.org/r/386077 (owner: 10Hashar) [21:47:35] (03CR) 10Dzahn: "the only hint i see it as "fe" and "be".. could be swift?" [puppet] - 10https://gerrit.wikimedia.org/r/386077 (owner: 10Hashar) [21:48:34] (03CR) 10Dzahn: "nevermind, it's varnish. module cacheproxy" [puppet] - 10https://gerrit.wikimedia.org/r/386077 (owner: 10Hashar) [21:50:22] (03CR) 10Dzahn: "it sets the number of Gigabytes allowed to be used for transient storage it seems" [puppet] - 10https://gerrit.wikimedia.org/r/386077 (owner: 10Hashar) [21:51:54] (03CR) 10Dzahn: "so if setting to 0 breaks it.. instead of properly disabling it, maybe you should set it to a really high but real value" [puppet] - 10https://gerrit.wikimedia.org/r/386077 (owner: 10Hashar) [21:52:35] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 28 probes of 295 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [21:55:31] (03PS10) 10Dzahn: contint: profile, role, and packages for R language [puppet] - 10https://gerrit.wikimedia.org/r/363337 (https://phabricator.wikimedia.org/T153856) (owner: 10Hashar) [21:56:05] (03CR) 10Dzahn: [C: 032] contint: profile, role, and packages for R language [puppet] - 10https://gerrit.wikimedia.org/r/363337 (https://phabricator.wikimedia.org/T153856) (owner: 10Hashar) [21:57:35] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 0 probes of 295 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [21:58:09] (03CR) 10Dzahn: "https://bugs.launchpad.net/ubuntu/+source/unattended-upgrades/+bug/1073138" [puppet] - 10https://gerrit.wikimedia.org/r/315079 (owner: 10Hashar) [22:06:35] PROBLEM - puppet last run on labvirt1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [22:08:31] 10Operations, 10Beta-Cluster-Infrastructure: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3704963 (10greg) Adding #operations to ask for assistance with diagnosing/resolving this. [22:12:17] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/383916 (https://phabricator.wikimedia.org/T178096) (owner: 10Bearloga) [22:13:45] 10Operations, 10Beta-Cluster-Infrastructure: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3704733 (10hashar) Since varnish is down, Puppet fails to trigger the letsencrypt certificate renewal since that attempts to reach http://beta.wmflabs.org/ :/ [22:14:06] ^ I looked and...it's transient [22:16:35] RECOVERY - puppet last run on labvirt1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:25:52] (03PS1) 10Volans: Use the hiera() value in the message [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/386110 [22:26:09] (03CR) 10jerkins-bot: [V: 04-1] Use the hiera() value in the message [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/386110 (owner: 10Volans) [22:29:01] (03PS3) 10Dzahn: contint: use profile::labs::lvm::srv instead of role [puppet] - 10https://gerrit.wikimedia.org/r/385476 (owner: 10Hashar) [22:30:38] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3704999 (10hashar) Puppet fails with: Notice: tlsproxy::localssl instance unified with server name beta.wmflabs.org is the default server. /usr/local/sbin/acme_tiny.py --account-ke... [22:31:05] (03PS2) 10Volans: Use the hiera() value in the message [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/386110 [22:31:20] (03CR) 10jerkins-bot: [V: 04-1] Use the hiera() value in the message [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/386110 (owner: 10Volans) [22:32:57] (03PS1) 10BryanDavis: toolforge: Update shinken checks [puppet] - 10https://gerrit.wikimedia.org/r/386112 [22:33:08] 10Operations, 10Traffic, 10Wikimedia-Logstash: Varnish does not vary elasticsearch query by request body - https://phabricator.wikimedia.org/T174960#3705007 (10dbarratt) >>! In T174960#3703499, @ema wrote: > @dbarratt can you please provide some examples, including request/response headers and body, the beha... [22:34:03] (03PS3) 10Volans: Use the hiera() value in the message [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/386110 [22:37:02] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3705021 (10hashar) @Krenair / @BBlack are looking into it. They both know about Letsencrypt/Varnish. [22:38:03] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3705026 (10Paladox) If it uses ferm, it will not take notice of security group settings. I found that out with jenkins-slave-01 and other instances. try this sudo iptables -A INPUT -p tc... [22:44:56] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3704733 (10Dzahn) > sudo iptables -A INPUT -p tcp --dport 80 -j ACCEPT please don't. it will just conflict / be reverted by ferm or ferm service will be stopped leading to more manual thi... [22:55:02] 10Operations, 10Traffic, 10Wikimedia-Logstash: Varnish does not vary elasticsearch query by request body - https://phabricator.wikimedia.org/T174960#3578041 (10EBernhardson) I suppose i can add that the reason it has to be GET, rather than POST, is because the kibana application that receives these requests... [22:58:29] 10Operations, 10Traffic, 10Wikimedia-Logstash: Varnish does not vary elasticsearch query by request body - https://phabricator.wikimedia.org/T174960#3578041 (10BBlack) I doubt Varnish in default config does anything about GET request bodies, they're a fairly non-standard thing. I think our current versions... [22:58:44] RECOVERY - Check Varnish expiry mailbox lag on cp4024 is OK: OK: expiry mailbox lag is 0 [22:59:23] 10Operations, 10Traffic, 10Wikimedia-Logstash: Varnish does not vary elasticsearch query by request body - https://phabricator.wikimedia.org/T174960#3705072 (10EBernhardson) Actually on closer review, kibana is allowing some POST requests, but not your _search endpoint: https://github.com/elastic/kibana/blo... [22:59:47] !log cp4024: varnish-backend-restart for lag [22:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:56] (I swear I already said that, but now I don't see it) [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do Evening SWAT (Max 8 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171023T2300). [23:00:06] twkozlowski: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:03:59] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Technical-Debt, 10Tracking: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220#3705080 (10hashar) [23:04:01] 10Puppet, 10Beta-Cluster-Infrastructure, 10Goal, 10Patch-For-Review: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#3705076 (10hashar) 05Open>03Resolved a:03yuvipanda It has been nicely cleaned up by @yuvipanda There are still some classes left but not much we can imple... [23:05:39] (03PS1) 10Mobrovac: [WIP] Improve the checking procedure and emit better messages [software/service-checker] - 10https://gerrit.wikimedia.org/r/386116 (https://phabricator.wikimedia.org/T150560) [23:06:32] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3705082 (10Krenair) Between the three of us it's been brought back up. [23:06:49] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Improve the checking procedure and emit better messages [software/service-checker] - 10https://gerrit.wikimedia.org/r/386116 (https://phabricator.wikimedia.org/T150560) (owner: 10Mobrovac) [23:07:12] (03CR) 10Chad: [C: 032] Working Class Movement Library (Salford) throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385377 (https://phabricator.wikimedia.org/T178689) (owner: 10Odder) [23:08:21] (03Merged) 10jenkins-bot: Working Class Movement Library (Salford) throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385377 (https://phabricator.wikimedia.org/T178689) (owner: 10Odder) [23:08:31] (03CR) 10jenkins-bot: Working Class Movement Library (Salford) throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385377 (https://phabricator.wikimedia.org/T178689) (owner: 10Odder) [23:09:37] !log demon@tin Synchronized wmf-config/throttle.php: Working Class Movement Library (Salford) throttle rule (duration: 00m 46s) [23:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:40] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3705126 (10Krenair) HTTPS should now work again too. Need to commit hieradata/labs/deployment-prep/host/deployment-cache-text04.yaml on the puppetmaster: ```profile::cache::base::varnish_v...