[00:00:15] you know we are not actually on wmf.11 [00:00:20] yes [00:00:43] ok, just checking [00:01:10] we just discussed this in the other channel [00:01:46] wanted to get the change out before the cluster went back to using wmf.11 [00:01:56] Oh, sorry for forking. [00:02:36] guess I should push out mine too [00:04:25] !log tstarling@mira Synchronized php-1.27.0-wmf.11/includes: (no message) (duration: 01m 31s) [00:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:27:24] PROBLEM - puppet last run on mw1174 is CRITICAL: CRITICAL: puppet fail [00:47:18] (03PS2) 10Tim Landscheidt: Tools: Fix double file resource for jlocal [puppet] - 10https://gerrit.wikimedia.org/r/266934 [00:55:25] RECOVERY - puppet last run on mw1174 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:20:44] PROBLEM - puppet last run on mw2212 is CRITICAL: CRITICAL: Puppet has 1 failures [01:46:55] RECOVERY - puppet last run on mw2212 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [02:25:25] PROBLEM - HHVM rendering on mw1072 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:24] PROBLEM - Apache HTTP on mw1072 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:54] PROBLEM - puppet last run on mw1072 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:27:04] PROBLEM - dhclient process on mw1072 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:27:30] 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1985525 (10BBlack) Well, we have 3 different stages of rate-increase in the insert graph, so it could well b... [02:28:13] (03PS1) 10Yuvipanda: toollabs: Add new role for flannel etcd [puppet] - 10https://gerrit.wikimedia.org/r/267628 [02:28:33] RECOVERY - puppet last run on mw1072 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [02:28:45] RECOVERY - dhclient process on mw1072 is OK: PROCS OK: 0 processes with command name dhclient [02:29:09] (03CR) 10Yuvipanda: [C: 032] toollabs: Add new role for flannel etcd [puppet] - 10https://gerrit.wikimedia.org/r/267628 (owner: 10Yuvipanda) [02:29:16] (03CR) 10Yuvipanda: [V: 032] toollabs: Add new role for flannel etcd [puppet] - 10https://gerrit.wikimedia.org/r/267628 (owner: 10Yuvipanda) [02:29:57] 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1985526 (10BBlack) Continuing with some stuff I was saying in IRC the other day. At the "new normal", we're... [02:34:19] 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1985528 (10Legoktm) >>! In T124418#1985525, @BBlack wrote: > it's not like we gained a 5x increase in human... [02:37:33] PROBLEM - puppet last run on mw1108 is CRITICAL: CRITICAL: Puppet has 1 failures [02:39:16] 6operations, 10Wikimedia-Mailing-lists: Need listadmin password reset for Translators-l mailing list - https://phabricator.wikimedia.org/T123163#1985530 (10Jalexander) 5Open>3Resolved Reset and password sent to both list admins. [03:03:45] RECOVERY - puppet last run on mw1108 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [03:03:53] PROBLEM - Disk space on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:03:54] PROBLEM - DPKG on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:04:04] PROBLEM - puppet last run on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:04:04] PROBLEM - RAID on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:04:34] PROBLEM - dhclient process on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:04:44] PROBLEM - salt-minion processes on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:04:53] PROBLEM - Check size of conntrack table on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:04:54] PROBLEM - configured eth on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:10:34] 6operations, 6Analytics-Kanban, 10hardware-requests: 8 x 3 SSDs for AQS nodes. - https://phabricator.wikimedia.org/T124947#1985547 (10GWicke) In the short term, a few tweaks could also help to reduce latency: - Reduce replication for the per-article keyspace from 3 to 2. - Read with CL_ONE by default. - Pos... [03:50:14] PROBLEM - SSH on cygnus is CRITICAL: Server answer [03:53:43] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [04:11:14] PROBLEM - NTP on cygnus is CRITICAL: NTP CRITICAL: No response from NTP server [04:21:40] (03CR) 10Santhosh: [C: 04-1] "Please hold this till 1.27.0-wmf.12 is in production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267236 (owner: 10KartikMistry) [04:26:49] >>> UNRECOVERABLE FATAL ERROR <<< [04:26:49] Maximum execution time of 10 seconds exceeded [04:26:49] /srv/phab/libphutil/src/aphront/storage/connection/mysql/AphrontMySQLiDatabaseConnection.php:131 [04:26:50] Danny_B: SyntaxError: Unexpected identifier [04:27:13] when trying to display https://phabricator.wikimedia.org/maniphest/report/project/ [04:53:04] PROBLEM - SSH on cygnus is CRITICAL: Server answer [04:54:53] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [05:07:14] PROBLEM - SSH on cygnus is CRITICAL: Server answer [05:08:54] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [05:15:54] PROBLEM - SSH on cygnus is CRITICAL: Server answer [05:17:34] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [05:22:44] PROBLEM - SSH on cygnus is CRITICAL: Server answer [05:24:33] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [05:47:23] PROBLEM - SSH on cygnus is CRITICAL: Server answer [05:56:03] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [06:27:14] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] [06:27:23] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] [06:29:13] PROBLEM - SSH on cygnus is CRITICAL: Server answer [06:30:13] PROBLEM - puppet last run on mw1260 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:44] PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:53] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:05] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:13] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:43] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:03] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:38:04] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [06:46:35] PROBLEM - SSH on cygnus is CRITICAL: Server answer [06:51:45] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [06:55:23] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:56:13] RECOVERY - puppet last run on mw1260 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:44] RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:56:45] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:57:03] PROBLEM - SSH on cygnus is CRITICAL: Server answer [06:57:03] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:14] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:58:43] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:05:54] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [07:14:04] PROBLEM - puppet last run on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:15:44] RECOVERY - puppet last run on mw1057 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:16:14] PROBLEM - SSH on cygnus is CRITICAL: Server answer [07:32:03] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [07:37:14] PROBLEM - SSH on cygnus is CRITICAL: Server answer [07:44:05] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [07:47:05] PROBLEM - puppet last run on mw1057 is CRITICAL: CRITICAL: Puppet has 18 failures [07:49:23] PROBLEM - SSH on cygnus is CRITICAL: Server answer [07:52:45] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [07:58:03] PROBLEM - SSH on cygnus is CRITICAL: Server answer [08:03:24] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [08:08:35] PROBLEM - SSH on cygnus is CRITICAL: Server answer [08:15:04] PROBLEM - Check size of conntrack table on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:16:53] RECOVERY - Check size of conntrack table on mw1057 is OK: OK: nf_conntrack is 3 % full [08:27:53] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [08:38:34] PROBLEM - SSH on cygnus is CRITICAL: Server answer [08:45:10] (03PS1) 10Muehlenhoff: Correct target distribution [debs/openssl] - 10https://gerrit.wikimedia.org/r/267645 [08:45:15] PROBLEM - Check size of conntrack table on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:45:39] (03CR) 10Muehlenhoff: [C: 032 V: 032] Correct target distribution [debs/openssl] - 10https://gerrit.wikimedia.org/r/267645 (owner: 10Muehlenhoff) [08:47:23] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [08:48:43] RECOVERY - Check size of conntrack table on mw1057 is OK: OK: nf_conntrack is 4 % full [08:52:44] PROBLEM - SSH on cygnus is CRITICAL: Server answer [08:57:52] (03PS11) 10ArielGlenn: dumps: set up but don't enable script for dumps to run from cron [puppet] - 10https://gerrit.wikimedia.org/r/263807 (https://phabricator.wikimedia.org/T107750) [08:58:03] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [08:59:17] (03CR) 10jenkins-bot: [V: 04-1] dumps: set up but don't enable script for dumps to run from cron [puppet] - 10https://gerrit.wikimedia.org/r/263807 (https://phabricator.wikimedia.org/T107750) (owner: 10ArielGlenn) [09:03:43] PROBLEM - SSH on cygnus is CRITICAL: Server answer [09:06:16] (03PS1) 10Addshore: Revert "Revert "wgRCWatchCategoryMembership true on wikipedias & commons"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267646 [09:07:26] (03PS2) 10Addshore: Revert "Revert "wgRCWatchCategoryMembership true on wikipedias & commons"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267646 [09:11:28] (03PS1) 10Lokal Profil: Setting l10n-bot submissions to same as in https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki,access [dumps/dcat] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/267647 [09:14:05] (03PS2) 10Addshore: wgRCWatchCategoryMembership true everywhere except wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264734 [09:14:25] (03CR) 10Lokal Profil: [C: 032 V: 032] Localisation updates from https://translatewiki.net. [dumps/dcat] - 10https://gerrit.wikimedia.org/r/266482 (owner: 10L10n-bot) [09:15:47] (03PS2) 10Addshore: wgRCWatchCategoryMembership true on wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264735 [09:17:21] (03PS3) 10Addshore: Revert "Revert "wgRCWatchCategoryMembership true on wikipedias & commons"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267646 [09:18:02] (03PS3) 10Addshore: wgRCWatchCategoryMembership true on wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264735 [09:18:04] (03PS3) 10Addshore: wgRCWatchCategoryMembership true everywhere except wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264734 [09:30:20] 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad (boxes out of warranty, capacity planning) - https://phabricator.wikimedia.org/T118154#1985715 (10ArielGlenn) @robh: I've been requested to provide a list of all hardware needs in both codfw and eqiad, both snap... [09:31:15] (03PS2) 10Lokal Profil: Localisation updates from https://translatewiki.net. [dumps/dcat] - 10https://gerrit.wikimedia.org/r/266681 (owner: 10L10n-bot) [09:31:38] 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#1985716 (10ArielGlenn) [09:32:14] (03CR) 10Lokal Profil: [C: 032 V: 032] "Manual rebase was needed" [dumps/dcat] - 10https://gerrit.wikimedia.org/r/266681 (owner: 10L10n-bot) [09:33:39] 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#1985720 (10ArielGlenn) When figuring out memory and cpu needs for the dataset servers, we should keep in mind: the dataset host has the following going on a... [09:34:05] (03PS2) 10Lokal Profil: Localisation updates from https://translatewiki.net. [dumps/dcat] - 10https://gerrit.wikimedia.org/r/267642 (owner: 10L10n-bot) [09:34:21] (03CR) 10Lokal Profil: [C: 032 V: 032] "rebased" [dumps/dcat] - 10https://gerrit.wikimedia.org/r/267642 (owner: 10L10n-bot) [09:36:25] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [09:36:30] 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#1985723 (10ArielGlenn) @paravoid, I'm adding you to this because you had volunteered your help and know-how earlier in making our dumps download setup suffic... [09:37:03] 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#1985725 (10ArielGlenn) [09:37:05] 6operations, 10Dumps-Generation, 10hardware-requests: Detail codfw snapshot/dataset requirements - https://phabricator.wikimedia.org/T118173#1985724 (10ArielGlenn) [09:38:05] 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#1792783 (10ArielGlenn) [09:38:33] (03CR) 10Merlijn van Deen: [C: 031] Tools: Fix double file resource for jlocal [puppet] - 10https://gerrit.wikimedia.org/r/266934 (owner: 10Tim Landscheidt) [09:39:14] 6operations, 10Datasets-General-or-Unknown: Provide a good download service of dumps from Wikimedia - https://phabricator.wikimedia.org/T122917#1985731 (10ArielGlenn) All hardware refresh tickets for dumps are now at T118154. [09:41:27] (03PS12) 10ArielGlenn: dumps: set up but don't enable script for dumps to run from cron [puppet] - 10https://gerrit.wikimedia.org/r/263807 (https://phabricator.wikimedia.org/T107750) [09:41:47] PROBLEM - SSH on cygnus is CRITICAL: Server answer [09:41:56] (03CR) 10Merlijn van Deen: [C: 04-1] ""Allows for more flexibility this way"" [puppet] - 10https://gerrit.wikimedia.org/r/267402 (owner: 10Yuvipanda) [09:42:12] 6operations: reinstall eqiad memcache servers with jessie - https://phabricator.wikimedia.org/T123711#1985735 (10elukey) Updates from me (or better excuses to justify why this work hasn't been completed yet :) I worked with Joe to double check how nutcracker handles redis and memcached nodes going away from its... [09:44:34] PROBLEM - Apache HTTP on mw1057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:45:24] PROBLEM - HHVM rendering on mw1057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:46:34] PROBLEM - SSH on mw1057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:46:35] PROBLEM - DPKG on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:46:45] PROBLEM - HHVM processes on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:46:45] PROBLEM - Check size of conntrack table on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:46:54] PROBLEM - Disk space on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:47:04] PROBLEM - RAID on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:47:25] PROBLEM - configured eth on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:48:04] PROBLEM - nutcracker process on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:48:04] PROBLEM - dhclient process on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:48:04] PROBLEM - nutcracker port on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:48:04] PROBLEM - salt-minion processes on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:48:56] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [09:53:51] 6operations, 6Analytics-Kanban, 10hardware-requests: 8 x 3 SSDs for AQS nodes. - https://phabricator.wikimedia.org/T124947#1985752 (10JAllemandou) @Gwicke: - The replication factor for the per-article table had been changed to 2 a few month ago. I think restbase config management for cassandra has changed i... [10:01:45] PROBLEM - SSH on cygnus is CRITICAL: Server answer [10:03:34] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [10:05:14] (03PS13) 10ArielGlenn: dumps: set up but don't enable script for dumps to run from cron [puppet] - 10https://gerrit.wikimedia.org/r/263807 (https://phabricator.wikimedia.org/T107750) [10:09:04] PROBLEM - SSH on cygnus is CRITICAL: Server answer [10:09:54] RECOVERY - nutcracker port on mw1057 is OK: TCP OK - 0.000 second response time on port 11212 [10:09:55] RECOVERY - dhclient process on mw1057 is OK: PROCS OK: 0 processes with command name dhclient [10:09:55] RECOVERY - nutcracker process on mw1057 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [10:09:55] RECOVERY - salt-minion processes on mw1057 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:10:24] RECOVERY - SSH on mw1057 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0) [10:10:24] RECOVERY - DPKG on mw1057 is OK: All packages OK [10:10:33] RECOVERY - HHVM processes on mw1057 is OK: PROCS OK: 6 processes with command name hhvm [10:10:33] RECOVERY - Check size of conntrack table on mw1057 is OK: OK: nf_conntrack is 0 % full [10:10:34] RECOVERY - Disk space on mw1057 is OK: DISK OK [10:10:54] RECOVERY - RAID on mw1057 is OK: OK: no RAID installed [10:11:14] RECOVERY - configured eth on mw1057 is OK: OK - interfaces up [10:12:15] (03CR) 10ArielGlenn: [C: 032] dumps: set up but don't enable script for dumps to run from cron [puppet] - 10https://gerrit.wikimedia.org/r/263807 (https://phabricator.wikimedia.org/T107750) (owner: 10ArielGlenn) [10:14:34] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [10:18:38] (03PS1) 10ArielGlenn: enable dumps from cron on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/267653 [10:20:03] PROBLEM - SSH on cygnus is CRITICAL: Server answer [10:27:35] !log partitioning revision and logging for db2037 and db2044 (s4) [10:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:32:27] !log reboot ms-be1010, xfs [10:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:35:54] 6operations, 5Patch-For-Review, 7Swift: swift capacity planning - https://phabricator.wikimedia.org/T1268#1985807 (10fgiunchedi) also related to capacity planning for swift, thumbnails vs originals size metrics in eqiad: https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1448470754.236&target=... [10:36:04] (03PS2) 10ArielGlenn: enable dumps from cron on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/267653 [10:36:44] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [10:39:44] RECOVERY - Disk space on ms-be1010 is OK: DISK OK [10:39:44] RECOVERY - very high load average likely xfs on ms-be1010 is OK: OK - load average: 9.84, 2.28, 0.75 [10:39:44] RECOVERY - Check size of conntrack table on ms-be1010 is OK: OK: nf_conntrack is 4 % full [10:39:44] RECOVERY - swift-container-auditor on ms-be1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:39:44] RECOVERY - swift-object-server on ms-be1010 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [10:39:44] RECOVERY - swift-container-updater on ms-be1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [10:39:44] RECOVERY - swift-container-replicator on ms-be1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [10:39:45] RECOVERY - dhclient process on ms-be1010 is OK: PROCS OK: 0 processes with command name dhclient [10:39:45] RECOVERY - DPKG on ms-be1010 is OK: All packages OK [10:39:46] RECOVERY - configured eth on ms-be1010 is OK: OK - interfaces up [10:39:46] RECOVERY - swift-account-server on ms-be1010 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [10:39:47] RECOVERY - swift-account-reaper on ms-be1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [10:40:13] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 896 [10:42:14] PROBLEM - SSH on cygnus is CRITICAL: Server answer [10:45:13] RECOVERY - check_mysql on db1008 is OK: Uptime: 1105615 Threads: 2 Questions: 6726950 Slow queries: 7504 Opens: 2795 Flush tables: 2 Open tables: 430 Queries per second avg: 6.084 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:46:31] (03PS3) 10ArielGlenn: enable dumps from cron on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/267653 [10:48:42] (03CR) 10ArielGlenn: [C: 032] enable dumps from cron on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/267653 (owner: 10ArielGlenn) [11:02:33] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [11:07:55] PROBLEM - SSH on cygnus is CRITICAL: Server answer [11:12:24] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [11:12:45] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [11:12:53] PROBLEM - swift codfw-prod object availability on graphite1001 is CRITICAL: CRITICAL: 25.00% of data under the critical threshold [90.0] [11:15:13] (03PS1) 10ArielGlenn: snapshots: enable the dumps from cron as opposed to just deploying the script [puppet] - 10https://gerrit.wikimedia.org/r/267658 [11:16:40] (03CR) 10jenkins-bot: [V: 04-1] snapshots: enable the dumps from cron as opposed to just deploying the script [puppet] - 10https://gerrit.wikimedia.org/r/267658 (owner: 10ArielGlenn) [11:18:54] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [11:19:11] (03PS2) 10ArielGlenn: snapshots: enable the dumps from cron as opposed to just deploying the script [puppet] - 10https://gerrit.wikimedia.org/r/267658 [11:19:37] !log repool restbase1007 [11:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:20:34] (03CR) 10jenkins-bot: [V: 04-1] snapshots: enable the dumps from cron as opposed to just deploying the script [puppet] - 10https://gerrit.wikimedia.org/r/267658 (owner: 10ArielGlenn) [11:20:41] 7Blocked-on-Operations, 6operations: Re-pool restbase1007 - https://phabricator.wikimedia.org/T124565#1985849 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi indeed, I've repooled restbase1007 [11:20:59] it is the metrics api what it is failing [11:24:19] 6operations, 10ops-codfw: ms-be2015 doesn't come up after reboot - https://phabricator.wikimedia.org/T125383#1985864 (10MoritzMuehlenhoff) 3NEW a:3Papaul [11:24:24] PROBLEM - SSH on cygnus is CRITICAL: Server answer [11:25:54] ACKNOWLEDGEMENT - Host ms-be2015 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff hardware problem, see T125383 [11:26:14] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [11:27:10] jynus: mh? [11:27:23] it works now [11:27:32] https://grafana.wikimedia.org/dashboard/db/varnish-http-errors [11:30:31] have we multiplied by 4 the number of requests since 19 Jan? [11:30:44] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [11:30:53] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:31:23] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:34:28] !log uploaded openssl 1.0.2f for jessie-wikimedia to carbon [11:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:35:34] PROBLEM - SSH on cygnus is CRITICAL: Server answer [11:38:44] (03PS1) 10Jcrespo: Delete eqiad masters from codfw configuration and add db weights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 [11:39:03] !log uploaded openjdk-8 8u72-b15-1~bpo8+1 for jessie-wikimedia to carbon [11:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:40:43] I do not know who to add as reviewer of that last commit, either someone from infrastructure or from performance, I suppose [11:46:56] apergos, I have some questions regarding mysql && dumps for codfw, I do not know if creating a ticket for it [11:47:03] (03PS1) 10John Vandenberg: Remove multiple spaces before operator [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/267660 [11:49:15] I am, I suppose having a ticket doesn't hurt, even if it was the case that it can be trivially answered [11:49:47] jynus: let's hear it [11:53:03] 6operations, 10Salt: Salt minions randomly crashing when the deployment server grain gets changed - https://phabricator.wikimedia.org/T124646#1985931 (10ArielGlenn) I've updated my docker salt testbed to work with latest docker api and latest wmf packages: https://github.com/apergos/docker-saltcluster I'll be... [11:53:19] 6operations, 10DBA, 10Dumps-Generation, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Clarify how mysql dumps will be architectured during codfw failover - https://phabricator.wikimedia.org/T125386#1985937 (10jcrespo) 3NEW [11:53:36] ^apergos [11:54:19] basically, give me your thoughts, issues, and both current state and desired state [11:55:35] 6operations, 10DBA, 10Dumps-Generation, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Clarify how mysql dumps will be architectured during codfw failover - https://phabricator.wikimedia.org/T125386#1985953 (10ArielGlenn) We could continue to generate them out of eqiad and serve them as downloadable from c... [11:55:58] jynus: hope that's enough to start with [11:58:15] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:59:35] !log rolling reboot of ms-be1016 to ms-be1021 for kernel update [11:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:01:50] the key word is "we could" [12:02:24] yes [12:02:25] I am not asking what we have to do now, but what we should do [12:02:47] we could = it would be fine with me [12:02:53] sorry, not "it is possible" [12:03:07] yes, but there is not machines yet? [12:03:26] in eqiad there are machines. there is not yet a dataset server in codfw [12:03:34] ok, that is the key thing [12:03:35] that's under discussion in the dumps hw ticket I mentioned [12:03:45] (03CR) 10Aude: "it would be best to enable this first on test.wikidata + test.wikipedia + test2.wikipedia, and then wikidata." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267405 (https://phabricator.wikimedia.org/T124931) (owner: 10Llyrian) [12:05:29] one last question, what is the expected timeline for having working dumps active on codfw? Is it before the end of the quarter? [12:05:34] under discussion is whether we will have snapshot hosts in codfw at all, or rather assume that only one dc supports generation [12:05:44] we don't have a timeline because of the above [12:05:52] that is ok [12:06:02] asume the answer was positive [12:06:10] (which we do not know yet) [12:06:20] if we are expected to be able to run them out of both dcs then [12:06:30] when does the quarter end? [12:06:40] 1 May [12:06:49] it will depend on budget essentially, when the systems would be able to be ordered [12:06:59] and that is something ma rk will know [12:07:01] understood [12:07:33] the answer might be 'no we don't' in which case your task would already be done [12:10:25] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [12:10:52] I did not need a definitive answer, I just needed short term ideas for immediate codfw configuration so failover is possible [12:11:04] Configuration can be changed at any time [12:11:33] (03PS3) 10ArielGlenn: snapshots: enable the dumps from cron as opposed to just deploying the script [puppet] - 10https://gerrit.wikimedia.org/r/267658 [12:12:18] (03PS12) 10John Vandenberg: tox entry point to run pep8==1.4.6 [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [12:12:40] ok great [12:12:56] (03CR) 10jenkins-bot: [V: 04-1] snapshots: enable the dumps from cron as opposed to just deploying the script [puppet] - 10https://gerrit.wikimedia.org/r/267658 (owner: 10ArielGlenn) [12:15:00] <_joe_> !log backing up tin homes before reimaging [12:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:15:41] (03PS4) 10ArielGlenn: snapshots: enable the dumps from cron as opposed to just deploying the script [puppet] - 10https://gerrit.wikimedia.org/r/267658 [12:15:54] PROBLEM - SSH on cygnus is CRITICAL: Server answer [12:15:57] (03CR) 10Hoo man: [C: 032] Setting l10n-bot submissions to same as in https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki,access [dumps/dcat] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/267647 (owner: 10Lokal Profil) [12:16:15] (03CR) 10Hoo man: [V: 032] Setting l10n-bot submissions to same as in https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki,access [dumps/dcat] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/267647 (owner: 10Lokal Profil) [12:16:30] (03CR) 10John Vandenberg: "those submodules patches have been merged; just trying to keep this moving. I037ebefdeb05a890 and a similar changeset for the other submo" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [12:17:55] (03CR) 10ArielGlenn: [C: 032] snapshots: enable the dumps from cron as opposed to just deploying the script [puppet] - 10https://gerrit.wikimedia.org/r/267658 (owner: 10ArielGlenn) [12:18:05] 6operations, 10DBA, 10MediaWiki-Special-pages, 10Wikidata, 7Performance: Batch updates create slave lag on s3 over WAN - https://phabricator.wikimedia.org/T122429#1985971 (10JanZerebecki) [12:19:02] (03PS1) 10DCausse: Put more like query load back on eqiad for load testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267662 [12:19:48] 6operations, 10DBA, 10Dumps-Generation, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Clarify how mysql dumps will be architectured during codfw failover - https://phabricator.wikimedia.org/T125386#1985972 (10jcrespo) 5Open>3Resolved a:3jcrespo From my conversation with Ariel, this seems to have som... [12:19:50] 6operations, 10DBA, 5Patch-For-Review, 7Performance, and 2 others: Stress-test mediawiki application servers at codfw (specially to figure out db weights configuration) and basic buffer warming - https://phabricator.wikimedia.org/T124697#1985975 (10jcrespo) [12:25:52] (03PS2) 10Jcrespo: Delete eqiad masters from codfw configuration and add db weights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 [12:26:26] 6operations, 10DBA, 10Dumps-Generation, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Clarify how mysql dumps will be architectured during codfw failover - https://phabricator.wikimedia.org/T125386#1985982 (10ArielGlenn) I agree, dumps are not planned to be part of the upcoming test switchover in any case. [12:29:02] 6operations, 10DBA, 10Dumps-Generation, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Clarify how mysql dumps will be architectured during codfw failover - https://phabricator.wikimedia.org/T125386#1985986 (10jcrespo) I've updated https://gerrit.wikimedia.org/r/#/c/267659/ to reflect this decision. [12:32:32] (03CR) 10Jcrespo: "Aaron, I want you to be aware of this important mediawiki-databases change, as it will impact both performance, availability and architect" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 (owner: 10Jcrespo) [12:38:30] (03PS3) 10Jcrespo: Prepare db-codfw.php for a live deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 [12:39:44] 6operations, 5Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Switchover of the application servers to codfw - https://phabricator.wikimedia.org/T124671#1985995 (10jcrespo) Related: https://gerrit.wikimedia.org/r/267659 [12:40:01] <_joe_> !log depooling cp3042 from esams uploads [12:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:40:25] 6operations, 10DBA, 5Patch-For-Review, 7Performance, and 2 others: Stress-test mediawiki application servers at codfw (specially to figure out db weights configuration) and basic buffer warming - https://phabricator.wikimedia.org/T124697#1985997 (10jcrespo) [12:40:26] 6operations, 5Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Switchover of the application servers to codfw - https://phabricator.wikimedia.org/T124671#1985996 (10jcrespo) [12:41:13] RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [12:42:12] (03PS1) 10Giuseppe Lavagetto: ipsec: remove cp3042 [puppet] - 10https://gerrit.wikimedia.org/r/267664 (https://phabricator.wikimedia.org/T125265) [12:48:38] (03Abandoned) 10Giuseppe Lavagetto: role::deployment::salt_masters: correct a hiera lookup [puppet] - 10https://gerrit.wikimedia.org/r/266216 (owner: 10Giuseppe Lavagetto) [12:49:05] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [12:53:09] 6operations, 10Dumps-Generation: Make dumps run via cron on each snapshot host - https://phabricator.wikimedia.org/T107750#1986010 (10ArielGlenn) These jobs are now all enabled. The first attempt to run will be Feb 2 early in the morning. I'll be checking to make sure everything started properly. One catch... [12:54:34] PROBLEM - SSH on cygnus is CRITICAL: Server answer [12:55:03] (03PS1) 10Muehlenhoff: Add initial entries for auth* servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/267665 [12:59:30] (03PS2) 10Muehlenhoff: Add initial entries for auth* servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/267665 [12:59:40] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add initial entries for auth* servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/267665 (owner: 10Muehlenhoff) [13:03:40] (03PS5) 10KartikMistry: Beta: Add cxserver registry to Beta [puppet] - 10https://gerrit.wikimedia.org/r/266668 [13:13:33] PROBLEM - RAID on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:13:44] PROBLEM - configured eth on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:14:06] PROBLEM - nutcracker port on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:14:13] PROBLEM - dhclient process on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:14:13] PROBLEM - salt-minion processes on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:14:13] PROBLEM - nutcracker process on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:14:36] PROBLEM - SSH on mw1057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:14:44] PROBLEM - HHVM processes on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:14:44] PROBLEM - DPKG on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:14:44] PROBLEM - Check size of conntrack table on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:14:53] PROBLEM - Disk space on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:26:31] 6operations, 6Release-Engineering-Team: reinstall/upgrade gerrit server (ytterbium) from precise to jessie - https://phabricator.wikimedia.org/T125018#1986042 (10hashar) We will need to tweak CI configuration and the Zuul merger repos origin. They are pointing to ytterbium. [13:26:36] (03PS6) 10KartikMistry: Beta: Add cxserver registry to Beta [puppet] - 10https://gerrit.wikimedia.org/r/266668 [13:33:12] !log rolling reboot of xenon/cerium/praseodymium for kernel update (and updating to new openjdk-8) [13:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:35:18] (03PS1) 10ArielGlenn: new salt runner to sign key for a specific minion [puppet] - 10https://gerrit.wikimedia.org/r/267670 (https://phabricator.wikimedia.org/T124761) [13:36:01] 7Puppet, 6operations, 10Salt, 5Patch-For-Review: Make it possible for wmf-reimage to work seamlessly with a non-local salt master - https://phabricator.wikimedia.org/T124761#1986056 (10ArielGlenn) Tested with a print in place of the key acceptance line. [13:36:29] (03CR) 10jenkins-bot: [V: 04-1] new salt runner to sign key for a specific minion [puppet] - 10https://gerrit.wikimedia.org/r/267670 (https://phabricator.wikimedia.org/T124761) (owner: 10ArielGlenn) [13:38:22] (03PS2) 10ArielGlenn: new salt runner to sign key for a specific minion [puppet] - 10https://gerrit.wikimedia.org/r/267670 (https://phabricator.wikimedia.org/T124761) [13:39:23] (03CR) 10jenkins-bot: [V: 04-1] new salt runner to sign key for a specific minion [puppet] - 10https://gerrit.wikimedia.org/r/267670 (https://phabricator.wikimedia.org/T124761) (owner: 10ArielGlenn) [13:40:03] RECOVERY - RAID on mw1057 is OK: OK: no RAID installed [13:40:13] RECOVERY - Check size of conntrack table on mw1057 is OK: OK: nf_conntrack is 0 % full [13:40:19] (03PS3) 10ArielGlenn: new salt runner to sign key for a specific minion [puppet] - 10https://gerrit.wikimedia.org/r/267670 (https://phabricator.wikimedia.org/T124761) [13:40:22] RECOVERY - nutcracker port on mw1057 is OK: TCP OK - 0.000 second response time on port 11212 [13:40:23] RECOVERY - SSH on mw1057 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0) [13:40:23] RECOVERY - dhclient process on mw1057 is OK: PROCS OK: 0 processes with command name dhclient [13:40:23] RECOVERY - salt-minion processes on mw1057 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:40:23] RECOVERY - nutcracker process on mw1057 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [13:40:43] RECOVERY - DPKG on mw1057 is OK: All packages OK [13:40:53] RECOVERY - Disk space on mw1057 is OK: DISK OK [13:41:04] RECOVERY - HHVM processes on mw1057 is OK: PROCS OK: 6 processes with command name hhvm [13:41:13] RECOVERY - configured eth on mw1057 is OK: OK - interfaces up [13:48:37] (03PS4) 10ArielGlenn: rewrite pagerange.py so it's both fast and useful [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/264040 (https://phabricator.wikimedia.org/T123571) [13:49:07] (03PS5) 10ArielGlenn: rewrite pagerange.py so it's both fast and useful [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/264040 (https://phabricator.wikimedia.org/T123571) [13:50:16] (03PS6) 10ArielGlenn: rewrite pagerange.py so it's both fast and useful [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/264040 (https://phabricator.wikimedia.org/T123571) [13:51:16] (03CR) 10ArielGlenn: [C: 032] rewrite pagerange.py so it's both fast and useful [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/264040 (https://phabricator.wikimedia.org/T123571) (owner: 10ArielGlenn) [14:00:15] 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1986077 (10BBlack) Another data point from the weekend: In one sample I took Saturday morning, when I sample... [14:03:00] 6operations, 7Monitoring: monitor and alarm on SMART attributes - https://phabricator.wikimedia.org/T86552#1986090 (10elukey) Adding a big +1 At the moment the Kafka cluster could theoretically serve corrupted data due to disk failure, delegating the responsibility to react to the consumers. If the bad disk i... [14:04:57] !log set ms-be1019 swift weight to 4000 [14:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:17:02] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:18:42] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [14:24:13] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:26:02] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [14:33:23] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:33:53] (03CR) 10Addshore: wgRCWatchCategoryMembership true everywhere except wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264734 (owner: 10Addshore) [14:33:56] !log labstore1002 cfg scheduling [14:33:59] (03CR) 10Addshore: wgRCWatchCategoryMembership true on wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264735 (owner: 10Addshore) [14:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:38:53] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [14:39:13] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:40:54] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [14:43:06] Is it possible the l10n caches are going through some churn? I have Commons people complaining about a missing message. [14:43:08] Specifically, https://commons.wikimedia.org/w/api.php?action=query&meta=allmessages&ammessages=licenses&amlang=experienced [14:43:32] cf. https://commons.wikimedia.org/wiki/MediaWiki:Licenses/experienced [14:45:24] 6operations, 10Analytics-Cluster: Complete installation of analytics1017.eqiad.wmnet - https://phabricator.wikimedia.org/T125055#1986160 (10elukey) a:3Ottomata [14:45:41] 6operations, 10Analytics-Cluster: Complete installation of analytics1017.eqiad.wmnet - https://phabricator.wikimedia.org/T125055#1973447 (10elukey) Assigning to Andrew to have a reminder of this task in our queue. [14:49:06] I guess that message is supposed to be empty, and fallback to the en version, but instead we have nothing showing up on Special:Upload [14:51:11] (03CR) 10Ema: [C: 031] ipsec: remove cp3042 [puppet] - 10https://gerrit.wikimedia.org/r/267664 (https://phabricator.wikimedia.org/T125265) (owner: 10Giuseppe Lavagetto) [14:52:23] 6operations, 10ops-esams, 5Patch-For-Review: cp3042 - controller / hardware issue - https://phabricator.wikimedia.org/T125265#1986176 (10BBlack) The ata link down messages like: ``` [ 6.350863] ata3: SATA link down (SStatus 0 SControl 300) [ 6.675177] ata4: SATA link down (SStatus 0 SControl 300) ``` are nor... [14:53:27] MarkTraceur: A message with the value "-" typically means that the message is disabled and should *not* fall back, because leaving the message completely empty means to fall back. [14:53:33] (at least, I think so) [14:54:58] Hm. [14:55:30] anomie: The message didn't have any value on Commons until today, when odder created a copy of the English message to try and fix the upload page [14:55:38] FYI https://commons.wikimedia.org/w/index.php?title=Special:Upload&uselang=experienced [14:59:24] MarkTraceur: VE had some problems with wrong i18n messages being delivered, or something. i think Krenair knows the details [14:59:44] switching MW versions back and forth a couple times certainly didn't help with this, heh [14:59:47] uh... no, I don't think so [14:59:52] It's not even a JS bug, all of the message handling happens on the backend [14:59:54] we had issues with old JS code being delivered [14:59:57] (i might be confused) [15:00:01] hmm. okay [15:00:27] I also thought oh, must be a cache thing, but no, includes/License.php (which was a new find for me) handles everything [15:01:23] RECOVERY - Host cp3042 is UP: PING WARNING - Packet loss = 64%, RTA = 86.19 ms [15:01:23] RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 60 ESP OK [15:01:24] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 166 ESP OK [15:01:32] RECOVERY - IPsec on cp1061 is OK: Strongswan OK - 60 ESP OK [15:01:33] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 166 ESP OK [15:01:43] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [15:01:52] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 60 ESP OK [15:02:02] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 60 ESP OK [15:02:03] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 166 ESP OK [15:02:03] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 166 ESP OK [15:02:03] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 60 ESP OK [15:02:13] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [15:02:13] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 60 ESP OK [15:02:33] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 60 ESP OK [15:02:33] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 60 ESP OK [15:02:53] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 60 ESP OK [15:02:53] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [15:02:54] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [15:03:03] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 166 ESP OK [15:05:54] 6operations, 10ops-esams, 5Patch-For-Review: cp3042 - controller / hardware issue - https://phabricator.wikimedia.org/T125265#1986234 (10jcrespo) +1, as we have the exact message on other servers that we rebooted at the time. [15:06:03] PROBLEM - Freshness of OCSP Stapling files on cp3042 is CRITICAL: CRITICAL: File /var/cache/ocsp/unified.ocsp is more than 29100 secs old! [15:10:58] !log restarting hhvm on mw1057 [15:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:11:30] 6operations, 10ops-eqiad, 5Patch-For-Review: SMART errors on kafka1012.eqiad.wmfnet - https://phabricator.wikimedia.org/T125199#1986248 (10elukey) Next steps before closing: 1) disk replaced 2) bring the host/service up and running again 3) evaluates the following reverts: - https://gerrit.wikimedia.org/r/2... [15:11:54] 6operations, 10ops-eqiad, 5Patch-For-Review: SMART errors on kafka1012.eqiad.wmfnet - https://phabricator.wikimedia.org/T125199#1986249 (10elukey) a:3Cmjohnson [15:12:07] 6operations, 10ops-eqiad, 5Patch-For-Review: SMART errors on kafka1012.eqiad.wmfnet - https://phabricator.wikimedia.org/T125199#1981110 (10elukey) Temporary assigned to Cmjohnson [15:13:33] RECOVERY - Apache HTTP on mw1057 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.553 second response time [15:13:38] (03PS1) 10MarcoAurelio: Adding museumvictoria.com.au domain to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267677 (https://phabricator.wikimedia.org/T125387) [15:13:48] !log cp3042 repooled [15:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:14:08] 6operations, 10ops-esams, 5Patch-For-Review: cp3042 - controller / hardware issue - https://phabricator.wikimedia.org/T125265#1986271 (10BBlack) 5Open>3Resolved a:3BBlack So to retrace my steps here: 1. ata link down is a red herring 2. the root disks seem to be fine (other than rebuilding mirror in t... [15:14:13] RECOVERY - HHVM rendering on mw1057 is OK: HTTP OK: HTTP/1.1 200 OK - 70830 bytes in 0.106 second response time [15:21:23] RECOVERY - puppet last run on mw1057 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:23:51] (03CR) 10Filippo Giunchedi: [C: 031] Raise file upload limit to 2500 MB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266544 (https://phabricator.wikimedia.org/T116514) (owner: 10TheDJ) [15:23:51] 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#1986304 (10ArielGlenn) A few more comments after discussion with Mark: We thought about splitting up the dumps between dcs but this is expensive because it... [15:24:46] 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1986305 (10Lydia_Pintscher) Very strange. Wikidata use on templates on talk pages isn't impossible but I'd c... [15:24:54] 6operations, 10Wikimedia-Video, 5Patch-For-Review: 1gb file upload limit is too restrictive for conference presentation videos - https://phabricator.wikimedia.org/T116514#1986306 (10fgiunchedi) for sure @reedy, LGTM, on the swift side the default maximum upload size for a single object is 5GB FYI [15:25:27] (03PS1) 10Jcrespo: Pool db1018; Depool db1054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267678 (https://phabricator.wikimedia.org/T125215) [15:27:10] 6operations, 10Wikimedia-DNS, 5Patch-For-Review: Internal DNS resolver responds with NXDOMAIN for localhost AAAA - https://phabricator.wikimedia.org/T125170#1986311 (10elukey) @yuvipanda: What are the next steps for this? Do you need more reviewers to get added? [15:34:37] (03PS1) 10Jcrespo: Testing db jessie installer problems on db2030 [puppet] - 10https://gerrit.wikimedia.org/r/267681 (https://phabricator.wikimedia.org/T125256) [15:37:39] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1986404 (10Ottomata) I just talked to @bblack, and also looked at requests in the `webrequest_mobile` topic in Kafka. There are still real user requests from cp1060, but most of... [15:42:25] !log restarted pybal on lvs1003 [15:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:44:30] (03CR) 10Filippo Giunchedi: "FTR this translates to ~7% more space on a single whisper file" [puppet] - 10https://gerrit.wikimedia.org/r/266567 (owner: 10EBernhardson) [15:44:43] RECOVERY - Freshness of OCSP Stapling files on cp3042 is OK: OK [15:45:01] (03PS2) 10Giuseppe Lavagetto: ipsec: remove cp3042 [puppet] - 10https://gerrit.wikimedia.org/r/267664 (https://phabricator.wikimedia.org/T125265) [15:45:12] (03PS1) 10MarcoAurelio: Expanding transwiki import sources for be.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267685 (https://phabricator.wikimedia.org/T125390) [15:48:36] !log restarted pybal on lvs1004 (lvs1003 above was a bad log message!) [15:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:51:51] 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1986445 (10daniel) @BBlack can you give a few examples of such pages on srwiki? Were these non-existant talk... [15:51:55] OK, so… what's happening this morning? Are we going with wmf.11? [15:53:10] <_joe_> this morning? [15:53:29] <_joe_> uhm I guess I have no time to reimage tin then [15:54:05] _joe_: I don't know. SWAT's in 6 minutes' time. [15:54:25] !log krenair@mira Synchronized wmf-config/interwiki.cdb: Updating interwiki cache (duration: 01m 52s) [15:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:54:39] mafk, ^ [15:54:51] :D [15:55:31] but the patch needs to be merged in order to work [15:56:04] indeed [15:56:09] Does that need a review from csteipp or someone? I don't see any non-WMF wiki there. [15:58:01] unclear [15:59:26] I don't think it's been done before [15:59:56] <_joe_> James_F: if it's a normal SWAT, I can just ask the SWATTers to hold for a bit when I acknolwedge tin is down [16:00:09] * James_F nods. [16:00:17] I also have doubts [16:00:29] romaine has been granted local rights fyi [16:00:30] 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1986463 (10BBlack) @daniel - Sorry I should have linked this earlier, I made a paste at the time: P2547 . N... [16:03:08] 6operations, 10Wikimedia-DNS, 5Patch-For-Review: Internal DNS resolver responds with NXDOMAIN for localhost AAAA - https://phabricator.wikimedia.org/T125170#1986468 (10elukey) p:5Triage>3Normal [16:03:31] (03CR) 10MarcoAurelio: "I have doubts here however. I can't see any non-WMF wikis added to the transwiki import sources, so I don't know if this will be approved " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267685 (https://phabricator.wikimedia.org/T125390) (owner: 10MarcoAurelio) [16:04:30] hmm, looks like no notification for SWAT. [16:04:49] jouncebot: yt? [16:04:56] jouncebot: refresh [16:04:58] I refreshed my knowledge about deployments. [16:05:00] Hello. [16:05:02] jouncebot: next [16:05:02] In 4 hour(s) and 54 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160201T2100) [16:05:06] (03CR) 10Jforrester: "This also can never hope to have reasonable histories, as it's not an SUL wiki. I'm minded to say this shouldn't be allowed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267685 (https://phabricator.wikimedia.org/T125390) (owner: 10MarcoAurelio) [16:05:27] well anyway, I can SWAT, _joe_ did you want to do something to tin pre-SWAT? [16:05:34] !log hhvm restarted on mw1072 [16:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:05:57] greg-g, thcipriani: So… [16:06:05] Also, does that actually work? IIRC, prod cluster can't communicate with most of the internet. [16:06:11] 6operations, 10Traffic, 7Pybal: pybal etcd coroutine crashed - https://phabricator.wikimedia.org/T125397#1986480 (10BBlack) 3NEW a:3Joe [16:06:12] RECOVERY - Apache HTTP on mw1072 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.265 second response time [16:06:15] greg-g: What's the plan with wmf.11? [16:06:22] James_F: determining [16:06:40] sorry, I didn't work this weekend, and since I just got online 10 minutes ago, I don't have all my questions answered yet :) [16:06:45] greg-g: Slacker. ;-) [16:07:04] RECOVERY - HHVM rendering on mw1072 is OK: HTTP OK: HTTP/1.1 200 OK - 70843 bytes in 0.120 second response time [16:07:06] it *may* go out everywhere again today, we'll see [16:07:09] thcipriani: as Kelson isn't there, I'll take care of 262893 deployment too [16:07:22] Dereckson: sounds good, thanks. [16:09:08] thcipriani, you should read the email I sent before dealing with that patch [16:09:32] James_F: are your SWAT patches predicated on wmf.11 being out in any way or are you OK to go without? Also, re: https://gerrit.wikimedia.org/r/#/c/258206/7 changing the default to true. Wanted to be sure that was the intention. [16:09:38] 6operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T125200#1986503 (10Papaul) @Robh as you mentioned "HDD - SATA Seagate ST2000DM001 7.2K 2TB 10" those drives that I have in spare are SATA and ms-be2003 is using SAS and not SATA. [16:09:38] <_joe_> thcipriani: no go on, I'll stop you in case [16:09:44] _joe_: kk, thanks [16:09:51] (03CR) 10BBlack: [C: 031] "I tested the same zonefile + config change locally on a production-config pdns_recursor machine and it works. I haven't really tested the" [puppet] - 10https://gerrit.wikimedia.org/r/267208 (https://phabricator.wikimedia.org/T125170) (owner: 10BBlack) [16:10:09] Glaisher, so I think that it will use the default proxy settings [16:10:13] thcipriani: Eh. Go for it now. [16:10:31] thcipriani: And yes, default-to-true is the intent. [16:10:35] James_F: ack. thanks! [16:11:00] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264765 (https://phabricator.wikimedia.org/T116523) (owner: 10Jforrester) [16:11:06] Glaisher, which isn't set up [16:11:36] (03CR) 10Glaisher: "> This also can never hope to have reasonable histories, as it's not an SUL wiki. I'm minded to say this shouldn't be allowed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267685 (https://phabricator.wikimedia.org/T125390) (owner: 10MarcoAurelio) [16:11:42] (03Merged) 10jenkins-bot: Enable VisualEditor by default for some other wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264765 (https://phabricator.wikimedia.org/T116523) (owner: 10Jforrester) [16:12:25] Krenair: So it won't work anyway? [16:13:49] (03CR) 10Alex Monk: "We already allow import from wikimediafoundation.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267685 (https://phabricator.wikimedia.org/T125390) (owner: 10MarcoAurelio) [16:13:56] (03PS2) 10Dereckson: Namespace configuration on cu.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265885 (https://phabricator.wikimedia.org/T123654) [16:14:20] Krenair: what's up with the interwiki.cdb file? [16:14:31] thcipriani, what's up with it? [16:14:32] (03CR) 10Daniel Kinzler: [C: 031] "Yes, we want this for Wikidata. Tested this together with Jonas, seems to work as advertised. Needs I2043353da." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254645 (https://phabricator.wikimedia.org/T85368) (owner: 10Bene) [16:14:32] (saw you sync'd it this morning, modified on mira) [16:14:35] (03CR) 10Dereckson: "PS2: added the old namespace name as alias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265885 (https://phabricator.wikimedia.org/T123654) (owner: 10Dereckson) [16:14:38] ugh [16:14:47] is that part of the git repo again now? [16:15:08] Krenair: evidently [16:15:15] I ran updateinterwikicache earlier [16:15:59] oh I see. I'll push up a patch to gerrit. [16:16:19] I'm doing it now [16:16:22] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 10hardware-requests: Hadoop Node expansion for end of FY - https://phabricator.wikimedia.org/T124951#1986511 (10mark) p:5Normal>3Low [16:16:35] (03PS1) 10Alex Monk: updateinterwikicache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267687 [16:16:52] (03PS1) 10Giuseppe Lavagetto: scap: temprorarily remove tin during reimaging [puppet] - 10https://gerrit.wikimedia.org/r/267688 [16:17:01] (03CR) 10Alex Monk: [C: 032] updateinterwikicache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267687 (owner: 10Alex Monk) [16:17:12] Krenair: May 23: 03e6919608cc5aaab44dfda23d05e7f8439ba6a2 - Don't commit interwiki.cdb anymore / Dec 4: 22a00eb5f473c6822e1c19c6602db7ae55a613ac - Rvert [16:17:28] I only recently broke it and could only recover the old one, because I saved it to my home [16:17:31] before running hte script, that broke it [16:17:34] that's a pretty good reason for having it versioned, no? [16:17:37] Yeah [16:17:57] (03Merged) 10jenkins-bot: updateinterwikicache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267687 (owner: 10Alex Monk) [16:18:04] that was pretty scary... yes [16:18:09] thcipriani, should be good now [16:18:15] Krenair: cool, thanks. [16:18:35] 6operations, 10ops-codfw: ms-be2015 doesn't come up after reboot - https://phabricator.wikimedia.org/T125383#1986513 (10Papaul) @MoritzMuehlenhoff here is what I have on the system. {F3300155} [16:19:32] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [16:21:13] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [16:22:09] !log thcipriani@mira Synchronized dblists/visualeditor-default.dblist: SWAT: Enable VisualEditor by default for some other wikis [[gerrit:264765]] (duration: 01m 58s) [16:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:22:23] ^ James_F check please [16:23:00] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258206 (https://phabricator.wikimedia.org/T92661) (owner: 10Jforrester) [16:23:21] thcipriani: Yup, WFM. [16:23:44] (03Merged) 10jenkins-bot: Centralise all VisualEditor feedback pages except for a few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258206 (https://phabricator.wikimedia.org/T92661) (owner: 10Jforrester) [16:24:36] 6operations, 10Traffic, 7Pybal: pybal etcd coroutine crashed - https://phabricator.wikimedia.org/T125397#1986530 (10Joe) So, I guess this has been some sort of weird race condition. Or that etcd responded with stale/spurious data in some way (the way recursive watch works might have caused that). [16:26:20] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Centralise all VisualEditor feedback pages except for a few wikis [[gerrit:258206]] (duration: 01m 30s) [16:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:26:27] ^ James_F check please [16:26:37] thcipriani: Checking. [16:27:50] Hmm. [16:28:38] thcipriani: Keep going. [16:28:45] James_F: okie doke. [16:29:19] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265885 (https://phabricator.wikimedia.org/T123654) (owner: 10Dereckson) [16:30:11] (03Merged) 10jenkins-bot: Namespace configuration on cu.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265885 (https://phabricator.wikimedia.org/T123654) (owner: 10Dereckson) [16:32:30] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Namespace configuration on cu.wikipedia [[gerrit:265885]] (duration: 01m 26s) [16:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:32:35] ^ Dereckson: check please [16:32:39] 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1986560 (10Lydia_Pintscher) Thanks. I looked at one of them and the only thing in the page is the template f... [16:34:22] 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1986566 (10daniel) Yea, looks like the srwiki talk pages wasn't us, but an edit to a much-used template. [16:35:41] thcipriani: seems to work [16:35:48] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266541 (https://phabricator.wikimedia.org/T123084) (owner: 10Dereckson) [16:36:32] (03Merged) 10jenkins-bot: Set WikidataPageBanner namespaces on fr.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266541 (https://phabricator.wikimedia.org/T123084) (owner: 10Dereckson) [16:38:55] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Set WikidataPageBanner namespaces on fr.wikivoyage [[gerrit:266541]] (duration: 01m 26s) [16:38:57] ^ Dereckson check please [16:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:42:54] 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1986594 (10BBlack) Regardless, the average rate of HTCP these days is normally-flat-ish (a few scary spikes... [16:42:57] Doesn't work in NS_PROJECT [16:43:16] Nor through Wikidata, nor through {{PAGEBANNER}} [16:43:45] thcipriani: could you do an mwscript eval? [16:44:12] on wgWPBNamespaces for frwikivoyage [16:46:12] Dereckson: https://phabricator.wikimedia.org/P2550 [16:47:52] <_joe_> !log installing the new HHVM package to the api appserver cluster in eqiad [16:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:48:00] ah the setting is wrong, InitialiseSettings.php has wgWPBNamespaces, the extension wants wgWPBBannerNamespaces [16:48:31] I'm preparing a patch to fix that. [16:49:04] Dereckson: blerg. thank you. [16:50:20] 6operations, 10Traffic: 3x cache_upload crashed in a short time window - https://phabricator.wikimedia.org/T125401#1986611 (10BBlack) 3NEW [16:50:44] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267195 (https://phabricator.wikimedia.org/T125000) (owner: 10Dereckson) [16:51:36] (03Merged) 10jenkins-bot: Enable WikidataPageBanner on es.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267195 (https://phabricator.wikimedia.org/T125000) (owner: 10Dereckson) [16:52:25] <_joe_> !log restarted pybal on lvs1001 [16:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:52:40] (03PS1) 10Dereckson: wgWPBNamespaces → wgWPBBannerNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267694 [16:52:54] 6operations, 10Traffic: 3x cache_upload crashed in a short time window - https://phabricator.wikimedia.org/T125401#1986628 (10BBlack) See also: T125265 + T124418 [16:53:28] 6operations, 10Traffic: 3x cache_upload crashed in a short time window - https://phabricator.wikimedia.org/T125401#1986633 (10BBlack) And maybe-related: T122455 [16:54:06] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable WikidataPageBanner on es.wikivoyage [[gerrit:267195]] (duration: 01m 29s) [16:54:09] ^ Dereckson check please [16:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:54:53] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267194 (https://phabricator.wikimedia.org/T124614) (owner: 10Dereckson) [16:55:14] (03PS3) 10ArielGlenn: dumps: configure for parallelized runs for zhwiki, metawiki [puppet] - 10https://gerrit.wikimedia.org/r/263912 [16:55:42] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1986637 (10BBlack) cp1060 is depooled for users now. Once Analytics is done with their oozie thing, we can proceed on the next steps for actually stopping the cache_mobile clust... [16:55:42] (03Merged) 10jenkins-bot: Enable SandboxLink on or.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267194 (https://phabricator.wikimedia.org/T124614) (owner: 10Dereckson) [16:56:39] thcipriani: 267195 tested [16:56:49] (03CR) 10ArielGlenn: [C: 032] dumps: configure for parallelized runs for zhwiki, metawiki [puppet] - 10https://gerrit.wikimedia.org/r/263912 (owner: 10ArielGlenn) [16:57:02] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [16:57:16] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#1986641 (10Ottomata) [16:58:08] 6operations, 10Analytics-Cluster, 10EventBus, 6Services: Investigate proper set up for using Kafka MirrorMaker with new main Kafka clusters. - https://phabricator.wikimedia.org/T123954#1986647 (10Ottomata) 5Open>3Resolved [16:58:33] PROBLEM - puppet last run on mw1231 is CRITICAL: CRITICAL: Puppet has 1 failures [16:59:05] (03PS4) 10Thcipriani: Use extension registration for Graph [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266433 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [16:59:15] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable SandboxLink on or.wikipedia.org [[gerrit:267194]] (duration: 01m 31s) [16:59:20] ^ Dereckson check please [17:00:11] (03PS4) 10ArielGlenn: dumps: configure for parallelized runs for zhwiki, metawiki [puppet] - 10https://gerrit.wikimedia.org/r/263912 [17:00:20] 266433 tested. [17:00:50] hmmm, not really a good way to sync https://gerrit.wikimedia.org/r/#/c/266433/ [17:01:03] (03CR) 10Dereckson: [C: 04-1] "Documentation extension lags on actual code. wgWPBNamespaces seems the correct setting." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267694 (owner: 10Dereckson) [17:02:53] PROBLEM - SSH on cygnus is CRITICAL: Server answer [17:03:12] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266433 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [17:03:40] (03Merged) 10jenkins-bot: Use extension registration for Graph [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266433 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [17:03:52] 10Ops-Access-Requests, 6operations: add subbu to parsoid-roots - https://phabricator.wikimedia.org/T125166#1986670 (10TrevorParscal) I approve giving Subbu access to ruthenium. [17:05:31] (03PS2) 10Yuvipanda: toolslabs: install hunspell and libhunspell-dev to exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/267513 (https://phabricator.wikimedia.org/T125193) (owner: 10Ebrahim) [17:05:37] (03CR) 10Yuvipanda: [C: 032 V: 032] toolslabs: install hunspell and libhunspell-dev to exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/267513 (https://phabricator.wikimedia.org/T125193) (owner: 10Ebrahim) [17:06:53] !log thcipriani@mira Synchronized wmf-config: SWAT: Use extension registration for Graph [[gerrit:266433]] (duration: 01m 29s) [17:06:56] Dereckson: check please ^ [17:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:07:49] (03PS3) 10ArielGlenn: dumps: rebalance page ranges for dump jobs that run in parallel [puppet] - 10https://gerrit.wikimedia.org/r/266314 (https://phabricator.wikimedia.org/T123571) [17:08:41] Graph pages still work as expected, okay. So 266433 tested. [17:09:53] Dereckson: I'm going to bump https://gerrit.wikimedia.org/r/#/q/262893,n,z if that's ok. I've got to run to a meeting. [17:10:03] Okay. [17:10:16] Dereckson: thank you for your help. appreciated. [17:10:33] (03CR) 10ArielGlenn: [C: 032] dumps: rebalance page ranges for dump jobs that run in parallel [puppet] - 10https://gerrit.wikimedia.org/r/266314 (https://phabricator.wikimedia.org/T123571) (owner: 10ArielGlenn) [17:10:44] Thank you for the deploy. [17:17:28] (03Abandoned) 10Dereckson: wgWPBNamespaces → wgWPBBannerNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267694 (owner: 10Dereckson) [17:20:15] Do you know when decisions around the rollback will be taken today? I need to make a decision about what to do with this week's issue of Tech News today, and wonder if I should wait another hour or so or postpone sending it out until tomorrow UTC. [17:21:01] greg-g, bd808 ^ [17:21:49] JohanJ: greg-g will be announcing in <60m. He's in a meeting right now [17:23:06] Rightyo. Thansk. [17:23:10] Or thanks. [17:24:53] RECOVERY - puppet last run on mw1231 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [17:26:18] 6operations, 6Discovery: Elasticsearch health and capacity planning FY2016-17 - https://phabricator.wikimedia.org/T124626#1986751 (10EBernhardson) If we were to split the cluster by wiki our disk usage should stay pretty consistent with growth over the last year, merely split between clusters. If we were to sp... [17:27:14] 7Blocked-on-Operations, 6operations, 6Services, 7Graphite: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451#1986752 (10Addshore) >>! In T85451#1258826, @GWicke wrote: > So, should we > > a) get a new box with more / bigger SSDs (most 2.5" cases have space for 8 SSDs), or >... [17:31:19] <_joe_> gehel: welcome :) [17:36:45] _joe_: thx! Happy to be now part of the family ! [17:40:35] 10Ops-Access-Requests, 6operations: add subbu to parsoid-roots - https://phabricator.wikimedia.org/T125166#1986810 (10ssastry) In addition to what Trevor approved above, based on ops recommendations, I can request and get @TrevorParscal's approval for the same. [17:42:15] ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on kafka1013 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] ottomata Will be this way until kafka1012 is back online. [17:42:15] ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] ottomata Will be this way until kafka1012 is back online. [17:42:15] ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] ottomata Will be this way until kafka1012 is back online. [17:42:15] ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] ottomata Will be this way until kafka1012 is back online. [17:42:15] ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] ottomata Will be this way until kafka1012 is back online. [17:45:24] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 15.00% of data above the critical threshold [100000000.0] [17:50:33] dewiki is down ... [17:50:51] and up again [17:51:24] PROBLEM - puppet last run on mw1253 is CRITICAL: CRITICAL: Puppet has 1 failures [17:51:31] <_joe_> thcipriani: I won't be at the deployment wg meeting, I have a conflicting meeting I need to attend [17:51:49] _joe_: ack, np, thanks for the heads up [17:53:02] https://commons.wikimedia.org/w/index.php?title=MediaWiki:ImageAnnotatorConfig.js&action=raw&ctype=text/javascript [17:53:02] ohoh [17:53:09] Ah, not the only one. [17:53:22] I have a broken homepage of Wikidata and get the same error when purging. [17:54:03] !log restarted hhvm on mw1253 [17:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:54:25] hoo probably fixed it [17:54:31] <_joe_> Fatal error: Call to undefined function headers_sent() in /srv/mediawiki/errorpages/hhvm-fatal-error.php on line 815 [17:54:50] <_joe_> see logstash [17:54:53] Also many redis servers seem to be unreachable [17:55:04] <_joe_> hoo: which ones? and where? [17:55:12] _joe_: only 1253 [17:55:12] rdb1001.eqiad.wmnet [17:55:17] and that has stopped [17:55:18] and rdb1007.eqiad.wmnet [17:55:30] <_joe_> hoo: what do you mean they're unreachable? [17:55:36] I see timeouts in the logs [17:55:40] but only a few [17:55:40] <_joe_> you see connection errors? [17:55:43] <_joe_> ok that's an overload [17:55:49] ah ok [17:56:24] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [17:56:51] <_joe_> thcipriani: see the fatal log [17:56:53] <_joe_> please [17:57:22] _joe_: whoa [17:57:48] <_joe_> thcipriani: could be partly an artifact of me rolling restarting all appservers [17:59:00] all one host [17:59:02] 6operations, 6Discovery: Elasticsearch health and capacity planning FY2016-17 - https://phabricator.wikimedia.org/T124626#1986891 (10dcausse) I agree: splitting by feature is not easy and maybe not appropriate for the moment. Random thoughts: In the future I think one strategy could be: 1/ cluster for type a... [17:59:04] mw1253 [17:59:15] well "all" being 99.999% [17:59:22] greg-g: that is fixed [17:59:26] I restarted hhvm there, it stopped afterwards [17:59:39] ah, sorry, didn't read full scrollback, we were in a meeting [17:59:42] thanks all :) [17:59:56] alright, email being written/sent re wmf.11/12 now [18:00:05] greg-g: thanks. [18:02:32] it's been a fun morning :) [18:02:59] greg-g: To which ml? Didn't get it yet [18:03:55] Anyway, my food i sgetting cold, I'll be right back [18:05:55] (03PS1) 10ArielGlenn: dumps: allow per-wiki configuration of checkpoint time in config [puppet] - 10https://gerrit.wikimedia.org/r/267704 [18:07:56] (03CR) 10ArielGlenn: [C: 032] dumps: allow per-wiki configuration of checkpoint time in config [puppet] - 10https://gerrit.wikimedia.org/r/267704 (owner: 10ArielGlenn) [18:10:48] _joe_: i think some of the fatal errors maybe got cached in varnish [18:11:00] hoo|away: coming now [18:11:03] !log planet1001 - rebooting for upgrade [18:11:05] cp1055 miss(0), cp3040 hit(2), cp3031 frontend hit(1) [18:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:11:11] sometimes when i try https://www.wikidata.org/wiki/Wikidata:Main_Page?action=purge [18:11:14] <_joe_> aude: bblack already banned a url [18:11:21] hoo|away: ops@, engineering@, and wikitech-l@ [18:11:43] 7Blocked-on-Operations, 6operations, 6Services, 7Graphite: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451#1986923 (10EBernhardson) I also support this change, collecting stats in graphite has shown to be quite powerful. Removing limitations around stats collection should be... [18:12:06] "Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon. Please try again in a few minutes." [18:12:09] <_joe_> aude: try https://www.wikidata.org/w/index.php?title=Wikidata:Main_Page&cache=none [18:12:14] _joe_: ok [18:12:20] <_joe_> or, add any arbitrary key to the query string :P [18:12:30] When viewing someones userpage there is an error page but no details [18:12:33] _joe_: yeah, i did [18:12:34] <_joe_> (I already purged the page for now) [18:12:35] and that works [18:12:41] <_joe_> Bsadowski1: url? [18:12:43] No error details at all [18:12:46] <_joe_> in pm if needed [18:12:49] https://simple.wikipedia.org/wiki/User:Etamni [18:12:59] Someone complained about it [18:13:51] <_joe_> Bsadowski1: fixed [18:14:00] Weird. It doesn't happen on mine.. [18:14:10] <_joe_> the trick is to append ?cache=hi&action=purge :P [18:14:11] is Kelson in here? [18:14:16] <_joe_> Bsadowski1: I did purge the cache [18:15:29] https://www.wikidata.org/wiki/Wikidata:Main_Page?action=purge is still broken [18:17:13] RECOVERY - puppet last run on mw1253 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [18:19:24] possibly just https://www.wikidata.org/wiki/Wikidata:Main_Page is also cached in broken state (reported by user) [18:20:24] 6operations, 10Wikimedia-Video, 5Patch-For-Review: 1gb file upload limit is too restrictive for conference presentation videos - https://phabricator.wikimedia.org/T116514#1986959 (10BBlack) Have we actually tested a ~2.4 GB upload and seen it work? I worry that somewhere in the stack, something is using a s... [18:20:52] (03CR) 10BBlack: "https://phabricator.wikimedia.org/T116514#1986959" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266544 (https://phabricator.wikimedia.org/T116514) (owner: 10TheDJ) [18:25:30] (03CR) 10Filippo Giunchedi: "thanks Daniel! puppet compiler is happy, https://puppet-compiler.wmflabs.org/1670/ I'll merge this tomorrow" [puppet] - 10https://gerrit.wikimedia.org/r/204275 (owner: 10Filippo Giunchedi) [18:25:36] Nemo_bis: do you know if the newer RC stream (stream.wm, not irc.wm) is also that critical for antivandal tools? [18:29:44] 6operations, 10Wikimedia-Video, 5Patch-For-Review: 1gb file upload limit is too restrictive for conference presentation videos - https://phabricator.wikimedia.org/T116514#1986992 (10fgiunchedi) re: testing, as of a week ago there's a swift cluster in beta as per {T64835}, mediawiki config change hasn't happe... [18:31:04] (03PS3) 10ArielGlenn: crap salt cleanup scripts primarily for labs use [software] - 10https://gerrit.wikimedia.org/r/236798 [18:31:34] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: puppet fail [18:33:04] mutante: probably is [18:33:19] * aude doesn't know for sure though [18:33:33] aude: yep, i heard it's used by cvnbot [18:33:48] (03CR) 10ArielGlenn: [C: 032 V: 032] "they're crap scripts but they might as well go in the repo instead of sitting here forever" [software] - 10https://gerrit.wikimedia.org/r/236798 (owner: 10ArielGlenn) [18:34:17] mobile web error on https://en.m.wikipedia.org/wiki/Main_Page but all other pages seem fine. [18:35:41] (03PS2) 10ArielGlenn: force salt minion to ping master every 15 minutes [puppet] - 10https://gerrit.wikimedia.org/r/219134 [18:35:56] <_joe_> quiddity: https://en.m.wikipedia.org/w/index.php?title=Main_Page&nocache=y&action=purge solved it, FYI [18:36:20] <_joe_> the trick is to add a bogus part in the query string (nocache=y does nothing) [18:36:30] ty! [18:36:47] <_joe_> so that you know what to do when something else happens [18:37:58] 6operations, 6Discovery, 10MediaWiki-Logging, 7HHVM, and 3 others: MediaWiki monolog doesn't handle Kafka failures gracefully - https://phabricator.wikimedia.org/T125084#1987038 (10Joe) All appservers have been upgraded, I'll perform some tests tomorrow to ensure this is solved. [18:37:58] (03CR) 10ArielGlenn: [C: 032] "This may be useful in multimaster scenarios in the future. For now it should be harmless, let's keep an eye on performance though." [puppet] - 10https://gerrit.wikimedia.org/r/219134 (owner: 10ArielGlenn) [18:38:06] 6operations, 6Discovery, 10MediaWiki-Logging, 7HHVM, and 3 others: MediaWiki monolog doesn't handle Kafka failures gracefully - https://phabricator.wikimedia.org/T125084#1987039 (10Joe) a:3Joe [18:40:57] _joe_: here? [18:41:01] or bblack [18:41:23] <_joe_> hoo: here-ish [18:41:26] <_joe_> what's up? [18:41:33] (03PS3) 10BBlack: Do not normalize_path for cxserver|citoid|rest.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/267381 (https://phabricator.wikimedia.org/T125176) (owner: 10GWicke) [18:41:51] Well, error pages caught up in varnish because they returned 200 OK and probably inappropriate cache headers [18:41:58] thus they are cached even for logged ins [18:42:03] not sure how many pages are affected [18:42:11] I know about https://www.wikidata.org/wiki/Wikidata:Main_Page?action=purge at least [18:42:29] <_joe_> hoo: read my comment to quiddity for a workaround [18:42:32] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 23.08% of data above the critical threshold [100000000.0] [18:42:45] _joe_: I know how to bypass varnish ;) [18:43:13] mutante: yes, AFAIK most cross-wiki antivandalism relies on rcstream by now [18:43:20] (03CR) 10GWicke: [C: 031] "LGTM. Thank you, @bblack!" [puppet] - 10https://gerrit.wikimedia.org/r/267381 (https://phabricator.wikimedia.org/T125176) (owner: 10GWicke) [18:43:32] _joe_: i did that [18:43:36] <_joe_> hoo: ok, there were 1500 errors, I don't know how to find a list of the affected pages easily and it's too late for me [18:43:42] a handful big wikis may survive without but several hundreds would probably be devastated :) [18:44:08] <_joe_> Nemo_bis: as long as the clients reconnect upon a connection failure, they'll be ok [18:44:40] _joe_: Fair enough [18:44:51] (03CR) 10BBlack: [C: 032] Do not normalize_path for cxserver|citoid|rest.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/267381 (https://phabricator.wikimedia.org/T125176) (owner: 10GWicke) [18:44:51] <_joe_> Nemo_bis: if they don't, they should be fixed [18:44:53] would be nice if it were easier to purge individual urls [18:45:08] <_joe_> hoo: well our problem here is to find such urls [18:45:17] * aude could purge them on beta with -X PURGE [18:45:19] <_joe_> we could ban based on the content-length maybe [18:45:23] from one of the hosts [18:45:33] <_joe_> on the varnishes, I mean [18:45:35] no idea all the affected hosts but at least ones i know about [18:45:54] content length = 13817 [18:45:58] _joe_, when is the techops meeting? i am trying to figure out the status of my access requests. [18:46:23] meeting is over, subbu [18:46:29] * _joe_ off [18:46:39] _joe_: thanks for helping [18:46:41] bye, joe. happy recharging [18:46:42] content lenght + time span would work, I guess [18:46:53] can someone tl;dr the above? [18:46:59] ah, ok. i'll wait to hear more about the access requests then. [18:47:02] bblack: Sure [18:47:06] <_joe_> bblack: one appserver went rogue, cached error pages [18:47:15] Server gone mad, giving cacheable broken error pages [18:47:18] ok [18:47:18] <_joe_> bblack: some we manually purged [18:47:28] eg. https://www.wikidata.org/wiki/Wikidata:Main_Page?action=purge [18:47:28] e.g. https://www.wikidata.org/wiki/Wikidata:Main_Page?action=purge (for me at least) [18:47:35] ok, so ban content-length of 13817 supposedly catches them all? [18:47:40] it should [18:47:42] bblack: probably [18:47:44] <_joe_> that is actually a double-cache effect [18:47:47] <_joe_> which is weird [18:47:50] the apperserver failed at one point in that page [18:47:57] double-cache effect? [18:48:01] Another report: https://en.wikipedia.org/wiki/Syrian_Civil_War [18:48:13] <_joe_> bblack: uhm no scratch that [18:48:41] <_joe_> bblack: so any page that was cached with an error gets re-presented if you ask it with action=purge [18:48:41] ok working on the ban for content length, we're probably ok with that alone as conditional if it works [18:48:44] RecentChanges also fires up a error sometimes. [18:48:49] <_joe_> bblack: +1 [18:49:34] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [18:50:23] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: puppet fail [18:53:51] ban should be complete [18:54:12] I'm not sure how we confirm completely other than dropoff of user reports. different users will hit different frontend caches for the same URL [18:54:32] !log LDAP - added elukey to "ops" group [18:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:54:54] !log banned obj.http.Content-Length == 13817 on all cache_text [18:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:55:23] bblack: works now [18:55:38] sjoerddebruin: errors gone now? [18:55:41] Yep [18:55:46] great [18:55:53] thanks bblack [18:55:56] np! [18:57:51] (03CR) 10EBernhardson: Reduce replica count for commonswiki_file in codfw (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266658 (owner: 10EBernhardson) [18:57:53] (03PS2) 10EBernhardson: Reduce replica count for commonswiki_file in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266658 [19:00:25] (03PS4) 10Dereckson: Prepare db-codfw.php for a live deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 (owner: 10Jcrespo) [19:03:34] _joe_, I am adding you to a patch we discussed on the meeting, but it is not urgent [19:04:23] Hey nuria_ hive [19:04:26] oops :) [19:04:29] yes [19:04:39] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1987127 (10greg) >>! In T124440#1981816, @greg wrote: > @legoktm: Update, please? Anyone, update please. [19:04:46] nothing yet nuria_ , my autocompletion worked too fast [19:04:51] sorry :) [19:07:29] (03Abandoned) 10MarcoAurelio: Expanding transwiki import sources for be.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267685 (https://phabricator.wikimedia.org/T125390) (owner: 10MarcoAurelio) [19:08:56] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:08:57] (03CR) 10Jcrespo: "The commit comment shoud say "partitioning changes that have been rolled-in", not back, will fix on my next patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 (owner: 10Jcrespo) [19:11:26] (03CR) 10MarcoAurelio: Change Nepali Wikibooks sitename and logo (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267170 (https://phabricator.wikimedia.org/T124881) (owner: 10MtDu) [19:12:18] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1987155 (10ArielGlenn) I see it still going on terbium. [19:12:21] 6operations, 10procurement: dataset host specification refresh - https://phabricator.wikimedia.org/T125421#1987156 (10RobH) 3NEW [19:12:48] (03CR) 10MarcoAurelio: "> I ran optipng on the logo before I pushed the patch. Is that enough or what else do I need to do?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267170 (https://phabricator.wikimedia.org/T124881) (owner: 10MtDu) [19:12:51] 6operations, 10procurement: dataset host specification refresh - https://phabricator.wikimedia.org/T125421#1987168 (10RobH) 5Open>3Invalid a:3RobH didnt put in private space, so rejecting this one as invalid. [19:13:46] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [19:13:46] 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#1792783 (10RobH) [19:21:16] 6operations, 10procurement: dataset host specification refresh - https://phabricator.wikimedia.org/T125421#1987223 (10RobH) [19:21:47] bah, edit the rejected one [19:25:32] 6operations, 10Incident-20160126-WikimediaDomainRedirection, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are using wikimedia.org VHost, including /wiki/ -> wikimediafoundation.org redirect - https://phabricator.wikimedia.org/T124804#1987246 (1... [19:26:10] (03PS1) 10Ori.livneh: Speed trials: add preconnect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267719 (https://phabricator.wikimedia.org/T123582) [19:26:24] (03CR) 10Ori.livneh: [C: 032 V: 032] Speed trials: add preconnect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267719 (https://phabricator.wikimedia.org/T123582) (owner: 10Ori.livneh) [19:27:04] 7Blocked-on-Operations, 6operations: Re-pool restbase1007 - https://phabricator.wikimedia.org/T124565#1987253 (10GWicke) It is still listed as disabled [1], and is not getting much traffic. [1]: http://config-master.wikimedia.org/conftool/eqiad/restbase [19:27:11] 6operations, 10Incident-20160126-WikimediaDomainRedirection, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are using wikimedia.org VHost, including /wiki/ -> wikimediafoundation.org redirect - https://phabricator.wikimedia.org/T124804#1987254 (1... [19:27:26] 7Blocked-on-Operations, 6operations: Re-pool restbase1007 - https://phabricator.wikimedia.org/T124565#1987255 (10GWicke) 5Resolved>3Open [19:27:43] 6operations, 10Incident-20160126-WikimediaDomainRedirection, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are using wikimedia.org VHost, including /wiki/ -> wikimediafoundation.org redirect - https://phabricator.wikimedia.org/T124804#1987257 (1... [19:28:23] !log ori@mira Synchronized docroot/wikipedia.org/speed-tests: I5b48a491390: Speed trials: add preconnect (duration: 01m 27s) [19:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:31:19] (03PS4) 10MtDu: Change Nepali Wikibooks sitename and logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267170 (https://phabricator.wikimedia.org/T124881) [19:32:06] 6operations, 10Incident-20160126-WikimediaDomainRedirection, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are using wikimedia.org VHost, including /wiki/ -> wikimediafoundation.org redirect - https://phabricator.wikimedia.org/T124804#1987284 (1... [19:37:03] hoo, did something related to wikidata config got deployed at 18:41? https://logstash.wikimedia.org/#dashboard/temp/AVKeVRrBptxhN1XatUHF [19:37:35] or labs? [19:38:51] not that I know of [19:42:33] 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#1987317 (10jcrespo) Adding @Greg, as this (setting s2 shard as read-only) needs Release Engineering coordination, and sadly we are in a timer here. * Decide a date for the... [19:48:26] 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#1987343 (10greg) How bad is the time crunch/when is the latest you would *comfortably* want to do this? For the warning users part, we'll need some help from @Johan (ping, j... [19:59:54] 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#1987359 (10jcrespo) The plan is: doing a try this week/early next week. Best case scenario, 1-10 seconds in read-only mode, and almost no user notice. If that doesn't work,... [20:04:06] mutante, robh any update about my access requests? [20:05:08] both ended up as inconclusive during the meeting since there is some confusion on them [20:05:42] i think now its up to current clinic person to sort out who needs to followup? [20:05:51] elukey: ^ heh [20:06:03] you may recall we had a very confusing discussion about this during the ops meeting? [20:07:27] robh, anything i can do to help with clarifying the confuing pieces? I am not sure if my requests were confusing or something else was the issue. [20:07:42] or elukey. [20:13:19] I didn't think it was confusing, but somehow we werent able to discuss the parsoid-rt-admin group [20:13:26] without discussing the adding you to parsoid-roots [20:13:29] which to me were unrelated. [20:14:24] so you're going to block subbu because you can't follow through with the process you devised? [20:14:44] without even proactively giving an explanation for the delay? [20:14:46] good times [20:14:49] ori: srsly? [20:14:55] that is unduly harsh to me when im trying to help out now [20:15:10] i'm not picking on you specifically, but something is not right here [20:15:17] we are following up now [20:15:30] but you want to argue, im dropping this. [20:15:31] great, thanks [20:15:46] * robh is tired of arguing when he is simply relaying info. [20:15:49] ori, robh chill .. i can wait for things to be resolved. thanks for trying to help. [20:16:07] i am blocked, but as long as ops don't mind me poking you for geting stuff done on ruthenium, i can deal. [20:17:51] Again, I am not sure what wasn't clear during the ops meeting. I supported the rights being added in https://phabricator.wikimedia.org/T124701 [20:18:19] then folks started saying that access request and the parsoid-roots had to be related together [20:18:21] and they dont imo [20:18:37] mutante i think was one of those who said they were related? [20:18:44] (again, not sure) [20:18:52] no, i pointed out there were 2 separate tickets, that's all [20:18:59] those 2 requests are different. [20:19:05] i support the one for root on ruthenium and did so in meeting [20:19:07] (a) one is about getting me full access [20:19:10] i dont know about the other one [20:19:20] (b) another is about getting others in my team getting partial access sudo for specific operations [20:19:34] i will be on vacation, not around [20:19:52] and i want others in my team to be able to do stuff .. most of the itme it will be simply to deploy fresh code, restart services, etc. [20:20:04] and if they need anything that their access doesn't allow, they can ping someone here. [20:20:17] so https://phabricator.wikimedia.org/T124701 is b) team [20:20:19] and if that becomes a problem, we can revisit then. [20:20:22] im not sure why eveyrone didnt approve [20:20:32] but folks kept talking over me [20:20:35] and then it got shut down. [20:20:41] yes, T124701 is team [20:20:59] (03CR) 10MarcoAurelio: [C: 031] "Thank you." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267170 (https://phabricator.wikimedia.org/T124881) (owner: 10MtDu) [20:21:45] 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#1987425 (10jcrespo) List of wikis affected: ``` mysql -e "SHOW DATABASES like '%wik%'" +------------------+ | Database (%wik%) | +------------------+ | bgwiki | |... [20:21:50] (03PS1) 10Yuvipanda: tools: Setup appropriate ferm rules for etcd clients / peers [puppet] - 10https://gerrit.wikimedia.org/r/267728 [20:21:59] 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1987427 (10RobH) DB rights should likely be a different ticket, since they require @jcrespo to re... [20:22:17] robh, ori and hey hope i didn't come across too rude by asking you two to chill ... [20:22:19] (03PS2) 10Yuvipanda: tools: Setup appropriate ferm rules for etcd clients / peers [puppet] - 10https://gerrit.wikimedia.org/r/267728 [20:22:27] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Setup appropriate ferm rules for etcd clients / peers [puppet] - 10https://gerrit.wikimedia.org/r/267728 (owner: 10Yuvipanda) [20:22:34] considering ori got me pretty angry, nope. [20:22:41] that wasn't my intention at least. [20:23:09] I dislike it when people jump in claiming I'm not working on something when I [20:23:14] m clearing discussing it at that time ;] [20:23:24] though i realize now ori likely didnt mean it that way =] [20:23:35] ori: we good right? [20:24:15] (i hope he went afk and isnt still mad) [20:24:29] win 22 [20:26:28] 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#1987436 (10Krenair) Interesting. I have never noticed that `l10nwiki` database before. All it contains is two empty tables (localisation and localisation_file_hash). [20:26:28] robh, i can understand why ori was frustrated on my behalf .. he has spent a good amount of time trying to get my puppet patches reviewed and merged on ruthenium. [20:26:28] but, i can be patient. [20:26:43] the meeting block wasnt due to lack of trust but due to lack of clarity in the overall request during the meeting. I think my update clarifies that, but I'm not sure how to proceed [20:27:01] I don't really want to wait for another week, as a few of us in ops are having to do things for you which is silly [20:27:07] got it. [20:28:47] ok, will respond on the ticket. [20:29:25] 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1987460 (10ssastry) Got it. I'll split the db access into a separate ticket. And, you are right... [20:29:50] subbu: on https://phabricator.wikimedia.org/T124701 [20:29:58] you suggest reusing parsoid admins? [20:30:07] but that group doesnt include all the rt stuff, and is clusterwide [20:30:22] that would mean your team would have to be approved for admin on produciton level parsoid stuff [20:30:26] 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#1987464 (10greg) Looks like the communities of the effected wikis aren't all in one part of the globe (based on my hugely hand-wavy assessment), so whatever time of day that... [20:30:28] (seems harder to approve is all) [20:30:43] unless they need to do all that stuff on the entire cluster or machines? [20:31:02] robh ah, ok. never mind. i didn't understand those nuances. [20:31:05] (or am i misunderstanding your reply?) [20:31:20] cuz indeed, it would be esaier to slap three sudo rights onto the existing group [20:31:23] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1987465 (10Legoktm) Sorry missed the ping. Yes, it's still going :( [20:31:29] but then it excalates all those others to full admin on multiple machines [20:31:32] cool [20:31:36] maybe call it parsoid-testing-admins instead? [20:32:48] 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1987469 (10RobH) IRC Discussion Update: I pointed out how appending those rights to the existing... [20:32:57] oh, true [20:33:07] 6operations, 10Security-Reviews, 7Surveys: Re-evaluate Limesurvey - https://phabricator.wikimedia.org/T109606#1987474 (10Dzahn) [20:33:22] i suppose its not clear what the rt part of parsoid-rt is [20:33:28] roundtrip :) [20:33:52] but, we ar running more than rt-tests on ruthenium .. so, -testing- is better. [20:34:13] (03PS2) 10Dzahn: releases: Fix capitalization of MediaWiki [puppet] - 10https://gerrit.wikimedia.org/r/267428 (owner: 10Legoktm) [20:34:28] (03CR) 10Dzahn: [C: 032] "good point! thanks" [puppet] - 10https://gerrit.wikimedia.org/r/267428 (owner: 10Legoktm) [20:35:22] (03PS1) 10Yuvipanda: tools: Include base::firewall on the flannel etcd hosts [puppet] - 10https://gerrit.wikimedia.org/r/267732 [20:37:21] (03PS2) 10Yuvipanda: tools: Include base::firewall on the flannel etcd hosts [puppet] - 10https://gerrit.wikimedia.org/r/267732 [20:37:24] (03PS3) 10Dzahn: Tools: Fix double file resource for jlocal [puppet] - 10https://gerrit.wikimedia.org/r/266934 (owner: 10Tim Landscheidt) [20:37:29] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Include base::firewall on the flannel etcd hosts [puppet] - 10https://gerrit.wikimedia.org/r/267732 (owner: 10Yuvipanda) [20:37:39] (03PS4) 10Dzahn: Tools: Fix double file resource for jlocal [puppet] - 10https://gerrit.wikimedia.org/r/266934 (owner: 10Tim Landscheidt) [20:39:01] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/266934 (owner: 10Tim Landscheidt) [20:40:01] 10Ops-Access-Requests, 6operations: Grant mysql client access to testreduce_vd and testreduce_0715 databases - https://phabricator.wikimedia.org/T125435#1987492 (10ssastry) 3NEW [20:43:00] 6operations, 7Diamond, 7Upstream: Diamond load averages do not contain scaled versions - https://phabricator.wikimedia.org/T125411#1987514 (10yuvipanda) [20:43:22] (03CR) 10Dzahn: [C: 032] Tools: Fix double file resource for jlocal [puppet] - 10https://gerrit.wikimedia.org/r/266934 (owner: 10Tim Landscheidt) [20:48:17] (03PS1) 10Yuvipanda: tools: Allow peer nodes to acces etcd port too [puppet] - 10https://gerrit.wikimedia.org/r/267736 [20:48:40] 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1987540 (10RobH) a:5RobH>3None My understanding (though I could be mistaken) is with the clar... [20:49:14] (03CR) 10jenkins-bot: [V: 04-1] tools: Allow peer nodes to acces etcd port too [puppet] - 10https://gerrit.wikimedia.org/r/267736 (owner: 10Yuvipanda) [20:49:14] 10Ops-Access-Requests, 6operations: add subbu to parsoid-roots - https://phabricator.wikimedia.org/T125166#1987545 (10RobH) My understanding (though I could be mistaken) is with the clarification above, this now only needs one of the following: A) Ops team meeting review (this was attempted today but there wa... [20:49:16] 6operations, 7Diamond, 7Upstream: Diamond load averages do not contain scaled versions - https://phabricator.wikimedia.org/T125411#1987543 (10yuvipanda) Hmm, this will add an extra metric for all hosts, prod and labs. @Fgiunchedi is that ok? [20:49:25] (03PS2) 10Yuvipanda: tools: Allow peer nodes to acces etcd port too [puppet] - 10https://gerrit.wikimedia.org/r/267736 [20:49:41] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Allow peer nodes to acces etcd port too [puppet] - 10https://gerrit.wikimedia.org/r/267736 (owner: 10Yuvipanda) [20:50:01] 6operations, 6Labs, 10Labs-Infrastructure, 10Tool-Labs: tools-exec: automatic php upgrade - puppet fail - https://phabricator.wikimedia.org/T125438#1987548 (10Dzahn) 3NEW [20:50:12] 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1987556 (10ssastry) p:5Triage>3High [20:50:27] 10Ops-Access-Requests, 6operations: add subbu to parsoid-roots - https://phabricator.wikimedia.org/T125166#1987557 (10ssastry) p:5Triage>3High [20:53:47] (03CR) 10Thcipriani: "Inline comments." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/266773 (owner: 10Chad) [20:55:00] 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#1987567 (10jcrespo) Hardware maintenance (there is an LVM/fs problem that avoids the partition to grow), OS upgrade and mariadb upgrade (5.5 -> 10). I have the long version i... [20:55:49] 10Ops-Access-Requests, 6operations: add subbu to parsoid-roots - https://phabricator.wikimedia.org/T125166#1987570 (10RobH) My last comment (and token) was meant for an unrelated task i had open in another tab. I removed the comment but tokens seem to stick (sorry about that!) [20:56:52] no parsoid deploy today. we have some regressions that we need to deal with first. [20:59:31] (03PS1) 10Bmansurov: Remove section collapsing config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267776 (https://phabricator.wikimedia.org/T124220) [21:00:04] gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160201T2100). [21:00:18] no mobileapps deployment today. [21:01:20] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1987589 (10Anomie) See also https://gerrit.wikimedia.org/r/#/c/267734/ and https://gerrit.wikimedia.org/r/#/c/267735/. [21:02:22] 6operations, 6Labs, 10Labs-Infrastructure, 10Tool-Labs: tools-exec: automatic php upgrade - puppet fail - https://phabricator.wikimedia.org/T125438#1987599 (10scfc) 5Open>3Resolved a:3scfc Fixed by `apt-get install php5-gd` and downgrading. [21:02:38] (03PS1) 10Dzahn: toollabs: don't use ensure => latest for everything [puppet] - 10https://gerrit.wikimedia.org/r/267778 [21:03:18] 6operations, 6Labs, 10Labs-Infrastructure, 10Tool-Labs: tools-exec: automatic php upgrade - puppet fail - https://phabricator.wikimedia.org/T125438#1987605 (10Dzahn) thanks @scfc :) i uploaded https://gerrit.wikimedia.org/r/#/c/267778/ [21:03:44] (03PS2) 10Dzahn: toollabs: don't use ensure => latest for everything [puppet] - 10https://gerrit.wikimedia.org/r/267778 [21:07:10] 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#1987621 (10greg) That's good enough for me :) [21:08:07] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [21:09:07] PROBLEM - RAID on es2009 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [21:12:18] where can i see the data collected by " diamond::collector { 'Httpd':"? [21:12:18] oh, ensure => absent, duh [21:12:18] (03PS2) 10Dzahn: apache: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266970 [21:12:21] (03CR) 10Dzahn: [C: 032] apache: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266970 (owner: 10Dzahn) [21:13:17] 7Blocked-on-Operations, 6operations, 10RESTBase, 10hardware-requests: Expand SSD space in Cassandra cluster - https://phabricator.wikimedia.org/T121575#1987631 (10GWicke) @fgiunchedi, could you tackle the upgrades (disk and CPU/RAM) for 1007-9 soon? [21:14:20] (03PS1) 10MarcoAurelio: Enabling Extension:ShortUrl on or.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267780 (https://phabricator.wikimedia.org/T124429) [21:14:51] (03PS2) 10MarcoAurelio: Enabling Extension:ShortUrl on od.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267780 (https://phabricator.wikimedia.org/T124429) [21:14:56] YuviPanda: is "ircyall" used anywhere? i can't seem to find it with "watroles" [21:15:27] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [21:15:27] mutante: yes it's used in the ircnotifier project [21:15:34] oh that's interesting. not sure why that's the case [21:16:15] I'm in the middle of an etcd move, so can you file a bug so I can take a look, mutante? [21:16:18] (03CR) 10JanZerebecki: "The dependency is probably deployed this train on Wed, so it can be added to SWAT afterwards." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254645 (https://phabricator.wikimedia.org/T85368) (owner: 10Bene) [21:16:28] valhallasw`cloud: I'm considering letting ircnotifier die [21:16:33] valhallasw`cloud: wikibugs is the only user now :D [21:16:38] well [21:16:40] :( [21:16:50] YuviPanda: sure, ok [21:16:56] 6operations, 10ops-codfw: es2009 degraded RAID - https://phabricator.wikimedia.org/T125442#1987649 (10jcrespo) 3NEW [21:18:16] YuviPanda: :( [21:18:32] YuviPanda: well, I guess I'll have to move to wmbot then :P [21:18:41] (03PS1) 10Yuvipanda: tools: Point flannel to use new flannel specific etcd [puppet] - 10https://gerrit.wikimedia.org/r/267781 (https://phabricator.wikimedia.org/T125371) [21:18:56] valhallasw`cloud: yeah. I don't think I've the bandwidth to push it through to its completion [21:19:23] valhallasw`cloud: hmm, since we only want it for sal, maybe once bd808 is done with all the auth manager stuff we can make a HTTP endpoint for SAL :) [21:19:37] (03PS1) 10Andrew Bogott: Fix instance storage volume for labtestvirt2001 [puppet] - 10https://gerrit.wikimedia.org/r/267782 [21:19:42] YuviPanda: uh, I guess, although I also find it practical to see it on irc [21:19:53] fair enough [21:20:07] valhallasw`cloud: I can probably just move it to a tools project, I guess [21:20:11] and kill the puppet role [21:20:15] but maybe not worth it [21:20:55] 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#1987669 (10jcrespo) Tuesday 9 Feb, 23:00 UTC, does that work for anyone on your team? [21:20:55] YuviPanda: i used the wrong "syntax" for the tool, nevermind [21:21:15] /role/role::foo vs role/foo role:foo etc [21:21:18] aaah [21:21:21] fun [21:22:21] (03PS2) 10Andrew Bogott: Fix instance storage volume for labtestvirt2001 [puppet] - 10https://gerrit.wikimedia.org/r/267782 [21:23:46] (03CR) 10Andrew Bogott: [C: 032] Fix instance storage volume for labtestvirt2001 [puppet] - 10https://gerrit.wikimedia.org/r/267782 (owner: 10Andrew Bogott) [21:24:39] (03PS2) 10Dzahn: ircyall: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266964 [21:24:49] (03CR) 10Dzahn: [C: 032] ircyall: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266964 (owner: 10Dzahn) [21:26:19] thanks mutante [21:26:28] (03PS1) 10MarcoAurelio: Enabling Extension:ShortUrl for bhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267783 (https://phabricator.wikimedia.org/T113348) [21:27:05] YuviPanda: welcome, it's trying to eliminate that issue all across the repo, then remove the exception for it from global .puppet-lint.rc [21:27:11] * YuviPanda nods [21:30:18] 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#1987692 (10greg) The hour before afternoon SWAT should be OK, yeah. Anything specific you need from us? We'll all be around during the time, but if you need someone's undivid... [21:30:25] (03PS1) 10MarcoAurelio: Removing testwiki from wmgUseShortUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267784 [21:30:52] (03PS1) 10Yuvipanda: toollabs: Add proxy nodes to flanel etcd access list [puppet] - 10https://gerrit.wikimedia.org/r/267785 [21:32:00] (03PS2) 10Yuvipanda: toollabs: Add proxy nodes to flanel etcd access list [puppet] - 10https://gerrit.wikimedia.org/r/267785 [21:32:02] (03PS2) 10Yuvipanda: tools: Point flannel to use new flannel specific etcd [puppet] - 10https://gerrit.wikimedia.org/r/267781 (https://phabricator.wikimedia.org/T125371) [21:34:00] (03CR) 10Dzahn: "${labsproject}.${hostname}.reqstats - can't find it in graphite.wmflabs" [puppet] - 10https://gerrit.wikimedia.org/r/266973 (owner: 10Dzahn) [21:34:22] (03PS2) 10Dzahn: dynamicproxy: fix top-scope var without namespace, lint [puppet] - 10https://gerrit.wikimedia.org/r/266973 [21:34:40] (03CR) 10Dzahn: [C: 032] dynamicproxy: fix top-scope var without namespace, lint [puppet] - 10https://gerrit.wikimedia.org/r/266973 (owner: 10Dzahn) [21:37:02] 6operations, 3Mobile-Content-Service: Improve operational documentation for the mobileapps service - https://phabricator.wikimedia.org/T123852#1987743 (10Mholloway) I don't have permission to create a page on Wikitech, but I created a draft here: https://www.mediawiki.org/wiki/User:MHolloway_(WMF)/Draft:Mobi... [21:37:03] (03CR) 10Dzahn: "the change only touched the graphite part and i could not see those reqstats in actual graphite" [puppet] - 10https://gerrit.wikimedia.org/r/266973 (owner: 10Dzahn) [21:41:01] 6operations, 3Mobile-Content-Service: Improve operational documentation for the mobileapps service - https://phabricator.wikimedia.org/T123852#1987760 (10Dzahn) >>! In T123852#1987743, @Mholloway wrote: > I don't have permission to create a page on Wikitech, but I created a draft here: @Mholloway your user a... [21:44:18] (03CR) 10Tim Landscheidt: [C: 04-1] "The issue as shown by the tasks is not "ensure => latest", that just highlights it. On the initial Puppet run, PHP packages & Co. are ins" [puppet] - 10https://gerrit.wikimedia.org/r/267778 (owner: 10Dzahn) [21:44:18] (03CR) 10Dzahn: "yea, i know it touches the certificates.. but $site -> $::site is fairly common" [puppet] - 10https://gerrit.wikimedia.org/r/266980 (owner: 10Dzahn) [21:46:11] 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#1987768 (10jcrespo) No, the only complex thing is DBA-specific. I will need just the usual attention as if it was a deployment (logs monitoring for higher rates of errors, et... [21:46:46] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 11.76% of data above the critical threshold [100000000.0] [21:46:47] (03Abandoned) 10Dzahn: toollabs: don't use ensure => latest for everything [puppet] - 10https://gerrit.wikimedia.org/r/267778 (owner: 10Dzahn) [21:47:57] PROBLEM - puppet last run on wtp2001 is CRITICAL: CRITICAL: puppet fail [21:51:11] 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#1987795 (10greg) Gotcha, thanks @jcrespo. Added to the calendar: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=281828&oldid=281348 [21:53:44] (03PS2) 10Dzahn: aptly: fix top-scope vars without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266984 [21:55:01] 6operations, 3Mobile-Content-Service: Improve operational documentation for the mobileapps service - https://phabricator.wikimedia.org/T123852#1987816 (10Mholloway) Guess it was indeed a case of user error. After the reset I can now create pages on Wikitech. Thanks @Dzahn! [21:55:16] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 896 [21:55:34] (03CR) 10Dzahn: [C: 032] aptly: fix top-scope vars without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266984 (owner: 10Dzahn) [21:58:34] (03CR) 10Dzahn: "checked on: tools-services-01, toolsbeta-aptly-server-01, ores-misc-01, (they use this per "watroles" tool) - noop" [puppet] - 10https://gerrit.wikimedia.org/r/266984 (owner: 10Dzahn) [21:58:36] (03CR) 10Luke081515: [C: 04-1] "Why odwikisource? There are not wikis woth code od, only with or, like requested in T124429" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267780 (https://phabricator.wikimedia.org/T124429) (owner: 10MarcoAurelio) [21:59:25] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1987820 (10Redrose64) It may have double-run - I have been force-logged out twice since this task was raised, the second one was today [22:00:16] RECOVERY - check_mysql on lutetium is OK: Uptime: 1833206 Threads: 2 Questions: 13676857 Slow queries: 15785 Opens: 79943 Flush tables: 3 Open tables: 64 Queries per second avg: 7.460 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [22:04:18] (03PS2) 10Dzahn: ganglia: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266965 [22:04:27] (03CR) 10Dzahn: [C: 032] ganglia: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266965 (owner: 10Dzahn) [22:04:33] 6operations, 10ops-codfw: Codfw-mw* IDRAC firmware upgrade - https://phabricator.wikimedia.org/T125088#1987838 (10Papaul) Please see link below for the documentation on how to upgrade the IDRAC firmware for the PowerEdge R410. https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Dell_PowerEdge_R... [22:11:38] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [22:16:27] RECOVERY - puppet last run on wtp2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:17:33] 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1987865 (10ori) >>! In T124418#1986594, @BBlack wrote: > Regardless, the average rate of HTCP these days is... [22:19:26] (03PS3) 10Yuvipanda: toollabs: Add proxy nodes to flanel etcd access list [puppet] - 10https://gerrit.wikimedia.org/r/267785 [22:19:33] (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: Add proxy nodes to flanel etcd access list [puppet] - 10https://gerrit.wikimedia.org/r/267785 (owner: 10Yuvipanda) [22:21:30] (03PS3) 10Yuvipanda: tools: Point flannel to use new flannel specific etcd [puppet] - 10https://gerrit.wikimedia.org/r/267781 (https://phabricator.wikimedia.org/T125371) [22:21:32] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1987876 (10Redrose64) It may have double-run - I have been force-logged out twice since this task was raised, the second one was today [22:21:36] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Point flannel to use new flannel specific etcd [puppet] - 10https://gerrit.wikimedia.org/r/267781 (https://phabricator.wikimedia.org/T125371) (owner: 10Yuvipanda) [22:26:07] (03PS3) 10Dzahn: ganglia: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266965 [22:27:48] 6operations: decom magnesieum (was: Reinstall magnesium with jessie) - https://phabricator.wikimedia.org/T123713#1987896 (10Dzahn) [22:29:40] 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1987905 (10BBlack) @ori - yeah that makes sense for the initial bump, and I think there may have even been a... [22:30:12] 10Ops-Access-Requests, 6operations, 10DBA: Grant mysql client access to testreduce_vd and testreduce_0715 databases - https://phabricator.wikimedia.org/T125435#1987910 (10Dzahn) [22:31:36] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [5000000.0] [22:33:40] 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1987938 (10JanZerebecki) Good find. That commit was first deployed in wmf8 which was branched on Dec 8 (rMWb... [22:35:05] (03PS1) 10Yuvipanda: tools: Disallow k8s master etcd from being accessible to workers [puppet] - 10https://gerrit.wikimedia.org/r/267793 (https://phabricator.wikimedia.org/T125371) [22:40:31] (03PS2) 10Yuvipanda: tools: Disallow k8s master etcd from being accessible to workers [puppet] - 10https://gerrit.wikimedia.org/r/267793 (https://phabricator.wikimedia.org/T125371) [22:42:07] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [22:50:38] (03CR) 10Jdlrobson: [C: 031] "who can swat this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267776 (https://phabricator.wikimedia.org/T124220) (owner: 10Bmansurov) [22:59:31] 6operations, 10Wikimedia-DNS, 5Patch-For-Review: Internal DNS resolver responds with NXDOMAIN for localhost AAAA - https://phabricator.wikimedia.org/T125170#1988007 (10yuvipanda) I think the patch needs to get merged and babysat :) [23:22:38] (03PS1) 10Yuvipanda: tools: Add separate role for k8s etcd [puppet] - 10https://gerrit.wikimedia.org/r/267796 [23:22:45] (03PS1) 10Ottomata: [WIP] Refactor manifests/role/analytics/* into modules/role, use hiera to configure [puppet] - 10https://gerrit.wikimedia.org/r/267797 (https://phabricator.wikimedia.org/T109859) [23:23:36] (03PS2) 10Yuvipanda: tools: Add separate role for k8s etcd [puppet] - 10https://gerrit.wikimedia.org/r/267796 [23:23:47] (03CR) 10Yuvipanda: [C: 032] tools: Disallow k8s master etcd from being accessible to workers [puppet] - 10https://gerrit.wikimedia.org/r/267793 (https://phabricator.wikimedia.org/T125371) (owner: 10Yuvipanda) [23:24:09] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Add separate role for k8s etcd [puppet] - 10https://gerrit.wikimedia.org/r/267796 (owner: 10Yuvipanda) [23:24:27] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Refactor manifests/role/analytics/* into modules/role, use hiera to configure [puppet] - 10https://gerrit.wikimedia.org/r/267797 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [23:24:44] (03CR) 10Paladox: "recheck" [software] - 10https://gerrit.wikimedia.org/r/169253 (owner: 10Tim Landscheidt) [23:24:53] (03PS2) 10Ottomata: [WIP] Refactor manifests/role/analytics/* into modules/role, use hiera to configure [puppet] - 10https://gerrit.wikimedia.org/r/267797 (https://phabricator.wikimedia.org/T109859) [23:26:02] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Refactor manifests/role/analytics/* into modules/role, use hiera to configure [puppet] - 10https://gerrit.wikimedia.org/r/267797 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [23:26:46] So, sanity check, is just, everything broken [23:28:38] 6operations, 10netops: turn-up/implement zayo wave (579171) for ulsfo-codfw - https://phabricator.wikimedia.org/T122885#1916306 (10RobH) All procurement and onsite patch tasks have been completed. (They don't show resolved since they sit in the pending invoice column for a month pending invoice.) @Faidon sho... [23:29:51] Never mind, looks like it was temporary [23:30:49] 6operations, 10Wikimedia-Video, 5Patch-For-Review: 1gb file upload limit is too restrictive for conference presentation videos - https://phabricator.wikimedia.org/T116514#1988121 (10BBlack) If we have a strong reason to stick with 2.5GB, we should really test this through the whole stack somehow (I guess in... [23:37:10] 6operations, 6Performance-Team, 7Graphite, 7Monitoring: Add monitoring for analytics-statsv service - https://phabricator.wikimedia.org/T117994#1988128 (10Krinkle) a:3ori [23:38:23] Reedy: have a moment to help me with an https-everwhere rule? I hear tell that you’re an expert. [23:48:45] (03PS1) 10Yuvipanda: tools: Allow k8s etcd peers to access client port too [puppet] - 10https://gerrit.wikimedia.org/r/267801 [23:49:16] (03PS2) 10Yuvipanda: tools: Allow k8s etcd peers to access client port too [puppet] - 10https://gerrit.wikimedia.org/r/267801 [23:49:23] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Allow k8s etcd peers to access client port too [puppet] - 10https://gerrit.wikimedia.org/r/267801 (owner: 10Yuvipanda) [23:51:34] !log restbase deploy start of c3bd864 on canary rb1001 [23:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:57:32] (03PS1) 10Ori.livneh: Add monitoring for statsv process [puppet] - 10https://gerrit.wikimedia.org/r/267802 (https://phabricator.wikimedia.org/T117994) [23:57:38] (03PS1) 10Luke081515: Enable confirmed group at nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267804 (https://phabricator.wikimedia.org/T125448) [23:57:44] (03PS2) 10Ori.livneh: Add monitoring for statsv process [puppet] - 10https://gerrit.wikimedia.org/r/267802 (https://phabricator.wikimedia.org/T117994) [23:57:50] (03CR) 10Ori.livneh: [C: 032 V: 032] Add monitoring for statsv process [puppet] - 10https://gerrit.wikimedia.org/r/267802 (https://phabricator.wikimedia.org/T117994) (owner: 10Ori.livneh) [23:59:11] (03CR) 10Luke081515: [C: 04-1] "Do not merge:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267804 (https://phabricator.wikimedia.org/T125448) (owner: 10Luke081515)