[00:00:15] <TimStarling>	 you know we are not actually on wmf.11
[00:00:20] <Krenair>	 yes
[00:00:43] <TimStarling>	 ok, just checking
[00:01:10] <Krenair>	 we just discussed this in the other channel
[00:01:46] <Krenair>	 wanted to get the change out before the cluster went back to using wmf.11
[00:01:56] <James_F>	 Oh, sorry for forking.
[00:02:36] <TimStarling>	 guess I should push out mine too
[00:04:25] <logmsgbot>	 !log tstarling@mira Synchronized php-1.27.0-wmf.11/includes: (no message) (duration: 01m 31s)
[00:04:28] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:27:24] <icinga-wm>	 PROBLEM - puppet last run on mw1174 is CRITICAL: CRITICAL: puppet fail
[00:47:18] <grrrit-wm>	 (03PS2) 10Tim Landscheidt: Tools: Fix double file resource for jlocal [puppet] - 10https://gerrit.wikimedia.org/r/266934 
[00:55:25] <icinga-wm>	 RECOVERY - puppet last run on mw1174 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[01:20:44] <icinga-wm>	 PROBLEM - puppet last run on mw2212 is CRITICAL: CRITICAL: Puppet has 1 failures
[01:46:55] <icinga-wm>	 RECOVERY - puppet last run on mw2212 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[02:25:25] <icinga-wm>	 PROBLEM - HHVM rendering on mw1072 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:26:24] <icinga-wm>	 PROBLEM - Apache HTTP on mw1072 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:26:54] <icinga-wm>	 PROBLEM - puppet last run on mw1072 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:27:04] <icinga-wm>	 PROBLEM - dhclient process on mw1072 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:27:30] <wikibugs>	 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1985525 (10BBlack) Well, we have 3 different stages of rate-increase in the insert graph, so it could well b...
[02:28:13] <grrrit-wm>	 (03PS1) 10Yuvipanda: toollabs: Add new role for flannel etcd [puppet] - 10https://gerrit.wikimedia.org/r/267628 
[02:28:33] <icinga-wm>	 RECOVERY - puppet last run on mw1072 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[02:28:45] <icinga-wm>	 RECOVERY - dhclient process on mw1072 is OK: PROCS OK: 0 processes with command name dhclient
[02:29:09] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] toollabs: Add new role for flannel etcd [puppet] - 10https://gerrit.wikimedia.org/r/267628 (owner: 10Yuvipanda)
[02:29:16] <grrrit-wm>	 (03CR) 10Yuvipanda: [V: 032] toollabs: Add new role for flannel etcd [puppet] - 10https://gerrit.wikimedia.org/r/267628 (owner: 10Yuvipanda)
[02:29:57] <wikibugs>	 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1985526 (10BBlack) Continuing with some stuff I was saying in IRC the other day.  At the "new normal", we're...
[02:34:19] <wikibugs>	 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1985528 (10Legoktm) >>! In T124418#1985525, @BBlack wrote: > it's not like we gained a 5x increase in human...
[02:37:33] <icinga-wm>	 PROBLEM - puppet last run on mw1108 is CRITICAL: CRITICAL: Puppet has 1 failures
[02:39:16] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists: Need listadmin password reset for Translators-l mailing list - https://phabricator.wikimedia.org/T123163#1985530 (10Jalexander) 5Open>3Resolved Reset and password sent to both list admins.
[03:03:45] <icinga-wm>	 RECOVERY - puppet last run on mw1108 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[03:03:53] <icinga-wm>	 PROBLEM - Disk space on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:03:54] <icinga-wm>	 PROBLEM - DPKG on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:04:04] <icinga-wm>	 PROBLEM - puppet last run on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:04:04] <icinga-wm>	 PROBLEM - RAID on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:04:34] <icinga-wm>	 PROBLEM - dhclient process on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:04:44] <icinga-wm>	 PROBLEM - salt-minion processes on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:04:53] <icinga-wm>	 PROBLEM - Check size of conntrack table on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:04:54] <icinga-wm>	 PROBLEM - configured eth on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:10:34] <wikibugs>	 6operations, 6Analytics-Kanban, 10hardware-requests: 8 x 3 SSDs for AQS nodes. - https://phabricator.wikimedia.org/T124947#1985547 (10GWicke) In the short term, a few tweaks could also help to reduce latency:  - Reduce replication for the per-article keyspace from 3 to 2. - Read with CL_ONE by default. - Pos...
[03:50:14] <icinga-wm>	 PROBLEM - SSH on cygnus is CRITICAL: Server answer
[03:53:43] <icinga-wm>	 RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0)
[04:11:14] <icinga-wm>	 PROBLEM - NTP on cygnus is CRITICAL: NTP CRITICAL: No response from NTP server
[04:21:40] <grrrit-wm>	 (03CR) 10Santhosh: [C: 04-1] "Please hold this till 1.27.0-wmf.12 is in production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267236 (owner: 10KartikMistry)
[04:26:49] <Danny_B>	 >>> UNRECOVERABLE FATAL ERROR <<<
[04:26:49] <Danny_B>	 Maximum execution time of 10 seconds exceeded
[04:26:49] <Danny_B>	 /srv/phab/libphutil/src/aphront/storage/connection/mysql/AphrontMySQLiDatabaseConnection.php:131
[04:26:50] <ecmabot-wm>	 Danny_B: SyntaxError: Unexpected identifier
[04:27:13] <Danny_B>	 when trying to display https://phabricator.wikimedia.org/maniphest/report/project/
[04:53:04] <icinga-wm>	 PROBLEM - SSH on cygnus is CRITICAL: Server answer
[04:54:53] <icinga-wm>	 RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0)
[05:07:14] <icinga-wm>	 PROBLEM - SSH on cygnus is CRITICAL: Server answer
[05:08:54] <icinga-wm>	 RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0)
[05:15:54] <icinga-wm>	 PROBLEM - SSH on cygnus is CRITICAL: Server answer
[05:17:34] <icinga-wm>	 RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0)
[05:22:44] <icinga-wm>	 PROBLEM - SSH on cygnus is CRITICAL: Server answer
[05:24:33] <icinga-wm>	 RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0)
[05:47:23] <icinga-wm>	 PROBLEM - SSH on cygnus is CRITICAL: Server answer
[05:56:03] <icinga-wm>	 RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0)
[06:27:14] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0]
[06:27:23] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0]
[06:29:13] <icinga-wm>	 PROBLEM - SSH on cygnus is CRITICAL: Server answer
[06:30:13] <icinga-wm>	 PROBLEM - puppet last run on mw1260 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:44] <icinga-wm>	 PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:53] <icinga-wm>	 PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:05] <icinga-wm>	 PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:13] <icinga-wm>	 PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:43] <icinga-wm>	 PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:03] <icinga-wm>	 PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:38:04] <icinga-wm>	 RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0)
[06:46:35] <icinga-wm>	 PROBLEM - SSH on cygnus is CRITICAL: Server answer
[06:51:45] <icinga-wm>	 RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0)
[06:55:23] <icinga-wm>	 RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[06:56:13] <icinga-wm>	 RECOVERY - puppet last run on mw1260 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:56:44] <icinga-wm>	 RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[06:56:45] <icinga-wm>	 RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures
[06:57:03] <icinga-wm>	 PROBLEM - SSH on cygnus is CRITICAL: Server answer
[06:57:03] <icinga-wm>	 RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:14] <icinga-wm>	 RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[06:58:43] <icinga-wm>	 RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:05:54] <icinga-wm>	 RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0)
[07:14:04] <icinga-wm>	 PROBLEM - puppet last run on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:15:44] <icinga-wm>	 RECOVERY - puppet last run on mw1057 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:16:14] <icinga-wm>	 PROBLEM - SSH on cygnus is CRITICAL: Server answer
[07:32:03] <icinga-wm>	 RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0)
[07:37:14] <icinga-wm>	 PROBLEM - SSH on cygnus is CRITICAL: Server answer
[07:44:05] <icinga-wm>	 RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0)
[07:47:05] <icinga-wm>	 PROBLEM - puppet last run on mw1057 is CRITICAL: CRITICAL: Puppet has 18 failures
[07:49:23] <icinga-wm>	 PROBLEM - SSH on cygnus is CRITICAL: Server answer
[07:52:45] <icinga-wm>	 RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0)
[07:58:03] <icinga-wm>	 PROBLEM - SSH on cygnus is CRITICAL: Server answer
[08:03:24] <icinga-wm>	 RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0)
[08:08:35] <icinga-wm>	 PROBLEM - SSH on cygnus is CRITICAL: Server answer
[08:15:04] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:16:53] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1057 is OK: OK: nf_conntrack is 3 % full
[08:27:53] <icinga-wm>	 RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0)
[08:38:34] <icinga-wm>	 PROBLEM - SSH on cygnus is CRITICAL: Server answer
[08:45:10] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Correct target distribution [debs/openssl] - 10https://gerrit.wikimedia.org/r/267645 
[08:45:15] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:45:39] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Correct target distribution [debs/openssl] - 10https://gerrit.wikimedia.org/r/267645 (owner: 10Muehlenhoff)
[08:47:23] <icinga-wm>	 RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0)
[08:48:43] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1057 is OK: OK: nf_conntrack is 4 % full
[08:52:44] <icinga-wm>	 PROBLEM - SSH on cygnus is CRITICAL: Server answer
[08:57:52] <grrrit-wm>	 (03PS11) 10ArielGlenn: dumps: set up but don't enable script for dumps to run from cron [puppet] - 10https://gerrit.wikimedia.org/r/263807 (https://phabricator.wikimedia.org/T107750) 
[08:58:03] <icinga-wm>	 RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0)
[08:59:17] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] dumps: set up but don't enable script for dumps to run from cron [puppet] - 10https://gerrit.wikimedia.org/r/263807 (https://phabricator.wikimedia.org/T107750) (owner: 10ArielGlenn)
[09:03:43] <icinga-wm>	 PROBLEM - SSH on cygnus is CRITICAL: Server answer
[09:06:16] <grrrit-wm>	 (03PS1) 10Addshore: Revert "Revert "wgRCWatchCategoryMembership true on wikipedias & commons"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267646 
[09:07:26] <grrrit-wm>	 (03PS2) 10Addshore: Revert "Revert "wgRCWatchCategoryMembership true on wikipedias & commons"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267646 
[09:11:28] <grrrit-wm>	 (03PS1) 10Lokal Profil: Setting l10n-bot submissions to same as in https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki,access [dumps/dcat] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/267647 
[09:14:05] <grrrit-wm>	 (03PS2) 10Addshore: wgRCWatchCategoryMembership true everywhere except wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264734 
[09:14:25] <grrrit-wm>	 (03CR) 10Lokal Profil: [C: 032 V: 032] Localisation updates from https://translatewiki.net. [dumps/dcat] - 10https://gerrit.wikimedia.org/r/266482 (owner: 10L10n-bot)
[09:15:47] <grrrit-wm>	 (03PS2) 10Addshore: wgRCWatchCategoryMembership true on wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264735 
[09:17:21] <grrrit-wm>	 (03PS3) 10Addshore: Revert "Revert "wgRCWatchCategoryMembership true on wikipedias & commons"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267646 
[09:18:02] <grrrit-wm>	 (03PS3) 10Addshore: wgRCWatchCategoryMembership true on wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264735 
[09:18:04] <grrrit-wm>	 (03PS3) 10Addshore: wgRCWatchCategoryMembership true everywhere except wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264734 
[09:30:20] <wikibugs>	 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad (boxes out of warranty, capacity planning) - https://phabricator.wikimedia.org/T118154#1985715 (10ArielGlenn) @robh: I've been requested to provide a list of all hardware needs in both codfw and eqiad, both snap...
[09:31:15] <grrrit-wm>	 (03PS2) 10Lokal Profil: Localisation updates from https://translatewiki.net. [dumps/dcat] - 10https://gerrit.wikimedia.org/r/266681 (owner: 10L10n-bot)
[09:31:38] <wikibugs>	 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#1985716 (10ArielGlenn)
[09:32:14] <grrrit-wm>	 (03CR) 10Lokal Profil: [C: 032 V: 032] "Manual rebase was needed" [dumps/dcat] - 10https://gerrit.wikimedia.org/r/266681 (owner: 10L10n-bot)
[09:33:39] <wikibugs>	 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#1985720 (10ArielGlenn) When figuring out memory and cpu needs for the dataset servers, we should keep in mind:  the dataset host has the following going on a...
[09:34:05] <grrrit-wm>	 (03PS2) 10Lokal Profil: Localisation updates from https://translatewiki.net. [dumps/dcat] - 10https://gerrit.wikimedia.org/r/267642 (owner: 10L10n-bot)
[09:34:21] <grrrit-wm>	 (03CR) 10Lokal Profil: [C: 032 V: 032] "rebased" [dumps/dcat] - 10https://gerrit.wikimedia.org/r/267642 (owner: 10L10n-bot)
[09:36:25] <icinga-wm>	 RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0)
[09:36:30] <wikibugs>	 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#1985723 (10ArielGlenn) @paravoid, I'm adding you to this because you had volunteered your help and know-how earlier in making our dumps download setup suffic...
[09:37:03] <wikibugs>	 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#1985725 (10ArielGlenn)
[09:37:05] <wikibugs>	 6operations, 10Dumps-Generation, 10hardware-requests: Detail codfw snapshot/dataset requirements - https://phabricator.wikimedia.org/T118173#1985724 (10ArielGlenn)
[09:38:05] <wikibugs>	 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#1792783 (10ArielGlenn)
[09:38:33] <grrrit-wm>	 (03CR) 10Merlijn van Deen: [C: 031] Tools: Fix double file resource for jlocal [puppet] - 10https://gerrit.wikimedia.org/r/266934 (owner: 10Tim Landscheidt)
[09:39:14] <wikibugs>	 6operations, 10Datasets-General-or-Unknown: Provide a good  download service of dumps from Wikimedia - https://phabricator.wikimedia.org/T122917#1985731 (10ArielGlenn) All hardware refresh tickets for dumps are now at T118154.
[09:41:27] <grrrit-wm>	 (03PS12) 10ArielGlenn: dumps: set up but don't enable script for dumps to run from cron [puppet] - 10https://gerrit.wikimedia.org/r/263807 (https://phabricator.wikimedia.org/T107750) 
[09:41:47] <icinga-wm>	 PROBLEM - SSH on cygnus is CRITICAL: Server answer
[09:41:56] <grrrit-wm>	 (03CR) 10Merlijn van Deen: [C: 04-1] ""Allows for more flexibility this way"" [puppet] - 10https://gerrit.wikimedia.org/r/267402 (owner: 10Yuvipanda)
[09:42:12] <wikibugs>	 6operations: reinstall eqiad memcache servers with jessie - https://phabricator.wikimedia.org/T123711#1985735 (10elukey) Updates from me (or better excuses to justify why this work hasn't been completed yet :)  I worked with Joe to double check how nutcracker handles redis and memcached nodes going away from its...
[09:44:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw1057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:45:24] <icinga-wm>	 PROBLEM - HHVM rendering on mw1057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:46:34] <icinga-wm>	 PROBLEM - SSH on mw1057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:46:35] <icinga-wm>	 PROBLEM - DPKG on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:46:45] <icinga-wm>	 PROBLEM - HHVM processes on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:46:45] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:46:54] <icinga-wm>	 PROBLEM - Disk space on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:47:04] <icinga-wm>	 PROBLEM - RAID on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:47:25] <icinga-wm>	 PROBLEM - configured eth on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:48:04] <icinga-wm>	 PROBLEM - nutcracker process on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:48:04] <icinga-wm>	 PROBLEM - dhclient process on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:48:04] <icinga-wm>	 PROBLEM - nutcracker port on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:48:04] <icinga-wm>	 PROBLEM - salt-minion processes on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:48:56] <icinga-wm>	 RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0)
[09:53:51] <wikibugs>	 6operations, 6Analytics-Kanban, 10hardware-requests: 8 x 3 SSDs for AQS nodes. - https://phabricator.wikimedia.org/T124947#1985752 (10JAllemandou) @Gwicke:  - The replication factor for the per-article table had been changed to 2 a few month ago. I think restbase config management for cassandra has changed i...
[10:01:45] <icinga-wm>	 PROBLEM - SSH on cygnus is CRITICAL: Server answer
[10:03:34] <icinga-wm>	 RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0)
[10:05:14] <grrrit-wm>	 (03PS13) 10ArielGlenn: dumps: set up but don't enable script for dumps to run from cron [puppet] - 10https://gerrit.wikimedia.org/r/263807 (https://phabricator.wikimedia.org/T107750) 
[10:09:04] <icinga-wm>	 PROBLEM - SSH on cygnus is CRITICAL: Server answer
[10:09:54] <icinga-wm>	 RECOVERY - nutcracker port on mw1057 is OK: TCP OK - 0.000 second response time on port 11212
[10:09:55] <icinga-wm>	 RECOVERY - dhclient process on mw1057 is OK: PROCS OK: 0 processes with command name dhclient
[10:09:55] <icinga-wm>	 RECOVERY - nutcracker process on mw1057 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker
[10:09:55] <icinga-wm>	 RECOVERY - salt-minion processes on mw1057 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[10:10:24] <icinga-wm>	 RECOVERY - SSH on mw1057 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0)
[10:10:24] <icinga-wm>	 RECOVERY - DPKG on mw1057 is OK: All packages OK
[10:10:33] <icinga-wm>	 RECOVERY - HHVM processes on mw1057 is OK: PROCS OK: 6 processes with command name hhvm
[10:10:33] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1057 is OK: OK: nf_conntrack is 0 % full
[10:10:34] <icinga-wm>	 RECOVERY - Disk space on mw1057 is OK: DISK OK
[10:10:54] <icinga-wm>	 RECOVERY - RAID on mw1057 is OK: OK: no RAID installed
[10:11:14] <icinga-wm>	 RECOVERY - configured eth on mw1057 is OK: OK - interfaces up
[10:12:15] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] dumps: set up but don't enable script for dumps to run from cron [puppet] - 10https://gerrit.wikimedia.org/r/263807 (https://phabricator.wikimedia.org/T107750) (owner: 10ArielGlenn)
[10:14:34] <icinga-wm>	 RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0)
[10:18:38] <grrrit-wm>	 (03PS1) 10ArielGlenn: enable dumps from cron on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/267653 
[10:20:03] <icinga-wm>	 PROBLEM - SSH on cygnus is CRITICAL: Server answer
[10:27:35] <jynus>	 !log partitioning revision and logging for db2037 and db2044 (s4)
[10:27:37] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:32:27] <godog>	 !log reboot ms-be1010, xfs
[10:32:30] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:35:54] <wikibugs>	 6operations, 5Patch-For-Review, 7Swift: swift capacity planning - https://phabricator.wikimedia.org/T1268#1985807 (10fgiunchedi) also related to capacity planning for swift, thumbnails vs originals size metrics in eqiad: https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1448470754.236&target=...
[10:36:04] <grrrit-wm>	 (03PS2) 10ArielGlenn: enable dumps from cron on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/267653 
[10:36:44] <icinga-wm>	 RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0)
[10:39:44] <icinga-wm>	 RECOVERY - Disk space on ms-be1010 is OK: DISK OK
[10:39:44] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be1010 is OK: OK - load average: 9.84, 2.28, 0.75
[10:39:44] <icinga-wm>	 RECOVERY - Check size of conntrack table on ms-be1010 is OK: OK: nf_conntrack is 4 % full
[10:39:44] <icinga-wm>	 RECOVERY - swift-container-auditor on ms-be1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[10:39:44] <icinga-wm>	 RECOVERY - swift-object-server on ms-be1010 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[10:39:44] <icinga-wm>	 RECOVERY - swift-container-updater on ms-be1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[10:39:44] <icinga-wm>	 RECOVERY - swift-container-replicator on ms-be1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[10:39:45] <icinga-wm>	 RECOVERY - dhclient process on ms-be1010 is OK: PROCS OK: 0 processes with command name dhclient
[10:39:45] <icinga-wm>	 RECOVERY - DPKG on ms-be1010 is OK: All packages OK
[10:39:46] <icinga-wm>	 RECOVERY - configured eth on ms-be1010 is OK: OK - interfaces up
[10:39:46] <icinga-wm>	 RECOVERY - swift-account-server on ms-be1010 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[10:39:47] <icinga-wm>	 RECOVERY - swift-account-reaper on ms-be1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[10:40:13] <icinga-wm>	 PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 896
[10:42:14] <icinga-wm>	 PROBLEM - SSH on cygnus is CRITICAL: Server answer
[10:45:13] <icinga-wm>	 RECOVERY - check_mysql on db1008 is OK: Uptime: 1105615 Threads: 2 Questions: 6726950 Slow queries: 7504 Opens: 2795 Flush tables: 2 Open tables: 430 Queries per second avg: 6.084 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0
[10:46:31] <grrrit-wm>	 (03PS3) 10ArielGlenn: enable dumps from cron on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/267653 
[10:48:42] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] enable dumps from cron on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/267653 (owner: 10ArielGlenn)
[11:02:33] <icinga-wm>	 RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0)
[11:07:55] <icinga-wm>	 PROBLEM - SSH on cygnus is CRITICAL: Server answer
[11:12:24] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0]
[11:12:45] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0]
[11:12:53] <icinga-wm>	 PROBLEM - swift codfw-prod object availability on graphite1001 is CRITICAL: CRITICAL: 25.00% of data under the critical threshold [90.0]
[11:15:13] <grrrit-wm>	 (03PS1) 10ArielGlenn: snapshots: enable the dumps from cron as opposed to just deploying the script [puppet] - 10https://gerrit.wikimedia.org/r/267658 
[11:16:40] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] snapshots: enable the dumps from cron as opposed to just deploying the script [puppet] - 10https://gerrit.wikimedia.org/r/267658 (owner: 10ArielGlenn)
[11:18:54] <icinga-wm>	 RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0)
[11:19:11] <grrrit-wm>	 (03PS2) 10ArielGlenn: snapshots: enable the dumps from cron as opposed to just deploying the script [puppet] - 10https://gerrit.wikimedia.org/r/267658 
[11:19:37] <godog>	 !log repool restbase1007
[11:19:40] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:20:34] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] snapshots: enable the dumps from cron as opposed to just deploying the script [puppet] - 10https://gerrit.wikimedia.org/r/267658 (owner: 10ArielGlenn)
[11:20:41] <wikibugs>	 7Blocked-on-Operations, 6operations: Re-pool restbase1007 - https://phabricator.wikimedia.org/T124565#1985849 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi indeed, I've repooled restbase1007
[11:20:59] <jynus>	 it is the metrics api what it is failing
[11:24:19] <wikibugs>	 6operations, 10ops-codfw: ms-be2015 doesn't come up after reboot - https://phabricator.wikimedia.org/T125383#1985864 (10MoritzMuehlenhoff) 3NEW a:3Papaul
[11:24:24] <icinga-wm>	 PROBLEM - SSH on cygnus is CRITICAL: Server answer
[11:25:54] <icinga-wm>	 ACKNOWLEDGEMENT - Host ms-be2015 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff hardware problem, see T125383
[11:26:14] <icinga-wm>	 RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0)
[11:27:10] <godog>	 jynus: mh?
[11:27:23] <jynus>	 it works now
[11:27:32] <jynus>	 https://grafana.wikimedia.org/dashboard/db/varnish-http-errors
[11:30:31] <jynus>	 have we multiplied by 4 the number of requests since 19 Jan?
[11:30:44] <icinga-wm>	 PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures
[11:30:53] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[11:31:23] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[11:34:28] <moritzm>	 !log uploaded openssl 1.0.2f for jessie-wikimedia to carbon
[11:34:31] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:35:34] <icinga-wm>	 PROBLEM - SSH on cygnus is CRITICAL: Server answer
[11:38:44] <grrrit-wm>	 (03PS1) 10Jcrespo: Delete eqiad masters from codfw configuration and add db weights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 
[11:39:03] <moritzm>	 !log uploaded openjdk-8 8u72-b15-1~bpo8+1 for jessie-wikimedia to carbon
[11:39:05] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:40:43] <jynus>	 I do not know who to add as reviewer of that last commit, either someone from infrastructure or from performance, I suppose
[11:46:56] <jynus>	 apergos, I have some questions regarding mysql && dumps for codfw, I do not know if creating a ticket for it
[11:47:03] <grrrit-wm>	 (03PS1) 10John Vandenberg: Remove multiple spaces before operator [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/267660 
[11:49:15] <jynus>	 I am, I suppose having a ticket doesn't hurt, even if it was the case that it can be trivially answered
[11:49:47] <apergos>	 jynus: let's hear it
[11:53:03] <wikibugs>	 6operations, 10Salt: Salt minions randomly crashing when the deployment server grain gets changed - https://phabricator.wikimedia.org/T124646#1985931 (10ArielGlenn) I've updated my docker salt testbed to work with latest docker api and latest wmf packages: https://github.com/apergos/docker-saltcluster  I'll be...
[11:53:19] <wikibugs>	 6operations, 10DBA, 10Dumps-Generation, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Clarify how mysql dumps will be architectured during codfw failover - https://phabricator.wikimedia.org/T125386#1985937 (10jcrespo) 3NEW
[11:53:36] <jynus>	 ^apergos
[11:54:19] <jynus>	 basically, give me your thoughts, issues, and both current state and desired state
[11:55:35] <wikibugs>	 6operations, 10DBA, 10Dumps-Generation, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Clarify how mysql dumps will be architectured during codfw failover - https://phabricator.wikimedia.org/T125386#1985953 (10ArielGlenn) We could continue to generate them out of eqiad and serve them as downloadable from c...
[11:55:58] <apergos>	 jynus: hope that's enough to start with
[11:58:15] <icinga-wm>	 RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[11:59:35] <moritzm>	 !log rolling reboot of ms-be1016 to ms-be1021 for kernel update
[11:59:37] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:01:50] <jynus>	 the key word is "we could"
[12:02:24] <apergos>	 yes
[12:02:25] <jynus>	 I am not asking what we have to do now, but what we should do
[12:02:47] <apergos>	 we could = it would be fine with me 
[12:02:53] <apergos>	 sorry, not "it is possible"
[12:03:07] <jynus>	 yes, but there is not machines yet?
[12:03:26] <apergos>	 in eqiad there are machines. there is not yet a dataset server in codfw
[12:03:34] <jynus>	 ok, that is the key thing
[12:03:35] <apergos>	 that's under discussion in the dumps hw ticket I mentioned
[12:03:45] <grrrit-wm>	 (03CR) 10Aude: "it would be best to enable this first on test.wikidata + test.wikipedia + test2.wikipedia, and then wikidata." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267405 (https://phabricator.wikimedia.org/T124931) (owner: 10Llyrian)
[12:05:29] <jynus>	 one last question, what is the expected timeline for having working dumps active on codfw? Is it before the end of the quarter?
[12:05:34] <apergos>	 under discussion is whether we will have snapshot hosts in codfw at all, or rather assume that only one dc supports generation 
[12:05:44] <apergos>	 we don't have a timeline because of the above
[12:05:52] <jynus>	 that is ok
[12:06:02] <jynus>	 asume the answer was positive
[12:06:10] <jynus>	 (which we do not know yet)
[12:06:20] <apergos>	 if we are expected to be able to run them out of both dcs then
[12:06:30] <apergos>	 when does the quarter end?
[12:06:40] <jynus>	 1 May
[12:06:49] <apergos>	 it will depend on budget essentially, when the systems would be able to be ordered
[12:06:59] <apergos>	 and that is something ma rk will know
[12:07:01] <jynus>	 understood
[12:07:33] <apergos>	 the answer might be 'no we don't' in which case your task would already be done
[12:10:25] <icinga-wm>	 RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0)
[12:10:52] <jynus>	 I did not need a definitive answer, I just needed short term ideas for immediate codfw configuration so failover is possible
[12:11:04] <jynus>	 Configuration can be changed at any time
[12:11:33] <grrrit-wm>	 (03PS3) 10ArielGlenn: snapshots: enable the dumps from cron as opposed to just deploying the script [puppet] - 10https://gerrit.wikimedia.org/r/267658 
[12:12:18] <grrrit-wm>	 (03PS12) 10John Vandenberg: tox entry point to run pep8==1.4.6 [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar)
[12:12:40] <apergos>	 ok great
[12:12:56] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] snapshots: enable the dumps from cron as opposed to just deploying the script [puppet] - 10https://gerrit.wikimedia.org/r/267658 (owner: 10ArielGlenn)
[12:15:00] <_joe_>	 !log backing up tin homes before reimaging
[12:15:02] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:15:41] <grrrit-wm>	 (03PS4) 10ArielGlenn: snapshots: enable the dumps from cron as opposed to just deploying the script [puppet] - 10https://gerrit.wikimedia.org/r/267658 
[12:15:54] <icinga-wm>	 PROBLEM - SSH on cygnus is CRITICAL: Server answer
[12:15:57] <grrrit-wm>	 (03CR) 10Hoo man: [C: 032] Setting l10n-bot submissions to same as in https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki,access [dumps/dcat] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/267647 (owner: 10Lokal Profil)
[12:16:15] <grrrit-wm>	 (03CR) 10Hoo man: [V: 032] Setting l10n-bot submissions to same as in https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki,access [dumps/dcat] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/267647 (owner: 10Lokal Profil)
[12:16:30] <grrrit-wm>	 (03CR) 10John Vandenberg: "those submodules patches have been merged; just trying to keep this moving. I037ebefdeb05a890 and a similar changeset for the other submo" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar)
[12:17:55] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] snapshots: enable the dumps from cron as opposed to just deploying the script [puppet] - 10https://gerrit.wikimedia.org/r/267658 (owner: 10ArielGlenn)
[12:18:05] <wikibugs>	 6operations, 10DBA, 10MediaWiki-Special-pages, 10Wikidata, 7Performance: Batch updates create slave lag on s3 over WAN - https://phabricator.wikimedia.org/T122429#1985971 (10JanZerebecki)
[12:19:02] <grrrit-wm>	 (03PS1) 10DCausse: Put more like query load back on eqiad for load testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267662 
[12:19:48] <wikibugs>	 6operations, 10DBA, 10Dumps-Generation, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Clarify how mysql dumps will be architectured during codfw failover - https://phabricator.wikimedia.org/T125386#1985972 (10jcrespo) 5Open>3Resolved a:3jcrespo From my conversation with Ariel, this seems to have som...
[12:19:50] <wikibugs>	 6operations, 10DBA, 5Patch-For-Review, 7Performance, and 2 others: Stress-test mediawiki application servers at codfw (specially to figure out db weights configuration) and basic buffer warming - https://phabricator.wikimedia.org/T124697#1985975 (10jcrespo)
[12:25:52] <grrrit-wm>	 (03PS2) 10Jcrespo: Delete eqiad masters from codfw configuration and add db weights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 
[12:26:26] <wikibugs>	 6operations, 10DBA, 10Dumps-Generation, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Clarify how mysql dumps will be architectured during codfw failover - https://phabricator.wikimedia.org/T125386#1985982 (10ArielGlenn) I agree, dumps are not planned to be part of the upcoming test switchover in any case.
[12:29:02] <wikibugs>	 6operations, 10DBA, 10Dumps-Generation, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Clarify how mysql dumps will be architectured during codfw failover - https://phabricator.wikimedia.org/T125386#1985986 (10jcrespo) I've updated https://gerrit.wikimedia.org/r/#/c/267659/ to reflect this decision.
[12:32:32] <grrrit-wm>	 (03CR) 10Jcrespo: "Aaron, I want you to be aware of this important mediawiki-databases change, as it will impact both performance, availability and architect" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 (owner: 10Jcrespo)
[12:38:30] <grrrit-wm>	 (03PS3) 10Jcrespo: Prepare db-codfw.php for a live deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 
[12:39:44] <wikibugs>	 6operations, 5Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Switchover of the application servers to codfw - https://phabricator.wikimedia.org/T124671#1985995 (10jcrespo) Related: https://gerrit.wikimedia.org/r/267659
[12:40:01] <_joe_>	 !log depooling cp3042 from esams uploads
[12:40:04] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:40:25] <wikibugs>	 6operations, 10DBA, 5Patch-For-Review, 7Performance, and 2 others: Stress-test mediawiki application servers at codfw (specially to figure out db weights configuration) and basic buffer warming - https://phabricator.wikimedia.org/T124697#1985997 (10jcrespo)
[12:40:26] <wikibugs>	 6operations, 5Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Switchover of the application servers to codfw - https://phabricator.wikimedia.org/T124671#1985996 (10jcrespo)
[12:41:13] <icinga-wm>	 RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[12:42:12] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: ipsec: remove cp3042 [puppet] - 10https://gerrit.wikimedia.org/r/267664 (https://phabricator.wikimedia.org/T125265) 
[12:48:38] <grrrit-wm>	 (03Abandoned) 10Giuseppe Lavagetto: role::deployment::salt_masters: correct a hiera lookup [puppet] - 10https://gerrit.wikimedia.org/r/266216 (owner: 10Giuseppe Lavagetto)
[12:49:05] <icinga-wm>	 RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0)
[12:53:09] <wikibugs>	 6operations, 10Dumps-Generation: Make dumps run via cron on each snapshot host - https://phabricator.wikimedia.org/T107750#1986010 (10ArielGlenn) These jobs are now all enabled.  The first attempt to run will be Feb 2 early in the morning.  I'll be checking to make sure everything started properly.  One catch...
[12:54:34] <icinga-wm>	 PROBLEM - SSH on cygnus is CRITICAL: Server answer
[12:55:03] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Add initial entries for auth* servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/267665 
[12:59:30] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Add initial entries for auth* servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/267665 
[12:59:40] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Add initial entries for auth* servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/267665 (owner: 10Muehlenhoff)
[13:03:40] <grrrit-wm>	 (03PS5) 10KartikMistry: Beta: Add cxserver registry to Beta [puppet] - 10https://gerrit.wikimedia.org/r/266668 
[13:13:33] <icinga-wm>	 PROBLEM - RAID on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:13:44] <icinga-wm>	 PROBLEM - configured eth on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:14:06] <icinga-wm>	 PROBLEM - nutcracker port on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:14:13] <icinga-wm>	 PROBLEM - dhclient process on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:14:13] <icinga-wm>	 PROBLEM - salt-minion processes on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:14:13] <icinga-wm>	 PROBLEM - nutcracker process on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:14:36] <icinga-wm>	 PROBLEM - SSH on mw1057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:14:44] <icinga-wm>	 PROBLEM - HHVM processes on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:14:44] <icinga-wm>	 PROBLEM - DPKG on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:14:44] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:14:53] <icinga-wm>	 PROBLEM - Disk space on mw1057 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:26:31] <wikibugs>	 6operations, 6Release-Engineering-Team: reinstall/upgrade gerrit server (ytterbium) from precise to jessie - https://phabricator.wikimedia.org/T125018#1986042 (10hashar) We will need to tweak CI configuration and the Zuul merger repos origin. They are pointing to ytterbium.
[13:26:36] <grrrit-wm>	 (03PS6) 10KartikMistry: Beta: Add cxserver registry to Beta [puppet] - 10https://gerrit.wikimedia.org/r/266668 
[13:33:12] <moritzm>	 !log rolling reboot of xenon/cerium/praseodymium for kernel update (and updating to new openjdk-8)
[13:33:15] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:35:18] <grrrit-wm>	 (03PS1) 10ArielGlenn: new salt runner to sign key for a specific minion [puppet] - 10https://gerrit.wikimedia.org/r/267670 (https://phabricator.wikimedia.org/T124761) 
[13:36:01] <wikibugs>	 7Puppet, 6operations, 10Salt, 5Patch-For-Review: Make it possible for wmf-reimage to work seamlessly with a non-local salt master - https://phabricator.wikimedia.org/T124761#1986056 (10ArielGlenn) Tested with a print in place of the key acceptance line.
[13:36:29] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] new salt runner to sign key for a specific minion [puppet] - 10https://gerrit.wikimedia.org/r/267670 (https://phabricator.wikimedia.org/T124761) (owner: 10ArielGlenn)
[13:38:22] <grrrit-wm>	 (03PS2) 10ArielGlenn: new salt runner to sign key for a specific minion [puppet] - 10https://gerrit.wikimedia.org/r/267670 (https://phabricator.wikimedia.org/T124761) 
[13:39:23] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] new salt runner to sign key for a specific minion [puppet] - 10https://gerrit.wikimedia.org/r/267670 (https://phabricator.wikimedia.org/T124761) (owner: 10ArielGlenn)
[13:40:03] <icinga-wm>	 RECOVERY - RAID on mw1057 is OK: OK: no RAID installed
[13:40:13] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1057 is OK: OK: nf_conntrack is 0 % full
[13:40:19] <grrrit-wm>	 (03PS3) 10ArielGlenn: new salt runner to sign key for a specific minion [puppet] - 10https://gerrit.wikimedia.org/r/267670 (https://phabricator.wikimedia.org/T124761) 
[13:40:22] <icinga-wm>	 RECOVERY - nutcracker port on mw1057 is OK: TCP OK - 0.000 second response time on port 11212
[13:40:23] <icinga-wm>	 RECOVERY - SSH on mw1057 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0)
[13:40:23] <icinga-wm>	 RECOVERY - dhclient process on mw1057 is OK: PROCS OK: 0 processes with command name dhclient
[13:40:23] <icinga-wm>	 RECOVERY - salt-minion processes on mw1057 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[13:40:23] <icinga-wm>	 RECOVERY - nutcracker process on mw1057 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker
[13:40:43] <icinga-wm>	 RECOVERY - DPKG on mw1057 is OK: All packages OK
[13:40:53] <icinga-wm>	 RECOVERY - Disk space on mw1057 is OK: DISK OK
[13:41:04] <icinga-wm>	 RECOVERY - HHVM processes on mw1057 is OK: PROCS OK: 6 processes with command name hhvm
[13:41:13] <icinga-wm>	 RECOVERY - configured eth on mw1057 is OK: OK - interfaces up
[13:48:37] <grrrit-wm>	 (03PS4) 10ArielGlenn: rewrite pagerange.py so it's both fast and useful [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/264040 (https://phabricator.wikimedia.org/T123571) 
[13:49:07] <grrrit-wm>	 (03PS5) 10ArielGlenn: rewrite pagerange.py so it's both fast and useful [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/264040 (https://phabricator.wikimedia.org/T123571) 
[13:50:16] <grrrit-wm>	 (03PS6) 10ArielGlenn: rewrite pagerange.py so it's both fast and useful [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/264040 (https://phabricator.wikimedia.org/T123571) 
[13:51:16] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] rewrite pagerange.py so it's both fast and useful [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/264040 (https://phabricator.wikimedia.org/T123571) (owner: 10ArielGlenn)
[14:00:15] <wikibugs>	 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1986077 (10BBlack) Another data point from the weekend: In one sample I took Saturday morning, when I sample...
[14:03:00] <wikibugs>	 6operations, 7Monitoring: monitor and alarm on SMART attributes - https://phabricator.wikimedia.org/T86552#1986090 (10elukey) Adding a big +1  At the moment the Kafka cluster could theoretically serve corrupted data due to disk failure, delegating the responsibility to react to the consumers. If the bad disk i...
[14:04:57] <godog>	 !log set ms-be1019 swift weight to 4000
[14:05:01] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:17:02] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:18:42] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[14:24:13] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:26:02] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[14:33:23] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:33:53] <grrrit-wm>	 (03CR) 10Addshore: wgRCWatchCategoryMembership true everywhere except wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264734 (owner: 10Addshore)
[14:33:56] <chasemp>	 !log labstore1002 cfg scheduling
[14:33:59] <grrrit-wm>	 (03CR) 10Addshore: wgRCWatchCategoryMembership true on wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264735 (owner: 10Addshore)
[14:33:59] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:38:53] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[14:39:13] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:40:54] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[14:43:06] <MarkTraceur>	 Is it possible the l10n caches are going through some churn? I have Commons people complaining about a missing message.
[14:43:08] <MarkTraceur>	 Specifically, https://commons.wikimedia.org/w/api.php?action=query&meta=allmessages&ammessages=licenses&amlang=experienced
[14:43:32] <MarkTraceur>	 cf. https://commons.wikimedia.org/wiki/MediaWiki:Licenses/experienced
[14:45:24] <wikibugs>	 6operations, 10Analytics-Cluster: Complete installation of analytics1017.eqiad.wmnet  - https://phabricator.wikimedia.org/T125055#1986160 (10elukey) a:3Ottomata
[14:45:41] <wikibugs>	 6operations, 10Analytics-Cluster: Complete installation of analytics1017.eqiad.wmnet - https://phabricator.wikimedia.org/T125055#1973447 (10elukey) Assigning to Andrew to have a reminder of this task in our queue.
[14:49:06] <MarkTraceur>	 I guess that message is supposed to be empty, and fallback to the en version, but instead we have nothing showing up on Special:Upload
[14:51:11] <grrrit-wm>	 (03CR) 10Ema: [C: 031] ipsec: remove cp3042 [puppet] - 10https://gerrit.wikimedia.org/r/267664 (https://phabricator.wikimedia.org/T125265) (owner: 10Giuseppe Lavagetto)
[14:52:23] <wikibugs>	 6operations, 10ops-esams, 5Patch-For-Review: cp3042 - controller / hardware issue - https://phabricator.wikimedia.org/T125265#1986176 (10BBlack) The ata link down messages like: ``` [ 6.350863] ata3: SATA link down (SStatus 0 SControl 300) [ 6.675177] ata4: SATA link down (SStatus 0 SControl 300) ``` are nor...
[14:53:27] <anomie>	 MarkTraceur: A message with the value "-" typically means that the message is disabled and should *not* fall back, because leaving the message completely empty means to fall back.
[14:53:33] <anomie>	 (at least, I think so)
[14:54:58] <MarkTraceur>	 Hm.
[14:55:30] <MarkTraceur>	 anomie: The message didn't have any value on Commons until today, when odder created a copy of the English message to try and fix the upload page
[14:55:38] <MarkTraceur>	 FYI https://commons.wikimedia.org/w/index.php?title=Special:Upload&uselang=experienced
[14:59:24] <MatmaRex>	 MarkTraceur: VE had some problems with wrong i18n messages being delivered, or something. i think Krenair knows the details
[14:59:44] <MatmaRex>	 switching MW versions back and forth a couple times certainly didn't help with this, heh
[14:59:47] <Krenair>	 uh... no, I don't think so
[14:59:52] <MarkTraceur>	 It's not even a JS bug, all of the message handling happens on the backend
[14:59:54] <Krenair>	 we had issues with old JS code being delivered
[14:59:57] <MatmaRex>	 (i might be confused)
[15:00:01] <MatmaRex>	 hmm. okay
[15:00:27] <MarkTraceur>	 I also thought oh, must be a cache thing, but no, includes/License.php (which was a new find for me) handles everything
[15:01:23] <icinga-wm>	 RECOVERY - Host cp3042 is UP: PING WARNING - Packet loss = 64%, RTA = 86.19 ms
[15:01:23] <icinga-wm>	 RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 60 ESP OK
[15:01:24] <icinga-wm>	 RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 166 ESP OK
[15:01:32] <icinga-wm>	 RECOVERY - IPsec on cp1061 is OK: Strongswan OK - 60 ESP OK
[15:01:33] <icinga-wm>	 RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 166 ESP OK
[15:01:43] <icinga-wm>	 RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK
[15:01:52] <icinga-wm>	 RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 60 ESP OK
[15:02:02] <icinga-wm>	 RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 60 ESP OK
[15:02:03] <icinga-wm>	 RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 166 ESP OK
[15:02:03] <icinga-wm>	 RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 166 ESP OK
[15:02:03] <icinga-wm>	 RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 60 ESP OK
[15:02:13] <icinga-wm>	 RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK
[15:02:13] <icinga-wm>	 RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 60 ESP OK
[15:02:33] <icinga-wm>	 RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 60 ESP OK
[15:02:33] <icinga-wm>	 RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 60 ESP OK
[15:02:53] <icinga-wm>	 RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 60 ESP OK
[15:02:53] <icinga-wm>	 RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK
[15:02:54] <icinga-wm>	 RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK
[15:03:03] <icinga-wm>	 RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 166 ESP OK
[15:05:54] <wikibugs>	 6operations, 10ops-esams, 5Patch-For-Review: cp3042 - controller / hardware issue - https://phabricator.wikimedia.org/T125265#1986234 (10jcrespo) +1, as we have the exact message on other servers that we rebooted at the time.
[15:06:03] <icinga-wm>	 PROBLEM - Freshness of OCSP Stapling files on cp3042 is CRITICAL: CRITICAL: File /var/cache/ocsp/unified.ocsp is more than 29100 secs old!
[15:10:58] <ema>	 !log restarting hhvm on mw1057 
[15:11:01] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:11:30] <wikibugs>	 6operations, 10ops-eqiad, 5Patch-For-Review: SMART errors on kafka1012.eqiad.wmfnet - https://phabricator.wikimedia.org/T125199#1986248 (10elukey) Next steps before closing:  1) disk replaced 2) bring the host/service up and running again 3) evaluates the following reverts: - https://gerrit.wikimedia.org/r/2...
[15:11:54] <wikibugs>	 6operations, 10ops-eqiad, 5Patch-For-Review: SMART errors on kafka1012.eqiad.wmfnet - https://phabricator.wikimedia.org/T125199#1986249 (10elukey) a:3Cmjohnson
[15:12:07] <wikibugs>	 6operations, 10ops-eqiad, 5Patch-For-Review: SMART errors on kafka1012.eqiad.wmfnet - https://phabricator.wikimedia.org/T125199#1981110 (10elukey) Temporary assigned to Cmjohnson
[15:13:33] <icinga-wm>	 RECOVERY - Apache HTTP on mw1057 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.553 second response time
[15:13:38] <grrrit-wm>	 (03PS1) 10MarcoAurelio: Adding museumvictoria.com.au domain to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267677 (https://phabricator.wikimedia.org/T125387) 
[15:13:48] <bblack>	 !log cp3042 repooled
[15:13:50] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:14:08] <wikibugs>	 6operations, 10ops-esams, 5Patch-For-Review: cp3042 - controller / hardware issue - https://phabricator.wikimedia.org/T125265#1986271 (10BBlack) 5Open>3Resolved a:3BBlack So to retrace my steps here:  1. ata link down is a red herring 2. the root disks seem to be fine (other than rebuilding mirror in t...
[15:14:13] <icinga-wm>	 RECOVERY - HHVM rendering on mw1057 is OK: HTTP OK: HTTP/1.1 200 OK - 70830 bytes in 0.106 second response time
[15:21:23] <icinga-wm>	 RECOVERY - puppet last run on mw1057 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[15:23:51] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 031] Raise file upload limit to 2500 MB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266544 (https://phabricator.wikimedia.org/T116514) (owner: 10TheDJ)
[15:23:51] <wikibugs>	 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#1986304 (10ArielGlenn) A few more comments after discussion with Mark:  We thought about splitting up the dumps between dcs but this is expensive because it...
[15:24:46] <wikibugs>	 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1986305 (10Lydia_Pintscher) Very strange. Wikidata use on templates on talk pages isn't impossible but I'd c...
[15:24:54] <wikibugs>	 6operations, 10Wikimedia-Video, 5Patch-For-Review: 1gb file upload limit is too restrictive for conference presentation videos - https://phabricator.wikimedia.org/T116514#1986306 (10fgiunchedi) for sure @reedy, LGTM, on the swift side the default maximum upload size for a single object is 5GB FYI
[15:25:27] <grrrit-wm>	 (03PS1) 10Jcrespo: Pool db1018; Depool db1054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267678 (https://phabricator.wikimedia.org/T125215) 
[15:27:10] <wikibugs>	 6operations, 10Wikimedia-DNS, 5Patch-For-Review: Internal DNS resolver responds with NXDOMAIN for localhost AAAA - https://phabricator.wikimedia.org/T125170#1986311 (10elukey) @yuvipanda: What are the next steps for this? Do you need more reviewers to get added?
[15:34:37] <grrrit-wm>	 (03PS1) 10Jcrespo: Testing db jessie installer problems on db2030 [puppet] - 10https://gerrit.wikimedia.org/r/267681 (https://phabricator.wikimedia.org/T125256) 
[15:37:39] <wikibugs>	 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1986404 (10Ottomata) I just talked to @bblack, and also looked at requests in the `webrequest_mobile` topic in Kafka.  There are still real user requests from cp1060, but most of...
[15:42:25] <bblack>	 !log restarted pybal on lvs1003
[15:42:28] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:44:30] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: "FTR this translates to ~7% more space on a single whisper file" [puppet] - 10https://gerrit.wikimedia.org/r/266567 (owner: 10EBernhardson)
[15:44:43] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp3042 is OK: OK
[15:45:01] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: ipsec: remove cp3042 [puppet] - 10https://gerrit.wikimedia.org/r/267664 (https://phabricator.wikimedia.org/T125265) 
[15:45:12] <grrrit-wm>	 (03PS1) 10MarcoAurelio: Expanding transwiki import sources for be.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267685 (https://phabricator.wikimedia.org/T125390) 
[15:48:36] <bblack>	 !log restarted pybal on lvs1004 (lvs1003 above was a bad log message!)
[15:48:39] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:51:51] <wikibugs>	 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1986445 (10daniel) @BBlack can you give a few examples of such pages on srwiki? Were these non-existant talk...
[15:51:55] <James_F>	 OK, so… what's happening this morning? Are we going with wmf.11?
[15:53:10] <_joe_>	 this morning?
[15:53:29] <_joe_>	 uhm I guess I have no time to reimage tin then
[15:54:05] <James_F>	 _joe_: I don't know. SWAT's in 6 minutes' time.
[15:54:25] <logmsgbot>	 !log krenair@mira Synchronized wmf-config/interwiki.cdb: Updating interwiki cache (duration: 01m 52s)
[15:54:28] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:54:39] <Krenair>	 mafk, ^
[15:54:51] <mafk>	 :D
[15:55:31] <mafk>	 but the patch needs to be merged in order to work
[15:56:04] <Krenair>	 indeed
[15:56:09] <Glaisher>	 Does that need a review from csteipp or someone? I don't see any non-WMF wiki there.
[15:58:01] <Krenair>	 unclear
[15:59:26] <Krenair>	 I don't think it's been done before
[15:59:56] <_joe_>	 James_F: if it's a normal SWAT, I can just ask the SWATTers to hold for a bit when I acknolwedge tin is down
[16:00:09] * James_F nods.
[16:00:17] <mafk>	 I also have doubts
[16:00:29] <mafk>	 romaine has been granted local rights fyi
[16:00:30] <wikibugs>	 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1986463 (10BBlack) @daniel - Sorry I should have linked this earlier, I made a paste at the time: P2547 .  N...
[16:03:08] <wikibugs>	 6operations, 10Wikimedia-DNS, 5Patch-For-Review: Internal DNS resolver responds with NXDOMAIN for localhost AAAA - https://phabricator.wikimedia.org/T125170#1986468 (10elukey) p:5Triage>3Normal
[16:03:31] <grrrit-wm>	 (03CR) 10MarcoAurelio: "I have doubts here however. I can't see any non-WMF wikis added to the transwiki import sources, so I don't know if this will be approved " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267685 (https://phabricator.wikimedia.org/T125390) (owner: 10MarcoAurelio)
[16:04:30] <thcipriani>	 hmm, looks like no notification for SWAT. 
[16:04:49] <greg-g>	 jouncebot: yt?
[16:04:56] <greg-g>	 jouncebot: refresh
[16:04:58] <jouncebot>	 I refreshed my knowledge about deployments.
[16:05:00] <Dereckson>	 Hello.
[16:05:02] <greg-g>	 jouncebot: next
[16:05:02] <jouncebot>	 In 4 hour(s) and 54 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160201T2100)
[16:05:06] <grrrit-wm>	 (03CR) 10Jforrester: "This also can never hope to have reasonable histories, as it's not an SUL wiki. I'm minded to say this shouldn't be allowed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267685 (https://phabricator.wikimedia.org/T125390) (owner: 10MarcoAurelio)
[16:05:27] <thcipriani>	 well anyway, I can SWAT, _joe_ did you want to do something to tin pre-SWAT?
[16:05:34] <ema>	 !log hhvm restarted on mw1072
[16:05:37] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:05:57] <James_F>	 greg-g, thcipriani: So…
[16:06:05] <Glaisher>	 Also, does that actually work? IIRC, prod cluster can't communicate with most of the internet.
[16:06:11] <wikibugs>	 6operations, 10Traffic, 7Pybal: pybal etcd coroutine crashed - https://phabricator.wikimedia.org/T125397#1986480 (10BBlack) 3NEW a:3Joe
[16:06:12] <icinga-wm>	 RECOVERY - Apache HTTP on mw1072 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.265 second response time
[16:06:15] <James_F>	 greg-g: What's the plan with wmf.11?
[16:06:22] <greg-g>	 James_F: determining
[16:06:40] <greg-g>	 sorry, I didn't work this weekend, and since I just got online 10 minutes ago, I don't have all my questions answered yet :)
[16:06:45] <James_F>	 greg-g: Slacker. ;-)
[16:07:04] <icinga-wm>	 RECOVERY - HHVM rendering on mw1072 is OK: HTTP OK: HTTP/1.1 200 OK - 70843 bytes in 0.120 second response time
[16:07:06] <greg-g>	 it *may* go out everywhere again today, we'll see
[16:07:09] <Dereckson>	 thcipriani: as Kelson isn't there, I'll take care of 262893 deployment too
[16:07:22] <thcipriani>	 Dereckson: sounds good, thanks.
[16:09:08] <Krenair>	 thcipriani, you should read the email I sent before dealing with that patch
[16:09:32] <thcipriani>	 James_F: are your SWAT patches predicated on wmf.11 being out in any way or are you OK to go without? Also, re: https://gerrit.wikimedia.org/r/#/c/258206/7 changing the default to true. Wanted to be sure that was the intention.
[16:09:38] <wikibugs>	 6operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T125200#1986503 (10Papaul) @Robh as you mentioned "HDD - SATA Seagate ST2000DM001 7.2K 2TB 10" those drives that I have in spare are SATA and ms-be2003 is using SAS and not SATA.
[16:09:38] <_joe_>	 thcipriani: no go on, I'll stop you in case
[16:09:44] <thcipriani>	 _joe_: kk, thanks
[16:09:51] <grrrit-wm>	 (03CR) 10BBlack: [C: 031] "I tested the same zonefile + config change locally on a production-config pdns_recursor machine and it works. I haven't really tested the" [puppet] - 10https://gerrit.wikimedia.org/r/267208 (https://phabricator.wikimedia.org/T125170) (owner: 10BBlack)
[16:10:09] <Krenair>	 Glaisher, so I think that it will use the default proxy settings
[16:10:13] <James_F>	 thcipriani: Eh. Go for it now.
[16:10:31] <James_F>	 thcipriani: And yes, default-to-true is the intent.
[16:10:35] <thcipriani>	 James_F: ack. thanks!
[16:11:00] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264765 (https://phabricator.wikimedia.org/T116523) (owner: 10Jforrester)
[16:11:06] <Krenair>	 Glaisher, which isn't set up
[16:11:36] <grrrit-wm>	 (03CR) 10Glaisher: "> This also can never hope to have reasonable histories, as it's not an SUL wiki. I'm minded to say this shouldn't be allowed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267685 (https://phabricator.wikimedia.org/T125390) (owner: 10MarcoAurelio)
[16:11:42] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable VisualEditor by default for some other wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264765 (https://phabricator.wikimedia.org/T116523) (owner: 10Jforrester)
[16:12:25] <Glaisher>	 Krenair: So it won't work anyway?
[16:13:49] <grrrit-wm>	 (03CR) 10Alex Monk: "We already allow import from wikimediafoundation.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267685 (https://phabricator.wikimedia.org/T125390) (owner: 10MarcoAurelio)
[16:13:56] <grrrit-wm>	 (03PS2) 10Dereckson: Namespace configuration on cu.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265885 (https://phabricator.wikimedia.org/T123654) 
[16:14:20] <thcipriani>	 Krenair: what's up with the interwiki.cdb file?
[16:14:31] <Krenair>	 thcipriani, what's up with it?
[16:14:32] <grrrit-wm>	 (03CR) 10Daniel Kinzler: [C: 031] "Yes, we want this for Wikidata. Tested this together with Jonas, seems to work as advertised. Needs I2043353da." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254645 (https://phabricator.wikimedia.org/T85368) (owner: 10Bene)
[16:14:32] <thcipriani>	 (saw you sync'd it this morning, modified on mira)
[16:14:35] <grrrit-wm>	 (03CR) 10Dereckson: "PS2: added the old namespace name as alias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265885 (https://phabricator.wikimedia.org/T123654) (owner: 10Dereckson)
[16:14:38] <Krenair>	 ugh
[16:14:47] <Krenair>	 is that part of the git repo again now?
[16:15:08] <thcipriani>	 Krenair: evidently
[16:15:15] <Krenair>	 I ran updateinterwikicache earlier
[16:15:59] <thcipriani>	 oh I see. I'll push up a patch to gerrit.
[16:16:19] <Krenair>	 I'm doing it now
[16:16:22] <wikibugs>	 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 10hardware-requests: Hadoop Node expansion for end of FY - https://phabricator.wikimedia.org/T124951#1986511 (10mark) p:5Normal>3Low
[16:16:35] <grrrit-wm>	 (03PS1) 10Alex Monk: updateinterwikicache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267687 
[16:16:52] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: scap: temprorarily remove tin during reimaging [puppet] - 10https://gerrit.wikimedia.org/r/267688 
[16:17:01] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] updateinterwikicache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267687 (owner: 10Alex Monk)
[16:17:12] <Dereckson>	 Krenair: May 23: 03e6919608cc5aaab44dfda23d05e7f8439ba6a2 - Don't commit interwiki.cdb anymore / Dec 4: 22a00eb5f473c6822e1c19c6602db7ae55a613ac - Rvert
[16:17:28] <Dereckson>	 <hoo> I only recently broke it and could only recover the old one, because I saved it to my home
[16:17:31] <Dereckson>	 <hoo> before running hte script, that broke it
[16:17:34] <Dereckson>	 <ori> that's a pretty good reason for having it versioned, no?
[16:17:37] <Dereckson>	 <hoo> Yeah
[16:17:57] <grrrit-wm>	 (03Merged) 10jenkins-bot: updateinterwikicache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267687 (owner: 10Alex Monk)
[16:18:04] <hoo>	 that was pretty scary... yes
[16:18:09] <Krenair>	 thcipriani, should be good now
[16:18:15] <thcipriani>	 Krenair: cool, thanks.
[16:18:35] <wikibugs>	 6operations, 10ops-codfw: ms-be2015 doesn't come up after reboot - https://phabricator.wikimedia.org/T125383#1986513 (10Papaul) @MoritzMuehlenhoff here is what I have on the system. {F3300155}
[16:19:32] <icinga-wm>	 RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge.
[16:21:13] <icinga-wm>	 RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge.
[16:22:09] <logmsgbot>	 !log thcipriani@mira Synchronized dblists/visualeditor-default.dblist: SWAT: Enable VisualEditor by default for some other wikis [[gerrit:264765]] (duration: 01m 58s)
[16:22:12] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:22:23] <thcipriani>	 ^ James_F check please
[16:23:00] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258206 (https://phabricator.wikimedia.org/T92661) (owner: 10Jforrester)
[16:23:21] <James_F>	 thcipriani: Yup, WFM.
[16:23:44] <grrrit-wm>	 (03Merged) 10jenkins-bot: Centralise all VisualEditor feedback pages except for a few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258206 (https://phabricator.wikimedia.org/T92661) (owner: 10Jforrester)
[16:24:36] <wikibugs>	 6operations, 10Traffic, 7Pybal: pybal etcd coroutine crashed - https://phabricator.wikimedia.org/T125397#1986530 (10Joe) So, I guess this has been some sort of weird race condition. Or that etcd responded with stale/spurious data in some way (the way recursive watch works might have caused that).
[16:26:20] <logmsgbot>	 !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Centralise all VisualEditor feedback pages except for a few wikis [[gerrit:258206]] (duration: 01m 30s)
[16:26:23] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:26:27] <thcipriani>	 ^ James_F check please
[16:26:37] <James_F>	 thcipriani: Checking.
[16:27:50] <James_F>	 Hmm.
[16:28:38] <James_F>	 thcipriani: Keep going.
[16:28:45] <thcipriani>	 James_F: okie doke.
[16:29:19] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265885 (https://phabricator.wikimedia.org/T123654) (owner: 10Dereckson)
[16:30:11] <grrrit-wm>	 (03Merged) 10jenkins-bot: Namespace configuration on cu.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265885 (https://phabricator.wikimedia.org/T123654) (owner: 10Dereckson)
[16:32:30] <logmsgbot>	 !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Namespace configuration on cu.wikipedia [[gerrit:265885]] (duration: 01m 26s)
[16:32:32] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:32:35] <thcipriani>	 ^ Dereckson: check please
[16:32:39] <wikibugs>	 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1986560 (10Lydia_Pintscher) Thanks. I looked at one of them and the only thing in the page is the template f...
[16:34:22] <wikibugs>	 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1986566 (10daniel) Yea, looks like the srwiki talk pages wasn't us, but an edit to a much-used template.
[16:35:41] <Dereckson>	 thcipriani: seems to work
[16:35:48] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266541 (https://phabricator.wikimedia.org/T123084) (owner: 10Dereckson)
[16:36:32] <grrrit-wm>	 (03Merged) 10jenkins-bot: Set WikidataPageBanner namespaces on fr.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266541 (https://phabricator.wikimedia.org/T123084) (owner: 10Dereckson)
[16:38:55] <logmsgbot>	 !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Set WikidataPageBanner namespaces on fr.wikivoyage [[gerrit:266541]] (duration: 01m 26s)
[16:38:57] <thcipriani>	 ^ Dereckson check please
[16:38:59] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:42:54] <wikibugs>	 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1986594 (10BBlack) Regardless, the average rate of HTCP these days is normally-flat-ish (a few scary spikes...
[16:42:57] <Dereckson>	 Doesn't work in NS_PROJECT
[16:43:16] <Dereckson>	 Nor through Wikidata, nor through {{PAGEBANNER}} 
[16:43:45] <Dereckson>	 thcipriani: could you do an mwscript eval?
[16:44:12] <Dereckson>	 on wgWPBNamespaces for frwikivoyage
[16:46:12] <thcipriani>	 Dereckson: https://phabricator.wikimedia.org/P2550
[16:47:52] <_joe_>	 !log installing the new HHVM package to the api appserver cluster in eqiad
[16:47:55] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:48:00] <Dereckson>	 ah the setting is wrong, InitialiseSettings.php has wgWPBNamespaces, the extension wants wgWPBBannerNamespaces
[16:48:31] <Dereckson>	 I'm preparing a patch to fix that.
[16:49:04] <thcipriani>	 Dereckson: blerg. thank you.
[16:50:20] <wikibugs>	 6operations, 10Traffic: 3x cache_upload crashed in a short time window - https://phabricator.wikimedia.org/T125401#1986611 (10BBlack) 3NEW
[16:50:44] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267195 (https://phabricator.wikimedia.org/T125000) (owner: 10Dereckson)
[16:51:36] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable WikidataPageBanner on es.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267195 (https://phabricator.wikimedia.org/T125000) (owner: 10Dereckson)
[16:52:25] <_joe_>	 !log restarted pybal on lvs1001
[16:52:28] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:52:40] <grrrit-wm>	 (03PS1) 10Dereckson: wgWPBNamespaces → wgWPBBannerNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267694 
[16:52:54] <wikibugs>	 6operations, 10Traffic: 3x cache_upload crashed in a short time window - https://phabricator.wikimedia.org/T125401#1986628 (10BBlack) See also: T125265 + T124418
[16:53:28] <wikibugs>	 6operations, 10Traffic: 3x cache_upload crashed in a short time window - https://phabricator.wikimedia.org/T125401#1986633 (10BBlack) And maybe-related: T122455
[16:54:06] <logmsgbot>	 !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable WikidataPageBanner on es.wikivoyage [[gerrit:267195]] (duration: 01m 29s)
[16:54:09] <thcipriani>	 ^ Dereckson check please
[16:54:09] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:54:53] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267194 (https://phabricator.wikimedia.org/T124614) (owner: 10Dereckson)
[16:55:14] <grrrit-wm>	 (03PS3) 10ArielGlenn: dumps: configure for parallelized runs for zhwiki, metawiki [puppet] - 10https://gerrit.wikimedia.org/r/263912 
[16:55:42] <wikibugs>	 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1986637 (10BBlack) cp1060 is depooled for users now.  Once Analytics is done with their oozie thing, we can proceed on the next steps for actually stopping the cache_mobile clust...
[16:55:42] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable SandboxLink on or.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267194 (https://phabricator.wikimedia.org/T124614) (owner: 10Dereckson)
[16:56:39] <Dereckson>	 thcipriani: 267195 tested
[16:56:49] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] dumps: configure for parallelized runs for zhwiki, metawiki [puppet] - 10https://gerrit.wikimedia.org/r/263912 (owner: 10ArielGlenn)
[16:57:02] <icinga-wm>	 RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0)
[16:57:16] <wikibugs>	 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#1986641 (10Ottomata)
[16:58:08] <wikibugs>	 6operations, 10Analytics-Cluster, 10EventBus, 6Services: Investigate proper set up for using Kafka MirrorMaker with new main Kafka clusters. - https://phabricator.wikimedia.org/T123954#1986647 (10Ottomata) 5Open>3Resolved
[16:58:33] <icinga-wm>	 PROBLEM - puppet last run on mw1231 is CRITICAL: CRITICAL: Puppet has 1 failures
[16:59:05] <grrrit-wm>	 (03PS4) 10Thcipriani: Use extension registration for Graph [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266433 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson)
[16:59:15] <logmsgbot>	 !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable SandboxLink on or.wikipedia.org [[gerrit:267194]] (duration: 01m 31s)
[16:59:20] <thcipriani>	 ^ Dereckson check please
[17:00:11] <grrrit-wm>	 (03PS4) 10ArielGlenn: dumps: configure for parallelized runs for zhwiki, metawiki [puppet] - 10https://gerrit.wikimedia.org/r/263912 
[17:00:20] <Dereckson>	 266433 tested.
[17:00:50] <thcipriani>	 hmmm, not really a good way to sync https://gerrit.wikimedia.org/r/#/c/266433/
[17:01:03] <grrrit-wm>	 (03CR) 10Dereckson: [C: 04-1] "Documentation extension lags on actual code. wgWPBNamespaces seems the correct setting." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267694 (owner: 10Dereckson)
[17:02:53] <icinga-wm>	 PROBLEM - SSH on cygnus is CRITICAL: Server answer
[17:03:12] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266433 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson)
[17:03:40] <grrrit-wm>	 (03Merged) 10jenkins-bot: Use extension registration for Graph [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266433 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson)
[17:03:52] <wikibugs>	 10Ops-Access-Requests, 6operations: add subbu to parsoid-roots - https://phabricator.wikimedia.org/T125166#1986670 (10TrevorParscal) I approve giving Subbu access to ruthenium.
[17:05:31] <grrrit-wm>	 (03PS2) 10Yuvipanda: toolslabs: install hunspell and libhunspell-dev to exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/267513 (https://phabricator.wikimedia.org/T125193) (owner: 10Ebrahim)
[17:05:37] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] toolslabs: install hunspell and libhunspell-dev to exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/267513 (https://phabricator.wikimedia.org/T125193) (owner: 10Ebrahim)
[17:06:53] <logmsgbot>	 !log thcipriani@mira Synchronized wmf-config: SWAT: Use extension registration for Graph [[gerrit:266433]] (duration: 01m 29s)
[17:06:56] <thcipriani>	 Dereckson: check please ^
[17:06:56] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:07:49] <grrrit-wm>	 (03PS3) 10ArielGlenn: dumps: rebalance page ranges for dump jobs that run in parallel [puppet] - 10https://gerrit.wikimedia.org/r/266314 (https://phabricator.wikimedia.org/T123571) 
[17:08:41] <Dereckson>	 Graph pages still work as expected, okay. So 266433 tested.
[17:09:53] <thcipriani>	 Dereckson: I'm going to bump https://gerrit.wikimedia.org/r/#/q/262893,n,z if that's ok. I've got to run to a meeting.
[17:10:03] <Dereckson>	 Okay.
[17:10:16] <thcipriani>	 Dereckson: thank you for your help. appreciated.
[17:10:33] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] dumps: rebalance page ranges for dump jobs that run in parallel [puppet] - 10https://gerrit.wikimedia.org/r/266314 (https://phabricator.wikimedia.org/T123571) (owner: 10ArielGlenn)
[17:10:44] <Dereckson>	 Thank you for the deploy.
[17:17:28] <grrrit-wm>	 (03Abandoned) 10Dereckson: wgWPBNamespaces → wgWPBBannerNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267694 (owner: 10Dereckson)
[17:20:15] <JohanJ>	 Do you know when decisions around the rollback will be taken today? I need to make a decision about what to do with this week's issue of Tech News today, and wonder if I should wait another hour or so or postpone sending it out until tomorrow UTC.
[17:21:01] <Krenair>	 greg-g, bd808 ^
[17:21:49] <bd808>	 JohanJ: greg-g will be announcing in <60m. He's in a meeting right now
[17:23:06] <JohanJ>	 Rightyo. Thansk.
[17:23:10] <JohanJ>	 Or thanks.
[17:24:53] <icinga-wm>	 RECOVERY - puppet last run on mw1231 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[17:26:18] <wikibugs>	 6operations, 6Discovery: Elasticsearch health and capacity planning FY2016-17 - https://phabricator.wikimedia.org/T124626#1986751 (10EBernhardson) If we were to split the cluster by wiki our disk usage should stay pretty consistent with growth over the last year, merely split between clusters. If we were to sp...
[17:27:14] <wikibugs>	 7Blocked-on-Operations, 6operations, 6Services, 7Graphite: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451#1986752 (10Addshore) >>! In T85451#1258826, @GWicke wrote: > So, should we  >  > a) get a new box with more / bigger SSDs (most 2.5" cases have space for 8 SSDs), or >...
[17:31:19] <_joe_>	 gehel: welcome :)
[17:36:45] <gehel>	 _joe_: thx! Happy to be now part of the family !
[17:40:35] <wikibugs>	 10Ops-Access-Requests, 6operations: add subbu to parsoid-roots - https://phabricator.wikimedia.org/T125166#1986810 (10ssastry) In addition to what Trevor approved above, based on ops recommendations, I can request and get @TrevorParscal's approval for the same.
[17:42:15] <icinga-wm>	 ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on kafka1013 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] ottomata Will be this way until kafka1012 is back online.
[17:42:15] <icinga-wm>	 ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] ottomata Will be this way until kafka1012 is back online.
[17:42:15] <icinga-wm>	 ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] ottomata Will be this way until kafka1012 is back online.
[17:42:15] <icinga-wm>	 ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] ottomata Will be this way until kafka1012 is back online.
[17:42:15] <icinga-wm>	 ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] ottomata Will be this way until kafka1012 is back online.
[17:45:24] <icinga-wm>	 PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 15.00% of data above the critical threshold [100000000.0]
[17:50:33] <doctaxon>	 dewiki is down ...
[17:50:51] <doctaxon>	 and up again
[17:51:24] <icinga-wm>	 PROBLEM - puppet last run on mw1253 is CRITICAL: CRITICAL: Puppet has 1 failures
[17:51:31] <_joe_>	 thcipriani: I won't be at the deployment wg meeting, I have a conflicting meeting I need to attend
[17:51:49] <thcipriani>	 _joe_: ack, np, thanks for the heads up
[17:53:02] <Steinsplitter>	 https://commons.wikimedia.org/w/index.php?title=MediaWiki:ImageAnnotatorConfig.js&action=raw&ctype=text/javascript
[17:53:02] <Steinsplitter>	 ohoh
[17:53:09] <sjoerddebruin>	 Ah, not the only one.
[17:53:22] <sjoerddebruin>	 I have a broken homepage of Wikidata and get the same error when purging.
[17:54:03] <hoo>	 !log restarted hhvm on mw1253
[17:54:07] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:54:25] <aude>	 hoo probably fixed it
[17:54:31] <_joe_>	 Fatal error: Call to undefined function headers_sent() in /srv/mediawiki/errorpages/hhvm-fatal-error.php on line 815
[17:54:50] <_joe_>	 see logstash
[17:54:53] <hoo>	 Also many redis servers seem to be unreachable
[17:55:04] <_joe_>	 hoo: which ones? and where?
[17:55:12] <aude>	 _joe_: only 1253
[17:55:12] <hoo>	 rdb1001.eqiad.wmnet
[17:55:17] <aude>	 and that has stopped
[17:55:18] <hoo>	 and rdb1007.eqiad.wmnet
[17:55:30] <_joe_>	 hoo: what do you mean they're unreachable?
[17:55:36] <hoo>	 I see timeouts in the logs
[17:55:40] <hoo>	 but only a few
[17:55:40] <_joe_>	 you see connection errors?
[17:55:43] <_joe_>	 ok that's an overload
[17:55:49] <hoo>	 ah ok
[17:56:24] <icinga-wm>	 RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0]
[17:56:51] <_joe_>	 thcipriani: see the fatal log
[17:56:53] <_joe_>	 please
[17:57:22] <thcipriani>	 _joe_: whoa
[17:57:48] <_joe_>	 thcipriani: could be partly an artifact of me rolling restarting all appservers
[17:59:00] <greg-g>	 all one host
[17:59:02] <wikibugs>	 6operations, 6Discovery: Elasticsearch health and capacity planning FY2016-17 - https://phabricator.wikimedia.org/T124626#1986891 (10dcausse) I agree: splitting by feature is not easy and maybe not appropriate for the moment.  Random thoughts:  In the future I think one strategy could be: 1/ cluster for type a...
[17:59:04] <greg-g>	 mw1253
[17:59:15] <greg-g>	 well "all" being 99.999%
[17:59:22] <aude>	 greg-g: that is fixed
[17:59:26] <hoo>	 I restarted hhvm there, it stopped afterwards
[17:59:39] <greg-g>	 ah, sorry, didn't read full scrollback, we were in a meeting
[17:59:42] <greg-g>	 thanks all :)
[17:59:56] <greg-g>	 alright, email being written/sent re wmf.11/12 now
[18:00:05] <JohanJ>	 greg-g: thanks.
[18:02:32] <greg-g>	 it's been a fun morning :)
[18:02:59] <hoo>	 greg-g: To which ml? Didn't get it yet
[18:03:55] <hoo|away>	 Anyway, my food i sgetting cold, I'll be right back
[18:05:55] <grrrit-wm>	 (03PS1) 10ArielGlenn: dumps: allow per-wiki configuration of checkpoint time in config [puppet] - 10https://gerrit.wikimedia.org/r/267704 
[18:07:56] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] dumps: allow per-wiki configuration of checkpoint time in config [puppet] - 10https://gerrit.wikimedia.org/r/267704 (owner: 10ArielGlenn)
[18:10:48] <aude>	 _joe_: i think some of the fatal errors maybe got cached in varnish
[18:11:00] <greg-g>	 hoo|away: coming now
[18:11:03] <mutante>	 !log planet1001 - rebooting for upgrade
[18:11:05] <aude>	 cp1055 miss(0), cp3040 hit(2), cp3031 frontend hit(1) 
[18:11:06] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:11:11] <aude>	 sometimes when i try https://www.wikidata.org/wiki/Wikidata:Main_Page?action=purge
[18:11:14] <_joe_>	 aude: bblack already banned a url
[18:11:21] <greg-g>	 hoo|away: ops@, engineering@, and wikitech-l@
[18:11:43] <wikibugs>	 7Blocked-on-Operations, 6operations, 6Services, 7Graphite: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451#1986923 (10EBernhardson) I also support this change, collecting stats in graphite has shown to be quite powerful. Removing limitations around stats collection should be...
[18:12:06] <Bsadowski1>	 "Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon. Please try again in a few minutes."
[18:12:09] <_joe_>	 aude: try https://www.wikidata.org/w/index.php?title=Wikidata:Main_Page&cache=none
[18:12:14] <aude>	 _joe_: ok
[18:12:20] <_joe_>	 or, add any arbitrary key to the query string :P
[18:12:30] <Bsadowski1>	 When viewing someones userpage there is an error page but no details
[18:12:33] <aude>	 _joe_: yeah, i did
[18:12:34] <_joe_>	 (I already purged the page for now)
[18:12:35] <aude>	 and that works
[18:12:41] <_joe_>	 Bsadowski1: url?
[18:12:43] <Bsadowski1>	 No error details at all
[18:12:46] <_joe_>	 in pm if needed
[18:12:49] <Bsadowski1>	 https://simple.wikipedia.org/wiki/User:Etamni
[18:12:59] <Bsadowski1>	 Someone complained about it
[18:13:51] <_joe_>	 Bsadowski1: fixed
[18:14:00] <Bsadowski1>	 Weird. It doesn't happen on mine..
[18:14:10] <_joe_>	 the trick is to append ?cache=hi&action=purge :P
[18:14:11] <greg-g>	 is Kelson in here?
[18:14:16] <_joe_>	 Bsadowski1: I did purge the cache
[18:15:29] <aude>	 https://www.wikidata.org/wiki/Wikidata:Main_Page?action=purge is still broken
[18:17:13] <icinga-wm>	 RECOVERY - puppet last run on mw1253 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures
[18:19:24] <aude>	 possibly just https://www.wikidata.org/wiki/Wikidata:Main_Page is also cached in broken state (reported by user)
[18:20:24] <wikibugs>	 6operations, 10Wikimedia-Video, 5Patch-For-Review: 1gb file upload limit is too restrictive for conference presentation videos - https://phabricator.wikimedia.org/T116514#1986959 (10BBlack) Have we actually tested a ~2.4 GB upload and seen it work?  I worry that somewhere in the stack, something is using a s...
[18:20:52] <grrrit-wm>	 (03CR) 10BBlack: "https://phabricator.wikimedia.org/T116514#1986959" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266544 (https://phabricator.wikimedia.org/T116514) (owner: 10TheDJ)
[18:25:30] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: "thanks Daniel! puppet compiler is happy, https://puppet-compiler.wmflabs.org/1670/ I'll merge this tomorrow" [puppet] - 10https://gerrit.wikimedia.org/r/204275 (owner: 10Filippo Giunchedi)
[18:25:36] <mutante>	 Nemo_bis: do you know if the newer RC stream (stream.wm, not irc.wm) is also that critical for antivandal tools?
[18:29:44] <wikibugs>	 6operations, 10Wikimedia-Video, 5Patch-For-Review: 1gb file upload limit is too restrictive for conference presentation videos - https://phabricator.wikimedia.org/T116514#1986992 (10fgiunchedi) re: testing, as of a week ago there's a swift cluster in beta as per {T64835}, mediawiki config change hasn't happe...
[18:31:04] <grrrit-wm>	 (03PS3) 10ArielGlenn: crap salt cleanup scripts primarily for labs use [software] - 10https://gerrit.wikimedia.org/r/236798 
[18:31:34] <icinga-wm>	 PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: puppet fail
[18:33:04] <aude>	 mutante: probably is
[18:33:19] * aude doesn't know for sure though
[18:33:33] <mutante>	 aude: yep, i heard it's used by cvnbot 
[18:33:48] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032 V: 032] "they're crap scripts but they might as well go in the repo instead of sitting here forever" [software] - 10https://gerrit.wikimedia.org/r/236798 (owner: 10ArielGlenn)
[18:34:17] <quiddity>	 mobile web error on https://en.m.wikipedia.org/wiki/Main_Page but all other pages seem fine.
[18:35:41] <grrrit-wm>	 (03PS2) 10ArielGlenn: force salt minion to ping master every 15 minutes [puppet] - 10https://gerrit.wikimedia.org/r/219134 
[18:35:56] <_joe_>	 quiddity: https://en.m.wikipedia.org/w/index.php?title=Main_Page&nocache=y&action=purge solved it, FYI
[18:36:20] <_joe_>	 the trick is to add a bogus part in the query string (nocache=y does nothing)
[18:36:30] <quiddity>	 ty!
[18:36:47] <_joe_>	 so that you know what to do when something else happens
[18:37:58] <wikibugs>	 6operations, 6Discovery, 10MediaWiki-Logging, 7HHVM, and 3 others: MediaWiki monolog doesn't handle Kafka failures gracefully - https://phabricator.wikimedia.org/T125084#1987038 (10Joe) All appservers have been upgraded, I'll perform some tests tomorrow to ensure this is solved.
[18:37:58] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] "This may be useful in multimaster scenarios in the future. For now it should be harmless, let's keep an eye on performance though." [puppet] - 10https://gerrit.wikimedia.org/r/219134 (owner: 10ArielGlenn)
[18:38:06] <wikibugs>	 6operations, 6Discovery, 10MediaWiki-Logging, 7HHVM, and 3 others: MediaWiki monolog doesn't handle Kafka failures gracefully - https://phabricator.wikimedia.org/T125084#1987039 (10Joe) a:3Joe
[18:40:57] <hoo>	 _joe_: here?
[18:41:01] <hoo>	 or bblack 
[18:41:23] <_joe_>	 hoo: here-ish
[18:41:26] <_joe_>	 what's up?
[18:41:33] <grrrit-wm>	 (03PS3) 10BBlack: Do not normalize_path for cxserver|citoid|rest.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/267381 (https://phabricator.wikimedia.org/T125176) (owner: 10GWicke)
[18:41:51] <hoo>	 Well, error pages caught up in varnish because they returned 200 OK and probably inappropriate cache headers
[18:41:58] <hoo>	 thus they are cached even for logged ins
[18:42:03] <hoo>	 not sure how many pages are affected
[18:42:11] <hoo>	 I know about https://www.wikidata.org/wiki/Wikidata:Main_Page?action=purge at least
[18:42:29] <_joe_>	 hoo: read my comment to quiddity for a workaround
[18:42:32] <icinga-wm>	 PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 23.08% of data above the critical threshold [100000000.0]
[18:42:45] <hoo>	 _joe_: I know how to bypass varnish ;)
[18:43:13] <Nemo_bis>	 mutante: yes, AFAIK most cross-wiki antivandalism relies on rcstream by now
[18:43:20] <grrrit-wm>	 (03CR) 10GWicke: [C: 031] "LGTM. Thank you, @bblack!" [puppet] - 10https://gerrit.wikimedia.org/r/267381 (https://phabricator.wikimedia.org/T125176) (owner: 10GWicke)
[18:43:32] <aude>	 _joe_: i did that
[18:43:36] <_joe_>	 hoo: ok, there were 1500 errors, I don't know how to find a list of the affected pages easily and it's too late for me
[18:43:42] <Nemo_bis>	 a handful big wikis may survive  without but several hundreds would probably be devastated :)
[18:44:08] <_joe_>	 Nemo_bis: as long as the clients reconnect upon a connection failure, they'll be ok
[18:44:40] <hoo>	 _joe_: Fair enough
[18:44:51] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] Do not normalize_path for cxserver|citoid|rest.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/267381 (https://phabricator.wikimedia.org/T125176) (owner: 10GWicke)
[18:44:51] <_joe_>	 Nemo_bis: if they don't, they should be fixed
[18:44:53] <hoo>	 would be nice if it were easier to purge individual urls
[18:45:08] <_joe_>	 hoo: well our problem here is to find such urls
[18:45:17] * aude could purge them on beta with -X PURGE
[18:45:19] <_joe_>	 we could ban based on the content-length maybe
[18:45:23] <aude>	 from one of the hosts
[18:45:33] <_joe_>	 on the varnishes, I mean
[18:45:35] <aude>	 no idea all the affected hosts but at least ones i know about
[18:45:54] <aude>	 content length = 13817
[18:45:58] <subbu>	 _joe_, when is the techops meeting? i am trying to figure out the status of my access requests.
[18:46:23] <apergos>	 meeting is over, subbu
[18:46:29] * _joe_ off
[18:46:39] <aude>	 _joe_: thanks for helping
[18:46:41] <apergos>	 bye, joe. happy recharging
[18:46:42] <hoo>	 content lenght + time span would work, I guess
[18:46:53] <bblack>	 can someone tl;dr the above?
[18:46:59] <subbu>	 ah, ok. i'll wait to hear more about the access requests then.
[18:47:02] <hoo>	 bblack: Sure
[18:47:06] <_joe_>	 bblack: one appserver went rogue, cached error pages 
[18:47:15] <hoo>	 Server gone mad, giving cacheable broken error pages
[18:47:18] <bblack>	 ok
[18:47:18] <_joe_>	 bblack: some we manually purged
[18:47:28] <hoo>	 eg. https://www.wikidata.org/wiki/Wikidata:Main_Page?action=purge
[18:47:28] <aude>	 e.g. https://www.wikidata.org/wiki/Wikidata:Main_Page?action=purge (for me at least)
[18:47:35] <bblack>	 ok, so ban content-length of 13817 supposedly catches them all?
[18:47:40] <hoo>	 it should
[18:47:42] <aude>	 bblack: probably
[18:47:44] <_joe_>	 that is actually a double-cache effect
[18:47:47] <_joe_>	 which is weird
[18:47:50] <hoo>	 the apperserver failed at one point in that page
[18:47:57] <bblack>	 double-cache effect?
[18:48:01] <sjoerddebruin>	 Another report: https://en.wikipedia.org/wiki/Syrian_Civil_War
[18:48:13] <_joe_>	 bblack: uhm no scratch that
[18:48:41] <_joe_>	 bblack: so any page that was cached with an error gets re-presented if you ask it with action=purge
[18:48:41] <bblack>	 ok working on the ban for content length, we're probably ok with that alone as conditional if it works
[18:48:44] <sjoerddebruin>	 RecentChanges also fires up a error sometimes.
[18:48:49] <_joe_>	 bblack: +1
[18:49:34] <icinga-wm>	 RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[18:50:23] <icinga-wm>	 PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: puppet fail
[18:53:51] <bblack>	 ban should be complete
[18:54:12] <bblack>	 I'm not sure how we confirm completely other than dropoff of user reports.  different users will hit different frontend caches for the same URL
[18:54:32] <mutante>	 !log LDAP - added elukey to "ops" group
[18:54:37] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:54:54] <bblack>	 !log banned obj.http.Content-Length == 13817 on all cache_text
[18:54:56] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:55:23] <aude>	 bblack: works now
[18:55:38] <aude>	 sjoerddebruin: errors gone now?
[18:55:41] <sjoerddebruin>	 Yep
[18:55:46] <aude>	 great
[18:55:53] <aude>	 thanks bblack 
[18:55:56] <bblack>	 np!
[18:57:51] <grrrit-wm>	 (03CR) 10EBernhardson: Reduce replica count for commonswiki_file in codfw (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266658 (owner: 10EBernhardson)
[18:57:53] <grrrit-wm>	 (03PS2) 10EBernhardson: Reduce replica count for commonswiki_file in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266658 
[19:00:25] <grrrit-wm>	 (03PS4) 10Dereckson: Prepare db-codfw.php for a live deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 (owner: 10Jcrespo)
[19:03:34] <jynus>	 _joe_, I am adding you to a patch we discussed on the meeting, but it is not urgent
[19:04:23] <joal>	 Hey nuria_ hive
[19:04:26] <joal>	 oops :)
[19:04:29] <nuria_>	 yes
[19:04:39] <wikibugs>	 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1987127 (10greg) >>! In T124440#1981816, @greg wrote: > @legoktm: Update, please?  Anyone, update please.
[19:04:46] <joal>	 nothing yet nuria_ , my autocompletion worked too fast
[19:04:51] <joal>	 sorry :)
[19:07:29] <grrrit-wm>	 (03Abandoned) 10MarcoAurelio: Expanding transwiki import sources for be.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267685 (https://phabricator.wikimedia.org/T125390) (owner: 10MarcoAurelio)
[19:08:56] <icinga-wm>	 RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:08:57] <grrrit-wm>	 (03CR) 10Jcrespo: "The commit comment shoud say "partitioning changes that have been rolled-in", not back, will fix on my next patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 (owner: 10Jcrespo)
[19:11:26] <grrrit-wm>	 (03CR) 10MarcoAurelio: Change Nepali Wikibooks sitename and logo (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267170 (https://phabricator.wikimedia.org/T124881) (owner: 10MtDu)
[19:12:18] <wikibugs>	 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1987155 (10ArielGlenn) I see it still going on terbium.
[19:12:21] <wikibugs>	 6operations, 10procurement: dataset host specification refresh - https://phabricator.wikimedia.org/T125421#1987156 (10RobH) 3NEW
[19:12:48] <grrrit-wm>	 (03CR) 10MarcoAurelio: "> I ran optipng on the logo before I pushed the patch. Is that enough or what else do I need to do?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267170 (https://phabricator.wikimedia.org/T124881) (owner: 10MtDu)
[19:12:51] <wikibugs>	 6operations, 10procurement: dataset host specification refresh - https://phabricator.wikimedia.org/T125421#1987168 (10RobH) 5Open>3Invalid a:3RobH didnt put in private space, so rejecting this one as invalid.
[19:13:46] <icinga-wm>	 RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0]
[19:13:46] <wikibugs>	 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#1792783 (10RobH)
[19:21:16] <wikibugs>	 6operations, 10procurement: dataset host specification refresh - https://phabricator.wikimedia.org/T125421#1987223 (10RobH)
[19:21:47] <robh>	 bah, edit the rejected one
[19:25:32] <wikibugs>	 6operations, 10Incident-20160126-WikimediaDomainRedirection, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are using wikimedia.org VHost, including /wiki/ -> wikimediafoundation.org redirect - https://phabricator.wikimedia.org/T124804#1987246 (1...
[19:26:10] <grrrit-wm>	 (03PS1) 10Ori.livneh: Speed trials: add preconnect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267719 (https://phabricator.wikimedia.org/T123582) 
[19:26:24] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] Speed trials: add preconnect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267719 (https://phabricator.wikimedia.org/T123582) (owner: 10Ori.livneh)
[19:27:04] <wikibugs>	 7Blocked-on-Operations, 6operations: Re-pool restbase1007 - https://phabricator.wikimedia.org/T124565#1987253 (10GWicke) It is still listed as disabled [1], and is not getting much traffic.  [1]: http://config-master.wikimedia.org/conftool/eqiad/restbase
[19:27:11] <wikibugs>	 6operations, 10Incident-20160126-WikimediaDomainRedirection, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are using wikimedia.org VHost, including /wiki/ -> wikimediafoundation.org redirect - https://phabricator.wikimedia.org/T124804#1987254 (1...
[19:27:26] <wikibugs>	 7Blocked-on-Operations, 6operations: Re-pool restbase1007 - https://phabricator.wikimedia.org/T124565#1987255 (10GWicke) 5Resolved>3Open
[19:27:43] <wikibugs>	 6operations, 10Incident-20160126-WikimediaDomainRedirection, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are using wikimedia.org VHost, including /wiki/ -> wikimediafoundation.org redirect - https://phabricator.wikimedia.org/T124804#1987257 (1...
[19:28:23] <logmsgbot>	 !log ori@mira Synchronized docroot/wikipedia.org/speed-tests: I5b48a491390: Speed trials: add preconnect (duration: 01m 27s)
[19:28:28] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:31:19] <grrrit-wm>	 (03PS4) 10MtDu: Change Nepali Wikibooks sitename and logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267170 (https://phabricator.wikimedia.org/T124881) 
[19:32:06] <wikibugs>	 6operations, 10Incident-20160126-WikimediaDomainRedirection, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are using wikimedia.org VHost, including /wiki/ -> wikimediafoundation.org redirect - https://phabricator.wikimedia.org/T124804#1987284 (1...
[19:37:03] <jynus>	 hoo, did something related to wikidata config got deployed at 18:41? https://logstash.wikimedia.org/#dashboard/temp/AVKeVRrBptxhN1XatUHF
[19:37:35] <jynus>	 or labs?
[19:38:51] <hoo>	 not that I know of
[19:42:33] <wikibugs>	 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#1987317 (10jcrespo) Adding @Greg,  as this (setting s2 shard as read-only) needs Release Engineering coordination, and sadly we are in a timer here.  * Decide a date for the...
[19:48:26] <wikibugs>	 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#1987343 (10greg) How bad is the time crunch/when is the latest you would *comfortably* want to do this?  For the warning users part, we'll need some help from @Johan (ping, j...
[19:59:54] <wikibugs>	 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#1987359 (10jcrespo) The plan is: doing a try this week/early next week. Best case scenario, 1-10 seconds in read-only mode, and almost no user notice.  If that doesn't work,...
[20:04:06] <subbu>	 mutante, robh any update about my access requests?
[20:05:08] <robh>	 both ended up as inconclusive during the meeting since there is some confusion on them
[20:05:42] <robh>	 i think now its up to current clinic person to sort out who needs to followup?
[20:05:51] <robh>	 elukey: ^ heh
[20:06:03] <robh>	 you may recall we had a very confusing discussion about this during the ops meeting?
[20:07:27] <subbu>	 robh, anything i can do to help with clarifying the confuing pieces? I am not sure if my requests were confusing or something else was the issue.
[20:07:42] <subbu>	 or elukey.
[20:13:19] <robh>	 I didn't think it was confusing, but somehow we werent able to discuss the parsoid-rt-admin group
[20:13:26] <robh>	 without discussing the adding you to parsoid-roots
[20:13:29] <robh>	 which to me were unrelated.
[20:14:24] <ori>	 so you're going to block subbu because you can't follow through with the process you devised?
[20:14:44] <ori>	 without even proactively giving an explanation for the delay?
[20:14:46] <ori>	 good times
[20:14:49] <robh>	 ori: srsly?
[20:14:55] <robh>	 that is unduly harsh to me when im trying to help out now
[20:15:10] <ori>	 i'm not picking on you specifically, but something is not right here
[20:15:17] <robh>	 we are following up now
[20:15:30] <robh>	 but you want to argue, im dropping this.
[20:15:31] <ori>	 great, thanks
[20:15:46] * robh is tired of arguing when he is simply relaying info.
[20:15:49] <subbu>	 ori, robh chill .. i can wait for things to be resolved. thanks for trying to help.
[20:16:07] <subbu>	 i am blocked, but as long as ops don't mind me poking you for geting stuff done on ruthenium, i can deal.
[20:17:51] <robh>	 Again, I am not sure what wasn't clear during the ops meeting.  I supported the rights being added in https://phabricator.wikimedia.org/T124701
[20:18:19] <robh>	 then folks started saying that access request and the parsoid-roots had to be related together
[20:18:21] <robh>	 and they dont imo
[20:18:37] <robh>	 mutante i think was one of those who said they were related?
[20:18:44] <robh>	 (again, not sure)
[20:18:52] <mutante>	 no, i pointed out there were 2 separate tickets, that's all
[20:18:59] <subbu>	 those 2 requests are different.
[20:19:05] <mutante>	 i support the one for root on ruthenium and did so in meeting
[20:19:07] <subbu>	 (a) one is about getting me full access
[20:19:10] <mutante>	 i dont know about the other one
[20:19:20] <subbu>	 (b) another is about getting others in my team getting partial access sudo for specific operations
[20:19:34] <subbu>	 i will be on vacation, not around
[20:19:52] <subbu>	 and i want others in my team to be able to do stuff .. most of the itme it will be simply to deploy fresh code, restart services, etc.
[20:20:04] <subbu>	 and if they need anything that their access doesn't allow, they can ping someone here.
[20:20:17] <robh>	 so https://phabricator.wikimedia.org/T124701 is b) team
[20:20:19] <subbu>	 and if that becomes a problem, we can revisit then.
[20:20:22] <robh>	 im not sure why eveyrone didnt approve
[20:20:32] <robh>	 but folks kept talking over me
[20:20:35] <robh>	 and then it got shut down.
[20:20:41] <subbu>	 yes, T124701 is team
[20:20:59] <grrrit-wm>	 (03CR) 10MarcoAurelio: [C: 031] "Thank you." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267170 (https://phabricator.wikimedia.org/T124881) (owner: 10MtDu)
[20:21:45] <wikibugs>	 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#1987425 (10jcrespo) List of wikis affected:  ``` mysql -e "SHOW DATABASES like '%wik%'" +------------------+ | Database (%wik%) | +------------------+ | bgwiki           | |...
[20:21:50] <grrrit-wm>	 (03PS1) 10Yuvipanda: tools: Setup appropriate ferm rules for etcd clients / peers [puppet] - 10https://gerrit.wikimedia.org/r/267728 
[20:21:59] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1987427 (10RobH) DB rights should likely be a different ticket, since they require @jcrespo to re...
[20:22:17] <subbu>	 robh, ori and hey hope i didn't come across too rude by asking you two to chill ... 
[20:22:19] <grrrit-wm>	 (03PS2) 10Yuvipanda: tools: Setup appropriate ferm rules for etcd clients / peers [puppet] - 10https://gerrit.wikimedia.org/r/267728 
[20:22:27] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Setup appropriate ferm rules for etcd clients / peers [puppet] - 10https://gerrit.wikimedia.org/r/267728 (owner: 10Yuvipanda)
[20:22:34] <robh>	 considering ori got me pretty angry, nope.
[20:22:41] <subbu>	 that wasn't my intention at least.
[20:23:09] <robh>	 I dislike it when people jump in claiming I'm not working on something when I
[20:23:14] <robh>	 m clearing discussing it at that time ;]
[20:23:24] <robh>	 though i realize now ori likely didnt mean it that way =]
[20:23:35] <robh>	 ori: we good right?
[20:24:15] <robh>	 (i hope he went afk and isnt still mad)
[20:24:29] <mutante>	 win 22
[20:26:28] <wikibugs>	 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#1987436 (10Krenair) Interesting. I have never noticed that `l10nwiki` database before. All it contains is two empty tables (localisation and localisation_file_hash).
[20:26:28] <subbu>	 robh, i can understand why ori was frustrated on my behalf .. he has spent a good amount of time trying to get my puppet patches reviewed and merged on ruthenium.
[20:26:28] <subbu>	 but, i can be patient.
[20:26:43] <robh>	 the meeting block wasnt due to lack of trust but due to lack of clarity in the overall request during the meeting.  I think my update clarifies that, but I'm not sure how to proceed
[20:27:01] <robh>	 I don't really want to wait for another week, as a few of us in ops are having to do things for you which is silly
[20:27:07] <subbu>	 got it.
[20:28:47] <subbu>	 ok, will respond on the ticket.
[20:29:25] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1987460 (10ssastry) Got it. I'll split the db access into a separate ticket.  And, you are right...
[20:29:50] <robh>	 subbu: on https://phabricator.wikimedia.org/T124701
[20:29:58] <robh>	 you suggest reusing parsoid admins?
[20:30:07] <robh>	 but that group doesnt include all the rt stuff, and is clusterwide
[20:30:22] <robh>	 that would mean your team would have to be approved for admin on produciton level parsoid stuff
[20:30:26] <wikibugs>	 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#1987464 (10greg) Looks like the communities of the effected wikis aren't all in one part of the globe (based on my hugely hand-wavy assessment), so whatever time of day that...
[20:30:28] <robh>	 (seems harder to approve is all)
[20:30:43] <robh>	 unless they need to do all that stuff on the entire cluster or machines?
[20:31:02] <subbu>	 robh ah, ok. never mind. i didn't understand those nuances.
[20:31:05] <robh>	 (or am i misunderstanding your reply?)
[20:31:20] <robh>	 cuz indeed, it would be esaier to slap three sudo rights onto the existing group
[20:31:23] <wikibugs>	 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1987465 (10Legoktm) Sorry missed the ping. Yes, it's still going :(
[20:31:29] <robh>	 but then it excalates all those others to full admin on multiple machines
[20:31:32] <robh>	 cool
[20:31:36] <subbu>	 maybe call it parsoid-testing-admins instead?
[20:32:48] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1987469 (10RobH) IRC Discussion Update: I pointed out how appending those rights to the existing...
[20:32:57] <robh>	 oh, true
[20:33:07] <wikibugs>	 6operations, 10Security-Reviews, 7Surveys: Re-evaluate Limesurvey - https://phabricator.wikimedia.org/T109606#1987474 (10Dzahn)
[20:33:22] <robh>	 i suppose its not clear what the rt part of parsoid-rt is
[20:33:28] <subbu>	 roundtrip :)
[20:33:52] <subbu>	 but, we ar running more than rt-tests on ruthenium .. so, -testing- is better.
[20:34:13] <grrrit-wm>	 (03PS2) 10Dzahn: releases: Fix capitalization of MediaWiki [puppet] - 10https://gerrit.wikimedia.org/r/267428 (owner: 10Legoktm)
[20:34:28] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "good point! thanks" [puppet] - 10https://gerrit.wikimedia.org/r/267428 (owner: 10Legoktm)
[20:35:22] <grrrit-wm>	 (03PS1) 10Yuvipanda: tools: Include base::firewall on the flannel etcd hosts [puppet] - 10https://gerrit.wikimedia.org/r/267732 
[20:37:21] <grrrit-wm>	 (03PS2) 10Yuvipanda: tools: Include base::firewall on the flannel etcd hosts [puppet] - 10https://gerrit.wikimedia.org/r/267732 
[20:37:24] <grrrit-wm>	 (03PS3) 10Dzahn: Tools: Fix double file resource for jlocal [puppet] - 10https://gerrit.wikimedia.org/r/266934 (owner: 10Tim Landscheidt)
[20:37:29] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Include base::firewall on the flannel etcd hosts [puppet] - 10https://gerrit.wikimedia.org/r/267732 (owner: 10Yuvipanda)
[20:37:39] <grrrit-wm>	 (03PS4) 10Dzahn: Tools: Fix double file resource for jlocal [puppet] - 10https://gerrit.wikimedia.org/r/266934 (owner: 10Tim Landscheidt)
[20:39:01] <grrrit-wm>	 (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/266934 (owner: 10Tim Landscheidt)
[20:40:01] <wikibugs>	 10Ops-Access-Requests, 6operations: Grant mysql client access to testreduce_vd and testreduce_0715 databases - https://phabricator.wikimedia.org/T125435#1987492 (10ssastry) 3NEW
[20:43:00] <wikibugs>	 6operations, 7Diamond, 7Upstream: Diamond load averages do not contain scaled versions - https://phabricator.wikimedia.org/T125411#1987514 (10yuvipanda)
[20:43:22] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] Tools: Fix double file resource for jlocal [puppet] - 10https://gerrit.wikimedia.org/r/266934 (owner: 10Tim Landscheidt)
[20:48:17] <grrrit-wm>	 (03PS1) 10Yuvipanda: tools: Allow peer nodes to acces etcd port too [puppet] - 10https://gerrit.wikimedia.org/r/267736 
[20:48:40] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1987540 (10RobH) a:5RobH>3None My understanding (though I could be mistaken) is with the clar...
[20:49:14] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] tools: Allow peer nodes to acces etcd port too [puppet] - 10https://gerrit.wikimedia.org/r/267736 (owner: 10Yuvipanda)
[20:49:14] <wikibugs>	 10Ops-Access-Requests, 6operations: add subbu to parsoid-roots - https://phabricator.wikimedia.org/T125166#1987545 (10RobH) My understanding (though I could be mistaken) is with the clarification above, this now only needs one of the following:  A) Ops team meeting review (this was attempted today but there wa...
[20:49:16] <wikibugs>	 6operations, 7Diamond, 7Upstream: Diamond load averages do not contain scaled versions - https://phabricator.wikimedia.org/T125411#1987543 (10yuvipanda) Hmm, this will add an extra metric for all hosts, prod and labs. @Fgiunchedi is that ok?
[20:49:25] <grrrit-wm>	 (03PS2) 10Yuvipanda: tools: Allow peer nodes to acces etcd port too [puppet] - 10https://gerrit.wikimedia.org/r/267736 
[20:49:41] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Allow peer nodes to acces etcd port too [puppet] - 10https://gerrit.wikimedia.org/r/267736 (owner: 10Yuvipanda)
[20:50:01] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure, 10Tool-Labs: tools-exec: automatic php upgrade  - puppet fail - https://phabricator.wikimedia.org/T125438#1987548 (10Dzahn) 3NEW
[20:50:12] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1987556 (10ssastry) p:5Triage>3High
[20:50:27] <wikibugs>	 10Ops-Access-Requests, 6operations: add subbu to parsoid-roots - https://phabricator.wikimedia.org/T125166#1987557 (10ssastry) p:5Triage>3High
[20:53:47] <grrrit-wm>	 (03CR) 10Thcipriani: "Inline comments." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/266773 (owner: 10Chad)
[20:55:00] <wikibugs>	 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#1987567 (10jcrespo) Hardware maintenance (there is an LVM/fs problem that avoids the partition to grow), OS upgrade and mariadb upgrade (5.5 -> 10). I have the long version i...
[20:55:49] <wikibugs>	 10Ops-Access-Requests, 6operations: add subbu to parsoid-roots - https://phabricator.wikimedia.org/T125166#1987570 (10RobH) My last comment (and token) was meant for an unrelated task i had open in another tab. I removed the comment but tokens seem to stick (sorry about that!)
[20:56:52] <subbu>	 no parsoid deploy today. we have some regressions that we need to deal with first.
[20:59:31] <grrrit-wm>	 (03PS1) 10Bmansurov: Remove section collapsing config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267776 (https://phabricator.wikimedia.org/T124220) 
[21:00:04] <jouncebot>	 gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160201T2100).
[21:00:18] <mdholloway>	 no mobileapps deployment today.
[21:01:20] <wikibugs>	 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1987589 (10Anomie) See also https://gerrit.wikimedia.org/r/#/c/267734/ and https://gerrit.wikimedia.org/r/#/c/267735/.
[21:02:22] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure, 10Tool-Labs: tools-exec: automatic php upgrade  - puppet fail - https://phabricator.wikimedia.org/T125438#1987599 (10scfc) 5Open>3Resolved a:3scfc Fixed by `apt-get install php5-gd` and downgrading.
[21:02:38] <grrrit-wm>	 (03PS1) 10Dzahn: toollabs: don't use ensure => latest for everything [puppet] - 10https://gerrit.wikimedia.org/r/267778 
[21:03:18] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure, 10Tool-Labs: tools-exec: automatic php upgrade  - puppet fail - https://phabricator.wikimedia.org/T125438#1987605 (10Dzahn) thanks @scfc :)  i uploaded https://gerrit.wikimedia.org/r/#/c/267778/
[21:03:44] <grrrit-wm>	 (03PS2) 10Dzahn: toollabs: don't use ensure => latest for everything [puppet] - 10https://gerrit.wikimedia.org/r/267778 
[21:07:10] <wikibugs>	 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#1987621 (10greg) That's good enough for me :)
[21:08:07] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet).
[21:09:07] <icinga-wm>	 PROBLEM - RAID on es2009 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded)
[21:12:18] <mutante>	 where can i see the data collected by " diamond::collector { 'Httpd':"?
[21:12:18] <mutante>	 oh, ensure   => absent,  duh
[21:12:18] <grrrit-wm>	 (03PS2) 10Dzahn: apache: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266970 
[21:12:21] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] apache: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266970 (owner: 10Dzahn)
[21:13:17] <wikibugs>	 7Blocked-on-Operations, 6operations, 10RESTBase, 10hardware-requests: Expand SSD space in Cassandra cluster - https://phabricator.wikimedia.org/T121575#1987631 (10GWicke) @fgiunchedi, could you tackle the upgrades (disk and CPU/RAM) for 1007-9 soon?
[21:14:20] <grrrit-wm>	 (03PS1) 10MarcoAurelio: Enabling Extension:ShortUrl on or.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267780 (https://phabricator.wikimedia.org/T124429) 
[21:14:51] <grrrit-wm>	 (03PS2) 10MarcoAurelio: Enabling Extension:ShortUrl on od.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267780 (https://phabricator.wikimedia.org/T124429) 
[21:14:56] <mutante>	 YuviPanda: is "ircyall" used anywhere? i can't seem to find it with "watroles"
[21:15:27] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[21:15:27] <YuviPanda>	 mutante: yes it's used in the ircnotifier project
[21:15:34] <YuviPanda>	 oh that's interesting. not sure why that's the case
[21:16:15] <YuviPanda>	 I'm in the middle of an etcd move, so can you file a bug so I can take a look, mutante?
[21:16:18] <grrrit-wm>	 (03CR) 10JanZerebecki: "The dependency is probably deployed this train on Wed, so it can be added to SWAT afterwards." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254645 (https://phabricator.wikimedia.org/T85368) (owner: 10Bene)
[21:16:28] <YuviPanda>	 valhallasw`cloud: I'm considering letting ircnotifier die
[21:16:33] <YuviPanda>	 valhallasw`cloud: wikibugs is the only user now :D
[21:16:38] <YuviPanda>	 well
[21:16:40] <YuviPanda>	 :(
[21:16:50] <mutante>	 YuviPanda: sure, ok
[21:16:56] <wikibugs>	 6operations, 10ops-codfw: es2009 degraded RAID - https://phabricator.wikimedia.org/T125442#1987649 (10jcrespo) 3NEW
[21:18:16] <valhallasw`cloud>	 YuviPanda: :(
[21:18:32] <valhallasw`cloud>	 YuviPanda: well, I guess I'll have to move to wmbot then :P
[21:18:41] <grrrit-wm>	 (03PS1) 10Yuvipanda: tools: Point flannel to use new flannel specific etcd [puppet] - 10https://gerrit.wikimedia.org/r/267781 (https://phabricator.wikimedia.org/T125371) 
[21:18:56] <YuviPanda>	 valhallasw`cloud: yeah. I don't think I've the bandwidth to push it through to its completion
[21:19:23] <YuviPanda>	 valhallasw`cloud: hmm, since we only want it for sal, maybe once bd808 is done with all the auth manager stuff we can make a HTTP endpoint for SAL :)
[21:19:37] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Fix instance storage volume for labtestvirt2001 [puppet] - 10https://gerrit.wikimedia.org/r/267782 
[21:19:42] <valhallasw`cloud>	 YuviPanda: uh, I guess, although I also find it practical to see it on irc
[21:19:53] <YuviPanda>	 fair enough
[21:20:07] <YuviPanda>	 valhallasw`cloud: I can probably just move it to a tools project, I guess
[21:20:11] <YuviPanda>	 and kill the puppet role
[21:20:15] <YuviPanda>	 but maybe not worth it
[21:20:55] <wikibugs>	 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#1987669 (10jcrespo) Tuesday 9 Feb, 23:00 UTC, does that work for anyone on your team?
[21:20:55] <mutante>	 YuviPanda: i used the wrong "syntax" for the tool, nevermind
[21:21:15] <mutante>	  /role/role::foo   vs  role/foo   role:foo  etc
[21:21:18] <YuviPanda>	 aaah
[21:21:21] <YuviPanda>	 fun
[21:22:21] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Fix instance storage volume for labtestvirt2001 [puppet] - 10https://gerrit.wikimedia.org/r/267782 
[21:23:46] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Fix instance storage volume for labtestvirt2001 [puppet] - 10https://gerrit.wikimedia.org/r/267782 (owner: 10Andrew Bogott)
[21:24:39] <grrrit-wm>	 (03PS2) 10Dzahn: ircyall: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266964 
[21:24:49] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] ircyall: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266964 (owner: 10Dzahn)
[21:26:19] <YuviPanda>	 thanks mutante
[21:26:28] <grrrit-wm>	 (03PS1) 10MarcoAurelio: Enabling Extension:ShortUrl for bhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267783 (https://phabricator.wikimedia.org/T113348) 
[21:27:05] <mutante>	 YuviPanda: welcome, it's trying to eliminate that issue all across the repo, then remove the exception for it from global .puppet-lint.rc
[21:27:11] * YuviPanda nods
[21:30:18] <wikibugs>	 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#1987692 (10greg) The hour before afternoon SWAT should be OK, yeah. Anything specific you need from us? We'll all be around during the time, but if you need someone's undivid...
[21:30:25] <grrrit-wm>	 (03PS1) 10MarcoAurelio: Removing testwiki from wmgUseShortUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267784 
[21:30:52] <grrrit-wm>	 (03PS1) 10Yuvipanda: toollabs: Add proxy nodes to flanel etcd access list [puppet] - 10https://gerrit.wikimedia.org/r/267785 
[21:32:00] <grrrit-wm>	 (03PS2) 10Yuvipanda: toollabs: Add proxy nodes to flanel etcd access list [puppet] - 10https://gerrit.wikimedia.org/r/267785 
[21:32:02] <grrrit-wm>	 (03PS2) 10Yuvipanda: tools: Point flannel to use new flannel specific etcd [puppet] - 10https://gerrit.wikimedia.org/r/267781 (https://phabricator.wikimedia.org/T125371) 
[21:34:00] <grrrit-wm>	 (03CR) 10Dzahn: "${labsproject}.${hostname}.reqstats - can't find it in graphite.wmflabs" [puppet] - 10https://gerrit.wikimedia.org/r/266973 (owner: 10Dzahn)
[21:34:22] <grrrit-wm>	 (03PS2) 10Dzahn: dynamicproxy: fix top-scope var without namespace, lint [puppet] - 10https://gerrit.wikimedia.org/r/266973 
[21:34:40] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] dynamicproxy: fix top-scope var without namespace, lint [puppet] - 10https://gerrit.wikimedia.org/r/266973 (owner: 10Dzahn)
[21:37:02] <wikibugs>	 6operations, 3Mobile-Content-Service: Improve operational documentation for the mobileapps service - https://phabricator.wikimedia.org/T123852#1987743 (10Mholloway) I don't have permission to create a page on Wikitech, but I created a draft here:   https://www.mediawiki.org/wiki/User:MHolloway_(WMF)/Draft:Mobi...
[21:37:03] <grrrit-wm>	 (03CR) 10Dzahn: "the change only touched the graphite part and i could not see those reqstats in actual graphite" [puppet] - 10https://gerrit.wikimedia.org/r/266973 (owner: 10Dzahn)
[21:41:01] <wikibugs>	 6operations, 3Mobile-Content-Service: Improve operational documentation for the mobileapps service - https://phabricator.wikimedia.org/T123852#1987760 (10Dzahn) >>! In T123852#1987743, @Mholloway wrote: > I don't have permission to create a page on Wikitech, but I created a draft here:   @Mholloway your user a...
[21:44:18] <grrrit-wm>	 (03CR) 10Tim Landscheidt: [C: 04-1] "The issue as shown by the tasks is not "ensure => latest", that just highlights it. On the initial Puppet run, PHP packages & Co. are ins" [puppet] - 10https://gerrit.wikimedia.org/r/267778 (owner: 10Dzahn)
[21:44:18] <grrrit-wm>	 (03CR) 10Dzahn: "yea, i know it touches the certificates.. but $site -> $::site is fairly common" [puppet] - 10https://gerrit.wikimedia.org/r/266980 (owner: 10Dzahn)
[21:46:11] <wikibugs>	 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#1987768 (10jcrespo) No, the only complex thing is DBA-specific. I will need just the usual attention as if it was a deployment (logs monitoring for higher rates of errors, et...
[21:46:46] <icinga-wm>	 PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 11.76% of data above the critical threshold [100000000.0]
[21:46:47] <grrrit-wm>	 (03Abandoned) 10Dzahn: toollabs: don't use ensure => latest for everything [puppet] - 10https://gerrit.wikimedia.org/r/267778 (owner: 10Dzahn)
[21:47:57] <icinga-wm>	 PROBLEM - puppet last run on wtp2001 is CRITICAL: CRITICAL: puppet fail
[21:51:11] <wikibugs>	 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#1987795 (10greg) Gotcha, thanks @jcrespo.  Added to the calendar: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=281828&oldid=281348
[21:53:44] <grrrit-wm>	 (03PS2) 10Dzahn: aptly: fix top-scope vars without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266984 
[21:55:01] <wikibugs>	 6operations, 3Mobile-Content-Service: Improve operational documentation for the mobileapps service - https://phabricator.wikimedia.org/T123852#1987816 (10Mholloway) Guess it was indeed a case of user error.  After the reset I can now create pages on Wikitech.  Thanks @Dzahn!
[21:55:16] <icinga-wm>	 PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 896
[21:55:34] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] aptly: fix top-scope vars without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266984 (owner: 10Dzahn)
[21:58:34] <grrrit-wm>	 (03CR) 10Dzahn: "checked on: tools-services-01, toolsbeta-aptly-server-01, ores-misc-01, (they use this per "watroles" tool) - noop" [puppet] - 10https://gerrit.wikimedia.org/r/266984 (owner: 10Dzahn)
[21:58:36] <grrrit-wm>	 (03CR) 10Luke081515: [C: 04-1] "Why odwikisource? There are not wikis woth code od, only with or, like requested in T124429" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267780 (https://phabricator.wikimedia.org/T124429) (owner: 10MarcoAurelio)
[21:59:25] <wikibugs>	 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1987820 (10Redrose64) It may have double-run - I have been force-logged out twice since this task was raised, the second one was today
[22:00:16] <icinga-wm>	 RECOVERY - check_mysql on lutetium is OK: Uptime: 1833206 Threads: 2 Questions: 13676857 Slow queries: 15785 Opens: 79943 Flush tables: 3 Open tables: 64 Queries per second avg: 7.460 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0
[22:04:18] <grrrit-wm>	 (03PS2) 10Dzahn: ganglia: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266965 
[22:04:27] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] ganglia: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266965 (owner: 10Dzahn)
[22:04:33] <wikibugs>	 6operations, 10ops-codfw: Codfw-mw* IDRAC firmware upgrade - https://phabricator.wikimedia.org/T125088#1987838 (10Papaul) Please see link below for the documentation on how to upgrade the IDRAC firmware for the PowerEdge R410. https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Dell_PowerEdge_R...
[22:11:38] <icinga-wm>	 RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0]
[22:16:27] <icinga-wm>	 RECOVERY - puppet last run on wtp2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[22:17:33] <wikibugs>	 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1987865 (10ori) >>! In T124418#1986594, @BBlack wrote: > Regardless, the average rate of HTCP these days is...
[22:19:26] <grrrit-wm>	 (03PS3) 10Yuvipanda: toollabs: Add proxy nodes to flanel etcd access list [puppet] - 10https://gerrit.wikimedia.org/r/267785 
[22:19:33] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: Add proxy nodes to flanel etcd access list [puppet] - 10https://gerrit.wikimedia.org/r/267785 (owner: 10Yuvipanda)
[22:21:30] <grrrit-wm>	 (03PS3) 10Yuvipanda: tools: Point flannel to use new flannel specific etcd [puppet] - 10https://gerrit.wikimedia.org/r/267781 (https://phabricator.wikimedia.org/T125371) 
[22:21:32] <wikibugs>	 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1987876 (10Redrose64) It may have double-run - I have been force-logged out twice since this task was raised, the second one was today
[22:21:36] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Point flannel to use new flannel specific etcd [puppet] - 10https://gerrit.wikimedia.org/r/267781 (https://phabricator.wikimedia.org/T125371) (owner: 10Yuvipanda)
[22:26:07] <grrrit-wm>	 (03PS3) 10Dzahn: ganglia: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266965 
[22:27:48] <wikibugs>	 6operations: decom magnesieum (was: Reinstall magnesium with jessie) - https://phabricator.wikimedia.org/T123713#1987896 (10Dzahn)
[22:29:40] <wikibugs>	 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1987905 (10BBlack) @ori - yeah that makes sense for the initial bump, and I think there may have even been a...
[22:30:12] <wikibugs>	 10Ops-Access-Requests, 6operations, 10DBA: Grant mysql client access to testreduce_vd and testreduce_0715 databases - https://phabricator.wikimedia.org/T125435#1987910 (10Dzahn)
[22:31:36] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [5000000.0]
[22:33:40] <wikibugs>	 6operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1987938 (10JanZerebecki) Good find. That commit was first deployed in wmf8 which was branched on Dec 8 (rMWb...
[22:35:05] <grrrit-wm>	 (03PS1) 10Yuvipanda: tools: Disallow k8s master etcd from being accessible to workers [puppet] - 10https://gerrit.wikimedia.org/r/267793 (https://phabricator.wikimedia.org/T125371) 
[22:40:31] <grrrit-wm>	 (03PS2) 10Yuvipanda: tools: Disallow k8s master etcd from being accessible to workers [puppet] - 10https://gerrit.wikimedia.org/r/267793 (https://phabricator.wikimedia.org/T125371) 
[22:42:07] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[22:50:38] <grrrit-wm>	 (03CR) 10Jdlrobson: [C: 031] "who can swat this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267776 (https://phabricator.wikimedia.org/T124220) (owner: 10Bmansurov)
[22:59:31] <wikibugs>	 6operations, 10Wikimedia-DNS, 5Patch-For-Review: Internal DNS resolver responds with NXDOMAIN for localhost AAAA - https://phabricator.wikimedia.org/T125170#1988007 (10yuvipanda) I think the patch needs to get merged and babysat :)
[23:22:38] <grrrit-wm>	 (03PS1) 10Yuvipanda: tools: Add separate role for k8s etcd [puppet] - 10https://gerrit.wikimedia.org/r/267796 
[23:22:45] <grrrit-wm>	 (03PS1) 10Ottomata: [WIP] Refactor manifests/role/analytics/* into modules/role, use hiera to configure [puppet] - 10https://gerrit.wikimedia.org/r/267797 (https://phabricator.wikimedia.org/T109859) 
[23:23:36] <grrrit-wm>	 (03PS2) 10Yuvipanda: tools: Add separate role for k8s etcd [puppet] - 10https://gerrit.wikimedia.org/r/267796 
[23:23:47] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] tools: Disallow k8s master etcd from being accessible to workers [puppet] - 10https://gerrit.wikimedia.org/r/267793 (https://phabricator.wikimedia.org/T125371) (owner: 10Yuvipanda)
[23:24:09] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Add separate role for k8s etcd [puppet] - 10https://gerrit.wikimedia.org/r/267796 (owner: 10Yuvipanda)
[23:24:27] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] [WIP] Refactor manifests/role/analytics/* into modules/role, use hiera to configure [puppet] - 10https://gerrit.wikimedia.org/r/267797 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata)
[23:24:44] <grrrit-wm>	 (03CR) 10Paladox: "recheck" [software] - 10https://gerrit.wikimedia.org/r/169253 (owner: 10Tim Landscheidt)
[23:24:53] <grrrit-wm>	 (03PS2) 10Ottomata: [WIP] Refactor manifests/role/analytics/* into modules/role, use hiera to configure [puppet] - 10https://gerrit.wikimedia.org/r/267797 (https://phabricator.wikimedia.org/T109859) 
[23:26:02] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] [WIP] Refactor manifests/role/analytics/* into modules/role, use hiera to configure [puppet] - 10https://gerrit.wikimedia.org/r/267797 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata)
[23:26:46] <MarkTraceur>	 So, sanity check, is just, everything broken
[23:28:38] <wikibugs>	 6operations, 10netops: turn-up/implement zayo wave (579171) for ulsfo-codfw - https://phabricator.wikimedia.org/T122885#1916306 (10RobH) All procurement and onsite patch tasks have been completed.  (They don't show resolved since they sit in the pending invoice column for a month pending invoice.)  @Faidon sho...
[23:29:51] <MarkTraceur>	 Never mind, looks like it was temporary
[23:30:49] <wikibugs>	 6operations, 10Wikimedia-Video, 5Patch-For-Review: 1gb file upload limit is too restrictive for conference presentation videos - https://phabricator.wikimedia.org/T116514#1988121 (10BBlack) If we have a strong reason to stick with 2.5GB, we should really test this through the whole stack somehow (I guess in...
[23:37:10] <wikibugs>	 6operations, 6Performance-Team, 7Graphite, 7Monitoring: Add monitoring for analytics-statsv service - https://phabricator.wikimedia.org/T117994#1988128 (10Krinkle) a:3ori
[23:38:23] <andrewbogott>	 Reedy: have a moment to help me with an https-everwhere rule?  I hear tell that you’re an expert.
[23:48:45] <grrrit-wm>	 (03PS1) 10Yuvipanda: tools: Allow k8s etcd peers to access client port too [puppet] - 10https://gerrit.wikimedia.org/r/267801 
[23:49:16] <grrrit-wm>	 (03PS2) 10Yuvipanda: tools: Allow k8s etcd peers to access client port too [puppet] - 10https://gerrit.wikimedia.org/r/267801 
[23:49:23] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Allow k8s etcd peers to access client port too [puppet] - 10https://gerrit.wikimedia.org/r/267801 (owner: 10Yuvipanda)
[23:51:34] <mobrovac>	 !log restbase deploy start of c3bd864 on canary rb1001
[23:51:37] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:57:32] <grrrit-wm>	 (03PS1) 10Ori.livneh: Add monitoring for statsv process [puppet] - 10https://gerrit.wikimedia.org/r/267802 (https://phabricator.wikimedia.org/T117994) 
[23:57:38] <grrrit-wm>	 (03PS1) 10Luke081515: Enable confirmed group at nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267804 (https://phabricator.wikimedia.org/T125448) 
[23:57:44] <grrrit-wm>	 (03PS2) 10Ori.livneh: Add monitoring for statsv process [puppet] - 10https://gerrit.wikimedia.org/r/267802 (https://phabricator.wikimedia.org/T117994) 
[23:57:50] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] Add monitoring for statsv process [puppet] - 10https://gerrit.wikimedia.org/r/267802 (https://phabricator.wikimedia.org/T117994) (owner: 10Ori.livneh)
[23:59:11] <grrrit-wm>	 (03CR) 10Luke081515: [C: 04-1] "Do not merge:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267804 (https://phabricator.wikimedia.org/T125448) (owner: 10Luke081515)