[00:10:59] <wikibugs>	 6operations, 7HHVM: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#1940286 (10matmarex) After every ICU upgrade, we need to run a long-running (takes a few days on largest wikis, IIRC) maintenance script on a couple dozen wikis (see task description), which is why w...
[00:18:17] <icinga-wm>	 RECOVERY - puppet last run on mw2039 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[00:20:37] <wikibugs>	 6operations: Adding/Removing users from enWP Arbcom Mailinglist archives - https://phabricator.wikimedia.org/T123787#1940296 (10Jalexander) Also adding: Callanecc Key: https://en.wikipedia.org/wiki/User:Callanecc/Key
[01:30:32] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Add aliases for labtest websites: [dns] - 10https://gerrit.wikimedia.org/r/264703 
[01:34:33] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Add aliases for labtest websites: [dns] - 10https://gerrit.wikimedia.org/r/264703 (owner: 10Andrew Bogott)
[01:42:49] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Specify webserver_hostname for the openstack_manager class. [puppet] - 10https://gerrit.wikimedia.org/r/264704 
[01:43:47] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Specify webserver_hostname for the openstack_manager class. [puppet] - 10https://gerrit.wikimedia.org/r/264704 (owner: 10Andrew Bogott)
[01:45:22] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Specify webserver_hostname for the openstack_manager class. [puppet] - 10https://gerrit.wikimedia.org/r/264704 
[01:47:48] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Specify webserver_hostname for the openstack_manager class. [puppet] - 10https://gerrit.wikimedia.org/r/264704 (owner: 10Andrew Bogott)
[01:50:00] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Typo fix [puppet] - 10https://gerrit.wikimedia.org/r/264705 
[01:51:10] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Typo fix [puppet] - 10https://gerrit.wikimedia.org/r/264705 (owner: 10Andrew Bogott)
[01:54:38] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Add labtestwikitech apache config [puppet] - 10https://gerrit.wikimedia.org/r/264706 
[01:55:52] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Add labtestwikitech apache config [puppet] - 10https://gerrit.wikimedia.org/r/264706 (owner: 10Andrew Bogott)
[02:19:45] <wikibugs>	 6operations, 7Mail: remove exim alias feedbacktest@ - https://phabricator.wikimedia.org/T123665#1940338 (10Jdlrobson) I'd assume so.  I believe this is the old feedback form we used to run on web (which was mostly noise)
[02:30:21] <logmsgbot>	 !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 11m 38s)
[02:30:25] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:49:09] <YuviPanda>	 !log updated annualreport for foks
[02:49:13] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:50:33] <logmsgbot>	 !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.10) (duration: 08m 39s)
[02:50:36] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:57:41] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Jan 18 02:57:41 UTC 2016 (duration 7m 8s)
[02:57:45] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:59:48] <icinga-wm>	 PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: puppet fail
[03:07:16] <wikibugs>	 6operations: Adding/Removing users from enWP Arbcom Mailinglist archives - https://phabricator.wikimedia.org/T123787#1940361 (10yuvipanda) I've done all these things and @Jalexander is verifying and GPG-mailing the passwords to these people.  If only we didn't use software from the early 90s...
[03:27:18] <icinga-wm>	 RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[03:44:17] <icinga-wm>	 PROBLEM - puppet last run on cp1069 is CRITICAL: CRITICAL: Puppet has 1 failures
[04:09:29] <icinga-wm>	 RECOVERY - puppet last run on cp1069 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[04:36:17] <icinga-wm>	 PROBLEM - HHVM rendering on mw1057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[04:38:08] <icinga-wm>	 RECOVERY - HHVM rendering on mw1057 is OK: HTTP OK: HTTP/1.1 200 OK - 69433 bytes in 0.081 second response time
[04:53:12] <wikibugs>	 7Puppet, 10MediaWiki-Vagrant, 7Easy, 5Patch-For-Review: MediaWiki-Vagrant guest OS clock gets out of sync - https://phabricator.wikimedia.org/T116507#1940437 (10Tgr) That links seems to be about running the NTP daemon on the guest, but if Virtualbox can do that, that's certainly more convenient.
[06:31:37] <icinga-wm>	 PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:37] <icinga-wm>	 PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:58] <icinga-wm>	 PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:07] <icinga-wm>	 PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:08] <icinga-wm>	 PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:08] <icinga-wm>	 PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:28] <icinga-wm>	 PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:28] <icinga-wm>	 PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:28] <icinga-wm>	 PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:48] <icinga-wm>	 PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:38] <icinga-wm>	 PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:42:52] <grrrit-wm>	 (03PS1) 10KartikMistry: cxserver: Enable all source languages for Yandex [puppet] - 10https://gerrit.wikimedia.org/r/264719 (https://phabricator.wikimedia.org/T123906) 
[06:56:38] <icinga-wm>	 RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[06:56:59] <icinga-wm>	 RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[06:57:08] <icinga-wm>	 RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
[06:57:18] <icinga-wm>	 RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[06:57:18] <icinga-wm>	 RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[06:57:38] <icinga-wm>	 RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[06:57:38] <icinga-wm>	 RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[06:57:38] <icinga-wm>	 RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[06:57:57] <icinga-wm>	 RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:58:47] <icinga-wm>	 RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:58:48] <icinga-wm>	 RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:14:56] <grrrit-wm>	 (03PS2) 10KartikMistry: cxserver: Enable all source languages for Yandex [puppet] - 10https://gerrit.wikimedia.org/r/264719 (https://phabricator.wikimedia.org/T123906) 
[08:00:04] <jouncebot>	 Deploy window US Holiday (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160118T0800)
[08:53:48] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [24.0]
[08:57:48] <icinga-wm>	 PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /a 326801 MB (3% inode=99%)
[08:57:58] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[09:02:09] <icinga-wm>	 RECOVERY - Disk space on stat1002 is OK: DISK OK
[09:05:17] <icinga-wm>	 PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 640
[09:15:17] <icinga-wm>	 RECOVERY - check_mysql on db1008 is OK: Uptime: 500150 Threads: 2 Questions: 3742864 Slow queries: 3079 Opens: 1350 Flush tables: 2 Open tables: 396 Queries per second avg: 7.483 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0
[09:57:59] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on ms-be2015 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi sdl failed T123830
[10:13:18] <icinga-wm>	 RECOVERY - RAID on ms-be2007 is OK: OK: optimal, 14 logical, 14 physical
[10:13:20] <icinga-wm>	 RECOVERY - swift-object-auditor on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[10:13:20] <icinga-wm>	 RECOVERY - swift-container-replicator on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[10:13:20] <icinga-wm>	 RECOVERY - dhclient process on ms-be2007 is OK: PROCS OK: 0 processes with command name dhclient
[10:13:57] <icinga-wm>	 RECOVERY - swift-container-updater on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[10:13:58] <icinga-wm>	 RECOVERY - Check size of conntrack table on ms-be2007 is OK: OK: nf_conntrack is 0 % full
[10:13:58] <icinga-wm>	 RECOVERY - swift-container-server on ms-be2007 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[10:13:58] <icinga-wm>	 RECOVERY - DPKG on ms-be2007 is OK: All packages OK
[10:14:47] <icinga-wm>	 RECOVERY - configured eth on ms-be2007 is OK: OK - interfaces up
[10:14:48] <icinga-wm>	 RECOVERY - swift-account-server on ms-be2007 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[10:14:48] <icinga-wm>	 RECOVERY - swift-object-updater on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[10:14:48] <icinga-wm>	 RECOVERY - Disk space on ms-be2007 is OK: DISK OK
[10:15:17] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be2007 is OK: OK - load average: 7.76, 2.95, 1.30
[10:15:17] <icinga-wm>	 RECOVERY - swift-container-auditor on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[10:15:57] <icinga-wm>	 RECOVERY - swift-account-auditor on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[10:16:08] <icinga-wm>	 RECOVERY - swift-account-reaper on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[10:16:08] <icinga-wm>	 RECOVERY - puppet last run on ms-be2007 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[10:16:37] <icinga-wm>	 RECOVERY - swift-object-replicator on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[10:16:42] <godog>	 !log dist-upgrade ms-be3002 to trusty
[10:16:45] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:16:57] <icinga-wm>	 RECOVERY - swift-account-replicator on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[10:16:58] <icinga-wm>	 RECOVERY - swift-object-server on ms-be2007 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[10:18:58] <icinga-wm>	 PROBLEM - RAID on ms-be2016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:19:38] <icinga-wm>	 PROBLEM - Swift HTTP backend on ms-fe2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:20:18] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2016 is CRITICAL: CRITICAL - load average: 416.85, 300.17, 141.92
[10:21:38] <icinga-wm>	 RECOVERY - Swift HTTP backend on ms-fe2001 is OK: HTTP OK: HTTP/1.1 200 OK - 396 bytes in 0.084 second response time
[10:28:08] <icinga-wm>	 PROBLEM - Swift HTTP backend on ms-fe2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:30:27] <icinga-wm>	 RECOVERY - Swift HTTP backend on ms-fe2004 is OK: HTTP OK: HTTP/1.1 200 OK - 396 bytes in 0.095 second response time
[10:30:48] <icinga-wm>	 RECOVERY - salt-minion processes on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[10:32:38] <icinga-wm>	 PROBLEM - SSH on ms-be2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:33:48] <icinga-wm>	 PROBLEM - puppet last run on ms-be2016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:41:57] <icinga-wm>	 PROBLEM - salt-minion processes on tin is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[10:42:51] <godog>	 !log powercycle ms-be2016, high load avg
[10:42:54] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:45:48] <icinga-wm>	 PROBLEM - Host ms-be2016 is DOWN: PING CRITICAL - Packet loss = 100%
[10:47:38] <icinga-wm>	 RECOVERY - Host ms-be2016 is UP: PING OK - Packet loss = 0%, RTA = 36.39 ms
[10:47:39] <icinga-wm>	 RECOVERY - SSH on ms-be2016 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0)
[10:48:49] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be2016 is OK: OK - load average: 7.41, 3.62, 1.43
[10:48:57] <icinga-wm>	 RECOVERY - puppet last run on ms-be2016 is OK: OK: Puppet is currently enabled, last run 57 minutes ago with 0 failures
[10:49:29] <icinga-wm>	 RECOVERY - RAID on ms-be2016 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0
[11:08:09] <icinga-wm>	 RECOVERY - salt-minion processes on tin is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[11:36:37] <grrrit-wm>	 (03PS3) 10KartikMistry: cxserver: Add all available source languages for Russian in Yandex MT [puppet] - 10https://gerrit.wikimedia.org/r/264719 (https://phabricator.wikimedia.org/T123906) 
[11:36:56] <grrrit-wm>	 (03PS4) 10KartikMistry: cxserver: Add all available source languages for Russian in Yandex MT [puppet] - 10https://gerrit.wikimedia.org/r/264719 (https://phabricator.wikimedia.org/T123906) 
[11:42:02] <grrrit-wm>	 (03Draft1) 10Addshore: wgRCWatchCategoryMembership true on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264732 
[11:51:26] <grrrit-wm>	 (03Draft1) 10Addshore: wgRCWatchCategoryMembership true on wikipedias & commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264733 
[11:58:16] <wikibugs>	 6operations, 7Swift: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918#1940833 (10fgiunchedi) 3NEW
[12:13:08] <icinga-wm>	 RECOVERY - swift-account-auditor on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[12:13:28] <icinga-wm>	 RECOVERY - swift-container-auditor on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[12:13:29] <icinga-wm>	 RECOVERY - swift-container-updater on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[12:13:29] <icinga-wm>	 RECOVERY - swift-container-replicator on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[12:13:29] <icinga-wm>	 RECOVERY - swift-container-server on ms-be3001 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[12:14:27] <icinga-wm>	 RECOVERY - swift-object-auditor on ms-be3001 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[12:14:27] <icinga-wm>	 RECOVERY - swift-account-reaper on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[12:14:27] <icinga-wm>	 RECOVERY - swift-account-replicator on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[12:56:25] <grrrit-wm>	 (03CR) 10Thiemo Mättig (WMDE): [C: 031] Correct HTML code for WMF image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264461 (owner: 10Suriyaa Kudo)
[13:09:28] <icinga-wm>	 PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /a 329644 MB (3% inode=99%)
[13:57:27] <icinga-wm>	 PROBLEM - restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:57:27] <icinga-wm>	 PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:58:47] <icinga-wm>	 PROBLEM - restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:58:59] <icinga-wm>	 PROBLEM - puppet last run on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:04:28] <icinga-wm>	 PROBLEM - configured eth on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:04:28] <icinga-wm>	 PROBLEM - cassandra-a service on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:04:48] <icinga-wm>	 PROBLEM - DPKG on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:06:08] <icinga-wm>	 PROBLEM - RAID on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:06:37] <icinga-wm>	 PROBLEM - salt-minion processes on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:08:47] <icinga-wm>	 RECOVERY - salt-minion processes on praseodymium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[14:09:17] <icinga-wm>	 PROBLEM - Restbase root url on praseodymium is CRITICAL: Connection refused
[14:14:43] <godog>	 looks like praseodymium is down, looking
[14:15:18] <icinga-wm>	 PROBLEM - salt-minion processes on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:19:28] <icinga-wm>	 PROBLEM - Check size of conntrack table on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:19:36] <godog>	 mhh acpi_pad kernel thread using a lot of cpu, never seen this before
[14:22:57] <icinga-wm>	 PROBLEM - dhclient process on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:23:57] <icinga-wm>	 RECOVERY - Check size of conntrack table on praseodymium is OK: OK: nf_conntrack is 0 % full
[14:24:17] <icinga-wm>	 PROBLEM - Disk space on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:24:44] <godog>	 !log powercycle praseodymium 
[14:24:46] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:25:58] <icinga-wm>	 RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy
[14:28:45] <wikibugs>	 6operations: acpi_pad runaway processes on praseodymium - https://phabricator.wikimedia.org/T123924#1941039 (10fgiunchedi)
[14:29:37] <icinga-wm>	 RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy
[14:32:17] <icinga-wm>	 PROBLEM - Host praseodymium is DOWN: PING CRITICAL - Packet loss = 100%
[14:32:58] <icinga-wm>	 RECOVERY - salt-minion processes on praseodymium is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[14:33:07] <icinga-wm>	 RECOVERY - Host praseodymium is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms
[14:33:17] <icinga-wm>	 RECOVERY - Disk space on praseodymium is OK: DISK OK
[14:33:37] <icinga-wm>	 RECOVERY - DPKG on praseodymium is OK: All packages OK
[14:34:08] <icinga-wm>	 RECOVERY - dhclient process on praseodymium is OK: PROCS OK: 0 processes with command name dhclient
[14:34:47] <icinga-wm>	 RECOVERY - RAID on praseodymium is OK: OK: no disks configured for RAID
[14:41:38] <icinga-wm>	 PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:42:52] <grrrit-wm>	 (03PS2) 10Ottomata: Remove MobileWebSectionUsage from blacklist [puppet] - 10https://gerrit.wikimedia.org/r/264465 (owner: 10Milimetric)
[14:43:39] <icinga-wm>	 RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy
[14:50:27] <icinga-wm>	 PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:51:08] <icinga-wm>	 RECOVERY - Restbase root url on praseodymium is OK: HTTP OK: HTTP/1.1 200 - 15214 bytes in 0.042 second response time
[14:52:29] <icinga-wm>	 RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy
[14:52:37] <icinga-wm>	 RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy
[14:52:57] <icinga-wm>	 RECOVERY - cassandra-a service on praseodymium is OK: OK - cassandra-a is active
[14:57:23] <wikibugs>	 6operations: acpi_pad runaway processes on praseodymium - https://phabricator.wikimedia.org/T123924#1941098 (10fgiunchedi) I've "fixed" it by turning off and on "logical processor" in bios (i.e. hyperthreading) without needing to physically drain power. I'm not sure about the exact functions of acpi_pad but we c...
[15:20:49] <dr0ptp4kt>	 akosiaris: say, https://etherpad.wikimedia.org/p/ReadingWebQ3Planning doesn't seem to be loading, whereas other etherpads are loading and can be started. are you able to get at https://etherpad.wikimedia.org/p/ReadingWebQ3Planning or able to extract its contents and email them to me?
[15:27:05] <dr0ptp4kt>	 anybody able to take a look at the etherpad issue? there's apparently a websocket 400 going on. cc phuedx 
[15:27:09] <dr0ptp4kt>	 cc bd808 
[15:38:40] <grrrit-wm>	 (03PS1) 10Ottomata: Use hiera to configure hive and oozie server hostnames [puppet] - 10https://gerrit.wikimedia.org/r/264742 (https://phabricator.wikimedia.org/T110090) 
[15:45:31] <akosiaris>	 dr0ptp4kt: I am taking a look, but as far as the extracting stuff out of an etherpad manually... I doubt it will make ansy sense to you
[15:45:40] <ema>	 it looks like the 400 WS error also happens on pads working properly
[15:45:43] <akosiaris>	 it's storing json changesets
[15:46:02] <dr0ptp4kt>	 akosiaris: k. json changesets, woah.
[15:46:04] <akosiaris>	 Invalid changeset (checkRep failed)
[15:46:11] <akosiaris>	 so, corrupt pad
[15:46:51] <akosiaris>	 ema: the websocket thing is meant to fail, we don't support websockets yet
[15:47:14] <dr0ptp4kt>	 akosiaris: got it. is it possible to restore it to the latest non-corrupt state?
[15:48:13] <akosiaris>	 dr0ptp4kt: maybe... depends on how the bad changeset is stored
[15:48:23] <akosiaris>	 not promising anything though... it's a long shot
[15:48:31] <dr0ptp4kt>	 akosiaris: understood
[15:52:11] <wikibugs>	 6operations, 10Continuous-Integration-Infrastructure, 10Gerrit, 10GitHub-Mirrors, and 3 others: [Task] Redirect unused extensions/ValueView repository to data-values/value-view - https://phabricator.wikimedia.org/T123624#1941263 (10thiemowmde) Personally I very much prefer redirects because they do not lea...
[15:52:45] <grrrit-wm>	 (03PS1) 10WMDE-leszek: Phragile: Ensure clone before creating storage dir [puppet] - 10https://gerrit.wikimedia.org/r/264745 
[16:03:22] <dr0ptp4kt>	 bd808: phuedx ^ akosiaris is looking into restoration to a non-corrupted pad state
[16:03:50] <phuedx>	 cool
[16:03:50] <phuedx>	 ta
[16:04:14] <bd808>	 dr0ptp4kt: maybe we can submit a FOIA request to the NSA for the contents of the pad ;)
[16:04:45] <dr0ptp4kt>	 bd808: KEYWORD_DETECTION_ACTIVATED
[16:04:50] <akosiaris>	 10018 revisions ? sigh... 
[16:05:07] <akosiaris>	 or so I think at least... maybe more
[16:06:37] <dr0ptp4kt>	 akosiaris: i wonder if somebody vandalized it. the last major edits were friday, jan 15 0100 utc approximately
[16:06:54] <dr0ptp4kt>	 (legitimate edits)
[16:07:48] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Use hiera to configure hive and oozie server hostnames [puppet] - 10https://gerrit.wikimedia.org/r/264742 (https://phabricator.wikimedia.org/T110090) (owner: 10Ottomata)
[16:07:53] <akosiaris>	 dr0ptp4kt: well, vandalize is not a good word for this. Somehow, someone, most probably not maliciously, triggered a bug and corrupted a pad. That's not the first time and given etherpad's track record it will not be the last
[16:08:09] <dr0ptp4kt>	 akosiaris: /me cries
[16:08:25] <akosiaris>	 look at this for example 
[16:08:27] <akosiaris>	 pad:ReadingWebQ3Planning:revs:10018 | {"changeset":"Z:hbb<9|9e=f9m-a*2+1$O","meta":{"author":"a.CM2Hc5CAvS1dZuDw","timestamp":1452819512958}} 
[16:09:06] <akosiaris>	 that's revision 10018 ... just a change from the previous one I assume.... which one actually has useful data in it ... still looking
[16:09:15] <bd808>	 dr0ptp4kt: https://etherpad.wikimedia.org/p/ReadingWebQ3Planning/timeslider#10018
[16:09:30] <akosiaris>	 wow, that works ?
[16:09:32] <akosiaris>	 bd808: nice
[16:09:41] <akosiaris>	 I did not expect the timeslider to work
[16:10:23] <dr0ptp4kt>	 bd808: nice!
[16:10:55] <dr0ptp4kt>	 akosiaris: thx also for looking into it. no point mucking with it any further. i've saved off a copy
[16:12:55] <akosiaris>	 dr0ptp4kt: ok, I 'll much around a bit more with it in case I find something... this is interesting... the freaking timeslider working but not the pad itself is weird
[16:13:05] <akosiaris>	 oh and that changeset looks like it's the last one
[16:13:17] <akosiaris>	 can't find anything after that
[16:15:15] <akosiaris>	 heh, https://github.com/ether/etherpad-lite/issues/2107
[16:15:26] <akosiaris>	 still open, 1,5 year later
[16:15:56] <bd808>	 akosiaris: that's the bug that made me try timeslider
[16:17:05] <bd808>	 there is a dump&reimport script that some folk had success with -- https://github.com/ether/etherpad-lite/pull/2210/files -- but it requires taking the whole server offline
[16:18:14] <akosiaris>	 that's probably to guarantee the transaction
[16:18:54] <akosiaris>	 I 've tried it .. no change
[16:23:13] <dr0ptp4kt>	 akosiaris: :)
[16:33:58] <grrrit-wm>	 (03CR) 10Subramanya Sastry: "integration/visualdiff has now been instantiated ... this is ready for review and merge. I can then apply on ruthenium and test it." [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) (owner: 10Subramanya Sastry)
[16:45:58] <grrrit-wm>	 (03CR) 10Anomie: "I put this on SWAT for tomorrow morning, FYI." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264437 (owner: 10Anomie)
[16:50:42] <grrrit-wm>	 (03CR) 10Bartosz Dziewoński: "(typo: "dupliacte")" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264437 (owner: 10Anomie)
[16:54:35] <grrrit-wm>	 (03CR) 10Nuria: "Thanks for doing this. Did you re-started EL so changes take effect?" [puppet] - 10https://gerrit.wikimedia.org/r/264465 (owner: 10Milimetric)
[17:00:13] <grrrit-wm>	 (03PS1) 10Luke081515: Apply global blocks for meta, not for deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264758 (https://phabricator.wikimedia.org/T123936) 
[17:04:30] <grrrit-wm>	 (03PS3) 10Anomie: Centralize and add rights and grants in preparation for grants moving into core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264437 
[17:04:43] <grrrit-wm>	 (03PS2) 10Luke081515: Apply global blocks at meta, not at deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264758 (https://phabricator.wikimedia.org/T123936) 
[17:06:40] <grrrit-wm>	 (03PS1) 10Ottomata: Move hive-server, hive-metastore and oozie from analytics1027 to analytics1015 [puppet] - 10https://gerrit.wikimedia.org/r/264760 (https://phabricator.wikimedia.org/T110090) 
[17:06:47] <grrrit-wm>	 (03CR) 10Luke081515: [C: 031] Centralize and add rights and grants in preparation for grants moving into core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264437 (owner: 10Anomie)
[17:07:33] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Move hive-server, hive-metastore and oozie from analytics1027 to analytics1015 [puppet] - 10https://gerrit.wikimedia.org/r/264760 (https://phabricator.wikimedia.org/T110090) (owner: 10Ottomata)
[17:27:07] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] "looks like global blocks in beta have come from deployment instead of meta for quite a while" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264758 (https://phabricator.wikimedia.org/T123936) (owner: 10Luke081515)
[17:27:31] <grrrit-wm>	 (03Merged) 10jenkins-bot: Apply global blocks at meta, not at deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264758 (https://phabricator.wikimedia.org/T123936) (owner: 10Luke081515)
[17:30:31] <grrrit-wm>	 (03PS2) 10Ottomata: Move hive-server, hive-metastore and oozie from analytics1027 to analytics1015 [puppet] - 10https://gerrit.wikimedia.org/r/264760 (https://phabricator.wikimedia.org/T110090) 
[17:30:51] <logmsgbot>	 !log krenair@tin Synchronized wmf-config/InitialiseSettings-labs.php: https://gerrit.wikimedia.org/r/264758 - labs-only change (duration: 00m 36s)
[17:30:54] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:30:59] <Krenair>	 Luke081515, ^
[17:31:29] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Move hive-server, hive-metastore and oozie from analytics1027 to analytics1015 [puppet] - 10https://gerrit.wikimedia.org/r/264760 (https://phabricator.wikimedia.org/T110090) (owner: 10Ottomata)
[17:33:12] <grrrit-wm>	 (03PS1) 10Jforrester: Enable VisualEditor by default for some other wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264765 (https://phabricator.wikimedia.org/T116523) 
[17:34:02] <Luke081515>	 Krenair: Thanks
[17:34:32] <grrrit-wm>	 (03CR) 10Jforrester: "Scheduled for 25 January." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264765 (https://phabricator.wikimedia.org/T116523) (owner: 10Jforrester)
[17:51:59] <wikibugs>	 7Blocked-on-Operations, 6operations, 10ops-eqiad, 5Patch-For-Review: reclaim erbium, gadolinium into spares - https://phabricator.wikimedia.org/T123029#1941591 (10akosiaris) Are those going to be reinstalled soon ? I see erbium still in DNS btw
[17:54:36] <wikibugs>	 6operations, 10DBA, 5Patch-For-Review: Better mysql monitoring for number of connections and processlist strange patterns - https://phabricator.wikimedia.org/T112473#1941615 (10jcrespo) 5Open>3stalled
[19:01:55] <andrewbogott>	 mobrovac: did anyone follow up with you about updating your contact info?
[19:12:38] <icinga-wm>	 PROBLEM - piwik.wikimedia.org on bohrium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:14:29] <icinga-wm>	 RECOVERY - piwik.wikimedia.org on bohrium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 518 bytes in 0.010 second response time
[19:25:29] <icinga-wm>	 PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: puppet fail
[19:40:08] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be1001 is CRITICAL: CRITICAL - load average: 234.71, 171.11, 86.19
[19:52:59] <icinga-wm>	 RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[20:06:36] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Add labtestwiki, a testing version of labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264786 
[20:08:16] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Add labtestwiki, a testing version of labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264786 
[20:24:55] <mobrovac>	 hm, what happened to rb on praseodymium today?
[20:25:19] <RoanKattouw>	 Oh God we still have prasdfjsdlkfjdslkfium?
[20:25:42] <RoanKattouw>	 I was talking about that box just the other day as an example of terrible server names from the past
[20:25:46] <RoanKattouw>	 But now I see it's not from the past
[20:31:15] <mobrovac>	 RoanKattouw: i don't have a problem with it, as long as my shell can auto-complete it :P
[20:47:42] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 04-1] "Shouldn't this be labstestwiki, with an s?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264786 (owner: 10Andrew Bogott)
[20:53:31] <grrrit-wm>	 (03CR) 10Alex Monk: Add labtestwiki, a testing version of labswiki (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264786 (owner: 10Andrew Bogott)
[20:54:59] <grrrit-wm>	 (03CR) 10Alex Monk: Add labtestwiki, a testing version of labswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264786 (owner: 10Andrew Bogott)
[20:55:35] <josephine>	 Hey, is there anyone on the ops team that can help me out?
[20:55:54] <wikibugs>	 6operations, 10Analytics, 10Analytics-Cluster, 10EventBus, 6Services: Investigate proper set up for using Kafka MirrorMaker with new main Kafka clusters. - https://phabricator.wikimedia.org/T123954#1942024 (10mobrovac)
[20:56:24] <RoanKattouw>	 josephine: I'm not ops, but if you say what you need I may know who can help
[20:57:37] <josephine>	 RoanKattouw: I created a google group for someone, but I keep getting a bounceback email every time I try to send a "Test" email to it :/
[20:58:33] <RoanKattouw>	 josephine: Hmm, I think the person I'd ask about that is mutante but I'm not sure if he's around today
[20:59:04] <RoanKattouw>	 andrewbogott is clearly active but I don't know if he knows anything about email stuff
[20:59:14] <josephine>	 RoanKattouw: Yea, I thought it was a long shot since its a holiday today.
[21:00:02] <josephine>	 RoanKattouw: I'll just file a ticket on phab :)
[21:00:15] <RoanKattouw>	 Yeah that works
[21:00:32] <RoanKattouw>	 It's pretty quiet here in the office
[21:06:47] <wikibugs>	 6operations, 10Analytics, 10Analytics-Cluster, 10EventBus, 6Services: Investigate proper set up for using Kafka MirrorMaker with new main Kafka clusters. - https://phabricator.wikimedia.org/T123954#1942098 (10Ottomata) > RESTBase is stateless per se, but it relies on Cassandra, for which a cross-DC repli...
[21:09:15] <wikibugs>	 6operations, 10Analytics, 10Analytics-Cluster, 10EventBus, 6Services: Investigate proper set up for using Kafka MirrorMaker with new main Kafka clusters. - https://phabricator.wikimedia.org/T123954#1942101 (10Ottomata) If Services doesn't need events from eqiad to show up in codfw (and vice versa), then...
[21:41:55] <Luke081515>	 ori: Can you take a look at https://phabricator.wikimedia.org/T115584 ?
[21:43:26] <grrrit-wm>	 (03CR) 10Andrew Bogott: "'labtestwiki' is consistent with the 'labtest' domain and the host names which are all 'labtest*'" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264786 (owner: 10Andrew Bogott)
[21:45:05] <grrrit-wm>	 (03CR) 10Alex Monk: "ok. also, ignore my comment earlier about multiversion" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264786 (owner: 10Andrew Bogott)
[21:50:05] <grrrit-wm>	 (03PS3) 10Andrew Bogott: Add labtestwiki, a testing version of labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264786 
[22:20:43] <grrrit-wm>	 (03PS4) 10Alex Monk: Add labtestwiki, a testing version of labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264786 (owner: 10Andrew Bogott)
[22:35:24] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Tidy up wiki crons for labtestwiki: [puppet] - 10https://gerrit.wikimedia.org/r/264882 
[22:45:07] <ori>	 Luke081515|away: on it
[22:45:46] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 031] Tidy up wiki crons for labtestwiki: [puppet] - 10https://gerrit.wikimedia.org/r/264882 (owner: 10Andrew Bogott)
[22:48:59] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Tidy up wiki crons for labtestwiki: [puppet] - 10https://gerrit.wikimedia.org/r/264882 
[22:50:30] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Tidy up wiki crons for labtestwiki: [puppet] - 10https://gerrit.wikimedia.org/r/264882 (owner: 10Andrew Bogott)
[22:51:26] <Krenair>	 andrewbogott, should I do the mediawiki-config change then?
[22:51:43] <andrewbogott>	 Krenair: yes please, if you feel confident about it.
[22:51:57] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Add labtestwiki, a testing version of labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264786 (owner: 10Andrew Bogott)
[22:52:21] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add labtestwiki, a testing version of labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264786 (owner: 10Andrew Bogott)
[22:53:33] <logmsgbot>	 !log krenair@tin Synchronized w/static/images/project-logos/wikitech.png: https://gerrit.wikimedia.org/r/#/c/264786/ (duration: 00m 31s)
[22:53:36] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:55:02] <logmsgbot>	 !log krenair@tin Synchronized dblists: (no message) (duration: 00m 31s)
[22:55:04] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:55:22] <logmsgbot>	 !log krenair@tin rebuilt wikiversions.php and synchronized wikiversions files: (no message)
[22:55:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:56:10] <logmsgbot>	 !log krenair@tin Synchronized wmf-config: https://gerrit.wikimedia.org/r/#/c/264786/ (duration: 00m 32s)
[23:00:13] <Krenair>	 andrewbogott, uh... so that might have stopped normal wikitech from working
[23:00:22] <icinga-wm>	 PROBLEM - NFS read/writeable on labs instances on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:00:46] <andrewbogott>	 hm, so I see :(
[23:01:49] <Krenair>	 but why...
[23:02:38] <andrewbogott>	 I don’t think this is related, but I am loving this error message:  "GC cache entry 'labswiki:cirrussearch-morelikethis-settings:2' was on gc-list for 3602 seconds in /srv/mediawiki/php-1.27.0-wmf.9/includes/libs/objectcache/APCBagOStuff.php"
[23:05:12] <Krenair>	 Ohh, wait
[23:05:15] <andrewbogott>	 Krenair: it’s looking for labswiki on db1029.eqiad.wmnet instead of on silver
[23:05:21] <Krenair>	 I might know what this is
[23:05:31] <Krenair>	         'extension1' => array(
[23:05:31] <Krenair>	                 '10.64.16.18' => 10, # db1029
[23:05:31] <Krenair>	                 '10.64.16.20' => 20, # db1031 snapshot host
[23:05:36] <Krenair>	 it's looking for one of those two servers
[23:06:12] <Krenair>	 So maybe this is wmgEchoCluster
[23:06:53] <Krenair>	 Yep.
[23:06:59] <Krenair>	 krenair@tin:/srv/mediawiki-staging (master)$ mwscript eval.php labswiki
[23:06:59] <Krenair>	 > var_dump( $wmgEchoCluster );
[23:06:59] <Krenair>	 bool(false)
[23:07:00] <Krenair>	 Whereas:
[23:07:05] <Krenair>	 krenair@silver:~$ mwscript eval.php labswiki
[23:07:05] <Krenair>	 > var_dump( $wmgEchoCluster );
[23:07:05] <Krenair>	 string(10) "extension1"
[23:07:18] <Krenair>	 running sync-common on silver
[23:07:59] <chasemp>	 I am not home atm but i see you guys are in the thick of it
[23:08:10] <Krenair>	 you got a page about wikitech chasemp?
[23:08:15] <andrewbogott>	 chasemp: two unrelated things I think
[23:08:31] <andrewbogott>	 Krenair: tools paged at the same time, but it was back up by the time I looked.
[23:08:31] <Krenair>	 Okay
[23:08:36] <chasemp>	 NFS
[23:08:36] <Krenair>	 It's back up
[23:09:16] <Krenair>	 I'm not clear on why that fixed it
[23:10:00] <andrewbogott>	 Krenair: https://labtestwikitech.wikimedia.org/wiki/ is still saying ‘no wiki found’ but I expect that’s because the db is empty?
[23:11:22] <Krenair>	 No
[23:11:33] <Krenair>	 It's because I haven't figured out how to sync files to the test server yet
[23:12:05] <andrewbogott>	 oh?  I ran ‘sync common’ there already and it seemed ok
[23:12:09] <Krenair>	 That error is about being unable to map the host properly to something in all.dblist
[23:12:12] <andrewbogott>	 but the db is still a mess
[23:13:29] <andrewbogott>	 heh, it doesn’t even have mysql-server installed
[23:13:40] <Krenair>	 Huh.
[23:14:03] <Krenair>	 Oh, I know why.
[23:14:28] <Krenair>	 Remember when I said to ignore my comment about multiversion?
[23:14:38] <Krenair>	 Well... I was right in the first place
[23:15:22] <andrewbogott>	 what was wrong about my entry?
[23:15:24] <Krenair>	 simple command on wikitech to make that server sync:
[23:15:25] <Krenair>	 SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@labtestwikitech /srv/deployment/scap/scap/bin/sync-common
[23:15:38] <Krenair>	 andrewbogott: you didn't add one
[23:15:46] <Krenair>	 on tin* even
[23:16:25] <andrewbogott>	 this is different from wikiversions.json I take it
[23:16:35] <Krenair>	 yes
[23:16:42] <Krenair>	 I'll upload the patch
[23:16:48] <andrewbogott>	 ok, thanks
[23:17:23] <andrewbogott>	 Krenair: this is unrelated to the ‘no database server’ thing right?
[23:17:26] <Krenair>	 (that I already pushed, now it shows the DB error)
[23:17:37] <andrewbogott>	 ok
[23:17:38] <icinga-wm>	 PROBLEM - puppet last run on mc2009 is CRITICAL: CRITICAL: puppet fail
[23:17:41] <Krenair>	 right, the no DB server error is because mysql is not installed
[23:17:49] <andrewbogott>	 yeah, that seems bad
[23:17:56] <andrewbogott>	 I guess it’s not puppetized properly on silver :(
[23:18:03] <Krenair>	 surprise surprise :P
[23:18:10] <Reedy>	 or it's from a different role
[23:18:24] <Reedy>	 I guess that's the same thing, effectively
[23:18:30] <Krenair>	 no, the roles match I think
[23:19:23] <Krenair>	 andrewbogott, oh, before I forget: other puppet change needed is dealing with silver.yaml
[23:19:32] <andrewbogott>	 so probably we need role::mariadb 
[23:21:22] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Allow deployer access to labtestweb [puppet] - 10https://gerrit.wikimedia.org/r/264894 
[23:21:23] <andrewbogott>	 Krenair: Like ^ you mean?
[23:22:18] <Krenair>	 I suppose that works, although ideally the admin::groups and apache::logrotate::rotate from both hosts would be moved out into a common file
[23:22:24] <Krenair>	 and the node blocks in site.pp merged
[23:23:02] <andrewbogott>	 this box isn’t going to == silver in the long run.  I’m hoping to host horizon there too...
[23:23:12] <Krenair>	 oh, ok
[23:23:13] <andrewbogott>	 so I definitely don’t want to merge the node definitions
[23:23:57] <grrrit-wm>	 (03PS1) 10Alex Monk: labtestwikitech.wikimedia.org -> labtestwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264895 
[23:24:20] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] labtestwikitech.wikimedia.org -> labtestwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264895 (owner: 10Alex Monk)
[23:24:28] <YuviPanda>	 hey
[23:24:31] <YuviPanda>	 andrewbogott: I got an NFS page
[23:24:38] <YuviPanda>	 is everything ok?
[23:24:52] <grrrit-wm>	 (03Merged) 10jenkins-bot: labtestwikitech.wikimedia.org -> labtestwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264895 (owner: 10Alex Monk)
[23:24:54] <YuviPanda>	 seems ok except for things being fucking slow but that's business as usual
[23:24:57] <YuviPanda>	 ok
[23:25:43] <andrewbogott>	 YuviPanda: I think that nfs had a stutter which I haven’t investigated.
[23:25:53] <andrewbogott>	 Alex and I are messing with wikitech but hopefully the ‘broken’ phase of that is over now
[23:25:57] <YuviPanda>	 ok
[23:26:18] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Include role::mariadb on silver and labtestweb2001 [puppet] - 10https://gerrit.wikimedia.org/r/264896 
[23:26:27] <Krenair>	 I think wikitech was fixed, btw
[23:26:43] <logmsgbot>	 !log krenair@tin Synchronized multiversion/MWMultiVersion.php: https://gerrit.wikimedia.org/r/264895 (duration: 00m 31s)
[23:26:46] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:30:16] <andrewbogott>	 well… I wish I could test 264896 with the puppet compiler
[23:30:28] <andrewbogott>	 otherwise who knows what will happen to silver
[23:31:22] <Krenair>	 why can't you test it with that?
[23:31:41] <YuviPanda>	 andrewbogott: did you get a recovery SMS for labs?
[23:31:45] <YuviPanda>	 for NFS
[23:32:00] <andrewbogott>	 YuviPanda: I got no sms at all
[23:32:01] <Krenair>	 there wasn't a recovery from icinga
[23:33:08] <YuviPanda>	 ok
[23:33:09] <andrewbogott>	 Krenair: because of https://puppet-compiler.wmflabs.org/1623/ which seems more like ‘puppet compiler can’t handle this’ than ‘actual error'
[23:33:11] <andrewbogott>	 but I’m still reading
[23:34:02] <Krenair>	 https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustypes=16&hoststatustypes=3&serviceprops=2097162&nostatusheader still shows NFS as critical
[23:34:07] <andrewbogott>	 Hm, actually, YuviPanda http://tools-checker.wmflabs.org/nfs/home
[23:34:25] <Krenair>	 and indeed if I go to http://tools-checker.wmflabs.org/ I get 502 Bad Gateway
[23:34:42] <icinga-wm>	 RECOVERY - NFS read/writeable on labs instances on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.170 second response time
[23:34:45] <YuviPanda>	 andrewbogott: yeah, I just fixed it
[23:34:50] <YuviPanda>	 andrewbogott: tools checker was stuck
[23:34:57] <andrewbogott>	 did you just restart nginx?
[23:35:26] <YuviPanda>	 uwsgi
[23:35:28] <YuviPanda>	 I know the cause too
[23:35:32] <YuviPanda>	 webservice gets stuck sometimes
[23:35:34] <YuviPanda>	 need to add timeouts
[23:36:29] <andrewbogott>	 Krenair: I think it’s because we don’t allow exported resources in labs
[23:36:57] <Krenair>	 k, I think this is beyond my level of puppet knowledge :p
[23:37:38] <YuviPanda>	 I've a fix coming up
[23:37:54] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Allow deployer access to labtestweb [puppet] - 10https://gerrit.wikimedia.org/r/264894 (owner: 10Andrew Bogott)
[23:38:10] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Include role::mariadb on labtestweb2001 [puppet] - 10https://gerrit.wikimedia.org/r/264896 
[23:39:45] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Include role::mariadb on labtestweb2001 [puppet] - 10https://gerrit.wikimedia.org/r/264896 (owner: 10Andrew Bogott)
[23:40:08] <grrrit-wm>	 (03PS1) 10Yuvipanda: tools: Setup timeouts for toolschecker [puppet] - 10https://gerrit.wikimedia.org/r/264898 (https://phabricator.wikimedia.org/T123987) 
[23:41:27] <grrrit-wm>	 (03PS2) 10Yuvipanda: tools: Setup timeouts for toolschecker [puppet] - 10https://gerrit.wikimedia.org/r/264898 (https://phabricator.wikimedia.org/T123987) 
[23:42:19] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Setup timeouts for toolschecker [puppet] - 10https://gerrit.wikimedia.org/r/264898 (https://phabricator.wikimedia.org/T123987) (owner: 10Yuvipanda)
[23:43:28] <icinga-wm>	 RECOVERY - puppet last run on mc2009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[23:44:37] <YuviPanda>	 andrewbogott: that should take care of the tools checker paging problem
[23:44:45] <andrewbogott>	 great, thanks!
[23:49:32] <andrewbogott>	 ok, I am doing something dumb… role::mariadb is included on labtestweb2001 but none of its packages are getting installed
[23:52:16] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Added icinga check for 'showmount' on tools instances. [puppet] - 10https://gerrit.wikimedia.org/r/264049 (https://phabricator.wikimedia.org/T123588) 
[23:53:49] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Added icinga check for 'showmount' on tools instances. [puppet] - 10https://gerrit.wikimedia.org/r/264049 (https://phabricator.wikimedia.org/T123588) (owner: 10Andrew Bogott)
[23:58:35] <andrewbogott>	 ok… YuviPanda do you have time to explain my mistake?  labtestweb2001 includes role::mariadb which includes mariadb which includes mariadb::packages which installs mariadb-server-5.5
[23:58:41] <andrewbogott>	 …so why isn’t that package on labtestweb2001?
[23:59:18] <YuviPanda>	 looking
[23:59:54] <YuviPanda>	 at least one reason is
[23:59:56] <YuviPanda>	 >         dist       => 'precise-wikimedia',
[23:59:59] <YuviPanda>	 and I bet you aren't on precise :)