[01:27:00] <grrrit-wm>	 (03CR) 10VolkerE: [C: 04-1] Remove $wgCopyrightIcon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261999 (https://phabricator.wikimedia.org/T122754) (owner: 10Florianschmidtwelzow)
[01:36:44] <icinga-wm>	 PROBLEM - puppet last run on mw1109 is CRITICAL: CRITICAL: Puppet has 1 failures
[01:50:33] <icinga-wm>	 PROBLEM - puppet last run on db2065 is CRITICAL: CRITICAL: puppet fail
[02:02:53] <icinga-wm>	 RECOVERY - puppet last run on mw1109 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[02:12:02] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1035 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[02:16:12] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1035 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[02:17:22] <icinga-wm>	 RECOVERY - puppet last run on db2065 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[02:25:17] <logmsgbot>	 !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 10m 05s)
[02:25:23] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:32:10] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Jan  4 02:32:10 UTC 2016 (duration 6m 53s)
[02:32:15] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:50:04] <icinga-wm>	 PROBLEM - puppet last run on ms-be3003 is CRITICAL: CRITICAL: puppet fail
[03:18:57] <icinga-wm>	 RECOVERY - puppet last run on ms-be3003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[04:09:42] <icinga-wm>	 PROBLEM - puppet last run on ms-fe2002 is CRITICAL: CRITICAL: puppet fail
[04:35:37] <icinga-wm>	 RECOVERY - puppet last run on ms-fe2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[05:01:06] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1041 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[05:25:46] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1041 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[05:49:31] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1032 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[05:51:32] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1032 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[06:07:44] <grrrit-wm>	 (03CR) 10BBlack: [C: 04-1] "(1) Unless the rest API URLs *never* provide content that varies per user, this would be incorrect." [puppet] - 10https://gerrit.wikimedia.org/r/261662 (https://phabricator.wikimedia.org/T122673) (owner: 10GWicke)
[06:25:16] <icinga-wm>	 PROBLEM - Disk space on elastic1004 is CRITICAL: DISK CRITICAL - free space: / 965 MB (3% inode=95%)
[06:27:25] <icinga-wm>	 RECOVERY - Disk space on elastic1004 is OK: DISK OK
[06:31:25] <icinga-wm>	 PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:39] <icinga-wm>	 PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:35:39] <icinga-wm>	 PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:56:48] <icinga-wm>	 RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[06:56:49] <icinga-wm>	 RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[06:58:18] <icinga-wm>	 RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:43:09] <icinga-wm>	 PROBLEM - puppet last run on labvirt1001 is CRITICAL: CRITICAL: Puppet has 1 failures
[08:48:09] <icinga-wm>	 PROBLEM - puppet last run on achernar is CRITICAL: CRITICAL: puppet fail
[09:09:28] <icinga-wm>	 RECOVERY - puppet last run on labvirt1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[09:15:07] <icinga-wm>	 RECOVERY - puppet last run on achernar is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
[09:54:11] <icinga-wm>	 PROBLEM - salt-minion processes on cygnus is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[10:06:09] <icinga-wm>	 RECOVERY - salt-minion processes on cygnus is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[10:27:52] <grrrit-wm>	 (03PS5) 10Hashar: contint: remove maven webproxy [puppet] - 10https://gerrit.wikimedia.org/r/261476 (https://phabricator.wikimedia.org/T122449) (owner: 10Smalyshev)
[10:29:45] <grrrit-wm>	 (03CR) 10Hashar: "webproxy.eqiad.wmnet can no more be reached by labs instance since Dec 24th. I have dropped the configuration for MediaWiki https://gerri" [puppet] - 10https://gerrit.wikimedia.org/r/261476 (https://phabricator.wikimedia.org/T122449) (owner: 10Smalyshev)
[10:33:07] <wikibugs>	 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: WDQS builds fail due to network issues - https://phabricator.wikimedia.org/T122594#1914078 (10hashar)
[10:33:22] <wikibugs>	 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: WDQS builds fail due to network issues - https://phabricator.wikimedia.org/T122594#1908476 (10hashar) webproxy.eqiad.wmnet is no more available to labs instances ( T122368 ). I did fix it for MediaWiki https://gerrit.wikimedia.org/r/#/c...
[10:34:15] <wikibugs>	 6operations, 10Continuous-Integration-Infrastructure: Webproxy on carbon unreachable from labs instances since Dec 24  roughly 1am - https://phabricator.wikimedia.org/T122461#1914084 (10hashar) `maven` was still being routed via webproxy: T122594
[10:34:26] <wikibugs>	 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: WDQS builds fail due to network issues - https://phabricator.wikimedia.org/T122594#1908476 (10hashar) Same as T122461
[10:35:19] <icinga-wm>	 PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 843
[10:40:09] <icinga-wm>	 RECOVERY - check_mysql on db1008 is OK: Uptime: 1188202 Threads: 55 Questions: 40090554 Slow queries: 14414 Opens: 58874 Flush tables: 2 Open tables: 416 Queries per second avg: 33.740 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0
[10:59:00] <icinga-wm>	 PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api
[11:01:09] <icinga-wm>	 RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy
[11:56:49] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0]
[12:00:19] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1029 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[12:03:32] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[12:18:42] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1029 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[12:19:42] <icinga-wm>	 PROBLEM - puppet last run on elastic1006 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago
[12:22:53] <icinga-wm>	 PROBLEM - puppet last run on mw2093 is CRITICAL: CRITICAL: Puppet has 1 failures
[12:45:57] <icinga-wm>	 PROBLEM - puppet last run on mw2164 is CRITICAL: CRITICAL: puppet fail
[12:47:47] <icinga-wm>	 RECOVERY - puppet last run on mw2093 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:14:26] <icinga-wm>	 RECOVERY - puppet last run on mw2164 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:22:00] <wikibugs>	 6operations, 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure, and 3 others: rake-jessie jobs stuck due to no ci-jessie-wikimedia slaves being attached to Jenkins - https://phabricator.wikimedia.org/T122731#1914298 (10hashar) Nodepool relies on the wmflabs OpenStack API. Whenever OpenS...
[13:44:13] <grrrit-wm>	 (03PS1) 10Hashar: nodepool: set Nova API timeout to 60 seconds [puppet] - 10https://gerrit.wikimedia.org/r/262343 (https://phabricator.wikimedia.org/T122731) 
[13:45:33] <wikibugs>	 6operations, 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure, and 4 others: rake-jessie jobs stuck due to no ci-jessie-wikimedia slaves being attached to Jenkins - https://phabricator.wikimedia.org/T122731#1914303 (10hashar) a:3hashar https://gerrit.wikimedia.org/r/#/c/262343/ sets...
[13:45:59] <wikibugs>	 6operations, 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: rake-jessie jobs stuck due to no ci-jessie-wikimedia slaves being attached to Jenkins - https://phabricator.wikimedia.org/T122731#1914305 (10hashar)
[13:46:46] <wikibugs>	 6operations, 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: Nodepool deadlocks when querying unresponsive OpenStack API (was: rake-jessie jobs stuck due to no ci-jessie-wikimedia slaves being attached to Jenkins) - https://phabricator.wikimedia.org/T122731#1914308...
[14:15:44] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1037 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[14:15:54] * Steinsplitter pokes jynus
[14:34:57] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1037 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[15:42:51] <icinga-wm>	 PROBLEM - puppet last run on mw2151 is CRITICAL: CRITICAL: puppet fail
[15:52:14] <grrrit-wm>	 (03Abandoned) 10Paladox: gitblit: Fix "Sorry, the repository $1 does not have a $2 branch" [puppet] - 10https://gerrit.wikimedia.org/r/250697 (https://phabricator.wikimedia.org/T117459) (owner: 10Paladox)
[15:52:33] <grrrit-wm>	 (03Abandoned) 10Paladox: Re enable git.enableGitServlet [puppet] - 10https://gerrit.wikimedia.org/r/250450 (owner: 10Paladox)
[16:00:04] <jouncebot>	 anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160104T1600).
[16:09:55] <grrrit-wm>	 (03PS1) 10Shanmugamp7: Add en wiki as transwiki import source for ta.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262352 (https://phabricator.wikimedia.org/T122808) 
[16:11:47] <icinga-wm>	 RECOVERY - puppet last run on mw2151 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[16:36:50] <grrrit-wm>	 (03CR) 10Hashar: [C: 031] contint: remove maven webproxy [puppet] - 10https://gerrit.wikimedia.org/r/261476 (https://phabricator.wikimedia.org/T122449) (owner: 10Smalyshev)
[16:39:34] <grrrit-wm>	 (03CR) 10Hashar: [C: 031] RuboCop: fixed Style/CaseIndentation offense [puppet] - 10https://gerrit.wikimedia.org/r/259699 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin)
[16:59:29] <grrrit-wm>	 (03CR) 10Luke081515: [C: 04-1] Add en wiki as transwiki import source for ta.wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262352 (https://phabricator.wikimedia.org/T122808) (owner: 10Shanmugamp7)
[17:00:00] <grrrit-wm>	 (03PS6) 10Faidon Liambotis: contint: remove maven webproxy [puppet] - 10https://gerrit.wikimedia.org/r/261476 (https://phabricator.wikimedia.org/T122449) (owner: 10Smalyshev)
[17:00:11] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] contint: remove maven webproxy [puppet] - 10https://gerrit.wikimedia.org/r/261476 (https://phabricator.wikimedia.org/T122449) (owner: 10Smalyshev)
[17:03:50] <grrrit-wm>	 (03PS2) 10Shanmugamp7: Add en wiki as transwiki import source for ta.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262352 (https://phabricator.wikimedia.org/T122808) 
[17:04:07] <Jeff_Green>	 ccccccelhiklvvbcknnhktfrvugfeejeluvhuljfvedt
[17:04:13] <Jeff_Green>	 garg.
[17:04:36] <Jeff_Green>	 stupid window focus. good thing it's a otp...
[17:14:10] <grrrit-wm>	 (03PS4) 10Florianschmidtwelzow: Remove $wgCopyrightIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261999 (https://phabricator.wikimedia.org/T122754) 
[17:14:17] <grrrit-wm>	 (03CR) 10Florianschmidtwelzow: Remove $wgCopyrightIcon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261999 (https://phabricator.wikimedia.org/T122754) (owner: 10Florianschmidtwelzow)
[17:34:30] <grrrit-wm>	 (03PS1) 10Yuvipanda: base: Do not do add host nagios monitoring in labs [puppet] - 10https://gerrit.wikimedia.org/r/262359 (https://phabricator.wikimedia.org/T122757) 
[17:34:33] <YuviPanda>	 valhallasw`cloud: ^
[17:34:41] <YuviPanda>	 at least some of Luke081515's failures were related to arcconf
[17:36:47] <valhallasw`cloud>	 YuviPanda: isn't the real issue something with the package server?
[17:37:13] <YuviPanda>	 valhallasw`cloud: possibly, am still digging.
[17:37:17] <YuviPanda>	 valhallasw`cloud: but this is a good thing anyway
[17:37:20] <YuviPanda>	 removes crap we don't use
[17:37:26] <valhallasw`cloud>	 *nod*
[17:37:33] <logmsgbot>	 !log yurik@tin Synchronized php-1.27.0-wmf.9/extensions/Graph/: Deployed Graph ext - gerrit 262357 (duration: 00m 33s)
[17:37:37] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:37:58] <YuviPanda>	 valhallasw`cloud: it's also too early aaaaaaaaaaaaaaaaaaaaaaaaa
[17:38:49] <grrrit-wm>	 (03CR) 10Merlijn van Deen: [C: 031] base: Do not do add host nagios monitoring in labs [puppet] - 10https://gerrit.wikimedia.org/r/262359 (https://phabricator.wikimedia.org/T122757) (owner: 10Yuvipanda)
[18:14:38] <grrrit-wm>	 (03PS3) 10Luke081515: Add enwiki as transwiki import source for ta.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262352 (https://phabricator.wikimedia.org/T122808) (owner: 10Shanmugamp7)
[18:15:14] <grrrit-wm>	 (03CR) 10Luke081515: [C: 031] "Thanks for the patch (You can use the dbnames at the commit messages too, there are shorter)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262352 (https://phabricator.wikimedia.org/T122808) (owner: 10Shanmugamp7)
[18:39:25] <icinga-wm>	 PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/2: down - Transit: Zayo (IPYX/125449/004/ZYO) {#?} [10Gbps]BR
[18:45:14] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/2: down - Transit: Zayo (IPYX/125449/002/ZYO) {#?} [10Gbps]BR
[18:46:08] <wikibugs>	 6operations, 10ops-codfw, 10netops, 10procurement: patch/implement new zayo wave (579171) codfw-ulsfo cr1-codfw:xe-5/0/2 - https://phabricator.wikimedia.org/T122823#1914742 (10RobH) 3NEW a:3Papaul
[18:46:18] <wikibugs>	 6operations, 10ops-codfw, 10netops: patch/implement new zayo wave (579171) codfw-ulsfo cr1-codfw:xe-5/0/2 - https://phabricator.wikimedia.org/T122823#1914742 (10RobH)
[18:46:31] <wikibugs>	 6operations, 10ops-codfw, 10netops: patch/implement new zayo wave (579171) codfw-ulsfo cr1-codfw:xe-5/0/2 - https://phabricator.wikimedia.org/T122823#1914742 (10RobH)
[18:49:24] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/3/1: down - Transit: Zayo (IPYX/125449/001/ZYO) {#?} [10Gbps]BR
[18:50:45] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/3/1: down - Transit: Zayo (IPYX/125449/003/ZYO) {#?} [10Gbps]BR
[18:52:39] <grrrit-wm>	 (03CR) 10Dzahn: [C: 031] "the ticket has approval meanwhile and i think the 3 days have passed as well" [puppet] - 10https://gerrit.wikimedia.org/r/261217 (https://phabricator.wikimedia.org/T122524) (owner: 10Jcrespo)
[18:52:54] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 120, down: 0, dormant: 0, excluded: 1, unused: 0
[18:53:34] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0
[18:53:34] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 1, unused: 0
[18:54:05] <icinga-wm>	 RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 1, unused: 0
[18:56:16] <grrrit-wm>	 (03PS2) 10Andrew Bogott: nodepool: set Nova API timeout to 60 seconds [puppet] - 10https://gerrit.wikimedia.org/r/262343 (https://phabricator.wikimedia.org/T122731) (owner: 10Hashar)
[18:56:55] <wikibugs>	 10Ops-Access-Requests, 6operations, 10Analytics, 10ContentTranslation-Analytics, and 2 others: access for amire80 to stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T122524#1914783 (10Dzahn) added +1 for https://gerrit.wikimedia.org/r/#/c/261217/ i can merge this (tomorrow then), i'm here
[18:57:58] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] nodepool: set Nova API timeout to 60 seconds [puppet] - 10https://gerrit.wikimedia.org/r/262343 (https://phabricator.wikimedia.org/T122731) (owner: 10Hashar)
[18:58:50] <wikibugs>	 6operations, 6Labs, 5Patch-For-Review: Increase timeout for tools-home check - https://phabricator.wikimedia.org/T122615#1914796 (10Dzahn) a:3Dzahn
[19:03:32] <wikibugs>	 6operations, 10ops-codfw, 10fundraising-tech-ops: bellatrix hardware RAID predictive failure - https://phabricator.wikimedia.org/T122026#1914805 (10Jgreen) 5Open>3Resolved done
[19:06:36] <wikibugs>	 6operations, 6Labs, 5Patch-For-Review: Increase timeout for tools-home check - https://phabricator.wikimedia.org/T122615#1914817 (10Dzahn) p:5Triage>3Normal With the changes above the timeout has been raised from 10 to 20 and made configurable.  It can be adjusted by editing `check_command => 'check_http...
[19:07:06] <wikibugs>	 6operations, 6Labs, 5Patch-For-Review: Increase timeout for tools-home check - https://phabricator.wikimedia.org/T122615#1914823 (10Dzahn) 5Open>3Resolved
[19:07:31] <wikibugs>	 6operations, 6Labs, 5Patch-For-Review: Increase timeout for tools-home check - https://phabricator.wikimedia.org/T122615#1914827 (10yuvipanda) \o/ Thanks for doing that! I think we can re-enable the SMS notification now too
[19:09:21] <grrrit-wm>	 (03PS1) 10Dzahn: Revert "toollabs: disable paging for tools-home/NFS" [puppet] - 10https://gerrit.wikimedia.org/r/262365 (https://phabricator.wikimedia.org/T122615) 
[19:10:16] <grrrit-wm>	 (03PS2) 10Dzahn: Revert "toollabs: disable paging for tools-home/NFS" [puppet] - 10https://gerrit.wikimedia.org/r/262365 (https://phabricator.wikimedia.org/T122615) 
[19:10:37] <wikibugs>	 6operations, 6Services, 10Wikimedia-Developer-Summit-2016: Service Ownership and Maintenance - https://phabricator.wikimedia.org/T122825#1914833 (10mobrovac) 3NEW
[19:12:52] <grrrit-wm>	 (03PS3) 10Dzahn: Revert "toollabs: disable paging for tools-home/NFS" [puppet] - 10https://gerrit.wikimedia.org/r/262365 (https://phabricator.wikimedia.org/T122615) 
[19:14:25] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "per T122615#1914827 and checks have been OK for 19h / 1d6h and per Yuvi on" [puppet] - 10https://gerrit.wikimedia.org/r/262365 (https://phabricator.wikimedia.org/T122615) (owner: 10Dzahn)
[19:16:29] <wikibugs>	 6operations, 6Labs, 5Patch-For-Review: Increase timeout for tools-home check - https://phabricator.wikimedia.org/T122615#1914888 (10Dzahn) >>! In T122615#1914827, @yuvipanda wrote: > \o/ Thanks for doing that! I think we can re-enable the SMS notification now too  alright, done with the revert above. will be...
[19:18:14] <wikibugs>	 6operations, 6Labs: Increase timeout for tools-home check - https://phabricator.wikimedia.org/T122615#1914892 (10Dzahn)
[19:18:59] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists: import old staff list archives ? - https://phabricator.wikimedia.org/T109395#1914898 (10Dzahn) p:5Normal>3Low
[19:23:20] <wikibugs>	 6operations, 10ores, 7Icinga: change ores monitoring to avoid icinga reload on puppet runs - https://phabricator.wikimedia.org/T122830#1914916 (10Dzahn) 3NEW
[19:23:31] <wikibugs>	 6operations, 10ores, 7Icinga: change ores monitoring to avoid icinga reload on puppet runs - https://phabricator.wikimedia.org/T122830#1914923 (10Dzahn) a:3Dzahn
[19:28:04] <mutante>	 !log elastic1006 - out of disk - gzip eqiad_index_search_slowlog.log files
[19:28:09] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:34:44] <icinga-wm>	 RECOVERY - Disk space on elastic1006 is OK: DISK OK
[19:38:57] <wikibugs>	 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review, 15User-bd808: WDQS builds fail due to network issues - https://phabricator.wikimedia.org/T122594#1914933 (10hashar) a:3bd808 Patched by Stas deployed by Bryan on the CI puppet master.  I have amended the change to cleanup all refere...
[19:39:04] <wikibugs>	 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review, 15User-bd808: WDQS builds fail due to network issues - https://phabricator.wikimedia.org/T122594#1914936 (10hashar) 5Open>3Resolved  All good now! Thank you for the quick fix!
[19:50:23] <icinga-wm>	 RECOVERY - puppet last run on elastic1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:56:58] <wikibugs>	 6operations, 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: Nodepool deadlocks when querying unresponsive OpenStack API (was: rake-jessie jobs stuck due to no ci-jessie-wikimedia slaves being attached to Jenkins) - https://phabricator.wikimedia.org/T122731#1914967...
[19:57:16] <wikibugs>	 6operations, 6Discovery, 7Elasticsearch: elastic - large slow query logs / disk space - https://phabricator.wikimedia.org/T122832#1914969 (10Dzahn) 3NEW
[19:58:45] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on labtestmetal2001 is CRITICAL: CRITICAL: puppet fail daniel_zahn per labtest this cant be critical
[19:58:45] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on labtestnet2001 is CRITICAL: CRITICAL: puppet fail daniel_zahn per labtest this cant be critical
[19:58:45] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on labtestvirt2001 is CRITICAL: CRITICAL: puppet fail daniel_zahn per labtest this cant be critical
[20:00:08] <mutante>	 !log mw1123 - start HHVM (was 503 and service stopped)
[20:00:12] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:01:36] <icinga-wm>	 RECOVERY - HHVM rendering on mw1123 is OK: HTTP OK: HTTP/1.1 200 OK - 69877 bytes in 4.105 second response time
[20:01:48] <mutante>	 ^ that's fun how that actually worked
[20:02:07] <mutante>	 because Uncaught exception: HHVM no longer supports the built-in webserver as of 3.0.0 always looks so bad first
[20:02:17] <icinga-wm>	 RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.083 second response time
[20:02:57] <wikibugs>	 6operations, 6Analytics-Backlog, 7HTTPS: EventLogging sees too few distinct client IPs - https://phabricator.wikimedia.org/T119144#1914981 (10leila)
[20:04:54] <wikibugs>	 6operations, 6Discovery, 7Elasticsearch: elastic - large slow query logs / server runs out of disk space - https://phabricator.wikimedia.org/T122832#1914995 (10Dzahn)
[20:05:45] <wikibugs>	 6operations, 6Analytics-Backlog, 7HTTPS: EventLogging sees too few distinct client IPs - https://phabricator.wikimedia.org/T119144#1914998 (10leila) @ori we need it for the reader research, for matching QuickSurvey responses to the webrequest logs.  @Ottomata, can we look into this ticket?
[20:07:47] <icinga-wm>	 ACKNOWLEDGEMENT - Apache HTTP on mw1228 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.007 second response time daniel_zahn https://phabricator.wikimedia.org/T122005
[20:07:47] <icinga-wm>	 ACKNOWLEDGEMENT - HHVM processes on mw1228 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm daniel_zahn https://phabricator.wikimedia.org/T122005
[20:07:47] <icinga-wm>	 ACKNOWLEDGEMENT - HHVM rendering on mw1228 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.011 second response time daniel_zahn https://phabricator.wikimedia.org/T122005
[20:07:47] <icinga-wm>	 ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1228 is CRITICAL: Host mw1228 is not in mediawiki-installation dsh group daniel_zahn https://phabricator.wikimedia.org/T122005
[20:07:47] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on mw1228 is CRITICAL: CRITICAL: Puppet last ran 15 days ago daniel_zahn https://phabricator.wikimedia.org/T122005
[20:10:47] <icinga-wm>	 RECOVERY - HTTP-peopleweb on rutherfordium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 520 bytes in 0.128 second response time
[20:10:47] <icinga-wm>	 RECOVERY - Check size of conntrack table on rutherfordium is OK: OK: nf_conntrack is 0 % full
[20:10:56] <icinga-wm>	 RECOVERY - DPKG on rutherfordium is OK: All packages OK
[20:11:25] <mutante>	 what...it fixes itself right when i try to login but was down for 3 days??
[20:11:48] <icinga-wm>	 RECOVERY - NTP on rutherfordium is OK: NTP OK: Offset 0.05729794502 secs
[20:12:32] <mutante>	 !log rutherfordium (people.wm) was down for days per icinga - then magically fixes itself when i connect to console but before even loggin in (ganeti VM)
[20:12:36] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:14:07] <James_F>	 mutante: Ha.
[20:14:34] <wikibugs>	 6operations: No postinst, preinst, etc for linux-image-3.19.0-2-amd64 - https://phabricator.wikimedia.org/T122284#1915020 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff
[20:17:16] <icinga-wm>	 PROBLEM - Check size of conntrack table on rutherfordium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:17:17] <icinga-wm>	 PROBLEM - DPKG on rutherfordium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:19:17] <icinga-wm>	 PROBLEM - HTTP-peopleweb on rutherfordium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:19:26] <mutante>	 !log rutherfordium - attempt to restart with gnt-instance
[20:19:31] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:19:37] <icinga-wm>	 PROBLEM - puppet last run on mw2019 is CRITICAL: CRITICAL: puppet fail
[20:20:57] <icinga-wm>	 RECOVERY - salt-minion processes on rutherfordium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[20:21:16] <icinga-wm>	 RECOVERY - HTTP-peopleweb on rutherfordium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 520 bytes in 0.024 second response time
[20:21:16] <icinga-wm>	 RECOVERY - Check size of conntrack table on rutherfordium is OK: OK: nf_conntrack is 0 % full
[20:21:17] <icinga-wm>	 RECOVERY - DPKG on rutherfordium is OK: All packages OK
[20:21:37] <icinga-wm>	 RECOVERY - Disk space on rutherfordium is OK: DISK OK
[20:21:47] <icinga-wm>	 RECOVERY - configured eth on rutherfordium is OK: OK - interfaces up
[20:22:17] <icinga-wm>	 RECOVERY - dhclient process on rutherfordium is OK: PROCS OK: 0 processes with command name dhclient
[20:22:36] <icinga-wm>	 RECOVERY - RAID on rutherfordium is OK: OK: no RAID installed
[20:22:37] <icinga-wm>	 RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[20:22:37] <icinga-wm>	 RECOVERY - puppet last run on rutherfordium is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[20:23:27] <icinga-wm>	 PROBLEM - Host betelgeuse is DOWN: PING CRITICAL - Packet loss = 100%
[20:23:55] <Robh>	 ok is that expected?
[20:24:03] <Robh>	 betelgeuse thatis
[20:24:10] <mutante>	 no
[20:24:38] <Robh>	 ok, attempting serial
[20:24:40] <mutante>	 i dont think i have seen that name before
[20:24:45] <mutante>	 cool
[20:25:13] <mutante>	 is it fundraising?
[20:25:17] <Jeff_Green>	 looking
[20:25:17] <Robh>	 hrmm
[20:25:20] <Robh>	 has to be yep
[20:25:21] <mutante>	 because i dont see it in our repo
[20:25:24] <Robh>	 cuz i cannot route to it normally, yep!
[20:25:29] <mutante>	 ah, great, hi Jeff
[20:25:29] <Robh>	 Jeff_Green is looking =]
[20:25:35] <Jeff_Green>	 it's a logger for eqiad
[20:25:41] <Robh>	 located in codfw though
[20:25:49] <Jeff_Green>	 i did a kernel update and reboot, it should have come up already
[20:26:12] <Jeff_Green>	 (i scheduled icinga downtime but apparently that ran out)
[20:26:16] <Robh>	 im in the office and silenced my phone... if not eeryone could have heard gir yell 'oh no we're doomed!'
[20:26:39] <Jeff_Green>	 ha
[20:27:58] <mutante>	 Jeff_Green:you probably scheduled downtime for all services on the host, except the host itself, because icinga makes that super easy to miss and requires that extra checkbox for it
[20:28:14] <mutante>	 so there was just the "host down" left while all the services did not report as intended
[20:28:15] <Jeff_Green>	 no i finally learned my lesson on that
[20:28:26] <Jeff_Green>	 i used the CLI tool
[20:29:42] <mutante>	 ah, i guess it has forgotten downtimes then after a restart, actually there were others as well that looked like that
[20:30:04] <Jeff_Green>	 grrr. the management password isn't working
[20:30:39] <Jeff_Green>	 maybe it got nuked by a ILO firmware update?
[20:30:51] <andrewbogott>	 Jeff_Green: usually when mgmt doesn’t work for me it’s because I forgot to do root@
[20:30:56] <andrewbogott>	 (or admin@ if it’s a cisco)
[20:31:06] * andrewbogott is either helpful or insulting, depending
[20:31:12] <Jeff_Green>	 andrewbogott: yp, double-checked that
[20:31:21] <mutante>	 Jeff_Green: PMed you a possible default
[20:35:54] <Jeff_Green>	 double grrr
[20:36:11] <mutante>	 !log mw2019 - puppet run (icinga claimed it failed but just here)
[20:36:15] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:37:16] <Jeff_Green>	 OH! i figured it out
[20:37:32] <Jeff_Green>	 i forgot that the mgmt interface is now behind 2FA for codfw
[20:37:46] <mutante>	 ah! interesting
[20:37:51] <mutante>	 using yubikey?
[20:38:07] <Jeff_Green>	 yeah
[20:39:15] <icinga-wm>	 RECOVERY - puppet last run on mw2019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[20:42:58] <mutante>	 !log ms-be2007 - powercycle (was status: on but all frozen) (i assume xfs like be2006 appears in SAL recently)
[20:43:03] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:43:53] <mutante>	 !log ms-be2007 - System halted!Error:  Integrated RAID
[20:43:57] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:45:25] <icinga-wm>	 RECOVERY - Host betelgeuse is UP: PING OK - Packet loss = 0%, RTA = 37.25 ms
[20:46:28] <wikibugs>	 6operations, 10ops-codfw: ms-be2007 - System halted!Error:  Integrated RAID  - https://phabricator.wikimedia.org/T122844#1915129 (10Dzahn) 3NEW
[20:47:14] <icinga-wm>	 ACKNOWLEDGEMENT - Host ms-be2007 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T122844
[20:50:09] <mutante>	 !log ms-be1011 - powercycled, was frozen
[20:50:14] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:51:35] <icinga-wm>	 RECOVERY - Host ms-be1011 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms
[20:52:45] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be1011 is OK: OK - load average: 53.81, 17.48, 6.16
[20:55:38] <mutante>	 unfortunate wording :) "very high load likely ..OK"
[21:05:54] <grrrit-wm>	 (03PS3) 10Dzahn: [English Planet] Add Greg Sabino Mullane [puppet] - 10https://gerrit.wikimedia.org/r/261626 (owner: 10Nemo bis)
[21:06:05] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] [English Planet] Add Greg Sabino Mullane [puppet] - 10https://gerrit.wikimedia.org/r/261626 (owner: 10Nemo bis)
[21:14:53] <grrrit-wm>	 (03CR) 10Addshore: "Is there a way to move this forward? :)" [puppet] - 10https://gerrit.wikimedia.org/r/253594 (https://phabricator.wikimedia.org/T118739) (owner: 10ArielGlenn)
[21:16:08] <icinga-wm>	 PROBLEM - puppet last run on mw2116 is CRITICAL: CRITICAL: puppet fail
[21:42:42] <icinga-wm>	 PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: puppet fail
[21:44:41] <icinga-wm>	 RECOVERY - puppet last run on mw2116 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:58:27] <wikibugs>	 6operations, 10ores, 7Icinga: change ores monitoring to avoid icinga reload on puppet runs - https://phabricator.wikimedia.org/T122830#1915285 (10Dzahn) p:5Triage>3Normal
[22:08:51] <icinga-wm>	 RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures
[23:07:32] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[23:09:42] <icinga-wm>	 PROBLEM - Mobile HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[23:10:23] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[23:13:37] <wikibugs>	 6operations: No postinst, preinst, etc for linux-image-3.19.0-2-amd64 - https://phabricator.wikimedia.org/T122284#1915421 (10MoritzMuehlenhoff) I think I've found the problem, will be folded in the next round of 3.19 kernel updates.
[23:13:43] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[23:13:52] <icinga-wm>	 RECOVERY - Mobile HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[23:14:33] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[23:16:32] <wikibugs>	 6operations: No postinst, preinst, etc for linux-image-3.19.0-2-amd64 - https://phabricator.wikimedia.org/T122284#1915428 (10jcrespo) @MoritzMuehlenhoff I upgraded a bunch of new servers to the latest kernel (-2). Does it impact those or I should not worry if it works for me?
[23:21:36] <wikibugs>	 6operations: No postinst, preinst, etc for linux-image-3.19.0-2-amd64 - https://phabricator.wikimedia.org/T122284#1915442 (10MoritzMuehlenhoff) @jcrespo: These are all fine, this is only a cornercase scenario for installations which don't use linux-meta (which takes care of updating the initrd)
[23:47:20] <cajoel>	 I got to explain zmodem/ymodem/xmodem and kermit to some people today
[23:47:31] <cajoel>	 pull out the old man hat...