[01:27:00] (03CR) 10VolkerE: [C: 04-1] Remove $wgCopyrightIcon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261999 (https://phabricator.wikimedia.org/T122754) (owner: 10Florianschmidtwelzow) [01:36:44] PROBLEM - puppet last run on mw1109 is CRITICAL: CRITICAL: Puppet has 1 failures [01:50:33] PROBLEM - puppet last run on db2065 is CRITICAL: CRITICAL: puppet fail [02:02:53] RECOVERY - puppet last run on mw1109 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [02:12:02] PROBLEM - Hadoop NodeManager on analytics1035 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [02:16:12] RECOVERY - Hadoop NodeManager on analytics1035 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [02:17:22] RECOVERY - puppet last run on db2065 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [02:25:17] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 10m 05s) [02:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:32:10] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Jan 4 02:32:10 UTC 2016 (duration 6m 53s) [02:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:50:04] PROBLEM - puppet last run on ms-be3003 is CRITICAL: CRITICAL: puppet fail [03:18:57] RECOVERY - puppet last run on ms-be3003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:09:42] PROBLEM - puppet last run on ms-fe2002 is CRITICAL: CRITICAL: puppet fail [04:35:37] RECOVERY - puppet last run on ms-fe2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:01:06] PROBLEM - Hadoop NodeManager on analytics1041 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [05:25:46] RECOVERY - Hadoop NodeManager on analytics1041 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [05:49:31] PROBLEM - Hadoop NodeManager on analytics1032 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [05:51:32] RECOVERY - Hadoop NodeManager on analytics1032 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [06:07:44] (03CR) 10BBlack: [C: 04-1] "(1) Unless the rest API URLs *never* provide content that varies per user, this would be incorrect." [puppet] - 10https://gerrit.wikimedia.org/r/261662 (https://phabricator.wikimedia.org/T122673) (owner: 10GWicke) [06:25:16] PROBLEM - Disk space on elastic1004 is CRITICAL: DISK CRITICAL - free space: / 965 MB (3% inode=95%) [06:27:25] RECOVERY - Disk space on elastic1004 is OK: DISK OK [06:31:25] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:39] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:39] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:48] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:56:49] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:58:18] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:43:09] PROBLEM - puppet last run on labvirt1001 is CRITICAL: CRITICAL: Puppet has 1 failures [08:48:09] PROBLEM - puppet last run on achernar is CRITICAL: CRITICAL: puppet fail [09:09:28] RECOVERY - puppet last run on labvirt1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:15:07] RECOVERY - puppet last run on achernar is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [09:54:11] PROBLEM - salt-minion processes on cygnus is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:06:09] RECOVERY - salt-minion processes on cygnus is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:27:52] (03PS5) 10Hashar: contint: remove maven webproxy [puppet] - 10https://gerrit.wikimedia.org/r/261476 (https://phabricator.wikimedia.org/T122449) (owner: 10Smalyshev) [10:29:45] (03CR) 10Hashar: "webproxy.eqiad.wmnet can no more be reached by labs instance since Dec 24th. I have dropped the configuration for MediaWiki https://gerri" [puppet] - 10https://gerrit.wikimedia.org/r/261476 (https://phabricator.wikimedia.org/T122449) (owner: 10Smalyshev) [10:33:07] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: WDQS builds fail due to network issues - https://phabricator.wikimedia.org/T122594#1914078 (10hashar) [10:33:22] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: WDQS builds fail due to network issues - https://phabricator.wikimedia.org/T122594#1908476 (10hashar) webproxy.eqiad.wmnet is no more available to labs instances ( T122368 ). I did fix it for MediaWiki https://gerrit.wikimedia.org/r/#/c... [10:34:15] 6operations, 10Continuous-Integration-Infrastructure: Webproxy on carbon unreachable from labs instances since Dec 24 roughly 1am - https://phabricator.wikimedia.org/T122461#1914084 (10hashar) `maven` was still being routed via webproxy: T122594 [10:34:26] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: WDQS builds fail due to network issues - https://phabricator.wikimedia.org/T122594#1908476 (10hashar) Same as T122461 [10:35:19] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 843 [10:40:09] RECOVERY - check_mysql on db1008 is OK: Uptime: 1188202 Threads: 55 Questions: 40090554 Slow queries: 14414 Opens: 58874 Flush tables: 2 Open tables: 416 Queries per second avg: 33.740 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:59:00] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [11:01:09] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [11:56:49] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0] [12:00:19] PROBLEM - Hadoop NodeManager on analytics1029 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [12:03:32] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [12:18:42] RECOVERY - Hadoop NodeManager on analytics1029 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [12:19:42] PROBLEM - puppet last run on elastic1006 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [12:22:53] PROBLEM - puppet last run on mw2093 is CRITICAL: CRITICAL: Puppet has 1 failures [12:45:57] PROBLEM - puppet last run on mw2164 is CRITICAL: CRITICAL: puppet fail [12:47:47] RECOVERY - puppet last run on mw2093 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:14:26] RECOVERY - puppet last run on mw2164 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:22:00] 6operations, 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure, and 3 others: rake-jessie jobs stuck due to no ci-jessie-wikimedia slaves being attached to Jenkins - https://phabricator.wikimedia.org/T122731#1914298 (10hashar) Nodepool relies on the wmflabs OpenStack API. Whenever OpenS... [13:44:13] (03PS1) 10Hashar: nodepool: set Nova API timeout to 60 seconds [puppet] - 10https://gerrit.wikimedia.org/r/262343 (https://phabricator.wikimedia.org/T122731) [13:45:33] 6operations, 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure, and 4 others: rake-jessie jobs stuck due to no ci-jessie-wikimedia slaves being attached to Jenkins - https://phabricator.wikimedia.org/T122731#1914303 (10hashar) a:3hashar https://gerrit.wikimedia.org/r/#/c/262343/ sets... [13:45:59] 6operations, 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: rake-jessie jobs stuck due to no ci-jessie-wikimedia slaves being attached to Jenkins - https://phabricator.wikimedia.org/T122731#1914305 (10hashar) [13:46:46] 6operations, 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: Nodepool deadlocks when querying unresponsive OpenStack API (was: rake-jessie jobs stuck due to no ci-jessie-wikimedia slaves being attached to Jenkins) - https://phabricator.wikimedia.org/T122731#1914308... [14:15:44] PROBLEM - Hadoop NodeManager on analytics1037 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [14:15:54] * Steinsplitter pokes jynus [14:34:57] RECOVERY - Hadoop NodeManager on analytics1037 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:42:51] PROBLEM - puppet last run on mw2151 is CRITICAL: CRITICAL: puppet fail [15:52:14] (03Abandoned) 10Paladox: gitblit: Fix "Sorry, the repository $1 does not have a $2 branch" [puppet] - 10https://gerrit.wikimedia.org/r/250697 (https://phabricator.wikimedia.org/T117459) (owner: 10Paladox) [15:52:33] (03Abandoned) 10Paladox: Re enable git.enableGitServlet [puppet] - 10https://gerrit.wikimedia.org/r/250450 (owner: 10Paladox) [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160104T1600). [16:09:55] (03PS1) 10Shanmugamp7: Add en wiki as transwiki import source for ta.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262352 (https://phabricator.wikimedia.org/T122808) [16:11:47] RECOVERY - puppet last run on mw2151 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:36:50] (03CR) 10Hashar: [C: 031] contint: remove maven webproxy [puppet] - 10https://gerrit.wikimedia.org/r/261476 (https://phabricator.wikimedia.org/T122449) (owner: 10Smalyshev) [16:39:34] (03CR) 10Hashar: [C: 031] RuboCop: fixed Style/CaseIndentation offense [puppet] - 10https://gerrit.wikimedia.org/r/259699 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [16:59:29] (03CR) 10Luke081515: [C: 04-1] Add en wiki as transwiki import source for ta.wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262352 (https://phabricator.wikimedia.org/T122808) (owner: 10Shanmugamp7) [17:00:00] (03PS6) 10Faidon Liambotis: contint: remove maven webproxy [puppet] - 10https://gerrit.wikimedia.org/r/261476 (https://phabricator.wikimedia.org/T122449) (owner: 10Smalyshev) [17:00:11] (03CR) 10Faidon Liambotis: [C: 032] contint: remove maven webproxy [puppet] - 10https://gerrit.wikimedia.org/r/261476 (https://phabricator.wikimedia.org/T122449) (owner: 10Smalyshev) [17:03:50] (03PS2) 10Shanmugamp7: Add en wiki as transwiki import source for ta.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262352 (https://phabricator.wikimedia.org/T122808) [17:04:07] ccccccelhiklvvbcknnhktfrvugfeejeluvhuljfvedt [17:04:13] garg. [17:04:36] stupid window focus. good thing it's a otp... [17:14:10] (03PS4) 10Florianschmidtwelzow: Remove $wgCopyrightIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261999 (https://phabricator.wikimedia.org/T122754) [17:14:17] (03CR) 10Florianschmidtwelzow: Remove $wgCopyrightIcon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261999 (https://phabricator.wikimedia.org/T122754) (owner: 10Florianschmidtwelzow) [17:34:30] (03PS1) 10Yuvipanda: base: Do not do add host nagios monitoring in labs [puppet] - 10https://gerrit.wikimedia.org/r/262359 (https://phabricator.wikimedia.org/T122757) [17:34:33] valhallasw`cloud: ^ [17:34:41] at least some of Luke081515's failures were related to arcconf [17:36:47] YuviPanda: isn't the real issue something with the package server? [17:37:13] valhallasw`cloud: possibly, am still digging. [17:37:17] valhallasw`cloud: but this is a good thing anyway [17:37:20] removes crap we don't use [17:37:26] *nod* [17:37:33] !log yurik@tin Synchronized php-1.27.0-wmf.9/extensions/Graph/: Deployed Graph ext - gerrit 262357 (duration: 00m 33s) [17:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:37:58] valhallasw`cloud: it's also too early aaaaaaaaaaaaaaaaaaaaaaaaa [17:38:49] (03CR) 10Merlijn van Deen: [C: 031] base: Do not do add host nagios monitoring in labs [puppet] - 10https://gerrit.wikimedia.org/r/262359 (https://phabricator.wikimedia.org/T122757) (owner: 10Yuvipanda) [18:14:38] (03PS3) 10Luke081515: Add enwiki as transwiki import source for ta.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262352 (https://phabricator.wikimedia.org/T122808) (owner: 10Shanmugamp7) [18:15:14] (03CR) 10Luke081515: [C: 031] "Thanks for the patch (You can use the dbnames at the commit messages too, there are shorter)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262352 (https://phabricator.wikimedia.org/T122808) (owner: 10Shanmugamp7) [18:39:25] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/2: down - Transit: Zayo (IPYX/125449/004/ZYO) {#?} [10Gbps]BR [18:45:14] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/2: down - Transit: Zayo (IPYX/125449/002/ZYO) {#?} [10Gbps]BR [18:46:08] 6operations, 10ops-codfw, 10netops, 10procurement: patch/implement new zayo wave (579171) codfw-ulsfo cr1-codfw:xe-5/0/2 - https://phabricator.wikimedia.org/T122823#1914742 (10RobH) 3NEW a:3Papaul [18:46:18] 6operations, 10ops-codfw, 10netops: patch/implement new zayo wave (579171) codfw-ulsfo cr1-codfw:xe-5/0/2 - https://phabricator.wikimedia.org/T122823#1914742 (10RobH) [18:46:31] 6operations, 10ops-codfw, 10netops: patch/implement new zayo wave (579171) codfw-ulsfo cr1-codfw:xe-5/0/2 - https://phabricator.wikimedia.org/T122823#1914742 (10RobH) [18:49:24] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/3/1: down - Transit: Zayo (IPYX/125449/001/ZYO) {#?} [10Gbps]BR [18:50:45] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/3/1: down - Transit: Zayo (IPYX/125449/003/ZYO) {#?} [10Gbps]BR [18:52:39] (03CR) 10Dzahn: [C: 031] "the ticket has approval meanwhile and i think the 3 days have passed as well" [puppet] - 10https://gerrit.wikimedia.org/r/261217 (https://phabricator.wikimedia.org/T122524) (owner: 10Jcrespo) [18:52:54] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 120, down: 0, dormant: 0, excluded: 1, unused: 0 [18:53:34] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 [18:53:34] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 1, unused: 0 [18:54:05] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 1, unused: 0 [18:56:16] (03PS2) 10Andrew Bogott: nodepool: set Nova API timeout to 60 seconds [puppet] - 10https://gerrit.wikimedia.org/r/262343 (https://phabricator.wikimedia.org/T122731) (owner: 10Hashar) [18:56:55] 10Ops-Access-Requests, 6operations, 10Analytics, 10ContentTranslation-Analytics, and 2 others: access for amire80 to stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T122524#1914783 (10Dzahn) added +1 for https://gerrit.wikimedia.org/r/#/c/261217/ i can merge this (tomorrow then), i'm here [18:57:58] (03CR) 10Andrew Bogott: [C: 032] nodepool: set Nova API timeout to 60 seconds [puppet] - 10https://gerrit.wikimedia.org/r/262343 (https://phabricator.wikimedia.org/T122731) (owner: 10Hashar) [18:58:50] 6operations, 6Labs, 5Patch-For-Review: Increase timeout for tools-home check - https://phabricator.wikimedia.org/T122615#1914796 (10Dzahn) a:3Dzahn [19:03:32] 6operations, 10ops-codfw, 10fundraising-tech-ops: bellatrix hardware RAID predictive failure - https://phabricator.wikimedia.org/T122026#1914805 (10Jgreen) 5Open>3Resolved done [19:06:36] 6operations, 6Labs, 5Patch-For-Review: Increase timeout for tools-home check - https://phabricator.wikimedia.org/T122615#1914817 (10Dzahn) p:5Triage>3Normal With the changes above the timeout has been raised from 10 to 20 and made configurable. It can be adjusted by editing `check_command => 'check_http... [19:07:06] 6operations, 6Labs, 5Patch-For-Review: Increase timeout for tools-home check - https://phabricator.wikimedia.org/T122615#1914823 (10Dzahn) 5Open>3Resolved [19:07:31] 6operations, 6Labs, 5Patch-For-Review: Increase timeout for tools-home check - https://phabricator.wikimedia.org/T122615#1914827 (10yuvipanda) \o/ Thanks for doing that! I think we can re-enable the SMS notification now too [19:09:21] (03PS1) 10Dzahn: Revert "toollabs: disable paging for tools-home/NFS" [puppet] - 10https://gerrit.wikimedia.org/r/262365 (https://phabricator.wikimedia.org/T122615) [19:10:16] (03PS2) 10Dzahn: Revert "toollabs: disable paging for tools-home/NFS" [puppet] - 10https://gerrit.wikimedia.org/r/262365 (https://phabricator.wikimedia.org/T122615) [19:10:37] 6operations, 6Services, 10Wikimedia-Developer-Summit-2016: Service Ownership and Maintenance - https://phabricator.wikimedia.org/T122825#1914833 (10mobrovac) 3NEW [19:12:52] (03PS3) 10Dzahn: Revert "toollabs: disable paging for tools-home/NFS" [puppet] - 10https://gerrit.wikimedia.org/r/262365 (https://phabricator.wikimedia.org/T122615) [19:14:25] (03CR) 10Dzahn: [C: 032] "per T122615#1914827 and checks have been OK for 19h / 1d6h and per Yuvi on" [puppet] - 10https://gerrit.wikimedia.org/r/262365 (https://phabricator.wikimedia.org/T122615) (owner: 10Dzahn) [19:16:29] 6operations, 6Labs, 5Patch-For-Review: Increase timeout for tools-home check - https://phabricator.wikimedia.org/T122615#1914888 (10Dzahn) >>! In T122615#1914827, @yuvipanda wrote: > \o/ Thanks for doing that! I think we can re-enable the SMS notification now too alright, done with the revert above. will be... [19:18:14] 6operations, 6Labs: Increase timeout for tools-home check - https://phabricator.wikimedia.org/T122615#1914892 (10Dzahn) [19:18:59] 6operations, 10Wikimedia-Mailing-lists: import old staff list archives ? - https://phabricator.wikimedia.org/T109395#1914898 (10Dzahn) p:5Normal>3Low [19:23:20] 6operations, 10ores, 7Icinga: change ores monitoring to avoid icinga reload on puppet runs - https://phabricator.wikimedia.org/T122830#1914916 (10Dzahn) 3NEW [19:23:31] 6operations, 10ores, 7Icinga: change ores monitoring to avoid icinga reload on puppet runs - https://phabricator.wikimedia.org/T122830#1914923 (10Dzahn) a:3Dzahn [19:28:04] !log elastic1006 - out of disk - gzip eqiad_index_search_slowlog.log files [19:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:34:44] RECOVERY - Disk space on elastic1006 is OK: DISK OK [19:38:57] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review, 15User-bd808: WDQS builds fail due to network issues - https://phabricator.wikimedia.org/T122594#1914933 (10hashar) a:3bd808 Patched by Stas deployed by Bryan on the CI puppet master. I have amended the change to cleanup all refere... [19:39:04] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review, 15User-bd808: WDQS builds fail due to network issues - https://phabricator.wikimedia.org/T122594#1914936 (10hashar) 5Open>3Resolved All good now! Thank you for the quick fix! [19:50:23] RECOVERY - puppet last run on elastic1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:56:58] 6operations, 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: Nodepool deadlocks when querying unresponsive OpenStack API (was: rake-jessie jobs stuck due to no ci-jessie-wikimedia slaves being attached to Jenkins) - https://phabricator.wikimedia.org/T122731#1914967... [19:57:16] 6operations, 6Discovery, 7Elasticsearch: elastic - large slow query logs / disk space - https://phabricator.wikimedia.org/T122832#1914969 (10Dzahn) 3NEW [19:58:45] ACKNOWLEDGEMENT - puppet last run on labtestmetal2001 is CRITICAL: CRITICAL: puppet fail daniel_zahn per labtest this cant be critical [19:58:45] ACKNOWLEDGEMENT - puppet last run on labtestnet2001 is CRITICAL: CRITICAL: puppet fail daniel_zahn per labtest this cant be critical [19:58:45] ACKNOWLEDGEMENT - puppet last run on labtestvirt2001 is CRITICAL: CRITICAL: puppet fail daniel_zahn per labtest this cant be critical [20:00:08] !log mw1123 - start HHVM (was 503 and service stopped) [20:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:01:36] RECOVERY - HHVM rendering on mw1123 is OK: HTTP OK: HTTP/1.1 200 OK - 69877 bytes in 4.105 second response time [20:01:48] ^ that's fun how that actually worked [20:02:07] because Uncaught exception: HHVM no longer supports the built-in webserver as of 3.0.0 always looks so bad first [20:02:17] RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.083 second response time [20:02:57] 6operations, 6Analytics-Backlog, 7HTTPS: EventLogging sees too few distinct client IPs - https://phabricator.wikimedia.org/T119144#1914981 (10leila) [20:04:54] 6operations, 6Discovery, 7Elasticsearch: elastic - large slow query logs / server runs out of disk space - https://phabricator.wikimedia.org/T122832#1914995 (10Dzahn) [20:05:45] 6operations, 6Analytics-Backlog, 7HTTPS: EventLogging sees too few distinct client IPs - https://phabricator.wikimedia.org/T119144#1914998 (10leila) @ori we need it for the reader research, for matching QuickSurvey responses to the webrequest logs. @Ottomata, can we look into this ticket? [20:07:47] ACKNOWLEDGEMENT - Apache HTTP on mw1228 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.007 second response time daniel_zahn https://phabricator.wikimedia.org/T122005 [20:07:47] ACKNOWLEDGEMENT - HHVM processes on mw1228 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm daniel_zahn https://phabricator.wikimedia.org/T122005 [20:07:47] ACKNOWLEDGEMENT - HHVM rendering on mw1228 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.011 second response time daniel_zahn https://phabricator.wikimedia.org/T122005 [20:07:47] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1228 is CRITICAL: Host mw1228 is not in mediawiki-installation dsh group daniel_zahn https://phabricator.wikimedia.org/T122005 [20:07:47] ACKNOWLEDGEMENT - puppet last run on mw1228 is CRITICAL: CRITICAL: Puppet last ran 15 days ago daniel_zahn https://phabricator.wikimedia.org/T122005 [20:10:47] RECOVERY - HTTP-peopleweb on rutherfordium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 520 bytes in 0.128 second response time [20:10:47] RECOVERY - Check size of conntrack table on rutherfordium is OK: OK: nf_conntrack is 0 % full [20:10:56] RECOVERY - DPKG on rutherfordium is OK: All packages OK [20:11:25] what...it fixes itself right when i try to login but was down for 3 days?? [20:11:48] RECOVERY - NTP on rutherfordium is OK: NTP OK: Offset 0.05729794502 secs [20:12:32] !log rutherfordium (people.wm) was down for days per icinga - then magically fixes itself when i connect to console but before even loggin in (ganeti VM) [20:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:14:07] mutante: Ha. [20:14:34] 6operations: No postinst, preinst, etc for linux-image-3.19.0-2-amd64 - https://phabricator.wikimedia.org/T122284#1915020 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [20:17:16] PROBLEM - Check size of conntrack table on rutherfordium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:17:17] PROBLEM - DPKG on rutherfordium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:19:17] PROBLEM - HTTP-peopleweb on rutherfordium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:19:26] !log rutherfordium - attempt to restart with gnt-instance [20:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:19:37] PROBLEM - puppet last run on mw2019 is CRITICAL: CRITICAL: puppet fail [20:20:57] RECOVERY - salt-minion processes on rutherfordium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:21:16] RECOVERY - HTTP-peopleweb on rutherfordium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 520 bytes in 0.024 second response time [20:21:16] RECOVERY - Check size of conntrack table on rutherfordium is OK: OK: nf_conntrack is 0 % full [20:21:17] RECOVERY - DPKG on rutherfordium is OK: All packages OK [20:21:37] RECOVERY - Disk space on rutherfordium is OK: DISK OK [20:21:47] RECOVERY - configured eth on rutherfordium is OK: OK - interfaces up [20:22:17] RECOVERY - dhclient process on rutherfordium is OK: PROCS OK: 0 processes with command name dhclient [20:22:36] RECOVERY - RAID on rutherfordium is OK: OK: no RAID installed [20:22:37] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [20:22:37] RECOVERY - puppet last run on rutherfordium is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [20:23:27] PROBLEM - Host betelgeuse is DOWN: PING CRITICAL - Packet loss = 100% [20:23:55] ok is that expected? [20:24:03] betelgeuse thatis [20:24:10] no [20:24:38] ok, attempting serial [20:24:40] i dont think i have seen that name before [20:24:45] cool [20:25:13] is it fundraising? [20:25:17] looking [20:25:17] hrmm [20:25:20] has to be yep [20:25:21] because i dont see it in our repo [20:25:24] cuz i cannot route to it normally, yep! [20:25:29] ah, great, hi Jeff [20:25:29] Jeff_Green is looking =] [20:25:35] it's a logger for eqiad [20:25:41] located in codfw though [20:25:49] i did a kernel update and reboot, it should have come up already [20:26:12] (i scheduled icinga downtime but apparently that ran out) [20:26:16] im in the office and silenced my phone... if not eeryone could have heard gir yell 'oh no we're doomed!' [20:26:39] ha [20:27:58] Jeff_Green:you probably scheduled downtime for all services on the host, except the host itself, because icinga makes that super easy to miss and requires that extra checkbox for it [20:28:14] so there was just the "host down" left while all the services did not report as intended [20:28:15] no i finally learned my lesson on that [20:28:26] i used the CLI tool [20:29:42] ah, i guess it has forgotten downtimes then after a restart, actually there were others as well that looked like that [20:30:04] grrr. the management password isn't working [20:30:39] maybe it got nuked by a ILO firmware update? [20:30:51] Jeff_Green: usually when mgmt doesn’t work for me it’s because I forgot to do root@ [20:30:56] (or admin@ if it’s a cisco) [20:31:06] * andrewbogott is either helpful or insulting, depending [20:31:12] andrewbogott: yp, double-checked that [20:31:21] Jeff_Green: PMed you a possible default [20:35:54] double grrr [20:36:11] !log mw2019 - puppet run (icinga claimed it failed but just here) [20:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:37:16] OH! i figured it out [20:37:32] i forgot that the mgmt interface is now behind 2FA for codfw [20:37:46] ah! interesting [20:37:51] using yubikey? [20:38:07] yeah [20:39:15] RECOVERY - puppet last run on mw2019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:42:58] !log ms-be2007 - powercycle (was status: on but all frozen) (i assume xfs like be2006 appears in SAL recently) [20:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:43:53] !log ms-be2007 - System halted!Error: Integrated RAID [20:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:45:25] RECOVERY - Host betelgeuse is UP: PING OK - Packet loss = 0%, RTA = 37.25 ms [20:46:28] 6operations, 10ops-codfw: ms-be2007 - System halted!Error: Integrated RAID - https://phabricator.wikimedia.org/T122844#1915129 (10Dzahn) 3NEW [20:47:14] ACKNOWLEDGEMENT - Host ms-be2007 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T122844 [20:50:09] !log ms-be1011 - powercycled, was frozen [20:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:51:35] RECOVERY - Host ms-be1011 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [20:52:45] RECOVERY - very high load average likely xfs on ms-be1011 is OK: OK - load average: 53.81, 17.48, 6.16 [20:55:38] unfortunate wording :) "very high load likely ..OK" [21:05:54] (03PS3) 10Dzahn: [English Planet] Add Greg Sabino Mullane [puppet] - 10https://gerrit.wikimedia.org/r/261626 (owner: 10Nemo bis) [21:06:05] (03CR) 10Dzahn: [C: 032] [English Planet] Add Greg Sabino Mullane [puppet] - 10https://gerrit.wikimedia.org/r/261626 (owner: 10Nemo bis) [21:14:53] (03CR) 10Addshore: "Is there a way to move this forward? :)" [puppet] - 10https://gerrit.wikimedia.org/r/253594 (https://phabricator.wikimedia.org/T118739) (owner: 10ArielGlenn) [21:16:08] PROBLEM - puppet last run on mw2116 is CRITICAL: CRITICAL: puppet fail [21:42:42] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: puppet fail [21:44:41] RECOVERY - puppet last run on mw2116 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:58:27] 6operations, 10ores, 7Icinga: change ores monitoring to avoid icinga reload on puppet runs - https://phabricator.wikimedia.org/T122830#1915285 (10Dzahn) p:5Triage>3Normal [22:08:51] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [23:07:32] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [23:09:42] PROBLEM - Mobile HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [23:10:23] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [23:13:37] 6operations: No postinst, preinst, etc for linux-image-3.19.0-2-amd64 - https://phabricator.wikimedia.org/T122284#1915421 (10MoritzMuehlenhoff) I think I've found the problem, will be folded in the next round of 3.19 kernel updates. [23:13:43] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:13:52] RECOVERY - Mobile HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:14:33] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:16:32] 6operations: No postinst, preinst, etc for linux-image-3.19.0-2-amd64 - https://phabricator.wikimedia.org/T122284#1915428 (10jcrespo) @MoritzMuehlenhoff I upgraded a bunch of new servers to the latest kernel (-2). Does it impact those or I should not worry if it works for me? [23:21:36] 6operations: No postinst, preinst, etc for linux-image-3.19.0-2-amd64 - https://phabricator.wikimedia.org/T122284#1915442 (10MoritzMuehlenhoff) @jcrespo: These are all fine, this is only a cornercase scenario for installations which don't use linux-meta (which takes care of updating the initrd) [23:47:20] I got to explain zmodem/ymodem/xmodem and kermit to some people today [23:47:31] pull out the old man hat...