[00:01:43] bd808: the nodes are showing now, and jobs are processing, i think [00:01:51] but jenkins reports they have 0 bytes of available swap space [00:02:52] w00t. yeah I see jobs running [00:03:12] I don't know if the swap space thing is important or not [00:03:16] I would how the jobs don't push things into swap normally [00:04:48] !log (at 23:46 UTC) restarted nova-compute on labvirt1002 [00:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:05:35] so it's probably the nova-compute restart that is fixing it? [00:05:50] Seems likely. [00:06:42] That also kind of makes sense. They would disappear from the Jenkins UI entirely do to network reachability issues [00:07:07] hello [00:07:11] nodepoold on labnodepool1001 was hung, did not respond to SIGKILL [00:07:28] I see others have already fixed stuff :) [00:08:06] bd808: it's also safe to restart nova-compute, it being down only prevents new instance scheduling / restarts / logging(?), so provided it comes back up it's ok [00:08:31] strace showed it (nodepoold) was waiting on a lock ; could not get a useful stack trace from gdb [00:09:50] i think it was waiting for a reply and was stuck in a blocking read [00:10:11] anyways [00:10:13] \o [00:10:17] thanks ori. Should we add some notes to that phab task? Twice in 72 hours seems like there is something systemic that needs to be looked into [00:10:17] bye [00:11:01] bd808: i doubt i discovered anything andrewbogott didnt' already know [00:11:24] but yeah, re: something systemic that needs to be looked into [00:11:40] labs has systemic issues? noway! [00:11:47] I'll add some notes. thanks again for using your superpowers [00:20:20] 6operations, 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure, and 3 others: rake-jessie jobs stuck due to no ci-jessie-wikimedia slaves being attached to Jenkins - https://phabricator.wikimedia.org/T122731#1912365 (10bd808) Restored ~2016-01-011T23:58 @ori and @legoktm both attempted... [00:53:03] (03CR) 10BryanDavis: "Filed T122734 for memory cgroups" [puppet] - 10https://gerrit.wikimedia.org/r/245920 (owner: 10BryanDavis) [00:57:12] bd808: do you want me to import the jessie deb now? [00:57:35] sure! [00:57:54] kkk [00:58:44] I need to test the latest upstream version in labs sometime soon too [00:59:00] * bd808 makes a task to remind himself to do that [01:00:38] bd808: do you want the exact version we have for trusty? [01:00:41] for jessie? [01:00:45] or shall I just import the latest [01:01:11] Getting the latest would be fine [01:01:42] I need to test that 1.8.1 doesn't break something on trusty before we switch that [01:01:53] but nothing is using it for jessie yet [01:01:59] ok [01:04:13] !log imported vagrant 1.8.1 for jessie per bd808 [01:04:15] bd808: done [01:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:04:29] thanks YuviPanda [01:08:38] (03CR) 10BryanDavis: "Yuvi imported Vagrant 1.8.1 into the jessie apt repo" [puppet] - 10https://gerrit.wikimedia.org/r/245920 (owner: 10BryanDavis) [01:54:22] PROBLEM - Disk space on elastic1006 is CRITICAL: DISK CRITICAL - free space: / 1061 MB (3% inode=95%) [02:24:30] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 10m 09s) [02:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:28] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Jan 2 02:31:28 UTC 2016 (duration 6m 58s) [02:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:00:33] (03PS2) 10Luke081515: Changed user group rights at trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261869 (https://phabricator.wikimedia.org/T122710) [03:01:08] (03CR) 10jenkins-bot: [V: 04-1] Changed user group rights at trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261869 (https://phabricator.wikimedia.org/T122710) (owner: 10Luke081515) [03:02:10] (03PS3) 10Luke081515: Changed user group rights at trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261869 (https://phabricator.wikimedia.org/T122710) [03:02:36] (03CR) 10jenkins-bot: [V: 04-1] Changed user group rights at trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261869 (https://phabricator.wikimedia.org/T122710) (owner: 10Luke081515) [03:03:38] (03PS4) 10Luke081515: Changed user group rights at trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261869 (https://phabricator.wikimedia.org/T122710) [03:04:00] (03CR) 10jenkins-bot: [V: 04-1] Changed user group rights at trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261869 (https://phabricator.wikimedia.org/T122710) (owner: 10Luke081515) [03:05:10] (03PS5) 10Luke081515: Changed user group rights at trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261869 (https://phabricator.wikimedia.org/T122710) [03:05:30] (03CR) 10jenkins-bot: [V: 04-1] Changed user group rights at trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261869 (https://phabricator.wikimedia.org/T122710) (owner: 10Luke081515) [03:09:28] (03PS6) 10Luke081515: Changed user group rights at trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261869 (https://phabricator.wikimedia.org/T122710) [03:15:16] (03PS1) 10Base: Added noindex rule for uawikimedia's ns2 Bug: T122732 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261902 (https://phabricator.wikimedia.org/T122732) [03:34:46] !log deploying https://gerrit.wikimedia.org/r/261725, restarted apache2 on iridium [03:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:23:32] PROBLEM - Disk space on elastic1004 is CRITICAL: DISK CRITICAL - free space: / 1062 MB (3% inode=95%) [04:48:43] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:10:26] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 27079 bytes in 0.092 second response time [06:28:13] RECOVERY - Disk space on elastic1004 is OK: DISK OK [06:31:13] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:22] PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:42] PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:43] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:11] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:52] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:12] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:33] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:21] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:51] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:52] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 2 failures [06:36:22] PROBLEM - puppet last run on mw2088 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:12] PROBLEM - puppet last run on cp1048 is CRITICAL: CRITICAL: puppet fail [06:55:22] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:56:23] RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:57:02] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:02] RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:57:03] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:22] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:51] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:11] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:21] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:02] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:41] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:00:01] RECOVERY - puppet last run on cp1048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:01:34] RECOVERY - puppet last run on mw2088 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:35:10] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 650 [10:40:20] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 951 [10:45:10] RECOVERY - check_mysql on db1008 is OK: Uptime: 1015702 Threads: 143 Questions: 39014068 Slow queries: 12603 Opens: 58844 Flush tables: 2 Open tables: 416 Queries per second avg: 38.410 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [12:00:39] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [24.0] [12:03:10] PROBLEM - puppet last run on auth2001 is CRITICAL: CRITICAL: puppet fail [12:04:09] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [12:12:09] PROBLEM - Hadoop NodeManager on analytics1036 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [12:20:18] PROBLEM - puppet last run on elastic1006 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [12:22:29] RECOVERY - Hadoop NodeManager on analytics1036 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [12:32:01] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0] [12:33:40] RECOVERY - puppet last run on auth2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:10:44] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [14:11:09] ^ yes, seems positively dead [14:11:59] or... very very slow, at least [14:12:34] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 973513 bytes in 5.487 second response time [14:25:15] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [14:27:14] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 973454 bytes in 12.739 second response time [14:28:54] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: puppet fail [14:39:23] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [14:56:42] RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:21:37] PROBLEM - puppet last run on mw2090 is CRITICAL: CRITICAL: puppet fail [15:29:38] (03CR) 10Alex Monk: [C: 031] Added noindex rule for uawikimedia's ns2 Bug: T122732 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261902 (https://phabricator.wikimedia.org/T122732) (owner: 10Base) [15:29:55] (03PS2) 10Alex Monk: Added noindex rule for uawikimedia's user namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261902 (https://phabricator.wikimedia.org/T122732) (owner: 10Base) [15:49:27] RECOVERY - puppet last run on mw2090 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:31:43] (03PS1) 10Hoo man: Provide a latest link for the Wikidata JSON dumps [puppet] - 10https://gerrit.wikimedia.org/r/261949 (https://phabricator.wikimedia.org/T72247) [18:50:11] PROBLEM - puppet last run on ms-be3003 is CRITICAL: CRITICAL: puppet fail [19:04:48] (03PS5) 10Tim Landscheidt: Avoid breaking full phabricator URLs [puppet] - 10https://gerrit.wikimedia.org/r/256663 (https://phabricator.wikimedia.org/T75997) (owner: 10Thiemo Mättig (WMDE)) [19:12:10] (03PS6) 10Nemo bis: Avoid breaking full phabricator URLs [puppet] - 10https://gerrit.wikimedia.org/r/256663 (https://phabricator.wikimedia.org/T75997) (owner: 10Thiemo Mättig (WMDE)) [19:17:34] RECOVERY - puppet last run on ms-be3003 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [21:24:04] (03PS1) 10RLuts: Enable WikidataPageBanner extension on Ukrainian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261994 (https://phabricator.wikimedia.org/T121999) [21:28:20] (03CR) 10Base: [C: 031] Enable WikidataPageBanner extension on Ukrainian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261994 (https://phabricator.wikimedia.org/T121999) (owner: 10RLuts) [21:38:20] (03PS1) 10Florianschmidtwelzow: Remove $wgCopyrightIcons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261999 [21:39:30] (03CR) 10Reedy: [C: 031] Remove $wgCopyrightIcons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261999 (owner: 10Florianschmidtwelzow) [21:49:53] (03PS2) 10Florianschmidtwelzow: Remove $wgCopyrightIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261999 [22:27:03] PROBLEM - puppet last run on ms-be1017 is CRITICAL: CRITICAL: puppet fail [22:53:03] RECOVERY - puppet last run on ms-be1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures