[01:02:10] PROBLEM - HHVM busy threads on mw1226 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [115.2] [01:03:25] PROBLEM - HHVM busy threads on mw1233 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [01:03:47] PROBLEM - HHVM busy threads on mw1231 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [01:04:41] PROBLEM - HHVM busy threads on mw1229 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [01:04:52] PROBLEM - HHVM busy threads on mw1225 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [01:04:52] PROBLEM - HHVM busy threads on mw1230 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [115.2] [01:04:53] PROBLEM - HHVM busy threads on mw1228 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [01:04:53] PROBLEM - HHVM busy threads on mw1232 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [115.2] [01:05:04] that sounds bad [01:05:08] PROBLEM - HHVM busy threads on mw1234 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [115.2] [01:05:53] Bad timing too [01:05:56] 10 is less than 22! [01:06:05] Did someone die? [01:06:31] PROBLEM - HHVM busy threads on mw1221 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [01:06:35] RECOVERY - HHVM busy threads on mw1233 is OK: OK: Less than 1.00% above the threshold [76.8] [01:06:51] PROBLEM - HHVM busy threads on mw1224 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [01:08:06] RECOVERY - HHVM busy threads on mw1226 is OK: OK: Less than 1.00% above the threshold [76.8] [01:08:06] RECOVERY - HHVM busy threads on mw1234 is OK: OK: Less than 1.00% above the threshold [76.8] [01:09:36] RECOVERY - HHVM busy threads on mw1221 is OK: OK: Less than 1.00% above the threshold [76.8] [01:09:48] RECOVERY - HHVM busy threads on mw1224 is OK: OK: Less than 1.00% above the threshold [76.8] [01:09:48] RECOVERY - HHVM busy threads on mw1231 is OK: OK: Less than 1.00% above the threshold [76.8] [01:10:27] RECOVERY - HHVM busy threads on mw1229 is OK: OK: Less than 1.00% above the threshold [76.8] [01:10:38] RECOVERY - HHVM busy threads on mw1225 is OK: OK: Less than 1.00% above the threshold [76.8] [01:10:39] RECOVERY - HHVM busy threads on mw1230 is OK: OK: Less than 1.00% above the threshold [76.8] [01:10:54] RECOVERY - HHVM busy threads on mw1228 is OK: OK: Less than 1.00% above the threshold [76.8] [01:10:54] RECOVERY - HHVM busy threads on mw1232 is OK: OK: Less than 1.00% above the threshold [76.8] [01:11:22] gj hhvm [02:10:58] !log l10nupdate Synchronized php-1.25wmf9/cache/l10n: (no message) (duration: 00m 01s) [02:11:02] !log LocalisationUpdate completed (1.25wmf9) at 2014-11-30 02:11:01+00:00 [02:11:08] Logged the message, Master [02:11:13] Logged the message, Master [02:18:19] !log l10nupdate Synchronized php-1.25wmf10/cache/l10n: (no message) (duration: 00m 02s) [02:18:22] !log LocalisationUpdate completed (1.25wmf10) at 2014-11-30 02:18:22+00:00 [02:18:23] Logged the message, Master [02:18:26] Logged the message, Master [02:46:03] PROBLEM - HHVM busy threads on mw1224 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [115.2] [02:46:44] PROBLEM - HHVM busy threads on mw1226 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [02:48:23] PROBLEM - HHVM busy threads on mw1233 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [02:48:38] PROBLEM - HHVM busy threads on mw1231 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [02:48:39] RECOVERY - HHVM busy threads on mw1224 is OK: OK: Less than 1.00% above the threshold [76.8] [02:49:24] PROBLEM - HHVM busy threads on mw1232 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [02:49:34] RECOVERY - HHVM busy threads on mw1226 is OK: OK: Less than 1.00% above the threshold [76.8] [02:49:53] PROBLEM - HHVM busy threads on mw1235 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [115.2] [02:51:16] RECOVERY - HHVM busy threads on mw1231 is OK: OK: Less than 1.00% above the threshold [76.8] [02:52:06] RECOVERY - HHVM busy threads on mw1232 is OK: OK: Less than 1.00% above the threshold [76.8] [02:52:29] RECOVERY - HHVM busy threads on mw1235 is OK: OK: Less than 1.00% above the threshold [76.8] [02:53:42] RECOVERY - HHVM busy threads on mw1233 is OK: OK: Less than 1.00% above the threshold [76.8] [03:36:45] PROBLEM - puppet last run on mw1198 is CRITICAL: CRITICAL: Puppet has 2 failures [03:37:06] PROBLEM - puppet last run on search1015 is CRITICAL: CRITICAL: Puppet has 1 failures [03:37:13] PROBLEM - puppet last run on mw1023 is CRITICAL: CRITICAL: Puppet has 1 failures [03:37:43] PROBLEM - puppet last run on mw1087 is CRITICAL: CRITICAL: Puppet has 1 failures [03:39:08] PROBLEM - puppet last run on mw1053 is CRITICAL: CRITICAL: Puppet has 1 failures [03:41:04] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Nov 30 03:41:03 UTC 2014 (duration 41m 2s) [03:41:08] Logged the message, Master [03:48:28] PROBLEM - MySQL Processlist on db1059 is CRITICAL: CRIT 101 unauthenticated, 0 locked, 0 copy to table, 0 statistics [03:51:09] RECOVERY - MySQL Processlist on db1059 is OK: OK 1 unauthenticated, 0 locked, 0 copy to table, 0 statistics [03:51:19] PROBLEM - check if salt-minion is running on stat1003 is CRITICAL: Timeout while attempting connection [03:51:58] PROBLEM - Host hafnium is DOWN: PING CRITICAL - Packet loss = 100% [03:51:58] PROBLEM - Host gadolinium is DOWN: PING CRITICAL - Packet loss = 100% [03:51:58] PROBLEM - Host iridium is DOWN: PING CRITICAL - Packet loss = 100% [03:51:58] PROBLEM - Host logstash1003 is DOWN: PING CRITICAL - Packet loss = 100% [03:51:58] PROBLEM - Host osmium is DOWN: PING CRITICAL - Packet loss = 100% [03:52:44] PROBLEM - Host erbium is DOWN: CRITICAL - Plugin timed out after 15 seconds [03:52:44] PROBLEM - Host lead is DOWN: CRITICAL - Plugin timed out after 15 seconds [03:52:44] PROBLEM - Host radon is DOWN: CRITICAL - Plugin timed out after 15 seconds [03:52:44] PROBLEM - Host ssl1005 is DOWN: CRITICAL - Plugin timed out after 15 seconds [03:52:44] PROBLEM - Host stat1003 is DOWN: CRITICAL - Plugin timed out after 15 seconds [03:52:45] PROBLEM - Host platinum is DOWN: CRITICAL - Plugin timed out after 15 seconds [03:52:45] PROBLEM - Host logstash1002 is DOWN: CRITICAL - Plugin timed out after 15 seconds [03:52:46] PROBLEM - Host labsdb1007 is DOWN: CRITICAL - Plugin timed out after 15 seconds [03:52:46] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:53:07] RECOVERY - puppet last run on mw1198 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:53:26] RECOVERY - puppet last run on search1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:53:26] RECOVERY - puppet last run on mw1023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:54:01] RECOVERY - puppet last run on mw1087 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [03:54:02] PROBLEM - Host logstash1001 is DOWN: PING CRITICAL - Packet loss = 100% [03:54:02] PROBLEM - Host gold is DOWN: PING CRITICAL - Packet loss = 100% [03:54:02] PROBLEM - Host rcs1001 is DOWN: PING CRITICAL - Packet loss = 100% [03:54:02] PROBLEM - Host caesium is DOWN: PING CRITICAL - Packet loss = 100% [03:54:02] PROBLEM - Host ssl1009 is DOWN: PING CRITICAL - Packet loss = 100% [03:54:02] PROBLEM - Host rdb1001 is DOWN: PING CRITICAL - Packet loss = 100% [03:54:09] PROBLEM - Host labsdb1006 is DOWN: PING CRITICAL - Packet loss = 100% [04:15:28] that does not look good… [04:17:16] so... any ops awake? [04:19:41] i can page. is there any user impact? [04:19:54] phab is down [04:20:10] ok, i'll page [04:20:11] i'm awake, but on a useless mobile conn [04:20:30] seems like network partition [04:20:31] springle: what do you think, is this worth paging? [04:20:55] worth paging someone for, I mean [04:21:02] ori: yes, some gdash graphs seem stopped [04:21:13] so i really have no idea what the non-DB layers are doing [04:21:24] * ori does [04:21:28] tnx [04:23:41] bblack's coming [04:23:46] jobrunner load down. redis connections failing [04:24:28] hey [04:25:00] so icinga reports a bunch of hosts as being down (gold, rcs1001, caseium, ssl1009, rdb1001, a few others). springle suspects network partition. [04:25:15] [04:25:21] ok [04:25:28] poking around some [04:27:15] at first glance, it looks like we lost cabinet C4 in eqiad. still matching things up [04:29:57] have we lost any public services indirectly? [04:30:03] phabricator [04:30:17] the job queue and recent changes stream are redundant and seem to have failed over gracefully [04:30:41] I haven't noticed anything else go. Not sure what those ssl hosts serve exactly [04:30:56] we just stopped using those ssl100x earlier this week [04:30:59] (1005, 1009) [04:31:00] ok [04:31:07] phabricator is iridium [04:38:12] iridium's last heartbeat was ~48 minutes ago [04:56:06] PROBLEM - udp2log log age for oxygen on oxygen is CRITICAL: CRITICAL: log files /var/log/udp2log/packet-loss.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [04:57:55] that's because gadolinium is down [05:03:17] phabricator is giving 503 [05:03:32] known; see /topic [05:03:36] thanks for reporting, though. [05:03:52] (it's being investigated) [05:04:05] oh [05:04:06] great [05:50:46] !log 3:50 UTC: switch asw-c-eqiad lost connectivity with cabinet C4. Impact: phabricator down; gap in web request logs and some perf monitoring. Job queue and Recent Changes stream OK b/c redundant servers are up. [05:50:51] Logged the message, Master [06:34:14] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:30] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:37] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:37] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:47] PROBLEM - puppet last run on dbproxy1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:51] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Puppet has 3 failures [06:36:08] PROBLEM - puppet last run on ms-fe2003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:42] PROBLEM - puppet last run on amssq60 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:24] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:46:04] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:47:35] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:47:37] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:48:59] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:49:07] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:49:07] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:49:31] RECOVERY - puppet last run on dbproxy1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:49:42] RECOVERY - puppet last run on ms-fe2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:50:09] RECOVERY - puppet last run on amssq60 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:09:31] PROBLEM - HHVM busy threads on mw1226 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [115.2] [07:11:59] PROBLEM - HHVM busy threads on mw1225 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [115.2] [07:12:52] (03PS1) 10Giuseppe Lavagetto: jobqueue: switch redis server due to outage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176492 [07:12:58] <_joe_> bblack: ^^ [07:13:13] <_joe_> bblack: "high availability, the wikimedia way" :/ [07:13:35] (03CR) 10BBlack: [C: 031] jobqueue: switch redis server due to outage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176492 (owner: 10Giuseppe Lavagetto) [07:13:40] :) [07:14:09] (03CR) 10Giuseppe Lavagetto: [C: 032] jobqueue: switch redis server due to outage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176492 (owner: 10Giuseppe Lavagetto) [07:14:48] RECOVERY - HHVM busy threads on mw1225 is OK: OK: Less than 1.00% above the threshold [76.8] [07:15:18] RECOVERY - HHVM busy threads on mw1226 is OK: OK: Less than 1.00% above the threshold [76.8] [07:15:47] !log oblivian Synchronized wmf-config/jobqueue-eqiad.php: (no message) (duration: 00m 05s) [07:15:56] Logged the message, Master [07:27:32] <_joe_> !log restarted the jobrunner service on all jobrunners [07:27:36] Logged the message, Master [08:13:02] (03PS1) 10Giuseppe Lavagetto: jobrunner: failover the redis server in use [puppet] - 10https://gerrit.wikimedia.org/r/176493 [08:14:55] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: failover the redis server in use [puppet] - 10https://gerrit.wikimedia.org/r/176493 (owner: 10Giuseppe Lavagetto) [08:21:13] (03PS1) 10Giuseppe Lavagetto: jobrunner: fix the jobrunners, not just the videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/176494 [08:21:35] <_joe_> this ^^ is something that wouldn't have happened if we used hiera [08:22:00] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: fix the jobrunners, not just the videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/176494 (owner: 10Giuseppe Lavagetto) [09:03:17] (03PS1) 10Giuseppe Lavagetto: jobrunner: change the aggregator IP as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176495 [09:05:40] (03PS2) 10Giuseppe Lavagetto: jobrunner: change the aggregator IP as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176495 [09:05:45] Is there other place where one can find info saying which things are down? http://status.wikimedia.org/ doesn't mention any issues even though Phabricator is not working [09:06:01] (03CR) 10Legoktm: [C: 031] jobrunner: change the aggregator IP as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176495 (owner: 10Giuseppe Lavagetto) [09:06:12] <_joe_> Helder: phabricator is down, apart from that nothing user-facing should be down [09:06:27] <_joe_> Helder: look at the topic here [09:06:42] yeah, I was just wondering why the status page doesn't reflect that [09:06:44] <_joe_> (I guess status has not been updated to monitor phabricator) [09:06:58] probably... it has something for Bugzilla [09:07:23] <_joe_> I'll tell chase to do that, thanks for telling us [09:07:28] :-) [09:07:33] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: change the aggregator IP as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176495 (owner: 10Giuseppe Lavagetto) [09:07:47] you're welcome. [09:07:49] BTW: good morning! [09:08:03] <_joe_> good morning to you too :) [09:08:50] !log oblivian Synchronized wmf-config/jobqueue-eqiad.php: changing the aggregator address as well (duration: 00m 05s) [09:08:56] Logged the message, Master [09:09:18] _joe_: do you've any idea when phab will be back up? [09:09:40] <_joe_> Glaisher: not at the moment, I am firefighting the job queues for the wikis [09:10:22] <_joe_> Glaisher: I don't think that would happen this morning (european morning) [09:10:57] <_joe_> it's an hardware issue, since nothing too crucial is down [09:12:53] <_joe_> I don't think we're going to wake up chris in the middle of the night [09:13:09] <_joe_> !log jobsqueues work again [09:13:12] Logged the message, Master [09:13:45] heh [09:20:53] (03PS1) 10Legoktm: Stop sending rcfeed to rcs1001 since it's down [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176496 [09:21:02] _joe_: ^ [09:21:09] <_joe_> on it [09:21:39] (03CR) 10Ori.livneh: [C: 04-1] "What's the point?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176496 (owner: 10Legoktm) [09:21:51] <_joe_> ori: to stop spam in the logs [09:22:05] (03CR) 10Legoktm: "It's spamming redis.log on fluorine with error messages." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176496 (owner: 10Legoktm) [09:22:50] (03CR) 10Ori.livneh: "Is fluorine running out of disk space?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176496 (owner: 10Legoktm) [09:24:28] (03Abandoned) 10Legoktm: Stop sending rcfeed to rcs1001 since it's down [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176496 (owner: 10Legoktm) [09:58:45] legoktm: hey, sorry if i was snarky there. less snarky reply: there is a standard unit of risk that goes with any configuration change, and this would require two of them (one to remove rcs1001 and one to add it back). and there is the additional risk of forgetting to revert the change, which is compounded by the fact that in such a case the logs would be silent. [09:58:58] lastly, it is desirable for failover to be automatic, so instead of performing manual work, if the log volume were seriously problematic, the solution should have been to fix the logging setup, which would obviate once and forever the need for manual intervention (for this particular reason at least). [10:01:38] yeah, that makes sense [10:05:30] plus in this case when the switch is rebooted or whatever the problem is fully resolved, versus requiring an additional follow-up step [10:13:25] !log Rebooted asw-c4-eqiad [10:13:27] Logged the message, Master [10:13:28] RECOVERY - Host radon is UP: PING OK - Packet loss = 0%, RTA = 1.14 ms [10:13:29] RECOVERY - Host labsdb1006 is UP: PING OK - Packet loss = 0%, RTA = 2.07 ms [10:13:29] RECOVERY - Host logstash1003 is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [10:13:29] RECOVERY - Host caesium is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [10:13:29] RECOVERY - Host iridium is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [10:13:29] RECOVERY - Host gadolinium is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [10:13:29] RECOVERY - Host platinum is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms [10:13:30] RECOVERY - Host hafnium is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [10:13:30] RECOVERY - Host rcs1001 is UP: PING OK - Packet loss = 0%, RTA = 3.80 ms [10:13:31] RECOVERY - Host gold is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [10:13:31] RECOVERY - Host erbium is UP: PING OK - Packet loss = 0%, RTA = 6.97 ms [10:13:32] RECOVERY - Host stat1003 is UP: PING OK - Packet loss = 0%, RTA = 6.61 ms [10:13:32] RECOVERY - Host logstash1002 is UP: PING OK - Packet loss = 0%, RTA = 8.14 ms [10:13:33] RECOVERY - Host ssl1009 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [10:13:33] RECOVERY - Host rdb1001 is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [10:13:34] RECOVERY - Host lead is UP: PING OK - Packet loss = 0%, RTA = 4.03 ms [10:13:37] RECOVERY - Host ssl1005 is UP: PING OK - Packet loss = 0%, RTA = 2.00 ms [10:13:39] RECOVERY - Host osmium is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [10:13:39] RECOVERY - Host logstash1001 is UP: PING OK - Packet loss = 0%, RTA = 3.28 ms [10:14:10] oh nice [10:14:18] RECOVERY - Host labsdb1007 is UP: PING OK - Packet loss = 0%, RTA = 4.29 ms [10:14:34] there's always a straggler [10:15:17] RECOVERY - udp2log log age for oxygen on oxygen is OK: OK: all log files active [10:19:50] (03PS1) 10Giuseppe Lavagetto: job-queue: revert to rdb1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176497 [10:19:54] PROBLEM - puppet last run on rdb1001 is CRITICAL: CRITICAL: puppet fail [10:20:34] PROBLEM - puppet last run on lead is CRITICAL: CRITICAL: puppet fail [10:21:05] PROBLEM - puppet last run on ssl1009 is CRITICAL: CRITICAL: puppet fail [10:21:07] PROBLEM - puppet last run on osmium is CRITICAL: CRITICAL: puppet fail [10:21:08] PROBLEM - puppet last run on caesium is CRITICAL: CRITICAL: puppet fail [10:21:08] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: puppet fail [10:21:08] PROBLEM - puppet last run on logstash1001 is CRITICAL: CRITICAL: puppet fail [10:21:08] PROBLEM - puppet last run on gold is CRITICAL: CRITICAL: puppet fail [10:21:08] PROBLEM - puppet last run on rcs1001 is CRITICAL: CRITICAL: puppet fail [10:21:46] PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: puppet fail [10:22:16] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: puppet fail [10:22:16] PROBLEM - puppet last run on platinum is CRITICAL: CRITICAL: puppet fail [10:22:17] PROBLEM - puppet last run on ssl1005 is CRITICAL: CRITICAL: puppet fail [10:22:27] (03PS1) 10Giuseppe Lavagetto: jobrunners: revert to rdb1001 [puppet] - 10https://gerrit.wikimedia.org/r/176498 [10:22:37] PROBLEM - puppet last run on gadolinium is CRITICAL: CRITICAL: puppet fail [10:22:50] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] jobrunners: revert to rdb1001 [puppet] - 10https://gerrit.wikimedia.org/r/176498 (owner: 10Giuseppe Lavagetto) [10:24:08] RECOVERY - puppet last run on caesium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:24:12] RECOVERY - puppet last run on logstash1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:24:12] RECOVERY - puppet last run on gold is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [10:25:17] RECOVERY - puppet last run on platinum is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [10:26:07] (03CR) 10Giuseppe Lavagetto: [C: 032] job-queue: revert to rdb1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176497 (owner: 10Giuseppe Lavagetto) [10:26:15] (03Merged) 10jenkins-bot: job-queue: revert to rdb1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176497 (owner: 10Giuseppe Lavagetto) [10:26:27] RECOVERY - puppet last run on lead is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:27:42] RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:27:49] PROBLEM - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 {channel:frontend.error,request:{id:1417343263032-52236},error:{message:Status check failed (redis failure?)}} - 232 bytes in 1.533 second response time [10:28:01] !log oblivian Synchronized wmf-config/jobqueue-eqiad.php: reverting to rdb1001 (duration: 00m 05s) [10:28:03] Logged the message, Master [10:28:16] RECOVERY - puppet last run on ssl1005 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [10:28:39] <_joe_> mmmh ocg critical [10:30:02] RECOVERY - puppet last run on ssl1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:30:02] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [10:30:54] RECOVERY - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 0.004 second response time [10:31:13] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:31:37] RECOVERY - puppet last run on gadolinium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:31:44] PROBLEM - puppet last run on mw1250 is CRITICAL: CRITICAL: Puppet has 1 failures [10:32:05] RECOVERY - puppet last run on rdb1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:32:26] PROBLEM - check if salt-minion is running on cp3017 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:32:56] RECOVERY - puppet last run on osmium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:33:07] RECOVERY - puppet last run on rcs1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:38:02] RECOVERY - check if salt-minion is running on cp3017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:46:11] RECOVERY - puppet last run on mw1250 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:58:37] _joe_: just saw it too [10:58:55] _joe_: any idea what is going on ? [11:01:23] <_joe_> akosiaris: yes, everything is ok [11:01:26] <_joe_> :) [11:01:37] oh, I just got the OK page [11:02:01] <_joe_> akosiaris: rdb1001 was down due to a network failure [11:02:17] failure ? [11:02:31] <_joe_> yes asw-c4-eqiad went down [11:02:41] ouch [12:51:15] PROBLEM - HHVM queue size on mw1230 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [80.0] [12:52:12] PROBLEM - HHVM busy threads on mw1221 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [115.2] [12:52:38] PROBLEM - HHVM busy threads on mw1224 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [12:53:16] PROBLEM - HHVM busy threads on mw1227 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [115.2] [12:53:22] PROBLEM - HHVM busy threads on mw1230 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [115.2] [12:53:41] PROBLEM - HHVM busy threads on mw1225 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [12:54:13] PROBLEM - HHVM queue size on mw1223 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [80.0] [12:54:33] _joe_: are these alerts over-sensitive or is there an actual cause for alarm? [12:54:52] PROBLEM - HHVM busy threads on mw1223 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [115.2] [12:54:52] PROBLEM - HHVM queue size on mw1227 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [80.0] [12:55:01] does it correlate with actual or impending failure? [12:55:32] PROBLEM - HHVM busy threads on mw1222 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [12:56:35] RECOVERY - HHVM busy threads on mw1225 is OK: OK: Less than 1.00% above the threshold [76.8] [12:57:13] RECOVERY - HHVM queue size on mw1230 is OK: OK: Less than 1.00% above the threshold [10.0] [12:57:14] RECOVERY - HHVM queue size on mw1223 is OK: OK: Less than 1.00% above the threshold [10.0] [12:57:32] RECOVERY - HHVM queue size on mw1227 is OK: OK: Less than 1.00% above the threshold [10.0] [12:57:33] RECOVERY - HHVM busy threads on mw1223 is OK: OK: Less than 1.00% above the threshold [76.8] [12:57:51] RECOVERY - HHVM busy threads on mw1221 is OK: OK: Less than 1.00% above the threshold [76.8] [12:58:26] RECOVERY - HHVM busy threads on mw1222 is OK: OK: Less than 1.00% above the threshold [76.8] [12:58:32] RECOVERY - HHVM busy threads on mw1224 is OK: OK: Less than 1.00% above the threshold [76.8] [12:59:05] RECOVERY - HHVM busy threads on mw1227 is OK: OK: Less than 1.00% above the threshold [76.8] [12:59:07] RECOVERY - HHVM busy threads on mw1230 is OK: OK: Less than 1.00% above the threshold [76.8] [14:34:33] <_joe_> ori: they need tuning, as in we should monitor a longer time span or (probably) an higher percentage of data [14:34:56] <_joe_> say if 50% of data over 15 minutes is over the threshold, we need to restart hhvm probably [14:36:38] _joe__: can you fix this in the db https://phabricator.wikimedia.org/T76289? hi;) [14:37:20] <_joe_> Steinsplitter: uh I wouldn't know how to do that out of the box, sorry [14:37:36] ok, thanks aniway [14:37:52] <_joe_> sorry :) [14:38:28] <_joe_> I've already had to deal with an outage this morning, I was just passing by, tbh [14:39:26] oh [14:48:20] _joe_: good work with the outage this morning btw :) [15:31:13] (03PS1) 10Glaisher: Modify abusefilter configuration for metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176504 [15:35:07] (03PS5) 10KartikMistry: Add ContentTranslation in wikishared DB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175979 [16:36:31] (03PS1) 10Glaisher: Redirect wikimedia.community to www.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/176508 [16:58:29] (03PS2) 10Glaisher: Redirect wikimedia.community to www.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/176508 [17:04:38] PROBLEM - Disk space on ocg1001 is CRITICAL: DISK CRITICAL - free space: /srv 18536 MB (3% inode=97%): [17:33:56] (03PS2) 10Glaisher: Modify abusefilter configuration for metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176504 [18:33:25] PROBLEM - puppet last run on rdb1002 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago [18:34:46] PROBLEM - puppet last run on mw1018 is CRITICAL: CRITICAL: Puppet last ran 4 days ago [18:37:18] RECOVERY - puppet last run on mw1018 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:53:23] RECOVERY - puppet last run on rdb1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:08:05] jenkins is stalled [19:12:29] What else is new [19:16:18] Krinkle: well, that got rid of the stuck job, but it still refuses to run any new ones [19:16:33] I'm aware [19:17:03] !log Disabling and relauching Gearman connection from Jenkins. [19:17:11] Logged the message, Master [20:51:37] !log restarted eventlogging mysql-m2-master consumer. It seems it could no longer write to the database. [20:51:40] Logged the message, Master [21:18:45] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: puppet fail [21:33:09] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [22:21:25] (03Draft2) 10Dereckson: Extra language names configuration for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176610 (https://bugzilla.wikimedia.org/53472) [22:23:33] (03PS3) 10Dereckson: Extra language names configuration for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176610 [22:29:56] (03PS4) 10Legoktm: Extra language names configuration for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176610 (owner: 10Dereckson) [22:56:48] (03CR) 10Aude: Extra language names configuration for Wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176610 (owner: 10Dereckson) [23:17:17] !log Updated EventLogging to 19c23698bc03694017d764af33307d6f035fc224 (on 22:51) and restarted it [23:17:23] Logged the message, Master [23:23:36] (03PS5) 10Dereckson: Extra language names configuration for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176610 (https://bugzilla.wikimedia.org/53472) [23:24:56] (03CR) 10Dereckson: "PS5: merged arrays, per Aude comment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176610 (https://bugzilla.wikimedia.org/53472) (owner: 10Dereckson) [23:26:15] (03PS6) 10Dereckson: Extra language names configuration for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176610