[00:00:53] PROBLEM - Kafka Cluster analytics-eqiad Broker Messages In Per Second on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 0 data above and 46 below the confidence bounds [00:05:34] PROBLEM - puppet last run on mw1133 is CRITICAL: CRITICAL: Puppet has 1 failures [00:16:34] RECOVERY - puppet last run on mw1204 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:26:43] RECOVERY - Kafka Cluster analytics-eqiad Broker Messages In Per Second on graphite1001 is OK: OK: No anomaly detected [00:31:34] RECOVERY - puppet last run on mw1133 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:33:33] PROBLEM - puppet last run on mw2166 is CRITICAL: CRITICAL: Puppet has 1 failures [01:43:03] PROBLEM - tcpircbot_service_running on neon is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args tcpircbot.py [01:59:15] RECOVERY - puppet last run on mw2166 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [02:00:54] RECOVERY - tcpircbot_service_running on neon is OK: PROCS OK: 1 process with command name python, args tcpircbot.py [02:29:07] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun May 15 02:29:07 UTC 2016 (duration 8m 44s) [03:06:17] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:08:17] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [03:36:39] PROBLEM - puppet last run on mw2051 is CRITICAL: CRITICAL: Puppet has 2 failures [03:39:17] PROBLEM - configured eth on rutherfordium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:39:27] PROBLEM - Check size of conntrack table on rutherfordium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:39:28] PROBLEM - salt-minion processes on rutherfordium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:39:29] PROBLEM - HTTP-peopleweb on rutherfordium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:39:37] PROBLEM - dhclient process on rutherfordium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:39:39] PROBLEM - Disk space on rutherfordium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:39:59] PROBLEM - puppet last run on rutherfordium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:40:37] PROBLEM - DPKG on rutherfordium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:40:48] PROBLEM - RAID on rutherfordium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:02:28] RECOVERY - puppet last run on mw2051 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [04:23:28] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [04:27:28] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [04:33:08] PROBLEM - NTP on rutherfordium is CRITICAL: NTP CRITICAL: No response from NTP server [04:57:18] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [05:01:19] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [05:17:38] PROBLEM - Disk space on ms-be2012 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=83%) [05:43:07] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [05:44:58] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [05:50:57] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [05:52:57] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [05:58:49] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [06:00:49] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [06:22:38] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [06:27:18] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [24.0] [06:30:37] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [06:30:48] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:49] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:57] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:57] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:58] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:07] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:38] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:58] PROBLEM - puppet last run on mw2095 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:39] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [06:39:27] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [06:40:38] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [06:44:44] 06Operations, 10media-storage, 07Documentation: Document how to handle 'inconsistent state within the internal storage backends' issues - https://phabricator.wikimedia.org/T135318#2295341 (10Peachey88) [06:52:38] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [06:56:49] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:56:57] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:58] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:56:59] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:57:47] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:58] RECOVERY - puppet last run on mw2095 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:58:08] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:38] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [06:58:58] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:04:48] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [07:07:38] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [24.0] [07:26:37] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [07:27:57] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [07:31:28] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [07:32:48] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [07:34:48] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [07:36:08] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:37:28] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:40:39] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [07:53:28] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [08:11:29] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [08:33:47] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [08:53:19] RECOVERY - Disk space on ms-be2012 is OK: DISK OK [09:03:08] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [09:15:08] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [09:29:17] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [09:35:37] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [09:49:39] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [09:55:47] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [10:02:07] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [10:15:50] 06Operations: Special pages on cswiki have not received updates for 3 days - https://phabricator.wikimedia.org/T135326#2295515 (10Peachey88) These are controlled via a Cron job iirc, That might need to be checked on [10:16:09] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [10:20:17] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [10:26:18] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [10:28:19] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [10:31:52] 06Operations: rutherfordium (powers people.wikimedia.org) flapping in channel - https://phabricator.wikimedia.org/T135330#2295529 (10Peachey88) [10:34:38] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [10:48:48] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [10:54:48] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [11:01:08] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [11:07:08] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [11:09:17] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [11:17:15] 06Operations, 06Project-Admins: Create #IRCecho project - https://phabricator.wikimedia.org/T134961#2295574 (10Aklapper) Agreeing but should discuss name and description with its maintainers, e.g. one idea is to merge ircecho with tcpircbot... [11:27:27] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [12:20:44] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 5 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2295616 (10Aklapper) [12:45:08] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [12:51:17] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [12:51:29] 06Operations, 06Project-Admins, 10Traffic: Create #HTTP2 tag - https://phabricator.wikimedia.org/T134960#2295661 (10BBlack) I think maybe some of the confusion here is about the various roles of tags here. Most of our tags are more about management of teams and projects. This is very different from the tra... [13:13:48] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [13:19:48] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [13:34:08] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [13:40:09] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [13:42:17] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [13:54:28] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [14:02:48] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [14:11:23] 06Operations, 06Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2295722 (10ema) Update: setting default_grace to 0 makes the issue unreproducible. 4.1.0 is not affected, while... [14:12:48] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [15:22:59] 06Operations, 06Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2295734 (10ema) I haven't been able to reproduce the bug so far after reverting the commit mentioned above. cp10... [15:23:25] 06Operations, 10MediaWiki-extensions-BounceHandler, 13Patch-For-Review: Need an administrative front end for BounceHandler - https://phabricator.wikimedia.org/T114020#2295735 (1001tonythomas) This seems done ? [17:14:27] PROBLEM - puppet last run on mw2102 is CRITICAL: CRITICAL: puppet fail [17:42:37] RECOVERY - puppet last run on mw2102 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [17:52:28] RECOVERY - SSH on rutherfordium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [17:58:37] PROBLEM - SSH on rutherfordium is CRITICAL: Server answer [18:18:38] PROBLEM - puppet last run on db2009 is CRITICAL: CRITICAL: puppet fail [18:44:47] RECOVERY - puppet last run on db2009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:23:54] (03PS3) 10Rush: icinga tools.checker Tools paging update [puppet] - 10https://gerrit.wikimedia.org/r/288603 [19:25:56] (03CR) 10Rush: [C: 032 V: 032] icinga tools.checker Tools paging update [puppet] - 10https://gerrit.wikimedia.org/r/288603 (owner: 10Rush) [20:57:14] host db1026 lags since 17:50 UTC, what's going on? [20:58:23] PROBLEM - puppet last run on mw2076 is CRITICAL: CRITICAL: puppet fail [21:26:13] RECOVERY - puppet last run on mw2076 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:44:13] PROBLEM - puppet last run on mw2135 is CRITICAL: CRITICAL: Puppet has 1 failures [22:10:13] RECOVERY - puppet last run on mw2135 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [22:39:10] (03CR) 10Rush: "I think it would be better to do in hiera, we try to keep if $realm checks to a minimum in general and I'm not a big fan of them in templa" [labs/private] - 10https://gerrit.wikimedia.org/r/288736 (owner: 10Alex Monk) [22:56:52] RECOVERY - MariaDB Slave Lag: s3 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 272.01 seconds [23:05:04] (03PS2) 10Alex Monk: Add my (Alex M's) key to labtest-instances root [labs/private] - 10https://gerrit.wikimedia.org/r/288736 [23:09:58] is Wiki*eda down for anyine else? [23:10:14] Getting ERR_CONNECTION_TIMED_OUT on commons and WIkipedia... [23:12:17] So...I can't access any page atm [23:12:30] what's wiki*eda? [23:12:52] wikimedia and wikipedia, with * as a wildcard, to indivcate all wikis [23:13:00] indicate* [23:13:18] Wiki[mp]edia [23:14:06] Trying https://commons.wikimedia.org/wiki/Image:Eurovision_Song_Contest_2017_Ukraine_Logo.png and https://en.wikipedia.org/wiki/Special:Random just loads and loads and loads and stopped by the browser after X number of minutes [23:15:35] Was anything recently deployed, or swithced servers or...is it just me...? I can access all other websites online just fine [23:17:40] WFM [23:18:06] can't access any pages on wikimedia sites? [23:18:17] exactly [23:18:30] or wait... [23:18:35] I can access phabricator [23:19:07] but not wikipedia, or Commons, or meta, or otrs-wiki [23:19:20] okay [23:19:43] can you ping phabricator and then wikipedia/commons/meta/otrs-wiki? [23:19:56] umm...how do I go about and do that.... [23:20:12] depends on your OS. command line/terminal? [23:21:04] I am on a Chromebook, and only have access to whatever is inside Chrome [23:21:11] (unless I boot my W10) [23:21:50] in that case I suggest you google "chromebook ping" :) [23:22:06] :p [23:22:34] PROBLEM - RAID on db1034 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [23:22:51] wait...now phab is down too... [23:23:21] https://www.irccloud.com/pastebin/OeFU4yNh/ [23:23:35] https://www.irccloud.com/pastebin/fSAtR6zR/ [23:24:05] hmmm wait...that's not right... [23:24:15] works fine for me [23:26:15] pastebins deleted? [23:26:18] those links above were via an add-on and not from my IP [23:26:33] so, I deleted them, since they did it from off-computer servers... [23:26:42] doing it for real via crosh now [23:26:46] ok [23:27:02] been ~60secs now...still no pingback [23:27:27] I got a pingback for twitter.com within milliseconds [23:28:21] https://www.irccloud.com/pastebin/a3tLwwg1/ [23:28:56] still no answr to my ping though [23:29:26] oh wait...now I'm getting stuff again! :D [23:29:55] now it wfm again [23:30:10] weird [23:35:53] 06Operations, 10ops-eqiad: db1034 degraded RAID - https://phabricator.wikimedia.org/T135353#2296141 (10Volans) [23:36:34] ACKNOWLEDGEMENT - RAID on db1034 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Volans T135353 [23:41:57] (03CR) 10Andrew Bogott: [C: 032] Add my (Alex M's) key to labtest-instances root [labs/private] - 10https://gerrit.wikimedia.org/r/288736 (owner: 10Alex Monk) [23:42:55] (03CR) 10Andrew Bogott: [V: 032] Add my (Alex M's) key to labtest-instances root [labs/private] - 10https://gerrit.wikimedia.org/r/288736 (owner: 10Alex Monk)