[00:24:57] PROBLEM - SSH on kraz is CRITICAL: Server answer [00:26:57] RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [00:40:56] PROBLEM - NTP on kraz is CRITICAL: NTP CRITICAL: No response from NTP server [00:49:16] PROBLEM - SSH on kraz is CRITICAL: Server answer [00:53:16] RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [01:15:36] PROBLEM - SSH on kraz is CRITICAL: Server answer [01:19:28] RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [01:35:56] PROBLEM - SSH on kraz is CRITICAL: Server answer [01:43:48] RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [01:49:47] PROBLEM - SSH on kraz is CRITICAL: Server answer [01:53:47] RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [02:06:07] PROBLEM - SSH on kraz is CRITICAL: Server answer [02:08:07] RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [02:22:29] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.2) (duration: 09m 57s) [02:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:13] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun May 22 02:31:13 UTC 2016 (duration 8m 45s) [02:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:36:16] PROBLEM - SSH on kraz is CRITICAL: Server answer [02:48:07] RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [03:08:17] PROBLEM - SSH on kraz is CRITICAL: Server answer [03:12:17] RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [03:24:07] PROBLEM - SSH on kraz is CRITICAL: Server answer [03:26:07] RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [03:36:17] PROBLEM - SSH on kraz is CRITICAL: Server answer [03:40:16] RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [04:00:47] PROBLEM - puppet last run on mw1222 is CRITICAL: CRITICAL: Puppet has 1 failures [04:02:17] PROBLEM - SSH on kraz is CRITICAL: Server answer [04:08:16] RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [04:14:08] PROBLEM - SSH on kraz is CRITICAL: Server answer [04:14:34] growing replication lag: [04:14:36] "host": "db1056", [04:14:36] "lag": 408 [04:14:42] commons.wikimedia.org [04:16:16] RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [04:22:08] PROBLEM - SSH on kraz is CRITICAL: Server answer [04:24:47] RECOVERY - puppet last run on mw1222 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [04:36:18] RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [04:42:17] PROBLEM - SSH on kraz is CRITICAL: Server answer [04:46:18] RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [04:49:57] PROBLEM - puppet last run on mw2009 is CRITICAL: CRITICAL: Puppet has 1 failures [04:52:17] PROBLEM - SSH on kraz is CRITICAL: Server answer [05:01:12] 06Operations, 10Wikimedia-IRC-RC-Server: Kraz (irc.wikimedia.org) has been flapping on IRC most of day - https://phabricator.wikimedia.org/T135930#2315645 (10Peachey88) [05:10:37] RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [05:14:16] RECOVERY - puppet last run on mw2009 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [05:16:37] PROBLEM - SSH on kraz is CRITICAL: Server answer [05:25:05] 06Operations: kvm on ganeti instances getting stuck - https://phabricator.wikimedia.org/T134242#2259160 (10Peachey88) looks like {T135930} might be another possible case [05:46:36] RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [05:52:27] PROBLEM - SSH on kraz is CRITICAL: Server answer [05:56:27] RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [06:02:37] PROBLEM - SSH on kraz is CRITICAL: Server answer [06:14:36] RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [06:24:27] PROBLEM - SSH on kraz is CRITICAL: Server answer [06:28:36] RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [06:30:07] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:17] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:57] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:27] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:46] PROBLEM - SSH on kraz is CRITICAL: Server answer [06:35:37] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 2 failures [06:44:37] RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [06:50:48] PROBLEM - SSH on kraz is CRITICAL: Server answer [06:54:50] 06Operations, 10Traffic, 07HTTPS: Secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#2315697 (10Nemo_bis) "Secure redirect service" is grammatically unclear to me, I don't understand what is verb/noun/adjective. Does the summary just mean "Swit... [06:56:18] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:56:36] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:56:37] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:57:37] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:17] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:04:57] RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [07:10:48] PROBLEM - SSH on kraz is CRITICAL: Server answer [07:18:47] RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [07:24:46] PROBLEM - SSH on kraz is CRITICAL: Server answer [07:26:46] RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [07:32:57] PROBLEM - SSH on kraz is CRITICAL: Server answer [07:36:56] RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [07:42:47] PROBLEM - SSH on kraz is CRITICAL: Server answer [07:46:48] RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [08:00:56] PROBLEM - SSH on kraz is CRITICAL: Server answer [08:33:26] RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [08:39:26] PROBLEM - SSH on kraz is CRITICAL: Server answer [08:52:27] PROBLEM - Disk space on planet2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:52:28] PROBLEM - Check size of conntrack table on planet2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:52:36] PROBLEM - RAID on planet2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:52:56] PROBLEM - configured eth on planet2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:16] PROBLEM - puppet last run on planet2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:53:17] PROBLEM - DPKG on planet2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:18] PROBLEM - dhclient process on planet2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:37] PROBLEM - salt-minion processes on planet2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:10:16] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 611 [09:20:16] RECOVERY - check_mysql on lutetium is OK: Uptime: 994398 Threads: 1 Questions: 17096431 Slow queries: 14667 Opens: 99511 Flush tables: 2 Open tables: 64 Queries per second avg: 17.192 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:25:48] RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [09:32:07] PROBLEM - SSH on kraz is CRITICAL: Server answer [09:42:36] PROBLEM - NTP on planet2001 is CRITICAL: NTP CRITICAL: No response from NTP server [09:46:47] PROBLEM - SSH on planet2001 is CRITICAL: Server answer [09:47:47] PROBLEM - puppet last run on elastic1004 is CRITICAL: CRITICAL: Puppet has 1 failures [09:48:47] RECOVERY - SSH on planet2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [09:58:07] RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [10:08:18] PROBLEM - SSH on kraz is CRITICAL: Server answer [10:14:06] RECOVERY - puppet last run on elastic1004 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [10:29:17] PROBLEM - SSH on planet2001 is CRITICAL: Server answer [10:31:36] RECOVERY - SSH on planet2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [10:43:47] PROBLEM - SSH on planet2001 is CRITICAL: Server answer [10:56:07] RECOVERY - SSH on planet2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [11:02:36] PROBLEM - SSH on planet2001 is CRITICAL: Server answer [11:04:37] RECOVERY - SSH on planet2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [11:10:47] PROBLEM - SSH on planet2001 is CRITICAL: Server answer [11:18:56] RECOVERY - SSH on planet2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [11:20:33] 06Operations, 10ops-eqiad: Rack and Set up new application servers mw1284-1306 - https://phabricator.wikimedia.org/T134309#2315876 (10Southparkfan) @Joe did you already install one of those servers? noticed https://ganglia.wikimedia.org/latest/?c=Application%20servers%20eqiad&h=mw1305.eqiad.wmnet&m=cpu_report&... [11:25:38] PROBLEM - RAID on ms-be2012 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [11:27:17] PROBLEM - Disk space on ms-be2012 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdk1 is not accessible: Input/output error [11:35:36] PROBLEM - SSH on planet2001 is CRITICAL: Server answer [11:41:46] RECOVERY - SSH on planet2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [11:45:16] PROBLEM - puppet last run on ms-be2012 is CRITICAL: CRITICAL: Puppet has 1 failures [12:02:37] RECOVERY - Disk space on ms-be2012 is OK: DISK OK [12:06:26] PROBLEM - SSH on planet2001 is CRITICAL: Server answer [12:08:26] RECOVERY - SSH on planet2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [12:09:02] 06Operations, 10Traffic, 07HTTPS: Secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#2315930 (10BBlack) It means a distinct service, running separately from our normal infrastructure for the canonical domains, which does nothing but handle the... [12:30:48] PROBLEM - SSH on planet2001 is CRITICAL: Server answer [12:34:48] RECOVERY - SSH on planet2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [13:18:04] 06Operations, 10DBA, 10MediaWiki-Database: Preserve InnoDB table auto_increment on restart - https://phabricator.wikimedia.org/T135851#2315983 (10jcrespo) This would be a major change though I know, but I would still push for it. For the time being this is a non-issue (aka low) because dbs are never restart... [13:28:07] PROBLEM - SSH on planet2001 is CRITICAL: Server answer [13:30:26] RECOVERY - SSH on planet2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [13:36:37] PROBLEM - SSH on planet2001 is CRITICAL: Server answer [14:02:40] !log performing schema change on s6 T130692 [14:02:41] T130692: Add new indexes from eec016ece6d2b30addcdf3d3efcc2ba59b10e858 to production databases - https://phabricator.wikimedia.org/T130692 [14:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:05:17] RECOVERY - SSH on planet2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [14:11:18] PROBLEM - SSH on planet2001 is CRITICAL: Server answer [14:33:46] RECOVERY - SSH on planet2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [14:38:17] !log trying to restart kraz and planet2001 (both service and console unresponsive) [14:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:42:06] RECOVERY - configured eth on planet2001 is OK: OK - interfaces up [14:42:17] RECOVERY - dhclient process on planet2001 is OK: PROCS OK: 0 processes with command name dhclient [14:42:17] RECOVERY - DPKG on planet2001 is OK: All packages OK [14:42:27] RECOVERY - salt-minion processes on planet2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:42:47] RECOVERY - NTP on planet2001 is OK: NTP OK: Offset -0.002482891083 secs [14:43:37] RECOVERY - Disk space on planet2001 is OK: DISK OK [14:43:37] RECOVERY - RAID on planet2001 is OK: OK: no RAID installed [14:43:37] RECOVERY - Check size of conntrack table on planet2001 is OK: OK: nf_conntrack is 0 % full [14:44:26] PROBLEM - Host kraz is DOWN: PING CRITICAL - Packet loss = 100% [14:45:57] RECOVERY - RAID on kraz is OK: OK: no RAID installed [14:45:57] RECOVERY - salt-minion processes on kraz is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:46:07] RECOVERY - Host kraz is UP: PING OK - Packet loss = 0%, RTA = 37.02 ms [14:46:26] RECOVERY - puppet last run on planet2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:46:36] RECOVERY - Check size of conntrack table on kraz is OK: OK: nf_conntrack is 0 % full [14:46:36] RECOVERY - Disk space on kraz is OK: DISK OK [14:46:56] RECOVERY - DPKG on kraz is OK: All packages OK [14:46:56] RECOVERY - configured eth on kraz is OK: OK - interfaces up [14:46:57] RECOVERY - dhclient process on kraz is OK: PROCS OK: 0 processes with command name dhclient [14:47:18] RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [14:48:06] RECOVERY - puppet last run on kraz is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [15:23:21] irc.wikimedia.org is broken or something [15:24:18] 06Operations, 10Phabricator, 10Phabricator-Upstream: PHD ensuring umask goodness - https://phabricator.wikimedia.org/T91648#2316165 (10Aklapper) Anyone knows if this is actually still an issue or if T128009 fixed this? @akosiaris, @chasemp or anyone else? Asking as this task hasn't seen an update since Dec... [15:30:34] 06Operations, 10Wikimedia-IRC-RC-Server: irc.wikimedia.org is not sending out changes - https://phabricator.wikimedia.org/T135948#2316189 (10Glaisher) Adding #operations as I'm not sure who else has access to this. [15:46:57] 06Operations: kvm on ganeti instances getting stuck - https://phabricator.wikimedia.org/T134242#2316205 (10jcrespo) I've restarted today planet2001 and kraz. The synthoms were different this time- they did not woke up on connection. SSH and console were down/overloaded. I restarted them and the services came to... [15:54:07] Is anyone looking into https://phabricator.wikimedia.org/T135948 this UBN? It has broken almost all the anti-vandalism bots we have. [15:54:53] _joe_: akosiaris paravoid ^ [16:00:18] Glaisher, check reconecting now [16:02:13] jynus: working now, thanks [16:02:28] can I resolve T135948? [16:02:28] T135948: irc.wikimedia.org is not sending out changes - https://phabricator.wikimedia.org/T135948 [16:02:44] I think so [16:03:49] 06Operations, 10Wikimedia-IRC-RC-Server: irc.wikimedia.org is not sending out changes - https://phabricator.wikimedia.org/T135948#2316214 (10jcrespo) 05Open>03Resolved a:03jcrespo It seems the server overloaded/bugged out, then it was hit by T134875. [16:03:51] 06Operations, 10Wikimedia-IRC-RC-Server: irc.wikimedia.org is not sending out changes - https://phabricator.wikimedia.org/T135948#2316171 (10Dzahn) I checked and the IRC bot is running as of now. Either somebody restarted the service (it's T134875 that it needed that) or it came back a little delayed. It's nor... [16:04:13] heh ;) [16:04:51] 06Operations, 10Wikimedia-IRC-RC-Server: irc.wikimedia.org is not sending out changes - https://phabricator.wikimedia.org/T135948#2316221 (10jcrespo) No, I have to do it manually, we need to resolve that. [16:06:46] 06Operations: kvm on ganeti instances getting stuck - https://phabricator.wikimedia.org/T134242#2316223 (10jcrespo) Now it does, it hit T134875. [16:15:19] 06Operations: kvm on ganeti instances getting stuck - https://phabricator.wikimedia.org/T134242#2316227 (10Dzahn) >>! In T134242#2316205, @jcrespo wrote: > the instructions are not up to date: https://wikitech.wikimedia.org/wiki/IRCD#Starting_the_bot updated. [16:16:22] 06Operations: kvm on ganeti instances getting stuck - https://phabricator.wikimedia.org/T134242#2316228 (10jcrespo) Actually, I added them too, more complete: IRCD#Services [16:18:36] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: puppet fail [16:31:49] 06Operations, 06Commons, 10media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#2316232 (10Aklapper) [16:33:16] 06Operations, 06Commons, 10media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1634588 (10Aklapper) [16:33:37] 06Operations, 06Commons, 10media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1634588 (10Aklapper) [16:34:30] 06Operations, 06Commons, 10media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#2316238 (10Aklapper) @Dvorapa: I do not know more than what's written in T84950, sorry. :( [16:45:16] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:16:14] !log defragmenting db1028 [17:16:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:36:13] <_joe_> Glaisher: sorry, I wasn't around [17:36:17] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 676 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6488837 keys - replication_delay is 676 [17:36:33] _joe_: np, it has been fixed now [17:37:11] <_joe_> Glaisher: I know, I am just sorry I didn't get the ping [17:38:16] people can't be always be around on IRC ;-) [18:19:17] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6480537 keys - replication_delay is 0 [18:35:57] (03PS1) 10Dzahn: ircd/ircecho: add icinga process monitoring [puppet] - 10https://gerrit.wikimedia.org/r/290077 (https://phabricator.wikimedia.org/T135948) [18:36:08] you are right, i'm adding monitoring ^ [18:37:40] (03CR) 10jenkins-bot: [V: 04-1] ircd/ircecho: add icinga process monitoring [puppet] - 10https://gerrit.wikimedia.org/r/290077 (https://phabricator.wikimedia.org/T135948) (owner: 10Dzahn) [18:40:18] (03PS2) 10Dzahn: ircd/ircecho: add icinga process monitoring [puppet] - 10https://gerrit.wikimedia.org/r/290077 (https://phabricator.wikimedia.org/T135948) [18:41:33] (03CR) 10jenkins-bot: [V: 04-1] ircd/ircecho: add icinga process monitoring [puppet] - 10https://gerrit.wikimedia.org/r/290077 (https://phabricator.wikimedia.org/T135948) (owner: 10Dzahn) [18:42:13] (03CR) 10Jcrespo: "Actually, both processes were running when down, I would do a more functional approach by trying to join #enwiki, etc. (or easier, checkin" [puppet] - 10https://gerrit.wikimedia.org/r/290077 (https://phabricator.wikimedia.org/T135948) (owner: 10Dzahn) [18:42:15] (03PS3) 10Dzahn: ircd/ircecho: add icinga process monitoring [puppet] - 10https://gerrit.wikimedia.org/r/290077 (https://phabricator.wikimedia.org/T135948) [18:45:33] (03PS4) 10Dzahn: ircd/ircecho: add icinga process monitoring [puppet] - 10https://gerrit.wikimedia.org/r/290077 (https://phabricator.wikimedia.org/T135948) [18:48:43] (03CR) 10Jcrespo: "See comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/290077 (https://phabricator.wikimedia.org/T135948) (owner: 10Dzahn) [18:56:14] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 3 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2316347 (10Steinsplitter) See also https://commons.wikimedia.org/wiki/Category:Dele... [18:57:47] (03CR) 10Dzahn: ircd/ircecho: add icinga process monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/290077 (https://phabricator.wikimedia.org/T135948) (owner: 10Dzahn) [18:58:36] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 663 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6485884 keys - replication_delay is 663 [19:01:57] (03PS1) 10Dzahn: ircd: make check_ircd a critical (paging) icinga check [puppet] - 10https://gerrit.wikimedia.org/r/290078 (https://phabricator.wikimedia.org/T135948) [19:03:32] (03CR) 10jenkins-bot: [V: 04-1] ircd: make check_ircd a critical (paging) icinga check [puppet] - 10https://gerrit.wikimedia.org/r/290078 (https://phabricator.wikimedia.org/T135948) (owner: 10Dzahn) [19:06:09] (03PS2) 10Dzahn: ircd: make check_ircd a critical (paging) icinga check [puppet] - 10https://gerrit.wikimedia.org/r/290078 (https://phabricator.wikimedia.org/T135948) [19:06:32] (03PS3) 10Dzahn: ircd: make check_ircd a critical (paging) icinga check [puppet] - 10https://gerrit.wikimedia.org/r/290078 (https://phabricator.wikimedia.org/T135948) [19:07:41] (03CR) 10jenkins-bot: [V: 04-1] ircd: make check_ircd a critical (paging) icinga check [puppet] - 10https://gerrit.wikimedia.org/r/290078 (https://phabricator.wikimedia.org/T135948) (owner: 10Dzahn) [19:08:45] (03PS4) 10Dzahn: ircd: make check_ircd a critical (paging) icinga check [puppet] - 10https://gerrit.wikimedia.org/r/290078 (https://phabricator.wikimedia.org/T135948) [19:13:06] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6482866 keys - replication_delay is 0 [19:18:29] (03PS5) 10Dzahn: ircd: make check_ircd a critical (paging) icinga check [puppet] - 10https://gerrit.wikimedia.org/r/290078 (https://phabricator.wikimedia.org/T135948) [19:18:44] (03PS6) 10Dzahn: ircd: make check_ircd a critical (paging) icinga check [puppet] - 10https://gerrit.wikimedia.org/r/290078 (https://phabricator.wikimedia.org/T135948) [19:19:33] (03CR) 10jenkins-bot: [V: 04-1] ircd: make check_ircd a critical (paging) icinga check [puppet] - 10https://gerrit.wikimedia.org/r/290078 (https://phabricator.wikimedia.org/T135948) (owner: 10Dzahn) [19:19:40] (03CR) 10Dzahn: "there is also https://gerrit.wikimedia.org/r/#/c/135074/7" [puppet] - 10https://gerrit.wikimedia.org/r/290077 (https://phabricator.wikimedia.org/T135948) (owner: 10Dzahn) [19:22:12] 06Operations, 10Wikimedia-IRC-RC-Server: Kraz (irc.wikimedia.org) has been flapping on IRC most of day - https://phabricator.wikimedia.org/T135930#2316371 (10Dzahn) these were effects of T134242 it's not happening anymore since the VM got restarted i'd consider it merged into the above and more or less a dup... [19:24:14] 06Operations, 10Wikimedia-IRC-RC-Server: udpmxircecho should write stats of messages processed and we should alert when that drops to zero - https://phabricator.wikimedia.org/T134326#2316374 (10Dzahn) We have this https://gerrit.wikimedia.org/r/#/c/135074/7 that already sends user and channel count to statsd... [19:29:25] (03CR) 10Luke081515: "Seems like gerrit restored PS3." [puppet] - 10https://gerrit.wikimedia.org/r/290078 (https://phabricator.wikimedia.org/T135948) (owner: 10Dzahn) [19:31:49] (03PS7) 10Dzahn: ircd: make check_ircd a critical (paging) icinga check [puppet] - 10https://gerrit.wikimedia.org/r/290078 (https://phabricator.wikimedia.org/T135948) [19:44:34] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 3 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2316391 (10Riley_Huntley) Well thats a record.. 32 images in a row that I could not... [20:36:56] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 665 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6496467 keys - replication_delay is 665 [21:13:36] (03PS7) 10Dereckson: Adjust groups permissions on fa.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290000 (https://phabricator.wikimedia.org/T135774) (owner: 10Urbanecm) [21:13:56] (03CR) 10Dereckson: "PS7: +signed-off-by per PS1 comment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290000 (https://phabricator.wikimedia.org/T135774) (owner: 10Urbanecm) [21:19:26] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6489689 keys - replication_delay is 0 [21:19:44] Dereckson: Hi do you have permission to approve translations for this template https://www.mediawiki.org/wiki/Template:WikimediaDownload please. [21:24:07] paladox: done I think [21:24:34] Dereckson: Thanks [21:25:02] You're welcome. [21:31:52] Dereckson: Would you also be able to enable https://www.mediawiki.org/wiki/Template:WikimediaDownloadOld for translating please. [21:31:52] :) [21:32:58] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 3 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2316535 (10matmarex) @Riley_Huntley That sounds like a separate bug, Wikidata-relat... [21:33:02] 06Operations, 06Performance-Team, 10Thumbor: Package and backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2316537 (10Gilles) [21:33:43] No changes to review. Marking this page for translation will not edit the page nor any existing translation unit. [21:34:13] The page Template:WikimediaDownloadOld has been marked up for translation with 13 translation units. The page can now be translated. Please import any pre-existing translations: you can use Special:PageMigration for this purpose. [21:35:43] https://www.mediawiki.org/wiki/Special:AllPages?from=WikimediaDownloadOld&to=&namespace=10 there wasn't any [21:37:37] Thanks Dereckson [21:52:32] yw [21:53:57] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [22:18:17] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [22:28:34] 06Operations, 14Spam: invalid ops task - https://phabricator.wikimedia.org/T78602#2316595 (10Danny_B) [22:50:37] PROBLEM - Disk space on kafka1022 is CRITICAL: DISK CRITICAL - free space: /var/spool/kafka/b 73805 MB (3% inode=99%) [23:17:14] 06Operations, 10Wikimedia-IRC-RC-Server: Kraz (irc.wikimedia.org) has been flapping on IRC most of day - https://phabricator.wikimedia.org/T135930#2316812 (10Peachey88) 05Open>03Resolved a:03jcrespo > !log trying to restart kraz and planet2001 (both service and console unresponsive) >>! In T1359... [23:25:06] 06Operations, 10ArchCom-RfC, 06Services, 07RfC: Service Ownership and Maintenance - https://phabricator.wikimedia.org/T122825#2316826 (10Danny_B) [23:25:33] 06Operations, 10Analytics, 10ArchCom-RfC, 06Discovery, and 8 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#2316831 (10Danny_B) [23:25:53] 06Operations, 10ArchCom-RfC, 10Architecture, 10Incident-20150423-Commons, and 7 others: RFC: Re-evaluate varnish-level request-restart behavior on 5xx - https://phabricator.wikimedia.org/T97206#2316840 (10Danny_B)