[00:39:53] <icinga-wm>	 PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:08:53] <icinga-wm>	 RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures
[01:46:13] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:47:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[01:49:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:50:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:50:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:50:13] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:51:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:51:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:51:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:51:13] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:52:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy
[01:52:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[01:52:13] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[01:53:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[01:53:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy
[01:54:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy
[01:55:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy
[01:55:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:55:13] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:56:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy
[01:56:04] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy
[01:56:04] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:56:04] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[01:56:33] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0]
[01:58:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:59:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy
[01:59:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy
[01:59:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:00:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy
[02:00:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:00:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:03:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy
[02:03:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:03:13] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:04:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[02:05:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy
[02:05:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[02:07:04] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:08:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy
[02:14:33] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0]
[02:39:43] <icinga-wm>	 PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 653 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2963632 keys, up 23 days 10 hours - replication_delay is 653
[02:51:43] <icinga-wm>	 RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2948411 keys, up 23 days 10 hours - replication_delay is 0
[03:04:37] <wikibugs>	 (03CR) 10Juniorsys: [C: 031] standardize "include ::profile:*", "include ::nrpe" [puppet] - 10https://gerrit.wikimedia.org/r/347023 (owner: 10Dzahn)
[03:22:48] <wikibugs>	 (03CR) 10Juniorsys: [C: 031] "> I could not explain the syntax error. but it's gone without that. strange or i was blind to see an additional : or something. i stared a" [puppet] - 10https://gerrit.wikimedia.org/r/347023 (owner: 10Dzahn)
[03:32:34] <icinga-wm>	 PROBLEM - puppet last run on mw1216 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.test],File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz]
[03:57:33] <icinga-wm>	 PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:58:03] <icinga-wm>	 PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:00:23] <icinga-wm>	 RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[04:00:43] <icinga-wm>	 RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[04:00:53] <icinga-wm>	 RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient
[04:27:25] <wikibugs__>	 (03CR) 10Chad: Move contribution tracking config to CommonSettings.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342857 (https://phabricator.wikimedia.org/T147479) (owner: 10Chad)
[04:36:20] <wikibugs>	 06Operations, 06Labs, 10Tool-Labs, 10Traffic, 07HTTPS: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#3184860 (10zhuyifei1999)
[04:42:43] <wikibugs__>	 06Operations, 06Labs, 10Tool-Labs, 10Traffic, 07HTTPS: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#3184861 (10zhuyifei1999) >>! In T102367#3021211, @scfc wrote: > Is there any resistance to redirecting `GET` requests from `http` to `https` at the pro...
[04:52:23] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1343.30 Read Requests/Sec=481.10 Write Requests/Sec=12.00 KBytes Read/Sec=39796.80 KBytes_Written/Sec=68.40
[04:58:23] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=129.50 Read Requests/Sec=269.60 Write Requests/Sec=2.60 KBytes Read/Sec=5021.60 KBytes_Written/Sec=385.60
[07:36:33] <icinga-wm>	 PROBLEM - puppet last run on mw1272 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:57:23] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 127.42 seconds
[08:00:47] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s6 on db1037 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:01:37] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s6 on db1037 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[08:03:57] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s6 on db1037 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:04:18] <marostegui>	  Checking
[08:04:33] <icinga-wm>	 RECOVERY - puppet last run on mw1272 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[08:04:46] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s6 on db1037 is OK: OK slave_io_state Slave_IO_Running: Yes
[08:05:57] <marostegui>	 storage and BBU looks good
[08:06:06] <jynus>	 how is that even an allowed query?
[08:07:31] <apergos>	 and why ruwiki in particular? 
[08:08:37] <marostegui>	 that query is just mad
[08:09:33] <jynus>	 I do not know why long running queries are not killed-it has the watchdog
[08:09:51] <apergos>	 is this an ores cron of some type or kicked off by user interaction I wonder
[08:10:19] <jynus>	 oh, I know
[08:10:20] <Amir1>	 hey, what's wrong with ores?
[08:10:30] <jynus>	 because it starts with (SELECT
[08:10:47] <jynus>	 and it checks ^\ *SELECT
[08:10:54] <marostegui>	 Can be this somehow related to: https://phabricator.wikimedia.org/T134976 ?
[08:11:00] <Zackary>	 Amir1: why do you think something is wrong with ORES?
[08:11:21] <elukey>	 hello people! 
[08:11:29] <Amir1>	 I'm watched the word "ores" and it came up here
[08:11:36] <Amir1>	 by apergos :)
[08:11:40] <Zackary>	 'ello
[08:12:20] <apergos>	 Amir1: some long running queries that are ores-related
[08:13:01] <Amir1>	 hmm, these look from ores extension 
[08:13:07] <Amir1>	 let me take a look
[08:15:39] <Amir1>	 apergos: marostegui: If these long queries only happen in SpecialRecentChangesLinked, I can disable ores in it for now
[08:15:45] <Amir1>	 Do you want that?
[08:17:08] <marostegui>	 Amir1: wait a sec, as it looks like jynus said it is fixed, so let's wait until he has confirmed he's done his magic
[08:17:26] <Amir1>	 okay, keep me posted
[08:17:45] <wikibugs__>	 (03PS1) 10Jcrespo: Database slave watchdog- kill all queries, not only the selects [software] - 10https://gerrit.wikimedia.org/r/348349 (https://phabricator.wikimedia.org/T163063)
[08:18:33] <wikibugs__>	 (03CR) 10Jcrespo: [V: 032 C: 032] Database slave watchdog- kill all queries, not only the selects [software] - 10https://gerrit.wikimedia.org/r/348349 (https://phabricator.wikimedia.org/T163063) (owner: 10Jcrespo)
[08:24:38] <marostegui>	 Amir1: if you can try to find out why this query appeared it would be nice: https://phabricator.wikimedia.org/T163063
[08:24:56] <marostegui>	 The issue has been fixed by jynus by changing the query killer to catch that query
[08:26:45] <Amir1>	 marostegui: It happened using SpecialRecentChangesLinked. The thing is ChangesList has several subclasses like watchlist and recent changes
[08:27:19] <Amir1>	 I checked for the important ones so ores don't make issues but this one doesn't seem to be used at all and it's slow already
[08:27:29] <Amir1>	 ores put some more pressure on it
[08:28:01] <Amir1>	 I can remove that functionality without anyone noticing 
[08:28:33] <marostegui>	 Sure, if you think that can be done and prevent future issues like this, that'd be nice! :)
[08:29:02] <Amir1>	 Okay, I make a patch in ten minutes
[08:29:41] <marostegui>	 I rather not deploy anything today though, as the patch Jaime sent fixes the issue for now :)
[08:31:04] <marostegui>	 Going to log off now. Thanks everyone for the help!
[08:31:11] <marostegui>	 Let's continue the discussion on that ticket
[10:34:35] <wikibugs__>	 06Operations, 06Commons, 10media-storage: More missing 'original' files on Commons - https://phabricator.wikimedia.org/T163068#3185065 (10Revent)
[12:32:03] <icinga-wm>	 PROBLEM - Disk space on ocg1003 is CRITICAL: DISK CRITICAL - free space: / 1701 MB (3% inode=76%): /srv/deployment/ocg/output 10513 MB (5% inode=96%)
[14:54:53] <icinga-wm>	 PROBLEM - puppet last run on mc1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:22:29] <elukey>	 there was a peak in OCG requests plus ocg1003 has a smaller lvm partition for /srv/deployment/ocg/output :(
[15:22:53] <icinga-wm>	 RECOVERY - puppet last run on mc1016 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[15:28:35] <elukey>	 mmmm there is space in the pv, the lvm volume is smaller than the other hosts
[15:30:20] <elukey>	 reading https://phabricator.wikimedia.org/T162780
[15:33:30] <elukey>	 options are 1) incrase the lv and the fs
[15:33:40] <elukey>	 2) just clean up some files
[15:33:58] <elukey>	 I'd go for cleaning up old pdfs with mtime +3
[15:35:25] <elukey>	 !log executing sudo find -name *.pdf -mtime +3 -exec rm {} \; on ocg1003's /srv/deployment/ocg/output to clean up some disk space - T162780
[15:35:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:35] <stashbot>	 T162780: ocg1003 partitions are severely misconfigured - https://phabricator.wikimedia.org/T162780
[15:37:52] <elukey>	 ah also / is full
[15:37:55] <elukey>	 -.-
[15:44:43] <elukey>	 !log restart ocg on ocg1003 to clean up deleted files in lsof
[15:44:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:28] <elukey>	 ah now I found it, the "post-mortem" dir in /srv is full (since it is mounted under /)
[15:50:13] <icinga-wm>	 RECOVERY - Disk space on ocg1003 is OK: DISK OK
[15:50:47] <elukey>	 didn't do anything more than restarting
[15:54:12] <elukey>	 looks good now, but poor OCG is definitely not healthy lately :)
[15:54:21] * elukey brb again
[17:19:41] <_joe_>	 elukey ftr, I just created the lvm and I left spae for growth
[17:20:13] <_joe_>	 since I wasn't sure we needed to move post-mortem too
[17:20:26] <_joe_>	 btw post-mortems can be removed almost safely
[17:40:27] <elukey>	 thanks _joe_!
[18:02:53] <icinga-wm>	 PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:03:43] <icinga-wm>	 RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[18:55:43] <icinga-wm>	 PROBLEM - HTTPS on ms1001 is CRITICAL: SSL CRITICAL - Certificate download.wikimedia.org valid until 2017-04-19 18:55:12 +0000 (expires in 2 days)
[18:56:43] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.192.16.178:9042 on restbase2007 is CRITICAL: connect to address 10.192.16.178 and port 9042: Connection refused
[18:56:53] <icinga-wm>	 PROBLEM - Check systemd state on restbase2007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[18:57:03] <icinga-wm>	 PROBLEM - cassandra-c service on restbase2007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[18:57:03] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.192.16.178:7001 on restbase2007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[19:07:05] <elukey>	 this one is surely a OOM due to the tombstones --^
[19:08:50] <elukey>	 yep seems so (Cc: urandom o/)
[19:08:53] <icinga-wm>	 RECOVERY - Check systemd state on restbase2007 is OK: OK - running: The system is fully operational
[19:09:03] <icinga-wm>	 RECOVERY - cassandra-c service on restbase2007 is OK: OK - cassandra-c is active
[19:09:11] <elukey>	 (just ran puppet)
[19:12:13] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.192.16.178:7001 on restbase2007 is OK: SSL OK - Certificate restbase2007-c valid until 2017-09-12 15:35:55 +0000 (expires in 148 days)
[19:12:43] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.192.16.178:9042 on restbase2007 is OK: TCP OK - 0.036 second response time on 10.192.16.178 port 9042
[19:51:53] <wikibugs__>	 06Operations, 05MW-1.29-release (WMF-deploy-2017-04-25_(1.29.0-wmf.21)), 05MW-1.29-release-notes, 13Patch-For-Review, 07Wikimedia-log-errors: firejail for mediawiki converter leaks to stderr: "Reading profile /etc/firejail/mediawiki-converters.profile" - https://phabricator.wikimedia.org/T158649#3185399 (...
[20:15:53] <icinga-wm>	 PROBLEM - zotero on sca1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:16:43] <icinga-wm>	 RECOVERY - zotero on sca1003 is OK: HTTP OK: HTTP/1.0 200 OK - 62 bytes in 0.009 second response time
[20:56:03] <icinga-wm>	 RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 770802 msg: ocg_render_job_queue 0 msg
[20:56:03] <icinga-wm>	 RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 769385 msg: ocg_render_job_queue 0 msg
[21:04:47] <wikibugs>	 (03PS1) 10Dereckson: Enable responsive references on el.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348400 (https://phabricator.wikimedia.org/T163074)
[21:37:03] <icinga-wm>	 PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479
[21:38:04] <icinga-wm>	 RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2907479 keys, up 24 days 5 hours - replication_delay is 0
[22:51:43] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[22:52:43] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1002 is OK: OK ferm input default policy is set
[23:33:53] <icinga-wm>	 PROBLEM - puppet last run on mc1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues