[00:39:53] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:08:53] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [01:46:13] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:47:03] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [01:49:03] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:50:03] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:50:03] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:50:13] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:51:03] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:51:03] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:51:03] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:51:13] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:52:03] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [01:52:03] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [01:52:13] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [01:53:03] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [01:53:03] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [01:54:03] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [01:55:03] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [01:55:03] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:55:13] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:56:03] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [01:56:04] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [01:56:04] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:56:04] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [01:56:33] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [01:58:03] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:59:03] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [01:59:03] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [01:59:03] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:00:03] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [02:00:03] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:00:03] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:03:03] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [02:03:03] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:03:13] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:04:03] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [02:05:03] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [02:05:03] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [02:07:04] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:08:03] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [02:14:33] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [02:39:43] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 653 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2963632 keys, up 23 days 10 hours - replication_delay is 653 [02:51:43] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2948411 keys, up 23 days 10 hours - replication_delay is 0 [03:04:37] (03CR) 10Juniorsys: [C: 031] standardize "include ::profile:*", "include ::nrpe" [puppet] - 10https://gerrit.wikimedia.org/r/347023 (owner: 10Dzahn) [03:22:48] (03CR) 10Juniorsys: [C: 031] "> I could not explain the syntax error. but it's gone without that. strange or i was blind to see an additional : or something. i stared a" [puppet] - 10https://gerrit.wikimedia.org/r/347023 (owner: 10Dzahn) [03:32:34] PROBLEM - puppet last run on mw1216 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.test],File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [03:57:33] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:58:03] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:00:23] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [04:00:43] RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [04:00:53] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [04:27:25] (03CR) 10Chad: Move contribution tracking config to CommonSettings.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342857 (https://phabricator.wikimedia.org/T147479) (owner: 10Chad) [04:36:20] 06Operations, 06Labs, 10Tool-Labs, 10Traffic, 07HTTPS: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#3184860 (10zhuyifei1999) [04:42:43] 06Operations, 06Labs, 10Tool-Labs, 10Traffic, 07HTTPS: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#3184861 (10zhuyifei1999) >>! In T102367#3021211, @scfc wrote: > Is there any resistance to redirecting `GET` requests from `http` to `https` at the pro... [04:52:23] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1343.30 Read Requests/Sec=481.10 Write Requests/Sec=12.00 KBytes Read/Sec=39796.80 KBytes_Written/Sec=68.40 [04:58:23] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=129.50 Read Requests/Sec=269.60 Write Requests/Sec=2.60 KBytes Read/Sec=5021.60 KBytes_Written/Sec=385.60 [07:36:33] PROBLEM - puppet last run on mw1272 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:57:23] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 127.42 seconds [08:00:47] PROBLEM - MariaDB Slave SQL: s6 on db1037 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:01:37] RECOVERY - MariaDB Slave SQL: s6 on db1037 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:03:57] PROBLEM - MariaDB Slave IO: s6 on db1037 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:04:18] Checking [08:04:33] RECOVERY - puppet last run on mw1272 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [08:04:46] RECOVERY - MariaDB Slave IO: s6 on db1037 is OK: OK slave_io_state Slave_IO_Running: Yes [08:05:57] storage and BBU looks good [08:06:06] how is that even an allowed query? [08:07:31] and why ruwiki in particular? [08:08:37] that query is just mad [08:09:33] I do not know why long running queries are not killed-it has the watchdog [08:09:51] is this an ores cron of some type or kicked off by user interaction I wonder [08:10:19] oh, I know [08:10:20] hey, what's wrong with ores? [08:10:30] because it starts with (SELECT [08:10:47] and it checks ^\ *SELECT [08:10:54] Can be this somehow related to: https://phabricator.wikimedia.org/T134976 ? [08:11:00] Amir1: why do you think something is wrong with ORES? [08:11:21] hello people! [08:11:29] I'm watched the word "ores" and it came up here [08:11:36] by apergos :) [08:11:40] 'ello [08:12:20] Amir1: some long running queries that are ores-related [08:13:01] hmm, these look from ores extension [08:13:07] let me take a look [08:15:39] apergos: marostegui: If these long queries only happen in SpecialRecentChangesLinked, I can disable ores in it for now [08:15:45] Do you want that? [08:17:08] Amir1: wait a sec, as it looks like jynus said it is fixed, so let's wait until he has confirmed he's done his magic [08:17:26] okay, keep me posted [08:17:45] (03PS1) 10Jcrespo: Database slave watchdog- kill all queries, not only the selects [software] - 10https://gerrit.wikimedia.org/r/348349 (https://phabricator.wikimedia.org/T163063) [08:18:33] (03CR) 10Jcrespo: [V: 032 C: 032] Database slave watchdog- kill all queries, not only the selects [software] - 10https://gerrit.wikimedia.org/r/348349 (https://phabricator.wikimedia.org/T163063) (owner: 10Jcrespo) [08:24:38] Amir1: if you can try to find out why this query appeared it would be nice: https://phabricator.wikimedia.org/T163063 [08:24:56] The issue has been fixed by jynus by changing the query killer to catch that query [08:26:45] marostegui: It happened using SpecialRecentChangesLinked. The thing is ChangesList has several subclasses like watchlist and recent changes [08:27:19] I checked for the important ones so ores don't make issues but this one doesn't seem to be used at all and it's slow already [08:27:29] ores put some more pressure on it [08:28:01] I can remove that functionality without anyone noticing [08:28:33] Sure, if you think that can be done and prevent future issues like this, that'd be nice! :) [08:29:02] Okay, I make a patch in ten minutes [08:29:41] I rather not deploy anything today though, as the patch Jaime sent fixes the issue for now :) [08:31:04] Going to log off now. Thanks everyone for the help! [08:31:11] Let's continue the discussion on that ticket [10:34:35] 06Operations, 06Commons, 10media-storage: More missing 'original' files on Commons - https://phabricator.wikimedia.org/T163068#3185065 (10Revent) [12:32:03] PROBLEM - Disk space on ocg1003 is CRITICAL: DISK CRITICAL - free space: / 1701 MB (3% inode=76%): /srv/deployment/ocg/output 10513 MB (5% inode=96%) [14:54:53] PROBLEM - puppet last run on mc1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:22:29] there was a peak in OCG requests plus ocg1003 has a smaller lvm partition for /srv/deployment/ocg/output :( [15:22:53] RECOVERY - puppet last run on mc1016 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [15:28:35] mmmm there is space in the pv, the lvm volume is smaller than the other hosts [15:30:20] reading https://phabricator.wikimedia.org/T162780 [15:33:30] options are 1) incrase the lv and the fs [15:33:40] 2) just clean up some files [15:33:58] I'd go for cleaning up old pdfs with mtime +3 [15:35:25] !log executing sudo find -name *.pdf -mtime +3 -exec rm {} \; on ocg1003's /srv/deployment/ocg/output to clean up some disk space - T162780 [15:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:35] T162780: ocg1003 partitions are severely misconfigured - https://phabricator.wikimedia.org/T162780 [15:37:52] ah also / is full [15:37:55] -.- [15:44:43] !log restart ocg on ocg1003 to clean up deleted files in lsof [15:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:28] ah now I found it, the "post-mortem" dir in /srv is full (since it is mounted under /) [15:50:13] RECOVERY - Disk space on ocg1003 is OK: DISK OK [15:50:47] didn't do anything more than restarting [15:54:12] looks good now, but poor OCG is definitely not healthy lately :) [15:54:21] * elukey brb again [17:19:41] <_joe_> elukey ftr, I just created the lvm and I left spae for growth [17:20:13] <_joe_> since I wasn't sure we needed to move post-mortem too [17:20:26] <_joe_> btw post-mortems can be removed almost safely [17:40:27] thanks _joe_! [18:02:53] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:03:43] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:55:43] PROBLEM - HTTPS on ms1001 is CRITICAL: SSL CRITICAL - Certificate download.wikimedia.org valid until 2017-04-19 18:55:12 +0000 (expires in 2 days) [18:56:43] PROBLEM - cassandra-c CQL 10.192.16.178:9042 on restbase2007 is CRITICAL: connect to address 10.192.16.178 and port 9042: Connection refused [18:56:53] PROBLEM - Check systemd state on restbase2007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:57:03] PROBLEM - cassandra-c service on restbase2007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [18:57:03] PROBLEM - cassandra-c SSL 10.192.16.178:7001 on restbase2007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [19:07:05] this one is surely a OOM due to the tombstones --^ [19:08:50] yep seems so (Cc: urandom o/) [19:08:53] RECOVERY - Check systemd state on restbase2007 is OK: OK - running: The system is fully operational [19:09:03] RECOVERY - cassandra-c service on restbase2007 is OK: OK - cassandra-c is active [19:09:11] (just ran puppet) [19:12:13] RECOVERY - cassandra-c SSL 10.192.16.178:7001 on restbase2007 is OK: SSL OK - Certificate restbase2007-c valid until 2017-09-12 15:35:55 +0000 (expires in 148 days) [19:12:43] RECOVERY - cassandra-c CQL 10.192.16.178:9042 on restbase2007 is OK: TCP OK - 0.036 second response time on 10.192.16.178 port 9042 [19:51:53] 06Operations, 05MW-1.29-release (WMF-deploy-2017-04-25_(1.29.0-wmf.21)), 05MW-1.29-release-notes, 13Patch-For-Review, 07Wikimedia-log-errors: firejail for mediawiki converter leaks to stderr: "Reading profile /etc/firejail/mediawiki-converters.profile" - https://phabricator.wikimedia.org/T158649#3185399 (... [20:15:53] PROBLEM - zotero on sca1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:16:43] RECOVERY - zotero on sca1003 is OK: HTTP OK: HTTP/1.0 200 OK - 62 bytes in 0.009 second response time [20:56:03] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 770802 msg: ocg_render_job_queue 0 msg [20:56:03] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 769385 msg: ocg_render_job_queue 0 msg [21:04:47] (03PS1) 10Dereckson: Enable responsive references on el.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348400 (https://phabricator.wikimedia.org/T163074) [21:37:03] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479 [21:38:04] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2907479 keys, up 24 days 5 hours - replication_delay is 0 [22:51:43] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [22:52:43] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1002 is OK: OK ferm input default policy is set [23:33:53] PROBLEM - puppet last run on mc1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues