[00:03:56] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:04:46] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 241 bytes in 0.002 second response time
[00:15:04] <godog>	 mh another case of T151851 I think, checking now
[00:15:05] <stashbot>	 T151851: Thumbor resource consumption is spiky - https://phabricator.wikimedia.org/T151851
[00:24:39] <godog>	 silenced until monday, we might as well remove load from it since it isin't in production
[00:25:08] <wikibugs_>	 06Operations, 06Commons, 06Multimedia, 10media-storage, 15User-Urbanecm: Metro Mad Linea 7.png file half-disappeared - it can't be used - https://phabricator.wikimedia.org/T153540#2883589 (10Tgr) Yay for mirrors! http://ftpmirror.your.org/pub/wikimedia/images/wikipedia/commons/c/cf/Metro_Mad_Linea_7.png...
[00:45:03] <icinga-wm>	 PROBLEM - puppet last run on analytics1057 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:52:43] <wikibugs_>	 06Operations, 06Commons, 06Multimedia, 10media-storage, 15User-Urbanecm: Metro Mad Linea 7.png file half-disappeared - it can't be used - https://phabricator.wikimedia.org/T153540#2884363 (10Tgr) Apparently if the file had been uploaded a year later, we would be out of luck: {T53001}  Filed {T153565} abo...
[00:55:07] <wikibugs_>	 06Operations, 10Datasets-Archiving, 10Datasets-General-or-Unknown: Image tarball dumps on your.org are not being generated - https://phabricator.wikimedia.org/T53001#571615 (10Tgr) Is this specifically about the tarballs or is http://ftpmirror.your.org/pub/wikimedia/images/ similarly affected? Given our tend...
[01:15:03] <icinga-wm>	 RECOVERY - puppet last run on analytics1057 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[01:23:07] <wikibugs_>	 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#2884411 (10Smalyshev) We previously discussed this and the tradition for entity identifiers is to use http. E.g. such commonly known prefi...
[01:25:03] <icinga-wm>	 PROBLEM - puppet last run on db1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:30:13] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[01:30:23] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[01:30:23] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[01:30:33] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[01:30:33] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[01:31:03] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[01:31:23] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.119 and port 9042: Connection refused
[01:31:23] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.207 and port 9042: Connection refused
[01:31:43] <icinga-wm>	 PROBLEM - Check systemd state on restbase1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[01:31:53] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[01:31:53] <icinga-wm>	 PROBLEM - Check systemd state on restbase1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[01:31:54] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[01:32:00] <Zppix>	 mutante: ^
[01:32:33] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[01:53:03] <icinga-wm>	 RECOVERY - puppet last run on db1053 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[02:18:51] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.6) (duration: 06m 39s)
[02:19:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:23:12] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Dec 18 02:23:11 UTC 2016 (duration 4m 20s)
[02:23:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:24:23] <icinga-wm>	 PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:30:47] <Zppix>	 I see l10n is back to normal
[02:36:18] <Krenair>	 yeah
[02:36:40] <Krenair>	 the sync-l10n scap package update was deployed to the servers
[02:41:11] <Zppix>	 What was wrong with it?
[02:52:23] <icinga-wm>	 RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[02:53:59] <Krenair>	 there was a thing
[02:54:06] <Krenair>	 it broke stuff
[02:54:27] <Krenair>	 so it broke l10nupdate-1
[02:55:41] <Krenair>	 Zppix, https://phabricator.wikimedia.org/T152390
[03:16:07] <Zppix>	 There was a thing it broke stuff so it broke... thats the best thing ive heard on irc Krenair
[03:23:53] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 681.06 seconds
[03:24:03] <icinga-wm>	 PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:29:53] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 161.21 seconds
[03:33:03] <icinga-wm>	 PROBLEM - puppet last run on mw1286 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz]
[03:35:45] <Yvette>	 Is Mathoid having issues?
[03:35:50] <Yvette>	 Getting...
[03:36:59] <Krenair>	 getting...?
[03:40:05] <Krenair>	 Yvette?
[03:47:48] <wikibugs_>	 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#2884514 (10MZMcBride) >>! In T153563#2884409, @Smalyshev wrote: > We previously discussed this and the tradition for entity identifiers is...
[03:48:02] <Krenair>	 looks like Yvette's thing is in #mediawiki
[03:49:10] <wikibugs>	 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#2884515 (10MZMcBride) In this specific context, the query service is outputting URLs (yes, URLs, right here in River City) such as this on...
[03:49:11] <Yvette>	 > Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "/mathoid/local/v1/":): {\displaystyle \left(\frac {dG}{d\xi}\right)_{T,p} = RT \ln \left(\frac {Q_\mathrm{r}}{K_\mathrm{eq}}\right)~}
[03:49:22] <Yvette>	 Sorry, got distracted with that other task.
[03:50:32] <Yvette>	 From https://en.wikipedia.org/w/index.php?title=Chemical_equilibrium&action=history 
[03:51:27] <Yvette>	 https://en.wikipedia.org/wiki/Chemical_equilibrium#Addition_of_reactants_or_products
[03:52:03] <icinga-wm>	 PROBLEM - puppet last run on cp1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:52:04] <icinga-wm>	 RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[04:01:03] <icinga-wm>	 RECOVERY - puppet last run on mw1286 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[04:04:53] <icinga-wm>	 RECOVERY - Check systemd state on restbase1013 is OK: OK - running: The system is fully operational
[04:05:03] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1013 is OK: OK - cassandra-c is active
[04:07:53] <icinga-wm>	 PROBLEM - Check systemd state on restbase1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:08:03] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[04:20:03] <icinga-wm>	 RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[04:39:53] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=746.30 Read Requests/Sec=248.10 Write Requests/Sec=0.90 KBytes Read/Sec=31713.60 KBytes_Written/Sec=20.40
[04:50:03] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[04:50:53] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=4.10 Read Requests/Sec=0.10 Write Requests/Sec=0.30 KBytes Read/Sec=0.40 KBytes_Written/Sec=10.40
[04:55:03] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0]
[04:56:54] <wikibugs_>	 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#2884289 (10Esc3300) For the reasons WMF uses https shouldn't be make sure that users don't access http ?   If people are given sufficient...
[04:58:48] <wikibugs>	 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2884546 (10Tgr) After thinking more about this and looking at the code I am getting more and more confused about what exactly we are trying to do. Medi...
[05:34:53] <icinga-wm>	 RECOVERY - Check systemd state on restbase1013 is OK: OK - running: The system is fully operational
[05:35:03] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1013 is OK: OK - cassandra-c is active
[05:37:53] <icinga-wm>	 PROBLEM - Check systemd state on restbase1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[05:38:03] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[06:02:09] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [1000.0]
[06:06:27] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0]
[06:19:03] <Yvette>	 https://meta.wikimedia.org/wiki/Special:GlobalUserRights --> [WFYqGgpAAEQAAjAVLSkAAAAW] 2016-12-18 06:18:02: Fatal exception of type MWException
[06:19:07] <Yvette>	 Logged in.
[06:25:56] <icinga-wm>	 PROBLEM - MariaDB disk space on db1047 is CRITICAL: DISK CRITICAL - free space: / 398 MB (5% inode=54%)
[06:26:03] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [1000.0]
[06:26:56] <icinga-wm>	 RECOVERY - MariaDB disk space on db1047 is OK: DISK OK
[06:28:03] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[06:29:07] <godog>	 got paged for db1047 space, can takr a look on 20 min or so
[06:42:33] <icinga-wm>	 RECOVERY - Check systemd state on restbase1011 is OK: OK - running: The system is fully operational
[06:42:33] <icinga-wm>	 PROBLEM - puppet last run on elastic2009 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[tmux],Package[pv]
[06:45:33] <icinga-wm>	 PROBLEM - Check systemd state on restbase1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:47:03] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[06:48:42] <godog>	 recovered by itself shortly afterwards heh
[06:56:28] <grrrit-wm>	 (03PS2) 10Tim Landscheidt: Tools: Make tools-clush-generator project-agnostic [puppet] - 10https://gerrit.wikimedia.org/r/326892 
[06:56:30] <grrrit-wm>	 (03PS1) 10Tim Landscheidt: Tools: Generate node sets dynamically [puppet] - 10https://gerrit.wikimedia.org/r/328030 
[06:59:47] <grrrit-wm>	 (03CR) 10Tim Landscheidt: [] "Tested on Toolsbeta." [puppet] - 10https://gerrit.wikimedia.org/r/328030 (owner: 10Tim Landscheidt) 
[07:02:54] <wikibugs>	 07Puppet, 06Labs, 10Tool-Labs: Make standalone puppetmasters optionally use PuppetDB - https://phabricator.wikimedia.org/T153577#2884609 (10scfc)
[07:04:49] <Yvette>	 I filed the exception as https://phabricator.wikimedia.org/T153578
[07:05:53] <icinga-wm>	 RECOVERY - Check systemd state on restbase1013 is OK: OK - running: The system is fully operational
[07:06:03] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1013 is OK: OK - cassandra-c is active
[07:07:09] <godog>	 !log force git-fat pull for twcs on restbase1* to restore twcs jar
[07:07:13] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is OK: SSL OK - Certificate restbase1013-c valid until 2017-09-12 15:34:23 +0000 (expires in 268 days)
[07:07:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:07:24] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is OK: TCP OK - 0.002 second response time on 10.64.32.207 port 9042
[07:07:33] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[07:07:33] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy
[07:07:33] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy
[07:07:33] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[07:07:34] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[07:10:33] <icinga-wm>	 RECOVERY - puppet last run on elastic2009 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures
[07:12:04] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1011 is OK: OK - cassandra-c is active
[07:12:33] <icinga-wm>	 RECOVERY - Check systemd state on restbase1011 is OK: OK - running: The system is fully operational
[07:12:34] <wikibugs>	 06Operations, 06Commons, 06Multimedia, 10media-storage, 15User-Urbanecm: Metro Mad Linea 7.png file half-disappeared - it can't be used - https://phabricator.wikimedia.org/T153540#2884634 (10zhuyifei1999) >>! In T153540#2884321, @Tgr wrote: > Yay for mirrors! http://ftpmirror.your.org/pub/wikimedia/image...
[07:13:13] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-c valid until 2017-09-12 15:34:08 +0000 (expires in 268 days)
[07:13:24] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is OK: TCP OK - 0.004 second response time on 10.64.0.119 port 9042
[07:52:42] <grrrit-wm>	 (03PS1) 10Tim Landscheidt: WIP [puppet] - 10https://gerrit.wikimedia.org/r/328031 (https://phabricator.wikimedia.org/T150726) 
[08:26:23] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[08:26:23] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[08:26:24] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.207 and port 9042: Connection refused
[08:27:03] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[08:27:03] <icinga-wm>	 PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:27:13] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused
[08:28:03] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[08:28:53] <icinga-wm>	 PROBLEM - Check systemd state on restbase1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:30:03] <icinga-wm>	 PROBLEM - puppet last run on ms-be1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:34:03] <icinga-wm>	 PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:34:53] <icinga-wm>	 RECOVERY - Check systemd state on restbase1013 is OK: OK - running: The system is fully operational
[08:35:03] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1013 is OK: OK - cassandra-c is active
[08:36:23] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is OK: SSL OK - Certificate restbase1013-c valid until 2017-09-12 15:34:23 +0000 (expires in 268 days)
[08:36:24] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is OK: TCP OK - 0.003 second response time on 10.64.32.207 port 9042
[08:39:33] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[08:39:33] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[08:39:34] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[08:39:34] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[08:39:34] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[08:39:43] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[08:40:13] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[08:40:23] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.119 and port 9042: Connection refused
[08:40:24] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.207 and port 9042: Connection refused
[08:40:33] <icinga-wm>	 PROBLEM - Check systemd state on restbase1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:41:03] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[08:41:53] <icinga-wm>	 PROBLEM - Check systemd state on restbase1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:42:03] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[08:42:04] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1011 is OK: OK - cassandra-c is active
[08:42:33] <icinga-wm>	 RECOVERY - Check systemd state on restbase1011 is OK: OK - running: The system is fully operational
[08:43:03] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active
[08:43:03] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[08:43:06] <elukey>	 !log forced puppet on restbase1009 to bring up cassandra-a (stopped due to OOM issues)
[08:43:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:23] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-c valid until 2017-09-12 15:34:08 +0000 (expires in 268 days)
[08:43:23] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is OK: TCP OK - 0.003 second response time on 10.64.0.119 port 9042
[08:43:33] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[08:43:33] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy
[08:43:33] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[08:43:33] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[08:43:33] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy
[08:44:13] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on 10.64.48.120 port 9042
[08:44:33] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-a valid until 2017-09-12 15:33:48 +0000 (expires in 268 days)
[08:46:33] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[08:46:33] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[08:46:33] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[08:46:33] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[08:46:33] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[08:47:03] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[08:47:13] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[08:47:14] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused
[08:47:23] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[08:47:24] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.119 and port 9042: Connection refused
[08:47:33] <icinga-wm>	 PROBLEM - Check systemd state on restbase1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:48:33] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-a valid until 2017-09-12 15:33:48 +0000 (expires in 268 days)
[08:49:13] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on 10.64.48.120 port 9042
[08:49:38] <elukey>	 !log forced restart for cassandra-a on restbase1009 (still OOMs) 
[08:49:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:50] <elukey>	 !log forced restart of cassandra-b/c on restbase1013 (b not really needed, my error)
[08:51:53] <icinga-wm>	 RECOVERY - Check systemd state on restbase1013 is OK: OK - running: The system is fully operational
[08:52:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:03] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1013 is OK: OK - cassandra-c is active
[08:52:24] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is OK: TCP OK - 0.002 second response time on 10.64.32.207 port 9042
[08:52:33] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is OK: SSL OK - Certificate restbase1013-c valid until 2017-09-12 15:34:23 +0000 (expires in 268 days)
[08:52:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[08:52:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[08:52:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy
[08:52:43] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy
[08:53:33] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[08:57:03] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1011 is OK: OK - cassandra-c is active
[08:57:20] <elukey>	 !log forced restart of cassandra-c on restbase1011 
[08:57:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:33] <icinga-wm>	 RECOVERY - Check systemd state on restbase1011 is OK: OK - running: The system is fully operational
[08:58:03] <icinga-wm>	 RECOVERY - puppet last run on ms-be1014 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures
[08:58:13] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-c valid until 2017-09-12 15:34:08 +0000 (expires in 268 days)
[08:58:23] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is OK: TCP OK - 0.005 second response time on 10.64.0.119 port 9042
[09:00:03] <icinga-wm>	 PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:00:14] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused
[09:00:23] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[09:00:24] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.207 and port 9042: Connection refused
[09:00:33] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[09:00:53] <icinga-wm>	 PROBLEM - Check systemd state on restbase1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:01:03] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[09:01:03] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[09:03:03] <icinga-wm>	 RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[09:05:53] <icinga-wm>	 RECOVERY - Check systemd state on restbase1013 is OK: OK - running: The system is fully operational
[09:06:03] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1013 is OK: OK - cassandra-c is active
[09:06:43] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is OK: SSL OK - Certificate restbase1013-c valid until 2017-09-12 15:34:23 +0000 (expires in 268 days)
[09:07:24] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is OK: TCP OK - 0.000 second response time on 10.64.32.207 port 9042
[09:15:03] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active
[09:15:03] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[09:15:33] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-a valid until 2017-09-12 15:33:48 +0000 (expires in 268 days)
[09:16:13] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on 10.64.48.120 port 9042
[09:24:39] <elukey>	 urandom: ---^
[09:24:56] <elukey>	 a big cassandra instaces flapping event 
[09:46:51] * elukey afk!
[10:05:33] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.207 and port 9042: Connection refused
[10:05:43] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[10:06:13] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused
[10:06:23] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[10:06:54] <wikibugs>	 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2882187 (10Urbanecm)
[10:07:03] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[10:07:13] <icinga-wm>	 PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:07:34] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is OK: TCP OK - 0.000 second response time on 10.64.32.207 port 9042
[10:07:43] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is OK: SSL OK - Certificate restbase1013-c valid until 2017-09-12 15:34:23 +0000 (expires in 268 days)
[10:07:46] <wikibugs>	 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2882187 (10Urbanecm)
[10:08:01] <wikibugs>	 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2882187 (10Urbanecm)
[10:12:33] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.207 and port 9042: Connection refused
[10:12:53] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[10:14:03] <icinga-wm>	 PROBLEM - Check systemd state on restbase1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:14:23] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[10:17:13] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[10:18:03] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active
[10:20:33] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[10:20:34] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[10:20:34] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[10:20:34] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[10:20:43] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[10:21:13] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[10:21:23] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[10:21:33] <icinga-wm>	 PROBLEM - Check systemd state on restbase1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:21:33] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.119 and port 9042: Connection refused
[10:22:04] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[10:22:13] <icinga-wm>	 PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:35:23] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1013 is OK: OK - cassandra-c is active
[10:36:04] <icinga-wm>	 RECOVERY - Check systemd state on restbase1013 is OK: OK - running: The system is fully operational
[10:36:33] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is OK: TCP OK - 0.000 second response time on 10.64.32.207 port 9042
[10:36:33] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[10:36:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy
[10:36:43] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy
[10:36:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[10:36:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[10:36:53] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is OK: SSL OK - Certificate restbase1013-c valid until 2017-09-12 15:34:23 +0000 (expires in 268 days)
[10:42:23] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1011 is OK: OK - cassandra-c is active
[10:42:33] <icinga-wm>	 RECOVERY - Check systemd state on restbase1011 is OK: OK - running: The system is fully operational
[10:43:14] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-c valid until 2017-09-12 15:34:08 +0000 (expires in 268 days)
[10:43:33] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is OK: TCP OK - 0.000 second response time on 10.64.0.119 port 9042
[10:48:03] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active
[10:48:13] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on 10.64.48.120 port 9042
[10:48:14] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[10:48:23] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-a valid until 2017-09-12 15:33:48 +0000 (expires in 268 days)
[10:54:44] <grrrit-wm>	 (03PS1) 10Urbanecm: Add ftpmirror.your.org to whitelist of commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328036 (https://phabricator.wikimedia.org/T153569) 
[11:00:33] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.207 and port 9042: Connection refused
[11:00:53] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[11:01:13] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused
[11:01:23] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[11:01:23] <wikibugs>	 06Operations, 06Commons, 06Multimedia, 10media-storage, 15User-Urbanecm: Metro Mad Linea 7.png file half-disappeared - it can't be used - https://phabricator.wikimedia.org/T153540#2884869 (10Urbanecm) 05Open>03Resolved a:03zhuyifei1999 Thanks for resolving, seems it works, marking as resolved.
[11:02:03] <icinga-wm>	 PROBLEM - Check systemd state on restbase1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:02:04] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[11:02:13] <icinga-wm>	 PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:02:23] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[11:03:14] <grrrit-wm>	 (03PS1) 10Zhuyifei1999: videoscaler: Reduce runners_transcode from 5 to 2 [puppet] - 10https://gerrit.wikimedia.org/r/328037 (https://phabricator.wikimedia.org/T153488) 
[11:05:03] <icinga-wm>	 RECOVERY - Check systemd state on restbase1013 is OK: OK - running: The system is fully operational
[11:05:07] <wikibugs>	 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, 13Patch-For-Review: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2884876 (10zhuyifei1999) 500% load. Seems that each runner cause 100% load.
[11:05:23] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1013 is OK: OK - cassandra-c is active
[11:06:33] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is OK: TCP OK - 0.001 second response time on 10.64.32.207 port 9042
[11:06:53] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is OK: SSL OK - Certificate restbase1013-c valid until 2017-09-12 15:34:23 +0000 (expires in 268 days)
[11:17:04] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active
[11:17:13] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[11:18:13] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on 10.64.48.120 port 9042
[11:18:23] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-a valid until 2017-09-12 15:33:48 +0000 (expires in 268 days)
[11:22:03] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[11:22:04] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[11:22:13] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused
[11:22:14] <icinga-wm>	 PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:22:23] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[11:22:33] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.207 and port 9042: Connection refused
[11:23:03] <icinga-wm>	 PROBLEM - Check systemd state on restbase1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:23:23] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[11:33:33] <icinga-wm>	 PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:36:03] <icinga-wm>	 RECOVERY - Check systemd state on restbase1013 is OK: OK - running: The system is fully operational
[11:36:23] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1013 is OK: OK - cassandra-c is active
[11:36:33] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is OK: TCP OK - 0.000 second response time on 10.64.32.207 port 9042
[11:37:03] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is OK: SSL OK - Certificate restbase1013-c valid until 2017-09-12 15:34:23 +0000 (expires in 268 days)
[11:44:33] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[11:44:34] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[11:44:34] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[11:44:34] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[11:44:34] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[11:45:13] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[11:45:14] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[11:45:23] <icinga-wm>	 PROBLEM - puppet last run on mw1254 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:45:33] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.119 and port 9042: Connection refused
[11:45:33] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.207 and port 9042: Connection refused
[11:46:23] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[11:46:33] <icinga-wm>	 PROBLEM - Check systemd state on restbase1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:47:03] <icinga-wm>	 PROBLEM - Check systemd state on restbase1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:47:04] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active
[11:47:13] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[11:47:23] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[11:48:13] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on 10.64.48.120 port 9042
[11:48:23] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-a valid until 2017-09-12 15:33:48 +0000 (expires in 268 days)
[12:02:33] <icinga-wm>	 RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures
[12:05:23] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1013 is OK: OK - cassandra-c is active
[12:06:03] <icinga-wm>	 RECOVERY - Check systemd state on restbase1013 is OK: OK - running: The system is fully operational
[12:06:23] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is OK: SSL OK - Certificate restbase1013-c valid until 2017-09-12 15:34:23 +0000 (expires in 268 days)
[12:06:33] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is OK: TCP OK - 0.000 second response time on 10.64.32.207 port 9042
[12:06:43] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy
[12:06:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy
[12:06:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[12:06:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[12:06:44] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[12:08:22] <mobrovac>	 !log disabling puppet on restbase1009, restbase1011 and restbase1013 due to cassandra OOMs
[12:08:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:23] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[12:09:23] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[12:09:33] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.207 and port 9042: Connection refused
[12:09:34] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[12:09:34] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[12:09:34] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[12:09:43] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[12:09:43] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[12:10:04] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[12:10:14] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused
[12:10:14] <icinga-wm>	 PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:11:03] <icinga-wm>	 PROBLEM - Check systemd state on restbase1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:11:23] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[12:13:23] <icinga-wm>	 RECOVERY - puppet last run on mw1254 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures
[12:16:23] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1013 is OK: OK - cassandra-c is active
[12:17:03] <icinga-wm>	 RECOVERY - Check systemd state on restbase1013 is OK: OK - running: The system is fully operational
[12:17:23] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is OK: SSL OK - Certificate restbase1013-c valid until 2017-09-12 15:34:23 +0000 (expires in 268 days)
[12:17:33] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is OK: TCP OK - 0.000 second response time on 10.64.32.207 port 9042
[12:17:38] <mobrovac>	 !log started back cassandra restbase1013-c
[12:17:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[12:17:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy
[12:17:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[12:17:43] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy
[12:17:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[12:17:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:27:23] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1011 is OK: OK - cassandra-c is active
[12:27:29] <mobrovac>	 !log started back cassandra restbase1011-c
[12:27:33] <icinga-wm>	 RECOVERY - Check systemd state on restbase1011 is OK: OK - running: The system is fully operational
[12:27:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:29:13] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-c valid until 2017-09-12 15:34:08 +0000 (expires in 268 days)
[12:29:34] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is OK: TCP OK - 0.001 second response time on 10.64.0.119 port 9042
[12:37:03] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active
[12:37:13] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[12:37:33] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-a valid until 2017-09-12 15:33:48 +0000 (expires in 268 days)
[12:38:12] <mobrovac>	 !log started back cassandra restbase1009-a
[12:38:13] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on 10.64.48.120 port 9042
[12:38:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:13] <icinga-wm>	 PROBLEM - puppet last run on mc1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:26:13] <icinga-wm>	 RECOVERY - puppet last run on mc1017 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[13:34:44] <debt>	 \o
[13:35:13] <debt>	 not sure if this is where to ask, but I've been getting tons of spam irc messages
[13:35:48] <debt>	 and I don't know how to stop it
[13:38:24] <Krenair>	 so have most people
[13:38:30] <Krenair>	 #freenode perhaps
[13:38:50] <debt>	 nothing we can do? :(
[13:38:55] <Krenair>	 no
[13:39:03] <Krenair>	 you can set umode +R
[13:39:39] <debt>	 ok, thanks :(
[14:16:33] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down!
[14:20:33] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy
[14:37:14] <akosiaris>	 elukey: around ?
[15:37:13] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused
[15:37:23] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[15:37:23] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[15:37:23] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[15:37:33] <icinga-wm>	 PROBLEM - Check systemd state on restbase1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[15:37:33] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.119 and port 9042: Connection refused
[15:38:13] <icinga-wm>	 PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[15:38:24] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[15:55:53] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:56:43] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy
[16:36:23] <icinga-wm>	 PROBLEM - puppet last run on mw1164 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:45:14] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[16:45:23] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active
[16:45:41] <elukey>	 !log starting cassandra instances on restbase1009, restbase1011 and restbase1013 (one at the time) - T153588
[16:45:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:45:54] <stashbot>	 T153588: Cassandra OOMs on restbase1009-a, restbase1011-c and restbase1013-c - https://phabricator.wikimedia.org/T153588
[16:46:13] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on 10.64.48.120 port 9042
[16:46:23] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-a valid until 2017-09-12 15:33:48 +0000 (expires in 267 days)
[16:46:38] <elukey>	 urandom: you there?
[16:49:23] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1011 is OK: OK - cassandra-c is active
[16:49:28] <elukey>	 ok 1009 is up, proceeding with 1011
[16:49:33] <icinga-wm>	 RECOVERY - Check systemd state on restbase1011 is OK: OK - running: The system is fully operational
[16:50:33] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is OK: TCP OK - 0.000 second response time on 10.64.0.119 port 9042
[16:50:43] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-c valid until 2017-09-12 15:34:08 +0000 (expires in 267 days)
[16:52:43] <icinga-wm>	 PROBLEM - puppet last run on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:54:33] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[16:54:34] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.119 and port 9042: Connection refused
[16:55:12] <elukey>	 nope, 1011 doesn't want to come up 
[16:55:13] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused
[16:55:14] <icinga-wm>	 PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:55:23] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[16:55:23] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[16:55:24] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[16:55:42] <elukey>	 and now again 1009
[16:55:43] <icinga-wm>	 PROBLEM - puppet last run on analytics1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:55:44] <elukey>	 sigh
[16:55:48] <elukey>	 mobrovac: you there?
[16:56:23] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1011 is OK: OK - cassandra-c is active
[16:56:33] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-c valid until 2017-09-12 15:34:08 +0000 (expires in 267 days)
[16:56:34] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is OK: TCP OK - 0.000 second response time on 10.64.0.119 port 9042
[16:57:19] <elukey>	 ah saw your email
[16:57:35] <elukey>	 ops people: Marko is commuting and will take care of the instances in ~25 mins
[17:00:33] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[17:01:23] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[17:01:33] <icinga-wm>	 PROBLEM - Check systemd state on restbase1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[17:01:34] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.119 and port 9042: Connection refused
[17:04:23] <icinga-wm>	 RECOVERY - puppet last run on mw1164 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[17:19:54] <grrrit-wm>	 (03CR) 10Dereckson: [] "The current '5' value has been introduced by commit 95b52b9b48." [puppet] - 10https://gerrit.wikimedia.org/r/328037 (https://phabricator.wikimedia.org/T153488) (owner: 10Zhuyifei1999) 
[17:23:43] <icinga-wm>	 RECOVERY - puppet last run on analytics1053 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[17:27:33] <icinga-wm>	 RECOVERY - puppet last run on restbase1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:28:24] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[17:28:24] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active
[17:29:13] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on 10.64.48.120 port 9042
[17:29:23] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-a valid until 2017-09-12 15:33:48 +0000 (expires in 267 days)
[17:30:23] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1011 is OK: OK - cassandra-c is active
[17:30:33] <icinga-wm>	 RECOVERY - Check systemd state on restbase1011 is OK: OK - running: The system is fully operational
[17:30:43] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-c valid until 2017-09-12 15:34:08 +0000 (expires in 267 days)
[17:31:33] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is OK: TCP OK - 0.001 second response time on 10.64.0.119 port 9042
[18:17:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:17:53] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:17:53] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:17:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:17:53] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:18:43] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy
[18:19:43] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy
[18:19:44] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy
[18:19:44] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy
[18:19:44] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy
[18:23:45] <mobrovac>	 ^ ?
[18:28:18] <grrrit-wm>	 (03Draft1) 10Paladox: Contint: Make sure /mnt/home/jenkins-deploy/tmpfs is mounted before starting MySQL [puppet] - 10https://gerrit.wikimedia.org/r/328051 (https://phabricator.wikimedia.org/T141450) 
[18:28:23] <grrrit-wm>	 (03Draft2) 10Paladox: Contint: Make sure /mnt/home/jenkins-deploy/tmpfs is mounted before starting MySQL [puppet] - 10https://gerrit.wikimedia.org/r/328051 (https://phabricator.wikimedia.org/T141450) 
[18:32:59] <WMFlabs>	 !log Testing
[18:33:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:34:15] <urandom>	 mobrovac: i wonder if that wasn't a consequence of the node outages, that client library doesn't seem very robust in the face of lost connections
[18:34:49] <mobrovac>	 ah right
[18:34:52] <mobrovac>	 totally possible
[18:35:03] <urandom>	 either that or gremlins
[18:35:28] * urandom lights a black candle
[18:35:51] <mobrovac>	 haha
[18:37:13] <mobrovac>	 urandom: trying to update https://commons.wikimedia.org/wiki/User%3AJ_budissin%2FUploads%2FBiH%2F2016_December_11-20 produces an error in cass
[18:37:35] <mobrovac>	 which isn't a surprise seeing the size of the page
[18:38:59] <grrrit-wm>	 (03PS3) 10Paladox: Contint: Notify Service mysql to restart [puppet] - 10https://gerrit.wikimedia.org/r/328051 (https://phabricator.wikimedia.org/T141450) 
[18:39:21] <grrrit-wm>	 (03PS4) 10Paladox: Contint: Notify Service mysql to restart [puppet] - 10https://gerrit.wikimedia.org/r/328051 (https://phabricator.wikimedia.org/T141450) 
[18:40:28] <urandom>	 mobrovac: timeout?
[18:41:43] <mobrovac>	 not clear from the log i have, but smells like it
[18:41:48] <mobrovac>	 chrome can't even load the page for me
[18:42:12] <urandom>	 it's still trying here :)
[18:42:17] <mobrovac>	 haha
[18:42:54] <urandom>	 heh, i was just asked if i wanted to kill it or wait
[18:43:29] * urandom is in a killing mood
[18:43:42] <mobrovac>	 i waited twice then gave up on it
[18:44:02] <mobrovac>	 ok here's another candidate for blacklisting - https://commons.wikimedia.org/w/index.php?title=User:OgreBot/Uploads_by_new_users&action=history
[18:44:14] <mobrovac>	 editing a huge log every hour
[18:46:10] * mobrovac is in a blacklisting mood
[18:46:22] <urandom>	 wack-a-mole
[18:47:23] <mobrovac>	 yup
[18:50:53] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:51:44] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[19:12:23] <icinga-wm>	 PROBLEM - puppet last run on db1083 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:38:15] <wikibugs_>	 06Operations, 10Wikimedia-Mailing-lists: Reach out to Google about @yahoo.com emails not reaching gmail inboxes (when sent to mailing lists) - https://phabricator.wikimedia.org/T146841#2885268 (10Aklapper) >>! In T146841#2728793, @faidon wrote: > This is most likely related to Yahoo's DMARC policy, cf. T66818....
[19:40:23] <icinga-wm>	 RECOVERY - puppet last run on db1083 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[20:25:43] <icinga-wm>	 PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[20:39:53] <icinga-wm>	 PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 18 failures. Last run 2 minutes ago with 18 failures. Failed resources (up to 3 shown): Exec[ip addr add 2620:0:860:102:10:192:16:30/64 dev eth0],Service[ferm],Service[diamond],Service[prometheus-node-exporter]
[20:47:50] <grrrit-wm>	 (03PS1) 10Mobrovac: Conftool: Add restbase101[678] and restbase201[012] [puppet] - 10https://gerrit.wikimedia.org/r/328059 (https://phabricator.wikimedia.org/T151086) 
[20:49:43] <icinga-wm>	 RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[21:01:53] <icinga-wm>	 PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[21:07:53] <icinga-wm>	 RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[21:29:23] <icinga-wm>	 PROBLEM - puppet last run on wtp1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:29:53] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures
[21:57:23] <icinga-wm>	 RECOVERY - puppet last run on wtp1005 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures
[22:16:59] <logmsgbot>	 !log ariel@tin Starting deploy [dumps/dumps@2a35e23]: fix checkpoint prefetch jobs
[22:17:02] <logmsgbot>	 !log ariel@tin Finished deploy [dumps/dumps@2a35e23]: fix checkpoint prefetch jobs (duration: 00m 02s)
[22:17:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:17:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:26:29] <wikibugs>	 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: Rename user TextworkerBot to VladiBot - https://phabricator.wikimedia.org/T153602#2885442 (10Peachey88)
[22:32:21] <grrrit-wm>	 (03PS1) 10ArielGlenn: if one wiki can't be monitored, don't except out, do the rest [dumps] - 10https://gerrit.wikimedia.org/r/328108 
[22:33:14] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] if one wiki can't be monitored, don't except out, do the rest [dumps] - 10https://gerrit.wikimedia.org/r/328108 (owner: 10ArielGlenn) 
[22:34:00] <logmsgbot>	 !log ariel@tin Starting deploy [dumps/dumps@92946f0]: make monitoring more robust
[22:34:02] <logmsgbot>	 !log ariel@tin Finished deploy [dumps/dumps@92946f0]: make monitoring more robust (duration: 00m 01s)
[22:34:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:34:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log