[00:17:20] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active
[00:20:20] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[00:47:10] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[00:47:20] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active
[00:50:11] <icinga-wm>	 PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[00:50:20] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[00:51:12] <matt_flaschen>	 greg-g, robh, et al, should we be using terbium right now, or is there a different one in codfw?
[01:01:10] <wikibugs___>	 06Operations, 10Ops-Access-Requests: Request to add phuedx to "researchers" group - https://phabricator.wikimedia.org/T164060#3223835 (10Tbayer) @Dzahn BTW, if you could take a look at https://wikitech.wikimedia.org/wiki/SWAP#Access and check whether anything important is missing there (from the Ops perspectiv...
[01:04:20] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db1035 is OK: OK slave_sql_lag Replication lag: 0.09 seconds
[01:04:37] <volans>	 matt_flaschen: wasat.codfw.wmnet
[01:09:27] <matt_flaschen>	 volans, thanks
[01:17:20] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active
[01:20:20] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[01:40:58] <wikibugs___>	 (03PS6) 10BryanDavis: Monitor wikitech-static mediawiki version [puppet] - 10https://gerrit.wikimedia.org/r/350920 (https://phabricator.wikimedia.org/T163721) (owner: 10Andrew Bogott)
[01:42:50] <wikibugs>	 (03CR) 10BryanDavis: "I started cleaning up the requests api usage to make it more readable and kind of went crazy with the changes by the end. I think this is " [puppet] - 10https://gerrit.wikimedia.org/r/350920 (https://phabricator.wikimedia.org/T163721) (owner: 10Andrew Bogott)
[01:47:10] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[01:47:20] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active
[01:50:11] <icinga-wm>	 PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[01:50:20] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[02:17:20] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[02:17:20] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active
[02:20:10] <icinga-wm>	 PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[02:20:20] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[02:43:20] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db1044 is OK: OK slave_sql_lag Replication lag: 0.51 seconds
[02:47:10] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[02:47:20] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active
[02:50:11] <icinga-wm>	 PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[02:50:20] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[03:17:10] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[03:17:20] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active
[03:20:11] <icinga-wm>	 PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[03:20:20] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[03:26:27] <wikibugs>	 (03CR) 10Krinkle: [C: 031] Monitor wikitech-static mediawiki version [puppet] - 10https://gerrit.wikimedia.org/r/350920 (https://phabricator.wikimedia.org/T163721) (owner: 10Andrew Bogott)
[03:28:57] <wikibugs___>	 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#3223942 (10Krinkle)
[03:35:41] <wikibugs>	 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#3223950 (10Krinkle)
[03:43:12] <wikibugs___>	 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#3223956 (10Krinkle)
[03:47:10] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[03:47:20] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active
[03:50:10] <icinga-wm>	 PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[03:50:20] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[03:54:32] <wikibugs>	 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1271503 (10Krinkle)
[03:54:59] <wikibugs>	 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1283075 (10Krinkle)
[03:55:06] <wikibugs___>	 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 05Goal: Wikipedias with zh-* language codes waiting to be renamed (zh-min-nan -> nan, zh-yue -> yue, zh-classical -> lzh) - https://phabricator.wikimedia.org/T10217#3223977 (10Krinkle)
[03:55:28] <wikibugs___>	 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1287631 (10Krinkle)
[03:55:58] <wikibugs___>	 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1289110 (10Krinkle)
[03:56:07] <wikibugs>	 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 05Goal: Wikipedias with zh-* language codes waiting to be renamed (zh-min-nan -> nan, zh-yue -> yue, zh-classical -> lzh) - https://phabricator.wikimedia.org/T10217#135543 (10Krinkle) 05Open>03stalled Stalled since renaming (T21986) i...
[03:56:22] <wikibugs___>	 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1293798 (10Krinkle)
[03:56:29] <wikibugs>	 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 05Goal: Wikipedias with zh-* language codes waiting to be renamed (zh-min-nan -> nan, zh-yue -> yue, zh-classical -> lzh) - https://phabricator.wikimedia.org/T10217#3224001 (10Krinkle)
[03:56:50] <wikibugs>	 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1293798 (10Krinkle)
[03:56:57] <wikibugs___>	 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 05Goal: Wikipedias with zh-* language codes waiting to be renamed (zh-min-nan -> nan, zh-yue -> yue, zh-classical -> lzh) - https://phabricator.wikimedia.org/T10217#135543 (10Krinkle)
[03:57:27] <wikibugs>	 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1293798 (10Krinkle)
[03:57:38] <wikibugs>	 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1293798 (10Krinkle)
[04:09:00] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=369.90 Read Requests/Sec=296.00 Write Requests/Sec=745.80 KBytes Read/Sec=37863.60 KBytes_Written/Sec=6269.60
[04:16:37] <wikibugs___>	 (03PS1) 10Krinkle: private: Update example [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351100
[04:17:00] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=27.40 Read Requests/Sec=4.30 Write Requests/Sec=7.50 KBytes Read/Sec=148.80 KBytes_Written/Sec=155.60
[04:17:10] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[04:19:09] <wikibugs___>	 (03CR) 10Krinkle: [C: 032] private: Update example [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351100 (owner: 10Krinkle)
[04:20:10] <icinga-wm>	 PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:20:18] <wikibugs>	 (03Merged) 10jenkins-bot: private: Update example [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351100 (owner: 10Krinkle)
[04:20:27] <wikibugs___>	 (03CR) 10jenkins-bot: private: Update example [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351100 (owner: 10Krinkle)
[05:13:18] <wikibugs>	 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#3224068 (10Liuxinyu970226)
[05:13:32] <wikibugs>	 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1380165 (10Liuxinyu970226)
[05:14:02] <wikibugs___>	 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1424592 (10Liuxinyu970226)
[05:14:23] <wikibugs>	 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1424592 (10Liuxinyu970226) 05Open>03stalled per krinkle
[05:14:44] <wikibugs___>	 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#3224081 (10Liuxinyu970226)
[05:14:58] <wikibugs___>	 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1424592 (10Liuxinyu970226)
[05:15:04] <wikibugs___>	 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1424592 (10Liuxinyu970226)
[05:15:07] <wikibugs>	 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1424592 (10Liuxinyu970226)
[05:17:10] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[05:20:10] <icinga-wm>	 PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[05:22:45] <wikibugs___>	 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#3224095 (10Xqt)
[05:22:47] <wikibugs>	 06Operations, 10Pywikibot-core, 10Traffic, 07HTTPS, and 2 others: Prepare pywikibot for http -> https switch in entity uri - https://phabricator.wikimedia.org/T159956#3224094 (10Xqt) 05Open>03Resolved
[06:17:10] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[06:17:20] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active
[06:20:11] <icinga-wm>	 PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:20:20] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[06:47:10] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[06:47:20] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active
[06:50:11] <icinga-wm>	 PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:50:20] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[07:07:00] <icinga-wm>	 RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack
[07:10:00] <icinga-wm>	 PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack
[07:17:10] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[07:17:20] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active
[07:20:10] <icinga-wm>	 PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[07:20:22] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[07:34:12] <elukey>	 CommitLogReplayer$CommitLogReplayException: Could not read commit log descriptor in file
[07:34:19] <elukey>	 a bit weird..
[07:34:22] <elukey>	 urandom: ---^
[07:36:29] <elukey>	 so yesterday restbase1009-a failed for OOM, then now the CommitLog 
[07:37:26] <elukey>	 so the file is /srv/cassandra-a/commitlog/CommitLog-5-1490738321543.log, but it is empty..
[07:37:47] <elukey>	 maybe some weird inconsistency created when it failed
[07:39:51] <elukey>	 ah yes the mtime seems to be more or less the one of the OOM
[07:42:47] <elukey>	 found https://issues.apache.org/jira/browse/CASSANDRA-11595 that is a similar issue
[07:42:59] <elukey>	 Cassandra refuses to start to be on the safe side
[07:43:33] <elukey>	 it creates the commitlog file first, then it adds data to it.. if anything happens in the meantime, we have this problem
[07:45:46] <elukey>	 !log deleted /srv/cassandra-a/commitlog/CommitLog-5-1490738321543.log from restbase1009-a (empty commit log file created before OOM - backup in /home/elukey) 
[07:45:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:46:10] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[07:46:20] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active
[07:46:31] <elukey>	 tailing logs, it is replaying the commit log now
[07:46:41] <elukey>	 aaand handshaked with the other nodes
[07:46:50] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.036 second response time on 10.64.48.120 port 9042
[07:47:30] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-a valid until 2017-09-12 15:33:48 +0000 (expires in 135 days)
[07:57:37] <elukey>	 checking graphs and https://grafana.wikimedia.org/dashboard/db/restbase-cassandra-read-repair?panelId=28&fullscreen&orgId=1 showed some read repair happened for 1009-a
[08:06:58] <elukey>	 there is only one UNKNOWN in icinga at the moment for pending compactions but it should resolve in a bit (also really weird, it does only alarm for missing datapoints) 
[08:11:40] <_joe_>	 elukey: means there are no data in graphite for the span of time
[08:12:24] <_joe_>	 elukey: yes, check_graphite could be better
[08:12:42] <elukey>	 :)
[08:14:12] <elukey>	 will re-check later, everything looks good from https://grafana.wikimedia.org/dashboard/db/cassandra-restbase-eqiad?orgId=1&from=now-6h&to=now
[08:35:26] <Raymond_>	 watchlist on Commonswiki is broken for me (blank screen): https://commons.wikimedia.org/wiki/Special:Watchlist
[08:37:00] <icinga-wm>	 RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack
[08:40:00] <icinga-wm>	 PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack
[10:32:50] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements responds with malformed body: list index out of range
[10:33:50] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy
[11:31:20] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp2002 is CRITICAL: CRITICAL: expiry mailbox lag is 590352
[13:10:50] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused
[13:11:10] <icinga-wm>	 PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[13:11:20] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[13:11:30] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[13:17:10] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[13:17:20] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active
[13:18:30] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-a valid until 2017-09-12 15:33:48 +0000 (expires in 135 days)
[13:18:50] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.037 second response time on 10.64.48.120 port 9042
[13:22:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200)
[13:23:50] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy
[13:24:15] <elukey>	 this one is the "usual" java.lang.OutOfMemoryError: Java heap space that occurs for restbase-async
[13:24:30] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[13:24:51] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused
[13:25:11] <icinga-wm>	 PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[13:25:20] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[13:26:18] <elukey>	 yeah there seems to be a problem with a link that generates too many tombstones collected 
[13:26:22] <elukey>	 urandom: ---^
[13:27:56] <elukey>	 this one might be a good use case to apply https://phabricator.wikimedia.org/P5165 on restbase1009
[13:28:35] <elukey>	 cc mobrovac too
[13:32:10] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[13:32:20] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active
[13:32:21] <elukey>	 (just ran puppet)
[13:32:58] <elukey>	 (going afk for a bit, will re-check later on)
[13:33:30] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-a valid until 2017-09-12 15:33:48 +0000 (expires in 135 days)
[13:33:50] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.036 second response time on 10.64.48.120 port 9042
[13:37:00] <icinga-wm>	 RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack
[13:40:00] <icinga-wm>	 PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack
[13:41:50] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused
[13:42:10] <icinga-wm>	 PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[13:42:20] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[13:42:30] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[13:47:11] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[13:47:21] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active
[13:48:30] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-a valid until 2017-09-12 15:33:48 +0000 (expires in 135 days)
[13:48:51] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.036 second response time on 10.64.48.120 port 9042
[13:52:50] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused
[13:53:10] <icinga-wm>	 PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[13:53:20] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[13:53:30] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[14:01:21] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Add stage for restarting Redis. [switchdc] - 10https://gerrit.wikimedia.org/r/351004 (https://phabricator.wikimedia.org/T163337)
[14:01:39] <wikibugs___>	 (03CR) 10jerkins-bot: [V: 04-1] Add stage for restarting Redis. [switchdc] - 10https://gerrit.wikimedia.org/r/351004 (https://phabricator.wikimedia.org/T163337) (owner: 10Giuseppe Lavagetto)
[14:02:11] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Add stage for restarting Redis. [switchdc] - 10https://gerrit.wikimedia.org/r/351004 (https://phabricator.wikimedia.org/T163337)
[14:09:36] <wikibugs>	 06Operations, 10Traffic: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#2962044 (10Shangkuanlc) Hi I am wondering is it a done deal that WMF is decided to set the new node for Asia in Singapore as Denny indicated [[ https://www.facebook.com/groups/wikipediaweekly/permalink/...
[14:17:10] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[14:17:20] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active
[14:18:30] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-a valid until 2017-09-12 15:33:48 +0000 (expires in 135 days)
[14:18:50] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.036 second response time on 10.64.48.120 port 9042
[14:21:30] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[14:21:50] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused
[14:22:10] <icinga-wm>	 PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[14:22:20] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[14:22:50] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp2014 is CRITICAL: CRITICAL: expiry mailbox lag is 741382
[14:30:00] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 254.43 seconds
[14:39:06] <wikibugs___>	 (03PS1) 10Giuseppe Lavagetto: Parallelize url fetching [software/service-checker] - 10https://gerrit.wikimedia.org/r/351110
[14:39:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Parallelize url fetching [software/service-checker] - 10https://gerrit.wikimedia.org/r/351110 (owner: 10Giuseppe Lavagetto)
[14:47:20] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[14:47:20] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active
[14:48:50] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 7.059 second response time on 10.64.48.120 port 9042
[14:49:20] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-a valid until 2017-09-12 15:33:48 +0000 (expires in 135 days)
[14:52:10] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp2017 is CRITICAL: CRITICAL: expiry mailbox lag is 662758
[14:54:40] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[14:54:50] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused
[14:55:20] <icinga-wm>	 PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[14:55:20] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[15:10:20] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp2022 is CRITICAL: CRITICAL: expiry mailbox lag is 588147
[15:17:20] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[15:17:20] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active
[15:18:40] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-a valid until 2017-09-12 15:33:48 +0000 (expires in 135 days)
[15:18:50] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.036 second response time on 10.64.48.120 port 9042
[15:21:40] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[15:21:50] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused
[15:22:16] <elukey>	 that's not really resolving, I believe that we should try to temporarily lower down the tombstone threshold
[15:23:40] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-a valid until 2017-09-12 15:33:48 +0000 (expires in 135 days)
[15:23:50] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 3.044 second response time on 10.64.48.120 port 9042
[15:24:21] <elukey>	 !log set tombstone_failure_threshold=10000 to restbase1009-a with P5165 on restbase1009-a - T160759
[15:24:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:24:31] <stashbot>	 T160759: Cassandra OOMs - https://phabricator.wikimedia.org/T160759
[15:25:22] <elukey>	 now what is should happen is that the request(s) causing OOMs will get a 500
[15:26:29] <elukey>	 from the log it seems working
[15:27:04] <elukey>	 ERROR [SharedPool-Worker-11] 2017-04-30 15:26:22,441 MessageDeliveryTask.java:77 - Scanned over 10001 tombstones in etc..
[15:27:27] <elukey>	 so 80k tombstones were previously requested for a given page
[15:29:50] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused
[15:30:20] <icinga-wm>	 PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[15:30:20] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[15:30:28] <elukey>	 no 10000 is too much
[15:30:40] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[15:31:19] <elukey>	 running puppet and settings 1000
[15:31:20] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active
[15:31:34] <urandom>	 elukey: ciao!
[15:31:38] <elukey>	 ciao urandom! 
[15:31:47] <elukey>	 !log set tombstone_failure_threshold=1000 to restbase1009-a with P5165 on restbase1009-a - T160759
[15:31:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:55] <stashbot>	 T160759: Cassandra OOMs - https://phabricator.wikimedia.org/T160759
[15:32:20] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[15:32:40] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-a valid until 2017-09-12 15:33:48 +0000 (expires in 135 days)
[15:32:50] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.038 second response time on 10.64.48.120 port 9042
[15:33:58] <urandom>	 elukey: wow, it's blocking all sorts of queries
[15:34:49] <urandom>	 of course, the one i am seeing is mobileapps, and it seems to have a lot of warnings on any given day that aren't causing issues
[15:34:50] <elukey>	 urandom: I can see only one atm that keeps being blocked no?
[15:35:28] <urandom>	 elukey: i'm tailing the log and it's going to fast to see anything other than that one
[15:35:39] <urandom>	 elukey: was 10k not enough to have this one blocked?
[15:35:56] <elukey>	 urandom: I tried but it died anyway for ooms :(
[15:36:04] <elukey>	 so I thought to use the hammer
[15:36:37] <urandom>	 was this current title with all of the logged blocks, being blocked at 10k?
[15:36:45] <elukey>	 yeah
[15:37:00] <icinga-wm>	 RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack
[15:37:09] <elukey>	 but maybe hosting 10k in the heap wasn't good anyway
[15:37:17] <elukey>	 so I thought to use 1k
[15:38:19] <elukey>	 urandom: today I had to delete CommitLog-5-1490738321543.log in restbase1009-a's dir since it was a 0 size file blocking cassandra while starting
[15:38:28] <urandom>	 elukey: ok
[15:38:29] <elukey>	 it was created more or less during a OOM
[15:38:43] <elukey>	 I logged everything in sal but I thought to mention
[15:38:55] <urandom>	 elukey: yeah, that shouldn't happen, but when the JVM is in that state, i guess all bets are off
[15:39:18] <elukey>	 yeah
[15:40:00] <icinga-wm>	 PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack
[15:40:32] <elukey>	 urandom: can we block that specific request from happening?
[15:41:48] <elukey>	 anyhow, the block seems having a positive effect on metrics
[15:42:55] <urandom>	 elukey: so the short answer is 'yes', the longer answer is that it requires a restbase deploy
[15:43:10] <elukey>	 ah yes right I forgot that bit
[15:44:33] <urandom>	 but yeah, i'm seeing warning for that title all over the place
[15:48:34] <urandom>	 me might want to do that
[15:48:39] * urandom doesn't deploy very often
[15:48:59] <elukey>	 where does it come from? ChangeProp?
[15:49:24] <elukey>	 I am trying to understand why it keeps happening
[15:49:53] <urandom>	 elukey: the requests are coming from ChangeProp, yes
[15:50:50] <elukey>	 (brb)
[15:51:48] <urandom>	 OK, and the restbase deploy has been moved to scap, and i don't think the docs are there yet
[15:51:56] <urandom>	 mobrovac: ping?
[15:57:15] <urandom>	 Pchelolo: ping?
[16:03:50] <urandom>	 elukey: https://github.com/wikimedia/restbase/pull/811
[16:07:42] <urandom>	 elukey: thank you so much for looking into this!
[16:08:46] <elukey>	 urandom: nice! sorry to ping you on a sunday :(
[16:10:00] <icinga-wm>	 PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100%
[16:10:00] <icinga-wm>	 PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[16:10:43] <urandom>	 elukey: it's sunday for you too!
[16:12:05] <Pchelolo>	 here urandom
[16:12:13] <elukey>	 Pchelolo: o/
[16:12:20] <Pchelolo>	 want me to deploy that?
[16:12:46] <urandom>	 Pchelolo: yo!
[16:12:52] <urandom>	 Pchelolo: ummm, ideally
[16:13:00] <Pchelolo>	 kk
[16:13:18] <Pchelolo>	 Our camping trip didn't work out, some bridge on the road collapsed
[16:13:26] <Pchelolo>	 deploying
[16:13:31] <urandom>	 Pchelolo: i was just sitting here trying to decide the balance of things, if it warranted a weekend deploy
[16:14:03] <urandom>	 the answer was 'No' if i was doing it; i'm not sure my own personal comfort level is high enough (need to fix that)
[16:14:09] <urandom>	 Pchelolo: sorry to hear about your trip
[16:14:19] <Pchelolo>	 is it the one that's causing all the alerts?
[16:14:35] <urandom>	 Pchelolo: this morning's alerts, yeah
[16:15:10] <icinga-wm>	 RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 35.39 ms
[16:15:10] <icinga-wm>	 RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 39.30 ms
[16:15:15] <urandom>	 Pchelolo: elukey enacted the work-around as a nearer-term fix, and that seems to have done the job
[16:15:50] <wikibugs>	 06Operations, 10Traffic: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#3224367 (10Aklapper) @Shangkuanlc: That sounds like a question for the [[ https://lists.wikimedia.org/mailman/listinfo/wikimedia-l | wikimedia-l mailing list ]] as the question is out of scope for this...
[16:15:53] <urandom>	 Pchelolo: but there will likely be some collateral damage in the form of background updates (the retention policy) that do not succeed, and would have otherwise
[16:16:23] <urandom>	 Pchelolo: on 1009 anyway
[16:21:49] <logmsgbot>	 !log ppchelko@naos Started deploy [restbase/deploy@4f96ae3]: Blacklist a zhwiki page that's causing issues
[16:21:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:08] <elukey>	 nice
[16:26:28] <mobrovac>	 urandom: Pchelolo: deploying for blacklists?
[16:26:50] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:27:03] <Pchelolo>	 si mobrovac
[16:27:04] <urandom>	 mobrovac: Pchelolo is
[16:27:24] <Pchelolo>	 I want that automatic blacklisting feature so bad
[16:27:54] <urandom>	 +11
[16:28:18] <urandom>	 that was supposed to be '+1', but the typo actually works better here
[16:28:36] <mobrovac>	 ok, Pchelolo we will have to monitor the situation then because we have those parsoid version bumps in there too
[16:28:52] <mobrovac>	 yeah urandom, +11 make more sense here actually :)
[16:28:55] <Pchelolo>	 nono that's already deployed before
[16:29:02] <Zppix>	 urandom:  hmm  gerrit feature request ?
[16:29:17] <logmsgbot>	 !log ppchelko@naos Finished deploy [restbase/deploy@4f96ae3]: Blacklist a zhwiki page that's causing issues (duration: 07m 27s)
[16:29:24] <mobrovac>	 Pchelolo: i don't recall it being deployed
[16:29:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:38] <Pchelolo>	 mobrovac: this change is blacklist solo: https://gerrit.wikimedia.org/r/351114
[16:29:50] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[16:30:11] <mobrovac>	 oh ok even better
[16:30:37] <Pchelolo>	 Thu Apr 27 15:34:12
[16:30:42] <urandom>	 Pchelolo: i guess that is live now
[16:30:56] <Pchelolo>	 up urandom just finished
[16:31:05] <urandom>	 Pchelolo: the query aborts of ceased on restbase1009
[16:31:10] <urandom>	 so, \o/
[16:31:46] <Pchelolo>	 Ok, I'll be here for a bit, but I'll switch to eating my noodles
[16:32:42] <urandom>	 i guess i'll undo the tombstone threshold changes on 1009 then
[16:35:28] <mobrovac>	 yup, sounds good
[16:35:50] <urandom>	 !log T160759: Restoring default tombstone_threshold on restbase1009
[16:35:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:59] <stashbot>	 T160759: Cassandra OOMs - https://phabricator.wikimedia.org/T160759
[16:41:45] <elukey>	 nice work people!
[16:41:46] <elukey>	 thanks :)
[16:43:00] <elukey>	 urandom: going afk then, everything seems fine atm
[16:43:21] <urandom>	 elukey: yup; thanks again!
[16:43:26] <elukey>	 o/
[16:43:28] <urandom>	 elukey: enjoy the remainder of your weekend
[16:43:51] <elukey>	 I'll try!
[16:44:31] <mobrovac>	 thnx elukey!
[16:44:49] <mobrovac>	 for future ref, sending me a mail is more efficient than pinging me here
[16:45:13] <mobrovac>	 :)
[16:45:28] * mobrovac going afk too
[17:10:20] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp2022 is OK: OK: expiry mailbox lag is 13307
[17:20:20] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.97 seconds
[17:22:10] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp2017 is OK: OK: expiry mailbox lag is 2795
[17:23:20] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.86 seconds
[17:28:20] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 321.01 seconds
[17:28:24] <chicocvenancio>	 is something still a problem?
[17:30:28] <chicocvenancio>	 I got a few "wiki in read only mode" messages while translating, but seems it went away
[17:57:12] <Zppix>	 chicocvenancio:  I cannot think of any reason any wiki would be read noly
[17:57:14] <Zppix>	 only*
[17:57:32] <Zppix>	 chicocvenancio:  about what time?
[17:58:21] <chicocvenancio>	 about 1 minute before my message, it was brief. Might be an error in the translation interface
[18:00:02] <Zppix>	 chicocvenancio:  what wiki?
[18:00:17] <chicocvenancio>	 Meta
[18:00:29] <Zppix>	 okay let me take a look and see if i encounter the same
[18:00:40] <chicocvenancio>	 Zppix, it was brief
[18:00:55] <chicocvenancio>	 I have saved after that already
[18:01:11] <Zppix>	 chicocvenancio:  I understand but I wanna be sure and see if I can't reproduce it somehow
[18:01:48] <chicocvenancio>	 ok, but I don't think you'll manage to. Unless other people report it I wouldn't worry about it
[18:02:02] <Zppix>	 chicocvenancio:  okay, for future reference what page did this appear on
[18:02:19] <chicocvenancio>	 I was on the translation interface
[18:02:43] <Zppix>	 oh yes my bad I forgot
[18:02:50] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp2014 is OK: OK: expiry mailbox lag is 269434
[18:03:06] <chicocvenancio>	 I'm not quite sure which message I was saving, but it was after Translations:Wikimedia Conference 2017/Documentation/Movement Strategy track/Day 2/423/pt for sure
[18:03:19] * chicocvenancio goes to look for the contributions timestamps
[18:04:45] <chicocvenancio>	 I think it was around Translations:Wikimedia Conference 2017/Documentation/Movement Strategy track/Day 2/432/pt ‎
[18:05:11] <chicocvenancio>	 a few of the messages came back with the error (I think it was 3)
[18:05:23] <chicocvenancio>	 I managed to save them a few minutes latter
[18:06:26] * chicocvenancio went to check the calendar if the server switch was today... but that didn't make sense
[18:09:47] <Zppix>	 chicocvenancio:  the switch over is may 3rd at 14:00 utc
[18:10:07] <chicocvenancio>	 yeah, I saw that
[18:10:24] <chicocvenancio>	 thanks for the help Zppix 
[18:10:39] <Zppix>	 no problem chicocvenancio  for further reference #wikimedia-tech is the place to report issues with WMF wikis
[18:10:57] <chicocvenancio>	 ok
[18:30:21] <eddiegp>	 Zppix: Servers in eqiad are currently ro (as can be seen by using X-Wikimedia-Debug with some eqiad server: It'll show ro-messages when trying to edit). So my guess is that those translation things somehow managed to be routed to eqiad instead of codfw, having this error returned, though that certainly shouldn't happen.
[18:31:09] <Zppix>	 eddiegp:  if it happens again then well do further issue reporting, but considering it resolved itself i  think it was maybe a cache problem
[18:33:42] <eddiegp>	 Zppix: Yeah, didn't want to investigate further now anyway ;) 
[18:47:50] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements responds with malformed body: list index out of range
[18:48:50] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy
[18:48:55] <wikibugs>	 06Operations, 10Wikimedia-General-or-Unknown: Investigate why firejails break PdfHandler - https://phabricator.wikimedia.org/T164145#3224445 (10Tgr) ``` tgr@wasat:~$ /usr/bin/gs -sDEVICE=jpeg -sOutputFile=/home/tgr/Special_301_Report_2014.jpeg -dFirstPage=2 -dLastPage=2 -dSAFER -r150 -dBATCH -dNOPAUSE -q /home...
[18:55:20] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db1047 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[19:00:47] <wikibugs___>	 06Operations, 06Performance-Team: UPDATE /* Title::invalidateCache  */ coming from the JobQueue are causing slowdown on masters and lag on several wikis - https://phabricator.wikimedia.org/T164173#3224448 (10jcrespo)
[19:01:06] <wikibugs>	 06Operations, 10DBA, 06Performance-Team: UPDATE /* Title::invalidateCache  */ coming from the JobQueue are causing slowdown on masters and lag on several wikis - https://phabricator.wikimedia.org/T164173#3224460 (10jcrespo)
[19:05:22] <wikibugs___>	 06Operations, 10DBA, 06Performance-Team: UPDATE /* Title::invalidateCache  */ coming from the JobQueue are causing slowdown on masters and lag on several wikis - https://phabricator.wikimedia.org/T164173#3224461 (10jcrespo) Lots of bots starting aroudn the same time, could it be some kind of labs change/othe...
[19:07:41] <wikibugs___>	 06Operations, 10Wikimedia-General-or-Unknown: Investigate why firejails break PdfHandler - https://phabricator.wikimedia.org/T164145#3224462 (10Tgr) Never mind that. That's expected behavior from the profile blacklisting `/home`.  The actual error is that ghostscript outputs some junk: when I redirect the stan...
[19:20:10] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[19:29:50] <icinga-wm>	 PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0]
[19:30:10] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[19:30:57] <wikibugs___>	 06Operations, 10Wikimedia-General-or-Unknown: Investigate why firejails break PdfHandler - https://phabricator.wikimedia.org/T164145#3224491 (10Tgr) Or rather, firejail mixes its own output with the gs output. All that's needed is to make it shut up: ``` tgr@wasat:~$ '/usr/bin/firejail' '--profile=/etc/firejai...
[19:49:10] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0]
[19:54:33] <Zppix>	 eddiegp:  in regards to your reply to my email I think we could somehow find out where that banner lives and edit it?
[20:02:32] <wikibugs>	 (03PS1) 10Gergő Tisza: Make mediawiki-firejail-ghostscript quiet [puppet] - 10https://gerrit.wikimedia.org/r/351123 (https://phabricator.wikimedia.org/T164145)
[20:07:00] <icinga-wm>	 RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack
[20:09:07] <eddiegp>	 Zppix: Good idea.
[20:09:24] <Zppix>	 eddiegp:  I would assume its on meta?
[20:09:29] <eddiegp>	 I think it is set with https://www.mediawiki.org/wiki/Manual:$wgReadOnly
[20:09:49] <eddiegp>	 So should live somewhere in the settings repository
[20:10:00] <icinga-wm>	 PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack
[20:10:10] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[20:10:38] <Zppix>	 eddiegp:  No, wikitext is also allowed so searching meta would be a good idea
[20:12:44] <eddiegp>	 So, I'm talking about the ro message when trying to edit a page. It's built into mediawiki. It's a system message. See: https://translatewiki.net/wiki/MediaWiki:Readonlywarning It get's the reason to be added there ($1) from a configuration variable, $wgReadOnly
[20:12:55] <Zppix>	 oh
[20:13:04] <Zppix>	 how will get to that then?
[20:13:11] <Zppix>	 i would assume i18n?
[20:14:10] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[20:14:26] <eddiegp>	 I don't really know. It must live somewhere in the switchdc tool I think ...
[20:14:36] <Zppix>	 wheres that located?
[20:14:45] <Zppix>	 i found this eddiegp  https://gerrit.wikimedia.org/r/#/c/346251/3/wmf-config/db-codfw.php
[20:15:10] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[20:18:10] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[20:20:05] <eddiegp>	 Zppix: Well that is for the current mediawiki-config, right. It's basically there to prevent writing to equiad while we are on codfw. But that is not the part that puts everything into ro when actually in the step of switching.
[20:20:28] <eddiegp>	 That's in https://github.com/wikimedia/operations-switchdc/tree/master/switchdc/stages
[20:20:46] <eddiegp>	 The t02_start_mediawiki_readonly.py and t08_stop_mediawiki_readonly.py
[20:22:29] <Zppix>	 brb
[20:23:10] <icinga-wm>	 RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[20:24:25] <eddiegp>	 That is directly within the switchdc tool, which is used always when switching DC's. It's there to automate things, so I don't think it's a great idea to hardcode that message there. If you need to merge a patch for the switchdc tool and deploy that patch each time you want the switchdc tool to run, that breaks it's sense (which is to remove the need to add a patch for the files in mediawiki-config, merge & deploy them).
[20:25:17] <eddiegp>	 So what it does now makes sense in some way: Just set a generic message that doesn't need to be changed.
[20:26:05] <eddiegp>	 A more specific message would be better, but we can't just replace it in the source code. A parameter to the python tool which takes the message to be shown might be a better idea.
[20:27:50] <icinga-wm>	 RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[20:41:34] <Zppix>	 eddiegp:  I know, so i dont know what to do
[20:42:02] <eddiegp>	 If in doubt, file a phab task ;-)
[20:42:19] <eddiegp>	 I'll do that next week.
[20:42:21] <Zppix>	 your the one with the idea so why don't you 
[20:42:30] <Zppix>	 the switchover is may 3rd
[20:42:34] <Zppix>	 its the 30th
[20:43:08] <eddiegp>	 That was what I meant, didn't mean that _you_ should
[20:43:17] <eddiegp>	 Rather that I should
[20:44:16] <eddiegp>	 I'm busy at the moment, can't really look into this right now. I don't think we'll get the tool to change within the few days, may just keep it that way for now and change it for the future. It's really urgent to have this next week, is it?
[20:44:16] <Zppix>	 oh
[20:44:48] <Zppix>	 eddiegp:  considering the one we used when we switch to codfw was really misleading yes its kinda urgent
[20:45:04] <Zppix>	 (this really should of been brought up sooner)
[20:48:43] <eddiegp>	 I'll create a task for it in a moment, just can't look into how to fix this the next few days.
[20:49:59] <Zppix>	 Maybe someone else will be able to?
[20:50:04] <Zppix>	 eddiegp:  CC me please
[20:52:02] <eddiegp>	 Yeah, just didn't thought it would be that urgent. I mean before you click to the edit tab of a page, you'll be on that page and see the Centralnotice (that will be shown on top of every page) about the switchover anyway. But anyway, filing it won't do bad.
[20:53:45] <Zppix>	 eddiegp:  thing is that notice is misleading
[20:53:52] <Zppix>	 it isnt up-to-date for one
[21:05:42] <Zppix>	 eddiegp:  Let me know when the task is created
[21:05:55] <eddiegp>	 I'm currently writing it.
[21:06:22] <Zppix>	 eddiegp:  ack thank you
[21:15:07] <eddiegp>	 Zppix: T164177
[21:15:08] <stashbot>	 T164177: switchdc: Improve wgReadOnly message - https://phabricator.wikimedia.org/T164177
[21:15:16] <Zppix>	 thanks eddiegp 
[21:20:03] <wikibugs___>	 (03PS1) 10BryanDavis: toollabs: ensure default hhvm service is stopped [puppet] - 10https://gerrit.wikimedia.org/r/351124 (https://phabricator.wikimedia.org/T78783)
[22:01:55] <c>	 ?/act
[22:05:50] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements responds with malformed body: list index out of range
[22:06:50] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy
[22:14:10] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.48.140:9042 on restbase1015 is CRITICAL: connect to address 10.64.48.140 and port 9042: Connection refused
[22:14:40] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.48.140:7001 on restbase1015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[22:16:00] <icinga-wm>	 PROBLEM - Check systemd state on restbase1015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[22:16:20] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[22:38:00] <icinga-wm>	 RECOVERY - Check systemd state on restbase1015 is OK: OK - running: The system is fully operational
[22:38:20] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1015 is OK: OK - cassandra-c is active
[22:39:10] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.48.140:9042 on restbase1015 is OK: TCP OK - 0.036 second response time on 10.64.48.140 port 9042
[22:39:40] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.48.140:7001 on restbase1015 is OK: SSL OK - Certificate restbase1015-c valid until 2017-09-12 15:34:41 +0000 (expires in 134 days)
[23:16:53] <wikibugs___>	 06Operations, 06Operations-Software-Development, 05codfw-rollout: switchdc: Improve wgReadOnly message - https://phabricator.wikimedia.org/T164177#3224571 (10EddieGP)
[23:23:23] <wikibugs>	 (03CR) 1020after4: [C: 031] List disabled user accounts with associated open tasks in weekly Phab email [puppet] - 10https://gerrit.wikimedia.org/r/351011 (https://phabricator.wikimedia.org/T157740) (owner: 10Aklapper)