[00:03:31] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:26:40] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements responds with malformed body: list index out of range [00:27:40] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [00:32:30] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [01:04:50] PROBLEM - Juniper alarms on mr1-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.154.199 [01:05:40] RECOVERY - Juniper alarms on mr1-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [01:14:02] (03PS1) 10Krinkle: varnish: Make errorpage.html balanced and use placeholder [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) [01:21:06] (03PS2) 10Krinkle: varnish: Convert errorpage into re-usable template [puppet] - 10https://gerrit.wikimedia.org/r/350493 (https://phabricator.wikimedia.org/T113114) [01:21:08] (03PS5) 10Krinkle: dynamicproxy: Make use of errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350494 (https://phabricator.wikimedia.org/T113114) [01:21:31] (03CR) 10Krinkle: "Tried out regsub() in separate commit before this one: Id77d23a442ab9b" [puppet] - 10https://gerrit.wikimedia.org/r/350493 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [01:22:00] 06Operations, 10ops-eqiad: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158#3223019 (10Dzahn) It would be nice to keep this open until ocg1001 is actually back in service. [01:22:34] (03CR) 10Krinkle: "@BBlack: Would like feedback on this approach." [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [01:40:05] 06Operations, 10Ops-Access-Requests: Request access to SWAP - https://phabricator.wikimedia.org/T164060#3220328 (10Dzahn) Hi, could you add some detail what this is for? What is SWAP please and what do you need access to stat1002 for? Are you sure it's stat1002 and not stat1003? The requested group "researcher... [02:58:40] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements responds with malformed body: list index out of range [03:00:40] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [03:04:35] (03CR) 10Tim Starling: "The difficulty is that this was meant to be used in production for the codfw->eqiad switch scheduled for May 3. Other options for implemen" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [03:11:50] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active, AS1299/IPv4: Connect [03:35:10] (03PS10) 10Krinkle: Use EtcdConfig in beta cluster only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [03:36:21] (03PS11) 10Krinkle: Use EtcdConfig in beta cluster only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [03:36:34] (03CR) 10Krinkle: "Added file comment and added noc symlink." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [03:55:30] (03CR) 10Krinkle: [C: 031] "(one nit). @Tim: Checked the logs. It did mention the two connections side-effect, but I'm not sure if that was still obvious at the point" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [03:58:50] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 10, down: 0, shutdown: 2 [04:10:00] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=342.60 Read Requests/Sec=1389.50 Write Requests/Sec=0.40 KBytes Read/Sec=34910.80 KBytes_Written/Sec=20.40 [04:16:00] (03PS12) 10Krinkle: Use EtcdConfig in beta cluster only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [04:19:00] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.50 Read Requests/Sec=0.30 Write Requests/Sec=11.20 KBytes Read/Sec=1.20 KBytes_Written/Sec=99.60 [04:23:40] RECOVERY - MariaDB Slave Lag: s3 on db1015 is OK: OK slave_sql_lag Replication lag: 0.42 seconds [05:05:00] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [05:07:00] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [05:10:00] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [05:27:49] 06Operations, 10DBA, 06DC-Ops: db1063 thermal issues (was: db1063 io (s5 master eqiad) performance is bad) - https://phabricator.wikimedia.org/T164107#3223213 (10Marostegui) At least it has not gone up: ``` Temperature: 78 C Battery State: Optimal BBU Firmware Status: Charging Status : None... [05:58:38] 06Operations, 10Ops-Access-Requests: Request to add phuedx to "researchers" group - https://phabricator.wikimedia.org/T164060#3223217 (10phuedx) [05:59:27] 06Operations, 10Ops-Access-Requests: Request to add phuedx to "researchers" group - https://phabricator.wikimedia.org/T164060#3220328 (10phuedx) @Dzahn: I've added more detail and tried to clarify my request (mainly by dropping mentions of server names irrelevant to this request!). [06:53:20] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 180055.14 seconds [07:05:37] ACKNOWLEDGEMENT - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 180585.05 seconds Jcrespo replica catching up after alter table [08:07:00] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [08:10:00] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [08:37:00] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [08:40:00] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [08:44:10] PROBLEM - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 100% [08:56:27] 06Operations, 10DBA, 06DC-Ops: db1063 thermal issues (was: db1063 io (s5 master eqiad) performance is bad) - https://phabricator.wikimedia.org/T164107#3223305 (10Marostegui) I have been trying to increase the fans speed via ipmitool raw commands but apparently on the R720 it is not possible to do that. The d... [09:21:20] RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 56.43 ms [09:22:20] apparently turkey blocked wikipedia again [09:25:29] or actually.. that might be a first.... [09:46:17] (03PS1) 10Reedy: Revert "Run Pdf Processors in firejails" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350981 (https://phabricator.wikimedia.org/T164045) [09:46:35] <_joe_> thedj: I don't remember that happening before, no [09:46:45] <_joe_> at least in the last 3 years, I'd remember [09:47:45] _joe_: i think they might have blocked an individual page pre-https times... but i'm not sure [09:48:10] other projects still available according to another irc user (torak) [09:54:48] 06Operations, 10Wikimedia-General-or-Unknown: Investigate why firejails break PdfHandler - https://phabricator.wikimedia.org/T164145#3223332 (10Reedy) [09:55:02] (03CR) 10Reedy: [C: 032] Revert "Run Pdf Processors in firejails" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350981 (https://phabricator.wikimedia.org/T164045) (owner: 10Reedy) [09:55:07] (03PS2) 10Reedy: Revert "Run Pdf Processors in firejails" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350981 (https://phabricator.wikimedia.org/T164045) [09:55:14] (03CR) 10Reedy: [C: 032] Revert "Run Pdf Processors in firejails" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350981 (https://phabricator.wikimedia.org/T164045) (owner: 10Reedy) [09:56:16] (03Merged) 10jenkins-bot: Revert "Run Pdf Processors in firejails" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350981 (https://phabricator.wikimedia.org/T164045) (owner: 10Reedy) [09:56:25] (03CR) 10jenkins-bot: Revert "Run Pdf Processors in firejails" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350981 (https://phabricator.wikimedia.org/T164045) (owner: 10Reedy) [09:59:23] !log reedy@naos Synchronized wmf-config/CommonSettings.php: Revert pdf processor firejails T164045 (duration: 02m 41s) [09:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:33] T164045: PDF thumbnails fail to render on newly-uploaded PDF files - https://phabricator.wikimedia.org/T164045 [10:02:30] PROBLEM - are wikitech and wt-static in sync on labtestweb2001 is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (206708s 200000s) [10:02:30] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (206708s 200000s) [10:10:50] 06Operations, 10Wikimedia-General-or-Unknown: Investigate why firejails break PdfHandler - https://phabricator.wikimedia.org/T164145#3223357 (10Reedy) [10:25:25] (03PS2) 10Reedy: Promote CollaborationKit to the big leagues; deploy on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343697 (https://phabricator.wikimedia.org/T138326) [10:37:00] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [10:39:32] !log start ferm on kafka1020/18 (nodes were previously down for maintenance, not sure why ferm wasn't started) [10:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:00] RECOVERY - Check systemd state on kafka1020 is OK: OK - running: The system is fully operational [10:40:00] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [10:40:22] ahh DNS query for 'prometheus1003.eqiad.wmnet' failed: query timed out [10:40:50] RECOVERY - Check whether ferm is active by checking the default input chain on kafka1020 is OK: OK ferm input default policy is set [10:41:00] RECOVERY - Check systemd state on kafka1018 is OK: OK - running: The system is fully operational [10:41:20] RECOVERY - Check whether ferm is active by checking the default input chain on kafka1018 is OK: OK ferm input default policy is set [10:50:18] !log set sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=65 to kafka[1018,1020,1022].eqiad.wmnet (was 120 - maybe related to T136094 ?) [10:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:28] T136094: Race condition in setting net.netfilter.nf_conntrack_tcp_timeout_time_wait - https://phabricator.wikimedia.org/T136094 [10:50:29] moritzm: --^ [10:51:48] kafka1018/1020 were shutdown for row-d mainteance, not sure if it was the same problem or not [10:52:57] kafka metrics looks good [10:55:50] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:56:50] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [11:25:30] (03PS1) 10Giuseppe Lavagetto: Move most redis code to a library [switchdc] - 10https://gerrit.wikimedia.org/r/351003 [11:25:32] (03PS1) 10Giuseppe Lavagetto: Add stage for restarting Redis. [switchdc] - 10https://gerrit.wikimedia.org/r/351004 (https://phabricator.wikimedia.org/T163337) [11:25:54] (03CR) 10jerkins-bot: [V: 04-1] Add stage for restarting Redis. [switchdc] - 10https://gerrit.wikimedia.org/r/351004 (https://phabricator.wikimedia.org/T163337) (owner: 10Giuseppe Lavagetto) [11:26:22] <_joe_> yeah, yeah jenkins, whatever. It's saturday and I'm working, cut me some slack :P [11:32:32] It must be jenkins on his weekend holiday. [11:35:03] 06Operations, 10Wikimedia-General-or-Unknown: Production error message (when servers are down) points users to donate link which is likely to produce the same error message - https://phabricator.wikimedia.org/T154627#3223458 (10Aklapper) [11:39:40] 06Operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 13Patch-For-Review, and 2 others: jobqueue is full of refreshlinks duplicates after the switchover. - https://phabricator.wikimedia.org/T163418#3196736 (10Ciencia_Al_Poder) Please see {T157545} [12:07:00] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [12:10:00] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [12:16:33] is that meant to happen ^^? [12:18:20] RECOVERY - MariaDB Slave Lag: s3 on db1038 is OK: OK slave_sql_lag Replication lag: 0.19 seconds [12:34:20] PROBLEM - Check Varnish expiry mailbox lag on cp2002 is CRITICAL: CRITICAL: expiry mailbox lag is 686180 [13:01:32] 06Operations, 10ops-eqiad, 10DBA: db1062 (s7 master eqiad) in a reboot cycle - https://phabricator.wikimedia.org/T164092#3223557 (10jcrespo) 05Open>03Resolved a:03jcrespo No longer ongoing. [13:14:20] RECOVERY - Check Varnish expiry mailbox lag on cp2002 is OK: OK: expiry mailbox lag is 8 [13:38:50] PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused [13:39:30] PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [13:40:11] PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:40:20] PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [13:41:40] PROBLEM - cassandra-a SSL 10.64.32.205:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [13:41:50] PROBLEM - cassandra-a CQL 10.64.32.205:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.205 and port 9042: Connection refused [13:43:00] PROBLEM - cassandra-a service on restbase1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [13:43:40] PROBLEM - Check systemd state on restbase1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:47:10] RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational [13:47:20] RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active [13:50:11] PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:50:20] PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [13:51:30] elukey: mhh, also seems if ferm didn't come up properly. let's do controlled reboot of a kafka broker next week to investigate this further [13:55:50] 06Operations, 10Wikimedia-General-or-Unknown: Investigate why firejails break PdfHandler - https://phabricator.wikimedia.org/T164145#3223616 (10Reedy) [14:04:40] RECOVERY - Check systemd state on restbase1013 is OK: OK - running: The system is fully operational [14:05:00] RECOVERY - cassandra-a service on restbase1013 is OK: OK - cassandra-a is active [14:06:40] RECOVERY - cassandra-a SSL 10.64.32.205:7001 on restbase1013 is OK: SSL OK - Certificate restbase1013-a valid until 2017-09-12 15:34:18 +0000 (expires in 136 days) [14:06:50] RECOVERY - cassandra-a CQL 10.64.32.205:9042 on restbase1013 is OK: TCP OK - 0.036 second response time on 10.64.32.205 port 9042 [14:17:20] RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active [14:20:20] PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [14:47:10] RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational [14:47:20] RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active [14:50:10] PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:50:20] PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [15:17:10] RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational [15:17:20] RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active [15:20:11] PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:20:20] PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [15:21:20] PROBLEM - Check Varnish expiry mailbox lag on cp2022 is CRITICAL: CRITICAL: expiry mailbox lag is 644312 [15:47:10] RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational [15:47:20] RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active [15:50:11] PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:50:20] PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [16:31:20] RECOVERY - Check Varnish expiry mailbox lag on cp2022 is OK: OK: expiry mailbox lag is 0 [16:46:18] (03PS1) 10Aklapper: List disabled user accounts with associated open tasks in weekly Phab email [puppet] - 10https://gerrit.wikimedia.org/r/351011 (https://phabricator.wikimedia.org/T157740) [16:47:10] RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational [16:47:20] RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active [16:50:11] PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:50:20] PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [17:17:20] RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active [17:20:20] PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [17:32:32] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, and 2 others: Issues with displaying thumbnails for CMYK JPG images due to buggy version of ImageMagick (black horizontal stripes, black color missing) - https://phabricator.wikimedia.org/T141739#3223675 (10matmarex) I don't actually see t... [17:47:11] RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational [17:47:20] RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active [17:50:11] PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:50:20] PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [18:13:20] PROBLEM - Check Varnish expiry mailbox lag on cp2002 is CRITICAL: CRITICAL: expiry mailbox lag is 684946 [18:17:11] RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational [18:17:20] RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active [18:20:11] PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:20:20] PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [18:37:00] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [18:40:00] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [18:41:40] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements responds with malformed body: list index out of range [18:42:40] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [18:47:10] RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational [18:50:11] PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:53:20] RECOVERY - Check Varnish expiry mailbox lag on cp2002 is OK: OK: expiry mailbox lag is 263674 [19:17:10] RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational [19:17:20] RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active [19:20:11] PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:20:20] PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [19:22:20] PROBLEM - Check Varnish expiry mailbox lag on cp2002 is CRITICAL: CRITICAL: expiry mailbox lag is 786038 [19:47:10] RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational [19:47:20] RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active [19:50:11] PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:50:20] PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [20:17:10] RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational [20:17:20] RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active [20:20:11] PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:20:20] PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [20:42:20] RECOVERY - Check Varnish expiry mailbox lag on cp2002 is OK: OK: expiry mailbox lag is 58 [20:44:41] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements responds with malformed body: list index out of range [20:46:40] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [21:31:40] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements responds with malformed body: list index out of range [21:33:40] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [21:47:10] RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational [21:47:20] RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active [21:50:11] PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:50:20] PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [22:07:00] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [22:10:00] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [22:11:50] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements responds with malformed body: list index out of range [22:13:40] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [22:47:11] RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational [22:47:20] RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active [22:50:11] PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:50:20] PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [23:17:11] RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational [23:17:20] RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active [23:20:11] PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:20:20] PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [23:47:10] RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational [23:47:20] RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active [23:50:11] PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:50:20] PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed