[01:40:43] (03CR) 10Madhuvishy: [C: 031] "Nice write up! Thanks for putting this together :)" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/420238 (https://phabricator.wikimedia.org/T189974) (owner: 10Nehajha) [03:25:47] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 834.01 seconds [03:34:48] PROBLEM - puppet last run on analytics1055 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [03:53:27] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-Connection-Type.mmdb],File[/usr/share/GeoIP/GeoIPCity.dat] [03:58:48] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 240.23 seconds [04:04:48] RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [04:46:47] PROBLEM - puppet last run on lvs5003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:48:27] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [05:16:47] RECOVERY - puppet last run on lvs5003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [05:30:57] PROBLEM - Disk space on elastic1024 is CRITICAL: DISK CRITICAL - free space: /srv 62161 MB (12% inode=99%) [05:37:57] RECOVERY - Disk space on elastic1024 is OK: DISK OK [07:17:24] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4079025 (10Liuxinyu970226) What about Antartica (AD)? [09:38:37] PROBLEM - cassandra-a SSL 10.64.48.168:7001 on restbase-dev1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [09:40:58] PROBLEM - cassandra-a service on restbase-dev1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [09:40:58] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:41:07] PROBLEM - cassandra-a CQL 10.64.48.168:9042 on restbase-dev1006 is CRITICAL: connect to address 10.64.48.168 and port 9042: Connection refused [10:03:58] RECOVERY - cassandra-a service on restbase-dev1006 is OK: OK - cassandra-a is active [10:04:07] RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational [10:04:47] RECOVERY - cassandra-a SSL 10.64.48.168:7001 on restbase-dev1006 is OK: SSL OK - Certificate restbase-dev1006-a valid until 2018-07-20 15:08:10 +0000 (expires in 117 days) [10:05:08] RECOVERY - cassandra-a CQL 10.64.48.168:9042 on restbase-dev1006 is OK: TCP OK - 0.000 second response time on 10.64.48.168 port 9042 [11:22:48] PROBLEM - cassandra-b SSL 10.64.48.169:7001 on restbase-dev1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [11:24:48] PROBLEM - cassandra-b CQL 10.64.48.169:9042 on restbase-dev1006 is CRITICAL: connect to address 10.64.48.169 and port 9042: Connection refused [11:25:08] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:25:37] PROBLEM - cassandra-b service on restbase-dev1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [11:33:37] RECOVERY - cassandra-b service on restbase-dev1006 is OK: OK - cassandra-b is active [11:34:08] RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational [11:34:57] RECOVERY - cassandra-b SSL 10.64.48.169:7001 on restbase-dev1006 is OK: SSL OK - Certificate restbase-dev1006-b valid until 2018-07-20 15:08:11 +0000 (expires in 117 days) [11:35:48] RECOVERY - cassandra-b CQL 10.64.48.169:9042 on restbase-dev1006 is OK: TCP OK - 0.000 second response time on 10.64.48.169 port 9042 [11:55:27] PROBLEM - HHVM rendering on mw2262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:56:17] RECOVERY - HHVM rendering on mw2262 is OK: HTTP OK: HTTP/1.1 200 OK - 75893 bytes in 0.282 second response time [11:59:37] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:01:37] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:22:48] PROBLEM - cassandra-b SSL 10.64.0.168:7001 on restbase-dev1004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:23:08] PROBLEM - cassandra-b CQL 10.64.0.168:9042 on restbase-dev1004 is CRITICAL: connect to address 10.64.0.168 and port 9042: Connection refused [15:23:38] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:23:57] PROBLEM - cassandra-b service on restbase-dev1004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [15:35:38] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [15:35:57] RECOVERY - cassandra-b service on restbase-dev1004 is OK: OK - cassandra-b is active [15:36:57] RECOVERY - cassandra-b SSL 10.64.0.168:7001 on restbase-dev1004 is OK: SSL OK - Certificate restbase-dev1004-b valid until 2018-07-20 15:08:05 +0000 (expires in 116 days) [15:37:08] RECOVERY - cassandra-b CQL 10.64.0.168:9042 on restbase-dev1004 is OK: TCP OK - 0.003 second response time on 10.64.0.168 port 9042 [15:59:23] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] Use namespaced PHPUnit\Framework\TestCase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421588 (https://phabricator.wikimedia.org/T188166) (owner: 10Umherirrender) [20:30:53] (03PS1) 10Kevin py: My understanding of the webservice script [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/421798 [20:44:42] (03PS1) 10Kevin py: my understanding of webservice with start and stop actions [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/421799 [21:26:57] PROBLEM - cassandra-a SSL 10.64.48.168:7001 on restbase-dev1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [21:29:08] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:29:27] PROBLEM - cassandra-a service on restbase-dev1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [21:29:38] PROBLEM - cassandra-a CQL 10.64.48.168:9042 on restbase-dev1006 is CRITICAL: connect to address 10.64.48.168 and port 9042: Connection refused [21:33:08] RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational [21:33:27] RECOVERY - cassandra-a service on restbase-dev1006 is OK: OK - cassandra-a is active [21:34:38] RECOVERY - cassandra-a CQL 10.64.48.168:9042 on restbase-dev1006 is OK: TCP OK - 0.000 second response time on 10.64.48.168 port 9042 [21:34:57] RECOVERY - cassandra-a SSL 10.64.48.168:7001 on restbase-dev1006 is OK: SSL OK - Certificate restbase-dev1006-a valid until 2018-07-20 15:08:10 +0000 (expires in 116 days) [21:56:47] PROBLEM - cassandra-b SSL 10.64.48.169:7001 on restbase-dev1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [21:58:58] PROBLEM - cassandra-b service on restbase-dev1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [21:59:17] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:59:27] PROBLEM - cassandra-b CQL 10.64.48.169:9042 on restbase-dev1006 is CRITICAL: connect to address 10.64.48.169 and port 9042: Connection refused [22:04:07] RECOVERY - cassandra-b service on restbase-dev1006 is OK: OK - cassandra-b is active [22:04:17] RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational [22:05:17] RECOVERY - cassandra-b SSL 10.64.48.169:7001 on restbase-dev1006 is OK: SSL OK - Certificate restbase-dev1006-b valid until 2018-07-20 15:08:11 +0000 (expires in 116 days) [22:05:27] RECOVERY - cassandra-b CQL 10.64.48.169:9042 on restbase-dev1006 is OK: TCP OK - 0.002 second response time on 10.64.48.169 port 9042 [23:26:27] PROBLEM - HHVM rendering on mw2222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:27:17] RECOVERY - HHVM rendering on mw2222 is OK: HTTP OK: HTTP/1.1 200 OK - 75344 bytes in 0.304 second response time [23:53:51] PROBLEM - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:54:41] RECOVERY - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.009 second response time [23:56:45] Hmm I wonder did that hit a blip?