[01:02:10] PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1499562124 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 9229041 keys, up 2 minutes 2 seconds - replication_delay is 1499562124 [01:02:20] PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1499562137 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9322319 keys, up 2 minutes 15 seconds - replication_delay is 1499562137 [01:03:00] PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6380 [01:03:10] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [01:03:20] RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9318435 keys, up 3 minutes 16 seconds - replication_delay is 0 [01:04:00] RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 9318649 keys, up 3 minutes 53 seconds - replication_delay is 0 [01:04:10] RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9321480 keys, up 3 minutes 59 seconds - replication_delay is 0 [01:04:10] RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 9224469 keys, up 4 minutes 5 seconds - replication_delay is 0 [01:07:00] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [01:08:00] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [01:09:20] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [01:10:20] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [01:14:00] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:15:20] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:15:20] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:16:00] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:33:00] PROBLEM - puppet last run on mw2222 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [03:33:50] PROBLEM - puppet last run on mw2135 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.test],File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [04:00:30] RECOVERY - puppet last run on mw2222 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [04:01:00] RECOVERY - puppet last run on mw2135 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [04:08:40] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=449.50 Read Requests/Sec=3833.00 Write Requests/Sec=0.20 KBytes Read/Sec=35971.20 KBytes_Written/Sec=12.00 [04:15:40] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=55.30 Read Requests/Sec=4.40 Write Requests/Sec=37.80 KBytes Read/Sec=17.60 KBytes_Written/Sec=5955.60 [05:34:40] PROBLEM - nova-compute process on labvirt1013 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [05:35:40] RECOVERY - nova-compute process on labvirt1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [05:46:30] 10Operations, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Performance-Team, and 5 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3418483 (10Ladsgroup) a:03Ladsgroup [05:50:22] 10Operations, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Performance-Team, and 6 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3418489 (10Ladsgroup) >>! In T164173#3413805, @Krinkle wrote: > * PageUpdater::purgeParse... [06:48:30] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [06:50:30] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [06:50:30] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [06:55:30] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:56:30] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:16:50] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [07:37:00] (03PS2) 10Smalyshev: Index deletes everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363669 (https://phabricator.wikimedia.org/T163235) [07:47:00] PROBLEM - cassandra-a SSL 10.192.16.176:7001 on restbase2007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [07:47:00] PROBLEM - cassandra-a CQL 10.192.16.176:9042 on restbase2007 is CRITICAL: connect to address 10.192.16.176 and port 9042: Connection refused [07:48:50] PROBLEM - cassandra-a service on restbase2007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [07:48:50] PROBLEM - Check systemd state on restbase2007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:01:50] RECOVERY - Check systemd state on restbase2007 is OK: OK - running: The system is fully operational [08:02:50] RECOVERY - cassandra-a service on restbase2007 is OK: OK - cassandra-a is active [08:04:10] RECOVERY - cassandra-a SSL 10.192.16.176:7001 on restbase2007 is OK: SSL OK - Certificate restbase2007-a valid until 2017-09-12 15:35:50 +0000 (expires in 65 days) [08:05:00] RECOVERY - cassandra-a CQL 10.192.16.176:9042 on restbase2007 is OK: TCP OK - 0.036 second response time on 10.192.16.176 port 9042 [10:17:30] PROBLEM - cassandra-c SSL 10.192.48.70:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [10:18:00] PROBLEM - cassandra-c CQL 10.192.48.70:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.70 and port 9042: Connection refused [10:18:11] PROBLEM - Check systemd state on restbase2012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:18:20] PROBLEM - cassandra-c service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [10:24:20] RECOVERY - Check systemd state on restbase2012 is OK: OK - running: The system is fully operational [10:24:30] RECOVERY - cassandra-c service on restbase2012 is OK: OK - cassandra-c is active [10:24:50] RECOVERY - cassandra-c SSL 10.192.48.70:7001 on restbase2012 is OK: SSL OK - Certificate restbase2012-c valid until 2017-11-17 00:54:34 +0000 (expires in 130 days) [10:25:00] RECOVERY - cassandra-c CQL 10.192.48.70:9042 on restbase2012 is OK: TCP OK - 0.036 second response time on 10.192.48.70 port 9042 [10:51:40] (03CR) 10Ladsgroup: "Daniel: It includes these rules in line 95 of https://gerrit.wikimedia.org/r/#/c/361801/2/modules/mediawiki/files/apache/sites/main.conf :" [puppet] - 10https://gerrit.wikimedia.org/r/357985 (https://phabricator.wikimedia.org/T119536) (owner: 10Ladsgroup) [12:40:35] (03PS1) 10Elukey: Remove ladsgroup from production access [puppet] - 10https://gerrit.wikimedia.org/r/364102 [12:42:37] (03CR) 10Elukey: [C: 032] Remove ladsgroup from production access [puppet] - 10https://gerrit.wikimedia.org/r/364102 (owner: 10Elukey) [13:25:37] (03PS1) 10Elukey: Set ladsgroup as absented user [puppet] - 10https://gerrit.wikimedia.org/r/364104 [13:27:31] (03CR) 10Elukey: [C: 032] Set ladsgroup as absented user [puppet] - 10https://gerrit.wikimedia.org/r/364104 (owner: 10Elukey) [15:56:39] (03PS1) 10Framawiki: Set $wgUploadNavigationUrl for fr.wikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364115 (https://phabricator.wikimedia.org/T170083) [16:30:20] PROBLEM - Apache HTTP on mw2235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:31:10] RECOVERY - Apache HTTP on mw2235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.110 second response time [16:35:20] 10Operations, 10vm-requests, 10Patch-For-Review: Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*) - https://phabricator.wikimedia.org/T149557#3418902 (10Dzahn) can be closed as resolved now? [16:44:11] (03Abandoned) 10Framawiki: Set $wgUploadNavigationUrl for fr.wikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364115 (https://phabricator.wikimedia.org/T170083) (owner: 10Framawiki) [17:24:30] PROBLEM - HHVM rendering on mw2199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:25:20] RECOVERY - HHVM rendering on mw2199 is OK: HTTP OK: HTTP/1.1 200 OK - 74898 bytes in 0.341 second response time [18:08:12] (03PS1) 10Framawiki: Set $wgUploadNavigationUrl for few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364121 (https://phabricator.wikimedia.org/T170083) [20:42:14] 10Operations, 10Wikimedia-General-or-Unknown: Icinga has httpauth on (not accessible for public) - https://phabricator.wikimedia.org/T62112#661810 (10Luke081515) Did something changed here in over two years since icinga is login-only? [21:32:24] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Patch-For-Review: Create Dinka Wikipedia - https://phabricator.wikimedia.org/T168518#3419139 (10Urbanecm) Hello, I don't think so. There is no blocked AFAIK. @dereckson, do you know what is next step? I think it is reserving an window a... [21:35:58] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Patch-For-Review: Create Dinka Wikipedia - https://phabricator.wikimedia.org/T168518#3419141 (10Dereckson) Thanks @Urbanecm for the update, I haven't seen this ticket yet. >>! In T168518#3413280, @Amire80 wrote: > Hi, > > It seems to... [21:44:58] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-Urbanecm: Create Dinka Wikipedia - https://phabricator.wikimedia.org/T168518#3419153 (10Urbanecm) [22:05:14] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3419162 (10Dereckson) 05Open>03stalled [22:16:40] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3419180 (10Dereckson) Discussions occurred [[ https://lists.wikimedia.org/pipermail/langcom/2017-June/thread.html | end June ]] and [[ https://lists.wikim... [22:27:03] (03PS1) 10Urbanecm: Add import sources for specieswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364131 (https://phabricator.wikimedia.org/T170094) [22:35:55] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3419191 (10Koavf) How can I ensure that the Committee sees my concerns? Posting to Meta, here, the mailing list? [22:52:09] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3376126 (10Urbanecm) >>! In T168765#3419191, @Koavf wrote: > How can I ensure that the Committee sees my concerns? Posting to Meta, here, the mailing list...