[01:02:45] (03CR) 10Reedy: [C: 04-1] "CR -1 to prevent accidental merging because of dependency on I011db0e9a2d9da825cf3ac02bfba23b562e052f6 (in operations/puppet repo)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370358 (https://phabricator.wikimedia.org/T171372) (owner: 10Ebe123) [01:05:28] PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:06:48] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [01:06:58] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [01:07:14] PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused [01:07:15] PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed [01:07:15] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [01:07:15] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [01:07:16] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [01:07:16] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [01:07:27] PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:07:37] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [01:07:37] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [01:08:17] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy [01:09:18] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [01:09:27] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy [01:09:27] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy [01:09:37] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [01:09:38] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy [01:09:57] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy [01:10:18] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [01:10:27] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy [01:11:17] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [01:11:18] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy [01:13:44] 10Operations, 10Domains, 10Traffic, 10Wikimedia Resource Center, 10Patch-For-Review: Create resources.wikimedia.org as a redirect - https://phabricator.wikimedia.org/T172417#3497968 (10Reedy) I don't disagree with Timo above, and I'm guessing #operations will agree. It then fixes the "issues" that arise... [01:15:17] Reedy: I don't personally object to wm.o/resources but it's not up to me [01:15:29] Who is it upto? :) [01:16:00] Ofcourse, you can ask for foobar.wikimedia.org, but if ops say no, it's a no ;) [01:22:27] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [01:24:34] harej if i may, I think going the "slash" route is more convential and more usual, and not to mention if for some reason someone needs to setup a dev-related thing under resources.wikimedia.org we arent jumping through hoops to find a different subdomain [01:26:37] PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100% [01:28:05] !log conf2002:~# service etcdmirror-conftool-eqiad-wmnet restart (not sure what else to do the service failed) [01:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:31:35] Zppix: What's dev related got to do with it? [01:31:55] Chances are, it's only gonna be used for Wikimania, so can probably be scrapped at some point in the future anyway [01:31:56] 10Operations: conf2002 etcdmirror-conftool-eqiad-wmnet died - https://phabricator.wikimedia.org/T172628#3504048 (10chasemp) [01:32:05] 10Operations: conf2002 etcdmirror-conftool-eqiad-wmnet died - https://phabricator.wikimedia.org/T172628#3504061 (10chasemp) p:05Triage>03Normal [01:33:47] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [01:34:07] RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [01:34:47] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1001 is OK: OK ferm input default policy is set [01:36:04] Reedy idk the meta page looks like it could be long term potientally, but i dont know i just read the task to find out what that was about [01:36:12] (03PS2) 10Ebe123: Run Lilypond from Firejail [puppet] - 10https://gerrit.wikimedia.org/r/370361 (https://phabricator.wikimedia.org/T171372) [01:36:28] (03PS2) 10Ebe123: Run Lilypond from Firejail [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370358 (https://phabricator.wikimedia.org/T171372) [01:36:34] and i was just relaying the concerns I had that was also mentioned on the task [01:37:54] 10Operations: conf2002 etcdmirror-conftool-eqiad-wmnet died - https://phabricator.wikimedia.org/T172628#3504064 (10chasemp) Seems like this is dying really soon post restart ```root@conf2002:~# service etcdmirror-conftool-eqiad-wmnet status ● etcdmirror-conftool-eqiad-wmnet.service - Etcd mirrormaker Loaded:... [01:38:03] 10Operations: conf2002 etcdmirror-conftool-eqiad-wmnet died - https://phabricator.wikimedia.org/T172628#3504065 (10chasemp) p:05Normal>03Unbreak! [01:39:43] the shortcut would be used beyond wikimania. in any case, as i said i am fine with wikimedia.org/resources but I need to ask others first [01:40:40] 10Operations, 10Domains, 10Traffic, 10Wikimedia Resource Center, 10Patch-For-Review: Create resources.wikimedia.org as a redirect - https://phabricator.wikimedia.org/T172417#3504068 (10Reedy) The other question is whether you're going to be wanting more redirects for other similar purposes... In which c... [01:41:16] Reedy: what do you mean by the education subdomain? [01:41:26] harej i'd assume wikiedu [01:41:30] https://github.com/wikimedia/puppet/blob/7ac5f9a959924a6b51625b713e95c44ed7560ee8/modules/mediawiki/files/apache/sites/redirects/redirects.dat#L112-L140 [01:41:49] huh, interesting [01:42:03] I don't mean you using the education subdomain, ofc [01:42:09] but that pattern may make more sense than [01:42:16] wikimedia.org/resourcesfoobar [01:42:19] wikimedia.org/resourcesfoobaz [01:42:22] wikimedia.org/resourcesbarfoo [01:42:34] so, resources.wikimedia.org/contributors, stuff like that? [01:42:46] potentially, yup [01:42:48] that sounds interesting [01:43:08] ofc, that detracts from Timos originally point/suggestion... [01:43:25] But if it's more than a single use... It may have merit doing so [01:44:34] Thing is, the sub-categorization is liable to change [01:44:35] The current four-audience setup isn't going to stay forever [01:44:40] 10Operations: conf2002 etcdmirror-conftool-eqiad-wmnet died - https://phabricator.wikimedia.org/T172628#3504071 (10chasemp) I think the relevant portion is probably "...or if the lag is large enough that we're losing etcd events" [02:01:27] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1004 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [02:02:27] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1004 is OK: OK ferm input default policy is set [02:34:07] PROBLEM - MD RAID on labtestservices2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:07] RECOVERY - MD RAID on labtestservices2003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [02:47:17] RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2002 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active [02:47:27] RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational [02:47:50] RECOVERY - Etcd replication lag on conf2002 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.074 second response time [02:54:17] (03PS3) 10Ebe123: Run Lilypond from Firejail [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370358 (https://phabricator.wikimedia.org/T172582) [02:58:10] (03PS3) 10Ebe123: Run Lilypond from Firejail [puppet] - 10https://gerrit.wikimedia.org/r/370361 (https://phabricator.wikimedia.org/T172582) [03:08:07] PROBLEM - Check whether ferm is active by checking the default input chain on kubestage1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [03:09:07] RECOVERY - Check whether ferm is active by checking the default input chain on kubestage1002 is OK: OK ferm input default policy is set [03:12:08] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1004 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [03:13:07] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1004 is OK: OK ferm input default policy is set [03:25:57] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 817.38 seconds [03:54:17] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 208.11 seconds [04:56:17] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [04:56:59] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3841 bytes in 0.022 second response time [05:50:07] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [05:50:17] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [05:50:27] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [05:50:27] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [05:50:28] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [05:50:28] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [05:50:37] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [05:50:37] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [05:50:37] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [05:51:37] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy [05:52:37] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [05:53:18] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [05:54:38] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [05:54:39] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy [05:55:37] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy [05:55:37] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy [05:57:57] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [05:58:02] 10Operations, 10Wikimedia-Site-requests, 10I18n, 10Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#3504125 (10Zoranzoki21) [05:58:37] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [05:58:38] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [05:59:37] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [06:02:47] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy [06:03:47] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy [06:04:57] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy [06:05:27] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy [06:05:48] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [06:06:47] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [06:07:48] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [06:08:27] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [06:08:47] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [06:08:57] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [06:09:57] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy [06:09:58] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy [06:10:47] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy [06:11:07] (03PS1) 10EBernhardson: Decrease size of cirrussearch pool counters to reduce load during spikes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370362 [06:11:22] (03CR) 10EBernhardson: [C: 032] Decrease size of cirrussearch pool counters to reduce load during spikes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370362 (owner: 10EBernhardson) [06:11:48] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [06:11:57] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [06:12:52] (03Merged) 10jenkins-bot: Decrease size of cirrussearch pool counters to reduce load during spikes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370362 (owner: 10EBernhardson) [06:12:57] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [06:12:58] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [06:13:11] (03CR) 10jenkins-bot: Decrease size of cirrussearch pool counters to reduce load during spikes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370362 (owner: 10EBernhardson) [06:14:37] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy [06:15:07] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy [06:15:57] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [06:17:07] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [06:17:37] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [06:18:07] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [06:18:07] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy [06:18:08] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy [06:18:08] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [06:18:34] !log ebernhardson@tin Synchronized wmf-config/PoolCounterSettings.php: T169498: Reduce cirrus search pool counter to 200 parallel requests cluster wide (duration: 02m 54s) [06:18:38] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy [06:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:47] T169498: Investigate load spikes on the elasticsearch cluster in eqiad - https://phabricator.wikimedia.org/T169498 [06:19:58] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [06:20:07] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy [06:20:08] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy [06:22:07] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy [06:22:09] oddly reducing the pool size from 380 to 200 didn't increase the number of rejected searches... [06:22:17] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy [06:27:07] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [06:32:38] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [08:38:02] (03PS1) 10Revi: Enable wgMinervaEnableSiteNotice for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370363 (https://phabricator.wikimedia.org/T172630) [08:44:06] (03PS2) 10Revi: Enable wgMinervaEnableSiteNotice for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370363 (https://phabricator.wikimedia.org/T172630) [10:58:38] 10Operations, 10ops-eqiad, 10Analytics: Analytics1034 eth0 negotiated speed to 100Mb/s instead of 1000Mb/s - https://phabricator.wikimedia.org/T172633#3504238 (10elukey) [11:07:06] 10Operations: conf2002 etcdmirror-conftool-eqiad-wmnet died - https://phabricator.wikimedia.org/T172628#3504266 (10elukey) p:05Unbreak!>03High [13:04:07] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [13:05:59] seems already fixed --^ [13:06:06] peak of 502s for upload [13:08:02] seems all */thumb/* so I'll Cc: godog just in case :) [13:11:07] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.005 second response time [13:12:17] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:13:10] !log restart pdfrender on scb1002 [13:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:13] !log powercycle mw2256 - com2 frozen - T163346 [13:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:24] T163346: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346 [13:19:07] RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.16 ms [13:20:17] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3504378 (10elukey) @Papaul the host keep getting in a frozen state, we'd need to re-check what's wrong :( [15:06:57] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2056889 [16:46:57] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 1 [17:46:18] 10Operations, 10MediaWiki-Maintenance-scripts, 10Performance-Team, 10Thumbor: ensure thumbor container access is preserved by mw filebackend setzoneaccess - https://phabricator.wikimedia.org/T144479#3504660 (10fgiunchedi) p:05Low>03High Since thumbor is in production now I'm bumping the priority becaus... [17:46:38] 10Operations, 10MediaWiki-Maintenance-scripts, 10Performance-Team, 10Thumbor: Ensure thumbor container access is preserved by mw filebackend setzoneaccess - https://phabricator.wikimedia.org/T144479#3504664 (10fgiunchedi) [19:21:28] PROBLEM - Nginx local proxy to apache on mw1269 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.007 second response time [19:21:38] PROBLEM - Apache HTTP on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:28] RECOVERY - Nginx local proxy to apache on mw1269 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.136 second response time [19:22:37] RECOVERY - Apache HTTP on mw1269 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.045 second response time [21:49:57] PROBLEM - MariaDB Slave Lag: s4 on db2044 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 356.64 seconds [21:50:07] PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 363.76 seconds [21:50:07] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 363.85 seconds [21:50:08] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 365.36 seconds [21:50:28] PROBLEM - MariaDB Slave Lag: s4 on db2019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 378.78 seconds [21:52:57] RECOVERY - MariaDB Slave Lag: s4 on db2044 is OK: OK slave_sql_lag Replication lag: 9.49 seconds [21:53:07] RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 13.53 seconds [21:53:08] RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 14.74 seconds [21:53:08] RECOVERY - MariaDB Slave Lag: s4 on db2037 is OK: OK slave_sql_lag Replication lag: 16.14 seconds [21:53:37] RECOVERY - MariaDB Slave Lag: s4 on db2019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds