[00:08:01] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [00:11:11] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [00:12:01] RECOVERY - Check Varnish expiry mailbox lag on cp2022 is OK: OK: expiry mailbox lag is 11 [00:28:02] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:30:11] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:34:00] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp2002.codfw.wmnet,service=varnish-be [00:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:11] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [00:37:01] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [00:40:51] PROBLEM - Check Varnish expiry mailbox lag on cp2026 is CRITICAL: CRITICAL: expiry mailbox lag is 720252 [00:46:01] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:46:11] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:09:51] RECOVERY - Check Varnish expiry mailbox lag on cp2002 is OK: OK: expiry mailbox lag is 0 [02:16:36] (03CR) 10Krinkle: Use EtcdConfig (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [02:17:02] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [02:18:49] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp2002.codfw.wmnet,service=varnish-be [02:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:25] (03PS2) 10Krinkle: mwgrep: Add --etitle option [puppet] - 10https://gerrit.wikimedia.org/r/349352 [02:28:11] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [02:36:15] (03CR) 10Krinkle: phpunit: factor out logic to handle globals vars (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349413 (owner: 10Hashar) [02:43:06] (03CR) 10Krinkle: [C: 032] phpunit: automatically backup globals between tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349210 (owner: 10Hashar) [02:44:16] (03Merged) 10jenkins-bot: phpunit: automatically backup globals between tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349210 (owner: 10Hashar) [02:44:58] (03PS2) 10Krinkle: phpunit: factor out logic to handle globals vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349413 (owner: 10Hashar) [02:45:07] (03CR) 10jerkins-bot: [V: 04-1] phpunit: factor out logic to handle globals vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349413 (owner: 10Hashar) [02:45:22] (03PS3) 10Krinkle: phpunit: factor out logic to handle globals vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349413 (owner: 10Hashar) [02:45:38] (03CR) 10jenkins-bot: phpunit: automatically backup globals between tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349210 (owner: 10Hashar) [02:46:17] (03CR) 10jerkins-bot: [V: 04-1] phpunit: factor out logic to handle globals vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349413 (owner: 10Hashar) [02:56:40] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp2024.codfw.wmnet,service=varnish-be [02:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:09:11] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:10:01] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:10:11] RECOVERY - Check Varnish expiry mailbox lag on cp2024 is OK: OK: expiry mailbox lag is 149407 [03:18:51] PROBLEM - Check Varnish expiry mailbox lag on cp2002 is CRITICAL: CRITICAL: expiry mailbox lag is 738741 [03:21:01] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp2024.codfw.wmnet,service=varnish-be [03:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:55:56] (03PS4) 10Krinkle: phpunit: factor out logic to handle globals vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349413 (owner: 10Hashar) [04:00:51] RECOVERY - Check Varnish expiry mailbox lag on cp2026 is OK: OK: expiry mailbox lag is 21 [04:12:51] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=572.90 Read Requests/Sec=532.90 Write Requests/Sec=500.10 KBytes Read/Sec=39599.60 KBytes_Written/Sec=3535.60 [04:18:51] RECOVERY - Check Varnish expiry mailbox lag on cp2002 is OK: OK: expiry mailbox lag is 0 [04:20:51] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=49.00 Read Requests/Sec=5.00 Write Requests/Sec=0.50 KBytes Read/Sec=28.80 KBytes_Written/Sec=15.20 [06:56:51] PROBLEM - Host es2019 is DOWN: PING CRITICAL - Packet loss = 100% [07:12:30] did we lose an external storage server? [07:15:47] (03PS1) 10Jcrespo: Emergency depool of es2019 (crashed?) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349727 [07:17:16] (03CR) 10Jcrespo: [C: 032] Emergency depool of es2019 (crashed?) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349727 (owner: 10Jcrespo) [07:18:15] (03Merged) 10jenkins-bot: Emergency depool of es2019 (crashed?) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349727 (owner: 10Jcrespo) [07:18:23] (03CR) 10jenkins-bot: Emergency depool of es2019 (crashed?) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349727 (owner: 10Jcrespo) [07:21:03] !log jynus@naos Synchronized wmf-config/db-codfw.php: Depool es2019 (duration: 02m 16s) [07:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:28] jynus: o/ - seems frozen in the console, can I powercycle? (I saw you depooled it) [07:23:47] noy yet [07:24:13] sure I'll leave you do it whenever you are ready [07:27:09] did you guys get paged? [07:28:32] es2019 again? [07:28:33] https://phabricator.wikimedia.org/T149526 [07:28:34] :( [07:29:18] 06Operations, 10ops-codfw, 10DBA: Several es20XX servers keep crashing (es2017, es2019, es2015, es2014) since 23 March - https://phabricator.wikimedia.org/T130702#3203266 (10jcrespo) es2019 crashed on 2017-04-22 [07:31:40] it crashed the same day one year ago [07:31:40] anything on the console jynus ? [07:31:45] what?! [07:31:57] console is not highly responsive [07:32:18] did you guys get paged? I didn't and I am worried my sms are not working? [07:32:32] hosts going down doesn't page [07:33:38] I got host down alerts this week for some hosts [07:33:46] Not DBs though [07:35:16] I told you these hosts scared me for the switchover :( [07:35:21] lets hope it is only 2019 [07:36:57] racadm getsel for today indicates Multi-bit memory errors detected on a memory device at location(s) DIMM_A1. [07:37:27] lovely [07:37:36] and also DIMM_B4. [07:37:37] :D [07:37:48] I was checking the graphs and there was no spikes or anything before it stopped graphing [07:38:06] elukey: can you paste what you are seeing on: https://phabricator.wikimedia.org/T149526 [07:38:11] pleae? [07:38:12] 06Operations, 10ops-codfw: es2019 crashed again - https://phabricator.wikimedia.org/T149526#3203268 (10jcrespo) 05Resolved>03Open [07:38:56] 06Operations, 10ops-codfw: es2019 crashed again - https://phabricator.wikimedia.org/T149526#3203269 (10Marostegui) Server crashed again as per: T130702#3203266 [07:39:05] 06Operations, 10ops-codfw: es2019 crashed again - https://phabricator.wikimedia.org/T149526#3203270 (10jcrespo) ``` Normal","Mon Nov 21 2016 16:22:07","Log cleared. Critical","Sat Apr 22 2017 07:52:10","CPU 2 has an internal error (IERR). Normal","Sat Apr 22 2017 06:56:24","A problem was detected related to th... [07:39:24] 06Operations, 10ops-codfw: es2019 crashed again - https://phabricator.wikimedia.org/T149526#2755141 (10elukey) `racadm getsel` for es2019: ``` Severity: Critical Description: CPU 2 has an internal error (IERR). ------------------------------------------------------------------------------- Record: 3 D... [07:39:33] there you go :) [07:40:03] 06Operations, 10ops-codfw: es2019 crashed again - https://phabricator.wikimedia.org/T149526#3203273 (10jcrespo) ``` Severity Date and Time Message ID Summary Comment 2017-04-22T07:37:48-0500 USR0032 The session for root from 10.64.32.20 using SSH is logged off. 2017-04-22T07:37:41-0500 USR0030... [07:40:52] thanks! [07:41:00] looks like it had that module complaining in the past [07:41:12] you have duplicated my comment, please delete it, it is confusing [07:41:41] sure, I only added because Manuel asked :) [07:41:56] removed [07:43:37] !log powercycling es2019.codfw.wmnet, unresponsive [07:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:11] As expected, nothing on logs [07:48:20] don't put that [07:48:46] they think it means on the hardware logs and then they say it is a software problem [07:49:21] what? [07:51:20] 06Operations, 10ops-codfw, 10DBA: Several es20XX servers keep crashing (es2017, es2019, es2015, es2014) since 23 March - https://phabricator.wikimedia.org/T130702#3203278 (10jcrespo) p:05Normal>03High [07:51:43] let's start mysql without replication first? [07:51:47] no [07:51:50] let me restart first [07:51:59] ok [07:53:26] !log restarting es2019.codfw.wmnet after upgrade [07:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:08] ah, upgrade :) [07:56:24] the "aviate navigate communicate" of server means: [07:56:35] 1) stop creating errors (depool) [07:56:51] 2) investigate what really happen (hw logs in this case) [07:57:15] 3) explain it on a ticket and log it here [07:57:21] hehe good analogy [07:57:48] I am starting mysql [07:58:09] ok [07:59:22] Clean start apparently [08:00:01] is heartbeat still in myisam? [08:00:26] yes [08:00:38] we will convert that [08:00:55] lets get the max_if from all tables [08:02:24] I have converted heartbeat to innodb [08:05:11] I'm running $ cat *.dblist | while read db; do mysql -BN -A -h es2019.codfw.wmnet $db -e "SELECT '$db', max(blob_id) FROM blobs_cluster25"; done [08:05:37] good [08:05:51] so we can run compare.py around those values [08:06:17] yes, I know GTID should work [08:06:24] but I want to test anyway [08:06:50] sure [08:09:11] PROBLEM - Check Varnish expiry mailbox lag on cp2024 is CRITICAL: CRITICAL: expiry mailbox lag is 574516 [08:11:17] you can start replication if you want [08:11:22] I have what I wanted [08:11:26] ok [08:11:51] done and looking good [08:12:02] it caught up quickly [08:13:08] i think we should leave it depooled till monday [08:13:18] 06Operations, 10ops-codfw: es2019 crashed again - https://phabricator.wikimedia.org/T149526#3203283 (10jcrespo) For checking later: {P5309} [08:13:20] I agree [08:13:43] I will want to run compare.py on the values around the last edits [08:13:52] that sounds sane [08:13:52] to be 100% sure no edit was lost [08:14:11] befor poling it again [08:14:21] can you check the config is ok? [08:14:40] innodb flush logs, binlog sync, gtid, ssl, etc [08:15:04] checking [08:16:19] it all looks normal yes [08:16:48] root@db2062[enwiki]> ANALYZE TABLE revision; -> Terminated [08:17:56] that was fast [08:18:04] no, Terminated not good [08:18:18] ah, killed?? [08:18:25] the one on db1080 finished in 3 hours [08:18:30] I do not know, I never had seen Terminated [08:18:50] did it finish successfuly? Saying it has correctly populated the tables? [08:19:04] yep [08:19:10] it didn't here [08:19:18] | enwiki.revision | analyze | status | Engine-independent statistics collected | [08:19:21] | enwiki.revision | analyze | status | OK [08:19:35] i would say run it again and see what happens [08:19:40] did the other page and logging finished fine? [08:19:50] yes [08:20:09] it could be the query killer [08:20:15] I was testing it there [08:20:16] would it kill root? [08:20:18] ah [08:20:24] no, if it works correctly [08:20:37] but you know, it is being tested for a reason [08:20:42] yeah yeah [08:20:54] leave it running and let's see if it is done bymonday [08:21:12] I am going to logoff now and everything looks under control now [08:21:17] *as [08:21:34] thank you for responding! [08:21:36] bye! [08:22:47] you had all controller when I arrived! [08:22:56] See you monday (and hopefully not before monday!) [11:40:01] (03PS1) 10Volans: IRC logging, make messages more human-friendly [switchdc] - 10https://gerrit.wikimedia.org/r/349731 (https://phabricator.wikimedia.org/T163367) [12:00:55] (03PS1) 10Volans: Make more evident when a submenu was completed [switchdc] - 10https://gerrit.wikimedia.org/r/349732 (https://phabricator.wikimedia.org/T163371) [12:41:56] (03PS1) 10TheDJ: Disable mp3 uploads for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349733 (https://phabricator.wikimedia.org/T115170) [12:44:01] (03CR) 10Paladox: [C: 031] Disable mp3 uploads for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349733 (https://phabricator.wikimedia.org/T115170) (owner: 10TheDJ) [13:30:02] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [13:41:27] !log ema@neodymium conftool action : set/pooled=no; selector: name=cp2024.codfw.wmnet,service=varnish-be [13:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:41] PROBLEM - Check Varnish expiry mailbox lag on cp2014 is CRITICAL: CRITICAL: expiry mailbox lag is 682138 [13:49:11] RECOVERY - Check Varnish expiry mailbox lag on cp2024 is OK: OK: expiry mailbox lag is 0 [13:53:41] RECOVERY - Check Varnish expiry mailbox lag on cp2014 is OK: OK: expiry mailbox lag is 0 [13:55:01] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:01:01] PROBLEM - Check Varnish expiry mailbox lag on cp2022 is CRITICAL: CRITICAL: expiry mailbox lag is 608941 [14:12:50] (03PS2) 10Volans: IRC logging, make messages more human-friendly [switchdc] - 10https://gerrit.wikimedia.org/r/349731 (https://phabricator.wikimedia.org/T163367) [14:12:51] (03PS2) 10Volans: Make more evident when a submenu was completed [switchdc] - 10https://gerrit.wikimedia.org/r/349732 (https://phabricator.wikimedia.org/T163371) [14:12:54] (03PS1) 10Volans: Mediawiki: refactor stop/start maintenance [switchdc] - 10https://gerrit.wikimedia.org/r/349737 (https://phabricator.wikimedia.org/T163372) [14:28:35] (03PS2) 10Ema: Revert "cache_upload: lower keep from 3d to 1d on upload backends" [puppet] - 10https://gerrit.wikimedia.org/r/348698 (https://phabricator.wikimedia.org/T162035) [14:28:51] (03CR) 10Ema: [V: 032 C: 032] Revert "cache_upload: lower keep from 3d to 1d on upload backends" [puppet] - 10https://gerrit.wikimedia.org/r/348698 (https://phabricator.wikimedia.org/T162035) (owner: 10Ema) [15:37:42] 06Operations, 06Commons, 06Multimedia: Some thumbnails / fullscreen images on Commons show either HTTP 503 errors or other issues - https://phabricator.wikimedia.org/T163610#3203417 (10Aklapper) [16:10:24] (03PS1) 10Volans: DNS Discovery: add a check for the resolved address [switchdc] - 10https://gerrit.wikimedia.org/r/349738 (https://phabricator.wikimedia.org/T163364) [18:57:01] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:05:01] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:13:01] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [19:14:11] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:21:11] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:24:01] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:11:01] RECOVERY - Check Varnish expiry mailbox lag on cp2022 is OK: OK: expiry mailbox lag is 0 [21:41:22] (03CR) 10Urbanecm: [C: 04-1] "Add wgRemoveGroups definition too. Otherwise sysops would be able only to assign the rights." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349427 (https://phabricator.wikimedia.org/T163167) (owner: 10Zppix) [21:41:44] (03PS8) 10Urbanecm: Fix EducationProgram user rights so that they can be assigned/removed by sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349427 (https://phabricator.wikimedia.org/T163167) (owner: 10Zppix) [21:41:51] (03CR) 10Zppix: "> Add wgRemoveGroups definition too. Otherwise sysops would be able" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349427 (https://phabricator.wikimedia.org/T163167) (owner: 10Zppix) [21:42:04] (03CR) 10Urbanecm: [C: 04-1] "PS8: Not every bureaucrat must be a sysop so -bureaucrats from commit MSG." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349427 (https://phabricator.wikimedia.org/T163167) (owner: 10Zppix) [21:44:14] (03PS9) 10Zppix: Fix EducationProgram user rights so that they can be assigned/removed by sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349427 (https://phabricator.wikimedia.org/T163167) [21:44:28] (03PS10) 10Zppix: Fix EducationProgram user rights so that they can be assigned/removed by sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349427 (https://phabricator.wikimedia.org/T163167) [21:44:48] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349427 (https://phabricator.wikimedia.org/T163167) (owner: 10Zppix) [22:09:19] (03CR) 10EddieGP: [C: 031] Fix EducationProgram user rights so that they can be assigned/removed by sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349427 (https://phabricator.wikimedia.org/T163167) (owner: 10Zppix)