[00:00:09] <_joe_> unless this is something new [00:00:48] hey [00:01:10] marostegui: hola [00:01:14] <_joe_> marostegui: we had a crash on db1098 [00:01:18] <_joe_> we just depooled it for now [00:01:19] got the call but wasn't fast enough to pick up [00:01:22] <_joe_> but it's a rc host [00:01:36] sorry, just sent an SMS to jaime too (I thought your phone was out of reach according to the voice) [00:01:39] <_joe_> it's ok to just depool one, without substitution? [00:01:49] sure [00:01:50] <_joe_> change we deployed is https://gerrit.wikimedia.org/r/429642 [00:01:52] let me recheck the file [00:02:12] we just have one left for rc & co for s6 and s7 [00:02:24] what is a bit worry is that this time too no logs on the host, just HW logs, see T193331 [00:02:25] T193331: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331 [00:02:43] <_joe_> same error as db2081 this morning [00:02:46] volans|off: like db2081 earlier I guess [00:02:47] yeah [00:04:36] <_joe_> marostegui: we mainly wanted one of you to validate our actions [00:04:39] let's leave it depooled till monday [00:04:41] yeah [00:04:51] I'll silent it on icinga [00:05:00] _joe_: it is correctly depooled [00:05:04] volans|off: thanks [00:05:12] <_joe_> 27 minutes of bad service on s6/s6 :/ [00:05:22] <_joe_> s6/s7, I mean [00:05:28] yeah, on rc [00:06:13] <_joe_> marostegui: well whatever does waitforslaves is affected [00:06:20] <_joe_> you know, the usual bug [00:06:39] downtimed untile Wed. mid-eu day [00:06:40] just in case [00:06:43] <_joe_> if you look at fatalmonitor, we had ~ 1000 fatals/minute for that time [00:07:02] how many are just logs and the software retries though? [00:07:03] <_joe_> but until the dbloadbalancer in mediawiki is fixed [00:07:09] <_joe_> volans|off: fatals [00:07:11] yeah, the usual balancer issue [00:07:32] <_joe_> this should've been "not an issue", really [00:07:45] <_joe_> it's a shame we never get to prioritize this work [00:08:34] RECOVERY - HTTP availability for Nginx -SSL terminators- on einsteinium is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [00:08:41] jynus: sorry for the trouble, as soon as I sent you the SMS manuel got online [00:08:46] I can give you the TL;DR [00:08:59] <_joe_> ok, enough rants. I was literally going to bed when this happened. I will get back in that direction [00:09:02] 2 hosts on the same day? [00:09:09] a bit worry yeah, same error [00:09:12] jynus: yep [00:09:14] smartctl killing hosts? [00:09:35] 10Operations, 10DBA: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331#4166266 (10Volans) I've downtimed db1098 on Icinga until Wed mid EU day and disabled notifications. [00:09:50] I saw some disk errors on the other one [00:15:50] 10Operations, 10DBA: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331#4166270 (10Marostegui) This is the same error as db2081 earlier today: T193325 ``` The Intel Management Engine has recovered the ability to utilize the PECI over DMI facility. If the PWR2262 "internal system erro... [00:19:12] 10Operations, 10DBA: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331#4166274 (10Marostegui) T175973#3615656 db1100 suffered it too which is the same batch as db1098 [00:19:19] Given it is now under control, and it is 2am, I am going to go back to bed and will debug more tomorrow/monday [00:19:20] I didn't see any disk error in getraclogs or get-raid-status-megacli -a [00:19:28] thanks volans|off for phoning :) [00:19:30] 10Operations, 10DBA: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331#4166276 (10jcrespo) ``` 2018-04-28T23:28:04-0500 LOG007 The previous log entry was repeated 1 times. 2018-04-29T00:13:43-0500 SYS1003 System CPU Resetting. 2018-04-29T00:13:42-0500 SYS1000... [00:20:38] marostegui: sorry to bother, mostly wanted to double check that was ok to leave it with only one slave for the specific roles [00:20:51] thanks for checking in, both of you! [00:21:04] I think I'll head off to bed too at this point [02:18:15] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1968 bytes in 0.106 second response time [02:40:24] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1963 bytes in 0.089 second response time [03:16:07] PROBLEM - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:16:57] RECOVERY - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 1.321 second response time [03:26:35] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 877.93 seconds [03:26:50] (03PS1) 10ArielGlenn: disable nfs file attr caching on the last of the snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/429647 (https://phabricator.wikimedia.org/T191177) [03:28:56] (03CR) 10ArielGlenn: [C: 032] disable nfs file attr caching on the last of the snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/429647 (https://phabricator.wikimedia.org/T191177) (owner: 10ArielGlenn) [03:55:02] 10Operations, 10Datasets-General-or-Unknown, 10User-ArielGlenn: Reboots of dumps/snapshot hosts - https://phabricator.wikimedia.org/T188242#4166349 (10ArielGlenn) 05Open>03Resolved snapshot1007 and dumpsdata1001 have been rebooted at last. [04:10:54] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 23.26 seconds [04:18:09] (03PS1) 10ArielGlenn: Revert "disable cron for partial dumps temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/429650 [04:18:17] (03PS2) 10ArielGlenn: Revert "disable cron for partial dumps temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/429650 [04:21:27] (03CR) 10ArielGlenn: [C: 032] Revert "disable cron for partial dumps temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/429650 (owner: 10ArielGlenn) [04:50:24] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1972 bytes in 0.095 second response time [04:52:05] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [04:52:24] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [06:27:25] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1967 bytes in 0.117 second response time [06:45:16] 10Operations, 10DBA: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331#4166362 (10Marostegui) a:03Cmjohnson @Cmjohnson can we do the same thing we did to db1100? (which had never had another crash ever since): - Check if there are BIOS/firmware updates available - Power drain the h... [06:45:40] 10Operations, 10ops-eqiad, 10DBA: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331#4166365 (10Marostegui) [06:50:26] (03PS1) 10Marostegui: db1098.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/429652 (https://phabricator.wikimedia.org/T193331) [06:51:07] (03CR) 10Marostegui: [C: 032] db1098.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/429652 (https://phabricator.wikimedia.org/T193331) (owner: 10Marostegui) [07:01:39] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331#4166371 (10Marostegui) I have started MySQL on db1098 to: - Make sure nothing is corrupted and replication can flow - Avoid leaving the host to fall behind replication for 2 da... [07:06:15] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [07:06:15] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [07:16:48] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4166373 (10alanajjar) **Note:** All name changes are turned off until this problem is fixed! so... [07:17:16] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [07:17:16] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [07:20:25] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [07:20:25] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [07:24:25] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [07:24:26] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [07:25:25] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [07:25:26] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [07:30:26] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [07:30:35] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [07:40:35] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [07:40:36] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [09:22:43] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4166434 (10MarcoAurelio) I only see one stuck global rename right now. Yes, it is true however... [09:30:32] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4166435 (10Tgr) >>! In T193254#4165780, @Nirmos wrote: > Is this because of https://gerrit.wiki... [09:32:22] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4166436 (10MarcoAurelio) Any idea why that may be happening? Issues with the meta job queue? Th... [09:33:26] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4166437 (10alanajjar) >>! In T193254#4165761, @1997kB wrote: > [[https://meta.wikimedia.org/wik... [09:57:01] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4166457 (10Tgr) @mobrovac do you know if LocalRenameUserJob jobs on meta (and only there) could... [10:06:15] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4166459 (10Tgr) It seems the last successful non-CLI rename on meta [[https://logstash.wikimedi... [11:35:59] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331#4166479 (10Marostegui) As a side note. Either db1098 (T193331), db2081 (T193325) and db1100 (T175973) (they were all coming from the same batch of purchases (T162159 and T162233... [11:36:08] 10Operations, 10ops-codfw, 10DBA: db2081 crashed/rebooted, probably due to hardware failure - https://phabricator.wikimedia.org/T193325#4166483 (10Marostegui) As a side note. Either db1098 (T193331), db2081 (T193325) and db1100 (T175973) (they were all coming from the same batch of purchases (T162159 and T16... [13:53:45] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [13:54:36] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [14:02:29] (03PS1) 10Jcrespo: Add mysql.py wrapper [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/429654 [16:08:55] (03PS5) 10ArielGlenn: Wikidata JSON dump: Only dump batches of ~400,000 pages at once [puppet] - 10https://gerrit.wikimedia.org/r/425926 (https://phabricator.wikimedia.org/T190513) (owner: 10Hoo man) [16:09:54] (03CR) 10ArielGlenn: [C: 032] Wikidata JSON dump: Only dump batches of ~400,000 pages at once [puppet] - 10https://gerrit.wikimedia.org/r/425926 (https://phabricator.wikimedia.org/T190513) (owner: 10Hoo man) [16:10:01] :) [16:18:11] so we didnt get affected by the AMS power outage, right [16:18:38] said "near Shiphol" massive power outage in the Amsterdam region.. and services like Telegram very affected [16:19:25] (03PS5) 10ArielGlenn: pull phabricator dumps from phab server to dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726) [16:37:45] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1965 bytes in 0.109 second response time [16:38:01] Finally :) [16:48:17] (03PS6) 10ArielGlenn: pull phabricator dumps from phab server to dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726) [17:05:19] (03PS7) 10ArielGlenn: pull phabricator dumps from phab server to dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726) [17:07:57] (03PS1) 10Hoo man: Increase dispatching resources by about 50% [puppet] - 10https://gerrit.wikimedia.org/r/429662 (https://phabricator.wikimedia.org/T193349) [17:23:04] (03PS2) 10Hoo man: Increase dispatching resources by about 50% [puppet] - 10https://gerrit.wikimedia.org/r/429662 (https://phabricator.wikimedia.org/T193349) [17:23:50] (03CR) 10Hoo man: Increase dispatching resources by about 50% (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/429662 (https://phabricator.wikimedia.org/T193349) (owner: 10Hoo man) [17:24:23] (03PS8) 10ArielGlenn: pull phabricator dumps from phab server to dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726) [17:25:26] (03PS9) 10ArielGlenn: pull phabricator dumps from phab server to dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726) [17:26:22] (03CR) 10ArielGlenn: [C: 032] pull phabricator dumps from phab server to dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726) (owner: 10ArielGlenn) [17:46:52] !log rebuilding image metadata for PDFs on commons on terbium [17:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:45] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1970 bytes in 0.112 second response time [19:08:43] (03PS1) 10ArielGlenn: fix up source dir for sync of phab dumps to public webserver [puppet] - 10https://gerrit.wikimedia.org/r/429666 (https://phabricator.wikimedia.org/T188726) [19:09:06] (03CR) 10jerkins-bot: [V: 04-1] fix up source dir for sync of phab dumps to public webserver [puppet] - 10https://gerrit.wikimedia.org/r/429666 (https://phabricator.wikimedia.org/T188726) (owner: 10ArielGlenn) [19:10:29] (03PS2) 10ArielGlenn: fix up source dir for sync of phab dumps to public webserver [puppet] - 10https://gerrit.wikimedia.org/r/429666 (https://phabricator.wikimedia.org/T188726) [19:10:53] (03CR) 10jerkins-bot: [V: 04-1] fix up source dir for sync of phab dumps to public webserver [puppet] - 10https://gerrit.wikimedia.org/r/429666 (https://phabricator.wikimedia.org/T188726) (owner: 10ArielGlenn) [19:14:00] (03PS3) 10ArielGlenn: fix up source dir for sync of phab dumps to public webserver [puppet] - 10https://gerrit.wikimedia.org/r/429666 (https://phabricator.wikimedia.org/T188726) [19:16:37] (03CR) 10ArielGlenn: [C: 032] fix up source dir for sync of phab dumps to public webserver [puppet] - 10https://gerrit.wikimedia.org/r/429666 (https://phabricator.wikimedia.org/T188726) (owner: 10ArielGlenn) [19:46:04] (03PS1) 10Urbanecm: Enable flood flag on sourceswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429668 (https://phabricator.wikimedia.org/T193350) [20:24:46] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1946 bytes in 0.103 second response time [20:36:46] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1969 bytes in 0.100 second response time [22:41:45] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1954 bytes in 0.096 second response time