[00:00:09] <_joe_>	 unless this is something new
[00:00:48] <marostegui>	 hey
[00:01:10] <volans|off>	 marostegui: hola
[00:01:14] <_joe_>	 marostegui: we had a crash on db1098
[00:01:18] <_joe_>	 we just depooled it for now
[00:01:19] <marostegui>	 got the call but wasn't fast enough to pick up
[00:01:22] <_joe_>	 but it's a rc host
[00:01:36] <volans|off>	 sorry, just sent an SMS to jaime too (I thought your phone was out of reach according to the voice)
[00:01:39] <_joe_>	 it's ok to just depool one, without substitution?
[00:01:49] <marostegui>	 sure
[00:01:50] <_joe_>	 change we deployed is https://gerrit.wikimedia.org/r/429642
[00:01:52] <marostegui>	 let me recheck the file
[00:02:12] <volans|off>	 we just have one left for rc & co for s6 and s7
[00:02:24] <volans|off>	 what is a bit worry is that this time too no logs on the host, just HW logs, see T193331
[00:02:25] <stashbot>	 T193331: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331
[00:02:43] <_joe_>	 same error as db2081 this morning
[00:02:46] <marostegui>	 volans|off: like db2081 earlier I guess
[00:02:47] <marostegui>	 yeah
[00:04:36] <_joe_>	 marostegui: we mainly wanted one of you to validate our actions
[00:04:39] <marostegui>	 let's leave it depooled till monday
[00:04:41] <marostegui>	 yeah
[00:04:51] <volans|off>	 I'll silent it on icinga
[00:05:00] <marostegui>	 _joe_: it is correctly depooled
[00:05:04] <marostegui>	 volans|off: thanks
[00:05:12] <_joe_>	 27 minutes of bad service on s6/s6 :/
[00:05:22] <_joe_>	 s6/s7, I mean
[00:05:28] <marostegui>	 yeah, on rc
[00:06:13] <_joe_>	 marostegui: well whatever does waitforslaves is affected
[00:06:20] <_joe_>	 you know, the usual bug
[00:06:39] <volans|off>	 downtimed untile Wed. mid-eu day
[00:06:40] <volans|off>	 just in case
[00:06:43] <_joe_>	 if you look at fatalmonitor, we had ~ 1000 fatals/minute for that time
[00:07:02] <volans|off>	 how many are just logs and the software retries though?
[00:07:03] <_joe_>	 but until the dbloadbalancer in mediawiki is fixed
[00:07:09] <_joe_>	 volans|off: fatals
[00:07:11] <marostegui>	 yeah, the usual balancer issue
[00:07:32] <_joe_>	 this should've been "not an issue", really
[00:07:45] <_joe_>	 it's a shame we never get to prioritize this work
[00:08:34] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- on einsteinium is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[00:08:41] <volans|off>	 jynus: sorry for the trouble, as soon as I sent you the SMS manuel got online
[00:08:46] <volans|off>	 I can give you the TL;DR
[00:08:59] <_joe_>	 ok, enough rants. I was literally going to bed when this happened. I will get back in that direction
[00:09:02] <jynus>	 2 hosts on the same day?
[00:09:09] <volans|off>	 a bit worry yeah, same error
[00:09:12] <marostegui>	 jynus: yep
[00:09:14] <jynus>	 smartctl killing hosts?
[00:09:35] <wikibugs_>	 10Operations, 10DBA: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331#4166266 (10Volans) I've downtimed db1098 on Icinga until Wed mid EU day and disabled notifications.
[00:09:50] <jynus>	 I saw some disk errors on the other one
[00:15:50] <wikibugs_>	 10Operations, 10DBA: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331#4166270 (10Marostegui) This is the same error as db2081 earlier today: T193325 ```  The Intel Management Engine has recovered the ability to utilize the PECI over DMI facility.  If the PWR2262 "internal system erro...
[00:19:12] <wikibugs_>	 10Operations, 10DBA: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331#4166274 (10Marostegui) T175973#3615656 db1100 suffered it too which is the same batch as db1098
[00:19:19] <marostegui>	 Given it is now under control, and it is 2am, I am going to go back to bed and will debug more tomorrow/monday
[00:19:20] <volans|off>	 I didn't see any disk error in getraclogs or get-raid-status-megacli -a
[00:19:28] <marostegui>	 thanks volans|off for phoning :)
[00:19:30] <wikibugs_>	 10Operations, 10DBA: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331#4166276 (10jcrespo) ```      2018-04-28T23:28:04-0500 LOG007  The previous log entry was repeated 1 times.          2018-04-29T00:13:43-0500 SYS1003  System CPU Resetting.          2018-04-29T00:13:42-0500 SYS1000...
[00:20:38] <volans|off>	 marostegui: sorry to bother, mostly wanted to double check that was ok to leave it with only one slave for the specific roles
[00:20:51] <volans|off>	 thanks for checking in, both of you!
[00:21:04] <volans|off>	 I think I'll head off to bed too at this point
[02:18:15] <icinga-wm>	 RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1968 bytes in 0.106 second response time
[02:40:24] <icinga-wm>	 PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1963 bytes in 0.089 second response time
[03:16:07] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[03:16:57] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 1.321 second response time
[03:26:35] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 877.93 seconds
[03:26:50] <wikibugs_>	 (03PS1) 10ArielGlenn: disable nfs file attr caching on the last of the snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/429647 (https://phabricator.wikimedia.org/T191177)
[03:28:56] <wikibugs_>	 (03CR) 10ArielGlenn: [C: 032] disable nfs file attr caching on the last of the snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/429647 (https://phabricator.wikimedia.org/T191177) (owner: 10ArielGlenn)
[03:55:02] <wikibugs_>	 10Operations, 10Datasets-General-or-Unknown, 10User-ArielGlenn: Reboots of dumps/snapshot hosts - https://phabricator.wikimedia.org/T188242#4166349 (10ArielGlenn) 05Open>03Resolved snapshot1007 and dumpsdata1001 have been rebooted at last.
[04:10:54] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 23.26 seconds
[04:18:09] <wikibugs_>	 (03PS1) 10ArielGlenn: Revert "disable cron for partial dumps temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/429650
[04:18:17] <wikibugs_>	 (03PS2) 10ArielGlenn: Revert "disable cron for partial dumps temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/429650
[04:21:27] <wikibugs_>	 (03CR) 10ArielGlenn: [C: 032] Revert "disable cron for partial dumps temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/429650 (owner: 10ArielGlenn)
[04:50:24] <icinga-wm>	 RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1972 bytes in 0.095 second response time
[04:52:05] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0
[04:52:24] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0
[06:27:25] <icinga-wm>	 PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1967 bytes in 0.117 second response time
[06:45:16] <wikibugs_>	 10Operations, 10DBA: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331#4166362 (10Marostegui) a:03Cmjohnson @Cmjohnson can we do the same thing we did to db1100? (which had never had another crash ever since):  - Check if there are BIOS/firmware updates available - Power drain the h...
[06:45:40] <wikibugs_>	 10Operations, 10ops-eqiad, 10DBA: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331#4166365 (10Marostegui)
[06:50:26] <wikibugs_>	 (03PS1) 10Marostegui: db1098.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/429652 (https://phabricator.wikimedia.org/T193331)
[06:51:07] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] db1098.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/429652 (https://phabricator.wikimedia.org/T193331) (owner: 10Marostegui)
[07:01:39] <wikibugs_>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331#4166371 (10Marostegui) I have started MySQL on db1098 to:  - Make sure nothing is corrupted and replication can flow - Avoid leaving the host to fall behind replication for 2 da...
[07:06:15] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0
[07:06:15] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0
[07:16:48] <wikibugs_>	 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4166373 (10alanajjar) **Note:** All name changes are turned off until this problem is fixed! so...
[07:17:16] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0
[07:17:16] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0
[07:20:25] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0
[07:20:25] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0
[07:24:25] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0
[07:24:26] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0
[07:25:25] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0
[07:25:26] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0
[07:30:26] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0
[07:30:35] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0
[07:40:35] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0
[07:40:36] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0
[09:22:43] <wikibugs_>	 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4166434 (10MarcoAurelio) I only see one stuck global rename right now. Yes, it is true however...
[09:30:32] <wikibugs_>	 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4166435 (10Tgr) >>! In T193254#4165780, @Nirmos wrote: > Is this because of https://gerrit.wiki...
[09:32:22] <wikibugs_>	 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4166436 (10MarcoAurelio) Any idea why that may be happening? Issues with the meta job queue? Th...
[09:33:26] <wikibugs_>	 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4166437 (10alanajjar) >>! In T193254#4165761, @1997kB wrote: > [[https://meta.wikimedia.org/wik...
[09:57:01] <wikibugs_>	 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4166457 (10Tgr) @mobrovac do you know if LocalRenameUserJob jobs on meta (and only there) could...
[10:06:15] <wikibugs_>	 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4166459 (10Tgr) It seems the last successful non-CLI rename on meta [[https://logstash.wikimedi...
[11:35:59] <wikibugs_>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331#4166479 (10Marostegui) As a side note. Either db1098 (T193331), db2081 (T193325) and db1100 (T175973) (they were all coming from the same batch of purchases (T162159 and T162233...
[11:36:08] <wikibugs_>	 10Operations, 10ops-codfw, 10DBA: db2081 crashed/rebooted, probably due to hardware failure - https://phabricator.wikimedia.org/T193325#4166483 (10Marostegui) As a side note. Either db1098 (T193331), db2081 (T193325) and db1100 (T175973) (they were all coming from the same batch of purchases (T162159 and T16...
[13:53:45] <icinga-wm>	 PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[13:54:36] <icinga-wm>	 RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy
[14:02:29] <wikibugs_>	 (03PS1) 10Jcrespo: Add mysql.py wrapper [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/429654
[16:08:55] <wikibugs_>	 (03PS5) 10ArielGlenn: Wikidata JSON dump: Only dump batches of ~400,000 pages at once [puppet] - 10https://gerrit.wikimedia.org/r/425926 (https://phabricator.wikimedia.org/T190513) (owner: 10Hoo man)
[16:09:54] <wikibugs_>	 (03CR) 10ArielGlenn: [C: 032] Wikidata JSON dump: Only dump batches of ~400,000 pages at once [puppet] - 10https://gerrit.wikimedia.org/r/425926 (https://phabricator.wikimedia.org/T190513) (owner: 10Hoo man)
[16:10:01] <hoo>	 :)
[16:18:11] <mutante>	 so we didnt get affected by the AMS power outage, right
[16:18:38] <mutante>	 said "near Shiphol" massive power outage in the Amsterdam region.. and services like Telegram very affected
[16:19:25] <wikibugs_>	 (03PS5) 10ArielGlenn: pull phabricator dumps from phab server to dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726)
[16:37:45] <icinga-wm>	 RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1965 bytes in 0.109 second response time
[16:38:01] <hoo>	 Finally :)
[16:48:17] <wikibugs_>	 (03PS6) 10ArielGlenn: pull phabricator dumps from phab server to dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726)
[17:05:19] <wikibugs_>	 (03PS7) 10ArielGlenn: pull phabricator dumps from phab server to dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726)
[17:07:57] <wikibugs_>	 (03PS1) 10Hoo man: Increase dispatching resources by about 50% [puppet] - 10https://gerrit.wikimedia.org/r/429662 (https://phabricator.wikimedia.org/T193349)
[17:23:04] <wikibugs_>	 (03PS2) 10Hoo man: Increase dispatching resources by about 50% [puppet] - 10https://gerrit.wikimedia.org/r/429662 (https://phabricator.wikimedia.org/T193349)
[17:23:50] <wikibugs_>	 (03CR) 10Hoo man: Increase dispatching resources by about 50% (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/429662 (https://phabricator.wikimedia.org/T193349) (owner: 10Hoo man)
[17:24:23] <wikibugs_>	 (03PS8) 10ArielGlenn: pull phabricator dumps from phab server to dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726)
[17:25:26] <wikibugs_>	 (03PS9) 10ArielGlenn: pull phabricator dumps from phab server to dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726)
[17:26:22] <wikibugs_>	 (03CR) 10ArielGlenn: [C: 032] pull phabricator dumps from phab server to dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726) (owner: 10ArielGlenn)
[17:46:52] <brion>	 !log rebuilding image metadata for PDFs on commons on terbium
[17:46:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:45] <icinga-wm>	 PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1970 bytes in 0.112 second response time
[19:08:43] <wikibugs_>	 (03PS1) 10ArielGlenn: fix up source dir for sync of phab dumps to public webserver [puppet] - 10https://gerrit.wikimedia.org/r/429666 (https://phabricator.wikimedia.org/T188726)
[19:09:06] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] fix up source dir for sync of phab dumps to public webserver [puppet] - 10https://gerrit.wikimedia.org/r/429666 (https://phabricator.wikimedia.org/T188726) (owner: 10ArielGlenn)
[19:10:29] <wikibugs_>	 (03PS2) 10ArielGlenn: fix up source dir for sync of phab dumps to public webserver [puppet] - 10https://gerrit.wikimedia.org/r/429666 (https://phabricator.wikimedia.org/T188726)
[19:10:53] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] fix up source dir for sync of phab dumps to public webserver [puppet] - 10https://gerrit.wikimedia.org/r/429666 (https://phabricator.wikimedia.org/T188726) (owner: 10ArielGlenn)
[19:14:00] <wikibugs_>	 (03PS3) 10ArielGlenn: fix up source dir for sync of phab dumps to public webserver [puppet] - 10https://gerrit.wikimedia.org/r/429666 (https://phabricator.wikimedia.org/T188726)
[19:16:37] <wikibugs_>	 (03CR) 10ArielGlenn: [C: 032] fix up source dir for sync of phab dumps to public webserver [puppet] - 10https://gerrit.wikimedia.org/r/429666 (https://phabricator.wikimedia.org/T188726) (owner: 10ArielGlenn)
[19:46:04] <wikibugs_>	 (03PS1) 10Urbanecm: Enable flood flag on sourceswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429668 (https://phabricator.wikimedia.org/T193350)
[20:24:46] <icinga-wm>	 RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1946 bytes in 0.103 second response time
[20:36:46] <icinga-wm>	 PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1969 bytes in 0.100 second response time
[22:41:45] <icinga-wm>	 RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1954 bytes in 0.096 second response time