[01:56:07] PROBLEM - MegaRAID on db1001 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [01:56:08] ACKNOWLEDGEMENT - MegaRAID on db1001 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T183708 [01:56:12] 10Operations, 10ops-eqiad: Degraded RAID on db1001 - https://phabricator.wikimedia.org/T183708#3861352 (10ops-monitoring-bot) [02:06:46] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1001 - https://phabricator.wikimedia.org/T183708#3861355 (10Peachey88) [03:24:36] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 789.72 seconds [03:50:37] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 116.17 seconds [05:54:36] PROBLEM - Check Varnish expiry mailbox lag on cp4024 is CRITICAL: CRITICAL: expiry mailbox lag is 2082652 [07:23:06] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1001 - https://phabricator.wikimedia.org/T183708#3861434 (10Marostegui) a:03Cmjohnson Even though this server will be decommissioned (hopefully) during next Q, let's get the disk replaced when possible We should have plenty of 300G disks from the old dec... [07:29:27] PROBLEM - HHVM rendering on mw2134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:30:17] RECOVERY - HHVM rendering on mw2134 is OK: HTTP OK: HTTP/1.1 200 OK - 75025 bytes in 0.314 second response time [07:41:17] (03PS1) 10Marostegui: install_server: Allow reinstall db1113,db1114 [puppet] - 10https://gerrit.wikimedia.org/r/400268 (https://phabricator.wikimedia.org/T182896) [07:41:22] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3861442 (10Marostegui) [07:41:55] (03Draft2) 10Jayprakash12345: Add new namespace aliases on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/400267 [07:42:21] (03PS3) 10Jayprakash12345: Add new namespace aliases on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/400267 (https://phabricator.wikimedia.org/T183711) [07:43:26] (03CR) 10Marostegui: [C: 032] install_server: Allow reinstall db1113,db1114 [puppet] - 10https://gerrit.wikimedia.org/r/400268 (https://phabricator.wikimedia.org/T182896) (owner: 10Marostegui) [08:08:28] (03CR) 10TerraCodes: [C: 031] "Wouldn't it break things like global userpages having "hi, you can contact me via my email if its private"?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397768 (https://phabricator.wikimedia.org/T182541) (owner: 10EddieGP) [08:15:40] (03CR) 10Marostegui: [C: 031] mariadb: Repool db1055 & db1056 as x1 replicas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399782 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo) [08:22:23] (03CR) 10Marostegui: mariadb: Decommissioning proposal (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399792 (https://phabricator.wikimedia.org/T134476) (owner: 10Jcrespo) [09:09:02] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] Fix linewrap issue on wikimedia error page (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/395552 (https://phabricator.wikimedia.org/T180656) (owner: 10Phantom42) [09:13:26] PROBLEM - Apache HTTP on mw2125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:14:17] RECOVERY - Apache HTTP on mw2125 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.119 second response time [09:23:34] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3861506 (10Marostegui) [09:46:19] (03CR) 10星耀晨曦: [C: 031] Add new namespace aliases on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/400267 (https://phabricator.wikimedia.org/T183711) (owner: 10Jayprakash12345) [09:59:28] (03PS4) 10ArielGlenn: move ferm rules for nfs out from dumps module to a profile [puppet] - 10https://gerrit.wikimedia.org/r/400244 [10:00:01] (03CR) 10jerkins-bot: [V: 04-1] move ferm rules for nfs out from dumps module to a profile [puppet] - 10https://gerrit.wikimedia.org/r/400244 (owner: 10ArielGlenn) [10:02:39] (03PS5) 10ArielGlenn: move ferm rules for nfs out from dumps module to a profile [puppet] - 10https://gerrit.wikimedia.org/r/400244 [10:03:13] (03CR) 10jerkins-bot: [V: 04-1] move ferm rules for nfs out from dumps module to a profile [puppet] - 10https://gerrit.wikimedia.org/r/400244 (owner: 10ArielGlenn) [10:04:44] (03PS6) 10ArielGlenn: move ferm rules for nfs out from dumps module to a profile [puppet] - 10https://gerrit.wikimedia.org/r/400244 [10:06:26] PROBLEM - HHVM rendering on mw2138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:07:17] RECOVERY - HHVM rendering on mw2138 is OK: HTTP OK: HTTP/1.1 200 OK - 75073 bytes in 0.296 second response time [10:38:09] (03PS7) 10ArielGlenn: move ferm rules for nfs out from dumps module to a profile [puppet] - 10https://gerrit.wikimedia.org/r/400244 [10:40:26] (03CR) 10ArielGlenn: [C: 032] move ferm rules for nfs out from dumps module to a profile [puppet] - 10https://gerrit.wikimedia.org/r/400244 (owner: 10ArielGlenn) [10:47:16] (03PS1) 10ArielGlenn: don't export dumps web server filesystems to snapshots, they don't use it [puppet] - 10https://gerrit.wikimedia.org/r/400386 [11:05:05] (03CR) 10ArielGlenn: [C: 032] don't export dumps web server filesystems to snapshots, they don't use it [puppet] - 10https://gerrit.wikimedia.org/r/400386 (owner: 10ArielGlenn) [11:09:31] (03PS1) 10ArielGlenn: allow dumps nfs server to be configured without clients if needed [puppet] - 10https://gerrit.wikimedia.org/r/400387 [11:09:39] (03CR) 10EddieGP: "> Wouldn't it break things like global userpages having "hi, you can" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397768 (https://phabricator.wikimedia.org/T182541) (owner: 10EddieGP) [11:14:37] RECOVERY - Check Varnish expiry mailbox lag on cp4024 is OK: OK: expiry mailbox lag is 0 [11:16:12] (03CR) 10ArielGlenn: [C: 032] allow dumps nfs server to be configured without clients if needed [puppet] - 10https://gerrit.wikimedia.org/r/400387 (owner: 10ArielGlenn) [11:31:10] (03PS1) 10ArielGlenn: create a profile for nginx-extras package for dumps [puppet] - 10https://gerrit.wikimedia.org/r/400391 [11:32:49] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey: Migration of mw* servers to stretch - https://phabricator.wikimedia.org/T174431#3861568 (10elukey) mw2246 today reported a failure in logrotate: ``` /etc/cron.daily/logrotate: Job for apache2.service failed because the control process exited with er... [11:33:00] Cc: volans --^ :) [11:33:51] 10Operations, 10Mail, 10MediaWiki-Watchlist: Mails from MediaWiki seem to get (partially) lost - https://phabricator.wikimedia.org/T121105#3861569 (10hoo) >>! In T121105#3860761, @Aklapper wrote: > @hoo, @Lydia_Pintscher : Still an issue, 18 months later? Or should this task be closed? I haven't had any (ap... [11:35:36] elukey: ? [11:36:28] volans: cronspam from some videoscalers, I thought to ping you since the last time we had a chat about it [11:36:42] 10Operations, 10Mail, 10MediaWiki-Watchlist: Mails from MediaWiki seem to get (partially) lost - https://phabricator.wikimedia.org/T121105#3861570 (10Lydia_Pintscher) 05Open>03Resolved a:03Lydia_Pintscher Yeah it seems ok now. [11:37:03] it was a FYI, didn't mean to page you :P [11:37:13] ah got it, and those are stretch right? [11:37:30] no "page", don't worry ;) [11:38:28] good to know, let's if it's common to all of them or just a race condition [11:38:42] *let's see [11:40:07] (03CR) 10ArielGlenn: [C: 032] create a profile for nginx-extras package for dumps [puppet] - 10https://gerrit.wikimedia.org/r/400391 (owner: 10ArielGlenn) [11:40:29] elukey: if you need to page volans just write cumin [11:40:32] :p [11:42:38] marostegui: rotfl... that's a lie :-P [11:43:20] volans: we both know you have a notification for any cumin word, you just refused to admit it [11:44:00] who knows, you know irssi notifications are so hard to implement [11:44:25] send me you config and I can check it for you :-p [11:44:30] ahahahah [11:45:04] it's not safe there could be PII in it :-P [11:45:25] ~/.irssi# wc -l config [11:45:26] 434 config [11:47:00] 669 here, but there might be some boilerplate autoadded [11:47:14] I just need to remove 3 lines to be perfect [11:47:23] hahaha [11:47:59] inb4 removes nickserv password [11:48:52] volans: cat config | grep hilights for me? [11:49:51] lol [11:50:57] 2018 will be the year I will find out whether you have a notification for it or not (you do) [11:51:38] marostegui: hilights = ( [11:51:53] :-P [11:53:38] ok in 2018 I'll tell you [11:54:39] (03PS1) 10ArielGlenn: move ipv6 setup for dump web servers to the appropriate profiles [puppet] - 10https://gerrit.wikimedia.org/r/400394 [11:59:02] (03CR) 10ArielGlenn: [C: 032] move ipv6 setup for dump web servers to the appropriate profiles [puppet] - 10https://gerrit.wikimedia.org/r/400394 (owner: 10ArielGlenn) [13:28:20] (03PS1) 10ArielGlenn: get rid of redundant code in dumps web server manifests [puppet] - 10https://gerrit.wikimedia.org/r/400403 [13:36:25] 10Operations, 10Developer-Relations, 10cloud-services-team (Kanban): Create discourse-mediawiki.wmflabs.org (pilot instance) - https://phabricator.wikimedia.org/T180854#3861687 (10Qgil) [13:36:33] (03PS2) 10ArielGlenn: get rid of redundant code in dumps web server manifests [puppet] - 10https://gerrit.wikimedia.org/r/400403 [13:39:37] 10Operations, 10Developer-Relations: Bring discourse.mediawiki.org to production - https://phabricator.wikimedia.org/T180853#3861689 (10Qgil) [14:02:20] 10Operations, 10Developer-Relations, 10cloud-services-team (Kanban): Create discourse-mediawiki.wmflabs.org (pilot instance) - https://phabricator.wikimedia.org/T180854#3861707 (10Qgil) > Guidelines for pre-SSO usernames i.e. "user your Wikimedia username"? If I am reading [[ https://meta.discourse.org/t/is... [14:04:49] (03CR) 10ArielGlenn: [C: 032] get rid of redundant code in dumps web server manifests [puppet] - 10https://gerrit.wikimedia.org/r/400403 (owner: 10ArielGlenn) [14:13:47] (03PS1) 10ArielGlenn: get rid of redundant code in dumps nfs server manifests [puppet] - 10https://gerrit.wikimedia.org/r/400405 [14:24:31] (03CR) 10ArielGlenn: [C: 032] get rid of redundant code in dumps nfs server manifests [puppet] - 10https://gerrit.wikimedia.org/r/400405 (owner: 10ArielGlenn) [15:35:06] PROBLEM - Varnish HTTP text-backend - port 3128 on cp4027 is CRITICAL: connect to address 10.128.0.127 and port 3128: Connection refused [15:36:06] RECOVERY - Varnish HTTP text-backend - port 3128 on cp4027 is OK: HTTP OK: HTTP/1.1 200 OK - 218 bytes in 0.157 second response time [15:41:33] (03PS1) 10Urbanecm: Add suppressredirect to autoreview/editor at ruwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/400409 (https://phabricator.wikimedia.org/T183719) [15:44:26] PROBLEM - Long running screen/tmux on analytics1003 is CRITICAL: CRIT: Long running SCREEN process. (PID: 5624, 1733528s 1728000s). [15:49:27] PROBLEM - pdfrender on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 5252: Connection refused [15:53:20] on it ^ [15:53:33] thanks, I was just looking [15:54:06] it seems to have been restarted a few minutes ago (?) [15:54:30] !log mobrovac@tin Started restart [electron-render/deploy@94d27d7]: Bounce Electron, stuck - T174916 [15:54:54] heh [15:55:26] PROBLEM - https://phabricator.wikimedia.org on phab1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:55:36] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [15:55:42] ah ha [15:56:17] RECOVERY - https://phabricator.wikimedia.org on phab1001 is OK: HTTP OK: HTTP/1.1 200 OK - 34525 bytes in 0.238 second response time [15:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:57] T174916: electron/pdfrender hangs - https://phabricator.wikimedia.org/T174916 [15:57:10] <_joe_> what did happen with phab? [15:57:15] <_joe_> anyone has any ideas? [15:57:32] nope [15:57:54] i was in here because of pdfrender, but phab went and came back before I could even look [15:58:12] <_joe_> ok [15:58:25] <_joe_> pdfrender was just one machine? [15:58:34] yep just the one this time [15:58:47] <_joe_> ok [15:58:50] perhaps i should knock on wood or something... [16:01:07] phab transient error during holidays ? sigh [16:01:12] yeah [16:01:40] it ha already come back when I got the pages and realized that would interrupt everyone's relaxing vacation evening [16:01:41] meh [16:01:43] <_joe_> akosiaris: seems so [17:00:59] !log restarting apache on phab1001 [17:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:08] phab error is not entirely transient - there is a problem where something is eating up workers and keeping them marked as 'busy' until eventually it runs out of available workers - see 'apache connections' section on https://grafana.wikimedia.org/dashboard/db/phabricator?orgId=1 [17:02:26] I think the problem has been unnoticed because I usually restart apache once a week for updates [17:23:04] oh? huh [17:23:06] thank you [17:23:07] PROBLEM - Disk space on ms-be1033 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdk1 is not accessible: Input/output error [17:27:27] PROBLEM - puppet last run on ms-be1033 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[mountpoint-/srv/swift-storage/sdk1] [17:28:24] that's legit, sdk shows erors in dmesg [17:37:05] 10Operations, 10ops-eqiad: failed disk on ms-be1033 - https://phabricator.wikimedia.org/T183723#3861840 (10ArielGlenn) p:05Triage>03Normal [17:42:06] PROBLEM - HP RAID on ms-be1033 is CRITICAL: CRITICAL: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:2, 2I:2:3, 2I:2:4 - Failed: 2I:2:1 - Controller: OK - Battery/Capacitor: OK [17:42:11] ACKNOWLEDGEMENT - HP RAID on ms-be1033 is CRITICAL: CRITICAL: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:2, 2I:2:3, 2I:2:4 - Failed: 2I:2:1 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T183724 [17:42:14] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1033 - https://phabricator.wikimedia.org/T183724#3861857 (10ops-monitoring-bot) [17:44:13] apergos: I would merge both tasks probably [17:45:54] meh, mine can be deleted, I forgot about the autogenerated ones [17:45:58] 10Operations, 10ops-eqiad: failed disk on ms-be1033 - https://phabricator.wikimedia.org/T183723#3861865 (10Marostegui) [17:46:01] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1033 - https://phabricator.wikimedia.org/T183724#3861867 (10Marostegui) [17:46:15] although I did go looking to see if there was already a task for some reason [17:46:20] I always merge the wrong direction XD [17:46:23] let me fix it [17:46:26] thanks [17:46:37] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1033 - https://phabricator.wikimedia.org/T183724#3861857 (10Marostegui) 05duplicate>03Open [17:46:42] I am trying to decide whether to fiddle with removing the device and rebalancing the rings and etc [17:47:18] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1033 - https://phabricator.wikimedia.org/T183724#3861857 (10Marostegui) [17:47:20] 10Operations, 10ops-eqiad: failed disk on ms-be1033 - https://phabricator.wikimedia.org/T183723#3861840 (10Marostegui) [17:47:35] done [17:47:50] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1033 - https://phabricator.wikimedia.org/T183724#3861857 (10Marostegui) p:05Triage>03Normal [17:48:37] it seems like no one has done that via the documented method for many months though so [17:48:43] not sure if that's the approved method now [17:48:48] any thoughts? [17:49:24] Never done it so...I cannot help :) [17:49:45] well the last time I did it there was no puppet repo for rings, so I'm very out of date [17:50:01] Oh wow [17:50:20] If we shut it down it will just get out from the LB right? [17:51:38] well I think it's better to leave it up, if swift itself will just lower the weight of the device [17:51:42] but I can't remember that either [17:53:01] Yeah, I would assume it would do that by itself [17:57:49] well it says here (random web page on open stack) that it's good to unmount the device because that will help swift work around the replication failure [17:57:55] I suppose puppet would undo that [17:58:16] https://docs.openstack.org/swift/newton/admin_guide.html#handling-drive-failure we run 2.10 which is this version [18:01:17] RECOVERY - Disk space on ms-be1033 is OK: DISK OK [18:02:03] there's been no changes to the swift rings in that repo for the last reported bad disk I saw in phab, so going to leave it be for now [18:04:00] well it's unmounted automagically so that's that [18:11:55] ah, I see there is a swift drive audit script that must do it [18:12:10] the umount is blocked but it took the device out of the mount table at least [19:42:07] apergos: thanks for taking a look! if the umount is blocked we can reboot [19:42:23] godog: I bet it is not [19:42:40] let me have a look [19:43:24] can't tell [19:43:33] sec [19:43:53] root 10977 0.0 0.0 23624 1224 ? D 18:01 0:01 umount -fl /srv/swift-storage/sdk1 [19:43:56] nope still stuck [19:44:05] oh. and I misread, heh [19:44:22] it is indeed still blocked, I thought you were saying if it was unblocked yet [19:44:38] do you want to do the honors? [19:45:00] yeah I'll kick it [19:45:03] sweet [19:45:25] I'm looking into why sdk wasn't commented from fstab though, swift-drive-audit should be able to [19:45:35] maybe that's after the umount? [19:46:12] yeah likely, I would have hoped not [19:46:14] yep [19:46:22] the umount must finish, then it comments out :-P [19:47:42] !log reboot ms-be1033 - T183724 [19:47:43] welcome back irc-cloud users :-P [19:47:48] yeah I commented it manually [19:47:51] k [19:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:55] T183724: Degraded RAID on ms-be1033 - https://phabricator.wikimedia.org/T183724 [19:48:11] I take it we don't rebalance the swift rings for dead disks these days? [19:49:12] no it is usually a matter of 2/3 days [19:49:21] and not worth it [19:49:23] gotcha [19:49:46] I wonder how that will play out this week [19:51:19] true, yeah might be more like a week [19:54:18] !log power reset ms-be1033 [19:54:28] meh, it wasn't coming back [19:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:31] ugh [19:54:42] oh it probably never finished powering down becuse waiting for that disk somehow [19:56:36] quite possible yeah [19:58:02] yeah it is back [20:00:06] ah so [20:00:21] nm, I already asked you [20:00:25] have a good vacation [20:00:45] apergos: you too! [20:02:16] RECOVERY - HP RAID on ms-be1033 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [20:02:27] RECOVERY - puppet last run on ms-be1033 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [20:47:20] (03PS8) 10Krinkle: Move statsv varnishkafka and service to use main Kafka cluster(s) [puppet] - 10https://gerrit.wikimedia.org/r/391705 (https://phabricator.wikimedia.org/T179093) (owner: 10Ottomata) [20:57:22] (03CR) 10Krinkle: [C: 031] Add wikidata and mediawiki.org to $wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392999 (https://phabricator.wikimedia.org/T117302) (owner: 10TerraCodes)