[01:05:45] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [01:07:26] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10501 bytes in 0.108 second response time [01:50:36] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [01:55:56] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10501 bytes in 0.094 second response time [02:23:14] !log l10nupdate@tin Synchronized php-1.26wmf24/cache/l10n: l10nupdate for 1.26wmf24 (duration: 07m 05s) [02:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:04:29] 6operations, 10Wikimedia-Apache-configuration, 5Patch-For-Review: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#1678820 (10tli) 5Resolved>3Open Hi, Re-opened this ticket. We'd like to change the link to direct to this page in the public policy portal: https://policy.wikimedia... [03:04:45] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-others/snapshot is not accessible: Permission denied [03:09:36] PROBLEM - puppet last run on cp3037 is CRITICAL: CRITICAL: puppet fail [03:29:45] RECOVERY - Disk space on labstore1002 is OK: DISK OK [03:35:05] PROBLEM - puppet last run on mw1212 is CRITICAL: CRITICAL: Puppet has 1 failures [03:35:07] PROBLEM - puppet last run on mw1252 is CRITICAL: CRITICAL: Puppet has 1 failures [03:35:36] PROBLEM - puppet last run on mw1257 is CRITICAL: CRITICAL: Puppet has 1 failures [03:35:56] PROBLEM - puppet last run on analytics1050 is CRITICAL: CRITICAL: Puppet has 1 failures [03:36:27] PROBLEM - puppet last run on mw1109 is CRITICAL: CRITICAL: Puppet has 1 failures [03:36:36] RECOVERY - puppet last run on cp3037 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:50:07] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: puppet fail [04:00:57] RECOVERY - puppet last run on analytics1050 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [04:01:55] RECOVERY - puppet last run on mw1212 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [04:02:06] RECOVERY - puppet last run on mw1252 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [04:02:35] RECOVERY - puppet last run on mw1257 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:03:27] RECOVERY - puppet last run on mw1109 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:05:36] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-maps/snapshot is not accessible: Permission denied [04:07:36] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [100000000.0] [04:18:55] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:43:26] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [06:24:53] 6operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash: Set up a service IP for logstash - https://phabricator.wikimedia.org/T113104#1678882 (10bd808) The need for udp "stream" re-assembly for GELF is a major reason that I never bothered to try and get LVS setup in front of the Logstash cluster. The way... [06:28:38] hooray a bd808 is here [06:28:45] and there was much rejoicing etc [06:28:49] :0 [06:29:11] I see that you managed to run logstash out of disk while I was gone [06:29:17] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:52] I should be sleeping now to get over my jetlag :/ [06:30:15] PROBLEM - puppet last run on aqs1002 is CRITICAL: CRITICAL: puppet fail [06:30:29] spending 10 days at +18:00 from your home timezone does strange things to sleeping and eating patterns [06:30:35] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:55] bd808: inorite [06:30:56] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:05] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:09] bd808: I didn't personally run it out of disk space, I only did an unnecessary ES rebuild [06:31:15] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:16] heh [06:31:17] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:25] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:34] probably some change in runJobs volume ate the space [06:31:45] PROBLEM - puppet last run on wtp2019 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:04] I'll poke at it this week and see if there are "cheap" wins to be had [06:32:05] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 3 failures [06:32:06] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:09] \o/ [06:32:14] I should write an incident report [06:32:26] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:55] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:55] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 2 failures [06:35:25] bd808: I have also repented and moved off IRCCloud [06:36:07] yeah? not as awesome as you'd hoped? [06:36:23] bd808: they did a redesign that made it take about 30% CPU on my underpowered laptop [06:36:34] lame [06:36:37] indeed [06:36:52] it's just some javascript right? [06:36:56] yeah [06:37:15] I tried going bouncerless but that lasted only a day.... [06:37:19] am on weechat now [06:37:30] znc ftw [06:37:40] I used to be on ZNC [06:37:51] I switched away because I hated the web interface [06:38:07] they all continue to suck... I should spend some time actually building a nice IRCCloud open source version... [06:38:44] bd808: I'm also using slack (K8s is on slack) [06:39:01] bd808: they actually have no 'real' improvements over IRC other than 'ease of setup' [06:39:15] which was surprising to me, since it had so many passionate supporters... [06:39:17] icr for dummies [06:39:42] I still think IRCCloud is great for other folks since it gives you a large chunk of what slack gives you [06:39:47] the minimal barriers to use are one of the reasons I like irc [06:40:21] fewer spam bots and such compared chat networks I used to use [06:40:43] oh god, if I respond to that 'minimal barriers' quote this will be the third time I'm having this discussion for the day! [06:40:57] * bd808 still has an icq account [06:41:05] Instead I shall go away and continue attempting to move off WordPress to jekyll [06:41:05] eh [06:41:14] I have thought 'eh, it is a gem, how bad can it be' [06:41:17] I suppose I'll find out soon [06:43:04] my seldom used blog is built with octopress (Jekyll + some other crap) [06:43:16] yah I'm lookin at that too [06:46:12] * bd808 tries to get some sleep at a near timezone appropriate hour [06:46:34] bd808: good luck! [06:46:36] and welcome back! [06:56:26] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:56:35] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:37] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:56:46] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:56:47] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:26] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:57:26] RECOVERY - puppet last run on aqs1002 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:57:27] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:57:47] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:57:55] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:16] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:16] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:25] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:05] RECOVERY - puppet last run on wtp2019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:04:55] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above 100 [08:17:25] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below 100 [08:37:00] (03CR) 10Hashar: [C: 031] "Looks legit?" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/241254 (owner: 10Alex Monk) [10:05:03] !leaving innodb compression tests on es2005 running (could affect lag on that host) [10:05:14] !log leaving innodb compression tests on es2005 running (could affect lag on that host) [10:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:55:26] PROBLEM - puppet last run on db2003 is CRITICAL: CRITICAL: Puppet has 1 failures [11:22:36] RECOVERY - puppet last run on db2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:20:06] 6operations, 10Wikimedia-Mailing-lists: TTL back up to normal 1H - https://phabricator.wikimedia.org/T110141#1679054 (10faidon) [12:20:07] 6operations, 10Wikimedia-Mailing-lists: Upgrade to Mailman 3.0 - https://phabricator.wikimedia.org/T97492#1679055 (10faidon) [12:20:08] 6operations: mailman - replace lighttpd - https://phabricator.wikimedia.org/T84053#1679056 (10faidon) [12:20:10] 6operations, 7Mail: Upgrade Exim to >=4.73 - https://phabricator.wikimedia.org/T83541#1679057 (10faidon) [12:20:12] 6operations: Get rid of all Ubuntu Lucid (10.04) installs - https://phabricator.wikimedia.org/T80945#1679058 (10faidon) [12:20:14] 6operations: shutdown sodium after mailman has migrated to jessie VM - https://phabricator.wikimedia.org/T82698#1679052 (10faidon) 5Resolved>3Open Shutting it down without wiping it is really dangerous — it means that it could come back up at any point due to e.g. power flapping, take its old IP and -at best... [13:53:36] PROBLEM - puppet last run on ms-be2014 is CRITICAL: CRITICAL: puppet fail [13:55:08] (03PS2) 10Alex Monk: [WIP] Move from ircecho to tcpircbot [puppet] - 10https://gerrit.wikimedia.org/r/240945 [13:55:10] (03PS1) 10Alex Monk: Fix inclusion of labs proxyagent password in shinkengen [puppet] - 10https://gerrit.wikimedia.org/r/241526 [13:57:46] (03PS2) 10Alex Monk: shinken: Fix inclusion of labs proxyagent password in shinkengen [puppet] - 10https://gerrit.wikimedia.org/r/241526 [14:02:56] * Krenair grumbles about gerrit dependencies [14:07:48] (03PS3) 10Alex Monk: [WIP] Move from ircecho to tcpircbot [puppet] - 10https://gerrit.wikimedia.org/r/240945 [14:12:42] (03CR) 10Alex Monk: [C: 04-1] [WIP] Move from ircecho to tcpircbot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/240945 (owner: 10Alex Monk) [14:13:15] (03CR) 10Alex Monk: [WIP] Move from ircecho to tcpircbot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/240945 (owner: 10Alex Monk) [14:20:37] RECOVERY - puppet last run on ms-be2014 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [14:36:18] (03PS3) 10Alex Monk: tcpircbot: Allow per-infile channel lists [puppet] - 10https://gerrit.wikimedia.org/r/240939 [14:37:06] (03CR) 10Alex Monk: [WIP] Move from ircecho to tcpircbot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/240945 (owner: 10Alex Monk) [15:18:49] 6operations, 10MediaWiki-Database: Compress data at external storage - https://phabricator.wikimedia.org/T106386#1679216 (10jcrespo) [15:18:51] 6operations, 10hardware-requests, 7Database: new external storage cluster(s) - https://phabricator.wikimedia.org/T105843#1679214 (10jcrespo) 5Open>3Resolved Actually, closing, tracking codfw separately. [15:22:11] 6operations, 10MediaWiki-Database: Compress data at external storage - https://phabricator.wikimedia.org/T106386#1679221 (10jcrespo) p:5Normal>3High We should start by creating a new cluster (enwiki table has surpased the 1 TB size). Inserting to a separate table will be faster. I can own this, but I need... [15:33:10] (03CR) 10MarcoAurelio: "Thanks for noticing this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241354 (https://phabricator.wikimedia.org/T113848) (owner: 10Se4598) [15:41:24] (03PS2) 10MarcoAurelio: Enable Education Program extension at srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236231 (https://phabricator.wikimedia.org/T110619) [15:41:56] (03CR) 10MarcoAurelio: "Do not deploy until issues with EP are resolved." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236231 (https://phabricator.wikimedia.org/T110619) (owner: 10MarcoAurelio) [15:42:21] (03PS2) 10MarcoAurelio: Enable Extension:EducationProgram on enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236418 (https://phabricator.wikimedia.org/T111630) [15:49:35] (03CR) 10Alex Monk: [C: 032] Restore previous custom AbuseFilter IP Block durations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241354 (https://phabricator.wikimedia.org/T113848) (owner: 10Se4598) [15:49:42] (03Merged) 10jenkins-bot: Restore previous custom AbuseFilter IP Block durations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241354 (https://phabricator.wikimedia.org/T113848) (owner: 10Se4598) [15:50:38] * Krenair grumbles about snapshot1001 [15:50:41] !log krenair@tin Synchronized wmf-config/abusefilter.php: https://gerrit.wikimedia.org/r/#/c/241354/ - fix AbuseFilter block durations (duration: 00m 18s) [15:50:43] that machine is still full [15:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:51:33] (03CR) 10Alex Monk: "Thanks for fixing this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241354 (https://phabricator.wikimedia.org/T113848) (owner: 10Se4598) [16:00:25] maybe I should file a ticket for ops [16:00:52] you should [16:01:40] 6operations: scap to snapshot1001 failing due to full disk - https://phabricator.wikimedia.org/T113888#1679266 (10Krenair) 3NEW [16:02:27] PROBLEM - puppet last run on db2016 is CRITICAL: CRITICAL: puppet fail [16:06:21] 6operations: scap to snapshot1001 failing due to full disk - https://phabricator.wikimedia.org/T113888#1679274 (10Krenair) I guess we could get rid of i18n cache for 1.26wmf19, 1.26wmf20, and 1.26wmf21? [16:16:15] RECOVERY - Disk space on labstore1002 is OK: DISK OK [16:30:36] RECOVERY - puppet last run on db2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:30:41] Isn't snapshot1001 due for reinstall? [16:30:50] Wasn't there something about removing it from mediawiki-installation? [16:32:21] I was going to say I doubt a reinstall would help if it's all being taken up by /srv/mediawiki [16:32:23] but I found https://gerrit.wikimedia.org/r/#/c/240323/1 [16:32:37] heh [16:32:52] yay apergos :) [16:33:58] Krenair: I'd kill off the old i18n cache though, yeah [16:34:02] easy fix [16:34:37] and yeah, there was talk of reinstall in here actually [16:37:21] Reedy, https://phabricator.wikimedia.org/P2103 [16:37:45] Yeah, that sounds familiar [16:37:47] I guess it's not been removed [16:38:04] Do it and ask paravoid nicely to merge? :) [16:38:06] but see the ops list [16:38:14] heh [16:42:05] !log krenair@tin Purged l10n cache for 1.26wmf19 [16:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:42:37] !log krenair@tin Purged l10n cache for 1.26wmf20 [16:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:43:06] RECOVERY - Disk space on snapshot1001 is OK: DISK OK [16:43:20] !log krenair@tin Purged l10n cache for 1.26wmf21 [16:43:23] (not touching wmf22/wmf23 because I'm pretty sure wikidata or something was stuck on one of them) [16:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:43:52] 3 versions sohuld clear up a lot of space [16:48:20] 6operations: scap to snapshot1001 failing due to full disk - https://phabricator.wikimedia.org/T113888#1679311 (10Krenair) !log krenair@tin Purged l10n cache for 1.26wmf19 !log krenair@tin Purged l10n cache for 1.26wmf20 RECOVERY - Disk space on snapshot1001 is OK: DISK OK (03PS1) 10ArielGlenn: disable check for utf8 normalize ext for dumps, no longer used [puppet] - 10https://gerrit.wikimedia.org/r/241537 [16:49:52] yay me, I'm finally not sick. geez [16:50:49] (03CR) 10ArielGlenn: [C: 032] disable check for utf8 normalize ext for dumps, no longer used [puppet] - 10https://gerrit.wikimedia.org/r/241537 (owner: 10ArielGlenn) [16:51:55] !log ran sync-common on snapshot1001 to bring it up to date [16:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:56:08] yay apergos [17:18:12] (03PS1) 10ArielGlenn: dumps: never check for utf8 normalize ext, it's no longer used [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/241540 [17:22:07] ok. step number one to getting completely well: eat food [17:22:31] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: never check for utf8 normalize ext, it's no longer used [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/241540 (owner: 10ArielGlenn) [20:35:29] (03PS1) 10John F. Lewis: admin: add asherman to bastiononly [puppet] - 10https://gerrit.wikimedia.org/r/241562 [20:39:54] (03CR) 10John F. Lewis: "johnflewis@bast1001:~$ id asherman" [puppet] - 10https://gerrit.wikimedia.org/r/241562 (owner: 10John F. Lewis) [20:48:49] (03PS1) 10Base: Modified redirects config concerning outreachwiki aliases [puppet] - 10https://gerrit.wikimedia.org/r/241564 [21:21:36] PROBLEM - RAID on ms-be1012 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [21:21:56] PROBLEM - Disk space on ms-be1012 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdf1 is not accessible: Input/output error [21:43:36] PROBLEM - puppet last run on ms-be1012 is CRITICAL: CRITICAL: Puppet has 1 failures [21:57:12] godog: :( [22:51:05] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [23:02:27] 6operations, 10Datasets-General-or-Unknown, 7HHVM: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#1679712 (10JeroenDeDauw) September itself is about to be gone... [23:08:22] Little does he know we only just got rid of the last 10.04 host [23:08:25] * Reedy grins [23:12:19] 6operations, 10Datasets-General-or-Unknown, 7HHVM: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#1679736 (10Reedy) >>! In T94277#1679712, @JeroenDeDauw wrote: > September itself is about to be gone... We've only just got rid of the last 10.04 host; 5.5 years after... [23:13:28] 6operations, 10RESTBase: enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#1679749 (10mobrovac) [23:17:23] !log ori@tin Synchronized php-1.26wmf24/./resources/src/mediawiki.toolbar/toolbar.less: I94ced06178: mediawiki.toolbar: temporary workaround for T113868 (duration: 00m 17s) [23:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:18:07] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures