[00:00:22] PROBLEM - Disk space on eventlog1001 is CRITICAL: DISK CRITICAL - free space: / 104 MB (1% inode=86%) [00:00:31] hmm [00:02:01] RECOVERY - Disk space on eventlog1001 is OK: DISK OK [00:08:09] paravoid, hey. you there? [00:42:42] 6operations, 10Wikimedia-Mailing-lists: scrub non-free PDF from list archives - https://phabricator.wikimedia.org/T95195#1217574 (10Jrogers-WMF) 5declined>3Open [00:45:04] 6operations, 10Wikimedia-Mailing-lists: scrub non-free PDF from list archives - https://phabricator.wikimedia.org/T95195#1217576 (10Jrogers-WMF) Hi @Krenair, I took a look at this and legal would like it removed. Apologies for the delay from original legal tag and for the technical annoyance fro this answer. [00:45:07] 6operations, 10Wikimedia-Mailing-lists: scrub non-free PDF from list archives - https://phabricator.wikimedia.org/T95195#1217578 (10JohnLewis) a:5Krenair>3None @jrogers-WMF could you accompany reopens with an explanation on why it is being reopened? From the response from Stephen it seems no action is requ... [00:45:35] just as I reply too [00:46:38] haha yeah that's bad timing [00:46:53] 6operations, 10Wikimedia-Mailing-lists: scrub non-free PDF from list archives - https://phabricator.wikimedia.org/T95195#1217581 (10Jrogers-WMF) @JohnLewis, I'll switch my order to comment and reopen next time. :) [00:49:34] I then strike my comment and he replies [00:49:38] why >.> [01:02:28] 6operations, 10Wikimedia-Mailing-lists: Update mailman listinfo.txt template - https://phabricator.wikimedia.org/T96108#1217593 (10Dzahn) https://lists.wikimedia.org/mailman/listinfo/arbcom-appeals-en https://lists.wikimedia.org/mailman/listinfo/arbcom-l https://lists.wikimedia.org/mailman/listinfo/advisory ht... [01:02:45] 6operations, 10Wikimedia-Mailing-lists: Update mailman listinfo.txt template - https://phabricator.wikimedia.org/T96108#1217594 (10Dzahn) 5Open>3Resolved [01:09:40] PROBLEM - puppet last run on db1034 is CRITICAL Puppet has 1 failures [01:15:49] (03CR) 10Legoktm: [C: 032] Fix PEP-8 style [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/203985 (owner: 10BryanDavis) [01:15:53] (03PS1) 10Dzahn: make carbon a ganglia_new aggregator for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/204978 (https://phabricator.wikimedia.org/T93776) [01:17:43] (03Merged) 10jenkins-bot: Fix PEP-8 style [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/203985 (owner: 10BryanDavis) [01:20:44] (03CR) 10John F. Lewis: [C: 031] "Plus made sense when I suggested it a few days ago." [puppet] - 10https://gerrit.wikimedia.org/r/204978 (https://phabricator.wikimedia.org/T93776) (owner: 10Dzahn) [01:25:51] RECOVERY - puppet last run on db1034 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [01:31:50] PROBLEM - HHVM queue size on mw1096 is CRITICAL 100.00% of data above the critical threshold [80.0] [01:32:56] (03CR) 10Aaron Schulz: [C: 032] Tweaked "recentchangeslinked" comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204875 (owner: 10Aaron Schulz) [01:33:03] (03Merged) 10jenkins-bot: Tweaked "recentchangeslinked" comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204875 (owner: 10Aaron Schulz) [01:41:21] PROBLEM - HHVM queue size on mw1181 is CRITICAL 100.00% of data above the critical threshold [80.0] [01:57:56] (03CR) 10Hoo man: "Nit pick, looks fine despite" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204728 (owner: 10Aude) [02:06:00] PROBLEM - HHVM queue size on mw1181 is CRITICAL 100.00% of data above the critical threshold [80.0] [02:06:22] (03PS1) 10Ori.livneh: coal: use set_wakeup_fd & poll for interval timer [puppet] - 10https://gerrit.wikimedia.org/r/204984 [02:11:52] (03PS2) 10Ori.livneh: coal: use set_wakeup_fd & poll for interval timer [puppet] - 10https://gerrit.wikimedia.org/r/204984 [02:12:41] PROBLEM - HHVM queue size on mw1096 is CRITICAL 100.00% of data above the critical threshold [80.0] [02:21:31] RECOVERY - Apache HTTP on mw1181 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.139 second response time [02:21:40] PROBLEM - HHVM busy threads on mw1181 is CRITICAL 100.00% of data above the critical threshold [115.2] [02:22:10] RECOVERY - HHVM rendering on mw1181 is OK: HTTP OK: HTTP/1.1 200 OK - 65924 bytes in 0.127 second response time [02:22:10] RECOVERY - HHVM rendering on mw1096 is OK: HTTP OK: HTTP/1.1 200 OK - 65924 bytes in 0.274 second response time [02:22:32] !log Restarted HHVM on mw1181 and mw1096 after total lock-up; backtrace mw1096:/var/log/hhvm/hhvm.28914.bt [02:22:41] RECOVERY - Apache HTTP on mw1096 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.075 second response time [02:22:41] Logged the message, Master [02:23:06] (03CR) 10Ori.livneh: [C: 032] coal: use set_wakeup_fd & poll for interval timer [puppet] - 10https://gerrit.wikimedia.org/r/204984 (owner: 10Ori.livneh) [02:27:37] !log l10nupdate Synchronized php-1.26wmf1/cache/l10n: (no message) (duration: 05m 56s) [02:27:43] Logged the message, Master [02:30:22] RECOVERY - HHVM queue size on mw1181 is OK Less than 30.00% above the threshold [10.0] [02:30:42] RECOVERY - HHVM queue size on mw1096 is OK Less than 30.00% above the threshold [10.0] [02:31:21] RECOVERY - HHVM busy threads on mw1181 is OK Less than 30.00% above the threshold [76.8] [02:32:07] !log LocalisationUpdate completed (1.26wmf1) at 2015-04-18 02:31:04+00:00 [02:32:12] Logged the message, Master [02:44:30] (03CR) 10Krinkle: contint: Put mysql db on tmpfs for role::ci::slave::labs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/204528 (https://phabricator.wikimedia.org/T96230) (owner: 10coren) [02:48:51] RECOVERY - HHVM busy threads on mw1096 is OK Less than 30.00% above the threshold [57.6] [02:52:20] !log l10nupdate Synchronized php-1.26wmf2/cache/l10n: (no message) (duration: 05m 05s) [02:52:28] Logged the message, Master [02:56:21] !log LocalisationUpdate completed (1.26wmf2) at 2015-04-18 02:55:17+00:00 [02:56:27] Logged the message, Master [03:16:30] PROBLEM - puppet last run on mw2064 is CRITICAL puppet fail [03:16:50] PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [03:34:31] RECOVERY - puppet last run on mw2064 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [03:37:50] (03PS1) 10Yuvipanda: tools: Setup replication from master to all slave proxies [puppet] - 10https://gerrit.wikimedia.org/r/204990 (https://phabricator.wikimedia.org/T96334) [03:38:22] (03PS2) 10Yuvipanda: tools: Setup replication from master to all slave proxies [puppet] - 10https://gerrit.wikimedia.org/r/204990 (https://phabricator.wikimedia.org/T96334) [03:56:20] PROBLEM - puppet last run on db1064 is CRITICAL Puppet has 1 failures [03:58:33] (03PS3) 10Yuvipanda: tools: Setup replication from master to all slave proxies [puppet] - 10https://gerrit.wikimedia.org/r/204990 (https://phabricator.wikimedia.org/T96334) [03:58:55] (03PS4) 10Yuvipanda: tools: Setup replication from master to all slave proxies [puppet] - 10https://gerrit.wikimedia.org/r/204990 (https://phabricator.wikimedia.org/T96334) [03:59:08] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Setup replication from master to all slave proxies [puppet] - 10https://gerrit.wikimedia.org/r/204990 (https://phabricator.wikimedia.org/T96334) (owner: 10Yuvipanda) [04:06:01] PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 57.14% of data above the critical threshold [24.0] [04:10:51] RECOVERY - High load for whatever reason on labstore1001 is OK Less than 50.00% above the threshold [16.0] [04:12:41] RECOVERY - puppet last run on db1064 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [04:22:21] PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 55.56% of data above the critical threshold [24.0] [04:30:21] RECOVERY - High load for whatever reason on labstore1001 is OK Less than 50.00% above the threshold [16.0] [04:53:23] (03PS1) 10Legoktm: admin: Turn data_admin into a real unit test [puppet] - 10https://gerrit.wikimedia.org/r/204992 [04:53:25] (03PS1) 10Legoktm: tox: Remove 'data_admin_lint' environment [puppet] - 10https://gerrit.wikimedia.org/r/204993 [04:53:27] (03PS1) 10Legoktm: redirects: Add test to verify redirects.conf has been regenerated from redirects.dat [puppet] - 10https://gerrit.wikimedia.org/r/204994 (https://phabricator.wikimedia.org/T72068) [04:54:14] (03CR) 10jenkins-bot: [V: 04-1] admin: Turn data_admin into a real unit test [puppet] - 10https://gerrit.wikimedia.org/r/204992 (owner: 10Legoktm) [04:54:27] (03CR) 10jenkins-bot: [V: 04-1] tox: Remove 'data_admin_lint' environment [puppet] - 10https://gerrit.wikimedia.org/r/204993 (owner: 10Legoktm) [04:54:38] (03CR) 10jenkins-bot: [V: 04-1] redirects: Add test to verify redirects.conf has been regenerated from redirects.dat [puppet] - 10https://gerrit.wikimedia.org/r/204994 (https://phabricator.wikimedia.org/T72068) (owner: 10Legoktm) [04:54:50] shush [04:58:06] (03PS2) 10Legoktm: admin: Turn data_admin into a real unit test [puppet] - 10https://gerrit.wikimedia.org/r/204992 [04:58:08] (03PS2) 10Legoktm: tox: Remove 'data_admin_lint' environment [puppet] - 10https://gerrit.wikimedia.org/r/204993 [04:58:10] (03PS2) 10Legoktm: redirects: Add test to verify redirects.conf has been regenerated from redirects.dat [puppet] - 10https://gerrit.wikimedia.org/r/204994 (https://phabricator.wikimedia.org/T72068) [04:59:06] (03CR) 10jenkins-bot: [V: 04-1] tox: Remove 'data_admin_lint' environment [puppet] - 10https://gerrit.wikimedia.org/r/204993 (owner: 10Legoktm) [04:59:22] (03CR) 10jenkins-bot: [V: 04-1] redirects: Add test to verify redirects.conf has been regenerated from redirects.dat [puppet] - 10https://gerrit.wikimedia.org/r/204994 (https://phabricator.wikimedia.org/T72068) (owner: 10Legoktm) [05:01:26] (03CR) 10Legoktm: "While writing this test, I noticed that redirects.conf is in fact out of date and does need to be regenerated." [puppet] - 10https://gerrit.wikimedia.org/r/204994 (https://phabricator.wikimedia.org/T72068) (owner: 10Legoktm) [05:03:48] (03CR) 10Yuvipanda: "why have data_admin at all?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/204992 (owner: 10Legoktm) [05:04:51] (03CR) 10Legoktm: "No idea, but I didn't want to remove any functionality, just refactor." [puppet] - 10https://gerrit.wikimedia.org/r/204992 (owner: 10Legoktm) [05:05:38] (03PS3) 10Legoktm: admin: Turn data_admin into a real unit test [puppet] - 10https://gerrit.wikimedia.org/r/204992 [05:05:40] (03PS3) 10Legoktm: tox: Remove 'data_admin_lint' environment [puppet] - 10https://gerrit.wikimedia.org/r/204993 [05:05:42] (03PS3) 10Legoktm: redirects: Add test to verify redirects.conf has been regenerated from redirects.dat [puppet] - 10https://gerrit.wikimedia.org/r/204994 (https://phabricator.wikimedia.org/T72068) [05:06:01] (03CR) 10Legoktm: admin: Turn data_admin into a real unit test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/204992 (owner: 10Legoktm) [05:06:43] (03CR) 10jenkins-bot: [V: 04-1] tox: Remove 'data_admin_lint' environment [puppet] - 10https://gerrit.wikimedia.org/r/204993 (owner: 10Legoktm) [05:07:03] (03CR) 10jenkins-bot: [V: 04-1] redirects: Add test to verify redirects.conf has been regenerated from redirects.dat [puppet] - 10https://gerrit.wikimedia.org/r/204994 (https://phabricator.wikimedia.org/T72068) (owner: 10Legoktm) [05:12:44] (03CR) 10Yuvipanda: [C: 032] admin: Turn data_admin into a real unit test [puppet] - 10https://gerrit.wikimedia.org/r/204992 (owner: 10Legoktm) [05:12:58] (03CR) 10Yuvipanda: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/204993 (owner: 10Legoktm) [05:18:39] (03CR) 10Legoktm: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/204993 (owner: 10Legoktm) [05:18:51] (03CR) 10Legoktm: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/204994 (https://phabricator.wikimedia.org/T72068) (owner: 10Legoktm) [05:20:26] (03CR) 10Legoktm: [C: 031] "jenkins updated in f76e3d4dc17021bd4f83b6fa67612496687679cf" [puppet] - 10https://gerrit.wikimedia.org/r/204993 (owner: 10Legoktm) [05:20:35] YuviPanda: ^ is ready [05:20:57] (03CR) 10Yuvipanda: [C: 032] tox: Remove 'data_admin_lint' environment [puppet] - 10https://gerrit.wikimedia.org/r/204993 (owner: 10Legoktm) [05:21:25] legoktm: what does updating redirects.conf entail? [05:21:52] let me upload that commit [05:22:31] PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 66.67% of data above the critical threshold [24.0] [05:23:09] (03PS4) 10Legoktm: redirects: Add test to verify redirects.conf has been regenerated from redirects.dat [puppet] - 10https://gerrit.wikimedia.org/r/204994 (https://phabricator.wikimedia.org/T72068) [05:23:11] (03PS1) 10Legoktm: mediawiki: Update redirects.conf using refreshDomainRedirects [puppet] - 10https://gerrit.wikimedia.org/r/204996 [05:23:30] YuviPanda: ^ [05:23:51] this might not be the best time for that one :P [05:23:55] yeah totally :) [05:23:58] that’s what I meant :) [05:24:09] I’m thinking I”ll let that one be [05:24:13] (03PS5) 10Legoktm: mediawiki: Add test to verify redirects.conf has been regenerated from redirects.dat [puppet] - 10https://gerrit.wikimedia.org/r/204994 (https://phabricator.wikimedia.org/T72068) [05:24:46] YuviPanda: +1 on the test case? ^ [05:25:34] legoktm: needs more comments I think [05:25:37] lots of magic joining [05:25:54] joining? [05:26:15] it's just non-relative paths [05:26:31] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 71.43% of data above the critical threshold [35.0] [05:26:54] well [05:27:03] just in general. more comments as to what that test is testing for :) [05:27:06] ok [05:27:24] ‘what is refreshDomainRedirects? why does it need to be run? what is this dat file? what is this conf file?’ etc [05:27:48] specifically, ‘why this test? what does it mean if this test is failing?' [05:27:56] something like linting is fairly obvious while this isn’t I think [05:29:20] (03PS6) 10Legoktm: mediawiki: Add test to verify redirects.conf has been regenerated from redirects.dat [puppet] - 10https://gerrit.wikimedia.org/r/204994 (https://phabricator.wikimedia.org/T72068) [05:30:35] legoktm: I’m heading back home now tho [05:30:47] that's fine, thanks for the reviews :) [05:30:57] yw! thanks for the patches :) [05:34:34] jamesofur :) [05:51:00] RECOVERY - Persistent high iowait on labstore1001 is OK Less than 50.00% above the threshold [25.0] [06:01:41] PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [06:24:30] RECOVERY - High load for whatever reason on labstore1001 is OK Less than 50.00% above the threshold [16.0] [06:29:20] PROBLEM - puppet last run on subra is CRITICAL Puppet has 1 failures [06:30:22] PROBLEM - puppet last run on mw1166 is CRITICAL Puppet has 1 failures [06:30:40] PROBLEM - puppet last run on cp3004 is CRITICAL puppet fail [06:31:00] PROBLEM - puppet last run on cp3042 is CRITICAL Puppet has 1 failures [06:31:00] PROBLEM - puppet last run on cp3008 is CRITICAL Puppet has 1 failures [06:31:02] PROBLEM - puppet last run on mw1042 is CRITICAL Puppet has 1 failures [06:32:01] PROBLEM - puppet last run on elastic1022 is CRITICAL Puppet has 1 failures [06:34:41] PROBLEM - puppet last run on ms-fe2003 is CRITICAL Puppet has 1 failures [06:34:51] PROBLEM - puppet last run on mw2134 is CRITICAL Puppet has 1 failures [06:35:10] PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 1 failures [06:35:11] PROBLEM - puppet last run on mw2017 is CRITICAL Puppet has 1 failures [06:35:11] PROBLEM - puppet last run on mw2073 is CRITICAL Puppet has 2 failures [06:35:11] PROBLEM - puppet last run on mw1092 is CRITICAL Puppet has 1 failures [06:35:42] PROBLEM - puppet last run on mw2146 is CRITICAL Puppet has 1 failures [06:35:51] PROBLEM - puppet last run on mw1025 is CRITICAL Puppet has 1 failures [06:35:51] PROBLEM - puppet last run on mw1061 is CRITICAL Puppet has 1 failures [06:35:51] PROBLEM - puppet last run on mw1235 is CRITICAL Puppet has 1 failures [06:36:11] PROBLEM - puppet last run on mw1251 is CRITICAL Puppet has 1 failures [06:36:31] PROBLEM - puppet last run on mw2206 is CRITICAL Puppet has 1 failures [06:36:41] PROBLEM - puppet last run on mw2184 is CRITICAL Puppet has 1 failures [06:40:51] PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 57.14% of data above the critical threshold [24.0] [06:45:40] RECOVERY - puppet last run on subra is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:45:41] RECOVERY - puppet last run on cp3042 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:46:21] RECOVERY - puppet last run on mw2206 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:46:21] RECOVERY - puppet last run on mw2134 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:46:31] RECOVERY - puppet last run on mw2184 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:46:41] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:46:41] RECOVERY - puppet last run on mw2073 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:46:41] RECOVERY - puppet last run on mw2017 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:46:41] RECOVERY - puppet last run on mw1092 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:46:42] RECOVERY - puppet last run on elastic1022 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:46:42] RECOVERY - puppet last run on mw1166 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:47:12] RECOVERY - puppet last run on mw2146 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:47:21] RECOVERY - puppet last run on mw1025 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:21] RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:47:21] RECOVERY - puppet last run on mw1235 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:21] RECOVERY - puppet last run on mw1061 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:30] RECOVERY - puppet last run on mw1042 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:41] RECOVERY - puppet last run on mw1251 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:51] RECOVERY - puppet last run on ms-fe2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:40] RECOVERY - puppet last run on cp3004 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:48:51] RECOVERY - High load for whatever reason on labstore1001 is OK Less than 50.00% above the threshold [16.0] [07:20:10] PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 62.50% of data above the critical threshold [24.0] [08:45:23] do we have an internal issue? From tools-login I am seeing the error **Closing Link: 10.68.17.228 (Connection timed out) [08:45:23] ** and SULWatcher is no longer echoing account creations [08:46:05] _joe_ ^ [08:46:29] jgage ^ [08:55:12] tools-login and tools-dev works for me [08:56:20] yes, didn't say they didn't [08:56:35] said that from tools-login that I am getting the error Closing Link: 10.68.17.228 (Connection timed out) [08:56:58] and that SULWatcher is not getting its feed of account creations [08:57:34] * sDrewth doesn't know which server is "10.68.17.228" [08:59:40] ohh, someone has fixed it, thanks [09:05:56] !log LocalisationUpdate ResourceLoader cache refresh completed at Sat Apr 18 09:04:53 UTC 2015 (duration 4m 52s) [09:06:04] Logged the message, Master [09:07:11] PROBLEM - RAID on db1060 is CRITICAL 1 failed LD(s) (Degraded) [09:29:48] (03CR) 10Mobrovac: [C: 031] "This could have been a nice opportunity to use https://github.com/wikimedia/service-runner/pull/30 ;)" [dumps/html/deploy] - 10https://gerrit.wikimedia.org/r/204964 (https://phabricator.wikimedia.org/T94457) (owner: 10GWicke) [09:32:11] RECOVERY - High load for whatever reason on labstore1001 is OK Less than 50.00% above the threshold [16.0] [09:32:40] (03CR) 10Faidon Liambotis: "This is a non-refreshonly exec with no onlyif/unless. This means it gets executed on every puppet run and logged like that -- that's not g" [puppet] - 10https://gerrit.wikimedia.org/r/204932 (https://phabricator.wikimedia.org/T96045) (owner: 10coren) [10:01:46] Krenair: ? [11:11:06] paravoid, I was wondering what it takes to get our mail servers to accept mail for a new domain [11:11:09] and wondered if you might know [11:11:11] Is it a simple case of editing files/exim/wikimedia_domains? [11:12:49] well "new". it was registered over a decade ago but its MX records seem useless because someone who tried to mail it got rejected by polonium [11:51:31] PROBLEM - puppet last run on cp4018 is CRITICAL puppet fail [12:09:40] RECOVERY - puppet last run on cp4018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:23:11] (03PS1) 10Glaisher: Add DNS for 'gom' sites [dns] - 10https://gerrit.wikimedia.org/r/205009 (https://phabricator.wikimedia.org/T96468) [16:12:39] (03CR) 10GWicke: "The dumper is just a set of scripts and not a long-running service. As such, I don't see much use in pulling in service-runner." [dumps/html/deploy] - 10https://gerrit.wikimedia.org/r/204964 (https://phabricator.wikimedia.org/T94457) (owner: 10GWicke) [17:47:41] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00333333333333 [17:53:33] (03PS1) 10Ricordisamoa: Add categories for quality badges on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/205013 [18:15:41] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00333333333333 [18:18:13] (03PS1) 10Ori.livneh: coal: fix args [puppet] - 10https://gerrit.wikimedia.org/r/205014 [18:18:24] (03PS2) 10Ori.livneh: coal: fix args [puppet] - 10https://gerrit.wikimedia.org/r/205014 [18:18:33] (03CR) 10Ori.livneh: [C: 032 V: 032] coal: fix args [puppet] - 10https://gerrit.wikimedia.org/r/205014 (owner: 10Ori.livneh) [18:21:20] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [19:40:21] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00332225913621 [19:40:41] PROBLEM - puppet last run on cp4011 is CRITICAL puppet fail [19:51:22] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [19:58:31] RECOVERY - puppet last run on cp4011 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [20:27:27] (03PS1) 10Yuvipanda: tools: Overcommit memory for redis instances [puppet] - 10https://gerrit.wikimedia.org/r/205016 [20:27:28] valhallasw`cloud: ^ [20:30:07] (03PS2) 10Yuvipanda: tools: Overcommit memory for redis instances [puppet] - 10https://gerrit.wikimedia.org/r/205016 [20:30:16] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Overcommit memory for redis instances [puppet] - 10https://gerrit.wikimedia.org/r/205016 (owner: 10Yuvipanda) [20:30:35] valhallasw`cloud: that should prevent bgsave issues [20:31:37] YuviPanda: sounds good [20:31:42] reboot it! :P [20:31:55] valhallasw`cloud: what why? :) [20:32:04] for the overcommit mem/ [20:32:08] or doesnt that need a reboot [20:32:11] valhallasw`cloud: not needed, I hand set it. [20:32:13] nope [20:32:14] ah [20:32:20] valhallasw`cloud: I hand set it earlier that’s why bgsave succeeded [20:32:27] aaah [20:34:04] valhallasw`cloud: I wonder if two redis instances would be an appropriate solution until we can try out cloud... [20:34:11] valhallasw`cloud: tools-redis-cache and tools-redis-???? [20:34:16] and -cache won’t be backed up [20:39:28] YuviPanda: dunno [20:39:33] YuviPanda: maybe redis just sucks [20:39:33] :P [20:40:06] but different hosts with different activated commands would be an option, I guess [20:40:17] and cache would just evict keys after some time [20:40:34] redis-broker which just does pub/sub :P [20:40:52] but those are all stupid workarounds for redis not doing its job [20:41:39] valhallasw`cloud: what is redis' job you are thinking of that it isn't doing? [20:41:45] not dying [20:42:12] if only things were that simple :) [20:42:22] anyway, it’s an in memory key value stores with certain semantics [20:42:29] and was never meant to be used in a shared situation like we are [20:42:39] so we’re already abusing it in many ways :) [20:42:47] maybe [20:42:55] it does evict keys [20:42:56] but even in a non-shared situation you'll get this crap [20:43:08] someone accidentally using too much memory and oh everything dies [20:43:10] we just don’t have enough people setting sane TTLs [20:43:28] I mean, why does it even try to keep everything in memory?! [20:43:36] seriously, mysql would do a better job [20:43:37] because it’s an in memory database? [20:43:43] err, mysql is not a cache... [20:43:53] and people can use mysql too, y’know. we have that available as well [20:44:18] I know :P [20:44:40] they’re two different things [20:44:53] ‘why use memcached? might as well use mysql’ doesn’t make sense in the same way :) [20:45:01] you shouldn’t be using redis for things that aren’t ephemeral [20:45:25] also if this is about wikibugs - there’s a reason I switched out grrrit-wm to not use redis. [20:45:33] nobody else was using the queue other than grrrit-wm itself [20:45:36] so no point in using redis [20:45:37] at all [20:45:46] when I could just pass the data internally in the script [20:45:51] * bd808 uses flat files on nfs for everything ;) [20:46:05] * YuviPanda books bd808 a 29h flight on United [20:46:29] I'm wandering around the country on Delta today [20:46:33] it's also about wikibugs, but it's about infra that fails to work in a consistent manner [20:46:50] alright, so ‘redis does not work in a consistent manner’. [20:46:58] well, it crashes consistently :P [20:46:59] it went down today. what do you think we should do to fix this? [20:47:09] well, it gets *full* consistently [20:47:28] when you put mysql on a host and then it crashes because you put too much data on it for the disk you don’t blame mysql.. [20:47:34] you put it on a bigger disk [20:47:41] yes, and because it's redis there is absolutely zilch tooling for figuring out why [20:47:47] in the case of mysql that would be trivial [20:47:51] well, it does have tooling [20:47:54] we just have it disabled [20:48:14] which one, then? [20:48:19] the random sampling? [20:48:31] yeah [20:48:56] —big-keys [20:49:17] we tried that, remember? [20:49:24] https://phabricator.wikimedia.org/T96489 [20:49:25] it didn't give us a hint of who or what was the isue [20:49:30] anyway [20:49:43] do we agree that the problem is ‘too much usage’? [20:50:04] well [20:50:07] well, mainly non-sensible handling of the too much usage [20:50:08] too much usage for our current setup :) [20:50:11] evicting keys would be sensible [20:50:15] it does evict keys [20:50:19] let me find you the graph [20:50:22] telling the client 'sod off, I'm full' would be sensible [20:50:27] well, not enough then, clearly? [20:50:46] it's like your file system just stops responding when it's full instead of telling you it's full [20:51:17] are you going to patch redis now? [20:51:25] this ‘oh redis is so terrible’ isn’t very productive [20:51:44] you asked me if I agreed with your problem statement, and I don't agree :P [20:51:48] alright [20:51:51] so what do you think is the problem? [20:52:06] 22:49 well, mainly non-sensible handling of the too much usage [20:52:08] is it ‘redis is a piece of shit’? [20:53:34] anwyay, I think redis-cluster is a reasonable long-term solution [20:53:57] valhallasw`cloud: http://redis.io/topics/lru-cache lists our eviction policies. feel free to play around with those :) [20:53:59] err [20:54:02] ‘our’ as in redis' [20:54:09] it’s set to allkeys-lru [20:54:36] it used to be volatile-lru but that wasn’t enough [20:54:37] ok [20:54:39] because nobody set ttl [20:55:28] so should we just set maxmemory lower? [20:55:38] so that the host has a bit more breathing space? [20:55:47] valhallasw`cloud: yeah, it used to be 15G for some reason (not me!) and is 12 now [20:55:51] valhallasw`cloud: let’s reduce it to 10? [20:56:07] 3G of extra space seems ok but maybe we can reduce it further [20:56:09] mm, or figure out why the memory is full still [20:56:24] valhallasw`cloud: I’m also wondering if it just got ‘stuck’ due to OpenStack, actually [20:56:31] causing diamond to fail as well [20:56:34] hmm [20:56:35] but maybe that’s just the result of OOM [20:57:01] valhallasw`cloud: you can reduce memory yourself! ‘toollabs::redis::maxmemory: 10G’ on Hiera:tools :D [20:57:07] :P [20:57:12] and run puppet / reboot redis [20:57:13] err [20:57:14] restart [20:57:40] !log running forceRenameUsers.php (SUL finalization) on medium wikis starting with mgwiki. skipping mediawikiwiki for now due to T96489 [20:57:46] Logged the message, Master [20:59:14] YuviPanda: also, there are two redis-server processes that both use 12G?! [20:59:20] huh? [20:59:32] 6329 redis 20 0 12.393g 0.012t 436 R 99.7 78.4 1:46.09 redis-server 0 604 12.380g [20:59:32] 1339 redis 20 0 12.393g 0.012t 1108 S 0.0 78.4 2:57.71 redis-server 0 604 12.379g [20:59:32] [20:59:49] wtf [21:00:50] oh maybe when it's dumping [21:00:53] it's gone now [21:01:02] but it fork()s when it does a BGDUMP [21:02:00] valhallasw`cloud: ah, maybe. ps auxf helps [21:05:40] but maybe linux is smart and shares memory on fork()? [21:07:45] valhallasw`cloud: yup, COW [21:07:54] oh, should be OK then [21:10:44] valhallasw`cloud: yea,h that’s why it forks I think [21:10:46] so it can just read [21:10:57] let me check what the free memory does [22:10:40] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1218232 (10Nemo_bis) [22:50:21] (03PS1) 10Ori.livneh: coal: fix archive def and file path [puppet] - 10https://gerrit.wikimedia.org/r/205066 [22:50:35] (03CR) 10Ori.livneh: [C: 032 V: 032] coal: fix archive def and file path [puppet] - 10https://gerrit.wikimedia.org/r/205066 (owner: 10Ori.livneh) [23:01:07] (03PS1) 10QChris: Add alerts for missing hours in pagecounts_all_sites and pagecounts_raw [puppet] - 10https://gerrit.wikimedia.org/r/205067 [23:03:01] (03CR) 10QChris: Add alerts for missing hours in pagecounts_all_sites and pagecounts_raw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/205067 (owner: 10QChris) [23:32:21] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 836.118956377 [23:51:01] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 5Patch-For-Review: Create Wikipedia Konkani - https://phabricator.wikimedia.org/T96468#1218272 (10MF-Warburg) >>! In T96468#1218022, @Glaisher wrote: > Currently statistics at translatewiki.net for gom-deva. > | Message group | Completion...