[00:14:34] (03PS1) 10Dzahn: mailman: old maintenance script for list report [puppet] - 10https://gerrit.wikimedia.org/r/237865 [00:17:27] (03PS2) 10Dzahn: mailman: old maintenance script for list report [puppet] - 10https://gerrit.wikimedia.org/r/237865 (https://phabricator.wikimedia.org/T83158) [00:19:27] 6operations: New report for Mailman mailing lists configuration - https://phabricator.wikimedia.org/T83158#1633093 (10Dzahn) [00:28:08] PROBLEM - puppet last run on analytics1017 is CRITICAL: CRITICAL: Puppet has 1 failures [00:34:05] (03CR) 10Dzahn: [C: 04-1] "that script is still just in root and that user doesnt have the permission to run it, gotta puppetize that too" [puppet] - 10https://gerrit.wikimedia.org/r/235959 (https://phabricator.wikimedia.org/T107398) (owner: 10Dzahn) [00:41:25] (03CR) 10BBlack: [C: 032] Grafana: allow unauthenticated GET requests [puppet] - 10https://gerrit.wikimedia.org/r/237761 (owner: 10Ori.livneh) [00:54:30] RECOVERY - puppet last run on analytics1017 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [01:02:28] PROBLEM - puppet last run on mw2024 is CRITICAL: CRITICAL: Puppet has 1 failures [01:06:43] (03PS1) 10GWicke: Slightly increase RESTBase job runner concurrency [puppet] - 10https://gerrit.wikimedia.org/r/237868 [01:08:37] (03PS2) 10GWicke: Slightly increase RESTBase job runner concurrency [puppet] - 10https://gerrit.wikimedia.org/r/237868 [01:28:39] RECOVERY - puppet last run on mw2024 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [01:45:29] (03PS31) 10Gergő Tisza: Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [02:01:10] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=364.72 Read Requests/Sec=418.13 Write Requests/Sec=136.05 KBytes Read/Sec=3176.64 KBytes_Written/Sec=730.31 [02:03:18] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=1.30 Read Requests/Sec=0.00 Write Requests/Sec=1.30 KBytes Read/Sec=0.00 KBytes_Written/Sec=5.61 [02:32:01] !log l10nupdate@tin Synchronized php-1.26wmf22/cache/l10n: l10nupdate for 1.26wmf22 (duration: 06m 54s) [02:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:34:59] (03PS1) 10Tim Landscheidt: Tools: Use LDAP for mail queries [puppet] - 10https://gerrit.wikimedia.org/r/237871 [02:35:36] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf22) at 2015-09-12 02:35:36+00:00 [02:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:42:37] (03CR) 10Tim Landscheidt: "(This is just meant as an improvement to the existing setup. I have uploaded a move to LDAP as I0699c0281a593f7aca5ad991921e21e9eed90df6." [puppet] - 10https://gerrit.wikimedia.org/r/148917 (owner: 10Tim Landscheidt) [02:44:37] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint, 7WorkType-Maintenance: Upgrade beta to Elasticsearch 1.7.0 - https://phabricator.wikimedia.org/T106164#1633249 (10Deskana) 5Open>3Resolved [02:46:37] (03CR) 10Deskana: "In T100500#1443407, Erik said this issue was resolved. Should this patch be abandoned?" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/223202 (owner: 10EBernhardson) [02:46:53] 6operations, 6Discovery, 10Wikimedia-Logstash, 7Elasticsearch, 7Graphite: Deploy statsd plugin for production elasticsearch & logstash - https://phabricator.wikimedia.org/T90889#1633274 (10Deskana) [02:54:49] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [02:58:21] (03CR) 10Tim Landscheidt: [C: 04-1] "I tested this on Toolsbeta (and it is currently deployed there), and it works for users, servicegroups and nested servicegroups. But:" [puppet] - 10https://gerrit.wikimedia.org/r/237871 (owner: 10Tim Landscheidt) [03:01:00] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [03:05:08] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [03:32:09] (03PS2) 10Andrew Bogott: Added a check for labs internal dns [puppet] - 10https://gerrit.wikimedia.org/r/237735 (https://phabricator.wikimedia.org/T107453) [03:33:16] (03CR) 10Andrew Bogott: [C: 032] Added a check for labs internal dns [puppet] - 10https://gerrit.wikimedia.org/r/237735 (https://phabricator.wikimedia.org/T107453) (owner: 10Andrew Bogott) [03:59:42] 6operations, 10Wikimedia-General-or-Unknown, 7Database: Multiple pages with no revisions - https://phabricator.wikimedia.org/T112282#1633349 (10Krenair) ```mysql> use enwiki; Database changed mysql> select page_namespace, page_title from page left join revision on (rev_page = page_id) where rev_page is null;... [04:15:39] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [04:28:29] (03PS32) 10Gergő Tisza: Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [04:52:02] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Sep 12 04:52:01 UTC 2015 (duration 52m 0s) [04:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:54:20] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 3 below the confidence bounds [06:00:29] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [06:18:39] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 8 below the confidence bounds [06:22:39] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [06:28:39] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [06:30:49] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:58] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:49] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:59] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:49] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:39] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:38:51] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [06:48:49] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [06:56:49] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:56:49] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:56:50] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:57:42] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:49] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:01] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:07:08] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [08:12:59] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=143.27 Read Requests/Sec=984.91 Write Requests/Sec=101.11 KBytes Read/Sec=15050.70 KBytes_Written/Sec=404.83 [08:29:00] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=190.11 Read Requests/Sec=55.87 Write Requests/Sec=45.04 KBytes Read/Sec=1994.78 KBytes_Written/Sec=180.14 [08:50:30] (03PS1) 10Zfilipin: WIP Move Ruby related packages to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/237876 (https://phabricator.wikimedia.org/T110865) [08:53:20] (03CR) 10Jcrespo: "@QChris I do not say that we should import the git sources of every package we use, but we should make sure that a) They are open source b" [puppet] - 10https://gerrit.wikimedia.org/r/237753 (https://phabricator.wikimedia.org/T112025) (owner: 10QChris) [08:58:22] 6operations, 10Datasets-General-or-Unknown: At peak usage, dumps.wikimedia.org becomes very slow for users (sometimes unresponsive) - https://phabricator.wikimedia.org/T45647#1633442 (10Nemo_bis) > I wouldn't be surprised if this lead to a recent traffic increase on dumps. Traffic doesn't look exceptional tho... [09:00:25] 6operations, 10Datasets-General-or-Unknown: Sometimes (at peak usage?), dumps.wikimedia.org becomes very slow for users (sometimes unresponsive) - https://phabricator.wikimedia.org/T45647#1633443 (10Nemo_bis) [09:03:14] (03CR) 10Jcrespo: "So only useful on mw1017, then?" [puppet] - 10https://gerrit.wikimedia.org/r/237707 (https://phabricator.wikimedia.org/T112174) (owner: 10Jcrespo) [09:04:06] 6operations, 10Traffic, 10fundraising-tech-ops, 7IPv6, 5Patch-For-Review: Enable IPv6 on donate.wikimedia.org - https://phabricator.wikimedia.org/T73267#1633445 (10jrobell) @BBlack We have a banner campaign planned to go up in Luxembourg and Belgium on Monday morning UTC time. Should we postpone this lau... [10:26:23] (03CR) 10Merlijn van Deen: [C: 031] Tools: Accept mail for all submit hosts [puppet] - 10https://gerrit.wikimedia.org/r/237863 (https://phabricator.wikimedia.org/T63484) (owner: 10Tim Landscheidt) [10:50:08] 6operations, 10Datasets-General-or-Unknown: Sometimes (at peak usage?), dumps.wikimedia.org becomes very slow for users (sometimes unresponsive) - https://phabricator.wikimedia.org/T45647#1633610 (10jcrespo) Please note than on the original ticket, I noted that I do not think this is a network bandwith problem... [11:19:14] (03PS1) 10Merlijn van Deen: toollabs: remove redis Sysctl[vm.overcommit_memory] [puppet] - 10https://gerrit.wikimedia.org/r/237895 [11:35:13] YuviPanda: ^ [12:21:02] 6operations, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1633652 (10Nemo_bis) >>! In T112025#1632613, @QChris wrote: > I verified that even for our old gerrit, adding BouncyCastle is sufficient t... [12:23:29] PROBLEM - puppet last run on lvs4001 is CRITICAL: CRITICAL: Puppet has 1 failures [12:49:39] RECOVERY - puppet last run on lvs4001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:10:58] 6operations, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1633685 (10Aklapper) @Paladox: Other arguments (//=unrelated to this very task//, like random features in 2.11) for upgrading to Gerrit 2.... [13:47:06] Did the SSL cert for phab.wmfusercontent.org just expire, 6 minutes ago? LOOl [13:47:52] that's an oddly short-lived certificate [13:48:28] Phabricator is now unstyled so profit [13:49:19] Valid from Tue, 14 Jul 2015 00:32:07 UTC [13:49:27] just two months? [13:50:06] jynus: ^ [13:50:36] yes, I see it [13:54:48] PROBLEM - very high load average likely xfs on ms-be2006 is CRITICAL: CRITICAL - load average: 103.60, 100.49, 96.79 [13:57:00] the dates seem close to https://gerrit.wikimedia.org/r/#/c/224552/ [13:59:06] It would be very useful if the cert that was issued was valid for at least a year. [13:59:22] but, perhaps in the future. [14:01:18] nice rainbow [14:02:15] when it is generated for 2 months it is because it is temporary for a reason [14:04:56] There is a setting to make phabricator use its main domain for the styles, btw. [14:05:00] PROBLEM - very high load average likely xfs on ms-be2006 is CRITICAL: CRITICAL - load average: 106.01, 100.32, 97.82 [14:06:20] which it should be doing anyways. the point of wmfusercontent.org was just to host uploaded things attached to tickets and such [14:06:41] anyways, I'll start the cert renewal. apparently they send reminder emails to individuals heh. [14:07:17] I will check the config [14:07:29] https://phabricator.wikimedia.org/T104730 is related I guess [14:07:31] so that if it hapens again, only images/files are affected [14:08:58] there's a bug filed about phab loading CSS/JS from wmfusercontent.org. [14:09:45] 6operations, 6Phabricator, 6Security: Phabricator dependence on wmfusercontent.org - https://phabricator.wikimedia.org/T104730#1633764 (10jcrespo) p:5Normal>3High This is making phabricator unusuable due to the lack of styles as phab.wmfusercontent.org is unavailable at the moment. [14:09:54] thanks, SPF|Cloud [14:09:59] np [14:10:09] (oh, you linked it) [14:10:09] js works [14:10:20] it is css only [14:11:03] jynus: both CSS and JS here are loaded via phab.wmfusercontent.org [14:11:04] or at least the basic js for preview, etc. [14:11:13] https://gyazo.com/f730a80a3bc271162faa3b1220024c4e [14:11:48] mmm, maybe it is cached on my system, then [14:11:59] In fact you are able to load the css/js files over http, because https isn't forced - but phabricator itself is https-only so.. [14:21:05] from what I can see in the phab docs, I think this is "normal" for phab [14:21:30] when you set the security.alternate-file-domain for uploaded content, it also uses that for all js/css [14:21:43] (they're not separate settings) [14:22:03] :-) [14:22:04] Yes, that's the problem. [14:25:16] we could just flip it all over to phab.wm.o for now. I mean, there is a security reason for that, but I imagine it's one of many layers of defense? [14:25:29] it could be disabled temporarelly <<--- I was going to say the same [14:25:31] I donno, would be better to talk to someone who knows it better [14:25:47] i.e. how exposed are we to some injection attack by doing that [14:26:10] yes, exatly my train of thought, can we trust that? [14:26:17] we could also reconfigure everything to use something temporary in another of our wildcarded domains to mitigate this [14:26:26] e.g. set up a phab.wikibooks.org for it or whatever [14:26:50] 6operations, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1633777 (10QChris) >>! In T112025#1633652, @Nemo_bis wrote: > 1.44? If my notes at T65847#699055 are correct, [[https://gerrit-documentati... [14:26:51] well, maybe not wikibooks [14:27:09] the point is to separate it from wikimedia.org [14:27:12] yes [14:27:16] phabricator-static.wikipedia.org? [14:27:19] idk [14:27:39] it'd be a temporary solution anyways, until this renewal completes [14:27:47] I'd hope the renewal completes in a matter of hours anyways, but still [14:28:37] it is actually good that it failed on a saturday [14:28:50] why? [14:28:52] well, better not failing at all [14:29:24] it has less traffic on weekend [14:29:43] still more than at 4am UTC [14:31:52] so, for that we would have to change the SSL configuration on cache for that domain, the apache virtual domain, and the phabricator configuration? Did I miss something? [14:32:29] jynus: none I can think about really, looks good [14:32:30] also the dns for the new domain [14:32:39] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [14:32:48] no [14:32:59] RECOVERY - Host mw2027 is UP: PING OK - Packet loss = 0%, RTA = 34.58 ms [14:33:03] anyways, I'm pushing a related misc-web change that will let me hack my local DNS and see if the rest works [14:33:10] if so, we can follow that up with a real DNS change [14:33:26] ok, will wait [14:33:26] (03PS1) 10BBlack: misc-web: temporarily broaden user content domain match for phab [puppet] - 10https://gerrit.wikimedia.org/r/237912 [14:34:05] (03CR) 10BBlack: [C: 032 V: 032] "Note this is temporary - will be reverted after cert issues fixed" [puppet] - 10https://gerrit.wikimedia.org/r/237912 (owner: 10BBlack) [14:36:44] woah, is phabricator broken for others? [14:37:23] aude: yeah, ops is working on it, see above [14:37:56] ok [14:38:03] it has a 'new' skin :P [14:38:25] we're calling it the light and clean look [14:38:28] it's Web 5.0 [14:38:29] heh [14:38:31] lol [14:38:53] Who needs CSS? [14:39:49] ok looks like it will work with very little hackery, and I'm going to switch it to phab.wikivoyage.org for now until the renewal goes through [14:40:20] +1 [14:40:33] no security compromised with that [14:40:38] [/me hates renewals] looks good [14:41:00] (03PS1) 10BBlack: Temporarily move phab altdom into wikivoyage.org [puppet] - 10https://gerrit.wikimedia.org/r/237913 [14:41:58] (03CR) 10Luke081515: [C: 031] Temporarily move phab altdom into wikivoyage.org [puppet] - 10https://gerrit.wikimedia.org/r/237913 (owner: 10BBlack) [14:42:55] (03PS1) 10BBlack: Temporarily create phab.wikivoyage.org [dns] - 10https://gerrit.wikimedia.org/r/237914 [14:43:16] (03CR) 10BBlack: [C: 032] Temporarily create phab.wikivoyage.org [dns] - 10https://gerrit.wikimedia.org/r/237914 (owner: 10BBlack) [14:43:46] (03CR) 10BBlack: [C: 032] Temporarily move phab altdom into wikivoyage.org [puppet] - 10https://gerrit.wikimedia.org/r/237913 (owner: 10BBlack) [14:46:03] should be fixed now [14:46:07] (I purged caches, etc) [14:46:11] jep [14:46:13] looks good [14:46:18] I confirm. [14:47:53] (03CR) 10JanZerebecki: "Using the the Debian binary as the source for the binary jar will be able to provide the chain of trust (and ensure that the source is sti" [puppet] - 10https://gerrit.wikimedia.org/r/237753 (https://phabricator.wikimedia.org/T112025) (owner: 10QChris) [14:50:02] !log phab.wmfusercontent.org has been temporarily switched to phab.wikivoyage.org due to cert issues [14:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:50:51] 6operations: Undo phab.wikivoyage.org hacks after wmfusercontent.org cert is fixed - https://phabricator.wikimedia.org/T112381#1633810 (10BBlack) 3NEW [15:12:29] PROBLEM - very high load average likely xfs on ms-be2006 is CRITICAL: CRITICAL - load average: 101.44, 100.93, 99.10 [15:12:36] (03CR) 10Jcrespo: "Maybe it already exists on beta? You should check that." [puppet] - 10https://gerrit.wikimedia.org/r/237753 (https://phabricator.wikimedia.org/T112025) (owner: 10QChris) [15:13:56] (03CR) 10QChris: "@Jcrespo: We're on the same side." [puppet] - 10https://gerrit.wikimedia.org/r/237753 (https://phabricator.wikimedia.org/T112025) (owner: 10QChris) [15:15:14] (03CR) 10QChris: "@Jcrespo: I see our messages crossed one another. Ok. So let's" [puppet] - 10https://gerrit.wikimedia.org/r/237753 (https://phabricator.wikimedia.org/T112025) (owner: 10QChris) [15:18:29] PROBLEM - very high load average likely xfs on ms-be2006 is CRITICAL: CRITICAL - load average: 111.53, 101.47, 99.59 [15:23:46] 6operations, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1633899 (10Nemo_bis) I know. I was wondering if installing BouncyCastle 1.44 will prevent the upgrade to gerrit 2.8.5+. [15:24:39] PROBLEM - very high load average likely xfs on ms-be2006 is CRITICAL: CRITICAL - load average: 103.95, 100.87, 99.70 [15:35:27] (03PS1) 10QChris: Add jar for BouncyCastle 1.44 from Debian wheezy [gerrit/plugins] - 10https://gerrit.wikimedia.org/r/237918 (https://phabricator.wikimedia.org/T112025) [15:37:03] (03CR) 10QChris: [C: 031] Add jar for BouncyCastle 1.44 from Debian wheezy [gerrit/plugins] - 10https://gerrit.wikimedia.org/r/237918 (https://phabricator.wikimedia.org/T112025) (owner: 10QChris) [15:43:28] PROBLEM - very high load average likely xfs on ms-be2006 is CRITICAL: CRITICAL - load average: 107.71, 102.35, 100.27 [15:44:38] (03PS1) 10BBlack: add phab.wikidata.org temporarily T112381 [dns] - 10https://gerrit.wikimedia.org/r/237919 [15:44:59] (03CR) 10BBlack: [C: 032] add phab.wikidata.org temporarily T112381 [dns] - 10https://gerrit.wikimedia.org/r/237919 (owner: 10BBlack) [15:45:29] (03PS2) 10QChris: Make gerrit offer newer key exchange algorithms for new sshs [puppet] - 10https://gerrit.wikimedia.org/r/237753 (https://phabricator.wikimedia.org/T112025) [15:46:21] (03PS1) 10BBlack: switch phab altdom to phab.wikidata.org T112381 [puppet] - 10https://gerrit.wikimedia.org/r/237920 [15:46:40] (03CR) 10QChris: [C: 04-1] "CR-1 to mark dependence on yet unmerged change" [puppet] - 10https://gerrit.wikimedia.org/r/237753 (https://phabricator.wikimedia.org/T112025) (owner: 10QChris) [15:46:42] (03CR) 10BBlack: [C: 032 V: 032] switch phab altdom to phab.wikidata.org T112381 [puppet] - 10https://gerrit.wikimedia.org/r/237920 (owner: 10BBlack) [15:49:39] 6operations, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1633969 (10QChris) >>! In T112025#1633899, @Nemo_bis wrote: > I was wondering if installing BouncyCastle 1.44 will prevent the upgrade to... [15:50:09] 6operations, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1633971 (10jcrespo) QChris: It is trivial to move it from Debian to our repo: https://wikitech.wikimedia.org/wiki/APT_repository#Updating_... [15:50:26] 6operations: Undo phab.wikidata.org hacks after wmfusercontent.org cert is fixed - https://phabricator.wikimedia.org/T112381#1633972 (10BBlack) [15:51:30] PROBLEM - very high load average likely xfs on ms-be2006 is CRITICAL: CRITICAL - load average: 101.30, 100.44, 100.06 [15:53:12] jynus: ^ may be worth powercycling. been like that for a few hours it seems [15:53:35] 6operations, 6Phabricator, 6Security: Phabricator dependence on wmfusercontent.org - https://phabricator.wikimedia.org/T104730#1633981 (10greg) Then it should be fixed, the dependency/separation needs to stay for security reasons, as stated by our security team. [15:56:05] 6operations, 6Phabricator, 6Security: Phabricator dependence on wmfusercontent.org - https://phabricator.wikimedia.org/T104730#1633984 (10BBlack) The lack of style issue earlier today is relatively-unrelated (cert expiry issue, but the impact would've been smaller if this separation issue here had already be... [15:58:30] 6operations, 6Phabricator, 6Security: Phabricator dependence on wmfusercontent.org - https://phabricator.wikimedia.org/T104730#1633988 (10jcrespo) Phabricator has been temporarily fixed. Yes, separation should stay @Greg. I am not suggesting merging both domains, but report and/or fix the load of phabricato... [16:01:58] 6operations, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1633992 (10QChris) >>! In T112025#1633971, @jcrespo wrote: > QChris: It is trivial to move it from Debian to our repo: https://wikitech.wi... [16:05:06] 6operations, 6Phabricator, 6Security: Phabricator dependence on wmfusercontent.org - https://phabricator.wikimedia.org/T104730#1633993 (10greg) I think the purpose if this ticket should be clarified in the description (as agreed by ops and releng); there are multiple opinions floated here. [16:10:30] 6operations, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1633995 (10jcrespo) Yes, I was offering my help here to make things "the right way", and I would take the "supposed overhead", but it seem... [16:16:48] 6operations, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1633998 (10QChris) >>! In T112025#1633995, @jcrespo wrote: > Yes, I was offering my help here to make things "the right way", and I would... [16:22:18] PROBLEM - very high load average likely xfs on ms-be2006 is CRITICAL: CRITICAL - load average: 108.50, 101.25, 100.24 [16:41:35] 6operations: Undo phab.wikidata.org hacks after wmfusercontent.org cert is fixed - https://phabricator.wikimedia.org/T112381#1634000 (10Luke081515) Why switched between wikivoyage and wikidata? [16:44:49] PROBLEM - very high load average likely xfs on ms-be2006 is CRITICAL: CRITICAL - load average: 98.44, 99.95, 100.01 [16:47:28] JohnFLewis, it is https://phabricator.wikimedia.org/T112242, will downtime it [16:47:53] mutante: ^ ; jynus okay :) [16:56:05] 6operations: Undo phab.wikidata.org hacks after wmfusercontent.org cert is fixed - https://phabricator.wikimedia.org/T112381#1634007 (10jcrespo) @Luke081515 There where some concerns about wikivoyage isolation between subdomains; we believe wikidata domains are more independent from each other and could create l... [17:08:13] 6operations, 6Release-Engineering-Team: tin disk space at 5% - https://phabricator.wikimedia.org/T112391#1634038 (10jcrespo) 3NEW [17:09:03] 6operations, 6Release-Engineering-Team: tin disk space at 5% - https://phabricator.wikimedia.org/T112391#1634045 (10jcrespo) [17:09:31] 6operations, 6Release-Engineering-Team: tin disk space at 5% - https://phabricator.wikimedia.org/T112391#1634038 (10jcrespo) [17:13:49] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [17:17:49] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [17:21:58] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [17:29:50] 6operations, 6Release-Engineering-Team: tin disk space at 5% - https://phabricator.wikimedia.org/T112391#1634070 (10Dzahn) The thing that takes all the space here mostly is `/var/lib/l10nupdate/caches` is 53G. [17:53:12] 6operations, 6Release-Engineering-Team: tin disk space at 5% - https://phabricator.wikimedia.org/T112391#1634084 (10Legoktm) @Krenair is also using tin to upload files (~92G) - {T111941} because terbium doesn't have enough space. [18:09:19] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [18:19:28] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=605.14 Read Requests/Sec=1045.66 Write Requests/Sec=1.74 KBytes Read/Sec=4182.64 KBytes_Written/Sec=6.95 [18:21:29] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.50 Read Requests/Sec=0.90 Write Requests/Sec=2.10 KBytes Read/Sec=3.60 KBytes_Written/Sec=8.39 [18:31:49] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=538.07 Read Requests/Sec=563.72 Write Requests/Sec=1.11 KBytes Read/Sec=2254.87 KBytes_Written/Sec=4.42 [18:37:51] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=124.70 Read Requests/Sec=1.20 Write Requests/Sec=0.60 KBytes Read/Sec=4.80 KBytes_Written/Sec=2.40 [18:41:47] 6operations, 6Release-Engineering-Team: tin disk space at 5% - https://phabricator.wikimedia.org/T112391#1634151 (10Krenair) I deleted the 46GB original archive, other files will go soon. [18:49:59] 6operations: Wikimedia sites not reachable through CenturyLink ISP - https://phabricator.wikimedia.org/T112396#1634165 (10AxelBoldt) 3NEW [18:52:26] (03PS1) 10Kaldari: Adding comment on disabling anon page creation on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237960 [18:57:38] (03CR) 10Nemo bis: Adding comment on disabling anon page creation on English Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237960 (owner: 10Kaldari) [19:10:49] PROBLEM - puppet last run on lvs2001 is CRITICAL: CRITICAL: puppet fail [19:21:55] !log performing Cassandra repair on restbase1002 (nodetool repair -pr) [19:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:37:29] RECOVERY - puppet last run on lvs2001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [19:38:14] 6operations: Undo phab.wikidata.org hacks after wmfusercontent.org cert is fixed - https://phabricator.wikimedia.org/T112381#1634224 (10Krenair) Just like *.wikipedia.org, putting it at *.wikivoyage.org would have given Phabricator access to production MediaWiki cookies. *.wikidata.org is not trusted like that A... [19:51:38] 6operations: Undo phab.wikidata.org hacks after wmfusercontent.org cert is fixed - https://phabricator.wikimedia.org/T112381#1634236 (10Southparkfan) @BBlack: is the renewal still not done yet? [19:53:50] RECOVERY - Disk space on labstore1002 is OK: DISK OK [19:57:38] 6operations: Undo phab.wikidata.org hacks after wmfusercontent.org cert is fixed - https://phabricator.wikimedia.org/T112381#1634237 (10BBlack) Nope [20:15:19] !log Rolling back Echo to 1.26wmf21 branch on mw1017 (testwiki) to measure increase in render-blocking CSS size [20:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:21:40] 6operations, 10netops: Wikimedia sites not reachable through CenturyLink ISP - https://phabricator.wikimedia.org/T112396#1634262 (10Krenair) [20:28:04] 6operations, 10netops: Wikimedia sites not reachable through CenturyLink ISP - https://phabricator.wikimedia.org/T112396#1634266 (10faidon) p:5Triage>3High [20:29:14] 6operations, 10netops: Wikimedia sites not reachable through CenturyLink ISP - https://phabricator.wikimedia.org/T112396#1634165 (10faidon) Please provide your IP for a reverse traceroute — you can omit the last number for privacy reasons if you want. Also, it'd be helpful to know if the site is completely un... [20:38:40] 6operations, 10netops: Wikimedia sites not reachable through CenturyLink ISP - https://phabricator.wikimedia.org/T112396#1634275 (10faidon) I tried ping/traceroute from CenturyLink's Minneapolis PoP using their [[ https://kai04.centurylink.com/PtapRpts/Public/BackboneReport.aspx | looking glass ]]. It looks no... [20:50:04] ori: ? we're now loading oojs-ui on every page view [20:51:48] 6operations, 10netops: Wikimedia sites not reachable through CenturyLink ISP - https://phabricator.wikimedia.org/T112396#1634293 (10faidon) I also created a RIPE Atlas measurement from 24 probes that are in CenturyLink's network (AS209). That's [[ https://atlas.ripe.net/measurements/2407575/ | measurement 24... [20:52:02] legoktm: https://phabricator.wikimedia.org/T112401 [20:52:20] * legoktm looks [20:55:10] ori: seen https://phabricator.wikimedia.org/T112347 > [20:55:11] ? [20:57:55] nope. that should have been a blocker, probably. [21:01:39] it won't help as much as it might seem, since we're also loading all the modules with icons CSS and stuff. [21:01:55] which is about as big as the regular CSS. [21:02:15] ori: hm, so i guess you don't read bugmail? :/ [21:03:11] MatmaRex: I try to, but this was posted an hour before the end of the workweek, and less than 24 hours ago [21:03:50] I frequently fall behind and sometimes I cut my losses and archive things in the interest of not lagging permanently behind [21:04:31] well yeah, it was in response to the echo split icons deployment [21:04:50] anyway, okay. it just bugs me when people file duplicates of bugs that i cc'd them on, and such [21:07:04] 6operations, 10Datasets-General-or-Unknown, 7JavaScript: Instability on fr.wikiversity server - https://phabricator.wikimedia.org/T112069#1634326 (10Lionel_Scheepmans) [21:07:58] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [500.0] [21:08:15] MatmaRex: fair point. For now, though, I think it makes sense to keep both of them open, and make T112401 about the immediate issue of how to deal with this being deployed, and T112347 about longer-term fixes to OOjs UI. [21:08:33] MatmaRex: either way, kudos for spotting it. [21:09:55] yeah, you're right [21:10:25] 6operations, 7JavaScript: Instability on fr.wikiversity project - https://phabricator.wikimedia.org/T112069#1634328 (10Peachey88) [21:11:02] 6operations, 7JavaScript: Instability on fr.wikiversity project - https://phabricator.wikimedia.org/T112069#1624124 (10Peachey88) 5Open>3Resolved a:3Peachey88 Closing as per Lionel_Scheepmans's latest edits to the task. [21:17:02] 6operations, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, 6Services: Apertium Failed to load resource: net::ERR_SPDY_PROTOCOL_ERROR - https://phabricator.wikimedia.org/T112403#1634353 (10MarcoAurelio) Adding some projects found on other CT bugs. [21:19:01] 6operations, 10netops: Wikimedia sites not reachable through CenturyLink ISP - https://phabricator.wikimedia.org/T112396#1634365 (10AxelBoldt) It is now 4:18 PM CDT and I can reach all Mediawiki sites again. [21:20:57] 6operations, 7JavaScript: Instability on fr.wikiversity project - https://phabricator.wikimedia.org/T112069#1634370 (10Lionel_Scheepmans) [21:21:59] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:18:17] 6operations, 7JavaScript: Instability on fr.wikiversity project - https://phabricator.wikimedia.org/T112069#1634456 (10Nemo_bis) [23:02:46] Hello, I saw a post by Mark about Icinga lack of scalability. I was curious to know more about it and what was the constraint that make Icinga not scalable enough [23:06:10] Hey Toordog [23:06:29] Mark is in the Netherlands and probably not around right now [23:09:42] YuviPanda: yeah, I was chatting with Toordog on #wikimedia-tech, and redirected here. I figured *someone* here might know about the scalability issues Mark saw with Icinga [23:09:58] Hi YuviPanda, yeah I will have to try again early morning in america [23:10:40] Toordog: robla my understanding of that is that we just have too many checks for one box and there is immense skepticism of icinga's horizontal scalability [23:11:30] Toordog: akosiaris and godog (both in CEST tz) probably know more details [23:11:34] you are using only one icinga server? [23:11:38] Yes [23:11:51] maybe i can give some input then [23:12:09] Indeed. [23:12:30] (I'm very tangentially involved only - don't know the full details) [23:12:37] looks likw with my very very limited understanding of your environment that it might be an architecture design issue of your icinga infrastructure. [23:13:09] That's possible [23:13:11] I was reviewing and auditing Icinga for one of the major bank of Canada and scalability was one of the main scope we evaluated. [23:13:44] We have nrpe from 4 dcs reporting to one icings server as well [23:14:00] other than Sensu *and sensu didn't clear because of the project wasn't mature, supported enough and stability of the product* [23:14:31] you are using icinga 1 or 2? [23:15:17] 1 [23:15:39] Icinga2 is a possible evaluation candidate [23:15:52] icinga 1 is only a rewritten nagios [23:16:09] the architecture is flawed just like Nagios, only the language and a few small fix made it faster. [23:16:14] Yup [23:16:21] icinga didn't clear either [23:16:24] only icinga 2 [23:16:24] Icinga2 might fare much better [23:16:29] it is [23:16:43] and it integrates very well with puppet and is designed to be more programmatic friendly [23:16:49] Yay! [23:16:58] mean your config file accept iteration and conditional [23:16:59] I haven't fully looked at it yet (none of us have) [23:17:02] if, for, ... [23:17:08] Yeah I really like the file format [23:17:16] Much better than nagios compatible config [23:17:24] i think you guys will fall in love with it, particularly if you are using puppet extensively [23:18:04] Prometheus has been considered a candidate as well [23:18:08] have you seen the puppet repo yet? [23:18:15] one of the very good thing of icinga2, is also that everything is encrypted between the client and the server ... an agent is just a icinga with some module not enabled. That mean, if you know how Icinga master work, it is the same thing for the agent. [23:18:16] As a replacement for graphite+icinga [23:18:23] you can also use an agent as a proxy or *second master * without effort [23:18:28] very flexible [23:18:38] Toordog: by 'agent' you mean an nrpe replacement? [23:18:55] (Or ncsa, I can never keep those straight in my head) [23:18:57] NRPE is the worst thing ever happening [23:19:02] Ha-ha yes [23:19:04] flawed in everything [23:19:14] so yes it not only replace it, but you use the icinga engine as agent [23:19:23] mean you just need to enable module to get the same functionality as the master. [23:19:30] Nice [23:19:35] you can switch an agent as a second master and so on just by config [23:19:51] if an outage happen, it is easy to retrieve a GUI and all the feature that the master was taking care of. [23:20:13] Yeah it has a new web thing too [23:21:14] one negative for people that like GUI, it is not designed to be configured via the web interface. [23:21:22] the web interface is mostly informational. [23:21:37] *we were comparing NagiosXI vs Icinga 2 * [23:21:51] but i think in wikimedia case, that won't be any negative at all [23:22:11] i don't know prometheus [23:23:17] Toordog: are you involved with the icinga project? [23:23:45] I think godog and maybe akosiaris will be interested when their day rolls around :) [23:23:54] ah ok prometheus is more like a graphic tooling. *like cacti, graphite, ...* [23:24:26] i'm not involved, i was team lead of a project auditing icinga only. I talked to Friedrich dev lead at icinga2 [23:25:24] for everything related to icinga 2 architecture, design, deployment, i'm pretty savvy now. Operational, I didn't operate it yet. It stayed on the tablet, delivered the report and passed to the next challenge. [23:25:43] i'm an IT consultant in Montreal [23:26:18] I think Mark should get a chat with Friedrich, the team are in germany and netherland [23:26:39] Toordog: aaah nice :l [23:26:40] Err [23:26:42] :) [23:27:04] Toordog: we have tried out shinken in our labs environment [23:27:08] It isn't too bad either [23:27:18] But still has nagios compat config [23:28:39] shinken is a skin on top of nagios [23:28:50] just like any other nagios ports [23:29:00] they all hav ethe same problem. : the core of nagios is flawed [23:29:25] one of the big challenge of those platform is : try to have many monitoring server and centralise the reports [23:29:35] even truk is more a hack than a solution [23:33:23] 6operations, 10Beta-Cluster, 10Traffic, 5Patch-For-Review: Puppet failing on deployment-prep caches - https://phabricator.wikimedia.org/T104076#1634570 (10Luke081515) [23:33:42] netways.de are very welcoming and great people. When we talked to Nagios, the guys almost told me to go fuck yourself *i represent a bank willing to purchase for 150 000$ of license* and at best they offered as it is product. When I talked to netways, they don't sell license and they were ready to flight someone at our location for a very basic price for consultancy. [23:34:39] *they are from germany and we are in canada*, for me it was a winner because as a company they care and they are available and the product is very good too. [23:36:41] YuviPanda is the operation engineer require to be very proficient with Puppet? [23:36:54] Toordog: it isn't too hard to pick up [23:37:14] so I wouldn't say 'very' proficien [23:38:46] I'm familiar with puppet but very basic stuff [23:40:29] Toordog: yeah, the rest is just going 'what the fuck, puppet?' every day for a few months :D [23:48:51] lol [23:49:25] YuviPanda can you tell me about what is LVS ^ [23:49:26] ? [23:50:02] I'm seeing it extensively as for the load balancing and as a cluster [23:52:12] Toordog: I don't fully know myself, I must admit. bblack knows way more :) [23:52:42] Toordog: I guess it's https://en.wikipedia.org/wiki/Linux_Virtual_Server [23:53:04] thx for the link [23:54:27] damn how so I never heard of that project before :) [23:56:20] :) [23:57:10] At the same time, the front end cluster never been as big as wikimedia wiki. So a bunch of nginx or haproxy *(ssl offload)* was doing the job good enough. [23:57:47] if I understand it, you have a LVS cluster running nginx and the LVS layer will distributed the load on the different Nginx? [23:58:33] I *think* it's LVS -> nginx -> varnish -> lvs -> apache [23:58:49] i could be wrong [23:59:37] that's more or less my understanding too