[00:03:57] hold on. gerrit is being worked on [00:04:06] PROBLEM - gerrit process on ytterbium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [00:04:19] ebernhardson: ^ mutante did it [00:04:36] Yes yes [00:04:36] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures [00:04:37] Stupid gerrit [00:05:06] !log gerrit: disabled puppet for a minute so I can unbreak gerrit so I can fix gerrit in puppet. [00:05:08] * bd808 shakes fist [00:05:10] inception! [00:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:06:01] a small syntax issue in that config change [00:06:04] Yep [00:06:17] RECOVERY - gerrit process on ytterbium is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [00:06:36] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There are 13 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [00:06:41] !log gerrit is back [00:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:07:51] Ok, chain incoming [00:08:16] Oh yeah bot died. [00:08:18] Figures [00:08:19] lol [00:08:38] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [00:08:38] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Puppet has 1 failures [00:08:48] mutante: https://gerrit.wikimedia.org/r/#/c/298116/ [00:08:50] Start there [00:08:52] (brb) [00:08:53] :) yea, always on restart [00:08:57] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [00:08:58] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [00:09:07] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 2 failures [00:09:46] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Puppet has 1 failures [00:10:35] that will be another restart but gotta [00:11:07] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:11:38] ostriches: it's on master now [00:11:55] are you re-enabling it or more than 1 patch [00:12:01] Shouldn't restart [00:12:06] I livehacked [00:12:10] ah, cool [00:12:15] oh right [00:12:17] So turn on puppet and we should get the same file [00:12:28] 06Operations, 10Deployment-Systems, 03Scap3 (Scap3-MediaWiki-MVP): Completely port l10nupdate to scap - https://phabricator.wikimedia.org/T133913#2443475 (10bd808) >>! In T133913#2443360, @mmodell wrote: > Full scap of a new branch is ~30 minutes. Updating localization without syncing a new branch is probabl... [00:12:33] ok, turning on [00:13:26] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [00:13:28] PROBLEM - puppet last run on ytterbium is CRITICAL: CRITICAL: Puppet has 1 failures [00:13:38] icinga-wm: not anymore, it's fine [00:13:52] yep, no change [00:13:57] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:14:55] mutante: Running the ssl refactor through puppet compiler now [00:15:32] error, ugh [00:15:47] RECOVERY - puppet last run on ytterbium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:15:51] Oh, I have to still pass host. [00:15:59] That sucks. [00:16:16] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [00:20:36] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:22:22] 06Operations, 10Cassandra, 06Services, 13Patch-For-Review: High storage utilization on restbase1014.eqiad.wmnet - https://phabricator.wikimedia.org/T139362#2443543 (10Eevans) 05Open>03Resolved [00:25:46] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [00:28:25] !log analytics1027 - icinga said puppet fail, just ran it, recovery, same on neon.. something kafka graphite checks [00:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:29:31] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:30:10] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [00:30:32] ACKNOWLEDGEMENT - Unmerged changes on repository puppet on rhodium is CRITICAL: There are 14 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). daniel_zahn new puppetmaster being set up (https://phabricator.wikimedia.org/T98173) [00:30:32] ACKNOWLEDGEMENT - puppet last run on rhodium is CRITICAL: CRITICAL: Puppet has 4 failures daniel_zahn new puppetmaster being set up (https://phabricator.wikimedia.org/T98173) [00:35:02] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:37:50] PROBLEM - puppet last run on mw2169 is CRITICAL: CRITICAL: Puppet has 1 failures [00:38:20] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [00:39:00] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [00:39:10] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [00:42:50] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:43:31] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:43:41] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:46:30] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:47:40] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:50:19] (03CR) 10Alex Monk: "(testing the gerrit irc bot)" [dns] - 10https://gerrit.wikimedia.org/r/284491 (https://phabricator.wikimedia.org/T115491) (owner: 10Andrew Bogott) [00:50:22] mutante, ^ works [00:50:51] thank you! [00:51:03] i just noticed that it moved to kubernetes [00:51:09] and wasnt like before [00:51:21] it moved a while ago [00:51:38] took a while before we non-ops/tools admins got back control of it [00:52:06] ah [00:52:10] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [00:53:20] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [00:58:42] (03PS1) 10Dzahn: admin: add eventlogging-admins on hafnium [puppet] - 10https://gerrit.wikimedia.org/r/298120 (https://phabricator.wikimedia.org/T139202) [01:04:43] RECOVERY - puppet last run on mw2169 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:22:15] 06Operations, 10Fundraising-Backlog, 10fundraising-tech-ops: Allow Fundraising to A/B test wikipedia.org as send domain - https://phabricator.wikimedia.org/T135410#2443660 (10dpatrick) Since this allows us to gather the data mentioned at T94052#1171831 and elsewhere in that ticket, and it provides FR with in... [01:27:39] (03PS2) 10Dzahn: (WIP) make gerrit compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/298041 [02:04:57] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: access: eventlogging-admins -> hafnium - https://phabricator.wikimedia.org/T139202#2443663 (10Dzahn) a:05MoritzMuehlenhoff>03Dzahn [02:05:53] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:10:24] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [02:11:43] RECOVERY - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 4.836 second response time [02:23:04] PROBLEM - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 185 bytes in 11.032 second response time [02:23:47] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.9) (duration: 08m 21s) [02:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:29:53] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Jul 9 02:29:53 UTC 2016 (duration 6m 7s) [02:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:08:54] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:11:04] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [03:36:04] RECOVERY - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.712 second response time [03:45:04] PROBLEM - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 301 bytes in 0.218 second response time [03:56:24] RECOVERY - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 5.953 second response time [04:05:44] PROBLEM - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 185 bytes in 11.882 second response time [04:33:03] RECOVERY - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 6.281 second response time [04:48:55] PROBLEM - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 185 bytes in 13.635 second response time [05:25:14] RECOVERY - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 7.324 second response time [05:34:47] PROBLEM - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 185 bytes in 11.060 second response time [05:37:47] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:39:58] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [05:43:28] RECOVERY - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.639 second response time [05:50:18] PROBLEM - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 185 bytes in 11.176 second response time [06:03:13] PROBLEM - puppet last run on mw2061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:16:22] PROBLEM - puppet last run on mw2221 is CRITICAL: CRITICAL: puppet fail [06:16:23] (03PS2) 10Dzahn: admin: add eventlogging-admins on hafnium [puppet] - 10https://gerrit.wikimedia.org/r/298120 (https://phabricator.wikimedia.org/T139202) [06:31:33] RECOVERY - puppet last run on mw2061 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:31:43] PROBLEM - puppet last run on mw2250 is CRITICAL: CRITICAL: puppet fail [06:31:53] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:53] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:13] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:57] PROBLEM - puppet last run on ms-be2022 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:37] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:07] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:36] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:56] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:43:46] RECOVERY - puppet last run on mw2221 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:27] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:57:06] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:57:26] RECOVERY - puppet last run on ms-be2022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:36] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:16] RECOVERY - puppet last run on mw2250 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:17] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:26] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:37] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:06] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:09:40] 06Operations, 10Wikimedia-Apache-configuration, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2443969 (10Joe) >>! In T73487#2442995, @elukey wrote: > I found a repro (If-Mod... [09:25:01] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:27:20] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [09:32:40] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:34:59] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [10:02:58] (03CR) 10Merlijn van Deen: [C: 04-1] "We should install the package for both Trusty and Jessie." [puppet] - 10https://gerrit.wikimedia.org/r/297975 (https://phabricator.wikimedia.org/T139738) (owner: 10Dereckson) [10:39:45] (03PS3) 10Dereckson: Install arcanist in toollabs::dev_environ [puppet] - 10https://gerrit.wikimedia.org/r/297975 (https://phabricator.wikimedia.org/T139738) [10:44:58] Hello, There a denial of service attack with gerrit that allow a single attacker with only 1 connection to fill the server ʀᴀᴍ. The only thing required is to allow cloning without authentication which is the case for wikimedia. [10:45:46] Please note building a python exploit can be done in less than an hundred line of code without any third party library. [10:46:20] Here it is : https://bugs.eclipse.org/bugs/show_bug.cgi?id=497604 [10:47:33] I initially tried to deal with Google vrp and eclipse for it. But due to the lack of response [10:47:50] the bug is public without any fixes available [10:49:07] (03PS1) 10Ladsgroup: Add yubikey ssh key for ladsgroup [puppet] - 10https://gerrit.wikimedia.org/r/298130 [10:49:22] ytrezq: i've filed a ticket in our private security area [10:50:32] (03CR) 10Merlijn van Deen: [C: 031] Install arcanist in toollabs::dev_environ [puppet] - 10https://gerrit.wikimedia.org/r/297975 (https://phabricator.wikimedia.org/T139738) (owner: 10Dereckson) [10:53:24] (03CR) 10Dereckson: [C: 04-1] "phabricator::arcanist triggers only the installation of the arcanist package currently." [puppet] - 10https://gerrit.wikimedia.org/r/297975 (https://phabricator.wikimedia.org/T139738) (owner: 10Dereckson) [10:53:42] p858snake: consider than attack can come at any time [10:54:08] p858snake: please consider that an than attack can come at any time [10:59:29] (03CR) 10Nemo bis: [C: 031] Add Cape Verdean Creole (kea) as extra language for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297556 (https://phabricator.wikimedia.org/T127435) (owner: 10Thiemo Mättig (WMDE)) [11:03:46] (03CR) 10Nemo bis: Add Cape Verdean Creole (kea) as extra language for wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297556 (https://phabricator.wikimedia.org/T127435) (owner: 10Thiemo Mättig (WMDE)) [11:06:47] PROBLEM - puppet last run on mw2214 is CRITICAL: CRITICAL: puppet fail [11:09:48] p858snake: can you cc me on that ticket? [11:10:47] 06Operations, 10Datasets-General-or-Unknown: Provide a good download service of dumps from Wikimedia - https://phabricator.wikimedia.org/T122917#2444213 (10Nemo_bis) So, can we set up this proxy in Europe please? If it's too hard to do in a WMF datacentre, can you direct me to the most appropriate way to get... [11:11:21] zhuyifei1999_: why? [11:11:40] to see if it gets fixed [11:11:44] 06Operations, 10Datasets-General-or-Unknown, 10netops: dumps.wikimedia.org seems to have poor networking towards Telia - https://phabricator.wikimedia.org/T120425#2444215 (10Nemo_bis) Yes AFAICT, from a quick check. What makes you think it wouldn't be? Ops, if some specific testing is (still) needed please s... [11:13:33] if you refuse, then whatever [11:22:29] i'm not 100% comfortable adding users I don't know to private tickets, when its fixed the report will be made public [11:22:56] or merged/linked to other relevant tickets if they exist [11:27:35] zhuyifei1999_: the report is public [11:27:53] zhuyifei1999_: the report is public https://bugs.eclipse.org/bugs/show_bug.cgi?id=497604 [11:28:10] zhuyifei1999_: and no fix is available [11:28:44] no not that one, (and honestly I don't really care about eclipse as I'm not a user of it) [11:29:04] I was saying about the report of wikimedia gerrit [11:31:46] RECOVERY - puppet last run on mw2214 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [11:35:55] Yeah, they're just a client, zhuyifei1999_. [11:36:07] 06Operations, 10DBA, 10Wikimedia-Etherpad, 13Patch-For-Review, 07User-notice: etherpad database issues - https://phabricator.wikimedia.org/T138516#2444241 (10Nemo_bis) In retrospect, it would have been nice to make a list of etherpad titles which existed in etherpad-restore but not etherpad. (Given it wa... [11:36:12] https://en.wikipedia.org/wiki/Gerrit_(software)#Notable_users [11:36:41] k [11:44:54] 06Operations, 10DBA, 10Wikimedia-Etherpad, 13Patch-For-Review, 07User-notice: etherpad database issues - https://phabricator.wikimedia.org/T138516#2444243 (10jcrespo) > it would have been nice to make a list of etherpad titles Have you ever looked at etherpad MySQL schema? Hint: It has one single table... [12:08:09] (03PS1) 10Dereckson: Fix Compact Language Links availability issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298133 (https://phabricator.wikimedia.org/T138524) [12:09:34] (03PS2) 10Dereckson: Fix Compact Language Links availability issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298133 (https://phabricator.wikimedia.org/T138524) [12:11:15] (03CR) 10Dereckson: "This needs feedback from ULS maintainer. Ideally from KartikMistry who prepared the original e0083e41 commit." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298133 (https://phabricator.wikimedia.org/T138524) (owner: 10Dereckson) [12:28:37] (03CR) 10Nikerabbit: "It currently works as intended: T136677" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298133 (https://phabricator.wikimedia.org/T138524) (owner: 10Dereckson) [12:41:48] (03CR) 10Dereckson: "So we need instead to fix configuration comments." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298133 (https://phabricator.wikimedia.org/T138524) (owner: 10Dereckson) [12:43:44] (03Abandoned) 10Dereckson: Fix Compact Language Links availability issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298133 (https://phabricator.wikimedia.org/T138524) (owner: 10Dereckson) [12:45:18] 06Operations, 06Performance-Team, 10Wikimedia-General-or-Unknown: jobrunner memory leaks - https://phabricator.wikimedia.org/T122069#2444301 (10aaron) a:05aaron>03None [12:49:12] 06Operations, 10Cassandra, 06Services: Remove obsolete metrics - https://phabricator.wikimedia.org/T139792#2444303 (10Eevans) I just discovered the "Hide series with only nulls" option in Grafana, making this much less of an issue. [12:49:18] (03PS1) 10Dereckson: State Compact Language Links isn't beta anymore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298134 [12:49:20] 06Operations, 10Cassandra, 06Services: Remove obsolete metrics - https://phabricator.wikimedia.org/T139792#2444304 (10Eevans) p:05Triage>03Low [12:52:03] (03PS2) 10Dereckson: State Compact Language Links isn't beta anymore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298134 (https://phabricator.wikimedia.org/T136677) [13:16:22] (03CR) 10KartikMistry: [C: 031] State Compact Language Links isn't beta anymore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298134 (https://phabricator.wikimedia.org/T136677) (owner: 10Dereckson) [13:25:16] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:27:35] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [13:47:58] PROBLEM - puppet last run on wtp2020 is CRITICAL: CRITICAL: puppet fail [14:15:36] RECOVERY - puppet last run on wtp2020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:31:38] (03PS2) 10BryanDavis: Fix de_dot to process keys with falsey values [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/298115 (https://phabricator.wikimedia.org/T136001) [14:37:00] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:39:10] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [15:13:30] (03PS3) 10BryanDavis: Fix de_dot to process keys with falsey values [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/298115 (https://phabricator.wikimedia.org/T136001) [15:22:30] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: puppet fail [15:43:59] RECOVERY - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.778 second response time [15:49:49] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [15:50:58] PROBLEM - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 185 bytes in 11.239 second response time [15:54:16] (03CR) 10BryanDavis: [V: 031] "It took a couple of tries to get right, but this seems to be working now as tested in the beta cluster. Before and after tests done via ev" [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/298115 (https://phabricator.wikimedia.org/T136001) (owner: 10BryanDavis) [15:55:19] PROBLEM - puppet last run on mw2231 is CRITICAL: CRITICAL: puppet fail [15:58:25] 06Operations, 06Discovery, 10Wikimedia-Logstash, 03Discovery-Search-Sprint, and 2 others: [EPIC] Upgrade elasticsearch cluster supporting logging to 2.3 - https://phabricator.wikimedia.org/T136001#2444475 (10bd808) De-dot fix test results: ``` $ mwscript eval.php enwiki > wfDebugLog('redis', 'de-dot test',... [16:02:57] 06Operations, 06Commons, 10media-storage, 07User-notice: Some fonts not anti-aliasing in SVG thumbnails after upgrade of scaling servers - https://phabricator.wikimedia.org/T139543#2444476 (10Menner) @MoritzMuehlenhoff: Please consider remarks about Pango in T40010 as well. To me they sound more plausible. [16:22:39] RECOVERY - puppet last run on mw2231 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [17:04:58] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:07:07] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [17:16:28] RECOVERY - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 3.938 second response time [17:22:58] PROBLEM - puppet last run on mc2014 is CRITICAL: CRITICAL: puppet fail [17:23:19] PROBLEM - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 185 bytes in 11.096 second response time [17:48:47] RECOVERY - puppet last run on mc2014 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [19:36:11] (03PS4) 10BryanDavis: Fix de_dot to process keys with falsey values [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/298115 (https://phabricator.wikimedia.org/T136001) [19:37:39] (03CR) 10EBernhardson: [C: 031] "Looks like I don't have +2 here, but we should ship this before next days index is created" [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/298115 (https://phabricator.wikimedia.org/T136001) (owner: 10BryanDavis) [19:42:27] (03CR) 10BryanDavis: [C: 032 V: 031] "Retested PS 4 on beta successfully." [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/298115 (https://phabricator.wikimedia.org/T136001) (owner: 10BryanDavis) [19:43:43] (03CR) 10BryanDavis: [V: 032] "Forgot that there is no zuul integration here." [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/298115 (https://phabricator.wikimedia.org/T136001) (owner: 10BryanDavis) [19:46:41] !log Updated logstash/plugins to 18b3f1f (Fix de_dot to process keys with falsey values) [19:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:46:55] (03CR) 10Dereckson: [C: 031] Update logo settings for the Nepali Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297134 (https://phabricator.wikimedia.org/T139240) (owner: 10Odder) [19:50:37] !log restarted logstash on logstash1001 for de-dot plugin update (T136001) [19:50:38] T136001: [EPIC] Upgrade elasticsearch cluster supporting logging to 2.3 - https://phabricator.wikimedia.org/T136001 [19:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:52:16] !log restarted logstash on logstash1002 for de-dot plugin update (T136001) [19:52:17] T136001: [EPIC] Upgrade elasticsearch cluster supporting logging to 2.3 - https://phabricator.wikimedia.org/T136001 [19:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:54:10] !log restarted logstash on logstash1003 for de-dot plugin update (T136001) [19:54:11] T136001: [EPIC] Upgrade elasticsearch cluster supporting logging to 2.3 - https://phabricator.wikimedia.org/T136001 [19:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:54:35] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:56:55] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [20:09:14] 06Operations, 10Wikimedia-Site-requests, 07Community-consensus-needed, 13Patch-For-Review: Add the Kartographer extension to Metawiki - https://phabricator.wikimedia.org/T139787#2444674 (10Dereckson) [ Reassigning to Gehel per Yurik comment. @Gehel: please reassign this to @Tpt if the request can be proces... [20:16:03] 06Operations, 10Wikimedia-Site-requests, 07Community-consensus-needed, 13Patch-For-Review: Add the Kartographer extension to Metawiki - https://phabricator.wikimedia.org/T139787#2444679 (10Urbanecm) @Dereckson Ehm, this is still assigned to @Tpt. Is this a mistake? [20:16:50] 06Operations, 10Wikimedia-Site-requests, 07Community-consensus-needed, 13Patch-For-Review: Add the Kartographer extension to Metawiki - https://phabricator.wikimedia.org/T139787#2444684 (10Dereckson) a:05Tpt>03Gehel Fixed. [20:17:16] RECOVERY - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 21.114 second response time [20:52:10] 07Blocked-on-Operations, 06Operations, 10Wikimedia-Site-requests, 07Community-consensus-needed, 13Patch-For-Review: Add the Kartographer extension to Metawiki - https://phabricator.wikimedia.org/T139787#2444738 (10Urbanecm) Thanks. [21:44:13] (03CR) 10Paladox: [C: 04-1] "I tested this but I get a syntax error" [puppet] - 10https://gerrit.wikimedia.org/r/298041 (owner: 10Dzahn) [21:49:59] (03CR) 10Paladox: "negative Require directive has no effect in directive" [puppet] - 10https://gerrit.wikimedia.org/r/298041 (owner: 10Dzahn) [21:58:05] (03CR) 10Paladox: (WIP) make gerrit compatible with Apache 2.4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/298041 (owner: 10Dzahn)