[00:05:58] PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [00:06:08] PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:09:53] ACKNOWLEDGEMENT - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn debugging in progress [00:09:53] ACKNOWLEDGEMENT - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daniel_zahn debugging in progress [00:14:07] 10Operations: Copper root (/) 95% full - https://phabricator.wikimedia.org/T172409#3620089 (10Dzahn) 05Resolved>03Open [00:14:18] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [00:14:39] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [00:15:07] 10Operations: Copper root (/) 95% full - https://phabricator.wikimedia.org/T172409#3497614 (10Dzahn) Found this ticket from the comment on Icinga when i noticed we have this alert now: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=copper&service=Disk+space DISK CRITICAL - /var/lib/docker/... [00:16:36] 10Operations: Copper root (/) 95% full - https://phabricator.wikimedia.org/T172409#3620105 (10Dzahn) So it's not actually about disk space now, but it's about how to deal with the false positive from Icinga due to permissions. The disk check should exclude these mounts. [00:37:53] RoanKattouw, just had a timeout in my contributions [00:38:09] Just your normal contribs, or did you use filters? [00:38:22] Also we didn't add fancy filters or UI to the contribs page :/ [00:38:33] normal contribs didn't touch anything [00:38:38] weird [00:39:28] RC query errors still continue, BTW [00:41:01] What RC query errors? [00:41:42] timeouts [00:41:54] On normal RC or with filters? [00:42:17] in logs :) [00:42:18] Is there a phab task for the recent 503 errors that result in the "Our servers are currently under maintenance or experiencing a technical problem. Please try again in a few minutes." message. [00:42:34] Example: Request from via cp4016 cp4016, Varnish XID 427983032 [00:42:34] Error: 503, Backend fetch failed at Wed, 20 Sep 2017 00:38:34 GMT [00:42:40] that's what I was reporting [00:42:41] Maybe we're having a DB issue? [00:43:20] 17:14:40 <+icinga-wm> PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [00:43:27] thcipriani: Did your SWAT changes cause this perhaps? ---^^ [00:43:51] Let's look at logstash to see what all the 5xxs are about [00:44:14] It started before today, but it seems to have increased in frequency. [00:44:54] (03PS4) 10Smalyshev: Generate daily diffs for categories RDF [puppet] - 10https://gerrit.wikimedia.org/r/378355 (https://phabricator.wikimedia.org/T173774) [00:45:15] (03CR) 10jerkins-bot: [V: 04-1] Generate daily diffs for categories RDF [puppet] - 10https://gerrit.wikimedia.org/r/378355 (https://phabricator.wikimedia.org/T173774) (owner: 10Smalyshev) [00:45:17] hmm, someone's scraping us [00:45:44] hah, are they? [00:47:26] (03PS5) 10Smalyshev: Generate daily diffs for categories RDF [puppet] - 10https://gerrit.wikimedia.org/r/378355 (https://phabricator.wikimedia.org/T173774) [00:51:39] ACKNOWLEDGEMENT - Host lvs1007 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn T167299 [00:54:48] !log cp4016 - backend restart, mailbox lag, text [00:54:59] log cp1066 - backend restart, mailbox lag, text [00:55:02] !log cp1066 - backend restart, mailbox lag, text [00:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:03] JJMC89: https://phabricator.wikimedia.org/T175803 [00:58:44] (but we've already pushed a couple of related fixups in the past 24h, one of which may take a few days of runtime before it has real effects. it's also possible that this evening's 503s are externally-induced but confusing us into thinking they're related) [01:00:39] Thanks bblack. Is there anything ops needs from those encountering the 503s, or should we just ride it out? [01:01:10] not really at this point, except to wake up someone who might wake up someone who might wake up someone who can look at them :) [01:02:25] * MaxSem holds bblack's number tight [01:03:18] RECOVERY - Check Varnish expiry mailbox lag on cp1066 is OK: OK: expiry mailbox lag is 0 [01:06:58] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:08:28] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:29:10] 10Operations, 10Fundraising-Backlog, 10fundraising-tech-ops: Port fundraising stats off Ganglia - https://phabricator.wikimedia.org/T152562#3620193 (10Jgreen) nginx and memcache metric collectors are done [01:39:29] 10Operations, 10Fundraising-Backlog, 10fundraising-tech-ops: Port fundraising stats off Ganglia - https://phabricator.wikimedia.org/T152562#3620235 (10Jgreen) @fgiunchedi as of now we're done with ganglia! I shut down the aggregators yesterday. I'm going to leave this task open while we finish some minor cle... [01:40:41] 10Operations, 10Fundraising-Backlog, 10fundraising-tech-ops: Port fundraising stats off Ganglia - https://phabricator.wikimedia.org/T152562#3620237 (10Jgreen) p:05Normal>03Low [02:31:53] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.18) (duration: 08m 06s) [02:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:08:54] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.19) (duration: 15m 57s) [03:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:16:25] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Sep 20 03:16:25 UTC 2017 (duration 7m 31s) [03:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:38] PROBLEM - HHVM rendering on mw1294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:13:28] RECOVERY - HHVM rendering on mw1294 is OK: HTTP OK: HTTP/1.1 200 OK - 78179 bytes in 0.145 second response time [05:51:38] RECOVERY - Wikitech and wt-static content in sync on labtestweb2001 is OK: wikitech-static OK - wikitech and wikitech-static in sync (52475 200000s) [05:53:29] RECOVERY - Wikitech and wt-static content in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (52475 200000s) [05:55:27] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Fixes to the build script: [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/378952 (owner: 10Giuseppe Lavagetto) [05:55:37] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Makefile: make "clean" fault-tolerant [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/378953 (owner: 10Giuseppe Lavagetto) [06:09:48] RECOVERY - Disk space on copper is OK: DISK OK [06:51:08] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [06:51:48] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [06:52:48] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:53:28] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 503 (expecting: 200): /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /api/rest_v1/feed/featur [06:53:28] (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [06:55:48] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [1000.0] [06:56:18] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [06:57:34] (03CR) 10Alexandros Kosiaris: [C: 031] base: send syslog over TLS [puppet] - 10https://gerrit.wikimedia.org/r/378922 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [06:59:44] (03CR) 10Alexandros Kosiaris: [C: 031] rsyslog: add support to receive syslog over TLS [puppet] - 10https://gerrit.wikimedia.org/r/369950 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [07:00:27] (03CR) 10Alexandros Kosiaris: [C: 04-1] "-1 after all, I think there's a typo in the port declaration" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/378922 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [07:04:44] (03CR) 10Alexandros Kosiaris: [C: 031] Adding a list of bogus ifNames [puppet] - 10https://gerrit.wikimedia.org/r/379122 (owner: 10Ayounsi) [07:06:26] (03CR) 10Alexandros Kosiaris: [C: 031] Whitelist wtp10[25-48] for Linter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378073 (https://phabricator.wikimedia.org/T165520) (owner: 10Legoktm) [07:06:38] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 503 (expecting: 200) [07:07:38] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [07:11:31] !log cp3030 varnish-be restart due to 503s [07:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:49] (03PS1) 10Giuseppe Lavagetto: Add apt_options to apt-get update as well [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/379173 [07:13:25] ema: tyty, its odd, the spikes have been recently caused by cp1053 more than the others [07:15:29] TheresNoTime: so our current theory is that the spikes affecting cp1053 were mostly due to a change of ours which increases the objects keep time, essentially how long they're kept in cache after expiration for conditional requests (If-Modified-Since and such) [07:16:15] we've reverted that change, but it takes a while till we can see the full effect of it (keep time has to expire basically!) [07:16:30] Well fingers crossed! and cp3030? [07:17:17] also, yesterday we changed in a fundamental way the way objects are cached, mostly for what concerns eqiad [07:17:53] which seemed like a huge win on many aspects [07:18:36] also, during yesterday's EU evening, some unexpected issues started to surface in ulsfo (cp4*), but we thought they must have been unrelated to our changes [07:18:59] now todays cp3030's issues really are puzzling [07:22:48] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:23:33] ema: was it still cp3030's expiry mailbox lag? The check in Icinga has only been OK 9mins [07:24:36] (03PS1) 10Giuseppe Lavagetto: profile::docker::builder: add proxy settings to build config [puppet] - 10https://gerrit.wikimedia.org/r/379175 [07:24:38] (03PS1) 10Giuseppe Lavagetto: profile::docker::builder: add build script for production-images [puppet] - 10https://gerrit.wikimedia.org/r/379176 [07:24:55] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Add apt_options to apt-get update as well [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/379173 (owner: 10Giuseppe Lavagetto) [07:25:10] (03CR) 10jerkins-bot: [V: 04-1] profile::docker::builder: add proxy settings to build config [puppet] - 10https://gerrit.wikimedia.org/r/379175 (owner: 10Giuseppe Lavagetto) [07:25:20] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:27:09] TheresNoTime: the type of behavior we've started seeing yesterday also does induce a quick growth in mailbox lag, but it's so sudden that the icinga check doesn't catch it (and indeed the 503s begin *before* the lag) [07:27:37] see https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp3030&var-datasource=esams%20prometheus%2Fops&from=1505868122432&to=1505892299916 [07:27:56] (03PS2) 10Giuseppe Lavagetto: profile::docker::builder: add proxy settings to build config [puppet] - 10https://gerrit.wikimedia.org/r/379175 [07:27:58] (03PS2) 10Giuseppe Lavagetto: profile::docker::builder: add build script for production-images [puppet] - 10https://gerrit.wikimedia.org/r/379176 [07:29:03] the "Fetch failed" graph basically equals 503 [07:29:18] (03PS2) 10Muehlenhoff: Remove deployment::redis [puppet] - 10https://gerrit.wikimedia.org/r/378925 [07:30:23] (03PS1) 10Muehlenhoff: Update contact for one NDA user [puppet] - 10https://gerrit.wikimedia.org/r/379177 [07:31:06] (03CR) 10Muehlenhoff: [C: 032] Remove deployment::redis [puppet] - 10https://gerrit.wikimedia.org/r/378925 (owner: 10Muehlenhoff) [07:31:28] (03PS2) 10Muehlenhoff: Update contact for one NDA user [puppet] - 10https://gerrit.wikimedia.org/r/379177 [07:31:55] 10Operations, 10Maps-Sprint, 10Maps (Kartotherian): Upgrade kartotherian and tilerator to nodejs 6.11 - https://phabricator.wikimedia.org/T171707#3620405 (10Gehel) @Pnorman: not sure what you mean by >>! In T171707#3619951, @Pnorman wrote: > @Gehel, can you make maps-test2004 the same as one of the other b... [07:32:23] (03CR) 10Muehlenhoff: [C: 032] Update contact for one NDA user [puppet] - 10https://gerrit.wikimedia.org/r/379177 (owner: 10Muehlenhoff) [07:33:04] (03PS2) 10Muehlenhoff: Drop Hiera setting for trebuchet server in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/378917 [07:34:14] ema: jeez :/ interesting to see what else is affected during the 503s though [07:34:18] (03CR) 10Muehlenhoff: [C: 032] Drop Hiera setting for trebuchet server in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/378917 (owner: 10Muehlenhoff) [07:35:05] 10Operations, 10Maps-Sprint, 10Maps (Kartotherian): Upgrade kartotherian and tilerator to nodejs 6.11 - https://phabricator.wikimedia.org/T171707#3620410 (10Gehel) >>! In T171707#3619901, @Yurik wrote: > So it seems the sources & variables file specified in the /etc/tilerator/config.yaml has incorrectly spec... [07:37:19] (03PS3) 10Muehlenhoff: Remove hack configuring eventlogging/eventlogging to use trebuchet [puppet] - 10https://gerrit.wikimedia.org/r/378964 [07:38:58] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [07:39:38] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [07:40:37] the majority seems to come from cp3032, but it is not yet recovered [07:41:00] ah mr Ema already on top of it <3 [07:41:09] didn't see the entries :) [07:42:15] (03CR) 10Muehlenhoff: [C: 032] Remove hack configuring eventlogging/eventlogging to use trebuchet [puppet] - 10https://gerrit.wikimedia.org/r/378964 (owner: 10Muehlenhoff) [07:42:48] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 503 (expecting: 200) [07:43:13] this is a side effect --^ [07:43:31] (03CR) 10Alexandros Kosiaris: [C: 031] profile::docker::builder: add build script for production-images [puppet] - 10https://gerrit.wikimedia.org/r/379176 (owner: 10Giuseppe Lavagetto) [07:43:38] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [07:43:54] (03CR) 10Alexandros Kosiaris: [C: 031] profile::docker::builder: add proxy settings to build config [puppet] - 10https://gerrit.wikimedia.org/r/379175 (owner: 10Giuseppe Lavagetto) [07:43:56] (03CR) 10Muehlenhoff: "It certainly makes sense to add this to scap and to deprecate the standalone git-fat package, but let's disentangle this from the Trebuche" [puppet] - 10https://gerrit.wikimedia.org/r/378931 (owner: 10Muehlenhoff) [07:44:15] so cp3032's mailbox lag is skyrocketing [07:44:20] ema --^ [07:44:42] elukey: yep, we've seen a similar behavior yesterday night in ulsfo, but thought it was due to de/repools [07:44:54] I wouldn't restart the backends yet [07:45:34] I was trying to figure out whether some specific traffic pattern happening today might have triggered this [07:45:48] in particular, POSTs to /w/api.php [07:47:28] RECOVERY - MariaDB Slave Lag: s5 on dbstore1001 is OK: OK slave_sql_lag not a slave [07:47:57] cp3041 joined the party [07:48:35] this doesn't look good at all [07:50:48] yesterday, new filters to recentchanges (or watchlists) were added [07:51:01] (03PS2) 10Muehlenhoff: Add git-fat to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/378931 [07:54:58] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 503 (expecting: 200) [07:55:58] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [07:58:33] (03CR) 10Muehlenhoff: [C: 032] Add git-fat to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/378931 (owner: 10Muehlenhoff) [07:58:40] so the `Restbase edge` thing is "new" eh? [07:59:23] side effect of the issue :( [07:59:28] Request from 80.176.129.180 via cp3041 cp3041, Varnish XID 747470851 [07:59:29] Error: 503, Backend fetch failed at Wed, 20 Sep 2017 07:57:32 GMT [07:59:36] Getting MANY of these this morning [07:59:45] But I am understanding it's a KNOWN issue. [08:01:03] (03PS1) 10Ema: Depool text esams [dns] - 10https://gerrit.wikimedia.org/r/379178 [08:01:08] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) [08:01:57] (03CR) 10Giuseppe Lavagetto: [C: 031] Depool text esams [dns] - 10https://gerrit.wikimedia.org/r/379178 (owner: 10Ema) [08:02:08] (03PS2) 10Ema: Depool text esams [dns] - 10https://gerrit.wikimedia.org/r/379178 [08:02:09] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [08:02:16] (03CR) 10Ema: [V: 032 C: 032] Depool text esams [dns] - 10https://gerrit.wikimedia.org/r/379178 (owner: 10Ema) [08:02:21] <_joe_> ShakespeareFan00: yeah we're working on it [08:05:18] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200): /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 ret [08:05:18] d status 503 (expecting: 200): /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 503 (expecting: 200): /api/rest_v1/page/revision/{revision} (Get rev by ID) is CRITICAL: Test Get rev by ID returned the unexpected status 503 (expecting: 200) [08:05:53] coffee & cookies for ema [08:07:18] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [08:09:04] _joe_: Anny idea on when thigns will be stable? [08:09:28] <_joe_> ShakespeareFan00: in a few minutes you should be routed through a different datacenter [08:10:14] for me i got it on cp3043 [08:10:18] PROBLEM - Varnish HTTP text-backend - port 3128 on cp3032 is CRITICAL: HTTP CRITICAL - No data received from host [08:11:19] RECOVERY - Varnish HTTP text-backend - port 3128 on cp3032 is OK: HTTP OK: HTTP/1.1 200 OK - 180 bytes in 0.168 second response time [08:11:37] <_joe_> I'm already routed to eqiad FWIW [08:11:50] <_joe_> most of you should be too [08:12:02] seems so [08:12:28] PROBLEM - Varnish HTTP text-backend - port 3128 on cp3043 is CRITICAL: HTTP CRITICAL - No data received from host [08:14:28] RECOVERY - Varnish HTTP text-backend - port 3128 on cp3043 is OK: HTTP OK: HTTP/1.1 200 OK - 180 bytes in 0.168 second response time [08:19:38] PROBLEM - Check Varnish expiry mailbox lag on cp1054 is CRITICAL: CRITICAL: expiry mailbox lag is 2240267 [08:23:42] not sure if T175803 is still the most relevant task here, but if so, I've pointed the concerned enwp folks towards it on the Village Pump [08:23:42] T175803: Text eqiad varnish 503 spikes - https://phabricator.wikimedia.org/T175803 [08:23:53] (03PS1) 10Ema: Revert "Depool text esams" [dns] - 10https://gerrit.wikimedia.org/r/379179 [08:24:28] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:24:28] (03PS1) 10Muehlenhoff: Drop trebuchet::packages [puppet] - 10https://gerrit.wikimedia.org/r/379180 [08:24:49] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:26:15] (03PS1) 10Ema: Revert "VCL: stabilize backend storage patterns" [puppet] - 10https://gerrit.wikimedia.org/r/379181 [08:28:38] PROBLEM - Check Varnish expiry mailbox lag on cp1065 is CRITICAL: CRITICAL: expiry mailbox lag is 2007004 [08:29:18] PROBLEM - Check Varnish expiry mailbox lag on cp1068 is CRITICAL: CRITICAL: expiry mailbox lag is 2208727 [08:31:17] (03PS1) 10Alexandros Kosiaris: Add thirdparty/ci component to jessie and stretch [puppet] - 10https://gerrit.wikimedia.org/r/379182 (https://phabricator.wikimedia.org/T175293) [08:32:15] (03PS5) 10Gehel: Set elasticsearch servers to use 128kB readahead [puppet] - 10https://gerrit.wikimedia.org/r/371963 (https://phabricator.wikimedia.org/T169498) (owner: 10EBernhardson) [08:33:21] (03CR) 10Muehlenhoff: [C: 031] Add thirdparty/ci component to jessie and stretch [puppet] - 10https://gerrit.wikimedia.org/r/379182 (https://phabricator.wikimedia.org/T175293) (owner: 10Alexandros Kosiaris) [08:34:37] (03PS2) 10Alexandros Kosiaris: Add thirdparty/ci component to jessie and stretch [puppet] - 10https://gerrit.wikimedia.org/r/379182 (https://phabricator.wikimedia.org/T175293) [08:34:39] (03PS1) 10Alexandros Kosiaris: Enable thirdparty/ci on role::ci::slave [puppet] - 10https://gerrit.wikimedia.org/r/379183 (https://phabricator.wikimedia.org/T175293) [08:37:13] (03PS1) 10Muehlenhoff: Remove deployment::redis ferm service [puppet] - 10https://gerrit.wikimedia.org/r/379185 [08:38:38] RECOVERY - Check Varnish expiry mailbox lag on cp1065 is OK: OK: expiry mailbox lag is 0 [08:39:36] (03CR) 10Ema: [C: 032] Revert "Depool text esams" [dns] - 10https://gerrit.wikimedia.org/r/379179 (owner: 10Ema) [08:40:36] !log restarted varnish-be on cp1065 [08:40:42] !log restarted varnish-be on cp1054 [08:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:52] !log restarted varnish-be on cp1068 [08:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:55] !log restarted varnish-be on cp1053 [08:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:24] (03PS6) 10Gehel: Set elasticsearch servers to use 128kB readahead [puppet] - 10https://gerrit.wikimedia.org/r/371963 (https://phabricator.wikimedia.org/T169498) (owner: 10EBernhardson) [08:49:19] RECOVERY - Check Varnish expiry mailbox lag on cp1068 is OK: OK: expiry mailbox lag is 0 [08:49:38] RECOVERY - Check Varnish expiry mailbox lag on cp1054 is OK: OK: expiry mailbox lag is 0 [08:52:24] (03CR) 10ArielGlenn: [C: 031] "This is exactly what I had in mind. I'm giving it a +1 for the moment, because I've only skimmed the code. If I see no other issues afte" [puppet] - 10https://gerrit.wikimedia.org/r/378355 (https://phabricator.wikimedia.org/T173774) (owner: 10Smalyshev) [08:52:27] (03PS7) 10Gehel: Set elasticsearch servers to use 128kB readahead [puppet] - 10https://gerrit.wikimedia.org/r/371963 (https://phabricator.wikimedia.org/T169498) (owner: 10EBernhardson) [08:56:11] (03PS3) 10Giuseppe Lavagetto: profile::docker::builder: add proxy settings to apt_config [puppet] - 10https://gerrit.wikimedia.org/r/379175 [08:56:40] (03PS8) 10Gehel: Set elasticsearch servers to use 128kB readahead [puppet] - 10https://gerrit.wikimedia.org/r/371963 (https://phabricator.wikimedia.org/T169498) (owner: 10EBernhardson) [09:03:22] (03CR) 10Gehel: "Puppet compiler looks happy: https://puppet-compiler.wmflabs.org/compiler02/7938/" [puppet] - 10https://gerrit.wikimedia.org/r/371963 (https://phabricator.wikimedia.org/T169498) (owner: 10EBernhardson) [09:07:02] (03PS1) 10Hashar: contint: deploy slave_scripts via git::clone [puppet] - 10https://gerrit.wikimedia.org/r/379187 [09:07:23] (03PS1) 10Muehlenhoff: Remove role::deployment::config and the related Hiera data [puppet] - 10https://gerrit.wikimedia.org/r/379188 [09:07:25] (03PS1) 10Muehlenhoff: Remove deployment::salt_master/role::deployment::salt_masters and related files [puppet] - 10https://gerrit.wikimedia.org/r/379189 [09:09:54] (03PS2) 10Hashar: contint: deploy slave_scripts via git::clone [puppet] - 10https://gerrit.wikimedia.org/r/379187 [09:11:34] (03CR) 10Hashar: "https://puppet-compiler.wmflabs.org/compiler02/7941/ :]" [puppet] - 10https://gerrit.wikimedia.org/r/379187 (owner: 10Hashar) [09:11:38] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [09:12:08] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [09:12:46] (03CR) 10Hashar: [C: 031] contint: deploy slave_scripts via git::clone [puppet] - 10https://gerrit.wikimedia.org/r/379187 (owner: 10Hashar) [09:14:03] !log restarted varnish-be on cp3040 [09:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:42] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3620607 (10mobrovac) >>! In T175210#3618572, @GWicke wrote: > I honestly don't have a strong preference between the other "hearted"... [09:19:25] !log restarted varnish-be on cp3032 [09:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:08] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) [09:21:08] !log restarted varnish-be on cp3041 [09:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:34] esams is sinking [09:22:01] it is not, it just need a bit of encouragment :) [09:22:08] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [09:22:13] *encouragement [09:22:21] (03PS1) 10Muehlenhoff: Remove service-restart and deploy-info [puppet] - 10https://gerrit.wikimedia.org/r/379191 [09:22:22] it is well-positioned to sink, geographically :) [09:22:51] <_joe_> lol [09:22:52] * TheresNoTime starts inflating arm bands [09:23:14] !log restarted varnish-be on cp3043 [09:23:15] (03CR) 10Alexandros Kosiaris: [C: 032] Add thirdparty/ci component to jessie and stretch [puppet] - 10https://gerrit.wikimedia.org/r/379182 (https://phabricator.wikimedia.org/T175293) (owner: 10Alexandros Kosiaris) [09:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:28] (03CR) 10Filippo Giunchedi: base: send syslog over TLS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/378922 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [09:25:29] (03PS1) 10Volans: WMCS: fix Cumin configuration [puppet] - 10https://gerrit.wikimedia.org/r/379192 (https://phabricator.wikimedia.org/T175712) [09:26:19] (03PS3) 10Filippo Giunchedi: base: send syslog over TLS [puppet] - 10https://gerrit.wikimedia.org/r/378922 (https://phabricator.wikimedia.org/T136312) [09:29:07] !log restarted varnish-be on cp1052 [09:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:04] (03CR) 10Hashar: "role::ci::slave is representing a jenkins slave on the production machines. That is merely to deploy artifacts to doc.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/379183 (https://phabricator.wikimedia.org/T175293) (owner: 10Alexandros Kosiaris) [09:36:13] (03CR) 10Volans: "Compiler results available here: https://puppet-compiler.wmflabs.org/compiler02/7942/" [puppet] - 10https://gerrit.wikimedia.org/r/379192 (https://phabricator.wikimedia.org/T175712) (owner: 10Volans) [09:37:38] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:37:59] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:42:28] (03CR) 10DCausse: [C: 031] Set elasticsearch servers to use 128kB readahead [puppet] - 10https://gerrit.wikimedia.org/r/371963 (https://phabricator.wikimedia.org/T169498) (owner: 10EBernhardson) [09:43:11] (03CR) 10Hashar: [C: 04-1] "On beta (and integration) we still need a salt master until a replacement is setup (either cumin or cluster shell?)." [puppet] - 10https://gerrit.wikimedia.org/r/379189 (owner: 10Muehlenhoff) [09:44:24] (03CR) 10Alexandros Kosiaris: [C: 031] base: send syslog over TLS [puppet] - 10https://gerrit.wikimedia.org/r/378922 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [09:50:02] (03CR) 10Hashar: [C: 031] "We used to have .htaccess files for integration.wikimedia.org and it is apparently no more the case (based on a file search on the host). " [puppet] - 10https://gerrit.wikimedia.org/r/378853 (owner: 10Elukey) [09:50:38] hashar: o/ - any issue if we merge + test --^ [09:50:52] elukey: I think it will be fine but I can not babysit it today [09:51:07] all right so let's do it whenever you have a bit of time [09:51:10] elukey: at leasdt on contint1001.wikimedia.org /srv/org/wikimedia/integration/ there is no .httaccess file anymore [09:51:16] yeah [09:52:11] kids && lunch && etc [09:52:19] will be back for swat [09:53:50] (03CR) 10Elukey: "Is it always used? I don't find it on netmon1002:" [puppet] - 10https://gerrit.wikimedia.org/r/378858 (owner: 10Elukey) [09:55:27] !log restarted varnish-be on cp3042 [09:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:49] (03PS1) 10Alexandros Kosiaris: Rename role::test::system to role::test [puppet] - 10https://gerrit.wikimedia.org/r/379195 [09:58:38] PROBLEM - Disk space on copper is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/296af4efe8f6f57432c8905b9d09a558eca6ba4214f40b1183f1d7a794976745/shm is not accessible: Permission denied [10:01:28] 10Operations, 10Fundraising-Backlog, 10fundraising-tech-ops: Port fundraising stats off Ganglia - https://phabricator.wikimedia.org/T152562#3620676 (10fgiunchedi) @Jgreen that's awesome news! I think we can finally shut down ganglia for good ! [10:08:51] (03CR) 10Muehlenhoff: "Yeah, but this change only removes role::deployment:salt_masters, not role::salt::masters::labs. Also, Cumin support for WMCS is currently" [puppet] - 10https://gerrit.wikimedia.org/r/379189 (owner: 10Muehlenhoff) [10:09:52] (03PS1) 10ArielGlenn: Move datasets nginx logs rsync to dumps web manifest where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/379198 (https://phabricator.wikimedia.org/T175528) [10:10:56] (03PS18) 10Phedenskog: Make values stackable [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) [10:12:30] 10Operations: Restarts of ganglia-monitor are unreliable - https://phabricator.wikimedia.org/T135723#3620695 (10fgiunchedi) 05Open>03declined Ganglia is indeed going away [10:12:32] 10Operations, 10Patch-For-Review: Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991#3620697 (10fgiunchedi) [10:13:23] (03CR) 10Phedenskog: "With the navtiming change out we could push this right? @krinkle @gilles do you have anything else we should fix? Really looking forward t" [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [10:14:29] (03CR) 10Muehlenhoff: [C: 031] WMCS: fix Cumin configuration [puppet] - 10https://gerrit.wikimedia.org/r/379192 (https://phabricator.wikimedia.org/T175712) (owner: 10Volans) [10:14:43] (03PS2) 10ArielGlenn: Move datasets nginx logs rsync to dumps web manifest where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/379198 (https://phabricator.wikimedia.org/T175528) [10:15:21] (03PS3) 10Muehlenhoff: contint: deploy slave_scripts via git::clone [puppet] - 10https://gerrit.wikimedia.org/r/379187 (owner: 10Hashar) [10:19:29] (03CR) 10Muehlenhoff: [C: 032] contint: deploy slave_scripts via git::clone [puppet] - 10https://gerrit.wikimedia.org/r/379187 (owner: 10Hashar) [10:19:57] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3620709 (10thiemowmde) [10:20:52] (03PS2) 10Volans: WMCS: fix Cumin configuration [puppet] - 10https://gerrit.wikimedia.org/r/379192 (https://phabricator.wikimedia.org/T175712) [10:23:58] (03CR) 10Volans: [C: 032] WMCS: fix Cumin configuration [puppet] - 10https://gerrit.wikimedia.org/r/379192 (https://phabricator.wikimedia.org/T175712) (owner: 10Volans) [10:24:30] (03PS1) 10BBlack: VCL: do not randomize backends for MISS2PASS traffic [puppet] - 10https://gerrit.wikimedia.org/r/379199 [10:26:28] (03PS3) 10ArielGlenn: Move datasets nginx logs rsync to dumps web manifest where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/379198 (https://phabricator.wikimedia.org/T175528) [10:26:43] (03PS4) 10ArielGlenn: Move datasets nginx logs rsync to dumps web manifest where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/379198 (https://phabricator.wikimedia.org/T175528) [10:33:46] 10Operations, 10Patch-For-Review: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3620729 (10akosiaris) >>! In T165520#3618530, @Arlolra wrote: > There're a couple problems with this change, > https://gerrit.wikimedia.org/r/#/c/377966/ > > `/etc/dsh/group/parsoid` contains `ruthenium.... [10:35:22] (03PS1) 10Alexandros Kosiaris: Remove ruthenium from scap::dsh::groups::parsoid [puppet] - 10https://gerrit.wikimedia.org/r/379200 (https://phabricator.wikimedia.org/T165520) [10:41:31] (03PS2) 10BBlack: VCL: do not randomize backends for MISS2PASS traffic [puppet] - 10https://gerrit.wikimedia.org/r/379199 [10:42:13] (03PS2) 10Alexandros Kosiaris: Rename role::test::system to role::test [puppet] - 10https://gerrit.wikimedia.org/r/379195 [10:42:21] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Rename role::test::system to role::test [puppet] - 10https://gerrit.wikimedia.org/r/379195 (owner: 10Alexandros Kosiaris) [10:49:44] (03CR) 10Ema: [C: 031] VCL: do not randomize backends for MISS2PASS traffic [puppet] - 10https://gerrit.wikimedia.org/r/379199 (owner: 10BBlack) [10:55:21] (03PS1) 10Jcrespo: mariadb: update s2 pager for contributions replicas [software] - 10https://gerrit.wikimedia.org/r/379206 [10:55:29] (03PS3) 10BBlack: VCL: do not randomize backends for MISS2PASS traffic [puppet] - 10https://gerrit.wikimedia.org/r/379199 [10:56:19] (03CR) 10Ema: [C: 031] VCL: do not randomize backends for MISS2PASS traffic [puppet] - 10https://gerrit.wikimedia.org/r/379199 (owner: 10BBlack) [10:56:50] (03CR) 10BBlack: [C: 032] VCL: do not randomize backends for MISS2PASS traffic [puppet] - 10https://gerrit.wikimedia.org/r/379199 (owner: 10BBlack) [10:56:56] (03PS4) 10BBlack: VCL: do not randomize backends for MISS2PASS traffic [puppet] - 10https://gerrit.wikimedia.org/r/379199 [10:57:00] (03CR) 10BBlack: [V: 032 C: 032] VCL: do not randomize backends for MISS2PASS traffic [puppet] - 10https://gerrit.wikimedia.org/r/379199 (owner: 10BBlack) [11:00:31] !log mobrovac@tin Started deploy [cassandra/metrics-collector@d0169ee]: (no justification provided) [11:00:35] !log mobrovac@tin Finished deploy [cassandra/metrics-collector@d0169ee]: (no justification provided) (duration: 00m 04s) [11:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:18] RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational [11:01:31] !log mobrovac@tin Started deploy [cassandra/metrics-collector@d0169ee]: (no justification provided) [11:01:35] !log mobrovac@tin Finished deploy [cassandra/metrics-collector@d0169ee]: (no justification provided) (duration: 00m 03s) [11:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:50] !log mobrovac@tin Started deploy [cassandra/metrics-collector@d0169ee]: (no justification provided) [11:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:54] !log mobrovac@tin Finished deploy [cassandra/metrics-collector@d0169ee]: (no justification provided) (duration: 00m 04s) [11:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:09] RECOVERY - Check systemd state on restbase1008 is OK: OK - running: The system is fully operational [11:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:18] !log restarted varnish-be on cp4008 [11:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:38] !log mobrovac@tin Started deploy [cassandra/metrics-collector@d0169ee]: (no justification provided) [11:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:57] !log mobrovac@tin Finished deploy [cassandra/metrics-collector@d0169ee]: (no justification provided) (duration: 00m 19s) [11:02:58] RECOVERY - Check systemd state on restbase1010 is OK: OK - running: The system is fully operational [11:02:58] RECOVERY - Check systemd state on restbase2001 is OK: OK - running: The system is fully operational [11:03:03] !log rolling out new debdeploy release across the fleet [11:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:37] (03CR) 10Jcrespo: [C: 032] mariadb: update s2 pager for contributions replicas [software] - 10https://gerrit.wikimedia.org/r/379206 (owner: 10Jcrespo) [11:03:58] RECOVERY - Check systemd state on restbase2005 is OK: OK - running: The system is fully operational [11:24:25] 10Operations, 10DBA, 10Patch-For-Review: decommission db1018 - https://phabricator.wikimedia.org/T176215#3620825 (10jcrespo) [11:24:28] 10Operations, 10DBA, 10Patch-For-Review: Decomissions old s2 eqiad hosts (db1018, db1021, db1024, db1036) - https://phabricator.wikimedia.org/T162699#3620824 (10jcrespo) [11:24:34] (03PS1) 10Volans: WMCS: fine tuning Cumin configuration for proxies [puppet] - 10https://gerrit.wikimedia.org/r/379209 (https://phabricator.wikimedia.org/T175712) [11:24:40] 10Operations, 10DBA, 10Patch-For-Review: decommission db1018 - https://phabricator.wikimedia.org/T176215#3617573 (10jcrespo) [11:24:42] 10Operations, 10DBA, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3620828 (10jcrespo) [11:24:58] (03CR) 10jerkins-bot: [V: 04-1] WMCS: fine tuning Cumin configuration for proxies [puppet] - 10https://gerrit.wikimedia.org/r/379209 (https://phabricator.wikimedia.org/T175712) (owner: 10Volans) [11:25:34] !log mobrovac@tin Started deploy [restbase/deploy@dea2b41]: (no justification provided) [11:25:36] !log mobrovac@tin Finished deploy [restbase/deploy@dea2b41]: (no justification provided) (duration: 00m 02s) [11:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:38] _joe_: Hey, the jobqueue seems steady for a while: https://grafana.wikimedia.org/dashboard/db/job-queue-health?refresh=1m&orgId=1&from=now-24h&to=now or we need to something else? [11:27:05] 10Operations, 10DBA: decommission db1036 - https://phabricator.wikimedia.org/T176311#3620829 (10jcrespo) [11:27:13] now the jobs are small and fast (but numerous) [11:28:12] (03PS1) 10Jcrespo: mariadb: Depool db1101 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379212 (https://phabricator.wikimedia.org/T176311) [11:28:46] (03PS2) 10Volans: WMCS: fine tuning Cumin configuration for proxies [puppet] - 10https://gerrit.wikimedia.org/r/379209 (https://phabricator.wikimedia.org/T175712) [11:29:14] <_joe_> Amir1: yeah it's not really steady, but I'm off to lunch now [11:30:25] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3620853 (10daniel) [11:32:05] _joe_: Bon appetit :) [11:32:12] (03CR) 10Volans: [C: 032] WMCS: fine tuning Cumin configuration for proxies [puppet] - 10https://gerrit.wikimedia.org/r/379209 (https://phabricator.wikimedia.org/T175712) (owner: 10Volans) [11:34:24] !log restarted varnish-be on cp4010 [11:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:37] (03PS1) 10Mobrovac: RESTBase: Separate eqiad and codfw seeds [puppet] - 10https://gerrit.wikimedia.org/r/379217 (https://phabricator.wikimedia.org/T169940) [11:39:32] (03PS1) 10Gilles: Upgrade to 1.5 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/379218 (https://phabricator.wikimedia.org/T173804) [11:39:37] jouncebot: refresh [11:39:39] I refreshed my knowledge about deployments. [11:39:57] jouncebot: next [11:39:57] In 0 hour(s) and 20 minute(s): Wikidata usage tracking (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170920T1200) [11:40:16] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1101 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379212 (https://phabricator.wikimedia.org/T176311) (owner: 10Jcrespo) [11:42:49] (03CR) 10Mobrovac: "PCC OK - https://puppet-compiler.wmflabs.org/compiler02/7944/" [puppet] - 10https://gerrit.wikimedia.org/r/379217 (https://phabricator.wikimedia.org/T169940) (owner: 10Mobrovac) [11:43:09] "Unable to open file ./wmf-config/PrivateSettings.php" [11:43:15] that is on CI composer [11:43:29] (03Merged) 10jenkins-bot: mariadb: Depool db1101 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379212 (https://phabricator.wikimedia.org/T176311) (owner: 10Jcrespo) [11:43:43] (03CR) 10jenkins-bot: mariadb: Depool db1101 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379212 (https://phabricator.wikimedia.org/T176311) (owner: 10Jcrespo) [11:43:45] (03CR) 10Hashar: [C: 031] "On contint1001:" [puppet] - 10https://gerrit.wikimedia.org/r/378853 (owner: 10Elukey) [11:45:45] !log jynus@tin Synchronized wmf-config/db-eqiad.php: depool db1101 (duration: 00m 58s) [11:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:03] (03PS2) 10Elukey: contint::apache: disallow .htaccess usage [puppet] - 10https://gerrit.wikimedia.org/r/378853 [11:48:24] 10Operations, 10Operations-Software-Development, 10Goal, 10Patch-For-Review, 10Technical-Debt: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3620886 (10Volans) [11:49:33] !log stopping replication on db1101 for faster repartition work T176311 [11:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:47] T176311: decommission db1036 - https://phabricator.wikimedia.org/T176311 [11:51:11] !log mobrovac@tin Started deploy [restbase/deploy@dea2b41]: New storage schema for mobile-sections, canary deploy for schema creation - T169940 [11:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:24] T169940: End of September milestone: Start migration of production use cases. - https://phabricator.wikimedia.org/T169940 [11:52:40] 10Operations, 10DBA, 10Patch-For-Review: decommission db1036 - https://phabricator.wikimedia.org/T176311#3620896 (10jcrespo) Repartitioning db1101 is ongoing (while replication is down) so that it can substitute db1036 role. [11:53:38] PROBLEM - Restbase root url on restbase2002 is CRITICAL: connect to address 10.192.16.153 and port 7231: Connection refused [11:55:50] known ^ [11:56:52] !log restarted varnish-be on cp4018 [11:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:32] !log mobrovac@tin Finished deploy [restbase/deploy@dea2b41]: New storage schema for mobile-sections, canary deploy for schema creation - T169940 (duration: 07m 20s) [11:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:45] T169940: End of September milestone: Start migration of production use cases. - https://phabricator.wikimedia.org/T169940 [12:00:04] hoo: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata usage tracking. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170920T1200). [12:00:05] No patches in the queue for this window. Wheeee! [12:00:38] PROBLEM - Check systemd state on restbase2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:01:53] (03PS3) 10Hoo man: Enable statement usage tracking on elwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375544 (https://phabricator.wikimedia.org/T151717) (owner: 10Eranroz) [12:03:09] (03CR) 10Hoo man: [C: 032] Enable statement usage tracking on elwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375544 (https://phabricator.wikimedia.org/T151717) (owner: 10Eranroz) [12:04:21] 10Operations, 10Traffic, 10Browser-Support-Apple-Safari, 10Upstream: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#3620910 (10TheDJ) FYI: Seems the change missed the cut-off for the Safari 11 release. iOS 11 Safari 11 not yet fixed: Mozilla/5... [12:05:48] (03Merged) 10jenkins-bot: Enable statement usage tracking on elwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375544 (https://phabricator.wikimedia.org/T151717) (owner: 10Eranroz) [12:07:21] (03CR) 10jenkins-bot: Enable statement usage tracking on elwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375544 (https://phabricator.wikimedia.org/T151717) (owner: 10Eranroz) [12:08:38] !log hoo@tin Synchronized wmf-config/Wikibase-production.php: Enable statement usage tracking on elwiki (T151717) (duration: 00m 49s) [12:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:53] T151717: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717 [12:16:15] (03PS2) 10Filippo Giunchedi: RESTBase: Separate eqiad and codfw seeds [puppet] - 10https://gerrit.wikimedia.org/r/379217 (https://phabricator.wikimedia.org/T169940) (owner: 10Mobrovac) [12:17:05] (03CR) 10Filippo Giunchedi: [C: 032] RESTBase: Separate eqiad and codfw seeds [puppet] - 10https://gerrit.wikimedia.org/r/379217 (https://phabricator.wikimedia.org/T169940) (owner: 10Mobrovac) [12:17:48] mobrovac: ^ merged [12:18:10] !log uploaded apache 2.4.10-10+deb8u11+wmf1 to jessie-wikimedia (rebuild of latest security update with our local patches on top) [12:18:13] grazie godog! [12:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:11] !log restarted varnish-be on cp4027 [12:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:28] PROBLEM - DPKG on eventlog2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:23:28] RECOVERY - DPKG on eventlog2001 is OK: All packages OK [12:28:20] !log upgrading nodejs to 6.11 on maps-test2003 for testing - T171707 [12:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:34] T171707: Upgrade kartotherian and tilerator to nodejs 6.11 - https://phabricator.wikimedia.org/T171707 [12:35:36] (03PS5) 10ArielGlenn: Move datasets nginx logs rsync to dumps web manifest where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/379198 (https://phabricator.wikimedia.org/T175528) [12:37:26] (03PS3) 10Elukey: contint::apache: disallow .htaccess usage [puppet] - 10https://gerrit.wikimedia.org/r/378853 [12:37:55] (03CR) 10Elukey: [C: 032] contint::apache: disallow .htaccess usage [puppet] - 10https://gerrit.wikimedia.org/r/378853 (owner: 10Elukey) [12:39:59] hashar: merged --^ lemme know if you any weirdness later on, I can't see any now [12:40:19] (03PS2) 10Elukey: smokeping::apache: disallow .htaccess usage [puppet] - 10https://gerrit.wikimedia.org/r/378854 [12:40:44] (03CR) 10Elukey: [C: 032] smokeping::apache: disallow .htaccess usage [puppet] - 10https://gerrit.wikimedia.org/r/378854 (owner: 10Elukey) [12:45:37] (03PS1) 10BBlack: VCL: insure against coalesce in MISS2PASS pathway [puppet] - 10https://gerrit.wikimedia.org/r/379221 [12:46:07] elukey: I guess it will be just fine [12:48:41] 10Operations, 10monitoring, 10netops, 10Patch-For-Review, 10User-fgiunchedi: Evaluate LibreNMS' Graphite backend - https://phabricator.wikimedia.org/T171167#3621017 (10mark) [12:48:44] 10Operations, 10monitoring, 10netops, 10User-fgiunchedi: Grafana dashboards for librenms graphite data - https://phabricator.wikimedia.org/T171823#3621014 (10mark) 05Open>03Resolved a:03mark This will do for our aggregate power stats. Thanks! [12:50:11] (03CR) 10Ema: [C: 031] VCL: insure against coalesce in MISS2PASS pathway [puppet] - 10https://gerrit.wikimedia.org/r/379221 (owner: 10BBlack) [12:51:19] !log restarted varnish-be on cp4028 [12:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:55] (03PS1) 10Mobrovac: Cassandra: Allow extra hosts to access the cluster via CQL [puppet] - 10https://gerrit.wikimedia.org/r/379223 (https://phabricator.wikimedia.org/T169940) [12:52:58] (03CR) 10BBlack: [C: 032] VCL: insure against coalesce in MISS2PASS pathway [puppet] - 10https://gerrit.wikimedia.org/r/379221 (owner: 10BBlack) [12:53:24] (03CR) 10jerkins-bot: [V: 04-1] Cassandra: Allow extra hosts to access the cluster via CQL [puppet] - 10https://gerrit.wikimedia.org/r/379223 (https://phabricator.wikimedia.org/T169940) (owner: 10Mobrovac) [12:55:45] (03PS2) 10Mobrovac: Cassandra: Allow extra hosts to access the cluster via CQL [puppet] - 10https://gerrit.wikimedia.org/r/379223 (https://phabricator.wikimedia.org/T169940) [12:56:13] (03CR) 10jerkins-bot: [V: 04-1] Cassandra: Allow extra hosts to access the cluster via CQL [puppet] - 10https://gerrit.wikimedia.org/r/379223 (https://phabricator.wikimedia.org/T169940) (owner: 10Mobrovac) [12:59:02] (03PS3) 10Mobrovac: Cassandra: Allow extra hosts to access the cluster via CQL [puppet] - 10https://gerrit.wikimedia.org/r/379223 (https://phabricator.wikimedia.org/T169940) [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170920T1300). [13:00:04] dcausse: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break the wikis, you will be rewarded with a sticker. [13:00:24] o/ [13:01:18] Hello [13:01:24] (03PS1) 10Dereckson: Add John Michael Kohler Art Center throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) [13:01:24] I can SWAT. [13:01:31] thanks! [13:01:32] (I'll add this change too ^) [13:01:49] I am floating around if needed [13:02:08] o/ [13:02:11] I'm around too [13:03:03] (03PS3) 10Dereckson: Switch elasticsearch active cluster to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378850 (owner: 10DCausse) [13:03:10] (03CR) 10jerkins-bot: [V: 04-1] Add John Michael Kohler Art Center throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) (owner: 10Dereckson) [13:03:27] dcausse: I wonder what's the point of a test you need to change when DC change [13:04:02] perhaps instead you should check is an expected value, like a datacenter array containing codfw eqiad [13:04:23] (03CR) 10Mobrovac: "PCC looking good - https://puppet-compiler.wmflabs.org/compiler02/7946/" [puppet] - 10https://gerrit.wikimedia.org/r/379223 (https://phabricator.wikimedia.org/T169940) (owner: 10Mobrovac) [13:04:27] Dereckson: yes... it's annoying, this test is mostly useful to check that we follow the default master DC [13:04:49] but fails we specify a custom DC for elastic [13:04:59] s/we/when we/ [13:05:12] (03CR) 10Dereckson: [C: 032] Switch elasticsearch active cluster to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378850 (owner: 10DCausse) [13:06:23] Dereckson: when we can create hi.wikivoyage? [13:06:57] I want my gerrit review dashboard clean ;) [13:07:24] (03PS2) 10Elukey: tendril::apache: disallow .htaccess usage [puppet] - 10https://gerrit.wikimedia.org/r/378855 [13:07:51] (03Merged) 10jenkins-bot: Switch elasticsearch active cluster to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378850 (owner: 10DCausse) [13:08:00] (03CR) 10jenkins-bot: Switch elasticsearch active cluster to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378850 (owner: 10DCausse) [13:08:14] (03CR) 10Elukey: [C: 032] tendril::apache: disallow .htaccess usage [puppet] - 10https://gerrit.wikimedia.org/r/378855 (owner: 10Elukey) [13:08:55] tabbycat: this need a dedicated window, not in SWAT [13:09:09] tabbycat: Apache and DNS are ok? [13:09:21] Dereckson: I know. But greg-g and myself have asked in the task if you could take care of it [13:09:26] Dereckson: I think yes [13:09:36] let me check [13:09:59] dcausse: live on mwdebug1002 [13:10:03] Dereckson: looking [13:14:13] (03PS2) 10Zoranzoki21: Add John Michael Kohler Art Center throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) (owner: 10Dereckson) [13:14:57] Dereckson: looks good, maybe expect a small burst of pool counter failures [13:15:04] (03CR) 10Zoranzoki21: [C: 031] "Now is all ok. CR: +1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) (owner: 10Dereckson) [13:15:47] ok [13:15:59] looks good too in the logs currently [13:16:11] 10Operations, 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 10Operations-Software-Development, and 2 others: Replace salt on integration and deployment-prep projects - https://phabricator.wikimedia.org/T176314#3621041 (10hashar) [13:17:31] !log dereckson@tin Synchronized tests/cirrusTest.php: Switch elasticsearch active cluster to codfw ([[Gerrit:378850]], 1/3, no-op in prod) (duration: 00m 49s) [13:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:41] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Switch elasticsearch active cluster to codfw ([[Gerrit:378850]], 2/3) (duration: 00m 49s) [13:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:48] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Switch elasticsearch active cluster to codfw ([[Gerrit:378850]], 3/3) (duration: 00m 48s) [13:20:56] (03PS2) 10Elukey: noc::apache: disallow .htaccess usage [puppet] - 10https://gerrit.wikimedia.org/r/378857 [13:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:56] !log begin rolling codfw varnish-be restarts, 20m spacing [13:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:22] (03PS3) 10Dereckson: Add John Michael Kohler Art Center throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) [13:25:29] (03CR) 10Elukey: [C: 032] "Checked /srv/mediawiki/docroot/noc on wasat/terbium, no traces of .htaccess" [puppet] - 10https://gerrit.wikimedia.org/r/378857 (owner: 10Elukey) [13:25:49] (03Restored) 10Ottomata: Use EventBus for recentchanges stream instead of RCStream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369861 (https://phabricator.wikimedia.org/T172356) (owner: 10Hashar) [13:25:51] (03PS3) 10Ottomata: Use EventBus for recentchanges stream instead of RCStream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369861 (https://phabricator.wikimedia.org/T172356) (owner: 10Hashar) [13:27:02] (03CR) 10jerkins-bot: [V: 04-1] Add John Michael Kohler Art Center throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) (owner: 10Dereckson) [13:27:09] (03CR) 10jerkins-bot: [V: 04-1] Use EventBus for recentchanges stream instead of RCStream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369861 (https://phabricator.wikimedia.org/T172356) (owner: 10Hashar) [13:27:39] Dereckson: my irc client misbehaved (e.g. I did not see the sal log for CS.php) [13:27:55] (03CR) 10Zoranzoki21: "Sorry Dereckson.. I edited because jenkins put -1.. When I edited he put +1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) (owner: 10Dereckson) [13:28:10] Dereckson: the deploy looks good to me (in case you've asked and I did not see your ping) [13:30:48] (03PS4) 10Zoranzoki21: Add John Michael Kohler Art Center throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) (owner: 10Dereckson) [13:31:11] (03CR) 10Elukey: [C: 04-1] "It is, I am stupid, since I was checking the default vhost's document root:" [puppet] - 10https://gerrit.wikimedia.org/r/378858 (owner: 10Elukey) [13:31:39] (03PS4) 10Krinkle: Use EventBus for recentchanges stream instead of RCStream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369861 (https://phabricator.wikimedia.org/T172356) (owner: 10Hashar) [13:31:58] (03CR) 10Zoranzoki21: [C: 031] "This is ok, because is only one IP adress. If you disagree with my edit, write and then I will stop." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) (owner: 10Dereckson) [13:32:31] (03CR) 10Krinkle: [C: 031] "Yeah, either it's not set up in labs, or it's working fine through the default ones. CommonSettings.php isn't all prod-specific. A lot of " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369861 (https://phabricator.wikimedia.org/T172356) (owner: 10Hashar) [13:35:33] (03CR) 10Ottomata: "Ah great, yeah it def works in labs, I checked before I abandoned this. Ok, so can we just merge this as is then?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369861 (https://phabricator.wikimedia.org/T172356) (owner: 10Hashar) [13:35:49] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/379223 (https://phabricator.wikimedia.org/T169940) (owner: 10Mobrovac) [13:35:51] (03CR) 10Ottomata: "AH, you +1ed, merging." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369861 (https://phabricator.wikimedia.org/T172356) (owner: 10Hashar) [13:35:53] (03CR) 10Ottomata: [C: 032] Use EventBus for recentchanges stream instead of RCStream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369861 (https://phabricator.wikimedia.org/T172356) (owner: 10Hashar) [13:35:56] (03PS5) 10Ottomata: Use EventBus for recentchanges stream instead of RCStream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369861 (https://phabricator.wikimedia.org/T172356) (owner: 10Hashar) [13:35:58] (03CR) 10Ottomata: [V: 032 C: 032] Use EventBus for recentchanges stream instead of RCStream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369861 (https://phabricator.wikimedia.org/T172356) (owner: 10Hashar) [13:36:35] (03CR) 10jenkins-bot: Use EventBus for recentchanges stream instead of RCStream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369861 (https://phabricator.wikimedia.org/T172356) (owner: 10Hashar) [13:37:12] ottomata: hasharAway: ping me when you're done, I was deploying another change [13:37:44] Dereckson: you can proceed if you like, it is a no op change in labs [13:37:55] thing is pretty much already removed, that's just a follow up [13:38:17] ottomata: deploy it to prod please, we want labs/prod equivalence in files, even the -labs one [13:38:44] ok [13:42:39] !log otto@tin Synchronized /srv/mediawiki-staging/wmf-config/CommonSettings-labs.php: (no justification provided) (duration: 00m 49s) [13:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:54] Thanks. So, next time, please avoid that during SWAT. [13:43:49] (03CR) 10Dereckson: "Standardization vs microoptimization for cached values. Your call." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) (owner: 10Dereckson) [13:43:56] ok Dereckson sorry about that. [13:43:57] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) (owner: 10Dereckson) [13:44:16] I probably shouldn't have merged so hastily. [13:45:31] (03CR) 10jerkins-bot: [V: 04-1] Add John Michael Kohler Art Center throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) (owner: 10Dereckson) [13:49:11] (03CR) 10Dereckson: "@Zoran Please run phpcs before submit changes correction by the way (you can get it through `composer install`)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) (owner: 10Dereckson) [13:49:57] (03CR) 10Dereckson: [C: 04-1] "Cancelling this change for SWAT, as it needs to be reformatted correctly." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) (owner: 10Dereckson) [13:50:19] Hmm not rleated to wikipedia issues but cloudfare seems to be problematic right now [13:51:00] (03CR) 10Zoranzoki21: "Ok" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) (owner: 10Dereckson) [13:51:18] More hrer - https://blog.cloudflare.com/todays-outage-post-mortem-82515/ [13:51:22] *here [13:51:31] Does anything WMF route via cloudflare? [13:51:59] Oh wait that's from 2013 [13:52:02] Sorry :( [13:52:07] Brain not engaged this morning [13:52:27] rofl [13:52:50] though to be fair, with distributed systems you never know [13:53:17] Anyway one site I view a lot was done, with a cloudflare error [13:54:12] 10Operations, 10Patch-For-Review: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3621166 (10Arlolra) > We kind of have that already in the force of inactive but it is not sufficient as setting a host to inactive means scap will skip it during deploys which is not desirable in this cas... [13:54:24] (03CR) 10Filippo Giunchedi: [C: 031] Cassandra: Allow extra hosts to access the cluster via CQL [puppet] - 10https://gerrit.wikimedia.org/r/379223 (https://phabricator.wikimedia.org/T169940) (owner: 10Mobrovac) [13:54:30] (03PS4) 10Filippo Giunchedi: Cassandra: Allow extra hosts to access the cluster via CQL [puppet] - 10https://gerrit.wikimedia.org/r/379223 (https://phabricator.wikimedia.org/T169940) (owner: 10Mobrovac) [13:55:20] (03CR) 10Filippo Giunchedi: [C: 032] Cassandra: Allow extra hosts to access the cluster via CQL [puppet] - 10https://gerrit.wikimedia.org/r/379223 (https://phabricator.wikimedia.org/T169940) (owner: 10Mobrovac) [13:55:26] 10Operations, 10Mail, 10MediaWiki-Email, 10Wikimedia-General-or-Unknown, and 5 others: Email server's DMARC config prevents users from sending emails via Special:EmailUser - https://phabricator.wikimedia.org/T66795#3621168 (10Reedy) [13:55:58] dcausse: logs are still good, no error spike for your change :) [13:56:24] Dereckson: yes all good for me, thanks for the deploy! [13:57:09] You're welcome. [14:01:04] 10Operations, 10Patch-For-Review: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3621184 (10akosiaris) >>! In T165520#3621166, @Arlolra wrote: >> We kind of have that already in the force of inactive but it is not sufficient as setting a host to inactive means scap will skip it during... [14:01:12] (03PS3) 10Alexandros Kosiaris: Whitelist wtp10[25-48] for Linter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378073 (https://phabricator.wikimedia.org/T165520) (owner: 10Legoktm) [14:01:20] (03CR) 10Alexandros Kosiaris: [C: 032] Whitelist wtp10[25-48] for Linter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378073 (https://phabricator.wikimedia.org/T165520) (owner: 10Legoktm) [14:01:21] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Whitelist wtp10[25-48] for Linter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378073 (https://phabricator.wikimedia.org/T165520) (owner: 10Legoktm) [14:01:44] (03CR) 10Zoranzoki21: [C: 031] Add John Michael Kohler Art Center throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) (owner: 10Dereckson) [14:02:08] (03CR) 10Zoranzoki21: [C: 031] "I can not remove jenkins-bot to recheck.. All is ok" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) (owner: 10Dereckson) [14:02:29] (03CR) 10jenkins-bot: Whitelist wtp10[25-48] for Linter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378073 (https://phabricator.wikimedia.org/T165520) (owner: 10Legoktm) [14:03:52] !log akosiaris@tin Synchronized wmf-config/InitialiseSettings.php: Whitelist wtp10[25-48] for Linter (duration: 00m 49s) [14:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:30] 10Operations, 10Patch-For-Review: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3621199 (10akosiaris) https://gerrit.wikimedia.org/r/#/c/378073/ has just been deployed [14:09:12] (03PS5) 10Zoranzoki21: Add John Michael Kohler Art Center throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) (owner: 10Dereckson) [14:11:07] (03PS1) 10Ottomata: flock before attempting to run rsync of published-datasets [puppet] - 10https://gerrit.wikimedia.org/r/379234 (https://phabricator.wikimedia.org/T174756) [14:13:43] 10Operations, 10fundraising-tech-ops, 10netops: remove fundraising firewall rules related to ganglia - https://phabricator.wikimedia.org/T176319#3621226 (10Jgreen) [14:14:52] 10Operations, 10Patch-For-Review: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3621252 (10Arlolra) > It's just a different argument to pooled. Instead of pooled=yes or pooled=no you say pooled=inactive. Up to now we use it exclusively for hosts going through extensive maintenance (R... [14:15:04] 10Operations, 10fundraising-tech-ops, 10netops: remove fundraising firewall rules related to ganglia - https://phabricator.wikimedia.org/T176319#3621259 (10Jgreen) [14:15:07] 10Operations, 10Fundraising-Backlog, 10fundraising-tech-ops: Port fundraising stats off Ganglia - https://phabricator.wikimedia.org/T152562#3621258 (10Jgreen) [14:15:07] godog: I have some grafana queries that give me 30-day uptime, for example: sum_over_time(probe_success{instance="labservices1001.wikimedia.org:9001"}[30d]) / count_over_time(probe_success{instance="labservices1001.wikimedia.org:9001"}[30d] [14:15:07] ) [14:15:26] 10Operations, 10fundraising-tech-ops, 10netops: remove fundraising firewall rules related to ganglia - https://phabricator.wikimedia.org/T176319#3621226 (10Jgreen) [14:15:31] Is there a way to replace that '30d' with something that's responsive to the time-interval selector on the grafana UI? [14:17:02] (03PS9) 10Gehel: Set elasticsearch servers to use 128kB readahead [puppet] - 10https://gerrit.wikimedia.org/r/371963 (https://phabricator.wikimedia.org/T169498) (owner: 10EBernhardson) [14:17:57] 10Operations, 10fundraising-tech-ops: Long term storage for frack prometheus data - https://phabricator.wikimedia.org/T175738#3621268 (10Jgreen) [14:18:09] andrewbogott: yeah I think grafana templating can do that, there should be a special type of templating variable for that [14:18:23] * andrewbogott googles grafana template variable [14:18:25] (03CR) 10Gehel: [C: 032] Set elasticsearch servers to use 128kB readahead [puppet] - 10https://gerrit.wikimedia.org/r/371963 (https://phabricator.wikimedia.org/T169498) (owner: 10EBernhardson) [14:30:07] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3621325 (10daniel) [14:32:38] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3621331 (10Halfak) Just checking in here because I've been OoO. Did any tests happen after my last test? [14:39:53] (03PS1) 10Herron: Add rate limiting to toollabs::mailrelay with warn action [puppet] - 10https://gerrit.wikimedia.org/r/379239 (https://phabricator.wikimedia.org/T175964) [14:40:19] (03CR) 10jerkins-bot: [V: 04-1] Add rate limiting to toollabs::mailrelay with warn action [puppet] - 10https://gerrit.wikimedia.org/r/379239 (https://phabricator.wikimedia.org/T175964) (owner: 10Herron) [14:41:29] RECOVERY - Check systemd state on restbase2002 is OK: OK - running: The system is fully operational [14:43:42] 10Operations, 10Goal, 10Kubernetes: Operations Q1 goal: Streamlined Service Delivery - https://phabricator.wikimedia.org/T170108#3621363 (10akosiaris) [14:44:13] 10Operations, 10Goal, 10Kubernetes: Experiment with ingress solutions (stretch) - https://phabricator.wikimedia.org/T170121#3621364 (10akosiaris) [14:47:52] Hello! I need some help with database access at Toolforge. I want to access the databases with php (mysqli_connect("localhost","my_user","my_password","my_db");) but after migrating Tool Labs to Toolforge, old connection data doesnt seem to work. Could someone help me? [14:48:46] Sanyi4: lets talk about it in the #wikimedia-cloud channel [14:49:31] ok [14:50:23] localhost looks pretty wrong for a start [14:53:36] (03PS4) 10Krinkle: webperf: Limit by-country navtiming breakdown to those with 5+ hits/min [puppet] - 10https://gerrit.wikimedia.org/r/377806 (https://phabricator.wikimedia.org/T166390) [14:55:08] (03PS2) 10Herron: Add rate limiting to toollabs::mailrelay with warn action [puppet] - 10https://gerrit.wikimedia.org/r/379239 (https://phabricator.wikimedia.org/T175964) [14:55:11] (03CR) 10Chad: "@Moritz: Yeah that makes sense. Thx for the merge :)" [puppet] - 10https://gerrit.wikimedia.org/r/378931 (owner: 10Muehlenhoff) [14:55:59] (03CR) 10Chad: [C: 031] Remove deployment::redis ferm service [puppet] - 10https://gerrit.wikimedia.org/r/379185 (owner: 10Muehlenhoff) [14:56:07] 10Operations, 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 10Operations-Software-Development, and 2 others: Replace salt on integration and deployment-prep projects - https://phabricator.wikimedia.org/T176314#3621406 (10faidon) beta, CI and other WMCS VPS projects are not environ... [14:56:49] (03CR) 10Chad: [C: 031] Remove role::deployment::config and the related Hiera data [puppet] - 10https://gerrit.wikimedia.org/r/379188 (owner: 10Muehlenhoff) [14:57:12] (03CR) 10Chad: [C: 031] Remove service-restart and deploy-info [puppet] - 10https://gerrit.wikimedia.org/r/379191 (owner: 10Muehlenhoff) [14:57:54] (03CR) 10Chad: [C: 031] Remove deployment::salt_master/role::deployment::salt_masters and related files [puppet] - 10https://gerrit.wikimedia.org/r/379189 (owner: 10Muehlenhoff) [14:58:38] no_justification: thanks for the +1, not sure if that's in line with T176314 :) [14:58:39] T176314: Replace salt on integration and deployment-prep projects - https://phabricator.wikimedia.org/T176314 [14:58:44] (03CR) 10Chad: [C: 031] Gerrit: Fix systemd script [puppet] - 10https://gerrit.wikimedia.org/r/379136 (owner: 10Paladox) [14:58:57] although I guess these are trebuchet-specific [14:59:54] :) [15:00:10] Tearing down salt > having a salt replacement in beta imho ;-) [15:02:27] That being said, I can't imagine setting up cumin in beta would be all that hard? [15:02:40] those are trebuchet-specific, the remaining salt bits will be ripped out soon [15:02:52] 10Operations, 10Traffic, 10Browser-Support-Apple-Safari, 10Upstream: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#3621439 (10Nuria) Super thanks for following up [15:02:55] i see we have an optimist in the room [15:03:03] volans is almost done with Cumin for WMCS, only misses the openstack backend at this point [15:03:16] CR in few minutes for that ;) [15:03:47] moritzm: Btw, seeing that pile of changes in my e-mail brought a smile to my face :) [15:03:58] PROBLEM - Check systemd state on restbase2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:04:00] * no_justification sipped his coffee and went on a review spree [15:04:04] known ^ [15:05:04] anyway I think that in general what people does inside a cloud project is out of scope, if someone wants to have salt in thery project fine, the same as we don't check/enforce any other software/technology [15:05:33] that being said, being CI and beta "our" projects, it would be nice to have them in sync [15:07:46] Beta's supposed to ostensibly track production (more or less, for values of track that do not include actually matching :P) [15:08:35] with cumin enabled for WMCS all the people that can run cumin in prod will be able to run it in deployment-prep hosts [15:08:58] Nobody in releng can run cumin in prod ;-) [15:09:09] PROBLEM - MariaDB Slave Lag: s1 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 634.91 seconds [15:09:11] no_justification: you said track production ;) [15:09:19] Touché! [15:09:30] (03CR) 10Thcipriani: [C: 031] Remove service-restart and deploy-info [puppet] - 10https://gerrit.wikimedia.org/r/379191 (owner: 10Muehlenhoff) [15:09:32] but I see the point ofc [15:09:44] I'll look into it once the goal-stuff is done [15:10:07] I bet it's either pretty easy or super hard [15:10:20] Gotta be one of those! :) [15:12:01] (03CR) 10Thcipriani: [C: 031] Remove deployment::redis ferm service [puppet] - 10https://gerrit.wikimedia.org/r/379185 (owner: 10Muehlenhoff) [15:13:30] 10Operations, 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 10Operations-Software-Development, and 2 others: Replace salt on integration and deployment-prep projects - https://phabricator.wikimedia.org/T176314#3621041 (10demon) I don't really feel like nitpicking over projects or... [15:13:37] paravoid: Responded on the task ^ [15:14:15] no_justification: thanks :) [15:14:37] didn't mean to nitpick projects or anything, I was just trying to understand if there was some hidden meaning behind tagging it e.g. as #Goal [15:14:48] 10Operations, 10ops-eqiad, 10hardware-requests, 10Performance-Team (Radar): Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3582551 (10Krinkle) [15:14:49] like "it's part of the goal" [15:15:10] I bet it was because the "Create subtask" button copies all CCs and projects by default. [15:15:13] Not my favorite behavior, tbh [15:19:19] RECOVERY - MariaDB Slave Lag: s1 on db1047 is OK: OK slave_sql_lag Replication lag: 0.41 seconds [15:20:48] (03PS1) 10Volans: Backends: add OpenStack backend [software/cumin] - 10https://gerrit.wikimedia.org/r/379247 (https://phabricator.wikimedia.org/T175711) [15:22:18] we can ignore db1047, normally that is some researcher blocking replication with large queries, but that is "normal" [15:22:34] thanks jynus, makes sense [15:22:41] unless somone wants to investigate or it is up for a long time [15:23:07] I am also running eventlogging_cleaner up to 2016 (1000 updates every 5s) [15:23:17] cool [15:23:32] it will take a bit to clean up but in this way it doesn't hammer the disks [15:26:28] (03CR) 10Krinkle: "@Peter LGTM, but in running tests I'm noticing that the test samples are from before the new client patch, which means we're not testing t" [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [15:28:15] zuul seems to have a bit of a queue or is my impression? [15:32:30] no_justification ^^^ [15:33:29] Looking [15:33:50] thx [15:34:50] Bunch of core gate and submits, plus that long running job for Marvin whatever that is [15:34:55] But seems to be moving [15:34:58] Just volume [15:35:49] * volans would send the waiting for jenkins meme but cdn.meme.am returns server error :D [15:36:19] ack [15:36:29] !log upgrading elasticsearch plugins on elasticsearch eqiad, including cold restart of the cluster - T173231 [15:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:42] T173231: Wikidata Elastic search drops results with matches in different language label - https://phabricator.wikimedia.org/T173231 [15:36:57] pybal will alert when all elasticsearch nodes will restart, but do not worry... [15:37:18] * volans starts worrying [15:38:05] volans: Main thing to remember is that the queues have priorities -- gate-and-submit(-swat) are going to beat out test any day [15:38:23] So if you've got a good number of executors being held up for say a string of mw/core changes, just gotta wait your turn [15:38:41] (03PS1) 10EBernhardson: Switch CirrusSearch MLR model for enwiki to older model [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379252 [15:39:07] Two questions to ask when looking at the zuul status page are: "what exactly is slow," and "are things moving?" [15:39:18] Long as the answer is "lower priority stuff" and "yes" then we're ok :) [15:40:45] :-) maybe also underpowered? [15:42:22] (03PS1) 10Chad: group1 to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379253 [15:42:33] (03CR) 10Chad: [C: 04-2] "For later" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379253 (owner: 10Chad) [15:45:57] (03CR) 1020after4: [C: 031] Setup apache vhost on scap proxies as well [puppet] - 10https://gerrit.wikimedia.org/r/344221 (https://phabricator.wikimedia.org/T147938) (owner: 10Chad) [15:46:28] 10Operations: recommended ssh ciphers/kexalgorithms combination doesn't work for ilo - https://phabricator.wikimedia.org/T111698#3621549 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff Tracked via T171041 (somewhat a duplicate, closing) [15:49:18] RECOVERY - puppet last run on elastic1020 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [15:50:06] 10Operations, 10Mail: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#3621572 (10herron) [15:50:08] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic1024.eqiad.wmnet because of too many down!: search_9200 - Could not depool server elastic1051.eqiad.wmnet because of too many down! [15:50:28] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic1030.eqiad.wmnet because of too many down!: search_9200 - Could not depool server elastic1051.eqiad.wmnet because of too many down! [15:50:58] PROBLEM - PyBal backends health check on lvs1009 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic1021.eqiad.wmnet because of too many down!: search_9200 - Could not depool server elastic1041.eqiad.wmnet because of too many down! [15:51:19] gehel: ^ [15:51:23] (03CR) 10Matthias Mullie: "Is this good for tomorrow's puppet SWAT?" [puppet] - 10https://gerrit.wikimedia.org/r/378233 (https://phabricator.wikimedia.org/T160185) (owner: 10Matthias Mullie) [15:51:28] yep, that's me [15:51:40] ok :) [15:51:58] PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([elastic1021.eqiad.wmnet, elastic1034.eqiad.wmnet, elastic1044.eqiad.wmnet, elastic1018.eqiad.wmnet, elastic1017.eqiad.wmnet, elastic1029.eqiad.wmnet, elastic1026.eqiad.wmnet, elastic1039.eqiad.wmnet, elastic1022.eqiad.wmnet, elastic1046.eqiad.wmnet, elastic1043.eqiad.wmnet, elastic1052.eqiad.wmnet, elastic1050.eqiad.wmnet, [15:51:58] wmnet, elastic1023.eqiad.wmnet, elastic1032.eqiad.wmnet, elastic1027.eqiad.wmnet, elastic1036.eqiad.wmnet, elastic1049.eqiad.wmnet, elastic1048.eqiad.wmnet, elastic1019.eqiad.wmnet, elastic1031.eqiad.wmnet, elastic1051.eqiad.wmnet, elastic1041.eqiad.wmnet, elastic1028.eqiad.wmnet, elastic1045.eqiad.wmnet, elastic1047.eqiad.wmnet, elastic1024.eqiad.wmnet]) [15:52:11] cold restart of elasticsearch eqiad in progress, I did not want to completely silence pybal (I don't think I can tell it to just ignore ES) [15:52:18] PROBLEM - PyBal IPVS diff check on lvs1010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([elastic1030.eqiad.wmnet, elastic1021.eqiad.wmnet, elastic1025.eqiad.wmnet, elastic1042.eqiad.wmnet, elastic1044.eqiad.wmnet, elastic1017.eqiad.wmnet, elastic1026.eqiad.wmnet, elastic1039.eqiad.wmnet, elastic1022.eqiad.wmnet, elastic1046.eqiad.wmnet, elastic1043.eqiad.wmnet, elastic1052.eqiad.wmnet, elastic1050.eqiad.wmnet, [15:52:18] wmnet, elastic1023.eqiad.wmnet, elastic1027.eqiad.wmnet, elastic1036.eqiad.wmnet, elastic1049.eqiad.wmnet, elastic1048.eqiad.wmnet, elastic1038.eqiad.wmnet, elastic1019.eqiad.wmnet, elastic1051.eqiad.wmnet, elastic1041.eqiad.wmnet, elastic1045.eqiad.wmnet, elastic1047.eqiad.wmnet, elastic1024.eqiad.wmnet]) [15:53:18] PROBLEM - PyBal IPVS diff check on lvs1009 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([elastic1030.eqiad.wmnet, elastic1021.eqiad.wmnet, elastic1034.eqiad.wmnet, elastic1025.eqiad.wmnet, elastic1042.eqiad.wmnet, elastic1044.eqiad.wmnet, elastic1017.eqiad.wmnet, elastic1029.eqiad.wmnet, elastic1022.eqiad.wmnet, elastic1046.eqiad.wmnet, elastic1043.eqiad.wmnet, elastic1052.eqiad.wmnet, elastic1050.eqiad.wmnet, [15:53:18] wmnet, elastic1023.eqiad.wmnet, elastic1018.eqiad.wmnet, elastic1027.eqiad.wmnet, elastic1036.eqiad.wmnet, elastic1049.eqiad.wmnet, elastic1048.eqiad.wmnet, elastic1040.eqiad.wmnet, elastic1038.eqiad.wmnet, elastic1019.eqiad.wmnet, elastic1031.eqiad.wmnet, elastic1051.eqiad.wmnet, elastic1041.eqiad.wmnet, elastic1028.eqiad.wmnet, elastic1045.eqiad.wmnet, elastic1047.eqiad.wmnet, elastic1024.eqiad.wmnet]) [15:53:28] (03CR) 10Pmiazga: "maybe this is the time to create a WMF role. I would use the GlobalUsers/staff group as it's the closest to what we want to achieve. Filte" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377929 (https://phabricator.wikimedia.org/T175868) (owner: 10Gergő Tisza) [15:53:38] Ouch, this is taking longer than expected... sorry for the noise [15:54:29] (03CR) 10Filippo Giunchedi: [C: 031] "Yep, good for tomorrow's puppet SWAT" [puppet] - 10https://gerrit.wikimedia.org/r/378233 (https://phabricator.wikimedia.org/T160185) (owner: 10Matthias Mullie) [15:54:31] (03Draft2) 10Jayprakash12345: Import sources on hr.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379254 (https://phabricator.wikimedia.org/T176320) [15:54:38] PROBLEM - PyBal IPVS diff check on lvs1003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([elastic1030.eqiad.wmnet, elastic1021.eqiad.wmnet, elastic1025.eqiad.wmnet, elastic1044.eqiad.wmnet, elastic1017.eqiad.wmnet, elastic1029.eqiad.wmnet, elastic1026.eqiad.wmnet, elastic1022.eqiad.wmnet, elastic1046.eqiad.wmnet, elastic1043.eqiad.wmnet, elastic1052.eqiad.wmnet, elastic1050.eqiad.wmnet, elastic1023.eqiad.wmnet, [15:54:38] wmnet, elastic1027.eqiad.wmnet, elastic1036.eqiad.wmnet, elastic1049.eqiad.wmnet, elastic1048.eqiad.wmnet, elastic1019.eqiad.wmnet, elastic1035.eqiad.wmnet, elastic1031.eqiad.wmnet, elastic1051.eqiad.wmnet, elastic1041.eqiad.wmnet, elastic1028.eqiad.wmnet, elastic1045.eqiad.wmnet, elastic1047.eqiad.wmnet, elastic1024.eqiad.wmnet]) [15:55:32] q! [15:56:36] (03CR) 10Filippo Giunchedi: [C: 031] "I won't be able to attend puppet SWAT tomorrow, though any one of "the usuals" should be able to merge and essentially apply https://wikit" [puppet] - 10https://gerrit.wikimedia.org/r/378233 (https://phabricator.wikimedia.org/T160185) (owner: 10Matthias Mullie) [15:56:37] no_justification: it took 32 minutes... not sure if can be defined "ok" from the user point of view ;-) [15:56:54] (03PS1) 10Elukey: partman: fix mw-raid1-lvm and set it as default recipe for mw13* [puppet] - 10https://gerrit.wikimedia.org/r/379257 (https://phabricator.wikimedia.org/T165519) [15:57:12] As a user, I'm very used jenkins taking annoyingly long time ;) [15:57:12] (03PS1) 10Giuseppe Lavagetto: Add explicit management of http proxy for apt. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/379258 [15:57:29] (03PS2) 10Ayounsi: Adding a list of bogus ifNames [puppet] - 10https://gerrit.wikimedia.org/r/379122 [15:58:04] (03CR) 10Elukey: [C: 032] partman: fix mw-raid1-lvm and set it as default recipe for mw13* [puppet] - 10https://gerrit.wikimedia.org/r/379257 (https://phabricator.wikimedia.org/T165519) (owner: 10Elukey) [15:58:34] (03PS1) 10Ema: setup.py: ship pybal.bgp package [debs/pybal] - 10https://gerrit.wikimedia.org/r/379259 [15:58:40] volans: Not much can be done in the short term about it really. There's a quota, and we don't enforce quotas on the number of changes you can submit for review [15:58:46] (beyond basic DOS-preventing quotas) [15:59:17] There's a quota on nodepool, that is, and if # of things to test > that quota then somebody's gotta wait their turn :( [15:59:27] There's medium term and long term work planned here, ofc [15:59:42] (03PS4) 10Giuseppe Lavagetto: profile::docker::builder: add proxy settings to build config [puppet] - 10https://gerrit.wikimedia.org/r/379175 [16:00:11] (03CR) 10Zoranzoki21: [C: 031] "Looks good to me, but someone else must approve" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379254 (https://phabricator.wikimedia.org/T176320) (owner: 10Jayprakash12345) [16:00:34] (03CR) 10Ayounsi: [C: 032] Adding a list of bogus ifNames [puppet] - 10https://gerrit.wikimedia.org/r/379122 (owner: 10Ayounsi) [16:00:45] (03PS3) 10Ayounsi: Adding a list of bogus ifNames [puppet] - 10https://gerrit.wikimedia.org/r/379122 [16:01:19] I was wondering if we should do (or we already do) QoS maybe leaving like one node always for small/quick jobs or things like that [16:02:00] (03CR) 10Ema: [V: 032 C: 032] setup.py: ship pybal.bgp package [debs/pybal] - 10https://gerrit.wikimedia.org/r/379259 (owner: 10Ema) [16:02:08] (03PS1) 10Ema: setup.py: ship pybal.bgp package [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/379260 [16:02:28] (03CR) 10Ema: [V: 032 C: 032] setup.py: ship pybal.bgp package [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/379260 (owner: 10Ema) [16:02:55] but yeah I didn't mean to criticize, just that although all looks ok from the CI point of view, from the user point of view it might seem that CI is stuck/overload [16:03:06] sorry also in a meeting [16:03:22] (03PS5) 10Giuseppe Lavagetto: profile::docker::builder: add proxy settings to build config [puppet] - 10https://gerrit.wikimedia.org/r/379175 [16:03:40] !log depooling all nodes in elasticsearch eqiad to investigate failed cluster restart (traffic is already directed to codfw) [16:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:04] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,cluster=elasticsearch [16:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:38] RECOVERY - PyBal IPVS diff check on lvs1003 is OK: OK: no difference between hosts in IPVS/PyBal [16:04:58] RECOVERY - Router interfaces on mr1-codfw is OK: OK: host 208.80.153.196, interfaces up: 35, down: 0, dormant: 0, excluded: 1, unused: 0 [16:06:43] (03CR) 10MarkTraceur: [C: 031] "+1 from me (I looked at this before but didn't +1 because I figured someone from ops would help out, thanks Filippo)" [puppet] - 10https://gerrit.wikimedia.org/r/378233 (https://phabricator.wikimedia.org/T160185) (owner: 10Matthias Mullie) [16:06:49] RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal [16:07:18] RECOVERY - PyBal IPVS diff check on lvs1010 is OK: OK: no difference between hosts in IPVS/PyBal [16:08:08] RECOVERY - Router interfaces on pfw3-codfw is OK: Smartmatch is experimental at /usr/lib/nagios/plugins/check_ifstatus_nomon line 178. [16:08:18] RECOVERY - PyBal IPVS diff check on lvs1009 is OK: OK: no difference between hosts in IPVS/PyBal [16:08:38] <_joe_> gehel: I doubt that would work with pybal [16:08:49] (03PS6) 10Giuseppe Lavagetto: profile::docker::builder: add proxy settings to build config [puppet] - 10https://gerrit.wikimedia.org/r/379175 [16:09:04] _joe_: that what would work? [16:09:04] <_joe_> pybal won't depool all the elements in a pool if you set pooled=no [16:09:25] <_joe_> it will depool until you reach the depool_threshold [16:09:33] <_joe_> what you want is pooled=inactive [16:09:41] ok, it at least reduced the log noise on the cluster, which is what we were trying to achieve... [16:10:07] !log restarting elasticsearch masters on eqiad (elastic1030, 36 and 40) [16:10:08] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [16:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:17] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::docker::builder: add proxy settings to build config [puppet] - 10https://gerrit.wikimedia.org/r/379175 (owner: 10Giuseppe Lavagetto) [16:12:09] 10Operations, 10MediaWiki-API, 10Traffic, 10monitoring, 10Services (watching): Set up action API latency / error rate metrics & alerts - https://phabricator.wikimedia.org/T123854#3621638 (10GWicke) FTR, this is the graph with the alert I mentioned: https://grafana.wikimedia.org/dashboard/db/restbase?pane... [16:12:15] (03CR) 10Smalyshev: "@ArielGlenn: as soon as I9867ad566c0619b55a48a011bd3c55321b1bfcff is merged, yes." [puppet] - 10https://gerrit.wikimedia.org/r/378355 (https://phabricator.wikimedia.org/T173774) (owner: 10Smalyshev) [16:15:24] (03PS1) 10Ayounsi: check_ifstatus: ignore experimental warnings [puppet] - 10https://gerrit.wikimedia.org/r/379263 [16:16:23] (03CR) 10Ayounsi: [C: 032] check_ifstatus: ignore experimental warnings [puppet] - 10https://gerrit.wikimedia.org/r/379263 (owner: 10Ayounsi) [16:16:39] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Add explicit management of http proxy for apt. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/379258 (owner: 10Giuseppe Lavagetto) [16:19:54] !log starting elasticsearch on all nodes on eqiad [16:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:38] 10Operations, 10Analytics, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#3621689 (10phuedx) Deterministic bucketing is also available in MediaWiki core via [[ https://github.com/wikimedia/mediawiki/blob/00c769eb8d7746dfddff525ccc813f276046dea8/resources/src/mediawiki/medi... [16:26:16] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3621692 (10elukey) I came up with the following layout for mw13*: ``` root@mw1319:~# df -h Filesystem Size Used Avail Use% Mounted on udev... [16:27:18] PROBLEM - puppet last run on wtp1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:30:07] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10User-fgiunchedi: Decommission ms-fe100[1-4] - https://phabricator.wikimedia.org/T160986#3117417 (10Cmjohnson) [16:30:14] volans hi, i am getting cumin_openstack pub keys errors on a jessie puppet master. [16:30:15] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: secret(): invalid secret keyholder/cumin_openstack_master.pub at /etc/puppet/modules/profile/manifests/openstack/main/cumin/target.pp:16 on node puppet-phabricator.phabricator.eqiad.wmflabs [16:30:20] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10User-fgiunchedi: Decommission ms-fe100[1-4] - https://phabricator.wikimedia.org/T160986#3117417 (10Cmjohnson) Disks are wiped [16:31:02] paladox: checking [16:31:28] ah, sorry false alarm. I found the labs private repo was not updated there for strange reasons [16:31:46] ok, great, is exactly what I was checking ;) [16:32:01] 10Operations, 10ops-eqiad, 10hardware-requests: Decommission WMF3248 (old R510) - https://phabricator.wikimedia.org/T172323#3621703 (10Cmjohnson) The disks area wiped, this server has been off-line since 2015. Only on-site work is needed. [16:32:08] RECOVERY - Check systemd state on restbase2002 is OK: OK - running: The system is fully operational [16:32:29] 10Operations, 10ops-eqiad, 10hardware-requests, 10monitoring: decom netmon1001 - https://phabricator.wikimedia.org/T171018#3621704 (10Cmjohnson) The disks have been wiped. [16:32:43] (03PS1) 10Ema: pybaltest: Fix BGP port number in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/379265 [16:32:48] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [16:34:09] !log cleanup jar hell and restart elastic1051 [16:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:25] !log elasticsearch eqiad is finally recovering, after the restart of elastic1051 [16:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:53] some lesson learned here, and documentation to update... [16:34:53] (03CR) 10Ema: [C: 032] pybaltest: Fix BGP port number in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/379265 (owner: 10Ema) [16:35:13] 10Operations, 10Analytics, 10monitoring, 10Patch-For-Review: Eventstreams graphite disk usage - https://phabricator.wikimedia.org/T160644#3621710 (10Zoranzoki21) [16:36:33] (03Abandoned) 10Ema: Revert "VCL: stabilize backend storage patterns" [puppet] - 10https://gerrit.wikimedia.org/r/379181 (owner: 10Ema) [16:39:33] 10Operations, 10ops-eqiad: decommission beryllium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T147934#3621724 (10Cmjohnson) Disks are wiping [16:42:23] !log rolling restart of logstash ingesters after cirrus eqiad recovery [16:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:09] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=elasticsearch [16:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:00] !log Upgraded and restarted apache2 on labs-puppetmaster.wikimedia.org [16:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:34] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, 10Wikimedia-Logstash: logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable - https://phabricator.wikimedia.org/T176335#3621843 (10Gehel) [16:56:06] (03CR) 10Mobrovac: [C: 031] "PCC looks ok - https://puppet-compiler.wmflabs.org/compiler02/7952/" [puppet] - 10https://gerrit.wikimedia.org/r/376500 (https://phabricator.wikimedia.org/T175242) (owner: 10Ladsgroup) [16:56:08] RECOVERY - puppet last run on wtp1030 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [16:58:17] 10Operations, 10Mail: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#3621879 (10herron) Looking more closely at how to pull mx2001 out of service for an OS reload it is more complicated than I originally thought. We have ~100 dns zones referencing the fqdn of mx2001 in MX and SPF... [16:58:48] Does anyone in here have any experience with running refreshLinks on a lot of pages (whole of elwiki) [16:58:52] legoktm: maybe? ^ [16:59:09] I've done it before [16:59:28] I know… is there anything crazy to consider? [16:59:38] it's medium sized, it should be fine [16:59:53] why do you want to run it? [17:00:03] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, and 2 others: logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable - https://phabricator.wikimedia.org/T176335#3621893 (10mobrovac) This is currently occurring on RESTBase and Parsoid hosts and SCB, impacting mos... [17:00:05] Re-Generate the wbc_entity_usage table basically [17:00:36] I think that should be fine [17:00:49] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3621900 (10Halfak) [17:01:03] !log otto@tin Started deploy [eventstreams/deploy@e62ab64]: (no justification provided) [17:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:16] 10Operations, 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 10Operations-Software-Development, and 2 others: Replace salt on integration and deployment-prep projects - https://phabricator.wikimedia.org/T176314#3621911 (10bd808) https://github.com/bd808/wikimedia-cloud-vps-hostgrou... [17:03:41] 10Operations, 10Analytics, 10monitoring, 10Patch-For-Review: Eventstreams graphite disk usage - https://phabricator.wikimedia.org/T160644#3621915 (10Ottomata) [17:05:47] !log otto@tin Finished deploy [eventstreams/deploy@e62ab64]: (no justification provided) (duration: 04m 44s) [17:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:18] (03PS1) 10Andrew Bogott: labtest: add stub node definitions for some currently unused hosts [puppet] - 10https://gerrit.wikimedia.org/r/379268 [17:10:44] (03CR) 10Andrew Bogott: [C: 032] labtest: add stub node definitions for some currently unused hosts [puppet] - 10https://gerrit.wikimedia.org/r/379268 (owner: 10Andrew Bogott) [17:11:02] 10Operations, 10ops-esams, 10DC-Ops, 10netops, 10procurement: esams: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176337#3621935 (10RobH) [17:11:08] PROBLEM - Check systemd state on restbase2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:12:09] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, and 2 others: logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable - https://phabricator.wikimedia.org/T176335#3621953 (10mobrovac) FTR, all of the aforementioned services use `logstash1001` directly. That ought... [17:13:22] !log Ran mwscript refreshLinks.php --wiki elwiki --namespace 0 -e 30 for testing purposes [17:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:31] greg-g: Is it ok with you if I sign up running refreshLinks for elwiki (articles only) now? This is coordinated with the DBA [17:15:18] RECOVERY - Check systemd state on restbase2002 is OK: OK - running: The system is fully operational [17:15:20] (03CR) 10Gergő Tisza: "We do control usernames, see https://meta.wikimedia.org/wiki/Title_blacklist. staff is a very powerful group (basically the rough equivale" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377929 (https://phabricator.wikimedia.org/T175868) (owner: 10Gergő Tisza) [17:17:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops, 10procurement: eqiad: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176338#3621976 (10RobH) [17:18:20] PROBLEM - Check systemd state on restbase2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:23:09] hoo: sure [17:23:25] paladox: ehm.. is "ExecRestart" really a thing? i dont see it on https://www.freedesktop.org/software/systemd/man/systemd.service.html [17:23:29] ExecReload yes [17:23:51] (03CR) 10Brian Wolff: "I agree with Pmiazga, this should be permission based. I don't think that it should use the "staff" group specificly (Since that group is " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377929 (https://phabricator.wikimedia.org/T175868) (owner: 10Gergő Tisza) [17:23:52] oh [17:24:16] mutante i guess that reload acts the same as restart, if not. We can still update the command in reload to use restart [17:24:33] yea, the command on the right-hand side should be updated regardless [17:24:41] (03PS3) 10Herron: Lists: Add acl to warn of invalid/forged HELO messages [puppet] - 10https://gerrit.wikimedia.org/r/372174 (https://phabricator.wikimedia.org/T173338) [17:24:41] but the left-hand side,, doesnt exist? [17:25:37] yep seems execrestart does not exist [17:25:39] from a quick google search [17:25:53] _joe_: I still see a pybal alert in icinga, but all elasticsearch nodes are pooled again as far as I can see... any idea why [17:26:41] (03PS3) 10Paladox: Gerrit: Fix systemd script [puppet] - 10https://gerrit.wikimedia.org/r/379136 [17:27:14] greg-g: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1770576&oldid=1770575 [17:28:58] RECOVERY - Host lvs1007 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [17:29:09] (03CR) 10Herron: [C: 032] Lists: Add acl to warn of invalid/forged HELO messages [puppet] - 10https://gerrit.wikimedia.org/r/372174 (https://phabricator.wikimedia.org/T173338) (owner: 10Herron) [17:29:10] !log Started mwscript refreshLinks.php --wiki elwiki --namespace 0 in a screen on elwiki. Can be KILLed at any time, if needed. [17:29:15] (03PS4) 10Herron: Lists: Add acl to warn of invalid/forged HELO messages [puppet] - 10https://gerrit.wikimedia.org/r/372174 (https://phabricator.wikimedia.org/T173338) [17:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:27] !log Started mwscript refreshLinks.php --wiki elwiki --namespace 0 in a screen on terbium. Can be KILLed at any time, if needed. [17:29:28] doh [17:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:38] (03PS2) 10Muehlenhoff: Remove deployment::redis ferm service [puppet] - 10https://gerrit.wikimedia.org/r/379185 [17:31:08] PROBLEM - puppet last run on lvs1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:31:18] PROBLEM - PyBal backends health check on lvs1007 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [17:31:44] FailureAction= Configure the action to take when the service enters a failed state < we could use this for monitoring. if the action would be "send packet to Icinga".. then those would all be passive checks, taking load off of Icinga server [17:31:58] PROBLEM - pybal on lvs1007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [17:32:12] (03CR) 10Muehlenhoff: [C: 032] Remove deployment::redis ferm service [puppet] - 10https://gerrit.wikimedia.org/r/379185 (owner: 10Muehlenhoff) [17:33:18] (03CR) 10Dzahn: [C: 032] Gerrit: Fix systemd script [puppet] - 10https://gerrit.wikimedia.org/r/379136 (owner: 10Paladox) [17:33:32] mutante you wont be able to merge ^^ [17:33:36] until we merge the parent [17:33:46] https://gerrit.wikimedia.org/r/#/c/378768/ [17:34:10] (03PS16) 10Paladox: Gerrit: Use systemd::service for systemd [puppet] - 10https://gerrit.wikimedia.org/r/378768 [17:34:12] ok, just saw, *nod* [17:34:26] (03PS4) 10Paladox: Gerrit: Fix systemd script [puppet] - 10https://gerrit.wikimedia.org/r/379136 [17:34:37] (03PS2) 10Muehlenhoff: Remove role::deployment::config and the related Hiera data [puppet] - 10https://gerrit.wikimedia.org/r/379188 [17:40:07] (03CR) 10Muehlenhoff: [C: 032] Remove role::deployment::config and the related Hiera data [puppet] - 10https://gerrit.wikimedia.org/r/379188 (owner: 10Muehlenhoff) [17:44:18] RECOVERY - pybal on lvs1007 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [17:44:28] RECOVERY - puppet last run on lvs1007 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [17:44:36] (03CR) 10Dzahn: "yea, nothing specific i have against this right now, but we definitely want to get the issue on gerrit2001 fixed and get to the bottom of " [puppet] - 10https://gerrit.wikimedia.org/r/378768 (owner: 10Paladox) [17:44:38] RECOVERY - PyBal backends health check on lvs1007 is OK: PYBAL OK - All pools are healthy [17:46:33] (03PS2) 10Muehlenhoff: Remove service-restart and deploy-info [puppet] - 10https://gerrit.wikimedia.org/r/379191 [17:49:09] (03CR) 10Muehlenhoff: [C: 032] Remove service-restart and deploy-info [puppet] - 10https://gerrit.wikimedia.org/r/379191 (owner: 10Muehlenhoff) [17:51:03] (03CR) 10Dzahn: "Do i need to find a sponsor for this change and put it on a calendar?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376337 (https://phabricator.wikimedia.org/T104148) (owner: 10Dzahn) [17:55:20] (03PS1) 10Muehlenhoff: Remove a few obsolete references to trebuchet [puppet] - 10https://gerrit.wikimedia.org/r/379281 [17:58:41] Spoiler alert: I kinda took over the 11am SWAT with a bunch of patches, including 2 that needs scap, so I'll do the SWAT [17:59:19] (03CR) 10Bartosz Dziewoński: "> There is no way to show an error with SpecialPageBeforeExecute" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377929 (https://phabricator.wikimedia.org/T175868) (owner: 10Gergő Tisza) [17:59:26] 10Operations, 10hardware-requests, 10Discovery-Wikidata-Query-Service-Sprint, 10Patch-For-Review: decommission wdqs100[12] - https://phabricator.wikimedia.org/T175595#3622171 (10RobH) I cannot find wdqs1001 on the network switch stack for row d (either old asw or new asw2 stack). @cmjohnson will need to u... [17:59:52] (03CR) 10Ottomata: role::kafka::jumbo::broker: enable Prometheus JMX monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/377753 (https://phabricator.wikimedia.org/T175922) (owner: 10Elukey) [18:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Morning SWAT (Max 8 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170920T1800). [18:00:06] RoanKattouw: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break the wikis, you will be rewarded with a sticker. [18:00:10] I got it [18:00:13] o/ [18:00:31] RoanKattouw, there are also patches by me [18:00:41] Ah I see them [18:00:44] Thanks for pointing them out [18:01:22] (03PS1) 10RobH: decom wdqs100[12] [dns] - 10https://gerrit.wikimedia.org/r/379282 (https://phabricator.wikimedia.org/T175595) [18:01:25] OK I just +2ed all six patches, we'll have to let Jenkins wokrk on that for a bit [18:02:06] (03CR) 10Dereckson: [C: 031] copy squid.php->reverse-proxy.php, squid-labs->reverse-proxy-staging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376337 (https://phabricator.wikimedia.org/T104148) (owner: 10Dzahn) [18:02:48] 10Operations, 10hardware-requests, 10Discovery-Wikidata-Query-Service-Sprint, 10Patch-For-Review: decommission wdqs100[12] - https://phabricator.wikimedia.org/T175595#3622188 (10RobH) [18:04:11] (03PS2) 10RobH: decom wdqs100[12] [dns] - 10https://gerrit.wikimedia.org/r/379282 (https://phabricator.wikimedia.org/T175595) [18:04:28] 10Operations, 10hardware-requests, 10Discovery-Wikidata-Query-Service-Sprint, 10Patch-For-Review: decommission wdqs100[12] - https://phabricator.wikimedia.org/T175595#3622193 (10Cmjohnson) @robh updated ge-3/0/15 still had asset tag listed. [18:04:30] (03CR) 10RobH: [C: 032] decom wdqs100[12] [dns] - 10https://gerrit.wikimedia.org/r/379282 (https://phabricator.wikimedia.org/T175595) (owner: 10RobH) [18:05:41] RECOVERY - Check systemd state on restbase2002 is OK: OK - running: The system is fully operational [18:06:58] (03PS1) 10Andrew Bogott: novastats: minor tweaks to the diskspace tool [puppet] - 10https://gerrit.wikimedia.org/r/379286 [18:07:34] (03PS1) 10RobH: decom of wdqs100[12] [puppet] - 10https://gerrit.wikimedia.org/r/379287 (https://phabricator.wikimedia.org/T175595) [18:08:05] (03CR) 10RobH: [C: 032] decom of wdqs100[12] [puppet] - 10https://gerrit.wikimedia.org/r/379287 (https://phabricator.wikimedia.org/T175595) (owner: 10RobH) [18:08:16] (03Abandoned) 10RobH: further tweaks to kafka-jumbo [puppet] - 10https://gerrit.wikimedia.org/r/374645 (https://phabricator.wikimedia.org/T174457) (owner: 10RobH) [18:10:57] 10Operations, 10hardware-requests, 10Discovery-Wikidata-Query-Service-Sprint, 10Patch-For-Review: decommission wdqs100[12] - https://phabricator.wikimedia.org/T175595#3622207 (10RobH) a:05RobH>03Cmjohnson Ok, systems are all done and puppet node clean/deactivate has been done. However, wdqs1001 doesn'... [18:11:20] 10Operations, 10ops-eqiad, 10hardware-requests, 10Discovery-Wikidata-Query-Service-Sprint: decommission wdqs100[12] - https://phabricator.wikimedia.org/T175595#3597404 (10RobH) [18:12:16] (03CR) 10Andrew Bogott: [C: 032] novastats: minor tweaks to the diskspace tool [puppet] - 10https://gerrit.wikimedia.org/r/379286 (owner: 10Andrew Bogott) [18:12:22] (03PS2) 10Andrew Bogott: novastats: minor tweaks to the diskspace tool [puppet] - 10https://gerrit.wikimedia.org/r/379286 [18:14:07] 10Operations, 10netops: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840#3622219 (10ayounsi) [18:14:55] (03CR) 10Andrew Bogott: [C: 031] "I haven't given a very close read but the openstack client bits look right to me." [software/cumin] - 10https://gerrit.wikimedia.org/r/379247 (https://phabricator.wikimedia.org/T175711) (owner: 10Volans) [18:15:08] 10Operations, 10Discovery, 10Elasticsearch, 10Wikimedia-Logstash, and 2 others: logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable - https://phabricator.wikimedia.org/T176335#3622221 (10debt) p:05Triage>03High [18:15:24] 10Operations, 10ops-eqiad, 10hardware-requests, 10Discovery-Wikidata-Query-Service-Sprint: decommission wdqs100[12] - https://phabricator.wikimedia.org/T175595#3622224 (10RobH) [18:15:36] (03PS1) 10Ottomata: Include jmx_exporter_config to make prometheus query Kafka jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/379290 (https://phabricator.wikimedia.org/T175922) [18:15:40] 10Operations, 10ops-eqiad, 10hardware-requests, 10Discovery-Wikidata-Query-Service-Sprint: decommission wdqs100[12] - https://phabricator.wikimedia.org/T175595#3597404 (10RobH) p:05Triage>03Low [18:18:21] (03CR) 10Ottomata: role::kafka::jumbo::broker: enable Prometheus JMX monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/377753 (https://phabricator.wikimedia.org/T175922) (owner: 10Elukey) [18:20:20] 10Operations, 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 10Operations-Software-Development, and 2 others: Replace salt on integration and deployment-prep projects - https://phabricator.wikimedia.org/T176314#3622234 (10greg) For the record, the choice of tags is automatic by the... [18:25:07] PROBLEM - Check systemd state on restbase2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:25:59] known, disabled ^ [18:31:39] MaxSem: I don't suppose you really need to test your changes before I deploy them, since they're in a population maintenace script? [18:31:51] yup [18:31:56] OK cool [18:32:01] I'll test after the sync [18:32:09] Just waiting for Jenkins to merge one more and then I'll scap [18:35:29] 10Operations, 10ops-esams, 10DC-Ops, 10netops, 10procurement: esams: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176337#3622292 (10RobH) [18:40:21] 10Operations, 10ops-esams, 10DC-Ops, 10netops, 10procurement: esams: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176337#3622317 (10RobH) I tried to connect to cr3-esams and pull chassis hardware info, with no joy: ``` robh@re0.cr3-esams> show chassis hardware... [18:40:46] !log catrope@tin Started scap: Patches for T176302, T175962, T173533 [18:40:50] There we go [18:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:02] T175962: Issue with maintenance script: SELECTing revisions with high rev_id is painfully slow - https://phabricator.wikimedia.org/T175962 [18:41:02] T176302: Mentions not displaying properly when replying in flow - https://phabricator.wikimedia.org/T176302 [18:41:03] T173533: No longer block user from clicking results when page is loading. - https://phabricator.wikimedia.org/T173533 [18:42:55] 10Operations, 10ops-esams, 10DC-Ops, 10netops, 10procurement: esams: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176337#3622339 (10RobH) [19:00:05] no_justification: (Dis)respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170920T1900). Please do the needful. [19:00:05] No patches in the queue for this window. Wheeee! [19:00:16] THERE IS A PATCH YOU INSOLENT BOT [19:00:17] no_justification: My scap is aaaalmost done [19:00:27] So please wait just a minute [19:00:33] scap-cdb-rebuild: 99% (ok: 301; fail: 0; left: 1) [19:00:41] * no_justification uses his secret backdoors to steal scap [19:00:42] kidding! [19:00:44] (or am I?) [19:01:30] Hah it's mwdebug1002 of all boxes that's the holdout [19:02:04] Looks like it's working hard [19:02:15] !log catrope@tin Finished scap: Patches for T176302, T175962, T173533 (duration: 21m 29s) [19:02:21] Yay [19:02:24] no_justification: All yours [19:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:32] T175962: Issue with maintenance script: SELECTing revisions with high rev_id is painfully slow - https://phabricator.wikimedia.org/T175962 [19:02:32] T176302: Mentions not displaying properly when replying in flow - https://phabricator.wikimedia.org/T176302 [19:02:33] T173533: No longer block user from clicking results when page is loading. - https://phabricator.wikimedia.org/T173533 [19:04:09] RoanKattouw: Actually, I've got my full 2h window and probably gonna wait a bit anyway [19:04:13] I'm finishing my lunch :) [19:04:14] OK cool [19:04:17] And writing an e-mail :p [19:07:15] MaxSem: Your thing should be deployed now [19:07:23] thanks RoanKattouw [19:09:37] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3622407 (10Dzahn) So i ran a fresh cumin command to generate a current list (after quite a few have already been closed after mailing people). I used `[neodymium:~] $ sudo -i c... [19:17:29] (03CR) 10Chad: [C: 031] Drop trebuchet::packages [puppet] - 10https://gerrit.wikimedia.org/r/379180 (owner: 10Muehlenhoff) [19:19:29] (03CR) 10Chad: [C: 032] group1 to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379253 (owner: 10Chad) [19:21:25] 2017-09-20T19:20:23 hhvm INFO - mw1243 /bin/bash: svn: command not found [19:21:26] Wut? [19:21:28] svn? [19:21:51] (03Merged) 10jenkins-bot: group1 to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379253 (owner: 10Chad) [19:22:02] (03CR) 10jenkins-bot: group1 to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379253 (owner: 10Chad) [19:24:56] !log demon@tin Synchronized php: symlink swap for wmf.19 (duration: 00m 49s) [19:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:03] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 to wmf.19 [19:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:15] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3622472 (10Dzahn) ``` [neodymium:~] $ sudo cat /root/screen-hosts ``` for a list of _just_ the hostnames that have one or more screens currently, much easier to read to get a... [19:39:07] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp4008.ulsfo.wmnet [19:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:49] (03PS7) 10Dzahn: icinga: initial whitelist for screen monitoring [puppet] - 10https://gerrit.wikimedia.org/r/377823 (https://phabricator.wikimedia.org/T165348) [19:41:59] (03PS1) 10Eevans: Update collector to 4.1.0 [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/379302 (https://phabricator.wikimedia.org/T171772) [19:43:05] (03PS3) 10MarcoAurelio: Create a 'patroller' user group at Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378695 (https://phabricator.wikimedia.org/T176079) [19:43:08] (03CR) 10Eevans: [V: 032 C: 032] Update collector to 4.1.0 [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/379302 (https://phabricator.wikimedia.org/T171772) (owner: 10Eevans) [19:47:36] !log eevans@tin Started deploy [cassandra/metrics-collector@df909a1]: Pushing out 4.1.0 release [19:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:23] !log eevans@tin Finished deploy [cassandra/metrics-collector@df909a1]: Pushing out 4.1.0 release (duration: 00m 47s) [19:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:57] (03PS6) 10ArielGlenn: Move datasets nginx logs rsync to dumps web manifest where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/379198 (https://phabricator.wikimedia.org/T175528) [19:50:22] (03CR) 10jerkins-bot: [V: 04-1] Move datasets nginx logs rsync to dumps web manifest where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/379198 (https://phabricator.wikimedia.org/T175528) (owner: 10ArielGlenn) [19:50:34] 10Operations, 10ops-ulsfo, 10netops: connect new office link to asw-ulsfo - https://phabricator.wikimedia.org/T176350#3622485 (10RobH) [19:53:09] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp4006.ulsfo.wmnet [19:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:26] !log eevans@tin Started deploy [cassandra/metrics-collector@df909a1] (dev-cluster): Deploy 4.1.0 artifact to dev cluster [19:54:36] !log eevans@tin Finished deploy [cassandra/metrics-collector@df909a1] (dev-cluster): Deploy 4.1.0 artifact to dev cluster (duration: 00m 10s) [19:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:21] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp4016.ulsfo.wmnet [19:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:35] (03PS7) 10ArielGlenn: Move datasets nginx logs rsync to dumps web manifest where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/379198 (https://phabricator.wikimedia.org/T175528) [19:57:50] (03PS1) 10Eevans: Link-in upgraded cassandra-metrics-collector jar [puppet] - 10https://gerrit.wikimedia.org/r/379305 (https://phabricator.wikimedia.org/T171772) [19:58:04] (03PS2) 10Ottomata: flock before attempting to run rsync of published-datasets [puppet] - 10https://gerrit.wikimedia.org/r/379234 (https://phabricator.wikimedia.org/T174756) [19:58:08] (03CR) 10Ottomata: [V: 032 C: 032] flock before attempting to run rsync of published-datasets [puppet] - 10https://gerrit.wikimedia.org/r/379234 (https://phabricator.wikimedia.org/T174756) (owner: 10Ottomata) [19:58:38] (03PS8) 10Dzahn: icinga: initial whitelist for screen monitoring [puppet] - 10https://gerrit.wikimedia.org/r/377823 (https://phabricator.wikimedia.org/T165348) [20:00:05] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Time to pull up your socks and deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / …. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170920T2000). [20:00:07] No patches in the queue for this window. Wheeee! [20:00:20] no parsoid deploy today [20:00:27] No ORES today [20:01:57] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:02:34] (03PS2) 10Eevans: Link-in upgraded cassandra-metrics-collector jar [puppet] - 10https://gerrit.wikimedia.org/r/379305 (https://phabricator.wikimedia.org/T171772) [20:03:46] bah, how do I get rid of these 'Empty Space' blocks in grafana? [20:04:58] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [20:06:24] (03CR) 10Dzahn: "hey, so i reduced this list to: build hosts, all mariadb:* roles, puppetmaster frontend-only and restbase-dev/test. ok as first step?" [puppet] - 10https://gerrit.wikimedia.org/r/377823 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [20:08:41] (03CR) 10Eevans: [C: 031] "The jar file is already deployed on the restbase-ng and dev clusters, this changeset will just link it in place; [PC output](http://puppet" [puppet] - 10https://gerrit.wikimedia.org/r/379305 (https://phabricator.wikimedia.org/T171772) (owner: 10Eevans) [20:08:51] (03PS1) 10Ottomata: [WIP] Port statsv from kafka analytics to kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/379308 (https://phabricator.wikimedia.org/T176352) [20:09:06] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3622579 (10Dzahn) I reduced the initial whitelist to: build hosts, all mariadb:* roles, puppetmaster frontend-only and restbase-dev/test. ok as first step? Of course we can (an... [20:09:12] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Port statsv from kafka analytics to kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/379308 (https://phabricator.wikimedia.org/T176352) (owner: 10Ottomata) [20:11:06] (03CR) 10MarcoAurelio: [C: 031] Import sources on hr.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379254 (https://phabricator.wikimedia.org/T176320) (owner: 10Jayprakash12345) [20:11:28] (03CR) 10MarcoAurelio: [C: 031] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379254 (https://phabricator.wikimedia.org/T176320) (owner: 10Jayprakash12345) [20:12:09] (03PS3) 10Dzahn: cassandra: Link-in upgraded cassandra-metrics-collector jar [puppet] - 10https://gerrit.wikimedia.org/r/379305 (https://phabricator.wikimedia.org/T171772) (owner: 10Eevans) [20:12:48] (03PS8) 10ArielGlenn: Move datasets nginx logs rsync to dumps web manifest where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/379198 (https://phabricator.wikimedia.org/T175528) [20:13:41] (03PS2) 10Ottomata: [WIP] Port statsv from kafka analytics to kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/379308 (https://phabricator.wikimedia.org/T176352) [20:20:43] (03PS6) 10Zoranzoki21: Add John Michael Kohler Art Center throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) (owner: 10Dereckson) [20:21:25] (03PS7) 10Zoranzoki21: Add John Michael Kohler Art Center throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) (owner: 10Dereckson) [20:30:06] (03PS10) 10MarcoAurelio: Cloud VPS configuration for hi.wikivoyage [puppet] - 10https://gerrit.wikimedia.org/r/371096 (https://phabricator.wikimedia.org/T173013) [20:31:18] (03Abandoned) 10Zoranzoki21: Add new throttle rules.. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378393 (https://phabricator.wikimedia.org/T176037) (owner: 10Zoranzoki21) [20:31:50] andrewbogott: hi, does https://gerrit.wikimedia.org/r/#/c/371096/ need to wait until the wiki is created? [20:32:38] tabbycat: I think so, yes [20:32:43] ok :) [20:33:31] (03PS1) 10Framawiki: Enable Extension:Newsletter on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379316 (https://phabricator.wikimedia.org/T176199) [20:33:42] !log demon@tin Synchronized php-1.30.0-wmf.19/includes/diff/DifferenceEngine.php: fix some warnings (duration: 00m 51s) [20:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:46] !log bsitzmann@tin Started deploy [mobileapps/deploy@e23bf66]: Update mobileapps to d46860e (T175759 T176263) [20:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:02] T175759: Add scowiki TFA to Explore feed - https://phabricator.wikimedia.org/T175759 [20:35:02] T176263: Certain image URL schemes are being (re)written to 'http' in production - https://phabricator.wikimedia.org/T176263 [20:35:19] !log increase max recovery bytes/s to 80mb in eqiad elasitcsearch cluster [20:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:04] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp4014.ulsfo.wmnet [20:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:36] (03CR) 10Zoranzoki21: [C: 031] "Looks good to me, but someone else must approve" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379316 (https://phabricator.wikimedia.org/T176199) (owner: 10Framawiki) [20:37:41] (03PS1) 10Smalyshev: Adding mlwiki to categories, by request. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379323 [20:39:03] (03CR) 10Zoranzoki21: [C: 031] Create a 'patroller' user group at Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378695 (https://phabricator.wikimedia.org/T176079) (owner: 10MarcoAurelio) [20:40:51] !log bsitzmann@tin Finished deploy [mobileapps/deploy@e23bf66]: Update mobileapps to d46860e (T175759 T176263) (duration: 06m 05s) [20:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:06] T175759: Add scowiki TFA to Explore feed - https://phabricator.wikimedia.org/T175759 [20:41:06] T176263: Certain image URL schemes are being (re)written to 'http' in production - https://phabricator.wikimedia.org/T176263 [20:47:52] (03CR) 10Luke081515: [C: 031] Create a 'patroller' user group at Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378695 (https://phabricator.wikimedia.org/T176079) (owner: 10MarcoAurelio) [20:48:15] (03PS1) 10Chad: Stop sending CSP reports to logstash for now, spams my graphs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379366 [20:53:34] (03CR) 10jerkins-bot: [V: 04-1] Stop sending CSP reports to logstash for now, spams my graphs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379366 (owner: 10Chad) [20:58:08] (03CR) 10Jcrespo: "The idea of this patch is to have a proof of concept (very simple, but working) on dbstore2001 (plus dbstore2002) that basically does the " [puppet] - 10https://gerrit.wikimedia.org/r/374560 (https://phabricator.wikimedia.org/T169516) (owner: 10Jcrespo) [20:58:38] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp4007.ulsfo.wmnet [20:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:26] (03PS1) 10Andrew Bogott: fullstack: optionally clean up leaked VMs after a point [puppet] - 10https://gerrit.wikimedia.org/r/379388 (https://phabricator.wikimedia.org/T167556) [20:59:52] (03CR) 10Jcrespo: "typo" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/374560 (https://phabricator.wikimedia.org/T169516) (owner: 10Jcrespo) [21:02:13] (03PS2) 10Andrew Bogott: fullstack: optionally clean up leaked VMs after a point [puppet] - 10https://gerrit.wikimedia.org/r/379388 (https://phabricator.wikimedia.org/T167556) [21:03:57] 10Operations, 10Ops-Access-Requests: Request access to logstash (nda group) for @framawiki - https://phabricator.wikimedia.org/T176364#3622856 (10Framawiki) [21:09:49] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp4015.ulsfo.wmnet [21:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:25] 10Operations, 10ops-ulsfo, 10Traffic: Decom cp4005-8,13-16 (8 nodes) - https://phabricator.wikimedia.org/T176366#3622887 (10BBlack) [21:15:06] 10Operations, 10ops-ulsfo, 10Traffic: Decom cp4005-8,13-16 (8 nodes) - https://phabricator.wikimedia.org/T176366#3622901 (10BBlack) [just pre-creating the task, we're not quite ready to take action yet. These systems are now depooled, but we'll wait a few days before un-configuring in case a reason to repoo... [21:16:18] (03CR) 10Brian Wolff: [C: 031] "Just +1 to say i agree with making csp log to text file only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379366 (owner: 10Chad) [21:19:34] (03CR) 10Dzahn: [C: 032] "compiler output confirms it only affects the "prod ng" nodes and not the regular "prod" nodes" [puppet] - 10https://gerrit.wikimedia.org/r/379305 (https://phabricator.wikimedia.org/T171772) (owner: 10Eevans) [21:21:33] mutante: thanks! [21:21:48] yw :) [21:25:22] (03CR) 10Dzahn: "re "while we fix the scap repos" not sure if that is already fixed now or not" [puppet] - 10https://gerrit.wikimedia.org/r/377269 (https://phabricator.wikimedia.org/T172333) (owner: 10Alexandros Kosiaris) [21:27:02] 10Operations, 10DBA, 10Patch-For-Review: decommission db1036 - https://phabricator.wikimedia.org/T176311#3622925 (10jcrespo) partitioning finished, db1101 should be ready to be pooled as the new special slave. [21:27:47] (03CR) 10Dzahn: [C: 04-1] "added jcrespo. this is more "fyi", so i wanted to suggest this option at first to turn it into warn-only for everything but now i would in" [puppet] - 10https://gerrit.wikimedia.org/r/376636 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [21:28:03] (03CR) 10BryanDavis: [C: 031] "+1 for concept and configuration flexibility. I don't have the exim skills to evaluate the implementation." [puppet] - 10https://gerrit.wikimedia.org/r/379239 (https://phabricator.wikimedia.org/T175964) (owner: 10Herron) [21:29:59] (03CR) 10Dzahn: "does it have a little value in itself to make it _possible_ to skip monitoring even if we don't use it ..? or should i just abandon it now" [puppet] - 10https://gerrit.wikimedia.org/r/368124 (https://phabricator.wikimedia.org/T151632) (owner: 10Dzahn) [21:30:52] jouncebot: next [21:30:52] In 1 hour(s) and 29 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170920T2300) [21:31:10] jouncebot: num patches [21:33:09] (03PS2) 10Dzahn: copy squid.php->reverse-proxy.php, squid-labs->reverse-proxy-staging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376337 (https://phabricator.wikimedia.org/T104148) [21:33:39] 10Operations, 10Performance-Team, 10Traffic, 10Varnish, 10Wikimedia-Incident: Collect Backend-Timing in Graphite - https://phabricator.wikimedia.org/T131894#3622947 (10greg) [21:34:32] no_justification, i wonder should we enable the ui for the slave? [21:34:33] (03PS1) 10Dmaza: Enable user email blacklist on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379418 (https://phabricator.wikimedia.org/T174694) [21:34:54] there's a flag for that. [21:35:17] ah ! [21:35:22] would that get us the readonly view [21:35:27] instead of the error page [21:36:05] yeh [21:36:31] that would be nice [21:36:39] * paladox submit patch [21:36:44] i was about to change the error page but in that case i won't bother [21:38:06] (03CR) 10Dzahn: "thanks Dereckson, i will add it to a SWAT window like tomorrow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376337 (https://phabricator.wikimedia.org/T104148) (owner: 10Dzahn) [21:39:14] (03Draft1) 10Paladox: Gerrit: Enable ui for slaves [puppet] - 10https://gerrit.wikimedia.org/r/379420 [21:39:17] (03PS2) 10Paladox: Gerrit: Enable ui for slaves [puppet] - 10https://gerrit.wikimedia.org/r/379420 [21:39:18] mutante ^^ [21:41:07] hmm, the ui starts, but shows https://gerrit2.git.wmflabs.org/r/ [21:41:28] hmm, what is it trying to find ?:) [21:42:37] it is trying to load status:open [21:50:03] (03CR) 10Dzahn: [C: 032] "compiled. lgtm http://puppet-compiler.wmflabs.org/7956/ it adds all the things needed for aphlict but also stays " aphlict_enabled => Fal" [puppet] - 10https://gerrit.wikimedia.org/r/379005 (https://phabricator.wikimedia.org/T765) (owner: 1020after4) [21:50:38] (03PS5) 10Dzahn: Phabricator: configure notification server [puppet] - 10https://gerrit.wikimedia.org/r/379005 (https://phabricator.wikimedia.org/T765) (owner: 1020after4) [21:51:01] (03PS6) 1020after4: Phabricator: configure notification server [puppet] - 10https://gerrit.wikimedia.org/r/379005 (https://phabricator.wikimedia.org/T765) [21:51:38] (03CR) 1020after4: "Indeed it stays disabled, though I think we should try enabling it so that there will be an endpoint to point the proxies at." [puppet] - 10https://gerrit.wikimedia.org/r/379005 (https://phabricator.wikimedia.org/T765) (owner: 1020after4) [21:52:54] (03CR) 1020after4: [C: 031] Phabricator: Override the frog token's label [puppet] - 10https://gerrit.wikimedia.org/r/371660 (https://phabricator.wikimedia.org/T173208) (owner: 10Greg Grossmeier) [21:52:57] (03CR) 10Dzahn: [C: 032] "ok, i'm submitting this one first :)" [puppet] - 10https://gerrit.wikimedia.org/r/379005 (https://phabricator.wikimedia.org/T765) (owner: 1020after4) [21:54:18] !log phab: disable puppet on phab1001 for a minute, apply notification server change on phab2001 [21:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:52] (03CR) 10Dzahn: "Error: Could not create user aphlict: Execution of '/usr/sbin/useradd -g aphlict -d /var/run/aphlict -s /bin/false -r aphlict' returned 6:" [puppet] - 10https://gerrit.wikimedia.org/r/379005 (https://phabricator.wikimedia.org/T765) (owner: 1020after4) [21:56:06] 10Operations, 10Maps-Sprint, 10Maps (Kartotherian): Upgrade kartotherian and tilerator to nodejs 6.11 - https://phabricator.wikimedia.org/T171707#3622975 (10Pnorman) We were unable to reproduce the errors in the tilerator log on 2004, and 2003 worked without them, so I think 6.11 is good to go. @gehel found... [21:58:12] heh, the systemd service for the phab change just merged has this [21:58:14] /srv/phab/phabricator//bin/aphlict [21:58:18] two slashes [21:58:22] it still works though [21:58:54] a) not applied on phab1001 b) has another issue, cant create aphlict user because group aphlict doesnt exist [22:00:02] but eh. yea, there is a group{} in the manifest too [22:00:21] thanks for catching the slashes though [22:02:25] Resources only in the new catalog: Group[aphlict] [22:02:35] it's right there. ..yet it isnt [22:03:06] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-Incident: Collect Backend-Timing in Graphite - https://phabricator.wikimedia.org/T131894#3622996 (10Krinkle) [22:03:30] and the user has a require on the group [22:09:45] jouncebot: next [22:09:46] In 0 hour(s) and 50 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170920T2300) [22:09:55] 50 minutes still, geez [22:10:06] I'm falling asleep :S [22:10:16] heh [22:10:33] slave starts ssh for me [22:10:45] so it must be something on gerrit2001 preventing ssh from starting [22:10:48] systemd script works [22:12:09] paladox: on port 29148 ? [22:12:11] 29418 [22:12:15] yewh [22:12:16] yeh [22:12:20] logs show it starting [22:12:23] and not showing it failed [22:12:45] (03PS1) 10Dzahn: phab/aphlict: use groups instead of gid, remove require [puppet] - 10https://gerrit.wikimedia.org/r/379422 (https://phabricator.wikimedia.org/T765) [22:12:50] ok, now try restarting the gerrit service with systemctl [22:13:01] cause that's when it started breaking [22:13:11] (03CR) 10jerkins-bot: [V: 04-1] phab/aphlict: use groups instead of gid, remove require [puppet] - 10https://gerrit.wikimedia.org/r/379422 (https://phabricator.wikimedia.org/T765) (owner: 10Dzahn) [22:13:21] ok [22:14:16] wrks [22:14:17] works [22:14:18] mutante ^^ [22:14:19] systemctl restart gerrit works [22:14:47] 10Operations, 10HHVM: Upload hhvm to stretch apt repo in apt.wikimedia.org - https://phabricator.wikimedia.org/T167225#3623003 (10bd808) [22:14:49] 10Operations, 10HHVM: Migration of mw* servers to stretch - https://phabricator.wikimedia.org/T174431#3623004 (10bd808) [22:15:08] paladox: and gerrit-ssh is still running too? [22:15:16] on same port [22:15:34] sudo netstat -plnt | grep ':29148' [22:15:38] shows no ports [22:15:43] but logs say something different [22:15:55] 29418 ? [22:15:55] [2017-09-20 22:13:54,651] [main] INFO com.google.gerrit.sshd.SshDaemon : Started Gerrit SSHD-CORE-1.2.0 on *:29418 [22:15:55] [2017-09-20 22:13:54,652] [main] INFO com.google.gerrit.pgm.Daemon : Gerrit Code Review 2.13.9-2-g99a8c8bc51-dirty ready [22:15:57] yeh [22:16:13] so it's in netstat output both before and after the restart? [22:16:39] nice @ "-dirty" at the end of a version string ... [22:17:06] (03CR) 10Chad: Gerrit: Enable ui for slaves (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/379420 (owner: 10Paladox) [22:17:23] mutante we have a custom patch applied [22:17:28] to prevent login problems [22:18:07] (03PS3) 10Paladox: Gerrit: Enable ui for slaves [puppet] - 10https://gerrit.wikimedia.org/r/379420 [22:18:11] (03CR) 10Paladox: Gerrit: Enable ui for slaves (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/379420 (owner: 10Paladox) [22:18:24] mutante need to check before [22:18:48] (03PS2) 10Dzahn: phab/aphlict: remove require for group by user [puppet] - 10https://gerrit.wikimedia.org/r/379422 (https://phabricator.wikimedia.org/T765) [22:19:52] 10Operations, 10MediaWiki-Platform-Team, 10TechCom-RfC: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#3623015 (10tstarling) [22:20:17] 10Operations, 10MediaWiki-Platform-Team, 10TechCom-RfC: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#3623029 (10tstarling) [22:20:44] mutante dosen't work before either with netstat [22:20:47] though ssh works [22:20:49] i think [22:21:02] paladox: i doubt it if it doesnt show up in netstat output [22:21:41] works [22:21:44] mutante ssh works [22:21:51] i just cloned [22:21:51] https://gerrit2.git.wmflabs.org/r/#/admin/projects/test [22:21:54] using my ssh key [22:22:01] oh [22:22:07] lol it was set to anon [22:22:08] https isnt ssh:// [22:22:33] works [22:22:48] !log elasticsearch eqiad completely recovery, all green. Returning recovery max_bytes_per_sec to 20mb [22:22:49] paladox: on port 29418 ? [22:22:56] (03PS3) 10Dzahn: phab/aphlict: remove require for group by user [puppet] - 10https://gerrit.wikimedia.org/r/379422 (https://phabricator.wikimedia.org/T765) [22:23:02] https://phabricator.wikimedia.org/P6032 [22:23:04] yeh [22:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:06] mutante ^^ [22:23:52] paladox: ok, yes. restart gerrit service again ..confirm another time [22:23:58] ok [22:24:24] also, do you have /etc/init.d/gerrit existing or removed on that instance [22:24:38] still works [22:24:38] :) [22:24:40] and did you use "systemctl restart" [22:24:52] ok :) [22:25:00] /etc/init.d/gerrit should have been installed by the package [22:25:04] and yeh [22:25:09] systemctl restart gerrit [22:25:24] alright [22:25:40] it must be something blocking the port on gerrit2001 [22:28:31] paladox: i can use that port just fine when i listen on it myself with netcat [22:29:03] hmm [22:29:08] netcat [22:29:14] nc -l -p 29418 = server [22:33:10] (03CR) 10Dzahn: [C: 032] phab/aphlict: remove require for group by user [puppet] - 10https://gerrit.wikimedia.org/r/379422 (https://phabricator.wikimedia.org/T765) (owner: 10Dzahn) [22:34:20] 10Operations, 10MediaWiki-Platform-Team, 10Performance-Team, 10HHVM: Convert Wikimedia production HHVM instances to have hhvm.php7.all set true - https://phabricator.wikimedia.org/T173786#3623086 (10MaxSem) This probably needs to be closed due to {T176370} - unless someone thinks that HHVM's PHP7 mode can... [22:36:58] 10Operations, 10MediaWiki-Platform-Team, 10Performance-Team, 10HHVM: Convert Wikimedia production HHVM instances to have hhvm.php7.all set true - https://phabricator.wikimedia.org/T173786#3623091 (10Jdforrester-WMF) Duped in or sub-tree'ed, yes. [22:37:23] (03PS1) 10Smalyshev: Make using CirrusSearch engine default for wbsearchentities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379426 (https://phabricator.wikimedia.org/T175741) [22:37:46] aha [22:39:37] (03CR) 10Gergő Tisza: "> I agree with Pmiazga, this should be permission based. I don't" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377929 (https://phabricator.wikimedia.org/T175868) (owner: 10Gergő Tisza) [22:40:17] paladox: where does that wrong IP come from ?:) good catch [22:40:20] Hiera? [22:40:44] hmm not sure [22:40:45] https://github.com/wikimedia/puppet/search?utf8=✓&q=208.80.153.74&type= [22:41:48] (03CR) 10Gergő Tisza: "> You can always throw an ErrorPageError." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377929 (https://phabricator.wikimedia.org/T175868) (owner: 10Gergő Tisza) [22:42:08] yea, tegmen. so far so good, but where does gerrit2001 take that from [22:42:11] looking [22:42:22] hmm [22:42:36] no_justification ^^ [22:42:40] i think we found the problem [22:42:43] 208.80 rings WMF Labs to me [22:45:02] mutante it's trying to connect to gerrit [22:45:19] so may not be a specific gerrit config [22:45:29] but rather a config telling tegman to look at gerrit [22:45:51] though it could be the icinga checking the port [22:46:00] like, Icinga monitoring if gerrit-ssh is up :) [22:46:19] yeh [22:46:41] also it was like 10 minutes before the shutdown happened [22:46:43] like this one [22:46:43] https://github.com/wikimedia/puppet/blob/3112c8b002996e228474361b9cd54755718e04ab/modules/profile/manifests/gerrit/server.pp#L30 [22:47:13] jouncebot: next [22:47:14] In 0 hour(s) and 12 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170920T2300) [22:47:20] oh men [23:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy Evening SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170920T2300). [23:00:05] tabbycat, Smalyshev, and DMaza: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break the wikis, you will be rewarded with a sticker. [23:00:16] HI [23:00:18] oops [23:00:24] o/ [23:00:26] o/ [23:00:56] here [23:03:02] Here [23:03:15] There is something a little worrying [23:03:17] (03CR) 1020after4: "odd, but ok. :) Thanks for fixing this." [puppet] - 10https://gerrit.wikimedia.org/r/379422 (https://phabricator.wikimedia.org/T765) (owner: 10Dzahn) [23:03:23] we've an event tomorrow [23:03:52] (03PS4) 10Gergő Tisza: Temporarily prevent users from accessing Special:RenderBook/test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377929 (https://phabricator.wikimedia.org/T175868) [23:03:55] I prepared a change for it, but some new contributor wanted to improve it, and bumped to phpcs errors, so it wasn't swattable [23:04:29] RoanKattouw: will you SWAT on this window? :) [23:04:31] I imagined they would follow the full procedure and subscribe the change to the morning or evenign SWAT. [23:04:41] If no one else is going to, I can do it [23:05:02] Dereckson: phpcs errors are easy to fix though right? [23:05:16] It's because my eyes are falling appart and I'd love to go to sleep. Sorry for being so selfish. [23:05:19] Yes, but I was mostly annoied of the new patchset during a deployment [23:05:27] and surprise I CR+2 I got a -1 [23:06:09] (03CR) 10Dereckson: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) (owner: 10Dereckson) [23:06:29] PS7, wonderful, thanks Zoranzoki21 [23:06:36] OK, I'll SWAT [23:06:51] tabbycat's change first [23:07:05] tabbycat: could you take care of Zoranzoki21 [23:07:13] er... of https://gerrit.wikimedia.org/r/379224 ? [23:07:18] (03CR) 10Catrope: [C: 032] Create a 'patroller' user group at Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378695 (https://phabricator.wikimedia.org/T176079) (owner: 10MarcoAurelio) [23:07:25] lol [23:07:39] Dereckson: we can discuss it in private ;) [23:09:02] tabbycat: yes but that was a copy paste error ^^ I meant the change, could you add it to the Deployments page and ask for deploy at this SWAT please? I'm not currently available. [23:09:10] (03CR) 10jerkins-bot: [V: 04-1] Add John Michael Kohler Art Center throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) (owner: 10Dereckson) [23:09:15] ^^ [23:09:22] * Dereckson shrugs [23:09:45] It's annoying not to be able to rebase changes from somebody else [23:09:53] maybe some of us should be able to [23:10:09] I think I can [23:10:16] I'm a Gerrit admin though, so maybe that's why :) [23:10:34] (03Merged) 10jenkins-bot: Create a 'patroller' user group at Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378695 (https://phabricator.wikimedia.org/T176079) (owner: 10MarcoAurelio) [23:10:36] Dereckson: You mean the Kohler Art center thing? [23:10:41] yep, gerrit admins and deployers can [23:10:44] (03CR) 10jenkins-bot: Create a 'patroller' user group at Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378695 (https://phabricator.wikimedia.org/T176079) (owner: 10MarcoAurelio) [23:11:03] RoanKattouw: let me know on which mwdebug you'll set it so I can test it :) [23:11:11] (03CR) 10Dereckson: "I'm going to revert this to PS1, as it was correct for phpcs, before Zoran tried to improve this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) (owner: 10Dereckson) [23:11:15] tabbycat: Your metawiki patroller change is now on 1002 [23:11:23] !log phab2001 - groupadd aphlict,temp adduser aphlict to debug user creation issue [23:11:25] testing [23:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:51] RoanKattouw: ah, sorry, I'm here too [23:12:13] RoanKattouw: looks good to me [23:12:16] RoanKattouw: for https://gerrit.wikimedia.org/r/#/c/379323/ :) really simple one [23:12:59] tabbycat: OK syncing [23:13:01] (03PS8) 10Dereckson: Add John Michael Kohler Art Center throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) [23:13:08] Next is SMalyshev [23:13:17] (03CR) 10Catrope: [C: 032] Adding mlwiki to categories, by request. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379323 (owner: 10Smalyshev) [23:13:19] \o/ [23:13:25] Is that patch testable on 1002? [23:13:34] Or is it to do with indexing scripts? [23:13:38] not really... it's just a dblist change [23:13:40] OK [23:13:43] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Add patroller group on metawiki (T176079) (duration: 00m 48s) [23:13:45] it influences weekly dumps [23:13:49] I'll just sync it straight out then [23:13:52] yep [23:13:54] thanks [23:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:58] T176079: New 'patroller' user group for Meta-Wiki - https://phabricator.wikimedia.org/T176079 [23:14:45] !log phab2001 - deluser aphlict, delgroup aphlict, try to let puppet create it again, still fails, just won't create the group [23:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:10] Urgh we really need mw-config changes to be in gate-and-submit-swat [23:15:33] Or just more nodepool capacity, just 3 half-finished jobs are starving out everything else [23:16:34] (03CR) 10jerkins-bot: [V: 04-1] Add John Michael Kohler Art Center throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) (owner: 10Dereckson) [23:16:46] (03Draft1) 10Paladox: phab/aphlict: fix puppet [puppet] - 10https://gerrit.wikimedia.org/r/379436 [23:16:48] (03PS2) 10Paladox: phab/aphlict: move group up a bit in code [puppet] - 10https://gerrit.wikimedia.org/r/379436 [23:17:06] and works as expected as well [23:18:07] 10Operations, 10Analytics, 10Analytics-Cluster, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3623169 (10dr0ptp4kt) @Ottomata do you know if it ws an s9150, or was it a W9100? I'm seeing the `04:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/AT... [23:18:15] (03PS9) 10Dereckson: Add John Michael Kohler Art Center throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) [23:18:36] (03Merged) 10jenkins-bot: Adding mlwiki to categories, by request. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379323 (owner: 10Smalyshev) [23:18:50] (03CR) 10jenkins-bot: Adding mlwiki to categories, by request. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379323 (owner: 10Smalyshev) [23:18:55] I think IP do not need to be put into [ ] dereck [23:18:58] Dereckson: [23:19:12] That was the edit Zoran wanted to do. [23:19:37] I always used IP => '123.456.78.90' and it worked fine [23:19:42] * Dereckson would delete the IP string and always make that an array, to simplify throttle code and decrease the complexity [23:19:44] see the removed rules, for example [23:20:17] in any case, I'm sorry but I cannot take that patch now, I really need to go [23:20:21] A data structure should have an unified format. [23:20:48] sure, I just followed what other rules used [23:21:02] if the structure changes, I'll follow the new structure [23:21:02] Something sometimes a string, sometimes an array of string isn't strongly typed, and that makes the code more comple, without any real reason. [23:21:11] (03PS2) 10Catrope: Enable user email blacklist on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379418 (https://phabricator.wikimedia.org/T174694) (owner: 10Dmaza) [23:21:13] !log catrope@tin Synchronized dblists/categories-rdf.dblist: Add mlwiki to categories-rdf.dblist (duration: 00m 48s) [23:21:16] (03CR) 10Catrope: [C: 032] Enable user email blacklist on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379418 (https://phabricator.wikimedia.org/T174694) (owner: 10Dmaza) [23:21:24] Good night everybody [23:21:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:29] tabbycat: don't worry, currently, the two formats work. There is a check made to determine if it's a string or an array [23:21:49] Dereckson: what I don't understand is what for throttle-analyze is for [23:21:59] Dereckson, tabbycat : Do you want that throttle change deployed now? Happy to do it [23:22:06] RoanKattouw: yes, yo ucan deploy it [23:22:15] RoanKattouw: what Dereckson says [23:22:17] as it's for tommorow, it's probably best to do at this SWAT [23:22:27] it passed jenkins? [23:22:39] yes yes [23:23:02] (03Merged) 10jenkins-bot: Enable user email blacklist on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379418 (https://phabricator.wikimedia.org/T174694) (owner: 10Dmaza) [23:23:11] (03CR) 10jenkins-bot: Enable user email blacklist on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379418 (https://phabricator.wikimedia.org/T174694) (owner: 10Dmaza) [23:23:19] okay so we can do it [23:24:27] DMaza: Your user email blacklist patch is on mwdebug1002, please test [23:24:42] (03PS10) 10Catrope: Add John Michael Kohler Art Center throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) (owner: 10Dereckson) [23:24:48] testing [23:24:54] (03CR) 10Catrope: [C: 032] Add John Michael Kohler Art Center throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) (owner: 10Dereckson) [23:25:45] well, that rule is not testable RoanKattouw so if logstash/fatalmonitor doesn't complain I'd sync it right away [23:25:50] unless Dereckson says otherwise [23:25:56] Cool [23:26:02] I'll note that patch on wikitech [23:26:06] Thanks [23:26:18] no, thanks to you both [23:26:49] looks good to me [23:27:30] (03Merged) 10jenkins-bot: Add John Michael Kohler Art Center throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) (owner: 10Dereckson) [23:27:43] (03CR) 10jenkins-bot: Add John Michael Kohler Art Center throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379224 (https://phabricator.wikimedia.org/T176287) (owner: 10Dereckson) [23:29:13] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable user email blacklist on meta (T174694) (duration: 00m 48s) [23:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:29] T174694: Enable Special:EmailUser User Prohibit on Meta - https://phabricator.wikimedia.org/T174694 [23:29:32] Dereckson: I added the patch on wikitech [23:30:49] DMaza: how can we use that new feature, via preferences? [23:30:55] Thanks [23:30:58] yes [23:31:12] Does it mean that users you've blacklisted from notifications now also can't email you any more? [23:31:15] tabbycat, peferences, email options section [23:31:18] (03PS1) 10Dzahn: phab/aphlict: ensure aphlict group always exists [puppet] - 10https://gerrit.wikimedia.org/r/379439 (https://phabricator.wikimedia.org/T765) [23:31:22] Or is it a separate blacklist? [23:31:30] it is a separate blacklist [23:31:34] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Add throttle rule for John Michael Kohler Art Center (T176287) (duration: 00m 48s) [23:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:50] T176287: Lift IP cap for account creation for John Michael Kohler Art Center - Thur Sept 21, Sun Sept 24 & Tues Sept 26. - https://phabricator.wikimedia.org/T176287 [23:31:56] DMaza: see it, thanks :) [23:31:57] (03PS3) 10Catrope: Enable jQuery 3 on commons.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378803 (https://phabricator.wikimedia.org/T124742) (owner: 10Krinkle) [23:32:02] (03CR) 10Catrope: [C: 032] Enable jQuery 3 on commons.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378803 (https://phabricator.wikimedia.org/T124742) (owner: 10Krinkle) [23:32:03] It prevents receiving emails via SpecialPage:EmailUser [23:32:16] :) [23:32:22] thank you guys [23:32:22] (03CR) 10Dzahn: [C: 032] phab/aphlict: ensure aphlict group always exists [puppet] - 10https://gerrit.wikimedia.org/r/379439 (https://phabricator.wikimedia.org/T765) (owner: 10Dzahn) [23:33:25] (03CR) 10Dzahn: "follow-up https://gerrit.wikimedia.org/r/#/c/379439/" [puppet] - 10https://gerrit.wikimedia.org/r/379005 (https://phabricator.wikimedia.org/T765) (owner: 1020after4) [23:33:57] DMaza: very useful indeed [23:34:03] thanks to you [23:34:12] and I'm really off to bed righ f. now [23:34:15] :) [23:34:57] (03Merged) 10jenkins-bot: Enable jQuery 3 on commons.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378803 (https://phabricator.wikimedia.org/T124742) (owner: 10Krinkle) [23:35:36] tabbycat, thanks. the credit goes to davidbarratt, he is the one that worked on it [23:35:59] (03CR) 10jenkins-bot: Enable jQuery 3 on commons.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378803 (https://phabricator.wikimedia.org/T124742) (owner: 10Krinkle) [23:36:04] !log phab1001 - re-enable puppet, run after follow-up fix, adding notification server config [23:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:24] Notice: /Stage[main]/Phabricator::Aphlict/Base::Service_unit[aphlict]/Service[aphlict]/ensure: ensure changed 'stopped' to 'running' [23:36:27] Info: /Stage[main]/Phabricator::Aphlict/Base::Service_unit[aphlict]/Service[aphlict]: Unscheduling refresh on Service[aphlict] [23:36:30] twentyafterfour: ^ [23:39:58] Krinkle: ebernhardson: Your changes are live on mwdebug1002, please test [23:40:42] !log phabricator: aphlict (notification service) now running on prod server [23:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:30] RoanKattouw: looking [23:45:14] RoanKattouw: looks grat [23:46:20] Krinkle: What about you? [23:47:42] !log catrope@tin Synchronized php-1.30.0-wmf.19/extensions/WikimediaEvents/: Log JS errors during search satisfaction tests (duration: 00m 49s) [23:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:41] !log catrope@tin Synchronized php-1.30.0-wmf.18/extensions/WikimediaEvents/: Log JS errors during search satisfaction tests (duration: 00m 48s) [23:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:05] Hah I completely forgot my own patch [23:53:47] PROBLEM - Check systemd state on phab1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.