[00:01:06] (03PS11) 10EBernhardson: Switch elasticsearch to use tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/444610 (https://phabricator.wikimedia.org/T198351) [00:01:06] (03PS4) 10EBernhardson: Make elasticsearch http and transport ports explicit [puppet] - 10https://gerrit.wikimedia.org/r/447568 (https://phabricator.wikimedia.org/T198351) [00:01:08] (03PS22) 10EBernhardson: Prep work for multi-instance elasticsearch refactor [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) [00:01:38] (03PS1) 10Dzahn: netbox: have Bacula backups of netbox hosts [puppet] - 10https://gerrit.wikimedia.org/r/447744 (https://phabricator.wikimedia.org/T190184) [00:05:35] (03PS2) 10Dzahn: netbox: have Bacula backups of netbox hosts [puppet] - 10https://gerrit.wikimedia.org/r/447744 (https://phabricator.wikimedia.org/T190184) [00:20:13] (03CR) 10Dzahn: [C: 031] "+1 afaict. as long as we test that mail still arrives after merging" [puppet] - 10https://gerrit.wikimedia.org/r/441131 (https://phabricator.wikimedia.org/T196920) (owner: 10Herron) [00:28:36] (03PS1) 10Dzahn: planet: drop feed templates for planet-venus/stretch [puppet] - 10https://gerrit.wikimedia.org/r/447746 [00:29:10] (03CR) 10jerkins-bot: [V: 04-1] planet: drop feed templates for planet-venus/stretch [puppet] - 10https://gerrit.wikimedia.org/r/447746 (owner: 10Dzahn) [00:29:51] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) [00:34:16] (03PS2) 10Dzahn: planet: drop feed templates for planet-venus/stretch [puppet] - 10https://gerrit.wikimedia.org/r/447746 [00:35:55] (03PS23) 10EBernhardson: Prep work for multi-instance elasticsearch refactor [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) [00:35:57] (03PS23) 10EBernhardson: convert role::logstash::elasticsearch to profiles [puppet] - 10https://gerrit.wikimedia.org/r/441894 (https://phabricator.wikimedia.org/T198351) [00:37:15] (03CR) 10jerkins-bot: [V: 04-1] Prep work for multi-instance elasticsearch refactor [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [00:38:40] (03CR) 10jerkins-bot: [V: 04-1] convert role::logstash::elasticsearch to profiles [puppet] - 10https://gerrit.wikimedia.org/r/441894 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [00:44:27] (03PS1) 10Dzahn: netbox: define a Bacula fileset and apply it [puppet] - 10https://gerrit.wikimedia.org/r/447747 (https://phabricator.wikimedia.org/T190184) [00:55:49] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [00:59:08] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [01:00:10] (03PS24) 10EBernhardson: Prep work for multi-instance elasticsearch refactor [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) [01:00:12] (03PS24) 10EBernhardson: convert role::logstash::elasticsearch to profiles [puppet] - 10https://gerrit.wikimedia.org/r/441894 (https://phabricator.wikimedia.org/T198351) [01:01:03] (03CR) 10Dzahn: [C: 032] netbox: have Bacula backups of netbox hosts [puppet] - 10https://gerrit.wikimedia.org/r/447744 (https://phabricator.wikimedia.org/T190184) (owner: 10Dzahn) [01:04:37] (03CR) 10Dzahn: [C: 032] "the needed profile was already applied via role "netmon" which includes "netbox", but i kept it anyways because it doesn't hurt and means " [puppet] - 10https://gerrit.wikimedia.org/r/447744 (https://phabricator.wikimedia.org/T190184) (owner: 10Dzahn) [01:05:17] (03PS2) 10Dzahn: netbox: define a Bacula fileset and apply it [puppet] - 10https://gerrit.wikimedia.org/r/447747 (https://phabricator.wikimedia.org/T190184) [01:05:48] (03PS27) 10EBernhardson: prometheus/elasticsearch support multiple exporters per host [puppet] - 10https://gerrit.wikimedia.org/r/441321 (https://phabricator.wikimedia.org/T198351) [01:05:50] (03PS30) 10EBernhardson: Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 (https://phabricator.wikimedia.org/T198351) [01:05:52] (03PS58) 10EBernhardson: Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (https://phabricator.wikimedia.org/T198351) [01:05:54] (03PS3) 10EBernhardson: Cleanup ensure => absent after refactoring [puppet] - 10https://gerrit.wikimedia.org/r/444765 (https://phabricator.wikimedia.org/T198351) [01:09:23] (03PS3) 10Dzahn: netbox: define a Bacula fileset and apply it [puppet] - 10https://gerrit.wikimedia.org/r/447747 (https://phabricator.wikimedia.org/T190184) [01:10:48] (03CR) 10jerkins-bot: [V: 04-1] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [01:12:31] (03CR) 10jerkins-bot: [V: 04-1] Cleanup ensure => absent after refactoring [puppet] - 10https://gerrit.wikimedia.org/r/444765 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [01:15:56] (03PS4) 10Dzahn: netbox: define a new Bacula fileset and apply it [puppet] - 10https://gerrit.wikimedia.org/r/447747 (https://phabricator.wikimedia.org/T190184) [01:36:21] (03CR) 10Paladox: [C: 031] planet: tune feed name, description, owneremail, maxarticles [puppet] - 10https://gerrit.wikimedia.org/r/447743 (owner: 10Dzahn) [02:10:37] 10Operations, 10Domains, 10Traffic, 10WikimediaUI Style Guide: Redirect design.wikimedia.org/style-guide/wiki/* to design.wikimedia.org/style-guide/ - https://phabricator.wikimedia.org/T200304 (10Prtksxna) p:05Triage>03Normal [02:11:30] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282 (10Prtksxna) >>! In T185282#4448556, @Dzahn wrote: > @Prtksxna Yes, i think so. Please feel free to create that subtask and assign... [02:32:46] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.13) (duration: 13m 28s) [02:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:43:05] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Wed Jul 25 02:43:05 UTC 2018 (duration 10m 19s) [02:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:47:55] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/11846/" [puppet] - 10https://gerrit.wikimedia.org/r/447747 (https://phabricator.wikimedia.org/T190184) (owner: 10Dzahn) [04:35:17] !log tstarling@deploy1001 Synchronized php-1.32.0-wmf.13/includes/api/ApiMain.php: record all API requests in statsd (duration: 00m 49s) [04:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:07] (03PS2) 10Jcrespo: mariadb: Promote es1017 as the master of es3-eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447586 (https://phabricator.wikimedia.org/T197073) [05:04:42] (03CR) 10Jcrespo: "What is strange?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447586 (https://phabricator.wikimedia.org/T197073) (owner: 10Jcrespo) [05:05:27] (03CR) 10Marostegui: ""es1017 will be the new master of es1017-eqiad"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447586 (https://phabricator.wikimedia.org/T197073) (owner: 10Jcrespo) [05:05:58] (03PS1) 10Marostegui: db-eqiad.php: Depool db1091, db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447751 [05:06:13] (03CR) 10Marostegui: [C: 04-1] "Wait for the failover" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447751 (owner: 10Marostegui) [05:07:09] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1091, db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447751 (owner: 10Marostegui) [05:09:01] (03PS2) 10Marostegui: db-eqiad.php: Depool db1091, db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447751 [05:14:23] (03CR) 10Jcrespo: "Ok, you meant the body, I was only looking at the title." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447586 (https://phabricator.wikimedia.org/T197073) (owner: 10Jcrespo) [05:15:53] (03PS3) 10Jcrespo: mariadb: Promote es1017 as the master of es3-eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447586 (https://phabricator.wikimedia.org/T197073) [05:16:34] (03PS3) 10Jcrespo: mariadb: Promote es1017 as the master of es3-eqiad (instead of es1014) [puppet] - 10https://gerrit.wikimedia.org/r/447584 (https://phabricator.wikimedia.org/T197073) [05:26:52] (03CR) 10Jcrespo: "Please don't assume what it is immediate obvious to you it is to me- I was looking at the title and did not see anything wrong. Please be " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447586 (https://phabricator.wikimedia.org/T197073) (owner: 10Jcrespo) [05:27:23] (03PS2) 10Jcrespo: Setup es1017 as the backend for the es3-eqiad master [dns] - 10https://gerrit.wikimedia.org/r/447587 (https://phabricator.wikimedia.org/T197073) [05:28:08] (03CR) 10Marostegui: "> Please don't assume what it is immediate obvious to you it is to" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447586 (https://phabricator.wikimedia.org/T197073) (owner: 10Jcrespo) [05:51:57] (03CR) 10Marostegui: [C: 032] mariadb: Promote es1017 as the master of es3-eqiad (instead of es1014) [puppet] - 10https://gerrit.wikimedia.org/r/447584 (https://phabricator.wikimedia.org/T197073) (owner: 10Jcrespo) [05:52:57] We are going to take over deploy1001 for the es failover - if there is anything that needs deployment, coordinate with us [05:55:39] (03CR) 10Marostegui: [C: 032] mariadb: Promote es1017 as the master of es3-eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447586 (https://phabricator.wikimedia.org/T197073) (owner: 10Jcrespo) [05:56:52] (03Merged) 10jenkins-bot: mariadb: Promote es1017 as the master of es3-eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447586 (https://phabricator.wikimedia.org/T197073) (owner: 10Jcrespo) [05:57:45] I will rebase on deploy1001 [05:57:50] I did already [05:57:55] oh, thanks [05:58:03] so it is now all ready [05:58:14] I can see it [05:58:44] so I guess the plan is to run on neodymium as root [05:58:57] ./switchover.py --skip-slave-move es1014 es1017 [05:59:03] sounds good! [05:59:14] and of it is successful, deploy with force [05:59:26] if it is not, I will paste the error on etherpad [05:59:32] cool [05:59:46] if it is, I will paste DONE while I force-merge [06:00:04] jynus and marostegui: #bothumor I � Unicode. All rise for Database Maintenance deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180725T0600). [06:00:11] \o/ [06:00:40] will start as soon as I log [06:00:45] go! [06:01:08] check alerts and monitoring, please, doing [06:01:14] yeah, will do [06:01:20] !log switchover es3 eqiad master from es1014 to es1017 [06:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:47] success [06:02:03] errors still coming [06:02:07] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Switchover es3 master eqiad from es1014 to es1017 (duration: 00m 24s) [06:02:09] I know [06:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:13] what errors? [06:02:21] errors gone [06:02:29] 108 errors only [06:02:33] heartbeat started correctly on es1017 [06:02:35] so this is the mistake [06:02:43] I ran scap --force [06:02:59] it should have been scap sync-file --force [06:03:06] I lost 2-4 seconds there [06:03:31] it has been pretty impressive though :) [06:04:07] running puppet [06:04:17] I've been playing with the API dashboard in grafana for the last hour, after I deployed a metric change [06:04:55] First error at 06:01:27 and last error 06:02:05 [06:04:59] TimStarling: this is the first use in production of a fully automatic failover script [06:05:15] the API module has had a p99 time of about 1 minute since 05:32, a sudden jump [06:05:18] (except mediawiki config change, that still requires code deployment) [06:05:32] TimStarling: impossible to avoid due to es server migration [06:05:52] ok, no problem [06:05:57] but we reduced it to just seconds [06:06:01] instead of 5-10 minutes [06:06:14] AND now it is fully automatic [06:06:19] so no human error [06:06:29] applying SRE metodologies [06:06:43] when etc backend it is in place [06:06:50] it will be just seconds [06:07:14] keep checking for anything wrong [06:08:22] TimStarling: the reason it is inevitable is that there is no way to put es servers in "read only" from mediawiki side [06:08:29] as that would imply read only everywhere [06:08:52] marostegui: I updated tendril [06:09:09] replication topology looking good [06:09:15] Yeah and no errors or anything [06:09:25] deploying dns [06:09:29] (03CR) 10Jcrespo: [C: 032] Setup es1017 as the backend for the es3-eqiad master [dns] - 10https://gerrit.wikimedia.org/r/447587 (https://phabricator.wikimedia.org/T197073) (owner: 10Jcrespo) [06:09:36] well, you can change the active write cluster [06:09:55] $wgDefaultExternalStore [06:09:56] so you mean disabling temporarilly es3? [06:10:04] so all writes go to es2? [06:10:21] yes, bearing in mind that I might not understand the problem and what you're doing exactly [06:10:21] I think is cluster 24 vs cluster 25? [06:10:33] TimStarling: we need to do regular maintenance on all servers [06:10:45] that means shutting them down [06:11:08] so you did a master switch of the es3 cluster [06:11:12] yes [06:11:27] it takes a few seconds (around 1 second) to do that [06:11:39] so we figured it should be ok to do it at this time [06:12:18] it should have way less impact than a normal train deploy and revert [06:12:58] TimStarling: the problem with switching the cluster [06:13:17] is that in practice clients will continue hitting head with the wrong server [06:13:29] because they haven't updated their config [06:13:39] clients from the point of view of the database [06:13:43] mid-session [06:14:03] they should get a read-only error and throw an exception [06:14:31] so mostly the same as they did this way- but with 1 less mediawiki deploy [06:14:39] which is the painfil part for us [06:14:53] switch- 1-2 seconds, deploy= 1 minute [06:15:00] (At the moment) [06:15:11] *painful [06:15:32] and yes, you and giuseppe are/have worked on etc to aliviate that :-) [06:15:34] so thank you [06:16:46] so why are API edits slow [06:16:55] are they now? [06:17:23] I can see the same rate as they were [06:18:14] if you are looking at https://grafana.wikimedia.org/dashboard/db/save-timing?panelId=15&fullscreen&orgId=1 [06:18:29] that spike started well before 6 UTC [06:19:00] And the amount have not decreased: https://grafana.wikimedia.org/dashboard/db/wikidata-edits?refresh=1m&orgId=1&from=now-1h&to=now [06:19:06] and it is the p99, which goes up to that level normally [06:19:14] it started at 05:32, I said so already [06:19:19] And: https://grafana.wikimedia.org/dashboard/db/edit-count?refresh=5m&orgId=1 [06:19:24] I was wondering since there was no SAL entry at that time [06:19:26] then not us [06:19:38] we started doing things at 6 [06:20:04] my guess is an api client starting heavy edits at that time [06:20:35] I can check, but would like to finish the deployment window we reserved first, then I can help with that [06:21:18] did you deploy the dns change? [06:21:22] (I am going thru the checklist) [06:21:24] not yet [06:21:29] oki [06:21:53] !log deploy es3-master dns change [06:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:04] I see it [06:22:29] let's check semisync [06:24:13] I am going to enable semi-sunc on es1017 [06:24:18] (master) [06:24:25] yeah, probably no big deal [06:24:39] TimStarling: I think it is good you brought it out [06:24:59] but will research later :-), a bit busy here now [06:27:51] !log enabling semi-sync master on es1017, disabling it as client [06:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:07] Rpl_semi_sync_master_clients | 2 [06:29:09] PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/10-puppet-agent.conf] [06:29:34] marostegui: you said you checked gtid, you want me to check? [06:29:39] (double check) [06:29:40] please do [06:30:39] if is on 14 [06:30:44] but not on 19 [06:30:56] I need another pair of eyes? [06:30:58] not sure [06:31:00] Oh! [06:31:01] No [06:31:03] You are right :) [06:31:08] I am glad you double checked [06:31:10] it is not on 17, as expected [06:31:12] I will enabled it now [06:31:14] 10Operations, 10ops-codfw, 10Analytics, 10Analytics-Kanban, and 5 others: EventStreams accumulates too much memory on SCB nodes in CODFW - https://phabricator.wikimedia.org/T199813 (10elukey) For posterity, it is easy to spot when Marko deployed and the effects of the new code: {F24087167} {F24087168} [06:31:22] done! [06:31:28] I like to double check, the same as I like to be double checked [06:31:38] yeah, it is always do to do cross checks! [06:31:39] remember this can be automated, too [06:31:52] yeah, but we decided to do it in a second phase :) [06:31:58] I have the set_gtid_mode(mode) prepared [06:33:23] I think we should be ok now [06:33:54] !log finished es1014 -> es1017 switch T197073 [06:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:58] T197073: switchover es1014 to es1017 - https://phabricator.wikimedia.org/T197073 [06:34:04] we can do the rest not in a hurry [06:34:10] Congratulations :) [06:34:11] I want to check the api thing [06:34:23] (03PS3) 10Marostegui: db-eqiad.php: Depool db1091, db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447751 [06:35:29] TimStarling: I can see some MediaStatisticsPage::reallyDoQuery [06:35:38] doing >60 seconds queries [06:36:01] I think it is not abnormal (in the sense that those are always there, not that they should happen) [06:36:15] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1091, db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447751 (owner: 10Marostegui) [06:36:26] but the p99 is biased on this (for europe and americas) [06:36:48] there is also dumps running, but that should not count for api latency [06:37:30] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1091, db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447751 (owner: 10Marostegui) [06:38:23] 10Operations, 10DBA: Investigate slow servermon updating queries on db1016 - https://phabricator.wikimedia.org/T165674 (10jcrespo) I caught one slow query while doing other monitoring, putting it here for extra debugging: ``` db1063 22457446 puppet dbproxy1001 puppet 31s UPDATE hosts SET environment = 'product... [06:38:46] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1097:3314 db1091 (duration: 00m 48s) [06:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:48] !log Stop replication in sync on db1091 and db1097:3314 [06:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:36] I am checking other performance metrics [06:45:33] there is an upwards trend on .12 and .13 since yesterday [06:46:11] then certainly a p99 spike at 5:22 [06:46:46] while p95 and median seem unchanged [06:50:31] also apprently Special:Uploads doesn't work on wikitech [06:53:18] !log resetting postgres data on maps1004 after failing replication - T200228 [06:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:22] T200228: disk space alert on maps1001 - https://phabricator.wikimedia.org/T200228 [06:59:39] RECOVERY - puppet last run on ms-be1030 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:00:26] I am almost sure the performance spike is purely client-related- edit rate also spiked at that time [07:05:55] (03PS1) 10Jcrespo: mariadb: Depool es1014 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447754 [07:12:08] !log Deploy schema change on db1091 T144010 T51190 T199368 [07:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:14] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [07:12:15] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [07:12:15] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [07:31:51] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585 (10jcrespo) [07:32:06] (03PS1) 10Ema: 5.1.3-1wm9: fix POST requests [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/447755 [07:35:12] !log rolling restart of elasticsearch / cirrus / codfw to disable G1 - T156137 [07:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:17] T156137: Reduce impact of GC pauses on elasticsearch response time - https://phabricator.wikimedia.org/T156137 [07:36:55] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es1014 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447754 (owner: 10Jcrespo) [07:38:03] (03Merged) 10jenkins-bot: mariadb: Depool es1014 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447754 (owner: 10Jcrespo) [07:57:47] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1091, db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447759 [07:57:52] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1091, db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447759 [07:58:16] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool es1014 (duration: 00m 48s) [07:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:38] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on einsteinium is CRITICAL: 2.889e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [08:00:02] there seem to be something going on since 7:36 [08:00:08] lots of produced topics [08:00:50] damn, that might be the elasticsearch cluster restart, checking [08:00:54] gehel: could it be related to your deploy? [08:00:58] ah :-) [08:01:30] both incoming and producer seem high, cannot keep up [08:01:55] I can confirm it is elastic cirrussearchwrite [08:02:12] yep, this is me... [08:02:28] !log pausing elasticsearch cluster restart on codfw - T156137 [08:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:32] T156137: Reduce impact of GC pauses on elasticsearch response time - https://phabricator.wikimedia.org/T156137 [08:02:41] it looks like Kafka hates me :( [08:02:43] I guess it shouldn't affect the other topics, so not as a large impact? [08:02:57] checking... [08:02:58] except I guess search indexing getting delayed [08:03:04] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1091, db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447759 (owner: 10Marostegui) [08:03:15] ah it is lag [08:03:38] delaying the search index is the whole point of the operation, I am pausing writes to elasticsearch during cluster restarts [08:03:47] yes, lag only on elastic [08:03:51] the others are mostly 0 [08:03:54] gehel: iirc your script uses mwscript showJobs, last time I checked it did not talk to kafka [08:03:58] ah [08:04:06] then technically no issue? [08:04:10] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1091, db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447759 (owner: 10Marostegui) [08:04:29] dcausse: yep, I removed that part since then, but I probably need to add something smarter to wait for the queue to drain on kafka [08:04:33] I will shut up and let you handle it [08:04:35] sorry [08:04:38] jynus: thanks! [08:05:06] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1097:3314 db1091 (duration: 00m 47s) [08:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:10] I think the sole issue is that we need to pause the restarts based on the lag caused on cirrusElasticaWrite [08:05:25] elukey: is there a good way to programatically check the lag on that job? From a python script? [08:05:35] dcausse: agreed! [08:06:18] gehel: if not internally, you can query prometheus metric [08:06:43] jynus: yep, that seems like a good option [08:06:45] gehel: the metric is on prometheus, so getting it would be great.. there might be the possibility to query burrow directly [08:07:05] but not sure if we have opened the port for the http api [08:07:10] ok, let's get a coffee before digging into this, brb [08:07:24] prometheus should be fine [08:07:46] remember that the only thing that alarmed is mirror maker [08:07:58] (03PS1) 10Marostegui: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447761 [08:08:07] so the consumer group has its prefix [08:08:21] (check the metric that it is alarming to get the full prometheus name) [08:08:39] elukey: what is mirror maker? [08:08:54] dcausse: the synchronization tool between the 2 kafka clusters [08:09:02] ok [08:09:06] dcausse: hi :) It is basically a kafka consumer producer that replicates topics between DCs [08:09:19] thanks and hi :) [08:09:27] in this case, "main-eqiad_to_main-codfw" [08:09:31] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447761 (owner: 10Marostegui) [08:10:13] !log start of ladsgroup@mwmaint1001:~$ foreachwikiindblist s4 populateChangeTagDef.php --sleep 2 (T193873) [08:10:14] (running on kafka2* nodes, consuming from main-eqiad and producing to main-codfw) [08:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:17] T193873: Run maintenance script to populate change_tag_def on WMF production (all wikis) - https://phabricator.wikimedia.org/T193873 [08:10:49] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447761 (owner: 10Marostegui) [08:11:52] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1121 (duration: 00m 47s) [08:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:52] (03PS2) 10DCausse: [cirrus] allow term_freq and remove deprecated settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445399 [08:14:31] !log Deploy schema change on db1121 with replication, this will generate lag on labsdb hosts for s4 T144010 T51190 T199368 [08:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:37] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [08:14:37] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [08:14:37] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [08:20:26] 10Operations, 10DBA: Investigate slow servermon updating queries on db1016 - https://phabricator.wikimedia.org/T165674 (10Volans) FYI servermon will probably be decommissioned soon, see T198939. [08:21:19] (03CR) 10jerkins-bot: [V: 04-1] 5.1.3-1wm9: fix POST requests [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/447755 (owner: 10Ema) [08:22:04] (03CR) 10Ema: "recheck" [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/447755 (owner: 10Ema) [08:25:12] 10Operations, 10netops: OSPF metrics - https://phabricator.wikimedia.org/T200277 (10mark) > The eqdfw-knams needs have a lower metric than the current primary (codfw-eqiad + eqiad-esams) links so traffic from codfw to esams prefer that link. Could you explain that premise? What are we trying to optimize for?... [08:26:03] (03PS1) 10Jcrespo: mariadb: Move es1014 socket to the default postion and disable notif [puppet] - 10https://gerrit.wikimedia.org/r/447762 (https://phabricator.wikimedia.org/T148507) [08:27:00] (03CR) 10jerkins-bot: [V: 04-1] 5.1.3-1wm9: fix POST requests [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/447755 (owner: 10Ema) [08:29:03] (03CR) 10Ema: "recheck" [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/447755 (owner: 10Ema) [08:29:34] (03PS2) 10Jcrespo: mariadb: Prepare es1014 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/447762 (https://phabricator.wikimedia.org/T148507) [08:36:29] 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins: Jenkins build stuck at "Not enough random bytes available" - https://phabricator.wikimedia.org/T200307 (10ema) [08:36:42] 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins: Jenkins builds using autopkgtest stuck at "Not enough random bytes available" - https://phabricator.wikimedia.org/T200307 (10ema) [08:40:17] 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins: Jenkins builds using autopkgtest stuck at "Not enough random bytes available" - https://phabricator.wikimedia.org/T200307 (10ema) p:05Triage>03Normal [08:41:06] !log stop es1014 for reimage [08:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:46] (03CR) 10Jcrespo: [C: 032] mariadb: Prepare es1014 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/447762 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [08:43:06] (03PS1) 10Ema: package_builder: install rng-tools [puppet] - 10https://gerrit.wikimedia.org/r/447763 (https://phabricator.wikimedia.org/T200307) [08:43:37] (03CR) 10Ema: [V: 032 C: 032] 5.1.3-1wm9: fix POST requests [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/447755 (owner: 10Ema) [08:43:58] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on einsteinium is OK: (C)1e+05 gt (W)1e+04 gt 7311 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [08:53:46] PROBLEM - MariaDB Slave Lag: s6 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.15 seconds [09:02:29] https://upload.wikimedia.org/wikipedia/commons/thumb/2/2a/Voando_num_mar_de_areia.jpg/3000px-Voando_num_mar_de_areia.jpg [09:02:46] this gives repeatedly an error [09:04:42] It seems to be failing to render when it is not in cache [09:05:08] other sizes work [09:05:21] precached ones and original works [09:05:55] https://upload.wikimedia.org/wikipedia/commons/thumb/2/2a/Voando_num_mar_de_areia.jpg/2000px-Voando_num_mar_de_areia.jpg OK [09:06:29] (03PS1) 10Marostegui: dbtools: check_tables.sh [software] - 10https://gerrit.wikimedia.org/r/447764 (https://phabricator.wikimedia.org/T104459) [09:06:42] https://upload.wikimedia.org/wikipedia/commons/thumb/2/2a/Voando_num_mar_de_areia.jpg/1000px-Voando_num_mar_de_areia.jpg failed [09:07:37] is this a known issue? [09:07:49] I cannot reproduce on other images- I would recommed you to report it in a ticket [09:07:54] (03PS2) 10Marostegui: dbtools: check_tables.sh [software] - 10https://gerrit.wikimedia.org/r/447764 (https://phabricator.wikimedia.org/T104459) [09:07:58] yannf: interesting, the 429 seems to be generated by varnish [09:08:00] ok [09:08:01] I'm taking a look [09:08:02] with the exat image and url, yannf do you know how? [09:08:06] *exact [09:08:13] yes [09:08:22] thank you that will be very helpful [09:08:31] thank you for the report [09:08:51] (03CR) 10Marostegui: [C: 032] dbtools: check_tables.sh [software] - 10https://gerrit.wikimedia.org/r/447764 (https://phabricator.wikimedia.org/T104459) (owner: 10Marostegui) [09:09:25] (03PS2) 10Mark Bergsma: Remove Server.modified [debs/pybal] - 10https://gerrit.wikimedia.org/r/446614 [09:09:38] (03Merged) 10jenkins-bot: dbtools: check_tables.sh [software] - 10https://gerrit.wikimedia.org/r/447764 (https://phabricator.wikimedia.org/T104459) (owner: 10Marostegui) [09:10:32] so the 429 error is actually coming from swift [09:11:13] it looks like it's coming from varnish because we have VCL code to turn origin server responses with status code 429 and Content-Length 0 into synthetic 429s from varnish [09:11:30] ema: actually swift itself or thumbor? [09:11:54] https://phabricator.wikimedia.org/T200313 [09:12:01] yannf: thank you [09:12:07] jynus: thumbor actually [09:12:11] here's a repro: https://phabricator.wikimedia.org/P7386 [09:12:47] 10Operations, 10Commons, 10Thumbor, 10media-storage: Thumbnails not created: Error: 429, Too Many Requests - https://phabricator.wikimedia.org/T200313 (10jcrespo) [09:12:53] add it to the ticket above [09:12:54] thumbor 429 rate doesn't look weird: https://grafana.wikimedia.org/dashboard/db/thumbor?panelId=7&fullscreen&orgId=1&from=now-24h&to=now [09:13:05] jynus: will do, thanks! [09:13:42] I am a bit distracted with some oigoing opeartions [09:13:52] if someone else can have a look, if not, I will do later [09:16:26] RECOVERY - MariaDB Slave Lag: s6 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 49.95 seconds [09:17:21] (03PS1) 10Mark Bergsma: Don't recalculate server.up in refreshPreexistingServers [debs/pybal] - 10https://gerrit.wikimedia.org/r/447766 [09:18:09] 10Operations, 10Commons, 10Thumbor, 10media-storage: Thumbnails not created: Error: 429, Too Many Requests - https://phabricator.wikimedia.org/T200313 (10ema) Here's the full thumbor response: ``` $ curl -v http://ms-fe.svc.eqiad.wmnet/wikipedia/commons/thumb/2/2a/Voando_num_mar_de_areia.jpg/3000px-Voando... [09:18:35] 10Operations, 10Commons, 10Thumbor, 10media-storage: Thumbnails not created: Error: 429, Too Many Requests - https://phabricator.wikimedia.org/T200313 (10Aklapper) [09:18:43] taking a look too [09:20:25] !log restarting elasticsearch cluster restart on codfw - T156137 [09:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:29] T156137: Reduce impact of GC pauses on elasticsearch response time - https://phabricator.wikimedia.org/T156137 [09:22:52] unrelated but grafana dashboards loaded from disk appear to be gone? i.e. the swift dashboards [09:31:25] !log upload varnish 5.1.3-1wm9 to apt.w.o (fixing POST requests w/ separate VCL) T164609 [09:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:29] T164609: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 [09:39:11] 10Operations, 10Commons, 10Thumbor, 10media-storage: Thumbnails not created: Error: 429, Too Many Requests - https://phabricator.wikimedia.org/T200313 (10fgiunchedi) Looks like thumbor/imagemagick are running into resource exhaustion when trying to scale this image (error below) resulting in 500s. Then poo... [09:40:17] 10Operations, 10Commons, 10Thumbor, 10media-storage: cache resources exhausted while scaling File:Voando_num_mar_de_areia.jpg - https://phabricator.wikimedia.org/T200313 (10fgiunchedi) [09:40:50] yannf: ^ thanks for the report! [09:41:05] ;) [09:44:35] 10Operations, 10Commons, 10Thumbor, 10media-storage: cache resources exhausted while scaling File:Voando_num_mar_de_areia.jpg - https://phabricator.wikimedia.org/T200313 (10ema) >>! In T200313#4449804, @fgiunchedi wrote: > Then poolcounter kicks in for this original due to repeated 500s while scaling and 4... [09:48:28] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447768 [09:49:15] (03PS1) 10Mark Bergsma: Don't depool pooledDownServers in refreshPreexistingServers [debs/pybal] - 10https://gerrit.wikimedia.org/r/447769 (https://phabricator.wikimedia.org/T184715) [09:49:17] (03PS1) 10Mark Bergsma: Extend testConfigServerRemoval test case. [debs/pybal] - 10https://gerrit.wikimedia.org/r/447770 [09:52:35] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447768 (owner: 10Marostegui) [09:53:17] !log bounce grafana on krypton [09:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:00] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447768 (owner: 10Marostegui) [09:55:14] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1121 (duration: 00m 47s) [09:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:38] (03PS1) 10Jcrespo: mariadb: Reenable notifications for es1014 [puppet] - 10https://gerrit.wikimedia.org/r/447771 [09:57:14] (03CR) 10Jcrespo: [C: 032] mariadb: Reenable notifications for es1014 [puppet] - 10https://gerrit.wikimedia.org/r/447771 (owner: 10Jcrespo) [09:58:14] (03PS9) 10Vgutierrez: WIP: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) [09:58:42] (03CR) 10Vgutierrez: "Thanks for the initial review Riccardo, your comments should be addressed on PS9" (034 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez) [09:58:56] (03CR) 10jerkins-bot: [V: 04-1] WIP: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez) [09:59:23] (03PS1) 10Jcrespo: Revert "mariadb: Depool es1014 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447772 [09:59:29] (03PS2) 10Jcrespo: Revert "mariadb: Depool es1014 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447772 [10:00:14] 10Operations, 10monitoring: grafana fails to load dashboards from disk - https://phabricator.wikimedia.org/T200317 (10fgiunchedi) [10:00:32] found out why grafana wasn't loading dashboards from disk ^ [10:00:57] hehe [10:01:06] (03PS1) 10Jcrespo: mariadb: Repool es1014 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447773 [10:02:35] addshore: and it was me that introduced the invalid json! [10:02:41] * godog gets brown paperbag [10:02:49] 10Operations, 10DBA, 10Epic: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) [10:03:53] I just CCed you on a ticket that I created during Wikimania regarding some confusing graphite data ( https://phabricator.wikimedia.org/T199968 ) It could be that everyone that was looking it just doesn't understand what is happening, if you get 5 mins would you be able to have a look? [10:03:55] (03PS1) 10Filippo Giunchedi: grafana: fix server-dashboard json [puppet] - 10https://gerrit.wikimedia.org/r/447774 (https://phabricator.wikimedia.org/T200317) [10:05:08] !log upgrade varnish to 5.1.3-1wm9 on text-eqiad T164609 [10:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:12] T164609: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 [10:05:34] https://phabricator.wikimedia.org/T200121 [10:05:42] what did change? ^ [10:07:08] !log Deploy schema change on db2040 (s7 codfw master) with replication, this will generate lag on s7 codfw T144010 T51190 T199368 [10:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:14] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [10:07:14] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [10:07:14] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [10:07:16] yannf: I think your best chance at getting an answer is the ticket itself- not everybody in the know will be available at this timezone [10:07:27] ok [10:07:34] (03PS1) 10Mark Bergsma: Modernize and cleanup Coordinator [debs/pybal] - 10https://gerrit.wikimedia.org/r/447775 [10:07:35] addshore: for sure, I'll take a look today [10:07:43] godog: thanks :) [10:08:07] (03CR) 10Filippo Giunchedi: [C: 032] grafana: fix server-dashboard json [puppet] - 10https://gerrit.wikimedia.org/r/447774 (https://phabricator.wikimedia.org/T200317) (owner: 10Filippo Giunchedi) [10:10:53] (03PS10) 10Vgutierrez: WIP: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) [10:11:06] (03CR) 10Jcrespo: [C: 032] mariadb: Repool es1014 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447773 (owner: 10Jcrespo) [10:11:32] (03CR) 10jerkins-bot: [V: 04-1] WIP: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez) [10:12:23] (03Merged) 10jenkins-bot: mariadb: Repool es1014 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447773 (owner: 10Jcrespo) [10:17:48] (03PS1) 10Ema: cache_text: re-enable alternate domains [puppet] - 10https://gerrit.wikimedia.org/r/447776 (https://phabricator.wikimedia.org/T164609) [10:21:35] (03CR) 10Alexandros Kosiaris: "Don't rng-tools require a hardware TRNG (supported by a hw_random linux kernel module like amd-rng.ko/intel-rng.ko) ? Cause boron is a VM," [puppet] - 10https://gerrit.wikimedia.org/r/447763 (https://phabricator.wikimedia.org/T200307) (owner: 10Ema) [10:21:49] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool es1014 with low load after maintenance (duration: 00m 47s) [10:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:56] (03PS3) 10Jcrespo: Revert "mariadb: Depool es1014 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447772 [10:28:11] (03PS1) 10Jcrespo: mariadb: Stop being able to reimage es1014, add es1015 [puppet] - 10https://gerrit.wikimedia.org/r/447777 [10:29:12] (03CR) 10Jcrespo: [C: 032] mariadb: Stop being able to reimage es1014, add es1015 [puppet] - 10https://gerrit.wikimedia.org/r/447777 (owner: 10Jcrespo) [10:30:34] (03CR) 10Alexandros Kosiaris: "I had a closer look to the ticket and realized we are not talking about boron alone but also the jenkins slaves. Which from what I know ar" [puppet] - 10https://gerrit.wikimedia.org/r/447763 (https://phabricator.wikimedia.org/T200307) (owner: 10Ema) [10:33:18] (03CR) 10Alexandros Kosiaris: [C: 04-1] phabricator: Use the mysql native driver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/443045 (owner: 10Alexandros Kosiaris) [10:35:01] anybody knows what's happening with 5xx errors? https://grafana.wikimedia.org/dashboard/db/varnish-http-errors?refresh=1m&orgId=1&from=now-12h&to=now cc robh [10:35:45] 503 actually [10:36:06] yes, ores [10:43:11] (03PS2) 10Ema: package_builder: install haveged [puppet] - 10https://gerrit.wikimedia.org/r/447763 (https://phabricator.wikimedia.org/T200307) [10:44:14] (03PS1) 10Filippo Giunchedi: grafana: introduce dashboard json validator [puppet] - 10https://gerrit.wikimedia.org/r/447778 (https://phabricator.wikimedia.org/T200317) [10:44:16] (03PS1) 10Filippo Giunchedi: tox: add nosetests for grafana [puppet] - 10https://gerrit.wikimedia.org/r/447779 (https://phabricator.wikimedia.org/T200317) [10:45:40] (03CR) 10jerkins-bot: [V: 04-1] tox: add nosetests for grafana [puppet] - 10https://gerrit.wikimedia.org/r/447779 (https://phabricator.wikimedia.org/T200317) (owner: 10Filippo Giunchedi) [10:45:50] (03CR) 10Ema: "> Don't rng-tools require a hardware TRNG (supported by a hw_random" [puppet] - 10https://gerrit.wikimedia.org/r/447763 (https://phabricator.wikimedia.org/T200307) (owner: 10Ema) [10:50:53] (03PS1) 10Zfilipin: Group0 to 1.32.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447780 [10:56:59] (03PS11) 10Vgutierrez: WIP: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) [10:57:41] (03CR) 10jerkins-bot: [V: 04-1] WIP: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez) [10:59:43] (03PS12) 10Vgutierrez: WIP: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) [11:00:03] (03CR) 10Filippo Giunchedi: [C: 032] grafana: introduce dashboard json validator [puppet] - 10https://gerrit.wikimedia.org/r/447778 (https://phabricator.wikimedia.org/T200317) (owner: 10Filippo Giunchedi) [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180725T1100). [11:00:04] dcausse: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] (03PS2) 10Filippo Giunchedi: grafana: introduce dashboard json validator [puppet] - 10https://gerrit.wikimedia.org/r/447778 (https://phabricator.wikimedia.org/T200317) [11:00:14] o/ [11:00:25] (03CR) 10jerkins-bot: [V: 04-1] WIP: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez) [11:01:02] (03CR) 10Filippo Giunchedi: [C: 032] "> Patch Set 1: Verified-1" [puppet] - 10https://gerrit.wikimedia.org/r/447779 (https://phabricator.wikimedia.org/T200317) (owner: 10Filippo Giunchedi) [11:01:18] !log zfilipin@deploy1001 Pruned MediaWiki: 1.32.0-wmf.10 [keeping static files] (duration: 05m 13s) [11:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:30] zeljkof: I'm going to swat my config patch, any objections? [11:02:34] (03PS2) 10Filippo Giunchedi: tox: add nosetests for grafana [puppet] - 10https://gerrit.wikimedia.org/r/447779 (https://phabricator.wikimedia.org/T200317) [11:03:15] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] tox: add nosetests for grafana [puppet] - 10https://gerrit.wikimedia.org/r/447779 (https://phabricator.wikimedia.org/T200317) (owner: 10Filippo Giunchedi) [11:05:40] dcausse: go ahead :) [11:05:53] (sorry, forgot about swat) [11:06:07] np :) [11:06:11] swating [11:06:16] there was a network spike on mw2188 [11:06:28] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445399 (owner: 10DCausse) [11:06:34] https://grafana.wikimedia.org/dashboard/db/host-overview?refresh=300s&orgId=1&var-server=mw2188&var-datasource=codfw%20prometheus%2Fops&var-cluster=appserver [11:06:54] dcausse: please let me know when you are done, so I can continue with train [11:06:58] sure [11:07:39] 10Operations, 10Pybal, 10Traffic: Unhandled pybal error: OpenSSL.SSL.Error - ssl handshake failure - https://phabricator.wikimedia.org/T168539 (10mark) @ema: Has this been seen again? Does this need any work in Pybal? [11:07:44] (03Merged) 10jenkins-bot: [cirrus] allow term_freq and remove deprecated settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445399 (owner: 10DCausse) [11:08:42] zeljkof: wikiversion.json is modified on deploy1001, I can't rebase [11:08:55] dcausse: ah, sorry, will revert [11:09:24] zeljkof: I can postpone my swat, it's not urgent at all [11:09:30] I can send it tomorrow [11:09:34] dcausse: no, go ahead, it's all fixed now [11:09:36] ok [11:09:43] finish the swat, I'll continue with train later [11:09:50] forgot to clean up [11:14:09] !log dcausse@deploy1001 Synchronized ./wmf-config/CirrusSearch-common.php: [cirrus] allow term_freq and remove deprecated settings (duration: 00m 48s) [11:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:21] 10Operations, 10ops-codfw, 10Analytics, 10Analytics-Kanban, and 5 others: EventStreams accumulates too much memory on SCB nodes in CODFW - https://phabricator.wikimedia.org/T199813 (10akosiaris) Wow, nice work! [11:15:59] zeljkof: I'm done [11:16:09] !log EU SWAT done [11:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:37] 10Operations, 10SCB, 10Services (watching): Page allocation stalls on scb1001, scb1002 - https://phabricator.wikimedia.org/T191199 (10akosiaris) 05Open>03stalled T199813 was closed today (nice work on it). I am thking (and hoping) it was the root cause. I 'll stall this task for a week just for monitorin... [11:16:58] dcausse: thanks! [11:17:02] 10Operations, 10SCB, 10Services (watching): Page allocation stalls on scb1001, scb1002 - https://phabricator.wikimedia.org/T191199 (10akosiaris) p:05High>03Normal [11:17:08] continuing with train then [11:18:24] (03CR) 10Alexandros Kosiaris: [C: 032] RESTBase: Add Proton's URI [puppet] - 10https://gerrit.wikimedia.org/r/444835 (https://phabricator.wikimedia.org/T186748) (owner: 10Mobrovac) [11:18:34] (03PS2) 10Alexandros Kosiaris: RESTBase: Add Proton's URI [puppet] - 10https://gerrit.wikimedia.org/r/444835 (https://phabricator.wikimedia.org/T186748) (owner: 10Mobrovac) [11:18:40] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] RESTBase: Add Proton's URI [puppet] - 10https://gerrit.wikimedia.org/r/444835 (https://phabricator.wikimedia.org/T186748) (owner: 10Mobrovac) [11:20:19] !log zfilipin@deploy1001 Started scap: testwiki to php-1.32.0-wmf.14 and rebuild l10n cache [11:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:35] zeljkof: can i get a mw-config change in before? [11:20:39] ah just seen the !log [11:20:41] ok, nevermind then [11:21:40] mobrovac: ah, a second too late :) will be done in an hour or so [11:21:48] no worries :) [11:22:36] zeljkof: just to get it right, you are now doing what was supposed to happen 2h from now or will that window still be used? [11:23:16] mobrovac: I'm doing what should happen yesterday morning, but .14 was blocked on .13 :/ [11:23:31] now that .13 is everywhere, I'm catching up on .14 [11:23:36] ah i see [11:23:38] kk [11:24:10] this is just getting ready for the deploy window, it will still be used, doing this currently https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Sync_to_cluster_and_verify_on_testwiki [11:37:26] 10Operations, 10ops-codfw, 10Analytics, 10Analytics-Kanban, and 5 others: EventStreams accumulates too much memory on SCB nodes in CODFW - https://phabricator.wikimedia.org/T199813 (10elukey) 05Resolved>03Open There seems to be only one weirdness remaining, namely: {F24089464} For some reason, right... [11:52:29] 10Operations, 10Pybal, 10Traffic: Unhandled pybal error: OpenSSL.SSL.Error - ssl handshake failure - https://phabricator.wikimedia.org/T168539 (10ema) 05Open>03Resolved a:03ema Nope, I haven't seen this since. Closing. [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180725T1200) [12:19:47] !log zfilipin@deploy1001 Finished scap: testwiki to php-1.32.0-wmf.14 and rebuild l10n cache (duration: 59m 27s) [12:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:46] !log start of ladsgroup@mwmaint1001:~$ foreachwikiindblist s5 populateChangeTagDef.php --sleep 2 (T193873) [12:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:50] T193873: Run maintenance script to populate change_tag_def on WMF production (all wikis) - https://phabricator.wikimedia.org/T193873 [12:27:29] (03PS1) 10Filippo Giunchedi: tox: remove webperf nosetests [puppet] - 10https://gerrit.wikimedia.org/r/447789 [12:29:14] (03CR) 10jerkins-bot: [V: 04-1] tox: remove webperf nosetests [puppet] - 10https://gerrit.wikimedia.org/r/447789 (owner: 10Filippo Giunchedi) [12:30:45] (03CR) 10Filippo Giunchedi: "> Patch Set 1: Verified-1" [puppet] - 10https://gerrit.wikimedia.org/r/447789 (owner: 10Filippo Giunchedi) [12:39:31] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10Cmjohnson) @ayounsi I was not able to add the ports in row A to the public vlan. Can you check the following and add to public vlan. Also, adding the servers to the vla... [12:42:35] (03PS1) 10Filippo Giunchedi: cumin: ignore flake8 syntax until pep8 uses python 3 [puppet] - 10https://gerrit.wikimedia.org/r/447793 (https://phabricator.wikimedia.org/T184435) [12:43:29] (03CR) 10jerkins-bot: [V: 04-1] cumin: ignore flake8 syntax until pep8 uses python 3 [puppet] - 10https://gerrit.wikimedia.org/r/447793 (https://phabricator.wikimedia.org/T184435) (owner: 10Filippo Giunchedi) [12:43:59] (03CR) 10Volans: [C: 031] "ACK, that was my fault when I removed the import from future once migrated to py3." [puppet] - 10https://gerrit.wikimedia.org/r/447793 (https://phabricator.wikimedia.org/T184435) (owner: 10Filippo Giunchedi) [12:44:09] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10mark) >>! In T195923#4450204, @Cmjohnson wrote: > @ayounsi I was not able to add the ports in row A to the public vlan. Can you check the following and add to public v... [12:44:38] (03PS2) 10Filippo Giunchedi: cumin: ignore flake8 syntax until pep8 uses python 3 [puppet] - 10https://gerrit.wikimedia.org/r/447793 (https://phabricator.wikimedia.org/T184435) [12:45:23] 10Operations, 10ops-eqiad, 10DC-Ops: Replace wtp1043's sda - https://phabricator.wikimedia.org/T196886 (10Cmjohnson) a:03RobH @RobH Can you help with this by reaching out to our Dell Rep. the repair request was denied because the service tag shows as not belonging to our organization. (ST is 5MCLDH2). If... [12:46:37] volans: thanks for the quick review! when you removed the future import jenkins didn't complain? I'm asking because another unrelated failure for webperf I noticed only because I touched tox.ini and thus tests ran [12:46:49] (03CR) 10Krinkle: [C: 031] tox: remove webperf nosetests [puppet] - 10https://gerrit.wikimedia.org/r/447789 (owner: 10Filippo Giunchedi) [12:47:23] (03CR) 10Filippo Giunchedi: [C: 032] tox: remove webperf nosetests [puppet] - 10https://gerrit.wikimedia.org/r/447789 (owner: 10Filippo Giunchedi) [12:47:31] (03PS2) 10Filippo Giunchedi: tox: remove webperf nosetests [puppet] - 10https://gerrit.wikimedia.org/r/447789 [12:47:38] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10Cmjohnson) Thanks @mark fixing now. I looked up one other and it must've been for something else. I believe it was cp1008 [12:47:40] godog: I think it complained a I might have overwrite it, it was in the middle of the whole py3 migration of all the stuff and didn't had the time to debug further at that time [12:47:56] basically is the parameter for the print function that is not present in py2 [12:48:02] unless imported from future [12:48:09] we could re-add the import, doesn't hurt in py3 [12:48:14] it's just redundant [12:48:40] (03CR) 10jerkins-bot: [V: 04-1] tox: remove webperf nosetests [puppet] - 10https://gerrit.wikimedia.org/r/447789 (owner: 10Filippo Giunchedi) [12:48:43] ah, yeah I don't feel strongly one way or another [12:49:06] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] tox: remove webperf nosetests [puppet] - 10https://gerrit.wikimedia.org/r/447789 (owner: 10Filippo Giunchedi) [12:50:13] if this works I'm ok with it [12:50:28] yeah it does, flake8 is happy with it [12:50:36] ok I'll merge the change [12:50:45] ack [12:50:46] (03CR) 10Filippo Giunchedi: [C: 032] cumin: ignore flake8 syntax until pep8 uses python 3 [puppet] - 10https://gerrit.wikimedia.org/r/447793 (https://phabricator.wikimedia.org/T184435) (owner: 10Filippo Giunchedi) [12:50:53] (03PS3) 10Filippo Giunchedi: cumin: ignore flake8 syntax until pep8 uses python 3 [puppet] - 10https://gerrit.wikimedia.org/r/447793 (https://phabricator.wikimedia.org/T184435) [12:51:25] for added fun we should randomly run "full tox" to catch errors [12:55:37] 10Operations, 10vm-requests, 10Performance-Team (Radar): Increase webperf1002/webperf2002 space from 50GB to 150GB (Ganeti) - https://phabricator.wikimedia.org/T199853 (10akosiaris) >>! In T199853#4442280, @Krinkle wrote: > @herron That would reduce this request to needing ~150 GB (for XHGui's Mongo). Is tha... [12:58:39] 10Operations, 10ops-eqsin, 10Traffic: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 (10BBlack) I think they did something, as the password for mgmt ssh appears to be reset (can't get in anymore) [13:00:04] hashar: Your horoscope predicts another unfortunate MediaWiki train - European version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180725T1300). [13:00:53] !log resetting postgres data on maps1002 after failing replication - T200228 [13:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:58] T200228: disk space alert on maps1001 - https://phabricator.wikimedia.org/T200228 [13:14:51] godog: eheheh [13:20:51] (03PS1) 10Marostegui: db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447797 [13:21:55] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447797 (owner: 10Marostegui) [13:22:21] (03CR) 10Zfilipin: [C: 032] Group0 to 1.32.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447780 (owner: 10Zfilipin) [13:23:17] (03PS2) 10Marostegui: db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447797 [13:23:19] (03CR) 10Ema: [C: 032] cache_text: re-enable alternate domains [puppet] - 10https://gerrit.wikimedia.org/r/447776 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [13:23:26] (03PS2) 10Ema: cache_text: re-enable alternate domains [puppet] - 10https://gerrit.wikimedia.org/r/447776 (https://phabricator.wikimedia.org/T164609) [13:23:29] (03Merged) 10jenkins-bot: Group0 to 1.32.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447780 (owner: 10Zfilipin) [13:24:21] 10Operations, 10Analytics, 10WMDE-Analytics-Engineering: Cannot SSH to stat1004 - https://phabricator.wikimedia.org/T200330 (10GoranSMilovanovic) [13:24:36] 10Operations, 10Analytics, 10WMDE-Analytics-Engineering: Cannot SSH to stat1004 - https://phabricator.wikimedia.org/T200330 (10GoranSMilovanovic) p:05Triage>03High [13:26:03] !log depool cp1067 and test alternate domains patch with varnish 5.1.3-1wm9 T164609 [13:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:07] T164609: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 [13:26:45] 10Operations, 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Cannot SSH to stat1004 - https://phabricator.wikimedia.org/T200330 (10GoranSMilovanovic) [13:27:03] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.32.0-wmf.14 [13:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:46] Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [13:27:49] or maybe this blog by freenode staff member Matthew 'mst' Trout https://MattSTrout.com/ [13:27:52] Read what IRC investigative journalists have uncovered on the freenode pedophilia scandal https://encyclopediadramatica.rs/Freenodegate [13:27:55] Voice your opinions at https://webchat.freenode.net/?channels=#freenode [13:34:06] !log repool cp1067 w/ alternate domains patch and varnish 5.1.3-1wm9 T164609 [13:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:11] T164609: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 [13:36:49] Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [13:36:52] or maybe this blog by freenode staff member Matthew 'mst' Trout https://MattSTrout.com/ [13:36:55] Read what IRC investigative journalists have uncovered on the freenode pedophilia scandal https://encyclopediadramatica.rs/Freenodegate [13:36:58] Voice your opinions at https://webchat.freenode.net/?channels=#freenode [13:44:41] !log restarting ores celery workers on codfw [13:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:50] Read what IRC investigative journalists have uncovered on the freenode pedophilia scandal https://encyclopediadramatica.rs/Freenodegate [13:44:53] Voice your opinions at https://webchat.freenode.net/?channels=#freenode [13:47:17] (03PS1) 10Elukey: profile::hadoop::common: add ':' to hadoop trash parameters [puppet] - 10https://gerrit.wikimedia.org/r/447800 [13:48:49] !log text-eqiad: test alternate domains patch with varnish 5.1.3-1wm9 T164609 [13:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:53] T164609: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 [13:49:23] For anyone having clicked any of the previous links – This appears to be a fakenews scandal intended to harm "kloeri", "Exherbo" and/or Freenode. Several sources state that these are not the people's actual blogs, but sites created by others in their name with fake content. [13:49:43] Those are fake pages by someone who has been blocked on Freenode and taking revenge. [13:49:52] They also heavily spammed other IRC networks yesterday. [13:49:56] Yeah, been going on for a while. [13:51:28] (03PS1) 10Zfilipin: group1 wikis to 1.32.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447802 [13:51:30] (03CR) 10Zfilipin: [C: 032] group1 wikis to 1.32.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447802 (owner: 10Zfilipin) [13:52:46] (03CR) 10jerkins-bot: [V: 04-1] group1 wikis to 1.32.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447802 (owner: 10Zfilipin) [13:53:11] (03Merged) 10jenkins-bot: group1 wikis to 1.32.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447802 (owner: 10Zfilipin) [13:54:52] !log running compare.py on all es3 databases [13:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:51] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11849/ - no op as expected." [puppet] - 10https://gerrit.wikimedia.org/r/447800 (owner: 10Elukey) [13:56:58] (03PS1) 10BBlack: tmpfs privkeys [1/3]: sslcert tmpfs bits [puppet] - 10https://gerrit.wikimedia.org/r/447803 [13:57:00] (03PS1) 10BBlack: tmpfs privkeys [2/3]: agent first run via systemd [puppet] - 10https://gerrit.wikimedia.org/r/447804 [13:57:02] (03PS1) 10BBlack: tmpfs privkeys [3/3]: use for tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/447805 [13:59:12] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.32.0-wmf.14 [13:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:08] !log zfilipin@deploy1001 Synchronized php: group1 wikis to 1.32.0-wmf.14 (duration: 00m 55s) [14:00:17] elukey, just had a look at the load on ORES. It's not me. I've pinged my UW collaborators. Can we block this IP from ORES in the meantime? [14:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:26] akosiaris, ^ [14:00:52] RECOVERY - Check systemd state on ms-be2036 is OK: OK - running: The system is fully operational [14:01:10] Not sure if this is advisable, but I'd like the researchers to stop ASAP. All considered, ORES is handling this OK it seems. [14:01:43] yeah we can block it [14:01:51] It's hard to tell what proportion of 503's are going to the UW bomb and which are going to legitimate users. [14:02:12] (03CR) 10Faidon Liambotis: "Placing a tmpfs under /etc is a bit counter-intuitive -- have you considered placing this under /run or /var?" [puppet] - 10https://gerrit.wikimedia.org/r/447803 (owner: 10BBlack) [14:02:14] Cool. I think we should block it until we can figure out what the cause is. [14:02:20] (03PS2) 10BBlack: tmpfs privkeys [2/3]: agent first run via systemd [puppet] - 10https://gerrit.wikimedia.org/r/447804 [14:02:22] (03PS2) 10BBlack: tmpfs privkeys [3/3]: use for tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/447805 [14:02:38] akosiaris, also, if you could send me the IP address, I might be able to have someone on the UW network do some detective work for us. [14:04:39] halfak: yeah lemme have a look at blocking it [14:09:26] (03CR) 10BBlack: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/447803 (owner: 10BBlack) [14:11:34] Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [14:11:34] or maybe this blog by freenode staff member Matthew 'mst' Trout https://MattSTrout.com/ [14:11:38] Read what IRC investigative journalists have uncovered on the freenode pedophilia scandal https://encyclopediadramatica.rs/Freenodegate [14:11:38] Voice your opinions at https://webchat.freenode.net/?channels=#freenode [14:12:15] heh, I know Matt Trout personally, and that's definitely not him [14:13:07] 10Operations, 10Wikimedia-Logstash, 10Goal: Logstash/Kibana architecture review - https://phabricator.wikimedia.org/T198754 (10fgiunchedi) >>! In T198754#4422095, @fgiunchedi wrote: > Non exhaustive list of things that we'll need to address: > * More insight into logstash/kibana activity via prometheus metri... [14:14:07] bblack, they are even showing up on a few SlashNET channels I am in. This smear campaign is wide-reaching. [14:15:06] halfak: IP blocked. I 'll unblock it in a few hours however [14:15:33] unless we hear back from them and they turn out to be some huge PITA [14:18:12] Perfect! Thank you akosiaris [14:18:15] (03PS1) 10Mark Bergsma: Move Attribute constants from attributes to constants [debs/pybal] - 10https://gerrit.wikimedia.org/r/447808 [14:18:17] (03PS1) 10Mark Bergsma: Use absolute imports for all BGP modules [debs/pybal] - 10https://gerrit.wikimedia.org/r/447809 [14:18:18] * halfak gets to documenting all of this. [14:18:19] (03CR) 10jenkins-bot: all wikis to 1.32.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447617 (owner: 10Zfilipin) [14:18:21] (03CR) 10jenkins-bot: Group 2 back to php-1.32.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447623 (https://phabricator.wikimedia.org/T191059) (owner: 10Zfilipin) [14:18:23] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447626 (owner: 10Marostegui) [14:18:25] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447635 (owner: 10Marostegui) [14:18:27] (03CR) 10jenkins-bot: all wikis to 1.32.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447650 (owner: 10Zfilipin) [14:18:29] (03CR) 10jenkins-bot: mariadb: Promote es1017 as the master of es3-eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447586 (https://phabricator.wikimedia.org/T197073) (owner: 10Jcrespo) [14:18:31] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1091, db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447751 (owner: 10Marostegui) [14:18:33] (03CR) 10jenkins-bot: mariadb: Depool es1014 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447754 (owner: 10Jcrespo) [14:18:35] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1091, db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447759 (owner: 10Marostegui) [14:18:37] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447761 (owner: 10Marostegui) [14:18:39] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447768 (owner: 10Marostegui) [14:18:41] (03CR) 10jenkins-bot: mariadb: Repool es1014 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447773 (owner: 10Jcrespo) [14:18:43] (03CR) 10jenkins-bot: [cirrus] allow term_freq and remove deprecated settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445399 (owner: 10DCausse) [14:18:45] (03CR) 10jenkins-bot: Group0 to 1.32.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447780 (owner: 10Zfilipin) [14:18:47] (03CR) 10jenkins-bot: group1 wikis to 1.32.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447802 (owner: 10Zfilipin) [14:21:04] (03PS4) 10Jcrespo: Revert "mariadb: Depool es1014 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447772 [14:23:00] akosiaris & elukey: https://phabricator.wikimedia.org/T200338 [14:23:18] Amir1, addshore: does this sound familiar? `Wikibase\DataModel\Entity\EntityIdParsingException from line 49 of /srv/mediawiki/php-1.32.0-wmf.14/vendor/wikibase/data-model/src/Entity/DispatchingEntityIdParser.php: $serialization must not be an empty string` [14:24:05] a lot of those at https://logstash.wikimedia.org/app/kibana#/dashboard/Fatal-Monitor [14:25:49] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool es1014 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447772 (owner: 10Jcrespo) [14:26:21] zeljkof: Thanks, was just gonna file a report for that one. Seems new in wmf.14 indeed. [14:26:48] Krinkle: did not find it in phab, creating ticket right now [14:26:59] (03Merged) 10jenkins-bot: Revert "mariadb: Depool es1014 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447772 (owner: 10Jcrespo) [14:27:03] I was searching phab for it, so I would not create a duplicate [14:27:29] Seems to be causing certain api.php user requests to fail [14:28:04] Might be safest to revert until we know more. [14:28:40] Krinkle: this is the task, will add more data, blocking the train T200340 [14:28:41] T200340: [{exception_id}] {exception_url} Wikibase\DataModel\Entity\EntityIdParsingException from line 49 of /srv/mediawiki/php-1.32.0-wmf.14/vendor/wikibase/data-model/src/Entity/DispatchingEntityIdParser.php: $serialization must not be an empty string - https://phabricator.wikimedia.org/T200340 [14:29:17] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool es1014 fully (duration: 00m 55s) [14:29:20] (03PS1) 10Marostegui: mariadb: Add db1120 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/447811 (https://phabricator.wikimedia.org/T196376) [14:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:35] Krinkle: does not seem to produce a lot of errors, my plan was to leave it blocking the train going further, do you think I should revert? we are at group 1 now, should I revert to group 0, or all the way to .13? [14:33:32] (03CR) 10Marostegui: [C: 032] mariadb: Add db1120 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/447811 (https://phabricator.wikimedia.org/T196376) (owner: 10Marostegui) [14:34:24] ah, the instructions say to roll back if there is a new error, rolling back then [14:34:36] (03PS1) 10Elukey: [WIP] Refactor Hadoop code to allow more than one cluster in Prod [puppet] - 10https://gerrit.wikimedia.org/r/447813 [14:35:17] (03CR) 10Jcrespo: "Remember there is now a transfer.py and a recover_section.py :-)" [puppet] - 10https://gerrit.wikimedia.org/r/447811 (https://phabricator.wikimedia.org/T196376) (owner: 10Marostegui) [14:35:35] halfak: thanks a lot! My bad, this morning I saw where the IP was coming from and your UA (with the email) and I thought to not block it but just to send the email. Next time I'll be more proactive and will stop it straight away [14:35:48] zeljkof: For reference, if you get wikidata issues like that, Tag them Wikidata-Campsite [14:35:50] (03CR) 10Marostegui: [C: 032] "Yeah! I was going to use recover_section :)" [puppet] - 10https://gerrit.wikimedia.org/r/447811 (https://phabricator.wikimedia.org/T196376) (owner: 10Marostegui) [14:35:53] (03CR) 10Herron: "> +1 afaict. as long as we test that mail still arrives after merging" [puppet] - 10https://gerrit.wikimedia.org/r/441131 (https://phabricator.wikimedia.org/T196920) (owner: 10Herron) [14:36:28] Reedy: thanks, will do! [14:36:31] (03CR) 10Gergő Tisza: [C: 031] Set MCR write-both-read-old on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447638 (https://phabricator.wikimedia.org/T197817) (owner: 10Anomie) [14:36:35] It's their "on call" group [14:36:45] (03CR) 10Gergő Tisza: [C: 031] Set MCR read-old-write-both on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447639 (https://phabricator.wikimedia.org/T198311) (owner: 10Anomie) [14:36:54] (03CR) 10Gergő Tisza: [C: 031] Set MCR read-new-write-both on Beta Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447640 (https://phabricator.wikimedia.org/T198311) (owner: 10Anomie) [14:37:06] 10Operations, 10vm-requests, 10Performance-Team (Radar), 10User-herron: Increase webperf1002/webperf2002 space from 50GB to 150GB (Ganeti) - https://phabricator.wikimedia.org/T199853 (10herron) a:03herron [14:37:39] (03PS3) 10Marostegui: db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447797 [14:39:21] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: (no justification provided) [14:39:22] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447797 (owner: 10Marostegui) [14:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:49] 10Operations, 10vm-requests, 10Performance-Team (Radar), 10User-herron: Increase webperf1002/webperf2002 space from 50GB to 150GB (Ganeti) - https://phabricator.wikimedia.org/T199853 (10Imarlier) @herron yep, that would work fine. [14:40:41] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447797 (owner: 10Marostegui) [14:42:51] heh. Totally reasonable elukey. I need to stop putting my own email address in example code! [14:43:01] (03PS2) 10Elukey: [WIP] Refactor Hadoop code to allow more than one cluster in Prod [puppet] - 10https://gerrit.wikimedia.org/r/447813 [14:43:26] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1098:3317 (duration: 00m 54s) [14:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:34] 10Operations, 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Cannot SSH to stat1004 - https://phabricator.wikimedia.org/T200330 (10Reedy) When did you last use it? >Debian GNU/Linux 9 auto-installed on Tue May 22 15:58:49 UTC 2018. Just remove the old key with the command it told yo... [14:43:40] (03CR) 10Vgutierrez: [C: 031] Extend NaiveBGPPeering unit testing [debs/pybal] - 10https://gerrit.wikimedia.org/r/436807 (owner: 10Mark Bergsma) [14:43:41] > !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: (no justification provided) [14:44:05] that is me reverting group 1 to from .14 to .13 [14:44:44] (03CR) 10Alexandros Kosiaris: [C: 031] package_builder: install haveged [puppet] - 10https://gerrit.wikimedia.org/r/447763 (https://phabricator.wikimedia.org/T200307) (owner: 10Ema) [14:45:09] !log Deploy schema change on db1098:3317 T144010 T51190 T199368 [14:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:15] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [14:45:15] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [14:45:16] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [14:46:31] (03PS1) 10Zfilipin: Revert "group1 wikis to 1.32.0-wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447815 (https://phabricator.wikimedia.org/T200340) [14:47:13] =o [14:48:23] zeljkof: given that it's only happen in wikidata I think rolling back makes sense as wikidata is in group1. [14:48:39] Amir1: done [14:49:03] addshore, Amir1 please comment on the task, if you have any insight, it's blocking the train :( [14:49:20] (03PS1) 10Cmjohnson: Add ipv4/ipv6 dns cp1075-1090 [dns] - 10https://gerrit.wikimedia.org/r/447816 (https://phabricator.wikimedia.org/T195923) [14:49:51] (03CR) 10Alexandros Kosiaris: "is this still required ? The issue described in the task also presented itself in production and was kind of resolved with Ia9a9efb0031dfe" [puppet] - 10https://gerrit.wikimedia.org/r/447487 (https://phabricator.wikimedia.org/T171173) (owner: 10Thcipriani) [14:49:53] (03CR) 10Zfilipin: [C: 032] Revert "group1 wikis to 1.32.0-wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447815 (https://phabricator.wikimedia.org/T200340) (owner: 10Zfilipin) [14:51:08] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.32.0-wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447815 (https://phabricator.wikimedia.org/T200340) (owner: 10Zfilipin) [14:53:18] 10Operations, 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Cannot SSH to stat1004 - https://phabricator.wikimedia.org/T200330 (10GoranSMilovanovic) @Reedy Thanks - all is fine now. I do not use stat1004 often at all - there is only one of my scripts running there, everything else... [14:53:31] 10Operations, 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Cannot SSH to stat1004 - https://phabricator.wikimedia.org/T200330 (10GoranSMilovanovic) 05Open>03Resolved a:03GoranSMilovanovic [14:54:02] addshore, Amir1 now that .14 is rolled back, I see the same error on .13 :/ [14:57:15] 10Operations, 10netops: OSPF metrics - https://phabricator.wikimedia.org/T200277 (10ayounsi) > Could you explain that premise? What are we trying to optimize for? > > If a path with an extra hop in eqiad is the lowest latency path, that could just become our preferred path, despite not being direct? Also sinc... [14:58:41] zeljkof: what are these errors? [14:58:45] *reads up* [14:59:09] addshore: I have no clue, an empty string somewhere [14:59:23] is there a ticket? or logstash link? [14:59:30] (03PS1) 10Aaron Schulz: Revert "Revert "Make all non-test wikis write to both nutcracker and mcrouter again"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447819 [14:59:40] Yes [14:59:48] Wikibase\DataModel\Entity\EntityIdParsingException from line 49 of /srv/mediawiki/php-1.32.0-wmf.13/vendor/wikibase/data-model/src/Entity/DispatchingEntityIdParser.php: $serialization must not be an empty string [14:59:48] https://phabricator.wikimedia.org/T200340 [14:59:55] *looks ta the ticket* [15:00:02] No stack though [15:00:42] its coming from the api [15:00:58] it could just be a client making bad requests [15:01:29] Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [15:01:32] or maybe this blog by freenode staff member Matthew 'mst' Trout https://MattSTrout.com/ [15:01:35] Read what IRC investigative journalists have uncovered on the freenode pedophilia scandal https://encyclopediadramatica.rs/Freenodegate [15:01:38] nandub please leave. [15:01:38] [15:01:41] This message was brought to you by Private Internet Access [15:01:54] (03CR) 10Ayounsi: [C: 031] "I didn't check if the servers were in the matching rows than their dns allocation, but checked the dns syntax and v4/v6 mapping and they a" [dns] - 10https://gerrit.wikimedia.org/r/447816 (https://phabricator.wikimedia.org/T195923) (owner: 10Cmjohnson) [15:02:17] added a stack to the ticket [15:03:00] (03CR) 10jenkins-bot: Revert "mariadb: Depool es1014 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447772 (owner: 10Jcrespo) [15:03:02] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447797 (owner: 10Marostegui) [15:03:04] (03CR) 10jenkins-bot: Revert "group1 wikis to 1.32.0-wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447815 (https://phabricator.wikimedia.org/T200340) (owner: 10Zfilipin) [15:03:23] (03CR) 10Vgutierrez: [C: 031] Test UPDATE generation of the NaiveBGPPeering [debs/pybal] - 10https://gerrit.wikimedia.org/r/436808 (owner: 10Mark Bergsma) [15:04:35] (03CR) 10Dzahn: [C: 031] "unfortunately no, i don't know about the app itself, just did something on the puppet side once" [puppet] - 10https://gerrit.wikimedia.org/r/441131 (https://phabricator.wikimedia.org/T196920) (owner: 10Herron) [15:05:16] 10Operations, 10ops-eqiad: Relabel labnet1003.eqiad.wmnet as cloudnet1003.eqiad.wmnet - https://phabricator.wikimedia.org/T199524 (10Cmjohnson) 05Open>03Resolved [15:05:33] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Epic: Relabel labcontrol1003.wikimedia.org as cloudcontrol1003.wikimedia.org - https://phabricator.wikimedia.org/T200080 (10Cmjohnson) 05Open>03Resolved [15:05:49] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Epic: Relabel labcontrol1004.wikimedia.org as cloudcontrol1004.wikimedia.org - https://phabricator.wikimedia.org/T199782 (10Cmjohnson) 05Open>03Resolved [15:06:16] 10Operations, 10ops-eqiad, 10cloud-services-team: Relabel labnet1004.eqiad.wmnet as cloudnet1004.eqiad.wmnet - https://phabricator.wikimedia.org/T199921 (10Cmjohnson) 05Open>03Resolved [15:06:43] Reedy: addshore: added stack trace to T200340 [15:06:44] T200340: Wikibase\DataModel\Entity\EntityIdParsingException $serialization must not be an empty string - https://phabricator.wikimedia.org/T200340 [15:06:53] ugh, beat me to it :) [15:06:55] lol [15:07:38] that should probably be caught, rather than just dieing an evil exception death [15:09:44] (03CR) 10Vgutierrez: [C: 031] Fix handling of withdrawals for Inet Unicast [debs/pybal] - 10https://gerrit.wikimedia.org/r/436809 (owner: 10Mark Bergsma) [15:13:04] (03CR) 10Mark Bergsma: [C: 032] Extend NaiveBGPPeering unit testing [debs/pybal] - 10https://gerrit.wikimedia.org/r/436807 (owner: 10Mark Bergsma) [15:13:42] (03Merged) 10jenkins-bot: Extend NaiveBGPPeering unit testing [debs/pybal] - 10https://gerrit.wikimedia.org/r/436807 (owner: 10Mark Bergsma) [15:14:52] (03CR) 10Mark Bergsma: [C: 032] Test UPDATE generation of the NaiveBGPPeering [debs/pybal] - 10https://gerrit.wikimedia.org/r/436808 (owner: 10Mark Bergsma) [15:15:34] (03Merged) 10jenkins-bot: Test UPDATE generation of the NaiveBGPPeering [debs/pybal] - 10https://gerrit.wikimedia.org/r/436808 (owner: 10Mark Bergsma) [15:17:10] (03CR) 10Anomie: [C: 032] "Deploying config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447638 (https://phabricator.wikimedia.org/T197817) (owner: 10Anomie) [15:18:22] (03CR) 10Vgutierrez: [C: 031] Split off BGP factory/peering classes into a separate module [debs/pybal] - 10https://gerrit.wikimedia.org/r/436822 (owner: 10Mark Bergsma) [15:18:43] (03Merged) 10jenkins-bot: Set MCR write-both-read-old on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447638 (https://phabricator.wikimedia.org/T197817) (owner: 10Anomie) [15:19:07] (03CR) 10Mark Bergsma: [C: 032] Fix handling of withdrawals for Inet Unicast [debs/pybal] - 10https://gerrit.wikimedia.org/r/436809 (owner: 10Mark Bergsma) [15:20:09] (03Merged) 10jenkins-bot: Fix handling of withdrawals for Inet Unicast [debs/pybal] - 10https://gerrit.wikimedia.org/r/436809 (owner: 10Mark Bergsma) [15:21:31] (03CR) 10Mark Bergsma: [C: 032] Split off BGP factory/peering classes into a separate module [debs/pybal] - 10https://gerrit.wikimedia.org/r/436822 (owner: 10Mark Bergsma) [15:21:32] Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [15:21:33] or maybe this blog by freenode staff member Matthew 'mst' Trout https://MattSTrout.com/ [15:21:36] Read what IRC investigative journalists have uncovered on the freenode pedophilia scandal https://encyclopediadramatica.rs/Freenodegate [15:21:36] [15:21:39] This message was brought to you by Private Internet Access [15:21:48] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Set wgMultiContentRevisionSchemaMigrationStage to write-both read-old on testwiki (T197817) (duration: 00m 55s) [15:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:52] T197817: Enable MCR migration stage "write both, read old" on testwiki - https://phabricator.wikimedia.org/T197817 [15:22:18] (03Merged) 10jenkins-bot: Split off BGP factory/peering classes into a separate module [debs/pybal] - 10https://gerrit.wikimedia.org/r/436822 (owner: 10Mark Bergsma) [15:23:50] (03PS3) 10Elukey: [WIP] Refactor Hadoop code to allow more than one cluster in Prod [puppet] - 10https://gerrit.wikimedia.org/r/447813 [15:24:02] (03CR) 10Anomie: [C: 032] "Deploying config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447639 (https://phabricator.wikimedia.org/T198311) (owner: 10Anomie) [15:25:10] (03CR) 10Vgutierrez: [C: 031] Cleanup setServers and clarify its use [debs/pybal] - 10https://gerrit.wikimedia.org/r/446565 (owner: 10Mark Bergsma) [15:25:28] (03Merged) 10jenkins-bot: Set MCR read-old-write-both on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447639 (https://phabricator.wikimedia.org/T198311) (owner: 10Anomie) [15:26:52] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Sync labs config file, no prod impact (duration: 00m 55s) [15:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:06] (03Abandoned) 10Thcipriani: Beta: Fix for service dependency loops [puppet] - 10https://gerrit.wikimedia.org/r/447487 (https://phabricator.wikimedia.org/T171173) (owner: 10Thcipriani) [15:28:51] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447822 [15:30:00] Does anyone here have Google Webmaster Tools configured for the various wikipedia domains? (Specifically, it.wikipedia.org in this case.) [15:30:32] I have it.wikinews [15:30:33] lol [15:30:54] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447822 (owner: 10Marostegui) [15:31:33] marlier: ask Deskana, the last few times I have seen permissions handled for this it has gone through him in some respect :) [15:31:45] !log running populateContentTables.php on testwiki for T183488 [15:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:49] T183488: MCR schema migration stage 2: populate new fields - https://phabricator.wikimedia.org/T183488 [15:31:57] 10Operations, 10ORES, 10Scoring-platform-team (Current): Address mass overload errors in ORES (July 2018, UW origin) - https://phabricator.wikimedia.org/T200338 (10faidon) [15:32:12] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447822 (owner: 10Marostegui) [15:33:06] (03PS2) 10Mark Bergsma: Cleanup setServers and clarify its use [debs/pybal] - 10https://gerrit.wikimedia.org/r/446565 [15:33:07] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Get helm test to dump more information - https://phabricator.wikimedia.org/T200348 (10thcipriani) p:05Triage>03Normal [15:33:09] (03PS4) 10Mark Bergsma: Test Server invariants [debs/pybal] - 10https://gerrit.wikimedia.org/r/445207 (https://phabricator.wikimedia.org/T184715) [15:33:11] (03PS3) 10Mark Bergsma: Remove Server.modified [debs/pybal] - 10https://gerrit.wikimedia.org/r/446614 [15:33:13] (03PS2) 10Mark Bergsma: Don't recalculate server.up in refreshPreexistingServers [debs/pybal] - 10https://gerrit.wikimedia.org/r/447766 [15:33:15] (03PS2) 10Mark Bergsma: Don't depool pooledDownServers in refreshPreexistingServers [debs/pybal] - 10https://gerrit.wikimedia.org/r/447769 (https://phabricator.wikimedia.org/T184715) [15:33:17] (03PS2) 10Mark Bergsma: Extend testConfigServerRemoval test case. [debs/pybal] - 10https://gerrit.wikimedia.org/r/447770 [15:33:19] (03PS2) 10Mark Bergsma: Modernize and cleanup Coordinator [debs/pybal] - 10https://gerrit.wikimedia.org/r/447775 [15:33:22] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1098:3317 (duration: 00m 55s) [15:33:23] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Get helm test to dump more information - https://phabricator.wikimedia.org/T200348 (10thcipriani) a:03thcipriani [15:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:27] Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [15:34:12] (03CR) 10Vgutierrez: Remove Server.modified (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/446614 (owner: 10Mark Bergsma) [15:36:16] (03CR) 10Mark Bergsma: Remove Server.modified (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/446614 (owner: 10Mark Bergsma) [15:36:30] (03CR) 10Mark Bergsma: [C: 032] Cleanup setServers and clarify its use [debs/pybal] - 10https://gerrit.wikimedia.org/r/446565 (owner: 10Mark Bergsma) [15:37:05] (03Merged) 10jenkins-bot: Cleanup setServers and clarify its use [debs/pybal] - 10https://gerrit.wikimedia.org/r/446565 (owner: 10Mark Bergsma) [15:38:43] (03CR) 10Paladox: [C: 031] planet: drop feed templates for planet-venus/stretch [puppet] - 10https://gerrit.wikimedia.org/r/447746 (owner: 10Dzahn) [15:38:47] (03PS3) 10Dzahn: planet: tune feed name, description, owneremail, maxarticles [puppet] - 10https://gerrit.wikimedia.org/r/447743 [15:40:19] (03CR) 10Vgutierrez: Remove Server.modified (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/446614 (owner: 10Mark Bergsma) [15:41:11] (03PS4) 10Dzahn: planet: tune feed name, description, owneremail, maxarticles [puppet] - 10https://gerrit.wikimedia.org/r/447743 [15:41:23] 10Operations, 10Maps, 10Maps-Sprint, 10Reading-Infrastructure-Team-Backlog: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10LGoto) [15:44:32] !log joal@deploy1001 Started deploy [analytics/refinery@9390b63]: Regular weekly deploy of Analytics-Hadoop scripts [15:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:30] (03CR) 10Vgutierrez: Don't recalculate server.up in refreshPreexistingServers (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/447766 (owner: 10Mark Bergsma) [15:45:33] 10Operations, 10Maps, 10Maps-Sprint, 10Reading-Infrastructure-Team-Backlog, 10Traffic: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732 (10LGoto) [15:45:42] (03PS3) 10Dzahn: planet: drop feed templates for planet-venus/jessie [puppet] - 10https://gerrit.wikimedia.org/r/447746 (https://phabricator.wikimedia.org/T180498) [15:46:30] (03CR) 10Dzahn: [C: 032] planet: tune feed name, description, owneremail, maxarticles [puppet] - 10https://gerrit.wikimedia.org/r/447743 (owner: 10Dzahn) [15:47:52] 10Operations, 10ops-eqsin, 10Traffic: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 (10RobH) Ok, so the email back from them when I woke up this AM was a bit confusing, but boils down to this: * They seem to have replaced the mainboard, and set the temp drac password as requested. * @robh... [15:48:20] (03CR) 10Mark Bergsma: Remove Server.modified (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/446614 (owner: 10Mark Bergsma) [15:49:00] !log elasticsearch cluster restart on codfw completed - T156137 [15:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:04] T156137: Reduce impact of GC pauses on elasticsearch response time - https://phabricator.wikimedia.org/T156137 [15:49:13] (03CR) 10jenkins-bot: Set MCR write-both-read-old on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447638 (https://phabricator.wikimedia.org/T197817) (owner: 10Anomie) [15:49:15] (03CR) 10jenkins-bot: Set MCR read-old-write-both on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447639 (https://phabricator.wikimedia.org/T198311) (owner: 10Anomie) [15:49:17] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447822 (owner: 10Marostegui) [15:53:26] (03PS4) 10Dzahn: planet: drop feed templates for planet-venus/jessie [puppet] - 10https://gerrit.wikimedia.org/r/447746 (https://phabricator.wikimedia.org/T180498) [15:54:50] (03CR) 10Dzahn: "change looks bigger than it is because we move templates around but compiler confirms: http://puppet-compiler.wmflabs.org/11854/" [puppet] - 10https://gerrit.wikimedia.org/r/447746 (https://phabricator.wikimedia.org/T180498) (owner: 10Dzahn) [15:54:56] (03CR) 10Dzahn: [C: 032] planet: drop feed templates for planet-venus/jessie [puppet] - 10https://gerrit.wikimedia.org/r/447746 (https://phabricator.wikimedia.org/T180498) (owner: 10Dzahn) [15:55:16] (03PS1) 10Rush: icinga: changing rush notifications for vacation time [puppet] - 10https://gerrit.wikimedia.org/r/447827 [15:57:22] (03PS2) 10Rush: icinga: changing rush notifications for vacation time [puppet] - 10https://gerrit.wikimedia.org/r/447827 [15:58:05] (03CR) 10Rush: [C: 032] icinga: changing rush notifications for vacation time [puppet] - 10https://gerrit.wikimedia.org/r/447827 (owner: 10Rush) [15:58:32] (03PS2) 10Cmjohnson: Add ipv4/ipv6 dns cp1075-1090 [dns] - 10https://gerrit.wikimedia.org/r/447816 (https://phabricator.wikimedia.org/T195923) [16:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy Morning SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180725T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:02:23] !log joal@deploy1001 Finished deploy [analytics/refinery@9390b63]: Regular weekly deploy of Analytics-Hadoop scripts (duration: 17m 51s) [16:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:01] (03CR) 10Dzahn: [C: 031] "all my comments on the same change for "iegreview" as opposed to scholarships app are also true here" [puppet] - 10https://gerrit.wikimedia.org/r/441133 (https://phabricator.wikimedia.org/T196920) (owner: 10Herron) [16:11:53] chasemp: we both made changes to icinga contacts and i noticed this error now: Error: Could not find any contact matching 'rush' [16:12:28] looks like it's still in a contactgroup while the contact itself is gone [16:20:31] PROBLEM - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors [16:21:15] mutante: that's weird...I stepped out to lunch can you revert as needed? [16:22:54] chasemp: yes, i will. the issue is probably between "rush" and "rush-wmcs" [16:23:27] (03PS1) 10Dzahn: Revert "icinga: changing rush notifications for vacation time" [puppet] - 10https://gerrit.wikimedia.org/r/447832 [16:24:44] Tx mutante, I can look in just a bit [16:25:08] yep, no worries. enjoy lunch [16:25:23] (03CR) 10Dzahn: [C: 032] Revert "icinga: changing rush notifications for vacation time" [puppet] - 10https://gerrit.wikimedia.org/r/447832 (owner: 10Dzahn) [16:26:33] Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [16:26:33] or maybe this blog by freenode staff member Matthew 'mst' Trout https://MattSTrout.com/ [16:29:07] wth, now it complains about the _other_ contact not found.. looking [16:31:57] ah, it uses a template to create a contact [16:35:37] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10Cmjohnson) a:05Cmjohnson>03RobH @RobH can you take over the installs from here. I did do production dns, please review and merge if okay. I am not seeing a physic... [16:37:25] (03PS1) 10Ladsgroup: Enable reading from the new backend for Special:Tags in several large wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447834 (https://phabricator.wikimedia.org/T199334) [16:38:03] Any objections if I deploy this right now? It's SWAT window right now ^ [16:38:16] (03PS1) 10Bstorm: gridengine: correct a few issues with stretch exec packages [puppet] - 10https://gerrit.wikimedia.org/r/447835 (https://phabricator.wikimedia.org/T199276) [16:38:24] (03PS1) 10Ema: vcl: add cluster_{fe,be}_vcl_switch hooks [puppet] - 10https://gerrit.wikimedia.org/r/447836 (https://phabricator.wikimedia.org/T164609) [16:38:32] (03Abandoned) 10Dzahn: add bast3003 to site and network constants [puppet] - 10https://gerrit.wikimedia.org/r/405226 (https://phabricator.wikimedia.org/T184936) (owner: 10Dzahn) [16:39:32] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447834 (https://phabricator.wikimedia.org/T199334) (owner: 10Ladsgroup) [16:40:25] !log remove ORES abuser blocking T200338, let's reevaluate [16:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:29] T200338: Address mass overload errors in ORES (July 2018, UW origin) - https://phabricator.wikimedia.org/T200338 [16:41:11] (03Merged) 10jenkins-bot: Enable reading from the new backend for Special:Tags in several large wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447834 (https://phabricator.wikimedia.org/T199334) (owner: 10Ladsgroup) [16:44:18] (03CR) 10Bstorm: [C: 032] gridengine: correct a few issues with stretch exec packages [puppet] - 10https://gerrit.wikimedia.org/r/447835 (https://phabricator.wikimedia.org/T199276) (owner: 10Bstorm) [16:45:26] works in mwdebug1002, moving forward [16:48:13] mutante: things settle down? [16:48:47] Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [16:49:06] chasemp: no, i waited for a puppet run for a while. but maybe now after that is done and we re-revert that [16:49:13] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:447834|Enable reading from the new backend for Special:Tags in several large wikis (T199334)]] (duration: 00m 56s) [16:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:17] T199334: Temporarily add config and use it to use change_tag_def table instead of change_tag table for Special:Tags - https://phabricator.wikimedia.org/T199334 [16:49:22] ^ done [16:49:24] the error changed from not finding rush to not finding rush-wmcs [16:49:33] so i will re-revert [16:49:35] !log morning SWAT is done [16:49:35] mutante: it was two commits that were interdependent so I wonder if you didn't catch it in a bad cycle initially and then it got more confused :) [16:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:16] chasemp: also the part where running puppet on einsteinium first failed with "error 503 on Server" ;/ [16:50:23] and yes [16:50:25] even better! :) [16:50:50] (03PS1) 10Dzahn: Revert "Revert "icinga: changing rush notifications for vacation time"" [puppet] - 10https://gerrit.wikimedia.org/r/447838 [16:51:22] (03PS2) 10Dzahn: Revert "Revert "icinga: changing rush notifications for vacation time"" [puppet] - 10https://gerrit.wikimedia.org/r/447838 [16:51:49] (03CR) 10Dzahn: [C: 032] "ran puppet on einsteinium. < mutante> the error changed from not finding rush to not finding rush-wmcs" [puppet] - 10https://gerrit.wikimedia.org/r/447838 (owner: 10Dzahn) [16:53:34] ok, running puppet again... in progress [16:55:08] chasemp: this puppet run changed contactgroup members again to before [16:55:12] and the error is back to "- members irc-cloud-feed,aborrero-wmcs,andrew-wmcs,bd808,bstorm-wmcs,rush-wmcs [16:55:15] + members irc-cloud-feed,rush,aborrero-wmcs,andrew-wmcs,bd808,bstorm-wmcs [16:55:20] Could not find any contact matching 'rush' [16:55:44] mutante: yes that only works if the private change is merged as well [16:55:51] which changed rush-template => rush [16:55:56] iiuc [16:56:56] yea, but that shows up in git log [16:58:01] mutante: is there some issue with syncing private to the icinga server? when I look at HEAD now I do see [16:58:04] define contact{ [16:58:04] name rush [16:58:41] but I can take on debugging if you want no worries [16:59:29] my own change after that has been synced to einsteinium.. at least now [16:59:58] * chasemp nods [16:59:59] kk [17:00:05] Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [17:00:08] 10Operations, 10ORES, 10Scoring-platform-team (Current): Address mass overload errors in ORES (July 2018, UW origin) - https://phabricator.wikimedia.org/T200338 (10Halfak) Was a UW researcher. I'll work with her to continue :) [17:01:25] (03PS4) 10Elukey: [WIP] Refactor Hadoop code to allow more than one cluster in Prod [puppet] - 10https://gerrit.wikimedia.org/r/447813 [17:01:31] chasemp: yes, please try. maybe it had a temp problem [17:01:41] mutante: cool, no worries, I'll dig in [17:01:46] thanks [17:08:21] (03CR) 10jenkins-bot: Enable reading from the new backend for Special:Tags in several large wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447834 (https://phabricator.wikimedia.org/T199334) (owner: 10Ladsgroup) [17:15:43] 10Operations, 10Commons, 10Patch-For-Review: Improve mwmaint servers (e.g. mwmain1001) userland to process server side uploads - https://phabricator.wikimedia.org/T159661 (10Krinkle) [17:17:40] Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [17:17:44] or maybe this blog by freenode staff member Matthew 'mst' Trout https://MattSTrout.com/ [17:17:47] Read what IRC investigative journalists have uncovered on the freenode pedophilia scandal https://encyclopediadramatica.rs/Freenodegate [17:17:50] [17:17:53] This message was brought to you by Private Internet Access [17:20:05] * Krinkle staging on mwdeploy1001/mwdebug1002 to roll-out log-spam fix https://phabricator.wikimedia.org/T182929#4451062 to wmf.14 [17:21:59] (03PS5) 10Elukey: [WIP] Refactor Hadoop code to allow more than one cluster in Prod [puppet] - 10https://gerrit.wikimedia.org/r/447813 [17:23:32] PROBLEM - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.074 second response time [17:31:46] (03PS6) 10Elukey: [WIP] Refactor Hadoop code to allow more than one cluster in Prod [puppet] - 10https://gerrit.wikimedia.org/r/447813 [17:38:39] (03PS1) 10Dzahn: netbox: add psql dump cron and back it up [puppet] - 10https://gerrit.wikimedia.org/r/447842 (https://phabricator.wikimedia.org/T190184) [17:39:16] (03CR) 10jerkins-bot: [V: 04-1] netbox: add psql dump cron and back it up [puppet] - 10https://gerrit.wikimedia.org/r/447842 (https://phabricator.wikimedia.org/T190184) (owner: 10Dzahn) [17:41:29] (03PS2) 10Dzahn: netbox: add psql dump cron and back it up [puppet] - 10https://gerrit.wikimedia.org/r/447842 (https://phabricator.wikimedia.org/T190184) [17:42:31] (03PS7) 10Elukey: [WIP] Refactor Hadoop code to allow more than one cluster in Prod [puppet] - 10https://gerrit.wikimedia.org/r/447813 [17:43:52] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Tgr) The domain was removed in {0593daa89b07982b67121bb6d14f05974d3e5914}. I... [17:45:41] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.14/extensions/Cite/includes/Cite.php: (no justification provided) (duration: 00m 55s) [17:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:14] bstorm_: fyi I think the above toolschecker alert is probably a canary for NFS fallout^^^ [17:47:59] Fair enough. [17:49:33] (03CR) 10Anomie: [C: 032] "Deploying config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447640 (https://phabricator.wikimedia.org/T198311) (owner: 10Anomie) [17:49:44] (03CR) 10jerkins-bot: [V: 04-1] Set MCR read-new-write-both on Beta Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447640 (https://phabricator.wikimedia.org/T198311) (owner: 10Anomie) [17:50:06] (03PS2) 10Anomie: Set MCR read-new-write-both on Beta Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447640 (https://phabricator.wikimedia.org/T198311) [17:50:06] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Tgr) I guess short term fix is to disable thumbnail prerendering since it is... [17:50:21] (03CR) 10Anomie: Set MCR read-new-write-both on Beta Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447640 (https://phabricator.wikimedia.org/T198311) (owner: 10Anomie) [17:50:28] (03CR) 10Anomie: [C: 032] "Try again" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447640 (https://phabricator.wikimedia.org/T198311) (owner: 10Anomie) [17:51:29] 10Operations, 10ORES, 10Scoring-platform-team (Current): Address mass overload errors in ORES (July 2018, UW origin) - https://phabricator.wikimedia.org/T200338 (10elukey) @Halfak as another follow up step, I'd also add more monitoring to catch these situations. We noticed the issue because Jaime was watchin... [17:52:35] (03Merged) 10jenkins-bot: Set MCR read-new-write-both on Beta Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447640 (https://phabricator.wikimedia.org/T198311) (owner: 10Anomie) [17:53:46] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Sync labs config file, no prod impact (duration: 00m 54s) [17:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:07] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Imarlier) @Tgr I think that's right. Do you mind doing so? [17:54:26] (03CR) 10jenkins-bot: Set MCR read-new-write-both on Beta Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447640 (https://phabricator.wikimedia.org/T198311) (owner: 10Anomie) [18:00:01] 10Operations, 10ORES, 10Scoring-platform-team (Current): Address mass overload errors in ORES (July 2018, UW origin) - https://phabricator.wikimedia.org/T200338 (10awight) Just a minor note: in the past, overload events like this have resulted in the collapse of ORES worker nodes, but in this case the worker... [18:01:01] (03PS1) 10Imarlier: wmf-config: Comment out wgUploadThumbnailRenderHttpCustomDomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447843 (https://phabricator.wikimedia.org/T200346) [18:01:31] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Imarlier) Actually, I take that back. We should be abl... [18:02:33] (03PS2) 10Imarlier: wmf-config: Comment out wgUploadThumbnailRenderHttpCustomDomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447843 (https://phabricator.wikimedia.org/T200346) [18:04:16] (03CR) 10jerkins-bot: [V: 04-1] wmf-config: Comment out wgUploadThumbnailRenderHttpCustomDomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447843 (https://phabricator.wikimedia.org/T200346) (owner: 10Imarlier) [18:07:26] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Mooeypoo) I just want to make a point about the specific term "Judgment" -- it has a big potential of setting the t... [18:08:33] (03PS3) 10Imarlier: wmf-config: Comment out wgUploadThumbnailRenderHttpCustomDomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447843 (https://phabricator.wikimedia.org/T200346) [18:09:04] (03PS1) 10Dzahn: postgresql: add defined type to create db backups [puppet] - 10https://gerrit.wikimedia.org/r/447844 (https://phabricator.wikimedia.org/T190184) [18:09:43] (03CR) 10jerkins-bot: [V: 04-1] postgresql: add defined type to create db backups [puppet] - 10https://gerrit.wikimedia.org/r/447844 (https://phabricator.wikimedia.org/T190184) (owner: 10Dzahn) [18:10:06] (03CR) 10jerkins-bot: [V: 04-1] wmf-config: Comment out wgUploadThumbnailRenderHttpCustomDomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447843 (https://phabricator.wikimedia.org/T200346) (owner: 10Imarlier) [18:10:32] (03CR) 10Imarlier: "Verified that this setting has an appropriate default (set to 'false' in DefaultSettings.php), and that it's not referenced anywhere in Me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447843 (https://phabricator.wikimedia.org/T200346) (owner: 10Imarlier) [18:11:09] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) @Mooeypoo Very interesting, thanks for flagging this! I think the term may have originally come about due... [18:11:44] (03PS2) 10Dzahn: postgresql: add defined type to create db backups [puppet] - 10https://gerrit.wikimedia.org/r/447844 (https://phabricator.wikimedia.org/T190184) [18:11:46] Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [18:11:46] or maybe this blog by freenode staff member Matthew 'mst' Trout https://MattSTrout.com/ [18:11:49] Read what IRC investigative journalists have uncovered on the freenode pedophilia scandal https://encyclopediadramatica.rs/Freenodegate [18:11:50] [18:11:52] This message was brought to you by Private Internet Access [18:12:34] (03CR) 10jerkins-bot: [V: 04-1] postgresql: add defined type to create db backups [puppet] - 10https://gerrit.wikimedia.org/r/447844 (https://phabricator.wikimedia.org/T190184) (owner: 10Dzahn) [18:13:26] (03CR) 10Dzahn: "upon further thought i should probably make this a feature of the postgresql module itself, please see here now: i would then use that def" [puppet] - 10https://gerrit.wikimedia.org/r/447842 (https://phabricator.wikimedia.org/T190184) (owner: 10Dzahn) [18:16:09] (03PS3) 10Dzahn: netbox: add psql dump cron and back it up [puppet] - 10https://gerrit.wikimedia.org/r/447842 (https://phabricator.wikimedia.org/T190184) [18:16:34] Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [18:16:34] or maybe this blog by freenode staff member Matthew 'mst' Trout https://MattSTrout.com/ [18:18:12] Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [18:18:13] (03PS4) 10Imarlier: wmf-config: Comment out wgUploadThumbnailRenderHttpCustomDomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447843 (https://phabricator.wikimedia.org/T200346) [18:19:11] (03CR) 10Gergő Tisza: "AIUI this would make the request go through Varnish, and the custom domain was used to avoid that. Not sure how important that is. To disa" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447843 (https://phabricator.wikimedia.org/T200346) (owner: 10Imarlier) [18:20:09] (03PS3) 10Dzahn: postgresql: add defined type to create db backups [puppet] - 10https://gerrit.wikimedia.org/r/447844 (https://phabricator.wikimedia.org/T190184) [18:22:19] (03CR) 10Gergő Tisza: "Also, no point in setting a custom host if the domain is non-custom (ie. it is upload.wikimedia.org)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447843 (https://phabricator.wikimedia.org/T200346) (owner: 10Imarlier) [18:25:13] (03CR) 10Imarlier: "> AIUI this would make the request go through Varnish, and the custom" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447843 (https://phabricator.wikimedia.org/T200346) (owner: 10Imarlier) [18:26:39] (03CR) 10Dzahn: "It seems like $atavars_host is undefined and not actually set to something but then used in Apache as ServerName. That would probably brea" [puppet] - 10https://gerrit.wikimedia.org/r/439783 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [18:28:05] (03CR) 10Imarlier: "> Also, no point in setting a custom host if the domain is non-custom" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447843 (https://phabricator.wikimedia.org/T200346) (owner: 10Imarlier) [18:33:17] (03CR) 10Dzahn: [C: 031] "this comment sounds like it's ready to be merged now, ack?" [puppet] - 10https://gerrit.wikimedia.org/r/431042 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [18:33:40] Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [18:36:11] 10Operations, 10Patch-For-Review: setup replacements for maintenance_server (terbium, wasat) on Stretch - https://phabricator.wikimedia.org/T192092 (10Dzahn) >>! In T192092#4422135, @jcrespo wrote: > There is an undocumented grant from californium.wikimedia.org to striker @bd808 > No more grants on m5 referenc... [18:36:25] Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [18:36:25] or maybe this blog by freenode staff member Matthew 'mst' Trout https://MattSTrout.com/ [18:36:28] Read what IRC investigative journalists have uncovered on the freenode pedophilia scandal https://encyclopediadramatica.rs/Freenodegate [18:36:28] [18:36:31] This message was brought to you by Private Internet Access [18:41:23] Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [18:41:23] or maybe this blog by freenode staff member Matthew 'mst' Trout https://MattSTrout.com/ [18:41:46] (03CR) 10Gergő Tisza: "> As far as I can tell, that's right - the request will end up going to varnish. Is that really a problem, though?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447843 (https://phabricator.wikimedia.org/T200346) (owner: 10Imarlier) [18:42:39] (03CR) 10Gergő Tisza: [C: 031] wmf-config: Comment out wgUploadThumbnailRenderHttpCustomDomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447843 (https://phabricator.wikimedia.org/T200346) (owner: 10Imarlier) [18:45:36] (03PS5) 10Imarlier: wmf-config: Comment out wgUploadThumbnailRenderHttpCustomDomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447843 (https://phabricator.wikimedia.org/T200346) [18:48:57] (03PS17) 10Paladox: Gerrit: Add support for avatars url in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783 (https://phabricator.wikimedia.org/T191183) [18:49:04] (03PS18) 10Paladox: Gerrit: Add support for avatars url in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783 (https://phabricator.wikimedia.org/T191183) [18:50:01] (03PS7) 10Paladox: Gerrit: Hook up gerrit.wmfusercontent.org to apache [puppet] - 10https://gerrit.wikimedia.org/r/439808 (https://phabricator.wikimedia.org/T191183) [18:50:27] (03PS5) 10Paladox: Gerrit: Clone avatars repo into /var/www/avatars [puppet] - 10https://gerrit.wikimedia.org/r/440104 [18:51:09] 10Operations, 10Wikimedia-Logstash, 10monitoring: Send logstash service metrics to prometheus - https://phabricator.wikimedia.org/T200362 (10herron) p:05Triage>03Normal [18:51:14] (03PS1) 10Thcipriani: Merge tag 'v2.15.3' into wmf/stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/447846 [18:52:30] (03CR) 10Krinkle: "Codesearch shows usage in wmf-config for upload.svc" [dns] - 10https://gerrit.wikimedia.org/r/228801 (owner: 10Faidon Liambotis) [18:52:53] (03CR) 10Krinkle: "Reported at T200346 instead. (too old for fixme I guess)" [dns] - 10https://gerrit.wikimedia.org/r/228801 (owner: 10Faidon Liambotis) [18:54:40] (03CR) 10Krinkle: wmf-config: Comment out wgUploadThumbnailRenderHttpCustomDomain (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447843 (https://phabricator.wikimedia.org/T200346) (owner: 10Imarlier) [18:57:17] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Access request to stat1005 and stat1006 for cooltey - https://phabricator.wikimedia.org/T190150 (10cooltey) Hi @RobH, I tried to access the release server to upload an alpha APK to it, but I cannot access it successfully. It returns `ssh_exchange_iden... [18:58:07] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Harej) We did recently just rename the namespace from `Jade` to `Judgment` since `Judgment` is better semantics (na... [18:58:22] RECOVERY - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.009 second response time [18:58:45] andrewbogott: bstorm_: taht canary just came back to life^ [18:58:58] I see that [18:59:08] Good to note [18:59:42] !log Added new disk shelf to labstore1007 [18:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:05] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180725T1900) [19:00:09] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) Before this gets out of hand, let's please discuss the name in a subtask, it's important that the deeper, s... [19:00:42] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Halfak) {T200365} [19:02:24] (03CR) 10Krinkle: [C: 031] wmf-config: Comment out wgUploadThumbnailRenderHttpCustomDomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447843 (https://phabricator.wikimedia.org/T200346) (owner: 10Imarlier) [19:02:40] (03PS6) 10Imarlier: wmf-config: Comment out wgUploadThumbnailRenderHttpCustomDomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447843 (https://phabricator.wikimedia.org/T200346) [19:03:18] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Krinkle) At risk of asking the obvious – have we decide... [19:03:22] (03CR) 10Imarlier: "Per Krinkle's comment, revised to just remove the setting entirely." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447843 (https://phabricator.wikimedia.org/T200346) (owner: 10Imarlier) [19:04:58] (03PS7) 10Imarlier: wmf-config: Comment out wgUploadThumbnailRenderHttpCustomDomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447843 (https://phabricator.wikimedia.org/T200346) [19:08:09] Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [19:08:12] or maybe this blog by freenode staff member Matthew 'mst' Trout https://MattSTrout.com/ [19:08:16] Read what IRC investigative journalists have uncovered on the freenode pedophilia scandal https://encyclopediadramatica.rs/Freenodegate [19:08:19] [19:08:22] This message was brought to you by Private Internet Access [19:10:11] (03CR) 10Krinkle: [C: 031] wmf-config: Comment out wgUploadThumbnailRenderHttpCustomDomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447843 (https://phabricator.wikimedia.org/T200346) (owner: 10Imarlier) [19:10:21] Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [19:10:21] or maybe this blog by freenode staff member Matthew 'mst' Trout https://MattSTrout.com/ [19:11:42] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Tgr) 200 is the default value for that property; overri... [19:12:48] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Anomie) >>! In T200346#4451345, @Krinkle wrote: > At ri... [19:14:52] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Krinkle) >>! In T200346#4451362, @Anomie wrote: > "0 is... [19:15:06] 10Operations, 10Wikimedia-Logstash, 10monitoring: Send logstash service metrics to prometheus - https://phabricator.wikimedia.org/T200362 (10herron) Spent some time setting up and experimenting with logstash_exporter (https://github.com/BonnierNews/logstash_exporter) It's now up and running for testing on h... [19:15:31] 10Operations, 10Wikimedia-Logstash, 10monitoring, 10User-herron: Send logstash service metrics to prometheus - https://phabricator.wikimedia.org/T200362 (10herron) [19:15:56] 10Operations, 10Wikimedia-Logstash, 10monitoring, 10User-herron: Send logstash service metrics to prometheus - https://phabricator.wikimedia.org/T200362 (10herron) [19:15:58] 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Modernize logging, alerting and metrics monitoring infrastructure - Adopt Logstash (2018-19 Q1 Goal) - https://phabricator.wikimedia.org/T198753 (10herron) [19:27:37] (03CR) 10Paladox: [C: 031] "LGTM (reviewed all the files)" [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/447846 (owner: 10Thcipriani) [19:33:11] (03CR) 10BBlack: [C: 031] vcl: add cluster_{fe,be}_vcl_switch hooks [puppet] - 10https://gerrit.wikimedia.org/r/447836 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [19:35:23] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10BBlack) a:05RobH>03BBlack I'll take over these from here. It's a very new hardware config we'll have to develop some puppet-level fixups for as we test how the inst... [19:50:21] (03PS1) 10BBlack: cp5006: uncomment in upload hieradata list [puppet] - 10https://gerrit.wikimedia.org/r/447849 (https://phabricator.wikimedia.org/T187157) [19:50:38] (03CR) 10BBlack: [C: 032] cp5006: uncomment in upload hieradata list [puppet] - 10https://gerrit.wikimedia.org/r/447849 (https://phabricator.wikimedia.org/T187157) (owner: 10BBlack) [19:50:48] (03CR) 10BBlack: [V: 032 C: 032] cp5006: uncomment in upload hieradata list [puppet] - 10https://gerrit.wikimedia.org/r/447849 (https://phabricator.wikimedia.org/T187157) (owner: 10BBlack) [19:52:28] (03PS1) 10Bstorm: dumps distribution: add labstore1007 back to list of nfs servers [puppet] - 10https://gerrit.wikimedia.org/r/447850 [19:52:47] greg-g: zeljkof: If no train this hour, OK to roll-out this fix that would unblock wmf.14? [19:52:48] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/447843/ [19:54:16] (03CR) 10Bstorm: [C: 032] dumps distribution: add labstore1007 back to list of nfs servers [puppet] - 10https://gerrit.wikimedia.org/r/447850 (owner: 10Bstorm) [19:54:28] (03PS2) 10Bstorm: dumps distribution: add labstore1007 back to list of nfs servers [puppet] - 10https://gerrit.wikimedia.org/r/447850 [19:55:17] (03PS1) 10MSantos: Set up cron task to regen low-zoom vector tiles [puppet] - 10https://gerrit.wikimedia.org/r/447851 (https://phabricator.wikimedia.org/T194787) [19:55:49] (03CR) 10jerkins-bot: [V: 04-1] Set up cron task to regen low-zoom vector tiles [puppet] - 10https://gerrit.wikimedia.org/r/447851 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos) [19:57:17] (03PS4) 10EBernhardson: Split elasticsearch::log::hot_threads into two pieces [puppet] - 10https://gerrit.wikimedia.org/r/447565 (https://phabricator.wikimedia.org/T198351) [19:57:19] (03PS8) 10EBernhardson: Make cirrus specific elasticsearch profile [puppet] - 10https://gerrit.wikimedia.org/r/447566 (https://phabricator.wikimedia.org/T198351) [19:57:21] (03PS6) 10EBernhardson: Split per-cluster config out of elasticsearch::curator [puppet] - 10https://gerrit.wikimedia.org/r/447567 (https://phabricator.wikimedia.org/T180807) [19:57:23] (03PS12) 10EBernhardson: Switch elasticsearch to use tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/444610 (https://phabricator.wikimedia.org/T198351) [19:57:25] (03PS5) 10EBernhardson: Make elasticsearch http and transport ports explicit [puppet] - 10https://gerrit.wikimedia.org/r/447568 (https://phabricator.wikimedia.org/T198351) [19:57:27] (03PS25) 10EBernhardson: Prep work for multi-instance elasticsearch refactor [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) [19:57:29] (03PS25) 10EBernhardson: convert role::logstash::elasticsearch to profiles [puppet] - 10https://gerrit.wikimedia.org/r/441894 (https://phabricator.wikimedia.org/T198351) [19:57:39] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Imarlier) At this point, just waiting on someone with a... [19:58:35] (03PS2) 10MSantos: Set up cron task to regen low-zoom vector tiles [puppet] - 10https://gerrit.wikimedia.org/r/447851 (https://phabricator.wikimedia.org/T194787) [19:58:47] k, I'll roll it out now [19:58:49] (03CR) 10Krinkle: [C: 032] wmf-config: Comment out wgUploadThumbnailRenderHttpCustomDomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447843 (https://phabricator.wikimedia.org/T200346) (owner: 10Imarlier) [19:59:13] * Krinkle staging on mwdeploy1001/mwdebug1002 [19:59:51] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp5006_v4, cp5006_v6 [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: Your horoscope predicts another unfortunate Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180725T2000). [20:01:04] (03PS1) 10Gehel: wdqs: don't load kafka config if we're not using kafka [puppet] - 10https://gerrit.wikimedia.org/r/447853 [20:01:19] ipsec alert is me, I don't think there will be many others, it was a close race! [20:01:32] (03PS1) 10Andrew Bogott: designate: pass region to wmf_sink [puppet] - 10https://gerrit.wikimedia.org/r/447854 [20:02:01] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp5006_v4, cp5006_v6 [20:02:23] (03CR) 10jerkins-bot: [V: 04-1] Set up cron task to regen low-zoom vector tiles [puppet] - 10https://gerrit.wikimedia.org/r/447851 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos) [20:03:01] (03CR) 10jerkins-bot: [V: 04-1] designate: pass region to wmf_sink [puppet] - 10https://gerrit.wikimedia.org/r/447854 (owner: 10Andrew Bogott) [20:03:02] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 70 ESP OK [20:03:05] Krinkle: it's really late here, but sounds good to me [20:03:11] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 68 ESP OK [20:03:25] (03Merged) 10jenkins-bot: wmf-config: Comment out wgUploadThumbnailRenderHttpCustomDomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447843 (https://phabricator.wikimedia.org/T200346) (owner: 10Imarlier) [20:03:27] (03CR) 10Smalyshev: [C: 031] wdqs: don't load kafka config if we're not using kafka [puppet] - 10https://gerrit.wikimedia.org/r/447853 (owner: 10Gehel) [20:04:00] Nothing for ORES today. [20:04:38] (03PS2) 10Andrew Bogott: designate: pass region to wmf_sink [puppet] - 10https://gerrit.wikimedia.org/r/447854 [20:05:15] !log krinkle@deploy1001 Synchronized wmf-config/: Remove wgUploadThumbnailRenderHttpCustomDomain override - T200346 (duration: 00m 57s) [20:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:19] T200346: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 [20:06:13] (03PS3) 10MSantos: Set up cron task to regen low-zoom vector tiles [puppet] - 10https://gerrit.wikimedia.org/r/447851 (https://phabricator.wikimedia.org/T194787) [20:06:51] (03PS2) 10Gehel: wdqs: don't load kafka config if we're not using kafka [puppet] - 10https://gerrit.wikimedia.org/r/447853 [20:06:53] (03CR) 10jerkins-bot: [V: 04-1] Set up cron task to regen low-zoom vector tiles [puppet] - 10https://gerrit.wikimedia.org/r/447851 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos) [20:08:00] (03CR) 10Gehel: [C: 032] wdqs: don't load kafka config if we're not using kafka [puppet] - 10https://gerrit.wikimedia.org/r/447853 (owner: 10Gehel) [20:08:15] (03PS3) 10Andrew Bogott: designate: pass region to wmf_sink [puppet] - 10https://gerrit.wikimedia.org/r/447854 [20:08:58] 10Operations, 10Traffic, 10netops: Configure interface damping on primary links - https://phabricator.wikimedia.org/T196432 (10ayounsi) Looking at doing this Wednesday August 1st, 3 PM UTC, 1h expected. 1 link at a time, only on the primary of the redundant ones, and outside link maintenance. [20:09:14] Krinkle: yes (sorry for delay) [20:12:15] (03PS4) 10MSantos: Set up cron task to regen low-zoom vector tiles [puppet] - 10https://gerrit.wikimedia.org/r/447851 (https://phabricator.wikimedia.org/T194787) [20:12:37] (03PS4) 10Andrew Bogott: designate: pass region to wmf_sink [puppet] - 10https://gerrit.wikimedia.org/r/447854 [20:12:41] (03PS1) 10Cooltey: Correct the ssh_keys format under user: cooltey [puppet] - 10https://gerrit.wikimedia.org/r/447859 [20:12:47] (03CR) 10jerkins-bot: [V: 04-1] Set up cron task to regen low-zoom vector tiles [puppet] - 10https://gerrit.wikimedia.org/r/447851 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos) [20:13:13] (03PS1) 10Bstorm: gridengine: include the right profile for the right OS [puppet] - 10https://gerrit.wikimedia.org/r/447860 (https://phabricator.wikimedia.org/T199276) [20:13:21] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Krinkle) 05Open>03Resolved a:03Krinkle Tentativel... [20:13:42] (03CR) 10jerkins-bot: [V: 04-1] gridengine: include the right profile for the right OS [puppet] - 10https://gerrit.wikimedia.org/r/447860 (https://phabricator.wikimedia.org/T199276) (owner: 10Bstorm) [20:15:11] (03Abandoned) 10Bstorm: gridengine: include the right profile for the right OS [puppet] - 10https://gerrit.wikimedia.org/r/447860 (https://phabricator.wikimedia.org/T199276) (owner: 10Bstorm) [20:15:43] (03PS5) 10Andrew Bogott: designate: pass region to wmf_sink [puppet] - 10https://gerrit.wikimedia.org/r/447854 [20:16:31] (03CR) 10Andrew Bogott: [C: 032] designate: pass region to wmf_sink [puppet] - 10https://gerrit.wikimedia.org/r/447854 (owner: 10Andrew Bogott) [20:18:11] (03CR) 10jenkins-bot: wmf-config: Comment out wgUploadThumbnailRenderHttpCustomDomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447843 (https://phabricator.wikimedia.org/T200346) (owner: 10Imarlier) [20:20:19] (03CR) 10MSantos: "I created the cron job inside the tilerator module because it makes more sense. I fixed some bugs but I have no idea what could be causing" [puppet] - 10https://gerrit.wikimedia.org/r/447851 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos) [20:23:03] !log pooling cp5006 - T187157 [20:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:07] T187157: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 [20:24:44] 10Operations, 10ops-eqsin, 10Traffic, 10Patch-For-Review: rack/setup/install cp50(0[1-9]|1[0-2]) - https://phabricator.wikimedia.org/T181557 (10BBlack) [20:24:46] 10Operations, 10ops-eqsin, 10Traffic, 10Patch-For-Review: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 (10BBlack) 05Open>03Resolved cp5006 is now installed and puppeted and in-service, should be all fixed up assuming nothing bursts into flames in the near future. [20:25:24] 10Operations, 10ops-eqsin, 10Traffic: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10BBlack) p:05High>03Normal [20:31:56] (03PS1) 10BBlack: varnishkafka drerr monitoring: Fix duplicate descriptions in icinga [puppet] - 10https://gerrit.wikimedia.org/r/447913 [20:33:20] (03CR) 10BBlack: [C: 032] varnishkafka drerr monitoring: Fix duplicate descriptions in icinga [puppet] - 10https://gerrit.wikimedia.org/r/447913 (owner: 10BBlack) [20:33:31] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Krinkle) >>! In T196547#4446016, @awight wrote: > Here are the notes from our meeting, plus some more discussion af... [20:33:34] RECOVERY - Check correctness of the icinga configuration on einsteinium is OK: Icinga configuration is correct [20:33:58] Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [20:34:01] or maybe this blog by freenode staff member Matthew 'mst' Trout https://MattSTrout.com/ [20:34:05] Read what IRC investigative journalists have uncovered on the freenode pedophilia scandal https://encyclopediadramatica.rs/Freenodegate [20:34:08] [20:34:08] Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [20:34:11] This message was brought to you by Private Internet Access [20:34:11] or maybe this blog by freenode staff member Matthew 'mst' Trout https://MattSTrout.com/ [20:34:15] Read what IRC investigative journalists have uncovered on the freenode pedophilia scandal https://encyclopediadramatica.rs/Freenodegate [20:34:19] [20:34:21] This message was brought to you by Private Internet Access [20:36:18] (03PS1) 10Bstorm: gridengine: switch back to extended locales for debian [puppet] - 10https://gerrit.wikimedia.org/r/447914 (https://phabricator.wikimedia.org/T199276) [20:36:53] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Milimetric) Just discussed this in TechCom, saw the minutes of the meeting with @mark and @jcrespo. Would you like... [20:37:33] (03CR) 10Bstorm: [C: 032] gridengine: switch back to extended locales for debian [puppet] - 10https://gerrit.wikimedia.org/r/447914 (https://phabricator.wikimedia.org/T199276) (owner: 10Bstorm) [20:37:41] (03PS2) 10Bstorm: gridengine: switch back to extended locales for debian [puppet] - 10https://gerrit.wikimedia.org/r/447914 (https://phabricator.wikimedia.org/T199276) [20:41:47] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) @Milimetric Great, I'd love to have an IRC meeting any time that's convenient, and happy to also discuss ra... [20:44:48] (03PS13) 10EBernhardson: Switch elasticsearch to use tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/444610 (https://phabricator.wikimedia.org/T198351) [20:44:50] (03PS6) 10EBernhardson: Make elasticsearch http and transport ports explicit [puppet] - 10https://gerrit.wikimedia.org/r/447568 (https://phabricator.wikimedia.org/T198351) [20:44:52] (03PS26) 10EBernhardson: Prep work for multi-instance elasticsearch refactor [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) [20:44:54] (03PS26) 10EBernhardson: convert role::logstash::elasticsearch to profiles [puppet] - 10https://gerrit.wikimedia.org/r/441894 (https://phabricator.wikimedia.org/T198351) [20:44:56] (03PS28) 10EBernhardson: prometheus/elasticsearch support multiple exporters per host [puppet] - 10https://gerrit.wikimedia.org/r/441321 (https://phabricator.wikimedia.org/T198351) [20:44:58] (03PS31) 10EBernhardson: Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 (https://phabricator.wikimedia.org/T198351) [20:46:06] (03CR) 10jerkins-bot: [V: 04-1] Switch elasticsearch to use tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/444610 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [20:46:44] (03CR) 10Paladox: "Needs @Thcipriani to review please." [puppet] - 10https://gerrit.wikimedia.org/r/439783 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [20:46:47] (03CR) 10Paladox: "Needs @Thcipriani to review please." [puppet] - 10https://gerrit.wikimedia.org/r/439808 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [20:46:49] (03CR) 10jerkins-bot: [V: 04-1] Make elasticsearch http and transport ports explicit [puppet] - 10https://gerrit.wikimedia.org/r/447568 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [20:46:51] (03CR) 10Paladox: "Needs @Thcipriani to review please." [puppet] - 10https://gerrit.wikimedia.org/r/440104 (owner: 10Paladox) [20:47:23] (three different changes, not the same one :)) [20:50:42] (03CR) 10jerkins-bot: [V: 04-1] prometheus/elasticsearch support multiple exporters per host [puppet] - 10https://gerrit.wikimedia.org/r/441321 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [20:52:03] (03CR) 10jerkins-bot: [V: 04-1] Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [20:55:32] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 41 probes of 308 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [20:57:57] Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [20:57:57] or maybe this blog by freenode staff member Matthew 'mst' Trout https://MattSTrout.com/ [20:58:31] i'm sure thats 100% legit, how could it be anything elsE? [20:58:39] /s [20:59:31] lol [20:59:42] (03PS1) 10RobH: fixing typo in cooltey's ssh key entry [puppet] - 10https://gerrit.wikimedia.org/r/447915 (https://phabricator.wikimedia.org/T190150) [21:00:17] lol they tried to embed malware via a html script tag in -dev [21:01:08] cooltey: yeahhhhhhhh, typo had ssh-rsa listed twice in your ssh key entry [21:01:15] which would explain why things wouldnt work [21:01:18] fixing it now! [21:02:29] robh: thanks!! [21:02:33] (03CR) 10RobH: [C: 032] fixing typo in cooltey's ssh key entry [puppet] - 10https://gerrit.wikimedia.org/r/447915 (https://phabricator.wikimedia.org/T190150) (owner: 10RobH) [21:02:56] cooltey: what bastion (do you access tunnel through) and what system did you want to login to? [21:03:03] ill manually run puppet on them so you dont have to wait 30 minutes to test [21:03:19] (03Abandoned) 10Cooltey: Correct the ssh_keys format under user: cooltey [puppet] - 10https://gerrit.wikimedia.org/r/447859 (owner: 10Cooltey) [21:03:56] or we wait 30 and it will have run everywhere, but i feel bad making you wait ;D [21:04:35] * robh kicks bast1002.wikimedia.org to run puppet now [21:04:54] yes bast1002 [21:05:14] ok, it just got your updated ssh key [21:05:21] go ahead and ssh into bast1002.wikimedia.org, it should work now [21:05:26] (this can be first test) [21:05:50] when i updated it the first time i introduced a stupid typo =P [21:05:52] sorry about that! [21:06:03] yeah! can access it now! Thanks 😂 [21:06:15] no problem :) [21:06:20] cool, which analytics machine did you need to get to, or releases? whichever one it was [21:06:26] i can manually kick puppet on it now for you if you like [21:06:52] seeing emoticons in irc is still so very strange for me [21:06:56] also makes me think of no_justification [21:07:05] releases1001.eqiad.wmnet [21:07:31] ok, puppet is runnign now, it takes a couple of minutes [21:08:01] ok, it has it [21:08:06] you should be ok to login there now as well [21:09:00] yes, logged in [21:09:08] (03PS1) 10Andrew Bogott: region-migrate: add dns validation [puppet] - 10https://gerrit.wikimedia.org/r/447918 [21:09:47] thanks! it works [21:09:53] (03PS2) 10Andrew Bogott: region-migrate: add dns validation [puppet] - 10https://gerrit.wikimedia.org/r/447918 [21:11:12] (03CR) 10Andrew Bogott: [C: 032] region-migrate: add dns validation [puppet] - 10https://gerrit.wikimedia.org/r/447918 (owner: 10Andrew Bogott) [21:11:45] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Access request to stat1005 and stat1006 for cooltey - https://phabricator.wikimedia.org/T190150 (10RobH) Ok, fixed the typo and synced with @cooltey via irc. Login is now working. [21:15:04] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10TheDJ) > I see. So all network/transport level errors,... [21:19:20] (03PS14) 10EBernhardson: Switch elasticsearch to use tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/444610 (https://phabricator.wikimedia.org/T198351) [21:19:22] (03PS7) 10EBernhardson: Make elasticsearch http and transport ports explicit [puppet] - 10https://gerrit.wikimedia.org/r/447568 (https://phabricator.wikimedia.org/T198351) [21:19:24] (03PS27) 10EBernhardson: Prep work for multi-instance elasticsearch refactor [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) [21:19:26] (03PS27) 10EBernhardson: convert role::logstash::elasticsearch to profiles [puppet] - 10https://gerrit.wikimedia.org/r/441894 (https://phabricator.wikimedia.org/T198351) [21:19:28] (03PS29) 10EBernhardson: prometheus/elasticsearch support multiple exporters per host [puppet] - 10https://gerrit.wikimedia.org/r/441321 (https://phabricator.wikimedia.org/T198351) [21:19:30] (03PS32) 10EBernhardson: Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 (https://phabricator.wikimedia.org/T198351) [21:19:32] (03PS59) 10EBernhardson: Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (https://phabricator.wikimedia.org/T198351) [21:19:34] (03PS4) 10EBernhardson: Cleanup ensure => absent after refactoring [puppet] - 10https://gerrit.wikimedia.org/r/444765 (https://phabricator.wikimedia.org/T198351) [21:23:28] Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [21:27:25] (03CR) 10BBlack: [C: 032] Add ipv4/ipv6 dns cp1075-1090 [dns] - 10https://gerrit.wikimedia.org/r/447816 (https://phabricator.wikimedia.org/T195923) (owner: 10Cmjohnson) [21:28:52] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:30:25] (03PS1) 10BBlack: cp1075: set installer macaddr [puppet] - 10https://gerrit.wikimedia.org/r/447920 [21:31:54] (03CR) 10BBlack: [C: 032] cp1075: set installer macaddr [puppet] - 10https://gerrit.wikimedia.org/r/447920 (owner: 10BBlack) [21:40:58] (03PS1) 10BBlack: cp1075: try stretch installer, for now [puppet] - 10https://gerrit.wikimedia.org/r/447921 [21:41:35] (03CR) 10BBlack: [V: 032 C: 032] cp1075: try stretch installer, for now [puppet] - 10https://gerrit.wikimedia.org/r/447921 (owner: 10BBlack) [21:41:41] PROBLEM - Long running screen/tmux on analytics1003 is CRITICAL: CRIT: Long running SCREEN process. (user: otto PID: 46200, 1736040s 1728000s). [21:48:00] <92AADK2WL> Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [21:48:00] <92AADK2WL> or maybe this blog by freenode staff member Matthew 'mst' Trout https://MattSTrout.com/ [21:49:31] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:53:10] (03PS1) 10Smalyshev: Create wikidata ntriples dump from ttl dump [puppet] - 10https://gerrit.wikimedia.org/r/447922 (https://phabricator.wikimedia.org/T144103) [21:57:54] (03PS60) 10EBernhardson: Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (https://phabricator.wikimedia.org/T198351) [21:57:56] (03PS5) 10EBernhardson: Cleanup ensure => absent after refactoring [puppet] - 10https://gerrit.wikimedia.org/r/444765 (https://phabricator.wikimedia.org/T198351) [21:58:30] (03PS2) 10Smalyshev: Create wikidata ntriples dump from ttl dump [puppet] - 10https://gerrit.wikimedia.org/r/447922 (https://phabricator.wikimedia.org/T144103) [22:07:11] PROBLEM - Device not healthy -SMART- on labstore1007 is CRITICAL: cluster=misc device={cciss,14,cciss,15,cciss,16,cciss,17,cciss,18,cciss,19,cciss,20,cciss,21,cciss,22,cciss,23} instance=labstore1007:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labstore1007&var-datasource=eqiad%2520prometheus%252Fops [22:18:55] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10BBlack) What I know so far from testing on cp1075: * The various BIOS settings seem fine so far, I didn't have to change anything in BIOS or NIC or controller firmware s... [22:23:26] (03CR) 10Jforrester: [C: 04-2] Install but don't enable the WikibaseMediaInfo extension, part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446841 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [22:24:27] (03PS2) 10Jforrester: Delete multiversion/submodules.json, putatively unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446847 [22:26:19] PS60? woah [22:31:30] (03CR) 10Legoktm: [C: 031] "Introduced in 70651109f3f264dd2f1e966e66bdbd6927df709b to add a "scap branch" plugin. The plugin was never finished and removed in 940e45a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446847 (owner: 10Jforrester) [22:55:56] (03PS33) 10EBernhardson: Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 (https://phabricator.wikimedia.org/T198351) [22:55:58] (03PS61) 10EBernhardson: Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (https://phabricator.wikimedia.org/T198351) [22:56:00] (03PS6) 10EBernhardson: Cleanup ensure => absent after refactoring [puppet] - 10https://gerrit.wikimedia.org/r/444765 (https://phabricator.wikimedia.org/T198351) [22:58:00] (03CR) 10jerkins-bot: [V: 04-1] Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do Evening SWAT (Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180725T2300). [23:00:04] MatmaRex: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:28] hi [23:02:56] Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [23:02:56] or maybe this blog by freenode staff member Matthew 'mst' Trout https://MattSTrout.com/ [23:03:09] I can SWAT [23:03:54] huh, better not click those links above [23:03:57] thanks thcipriani [23:05:04] spammers have been at it all day [23:05:06] !log added dchen to LDAP group "wmf" (was already in admins as shell user, so didn't have to be added in puppet repo) (T200366) [23:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:11] T200366: Add Daisy Chen to the wmf LDAP group - https://phabricator.wikimedia.org/T200366 [23:08:52] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 14 probes of 308 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [23:16:25] (03PS34) 10EBernhardson: Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 (https://phabricator.wikimedia.org/T198351) [23:18:34] (03CR) 10Dzahn: "talked briefly on IRC about it with bd808. there is a plan to use probably admin.toolforge.org instead which seems even better" [dns] - 10https://gerrit.wikimedia.org/r/441817 (https://phabricator.wikimedia.org/T189637) (owner: 10MarcoAurelio) [23:20:31] (03PS35) 10EBernhardson: Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 (https://phabricator.wikimedia.org/T198351) [23:20:33] (03PS62) 10EBernhardson: Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (https://phabricator.wikimedia.org/T198351) [23:20:35] (03PS7) 10EBernhardson: Cleanup ensure => absent after refactoring [puppet] - 10https://gerrit.wikimedia.org/r/444765 (https://phabricator.wikimedia.org/T198351) [23:21:41] (03Abandoned) 10EBernhardson: Cleanup ensure => absent after refactoring [puppet] - 10https://gerrit.wikimedia.org/r/444765 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [23:25:17] (03CR) 10Dzahn: [C: 04-1] "if $avatars_host is always equal to gerrit::server::master_host , then let's just use that instead of adding the new variable?" [puppet] - 10https://gerrit.wikimedia.org/r/439783 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [23:26:19] wowza jenkins [23:29:20] oof [23:29:25] it only now went through [23:29:41] PROBLEM - MariaDB Slave Lag: m3 on db2042 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.17 seconds [23:30:09] MatmaRex: live on mwdebug1002, check please [23:30:11] PROBLEM - MariaDB Slave Lag: m3 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.44 seconds [23:32:21] thcipriani: thanks, looks good [23:32:30] cool, going live [23:34:03] (03CR) 10Paladox: "> if $avatars_host is always equal to gerrit::server::master_host ," [puppet] - 10https://gerrit.wikimedia.org/r/439783 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [23:35:02] !log thcipriani@deploy1001 Synchronized php-1.32.0-wmf.13/extensions/VisualEditor/modules/ve-mw/dm/nodes/ve.dm.MWInlineImageNode.js: SWAT: [[gerrit:447653|ve.dm.MWInlineImageNode: Fix undefined data-mw in toDomElements output]] T198941 T200214 (duration: 00m 57s) [23:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:07] T200214: Editing a page using visual editor removes captions of inline images - https://phabricator.wikimedia.org/T200214 [23:35:07] T198941: SyntaxError: Unexpected token u in JSON at position 0 - https://phabricator.wikimedia.org/T198941 [23:35:13] ^ MatmaRex should be live [23:35:43] yes, it is. thanks [23:35:45] ! [23:35:46] :) [23:37:40] yw :) [23:39:33] (03PS36) 10EBernhardson: Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 (https://phabricator.wikimedia.org/T198351) [23:39:35] (03PS63) 10EBernhardson: Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (https://phabricator.wikimedia.org/T198351) [23:43:26] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/11865/" [puppet] - 10https://gerrit.wikimedia.org/r/439783 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [23:47:05] (03PS3) 10Dzahn: network::monitor: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/439554 [23:50:54] (03CR) 10Dzahn: "we talked about the remoteip part of this change, it's because of https://phabricator.wikimedia.org/T114014 so that makes sense to me. cha" [puppet] - 10https://gerrit.wikimedia.org/r/439783 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [23:58:16] (03CR) 10Dzahn: [C: 031] "compiles fine on netmon* hosts, has been no-op on everything before - part of a long string of changes, see topic - http://puppet-compiler" [puppet] - 10https://gerrit.wikimedia.org/r/439554 (owner: 10Dzahn)