[00:40:32] (03PS1) 10Smalyshev: Assign user directory to blazegraph user [puppet] - 10https://gerrit.wikimedia.org/r/467849 [00:42:26] (03PS2) 10Smalyshev: Assign user directory to blazegraph user [puppet] - 10https://gerrit.wikimedia.org/r/467849 [01:15:32] (03CR) 10Jforrester: [C: 031] "Let's do it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467830 (owner: 10Legoktm) [01:17:41] (03CR) 10Legoktm: [C: 032] Add REL1_32 to ExtensionDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467830 (owner: 10Legoktm) [01:18:47] (03Merged) 10jenkins-bot: Add REL1_32 to ExtensionDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467830 (owner: 10Legoktm) [01:24:41] !log legoktm@deploy1001 Synchronized wmf-config/CommonSettings.php: Add REL1_32 to ExtensionDistributor (duration: 00m 59s) [01:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:22] (03CR) 10jenkins-bot: Add REL1_32 to ExtensionDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467830 (owner: 10Legoktm) [02:14:12] (03PS2) 10Dduvall: Project clone URLs based on access control [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/467843 [03:30:16] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 717.12 seconds [03:33:48] (03PS1) 10KartikMistry: Enable cx2outreach campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467870 (https://phabricator.wikimedia.org/T207031) [03:40:08] (03PS1) 10KartikMistry: apertium-separable: New upstream release [debs/contenttranslation/apertium-separable] - 10https://gerrit.wikimedia.org/r/467871 (https://phabricator.wikimedia.org/T206441) [03:40:37] (03CR) 10jerkins-bot: [V: 04-1] apertium-separable: New upstream release [debs/contenttranslation/apertium-separable] - 10https://gerrit.wikimedia.org/r/467871 (https://phabricator.wikimedia.org/T206441) (owner: 10KartikMistry) [04:01:18] (03PS1) 10KartikMistry: apertium-lex-tools: New upstream release [debs/contenttranslation/apertium-lex-tools] - 10https://gerrit.wikimedia.org/r/467875 (https://phabricator.wikimedia.org/T206440) [04:01:51] (03CR) 10jerkins-bot: [V: 04-1] apertium-lex-tools: New upstream release [debs/contenttranslation/apertium-lex-tools] - 10https://gerrit.wikimedia.org/r/467875 (https://phabricator.wikimedia.org/T206440) (owner: 10KartikMistry) [04:06:26] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 291.36 seconds [04:26:26] RECOVERY - Memory correctable errors -EDAC- on db1069 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1069&var-datasource=eqiad%2520prometheus%252Fops [05:10:24] 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) [05:10:28] 10Operations, 10ops-eqiad, 10DBA: db1069 (x1 master) memory errors - https://phabricator.wikimedia.org/T201133 (10Marostegui) 05Open>03Resolved a:03Marostegui As it happened before, this recovered itself - closing for now: ``` 04:26 < icinga-wm> RECOVERY - Memory correctable errors -EDAC- on db1069 is... [05:14:50] (03PS2) 10Giuseppe Lavagetto: mediawiki: install and start php-fpm on the mwdebug* servers [puppet] - 10https://gerrit.wikimedia.org/r/467570 (https://phabricator.wikimedia.org/T201140) [05:14:52] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::vhost: allow serving content from php7 [puppet] - 10https://gerrit.wikimedia.org/r/467878 (https://phabricator.wikimedia.org/T206338) [05:15:35] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::web::vhost: allow serving content from php7 [puppet] - 10https://gerrit.wikimedia.org/r/467878 (https://phabricator.wikimedia.org/T206338) (owner: 10Giuseppe Lavagetto) [05:24:33] (03CR) 10Alexandros Kosiaris: [C: 04-2] "That's an upstream module (https://github.com/puppetlabs/puppetlabs-stdlib), we import it as-is and don't modify unless strictly necessary" [puppet] - 10https://gerrit.wikimedia.org/r/467826 (owner: 10Thifranc) [06:05:44] (03PS6) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/462495 (https://phabricator.wikimedia.org/T196968) [06:08:47] (03PS7) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/462495 (https://phabricator.wikimedia.org/T196968) [06:12:21] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1002/12982/mwdebug1001.eqiad.wmnet/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/462495 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [06:18:01] <_joe_> jouncebot: next [06:18:01] In 2 hour(s) and 41 minute(s): Extra SRE deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181017T0900) [06:28:33] PROBLEM - puppet last run on analytics1061 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check_established_connections] [06:32:44] (03CR) 10Giuseppe Lavagetto: [C: 031] "looks good to me now, AFAICT" [puppet] - 10https://gerrit.wikimedia.org/r/462495 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [06:38:02] 10Operations, 10Core Platform Team Backlog (Watching / External), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Legoktm) After discussion in our meeting and on IRC, we're deciding to stick with PHP 7.2 for the initial migration,... [06:38:15] (03PS2) 10Muehlenhoff: Clean up removed rsyncd configs [puppet] - 10https://gerrit.wikimedia.org/r/465583 (https://phabricator.wikimedia.org/T205618) [06:39:02] RECOVERY - MariaDB Slave SQL: s8 on db1124 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:39:38] (03CR) 10Muehlenhoff: [C: 032] Clean up removed rsyncd configs [puppet] - 10https://gerrit.wikimedia.org/r/465583 (https://phabricator.wikimedia.org/T205618) (owner: 10Muehlenhoff) [06:40:51] (03CR) 10Elukey: [C: 031] "The diff in pcc looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/462495 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [06:52:00] !log fixing s8 master drifts T206743 [06:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:04] T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 [06:59:12] RECOVERY - puppet last run on analytics1061 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:11:15] (03CR) 10Filippo Giunchedi: [C: 031] prometheus: clean deprecation warnings on prometheus::node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/467677 (owner: 10Gehel) [07:18:42] (03CR) 10Nikerabbit: Enable cx2outreach campaign (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467870 (https://phabricator.wikimedia.org/T207031) (owner: 10KartikMistry) [07:19:42] 10Operations, 10ops-codfw: Degraded RAID on db2051 - https://phabricator.wikimedia.org/T207212 (10Banyek) p:05Triage>03Normal [07:20:22] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [07:21:16] banyek: we normally tell papaul or chris on those tickets that they can proceed to replace the disk, otherwise they normally wait for the green light as per what the task says: Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware. [07:21:23] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [07:21:47] 10Operations, 10ops-codfw: Degraded RAID on db2051 - https://phabricator.wikimedia.org/T207212 (10Banyek) a:03Papaul Hi, can we ask for a disk replacement please? [07:22:01] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2051 - https://phabricator.wikimedia.org/T207212 (10Banyek) [07:22:35] (03PS8) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/462495 (https://phabricator.wikimedia.org/T196968) [07:22:37] (03PS4) 10Giuseppe Lavagetto: mediawiki: add httpd class, alternative to mediawiki::web [puppet] - 10https://gerrit.wikimedia.org/r/467643 [07:22:39] (03PS4) 10Giuseppe Lavagetto: mediawiki::webserver: introduce profile, use it on mwdebug* [puppet] - 10https://gerrit.wikimedia.org/r/467644 [07:22:41] (03PS4) 10Giuseppe Lavagetto: profile::mediawiki::common: stop using class 'mediawiki' [puppet] - 10https://gerrit.wikimedia.org/r/467642 [07:22:54] ack [07:23:13] <_joe_> argh, no, what did I do [07:24:03] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2051 is CRITICAL: cluster=mysql device=cciss,11 instance=db2051:9100 job=node site=codfw Banyek ACK T207212 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2051&var-datasource=codfw%2520prometheus%252Fops [07:24:05] <_joe_> hah just rebased [07:32:11] 10Operations, 10Patch-For-Review: Allow directing users to PHP7 based on a cookie - https://phabricator.wikimedia.org/T206338 (10fgiunchedi) Agreed on the behaviors we want, on the behaviors that are desirable (i.e. what to do on engine down or unresponsive) I think we should stick to what the (absence of) the... [07:38:15] (03PS1) 10Filippo Giunchedi: statsite: remove ignored Journal section from unit [puppet] - 10https://gerrit.wikimedia.org/r/467896 [07:43:42] (03PS2) 10Gehel: prometheus: clean deprecation warnings on prometheus::node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/467677 [07:44:15] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) [07:45:26] 10Operations, 10ops-codfw, 10DBA: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Marostegui) [07:45:45] 10Operations, 10ops-codfw, 10DBA: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Marostegui) p:05Triage>03Normal [07:46:01] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) p:05Triage>03Normal [07:46:31] 10Operations, 10ops-codfw, 10DBA: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Marostegui) [07:46:47] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) [07:47:15] 10Operations, 10DBA, 10MediaWiki-Cache, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) after cleaning up parsercache data yesterday on pc2004, I checked the disk usage of the 'parsercache' database and the binlogs. Th... [07:50:19] (03CR) 10Gehel: [C: 032] prometheus: clean deprecation warnings on prometheus::node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/467677 (owner: 10Gehel) [07:51:32] 10Operations, 10DBA, 10MediaWiki-Cache, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) [07:51:52] 10Operations, 10DBA, 10MediaWiki-Cache, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) Keep in mind that the disk space is always constant on the pc hosts: https://grafana.wikimedia.org/dashboard/file/server-board.... [07:52:54] 10Operations, 10ops-eqiad, 10netops: Interface errors on cr2-eqiad:xe-4/0/0 - https://phabricator.wikimedia.org/T203719 (10ayounsi) 05Resolved>03Open Re-opening as we're seeing errors again (at a lower rate, but errors nonetheless) https://librenms.wikimedia.org/graphs/to=1536910800/id=6828/type=port_er... [07:53:54] (03CR) 10KartikMistry: Enable cx2outreach campaign (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467870 (https://phabricator.wikimedia.org/T207031) (owner: 10KartikMistry) [07:57:05] 10Operations, 10DBA, 10MediaWiki-Cache, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) the 'parsercache' database takes around 1,5T data on pc1004 the binlogs are 7Gb on pc1004; but the binlogs are cleaned up in every... [07:58:20] 10Operations, 10DBA, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache being empty post-switchover - https://phabricator.wikimedia.org/T206841 (10Joe) 05Open>03Resolved p:05Triage>03High [07:59:15] 10Operations, 10DBA, 10MediaWiki-Cache, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) >>! In T206740#4673427, @Banyek wrote: > the 'parsercache' database takes around 1,5T data on pc1004 > the binlogs are 7Gb on p... [08:00:09] 10Operations, 10DBA, 10MediaWiki-Cache, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) ok, I top the purgers there too. [08:00:36] !log truncating parsercache tables on pc2005 (T206740) [08:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:39] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, looks like we're not applying the class anywhere already" [puppet] - 10https://gerrit.wikimedia.org/r/467103 (owner: 10Krinkle) [08:00:40] T206740: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 [08:01:54] 10Operations, 10DBA, 10MediaWiki-Cache, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) I double checked, pc1005 is NOT replicating from pc2005 (`show slave hosts` is empty in pc2005, and `show slave status` is empty on... [08:03:53] (03PS1) 10Filippo Giunchedi: smart: log failed physical disk enumeration instead of output [puppet] - 10https://gerrit.wikimedia.org/r/467901 [08:06:48] 10Operations, 10Beta-Cluster-Infrastructure, 10Mail, 10Wikimedia-Mailing-lists, and 2 others: Jenkins mail delivery failure to betacluster-alerts@list.wikimedia.org - https://phabricator.wikimedia.org/T207260 (10hashar) [08:09:22] !log stopping binlog purgers on the parsercache hosts (the binlogs will be kept for 24hrs) - T206740 [08:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:25] T206740: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 [08:10:17] Is it OK to do cxserver deployment now()? We need to push 2 patches. [08:10:51] _joe_: ^ [08:11:39] <_joe_> kart_: you have 50 minutes for realease and rollback in case something goes wrong [08:12:10] 10Operations, 10DBA, 10MediaWiki-Cache, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) [08:12:26] 10Operations, 10Traffic: Document eqsin power connections in Netbox - https://phabricator.wikimedia.org/T207138 (10ayounsi) > Oct 14 19:05:27 asw1-eqsin craftd[1962]: Minor alarm cleared, FPC 0 PEM 0 is not powered > Oct 14 19:05:28 asw1-eqsin craftd[1962]: Minor alarm cleared, FPC 1 PEM 0 is not powered So... [08:12:53] <_joe_> kart_: +1 for me [08:13:29] 10Operations, 10Beta-Cluster-Infrastructure, 10Mail, 10Wikimedia-Mailing-lists, and 3 others: Jenkins mail delivery failure to betacluster-alerts@list.wikimedia.org - https://phabricator.wikimedia.org/T207260 (10hashar) 05Open>03Resolved a:03hashar The email domain was wrong in the Jenkins configurat... [08:14:30] _joe_: OK. Thanks. [08:15:33] (03CR) 10Muehlenhoff: [C: 031] "Looks fantastic" [puppet] - 10https://gerrit.wikimedia.org/r/462495 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [08:26:59] !log kartik@deploy1001 Started deploy [cxserver/deploy@b30a323]: Update cxserver to 29e01e4 (T206305, T204668) [08:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:03] T206305: CX2: Missing links in source article shown as blue instead of red - https://phabricator.wikimedia.org/T206305 [08:27:04] T204668: Unwanted HTML content sent to HTML MT systems often crosses character limit. - https://phabricator.wikimedia.org/T204668 [08:28:04] 10Operations, 10netops: relabel switch interfaces formerly saiph.frack.codfw.wmnet to frpig2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T207035 (10ayounsi) 05Open>03Resolved a:03ayounsi Renamed. [08:28:16] (03CR) 10Nikerabbit: Enable cx2outreach campaign (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467870 (https://phabricator.wikimedia.org/T207031) (owner: 10KartikMistry) [08:30:40] (03CR) 10Giuseppe Lavagetto: [C: 031] "long overdue IMHO" [puppet] - 10https://gerrit.wikimedia.org/r/467742 (owner: 10Alexandros Kosiaris) [08:30:52] !log kartik@deploy1001 Finished deploy [cxserver/deploy@b30a323]: Update cxserver to 29e01e4 (T206305, T204668) (duration: 03m 54s) [08:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:22] (03PS2) 10KartikMistry: Enable cx2outreach campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467870 (https://phabricator.wikimedia.org/T207031) [08:34:55] (03CR) 10KartikMistry: ">" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467870 (https://phabricator.wikimedia.org/T207031) (owner: 10KartikMistry) [08:35:16] _joe_: done with cxserver update. Thanks. [08:35:31] <_joe_> kart_: thanks [08:42:50] (03CR) 10Addshore: "jenkins says no!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467736 (https://phabricator.wikimedia.org/T207196) (owner: 10WMDE-leszek) [08:43:30] !log mobrovac@deploy1001 Started restart [proton/deploy@a657059]: (no justification provided) [08:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:32] (03PS1) 10Addshore: Increase wikidata dispatch randomness to 30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467911 [08:50:18] (03PS1) 10Addshore: Remove Wikidaat RejectParserCacheValue hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467913 (https://phabricator.wikimedia.org/T205330) [08:51:01] (03PS2) 10Addshore: Remove Wikidata RejectParserCacheValue hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467913 (https://phabricator.wikimedia.org/T205330) [08:51:17] (03PS9) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/462495 (https://phabricator.wikimedia.org/T196968) [08:51:35] <_joe_> jouncebot: be merciful [08:51:40] <_joe_> jouncebot: next [08:51:41] In 0 hour(s) and 8 minute(s): Extra SRE deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181017T0900) [08:58:59] 10Operations, 10Wikidata, 10Wikidata-Campsite, 10Wikimedia-General-or-Unknown, and 5 others: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520 (10Lydia_Pintscher) [09:00:05] _joe_: That opportune time is upon us again. Time for a Extra SRE deployment deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181017T0900). [09:00:05] _joe_: A patch you scheduled for Extra SRE deployment is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [09:00:23] <_joe_> jouncebot is clearly teasing me [09:00:26] <_joe_> anyways [09:00:42] <_joe_> elukey, moritzm let's go? [09:01:17] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web::prod_sites: convert wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/462495 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [09:01:39] ack! [09:02:56] <_joe_> ok starting [09:03:03] ack [09:03:06] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Track and install additional npm packages for all service container images - https://phabricator.wikimedia.org/T205911 (10mobrovac) These are fairly static node package... [09:03:19] (03PS1) 10Giuseppe Lavagetto: Revert "mediawiki::web::prod_sites: convert wikipedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/467918 [09:03:23] <_joe_> better have it ready [09:04:32] <_joe_> !log puppet disabled on the appservers, now merging the wikipedia.org conversion to mediawiki::web::vhost [09:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:39] <_joe_> !log running puppet on mwdebug1001, then testing again wikipedia.org for regressions [09:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:01] <_joe_> ok, test is green [09:07:11] <_joe_> I'd run puppet through eqiad first [09:07:44] looks fine on mwdebug1001 [09:08:06] <_joe_> ok going [09:08:33] <_joe_> !log running puppet on all apaches (appserver/api) in eqiad to pick up the wikipedia.org vhost refactor [09:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:52] <_joe_> 45/99 done [09:11:30] <_joe_> 80/99 [09:12:39] <_joe_> !log change applied to all appservers serving traffic [09:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:54] <_joe_> !log reenabling puppet (not running it) in codfw [09:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:28] <_joe_> ok some smoke testing makes me think things didn't go horribly wrong [09:15:04] ack, looks all fine to me [09:15:32] (03PS1) 10DCausse: Fix typo in README.rst [software/spicerack] - 10https://gerrit.wikimedia.org/r/467922 [09:16:25] <_joe_> ok so... let's declare the change done [09:16:39] <_joe_> this means we've converted all vhosts [09:16:52] <_joe_> we have quite some cleanup to do [09:16:56] (03CR) 10jerkins-bot: [V: 04-1] Fix typo in README.rst [software/spicerack] - 10https://gerrit.wikimedia.org/r/467922 (owner: 10DCausse) [09:20:12] great work, this is a major leap forward [09:21:19] <_joe_> clear reference to https://en.wikipedia.org/wiki/Great_Leap_Forward [09:22:45] (03PS3) 10Volans: Revert "admins: add Cas Rusnov to admins as ldap_only" [puppet] - 10https://gerrit.wikimedia.org/r/467745 [09:23:32] (03CR) 10Volans: [C: 032] Revert "admins: add Cas Rusnov to admins as ldap_only" [puppet] - 10https://gerrit.wikimedia.org/r/467745 (owner: 10Volans) [09:24:41] _joe_ great work! \o/ [09:25:43] <_joe_> elukey: let's wait tomorrow [09:26:13] <_joe_> but yeah, seems things are not too horribly wrong [09:26:18] (03Abandoned) 10Volans: Rebuild requirements to pick security upgrades [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/449992 (owner: 10Volans) [09:26:27] (03Abandoned) 10Volans: Rebuild wheels with upgraded dependencies [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/449993 (owner: 10Volans) [09:26:42] (03Abandoned) 10Volans: Fix submodule directory [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/449991 (owner: 10Volans) [09:30:27] !log truncating parsercache tables on pc2006 (T206740) [09:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:31] T206740: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 [09:32:09] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/467896 (owner: 10Filippo Giunchedi) [09:34:37] (03CR) 10Volans: [C: 031] "LGTM, minor nitpick inline, no need to wait for re-review if fixing it." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/467901 (owner: 10Filippo Giunchedi) [09:34:43] !log update interfaces and BGP IPs for office-DC link (DC side, interfaces still disabled) - T205985 [09:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:47] T205985: Renumber office-DC interconnect link - https://phabricator.wikimedia.org/T205985 [09:35:59] 10Operations, 10DBA, 10MediaWiki-Cache, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) [09:53:07] (03PS1) 10Volans: Revert "Allow custom fields in the Device CSV form" [software/netbox] - 10https://gerrit.wikimedia.org/r/467936 (https://phabricator.wikimedia.org/T205896) [09:56:04] (03PS1) 10Volans: Revert "Add Wikimedia's initial data" [software/netbox] - 10https://gerrit.wikimedia.org/r/467937 (https://phabricator.wikimedia.org/T205896) [10:00:09] (03PS1) 10Vgutierrez: admin: Add liw user account [puppet] - 10https://gerrit.wikimedia.org/r/467938 (https://phabricator.wikimedia.org/T206612) [10:00:11] (03PS1) 10Vgutierrez: admin: Add liw to deployment, contint-admins, labnet-users and contint-docker [puppet] - 10https://gerrit.wikimedia.org/r/467939 (https://phabricator.wikimedia.org/T206612) [10:01:07] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/467901 (owner: 10Filippo Giunchedi) [10:02:14] (03CR) 10Muehlenhoff: [C: 031] "I think it's fine to proceed with the current set of hosts, all the remaining roles will be addressed later on anyway." [puppet] - 10https://gerrit.wikimedia.org/r/466904 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [10:08:30] (03PS3) 10Addshore: Wikidata: add setting for setting the enabled entity data forms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467735 (https://phabricator.wikimedia.org/T207196) (owner: 10WMDE-leszek) [10:08:45] (03CR) 10Addshore: "merge conflict" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467736 (https://phabricator.wikimedia.org/T207196) (owner: 10WMDE-leszek) [10:08:50] (03CR) 10Addshore: ":(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467736 (https://phabricator.wikimedia.org/T207196) (owner: 10WMDE-leszek) [10:10:07] (03PS2) 10Addshore: Wikidata: enable JSON-LD data format on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467736 (https://phabricator.wikimedia.org/T207196) (owner: 10WMDE-leszek) [10:10:22] (03CR) 10Volans: [V: 032 C: 032] Revert "Allow custom fields in the Device CSV form" [software/netbox] - 10https://gerrit.wikimedia.org/r/467936 (https://phabricator.wikimedia.org/T205896) (owner: 10Volans) [10:10:34] (03CR) 10Volans: [V: 032 C: 032] Revert "Add Wikimedia's initial data" [software/netbox] - 10https://gerrit.wikimedia.org/r/467937 (https://phabricator.wikimedia.org/T205896) (owner: 10Volans) [10:10:55] (03CR) 10jerkins-bot: [V: 04-1] Wikidata: enable JSON-LD data format on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467736 (https://phabricator.wikimedia.org/T207196) (owner: 10WMDE-leszek) [10:11:09] (03CR) 10Vgutierrez: [C: 04-2] "To be approved in next Monday SRE meeting" [puppet] - 10https://gerrit.wikimedia.org/r/467939 (https://phabricator.wikimedia.org/T206612) (owner: 10Vgutierrez) [10:12:18] (03PS3) 10Addshore: Wikidata: enable JSON-LD data format on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467736 (https://phabricator.wikimedia.org/T207196) (owner: 10WMDE-leszek) [10:12:38] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to servers for Release Engineering tasks for Lars Wirzenius - https://phabricator.wikimedia.org/T206612 (10Vgutierrez) Access will be granted after approval on next Monday SRE meeting [10:13:30] _joe_: did your slot all go well? [10:13:50] seems so up to now :) [10:14:19] <_joe_> fingers crossed, but yeah [10:16:43] (03PS4) 10Muehlenhoff: rsyncd: Remove xinetd remnants [puppet] - 10https://gerrit.wikimedia.org/r/465584 [10:21:30] (03CR) 10Vgutierrez: [C: 04-1] "Publishing the ns[0-2] IPs only using IPv4 is intended. But as we use DNS as our source of truth for IP assignments, I guess that the AAAA" [dns] - 10https://gerrit.wikimedia.org/r/467702 (owner: 10Volans) [10:24:02] vgutierrez: do you meant the PTRs? [10:24:30] /o\ of course I do [10:24:37] :) [10:24:46] (03Abandoned) 10Giuseppe Lavagetto: mediawiki: convert more apache defines to httpd [puppet] - 10https://gerrit.wikimedia.org/r/467572 (owner: 10Giuseppe Lavagetto) [10:24:59] (03CR) 10Vgutierrez: [C: 04-1] "> Publishing the ns[0-2] IPs only using IPv4 is intended. But as we" [dns] - 10https://gerrit.wikimedia.org/r/467702 (owner: 10Volans) [10:25:16] sigh :) [10:37:22] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 42 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [10:40:43] (03PS5) 10Muehlenhoff: rsyncd: Remove xinetd remnants [puppet] - 10https://gerrit.wikimedia.org/r/465584 [10:43:23] 10Operations, 10netops, 10Patch-For-Review: Renumber office-DC interconnect link - https://phabricator.wikimedia.org/T205985 (10ayounsi) I installed Quagga in a VM to verify the commands, but there will most likely be differences with the office Quagga. Only things that might need to be added is updating ipt... [10:45:30] (03Abandoned) 10Ladsgroup: Add centralauth.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457459 (https://phabricator.wikimedia.org/T201009) (owner: 10Ladsgroup) [10:46:42] (03CR) 10Muehlenhoff: [C: 032] rsyncd: Remove xinetd remnants [puppet] - 10https://gerrit.wikimedia.org/r/465584 (owner: 10Muehlenhoff) [10:51:51] (03PS1) 10Ladsgroup: Enable reading from new backend of change tag everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467946 (https://phabricator.wikimedia.org/T194164) [10:56:58] 10Operations, 10DBA, 10MediaWiki-Cache, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) All done, the parsercache tables are truncated on all the codfw hosts, and the disk usage of binlogs seems normalized in eqiad. Th... [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do European Mid-day SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181017T1100). [11:00:04] kart_, Addshore, raynor, and Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:12] dear god there are lots of patches in swat today... [11:00:29] addshore: only 6 allowed :) [11:00:42] o/ If it's too much, you probably can ignore mine [11:00:45] kart_: I'll hit +2 on yours now, but deploy some of mine while it merges [11:00:56] addshore: yes. [11:01:01] (03PS2) 10Addshore: Add constraint-suggestions to wgBetaFeaturesWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467691 (https://phabricator.wikimedia.org/T207019) [11:01:04] addshore: you're in charge of swat today? [11:01:11] (03CR) 10Addshore: [C: 032] Add constraint-suggestions to wgBetaFeaturesWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467691 (https://phabricator.wikimedia.org/T207019) (owner: 10Addshore) [11:01:13] zeljkof: as I have 6 patches in it, yes :D [11:01:24] addshore: :D great [11:01:41] soo no time for me ;( ? [11:01:49] raynor: we will see, mine should all be quick [11:01:59] but there are 9 patches in the window... [11:02:15] (03Merged) 10jenkins-bot: Add constraint-suggestions to wgBetaFeaturesWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467691 (https://phabricator.wikimedia.org/T207019) (owner: 10Addshore) [11:02:16] kart_, Addshore, raynor, and Amir1 - please self organize and try to sort the most urgent patches at the top [11:02:31] mine can wait [11:02:52] sure [11:02:55] nothing urgent, we want to start counting errors on Minerva, I can deploy it later today if there is not time in current window [11:03:03] (03CR) 10ArielGlenn: [C: 032] use lbzip2 for recombining page content dumps, if available and configured [dumps] - 10https://gerrit.wikimedia.org/r/466344 (https://phabricator.wikimedia.org/T179059) (owner: 10ArielGlenn) [11:03:09] mine can go last, it's a new feature [11:04:30] (03PS5) 10Addshore: Enable WBQualityConstraintsSuggestionsBetaFeature on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467343 (https://phabricator.wikimedia.org/T207019) [11:04:38] !log ariel@deploy1001 Started deploy [dumps/dumps@ed7eed9]: use lbzip2 for recombine steps if configured [11:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:42] !log ariel@deploy1001 Finished deploy [dumps/dumps@ed7eed9]: use lbzip2 for recombine steps if configured (duration: 00m 03s) [11:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:56] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT [[gerrit:467691]] Add constraint-suggestions to wgBetaFeaturesWhitelist (duration: 01m 10s) [11:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:24] (03CR) 10Addshore: [C: 032] Enable WBQualityConstraintsSuggestionsBetaFeature on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467343 (https://phabricator.wikimedia.org/T207019) (owner: 10Addshore) [11:06:02] (03PS2) 10Addshore: Increase wikidata dispatch randomness to 30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467911 [11:06:06] (03CR) 10Addshore: [C: 032] Increase wikidata dispatch randomness to 30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467911 (owner: 10Addshore) [11:06:17] (03PS3) 10KartikMistry: Enable cx2outreach campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467870 (https://phabricator.wikimedia.org/T207031) [11:06:25] (03Merged) 10jenkins-bot: Enable WBQualityConstraintsSuggestionsBetaFeature on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467343 (https://phabricator.wikimedia.org/T207019) (owner: 10Addshore) [11:07:11] (03Merged) 10jenkins-bot: Increase wikidata dispatch randomness to 30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467911 (owner: 10Addshore) [11:07:53] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [11:08:05] (03CR) 10jenkins-bot: Add constraint-suggestions to wgBetaFeaturesWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467691 (https://phabricator.wikimedia.org/T207019) (owner: 10Addshore) [11:08:07] (03CR) 10jenkins-bot: Enable WBQualityConstraintsSuggestionsBetaFeature on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467343 (https://phabricator.wikimedia.org/T207019) (owner: 10Addshore) [11:08:09] (03CR) 10jenkins-bot: Increase wikidata dispatch randomness to 30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467911 (owner: 10Addshore) [11:08:24] (03PS3) 10ArielGlenn: dumps config settings to use lbzip2 for recombining page content files [puppet] - 10https://gerrit.wikimedia.org/r/466554 (https://phabricator.wikimedia.org/T179059) [11:08:52] !log addshore@deploy1001 Synchronized wmf-config/Wikibase-production.php: SWAT: T207019 [[gerrit:467343]] Enable WBQualityConstraintsSuggestionsBetaFeature on wikidatawiki (duration: 00m 56s) [11:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:55] (03PS3) 10Addshore: Remove Wikidata RejectParserCacheValue hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467913 (https://phabricator.wikimedia.org/T205330) [11:08:55] T207019: Deploy Beta feature for constraint suggestions - https://phabricator.wikimedia.org/T207019 [11:08:57] (03CR) 10Addshore: [C: 032] Remove Wikidata RejectParserCacheValue hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467913 (https://phabricator.wikimedia.org/T205330) (owner: 10Addshore) [11:09:18] (03CR) 10ArielGlenn: [C: 032] dumps config settings to use lbzip2 for recombining page content files [puppet] - 10https://gerrit.wikimedia.org/r/466554 (https://phabricator.wikimedia.org/T179059) (owner: 10ArielGlenn) [11:10:37] 10Operations, 10ops-eqiad, 10netops: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10ayounsi) [11:11:01] !log addshore@deploy1001 Synchronized wmf-config/Wikibase.php: SWAT: Increase wikidata dispatch randomness to 30 (duration: 00m 56s) [11:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:07] kart_: your CX changes is failing CI [11:12:23] (03CR) 10Addshore: [C: 032] Remove Wikidata RejectParserCacheValue hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467913 (https://phabricator.wikimedia.org/T205330) (owner: 10Addshore) [11:13:25] (03Merged) 10jenkins-bot: Remove Wikidata RejectParserCacheValue hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467913 (https://phabricator.wikimedia.org/T205330) (owner: 10Addshore) [11:14:03] addshore: err :/ Just saw that. [11:14:33] addshore: I'll try to debug, but that won't fit into SWAT window. [11:15:02] (03CR) 10Alexandros Kosiaris: "To answer the questions a bit" [puppet] - 10https://gerrit.wikimedia.org/r/467742 (owner: 10Alexandros Kosiaris) [11:15:09] (03CR) 10Addshore: [C: 032] Wikidata: add setting for setting the enabled entity data forms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467735 (https://phabricator.wikimedia.org/T207196) (owner: 10WMDE-leszek) [11:15:13] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 64 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [11:15:13] kart_: ack [11:15:13] !log addshore@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: T205611 T205330 Remove Wikidata RejectParserCacheValue hook [[gerrit:467913]] (duration: 00m 56s) [11:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:22] T205330: Adding a value to a statement prompts for a property - https://phabricator.wikimedia.org/T205330 [11:15:23] T205611: Remove parser cache purging hook from mediawiki-config - https://phabricator.wikimedia.org/T205611 [11:15:39] (03PS4) 10Addshore: Wikidata: add setting for setting the enabled entity data forms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467735 (https://phabricator.wikimedia.org/T207196) (owner: 10WMDE-leszek) [11:15:43] (03CR) 10Addshore: [C: 032] Wikidata: add setting for setting the enabled entity data forms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467735 (https://phabricator.wikimedia.org/T207196) (owner: 10WMDE-leszek) [11:16:45] (03Merged) 10jenkins-bot: Wikidata: add setting for setting the enabled entity data forms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467735 (https://phabricator.wikimedia.org/T207196) (owner: 10WMDE-leszek) [11:17:37] !log ladsgroup@mwmaint1002:~$ mwscript deleteLocalPasswords.php --wiki=enwiki --delete --batch-size 200 (This will cause lag on codfw) [11:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:22] 10Operations, 10ops-eqiad, 10netops: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10ayounsi) @Cmjohnson all yours to populate with optics and ship to eqord, @Papaul will be there on Oct. 23rd. If we don't have enough optics, please use the ones received for cr2-eqsin in T205487 and let m... [11:19:31] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: T207196 Wikidata: add setting for setting the enabled entity data forms [[gerrit:467735]] PT 1/2 (duration: 00m 57s) [11:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:35] T207196: Temporarily turn off JSON-LD entity data form on wikidata.org - https://phabricator.wikimedia.org/T207196 [11:21:04] !log addshore@deploy1001 Synchronized wmf-config/Wikibase.php: SWAT: T207196 Wikidata: add setting for setting the enabled entity data forms [[gerrit:467735]] PT 2/2 (duration: 00m 56s) [11:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:20] (03PS4) 10Addshore: Wikidata: enable JSON-LD data format on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467736 (https://phabricator.wikimedia.org/T207196) (owner: 10WMDE-leszek) [11:22:24] (03CR) 10Addshore: [C: 032] Wikidata: enable JSON-LD data format on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467736 (https://phabricator.wikimedia.org/T207196) (owner: 10WMDE-leszek) [11:23:29] (03Merged) 10jenkins-bot: Wikidata: enable JSON-LD data format on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467736 (https://phabricator.wikimedia.org/T207196) (owner: 10WMDE-leszek) [11:23:42] * addshore is seeing bursts of exceptions due to transactions taking to long on urwiki, doesn't look related to swat though [11:24:28] (03CR) 10jenkins-bot: Remove Wikidata RejectParserCacheValue hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467913 (https://phabricator.wikimedia.org/T205330) (owner: 10Addshore) [11:24:30] (03CR) 10jenkins-bot: Wikidata: add setting for setting the enabled entity data forms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467735 (https://phabricator.wikimedia.org/T207196) (owner: 10WMDE-leszek) [11:24:32] (03CR) 10jenkins-bot: Wikidata: enable JSON-LD data format on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467736 (https://phabricator.wikimedia.org/T207196) (owner: 10WMDE-leszek) [11:25:34] raynor: Amir1 just your 2 left [11:26:06] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: T207196 [[gerrit:467736]] Wikidata: enable JSON-LD data format on test.wikidata.org (duration: 00m 56s) [11:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:09] T207196: Temporarily turn off JSON-LD entity data form on wikidata.org - https://phabricator.wikimedia.org/T207196 [11:26:12] (03PS3) 10Addshore: Enable client side error counting on Minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467760 (https://phabricator.wikimedia.org/T206702) (owner: 10Pmiazga) [11:26:19] raynor: still around? [11:26:45] yes [11:26:50] (03CR) 10Addshore: [C: 032] Enable client side error counting on Minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467760 (https://phabricator.wikimedia.org/T206702) (owner: 10Pmiazga) [11:26:53] 10Operations, 10DBA, 10MediaWiki-Cache, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) p:05High>03Lowest As the ticket is finished, and just kept open for a few days, there is not need to keep it at high priority [11:27:02] raynor: is it testable on a mwdebug server? :) [11:27:51] (03Merged) 10jenkins-bot: Enable client side error counting on Minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467760 (https://phabricator.wikimedia.org/T206702) (owner: 10Pmiazga) [11:27:52] kinda yes, I would have to go to mwdebug, try to create some js errors and check that it logs [11:28:00] raynor: ack [11:28:07] but it may take some time, I think it's safe to skip it [11:28:19] (03PS5) 10Giuseppe Lavagetto: profile::mediawiki::common: stop using class 'mediawiki' [puppet] - 10https://gerrit.wikimedia.org/r/467642 [11:28:20] and go to the prod, we already tested that on beta cluster [11:28:29] raynor: ack [11:28:35] ill just make sure it hasn't made anything explode [11:29:40] raynor: syncing [11:29:46] great, thx [11:29:58] Amir1: can I let you do your change? :) [11:30:34] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: T206702 [[gerrit:467760|Enable client side error counting on Minerva]] (duration: 00m 57s) [11:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:38] T206702: Enable client side error counting on Minerva production (wikipedia only) - https://phabricator.wikimedia.org/T206702 [11:30:40] lovely, done! [11:31:04] addshore: sure [11:31:11] 7 things in 30 mins, not bad [11:31:54] (03PS2) 10Ladsgroup: Enable reading from new backend of change tag everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467946 (https://phabricator.wikimedia.org/T194164) [11:32:02] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467946 (https://phabricator.wikimedia.org/T194164) (owner: 10Ladsgroup) [11:32:05] !log installing graphicsmagick security updates [11:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:05] (03Merged) 10jenkins-bot: Enable reading from new backend of change tag everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467946 (https://phabricator.wikimedia.org/T194164) (owner: 10Ladsgroup) [11:37:41] (03CR) 10BBlack: [C: 031] "I think this is long overdue too :)" [puppet] - 10https://gerrit.wikimedia.org/r/467742 (owner: 10Alexandros Kosiaris) [11:39:17] (03CR) 10jenkins-bot: Enable client side error counting on Minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467760 (https://phabricator.wikimedia.org/T206702) (owner: 10Pmiazga) [11:39:19] (03CR) 10jenkins-bot: Enable reading from new backend of change tag everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467946 (https://phabricator.wikimedia.org/T194164) (owner: 10Ladsgroup) [11:39:21] (03CR) 10BBlack: [C: 031] "Typo at the very end above, s/esams RB/eqiad RB/ :)" [puppet] - 10https://gerrit.wikimedia.org/r/467742 (owner: 10Alexandros Kosiaris) [11:40:01] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:467946|Enable reading from new backend of change tag everywhere (T194164)]] (duration: 00m 57s) [11:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:04] T194164: Start reading from change_tag_def in production - https://phabricator.wikimedia.org/T194164 [11:40:57] !log EU SWAT is done [11:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:06] I will keep an eye on monitoring things [11:47:10] (03CR) 10Vgutierrez: [C: 031] "LGTM" [debs/pybal] - 10https://gerrit.wikimedia.org/r/447775 (owner: 10Mark Bergsma) [11:47:53] (03PS1) 10Giuseppe Lavagetto: add two new hosts [labs/private] - 10https://gerrit.wikimedia.org/r/467952 [11:48:54] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] add two new hosts [labs/private] - 10https://gerrit.wikimedia.org/r/467952 (owner: 10Giuseppe Lavagetto) [11:50:53] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 25 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [11:52:37] (03PS6) 10Giuseppe Lavagetto: profile::mediawiki::common: stop using class 'mediawiki' [puppet] - 10https://gerrit.wikimedia.org/r/467642 [11:58:12] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 71 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [11:58:34] (03PS1) 10Volans: Upgrade Netbox to upstream v2.4.6 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/467953 (https://phabricator.wikimedia.org/T207009) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181017T1200) [12:02:14] (03CR) 10Thifranc: "> Patch Set 1: Code-Review-2" [puppet] - 10https://gerrit.wikimedia.org/r/467826 (owner: 10Thifranc) [12:02:35] (03CR) 10Filippo Giunchedi: [C: 032] statsite: remove ignored Journal section from unit [puppet] - 10https://gerrit.wikimedia.org/r/467896 (owner: 10Filippo Giunchedi) [12:02:42] (03PS2) 10Filippo Giunchedi: statsite: remove ignored Journal section from unit [puppet] - 10https://gerrit.wikimedia.org/r/467896 [12:06:57] (03PS1) 10Banyek: mariadb: enable db2096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467954 (https://phabricator.wikimedia.org/T206593) [12:08:39] (03PS7) 10Giuseppe Lavagetto: profile::mediawiki::common: stop using class 'mediawiki' [puppet] - 10https://gerrit.wikimedia.org/r/467642 [12:09:03] (03PS2) 10Filippo Giunchedi: smart: log failed physical disk enumeration instead of output [puppet] - 10https://gerrit.wikimedia.org/r/467901 [12:11:01] (03CR) 10Filippo Giunchedi: [C: 032] smart: log failed physical disk enumeration instead of output [puppet] - 10https://gerrit.wikimedia.org/r/467901 (owner: 10Filippo Giunchedi) [12:11:12] (03PS3) 10Filippo Giunchedi: smart: log failed physical disk enumeration instead of output [puppet] - 10https://gerrit.wikimedia.org/r/467901 [12:15:44] (03PS1) 10Muehlenhoff: Bail out if debdeploy-deploy isn't run as root [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/467955 [12:17:21] 10Operations, 10netops, 10Patch-For-Review: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387 (10ayounsi) 05Open>03Resolved As this was the troubleshooting task and follow up tasks have been open for the software upgrades thems... [12:17:38] hey, I have a quick question, if I have a config that I want to enable to all wikis [12:17:57] what key should I use in the InitializeSettings.php file? [12:18:00] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1002/12994/ does the right thing." [puppet] - 10https://gerrit.wikimedia.org/r/467642 (owner: 10Giuseppe Lavagetto) [12:18:16] like -> if I want to enable something to english wiki - I use `enwiki` [12:18:20] default [12:18:37] but the default is for all wikis, I mean all wikipedias [12:18:55] Well, be clear what you want ;P [12:19:04] I don't want to enable that key for wikidata or wikivoyage [12:19:13] you should be able to do 'wikipedia' => true [12:19:28] sorry, my bad, by wikis I meant "wikipedias" :) [12:20:25] wikis generally means all wikis [12:20:31] sometimes people just call them pedias to be obvious [12:21:14] yeah, I used to call pedias just wikis ;) [12:21:26] ok nvm, thx for the info [12:21:35] so now, the next question [12:21:35] (03PS1) 10Elukey: profile::hadoop::balancer: move to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/467956 (https://phabricator.wikimedia.org/T172532) [12:21:35] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/467760/3/wmf-config/InitialiseSettings.php [12:21:54] this is on prod atm [12:22:01] but I don't see that config change :( [12:22:37] reedy@deploy1001:~$ mwscript eval.php enwiki [12:22:37] > var_dump( $wgMinervaCountErrors ); [12:22:37] bool(true) [12:22:56] reedy@deploy1001:~$ mwscript eval.php mediawikiwiki [12:22:56] > var_dump( $wgMinervaCountErrors ); [12:22:56] bool(false) [12:22:57] WFM [12:24:05] (03CR) 10BBlack: [C: 031] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [12:24:30] (03PS2) 10Elukey: profile::hadoop::balancer: move to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/467956 (https://phabricator.wikimedia.org/T172532) [12:24:35] Reedy! I really need your feedback on the security review plan for https://phabricator.wikimedia.org/T202295 please, would be very much appreciated. ;-) [12:25:27] Also a "I don't know yet because this and that." would be helpful so we can plan with that. [12:27:20] (03Abandoned) 10Thifranc: Correct alongside Paladox review [puppet] - 10https://gerrit.wikimedia.org/r/467826 (owner: 10Thifranc) [12:27:44] (03CR) 10Ayounsi: [C: 031] "Can't review it all but looks good to me" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/467953 (https://phabricator.wikimedia.org/T207009) (owner: 10Volans) [12:28:18] (03CR) 10Nikerabbit: [C: 031] Enable cx2outreach campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467870 (https://phabricator.wikimedia.org/T207031) (owner: 10KartikMistry) [12:28:29] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review: Allow directing users to PHP7 based on a cookie - https://phabricator.wikimedia.org/T206338 (10CCicalese_WMF) [12:28:48] (03PS3) 10Elukey: profile::hadoop::balancer: move to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/467956 (https://phabricator.wikimedia.org/T172532) [12:31:08] Reedy - found [12:32:12] (03PS4) 10Elukey: profile::hadoop::balancer: move to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/467956 (https://phabricator.wikimedia.org/T172532) [12:33:25] the MinervaSkin is old, because of that nothing tracks the errors as it didn't hit the production yet ;/ [12:33:51] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/12998/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/467956 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [12:33:53] (03CR) 10Alexandros Kosiaris: "> Looking at (https://github.com/puppetlabs/puppetlabs-stdlib/tree/master/spec/fixtures/test/manifests) I don't see the files I've edited," [puppet] - 10https://gerrit.wikimedia.org/r/467826 (owner: 10Thifranc) [12:39:17] (03CR) 10Alexandros Kosiaris: [C: 04-2] "Per the request on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/467826, continuing the discussion here." [puppet] - 10https://gerrit.wikimedia.org/r/467824 (owner: 10Thifranc) [12:45:28] (03CR) 10Alexandros Kosiaris: "And I managed to misinterpret the graph I sent, forgetting to count the caching pops. brandon correct about the average ratio." [puppet] - 10https://gerrit.wikimedia.org/r/467742 (owner: 10Alexandros Kosiaris) [12:45:30] (03CR) 10BBlack: [C: 04-1] "Yeah, the state of affairs is definitely intentional here. We've talked about enabling V6 traffic for our authdns, but there has to be so" [dns] - 10https://gerrit.wikimedia.org/r/467702 (owner: 10Volans) [12:46:39] (03CR) 10Marostegui: [C: 031] mariadb: enable notifications for db2096 [puppet] - 10https://gerrit.wikimedia.org/r/467722 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [12:46:57] (03PS1) 10Muehlenhoff: Switch carbon rsync module to auto_ferm [puppet] - 10https://gerrit.wikimedia.org/r/467958 [12:48:57] (03CR) 10Volans: "> Patch Set 1: Code-Review-1" [dns] - 10https://gerrit.wikimedia.org/r/467702 (owner: 10Volans) [12:49:19] (03Abandoned) 10Volans: Add missing AAAA records for nameservers [dns] - 10https://gerrit.wikimedia.org/r/467702 (owner: 10Volans) [12:53:47] (03PS1) 10Banyek: mariadb: enable replication check on Parsercache hosts [puppet] - 10https://gerrit.wikimedia.org/r/467959 (https://phabricator.wikimedia.org/T206992) [12:54:55] (03CR) 10Banyek: [C: 032] mariadb: enable notifications for db2096 [puppet] - 10https://gerrit.wikimedia.org/r/467722 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [12:55:09] (03PS2) 10Banyek: mariadb: enable notifications for db2096 [puppet] - 10https://gerrit.wikimedia.org/r/467722 (https://phabricator.wikimedia.org/T206593) [12:55:25] !log enabling notifications on db2096 [12:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:10] !log enabling notifications on db2096 (T206593) [12:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:15] T206593: Productionize db2096 on x1 - https://phabricator.wikimedia.org/T206593 [12:56:49] (03CR) 10Banyek: [V: 032 C: 032] mariadb: enable notifications for db2096 [puppet] - 10https://gerrit.wikimedia.org/r/467722 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [12:57:02] (03CR) 10Muehlenhoff: "https://puppet-compiler.wmflabs.org/compiler1002/12999/" [puppet] - 10https://gerrit.wikimedia.org/r/467958 (owner: 10Muehlenhoff) [13:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181017T1300) [13:00:24] (03CR) 10Muehlenhoff: [C: 031] "I doublechecked dashboard and there's nothing for scb referencing Diamond, this should be good to go." [puppet] - 10https://gerrit.wikimedia.org/r/466906 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [13:01:02] (03PS4) 10Gehel: wdqs: spread IRQ from NIC over multiple CPUs [puppet] - 10https://gerrit.wikimedia.org/r/465624 (https://phabricator.wikimedia.org/T206105) [13:02:49] (03CR) 10Gehel: [C: 032] wdqs: spread IRQ from NIC over multiple CPUs [puppet] - 10https://gerrit.wikimedia.org/r/465624 (https://phabricator.wikimedia.org/T206105) (owner: 10Gehel) [13:03:41] (03CR) 10Banyek: [C: 032] wiki replicas: depool labsdb1010 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/467820 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [13:04:56] (03PS2) 10Banyek: wiki replicas: depool labsdb1010 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/467820 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [13:05:03] (03CR) 10Banyek: [V: 032 C: 032] wiki replicas: depool labsdb1010 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/467820 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [13:05:44] !log deplooling labsdb1010 (T181650) [13:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:47] T181650: Change views for the new columns of the refactored comment storage - https://phabricator.wikimedia.org/T181650 [13:08:15] !log applying rps NIC config for all wdqs nodes - T206105 [13:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:18] T206105: Optimize networking configuration for WDQS - https://phabricator.wikimedia.org/T206105 [13:10:35] (03PS12) 10Vgutierrez: Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 (owner: 10Alex Monk) [13:13:19] RECOVERY - Device not healthy -SMART- on heze is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=heze&var-datasource=codfw%2520prometheus%252Fops [13:14:16] (03CR) 10Vgutierrez: Certcentral-authdns integration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/459809 (owner: 10Alex Monk) [13:14:27] (03CR) 10Banyek: [C: 031] "looks like a proper regex to me" [puppet] - 10https://gerrit.wikimedia.org/r/467317 (https://phabricator.wikimedia.org/T207013) (owner: 10Jcrespo) [13:18:50] 10Operations, 10DBA, 10MediaWiki-Cache, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) 05Open>03Resolved p:05Lowest>03Normal Let's close it and if something breaks we can reopen. Good job! [13:19:38] (03CR) 10Jcrespo: "Banyek, to properly verify, could you run the puppet compiler over a repesentative subset of databases (one for each role and/or datacente" [puppet] - 10https://gerrit.wikimedia.org/r/467317 (https://phabricator.wikimedia.org/T207013) (owner: 10Jcrespo) [13:20:23] (03CR) 10Banyek: "sure I can!" [puppet] - 10https://gerrit.wikimedia.org/r/467317 (https://phabricator.wikimedia.org/T207013) (owner: 10Jcrespo) [13:22:33] (03CR) 10Marostegui: [C: 031] mariadb: enable db2096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467954 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [13:24:05] (03CR) 10Gehel: [C: 031] "LGTM, the infamous `handleError` linting issue is back, I'll let volans deal with it or merge as-is" [software/spicerack] - 10https://gerrit.wikimedia.org/r/467922 (owner: 10DCausse) [13:24:18] damn :( [13:25:30] volans: yep [13:27:03] (03CR) 10Volans: "> Patch Set 1: Code-Review+1" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/467953 (https://phabricator.wikimedia.org/T207009) (owner: 10Volans) [13:27:31] (03Abandoned) 10Volans: Upgrade Netbox to upstream v2.4.6 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/467953 (https://phabricator.wikimedia.org/T207009) (owner: 10Volans) [13:29:39] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [13:30:18] (03CR) 10Gehel: [C: 032] "Not sure if this changes anything, but why not..." [puppet] - 10https://gerrit.wikimedia.org/r/467849 (owner: 10Smalyshev) [13:30:28] (03PS3) 10Gehel: Assign user directory to blazegraph user [puppet] - 10https://gerrit.wikimedia.org/r/467849 (owner: 10Smalyshev) [13:34:01] (03PS1) 10Mathew.onipe: elasticsearch: pseudo cookbook for JVM upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T202885) [13:34:41] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: pseudo cookbook for JVM upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [13:36:49] PROBLEM - puppet last run on wdqs1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): User[blazegraph] [13:36:59] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 39 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [13:37:43] (03PS1) 10Volans: Upgrade Netbox to upstream v2.4.6 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/467965 (https://phabricator.wikimedia.org/T207009) [13:39:59] PROBLEM - puppet last run on wdqs2005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): User[blazegraph] [13:41:42] (03CR) 10Ayounsi: [C: 031] "Can't review it all but looks good to me" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/467965 (https://phabricator.wikimedia.org/T207009) (owner: 10Volans) [13:42:14] (03CR) 10Muehlenhoff: [C: 032] Bail out if debdeploy-deploy isn't run as root [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/467955 (owner: 10Muehlenhoff) [13:42:20] puppet failures on wdqs are probably me, checking [13:42:35] (03PS13) 10Vgutierrez: Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 (owner: 10Alex Monk) [13:42:49] PROBLEM - puppet last run on wdqs1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): User[blazegraph] [13:43:42] (03PS1) 10Gehel: Revert "Assign user directory to blazegraph user" [puppet] - 10https://gerrit.wikimedia.org/r/467967 [13:43:58] PROBLEM - puppet last run on wdqs1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): User[blazegraph] [13:44:17] (03PS2) 10Gehel: Revert "Assign user directory to blazegraph user" [puppet] - 10https://gerrit.wikimedia.org/r/467967 [13:45:07] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/467922 (owner: 10DCausse) [13:45:21] (03CR) 10Volans: [C: 031] "Facepalm to myself for the typo, thanks for spotting it!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/467922 (owner: 10DCausse) [13:45:40] (03CR) 10Gehel: [C: 032] Revert "Assign user directory to blazegraph user" [puppet] - 10https://gerrit.wikimedia.org/r/467967 (owner: 10Gehel) [13:46:15] (03CR) 10jerkins-bot: [V: 04-1] Fix typo in README.rst [software/spicerack] - 10https://gerrit.wikimedia.org/r/467922 (owner: 10DCausse) [13:47:27] (03PS2) 10Mathew.onipe: elasticsearch: pseudo cookbook for JVM upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T202885) [13:47:58] (03CR) 10Volans: [C: 031] "Something fishy in CI, it failed to load a plugin this time... not sure why" [software/spicerack] - 10https://gerrit.wikimedia.org/r/467922 (owner: 10DCausse) [13:48:02] (03CR) 10Volans: [C: 031] "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/467922 (owner: 10DCausse) [13:49:08] PROBLEM - puppet last run on wdqs2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): User[blazegraph] [13:49:10] (03CR) 10jerkins-bot: [V: 04-1] Fix typo in README.rst [software/spicerack] - 10https://gerrit.wikimedia.org/r/467922 (owner: 10DCausse) [13:50:18] (03CR) 10Muehlenhoff: elasticsearch: pseudo cookbook for JVM upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [13:50:50] (03CR) 10Mobrovac: [C: 031] hiera: remove diamond from scb role [puppet] - 10https://gerrit.wikimedia.org/r/466906 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [13:51:08] PROBLEM - puppet last run on wdqs2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): User[blazegraph] [13:51:39] PROBLEM - puppet last run on wdqs2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): User[blazegraph] [13:52:49] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::mediawiki::common: stop using class 'mediawiki' [puppet] - 10https://gerrit.wikimedia.org/r/467642 (owner: 10Giuseppe Lavagetto) [13:53:02] (03PS8) 10Giuseppe Lavagetto: profile::mediawiki::common: stop using class 'mediawiki' [puppet] - 10https://gerrit.wikimedia.org/r/467642 [13:54:33] (03PS1) 10Vgutierrez: secret: Add authdns-certcentral dummy SSH key [labs/private] - 10https://gerrit.wikimedia.org/r/467968 (https://phabricator.wikimedia.org/T194962) [13:55:19] RECOVERY - puppet last run on wdqs2005 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [13:56:03] (03CR) 10Vgutierrez: [V: 032 C: 032] secret: Add authdns-certcentral dummy SSH key [labs/private] - 10https://gerrit.wikimedia.org/r/467968 (https://phabricator.wikimedia.org/T194962) (owner: 10Vgutierrez) [13:56:09] (03PS1) 10Bstorm: labstore: correct the service name for stretch [puppet] - 10https://gerrit.wikimedia.org/r/467969 (https://phabricator.wikimedia.org/T203254) [13:56:45] gehel: blazegraph are all yours? [13:57:12] volans: the puppet failures? [13:57:13] (03CR) 10GTirloni: [C: 032] labstore: correct the service name for stretch [puppet] - 10https://gerrit.wikimedia.org/r/467969 (https://phabricator.wikimedia.org/T203254) (owner: 10Bstorm) [13:57:17] yes [13:57:33] puppet failures are mine, and patch is reverted [13:57:39] ack [13:57:40] thx [13:58:57] (03PS1) 10Bstorm: wiki replicas: repool labsdb1010 and depool labsdb1011 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/467970 (https://phabricator.wikimedia.org/T181650) [13:59:35] (03CR) 10Vgutierrez: Certcentral-authdns integration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/459809 (owner: 10Alex Monk) [13:59:43] (03PS4) 10Bstorm: openstack: add case for stretch and newton in client repos [puppet] - 10https://gerrit.wikimedia.org/r/463782 (https://phabricator.wikimedia.org/T203254) [13:59:45] (03CR) 10Marostegui: [C: 032] wiki replicas: repool labsdb1010 and depool labsdb1011 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/467970 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [14:01:08] (03CR) 10Vgutierrez: "pcc seems happy with PS13: https://puppet-compiler.wmflabs.org/compiler1002/13005/" [puppet] - 10https://gerrit.wikimedia.org/r/459809 (owner: 10Alex Monk) [14:01:08] (03CR) 10Bstorm: [C: 032] openstack: add case for stretch and newton in client repos [puppet] - 10https://gerrit.wikimedia.org/r/463782 (https://phabricator.wikimedia.org/T203254) (owner: 10Bstorm) [14:01:10] (03PS5) 10Bstorm: openstack: add case for stretch and newton in client repos [puppet] - 10https://gerrit.wikimedia.org/r/463782 (https://phabricator.wikimedia.org/T203254) [14:01:36] !log Repool labsdb1010, depool labsdb1011 - T181650 [14:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:40] T181650: Change views for the new columns of the refactored comment storage - https://phabricator.wikimedia.org/T181650 [14:02:07] (03CR) 10Mathew.onipe: "> Patch Set 2:" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [14:06:34] (03CR) 10Faidon Liambotis: [C: 031] "OK, this looks good overall, but I still have questions about how we would parameterize this class for WMCS (see inline). Jenkins' -1 woul" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [14:07:28] RECOVERY - puppet last run on wdqs1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:09:29] RECOVERY - puppet last run on wdqs1009 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [14:09:40] (03PS2) 10Faidon Liambotis: Fix PTR for mr1-esams [dns] - 10https://gerrit.wikimedia.org/r/467712 (owner: 10Volans) [14:09:48] (03PS3) 10Faidon Liambotis: Fix PTR for mr1-esams [dns] - 10https://gerrit.wikimedia.org/r/467712 (owner: 10Volans) [14:10:06] (03CR) 10Faidon Liambotis: [C: 032] Fix PTR for mr1-esams [dns] - 10https://gerrit.wikimedia.org/r/467712 (owner: 10Volans) [14:13:29] RECOVERY - puppet last run on wdqs1007 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [14:14:38] RECOVERY - puppet last run on wdqs2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:16:39] RECOVERY - puppet last run on wdqs2001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [14:17:18] RECOVERY - puppet last run on wdqs2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:17:29] (03PS1) 10Muehlenhoff: Use auto_ferm to properly restrict to rsyncd on stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/467973 [14:17:41] 10Operations, 10ops-eqiad: Move dumpsdata1001 - https://phabricator.wikimedia.org/T207278 (10ayounsi) p:05Triage>03Normal [14:18:00] (03CR) 10Jforrester: "Is the code good enough to drop wgChangeTagsSchemaMigrationStage ahead of REL1_32?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467946 (https://phabricator.wikimedia.org/T194164) (owner: 10Ladsgroup) [14:19:36] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team, 10netops: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585 (10ayounsi) [14:19:38] 10Operations, 10ops-eqiad: Move dumpsdata1001 - https://phabricator.wikimedia.org/T207278 (10ArielGlenn) Adding @hoo to see if we can work out timing; we could work out sometime on Oct 30th or 31st but the wikidata weeklies would be interrupted and need restarting manually. [14:19:47] 10Operations, 10ops-eqiad, 10Dumps-Generation: Move dumpsdata1001 - https://phabricator.wikimedia.org/T207278 (10ArielGlenn) [14:19:49] (03CR) 10GTirloni: [C: 032] Fix labs/cloud records [dns] - 10https://gerrit.wikimedia.org/r/467708 (owner: 10Volans) [14:20:40] (03PS1) 10Muehlenhoff: Use auto_ferm for eventlogging rsync module [puppet] - 10https://gerrit.wikimedia.org/r/467974 [14:23:27] (03PS3) 10Gehel: tlsproxy: allow multiple default servers on different ports [puppet] - 10https://gerrit.wikimedia.org/r/467684 (https://phabricator.wikimedia.org/T198352) [14:24:55] (03CR) 10BBlack: [C: 031] "LGTM, go forth and break things!" [puppet] - 10https://gerrit.wikimedia.org/r/467684 (https://phabricator.wikimedia.org/T198352) (owner: 10Gehel) [14:25:59] (03CR) 10Elukey: [C: 031] Use auto_ferm to properly restrict to rsyncd on stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/467973 (owner: 10Muehlenhoff) [14:28:08] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 32 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [14:29:17] (03PS1) 10Muehlenhoff: Restrict ferm service package_builder_rsync to production networks [puppet] - 10https://gerrit.wikimedia.org/r/467976 [14:30:44] (03CR) 10Ottomata: [C: 031] "nice" [puppet] - 10https://gerrit.wikimedia.org/r/467956 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [14:31:44] (03PS4) 10GTirloni: ntp: move diamond::collector to where it will only apply to ntp servers [puppet] - 10https://gerrit.wikimedia.org/r/464866 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [14:31:47] (03PS6) 10GTirloni: hiera: diamond::remove on openstack control role [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [14:33:16] (03PS1) 10Muehlenhoff: Use auto_ferm for hdfs-archive rsyncd module [puppet] - 10https://gerrit.wikimedia.org/r/467977 [14:33:38] !log upload prometheus-statsd-exporter 0.7.0+ds1-2 - T205870 [14:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:42] T205870: Provision >= 50% of statsd/Graphite-only metrics in Prometheus - https://phabricator.wikimedia.org/T205870 [14:36:28] (03CR) 10Elukey: [C: 031] Use net_topology script content rather than erb path [puppet/cdh] - 10https://gerrit.wikimedia.org/r/467766 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [14:36:50] (03PS1) 10Muehlenhoff: Switch srvdumps rsync module to auto_ferm [puppet] - 10https://gerrit.wikimedia.org/r/467978 [14:36:52] (03CR) 10Elukey: [C: 031] "if pcc agrees looks good :)" [puppet] - 10https://gerrit.wikimedia.org/r/467769 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [14:37:06] 10Operations, 10Wikimedia-Mailing-lists: New list request - https://phabricator.wikimedia.org/T207283 (10AVasanth_WMF) [14:37:07] (03PS1) 10Filippo Giunchedi: thumbor: fix missing statsd_exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/467980 (https://phabricator.wikimedia.org/T205870) [14:37:32] (03CR) 10jerkins-bot: [V: 04-1] thumbor: fix missing statsd_exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/467980 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [14:37:54] 10Operations, 10Product-Analytics: Upload shiny-server .deb to our Jessie apt repository - https://phabricator.wikimedia.org/T168967 (10mpopov) >>! In T168967#4662942, @Legoktm wrote: > Who made the package for Ubuntu Trusty / where did it come from? The Ubunty Trusty package was made by [[ https://www.rstudi... [14:38:19] (03PS2) 10Filippo Giunchedi: thumbor: fix missing statsd_exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/467980 (https://phabricator.wikimedia.org/T205870) [14:39:24] (03CR) 10Filippo Giunchedi: [C: 032] thumbor: fix missing statsd_exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/467980 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [14:39:41] (03PS1) 10Muehlenhoff: Switch aptrepo::rsync to auto_ferm [puppet] - 10https://gerrit.wikimedia.org/r/467982 [14:40:41] (03CR) 10Elukey: [C: 031] "So I don't love the profile::hadoop::common::hadoop_cluster_name namespace, buuut for the sake of DRYness it should be ok." [puppet] - 10https://gerrit.wikimedia.org/r/467821 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [14:40:55] (03CR) 10GTirloni: [C: 032] hiera: diamond::remove on openstack control role [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [14:41:11] (03CR) 10Elukey: [C: 031] Move Hive profile settings to role hiera [puppet] - 10https://gerrit.wikimedia.org/r/467815 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [14:41:27] (03PS2) 10Muehlenhoff: Switch aptrepo::rsync to auto_ferm [puppet] - 10https://gerrit.wikimedia.org/r/467982 [14:44:49] RECOVERY - High lag on wdqs1010 is OK: (C)3600 ge (W)1200 ge 1195 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [14:45:58] PROBLEM - Check systemd state on thumbor1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:46:08] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:46:19] PROBLEM - Check systemd state on thumbor2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:46:32] thumbor doesn't like filippo :) [14:47:04] mhh might be the statsd-exporter upgrade, I'll check [14:49:18] RECOVERY - Check systemd state on thumbor1003 is OK: OK - running: The system is fully operational [14:50:03] 10Operations, 10Recommendation-API, 10Research, 10Core Platform Team Kanban (Doing), and 3 others: Setup access from service to mysql - https://phabricator.wikimedia.org/T205452 (10bmansurov) Looks like this is done, @mobrovac? [14:50:30] (03PS1) 10Muehlenhoff: Use auto_ferm for nfs::misc rsyncd modules [puppet] - 10https://gerrit.wikimedia.org/r/467985 [14:50:50] (03PS1) 10Filippo Giunchedi: thumbor: set name for all statsd_exporter metrics [puppet] - 10https://gerrit.wikimedia.org/r/467986 (https://phabricator.wikimedia.org/T205870) [14:51:00] (03CR) 10Filippo Giunchedi: [C: 032] thumbor: set name for all statsd_exporter metrics [puppet] - 10https://gerrit.wikimedia.org/r/467986 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [14:51:07] (03CR) 10jerkins-bot: [V: 04-1] Use auto_ferm for nfs::misc rsyncd modules [puppet] - 10https://gerrit.wikimedia.org/r/467985 (owner: 10Muehlenhoff) [14:52:49] RECOVERY - Check systemd state on thumbor2004 is OK: OK - running: The system is fully operational [14:53:05] (03PS1) 10Giuseppe Lavagetto: role::deployment::mediawiki: use profile::mediawiki::common [puppet] - 10https://gerrit.wikimedia.org/r/467987 [14:53:08] RECOVERY - Check systemd state on thumbor2001 is OK: OK - running: The system is fully operational [14:53:30] 10Operations, 10Recommendation-API, 10Research, 10Core Platform Team Kanban (Doing), and 3 others: Setup access from service to mysql - https://phabricator.wikimedia.org/T205452 (10mobrovac) >>! In T205452#4674273, @bmansurov wrote: > Looks like this is done, @mobrovac? We still need @jcrespo for these tw... [14:55:14] jouncebot: now [14:55:14] For the next 0 hour(s) and 4 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181017T1300) [14:55:16] jouncebot: next [14:55:16] In 1 hour(s) and 4 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181017T1600) [14:56:09] 10Operations, 10Puppet: wmf-style adds 'has no call to hiera' violations for parameters already containing hiera calls - https://phabricator.wikimedia.org/T207285 (10herron) p:05Triage>03Normal [14:56:11] (03PS1) 10Filippo Giunchedi: thumbor: use statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/467988 (https://phabricator.wikimedia.org/T205870) [14:59:31] (03PS2) 10Muehlenhoff: Use auto_ferm for nfs::misc rsyncd modules [puppet] - 10https://gerrit.wikimedia.org/r/467985 [15:01:47] (03PS67) 10Vgutierrez: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [15:02:21] (03CR) 10Ottomata: [C: 032] Use net_topology script content rather than erb path [puppet/cdh] - 10https://gerrit.wikimedia.org/r/467766 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [15:02:44] (03PS6) 10Ottomata: Move Hadoop net topology to hiera [puppet] - 10https://gerrit.wikimedia.org/r/467769 (https://phabricator.wikimedia.org/T204951) [15:02:48] (03CR) 10Vgutierrez: [C: 032] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [15:03:09] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/13008/thumbor1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/467986 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [15:03:24] (03PS6) 10Ottomata: Move Hive profile settings to role hiera [puppet] - 10https://gerrit.wikimedia.org/r/467815 (https://phabricator.wikimedia.org/T204951) [15:03:26] (03CR) 10jerkins-bot: [V: 04-1] Move Hadoop net topology to hiera [puppet] - 10https://gerrit.wikimedia.org/r/467769 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [15:03:37] (03PS1) 10Muehlenhoff: Switch prometheus-ops rsync module to auto_ferm [puppet] - 10https://gerrit.wikimedia.org/r/467990 [15:03:39] (03PS1) 10Muehlenhoff: Disable prometheus rsyncd module for now [puppet] - 10https://gerrit.wikimedia.org/r/467991 [15:04:01] (03PS2) 10Volans: Upgrade Netbox to upstream v2.4.6 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/467965 (https://phabricator.wikimedia.org/T205896) [15:04:08] PROBLEM - Maps - OSM synchronization lag - eqiad on einsteinium is CRITICAL: 7.531e+07 ge 1.728e+05 https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [15:04:34] gehel: ^ [15:06:54] (03CR) 10Ottomata: [C: 032] Move Hive profile settings to role hiera [puppet] - 10https://gerrit.wikimedia.org/r/467815 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [15:08:17] (03CR) 10Alexandros Kosiaris: [C: 04-1] Restrict ferm service package_builder_rsync to production networks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/467976 (owner: 10Muehlenhoff) [15:08:30] (03PS5) 10GTirloni: ntp: move diamond::collector to where it will only apply to ntp servers [puppet] - 10https://gerrit.wikimedia.org/r/464866 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [15:08:32] (03PS7) 10GTirloni: hiera: diamond::remove on openstack control role [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [15:08:47] (03PS7) 10Ottomata: Move Hadoop net topology to hiera [puppet] - 10https://gerrit.wikimedia.org/r/467769 (https://phabricator.wikimedia.org/T204951) [15:10:09] (03PS1) 10Alex Monk: dirs: Create /etc/certcentral/accounts [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/467992 [15:10:19] akosiaris: looking [15:11:22] (03PS2) 10Alex Monk: Create /etc/certcentral/accounts [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/467992 [15:12:04] (03CR) 10Filippo Giunchedi: [C: 031] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/467958 (owner: 10Muehlenhoff) [15:12:42] maps is a false alarm, the check is wrong [15:12:49] I'll deal with it (cc onimisionipe) [15:13:01] ok thanks [15:13:43] ACKNOWLEDGEMENT - Maps - OSM synchronization lag - eqiad on einsteinium is CRITICAL: 7.531e+07 ge 1.728e+05 Gehel check is wrong and needs to be fixed (it checks also for maps1004 which is being reimaged and reimported) https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [15:16:16] (03PS1) 10Alex Monk: certcentral: Add sslcert::dhparam requirement [puppet] - 10https://gerrit.wikimedia.org/r/467997 (https://phabricator.wikimedia.org/T194962) [15:17:19] (03PS8) 10Ottomata: Move Hadoop net topology to hiera [puppet] - 10https://gerrit.wikimedia.org/r/467769 (https://phabricator.wikimedia.org/T204951) [15:17:26] (03CR) 10jerkins-bot: [V: 04-1] certcentral: Add sslcert::dhparam requirement [puppet] - 10https://gerrit.wikimedia.org/r/467997 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [15:17:40] (03PS1) 10Filippo Giunchedi: smart: switch to syslog logging [puppet] - 10https://gerrit.wikimedia.org/r/467998 [15:19:17] (03PS7) 10Herron: New Kafka cluster logging-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/465166 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [15:20:12] (03CR) 10Vgutierrez: [V: 032 C: 032] "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/13012/" [puppet] - 10https://gerrit.wikimedia.org/r/467997 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [15:20:49] (03PS2) 10Banyek: mariadb: enable replication check on Parsercache hosts [puppet] - 10https://gerrit.wikimedia.org/r/467959 (https://phabricator.wikimedia.org/T206992) [15:22:15] (03CR) 10Ottomata: New Kafka cluster logging-eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465166 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [15:23:29] 10Operations, 10monitoring: Review prometheus_nodes params - https://phabricator.wikimedia.org/T207292 (10herron) p:05Triage>03Normal [15:23:45] (03CR) 10Ottomata: [C: 032] "Looks good! https://puppet-compiler.wmflabs.org/compiler1002/13013/stat1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/467769 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [15:23:51] (03PS9) 10Ottomata: Move Hadoop net topology to hiera [puppet] - 10https://gerrit.wikimedia.org/r/467769 (https://phabricator.wikimedia.org/T204951) [15:24:23] (03CR) 10Ottomata: [V: 032 C: 032] Move Hadoop net topology to hiera [puppet] - 10https://gerrit.wikimedia.org/r/467769 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [15:28:11] !log enabling db2096 for cluster x1 (T206593) [15:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:14] T206593: Productionize db2096 on x1 - https://phabricator.wikimedia.org/T206593 [15:28:45] (03PS1) 10Paladox: Planet: Fix cron to update feeds [puppet] - 10https://gerrit.wikimedia.org/r/468002 [15:29:31] (03PS2) 10Paladox: Planet: Fix cron to update feeds [puppet] - 10https://gerrit.wikimedia.org/r/468002 [15:30:27] (03PS1) 10Ottomata: Temporarily revert the change to net-topology.py.erb [puppet] - 10https://gerrit.wikimedia.org/r/468004 (https://phabricator.wikimedia.org/T204951) [15:30:31] PROBLEM - Check systemd state on certcentral1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:30:51] PROBLEM - Check systemd state on certcentral2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:30:59] !log banyek@deploy1001 Synchronized wmf-config/db-codfw.php: T206593: Enabling db2096 for x1 (duration: 00m 56s) [15:31:01] PROBLEM - puppet last run on certcentral2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 13 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/certcentral/accounts/6e01c693ed6e9d9a6b5930923ecef104] [15:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:07] (03CR) 10Ottomata: [V: 032 C: 032] Temporarily revert the change to net-topology.py.erb [puppet] - 10https://gerrit.wikimedia.org/r/468004 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [15:31:11] 10Operations, 10Certcentral, 10monitoring: Create icinga checks - https://phabricator.wikimedia.org/T207294 (10Krenair) [15:31:16] 10Operations, 10Certcentral, 10monitoring: Create icinga checks for certcentral - https://phabricator.wikimedia.org/T207294 (10Krenair) [15:32:24] I have to redo the last one, as I forgot to merge. :( [15:32:30] (03CR) 10Filippo Giunchedi: New Kafka cluster logging-eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465166 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [15:32:47] (03CR) 10Banyek: [C: 032] mariadb: enable db2096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467954 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [15:33:58] (03Merged) 10jenkins-bot: mariadb: enable db2096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467954 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [15:34:15] (03PS2) 10GTirloni: Fix labs/cloud records [dns] - 10https://gerrit.wikimedia.org/r/467708 (owner: 10Volans) [15:34:30] !log banyek@deploy1001 Synchronized wmf-config/db-codfw.php: T206593: Enabling db2096 for x1 (duration: 00m 56s) [15:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:33] T206593: Productionize db2096 on x1 - https://phabricator.wikimedia.org/T206593 [15:35:20] (03PS8) 10Filippo Giunchedi: New Kafka cluster logging-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/465166 (https://phabricator.wikimedia.org/T206454) [15:35:22] (03PS8) 10Filippo Giunchedi: site: enable logging Kafka on Logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/465167 (https://phabricator.wikimedia.org/T206454) [15:36:07] (03CR) 10Filippo Giunchedi: New Kafka cluster logging-eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465166 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [15:38:02] (03CR) 10jenkins-bot: mariadb: enable db2096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467954 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [15:39:11] PROBLEM - puppet last run on analytics-tool1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:42:22] (03CR) 10Filippo Giunchedi: [C: 032] smart: switch to syslog logging [puppet] - 10https://gerrit.wikimedia.org/r/467998 (owner: 10Filippo Giunchedi) [15:42:29] (03PS2) 10Filippo Giunchedi: smart: switch to syslog logging [puppet] - 10https://gerrit.wikimedia.org/r/467998 [15:44:19] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] smart: switch to syslog logging [puppet] - 10https://gerrit.wikimedia.org/r/467998 (owner: 10Filippo Giunchedi) [15:44:52] (03Abandoned) 10Banyek: mariadb: depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465634 (owner: 10Banyek) [15:46:23] (03CR) 10Alex Monk: [C: 04-1] "rebase, see comments, add T194962 to commit message, make changes to deployment-prep cherry-pick" [puppet] - 10https://gerrit.wikimedia.org/r/459809 (owner: 10Alex Monk) [15:47:46] (03Abandoned) 10Faidon Liambotis: Edit Project Config [software/netbox] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/433754 (owner: 10Faidon Liambotis) [15:49:25] 10Operations, 10Wikimedia-Logstash: Rationalize default logrotate "rotated" file extensions - https://phabricator.wikimedia.org/T207296 (10herron) p:05Triage>03Normal [15:50:05] (03PS1) 10Cwhite: move statsd cname to graphite1004 [dns] - 10https://gerrit.wikimedia.org/r/468009 (https://phabricator.wikimedia.org/T196484) [15:50:16] (03CR) 10Ladsgroup: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467946 (https://phabricator.wikimedia.org/T194164) (owner: 10Ladsgroup) [15:52:35] (03PS14) 10Alex Monk: Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 (https://phabricator.wikimedia.org/T194962) [15:53:44] !log twentyafterfour@deploy1001 Synchronized php-1.32.0-wmf.26/extensions/AbuseFilter/: sync AbuseFilter revision 4e2a6b665825e0850138487ee395a3f55f3dec96 to 1.32.0-wmf.26 refs T207220 (duration: 00m 58s) [15:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:48] T207220: AFComputedVariable.php: Argument to getLinksFromDB() must be an instance of Article - https://phabricator.wikimedia.org/T207220 [15:58:44] elukey, hey, I notice that you have two cherry-picks on deployment-puppetmaster03:/var/lib/git/labs/private where one is the revert of the other [15:58:52] currently HEAD~4 and HEAD~5 [15:59:11] please can you drop those commits if you don't need them? [16:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181017T1600). Please do the needful. [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:02:41] (03PS3) 10Dduvall: Project clone URLs based on access control [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/467843 [16:03:03] Krenair: sure [16:03:45] (03PS1) 10Alex Monk: authdns-certcentral key files [labs/private] - 10https://gerrit.wikimedia.org/r/468012 [16:04:13] elukey, thanks. just trying to get rid of stuff we don't need sitting there as cherry-picks :) [16:08:32] (03CR) 10Eevans: [C: 031] "I agree with this in principle (it was always supposed to be active-active), but ATM we're tight on IO (thanks to those awful Samsung SSDs" [puppet] - 10https://gerrit.wikimedia.org/r/467742 (owner: 10Alexandros Kosiaris) [16:09:06] (03PS4) 10Dduvall: Project clone URLs based on access control [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/467843 [16:11:56] (03PS5) 10Dduvall: Project clone URLs based on access control [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/467843 [16:12:53] (03CR) 10Ottomata: [C: 031] "Great documentation :)" [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [16:13:42] Krenair: did you already remove them? I don't find anything mine [16:13:55] ah wait, /var/lib/git/labs/private ?? [16:14:01] yes [16:15:21] super weird sorry, didn't remember to have added anything like that [16:15:26] just removed them [16:15:58] (03CR) 10Gehel: "puppet compiler agrees this is a noop: https://puppet-compiler.wmflabs.org/compiler1002/13014/" [puppet] - 10https://gerrit.wikimedia.org/r/467684 (https://phabricator.wikimedia.org/T198352) (owner: 10Gehel) [16:17:11] (03PS2) 10Alex Monk: Add authdns-certcentral public key file [labs/private] - 10https://gerrit.wikimedia.org/r/468012 [16:17:25] (03PS4) 10Gehel: tlsproxy: allow multiple default servers on different ports [puppet] - 10https://gerrit.wikimedia.org/r/467684 (https://phabricator.wikimedia.org/T198352) [16:17:46] elukey, ty! [16:18:07] (03CR) 10Gehel: [C: 032] tlsproxy: allow multiple default servers on different ports [puppet] - 10https://gerrit.wikimedia.org/r/467684 (https://phabricator.wikimedia.org/T198352) (owner: 10Gehel) [16:19:42] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 604.21 seconds [16:20:41] (03CR) 10Vgutierrez: [V: 032 C: 032] Add authdns-certcentral public key file [labs/private] - 10https://gerrit.wikimedia.org/r/468012 (owner: 10Alex Monk) [16:24:32] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:22] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 610.28 seconds [16:27:31] 10Operations, 10Traffic, 10Wikimedia-Incident: Power incident in eqsin - https://phabricator.wikimedia.org/T206861 (10RobH) [16:27:37] 10Operations, 10Traffic, 10Wikimedia-Incident: Add maint-announce@ to Equinix's recipient list for eqsin incidents - https://phabricator.wikimedia.org/T207140 (10RobH) 05Open>03Resolved synced with @faidon, sv9 is the virtual site for the eq peering link and thus won't have real notices. the notices for... [16:32:08] 10Operations, 10monitoring, 10User-fgiunchedi: Review prometheus_nodes params - https://phabricator.wikimedia.org/T207292 (10fgiunchedi) [16:35:52] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 2.978 second response time [16:39:21] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:46:20] (03PS1) 10Vgutierrez: secret: Move authdns-certcentral keys to keyholder directory [labs/private] - 10https://gerrit.wikimedia.org/r/468025 [16:49:13] (03CR) 10Vgutierrez: [V: 032 C: 032] secret: Move authdns-certcentral keys to keyholder directory [labs/private] - 10https://gerrit.wikimedia.org/r/468025 (owner: 10Vgutierrez) [16:50:01] (03PS1) 10Bstorm: wiki replicas: repool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/468028 [16:52:50] (03CR) 10Banyek: [C: 032] wiki replicas: repool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/468028 (owner: 10Bstorm) [16:53:10] !log repooling labsdb1011 [16:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:31] (03PS2) 10Banyek: wiki replicas: repool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/468028 (owner: 10Bstorm) [16:53:34] (03CR) 10Banyek: [V: 032 C: 032] wiki replicas: repool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/468028 (owner: 10Bstorm) [16:53:44] chasemp, bawolff, is it OK if migrate the 'security-tools' project sometime later today? Any active services running in there? [16:59:33] I have no idea whats in that project [16:59:43] i think it was darian's thing [17:00:09] yeah, I'd think it was abandoned but Chase spoke up in its defense on the purge page [17:00:21] bawolff: I can remove you as a member/admin if you don't want to hear about it in the future :) [17:00:43] Well its probably good to have access just in case [17:01:11] 'k [17:01:17] i certainly have no objectiin to it being migrated [17:02:02] hm, Reedy is also a project admin, as is bd808, johnben (IRC handle unknown), csteipp (no longer at the foundation), chasemp [17:02:06] that's everyone :) [17:02:53] I'm probably only there by accident of creating the project [17:03:03] That is john's irc handle [17:03:15] he is online but not in this channel [17:04:41] (03PS1) 10Andrew Bogott: Horizon: move some more projects to eqiad1: [puppet] - 10https://gerrit.wikimedia.org/r/468034 [17:04:43] (03PS1) 10Bstorm: wiki replicas: depool labsdb1009 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/468035 (https://phabricator.wikimedia.org/T181650) [17:05:33] (03PS2) 10Andrew Bogott: Horizon: move some more projects to eqiad1: [puppet] - 10https://gerrit.wikimedia.org/r/468034 [17:07:27] (03CR) 10Banyek: [C: 032] wiki replicas: depool labsdb1009 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/468035 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [17:07:57] (03CR) 10Andrew Bogott: [C: 032] Horizon: move some more projects to eqiad1: [puppet] - 10https://gerrit.wikimedia.org/r/468034 (owner: 10Andrew Bogott) [17:08:18] !log depooling labsdb1009 (T181650) [17:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:22] T181650: Change views for the new columns of the refactored comment storage - https://phabricator.wikimedia.org/T181650 [17:10:01] (03PS15) 10Alex Monk: Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 (https://phabricator.wikimedia.org/T194962) [17:10:01] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:12:12] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:13:21] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.706 second response time [17:16:41] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:20:37] (03PS16) 10Alex Monk: Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 (https://phabricator.wikimedia.org/T194962) [17:20:59] ufffff checking memcached metrics [17:21:40] WANCache:t:commonswiki:gadgets-definition:9:2 again... [17:22:00] (03PS6) 10Dduvall: Project clone URLs based on access control [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/467843 [17:22:33] but this time no connections yielded.. [17:25:16] but the issue was of course mc1035 [17:29:09] (03PS1) 10Vgutierrez: secret: Fix authdns_certcentral key filename [labs/private] - 10https://gerrit.wikimedia.org/r/468043 [17:29:35] (03CR) 10Vgutierrez: [V: 032 C: 032] secret: Fix authdns_certcentral key filename [labs/private] - 10https://gerrit.wikimedia.org/r/468043 (owner: 10Vgutierrez) [17:29:44] (03PS1) 10Elukey: profile::analytics::refinery::job::camus: temp disable eventlogging-client-side [puppet] - 10https://gerrit.wikimedia.org/r/468044 (https://phabricator.wikimedia.org/T206542) [17:32:25] (03PS2) 10Elukey: profile::analytics::refinery::job::camus: temp disable eventlogging-client-side [puppet] - 10https://gerrit.wikimedia.org/r/468044 (https://phabricator.wikimedia.org/T206542) [17:33:15] (03CR) 10Elukey: [C: 032] profile::analytics::refinery::job::camus: temp disable eventlogging-client-side [puppet] - 10https://gerrit.wikimedia.org/r/468044 (https://phabricator.wikimedia.org/T206542) (owner: 10Elukey) [17:34:54] (03PS17) 10Alex Monk: Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 (https://phabricator.wikimedia.org/T194962) [17:35:29] (03PS5) 10Elukey: profile::hadoop::balancer: move to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/467956 (https://phabricator.wikimedia.org/T172532) [17:35:34] (03PS1) 10Ottomata: Use net_topology_script_path to configure net.topology.script.file.name [puppet/cdh] - 10https://gerrit.wikimedia.org/r/468046 (https://phabricator.wikimedia.org/T204951) [17:36:08] (03CR) 10Ottomata: [V: 032 C: 032] Use net_topology_script_path to configure net.topology.script.file.name [puppet/cdh] - 10https://gerrit.wikimedia.org/r/468046 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [17:36:14] (03CR) 10Elukey: [C: 032] profile::hadoop::balancer: move to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/467956 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [17:36:18] (03PS3) 10Banyek: mariadb: enable replication check on Parsercache hosts [puppet] - 10https://gerrit.wikimedia.org/r/467959 (https://phabricator.wikimedia.org/T206992) [17:40:29] (03PS3) 10Alex Monk: Create /etc/certcentral/accounts [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/467992 [17:41:22] (03PS4) 10Alex Monk: Create /etc/certcentral/accounts [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/467992 [17:43:27] (03CR) 10Vgutierrez: [V: 032 C: 032] Create /etc/certcentral/accounts [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/467992 (owner: 10Alex Monk) [17:43:52] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 29.16 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:44:02] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:44:42] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:44:45] (03PS1) 10Ottomata: Fix net-topology.py.erb script to render proper python dict [puppet] - 10https://gerrit.wikimedia.org/r/468049 (https://phabricator.wikimedia.org/T204951) [17:45:32] (03CR) 10jenkins-bot: Create /etc/certcentral/accounts [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/467992 (owner: 10Alex Monk) [17:47:36] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10RobH) a:05Cmjohnson>03RobH Chris installed the new raid controller, I'm taking for installation. [17:48:05] (03PS2) 10Ottomata: Fix net-topology.py.erb script to render proper python dict [puppet] - 10https://gerrit.wikimedia.org/r/468049 (https://phabricator.wikimedia.org/T204951) [17:50:00] (03PS2) 10Bstorm: wiki replicas: depool labsdb1009 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/468035 (https://phabricator.wikimedia.org/T181650) [17:50:12] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 2.255 second response time [17:50:31] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 73.92 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:50:35] (03CR) 10Ottomata: [C: 032] "Good https://puppet-compiler.wmflabs.org/compiler1002/13019/stat1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/468049 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [17:50:41] (03CR) 10Ottomata: [C: 032] Fix net-topology.py.erb script to render proper python dict [puppet] - 10https://gerrit.wikimedia.org/r/468049 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [17:51:16] (03CR) 10ArielGlenn: "What happens when the same ferm rule is generated from different rsync::server::module invocations?" [puppet] - 10https://gerrit.wikimedia.org/r/467985 (owner: 10Muehlenhoff) [17:52:14] (03PS3) 10Bstorm: wiki replicas: depool labsdb1009 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/468035 (https://phabricator.wikimedia.org/T181650) [17:53:42] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:54:10] (03CR) 10Ottomata: [C: 032] "I'm pretty sure that the label will not matter to the dashboard until we actually add a second label value with cloud-analytics-eqiad. I'" [puppet] - 10https://gerrit.wikimedia.org/r/467821 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [17:54:17] (03PS2) 10Ottomata: Label Hadoop prometheus metrics with the hadoop_cluster_name [puppet] - 10https://gerrit.wikimedia.org/r/467821 (https://phabricator.wikimedia.org/T204951) [17:54:20] (03CR) 10Ottomata: [V: 032 C: 032] Label Hadoop prometheus metrics with the hadoop_cluster_name [puppet] - 10https://gerrit.wikimedia.org/r/467821 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [17:59:46] (03PS9) 10Herron: New Kafka cluster logging-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/465166 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [18:01:27] (03CR) 10Herron: [C: 032] New Kafka cluster logging-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/465166 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [18:03:09] 10Operations: skylake CPU numa clustering settting discussion - https://phabricator.wikimedia.org/T207312 (10RobH) p:05Triage>03Normal [18:04:05] (03PS4) 10Banyek: wiki replicas: depool labsdb1009 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/468035 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [18:04:09] (03CR) 10Banyek: [V: 032 C: 032] wiki replicas: depool labsdb1009 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/468035 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [18:04:46] 10Operations, 10Wikimedia-Mailing-lists: New list request for 1lib1ref - https://phabricator.wikimedia.org/T207283 (10Aklapper) [18:07:42] !log depooling labsdb1009 (T181650) [18:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:45] T181650: Change views for the new columns of the refactored comment storage - https://phabricator.wikimedia.org/T181650 [18:09:50] PROBLEM - Check systemd state on kafkamon1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:11:39] 10Operations: skylake CPU numa clustering settting discussion - https://phabricator.wikimedia.org/T207312 (10bd808) Some reading on OpenStack and NUMA: * https://www.stratoscale.com/blog/openstack/cpu-pinning-and-numa-awareness/ * https://docs.openstack.org/nova/pike/admin/cpu-topologies.html * https://access.re... [18:12:11] 10Operations, 10Electron-PDFs, 10Proton, 10Patch-For-Review, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10Pchelolo) [18:14:35] (03PS2) 10Dzahn: nagios_common: convert check_sslxNN to use Monitoring::Plugin [puppet] - 10https://gerrit.wikimedia.org/r/467831 (https://phabricator.wikimedia.org/T202782) [18:15:56] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/468009 (https://phabricator.wikimedia.org/T196484) (owner: 10Cwhite) [18:16:32] (03CR) 10Cwhite: [C: 032] move statsd cname to graphite1004 [dns] - 10https://gerrit.wikimedia.org/r/468009 (https://phabricator.wikimedia.org/T196484) (owner: 10Cwhite) [18:17:50] !log moving statsd cname to graphite1004 [18:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:06] (03PS1) 10Ottomata: Move analytics cluster druid deep storage out of profile::hadoop::master [puppet] - 10https://gerrit.wikimedia.org/r/468054 [18:22:30] (03PS2) 10Ottomata: Move analytics cluster druid deep storage out of profile::hadoop::master [puppet] - 10https://gerrit.wikimedia.org/r/468054 [18:24:59] (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13020/an-master1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/468054 (owner: 10Ottomata) [18:27:04] 10Operations, 10Performance-Team, 10Traffic: Investigate 200-300ms increase in responseStart.p75 - https://phabricator.wikimedia.org/T207315 (10Krinkle) [18:28:31] 10Operations, 10ops-eqiad, 10ops-ulsfo, 10DC-Ops: connect atlas-ulsfo to scs-ulsfo - https://phabricator.wikimedia.org/T206185 (10RobH) Ahh, @Cmjohnson can you pull one of those adapters and send it to me please? Just to my home address since its so tiny. [18:29:10] (03CR) 10Dzahn: "i don't see any services actually using this checkcommand and then i found 991dea7f7055896b6dba -> "Replace check_sslxNN with check_ssl_u" [puppet] - 10https://gerrit.wikimedia.org/r/467831 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [18:30:01] PROBLEM - statsd UDP receive errors are elevated on graphite1001 is CRITICAL: 4.456 ge 2 https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&refresh=1m&panelId=16&fullscreen [18:32:23] looking ^ [18:33:35] (03PS3) 10Dzahn: nagios_common: remove check_sslxN [puppet] - 10https://gerrit.wikimedia.org/r/467831 (https://phabricator.wikimedia.org/T202782) [18:36:18] mutante: typo [18:36:32] (03PS4) 10Dzahn: nagios_common: remove check_sslxN [puppet] - 10https://gerrit.wikimedia.org/r/467831 (https://phabricator.wikimedia.org/T202782) [18:36:34] (03PS1) 10Ottomata: Add more labels to Hadoop daemon JMX prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/468056 (https://phabricator.wikimedia.org/T204951) [18:37:12] this ?:) [18:37:13] (03PS5) 10Dzahn: nagios_common: remove check_sslxNN [puppet] - 10https://gerrit.wikimedia.org/r/467831 (https://phabricator.wikimedia.org/T202782) [18:37:14] looks like a spike in udp errors and going down now [18:37:19] (03CR) 10Ottomata: Add more labels to Hadoop daemon JMX prometheus metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/468056 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [18:37:31] (03CR) 10jerkins-bot: [V: 04-1] Add more labels to Hadoop daemon JMX prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/468056 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [18:37:50] RECOVERY - statsd UDP receive errors are elevated on graphite1001 is OK: (C)2 ge (W)1 ge 0.7649 https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&refresh=1m&panelId=16&fullscreen [18:38:06] (03CR) 10Faidon Liambotis: [C: 032] "As the original author of this... LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/467831 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [18:38:28] (03PS6) 10Dzahn: nagios_common: remove check_sslxNN [puppet] - 10https://gerrit.wikimedia.org/r/467831 (https://phabricator.wikimedia.org/T202782) [18:38:30] thanks:) [18:38:42] (03PS2) 10Ottomata: Add more labels to Hadoop daemon JMX prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/468056 (https://phabricator.wikimedia.org/T204951) [18:38:50] ottomata: reminder that adding labels will create new metrics [18:38:58] just in case! [18:39:43] 10Operations, 10ops-eqiad, 10Dumps-Generation: Move dumpsdata1001 - https://phabricator.wikimedia.org/T207278 (10hoo) Well, last week Sunday afternoon was the only possible time for this :( … mid-term we will (hopefully) solve T206535, but for now we should either disable the cron(s) or I can carefully stop... [18:39:49] !log restart jmxtrans on kafka hosts [18:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:30] (03PS3) 10Dzahn: nagios_common: convert check_jnx_alarms to use Monitoring::Plugin [puppet] - 10https://gerrit.wikimedia.org/r/467816 (https://phabricator.wikimedia.org/T202782) [18:40:41] mutante: also typo, two spaces instead of one :P [18:40:55] godog: ya we know [18:40:58] but we are adding a 2nd hadoop cluster for presto/cloud/datalake stuff, so we need labels to make dashboards better [18:41:15] (03PS4) 10Dzahn: nagios_common: convert check_jnx_alarms to use Monitoring::Plugin [puppet] - 10https://gerrit.wikimedia.org/r/467816 (https://phabricator.wikimedia.org/T202782) [18:41:58] ottomata: yeah that makes sense [18:42:00] (03CR) 10Ottomata: "Also, elukey, do we have icinga alerts based on these metrics that might need updated?" [puppet] - 10https://gerrit.wikimedia.org/r/468056 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [18:44:51] (03PS1) 10Hoo man: Wikidata entity dumps: Halve pagesPerBatch [puppet] - 10https://gerrit.wikimedia.org/r/468059 (https://phabricator.wikimedia.org/T147169) [18:46:10] (03PS2) 10Hoo man: Wikidata entity dumps: Halve pagesPerBatch [puppet] - 10https://gerrit.wikimedia.org/r/468059 (https://phabricator.wikimedia.org/T147169) [18:46:17] (03PS1) 10Bstorm: Revert "wiki replicas: depool labsdb1009 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/468060 [18:46:26] (03CR) 10Dzahn: [C: 031] "tested on einsteinium:" [puppet] - 10https://gerrit.wikimedia.org/r/467816 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [18:47:08] paravoid: typo fixed and tested directly in prod from /tmp/check_jnx_alarms .. works [18:47:31] (03CR) 10Banyek: [C: 032] Revert "wiki replicas: depool labsdb1009 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/468060 (owner: 10Bstorm) [18:47:40] (03PS2) 10Banyek: Revert "wiki replicas: depool labsdb1009 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/468060 (owner: 10Bstorm) [18:47:44] (03CR) 10Banyek: [V: 032 C: 032] Revert "wiki replicas: depool labsdb1009 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/468060 (owner: 10Bstorm) [18:48:07] !log restart navtiming on webperf nodes [18:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:20] !log repooling labsdb1009 (T181650) [18:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:23] T181650: Change views for the new columns of the refactored comment storage - https://phabricator.wikimedia.org/T181650 [18:49:56] (03PS1) 10Andrew Bogott: labvirt1018: rename to cloudvirt1018 [puppet] - 10https://gerrit.wikimedia.org/r/468062 [18:50:05] (03PS1) 10Volans: Add Cas Rusnov to ops group (root) [puppet] - 10https://gerrit.wikimedia.org/r/468063 (https://phabricator.wikimedia.org/T207009) [18:50:21] * Krinkle staging on mwdebug1001 [18:51:02] sigh ores isn't refreshing dns records, will need a rolling restart [18:52:02] awight: around? anything in particular to know wrt restarting ores? [18:52:43] godog: I can restart it now, thanks for the note. [18:53:16] awight: thanks! happy to help too if I can, will file a task too [18:54:03] godog: for future reference, here's the command I run on ores*, in serial, not parallel: [18:54:10] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 3.507 second response time [18:54:15] 10Operations, 10Patch-For-Review: Onboarding Cas Rusnov - https://phabricator.wikimedia.org/T207009 (10Volans) [18:54:23] sudo service celery-ores-worker restart [18:54:26] (03PS1) 10Andrew Bogott: labvirt1018: rename to cloudvirt1018 [dns] - 10https://gerrit.wikimedia.org/r/468064 (https://phabricator.wikimedia.org/T207317) [18:54:28] (03PS1) 10Andrew Bogott: cloudvirt1018: remove obsolete labvirt1018.mgmt entries [dns] - 10https://gerrit.wikimedia.org/r/468065 (https://phabricator.wikimedia.org/T207317) [18:54:41] (03PS2) 10Andrew Bogott: labvirt1018: rename to cloudvirt1018 [puppet] - 10https://gerrit.wikimedia.org/r/468062 (https://phabricator.wikimedia.org/T207317) [18:54:57] godog: If there's already a broader task, I can tag the SAL entry... [18:55:35] (03PS1) 10Dzahn: icinga: convert check_lonqueries.pl to use Monitoring::Plugin [puppet] - 10https://gerrit.wikimedia.org/r/468066 (https://phabricator.wikimedia.org/T202782) [18:55:41] awight: yeah T88997 that would be it [18:55:41] T88997: Improve graphite failover - https://phabricator.wikimedia.org/T88997 [18:55:47] ty! [18:56:52] andrewbogott: btw, when I was looking at the VLAN situation last week, I found a number of switch ports named as lab* [18:57:14] !log Restarting ORES cluster to refresh DNS, T88997 [18:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:19] andrewbogott: so when you're renaming hosts, it might be a good idea to also file tasks for a) switch ports b) netbox entries and c) physical labels to be renamed as well, if you aren't already! [18:57:29] 10Operations, 10Graphite, 10Patch-For-Review, 10Performance-Team (Radar), 10Zuul: Improve graphite failover - https://phabricator.wikimedia.org/T88997 (10Krinkle) [18:57:31] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:57:32] andrewbogott: unless you'd like to do this in one go at the end of your renaming process? not sure [18:57:39] paravoid: Right now we have hosts running with both names (lab* in the old region, cloud* in the new region) [18:57:40] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:57:40] godog: I'll ping you when the cluster finishes. [18:57:47] so as I move hosts I've been opening tasks to relabel [18:57:52] awight: sweet, thanks! [18:58:08] were there labvirt things > 1018? [18:58:11] andrewbogott: ah, didn't know that. apparently we weren't doing the switch ports as part of those tasks then [18:58:11] (03CR) 10Dzahn: [C: 032] nagios_common: convert check_jnx_alarms to use Monitoring::Plugin [puppet] - 10https://gerrit.wikimedia.org/r/467816 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [18:58:25] (03PS5) 10Dzahn: nagios_common: convert check_jnx_alarms to use Monitoring::Plugin [puppet] - 10https://gerrit.wikimedia.org/r/467816 (https://phabricator.wikimedia.org/T202782) [18:58:25] paravoid: ok, I'll make sure to mention that in my next rename task [18:58:25] yeah there were, I fixed them all last week I think [18:58:30] (which is coming right up :) ) [18:58:40] 10Operations, 10Graphite, 10Patch-For-Review, 10Performance-Team (Radar), 10Zuul: Improve graphite failover - https://phabricator.wikimedia.org/T88997 (10colewhite) > godog: for future reference, here's the command I run on ores*, in serial, not parallel: > sudo service celery-ores-work... [18:58:41] heh [18:58:56] godog: What old hostnames are you seeing? I should double-check the ORES config. [18:59:27] (03PS3) 10Andrew Bogott: labvirt1018: rename to cloudvirt1018 [puppet] - 10https://gerrit.wikimedia.org/r/468062 (https://phabricator.wikimedia.org/T207317) [18:59:40] (03PS2) 10Volans: Add Cas Rusnov to ops group (root) [puppet] - 10https://gerrit.wikimedia.org/r/468063 (https://phabricator.wikimedia.org/T207009) [18:59:46] andrewbogott: fwiw, last week I renamed on the switch cloudvirt1019/20/21/22 + cloudcontrol1004 + cloudnet1004 [18:59:52] awight: statsd.eqiad.wmnet has changed address, with ores hosts still sending to the old address [19:00:04] twentyafterfour: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - Americas version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181017T1900). [19:00:16] (03PS4) 10Andrew Bogott: labvirt1018: rename to cloudvirt1018 [puppet] - 10https://gerrit.wikimedia.org/r/468062 (https://phabricator.wikimedia.org/T207317) [19:00:18] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.26/includes/cache/: T193271 - I25aa0e27200a0 (duration: 01m 01s) [19:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:21] T193271: Refactor MessageCache to deal with NS_MEDIAWIKI pages that aren't standard interface messages - https://phabricator.wikimedia.org/T193271 [19:00:31] paravoid: thanks! Hopefully we can rename them as we go but worst case we can grep for stragglers after everything is moved [19:00:34] godog: okay got it, so not config. [19:00:38] (03CR) 10Volans: [C: 032] Add Cas Rusnov to ops group (root) [puppet] - 10https://gerrit.wikimedia.org/r/468063 (https://phabricator.wikimedia.org/T207009) (owner: 10Volans) [19:00:40] (03CR) 10Dzahn: "we should add to LDAP group "ops" together with this being merged" [puppet] - 10https://gerrit.wikimedia.org/r/468063 (https://phabricator.wikimedia.org/T207009) (owner: 10Volans) [19:00:44] andrewbogott: nod, that makes sense [19:01:05] (03PS3) 10Volans: Add Cas Rusnov to ops group (root) [puppet] - 10https://gerrit.wikimedia.org/r/468063 (https://phabricator.wikimedia.org/T207009) [19:01:10] (03CR) 10Volans: [V: 032 C: 032] Add Cas Rusnov to ops group (root) [puppet] - 10https://gerrit.wikimedia.org/r/468063 (https://phabricator.wikimedia.org/T207009) (owner: 10Volans) [19:01:14] damn gerrit ui [19:01:15] sorry [19:01:34] volans: should i do the LDAP edit? [19:01:39] ottomata: looks like the rename affected a bunch of alarms in icinga btw, all NaN [19:01:40] mutante: already done [19:01:41] thanks [19:01:44] alright [19:02:05] paravoid: while you're here… the af-test.automation-framework.eqiad.wmflabs VM is giving me trouble with migrating (probably because its puppet is broken.) It was in a SHUTOFF state before I started… can I just delete it or should I try to salvage? [19:02:18] I have no idea what that is :) [19:02:21] volans may know? [19:02:23] (Looks like you created it 2017-07-20 [19:02:24] ( [19:02:27] haha [19:02:41] paravoid: was yours ;) [19:02:49] no idea what it is, kill it [19:02:56] great, thanks :) [19:03:25] have you thought about setting some kind of expiry time on VMs when they're initially created, and emailing people at that date to ask them if they still need them? [19:03:44] sometimes we spawn short-lived stuff and forget them, could save you a lot of trouble! [19:03:49] volans: the "+2 on gerrit" checkbox should be automatically solved [19:03:55] yes [19:03:57] it is [19:03:58] already checked [19:03:59] great [19:04:33] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10RobH) [19:04:54] uh oh godog with lables/ looking [19:05:13] paravoid: we do periodic project-by-project purges. Automatic lifespans for VMs is a thing I think about pretty often but I'm not sure it would catch very many — opt-in and I'd delete things by mistake, opt-out and people will just filter the emails. [19:05:16] Worth revisiting though [19:05:33] hmm, i wouldn't expect that to change the result, there aren't other labels [19:05:34] hm [19:05:46] volans: that project is all moved now. [19:06:20] (03CR) 10Nuria: [C: 031] profile::hadoop::balancer: move to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/467956 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [19:06:47] andrewbogott: great thanks! [19:06:52] ottomata: do you (or does anyone) know about the 'services' project? It used to be Gabriel's thing I think... [19:07:03] currently contains one VM, 'appservice' [19:07:08] ohh because it is now returning multiple metrics [19:07:09] yar [19:07:22] (03CR) 10Dzahn: [C: 032] "yep, i also don't see anything using this" [puppet] - 10https://gerrit.wikimedia.org/r/467103 (owner: 10Krinkle) [19:07:42] 10Operations, 10Patch-For-Review: Onboarding Cas Rusnov - https://phabricator.wikimedia.org/T207009 (10Volans) [19:07:50] andrewbogott: ah! interesting. I'd imagine something like send two warning emails, then shutoff, then a month later if it's still shutoff delete, or something like that [19:08:07] paravoid: that's true, 'shutoff' is maybe a useful warning state [19:08:10] (03PS2) 10Dzahn: mediawiki: Remove unused 'role::logging::mediawiki::errors' [puppet] - 10https://gerrit.wikimedia.org/r/467103 (owner: 10Krinkle) [19:08:13] that's how we handle seemingly-abandoned projects. [19:08:23] ah nice [19:08:48] (03PS5) 10Andrew Bogott: labvirt1018: rename to cloudvirt1018 [puppet] - 10https://gerrit.wikimedia.org/r/468062 (https://phabricator.wikimedia.org/T207317) [19:09:56] !log roll-restart eventbus for statsd DNS change - T88997 [19:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:59] (03PS2) 10Dzahn: icinga: convert check_longqueries.pl to use Monitoring::Plugin [puppet] - 10https://gerrit.wikimedia.org/r/468066 (https://phabricator.wikimedia.org/T202782) [19:10:00] T88997: Improve graphite failover - https://phabricator.wikimedia.org/T88997 [19:10:50] (03PS1) 10Ottomata: Use hadoop_cluster label in icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/468068 (https://phabricator.wikimedia.org/T204951) [19:10:57] (03CR) 10Andrew Bogott: [C: 032] labvirt1018: rename to cloudvirt1018 [dns] - 10https://gerrit.wikimedia.org/r/468064 (https://phabricator.wikimedia.org/T207317) (owner: 10Andrew Bogott) [19:12:00] !log scb1003 - restart pdfrender [19:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:04] (03PS6) 10Andrew Bogott: labvirt1018: rename to cloudvirt1018 [puppet] - 10https://gerrit.wikimedia.org/r/468062 (https://phabricator.wikimedia.org/T207317) [19:12:33] (03CR) 10Ottomata: [C: 032] Use hadoop_cluster label in icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/468068 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [19:12:39] (03PS2) 10Ottomata: Use hadoop_cluster label in icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/468068 (https://phabricator.wikimedia.org/T204951) [19:12:41] (03CR) 10Ottomata: [V: 032 C: 032] Use hadoop_cluster label in icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/468068 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [19:12:46] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10RobH) a:05RobH>03Andrew [19:13:06] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.007 second response time [19:13:55] I apologize to whoever I'm fighting for the this gerrit merge [19:16:08] (03PS7) 10Andrew Bogott: labvirt1018: rename to cloudvirt1018 [puppet] - 10https://gerrit.wikimedia.org/r/468062 (https://phabricator.wikimedia.org/T207317) [19:16:10] (03PS1) 10Andrew Bogott: cloudvirt1024: Make a cloudvirt node [puppet] - 10https://gerrit.wikimedia.org/r/468069 [19:16:15] I'm going to restart zuul shortly btw [19:16:38] actually trying reload first [19:17:08] (03CR) 10Andrew Bogott: [C: 032] labvirt1018: rename to cloudvirt1018 [puppet] - 10https://gerrit.wikimedia.org/r/468062 (https://phabricator.wikimedia.org/T207317) (owner: 10Andrew Bogott) [19:17:32] (03CR) 10Andrew Bogott: [C: 032] cloudvirt1024: Make a cloudvirt node [puppet] - 10https://gerrit.wikimedia.org/r/468069 (owner: 10Andrew Bogott) [19:17:35] yeah, no [19:19:06] !log restart zuul for statsd DNS change - T88997 [19:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:10] T88997: Improve graphite failover - https://phabricator.wikimedia.org/T88997 [19:20:15] (03PS1) 10Ottomata: [WIP] Configure cloud-analytics-eqiad Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/468070 (https://phabricator.wikimedia.org/T204951) [19:20:36] (03PS3) 10Dzahn: icinga: convert check_longqueries.pl to use Monitoring::Plugin [puppet] - 10https://gerrit.wikimedia.org/r/468066 (https://phabricator.wikimedia.org/T202782) [19:21:25] (03CR) 10Dzahn: "in this case i also don't see the command actually being used by a service" [puppet] - 10https://gerrit.wikimedia.org/r/468066 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [19:21:38] ottomata: I hadn't realized you were building a cloud-analytics cluster, with a very cursory look I think it needs a bit more thinking [19:22:11] ottomata: one of the issues is that we're going towards a direction that wouldn't allow WMCS to access internal 10/8 address space [19:22:45] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [19:23:06] !log Mediawiki train is still blocked by T207288 [19:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:09] T207288: Text in the Sidebar does no longer show the message text, only the message name - https://phabricator.wikimedia.org/T207288 [19:24:50] (03PS1) 10Ottomata: Conditionally pass in zookeeper_hosts to cdh::hive [puppet] - 10https://gerrit.wikimedia.org/r/468071 (https://phabricator.wikimedia.org/T204951) [19:24:58] ottomata: and as far as I can see you're putting this in the analytics-specific network? that is not a great choice anyway -- basically analytics networks exist as to be more secure/separated from prod, and you're combining it with WMCS networks that are considered less secure than prod [19:25:16] paravoid: it doesn't have to go in analytics network [19:25:24] just analytics stuff needs to push data there [19:25:36] no, but we need to figure out where it should go, and I don't have an answer off-hand :) [19:25:48] aye [19:25:49] I think it needs to be thought through [19:25:54] paravoid: what needs more thought? where to put it? or if we should do it? [19:26:09] !log restart eventlogging for statsd DNS change - T88997 [19:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:12] T88997: Improve graphite failover - https://phabricator.wikimedia.org/T88997 [19:26:14] where to put it, I'm assuming that the "if we should it" question is already answered :) [19:26:18] ok phew [19:26:20] yeah long ago :) [19:26:50] ok, yeah, i'm unoppionated as to where it goes, it just needs to be queryable from cloud VPS, and needs to be writeable by Analytics VLAN stuff [19:27:22] but we need to be extra careful when we're talking about a cross-over our most secure data storage with our least secure open-to-everyone network [19:27:38] cross-over of* [19:27:47] the data that will be stored on this cluster is all public [19:27:47] nothign pii there etc. [19:27:54] thigns like mediawiki history data [19:28:03] which is computed from labsdb [19:28:27] I figured, but I was talking from an (inter)networking perspective [19:28:30] aye [19:28:46] can we use the same model we have for labsdb? [19:28:47] or do you want to do something different/better [19:28:48] ? [19:29:10] I think so, but I don't even remember the latest there tbh :) [19:29:18] 10Operations, 10ops-eqiad, 10Cloud-Services, 10DC-Ops: labvirt1018 -> cloudvirt1018: update physical label, network port description, netbox - https://phabricator.wikimedia.org/T207319 (10Andrew) p:05Triage>03Normal [19:30:11] (03PS4) 10Dzahn: icinga: remove check_lonqueries.pl [puppet] - 10https://gerrit.wikimedia.org/r/468066 (https://phabricator.wikimedia.org/T202782) [19:31:03] ok paravoid who should I ping and where [19:31:05] ? [19:31:08] T207194 ? [19:31:09] T207194: rack/setup/install ca-worker100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 [19:31:17] file a task to determine this specifically [19:31:30] arzhel and moritz would probably need to get involved [19:31:34] PROBLEM - Check systemd state on cloudvirt1024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:31:37] and I can chip-in :) [19:31:41] 10Operations, 10Wikimedia-General-or-Unknown: Wikimedia Foundation to host Wikimedia South Africa sites - https://phabricator.wikimedia.org/T195926 (10greg) Adding #operations as they'd be making the calls/changes to DNS/certs/whatever. [19:32:07] dammit, almost caught it in time [19:32:15] cloudvirt1024 alerts can be ignored for today [19:32:42] ok cool [19:36:41] (03PS1) 10Dzahn: add za.wikimedia.org and za.m.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/468073 (https://phabricator.wikimedia.org/T195926) [19:38:18] 10Operations, 10Analytics, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) p:05Triage>03High [19:38:29] 10Operations, 10Analytics, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) [19:38:32] paravoid: ^ :) [19:40:20] oh ganeti too, you're really pushing the boundaries eh :) [19:40:30] (03CR) 10Ottomata: "no-op https://puppet-compiler.wmflabs.org/compiler1002/13021/" [puppet] - 10https://gerrit.wikimedia.org/r/468071 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [19:40:32] (03CR) 10Ottomata: [C: 032] Conditionally pass in zookeeper_hosts to cdh::hive [puppet] - 10https://gerrit.wikimedia.org/r/468071 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [19:41:05] (03PS1) 10Dzahn: mediawiki/apache: add za.wikimeida.org as prod_site [puppet] - 10https://gerrit.wikimedia.org/r/468074 (https://phabricator.wikimedia.org/T195926) [19:41:52] mutante: typo! :P [19:41:58] sorry :) [19:42:11] already saw it, ack [19:42:31] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 177.02 seconds [19:42:59] (03PS2) 10Dzahn: mediawiki/apache: add za.wikimedia.org as prod_site [puppet] - 10https://gerrit.wikimedia.org/r/468074 (https://phabricator.wikimedia.org/T195926) [19:44:18] chaomodus: if you want we can fix the Icinga permissions thing for onboarding [19:44:32] PROBLEM - MariaDB Slave SQL: s8 on db1124 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table wikidatawiki.pagelinks: Cant find record in pagelinks, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1087-bin.003073, end_log_pos 582738698 [19:44:43] RECOVERY - Check systemd state on cloudvirt1024 is OK: OK - running: The system is fully operational [19:45:46] (03PS1) 10Zoranzoki21: Enable suppressredirect and markbotedit rights to rollbackers on it.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468075 (https://phabricator.wikimedia.org/T207300) [19:47:10] 10Operations, 10Graphite, 10Patch-For-Review, 10Performance-Team (Radar), 10Zuul: Improve graphite failover - https://phabricator.wikimedia.org/T88997 (10hashar) [19:47:12] (03PS2) 10Dzahn: add za.wikimedia.org and za.m.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/468073 (https://phabricator.wikimedia.org/T195926) [19:52:23] (03CR) 10Gehel: "Very minor comments inline (some of which would benefit from @mobrovac opinion). All those are minor enough and don't deserve a -1." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) (owner: 10Mathew.onipe) [19:53:48] (03CR) 10Ottomata: "We do! fixed: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/468068/" [puppet] - 10https://gerrit.wikimedia.org/r/468056 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [19:55:14] robh, cmjohnson1, paravoid, VM networking works on cloudvirt1024! [19:55:50] \o/ \o/ \o/ [19:57:06] well at least the order of 6 more should go smoother, since it is now replicating the working config of 1024 ;D [19:57:47] * andrewbogott hopes [19:57:52] (03PS1) 10Zoranzoki21: Enable autopatroller, patroller and rollbacker rights on srwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468079 (https://phabricator.wikimedia.org/T206936) [19:59:36] (03PS1) 10Zoranzoki21: Enable rollbacker right on srwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468080 (https://phabricator.wikimedia.org/T206935) [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: Your horoscope predicts another unfortunate Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181017T2000). [20:01:18] (03CR) 10Dzahn: [C: 032] add za.wikimedia.org and za.m.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/468073 (https://phabricator.wikimedia.org/T195926) (owner: 10Dzahn) [20:01:24] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10Andrew) 05Open>03Resolved Both these hosts are now up and running VMs. [20:04:06] PROBLEM - DPKG on cloudvirt1018 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:05:25] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:05:55] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:07:36] (03CR) 10ArielGlenn: "Happy to merge whenever; should I wait for anyone else's review?" [puppet] - 10https://gerrit.wikimedia.org/r/468059 (https://phabricator.wikimedia.org/T147169) (owner: 10Hoo man) [20:12:51] (03CR) 10Gehel: [C: 032] wdqs: cleanup rspec test [puppet] - 10https://gerrit.wikimedia.org/r/467687 (https://phabricator.wikimedia.org/T204240) (owner: 10Mathew.onipe) [20:13:14] (03PS2) 10Gehel: wdqs: cleanup rspec test [puppet] - 10https://gerrit.wikimedia.org/r/467687 (https://phabricator.wikimedia.org/T204240) (owner: 10Mathew.onipe) [20:13:47] (03PS6) 10Cwhite: ntp: move diamond::collector to where it will only apply to ntp servers [puppet] - 10https://gerrit.wikimedia.org/r/464866 (https://phabricator.wikimedia.org/T183454) [20:13:49] (03PS8) 10Cwhite: hiera: diamond::remove on openstack control role [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) [20:14:16] RECOVERY - DPKG on cloudvirt1018 is OK: All packages OK [20:16:00] (03CR) 10Gehel: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/467692 (https://phabricator.wikimedia.org/T204240) (owner: 10Mathew.onipe) [20:17:40] !log mobrovac@deploy1001 Started restart [proton/deploy@a657059]: (no justification provided) [20:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:17] (03PS2) 10Gehel: enable rspec tests [puppet] - 10https://gerrit.wikimedia.org/r/467692 (https://phabricator.wikimedia.org/T204240) (owner: 10Mathew.onipe) [20:22:53] (03CR) 10Gehel: "@hashar: could you rechek locally? I have an issue when running the tests, but it looks more related to my setup." [puppet] - 10https://gerrit.wikimedia.org/r/467692 (https://phabricator.wikimedia.org/T204240) (owner: 10Mathew.onipe) [20:25:53] XioNoX: not urgent, but do you know what's involved in https://phabricator.wikimedia.org/T207327? I assume it's roughly the same as what was done for cloudvirt1023/1024 [20:26:20] 10Operations, 10ops-codfw: es2017 and es2019 have an idrac ethernet interface in Linux - https://phabricator.wikimedia.org/T207328 (10BBlack) [20:26:20] (trying to avoid pinging Faid.on since it's getting late in Athens) [20:26:37] 10Operations, 10monitoring, 10Discovery-Search (Current work), 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Gehel) >>! In T206114#4669383, @ayounsi wrote: > What should be the runbook/actions when this alert goes off? I don't think we... [20:28:21] andrewbogott: XioNoX is on EU time this week too (and no, I don't have any idea about that ticket!) [20:28:35] dang, ok, will follow up on phab [20:29:02] mobrovac: safe to deploy parsoid? [20:29:34] yup arlolra, afaik it is [20:30:07] ok, just asking because you were restarting something above ... but maybe that wasn't restbase [20:32:13] !log arlolra@deploy1001 Started deploy [parsoid/deploy@babf1da]: Updating Parsoid to e6b708b [20:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:31] (03PS2) 10Andrew Bogott: cloudvirt1018: remove obsolete labvirt1018.mgmt entries [dns] - 10https://gerrit.wikimedia.org/r/468065 (https://phabricator.wikimedia.org/T207317) [20:35:00] (03CR) 10Andrew Bogott: [C: 032] cloudvirt1018: remove obsolete labvirt1018.mgmt entries [dns] - 10https://gerrit.wikimedia.org/r/468065 (https://phabricator.wikimedia.org/T207317) (owner: 10Andrew Bogott) [20:36:26] (03PS12) 10Herron: smarthost: create mail smarthost profile and wmcs smarthost role [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) [20:37:21] (03CR) 10jerkins-bot: [V: 04-1] smarthost: create mail smarthost profile and wmcs smarthost role [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [20:37:45] andrewbogott: I can have a look in an hour or so [20:38:17] thanks! Don't lose any sleep over it, tomorrow is also fine [20:40:27] (03CR) 10Herron: smarthost: create mail smarthost profile and wmcs smarthost role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [20:40:54] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@babf1da]: Updating Parsoid to e6b708b (duration: 08m 41s) [20:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:51] I'm trying to debug why the clearWatchlistJob doesn't seem to run as intended (https://phabricator.wikimedia.org/T207329). This page https://wikitech.wikimedia.org/wiki/Logs#mwlog1001:/srv/mw-log/ says to look at `/srv/mw-log/runJobs.log` but @RoanKattouw says this only contains labswiki entries. Is there somewhere else we should be looking? [20:43:15] _joe_: Would you happen to know where the Redis logs are these days? ---^^ [20:43:22] (or know who would know?) [20:45:28] (03CR) 10Hashar: [C: 031] "Ah I see https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/467687/2 :] Both should have been a single change to have CI to run the t" [puppet] - 10https://gerrit.wikimedia.org/r/467692 (https://phabricator.wikimedia.org/T204240) (owner: 10Mathew.onipe) [20:45:42] (03PS13) 10Herron: smarthost: create mail smarthost profile and wmcs smarthost role [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) [20:46:25] (03CR) 10jerkins-bot: [V: 04-1] smarthost: create mail smarthost profile and wmcs smarthost role [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [20:46:46] 10Operations, 10SRE-Access-Requests: Requesting access to production servers for kharlan - https://phabricator.wikimedia.org/T207330 (10kostajh) [20:48:02] (03CR) 10Herron: [V: 032 C: 032] "V+2 manually due to bug T207285" [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [20:48:23] <_joe_> RoanKattouw: on the server themselves and on the syslog host I guess [20:48:33] <_joe_> RoanKattouw: what issue are you seeing? [20:49:38] _joe_: kostajh is trying to debug a job deduplication issue (T207329), and while his request for prod access is pending he asked me to help him get some log files... but then the files the docs point to turned out not to exist [20:49:38] T207329: Clear watchlist on enwiki only removes 50 items at a time - https://phabricator.wikimedia.org/T207329 [20:50:09] <_joe_> ahem, we're talking about jobqueue logs? [20:50:28] !log Updated Parsoid to e6b708b (T204622, T187848, T207093) [20:50:28] <_joe_> or the deduplication function of changeprop? [20:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:34] T204622: Use native Javascript (ES6) classes instead of prototype-based definition pattern in the Parsoid codebase - https://phabricator.wikimedia.org/T204622 [20:50:41] T207093: AssertionError: undefined - https://phabricator.wikimedia.org/T207093 [20:50:42] T187848: Fix token transformer return types - https://phabricator.wikimedia.org/T187848 [20:50:45] <_joe_> lemme see the task [20:51:47] <_joe_> ok, what does the docs state? [20:53:44] <_joe_> RoanKattouw: the redises used by changeprop for deduplication are rdb1001:6382 and rdb1003:6382 [20:54:17] <_joe_> their logs are on the servers under /var/log/redis/tcp_6382.log [20:55:02] <_joe_> you can also look at the nutcracker logs on the scb* servers, but I doubt it has a lot of useful information [20:55:36] <_joe_> RoanKattouw: for troubleshooting jobqueue issues, also poke Pchelolo [20:56:13] geniice is reporting getting weird errors from varnish when browsing commons: Error: 404, Requested domainname does not exist on this server at Wed, 17 Oct 2018 20:48:04 GMT [20:56:30] Hmm I can't ssh into the rdb servers it looks like [20:56:34] <_joe_> bawolff: oh, that's bad [20:56:41] <_joe_> bawolff: url please [20:56:42] kostajh: Maybe you can work with pchelolo directly? [20:57:08] _joe_: he claims https://commons.wikimedia.org/w/index.php?search=vega&title=Special%3ASearch&go=Go [20:57:12] (Petr Pchelko) [20:57:18] but if that was happening for everyone the world would have exploded [20:57:49] <_joe_> can we get some more info from the error page? [20:58:04] cp3046 [20:58:35] Not getting error page now images just not loading [20:58:39] <_joe_> bawolff: I get correctly reddirected [20:58:41] (03PS2) 10Mathew.onipe: cumin: added wdqs-autodeploy alias [puppet] - 10https://gerrit.wikimedia.org/r/467346 [21:01:08] (03PS3) 10Gehel: cumin: added wdqs-autodeploy alias [puppet] - 10https://gerrit.wikimedia.org/r/467346 (owner: 10Mathew.onipe) [21:02:12] (03CR) 10Gehel: [C: 032] cumin: added wdqs-autodeploy alias [puppet] - 10https://gerrit.wikimedia.org/r/467346 (owner: 10Mathew.onipe) [21:02:17] (03CR) 10Smalyshev: [C: 031] Wikidata entity dumps: Halve pagesPerBatch [puppet] - 10https://gerrit.wikimedia.org/r/468059 (https://phabricator.wikimedia.org/T147169) (owner: 10Hoo man) [21:04:49] RoanKattouw: sounds good [21:28:01] ACKNOWLEDGEMENT - MariaDB Slave SQL: s8 on db1124 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table wikidatawiki.pagelinks: Cant find record in pagelinks, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1087-bin.003073, end_log_pos 582738698 Banyek ack T206743 [21:32:49] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10herron) Smarthosts `mx-out01.wmflabs.org` and `mx-out02.wmflabs.org` (wmcs instances within the cloudinfra project) are now configured using `role::ma... [21:50:55] <_joe_> geniice: still having the same issues? [21:51:12] checking [21:52:26] _joe_ no error but also no images (search was for "theory") [21:52:53] <_joe_> geniice: are you on windows? [21:53:00] yes [21:53:14] <_joe_> ok, do you know how to use nslookup ? [21:53:17] now this is interesting its trying to view the images at https://upload.wikimedia.org/wikipedia/commons/c/c7/Complex_systems_thoery.jpg [21:53:32] nslookup? no [21:53:53] <_joe_> I would like to find out what your dns responds for "upload.wikimedia.org" [21:55:19] <_joe_> if you open a command prompt (I think... alt+f2 and exec cmd.exe) and type "nslookup upload.wikimedia.org" and paste me the result, that might shine some light on the issue [21:55:26] ok I've got comand pompt open [21:55:41] <_joe_> nslookup upload.wikimedia.org [21:56:50] Server: UnKnown [21:56:51] Address: fdad:67f6:7b4:0:c23e:fff:fe97:98a4 [21:56:53] *** UnKnown can't find upload.wikimeida.org: No response from serve [21:57:03] <_joe_> you have a typo :) [21:57:12] <_joe_> wikimeida [21:58:19] <_joe_> geniice: alternatively, can you see the image with another browser? [21:58:52] checking [21:59:14] <_joe_> but please also re-do the nslookup writing the correct domain [21:59:20] <_joe_> you mistyped it [22:03:42] nslookup upload.wikimedia.org [22:03:44] Server: UnKnown [22:03:45] Address: fdad:67f6:7b4:0:c23e:fff:fe97:98a4 [22:03:47] *** UnKnown can't find upload.wikimedia.org: No response from server [22:03:59] however it works fine in seamonkey [22:04:11] but I'm not loged in on seamonkey [22:04:16] <_joe_> same url that is broken in your other browser? [22:04:31] <_joe_> that should not count if we're talking about images [22:04:58] <_joe_> so it's probably a browser issue [22:05:43] <_joe_> as far as we can tell, your browser is sending some requests to the wrong server. I'm not sure about the reasons though. [22:06:02] yeah same URL [22:06:16] going to log into seamonkey to check [22:07:34] yup browser issue [22:08:26] <_joe_> geniice: ok, nothing we can do for you right now - it's not an error in the html of the page [22:09:00] Well thankyou fgfor putting up with me [22:09:29] <_joe_> but people are taking a look at the issue to see if there is something to be done [22:09:51] <_joe_> geniice: also, I'm going offline - it's past midnight here :) [22:10:13] well goodnight [22:15:24] (03PS1) 1020after4: group0 wikis to 1.32.0-wmf.26 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468176 [22:15:26] (03CR) 1020after4: [C: 032] group0 wikis to 1.32.0-wmf.26 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468176 (owner: 1020after4) [22:16:43] (03Merged) 10jenkins-bot: group0 wikis to 1.32.0-wmf.26 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468176 (owner: 1020after4) [22:17:33] (03CR) 10jenkins-bot: group0 wikis to 1.32.0-wmf.26 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468176 (owner: 1020after4) [22:18:17] !log ppchelko@deploy1001 Started deploy [restbase/deploy@88c8f26] (dev-cluster): Spread requests beetween MCS nodes for onthisday [22:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:17] geniice, did you sort your problem? [22:21:08] looks like a chache issue at my end or something odd with the hotel proxy [22:21:12] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@88c8f26] (dev-cluster): Spread requests beetween MCS nodes for onthisday (duration: 02m 54s) [22:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:42] geniice, can you 'nslookup upload.wikimedia.org 8.8.8.8' ? [22:23:04] 62.252.172.241 [22:23:17] WTF [22:24:28] geniice, is your ISP Virgin Media? [22:24:44] andrewbogott: ok, I'll do it tomorrow then, thanks [22:24:47] the hotels appears to be yes [22:26:25] 241.172.252.62.in-addr.arpa domain name pointer know-sspiprxy-vip.network.virginmedia.net. [22:26:29] Well that's interesting [22:27:47] google shows some people complaining about that proxy back in 2014 [22:28:00] yeah [22:28:46] Seems to be related to WebSafe ( https://my.virginmedia.com/customer-news/articles/online-safety.html ) [22:29:01] which is supposed to block porny things and malware [22:29:12] 10Operations, 10Traffic, 10Performance-Team (Radar): Determine cause of upload.wikimedia.org requests routed to text-lb (404 Not Found) - https://phabricator.wikimedia.org/T207340 (10Krinkle) [22:29:12] So i suppose fair enough, commons is porny :P [22:29:56] except it doesn't block it when i switched to seamonkey [22:30:41] but yeah it blocks pornhub under childsafe stuff [22:30:57] It may be an interaction between their proxy and HTTP2 coalasing [22:31:17] although I'd (naively) assume that seamonkey and firefox would use the same network engine [22:32:06] PROBLEM - Host bast4001 is DOWN: PING CRITICAL - Packet loss = 100% [22:33:04] 10Operations, 10Traffic, 10Performance-Team (Radar): Determine cause of upload.wikimedia.org requests routed to text-lb (404 Not Found) - https://phabricator.wikimedia.org/T207340 (10Bawolff) Further comments on irc seem to suggest that its related to Virgin Media's "WebSafe" proxy feature that's supposed to... [22:33:27] godog: Oops! The ORES reboot was too silent and I wandered away from the tiller. Everything looks healthy from our side, are you able to confirm that the DNS was refreshed? [22:33:46] bawolff, in any case with a network interfering with stuff like this [22:34:02] I'm inclined to say that we should serve an error telling people to complain to their ISP [22:34:11] 10Operations, 10ops-ulsfo, 10decommission: decommission lvs400[1-4].ulsfo.wmnet - https://phabricator.wikimedia.org/T178535 (10RobH) wipe complete on all 4 systems. [22:34:18] 10Operations, 10ops-ulsfo, 10decommission: decommission lvs400[1-4].ulsfo.wmnet - https://phabricator.wikimedia.org/T178535 (10RobH) 05Open>03Resolved [22:34:20] 10Operations, 10ops-ulsfo, 10Traffic: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327 (10RobH) [22:34:40] robh, did you also get bast4001? it just alerted [22:34:46] i didnt touch it =/ [22:34:57] unless by accidnet checking [22:35:02] its power cable may have been fubar [22:35:19] Krenair: yeah... that was my fault [22:35:24] its power cables werent seated well [22:35:26] Krenair: Be exciting once we get encrypted SNI (I assume this type of attack would be impossible to do with encrypted SNI, if i understand how the transparent proxy of https content would work) [22:35:26] its powering back up [22:35:29] :D [22:35:48] bawolff my bet would be that firefox was the browser I used to go through the pasword on the hotel router. seamonkey was not [22:36:14] !log bast4001 reboot is my fault, power cables were justled when i was decommssioning lvs4002 right above it in the rack [22:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:26] bawolff, yeah, except in networks where the clients are also questionable e.g. corporate networks [22:36:27] jostled even whatever [22:37:10] geniice: given what we know, I guess using DNS over TLS (with e.g. cloudflare's resolver) would be one work around [22:37:19] ok, bast4001 is back [22:37:26] RECOVERY - Host bast4001 is UP: PING OK - Packet loss = 0%, RTA = 73.93 ms [22:37:32] sorry for anyone fubared by my mistake =[ [22:37:42] on that note [22:37:47] where are we on bast4002, heh [22:38:02] bawolff eh its non critial. not doing any uploading of images until I get home [22:38:07] ohhh, i can totally decom according to filippo, awesome [22:38:30] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.32.0-wmf.26 refs T191072 [22:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:33] T191072: 1.32.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T191072 [22:38:36] * robh plugs in wipe stuff to do remotely [22:51:23] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10Andrew) shinken-01.shinken.eqiad.wmflabs might be a good test. [22:55:37] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10Krenair) Yeah. Me and I think @valhallasw get fairly frequent emails from that host ^ [22:56:56] !log Restarting ORES uwsgi service for T88997 [22:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:00] T88997: Improve graphite failover - https://phabricator.wikimedia.org/T88997 [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Evening SWAT (Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181017T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:02:04] (03PS1) 10Andrew Bogott: base: include ::base::auto_restarts on Trusty instances [puppet] - 10https://gerrit.wikimedia.org/r/468179 [23:03:03] (03CR) 10Andrew Bogott: "This is necessary to get puppet running again on some cloud instances." [puppet] - 10https://gerrit.wikimedia.org/r/468179 (owner: 10Andrew Bogott) [23:25:41] (03PS1) 10Awight: Use the newer statsd name for ORES nodes [puppet] - 10https://gerrit.wikimedia.org/r/468182 (https://phabricator.wikimedia.org/T88997) [23:36:58] (03CR) 10Krinkle: Use the newer statsd name for ORES nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/468182 (https://phabricator.wikimedia.org/T88997) (owner: 10Awight) [23:39:36] RECOVERY - Check systemd state on certcentral1001 is OK: OK - running: The system is fully operational [23:42:57] PROBLEM - Check systemd state on certcentral1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:48:41] vgutierrez, volans: what's going on there? ^ [23:49:19] why did it recover? [23:51:07] 10Operations, 10Analytics, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10bd808) Most of the cloud infrastructure hosts either are in the public vlan or are moving there as we update and replace hardware. The labsdb10...