[00:19:56] (03CR) 10Gergő Tisza: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/546221 (https://phabricator.wikimedia.org/T236455) (owner: 10Gergő Tisza) [00:20:02] (03Abandoned) 10Gergő Tisza: [WIP] Make lxc work on buster in Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/546221 (https://phabricator.wikimedia.org/T236455) (owner: 10Gergő Tisza) [03:11:53] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:15:05] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [04:40:35] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Legacy (Watching / External), and 3 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10Bawolff) >>! In T211881#5607568, @dr0ptp4kt wrote: > @Yair_rand agreed the relatively larger JS component... [06:32:40] twentyafterfour: Pls deploy the latest! [06:53:47] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I think this should be merged immediately, as we're de-facto blocking any new deployment, and the threat model here is something we accept" [deployment-charts] - 10https://gerrit.wikimedia.org/r/544158 (https://phabricator.wikimedia.org/T235821) (owner: 10Alexandros Kosiaris) [07:12:18] (03PS21) 10Mathew.onipe: query_service: rename wdqs module to query_service [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) [07:12:20] (03PS28) 10Mathew.onipe: query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) [07:12:22] (03PS26) 10Mathew.onipe: query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) [07:12:24] (03PS24) 10Mathew.onipe: query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) [07:12:26] (03PS25) 10Mathew.onipe: query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) [07:12:28] (03PS25) 10Mathew.onipe: query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) [07:21:00] 10Operations, 10ops-ulsfo, 10Traffic, 10Wikidata, and 2 others: large number of 504 errors from ulsfo - https://phabricator.wikimedia.org/T236500 (10jijiki) [07:22:41] !log mobrovac@deploy1001 Started deploy [restbase/deploy@c500d7a] (dev-cluster): Add the Parsoid proxy for JS/PHP variants, add top mediarequests end point and add mnwwiki and ge.wm.org [07:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:58] (03PS29) 10Mathew.onipe: query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) [07:23:00] (03PS27) 10Mathew.onipe: query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) [07:23:02] (03PS25) 10Mathew.onipe: query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) [07:23:04] (03PS26) 10Mathew.onipe: query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) [07:23:06] (03PS26) 10Mathew.onipe: query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) [07:25:18] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@c500d7a] (dev-cluster): Add the Parsoid proxy for JS/PHP variants, add top mediarequests end point and add mnwwiki and ge.wm.org (duration: 02m 37s) [07:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:27] (03CR) 10Mathew.onipe: query_service: prepare query_service for reusbility (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [07:27:02] !log mobrovac@deploy1001 Started deploy [restbase/deploy@c500d7a]: Add the Parsoid proxy for JS/PHP variants, add top mediarequests end point and add mnwwiki and ge.wm.org - T230791 T235744 T236389 [07:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:09] T235744: Add mnwwiki to restbase - https://phabricator.wikimedia.org/T235744 [07:27:09] T236389: Create a wiki for Wikimedia Community User Group Georgia - https://phabricator.wikimedia.org/T236389 [07:27:10] T230791: Have a Mechanism for Storing and Retrieving Parsoid HTML from JS and PHP - https://phabricator.wikimedia.org/T230791 [07:28:56] (03CR) 10Elukey: [C: 03+2] Skip installing the default archiva.xml file [debs/archiva] (debian) - 10https://gerrit.wikimedia.org/r/544156 (owner: 10Elukey) [07:37:49] !log upload archiva 2.2.4-1 to wikimedia-stretch (fix to avoid overriding archiva.xml upon install) [07:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:02] !log elukey@deploy1001 Started deploy [eventlogging/analytics@0f1ad6d]: Move codebase to Python3 - second attempt [07:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:07] !log elukey@deploy1001 Finished deploy [eventlogging/analytics@0f1ad6d]: Move codebase to Python3 - second attempt (duration: 00m 05s) [07:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:46] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@c500d7a]: Add the Parsoid proxy for JS/PHP variants, add top mediarequests end point and add mnwwiki and ge.wm.org - T230791 T235744 T236389 (duration: 13m 44s) [07:40:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:53] T235744: Add mnwwiki to restbase - https://phabricator.wikimedia.org/T235744 [07:40:54] T236389: Create a wiki for Wikimedia Community User Group Georgia - https://phabricator.wikimedia.org/T236389 [07:40:54] T230791: Have a Mechanism for Storing and Retrieving Parsoid HTML from JS and PHP - https://phabricator.wikimedia.org/T230791 [07:42:52] eventlogging seems to be running fine on py3 this time, fingers crossed [07:46:09] 10Operations, 10ops-esams, 10Traffic: Degraded RAID on cp3048 - https://phabricator.wikimedia.org/T198784 (10Volans) 05Open→03Declined Closing as the host has been decommissioned as part of T236454 [07:49:44] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/546139 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [07:50:34] 10Operations, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on analytics1049 - https://phabricator.wikimedia.org/T234785 (10elukey) Forgot to answer sorry! It is fine to leave this host as it is, we'll refresh it during the next months so no need to swap the disk! [07:56:27] (03PS1) 10Ema: vcl: block requests with Host header set to an IP [puppet] - 10https://gerrit.wikimedia.org/r/546399 (https://phabricator.wikimedia.org/T236130) [07:59:39] (03CR) 10Vgutierrez: [C: 03+1] vcl: block requests with Host header set to an IP [puppet] - 10https://gerrit.wikimedia.org/r/546399 (https://phabricator.wikimedia.org/T236130) (owner: 10Ema) [08:02:56] !log mobrovac@deploy1001 Started deploy [restbase/deploy@447981b]: Parsoid: Shim content-language and vary headers only for the PHP variant - T230791 [08:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:02] T230791: Have a Mechanism for Storing and Retrieving Parsoid HTML from JS and PHP - https://phabricator.wikimedia.org/T230791 [08:03:51] (03PS7) 10Volans: metamonitoring: add sync of Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/545574 (https://phabricator.wikimedia.org/T222074) [08:05:16] (03PS2) 10Ema: vcl: block requests with Host header set to an IP [puppet] - 10https://gerrit.wikimedia.org/r/546399 (https://phabricator.wikimedia.org/T236130) [08:06:28] (03CR) 10Volans: [C: 03+2] metamonitoring: add sync of Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/545574 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans) [08:08:38] (03CR) 10Ema: [C: 03+2] vcl: block requests with Host header set to an IP [puppet] - 10https://gerrit.wikimedia.org/r/546399 (https://phabricator.wikimedia.org/T236130) (owner: 10Ema) [08:15:15] !log swift eqiad-prod: final weight to ms-be105[1-6] - T232367 [08:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:20] T232367: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 [08:15:20] (03PS3) 10Jforrester: Switch to wmf specific run mode for $wgDisableQueryPageUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530871 (https://phabricator.wikimedia.org/T78711) (owner: 10Umherirrender) [08:15:59] (03PS1) 10Volans: metamonitoring: skip logdir for timer [puppet] - 10https://gerrit.wikimedia.org/r/546407 (https://phabricator.wikimedia.org/T222074) [08:16:07] (03CR) 10jerkins-bot: [V: 04-1] Switch to wmf specific run mode for $wgDisableQueryPageUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530871 (https://phabricator.wikimedia.org/T78711) (owner: 10Umherirrender) [08:16:38] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@447981b]: Parsoid: Shim content-language and vary headers only for the PHP variant - T230791 (duration: 13m 42s) [08:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:43] T230791: Have a Mechanism for Storing and Retrieving Parsoid HTML from JS and PHP - https://phabricator.wikimedia.org/T230791 [08:20:37] (03CR) 10Volans: [C: 03+2] "Compiler results looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/546407 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans) [08:23:08] 10Operations, 10Traffic: Elevated 502s observed in ulsfo - https://phabricator.wikimedia.org/T236130 (10ema) 05Open→03Resolved a:03ema We're now returning 403 to those requests. Availability in text@ulsfo looks much better. [08:25:18] !log installing file/libmagic security updates [08:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:14] !log manually cleanup changes reverted in https://gerrit.wikimedia.org/r/546407 on icinga[12]001 - T222074 [08:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:20] T222074: Icinga meta-monitoring: automatically sync contact list - https://phabricator.wikimedia.org/T222074 [08:37:09] (03CR) 10Ema: [C: 03+2] Varnish: don't decode/encode slashes for core REST API paths [puppet] - 10https://gerrit.wikimedia.org/r/545369 (https://phabricator.wikimedia.org/T235779) (owner: 10Ppchelko) [08:38:36] 10Operations, 10Traffic, 10CPT Initiatives (Core REST API in PHP), 10Core Platform Team Workboards (Green), 10Patch-For-Review: Implement basic routing for rest.php - https://phabricator.wikimedia.org/T235779 (10ema) >>! In T235779#5604796, @WDoranWMF wrote: > @BBlack @ema would you have anytime to revie... [08:40:56] 10Puppet, 10Cloud-VPS: geoipupdate missing on buster on Cloud VPS - https://phabricator.wikimedia.org/T236487 (10MoritzMuehlenhoff) All packages in contrib are free software, but in contrast to the main Debian repository they rely on an additional non-free component/service to be useful (either a package from... [08:41:37] 10Operations, 10DBA, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) [08:48:13] !log bump udp_localhost kafka-logging topics to 6 partitions and roll-restart logstash and rsyslog - T215904 [08:48:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:18] T215904: Better understanding of Logstash performance - https://phabricator.wikimedia.org/T215904 [08:49:18] (03PS2) 10Jbond: puppet-facts-export: Support multiple puppetdb uri's [puppet] - 10https://gerrit.wikimedia.org/r/546139 (https://phabricator.wikimedia.org/T235655) [08:52:12] (03CR) 1020after4: [C: 03+1] admins: create new deploy group for design, add 3 users [puppet] - 10https://gerrit.wikimedia.org/r/546303 (https://phabricator.wikimedia.org/T236518) (owner: 10Dzahn) [08:53:06] (03PS3) 10Giuseppe Lavagetto: conftool::scripts: add a helper script to initialize a node [puppet] - 10https://gerrit.wikimedia.org/r/545838 [08:53:08] (03PS2) 10Giuseppe Lavagetto: profile::cache::base: include initialize script [puppet] - 10https://gerrit.wikimedia.org/r/546015 [08:56:02] (03CR) 10Giuseppe Lavagetto: [C: 03+2] conftool::scripts: add a helper script to initialize a node [puppet] - 10https://gerrit.wikimedia.org/r/545838 (owner: 10Giuseppe Lavagetto) [08:57:59] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::cache::base: include initialize script [puppet] - 10https://gerrit.wikimedia.org/r/546015 (owner: 10Giuseppe Lavagetto) [09:00:07] (03CR) 10Giuseppe Lavagetto: Varnish: don't decode/encode slashes for core REST API paths (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545369 (https://phabricator.wikimedia.org/T235779) (owner: 10Ppchelko) [09:00:41] (03PS1) 10Ema: envoyproxy: set timeout to 65s [puppet] - 10https://gerrit.wikimedia.org/r/546420 (https://phabricator.wikimedia.org/T236500) [09:02:42] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: set timeout to 65s [puppet] - 10https://gerrit.wikimedia.org/r/546420 (https://phabricator.wikimedia.org/T236500) (owner: 10Ema) [09:03:14] 10Operations, 10ops-ulsfo, 10Traffic, 10Wikidata, and 3 others: large number of 504 errors from ulsfo - https://phabricator.wikimedia.org/T236500 (10ema) It is envoy here that times out after 15 seconds (CC @Joe). [09:05:30] (03CR) 10Jbond: [C: 03+2] puppet-facts-export: Support multiple puppetdb uri's [puppet] - 10https://gerrit.wikimedia.org/r/546139 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [09:08:09] (03CR) 10Giuseppe Lavagetto: "LGTM, but update the tests accordingly" [puppet] - 10https://gerrit.wikimedia.org/r/546420 (https://phabricator.wikimedia.org/T236500) (owner: 10Ema) [09:08:42] (03PS2) 10Ema: envoyproxy: set timeout to 65s [puppet] - 10https://gerrit.wikimedia.org/r/546420 (https://phabricator.wikimedia.org/T236500) [09:09:20] (03PS2) 10Jbond: check_puppetrun: alert critical after 24 hours [puppet] - 10https://gerrit.wikimedia.org/r/546195 [09:10:53] 10Operations, 10observability: Icinga last puppet run check: re-enable relaxed per-host check - https://phabricator.wikimedia.org/T236345 (10jbond) [09:10:55] 10Operations, 10Puppet, 10observability, 10Patch-For-Review: update failed puppet checkes so that they go critical 24 hours - https://phabricator.wikimedia.org/T236478 (10jbond) [09:11:02] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: set timeout to 65s [puppet] - 10https://gerrit.wikimedia.org/r/546420 (https://phabricator.wikimedia.org/T236500) (owner: 10Ema) [09:11:39] 10Operations, 10Puppet, 10observability, 10Patch-For-Review: update failed puppet checkes so that they go critical 24 hours - https://phabricator.wikimedia.org/T236478 (10jbond) @Volans merged, sorry i thought i had the action for the ticket :) [09:15:33] (03PS3) 10Ema: envoyproxy: set timeout to 65s [puppet] - 10https://gerrit.wikimedia.org/r/546420 (https://phabricator.wikimedia.org/T236500) [09:17:23] (03PS1) 10Giuseppe Lavagetto: cache: add default weights for various objects [puppet] - 10https://gerrit.wikimedia.org/r/546423 [09:18:50] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [09:18:50] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [09:20:11] * volans unsure if it could be related to my previous change, was just adding a timer, checking [09:21:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud vps: Add support for Buster to LXC module [puppet] - 10https://gerrit.wikimedia.org/r/546373 (https://phabricator.wikimedia.org/T236455) (owner: 10BryanDavis) [09:22:04] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [09:23:37] (03CR) 10Arturo Borrero Gonzalez: "Did you try PCC? I wonder if this would produce duplicate resources or anything like that." [puppet] - 10https://gerrit.wikimedia.org/r/545081 (https://phabricator.wikimedia.org/T235252) (owner: 10Alex Monk) [09:23:38] no, seems unrelated [09:23:40] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [09:24:04] godog: FYI ^^^ [09:24:11] there was a spike of errors [09:24:17] * volans brb [09:25:12] (03PS1) 10Papaul: DNS: Remove mgmt DNS fro multatuli [dns] - 10https://gerrit.wikimedia.org/r/546426 [09:26:21] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS fro multatuli [dns] - 10https://gerrit.wikimedia.org/r/546426 (owner: 10Papaul) [09:26:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] ceph: add etcd and k8s master profile for rook [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) (owner: 10Jhedden) [09:27:12] 10Operations, 10DC-Ops, 10Traffic, 10decommission, 10Patch-For-Review: decommission multatuli - https://phabricator.wikimedia.org/T236489 (10Papaul) [09:27:29] 10Operations, 10DC-Ops, 10Traffic, 10decommission, 10Patch-For-Review: decommission multatuli - https://phabricator.wikimedia.org/T236489 (10Papaul) 05Open→03Resolved complete [09:27:32] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10Papaul) [09:28:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] maintain-kubeusers: add ability to merge and update configs [puppet] - 10https://gerrit.wikimedia.org/r/545966 (https://phabricator.wikimedia.org/T236202) (owner: 10Bstorm) [09:29:47] (03CR) 10Elukey: "Looks good from my end, but adding a new network constant might not be supported/wanted by SRE. Let's ask Alex/Moritz to review/approve be" [puppet] - 10https://gerrit.wikimedia.org/r/546189 (https://phabricator.wikimedia.org/T234229) (owner: 10Ottomata) [09:30:07] (03PS1) 10Papaul: DNS: Remove mgmt DNS for bast3002 [dns] - 10https://gerrit.wikimedia.org/r/546428 [09:30:54] volans: thanks, yeah known when roll-restarting rsyslog also the central log hosts are affected [09:31:02] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for bast3002 [dns] - 10https://gerrit.wikimedia.org/r/546428 (owner: 10Papaul) [09:32:14] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission bast3002 - https://phabricator.wikimedia.org/T236329 (10Papaul) [09:32:26] 10Operations, 10ops-esams, 10Traffic: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Papaul) [09:32:28] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission bast3002 - https://phabricator.wikimedia.org/T236329 (10Papaul) 05Open→03Resolved Complete [09:32:30] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10Papaul) [09:35:04] (03PS1) 10Papaul: DNS: Remove mgmt DNS for nescio and maerlant [dns] - 10https://gerrit.wikimedia.org/r/546429 [09:35:56] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for nescio and maerlant [dns] - 10https://gerrit.wikimedia.org/r/546429 (owner: 10Papaul) [09:35:58] (03CR) 10Ema: [C: 03+2] envoyproxy: set timeout to 65s [puppet] - 10https://gerrit.wikimedia.org/r/546420 (https://phabricator.wikimedia.org/T236500) (owner: 10Ema) [09:37:02] 10Operations, 10DC-Ops, 10Traffic, 10decommission, 10Patch-For-Review: decommission nescio and maerlant - https://phabricator.wikimedia.org/T236452 (10Papaul) [09:37:04] (03CR) 10Filippo Giunchedi: [C: 03+2] aptrepo: add elastic 7 [puppet] - 10https://gerrit.wikimedia.org/r/545786 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi) [09:37:31] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10Papaul) [09:37:40] 10Operations, 10DC-Ops, 10Traffic, 10decommission, 10Patch-For-Review: decommission nescio and maerlant - https://phabricator.wikimedia.org/T236452 (10Papaul) 05Open→03Resolved complete [09:40:03] (03PS5) 10Elukey: aqs: replace logstash host/port with rsyslog localhost/port [puppet] - 10https://gerrit.wikimedia.org/r/543278 (https://phabricator.wikimedia.org/T219928) [09:40:41] (03PS1) 10Papaul: DNS: Remove mgmt DNS for bast3001 [dns] - 10https://gerrit.wikimedia.org/r/546430 [09:42:00] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for bast3001 [dns] - 10https://gerrit.wikimedia.org/r/546430 (owner: 10Papaul) [09:42:21] (03CR) 10Filippo Giunchedi: [C: 03+1] wmftest.org: add wpt-graphite [dns] - 10https://gerrit.wikimedia.org/r/545934 (https://phabricator.wikimedia.org/T231870) (owner: 10CDanis) [09:42:36] (03CR) 10Filippo Giunchedi: [C: 03+2] wmftest.org: add wpt-graphite [dns] - 10https://gerrit.wikimedia.org/r/545934 (https://phabricator.wikimedia.org/T231870) (owner: 10CDanis) [09:42:43] (03PS5) 10Filippo Giunchedi: wmftest.org: add wpt-graphite [dns] - 10https://gerrit.wikimedia.org/r/545934 (https://phabricator.wikimedia.org/T231870) (owner: 10CDanis) [09:42:55] 10Operations, 10ops-esams, 10decommission, 10Patch-For-Review: Decommission bast3001 - https://phabricator.wikimedia.org/T159480 (10Papaul) [09:43:08] 10Operations, 10ops-esams, 10Epic: Remove all decommissioned hardware - https://phabricator.wikimedia.org/T184063 (10Papaul) [09:43:11] 10Operations, 10ops-esams, 10decommission, 10Patch-For-Review: Decommission bast3001 - https://phabricator.wikimedia.org/T159480 (10Papaul) 05Open→03Resolved Complete [09:43:14] 10Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T154603 (10Papaul) [09:43:15] 10Operations, 10hardware-requests, 10Patch-For-Review: Replace bast3001 - https://phabricator.wikimedia.org/T156506 (10Papaul) [09:43:17] 10Operations, 10ops-esams, 10hardware-requests, 10Patch-For-Review: redeploy hooft as bast3002 - https://phabricator.wikimedia.org/T131560 (10Papaul) [09:44:36] 10Operations, 10Analytics, 10Analytics-Kanban, 10Wikimedia-Logstash, and 6 others: Move AQS logging to new logging pipeline - https://phabricator.wikimedia.org/T219928 (10elukey) @mobrovac I just noticed in the description the `newest version of service-runner` part. We currently run 2.6.7, is it enough? [09:45:58] 10Operations, 10ops-esams, 10Traffic: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10Papaul) @Vgutierrez @BBlack if there is nothing else to do on these servers as far as racking and setting up, can we resolve this task? [09:46:27] 10Operations, 10ops-esams, 10DNS, 10Traffic: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10Papaul) @Vgutierrez @BBlack if there is nothing else to do on these servers as far as racking and setting up, can we resolve this task? [09:52:16] (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM! see inline" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/546260 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [09:53:20] (03PS1) 10Ema: envoyproxy: set timeout for non-SNI too [puppet] - 10https://gerrit.wikimedia.org/r/546432 (https://phabricator.wikimedia.org/T236500) [09:58:05] (03PS2) 10Giuseppe Lavagetto: cache: add default weights for various objects [puppet] - 10https://gerrit.wikimedia.org/r/546423 [09:58:48] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp3030-3049 - https://phabricator.wikimedia.org/T236454 (10Papaul) [10:03:45] (03PS2) 10Ema: envoyproxy: configurable route_timeout [puppet] - 10https://gerrit.wikimedia.org/r/546432 (https://phabricator.wikimedia.org/T236500) [10:05:33] (03PS3) 10Ema: envoyproxy: configurable route_timeout [puppet] - 10https://gerrit.wikimedia.org/r/546432 (https://phabricator.wikimedia.org/T236500) [10:06:24] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1003/19091/cp1077.eqiad.wmnet/change.cp1077.eqiad.wmnet.pson does what I'd expect." [puppet] - 10https://gerrit.wikimedia.org/r/546423 (owner: 10Giuseppe Lavagetto) [10:07:12] (03CR) 10Ema: "pcc seems fine: https://puppet-compiler.wmflabs.org/compiler1003/19093/" [puppet] - 10https://gerrit.wikimedia.org/r/546432 (https://phabricator.wikimedia.org/T236500) (owner: 10Ema) [10:07:41] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: configurable route_timeout [puppet] - 10https://gerrit.wikimedia.org/r/546432 (https://phabricator.wikimedia.org/T236500) (owner: 10Ema) [10:07:48] (03PS1) 10Papaul: DNS: Remove mgmt DNS for cp3030-cp3049 and the old cp3001-cp3022 [dns] - 10https://gerrit.wikimedia.org/r/546436 [10:08:32] (03CR) 10Ema: [C: 03+1] cache: add default weights for various objects [puppet] - 10https://gerrit.wikimedia.org/r/546423 (owner: 10Giuseppe Lavagetto) [10:08:42] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for cp3030-cp3049 and the old cp3001-cp3022 [dns] - 10https://gerrit.wikimedia.org/r/546436 (owner: 10Papaul) [10:11:05] (03PS4) 10Ema: envoyproxy: configurable route_timeout [puppet] - 10https://gerrit.wikimedia.org/r/546432 (https://phabricator.wikimedia.org/T236500) [10:11:34] (03PS1) 10Volans: check_icinga: ensure at least 5 contacts are set [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/546437 (https://phabricator.wikimedia.org/T222074) [10:12:44] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp3030-3049 - https://phabricator.wikimedia.org/T236454 (10Papaul) [10:12:59] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp3030-3049 - https://phabricator.wikimedia.org/T236454 (10Papaul) 05Open→03Resolved complete [10:13:04] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10Papaul) [10:17:13] (03CR) 10Giuseppe Lavagetto: [C: 03+2] cache: add default weights for various objects [puppet] - 10https://gerrit.wikimedia.org/r/546423 (owner: 10Giuseppe Lavagetto) [10:18:28] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: delete kubeadm keyword from things not related to kubeadm [puppet] - 10https://gerrit.wikimedia.org/r/546439 (https://phabricator.wikimedia.org/T236074) [10:18:45] (03PS1) 10Muehlenhoff: Fix directive used in keyholder proxy [puppet] - 10https://gerrit.wikimedia.org/r/546440 [10:21:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: delete kubeadm keyword from things not related to kubeadm [puppet] - 10https://gerrit.wikimedia.org/r/546439 (https://phabricator.wikimedia.org/T236074) (owner: 10Arturo Borrero Gonzalez) [10:30:04] jan_drewniak: Dear deployers, time to do the Wikimedia Portals Update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191028T1030). [10:35:04] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/546432 (https://phabricator.wikimedia.org/T236500) (owner: 10Ema) [10:36:02] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546442 (https://phabricator.wikimedia.org/T128546) [10:36:19] 04Critical Alert for device cr3-esams.wikimedia.org - Juniper alarm active [10:37:07] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546442 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:37:53] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546442 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:39:01] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) We can definitely reimage if it is the best path suggested by SRE, but if possible I'd do it manually (so commenting temporarily the partman recipe) to av... [10:39:33] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:546442| Bumping portals to master (T128546)]] (duration: 00m 54s) [10:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:39] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:40:27] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:546442| Bumping portals to master (T128546)]] (duration: 00m 53s) [10:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:50] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/546440 (owner: 10Muehlenhoff) [10:40:54] (03CR) 10Ema: [C: 03+2] envoyproxy: configurable route_timeout [puppet] - 10https://gerrit.wikimedia.org/r/546432 (https://phabricator.wikimedia.org/T236500) (owner: 10Ema) [10:41:25] 10Operations, 10Analytics, 10Analytics-Kanban, 10Wikimedia-Logstash, and 6 others: Move AQS logging to new logging pipeline - https://phabricator.wikimedia.org/T219928 (10mobrovac) >>! In T219928#5610521, @elukey wrote: > @mobrovac I just noticed in the description the `newest version of service-runner` pa... [10:41:27] (03Abandoned) 10Elukey: superset: set /tmp as upload directory [puppet] - 10https://gerrit.wikimedia.org/r/479408 (owner: 10Elukey) [10:41:39] (03Abandoned) 10Elukey: hive::metastore::sql: simply mysql commands [puppet/cdh] - 10https://gerrit.wikimedia.org/r/485189 (owner: 10Elukey) [10:42:10] (03Abandoned) 10Elukey: role::mediawiki::maintenance: raise the mcrouter's conn to 5 [puppet] - 10https://gerrit.wikimedia.org/r/499714 (owner: 10Elukey) [10:46:19] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Core Platform Team Legacy (Watching / External), and 3 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10Ladsgroup) [10:51:40] 10Operations, 10ops-ulsfo, 10Traffic, 10Wikidata, and 3 others: large number of 504 errors from ulsfo - https://phabricator.wikimedia.org/T236500 (10ema) @Bugreporter timeout raised to 65 seconds, this should fix the 504 errors. [10:55:01] 10Operations, 10Puppet, 10Cloud-Services, 10Release-Engineering-Team, 10puppet-compiler: labstore1006.wikimedia.org failing to compile on compiler1003 - https://phabricator.wikimedia.org/T236672 (10jbond) [10:55:05] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM overall, see the comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545652 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: Your horoscope predicts another unfortunate European Mid-day SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191028T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:00:24] * Urbanecm claims the window [11:00:26] hi [11:00:32] i actually have a patch, just adding it [11:00:40] (03CR) 10Urbanecm: [C: 03+2] Revert "Restrict uploads on azwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546127 (https://phabricator.wikimedia.org/T236307) (owner: 10MarcoAurelio) [11:00:43] if that's okay with you [11:00:48] (03CR) 10Urbanecm: [C: 03+2] Adjust wgUploadNavigationUrl for azwiki to point to commons' UpWiz [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546129 (owner: 10MarcoAurelio) [11:00:56] MatmaRex: absolutely [11:01:32] (03Merged) 10jenkins-bot: Revert "Restrict uploads on azwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546127 (https://phabricator.wikimedia.org/T236307) (owner: 10MarcoAurelio) [11:01:40] (03Merged) 10jenkins-bot: Adjust wgUploadNavigationUrl for azwiki to point to commons' UpWiz [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546129 (owner: 10MarcoAurelio) [11:02:03] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/546444 and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/546445 . i can wait so please do yours first :) [11:02:09] !log installing OpenJDK security updates on elastic* [11:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:42] MatmaRex: +2'ed, waiting on CI and doing the config patches while CI is working [11:05:29] !log urbanecm@deploy1001 Synchronized dblists/commonsuploads.dblist: SWAT: 7e26ef4: Revert "Restrict uploads on azwiki" (T236307) (duration: 00m 53s) [11:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:34] T236307: Link "Fayl yüklə" (Upload file) in AzWiki's Tools' Bar to Upload Wizard in Commons - https://phabricator.wikimedia.org/T236307 [11:06:43] 10Operations, 10Analytics, 10Analytics-Kanban, 10Wikimedia-Logstash, and 6 others: Move AQS logging to new logging pipeline - https://phabricator.wikimedia.org/T219928 (10elukey) Fine for me, will have to coordinate with my team on upgrading service-runner first! [11:07:07] thanks mobrovac ! [11:07:33] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: ff17666: Adjust wgUploadNavigationUrl for azwiki to point to commons UpWiz (T236307) (duration: 00m 53s) [11:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:59] yw elukey :) [11:08:34] (03PS4) 10Effie Mouzeli: hhvm: remove hhvm leftovers from apache configs [puppet] - 10https://gerrit.wikimedia.org/r/545652 (https://phabricator.wikimedia.org/T229792) [11:09:04] (03CR) 10Effie Mouzeli: hhvm: remove hhvm leftovers from apache configs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545652 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [11:09:18] (03PS2) 10Urbanecm: Add Translate channel for the Translate extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544114 (https://phabricator.wikimedia.org/T221119) (owner: 10Abijeet Patro) [11:09:44] (03CR) 10Urbanecm: [C: 03+2] Add Translate channel for the Translate extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544114 (https://phabricator.wikimedia.org/T221119) (owner: 10Abijeet Patro) [11:10:35] (03Merged) 10jenkins-bot: Add Translate channel for the Translate extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544114 (https://phabricator.wikimedia.org/T221119) (owner: 10Abijeet Patro) [11:10:40] (03CR) 10jerkins-bot: [V: 04-1] hhvm: remove hhvm leftovers from apache configs [puppet] - 10https://gerrit.wikimedia.org/r/545652 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [11:12:36] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: dd2f06c: Add Translate channel for the Translate extension (T221119) (duration: 00m 53s) [11:12:39] (03PS1) 10Muehlenhoff: Add Cumin alias for druid canary [puppet] - 10https://gerrit.wikimedia.org/r/546447 [11:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:42] T221119: "This namespace is reserved for content page translations" when trying to translate a recently created translation unit - https://phabricator.wikimedia.org/T221119 [11:16:20] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Remove whitespace (also why CI is failing)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/545652 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [11:16:30] I know :p [11:17:09] MatmaRex: your patch is ready at mwdebug1001 [11:17:14] (both patches, actually) [11:17:50] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::httpd: set a SERVERGROUP env variable [puppet] - 10https://gerrit.wikimedia.org/r/546448 (https://phabricator.wikimedia.org/T235899) [11:18:00] Urbanecm: thanks, looking [11:18:52] (03PS5) 10Effie Mouzeli: hhvm: remove hhvm leftovers from apache configs [puppet] - 10https://gerrit.wikimedia.org/r/545652 (https://phabricator.wikimedia.org/T229792) [11:18:57] (03PS1) 10Ema: cache: reimage cp5007 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/546449 (https://phabricator.wikimedia.org/T227432) [11:18:59] (03PS1) 10Ema: cache_text eqsin: read ats-be etcd keys [puppet] - 10https://gerrit.wikimedia.org/r/546450 (https://phabricator.wikimedia.org/T227432) [11:19:54] MatmaRex: ERROR reported in logstash [11:19:55] (03CR) 10Elukey: [C: 03+1] Add Cumin alias for druid canary [puppet] - 10https://gerrit.wikimedia.org/r/546447 (owner: 10Muehlenhoff) [11:19:56] [XbbOvApAIHsAAFBrsjwAAAAD] /w/api.php?action=visualeditor&format=json&paction=parse&page=Draft%3AAsdfsdfasdfasdffdaaafdfdfdfdaa&badetag=&uselang=en&editintro=Template%3AAfC%20draft%20editintro&preload=Template%3AAfc%20preload%2Fdraft&formatversion=2 ErrorException from line 625 of /srv/mediawiki/php-1.35.0-wmf.3/extensions/VisualEditor/includes/ApiVisualEditor.php: PHP Notice: Undefined index: etag [11:19:58] (03CR) 10jerkins-bot: [V: 04-1] cache_text eqsin: read ats-be etcd keys [puppet] - 10https://gerrit.wikimedia.org/r/546450 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [11:20:17] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin alias for druid canary [puppet] - 10https://gerrit.wikimedia.org/r/546447 (owner: 10Muehlenhoff) [11:20:54] hmm [11:21:31] (03PS2) 10Ema: cache_text eqsin: read ats-be etcd keys [puppet] - 10https://gerrit.wikimedia.org/r/546450 (https://phabricator.wikimedia.org/T227432) [11:23:35] (that is actually a request i just made myself) [11:23:53] MatmaRex: hmm doesn't tell much :). I think we should revert the bad patch and resolve outside of SWAT. [11:23:58] Do you know which patch is the offending one? [11:24:10] If not, I can revert one-by-one, or simply revert both [11:24:41] yes, it's caused by https://gerrit.wikimedia.org/r/c/546445/ [11:24:44] let's do that [11:24:56] 10Operations, 10Patch-For-Review: reinstall RT server with private IP and Buster - https://phabricator.wikimedia.org/T180641 (10faidon) RT seems to be currently pointed at moscovium, and is currently broken: the frontpage doesn't load properly (mixed content messages) and the login doesn't work. If this can't... [11:25:19] okay, reverting 546445 [11:25:32] Urbanecm: hm, is that even a new issue for sure? [11:26:02] not sure, I'm just watching the mwdebug dashboard [11:26:10] i guess it is [11:26:15] i see no older results for "PHP Notice: Undefined index: etag" [11:26:36] not sure how that happens, so let's revert [11:26:55] doing [11:28:11] MatmaRex: could you test 546444 (the unreverted patch) separately at mwdebug1001, so we can be sure it works w/o errors? [11:28:18] (03PS1) 10Jbond: dumps::web::fetches::job: update require to use directory [puppet] - 10https://gerrit.wikimedia.org/r/546452 (https://phabricator.wikimedia.org/T236672) [11:28:50] Urbanecm: yes, seems good [11:30:19] MatmaRex: I see https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-mediawiki-2019.10.28/mediawiki?id=AW4SHtYTghP2xm4vJQ3M&_g=h@44136fa [11:30:29] not sure if that's new or not [11:30:51] Urbanecm: not new [11:31:12] ok [11:31:29] (03PS2) 10Jbond: dumps::web::fetches::job: update require to use directory [puppet] - 10https://gerrit.wikimedia.org/r/546452 (https://phabricator.wikimedia.org/T236672) [11:32:39] MatmaRex: syncing [11:32:58] jouncebot next [11:32:58] In 5 hour(s) and 27 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191028T1700) [11:33:28] !log Disable puppet on mw* for 545652 - T229792 [11:33:30] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.3/extensions/VisualEditor/includes/ApiVisualEditorEdit.php: SWAT: 8caf681: Dont log missing ETags when creating a new page, thats normal (T233320) (duration: 00m 54s) [11:33:30] (03PS3) 10Jbond: dumps::web::fetches::job: update require to use directory [puppet] - 10https://gerrit.wikimedia.org/r/546452 (https://phabricator.wikimedia.org/T236672) [11:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:34] T229792: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 [11:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:38] T233320: VisualEditor <-> RESTBase communication and ETags - https://phabricator.wikimedia.org/T233320 [11:34:03] MatmaRex: done [11:34:25] Urbanecm: thanks [11:34:34] yw [11:34:36] !log EU SWAT done [11:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:39] (03PS7) 10Giuseppe Lavagetto: scaffold: Add option for TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/543854 (https://phabricator.wikimedia.org/T236008) [11:38:04] (03PS8) 10Giuseppe Lavagetto: scaffold: Add option for TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/543854 (https://phabricator.wikimedia.org/T236008) [11:38:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scaffold: Add option for TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/543854 (https://phabricator.wikimedia.org/T236008) (owner: 10Giuseppe Lavagetto) [11:38:58] (03Merged) 10jenkins-bot: scaffold: Add option for TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/543854 (https://phabricator.wikimedia.org/T236008) (owner: 10Giuseppe Lavagetto) [11:41:24] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10JAllemandou) From looking at the dashboards, it looks like the entire set of values we wasnt to collect is what is currently display... [11:42:15] (03CR) 10Giuseppe Lavagetto: scaffold: only expose one port as a service by default (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/544629 (owner: 10Giuseppe Lavagetto) [11:46:34] (03PS6) 10Giuseppe Lavagetto: scaffold: only expose one port as a service by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/544629 [11:50:45] (03PS6) 10Effie Mouzeli: hhvm: remove hhvm leftovers from apache configs [puppet] - 10https://gerrit.wikimedia.org/r/545652 (https://phabricator.wikimedia.org/T229792) [11:51:14] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/545652 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [11:53:10] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" (033 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/546146 (https://phabricator.wikimedia.org/T236468) (owner: 10Jbond) [11:55:38] RECOVERY - snapshot of s2 in codfw on db1115 is OK: snapshot for s2 at codfw taken less than 4 days ago and larger than 90 GB: Last one 2019-10-28 02:56:57 from db2098.codfw.wmnet:3312 (780 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [11:56:33] !log twentyafterfour@deploy1001 Started deploy [phabricator/deployment@e4e2b22]: testing deployment of phabricator to phab1001 [11:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:37] !log twentyafterfour@deploy1001 Finished deploy [phabricator/deployment@e4e2b22]: testing deployment of phabricator to phab1001 (duration: 00m 05s) [11:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:13] (03CR) 10Effie Mouzeli: [C: 03+2] hhvm: remove hhvm leftovers from apache configs [puppet] - 10https://gerrit.wikimedia.org/r/545652 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [12:06:23] (03PS1) 10Jcrespo: Add percona support, and standarize xtrabackup reference [software] - 10https://gerrit.wikimedia.org/r/546455 [12:10:48] PROBLEM - PHP7 jobrunner on mw2150 is CRITICAL: connect to address 10.192.32.38 and port 9005: Connection refused https://wikitech.wikimedia.org/wiki/Jobrunner [12:10:59] ^ that is me [12:11:56] PROBLEM - Nginx local proxy to jobrunner on mw2150 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.151 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [12:12:10] PROBLEM - PHP7 rendering on mw2150 is CRITICAL: connect to address 10.192.32.38 and port 9005: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:12:16] PROBLEM - Nginx local proxy to videoscaler on mw2150 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.151 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [12:13:32] 10Operations, 10Traffic, 10CPT Initiatives (Core REST API in PHP), 10Core Platform Team Workboards (Green), 10Patch-For-Review: Implement basic routing for rest.php - https://phabricator.wikimedia.org/T235779 (10WDoranWMF) Awesome, thanks @ema ! [12:18:09] (03Abandoned) 10Mathew.onipe: tlsproxy: add prometheus option [puppet] - 10https://gerrit.wikimedia.org/r/491972 (https://phabricator.wikimedia.org/T216681) (owner: 10Mathew.onipe) [12:18:23] (03Abandoned) 10Mathew.onipe: refactor lua support option [puppet] - 10https://gerrit.wikimedia.org/r/493154 (https://phabricator.wikimedia.org/T216681) (owner: 10Mathew.onipe) [12:18:35] (03Abandoned) 10Mathew.onipe: nginx: add lua support regardless of version [puppet/nginx] - 10https://gerrit.wikimedia.org/r/492711 (https://phabricator.wikimedia.org/T216681) (owner: 10Mathew.onipe) [12:19:34] (03PS2) 10Jcrespo: Add percona support, and standarize xtrabackup reference [software] - 10https://gerrit.wikimedia.org/r/546455 [12:19:58] (03PS2) 10Jcrespo: Update stretch mariadb wmf package to 10.1.41 [software] - 10https://gerrit.wikimedia.org/r/535810 [12:20:00] (03PS3) 10Jcrespo: Add percona support, and standarize xtrabackup reference [software] - 10https://gerrit.wikimedia.org/r/546455 [12:22:48] !log depool mw2150 [12:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:28] !log depool cp5007 and reimage as text_ats T227432 [12:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:34] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [12:30:47] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: use docker image from internal registry [puppet] - 10https://gerrit.wikimedia.org/r/546459 (https://phabricator.wikimedia.org/T236249) [12:33:41] (03CR) 10Ema: [C: 03+2] cache: reimage cp5007 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/546449 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [12:35:14] (03PS1) 10Effie Mouzeli: jobrunner: rename hhvm_jobrunner_port to jobrunner_port [puppet] - 10https://gerrit.wikimedia.org/r/546461 (https://phabricator.wikimedia.org/T229792) [12:35:30] (03PS2) 10Jforrester: MCR: Set testwiki to use the new MCR-only schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543007 (https://phabricator.wikimedia.org/T198558) (owner: 10Daniel Kinzler) [12:35:42] (03PS3) 10Jforrester: MCR: Set testwiki to use the new MCR-only schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543007 (https://phabricator.wikimedia.org/T198558) (owner: 10Daniel Kinzler) [12:37:17] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp5007.eqsin.wmnet'] ` The log can be found in `/var/log/wm... [12:44:09] (03CR) 10Effie Mouzeli: "LGTM https://puppet-compiler.wmflabs.org/compiler1001/19094/mw1337.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/546461 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [12:45:58] (03CR) 10Filippo Giunchedi: "See inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546219 (https://phabricator.wikimedia.org/T234283) (owner: 10Effie Mouzeli) [12:48:08] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/546461 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [12:48:47] (03CR) 10Effie Mouzeli: [C: 03+2] jobrunner: rename hhvm_jobrunner_port to jobrunner_port [puppet] - 10https://gerrit.wikimedia.org/r/546461 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [12:48:50] (03CR) 10CDanis: [C: 03+1] check_icinga: ensure at least 5 contacts are set [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/546437 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans) [12:53:13] RECOVERY - Nginx local proxy to jobrunner on mw2150 is OK: HTTP OK: HTTP/1.1 200 OK - 338 bytes in 0.152 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [12:55:07] RECOVERY - Nginx local proxy to videoscaler on mw2150 is OK: HTTP OK: HTTP/1.1 200 OK - 339 bytes in 0.156 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [12:57:45] (03PS1) 10Jbond: wmflib::secret: add a new secret function which supports binary files [puppet] - 10https://gerrit.wikimedia.org/r/546464 (https://phabricator.wikimedia.org/T236481) [12:57:47] (03PS1) 10Jbond: apereo_cas: migrate keystor to wmflib::secret [puppet] - 10https://gerrit.wikimedia.org/r/546465 (https://phabricator.wikimedia.org/T236481) [12:58:57] RECOVERY - PHP7 jobrunner on mw2150 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [12:59:04] (03CR) 10jerkins-bot: [V: 04-1] wmflib::secret: add a new secret function which supports binary files [puppet] - 10https://gerrit.wikimedia.org/r/546464 (https://phabricator.wikimedia.org/T236481) (owner: 10Jbond) [13:00:51] RECOVERY - PHP7 rendering on mw2150 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:01:45] !log stop db1114 for testing [13:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:11] (03PS2) 10Jbond: puppet_compiler: Add puppet version to the PCC report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/546146 (https://phabricator.wikimedia.org/T236468) [13:05:22] (03CR) 10Jbond: "thanks, all corrected" (033 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/546146 (https://phabricator.wikimedia.org/T236468) (owner: 10Jbond) [13:05:47] !log enable puppet on mw2* servers, depool and repool to reload apache - T229792 [13:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:53] T229792: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 [13:07:15] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [13:07:16] !log ema@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [13:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:20] (03PS2) 10Jbond: wmflib::secret: add a new secret function which supports binary files [puppet] - 10https://gerrit.wikimedia.org/r/546464 (https://phabricator.wikimedia.org/T236481) [13:11:24] 10Operations, 10SRE-tools: sre.hosts.downtime fails with "No hosts provided" - https://phabricator.wikimedia.org/T236684 (10ema) [13:11:47] 10Operations, 10SRE-tools: sre.hosts.downtime fails with "No hosts provided" - https://phabricator.wikimedia.org/T236684 (10ema) p:05Triage→03Normal [13:13:06] !log enable puppet on mw[1261-1265].eqiad.wmnet (mw canaries), depool and repool to reload apache - T229792 [13:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:11] T229792: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 [13:16:35] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/546146 (https://phabricator.wikimedia.org/T236468) (owner: 10Jbond) [13:19:48] (03PS5) 10Jcrespo: bacula: Add verbose & single job modes for backup freshness check [puppet] - 10https://gerrit.wikimedia.org/r/546217 (https://phabricator.wikimedia.org/T234900) [13:19:50] (03PS1) 10Jcrespo: WIP:A working percona configuration for mysql [puppet] - 10https://gerrit.wikimedia.org/r/546471 [13:20:18] (03CR) 10Jcrespo: "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/546471 (owner: 10Jcrespo) [13:20:57] PROBLEM - ats-tls HTTPS en.wikipedia.org ECDSA on cp5007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [13:20:57] PROBLEM - Ensure traffic_server is running for instance tls on cp5007 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.107: Connection reset by peer https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:20:57] PROBLEM - Confd template for /etc/varnish/directors.frontend.vcl on cp5007 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.107: Connection reset by peer https://wikitech.wikimedia.org/wiki/Confd [13:22:27] ^^ expected, it's being reimaged [13:22:33] PROBLEM - Host cp5007 is DOWN: PING CRITICAL - Packet loss = 100% [13:23:08] !log enable puppet on mw1*, depool and repool to reload apache - T229792 [13:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:13] T229792: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 [13:23:29] vgutierrez: I'm wondering why we're getting the alerts though, wmf-reimage should downtime the host but didn't ? [13:23:54] godog: I'm assuming that's related to https://phabricator.wikimedia.org/T236684 [13:24:19] RECOVERY - Host cp5007 is UP: PING OK - Packet loss = 0%, RTA = 235.25 ms [13:24:59] (03PS2) 10Filippo Giunchedi: logstash: remove deprecated elasticsearch options [puppet] - 10https://gerrit.wikimedia.org/r/545236 (https://phabricator.wikimedia.org/T235891) [13:25:24] vgutierrez: lol, yeah very likely [13:26:31] 10Operations, 10ops-esams: bast3004 can't reach mgmt networks - https://phabricator.wikimedia.org/T236686 (10ema) [13:26:37] 10Operations, 10ops-esams: bast3004 can't reach mgmt networks - https://phabricator.wikimedia.org/T236686 (10ema) p:05Triage→03Normal [13:26:55] PROBLEM - ats-tls HTTPS en.wikipedia.org RSA on cp5007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [13:26:55] PROBLEM - check_trafficserver_log_fifo_notpurge_backend on cp5007 is CRITICAL: connect to address 10.132.0.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:27:09] PROBLEM - ats-tls HTTPS en.wikipedia.org ECDSA on cp5007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [13:27:15] PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp5007 is CRITICAL: connect to address 10.132.0.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:27:15] PROBLEM - DPKG on cp5007 is CRITICAL: connect to address 10.132.0.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:27:15] PROBLEM - check_trafficserver_backend_config_status on cp5007 is CRITICAL: connect to address 10.132.0.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:27:27] PROBLEM - Ensure trafficserver_exporter is running for instance backend on cp5007 is CRITICAL: connect to address 10.132.0.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:27:27] PROBLEM - Varnish HTCP daemon on cp5007 is CRITICAL: connect to address 10.132.0.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [13:27:29] PROBLEM - Confd vcl based reload on cp5007 is CRITICAL: connect to address 10.132.0.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [13:27:34] /o\ [13:27:45] PROBLEM - Confd template for /etc/varnish/directors.frontend.vcl on cp5007 is CRITICAL: connect to address 10.132.0.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Confd [13:27:45] PROBLEM - Ensure traffic_server is running for instance tls on cp5007 is CRITICAL: connect to address 10.132.0.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:28:02] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: remove deprecated elasticsearch options [puppet] - 10https://gerrit.wikimedia.org/r/545236 (https://phabricator.wikimedia.org/T235891) (owner: 10Filippo Giunchedi) [13:28:38] sorry for the spam, cp5007 is me [13:28:54] ema: I've added a 60m downtime for the host [13:29:03] well.. for all the services on the host actually [13:29:07] thanks [13:30:33] (03CR) 10Jhedden: [C: 03+1] toolforge: k8s: use docker image from internal registry [puppet] - 10https://gerrit.wikimedia.org/r/546459 (https://phabricator.wikimedia.org/T236249) (owner: 10Arturo Borrero Gonzalez) [13:30:59] !log roll restart logstash in codfw/eqiad to apply new config [13:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:07] RECOVERY - check_trafficserver_backend_config_status on cp5007 is OK: OK: configuration is current https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:35:07] RECOVERY - DPKG on cp5007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:35:07] RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp5007 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:35:23] RECOVERY - Ensure trafficserver_exporter is running for instance backend on cp5007 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:35:23] RECOVERY - Varnish HTCP daemon on cp5007 is OK: PROCS OK: 1 process with UID = 115 (vhtcpd), args vhtcpd https://wikitech.wikimedia.org/wiki/Varnish [13:35:25] RECOVERY - Confd vcl based reload on cp5007 is OK: reload-vcl has not been executed yet. https://wikitech.wikimedia.org/wiki/Varnish [13:35:36] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp5007.eqsin.wmnet'] ` and were **ALL** successful. [13:35:43] RECOVERY - Confd template for /etc/varnish/directors.frontend.vcl on cp5007 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [13:35:43] RECOVERY - Ensure traffic_server is running for instance tls on cp5007 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:36:11] RECOVERY - check_trafficserver_log_fifo_notpurge_backend on cp5007 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /var/log/trafficserver/notpurge.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:45:40] (03CR) 10Ema: [C: 03+2] cache_text eqsin: read ats-be etcd keys [puppet] - 10https://gerrit.wikimedia.org/r/546450 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [13:47:10] (03PS1) 10Papaul: DNS: Remove mgmt DNS for lvs300[1-4] [dns] - 10https://gerrit.wikimedia.org/r/546479 [13:47:39] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001:9501 job=burrow partition=3 site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging- [13:47:39] ll&var-consumer_group=All [13:48:01] 10Operations, 10ops-eqiad, 10DC-Ops: b7-eqiad pdu refresh (Tuesday 11/5 @11am UTC) - https://phabricator.wikimedia.org/T227542 (10herron) [13:48:41] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [13:48:43] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for lvs300[1-4] [dns] - 10https://gerrit.wikimedia.org/r/546479 (owner: 10Papaul) [13:49:56] 10Operations, 10DC-Ops, 10Traffic, 10decommission, 10Patch-For-Review: decommission lvs300[1234] - https://phabricator.wikimedia.org/T236451 (10Papaul) [13:50:23] 10Operations, 10DC-Ops, 10Traffic, 10decommission, 10Patch-For-Review: decommission lvs300[1234] - https://phabricator.wikimedia.org/T236451 (10Papaul) 05Open→03Resolved complete [13:50:25] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10Papaul) [13:50:33] !log pool cp5007 with ATS backend T227432 [13:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:39] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [13:53:03] 10Operations, 10Discovery-Search, 10vm-requests: setup/install airflow1001.eqiad.wmnet on ganeti - https://phabricator.wikimedia.org/T236181 (10elukey) ` elukey@ganeti1003:~$ sudo gnt-group list Group Nodes Instances AllocPolicy NDParams row_A 4 42 preferred ovs=False, ssh_port=22, ovs_link=, sp... [13:53:45] (03PS1) 10Papaul: DNS: Remove mgmt DNS for eeden [dns] - 10https://gerrit.wikimedia.org/r/546488 [13:55:18] (03PS1) 10Ema: conftool::scripts: ensure initialize is quiet [puppet] - 10https://gerrit.wikimedia.org/r/546497 [13:55:21] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for eeden [dns] - 10https://gerrit.wikimedia.org/r/546488 (owner: 10Papaul) [13:57:43] (03CR) 10Herron: [C: 03+1] mailman: add alias and redirect for multimedia-team [puppet] - 10https://gerrit.wikimedia.org/r/545122 (https://phabricator.wikimedia.org/T235550) (owner: 10CRusnov) [13:59:06] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission eeden - https://phabricator.wikimedia.org/T235770 (10Papaul) [14:01:16] (03CR) 10Giuseppe Lavagetto: [C: 03+1] conftool::scripts: ensure initialize is quiet [puppet] - 10https://gerrit.wikimedia.org/r/546497 (owner: 10Ema) [14:01:19] RECOVERY - mediawiki-installation DSH group on mw1317 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [14:02:32] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2053.codfw.wmnet - https://phabricator.wikimedia.org/T231407 (10Papaul) ` papaul@asw-d-codfw# show | compare [edit interfaces interface-range vlan-private1-d-codfw] - member ge-6/0/1; [edit interfaces interface-range disabled] mem... [14:03:10] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2053.codfw.wmnet - https://phabricator.wikimedia.org/T231407 (10Papaul) [14:03:59] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Overall LGTM as an initial version." (034 comments) [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689 (owner: 10RLazarus) [14:04:27] (03PS1) 10Elukey: Add AAAA/A/PTR records for airflow1001 [dns] - 10https://gerrit.wikimedia.org/r/546498 (https://phabricator.wikimedia.org/T236181) [14:05:16] (03CR) 10Ottomata: [C: 03+1] "Ah cool, I was about to submit the same patch!" [puppet] - 10https://gerrit.wikimedia.org/r/546452 (https://phabricator.wikimedia.org/T236672) (owner: 10Jbond) [14:06:59] (03CR) 10Jbond: [C: 03+2] dumps::web::fetches::job: update require to use directory [puppet] - 10https://gerrit.wikimedia.org/r/546452 (https://phabricator.wikimedia.org/T236672) (owner: 10Jbond) [14:08:02] (03CR) 10Herron: [C: 03+1] "+1 for trying this, looks to be a quick revert if it does not work as expected" [puppet] - 10https://gerrit.wikimedia.org/r/543252 (https://phabricator.wikimedia.org/T235458) (owner: 10Brian Wolff) [14:08:38] (03CR) 10Elukey: "Since naming is very important, I'll wait to merge :)" [dns] - 10https://gerrit.wikimedia.org/r/546498 (https://phabricator.wikimedia.org/T236181) (owner: 10Elukey) [14:10:02] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission dbstore2001.codfw.wmnet and dbstore2002.codfw.wmnet - https://phabricator.wikimedia.org/T220002 (10Papaul) ` papaul@asw-a-codfw# show | compare [edit interfaces interface-range disabled] member ge-5/0/16 { ... } + member ge-6/0/16;... [14:10:34] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission dbstore2001.codfw.wmnet and dbstore2002.codfw.wmnet - https://phabricator.wikimedia.org/T220002 (10Papaul) [14:12:37] 10Operations, 10Puppet, 10Cloud-Services, 10Release-Engineering-Team, and 2 others: labstore1006.wikimedia.org failing to compile on compiler1003 - https://phabricator.wikimedia.org/T236672 (10jbond) 05Open→03Resolved Think this is resolved now, please reopen if there are still issues [14:12:41] 10Operations, 10Puppet, 10Release-Engineering-Team, 10puppet-compiler, 10Patch-For-Review: add compiler1003 to jenkins - https://phabricator.wikimedia.org/T236468 (10jbond) [14:12:56] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission eeden - https://phabricator.wikimedia.org/T235770 (10Papaul) [14:13:38] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission eeden - https://phabricator.wikimedia.org/T235770 (10Papaul) 05Open→03Resolved complete [14:15:29] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10Ottomata) Agree with Luca! As long as Kafka's data is maintained, re-imaging should be the same as a downtime for Kafka. [14:16:05] ema, vgutierrez: you could have pinged me, having a look at the downtime issue now [14:16:29] (03CR) 10Jbond: [C: 03+2] puppet_compiler: Add puppet version to the PCC report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/546146 (https://phabricator.wikimedia.org/T236468) (owner: 10Jbond) [14:16:29] volans: thanks! [14:16:34] volans: <3 [14:20:13] (03CR) 10Ema: [C: 03+2] conftool::scripts: ensure initialize is quiet [puppet] - 10https://gerrit.wikimedia.org/r/546497 (owner: 10Ema) [14:20:18] (03PS1) 10Papaul: DNS: Remove mgmt DNS for db2066 [dns] - 10https://gerrit.wikimedia.org/r/546503 [14:20:20] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for db2066 [dns] - 10https://gerrit.wikimedia.org/r/546503 (owner: 10Papaul) [14:20:32] (03CR) 10Ottomata: "Hm! Do we want to name it something indicative of the fact that this is for search? Maybe not, maybe we can just keep it in their house " [dns] - 10https://gerrit.wikimedia.org/r/546498 (https://phabricator.wikimedia.org/T236181) (owner: 10Elukey) [14:20:37] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2066.codfw.wmnet - https://phabricator.wikimedia.org/T230885 (10Papaul) [14:20:46] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Papaul) [14:20:51] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2066.codfw.wmnet - https://phabricator.wikimedia.org/T230885 (10Papaul) 05Open→03Resolved Complete [14:22:14] 10Operations, 10ops-esams, 10decommission: Decommission cp300[3456] - https://phabricator.wikimedia.org/T167376 (10Papaul) [14:23:06] (03CR) 10Elukey: "I really like an-airflow1001, +1" [dns] - 10https://gerrit.wikimedia.org/r/546498 (https://phabricator.wikimedia.org/T236181) (owner: 10Elukey) [14:23:51] 10Operations, 10ops-esams, 10decommission: Decommission cp300[3456] - https://phabricator.wikimedia.org/T167376 (10Papaul) 05Open→03Resolved Removed mgmt DNS on https://gerrit.wikimedia.org/r/#/c/operations/dns/+/546436/ [14:23:53] 10Operations, 10ops-esams, 10Epic: Remove all decommissioned hardware - https://phabricator.wikimedia.org/T184063 (10Papaul) [14:24:22] ema, vgutierrez :found the issue, patch coming in few minutes, sorry about that :( [14:24:29] thanks :D [14:24:56] (03CR) 10Ottomata: "Let's go with an-airflow then; there might be more of them in a cluster of workers one day." [dns] - 10https://gerrit.wikimedia.org/r/546498 (https://phabricator.wikimedia.org/T236181) (owner: 10Elukey) [14:28:47] (03PS1) 10Jbond: bump version 0.5.1 [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/546504 [14:29:51] actually, I'm not sure anymore [14:30:13] (03CR) 10Jbond: [C: 03+2] bump version 0.5.1 [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/546504 (owner: 10Jbond) [14:30:31] (03CR) 10Herron: [C: 03+1] puppet: manage localcacert in puppet [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277) (owner: 10Jbond) [14:32:30] (03PS2) 10Elukey: Add AAAA/A/PTR records for airflow1001 [dns] - 10https://gerrit.wikimedia.org/r/546498 (https://phabricator.wikimedia.org/T236181) [14:40:38] (03CR) 10Volans: [C: 03+2] check_icinga: ensure at least 5 contacts are set [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/546437 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans) [14:41:09] (03Merged) 10jenkins-bot: check_icinga: ensure at least 5 contacts are set [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/546437 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans) [14:46:09] PROBLEM - Check the Netbox report management for fail status. on netbox1001 is CRITICAL: management.ManagementConsole CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [14:50:28] ACKNOWLEDGEMENT - Check the Netbox report management for fail status. on netbox1001 is CRITICAL: management.ManagementConsole CRITICAL Cas Rusnov esams work https://wikitech.wikimedia.org/wiki/Netbox%23Reports [14:54:38] (03PS1) 10Jbond: puppetdb: only filter job_id on old puppet masters [puppet] - 10https://gerrit.wikimedia.org/r/546508 [14:57:00] (03CR) 10Muehlenhoff: puppetdb: only filter job_id on old puppet masters (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/546508 (owner: 10Jbond) [14:59:07] (03PS5) 10CRusnov: Add script to generate DNS records from Netbox [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/539013 (https://phabricator.wikimedia.org/T233183) [14:59:20] (03CR) 10CRusnov: Add script to generate DNS records from Netbox (034 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/539013 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [15:01:51] (03PS2) 10Jbond: puppetdb: only filter job_id on old puppet masters [puppet] - 10https://gerrit.wikimedia.org/r/546508 [15:02:46] (03PS3) 10Jbond: puppetdb: only filter job_id on old puppetdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/546508 [15:03:00] (03CR) 10Jbond: "thanks for the quick review, updated" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546508 (owner: 10Jbond) [15:07:02] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546508 (owner: 10Jbond) [15:08:35] 10Operations, 10Traffic, 10observability: global HTTP (un)availability number, as reported in Frontend Traffic dashboard, is bogus - https://phabricator.wikimedia.org/T234567 (10CDanis) 05Open→03Resolved a:03CDanis [15:10:19] (03CR) 10Jbond: [C: 03+2] puppetdb: only filter job_id on old puppetdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/546508 (owner: 10Jbond) [15:10:48] (03PS2) 10Elukey: Add deployment-memc08 to the mcrouter config [puppet] - 10https://gerrit.wikimedia.org/r/543456 (https://phabricator.wikimedia.org/T213089) [15:11:11] (03CR) 10EBernhardson: [C: 03+1] "Makes sense to me. And indeed one day it may require a cluster of workers, although today it's going to run everything on the single machi" [dns] - 10https://gerrit.wikimedia.org/r/546498 (https://phabricator.wikimedia.org/T236181) (owner: 10Elukey) [15:13:03] (03CR) 10Elukey: [C: 03+2] Add deployment-memc08 to the mcrouter config [puppet] - 10https://gerrit.wikimedia.org/r/543456 (https://phabricator.wikimedia.org/T213089) (owner: 10Elukey) [15:13:44] 10Operations, 10Patch-For-Review: reinstall RT server with private IP and Buster - https://phabricator.wikimedia.org/T180641 (10Dzahn) @faidon I am aware. I'll get it fixed today or revert to the previous one, ack. [15:13:57] PROBLEM - Long running screen/tmux on snapshot1008 is CRITICAL: CRIT: Long running SCREEN process. (user: ariel PID: 11681, 1734750s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [15:21:36] 10Operations, 10observability: Icinga meta-monitoring: automatically sync contact list - https://phabricator.wikimedia.org/T222074 (10Volans) 05Open→03Resolved This is all done, resolving. Feel free to re-open if any issue is found. [15:26:11] 10Operations, 10observability, 10Performance-Team (Radar): Upgrade grafana to 6.x - https://phabricator.wikimedia.org/T220838 (10CDanis) [15:27:01] 10Operations, 10Wikimedia-Apache-configuration, 10serviceops: Build a black-box httpd testing framework - https://phabricator.wikimedia.org/T236699 (10RLazarus) [15:36:41] 10Operations, 10Traffic, 10observability: 'LVS connections' graph on Load Balancers dashboard takes a rate of a gauge - https://phabricator.wikimedia.org/T236700 (10CDanis) [15:40:02] 10Operations, 10Traffic, 10observability: 'LVS connections' graph on Load Balancers dashboard takes a rate of a gauge - https://phabricator.wikimedia.org/T236700 (10CDanis) p:05Triage→03Normal [15:47:37] PROBLEM - Host tools.wmflabs.org is DOWN: PING CRITICAL - Packet loss = 100% [15:48:31] RECOVERY - Host tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 1.44 ms [15:48:54] (03PS1) 10Jakob: Update termbox test service to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/546627 [15:49:05] 10Operations, 10SRE-tools: sre.hosts.downtime fails with "No hosts provided" - https://phabricator.wikimedia.org/T236684 (10Volans) In this case the query to puppetdb returned no matching host. After a first look I think that it might be related to the queue size in Puppetdb that apparently has grown quite a l... [15:55:57] 10Operations, 10ops-ulsfo, 10Traffic, 10Wikidata, and 2 others: large number of 504 errors from ulsfo - https://phabricator.wikimedia.org/T236500 (10sbassett) >>! In T236500#5609046, @Bugreporter wrote: > @jijiki The Custom Policy does not make sense since #Traffic is currently a public-joinable project.... [15:56:02] (03CR) 10Jakob: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/546627 (owner: 10Jakob) [15:56:22] (03PS7) 10RLazarus: Initial version of httpbb, the HTTP black box testing tool. [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689 (https://phabricator.wikimedia.org/T236699) [15:58:43] (03PS17) 10CRusnov: backends: add Netbox backend [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) [16:00:05] (03CR) 10CRusnov: backends: add Netbox backend (035 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) (owner: 10CRusnov) [16:05:45] (03CR) 10jerkins-bot: [V: 04-1] backends: add Netbox backend [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) (owner: 10CRusnov) [16:11:24] (03PS1) 10Muehlenhoff: Pass the Groovy script with a file URI [puppet] - 10https://gerrit.wikimedia.org/r/546638 [16:11:26] (03PS1) 10Muehlenhoff: Fix attribute matching in Groovy script [puppet] - 10https://gerrit.wikimedia.org/r/546639 [16:13:11] (03PS2) 10Muehlenhoff: Fix attribute matching in Groovy script [puppet] - 10https://gerrit.wikimedia.org/r/546639 [16:16:03] (03CR) 10Jforrester: [C: 03+2] MCR: Set testwiki to use the new MCR-only schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543007 (https://phabricator.wikimedia.org/T198558) (owner: 10Daniel Kinzler) [16:16:51] (03PS1) 10Arturo Borrero Gonzalez: toollabs: delete unused proxy code [puppet] - 10https://gerrit.wikimedia.org/r/546640 (https://phabricator.wikimedia.org/T235627) [16:17:01] (03Merged) 10jenkins-bot: MCR: Set testwiki to use the new MCR-only schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543007 (https://phabricator.wikimedia.org/T198558) (owner: 10Daniel Kinzler) [16:17:03] (03CR) 10Bstorm: [C: 04-1] "See my comments on the task. I don't want to use the local registry for calico or anything kubeadm managed outside of pause. I made sure" [puppet] - 10https://gerrit.wikimedia.org/r/546459 (https://phabricator.wikimedia.org/T236249) (owner: 10Arturo Borrero Gonzalez) [16:17:31] Testing on mwdebug1001. [16:17:50] (03PS1) 10Krinkle: Remove unused $hostName variable in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546641 [16:18:46] (03CR) 10Arturo Borrero Gonzalez: "I believe everything this patch is deleting is already covered in `role::wmcs::toolforge::proxy`." [puppet] - 10https://gerrit.wikimedia.org/r/546640 (https://phabricator.wikimedia.org/T235627) (owner: 10Arturo Borrero Gonzalez) [16:22:54] (03PS1) 10Jforrester: Revert "MCR: Set testwiki to use the new MCR-only schema" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546643 [16:22:58] (03CR) 10Jforrester: [C: 03+2] Revert "MCR: Set testwiki to use the new MCR-only schema" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546643 (owner: 10Jforrester) [16:23:31] (03CR) 10Bstorm: [C: 03+1] toollabs: delete unused proxy code [puppet] - 10https://gerrit.wikimedia.org/r/546640 (https://phabricator.wikimedia.org/T235627) (owner: 10Arturo Borrero Gonzalez) [16:24:04] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Aug 30th, 2019) rack/setup/install elastic10[53-67].eqiad.wmnet - https://phabricator.wikimedia.org/T230746 (10Gehel) @wiki_willy any idea on a timeline to get those servers racked? At the moment, we have 4 servers down in t... [16:24:06] (03Merged) 10jenkins-bot: Revert "MCR: Set testwiki to use the new MCR-only schema" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546643 (owner: 10Jforrester) [16:24:08] (03CR) 10Jforrester: [C: 03+1] Remove unused $hostName variable in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546641 (owner: 10Krinkle) [16:24:10] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Update termbox test service to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/546627 (owner: 10Jakob) [16:26:02] I'm out of prod. [16:28:11] (03PS18) 10CRusnov: backends: add Netbox backend [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) [16:28:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "This is a NOOP for the current servers tools-proxy-05 and -06. https://puppet-compiler.wmflabs.org/compiler1001/19099/" [puppet] - 10https://gerrit.wikimedia.org/r/546640 (https://phabricator.wikimedia.org/T235627) (owner: 10Arturo Borrero Gonzalez) [16:32:34] (03PS2) 10Matthias Mullie: Increase rate limits for newbie non-ip users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541532 (https://phabricator.wikimedia.org/T231463) [16:33:55] akosiaris: Hey, I get this when deploying to k8s: [16:33:58] https://www.irccloud.com/pastebin/VmlMaduw/ [16:34:04] That seems scary [16:34:22] Amir1: he is off today [16:34:23] (03PS3) 10Elukey: Add AAAA/A/PTR records for an-airflow1001 [dns] - 10https://gerrit.wikimedia.org/r/546498 (https://phabricator.wikimedia.org/T236181) [16:34:38] jynus: oh thanks [16:36:16] !log restart puppetdb on pupetdb1001 to remove queue [16:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:08] 10Puppet, 10Cloud-VPS: geoipupdate missing on buster on Cloud VPS - https://phabricator.wikimedia.org/T236487 (10bd808) Looking at a jessie host, we seem to pull in contrib with our jessie-backports sources config. And the same for stretch and buster actually. So we would get this package in a theoretical futu... [16:40:19] 10Operations, 10Icinga, 10observability: Monitoring for puppetdb queue size - https://phabricator.wikimedia.org/T236707 (10MoritzMuehlenhoff) [16:40:25] 10Operations, 10Icinga, 10observability: Monitoring for puppetdb queue size - https://phabricator.wikimedia.org/T236707 (10MoritzMuehlenhoff) p:05Triage→03Normal [16:41:47] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.01094 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [16:43:23] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.0007294 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [16:44:52] 10Operations, 10ops-esams, 10netops: bast3004 can't reach mgmt networks - https://phabricator.wikimedia.org/T236686 (10Dzahn) [16:45:27] (03PS4) 10Elukey: Add AAAA/A/PTR records for an-airflow1001 [dns] - 10https://gerrit.wikimedia.org/r/546498 (https://phabricator.wikimedia.org/T236181) [16:46:20] (03PS1) 10Andrew Bogott: add snakeoil key for instance-puppet-user [labs/private] - 10https://gerrit.wikimedia.org/r/546648 [16:50:24] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM (nit inline) also please consider adding tests" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546217 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [16:50:50] 10Operations, 10ops-esams, 10netops: bast3004 can't reach mgmt networks - https://phabricator.wikimedia.org/T236686 (10BBlack) a:03BBlack I'll poke at this today since Arzhel's not here (may take a couple hours, squeezing it around meetings) [16:52:45] (03PS1) 10Andrew Bogott: Horizon: set up access to a git repo to archive instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/546651 (https://phabricator.wikimedia.org/T235708) [16:53:21] (03CR) 10Elukey: [C: 03+2] Add AAAA/A/PTR records for an-airflow1001 [dns] - 10https://gerrit.wikimedia.org/r/546498 (https://phabricator.wikimedia.org/T236181) (owner: 10Elukey) [16:53:29] (03CR) 10Filippo Giunchedi: "My understanding from reading T236478 and T236345 is that we'd like to address hosts with failed puppet and not disabled. For those hosts " [puppet] - 10https://gerrit.wikimedia.org/r/546165 (https://phabricator.wikimedia.org/T236478) (owner: 10Jbond) [16:53:46] (03CR) 10Ottomata: [C: 03+1] Add AAAA/A/PTR records for an-airflow1001 [dns] - 10https://gerrit.wikimedia.org/r/546498 (https://phabricator.wikimedia.org/T236181) (owner: 10Elukey) [16:54:50] (03CR) 10jerkins-bot: [V: 04-1] Horizon: set up access to a git repo to archive instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/546651 (https://phabricator.wikimedia.org/T235708) (owner: 10Andrew Bogott) [16:54:59] !log mr1-esams: fix bast3004 access for esams mgmt network - T236686 [16:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:03] T236686: bast3004 can't reach mgmt networks - https://phabricator.wikimedia.org/T236686 [16:55:37] 10Operations, 10Discovery-Search, 10vm-requests, 10Patch-For-Review: setup/install an-airflow1001.eqiad.wmnet on ganeti - https://phabricator.wikimedia.org/T236181 (10Ottomata) [16:56:22] (03PS2) 10Andrew Bogott: Horizon: set up access to a git repo to archive instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/546651 (https://phabricator.wikimedia.org/T235708) [16:56:57] !log elukey@cumin1001 START - Cookbook sre.ganeti.makevm [16:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:55] !log mr1-eqsin: fix bast3004 access for eqsin mgmt network - T236686 [16:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:20] (03CR) 10jerkins-bot: [V: 04-1] Horizon: set up access to a git repo to archive instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/546651 (https://phabricator.wikimedia.org/T235708) (owner: 10Andrew Bogott) [16:59:49] (03PS3) 10Andrew Bogott: Horizon: set up access to a git repo to archive instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/546651 (https://phabricator.wikimedia.org/T235708) [17:00:04] gehel and onimisionipe: I, the Bot under the Fountain, allow thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191028T1700). [17:00:19] no deploy [17:00:31] (03PS4) 10Andrew Bogott: Horizon: set up access to a git repo to archive instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/546651 (https://phabricator.wikimedia.org/T235708) [17:00:32] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/546165 (https://phabricator.wikimedia.org/T236478) (owner: 10Jbond) [17:01:06] (03CR) 10jerkins-bot: [V: 04-1] Horizon: set up access to a git repo to archive instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/546651 (https://phabricator.wikimedia.org/T235708) (owner: 10Andrew Bogott) [17:02:23] (03PS5) 10Andrew Bogott: Horizon: set up access to a git repo to archive instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/546651 (https://phabricator.wikimedia.org/T235708) [17:03:34] (03PS6) 10Andrew Bogott: Horizon: set up access to a git repo to archive instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/546651 (https://phabricator.wikimedia.org/T235708) [17:04:23] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Aug 30th, 2019) rack/setup/install elastic10[53-67].eqiad.wmnet - https://phabricator.wikimedia.org/T230746 (10Gehel) >>! In T230746#5544673, @EBernhardson wrote: > The servers today will not be able to utilize 10G, so they... [17:10:10] !log mr1-ulsfo: fix bast3004 access for ulsfo mgmt network - T236686 [17:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:15] T236686: bast3004 can't reach mgmt networks - https://phabricator.wikimedia.org/T236686 [17:11:17] !log mr1-codfw: fix bast3004 access for codfw mgmt network - T236686 [17:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:41] <_joe_> !log starting rolling restart of memcached servers in eqiad, beginning with mc1019 T235188 [17:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:46] T235188: Preemptive refresh in getMultiWithSetCallback() and getMultiWithUnionSetCallback() pollutes cache - https://phabricator.wikimedia.org/T235188 [17:12:34] !log mr1-eqiad: fix bast3004 access for eqiad mgmt network - T236686 [17:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:22] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] add snakeoil key for instance-puppet-user [labs/private] - 10https://gerrit.wikimedia.org/r/546648 (owner: 10Andrew Bogott) [17:14:04] 10Operations, 10ops-esams, 10netops: bast3004 can't reach mgmt networks - https://phabricator.wikimedia.org/T236686 (10BBlack) 05Open→03Resolved Turns out it was simpler than I thought! Should be done here, re-open if it's still not working. [17:14:37] (03PS7) 10Andrew Bogott: Horizon: set up access to a git repo to archive instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/546651 (https://phabricator.wikimedia.org/T235708) [17:20:25] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [17:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:37] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC) - https://phabricator.wikimedia.org/T227538 (10wiki_willy) a:03Cmjohnson [17:22:14] 10Operations, 10ops-eqiad, 10DC-Ops: b8-eqiad pdu refresh (Thursday 10/31 @11am UTC) - https://phabricator.wikimedia.org/T227543 (10wiki_willy) a:03Cmjohnson [17:24:50] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC) - https://phabricator.wikimedia.org/T227538 (10JHedden) a:05Cmjohnson→03None [17:25:36] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC) - https://phabricator.wikimedia.org/T227538 (10JHedden) [17:27:21] 10Operations, 10Discovery-Search, 10vm-requests: setup/install an-airflow1001.eqiad.wmnet on ganeti - https://phabricator.wikimedia.org/T236181 (10elukey) ` elukey@cumin1001:~$ sudo cookbook sre.ganeti.makevm eqiad_C an-airflow1001.eqiad.wmnet --vcpus 4 --memory 8 --disk 50 --link analytics START - Cookbook... [17:31:54] (03CR) 10Giuseppe Lavagetto: [C: 03+2] logstash: support both mediawiki and parsoid-php types [puppet] - 10https://gerrit.wikimedia.org/r/545534 (owner: 10Giuseppe Lavagetto) [17:32:28] (03PS1) 10Elukey: Introduce an-airflow1001 [puppet] - 10https://gerrit.wikimedia.org/r/546660 (https://phabricator.wikimedia.org/T236181) [17:33:38] <_joe_> subbu: did you merge your logging change today? [17:33:54] (03CR) 10Elukey: [C: 03+2] Introduce an-airflow1001 [puppet] - 10https://gerrit.wikimedia.org/r/546660 (https://phabricator.wikimedia.org/T236181) (owner: 10Elukey) [17:33:55] <_joe_> with my patch above we should be whitelisting parsoid-php too as a log type [17:36:00] <_joe_> subbu: O I see still not merged [17:36:10] <_joe_> do it at your earliest convenience :) [17:38:33] (03PS1) 10Dzahn: ssl: re-issue certificate for RT [puppet] - 10https://gerrit.wikimedia.org/r/546662 (https://phabricator.wikimedia.org/T180641) [17:39:38] (03CR) 10Dzahn: [C: 03+2] ssl: re-issue certificate for RT [puppet] - 10https://gerrit.wikimedia.org/r/546662 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [17:41:21] (03PS1) 10BryanDavis: puppetmaster: Add feature flag for geoip provisioning [puppet] - 10https://gerrit.wikimedia.org/r/546664 (https://phabricator.wikimedia.org/T236487) [17:41:44] (03CR) 10Jeena Huneidi: [C: 03+1] scaffold: only expose one port as a service by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/544629 (owner: 10Giuseppe Lavagetto) [17:43:31] (03PS1) 10Elukey: Add an-airflow to the ganeti's partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/546665 (https://phabricator.wikimedia.org/T236181) [17:44:47] (03CR) 10Elukey: [C: 03+2] Add an-airflow to the ganeti's partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/546665 (https://phabricator.wikimedia.org/T236181) (owner: 10Elukey) [17:45:26] jouncebot: next [17:45:26] In 0 hour(s) and 14 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191028T1800) [17:45:53] i just noticed that the SWATs are now pinned to UTC instead of local time in SF. cool :D [17:55:03] (03PS1) 10Herron: ipsec: remove check_strongswan in favor of prometheus check [puppet] - 10https://gerrit.wikimedia.org/r/546666 (https://phabricator.wikimedia.org/T230236) [17:57:40] (03PS2) 10Herron: ipsec: remove check_strongswan in favor of prometheus check [puppet] - 10https://gerrit.wikimedia.org/r/546666 (https://phabricator.wikimedia.org/T230236) [17:58:38] 10Operations, 10ops-eqiad, 10Discovery-Search: Degraded RAID on elastic1039 - https://phabricator.wikimedia.org/T236601 (10Gehel) This server is scheduled to be replaced, let's not fix anything. [17:58:47] (03CR) 10BryanDavis: "I tried to use PCC to verify that this would end up as a noop for prod puppetmasters, but the run failed with "Unable to find facts for ho" [puppet] - 10https://gerrit.wikimedia.org/r/546664 (https://phabricator.wikimedia.org/T236487) (owner: 10BryanDavis) [17:58:57] 10Operations, 10ops-eqiad, 10Discovery-Search: Degraded RAID on elastic1039 - https://phabricator.wikimedia.org/T236601 (10wiki_willy) a:03Gehel Per my conversation with Guillaume, this system will be decommissioned, so assigning it to @Gehel for now. [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Morning SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191028T1800). [18:00:04] urandom and MatmaRex: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:15] hi [18:00:15] I can SWAT today! [18:00:23] * urandom is present [18:01:14] (03PS9) 10Urbanecm: Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope) [18:01:36] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope) [18:02:31] MatmaRex: +2'ed your patches, will ping you once you can test [18:03:42] (03Merged) 10jenkins-bot: Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope) [18:04:01] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC) - https://phabricator.wikimedia.org/T227538 (10wiki_willy) a:03Cmjohnson [18:04:18] urandom: could you check your patch at mwdebug1001, please? [18:04:28] Urbanecm: yup! [18:04:51] (i'm away for 5 minutes, sorry) [18:05:11] MatmaRex: np, I'll wait :) [18:05:55] it...works [18:06:02] * urandom tries to act less suprised [18:06:17] thanks urandom, syncing :) [18:07:35] (03PS1) 10Urbanecm: Rename author NS at thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546669 (https://phabricator.wikimedia.org/T236640) [18:07:45] !log urbanecm@deploy1001 Synchronized wmf-config/: SWAT: ddaa534: Config changes for Echo kask migration (T222851) (duration: 00m 55s) [18:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:51] T222851: Improve Echo seentime code for multi-DC access - https://phabricator.wikimedia.org/T222851 [18:08:00] urandom: done [18:08:05] RoanKattouw: you around? [18:08:10] Urbanecm: cool; thanks! [18:08:16] yw urandom [18:08:34] (03PS2) 10Urbanecm: Rename author NS at thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546669 (https://phabricator.wikimedia.org/T236640) [18:08:38] (03CR) 10Urbanecm: [C: 03+2] Rename author NS at thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546669 (https://phabricator.wikimedia.org/T236640) (owner: 10Urbanecm) [18:09:33] (03Merged) 10jenkins-bot: Rename author NS at thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546669 (https://phabricator.wikimedia.org/T236640) (owner: 10Urbanecm) [18:10:32] (i'm back, sorry) [18:11:07] MatmaRex: ack, the patches are still making their way through CI [18:11:26] (03CR) 10Herron: [C: 03+1] "LGTM overall! Couple minor comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/545867 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi) [18:12:31] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:14:02] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: ea927dd: Rename author NS at thwikisource (T236640) (duration: 00m 53s) [18:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:07] T236640: Rename namespace on Thai Wikisource - https://phabricator.wikimedia.org/T236640 [18:15:50] !log Run mwscript namespaceDupes.php --wiki=thwikisource --fix (T236640) [18:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:00] (03CR) 10BryanDavis: "Jbond got PCC to test puppetmaster100{1,2}.eqiad -- https://puppet-compiler.wmflabs.org/compiler1002/10/ -- and it shows a functional noop" [puppet] - 10https://gerrit.wikimedia.org/r/546664 (https://phabricator.wikimedia.org/T236487) (owner: 10BryanDavis) [18:17:41] 10Operations, 10Puppet, 10puppet-compiler: PCC failing with ERROR: Unable to find facts for host - https://phabricator.wikimedia.org/T236717 (10jbond) p:05Triage→03Normal [18:18:09] MatmaRex: your patches are at mwdebug1001. Could you test them and let me know? [18:18:27] 10Operations, 10Puppet, 10puppet-compiler: PCC failing with ERROR: Unable to find facts for host - https://phabricator.wikimedia.org/T236717 (10jbond) I am syncing facts now just to rule it out [18:18:32] yeah [18:18:54] 10Operations, 10Puppet, 10puppet-compiler: PCC failing with ERROR: Unable to find facts for host - https://phabricator.wikimedia.org/T236717 (10jbond) [18:19:06] 10Operations, 10Discovery-Search, 10vm-requests: setup/install an-airflow1001.eqiad.wmnet on ganeti - https://phabricator.wikimedia.org/T236181 (10elukey) For some reason that I don't get, the debian install gets stuck at the partman step since it doesn't find the correct recipe.. [18:21:09] Urbanecm: looks good, and i don't see the error from earlier this time [18:21:20] me neither, so syncing [18:21:21] thanks MatmaRex [18:21:39] 10Operations, 10Discovery-Search, 10vm-requests: setup/install an-airflow1001.eqiad.wmnet on ganeti - https://phabricator.wikimedia.org/T236181 (10Dzahn) >>! In T236181#5612621, @elukey wrote: > For some reason that I don't get, the debian install gets stuck at the partman step since it doesn't find the corr... [18:23:12] 10Operations, 10Puppet, 10puppet-compiler: PCC failing with ERROR: Unable to find facts for host - https://phabricator.wikimedia.org/T236717 (10jbond) 05Open→03Resolved a:03jbond syncing facts seems to have resolved the issue [18:23:47] Urbanecm: I think we're going to need to roll this back [18:24:08] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.3/extensions/VisualEditor/includes/ApiVisualEditor.php: SWAT: b19ad5f: Revert "Revert "ApiVisualEditor: Return etag with content for preloaded content""; 4f3b724: ApiVisualEditor: Fix preload handling further (T233320) (duration: 00m 53s) [18:24:08] Urbanecm: it's working fine, but seems to have applied more broadly than testwiki (which was our intention) [18:24:12] 10Operations, 10ops-esams, 10netops: bast3004 can't reach mgmt networks - https://phabricator.wikimedia.org/T236686 (10Dzahn) confirmed working now. Thanks! ` [bast3004:~] $ ping -c1 -w1 cp5007.mgmt.eqsin.wmnet PING cp5007.mgmt.eqsin.wmnet (10.132.129.107) 56(84) bytes of data. 64 bytes from cp5007.mgmt.eq... [18:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:13] T233320: VisualEditor <-> RESTBase communication and ETags - https://phabricator.wikimedia.org/T233320 [18:24:31] urandom: okay, will do [18:24:49] (03PS1) 10Urbanecm: Revert "Config changes for Echo kask migration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546672 (https://phabricator.wikimedia.org/T222851) [18:24:59] (03PS2) 10Urbanecm: Revert "Config changes for Echo kask migration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546672 (https://phabricator.wikimedia.org/T222851) [18:25:06] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "SWAT, revert" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546672 (https://phabricator.wikimedia.org/T222851) (owner: 10Urbanecm) [18:25:44] MatmaRex: your patch is synced [18:25:51] thanks! [18:25:55] you're welcome [18:26:28] !log urbanecm@deploy1001 Synchronized wmf-config/: SWAT: c48271d: Revert "Config changes for Echo kask migration" (T222851) (duration: 00m 53s) [18:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:33] T222851: Improve Echo seentime code for multi-DC access - https://phabricator.wikimedia.org/T222851 [18:26:43] urandom: rollbacked [18:27:12] Urbanecm: great, thanks again [18:27:17] you're welcome [18:27:30] (03PS1) 10Urbanecm: Enable mapframe at kawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546673 (https://phabricator.wikimedia.org/T229726) [18:28:05] !log moscovium - deleting /etc/request-tracker4/RT_SiteConfig.d/ 50-debconf.pm and 51-dbconfig-common.pm which duplicate the same files without .pm extension with wrong values, probably due to some package change (T180641) [18:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:10] T180641: reinstall RT server with private IP and Buster - https://phabricator.wikimedia.org/T180641 [18:28:39] (03CR) 10Urbanecm: [C: 03+2] Enable mapframe at kawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546673 (https://phabricator.wikimedia.org/T229726) (owner: 10Urbanecm) [18:29:54] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) Reporting a chat with Rob on IRC. We could do the following as test: 1) start with kafka-jumbo1001, schedule downtime and stop kafka. Also systemctl mask... [18:30:03] _joe_, I figured someone else would be merging my patch ... i don't have +2 on that repo. [18:30:11] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 30111f3: Enable mapframe at kawiki (T229726) (duration: 00m 53s) [18:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:16] T229726: Activate Kartographer in Georgian wikipedia - https://phabricator.wikimedia.org/T229726 [18:30:17] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) When we are ready we can coordinate to add the new NIC to kafka-jumbo1001 :) [18:30:43] (03CR) 10Jbond: [C: 03+1] "lgtm" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/546664 (https://phabricator.wikimedia.org/T236487) (owner: 10BryanDavis) [18:30:47] <_joe_> subbu: oh dang, ask assistance to releng / sre in a better TZ than me [18:31:51] (03PS1) 10RobH: adding esams pdus to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/546674 (https://phabricator.wikimedia.org/T184066) [18:32:16] (03CR) 10Volans: [C: 04-1] "Nice initial version, as discussed offline some comments inline. I skipped the tests for now." (0310 comments) [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [18:32:27] !log moscovium - rename all files in /etc/request-tracker4/RT_SiteConfig.d to have a .pm extension - this fixed RT - login works again - puppet patch coming up (T180641) [18:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:59] _joe_, will do .. but did you say you alreayd have the whitelisting patch deployed? [18:33:07] puppet patch [18:35:30] 10Operations, 10Patch-For-Review: reinstall RT server with private IP and Buster - https://phabricator.wikimedia.org/T180641 (10Dzahn) >>! In T180641#5610987, @faidon wrote: > RT seems to be currently pointed at moscovium, and is currently broken: the frontpage doesn't load properly (mixed content messages) an... [18:36:15] Urbanecm, can i add a config patch to your swat deploy? [18:36:29] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/545944 [18:36:48] subbu: sure! [18:37:09] subbu: are you able to deploy it yourself, or do you want me to do it for you? [18:37:24] can you? [18:37:37] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM. PCC: https://puppet-compiler.wmflabs.org/compiler1001/19112/" [puppet] - 10https://gerrit.wikimedia.org/r/546664 (https://phabricator.wikimedia.org/T236487) (owner: 10BryanDavis) [18:37:43] (03CR) 10RobH: [C: 03+2] adding esams pdus to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/546674 (https://phabricator.wikimedia.org/T184066) (owner: 10RobH) [18:37:44] subbu: yes :-) [18:37:52] ty. [18:37:56] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545944 (https://phabricator.wikimedia.org/T235899) (owner: 10Subramanya Sastry) [18:38:41] (03Merged) 10jenkins-bot: Direct Parsoid/PHP logs to a parsoid-php log "type" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545944 (https://phabricator.wikimedia.org/T235899) (owner: 10Subramanya Sastry) [18:39:15] subbu: could you please test your patch at mwdebug1001? [18:40:57] parsoid/php doesn't yet receive traffic and isn't public .. so cannot test it quite yet from mwdebug1001. will have to be live on the parsoid cluster for me to test it. [18:41:12] subbu: okay, syncing then [18:41:16] k. [18:41:22] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [18:41:36] !log restarted memcached on mc1020 T235188 [18:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:41] T235188: Preemptive refresh in getMultiWithSetCallback() and getMultiWithUnionSetCallback() pollutes cache - https://phabricator.wikimedia.org/T235188 [18:42:18] <_joe_> subbu: yes I did (sorry, I'm really out now) [18:42:36] _joe_, ty. [18:42:38] !log urbanecm@deploy1001 Synchronized wmf-config/logging.php: SWAT: 1a09e2a: Direct Parsoid/PHP logs to a parsoid-php log "type" (T235899) (duration: 00m 52s) [18:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:43] T235899: Direct all Parsoid/PHP logs from wtp* servers to the parsoid-php channel instead of the mediawiki channel - https://phabricator.wikimedia.org/T235899 [18:42:50] subbu: synced [18:43:05] ty .. will test it out. [18:44:54] o/ [18:45:28] * Urbanecm waves to hauskater [18:45:43] 10Operations, 10ops-eqiad, 10Discovery-Search: Degraded RAID on elastic1039 - https://phabricator.wikimedia.org/T236601 (10Gehel) >>! In T236601#5612513, @Gehel wrote: > This server is scheduled to be replaced, let's not fix anything. Oops, we're actually only replacing elastic1017-1031, so we'll need to ke... [18:46:22] (03PS1) 10Dzahn: requesttracker: rename config files to have a .pm extension [puppet] - 10https://gerrit.wikimedia.org/r/546676 (https://phabricator.wikimedia.org/T180641) [18:47:49] 10Operations, 10ops-eqiad, 10Discovery-Search: Degraded RAID on elastic1039 - https://phabricator.wikimedia.org/T236601 (10MoritzMuehlenhoff) @Gehel : See the SRE meeting doc from today, there's now a new form for these requests: https://phabricator.wikimedia.org/maniphest/task/edit/form/55/ [18:49:04] _joe_, Urbanecm, works. all good. the logs are showing up under parsoid-php. [18:49:12] <_joe_> subbu: oh great! [18:49:13] thanks subbu [18:49:22] nice! [18:49:44] <_joe_> I think we're 99% done with our part for now mutante [18:50:09] _joe_: :) cool, i can update the OKR to 99, hehe [18:50:15] is everyone done deploying .. if so, we would like to do a parsoid deploy. [18:50:35] Pchelolo, after the parsoid deploy, you can enable 10% mirroring. [18:50:46] subbu: go ahead [18:51:03] <_joe_> Pchelolo: what url are you using for mirroring though? [18:51:32] <_joe_> Pchelolo: it's a separated url from parsoid, you should call https://parsoid-php.discovery.wmnet [18:51:38] 10Operations, 10ops-eqiad, 10Discovery-Search: Degraded RAID on elastic1039 - https://phabricator.wikimedia.org/T236601 (10wiki_willy) [18:52:25] subbu: _joe_ ok. just got to this, lemme research a bit what state has marko left it in [18:52:52] !log Morning SWAT done [18:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:45] !log updating PHP on people1001 [18:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:03] 10Operations, 10ops-eqiad, 10Discovery-Search: Degraded RAID on elastic1039 - https://phabricator.wikimedia.org/T236601 (10wiki_willy) Thanks @MoritzMuehlenhoff - no worries though, since this task looks like it was autogenerated. (I'll have to talk to Ricardo on how we can modify the autogenerated ones) @G... [18:54:48] 10Operations, 10ops-eqiad, 10Discovery-Search: Degraded RAID on elastic1039 - https://phabricator.wikimedia.org/T236601 (10wiki_willy) a:05Gehel→03Jclark-ctr [18:56:30] 10Operations, 10ops-esams: rack/setup/install ps[12]-oe1[456]-esams - https://phabricator.wikimedia.org/T184066 (10RobH) [18:56:46] _joe_: hm, it was actually set up for https://parsoid.svc.eqiad.wmnet/w/rest.php [18:59:46] 10Operations, 10Discovery-Search, 10vm-requests: setup/install an-airflow1001.eqiad.wmnet on ganeti - https://phabricator.wikimedia.org/T236181 (10elukey) >>! In T236181#5612633, @Dzahn wrote: >>>! In T236181#5612621, @elukey wrote: >> For some reason that I don't get, the debian install gets stuck at the pa... [19:01:35] (03CR) 10Cwhite: [C: 03+1] admins: create new deploy group for design, add 3 users [puppet] - 10https://gerrit.wikimedia.org/r/546303 (https://phabricator.wikimedia.org/T236518) (owner: 10Dzahn) [19:02:13] (03CR) 10Ladsgroup: [C: 03+1] admins: create new deploy group for design, add 3 users [puppet] - 10https://gerrit.wikimedia.org/r/546303 (https://phabricator.wikimedia.org/T236518) (owner: 10Dzahn) [19:03:01] 10Operations, 10Puppet, 10observability, 10Patch-For-Review: update failed puppet checks so that they go critical 24 hours - https://phabricator.wikimedia.org/T236478 (10herron) [19:03:11] _joe_: I'm getting some SSL errors for parsoid-php if curling from RESTBase hosts.. [19:10:44] (03PS1) 10Elukey: Fix partman configuration for an-airflow1001 in netboot [puppet] - 10https://gerrit.wikimedia.org/r/546680 (https://phabricator.wikimedia.org/T236181) [19:11:02] (03CR) 10Elukey: [C: 03+2] Fix partman configuration for an-airflow1001 in netboot [puppet] - 10https://gerrit.wikimedia.org/r/546680 (https://phabricator.wikimedia.org/T236181) (owner: 10Elukey) [19:18:03] <_joe_> Pchelolo: that would be wrong though [19:18:25] <_joe_> Pchelolo: can you talk to mutante? I'm off now [19:18:36] _joe_: sure. thank you, have a nice evening [19:19:12] <_joe_> Pchelolo: wait, parsoid.svc.eqiad.wmnet is ok [19:19:28] <_joe_> I suggested parsoid-php.discovery.wmnet which resolves to the same IP [19:19:50] <_joe_> but I want you to use the discovery record so that when we switchover MediaWiki it will work automagically [19:20:10] 10Operations, 10Discovery-Search, 10vm-requests, 10Patch-For-Review: setup/install an-airflow1001.eqiad.wmnet on ganeti - https://phabricator.wikimedia.org/T236181 (10elukey) The VM is ready, running buster with basic puppet settings (so only ops for now can ssh to it). @EBernhardson I just realized that B... [19:20:14] yeah. I've checked, that's why I made the patch that ignores it for now. I'll file a ticket about it [19:24:56] (03CR) 10VolkerE: [C: 03+1] admins: create new deploy group for design, add 3 users [puppet] - 10https://gerrit.wikimedia.org/r/546303 (https://phabricator.wikimedia.org/T236518) (owner: 10Dzahn) [19:30:23] (03CR) 10Jforrester: "recheck" [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [19:31:45] (03CR) 10jerkins-bot: [V: 04-1] Initial version of httpbb, the HTTP black box testing tool. [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [19:32:32] James_F: thanks! [19:32:42] rlazarus: Happy to help out. [19:38:26] !log ssastry@deploy1001 Started deploy [parsoid/deploy@d932d6a]: Update parsoid to 089bf28d [19:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:34] RECOVERY - Check the Netbox report puppetdb for fail status. on netbox1001 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:41:08] !log ssastry@deploy1001 Finished deploy [parsoid/deploy@d932d6a]: Update parsoid to 089bf28d (duration: 02m 42s) [19:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:19] mutante, rolled back parsoid deploy [19:41:36] can you verify if servers are pooled after rollback? [19:42:12] mutante, because i ran into this: "2019-10-28 19:41:03,462 [WARNING] Service restart failed. NOT repooling" becuase of php7.2-fpm restart failing [19:46:03] (03PS1) 10MarcoAurelio: Allow AbuseFilter to issue blocks on es.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546700 (https://phabricator.wikimedia.org/T236730) [19:49:20] subbu: https://config-master.wikimedia.org/pybal/eqiad/parsoid-php [19:49:32] looks like wtp1025 and 1026 aren't pooled [19:49:58] cdanis, ah .. right, we figured it is parsoid/php that is depooled and not parsoid/js because those are still getting the js requests. [19:50:11] https://config-master.wikimedia.org/pybal/eqiad/parsoid looks all pooled [19:50:30] !log restarted memcached on mc1021 (T235188) [19:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:36] T235188: Preemptive refresh in getMultiWithSetCallback() and getMultiWithUnionSetCallback() pollutes cache - https://phabricator.wikimedia.org/T235188 [19:50:52] mutante, https://phabricator.wikimedia.org/T236275#5613115 /cc cdanis [19:51:16] alright, we'll pick this up tomorrow again .. calling off the restbase traffic mirroring to parsoid/php today. [19:51:38] (03PS3) 10Cwhite: prometheus, profile: add file count feature and enable lists queue tracking [puppet] - 10https://gerrit.wikimedia.org/r/546260 (https://phabricator.wikimedia.org/T236505) [19:51:53] >Failed to restart php7.2-fpm.service: The name org.freedesktop.PolicyKit1 was not provided by any .service files [19:52:04] Pchelolo, we'll get back to our offsite-ing now. :) [19:52:07] I think this is usually gibberish for 'you don't have sudo access'? [19:52:28] (03CR) 10Cwhite: prometheus, profile: add file count feature and enable lists queue tracking (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/546260 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [19:53:36] (03CR) 10jerkins-bot: [V: 04-1] prometheus, profile: add file count feature and enable lists queue tracking [puppet] - 10https://gerrit.wikimedia.org/r/546260 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [19:56:46] (03PS1) 10Ottomata: EventLogging refine = Unblacklist ChangesListHighlights schema, it has been fixed [puppet] - 10https://gerrit.wikimedia.org/r/546703 (https://phabricator.wikimedia.org/T212367) [19:58:44] (03CR) 10jerkins-bot: [V: 04-1] EventLogging refine = Unblacklist ChangesListHighlights schema, it has been fixed [puppet] - 10https://gerrit.wikimedia.org/r/546703 (https://phabricator.wikimedia.org/T212367) (owner: 10Ottomata) [20:00:04] cscott, arlolra, subbu, halfak, and accraze: (Dis)respected human, time to deploy Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191028T2000). Please do the needful. [20:03:49] (03CR) 10DannyS712: [C: 03+1] "Looks good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546700 (https://phabricator.wikimedia.org/T236730) (owner: 10MarcoAurelio) [20:06:47] (03PS2) 10Cwhite: logstash: send PHP7 fatal-error messages type:mediawiki channel:fatal [puppet] - 10https://gerrit.wikimedia.org/r/546219 (https://phabricator.wikimedia.org/T234283) (owner: 10Effie Mouzeli) [20:12:32] (03PS4) 10Cwhite: prometheus, profile: add file count feature and enable lists queue tracking [puppet] - 10https://gerrit.wikimedia.org/r/546260 (https://phabricator.wikimedia.org/T236505) [20:14:40] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10colewhite) a:05colewhite→03CGlenn [20:24:40] twentyafterfour: Did you see my last message with the friendly deploy request? [20:26:08] (03PS2) 10Ottomata: EventLogging refine - Unblacklist ChangesListHighlights, it has been fixed [puppet] - 10https://gerrit.wikimedia.org/r/546703 (https://phabricator.wikimedia.org/T212367) [20:28:23] mutante: lol yeah okay I thought I recognized that 15:52:07 <@cdanis> I think this is usually gibberish for 'you don't have sudo access'? [20:29:23] urandom: I'm here now, what's up? [20:29:29] (Sorry for the late response, had meetings and then lunch) [20:29:56] RoanKattouw: we deployed that config change to enable the new storage of echo seen times, then rolled it back [20:30:21] RoanKattouw: https://phabricator.wikimedia.org/T222851#5612730 [20:31:05] TL;DR we were thinking this would pertain to testwiki only, and were a little surprised when it ended up with a greater reach [20:31:15] it all seemed to be working fine though [20:31:21] Well CommonSEttings.php line 3004 sets it, and then InitialiseSettings.php also sets it [20:31:23] I'm not sure what that does [20:31:35] However testwiki would certainly be expected to try to access global keys [20:32:02] and the other keys? enwiki, ruwiki, etc? [20:32:38] in total, it was about 1k/s requests [20:32:46] (03CR) 10Ottomata: [C: 03+2] EventLogging refine - Unblacklist ChangesListHighlights, it has been fixed [puppet] - 10https://gerrit.wikimedia.org/r/546703 (https://phabricator.wikimedia.org/T212367) (owner: 10Ottomata) [20:32:57] I wouldn't expect the enwiki, ruwiki etc, unless the conflicting definitions caused all wikis to get kask [20:33:12] That would have been easier to debug if the change had still been live, but I can try to reapply it on a test server and check [20:33:46] At the very least the config change was confusingly written, it shouldn't be this hard to figure out whether non-testwiki wikis got kask or not. The CommonSettings.php line should have been removed [20:34:24] (03PS1) 10DLynch: Re-enable mobile editor A/B testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546724 (https://phabricator.wikimedia.org/T236337) [20:34:26] Do you have an idea of what the ratio was between e.g. enwiki requests and testwiki requests? [20:34:46] not really [20:34:51] a lot more enwiki [20:35:01] Right OK, then you probably accidentally deployed it everywhere [20:35:09] (03PS1) 10Jgreen: switch fundraising-write.wmnet to point to frdb1002 [dns] - 10https://gerrit.wikimedia.org/r/546727 [20:35:31] I only ever spotted the two when I was testing against mwdebug1001, and if there were any afterward, they were lost in the noise [20:35:42] (03CR) 10DLynch: [C: 04-1] "This patch shouldn't be merged until the depends-on commit has rolled out (barring mishaps, should be 20191031)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546724 (https://phabricator.wikimedia.org/T236337) (owner: 10DLynch) [20:35:44] I ordinarily wouldn't recommend deploying this to only one wiki, because as you noticed the vast majority of the keys are global, but the worst that would happen in testwiki behaves a bit strangely which is fine [20:35:59] (03CR) 10Jgreen: [C: 03+2] switch fundraising-write.wmnet to point to frdb1002 [dns] - 10https://gerrit.wikimedia.org/r/546727 (owner: 10Jgreen) [20:37:09] RoanKattouw: I was really quite tempted to just leave it be and not roll it back at all [20:37:13] !log authdns update to switch fundraising db service hostname [20:37:15] 10Operations, 10Icinga, 10observability: Monitoring for puppetdb queue size - https://phabricator.wikimedia.org/T236707 (10Peachey88) [20:37:16] it seemed to be working fine [20:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:25] Yeah that would probably have been OK [20:37:38] it wasn't intentional, but we were already There [20:37:43] I'm doing some debugging now to confirm my theory [20:37:45] (03PS2) 10BryanDavis: puppetmaster: Add feature flag for geoip provisioning [puppet] - 10https://gerrit.wikimedia.org/r/546664 (https://phabricator.wikimedia.org/T236487) [20:37:58] (03CR) 10BryanDavis: puppetmaster: Add feature flag for geoip provisioning (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/546664 (https://phabricator.wikimedia.org/T236487) (owner: 10BryanDavis) [20:38:24] Yeah, see: [20:38:27] catrope@mwdebug1002:~$ mwscript eval.php enwiki [20:38:27] > echo $wgEchoSeenTimeCacheType; [20:38:27] kask-echoseen-transition [20:38:48] Because of a stray line in the config patch, it was enabled everywhere [20:39:03] CommonSettings.php:3004, yes? [20:40:30] RoanKattouw: could expound on that recommendation *against* only deploying to testwiki? what problems would that create? [20:41:21] Yes line 3004. That line conflicts with the per-wiki settings in InitialiseSettings.php, and it appears that the CommonSettings line wins [20:42:47] (03PS6) 10MarcoAurelio: mediawiki::web:prod_sites.pp: Apache config for ge.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/545889 (https://phabricator.wikimedia.org/T236389) [20:43:17] (03PS10) 10MarcoAurelio: Initial configuration for ge.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389) [20:43:41] Re testwiki-only problems: most users have global notifications enabled (it's the default), and so they have one global seentime cache entry, rather than one for each wiki [20:43:58] As you noticed, the vast majority of the cache keys start with global: rather than enwiki: or ruwiki: or whatever [20:44:26] If that user exists on both testwiki and, say, enwiki, then both of those will share the same global key for that user [20:44:53] So if you point those two wikis to different caches, then they won't actually be looking at the same cache entry, and that will cause mildly confusing behavior [20:45:00] But since testwiki is not a real wiki, it won't cause big problems [20:45:26] Basically it split-brains testwiki away from the rest of the wikis, so if they view their notifs on testwiki that won't be reflected on other wikis, and if they view them on some other wiki that won't be reflected on testwiki [20:45:42] yeah [20:46:02] hrmm [20:46:53] why is wikitech memcached-pecl ? [20:47:05] I think this must have been copied from the sessionstore example [20:47:29] wikitech might be different because it lives on different servers, or it used to? [20:47:30] Wikitech is a random beast with random config. [20:47:34] Either way, wikiitech is not part of the global user system [20:47:43] It runs on its own servers with its own databases and its own problems. [20:47:49] that's what I meant to ask, OK [20:47:54] Another strategy we could use here is test it on a wiki that's also not part of the global user system, like officewiki [20:48:03] (None of the private wikis are, for semi-obvious reasons) [20:48:56] Then it wouldn't be trying to access any global: keys, only officewiki: keys [20:49:31] where is wgEchoSeenTimeCacheType for officewiki going to be set at? [20:50:10] Same way it's set for anything else, you could just add an 'officewiki' => 'kask-echoseen-transition' line to InitialiseSettings.php [20:51:02] oh, so it does fall under 'default', it's just not included in global notifications [20:51:05] In the config patch you briefly deployed (if it hadn't been broken) it would have fallen back to 'default' => 'redis_local' [20:51:11] Yes that's right [20:51:31] Even super special snowflake wikitech falls under default, which is probably why it was customized [20:51:34] so we could almost just s/testwiki/officewiki/ there, and be good [20:51:42] PROBLEM - Check systemd state on elastic1039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:51:42] Yeah [20:51:46] And drop that CommonSettings line [20:51:57] right [20:52:17] I can't believe I didn't think of this before, I tend to forget that there are a number of wikis outside the global user system that have real users [20:52:37] (03PS1) 10Eevans: Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546731 (https://phabricator.wikimedia.org/T222851) [20:52:47] doh, need to fix the commit message [20:53:29] (03PS2) 10Eevans: Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546731 (https://phabricator.wikimedia.org/T222851) [20:54:23] (03PS8) 10Andrew Bogott: Horizon: set up access to a git repo to archive instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/546651 (https://phabricator.wikimedia.org/T235708) [20:56:36] !log restart memcached on mc1022 T235188 [20:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:41] T235188: Preemptive refresh in getMultiWithSetCallback() and getMultiWithUnionSetCallback() pollutes cache - https://phabricator.wikimedia.org/T235188 [20:58:12] (03PS9) 10Andrew Bogott: Horizon: set up access to a git repo to archive instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/546651 (https://phabricator.wikimedia.org/T235708) [21:00:04] Reedy and sbassett: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191028T2100). [21:00:14] (03CR) 10jerkins-bot: [V: 04-1] Horizon: set up access to a git repo to archive instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/546651 (https://phabricator.wikimedia.org/T235708) (owner: 10Andrew Bogott) [21:01:34] (03PS10) 10Andrew Bogott: Horizon: set up access to a git repo to archive instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/546651 (https://phabricator.wikimedia.org/T235708) [21:03:47] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: set up access to a git repo to archive instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/546651 (https://phabricator.wikimedia.org/T235708) (owner: 10Andrew Bogott) [21:07:43] (03PS1) 10Andrew Bogott: horizon: create directory to hold the instance-puppet-user key [puppet] - 10https://gerrit.wikimedia.org/r/546734 (https://phabricator.wikimedia.org/T235708) [21:08:27] (03CR) 10jerkins-bot: [V: 04-1] horizon: create directory to hold the instance-puppet-user key [puppet] - 10https://gerrit.wikimedia.org/r/546734 (https://phabricator.wikimedia.org/T235708) (owner: 10Andrew Bogott) [21:09:36] (03PS2) 10Andrew Bogott: horizon: create directory to hold the instance-puppet-user key [puppet] - 10https://gerrit.wikimedia.org/r/546734 (https://phabricator.wikimedia.org/T235708) [21:10:14] (03CR) 10Andrew Bogott: [C: 03+1] puppet: manage localcacert in puppet [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277) (owner: 10Jbond) [21:10:44] (03CR) 10Andrew Bogott: [C: 03+2] horizon: create directory to hold the instance-puppet-user key [puppet] - 10https://gerrit.wikimedia.org/r/546734 (https://phabricator.wikimedia.org/T235708) (owner: 10Andrew Bogott) [21:16:07] (03PS1) 10Andrew Bogott: horizon local_settings: quote PUPPET_GIT_REPO_USER [puppet] - 10https://gerrit.wikimedia.org/r/546737 (https://phabricator.wikimedia.org/T235708) [21:17:14] (03CR) 10Andrew Bogott: [C: 03+2] horizon local_settings: quote PUPPET_GIT_REPO_USER [puppet] - 10https://gerrit.wikimedia.org/r/546737 (https://phabricator.wikimedia.org/T235708) (owner: 10Andrew Bogott) [21:18:24] (03CR) 10Catrope: [C: 03+1] Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546731 (https://phabricator.wikimedia.org/T222851) (owner: 10Eevans) [21:18:36] urandom: At 4pm Pacific do you wanna try deploying this change? [21:18:46] Then I can test that the Echo functionality works correctly [21:18:55] RoanKattouw: yeah [21:19:26] RoanKattouw: want me to add it to Deployments? [21:19:32] I'm on it [21:19:46] I was about to do that for tomorrow, not knowing if I'd have anyone look it today [21:19:54] at it, that is [21:20:41] Wow you got 546731 for the new change and 540731 for the old one, impressive [21:20:50] OK I've put it on the schedule [21:21:04] 10Operations, 10Traffic: track NIC firmware version numbers across the fleet - https://phabricator.wikimedia.org/T236744 (10CDanis) [21:21:13] ha! [21:21:15] I'm usually more available for the 4pm SWAT window than any of the others, because of sleep and meetings with Europeans [21:25:38] (03CR) 10Andrew Bogott: [C: 03+1] puppetmaster::frontend serve volatile uri from the locale site frontend [puppet] - 10https://gerrit.wikimedia.org/r/542922 (https://phabricator.wikimedia.org/T235427) (owner: 10Jbond) [21:36:09] 10Operations, 10Discovery-Search, 10vm-requests: setup/install an-airflow1001.eqiad.wmnet on ganeti - https://phabricator.wikimedia.org/T236181 (10EBernhardson) I've only tested 3.5, but their github and pypi pages list 3.7 compatibility so it should be ok. I think I can go ahead and deploy the software and... [21:41:11] cdanis: hehe, indeed. i thought "this seems oddly weird but familiar" and then the stackexchange link was already in the browser history [21:41:43] soon my laptop will be out of battery and i still haven't found a place to charge it. even the co-working space at the local Internet provider. "we have wifi but no power" [21:51:54] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Reimage cobalt and gerrit2001 as buster - https://phabricator.wikimedia.org/T176774 (10Dzahn) gerrit prod has switched from cobalt to gerrit1001, it will be decom'ed but we need to create the ticket... [22:16:31] 10Operations, 10DC-Ops, 10decommission: decommission cobalt.wikimedia.org - https://phabricator.wikimedia.org/T236747 (10Dzahn) [22:17:12] 10Operations, 10DC-Ops, 10decommission: decommission cobalt.wikimedia.org - https://phabricator.wikimedia.org/T236747 (10Dzahn) [22:17:15] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Reimage cobalt and gerrit2001 as buster - https://phabricator.wikimedia.org/T176774 (10Dzahn) [22:18:40] 10Operations: decom ununpentium - https://phabricator.wikimedia.org/T236748 (10Dzahn) [22:18:52] 10Operations: decom ununpentium - https://phabricator.wikimedia.org/T236748 (10Dzahn) [22:18:56] 10Operations, 10Patch-For-Review: reinstall RT server with private IP and Buster - https://phabricator.wikimedia.org/T180641 (10Dzahn) [22:21:31] (03PS1) 10Alex Monk: Revert "cloudvps: ignore stderr in labs-ip-alias-dump.py" [puppet] - 10https://gerrit.wikimedia.org/r/546755 [22:22:25] (03CR) 10jerkins-bot: [V: 04-1] Revert "cloudvps: ignore stderr in labs-ip-alias-dump.py" [puppet] - 10https://gerrit.wikimedia.org/r/546755 (owner: 10Alex Monk) [22:23:50] (03PS2) 10Alex Monk: Revert "cloudvps: ignore stderr in labs-ip-alias-dump.py" [puppet] - 10https://gerrit.wikimedia.org/r/546755 (https://phabricator.wikimedia.org/T235627) [22:24:43] (03CR) 10jerkins-bot: [V: 04-1] Revert "cloudvps: ignore stderr in labs-ip-alias-dump.py" [puppet] - 10https://gerrit.wikimedia.org/r/546755 (https://phabricator.wikimedia.org/T235627) (owner: 10Alex Monk) [22:25:14] (03PS3) 10Alex Monk: Revert "cloudvps: ignore stderr in labs-ip-alias-dump.py" [puppet] - 10https://gerrit.wikimedia.org/r/546755 (https://phabricator.wikimedia.org/T235627) [22:36:57] (03PS1) 10Alex Monk: Fix labsaliaser script to be executable [puppet] - 10https://gerrit.wikimedia.org/r/546756 [22:38:52] (03CR) 10Jhedden: Fix labsaliaser script to be executable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546756 (owner: 10Alex Monk) [22:39:15] (03PS2) 10Dzahn: site/install_server: decom ununpentium [puppet] - 10https://gerrit.wikimedia.org/r/544080 (https://phabricator.wikimedia.org/T180641) [22:39:35] (03CR) 10jerkins-bot: [V: 04-1] site/install_server: decom ununpentium [puppet] - 10https://gerrit.wikimedia.org/r/544080 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [22:40:59] (03PS2) 10Alex Monk: Fix labsaliaser script to be executable [puppet] - 10https://gerrit.wikimedia.org/r/546756 [22:52:04] (03CR) 10Jhedden: [C: 03+1] Fix labsaliaser script to be executable [puppet] - 10https://gerrit.wikimedia.org/r/546756 (owner: 10Alex Monk) [22:53:38] (03PS3) 10Alex Monk: Fix labsaliaser script to be executable [puppet] - 10https://gerrit.wikimedia.org/r/546756 (https://phabricator.wikimedia.org/T235627) [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191028T2300). [23:00:04] urandom: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:20] o/ [23:00:43] * urandom may already be entitled to a sticker [23:04:28] urandom: Why are you testing on officewiki in that patch? [23:04:47] MaxSem: because it does not use global notifications [23:05:18] Hmm [23:05:28] Okay, I guess I'll deploy today [23:05:36] MaxSem: RoanKattouw pointed out earlier that if we enable on testwiki, we'll have a split-brain scenario [23:06:02] where testwiki won't update the seentime of notifications elsewhere, and vice-versa [23:06:12] probably not the end of the world, but weird [23:06:13] MaxSem: I can also do it if you like [23:06:17] Sorry for missing the initial ping [23:06:26] (03PS1) 10Dzahn: parsoid: fix path to systemctl for php-restart sudo command line [puppet] - 10https://gerrit.wikimedia.org/r/546758 (https://phabricator.wikimedia.org/T236275) [23:06:38] Since you're the Echo guy, go ahead [23:10:23] (03CR) 10CDanis: [C: 03+1] parsoid: fix path to systemctl for php-restart sudo command line [puppet] - 10https://gerrit.wikimedia.org/r/546758 (https://phabricator.wikimedia.org/T236275) (owner: 10Dzahn) [23:11:43] Alright [23:11:44] (03CR) 10Dzahn: [C: 03+2] parsoid: fix path to systemctl for php-restart sudo command line [puppet] - 10https://gerrit.wikimedia.org/r/546758 (https://phabricator.wikimedia.org/T236275) (owner: 10Dzahn) [23:12:01] (03CR) 10Catrope: [C: 03+2] Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546731 (https://phabricator.wikimedia.org/T222851) (owner: 10Eevans) [23:12:49] (03Merged) 10jenkins-bot: Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546731 (https://phabricator.wikimedia.org/T222851) (owner: 10Eevans) [23:14:34] (03CR) 10Dzahn: [C: 03+2] requesttracker: rename config files to have a .pm extension [puppet] - 10https://gerrit.wikimedia.org/r/546676 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [23:18:18] !log re-enabling puppet on moscovium (RT) [23:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:26] !log catrope@deploy1001 Synchronized wmf-config/ProductionServices.php: Deploy Echo kask migration to officewiki for testing, part 1 (T222851) (duration: 00m 54s) [23:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:30] T222851: Improve Echo seentime code for multi-DC access - https://phabricator.wikimedia.org/T222851 [23:20:02] (03PS1) 10Eevans: echostore: Set TTL to 1 year (31536000) [deployment-charts] - 10https://gerrit.wikimedia.org/r/546760 (https://phabricator.wikimedia.org/T222851) [23:20:32] !log catrope@deploy1001 Synchronized wmf-config/CommonSettings.php: Deploy Echo kask migration to officewiki for testing, part 2 (T222851) (duration: 00m 52s) [23:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:28] (03CR) 10Eevans: [V: 03+2 C: 03+2] echostore: Set TTL to 1 year (31536000) [deployment-charts] - 10https://gerrit.wikimedia.org/r/546760 (https://phabricator.wikimedia.org/T222851) (owner: 10Eevans) [23:22:40] (03Merged) 10jenkins-bot: echostore: Set TTL to 1 year (31536000) [deployment-charts] - 10https://gerrit.wikimedia.org/r/546760 (https://phabricator.wikimedia.org/T222851) (owner: 10Eevans) [23:23:41] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Deploy Echo kask migration to officewiki for testing, part 3 (T222851) (duration: 00m 52s) [23:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:04] anyone know what this helmfile error means? [23:25:08] in ./helmfile.yaml: failed processing release production: helm exited with status 1: [23:25:08] Error: forwarding ports: error upgrading connection: pods "tiller-deploy-97d7ddd67-2kgh4" is forbidden: User "echostore" cannot create resource "pods/portforward" in API group "" in the namespace "echostore" [23:27:03] RoanKattouw: FYI, TTLs are currently set to 1 hour (copy pasta from the session store k8s deployment charts) [23:27:16] I don't guess that's a problem because it'll fall back to redis [23:27:32] Does the fallback thing write to both backends? [23:27:44] umm... pretty sure [23:27:50] or I was until you just asked that [23:27:54] Also, we should be live on officewiki now and you should be seeing officewikl traffic [23:28:02] I am [23:28:19] a grand total of 3 keys so far :) [23:28:47] I would deploy a config change for the right TTL, but helmfile is mad at me for some reason [23:30:23] Yeah, MultiWriteBagOStuff should write to all listed backends [23:30:39] And read from them in order until it finds something [23:41:01] 10Operations: R packages to be added to frdev1001 server - https://phabricator.wikimedia.org/T236750 (10EYener) [23:44:36] (03PS1) 10Jgreen: switch fundraising queue monitoring from frdb1001 to frdb1002 [puppet] - 10https://gerrit.wikimedia.org/r/546762 (https://phabricator.wikimedia.org/T236739) [23:50:02] (03CR) 10Jgreen: [C: 03+2] switch fundraising queue monitoring from frdb1001 to frdb1002 [puppet] - 10https://gerrit.wikimedia.org/r/546762 (https://phabricator.wikimedia.org/T236739) (owner: 10Jgreen) [23:50:26] (03PS1) 10Bstorm: toolforge-k8s: enable the settings API and PodPreset [puppet] - 10https://gerrit.wikimedia.org/r/546764 (https://phabricator.wikimedia.org/T215678)