[00:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200114T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:03:58] (03PS2) 10Dzahn: releases: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/563548 [00:04:06] (03PS4) 10Holger Knust: Migrate changeprop & cpjobqueue to kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 [00:05:28] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/20331/releases1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/563548 (owner: 10Dzahn) [00:08:21] (03PS1) 10Nray: Temporarily turn off AmcOutreach until T242491 regression is resolved [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564157 (https://phabricator.wikimedia.org/T242491) [00:09:37] (03PS2) 10Dzahn: microsites: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/563552 [00:11:31] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/20332/bromine.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/563552 (owner: 10Dzahn) [00:14:46] (03CR) 10Dzahn: [C: 03+2] codesearch: Install docker-ce from thirdparty/kubeadm-k8s component [puppet] - 10https://gerrit.wikimedia.org/r/563633 (owner: 10Legoktm) [00:21:58] (03PS1) 10Dzahn: codesearch: fix parameters of apt::package_from:component [puppet] - 10https://gerrit.wikimedia.org/r/564167 [00:27:23] (03CR) 10Dzahn: [C: 03+2] codesearch: fix parameters of apt::package_from:component [puppet] - 10https://gerrit.wikimedia.org/r/564167 (owner: 10Dzahn) [00:34:34] (03Abandoned) 10Nray: Temporarily turn off AmcOutreach until T242491 regression is resolved [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564157 (https://phabricator.wikimedia.org/T242491) (owner: 10Nray) [00:38:21] 10Operations, 10Design-Research, 10Domains, 10Traffic: Register wikipersonas.org and redirect URL - https://phabricator.wikimedia.org/T241944 (10Dzahn) [00:43:34] (03PS1) 10Dzahn: admin: upgrade Hugh Nowlan to root shell user (ops) [puppet] - 10https://gerrit.wikimedia.org/r/564171 (https://phabricator.wikimedia.org/T242309) [00:45:25] (03CR) 10Dzahn: "follow-up: https://gerrit.wikimedia.org/r/c/operations/puppet/+/564167" [puppet] - 10https://gerrit.wikimedia.org/r/563633 (owner: 10Legoktm) [00:46:14] (03CR) 10Dzahn: "E: Failed to fetch http://apt.wikimedia.org/wikimedia/dists/stretch-wikimedia/InRelease Unable to find expected entry 'thirdparty/kubeadm" [puppet] - 10https://gerrit.wikimedia.org/r/563633 (owner: 10Legoktm) [00:47:11] (03CR) 10Bstorm: "Bryan noticed that individual jobs do not have the rerun bit set (per defaults). This is correct." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/564095 (https://phabricator.wikimedia.org/T242397) (owner: 10Bstorm) [00:47:34] (03Abandoned) 10Bstorm: gridengine: Make webservices "not rerunable" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/564095 (https://phabricator.wikimedia.org/T242397) (owner: 10Bstorm) [00:59:07] (03CR) 10Dzahn: "works on buster without puppet errors now! you can go ahead and create a new buster instance in your cloud VPS project, then just apply "r" [puppet] - 10https://gerrit.wikimedia.org/r/563633 (owner: 10Legoktm) [01:00:20] legoktm: ready to create the new buster instance in "codesearch" project [01:02:19] then just click puppet config, apply "role::codesearch" as class and run puppet agent -tv and it should have no errors [01:03:11] if it says something about not finding base_dir then put "profile::codesearch::base_dir: '/srv'" in the Hiera form. But it shouldn't because we already added it in the repo too.. (hmm) [01:03:34] !log catrope@deploy1001 Synchronized php-1.35.0-wmf.14/extensions/GrowthExperiments/: Various topic search-related cherry-picks (duration: 00m 57s) [01:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:12:07] (03PS1) 10Bstorm: gridengine: set the webgrid queues to not rerunable [puppet] - 10https://gerrit.wikimedia.org/r/564174 (https://phabricator.wikimedia.org/T242397) [01:14:58] mutante: the hieradata/labs patch is applied in beta, but puppet still fails the same way [01:17:20] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Puppet agent unable to run in Beta Cluster (Evaluation Error: Error while evaluating a Resource Statement) - https://phabricator.wikimedia.org/T242658 (10Krinkle) The above patch is live on the beta cluster puppet master: `name=deployment-puppetmaster03... [01:17:36] Krinkle: ack, it's almost like ./project/common.yaml does not get applied [01:17:54] (03PS1) 10Dzahn: define 2 API appservers per row in codfw as canary API appservers [puppet] - 10https://gerrit.wikimedia.org/r/564175 (https://phabricator.wikimedia.org/T242606) [01:18:36] mutante: I'm not familiar with that file. afaik deployment-prep.yaml is the highest level file relevant to beta [01:19:40] ah yeah, that's the one [01:19:40] it's [01:19:41] hieradata/labs/deployment-prep/common.yaml [01:19:45] deployment-prep/common.yaml [01:19:49] yeah that should be fine [01:20:04] yea, and the value it is missing is in there [01:20:20] so right now i have no idea why it's still missing it [01:20:59] it is also the same as in prod hieradata/commom.yaml [01:21:25] on another project i also noticed something wasn't applied that was in $projectname/common.yaml [01:21:43] but if that was the case we'd have other issues too [01:22:50] gotta stare at it again tomorrow, bbl [01:23:41] mutante: found it [01:23:43] https://horizon.wikimedia.org/project/puppet/ [01:23:50] etcd_client_srv_domain: '' [01:23:56] (03PS5) 10EBernhardson: Perform weekly dumps of all public media urls [puppet] - 10https://gerrit.wikimedia.org/r/561356 (https://phabricator.wikimedia.org/T240520) [01:24:08] I guess someone put it there to fix an issue they may have seen with it being undefined [01:24:11] which obscured the issue [01:24:40] (03CR) 10jerkins-bot: [V: 04-1] Perform weekly dumps of all public media urls [puppet] - 10https://gerrit.wikimedia.org/r/561356 (https://phabricator.wikimedia.org/T240520) (owner: 10EBernhardson) [01:25:20] * Krinkle removes it [01:25:50] https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/HEAD/deployment-prep/_.yaml [01:29:27] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Puppet agent unable to run in Beta Cluster (Evaluation Error: Error while evaluating a Resource Statement) - https://phabricator.wikimedia.org/T242658 (10Krinkle) It didn't work because at the Horizon layer there was a project-level override for this Hie... [01:29:40] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Puppet agent unable to run in Beta Cluster (Evaluation Error: Error while evaluating a Resource Statement) - https://phabricator.wikimedia.org/T242658 (10Krinkle) 05Open→03Resolved a:03Krinkle Puppet agent now runs cleanly in Beta. [01:29:45] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Puppet agent unable to run in Beta Cluster (Evaluation Error: Error while evaluating a Resource Statement) - https://phabricator.wikimedia.org/T242658 (10Krinkle) a:05Krinkle→03Dzahn [01:33:51] (03CR) 10BryanDavis: [C: 03+1] gridengine: set the webgrid queues to not rerunable [puppet] - 10https://gerrit.wikimedia.org/r/564174 (https://phabricator.wikimedia.org/T242397) (owner: 10Bstorm) [01:37:41] (03PS6) 10EBernhardson: Perform weekly dumps of all public media urls [puppet] - 10https://gerrit.wikimedia.org/r/561356 (https://phabricator.wikimedia.org/T240520) [01:39:14] mutante: sweet, will try in a few :D [01:39:51] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1080.66 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [01:44:51] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:46:41] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:00:02] (03PS1) 10Papaul: DNS: Add mgmt and production DNS for logstash202[6-9] [dns] - 10https://gerrit.wikimedia.org/r/564181 [02:04:13] (03CR) 10BryanDavis: k8s: Don't restart all k8s machinery to reboot a basic webservice (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/563624 (https://phabricator.wikimedia.org/T228499) (owner: 10Bstorm) [02:11:05] 10Operations, 10ops-codfw, 10DBA: (Needed By 31st January) codfw: rack/setup/install es202[0-5].codfw.wmnet - https://phabricator.wikimedia.org/T241336 (10Papaul) [02:11:26] (03PS1) 10Catrope: GrowthExperiments: Enable topic search, behind a hidden preference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564183 (https://phabricator.wikimedia.org/T242698) [02:23:19] (03CR) 10BryanDavis: [C: 03+2] "> Note that it does get through quite a few images before this" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/563610 (owner: 10Bstorm) [03:12:37] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [03:44:58] (03PS1) 10Andrew Bogott: Upgrade cloudservices nodes (designate) to OpenStack Pike [puppet] - 10https://gerrit.wikimedia.org/r/564280 (https://phabricator.wikimedia.org/T241348) [03:48:25] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:55:39] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:08:21] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:17:27] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:26:31] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:27:09] (03CR) 10Andrew Bogott: [C: 03+2] Upgrade cloudservices nodes (designate) to OpenStack Pike [puppet] - 10https://gerrit.wikimedia.org/r/564280 (https://phabricator.wikimedia.org/T241348) (owner: 10Andrew Bogott) [04:30:53] (03PS1) 10Andrew Bogott: Fix a VERY OBVIOUS typo setting the designate version [puppet] - 10https://gerrit.wikimedia.org/r/564331 (https://phabricator.wikimedia.org/T241348) [04:31:57] (03CR) 10Andrew Bogott: [C: 03+2] Fix a VERY OBVIOUS typo setting the designate version [puppet] - 10https://gerrit.wikimedia.org/r/564331 (https://phabricator.wikimedia.org/T241348) (owner: 10Andrew Bogott) [04:39:17] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:42:55] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:56:15] (03PS1) 10Andrew Bogott: designate: include designate-mdns package [puppet] - 10https://gerrit.wikimedia.org/r/564362 [04:56:56] 10Operations, 10cloud-services-team (Kanban): rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10Andrew) 05Resolved→03Open I think these are resolved now (I just reinstalled some packages; not sure what went wrong originally.) [04:57:28] (03CR) 10Andrew Bogott: [C: 03+2] designate: include designate-mdns package [puppet] - 10https://gerrit.wikimedia.org/r/564362 (owner: 10Andrew Bogott) [05:04:53] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:08:18] (03PS1) 10Andrew Bogott: designate monitoring: allow for different python versions in service monitoring [puppet] - 10https://gerrit.wikimedia.org/r/564366 (https://phabricator.wikimedia.org/T241348) [05:09:33] (03CR) 10Andrew Bogott: [C: 03+2] designate monitoring: allow for different python versions in service monitoring [puppet] - 10https://gerrit.wikimedia.org/r/564366 (https://phabricator.wikimedia.org/T241348) (owner: 10Andrew Bogott) [05:10:19] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:12:35] (03PS1) 10Andrew Bogott: nova monitoring: allow for different python versions in service [puppet] - 10https://gerrit.wikimedia.org/r/564373 (https://phabricator.wikimedia.org/T241347) [05:29:05] !log rebooting cloudservices1004 to make sure all my upgrades are sustainable [05:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:43] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:48:39] 10Operations: Sort out plan for install* servers in edge sites - https://phabricator.wikimedia.org/T242602 (10faidon) Splitting the internal apt repository from the install roles/servers sounds good -- it's more of a historical artifact than anything else. You probably know this already but do note that the inst... [05:50:23] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:59:25] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:00:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1105:3312 after removing partitions from revision table', diff saved to https://phabricator.wikimedia.org/P10140 and previous config saved to /var/cache/conftool/dbconfig/20200114-060003-marostegui.json [06:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1103:3312 - T239453', diff saved to https://phabricator.wikimedia.org/P10141 and previous config saved to /var/cache/conftool/dbconfig/20200114-060116-marostegui.json [06:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:25] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [06:01:33] !log Remove partitions from revision table on db1103:3312 [06:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:05] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:15:49] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:19:31] Looks like a spike of https://phabricator.wikimedia.org/T242437 [06:20:03] !log Deploy schema change on s3 master for officewiki and techconductwiki T242688 [06:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:10] T242688: Extend flow_ext_ref.ref_src_wiki - https://phabricator.wikimedia.org/T242688 [06:23:22] !log Deploy schema change on labswiki (wikitech) T242688 [06:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:48] !log Deploy schema change on flowdb (x1) directly on the master T242688 [06:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:51] T242688: Extend flow_ext_ref.ref_src_wiki - https://phabricator.wikimedia.org/T242688 [06:26:51] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:28:49] PROBLEM - Check size of conntrack table on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:28:51] PROBLEM - puppet last run on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:28:55] PROBLEM - ores uWSGI web app on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:28:57] PROBLEM - configured eth on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:28:59] PROBLEM - ores uWSGI web app on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:28:59] PROBLEM - Disk space on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2009&var-datasource=codfw+prometheus/ops [06:29:01] PROBLEM - Check whether ferm is active by checking the default input chain on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:29:01] PROBLEM - Check whether ferm is active by checking the default input chain on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:29:01] PROBLEM - Check systemd state on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:01] PROBLEM - dhclient process on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:29:05] PROBLEM - Check systemd state on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:07] PROBLEM - Check size of conntrack table on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:07] PROBLEM - Check size of conntrack table on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:09] PROBLEM - DPKG on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:29:13] PROBLEM - Check systemd state on ores2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:17] PROBLEM - Check systemd state on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:17] PROBLEM - Check systemd state on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:19] PROBLEM - dhclient process on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:29:21] PROBLEM - configured eth on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:29:23] PROBLEM - MD RAID on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:29:25] PROBLEM - Disk space on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2005&var-datasource=codfw+prometheus/ops [06:29:27] PROBLEM - puppet last run on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:29:29] PROBLEM - Disk space on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2006&var-datasource=codfw+prometheus/ops [06:29:31] PROBLEM - Check whether ferm is active by checking the default input chain on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:29:31] PROBLEM - Check size of conntrack table on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:33] PROBLEM - ores uWSGI web app on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:29:35] PROBLEM - configured eth on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:29:39] PROBLEM - MD RAID on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:29:39] PROBLEM - MD RAID on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:29:41] PROBLEM - Check systemd state on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:45] PROBLEM - MD RAID on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:29:45] PROBLEM - configured eth on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:29:53] PROBLEM - Check whether ferm is active by checking the default input chain on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:29:55] PROBLEM - configured eth on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:29:57] PROBLEM - configured eth on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:29:57] PROBLEM - Check systemd state on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:57] PROBLEM - Check size of conntrack table on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:30:03] PROBLEM - ores uWSGI web app on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:30:05] PROBLEM - Check systemd state on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:07] PROBLEM - MD RAID on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:30:07] PROBLEM - Check whether ferm is active by checking the default input chain on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:30:07] PROBLEM - Check whether ferm is active by checking the default input chain on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:30:11] PROBLEM - Disk space on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2008&var-datasource=codfw+prometheus/ops [06:30:11] PROBLEM - dhclient process on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:30:13] PROBLEM - Disk space on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2002&var-datasource=codfw+prometheus/ops [06:30:13] PROBLEM - DPKG on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:30:15] PROBLEM - ores uWSGI web app on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:30:17] PROBLEM - DPKG on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:30:21] PROBLEM - dhclient process on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:30:23] PROBLEM - ores uWSGI web app on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:30:23] PROBLEM - MD RAID on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:30:29] PROBLEM - configured eth on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:30:31] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:30:33] PROBLEM - dhclient process on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:30:33] PROBLEM - Disk space on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2003&var-datasource=codfw+prometheus/ops [06:30:33] PROBLEM - Disk space on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2001&var-datasource=codfw+prometheus/ops [06:30:35] PROBLEM - dhclient process on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:30:37] errrr? [06:30:37] PROBLEM - Check size of conntrack table on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:30:37] PROBLEM - Check size of conntrack table on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:30:37] PROBLEM - DPKG on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:30:39] what's going on? [06:30:43] PROBLEM - DPKG on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:30:43] PROBLEM - DPKG on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:30:49] RECOVERY - configured eth on ores2006 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:30:57] RECOVERY - Check systemd state on ores2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:57] RECOVERY - Check size of conntrack table on ores2006 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:31:03] PROBLEM - puppet last run on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:31:19] RECOVERY - Disk space on ores2006 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2006&var-datasource=codfw+prometheus/ops [06:31:47] PROBLEM - Check the NTP synchronisation status of timesyncd on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [06:31:55] RECOVERY - Check systemd state on ores2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:32:01] RECOVERY - Disk space on ores2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2002&var-datasource=codfw+prometheus/ops [06:32:05] PROBLEM - puppet last run on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:32:05] RECOVERY - DPKG on ores2002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:32:09] PROBLEM - puppet last run on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:32:21] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10DannyS712) [06:32:23] RECOVERY - dhclient process on ores2002 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:32:27] RECOVERY - Check size of conntrack table on ores2002 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:32:39] RECOVERY - Check whether ferm is active by checking the default input chain on ores2002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:33:19] RECOVERY - MD RAID on ores2002 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:33:33] RECOVERY - configured eth on ores2002 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:33:45] RECOVERY - Check whether ferm is active by checking the default input chain on ores2008 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:33:47] RECOVERY - Disk space on ores2008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2008&var-datasource=codfw+prometheus/ops [06:33:57] RECOVERY - dhclient process on ores2008 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:34:15] RECOVERY - Check size of conntrack table on ores2008 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:34:19] RECOVERY - DPKG on ores2008 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:34:25] RECOVERY - Disk space on ores2009 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2009&var-datasource=codfw+prometheus/ops [06:34:47] RECOVERY - dhclient process on ores2009 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:34:49] RECOVERY - configured eth on ores2008 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:34:49] RECOVERY - MD RAID on ores2008 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:35:07] RECOVERY - MD RAID on ores2009 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:35:09] RECOVERY - Check systemd state on ores2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:35:13] RECOVERY - configured eth on ores2009 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:35:19] RECOVERY - Check whether ferm is active by checking the default input chain on ores2009 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:35:23] RECOVERY - Check systemd state on ores2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:35:25] RECOVERY - Check size of conntrack table on ores2009 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:35:41] RECOVERY - DPKG on ores2009 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:36:51] RECOVERY - puppet last run on ores2009 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:37:59] RECOVERY - puppet last run on ores2008 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:39:00] RECOVERY - configured eth on ores2003 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:39:11] RECOVERY - MD RAID on ores2003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:39:35] RECOVERY - dhclient process on ores2003 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:39:37] RECOVERY - Disk space on ores2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2003&var-datasource=codfw+prometheus/ops [06:39:41] RECOVERY - Check size of conntrack table on ores2003 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:39:47] RECOVERY - DPKG on ores2003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:39:55] RECOVERY - Check whether ferm is active by checking the default input chain on ores2005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:39:55] RECOVERY - dhclient process on ores2005 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:40:11] RECOVERY - Check systemd state on ores2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:40:11] RECOVERY - Check systemd state on ores2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:40:17] RECOVERY - Disk space on ores2005 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2005&var-datasource=codfw+prometheus/ops [06:40:25] RECOVERY - Check whether ferm is active by checking the default input chain on ores2003 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:40:25] RECOVERY - Check size of conntrack table on ores2005 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:40:29] RECOVERY - puppet last run on ores2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:40:29] RECOVERY - configured eth on ores2005 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:40:39] RECOVERY - MD RAID on ores2005 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:41:17] RECOVERY - MD RAID on ores2001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:41:21] RECOVERY - configured eth on ores2001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:41:27] RECOVERY - Disk space on ores2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2001&var-datasource=codfw+prometheus/ops [06:41:29] RECOVERY - DPKG on ores2005 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:41:43] RECOVERY - Check systemd state on ores2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:41:49] RECOVERY - Check size of conntrack table on ores2001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:41:51] RECOVERY - DPKG on ores2001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:42:11] PROBLEM - ores_workers_running on ores2001 is CRITICAL: PROCS CRITICAL: 66 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:42:49] RECOVERY - Check whether ferm is active by checking the default input chain on ores2001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:42:53] PROBLEM - ores_workers_running on ores2004 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:42:53] RECOVERY - dhclient process on ores2001 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:43:43] RECOVERY - puppet last run on ores2005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:44:01] RECOVERY - ores_workers_running on ores2001 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:46:53] RECOVERY - puppet last run on ores2001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:48:34] (03PS1) 10Marostegui: install_server: Allow reimage of db1107 [puppet] - 10https://gerrit.wikimedia.org/r/564445 [06:48:41] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:49:54] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage of db1107 [puppet] - 10https://gerrit.wikimedia.org/r/564445 (owner: 10Marostegui) [06:51:13] 10Operations, 10ops-codfw, 10DBA: Missing Network drivers from Stretch and Buster installer for BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10Marostegui) p:05Triage→03High [06:51:40] 10Operations, 10ops-eqiad, 10DBA: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) p:05Triage→03Normal [06:52:19] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:54:43] RECOVERY - Check systemd state on ores2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:55:37] RECOVERY - ores_workers_running on ores2004 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [07:02:33] RECOVERY - Check the NTP synchronisation status of timesyncd on ores2005 is OK: OK: synced at Tue 2020-01-14 07:02:32 UTC. https://wikitech.wikimedia.org/wiki/NTP [07:06:45] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:08:13] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:08:25] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:15:27] this is Telia's transport -^ [07:15:47] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:16:02] there is a planned outage about it, all good [07:19:07] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:19:19] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:21:15] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:25:39] (03PS1) 10Legoktm: codesearch: Ensure /srv/hound is writable by codesearch user [puppet] - 10https://gerrit.wikimedia.org/r/564466 (https://phabricator.wikimedia.org/T242319) [07:26:23] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:26:35] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:31:14] 10Operations, 10ORES, 10Scoring-platform-team: Ores celery OOM event in codfw - https://phabricator.wikimedia.org/T242705 (10elukey) p:05Triage→03High [07:31:27] marostegui: --^ if you want to add more info [07:33:02] !log add peering to AS26744 in eqiad, eqord and eqdfw [07:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:59] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:41:11] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:43:03] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:46:02] 10Puppet, 10VPS-project-codesearch, 10Patch-For-Review: Puppetize codesearch - https://phabricator.wikimedia.org/T242319 (10Legoktm) The codesearch5 instance is now running Debian Buster plus the `role::codesearch` puppet role with lots of help from @Dzahn Remaining todos: * Make /srv/hound writable by cod... [07:55:53] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:59:29] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:02:12] (03CR) 10Filippo Giunchedi: [C: 03+2] DNS: Add mgmt and production DNS for logstash202[6-9] [dns] - 10https://gerrit.wikimedia.org/r/564181 (owner: 10Papaul) [08:05:16] (03CR) 10Filippo Giunchedi: "LGTM, although please add tests using existing examples in modules/mtail" [puppet] - 10https://gerrit.wikimedia.org/r/564129 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [08:13:05] (03PS6) 10Joal: Add mediawiki-history-dumps rsync to labstore [puppet] - 10https://gerrit.wikimedia.org/r/564066 [08:27:56] (03PS1) 10Muehlenhoff: Add dpifke to exception list, uses Yubikey backed key [puppet] - 10https://gerrit.wikimedia.org/r/564524 [08:29:22] (03CR) 10Muehlenhoff: [C: 03+2] Add dpifke to exception list, uses Yubikey backed key [puppet] - 10https://gerrit.wikimedia.org/r/564524 (owner: 10Muehlenhoff) [08:29:50] 10Operations: Anycast for webproxies - https://phabricator.wikimedia.org/T242715 (10ayounsi) p:05Triage→03Normal [08:31:01] (03CR) 10Muehlenhoff: [C: 03+2] Remove debug proxy roles/classes [puppet] - 10https://gerrit.wikimedia.org/r/564044 (https://phabricator.wikimedia.org/T224567) (owner: 10Muehlenhoff) [08:32:23] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:38:08] 10Operations: Anycast for webproxies - https://phabricator.wikimedia.org/T242715 (10Joe) One problem I see with this is - proxy IPs regularly get banned by third-party services by accident. So having multiple *external* IPs, and being able to switch between them, is a plus. I think you're right that the proxies... [08:38:11] bblack: very valid question, the documentation says it can go at any time before the wiki creation. [08:38:30] but I can't say 100% it would break anything if you merge the patch and I hit ng.wikimedia.org [08:39:15] <_joe_> Amir1: if you still haven't configured apache and/or mediawiki, that would cause nothing [08:39:29] (03CR) 10Elukey: "Now it renders as:" [puppet] - 10https://gerrit.wikimedia.org/r/564066 (owner: 10Joal) [08:40:25] _joe_: oh thanks [08:41:13] <_joe_> I think ng.wikimedia.org matches the virtualhost for *.wikimedia.org [08:41:48] <_joe_> so it goes to the www portal [08:41:51] The DNS record exists already https://phabricator.wikimedia.org/diffusion/ODNS/browse/master/templates/wikimedia.org [08:41:56] <_joe_> https://ng.wikimedia.org/ <- it works [08:42:00] https://phabricator.wikimedia.org/diffusion/ODNS/browse/master/templates/wikimedia.org$764 [08:42:19] <_joe_> ok so the only problem is that now the caches have something about that site memorized [08:42:35] <_joe_> we might need to purge them once the setup is done [08:50:32] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 (10ayounsi) a:05ayounsi→03BBlack That sounds like a good idea to me, @BBlack for a final opinion, and I can take care of it this Q if good to go. [09:03:08] (03PS1) 10Muehlenhoff: Switch conf/codfw and notebook* servers to standard Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/564550 (https://phabricator.wikimedia.org/T156955) [09:04:24] (03CR) 10Urbanecm: "Is this necessary? I've already enabled partial blocks at enwiki per T242569, and given also commons asked for partial blocks itself, they" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564121 (https://phabricator.wikimedia.org/T240300) (owner: 10Tchanders) [09:06:43] 10Operations, 10netops: Stale LibreNMS ports - https://phabricator.wikimedia.org/T242318 (10ayounsi) 05Open→03Resolved `root@cumin1001:~# for i in `mysql.py -hdb1135 -e "select table_name from information_schema.columns where column_name like 'device_id'" -BN`; do echo $i; mysql.py -hdb1135 librenms -e "de... [09:10:45] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, 10Patch-For-Review: Remove obsoleted docker images - https://phabricator.wikimedia.org/T242604 (10MoritzMuehlenhoff) p:05Triage→03Normal [09:12:15] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:14:17] quickly going to deploy this: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/564555 [09:15:51] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:35:28] (03CR) 10Filippo Giunchedi: [C: 03+1] Switch conf/codfw and notebook* servers to standard Partman recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/564550 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [09:38:08] (03PS1) 10Elukey: Increase Spark's crypto settings in Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/564562 (https://phabricator.wikimedia.org/T240934) [09:40:15] (03CR) 10Elukey: [C: 03+1] Switch conf/codfw and notebook* servers to standard Partman recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/564550 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [09:40:25] (03CR) 10Elukey: [C: 03+2] Increase Spark's crypto settings in Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/564562 (https://phabricator.wikimedia.org/T240934) (owner: 10Elukey) [09:43:10] 10Operations, 10ops-eqiad, 10serviceops: (No Need By Date Provided) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10jijiki) @Jclark-ctr Can you provide a date that is convenient for you for racking these? Thank you! [09:44:04] (03PS2) 10Filippo Giunchedi: ores: ship to logstash via the kafka logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/502527 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [09:44:19] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.14/extensions/Wikibase/lib/includes/Store/Sql/Terms: [[gerrit:564555|wbterms: Add Statsd metrics in critical parts of the new term store]] (duration: 00m 57s) [09:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:07] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1003/20335/ores1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/502527 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [09:46:50] (03PS2) 10Muehlenhoff: Switch conf/codfw and notebook* servers to standard Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/564550 (https://phabricator.wikimedia.org/T156955) [09:47:09] (03CR) 10Muehlenhoff: Switch conf/codfw and notebook* servers to standard Partman recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/564550 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [09:50:17] (03PS1) 10Ayounsi: esams/knams: advertise 185.15.58.0/23 instead of 185.15.56.0/22 [homer/public] - 10https://gerrit.wikimedia.org/r/564564 (https://phabricator.wikimedia.org/T207753) [09:50:45] akosiaris: I think the ores logging change is good to go, re: deployment I was thinking puppet-merge then https://wikitech.wikimedia.org/wiki/ORES/Deployment#Puppet-managed_config_changes [09:51:40] (03CR) 10Filippo Giunchedi: [C: 03+1] Switch conf/codfw and notebook* servers to standard Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/564550 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [09:53:11] (03CR) 10Ayounsi: "Diff for 3 devices: ['cr2-esams.wikimedia.org', 'cr3-esams.wikimedia.org', 'cr3-knams.wikimedia.org']" [homer/public] - 10https://gerrit.wikimedia.org/r/564564 (https://phabricator.wikimedia.org/T207753) (owner: 10Ayounsi) [09:54:32] (03Abandoned) 10Ayounsi: Depool esams for esams/knams work [dns] - 10https://gerrit.wikimedia.org/r/552792 (owner: 10Ayounsi) [10:06:29] awight: Hey, I see lots of Cite-related fatals and errors in logs, is it on the radar? https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors https://logstash.wikimedia.org/app/kibana#/dashboard/Fatal-Monitor [10:07:09] oh we have this: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Cite/+/564002 [10:07:23] !log installing remaining cyrus-sasl security updates [10:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:38] (03PS1) 10Vgutierrez: Release 8.0.5-1wm12 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/564584 (https://phabricator.wikimedia.org/T242620) [10:09:48] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.5-1wm12 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/564584 (https://phabricator.wikimedia.org/T242620) (owner: 10Vgutierrez) [10:09:54] that was fast :/ [10:10:12] <_joe_> yeah we added a condition to jerkins-bot [10:10:28] <_joe_> if ZUUL_SUBMITTER == "vgutierrez" fail [10:10:37] oh cool [10:10:44] /nick _joe_ [10:12:37] <_joe_> lol [10:13:42] O:) [10:21:47] (03CR) 10Faidon Liambotis: [C: 04-1] esams/knams: advertise 185.15.58.0/23 instead of 185.15.56.0/22 (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/564564 (https://phabricator.wikimedia.org/T207753) (owner: 10Ayounsi) [10:24:56] (03PS2) 10Ayounsi: esams/knams: advertise 185.15.58.0/23 instead of 185.15.56.0/22 [homer/public] - 10https://gerrit.wikimedia.org/r/564564 (https://phabricator.wikimedia.org/T207753) [10:25:15] (03CR) 10Ayounsi: esams/knams: advertise 185.15.58.0/23 instead of 185.15.56.0/22 (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/564564 (https://phabricator.wikimedia.org/T207753) (owner: 10Ayounsi) [10:27:41] 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: Move thumbor to the logging pipeline - https://phabricator.wikimedia.org/T242609 (10MoritzMuehlenhoff) p:05Triage→03Normal [10:27:57] 10Operations, 10serviceops, 10Patch-For-Review: No mw canary servers in codfw - https://phabricator.wikimedia.org/T242606 (10MoritzMuehlenhoff) p:05Triage→03Normal [10:40:08] !log upgrade ats to 8.0.5-1wm12 in cp4026 and cp4032 - T242620 [10:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:11] T242620: ats-tls is having issues when varnish-fe goes away - https://phabricator.wikimedia.org/T242620 [10:51:04] (03PS1) 10Muehlenhoff: Switch url-downloader.codfw to urldownloader2001 [dns] - 10https://gerrit.wikimedia.org/r/564588 (https://phabricator.wikimedia.org/T224551) [10:52:50] 10Operations, 10Mail: MediaWiki mail system for watchlist on it.wikipedia is delivering very slowly - https://phabricator.wikimedia.org/T240800 (10Mholloway) [10:52:52] 10Operations, 10MassMessage, 10User-DannyS712: MassMessage not delivering - https://phabricator.wikimedia.org/T240777 (10Mholloway) [10:52:55] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10Structured-Data-Backlog, and 5 others: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10Mholloway) 05Open→03Resolved a:03Mholloway Incident report is in review. [10:54:33] (03CR) 10Vgutierrez: "pcc seems happy: https://puppet-compiler.wmflabs.org/compiler1002/20325/" [puppet] - 10https://gerrit.wikimedia.org/r/564046 (https://phabricator.wikimedia.org/T238900) (owner: 10Vgutierrez) [10:54:57] (03CR) 10Vgutierrez: [C: 03+2] redirects.dat: Funnel fixcopyright.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/564020 (https://phabricator.wikimedia.org/T239141) (owner: 10Vgutierrez) [11:00:46] (03PS1) 10Vgutierrez: lvs: Set realserver_ips on ncredir ulsfo instances [puppet] - 10https://gerrit.wikimedia.org/r/564598 (https://phabricator.wikimedia.org/T242321) [11:04:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks great!" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/564138 (owner: 10Volans) [11:05:57] (03PS1) 10Vgutierrez: lvs: Add ulsfo ncredir configuration [puppet] - 10https://gerrit.wikimedia.org/r/564603 (https://phabricator.wikimedia.org/T242321) [11:09:01] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 106385384 and 12 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:09:11] (03CR) 10Volans: [C: 03+2] binary packages: optimize queries (part 2) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/564138 (owner: 10Volans) [11:09:13] (03CR) 10Ema: [C: 03+1] Add ncredir-lb.ulsfo.wikimedia.org DNS records [dns] - 10https://gerrit.wikimedia.org/r/564051 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [11:09:50] (03CR) 10Ema: "pcc output would be good!" [puppet] - 10https://gerrit.wikimedia.org/r/564598 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [11:10:41] (03CR) 10Ema: "pcc would be great here too" [puppet] - 10https://gerrit.wikimedia.org/r/564603 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [11:10:51] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 2312 and 97 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:11:18] 10Operations, 10Traffic, 10Patch-For-Review: ats-tls is having issues when varnish-fe goes away - https://phabricator.wikimedia.org/T242620 (10ema) p:05Triage→03High [11:11:22] (03Merged) 10jenkins-bot: binary packages: optimize queries (part 2) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/564138 (owner: 10Volans) [11:13:41] (03CR) 10Tchanders: "Urbanecm - the banner is now a post-deployment banner (see the final bullet points in the description of T240300)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564121 (https://phabricator.wikimedia.org/T240300) (owner: 10Tchanders) [11:15:20] !log Updating puppet-compiler facts [11:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:27] (03PS1) 10Volans: host packages: optimize table [software/debmonitor] - 10https://gerrit.wikimedia.org/r/564614 [11:17:56] (03CR) 10Volans: "I've tested this on the codfw slave of m2 and has the expected effect." [software/debmonitor] - 10https://gerrit.wikimedia.org/r/564614 (owner: 10Volans) [11:24:04] (03PS2) 10Matthias Mullie: Add 3d-patents page to wgForceUIMsgAsContentMsg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416730 [11:25:17] (03Abandoned) 10Matthias Mullie: [WIP] Add 3d2png scap targets [puppet] - 10https://gerrit.wikimedia.org/r/406997 (owner: 10Matthias Mullie) [11:28:36] (03CR) 10Vgutierrez: "pcc seems happy: https://puppet-compiler.wmflabs.org/compiler1002/20338/" [puppet] - 10https://gerrit.wikimedia.org/r/564598 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [11:30:24] (03CR) 10Vgutierrez: "pcc is happy here as well: https://puppet-compiler.wmflabs.org/compiler1001/20339/" [puppet] - 10https://gerrit.wikimedia.org/r/564603 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [11:31:29] 10Operations, 10ops-eqiad, 10serviceops: (Need By: Jan 10) rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T241795 (10elukey) Hi @Jclark-ctr, any timeline for these hosts to be racked? [11:34:41] (03PS2) 10Volans: hosts/images packages: optimize tables [software/debmonitor] - 10https://gerrit.wikimedia.org/r/564614 [11:37:12] (03PS2) 10Vgutierrez: Add ncredir-lb.ulsfo.wikimedia.org DNS records [dns] - 10https://gerrit.wikimedia.org/r/564051 (https://phabricator.wikimedia.org/T242321) [11:37:47] (03PS1) 10Ema: cache: raise vm.max_map_count sysctl [puppet] - 10https://gerrit.wikimedia.org/r/564616 (https://phabricator.wikimedia.org/T242417) [11:39:06] (03CR) 10Vgutierrez: [C: 03+2] Add ncredir-lb.ulsfo.wikimedia.org DNS records [dns] - 10https://gerrit.wikimedia.org/r/564051 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [11:40:03] (03CR) 10Ema: [C: 03+1] lvs: Set realserver_ips on ncredir ulsfo instances [puppet] - 10https://gerrit.wikimedia.org/r/564598 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [11:40:26] (03CR) 10Ema: [C: 03+1] lvs: Add ulsfo ncredir configuration [puppet] - 10https://gerrit.wikimedia.org/r/564603 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [11:41:01] (03CR) 10Volans: "Once applied to the test instance in cloud the relevant executed queries were:" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/564614 (owner: 10Volans) [11:41:39] (03CR) 10Vgutierrez: [C: 03+2] lvs: Set realserver_ips on ncredir ulsfo instances [puppet] - 10https://gerrit.wikimedia.org/r/564598 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [11:42:23] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/564614 (owner: 10Volans) [11:42:53] (03CR) 10Ema: "pcc here: https://puppet-compiler.wmflabs.org/compiler1002/20340/" [puppet] - 10https://gerrit.wikimedia.org/r/564616 (https://phabricator.wikimedia.org/T242417) (owner: 10Ema) [11:46:52] (03CR) 10Vgutierrez: [C: 03+2] lvs: Add ulsfo ncredir configuration [puppet] - 10https://gerrit.wikimedia.org/r/564603 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [11:47:58] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: service=nginx,name=ncredir4001.ulsfo.wmnet [11:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:04] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: service=nginx,name=ncredir4002.ulsfo.wmnet [11:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:22] !log restarting pybal on lvs4007 (secondary LVS) - T242321 [11:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:25] T242321: Provide non-canonical-redirect service from every datacenter - https://phabricator.wikimedia.org/T242321 [11:49:51] (03PS1) 10Jbond: puppet-compiler: fix double owner definition [puppet] - 10https://gerrit.wikimedia.org/r/564617 [11:51:52] !log restarting pybal on lvs4005 (high-traffic1 LVS) - T242321 [11:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:28] willikins:~ vgutierrez$ curl --resolve en.wikipedia.com:443:$(dig +short ncredir-lb.ulsfo.wikimedia.org) https://en.wikipedia.com -o /dev/null -v 2>&1 |grep location: [11:53:28] < location: https://en.wikipedia.org/ [11:53:29] :D [11:53:53] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/564617 (owner: 10Jbond) [11:54:36] (03CR) 10Jbond: [C: 03+2] puppet-compiler: fix double owner definition [puppet] - 10https://gerrit.wikimedia.org/r/564617 (owner: 10Jbond) [11:57:55] (03PS1) 10Vgutierrez: Pool ulsfo for ncredir service [dns] - 10https://gerrit.wikimedia.org/r/564627 (https://phabricator.wikimedia.org/T242321) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200114T1200). [12:00:04] awight: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:39] 10Operations, 10Traffic, 10fixcopyright.wikimedia.org: Redirect all traffic for fixcopyright.wikimedia.org to https://policy.wikimedia.org/policy-landing/copyright/ - https://phabricator.wikimedia.org/T239141 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez ` vgutierrez@mw1321:~$ curl --resolve fixcopyr... [12:00:54] 10Operations, 10Cleanup, 10Traffic, 10fixcopyright.wikimedia.org, and 4 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Vgutierrez) [12:01:28] (03CR) 10Muehlenhoff: [C: 03+1] "That seems sensible. Elasticsearch is also bumping this sysctl via it's init script and Cassandra hosts also raise the default via the cas" [puppet] - 10https://gerrit.wikimedia.org/r/564616 (https://phabricator.wikimedia.org/T242417) (owner: 10Ema) [12:01:32] * Urbanecm around, but leaving awight to do his own SWAT [12:01:32] o/ [12:01:53] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [12:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:15] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:23] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [12:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:29] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:01] (03PS1) 10Jbond: add default ops for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/564636 [12:04:10] (03CR) 10Jbond: [C: 03+2] add default ops for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/564636 (owner: 10Jbond) [12:10:27] (03PS1) 10Jbond: profile::puppetdb fix jvm_opts in labs.yaml [puppet] - 10https://gerrit.wikimedia.org/r/564647 [12:11:20] (03CR) 10Jbond: [C: 03+2] profile::puppetdb fix jvm_opts in labs.yaml [puppet] - 10https://gerrit.wikimedia.org/r/564647 (owner: 10Jbond) [12:11:46] (03CR) 10Vgutierrez: [C: 03+1] cache: raise vm.max_map_count sysctl [puppet] - 10https://gerrit.wikimedia.org/r/564616 (https://phabricator.wikimedia.org/T242417) (owner: 10Ema) [12:13:03] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] nova monitoring: allow for different python versions in service [puppet] - 10https://gerrit.wikimedia.org/r/564373 (https://phabricator.wikimedia.org/T241347) (owner: 10Andrew Bogott) [12:14:51] !log vgutierrez@puppetmaster1001 conftool action : set/weight=1; selector: service=nginx,name=ncredir4002.ulsfo.wmnet [12:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:58] !log vgutierrez@puppetmaster1001 conftool action : set/weight=1; selector: service=nginx,name=ncredir4001.ulsfo.wmnet [12:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:25] !log vgutierrez@puppetmaster1001 conftool action : set/weight=1; selector: service=nginx,name=ncredir3001.esams.wmnet [12:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:30] !log vgutierrez@puppetmaster1001 conftool action : set/weight=1; selector: service=nginx,name=ncredir3002.esams.wmnet [12:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:28] (03PS1) 10Jbond: profile::puppetdb add defaults to labs.yaml [puppet] - 10https://gerrit.wikimedia.org/r/564652 [12:21:11] (03PS1) 10Vgutierrez: Add ncredir500[12] DNS records [dns] - 10https://gerrit.wikimedia.org/r/564655 (https://phabricator.wikimedia.org/T242321) [12:21:34] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] haproxy for neutron: As of pike, the healthcheck url returns 405. [puppet] - 10https://gerrit.wikimedia.org/r/561806 (https://phabricator.wikimedia.org/T241347) (owner: 10Andrew Bogott) [12:22:03] (03PS1) 10Andrew Bogott: Horizon: put in maintenance mode for the ocata=>pike upgrade [puppet] - 10https://gerrit.wikimedia.org/r/564656 (https://phabricator.wikimedia.org/T241347) [12:22:05] (03CR) 10Jbond: [C: 03+2] "PCC no-op:" [puppet] - 10https://gerrit.wikimedia.org/r/564652 (owner: 10Jbond) [12:22:07] (03PS1) 10Andrew Bogott: Openstack: move eqiad1 to version 'pike' [puppet] - 10https://gerrit.wikimedia.org/r/564657 (https://phabricator.wikimedia.org/T241347) [12:22:09] (03PS1) 10Andrew Bogott: Revert "Horizon: put in maintenance mode for the ocata=>pike upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/564658 (https://phabricator.wikimedia.org/T241347) [12:24:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Horizon: put in maintenance mode for the ocata=>pike upgrade [puppet] - 10https://gerrit.wikimedia.org/r/564656 (https://phabricator.wikimedia.org/T241347) (owner: 10Andrew Bogott) [12:24:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Openstack: move eqiad1 to version 'pike' [puppet] - 10https://gerrit.wikimedia.org/r/564657 (https://phabricator.wikimedia.org/T241347) (owner: 10Andrew Bogott) [12:25:02] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Horizon: put in maintenance mode for the ocata=>pike upgrade [puppet] - 10https://gerrit.wikimedia.org/r/564656 (https://phabricator.wikimedia.org/T241347) (owner: 10Andrew Bogott) [12:28:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Openstack: move eqiad1 to version 'pike' [puppet] - 10https://gerrit.wikimedia.org/r/564657 (https://phabricator.wikimedia.org/T241347) (owner: 10Andrew Bogott) [12:31:45] (03PS1) 10Andrew Bogott: Remove hieradata/common/openstack.yaml [puppet] - 10https://gerrit.wikimedia.org/r/564662 [12:31:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] haproxy for neutron: As of pike, the healthcheck url returns 405. [puppet] - 10https://gerrit.wikimedia.org/r/561806 (https://phabricator.wikimedia.org/T241347) (owner: 10Andrew Bogott) [12:36:38] (03PS1) 10Jbond: profile::puppetdb::microsite add default for labs [puppet] - 10https://gerrit.wikimedia.org/r/564666 [12:37:35] (03CR) 10Jbond: [C: 03+2] profile::puppetdb::microsite add default for labs [puppet] - 10https://gerrit.wikimedia.org/r/564666 (owner: 10Jbond) [12:43:55] (03PS1) 10Jbond: profile::puppet_compiler fix call to conftool [puppet] - 10https://gerrit.wikimedia.org/r/564669 [12:44:34] (03CR) 10ArielGlenn: "There will need to be an entry added to the cleanup manifests too, so that these don't accumulate forever. See https://gerrit.wikimedia.or" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/561356 (https://phabricator.wikimedia.org/T240520) (owner: 10EBernhardson) [12:46:03] (03CR) 10Jbond: [C: 03+2] profile::puppet_compiler fix call to conftool [puppet] - 10https://gerrit.wikimedia.org/r/564669 (owner: 10Jbond) [13:35:07] (03PS2) 10Muehlenhoff: Switch url-downloader.codfw to urldownloader2001 [dns] - 10https://gerrit.wikimedia.org/r/564588 (https://phabricator.wikimedia.org/T224551) [13:37:55] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 (10BBlack) +1 from me, this was one of the many things we made the ganeti clusters for :) [13:41:22] 10Operations, 10ops-codfw, 10DBA: Missing Network drivers from Stretch and Buster installer for BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10Marostegui) @Papaul can you double check (maybe even with the vendor) if there is a way to disable the 10G port for now? [13:44:07] (03CR) 10Urbanecm: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564121 (https://phabricator.wikimedia.org/T240300) (owner: 10Tchanders) [13:44:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "Horizon: put in maintenance mode for the ocata=>pike upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/564658 (https://phabricator.wikimedia.org/T241347) (owner: 10Andrew Bogott) [13:48:23] (03CR) 10Ema: [C: 03+2] cache: raise vm.max_map_count sysctl [puppet] - 10https://gerrit.wikimedia.org/r/564616 (https://phabricator.wikimedia.org/T242417) (owner: 10Ema) [13:49:52] 10Operations: Sort out plan for install* servers in edge sites - https://phabricator.wikimedia.org/T242602 (10MoritzMuehlenhoff) >>! In T242602#5800549, @faidon wrote: > Splitting the internal apt repository from the install roles/servers sounds good -- it's more of a historical artifact than anything else. You... [13:52:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1080 for upgrade', diff saved to https://phabricator.wikimedia.org/P10142 and previous config saved to /var/cache/conftool/dbconfig/20200114-135238-marostegui.json [13:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:49] (03CR) 10Tchanders: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564121 (https://phabricator.wikimedia.org/T240300) (owner: 10Tchanders) [13:54:25] !log Upgrade db1080 [13:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:42] (03PS1) 10Andrew Bogott: nova-fullstack: update to track changes in novaclient bindings [puppet] - 10https://gerrit.wikimedia.org/r/564677 (https://phabricator.wikimedia.org/T241347) [13:58:55] (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron l3 agent: pike: disable ravd [puppet] - 10https://gerrit.wikimedia.org/r/564678 (https://phabricator.wikimedia.org/T241347) [14:00:04] liw and brennen: It is that lovely time of the day again! You are hereby commanded to deploy Mediawiki train - European+American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200114T1400). [14:00:26] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [14:00:49] (03CR) 10Andrew Bogott: [C: 03+2] nova-fullstack: update to track changes in novaclient bindings [puppet] - 10https://gerrit.wikimedia.org/r/564677 (https://phabricator.wikimedia.org/T241347) (owner: 10Andrew Bogott) [14:01:10] (03PS1) 10Filippo Giunchedi: prometheus: bump 'global' retention to 2.25 years [puppet] - 10https://gerrit.wikimedia.org/r/564679 [14:01:12] (03PS1) 10Filippo Giunchedi: prometheus: bump 'ops' retention to 4.5 months [puppet] - 10https://gerrit.wikimedia.org/r/564680 [14:02:10] PROBLEM - nova-compute proc minimum on cloudvirt1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:02:16] PROBLEM - nova-compute proc minimum on cloudvirt1020 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:02:25] PROBLEM - nova-compute proc minimum on cloudvirt1030 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:02:30] PROBLEM - nova-compute proc minimum on cloudvirt1016 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:02:33] PROBLEM - nova-compute proc maximum on cloudvirt1026 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:02:35] PROBLEM - nova-compute proc maximum on cloudvirt1007 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:02:43] PROBLEM - nova-compute proc minimum on cloudvirt1026 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:02:49] PROBLEM - nova-compute proc minimum on cloudvirt1009 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:02:53] PROBLEM - nova-compute proc minimum on cloudvirt1019 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:02:54] PROBLEM - nova-compute proc minimum on cloudvirt1017 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:02:55] PROBLEM - nova-compute proc minimum on cloudvirt1012 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:02:56] PROBLEM - nova-compute proc minimum on cloudvirt1014 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:02:56] PROBLEM - nova-compute proc maximum on cloudvirt1025 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:02:57] PROBLEM - nova-compute proc maximum on cloudvirt1008 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:02:59] PROBLEM - nova-compute proc maximum on cloudvirt1028 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:03:00] PROBLEM - nova-compute proc maximum on cloudvirt1012 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:03:00] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [14:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:01] PROBLEM - nova-compute proc minimum on cloudvirt1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:03:03] PROBLEM - nova-compute proc minimum on cloudvirt1022 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:03:08] PROBLEM - nova-compute proc minimum on cloudvirt1025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:03:08] PROBLEM - nova-compute proc minimum on cloudvirt1028 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:03:08] (03CR) 10jerkins-bot: [V: 04-1] prometheus: bump 'global' retention to 2.25 years [puppet] - 10https://gerrit.wikimedia.org/r/564679 (owner: 10Filippo Giunchedi) [14:03:11] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:25] train: I'm running late, starting with branch cut [14:06:53] (03CR) 10Andrew Bogott: [C: 03+1] openstack: neutron l3 agent: pike: disable ravd [puppet] - 10https://gerrit.wikimedia.org/r/564678 (https://phabricator.wikimedia.org/T241347) (owner: 10Arturo Borrero Gonzalez) [14:07:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: neutron l3 agent: pike: disable ravd [puppet] - 10https://gerrit.wikimedia.org/r/564678 (https://phabricator.wikimedia.org/T241347) (owner: 10Arturo Borrero Gonzalez) [14:09:48] !log upgrade ats to 8.0.5-1wm12 in cp5006 and cp5012 - T242620 [14:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:57] T242620: ats-tls is having issues when varnish-fe goes away - https://phabricator.wikimedia.org/T242620 [14:12:34] !log branch cut for 1.35.0-wmf.15 [14:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:40] (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron l3 agent: pike: fix radvd mask [puppet] - 10https://gerrit.wikimedia.org/r/564682 (https://phabricator.wikimedia.org/T241347) [14:13:14] (03PS1) 10Andrew Bogott: nova-compute monitoring: support multiple python versions [puppet] - 10https://gerrit.wikimedia.org/r/564683 (https://phabricator.wikimedia.org/T241347) [14:13:35] (03PS1) 10BBlack: Set up transparency-archive microsite [puppet] - 10https://gerrit.wikimedia.org/r/564684 (https://phabricator.wikimedia.org/T230638) [14:13:41] (03PS1) 10BBlack: Add transparency-archive.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/564685 (https://phabricator.wikimedia.org/T230638) [14:14:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: neutron l3 agent: pike: fix radvd mask [puppet] - 10https://gerrit.wikimedia.org/r/564682 (https://phabricator.wikimedia.org/T241347) (owner: 10Arturo Borrero Gonzalez) [14:14:31] (03CR) 10BBlack: [C: 03+2] Add transparency-archive.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/564685 (https://phabricator.wikimedia.org/T230638) (owner: 10BBlack) [14:15:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] nova-compute monitoring: support multiple python versions [puppet] - 10https://gerrit.wikimedia.org/r/564683 (https://phabricator.wikimedia.org/T241347) (owner: 10Andrew Bogott) [14:15:39] (03CR) 10Andrew Bogott: [C: 03+2] nova-compute monitoring: support multiple python versions [puppet] - 10https://gerrit.wikimedia.org/r/564683 (https://phabricator.wikimedia.org/T241347) (owner: 10Andrew Bogott) [14:15:58] !log push firewall policies to pfw3-codfw - T242681 [14:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:37] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1100 - https://phabricator.wikimedia.org/T241506 (10Jclark-ctr) Replaced Disk #0 [14:17:02] RECOVERY - nova-compute proc minimum on cloudvirt1030 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:17:08] RECOVERY - nova-compute proc minimum on cloudvirt1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:17:11] RECOVERY - nova-compute proc maximum on cloudvirt1026 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:17:22] RECOVERY - nova-compute proc minimum on cloudvirt1026 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:17:30] RECOVERY - nova-compute proc minimum on cloudvirt1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:17:31] RECOVERY - nova-compute proc minimum on cloudvirt1017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:17:33] RECOVERY - nova-compute proc minimum on cloudvirt1014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:17:34] (03PS3) 10Muehlenhoff: Switch url-downloader.codfw to urldownloader2001 [dns] - 10https://gerrit.wikimedia.org/r/564588 (https://phabricator.wikimedia.org/T224551) [14:17:34] RECOVERY - nova-compute proc maximum on cloudvirt1025 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:17:34] RECOVERY - nova-compute proc minimum on cloudvirt1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:17:35] RECOVERY - nova-compute proc maximum on cloudvirt1008 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:17:37] RECOVERY - nova-compute proc maximum on cloudvirt1028 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:17:38] RECOVERY - nova-compute proc maximum on cloudvirt1012 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:17:39] RECOVERY - nova-compute proc minimum on cloudvirt1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:17:41] RECOVERY - nova-compute proc minimum on cloudvirt1022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:17:45] RECOVERY - nova-compute proc minimum on cloudvirt1028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:17:46] RECOVERY - nova-compute proc minimum on cloudvirt1025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:18:13] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 (10ayounsi) a:05BBlack→03ayounsi [14:18:45] RECOVERY - nova-compute proc minimum on cloudvirt1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:18:51] RECOVERY - nova-compute proc minimum on cloudvirt1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:19:10] RECOVERY - nova-compute proc maximum on cloudvirt1007 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:21:03] (03CR) 10Muehlenhoff: [C: 03+2] Switch url-downloader.codfw to urldownloader2001 [dns] - 10https://gerrit.wikimedia.org/r/564588 (https://phabricator.wikimedia.org/T224551) (owner: 10Muehlenhoff) [14:21:37] !log push firewall policies to pfw3-eqiad - T242681 [14:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:47] (03CR) 10BBlack: [C: 03+2] Set up transparency-archive microsite [puppet] - 10https://gerrit.wikimedia.org/r/564684 (https://phabricator.wikimedia.org/T230638) (owner: 10BBlack) [14:22:22] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1100 - https://phabricator.wikimedia.org/T241506 (10Marostegui) Thanks - it is now rebuilding. I will close the task once it is finished ` PD: 0 Information Enclosure Device ID: 32 Slot Number: 0 Drive's position: DiskGroup: 0, Span: 0, Arm: 0 Enclosure po... [14:22:28] (03CR) 10Vgutierrez: [C: 03+1] ATS: Deploy acme-chief version of unified certificate on text [puppet] - 10https://gerrit.wikimedia.org/r/561883 (https://phabricator.wikimedia.org/T234803) (owner: 10Ema) [14:23:05] RECOVERY - nova-compute proc minimum on cloudvirt1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:24:42] !log Stop db1080 and db1107 replication in sync [14:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:02] !log Move db1114 under db1080 [14:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:07] (03PS1) 10BBlack: transparency-archive: correct template name [puppet] - 10https://gerrit.wikimedia.org/r/564689 (https://phabricator.wikimedia.org/T230638) [14:26:18] (03CR) 10BBlack: [V: 03+2 C: 03+2] transparency-archive: correct template name [puppet] - 10https://gerrit.wikimedia.org/r/564689 (https://phabricator.wikimedia.org/T230638) (owner: 10BBlack) [14:26:50] (03PS10) 10Giuseppe Lavagetto: wmflib: Introduce a more usable data structure to describe services. [puppet] - 10https://gerrit.wikimedia.org/r/558620 [14:26:52] (03PS1) 10Giuseppe Lavagetto: lvs::monitor: fix most technical debt [puppet] - 10https://gerrit.wikimedia.org/r/564690 [14:27:18] (03CR) 10jerkins-bot: [V: 04-1] wmflib: Introduce a more usable data structure to describe services. [puppet] - 10https://gerrit.wikimedia.org/r/558620 (owner: 10Giuseppe Lavagetto) [14:27:33] (03CR) 10jerkins-bot: [V: 04-1] lvs::monitor: fix most technical debt [puppet] - 10https://gerrit.wikimedia.org/r/564690 (owner: 10Giuseppe Lavagetto) [14:37:40] (03PS11) 10Giuseppe Lavagetto: wmflib: Introduce a more usable data structure to describe services. [puppet] - 10https://gerrit.wikimedia.org/r/558620 [14:37:42] (03PS2) 10Giuseppe Lavagetto: lvs::monitor: fix most technical debt [puppet] - 10https://gerrit.wikimedia.org/r/564690 [14:39:42] (03CR) 10jerkins-bot: [V: 04-1] lvs::monitor: fix most technical debt [puppet] - 10https://gerrit.wikimedia.org/r/564690 (owner: 10Giuseppe Lavagetto) [14:40:14] (03CR) 10jerkins-bot: [V: 04-1] wmflib: Introduce a more usable data structure to describe services. [puppet] - 10https://gerrit.wikimedia.org/r/558620 (owner: 10Giuseppe Lavagetto) [14:41:22] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frlog2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242265 (10Jgreen) [14:42:16] 10Operations, 10ops-codfw, 10DBA: Missing Network drivers from Stretch and Buster installer for BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10Marostegui) >>! In T242481#5801763, @MoritzMuehlenhoff wrote: > So, I digged into this a little: Interface auto setup not working if on... [14:43:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1080', diff saved to https://phabricator.wikimedia.org/P10143 and previous config saved to /var/cache/conftool/dbconfig/20200114-144341-marostegui.json [14:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:07] RECOVERY - MegaRAID on db1100 is OK: OK: optimal, 1 logical, 10 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:45:14] (03PS1) 10Lars Wirzenius: Group0 to 1.35.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564694 [14:47:02] (03PS1) 10Marostegui: mariadb: Place db1107 in s1 [puppet] - 10https://gerrit.wikimedia.org/r/564695 (https://phabricator.wikimedia.org/T242702) [14:47:30] !log liw@deploy1001 Started scap: testwiki to php-1.35.0-wmf.15 and rebuild l10n cache [14:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:32] (03PS1) 10Ema: prometheus: collect varnishd_mmap_count for varnish-frontend [puppet] - 10https://gerrit.wikimedia.org/r/564696 (https://phabricator.wikimedia.org/T242417) [14:48:50] (03CR) 10CDanis: [C: 03+1] prometheus: bump 'global' retention to 2.25 years [puppet] - 10https://gerrit.wikimedia.org/r/564679 (owner: 10Filippo Giunchedi) [14:48:53] (03CR) 10CDanis: [C: 03+1] prometheus: bump 'ops' retention to 4.5 months [puppet] - 10https://gerrit.wikimedia.org/r/564680 (owner: 10Filippo Giunchedi) [14:51:25] !log liw@deploy1001 scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="testwiki" --outdir="/tmp/scap_l10n_44869219" --threads=30 --lang en --quiet' returned non-zero exit status 1 (duration: 03m 55s) [14:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:31] (03PS2) 10Ema: prometheus: collect varnishd_mmap_count for varnish-frontend [puppet] - 10https://gerrit.wikimedia.org/r/564696 (https://phabricator.wikimedia.org/T242417) [14:59:01] (03PS1) 10BBlack: Update webserver-misc-static cert [puppet] - 10https://gerrit.wikimedia.org/r/564697 [14:59:28] (03PS2) 10BBlack: Update webserver-misc-static cert [puppet] - 10https://gerrit.wikimedia.org/r/564697 (https://phabricator.wikimedia.org/T230638) [15:00:17] (03CR) 10BBlack: [C: 03+2] Update webserver-misc-static cert [puppet] - 10https://gerrit.wikimedia.org/r/564697 (https://phabricator.wikimedia.org/T230638) (owner: 10BBlack) [15:00:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [15:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1080 for tranfer', diff saved to https://phabricator.wikimedia.org/P10144 and previous config saved to /var/cache/conftool/dbconfig/20200114-150223-marostegui.json [15:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:40] !log Copy data from db1080 to db1107 T242702 [15:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:43] T242702: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 [15:02:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:27] (03PS12) 10Giuseppe Lavagetto: wmflib: Introduce a more usable data structure to describe services. [puppet] - 10https://gerrit.wikimedia.org/r/558620 [15:06:29] (03PS3) 10Giuseppe Lavagetto: lvs::monitor: fix most technical debt [puppet] - 10https://gerrit.wikimedia.org/r/564690 [15:06:53] (03PS1) 10Ema: cache: enable geoiplookup in labs [puppet] - 10https://gerrit.wikimedia.org/r/564700 (https://phabricator.wikimedia.org/T241239) [15:09:08] (03PS2) 10Bstorm: gridengine: set the webgrid queues to not rerunable [puppet] - 10https://gerrit.wikimedia.org/r/564174 (https://phabricator.wikimedia.org/T242397) [15:10:26] (03CR) 10Ema: [C: 03+2] cache: enable geoiplookup in labs [puppet] - 10https://gerrit.wikimedia.org/r/564700 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [15:11:03] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1100 - https://phabricator.wikimedia.org/T241506 (10Marostegui) 05Open→03Resolved All good - thank you! ` Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Primary-1, Secondary-0, RAID Level Qualifie... [15:13:49] (03PS4) 10Giuseppe Lavagetto: lvs::monitor: fix most technical debt [puppet] - 10https://gerrit.wikimedia.org/r/564690 [15:14:45] (03PS1) 10BBlack: Support missing /historical on transparency sites [puppet] - 10https://gerrit.wikimedia.org/r/564704 (https://phabricator.wikimedia.org/T230638) [15:14:50] (03PS1) 10BBlack: Redirect transparency.wm.o -> foundation site [puppet] - 10https://gerrit.wikimedia.org/r/564705 (https://phabricator.wikimedia.org/T230638) [15:17:03] 10Operations, 10ops-codfw, 10DBA: Missing Network drivers from Stretch and Buster installer for BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10Marostegui) So adding the `01-` did the trick and es2020 is installing: ` append initrd=debian-installer/amd64/initrd.gz vga=normal aut... [15:18:01] 10Operations, 10ops-codfw, 10DBA: Missing Network drivers from Stretch and Buster installer for BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10Marostegui) Forgot to thank @MoritzMuehlenhoff for all the help and time with the troubleshooting [15:19:38] (03CR) 10BBlack: [C: 03+2] Support missing /historical on transparency sites [puppet] - 10https://gerrit.wikimedia.org/r/564704 (https://phabricator.wikimedia.org/T230638) (owner: 10BBlack) [15:19:58] Hi, what's happening with Jenkins? https://integration.wikimedia.org/ci/job/mwext-php72-phan-docker/30013/console [15:20:02] (03PS13) 10Giuseppe Lavagetto: wmflib: Introduce a more usable data structure to describe services. [puppet] - 10https://gerrit.wikimedia.org/r/558620 [15:20:04] (03PS5) 10Giuseppe Lavagetto: lvs::monitor: fix most technical debt [puppet] - 10https://gerrit.wikimedia.org/r/564690 [15:21:35] 10Operations, 10Traffic, 10Wikimedia-Logstash, 10observability, and 2 others: Port varnishlog consumers to log to syslog / logging infra - https://phabricator.wikimedia.org/T227108 (10ema) [15:27:27] (03CR) 10Marostegui: [C: 03+1] "The queries look good to me!" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/564614 (owner: 10Volans) [15:28:48] 10Operations, 10ops-eqiad, 10netops: (Need By: Sept 30) upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [15:28:52] Zoranzoki21: Not sure. Investigating. [15:30:39] 10Operations, 10ops-codfw, 10DBA: Missing Network drivers from Stretch and Buster installer for BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10Papaul) On the phone with Dell support. [15:32:06] (03PS2) 10BBlack: Redirect transparency.wm.o -> foundation site [puppet] - 10https://gerrit.wikimedia.org/r/564705 (https://phabricator.wikimedia.org/T230638) [15:32:08] (03PS1) 10BBlack: Fixup for historical redirect [puppet] - 10https://gerrit.wikimedia.org/r/564706 (https://phabricator.wikimedia.org/T230638) [15:33:25] James_F: I saw it on few patches, and always is same agent-docker [15:33:40] Yeah, I'm depooling integration-agent-docker-1003. [15:34:48] (03PS3) 10Bstorm: gridengine: set the webgrid queues to not rerunable [puppet] - 10https://gerrit.wikimedia.org/r/564174 (https://phabricator.wikimedia.org/T242397) [15:37:00] (03CR) 10BBlack: [C: 03+2] Fixup for historical redirect [puppet] - 10https://gerrit.wikimedia.org/r/564706 (https://phabricator.wikimedia.org/T230638) (owner: 10BBlack) [15:40:04] (03PS1) 10Vgutierrez: ATS: Set connect timeout and TTFB timeouts to different values [puppet] - 10https://gerrit.wikimedia.org/r/564708 (https://phabricator.wikimedia.org/T242620) [15:41:22] (03CR) 10Ema: varnish: format log consumer stdout as cee+json (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/563430 (https://phabricator.wikimedia.org/T227108) (owner: 10Filippo Giunchedi) [15:42:29] Zoranzoki21: If you see it again, please shout. [15:45:33] (03PS1) 10Lars Wirzenius: Group0 to 1.35.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564709 [15:47:43] !log liw@deploy1001 Started scap: testwiki to php-1.34.0-wmf.15 and rebuild l10n cache (try 2) [15:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:36] (03PS1) 10Herron: mx: increase exim queue check monitoring threshold [puppet] - 10https://gerrit.wikimedia.org/r/564710 [15:48:40] (03PS14) 10Giuseppe Lavagetto: wmflib: Introduce a more usable data structure to describe services. [puppet] - 10https://gerrit.wikimedia.org/r/558620 [15:48:42] (03PS6) 10Giuseppe Lavagetto: lvs::monitor: fix most technical debt [puppet] - 10https://gerrit.wikimedia.org/r/564690 [15:48:50] (03CR) 10Volans: [C: 03+2] hosts/images packages: optimize tables [software/debmonitor] - 10https://gerrit.wikimedia.org/r/564614 (owner: 10Volans) [15:48:52] o/ elukey. Could you chmod those files on ores2001 so I can read them? :) [15:48:58] /home/elukey/14012020_celery_oom/ [15:49:05] halfak: sure sorry! [15:49:16] I thought they were other-readable [15:49:22] (03CR) 10BBlack: [C: 03+2] Redirect transparency.wm.o -> foundation site [puppet] - 10https://gerrit.wikimedia.org/r/564705 (https://phabricator.wikimedia.org/T230638) (owner: 10BBlack) [15:49:24] No worries ^_^ [15:49:38] It's a persistent problem that our main.log is not readable :\ [15:50:11] (03PS2) 10Vgutierrez: ATS: Set connect timeout and TTFB timeouts to different values [puppet] - 10https://gerrit.wikimedia.org/r/564708 (https://phabricator.wikimedia.org/T242620) [15:50:12] ah because it is owned by www-data and 660 [15:50:13] (03PS1) 10Vgutierrez: ATS cp40[26|32], cp50[06|12]: Set connect timeout and TTFB timeouts to different values [puppet] - 10https://gerrit.wikimedia.org/r/564711 (https://phabricator.wikimedia.org/T242620) [15:50:19] 640 actually [15:51:08] (03CR) 10jerkins-bot: [V: 04-1] ATS cp40[26|32], cp50[06|12]: Set connect timeout and TTFB timeouts to different values [puppet] - 10https://gerrit.wikimedia.org/r/564711 (https://phabricator.wikimedia.org/T242620) (owner: 10Vgutierrez) [15:51:26] (03Merged) 10jenkins-bot: hosts/images packages: optimize tables [software/debmonitor] - 10https://gerrit.wikimedia.org/r/564614 (owner: 10Volans) [15:51:27] halfak: try nw [15:51:30] thanks [15:51:47] Now I can't even ls the directory :P [15:51:52] elukey, ^ [15:52:10] I also can't read any files in it. [15:52:53] I just sudoed as you and I can see [15:53:04] (03PS7) 10Giuseppe Lavagetto: lvs::monitor: fix most technical debt [puppet] - 10https://gerrit.wikimedia.org/r/564690 [15:53:24] ah no wait I can ls but not read, lemme fix [15:53:43] (03PS2) 10Vgutierrez: ATS: Set connect timeout and TTFB timeouts to different values (test hosts only) [puppet] - 10https://gerrit.wikimedia.org/r/564711 (https://phabricator.wikimedia.org/T242620) [15:53:47] (03PS3) 10Vgutierrez: ATS: Set connect timeout and TTFB timeouts to different values [puppet] - 10https://gerrit.wikimedia.org/r/564708 (https://phabricator.wikimedia.org/T242620) [15:55:07] 10Operations, 10ops-codfw, 10DBA: Missing Network drivers from Stretch and Buster installer for BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10Papaul) Dell said that it is not possible to disable the 10Gb interface. [15:55:13] halfak: I just apt-get installed basic-file-permission in my brain, hope it works now [15:55:31] (03PS8) 10Giuseppe Lavagetto: lvs::monitor: fix most technical debt [puppet] - 10https://gerrit.wikimedia.org/r/564690 [15:55:41] Works now! Thank you [15:55:48] sorry I need a coffe :D [15:56:07] I was giving rw instead of r-x [15:56:11] Coffee is fuel :) [15:56:29] we are more in beer than coffee tiem [15:56:31] *time [15:56:52] https://xkcd.com/323/ [15:57:00] Yes, and beer is good, you're right [15:57:10] (03CR) 10Urbanecm: [C: 03+1] "My comments are not in any way meant to block this - I'm just asking. I see now the two new wikis are in English anyway, so translations d" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564121 (https://phabricator.wikimedia.org/T240300) (owner: 10Tchanders) [15:57:40] 10Operations, 10ops-codfw, 10DBA: Missing Network drivers from Stretch and Buster installer for BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10ayounsi) >>! In T242481#5798551, @Marostegui wrote: > - Enable to 10G even though it will go to a 1G switch port? Is that even possible... [15:58:18] (03CR) 10Vgutierrez: "pcc shows the expected changes on the right hosts: https://puppet-compiler.wmflabs.org/compiler1001/20351/" [puppet] - 10https://gerrit.wikimedia.org/r/564711 (https://phabricator.wikimedia.org/T242620) (owner: 10Vgutierrez) [15:59:21] 10Operations, 10ops-codfw, 10DBA: Missing Network drivers from Stretch and Buster installer for BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10MoritzMuehlenhoff) >>! In T242481#5802185, @ayounsi wrote: >>>! In T242481#5798551, @Marostegui wrote: >> - Enable to 10G even though i... [16:01:19] (03PS6) 10CDanis: puppet-merge.py: SHA1 or explicit FETCH_HEAD is mandatory [puppet] - 10https://gerrit.wikimedia.org/r/559944 (https://phabricator.wikimedia.org/T241277) [16:01:34] (03PS1) 10Volans: Release v0.2.4 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/564717 [16:02:10] 10Operations, 10ops-codfw, 10DBA: Missing Network drivers from Stretch and Buster installer for BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10ayounsi) >>! In T242481#5802186, @MoritzMuehlenhoff wrote: > Is that because of different cables/connectors? Indeed, 1G switch ports ar... [16:02:17] (03CR) 10Volans: [V: 03+2 C: 03+2] "releasing the last 2 changes" [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/564717 (owner: 10Volans) [16:02:51] (03CR) 10Bstorm: [C: 03+2] gridengine: set the webgrid queues to not rerunable [puppet] - 10https://gerrit.wikimedia.org/r/564174 (https://phabricator.wikimedia.org/T242397) (owner: 10Bstorm) [16:03:35] (03CR) 10Ayounsi: [C: 03+1] "I don't think it's false positive, but it's a non-actionable alert anyway, so +1" [puppet] - 10https://gerrit.wikimedia.org/r/564710 (owner: 10Herron) [16:04:36] (03CR) 10CDanis: [C: 03+2] puppet-merge.py: SHA1 or explicit FETCH_HEAD is mandatory [puppet] - 10https://gerrit.wikimedia.org/r/559944 (https://phabricator.wikimedia.org/T241277) (owner: 10CDanis) [16:04:43] (03PS1) 10Ayounsi: Add conditional for vcp-snmp-statistics [homer/public] - 10https://gerrit.wikimedia.org/r/564718 [16:04:45] (03PS1) 10Ayounsi: Add tenant support for vlans [homer/public] - 10https://gerrit.wikimedia.org/r/564719 [16:05:12] 10Operations, 10ops-codfw, 10DBA: Missing Network drivers from Stretch and Buster installer for BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10Papaul) So I asked Dell if it was possible to replace the NIC card we have now with 2 separate NiC cards ( 1 x10Gb NiC and 1 x1GB NIC).... [16:05:54] bstorm_: okay to merge your gridengine: set the webgrid queues to not rerunable (19c44fa2c1) ? [16:06:04] Please do, I was just about to :) [16:06:37] (03PS15) 10Giuseppe Lavagetto: wmflib: Introduce a more usable data structure to describe services. [puppet] - 10https://gerrit.wikimedia.org/r/558620 [16:06:38] (03PS9) 10Giuseppe Lavagetto: lvs::monitor: fix most technical debt [puppet] - 10https://gerrit.wikimedia.org/r/564690 [16:06:47] thanks [16:07:04] !log volans@deploy1001 Started deploy [debmonitor/deploy@e72911c]: Release v0.2.4 [16:07:04] * cdanis just made some changes to puppet-merge; expected no-op aside from cleanups but please lmk if you have issues [16:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:29] James_F: See this https://integration.wikimedia.org/ci/job/quibble-composer-mysql-php72-docker/9616/console [16:07:44] (03CR) 10Herron: [C: 03+2] mx: increase exim queue check monitoring threshold [puppet] - 10https://gerrit.wikimedia.org/r/564710 (owner: 10Herron) [16:08:13] !log volans@deploy1001 Finished deploy [debmonitor/deploy@e72911c]: Release v0.2.4 (duration: 01m 09s) [16:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:30] Zoranzoki21: Is that from an LDAP extension that's expecting configuration? [16:08:37] (03CR) 10Marostegui: [C: 03+2] mariadb: Place db1107 in s1 [puppet] - 10https://gerrit.wikimedia.org/r/564695 (https://phabricator.wikimedia.org/T242702) (owner: 10Marostegui) [16:09:04] herron: ok to merge your change? [16:09:16] marostegui: sure thanks! [16:09:22] merging [16:09:51] James_F: Looks so [16:10:22] Zoranzoki21: There's at least one (maybe more?) extensions in gerrit that fundamentally don't pass in master. :-( [16:10:54] (03PS1) 10Marostegui: mariadb: es20[0-5] [puppet] - 10https://gerrit.wikimedia.org/r/564723 (https://phabricator.wikimedia.org/T241336) [16:11:14] 10Operations, 10Puppet, 10Patch-For-Review: puppet-merge can't accept an explicit SHA1 for an --ops merge - https://phabricator.wikimedia.org/T241277 (10CDanis) 05Open→03Resolved a:03CDanis [16:11:43] (03CR) 10Marostegui: [C: 03+2] mariadb: es20[0-5] [puppet] - 10https://gerrit.wikimedia.org/r/564723 (https://phabricator.wikimedia.org/T241336) (owner: 10Marostegui) [16:15:02] (03PS1) 10Marostegui: install_server: Changing es2023 MAC to the 10G one [puppet] - 10https://gerrit.wikimedia.org/r/564724 (https://phabricator.wikimedia.org/T242481) [16:16:31] (03CR) 10Papaul: [C: 03+2] install_server: Changing es2023 MAC to the 10G one [puppet] - 10https://gerrit.wikimedia.org/r/564724 (https://phabricator.wikimedia.org/T242481) (owner: 10Marostegui) [16:17:11] (03CR) 10Marostegui: [C: 03+2] install_server: Changing es2023 MAC to the 10G one [puppet] - 10https://gerrit.wikimedia.org/r/564724 (https://phabricator.wikimedia.org/T242481) (owner: 10Marostegui) [16:18:06] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Missing Network drivers from Stretch and Buster installer for BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10MoritzMuehlenhoff) pxelinux has a generic option to pass the MAC in place when receiving the boot image as BOOTIF... [16:21:01] James_F: Ok, ty [16:24:34] (03PS2) 10Filippo Giunchedi: varnish: use syslog for varnishlog consumers [puppet] - 10https://gerrit.wikimedia.org/r/563977 (https://phabricator.wikimedia.org/T227108) [16:24:40] (03CR) 10Filippo Giunchedi: varnish: format log consumer stdout as cee+json (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/563430 (https://phabricator.wikimedia.org/T227108) (owner: 10Filippo Giunchedi) [16:25:42] (03PS1) 10Ema: cache: add CAP_KILL to varnish-frontend capabilities [puppet] - 10https://gerrit.wikimedia.org/r/564726 (https://phabricator.wikimedia.org/T242411) [16:26:17] !log Disable temporarily puppet on install1002 and install2002 - T242481 [16:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:32] T242481: Missing Network drivers from Stretch and Buster installer for BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 [16:27:59] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Missing Network drivers from Stretch and Buster installer for BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` es20... [16:28:20] (03CR) 10Volans: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/564719 (owner: 10Ayounsi) [16:28:49] (03CR) 10Volans: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/564718 (owner: 10Ayounsi) [16:29:52] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Add tenant support for vlans [homer/public] - 10https://gerrit.wikimedia.org/r/564719 (owner: 10Ayounsi) [16:30:57] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Add conditional for vcp-snmp-statistics [homer/public] - 10https://gerrit.wikimedia.org/r/564718 (owner: 10Ayounsi) [16:31:12] !log liw@deploy1001 Finished scap: testwiki to php-1.34.0-wmf.15 and rebuild l10n cache (try 2) (duration: 43m 29s) [16:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:16] (03PS10) 10Giuseppe Lavagetto: lvs::monitor: fix most technical debt [puppet] - 10https://gerrit.wikimedia.org/r/564690 [16:31:39] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:34:03] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:34:14] (03PS1) 10Marostegui: Revert "install_server: Changing es2023 MAC to the 10G one" [puppet] - 10https://gerrit.wikimedia.org/r/564727 [16:34:46] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Missing Network drivers from Stretch and Buster installer for BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10Marostegui) >>! In T242481#5802260, @MoritzMuehlenhoff wrote: > pxelinux has a generic option to pass the MAC in... [16:35:12] (03CR) 10Filippo Giunchedi: "LGTM overall" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/564696 (https://phabricator.wikimedia.org/T242417) (owner: 10Ema) [16:35:56] (03CR) 10Marostegui: [C: 03+2] Revert "install_server: Changing es2023 MAC to the 10G one" [puppet] - 10https://gerrit.wikimedia.org/r/564727 (owner: 10Marostegui) [16:37:47] (03CR) 10Ema: [C: 03+1] ATS: Set connect timeout and TTFB timeouts to different values (test hosts only) [puppet] - 10https://gerrit.wikimedia.org/r/564711 (https://phabricator.wikimedia.org/T242620) (owner: 10Vgutierrez) [16:39:33] (03PS3) 10Filippo Giunchedi: varnish: format log consumer stdout as cee+json [puppet] - 10https://gerrit.wikimedia.org/r/563430 (https://phabricator.wikimedia.org/T227108) [16:39:35] (03PS3) 10Filippo Giunchedi: varnish: use syslog for varnishlog consumers [puppet] - 10https://gerrit.wikimedia.org/r/563977 (https://phabricator.wikimedia.org/T227108) [16:40:48] (03PS3) 10Ema: prometheus: collect varnishd_mmap_count for varnish-frontend [puppet] - 10https://gerrit.wikimedia.org/r/564696 (https://phabricator.wikimedia.org/T242417) [16:41:24] !log Enable puppet back on install1002 and install2002 - T242481 [16:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:27] T242481: Missing Network drivers from Stretch and Buster installer for BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 [16:41:36] (03CR) 10Ema: [C: 03+1] varnish: format log consumer stdout as cee+json [puppet] - 10https://gerrit.wikimedia.org/r/563430 (https://phabricator.wikimedia.org/T227108) (owner: 10Filippo Giunchedi) [16:42:13] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: collect varnishd_mmap_count for varnish-frontend [puppet] - 10https://gerrit.wikimedia.org/r/564696 (https://phabricator.wikimedia.org/T242417) (owner: 10Ema) [16:42:24] (03CR) 10Ema: prometheus: collect varnishd_mmap_count for varnish-frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/564696 (https://phabricator.wikimedia.org/T242417) (owner: 10Ema) [16:42:28] (03CR) 10Vgutierrez: [C: 03+2] ATS: Set connect timeout and TTFB timeouts to different values (test hosts only) [puppet] - 10https://gerrit.wikimedia.org/r/564711 (https://phabricator.wikimedia.org/T242620) (owner: 10Vgutierrez) [16:42:35] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, 10Patch-For-Review: Remove obsoleted docker images - https://phabricator.wikimedia.org/T242604 (10thcipriani) [16:44:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [16:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:31] !log branch is cut for 1.35.0-wmv.15; train window is closed, but I'll continue train since the next time slot seems to not have anything [16:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:52] (03PS1) 10Muehlenhoff: Pass down MAC address of to installing system via BOOTIF [puppet] - 10https://gerrit.wikimedia.org/r/564729 (https://phabricator.wikimedia.org/T242481) [16:46:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:47] (03Abandoned) 10Lars Wirzenius: Group0 to 1.35.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564694 (owner: 10Lars Wirzenius) [16:46:49] (03CR) 10Herron: [C: 03+1] prometheus: bump 'global' retention to 2.25 years (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/564679 (owner: 10Filippo Giunchedi) [16:47:22] (03CR) 10Lars Wirzenius: [C: 03+2] Group0 to 1.35.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564709 (owner: 10Lars Wirzenius) [16:47:41] (03CR) 10Herron: [C: 03+1] prometheus: bump 'ops' retention to 4.5 months [puppet] - 10https://gerrit.wikimedia.org/r/564680 (owner: 10Filippo Giunchedi) [16:48:21] (03Merged) 10jenkins-bot: Group0 to 1.35.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564709 (owner: 10Lars Wirzenius) [16:48:23] (03CR) 10Marostegui: [C: 03+1] "es2020 went fine indeed" [puppet] - 10https://gerrit.wikimedia.org/r/564729 (https://phabricator.wikimedia.org/T242481) (owner: 10Muehlenhoff) [16:53:03] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Missing Network drivers from Stretch and Buster installer for BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es2020.codfw.wmnet'] ` and were **ALL** successful. [16:53:45] !log liw@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.35.0-wmf.15 [16:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:21] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: (Needed By 31st January) codfw: rack/setup/install es202[0-5].codfw.wmnet - https://phabricator.wikimedia.org/T241336 (10Marostegui) [16:57:38] 10Operations, 10ops-codfw, 10Core Platform Team: (No Need By Date Provided) rack/setup/install restbase202[123] - https://phabricator.wikimedia.org/T241790 (10Papaul) [16:58:35] (03CR) 10Tchanders: "Thanks for checking Urbanecm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564121 (https://phabricator.wikimedia.org/T240300) (owner: 10Tchanders) [17:00:04] godog and _joe_: Your horoscope predicts another unfortunate Puppet SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200114T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:01:07] https://gerrit.wikimedia.org/r/c/559073 would be nice to be deployed if anyone can do so 🙂 [17:01:47] <_joe_> rlazarus: ^^ :P [17:03:43] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add a registryctl command-line utility [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/563482 (https://phabricator.wikimedia.org/T242604) (owner: 10Giuseppe Lavagetto) [17:04:47] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-fgiunchedi: Ingest production logs with ELK7 - https://phabricator.wikimedia.org/T235891 (10fgiunchedi) Something else I realized today: with ELK7 we dropped our custom logstash template in favor of logstash's default, although we'll need to bu... [17:05:05] (03Merged) 10jenkins-bot: Add a registryctl command-line utility [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/563482 (https://phabricator.wikimedia.org/T242604) (owner: 10Giuseppe Lavagetto) [17:09:01] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 94693184 and 12 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:09:16] 10Operations, 10cloud-services-team (Kanban): rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10aborrero) 05Open→03Resolved a:05aborrero→03None [17:09:34] (03CR) 10Krinkle: [C: 04-1] "Interesting failure upon running puppet agent in Beta:" [puppet] - 10https://gerrit.wikimedia.org/r/559262 (https://phabricator.wikimedia.org/T241097) (owner: 10Krinkle) [17:10:41] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 76 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:10:42] (03PS1) 10Giuseppe Lavagetto: New debian version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/564732 [17:11:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] New debian version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/564732 (owner: 10Giuseppe Lavagetto) [17:12:43] (03Merged) 10jenkins-bot: New debian version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/564732 (owner: 10Giuseppe Lavagetto) [17:13:44] 10Operations, 10MediaWiki-General, 10serviceops-radar, 10Performance-Team (Radar), and 2 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10Catrope) [17:14:41] (03PS2) 10Krinkle: mediawiki: Capture shutdown/destruct backtrace in php7-fatal-error.php [puppet] - 10https://gerrit.wikimedia.org/r/559262 (https://phabricator.wikimedia.org/T241097) [17:18:29] 10Puppet, 10cloud-services-team (Kanban): Reduce the effects of puppet breakage on VPS - https://phabricator.wikimedia.org/T226270 (10Andrew) p:05Triage→03Normal This is still important but lacking a good way to move forward. [17:21:36] <_joe_> !log upload docker-report 0.0.2 to {buster,stretch}-wikimedia T242604 [17:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:40] T242604: Remove obsoleted docker images - https://phabricator.wikimedia.org/T242604 [17:22:01] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Puppet agent unable to run in Beta Cluster (Evaluation Error: Error while evaluating a Resource Statement) - https://phabricator.wikimedia.org/T242658 (10Krinkle) [17:24:06] (03CR) 10Krinkle: [C: 04-1] "I tried to reproduce that error in Puppet compiler, but to my surprise, it not only has no error, it says this change is a no-op for MW ap" [puppet] - 10https://gerrit.wikimedia.org/r/559262 (https://phabricator.wikimedia.org/T241097) (owner: 10Krinkle) [17:24:50] effie: ^ Hm. can you think of a reason why puppet compiler does not see the edit to php7-fatal-error.php as a real change for mwdebug*/mw* servers? [17:39:20] !log depooling cp4027 for some ats-tls parent balancing tests [17:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:11] Urbanecm, _joe_: sure, will do [17:41:08] thanks rlazarus [17:41:40] (03CR) 10RLazarus: [C: 03+2] Add ng.wikimedia.org as chapter site [puppet] - 10https://gerrit.wikimedia.org/r/559073 (https://phabricator.wikimedia.org/T240771) (owner: 10IAmNetx) [17:42:34] (03CR) 10Dzahn: [C: 03+2] "fyi, technically you don't need to use 755 on directories because puppet always adds the exec bit automatically, so 644 would be the same." [puppet] - 10https://gerrit.wikimedia.org/r/564466 (https://phabricator.wikimedia.org/T242319) (owner: 10Legoktm) [17:43:33] !log repooling cp4027 [17:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:38] Urbanecm: merged, tested at mwdebug1001, will be deployed everywhere within 30m [17:46:21] (03PS1) 10saper: Wikistats v2 need no symbolic link [puppet] - 10https://gerrit.wikimedia.org/r/564739 (https://phabricator.wikimedia.org/T237752) [17:47:38] 10Operations, 10Release-Engineering-Team, 10serviceops: Hundreds of tags for `wikimedia/mediawiki-core` image - https://phabricator.wikimedia.org/T242775 (10Joe) p:05Triage→03High [17:50:01] 10Puppet, 10VPS-project-codesearch, 10Patch-For-Review: Puppetize codesearch - https://phabricator.wikimedia.org/T242319 (10Dzahn) Yea, it should be possible. You can run any command with the puppet [[ https://puppet.com/docs/puppet/latest/types/exec.html | exec resource type ]] and one way to do stuff only... [17:50:38] 10Operations, 10ops-codfw, 10Core Platform Team: (No Need By Date Provided) rack/setup/install restbase202[123] - https://phabricator.wikimedia.org/T241790 (10Papaul) [17:51:23] (03CR) 10saper: "Hello - since I have tested it the Apache config on my personal server, I thought - why not propose the change here?" [puppet] - 10https://gerrit.wikimedia.org/r/564739 (https://phabricator.wikimedia.org/T237752) (owner: 10saper) [17:51:50] (03CR) 10Dzahn: "or should i do one server per row as canary ? do you agree with the ticket we should have them in codfw?" [puppet] - 10https://gerrit.wikimedia.org/r/564175 (https://phabricator.wikimedia.org/T242606) (owner: 10Dzahn) [17:54:58] (03CR) 10Dzahn: Wikistats v2 need no symbolic link (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/564739 (https://phabricator.wikimedia.org/T237752) (owner: 10saper) [17:59:00] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Puppet agent unable to run in Beta Cluster (Evaluation Error: Error while evaluating a Resource Statement) - https://phabricator.wikimedia.org/T242658 (10Dzahn) @Krinkle cool, thanks for finding the additional override :) [18:00:04] cscott, arlolra, subbu, halfak, and accraze: Time to snap out of that daydream and deploy Services – Graphoid / Parsoid / Citoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200114T1800). [18:00:17] (03PS2) 10saper: Wikistats v2 need no symbolic link [puppet] - 10https://gerrit.wikimedia.org/r/564739 (https://phabricator.wikimedia.org/T237752) [18:00:19] (03PS1) 10saper: Wikistats v2 go live [puppet] - 10https://gerrit.wikimedia.org/r/564745 (https://phabricator.wikimedia.org/T237752) [18:03:25] thansk a lot rlazarus [18:06:50] !log depool cp5012 for some ats parent select debugging [18:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:23] (03PS3) 10saper: Wikistats v2 need no symbolic link [puppet] - 10https://gerrit.wikimedia.org/r/564739 (https://phabricator.wikimedia.org/T237752) [18:07:30] 10Operations, 10WMF-Legal, 10serviceops, 10Patch-For-Review: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10Jdforrester-WMF) 05Open→03Resolved a:03BBlack This looks fully done. Thank you! [18:09:26] (03CR) 10saper: "Small change suggested by dzahn@" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/564739 (https://phabricator.wikimedia.org/T237752) (owner: 10saper) [18:10:28] (03PS2) 10saper: Wikistats v2 go live [puppet] - 10https://gerrit.wikimedia.org/r/564745 (https://phabricator.wikimedia.org/T237752) [18:11:09] 10Operations, 10ORES, 10Scoring-platform-team: Ores celery OOM event in codfw - https://phabricator.wikimedia.org/T242705 (10Halfak) I had a look at the request log on ores2001 and I can't find any requests that look concerning. Hypotheses: 1. celery got into a weird state and went crazy. It may not happ... [18:11:13] !log repooling cp5012 [18:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:18] (03PS1) 10Papaul: DNS: Add mgmt and production DNS for puppetmaster2003 [dns] - 10https://gerrit.wikimedia.org/r/564746 [18:14:42] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Tiziano Piccardi shell request + analytics-privatedata-users - https://phabricator.wikimedia.org/T151969 (10tizianopiccardi) 05Resolved→03Open Hi all, I have a problem to login via ssh. I did not change anything recently, but I'm abl... [18:16:01] (03CR) 10saper: "Hello, this is the change that would bring wikistats v2 onto the stats.wikimedia.org homepage." [puppet] - 10https://gerrit.wikimedia.org/r/564745 (https://phabricator.wikimedia.org/T237752) (owner: 10saper) [18:16:48] 10Operations, 10cloud-services-team (Kanban): rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10Dzahn) All notifications for these hosts are (permanently?) disabled. Wondering if that is desired or maybe they should just not be in monitoring in the f... [18:17:00] (03PS1) 10Elukey: admin: update user piccardi's ssh public key [puppet] - 10https://gerrit.wikimedia.org/r/564747 (https://phabricator.wikimedia.org/T151969) [18:18:56] mutante: o/ if you have time can you check --^ ? I didn't find a better way to solve Tiziano's issue, but I may have missed something obvious :( [18:21:10] (03PS3) 10Krinkle: mediawiki: Capture shutdown/destruct backtrace in php7-fatal-error.php [puppet] - 10https://gerrit.wikimedia.org/r/559262 (https://phabricator.wikimedia.org/T241097) [18:22:13] (03CR) 10Krinkle: Wikistats v2 go live (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/564745 (https://phabricator.wikimedia.org/T237752) (owner: 10saper) [18:22:20] elukey: looks like he is using the wrong username [18:22:27] correct would be: piccardi [18:22:45] oh wait.. no..checking [18:23:33] Krinkle: because we tell puppet to take the file as is and copy it on the server [18:24:03] (03CR) 10Krinkle: "Directing at /v2/ as Saper's alternative patch might be better long-term as it means we hand out canonical/permalinks. This means that whe" [puppet] - 10https://gerrit.wikimedia.org/r/563508 (https://phabricator.wikimedia.org/T237752) (owner: 10Elukey) [18:24:37] Krinkle: if it were a template, we would see the changes, so this is normal [18:25:24] effie: should it not at least say that File[/etc/php7-fatal-error.php] has different content andor different md5 hash? [18:25:39] Or does that only work for templates? [18:26:28] Interesting, OK. I thought I did something wrong, but I guess all files I previously edited happen to be templtes then. Thanks :) [18:32:48] 10Operations, 10Traffic: ATS strict round robin parent select policy doesn't work as expected - https://phabricator.wikimedia.org/T242778 (10Vgutierrez) [18:35:09] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Tiziano Piccardi shell request + analytics-privatedata-users - https://phabricator.wikimedia.org/T151969 (10Dzahn) hi @tizianopiccardi i can confirm your user exists on bast1002 and notebook1003 and your key has not been revoked. It is:... [18:39:01] (03CR) 10saper: Wikistats v2 go live (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/564745 (https://phabricator.wikimedia.org/T237752) (owner: 10saper) [18:43:33] (03CR) 10Krinkle: Wikistats v2 go live (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/564745 (https://phabricator.wikimedia.org/T237752) (owner: 10saper) [18:45:08] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Tiziano Piccardi shell request + analytics-privatedata-users - https://phabricator.wikimedia.org/T151969 (10Dzahn) I see the SHA256 of the public key you are attempting to use is: ` SHA256:wNTUKNPfq5Wyubriy6VGxmqrPq3m9l6GSiyF0SV/ywE `... [18:50:13] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Tiziano Piccardi shell request + analytics-privatedata-users - https://phabricator.wikimedia.org/T151969 (10Dzahn) a:05RobH→03Muehlenhoff [18:50:15] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to EventLogging data for knissen - https://phabricator.wikimedia.org/T241838 (10Dzahn) a:05Dzahn→03None [18:50:17] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to EventLogging data for knissen - https://phabricator.wikimedia.org/T241838 (10Dzahn) a:03Muehlenhoff [18:51:27] (03CR) 10saper: "Sure, have a look at how I have tested it https://phabricator.wikimedia.org/T237752#5802478 - there is a tarball with all I needed." [puppet] - 10https://gerrit.wikimedia.org/r/564745 (https://phabricator.wikimedia.org/T237752) (owner: 10saper) [18:53:14] Krinkle: we will see that when we run puppet, but not on the compiler [18:54:54] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frlog2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242265 (10Papaul) [18:55:26] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Tiziano Piccardi shell request + analytics-privatedata-users - https://phabricator.wikimedia.org/T151969 (10tizianopiccardi) Hi Daniel, yes, I'm quite sure I didn't change anything (unless OSX updated and changed files somewhere). Here a... [18:57:26] 10Operations, 10ops-eqiad, 10SRE-swift-storage: Degraded RAID on ms-be1039 - https://phabricator.wikimedia.org/T242511 (10Cmjohnson) I am working on this now, a ticket has been created with HPE Case ID: 5344411330 Case title: Degraded RAID Severity 3-Normal [18:57:43] 10Operations, 10ops-eqiad, 10SRE-swift-storage: Degraded RAID on ms-be1039 - https://phabricator.wikimedia.org/T242511 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson [18:58:05] What's happening with mwext-phpunit-coverage-docker-publish?? https://integration.wikimedia.org/ci/job/mwext-phpunit-coverage-docker-publish/16751/console [18:59:54] 10Operations, 10ops-eqiad: frqueue1001 system battery needs replacement - https://phabricator.wikimedia.org/T237582 (10Cmjohnson) @Jgreen I have the batteries...when can we schedule to do this? [19:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Morning SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200114T1900). [19:00:05] Ammarpad: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:28] 10Operations, 10ops-eqiad: frqueue1001 system battery needs replacement - https://phabricator.wikimedia.org/T237582 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson [19:05:31] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10Cmjohnson) @Jclark-ctr where are you with these? [19:08:36] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10Jclark-ctr) @Cmjohnson no nic installed or host moved yet. @RobH had helped with 10g interfaces [19:13:19] jouncebot: next [19:13:20] In 0 hour(s) and 46 minute(s): Mediawiki train - European+American Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200114T2000) [19:13:25] jouncebot: now [19:13:25] For the next 0 hour(s) and 46 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200114T1900) [19:13:36] are we still deploying? [19:14:14] hauskatze: Ammarpad isn't here. [19:14:26] Urbanecm: mind if I add a patch? [19:14:34] not at all :) [19:14:35] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/564053/ [19:14:42] it's for wikitech [19:14:50] you can deploy there right? [19:16:23] yeah :) [19:17:09] (03PS3) 10Urbanecm: [wikitech] Restore contentadmin ability to manage abuse filters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564053 (https://phabricator.wikimedia.org/T242593) (owner: 10MarcoAurelio) [19:17:33] added to the calendar [19:17:42] thanks [19:17:53] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564053 (https://phabricator.wikimedia.org/T242593) (owner: 10MarcoAurelio) [19:18:52] I'd babysit Ammarpad's patches but I don't know about Minerva/MobileFrontend [19:18:59] (03Merged) 10jenkins-bot: [wikitech] Restore contentadmin ability to manage abuse filters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564053 (https://phabricator.wikimedia.org/T242593) (owner: 10MarcoAurelio) [19:20:05] i see [19:21:09] hauskatze: syncing [19:22:10] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: e400916: [wikitech] Restore contentadmin ability to manage abuse filters (duration: 01m 05s) [19:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:37] hauskatze: done! Lmk if you need anything else [19:23:11] I'll check it works [19:23:18] wikitech is so weird... ;) [19:23:42] a wiki on its own [19:23:53] yup, fixed [19:23:56] thanky :) [19:23:58] nice! [19:32:47] (03PS4) 10Dzahn: installserver: Convert tftp/dhcp ferm rules to ferm services [puppet] - 10https://gerrit.wikimedia.org/r/564010 (owner: 10Muehlenhoff) [19:38:27] Hello Urbanecm [19:38:35] I can help with Ammarpad's patches if you want? [19:38:52] (03CR) 10Dzahn: [C: 03+2] installserver: Convert tftp/dhcp ferm rules to ferm services [puppet] - 10https://gerrit.wikimedia.org/r/564010 (owner: 10Muehlenhoff) [19:45:38] (03CR) 10Dzahn: "ferm config changed on install1002/2002 and iptables -L output is unchanged (as expected)" [puppet] - 10https://gerrit.wikimedia.org/r/564010 (owner: 10Muehlenhoff) [19:47:18] (03CR) 10Dzahn: "should not be needed anymore. we talked on IRC and the issue is somewhere in the local ssh config. Tiziano could confirm it worked with th" [puppet] - 10https://gerrit.wikimedia.org/r/564747 (https://phabricator.wikimedia.org/T151969) (owner: 10Elukey) [19:54:10] (03CR) 10Dzahn: Wikistats v2 go live (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/564745 (https://phabricator.wikimedia.org/T237752) (owner: 10saper) [19:56:50] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Tiziano Piccardi shell request + analytics-privatedata-users - https://phabricator.wikimedia.org/T151969 (10Dzahn) We talked on IRC debugged a bit more and Tiziano could confirm logging in works with the existing key after moving the ssh... [20:00:05] liw and brennen: #bothumor My software never has bugs. It just develops random features. Rise for Mediawiki train - European+American Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200114T2000). [20:02:23] (european window deployment seems stable; nothing to be done at present.) [20:07:00] !log milimetric@deploy1001 Started deploy [analytics/aqs/deploy@1cf0530]: Increment service-runner to latest version [20:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:15] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 187360016 and 14 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:11:49] !log milimetric@deploy1001 Finished deploy [analytics/aqs/deploy@1cf0530]: Increment service-runner to latest version (duration: 04m 48s) [20:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:03] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 72208 and 107 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:18:53] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Tiziano Piccardi shell request + analytics-privatedata-users - https://phabricator.wikimedia.org/T151969 (10tizianopiccardi) 05Open→03Resolved The problem was in the config file. `ForwardAgent no IdentitiesOnly yes IdentityFile ~/.ss... [20:21:30] 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10Cmjohnson) The mgmt passwords have been updated. [20:22:25] 10Operations, 10ops-eqiad, 10vm-requests: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson The mgmt passwords have been updated. Expect these to be ready this week [20:25:59] (03CR) 10Cwhite: [C: 03+1] "LGTM, assuming you've verified that is Hugh's key" [puppet] - 10https://gerrit.wikimedia.org/r/564171 (https://phabricator.wikimedia.org/T242309) (owner: 10Dzahn) [20:27:18] 10Operations, 10DC-Ops, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-jessie): Replacement hardware for buster/stretch upgrade of contint1001 and contint2001 - https://phabricator.wikimedia.org/T239880 (10RobH) So this is a LOT of hardware churn that is non-desired by #dc-ops, at... [20:27:25] 10Operations, 10DC-Ops, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-jessie): Replacement hardware for buster/stretch upgrade of contint1001 and contint2001 - https://phabricator.wikimedia.org/T239880 (10RobH) a:05RobH→03thcipriani [20:29:50] 10Operations, 10DC-Ops, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-jessie): Replacement hardware for buster/stretch upgrade of contint1001 and contint2001 - https://phabricator.wikimedia.org/T239880 (10MoritzMuehlenhoff) @robh: You're completely right, see https://phabricator.w... [20:31:40] (03CR) 10Krinkle: [C: 03+1] "Done testing :)" [puppet] - 10https://gerrit.wikimedia.org/r/559262 (https://phabricator.wikimedia.org/T241097) (owner: 10Krinkle) [20:31:47] effie: good to go whenever :) [20:33:17] PROBLEM - Host lvs1015 is DOWN: PING CRITICAL - Packet loss = 100% [20:33:33] 10Operations, 10DC-Ops, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-jessie): Replacement hardware for buster/stretch upgrade of contint1001 and contint2001 - https://phabricator.wikimedia.org/T239880 (10RobH) @thcipriani: Please comment with additional reasoning on why we need t... [20:34:13] 10Operations, 10DC-Ops, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-jessie): Replacement hardware for buster/stretch upgrade of contint1001 and contint2001 - https://phabricator.wikimedia.org/T239880 (10RobH) >>! In T239880#5803332, @MoritzMuehlenhoff wrote: > @robh: You're comp... [20:37:10] 10Operations, 10ops-codfw, 10Wikimedia-Logstash: (No Need By Date Provided) rack/setup/install logstash202[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T240882 (10Papaul) [20:39:23] 10Operations, 10ops-codfw, 10Wikimedia-Logstash: (No Need By Date Provided) rack/setup/install logstash202[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T240882 (10Papaul) ` papaul@asw-c-codfw> show interfaces descriptions | match logstash2028 xe-7/0/11 up up logstash2028 [20:41:50] 10Operations, 10ops-eqiad, 10serviceops: (Need By: Jan 10) rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T241795 (10Cmjohnson) [20:48:43] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: Relabel labmon1001.eqiad.wmnet to cloudmetrics1001eqiad.wmnet and labmon1002.eqiad.wmnet to cloudmetrics1002eqiad.wmnet - https://phabricator.wikimedia.org/T241155 (10Cmjohnson) 05Open→03Resolved updated physical label and switch label [20:53:56] (03PS1) 10Cmjohnson: Updating hostname to reflect requested change [dns] - 10https://gerrit.wikimedia.org/r/564786 (https://phabricator.wikimedia.org/T239250) [20:55:23] (03PS2) 10Cmjohnson: Updating hostname to reflect requested change [dns] - 10https://gerrit.wikimedia.org/r/564786 (https://phabricator.wikimedia.org/T239250) [20:55:50] thanks cmjohnson1! [20:55:55] (03CR) 10Cmjohnson: [C: 03+2] Updating hostname to reflect requested change [dns] - 10https://gerrit.wikimedia.org/r/564786 (https://phabricator.wikimedia.org/T239250) (owner: 10Cmjohnson) [20:57:40] (03PS1) 10Papaul: DHCP: Add MAC address entires for logstash202[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/564788 (https://phabricator.wikimedia.org/T240882) [20:58:07] (03CR) 10Papaul: [C: 03+2] DNS: Add mgmt and production DNS for puppetmaster2003 [dns] - 10https://gerrit.wikimedia.org/r/564746 (owner: 10Papaul) [20:58:25] (03PS2) 10Papaul: DNS: Add mgmt and production DNS for puppetmaster2003 [dns] - 10https://gerrit.wikimedia.org/r/564746 [20:59:21] PROBLEM - Persistent high iowait on labstore1006 is CRITICAL: 16.23 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [20:59:51] (03CR) 10Papaul: [C: 03+2] DNS: Add mgmt and production DNS for puppetmaster2003 [dns] - 10https://gerrit.wikimedia.org/r/564746 (owner: 10Papaul) [21:00:29] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install censorship1001.eqiad.wmnet - https://phabricator.wikimedia.org/T239250 (10Cmjohnson) updated hostname to cescout1001 [x] physical label [x] netbox [x] network switch [x] mgmt and production DNS updated [21:00:57] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install censcout1001.eqiad.wmnet - https://phabricator.wikimedia.org/T239250 (10Cmjohnson) [21:01:17] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install cescout1001.eqiad.wmnet - https://phabricator.wikimedia.org/T239250 (10Cmjohnson) [21:05:51] 10Operations, 10ops-codfw, 10Patch-For-Review: (No Need By Date Provided) codfw: rack/setup/install puppetmaster2003.codfw.wmnet - https://phabricator.wikimedia.org/T239732 (10Papaul) [21:08:24] RECOVERY - Persistent high iowait on labstore1006 is OK: (C)10 ge (W)5 ge 4.64 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [21:09:46] 10Operations, 10ops-codfw, 10Patch-For-Review: (No Need By Date Provided) codfw: rack/setup/install puppetmaster2003.codfw.wmnet - https://phabricator.wikimedia.org/T239732 (10Papaul) ` papaul@asw-b-codfw# show | compare [edit interfaces interface-range vlan-private1-b-codfw] member ge-8/0/3 {... [21:09:59] 10Operations, 10ops-codfw, 10Patch-For-Review: (No Need By Date Provided) codfw: rack/setup/install puppetmaster2003.codfw.wmnet - https://phabricator.wikimedia.org/T239732 (10Papaul) [21:13:15] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frlog2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242265 (10Papaul) [21:16:28] thcipriani: yo, did new scap end up rolling out? [21:18:04] ori: no, not yet, I still need to do the packaging :( [21:18:38] ok np, i'm just excited to see what the impact will be. there's no rush [21:19:25] I will be happy to see that change go out [21:41:01] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frlog2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242265 (10Papaul) ` papaul@fasw-c-codfw# show | compare [edit interfaces interface-range disabled] - member ge-0/0/21; - member ge-1/0/21; [edit interfaces interface-r... [21:41:12] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frlog2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242265 (10Papaul) [22:35:48] (03CR) 10Dzahn: [C: 03+2] "actually checked the MACs on DRAC" [puppet] - 10https://gerrit.wikimedia.org/r/564788 (https://phabricator.wikimedia.org/T240882) (owner: 10Papaul) [22:38:00] 10Operations, 10ops-codfw, 10Wikimedia-Logstash, 10Patch-For-Review: (No Need By Date Provided) rack/setup/install logstash202[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T240882 (10Dzahn) ready for OS install [22:38:01] (03CR) 10Dzahn: "@papaul you can start installing. also ran puppet on install2002 already" [puppet] - 10https://gerrit.wikimedia.org/r/564788 (https://phabricator.wikimedia.org/T240882) (owner: 10Papaul) [22:39:07] 10Operations, 10Phabricator, 10Product-Analytics, 10WMF-NDA-Requests: Access to view #WMF-NDA tasks on Phabricator for jwang - https://phabricator.wikimedia.org/T242805 (10Dzahn) [22:53:45] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, researchers & wmf for jennifer wang (jwang) - https://phabricator.wikimedia.org/T242807 (10jwang) [23:04:38] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install cescout1001.eqiad.wmnet - https://phabricator.wikimedia.org/T239250 (10RobH) a:05RobH→03Cmjohnson [23:17:53] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:19:41] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:41:48] 10Operations, 10Cleanup, 10Traffic, 10fixcopyright.wikimedia.org, and 4 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Jdforrester-WMF) 05Stalled→03Open [23:43:41] (03PS1) 10Arlolra: Bump Parsoid/PHP cluster memory_limit again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564805 (https://phabricator.wikimedia.org/T239806) [23:45:15] (03PS2) 10Arlolra: Bump Parsoid/PHP cluster memory_limit again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564805 (https://phabricator.wikimedia.org/T239806) [23:49:11] (03PS2) 10Cwhite: mtail: track new subscription requests in prometheus [puppet] - 10https://gerrit.wikimedia.org/r/564129 (https://phabricator.wikimedia.org/T236505) [23:49:41] (03CR) 10Reedy: "How high can we go!?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564805 (https://phabricator.wikimedia.org/T239806) (owner: 10Arlolra) [23:50:57] PROBLEM - Check systemd state on labweb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:52:47] RECOVERY - Check systemd state on labweb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:53:27] PROBLEM - Check systemd state on labweb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:57:03] RECOVERY - Check systemd state on labweb1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state