[00:07:32] (03PS1) 10Dzahn: aphlict: set envoy-proxy upstream port to 22280, no SNI [puppet] - 10https://gerrit.wikimedia.org/r/616630 (https://phabricator.wikimedia.org/T238593) [00:07:37] (03PS1) 10Ebernhardson: airflow: Reduce scheduler execution rate to one minute [puppet] - 10https://gerrit.wikimedia.org/r/616631 [00:17:24] !log andrew@cumin1001 conftool action : set/pooled=no; selector: name=cloudcephmon1003.eqiad.wmnet [00:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:59] !log andrew@cumin1001 conftool action : set/pooled=inactive; selector: name=cloudcephmon1003.eqiad.wmnet [00:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:05] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:30:59] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:35:21] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 35916208 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:44:55] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 232976 and 63 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:46:37] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.51:9283]) andrew bogott https://phabricator.wikimedia.org/T258826#6339427 https://wikitech.wikimedia.org/wiki/PyBal [00:46:37] ACKNOWLEDGEMENT - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - cloudceph_9283: Servers cloudcephmon1003.eqiad.wmnet are marked down but pooled andrew bogott https://phabricator.wikimedia.org/T258826#6339427 https://wikitech.wikimedia.org/wiki/PyBal [00:46:37] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.51:9283]) andrew bogott https://phabricator.wikimedia.org/T258826#6339427 https://wikitech.wikimedia.org/wiki/PyBal [00:46:37] ACKNOWLEDGEMENT - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - cloudceph_9283: Servers cloudcephmon1003.eqiad.wmnet are marked down but pooled andrew bogott https://phabricator.wikimedia.org/T258826#6339427 https://wikitech.wikimedia.org/wiki/PyBal [00:52:17] 10Operations, 10Performance-Team, 10serviceops, 10Sustainability (Incident Followup): Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10Krinkle) >>! In T240684#6204688, @Joe wrote: > - mcrouter + gutter pool: better consistency (because deletes get... [00:52:51] (03CR) 10Dzahn: [C: 03+2] aphlict: set envoy-proxy upstream port to 22280, no SNI [puppet] - 10https://gerrit.wikimedia.org/r/616630 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [00:54:10] (03CR) 10Tim Starling: "Please review and give +1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616628 (owner: 10Tim Starling) [01:03:41] (03PS1) 10Dzahn: arclamp: send stderr of new arclamp compress cron job to a logfile [puppet] - 10https://gerrit.wikimedia.org/r/616634 (https://phabricator.wikimedia.org/T257931) [01:08:11] (03CR) 10Dave Pifke: [C: 04-1] "This is because I screwed up the cron specification, and it's running every minute instead of every hour, and stepping on itself." [puppet] - 10https://gerrit.wikimedia.org/r/616634 (https://phabricator.wikimedia.org/T257931) (owner: 10Dzahn) [01:09:23] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 87 probes of 565 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:13:27] (03PS1) 10Dave Pifke: arclamp: fix cron specification [puppet] - 10https://gerrit.wikimedia.org/r/616635 (https://phabricator.wikimedia.org/T235456) [01:13:37] (03PS1) 10Dzahn: arclamp: run log compression cron job once an hour, not every minute [puppet] - 10https://gerrit.wikimedia.org/r/616636 [01:14:10] (03CR) 10Dzahn: [V: 03+2 C: 03+2] "did the same thing at minute 30 with the same comment not everything should run at 0 :)" [puppet] - 10https://gerrit.wikimedia.org/r/616635 (https://phabricator.wikimedia.org/T235456) (owner: 10Dave Pifke) [01:14:37] (03Abandoned) 10Dzahn: arclamp: send stderr of new arclamp compress cron job to a logfile [puppet] - 10https://gerrit.wikimedia.org/r/616634 (https://phabricator.wikimedia.org/T257931) (owner: 10Dzahn) [01:14:54] (03Abandoned) 10Dzahn: arclamp: run log compression cron job once an hour, not every minute [puppet] - 10https://gerrit.wikimedia.org/r/616636 (owner: 10Dzahn) [01:17:06] (03CR) 10Dzahn: "Notice: /Stage[main]/Arclamp/Cron[arclamp_compress_logs]/minute: defined 'minute' as ['17']" [puppet] - 10https://gerrit.wikimedia.org/r/616635 (https://phabricator.wikimedia.org/T235456) (owner: 10Dave Pifke) [01:17:56] (03CR) 10Dave Pifke: "Thanks, and sorry for not catching/fixing sooner." [puppet] - 10https://gerrit.wikimedia.org/r/616635 (https://phabricator.wikimedia.org/T235456) (owner: 10Dave Pifke) [01:21:53] (03PS3) 10Jared Blumer: eslint: Update to eslint-config-wikimedia 0.16.0 and eslint 7.5.0 [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/616183 (https://phabricator.wikimedia.org/T254495) [01:24:27] (03CR) 10Dzahn: "no worries at all. also did not catch that in review" [puppet] - 10https://gerrit.wikimedia.org/r/616635 (https://phabricator.wikimedia.org/T235456) (owner: 10Dave Pifke) [01:24:32] (03CR) 10Jared Blumer: "All set. Thanks, Zeljko." (031 comment) [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/616183 (https://phabricator.wikimedia.org/T254495) (owner: 10Jared Blumer) [01:44:43] PROBLEM - Disk space on eventlog1002 is CRITICAL: DISK CRITICAL - free space: /srv 29125 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=eventlog1002&var-datasource=eqiad+prometheus/ops [02:02:18] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 565 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:05:31] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.2 [core] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/616638 [02:43:03] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:44:57] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:16:41] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1001 job=burrow partition={0,1,2,3,4,5} prometheus=ops site=eqiad topic={rsyslog-notice,udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=n [03:16:41] tasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [03:19:02] (03PS1) 10Catrope: Remove unused setting $wgGEHomepageSuggestedEditsNewAccountInitiatedPercentage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616645 [03:38:11] (03CR) 10Ebe123: [C: 03+1] Re-enable LilyPond/Score in safe mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616628 (owner: 10Tim Starling) [03:55:33] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (6) node(s) change every puppet run: otrs1001.eqiad.wmnet, contint2001.wikimedia.org, wdqs1009.eqiad.wmnet, an-tool1009.eqiad.wmnet, contint1001.wikimedia.org, testreduce1001.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [04:14:09] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:19:47] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:43:43] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:45:38] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:50:08] (03PS1) 10Bodhisattwa: Add "work" namespace to search results for Bengali Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616609 [05:04:25] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:05:54] (03CR) 10Bodhisattwa: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616609 (owner: 10Bodhisattwa) [05:06:09] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog2001.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [05:06:17] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:06:27] (03PS2) 10Bodhisattwa: Add "work" namespace to search results for Bengali Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616609 [05:09:03] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [05:12:54] (03CR) 1020after4: [C: 03+1] admins: remove demon from gerrit and phab root users [puppet] - 10https://gerrit.wikimedia.org/r/616164 (owner: 10Dzahn) [05:14:10] (03CR) 1020after4: [C: 03+2] eslint: Update to eslint-config-wikimedia 0.16.0 and eslint 7.5.0 [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/616183 (https://phabricator.wikimedia.org/T254495) (owner: 10Jared Blumer) [05:15:44] (03CR) 1020after4: [C: 03+2] Selenium: Update to WebdriverIO v6 [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/615801 (https://phabricator.wikimedia.org/T255471) (owner: 10Vidhi-Mody) [05:16:00] (03CR) 1020after4: [V: 03+2 C: 03+2] Selenium: Update to WebdriverIO v6 [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/615801 (https://phabricator.wikimedia.org/T255471) (owner: 10Vidhi-Mody) [05:16:29] (03CR) 1020after4: [V: 03+2 C: 03+2] eslint: Update to eslint-config-wikimedia 0.16.0 and eslint 7.5.0 [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/616183 (https://phabricator.wikimedia.org/T254495) (owner: 10Jared Blumer) [05:18:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1144:3314 and restore db1146:3314 original weight', diff saved to https://phabricator.wikimedia.org/P12064 and previous config saved to /var/cache/conftool/dbconfig/20200728-051813-marostegui.json [05:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1146:3314 for MCR schema change', diff saved to https://phabricator.wikimedia.org/P12065 and previous config saved to /var/cache/conftool/dbconfig/20200728-051928-marostegui.json [05:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:05] (03CR) 1020after4: [C: 03+1] Add basic doc for python-build* images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605649 (owner: 10Hashar) [05:25:09] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [05:27:36] (03PS1) 10Marostegui: db1082: Clarify BBU status [puppet] - 10https://gerrit.wikimedia.org/r/616649 (https://phabricator.wikimedia.org/T258336) [05:28:28] (03CR) 10Marostegui: [C: 03+2] db1082: Clarify BBU status [puppet] - 10https://gerrit.wikimedia.org/r/616649 (https://phabricator.wikimedia.org/T258336) (owner: 10Marostegui) [05:29:01] PROBLEM - puppet last run on otrs1001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [05:39:36] (03CR) 10Giuseppe Lavagetto: [C: 03+1] nutcracker: drop puppet < 3.5 support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/616469 (https://phabricator.wikimedia.org/T258931) (owner: 10Jbond) [06:07:52] (03PS4) 10Privacybatm: [POC4 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/616282 (https://phabricator.wikimedia.org/T257601) [06:12:35] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:14:27] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:17:33] PROBLEM - dump of zarcillo in codfw on icinga1001 is CRITICAL: Last dump for zarcillo at codfw (db2093.codfw.wmnet) taken on 2020-07-28 05:55:05 is 14 GB, but previous one was 0 GB, a change of 3633670.7% https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [06:29:33] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [06:31:21] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [06:31:54] (03PS2) 10Muehlenhoff: Add CAS support to Hue (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/616541 [06:40:36] 10Operations: Review lists of config/sysctl recommendations by "kernel self-protection project" - https://phabricator.wikimedia.org/T142984 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [06:40:53] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [06:42:45] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [06:51:24] (03PS8) 10Ema: ATS: stop responding to varnishcheck/status [puppet] - 10https://gerrit.wikimedia.org/r/610052 (https://phabricator.wikimedia.org/T255015) [06:52:17] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [06:53:27] (03Abandoned) 10Jforrester: [DNM] CI verification commit [deployment-charts] - 10https://gerrit.wikimedia.org/r/519266 (owner: 10Jforrester) [06:54:08] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [06:57:49] PROBLEM - Disk space on eventlog1002 is CRITICAL: DISK CRITICAL - free space: /srv 29309 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=eventlog1002&var-datasource=eqiad+prometheus/ops [06:58:28] (03CR) 10Ema: [C: 03+2] ATS: stop responding to varnishcheck/status [puppet] - 10https://gerrit.wikimedia.org/r/610052 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [07:01:11] 10Operations, 10Performance-Team, 10serviceops, 10Sustainability (Incident Followup): Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10elukey) >>! In T240684#6339437, @Krinkle wrote: >>>! In T240684#6204688, @Joe wrote: >> - mcrouter + gutter pool:... [07:02:16] (03CR) 10Privacybatm: "(Sorry to send this big comment :O)" [software/transferpy] - 10https://gerrit.wikimedia.org/r/616282 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [07:02:37] (03CR) 10Privacybatm: "2.2" [software/transferpy] - 10https://gerrit.wikimedia.org/r/616282 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [07:03:05] (03CR) 10Legoktm: [C: 03+1] "+1 to re-enabling. It would be nice if we could deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/609840/ too." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616628 (owner: 10Tim Starling) [07:03:31] !log 1.36.0-wmf.2 was branched at 04e863fdf3646ee6ed5c05b784f85c9f323e1f19 for T257970 [07:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:38] T257970: 1.36.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T257970 [07:03:44] (03CR) 10Lars Wirzenius: [C: 03+2] Branch commit for wmf/1.36.0-wmf.2 [core] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/616638 (owner: 10TrainBranchBot) [07:03:55] (03PS6) 10Elukey: Add analytics-product system user [puppet] - 10https://gerrit.wikimedia.org/r/595540 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [07:04:27] (03CR) 10Legoktm: mediawiki: Create /etc/firejail/mediawiki.local (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/609840 (owner: 10Legoktm) [07:07:01] (03CR) 10Elukey: [C: 03+2] Add analytics-product system user [puppet] - 10https://gerrit.wikimedia.org/r/595540 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [07:07:21] (03CR) 10Elukey: [C: 03+2] "Discussed and approved by the SRE team meeting happened yesterday" [puppet] - 10https://gerrit.wikimedia.org/r/595540 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [07:11:37] (03CR) 10Muehlenhoff: [C: 03+1] "Sounds good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616628 (owner: 10Tim Starling) [07:13:42] (03PS1) 10Marostegui: mariadb: Reimage dbproxy1015 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/616703 (https://phabricator.wikimedia.org/T255408) [07:14:21] (03CR) 10Marostegui: [C: 03+2] mariadb: Reimage dbproxy1015 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/616703 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [07:18:57] (03CR) 10Ema: [C: 04-1] "The DNS discovery record does not seem to be there right now:" [puppet] - 10https://gerrit.wikimedia.org/r/615797 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [07:25:37] (03PS1) 10Lars Wirzenius: testwikis wikis to 1.36.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616704 [07:25:39] (03CR) 10Lars Wirzenius: [C: 03+2] testwikis wikis to 1.36.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616704 (owner: 10Lars Wirzenius) [07:26:20] (03PS1) 10Elukey: role::statistics::explorer: add analyitcs-product keytab [puppet] - 10https://gerrit.wikimedia.org/r/616705 (https://phabricator.wikimedia.org/T255039) [07:26:22] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616704 (owner: 10Lars Wirzenius) [07:27:09] !log liw@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.2 [07:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:38] (03CR) 10Elukey: [C: 03+2] role::statistics::explorer: add analyitcs-product keytab [puppet] - 10https://gerrit.wikimedia.org/r/616705 (https://phabricator.wikimedia.org/T255039) (owner: 10Elukey) [07:28:00] (03PS2) 10Elukey: role::statistics::explorer: add analyitcs-product keytab [puppet] - 10https://gerrit.wikimedia.org/r/616705 (https://phabricator.wikimedia.org/T255039) [07:28:50] 10Operations, 10Fundraising-Backlog: New wiki for TY pages with same content as donatewiki - https://phabricator.wikimedia.org/T259002 (10AndyRussG) [07:30:34] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, 10Patch-For-Review: TY pages in a subdomain of wikipedia and set hide banner cookie - https://phabricator.wikimedia.org/T251780 (10AndyRussG) Thanks so much @Krinkle for the information on this! Please see {T259002}. [07:31:01] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1025.eqiad.wmnet [07:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [07:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:03] 10Operations, 10Analytics-Radar, 10SRE-Access-Requests, 10Patch-For-Review, and 2 others: Creation of a new POSIX group and system user for the Product Analytics team - https://phabricator.wikimedia.org/T255039 (10elukey) 05Open→03Resolved [07:40:48] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2001 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [07:41:17] (03CR) 10DCausse: [C: 03+1] "looks good, verified the signatures" (031 comment) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/616602 (owner: 10Ebernhardson) [07:42:01] (03CR) 10DCausse: [C: 04-1] "in I fact I think you forgot to bump the deb version in the rules file" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/616602 (owner: 10Ebernhardson) [07:42:08] (03PS1) 10Marostegui: dbproxy1015: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/616708 (https://phabricator.wikimedia.org/T255408) [07:42:35] (03CR) 10Muehlenhoff: [C: 03+2] Enable CAS for Superset [puppet] - 10https://gerrit.wikimedia.org/r/615754 (owner: 10Muehlenhoff) [07:42:52] (03PS1) 10Filippo Giunchedi: wikimedia: failover to netmon2001 [dns] - 10https://gerrit.wikimedia.org/r/616709 (https://phabricator.wikimedia.org/T247967) [07:43:01] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [07:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:23] 10Operations, 10Performance-Team, 10serviceops, 10Sustainability (Incident Followup): Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10Joe) >! In T240684#6339437, @Krinkle wrote: > > I might be missing something, but that doesnt' seem great. Per @... [07:44:26] 10Operations, 10ops-eqsin: Decommission cr1-eqsin - https://phabricator.wikimedia.org/T256947 (10ayounsi) [07:46:11] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [07:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:31] 10Operations, 10serviceops: All wtp servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts: ` wtp1026.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202007280746_j... [07:46:47] (03CR) 10DCausse: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/616593 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [07:46:53] (03PS1) 10Filippo Giunchedi: Failover to netmon2001 [puppet] - 10https://gerrit.wikimedia.org/r/616710 (https://phabricator.wikimedia.org/T247967) [07:47:05] (03CR) 10Marostegui: [C: 03+2] dbproxy1015: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/616708 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [07:47:43] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [07:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:22] !log switched superset to CAS [07:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:47] !log depooled wtp1026.eqiad.wmnet for reimage [07:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:49] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2001 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [07:52:28] (03PS2) 10Vidhi-Mody: Selenium: Update to WebdriverIO v6 [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/615801 (https://phabricator.wikimedia.org/T255471) [07:53:19] (03CR) 10Vidhi-Mody: "PS2 resolves the merge conflict in PS1" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/615801 (https://phabricator.wikimedia.org/T255471) (owner: 10Vidhi-Mody) [07:53:34] 10Operations, 10ops-eqsin: Decommission cr1-eqsin - https://phabricator.wikimedia.org/T256947 (10ayounsi) a:05ayounsi→03RobH Wiped and powered-off. [07:54:34] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/616710 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [07:56:00] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:07] (03PS1) 10Marostegui: dbproxy1019: Reduce labsdb1009 weight [puppet] - 10https://gerrit.wikimedia.org/r/616711 [08:02:55] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/24157/" [puppet] - 10https://gerrit.wikimedia.org/r/616710 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [08:03:31] (03CR) 10Marostegui: [C: 03+2] dbproxy1019: Reduce labsdb1009 weight [puppet] - 10https://gerrit.wikimedia.org/r/616711 (owner: 10Marostegui) [08:04:05] !log Reduce labsdb1009 weight [08:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:30] (03CR) 10Ayounsi: [C: 03+1] "Quick look at the diff and PCC lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/616710 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [08:06:04] (03CR) 10Ayounsi: [C: 03+1] wikimedia: failover to netmon2001 [dns] - 10https://gerrit.wikimedia.org/r/616709 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [08:06:09] !log failover librenms/smokeping to netmon2001 - T247967 [08:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:14] T247967: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 [08:06:19] (03CR) 10Filippo Giunchedi: [C: 03+2] Failover to netmon2001 [puppet] - 10https://gerrit.wikimedia.org/r/616710 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [08:06:27] (03PS2) 10Filippo Giunchedi: Failover to netmon2001 [puppet] - 10https://gerrit.wikimedia.org/r/616710 (https://phabricator.wikimedia.org/T247967) [08:07:38] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime [08:07:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:59] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [08:09:45] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:09:48] (03CR) 10Vgutierrez: [C: 04-1] "ats-tls needs a wss remap rule as well for phabricator, see https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615797 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [08:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:09] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [08:13:32] (03CR) 10Filippo Giunchedi: [C: 03+2] wikimedia: failover to netmon2001 [dns] - 10https://gerrit.wikimedia.org/r/616709 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [08:13:43] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:49] (03PS1) 10Ema: Revert "ATS: force cache revalidation on a few selected wikis" [puppet] - 10https://gerrit.wikimedia.org/r/616614 (https://phabricator.wikimedia.org/T256750) [08:16:04] (03CR) 10jerkins-bot: [V: 04-1] Revert "ATS: force cache revalidation on a few selected wikis" [puppet] - 10https://gerrit.wikimedia.org/r/616614 (https://phabricator.wikimedia.org/T256750) (owner: 10Ema) [08:16:06] (03CR) 10Ema: [C: 03+1] Revert "ATS: force cache revalidation on secure.wm.o" [puppet] - 10https://gerrit.wikimedia.org/r/616493 (https://phabricator.wikimedia.org/T151977) (owner: 10CDanis) [08:20:20] !log liw@deploy1001 Finished scap: testwikis wikis to 1.36.0-wmf.2 (duration: 53m 11s) [08:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:25] (03PS3) 10Ema: Revert "ATS: force cache revalidation on secure.wm.o" [puppet] - 10https://gerrit.wikimedia.org/r/616493 (https://phabricator.wikimedia.org/T151977) (owner: 10CDanis) [08:22:02] (03CR) 10Ema: [C: 03+2] Revert "ATS: force cache revalidation on secure.wm.o" [puppet] - 10https://gerrit.wikimedia.org/r/616493 (https://phabricator.wikimedia.org/T151977) (owner: 10CDanis) [08:22:51] (03PS2) 10Ema: Revert "ATS: force cache revalidation on a few selected wikis" [puppet] - 10https://gerrit.wikimedia.org/r/616614 (https://phabricator.wikimedia.org/T256750) [08:23:44] (03PS1) 10Muehlenhoff: Create component/cloudera for buster-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/616712 (https://phabricator.wikimedia.org/T258768) [08:24:13] (03CR) 10Ema: [C: 03+2] Revert "ATS: force cache revalidation on a few selected wikis" [puppet] - 10https://gerrit.wikimedia.org/r/616614 (https://phabricator.wikimedia.org/T256750) (owner: 10Ema) [08:27:58] (03CR) 10Elukey: [C: 03+1] Create component/cloudera for buster-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/616712 (https://phabricator.wikimedia.org/T258768) (owner: 10Muehlenhoff) [08:29:14] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [08:30:32] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [08:32:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1146:3314', diff saved to https://phabricator.wikimedia.org/P12066 and previous config saved to /var/cache/conftool/dbconfig/20200728-083209-marostegui.json [08:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1143 for MCR schema change', diff saved to https://phabricator.wikimedia.org/P12067 and previous config saved to /var/cache/conftool/dbconfig/20200728-083336-marostegui.json [08:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:24] (03PS2) 10Muehlenhoff: profile::superset: Remove stretch support [puppet] - 10https://gerrit.wikimedia.org/r/615724 [08:36:59] (03CR) 10Muehlenhoff: [C: 03+2] Create component/cloudera for buster-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/616712 (https://phabricator.wikimedia.org/T258768) (owner: 10Muehlenhoff) [08:37:50] (03CR) 10Jbond: [C: 03+2] diamond: remove unused file [puppet] - 10https://gerrit.wikimedia.org/r/616518 (https://phabricator.wikimedia.org/T258943) (owner: 10Jbond) [08:38:38] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:38:54] !log temporary downgrade prometheus-snmp-exporter on netmon2001 [08:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:27] !log standardize cr2-eqsin interfaces [08:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:36] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/616593 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [08:40:36] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:42:20] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:42:33] 10Operations, 10observability, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10MoritzMuehlenhoff) [08:44:08] (03CR) 10Jbond: nutcracker: drop puppet < 3.5 support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/616469 (https://phabricator.wikimedia.org/T258931) (owner: 10Jbond) [08:44:10] (03CR) 10Jbond: [C: 03+2] nutcracker: drop puppet < 3.5 support [puppet] - 10https://gerrit.wikimedia.org/r/616469 (https://phabricator.wikimedia.org/T258931) (owner: 10Jbond) [08:48:24] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10JMeybohm) [08:48:37] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: hiera_lookup failing to preform lookups after hiera5 upgrade - https://phabricator.wikimedia.org/T258931 (10jbond) After applying [[ https://gerrit.wikimedia.org/r/616469 | 616469 ]] the following lookup command is now working ` $ sudo puppet looku... [08:48:58] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [08:49:06] (03PS2) 10Jbond: nutcracker: ensure verbosity is an integer [puppet] - 10https://gerrit.wikimedia.org/r/616479 (https://phabricator.wikimedia.org/T258931) [08:50:09] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/613642 (owner: 10Ayounsi) [08:50:50] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [08:52:48] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10JMeybohm) @Dzahn as discussed yesterday I'm going to reimage the eqiad nodes. Would be nice if you could do the codfw ones. We should also reimage the new (still role(insetup)) "par... [08:54:58] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: cp3053 nvme0 issues - https://phabricator.wikimedia.org/T256632 (10Vgutierrez) 05Stalled→03Resolved a:03Vgutierrez Thanks for pinging me @wiki_willy, we can close to this task, everything seems good in cp3053 so far. I'll reopen the task if needed [08:55:53] (03CR) 10Jbond: [C: 03+2] nutcracker: ensure verbosity is an integer [puppet] - 10https://gerrit.wikimedia.org/r/616479 (https://phabricator.wikimedia.org/T258931) (owner: 10Jbond) [09:03:37] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1026.eqiad.wmnet'] ` and were **ALL** successful. [09:06:45] 10Operations, 10netops, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10fgiunchedi) [09:07:02] 10Operations, 10netops, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10fgiunchedi) [09:07:13] !log cp3050: restart varnishmtail.service, stuck on "Condition(c->offset <= c->vtx->len) not true." [09:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:17] 10Operations, 10netops, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10fgiunchedi) Failover happened and the active netmon host is now netmon2001. Issues identified: 1. Polling time for eqiad devices increased signif... [09:10:01] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1026.eqiad.wmnet [09:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:09] (03CR) 10Muehlenhoff: [C: 03+2] profile::superset: Remove stretch support [puppet] - 10https://gerrit.wikimedia.org/r/615724 (owner: 10Muehlenhoff) [09:11:14] 10Operations, 10netops, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10fgiunchedi) [09:18:42] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: name=wtp1027.eqiad.wmnet [09:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1082', diff saved to https://phabricator.wikimedia.org/P12069 and previous config saved to /var/cache/conftool/dbconfig/20200728-091849-marostegui.json [09:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:27] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts: ` wtp1027.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [09:19:46] !log standardize cr3-eqsin interfaces [09:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:07] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: name=wtp1028.eqiad.wmnet [09:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:50] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts: ` wtp1028.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [09:26:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1082', diff saved to https://phabricator.wikimedia.org/P12070 and previous config saved to /var/cache/conftool/dbconfig/20200728-092606-marostegui.json [09:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:19] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: name=wtp1029.eqiad.wmnet [09:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:41] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts: ` wtp1029.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [09:31:23] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: name=wtp1030.eqiad.wmnet [09:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:00] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts: ` wtp1030.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [09:32:33] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_proton_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:35:07] !log imported libmysqlclient18 to component/cloudera T258768 [09:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:12] T258768: Move Hue to a Buster VM - https://phabricator.wikimedia.org/T258768 [09:35:15] (03PS3) 10DCausse: [wcqs] use correct UriScheme in blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/616110 (https://phabricator.wikimedia.org/T251497) (owner: 10ZPapierski) [09:35:25] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:36:11] (03CR) 10DCausse: [C: 03+1] [wcqs] use correct UriScheme in blazegraph (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/616110 (https://phabricator.wikimedia.org/T251497) (owner: 10ZPapierski) [09:38:35] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime [09:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:13] 10Operations, 10GrowthExperiments-NewcomerTasks, 10Product-Infrastructure-Team-Backlog, 10serviceops: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10kostajh) Adding some tags for visibility. I'm going to be away from a computer for the next two weeks but @Tgr... [09:40:42] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:53] (03PS1) 10Muehlenhoff: Install libmysqlclient18 from component/cloudera [puppet] - 10https://gerrit.wikimedia.org/r/616715 (https://phabricator.wikimedia.org/T258768) [09:43:07] liw: Still busy in prod or can I sling out a quick config fix? [09:43:55] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime [09:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:04] (03CR) 10Elukey: [C: 03+1] Install libmysqlclient18 from component/cloudera [puppet] - 10https://gerrit.wikimedia.org/r/616715 (https://phabricator.wikimedia.org/T258768) (owner: 10Muehlenhoff) [09:44:27] (03CR) 10Jforrester: "Planned for w/c 2020-08-02." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612348 (https://phabricator.wikimedia.org/T248418) (owner: 10Jforrester) [09:44:39] James_F, I've deployed to testwikis, and waiting for the deploy window in a few hours before going to group0, so you're OK to go ahead [09:45:59] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:08] Thanks! [09:46:13] (03PS3) 10Jforrester: Fix VE-RealTime CSP entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615728 (owner: 10Esanders) [09:46:20] (03CR) 10Jforrester: [C: 03+2] Fix VE-RealTime CSP entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615728 (owner: 10Esanders) [09:47:03] (03Merged) 10jenkins-bot: Fix VE-RealTime CSP entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615728 (owner: 10Esanders) [09:47:27] (All done.) [09:47:52] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime [09:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:30] RECOVERY - Disk space on eventlog1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=eventlog1002&var-datasource=eqiad+prometheus/ops [09:49:45] (03PS2) 10DCausse: [cirrus] use more neutral config var names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612655 [09:49:55] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:02] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime [09:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:53] (03CR) 10Muehlenhoff: [C: 03+2] Install libmysqlclient18 from component/cloudera [puppet] - 10https://gerrit.wikimedia.org/r/616715 (https://phabricator.wikimedia.org/T258768) (owner: 10Muehlenhoff) [09:52:01] (03PS1) 10Filippo Giunchedi: prometheus: upgrade snmp-exporter config [puppet] - 10https://gerrit.wikimedia.org/r/616716 (https://phabricator.wikimedia.org/T247967) [09:52:05] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:05] !log standardize cr2-esams interfaces [09:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:58] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:57:59] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:38] (03PS1) 10Jbond: sudo: create new command to create safe wildcard sudo commands [puppet] - 10https://gerrit.wikimedia.org/r/616717 [09:59:40] (03PS1) 10Jbond: labstore::fileserver::exports: use sudo::safe_wildcard_cmd [puppet] - 10https://gerrit.wikimedia.org/r/616718 (https://phabricator.wikimedia.org/T258943) [10:02:04] (03PS2) 10Jbond: sudo: create new command to create safe wildcard sudo commands [puppet] - 10https://gerrit.wikimedia.org/r/616717 (https://phabricator.wikimedia.org/T258943) [10:02:12] (03PS2) 10Jbond: labstore::fileserver::exports: use sudo::safe_wildcard_cmd [puppet] - 10https://gerrit.wikimedia.org/r/616718 (https://phabricator.wikimedia.org/T258943) [10:04:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1082', diff saved to https://phabricator.wikimedia.org/P12072 and previous config saved to /var/cache/conftool/dbconfig/20200728-100412-marostegui.json [10:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:49] 10Operations, 10netops, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10fgiunchedi) [10:09:01] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/24161/" [puppet] - 10https://gerrit.wikimedia.org/r/616718 (https://phabricator.wikimedia.org/T258943) (owner: 10Jbond) [10:10:19] (03PS2) 10Jbond: thanos-query - lvs: update service to monitoring_setup while we update [puppet] - 10https://gerrit.wikimedia.org/r/615726 [10:10:29] (03PS4) 10Jbond: lvs - thanos-query: update to use port 443 instead of port 80 [puppet] - 10https://gerrit.wikimedia.org/r/615720 (https://phabricator.wikimedia.org/T151009) [10:10:39] (03PS3) 10Jbond: thanos-query - lvs: update service to production state [puppet] - 10https://gerrit.wikimedia.org/r/615727 [10:17:39] (03PS2) 10Tim Starling: Re-enable LilyPond/Score in safe mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616628 (https://phabricator.wikimedia.org/T257091) [10:23:09] (03PS1) 10Filippo Giunchedi: rancid: ship .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/616720 (https://phabricator.wikimedia.org/T247967) [10:23:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1082', diff saved to https://phabricator.wikimedia.org/P12074 and previous config saved to /var/cache/conftool/dbconfig/20200728-102342-marostegui.json [10:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:11] 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T258336 (10Marostegui) 05Open→03Resolved This host has been fully repooled. The BBU is now gone {T258910} so this same crash shouldn't happen again. This host will be decommissioned in Q2 {T258361} [10:25:13] 10Operations, 10DBA: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Marostegui) [10:25:16] 10Operations, 10DBA: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [10:27:17] 10Operations, 10Puppet, 10User-jbond: puppet: drop legacy validat_ functions - https://phabricator.wikimedia.org/T259013 (10jbond) [10:27:25] 10Operations, 10Puppet, 10User-jbond: puppet: drop legacy validat_ functions - https://phabricator.wikimedia.org/T259013 (10jbond) p:05Triage→03Low [10:27:27] (03PS1) 10Filippo Giunchedi: rancid: use active/failover netmon server for rsync [puppet] - 10https://gerrit.wikimedia.org/r/616722 (https://phabricator.wikimedia.org/T247967) [10:27:37] 10Operations, 10Puppet, 10User-jbond: puppet: drop legacy validate_ functions - https://phabricator.wikimedia.org/T259013 (10jbond) [10:30:42] (03PS1) 10Filippo Giunchedi: cache: remove smokeping, not proxied [puppet] - 10https://gerrit.wikimedia.org/r/616723 (https://phabricator.wikimedia.org/T247967) [10:31:58] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1027.eqiad.wmnet'] ` and were **ALL** successful. [10:32:46] 10Operations, 10Analytics-Radar, 10Patch-For-Review: Move Hue to a Buster VM - https://phabricator.wikimedia.org/T258768 (10MoritzMuehlenhoff) Hue fails to start due to some conflicts between the system Python and the modules bundled by Hue: > Jul 28 10:17:38 an-tool1009 systemd[1]: Failed to start LSB: H... [10:32:54] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1027.eqiad.wmnet [10:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:34] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: name=wtp1031.eqiad.wmnet [10:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:38] 10Operations, 10GrowthExperiments-NewcomerTasks, 10Product-Infrastructure-Team-Backlog, 10serviceops: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10kostajh) [10:34:03] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts: ` wtp1031.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [10:34:25] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/616722 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [10:34:46] (03CR) 10Ema: [C: 03+1] cache: remove smokeping, not proxied [puppet] - 10https://gerrit.wikimedia.org/r/616723 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [10:37:20] 10Operations, 10Traffic, 10Patch-For-Review: ats-tls is having issues when varnish-fe goes away - https://phabricator.wikimedia.org/T242620 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [10:37:47] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1028.eqiad.wmnet'] ` and were **ALL** successful. [10:38:09] RECOVERY - dump of zarcillo in codfw on icinga1001 is OK: Last dump for zarcillo at codfw (db2093.codfw.wmnet) taken on 2020-07-28 10:34:20 (0 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [10:38:54] 10Operations, 10Traffic: ATS logs aren't being rotated - https://phabricator.wikimedia.org/T238724 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [10:39:21] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 11.88 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [10:40:59] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:42:17] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1029.eqiad.wmnet'] ` and were **ALL** successful. [10:42:31] (03PS1) 10Vgutierrez: Release 8.0.8-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/616724 [10:43:05] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)8 ge (W)1 ge 0.2167 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [10:43:06] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1030.eqiad.wmnet'] ` and were **ALL** successful. [10:44:43] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:47:04] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1028.eqiad.wmnet [10:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:37] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: name=wtp1032.eqiad.wmnet [10:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:50] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts: ` wtp1032.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [10:48:04] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1029.eqiad.wmnet [10:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:18] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1030.eqiad.wmnet [10:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:25] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10Nahid) Thank you all kindly for taking the time to resolve this issue. works fine no... [10:50:39] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: name=wtp1033.eqiad.wmnet [10:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:04] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts: ` wtp1033.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [10:53:12] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime [10:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:20] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:33] !log hashar@deploy1001 Started deploy [integration/docroot@ba85bdf]: Catch up with HEAD and support DOCUMENT_ROOT being a symbolic link for T149924 [10:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:38] T149924: Clear /srv/.git on contint1001; move integration.wikimedia.org docroot to new location - https://phabricator.wikimedia.org/T149924 [10:56:40] !log hashar@deploy1001 Finished deploy [integration/docroot@ba85bdf]: Catch up with HEAD and support DOCUMENT_ROOT being a symbolic link for T149924 (duration: 00m 06s) [10:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:44] (03CR) 10Hashar: [C: 03+1] "The fix has been merged https://gerrit.wikimedia.org/r/c/integration/docroot/+/611377" [puppet] - 10https://gerrit.wikimedia.org/r/611369 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [10:59:43] (03PS1) 10Ema: ATS: force fawiki, hewiki cache revalidation [puppet] - 10https://gerrit.wikimedia.org/r/616726 (https://phabricator.wikimedia.org/T256750) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Your horoscope predicts another unfortunate European mid-day backport window(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200728T1100). [11:00:04] jan_drewniak and dcausse: A patch you scheduled for European mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:01:03] hm, I don’t see a dcausse entry in the calendar [11:01:10] oh no nevermind it’s there [11:01:12] sorry [11:01:15] :) [11:01:16] o/ [11:01:22] o/ [11:01:38] \o/ [11:01:59] jan_drewniak: want to self-deploy, or should I? [11:03:22] I can self-deploy :) [11:03:25] cool! [11:03:43] @Urbanecm would you be able do mine from last night as well? [11:03:57] sure! [11:04:19] could you please add it to the calendar Seddon ? [11:04:39] (03CR) 10Jdrewniak: [C: 03+2] Enable desktop improvements by default for testing group (round 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614890 (https://phabricator.wikimedia.org/T254227) (owner: 10Jdlrobson) [11:05:10] @Urbanecm done! [11:05:13] thanks! [11:05:29] (03Merged) 10jenkins-bot: Enable desktop improvements by default for testing group (round 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614890 (https://phabricator.wikimedia.org/T254227) (owner: 10Jdlrobson) [11:06:08] (03PS2) 10Ema: ATS: force cache revalidation on a few wikis [puppet] - 10https://gerrit.wikimedia.org/r/616726 (https://phabricator.wikimedia.org/T256750) [11:07:00] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime [11:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:05] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:13] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime [11:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:53] (03CR) 10Vgutierrez: [C: 03+1] ATS: force cache revalidation on a few wikis [puppet] - 10https://gerrit.wikimedia.org/r/616726 (https://phabricator.wikimedia.org/T256750) (owner: 10Ema) [11:10:07] (03PS1) 10Jbond: diamond: drop legacy validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616728 [11:10:09] (03PS1) 10Jbond: logstash::input::tcp: drop legacy validate function [puppet] - 10https://gerrit.wikimedia.org/r/616729 [11:10:11] (03PS1) 10Jbond: kartotherian: drop legacy validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616730 [11:10:13] (03PS1) 10Jbond: nginx: remove legacy validat functions [puppet] - 10https://gerrit.wikimedia.org/r/616731 [11:10:15] (03PS1) 10Jbond: cassandra::metrics: drop legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/616732 (https://phabricator.wikimedia.org/T259013) [11:10:17] (03PS1) 10Jbond: casandra::instance: drop lgacey _validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616733 (https://phabricator.wikimedia.org/T259013) [11:10:43] !log jdrewniak@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:614890 desktop improvements by default for testing group (round 2) (T254227)]] (duration: 01m 06s) [11:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:48] T254227: Switch test wikis to new version of vector by default - https://phabricator.wikimedia.org/T254227 [11:11:19] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:51] (03CR) 10jerkins-bot: [V: 04-1] casandra::instance: drop lgacey _validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616733 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [11:12:04] Urbanecm: My deploy is done :) [11:12:21] thanks! [11:12:27] dcausse: want to self-deploy? :-) [11:12:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1143', diff saved to https://phabricator.wikimedia.org/P12075 and previous config saved to /var/cache/conftool/dbconfig/20200728-111226-marostegui.json [11:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:40] (03CR) 10Ema: [C: 03+2] ATS: force cache revalidation on a few wikis [puppet] - 10https://gerrit.wikimedia.org/r/616726 (https://phabricator.wikimedia.org/T256750) (owner: 10Ema) [11:12:42] Urbanecm: sure :) [11:13:32] okay, go ahead then :) [11:14:28] (03CR) 10DCausse: [C: 03+2] [cirrus] use more neutral config var names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612655 (owner: 10DCausse) [11:15:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1143', diff saved to https://phabricator.wikimedia.org/P12076 and previous config saved to /var/cache/conftool/dbconfig/20200728-111522-marostegui.json [11:15:23] (03CR) 10Muehlenhoff: "Let's rather kill Diamond entirely.. I'll followup on T210993" [puppet] - 10https://gerrit.wikimedia.org/r/616728 (owner: 10Jbond) [11:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:12] (03PS3) 10DCausse: [cirrus] use more neutral config var names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612655 [11:16:17] (03CR) 10DCausse: [cirrus] use more neutral config var names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612655 (owner: 10DCausse) [11:16:54] (03CR) 10DCausse: [C: 03+2] [cirrus] use more neutral config var names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612655 (owner: 10DCausse) [11:17:36] 10Operations, 10observability, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10MoritzMuehlenhoff) This leaves VarnishStatus (added by @mmodell in 2015) running on deployment-cache-upload06.deployment-prep.eqiad.wmflabs as the last Diamo... [11:17:46] (03Merged) 10jenkins-bot: [cirrus] use more neutral config var names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612655 (owner: 10DCausse) [11:20:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1143', diff saved to https://phabricator.wikimedia.org/P12077 and previous config saved to /var/cache/conftool/dbconfig/20200728-112046-marostegui.json [11:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:46] !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [cirrus] use more neutral config var names (duration: 01m 06s) [11:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:51] Urbanecm: I'm done [11:22:55] thanks! [11:23:15] Seddon: ready to do yours now [11:23:22] (03PS2) 10Urbanecm: Undeploy graphoid for phase 2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614915 (https://phabricator.wikimedia.org/T258463) (owner: 10Seddon) [11:23:34] \o/ [11:23:44] (03CR) 10Urbanecm: [C: 03+2] "B&C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614915 (https://phabricator.wikimedia.org/T258463) (owner: 10Seddon) [11:25:12] !log A:cp-text varnish ban fa.wikipedia.org T256750 [11:25:43] (03Merged) 10jenkins-bot: Undeploy graphoid for phase 2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614915 (https://phabricator.wikimedia.org/T258463) (owner: 10Seddon) [11:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:01] T256750: CDN cache revalidation on several wikis for desktop improvements deployment - https://phabricator.wikimedia.org/T256750 [11:26:13] Seddon: ready for you to test at mwdebug1001 [11:27:00] Testing [11:27:48] @Urbanecm confirmed! [11:28:02] thanks [11:28:04] syncing [11:28:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1143', diff saved to https://phabricator.wikimedia.org/P12078 and previous config saved to /var/cache/conftool/dbconfig/20200728-112850-marostegui.json [11:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:32] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 04c7ef94bb7901668f2a8df3289b6a59d42f0a7e: Undeploy graphoid for phase 2 wikis (T258463) (duration: 01m 00s) [11:29:40] Seddon: done! [11:30:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1148 for MCR schema change', diff saved to https://phabricator.wikimedia.org/P12079 and previous config saved to /var/cache/conftool/dbconfig/20200728-113009-marostegui.json [11:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:12] T258463: Undeploy graphoid for phase 2 wiki's - https://phabricator.wikimedia.org/T258463 [11:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:50] !log Deploy MCR change on db1143, db1148, db1146:3314 [11:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:29] (03PS1) 10Jbond: mtail: drop legacy validate functions and add type validation [puppet] - 10https://gerrit.wikimedia.org/r/616735 (https://phabricator.wikimedia.org/T259013) [11:31:31] (03PS1) 10Jbond: sysfs::confile: drop legacy functions and add type validation [puppet] - 10https://gerrit.wikimedia.org/r/616736 (https://phabricator.wikimedia.org/T259013) [11:31:33] (03PS1) 10Jbond: udev::rule: drop legacy validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616737 (https://phabricator.wikimedia.org/T259013) [11:31:35] (03PS1) 10Jbond: graphite::web: drop validate_functions and add type validation [puppet] - 10https://gerrit.wikimedia.org/r/616738 (https://phabricator.wikimedia.org/T259013) [11:31:37] (03PS1) 10Jbond: profile::redis::master: drop legacy validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616739 (https://phabricator.wikimedia.org/T259013) [11:31:41] (03PS1) 10Jbond: profile::mariadb::eventloggin: drop legacy validate_ function [puppet] - 10https://gerrit.wikimedia.org/r/616740 (https://phabricator.wikimedia.org/T259013) [11:32:16] (03CR) 10jerkins-bot: [V: 04-1] sysfs::confile: drop legacy functions and add type validation [puppet] - 10https://gerrit.wikimedia.org/r/616736 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [11:32:18] (03PS1) 10Urbanecm: Revert "Revert "Move footer logos to /static/images/footer"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616623 (https://phabricator.wikimedia.org/T257732) [11:32:25] (03PS2) 10Urbanecm: Revert "Revert "Move footer logos to /static/images/footer"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616623 (https://phabricator.wikimedia.org/T257732) [11:32:30] (03CR) 10Urbanecm: [C: 03+2] Revert "Revert "Move footer logos to /static/images/footer"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616623 (https://phabricator.wikimedia.org/T257732) (owner: 10Urbanecm) [11:32:32] !log A:cp-text varnish ban he.wikipedia.org T256750 [11:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:37] T256750: CDN cache revalidation on several wikis for desktop improvements deployment - https://phabricator.wikimedia.org/T256750 [11:33:53] (03Merged) 10jenkins-bot: Revert "Revert "Move footer logos to /static/images/footer"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616623 (https://phabricator.wikimedia.org/T257732) (owner: 10Urbanecm) [11:33:54] 10Operations, 10Graphoid, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Platform Engineering (Icebox): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jseddon) [11:34:33] !log A:cp-text varnish ban eu.wikipedia.org T256750 [11:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:54] !log urbanecm@deploy1001 Synchronized static/images/footer: df9b9acf0876dad9b11d5641fe6fa174c7066f8b: Move footer logos to /static/images/footer (T257732) (duration: 01m 05s) [11:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:59] T257732: Change the footer logos in Turkish Wikipedia - https://phabricator.wikimedia.org/T257732 [11:36:40] !log A:cp-text varnish ban fr.wiktionary.org T256750 [11:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:27] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: df9b9acf0876dad9b11d5641fe6fa174c7066f8b: Move footer logos to /static/images/footer (T257732) (duration: 00m 58s) [11:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:50] (03CR) 10Tim Starling: [C: 03+2] mediawiki: Create /etc/firejail/mediawiki.local [puppet] - 10https://gerrit.wikimedia.org/r/609840 (owner: 10Legoktm) [11:38:15] !log A:cp-text varnish ban pt.wikiversity.org T256750 [11:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:19] T256750: CDN cache revalidation on several wikis for desktop improvements deployment - https://phabricator.wikimedia.org/T256750 [11:38:24] !log Deploy schema change on s3 codfw, this will generate lag on codfw T256682 [11:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:29] T256682: page_restrictions indexes have been majestically drifting from code - https://phabricator.wikimedia.org/T256682 [11:41:12] (03PS1) 10Jbond: profile::openstack: drop legacy validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616742 (https://phabricator.wikimedia.org/T259013) [11:41:14] (03PS1) 10Jbond: tmpreaper: drop legacy validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616743 (https://phabricator.wikimedia.org/T259013) [11:41:17] (03PS1) 10Jbond: postgresql: drop validate_functions and add type enforcment [puppet] - 10https://gerrit.wikimedia.org/r/616744 (https://phabricator.wikimedia.org/T259013) [11:43:17] (03PS1) 10Urbanecm: Add Turkish powered by MW and Wikimedia project icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616746 (https://phabricator.wikimedia.org/T257732) [11:43:40] !log urbanecm@deploy1001 Synchronized static/images: df9b9acf0876dad9b11d5641fe6fa174c7066f8b: Move footer logos to /static/images/footer (T257732) (duration: 01m 02s) [11:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:46] T257732: Change the footer logos in Turkish Wikipedia - https://phabricator.wikimedia.org/T257732 [11:43:49] (03CR) 10Urbanecm: [C: 03+2] Add Turkish powered by MW and Wikimedia project icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616746 (https://phabricator.wikimedia.org/T257732) (owner: 10Urbanecm) [11:44:35] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1031.eqiad.wmnet'] ` and were **ALL** successful. [11:44:38] (03Merged) 10jenkins-bot: Add Turkish powered by MW and Wikimedia project icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616746 (https://phabricator.wikimedia.org/T257732) (owner: 10Urbanecm) [11:46:44] !log urbanecm@deploy1001 Synchronized static/images/footer/: 1a5672628b82709350ca74bb784197e7ff5fdc19: Add Turkish powered by MW and Wikimedia project icons (T257732) (duration: 01m 01s) [11:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:49] (03CR) 10Ema: [C: 03+1] Release 8.0.8-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/616724 (owner: 10Vgutierrez) [11:49:43] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 1a5672628b82709350ca74bb784197e7ff5fdc19: Add Turkish powered by MW and Wikimedia project icons (T257732) (duration: 00m 59s) [11:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:48] T257732: Change the footer logos in Turkish Wikipedia - https://phabricator.wikimedia.org/T257732 [11:50:25] !log EU B&C window done [11:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:51] (03CR) 10Tim Starling: [C: 03+2] "> Patch Set 1: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616628 (https://phabricator.wikimedia.org/T257091) (owner: 10Tim Starling) [11:54:37] (03Merged) 10jenkins-bot: Re-enable LilyPond/Score in safe mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616628 (https://phabricator.wikimedia.org/T257091) (owner: 10Tim Starling) [11:55:34] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1032.eqiad.wmnet'] ` and were **ALL** successful. [11:56:47] !log tstarling@deploy1001 Synchronized wmf-config/CommonSettings.php: re-enabling Score in safe mode (duration: 01m 04s) [11:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:09] 10Operations, 10Analytics-Radar, 10Patch-For-Review: Move Hue to a Buster VM - https://phabricator.wikimedia.org/T258768 (10elukey) ` elukey@an-tool1009:~$ /usr/lib/hue/build/env/bin/python2.7 --version Python 2.7.9 elukey@an-tool1009:~$ /usr/lib/hue/build/env/bin/python2.7 Python 2.7.9 (default, Mar 1 20... [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200728T1200) [12:00:15] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1033.eqiad.wmnet'] ` and were **ALL** successful. [12:02:15] (03PS1) 10Tim Starling: Revert "Re-enable LilyPond/Score in safe mode" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616767 [12:02:24] (03CR) 10Tim Starling: [C: 03+2] Revert "Re-enable LilyPond/Score in safe mode" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616767 (owner: 10Tim Starling) [12:03:32] (03Merged) 10jenkins-bot: Revert "Re-enable LilyPond/Score in safe mode" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616767 (owner: 10Tim Starling) [12:03:37] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/616717 (https://phabricator.wikimedia.org/T258943) (owner: 10Jbond) [12:04:51] !log tstarling@deploy1001 Synchronized wmf-config/CommonSettings.php: disabling lilypond rendering in Score again due to error running gs (duration: 01m 05s) [12:04:51] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: name=wtp1034.eqiad.wmnet [12:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:06] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts: ` wtp1034.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [12:05:56] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1031.eqiad.wmnet [12:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:18] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1033.eqiad.wmnet [12:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:44] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1032.eqiad.wmnet [12:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:19] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: name=wtp1035.eqiad.wmnet [12:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:44] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts: ` wtp1035.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [12:10:25] (03CR) 10Filippo Giunchedi: [C: 03+2] cache: remove smokeping, not proxied [puppet] - 10https://gerrit.wikimedia.org/r/616723 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [12:10:32] (03PS2) 10Filippo Giunchedi: cache: remove smokeping, not proxied [puppet] - 10https://gerrit.wikimedia.org/r/616723 (https://phabricator.wikimedia.org/T247967) [12:11:19] Urbanecm: when do you think we should do T257943 [12:11:20] T257943: Create Wikipedia Kotava - https://phabricator.wikimedia.org/T257943 [12:11:49] Amir1: depends on how confident you are that using dewiki won't blow dewiki up :) [12:12:03] we should be ready at any time, technically speaking [12:12:39] (I mean, no blockers) [12:12:51] let's go with a smaller wiki in s5 [12:13:04] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog2001.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [12:13:14] enwikivoyage? [12:13:16] cebwiki? no one notices, you can hide a dead body in those bot created articles [12:13:34] cebwiki? it's quite big, but not like dewiki [12:13:39] heh, we share thoughts :D [12:13:39] (03CR) 10Filippo Giunchedi: [C: 03+2] rancid: use active/failover netmon server for rsync [puppet] - 10https://gerrit.wikimedia.org/r/616722 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [12:13:44] (03PS2) 10Filippo Giunchedi: rancid: use active/failover netmon server for rsync [puppet] - 10https://gerrit.wikimedia.org/r/616722 (https://phabricator.wikimedia.org/T247967) [12:13:45] it's big but not in readership [12:13:46] cebwiki LGTM [12:13:51] exactly [12:14:04] so if things break, the impact will be much smaller [12:14:09] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: name=wtp1036.eqiad.wmnet [12:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:26] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts: ` wtp1036.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [12:14:51] Amir1: so, tomorrow, right after EU B&C? Does that work for you? [12:15:00] Amir1: s/if things break/when things break/ [12:15:12] xd [12:15:24] we don't _expect_ things will break, usually :D [12:16:38] Urbanecm: that's an impressive commitment to optimism! :) [12:16:39] Urbanecm: your optimism is adorable [12:17:04] Urbanecm: sure [12:17:17] (03PS2) 10Filippo Giunchedi: rancid: ship .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/616720 (https://phabricator.wikimedia.org/T247967) [12:17:33] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: name=wtp1037.eqiad.wmnet [12:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:51] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts: ` wtp1037.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [12:18:07] (03CR) 10Filippo Giunchedi: [C: 03+2] rancid: ship .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/616720 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [12:19:17] Amir1: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1875617&oldid=1875580 :) [12:19:29] Awesome! [12:19:44] (03PS2) 10Filippo Giunchedi: prometheus: upgrade snmp-exporter config [puppet] - 10https://gerrit.wikimedia.org/r/616716 (https://phabricator.wikimedia.org/T247967) [12:23:59] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog2001.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [12:24:12] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime [12:24:12] (03PS1) 10Jbond: security::pam: drop validate_functions and add type enforcment [puppet] - 10https://gerrit.wikimedia.org/r/616747 (https://phabricator.wikimedia.org/T259013) [12:24:14] (03PS1) 10Jbond: kmod: drop legacy validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616748 (https://phabricator.wikimedia.org/T259013) [12:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:17] (03PS1) 10Jbond: keyholder: drop legacy validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616749 (https://phabricator.wikimedia.org/T259013) [12:24:19] (03PS1) 10Jbond: sslcert: drop validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616750 (https://phabricator.wikimedia.org/T259013) [12:24:21] (03PS1) 10Jbond: redis: fix legacy validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616751 (https://phabricator.wikimedia.org/T259013) [12:24:23] (03PS1) 10Jbond: tillerator: drop legacu validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616752 (https://phabricator.wikimedia.org/T259013) [12:24:25] (03PS1) 10Jbond: prometheus: drop legacy validate_functions [puppet] - 10https://gerrit.wikimedia.org/r/616753 (https://phabricator.wikimedia.org/T259013) [12:24:27] (03PS1) 10Jbond: zuul: drop validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616754 (https://phabricator.wikimedia.org/T259013) [12:24:29] (03PS1) 10Jbond: statsd_proxy: drop validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616755 (https://phabricator.wikimedia.org/T259013) [12:25:17] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [12:26:09] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [12:26:09] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [12:26:21] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:40] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime [12:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:26] 10Operations, 10Analytics-Radar, 10Patch-For-Review: Move Hue to a Buster VM - https://phabricator.wikimedia.org/T258768 (10MoritzMuehlenhoff) This commit landed in v2.7.14rc1 (and Stretch has 2.7.13): https://github.com/python/cpython/commit/3e37f4a11547a226c3c2f8bd612510465db397b9 [12:28:46] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:31] (03CR) 10Urbanecm: Initial configuration for avkwiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612897 (https://phabricator.wikimedia.org/T257943) (owner: 10Urbanecm) [12:30:45] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: upgrade snmp-exporter config [puppet] - 10https://gerrit.wikimedia.org/r/616716 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [12:31:13] (03PS7) 10Urbanecm: Initial configuration for avkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612897 (https://phabricator.wikimedia.org/T257943) [12:32:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1075', diff saved to https://phabricator.wikimedia.org/P12081 and previous config saved to /var/cache/conftool/dbconfig/20200728-123201-marostegui.json [12:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:10] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [12:33:16] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [12:33:31] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime [12:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:38] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:35:56] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime [12:36:08] jayme@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [12:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:01] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:11] jayme: one of your messages never logged [12:39:20] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [12:39:40] It was 13:35:39 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) that failed [12:40:12] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={pdu,pdu_sentry4} site={codfw,eqiad,eqsin,esams,ulsfo} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:41:07] that's me ^ [12:41:11] !log standardize cr2-esams interfaces [12:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:09] (03PS1) 10Filippo Giunchedi: Revert "prometheus: upgrade snmp-exporter config" [puppet] - 10https://gerrit.wikimedia.org/r/616756 [12:42:35] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "prometheus: upgrade snmp-exporter config" [puppet] - 10https://gerrit.wikimedia.org/r/616756 (owner: 10Filippo Giunchedi) [12:43:27] (03PS1) 10Jbond: sudo: drop validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616757 (https://phabricator.wikimedia.org/T259013) [12:43:29] (03PS1) 10Jbond: service: drop legacy validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616758 (https://phabricator.wikimedia.org/T259013) [12:44:16] (03CR) 10jerkins-bot: [V: 04-1] service: drop legacy validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616758 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [12:44:45] (03CR) 10jerkins-bot: [V: 04-1] sudo: drop validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616757 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [12:45:07] (03PS1) 10Urbanecm: labs: Turn beta cswiki to read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616759 (https://phabricator.wikimedia.org/T259004) [12:45:56] 10Operations, 10Traffic: varnishmtail silently stops working if varnishncsa crashes - https://phabricator.wikimedia.org/T259020 (10ema) [12:49:09] 10Operations, 10DBA, 10User-Kormat: Package wmfmariadbpy as a .deb - https://phabricator.wikimedia.org/T259021 (10Kormat) [12:49:16] 10Operations, 10DBA, 10User-Kormat: Package wmfmariadbpy as a .deb - https://phabricator.wikimedia.org/T259021 (10Kormat) p:05Triage→03Medium [12:49:25] (03CR) 10Jbond: [C: 03+2] thanos-query - lvs: update service to monitoring_setup while we update [puppet] - 10https://gerrit.wikimedia.org/r/615726 (owner: 10Jbond) [12:49:40] RECOVERY - Host ripe-atlas-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [12:49:40] RECOVERY - Host ripe-atlas-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [12:49:46] \o/ [12:49:59] (03PS1) 10VulpesVulpes825: Change the logo for Wu Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616760 (https://phabricator.wikimedia.org/T259005) [12:50:10] cdanis: you broke it enough that it wrapped around to working? :) [12:50:10] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:51:02] kormat: I think that's an accurate way to describe `kexec` tbh [12:53:09] :) [12:53:56] nice! [12:55:09] (03CR) 10Jbond: [C: 03+2] lvs - thanos-query: update to use port 443 instead of port 80 [puppet] - 10https://gerrit.wikimedia.org/r/615720 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [12:55:17] (03PS5) 10Jbond: lvs - thanos-query: update to use port 443 instead of port 80 [puppet] - 10https://gerrit.wikimedia.org/r/615720 (https://phabricator.wikimedia.org/T151009) [12:57:31] PROBLEM - gdnsd checkconf on authdns2001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [12:58:35] (03PS7) 10Elukey: Move mjolnir's daemons to search-loader hosts [puppet] - 10https://gerrit.wikimedia.org/r/616101 (https://phabricator.wikimedia.org/T258245) [12:58:53] PROBLEM - gdnsd checkconf on dns3002 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [12:58:58] 👀 [12:58:59] (03PS4) 10Jbond: thanos-query - lvs: update service to production state [puppet] - 10https://gerrit.wikimedia.org/r/615727 [12:59:21] did someone just do a authdns deploy? [12:59:39] maybe that should announce on IRC :) [13:00:04] liw and brennen: It is that lovely time of the day again! You are hereby commanded to deploy Mediawiki train - European+American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200728T1300). [13:00:42] * liw follows orders like clockwork [13:00:49] (03PS1) 10Lars Wirzenius: group0 wikis to 1.36.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616761 [13:00:51] (03CR) 10Lars Wirzenius: [C: 03+2] group0 wikis to 1.36.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616761 (owner: 10Lars Wirzenius) [13:00:58] mark: actually, alarmingly, it correlates with a puppet run [13:01:36] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616761 (owner: 10Lars Wirzenius) [13:01:51] error: plugin_geoip: Invalid resource name 'disc-thanos-query' detected from zonefile lookup [13:01:53] error: Name 'thanos-query.discovery.wmnet.': resolver plugin 'geoip' rejected resource name 'disc-thanos-query' [13:02:00] godog: ^^ [13:02:04] cdanis: i am making (what i thought was) an unrelated change on thanos [13:02:08] oh [13:02:11] and just noticed this on a icinga run "- service_description Confd template for /var/lib/gdnsd/discovery-thanos-query.state [13:02:38] ttps://gerrit.wikimedia.org/r/c/operations/puppet/+/615720 << the change [13:02:42] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.53:443]) https://wikitech.wikimedia.org/wiki/PyBal [13:02:48] PROBLEM - Disk space on eventlog1002 is CRITICAL: DISK CRITICAL - free space: /srv 21524 MB (2% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=eventlog1002&var-datasource=eqiad+prometheus/ops [13:02:53] mmhh probably not all bits and bobs are where they should be [13:03:00] this is baffling [13:03:22] PROBLEM - gdnsd checkconf on dns1002 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [13:03:40] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.53:443]) https://wikitech.wikimedia.org/wiki/PyBal [13:03:43] * jbond42 preparing a revert [13:03:49] !log liw@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.2 [13:04:00] (03PS1) 10Jbond: Revert "lvs - thanos-query: update to use port 443 instead of port 80" [puppet] - 10https://gerrit.wikimedia.org/r/616799 [13:04:00] ok so for sure pybal would need a restart, but yeah I'm baffled too re: gdns [13:04:08] yeh [13:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:18] !log standardize cr3-esams interfaces [13:04:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:56] the check for gdns looks like it has been updated to use the 443 port which is failing [13:06:09] (03CR) 10Elukey: "pcc: https://puppet-compiler.wmflabs.org/compiler1002/24164/" [puppet] - 10https://gerrit.wikimedia.org/r/616101 (https://phabricator.wikimedia.org/T258245) (owner: 10Elukey) [13:06:55] 10Operations, 10Traffic: varnishmtail silently stops working if varnishncsa crashes - https://phabricator.wikimedia.org/T259020 (10ema) p:05Triage→03Medium [13:07:03] mmhh ok so with a pybal restart in theory things should start working [13:07:17] godog: have you dont the restart? [13:07:36] PROBLEM - gdnsd checkconf on dns1001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [13:07:40] no I haven't, I can though, I didn't realize I was the one doing it [13:08:10] will do now [13:08:18] thanks [13:08:42] fyi just noticing that only lvs2009 and 2010 are alerting on pybal (not lvs1*) [13:09:11] !log roll-restart pybal on lvs low-traffic to apply thanos-query changes [13:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:54] 10Operations, 10Traffic, 10Patch-For-Review: Varnish and ATS health-check improvements - https://phabricator.wikimedia.org/T255015 (10ema) [13:10:06] 10Operations, 10Traffic, 10Patch-For-Review: Varnish and ATS health-check improvements - https://phabricator.wikimedia.org/T255015 (10ema) 05Open→03Resolved a:03ema All done! [13:10:10] PROBLEM - gdnsd checkconf on dns2002 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [13:10:14] 10Operations, 10netops: ripe-atlas-eqiad IPv6 unreachable - https://phabricator.wikimedia.org/T258018 (10CDanis) 05Stalled→03Resolved With the serial console now attached, I found myself in a rescue shell. I poked around some, got `/` and `/boot` mounted under the empty `/sysroot`, looked at the failed ke... [13:11:20] PROBLEM - gdnsd checkconf on dns3001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [13:11:57] jbond42: running puppet on thanos-fe too [13:12:04] ack thanks [13:12:07] I see pybal on lvs2010 failing heathchecks [13:12:48] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:44] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Couple of inline comments. Overall this is pretty neat and a path forward from our current state. I am not totally sold on the environment" (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [13:14:49] jbond42: mmhh still failing the healthchecks, can't figure out why yet though, this works: thanos-fe2002:~$ curl https://thanos-query.discovery.wmnet/-/ready [13:15:35] i get connection refused still when trying from cumin1001 [13:16:24] yes that's expected, I have bounced pybal on lvs2010 only [13:16:30] ahh [13:16:36] ok fixing the healthcheck url in puppet [13:17:26] (03PS1) 10Filippo Giunchedi: hieradata: fix thanos-query https healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/616806 [13:17:31] ^ [13:17:32] PROBLEM - gdnsd checkconf on dns4001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [13:17:45] nope that's wrong, sigh [13:17:50] (03CR) 10Jbond: [C: 03+1] hieradata: fix thanos-query https healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/616806 (owner: 10Filippo Giunchedi) [13:17:59] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1035.eqiad.wmnet'] ` and were **ALL** successful. [13:18:11] (03PS2) 10Filippo Giunchedi: hieradata: fix thanos-query https healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/616806 [13:18:31] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1034.eqiad.wmnet'] ` and were **ALL** successful. [13:18:44] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers thanos-fe2002.codfw.wmnet, thanos-fe2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:18:48] (03CR) 10Jbond: [C: 03+1] hieradata: fix thanos-query https healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/616806 (owner: 10Filippo Giunchedi) [13:18:50] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: fix thanos-query https healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/616806 (owner: 10Filippo Giunchedi) [13:19:16] godog: looks good: ✔️ cdanis@lvs2010.codfw.wmnet ~ 🕤☕ curl -v https://thanos-query.discovery.wmnet/-/ready --resolve thanos-query.discovery.wmnet:443:$(dig +short thanos-fe2003.codfw.wmnet) [13:19:30] woot woot [13:19:46] PROBLEM - gdnsd checkconf on dns5001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [13:19:49] ok I'll puppet + pybal restart + check etc [13:20:00] i have puppet running on the lvs now [13:20:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1075 with less weight', diff saved to https://phabricator.wikimedia.org/P12082 and previous config saved to /var/cache/conftool/dbconfig/20200728-132023-marostegui.json [13:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:38] I'm ok with whichever but I think it is confusing is more than one person is working on the problem [13:21:00] godog: sure thing ill back away, better you take the leed thanks [13:21:25] sounds good! thanks jbond42 [13:21:46] just let me know if i can help :) [13:22:04] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:22:46] PROBLEM - gdnsd checkconf on authdns1001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [13:23:45] for sure! if gdns/discovery doesn't recover by itself that warrants more investigation for sure [13:25:18] getting this on the dns hosts [13:25:19] error: plugin_geoip: Invalid resource name 'disc-thanos-query' detected from zonefile lookup [13:25:19] ok in codfw I can reach curl https://thanos-query.discovery.wmnet [13:25:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1075', diff saved to https://phabricator.wikimedia.org/P12083 and previous config saved to /var/cache/conftool/dbconfig/20200728-132520-marostegui.json [13:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:45] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1036.eqiad.wmnet'] ` and were **ALL** successful. [13:26:03] nice effect I didn't consider of pybal not cleaning up ipvs, now both http and https are available even though the former shouldn't [13:27:39] PROBLEM - gdnsd checkconf on dns4002 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [13:29:04] !log roll-restart pybal on eqiad lvs low-traffic to change port for thanos-query [13:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:20] PROBLEM - LVS thanos-query eqiad port 443/tcp - Prometheus long-term storage- query service IPv4 on thanos-query.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.53 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:29:21] godog: i think that the gdns is affected by the `state: monitoring_setup` [13:29:37] do you think im good to switch that back to production? [13:30:13] specificly https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/dns/auth/discovery.pp#L6 [13:30:25] RECOVERY - LVS thanos-query eqiad port 443/tcp - Prometheus long-term storage- query service IPv4 on thanos-query.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 187 bytes in 1.003 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:30:41] jbond42: yeah I think we're good to switch back now [13:30:52] ack merging now [13:31:00] (03PS5) 10Jbond: thanos-query - lvs: update service to production state [puppet] - 10https://gerrit.wikimedia.org/r/615727 [13:31:09] (03CR) 10Jbond: [V: 03+2 C: 03+2] thanos-query - lvs: update service to production state [puppet] - 10https://gerrit.wikimedia.org/r/615727 (owner: 10Jbond) [13:32:14] ok I'm merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/615733 too [13:32:21] ack thanks [13:32:29] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: move thanos-query clients to https [puppet] - 10https://gerrit.wikimedia.org/r/615733 (https://phabricator.wikimedia.org/T151009) (owner: 10Filippo Giunchedi) [13:33:21] RECOVERY - gdnsd checkconf on dns1001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [13:33:49] RECOVERY - gdnsd checkconf on dns1002 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [13:34:17] RECOVERY - gdnsd checkconf on dns4001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [13:34:19] RECOVERY - gdnsd checkconf on dns5001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [13:34:21] \o/ [13:34:21] RECOVERY - gdnsd checkconf on dns3001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [13:34:33] RECOVERY - gdnsd checkconf on dns4002 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [13:34:33] RECOVERY - gdnsd checkconf on dns3002 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [13:34:52] :D yay [13:35:59] eventlog1002 is basically out of disk space, not sure who's the best contact ? [13:36:09] RECOVERY - gdnsd checkconf on authdns2001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [13:36:10] no space on the vg to allocate btw [13:36:21] RECOVERY - gdnsd checkconf on dns2002 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [13:36:28] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1034.eqiad.wmnet [13:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:07] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: name=wtp1038.eqiad.wmnet [13:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:22] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts: ` wtp1038.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [13:37:49] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1035.eqiad.wmnet [13:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:20] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: name=wtp1039.eqiad.wmnet [13:38:21] jbond42: I'm finishing a puppet run on grafana and icinga to switch thanos-query to https, after that I'll remove the stale ipvs service so http stops working [13:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:27] godog: elukey: or ottomata may be able to help with eventlog [13:38:40] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts: ` wtp1039.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [13:38:41] sound good to me thanks [13:38:56] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1036.eqiad.wmnet [13:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:10] yep definitely out of disk space, cc ottomata elukey [13:39:10] oh looking [13:39:17] 10Operations, 10GrowthExperiments-NewcomerTasks, 10Product-Infrastructure-Team-Backlog, 10serviceops: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10Joe) My first note here is that we are actively discouraging shelling out from MediaWiki in production for a s... [13:39:20] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: name=wtp1040.eqiad.wmnet [13:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:51] RECOVERY - gdnsd checkconf on authdns1001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [13:39:52] (03Abandoned) 10Jbond: Revert "lvs - thanos-query: update to use port 443 instead of port 80" [puppet] - 10https://gerrit.wikimedia.org/r/616799 (owner: 10Jbond) [13:40:31] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts: ` wtp1040.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [13:42:35] 10Operations, 10GrowthExperiments-NewcomerTasks, 10Product-Infrastructure-Team-Backlog, 10serviceops: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10kostajh) > I think the ideal model for this is having a very simple service that basically accepts post reques... [13:43:34] deleted some files, i think maybe we should stop saving data on eventlog1002. going to make a ticket. [13:45:15] (03PS2) 10Jbond: statsd_proxy: drop validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616755 (https://phabricator.wikimedia.org/T259013) [13:45:44] (03CR) 10Jbond: [C: 03+2] zuul: drop validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616754 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [13:45:51] (03PS2) 10Jbond: zuul: drop validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616754 (https://phabricator.wikimedia.org/T259013) [13:46:01] (03PS2) 10Jbond: tillerator: drop legacu validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616752 (https://phabricator.wikimedia.org/T259013) [13:46:40] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: hiera_lookup failing to preform lookups after hiera5 upgrade - https://phabricator.wikimedia.org/T258931 (10colewhite) `sudo puppet lookup` works! Thanks! [13:46:42] (03PS2) 10Jbond: redis: fix legacy validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616751 (https://phabricator.wikimedia.org/T259013) [13:46:53] 10Operations, 10serviceops: Clean up the /*/mw/ mcrouter routing prefix - https://phabricator.wikimedia.org/T256291 (10RLazarus) 05Open→03Invalid [13:47:11] (03PS2) 10Jbond: kmod: drop legacy validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616748 (https://phabricator.wikimedia.org/T259013) [13:47:11] godog: again? I cleaned it up this morning /o\ [13:47:14] (03PS1) 10Cwhite: hiera_lookup: clarify suggestion to use puppet lookup on the puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/616809 (https://phabricator.wikimedia.org/T258931) [13:47:33] (03PS2) 10Jbond: security::pam: drop validate_functions and add type enforcment [puppet] - 10https://gerrit.wikimedia.org/r/616747 (https://phabricator.wikimedia.org/T259013) [13:47:52] (03PS2) 10Jbond: postgresql: drop validate_functions and add type enforcment [puppet] - 10https://gerrit.wikimedia.org/r/616744 (https://phabricator.wikimedia.org/T259013) [13:48:58] (03PS2) 10Jbond: mtail: drop legacy validate functions and add type validation [puppet] - 10https://gerrit.wikimedia.org/r/616735 (https://phabricator.wikimedia.org/T259013) [13:49:35] elukey: it is crapping itself faster than we can clean :| [13:50:18] !log remove stale ipvs thanos-query service on port 80 [13:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:40] (03CR) 10Jbond: [C: 03+2] tillerator: drop legacu validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616752 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [13:50:52] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1037.eqiad.wmnet'] ` and were **ALL** successful. [13:50:54] (03CR) 10Jbond: [C: 03+2] redis: fix legacy validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616751 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [13:51:06] (03CR) 10Jbond: [C: 03+2] kmod: drop legacy validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616748 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [13:51:19] (03CR) 10Jbond: [C: 03+2] security::pam: drop validate_functions and add type enforcment [puppet] - 10https://gerrit.wikimedia.org/r/616747 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [13:51:34] (03CR) 10Jbond: [C: 03+2] postgresql: drop validate_functions and add type enforcment [puppet] - 10https://gerrit.wikimedia.org/r/616744 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [13:51:55] (03CR) 10Jbond: [C: 03+2] mtail: drop legacy validate functions and add type validation [puppet] - 10https://gerrit.wikimedia.org/r/616735 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [13:52:08] (03CR) 10Jbond: [C: 03+2] udev::rule: drop legacy validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616737 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [13:52:15] (03PS2) 10Jbond: udev::rule: drop legacy validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616737 (https://phabricator.wikimedia.org/T259013) [13:54:46] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:55:02] (03PS2) 10Jbond: sudo: drop validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616757 (https://phabricator.wikimedia.org/T259013) [13:55:24] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1037.eqiad.wmnet [13:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:40] (03PS2) 10Jbond: service: drop legacy validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616758 (https://phabricator.wikimedia.org/T259013) [13:55:42] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:55:51] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: name=wtp1041.eqiad.wmnet [13:55:53] 10Operations, 10Graphoid, 10serviceops, 10Chinese-Sites, 10Platform Engineering (Icebox): Undeploy graphoid for phase 2 wiki's - https://phabricator.wikimedia.org/T258463 (10Aklapper) [13:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:08] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts: ` wtp1041.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [13:56:12] (03CR) 10jerkins-bot: [V: 04-1] service: drop legacy validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616758 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [13:56:23] (03CR) 10jerkins-bot: [V: 04-1] sudo: drop validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616757 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [13:56:26] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime [13:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:44] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime [13:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:56] 10Operations, 10Fundraising-Backlog: New wiki for TY pages with same content as donatewiki - https://phabricator.wikimedia.org/T259002 (10Pcoombe) I would actually prefer if we didn't copy all the content from donatewiki. There's a lot of stuff there, and having duplicate versions of it all will just be confus... [13:57:40] 10Operations, 10Fundraising-Backlog: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Pcoombe) [13:58:07] !log installing perl security updates [13:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:30] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:31] (03PS1) 10Cwhite: hiera: specify tlsproxy configuration for grafana [puppet] - 10https://gerrit.wikimedia.org/r/616811 (https://phabricator.wikimedia.org/T222826) [13:59:39] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime [13:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:39] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:57] 10Operations, 10GrowthExperiments-NewcomerTasks, 10Product-Infrastructure-Team-Backlog, 10serviceops: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10kostajh) [14:01:01] (03CR) 10Alexandros Kosiaris: [C: 03+1] ores: add envoy-proxy for TLS termination behind ATS [puppet] - 10https://gerrit.wikimedia.org/r/615569 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [14:01:12] (03PS1) 10Giuseppe Lavagetto: helmfile: add data for enabling service proxy in k8s [puppet] - 10https://gerrit.wikimedia.org/r/616812 [14:02:05] (03CR) 10Jcrespo: "> So clearly, the problem could happen at any point in code with remote_execution. This makes me feel like there is some issue with our Cu" [software/transferpy] - 10https://gerrit.wikimedia.org/r/616282 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [14:02:07] (03PS3) 10Jbond: sudo: drop validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616757 (https://phabricator.wikimedia.org/T259013) [14:02:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1148', diff saved to https://phabricator.wikimedia.org/P12084 and previous config saved to /var/cache/conftool/dbconfig/20200728-140207-marostegui.json [14:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1148', diff saved to https://phabricator.wikimedia.org/P12085 and previous config saved to /var/cache/conftool/dbconfig/20200728-140220-marostegui.json [14:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:29] (03CR) 10jerkins-bot: [V: 04-1] helmfile: add data for enabling service proxy in k8s [puppet] - 10https://gerrit.wikimedia.org/r/616812 (owner: 10Giuseppe Lavagetto) [14:02:45] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1148', diff saved to https://phabricator.wikimedia.org/P12086 and previous config saved to /var/cache/conftool/dbconfig/20200728-140249-marostegui.json [14:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1147', diff saved to https://phabricator.wikimedia.org/P12087 and previous config saved to /var/cache/conftool/dbconfig/20200728-140313-marostegui.json [14:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:20] PROBLEM - Disk space on eventlog1002 is CRITICAL: DISK CRITICAL - free space: /srv 27369 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=eventlog1002&var-datasource=eqiad+prometheus/ops [14:06:25] (03PS3) 10Jbond: service: drop legacy validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616758 (https://phabricator.wikimedia.org/T259013) [14:06:43] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/616755 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [14:06:57] (03CR) 10Jbond: [C: 03+2] sudo: drop validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616757 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [14:06:59] (03CR) 10jerkins-bot: [V: 04-1] service: drop legacy validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616758 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [14:08:09] (03PS1) 10JMeybohm: Add a new action to helm-chartctl to upload prebuild tgz [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/616813 [14:08:11] (03PS1) 10JMeybohm: New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/616814 [14:09:22] (03PS4) 10Jbond: service: drop legacy validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616758 (https://phabricator.wikimedia.org/T259013) [14:10:23] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/616809 (https://phabricator.wikimedia.org/T258931) (owner: 10Cwhite) [14:11:07] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/616758 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [14:11:09] (03PS2) 10Giuseppe Lavagetto: helmfile: add data for enabling service proxy in k8s [puppet] - 10https://gerrit.wikimedia.org/r/616812 [14:11:33] (03CR) 10Jbond: [C: 03+2] statsd_proxy: drop validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616755 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [14:11:41] (03PS3) 10Jbond: statsd_proxy: drop validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616755 (https://phabricator.wikimedia.org/T259013) [14:13:17] (03PS2) 10Jbond: tmpreaper: drop legacy validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616743 (https://phabricator.wikimedia.org/T259013) [14:13:29] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/616743 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [14:13:39] (03PS2) 10Jbond: sysfs::confile: drop legacy functions and add type validation [puppet] - 10https://gerrit.wikimedia.org/r/616736 (https://phabricator.wikimedia.org/T259013) [14:14:10] (03CR) 10jerkins-bot: [V: 04-1] sysfs::confile: drop legacy functions and add type validation [puppet] - 10https://gerrit.wikimedia.org/r/616736 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [14:14:37] (03CR) 10Mholloway: [C: 03+2] Proton: Remove unneeded APP_ENABLE_CANCELLABLE_PROMISES env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/616539 (owner: 10Mholloway) [14:15:14] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime [14:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:24] (03PS3) 10Giuseppe Lavagetto: helmfile: add data for enabling service proxy in k8s [puppet] - 10https://gerrit.wikimedia.org/r/616812 [14:15:38] (03PS3) 10Jbond: sysfs::confile: drop legacy functions and add type validation [puppet] - 10https://gerrit.wikimedia.org/r/616736 (https://phabricator.wikimedia.org/T259013) [14:15:56] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/616560 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [14:15:58] (03Merged) 10jenkins-bot: Proton: Remove unneeded APP_ENABLE_CANCELLABLE_PROMISES env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/616539 (owner: 10Mholloway) [14:16:00] (03Merged) 10jenkins-bot: Update Proton to 2020-07-27-123712-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/616540 (owner: 10Mholloway) [14:16:34] (03CR) 10jerkins-bot: [V: 04-1] helmfile: add data for enabling service proxy in k8s [puppet] - 10https://gerrit.wikimedia.org/r/616812 (owner: 10Giuseppe Lavagetto) [14:16:56] (03CR) 10Jbond: [C: 03+2] tmpreaper: drop legacy validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616743 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [14:17:52] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/24169/" [puppet] - 10https://gerrit.wikimedia.org/r/616736 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [14:17:55] (03CR) 10Jbond: [C: 03+2] sysfs::confile: drop legacy functions and add type validation [puppet] - 10https://gerrit.wikimedia.org/r/616736 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [14:19:14] (03CR) 10Urbanecm: [C: 03+2] "noop for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616759 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm) [14:19:25] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:37] (03PS2) 10Jbond: keyholder: drop legacy validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616749 (https://phabricator.wikimedia.org/T259013) [14:19:41] (03PS2) 10Jbond: sslcert: drop validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616750 (https://phabricator.wikimedia.org/T259013) [14:19:57] (03PS2) 10Jbond: graphite::web: drop validate_functions and add type validation [puppet] - 10https://gerrit.wikimedia.org/r/616738 (https://phabricator.wikimedia.org/T259013) [14:19:59] (03Merged) 10jenkins-bot: labs: Turn beta cswiki to read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616759 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm) [14:21:15] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [14:21:15] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog2001.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [14:23:02] !log bounced centrallog rsyslog services in codfw/eqiad [14:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:41] (03PS4) 10Giuseppe Lavagetto: helmfile: add data for enabling service proxy in k8s [puppet] - 10https://gerrit.wikimedia.org/r/616812 [14:26:38] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/616738 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [14:27:11] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/616749 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [14:27:46] (03PS2) 10Jbond: prometheus: drop legacy validate_functions [puppet] - 10https://gerrit.wikimedia.org/r/616753 (https://phabricator.wikimedia.org/T259013) [14:27:59] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [14:27:59] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [14:28:10] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/616753 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [14:29:33] PROBLEM - Ensure local MW versions match expected deployment on wtp1034 is CRITICAL: CRITICAL: 127 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:29:39] 10Operations: Integrate Stretch 9.13 point update - https://phabricator.wikimedia.org/T258407 (10MoritzMuehlenhoff) [14:30:09] PROBLEM - Ensure local MW versions match expected deployment on wtp1035 is CRITICAL: CRITICAL: 127 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:31:57] (03PS1) 10Andrew Bogott: Move cloudceph.svc.eqiad.wmnet service name to cloudceph.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/616817 (https://phabricator.wikimedia.org/T258826) [14:32:14] (03PS1) 10Andrew Bogott: cloudceph: don't use lvs for prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/616818 (https://phabricator.wikimedia.org/T258826) [14:34:43] (03PS3) 10Jbond: graphite::web: drop validate_functions and add type validation [puppet] - 10https://gerrit.wikimedia.org/r/616738 (https://phabricator.wikimedia.org/T259013) [14:35:20] (03PS3) 10Herron: admins: add all members of wdqs-admins to wdqs-roots [puppet] - 10https://gerrit.wikimedia.org/r/616593 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [14:37:03] (03PS4) 10Jbond: graphite::web: drop validate_functions and add type validation [puppet] - 10https://gerrit.wikimedia.org/r/616738 (https://phabricator.wikimedia.org/T259013) [14:37:05] (03CR) 10Herron: [C: 03+2] admins: add all members of wdqs-admins to wdqs-roots [puppet] - 10https://gerrit.wikimedia.org/r/616593 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [14:39:14] (03CR) 10Alexandros Kosiaris: [C: 03+2] charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos) [14:40:03] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10akosiaris) [14:40:07] (03PS5) 10Jbond: graphite::web: drop validate_functions and add type validation [puppet] - 10https://gerrit.wikimedia.org/r/616738 (https://phabricator.wikimedia.org/T259013) [14:41:11] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/24174/" [puppet] - 10https://gerrit.wikimedia.org/r/616738 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [14:41:23] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/616738 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [14:46:59] 10Operations, 10SRE-Access-Requests, 10Wikidata, 10Wikidata-Query-Service, and 2 others: wdqs admins should have access to nginx logs, jstack on wdqs machines - https://phabricator.wikimedia.org/T258739 (10herron) https://gerrit.wikimedia.org/r/616593 has been merged, and I've ran puppet on the wdqs* hosts... [14:47:17] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1038.eqiad.wmnet'] ` and were **ALL** successful. [14:47:37] (03CR) 10Herron: [C: 03+2] mx: add paniclog to exim logrotate [puppet] - 10https://gerrit.wikimedia.org/r/616524 (https://phabricator.wikimedia.org/T257016) (owner: 10Herron) [14:48:24] !log mholloway-shell@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [14:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:13] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1040.eqiad.wmnet'] ` and were **ALL** successful. [14:50:09] (03PS4) 10Ayounsi: Routers interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/613641 [14:50:45] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog2001.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [14:50:45] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog2001.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [14:51:47] (03PS4) 10Cicalese: Install WikimediaApiPortal/WikimediaApiPortalOAuth - I: Add i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609206 (https://phabricator.wikimedia.org/T251279) [14:52:17] (03PS1) 10Jbond: add missing keytab [labs/private] - 10https://gerrit.wikimedia.org/r/616821 [14:52:34] (03CR) 10Jbond: [V: 03+2 C: 03+2] add missing keytab [labs/private] - 10https://gerrit.wikimedia.org/r/616821 (owner: 10Jbond) [14:52:52] !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' . [14:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:48] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1038.eqiad.wmnet [14:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:05] (03CR) 10Volans: [C: 03+1] "Looks sane to me, I know has been thoroughly tested." [homer/public] - 10https://gerrit.wikimedia.org/r/613641 (owner: 10Ayounsi) [14:55:07] (03PS3) 10Jbond: prometheus: drop legacy validate_functions [puppet] - 10https://gerrit.wikimedia.org/r/616753 (https://phabricator.wikimedia.org/T259013) [14:55:11] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: name=wtp1042.eqiad.wmnet [14:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:30] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts: ` wtp1042.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [14:57:06] !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' . [14:57:09] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [14:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:37] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Te [14:57:37] page from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Proton [14:57:56] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/616753 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [14:58:05] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1040.eqiad.wmnet [14:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:31] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: name=wtp1043.eqiad.wmnet [14:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:58] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts: ` wtp1043.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [14:59:29] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [14:59:48] (03PS2) 10VulpesVulpes825: Change the logo for Wu Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616760 (https://phabricator.wikimedia.org/T259005) [15:00:11] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Add routers interfaces support to wmf-netbox plugin [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/613642 (owner: 10Ayounsi) [15:00:43] (03PS1) 10Kormat: Create debian packages. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 [15:00:50] (03PS3) 10Jbond: sslcert: drop validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616750 (https://phabricator.wikimedia.org/T259013) [15:01:05] (03PS2) 10Kormat: [WIP] Create debian packages. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 [15:01:14] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/616750 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [15:01:17] (03CR) 10Herron: [C: 03+2] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/616116 (https://phabricator.wikimedia.org/T248181) (owner: 10Herron) [15:01:35] (03PS2) 10Andrew Bogott: cloudceph: don't use lvs for prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/616818 (https://phabricator.wikimedia.org/T258826) [15:01:37] (03PS1) 10Andrew Bogott: Make the rest of the cloudcephosd hosts into osd nodes [puppet] - 10https://gerrit.wikimedia.org/r/616847 (https://phabricator.wikimedia.org/T251619) [15:01:47] !log ayounsi@deploy1001 Started deploy [homer/deploy@fcf4332]: CR613642 [15:01:47] 10Operations, 10serviceops, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) If we want to play this very safe, we could do the following steps: * step 1 - stop redis on mc1036 and wait a day to see if anything is reported or i... [15:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:57] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Te [15:01:57] page from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Proton [15:01:58] !log ayounsi@deploy1001 Finished deploy [homer/deploy@fcf4332]: CR613642 (duration: 00m 11s) [15:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:27] (03PS2) 10Jbond: cassandra::metrics: drop legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/616732 (https://phabricator.wikimedia.org/T259013) [15:02:37] (03CR) 10Ayounsi: [C: 03+2] Routers interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/613641 (owner: 10Ayounsi) [15:03:05] (03Merged) 10jenkins-bot: Routers interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/613641 (owner: 10Ayounsi) [15:03:53] (03PS2) 10Jbond: casandra::instance: drop lgacey _validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616733 (https://phabricator.wikimedia.org/T259013) [15:04:24] (03CR) 10jerkins-bot: [V: 04-1] casandra::instance: drop lgacey _validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616733 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [15:04:51] (03PS3) 10Jbond: casandra: drop lgacey _validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616733 (https://phabricator.wikimedia.org/T259013) [15:05:20] (03CR) 10jerkins-bot: [V: 04-1] casandra: drop lgacey _validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616733 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [15:05:48] (03PS2) 10Jbond: profile::redis::master: drop legacy validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616739 (https://phabricator.wikimedia.org/T259013) [15:05:54] (03CR) 10Bstorm: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/616818 (https://phabricator.wikimedia.org/T258826) (owner: 10Andrew Bogott) [15:06:29] !log ayounsi@deploy1001 Started deploy [homer/deploy@fcf4332]: CR613642 [15:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:10] (03CR) 10Jbond: [C: 03+2] profile::redis::master: drop legacy validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616739 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [15:07:27] (03CR) 10Andrew Bogott: [C: 03+2] Make the rest of the cloudcephosd hosts into osd nodes [puppet] - 10https://gerrit.wikimedia.org/r/616847 (https://phabricator.wikimedia.org/T251619) (owner: 10Andrew Bogott) [15:07:42] (03PS1) 10Mholloway: Proton: Restore APP_ENABLE_CANCELLABLE_PROMISES [deployment-charts] - 10https://gerrit.wikimedia.org/r/616848 [15:07:52] (03PS2) 10Jbond: profile::mariadb::eventloggin: drop legacy validate_ function [puppet] - 10https://gerrit.wikimedia.org/r/616740 (https://phabricator.wikimedia.org/T259013) [15:07:54] (03PS3) 10VulpesVulpes825: Change the logo for Wu Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616760 (https://phabricator.wikimedia.org/T259005) [15:08:43] !log ayounsi@deploy1001 Finished deploy [homer/deploy@fcf4332]: CR613642 (duration: 02m 14s) [15:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:17] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/616753 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [15:09:23] (03CR) 10Mholloway: [C: 03+2] Proton: Restore APP_ENABLE_CANCELLABLE_PROMISES [deployment-charts] - 10https://gerrit.wikimedia.org/r/616848 (owner: 10Mholloway) [15:09:31] (03PS3) 10Jbond: profile::mariadb::eventloggin: drop legacy validate_ function [puppet] - 10https://gerrit.wikimedia.org/r/616740 (https://phabricator.wikimedia.org/T259013) [15:09:53] (03CR) 10Bstorm: Move cloudceph.svc.eqiad.wmnet service name to cloudceph.eqiad.wmnet (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/616817 (https://phabricator.wikimedia.org/T258826) (owner: 10Andrew Bogott) [15:10:25] (03Merged) 10jenkins-bot: Proton: Restore APP_ENABLE_CANCELLABLE_PROMISES [deployment-charts] - 10https://gerrit.wikimedia.org/r/616848 (owner: 10Mholloway) [15:10:40] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1039.eqiad.wmnet'] ` and were **ALL** successful. [15:11:17] (03CR) 10Jbond: [C: 03+2] profile::mariadb::eventloggin: drop legacy validate_ function [puppet] - 10https://gerrit.wikimedia.org/r/616740 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [15:11:47] !log mholloway-shell@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [15:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:01] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog2001.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [15:12:01] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog2001.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [15:12:21] (03PS2) 10Jbond: profile::openstack: drop legacy validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616742 (https://phabricator.wikimedia.org/T259013) [15:12:34] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/616742 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [15:12:37] (03CR) 10jerkins-bot: [V: 04-1] profile::openstack: drop legacy validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616742 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [15:13:23] (03PS3) 10Jbond: profile::openstack: drop legacy validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616742 (https://phabricator.wikimedia.org/T259013) [15:13:32] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime [15:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:45] !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' . [15:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:38] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:12] !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' . [15:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:02] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime [15:17:03] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [15:17:03] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [15:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:25] (03PS5) 10Jbond: service: drop legacy validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616758 (https://phabricator.wikimedia.org/T259013) [15:17:53] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/616742 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [15:18:00] (03PS1) 10Volans: Force Python compilation of the plugin [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/616849 [15:18:26] (03CR) 10Jbond: [C: 03+2] keyholder: drop legacy validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616749 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [15:19:05] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:10] (03PS5) 10Hashar: Explicitly mentions the repository in scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/610254 (https://phabricator.wikimedia.org/T257413) [15:19:33] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1041.eqiad.wmnet'] ` and were **ALL** successful. [15:19:44] (03CR) 10Hashar: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/610254 (https://phabricator.wikimedia.org/T257413) (owner: 10Hashar) [15:21:27] (03PS1) 10Cwhite: provision loki on grafana-next [puppet] - 10https://gerrit.wikimedia.org/r/616851 (https://phabricator.wikimedia.org/T222826) [15:21:45] (03PS2) 10Andrew Bogott: Remove cloudcepf.svc.eqiad.wmnet service name [dns] - 10https://gerrit.wikimedia.org/r/616817 (https://phabricator.wikimedia.org/T258826) [15:22:40] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [15:23:00] (03PS4) 10Hashar: scap::sources stop assuming mediawiki/services as a prefix [puppet] - 10https://gerrit.wikimedia.org/r/610267 (https://phabricator.wikimedia.org/T257413) [15:23:18] (03PS2) 10Cwhite: provision loki on grafana-next [puppet] - 10https://gerrit.wikimedia.org/r/616851 (https://phabricator.wikimedia.org/T222826) [15:23:34] (03PS3) 10Andrew Bogott: Remove cloudceph.svc.eqiad.wmnet service name [dns] - 10https://gerrit.wikimedia.org/r/616817 (https://phabricator.wikimedia.org/T258826) [15:23:43] (03CR) 10Hashar: "Also updated the entries in hieradata/cloud/eqiad1/deployment-prep/common.yaml for the beta cluster." [puppet] - 10https://gerrit.wikimedia.org/r/610267 (https://phabricator.wikimedia.org/T257413) (owner: 10Hashar) [15:24:35] (03PS6) 10Jbond: service: drop legacy validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616758 (https://phabricator.wikimedia.org/T259013) [15:25:05] (03CR) 10Andrew Bogott: [C: 03+2] cloudceph: don't use lvs for prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/616818 (https://phabricator.wikimedia.org/T258826) (owner: 10Andrew Bogott) [15:26:54] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Force Python compilation of the plugin [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/616849 (owner: 10Volans) [15:27:18] (03PS1) 10Urbanecm: Revert "labs: Turn beta cswiki to read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616839 (https://phabricator.wikimedia.org/T259004) [15:27:43] (03CR) 10Urbanecm: [C: 03+2] Revert "labs: Turn beta cswiki to read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616839 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm) [15:27:53] (03PS1) 10Andrew Bogott: Remove lvs from cloudcephmon nodes [puppet] - 10https://gerrit.wikimedia.org/r/616852 (https://phabricator.wikimedia.org/T258826) [15:28:32] !log ayounsi@deploy1001 Started deploy [homer/deploy@5e999c8]: CR613642 [15:28:35] (03Merged) 10jenkins-bot: Revert "labs: Turn beta cswiki to read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616839 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm) [15:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:59] (03CR) 10Andrew Bogott: [C: 03+2] Remove lvs from cloudcephmon nodes [puppet] - 10https://gerrit.wikimedia.org/r/616852 (https://phabricator.wikimedia.org/T258826) (owner: 10Andrew Bogott) [15:29:33] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1039.eqiad.wmnet [15:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:51] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/616758 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [15:30:09] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: name=wtp1044.eqiad.wmnet [15:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:27] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts: ` wtp1044.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [15:30:30] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Monitor for mariadb backups of matomo&analytics_meta [puppet] - 10https://gerrit.wikimedia.org/r/616452 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [15:30:43] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Setup db1108 as the source of backups for analytics dbs [puppet] - 10https://gerrit.wikimedia.org/r/616453 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [15:30:51] (03CR) 10Andrew Bogott: [C: 03+2] Remove cloudceph.svc.eqiad.wmnet service name [dns] - 10https://gerrit.wikimedia.org/r/616817 (https://phabricator.wikimedia.org/T258826) (owner: 10Andrew Bogott) [15:30:52] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1041.eqiad.wmnet [15:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:39] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: name=wtp1045.eqiad.wmnet [15:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:51] (03PS4) 10Jbond: profile::openstack: drop legacy validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616742 (https://phabricator.wikimedia.org/T259013) [15:32:02] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts: ` wtp1045.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [15:32:11] !log ayounsi@deploy1001 Finished deploy [homer/deploy@5e999c8]: CR613642 (duration: 03m 38s) [15:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:51] !log ayounsi@deploy1001 Started deploy [homer/deploy@5e999c8]: once more [15:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:36] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/616742 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [15:33:49] (03PS5) 10Jbond: profile::openstack: drop legacy validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616742 (https://phabricator.wikimedia.org/T259013) [15:35:57] !log ayounsi@deploy1001 Finished deploy [homer/deploy@5e999c8]: once more (duration: 03m 06s) [15:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:45] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 104.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [15:41:08] (03PS2) 10Hnowlan: aptrepo: add component for future envoy packages [puppet] - 10https://gerrit.wikimedia.org/r/616560 (https://phabricator.wikimedia.org/T254908) [15:42:37] (03CR) 10Hnowlan: [C: 03+2] aptrepo: add component for future envoy packages [puppet] - 10https://gerrit.wikimedia.org/r/616560 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [15:42:48] (03PS1) 10Andrew Bogott: Add a couple of dummy ceph keyrings [labs/private] - 10https://gerrit.wikimedia.org/r/616853 [15:43:27] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:44:56] !log jayme@cumin1001 conftool action : set/pooled=no; selector: name=wtp1034.* [15:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:07] !log jayme@cumin1001 conftool action : set/pooled=no; selector: name=wtp1035.* [15:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:21] (03PS4) 10Jbond: casandra: drop lgacey _validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616733 (https://phabricator.wikimedia.org/T259013) [15:47:30] (03PS5) 10Jbond: casandra: drop legacy _validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616733 (https://phabricator.wikimedia.org/T259013) [15:47:54] (03PS1) 10Andrew Bogott: Ceph osd nodes: install bootstrap keyring [puppet] - 10https://gerrit.wikimedia.org/r/616855 (https://phabricator.wikimedia.org/T251619) [15:48:28] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime [15:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:37] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add a couple of dummy ceph keyrings [labs/private] - 10https://gerrit.wikimedia.org/r/616853 (owner: 10Andrew Bogott) [15:48:38] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [15:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:13] (03CR) 10Jbond: [C: 03+2] sslcert: drop validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616750 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [15:50:36] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:10] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime [15:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:23] (03CR) 10Cwhite: "PCC checks out https://puppet-compiler.wmflabs.org/compiler1002/24185/" [puppet] - 10https://gerrit.wikimedia.org/r/616851 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [15:51:40] (03PS3) 10Kormat: Create debian packages. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/616846 [15:52:03] (03PS1) 10Filippo Giunchedi: prometheus: upgrade snmp-exporter config [puppet] - 10https://gerrit.wikimedia.org/r/616857 (https://phabricator.wikimedia.org/T247967) [15:52:43] (03PS2) 10Andrew Bogott: Ceph osd nodes: install bootstrap keyring [puppet] - 10https://gerrit.wikimedia.org/r/616855 (https://phabricator.wikimedia.org/T251619) [15:52:46] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:38] (03CR) 10Andrew Bogott: [C: 03+2] Ceph osd nodes: install bootstrap keyring [puppet] - 10https://gerrit.wikimedia.org/r/616855 (https://phabricator.wikimedia.org/T251619) (owner: 10Andrew Bogott) [15:54:50] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:58] (03PS6) 10Jbond: casandra: drop legacy _validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616733 (https://phabricator.wikimedia.org/T259013) [15:59:19] (03PS6) 10Jbond: graphite::web: drop validate_functions and add type validation [puppet] - 10https://gerrit.wikimedia.org/r/616738 (https://phabricator.wikimedia.org/T259013) [16:00:04] godog and _joe_: Dear deployers, time to do the Puppet request window(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200728T1600). [16:00:35] (03Abandoned) 10Jbond: cassandra::metrics: drop legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/616732 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [16:01:53] (03PS1) 10DCausse: remove warning about prefixes on mediainfo dumps README [puppet] - 10https://gerrit.wikimedia.org/r/616860 [16:02:17] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:02:32] (03PS1) 10Filippo Giunchedi: logstash: fix thanos-query https url [puppet] - 10https://gerrit.wikimedia.org/r/616862 [16:02:38] (03PS2) 10Cwhite: hiera: specify tlsproxy configuration for grafana [puppet] - 10https://gerrit.wikimedia.org/r/616811 (https://phabricator.wikimedia.org/T222826) [16:03:38] (03PS7) 10Jbond: casandra: drop legacy _validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616733 (https://phabricator.wikimedia.org/T259013) [16:03:59] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: fix thanos-query https url [puppet] - 10https://gerrit.wikimedia.org/r/616862 (owner: 10Filippo Giunchedi) [16:04:03] if someone can help https://phabricator.wikimedia.org/T259023 be resolved in a near-future backport window, that would be wonderful [16:04:19] ACKNOWLEDGEMENT - Ensure local MW versions match expected deployment on wtp1034 is CRITICAL: CRITICAL: 127 mismatched wikiversions JMeybohm Problem during reimage, host is depooled for investigation https://wikitech.wikimedia.org/wiki/Application_servers [16:04:19] ACKNOWLEDGEMENT - Ensure local MW versions match expected deployment on wtp1035 is CRITICAL: CRITICAL: 127 mismatched wikiversions JMeybohm Problem during reimage, host is depooled for investigation https://wikitech.wikimedia.org/wiki/Application_servers [16:05:01] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1043.eqiad.wmnet'] ` and were **ALL** successful. [16:06:18] 10Operations, 10Fundraising-Backlog: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Krinkle) >>! In T259002#6340904, @Pcoombe wrote: > Config notes: > - needs `$wgRawHtml = true;` > - […] I'm looking at this page as an example (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/616733 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [16:06:47] (03PS2) 10Cicalese: Install WikimediaApiPortal/WikimediaApiPortalOAuth - III: Install where enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609211 (https://phabricator.wikimedia.org/T251279) [16:07:05] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1042.eqiad.wmnet'] ` and were **ALL** successful. [16:07:15] (03PS2) 10Cicalese: Install WikimediaApiPortal/WikimediaApiPortalOAuth - IV: Enable on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609215 (https://phabricator.wikimedia.org/T251279) [16:09:21] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1043.eqiad.wmnet [16:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:51] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: cp3053 nvme0 issues - https://phabricator.wikimedia.org/T256632 (10wiki_willy) Thanks @Vgutierrez [16:11:59] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1042.eqiad.wmnet [16:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:14] there will be 2 new backup alerts that may fail, I will ack them [16:12:52] jynus: wow what a stupid mistake that I made for db1108'd AAAA record, sigh, sending a patch [16:12:55] thanks a lot for spotting it [16:15:33] PROBLEM - MariaDB Replica Lag: analytics_meta on db1108 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 711.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:16:21] (03PS1) 10Elukey: Fix db1108's AAAA records [dns] - 10https://gerrit.wikimedia.org/r/616864 [16:17:19] volans: --^ huge pebcak from my side, maybe CI could alert before approving IPs already allocated? [16:17:38] jynus: if you have a moment to review --^ [16:17:50] PROBLEM - dump of analytics_meta in eqiad on icinga1001 is CRITICAL: We could not find any completed dump for analytics_meta at eqiad https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:17:52] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [16:19:11] or anybody that wants to sanity check the DNS change above [16:19:16] elukey: I can look in 5 [16:19:38] PROBLEM - dump of matomo in eqiad on icinga1001 is CRITICAL: We could not find any completed dump for matomo at eqiad https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:19:45] <3 [16:20:47] elukey: I think the eventual plan is to have it all in netbox anyway? [16:21:09] cdanis: yes makes sense [16:21:26] !log imported envoyproxy 1.15.0-1 deb into component/envoy-future for buster-wikimedia [16:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:52] (03PS1) 10Brennen Bearnes: filebackend: Fix index error in SwiftFileBackend [core] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/616842 (https://phabricator.wikimedia.org/T259023) [16:22:03] elukey: (that's a load-bearing "eventual", btw) [16:22:30] elukey: that could be the backups running [16:23:33] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [16:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:52] RECOVERY - MariaDB Replica Lag: analytics_meta on db1108 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:24:05] 10Operations, 10serviceops, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10nettrom_WMF) >>! In T252391#6337508, @kostajh wrote: > That said, I am pinging @nettrom_WMF and @MMiller_WMF to see if EditorJourney is something we are still... [16:25:39] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:46] elukey: interesting, it did it in the first run [16:29:55] I'm looking what happened with the following PS [16:30:16] RECOVERY - Ensure local MW versions match expected deployment on wtp1034 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [16:30:20] RECOVERY - Ensure local MW versions match expected deployment on wtp1035 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [16:31:05] !log mr1-eqiad# delete security nat source rule-set mgmt-to-untrust (unused, no matching ACL) [16:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:23] i'd like to deploy https://gerrit.wikimedia.org/r/c/mediawiki/core/+/616842/ - production generally clear at the moment? [16:31:28] (03CR) 10Ppchelko: [C: 03+2] Limit concurrency for processMediaModeration job [deployment-charts] - 10https://gerrit.wikimedia.org/r/615572 (https://phabricator.wikimedia.org/T258653) (owner: 10Ppchelko) [16:32:23] (03CR) 10Jcrespo: "I would review, but I have no idea what is the policy for asigning ips. I only repored the error because it failed to me and Xionox found " [dns] - 10https://gerrit.wikimedia.org/r/616864 (owner: 10Elukey) [16:32:29] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1034.eqiad.wmnet [16:32:30] (03Merged) 10jenkins-bot: Limit concurrency for processMediaModeration job [deployment-charts] - 10https://gerrit.wikimedia.org/r/615572 (https://phabricator.wikimedia.org/T258653) (owner: 10Ppchelko) [16:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:56] jynus: the ip4 is already assigned, it is just to see if the ipv6 one matches :) [16:33:04] I'll merge in a bit after meetings [16:33:06] (03PS1) 10Hnowlan: envoy-future: new image for future versions of Envoy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/616865 (https://phabricator.wikimedia.org/T254906) [16:33:37] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1035.eqiad.wmnet [16:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:53] brennen: should be good to go [16:33:59] rzl: ack, thx. [16:34:02] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [16:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:18] (03CR) 10Brennen Bearnes: [C: 03+2] filebackend: Fix index error in SwiftFileBackend [core] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/616842 (https://phabricator.wikimedia.org/T259023) (owner: 10Brennen Bearnes) [16:36:05] (03CR) 10Jcrespo: [C: 03+1] Fix db1108's AAAA records [dns] - 10https://gerrit.wikimedia.org/r/616864 (owner: 10Elukey) [16:36:56] !log ppchelko@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [16:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:54] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1045.eqiad.wmnet'] ` and were **ALL** successful. [16:39:45] (03CR) 10Cwhite: [C: 03+2] hiera_lookup: clarify suggestion to use puppet lookup on the puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/616809 (https://phabricator.wikimedia.org/T258931) (owner: 10Cwhite) [16:39:58] !log ppchelko@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [16:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:01] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1045.eqiad.wmnet [16:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:28] PROBLEM - Check status of defined EventLogging jobs on eventlog1002 is CRITICAL: CRITICAL: Stopped EventLogging jobs: eventlogging-consumer@client-side-events-log https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging [16:43:49] ^ is us [16:43:54] no biggy in meeting will fix shortly [16:44:27] (03CR) 10Jbol: [C: 03+1] "Looks good to me!" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/615801 (https://phabricator.wikimedia.org/T255471) (owner: 10Vidhi-Mody) [16:45:15] !log remove mr1-codfw source NAT (not used) [16:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:31] (03CR) 10Ebernhardson: "ahh, i was under the impression the version number would be detected from the changelog, but the README clearly says rules has to be updat" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/616602 (owner: 10Ebernhardson) [16:46:38] (03PS2) 10Ebernhardson: increment extra plugin to 6.5.4-wmf-11 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/616602 [16:47:07] (03PS1) 10Ayounsi: Add NAT stanza for management routers [homer/public] - 10https://gerrit.wikimedia.org/r/616870 [16:47:13] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@0982d4e]: convert_to_esbulk: repair variable ref before assign [16:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:38] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [16:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:49] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10JMeybohm) All hosts but `wtp104[6-8].eqiad.wmnet` completed. Unfortunately wtp1044.eqiad.wmnet is still waiting for the puppet run after reboot. Checking back on that later. [16:48:32] (03CR) 10Ayounsi: [C: 03+2] Add NAT stanza for management routers [homer/public] - 10https://gerrit.wikimedia.org/r/616870 (owner: 10Ayounsi) [16:49:01] (03Merged) 10jenkins-bot: Add NAT stanza for management routers [homer/public] - 10https://gerrit.wikimedia.org/r/616870 (owner: 10Ayounsi) [16:49:20] elukey: I'm still debugging, it was reported as TOO_MANY_PUBLIC_NAMES that is a warning because we have other valid use cases, I'm checking why public [16:49:44] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:44] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and nda groups for edtadros - https://phabricator.wikimedia.org/T256435 (10herron) 05Open→03Stalled [16:51:36] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/616738 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [16:51:42] (03PS5) 10Cicalese: Install WikimediaApiPortal/WikimediaApiPortalOAuth - I: Add i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609206 (https://phabricator.wikimedia.org/T251279) [16:51:44] (03PS2) 10Cicalese: Install WikimediaApiPortal/WikimediaApiPortalOAuth - II: Add flag to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609210 (https://phabricator.wikimedia.org/T251279) [16:51:46] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@0982d4e]: convert_to_esbulk: repair variable ref before assign (duration: 04m 33s) [16:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:05] elukey: we use the ipaddress module and check the is_private, and ofc IPv6 are public, so yeah it's a bug, it would have been reported as Error.TOO_MANY_NAMES instead [16:52:28] (03CR) 10Cwhite: [C: 03+1] hiera3: remove old hiera backend files [puppet] - 10https://gerrit.wikimedia.org/r/615161 (owner: 10Jbond) [16:53:23] volans: yeah also my big pebcak, I didn't check warnings :( [16:53:34] CI doesn't report them, just the counter [16:53:42] and traffic didn't want to implement the delta check [16:53:43] yes but I run tox bfore sending [16:53:46] that's supported in zone_validator [16:53:52] tox doesn't show it [16:54:02] (03PS3) 10Cicalese: Install WikimediaApiPortal/WikimediaApiPortalOAuth - III: Install where enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609211 (https://phabricator.wikimedia.org/T251279) [16:54:04] you have to modify deploy-check and change the call to zone_validator from -e to -w [16:54:05] ah okok the counters [16:54:12] yes I agree difficult to spot a +1 [16:54:14] (03Merged) 10jenkins-bot: filebackend: Fix index error in SwiftFileBackend [core] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/616842 (https://phabricator.wikimedia.org/T259023) (owner: 10Brennen Bearnes) [16:54:16] 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10herron) 05Stalled→03Resolved Assuming no news is good news, and transitioning this to resolved. If any assistance is still needed please re-open. Thanks! [16:54:18] ther eis a mode to save and then compare [16:54:19] (03PS3) 10Cicalese: Install WikimediaApiPortal/WikimediaApiPortalOAuth - IV: Enable on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609215 (https://phabricator.wikimedia.org/T251279) [16:54:35] that could be used by CI to run on master with --save and then on the patch with --compare [16:54:49] and it fails if any counter increases or if there is any error [16:55:47] anyhow most of this will be automated very soon, so that's a relief [16:55:55] but I'll look into a fix if it's not too complex [16:59:31] (03CR) 10Bstorm: [C: 03+2] dynamicproxy: update error pages [puppet] - 10https://gerrit.wikimedia.org/r/616585 (owner: 10BryanDavis) [17:00:04] halfak and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200728T1700). [17:00:12] (03PS1) 10Ottomata: eventlogging Stop outputing to local files on eventlogging server [puppet] - 10https://gerrit.wikimedia.org/r/616871 (https://phabricator.wikimedia.org/T259030) [17:00:41] (03PS1) 10CDanis: puppetmaster: clearer edit message when you can't rewrite history [puppet] - 10https://gerrit.wikimedia.org/r/616872 [17:02:04] !log brennen@deploy1001 Started scap: (no justification provided) [17:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:18] !log prior scap sync for https://gerrit.wikimedia.org/r/c/mediawiki/core/+/616842 (T259023) [17:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:24] T259023: PHP Notice: Undefined index: content-disposition - https://phabricator.wikimedia.org/T259023 [17:03:35] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1044.eqiad.wmnet'] ` and were **ALL** successful. [17:03:41] (03CR) 10RLazarus: [C: 03+1] "about time we started upholding the Temporal Prime Directive around here" [puppet] - 10https://gerrit.wikimedia.org/r/616872 (owner: 10CDanis) [17:03:58] 10Operations, 10GrowthExperiments-NewcomerTasks, 10Product-Infrastructure-Team-Backlog, 10serviceops: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10herron) p:05Triage→03Medium [17:04:28] 10Operations, 10Fundraising-Backlog: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10herron) p:05Triage→03Medium [17:04:52] (03CR) 10CDanis: [C: 03+2] puppetmaster: clearer edit message when you can't rewrite history [puppet] - 10https://gerrit.wikimedia.org/r/616872 (owner: 10CDanis) [17:07:08] (03PS25) 10Hnowlan: api-gateway: Basic envoy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) [17:07:59] (03CR) 10jerkins-bot: [V: 04-1] api-gateway: Basic envoy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [17:09:19] RECOVERY - Disk space on eventlog1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=eventlog1002&var-datasource=eqiad+prometheus/ops [17:09:29] (03PS1) 10Volans: zone_validator: fix private/public detection [dns] - 10https://gerrit.wikimedia.org/r/616873 [17:09:34] elukey: ^^^ [17:09:47] lemme merge my patch first :P [17:09:54] (03CR) 10jerkins-bot: [V: 04-1] zone_validator: fix private/public detection [dns] - 10https://gerrit.wikimedia.org/r/616873 (owner: 10Volans) [17:10:40] (03CR) 10Elukey: [C: 03+2] "Thanks for the review Jaime!" [dns] - 10https://gerrit.wikimedia.org/r/616864 (owner: 10Elukey) [17:11:38] (03CR) 10Volans: "The CI failure is expected and demonstrates that it would have catch the error as Ifcb4329ee708a23d169249d600dd19fe4bb7bc31 was not yet me" [dns] - 10https://gerrit.wikimedia.org/r/616873 (owner: 10Volans) [17:11:41] https://integration.wikimedia.org/ci/job/operations-dns-lint-docker/2514/console is perfect volans [17:11:43] (03PS2) 10Volans: zone_validator: fix private/public detection [dns] - 10https://gerrit.wikimedia.org/r/616873 [17:12:01] thanks for catching the bug! [17:12:20] (03CR) 10Zfilipin: [C: 03+1] "@Mukunda Modell looks like there was a merge conflict in PS1 so this wasn't merged." [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/615801 (https://phabricator.wikimedia.org/T255471) (owner: 10Vidhi-Mody) [17:13:37] (03CR) 10Elukey: [C: 03+1] zone_validator: fix private/public detection [dns] - 10https://gerrit.wikimedia.org/r/616873 (owner: 10Volans) [17:14:41] (03PS4) 10BryanDavis: dynamicproxy: Add custom response for missing proxy backend [puppet] - 10https://gerrit.wikimedia.org/r/616586 (https://phabricator.wikimedia.org/T258730) [17:27:41] hrm: mid sync-apaches: https://phabricator.wikimedia.org/P12090 [17:30:58] !log brennen@deploy1001 sync aborted: (no justification provided) (duration: 28m 53s) [17:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:45] (03PS2) 10Ottomata: eventlogging - Stop outputing to local files on eventlogging server [puppet] - 10https://gerrit.wikimedia.org/r/616871 (https://phabricator.wikimedia.org/T259030) [17:32:50] i could use an assist here. was seeing a recurrence of `rsync: write failed on "/srv/mediawiki/php-1.36.0-wmf.2/cache/l10n/upstream/l10n_cache-bcl.cdb.json": No space left on device (28)` during a full scap sync. [17:33:09] did it tell you which host? [17:33:25] !log standardize mr1-esams interfaces [17:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:17] Reedy: wtp2001, and wtp1048 [17:34:49] I don't think I can login to those... [17:36:25] brennen: the wtp hosts are being reimaged because of this issue (low space) [17:36:41] was given precedence to eqiad today [17:37:00] RECOVERY - Check status of defined EventLogging jobs on eventlog1002 is OK: OK: All defined EventLogging jobs are runnning. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging [17:37:07] Is the one that brennen saw fail now depooled? [17:37:35] indeed (2) wtp[1046,1048].eqiad.wmnet and wtp2001.codfw.wmnet are reporting 100% / [17:38:29] let me see if I can free something for the small time being [17:38:51] (03PS1) 10Elukey: Add missing fake keytabs for stat100x hosts [labs/private] - 10https://gerrit.wikimedia.org/r/616878 [17:39:11] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add missing fake keytabs for stat100x hosts [labs/private] - 10https://gerrit.wikimedia.org/r/616878 (owner: 10Elukey) [17:39:51] Krinkle, brennenL got back ~2GB with apt-get clean [17:40:14] they should be good for now, hopefully until the reimage [17:40:25] cc rzl for awareness in serviceops ^^^ [17:40:42] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:41:04] volans: thanks. [17:41:17] !log run apt-get clean on wtp[1046,1048].eqiad.wmnet and wtp2001.codfw.wmnet to free ~`2GB as they were 100% - T258775 [17:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:22] T258775: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 [17:42:12] 10Operations, 10Phabricator, 10Traffic, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10mmodell) 05Stalled→03Open [17:44:27] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1003/24193/" [puppet] - 10https://gerrit.wikimedia.org/r/616871 (https://phabricator.wikimedia.org/T259030) (owner: 10Ottomata) [17:44:37] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10Trizek-WMF) [17:45:17] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10Trizek-WMF) Is there any information about when it will happen (timeframe) and how it would last? [17:46:03] (03CR) 10Elukey: [C: 03+1] eventlogging - Stop outputing to local files on eventlogging server [puppet] - 10https://gerrit.wikimedia.org/r/616871 (https://phabricator.wikimedia.org/T259030) (owner: 10Ottomata) [17:46:23] (03CR) 10Ottomata: [C: 03+2] eventlogging - Stop outputing to local files on eventlogging server [puppet] - 10https://gerrit.wikimedia.org/r/616871 (https://phabricator.wikimedia.org/T259030) (owner: 10Ottomata) [17:46:35] !log volker-e@deploy1001 Started deploy [design/style-guide@e3fda83]: Deploy design/style-guide: [17:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:40] !log volker-e@deploy1001 Finished deploy [design/style-guide@e3fda83]: Deploy design/style-guide: (duration: 00m 05s) [17:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:38] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:48:57] (03CR) 10Bstorm: [C: 03+2] dynamicproxy: Add custom response for missing proxy backend [puppet] - 10https://gerrit.wikimedia.org/r/616586 (https://phabricator.wikimedia.org/T258730) (owner: 10BryanDavis) [17:49:25] (03PS1) 10BryanDavis: Remove ::role::deprecated::labsvagrant [puppet] - 10https://gerrit.wikimedia.org/r/616879 (https://phabricator.wikimedia.org/T258943) [17:50:39] so i believe the full `scap sync` i was in the midst of just now was strictly unnecessary - i should have done `scap sync-file` - and thus mostly just amounts to a no-op. unless failed rsync messed something up with the l10n files? [17:50:55] brennen: volans: is this possibly related to the reimages that jayme was doing on those hosts to fix partitioning? [17:50:59] (just catching up) [17:51:24] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:51:36] oh, are these hosts not gotten to yet? [17:51:58] (03PS1) 10Andrew Bogott: New IPs for cloudcephosd100[1-3] in cloud-hosts1-b-eqiad [dns] - 10https://gerrit.wikimedia.org/r/616880 (https://phabricator.wikimedia.org/T259057) [17:52:00] (03PS1) 10Andrew Bogott: Remove obsolete IPs for cloudcephosd100[1-3] [dns] - 10https://gerrit.wikimedia.org/r/616881 (https://phabricator.wikimedia.org/T259057) [17:52:07] yes, they are [17:52:18] ok nvm [17:54:48] cdanis: yes, they are are and those were not yet reimaged [17:54:50] see the related task [17:54:53] the thing i want to make sure of at the moment is that i haven't left anything in a botched state... i think those parsoid errors are unrelated as they're on different hosts. [17:54:55] yeah I just looked [17:54:59] T258775 [17:55:00] T258775: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 [17:55:01] ok :) [17:55:04] wasn't sure if this was some new issue on freshly-reimaged servers [17:55:10] not yet [17:55:13] yeah :) [17:55:14] wait for that :D [17:55:44] if there is an urgency to free space maybe we could remove some older MW version from there? [17:56:27] ah no I was fooled by the leftover of directories, most of them are emoty already [17:57:06] (03PS1) 10Andrew Bogott: wmcs: standardize old ceph osd nodes [puppet] - 10https://gerrit.wikimedia.org/r/616882 (https://phabricator.wikimedia.org/T259057) [17:57:50] i don't _think_ there is a great urgency. i have a single-file backport to sync and there are a few things for the window in a couple of minutes, but i'm guessing with a few gigs freed those should go fine? [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Morning backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200728T1800). [18:00:04] CindyCicaleseWMF: A patch you scheduled for Morning backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:30] PROBLEM - mediawiki-installation DSH group on wtp1044 is CRITICAL: Host wtp1044 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [18:00:34] I'm here 👋 [18:00:45] I'll do the deployment [18:00:51] Thank you! [18:01:04] (03PS1) 10BryanDavis: domainproxy: fix error page for /.error/noproxy/ [puppet] - 10https://gerrit.wikimedia.org/r/616883 [18:01:34] RoanKattouw: please check in with brennen and also, some of the wtp hosts might give you trouble [18:01:59] (03PS1) 10Urbanecm: Fix reference to MentorChangeLogFormatter in extension.json [extensions/GrowthExperiments] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/616843 (https://phabricator.wikimedia.org/T259041) [18:02:22] Oh I missed brennen's comment from a few minutes before the hour, thanks for pointing that out [18:02:24] RoanKattouw: i think you should be clear to go ahead, modulo any weirdness from the wtp hosts. https://gerrit.wikimedia.org/r/c/mediawiki/core/+/616842 still needs slung out. [18:02:34] (03CR) 10Bstorm: [C: 03+2] domainproxy: fix error page for /.error/noproxy/ [puppet] - 10https://gerrit.wikimedia.org/r/616883 (owner: 10BryanDavis) [18:02:52] (the wtp hosts were all partitioned badly on first install; they're being reimaged to fix that, but we can't just do that all at once) [18:02:52] OK, I can take care of that one for you too if you like [18:03:14] that would be lovely. [18:03:54] OK I see you've already pulled it, just not synced it [18:04:25] RoanKattouw: yeah, correct. i was mid-sync and ran into space issues on a couple of wtp boxen. [18:04:31] Ah gotcha [18:04:38] Let's see how it goes now [18:04:58] RoanKattouw: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/616843 fixes a regression in GE, fyi [18:05:31] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.2/includes/libs/filebackend/SwiftFileBackend.php: Fix index error in SwiftFileBackend (T259023) (duration: 01m 07s) [18:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:37] T259023: PHP Notice: Undefined index: content-disposition - https://phabricator.wikimedia.org/T259023 [18:05:41] Thanks Urbanecm, I remember seeing a notification on my phone when I woke up this morning that there was a GE thing I needed to deploy later, thanks for saving me the trouble of finding it :) [18:05:48] I'll tack that one on too, after CindyCicaleseWMF's patch [18:05:57] thanks! [18:05:57] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:11] brennen: There you go, completely uneventful sync, no error messages or anything [18:06:36] RoanKattouw: I'm happy to test it if you want me to [18:07:16] RoanKattouw: beauty. [18:07:26] Oof I think Cindy's series of patches may need a full scap because she's adding a new extension [18:08:06] shouldn't ading extnsions be done elsewhere than in a B&C window? [18:08:09] (03CR) 10Catrope: [C: 03+2] Fix reference to MentorChangeLogFormatter in extension.json [extensions/GrowthExperiments] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/616843 (https://phabricator.wikimedia.org/T259041) (owner: 10Urbanecm) [18:08:19] With that in mind I'll do Urbanecm's patch first [18:08:25] thx [18:08:36] Eh, I don't mind, there are no other patches in the window anyway [18:09:00] i meant policy side [18:09:04] CindyCicaleseWMF: Just to set your expectations, you're probably in for a 30-45 minute wait before this is done [18:09:11] Not sure if it makes a difference, but the extensions are already in place. The patches are for the config. But, I'm never sure what requires a full scap. Regardless, I'm happy to wait if there are higher priority or faster things to do first. [18:09:46] through i might be wrong, if RoanKattouw is happy to do it, I am happy as well - just thinking out loud [18:10:03] Yeah I'm doing the higher prio thing first (Urbanecm's patch) but that's not the part that will take that long. The full scap is what will, and it's needed because we need to rebuild the i18n cache (to add the i18n for the new extension) [18:10:26] (03PS1) 10Cmjohnson: Removing mgmt dns for decom host db1097 [dns] - 10https://gerrit.wikimedia.org/r/616884 (https://phabricator.wikimedia.org/T257406) [18:10:26] Thanks for the heads up on timing and for managing this! I will be waiting patiently at my keyboard ;-) [18:10:42] idk Urbanecm you might be right policy-wise, but I'm pretty OK bending the rules in cases where it doesn't hurt much [18:10:51] Ah, gotcha. [18:10:53] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:02] In this case Cindy was the only one who signed up for the B&C window, so there's not even really a difference between that and her getting her own window for this [18:11:14] up to you :) [18:11:42] Good to know for the future! I'll make a note of it. [18:11:51] (03CR) 10Catrope: [C: 03+2] Install WikimediaApiPortal/WikimediaApiPortalOAuth - I: Add i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609206 (https://phabricator.wikimedia.org/T251279) (owner: 10Cicalese) [18:11:53] (03CR) 10Catrope: [C: 03+2] Install WikimediaApiPortal/WikimediaApiPortalOAuth - II: Add flag to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609210 (https://phabricator.wikimedia.org/T251279) (owner: 10Cicalese) [18:11:55] (03CR) 10Catrope: [C: 03+2] Install WikimediaApiPortal/WikimediaApiPortalOAuth - III: Install where enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609211 (https://phabricator.wikimedia.org/T251279) (owner: 10Cicalese) [18:11:57] (03CR) 10Catrope: [C: 03+2] Install WikimediaApiPortal/WikimediaApiPortalOAuth - IV: Enable on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609215 (https://phabricator.wikimedia.org/T251279) (owner: 10Cicalese) [18:12:05] Oh wait a minute [18:12:09] You're only enabling it on the beta cluster [18:12:14] Yup. [18:12:16] (03CR) 10Ebernhardson: Move mjolnir's daemons to search-loader hosts (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/616101 (https://phabricator.wikimedia.org/T258245) (owner: 10Elukey) [18:12:17] In that case we're fine, and we don't need a full scap in production [18:12:22] Excellent! [18:12:32] Just don't try to enable it in production before the next train runs (a week from today) [18:12:40] fwiw, the code seems to be already ready at mwdebug1001 [18:12:47] (03Merged) 10jenkins-bot: Install WikimediaApiPortal/WikimediaApiPortalOAuth - I: Add i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609206 (https://phabricator.wikimedia.org/T251279) (owner: 10Cicalese) [18:12:56] (03Merged) 10jenkins-bot: Install WikimediaApiPortal/WikimediaApiPortalOAuth - II: Add flag to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609210 (https://phabricator.wikimedia.org/T251279) (owner: 10Cicalese) [18:13:00] (03Merged) 10jenkins-bot: Install WikimediaApiPortal/WikimediaApiPortalOAuth - III: Install where enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609211 (https://phabricator.wikimedia.org/T251279) (owner: 10Cicalese) [18:13:03] Urbanecm: The code for what, your patch? [18:13:04] (03Merged) 10jenkins-bot: Install WikimediaApiPortal/WikimediaApiPortalOAuth - IV: Enable on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609215 (https://phabricator.wikimedia.org/T251279) (owner: 10Cicalese) [18:13:11] RoanKattouw: no, Cindy's extension [18:13:15] We won't be enabling on production for at least a few weeks, probably. The wiki that will host this is still being prepped, AFAIK. [18:13:19] Oh yes the code itself is already there [18:13:20] (03CR) 10Cmjohnson: [C: 03+2] Removing mgmt dns for decom host db1097 [dns] - 10https://gerrit.wikimedia.org/r/616884 (https://phabricator.wikimedia.org/T257406) (owner: 10Cmjohnson) [18:13:24] OK great [18:14:23] !log backup pybal restart: ✔️ cdanis@lvs1016.eqiad.wmnet ~ 🕑☕ sudo systemctl restart pybal.service [18:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:42] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:14:50] Next week's train conductor will do a full scap as part of the Tuesday train deployment, and that will rebuild all the i18n stuff we need for this extension to be deployable [18:14:57] So you can enable it in production any time after that has happened [18:15:19] (03PS1) 10Herron: kibana: CVE-2020-7016 / CVE-2020-7017 mitigations [puppet] - 10https://gerrit.wikimedia.org/r/616885 (https://phabricator.wikimedia.org/T259000) [18:15:58] Perfect. Lots of details to keep on top of. [18:16:01] (03PS2) 10Andrew Bogott: wmcs: standardize old ceph osd nodes [puppet] - 10https://gerrit.wikimedia.org/r/616882 (https://phabricator.wikimedia.org/T259057) [18:16:03] (03PS1) 10Andrew Bogott: ceph osd: rename bootstrap key [puppet] - 10https://gerrit.wikimedia.org/r/616886 [18:16:32] (03Merged) 10jenkins-bot: Fix reference to MentorChangeLogFormatter in extension.json [extensions/GrowthExperiments] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/616843 (https://phabricator.wikimedia.org/T259041) (owner: 10Urbanecm) [18:16:39] Sorry for the initial confusion there, installing new extensions is common enough that I've done it before but uncommon enough that I have to remind myself of how it works every time I do it [18:16:47] !log primary pybal restart ✔️ cdanis@lvs1015.eqiad.wmnet ~ 🕑☕ sudo systemctl restart pybal.service [18:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:58] np :-) [18:17:02] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: No-op sync for wmgUseWikimediaApiPortal and wmgUseWikimediaApiPortalOAuth (1 of 2) (duration: 01m 05s) [18:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:51] (03CR) 10Andrew Bogott: [C: 03+2] ceph osd: rename bootstrap key [puppet] - 10https://gerrit.wikimedia.org/r/616886 (owner: 10Andrew Bogott) [18:18:04] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:18:19] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:19:01] lookin at the netbox error. [18:19:28] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:19:54] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:19:56] Uh oh what did I do [18:20:09] !log catrope@deploy1001 Synchronized wmf-config/CommonSettings.php: No-op sync for wmgUseWikimediaApiPortal and wmgUseWikimediaApiPortalOAuth (2 of 2) (duration: 00m 58s) [18:20:12] Ughhhh I did them in the wrong order [18:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:31] I'm such an idiot, I carefully analyzed what the correct order was and then I did it exactly backwards [18:20:55] Oh, bummer! Did I have them listed in the correct order? [18:21:13] (03PS3) 10Andrew Bogott: wmcs: standardize old ceph osd nodes [puppet] - 10https://gerrit.wikimedia.org/r/616882 (https://phabricator.wikimedia.org/T259057) [18:21:15] (03PS1) 10Andrew Bogott: ceph osds: rename bootstrap key, again [puppet] - 10https://gerrit.wikimedia.org/r/616887 [18:21:18] Hmm no wait it was the right order! [18:21:23] IS first then CS is actually correct [18:21:26] so, wtf [18:21:35] (03CR) 10jerkins-bot: [V: 04-1] wmcs: standardize old ceph osd nodes [puppet] - 10https://gerrit.wikimedia.org/r/616882 (https://phabricator.wikimedia.org/T259057) (owner: 10Andrew Bogott) [18:21:40] (03CR) 10jerkins-bot: [V: 04-1] ceph osds: rename bootstrap key, again [puppet] - 10https://gerrit.wikimedia.org/r/616887 (owner: 10Andrew Bogott) [18:22:24] RoanKattouw: if you are fretting about the exceptions spike, that might also have been caused by my pybal restart -- sorry [18:22:32] Oh I see [18:22:43] (it's not a graceful operation and does interrupt existing connections) [18:23:21] (03PS2) 10Andrew Bogott: ceph osds: rename bootstrap key, again [puppet] - 10https://gerrit.wikimedia.org/r/616887 [18:24:13] I haven't yet figured out what the spike was but I see there are some issues with the REST API code [18:24:33] There was an earlier spike of: TypeError from line 181 of /srv/mediawiki/php-1.36.0-wmf.1/vendor/wikimedia/parsoid/extension/src/Rest/Handler/ParsoidHandler.php: Return value of MWParsoid\Rest\Handler\ParsoidHandler::getParsedBody() must be of the type array, null returned [18:24:50] Where my guess would be that connection/network errors aren't handled correctly [18:25:18] (03CR) 10Andrew Bogott: [C: 03+2] ceph osds: rename bootstrap key, again [puppet] - 10https://gerrit.wikimedia.org/r/616887 (owner: 10Andrew Bogott) [18:25:30] Also an ongoing series of: Wikimedia\Assert\InvariantException from line 224 of /srv/mediawiki/php-1.36.0-wmf.1/vendor/wikimedia/assert/src/Assert.php: Invariant failed: Bad UTF-8 at end of string (2 byte sequence) [18:25:37] the former seems very likely to be a manifestation of in-flight requests getting their connections reset, yeah [18:25:48] Which sounds like either a failure to deal with bad input, or incorrect truncation code [18:25:50] sorry, I did the pybal restart and then remembered there was a deploy going on :) [18:25:53] (03PS4) 10Andrew Bogott: wmcs: standardize old ceph osd nodes [puppet] - 10https://gerrit.wikimedia.org/r/616882 (https://phabricator.wikimedia.org/T259057) [18:26:12] Urbanecm: Meanwhile, your patch is on mwdebug1001 and 1002 for testing [18:26:26] Sorry that took forever because of the extension-list change [18:26:52] RoanKattouw: the latter error souds like T242298 [18:26:53] T242298: Invariant failed: Bad UTF-8 at end of string (2 byte sequence) - https://phabricator.wikimedia.org/T242298 [18:26:56] Yeah but that Parsoid error spike was earlier [18:27:38] RoanKattouw: the GE patch works for me, thanks! [18:28:01] (03PS20) 10CRusnov: Modified-by: Cas Rusnov [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) [18:28:48] OK syncing [18:29:31] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.2/extensions/GrowthExperiments/extension.json: Fix reference to MentorChangeLogFormatter (T259041) (duration: 01m 05s) [18:29:31] !log ❌cdanis@lvs1016.eqiad.wmnet ~ 🕝☕ sudo ipvsadm -D -t 10.2.2.51:9283 [18:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:36] T259041: Beta growthexperiments log doesn't have proper log entry messages - https://phabricator.wikimedia.org/T259041 [18:29:38] cdanis: Looking at the graph more closely I think it was probably you, the peak is right at the time I deployed but it was going up for a minute or 2 before then [18:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:18] RoanKattouw: works for me w/o mwdebug, thx! [18:31:01] Oh and it is the Parsoid REST errors after all [18:32:08] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:32:23] OK yes Parsoid is doing return json_decode( $request->getBody()->getContents(), true ); [18:32:30] In a function that's typed to return an array [18:33:08] So if $request->getBody()->getContents() returns nothing, or even just returns something that's invalid JSON, json_decode() will return null, and that's a TypeError because the function is supposed to return an array [18:33:11] I'll file a task [18:34:32] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:35:23] !log ✔️ cdanis@lvs1015.eqiad.wmnet ~ 🕝☕ sudo ipvsadm -D -t 10.2.2.51:9283 [18:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:02] I see the config changes in the files on beta, but I'm not seeing the skin/extension enabled on https://api.wikimedia.beta.wmflabs.org/. I'm wondering if apiportalwiki is the incorrect identifier for the wiki. In interwiki-labs.php I see it used for wikis such as https://apiportal.wikimedia.beta.wmflabs.org. [18:36:03] cdanis: sorry, I'm just curious...what does those checks/crosses mean in your log statements? [18:36:26] Urbanecm: the success or failure of the command I executed before the prompt was printed :) [18:37:09] ah, so it's actuallyyour prompt! [18:37:11] neat [18:37:16] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:37:31] yeah, and the clock and the beverage emojis are suitable for whatever my local time is :) [18:37:48] nice! [18:37:52] in another 23 minutes that will be 🍵 instead of ☕ [18:38:58] CindyCicaleseWMF: I do see the skin/extension enabled at apiportalwiki [18:39:16] I'm looking at https://api.wikimedia.beta.wmflabs.org/wiki/Special:Version [18:39:35] https://api.wikimedia.beta.wmflabs.org/w/index.php?title=Main_Page&useskin=wikimediaapiportal also works for me [18:39:37] They are now :-) I guess I was just too impatient. Thank you! [18:39:41] I've filed T259063 for the Parsoid error spike [18:39:42] T259063: ParsoidHandler::getParsedBody() needs error handling for json_decode - https://phabricator.wikimedia.org/T259063 [18:39:56] No problem! [18:41:07] Cool. It looks good, so I'm all set. Thanks for all of the help! [18:41:23] (03CR) 10Andrew Bogott: [C: 03+2] New IPs for cloudcephosd100[1-3] in cloud-hosts1-b-eqiad [dns] - 10https://gerrit.wikimedia.org/r/616880 (https://phabricator.wikimedia.org/T259057) (owner: 10Andrew Bogott) [18:41:27] (03PS2) 10Andrew Bogott: New IPs for cloudcephosd100[1-3] in cloud-hosts1-b-eqiad [dns] - 10https://gerrit.wikimedia.org/r/616880 (https://phabricator.wikimedia.org/T259057) [18:43:04] (03CR) 10Andrew Bogott: [C: 03+2] wmcs: standardize old ceph osd nodes [puppet] - 10https://gerrit.wikimedia.org/r/616882 (https://phabricator.wikimedia.org/T259057) (owner: 10Andrew Bogott) [18:45:42] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:51:30] 10Operations, 10SRE-Access-Requests: Requesting access to Gerrit for snowick - https://phabricator.wikimedia.org/T259066 (10SNowick_WMF) [18:52:05] 10Operations, 10SRE-Access-Requests: Requesting access to Gerrit for snowick - https://phabricator.wikimedia.org/T259066 (10SNowick_WMF) [18:52:16] (03PS1) 10Dzahn: aphlict: add parameter/ferm rule to let phab server talk to admin port [puppet] - 10https://gerrit.wikimedia.org/r/616890 (https://phabricator.wikimedia.org/T238593) [18:53:14] 10Operations, 10SRE-Access-Requests: Requesting access to Gerrit for snowick - https://phabricator.wikimedia.org/T259066 (10SNowick_WMF) [18:54:35] 10Operations, 10SRE-Access-Requests: Requesting access to Gerrit for snowick - https://phabricator.wikimedia.org/T259066 (10Dzahn) > Already have server access and group membership needed, just need working login for Gerrit, thanks! Hi, you should be able to login on Gerrit with your existing Wikitech wiki u... [18:58:26] 10Operations, 10Graphoid, 10serviceops, 10Platform Engineering (Icebox): Undeploy graphoid for phase 1 wiki's - https://phabricator.wikimedia.org/T257402 (10Pppery) [18:58:47] (03PS1) 10Jeena Huneidi: Revert "Update blubberoid to 2020-07-24-194337-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/616844 [18:59:34] (03CR) 10Ahmon Dancy: [C: 03+2] Revert "Update blubberoid to 2020-07-24-194337-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/616844 (owner: 10Jeena Huneidi) [18:59:58] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/24194/aphlict1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/616890 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [19:00:04] liw and brennen: Dear deployers, time to do the Mediawiki train - European+American Version (secondary timeslot) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200728T1900). [19:00:18] (nothing for this window) [19:00:32] (03Merged) 10jenkins-bot: Revert "Update blubberoid to 2020-07-24-194337-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/616844 (owner: 10Jeena Huneidi) [19:01:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1147', diff saved to https://phabricator.wikimedia.org/P12093 and previous config saved to /var/cache/conftool/dbconfig/20200728-190137-marostegui.json [19:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:29] !log jhuneidi@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [19:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1147', diff saved to https://phabricator.wikimedia.org/P12094 and previous config saved to /var/cache/conftool/dbconfig/20200728-190517-marostegui.json [19:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:32] (03PS2) 10Herron: logstash-next: change backend naming from kibana-next to kibana7 [puppet] - 10https://gerrit.wikimedia.org/r/616124 [19:06:14] !log jhuneidi@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [19:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1147', diff saved to https://phabricator.wikimedia.org/P12095 and previous config saved to /var/cache/conftool/dbconfig/20200728-190933-marostegui.json [19:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:38] !log jhuneidi@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [19:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:19] (03PS1) 10CRusnov: Update netbox to v2.8.8-wmf [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/616892 (https://phabricator.wikimedia.org/T258942) [19:10:46] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/616892 (https://phabricator.wikimedia.org/T258942) (owner: 10CRusnov) [19:11:03] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@69bbbbb]: airflow: drop_old_data_daily: top_queries table renamed to fulltext_head_queries [19:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:35] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission db1097.eqiad.wmnet - https://phabricator.wikimedia.org/T257406 (10Cmjohnson) [19:11:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission db1097.eqiad.wmnet - https://phabricator.wikimedia.org/T257406 (10Cmjohnson) 05Open→03Resolved [19:11:55] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@69bbbbb]: airflow: drop_old_data_daily: top_queries table renamed to fulltext_head_queries (duration: 00m 53s) [19:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:00] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog2001.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [19:12:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1147', diff saved to https://phabricator.wikimedia.org/P12096 and previous config saved to /var/cache/conftool/dbconfig/20200728-191237-marostegui.json [19:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:42] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [19:15:36] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:16:30] (03PS1) 10Dzahn: phabricator: move phabricator_name key to common.yaml in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/616894 (https://phabricator.wikimedia.org/T238593) [19:18:53] (03CR) 1020after4: [C: 03+1] phabricator: move phabricator_name key to common.yaml in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/616894 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [19:19:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1142', diff saved to https://phabricator.wikimedia.org/P12097 and previous config saved to /var/cache/conftool/dbconfig/20200728-191926-marostegui.json [19:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:09] (03PS2) 10Dzahn: phabricator: move phabricator_name key to common.yaml in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/616894 (https://phabricator.wikimedia.org/T238593) [19:22:49] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/24195/" [puppet] - 10https://gerrit.wikimedia.org/r/616894 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [19:23:15] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission [19:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:02] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [19:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:18] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission [19:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:42] (03CR) 10Dzahn: "noop on phab1001 - added missing host name to ferm snippet on aphlict1001" [puppet] - 10https://gerrit.wikimedia.org/r/616894 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [19:24:59] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [19:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:09] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission [19:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:51] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [19:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:01] (03PS1) 10Jbond: validate_$type: add checks to prevent legacy stdlib functions [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/616895 (https://phabricator.wikimedia.org/T259013) [19:28:22] (03CR) 10jerkins-bot: [V: 04-1] validate_$type: add checks to prevent legacy stdlib functions [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/616895 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [19:30:33] (03PS2) 10Jbond: validate_$type: add checks to prevent legacy stdlib functions [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/616895 (https://phabricator.wikimedia.org/T259013) [19:35:16] 10Operations, 10SRE-Access-Requests: Requesting access to Gerrit for snowick - https://phabricator.wikimedia.org/T259066 (10SNowick_WMF) I just tried using all my logins and 'Shay Nowick' worked this time, cancelling request. [19:35:30] 10Operations, 10SRE-Access-Requests: Requesting access to Gerrit for snowick - https://phabricator.wikimedia.org/T259066 (10SNowick_WMF) 05Open→03Resolved [19:37:47] (03CR) 10Herron: "please see a question inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/616851 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [19:39:46] (03PS1) 10Dzahn: aphlict: make port and IP for the admin interface configurable [puppet] - 10https://gerrit.wikimedia.org/r/616896 (https://phabricator.wikimedia.org/T238593) [19:41:20] (03PS1) 10Andrew Bogott: wmcs eqiad1: change ceph nodes to use the 'eqiad1-compute' ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/616897 (https://phabricator.wikimedia.org/T253365) [19:42:05] (03CR) 10Andrew Bogott: [C: 03+2] wmcs eqiad1: change ceph nodes to use the 'eqiad1-compute' ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/616897 (https://phabricator.wikimedia.org/T253365) (owner: 10Andrew Bogott) [19:42:21] 10Operations, 10SRE-Access-Requests: Requesting access to Gerrit for snowick - https://phabricator.wikimedia.org/T259066 (10Dzahn) Cool! Now go to "Settings -> SSH Keys" and upload your own SSH key. This can be a separate key just for Gerrit or the existing prod key. It will be used when you clone/push from/... [19:45:54] (03PS2) 10Dzahn: aphlict: make port and IP for the admin interface configurable [puppet] - 10https://gerrit.wikimedia.org/r/616896 (https://phabricator.wikimedia.org/T238593) [19:48:42] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [19:48:43] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [19:48:46] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:54] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:50] (03PS3) 10Dzahn: aphlict: make port and IP for the admin interface configurable [puppet] - 10https://gerrit.wikimedia.org/r/616896 (https://phabricator.wikimedia.org/T238593) [19:54:21] (03PS4) 10Dzahn: aphlict: make port and IP for the admin interface configurable [puppet] - 10https://gerrit.wikimedia.org/r/616896 (https://phabricator.wikimedia.org/T238593) [19:55:39] Vote for the Gerrit favicon at https://phabricator.wikimedia.org/V21 [19:59:07] ABDE (i like how we use the actual voting app in Phabricator, glad we have one, remember Loomio) [20:04:00] (03PS3) 10Cwhite: provision loki on grafana-next [puppet] - 10https://gerrit.wikimedia.org/r/616851 (https://phabricator.wikimedia.org/T222826) [20:04:59] (03PS4) 10Cwhite: provision loki on grafana-next [puppet] - 10https://gerrit.wikimedia.org/r/616851 (https://phabricator.wikimedia.org/T222826) [20:05:20] (03CR) 10Cwhite: provision loki on grafana-next (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/616851 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [20:05:40] (03PS5) 10Dzahn: aphlict: make port and IP for the admin interface configurable [puppet] - 10https://gerrit.wikimedia.org/r/616896 (https://phabricator.wikimedia.org/T238593) [20:07:48] (03CR) 10Cwhite: "> Patch Set 2:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/616851 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [20:09:21] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 52 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:09:51] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/24200/" [puppet] - 10https://gerrit.wikimedia.org/r/616896 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [20:10:22] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH) [20:10:36] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH) [20:12:46] (03CR) 10Dzahn: "noop in prod" [puppet] - 10https://gerrit.wikimedia.org/r/616896 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [20:13:36] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH) I suspect that @Ottomata is the person to notify when these are ready for handoff, so I've added him as a subscriber. @Ottomata: if this should be som... [20:15:03] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:19:04] (03CR) 10CRusnov: "LGTM except for one inline caveat" (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/615793 (owner: 10Jbond) [20:19:25] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10Ottomata) Let's add @elukey too! [20:19:57] (03PS1) 10Dzahn: aphlict: let the service listen on %{::ipaddress} [puppet] - 10https://gerrit.wikimedia.org/r/616906 (https://phabricator.wikimedia.org/T238593) [20:22:42] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/24201/aphlict1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/616906 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [20:30:10] (03PS1) 10Dzahn: add aphlict.discovery.wmnet with CNAME aphlict1001 [dns] - 10https://gerrit.wikimedia.org/r/616909 (https://phabricator.wikimedia.org/T238593) [20:36:18] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [20:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:28] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:00] (03PS1) 10Dzahn: add backends for misc services with multiple backends but without geoip [dns] - 10https://gerrit.wikimedia.org/r/616911 [20:40:29] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [20:45:44] (03PS3) 10C. Scott Ananian: Alternate configuration mechanism for Parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612879 (https://phabricator.wikimedia.org/T241961) [20:49:08] (03CR) 10Dzahn: [C: 03+2] add aphlict.discovery.wmnet with CNAME aphlict1001 [dns] - 10https://gerrit.wikimedia.org/r/616909 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [20:49:13] (03PS2) 10Dzahn: add aphlict.discovery.wmnet with CNAME aphlict1001 [dns] - 10https://gerrit.wikimedia.org/r/616909 (https://phabricator.wikimedia.org/T238593) [20:52:11] 10Operations, 10MassMessage, 10MediaWiki-JobQueue: Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10AMooney) If Platform team is needed, retag us. [20:54:25] (03CR) 10Andrew Bogott: [C: 03+2] Remove obsolete IPs for cloudcephosd100[1-3] [dns] - 10https://gerrit.wikimedia.org/r/616881 (https://phabricator.wikimedia.org/T259057) (owner: 10Andrew Bogott) [20:54:28] (03PS2) 10Andrew Bogott: Remove obsolete IPs for cloudcephosd100[1-3] [dns] - 10https://gerrit.wikimedia.org/r/616881 (https://phabricator.wikimedia.org/T259057) [20:56:22] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:57:40] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:57:58] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [20:59:16] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:01:21] (03CR) 10Dzahn: "> Patch Set 3: Code-Review-1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615797 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [21:03:32] (03PS4) 10Dzahn: ATS: add new backend for phabricator aphlict [puppet] - 10https://gerrit.wikimedia.org/r/615797 (https://phabricator.wikimedia.org/T238593) [21:03:40] (03CR) 10Dzahn: "> Patch Set 3: Code-Review-1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615797 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [21:04:10] (03CR) 10Cwhite: [C: 03+1] "I haven't tested the config, but it LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/616885 (https://phabricator.wikimedia.org/T259000) (owner: 10Herron) [21:06:04] ACKNOWLEDGEMENT - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [1000.0] Ryan Kemper suspected to be related to a config value that determines the number of results that are run through the ML models per-shard on a search request. patch will be opened up https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia. [21:06:04] elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [21:07:52] I'm going to be debugging on mwdebug1002 [21:12:29] (03PS1) 10Ahmon Dancy: blubberoid: Update to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/616915 (https://phabricator.wikimedia.org/T259069) [21:13:42] 10Operations, 10Phabricator, 10Traffic, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Dzahn) @20after4 Current status is now - `aphlict1001.eqiad.wmnet` up and running - phabricator-roots admin gro... [21:13:47] (03CR) 10Dduvall: [C: 03+2] blubberoid: Update to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/616915 (https://phabricator.wikimedia.org/T259069) (owner: 10Ahmon Dancy) [21:14:54] (03Merged) 10jenkins-bot: blubberoid: Update to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/616915 (https://phabricator.wikimedia.org/T259069) (owner: 10Ahmon Dancy) [21:16:58] RECOVERY - Long running screen/tmux on netbox1001 is OK: OK: Tmux detected but not long running. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [21:17:03] !log dancy@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [21:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:18] (03PS1) 10Ebernhardson: cirrus: Reduce MLR window size on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616916 (https://phabricator.wikimedia.org/T256928) [21:24:43] !log dancy@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [21:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:18] (03PS2) 10Ebernhardson: cirrus: Reduce MLR window size on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616916 (https://phabricator.wikimedia.org/T256928) [21:27:17] !log dancy@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [21:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:47] (03PS1) 10MarcoAurelio: Add .gitreview [debs/prometheus-es-exporter] - 10https://gerrit.wikimedia.org/r/616845 [21:49:07] (03PS1) 10Dzahn: aphlict: add second envoy TLS terminator for client port [puppet] - 10https://gerrit.wikimedia.org/r/616917 (https://phabricator.wikimedia.org/T238593) [21:49:28] (03PS2) 10MarcoAurelio: Add .gitreview [debs/prometheus-es-exporter] - 10https://gerrit.wikimedia.org/r/616845 [21:49:40] (03CR) 10MarcoAurelio: [V: 03+2 C: 03+2] Add .gitreview [debs/prometheus-es-exporter] - 10https://gerrit.wikimedia.org/r/616845 (owner: 10MarcoAurelio) [21:50:23] (03CR) 10jerkins-bot: [V: 04-1] aphlict: add second envoy TLS terminator for client port [puppet] - 10https://gerrit.wikimedia.org/r/616917 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [21:51:17] (03PS1) 10Ebernhardson: cirrus: Dont ship user testing config to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616919 (https://phabricator.wikimedia.org/T253271) [21:56:58] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is OK: (C)100 gt (W)80 gt 75.25 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [21:57:54] (03CR) 10Bstorm: [C: 03+2] "Merging before someone gets a funny idea and launches a VM with it 😊" [puppet] - 10https://gerrit.wikimedia.org/r/616879 (https://phabricator.wikimedia.org/T258943) (owner: 10BryanDavis) [21:58:13] (03PS1) 10MarcoAurelio: Edit Repo Config [debs/prometheus-es-exporter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/616926 [21:58:59] (03Abandoned) 10MarcoAurelio: Edit Repo Config [debs/prometheus-es-exporter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/616926 (owner: 10MarcoAurelio) [22:10:45] An error has occurred while searching: Search is currently too busy. Please try again later. [22:11:16] That's a new one. Trying again worked though. [22:15:46] PROBLEM - MariaDB Replica Lag: s4 on db1145 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1087.93 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:19:32] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1044.eqiad.wmnet [22:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:32] 10Operations, 10Mail, 10OTRS, 10Trust-and-Safety, and 2 others: Forward emails addressed to privacy@wikidata to privacy@wikimedia - https://phabricator.wikimedia.org/T255733 (10jrbs) >>! In T255733#6338860, @Dzahn wrote: > gentle ping. any updates here from Privacy team? Sorry for the delay, slipped off m... [22:24:17] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10JMeybohm) > Unfortunately wtp1044.eqiad.wmnet is still waiting for the puppet run after reboot. Checking back on that later. wtp1044.eqiad.wmnet complete and repooled. [22:24:32] 10Operations, 10Mail, 10OTRS, 10Trust-and-Safety, and 2 others: Forward emails addressed to privacy@wikidata to privacy@wikimedia - https://phabricator.wikimedia.org/T255733 (10jrbs) >>! In T255733#6342878, @jrbs wrote: > I am happy to send a link to this ticket to them to confirm that is the case. {{done}} [22:29:14] (03PS1) 10Dzahn: installserver: use correct partman recipe for parse* [puppet] - 10https://gerrit.wikimedia.org/r/616920 [22:36:30] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [22:36:33] (03PS2) 10Dzahn: installserver: use correct partman recipe for parse* [puppet] - 10https://gerrit.wikimedia.org/r/616920 (https://phabricator.wikimedia.org/T258775) [22:37:10] (03PS3) 10Dzahn: installserver: use correct partman recipe for parse* [puppet] - 10https://gerrit.wikimedia.org/r/616920 (https://phabricator.wikimedia.org/T258775) [22:39:24] (03PS4) 10Jdlrobson: Switch test wikis to new version of vector by default (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614891 (https://phabricator.wikimedia.org/T254227) [22:52:37] (03PS1) 10Dzahn: aphlict: make client port and IP also configurable, rename parameters [puppet] - 10https://gerrit.wikimedia.org/r/616922 [22:53:51] (03CR) 10jerkins-bot: [V: 04-1] aphlict: make client port and IP also configurable, rename parameters [puppet] - 10https://gerrit.wikimedia.org/r/616922 (owner: 10Dzahn) [22:55:50] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=wtp1046.eqiad.wmnet [22:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening backport window(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200728T2300). [23:00:05] RoanKattouw and ebernhardson: A patch you scheduled for Evening backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:20] I'll deploy [23:00:40] (03CR) 10Catrope: [C: 03+2] Remove unused setting $wgGEHomepageSuggestedEditsNewAccountInitiatedPercentage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616645 (owner: 10Catrope) [23:00:44] (03PS2) 10Catrope: Remove unused setting $wgGEHomepageSuggestedEditsNewAccountInitiatedPercentage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616645 [23:00:51] (03CR) 10Catrope: [C: 03+2] Remove unused setting $wgGEHomepageSuggestedEditsNewAccountInitiatedPercentage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616645 (owner: 10Catrope) [23:01:06] ebernhardson: Are you here for your deployment? [23:01:43] (03Merged) 10jenkins-bot: Remove unused setting $wgGEHomepageSuggestedEditsNewAccountInitiatedPercentage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616645 (owner: 10Catrope) [23:02:09] (03CR) 10Catrope: [C: 03+2] cirrus: Dont ship user testing config to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616919 (https://phabricator.wikimedia.org/T253271) (owner: 10Ebernhardson) [23:03:14] RECOVERY - mediawiki-installation DSH group on wtp1044 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [23:03:30] (03Merged) 10jenkins-bot: cirrus: Dont ship user testing config to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616919 (https://phabricator.wikimedia.org/T253271) (owner: 10Ebernhardson) [23:04:19] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Remove unused setting $wgGEHomepageSuggestedEditsNewAccountInitiatedPercentage (no-op) (duration: 01m 06s) [23:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:12] 10Operations, 10serviceops, 10Patch-For-Review: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp1046.eqiad.wmmet ` The log can be found in `/var/log... [23:06:14] 10Operations, 10serviceops, 10Patch-For-Review: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1046.eqiad.wmmet'] ` Of which those **FAILED**: ` ['wtp1046.eqiad.wmmet'] ` [23:06:24] (03PS3) 10Catrope: cirrus: Reduce MLR window size on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616916 (https://phabricator.wikimedia.org/T256928) (owner: 10Ebernhardson) [23:07:10] ebernhardson: I'm going do my dishes, please ping me if you appear here later and want me to deploy your enwiki MLR window size patch [23:07:39] 10Operations, 10serviceops, 10Patch-For-Review: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp1046.eqiad.wmnet ` The log can be found in `/var/log... [23:13:49] RoanKattouw: i'll deploy it, thanks though! [23:15:07] (03CR) 10Ebernhardson: [C: 03+2] cirrus: Reduce MLR window size on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616916 (https://phabricator.wikimedia.org/T256928) (owner: 10Ebernhardson) [23:15:45] 10Operations, 10serviceops, 10Patch-For-Review: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp1047.eqiad.wmnet ` The log can be found in `/var/log... [23:16:00] (03Merged) 10jenkins-bot: cirrus: Reduce MLR window size on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616916 (https://phabricator.wikimedia.org/T256928) (owner: 10Ebernhardson) [23:20:18] actually, realized this isn't BC enough to prevent failing on deploy. Needs like 3 extra characters...one more patch [23:20:39] well, failing while the files aren't in sync i mean [23:24:39] (03PS1) 10Ebernhardson: cirrus: Add BC compat for MLR model definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616952 [23:27:06] (03CR) 10Ebernhardson: [C: 03+2] cirrus: Add BC compat for MLR model definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616952 (owner: 10Ebernhardson) [23:27:51] (03Merged) 10jenkins-bot: cirrus: Add BC compat for MLR model definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616952 (owner: 10Ebernhardson) [23:29:08] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [23:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:17] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [23:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:03] !log ebernhardson@deploy1001 Synchronized wmf-config/CirrusSearch-common.php: cirrus: reduce mlr window size on enwiki (duration: 01m 06s) [23:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:49] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [23:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:21] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: cirrus: reduce mlr window size on enwiki (duration: 01m 05s) [23:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:55] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [23:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:14] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:52:53] 10Operations, 10serviceops, 10Patch-For-Review: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp1048.eqiad.wmnet ` The log can be found in `/var/log... [23:57:10] RECOVERY - MariaDB Replica Lag: s4 on db1145 is OK: OK slave_sql_lag Replication lag: 0.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica