[00:38:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:41:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:43:33] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 80721648 and 12 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:45:53] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 1242840 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:04:00] (03CR) 10Nskaggs: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/678642 (https://phabricator.wikimedia.org/T279555) (owner: 10Andrew Bogott) [01:17:39] (03CR) 10Andrew Bogott: [C: 03+2] nfs-mounts: provide dumps access to spi-tools [puppet] - 10https://gerrit.wikimedia.org/r/678642 (https://phabricator.wikimedia.org/T279555) (owner: 10Andrew Bogott) [01:37:32] (03PS1) 10Andrew Bogott: wmcs-policy-tests: add some neutron tests [puppet] - 10https://gerrit.wikimedia.org/r/678669 (https://phabricator.wikimedia.org/T279845) [01:37:34] (03PS1) 10Andrew Bogott: wmcs-policy-tests: add a few more cinder tests [puppet] - 10https://gerrit.wikimedia.org/r/678670 (https://phabricator.wikimedia.org/T279845) [01:37:36] (03PS1) 10Andrew Bogott: wmcs-policy-tests.py: added tests for adding/removing security groups [puppet] - 10https://gerrit.wikimedia.org/r/678671 (https://phabricator.wikimedia.org/T279845) [01:38:37] (03CR) 10jerkins-bot: [V: 04-1] wmcs-policy-tests: add some neutron tests [puppet] - 10https://gerrit.wikimedia.org/r/678669 (https://phabricator.wikimedia.org/T279845) (owner: 10Andrew Bogott) [01:38:54] (03CR) 10jerkins-bot: [V: 04-1] wmcs-policy-tests: add a few more cinder tests [puppet] - 10https://gerrit.wikimedia.org/r/678670 (https://phabricator.wikimedia.org/T279845) (owner: 10Andrew Bogott) [01:53:05] (03PS2) 10Andrew Bogott: wmcs-policy-tests: more tests! [puppet] - 10https://gerrit.wikimedia.org/r/678669 (https://phabricator.wikimedia.org/T279845) [01:53:11] (03Abandoned) 10Andrew Bogott: wmcs-policy-tests: add a few more cinder tests [puppet] - 10https://gerrit.wikimedia.org/r/678670 (https://phabricator.wikimedia.org/T279845) (owner: 10Andrew Bogott) [01:53:32] (03Abandoned) 10Andrew Bogott: wmcs-policy-tests.py: added tests for adding/removing security groups [puppet] - 10https://gerrit.wikimedia.org/r/678671 (https://phabricator.wikimedia.org/T279845) (owner: 10Andrew Bogott) [01:54:39] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-policy-tests: more tests! [puppet] - 10https://gerrit.wikimedia.org/r/678669 (https://phabricator.wikimedia.org/T279845) (owner: 10Andrew Bogott) [02:07:56] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.39 [core] (wmf/1.36.0-wmf.39) - 10https://gerrit.wikimedia.org/r/678677 [02:07:59] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.36.0-wmf.39 [core] (wmf/1.36.0-wmf.39) - 10https://gerrit.wikimedia.org/r/678677 (owner: 10TrainBranchBot) [02:28:07] PROBLEM - snapshot of s7 in codfw on alert1001 is CRITICAL: snapshot for s7 at codfw taken more than 3 days ago: Most recent backup 2021-04-10 02:06:52 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [02:30:41] (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.39 [core] (wmf/1.36.0-wmf.39) - 10https://gerrit.wikimedia.org/r/678677 (owner: 10TrainBranchBot) [03:17:49] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:22:47] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 8.201 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:30:01] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:07:03] (03PS1) 10Andrew Bogott: wmcs-policy-tests: fix security-group tests [puppet] - 10https://gerrit.wikimedia.org/r/678685 (https://phabricator.wikimedia.org/T279845) [04:08:18] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-policy-tests: fix security-group tests [puppet] - 10https://gerrit.wikimedia.org/r/678685 (https://phabricator.wikimedia.org/T279845) (owner: 10Andrew Bogott) [04:12:48] (03PS1) 10Andrew Bogott: OpenStack Nova policies: add explicit rules about security groups [puppet] - 10https://gerrit.wikimedia.org/r/678708 [04:14:31] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Nova policies: add explicit rules about security groups [puppet] - 10https://gerrit.wikimedia.org/r/678708 (owner: 10Andrew Bogott) [04:14:39] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:14:49] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 50, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:17:09] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:17:17] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:57:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1112 for schema change', diff saved to https://phabricator.wikimedia.org/P15278 and previous config saved to /var/cache/conftool/dbconfig/20210413-045708-marostegui.json [04:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:31] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Marostegui) >>! In T278614#6985029, @Ladsgroup wrote: > With the wikitech-l imported my last offer is now: 34GB. This is a pretty decent size and 4GB per year is also ok with the... [05:24:28] 10SRE, 10DBA, 10Platform Engineering, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Appservers latency spike / parser cache growth 2021-03-28 - https://phabricator.wikimedia.org/T278655 (10Marostegui) Thanks @Krinkle! Yes, we are well aware of the trends parsercache is having lately a... [05:32:34] 10SRE, 10DBA, 10Platform Engineering, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Appservers latency spike / parser cache growth 2021-03-28 - https://phabricator.wikimedia.org/T278655 (10Krinkle) 05Open→03Resolved 👍 [05:47:39] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Ladsgroup) Thanks. We will likely bother you in two or three weeks. Most of the work is done. [05:56:34] <_joe_> !log restarting blazegraph on wdqs1013 [05:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:15] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1013 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:58:23] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.117 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:00:20] 10SRE, 10serviceops: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10Joe) p:05Medium→03High I don't realistically see it possible to switch memcached to TLS in the remaining time before we need to renew the certificates, hence raising priority. It will be raised... [06:00:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ayounsi) Please don't forget to update the switches by running Homer when you update Netbox, otherwise there are outstanding cha... [06:00:47] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1013 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:04:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10ayounsi) Please don't forget to update the switches by running Homer when you update Netbox, otherwise there are outstanding changes... [06:08:25] RECOVERY - Disk space on Hadoop worker on analytics1061 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [06:10:57] this is me --^ [06:11:04] I am working on the other alert [06:14:57] 10SRE, 10netops: Lumen link between cr2-eqiad and cr2-esams down - https://phabricator.wikimedia.org/T279820 (10ayounsi) For the record: Ticket opened at **2021-04-10 11:29:36 GMT** > **2021-04-10 12:50:02 GMT **- Hello from Lumen, > We are observing a network Local Fault from our Subsea portion. We will send... [06:19:55] RECOVERY - Disk space on Hadoop worker on analytics1070 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [06:20:57] (03PS1) 10Ladsgroup: Disable legacy javascript variables in zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678714 (https://phabricator.wikimedia.org/T72470) [06:39:48] (03PS1) 10Elukey: Update the build process for 4.9 [debs/hue] - 10https://gerrit.wikimedia.org/r/678717 [06:40:16] (03CR) 10Elukey: [C: 03+2] Update the build process for 4.9 [debs/hue] - 10https://gerrit.wikimedia.org/r/678717 (owner: 10Elukey) [06:50:27] 10SRE, 10OTRS, 10Security, 10User-notice: ((OTRS)) Community Edition 6 is end-of-life; no FOSS replacement provided - https://phabricator.wikimedia.org/T275294 (10NoFWDaddress) >>! In T275294#6972950, @Keegan wrote: >>>! In T275294#6972938, @Aklapper wrote: >>>>! In T275294#6972793, @Keegan wrote: >>> some... [06:59:22] morning [06:59:39] is OTRS->Znuny migration discussion happening in here? [07:01:46] (Traffic bill over quota) firing: (3) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org [07:02:55] (03CR) 10Effie Mouzeli: replace mwlog1001 with new mwlog[12]002 hosts (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677002 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [07:04:51] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Marostegui) Excellent! [07:06:21] 10SRE, 10OTRS, 10Security, 10User-notice: ((OTRS)) Community Edition 6 is end-of-life; no FOSS replacement provided - https://phabricator.wikimedia.org/T275294 (10Krenair) >>! In T275294#6972793, @Keegan wrote: >>>! In T275294#6972574, @Krenair wrote: >> >> I can be around during that window I think. > >... [07:06:46] (Traffic bill over quota) resolved: (3) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org [07:09:56] (03PS1) 10ArielGlenn: remove obsolete html files from snapshot manifests for dumps [puppet] - 10https://gerrit.wikimedia.org/r/678719 (https://phabricator.wikimedia.org/T279661) [07:11:01] (03CR) 10Muehlenhoff: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/677931 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [07:15:13] (03PS2) 10ArielGlenn: remove obsolete html files from snapshot manifests for dumps [puppet] - 10https://gerrit.wikimedia.org/r/678719 (https://phabricator.wikimedia.org/T279661) [07:20:26] (03CR) 10ArielGlenn: [C: 03+2] remove obsolete html files from snapshot manifests for dumps [puppet] - 10https://gerrit.wikimedia.org/r/678719 (https://phabricator.wikimedia.org/T279661) (owner: 10ArielGlenn) [07:26:00] !log shutdown all OTRS components on otrs1001, prep for OTRS -> Znuny migration. T279303 [07:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:10] T279303: Migrate OTRS CE 6 to Znuny LTS fork - https://phabricator.wikimedia.org/T279303 [07:28:07] 10SRE, 10OTRS, 10Security, 10User-notice: ((OTRS)) Community Edition 6 is end-of-life; no FOSS replacement provided - https://phabricator.wikimedia.org/T275294 (10akosiaris) >>! In T275294#6993001, @Krenair wrote: >>>! In T275294#6972793, @Keegan wrote: >>>>! In T275294#6972574, @Krenair wrote: >>> >>> I... [07:30:18] !log migrating to Znuny-6.0.33, release 2021-03-10 . T279303 [07:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:25] PROBLEM - OTRS SMTP on otrs1001 is CRITICAL: connect to address 10.64.16.39 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/OTRS%23Troubleshooting [07:33:43] ACKNOWLEDGEMENT - Rate of JVM GC Old generation-s runs - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is CRITICAL: 198.3 gt 100 ayounsi https://phabricator.wikimedia.org/T231517 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelast [07:33:43] ACKNOWLEDGEMENT - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-chi-eqiad on cloudelastic1005 is CRITICAL: 329.5 gt 100 ayounsi https://phabricator.wikimedia.org/T231517 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelast [07:33:43] ACKNOWLEDGEMENT - Rate of JVM GC Old generation-s runs - cloudelastic1006-cloudelastic-chi-eqiad on cloudelastic1006 is CRITICAL: 324.8 gt 100 ayounsi https://phabricator.wikimedia.org/T231517 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelast [07:37:38] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['parse2002.codfw.wmnet', 'parse2003.codfw.wmnet', 'parse200... [07:38:03] !log jiji@cumin1001 conftool action : set/weight=10; selector: cluster=jobrunner,name=mw1334.eqiad.wmnet [07:38:09] 10SRE, 10Traffic, 10User-notice: Rate limit requests in violation of User-Agent policy more aggressively - https://phabricator.wikimedia.org/T224891 (10Legoktm) >>! In T224891#6983370, @ayounsi wrote: > Even with the current rate limiting, some crawling are regularly causing issues, wasting precious SRE time... [07:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:16] !log jiji@cumin1001 conftool action : set/weight=10; selector: cluster=jobrunner,name=mw1318.eqiad.wmnet [07:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:18] 10SRE, 10serviceops: High load on jobrunners (12 Apr 2021) - https://phabricator.wikimedia.org/T279893 (10jijiki) 05Open→03Resolved Closing this task, no further issues were observed [07:41:50] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "Yes, please, to allow people to test it in all namespaces." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678606 (https://phabricator.wikimedia.org/T267911) (owner: 10Awight) [07:42:59] RECOVERY - OTRS SMTP on otrs1001 is OK: SMTP OK - 0.005 sec. response time https://wikitech.wikimedia.org/wiki/OTRS%23Troubleshooting [07:44:56] (03CR) 10Awight: [C: 03+2] "labs-only config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678606 (https://phabricator.wikimedia.org/T267911) (owner: 10Awight) [07:45:42] (03CR) 10ArielGlenn: [C: 03+1] "I like this approach, but someone else ought to give the final thumbs-up." [puppet] - 10https://gerrit.wikimedia.org/r/678336 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [07:45:55] (03Merged) 10jenkins-bot: [beta] Enable line numbering on all namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678606 (https://phabricator.wikimedia.org/T267911) (owner: 10Awight) [07:46:23] !log Start up all components on otrs1001. T279303 [07:46:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:32] T279303: Migrate OTRS CE 6 to Znuny LTS fork - https://phabricator.wikimedia.org/T279303 [07:47:31] (03CR) 10ArielGlenn: [C: 03+1] "As long as the command itself has the full path specified, this should do what's needed. I think." [puppet] - 10https://gerrit.wikimedia.org/r/678337 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [07:48:41] (03CR) 10ArielGlenn: "Thanks for all your work on this. I want to do some manual testing in beta with the results of the output from the puppet compiler, just t" [puppet] - 10https://gerrit.wikimedia.org/r/678338 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [07:49:33] !log jiji@cumin1001 conftool action : set/pooled=yes; selector: cluster=videoscaler,name=(mw1311.eqiad.wmnet|mw1318.eqiad.wmnet|mw1334.eqiad.wmnet) [07:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:59] 10SRE, 10Wikimedia-Logstash, 10observability, 10Software-Licensing: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10Legoktm) https://aws.amazon.com/blogs/opensource/introducing-opensearch/ > Today, we are introducing the OpenSearch pro... [07:56:39] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2002.codfw.wmnet with reason: REIMAGE [07:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:39] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2003.codfw.wmnet with reason: REIMAGE [07:58:42] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2002.codfw.wmnet with reason: REIMAGE [07:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:58] (03CR) 10Muehlenhoff: [C: 03+2] Switch to iptables legacy alternative provider on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/677931 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [07:59:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:59:52] 10SRE, 10LDAP-Access-Requests: Grant access to Superset for Mikeraish - https://phabricator.wikimedia.org/T279147 (10ema) 05Open→03Resolved [08:00:21] (03PS1) 10Alexandros Kosiaris: Bump version to show we are now targetting Znuny 6.0.x [software/otrs] - 10https://gerrit.wikimedia.org/r/678786 (https://phabricator.wikimedia.org/T279303) [08:00:41] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2004.codfw.wmnet with reason: REIMAGE [08:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:48] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2003.codfw.wmnet with reason: REIMAGE [08:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:06] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Bump version to show we are now targetting Znuny 6.0.x [software/otrs] - 10https://gerrit.wikimedia.org/r/678786 (https://phabricator.wikimedia.org/T279303) (owner: 10Alexandros Kosiaris) [08:02:53] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2004.codfw.wmnet with reason: REIMAGE [08:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:28] (03PS1) 10Muehlenhoff: Remove obsolete requires [puppet] - 10https://gerrit.wikimedia.org/r/678787 [08:03:38] (03CR) 10jerkins-bot: [V: 04-1] Remove obsolete requires [puppet] - 10https://gerrit.wikimedia.org/r/678787 (owner: 10Muehlenhoff) [08:05:46] (03PS2) 10Muehlenhoff: Remove obsolete requires [puppet] - 10https://gerrit.wikimedia.org/r/678787 [08:07:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:09:09] !log Remove system maintenance message from OTRS. Migration to Znuny 6.0.33 done. T279303 [08:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:17] T279303: Migrate OTRS CE 6 to Znuny LTS fork - https://phabricator.wikimedia.org/T279303 [08:14:24] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete requires [puppet] - 10https://gerrit.wikimedia.org/r/678787 (owner: 10Muehlenhoff) [08:14:36] (03CR) 10JMeybohm: [C: 03+2] calico: Add defauls for container resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/677906 (https://phabricator.wikimedia.org/T277877) (owner: 10JMeybohm) [08:16:07] (03CR) 10Marostegui: [C: 03+2] Add growthexperiments_mentee_data to private tables [puppet] - 10https://gerrit.wikimedia.org/r/677653 (https://phabricator.wikimedia.org/T279587) (owner: 10Urbanecm) [08:16:15] (03Merged) 10jenkins-bot: calico: Add defauls for container resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/677906 (https://phabricator.wikimedia.org/T277877) (owner: 10JMeybohm) [08:16:37] !log Restart sanitarium hosts db1124, db1125, db1154, db1155, db2094, db2095 T279587 [08:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:47] T279587: Create database table to cache data about mentees - https://phabricator.wikimedia.org/T279587 [08:16:49] (03PS1) 10Ema: cache: explicitly list object sizes in exp_policy.py [puppet] - 10https://gerrit.wikimedia.org/r/678788 (https://phabricator.wikimedia.org/T275809) [08:18:40] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [08:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:57] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [08:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:15] RECOVERY - BGP status on cr3-eqsin is OK: BGP OK - up: 317, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:21:22] (03PS1) 10Alexandros Kosiaris: admin: Introduce the cluster_group concept [deployment-charts] - 10https://gerrit.wikimedia.org/r/678789 [08:21:32] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2002.codfw.wmnet', 'parse2003.codfw.wmnet', 'parse2004.codfw.wmnet'] ` and were **ALL** successful. [08:25:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10jijiki) >>! In T274925#6981855, @Jclark-ctr wrote: > @jijiki Racking these host i only have 2 available spots in D4 will any of the ones in t... [08:26:24] (03PS1) 10Ema: cache: enable exp caching policy on upload@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/678790 (https://phabricator.wikimedia.org/T275809) [08:26:47] (03PS1) 10Urbanecm: mswiki: Fix help panel links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678791 (https://phabricator.wikimedia.org/T277562) [08:27:34] jouncebot: now [08:27:34] No deployments scheduled for the next 2 hour(s) and 32 minute(s) [08:27:49] (03CR) 10Urbanecm: [C: 03+2] mswiki: Fix help panel links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678791 (https://phabricator.wikimedia.org/T277562) (owner: 10Urbanecm) [08:28:06] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 64, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:28:09] (03CR) 10Alexandros Kosiaris: Switch to iptables legacy alternative provider on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677931 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [08:28:38] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/678790 (https://phabricator.wikimedia.org/T275809) (owner: 10Ema) [08:28:41] (03PS2) 10Marostegui: db1184: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/677420 (https://phabricator.wikimedia.org/T275633) [08:29:01] (03PS1) 10Marostegui: production-m2.sql.erb: Add ALTER grant [puppet] - 10https://gerrit.wikimedia.org/r/678792 (https://phabricator.wikimedia.org/T279053) [08:29:16] (03Merged) 10jenkins-bot: mswiki: Fix help panel links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678791 (https://phabricator.wikimedia.org/T277562) (owner: 10Urbanecm) [08:29:22] (03CR) 10Alexandros Kosiaris: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/677931 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [08:29:44] (03CR) 10Marostegui: [C: 03+2] db1184: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/677420 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [08:30:27] (03PS1) 10JMeybohm: Remove old tiller specs [deployment-charts] - 10https://gerrit.wikimedia.org/r/678793 [08:31:14] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 7ca767322b4469da05656b8acdedfc5101be2703: mswiki: Fix help panel links (T277562) (duration: 00m 58s) [08:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:23] T277562: Deploy Growth features on Malay Wikipedia - https://phabricator.wikimedia.org/T277562 [08:31:45] 10SRE, 10OTRS, 10Security, 10User-notice: Migrate OTRS CE 6 to Znuny LTS fork - https://phabricator.wikimedia.org/T279303 (10Krenair) Looks like it's all okay. Thanks Alexandros [08:33:12] 10SRE, 10OTRS, 10Security, 10User-notice: ((OTRS)) Community Edition 6 is end-of-life; no FOSS replacement provided - https://phabricator.wikimedia.org/T275294 (10akosiaris) [08:34:09] 10SRE, 10OTRS, 10Security, 10User-notice: Migrate OTRS CE 6 to Znuny LTS fork - https://phabricator.wikimedia.org/T279303 (10akosiaris) 05Open→03Resolved I am gonna be bold and resolve this as it pertains to the actual migration and that part seems to have gone quite well. I am pretty sure we have a to... [08:35:25] 10SRE, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Remove 185.15.56.0/24 from network::external - https://phabricator.wikimedia.org/T265864 (10ayounsi) PCC result: https://puppet-compiler.wmflabs.org/compiler1003/28963/ [08:35:39] (03CR) 10Kosta Harlan: [C: 03+1] "LGTM! As far as I know, we don't need DROP (Gergo could you confirm please?) but that could be left for another patch." [puppet] - 10https://gerrit.wikimedia.org/r/678792 (https://phabricator.wikimedia.org/T279053) (owner: 10Marostegui) [08:39:08] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove old tiller specs [deployment-charts] - 10https://gerrit.wikimedia.org/r/678793 (owner: 10JMeybohm) [08:40:28] (03Merged) 10jenkins-bot: Remove old tiller specs [deployment-charts] - 10https://gerrit.wikimedia.org/r/678793 (owner: 10JMeybohm) [08:41:15] (03CR) 10Alexandros Kosiaris: [C: 03+1] "OTRS wise, this is safe." [puppet] - 10https://gerrit.wikimedia.org/r/677872 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [08:45:19] (03CR) 10Ayounsi: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1003/28963/" [puppet] - 10https://gerrit.wikimedia.org/r/677872 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [08:46:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 25%: Repool db1112 after schema change', diff saved to https://phabricator.wikimedia.org/P15280 and previous config saved to /var/cache/conftool/dbconfig/20210413-084657-root.json [08:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:47] (03PS1) 10Marostegui: db1180: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/678795 (https://phabricator.wikimedia.org/T275633) [08:49:27] (03PS2) 10JMeybohm: admin: Introduce the cluster_group concept [deployment-charts] - 10https://gerrit.wikimedia.org/r/678789 (owner: 10Alexandros Kosiaris) [08:49:55] (03CR) 10Marostegui: [C: 03+2] db1180: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/678795 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [08:50:55] (03CR) 10Muehlenhoff: Switch to iptables legacy alternative provider on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677931 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [08:50:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1180 with minimal weight on s6 for the first time T275633', diff saved to https://phabricator.wikimedia.org/P15281 and previous config saved to /var/cache/conftool/dbconfig/20210413-085057-marostegui.json [08:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:06] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [08:51:17] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['parse2005.codfw.wmnet', 'parse2006.codfw.wmnet', 'parse200... [08:54:34] (03CR) 10Ema: [C: 03+2] cache: explicitly list object sizes in exp_policy.py [puppet] - 10https://gerrit.wikimedia.org/r/678788 (https://phabricator.wikimedia.org/T275809) (owner: 10Ema) [08:57:06] (03PS1) 10Kosta Harlan: linkrecommendation: Revert to earlier version [deployment-charts] - 10https://gerrit.wikimedia.org/r/678797 [08:57:16] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Revert to earlier version [deployment-charts] - 10https://gerrit.wikimedia.org/r/678797 (owner: 10Kosta Harlan) [08:57:26] 10SRE, 10Patch-For-Review: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 (10MoritzMuehlenhoff) > I have stopped apache on dbmonitor1001 (and done chmod -x to apache2 binary so puppet doesn't bring it up), let's leave it till next week and if nothing breaks, let's decommission it... [08:58:50] (03Merged) 10jenkins-bot: linkrecommendation: Revert to earlier version [deployment-charts] - 10https://gerrit.wikimedia.org/r/678797 (owner: 10Kosta Harlan) [08:58:57] (03PS1) 10WMDE-Fisch: [beta] Enable suggested values paramter in TemplateData and VisualEditor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678798 (https://phabricator.wikimedia.org/T271825) [08:59:18] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [08:59:18] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [08:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 50%: Repool db1112 after schema change', diff saved to https://phabricator.wikimedia.org/P15282 and previous config saved to /var/cache/conftool/dbconfig/20210413-090201-root.json [09:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:05] !log jmm@cumin1001 START - Cookbook sre.hosts.decommission for hosts dbmonitor1001.wikimedia.org [09:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:25] (03CR) 10JMeybohm: [C: 03+2] admin: Introduce the cluster_group concept [deployment-charts] - 10https://gerrit.wikimedia.org/r/678789 (owner: 10Alexandros Kosiaris) [09:07:53] (03Merged) 10jenkins-bot: admin: Introduce the cluster_group concept [deployment-charts] - 10https://gerrit.wikimedia.org/r/678789 (owner: 10Alexandros Kosiaris) [09:10:15] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2005.codfw.wmnet with reason: REIMAGE [09:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:33] (03PS1) 10Muehlenhoff: Remove dbmonitor1001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/678799 [09:10:35] (03PS1) 10Muehlenhoff: Remove grant for dbmonitor1001 [puppet] - 10https://gerrit.wikimedia.org/r/678800 (https://phabricator.wikimedia.org/T224589) [09:10:37] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:11:41] (03CR) 10jerkins-bot: [V: 04-1] Remove dbmonitor1001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/678799 (owner: 10Muehlenhoff) [09:12:17] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2006.codfw.wmnet with reason: REIMAGE [09:12:20] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2005.codfw.wmnet with reason: REIMAGE [09:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:43] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:56] (03PS2) 10Muehlenhoff: Remove dbmonitor1001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/678799 [09:14:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1180 with minimal weight on s6 for the first time T275633', diff saved to https://phabricator.wikimedia.org/P15283 and previous config saved to /var/cache/conftool/dbconfig/20210413-091414-marostegui.json [09:14:19] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2007.codfw.wmnet with reason: REIMAGE [09:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:23] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [09:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:34] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2006.codfw.wmnet with reason: REIMAGE [09:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:02] (03CR) 10jerkins-bot: [V: 04-1] Remove dbmonitor1001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/678799 (owner: 10Muehlenhoff) [09:16:30] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:38] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2007.codfw.wmnet with reason: REIMAGE [09:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:03] (03PS9) 10Giuseppe Lavagetto: Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) [09:17:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 75%: Repool db1112 after schema change', diff saved to https://phabricator.wikimedia.org/P15284 and previous config saved to /var/cache/conftool/dbconfig/20210413-091704-root.json [09:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:01] (03CR) 10jerkins-bot: [V: 04-1] Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) (owner: 10Giuseppe Lavagetto) [09:19:58] (03PS3) 10Muehlenhoff: Remove dbmonitor1001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/678799 (https://phabricator.wikimedia.org/T224589) [09:20:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] sonofgridengine: grid-configurator: fix help invocation [puppet] - 10https://gerrit.wikimedia.org/r/677850 (owner: 10Arturo Borrero Gonzalez) [09:21:15] (03PS4) 10Arturo Borrero Gonzalez: sonofgridnegine: grid-configurator: run black autoformater [puppet] - 10https://gerrit.wikimedia.org/r/677860 [09:21:18] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] sonofgridnegine: grid-configurator: run black autoformater [puppet] - 10https://gerrit.wikimedia.org/r/677860 (owner: 10Arturo Borrero Gonzalez) [09:22:15] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:41] (03PS4) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: include defaults in help message [puppet] - 10https://gerrit.wikimedia.org/r/677861 [09:23:17] 10SRE, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Remove 185.15.56.0/24 from network::external - https://phabricator.wikimedia.org/T265864 (10ema) >>! In T265864#6993281, @ayounsi wrote: > Now to figure out the actual service impact and if it's safe to merge. > To re-iterate, 185.15.56.0... [09:23:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] sonofgridengine: grid-configurator: include defaults in help message [puppet] - 10https://gerrit.wikimedia.org/r/677861 (owner: 10Arturo Borrero Gonzalez) [09:23:35] (03PS1) 10Marostegui: mariadb: Promote db1159 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/678801 (https://phabricator.wikimedia.org/T276448) [09:23:57] (03PS5) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: rework --domains option [puppet] - 10https://gerrit.wikimedia.org/r/677862 [09:24:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] sonofgridengine: grid-configurator: rework --domains option [puppet] - 10https://gerrit.wikimedia.org/r/677862 (owner: 10Arturo Borrero Gonzalez) [09:24:51] (03PS2) 10Muehlenhoff: dbmonitor: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/675081 [09:24:54] (03CR) 10David Caro: ceph: add ceph repo and parameter to all client modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677911 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [09:25:26] (03PS4) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: error if running in toolsbeta if no --beta [puppet] - 10https://gerrit.wikimedia.org/r/677865 [09:26:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] sonofgridengine: grid-configurator: error if running in toolsbeta if no --beta [puppet] - 10https://gerrit.wikimedia.org/r/677865 (owner: 10Arturo Borrero Gonzalez) [09:26:46] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [puppet] - 10https://gerrit.wikimedia.org/r/678801 (https://phabricator.wikimedia.org/T276448) (owner: 10Marostegui) [09:30:40] (03CR) 10Kormat: [C: 03+1] mariadb: Promote db1159 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/678801 (https://phabricator.wikimedia.org/T276448) (owner: 10Marostegui) [09:31:25] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/28993/" [puppet] - 10https://gerrit.wikimedia.org/r/675081 (owner: 10Muehlenhoff) [09:32:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 100%: Repool db1112 after schema change', diff saved to https://phabricator.wikimedia.org/P15285 and previous config saved to /var/cache/conftool/dbconfig/20210413-093208-root.json [09:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:55] (03CR) 10Giuseppe Lavagetto: Helm chart to run MediaWiki (0341 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) (owner: 10Giuseppe Lavagetto) [09:34:05] (03CR) 10Ema: [C: 03+2] cache: enable exp caching policy on upload@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/678790 (https://phabricator.wikimedia.org/T275809) (owner: 10Ema) [09:34:14] (03CR) 10Kormat: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/675081 (owner: 10Muehlenhoff) [09:34:43] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dbmonitor1001.wikimedia.org [09:34:50] 10SRE, 10Patch-For-Review: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin1001 for hosts: `dbmonitor1001.wikimedia.org` - dbmonitor1001.wikimedia.org (**PASS**) - Downtimed host on Icinga - Found Gan... [09:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:02] 10SRE: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [09:35:06] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2005.codfw.wmnet', 'parse2006.codfw.wmnet', 'parse2007.codfw.wmnet'] ` and were **ALL** successful. [09:35:51] 10SRE, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Remove 185.15.56.0/24 from network::external - https://phabricator.wikimedia.org/T265864 (10Majavah) >>! In T265864#6993418, @ema wrote: > On cache hosts, the result of that change would be removing the IP from **wikimedia_nets** and **wi... [09:36:42] (03PS10) 10Giuseppe Lavagetto: Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) [09:37:16] (03CR) 10Jbond: [C: 03+2] "LGTM ill merge" [puppet] - 10https://gerrit.wikimedia.org/r/678336 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [09:39:23] (03CR) 10Jbond: "lgtm will merge" [puppet] - 10https://gerrit.wikimedia.org/r/678337 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [09:39:41] (03CR) 10Jbond: [C: 03+2] systemd: Add ability to set working directory in the timer job [puppet] - 10https://gerrit.wikimedia.org/r/678337 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [09:39:52] (03PS3) 10Jbond: systemd: Add ability to set working directory in the timer job [puppet] - 10https://gerrit.wikimedia.org/r/678337 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [09:41:39] !log cp[5002-5006]: rolling varnish-frontend-restart to apply exp policy settings changes starting from empty caches T275809 [09:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:47] T275809: cache_upload cache policy + large_objects_cutoff concerns - https://phabricator.wikimedia.org/T275809 [09:42:08] (03PS8) 10Jbond: snapshot: Migrate cronjobs in pagetitles to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/678338 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [09:44:17] (03CR) 10Jbond: [C: 03+1] "LGTM, minor nit." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/678338 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [09:45:17] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/678806 [09:46:30] (03PS1) 10Muehlenhoff: Drop bastion role from bast1002 [puppet] - 10https://gerrit.wikimedia.org/r/678807 (https://phabricator.wikimedia.org/T276399) [09:46:56] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/678806 (owner: 10Kosta Harlan) [09:48:37] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/678806 (owner: 10Kosta Harlan) [09:53:40] 10SRE, 10Traffic, 10Patch-For-Review: cache_upload cache policy + large_objects_cutoff concerns - https://phabricator.wikimedia.org/T275809 (10ema) cp5001 has been running with the exp policy for 5 days now: compared to other upload nodes in eqsin, its [[https://grafana.wikimedia.org/d/A__2L7eWz/cache-hosts-... [09:54:13] (03PS2) 10Jbond: O:gitlab: add config for backup sets [puppet] - 10https://gerrit.wikimedia.org/r/677970 (https://phabricator.wikimedia.org/T274463) [09:54:20] (03CR) 10Ladsgroup: snapshot: Migrate cronjobs in pagetitles to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/678338 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [09:57:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1180 with minimal weight on s6 for the first time T275633', diff saved to https://phabricator.wikimedia.org/P15286 and previous config saved to /var/cache/conftool/dbconfig/20210413-095717-marostegui.json [09:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:28] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [10:04:57] 10SRE, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Remove 185.15.56.0/24 from network::external - https://phabricator.wikimedia.org/T265864 (10ayounsi) They're currently coming from the 172.16.0.0/12 space but T209011 is going to change it so they will come from the 185.15.56.0/24 space.... [10:05:59] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is OK: (C)100 gt (W)80 gt 29.49 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1004&panelId=37 [10:10:05] (03CR) 10Volans: "Looks sane to me, I didn't test it but is an easily revertible change in case of issues. See also a question inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677292 (owner: 10Jbond) [10:14:23] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/677506 (owner: 10Jbond) [10:15:54] (03PS1) 10Jbond: apt::package_from_component: ad toggle to make installing packages optional [puppet] - 10https://gerrit.wikimedia.org/r/678811 [10:16:07] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/677510 (https://phabricator.wikimedia.org/T273673) (owner: 10Jbond) [10:16:49] (03CR) 10Volans: [C: 03+1] "LGTM I'm not sure how much the nginx class support absent and will cleanup all the pieces though." [puppet] - 10https://gerrit.wikimedia.org/r/677514 (owner: 10Jbond) [10:18:43] (03CR) 10Jbond: ceph: add ceph repo and parameter to all client modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677911 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [10:22:33] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/678811 (owner: 10Jbond) [10:22:45] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=aqs1004.eqiad.wmnet [10:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:52] (03CR) 10Jbond: [C: 03+2] apt::package_from_component: ad toggle to make installing packages optional [puppet] - 10https://gerrit.wikimedia.org/r/678811 (owner: 10Jbond) [10:23:58] (03CR) 10Jbond: ceph: add ceph repo and parameter to all client modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677911 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [10:25:15] !log hnowlan@deploy1002 Started deploy [analytics/aqs/deploy@3227eea]: (no justification provided) [10:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 20%: Slowly pool db1180 for the first time in s6 T275633', diff saved to https://phabricator.wikimedia.org/P15287 and previous config saved to /var/cache/conftool/dbconfig/20210413-102617-root.json [10:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:27] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [10:27:36] 10SRE, 10serviceops, 10Wikimedia-Incident: High load on jobrunners (12 Apr 2021) - https://phabricator.wikimedia.org/T279893 (10jijiki) [10:28:23] !log hnowlan@deploy1002 Finished deploy [analytics/aqs/deploy@3227eea]: (no justification provided) (duration: 03m 08s) [10:28:29] (03CR) 10Jbond: P:debmonitor::server: move the internal server function to apache (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677292 (owner: 10Jbond) [10:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10aborrero) ok, thanks! Should I add this information myself to netbox? [10:29:41] (03CR) 10David Caro: ceph: add ceph repo and parameter to all client modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677911 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [10:31:34] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=aqs1004.eqiad.wmnet [10:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:56] !log restarting FPM on mw canaries to pick up OpenSSL updates [10:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:36] (03CR) 10Alexandros Kosiaris: Switch to iptables legacy alternative provider on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677931 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [10:34:46] (03CR) 10Jbond: [C: 03+2] P:debmonitor::server: move the internal server function to apache [puppet] - 10https://gerrit.wikimedia.org/r/677292 (owner: 10Jbond) [10:34:48] (03CR) 10Alexandros Kosiaris: [C: 03+1] New upstream version 0.13.1 [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/677996 (owner: 10JMeybohm) [10:34:50] (03PS1) 10Jbond: Revert "P:debmonitor::server: move the internal server function to apache" [puppet] - 10https://gerrit.wikimedia.org/r/678692 [10:35:36] !log switch debmonitor internal interface to use to use apache [10:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:12] (03CR) 10Marostegui: [C: 03+2] production-m2.sql.erb: Add ALTER grant [puppet] - 10https://gerrit.wikimedia.org/r/678792 (https://phabricator.wikimedia.org/T279053) (owner: 10Marostegui) [10:39:41] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:debmonitor::server: Drop nginx in favour of tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/677506 (owner: 10Jbond) [10:39:46] (03PS2) 10Jbond: P:debmonitor::server: Drop nginx in favour of tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/677506 [10:39:51] !log kharlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [10:39:55] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:debmonitor::server: Drop nginx in favour of tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/677506 (owner: 10Jbond) [10:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:17] (03PS1) 10Jbond: Revert "P:debmonitor::server: Drop nginx in favour of tlsproxy" [puppet] - 10https://gerrit.wikimedia.org/r/678693 [10:40:37] PROBLEM - Check systemd state on debmonitor1002 is CRITICAL: CRITICAL - degraded: The following units failed: nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:40:51] PROBLEM - debmonitor.wikimedia.org:7443 CDN on debmonitor1002 is CRITICAL: connect to address 10.64.16.72 and port 7443: Connection refused https://wikitech.wikimedia.org/wiki/Debmonitor [10:41:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 30%: Slowly pool db1180 for the first time in s6 T275633', diff saved to https://phabricator.wikimedia.org/P15288 and previous config saved to /var/cache/conftool/dbconfig/20210413-104121-root.json [10:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:30] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [10:41:59] jbond42: needs a hand to test the changes? [10:42:25] PROBLEM - debmonitor.discovery.wmnet:443 internal on debmonitor1002 is CRITICAL: connect to address 10.64.16.72 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Debmonitor [10:43:05] volans: I'm running some tests when the second patch it live, but more tests won't hurt [10:43:20] I'm seeing AH00526: Syntax error on line 5 of /etc/apache2/sites-enabled/50-debmonitor-discovery-wmnet.conf: [10:43:20] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [10:43:20] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [10:43:23] from syslog [10:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:52] (03CR) 10Jcrespo: "This looks fine." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677970 (https://phabricator.wikimedia.org/T274463) (owner: 10Jbond) [10:43:56] (03CR) 10Jcrespo: [C: 03+1] O:gitlab: add config for backup sets [puppet] - 10https://gerrit.wikimedia.org/r/677970 (https://phabricator.wikimedia.org/T274463) (owner: 10Jbond) [10:46:45] PROBLEM - Check systemd state on db1139 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:46:45] PROBLEM - Check systemd state on cp3054 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:46:47] PROBLEM - debmonitor.discovery.wmnet:443 internal on debmonitor2002 is CRITICAL: connect to address 10.192.32.42 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Debmonitor [10:46:49] PROBLEM - Check systemd state on db2108 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:01] PROBLEM - Check systemd state on es1029 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:01] PROBLEM - Check systemd state on restbase2009 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:05] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:09] PROBLEM - Check systemd state on cp5011 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:10] jbond42: are you reverting or fixing? [10:47:11] PROBLEM - Check systemd state on mc1020 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:19] PROBLEM - debmonitor.wikimedia.org:7443 CDN on debmonitor2002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 7443: HTTP/1.1 200 OK https://wikitech.wikimedia.org/wiki/Debmonitor [10:47:20] volans: ill revert now [10:47:23] PROBLEM - Check systemd state on mw2378 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:23] PROBLEM - Check systemd state on db2146 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:33] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:35] PROBLEM - Check systemd state on druid1002 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:41] PROBLEM - Check systemd state on elastic1064 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:41] PROBLEM - Check systemd state on mw2273 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:47] PROBLEM - Check systemd state on mw2266 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:57] PROBLEM - Check systemd state on ml-etcd1002 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:48:03] PROBLEM - Check systemd state on restbase2020 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:48:09] PROBLEM - Check systemd state on kubernetes1009 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:48:09] PROBLEM - Check systemd state on wtp1043 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:48:09] PROBLEM - Check systemd state on mw1404 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:48:21] PROBLEM - Check systemd state on kubemaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:48:37] PROBLEM - Check systemd state on restbase1021 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:48:43] PROBLEM - Check systemd state on wdqs2005 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:48:51] PROBLEM - Check systemd state on db2124 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:48:59] PROBLEM - Check systemd state on an-worker1114 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:48:59] PROBLEM - Check systemd state on maps2003 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:49:03] (03CR) 10Jbond: [C: 03+2] Revert "P:debmonitor::server: move the internal server function to apache" [puppet] - 10https://gerrit.wikimedia.org/r/678692 (owner: 10Jbond) [10:49:05] PROBLEM - Check systemd state on mw1337 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:49:35] (03CR) 10Jbond: [C: 03+2] Revert "P:debmonitor::server: Drop nginx in favour of tlsproxy" [puppet] - 10https://gerrit.wikimedia.org/r/678693 (owner: 10Jbond) [10:49:46] (03PS2) 10Jbond: Revert "P:debmonitor::server: move the internal server function to apache" [puppet] - 10https://gerrit.wikimedia.org/r/678692 [10:49:53] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "P:debmonitor::server: move the internal server function to apache" [puppet] - 10https://gerrit.wikimedia.org/r/678692 (owner: 10Jbond) [10:53:38] RECOVERY - Check systemd state on debmonitor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:55:24] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [10:55:24] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [10:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:38] RECOVERY - debmonitor.discovery.wmnet:443 internal on debmonitor1002 is OK: HTTP OK: Status line output matched HTTP/1.1 400 - 405 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [10:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:42] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is CRITICAL: 104.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1004&panelId=37 [10:56:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 40%: Slowly pool db1180 for the first time in s6 T275633', diff saved to https://phabricator.wikimedia.org/P15289 and previous config saved to /var/cache/conftool/dbconfig/20210413-105625-root.json [10:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:33] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [10:56:56] RECOVERY - Check systemd state on druid1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:57:03] jbond42: if it can helps we could run 'systemctl is-failed --quiet debmonitor-client.service && systemctl restart debmonitor-client.service' [10:57:39] volans: if you could that would be great [10:57:52] sure [10:57:57] thx [10:58:32] RECOVERY - Check systemd state on mw2273 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:58:36] 10SRE, 10Traffic: Network unreachable after network-online.target is brought up - https://phabricator.wikimedia.org/T237243 (10ayounsi) [10:58:46] jbond42: {done} [10:58:50] RECOVERY - Check systemd state on ml-etcd1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:58:50] RECOVERY - debmonitor.wikimedia.org:7443 CDN on debmonitor1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 518 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [10:58:54] volans: thanks [10:58:56] RECOVERY - Check systemd state on restbase2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:59:00] RECOVERY - Check systemd state on an-worker1114 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:59:02] RECOVERY - Check systemd state on kubernetes1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:59:02] RECOVERY - Check systemd state on wtp1043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:59:04] RECOVERY - Check systemd state on mw1404 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:59:24] RECOVERY - Check systemd state on kubemaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:59:38] RECOVERY - Check systemd state on wdqs2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:59:39] (03CR) 10Muehlenhoff: [C: 03+2] dbmonitor: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/675081 (owner: 10Muehlenhoff) [10:59:42] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:59:52] RECOVERY - Check systemd state on elastic1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:59:58] RECOVERY - Check systemd state on mw2266 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: (Dis)respected human, time to deploy [[Backport windows|European mid-day backport window]]
'''''' (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210413T1100). Please do the needful. [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:08] RECOVERY - Check systemd state on maps2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:00:18] RECOVERY - Check systemd state on mw2378 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:00:18] o/ [11:00:52] PROBLEM - Check no envoy runtime configuration is left persistent on debmonitor2002 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [11:01:03] (03PS1) 10Jbond: P:debmonitor::server: move the internal server function to apache [puppet] - 10https://gerrit.wikimedia.org/r/678695 [11:01:07] (03PS2) 10Muehlenhoff: ircecho: Install python-prometheus-client [puppet] - 10https://gerrit.wikimedia.org/r/677834 [11:01:14] 10SRE, 10observability, 10Security, 10User-jbond: ulog: filter out diffscan from ulog - https://phabricator.wikimedia.org/T265590 (10ayounsi) [11:01:24] PROBLEM - Check that envoy is running on debmonitor2002 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is inactive https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [11:02:16] (03PS1) 10Jbond: P:debmonitor::server: Drop nginx in favour of tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/678696 [11:02:18] RECOVERY - Check systemd state on mc1020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:02:32] (03CR) 10jerkins-bot: [V: 04-1] P:debmonitor::server: Drop nginx in favour of tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/678696 (owner: 10Jbond) [11:02:42] (03PS2) 10Jbond: P:debmonitor::server: Drop nginx in favour of tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/678696 [11:03:33] (03PS4) 10Jbond: P:debmonitor::server: convert cron to systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/677510 (https://phabricator.wikimedia.org/T273673) [11:03:43] (03CR) 10Muehlenhoff: [C: 03+2] ircecho: Install python-prometheus-client [puppet] - 10https://gerrit.wikimedia.org/r/677834 (owner: 10Muehlenhoff) [11:04:45] (03PS3) 10Muehlenhoff: profile::conftool::client: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/668019 [11:04:50] RECOVERY - Check systemd state on restbase1021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:05:03] 10SRE, 10Patch-For-Review: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 (10jcrespo) Probably known (sorry) but the other alert I saw recently was: "CRITICAL: the following (6) node(s) change every puppet run: dbmonitor1001.wikimedia.org,...". Probably related to this? [11:05:24] (03PS1) 10Marostegui: instances.yaml: Remove db1076 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/678818 (https://phabricator.wikimedia.org/T274752) [11:05:27] I shall self-serve [11:06:42] RECOVERY - debmonitor.discovery.wmnet:443 internal on debmonitor2002 is OK: HTTP OK: Status line output matched HTTP/1.1 400 - 405 bytes in 0.154 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [11:06:44] RECOVERY - Check systemd state on es1029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:51] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1076 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/678818 (https://phabricator.wikimedia.org/T274752) (owner: 10Marostegui) [11:08:48] RECOVERY - Check systemd state on cp3054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:08:56] RECOVERY - debmonitor.wikimedia.org:7443 CDN on debmonitor2002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 518 bytes in 0.140 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [11:08:58] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['parse2008.codfw.wmnet', 'parse2009.codfw.wmnet', 'parse201... [11:10:07] (03PS2) 10Jbond: P:debmonitor::server: move the internal server function to apache [puppet] - 10https://gerrit.wikimedia.org/r/678695 [11:10:36] (03PS2) 10Ladsgroup: Disable legacy javascript variables in zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678714 (https://phabricator.wikimedia.org/T72470) [11:10:42] (03CR) 10Ladsgroup: [C: 03+2] Disable legacy javascript variables in zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678714 (https://phabricator.wikimedia.org/T72470) (owner: 10Ladsgroup) [11:11:12] RECOVERY - Check systemd state on db2146 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:11:39] (03PS3) 10Jbond: P:debmonitor::server: Drop nginx in favour of tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/678696 [11:12:23] (03Merged) 10jenkins-bot: Disable legacy javascript variables in zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678714 (https://phabricator.wikimedia.org/T72470) (owner: 10Ladsgroup) [11:13:32] RECOVERY - Check systemd state on mw1337 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:14:00] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:678714|Disable legacy javascript variables in zhwiki (T72470)]] (duration: 00m 57s) [11:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:08] (03PS4) 10Jbond: P:debmonitor::server: Drop nginx in favour of tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/678696 [11:14:09] T72470: Remove legacy javascript globals - https://phabricator.wikimedia.org/T72470 [11:14:14] I'm done [11:14:24] (03PS5) 10Jbond: P:debmonitor::server: convert cron to systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/677510 (https://phabricator.wikimedia.org/T273673) [11:14:34] (03PS2) 10Jbond: P:debmonitor::Server: drop absented resource [puppet] - 10https://gerrit.wikimedia.org/r/677514 [11:15:36] RECOVERY - Check systemd state on db1139 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:38] RECOVERY - Check systemd state on db2108 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:17:15] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/678695 (owner: 10Jbond) [11:17:45] ok second attempt to switch debmonitor [11:17:55] !log switch debmonitor internal service to apache [11:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:08] (03CR) 10Jbond: [C: 03+2] P:debmonitor::server: move the internal server function to apache [puppet] - 10https://gerrit.wikimedia.org/r/678695 (owner: 10Jbond) [11:18:16] RECOVERY - Check systemd state on restbase2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:19:24] PROBLEM - Check no envoy runtime configuration is left persistent on debmonitor1002 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [11:19:46] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:19:58] PROBLEM - Check that envoy is running on debmonitor1002 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is inactive https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [11:19:59] 10SRE, 10LDAP-Access-Requests: Add Lena Meintrup to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T279531 (10Lena_WMDE) thanks @KFrancis ! I just reviewed and signed. [11:21:20] RECOVERY - Check systemd state on cp5011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:21:24] RECOVERY - Check systemd state on db2124 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:25:53] (03CR) 10Ladsgroup: "> Patch Set 7:" [puppet] - 10https://gerrit.wikimedia.org/r/678338 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [11:26:56] PROBLEM - Check systemd state on mw1386 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:27:56] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2008.codfw.wmnet with reason: REIMAGE [11:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:14] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [11:29:58] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2009.codfw.wmnet with reason: REIMAGE [11:30:01] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2008.codfw.wmnet with reason: REIMAGE [11:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:33] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [11:31:59] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2010.codfw.wmnet with reason: REIMAGE [11:32:06] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2009.codfw.wmnet with reason: REIMAGE [11:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:05] PROBLEM - Check systemd state on db2077 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:34:10] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2010.codfw.wmnet with reason: REIMAGE [11:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:29] jbond42: ^ debmonitor client failing again [11:35:11] Majavah: ack thanks will check just testing a few things right now but should be fixed soon [11:39:27] (03PS1) 10Muehlenhoff: Enable ProxyPreserveHost for the debmonitor Apache ingestion site [puppet] - 10https://gerrit.wikimedia.org/r/678825 [11:40:09] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/678825 (owner: 10Muehlenhoff) [11:41:45] (03CR) 10Muehlenhoff: [C: 03+2] Enable ProxyPreserveHost for the debmonitor Apache ingestion site [puppet] - 10https://gerrit.wikimedia.org/r/678825 (owner: 10Muehlenhoff) [11:42:09] (03PS5) 10Jbond: P:debmonitor::server: Drop nginx in favour of tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/678696 [11:44:25] Majavah: should recover now [11:45:18] PROBLEM - debmonitor.discovery.wmnet:443 internal on debmonitor1002 is CRITICAL: HTTP CRITICAL - No data received from host https://wikitech.wikimedia.org/wiki/Debmonitor [11:51:39] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2008.codfw.wmnet', 'parse2009.codfw.wmnet', 'parse2010.codfw.wmnet'] ` and were **ALL** successful. [11:58:37] PROBLEM - debmonitor.discovery.wmnet:443 internal on debmonitor2002 is CRITICAL: HTTP CRITICAL - No data received from host https://wikitech.wikimedia.org/wiki/Debmonitor [11:59:31] (03PS6) 10Jbond: P:debmonitor::server: Drop nginx in favour of tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/678696 [12:01:53] * jbond42 looking at the debmon alerts [12:04:53] (03PS7) 10Jbond: P:debmonitor::server: Drop nginx in favour of tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/678696 [12:05:39] (03CR) 10Muehlenhoff: [C: 03+2] Drop bastion role from bast1002 [puppet] - 10https://gerrit.wikimedia.org/r/678807 (https://phabricator.wikimedia.org/T276399) (owner: 10Muehlenhoff) [12:07:31] (03PS8) 10Jbond: P:debmonitor::server: Drop nginx in favour of tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/678696 [12:08:23] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28996/console" [puppet] - 10https://gerrit.wikimedia.org/r/678696 (owner: 10Jbond) [12:12:11] PROBLEM - Check systemd state on kubernetes2013 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:13:36] !log deleting stale wikidata indices on cloudelastic (T231517) [12:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:53] T231517: Investigate and fix GC issues on cloudelastic machines - https://phabricator.wikimedia.org/T231517 [12:14:18] (03CR) 10Gergő Tisza: "> LGTM! As far as I know, we don't need DROP (Gergo could you confirm please?) but that could be left for another patch." [puppet] - 10https://gerrit.wikimedia.org/r/678792 (https://phabricator.wikimedia.org/T279053) (owner: 10Marostegui) [12:19:19] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/678696 (owner: 10Jbond) [12:21:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1076 from dbctl T274752', diff saved to https://phabricator.wikimedia.org/P15291 and previous config saved to /var/cache/conftool/dbconfig/20210413-122119-marostegui.json [12:21:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 50%: Slowly pool db1180 for the first time in s6 T275633', diff saved to https://phabricator.wikimedia.org/P15292 and previous config saved to /var/cache/conftool/dbconfig/20210413-122126-root.json [12:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:29] T274752: decommission db1076.eqiad.wmnet - https://phabricator.wikimedia.org/T274752 [12:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:37] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [12:22:49] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:debmonitor::server: Drop nginx in favour of tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/678696 (owner: 10Jbond) [12:22:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1184 with minimal weight on s1 for the first time T275633', diff saved to https://phabricator.wikimedia.org/P15293 and previous config saved to /var/cache/conftool/dbconfig/20210413-122248-marostegui.json [12:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:03] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [12:24:25] 10SRE, 10Patch-For-Review: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 (10MoritzMuehlenhoff) >>! In T224589#6993748, @jcrespo wrote: > Probably known (sorry) but the other alert I saw recently was: "CRITICAL: the following (6) node(s) change every puppet run: dbmonitor1001.wik... [12:26:08] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for Silvan Heintze - https://phabricator.wikimedia.org/T279764 (10fgiunchedi) [12:29:32] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for Silvan Heintze - https://phabricator.wikimedia.org/T279764 (10fgiunchedi) [12:31:37] RECOVERY - snapshot of s7 in codfw on alert1001 is OK: Last snapshot for s7 at codfw (db2100.codfw.wmnet:3317) taken on 2021-04-13 10:48:17 (1073 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [12:34:08] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for Silvan Heintze - https://phabricator.wikimedia.org/T279764 (10fgiunchedi) >>! In T279764#6987000, @WMDE-leszek wrote: > As a WMDE Engineering Manager a approve this request. How approves it on WMF's end these days? @thcipriani or @greg? Yes tha... [12:36:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 60%: Slowly pool db1180 for the first time in s6 T275633', diff saved to https://phabricator.wikimedia.org/P15294 and previous config saved to /var/cache/conftool/dbconfig/20210413-123629-root.json [12:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:38] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [12:38:59] RECOVERY - Check systemd state on kubernetes2013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:39:47] PROBLEM - debmonitor.wikimedia.org:7443 CDN on debmonitor1002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 7443: HTTP/1.1 200 OK https://wikitech.wikimedia.org/wiki/Debmonitor [12:40:59] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [12:49:01] PROBLEM - debmonitor.wikimedia.org:7443 CDN on debmonitor2002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 7443: HTTP/1.1 200 OK https://wikitech.wikimedia.org/wiki/Debmonitor [12:49:17] jbond42: ^^^ [12:49:23] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is OK: (C)100 gt (W)80 gt 70.17 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1004&panelId=37 [12:49:34] 10SRE, 10DBA: Rename be_x_oldwiki database to be_taraskwiki - https://phabricator.wikimedia.org/T127570 (10LSobanski) 05Stalled→03Resolved a:03LSobanski Since T83609 was declined, I don't think there is much value in keeping this task open. Please reopen and / or message me if you think otherwise. [12:49:34] volans: yes investigating will ack the check [12:50:05] ack [12:50:09] lmk if I can help [12:50:26] ACKNOWLEDGEMENT - debmonitor.discovery.wmnet:443 internal on debmonitor1002 is CRITICAL: HTTP CRITICAL - No data received from host John Bond Post migration to envoy service is nolonger responding on the correct vhost https://wikitech.wikimedia.org/wiki/Debmonitor [12:50:26] ACKNOWLEDGEMENT - debmonitor.wikimedia.org:7443 CDN on debmonitor1002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 7443: HTTP/1.1 200 OK John Bond Post migration to envoy service is nolonger responding on the correct vhost https://wikitech.wikimedia.org/wiki/Debmonitor [12:50:26] ACKNOWLEDGEMENT - debmonitor.discovery.wmnet:443 internal on debmonitor2002 is CRITICAL: HTTP CRITICAL - No data received from host John Bond Post migration to envoy service is nolonger responding on the correct vhost https://wikitech.wikimedia.org/wiki/Debmonitor [12:50:26] ACKNOWLEDGEMENT - debmonitor.wikimedia.org:7443 CDN on debmonitor2002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 7443: HTTP/1.1 200 OK John Bond Post migration to envoy service is nolonger responding on the correct vhost https://wikitech.wikimedia.org/wiki/Debmonitor [12:50:29] thanks [12:51:05] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for Silvan Heintze - https://phabricator.wikimedia.org/T279764 (10fgiunchedi) p:05Triage→03Medium [12:51:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 70%: Slowly pool db1180 for the first time in s6 T275633', diff saved to https://phabricator.wikimedia.org/P15295 and previous config saved to /var/cache/conftool/dbconfig/20210413-125133-root.json [12:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:44] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [12:54:30] 10SRE, 10observability: Missing 'notify' for some Icinga configuration files - https://phabricator.wikimedia.org/T263027 (10fgiunchedi) p:05Triage→03Low [12:56:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1184 with minimal weight on s1 for the first time T275633', diff saved to https://phabricator.wikimedia.org/P15296 and previous config saved to /var/cache/conftool/dbconfig/20210413-125652-marostegui.json [12:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:01] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [13:01:02] 10SRE, 10Wikimedia-Mailing-lists: Figure out mailman3 search index config - https://phabricator.wikimedia.org/T279701 (10fgiunchedi) p:05Triage→03Medium [13:01:53] 10SRE, 10Performance-Team, 10Traffic, 10serviceops: Decide on details of progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10fgiunchedi) p:05Triage→03Medium [13:03:10] 10SRE, 10Traffic, 10netops: Unable to load en.wikipedia.org from 84.19.61.192/26 - https://phabricator.wikimedia.org/T279503 (10fgiunchedi) p:05Triage→03Medium [13:03:12] 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (2021-04-01 to 2021-06-30 (Q4)): Porting scap to Python 3 - https://phabricator.wikimedia.org/T279628 (10MoritzMuehlenhoff) p:05Triage→03High [13:03:27] 10SRE, 10Python3-Porting: git-fat needs to be ported to Python 3 - https://phabricator.wikimedia.org/T279509 (10MoritzMuehlenhoff) p:05Triage→03High [13:04:08] 10SRE, 10Analytics, 10Traffic: Add Traffic's notion of "from public cloud" to Analytics webrequest data - https://phabricator.wikimedia.org/T279380 (10fgiunchedi) p:05Triage→03Medium [13:04:50] (03PS4) 10Andrew Bogott: Neutron: forward our dmz hacks to Victoria [puppet] - 10https://gerrit.wikimedia.org/r/677647 (https://phabricator.wikimedia.org/T261137) [13:04:52] (03PS1) 10Andrew Bogott: eqiad1 designate -> Victoria [puppet] - 10https://gerrit.wikimedia.org/r/678833 (https://phabricator.wikimedia.org/T261137) [13:05:06] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1003/28999/" [puppet] - 10https://gerrit.wikimedia.org/r/678607 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [13:05:11] (03PS5) 10Ottomata: refine_job - remove RefineFailuresChecker and use 0.1.4 in test/refine [puppet] - 10https://gerrit.wikimedia.org/r/678607 (https://phabricator.wikimedia.org/T273789) [13:06:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 80%: Slowly pool db1180 for the first time in s6 T275633', diff saved to https://phabricator.wikimedia.org/P15297 and previous config saved to /var/cache/conftool/dbconfig/20210413-130637-root.json [13:06:37] (03Abandoned) 10Ottomata: analytics/refine: bump refinery-source version to 0.1.2 [puppet] - 10https://gerrit.wikimedia.org/r/667930 (owner: 10Milimetric) [13:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:46] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [13:06:51] (03CR) 10Ottomata: [C: 03+2] refine_job - remove RefineFailuresChecker and use 0.1.4 in test/refine [puppet] - 10https://gerrit.wikimedia.org/r/678607 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [13:08:45] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1006-cloudelastic-chi-eqiad on cloudelastic1006 is OK: (C)100 gt (W)80 gt 19.25 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1006&panelId=37 [13:08:47] (03CR) 10Andrew Bogott: [C: 03+2] eqiad1 designate -> Victoria [puppet] - 10https://gerrit.wikimedia.org/r/678833 (https://phabricator.wikimedia.org/T261137) (owner: 10Andrew Bogott) [13:14:12] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for Silvan Heintze - https://phabricator.wikimedia.org/T279764 (10thcipriani) >>! In T279764#6994055, @fgiunchedi wrote: >>>! In T279764#6987000, @WMDE-leszek wrote: >> As a WMDE Engineering Manager a approve this request. How approves it on WMF's e... [13:15:18] (03PS5) 10Ottomata: refine - use refinery 0.1.4 [puppet] - 10https://gerrit.wikimedia.org/r/678608 (https://phabricator.wikimedia.org/T273789) [13:15:29] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-chi-eqiad on cloudelastic1005 is OK: (C)100 gt (W)80 gt 0 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1005&panelId=37 [13:15:57] (03PS1) 10Jbond: P:debmonitor::server: Set cas site to listen on the loopback [puppet] - 10https://gerrit.wikimedia.org/r/678834 [13:16:28] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/678834 (owner: 10Jbond) [13:17:18] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1001/29000/" [puppet] - 10https://gerrit.wikimedia.org/r/678608 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [13:17:21] (03CR) 10Ottomata: [C: 03+2] refine - use refinery 0.1.4 [puppet] - 10https://gerrit.wikimedia.org/r/678608 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [13:21:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29002/console" [puppet] - 10https://gerrit.wikimedia.org/r/678834 (owner: 10Jbond) [13:21:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 90%: Slowly pool db1180 for the first time in s6 T275633', diff saved to https://phabricator.wikimedia.org/P15298 and previous config saved to /var/cache/conftool/dbconfig/20210413-132140-root.json [13:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:52] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [13:23:25] (03PS1) 10Ottomata: Refine - fix typo in job_config [puppet] - 10https://gerrit.wikimedia.org/r/678836 (https://phabricator.wikimedia.org/T273789) [13:23:55] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/678834 (owner: 10Jbond) [13:23:59] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Refine - fix typo in job_config [puppet] - 10https://gerrit.wikimedia.org/r/678836 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [13:24:43] (03PS2) 10Jbond: P:debmonitor::server: Set cas site to listen on the loopback [puppet] - 10https://gerrit.wikimedia.org/r/678834 [13:24:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.37.0-wmf.1 [core] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/678837 [13:24:53] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.37.0-wmf.1 [core] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/678837 (owner: 10TrainBranchBot) [13:26:48] (03PS1) 10Muehlenhoff: Extend d-i config for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/678838 (https://phabricator.wikimedia.org/T275873) [13:27:05] (03CR) 10Jbond: [C: 03+2] "thanks" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/678834 (owner: 10Jbond) [13:27:37] 10SRE, 10DBA, 10Platform Engineering Roadmap Decision Making, 10Performance-Team (Radar), 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10Marostegui) a:05nnikkhoui→03Marostegui [13:30:58] (03CR) 10Muehlenhoff: [C: 03+2] Extend d-i config for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/678838 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [13:36:41] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['parse2011.codfw.wmnet', 'parse2012.codfw.wmnet', 'parse201... [13:36:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 100%: Slowly pool db1180 for the first time in s6 T275633', diff saved to https://phabricator.wikimedia.org/P15299 and previous config saved to /var/cache/conftool/dbconfig/20210413-133644-root.json [13:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:56] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [13:38:40] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['wtp1026.eqiad.wmnet', 'wtp1027.eqiad.wmnet'] ` The log can... [13:43:28] PROBLEM - puppet last run on otrs1001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:47:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_mediawiki_job_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:06] (03PS1) 10Jbond: O:alerting_host: update check to search for ocation [puppet] - 10https://gerrit.wikimedia.org/r/678839 [13:48:31] (03Merged) 10jenkins-bot: Branch commit for wmf/1.37.0-wmf.1 [core] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/678837 (owner: 10TrainBranchBot) [13:49:05] (03PS5) 10Andrew Bogott: Neutron: forward our dmz hacks to Victoria [puppet] - 10https://gerrit.wikimedia.org/r/677647 (https://phabricator.wikimedia.org/T261137) [13:49:07] (03PS1) 10Andrew Bogott: cloud-vps codfw1dev -> OpenStack Victoria [puppet] - 10https://gerrit.wikimedia.org/r/678840 (https://phabricator.wikimedia.org/T261137) [13:49:35] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29003/console" [puppet] - 10https://gerrit.wikimedia.org/r/678839 (owner: 10Jbond) [13:51:05] 10SRE: mail alias jason@wikipedia.org - https://phabricator.wikimedia.org/T280026 (10JAdams) [13:52:02] (03CR) 10Effie Mouzeli: "This has been running on the canaries for some time" [puppet] - 10https://gerrit.wikimedia.org/r/676580 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli) [13:53:16] 10SRE, 10serviceops: Jenkins fails onCI puppet with: EnvironmentError: 404 Client Error: Not Found for url: https://pypi.org/simple/pkg-resources/ - https://phabricator.wikimedia.org/T279307 (10fgiunchedi) p:05Triage→03Medium [13:53:34] 10SRE, 10Maps, 10Packaging, 10serviceops: Packaging PostGIS 3.1 for the new Maps stack - https://phabricator.wikimedia.org/T277064 (10fgiunchedi) p:05Triage→03Medium [13:55:40] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2011.codfw.wmnet with reason: REIMAGE [13:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:22] (03CR) 10Awight: [C: 03+1] "Thanks for noticing that we needed to enable the feature!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678798 (https://phabricator.wikimedia.org/T271825) (owner: 10WMDE-Fisch) [13:57:39] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2012.codfw.wmnet with reason: REIMAGE [13:57:41] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2011.codfw.wmnet with reason: REIMAGE [13:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:42] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2013.codfw.wmnet with reason: REIMAGE [13:59:48] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2012.codfw.wmnet with reason: REIMAGE [13:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:58] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mcrouter: add healthz script [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/673845 (owner: 10Giuseppe Lavagetto) [14:01:42] (03PS5) 10Herron: kafka-logging: migrate broker logstash1010 to kafka-logging1001 [puppet] - 10https://gerrit.wikimedia.org/r/677009 (https://phabricator.wikimedia.org/T279342) [14:01:55] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2013.codfw.wmnet with reason: REIMAGE [14:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:44] (03CR) 10JMeybohm: [C: 03+2] New upstream version 0.13.1 [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/677996 (owner: 10JMeybohm) [14:03:42] <_joe_> !log uploading new versions of the mcrouter, php7.2-fpm and php7.3-fpm images to the registry [14:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 20%: Slowly pool db1184 for the first time in s1 T275633', diff saved to https://phabricator.wikimedia.org/P15300 and previous config saved to /var/cache/conftool/dbconfig/20210413-140353-root.json [14:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:02] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [14:04:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1184 with minimal weight on s1 for the first time T275633', diff saved to https://phabricator.wikimedia.org/P15301 and previous config saved to /var/cache/conftool/dbconfig/20210413-140431-marostegui.json [14:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:02] 10SRE, 10SRE-Access-Requests: Requesting access to sites from Google Search Console for pcoombe@wikimedia.org - https://phabricator.wikimedia.org/T277065 (10Pcoombe) Hi @JKatzWMF, although my own access is working fine, I can't seem to add any other users. Tried following the instructions at https://support.go... [14:06:38] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1026.eqiad.wmnet with reason: REIMAGE [14:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:57] (03PS1) 10Jbond: check_https_client_auth: add new icinga check [puppet] - 10https://gerrit.wikimedia.org/r/678844 [14:07:59] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29004/console" [puppet] - 10https://gerrit.wikimedia.org/r/678844 (owner: 10Jbond) [14:08:03] (03Merged) 10jenkins-bot: New upstream version 0.13.1 [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/677996 (owner: 10JMeybohm) [14:08:39] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1026.eqiad.wmnet with reason: REIMAGE [14:08:40] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1027.eqiad.wmnet with reason: REIMAGE [14:08:45] !log updated bullseye d-i image to 2021-04-12 daily build T275873 [14:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:08] T275873: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 [14:10:39] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1027.eqiad.wmnet with reason: REIMAGE [14:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:58] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:16] 10SRE, 10ops-codfw, 10User-fgiunchedi: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T279245 (10fgiunchedi) [14:14:27] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nit inline, LGTM otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/678839 (owner: 10Jbond) [14:17:10] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:18:47] (03CR) 10Andrew Bogott: [C: 03+2] Neutron: forward our dmz hacks to Victoria [puppet] - 10https://gerrit.wikimedia.org/r/677647 (https://phabricator.wikimedia.org/T261137) (owner: 10Andrew Bogott) [14:19:24] (03PS1) 10Herron: kafka-logging1001: disable icinga notifications during setup [puppet] - 10https://gerrit.wikimedia.org/r/678845 [14:19:56] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2011.codfw.wmnet', 'parse2012.codfw.wmnet', 'parse2013.codfw.wmnet'] ` and were **ALL** successful. [14:20:12] (03PS2) 10Jbond: check_https_client_auth_puppet: add new icinga check [puppet] - 10https://gerrit.wikimedia.org/r/678844 [14:20:17] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps codfw1dev -> OpenStack Victoria [puppet] - 10https://gerrit.wikimedia.org/r/678840 (https://phabricator.wikimedia.org/T261137) (owner: 10Andrew Bogott) [14:21:19] (03CR) 10jerkins-bot: [V: 04-1] check_https_client_auth_puppet: add new icinga check [puppet] - 10https://gerrit.wikimedia.org/r/678844 (owner: 10Jbond) [14:21:30] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29005/console" [puppet] - 10https://gerrit.wikimedia.org/r/678844 (owner: 10Jbond) [14:21:59] (03PS2) 10Jbond: O:alerting_host: update check to search for ocation [puppet] - 10https://gerrit.wikimedia.org/r/678839 [14:22:12] (03CR) 10Jbond: [C: 03+2] "thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/678839 (owner: 10Jbond) [14:22:41] (03CR) 10Jbond: [V: 03+2 C: 03+2] O:alerting_host: update check to search for ocation [puppet] - 10https://gerrit.wikimedia.org/r/678839 (owner: 10Jbond) [14:22:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:22:51] (03CR) 10Herron: [C: 03+2] kafka-logging1001: disable icinga notifications during setup [puppet] - 10https://gerrit.wikimedia.org/r/678845 (owner: 10Herron) [14:23:20] herron: happy for me to merge yours? [14:23:35] jbond42: yes please [14:23:49] herron: done [14:23:56] jbond42: kk thx! [14:24:09] np [14:26:28] RECOVERY - debmonitor.wikimedia.org:7443 CDN on debmonitor1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 552 bytes in 1.049 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [14:28:19] (03PS8) 10Ottomata: Set up refine_sanitize jobs in analytics test cluster. [puppet] - 10https://gerrit.wikimedia.org/r/676380 (https://phabricator.wikimedia.org/T273789) [14:29:25] (03CR) 10jerkins-bot: [V: 04-1] Set up refine_sanitize jobs in analytics test cluster. [puppet] - 10https://gerrit.wikimedia.org/r/676380 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [14:34:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1184 with minimal weight on s1 for the first time T275633', diff saved to https://phabricator.wikimedia.org/P15302 and previous config saved to /var/cache/conftool/dbconfig/20210413-143419-marostegui.json [14:34:24] (03CR) 10Thcipriani: [C: 03+1] gerrit: Remove GWTUI styles [puppet] - 10https://gerrit.wikimedia.org/r/678425 (https://phabricator.wikimedia.org/T277645) (owner: 10Paladox) [14:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:30] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [14:34:46] (03CR) 10Thcipriani: [C: 03+1] gerrit: Convert gerrit-theme to Polymer 3 [puppet] - 10https://gerrit.wikimedia.org/r/678646 (owner: 10Paladox) [14:35:19] (03PS9) 10Ottomata: Set up refine_sanitize jobs in analytics test cluster. [puppet] - 10https://gerrit.wikimedia.org/r/676380 (https://phabricator.wikimedia.org/T273789) [14:35:24] RECOVERY - Check systemd state on db2077 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:34] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/29007/an-test-coord1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/676380 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [14:38:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 20%: Slowly pool db1184 for the first time in s1 T275633', diff saved to https://phabricator.wikimedia.org/P15303 and previous config saved to /var/cache/conftool/dbconfig/20210413-143821-root.json [14:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:25] (03PS1) 10Filippo Giunchedi: admin: move 'sihe' into 'deployment' [puppet] - 10https://gerrit.wikimedia.org/r/678856 (https://phabricator.wikimedia.org/T279764) [14:39:30] (03CR) 10Thcipriani: [C: 04-1] "Not ready for this change at the moment. This may become necessary in future, but for now the trade off of fixing all the things this chan" [puppet] - 10https://gerrit.wikimedia.org/r/556270 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox) [14:40:00] (03CR) 10Cwhite: [C: 03+1] "Looks good! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/677834 (owner: 10Muehlenhoff) [14:42:36] (03PS1) 10Ottomata: test/refine_sanitize - use proper refinery path [puppet] - 10https://gerrit.wikimedia.org/r/678857 (https://phabricator.wikimedia.org/T273789) [14:45:10] (03PS1) 10Giuseppe Lavagetto: Fix error in the php script for liveness probes [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/678858 [14:45:17] (03PS2) 10Ottomata: test/refine_sanitize - use proper refinery path [puppet] - 10https://gerrit.wikimedia.org/r/678857 (https://phabricator.wikimedia.org/T273789) [14:45:59] (03PS1) 10Elukey: Set hue.wikimedia.org for an-tool1009 [puppet] - 10https://gerrit.wikimedia.org/r/678860 (https://phabricator.wikimedia.org/T264896) [14:46:01] (03PS1) 10Elukey: Move hue.wikimedia.org to the an-tool1009 backend [puppet] - 10https://gerrit.wikimedia.org/r/678861 (https://phabricator.wikimedia.org/T264896) [14:46:28] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/29009/an-test-coord1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/678857 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [14:47:10] 10SRE, 10OTRS, 10Security, 10User-notice: Migrate OTRS CE 6 to Znuny LTS fork - https://phabricator.wikimedia.org/T279303 (10Keegan) >>! In T279303#6993277, @akosiaris wrote: > I am gonna be bold and resolve this as it pertains to the actual migration and that part seems to have gone quite well. I am prett... [14:47:47] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Fix error in the php script for liveness probes [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/678858 (owner: 10Giuseppe Lavagetto) [14:52:01] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1026.eqiad.wmnet', 'wtp1027.eqiad.wmnet'] ` and were **ALL** successful. [14:53:06] (03PS1) 10Ottomata: Ensure refinery/python on PYTHONPATH for refinery-eventlogging-saltrotate in test [puppet] - 10https://gerrit.wikimedia.org/r/678863 (https://phabricator.wikimedia.org/T273789) [14:53:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 30%: Slowly pool db1184 for the first time in s1 T275633', diff saved to https://phabricator.wikimedia.org/P15304 and previous config saved to /var/cache/conftool/dbconfig/20210413-145325-root.json [14:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:34] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [14:53:37] PROBLEM - Check systemd state on debmonitor2002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:54:01] RECOVERY - debmonitor.wikimedia.org:7443 CDN on debmonitor2002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 552 bytes in 1.183 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [14:54:16] (03CR) 10jerkins-bot: [V: 04-1] Ensure refinery/python on PYTHONPATH for refinery-eventlogging-saltrotate in test [puppet] - 10https://gerrit.wikimedia.org/r/678863 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [14:54:54] (03PS2) 10Ottomata: Ensure refinery/python on PYTHONPATH for refinery-eventlogging-saltrotate [puppet] - 10https://gerrit.wikimedia.org/r/678863 (https://phabricator.wikimedia.org/T273789) [14:57:39] (03CR) 10Ottomata: [C: 03+2] Ensure refinery/python on PYTHONPATH for refinery-eventlogging-saltrotate [puppet] - 10https://gerrit.wikimedia.org/r/678863 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [14:58:27] (03PS1) 10Bearloga: statistics::discovery: Set PYTHONPATH [puppet] - 10https://gerrit.wikimedia.org/r/678864 (https://phabricator.wikimedia.org/T279443) [15:08:27] (03PS1) 10Ottomata: test/refine_sanitize - ensure hdfs salts dir exists, and use -f when running saltrotate rm [puppet] - 10https://gerrit.wikimedia.org/r/678867 (https://phabricator.wikimedia.org/T273789) [15:08:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 40%: Slowly pool db1184 for the first time in s1 T275633', diff saved to https://phabricator.wikimedia.org/P15305 and previous config saved to /var/cache/conftool/dbconfig/20210413-150829-root.json [15:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:38] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [15:09:35] (03CR) 10jerkins-bot: [V: 04-1] test/refine_sanitize - ensure hdfs salts dir exists, and use -f when running saltrotate rm [puppet] - 10https://gerrit.wikimedia.org/r/678867 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [15:09:40] (03PS2) 10Ottomata: test/refine_sanitize - ensure hdfs salts dir exists [puppet] - 10https://gerrit.wikimedia.org/r/678867 (https://phabricator.wikimedia.org/T273789) [15:11:26] (03CR) 10Ottomata: [C: 03+2] test/refine_sanitize - ensure hdfs salts dir exists [puppet] - 10https://gerrit.wikimedia.org/r/678867 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [15:12:15] !log reindexing English wikis on elastic@eqiad, elastic@codfw, and cloudelastic complete (with some failures) (T274200) [15:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:23] T274200: Reindex English and Italian wikis to enable homoglyph plugin - https://phabricator.wikimedia.org/T274200 [15:19:09] (03PS1) 10Ottomata: hadoop::directory - Remove dependency on namenode [puppet] - 10https://gerrit.wikimedia.org/r/678870 [15:19:39] (03PS3) 10Jbond: check_https_client_auth_puppet: add new icinga check [puppet] - 10https://gerrit.wikimedia.org/r/678844 [15:19:41] (03PS1) 10Jbond: monitoring: fix spec tests for monitoring class [puppet] - 10https://gerrit.wikimedia.org/r/678871 [15:21:25] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:12] (03CR) 10Jbond: [C: 03+2] monitoring: fix spec tests for monitoring class [puppet] - 10https://gerrit.wikimedia.org/r/678871 (owner: 10Jbond) [15:23:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 50%: Slowly pool db1184 for the first time in s1 T275633', diff saved to https://phabricator.wikimedia.org/P15306 and previous config saved to /var/cache/conftool/dbconfig/20210413-152333-root.json [15:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:42] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [15:24:58] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:15] !log migrating kafka-logging broker logstash1010 to kafka-logging1001 T279342 [15:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:23] T279342: Migrate colocated kafka-logging brokers to dedicated kafka-logging hosts - https://phabricator.wikimedia.org/T279342 [15:27:38] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/678861 (https://phabricator.wikimedia.org/T264896) (owner: 10Elukey) [15:28:23] 10SRE, 10observability, 10CAS-SSO, 10Patch-For-Review, and 2 others: Sign-in links from Grafana dashboards don't work when not signed into SSO - https://phabricator.wikimedia.org/T269272 (10Volans) From some quick tests it seems that the redirect works fine. Just as a note, possibly intended, if I'm logged... [15:28:41] (03CR) 10Jbond: ceph: add ceph repo and parameter to all client modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677911 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [15:30:43] (03PS2) 10Ottomata: Configure Notebook Terminal to not use login shell [puppet] - 10https://gerrit.wikimedia.org/r/676999 (https://phabricator.wikimedia.org/T224658) [15:31:06] (03CR) 10Arturo Borrero Gonzalez: gridengine: set grid-configurator source files to use new domain name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/678043 (https://phabricator.wikimedia.org/T277653) (owner: 10Bstorm) [15:31:14] (03CR) 10jerkins-bot: [V: 04-1] Configure Notebook Terminal to not use login shell [puppet] - 10https://gerrit.wikimedia.org/r/676999 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [15:32:21] (03CR) 10Bstorm: gridengine: set grid-configurator source files to use new domain name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/678043 (https://phabricator.wikimedia.org/T277653) (owner: 10Bstorm) [15:32:33] PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:33:06] (03PS1) 10Ottomata: test/refine_santize - Use normalized lowercase table name [puppet] - 10https://gerrit.wikimedia.org/r/678876 [15:33:20] (03PS3) 10Ottomata: Configure Notebook Terminal to not use login shell [puppet] - 10https://gerrit.wikimedia.org/r/676999 (https://phabricator.wikimedia.org/T224658) [15:33:54] (03CR) 10jerkins-bot: [V: 04-1] Configure Notebook Terminal to not use login shell [puppet] - 10https://gerrit.wikimedia.org/r/676999 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [15:33:57] RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:34:17] (03CR) 10jerkins-bot: [V: 04-1] test/refine_santize - Use normalized lowercase table name [puppet] - 10https://gerrit.wikimedia.org/r/678876 (owner: 10Ottomata) [15:35:05] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['wtp1028.eqiad.wmnet', 'wtp1029.eqiad.wmnet'] ` The log can... [15:36:29] (03PS2) 10Ottomata: test/refine_santize - Use normalized lowercase table name [puppet] - 10https://gerrit.wikimedia.org/r/678876 (https://phabricator.wikimedia.org/T273789) [15:38:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 60%: Slowly pool db1184 for the first time in s1 T275633', diff saved to https://phabricator.wikimedia.org/P15307 and previous config saved to /var/cache/conftool/dbconfig/20210413-153836-root.json [15:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:46] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [15:39:38] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['parse2014.codfw.wmnet', 'parse2015.codfw.wmnet', 'parse201... [15:39:53] (03PS4) 10Ottomata: Configure Notebook Terminal to not use login shell [puppet] - 10https://gerrit.wikimedia.org/r/676999 (https://phabricator.wikimedia.org/T224658) [15:39:55] (03CR) 10Bstorm: gridengine: set grid-configurator source files to use new domain name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/678043 (https://phabricator.wikimedia.org/T277653) (owner: 10Bstorm) [15:40:01] (03PS5) 10Ottomata: Configure Notebook Terminal to not use login shell [puppet] - 10https://gerrit.wikimedia.org/r/676999 (https://phabricator.wikimedia.org/T224658) [15:40:09] (03CR) 10Ottomata: [C: 03+2] test/refine_santize - Use normalized lowercase table name [puppet] - 10https://gerrit.wikimedia.org/r/678876 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [15:40:41] (03CR) 10jerkins-bot: [V: 04-1] Configure Notebook Terminal to not use login shell [puppet] - 10https://gerrit.wikimedia.org/r/676999 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [15:40:43] (03CR) 10Herron: [C: 03+2] kafka-logging: migrate broker logstash1010 to kafka-logging1001 [puppet] - 10https://gerrit.wikimedia.org/r/677009 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [15:41:19] (03CR) 10David Caro: apt::package_from_component: ad toggle to make installing packages optional (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/678811 (owner: 10Jbond) [15:41:44] (03PS2) 10MacFan4000: ExtensionDistributor: Add REL1_36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678146 (owner: 10Legoktm) [15:42:21] (03CR) 10MacFan4000: [C: 03+1] ExtensionDistributor: Add REL1_36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678146 (owner: 10Legoktm) [15:42:51] (03PS6) 10Ottomata: Configure Notebook Terminal to not use login shell [puppet] - 10https://gerrit.wikimedia.org/r/676999 (https://phabricator.wikimedia.org/T224658) [15:43:18] (03CR) 10jerkins-bot: [V: 04-1] Configure Notebook Terminal to not use login shell [puppet] - 10https://gerrit.wikimedia.org/r/676999 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [15:43:45] (03CR) 10Ottomata: [C: 03+2] "pep8 warning is irrelevant" [puppet] - 10https://gerrit.wikimedia.org/r/676999 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [15:43:53] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Configure Notebook Terminal to not use login shell [puppet] - 10https://gerrit.wikimedia.org/r/676999 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [15:45:50] (03PS1) 10Jbond: apt::package_from_component: Only add package dependencies if installing [puppet] - 10https://gerrit.wikimedia.org/r/678879 [15:46:17] (03CR) 10Jbond: apt::package_from_component: ad toggle to make installing packages optional (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/678811 (owner: 10Jbond) [15:46:55] (03PS2) 10Jbond: apt::package_from_component: Only add package dependencies if installing [puppet] - 10https://gerrit.wikimedia.org/r/678879 [15:47:09] (03PS2) 10Legoktm: logspam: silence rare but annoying UTF-8 warnings [puppet] - 10https://gerrit.wikimedia.org/r/677676 (owner: 10Brennen Bearnes) [15:47:47] (03PS2) 10Bstorm: gridengine: set grid-configurator source files to use new domain name [puppet] - 10https://gerrit.wikimedia.org/r/678043 (https://phabricator.wikimedia.org/T277653) [15:49:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:33] (03CR) 10Legoktm: [C: 03+2] logspam: silence rare but annoying UTF-8 warnings [puppet] - 10https://gerrit.wikimedia.org/r/677676 (owner: 10Brennen Bearnes) [15:52:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10Cmjohnson) a:05Cmjohnson→03RobH @robh all the secondary ports are updated and added to the private vlan per the instructions... [15:53:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 70%: Slowly pool db1184 for the first time in s1 T275633', diff saved to https://phabricator.wikimedia.org/P15308 and previous config saved to /var/cache/conftool/dbconfig/20210413-155340-root.json [15:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:50] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [15:55:26] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Services, 10Service-deployment-requests: [DRAFT] New Service Request tegola - https://phabricator.wikimedia.org/T274390 (10jijiki) [15:56:36] (03PS3) 10Jbond: O:gitlab: add config for backup sets [puppet] - 10https://gerrit.wikimedia.org/r/677970 (https://phabricator.wikimedia.org/T274463) [15:58:39] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2014.codfw.wmnet with reason: REIMAGE [15:58:43] (03CR) 10Jbond: "> Patch Set 2:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677970 (https://phabricator.wikimedia.org/T274463) (owner: 10Jbond) [15:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:00:04] jbond42 and cdanis: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for [[Puppet request window]]
'''''' deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210413T1600). [16:00:12] (03CR) 10Jbond: [C: 03+1] snapshot: Migrate cronjobs in pagetitles to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/678338 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [16:00:36] nothing to merge [16:00:41] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2015.codfw.wmnet with reason: REIMAGE [16:00:45] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2014.codfw.wmnet with reason: REIMAGE [16:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:43] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2016.codfw.wmnet with reason: REIMAGE [16:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:51] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2015.codfw.wmnet with reason: REIMAGE [16:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:07] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1028.eqiad.wmnet with reason: REIMAGE [16:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:59] 10SRE, 10observability, 10CAS-SSO, 10Patch-For-Review, and 2 others: Sign-in links from Grafana dashboards don't work when not signed into SSO - https://phabricator.wikimedia.org/T269272 (10jbond) > possibly intended, if I'm logged in and open grafana.w.o it remains there "logged out". AFAIK thats expecte... [16:04:39] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:04:54] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2016.codfw.wmnet with reason: REIMAGE [16:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:08] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1029.eqiad.wmnet with reason: REIMAGE [16:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:19] (03CR) 10David Caro: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/678879 (owner: 10Jbond) [16:06:41] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 90): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29014/console" [puppet] - 10https://gerrit.wikimedia.org/r/678879 (owner: 10Jbond) [16:06:56] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1028.eqiad.wmnet with reason: REIMAGE [16:06:59] (03CR) 10Jbond: [V: 03+1 C: 03+2] apt::package_from_component: Only add package dependencies if installing [puppet] - 10https://gerrit.wikimedia.org/r/678879 (owner: 10Jbond) [16:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:15] 10SRE, 10serviceops: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10jijiki) @RLazarus do you mind running the script one last time? I hope to get TLS working this quarter, but sadly I didn't manage to do it towards the end of Q3 as I originally planned. [16:07:49] dcaro: fyi the apt::package_from_component CR is merged now, thanks [16:08:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 80%: Slowly pool db1184 for the first time in s1 T275633', diff saved to https://phabricator.wikimedia.org/P15309 and previous config saved to /var/cache/conftool/dbconfig/20210413-160844-root.json [16:08:48] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1029.eqiad.wmnet with reason: REIMAGE [16:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:53] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [16:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:23] PROBLEM - Check systemd state on kubernetes2010 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:12:16] (03PS1) 10Urbanecm: Revert "Use getGrowthWikiConfig in appropriate places" [extensions/GrowthExperiments] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/678885 [16:12:31] (03PS2) 10Urbanecm: Revert "Use getGrowthWikiConfig in appropriate places" [extensions/GrowthExperiments] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/678885 (https://phabricator.wikimedia.org/T274520) [16:12:48] (03CR) 10Urbanecm: [C: 03+2] Revert "Use getGrowthWikiConfig in appropriate places" [extensions/GrowthExperiments] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/678885 (https://phabricator.wikimedia.org/T274520) (owner: 10Urbanecm) [16:12:49] 10SRE: mail alias jason@wikipedia.org - https://phabricator.wikimedia.org/T280026 (10Dzahn) a:03Dzahn [16:16:39] (03CR) 10Elukey: [C: 03+2] statistics::discovery: Set PYTHONPATH [puppet] - 10https://gerrit.wikimedia.org/r/678864 (https://phabricator.wikimedia.org/T279443) (owner: 10Bearloga) [16:19:23] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:34] (03CR) 10Ottomata: statistics::discovery: Set PYTHONPATH (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/678864 (https://phabricator.wikimedia.org/T279443) (owner: 10Bearloga) [16:22:55] (03CR) 10Elukey: statistics::discovery: Set PYTHONPATH (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/678864 (https://phabricator.wikimedia.org/T279443) (owner: 10Bearloga) [16:23:24] (03Merged) 10jenkins-bot: Revert "Use getGrowthWikiConfig in appropriate places" [extensions/GrowthExperiments] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/678885 (https://phabricator.wikimedia.org/T274520) (owner: 10Urbanecm) [16:23:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 90%: Slowly pool db1184 for the first time in s1 T275633', diff saved to https://phabricator.wikimedia.org/P15310 and previous config saved to /var/cache/conftool/dbconfig/20210413-162347-root.json [16:23:52] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2014.codfw.wmnet', 'parse2015.codfw.wmnet', 'parse2016.codfw.wmnet'] ` and were **ALL** successful. [16:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:57] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [16:25:10] RECOVERY - Check systemd state on kubernetes2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:28:11] (03PS1) 10Elukey: statistics:discovery: fix the PYTHONPATH set for the timer [puppet] - 10https://gerrit.wikimedia.org/r/678891 [16:29:06] (03CR) 10Elukey: [C: 03+2] statistics:discovery: fix the PYTHONPATH set for the timer [puppet] - 10https://gerrit.wikimedia.org/r/678891 (owner: 10Elukey) [16:29:33] (03CR) 10Bearloga: [C: 03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/678891 (owner: 10Elukey) [16:31:25] 10SRE: mail alias jason@wikipedia.org - https://phabricator.wikimedia.org/T280026 (10Dzahn) @JAdams Thanks for the ticket, all looks good and it's done. Mails to jason@wikipedia.org should now show up in the endowment@wikimedia.org inbox / group. Feel free to test and confirm it. technical info for other... [16:31:29] (03CR) 10Jcrespo: "(sorry, hopefully you understood what I meant)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677970 (https://phabricator.wikimedia.org/T274463) (owner: 10Jbond) [16:32:26] 10SRE: mail alias jason@wikipedia.org - https://phabricator.wikimedia.org/T280026 (10Dzahn) @JAdams As you can see above lisa@ and janeen@ still go to donate@ but jason@ goes to endowment@. Just making sure that is how it's supposed to be. [16:32:44] 10SRE: mail alias jason@wikipedia.org - https://phabricator.wikimedia.org/T280026 (10Dzahn) a:05Dzahn→03JAdams [16:33:17] (03PS1) 10Legoktm: mailman2: Purge attachments from lists with archiving disabled [puppet] - 10https://gerrit.wikimedia.org/r/678897 (https://phabricator.wikimedia.org/T279237) [16:33:34] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:38] 10SRE: mail alias jason@wikipedia.org - https://phabricator.wikimedia.org/T280026 (10JAdams) @Dzahn -- confirming that the redirect should go to endowment@. Thanks! [16:34:57] 10SRE: mail alias jason@wikipedia.org - https://phabricator.wikimedia.org/T280026 (10Dzahn) 05Open→03Resolved @JAdams Yep, basically just wanted to add that the previous addresses go to donate@ and were not changed. Thanks, resolving this ticket. [16:36:36] (03CR) 10jerkins-bot: [V: 04-1] mailman2: Purge attachments from lists with archiving disabled [puppet] - 10https://gerrit.wikimedia.org/r/678897 (https://phabricator.wikimedia.org/T279237) (owner: 10Legoktm) [16:36:49] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29015/console" [puppet] - 10https://gerrit.wikimedia.org/r/678897 (https://phabricator.wikimedia.org/T279237) (owner: 10Legoktm) [16:36:53] (03PS2) 10Legoktm: mailman2: Purge attachments from lists with archiving disabled [puppet] - 10https://gerrit.wikimedia.org/r/678897 (https://phabricator.wikimedia.org/T279237) [16:38:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 100%: Slowly pool db1184 for the first time in s1 T275633', diff saved to https://phabricator.wikimedia.org/P15311 and previous config saved to /var/cache/conftool/dbconfig/20210413-163851-root.json [16:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:00] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [16:39:34] (03CR) 10Legoktm: [C: 03+2] "Script was reviewed by rzl already" [puppet] - 10https://gerrit.wikimedia.org/r/678897 (https://phabricator.wikimedia.org/T279237) (owner: 10Legoktm) [16:48:47] (03PS6) 10Jbond: P:debmonitor::server: convert cron to systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/677510 (https://phabricator.wikimedia.org/T273673) [16:48:59] (03CR) 10H.krishna123: "Hi there, apologies for the late follow-up -- just wondering -- you had mentioned we shouldn't forget to close the ticket -- I wonder if t" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673714 (https://phabricator.wikimedia.org/T277162) (owner: 10H.krishna123) [16:49:09] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1028.eqiad.wmnet', 'wtp1029.eqiad.wmnet'] ` and were **ALL** successful. [16:51:44] PROBLEM - Check systemd state on debmonitor1002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:54:33] 10SRE, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Remove 185.15.56.0/24 from network::external - https://phabricator.wikimedia.org/T265864 (10Dzahn) Speaking for phab1001: I am not aware of anything that needs to connect from cloud to phab in production but adding @20after4 as well Spe... [16:55:59] (03CR) 10RLazarus: [C: 03+1] mailman2: Purge attachments from lists with archiving disabled [puppet] - 10https://gerrit.wikimedia.org/r/678897 (https://phabricator.wikimedia.org/T279237) (owner: 10Legoktm) [16:59:03] 10SRE, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Remove 185.15.56.0/24 from network::external - https://phabricator.wikimedia.org/T265864 (10mmodell) @dzahn: would this block toolhub tools from connecting to phab? Things like #stashbot and #wikibugs come to mind. [16:59:56] (03CR) 10Dzahn: "> Patch Set 17: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/556270 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox) [17:00:04] chrisalbon and accraze: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) [[mw:Services|Services]] – [[mw:Extension:Graph|Graphoid]] / [[ORES]] deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210413T1700). [17:03:15] (03CR) 10Jbond: [C: 03+2] P:debmonitor::server: convert cron to systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/677510 (https://phabricator.wikimedia.org/T273673) (owner: 10Jbond) [17:04:14] (03PS3) 10Jbond: P:debmonitor::Server: drop absented resource [puppet] - 10https://gerrit.wikimedia.org/r/677514 [17:07:03] (03PS4) 10Jbond: P:debmonitor::Server: drop absented resource [puppet] - 10https://gerrit.wikimedia.org/r/677514 [17:07:06] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/677514 (owner: 10Jbond) [17:13:58] (03CR) 10Dzahn: "given the history of https://phabricator.wikimedia.org/T240266 and the latest status, I am now removing myself from this" [puppet] - 10https://gerrit.wikimedia.org/r/556270 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox) [17:16:24] RECOVERY - Check systemd state on debmonitor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:17:02] (03CR) 10Dzahn: [C: 03+2] gerrit: Remove GWTUI styles [puppet] - 10https://gerrit.wikimedia.org/r/678425 (https://phabricator.wikimedia.org/T277645) (owner: 10Paladox) [17:17:44] RECOVERY - Check systemd state on debmonitor2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:19:35] (03CR) 10Dzahn: [C: 03+2] gerrit: Convert gerrit-theme to Polymer 3 [puppet] - 10https://gerrit.wikimedia.org/r/678646 (owner: 10Paladox) [17:21:54] !log gerrit1001 - remove /var/lib/gerrit2/review_site/static/gerrit-theme.html after https://gerrit.wikimedia.org/r/c/operations/puppet/+/678646 [17:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:25] (03CR) 10Dzahn: "(re-)moved /var/lib/gerrit2/review_site/static/gerrit-theme.html after deploy" [puppet] - 10https://gerrit.wikimedia.org/r/678646 (owner: 10Paladox) [17:23:14] (03CR) 10Bstorm: [C: 03+2] "This tests out fine (except some strange issue with dockerfiles on the very latest versions of Docker that has nothing to do with this pat" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/674134 (https://phabricator.wikimedia.org/T278180) (owner: 10Addshore) [17:23:24] (03PS1) 10Paladox: Revert "gerrit: Convert gerrit-theme to Polymer 3" [puppet] - 10https://gerrit.wikimedia.org/r/678700 [17:24:20] (03CR) 10Dzahn: [C: 03+2] Revert "gerrit: Convert gerrit-theme to Polymer 3" [puppet] - 10https://gerrit.wikimedia.org/r/678700 (owner: 10Paladox) [17:25:07] (03Merged) 10jenkins-bot: node10-sssd: bump npm from 6.5 to 6.14.5 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/674134 (https://phabricator.wikimedia.org/T278180) (owner: 10Addshore) [17:26:21] (03CR) 10Dzahn: "adding @godog as clinic duty" [puppet] - 10https://gerrit.wikimedia.org/r/678380 (owner: 10Ori.livneh) [17:28:17] (03CR) 10Dzahn: "since there will be manual testing in beta and others have reviewed already, there doesn't seem to be much to do here for me." [puppet] - 10https://gerrit.wikimedia.org/r/678338 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [17:28:58] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [17:28:59] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [17:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:20] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['parse2017.codfw.wmnet', 'parse2018.codfw.wmnet', 'parse201... [17:29:25] (03CR) 10Dzahn: [C: 03+1] "I don't really know. needs releng. Maybe Jeena can help." [puppet] - 10https://gerrit.wikimedia.org/r/670784 (owner: 10Hnowlan) [17:29:56] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['wtp1030.eqiad.wmnet', 'wtp1031.eqiad.wmnet'] ` The log can... [17:32:03] (03CR) 10Dzahn: "for the record, this change is not really puppet code. it just adds 2 new VMs to scap dsh groups. not sure what the issue is" [puppet] - 10https://gerrit.wikimedia.org/r/650306 (owner: 10Dzahn) [17:32:45] (03PS4) 10Dzahn: scap/dsh: add doc1002/doc2001 to ci-docroot hosts [puppet] - 10https://gerrit.wikimedia.org/r/650306 (https://phabricator.wikimedia.org/T247653) [17:34:18] (03CR) 10Dzahn: "Is there really a problem with adding these VMs to scap?" [puppet] - 10https://gerrit.wikimedia.org/r/650306 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [17:35:09] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T265864#6995392" [puppet] - 10https://gerrit.wikimedia.org/r/677872 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [17:37:20] 10SRE, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Remove 185.15.56.0/24 from network::external - https://phabricator.wikimedia.org/T265864 (10Dzahn) @mmodell Yes, given that `stashbot.toolforge.org has address 185.15.56.11` it seem it would break it. [17:37:45] (03CR) 10Dzahn: [C: 04-1] "seems it would break bots talking to Phabricator https://phabricator.wikimedia.org/T265864#6995415" [puppet] - 10https://gerrit.wikimedia.org/r/677872 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [17:39:42] (03PS1) 10Jeena Huneidi: testwikis wikis to 1.37.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678904 [17:39:44] (03CR) 10Jeena Huneidi: [C: 03+2] testwikis wikis to 1.37.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678904 (owner: 10Jeena Huneidi) [17:40:57] (03Merged) 10jenkins-bot: testwikis wikis to 1.37.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678904 (owner: 10Jeena Huneidi) [17:41:31] !log jhuneidi@deploy1002 Started scap: testwikis wikis to 1.37.0-wmf.1 [17:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:44] PROBLEM - Check systemd state on aqs1011 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:42:19] 10SRE, 10DBA: Rename be_x_oldwiki database to be_taraskwiki - https://phabricator.wikimedia.org/T127570 (10Krenair) 05Resolved→03Declined [17:48:11] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2017.codfw.wmnet with reason: REIMAGE [17:48:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:00] (03PS1) 10Jbond: (WIP) systemd::resolved: start work on puppet module for systemd-resolved [puppet] - 10https://gerrit.wikimedia.org/r/678907 (https://phabricator.wikimedia.org/T171498) [17:50:08] PROBLEM - Check systemd state on aqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:50:10] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2018.codfw.wmnet with reason: REIMAGE [17:50:15] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2017.codfw.wmnet with reason: REIMAGE [17:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:09] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2019.codfw.wmnet with reason: REIMAGE [17:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:27] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2018.codfw.wmnet with reason: REIMAGE [17:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:17] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [17:54:20] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [17:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:28] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2019.codfw.wmnet with reason: REIMAGE [17:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:01] 10SRE, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Remove 185.15.56.0/24 from network::external - https://phabricator.wikimedia.org/T265864 (10Legoktm) >>! In T265864#6995415, @mmodell wrote: > @dzahn: would this block toolhub tools from connecting to phab? Things like #stashbot and #wi... [17:56:25] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1030.eqiad.wmnet with reason: REIMAGE [17:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:46] 10SRE, 10LDAP-Access-Requests: Add Lena Meintrup to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T279531 (10KFrancis) @Dzahn I am confirming the signed NDA. Please proceed with this request. Thanks! [17:58:25] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1030.eqiad.wmnet with reason: REIMAGE [17:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:25] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1031.eqiad.wmnet with reason: REIMAGE [17:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210413T1800) [18:01:30] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1031.eqiad.wmnet with reason: REIMAGE [18:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:20] (03CR) 10Ori.livneh: "Filippo (hi!): I'm not sure you have my public GPG key (E70ABDAEA89D5A07) in your web of trust. So I went ahead and signed the new SSH key" [puppet] - 10https://gerrit.wikimedia.org/r/678380 (owner: 10Ori.livneh) [18:04:20] (03CR) 10Bstorm: gridengine: set grid-configurator source files to use new domain name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/678043 (https://phabricator.wikimedia.org/T277653) (owner: 10Bstorm) [18:04:28] I had no idea you could use ssh keys to sign stuff like that [18:09:08] I didn't know that either before today, just figured it ought to be possible and turns out it is [18:11:36] !log jhuneidi@deploy1002 Finished scap: testwikis wikis to 1.37.0-wmf.1 (duration: 30m 36s) [18:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:06] legoktm: signing commits? [18:14:15] or do you mean something else? [18:14:17] tabbycat: see https://gerrit.wikimedia.org/r/678380 [18:14:58] looks like a standard signed text [18:15:28] not the commit message, the last comment :) signing an ssh key with another ssh key [18:15:33] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2017.codfw.wmnet', 'parse2018.codfw.wmnet', 'parse2019.codfw.wmnet'] ` and were **ALL** successful. [18:15:39] (different from signing the commit message with a *pgp* key) [18:16:53] RECOVERY - Check systemd state on aqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:16:53] RECOVERY - Check systemd state on aqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:16:54] oh I see now [18:31:48] (03PS1) 10Andrew Bogott: wmfkeystonehooks: catch a minor exception during cleanup [puppet] - 10https://gerrit.wikimedia.org/r/678916 [18:32:30] (03CR) 10Merlijn van Deen: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/677805 (owner: 10Alexandros Kosiaris) [18:32:40] (03CR) 10Andrew Bogott: [C: 03+2] wmfkeystonehooks: catch a minor exception during cleanup [puppet] - 10https://gerrit.wikimedia.org/r/678916 (owner: 10Andrew Bogott) [18:44:54] (03PS3) 10Legoktm: lists: Add option to enable mailman3 on lists [puppet] - 10https://gerrit.wikimedia.org/r/678300 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [18:45:15] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1030.eqiad.wmnet', 'wtp1031.eqiad.wmnet'] ` and were **ALL** successful. [18:45:58] !log jhuneidi@deploy1002 Pruned MediaWiki: 1.36.0-wmf.37 (duration: 03m 16s) [18:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:59] 10SRE, 10Performance-Team, 10Traffic, 10serviceops: Decide on details of progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10Krinkle) a:03Krinkle [18:56:52] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['parse2020.codfw.wmnet'] ` The log can be found in `/var/lo... [18:57:20] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['wtp1032.eqiad.wmnet', 'wtp1033.eqiad.wmnet'] ` The log can... [19:00:04] longma and marxarelli: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Mediawiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210413T1900). [19:00:18] (03PS2) 10Jbond: (WIP) systemd::resolved: start work on puppet module for systemd-resolved [puppet] - 10https://gerrit.wikimedia.org/r/678907 (https://phabricator.wikimedia.org/T171498) [19:02:22] (03CR) 10jerkins-bot: [V: 04-1] (WIP) systemd::resolved: start work on puppet module for systemd-resolved [puppet] - 10https://gerrit.wikimedia.org/r/678907 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [19:02:35] (03CR) 10Legoktm: [C: 04-1] lists: Add option to enable mailman3 on lists (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/678300 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [19:08:58] (03PS1) 10Jeena Huneidi: group0 wikis to 1.37.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678920 [19:09:00] (03CR) 10Jeena Huneidi: [C: 03+2] group0 wikis to 1.37.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678920 (owner: 10Jeena Huneidi) [19:09:41] (03Merged) 10jenkins-bot: group0 wikis to 1.37.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678920 (owner: 10Jeena Huneidi) [19:11:13] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.37.0-wmf.1 [19:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:52] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2020.codfw.wmnet with reason: REIMAGE [19:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:56] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2020.codfw.wmnet with reason: REIMAGE [19:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:09] (03PS1) 10Dzahn: site/conftool-data: designate mw2394,mw2395 as dedicated jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/678926 (https://phabricator.wikimedia.org/T279100) [19:25:22] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1032.eqiad.wmnet with reason: REIMAGE [19:25:22] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/678928 [19:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:35] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/678928 (owner: 10Kosta Harlan) [19:25:42] (03CR) 10jerkins-bot: [V: 04-1] site/conftool-data: designate mw2394,mw2395 as dedicated jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/678926 (https://phabricator.wikimedia.org/T279100) (owner: 10Dzahn) [19:27:23] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1033.eqiad.wmnet with reason: REIMAGE [19:27:25] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1032.eqiad.wmnet with reason: REIMAGE [19:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:44] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/678928 (owner: 10Kosta Harlan) [19:28:56] (03PS2) 10Dzahn: site/conftool-data: designate mw2394,mw2395 as dedicated jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/678926 (https://phabricator.wikimedia.org/T279100) [19:28:58] !log kharlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [19:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:32] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1033.eqiad.wmnet with reason: REIMAGE [19:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:05] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [19:32:05] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [19:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:17] (cross-posting from #wikimedia-serviceops) is anyone around who could stop & remove a container with the linkrecommendation service? [19:35:26] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [19:35:27] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [19:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:05] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2020.codfw.wmnet'] ` and were **ALL** successful. [19:50:01] 10SRE, 10OTRS, 10Security, 10User-notice: ((OTRS)) Community Edition 6 is end-of-life; no FOSS replacement provided - https://phabricator.wikimedia.org/T275294 (10Keegan) The migration to Znuny LTS 6.0 is complete. The volunteer admins and I are meeting on Thursday to discuss and plan the next steps in re... [20:10:29] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1032.eqiad.wmnet', 'wtp1033.eqiad.wmnet'] ` and were **ALL** successful. [20:19:59] (03PS1) 10Ottomata: Refactor EventLoggingSanitization using RefineSanitize [puppet] - 10https://gerrit.wikimedia.org/r/678941 (https://phabricator.wikimedia.org/T273789) [20:21:02] (03CR) 10jerkins-bot: [V: 04-1] Refactor EventLoggingSanitization using RefineSanitize [puppet] - 10https://gerrit.wikimedia.org/r/678941 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [20:22:28] (03PS2) 10Ottomata: Refactor EventLoggingSanitization using RefineSanitize [puppet] - 10https://gerrit.wikimedia.org/r/678941 (https://phabricator.wikimedia.org/T273789) [20:23:31] (03CR) 10jerkins-bot: [V: 04-1] Refactor EventLoggingSanitization using RefineSanitize [puppet] - 10https://gerrit.wikimedia.org/r/678941 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [20:24:42] (03PS3) 10Ottomata: Refactor EventLoggingSanitization using RefineSanitize [puppet] - 10https://gerrit.wikimedia.org/r/678941 (https://phabricator.wikimedia.org/T273789) [20:25:57] (03CR) 10jerkins-bot: [V: 04-1] Refactor EventLoggingSanitization using RefineSanitize [puppet] - 10https://gerrit.wikimedia.org/r/678941 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [20:26:42] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29019/console" [puppet] - 10https://gerrit.wikimedia.org/r/678941 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [20:31:57] (03PS4) 10Ottomata: Refactor EventLoggingSanitization using RefineSanitize [puppet] - 10https://gerrit.wikimedia.org/r/678941 (https://phabricator.wikimedia.org/T273789) [20:32:53] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29020/console" [puppet] - 10https://gerrit.wikimedia.org/r/678941 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [20:46:00] (03CR) 10Legoktm: [C: 03+2] codesearch: Puppetize beta frontend [puppet] - 10https://gerrit.wikimedia.org/r/678216 (https://phabricator.wikimedia.org/T277459) (owner: 10Legoktm) [20:47:45] !log [kubemaster1001:~] $ sudo kubectl delete pod linkrecommendation-production-load-datasets-1618311600-hn6k8 -n linkrecommendation (T280076) [20:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:56] T280076: Stop / remove linkrecommendation-production-load-datasets-1618311600-hn6k8 - https://phabricator.wikimedia.org/T280076 [20:49:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:51:54] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:54:29] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/29016/" [puppet] - 10https://gerrit.wikimedia.org/r/678926 (https://phabricator.wikimedia.org/T279100) (owner: 10Dzahn) [20:55:20] jouncebot: now [20:55:20] For the next 0 hour(s) and 4 minute(s): Mediawiki train - American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210413T1900) [20:56:01] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw[2394-2395].codfw.wmnet with reason: reimage [20:56:02] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw[2394-2395].codfw.wmnet with reason: reimage [20:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:21] 10SRE, 10WMF-JobQueue, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts... [20:58:56] !log mw2395, mw2395 - reimaging as jobrunners (T279100) [20:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:06] T279100: Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 [21:05:38] 10SRE, 10WMF-JobQueue, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts... [21:06:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:08:05] (03PS1) 10Razzi: Remove unused hieradata/role/common/analytics_test_cluster/superset.yaml [puppet] - 10https://gerrit.wikimedia.org/r/678963 [21:10:19] (03CR) 10Ladsgroup: lists: Add option to enable mailman3 on lists (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/678300 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [21:10:56] (03PS4) 10Ladsgroup: lists: Add option to enable mailman3 on lists [puppet] - 10https://gerrit.wikimedia.org/r/678300 (https://phabricator.wikimedia.org/T278612) [21:13:06] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2394.codfw.wmnet with reason: REIMAGE [21:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:07] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2394.codfw.wmnet with reason: REIMAGE [21:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:31] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2395.codfw.wmnet with reason: REIMAGE [21:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:40] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2395.codfw.wmnet with reason: REIMAGE [21:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:05] (03PS1) 10Razzi: superset: use different user headers for staging and production [puppet] - 10https://gerrit.wikimedia.org/r/678966 (https://phabricator.wikimedia.org/T277729) [21:25:07] MatmaRex! hey, around by any chance? [21:25:16] yeah [21:25:44] MatmaRex: https://www.mediawiki.org/wiki/Help:VisualEditor/User_guide/cs doesn't work for me, because `Table 'mediawikiwiki.discussiontools_subscription' doesn't exist (10.64.16.7)`. Is that known to your team please? :-) [21:26:30] not known [21:26:39] the table is not supposed to exist yet, but thing should not fail because of that [21:26:43] things* [21:26:43] actually...the whole mediawiki.org is down for me. [21:26:48] also, the page loads for me [21:26:55] works for me [21:26:56] I have discussion tools enabled with the magic cookie [21:27:27] ie. mw.cookie.set( 'discussiontools-tempenable', 'yes' ); [21:27:52] I hsve that in my global.js too, and the beta feature enabled [21:28:01] and does mediawiki.org work for you Majavah ? [21:28:05] yes [21:28:08] PROBLEM - IPMI Sensor Status on htmldumper1001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:28:22] hmm [21:28:25] this fails for me: https://www.mediawiki.org/wiki/Help:VisualEditor/User_guide/cs?dtenable=1 [21:28:32] and the 'dtenable' query param shares code with the cookie [21:28:36] this is what i see https://usercontent.irccloud-cdn.com/file/mqM35eqf/image.png [21:28:47] yeah, fails for me too [21:29:10] https://meta.wikimedia.org/wiki/User:Majavah/global.js#L-49 this is what I use [21:30:15] please file a bug for me, we'll fix this today/early tomorrow [21:30:21] this should be a train blocker [21:31:20] MatmaRex: filled as T280082. Thanks! [21:31:20] T280082: Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'mediawikiwiki.discussiontools_subscription' doesn't exist (10.64.16.7)Function: MediaWiki\Extension\DiscussionTools\SubscriptionStore::fetchSubscriptionsQuery: SELECT sub_user,sub_item,sub_namespace,sub_title,sub_section,sub_state,sub_created,sub_notified FROM `discussiontools_subscription` WHERE sub_user = 1967330 AND sub_item = 'h-' - https://phabricator.wikimedia.or [21:32:20] 10SRE, 10WMF-JobQueue, 10serviceops, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2394.codfw.wmnet'] ` and were **ALL** successful. [21:33:43] (03PS2) 10Razzi: superset: use different user headers for staging and production [puppet] - 10https://gerrit.wikimedia.org/r/678966 (https://phabricator.wikimedia.org/T277729) [21:34:55] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29021/console" [puppet] - 10https://gerrit.wikimedia.org/r/678966 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi) [21:34:58] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2394.codfw.wmnet [21:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:33] !log mw2394 - rebooting [21:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10Jclark-ctr) a:03Cmjohnson snapshot1011 A1 U7 PORT19 ID 1852 snapshot1012 A3 U36 PORT30 ID1932 snapshot1013 B3 U24 PORT11 ID2614 snapshot1014 C5 U19... [21:37:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10Jclark-ctr) [21:40:38] 10SRE, 10WMF-JobQueue, 10serviceops, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2395.codfw.wmnet'] ` and were **ALL** successful. [21:40:50] (03CR) 10Razzi: [V: 03+1 C: 03+2] "I figured it out - the CAS server sets the user under a different header. It would be cool if this was easily configurable to have CAS use" [puppet] - 10https://gerrit.wikimedia.org/r/678966 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi) [21:42:14] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2395.codfw.wmnet [21:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:00] !log mw2394, mw2395 - scap pull [21:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:10] PROBLEM - mediawiki-installation DSH group on wtp1033 is CRITICAL: Host wtp1033 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:46:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:48:50] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2394.codfw.wmnet,service=jobrunner [21:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:32] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2394.codfw.wmnet,cluster=jobrunner [21:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:02] !log dzahn@cumin1001 conftool action : set/weight=15; selector: name=mw2394.codfw.wmnet,cluster=jobrunner [21:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:52] PROBLEM - mediawiki-installation DSH group on wtp1032 is CRITICAL: Host wtp1032 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:04:07] !log [urbanecm@mwmaint1002 /srv/mediawiki]$ foreachwikiindblist growthexperiments sql.php php-1.37.0-wmf.1/extensions/GrowthExperiments/maintenance/schemas/mysql/growthexperiments_mentor_mentee.sql # T278573 [22:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:16] T278573: Create growthexperiments_mentor_mentee database table on extension1 for wikis in growthexperiments.dblist - https://phabricator.wikimedia.org/T278573 [22:06:04] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2395.codfw.wmnet,cluster=jobrunner [22:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:14] !log dzahn@cumin1001 conftool action : set/weight=15; selector: name=mw2395.codfw.wmnet,cluster=jobrunner [22:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:26] 10SRE, 10WMF-JobQueue, 10serviceops, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Dzahn) mw2394 and mw2395 have been reimaged as jobrunners/videoscalers and then I pooled them into the jobrunner cluster b... [22:08:54] 10SRE, 10WMF-JobQueue, 10serviceops, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Dzahn) @Legoktm Would you say this is resolved (for) now? [22:14:05] 10SRE, 10LDAP-Access-Requests: Add Lena Meintrup to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T279531 (10Dzahn) a:05Lena_WMDE→03fgiunchedi [22:15:06] 10SRE, 10LDAP-Access-Requests: Add Lena Meintrup to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T279531 (10Dzahn) [22:23:58] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:26:08] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:40:05] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Silvan Heintze - https://phabricator.wikimedia.org/T279764 (10Dzahn) [22:40:16] (03CR) 10Dzahn: [C: 03+2] admin: move 'sihe' into 'deployment' [puppet] - 10https://gerrit.wikimedia.org/r/678856 (https://phabricator.wikimedia.org/T279764) (owner: 10Filippo Giunchedi) [22:41:52] !log welcome new deployer Silvan Heintze (sihe) (T279764) [22:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:49] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Silvan Heintze - https://phabricator.wikimedia.org/T279764 (10Dzahn) @Silvan_WMDE Your access has been granted. Puppet just created your user on `deploy1002.eqiad.wmnet` and will create it on other hosts needed within the n... [22:43:32] PROBLEM - LVS linkrecommendation eqiad port 4005/tcp - Link Recommendation- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.23 and port 4005: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [22:44:17] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Silvan Heintze - https://phabricator.wikimedia.org/T279764 (10Dzahn) 05Open→03Resolved a:03Dzahn See https://wikitech.wikimedia.org/wiki/Production_access#Setting_up_your_access for how to setup your SSH and let us kn... [22:45:42] kostajh: there seems to be an issue with linkrecommendation ^ [22:48:06] RECOVERY - LVS linkrecommendation eqiad port 4005/tcp - Link Recommendation- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 193 bytes in 1.054 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [22:48:11] ah :) [22:59:14] (03PS1) 10Ahmon Dancy: pipeline: Fix how vendor patches are applied [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679008 (https://phabricator.wikimedia.org/T271274) [23:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for [[Backport windows|Evening backport window]]
'''''' deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210413T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:00:23] Uh, I put patches [23:02:06] Zabe: here? [23:02:14] yes [23:03:27] (03CR) 10Legoktm: [C: 03+2] Unset $wmgUseWikimediaShopLink for ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678628 (https://phabricator.wikimedia.org/T279877) (owner: 10Zabe) [23:03:32] (03PS3) 10Legoktm: Unset $wmgUseWikimediaShopLink for ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678628 (https://phabricator.wikimedia.org/T279877) (owner: 10Zabe) [23:03:51] (03CR) 10Legoktm: [C: 03+2] Unset $wmgUseWikimediaShopLink for ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678628 (https://phabricator.wikimedia.org/T279877) (owner: 10Zabe) [23:04:36] (03Merged) 10jenkins-bot: Unset $wmgUseWikimediaShopLink for ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678628 (https://phabricator.wikimedia.org/T279877) (owner: 10Zabe) [23:04:48] (03PS3) 10Legoktm: ExtensionDistributor: Add REL1_36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678146 [23:04:53] (03CR) 10Legoktm: [C: 03+2] ExtensionDistributor: Add REL1_36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678146 (owner: 10Legoktm) [23:05:50] (03Merged) 10jenkins-bot: ExtensionDistributor: Add REL1_36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678146 (owner: 10Legoktm) [23:06:02] Zabe: it's live on mwdebug1002 for testing [23:06:45] (03CR) 10BryanDavis: Helm chart to run MediaWiki (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) (owner: 10Giuseppe Lavagetto) [23:09:00] legoktm: it works the supposed way [23:10:32] !log legoktm@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:678146|ExtensionDistributor: Add REL1_36]] (duration: 02m 03s) [23:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:43] ok, syncing it now [23:11:17] er wiat [23:11:54] 2021-04-13 23:10:29,211 [ERROR] Error depooling the servers: disabled/up/not pooled [23:11:54] 2021-04-13 23:10:29,211 [ERROR] Error running command with poolcounter: ('Failed executing ServiceRunner.run, return code %d', 127) [23:13:07] mutante: all the new jobrunners/videoscalers are pooled right? [23:13:36] https://phabricator.wikimedia.org/P15312 [23:14:16] (03CR) 10Ahmon Dancy: [C: 03+2] pipeline: Fix how vendor patches are applied [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679008 (https://phabricator.wikimedia.org/T271274) (owner: 10Ahmon Dancy) [23:14:47] dancy: er, I'm backporing stuff right now [23:15:01] (03Merged) 10jenkins-bot: pipeline: Fix how vendor patches are applied [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679008 (https://phabricator.wikimedia.org/T271274) (owner: 10Ahmon Dancy) [23:15:11] Sorry about that legoktm. [23:15:30] !log legoktm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:678628|Unset $wmgUseWikimediaShopLink for ptwiki (T279877)]] (duration: 01m 06s) [23:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:39] T279877: Unset $wmgUseWikimediaShopLink for ptwiki - https://phabricator.wikimedia.org/T279877 [23:15:40] That commit only affects the MW container image build process so it won't affect you. [23:15:56] ack [23:16:15] mutante: ignore, the errors seem to have gone away now, and I checked that all the hosts appear to be pooled correctly [23:16:21] Zabe: you should be set now! [23:16:49] (03PS3) 10Legoktm: Broadcast IRC events to irc1001 instead of kraz [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677806 (https://phabricator.wikimedia.org/T224579) (owner: 10Muehlenhoff) [23:17:15] Amir1: https://deploy-commands.toolforge.org/bacc/677806 isn't working :( " Sorry the id is ambigous :( " [23:18:08] legoktm: gerrit is stupid https://gerrit.wikimedia.org/r/q/677806 [23:18:13] (03CR) 10Legoktm: [C: 03+2] Broadcast IRC events to irc1001 instead of kraz [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677806 (https://phabricator.wikimedia.org/T224579) (owner: 10Muehlenhoff) [23:18:41] let me see if I can get a better way to make it work [23:18:50] heh wow [23:19:00] that sha1 is 6778060ab6820ffaca8a98d364481f103a4841b6 [23:19:09] (03Merged) 10jenkins-bot: Broadcast IRC events to irc1001 instead of kraz [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677806 (https://phabricator.wikimedia.org/T224579) (owner: 10Muehlenhoff) [23:19:46] legoktm: can you wait five minutes? I might be able to get it working [23:19:50] this is happening too much [23:20:00] yes, it takes me a few min to test this [23:20:01] legoktm: thx for your help :) [23:20:02] btw, why didn't you use "report issues" link [23:20:19] https://gerrit.wikimedia.org/r/q/change:677806 ? [23:20:42] why file a bug when I can poke you instantly :p [23:21:00] lower carbon footprint [23:21:26] btw, love bacc and will probably steal your code [23:21:55] dancy: oh I was planning to get the hash and search that but this also works [23:22:07] I've only used it twice and now I can't imagine typing out that command by hand anymore [23:23:56] tested locally, works [23:24:38] legoktm: ^^ the SAL is really annoying :D [23:24:47] https://gerrit.wikimedia.org/r/c/labs/tools/deploy-commands/+/679011 [23:25:54] deployed: https://deploy-commands.toolforge.org/bacc/677806 [23:26:14] ty [23:27:26] !log legoktm@deploy1002 Synchronized wmf-config/ProductionServices.php: Config: [[gerrit:677806|Broadcast IRC events to irc1001 instead of kraz (T224579)]] (duration: 01m 06s) [23:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:35] T224579: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 [23:27:50] my favorite thing about gerrit API: it adds crap at the first line of every api response for "security" [23:27:51] http://gerrit.wikimedia.org/r/changes/677806 [23:28:10] it's a different way of protecting against some type of attack, but yeah [23:28:31] it looks reallly hack [23:28:47] https://gerrit.wikimedia.org/r/plugins/gitiles/integration/utils/+/refs/heads/master/wikimediaci_utils/__init__.py#76 [23:29:39] hm [23:29:48] that change had two bug #s in it [23:29:57] but it only referenced one [23:30:54] :( probably the regex is weird [23:30:56] let me check [23:31:04] 10SRE, 10Sustainability: Add redundancy to IRC recent changes service - https://phabricator.wikimedia.org/T128592 (10Legoktm) [23:31:10] 10SRE, 10Wikimedia-IRC-RC-Server, 10Patch-For-Review: Set up spare irc1001.wikimedia.org in eqiad - https://phabricator.wikimedia.org/T278255 (10Legoktm) 05Open→03Resolved a:03MoritzMuehlenhoff `lang=irc <+logmsgbot> !log legoktm@deploy1002 Synchronized wmf-config/ProductionServices.php: Config: [[gerr... [23:33:44] (03PS1) 10Ahmon Dancy: Test emailing notification of security patch failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679015 [23:37:44] legoktm: it should be fixed now [23:37:58] pushing [23:45:25] 10SRE, 10Sustainability: Add redundancy to IRC recent changes service - https://phabricator.wikimedia.org/T128592 (10Legoktm) Current status: irc2001 is irc.wm.o, and irc1001 is receiving events from MediaWiki and is a hot spare that can be failed over to by adjusting the irc.wm.o CNAME (on a 5min TTL). >>! I... [23:47:32] Amir1: actually, can you move the phab tasks out of the [[gerrit:...]] link? then on https://wikitech.wikimedia.org/wiki/Server_Admin_Log#2021-04-13 the phab tasks will be linked [23:49:24] 10SRE, 10WMF-Annual-Report, 10serviceops: Update annual.wikimedia.org redirect to point to 2020 Annual Report - https://phabricator.wikimedia.org/T279571 (10Dzahn) 05Open→03Resolved @spatton I'll claim this is resolved. Cheers [23:49:33] 10SRE, 10Wikimedia-IRC-RC-Server, 10Patch-For-Review, 10User-notice: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 (10Legoktm) kraz is ready for decom now \o/ [23:53:37] 10SRE, 10WMF-JobQueue, 10serviceops, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Legoktm) I think we're missing 2 codfw servers that are *only* videoscalers?