[00:00:58] (03PS1) 10Bstorm: nfs monitoring: fix the broken paths for the directory size monitor [puppet] - 10https://gerrit.wikimedia.org/r/605705 [00:03:15] (03PS2) 10Bstorm: nfs monitoring: fix the broken paths for the directory size monitor [puppet] - 10https://gerrit.wikimedia.org/r/605705 [00:05:37] (03CR) 10Bstorm: [C: 03+1] toolforge: relocate nginx-ingress config from kubeadm [puppet] - 10https://gerrit.wikimedia.org/r/604649 (https://phabricator.wikimedia.org/T195217) (owner: 10Arturo Borrero Gonzalez) [00:14:27] (03CR) 10Bstorm: [C: 03+1] "> Patch Set 1:" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/603652 (https://phabricator.wikimedia.org/T249787) (owner: 10BryanDavis) [00:15:04] 10Operations, 10MediaWiki-General, 10Patch-For-Review, 10Sustainability (Incident Prevention): Stop a poolcounter server fail from being a SPOF for the service and the api (and the site) - https://phabricator.wikimedia.org/T105378 (10tstarling) p:05High→03Low Since deployment 15 minutes ago, we have ha... [00:16:12] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@17212bb]: airflow: migrate leven-dist to edit-dist [00:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:20] !log volker-e@deploy1001 Started deploy [design/style-guide@37c67dd]: Deploy design/style-guide: [00:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:25] !log volker-e@deploy1001 Finished deploy [design/style-guide@37c67dd]: Deploy design/style-guide: (duration: 00m 04s) [00:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:57] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@17212bb]: airflow: migrate leven-dist to edit-dist (duration: 00m 45s) [00:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:27] (03CR) 10Tim Starling: [C: 03+2] Set a maximum HTTP client timeout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605440 (https://phabricator.wikimedia.org/T245170) (owner: 10Tim Starling) [00:19:17] (03Merged) 10jenkins-bot: Set a maximum HTTP client timeout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605440 (https://phabricator.wikimedia.org/T245170) (owner: 10Tim Starling) [00:20:23] (03Abandoned) 10Mstyles: Auth for SeFC [puppet] - 10https://gerrit.wikimedia.org/r/599399 (https://phabricator.wikimedia.org/T251500) (owner: 10Mstyles) [00:25:37] !log tstarling@deploy1001 Synchronized wmf-config/set-time-limit.php: expose excimer timeout as a global variable T245170 (duration: 00m 56s) [00:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:42] T245170: Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki - https://phabricator.wikimedia.org/T245170 [00:28:35] !log tstarling@deploy1001 Synchronized wmf-config/CommonSettings.php: limit HTTP client timeout T245170 (duration: 00m 56s) [00:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:28] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 60424784 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:40:18] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 3056 and 20 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:07:42] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is OK: (C)100 gt (W)80 gt 75.25 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [01:29:09] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install 3 lightweight hadoop nodes - https://phabricator.wikimedia.org/T255518 (10RobH) [01:29:44] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install 3 lightweight hadoop nodes - https://phabricator.wikimedia.org/T255518 (10RobH) [01:30:09] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install 3 lightweight hadoop nodes - https://phabricator.wikimedia.org/T255518 (10RobH) [01:31:03] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install 3 lightweight hadoop nodes - https://phabricator.wikimedia.org/T255518 (10RobH) @elukey: This needs to have the hostname and racking info filled out by your team (if they should be in differing rows than one another, etc). Usually this... [01:32:49] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install - https://phabricator.wikimedia.org/T255520 (10RobH) [01:33:07] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install - https://phabricator.wikimedia.org/T255520 (10RobH) [01:33:55] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install - https://phabricator.wikimedia.org/T255520 (10RobH) @elukey, We need to know the hostname and racking details for these 3 new hadoop testing nodes. Please provide this info and reassign this task from you to @jcl... [01:39:09] (03CR) 10Bmansurov: "@Alexandros, can you help me out here. I'm unable to get helm running locally. Here's what I get:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [02:05:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.35.0-wmf.37 [core] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/605718 [02:10:27] (03PS2) 10DannyS712: Branch commit for wmf/1.35.0-wmf.37 [core] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/605718 (https://phabricator.wikimedia.org/T254174) (owner: 10TrainBranchBot) [02:20:19] @mmodell would it be possible for TrainBranchBot to automatically tag the blocker task in its commit message? [02:22:00] I would presume it is [02:22:13] Probably just needs a mapping/documentation file of it in the repo [02:22:33] Considering the tasks are created in batches, doesn't seem difficult to add that to the release tools repo at the same time [02:25:56] Unless querying phab for a task with a specific title was accurate enough... [02:32:28] https://gerrit.wikimedia.org/g/labs/tools/train-blockers/+/refs/heads/master might include some helpful code - https://train-blockers.toolforge.org/ automatically redirects to the current blocker [02:37:29] It used to be human-operated and so could provide T-id as cli param [02:37:43] but now that it is unattended, a bit harder to do reliably [02:44:47] (03PS33) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [02:51:11] (03PS34) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [02:58:10] (03PS1) 10Andrew Bogott: Dummy passwords for galera monitoring [labs/private] - 10https://gerrit.wikimedia.org/r/605723 [02:58:52] (03PS35) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [02:58:59] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Dummy passwords for galera monitoring [labs/private] - 10https://gerrit.wikimedia.org/r/605723 (owner: 10Andrew Bogott) [03:05:34] (03PS36) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [03:20:13] (03PS37) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [03:28:32] 10Operations, 10Wikimedia-Mailing-lists: teampractices mailing list should have active admins - https://phabricator.wikimedia.org/T255525 (10Aklapper) [03:33:13] 10Operations, 10Wikimedia-Mailing-lists: teampractices mailing list should have active admins - https://phabricator.wikimedia.org/T255525 (10Aklapper) [03:42:54] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [03:50:10] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 73.22 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [04:14:37] 10Operations, 10Wikimedia-Mailing-lists: teampractices mailing list should have active admins - https://phabricator.wikimedia.org/T255525 (10greg) Pinging @Awjrichards as the closest I can think of for someone who might care. Or @MBinder_WMF maybe? [04:17:28] 10Operations: HTML Dumps 429 error on RESTBase endpoints - https://phabricator.wikimedia.org/T255524 (10Aklapper) [04:18:59] 10Operations, 10Traffic: HTML Dumps 429 error on RESTBase endpoints - https://phabricator.wikimedia.org/T255524 (10CDanis) [04:37:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1138', diff saved to https://phabricator.wikimedia.org/P11511 and previous config saved to /var/cache/conftool/dbconfig/20200616-043748-marostegui.json [04:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:59] !log Deploy schema change on db1138 [04:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:34] (03PS1) 10Marostegui: mariadb: Reimage dbstore1004 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/605731 (https://phabricator.wikimedia.org/T254870) [04:40:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1138', diff saved to https://phabricator.wikimedia.org/P11512 and previous config saved to /var/cache/conftool/dbconfig/20200616-044036-marostegui.json [04:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:41:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1143', diff saved to https://phabricator.wikimedia.org/P11513 and previous config saved to /var/cache/conftool/dbconfig/20200616-044126-marostegui.json [04:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:42:10] (03CR) 10Marostegui: [C: 03+2] mariadb: Reimage dbstore1004 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/605731 (https://phabricator.wikimedia.org/T254870) (owner: 10Marostegui) [04:43:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1143', diff saved to https://phabricator.wikimedia.org/P11514 and previous config saved to /var/cache/conftool/dbconfig/20200616-044326-marostegui.json [04:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1149', diff saved to https://phabricator.wikimedia.org/P11515 and previous config saved to /var/cache/conftool/dbconfig/20200616-044409-marostegui.json [04:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:46:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1149', diff saved to https://phabricator.wikimedia.org/P11516 and previous config saved to /var/cache/conftool/dbconfig/20200616-044612-marostegui.json [04:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:50:49] (03PS1) 10Marostegui: db2092: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/605732 (https://phabricator.wikimedia.org/T254462) [04:53:50] (03PS1) 10Marostegui: report_users: Fix typo on dbproxy1020 IP [software] - 10https://gerrit.wikimedia.org/r/605733 [04:54:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1147', diff saved to https://phabricator.wikimedia.org/P11517 and previous config saved to /var/cache/conftool/dbconfig/20200616-045451-marostegui.json [04:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:26] (03CR) 10Marostegui: [C: 03+2] report_users: Fix typo on dbproxy1020 IP [software] - 10https://gerrit.wikimedia.org/r/605733 (owner: 10Marostegui) [04:55:33] !log Deploy schema change on db1147 [04:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1147', diff saved to https://phabricator.wikimedia.org/P11518 and previous config saved to /var/cache/conftool/dbconfig/20200616-045636-marostegui.json [04:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:12] (03CR) 10Marostegui: [C: 03+2] db2092: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/605732 (https://phabricator.wikimedia.org/T254462) (owner: 10Marostegui) [04:57:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1146:3314', diff saved to https://phabricator.wikimedia.org/P11519 and previous config saved to /var/cache/conftool/dbconfig/20200616-045744-marostegui.json [04:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1146:3314', diff saved to https://phabricator.wikimedia.org/P11520 and previous config saved to /var/cache/conftool/dbconfig/20200616-045958-marostegui.json [05:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [05:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [05:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:48] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:37:12] (03CR) 10WMDE-leszek: "What was the issue? Did that config change make mediainfo/items not available?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605643 (owner: 10Addshore) [05:41:30] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:54:13] !log volker-e@deploy1001 Started deploy [design/style-guide@37c67dd]: Deploy design/style-guide: [05:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:17] !log volker-e@deploy1001 Finished deploy [design/style-guide@37c67dd]: Deploy design/style-guide: (duration: 00m 05s) [05:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:19] !log Restarted Zuul scheduler and merger on contint2001 a couple hotfixes # T252310 T255424 [06:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:24] T252310: pywikibot get merge rejections due to zuul-merger not being able to update tags - https://phabricator.wikimedia.org/T252310 [06:04:24] T255424: Zuul deployment fails due to unsupported wheel - https://phabricator.wikimedia.org/T255424 [06:11:50] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/23253/" [puppet] - 10https://gerrit.wikimedia.org/r/605617 (https://phabricator.wikimedia.org/T252391) (owner: 10Elukey) [06:12:04] (03CR) 10Elukey: [C: 03+2] role::mediawiki::memcached::gutter: change slab distribution [puppet] - 10https://gerrit.wikimedia.org/r/605617 (https://phabricator.wikimedia.org/T252391) (owner: 10Elukey) [06:15:09] (03PS1) 10Elukey: profile::idp::memcached: use default settings for memcached [puppet] - 10https://gerrit.wikimedia.org/r/605735 [06:25:10] !log roll restart memcached on mc-gp* (gutter pools) to pick up new slab size distribution setting - T252391 [06:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:14] T252391: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 [06:34:40] 10Operations, 10serviceops, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) The next steps for this task should be: 1) Remove the nutcracker shards in https://gerrit.wikimedia.org/r/595810 (the change should re-hash their (mc1... [06:39:33] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:40:49] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [06:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:23] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:44:38] (03CR) 10Ayounsi: [C: 03+1] "I'm far from expert in that but it looks logical to me." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/605623 (https://phabricator.wikimedia.org/T245114) (owner: 10Volans) [06:50:31] (03PS1) 10Marostegui: dbstore1004: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/605809 (https://phabricator.wikimedia.org/T254870) [06:51:13] (03CR) 10Marostegui: [C: 03+2] dbstore1004: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/605809 (https://phabricator.wikimedia.org/T254870) (owner: 10Marostegui) [06:54:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1093', diff saved to https://phabricator.wikimedia.org/P11521 and previous config saved to /var/cache/conftool/dbconfig/20200616-065412-marostegui.json [06:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:37] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0) [06:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:01] !log marostegui@cumin2001 dbctl commit (dc=all): 'Depool db1134 for InnoDB compression T254462', diff saved to https://phabricator.wikimedia.org/P11522 and previous config saved to /var/cache/conftool/dbconfig/20200616-065600-marostegui.json [06:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:05] T254462: Compress enwiki InnoDB tables - https://phabricator.wikimedia.org/T254462 [06:57:00] !log Compress InnoDB on db1134 T254462 [06:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:23] (03PS1) 10Marostegui: db1134: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/605811 (https://phabricator.wikimedia.org/T254462) [06:58:55] (03CR) 10Marostegui: [C: 03+2] db1134: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/605811 (https://phabricator.wikimedia.org/T254462) (owner: 10Marostegui) [07:00:30] (03PS1) 10Marostegui: install_server: Do not reimage dbstore1004 [puppet] - 10https://gerrit.wikimedia.org/r/605812 [07:02:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1084', diff saved to https://phabricator.wikimedia.org/P11523 and previous config saved to /var/cache/conftool/dbconfig/20200616-070209-marostegui.json [07:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:50] (03CR) 10Lars Wirzenius: [C: 03+2] Branch commit for wmf/1.35.0-wmf.37 [core] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/605718 (https://phabricator.wikimedia.org/T254174) (owner: 10TrainBranchBot) [07:04:17] (03CR) 10Lars Wirzenius: [V: 03+2 C: 03+2] Branch commit for wmf/1.35.0-wmf.37 [core] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/605718 (https://phabricator.wikimedia.org/T254174) (owner: 10TrainBranchBot) [07:04:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1084', diff saved to https://phabricator.wikimedia.org/P11524 and previous config saved to /var/cache/conftool/dbconfig/20200616-070429-marostegui.json [07:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1148', diff saved to https://phabricator.wikimedia.org/P11525 and previous config saved to /var/cache/conftool/dbconfig/20200616-070450-marostegui.json [07:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1148', diff saved to https://phabricator.wikimedia.org/P11526 and previous config saved to /var/cache/conftool/dbconfig/20200616-070651-marostegui.json [07:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:51] !log 1.35.0-wmf.37 was branched at f856960f17b2a477640c5d848926c04f0d56196c for T254174 [07:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:54] T254174: 1.35.0-wmf.37 deployment blockers - https://phabricator.wikimedia.org/T254174 [07:08:53] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [07:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:15] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage dbstore1004 [puppet] - 10https://gerrit.wikimedia.org/r/605812 (owner: 10Marostegui) [07:14:19] (03PS1) 10Marostegui: filtered_tables.txt: Add mvi_priority column [puppet] - 10https://gerrit.wikimedia.org/r/605814 (https://phabricator.wikimedia.org/T255003) [07:14:52] (03CR) 10DannyS712: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/605814 (https://phabricator.wikimedia.org/T255003) (owner: 10Marostegui) [07:16:16] (03CR) 10Marostegui: "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/605814 (https://phabricator.wikimedia.org/T255003) (owner: 10Marostegui) [07:16:18] (03CR) 10Marostegui: [C: 03+2] filtered_tables.txt: Add mvi_priority column [puppet] - 10https://gerrit.wikimedia.org/r/605814 (https://phabricator.wikimedia.org/T255003) (owner: 10Marostegui) [07:16:30] (03PS1) 10Muehlenhoff: Update MOU date for piccardi [puppet] - 10https://gerrit.wikimedia.org/r/605815 [07:18:58] (03CR) 10Muehlenhoff: [C: 03+1] "Agreed, let's use the defaults, we can still fine-tune when we have used memcached 1.6 in production for a while." [puppet] - 10https://gerrit.wikimedia.org/r/605735 (owner: 10Elukey) [07:23:39] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0) [07:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:27] !log liw@deploy1001 Pruned MediaWiki: 1.35.0-wmf.34 (duration: 11m 52s) [07:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:01] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM. BTW, I was curious what is actually missing in Debian packaged Python deps and there's just two; ntc-templates and pynetbox. Everyth" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/605623 (https://phabricator.wikimedia.org/T245114) (owner: 10Volans) [07:37:07] !log liw@deploy1001 Pruned MediaWiki: 1.35.0-wmf.35 (duration: 01m 47s) [07:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:37] (03PS1) 10Lars Wirzenius: testwikis wikis to 1.35.0-wmf.37 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605821 [07:38:39] (03CR) 10Lars Wirzenius: [C: 03+2] testwikis wikis to 1.35.0-wmf.37 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605821 (owner: 10Lars Wirzenius) [07:39:27] (03Merged) 10jenkins-bot: testwikis wikis to 1.35.0-wmf.37 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605821 (owner: 10Lars Wirzenius) [07:40:18] !log liw@deploy1001 Started scap: testwikis wikis to 1.35.0-wmf.37 [07:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:20] 10Operations, 10Patch-For-Review: Migrate Cumin hosts to Buster - https://phabricator.wikimedia.org/T245114 (10Marostegui) Would it be possible to save `/home` directories somewhere so they are available once the host is back? It is not a lot of data to save: ` root@cumin1001:/home# du -shc . 4.9G . 4.9G total... [07:48:36] marostegui: we could always save /home in mysql somewhere ;) [07:48:55] in fact, we could write a fuse fs that uses mysql as a backingstore! [07:49:16] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [07:49:16] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:32] kormat: That'd be nice! Do you estimate that being as easy as setting the prometheus alert? [07:49:51] probably even easier! :P [07:49:57] hahaha [07:54:28] didn't lennart already do that? systemd-homed and all ? [07:54:39] oh wait, mysql [07:54:49] hmmmm I sense a systemd feature coming :PO [07:54:51] * akosiaris joking [07:56:34] (03CR) 10Volans: [V: 03+2 C: 03+2] "> Patch Set 2: Code-Review+1" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/605623 (https://phabricator.wikimedia.org/T245114) (owner: 10Volans) [07:58:24] !log volans@deploy1001 Started deploy [homer/deploy@85e92b8]: Release v0.2.3 on cumin2001 now on buster (take 2) [07:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:45] (03PS1) 10Marostegui: dbproxy2001: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/605826 (https://phabricator.wikimedia.org/T255408) [07:59:22] !log volans@deploy1001 Finished deploy [homer/deploy@85e92b8]: Release v0.2.3 on cumin2001 now on buster (take 2) (duration: 00m 57s) [07:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:52] (03CR) 10Marostegui: [C: 03+2] dbproxy2001: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/605826 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [08:05:19] (03PS1) 10Volans: Name expansion doesn't work, make it explicit [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/605827 (https://phabricator.wikimedia.org/T245114) [08:06:09] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/605827 (https://phabricator.wikimedia.org/T245114) (owner: 10Volans) [08:07:25] (03CR) 10Volans: [V: 03+2 C: 03+2] "thx!" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/605827 (https://phabricator.wikimedia.org/T245114) (owner: 10Volans) [08:07:54] !log volans@deploy1001 Started deploy [homer/deploy@e9acec8]: Release v0.2.3 on cumin2001 now on buster (take 3) [08:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:01] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [08:08:01] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:05] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [08:08:06] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:10] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [08:08:10] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:31] !log volans@deploy1001 Finished deploy [homer/deploy@e9acec8]: Release v0.2.3 on cumin2001 now on buster (take 3) (duration: 01m 37s) [08:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:38] !log volans@deploy1001 Started deploy [homer/deploy@e9acec8]: Release v0.2.3 on cumin2001 now on buster (take 3bis) [08:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:50] !log volans@deploy1001 Finished deploy [homer/deploy@e9acec8]: Release v0.2.3 on cumin2001 now on buster (take 3bis) (duration: 00m 12s) [08:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:24] (03PS1) 10Volans: homer: adapt plugin path to new installation [puppet] - 10https://gerrit.wikimedia.org/r/605828 (https://phabricator.wikimedia.org/T245114) [08:15:15] (03CR) 10Volans: [C: 03+2] homer: adapt plugin path to new installation [puppet] - 10https://gerrit.wikimedia.org/r/605828 (https://phabricator.wikimedia.org/T245114) (owner: 10Volans) [08:18:46] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [08:18:46] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:18:47] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [08:18:47] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:18:48] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [08:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:48] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:18:49] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [08:18:49] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:18:50] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [08:18:50] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:51] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [08:18:51] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:29] (03PS1) 10Ayounsi: Fix bug due to oversight in new ipversion detection [homer/public] - 10https://gerrit.wikimedia.org/r/605834 [08:23:29] (03CR) 10Volans: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/605834 (owner: 10Ayounsi) [08:23:47] (03CR) 10Ayounsi: [C: 03+2] Fix bug due to oversight in new ipversion detection [homer/public] - 10https://gerrit.wikimedia.org/r/605834 (owner: 10Ayounsi) [08:33:50] (03CR) 10Volans: "Fox context, I had a chat with John about this yesterday due to potentially looking at the wrong data. I've marked this as WIP as I know h" [puppet] - 10https://gerrit.wikimedia.org/r/605596 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [08:37:59] (03CR) 10Jbond: "comments inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/601893 (https://phabricator.wikimedia.org/T253140) (owner: 10CRusnov) [08:39:23] !log liw@deploy1001 Finished scap: testwikis wikis to 1.35.0-wmf.37 (duration: 59m 05s) [08:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:41] (03CR) 10Jbond: [C: 03+2] rsync: move oneline script inline [puppet] - 10https://gerrit.wikimedia.org/r/605275 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [08:41:10] (03PS1) 10Ayounsi: Depool eqiad for routers upgrade [dns] - 10https://gerrit.wikimedia.org/r/605838 (https://phabricator.wikimedia.org/T243080) [08:42:04] !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime [08:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:38] !log marostegui@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:05] (03PS1) 10Alexandros Kosiaris: Add IPv6 address to all kubernetes nodes [dns] - 10https://gerrit.wikimedia.org/r/605841 (https://phabricator.wikimedia.org/T241850) [08:46:28] (03CR) 10jerkins-bot: [V: 04-1] Add IPv6 address to all kubernetes nodes [dns] - 10https://gerrit.wikimedia.org/r/605841 (https://phabricator.wikimedia.org/T241850) (owner: 10Alexandros Kosiaris) [08:46:45] (03CR) 10Jbond: [C: 03+1] profile::idp::memcached: use default settings for memcached [puppet] - 10https://gerrit.wikimedia.org/r/605735 (owner: 10Elukey) [08:47:12] (03CR) 10Jbond: [C: 03+1] Update MOU date for piccardi [puppet] - 10https://gerrit.wikimedia.org/r/605815 (owner: 10Muehlenhoff) [08:47:20] (03PS1) 10Filippo Giunchedi: pontoon: include httpd and passenger [puppet] (sandbox/filippo/pontoon) - 10https://gerrit.wikimedia.org/r/605842 [08:47:22] (03PS1) 10Filippo Giunchedi: pontoon: sync / rebase git from gerrit [puppet] (sandbox/filippo/pontoon) - 10https://gerrit.wikimedia.org/r/605843 [08:48:13] !log Upgrade db2132 [08:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:16] (03CR) 10Jbond: [C: 03+2] Example: build script in line in puppet [puppet] - 10https://gerrit.wikimedia.org/r/602771 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [08:48:36] (03PS2) 10Alexandros Kosiaris: Add IPv6 address to all kubernetes nodes [dns] - 10https://gerrit.wikimedia.org/r/605841 (https://phabricator.wikimedia.org/T241850) [08:48:55] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10decommission-hardware: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10elukey) [08:49:54] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: include httpd and passenger [puppet] (sandbox/filippo/pontoon) - 10https://gerrit.wikimedia.org/r/605842 (owner: 10Filippo Giunchedi) [08:50:04] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add IPv6 address to all kubernetes nodes [dns] - 10https://gerrit.wikimedia.org/r/605841 (https://phabricator.wikimedia.org/T241850) (owner: 10Alexandros Kosiaris) [08:50:16] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: sync / rebase git from gerrit [puppet] (sandbox/filippo/pontoon) - 10https://gerrit.wikimedia.org/r/605843 (owner: 10Filippo Giunchedi) [08:50:44] kormat: should be better now re: pontoon missing httpd [08:50:52] (03PS1) 10Muehlenhoff: reimage banner for cumin1001 [puppet] - 10https://gerrit.wikimedia.org/r/605844 (https://phabricator.wikimedia.org/T245114) [08:51:22] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: optional read affinity proxy setting [puppet] - 10https://gerrit.wikimedia.org/r/605591 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [08:53:53] !log roll restart prometheus eqiad ops to enable thanos upload [08:53:54] (03CR) 10Elukey: [C: 03+2] profile::idp::memcached: use default settings for memcached [puppet] - 10https://gerrit.wikimedia.org/r/605735 (owner: 10Elukey) [08:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:02] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: enable thanos upload in ops eqiad [puppet] - 10https://gerrit.wikimedia.org/r/605178 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [08:54:09] (03PS1) 10Marostegui: dbproxy2001: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/605845 (https://phabricator.wikimedia.org/T255408) [08:54:23] wow 3 changes to merge :D [08:54:39] godog, jbond42 who goes ? :D [08:54:43] haha! happy to go [08:54:43] XD [08:54:48] please do! [08:55:06] (03CR) 10Volans: "Nice! Some minor nits inline and a couple of questions/ideas." (038 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/604678 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [08:55:15] {{done}} [08:55:50] ohs sorry [08:55:54] (03CR) 10Marostegui: [C: 03+2] dbproxy2001: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/605845 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [08:56:05] and thanks [08:56:07] np! [08:56:24] prometheus eqiad ops will be bouncing, no impact expected [08:57:20] (03CR) 10Volans: [C: 04-1] "Single char typo, looks good otherwise :)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/605844 (https://phabricator.wikimedia.org/T245114) (owner: 10Muehlenhoff) [08:58:14] 'no impact' as in, 'expected impact' with a small gap in metrics upon restart [09:00:10] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: enable read affinity for thanos-swift [puppet] - 10https://gerrit.wikimedia.org/r/605592 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [09:01:04] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1001:9501 job=burrow partition={0,1,2,3,4,5} site=eqiad topic=rsyslog-notice https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus [09:01:04] logging-eqiad&var-topic=All&var-consumer_group=All [09:02:15] (03PS1) 10Marostegui: install_server: Reimage dbproxy1014 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/605847 (https://phabricator.wikimedia.org/T255408) [09:02:52] mhh that's lag from last night's logspam from fpm, should be clearing up now [09:03:26] (03CR) 10Muehlenhoff: reimage banner for cumin1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605844 (https://phabricator.wikimedia.org/T245114) (owner: 10Muehlenhoff) [09:04:04] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1001:9501 job=burrow partition={0,1,2,3,4,5} site=eqiad topic=rsyslog-notice https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus [09:04:04] logging-eqiad&var-topic=All&var-consumer_group=All [09:05:03] (03PS2) 10Muehlenhoff: reimage banner for cumin1001 [puppet] - 10https://gerrit.wikimedia.org/r/605844 (https://phabricator.wikimedia.org/T245114) [09:05:24] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={jmx_puppetdb,swagger_check_restbase_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:06:18] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/605844 (https://phabricator.wikimedia.org/T245114) (owner: 10Muehlenhoff) [09:06:56] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:07:28] (03PS1) 10Alexandros Kosiaris: Add kubernetes[12]007-kubernetes[12]014 to BGP [homer/public] - 10https://gerrit.wikimedia.org/r/605848 (https://phabricator.wikimedia.org/T241850) [09:07:48] (03CR) 10jerkins-bot: [V: 04-1] Add kubernetes[12]007-kubernetes[12]014 to BGP [homer/public] - 10https://gerrit.wikimedia.org/r/605848 (https://phabricator.wikimedia.org/T241850) (owner: 10Alexandros Kosiaris) [09:09:07] XioNoX: this seems a nice occasion to test homer on buster from cumin2001 ^^^ [09:09:59] probably pointing at a bot that I /ignore [09:10:03] :) [09:10:29] yes, a patch from akosiaris for k8s hosts in homer/public [09:10:36] but yeah looking at https://gerrit.wikimedia.org/r/c/operations/homer/public/+/605848 [09:11:03] * akosiaris a bot? [09:11:21] (03PS3) 10Jbond: cookbook sre.pdus: add reboot script [cookbooks] - 10https://gerrit.wikimedia.org/r/604678 (https://phabricator.wikimedia.org/T246890) [09:11:34] (03PS3) 10Muehlenhoff: reimage banner for cumin1001 [puppet] - 10https://gerrit.wikimedia.org/r/605844 (https://phabricator.wikimedia.org/T245114) [09:11:48] jerkins-bot? it's the most useful bot :) [09:12:14] let's find out now where I messed up in that commit [09:12:20] akosiaris: spaces [09:12:28] akosiaris: I /ignore the bot the send tasks updates to that channel [09:12:32] apparently didn't like that you aligned them [09:12:43] * akosiaris this close to passing #noqa or something [09:12:44] yeah, CI doesn't like your nice indentation [09:12:55] let's see if we can fix it [09:12:57] other than that, it LGTM [09:14:02] !log filippo@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=thanos-swift,name=eqiad [09:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:55] # yamllint disable-line rule:commas [09:15:06] volans: should I just go this way instead ^ ? [09:15:15] the rule probably makes sense in most cases [09:15:22] +1 for me [09:15:32] * akosiaris trying [09:15:34] but do you have to add it to all lines? [09:15:50] hahaha [09:16:01] ah no [09:16:01] seems like I can add it at the top of the mapping [09:16:08] you can disable first and renable after [09:16:17] even better [09:16:17] # yamllint enable rule:commas [09:16:22] https://yamllint.readthedocs.io/en/stable/disable_with_comments.html?highlight=disable#disabling-checks-for-all-or-part-of-the-file [09:16:38] I'm in a train almost at destination, will monitor it on my phone and have laptop if needed [09:16:43] (03PS4) 10Muehlenhoff: reimage banner for cumin1001 [puppet] - 10https://gerrit.wikimedia.org/r/605844 (https://phabricator.wikimedia.org/T245114) [09:17:53] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/605844 (https://phabricator.wikimedia.org/T245114) (owner: 10Muehlenhoff) [09:18:02] (03PS2) 10Alexandros Kosiaris: Add kubernetes[12]007-kubernetes[12]014 to BGP [homer/public] - 10https://gerrit.wikimedia.org/r/605848 (https://phabricator.wikimedia.org/T241850) [09:18:39] +2ed by jenkins-bot so \o/ [09:19:21] soon we'll get this data from netbox too [09:20:29] I'm reviwing it [09:20:47] (03PS4) 10Jbond: cookbook sre.pdus: add reboot script [cookbooks] - 10https://gerrit.wikimedia.org/r/604678 (https://phabricator.wikimedia.org/T246890) [09:21:01] (03CR) 10Strainu: [C: 03+1] Add extended-confirmed group and restriction level for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605652 (https://phabricator.wikimedia.org/T254471) (owner: 10Ammarpad) [09:21:08] (03CR) 10Jbond: "updated thanks" (038 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/604678 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [09:21:32] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_eventgate_analytics_http_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:22:20] (03CR) 10Volans: [C: 04-1] "Some issues with the IPv6, looks good otherwise." (036 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/605848 (https://phabricator.wikimedia.org/T241850) (owner: 10Alexandros Kosiaris) [09:22:55] (03CR) 10Muehlenhoff: [C: 03+2] reimage banner for cumin1001 [puppet] - 10https://gerrit.wikimedia.org/r/605844 (https://phabricator.wikimedia.org/T245114) (owner: 10Muehlenhoff) [09:23:18] 10Operations, 10Traffic: noc.wikimedia.org consistently 503s in eqsin and sometimes 503s in esams - https://phabricator.wikimedia.org/T255368 (10ema) I found out more. This morning the list of affected hosts is shorter than yesterday: ` cp2027.codfw.wmnet,cp5008.eqsin.wmnet,cp3050.esams.wmnet ` My theory is t... [09:24:16] (03CR) 10jerkins-bot: [V: 04-1] cookbook sre.pdus: add reboot script [cookbooks] - 10https://gerrit.wikimedia.org/r/604678 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [09:24:46] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:24:46] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage dbproxy1014 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/605847 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [09:28:39] volans: yes we need to discuss that :) [09:29:44] XioNoX: what? [09:29:52] I might /ignore you :-P [09:30:07] ah getting the data from Netbox I guess [09:30:30] (03PS1) 10Privacybatm: transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) [09:30:34] yep [09:31:03] (03CR) 10jerkins-bot: [V: 04-1] transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [09:31:31] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/605570 (owner: 10Muehlenhoff) [09:37:07] (03PS2) 10Privacybatm: [WIP] transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) [09:43:06] (03PS1) 10Muehlenhoff: Fix MOTD [puppet] - 10https://gerrit.wikimedia.org/r/605852 [09:43:25] godog: would it be possible to rebase pontoon on production, too? [09:44:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: relocate nginx-ingress config from kubeadm [puppet] - 10https://gerrit.wikimedia.org/r/604649 (https://phabricator.wikimedia.org/T195217) (owner: 10Arturo Borrero Gonzalez) [09:46:10] (03CR) 10Privacybatm: "How does it look? (It's in WIP)" [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [09:46:21] (03PS2) 10Muehlenhoff: Fix MOTD [puppet] - 10https://gerrit.wikimedia.org/r/605852 [09:47:22] (03CR) 10Volans: [C: 03+1] "LGMT" [puppet] - 10https://gerrit.wikimedia.org/r/605852 (owner: 10Muehlenhoff) [09:50:18] (03CR) 10Volans: [C: 03+2] scripts: complete interface automation generation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/601877 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [09:50:40] kormat: for sure, the branch won't git pull cleanly anymore after that but if you are ok with it I'll rebase [09:51:03] +1 [09:51:27] (03CR) 10Muehlenhoff: [C: 03+2] Fix MOTD [puppet] - 10https://gerrit.wikimedia.org/r/605852 (owner: 10Muehlenhoff) [09:51:47] (03PS1) 10Volans: scripts: add esams to the mgmt migrated list [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/605853 (https://phabricator.wikimedia.org/T233183) [09:51:49] godog: also, as if to emphaise it's a bad idea to tell someone to use what you've done, i'm running into this on my slave: https://phabricator.wikimedia.org/P11530 [09:52:00] er. s/slave/client/ [09:52:21] (03CR) 10Volans: [C: 03+2] scripts: add esams to the mgmt migrated list [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/605853 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [09:53:22] kormat: mhh haven't run into that IIRC [09:54:00] !log restarting netbox to pickup modified customscripts [09:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:20] kormat: also yes as you know by now Pontoon is quite the experiment yet, but the standalone puppetmaster is supported/working https://wikitech.wikimedia.org/wiki/Help:Standalone_puppetmaster [09:54:41] the main difference being how hiera works / is used [09:54:43] ah. /var/lib/puppet/client/ssl/ doesn't exist. sigh [09:55:00] godog: ack. that's the main reason i'd like to use pontoon if i can. hiera is too magic [09:55:27] yeah I hear you, that's been my motivation as well [09:56:30] kormat: did you use the 'enroll' script ? that takes care of symlinking (!) that directory [09:57:02] godog: this is the first i've heard of 'enroll' :) [09:58:30] oh indeed, I've used the script in the demo but not in the instructions, fixing [09:58:33] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:59:33] (03PS1) 10Awight: [beta] Update survey with real questions and answers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605854 (https://phabricator.wikimedia.org/T253112) [09:59:58] (03CR) 10Awight: [C: 03+2] "Beta-only deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605854 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight) [10:00:48] (03Merged) 10jenkins-bot: [beta] Update survey with real questions and answers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605854 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight) [10:01:27] PROBLEM - Check the last execution of netbox_ganeti_eqsin_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:01:55] that was me with the reboot, bad timing, sorry for the spam [10:02:31] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:41] (03PS5) 10Jbond: cookbook sre.pdus: add reboot script [cookbooks] - 10https://gerrit.wikimedia.org/r/604678 (https://phabricator.wikimedia.org/T246890) [10:03:50] s/reboot/uwsgi restart/ [10:04:12] kormat: {{done}} [10:12:03] RECOVERY - Check the last execution of netbox_ganeti_eqsin_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:14:54] (03PS1) 10Elukey: Set Bigtop for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/605858 (https://phabricator.wikimedia.org/T244499) [10:17:12] (03CR) 10Alexandros Kosiaris: Add kubernetes[12]007-kubernetes[12]014 to BGP (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/605848 (https://phabricator.wikimedia.org/T241850) (owner: 10Alexandros Kosiaris) [10:17:24] (03PS3) 10Alexandros Kosiaris: Add kubernetes[12]007-kubernetes[12]014 to BGP [homer/public] - 10https://gerrit.wikimedia.org/r/605848 (https://phabricator.wikimedia.org/T241850) [10:17:44] (03CR) 10Elukey: [C: 03+2] Set Bigtop for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/605858 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [10:21:19] (03PS1) 10Filippo Giunchedi: hieradata: page on thanos swift/query failure [puppet] - 10https://gerrit.wikimedia.org/r/605859 (https://phabricator.wikimedia.org/T233956) [10:22:08] (03PS2) 10Jforrester: Stop setting wgCommentTableSchemaMigrationStage, no longer read in core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602105 [10:22:23] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: page on thanos swift/query failure [puppet] - 10https://gerrit.wikimedia.org/r/605859 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [10:22:27] (03CR) 10Jforrester: [C: 03+2] Stop setting wgCommentTableSchemaMigrationStage, no longer read in core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602105 (owner: 10Jforrester) [10:22:32] jouncebot: next [10:22:32] In 0 hour(s) and 37 minute(s): European mid-day backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200616T1100) [10:22:42] (03CR) 10Volans: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/605848 (https://phabricator.wikimedia.org/T241850) (owner: 10Alexandros Kosiaris) [10:23:05] no patches? wow [10:23:16] (03Merged) 10jenkins-bot: Stop setting wgCommentTableSchemaMigrationStage, no longer read in core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602105 (owner: 10Jforrester) [10:23:18] It happens. [10:23:26] (I'm deploying config clean-up.) [10:23:56] 👍 [10:26:10] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Stop setting wgCommentTableSchemaMigrationStage, no longer read in core (duration: 01m 07s) [10:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:09] (03PS2) 10Jforrester: Stop setting wgChangeTagsSchemaMigrationStage, no longer read in core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602106 [10:30:18] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10serviceops, and 4 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10akosiaris) >>! In T224041#6225405, @jeena wrote: > I created a helm test and got the integration a... [10:31:26] (03PS1) 10Jforrester: Stop setting wgTagStatisticsNewTable, no longer read in core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605863 [10:31:26] (03PS1) 10Jforrester: Stop setting wgActorTableSchemaMigrationStage, no longer read in core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605864 [10:31:28] (03CR) 10Jforrester: [C: 03+2] Stop setting wgChangeTagsSchemaMigrationStage, no longer read in core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602106 (owner: 10Jforrester) [10:32:07] (03Merged) 10jenkins-bot: Stop setting wgChangeTagsSchemaMigrationStage, no longer read in core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602106 (owner: 10Jforrester) [10:37:02] 10Operations: Reboot snapshot hosts - https://phabricator.wikimedia.org/T255550 (10MoritzMuehlenhoff) [10:37:47] (03PS3) 10Jbond: profile::sre::check_mail: new script for checking user emails [puppet] - 10https://gerrit.wikimedia.org/r/605596 (https://phabricator.wikimedia.org/T244792) [10:42:25] 10Operations, 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban: Create a profile to standardize the deployment of JVM packages and configurations - https://phabricator.wikimedia.org/T253553 (10elukey) In T252913 Keith is working on moving ES and Kafka to profile::java, so the one missing is Cassandra p... [10:43:56] 10Operations, 10Dumps-Generation: Reboot snapshot hosts - https://phabricator.wikimedia.org/T255550 (10ArielGlenn) [10:44:43] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Stop setting wgChangeTagsSchemaMigrationStage, no longer read in core (duration: 01m 06s) [10:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:06] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [10:50:12] (03CR) 10Jforrester: [C: 03+2] Stop setting wgTagStatisticsNewTable, no longer read in core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605863 (owner: 10Jforrester) [10:50:39] !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart [10:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:01] (03Merged) 10jenkins-bot: Stop setting wgTagStatisticsNewTable, no longer read in core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605863 (owner: 10Jforrester) [10:51:04] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [10:51:18] !log roll-restarting restbase101[6-8].eqiad.wmnet for cert updates [10:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:58] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Stop setting wgTagStatisticsNewTable, no longer read in core (duration: 01m 04s) [10:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:36] hnowlan: use cumin2001 if possible as has been migrated to buster and would be great to surface any issue before we reimage cumin1001 too next week ;) [10:54:32] (03CR) 10Jforrester: [C: 03+2] Stop setting wgActorTableSchemaMigrationStage, no longer read in core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605864 (owner: 10Jforrester) [10:54:45] volans: ack, I have more hosts to do later and I'll use cumin2001 then [10:54:59] great, thanks [10:55:29] (03Merged) 10jenkins-bot: Stop setting wgActorTableSchemaMigrationStage, no longer read in core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605864 (owner: 10Jforrester) [10:59:52] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do European mid-day backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200616T1100). [11:00:25] Just doing a final sync. [11:01:10] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:01:17] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Stop setting wgActorTableSchemaMigrationStage, no longer read in core (duration: 01m 05s) [11:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:25] Now we're down to just wgMultiContentRevisionSchemaMigrationStage as an open migration stage, which is nice to see. [11:01:28] volans: does this happen because the cookbook doesn't implement --help or is it something in the generic interface for cookbooks in spicerack itself? https://phabricator.wikimedia.org/P11531 [11:01:41] hnowlan: looking [11:01:44] (Prod clear for whoever needs it.) [11:02:58] !log rebooting mw2350-mw2376 [11:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:54] hnowlan: will look in a min, sorry, phone call [11:04:54] volans: very low priority, I can dig into it :) [11:09:20] 10Operations, 10Traffic, 10affects-Kiwix-and-openZIM: HTML Dumps 429 error on RESTBase endpoints - https://phabricator.wikimedia.org/T255524 (10Kelson) [11:09:24] !log updating perf on buster [11:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:44] (03PS3) 10Jforrester: Remove TranslationNotifications user settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603167 (https://phabricator.wikimedia.org/T144780) (owner: 10DannyS712) [11:10:30] (03CR) 10Jcrespo: "It ok, looks at the comments below. I would like to see proper integration tests of this both failing and succeeding." (033 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [11:11:50] (03PS1) 10Arturo Borrero Gonzalez: wmcs: kubeadm: drop leftover nginx_ingress_yaml declaration [puppet] - 10https://gerrit.wikimedia.org/r/605873 [11:13:03] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: kubeadm: drop leftover nginx_ingress_yaml declaration [puppet] - 10https://gerrit.wikimedia.org/r/605873 (owner: 10Arturo Borrero Gonzalez) [11:13:43] (03PS2) 10Jforrester: Remove Mobile mainpage special casing from it and vec wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603524 (https://phabricator.wikimedia.org/T254731) (owner: 10Ammarpad) [11:13:56] (03CR) 10Jforrester: [C: 03+2] Remove Mobile mainpage special casing from it and vec wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603524 (https://phabricator.wikimedia.org/T254731) (owner: 10Ammarpad) [11:14:40] !log Deploy MCR schema change on db2087:3316 [11:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:54] (03Merged) 10jenkins-bot: Remove Mobile mainpage special casing from it and vec wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603524 (https://phabricator.wikimedia.org/T254731) (owner: 10Ammarpad) [11:14:59] hnowlan: I think it's a combination of factors, testing something locally, will get back to you in a few [11:15:14] volans: cool, thanks! [11:15:20] !log updating perf on stretch hosts [11:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:43] (03PS2) 10Jforrester: Drop simplewiki mainpage special casing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604973 (https://phabricator.wikimedia.org/T32405) (owner: 10BrandonXLF) [11:15:48] (03CR) 10Jforrester: [C: 03+2] Drop simplewiki mainpage special casing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604973 (https://phabricator.wikimedia.org/T32405) (owner: 10BrandonXLF) [11:16:39] (03Merged) 10jenkins-bot: Drop simplewiki mainpage special casing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604973 (https://phabricator.wikimedia.org/T32405) (owner: 10BrandonXLF) [11:18:43] (03PS3) 10Ammarpad: Add extended-confirmed group and restriction level for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605652 (https://phabricator.wikimedia.org/T254471) [11:18:43] !log jforrester@deploy1001 Synchronized dblists/mobilemainpagelegacy.dblist: T32405 T254731 Drop mobile special casing of main page for simplewiki, itwikisource, vecwikisource (duration: 01m 05s) [11:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:48] T32405: [EPIC] MobileFrontend extension should stop special-casing main page - https://phabricator.wikimedia.org/T32405 [11:18:49] T254731: Turn off main page special casing on itwikisource and vecwikisource - https://phabricator.wikimedia.org/T254731 [11:20:14] hnowlan: commas in the help message for 'cluster' that makes the message a tuple instead of a multiline string [11:21:08] (03PS1) 10Volans: sre.cassandra.roll-restart: fix help message [cookbooks] - 10https://gerrit.wikimedia.org/r/605874 [11:21:11] ^^^ [11:21:26] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) [11:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:27] (03CR) 10Hnowlan: [C: 03+1] sre.cassandra.roll-restart: fix help message [cookbooks] - 10https://gerrit.wikimedia.org/r/605874 (owner: 10Volans) [11:22:32] ahhh nice catch volans [11:22:40] (03PS1) 10Muehlenhoff: Fix filename [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/605875 [11:23:39] hnowlan: funny that argparse doesn't complain about it, assigns it to it's variable and it fails only after when concatenating the default value because we're using a custom formatter that is a mixin og the defaults and raw description upstream ones [11:23:54] (03PS2) 10Muehlenhoff: Fix filename [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/605875 [11:25:58] (03CR) 10Volans: [C: 03+1] "LGTM" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/605875 (owner: 10Muehlenhoff) [11:26:53] (03CR) 10Volans: [C: 03+2] sre.cassandra.roll-restart: fix help message [cookbooks] - 10https://gerrit.wikimedia.org/r/605874 (owner: 10Volans) [11:26:54] !log hnowlan@cumin2001 START - Cookbook sre.cassandra.roll-restart [11:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:26] (03CR) 10Privacybatm: "Shall we do the integration test like:" (033 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [11:27:32] !log roll-restart restbase2009 for cert update [11:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:02] (03Merged) 10jenkins-bot: sre.cassandra.roll-restart: fix help message [cookbooks] - 10https://gerrit.wikimedia.org/r/605874 (owner: 10Volans) [11:29:56] RECOVERY - cassandra-a SSL 10.192.48.54:7001 on restbase2009 is OK: SSL OK - Certificate restbase2009-a valid until 2022-06-15 10:34:42 +0000 (expires in 728 days) https://phabricator.wikimedia.org/T120662 [11:30:23] hnowlan: sudo cookbook sre.cassandra.roll-restart -h works fine on 1001, I did not run puppet on 1001 as you're running it, just to be on the safe side [11:30:33] but feel free to force a puppet run to get the patch on 2001 if needed [11:30:37] before the usual cron [11:30:51] will do [11:31:28] RECOVERY - cassandra-b SSL 10.192.48.55:7001 on restbase2009 is OK: SSL OK - Certificate restbase2009-b valid until 2022-06-15 10:34:44 +0000 (expires in 728 days) https://phabricator.wikimedia.org/T120662 [11:31:45] (03PS1) 10Arturo Borrero Gonzalez: wmcs: kubeadm: decouple haproxy profile for Toolforge/PAWS [puppet] - 10https://gerrit.wikimedia.org/r/605876 (https://phabricator.wikimedia.org/T195217) [11:33:34] (03CR) 10Arturo Borrero Gonzalez: "PCC https://puppet-compiler.wmflabs.org/compiler1001/23258/" [puppet] - 10https://gerrit.wikimedia.org/r/605876 (https://phabricator.wikimedia.org/T195217) (owner: 10Arturo Borrero Gonzalez) [11:34:10] (03PS1) 10Ema: ATS: consider TS_LUA_CACHE_LOOKUP_HIT_STALE as "hit" [puppet] - 10https://gerrit.wikimedia.org/r/605877 (https://phabricator.wikimedia.org/T255368) [11:34:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: kubeadm: decouple haproxy profile for Toolforge/PAWS [puppet] - 10https://gerrit.wikimedia.org/r/605876 (https://phabricator.wikimedia.org/T195217) (owner: 10Arturo Borrero Gonzalez) [11:34:29] (03PS1) 10Jbond: memcached: convert systemd service file to an override [puppet] - 10https://gerrit.wikimedia.org/r/605878 (https://phabricator.wikimedia.org/T255132) [11:34:52] RECOVERY - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is OK: SSL OK - Certificate restbase2009-c valid until 2022-06-15 10:34:47 +0000 (expires in 728 days) https://phabricator.wikimedia.org/T120662 [11:35:57] !log reboot an-druid100[1,2] for kernel upgrades [11:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:18] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)210 ge (W)150 ge 114.9 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [11:38:59] !log hnowlan@cumin2001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) [11:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:38] !log jmm@cumin1001 START - Cookbook sre.hosts.downtime [11:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:47] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:17] !log hnowlan@cumin2001 START - Cookbook sre.cassandra.roll-restart [11:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:39] !log roll-restarting restbase201[0-2] for cert updates [11:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:49] (03PS2) 10Jbond: memcached: convert systemd service file to an override [puppet] - 10https://gerrit.wikimedia.org/r/605878 (https://phabricator.wikimedia.org/T255132) [11:41:09] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/23259/" [puppet] - 10https://gerrit.wikimedia.org/r/605878 (https://phabricator.wikimedia.org/T255132) (owner: 10Jbond) [11:41:29] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [11:41:30] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:41:30] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [11:41:31] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:41:31] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [11:41:32] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:41:32] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [11:41:33] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:33] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [11:41:34] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:41:34] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [11:41:35] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:45] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [11:41:46] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:41:46] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [11:41:47] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:47] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [11:41:48] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:41:48] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [11:41:49] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:54] (03PS1) 10CDanis: depool eqiad for router upgrade [dns] - 10https://gerrit.wikimedia.org/r/605880 [11:46:15] (03CR) 10CDanis: [C: 03+2] Depool eqiad for routers upgrade [dns] - 10https://gerrit.wikimedia.org/r/605838 (https://phabricator.wikimedia.org/T243080) (owner: 10Ayounsi) [11:47:01] (03PS2) 10CDanis: Depool eqiad for routers upgrade [dns] - 10https://gerrit.wikimedia.org/r/605838 (https://phabricator.wikimedia.org/T243080) (owner: 10Ayounsi) [11:47:29] (03Abandoned) 10CDanis: depool eqiad for router upgrade [dns] - 10https://gerrit.wikimedia.org/r/605880 (owner: 10CDanis) [11:48:40] !log depooling eqiad for router upgrade T243080 [11:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:08] cdanis: "have fun" :D [11:49:47] (03PS1) 10Arturo Borrero Gonzalez: wmcs: paws: add support for HTTP->HTTPS redirection in haproxy [puppet] - 10https://gerrit.wikimedia.org/r/605883 (https://phabricator.wikimedia.org/T195217) [11:50:21] elukey: :D [11:51:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: paws: add support for HTTP->HTTPS redirection in haproxy [puppet] - 10https://gerrit.wikimedia.org/r/605883 (https://phabricator.wikimedia.org/T195217) (owner: 10Arturo Borrero Gonzalez) [11:52:23] (03CR) 10Muehlenhoff: memcached: convert systemd service file to an override (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605878 (https://phabricator.wikimedia.org/T255132) (owner: 10Jbond) [11:56:15] (03PS3) 10Jbond: memcached: convert systemd service file to an override [puppet] - 10https://gerrit.wikimedia.org/r/605878 (https://phabricator.wikimedia.org/T255132) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200616T1200) [12:00:33] (03PS4) 10Jbond: memcached: convert systemd service file to an override [puppet] - 10https://gerrit.wikimedia.org/r/605878 (https://phabricator.wikimedia.org/T255132) [12:00:35] (03CR) 10Muehlenhoff: [C: 03+2] Fix filename [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/605875 (owner: 10Muehlenhoff) [12:00:51] (03CR) 10Muehlenhoff: [C: 03+2] Update MOU date for piccardi [puppet] - 10https://gerrit.wikimedia.org/r/605815 (owner: 10Muehlenhoff) [12:01:18] (03PS2) 10Muehlenhoff: Switch cumin1001 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/605570 [12:03:37] 10Operations, 10ops-eqiad, 10decommission-hardware: decommission ganeti100[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T255553 (10akosiaris) [12:04:07] 10Operations, 10ops-eqiad, 10decommission-hardware: decommission ganeti100[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T255553 (10akosiaris) [12:06:15] (03CR) 10Hnowlan: [C: 03+2] Switch changeprop and changeprop-jobqueue to v0.10.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/605621 (https://phabricator.wikimedia.org/T255278) (owner: 10Ppchelko) [12:06:48] (03Merged) 10jenkins-bot: Switch changeprop and changeprop-jobqueue to v0.10.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/605621 (https://phabricator.wikimedia.org/T255278) (owner: 10Ppchelko) [12:08:05] (03PS5) 10Jbond: memcached: convert systemd service file to an override [puppet] - 10https://gerrit.wikimedia.org/r/605878 (https://phabricator.wikimedia.org/T255132) [12:09:16] !log hnowlan@cumin2001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) [12:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:33] (03PS1) 10CDanis: bump cr2-eqiad OSPF metrics to shift traffic for maintenance [homer/public] - 10https://gerrit.wikimedia.org/r/605885 (https://phabricator.wikimedia.org/T243080) [12:14:35] !log disable transit/peering & increase frack MED on cr1-eqiad T243080 [12:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:42] (03PS1) 10Muehlenhoff: Switch CI to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) [12:15:42] (03PS1) 10Ayounsi: Fix bug due to oversight in new ipversion detection, part 2 [homer/public] - 10https://gerrit.wikimedia.org/r/605888 [12:15:43] !log cdanis@re0.cr1-eqiad# commit confirmed 2 comment "force VRRP failover T243080" [12:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:40] (03PS1) 10Alexandros Kosiaris: ganeti: Decomission ganeti1001-4, ganeti2001-6 [puppet] - 10https://gerrit.wikimedia.org/r/605890 (https://phabricator.wikimedia.org/T255553) [12:18:57] (03PS2) 10Alexandros Kosiaris: ganeti: Decomission ganeti1001-4, ganeti2001-6 [puppet] - 10https://gerrit.wikimedia.org/r/605890 (https://phabricator.wikimedia.org/T255553) [12:20:55] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/605890 (https://phabricator.wikimedia.org/T255553) (owner: 10Alexandros Kosiaris) [12:22:41] (03CR) 10Ayounsi: [C: 03+2] Fix bug due to oversight in new ipversion detection, part 2 [homer/public] - 10https://gerrit.wikimedia.org/r/605888 (owner: 10Ayounsi) [12:23:12] (03CR) 10Ayounsi: [C: 03+1] bump cr2-eqiad OSPF metrics to shift traffic for maintenance [homer/public] - 10https://gerrit.wikimedia.org/r/605885 (https://phabricator.wikimedia.org/T243080) (owner: 10CDanis) [12:24:39] 10Operations, 10ops-codfw, 10DC-Ops: (Need By:TBD) rack/setup/install alert2001 - https://phabricator.wikimedia.org/T255070 (10fgiunchedi) Thank you @Papaul [12:25:57] !log cr1-eqiad: rebooting RE1 [backup] with new junos version T243080 [12:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:22] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 16860184 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:30:43] (03PS38) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [12:31:25] !log cr1-eqiad: request chassis routing-engine master switch [12:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:35] !log T243080 cr1-eqiad: request chassis routing-engine master switch [12:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:42] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1623280 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:33:08] PROBLEM - Host pfw3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:33:08] is that expected? [12:33:11] XioNoX: like last time? [12:33:13] * apergos peeks in [12:33:14] not sure [12:33:15] looking [12:33:19] (03PS3) 10Alexandros Kosiaris: ganeti: Decomission ganeti1001-4, ganeti2001-6 [puppet] - 10https://gerrit.wikimedia.org/r/605890 (https://phabricator.wikimedia.org/T255553) [12:33:53] * akosiaris around if needed [12:33:54] I'm groggy but I can get on if needed [12:34:02] I see traffic on pfw3's link to cr2-eqiad [12:35:09] I think only the IP of the router itself is unreach but all frack is fine [12:35:10] PROBLEM - OSPF status on mr1-eqiad is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:35:11] looking [12:35:20] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:35:27] I can ping the router from cr2-eqiad itself [12:35:42] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:35:45] (03PS6) 10Jbond: memcached: convert systemd service file to an override [puppet] - 10https://gerrit.wikimedia.org/r/605878 (https://phabricator.wikimedia.org/T255132) [12:35:51] OSPF status alerts expected btw [12:36:09] RECOVERY - Host pfw3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [12:36:38] RECOVERY - OSPF status on mr1-eqiad is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:36:48] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:36:51] did someone do something or did it came back on its own? [12:37:02] (03PS1) 10Andrew Bogott: move fake galera passwords [labs/private] - 10https://gerrit.wikimedia.org/r/605894 [12:37:05] jynus: cr1-eqiad finished its routing engine failover [12:37:10] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:37:21] there's only not other alerts about cr1-eqiad here because it is downtimed [12:37:48] (03PS2) 10Andrew Bogott: move fake galera passwords [labs/private] - 10https://gerrit.wikimedia.org/r/605894 [12:38:14] (03PS4) 10Alexandros Kosiaris: ganeti: Decomission ganeti1001-4, ganeti2001-6 [puppet] - 10https://gerrit.wikimedia.org/r/605890 (https://phabricator.wikimedia.org/T255553) [12:38:16] (03PS1) 10Alexandros Kosiaris: Remove role spare from old ganeti hosts [puppet] - 10https://gerrit.wikimedia.org/r/605895 (https://phabricator.wikimedia.org/T255553) [12:38:25] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [12:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:52] (03PS3) 10Andrew Bogott: move fake galera passwords [labs/private] - 10https://gerrit.wikimedia.org/r/605894 [12:39:47] (03PS2) 10Muehlenhoff: Switch CI to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) [12:39:54] XioNoX: it looks like there might have been some real impact: https://cas-icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=frban1001 [12:40:44] cdanis: yeah, I should have updated the MED on the pfw side as well [12:40:44] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] move fake galera passwords [labs/private] - 10https://gerrit.wikimedia.org/r/605894 (owner: 10Andrew Bogott) [12:40:51] ah [12:41:04] not sure why it didn't fail over though [12:41:26] do we use BFD on those BGP sessions? [12:41:39] no because the interfaces are directly connected [12:41:49] oh so interface link should be sufficient? [12:41:49] so link down triggers a re-convergence [12:42:14] yeah, but maybe the way the FPC reboot didn't trigger a link down event or not fast enough [12:42:17] yeah [12:42:31] (03PS7) 10Jbond: memcached: convert systemd service file to an override [puppet] - 10https://gerrit.wikimedia.org/r/605878 (https://phabricator.wikimedia.org/T255132) [12:42:54] !log pfw3-eqiad set MED to cr1 to 300 - T243080 [12:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:08] XioNoX: ok to continue with re0 upgrade & reboot? [12:44:56] cdanis: upgrade yes, and I'm checking pfw3 in the meantime to be sure cr1-pfw3 is drained [12:45:00] (03CR) 10Andrew Bogott: [C: 03+2] Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 (owner: 10Andrew Bogott) [12:45:15] ack [12:46:10] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/23275/" [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [12:47:14] !log upload new memcache package with TLS to component/memcached16 in buster-wikimedia [12:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:26] cr1 was basically telling cr2 to discard traffic to pfw3, and cr2 accepted that because the med was lower https://www.irccloud.com/pastebin/fuZjd4fh/ [12:47:46] now cr2 is preferred in all directions [12:48:52] (03CR) 10Jbond: "updated" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605878 (https://phabricator.wikimedia.org/T255132) (owner: 10Jbond) [12:49:07] XioNoX: should we just have done `deactivate protocols bgp group Fundraising` ? [12:51:02] cdanis: it's in theory less smooth than changing the MEDs, but it's a good idea to do it in addition to [12:51:33] ack, pretty sure I don't have access to change the MED on the pfw side [12:51:43] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [12:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:59] cdanis: I did it [12:52:02] I guess graceful shutdown would be helpful here, assuming the pfws are recent enough [12:52:08] PROBLEM - Memcached on idp-test1001 is CRITICAL: connect to address 208.80.154.87 and port 11000: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [12:52:31] cdanis: MEDs are fine as we control both sides [12:52:39] ack [12:52:54] XioNoX: good to proceed with cr1 re0 reboot? [12:53:01] cdanis: graceful shutdown is just a policy, there is no "recent enough" [12:53:42] cdanis: yep [12:53:52] (03PS1) 10Filippo Giunchedi: templates: add v6 for thanos-fe* [dns] - 10https://gerrit.wikimedia.org/r/605896 [12:53:56] RECOVERY - Memcached on idp-test1001 is OK: TCP OK - 0.000 second response time on 208.80.154.87 port 11000 https://wikitech.wikimedia.org/wiki/Memcached [12:54:00] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [12:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:42] (03PS1) 10Andrew Bogott: icinga galera monitoring: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/605897 (https://phabricator.wikimedia.org/T242455) [12:58:39] (03CR) 10Andrew Bogott: [C: 03+2] icinga galera monitoring: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/605897 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [12:59:50] (03PS1) 10Arturo Borrero Gonzalez: keepalived: add support for multiple virtual addresses [puppet] - 10https://gerrit.wikimedia.org/r/605898 [13:00:04] liw and brennen: Dear deployers, time to do the Mediawiki train - European+American Version deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200616T1300). [13:00:36] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/605878 (https://phabricator.wikimedia.org/T255132) (owner: 10Jbond) [13:01:02] (03PS1) 10Lars Wirzenius: group0 wikis to 1.35.0-wmf.37 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605900 [13:01:04] (03CR) 10Lars Wirzenius: [C: 03+2] group0 wikis to 1.35.0-wmf.37 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605900 (owner: 10Lars Wirzenius) [13:01:29] !log rebooting mw2291-mw2334 [13:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:56] (03Merged) 10jenkins-bot: group0 wikis to 1.35.0-wmf.37 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605900 (owner: 10Lars Wirzenius) [13:03:00] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:09] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:19] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:39] !log T243080 cdanis@re1.cr1-eqiad> request chassis routing-engine master switch [13:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:05] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:05] (03CR) 10Alexandros Kosiaris: [C: 03+2] ganeti: Decomission ganeti1001-4, ganeti2001-6 [puppet] - 10https://gerrit.wikimedia.org/r/605890 (https://phabricator.wikimedia.org/T255553) (owner: 10Alexandros Kosiaris) [13:05:05] no issues with pfw3 now [13:06:29] !log akosiaris@cumin1001 START - Cookbook sre.hosts.decommission [13:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:16] PROBLEM - Memcached on idp-test2001 is CRITICAL: connect to address 208.80.153.25 and port 11000: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [13:07:54] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:07:56] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:08:17] !log liw@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.37 [13:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:30] (03PS1) 10Marostegui: wikireplica_analytics: Increase query killer time [puppet] - 10https://gerrit.wikimedia.org/r/605902 [13:09:22] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [13:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:28] 10Operations, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: decommission ganeti100[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T255553 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by akosiaris@cumin1001 for hosts: `ganeti[1001-1004].eqiad.wmnet` - ganeti1001.eqiad.w... [13:09:44] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:09:46] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:10:17] (03PS1) 10Reedy: Remove PasswordCannotBePopular references from CS-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605903 [13:11:38] (03PS1) 10Reedy: Replace PasswordNotInLargeBlacklist with PasswordNotInCommonList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605904 [13:12:18] (03PS2) 10Reedy: Remove PasswordCannotBePopular references from CS-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605903 [13:12:43] !log add graceful-switchover to cr1-eqiad [13:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:51] (03PS3) 10Reedy: Remove PasswordCannotBePopular references from CS-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605903 [13:13:24] (03PS4) 10Reedy: Remove PasswordCannotBePopular references from CS-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605903 [13:14:26] RECOVERY - Memcached on idp-test2001 is OK: TCP OK - 0.036 second response time on 208.80.153.25 port 11000 https://wikitech.wikimedia.org/wiki/Memcached [13:14:49] (03CR) 10Reedy: [C: 04-1] "Can't go till .37 is everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605904 (owner: 10Reedy) [13:15:10] (03CR) 10Reedy: [C: 03+2] Remove PasswordCannotBePopular references from CS-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605903 (owner: 10Reedy) [13:16:09] (03Merged) 10jenkins-bot: Remove PasswordCannotBePopular references from CS-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605903 (owner: 10Reedy) [13:17:00] !log pfw3-eqiad rollback MED to cr1 to 0 - T243080 [13:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:28] (03PS2) 10Reedy: Replace PasswordNotInLargeBlacklist with PasswordNotInCommonList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605904 [13:31:37] (03CR) 10CDanis: [C: 03+2] bump cr2-eqiad OSPF metrics to shift traffic for maintenance [homer/public] - 10https://gerrit.wikimedia.org/r/605885 (https://phabricator.wikimedia.org/T243080) (owner: 10CDanis) [13:31:53] (03PS3) 10Muehlenhoff: Switch cumin1001 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/605570 [13:32:01] (03Merged) 10jenkins-bot: bump cr2-eqiad OSPF metrics to shift traffic for maintenance [homer/public] - 10https://gerrit.wikimedia.org/r/605885 (https://phabricator.wikimedia.org/T243080) (owner: 10CDanis) [13:32:42] !log marostegui@cumin2001 dbctl commit (dc=all): 'Repool db2092 T254462', diff saved to https://phabricator.wikimedia.org/P11535 and previous config saved to /var/cache/conftool/dbconfig/20200616-133241-marostegui.json [13:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:49] T254462: Compress enwiki InnoDB tables - https://phabricator.wikimedia.org/T254462 [13:33:28] (03PS1) 10Andrew Bogott: Galera monitoring: move defines from icinga host to monitored host [puppet] - 10https://gerrit.wikimedia.org/r/605909 [13:34:38] (03CR) 10jerkins-bot: [V: 04-1] Galera monitoring: move defines from icinga host to monitored host [puppet] - 10https://gerrit.wikimedia.org/r/605909 (owner: 10Andrew Bogott) [13:38:24] (03PS2) 10Andrew Bogott: Galera monitoring: move defines from icinga host to monitored host [puppet] - 10https://gerrit.wikimedia.org/r/605909 [13:38:54] (03CR) 10Muehlenhoff: [C: 03+2] Switch cumin1001 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/605570 (owner: 10Muehlenhoff) [13:39:13] !log cr2-eqiad: disable transit/peering BGP & bump fr MED T243080 [13:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:32] (03CR) 10jerkins-bot: [V: 04-1] Galera monitoring: move defines from icinga host to monitored host [puppet] - 10https://gerrit.wikimedia.org/r/605909 (owner: 10Andrew Bogott) [13:39:46] PROBLEM - Check systemd state on mw2370 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:39:50] PROBLEM - Check systemd state on mw2366 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:39:56] PROBLEM - Check systemd state on mw2362 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:39:58] PROBLEM - Check systemd state on mw2352 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:39:58] PROBLEM - Check systemd state on mw2360 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:40:04] PROBLEM - Check systemd state on mw2354 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:40:34] PROBLEM - Check systemd state on mw2376 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:40:38] PROBLEM - Check systemd state on mw2356 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:40:38] PROBLEM - Check systemd state on mw2374 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:40:54] PROBLEM - Check systemd state on mw2364 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:40:56] PROBLEM - Check systemd state on mw2358 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:41:08] PROBLEM - Check systemd state on mw2372 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:41:18] PROBLEM - Check systemd state on mw2350 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:41:32] PROBLEM - Check systemd state on mw2368 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:41:35] looking [13:41:37] nginx.service [13:41:38] failed [13:41:53] related to network or? [13:42:11] these were rebooted [13:42:15] ok [13:42:21] downtime expired [13:42:36] Jun 16 11:56:06 mw2376 systemd[1010]: nginx.service: Failed at step EXEC spawning /usr/sbin/nginx: No such file or directory [13:43:05] (03PS3) 10Andrew Bogott: Galera monitoring: move defines from icinga host to monitored host [puppet] - 10https://gerrit.wikimedia.org/r/605909 [13:43:44] but the time doesn't match [13:44:04] but seems to match the reboot time [13:44:13] moritzm: ^^^ [13:44:27] (03CR) 10Andrew Bogott: [C: 03+2] Galera monitoring: move defines from icinga host to monitored host [puppet] - 10https://gerrit.wikimedia.org/r/605909 (owner: 10Andrew Bogott) [13:44:53] these have been switched to envoy, so I guess the old nginx setup wasn't cleaned up properly and this now show after the reboots [13:44:57] having a closer look [13:45:11] ack [13:46:02] these have nginx-common installed (which ships the service unit), but not any of the daemon packages, so it's not surprising that it fails to start :-) [13:46:20] yeah the daemon is in one of nginx-extras, nginx-full, nginx-light [13:46:25] and we have only common [13:47:00] I'm making a task to clean this up [13:48:00] RECOVERY - Check systemd state on mw2356 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:16] RECOVERY - Check systemd state on mw2364 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:18] RECOVERY - Check systemd state on mw2358 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:28] RECOVERY - Check systemd state on mw2372 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:38] RECOVERY - Check systemd state on mw2350 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:52] RECOVERY - Check systemd state on mw2368 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:54] RECOVERY - Check systemd state on mw2370 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:58] RECOVERY - Check systemd state on mw2366 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:04] RECOVERY - Check systemd state on mw2362 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:06] RECOVERY - Check systemd state on mw2352 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:06] RECOVERY - Check systemd state on mw2360 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:12] RECOVERY - Check systemd state on mw2354 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:42] RECOVERY - Check systemd state on mw2376 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:46] RECOVERY - Check systemd state on mw2374 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:26] !log cr2-eqiad: rebooting RE1 [backup] with new junos version T243080 [13:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:30] (03PS1) 10Filippo Giunchedi: swift: add explicit ordering for /var/log/swift [puppet] - 10https://gerrit.wikimedia.org/r/605915 (https://phabricator.wikimedia.org/T252186) [13:56:15] 10Operations, 10serviceops: Remaining nginx packages on some mw servers - https://phabricator.wikimedia.org/T255565 (10MoritzMuehlenhoff) [13:56:16] !log T243080 cdanis@re0.cr2-eqiad> request chassis routing-engine master switch [13:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:38] nice [13:56:41] and pfw3 is all fine [13:56:55] (03CR) 10Ema: [C: 03+2] ATS: use X-Cache-Status 'int' for responses without lookup [puppet] - 10https://gerrit.wikimedia.org/r/604710 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [13:57:05] (03PS1) 10Andrew Bogott: galera monitoring: more icinga fixes [puppet] - 10https://gerrit.wikimedia.org/r/605917 [13:57:15] (03CR) 10Ema: [C: 03+2] ATS: consider TS_LUA_CACHE_LOOKUP_HIT_STALE as "hit" [puppet] - 10https://gerrit.wikimedia.org/r/605877 (https://phabricator.wikimedia.org/T255368) (owner: 10Ema) [13:57:23] (03PS2) 10Ema: ATS: consider TS_LUA_CACHE_LOOKUP_HIT_STALE as "hit" [puppet] - 10https://gerrit.wikimedia.org/r/605877 (https://phabricator.wikimedia.org/T255368) [13:57:52] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1001/23278/" [puppet] - 10https://gerrit.wikimedia.org/r/605915 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [13:58:18] ema: ok to merge your change too ? [13:58:18] (03CR) 10Andrew Bogott: [C: 03+2] galera monitoring: more icinga fixes [puppet] - 10https://gerrit.wikimedia.org/r/605917 (owner: 10Andrew Bogott) [13:58:29] godog: yes! ty [13:58:41] np! running now [13:59:07] (03PS1) 10Jbond: confluent::kafka.sh: fix argument handeling [puppet] - 10https://gerrit.wikimedia.org/r/605918 [14:00:07] (03CR) 10Jbond: [C: 03+2] confluent::kafka.sh: fix argument handeling [puppet] - 10https://gerrit.wikimedia.org/r/605918 (owner: 10Jbond) [14:00:26] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:00:26] PROBLEM - OSPF status on mr1-eqiad is CRITICAL: OSPFv2: 2/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:02:11] (03PS1) 10Filippo Giunchedi: swift: stop requiring Package swift for log directory [puppet] - 10https://gerrit.wikimedia.org/r/605920 (https://phabricator.wikimedia.org/T252186) [14:02:12] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:02:12] RECOVERY - OSPF status on mr1-eqiad is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:02:23] (03CR) 10jerkins-bot: [V: 04-1] swift: stop requiring Package swift for log directory [puppet] - 10https://gerrit.wikimedia.org/r/605920 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [14:03:20] !log akosiaris@cumin1001 START - Cookbook sre.hosts.decommission [14:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:23] (03PS2) 10Filippo Giunchedi: swift: stop requiring Package swift for log directory [puppet] - 10https://gerrit.wikimedia.org/r/605920 (https://phabricator.wikimedia.org/T252186) [14:03:28] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) [14:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:33] !log akosiaris@cumin1001 START - Cookbook sre.hosts.decommission [14:03:33] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [14:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:46] !log akosiaris@cumin1001 START - Cookbook sre.hosts.decommission [14:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:50] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: stop requiring Package swift for log directory [puppet] - 10https://gerrit.wikimedia.org/r/605920 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [14:05:02] 10Operations, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: decommission ganeti100[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T255553 (10akosiaris) [14:05:32] (03CR) 10Jbond: [C: 03+2] memcached: convert systemd service file to an override [puppet] - 10https://gerrit.wikimedia.org/r/605878 (https://phabricator.wikimedia.org/T255132) (owner: 10Jbond) [14:06:46] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [14:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:59] !log removing stray nginx packages from mw canaries (mw1261-mw1265 and mw1276-mw1283) T255565 [14:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:11] T255565: Remaining nginx packages on some mw servers - https://phabricator.wikimedia.org/T255565 [14:14:42] !log T243080 cdanis@re1.cr2-eqiad> request chassis routing-engine master switch [14:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1076', diff saved to https://phabricator.wikimedia.org/P11541 and previous config saved to /var/cache/conftool/dbconfig/20200616-141540-marostegui.json [14:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:45] 10Operations, 10Traffic: noc.wikimedia.org consistently 503s in eqsin and sometimes 503s in esams - https://phabricator.wikimedia.org/T255368 (10ema) >>! In T255368#6227452, @ema wrote: > we should try to reproduce `TE:chunked` being added to a stale object on 304 responses from the origin I've written a simp... [14:19:01] PROBLEM - OSPF status on mr1-eqiad is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:19:13] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:20:09] (03CR) 10Filippo Giunchedi: [C: 03+2] templates: add v6 for thanos-fe* [dns] - 10https://gerrit.wikimedia.org/r/605896 (owner: 10Filippo Giunchedi) [14:20:49] RECOVERY - OSPF status on mr1-eqiad is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:21:00] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:22:29] (03PS1) 10Andrew Bogott: rename check-galera to check_galera [puppet] - 10https://gerrit.wikimedia.org/r/605928 (https://phabricator.wikimedia.org/T242455) [14:22:38] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [14:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:43] (03CR) 10Andrew Bogott: [C: 03+2] rename check-galera to check_galera [puppet] - 10https://gerrit.wikimedia.org/r/605928 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [14:25:25] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:35] (03PS1) 10CDanis: Revert "bump cr2-eqiad OSPF metrics to shift traffic for maintenance" [homer/public] - 10https://gerrit.wikimedia.org/r/605929 (https://phabricator.wikimedia.org/T243080) [14:27:59] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-swift_443: Servers thanos-fe2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:28:05] (03CR) 10Ayounsi: [C: 03+1] Revert "bump cr2-eqiad OSPF metrics to shift traffic for maintenance" [homer/public] - 10https://gerrit.wikimedia.org/r/605929 (https://phabricator.wikimedia.org/T243080) (owner: 10CDanis) [14:28:16] (03CR) 10CDanis: [C: 03+2] Revert "bump cr2-eqiad OSPF metrics to shift traffic for maintenance" [homer/public] - 10https://gerrit.wikimedia.org/r/605929 (https://phabricator.wikimedia.org/T243080) (owner: 10CDanis) [14:28:20] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-swift_443: Servers thanos-fe2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:28:39] (03Merged) 10jenkins-bot: Revert "bump cr2-eqiad OSPF metrics to shift traffic for maintenance" [homer/public] - 10https://gerrit.wikimedia.org/r/605929 (https://phabricator.wikimedia.org/T243080) (owner: 10CDanis) [14:28:50] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [14:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:51] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:30:25] looking at the thanos-fe alerts, might be due to adding ipv6 [14:31:23] !log reboot druid100[7,8] for kernel upgrades [14:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:59] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:33:47] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:57] !log eqiad router upgrades completed! 🎉 T243080 [14:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:11] (03CR) 10Bearloga: Add analytics-product system user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595540 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [14:34:22] 🎉 [14:35:19] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-swift_443: Servers thanos-fe2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:36:36] cdanis: \o/ nice1 [14:36:36] !!! [14:38:23] 10Operations, 10Dumps-Generation: Reboot snapshot hosts - https://phabricator.wikimedia.org/T255550 (10MoritzMuehlenhoff) [14:38:53] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:39:20] Reedy, robh: around? [14:39:31] (Or an op) [14:39:45] (03PS1) 10CDanis: Revert "Depool eqiad for routers upgrade" [dns] - 10https://gerrit.wikimedia.org/r/605933 (https://phabricator.wikimedia.org/T243080) [14:39:58] RhinosF1: ? [14:40:08] !log power off ms-be2018 for BBU replacement [14:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:21] (03CR) 10Ayounsi: [C: 03+1] Revert "Depool eqiad for routers upgrade" [dns] - 10https://gerrit.wikimedia.org/r/605933 (https://phabricator.wikimedia.org/T243080) (owner: 10CDanis) [14:40:35] PROBLEM - Thanos swift https on thanos-fe1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.006 second response time https://wikitech.wikimedia.org/wiki/Thanos [14:40:53] PROBLEM - Thanos swift https on thanos-fe2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.151 second response time https://wikitech.wikimedia.org/wiki/Thanos [14:40:57] cdanis: I messaged Reedy [14:41:03] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-swift_443: Servers thanos-fe2003.codfw.wmnet, thanos-fe2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:41:10] (03CR) 10CDanis: [C: 03+2] Revert "Depool eqiad for routers upgrade" [dns] - 10https://gerrit.wikimedia.org/r/605933 (https://phabricator.wikimedia.org/T243080) (owner: 10CDanis) [14:41:20] (03PS2) 10CDanis: Revert "Depool eqiad for routers upgrade" [dns] - 10https://gerrit.wikimedia.org/r/605933 (https://phabricator.wikimedia.org/T243080) [14:42:07] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-swift_443: Servers thanos-fe1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:42:25] RECOVERY - Thanos swift https on thanos-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 1.010 second response time https://wikitech.wikimedia.org/wiki/Thanos [14:42:37] I'll silence the alerts while I'm debugging [14:42:43] RECOVERY - Thanos swift https on thanos-fe2001 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 1.158 second response time https://wikitech.wikimedia.org/wiki/Thanos [14:43:18] !log repool eqiad T243080 [14:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:49] (03PS1) 10Hnowlan: changeprop: bump container image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/605934 [14:43:57] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:44:01] PROBLEM - Host ms-be2018 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:23] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-swift_443: Servers thanos-fe2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:45:06] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [14:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:24] !log rebooting scandium for kernel security update [14:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:11] (03CR) 10Ppchelko: [C: 03+1] changeprop: bump container image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/605934 (owner: 10Hnowlan) [14:46:27] mmhh envoy is listening on v4 only, that explains [14:47:31] I'll revert the dns change for now [14:47:32] (03PS1) 10Filippo Giunchedi: Revert "templates: add v6 for thanos-fe*" [dns] - 10https://gerrit.wikimedia.org/r/605935 [14:47:53] (03CR) 10Hnowlan: [C: 03+2] changeprop: bump container image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/605934 (owner: 10Hnowlan) [14:48:10] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "templates: add v6 for thanos-fe*" [dns] - 10https://gerrit.wikimedia.org/r/605935 (owner: 10Filippo Giunchedi) [14:48:19] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:48:20] (03Merged) 10jenkins-bot: changeprop: bump container image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/605934 (owner: 10Hnowlan) [14:49:01] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:27] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [14:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:49] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:50:13] RECOVERY - Host ms-be2018 is UP: PING WARNING - Packet loss = 90%, RTA = 36.14 ms [14:50:29] RECOVERY - HP RAID on ms-be2018 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:50:30] 10Operations, 10Wikimedia-Mailing-lists: teampractices mailing list should have active admins - https://phabricator.wikimedia.org/T255525 (10MBinder_WMF) Thanks, both. What's the latest legit email to that list? I like the idea of the list (and once upon a time, it was very active), but in the absence of both... [14:53:41] 10Operations, 10ops-codfw: Degraded RAID on ms-be2018 - https://phabricator.wikimedia.org/T254392 (10Papaul) 05Open→03Resolved BBU replacement complete [14:53:47] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-swift_443: Servers thanos-fe2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:55:17] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-swift_443: Servers thanos-fe2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:56:37] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-swift_443: Servers thanos-fe1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:57:25] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:00:39] (03PS2) 10Arturo Borrero Gonzalez: keepalived: add support for multiple virtual addresses [puppet] - 10https://gerrit.wikimedia.org/r/605898 [15:02:01] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-swift_443: Servers thanos-fe1001.eqiad.wmnet, thanos-fe1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:02:55] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-swift_443: Servers thanos-fe2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:03:07] (03PS1) 10Jbond: memcached: add TLS support [puppet] - 10https://gerrit.wikimedia.org/r/605937 (https://phabricator.wikimedia.org/T233933) [15:03:15] (03PS3) 10Arturo Borrero Gonzalez: keepalived: add support for multiple virtual addresses [puppet] - 10https://gerrit.wikimedia.org/r/605898 [15:03:49] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:04:19] PROBLEM - Check systemd state on mw2330 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:19] PROBLEM - Check systemd state on mw2332 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:22] the thanos alerts should recover as soon as dns ttl expires [15:04:28] apologies for the spam [15:04:45] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:05:23] PROBLEM - Ensure local MW versions match expected deployment on mw2293 is CRITICAL: CRITICAL: 127 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [15:05:45] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:06:09] RECOVERY - Check systemd state on mw2330 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:06:09] RECOVERY - Check systemd state on mw2332 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:06:39] 10Operations, 10netops: No LACP info for cr2-esams:ae2 - https://phabricator.wikimedia.org/T253970 (10ayounsi) > This behavior is expected as the bundle is not running active with LACP enable. Instead is just a hardcoded aggregated bundle, in other words ; it is just Up with no participation of LACP for any so... [15:06:46] !log reboot an-coord1001 for kernel upgrades [15:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:12] (03PS4) 10Arturo Borrero Gonzalez: keepalived: add support for multiple virtual addresses [puppet] - 10https://gerrit.wikimedia.org/r/605898 [15:08:23] (03CR) 10Jbond: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/605937" [puppet] - 10https://gerrit.wikimedia.org/r/605937 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [15:09:11] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] keepalived: add support for multiple virtual addresses [puppet] - 10https://gerrit.wikimedia.org/r/605898 (owner: 10Arturo Borrero Gonzalez) [15:09:31] (03PS2) 10Cwhite: set disable_fsnotify for all current mtail usage [puppet] - 10https://gerrit.wikimedia.org/r/605688 (https://phabricator.wikimedia.org/T251466) [15:10:13] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-swift_443: Servers thanos-fe2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:10:41] (03CR) 10jerkins-bot: [V: 04-1] set disable_fsnotify for all current mtail usage [puppet] - 10https://gerrit.wikimedia.org/r/605688 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [15:11:41] (03PS1) 10JMeybohm: Initial commit of debian directory [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/605940 (https://phabricator.wikimedia.org/T253843) [15:13:05] (03PS1) 10Andrew Bogott: icinga: include a couple of perl packages for check_icinga_nodes.pl [puppet] - 10https://gerrit.wikimedia.org/r/605941 [15:14:48] (03PS3) 10Cwhite: set disable_fsnotify for all current mtail usage [puppet] - 10https://gerrit.wikimedia.org/r/605688 (https://phabricator.wikimedia.org/T251466) [15:15:18] !log milimetric@deploy1001 Started deploy [analytics/refinery@c652f62]: Regular analytics weekly train [analytics/refinery@c652f62] [15:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:24] !log upgrading intel-microcode on jessie hosts [15:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:50] (03PS2) 10JMeybohm: Initial commit of debian directory [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/605940 (https://phabricator.wikimedia.org/T253843) [15:18:01] PROBLEM - Memcached on idp-test2001 is CRITICAL: connect to address 208.80.153.25 and port 11000: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [15:20:09] !log reboot kafka-jumbo1007 for kernel upgrades [15:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:53] godog: the thanos healthcheck alert is expected? [15:21:07] oh I see, ok :) [15:21:19] vgutierrez: yeah :| [15:21:41] silencing [15:22:56] (03PS1) 10Arturo Borrero Gonzalez: wmcs: paws: haproxy: add keepalived support [puppet] - 10https://gerrit.wikimedia.org/r/605944 (https://phabricator.wikimedia.org/T195217) [15:23:14] !log milimetric@deploy1001 Finished deploy [analytics/refinery@c652f62]: Regular analytics weekly train [analytics/refinery@c652f62] (duration: 07m 56s) [15:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:08] (03CR) 10jerkins-bot: [V: 04-1] wmcs: paws: haproxy: add keepalived support [puppet] - 10https://gerrit.wikimedia.org/r/605944 (https://phabricator.wikimedia.org/T195217) (owner: 10Arturo Borrero Gonzalez) [15:24:09] 10Operations, 10Security-Team, 10Patch-For-Review: (2019-09) Create secteam groups in admin.yaml and define permissions - https://phabricator.wikimedia.org/T223463 (10chasemp) Apologies @jcrespo and @herron for the lag here. I was away (thanks for updating the task @sbassett) and then this fell to the botto... [15:24:47] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:25:55] !log milimetric@deploy1001 Started deploy [analytics/refinery@c652f62] (thin): Regular analytics weekly THIN train [analytics/refinery@c652f62] [15:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:04] !log milimetric@deploy1001 Finished deploy [analytics/refinery@c652f62] (thin): Regular analytics weekly THIN train [analytics/refinery@c652f62] (duration: 00m 08s) [15:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:15] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:27:55] (03PS2) 10Arturo Borrero Gonzalez: wmcs: paws: haproxy: add keepalived support [puppet] - 10https://gerrit.wikimedia.org/r/605944 (https://phabricator.wikimedia.org/T195217) [15:29:17] (03CR) 10JMeybohm: "recheck" [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/605940 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [15:29:32] (03PS1) 10Jbond: profile::idp::memcached: move SSL termination to memcached [puppet] - 10https://gerrit.wikimedia.org/r/605947 (https://phabricator.wikimedia.org/T233933) [15:29:56] (03CR) 10jerkins-bot: [V: 04-1] profile::idp::memcached: move SSL termination to memcached [puppet] - 10https://gerrit.wikimedia.org/r/605947 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [15:30:15] 10Operations, 10SRE-Access-Requests: Requesting access to sites from Google Search Console for soworu - https://phabricator.wikimedia.org/T252705 (10soworu) >>! In T252705#6225588, @RobH wrote: > @soworu, > > This has been pending feedback since May 26th regarding: > > > > > >>>! In T252705#6146379, @RLa... [15:30:16] (03PS1) 10Filippo Giunchedi: thanos: pass min / max time to store [puppet] - 10https://gerrit.wikimedia.org/r/605949 (https://phabricator.wikimedia.org/T252186) [15:30:18] (03PS1) 10Filippo Giunchedi: thanos: use object storage for data older than 15d [puppet] - 10https://gerrit.wikimedia.org/r/605950 (https://phabricator.wikimedia.org/T252186) [15:31:22] (03CR) 10Hashar: "recheck CI change deployed ( https://gerrit.wikimedia.org/r/#/c/integration/config/+/605945/ )" [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/605940 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [15:32:52] (03PS2) 10Jbond: profile::idp::memcached: move SSL termination to memcached [puppet] - 10https://gerrit.wikimedia.org/r/605947 (https://phabricator.wikimedia.org/T233933) [15:33:30] (03PS3) 10Arturo Borrero Gonzalez: wmcs: paws: haproxy: add keepalived support [puppet] - 10https://gerrit.wikimedia.org/r/605944 (https://phabricator.wikimedia.org/T195217) [15:34:36] (03PS2) 10Jbond: memcached: add TLS support [puppet] - 10https://gerrit.wikimedia.org/r/605937 (https://phabricator.wikimedia.org/T233933) [15:34:43] 10Operations, 10SRE-Access-Requests: Requesting access to sites from Google Search Console for soworu - https://phabricator.wikimedia.org/T252705 (10Dzahn) The Kanuri (kr) Wikipedia has existed in the past but the wiki was closed: https://kr.wikipedia.org/wiki/Main_Page https://meta.wikimedia.org/wiki/Proposa... [15:34:57] (03PS3) 10Jbond: profile::idp::memcached: move SSL termination to memcached [puppet] - 10https://gerrit.wikimedia.org/r/605947 (https://phabricator.wikimedia.org/T233933) [15:35:47] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/605937 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [15:36:41] (03PS3) 10JMeybohm: Initial commit of debian directory [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/605940 (https://phabricator.wikimedia.org/T253843) [15:36:43] 10Operations, 10SRE-Access-Requests: Requesting access to sites from Google Search Console for soworu - https://phabricator.wikimedia.org/T252705 (10soworu) Thanks for the prompt feedack. If that's the case, then all other are okay witout Kanuri. Thank you. [15:37:48] (03PS3) 10Jbond: memcached: add TLS support [puppet] - 10https://gerrit.wikimedia.org/r/605937 (https://phabricator.wikimedia.org/T233933) [15:38:27] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [15:39:20] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Dzahn) We went into this issue several times b... [15:39:58] (03PS4) 10Jbond: profile::idp::memcached: move SSL termination to memcached [puppet] - 10https://gerrit.wikimedia.org/r/605947 (https://phabricator.wikimedia.org/T233933) [15:40:23] (03CR) 10CRusnov: "Thanks for looking at this. There was some irc side back and forth which lead to the hack you see here, see inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/601893 (https://phabricator.wikimedia.org/T253140) (owner: 10CRusnov) [15:40:46] (03PS1) 10Hnowlan: changeprop: new image version for dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/605954 [15:42:07] 10Operations, 10SRE-Access-Requests: Requesting access to sites from Google Search Console for soworu - https://phabricator.wikimedia.org/T252705 (10RobH) 05Open→03Resolved It seems this is fully resolved, so I am closing this ticket. If there are any other issues or concerns, please reopen this ticket (i... [15:43:23] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/605947 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [15:44:29] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@7d4458c]: Reduce glent maximum yarn resource usage to reasonable levels [15:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:02] 10Operations, 10ops-codfw, 10netops: codfw: rack/setup new srx300 (mr1) - https://phabricator.wikimedia.org/T255577 (10Papaul) [15:45:10] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@7d4458c]: Reduce glent maximum yarn resource usage to reasonable levels (duration: 00m 41s) [15:45:12] 10Operations, 10ops-codfw, 10netops: codfw: rack/setup new srx300 (mr1) - https://phabricator.wikimedia.org/T255577 (10Papaul) p:05Triage→03Medium [15:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:14] 10Operations, 10ops-codfw, 10netops: codfw: rack/setup new srx300 (mr1) - https://phabricator.wikimedia.org/T255577 (10Papaul) ` MD5 : 47ce46bc72ce069024a1694836961ed0 ` ` papaul@papauls-MacBook-Pro Downloads % md5 junos-srxsme-20.1R1.11.tgz MD5 (junos-srxsme-20.1R1.11.tgz) = 47ce46bc72ce069024a1694836961ed0 [15:46:15] (03CR) 10Andrew Bogott: [C: 03+2] icinga: include a couple of perl packages for check_icinga_nodes.pl [puppet] - 10https://gerrit.wikimedia.org/r/605941 (owner: 10Andrew Bogott) [15:49:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1076', diff saved to https://phabricator.wikimedia.org/P11543 and previous config saved to /var/cache/conftool/dbconfig/20200616-154924-marostegui.json [15:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:34] (03PS1) 10Ottomata: refine.pp - bump refinery jar version and make eventlogging_analytics use event_transforms [puppet] - 10https://gerrit.wikimedia.org/r/605955 (https://phabricator.wikimedia.org/T238230) [15:50:42] (03CR) 10jerkins-bot: [V: 04-1] refine.pp - bump refinery jar version and make eventlogging_analytics use event_transforms [puppet] - 10https://gerrit.wikimedia.org/r/605955 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [15:51:05] (03PS2) 10Bstorm: cloud nfs: allow opt-in soft mounting wherever folks want to try it [puppet] - 10https://gerrit.wikimedia.org/r/605310 (https://phabricator.wikimedia.org/T127559) [15:51:43] (03PS2) 10Ottomata: refine.pp - bump refinery version and make eventlogging_analytics use event_transforms [puppet] - 10https://gerrit.wikimedia.org/r/605955 (https://phabricator.wikimedia.org/T238230) [15:51:58] (03CR) 10Jbond: [C: 03+1] "LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/601893 (https://phabricator.wikimedia.org/T253140) (owner: 10CRusnov) [15:52:53] (03CR) 10jerkins-bot: [V: 04-1] refine.pp - bump refinery version and make eventlogging_analytics use event_transforms [puppet] - 10https://gerrit.wikimedia.org/r/605955 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [15:53:13] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/605937 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [15:53:35] (03PS3) 10Ottomata: refine.pp - bump version and make eventlogging_analytics use event_transforms [puppet] - 10https://gerrit.wikimedia.org/r/605955 (https://phabricator.wikimedia.org/T238230) [15:56:06] (03CR) 10Ppchelko: [C: 03+1] changeprop: new image version for dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/605954 (owner: 10Hnowlan) [15:56:08] (03PS3) 10Bstorm: cloud nfs: allow opt-in soft mounting wherever folks want to try it [puppet] - 10https://gerrit.wikimedia.org/r/605310 (https://phabricator.wikimedia.org/T127559) [15:56:42] (03PS4) 10Arturo Borrero Gonzalez: wmcs: paws: haproxy: add keepalived support [puppet] - 10https://gerrit.wikimedia.org/r/605944 (https://phabricator.wikimedia.org/T195217) [15:56:45] (03CR) 10Hnowlan: [C: 03+2] changeprop: new image version for dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/605954 (owner: 10Hnowlan) [15:57:11] (03Merged) 10jenkins-bot: changeprop: new image version for dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/605954 (owner: 10Hnowlan) [15:57:55] RECOVERY - Memcached on idp-test2001 is OK: TCP OK - 0.037 second response time on 208.80.153.25 port 11000 https://wikitech.wikimedia.org/wiki/Memcached [15:58:07] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [15:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:38] (03CR) 10Bstorm: "I added some explanation and more warnings in the comments. The original idea of the task was "soft mount everything", but that seems like" [puppet] - 10https://gerrit.wikimedia.org/r/605310 (https://phabricator.wikimedia.org/T127559) (owner: 10Bstorm) [15:58:47] (03PS6) 10CRusnov: netbox: Configure for netbox-dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/601893 (https://phabricator.wikimedia.org/T253140) [16:00:04] godog and _joe_: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200616T1600). [16:00:17] (03CR) 10CRusnov: [C: 03+2] netbox: Configure for netbox-dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/601893 (https://phabricator.wikimedia.org/T253140) (owner: 10CRusnov) [16:02:40] !log reboot kafka-jumbo1008 for kernel upgrades [16:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:58] (03PS1) 10Jbond: memcached: Clear ExecStart when used as an override [puppet] - 10https://gerrit.wikimedia.org/r/605957 [16:04:38] (03PS1) 10Esanders: Enable DiscussionTools on all labs wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605958 (https://phabricator.wikimedia.org/T255223) [16:04:38] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [16:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:31] (03CR) 10Esanders: "This is ready for deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605958 (https://phabricator.wikimedia.org/T255223) (owner: 10Esanders) [16:05:57] PROBLEM - Check systemd state on thanos-be2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:07:41] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' . [16:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:05] (03CR) 10Jbond: "Ready for review (better ways to achieve the same result most welcome)" [puppet] - 10https://gerrit.wikimedia.org/r/605957 (owner: 10Jbond) [16:10:18] (03PS1) 10Elukey: Update refinery-job jar version for analytics' refinery profiles [puppet] - 10https://gerrit.wikimedia.org/r/605959 [16:10:24] ottomata: --^ [16:11:08] 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Retire and remove module labs_debrepo - https://phabricator.wikimedia.org/T153612 (10Bstorm) 05Open→03Declined I don't mind this remaining on NFS because it is small, and it doesn't seem like it is in use anymore so that solves this problem to m... [16:11:46] ah I missed one [16:12:28] (03PS2) 10Elukey: Update refinery-job jar version for analytics' refinery profiles [puppet] - 10https://gerrit.wikimedia.org/r/605959 [16:16:49] (03PS1) 10Filippo Giunchedi: thanos: fix Thanos sidecar Prometheus connection alert [puppet] - 10https://gerrit.wikimedia.org/r/605960 (https://phabricator.wikimedia.org/T252186) [16:17:56] (03CR) 10Dzahn: "Is it expected to see the openjdk-jre-headless-11 package in the new catalog here?" [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [16:18:36] (03PS3) 10Dzahn: add IPs for releases1002/releases2002 [dns] - 10https://gerrit.wikimedia.org/r/605176 (https://phabricator.wikimedia.org/T247652) [16:18:45] RECOVERY - Check systemd state on thanos-be2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:45] (03CR) 10Dzahn: [C: 03+2] add IPs for releases1002/releases2002 [dns] - 10https://gerrit.wikimedia.org/r/605176 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn) [16:21:01] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 67 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:22:20] (03CR) 10Herron: [C: 03+2] java: manage elasticsearch and kafka java dependencies with ::profile::java [puppet] - 10https://gerrit.wikimedia.org/r/602459 (https://phabricator.wikimedia.org/T252913) (owner: 10Herron) [16:24:34] has wikibugs stopped reporting phab changes [16:25:55] 10Operations, 10Thumbor, 10Wikimedia-SVG-rendering: Replace Liberation 1 fonts with Liberation 2 for svg rendering - https://phabricator.wikimedia.org/T253600 (10AntiCompositeNumber) [16:25:58] nope [16:26:02] !log Updating changeprop to new container version with updated dependencies [16:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:47] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 47 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:26:59] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' . [16:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:37] 10Operations, 10Puppet, 10netbox: Netbox missing physical device in PuppetDB when Puppet disabled for too long - https://phabricator.wikimedia.org/T254986 (10bd808) [16:47:47] 10Operations, 10Traffic: Configure varnish to use "Unconfigured domain" page for 404 Not Served (instead of generic error) - https://phabricator.wikimedia.org/T112316 (10Krinkle) 05Open→03Declined They are both based on the same template now. It is just the message heading that is suboptimal. But, I don't... [16:57:50] 10Operations, 10Toolforge, 10Traffic, 10HTTPS, and 3 others: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367 (10bd808) a:03bd808 Last step is to set HSTS to 1 year and then close this. [17:00:05] halfak and accraze: Time to snap out of that daydream and deploy Services – Graphoid / Citoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200616T1700). [17:03:14] (03PS1) 10CRusnov: Add netbox-dev records and netbox-next public record [dns] - 10https://gerrit.wikimedia.org/r/605967 [17:03:18] !log performing rolling reboots of kafka-main hosts for security updates T254990 [17:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:32] (03CR) 10CRusnov: "This change is ready for review." [dns] - 10https://gerrit.wikimedia.org/r/605967 (owner: 10CRusnov) [17:12:03] (03CR) 10Elukey: [C: 03+2] Update refinery-job jar version for analytics' refinery profiles [puppet] - 10https://gerrit.wikimedia.org/r/605959 (owner: 10Elukey) [17:19:04] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/605967 (owner: 10CRusnov) [17:20:46] herron: o/ [17:21:02] hey elukey [17:21:03] there is cookbook to do it if you want to use it :) [17:21:22] (03CR) 10CRusnov: [C: 03+2] Add netbox-dev records and netbox-next public record [dns] - 10https://gerrit.wikimedia.org/r/605967 (owner: 10CRusnov) [17:21:22] (not sure if you were planning to do it manually or not) [17:21:30] (03PS2) 10CRusnov: Add netbox-dev records and netbox-next public record [dns] - 10https://gerrit.wikimedia.org/r/605967 [17:21:59] ah no probably reboots no, sigh misremembering [17:22:00] ah just rolling through them manually this time, already started [17:22:14] but I'll create one if not present [17:22:19] :) [17:22:37] (the co-location thing makes the logging case a bit special but we can workaround it) [17:22:44] okok nevermind, brainfart :) [17:22:52] hehe ok thanks though [17:23:07] (it is so nice to exec a cookbook and do other stuff :D) [17:23:32] :) [17:24:35] 10Operations, 10vm-requests: Site: eqiad/codfw 2 VM request for releases - https://phabricator.wikimedia.org/T255590 (10Dzahn) [17:32:04] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [17:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:41] (03PS1) 10Jayprakash12345: Set namespace aliases for guwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605973 [17:33:48] 10Operations, 10Commons, 10ConfirmEdit (CAPTCHA extension), 10Editing-team, and 5 others: GenerateFancyCaptchas.php crashes with "FormatJson.php: File not found in" after 1000 iterations - https://phabricator.wikimedia.org/T230245 (10Krinkle) >>! In T230245#6220755, @Platonides wrote: > I would try > * thr... [17:37:02] (03PS1) 10Jdlrobson: Restore Watchlist star [skins/Vector] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/605975 (https://phabricator.wikimedia.org/T255574) [17:38:22] (03CR) 10Jdlrobson: [C: 03+1] Restore Watchlist star [skins/Vector] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/605975 (https://phabricator.wikimedia.org/T255574) (owner: 10Jdlrobson) [17:41:25] (03PS2) 10Jayprakash12345: Set namespace aliases for guwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605973 (https://phabricator.wikimedia.org/T255358) [17:41:59] (03CR) 10jerkins-bot: [V: 04-1] Set namespace aliases for guwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605973 (https://phabricator.wikimedia.org/T255358) (owner: 10Jayprakash12345) [17:42:27] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@f4f5d7b]: airflow: adjust glent legal cutoff [17:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:38] (03CR) 10RhinosF1: [C: 04-1] "make sure to end lines with a , as appropiate" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605973 (https://phabricator.wikimedia.org/T255358) (owner: 10Jayprakash12345) [17:44:02] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@f4f5d7b]: airflow: adjust glent legal cutoff (duration: 01m 35s) [17:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:20] 10Operations, 10Commons, 10ConfirmEdit (CAPTCHA extension), 10Editing-team, and 5 others: GenerateFancyCaptchas.php crashes with "FormatJson.php: File not found in" after 1000 iterations - https://phabricator.wikimedia.org/T230245 (10Krinkle) >>! In T230245#6220755, @Platonides wrote: > I would try […] > *... [17:48:41] 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10Nuria) > i.e. when we did x did y happen as we would have expected it to? So the work definitely requires acc... [17:52:20] !log crusnov@cumin2001 START - Cookbook sre.ganeti.makevm [17:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:36] (03CR) 10Muehlenhoff: "Yes, that's exactly the right fix, as we're trying to overwrite the ExecStart shipped in the systemd unit. Although, I'm slightly puzzled " [puppet] - 10https://gerrit.wikimedia.org/r/605957 (owner: 10Jbond) [17:54:30] (03PS3) 10Jayprakash12345: Set namespace aliases for guwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605973 (https://phabricator.wikimedia.org/T255358) [17:54:33] (03CR) 10Lars Wirzenius: "I'm afraid I'm not qualified to review PHP or MW code, sorry." [skins/Vector] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/605975 (https://phabricator.wikimedia.org/T255574) (owner: 10Jdlrobson) [17:55:06] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [17:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:06] (03PS1) 10RhinosF1: close trwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605978 [17:56:56] (03CR) 10jerkins-bot: [V: 04-1] close trwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605978 (owner: 10RhinosF1) [17:57:12] (03PS2) 10RhinosF1: close trwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605978 (https://phabricator.wikimedia.org/T247330) [17:57:44] (03CR) 10DCausse: [C: 04-1] "after some discussion we might prefer to have a single service "wdqs-updater" that runs either the classic updater or the new streaming on" [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse) [17:58:09] (03CR) 10RhinosF1: "I'll fix groupOverrides later" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605978 (https://phabricator.wikimedia.org/T247330) (owner: 10RhinosF1) [17:58:11] (03CR) 10jerkins-bot: [V: 04-1] close trwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605978 (https://phabricator.wikimedia.org/T247330) (owner: 10RhinosF1) [17:58:34] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10MoritzMuehlenhoff) Definitely, if you need a s... [17:58:41] (03PS1) 10Bstorm: cloud apt: set periodic autocleaning to once a week [puppet] - 10https://gerrit.wikimedia.org/r/605979 (https://phabricator.wikimedia.org/T127374) [17:59:28] (03CR) 10RhinosF1: [C: 03+1] Set namespace aliases for guwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605973 (https://phabricator.wikimedia.org/T255358) (owner: 10Jayprakash12345) [18:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Morning backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200616T1800). [18:01:12] !log crusnov@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [18:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:51] 10Operations, 10Commons, 10ConfirmEdit (CAPTCHA extension), 10Editing-team, and 5 others: GenerateFancyCaptchas.php crashes with "FormatJson.php: File not found in" after 1000 iterations - https://phabricator.wikimedia.org/T230245 (10Krinkle) Looking into the cURL failure itself a bit. Those can always hap... [18:02:56] 10Operations, 10Commons, 10ConfirmEdit (CAPTCHA extension), 10Editing-team, and 5 others: GenerateFancyCaptchas.php crashes with "FormatJson.php: File not found in" after 1000 iterations - https://phabricator.wikimedia.org/T230245 (10Krinkle) [18:05:23] (03CR) 10VolkerE: [C: 03+1] Restore Watchlist star [skins/Vector] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/605975 (https://phabricator.wikimedia.org/T255574) (owner: 10Jdlrobson) [18:07:13] 10Operations, 10Puppet, 10cloud-services-team (Kanban): Puppet class systemd needs to throw a more useful error - https://phabricator.wikimedia.org/T195553 (10Bstorm) The lack of upstart-based and sysV-based hosts in the environment makes this a non-issue for us. I'll take the cloud tag off this in case som... [18:08:31] 10Operations, 10Puppet, 10cloud-services-team (Kanban): Puppet class systemd needs to throw a more useful error - https://phabricator.wikimedia.org/T195553 (10Bstorm) Ok, Herald dislikes my action. I'll move it to the graveyard. [18:09:13] 10Operations, 10ops-codfw, 10netops: codfw: rack/setup new srx300 (mr1) - https://phabricator.wikimedia.org/T255577 (10Papaul) Version before ` root> show version Model: srx300 Junos: 15.1X49-D170.4 JUNOS Software Release [15.1X49-D170.4] ` After upgrade ` root> show version Model: srx300 Junos: 18.4R3-S2... [18:09:38] 10Operations, 10ops-codfw, 10netops: codfw: rack/setup new srx300 (mr1) - https://phabricator.wikimedia.org/T255577 (10Papaul) [18:09:42] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.6e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [18:11:10] 10Operations, 10Commons, 10ConfirmEdit (CAPTCHA extension), 10Editing-team, and 5 others: GenerateFancyCaptchas.php crashes with "FormatJson.php: File not found in" after 1000 iterations - https://phabricator.wikimedia.org/T230245 (10Krinkle) The curlmulti/Swift requests all succeed if there are **968 iter... [18:12:37] (03PS1) 10CDanis: check_command configs should notify icinga for reload on change [puppet] - 10https://gerrit.wikimedia.org/r/605983 [18:16:56] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 2.145e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [18:17:31] (03CR) 10CDanis: "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1003/23291/" [puppet] - 10https://gerrit.wikimedia.org/r/605983 (owner: 10CDanis) [18:18:09] !log mw2293 - scap pull (because Icinga reports mismatched MW versions) [18:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:24] RECOVERY - Ensure local MW versions match expected deployment on mw2293 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:21:26] (03PS2) 10CDanis: check_command configs should notify icinga for reload on change [puppet] - 10https://gerrit.wikimedia.org/r/605983 [18:23:00] 10Operations, 10Analytics-Radar, 10Traffic: Spammy events coming our way for sites such us https://ru.wikipedia.kim - https://phabricator.wikimedia.org/T190843 (10Astonmalie) I thought ru stand for Russia, this can just be a Russia version of wikipedia This is my 2 cent though, i only assist students with [... [18:24:18] (03CR) 10RLazarus: [C: 03+1] check_command configs should notify icinga for reload on change [puppet] - 10https://gerrit.wikimedia.org/r/605983 (owner: 10CDanis) [18:24:51] (03CR) 10CDanis: [C: 03+2] check_command configs should notify icinga for reload on change [puppet] - 10https://gerrit.wikimedia.org/r/605983 (owner: 10CDanis) [18:26:13] (03CR) 10Jdlrobson: [C: 03+1] "Code has been reviewed it just needs to be merged into the branch for which I do not have merge rights." [skins/Vector] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/605975 (https://phabricator.wikimedia.org/T255574) (owner: 10Jdlrobson) [18:26:52] (03PS1) 10CRusnov: install_server: add netbox-dev server types, and dhcp config [puppet] - 10https://gerrit.wikimedia.org/r/605988 [18:27:35] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/605988 (owner: 10CRusnov) [18:28:16] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 4654 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [18:29:43] (03PS3) 10Privacybatm: [WIP] transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) [18:30:42] PROBLEM - Host scs-a1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:32:07] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [18:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:45] (03PS1) 10ArielGlenn: restructure rsync of xml/sql dumps from primary source to other servers [puppet] - 10https://gerrit.wikimedia.org/r/605990 (https://phabricator.wikimedia.org/T254856) [18:37:27] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10serviceops, and 4 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10thcipriani) [18:38:53] (03CR) 10Bstorm: cloud nfs: allow opt-in soft mounting wherever folks want to try it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605310 (https://phabricator.wikimedia.org/T127559) (owner: 10Bstorm) [18:42:56] Jdlrobson: will get that merged into branch, one sec. [18:44:48] (03CR) 10Brennen Bearnes: [C: 03+2] Restore Watchlist star [skins/Vector] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/605975 (https://phabricator.wikimedia.org/T255574) (owner: 10Jdlrobson) [18:48:34] (03PS3) 10Bstorm: nfs monitoring: fix the broken paths for the directory size monitor [puppet] - 10https://gerrit.wikimedia.org/r/605705 (https://phabricator.wikimedia.org/T160113) [18:52:36] !log Turning on puppet again on gerrit1002 to avoid having it lag too far behind. [18:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:56] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:59:08] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [18:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:44] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:59:48] PROBLEM - Check systemd state on dumpsdata1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:00:04] liw and brennen: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Mediawiki train - European+American Version (secondary timeslot) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200616T1900). [19:01:02] !log holding 1.35.0-wmf.27 deploy to group1 for a few minutes while merging & testing fix for T255574 [19:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:07] T255574: Watchlist star gone on Vector - https://phabricator.wikimedia.org/T255574 [19:01:51] (of _course_ i checked 3x and still typoed the version number that logline.) [19:03:27] !log CORRECTION: holding _1.35.0-wmf.37_ deploy to group1 for a few minutes while merging & testing fix for T255574 [19:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:36] (03Merged) 10jenkins-bot: Restore Watchlist star [skins/Vector] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/605975 (https://phabricator.wikimedia.org/T255574) (owner: 10Jdlrobson) [19:06:59] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10serviceops, and 4 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10jeena) >>! In T224041#6227592, @akosiaris wrote: > However, up to now we did not have a need to pu... [19:12:40] RECOVERY - Host scs-a1-codfw is UP: PING WARNING - Packet loss = 77%, RTA = 220.31 ms [19:15:52] !log brennen@deploy1001 Synchronized php-1.35.0-wmf.37/skins/Vector/resources/skins.vector.styles/: [[gerrit:605975|Restore Watchlist star]] (duration: 01m 05s) [19:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:01] 10Operations, 10Wikimedia-Mailing-lists: teampractices mailing list should have active admins - https://phabricator.wikimedia.org/T255525 (10Quiddity) 2 legit threads/posts since 2018 (in April 2020 and July 2019). Cf. archives list at https://lists.wikimedia.org/pipermail/teampractices/ +1 to sunsetting. [19:18:17] !log otto@deploy1001 Started deploy [analytics/refinery@8b8ce6e]: deploying refinery source 0.0.127 for eventlogging -> eventgate migration - T249261 [19:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:20] T249261: Vertical: Migrate SearchSatisfaction EventLogging event stream to Event Platform - https://phabricator.wikimedia.org/T249261 [19:19:08] (03PS1) 10Brennen Bearnes: group1 wikis to 1.35.0-wmf.37 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605995 [19:19:10] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.35.0-wmf.37 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605995 (owner: 10Brennen Bearnes) [19:20:00] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.37 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605995 (owner: 10Brennen Bearnes) [19:23:10] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.37 [19:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:15] !log brennen@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.37 (duration: 01m 04s) [19:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:42] (03PS1) 10Dzahn: DHCP: add releases1002/releases2002 [puppet] - 10https://gerrit.wikimedia.org/r/605996 (https://phabricator.wikimedia.org/T255590) [19:24:54] (03CR) 10Cicalese: "> Patch Set 1: Code-Review-2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599369 (https://phabricator.wikimedia.org/T247943) (owner: 10Jforrester) [19:25:13] (03CR) 10jerkins-bot: [V: 04-1] DHCP: add releases1002/releases2002 [puppet] - 10https://gerrit.wikimedia.org/r/605996 (https://phabricator.wikimedia.org/T255590) (owner: 10Dzahn) [19:28:34] (03PS1) 10Esanders: Set DiscussionToolsEnableVisual to true by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605997 (https://phabricator.wikimedia.org/T251654) [19:28:47] (03PS2) 10Dzahn: DHCP: add releases1002/releases2002 [puppet] - 10https://gerrit.wikimedia.org/r/605996 (https://phabricator.wikimedia.org/T255590) [19:29:13] (03CR) 10jerkins-bot: [V: 04-1] DHCP: add releases1002/releases2002 [puppet] - 10https://gerrit.wikimedia.org/r/605996 (https://phabricator.wikimedia.org/T255590) (owner: 10Dzahn) [19:29:47] 10Operations, 10Commons, 10ConfirmEdit (CAPTCHA extension), 10Editing-team, and 5 others: GenerateFancyCaptchas.php crashes with "FormatJson.php: File not found in" after 1000 iterations - https://phabricator.wikimedia.org/T230245 (10Krinkle) I've tried to isolate it to "just" the Python shell out and the... [19:30:13] (03PS3) 10Dzahn: DHCP: add releases1002/releases2002 [puppet] - 10https://gerrit.wikimedia.org/r/605996 (https://phabricator.wikimedia.org/T255590) [19:37:06] (03PS4) 10Ottomata: refine.pp - bump version and make eventlogging_analytics use event_transforms [puppet] - 10https://gerrit.wikimedia.org/r/605955 (https://phabricator.wikimedia.org/T238230) [19:41:09] (03CR) 10Ottomata: [C: 03+2] refine.pp - bump version and make eventlogging_analytics use event_transforms [puppet] - 10https://gerrit.wikimedia.org/r/605955 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [19:47:48] 10Operations, 10Wikimedia-Mailing-lists: Close teampractices mailing list (as it has no active admins) - https://phabricator.wikimedia.org/T255525 (10Aklapper) [19:48:54] 10Operations, 10Commons, 10ConfirmEdit (CAPTCHA extension), 10Editing-team, and 5 others: GenerateFancyCaptchas.php crashes with "FormatJson.php: File not found in" after 1000 iterations - https://phabricator.wikimedia.org/T230245 (10Tgr) Presumably, something sets an ulimit on the number of file descripto... [19:48:57] Krinkle: how would I go for testing that generateCaptcha error? [19:49:07] (03PS1) 10Jgreen: add check_kafkatee for fundraising banner loggers to nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/605999 [19:49:25] it has been years since I last logged into the beta cluster... [19:50:07] Platonides: OK, let me know when you'e sshed on deploy01 :) [19:50:25] (03CR) 10Jgreen: [C: 03+2] add check_kafkatee for fundraising banner loggers to nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/605999 (owner: 10Jgreen) [19:50:46] where's the current bastion? [19:51:43] I believe it is in Ashburn, Virginia :P [19:51:53] duh [19:51:57] https://wikitech.wikimedia.org/wiki/Help:Accessing_Cloud_VPS_instances#ProxyJump_(recommended) [19:52:15] I have a deployment-bastion ssh config [19:52:20] but it's oviously outdated [19:52:27] $ ssh deployment-bastion [19:52:27] channel 0: open failed: administratively prohibited: open failed [19:52:27] stdio forwarding failed [19:52:27] ssh_exchange_identification: Connection closed by remote host [19:52:32] yeah, beta no longer has its own bastion [19:52:43] make sure ssh config is setup for *.wmflabs [19:52:54] using the general wmflabs bastion [19:53:03] then `ssh deployment-deploy01.deployment-prep.eqiad.wmflabs` [19:53:45] it was going via bastion.wmflabs.org [19:54:02] is that the same as primary.bastion.wmflabs.org ¿ [19:54:12] Yes [19:54:22] I use without primary.* currently [19:54:24] they're aliaes [19:54:53] (03CR) 10RLazarus: [C: 03+2] maintenance: Migrate wikidata prune jobs to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/599956 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [19:57:00] (03CR) 10Dzahn: [C: 03+2] DHCP: add releases1002/releases2002 [puppet] - 10https://gerrit.wikimedia.org/r/605996 (https://phabricator.wikimedia.org/T255590) (owner: 10Dzahn) [19:58:24] ok, I'm in [19:59:54] (03PS1) 10QChris: gerrit: Escape quotes for pipeline commentlinks [puppet] - 10https://gerrit.wikimedia.org/r/606001 [20:00:39] (03CR) 10Volans: [C: 03+1] "LGTM (I didn't verify the MAC though)" [puppet] - 10https://gerrit.wikimedia.org/r/605988 (owner: 10CRusnov) [20:01:24] (03PS2) 10CRusnov: install_server: add netbox-dev server types, and dhcp config [puppet] - 10https://gerrit.wikimedia.org/r/605988 [20:01:47] (03CR) 10CRusnov: [C: 03+2] install_server: add netbox-dev server types, and dhcp config [puppet] - 10https://gerrit.wikimedia.org/r/605988 (owner: 10CRusnov) [20:02:05] I see [20:02:15] that's... "fun" [20:03:18] Krinkle: I'd be curious if "just" upgrading to PHP 7.3 would sort it [20:04:30] (03CR) 10Bartosz Dziewoński: [C: 03+1] "(This only affects wikis where wgDiscussionToolsEnable is true.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605997 (https://phabricator.wikimedia.org/T251654) (owner: 10Esanders) [20:06:25] Platonides: you can repro then? [20:06:37] let me know if you have any questions :) [20:06:56] if you modify things, do !log in #wikimedia-releng in case people are curious about random errors [20:07:21] yes [20:07:50] the only modification I am doing is to run that script with --delete so far [20:09:32] hah! [20:09:33] 27188 open("/srv/mediawiki-staging/php-master/includes/json/FormatJson.php", O_RDONLY) = -1 EMFILE (Too many open files) [20:10:23] something is probably leaking a fd per iteration [20:10:31] doing ulimit -n 4096 [20:10:37] that 969 test now works [20:10:43] (was 1024) [20:14:24] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:16:56] Platonides: Right, the Swft code is passing file handles to MultiHttp instead of the file contents [20:17:08] which menas they stay open until lazily streamed by curl when it wants to [20:17:35] So the file access is refused because there are already too many open file handles. [20:17:37] interesting. [20:17:53] That explains why it is fine if you want for all curl requests to complete [20:18:00] wait* [20:18:17] and maybe that also explains the curl conn failures, if that counts as a socket internally somewhere [20:18:23] although I thought it was re-using the same curl handle [20:18:42] I['m curious to learn how you found EMFILE :) [20:19:52] (03PS1) 10QChris: gerrit: Quote `javaOptions` in config [puppet] - 10https://gerrit.wikimedia.org/r/606003 [20:19:54] (03PS1) 10QChris: gerrit: Clarify that `container.javaOptions` is currently unused [puppet] - 10https://gerrit.wikimedia.org/r/606004 [20:19:56] (03PS1) 10QChris: gerrit: Split javaOptions settings onto separate lines [puppet] - 10https://gerrit.wikimedia.org/r/606005 [20:20:23] well, you can't pass 970 files with that limit [20:20:42] Krinkle: have a look at /tmp/T230245.log :P [20:21:10] (03CR) 10Dzahn: install_server: add netbox-dev server types, and dhcp config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605988 (owner: 10CRusnov) [20:21:12] (03PS2) 10QChris: gerrit: Quote `container.javaOptions` in config [puppet] - 10https://gerrit.wikimedia.org/r/606003 [20:21:14] (03PS2) 10QChris: gerrit: Clarify that `container.javaOptions` is currently unused [puppet] - 10https://gerrit.wikimedia.org/r/606004 [20:21:16] (03PS2) 10QChris: gerrit: Split `container.javaOptions` settings onto separate lines [puppet] - 10https://gerrit.wikimedia.org/r/606005 [20:21:18] (03CR) 10Paladox: [C: 03+1] gerrit: Split `container.javaOptions` settings onto separate lines [puppet] - 10https://gerrit.wikimedia.org/r/606005 (owner: 10QChris) [20:21:20] just a bit of strace [20:21:38] (03CR) 10Paladox: [C: 03+1] gerrit: Clarify that `container.javaOptions` is currently unused [puppet] - 10https://gerrit.wikimedia.org/r/606004 (owner: 10QChris) [20:21:45] Platonides: cool, yeah, would be awesome if you could paste on the task as well what you did to get that. [20:21:56] I'm glad it's not a Zend or opcache bug then [20:22:02] just PHP not telling the user why it failed [20:22:06] (03CR) 10Paladox: [C: 03+1] gerrit: Quote `container.javaOptions` in config [puppet] - 10https://gerrit.wikimedia.org/r/606003 (owner: 10QChris) [20:22:25] The fix should probably be that Swif limits the size of its batches [20:22:52] MW-swift code, I mean [20:25:31] I was writing it [20:26:43] <3 [20:27:48] there you go [20:27:48] 10Operations, 10Commons, 10ConfirmEdit (CAPTCHA extension), 10Editing-team, and 5 others: GenerateFancyCaptchas.php crashes with "FormatJson.php: File not found in" after 1000 iterations - https://phabricator.wikimedia.org/T230245 (10Platonides) I found it is a file descriptor problem. ulimit -n is set to... [20:28:13] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:28:34] keeping 900+ fd open seems "too much" [20:29:08] (03PS1) 10Dzahn: site: add releases1002/releases2002 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/606006 (https://phabricator.wikimedia.org/T255590) [20:29:31] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:29:39] !log signing puppet cert requests for releases1002 and releases2002 - T255590 [20:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:43] T255590: Site: eqiad/codfw 2 VM request for releases - https://phabricator.wikimedia.org/T255590 [20:30:07] (03CR) 10Dzahn: [C: 03+2] site: add releases1002/releases2002 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/606006 (https://phabricator.wikimedia.org/T255590) (owner: 10Dzahn) [20:31:35] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:31:54] just got a slew of EditPage errors; i'm going to go ahead and roll .37 back to group0. [20:31:57] 10Operations, 10vm-requests, 10Patch-For-Review: Site: eqiad/codfw 2 VM request for releases - https://phabricator.wikimedia.org/T255590 (10Dzahn) 05Open→03Resolved a:03Dzahn [20:32:19] Platonides: It's curious it works fine for years [20:32:30] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 (10Dzahn) [20:32:33] 10Operations, 10vm-requests, 10Patch-For-Review: Site: eqiad/codfw 2 VM request for releases - https://phabricator.wikimedia.org/T255590 (10Dzahn) [20:32:40] !log rolling 1.35.0-wmf.37 back to group0 [20:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:45] on this server? [20:32:56] maybe something changed the default ulimit [20:32:58] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 (10Dzahn) VMs with buster, releases1002/releases2002 have been created in the subtask. [20:34:01] It's worked at least ~2.5 years since I made some big changes to it [20:34:21] where's the swift code that is passing descriptors? [20:34:28] should be in mw core [20:34:49] includes/libs/filebackend [20:36:26] hmm [20:36:37] why is it doing a fread loop? [20:36:48] a file_get_contents should result in a mmap [20:37:09] that is, assuming this is a local filesystem [20:37:41] depends where in the code you're looking [20:39:25] I guess async is set to true [20:39:55] which makes it queue the operation [20:41:05] !log reset email and pw for CactusJack [20:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:43] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: Revert group1 wikis to 1.35.0-wmf.37 [20:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:25] (03PS1) 10Brennen Bearnes: Revert "group1 wikis to 1.35.0-wmf.37" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606009 [20:43:27] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "group1 wikis to 1.35.0-wmf.37" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606009 (owner: 10Brennen Bearnes) [20:43:35] PROBLEM - Host mw2228.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:44:17] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.35.0-wmf.37" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606009 (owner: 10Brennen Bearnes) [20:44:37] (03CR) 10Paladox: [C: 03+1] gerrit: Escape quotes for pipeline commentlinks [puppet] - 10https://gerrit.wikimedia.org/r/606001 (owner: 10QChris) [20:48:37] RECOVERY - Host mw2228.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.78 ms [20:52:46] 10Operations, 10Commons, 10ConfirmEdit (CAPTCHA extension), 10Editing-team, and 5 others: GenerateFancyCaptchas.php crashes with "FormatJson.php: File not found in" after 1000 iterations - https://phabricator.wikimedia.org/T230245 (10Platonides) I can disable $assync on FileBackendStore::doQuickOperationsI... [20:53:33] (03CR) 10Esanders: "Provisionally for deployment on Wednesday 17th June" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605997 (https://phabricator.wikimedia.org/T251654) (owner: 10Esanders) [20:55:36] (03PS1) 10Jdlrobson: Correct the name of the function [extensions/WikimediaEvents] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606012 [20:58:48] (03CR) 10Hashar: "Ah thank you, I forgot about that profile::java enhancement." [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [21:02:02] (03CR) 10Krinkle: [C: 03+2] Correct the name of the function [extensions/WikimediaEvents] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606012 (owner: 10Jdlrobson) [21:02:29] Jdlrobson: can roll out if you're around :) [21:05:12] (03Merged) 10jenkins-bot: Correct the name of the function [extensions/WikimediaEvents] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606012 (owner: 10Jdlrobson) [21:07:30] (03CR) 10Bstorm: [C: 03+2] nfs monitoring: fix the broken paths for the directory size monitor [puppet] - 10https://gerrit.wikimedia.org/r/605705 (https://phabricator.wikimedia.org/T160113) (owner: 10Bstorm) [21:09:56] 10Operations, 10Commons, 10ConfirmEdit (CAPTCHA extension), 10Editing-team, and 5 others: GenerateFancyCaptchas.php crashes with "FormatJson.php: File not found in" after 1000 iterations - https://phabricator.wikimedia.org/T230245 (10Platonides) $maxConcurrency was set to 50, but we had nearly one thousand... [21:12:15] !log krinkle@deploy1001 Synchronized php-1.35.0-wmf.37/extensions/WikimediaEvents/modules/: I67794c6c7192571 (duration: 01m 04s) [21:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:55] last local commit, 9b94f58d :P [21:19:27] (03CR) 10BryanDavis: [C: 03+1] cloud nfs: allow opt-in soft mounting wherever folks want to try it [puppet] - 10https://gerrit.wikimedia.org/r/605310 (https://phabricator.wikimedia.org/T127559) (owner: 10Bstorm) [21:19:31] Krinkle: yes please [21:19:40] and thank you if you did just that :) [21:19:47] (03CR) 10Bstorm: [C: 03+1] wikireplica_analytics: Increase query killer time [puppet] - 10https://gerrit.wikimedia.org/r/605902 (owner: 10Marostegui) [21:19:47] sorry im in logstash capturing all the errors im finding [21:20:14] (03CR) 10Bstorm: [C: 03+2] cloud nfs: allow opt-in soft mounting wherever folks want to try it [puppet] - 10https://gerrit.wikimedia.org/r/605310 (https://phabricator.wikimedia.org/T127559) (owner: 10Bstorm) [21:20:16] Platonides: " last local commit, 9b94f58d" what is this in reply/ref to? [21:21:11] nothing, really [21:21:15] ah nvm, I understand now [21:21:20] thought mayve it was deployment related [21:21:43] just that apparently my local checkout was last updated more than 2 years ago [21:22:45] I'm preparing a patch [21:22:47] Krinkle: is there any way to prohibit certain user scripts via config? If not should there be? The reason I ask is as I go through logstash I'm discovering lots of outdated scripts that throw JS errors where the maintainer is no longer present. My thinking is ResourceLoader could be used to prohibit known broken scripts from loading. [21:23:34] Jdlrobson: no, I don't think there should be. the errors cause no cascading impact, it's isolated at the module level already. [21:23:52] and how would you let the next maintainer-to-be to debug that, if you don't even let it fail? [21:24:34] obvious syntax errors are also substituted with a mw.log.error() call that logs the syntasx error without execuyting it. [21:24:56] Krinkle: the issue is these errors cause a lot of noise in the kabana dashboard [21:25:03] with no way of filtering [21:25:18] possibly user script errors should be tagged in a different channel? [21:25:38] Uncaught ReferenceError: wgPageName is not defined is an extremely common one [21:26:00] yes, there is also noise from browser extensions, and async exceptions from well-maintained gadgets that have no impact. E.g. it's not unusual for ajax() call that is aborted to cause an exception, but it goes nowhere, and fail() is still invoked fine. It's just an artefact of how JS works. [21:26:06] they can be filtered though [21:26:34] the stack trace is indexed, so one could exclude e.g. /wiki/ and /w/index.php as file paths, as those would reflect all importScript() calls [21:26:45] youd's till have gadgets and entry points to user scripts [21:27:25] also virtual urls like chrome-extension:// could be filtered likewise, hiding any stack trace that contains them [21:28:24] Jdlrobson: if there is a popular known gadget with a failure,and it is unmaintained you could also exclude one if its function names (if reasonably unique) from the dashboard in the same manner. [21:28:40] ok, it was probably from November 2018 :P [21:30:10] It looks like the stack trace formatter stopped working, it's a mess. [21:30:27] probably worth filing a bug for and/or commenting on the laucnh task if it is still open [21:30:45] Jdlrobson: " at https://en.wikipedia.org/w/index.php?title=User:Henrik/js/live-view-counter.js&action=raw&ctype=text/javascript:37:56" [21:30:59] excluding "w/index.php" would exclude all such importScript() files that emit errors [21:31:16] (from stack_trace field) [21:32:50] Krinkle: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/606017/ [21:33:57] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:34:01] 10Operations, 10Data-Services, 10Tracking-Neverending, 10cloud-services-team (Kanban): overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083 (10Bstorm) [21:34:13] chrome extensions use a different protocol (chrome-extension://) [21:34:27] (03CR) 10Bstorm: [C: 03+2] "Since this is already cherry-picked in anyway." [puppet] - 10https://gerrit.wikimedia.org/r/605550 (https://phabricator.wikimedia.org/T255371) (owner: 10Hashar) [21:34:29] so something like user-gadget:// would be helpful [21:35:55] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:36:08] the only code we can load from index.php is user-generated code. [21:36:36] it seems useful to have a url there that one could open if you wanted to :) [21:38:55] (03PS1) 10QChris: gerrit: Fix comment for enableReverseDnsLookup [puppet] - 10https://gerrit.wikimedia.org/r/606018 [22:04:36] (03PS1) 10Reedy: Revert "Workaround for GenerateFancyCaptcha not running as expected in prod" [puppet] - 10https://gerrit.wikimedia.org/r/606021 (https://phabricator.wikimedia.org/T230245) [22:05:46] (03CR) 10jerkins-bot: [V: 04-1] Revert "Workaround for GenerateFancyCaptcha not running as expected in prod" [puppet] - 10https://gerrit.wikimedia.org/r/606021 (https://phabricator.wikimedia.org/T230245) (owner: 10Reedy) [22:06:32] (03PS2) 10Reedy: Revert "Workaround for GenerateFancyCaptcha not running as expected in prod" [puppet] - 10https://gerrit.wikimedia.org/r/606021 (https://phabricator.wikimedia.org/T230245) [22:06:51] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 102.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [22:12:10] (03PS1) 10Dzahn: site: add releases role to releases1002/2002 [puppet] - 10https://gerrit.wikimedia.org/r/606022 (https://phabricator.wikimedia.org/T247652) [22:12:10] (03CR) 10jerkins-bot: [V: 04-1] site: add releases role to releases1002/2002 [puppet] - 10https://gerrit.wikimedia.org/r/606022 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn) [22:12:17] (03PS1) 10BryanDavis: Pass `--canonical` to webservice-runner inside k8s pod [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/606023 (https://phabricator.wikimedia.org/T254640) [22:23:30] (03CR) 10BryanDavis: Pass `--canonical` to webservice-runner inside k8s pod (035 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/606023 (https://phabricator.wikimedia.org/T254640) (owner: 10BryanDavis) [22:25:01] (03PS1) 10Krinkle: noc: Consistently use "section" in reference to our database grouping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606024 [22:27:26] (03CR) 10Krinkle: [C: 03+2] noc: Consistently use "section" in reference to our database grouping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606024 (owner: 10Krinkle) [22:28:12] (03Merged) 10jenkins-bot: noc: Consistently use "section" in reference to our database grouping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606024 (owner: 10Krinkle) [22:31:07] (03CR) 10Krinkle: [C: 03+2] "Tested on mwmain1002/mwmaint2001" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606024 (owner: 10Krinkle) [22:31:57] !log krinkle@deploy1001 Synchronized docroot/noc: (no justification provided) (duration: 01m 05s) [22:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:57] !log krinkle@deploy1001 Synchronized src/Noc/: (no justification provided) (duration: 01m 04s) [22:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:06] (03CR) 10Bstorm: [C: 03+1] "Looks good and it doesn't pop when doing some basic things in manual qa either :)" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/606023 (https://phabricator.wikimedia.org/T254640) (owner: 10BryanDavis) [22:44:37] (03CR) 10BryanDavis: [C: 03+2] Pass `--canonical` to webservice-runner inside k8s pod [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/606023 (https://phabricator.wikimedia.org/T254640) (owner: 10BryanDavis) [22:45:16] (03Merged) 10jenkins-bot: Pass `--canonical` to webservice-runner inside k8s pod [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/606023 (https://phabricator.wikimedia.org/T254640) (owner: 10BryanDavis) [22:48:56] 10Operations, 10Traffic: HTML Dumps 429 error on RESTBase endpoints - https://phabricator.wikimedia.org/T255524 (10CDanis) @Kelson I don't believe this affects Kiwix, this is about a different kind of dumps. @RBrounley_WMF If you made your requests from a [[ https://wikitech.wikimedia.org/wiki/Portal:Cloud_VP... [22:51:17] (03PS1) 10BryanDavis: d/changelog: prepare for 0.72 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/606029 [22:51:37] (03CR) 10BryanDavis: [C: 03+2] d/changelog: prepare for 0.72 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/606029 (owner: 10BryanDavis) [22:52:26] (03Merged) 10jenkins-bot: d/changelog: prepare for 0.72 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/606029 (owner: 10BryanDavis) [23:00:04] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Evening backport window(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200616T2300). [23:00:04] maryum: A patch you scheduled for Evening backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:14:23] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10Papaul) ` [edit interfaces interface-range vlan-private1-a-codfw] member ge-5/0/8 { ... } +... [23:14:44] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10Papaul) [23:15:34] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10Papaul) @akosiaris switch configuration done for all servers. [23:17:56] (03PS1) 10CRusnov: hiera: Fix value for netbox-dev2001 report alert list [puppet] - 10https://gerrit.wikimedia.org/r/606034 [23:22:19] is anyone around for the backport window? [23:22:27] (03CR) 10CRusnov: [C: 03+2] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/606034 (owner: 10CRusnov) [23:23:06] I can ship it [23:23:59] (03PS3) 10EBernhardson: Update ML models [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595019 (https://phabricator.wikimedia.org/T219534) (owner: 10Mstyles) [23:24:17] (03CR) 10EBernhardson: [C: 03+2] Update ML models [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595019 (https://phabricator.wikimedia.org/T219534) (owner: 10Mstyles) [23:25:11] (03Merged) 10jenkins-bot: Update ML models [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595019 (https://phabricator.wikimedia.org/T219534) (owner: 10Mstyles) [23:25:59] maryum: pulled to mwdebug1002 [23:26:54] ebernhardson: great, testing now [23:30:29] I'm looking at https://ko.wikipedia.org/w/index.php?search=hello&title=%ED%8A%B9%EC%88%98%3A%EA%B2%80%EC%83%89&fulltext=1&ns0=1&cirrusUserTesting=mlr-2020-test and https://ja.wikipedia.org/w/index.php?search=hello&title=%E7%89%B9%E5%88%A5%3A%E6%A4%9C%E7%B4%A2&fulltext=1&ns0=1&cirrusUserTesting=mlr-2020-test [23:30:33] (03PS1) 10CRusnov: hiera: add netbox-dev2001 as the first host for scap deploys [puppet] - 10https://gerrit.wikimedia.org/r/606036 [23:30:48] ebernhardson: but not sure how to see config [23:31:03] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/606036 (owner: 10CRusnov) [23:31:14] maryum: from the patch, it should be zh and ko we are conserned with, ja was removed? [23:31:20] oh right [23:32:04] (03CR) 10CRusnov: [C: 03+2] hiera: add netbox-dev2001 as the first host for scap deploys [puppet] - 10https://gerrit.wikimedia.org/r/606036 (owner: 10CRusnov) [23:32:42] since those return results and don't error it should be fine. I also verified by appending `&cirrusDebugQuery` that the sltr query we issue references the new models, essentially verifying it's invoking what we think it is. [23:32:46] lgtm [23:33:00] okay great [23:33:06] nothing has exploded [23:33:12] err, cirrusDumpQuery, not cirrusDebugQuery [23:34:19] !log ebernhardson@deploy1001 sync-file aborted: cirrus: update ML models for ko and zh, drop ja (duration: 00m 04s) [23:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:23] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: cirrus: update ML models for ko and zh, drop ja (duration: 01m 00s) [23:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:16] alright, BACON should be complete [23:43:00] 10Operations, 10DC-Ops, 10decommission-hardware, 10Goal: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821 (10wiki_willy) 05Open→03Resolved Remaining Cisco servers picked up by Cisco today. Thanks @Jclark-ctr ! [23:43:03] (03PS1) 10Krinkle: logging: Fold Parsoid into type:mediawiki, add 'servergroup:' instead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606038 (https://phabricator.wikimedia.org/T255627) [23:43:33] !log crusnov@deploy1001 Started deploy [netbox/deploy@5251cf1]: Deploying Netbox to netbox-dev T253140 [23:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:38] T253140: Create Scratch instance of Netbox - https://phabricator.wikimedia.org/T253140 [23:43:38] !log crusnov@deploy1001 Finished deploy [netbox/deploy@5251cf1]: Deploying Netbox to netbox-dev T253140 (duration: 00m 05s) [23:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:57] (03CR) 10Krinkle: [C: 04-1] "Marking as WIP until agreement on the task from Parsing team, just in case I missed something automated that might need updating first or " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606038 (https://phabricator.wikimedia.org/T255627) (owner: 10Krinkle)