[00:00:05] twentyafterfour: My dear minions, it's time we take the moon! Just kidding. Time for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210408T0000). [00:33:12] 10SRE, 10Security-Team, 10Wikimedia-Mailing-lists: Upgrade GNU Mailman from 2.1 to Mailman3 - https://phabricator.wikimedia.org/T52864 (10Ladsgroup) [00:33:35] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Let's call it resolved. Wikitech-l is one of our oldest and big... [00:56:48] (03CR) 10Ottomata: [C: 03+2] Release 2020.02~wmf5 [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/677669 (https://phabricator.wikimedia.org/T279480) (owner: 10Ottomata) [00:56:50] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Release 2020.02~wmf5 [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/677669 (https://phabricator.wikimedia.org/T279480) (owner: 10Ottomata) [01:30:40] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-chi-eqiad on cloudelastic1005 is CRITICAL: 137.3 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1005&panelId=37 [01:32:36] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1006-cloudelastic-chi-eqiad on cloudelastic1006 is CRITICAL: 153.6 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1006&panelId=37 [01:50:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:52:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:52:56] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is CRITICAL: 109.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1004&panelId=37 [02:07:06] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 106.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [02:50:28] !log Restarted importMissingLocalNames.php (mwmaint 1002, wiki=metawiki,batch-size=1000) [02:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:52:02] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 103.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1002&panelId=37 [03:27:44] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [03:43:44] (03PS1) 10Krinkle: [Beta Cluster] Disable wgEnableWANCacheReaper experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677731 [03:47:31] (03PS1) 10Krinkle: [Beta Cluster] mc: Use new 'wanRoutingPrefix' option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677732 [03:47:33] (03PS1) 10Krinkle: [Beta Cluster] mc: Remove unused mcrouterAware/cluster/coalesceKeys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677733 [03:47:35] (03PS1) 10Krinkle: mc: Remove unused mcrouterAware/cluster/coalesceKeys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677734 [03:47:44] (03PS2) 10Krinkle: [Beta Cluster] mc: Use new 'wanRoutingPrefix' option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677732 [03:47:46] (03PS2) 10Krinkle: mc: Add 'wanRoutingPrefix' (replaces 'mcrouterAware' and 'cluster') [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677418 [03:47:48] (03PS2) 10Krinkle: [Beta Cluster] mc: Remove unused mcrouterAware/cluster/coalesceKeys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677733 [03:47:50] (03PS2) 10Krinkle: mc: Remove unused mcrouterAware/cluster/coalesceKeys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677734 [04:15:42] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is OK: (C)100 gt (W)80 gt 77.29 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [04:19:22] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:21:50] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:36:24] 10SRE, 10Traffic, 10netops, 10Performance-Team (Radar): experiment with reenabling compression between applayer's TLS terminators and edge caches - https://phabricator.wikimedia.org/T263288 (10Krinkle) [05:00:02] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 104.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1002&panelId=37 [05:12:19] 10SRE, 10DBA, 10Platform Engineering, 10Wikimedia-Incident: Appservers latency spike / parser cache growth 2021-03-28 - https://phabricator.wikimedia.org/T278655 (10Marostegui) I am not fully sure I am reading the disk space graph correctly as I don't see an increase there. There's surely an increase on th... [05:15:11] (03PS2) 10KartikMistry: Update cxserver to 2021-04-07-062518-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/677557 (https://phabricator.wikimedia.org/T278141) [05:31:28] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2021-04-07-062518-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/677557 (https://phabricator.wikimedia.org/T278141) (owner: 10KartikMistry) [05:39:37] (03Merged) 10jenkins-bot: Update cxserver to 2021-04-07-062518-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/677557 (https://phabricator.wikimedia.org/T278141) (owner: 10KartikMistry) [05:42:52] PROBLEM - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [05:43:17] !log kartik@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [05:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:50] 10SRE, 10DBA, 10Platform Engineering, 10Wikimedia-Incident: Appservers latency spike / parser cache growth 2021-03-28 - https://phabricator.wikimedia.org/T278655 (10Marostegui) I have done some testing with pc000 in a testing host. Deleted everything under 20 days so simulating that we only keep 20 days in... [05:45:04] RECOVERY - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 673 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [05:54:52] !log kartik@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' . [05:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:20] !log kartik@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' . [05:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:14] !log Updated cxserver to 2021-04-07-062518-production (T278141, T263139, T271711, T201491, T240525, T207662) [06:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:28] T240525: cxserver: Update to core-js@3 - https://phabricator.wikimedia.org/T240525 [06:01:28] T278141: cxserver missing important metrics after service-runner 2.8.1 upgrade - https://phabricator.wikimedia.org/T278141 [06:01:29] T201491: Fix common typos in code - https://phabricator.wikimedia.org/T201491 [06:01:29] T207662: MT processing error: TypeError: key.trim is not a function - https://phabricator.wikimedia.org/T207662 [06:01:29] T271711: Update cxserver to service-runner 2.8.1 - https://phabricator.wikimedia.org/T271711 [06:01:29] T263139: Show section placeholder before "References" and similar sections - https://phabricator.wikimedia.org/T263139 [06:03:34] That's lots of fixes :) [06:15:39] (03CR) 10Elukey: hadoop: add the liblog4j-extras1.2-java jar to HADOOP_CLASSPATH (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677576 (https://phabricator.wikimedia.org/T276906) (owner: 10Elukey) [06:17:09] (03PS2) 10Elukey: hadoop: add the liblog4j-extras1.2-java jar to HADOOP_CLASSPATH [puppet] - 10https://gerrit.wikimedia.org/r/677576 (https://phabricator.wikimedia.org/T276906) [06:17:21] (03CR) 10Elukey: hadoop: add the liblog4j-extras1.2-java jar to HADOOP_CLASSPATH (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677576 (https://phabricator.wikimedia.org/T276906) (owner: 10Elukey) [06:25:06] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28946/console" [puppet] - 10https://gerrit.wikimedia.org/r/677576 (https://phabricator.wikimedia.org/T276906) (owner: 10Elukey) [06:28:27] (03PS11) 10DharmrajRathod98: Improved: regex-validation in cli/recover-dump and added unit test file in test/unit [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) [06:32:20] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 101.4 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1002&panelId=37 [06:33:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1111 to clone db1177 T275633', diff saved to https://phabricator.wikimedia.org/P15229 and previous config saved to /var/cache/conftool/dbconfig/20210408-063331-marostegui.json [06:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:40] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [06:33:48] !log Stop MySQL on db1111 to clone db1177 T275633 [06:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:46] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 103.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1002&panelId=37 [06:39:02] PROBLEM - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [06:41:06] RECOVERY - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 673 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [06:41:45] !log elukey@deploy1002 Started deploy [analytics/refinery@1dbbd3d] (hadoop-test): (no justification provided) [06:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:51] (03CR) 10Elukey: [V: 03+1 C: 03+2] hadoop: add the liblog4j-extras1.2-java jar to HADOOP_CLASSPATH [puppet] - 10https://gerrit.wikimedia.org/r/677576 (https://phabricator.wikimedia.org/T276906) (owner: 10Elukey) [06:44:05] !log elukey@deploy1002 Finished deploy [analytics/refinery@1dbbd3d] (hadoop-test): (no justification provided) (duration: 02m 20s) [06:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:47] (03PS1) 10Marostegui: mariadb: Add db1177 to s8. [puppet] - 10https://gerrit.wikimedia.org/r/677799 (https://phabricator.wikimedia.org/T275633) [06:56:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1023 to upgrade kernel and mysql, remove weight from es1021, to leave it as it was yesterday T279281', diff saved to https://phabricator.wikimedia.org/P15231 and previous config saved to /var/cache/conftool/dbconfig/20210408-065627-marostegui.json [06:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:35] T279281: Upgrade 10.4.13 hosts to a higher version - https://phabricator.wikimedia.org/T279281 [06:57:10] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 104.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1002&panelId=37 [06:59:28] (03PS2) 10Abijeet Patro: Rename wgTranslateBlacklist to wgTranslateExclusionList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676909 (https://phabricator.wikimedia.org/T277965) [07:06:08] (03PS1) 10Elukey: hadoop: improve the HDFS Namenode audit log4j config [puppet] - 10https://gerrit.wikimedia.org/r/677803 (https://phabricator.wikimedia.org/T276906) [07:06:52] (03PS2) 10Elukey: hadoop: improve the HDFS Namenode audit log4j config [puppet] - 10https://gerrit.wikimedia.org/r/677803 (https://phabricator.wikimedia.org/T276906) [07:09:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 25%: Repool es1023', diff saved to https://phabricator.wikimedia.org/P15232 and previous config saved to /var/cache/conftool/dbconfig/20210408-070946-root.json [07:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:34] (03PS2) 10Muehlenhoff: Assign mw_rc_irc role to irc1001 [puppet] - 10https://gerrit.wikimedia.org/r/677509 (https://phabricator.wikimedia.org/T278255) [07:16:56] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [07:17:20] looking ^ [07:18:44] (03PS1) 10Alexandros Kosiaris: deployment_server: Unify 2 "admin" if exlucsions via filter() [puppet] - 10https://gerrit.wikimedia.org/r/677805 [07:19:05] (03CR) 10Alexandros Kosiaris: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/677228 (https://phabricator.wikimedia.org/T268434) (owner: 10JMeybohm) [07:19:12] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: number_of_nodes: 6, active_shards: 1877, number_of_data_nodes: 6, unassigned_shards: 0, initializing_shards: 0, cluster_name: cloudelastic-chi-eqiad, relocating_shards: 0, timed_out: False, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0, status: green, delayed_unassigne [07:19:12] er_of_pending_tasks: 0, number_of_in_flight_fetch: 0, active_primary_shards: 937 https://wikitech.wikimedia.org/wiki/Search%23Administration [07:20:04] (03CR) 10Muehlenhoff: [C: 03+2] Assign mw_rc_irc role to irc1001 [puppet] - 10https://gerrit.wikimedia.org/r/677509 (https://phabricator.wikimedia.org/T278255) (owner: 10Muehlenhoff) [07:20:30] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28947/console" [puppet] - 10https://gerrit.wikimedia.org/r/677805 (owner: 10Alexandros Kosiaris) [07:21:53] (03CR) 10Alexandros Kosiaris: "Thanks. I 've submitted a small followup is https://gerrit.wikimedia.org/r/c/operations/puppet/+/677805 that should make this a bit more r" [puppet] - 10https://gerrit.wikimedia.org/r/677667 (owner: 10Legoktm) [07:23:33] (03CR) 10Ema: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28948/console" [puppet] - 10https://gerrit.wikimedia.org/r/677580 (https://phabricator.wikimedia.org/T279533) (owner: 10Ema) [07:24:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 50%: Repool es1023', diff saved to https://phabricator.wikimedia.org/P15233 and previous config saved to /var/cache/conftool/dbconfig/20210408-072450-root.json [07:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:47] (03PS1) 10Muehlenhoff: Broadcase IRC events to irc1001 instead of kraz [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677806 (https://phabricator.wikimedia.org/T224579) [07:26:25] (03PS2) 10Muehlenhoff: Broadcast IRC events to irc1001 instead of kraz [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677806 (https://phabricator.wikimedia.org/T224579) [07:27:34] (03CR) 10Ema: [V: 03+1 C: 03+2] vlc: get exp cache admission policy parameters from hiera [puppet] - 10https://gerrit.wikimedia.org/r/677580 (https://phabricator.wikimedia.org/T279533) (owner: 10Ema) [07:34:57] (03PS1) 10Muehlenhoff: Only install git-fat for distros up to Buster [puppet] - 10https://gerrit.wikimedia.org/r/677807 (https://phabricator.wikimedia.org/T275873) [07:35:16] (03CR) 10JMeybohm: [C: 03+1] "Thanks all for taking care of my mess 🙏" [puppet] - 10https://gerrit.wikimedia.org/r/677805 (owner: 10Alexandros Kosiaris) [07:35:21] (03CR) 10Elukey: [C: 03+2] hadoop: improve the HDFS Namenode audit log4j config [puppet] - 10https://gerrit.wikimedia.org/r/677803 (https://phabricator.wikimedia.org/T276906) (owner: 10Elukey) [07:39:40] (03PS1) 10Muehlenhoff: New component for PostGIS 3 backport [puppet] - 10https://gerrit.wikimedia.org/r/677808 (https://phabricator.wikimedia.org/T277064) [07:39:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 75%: Repool es1023', diff saved to https://phabricator.wikimedia.org/P15234 and previous config saved to /var/cache/conftool/dbconfig/20210408-073953-root.json [07:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:30] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1020.eqiad.wmnet [07:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1175 for schema change', diff saved to https://phabricator.wikimedia.org/P15235 and previous config saved to /var/cache/conftool/dbconfig/20210408-074524-marostegui.json [07:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:44] (03PS1) 10Ema: varnish: test setting exp policy parameters in labs [puppet] - 10https://gerrit.wikimedia.org/r/677810 (https://phabricator.wikimedia.org/T279533) [07:49:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=udpmxircecho site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:49:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove weight from es5 master', diff saved to https://phabricator.wikimedia.org/P15236 and previous config saved to /var/cache/conftool/dbconfig/20210408-074911-marostegui.json [07:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:40] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1020.eqiad.wmnet [07:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:22] (03CR) 10Marostegui: [C: 03+2] mariadb: Add db1177 to s8. [puppet] - 10https://gerrit.wikimedia.org/r/677799 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [07:51:28] (03PS2) 10Marostegui: mariadb: Add db1177 to s8. [puppet] - 10https://gerrit.wikimedia.org/r/677799 (https://phabricator.wikimedia.org/T275633) [07:53:51] (03PS2) 10Ema: varnish: test setting exp policy parameters in labs [puppet] - 10https://gerrit.wikimedia.org/r/677810 (https://phabricator.wikimedia.org/T279533) [07:54:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 100%: Repool es1023', diff saved to https://phabricator.wikimedia.org/P15237 and previous config saved to /var/cache/conftool/dbconfig/20210408-075457-root.json [07:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:28] PROBLEM - ircecho bot process on irc1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, regex args /usr/local/bin/udpmxircecho.py https://wikitech.wikimedia.org/wiki/Ircecho [07:58:32] (03CR) 10Ema: [C: 03+2] varnish: test setting exp policy parameters in labs [puppet] - 10https://gerrit.wikimedia.org/r/677810 (https://phabricator.wikimedia.org/T279533) (owner: 10Ema) [08:03:12] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2028.codfw.wmnet [08:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:22] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/677805 (owner: 10Alexandros Kosiaris) [08:03:53] 10SRE, 10Traffic: cache_upload cache policy + large_objects_cutoff concerns - https://phabricator.wikimedia.org/T275809 (10ema) [08:04:08] 10SRE, 10Traffic, 10Patch-For-Review: Add exp cache admission policy parameters to hiera - https://phabricator.wikimedia.org/T279533 (10ema) 05Open→03Resolved After changing `exp_policy_rate` and `exp_policy_base` in hiera for traffic-cache-atstext-buster, the rendered VCL now looks like this: ` +// Incl... [08:04:29] (03PS1) 10Alexandros Kosiaris: kubernetes1017: Add kubelet node labels [puppet] - 10https://gerrit.wikimedia.org/r/677811 [08:04:57] (03CR) 10Muehlenhoff: [C: 03+2] New component for PostGIS 3 backport [puppet] - 10https://gerrit.wikimedia.org/r/677808 (https://phabricator.wikimedia.org/T277064) (owner: 10Muehlenhoff) [08:05:21] (03CR) 10Alexandros Kosiaris: [C: 03+2] kubernetes1017: Add kubelet node labels [puppet] - 10https://gerrit.wikimedia.org/r/677811 (owner: 10Alexandros Kosiaris) [08:05:45] akosiaris: shall I merge your patch along? [08:05:54] moritzm: merge mine as well please :-) [08:06:10] done [08:06:16] danke! [08:06:55] (03CR) 10David Caro: [C: 03+1] "No problem" [puppet] - 10https://gerrit.wikimedia.org/r/677663 (https://phabricator.wikimedia.org/T276509) (owner: 10Andrew Bogott) [08:08:50] (03PS4) 10Alexandros Kosiaris: Segment values.yaml between teams [deployment-charts] - 10https://gerrit.wikimedia.org/r/675558 (https://phabricator.wikimedia.org/T278208) (owner: 10Elukey) [08:09:01] (03CR) 10jerkins-bot: [V: 04-1] Segment values.yaml between teams [deployment-charts] - 10https://gerrit.wikimedia.org/r/675558 (https://phabricator.wikimedia.org/T278208) (owner: 10Elukey) [08:10:56] (03CR) 10Alexandros Kosiaris: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/675558 (https://phabricator.wikimedia.org/T278208) (owner: 10Elukey) [08:11:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 25%: Repool after schema change', diff saved to https://phabricator.wikimedia.org/P15238 and previous config saved to /var/cache/conftool/dbconfig/20210408-081059-root.json [08:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:07] (03CR) 10jerkins-bot: [V: 04-1] Segment values.yaml between teams [deployment-charts] - 10https://gerrit.wikimedia.org/r/675558 (https://phabricator.wikimedia.org/T278208) (owner: 10Elukey) [08:12:21] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Switch values/values.yaml to common.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/676935 (owner: 10Alexandros Kosiaris) [08:13:39] (03Merged) 10jenkins-bot: admin: Switch values/values.yaml to common.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/676935 (owner: 10Alexandros Kosiaris) [08:14:14] (03PS3) 10Alexandros Kosiaris: admin: Switch usages to internal kubernetes API, with exceptions [deployment-charts] - 10https://gerrit.wikimedia.org/r/676936 [08:14:16] (03PS5) 10Alexandros Kosiaris: Segment values.yaml between teams [deployment-charts] - 10https://gerrit.wikimedia.org/r/675558 (https://phabricator.wikimedia.org/T278208) (owner: 10Elukey) [08:15:00] (03PS8) 10Effie Mouzeli: hieradata: remove parsoidJS from production 3 [puppet] - 10https://gerrit.wikimedia.org/r/677114 (https://phabricator.wikimedia.org/T279059) [08:15:12] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2028.codfw.wmnet [08:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:51] !log imported postgis 3.1.1+dfsg-1~wmf1 to component/postgis for buster-wikimedia T277064 [08:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:01] T277064: Packaging PostGIS 3.1 for the new Maps stack - https://phabricator.wikimedia.org/T277064 [08:18:36] (03CR) 10Alexandros Kosiaris: [C: 03+1] hieradata: remove parsoidJS from production 3 [puppet] - 10https://gerrit.wikimedia.org/r/677114 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [08:19:58] (03CR) 10Alexandros Kosiaris: "Tested, noop as expected" [deployment-charts] - 10https://gerrit.wikimedia.org/r/676935 (owner: 10Alexandros Kosiaris) [08:21:14] 10SRE, 10Maps, 10Packaging, 10serviceops: Packaging PostGIS 3.1 for the new Maps stack - https://phabricator.wikimedia.org/T277064 (10MoritzMuehlenhoff) [08:21:20] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is OK: (C)100 gt (W)80 gt 66.21 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1002&panelId=37 [08:22:10] PROBLEM - Unmerged changes on repository puppet on puppetmaster1002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [08:22:16] PROBLEM - Unmerged changes on repository puppet on puppetmaster1003 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [08:24:12] RECOVERY - Unmerged changes on repository puppet on puppetmaster1002 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [08:24:18] RECOVERY - Unmerged changes on repository puppet on puppetmaster1003 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [08:24:50] !log Stop MySQL on all db1117 sections to upgrade kernel [08:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:58] ^ this will cause haproxy irc alerts [08:26:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 50%: Repool after schema change', diff saved to https://phabricator.wikimedia.org/P15239 and previous config saved to /var/cache/conftool/dbconfig/20210408-082603-root.json [08:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:38] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Switch usages to internal kubernetes API, with exceptions [deployment-charts] - 10https://gerrit.wikimedia.org/r/676936 (owner: 10Alexandros Kosiaris) [08:27:34] PROBLEM - Disk space on stat1008 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=87%): /tmp 0 MB (0% inode=87%): /var/tmp 0 MB (0% inode=87%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops [08:28:05] (03Merged) 10jenkins-bot: admin: Switch usages to internal kubernetes API, with exceptions [deployment-charts] - 10https://gerrit.wikimedia.org/r/676936 (owner: 10Alexandros Kosiaris) [08:28:12] 10SRE, 10Scap, 10Python3-Porting: Porting scap to Python 3 - https://phabricator.wikimedia.org/T279628 (10MoritzMuehlenhoff) [08:28:22] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:28:36] ^ expected [08:28:38] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:28:54] ^ same [08:29:18] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:29:20] PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:31:45] (03CR) 10Filippo Giunchedi: "LGTM, nice work! (despite the curator version) just a few nits inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/677593 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [08:33:52] !log installing remaining curl security updates for buster [08:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:56] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:34:58] RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:35:12] (03CR) 10JMeybohm: [C: 04-1] "Wow, that's quite some chart! 😊" (0337 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) (owner: 10Giuseppe Lavagetto) [08:35:16] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:35:36] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:36:07] (03PS9) 10Effie Mouzeli: hieradata: remove parsoidJS from production 3 [puppet] - 10https://gerrit.wikimedia.org/r/677114 (https://phabricator.wikimedia.org/T279059) [08:37:57] !log akosiaris@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [08:38:03] !log akosiaris@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [08:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:31] (03PS1) 10Ema: varnish: add script to test exp policy offline [puppet] - 10https://gerrit.wikimedia.org/r/677814 (https://phabricator.wikimedia.org/T275809) [08:38:43] !log akosiaris@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [08:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:43] !log akosiaris@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [08:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:18] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'. [08:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:06] (03CR) 10Ema: [C: 03+2] varnish: add script to test exp policy offline [puppet] - 10https://gerrit.wikimedia.org/r/677814 (https://phabricator.wikimedia.org/T275809) (owner: 10Ema) [08:41:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 75%: Repool after schema change', diff saved to https://phabricator.wikimedia.org/P15240 and previous config saved to /var/cache/conftool/dbconfig/20210408-084107-root.json [08:41:14] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [08:41:14] RECOVERY - ircecho bot process on irc1001 is OK: PROCS OK: 1 process with command name python, regex args /usr/local/bin/udpmxircecho.py https://wikitech.wikimedia.org/wiki/Ircecho [08:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:56] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 103.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1002&panelId=37 [08:43:50] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'. [08:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:17] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [08:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:50] (03CR) 10David Caro: [C: 03+2] ceph: run tests on debian 10 buster [puppet] - 10https://gerrit.wikimedia.org/r/677307 (owner: 10David Caro) [08:46:52] PROBLEM - ircecho bot process on irc1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, regex args /usr/local/bin/udpmxircecho.py https://wikitech.wikimedia.org/wiki/Ircecho [08:47:22] (03PS1) 10Elukey: jupyter: avoid logs to syslog/daemon.log for jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/677816 [08:47:34] (03CR) 10Alexandros Kosiaris: [C: 03+2] "> Patch Set 3:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/675558 (https://phabricator.wikimedia.org/T278208) (owner: 10Elukey) [08:47:49] akosiaris: \o/ thanks! [08:48:05] elukey: prego [08:48:23] thanks for tackling it [08:48:32] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28949/console" [puppet] - 10https://gerrit.wikimedia.org/r/677816 (owner: 10Elukey) [08:48:36] RECOVERY - Disk space on stat1008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops [08:48:51] (03Merged) 10jenkins-bot: Segment values.yaml between teams [deployment-charts] - 10https://gerrit.wikimedia.org/r/675558 (https://phabricator.wikimedia.org/T278208) (owner: 10Elukey) [08:49:21] (03CR) 10Elukey: [V: 03+1 C: 03+2] jupyter: avoid logs to syslog/daemon.log for jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/677816 (owner: 10Elukey) [08:49:53] dcaro: o/ ok to merge? [08:50:09] elukey: yep, thanks [08:50:20] done :) [08:56:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 100%: Repool after schema change', diff saved to https://phabricator.wikimedia.org/P15241 and previous config saved to /var/cache/conftool/dbconfig/20210408-085610-root.json [08:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1166 for schema change', diff saved to https://phabricator.wikimedia.org/P15242 and previous config saved to /var/cache/conftool/dbconfig/20210408-085630-marostegui.json [08:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:37] (03PS1) 10Ema: cache: enable exp caching policy on cp5001 [puppet] - 10https://gerrit.wikimedia.org/r/677820 (https://phabricator.wikimedia.org/T275809) [09:02:06] (03CR) 10JMeybohm: [C: 04-1] "Oh, wait: Switching the docker cgroup driver to systemd means we need to/should also switch the kubelet cgroup driver to systemd to have t" [puppet] - 10https://gerrit.wikimedia.org/r/524186 (https://phabricator.wikimedia.org/T277876) (owner: 10Alexandros Kosiaris) [09:02:53] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/677820 (https://phabricator.wikimedia.org/T275809) (owner: 10Ema) [09:04:22] (03PS1) 10Elukey: jupyter: simplify the cron script to clean up user Trash [puppet] - 10https://gerrit.wikimedia.org/r/677822 [09:05:25] (03CR) 10jerkins-bot: [V: 04-1] jupyter: simplify the cron script to clean up user Trash [puppet] - 10https://gerrit.wikimedia.org/r/677822 (owner: 10Elukey) [09:09:03] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cloud email alerts: remove f-strings in case of stretch vms [puppet] - 10https://gerrit.wikimedia.org/r/677599 (owner: 10Bstorm) [09:09:56] !log installing underscore security updates on stretch [09:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:25] (03PS2) 10Elukey: jupyter: simplify the cron script to clean up user Trash [puppet] - 10https://gerrit.wikimedia.org/r/677822 [09:12:02] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28950/console" [puppet] - 10https://gerrit.wikimedia.org/r/677822 (owner: 10Elukey) [09:14:07] (03CR) 10Ema: [C: 03+2] cache: enable exp caching policy on cp5001 [puppet] - 10https://gerrit.wikimedia.org/r/677820 (https://phabricator.wikimedia.org/T275809) (owner: 10Ema) [09:14:33] !log akosiaris@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [09:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:56] !log akosiaris@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [09:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:28] (03PS10) 10Effie Mouzeli: hieradata: remove parsoidJS from production 3 [puppet] - 10https://gerrit.wikimedia.org/r/677114 (https://phabricator.wikimedia.org/T279059) [09:17:06] (03PS3) 10Elukey: jupyter: simplify the cron script to clean up user Trash [puppet] - 10https://gerrit.wikimedia.org/r/677822 [09:17:44] (03CR) 10jerkins-bot: [V: 04-1] hieradata: remove parsoidJS from production 3 [puppet] - 10https://gerrit.wikimedia.org/r/677114 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [09:18:24] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28952/console" [puppet] - 10https://gerrit.wikimedia.org/r/677822 (owner: 10Elukey) [09:20:30] !log cp5001: varnish-frontend-restart to test exp policy settings starting from a empty cache T275809 [09:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:38] T275809: cache_upload cache policy + large_objects_cutoff concerns - https://phabricator.wikimedia.org/T275809 [09:21:04] (03PS11) 10Effie Mouzeli: hieradata: remove parsoidJS from production 3 [puppet] - 10https://gerrit.wikimedia.org/r/677114 (https://phabricator.wikimedia.org/T279059) [09:22:09] (03CR) 10jerkins-bot: [V: 04-1] hieradata: remove parsoidJS from production 3 [puppet] - 10https://gerrit.wikimedia.org/r/677114 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [09:24:08] !log installing libzstd security updates on buster [09:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:03] There's some issues hapenning (old gc, and unassigned shards check timeout) on the cloudelastic nodes, I created T279636 and tagged it with 'elasticsearch', if it's not the right one please let me know [09:25:03] T279636: cloudelastic* timeout while checking shards - https://phabricator.wikimedia.org/T279636 [09:25:20] (03PS12) 10Effie Mouzeli: hieradata: remove parsoidJS from production 3 [puppet] - 10https://gerrit.wikimedia.org/r/677114 (https://phabricator.wikimedia.org/T279059) [09:25:48] !log zpapierski@deploy1002 Started deploy [wikimedia/discovery/analytics@d098717]: T273847 export queries to relforge dag deployment - sensor name fix [09:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:56] T273847: Create a elasticsearch/kibana index with queries to allow query completion candidate research - https://phabricator.wikimedia.org/T273847 [09:26:08] (03CR) 10David Caro: [C: 03+1] cloud email alerts: remove f-strings in case of stretch vms [puppet] - 10https://gerrit.wikimedia.org/r/677599 (owner: 10Bstorm) [09:26:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: Repool after schema change', diff saved to https://phabricator.wikimedia.org/P15243 and previous config saved to /var/cache/conftool/dbconfig/20210408-092608-root.json [09:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:31] (03PS1) 10Ema: Revert "cache: enable exp caching policy on cp5001" [puppet] - 10https://gerrit.wikimedia.org/r/677712 [09:27:00] (03CR) 10Effie Mouzeli: hieradata: remove parsoidJS from production 4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677118 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [09:27:29] (03CR) 10Ema: [C: 03+2] Revert "cache: enable exp caching policy on cp5001" [puppet] - 10https://gerrit.wikimedia.org/r/677712 (owner: 10Ema) [09:27:36] !log zpapierski@deploy1002 Finished deploy [wikimedia/discovery/analytics@d098717]: T273847 export queries to relforge dag deployment - sensor name fix (duration: 01m 48s) [09:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:45] (03CR) 10Effie Mouzeli: "PCC works https://puppet-compiler.wmflabs.org/compiler1002/28954/" [puppet] - 10https://gerrit.wikimedia.org/r/677114 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [09:29:07] !log akosiaris@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [09:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:40] !log akosiaris@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [09:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:49] 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10ayounsi) Ok, because of this RTF RMA we're going to replace the switch with a spare. @Papaul Let's chat on IRC to figure out what time would works best for you, then we can notify services owne... [09:30:23] !log installing openssl updates for buster [09:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 25%: Repool db1111', diff saved to https://phabricator.wikimedia.org/P15244 and previous config saved to /var/cache/conftool/dbconfig/20210408-093151-root.json [09:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:44] (03PS1) 10JMeybohm: Migrate kubernetes infrastructure_users to new syntax [labs/private] - 10https://gerrit.wikimedia.org/r/677825 (https://phabricator.wikimedia.org/T269461) [09:36:29] !log Retry server-side upload for T279192 [09:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:38] T279192: Server side upload for Sturm - https://phabricator.wikimedia.org/T279192 [09:36:42] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 103.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1002&panelId=37 [09:38:58] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Migrate kubernetes infrastructure_users to new syntax [labs/private] - 10https://gerrit.wikimedia.org/r/677825 (https://phabricator.wikimedia.org/T269461) (owner: 10JMeybohm) [09:39:51] (03PS1) 10Marostegui: instances.yaml: Add db1177 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/677826 (https://phabricator.wikimedia.org/T275633) [09:40:41] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1177 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/677826 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [09:41:05] (03PS1) 10Urbanecm: Enable Growth for newcomers on simplewiki, mswiki, tawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677828 (https://phabricator.wikimedia.org/T278369) [09:41:12] Urbanecm: neat, thank you for retrying [09:41:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: Repool after schema change', diff saved to https://phabricator.wikimedia.org/P15246 and previous config saved to /var/cache/conftool/dbconfig/20210408-094112-root.json [09:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:38] godog: np. I tried it twice the other day, and both attempts failed, so I called you, sorry for bothering :-). [09:42:04] !log disable puppet in mw* servers for 677114 [09:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1177 to dbctl T275633', diff saved to https://phabricator.wikimedia.org/P15247 and previous config saved to /var/cache/conftool/dbconfig/20210408-094218-marostegui.json [09:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:27] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [09:42:47] Urbanecm: no bother, you did the right thing! I think it was temporary indeed due to the swift rebalance in eqiad I started the other day, it does get noisy at the beginning and some PUTs are known to fail [09:44:22] !log [urbanecm@mwmaint1002 ~/uploads]$ mwscript importImages.php --wiki=commonswiki --comment-ext=txt --sleep=3600 --user=Lusccasdeutsch . # T278856 [09:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:30] T278856: Server side upload for Lusccasdeutsch (master task) - https://phabricator.wikimedia.org/T278856 [09:44:47] godog: good to know :). Anyway, thanks for the help :) [09:45:28] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1002&panelId=37 [09:46:43] 10SRE, 10Wikidata, 10Wikidata Query Builder, 10wdwb-tech, 10User-Addshore: Deploy WDQS query builder to microsites - https://phabricator.wikimedia.org/T266703 (10Addshore) [09:46:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 50%: Repool db1111', diff saved to https://phabricator.wikimedia.org/P15248 and previous config saved to /var/cache/conftool/dbconfig/20210408-094655-root.json [09:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:11] (03PS13) 10Effie Mouzeli: hieradata: remove parsoidJS from production 3 [puppet] - 10https://gerrit.wikimedia.org/r/677114 (https://phabricator.wikimedia.org/T279059) [09:50:04] RECOVERY - ircecho bot process on irc1001 is OK: PROCS OK: 1 process with command name python, regex args /usr/local/bin/udpmxircecho.py https://wikitech.wikimedia.org/wiki/Ircecho [09:51:18] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/677807 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [09:52:07] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: remove parsoidJS from production 3 [puppet] - 10https://gerrit.wikimedia.org/r/677114 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [09:53:49] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10jbond) > We're making it a part of the Ansible playbook that manages Gitlab installation. I believe you should h... [09:56:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: Repool after schema change', diff saved to https://phabricator.wikimedia.org/P15249 and previous config saved to /var/cache/conftool/dbconfig/20210408-095615-root.json [09:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:48] PROBLEM - ircecho bot process on irc1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, regex args /usr/local/bin/udpmxircecho.py https://wikitech.wikimedia.org/wiki/Ircecho [09:56:55] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:08] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:02] RECOVERY - ircecho bot process on irc1001 is OK: PROCS OK: 1 process with command name python, regex args /usr/local/bin/udpmxircecho.py https://wikitech.wikimedia.org/wiki/Ircecho [09:59:26] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/parsoid on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/parsoid is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:00:05] mvolz: Dear deployers, time to do the [[mw:Services|Services]] – [[mw:Citoid|Citoid]] / Zotero deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210408T1000). [10:00:24] PROBLEM - Confd template for /srv/config-master/pybal/codfw/parsoid on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/parsoid is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:00:34] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/parsoid on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/parsoid is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:00:42] PROBLEM - Confd template for /srv/config-master/pybal/codfw/parsoid on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/parsoid is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:01:04] effie: ^ (I guess) [10:01:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 75%: Repool db1111', diff saved to https://phabricator.wikimedia.org/P15250 and previous config saved to /var/cache/conftool/dbconfig/20210408-100159-root.json [10:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:29] yes that is me [10:02:32] all me [10:03:01] (03CR) 10Mvolz: [C: 03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/675848 (owner: 10PipelineBot) [10:03:14] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/674936 (owner: 10PipelineBot) [10:03:19] (03PS1) 10Marostegui: instances.yaml: Add db1180 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/677831 (https://phabricator.wikimedia.org/T275633) [10:03:35] 10SRE, 10Wikidata, 10Wikidata Query Builder, 10wdwb-tech, 10User-Addshore: Deploy WDQS query builder to microsites - https://phabricator.wikimedia.org/T266703 (10Michael) 05Open→03Stalled This is blocked by {T264822} [10:03:48] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/677558 (owner: 10Jbond) [10:03:53] 10SRE, 10Wikidata, 10Wikidata Query Builder, 10wdwb-tech, 10User-Addshore: Deploy WDQS query builder to microsites - https://phabricator.wikimedia.org/T266703 (10Michael) [10:03:58] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1180 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/677831 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [10:04:32] 10SRE, 10Wikidata, 10Wikidata Query Builder, 10wdwb-tech, 10User-Addshore: 🛑 Deploy WDQS query builder to microsites - https://phabricator.wikimedia.org/T266703 (10Michael) [10:04:36] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/675848 (owner: 10PipelineBot) [10:07:45] !log jgiannelos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' . [10:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1180 to dbctl T275633', diff saved to https://phabricator.wikimedia.org/P15251 and previous config saved to /var/cache/conftool/dbconfig/20210408-100829-marostegui.json [10:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:38] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [10:09:44] !log zpapierski@deploy1002 Started deploy [wikimedia/discovery/analytics@ff0137d]: T273847 export queries to relforge dag deployment - start date update [10:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:55] T273847: Create a elasticsearch/kibana index with queries to allow query completion candidate research - https://phabricator.wikimedia.org/T273847 [10:10:22] PROBLEM - ircecho bot process on irc1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, regex args /usr/local/bin/udpmxircecho.py https://wikitech.wikimedia.org/wiki/Ircecho [10:10:50] !log mvolz@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [10:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: Repool after schema change', diff saved to https://phabricator.wikimedia.org/P15252 and previous config saved to /var/cache/conftool/dbconfig/20210408-101119-root.json [10:11:22] !log zpapierski@deploy1002 Finished deploy [wikimedia/discovery/analytics@ff0137d]: T273847 export queries to relforge dag deployment - start date update (duration: 01m 37s) [10:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:00] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 103.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [10:13:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1157 for schema change', diff saved to https://phabricator.wikimedia.org/P15253 and previous config saved to /var/cache/conftool/dbconfig/20210408-101303-marostegui.json [10:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:53] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2028.codfw.wmnet [10:16:56] !log jgiannelos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' . [10:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 100%: Repool db1111', diff saved to https://phabricator.wikimedia.org/P15254 and previous config saved to /var/cache/conftool/dbconfig/20210408-101702-root.json [10:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:19] 10SRE, 10ops-codfw: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T279245 (10fgiunchedi) 05Resolved→03Open a:05fgiunchedi→03Papaul @papaul I'm running into troubles with the disk I haven't seen before (xfs crashes after a while, log below). Can we try another spare disk just to exclude... [10:23:11] (03PS1) 10Muehlenhoff: ircecho: Install python-prometheus-client [puppet] - 10https://gerrit.wikimedia.org/r/677834 [10:25:05] 10SRE, 10Proton, 10Product-Infrastructure-Team-Backlog (Kanban): Proton metrics broken - https://phabricator.wikimedia.org/T277857 (10Jgiannelos) It looks like native prometheus metrics are now exposed in the service. That said we may still need to adapt the grafana dashboard because the metrics names might... [10:25:48] RECOVERY - ircecho bot process on irc1001 is OK: PROCS OK: 1 process with command name python, regex args /usr/local/bin/udpmxircecho.py https://wikitech.wikimedia.org/wiki/Ircecho [10:27:15] (03CR) 10Effie Mouzeli: "After merging this, some confd/etcd alerts will pop up. Puppet should be run first on the puppetmasters, then on the icinga servers, and o" [puppet] - 10https://gerrit.wikimedia.org/r/677114 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [10:27:18] !log mvolz@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' . [10:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:43] !log enable puppet on all mw* servers [10:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1118 for kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15255 and previous config saved to /var/cache/conftool/dbconfig/20210408-102855-marostegui.json [10:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:26] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good! Let's sync up before merging, I'll run a few tests after it has been merged." [puppet] - 10https://gerrit.wikimedia.org/r/677292 (owner: 10Jbond) [10:30:01] (03PS3) 10Effie Mouzeli: hieradata: remove parsoidJS from production 4 [puppet] - 10https://gerrit.wikimedia.org/r/677118 (https://phabricator.wikimedia.org/T279059) [10:30:21] (03CR) 10jerkins-bot: [V: 04-1] hieradata: remove parsoidJS from production 4 [puppet] - 10https://gerrit.wikimedia.org/r/677118 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [10:30:32] !log Upgrade kernel on db1118 [10:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:49] (03PS4) 10Effie Mouzeli: hieradata: remove parsoidJS from production 4 [puppet] - 10https://gerrit.wikimedia.org/r/677118 (https://phabricator.wikimedia.org/T279059) [10:31:09] (03CR) 10Muehlenhoff: [C: 03+2] Only install git-fat for distros up to Buster [puppet] - 10https://gerrit.wikimedia.org/r/677807 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [10:31:11] (03CR) 10jerkins-bot: [V: 04-1] hieradata: remove parsoidJS from production 4 [puppet] - 10https://gerrit.wikimedia.org/r/677118 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [10:32:17] (03PS5) 10Effie Mouzeli: hieradata: remove parsoidJS from production 4 [puppet] - 10https://gerrit.wikimedia.org/r/677118 (https://phabricator.wikimedia.org/T279059) [10:32:37] !log enable sampling on cr1-codfw:fpc0 [10:32:37] (03CR) 10jerkins-bot: [V: 04-1] hieradata: remove parsoidJS from production 4 [puppet] - 10https://gerrit.wikimedia.org/r/677118 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [10:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:33] (03Abandoned) 10Effie Mouzeli: hieradata: remove parsoidJS from production 4 [puppet] - 10https://gerrit.wikimedia.org/r/677118 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [10:36:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:37:20] (03PS2) 10Ayounsi: Automatically enable sampling on all FPCs [homer/public] - 10https://gerrit.wikimedia.org/r/636392 (https://phabricator.wikimedia.org/T257392) [10:37:28] !log mvolz@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' . [10:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:16] (03Abandoned) 10Effie Mouzeli: hieradata: remove parsoidJS from production 5 [puppet] - 10https://gerrit.wikimedia.org/r/677119 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [10:38:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 25%: Repool db1118 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15256 and previous config saved to /var/cache/conftool/dbconfig/20210408-103821-root.json [10:38:24] (03CR) 10Ayounsi: [C: 03+2] Automatically enable sampling on all FPCs [homer/public] - 10https://gerrit.wikimedia.org/r/636392 (https://phabricator.wikimedia.org/T257392) (owner: 10Ayounsi) [10:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:03] (03Merged) 10jenkins-bot: Automatically enable sampling on all FPCs [homer/public] - 10https://gerrit.wikimedia.org/r/636392 (https://phabricator.wikimedia.org/T257392) (owner: 10Ayounsi) [10:40:21] (03PS1) 10Effie Mouzeli: profile::parsoid: remove parsoidJS module from parsoid profile [puppet] - 10https://gerrit.wikimedia.org/r/677837 [10:40:27] !log Upgrade db2085's kernel [10:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:25] (03CR) 10jerkins-bot: [V: 04-1] profile::parsoid: remove parsoidJS module from parsoid profile [puppet] - 10https://gerrit.wikimedia.org/r/677837 (owner: 10Effie Mouzeli) [10:41:50] !log enable sampling on all routers FPCs [10:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:19] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me! Likewise, let's sync when merging for some tests in parallel." [puppet] - 10https://gerrit.wikimedia.org/r/677506 (owner: 10Jbond) [10:44:26] (03PS1) 10JMeybohm: k8s_infrastructure_users: Remove special case for old schema [puppet] - 10https://gerrit.wikimedia.org/r/677839 (https://phabricator.wikimedia.org/T269461) [10:46:56] 10SRE, 10netops, 10Patch-For-Review: automatically sample from all FPCs on core routers - https://phabricator.wikimedia.org/T257392 (10ayounsi) 05Open→03Resolved a:03ayounsi One more thing automated from Netbox. [10:47:04] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28957/console" [puppet] - 10https://gerrit.wikimedia.org/r/677839 (https://phabricator.wikimedia.org/T269461) (owner: 10JMeybohm) [10:47:07] !log disable puppet on parsoid* servers [10:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:09] (03PS2) 10Effie Mouzeli: profile::parsoid: remove parsoidJS from production 4 [puppet] - 10https://gerrit.wikimedia.org/r/677837 (https://phabricator.wikimedia.org/T677119) [10:52:29] jouncebot: next [10:52:29] In 0 hour(s) and 7 minute(s): [[Backport windows|EU Backport and Config training]]
'''''' (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210408T1100) [10:53:23] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/677510 (https://phabricator.wikimedia.org/T273673) (owner: 10Jbond) [10:53:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 50%: Repool db1118 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15257 and previous config saved to /var/cache/conftool/dbconfig/20210408-105324-root.json [10:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:15] (03CR) 10Muehlenhoff: "Looks good, but we should also remove" [puppet] - 10https://gerrit.wikimedia.org/r/677514 (owner: 10Jbond) [10:57:09] (03CR) 10Alexandros Kosiaris: [C: 03+1] k8s_infrastructure_users: Remove special case for old schema [puppet] - 10https://gerrit.wikimedia.org/r/677839 (https://phabricator.wikimedia.org/T269461) (owner: 10JMeybohm) [10:57:49] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s_infrastructure_users: Remove special case for old schema [puppet] - 10https://gerrit.wikimedia.org/r/677839 (https://phabricator.wikimedia.org/T269461) (owner: 10JMeybohm) [10:58:19] (03PS3) 10Effie Mouzeli: profile::parsoid: remove parsoidJS from production 4 [puppet] - 10https://gerrit.wikimedia.org/r/677837 (https://phabricator.wikimedia.org/T677119) [10:59:08] PROBLEM - HP RAID on ms-be2028 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Failed: 1I:1:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [10:59:10] (03PS1) 10Elukey: hadoop: fix log4j audit log max file size [puppet] - 10https://gerrit.wikimedia.org/r/677845 [10:59:10] ACKNOWLEDGEMENT - HP RAID on ms-be2028 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Failed: 1I:1:2 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T279644 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [10:59:14] 10SRE, 10ops-codfw: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T279644 (10ops-monitoring-bot) [10:59:18] (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/677846 [11:00:05] Amir1, apergos, and duesen: Dear deployers, time to do the [[Backport windows|EU Backport and Config training]]
'''''' deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210408T1100). [11:01:13] there don't seem to be any changes listed in the window fwiw (maybe due to tomorrow's wmf holiday?) [11:01:20] (03CR) 10Elukey: [C: 03+2] hadoop: fix log4j audit log max file size [puppet] - 10https://gerrit.wikimedia.org/r/677845 (owner: 10Elukey) [11:02:58] 10SRE, 10Traffic, 10User-notice: Rate limit requests in violation of User-Agent policy more aggressively - https://phabricator.wikimedia.org/T224891 (10ayounsi) Even with the current rate limiting, some crawling are regularly causing issues, wasting precious SRE time. I'd like to revisit this task to be mor... [11:03:37] (03CR) 10Effie Mouzeli: "@Αλέξανδρος Κοσιάρης, deploy-service sudo block should be removed here after all" [puppet] - 10https://gerrit.wikimedia.org/r/677837 (https://phabricator.wikimedia.org/T677119) (owner: 10Effie Mouzeli) [11:03:51] (03CR) 10Effie Mouzeli: "pcc https://puppet-compiler.wmflabs.org/compiler1003/28956/parse2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/677837 (https://phabricator.wikimedia.org/T677119) (owner: 10Effie Mouzeli) [11:04:07] 10SRE, 10ops-codfw: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T279644 (10fgiunchedi) [11:04:09] 10SRE, 10ops-codfw: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T279245 (10fgiunchedi) [11:04:10] apergos: possibly. But I'd like to get sth deployed anyway, so if the training is running, I can give you all a config patch. [11:04:28] https://gerrit.wikimedia.org/r/c/677828 [11:04:35] there doesn't seem to be a training either. I'm in the google meet and there's 0 attendees :-D [11:04:44] ok [11:04:47] so I'll just self-service then :) [11:04:53] I mean, it's on the calendar but meh. [11:05:05] yeah, I'd just go ahead and do it yourself [11:05:07] (03CR) 10Urbanecm: [C: 03+2] Enable Growth for newcomers on simplewiki, mswiki, tawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677828 (https://phabricator.wikimedia.org/T278369) (owner: 10Urbanecm) [11:05:50] (03Merged) 10jenkins-bot: Enable Growth for newcomers on simplewiki, mswiki, tawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677828 (https://phabricator.wikimedia.org/T278369) (owner: 10Urbanecm) [11:06:47] what's the story for "community concensus" on this change? it looks like that's not the process here [11:06:59] (just following along since I'm in here) [11:07:25] oh also don't forget please to add your patch to the calendar so it's in the record, it's nice to be able to search there and not just the logs [11:07:30] (03CR) 10Alexandros Kosiaris: [C: 03+1] profile::parsoid: remove parsoidJS from production 4 [puppet] - 10https://gerrit.wikimedia.org/r/677837 (https://phabricator.wikimedia.org/T677119) (owner: 10Effie Mouzeli) [11:07:35] apergos: sure, will do. [11:07:36] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: de1670cbd2c59a24f1e29a6d3731e3ac7f39d336: Enable Growth for newcomers on simplewiki, mswiki, tawiki (T278369; T277562; T277550) (duration: 01m 07s) [11:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:47] T278369: Deploy Growth features on Tamil Wikipedia - https://phabricator.wikimedia.org/T278369 [11:07:48] T277562: Deploy Growth features on Malay Wikipedia - https://phabricator.wikimedia.org/T277562 [11:07:48] T277550: Deploy Growth features on Simple English Wikipedia - https://phabricator.wikimedia.org/T277550 [11:07:53] (03PS1) 10David Caro: pcc: honor spaces in arguments [puppet] - 10https://gerrit.wikimedia.org/r/677847 [11:08:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 75%: Repool db1118 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15258 and previous config saved to /var/cache/conftool/dbconfig/20210408-110828-root.json [11:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:24] !log filippo@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ms-be2028.codfw.wmnet [11:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:43] apergos: so, the features I just enabled for the wikis are an iniciative of the Growth team, for which I work as a software engineer. The community relationship specialist told me it's ready to go, so I synced it :). [11:10:12] (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/677847 (owner: 10David Caro) [11:10:20] hm it's probably good for that note to go on the task (assuming it's public info... which it now is, since this channel is logged :-D) [11:10:35] just so people looking at recent deploys to learn anything get how it goes [11:10:43] good point [11:11:32] the wikis are obviously contacted by us in advance, and given the opportunity to try & comment on the features before they are live, but it doesn't need an explicit consensus, as it's a WMF-pursued change rather than community-pursued [11:12:11] yup [11:12:32] folks learning to do this will need to know which is which and be able to double-check (well anyone doing this, really, heh) [11:14:07] yeah. Well, in this case, the tasks are created by a WMF employee who's a Growth team member, they're tagged with a team's sprint board, and another Growth team member requests deployments/deploys it, so...I guess it's pretty obvious it's WMF-pursued change [11:14:22] (03PS1) 10Jbond: hiera: move key to correct location [labs/private] - 10https://gerrit.wikimedia.org/r/677848 [11:14:42] but since you asked, it's probably not _really_ obvious. Not sure how to make it more visible tbh [11:15:42] (03CR) 10Jbond: [V: 03+2 C: 03+2] hiera: move key to correct location [labs/private] - 10https://gerrit.wikimedia.org/r/677848 (owner: 10Jbond) [11:17:10] well if we all get in the habit of adding a comment on the task "does not need community concensus, wmf-pursued change" or "link to community concensus here" (well some community member will do that) [11:17:21] and if no such comment is on there, the person with the patch can be asked [11:17:35] prolly ok that way, just gets it into people's heads [11:17:57] yeah, i totally understand it [11:18:16] on the other hand not everything that has community support can be done [11:18:29] (one example: we won't install flow/LQT on more wikis) [11:18:35] no, but we can at least make the check part of the routine, or else [11:18:44] some things might go out that shouldn't :-D [11:18:51] I mean, more than without the check! [11:19:11] I really wish we could de-install flow on more wikis but that's another topic :-P [11:19:28] iirc uninstaling flow is also on the list of banned changes [11:19:48] which is the worst of both worlds >_< [11:20:02] yeah [11:20:08] the issue is uninstalling flow...isn't exactly easy [11:20:18] must maintain it, won't improve it, won't give it to anyone else, can't get rid of it [11:20:34] and now we have these nice shiny new overlays for talk pages... >_< [11:20:47] yeah well what do you do with the old flow pages. nothing good [11:20:52] i love discussiontools, btw :) [11:20:57] :-) [11:21:21] T188812 says "Flow allegedly puts the wikis into an irreversible state whereupon it becomes impossible for the Wikimedia Foundation to handle its leftovers" [11:21:22] T188812: Uninstall Flow on all wikis where it has zero topics - https://phabricator.wikimedia.org/T188812 [11:21:32] i think it summarizes the flow case really well :/ [11:21:50] I will bookmark it and never read it until some day when I really want to feel 10x more miserable than normal [11:22:17] :D [11:22:19] I had to dig into the db structure at a point where I was rewriting the dumps for them [11:22:23] it was extremely painful [11:22:38] that concludes my "Flow, the externsion you love to hate" Ted Talk. [11:23:13] i dealt with a couple of issues that were mostly about "extension defined a content model, the extension is gone, the pages created via that extension cannot be deleted, undeleted, moved, viewed nor edited" [11:23:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 100%: Repool db1118 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15259 and previous config saved to /var/cache/conftool/dbconfig/20210408-112332-root.json [11:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:22] yeah how could they be. no extension, no content model, and a bunch of the data lives in a separate off-wiki db [11:24:29] that's not a recipe for trouble, right? [11:24:47] how could it be :) [11:24:57] welp, 25 minutes in, I think no one is coming to be trained, legit because it is the day before a 4 day weekend for wmf folks [11:25:08] so, closing that google meet tab :-) [11:25:13] ok ok : [11:25:31] thanks for using the window :-D [11:25:40] :) [11:27:42] (03PS1) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: fix help invocation [puppet] - 10https://gerrit.wikimedia.org/r/677850 [11:28:04] Urbanecm: discussion tools is one of my favourite extensions [11:28:11] yeahj [11:28:27] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid-configurator: fix help invocation [puppet] - 10https://gerrit.wikimedia.org/r/677850 (owner: 10Arturo Borrero Gonzalez) [11:28:38] (03PS2) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: fix help invocation [puppet] - 10https://gerrit.wikimedia.org/r/677850 [11:28:47] (03PS4) 10Effie Mouzeli: profile::parsoid: remove parsoidJS from production 4 [puppet] - 10https://gerrit.wikimedia.org/r/677837 (https://phabricator.wikimedia.org/T677119) [11:32:06] (03CR) 10Effie Mouzeli: [C: 03+2] profile::parsoid: remove parsoidJS from production 4 [puppet] - 10https://gerrit.wikimedia.org/r/677837 (https://phabricator.wikimedia.org/T677119) (owner: 10Effie Mouzeli) [11:33:47] (03CR) 10Effie Mouzeli: "PCC https://puppet-compiler.wmflabs.org/compiler1003/28956/parse2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/677837 (https://phabricator.wikimedia.org/T677119) (owner: 10Effie Mouzeli) [11:34:16] PROBLEM - Device not healthy -SMART- on ms-be2028 is CRITICAL: cluster=swift device=None instance=ms-be2028 job=node site=codfw https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2028&var-datasource=codfw+prometheus/ops [11:40:30] (03PS2) 10Amire80: Add default import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676930 (https://phabricator.wikimedia.org/T214139) [11:40:46] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 25%: Repool after schema change', diff saved to https://phabricator.wikimedia.org/P15261 and previous config saved to /var/cache/conftool/dbconfig/20210408-114625-root.json [11:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:04] (03PS1) 10Elukey: hadoop: fix the HDFS Namenode audit log config [puppet] - 10https://gerrit.wikimedia.org/r/677853 (https://phabricator.wikimedia.org/T276906) [11:47:30] !log zpapierski@deploy1002 Started deploy [wikimedia/discovery/analytics@25dad72]: T273847 export queries to relforge dag deployment - elastic index name fix [11:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:39] T273847: Create a elasticsearch/kibana index with queries to allow query completion candidate research - https://phabricator.wikimedia.org/T273847 [11:49:10] !log zpapierski@deploy1002 Finished deploy [wikimedia/discovery/analytics@25dad72]: T273847 export queries to relforge dag deployment - elastic index name fix (duration: 01m 39s) [11:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:28] (03PS1) 10Ema: vcl: fix ADM_PARAM definition [puppet] - 10https://gerrit.wikimedia.org/r/677858 (https://phabricator.wikimedia.org/T279533) [11:50:14] (03CR) 10Ema: [C: 03+2] vcl: fix ADM_PARAM definition [puppet] - 10https://gerrit.wikimedia.org/r/677858 (https://phabricator.wikimedia.org/T279533) (owner: 10Ema) [11:50:22] !log tighten cr3-ulsfo loopback firewall filter - T207799 [11:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:54] (03PS1) 10Ayounsi: Introduce production4/6 and tighten looback filter [homer/public] - 10https://gerrit.wikimedia.org/r/677859 (https://phabricator.wikimedia.org/T207799) [11:54:10] (03PS1) 10Ema: cache: enable exp caching policy on cp5001 [puppet] - 10https://gerrit.wikimedia.org/r/677718 (https://phabricator.wikimedia.org/T275809) [11:55:10] (03CR) 10Ayounsi: [C: 03+2] Introduce production4/6 and tighten looback filter [homer/public] - 10https://gerrit.wikimedia.org/r/677859 (https://phabricator.wikimedia.org/T207799) (owner: 10Ayounsi) [11:57:34] !log zpapierski@deploy1002 Started deploy [wikimedia/discovery/analytics@25dad72]: T273847 export queries to relforge dag deployment - elastic index name fix [11:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:42] T273847: Create a elasticsearch/kibana index with queries to allow query completion candidate research - https://phabricator.wikimedia.org/T273847 [11:57:43] !log zpapierski@deploy1002 Finished deploy [wikimedia/discovery/analytics@25dad72]: T273847 export queries to relforge dag deployment - elastic index name fix (duration: 00m 09s) [11:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:04] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/677718 (https://phabricator.wikimedia.org/T275809) (owner: 10Ema) [11:58:40] !log tighten all routers loopback firewall filter - T207799 [11:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:21] (03Merged) 10jenkins-bot: Introduce production4/6 and tighten looback filter [homer/public] - 10https://gerrit.wikimedia.org/r/677859 (https://phabricator.wikimedia.org/T207799) (owner: 10Ayounsi) [12:00:23] (03CR) 10Elukey: [C: 03+2] hadoop: fix the HDFS Namenode audit log config [puppet] - 10https://gerrit.wikimedia.org/r/677853 (https://phabricator.wikimedia.org/T276906) (owner: 10Elukey) [12:00:25] (03PS3) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: fix help invocation [puppet] - 10https://gerrit.wikimedia.org/r/677850 [12:00:27] (03PS1) 10Arturo Borrero Gonzalez: sonofgridnegine: grid-configurator: run black autoformater [puppet] - 10https://gerrit.wikimedia.org/r/677860 [12:00:29] (03PS1) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: include defaults in help message [puppet] - 10https://gerrit.wikimedia.org/r/677861 [12:00:31] (03PS1) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: rework --domains option [puppet] - 10https://gerrit.wikimedia.org/r/677862 [12:01:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 50%: Repool after schema change', diff saved to https://phabricator.wikimedia.org/P15262 and previous config saved to /var/cache/conftool/dbconfig/20210408-120128-root.json [12:01:32] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid-configurator: rework --domains option [puppet] - 10https://gerrit.wikimedia.org/r/677862 (owner: 10Arturo Borrero Gonzalez) [12:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 75%: Repool after schema change', diff saved to https://phabricator.wikimedia.org/P15263 and previous config saved to /var/cache/conftool/dbconfig/20210408-121633-root.json [12:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:19] (03PS1) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: error if running in toolsbeta if no --beta [puppet] - 10https://gerrit.wikimedia.org/r/677865 [12:20:11] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid-configurator: error if running in toolsbeta if no --beta [puppet] - 10https://gerrit.wikimedia.org/r/677865 (owner: 10Arturo Borrero Gonzalez) [12:22:02] (03PS2) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: rework --domains option [puppet] - 10https://gerrit.wikimedia.org/r/677862 [12:31:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 100%: Repool after schema change', diff saved to https://phabricator.wikimedia.org/P15264 and previous config saved to /var/cache/conftool/dbconfig/20210408-123137-root.json [12:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:16] (03PS1) 10Ema: vcl: declare adm_param as a global variable [puppet] - 10https://gerrit.wikimedia.org/r/677870 (https://phabricator.wikimedia.org/T279533) [12:32:32] 10SRE, 10Epic, 10cloud-services-team (Kanban): CloudVPS: network architecture - https://phabricator.wikimedia.org/T209460 (10ayounsi) [12:39:14] !log installing xcftools security updates [12:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:23] (03PS1) 10Ayounsi: Remove 185.15.56.0/24 from network::external [puppet] - 10https://gerrit.wikimedia.org/r/677872 (https://phabricator.wikimedia.org/T265864) [12:41:15] (03CR) 10Ayounsi: [C: 04-1] "-1 until we're sure it's safe to merge." [puppet] - 10https://gerrit.wikimedia.org/r/677872 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [12:44:03] !log installing libbsd security updates for Buster [12:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:08] (03CR) 10Ema: [C: 03+2] vcl: declare adm_param as a global variable [puppet] - 10https://gerrit.wikimedia.org/r/677870 (https://phabricator.wikimedia.org/T279533) (owner: 10Ema) [12:46:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 98): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28960/console" [puppet] - 10https://gerrit.wikimedia.org/r/677567 (owner: 10David Caro) [12:46:31] (03CR) 10Ema: [C: 03+2] cache: enable exp caching policy on cp5001 [puppet] - 10https://gerrit.wikimedia.org/r/677718 (https://phabricator.wikimedia.org/T275809) (owner: 10Ema) [12:48:23] (03PS4) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: fix help invocation [puppet] - 10https://gerrit.wikimedia.org/r/677850 [12:48:25] (03PS2) 10Arturo Borrero Gonzalez: sonofgridnegine: grid-configurator: run black autoformater [puppet] - 10https://gerrit.wikimedia.org/r/677860 [12:48:27] (03PS2) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: include defaults in help message [puppet] - 10https://gerrit.wikimedia.org/r/677861 [12:48:29] (03PS3) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: rework --domains option [puppet] - 10https://gerrit.wikimedia.org/r/677862 [12:48:31] (03PS2) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: error if running in toolsbeta if no --beta [puppet] - 10https://gerrit.wikimedia.org/r/677865 [12:48:33] (03PS1) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: introduce support for the new domain [puppet] - 10https://gerrit.wikimedia.org/r/677873 (https://phabricator.wikimedia.org/T277653) [12:49:38] !log cp5001: varnish-frontend-restart to test exp policy settings starting from a empty cache T275809 [12:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:46] T275809: cache_upload cache policy + large_objects_cutoff concerns - https://phabricator.wikimedia.org/T275809 [12:50:34] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid-configurator: introduce support for the new domain [puppet] - 10https://gerrit.wikimedia.org/r/677873 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [12:56:20] (03CR) 10David Caro: [C: 03+2] pcc: honor spaces in arguments [puppet] - 10https://gerrit.wikimedia.org/r/677847 (owner: 10David Caro) [13:01:00] (03CR) 10Ottomata: hadoop: add the liblog4j-extras1.2-java jar to HADOOP_CLASSPATH (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677576 (https://phabricator.wikimedia.org/T276906) (owner: 10Elukey) [13:03:19] (03CR) 10Ottomata: "Huh, TIL :) TY!" [puppet] - 10https://gerrit.wikimedia.org/r/677816 (owner: 10Elukey) [13:03:57] (03PS1) 10Ema: admin: add mikeraish to ldap_only users [puppet] - 10https://gerrit.wikimedia.org/r/677885 (https://phabricator.wikimedia.org/T279147) [13:06:08] (03CR) 10Ottomata: [C: 03+1] jupyter: simplify the cron script to clean up user Trash [puppet] - 10https://gerrit.wikimedia.org/r/677822 (owner: 10Elukey) [13:13:37] (03CR) 10Ema: [C: 03+2] admin: add mikeraish to ldap_only users [puppet] - 10https://gerrit.wikimedia.org/r/677885 (https://phabricator.wikimedia.org/T279147) (owner: 10Ema) [13:13:53] (03CR) 10Jbond: [C: 03+1] admin: add mikeraish to ldap_only users [puppet] - 10https://gerrit.wikimedia.org/r/677885 (https://phabricator.wikimedia.org/T279147) (owner: 10Ema) [13:14:34] 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (2021-04-01 to 2021-06-30 (Q4)): Porting scap to Python 3 - https://phabricator.wikimedia.org/T279628 (10thcipriani) [13:16:34] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant access to Superset for Mikeraish - https://phabricator.wikimedia.org/T279147 (10ema) @Mraishwmf: you should be all set! Let me know if you can now access Superset. [13:16:53] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [13:18:59] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: unassigned_shards: 0, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_primary_shards: 937, number_of_in_flight_fetch: 0, cluster_name: cloudelastic-chi-eqiad, active_shards: 1877, number_of_pending_tasks: 0, initializing_shards: 0, number_of_nodes: 6, task_max_waiting_in_queue_ [13:18:59] _of_data_nodes: 6, delayed_unassigned_shards: 0, status: green, timed_out: False https://wikitech.wikimedia.org/wiki/Search%23Administration [13:19:01] (03PS2) 10Andrew Bogott: Replace cloudcephmon2001-dev with cloudcephmon2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/677663 (https://phabricator.wikimedia.org/T276509) [13:20:40] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts... [13:20:46] 10SRE: Integrate Buster 10.9 point update - https://phabricator.wikimedia.org/T279054 (10MoritzMuehlenhoff) [13:22:18] (03CR) 10Andrew Bogott: [C: 03+2] Replace cloudcephmon2001-dev with cloudcephmon2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/677663 (https://phabricator.wikimedia.org/T276509) (owner: 10Andrew Bogott) [13:24:23] !log installing groff bugfix updates from Buster point release [13:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:47] (03PS1) 10Ottomata: dumps.wikimedia.org - add section in legal specifying CC0 license for analytics [puppet] - 10https://gerrit.wikimedia.org/r/677900 (https://phabricator.wikimedia.org/T278409) [13:26:09] (03CR) 10Andrew Bogott: [C: 03+2] Switch cloudcephmon2001-dev to a spare::system [puppet] - 10https://gerrit.wikimedia.org/r/677664 (owner: 10Andrew Bogott) [13:26:16] (03PS2) 10Andrew Bogott: Switch cloudcephmon2001-dev to a spare::system [puppet] - 10https://gerrit.wikimedia.org/r/677664 [13:29:12] 10SRE: DRY up .html files in puppet used for snapshot and dumps modules - https://phabricator.wikimedia.org/T279661 (10Ottomata) [13:29:42] (03CR) 10ArielGlenn: [C: 03+1] "Sure looks just like the other one. Go ahead on, sorry for the mixup!" [puppet] - 10https://gerrit.wikimedia.org/r/677900 (https://phabricator.wikimedia.org/T278409) (owner: 10Ottomata) [13:29:48] (03CR) 10Ottomata: [C: 03+2] dumps.wikimedia.org - add section in legal specifying CC0 license for analytics [puppet] - 10https://gerrit.wikimedia.org/r/677900 (https://phabricator.wikimedia.org/T278409) (owner: 10Ottomata) [13:33:54] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephmon2004-dev - https://phabricator.wikimedia.org/T276509 (10Andrew) thank you @papaul! this box is now in service. [13:34:05] Majavah: I've deployed that cpjobqueue change to beta [13:34:25] (03PS1) 10JMeybohm: calico: Add defauls for container resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/677906 (https://phabricator.wikimedia.org/T277877) [13:36:17] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Set resource requests and limits for calico PODs - https://phabricator.wikimedia.org/T277877 (10JMeybohm) a:03JMeybohm Added some defaults based on the current maximum values (https://grafana-rw.wikimedia.org/d/2AfU0X_Mz/jayme-ca... [13:39:43] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2001.codfw.wmnet with reason: REIMAGE [13:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:47] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2001.codfw.wmnet with reason: REIMAGE [13:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:28] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudcephmon2001-dev.codfw.wmnet [13:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:58] (03PS1) 10Andrew Bogott: Remove references to cloudcephmon2001-dev.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/677910 (https://phabricator.wikimedia.org/T279662) [13:48:13] (03PS3) 10David Caro: ceph: use ensure_packages instead of package directly [puppet] - 10https://gerrit.wikimedia.org/r/677595 (https://phabricator.wikimedia.org/T274566) [13:48:15] (03PS1) 10David Caro: ceph.common: add ceph repo parameter [puppet] - 10https://gerrit.wikimedia.org/r/677911 [13:48:55] (03CR) 10Andrew Bogott: [C: 03+2] Remove references to cloudcephmon2001-dev.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/677910 (https://phabricator.wikimedia.org/T279662) (owner: 10Andrew Bogott) [13:49:07] PROBLEM - graphite.wikimedia.org render on graphite1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [13:49:32] (03CR) 10jerkins-bot: [V: 04-1] ceph.common: add ceph repo parameter [puppet] - 10https://gerrit.wikimedia.org/r/677911 (owner: 10David Caro) [13:49:35] PROBLEM - graphite.wikimedia.org api on graphite1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [13:50:17] (03CR) 10David Caro: "PCC: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28962/console" [puppet] - 10https://gerrit.wikimedia.org/r/677595 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [13:50:29] RECOVERY - graphite.wikimedia.org render on graphite1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1594 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [13:55:13] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcephmon2001-dev.codfw.wmnet [13:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:55] 10ops-codfw, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudcephmon2001-dev - https://phabricator.wikimedia.org/T279662 (10Andrew) a:05Andrew→03Papaul [14:00:39] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2001.codfw.wmnet'] ` and were **ALL*... [14:09:51] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10Cmjohnson) [14:11:31] (03PS3) 10Silvan Heintze: Remove idGeneratorLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677560 (https://phabricator.wikimedia.org/T274156) (owner: 10Noa wmde) [14:17:53] PROBLEM - Long running screen/tmux on puppetmaster1001 is CRITICAL: CRIT: Long running tmux process. (user: ryankemper PID: 2120, 2539394s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [14:18:55] (03PS1) 10Silvan Heintze: Remove all remains of idGeneratorLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677920 (https://phabricator.wikimedia.org/T274156) [14:19:23] ^ Killed my tmux session `cergen` on `puppetmaster1001` [14:20:49] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Remove idGeneratorLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677560 (https://phabricator.wikimedia.org/T274156) (owner: 10Noa wmde) [14:22:20] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Remove all remains of idGeneratorLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677920 (https://phabricator.wikimedia.org/T274156) (owner: 10Silvan Heintze) [14:22:46] (03CR) 10Gergő Tisza: [C: 03+1] Add growthexperiments_mentee_data to private tables [puppet] - 10https://gerrit.wikimedia.org/r/677653 (https://phabricator.wikimedia.org/T279587) (owner: 10Urbanecm) [14:23:18] RECOVERY - graphite.wikimedia.org api on graphite1004 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [14:23:50] (03PS2) 10David Caro: ceph: add ceph repo parameter to all client modules [puppet] - 10https://gerrit.wikimedia.org/r/677911 (https://phabricator.wikimedia.org/T274566) [14:25:39] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/677911 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [14:29:54] (03PS3) 10David Caro: ceph: add ceph repo parameter to all client modules [puppet] - 10https://gerrit.wikimedia.org/r/677911 (https://phabricator.wikimedia.org/T274566) [14:30:05] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/677911 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [14:31:18] (03CR) 10Ahmon Dancy: [C: 03+1] logspam: silence rare but annoying UTF-8 warnings [puppet] - 10https://gerrit.wikimedia.org/r/677676 (owner: 10Brennen Bearnes) [14:34:44] (03CR) 10Alexandros Kosiaris: [C: 03+1] calico: Add defauls for container resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/677906 (https://phabricator.wikimedia.org/T277877) (owner: 10JMeybohm) [14:36:08] 10Puppet, 10SRE, 10puppet-compiler, 10Patch-For-Review, and 2 others: Integrate the puppet compiler in the puppet CI pipeline - https://phabricator.wikimedia.org/T166066 (10dcaro) >>! In T166066#5039654, @hashar wrote: > We have a Jenkins job T97513 which has been made to recognizes `Hosts:` in commit mess... [14:38:51] 10SRE, 10ops-codfw: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T279245 (10Papaul) a:05Papaul→03fgiunchedi Disk replaced [14:39:21] (03CR) 10Silvan Heintze: "split up into two separate changes, as Lucas suggested" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677920 (https://phabricator.wikimedia.org/T274156) (owner: 10Silvan Heintze) [14:45:44] RECOVERY - HP RAID on ms-be2028 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:46:11] (03PS1) 10JMeybohm: kube-apiserver: Use --enable-admission-plugins argument [puppet] - 10https://gerrit.wikimedia.org/r/677922 (https://phabricator.wikimedia.org/T270063) [14:46:13] (03PS1) 10JMeybohm: kube-apiserver: Update the list of enabled admission controllers [puppet] - 10https://gerrit.wikimedia.org/r/677923 (https://phabricator.wikimedia.org/T270063) [14:47:47] (03CR) 10Alexandros Kosiaris: [C: 03+1] kube-apiserver: Update the list of enabled admission controllers [puppet] - 10https://gerrit.wikimedia.org/r/677923 (https://phabricator.wikimedia.org/T270063) (owner: 10JMeybohm) [14:48:02] (03CR) 10Alexandros Kosiaris: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/677922 (https://phabricator.wikimedia.org/T270063) (owner: 10JMeybohm) [14:50:10] (03PS1) 10JMeybohm: infrastructure_users: Remove comments with old schema [puppet] - 10https://gerrit.wikimedia.org/r/677926 (https://phabricator.wikimedia.org/T269461) [14:52:30] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/677911 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [15:00:52] (03CR) 10Tonina Zhelyazkova: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677928 (https://phabricator.wikimedia.org/T204031) (owner: 10Tonina Zhelyazkova) [15:01:32] (03CR) 10JMeybohm: [C: 03+2] infrastructure_users: Remove comments with old schema [puppet] - 10https://gerrit.wikimedia.org/r/677926 (https://phabricator.wikimedia.org/T269461) (owner: 10JMeybohm) [15:03:30] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] wikidata: post edit constraint jobs on 60% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677928 (https://phabricator.wikimedia.org/T204031) (owner: 10Tonina Zhelyazkova) [15:04:06] (03PS4) 10David Caro: ceph: add ceph repo and parameter to all client modules [puppet] - 10https://gerrit.wikimedia.org/r/677911 (https://phabricator.wikimedia.org/T274566) [15:08:54] 10SRE, 10Community-Tech, 10MediaWiki-CrossWikiWatchlist, 10Crosswiki: Acquire new hardware for hosting cross-wiki watchlist database - https://phabricator.wikimedia.org/T142538 (10MusikAnimal) [15:10:37] (03PS1) 10Muehlenhoff: Switch to iptables legacy alternative provider on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/677931 (https://phabricator.wikimedia.org/T275873) [15:20:21] (03CR) 10Bstorm: [C: 03+2] cloud email alerts: remove f-strings in case of stretch vms [puppet] - 10https://gerrit.wikimedia.org/r/677599 (owner: 10Bstorm) [15:22:03] 10SRE, 10ops-eqiad, 10Analytics-Clusters: Icinga/MegaRAID alert on an-worker1100 - https://phabricator.wikimedia.org/T279475 (10elukey) ` elukey@an-worker1100:~$ sudo megacli -AdpBbuCmd -BbuLearn -aAll Adapter 0: BBU Learn Failed Exit Code: 0x01 ` This is also weird.. [15:29:22] (03PS1) 10David Caro: ceph.common: pin any package from ceph repo to prio 1003 [puppet] - 10https://gerrit.wikimedia.org/r/677938 (https://phabricator.wikimedia.org/T274566) [15:29:49] RECOVERY - Check whether ferm is active by checking the default input chain on sretest1002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:30:10] (03CR) 10jerkins-bot: [V: 04-1] ceph.common: pin any package from ceph repo to prio 1003 [puppet] - 10https://gerrit.wikimedia.org/r/677938 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [15:31:07] (03CR) 10Arturo Borrero Gonzalez: ceph.common: pin any package from ceph repo to prio 1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677938 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [15:32:08] (03CR) 10Arturo Borrero Gonzalez: ceph.common: pin any package from ceph repo to prio 1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677938 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [15:36:04] !log reboot an-worker1100 to see if it helps with the strange BBU behavior [15:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:21] (03PS2) 10David Caro: ceph.common: pin any package from ceph repo to prio 1003 [puppet] - 10https://gerrit.wikimedia.org/r/677938 (https://phabricator.wikimedia.org/T274566) [15:37:31] PROBLEM - Host an-worker1100 is DOWN: PING CRITICAL - Packet loss = 100% [15:40:37] RECOVERY - Device not healthy -SMART- on ms-be2028 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2028&var-datasource=codfw+prometheus/ops [15:40:39] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup200[4-7] - https://phabricator.wikimedia.org/T277323 (10Papaul) [15:42:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "I would record somewhere why the legacy version is required, commit message or a comment in the puppet manifest. It may help in the future" [puppet] - 10https://gerrit.wikimedia.org/r/677931 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [15:42:17] (03PS3) 10David Caro: ceph.common: pin any package from ceph repo to prio 1003 [puppet] - 10https://gerrit.wikimedia.org/r/677938 (https://phabricator.wikimedia.org/T274566) [15:42:20] (03CR) 10David Caro: ceph.common: pin any package from ceph repo to prio 1003 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/677938 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [15:42:21] 10SRE, 10Security: Investigate iptables replacments - https://phabricator.wikimedia.org/T279683 (10jbond) p:05Triage→03Medium [15:43:07] (03CR) 10Jbond: "LGTM, also created https://phabricator.wikimedia.org/T279683 to explore long term options" [puppet] - 10https://gerrit.wikimedia.org/r/677931 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [15:44:07] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [15:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:21] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/677938 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [15:45:47] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['wtp1025.eqiad.wmnet'] ` The log can be found in `/var/log/... [15:47:15] PROBLEM - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [15:48:14] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10jijiki) [15:48:28] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10jijiki) [15:49:07] RECOVERY - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 673 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [15:51:39] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:13] 10SRE, 10Security: Investigate iptables replacements - https://phabricator.wikimedia.org/T279683 (10dcaro) [15:53:36] 10SRE, 10Security: Investigate iptables replacements - https://phabricator.wikimedia.org/T279683 (10aborrero) beware that in the next debian release iptables may not even be part of the base system install. [15:55:15] RECOVERY - Host an-worker1100 is UP: PING WARNING - Packet loss = 33%, RTA = 2.34 ms [15:56:11] (03CR) 10Jbond: "lgtm but see inline comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/677911 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [15:57:59] PROBLEM - SSH on an-worker1100 is CRITICAL: connect to address 10.64.36.145 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:58:50] (03CR) 10Bstorm: [C: 03+1] "I'd noticed that recently and didn't take time to fix it. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/677850 (owner: 10Arturo Borrero Gonzalez) [15:59:29] (03CR) 10Bstorm: [C: 03+1] sonofgridnegine: grid-configurator: run black autoformater [puppet] - 10https://gerrit.wikimedia.org/r/677860 (owner: 10Arturo Borrero Gonzalez) [15:59:40] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/677595 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [16:00:05] jbond42 and cdanis: It is that lovely time of the day again! You are hereby commanded to deploy [[Puppet request window]]
''''''. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210408T1600). [16:00:16] (03CR) 10Bstorm: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/677861 (owner: 10Arturo Borrero Gonzalez) [16:00:21] thanks jouncebot [16:00:35] (03CR) 10Jbond: ceph.common: pin any package from ceph repo to prio 1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677938 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [16:00:48] (03PS1) 10Bartosz Dziewoński: Revert incorrect changes to ve.ui.MWBackCommand that made it stop working [extensions/VisualEditor] (wmf/1.36.0-wmf.38) - 10https://gerrit.wikimedia.org/r/677725 (https://phabricator.wikimedia.org/T279613) [16:02:31] (03CR) 10Jbond: ceph: add ceph repo and parameter to all client modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677911 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [16:04:17] (03CR) 10David Caro: ceph: add ceph repo and parameter to all client modules (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/677911 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [16:04:25] PROBLEM - Host an-worker1100 is DOWN: PING CRITICAL - Packet loss = 100% [16:05:01] jouncebot: now [16:05:01] For the next 0 hour(s) and 54 minute(s): [[Puppet request window]]
'''''' (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210408T1600) [16:05:08] jouncebot: next [16:05:08] In 0 hour(s) and 54 minute(s): [[mw:Services|Services]] – [[mw:Extension:Graph|Graphoid]] / [[ORES]] (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210408T1700) [16:05:36] effie: puppet request window empty today if you want to use it [16:05:53] RECOVERY - Host an-worker1100 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [16:05:56] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [16:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:17] cdanis: I am reimaging a parsoid server, and I just remembered that it could cause a deployment failure [16:06:44] so I was checking who I need to nag :p [16:06:53] RECOVERY - SSH on an-worker1100 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:06:58] (03CR) 10David Caro: [C: 03+2] ceph: use ensure_packages instead of package directly [puppet] - 10https://gerrit.wikimedia.org/r/677595 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [16:10:25] RECOVERY - MegaRAID on an-worker1100 is OK: OK: optimal, 23 logical, 23 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:10:50] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:54] (03PS4) 10Cwhite: logstash: refactor how curator jobs are defined and deployed [puppet] - 10https://gerrit.wikimedia.org/r/677593 (https://phabricator.wikimedia.org/T274394) [16:13:07] (03CR) 10Legoktm: "> Patch Set 1:" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677002 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [16:13:50] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1025.eqiad.wmnet with reason: REIMAGE [16:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:34] 10SRE, 10ops-eqiad, 10Analytics-Clusters: Icinga/MegaRAID alert on an-worker1100 - https://phabricator.wikimedia.org/T279475 (10elukey) The alert recovered, but I discovered a bad disk that needs to be replaced (had to clear preserved cache to allow boot, and one partition didn't mount). Hopefully we'll get... [16:14:57] (03CR) 10Jbond: ceph: add ceph repo and parameter to all client modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677911 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [16:15:58] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1025.eqiad.wmnet with reason: REIMAGE [16:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:46] !log update bios cp1087, already deposed for h/w issues T278729 [16:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:54] T278729: cp1087 powercycled - https://phabricator.wikimedia.org/T278729 [16:18:16] (03CR) 10Cwhite: logstash: refactor how curator jobs are defined and deployed (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/677593 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [16:18:21] PROBLEM - Host cp1087 is DOWN: PING CRITICAL - Packet loss = 100% [16:19:02] 10SRE, 10Traffic, 10Patch-For-Review: cache_upload cache policy + large_objects_cutoff concerns - https://phabricator.wikimedia.org/T275809 (10ema) Today I've added [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/varnish/files/exp_policy.py | exp_policy.py... [16:19:32] (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/677942 [16:22:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10Cmjohnson) These are all connected, the 2nd interfaces are not setup, it seems that we're all confused on how to do this so I di... [16:26:10] 10SRE, 10ops-eqiad, 10Analytics-Clusters: Icinga/MegaRAID alert on an-worker1100 - https://phabricator.wikimedia.org/T279475 (10elukey) One drive is in a Foreign state, no idea why (also unconfigured - good): ` Enclosure Device ID: 32 Slot Number: 10 Enclosure position: 1 Device Id: 10 WWN: 5000c500cf8ee990... [16:26:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10Papaul) @Cmjohnson I will take a look at it once done with some onsite work [16:26:41] RECOVERY - Host cp1087 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [16:27:06] (03PS2) 10Herron: replace mwlog1001 with new mwlog[12]002 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677002 (https://phabricator.wikimedia.org/T224565) [16:28:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10Cmjohnson) @aborrero The 2nd interfaces are cloudgw1001 cloudsw1-c8 xe-0/0/19 cable id 5321 cloudgw1002 cloudsw1-d5 xe-0/0/35... [16:28:59] (03CR) 10Herron: "> Patch Set 1:" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677002 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [16:33:04] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/677593 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [16:33:17] (03CR) 10Cwhite: [C: 03+1] replace mwlog1001 with new mwlog[12]002 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677002 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [16:33:45] !log reboot an-worker1100 again to check if all the disks come up correctly [16:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:22] 10SRE, 10ops-eqiad, 10Traffic: cp1087 powercycled - https://phabricator.wikimedia.org/T278729 (10Cmjohnson) updated the BIOS and submitted Dell ticket You have successfully submitted request SR1056516502. [16:35:00] 10SRE, 10ops-eqiad, 10Analytics-Clusters: Icinga/MegaRAID alert on an-worker1100 - https://phabricator.wikimedia.org/T279475 (10elukey) I had to do: ` megacli -CfgForeign -Scan -a0 megacli -CfgForeign -Clear -a0 megacli -CfgLdAdd -r0 [32:10] -a0 ` And the disk came back to life and I was able to re-mount it... [16:35:05] (03PS1) 10Jbond: O:gitlab: add config for backup sets [puppet] - 10https://gerrit.wikimedia.org/r/677970 (https://phabricator.wikimedia.org/T274463) [16:36:07] PROBLEM - Host an-worker1100 is DOWN: PING CRITICAL - Packet loss = 100% [16:37:39] RECOVERY - Host an-worker1100 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [16:40:50] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/677974 [16:42:19] 10SRE, 10ops-eqiad, 10Analytics-Clusters: Icinga/MegaRAID alert on an-worker1100 - https://phabricator.wikimedia.org/T279475 (10elukey) 05Open→03Resolved a:03elukey All good, I'll re-open in case something weird comes up, but now all disks are good :) [16:51:06] 10SRE, 10ops-eqiad, 10DC-Ops: Netbox Duplicate Cable Lables - https://phabricator.wikimedia.org/T279160 (10Cmjohnson) 05Open→03Resolved [16:51:16] 10SRE, 10ops-eqiad, 10DC-Ops: Netbox Duplicate Cable Lables - https://phabricator.wikimedia.org/T279160 (10Cmjohnson) Fixed the report has zero errors [16:51:29] !log testing Scap 3.17.0 release on deployment-deploy01 [16:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:40] (03PS5) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: fix help invocation [puppet] - 10https://gerrit.wikimedia.org/r/677850 [16:51:42] (03PS3) 10Arturo Borrero Gonzalez: sonofgridnegine: grid-configurator: run black autoformater [puppet] - 10https://gerrit.wikimedia.org/r/677860 [16:51:44] (03PS3) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: include defaults in help message [puppet] - 10https://gerrit.wikimedia.org/r/677861 [16:51:46] (03PS4) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: rework --domains option [puppet] - 10https://gerrit.wikimedia.org/r/677862 [16:51:48] (03PS3) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: error if running in toolsbeta if no --beta [puppet] - 10https://gerrit.wikimedia.org/r/677865 [16:51:50] (03PS2) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: introduce support for the new domain [puppet] - 10https://gerrit.wikimedia.org/r/677873 (https://phabricator.wikimedia.org/T277653) [16:54:03] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid-configurator: introduce support for the new domain [puppet] - 10https://gerrit.wikimedia.org/r/677873 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [16:58:42] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1025.eqiad.wmnet'] ` and were **ALL** successful. [17:00:05] chrisalbon and accraze: Your horoscope predicts another unfortunate [[mw:Services|Services]] – [[mw:Extension:Graph|Graphoid]] / [[ORES]] deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210408T1700). [17:05:20] (03PS1) 10Razzi: clouddb: enable alerting for clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/677977 (https://phabricator.wikimedia.org/T269211) [17:11:20] (03PS1) 10Jgiannelos: Bump chromium-render to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/677978 [17:12:51] Reedy: is now good? [17:13:40] (03PS2) 10JMeybohm: kube-apiserver: Update admission controller config [puppet] - 10https://gerrit.wikimedia.org/r/677922 (https://phabricator.wikimedia.org/T270063) [17:14:07] (03Abandoned) 10JMeybohm: kube-apiserver: Update the list of enabled admission controllers [puppet] - 10https://gerrit.wikimedia.org/r/677923 (https://phabricator.wikimedia.org/T270063) (owner: 10JMeybohm) [17:14:53] (03CR) 10Jgiannelos: [C: 03+2] Bump chromium-render to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/677978 (owner: 10Jgiannelos) [17:16:26] !log Scap 3.17.0 deployed to beta cluster [17:16:29] (03Merged) 10jenkins-bot: Bump chromium-render to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/677978 (owner: 10Jgiannelos) [17:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:48] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [17:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:58] (03PS1) 10Gergő Tisza: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/677980 [17:21:40] Majavah, we've deployed 3.17.0 on beta and are having some trouble testing (the hosts we were expecting don't exisst) - does everything look OK at your end? [17:22:44] liw: which hosts? [17:23:51] !log jgiannelos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' . [17:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:04] Majavah, deployment-mediawiki-07 and deployment-mediawiki11 [17:24:37] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28964/console" [puppet] - 10https://gerrit.wikimedia.org/r/677922 (https://phabricator.wikimedia.org/T270063) (owner: 10JMeybohm) [17:25:42] liw: mediawiki-07 is gone, mediawiki11 works just fine for me, but for that you can't use .wmflabs names, new VMs only have ..eqiad1.wikimedia.cloud [17:26:04] (03CR) 10Gergő Tisza: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/677980 (owner: 10Gergő Tisza) [17:27:32] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/677980 (owner: 10Gergő Tisza) [17:28:12] (03PS4) 10Legoktm: mediawiki fonts: Remove ttf-ubuntu-font-family [puppet] - 10https://gerrit.wikimedia.org/r/675357 (owner: 10Majavah) [17:29:47] 10SRE, 10Wikimedia-Mailing-lists: Hausa Wikimedians mailing list - https://phabricator.wikimedia.org/T279654 (10Ladsgroup) Can this wait for a month until we get the new mailman out of the door? [17:29:52] !log jgiannelos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' . [17:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:03] (03CR) 10Legoktm: [V: 03+1 C: 03+1] "PCC SUCCESS (DIFF 9): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28966/console" [puppet] - 10https://gerrit.wikimedia.org/r/675357 (owner: 10Majavah) [17:35:03] (03CR) 10Legoktm: [V: 03+1 C: 03+2] mediawiki fonts: Remove ttf-ubuntu-font-family [puppet] - 10https://gerrit.wikimedia.org/r/675357 (owner: 10Majavah) [17:35:11] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:35:23] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:36:06] (03PS3) 10JMeybohm: kube-apiserver: Update admission controller config [puppet] - 10https://gerrit.wikimedia.org/r/677922 (https://phabricator.wikimedia.org/T270063) [17:36:39] Majavah, right. we figured out the new name, but had permission problems. [17:36:39] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: relocating_shards: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0, unassigned_shards: 0, timed_out: False, active_shards: 1877, number_of_data_nodes: 6, status: green, cluster_name: cloudelastic-chi-eqiad, number_of_in_flight_fetch: 0, number_of_nodes: 6, initializin [17:36:39] er_of_pending_tasks: 0, active_primary_shards: 937, delayed_unassigned_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:38:23] liw: could you be more specific? I'm not aware of any issues with it [17:39:48] Majavah, I didn't take notes, dancy may have a note of the version, but my memory says scap tried to create a directory in /srv/deployments and didn't have permission [17:39:49] PROBLEM - mediawiki-installation DSH group on parse2001 is CRITICAL: Host parse2001 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:40:06] dancy, I meant a note of the directory [17:41:24] `Permission denied: '/srv/deployment'` [17:41:57] That's during `scap deploy -v 'testing scap3.17.0'` in `/srv/deployment/integration/slave-scripts` on deployment-deploy01 [17:42:15] uhh, let me test [17:43:53] (03CR) 10JMeybohm: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/28967/" [puppet] - 10https://gerrit.wikimedia.org/r/677922 (https://phabricator.wikimedia.org/T270063) (owner: 10JMeybohm) [17:45:25] dancy: deployment-mediawiki11 does not have /srv/deployment or any subdirectories, nor do I see anything in Puppet that should create it for integration/slave-scripts [17:45:28] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Ladsgroup) With the wikitech-l imported my last offer is now: 34GB. [17:46:08] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Ladsgroup) [17:46:12] Zooming out, what we need in general is a test of commands that are suitable for validating the new scap release in beta. Do you have suggestions? [17:47:26] if that was for me, I'm not familiar enough with scap to have anything else than standard mediawiki sync-worlds [17:48:17] ok. We'll work something out. Thanks [17:51:22] let me know if I can be helpful somehow [17:52:26] !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [17:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:01] RECOVERY - Check systemd state on wdqs2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:54:53] Amir1: would like any mailing list creations to be stalled? [17:54:59] (you) [17:55:46] I would estimate there is on average 2 to 4 per month [17:58:21] (03PS2) 10Dzahn: site/conftool-data: assign 4 x API, 4 x app, 2 x jobrunner, rack A5 [puppet] - 10https://gerrit.wikimedia.org/r/677674 (https://phabricator.wikimedia.org/T279599) [17:58:23] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:59:22] !log tgr@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [17:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:39] (03CR) 10Dzahn: [C: 03+2] site/conftool-data: assign 4 x API, 4 x app, 2 x jobrunner, rack A5 [puppet] - 10https://gerrit.wikimedia.org/r/677674 (https://phabricator.wikimedia.org/T279599) (owner: 10Dzahn) [17:59:48] jouncebot: now [17:59:48] For the next 0 hour(s) and 0 minute(s): [[mw:Services|Services]] – [[mw:Extension:Graph|Graphoid]] / [[ORES]] (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210408T1700) [18:00:05] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for [[Backport windows|Morning backport window]]
'''''' deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210408T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:00:08] liw: is my understanding correct that the integration/slave-scripts repo on beta is only used for testing scap? [18:00:24] jouncebot is lying, there are patches [18:00:27] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: unassigned_shards: 0, number_of_data_nodes: 6, active_shards_percent_as_number: 100.0, initializing_shards: 0, active_shards: 1877, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, cluster_name: cloudelastic-chi-eqiad, number_of_nodes: 6, status: green, task_max_waiting_i [18:00:27] , delayed_unassigned_shards: 0, active_primary_shards: 937, relocating_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [18:00:39] interesting MatmaRex [18:01:02] i can deploy then :) [18:02:07] !log mw2403 through mw2401 - new hardwere moving into production, not pooled yet, initial puppet run, being added to icinga etc, creating mcrouter certs for them (T279599) [18:02:07] it's weird, did we mess up the format? or does it not parse the page right after the recent changes? [18:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:15] T279599: bring 10 new mediawiki appserver in codfw into production, new rack A5 (mw2402 - mw2411) - https://phabricator.wikimedia.org/T279599 [18:02:35] and it is printing weird HTML stuff in its messages, which isn't reassuring [18:02:39] log mw2403 through mw2411 - new hardware moving into production, not pooled yet, initial puppet run, being added to icinga etc, creating mcrouter certs for them (T279599) [18:02:44] darn it [18:02:49] (03CR) 10Urbanecm: [C: 03+2] Revert incorrect changes to ve.ui.MWBackCommand that made it stop working [extensions/VisualEditor] (wmf/1.36.0-wmf.38) - 10https://gerrit.wikimedia.org/r/677725 (https://phabricator.wikimedia.org/T279613) (owner: 10Bartosz Dziewoński) [18:02:59] phuedx also has a patch scheduled, but doesn't look like they're here unless they're on an alt nick that I'm not aware of [18:03:17] !log mw2403 through mw2411 - new hardware moving into production, not pooled yet, initial puppet run, being added to icinga etc, creating mcrouter certs for them (T279599) [18:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:38] o/ Sorry I'm late [18:06:16] hi. Urbanecm is deploying [18:06:31] phuedx: you have private patch in PS, right? [18:06:32] mutante: my personal opinion is that if it is in any way time sensitive, it should go ahead but if it can wait for a bit, I'd like to stay so we get it pushed, there's not much left tbh [18:07:26] Urbanecm: AIUI I have to generate the value on the deployment host, add it to private/PrivateSettings.php, and then it can be deployed? [18:07:59] Amir1: alright! but the options are between "wait a bit if you can" and "use the old server if you feel you have to" but not "hey, wanna be the first to test new server", right? [18:08:54] phuedx: affirmative [18:09:37] Urbanecm: Ah! Sorry. No. I haven't done the patch to PS.php yet [18:10:18] Amir1: they are asking for wikimania organizing, cant tell how urgent, in the past I would have just done it, dont want to step on your toes during import though [18:11:19] mutante: that one clearly is good to go [18:11:28] on my side [18:11:43] Amir1: on mailman2? then i'll do it, ack [18:11:49] yeah [18:11:52] ok, thanks [18:12:02] Urbanecm: Want me to? [18:12:23] 10SRE, 10Wikimedia-Mailing-lists: Create mailing list for Wikimania Core Organizing Team - https://phabricator.wikimedia.org/T279668 (10Dzahn) a:03Dzahn [18:12:40] phuedx: yup. You can also sync it. [18:12:43] but on the other hand, it's not a big deal to migrate them from the old one, so I don't want to block any list creation [18:13:16] Amir1: cool, ok. not expecting this to happen every day [18:14:13] could have been been that you were looking for candidates who first get created on new side and never import.. that was part of my thought [18:14:37] but if import is no big deal.. just going ahead as normal [18:15:11] Urbanecm: Mind if I hold off for 10 minutes? thcipriani wanted to sit in on the deployment [18:15:24] Not at all, waiting for CI [18:15:32] Thanks :) [18:17:48] (03PS1) 10Dzahn: add fake mcrouter certs for mw2403 through mw2411 [labs/private] - 10https://gerrit.wikimedia.org/r/677991 (https://phabricator.wikimedia.org/T279599) [18:18:36] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake mcrouter certs for mw2403 through mw2411 [labs/private] - 10https://gerrit.wikimedia.org/r/677991 (https://phabricator.wikimedia.org/T279599) (owner: 10Dzahn) [18:19:05] RECOVERY - Long running screen/tmux on puppetmaster1001 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [18:19:42] Urbanecm: Is there a certain form for the SAL message, e.g. PrivateSettings: Add value for $wg... (T123456)? [18:19:46] T123456: Special:CentralAuth reports account attachment, which - being standalone - is confusing, report accout creation as well - https://phabricator.wikimedia.org/T123456 [18:20:01] I ask because I usually defer to the output of the backport-summary script when I'm deploying ;) [18:22:47] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 7 hosts with reason: new_install [18:22:50] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 7 hosts with reason: new_install [18:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:01] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw[2410-2411].codfw.wmnet with reason: new_install [18:23:02] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw[2410-2411].codfw.wmnet with reason: new_install [18:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:17] phuedx: what you suggest should be fine [18:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:37] there's not really a standard way; as long as it explains what is happening, it should be good [18:24:07] (03Merged) 10jenkins-bot: Revert incorrect changes to ve.ui.MWBackCommand that made it stop working [extensions/VisualEditor] (wmf/1.36.0-wmf.38) - 10https://gerrit.wikimedia.org/r/677725 (https://phabricator.wikimedia.org/T279613) (owner: 10Bartosz Dziewoński) [18:24:22] phuedx: please ping me once you're done, Matma.Rex's change just merged :) [18:24:50] Urbanecm: You go ahead of me [18:24:55] okay [18:24:58] MatmaRex: still around? [18:25:03] !log tgr@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [18:25:03] !log tgr@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [18:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:49] Urbanecm: yeah [18:26:00] MatmaRex: pulled to mwdebug1001, can you test? [18:26:03] is it just me, or is CI slower recently? [18:26:05] looking [18:27:28] Urbanecm: looks good [18:27:33] thx, syncing [18:29:42] !log urbanecm@deploy1002 Synchronized php-1.36.0-wmf.38/extensions/VisualEditor/modules/ve-mw/ui/tools/ve.ui.MWBackTool.js: e0f3735f6a31d2914bae6c9daac1267707a2d108: Revert incorrect changes to ve.ui.MWBackCommand that made it stop working (T279613) (duration: 01m 07s) [18:29:49] MatmaRex: should be live [18:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:51] T279613: [wmf.38-regression] mobile VE - "oo-ui-icon-close" button does not work - https://phabricator.wikimedia.org/T279613 [18:30:02] thanks [18:30:07] np [18:30:14] phuedx: I'm done. [18:30:26] (03CR) 10Dzahn: "learned something from this change, ty" [puppet] - 10https://gerrit.wikimedia.org/r/677805 (owner: 10Alexandros Kosiaris) [18:30:31] * thcipriani waves [18:30:40] Urbanecm: Thanks [18:30:45] np [18:30:46] hi thcipriani :) [18:30:53] phuedx: do let me know if i can help in any way. [18:31:47] !log tgr@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [18:31:47] !log tgr@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [18:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:11] (03PS1) 10JMeybohm: New upstream version 0.13.1 [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/677996 [18:32:19] (03CR) 10Dzahn: "there is a bug, wikibugs uses "(owner: Alexandros Kosiaris)" on IRC but it should also be Αλέξανδρος Κοσιάρης now 😊" [puppet] - 10https://gerrit.wikimedia.org/r/677805 (owner: 10Alexandros Kosiaris) [18:33:42] (03PS2) 10JMeybohm: New upstream version 0.13.1 [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/677996 [18:37:28] !log mw2403 through mw2411 - serial rebooting [18:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10Papaul) @Cmjohnson I did the second interface for cloudcephosd1016 see below for the instructions let me know in you have any qu... [18:42:22] Urbanecm: Made the commit to PrivateSettings.php. Going to sync-file now [18:43:36] Pulling to mwdebug1001 [18:44:24] Testing now [18:47:21] mutante: please uh, file a bug against wikibugs :p [18:47:37] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [18:47:49] legoktm: :) [18:48:14] yea, unicode works here [18:49:43] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, unassigned_shards: 0, status: green, active_primary_shards: 937, number_of_nodes: 6, initializing_shards: 0, timed_out: False, number_of_data_nodes: 6, number_of_pending_tasks: 0, relocating_shards: 0, number_of_in_flight_fetch: 0, delayed_unassigned_shards: 0, [18:49:43] _in_queue_millis: 0, active_shards: 1877, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [18:50:25] Syncing now [18:51:35] !log phuedx@deploy1002 Synchronized private/PrivateSettings.php: PrivateSettings: Add value for (T261842) (duration: 01m 06s) [18:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:44] T261842: Create schema to track users opting in/out of desktop improvements - https://phabricator.wikimedia.org/T261842 [18:51:59] *facepalm* double quotes [18:52:57] !log phuedx@deploy1002 Synchronized private/PrivateSettings.php: PrivateSettings: Add value for $wgWMEVectorPrefDiffSalt (T261842) [18:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:05] marxarelli and twentyafterfour: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Mediawiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210408T1900). [19:01:04] (03PS1) 10Andrew Bogott: policy.yaml files: update default behavior [puppet] - 10https://gerrit.wikimedia.org/r/678006 [19:02:11] (03CR) 10Andrew Bogott: [C: 03+2] policy.yaml files: update default behavior [puppet] - 10https://gerrit.wikimedia.org/r/678006 (owner: 10Andrew Bogott) [19:04:03] Urbanecm: All done :) [19:04:09] cool [19:09:23] (03PS1) 10Andrew Bogott: Cinder: prevent some actions in policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/678011 [19:10:24] (03CR) 10Andrew Bogott: [C: 03+2] Cinder: prevent some actions in policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/678011 (owner: 10Andrew Bogott) [19:16:48] legoktm: T279710 scnr [19:16:48] T279710: wikibugs should display the same type of name that the Gerrit UI displays - https://phabricator.wikimedia.org/T279710 [19:17:28] :D [19:26:08] jouncebot: ^ that's my favorite bothumor. you're a pretty funny bot, jouncebot. [19:27:01] marxarelli: Is ther some way I can help out with the train? [19:27:09] * twentyafterfour hasn't been following as closely as I should [19:27:15] twentyafterfour: just getting a late start, sorry [19:28:05] np [19:28:34] i think we're good. the only thing of concern that i saw yesterday was https://phabricator.wikimedia.org/T279585 and that doesn't seem of concern to api folks or wikidata folks [19:28:49] so just the usual today. roll and watch [19:29:22] (03PS1) 10Andrew Bogott: OpenStack nova: allow anyone to read instance volume info [puppet] - 10https://gerrit.wikimedia.org/r/678022 (https://phabricator.wikimedia.org/T279697) [19:29:40] marxarelli: ok, I'll help watch logs if that's helpful [19:30:02] 10SRE, 10Wikimedia-Logstash, 10observability: Buster elasticsearch-curator version not compatible with ELK7 - https://phabricator.wikimedia.org/T257024 (10colewhite) p:05High→03Medium I found elasticsearch-curator 5.8.1 in the `thirdparty/elastic74` component and added it to the `thirdparty/elastic710` c... [19:30:22] thanks! that's always helpful. logspam-watch was acting funny for me yesterday (choking on some bad utf-8 i think?) but it seems to be ok today [19:30:47] (03PS2) 10Andrew Bogott: OpenStack nova: allow anyone to read instance volume info [puppet] - 10https://gerrit.wikimedia.org/r/678022 (https://phabricator.wikimedia.org/T279697) [19:31:33] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack nova: allow anyone to read instance volume info [puppet] - 10https://gerrit.wikimedia.org/r/678022 (https://phabricator.wikimedia.org/T279697) (owner: 10Andrew Bogott) [19:33:02] (03PS1) 10Dduvall: all wikis to 1.36.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678023 [19:33:04] (03CR) 10Dduvall: [C: 03+2] all wikis to 1.36.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678023 (owner: 10Dduvall) [19:33:44] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678023 (owner: 10Dduvall) [19:34:08] marxarelli: re: logspam-watch, yeah, have a patch in for that: https://gerrit.wikimedia.org/r/c/operations/puppet/+/677676 [19:34:50] submitting "WikiPage constructed on a Title that cannot exist as a page" to prod-errors [19:35:01] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.38 [19:35:05] brennen: right on. even with the sporadic error it was still quite helpful [19:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:20] twentyafterfour: yeah, that doesn't look good [19:39:24] but happening for wmf.37 too it looks like [19:41:45] (03PS1) 10Andrew Bogott: Openstack Cinder: allow all users to getallsnapshots [puppet] - 10https://gerrit.wikimedia.org/r/678029 [19:42:59] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Cinder: allow all users to getallsnapshots [puppet] - 10https://gerrit.wikimedia.org/r/678029 (owner: 10Andrew Bogott) [19:44:31] a lot of lock wait timeouts, at least more than normal But I don't see any indicator of what the cause may be [19:46:12] (03PS1) 10Eric Gardner: Don't show "invalid search" message when request is aborted by user [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.38) - 10https://gerrit.wikimedia.org/r/677956 (https://phabricator.wikimedia.org/T277714) [19:47:09] Krinkle: just got to see your new phatality feature in action (the backlinks from phab to kibana) it works nicely [19:47:17] see T279711 [19:47:18] T279711: WikiPage constructed on a Title that cannot exist as a page: Special:Watchlist [Called from Article::newPage] - https://phabricator.wikimedia.org/T279711 [19:48:28] twentyafterfour: ooh nice, thanks for deploying that [19:48:42] seems to all work now as intended [19:49:19] yep it's pretty cool [19:49:36] thanks for building that feature, the backlinks are super helpful [19:49:55] +1 those are really nice [19:50:18] !log mw2403 through mw2411 - scap pull - new hardware [19:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:37] (03CR) 10Cwhite: "Tested in Pontoon and it appears to DTRT. Will triple-check the curator config in codfw before rolling out completely." [puppet] - 10https://gerrit.wikimedia.org/r/677593 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [19:50:44] Krinkle: should we get rid of the reqid field now or fix it so that it is populated? it's currently unpopulated and just used in the description template [19:53:03] (03PS5) 10Cwhite: pontoon: set jobs_host and define aggressive curator config [puppet] - 10https://gerrit.wikimedia.org/r/677593 (https://phabricator.wikimedia.org/T274394) [19:54:10] twentyafterfour: yeah, I was going to follow-up maybe after some time has passed to hide trace/reqId from the form. [19:54:19] but I think that means it also hides it on existing tasks, I think? [19:54:34] !log dzahn@cumin1001 conftool action : set/weight=30; selector: name=mw240[3-9].codfw.wmnet [19:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:48] Krinkle: I think hiding it on the form does not hide it from the task detail view [19:55:11] Krinkle: I'll find out [19:55:12] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw240[3-9].codfw.wmnet [19:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:01] (03PS1) 10Bstorm: gridengine: set grid-configurator source files to use new domain name [puppet] - 10https://gerrit.wikimedia.org/r/678043 (https://phabricator.wikimedia.org/T277653) [19:56:28] Krinkle: yeah the form doesn't affect the detail view [19:56:43] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=mw2379.codfw.wmnet [19:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:03] (03PS6) 10Cwhite: logstash: refactor how curator jobs are defined and deployed [puppet] - 10https://gerrit.wikimedia.org/r/677593 (https://phabricator.wikimedia.org/T274394) [19:57:06] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=mw238[0-2].codfw.wmnet [19:57:07] twentyafterfour: hm.. okay, but it hides it from edit form though for those tasks [19:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:19] fields that are visible but not editable [19:57:26] I guess that's okay for those secondary fields [19:57:31] unlikely to want to change [19:57:46] (03CR) 10Bstorm: "Please note, I didn't bother messing with the "dedicated" exec nodes because they aren't used at all and that stuff should be removed." [puppet] - 10https://gerrit.wikimedia.org/r/678043 (https://phabricator.wikimedia.org/T277653) (owner: 10Bstorm) [19:58:09] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=mw241[0-1].codfw.wmnet [19:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:20] Krinkle: yeah I think it's ok for this essentially deprecated fields [19:58:39] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw241[0-1].codfw.wmnet [19:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:46] (03PS1) 10Razzi: superset: check http server following redirects with curl [puppet] - 10https://gerrit.wikimedia.org/r/678044 (https://phabricator.wikimedia.org/T277729) [19:59:35] jouncebot: now [19:59:35] For the next 1 hour(s) and 0 minute(s): Mediawiki train - American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210408T1900) [20:00:08] just added new servers to scap groups but stands back now [20:00:22] (getting scap but not pooled) [20:02:21] !log imported parsoid_0.11.1all_all.deb to releases.wikimedia.org apt repo [20:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:25] (03PS2) 10Razzi: superset: check http server following redirects with curl [puppet] - 10https://gerrit.wikimedia.org/r/678044 (https://phabricator.wikimedia.org/T277729) [20:27:05] * razzi lunchtime! [20:27:07] !log legoktm@deploy1002:~$ cat deb-parsoid-urls.txt | mwscript purgeList.php --wiki=aawiki # to clear releases.wm.o/debian/ cache [20:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:54] jouncebot: now [20:27:55] For the next 0 hour(s) and 32 minute(s): Mediawiki train - American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210408T1900) [20:28:02] is a train ongoing? [20:28:56] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw241[0-1].codfw.wmnet [20:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:01] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw240[3-9].codfw.wmnet [20:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:57] !log mw2304 through mw2411 - pooled and set to active state in netbox (T279599) [20:33:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:06] T279599: bring 10 new mediawiki appserver in codfw into production, new rack A5 (mw2402 - mw2411) - https://phabricator.wikimedia.org/T279599 [20:33:20] typo in log line again, duh [20:33:40] !log mw2403 through mw2411 pooled and set to active state in netbox (T279599) [20:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Dzahn) [20:34:56] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Dzahn) mw2403 through mw2411 in production, set to Active in Netbox. [20:40:13] PROBLEM - mediawiki-installation DSH group on wtp1025 is CRITICAL: Host wtp1025 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:40:22] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Dzahn) Checked netbox one more time. Now all mw servers in codfw are in one of 2 states. ACTIVE or OFFLINE and covered by decom tickets. [20:42:51] 10SRE, 10LDAP-Access-Requests: Add Lena Meintrup to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T279531 (10KFrancis) @Lena_WMDE The NDA was sent to the email listed for your electronic signature. Please review and sign when you have a minute. Thanks! [20:43:44] (03CR) 10Ottomata: "Huh, maybe I don't know anything!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/678044 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi) [20:44:08] (03CR) 10Ottomata: "Otherwise lgtm, maybe investigate to see how monitoring::service works vs nrpe::monitoring_service." [puppet] - 10https://gerrit.wikimedia.org/r/678044 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi) [20:45:22] ignoring the wtp1025 alert after I saw it's not pooled. be back later [20:47:25] PROBLEM - Ensure local MW versions match expected deployment on parse2001 is CRITICAL: CRITICAL: 318 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [20:48:17] PROBLEM - Ensure local MW versions match expected deployment on wtp1025 is CRITICAL: CRITICAL: 318 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [20:48:54] 10SRE, 10Wikimedia-Logstash, 10observability: Buster elasticsearch-curator version not compatible with ELK7 - https://phabricator.wikimedia.org/T257024 (10herron) Initially the thinking was to store the appropriate elasticsearch-curator for each ES version in the component. But in practice yeah that's provi... [20:57:34] 10SRE, 10Wikimedia-Mailing-lists: Create mailing list for Wikimania Core Organizing Team - https://phabricator.wikimedia.org/T279668 (10Dzahn) Hey @Effeietsanders, you should have mail. See https://lists.wikimedia.org/mailman/listinfo/wikimania-cot and you should have a random password to login at: https:... [21:00:27] 10SRE, 10Wikimedia-Mailing-lists: Create mailing list for Wikimania Core Organizing Team - https://phabricator.wikimedia.org/T279668 (10Dzahn) 05Open→03Resolved [21:06:04] (03PS1) 10Gergő Tisza: Bump linkrecommendation version [deployment-charts] - 10https://gerrit.wikimedia.org/r/678078 [21:07:28] (03CR) 10Razzi: superset: check http server following redirects with curl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/678044 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi) [21:15:40] 10SRE, 10WMF-Annual-Report: Update annual.wikimedia.org redirect to point to 2020 Annual Report - https://phabricator.wikimedia.org/T279571 (10Dzahn) a:03Dzahn [21:16:32] (03PS3) 10Razzi: superset: check http server following redirects with curl [puppet] - 10https://gerrit.wikimedia.org/r/678044 (https://phabricator.wikimedia.org/T277729) [21:17:15] 10SRE, 10Wikimedia-Mailing-lists: Wikisul maillist - https://phabricator.wikimedia.org/T279482 (10Dzahn) 05Open→03Stalled setting to stalled to reflect this. if you feel like it's getting more urgent feel free to change that [21:17:40] 10SRE, 10Wikimedia-Mailing-lists: Wikisul maillist - https://phabricator.wikimedia.org/T279482 (10Dzahn) p:05Triage→03Medium [21:17:56] (03CR) 10Gergő Tisza: [C: 03+2] Bump linkrecommendation version [deployment-charts] - 10https://gerrit.wikimedia.org/r/678078 (owner: 10Gergő Tisza) [21:18:03] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28969/console" [puppet] - 10https://gerrit.wikimedia.org/r/678044 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi) [21:18:25] 10SRE, 10Wikimedia-Mailing-lists: Hausa Wikimedians mailing list - https://phabricator.wikimedia.org/T279654 (10Dzahn) p:05Triage→03Medium [21:21:59] 10SRE, 10Dumps-Generation, 10SRE-Access-Requests: Create new group for root access to snapshot*, dumpsdata* and labstore1006,7 with holger in it - https://phabricator.wikimedia.org/T277629 (10Dzahn) Any news on access check for @holger.knust ? [21:22:33] (03Merged) 10jenkins-bot: Bump linkrecommendation version [deployment-charts] - 10https://gerrit.wikimedia.org/r/678078 (owner: 10Gergő Tisza) [21:22:50] 10SRE, 10SRE-Access-Requests: Requesting access to stat boxes for mlitn - https://phabricator.wikimedia.org/T274749 (10Dzahn) Hi @MarkTraceur friendly ping. this is still blocked on your approval at the moment. [21:23:24] 10SRE, 10LDAP-Access-Requests: Add Lena Meintrup to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T279531 (10Dzahn) a:03Lena_WMDE [21:24:13] 10SRE, 10LDAP-Access-Requests: Grant access to Superset for Mikeraish - https://phabricator.wikimedia.org/T279147 (10Dzahn) a:03MRaishWMF [21:24:57] 10SRE, 10LDAP-Access-Requests, 10CAS-SSO: CAS SSO for reedy - https://phabricator.wikimedia.org/T279244 (10Dzahn) a:03Reedy [21:29:00] 10SRE, 10DBA, 10Platform Engineering, 10Wikimedia-Incident: Appservers latency spike / parser cache growth 2021-03-28 - https://phabricator.wikimedia.org/T278655 (10Krinkle) >>! In T278655#6982467, @Marostegui wrote: > I am not fully sure I am reading the disk space graph correctly as I don't see an increa... [21:29:13] 10SRE, 10DBA, 10Platform Engineering, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Appservers latency spike / parser cache growth 2021-03-28 - https://phabricator.wikimedia.org/T278655 (10Krinkle) [21:32:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['cloudvirt1041.eqiad.wmnet',... [21:33:43] !log tgr@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [21:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:25] !log andrew@deploy1002 Started deploy [horizon/deploy@3abe9d0]: Fix for T279667 [21:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:33] T279667: Horizon: 'edit security groups' instance menu produces an error - https://phabricator.wikimedia.org/T279667 [21:35:45] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be200[12] - https://phabricator.wikimedia.org/T276642 (10Papaul) [21:38:18] !log andrew@deploy1002 Finished deploy [horizon/deploy@3abe9d0]: Fix for T279667 (duration: 03m 52s) [21:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:27] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10procurement: Dc-Ops Commands for Cumin - https://phabricator.wikimedia.org/T279721 (10wiki_willy) [21:46:41] !log tgr@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [21:46:41] !log tgr@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [21:46:42] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1041.eqiad.wmnet with reason: REIMAGE [21:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:17] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1042.eqiad.wmnet with reason: REIMAGE [21:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:43] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1041.eqiad.wmnet with reason: REIMAGE [21:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:37] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1042.eqiad.wmnet with reason: REIMAGE [21:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:32] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1043.eqiad.wmnet with reason: REIMAGE [21:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:42] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1044.eqiad.wmnet with reason: REIMAGE [21:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:36] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1043.eqiad.wmnet with reason: REIMAGE [21:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:13] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10procurement: Dc-Ops Commands for Cumin - https://phabricator.wikimedia.org/T279721 (10RobH) [21:54:43] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1045.eqiad.wmnet with reason: REIMAGE [21:54:43] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10procurement: Dc-Ops Commands for Cumin - https://phabricator.wikimedia.org/T279721 (10RobH) [21:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:45] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1044.eqiad.wmnet with reason: REIMAGE [21:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:43] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1046.eqiad.wmnet with reason: REIMAGE [21:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:55] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1045.eqiad.wmnet with reason: REIMAGE [21:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:00] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1046.eqiad.wmnet with reason: REIMAGE [22:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:02] (03CR) 10Razzi: [V: 03+1 C: 03+2] superset: check http server following redirects with curl [puppet] - 10https://gerrit.wikimedia.org/r/678044 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi) [22:07:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1041.eqiad.wmnet', 'cloudvirt1042.eqiad.wmnet', 'cloudvirt1043.eqi... [22:12:11] !log tgr@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [22:12:11] !log tgr@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [22:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:12] 10SRE, 10Wikimedia-Mailing-lists: Create mailing list for Wikimania Core Organizing Team - https://phabricator.wikimedia.org/T279668 (10Effeietsanders) @Dzahn many thanks for the speedy turnaround. I've set it up. [22:18:15] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:20:09] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: task_max_waiting_in_queue_millis: 0, unassigned_shards: 0, active_shards: 1877, cluster_name: cloudelastic-chi-eqiad, number_of_data_nodes: 6, active_primary_shards: 937, relocating_shards: 0, timed_out: False, number_of_in_flight_fetch: 0, active_shards_percent_as_number: 100.0, number_of_pending_ [22:20:09] green, number_of_nodes: 6, delayed_unassigned_shards: 0, initializing_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:21:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10RobH) [22:21:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10RobH) I've emailed our Dell rep to determine where the NIC is for the seed server, cloudvirt1040. Once I have that info, I'll reass... [22:25:32] (03CR) 10Bstorm: [C: 03+1] "Interesting. I think I like this more. We'll want to update the docs." [puppet] - 10https://gerrit.wikimedia.org/r/677862 (owner: 10Arturo Borrero Gonzalez) [22:28:02] (03CR) 10Bstorm: [C: 03+1] "In a way, we could just make the --beta arg unnecessary (and key off of the project file), but maybe it's good to force you to check where" [puppet] - 10https://gerrit.wikimedia.org/r/677865 (owner: 10Arturo Borrero Gonzalez) [22:28:55] (03CR) 10Bstorm: "Suggested a patch that could make this approach work." [puppet] - 10https://gerrit.wikimedia.org/r/677873 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [22:44:37] (03PS1) 10Dzahn: annualreport: update redirect to annual report for 2020 [puppet] - 10https://gerrit.wikimedia.org/r/678106 (https://phabricator.wikimedia.org/T279571) [22:46:07] (03PS2) 10Dzahn: annualreport: update redirect to annual report for 2020 [puppet] - 10https://gerrit.wikimedia.org/r/678106 (https://phabricator.wikimedia.org/T279571) [22:49:04] (03PS1) 10Razzi: superset: put puppet:// resource in files/ [puppet] - 10https://gerrit.wikimedia.org/r/678109 (https://phabricator.wikimedia.org/T277729) [22:51:34] (03CR) 10Razzi: [C: 03+2] superset: put puppet:// resource in files/ [puppet] - 10https://gerrit.wikimedia.org/r/678109 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi) [22:54:10] jouncebot next [22:54:11] In 0 hour(s) and 5 minute(s): [[Backport windows|US Backport and Config training]]
'''''' (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210408T2300) [22:59:27] legoktm: if you dont mind .. a review on https://gerrit.wikimedia.org/r/c/operations/puppet/+/678106 [23:00:00] * legoktm looks [23:00:05] brennen: Time to snap out of that daydream and deploy [[Backport windows|US Backport and Config training]]
''''''. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210408T2300). [23:00:17] here, will be doing training with EricGardner. [23:01:23] (03CR) 10Legoktm: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/678106 (https://phabricator.wikimedia.org/T279571) (owner: 10Dzahn) [23:01:34] thank you [23:02:03] (03CR) 10Dzahn: [C: 03+2] annualreport: update redirect to annual report for 2020 [puppet] - 10https://gerrit.wikimedia.org/r/678106 (https://phabricator.wikimedia.org/T279571) (owner: 10Dzahn) [23:02:10] (03PS3) 10Dzahn: annualreport: update redirect to annual report for 2020 [puppet] - 10https://gerrit.wikimedia.org/r/678106 (https://phabricator.wikimedia.org/T279571) [23:02:13] np :) [23:04:40] I am here too - will do quick verify on https://gerrit.wikimedia.org/r/678106 [23:04:55] 10SRE, 10WMF-Annual-Report, 10Patch-For-Review: Update annual.wikimedia.org redirect to point to 2020 Annual Report - https://phabricator.wikimedia.org/T279571 (10Dzahn) p:05Triage→03High [23:05:06] (03CR) 10Brennen Bearnes: [C: 03+2] Don't show "invalid search" message when request is aborted by user [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.38) - 10https://gerrit.wikimedia.org/r/677956 (https://phabricator.wikimedia.org/T277714) (owner: 10Eric Gardner) [23:06:45] 10SRE, 10WMF-Annual-Report, 10Patch-For-Review: Update annual.wikimedia.org redirect to point to 2020 Annual Report - https://phabricator.wikimedia.org/T279571 (10Dzahn) @spatton Thanks for the thoughtful way you handled the ticket. Code change has been reviewed and deployed just now on the backends (miscwe... [23:08:14] (03PS1) 10Razzi: superset: comment out check that isn't working as intended [puppet] - 10https://gerrit.wikimedia.org/r/678113 (https://phabricator.wikimedia.org/T277729) [23:08:37] 10SRE, 10WMF-Annual-Report, 10Patch-For-Review: Update annual.wikimedia.org redirect to point to 2020 Annual Report - https://phabricator.wikimedia.org/T279571 (10Dzahn) ` curl -S https://annual.wikimedia.org | grep moved ...

The document has moved EricGardner has confirmed https://gerrit.wikimedia.org/r/677956 working, going ahead with sync. [23:47:17] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:47:34] Hm [23:47:36] i'll look at that. [23:48:21] !log brennen@deploy1002 Synchronized php-1.36.0-wmf.38/extensions/WikibaseMediaInfo/resources/mediasearch-vue/store/actions.js: Backport: [[gerrit:677956|Do not show "invalid search" message when request is aborted by user (TT277714)]] (duration: 00m 57s) [23:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:50:18] (03CR) 10Papaul: [C: 03+2] Add moss-be200[12] MAC adderess, partman recipe and role insetup [puppet] - 10https://gerrit.wikimedia.org/r/678117 (https://phabricator.wikimedia.org/T276642) (owner: 10Papaul) [23:54:03] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install moss-be200[12] - https://phabricator.wikimedia.org/T276642 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` moss-be2001.codfw.wmnet ` The log can be found in `/var/...