[00:07:27] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:09:03] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:09:47] RECOVERY - PHP opcache health on mw2197 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:12:25] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:14:23] PROBLEM - PHP opcache health on mw2239 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:15:53] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:21:15] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:30:23] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:30:49] !log restart elasticsearch on logstash1010 [00:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:55] PROBLEM - PHP opcache health on mw2191 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:34:19] RECOVERY - PHP opcache health on mw2239 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:35:51] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:37:55] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [00:38:57] (03CR) 10BryanDavis: [C: 03+2] Drop fam from everywhere [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/603667 (owner: 10Legoktm) [00:39:28] (03Merged) 10jenkins-bot: Drop fam from everywhere [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/603667 (owner: 10Legoktm) [00:41:17] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:43:24] (03PS1) 10Ladsgroup: maps: Rename "slave" to "replica" [puppet] - 10https://gerrit.wikimedia.org/r/603688 (https://phabricator.wikimedia.org/T254646) [00:44:31] (03CR) 10jerkins-bot: [V: 04-1] maps: Rename "slave" to "replica" [puppet] - 10https://gerrit.wikimedia.org/r/603688 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [00:45:41] RECOVERY - PHP opcache health on mw2191 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:46:45] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:48:16] mediawiki is surfacing the global timeout more than normal ^^ [00:50:53] PROBLEM - PHP opcache health on mw2195 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:51:34] (03PS2) 10Ladsgroup: maps: Rename "slave" to "replica" [puppet] - 10https://gerrit.wikimedia.org/r/603688 (https://phabricator.wikimedia.org/T254646) [00:52:41] (03CR) 10jerkins-bot: [V: 04-1] maps: Rename "slave" to "replica" [puppet] - 10https://gerrit.wikimedia.org/r/603688 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [00:58:57] (03PS3) 10Ladsgroup: maps: Rename "slave" to "replica" [puppet] - 10https://gerrit.wikimedia.org/r/603688 (https://phabricator.wikimedia.org/T254646) [01:01:49] (03CR) 10Ladsgroup: "Review would be really appreciated here" [puppet] - 10https://gerrit.wikimedia.org/r/603688 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [01:07:57] PROBLEM - PHP opcache health on wtp2012 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [01:09:11] RECOVERY - PHP opcache health on mw2195 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [01:10:25] PROBLEM - PHP opcache health on mw2236 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [01:12:31] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is OK: (C)100 gt (W)80 gt 77.29 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [01:14:05] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:15:55] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:23:07] RECOVERY - PHP opcache health on mw2236 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [01:28:25] PROBLEM - PHP opcache health on mw2189 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [01:34:01] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:37:31] RECOVERY - PHP opcache health on mw2189 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [01:39:31] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:49:49] RECOVERY - PHP opcache health on wtp2012 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [01:50:29] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:01:23] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:19:37] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:20:23] PROBLEM - PHP opcache health on wtp2011 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [02:32:25] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:37:55] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:41:35] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:47:07] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:49:33] (03PS1) 10Ryan Kemper: elasticsearch: Amend prom query to match new state [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 [02:51:31] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Amend prom query to match new state [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (owner: 10Ryan Kemper) [02:52:33] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:01:27] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:04:07] RECOVERY - PHP opcache health on wtp2011 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:13:04] (03CR) 10Ryan Kemper: "I tried a dry run using a (hopefully) representative subset of the code:" [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (owner: 10Ryan Kemper) [03:15:06] (03PS2) 10Ryan Kemper: elasticsearch: Amend prom query to match new state [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 [03:16:49] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Amend prom query to match new state [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (owner: 10Ryan Kemper) [03:17:04] (03CR) 10Ryan Kemper: [C: 03+2] maps: Rename "slave" to "replica" [puppet] - 10https://gerrit.wikimedia.org/r/603688 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [03:17:58] (03CR) 10Ryan Kemper: [C: 03+2] "I wanted to submit and puppet-merge this, but the button is greyed out for me. Does this need to be submitted by the author, or is it grey" [puppet] - 10https://gerrit.wikimedia.org/r/603688 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [03:18:33] (03CR) 10Ryan Kemper: [C: 03+2] "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/603688 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [03:20:13] PROBLEM - PHP opcache health on mw2272 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:21:10] (03CR) 10Ryan Kemper: [C: 03+2] "Would the job of solving this merge conflict fall on (Search team) us as the service owners, or instead on the author/etc because this wor" [puppet] - 10https://gerrit.wikimedia.org/r/603688 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [03:22:50] (03PS3) 10Ryan Kemper: elasticsearch: Amend prom query to match new state [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 [03:24:44] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Amend prom query to match new state [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (owner: 10Ryan Kemper) [03:27:16] (03PS4) 10Ryan Kemper: elasticsearch: Amend prom query to match new state [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 [03:28:49] (03PS5) 10Ryan Kemper: elasticsearch: Amend prom query to match new state [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 [03:29:15] RECOVERY - PHP opcache health on mw2272 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:30:30] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Amend prom query to match new state [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (owner: 10Ryan Kemper) [03:31:31] PROBLEM - PHP opcache health on wtp2016 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:35:49] (03PS6) 10Ryan Kemper: elasticsearch: Amend prom query to match new state [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 [03:37:39] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:41:19] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:44:00] (03CR) 10Ryan Kemper: "Okay, I finally made the linter happy. Ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (owner: 10Ryan Kemper) [03:58:33] PROBLEM - PHP opcache health on mw2310 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [04:02:59] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:05:49] RECOVERY - PHP opcache health on mw2310 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [04:06:35] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:08:25] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [04:10:07] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1002 is OK: HTTP OK: HTTP/1.1 200 OK - 9000 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [04:35:33] 10Operations, 10Wikimedia-Mailing-lists, 10Hindi-Sites: Adminship of Hindi Wikipedia Mailing List - https://phabricator.wikimedia.org/T73388 (10CptViraj) [04:46:27] PROBLEM - PHP opcache health on wtp2013 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [04:54:42] (03PS1) 10Marostegui: labsdb1011: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/603770 (https://phabricator.wikimedia.org/T249188) [04:55:47] (03CR) 10Marostegui: [C: 03+2] labsdb1011: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/603770 (https://phabricator.wikimedia.org/T249188) (owner: 10Marostegui) [04:56:47] RECOVERY - PHP opcache health on wtp2016 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:02:02] (03PS1) 10Marostegui: dbproxy1018: Pool labsdb1011 and add labsdb1010 with reduced weight [puppet] - 10https://gerrit.wikimedia.org/r/603774 (https://phabricator.wikimedia.org/T249188) [05:02:35] (03Abandoned) 10Marostegui: dbproxy1018: Add labsdb1010 with reduced weight [puppet] - 10https://gerrit.wikimedia.org/r/598691 (https://phabricator.wikimedia.org/T249188) (owner: 10Marostegui) [05:03:38] (03CR) 10Marostegui: [C: 03+2] dbproxy1018: Pool labsdb1011 and add labsdb1010 with reduced weight [puppet] - 10https://gerrit.wikimedia.org/r/603774 (https://phabricator.wikimedia.org/T249188) (owner: 10Marostegui) [05:10:15] PROBLEM - haproxy alive on dbproxy1018 is CRITICAL: CRITICAL check_alive invalid response https://wikitech.wikimedia.org/wiki/HAProxy [05:10:23] ^ checking [05:12:19] PROBLEM - PHP opcache health on mw2194 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:20:21] RECOVERY - PHP opcache health on mw2194 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:28:16] RECOVERY - PHP opcache health on wtp2013 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:32:54] !log Switch dbproxy1018 from "master" service to "replicas" - T249188 [05:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:59] T249188: Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 [05:34:49] (03PS1) 10Marostegui: db1091: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/603784 (https://phabricator.wikimedia.org/T253217) [05:37:26] PROBLEM - Thanos compact is halted on icinga1001 is CRITICAL: cluster=thanos instance=thanos-fe2001:12902 job=thanos-compact prometheus=ops site=codfw https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [05:37:32] (03CR) 10Marostegui: [C: 03+2] db1091: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/603784 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui) [05:47:48] RECOVERY - haproxy alive on dbproxy1018 is OK: OK check_alive uptime 943s https://wikitech.wikimedia.org/wiki/HAProxy [05:50:28] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:50:58] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:51:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly pool db1091 into s1 T253217', diff saved to https://phabricator.wikimedia.org/P11415 and previous config saved to /var/cache/conftool/dbconfig/20200609-055128-marostegui.json [05:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:32] T253217: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 [05:58:24] PROBLEM - PHP opcache health on wtp2006 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [06:12:26] (03CR) 10Giuseppe Lavagetto: [C: 03+1] [No-op]: Add precautions for kafka-purges before transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603649 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [06:12:43] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Enable kafka purges everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603654 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [06:14:22] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10Billinghurst) >>! In T238285#5821312, @MusikAnimal wrote: > I ran into this when I was unable to block https://... [06:19:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly pool db1091 into s1 T253217', diff saved to https://phabricator.wikimedia.org/P11416 and previous config saved to /var/cache/conftool/dbconfig/20200609-061916-marostegui.json [06:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:21] T253217: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 [06:26:23] PROBLEM - PHP opcache health on mw2268 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [06:33:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly pool db1091 into s1 T253217', diff saved to https://phabricator.wikimedia.org/P11417 and previous config saved to /var/cache/conftool/dbconfig/20200609-063344-marostegui.json [06:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:48] T253217: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 [06:36:01] 10Operations, 10Performance-Team, 10serviceops, 10Sustainability (Incident Prevention): Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10Joe) >>! In T240684#6193541, @elukey wrote: > We don't, and doing it should require a big change in the mcrout... [06:40:04] RECOVERY - PHP opcache health on wtp2006 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [06:40:22] !log Deploy schema change on s2 T206103 [06:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:25] T206103: recentchanges table indexes: tmp1, tmp2 and tmp3 - https://phabricator.wikimedia.org/T206103 [06:48:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully pool db1091 into s1 T253217', diff saved to https://phabricator.wikimedia.org/P11418 and previous config saved to /var/cache/conftool/dbconfig/20200609-064829-marostegui.json [06:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:34] T253217: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 [06:51:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2113 for on-site maintenance T251570', diff saved to https://phabricator.wikimedia.org/P11419 and previous config saved to /var/cache/conftool/dbconfig/20200609-065125-marostegui.json [06:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:29] T251570: codfw: Next Gen test rack - https://phabricator.wikimedia.org/T251570 [06:52:31] (03PS1) 10Marostegui: db2113: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/603807 (https://phabricator.wikimedia.org/T251570) [06:52:58] (03CR) 10Marostegui: [C: 03+2] db2113: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/603807 (https://phabricator.wikimedia.org/T251570) (owner: 10Marostegui) [06:53:13] !log Stop MySQL on db2113 for maintenance - T251570 [06:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:43] RECOVERY - PHP opcache health on mw2268 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [06:54:23] 10Operations, 10ops-codfw, 10procurement, 10Patch-For-Review: codfw: Next Gen test rack - https://phabricator.wikimedia.org/T251570 (10Marostegui) I have powered off db2113. Once you are done with the maintenance, please power it back on. Thank you and good luck! [06:56:52] 10Operations, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui) [06:58:46] 10Operations, 10Core Platform Team, 10Traffic: Move wikitech purges to kafka - https://phabricator.wikimedia.org/T254828 (10ema) > @ema please confirm if we need to keep using HTCP or we can switch to kafka for these caches. In case kafka needs to be used, we need to enable EventBus on wikitech, which would... [06:58:54] 10Operations, 10Core Platform Team, 10Traffic: Move wikitech purges to kafka - https://phabricator.wikimedia.org/T254828 (10ema) p:05Triage→03Medium [07:03:50] (03PS1) 10Elukey: Add piwik overrides for matomo1002 to ease testing [puppet] - 10https://gerrit.wikimedia.org/r/603809 (https://phabricator.wikimedia.org/T252767) [07:04:58] PROBLEM - PHP opcache health on mw2276 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:05:09] (03CR) 10Elukey: [C: 03+2] Add piwik overrides for matomo1002 to ease testing [puppet] - 10https://gerrit.wikimedia.org/r/603809 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [07:07:59] (03CR) 10Gilles: [V: 03+2 C: 03+2] Store Content-Disposition header in Swift [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/603386 (https://phabricator.wikimedia.org/T254557) (owner: 10Gilles) [07:09:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1097:3314, db1097:3315 T253217', diff saved to https://phabricator.wikimedia.org/P11420 and previous config saved to /var/cache/conftool/dbconfig/20200609-070917-marostegui.json [07:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:21] T253217: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 [07:09:22] PROBLEM - PHP opcache health on wtp2019 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:10:02] (03PS1) 10Elukey: profile::statistics::gpu: upgrade stat1005 to rocm 3.3 [puppet] - 10https://gerrit.wikimedia.org/r/603814 (https://phabricator.wikimedia.org/T247082) [07:10:54] 10Operations, 10Core Platform Team, 10Traffic: Configure purged in depoloyment-prep - https://phabricator.wikimedia.org/T254844 (10ema) [07:11:15] 10Operations, 10Core Platform Team, 10Traffic: Configure purged in depoloyment-prep - https://phabricator.wikimedia.org/T254844 (10ema) p:05Triage→03Medium [07:11:45] !log deployment-cache-text06: stop vhtcpd, start purged T254844 [07:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:49] T254844: Configure purged in depoloyment-prep - https://phabricator.wikimedia.org/T254844 [07:11:57] (03PS1) 10Marostegui: install_server: Allow reimage db1097 [puppet] - 10https://gerrit.wikimedia.org/r/603816 (https://phabricator.wikimedia.org/T253217) [07:12:14] (03CR) 10Elukey: [C: 03+2] profile::statistics::gpu: upgrade stat1005 to rocm 3.3 [puppet] - 10https://gerrit.wikimedia.org/r/603814 (https://phabricator.wikimedia.org/T247082) (owner: 10Elukey) [07:12:55] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage db1097 [puppet] - 10https://gerrit.wikimedia.org/r/603816 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui) [07:14:58] 10Operations, 10Core Platform Team, 10Traffic: Configure purged in depoloyment-prep - https://phabricator.wikimedia.org/T254844 (10ema) purged is now running in deployment-prep instead of vhtcpd: ` ema@deployment-cache-text06:~$ systemctl status purged.service ● purged.service - Purger for ATS and Varnish... [07:15:23] hashar: should this be merged or should we ask qchris first: https://gerrit.wikimedia.org/r/c/operations/puppet/+/598058 [07:16:37] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/23081/" [puppet] - 10https://gerrit.wikimedia.org/r/603490 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn) [07:17:43] (03PS1) 10Marostegui: mariadb: Move db1097 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/603871 (https://phabricator.wikimedia.org/T254556) [07:18:11] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db1097 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/603871 (https://phabricator.wikimedia.org/T254556) (owner: 10Marostegui) [07:19:35] (03CR) 10Marostegui: [V: 03+2 C: 03+2] mariadb: Move db1097 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/603871 (https://phabricator.wikimedia.org/T254556) (owner: 10Marostegui) [07:19:51] 10Operations, 10ops-codfw, 10decommission, 10serviceops, 10Patch-For-Review: codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10Dzahn) >>! In T247018#6204125, @Papaul wrote: > switch ports removed for mw2154 through mw2186 @P... [07:21:07] (03Abandoned) 10Dzahn: site: add ganeti role to all new ganeti servers [puppet] - 10https://gerrit.wikimedia.org/r/576406 (https://phabricator.wikimedia.org/T228924) (owner: 10Dzahn) [07:21:57] 10Operations: Build Debian package for thumbor-plugins 2.9 and upload to apt.wikimedia.org - https://phabricator.wikimedia.org/T254845 (10Gilles) [07:22:07] 10Operations: Build Debian package for thumbor-plugins 2.9 and upload to apt.wikimedia.org - https://phabricator.wikimedia.org/T254845 (10Gilles) p:05Triage→03Medium [07:22:12] (03Abandoned) 10Dzahn: add TXT record to wikimedia.org for haveibeenpwned.com verification [dns] - 10https://gerrit.wikimedia.org/r/593187 (https://phabricator.wikimedia.org/T246357) (owner: 10Dzahn) [07:23:40] (03PS16) 10Dzahn: phabricator: add envoy TLS terminator for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/587233 (https://phabricator.wikimedia.org/T238593) [07:24:00] 10Operations: Build Debian package for thumbor-plugins 2.9 and upload to apt.wikimedia.org - https://phabricator.wikimedia.org/T254845 (10Gilles) [07:24:03] 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, 10Browser-Support-Firefox: Thumbor doesn't save Content-Disposition: inline headers to Swift for webp thumbnails - https://phabricator.wikimedia.org/T254557 (10Gilles) [07:24:15] 10Operations: Build Debian package for thumbor-plugins 2.9 and upload to apt.wikimedia.org - https://phabricator.wikimedia.org/T254845 (10Gilles) [07:24:31] 10Operations: Build Debian package for thumbor-plugins 2.9 and upload to apt.wikimedia.org - https://phabricator.wikimedia.org/T254845 (10Gilles) [07:24:46] 10Operations: Build Debian package for thumbor-plugins 2.9 and upload to apt.wikimedia.org - https://phabricator.wikimedia.org/T254845 (10Gilles) [07:25:27] (03PS1) 10Gilles: Version bump [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/603875 (https://phabricator.wikimedia.org/T254845) [07:25:46] (03CR) 10Gilles: [V: 03+2 C: 03+2] Version bump [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/603875 (https://phabricator.wikimedia.org/T254845) (owner: 10Gilles) [07:25:50] (03CR) 10Dzahn: [C: 03+2] phabricator: add envoy TLS terminator for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/587233 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [07:25:55] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/23082/phab1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/587233 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [07:26:41] RECOVERY - PHP opcache health on mw2276 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:30:19] !log upgrading mw1390-mw1413 to PHP 7.2.31 [07:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:22] (03PS1) 10Gilles: Upgrade to 2.9 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/603876 (https://phabricator.wikimedia.org/T254845) [07:31:57] 10Operations, 10Patch-For-Review: Build python-thumbor-wikimedia 2.9 Debian package and upload to apt.wikimedia.org - https://phabricator.wikimedia.org/T254845 (10Gilles) [07:33:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [07:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:57] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:36:15] marostegui, kormat ^^ expected? :) [07:36:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:25] vgutierrez: I will fix it [07:36:28] <3 [07:36:32] (03CR) 10JMeybohm: prometheus: enable Thanos upload for k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602715 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [07:36:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1097 from config', diff saved to https://phabricator.wikimedia.org/P11421 and previous config saved to /var/cache/conftool/dbconfig/20200609-073635-marostegui.json [07:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:42] :) [07:41:00] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:41:28] vgutierrez: for once, not me \o/ [07:42:08] :) [07:42:29] 10Operations, 10Core Platform Team, 10Traffic: Configure purged in deployment-prep - https://phabricator.wikimedia.org/T254844 (10Aklapper) [07:46:01] (03CR) 10JMeybohm: lvs::configuration: add termbox-https (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/603437 (https://phabricator.wikimedia.org/T254581) (owner: 10JMeybohm) [07:58:33] (03PS2) 10Elukey: Remove mc1036/mc2036 from the Redis Nutcracker config [puppet] - 10https://gerrit.wikimedia.org/r/595810 (https://phabricator.wikimedia.org/T252391) [07:59:33] (03PS18) 10Kormat: install_server: Allow reuse of partitions during reimage. [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) [07:59:43] (03CR) 10Kormat: install_server: Allow reuse of partitions during reimage. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) (owner: 10Kormat) [08:00:00] (03CR) 10Elukey: "Updated the commit msg after a chat that I had with Timo some days ago (there were some unclear points that I hope are now better)." [puppet] - 10https://gerrit.wikimedia.org/r/595810 (https://phabricator.wikimedia.org/T252391) (owner: 10Elukey) [08:00:34] RECOVERY - PHP opcache health on wtp2019 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [08:00:40] PROBLEM - PHP opcache health on mw2190 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [08:01:13] (03PS1) 10Dzahn: phabricator: add non-global cert/key path for aphlict envoy terminator [puppet] - 10https://gerrit.wikimedia.org/r/603883 (https://phabricator.wikimedia.org/T238593) [08:01:43] !log stop m1 on db1117 to clone db1097 (this will trigger an haproxy irc alert) - T254556 [08:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:47] T254556: Upgrade m1 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254556 [08:04:00] ACKNOWLEDGEMENT - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui known https://wikitech.wikimedia.org/wiki/HAProxy [08:06:16] (03PS2) 10Vgutierrez: Revert "ATS: Re-enable parent proxies on ats-tls" [puppet] - 10https://gerrit.wikimedia.org/r/588954 (https://phabricator.wikimedia.org/T249335) [08:07:58] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:08:06] ACKNOWLEDGEMENT - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui known https://wikitech.wikimedia.org/wiki/HAProxy [08:09:10] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/23083/phab1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/603883 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [08:09:58] RECOVERY - PHP opcache health on mw2190 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [08:11:02] (03PS1) 10Dzahn: Revert "phabricator: add non-global cert/key path for aphlict envoy terminator" [puppet] - 10https://gerrit.wikimedia.org/r/603887 [08:11:19] (03CR) 10Dzahn: [C: 03+2] Revert "phabricator: add non-global cert/key path for aphlict envoy terminator" [puppet] - 10https://gerrit.wikimedia.org/r/603887 (owner: 10Dzahn) [08:11:32] (03CR) 10Dzahn: [V: 03+2 C: 03+2] Revert "phabricator: add non-global cert/key path for aphlict envoy terminator" [puppet] - 10https://gerrit.wikimedia.org/r/603887 (owner: 10Dzahn) [08:12:31] I just got a wikimedia error on Phabricator [08:12:35] Phab is 502'ing on me. ("via cp3050.esams.wmnet, ATS/8.0.7") [08:12:35] mutante: ^ [08:12:39] yea, i broke it :/ [08:13:02] * RhinosF1 is having a rubbish day for tehh Cu [08:13:05] Tech [08:13:56] PROBLEM - Check that envoy is running on phab1001 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is failed https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:13:56] (03PS19) 10Kormat: install_server: Allow reuse of partitions during reimage. [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) [08:14:32] andre__: RhinosF1 : fixed [08:14:33] ugh [08:14:45] thanks [08:14:49] i had to rebuild the envoy config [08:14:53] ty [08:15:00] it broke when i tried to fix the aphlict situation [08:15:23] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1001/23084/" [puppet] - 10https://gerrit.wikimedia.org/r/588954 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [08:15:36] RECOVERY - Check that envoy is running on phab1001 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:15:44] RECOVERY - Thanos compact is halted on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [08:16:27] human monitoring faster than icinga [08:16:45] that's surprising [08:16:54] (03PS4) 10Kormat: Add native mysql spicerack module. [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 [08:17:11] but then everything that could have failed today already has [08:17:30] RhinosF1: oh? [08:18:28] PROBLEM - Check systemd state on thanos-fe2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:18:43] mutante: I was in the middle of debugging a crashed tool and was delayed dealing with that because I was talking to a security researcher about a possible vulnerability [08:19:05] talking to security researchers is what causes vulnerabilities. ;) [08:20:10] RhinosF1: sometimes it's one of those days... [08:20:39] I think it might be [08:21:05] (03PS1) 10Dzahn: Revert "phabricator: add envoy TLS terminator for aphlict" [puppet] - 10https://gerrit.wikimedia.org/r/603891 [08:21:09] * RhinosF1 takes his crashed tool rant to -cloud [08:22:10] PROBLEM - PHP opcache health on wtp2007 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [08:23:01] (03CR) 10Dzahn: [C: 03+2] Revert "phabricator: add envoy TLS terminator for aphlict" [puppet] - 10https://gerrit.wikimedia.org/r/603891 (owner: 10Dzahn) [08:23:34] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:26:20] (03CR) 10Giuseppe Lavagetto: [C: 03+1] lvs::configuration: add termbox-https [puppet] - 10https://gerrit.wikimedia.org/r/603437 (https://phabricator.wikimedia.org/T254581) (owner: 10JMeybohm) [08:28:19] !log upgrading deployment servers to PHP 7.2.31 [08:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:30] (03PS1) 10Dzahn: Revert "Revert "phabricator: add envoy TLS terminator for aphlict"" [puppet] - 10https://gerrit.wikimedia.org/r/603895 [08:33:04] RECOVERY - PHP opcache health on wtp2007 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [08:33:12] (03PS2) 10Dzahn: phabricator: add envoy TLS terminator for aphlict (DO NOT MERGE) [puppet] - 10https://gerrit.wikimedia.org/r/603895 (https://phabricator.wikimedia.org/T238593) [08:39:43] !log upgrading snapshot servers to PHP 7.2.31 [08:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:35] (03PS1) 10Marostegui: db1097: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/603900 (https://phabricator.wikimedia.org/T254556) [08:47:23] (03CR) 10Marostegui: [C: 03+2] db1097: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/603900 (https://phabricator.wikimedia.org/T254556) (owner: 10Marostegui) [08:50:17] (03CR) 10Ladsgroup: "I'm pretty sure I started to work on top of HEAD. it doesn't matter. Let me fix it. Maybe it's the rebase hell? https://phabricator.wikime" [puppet] - 10https://gerrit.wikimedia.org/r/603688 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [08:50:42] (03PS4) 10Ladsgroup: maps: Rename "slave" to "replica" [puppet] - 10https://gerrit.wikimedia.org/r/603688 (https://phabricator.wikimedia.org/T254646) [08:51:20] (03CR) 10Gehel: [C: 04-1] "The change itself looks good! A few questions / comments inline, mostly about the organization of the code." (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (owner: 10Ryan Kemper) [08:51:44] (03CR) 10Ladsgroup: "Yup, it didn't have any real merge conflict" [puppet] - 10https://gerrit.wikimedia.org/r/603688 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [08:54:38] (03PS7) 10Muehlenhoff: Drop maps from supported clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/602318 [08:57:09] (03CR) 10Gehel: [C: 04-1] "PCC is failing on this change: https://puppet-compiler.wmflabs.org/compiler1001/23085/" [puppet] - 10https://gerrit.wikimedia.org/r/603688 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [08:57:12] PROBLEM - PHP opcache health on mw2237 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [08:57:15] (03CR) 10Dzahn: [C: 04-2] phabricator: add envoy TLS terminator for aphlict (DO NOT MERGE) [puppet] - 10https://gerrit.wikimedia.org/r/603895 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [08:59:21] (03Abandoned) 10Dzahn: phabricator: enable TLS for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/587225 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [09:00:34] (03CR) 10Muehlenhoff: [C: 03+2] Drop maps from supported clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/602318 (owner: 10Muehlenhoff) [09:01:30] !log rolling restart of cassandra on maps* to pick up Java security updates [09:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:56] (03PS2) 10Dzahn: puppetize meet-accountmanager (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/593233 (https://phabricator.wikimedia.org/T251034) [09:03:42] (03PS1) 10Marostegui: mariadb: Move db1141 from labsdb role to s4 [puppet] - 10https://gerrit.wikimedia.org/r/603907 (https://phabricator.wikimedia.org/T252512) [09:04:02] (03CR) 10jerkins-bot: [V: 04-1] puppetize meet-accountmanager (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/593233 (https://phabricator.wikimedia.org/T251034) (owner: 10Dzahn) [09:08:00] RECOVERY - PHP opcache health on mw2237 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [09:12:07] (03CR) 10Kormat: [C: 03+1] mariadb: Move db1141 from labsdb role to s4 [puppet] - 10https://gerrit.wikimedia.org/r/603907 (https://phabricator.wikimedia.org/T252512) (owner: 10Marostegui) [09:12:33] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1141 from labsdb role to s4 [puppet] - 10https://gerrit.wikimedia.org/r/603907 (https://phabricator.wikimedia.org/T252512) (owner: 10Marostegui) [09:15:25] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC looks good https://puppet-compiler.wmflabs.org/compiler1003/23069/maps2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/602704 (https://phabricator.wikimedia.org/T222377) (owner: 10Mholloway) [09:16:57] (03PS1) 10Giuseppe Lavagetto: check_opcache: simplify thresholds for the script [puppet] - 10https://gerrit.wikimedia.org/r/603921 [09:17:02] (03PS3) 10Dzahn: puppetize meet-accountmanager (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/593233 (https://phabricator.wikimedia.org/T251034) [09:19:29] 10Operations: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10akosiaris) [09:20:25] 10Operations: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10akosiaris) 05Open→03Resolved The hardware machines are now in full production mode, ready to receive VMS. Finally resolving. [09:21:32] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10akosiaris) [09:22:57] (03CR) 10Filippo Giunchedi: prometheus: enable Thanos upload for k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602715 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [09:22:57] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10akosiaris) 05Open→03Resolved The hardware machines are now in full production mode, ready to receive VMs. In fact, the row C machines already have VMs as the... [09:23:25] (03CR) 10Majavah: puppetize meet-accountmanager (WIP) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/593233 (https://phabricator.wikimedia.org/T251034) (owner: 10Dzahn) [09:25:04] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [09:28:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [09:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1148 to clone db1141 - T252512', diff saved to https://phabricator.wikimedia.org/P11423 and previous config saved to /var/cache/conftool/dbconfig/20200609-092915-marostegui.json [09:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:19] T252512: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 [09:30:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:46] (03CR) 10Dzahn: puppetize meet-accountmanager (WIP) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/593233 (https://phabricator.wikimedia.org/T251034) (owner: 10Dzahn) [09:33:03] (03CR) 10Marostegui: [C: 03+1] install_server: Allow reuse of partitions during reimage. [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) (owner: 10Kormat) [09:34:35] (03PS20) 10Kormat: install_server: Allow reuse of partitions during reimage. [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) [09:34:55] !log Stop MySQL on db1148 to clone db1141 - T252512 [09:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:59] T252512: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 [09:35:18] PROBLEM - PHP opcache health on wtp2014 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [09:36:47] (03CR) 10Ema: [C: 03+1] Revert "ATS: Re-enable parent proxies on ats-tls" [puppet] - 10https://gerrit.wikimedia.org/r/588954 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [09:37:08] (03PS1) 10Volans: bgp: sort transits for consistency on Python 3.5 [homer/public] - 10https://gerrit.wikimedia.org/r/603931 [09:37:14] (03CR) 10Kormat: [C: 03+2] install_server: Allow reuse of partitions during reimage. [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) (owner: 10Kormat) [09:37:46] (03PS1) 10Ayounsi: Add new eqiad/codfw Ganeti rows [software/spicerack] - 10https://gerrit.wikimedia.org/r/603932 (https://phabricator.wikimedia.org/T228926) [09:38:54] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/603932 (https://phabricator.wikimedia.org/T228926) (owner: 10Ayounsi) [09:40:59] !log Compress InnoDB on db2072 T254462 [09:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:03] T254462: Compress enwiki InnoDB tables - https://phabricator.wikimedia.org/T254462 [09:43:35] (03CR) 10jerkins-bot: [V: 04-1] Add new eqiad/codfw Ganeti rows [software/spicerack] - 10https://gerrit.wikimedia.org/r/603932 (https://phabricator.wikimedia.org/T228926) (owner: 10Ayounsi) [09:43:54] (03CR) 10Ema: [C: 03+1] lvs::configuration: add termbox-https [puppet] - 10https://gerrit.wikimedia.org/r/603437 (https://phabricator.wikimedia.org/T254581) (owner: 10JMeybohm) [09:44:35] (03CR) 10JMeybohm: [C: 03+2] lvs::configuration: add termbox-https [puppet] - 10https://gerrit.wikimedia.org/r/603437 (https://phabricator.wikimedia.org/T254581) (owner: 10JMeybohm) [09:44:53] (03CR) 10Dzahn: "there seem to be more tests: 05:39:01 exc_message = "Invalid row 'invalid' for cluster ganeti01.svc.eqiad.wmnet, expected one of: \\('A', " [software/spicerack] - 10https://gerrit.wikimedia.org/r/603932 (https://phabricator.wikimedia.org/T228926) (owner: 10Ayounsi) [09:51:01] PROBLEM - PHP opcache health on mw2317 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [09:51:13] (03CR) 10Dzahn: "@Ladsgroup Trying to move forward with this I went and SSHed to the instance called "jitsi" in the meet project." [puppet] - 10https://gerrit.wikimedia.org/r/593233 (https://phabricator.wikimedia.org/T251034) (owner: 10Dzahn) [09:51:25] (03PS2) 10Ayounsi: Add new eqiad/codfw Ganeti rows [software/spicerack] - 10https://gerrit.wikimedia.org/r/603932 (https://phabricator.wikimedia.org/T228926) [09:51:27] (03PS1) 10Ayounsi: Tox: add python 3.8 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/603939 [09:51:34] 10Puppet, 10Wikimedia Meet, 10Patch-For-Review: Puppetize the account manager - https://phabricator.wikimedia.org/T251034 (10Dzahn) @Ladsgroup Trying to move forward with this I went and SSHed to the instance called "jitsi" in the meet project. I found that the checkout of meet-accountmanager repo seems to... [09:53:01] (03CR) 10Dzahn: puppetize meet-accountmanager (WIP) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/593233 (https://phabricator.wikimedia.org/T251034) (owner: 10Dzahn) [09:53:19] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 57 connections established with conf2001.codfw.wmnet:2379 (min=58) https://wikitech.wikimedia.org/wiki/PyBal [09:53:20] (03CR) 10Jbond: "> Patch Set 2: Code-Review+1" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/603577 (owner: 10CDanis) [09:53:27] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.46:4004]) https://wikitech.wikimedia.org/wiki/PyBal [09:53:37] PROBLEM - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 77 connections established with conf2001.codfw.wmnet:2379 (min=78) https://wikitech.wikimedia.org/wiki/PyBal [09:53:49] (03PS2) 10Ayounsi: Tox: add python 3.8 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/603939 [09:54:21] the PyBal alerts above are due to work jayme is doing, please ignore ^ [09:54:46] (03CR) 10Vgutierrez: [C: 03+2] Revert "ATS: Re-enable parent proxies on ats-tls" [puppet] - 10https://gerrit.wikimedia.org/r/588954 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [09:55:01] PROBLEM - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 66 connections established with conf1004.eqiad.wmnet:4001 (min=67) https://wikitech.wikimedia.org/wiki/PyBal [09:55:31] PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 98 connections established with conf1004.eqiad.wmnet:4001 (min=99) https://wikitech.wikimedia.org/wiki/PyBal [09:55:45] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.46:4004]) https://wikitech.wikimedia.org/wiki/PyBal [09:55:49] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.46:4004]) https://wikitech.wikimedia.org/wiki/PyBal [09:55:49] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.46:4004]) https://wikitech.wikimedia.org/wiki/PyBal [09:55:53] !log akosiaris@cumin1001 conftool action : set/pooled=inactive; selector: name=thumbor2002.codfw.wmnet [09:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:58] !log akosiaris@cumin1001 conftool action : set/pooled=inactive; selector: name=thumbor2001.codfw.wmnet [09:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:05] !log disable parent proxies on ats-tls [09:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:55] (03PS1) 10Elukey: [WIP] memcached: allow more tunables to avoid implicit settings [puppet] - 10https://gerrit.wikimedia.org/r/603942 (https://phabricator.wikimedia.org/T252391) [09:57:16] !log depool and set as inactive thumber200{1,2} for T251750 [09:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:20] T251750: Make deploy-promote more robust against failures - https://phabricator.wikimedia.org/T251750 [09:57:38] !log correction: depool and set as inactive thumbor200{1,2} for T251570 [09:57:41] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/603932 (https://phabricator.wikimedia.org/T228926) (owner: 10Ayounsi) [09:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:41] T251570: codfw: Next Gen test rack - https://phabricator.wikimedia.org/T251570 [09:57:49] !log restarting pybal on lvs1016 and lvs2010 for T254581 [09:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:52] T254581: Move termbox to use TLS only - https://phabricator.wikimedia.org/T254581 [09:58:59] RECOVERY - PyBal connections to etcd on lvs2010 is OK: OK: 78 connections established with conf2001.codfw.wmnet:2379 (min=78) https://wikitech.wikimedia.org/wiki/PyBal [10:00:11] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [10:00:11] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [10:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:15] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/603939 (owner: 10Ayounsi) [10:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:17] (03CR) 10Ayounsi: [C: 03+2] Add new eqiad/codfw Ganeti rows [software/spicerack] - 10https://gerrit.wikimedia.org/r/603932 (https://phabricator.wikimedia.org/T228926) (owner: 10Ayounsi) [10:00:18] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [10:00:19] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:23] 10Operations, 10ops-codfw, 10procurement: codfw: Next Gen test rack - https://phabricator.wikimedia.org/T251570 (10ops-monitoring-bot) Icinga downtime for 2 days, 0:00:00 set by akosiaris@cumin1001 on 2 host(s) and their services with reason: poweroff for T251750 ` thumbor[2001-2002].codfw.wmnet ` [10:00:29] (03CR) 10Ayounsi: [C: 03+2] Tox: add python 3.8 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/603939 (owner: 10Ayounsi) [10:00:55] RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 99 connections established with conf1004.eqiad.wmnet:4001 (min=99) https://wikitech.wikimedia.org/wiki/PyBal [10:01:04] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.46:4004]) Ema Ongoing work: T254581 https://wikitech.wikimedia.org/wiki/PyBal [10:01:04] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 66 connections established with conf1004.eqiad.wmnet:4001 (min=67) Ema Ongoing work: T254581 https://wikitech.wikimedia.org/wiki/PyBal [10:01:04] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.46:4004]) Ema Ongoing work: T254581 https://wikitech.wikimedia.org/wiki/PyBal [10:01:04] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.46:4004]) Ema Ongoing work: T254581 https://wikitech.wikimedia.org/wiki/PyBal [10:01:04] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 57 connections established with conf2001.codfw.wmnet:2379 (min=58) Ema Ongoing work: T254581 https://wikitech.wikimedia.org/wiki/PyBal [10:01:04] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.46:4004]) Ema Ongoing work: T254581 https://wikitech.wikimedia.org/wiki/PyBal [10:01:09] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:01:15] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:01:22] (03PS4) 10Dzahn: puppetize meet-accountmanager (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/593233 (https://phabricator.wikimedia.org/T251034) [10:02:29] (03CR) 10Ayounsi: "Note that this will need a new Spicerack release." [software/spicerack] - 10https://gerrit.wikimedia.org/r/603932 (https://phabricator.wikimedia.org/T228926) (owner: 10Ayounsi) [10:02:31] (03CR) 10jerkins-bot: [V: 04-1] puppetize meet-accountmanager (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/593233 (https://phabricator.wikimedia.org/T251034) (owner: 10Dzahn) [10:03:16] (03CR) 10Majavah: puppetize meet-accountmanager (WIP) (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/593233 (https://phabricator.wikimedia.org/T251034) (owner: 10Dzahn) [10:04:06] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.38 [software/spicerack] - 10https://gerrit.wikimedia.org/r/603945 [10:05:23] 10Operations, 10ops-codfw, 10procurement: codfw: Next Gen test rack - https://phabricator.wikimedia.org/T251570 (10akosiaris) thumbor2001 and thumbor2002 have been set as inactive, downtime for 2 days in icinga and powered off. Once you are done, please power them back on and I 'll take it from there. Thanks! [10:06:07] (03CR) 10Dzahn: puppetize meet-accountmanager (WIP) (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/593233 (https://phabricator.wikimedia.org/T251034) (owner: 10Dzahn) [10:06:33] (03PS5) 10Dzahn: puppetize meet-accountmanager (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/593233 (https://phabricator.wikimedia.org/T251034) [10:07:50] !log Deploy schema change on s7 T206103 [10:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:55] T206103: recentchanges table indexes: tmp1, tmp2 and tmp3 - https://phabricator.wikimedia.org/T206103 [10:09:25] RECOVERY - PHP opcache health on wtp2014 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:10:09] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.38 [software/spicerack] - 10https://gerrit.wikimedia.org/r/603945 (owner: 10Volans) [10:10:09] RECOVERY - PHP opcache health on mw2317 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:11:46] (03CR) 10Majavah: puppetize meet-accountmanager (WIP) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/593233 (https://phabricator.wikimedia.org/T251034) (owner: 10Dzahn) [10:11:53] (03PS1) 10Volans: Upstream release v0.0.38 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/603948 [10:11:55] 10Operations, 10SRE-Access-Requests: Requesting access to PROD for lmata (SRE) - https://phabricator.wikimedia.org/T254818 (10jbond) [10:12:31] !log "Re-order some BGP transit neighbors terms" [10:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:14] (03CR) 10Ayounsi: [C: 03+2] bgp: sort transits for consistency on Python 3.5 [homer/public] - 10https://gerrit.wikimedia.org/r/603931 (owner: 10Volans) [10:14:26] !log restarting pybal on lvs1015 and lvs2009 for T254581 [10:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:30] T254581: Move termbox to use TLS only - https://phabricator.wikimedia.org/T254581 [10:14:59] RECOVERY - Check systemd state on thanos-fe2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:15:35] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:17:52] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:18:38] (03PS1) 10Jbond: admin: add shell account for lmata and add to ops group [puppet] - 10https://gerrit.wikimedia.org/r/603950 (https://phabricator.wikimedia.org/T254818) [10:19:10] PROBLEM - Check systemd state on thanos-fe2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:20:07] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to PROD for lmata (SRE) - https://phabricator.wikimedia.org/T254818 (10jbond) [10:20:20] RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 58 connections established with conf2001.codfw.wmnet:2379 (min=58) https://wikitech.wikimedia.org/wiki/PyBal [10:21:38] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to PROD for lmata (SRE) - https://phabricator.wikimedia.org/T254818 (10jbond) [10:22:41] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to PROD for lmata (SRE) - https://phabricator.wikimedia.org/T254818 (10jbond) @faidon can you approve either here or on the change @lmata It doesn't appear that you have registered an ssh key in the cloud environment, when you do ple... [10:23:19] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to PROD for lmata (SRE) - https://phabricator.wikimedia.org/T254818 (10jbond) p:05Triage→03Medium [10:23:34] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:23:46] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.38 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/603948 (owner: 10Volans) [10:27:11] !log uploaded spicerack_0.0.38-1_amd64.deb to apt.wikimedia.org stretch-wikimedia [10:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:29] (03PS6) 10Dzahn: puppetize meet-accountmanager (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/593233 (https://phabricator.wikimedia.org/T251034) [10:29:31] (03CR) 10Dzahn: puppetize meet-accountmanager (WIP) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/593233 (https://phabricator.wikimedia.org/T251034) (owner: 10Dzahn) [10:30:27] 10Operations, 10ChangeProp, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 7 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10hnowlan) cpjobqueue is now running fully on Kubernetes. The instance running on scb has no rules enabled. All that remains to be don... [10:30:40] (03PS2) 10Elukey: [WIP] memcached: allow more tunables to avoid implicit settings [puppet] - 10https://gerrit.wikimedia.org/r/603942 (https://phabricator.wikimedia.org/T252391) [10:30:40] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:32:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1141 depooled to s4 T252512', diff saved to https://phabricator.wikimedia.org/P11425 and previous config saved to /var/cache/conftool/dbconfig/20200609-103252-marostegui.json [10:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:57] T252512: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 [10:34:48] RECOVERY - PyBal connections to etcd on lvs1015 is OK: OK: 67 connections established with conf1004.eqiad.wmnet:4001 (min=67) https://wikitech.wikimedia.org/wiki/PyBal [10:35:14] !log installed spicerack 0.0.38 on cumin[12]001 [10:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:01] (03CR) 10Gehel: [C: 04-1] "minor comments inline about preserving dependencies." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/602459 (https://phabricator.wikimedia.org/T252913) (owner: 10Herron) [10:37:07] akosiaris, mutante: FYI this added support for all rows of ganeti clusters in eqiad/codfw, so now the cookbook should allow to create them too [10:37:14] LMK if you encounter any issue [10:38:06] volans: thank you, ok [10:38:51] (03CR) 10QChris: [C: 03+1] zuul: add a connection to gerrit-test.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/598058 (https://phabricator.wikimedia.org/T253263) (owner: 10Hashar) [10:39:35] (03PS1) 10Volans: sre.ganeti.makevm: sort available clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/603957 [10:39:41] and this is a nit for the help message :) [10:42:11] (03CR) 10Volans: [C: 03+2] sre.ganeti.makevm: sort available clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/603957 (owner: 10Volans) [10:42:18] (03PS1) 10Hnowlan: check_mw_versions: fix module path for requests.exceptions [puppet] - 10https://gerrit.wikimedia.org/r/603960 [10:42:44] PROBLEM - PHP opcache health on wtp2010 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:43:32] (03PS3) 10Elukey: [WIP] memcached: allow more tunables to avoid implicit settings [puppet] - 10https://gerrit.wikimedia.org/r/603942 (https://phabricator.wikimedia.org/T252391) [10:43:51] (03CR) 10Kormat: [C: 03+1] check_mw_versions: fix module path for requests.exceptions [puppet] - 10https://gerrit.wikimedia.org/r/603960 (owner: 10Hnowlan) [10:46:40] (03CR) 10Ema: [C: 03+1] Switch backend for piwik.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/603366 (https://phabricator.wikimedia.org/T252740) (owner: 10Elukey) [10:48:27] (03PS4) 10Elukey: [WIP] memcached: allow more tunables to avoid implicit settings [puppet] - 10https://gerrit.wikimedia.org/r/603942 (https://phabricator.wikimedia.org/T252391) [10:48:37] !log imported tqdm 4.23.4-1+wmf1 to buster-wikimedia/component/spicerack [10:48:38] 10Puppet, 10Wikimedia Meet, 10Patch-For-Review: Puppetize the account manager - https://phabricator.wikimedia.org/T251034 (10Ladsgroup) >>! In T251034#6205190, @Dzahn wrote: > @Ladsgroup Trying to move forward with this I went and SSHed to the instance called "jitsi" in the meet project. > > I found that th... [10:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:12] (03CR) 10Hnowlan: [C: 03+2] check_mw_versions: fix module path for requests.exceptions [puppet] - 10https://gerrit.wikimedia.org/r/603960 (owner: 10Hnowlan) [10:49:44] !log update pcc facts [10:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:11] (03CR) 10Ladsgroup: "> Patch Set 4: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/603688 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [10:58:50] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:59:55] (03PS1) 10Kormat: install_server: Test reuse-parts on metal. [puppet] - 10https://gerrit.wikimedia.org/r/603961 (https://phabricator.wikimedia.org/T252027) [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: Dear deployers, time to do the European Mid-day backport window(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200609T1100). [11:00:05] Ammarpad: A patch you scheduled for European Mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:16] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:00:20] I have a quick deployment [11:00:40] Ammarpad: Around? [11:01:37] (03CR) 10Gehel: [C: 04-1] "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/603688 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [11:01:51] Ammarpad doesn't seem to be around Amir1 [11:02:09] Unless he uses a different nick name [11:02:34] okay then, I deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/602675 [11:02:41] Phab says no different nick [11:06:34] (03CR) 10Ladsgroup: [C: 03+2] Add be-tarask to langlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602675 (https://phabricator.wikimedia.org/T111853) (owner: 10Ladsgroup) [11:07:22] (03Merged) 10jenkins-bot: Add be-tarask to langlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602675 (https://phabricator.wikimedia.org/T111853) (owner: 10Ladsgroup) [11:07:35] (03CR) 10Marostegui: [C: 03+1] "please coordinate with me before working with db1077" [puppet] - 10https://gerrit.wikimedia.org/r/603961 (https://phabricator.wikimedia.org/T252027) (owner: 10Kormat) [11:07:44] (03PS2) 10Kormat: install_server: Test reuse-parts on metal. [puppet] - 10https://gerrit.wikimedia.org/r/603961 (https://phabricator.wikimedia.org/T252027) [11:08:10] PROBLEM - PHP opcache health on mw2240 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:10:00] (03PS1) 10Muehlenhoff: Use the cumin profile in role::cluster::management [puppet] - 10https://gerrit.wikimedia.org/r/603965 (https://phabricator.wikimedia.org/T245114) [11:10:12] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/602171 (owner: 10EBernhardson) [11:11:12] (03CR) 10Volans: [C: 03+1] "LGTM if pcc is happy" [puppet] - 10https://gerrit.wikimedia.org/r/603965 (https://phabricator.wikimedia.org/T245114) (owner: 10Muehlenhoff) [11:12:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/603961 (https://phabricator.wikimedia.org/T252027) (owner: 10Kormat) [11:14:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1148 into s4 T252512', diff saved to https://phabricator.wikimedia.org/P11426 and previous config saved to /var/cache/conftool/dbconfig/20200609-111443-marostegui.json [11:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:47] T252512: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 [11:15:17] !log ladsgroup@deploy1001 Synchronized langlist: [[gerrit:602675|Add be-tarask to langlist (T111853)]] (duration: 00m 57s) [11:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:20] T111853: The href of be-tarask: interlanguage link points to the be-x-old domain - https://phabricator.wikimedia.org/T111853 [11:18:54] RECOVERY - PHP opcache health on mw2240 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:19:29] (03CR) 10Muehlenhoff: "https://puppet-compiler.wmflabs.org/compiler1001/23092/" [puppet] - 10https://gerrit.wikimedia.org/r/603965 (https://phabricator.wikimedia.org/T245114) (owner: 10Muehlenhoff) [11:19:32] (03CR) 10Muehlenhoff: [C: 03+2] Use the cumin profile in role::cluster::management [puppet] - 10https://gerrit.wikimedia.org/r/603965 (https://phabricator.wikimedia.org/T245114) (owner: 10Muehlenhoff) [11:25:44] RECOVERY - PHP opcache health on wtp2010 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:27:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1148 into s4 T252512', diff saved to https://phabricator.wikimedia.org/P11427 and previous config saved to /var/cache/conftool/dbconfig/20200609-112701-marostegui.json [11:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:06] T252512: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 [11:29:30] (03PS1) 10Marostegui: db1141: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/603973 (https://phabricator.wikimedia.org/T252512) [11:30:22] (03CR) 10Marostegui: [C: 03+2] db1141: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/603973 (https://phabricator.wikimedia.org/T252512) (owner: 10Marostegui) [11:30:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1141 into s4 T252512', diff saved to https://phabricator.wikimedia.org/P11428 and previous config saved to /var/cache/conftool/dbconfig/20200609-113056-marostegui.json [11:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:19] 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Replicated ticket registry - https://phabricator.wikimedia.org/T233933 (10jbond) >>! In T233933#6200461, @elukey wrote: > I am all for testing new versions of memcached to get experience, so on this front you'll always have my +1 :) > Upstream is a... [11:36:54] (03PS1) 10JMeybohm: lvs::configuration: termbox-https monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/603974 (https://phabricator.wikimedia.org/T254581) [11:37:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1148 into s4 T252512', diff saved to https://phabricator.wikimedia.org/P11429 and previous config saved to /var/cache/conftool/dbconfig/20200609-113702-marostegui.json [11:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:07] T252512: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 [11:38:13] (03PS1) 10Ladsgroup: Copy content of slave.yaml files to replica.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/603975 (https://phabricator.wikimedia.org/T254646) [11:38:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1141 into s4 T252512', diff saved to https://phabricator.wikimedia.org/P11430 and previous config saved to /var/cache/conftool/dbconfig/20200609-113818-marostegui.json [11:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:43] (03CR) 10Alexandros Kosiaris: [C: 03+1] "/o/ \o\ \o/" [puppet] - 10https://gerrit.wikimedia.org/r/603534 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [11:40:41] (03PS1) 10KartikMistry: Revert "Revert "Update cxserver to 2020-06-08-045500-production"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/603976 [11:41:55] (03CR) 10KartikMistry: [C: 03+2] "cxserver is OK to deploy now." [deployment-charts] - 10https://gerrit.wikimedia.org/r/603976 (owner: 10KartikMistry) [11:42:04] (03PS1) 10Marostegui: install_server: Reimage db2132 with buster [puppet] - 10https://gerrit.wikimedia.org/r/603978 (https://phabricator.wikimedia.org/T254556) [11:42:23] (03Merged) 10jenkins-bot: Revert "Revert "Update cxserver to 2020-06-08-045500-production"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/603976 (owner: 10KartikMistry) [11:42:54] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2132 with buster [puppet] - 10https://gerrit.wikimedia.org/r/603978 (https://phabricator.wikimedia.org/T254556) (owner: 10Marostegui) [11:44:16] !log kartik@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [11:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:41] akosiaris: deploying cxserver now ^ [11:45:11] 👍 [11:46:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1148 into s4 T252512', diff saved to https://phabricator.wikimedia.org/P11431 and previous config saved to /var/cache/conftool/dbconfig/20200609-114615-marostegui.json [11:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:19] T252512: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 [11:46:34] !log kartik@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'cxserver' for release 'production' . [11:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:02] PROBLEM - PHP opcache health on mw2192 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:47:07] (03CR) 10Ladsgroup: "I hope this is what you meant by the change. Thanks." [labs/private] - 10https://gerrit.wikimedia.org/r/603975 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [11:47:55] (03CR) 10Ladsgroup: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/603688 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [11:50:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1141 into s4 T252512', diff saved to https://phabricator.wikimedia.org/P11432 and previous config saved to /var/cache/conftool/dbconfig/20200609-115016-marostegui.json [11:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:57] !log kartik@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'cxserver' for release 'production' . [11:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:37] 10Puppet, 10Wikimedia Meet, 10Patch-For-Review: Puppetize the account manager - https://phabricator.wikimedia.org/T251034 (10Dzahn) >>! In T251034#6205322, @Ladsgroup wrote: > Yeah because jitsi is a client of the account manger. The one that really matters is "meet-auth". But on the "meet-auth" instance t... [11:54:54] PROBLEM - Thanos compact has disappeared from Prometheus discovery on icinga1001 is CRITICAL: 1 ge 1 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview [11:56:54] PROBLEM - PHP opcache health on wtp2015 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:59:40] RECOVERY - PHP opcache health on mw2192 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:00:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1141 into s4 T252512', diff saved to https://phabricator.wikimedia.org/P11433 and previous config saved to /var/cache/conftool/dbconfig/20200609-120009-marostegui.json [12:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:14] T252512: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 [12:02:06] (03PS1) 10Ssingh: wikidough: update profile to use a single recursor [puppet] - 10https://gerrit.wikimedia.org/r/603982 (https://phabricator.wikimedia.org/T252132) [12:05:21] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1003/23095/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/603982 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [12:05:47] (03CR) 10Kormat: [C: 03+2] install_server: Test reuse-parts on metal. [puppet] - 10https://gerrit.wikimedia.org/r/603961 (https://phabricator.wikimedia.org/T252027) (owner: 10Kormat) [12:08:00] (03PS3) 10Kormat: install_server: Test reuse-parts on metal. [puppet] - 10https://gerrit.wikimedia.org/r/603961 (https://phabricator.wikimedia.org/T252027) [12:09:40] (03CR) 10Gehel: [C: 03+1] "@ladsgroup: thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/603688 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [12:11:19] (03PS1) 10KartikMistry: Revert "Revert "Revert "Update cxserver to 2020-06-08-045500-production""" [deployment-charts] - 10https://gerrit.wikimedia.org/r/603986 [12:11:52] 😮 [12:12:00] (03CR) 10Marostegui: [C: 03+1] install_server: Test reuse-parts on metal. [puppet] - 10https://gerrit.wikimedia.org/r/603961 (https://phabricator.wikimedia.org/T252027) (owner: 10Kormat) [12:12:05] (03PS5) 10Elukey: memcached: allow more tunables to avoid implicit settings [puppet] - 10https://gerrit.wikimedia.org/r/603942 (https://phabricator.wikimedia.org/T252391) [12:12:41] (03CR) 10Kormat: [C: 03+2] install_server: Test reuse-parts on metal. [puppet] - 10https://gerrit.wikimedia.org/r/603961 (https://phabricator.wikimedia.org/T252027) (owner: 10Kormat) [12:12:55] (03CR) 10KartikMistry: [C: 03+2] "Breaks all MT services. Not sure if this is the root cause. But, first step." [deployment-charts] - 10https://gerrit.wikimedia.org/r/603986 (owner: 10KartikMistry) [12:13:23] (03Merged) 10jenkins-bot: Revert "Revert "Revert "Update cxserver to 2020-06-08-045500-production""" [deployment-charts] - 10https://gerrit.wikimedia.org/r/603986 (owner: 10KartikMistry) [12:14:07] (03PS7) 10Dzahn: puppetize meet-accountmanager, step 1 [puppet] - 10https://gerrit.wikimedia.org/r/593233 (https://phabricator.wikimedia.org/T251034) [12:14:31] !log kartik@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [12:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:36] (03PS4) 10Ayounsi: Add sre.network.prepare-upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/596444 [12:14:55] (03CR) 10Ayounsi: "Thanks! comments addressed." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/596444 (owner: 10Ayounsi) [12:16:10] !log kartik@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'cxserver' for release 'production' . [12:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:15] (03PS6) 10Elukey: memcached: allow more tunables to avoid implicit settings [puppet] - 10https://gerrit.wikimedia.org/r/603942 (https://phabricator.wikimedia.org/T252391) [12:18:17] !log kartik@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'cxserver' for release 'production' . [12:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:43] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/23096/" [puppet] - 10https://gerrit.wikimedia.org/r/603942 (https://phabricator.wikimedia.org/T252391) (owner: 10Elukey) [12:20:30] (03PS1) 10Volans: GC: fix garbage collection and refactor its query [software/debmonitor] - 10https://gerrit.wikimedia.org/r/603992 (https://phabricator.wikimedia.org/T254865) [12:20:48] (03CR) 10Hashar: [C: 03+1] "Tried on contint2001.wikimedia.org with:" [puppet] - 10https://gerrit.wikimedia.org/r/598058 (https://phabricator.wikimedia.org/T253263) (owner: 10Hashar) [12:21:09] (03PS8) 10Dzahn: puppetize meet-accountmanager, step 1 [puppet] - 10https://gerrit.wikimedia.org/r/593233 (https://phabricator.wikimedia.org/T251034) [12:22:10] !log reimaging sretest1002 T252027 [12:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:13] T252027: debian-installer: partman doesn't allow lvm LVs to be reused when reimaging - https://phabricator.wikimedia.org/T252027 [12:22:40] (03CR) 10Dzahn: [C: 03+2] puppetize meet-accountmanager, step 1 [puppet] - 10https://gerrit.wikimedia.org/r/593233 (https://phabricator.wikimedia.org/T251034) (owner: 10Dzahn) [12:24:12] 10Operations, 10Patch-For-Review: debian-installer: partman doesn't allow lvm LVs to be reused when reimaging - https://phabricator.wikimedia.org/T252027 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['sretest1002.eqiad.wmnet'] ` The log can be foun... [12:25:49] (03PS1) 10Volans: debmonitor GC: generate cron email on failure [puppet] - 10https://gerrit.wikimedia.org/r/603993 (https://phabricator.wikimedia.org/T254865) [12:29:12] (03CR) 10Dzahn: [C: 03+1] wikidough: update profile to use a single recursor [puppet] - 10https://gerrit.wikimedia.org/r/603982 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [12:29:36] (03CR) 10Giuseppe Lavagetto: [C: 03+2] php: $enable_request_profiling should affect CLI [puppet] - 10https://gerrit.wikimedia.org/r/599476 (https://phabricator.wikimedia.org/T253547) (owner: 10Dave Pifke) [12:29:40] (03CR) 10Volans: [C: 03+1] "LGTM as a first version. I'm sure there will be followup improvements after real life testing." [cookbooks] - 10https://gerrit.wikimedia.org/r/596444 (owner: 10Ayounsi) [12:30:24] (03CR) 10Dzahn: [C: 03+2] zuul: add a connection to gerrit-test.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/598058 (https://phabricator.wikimedia.org/T253263) (owner: 10Hashar) [12:32:32] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/551939 (owner: 10Dzahn) [12:32:47] (03CR) 10jerkins-bot: [V: 04-1] ATS/varnish: rename thorium director to analytics-web [puppet] - 10https://gerrit.wikimedia.org/r/551939 (owner: 10Dzahn) [12:33:05] RECOVERY - PHP opcache health on wtp2015 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:33:50] (03CR) 10Ssingh: [C: 03+2] wikidough: update profile to use a single recursor [puppet] - 10https://gerrit.wikimedia.org/r/603982 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [12:35:34] (03Abandoned) 10Dzahn: ATS/varnish: rename thorium director to analytics-web [puppet] - 10https://gerrit.wikimedia.org/r/551939 (owner: 10Dzahn) [12:38:06] (03PS1) 10Marostegui: mariadb: Reimage db2131 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/603999 (https://phabricator.wikimedia.org/T250666) [12:38:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2131 for reimage', diff saved to https://phabricator.wikimedia.org/P11434 and previous config saved to /var/cache/conftool/dbconfig/20200609-123817-marostegui.json [12:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:29] PROBLEM - PHP opcache health on mw2230 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:42:05] (03CR) 10Giuseppe Lavagetto: [C: 03+1] rake: Add kubeyaml validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/598280 (owner: 10Alexandros Kosiaris) [12:43:26] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/603993 (https://phabricator.wikimedia.org/T254865) (owner: 10Volans) [12:44:14] (03CR) 10Elukey: elasticsearch: manage java dependencies with ::profile::java (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/602459 (https://phabricator.wikimedia.org/T252913) (owner: 10Herron) [12:45:51] (03PS3) 10Andrew Bogott: wmcs resolv.conf: reduce timeout to 1s [puppet] - 10https://gerrit.wikimedia.org/r/601711 (https://phabricator.wikimedia.org/T253780) [12:46:05] (03CR) 10Volans: [C: 03+2] debmonitor GC: generate cron email on failure [puppet] - 10https://gerrit.wikimedia.org/r/603993 (https://phabricator.wikimedia.org/T254865) (owner: 10Volans) [12:46:49] (03PS2) 10Marostegui: mariadb: Reimage db2131 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/603999 (https://phabricator.wikimedia.org/T250666) [12:47:25] (03CR) 10Andrew Bogott: [C: 03+2] wmcs resolv.conf: reduce timeout to 1s [puppet] - 10https://gerrit.wikimedia.org/r/601711 (https://phabricator.wikimedia.org/T253780) (owner: 10Andrew Bogott) [12:51:26] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/603992 (https://phabricator.wikimedia.org/T254865) (owner: 10Volans) [12:51:41] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Overall LGTM! See the small correction inline." (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602061 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [12:52:56] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Again LGTM overall, see my comment inline though." (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602060 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [12:53:11] RECOVERY - PHP opcache health on mw2230 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:56:03] (03CR) 10Elukey: elasticsearch: manage java dependencies with ::profile::java (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602459 (https://phabricator.wikimedia.org/T252913) (owner: 10Herron) [12:56:25] 10Operations, 10Patch-For-Review: debian-installer: partman doesn't allow lvm LVs to be reused when reimaging - https://phabricator.wikimedia.org/T252027 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['sretest1002.eqiad.wmnet'] ` Of which those **FAILED**: ` ['sretest1002.eqiad.wmnet'] ` [12:58:31] (03CR) 10Andrew Bogott: [C: 03+1] "I confirmed this is a no-op for the wmcs hosts it runs on. I can't sign off on the mediawiki bits though." [puppet] - 10https://gerrit.wikimedia.org/r/603942 (https://phabricator.wikimedia.org/T252391) (owner: 10Elukey) [12:59:24] (03CR) 10Giuseppe Lavagetto: [C: 03+1] prometheus: use profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/599298 (https://phabricator.wikimedia.org/T253840) (owner: 10Filippo Giunchedi) [13:00:45] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to PROD for lmata (SRE) - https://phabricator.wikimedia.org/T254818 (10lmata) @jbond will generate that new key today, thank you [13:01:00] (03CR) 10Andrew Bogott: "pcc results https://puppet-compiler.wmflabs.org/compiler1002/23099/tools-sgegrid-master.tools.eqiad.wmflabs/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/601714 (https://phabricator.wikimedia.org/T253780) (owner: 10Andrew Bogott) [13:06:25] (03CR) 10Andrew Bogott: [C: 03+2] wmcs vms: stop using ns1 for resolving [puppet] - 10https://gerrit.wikimedia.org/r/601714 (https://phabricator.wikimedia.org/T253780) (owner: 10Andrew Bogott) [13:07:43] PROBLEM - PHP opcache health on mw2234 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:08:55] (03CR) 10Giuseppe Lavagetto: [C: 03+2] check_opcache: simplify thresholds for the script [puppet] - 10https://gerrit.wikimedia.org/r/603921 (owner: 10Giuseppe Lavagetto) [13:10:55] RECOVERY - PHP opcache health on mw2234 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:11:54] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/603992 (https://phabricator.wikimedia.org/T254865) (owner: 10Volans) [13:17:35] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:19:21] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:19:53] <_joe_> uh well that seems like a large spike of errors [13:20:55] <_joe_> again a failure at the eventgate-main layer [13:21:23] (03CR) 10Muehlenhoff: Add a Spicerack cook book to reboot hosts (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/596187 (https://phabricator.wikimedia.org/T252807) (owner: 10Muehlenhoff) [13:22:09] <_joe_> we're having a peak in requests [13:23:14] <_joe_> and it's mostly refreshlinks [13:24:21] _joe_: that suggests a template got edited? [13:28:00] <_joe_> cdanis: we could go and inspect kafka, but the situation seems to be back to normal (meaning: we're enqueueing jobs correctly) [13:31:21] (03PS1) 10Filippo Giunchedi: thanos: add SyslogIdentifier=%N to systemd services [puppet] - 10https://gerrit.wikimedia.org/r/604009 (https://phabricator.wikimedia.org/T233956) [13:32:23] (03PS3) 10Vgutierrez: ATS: Add support on tls.lua for http requests [puppet] - 10https://gerrit.wikimedia.org/r/603447 (https://phabricator.wikimedia.org/T254235) [13:33:48] (03PS5) 10Muehlenhoff: Add a Spicerack cook book to reboot hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/596187 (https://phabricator.wikimedia.org/T252807) [13:34:59] (03PS1) 10Andrew Bogott: Horizon: put into maintenance mode for cloudservices rebuilds [puppet] - 10https://gerrit.wikimedia.org/r/604010 (https://phabricator.wikimedia.org/T253780) [13:41:52] (03PS4) 10Vgutierrez: ATS: Add support on tls.lua for http requests [puppet] - 10https://gerrit.wikimedia.org/r/603447 (https://phabricator.wikimedia.org/T254235) [13:42:53] (03PS4) 10JMeybohm: eventstreams: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/602060 (https://phabricator.wikimedia.org/T253396) [13:43:09] (03PS4) 10JMeybohm: eventgate: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/602061 (https://phabricator.wikimedia.org/T253396) [13:44:18] (03CR) 10JMeybohm: "Good catch! Thanks" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602061 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [13:46:37] (03CR) 10JMeybohm: eventstreams: Update to v0.2 helpers (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602060 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [13:49:32] 10Operations: debian-installer: partman doesn't allow lvm LVs to be reused when reimaging - https://phabricator.wikimedia.org/T252027 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['sretest1002.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-aut... [13:49:35] 10Operations: debian-installer: partman doesn't allow lvm LVs to be reused when reimaging - https://phabricator.wikimedia.org/T252027 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['sretest1002.eqiad.wmnet'] ` Of which those **FAILED**: ` ['sretest1002.eqiad.wmnet'] ` [13:50:35] (03CR) 10Kormat: [C: 03+1] mariadb: Reimage db2131 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/603999 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [13:50:50] (03CR) 10Marostegui: [C: 03+2] mariadb: Reimage db2131 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/603999 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [13:51:39] 10Operations, 10Core Platform Team, 10Traffic: Configure purged in deployment-prep - https://phabricator.wikimedia.org/T254844 (10Pchelolo) > @Pchelolo: let me know which kafka topics we should read from in deployment-prep! @ema - it would be `eqiad.resource-purge` [13:52:04] 10Operations: debian-installer: partman doesn't allow lvm LVs to be reused when reimaging - https://phabricator.wikimedia.org/T252027 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['sretest1002.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-aut... [13:52:05] (03PS6) 10Muehlenhoff: Add a Spicerack cook book to reboot hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/596187 (https://phabricator.wikimedia.org/T252807) [13:54:21] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [13:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:26] (03PS1) 10Muehlenhoff: wmf_auto_reimage: Use systemd unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/604017 [13:56:48] (03CR) 10Volans: "You read my mind! :) one nit inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/604017 (owner: 10Muehlenhoff) [13:56:52] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:15] (03CR) 10Volans: [C: 03+2] GC: fix garbage collection and refactor its query [software/debmonitor] - 10https://gerrit.wikimedia.org/r/603992 (https://phabricator.wikimedia.org/T254865) (owner: 10Volans) [13:58:31] (03PS5) 10Vgutierrez: ATS: Add support on tls.lua for http requests [puppet] - 10https://gerrit.wikimedia.org/r/603447 (https://phabricator.wikimedia.org/T254235) [13:59:22] (03Merged) 10jenkins-bot: GC: fix garbage collection and refactor its query [software/debmonitor] - 10https://gerrit.wikimedia.org/r/603992 (https://phabricator.wikimedia.org/T254865) (owner: 10Volans) [14:00:12] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [14:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:35] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:38] (03CR) 10Volans: [C: 03+1] "LGTM for this first version, we can improved it with real life usage." [cookbooks] - 10https://gerrit.wikimedia.org/r/596187 (https://phabricator.wikimedia.org/T252807) (owner: 10Muehlenhoff) [14:00:47] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [14:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:51] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:00:52] !log update release repository's settings on Archiva - T254849 [14:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:56] T254849: Purge old files on Archiva to free some space - https://phabricator.wikimedia.org/T254849 [14:04:12] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: put into maintenance mode for cloudservices rebuilds [puppet] - 10https://gerrit.wikimedia.org/r/604010 (https://phabricator.wikimedia.org/T253780) (owner: 10Andrew Bogott) [14:04:53] (03PS6) 10Vgutierrez: ATS: Add support on tls.lua for http requests [puppet] - 10https://gerrit.wikimedia.org/r/603447 (https://phabricator.wikimedia.org/T254235) [14:05:04] 10Operations: debian-installer: partman doesn't allow lvm LVs to be reused when reimaging - https://phabricator.wikimedia.org/T252027 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['sretest1002.eqiad.wmnet'] ` and were **ALL** successful. [14:05:21] (03PS2) 10Muehlenhoff: wmf_auto_reimage: Use systemd unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/604017 [14:06:07] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventStreams, and 2 others: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10BBlack) Looking at that other ticket T250912 - would an in-band service ping or NOP event of some kind ad... [14:07:57] (03CR) 10Muehlenhoff: wmf_auto_reimage: Use systemd unconditionally (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/604017 (owner: 10Muehlenhoff) [14:09:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [14:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:54] (03PS1) 10Andrew Bogott: Move cloudservices1003/1004 to Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/604019 (https://phabricator.wikimedia.org/T253780) [14:12:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:57] (03CR) 10Andrew Bogott: [C: 03+2] Move cloudservices1003/1004 to Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/604019 (https://phabricator.wikimedia.org/T253780) (owner: 10Andrew Bogott) [14:13:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Move cloudservices1003/1004 to Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/604019 (https://phabricator.wikimedia.org/T253780) (owner: 10Andrew Bogott) [14:14:40] (03PS3) 10CDanis: run-puppet-agent: add new flag --unless-version SUBSTR [puppet] - 10https://gerrit.wikimedia.org/r/603577 [14:15:34] (03CR) 10CDanis: run-puppet-agent: add new flag --unless-version SUBSTR (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/603577 (owner: 10CDanis) [14:15:40] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Nice! I missed the lvs_setup patch (I see it is already merged), so \o/." [puppet] - 10https://gerrit.wikimedia.org/r/603974 (https://phabricator.wikimedia.org/T254581) (owner: 10JMeybohm) [14:15:43] (03PS4) 10CDanis: run-puppet-agent: add new flag --unless-version SUBSTR [puppet] - 10https://gerrit.wikimedia.org/r/603577 [14:18:15] (03CR) 10Jbond: [C: 03+1] "LGTM thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/603577 (owner: 10CDanis) [14:18:19] (03PS1) 10Kormat: install_server: Fix issue with reuse-parts.cfg on multi-disk machines. [puppet] - 10https://gerrit.wikimedia.org/r/604020 (https://phabricator.wikimedia.org/T252027) [14:18:21] 10Operations, 10Readers-Web-Backlog, 10WMF-Legal, 10SEO: (Automate) adding wikinews language versions to the Google Publisher Center / Google News - https://phabricator.wikimedia.org/T254437 (10Dzahn) a:03wkandek [14:18:35] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10Cmjohnson) @Jclark-ctr Please move these 2 servers thanos-be1002 from C2 to B2 thanos-be1004 from C4 to D7 [14:19:59] 10Operations, 10ops-eqiad, 10DC-Ops: (NEED BY: 2020-06-11) rack/setup/install thanos-fe100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T251620 (10Cmjohnson) @Jclark-ctr Please move thanos-fe1002 from A4 to B2 [14:20:55] 10Operations, 10Analytics, 10Analytics-Kanban, 10vm-requests: Create archiva1002 as replacement of archiva1001 - https://phabricator.wikimedia.org/T254890 (10elukey) [14:21:21] (03PS1) 10Elukey: Add archiva1002 AAAA/A/PTR records [dns] - 10https://gerrit.wikimedia.org/r/604021 (https://phabricator.wikimedia.org/T254890) [14:21:27] (03CR) 10Jbond: [C: 03+1] "puppetdb1002 ~ % time ruby -e "require 'puppet'"" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/603577 (owner: 10CDanis) [14:21:45] (03CR) 10jerkins-bot: [V: 04-1] Add archiva1002 AAAA/A/PTR records [dns] - 10https://gerrit.wikimedia.org/r/604021 (https://phabricator.wikimedia.org/T254890) (owner: 10Elukey) [14:22:12] (03PS5) 10CDanis: run-puppet-agent: add new flag --unless-version SUBSTR [puppet] - 10https://gerrit.wikimedia.org/r/603577 [14:22:33] jbond42: lol, yikes [14:23:00] (03CR) 10Marostegui: [C: 03+1] install_server: Fix issue with reuse-parts.cfg on multi-disk machines. [puppet] - 10https://gerrit.wikimedia.org/r/604020 (https://phabricator.wikimedia.org/T252027) (owner: 10Kormat) [14:23:33] (03CR) 10Kormat: [C: 03+2] install_server: Fix issue with reuse-parts.cfg on multi-disk machines. [puppet] - 10https://gerrit.wikimedia.org/r/604020 (https://phabricator.wikimedia.org/T252027) (owner: 10Kormat) [14:23:42] cdanis: yes i know thats the main reason for the grep hack [14:24:30] (03CR) 10Volans: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/604017 (owner: 10Muehlenhoff) [14:25:03] PROBLEM - Host ps1-c3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:25:10] (03PS1) 10Muehlenhoff: On buster install python3-tqdm from the spicerack component [puppet] - 10https://gerrit.wikimedia.org/r/604023 (https://phabricator.wikimedia.org/T245114) [14:25:33] RECOVERY - Host ps1-c3-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.99 ms [14:26:25] (03CR) 10jerkins-bot: [V: 04-1] On buster install python3-tqdm from the spicerack component [puppet] - 10https://gerrit.wikimedia.org/r/604023 (https://phabricator.wikimedia.org/T245114) (owner: 10Muehlenhoff) [14:26:53] (03PS1) 10QChris: Gerrit: Fix log retention period in comment [puppet] - 10https://gerrit.wikimedia.org/r/604024 [14:27:25] (03CR) 10Volans: [C: 03+1] "LGTM, indentation aside ;)" [puppet] - 10https://gerrit.wikimedia.org/r/604023 (https://phabricator.wikimedia.org/T245114) (owner: 10Muehlenhoff) [14:27:42] (03PS2) 10Muehlenhoff: On buster install python3-tqdm from the spicerack component [puppet] - 10https://gerrit.wikimedia.org/r/604023 (https://phabricator.wikimedia.org/T245114) [14:28:10] 10Operations, 10Patch-For-Review: debian-installer: partman doesn't allow lvm LVs to be reused when reimaging - https://phabricator.wikimedia.org/T252027 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['sretest1002.eqiad.wmnet'] ` The log can be foun... [14:28:55] (03CR) 10jerkins-bot: [V: 04-1] On buster install python3-tqdm from the spicerack component [puppet] - 10https://gerrit.wikimedia.org/r/604023 (https://phabricator.wikimedia.org/T245114) (owner: 10Muehlenhoff) [14:29:45] (03PS2) 10Elukey: Add archiva1002 AAAA/A/PTR records [dns] - 10https://gerrit.wikimedia.org/r/604021 (https://phabricator.wikimedia.org/T254890) [14:30:34] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [14:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:00] (03PS3) 10Muehlenhoff: On buster install python3-tqdm from the spicerack component [puppet] - 10https://gerrit.wikimedia.org/r/604023 (https://phabricator.wikimedia.org/T245114) [14:31:50] (03PS2) 10Hnowlan: beta: Allow using docker volumes [puppet] - 10https://gerrit.wikimedia.org/r/601717 (https://phabricator.wikimedia.org/T251176) (owner: 10Alexandros Kosiaris) [14:32:54] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:32:55] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:07] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:11] (03PS1) 10Volans: Release v0.2.5 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/604027 [14:34:16] !log rebooting auth1002 [14:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:18] (03PS6) 10CDanis: run-puppet-agent: add new flag --unless-version SUBSTR [puppet] - 10https://gerrit.wikimedia.org/r/603577 [14:36:52] jbond42: mind giving ^ another quick look? [14:36:59] yes looking now [14:38:23] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/603577 (owner: 10CDanis) [14:39:41] (03PS1) 10QChris: Gerrit: Enable crons on Gerrit replicas [puppet] - 10https://gerrit.wikimedia.org/r/604033 [14:40:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [14:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:11] cdanis: i think that as the script uses `grep ^configuration_version` its probably fine in most cases however i think the addition is good [14:40:14] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10RobH) [14:40:23] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10RobH) [14:40:23] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [14:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:33] iu think you test missed '^' (or i missed something) [14:42:05] ahhhh [14:42:13] yes you're right, although I think this is more bombproof anyway [14:42:20] yes agree [14:42:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:44] (03PS7) 10CDanis: run-puppet-agent: add new flag --unless-version SUBSTR [puppet] - 10https://gerrit.wikimedia.org/r/603577 [14:43:47] (03CR) 10Dzahn: [C: 03+2] Gerrit: Fix log retention period in comment [puppet] - 10https://gerrit.wikimedia.org/r/604024 (owner: 10QChris) [14:45:09] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:27] (03CR) 10Hnowlan: [C: 03+2] beta: Allow using docker volumes [puppet] - 10https://gerrit.wikimedia.org/r/601717 (https://phabricator.wikimedia.org/T251176) (owner: 10Alexandros Kosiaris) [14:45:59] (03CR) 10CDanis: [C: 03+2] "thanks for the reviews! manual testing lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/603577 (owner: 10CDanis) [14:46:39] (03CR) 10Volans: [V: 03+2 C: 03+2] "Merging, just the rebuild of the wheels" [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/604027 (owner: 10Volans) [14:46:58] (03PS5) 10Kormat: Add native mysql spicerack module. [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 [14:47:13] (03CR) 10Kormat: Add native mysql spicerack module. (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (owner: 10Kormat) [14:47:55] kormat: I'll get back to your CR asap, just full of open rabbit holes at the moment, sorry for the delay [14:48:04] volans: that's no problem [14:49:13] (03CR) 10Kormat: Add native mysql spicerack module. (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (owner: 10Kormat) [14:49:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2131 after reimage', diff saved to https://phabricator.wikimedia.org/P11436 and previous config saved to /var/cache/conftool/dbconfig/20200609-144929-marostegui.json [14:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:30] (03PS1) 10Marostegui: mariadb: Enable notification on db2131 [puppet] - 10https://gerrit.wikimedia.org/r/604036 (https://phabricator.wikimedia.org/T254871) [14:51:35] (03CR) 10Dzahn: [C: 03+2] Gerrit: Enable crons on Gerrit replicas [puppet] - 10https://gerrit.wikimedia.org/r/604033 (owner: 10QChris) [14:52:38] (03CR) 10Marostegui: [C: 03+2] mariadb: Enable notification on db2131 [puppet] - 10https://gerrit.wikimedia.org/r/604036 (https://phabricator.wikimedia.org/T254871) (owner: 10Marostegui) [14:52:44] (03CR) 10Alexandros Kosiaris: "> Hi Alex, even after waiting on the next Puppet run after this was merged, it doesn't appear that Puppet has created the .hfenv files and" [deployment-charts] - 10https://gerrit.wikimedia.org/r/602164 (https://phabricator.wikimedia.org/T225680) (owner: 10Mholloway) [14:53:48] (03PS6) 10Filippo Giunchedi: prometheus: use profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/599298 (https://phabricator.wikimedia.org/T253840) [14:53:58] (03CR) 10Alexandros Kosiaris: "> Otherwise, just replacing "chromium-render" for "proton" in the file paths in this change should allow you to move forward just fine." [deployment-charts] - 10https://gerrit.wikimedia.org/r/602164 (https://phabricator.wikimedia.org/T225680) (owner: 10Mholloway) [14:54:05] (03PS9) 10Cwhite: wmflib: add systemd.timer OnCalendar support to cron_splay [puppet] - 10https://gerrit.wikimedia.org/r/600928 (https://phabricator.wikimedia.org/T210818) [14:54:40] !log volans@deploy1001 Started deploy [debmonitor/deploy@44aa1ee]: Release v0.2.5 [14:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:42] 10Operations, 10Patch-For-Review: debian-installer: partman doesn't allow lvm LVs to be reused when reimaging - https://phabricator.wikimedia.org/T252027 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['sretest1002.eqiad.wmnet'] ` and were **ALL** successful. [14:55:12] (03CR) 10JMeybohm: [C: 03+2] lvs::configuration: termbox-https monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/603974 (https://phabricator.wikimedia.org/T254581) (owner: 10JMeybohm) [14:55:21] (03Abandoned) 10Jbond: role::mail::mx: enable jumpcloud test domain [puppet] - 10https://gerrit.wikimedia.org/r/588425 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [14:55:23] !log volans@deploy1001 Finished deploy [debmonitor/deploy@44aa1ee]: Release v0.2.5 (duration: 00m 43s) [14:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:09] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC's happy https://puppet-compiler.wmflabs.org/compiler1001/23102/prometheus1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/599298 (https://phabricator.wikimedia.org/T253840) (owner: 10Filippo Giunchedi) [14:59:05] (03CR) 10Cwhite: [C: 03+2] wmflib: add systemd.timer OnCalendar support to cron_splay [puppet] - 10https://gerrit.wikimedia.org/r/600928 (https://phabricator.wikimedia.org/T210818) (owner: 10Cwhite) [14:59:56] !log gerrit2001 - delete gerrit logfiles older than 30 days, crons are now enabled to keep doing it in the future [14:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:00] (03CR) 10Jbond: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/604021 (https://phabricator.wikimedia.org/T254890) (owner: 10Elukey) [15:03:25] (03PS5) 10Alexandros Kosiaris: mathoid: added support egress rules mathoid: deleted _policy_helper.tpl mathoid: Restore tls_helpers template which was accidentally deleted [deployment-charts] - 10https://gerrit.wikimedia.org/r/597777 (owner: 10Apakhomov) [15:03:27] (03CR) 10Mholloway: [C: 03+2] "Aha, that would do it. I have never been happy with the ambiguous names of this service, tbh. :/ Changes coming shortly..." [deployment-charts] - 10https://gerrit.wikimedia.org/r/602164 (https://phabricator.wikimedia.org/T225680) (owner: 10Mholloway) [15:04:09] (03PS1) 10Jbond: profile::mail::jumpmcloud: remove jumpcloud [puppet] - 10https://gerrit.wikimedia.org/r/604043 (https://phabricator.wikimedia.org/T244792) [15:04:11] (03PS1) 10Jbond: profile::mail::jumpcloud: [puppet] - 10https://gerrit.wikimedia.org/r/604044 (https://phabricator.wikimedia.org/T244792) [15:06:05] (03CR) 10Jbond: [C: 03+2] profile::mail::jumpmcloud: remove jumpcloud [puppet] - 10https://gerrit.wikimedia.org/r/604043 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [15:06:30] (03CR) 10Alexandros Kosiaris: [C: 03+2] mathoid: added support egress rules mathoid: deleted _policy_helper.tpl mathoid: Restore tls_helpers template which was accidentally deleted [deployment-charts] - 10https://gerrit.wikimedia.org/r/597777 (owner: 10Apakhomov) [15:06:32] (03CR) 10Jbond: [C: 03+2] profile::mail::jumpcloud: [puppet] - 10https://gerrit.wikimedia.org/r/604044 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [15:06:39] !log forcing a debmonitor GC to verify the fix of T254865 [15:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:43] T254865: Debmonitor GC not running - https://phabricator.wikimedia.org/T254865 [15:06:58] (03PS2) 10Alexandros Kosiaris: Bump chart versions for netpol bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/602706 [15:11:29] (03CR) 10Elukey: [C: 03+2] Add archiva1002 AAAA/A/PTR records [dns] - 10https://gerrit.wikimedia.org/r/604021 (https://phabricator.wikimedia.org/T254890) (owner: 10Elukey) [15:12:30] (03PS1) 10Mholloway: Proton: Fix helmfile paths and namespace refs [deployment-charts] - 10https://gerrit.wikimedia.org/r/604046 (https://phabricator.wikimedia.org/T225680) [15:14:28] (03CR) 10Alexandros Kosiaris: [C: 03+2] Proton: Fix helmfile paths and namespace refs [deployment-charts] - 10https://gerrit.wikimedia.org/r/604046 (https://phabricator.wikimedia.org/T225680) (owner: 10Mholloway) [15:14:32] (03CR) 10Paladox: [C: 03+1] Gerrit: Enable crons on Gerrit replicas [puppet] - 10https://gerrit.wikimedia.org/r/604033 (owner: 10QChris) [15:14:58] (03Merged) 10jenkins-bot: Proton: Fix helmfile paths and namespace refs [deployment-charts] - 10https://gerrit.wikimedia.org/r/604046 (https://phabricator.wikimedia.org/T225680) (owner: 10Mholloway) [15:16:04] (03CR) 10Subramanya Sastry: [C: 03+1] conftool: remove parsoid, keep parsoid-php [puppet] - 10https://gerrit.wikimedia.org/r/559705 (https://phabricator.wikimedia.org/T241207) (owner: 10Dzahn) [15:16:11] (03PS1) 10Ema: cache: remove legacy req_handling directives [puppet] - 10https://gerrit.wikimedia.org/r/604048 (https://phabricator.wikimedia.org/T241239) [15:18:12] 10Operations, 10vm-requests: Site: 1 VM request for recommender-systems - https://phabricator.wikimedia.org/T215421 (10akosiaris) Close to 1.5 years already and no movement either on the parent task or this one. Should we close for now as `Invalid`? [15:18:50] (03CR) 10Mholloway: Mobileapps: Add initial helmfile stanzas (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602155 (https://phabricator.wikimedia.org/T218733) (owner: 10Mholloway) [15:19:14] (03PS2) 10Ema: cache: remove legacy req_handling directives [puppet] - 10https://gerrit.wikimedia.org/r/604048 (https://phabricator.wikimedia.org/T241239) [15:21:20] (03CR) 10Alexandros Kosiaris: "> Note that there have not been any VMs yet with public IPs here." [dns] - 10https://gerrit.wikimedia.org/r/599883 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [15:23:08] 10Operations, 10ops-eqiad, 10DC-Ops: (NEED BY: 2020-06-11) rack/setup/install thanos-fe100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T251620 (10Jclark-ctr) thanos-fe1002 b2 u32. switchport 38 [15:23:35] (03PS1) 10Andrew Bogott: cloud-vps resolve.conf: move traffic from recursor0 to recursor1 [puppet] - 10https://gerrit.wikimedia.org/r/604049 (https://phabricator.wikimedia.org/T253780) [15:24:03] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10Jclark-ctr) thanos-be1002 b2 u36 switchport 39 thanos-be1004 d7 u11 switchport 13 [15:24:20] (03CR) 10Bstorm: base/monitoring: allow setting different contactgroup for systemd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602052 (owner: 10Dzahn) [15:25:00] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [15:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:23] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:18] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [15:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:22] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:45] (03CR) 10Dzahn: "crons have been created on gerrit2001. then I also manually ran the find command to delete logs older than 30 days" [puppet] - 10https://gerrit.wikimedia.org/r/604033 (owner: 10QChris) [15:27:26] (03PS1) 10JMeybohm: lvs::configuration: termbox-https production [puppet] - 10https://gerrit.wikimedia.org/r/604050 (https://phabricator.wikimedia.org/T254581) [15:27:46] (03CR) 10jerkins-bot: [V: 04-1] lvs::configuration: termbox-https production [puppet] - 10https://gerrit.wikimedia.org/r/604050 (https://phabricator.wikimedia.org/T254581) (owner: 10JMeybohm) [15:28:34] (03CR) 10BearND: [C: 03+1] Mobileapps: Add initial helmfile stanzas (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602155 (https://phabricator.wikimedia.org/T218733) (owner: 10Mholloway) [15:29:24] (03PS3) 10Dzahn: base/monitoring: allow setting different contactgroup for systemd [puppet] - 10https://gerrit.wikimedia.org/r/602052 [15:29:26] (03PS2) 10JMeybohm: lvs::configuration: termbox-https production [puppet] - 10https://gerrit.wikimedia.org/r/604050 (https://phabricator.wikimedia.org/T254581) [15:29:30] (03CR) 10Dzahn: base/monitoring: allow setting different contactgroup for systemd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602052 (owner: 10Dzahn) [15:29:45] (03CR) 10Alexandros Kosiaris: [C: 04-1] Mobileapps: Add initial helmfile stanzas (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602155 (https://phabricator.wikimedia.org/T218733) (owner: 10Mholloway) [15:34:55] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps resolve.conf: move traffic from recursor0 to recursor1 [puppet] - 10https://gerrit.wikimedia.org/r/604049 (https://phabricator.wikimedia.org/T253780) (owner: 10Andrew Bogott) [15:35:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cloud-vps resolve.conf: move traffic from recursor0 to recursor1 [puppet] - 10https://gerrit.wikimedia.org/r/604049 (https://phabricator.wikimedia.org/T253780) (owner: 10Andrew Bogott) [15:36:07] (03PS3) 10Ema: cache: remove director and other legacy directives [puppet] - 10https://gerrit.wikimedia.org/r/604048 (https://phabricator.wikimedia.org/T241239) [15:39:17] (03CR) 10Ayounsi: [C: 03+2] Add sre.network.prepare-upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/596444 (owner: 10Ayounsi) [15:39:23] (03PS5) 10Ayounsi: Add sre.network.prepare-upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/596444 [15:40:20] 10Operations, 10netops, 10vm-requests, 10Patch-For-Review: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10akosiaris) I just had a quick look into the 3 PoP ganeti clusters and it seems they aren't ready to serve public IPs VMs. /etc/network/interfa... [15:40:57] (03CR) 10Alexandros Kosiaris: [C: 03+1] lvs::configuration: termbox-https production [puppet] - 10https://gerrit.wikimedia.org/r/604050 (https://phabricator.wikimedia.org/T254581) (owner: 10JMeybohm) [15:41:05] (03PS1) 10Jbond: add dummy service file [labs/private] - 10https://gerrit.wikimedia.org/r/604052 [15:41:15] (03CR) 10Muehlenhoff: [C: 03+2] Add a Spicerack cook book to reboot hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/596187 (https://phabricator.wikimedia.org/T252807) (owner: 10Muehlenhoff) [15:41:17] (03PS4) 10Ema: cache: remove director and other legacy directives [puppet] - 10https://gerrit.wikimedia.org/r/604048 (https://phabricator.wikimedia.org/T241239) [15:45:25] (03CR) 10JMeybohm: [C: 03+2] lvs::configuration: termbox-https production [puppet] - 10https://gerrit.wikimedia.org/r/604050 (https://phabricator.wikimedia.org/T254581) (owner: 10JMeybohm) [15:45:44] (03PS5) 10Ema: cache: remove director and other legacy directives [puppet] - 10https://gerrit.wikimedia.org/r/604048 (https://phabricator.wikimedia.org/T241239) [15:46:22] (03CR) 10Jbond: [V: 03+2 C: 03+2] add dummy service file [labs/private] - 10https://gerrit.wikimedia.org/r/604052 (owner: 10Jbond) [15:48:25] (03CR) 10Alexandros Kosiaris: Add recommendation-api helmfile stanzas (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [15:49:27] (03CR) 10Alexandros Kosiaris: "LGTM, we just await the benchmark results to have an idea of the # of replicas we want and move forward." [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [15:49:58] (03CR) 10Bstorm: [C: 03+1] base/monitoring: allow setting different contactgroup for systemd [puppet] - 10https://gerrit.wikimedia.org/r/602052 (owner: 10Dzahn) [15:50:53] 10Operations, 10netops, 10vm-requests, 10Patch-For-Review: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10ayounsi) Sure I can do it, but do they need internet access? DHCP/TFTP shouldn't need internet access afaik? Are there other services running... [15:54:21] 10Operations, 10netops, 10vm-requests, 10Patch-For-Review: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10BBlack) @Ayounsi - Yes, we're going to have some outbound recursive DNS needs from some ganeti-hosted services [15:55:45] (03CR) 10Alexandros Kosiaris: [C: 03+1] "@ottomata, we probably need your +1 on this as well." [deployment-charts] - 10https://gerrit.wikimedia.org/r/602060 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [15:56:09] (03PS1) 10Jbond: mail::mx: add scripts to pull data from gsuite [puppet] - 10https://gerrit.wikimedia.org/r/604059 [15:56:50] (03CR) 10Alexandros Kosiaris: [C: 03+1] "@ottomata, we probably need your +1 on this as well." [deployment-charts] - 10https://gerrit.wikimedia.org/r/602061 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [15:57:21] (03CR) 10jerkins-bot: [V: 04-1] mail::mx: add scripts to pull data from gsuite [puppet] - 10https://gerrit.wikimedia.org/r/604059 (owner: 10Jbond) [15:59:17] 10Operations, 10netops, 10vm-requests, 10Patch-For-Review: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10ayounsi) a:03ayounsi [15:59:55] (03CR) 10Ottomata: [C: 03+1] eventstreams: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/602060 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [16:00:04] godog and _joe_: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200609T1600). [16:00:33] (03CR) 10Ottomata: [C: 03+1] "Ya! I'd love to do that, we just need to resolve the larger canary release problem to do it I think" [deployment-charts] - 10https://gerrit.wikimedia.org/r/602060 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [16:01:56] 10Operations, 10netops, 10vm-requests, 10Patch-For-Review: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10Dzahn) >>! In T254157#6206559, @ayounsi wrote: > Sure I can do it, but do they need internet access? DHCP/TFTP shouldn't need internet access... [16:02:20] (03CR) 10JMeybohm: "> Patch Set 4:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/602060 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [16:02:54] (03PS2) 10Jbond: mail::mx: add scripts to pull data from gsuite [puppet] - 10https://gerrit.wikimedia.org/r/604059 (https://phabricator.wikimedia.org/T244792) [16:04:31] (03PS6) 10Ema: cache: remove director and other legacy directives [puppet] - 10https://gerrit.wikimedia.org/r/604048 (https://phabricator.wikimedia.org/T241239) [16:06:14] (03PS3) 10Jbond: mail::mx: add scripts to pull data from gsuite [puppet] - 10https://gerrit.wikimedia.org/r/604059 [16:06:38] !log cutting the branch for 1.35.0-wmf.36 T254173 [16:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:42] T254173: 1.35.0-wmf.36 deployment blockers - https://phabricator.wikimedia.org/T254173 [16:07:25] (03CR) 10jerkins-bot: [V: 04-1] mail::mx: add scripts to pull data from gsuite [puppet] - 10https://gerrit.wikimedia.org/r/604059 (owner: 10Jbond) [16:09:06] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/604059 (owner: 10Jbond) [16:09:35] (03PS4) 10Jbond: mail::mx: add scripts to pull data from gsuite [puppet] - 10https://gerrit.wikimedia.org/r/604059 (https://phabricator.wikimedia.org/T244792) [16:10:04] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.35.0-wmf.36 [core] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604060 [16:13:22] (03CR) 10EBernhardson: Consolidate query_service profile duplication (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/599146 (owner: 10EBernhardson) [16:14:05] (03PS5) 10Jbond: mail::mx: add scripts to pull data from gsuite [puppet] - 10https://gerrit.wikimedia.org/r/604059 (https://phabricator.wikimedia.org/T244792) [16:15:18] (03PS7) 10Ema: cache: remove director and other legacy directives [puppet] - 10https://gerrit.wikimedia.org/r/604048 (https://phabricator.wikimedia.org/T241239) [16:15:34] (03PS1) 10Volans: CHANGELOG: add changelogs for release v4.0.0rc1 [software/cumin] - 10https://gerrit.wikimedia.org/r/604061 [16:15:39] (03CR) 10Jeena Huneidi: [C: 03+2] Branch commit for wmf/1.35.0-wmf.36 [core] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604060 (owner: 10TrainBranchBot) [16:17:03] 10Operations, 10Readers-Web-Backlog, 10WMF-Legal, 10SEO: (Automate) adding wikinews language versions to the Google Publisher Center / Google News - https://phabricator.wikimedia.org/T254437 (10ovasileva) a:05wkandek→03ovasileva [16:17:06] (03PS1) 10JMeybohm: services_proxy: switch termbox to TLS [puppet] - 10https://gerrit.wikimedia.org/r/604062 (https://phabricator.wikimedia.org/T254581) [16:18:22] (03CR) 10JMeybohm: "@joe: Is there a way to do this gradually?" [puppet] - 10https://gerrit.wikimedia.org/r/604062 (https://phabricator.wikimedia.org/T254581) (owner: 10JMeybohm) [16:18:54] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v4.0.0rc1 [software/cumin] - 10https://gerrit.wikimedia.org/r/604061 (owner: 10Volans) [16:20:13] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Editing-team, 10SRE-swift-storage, and 3 others: Mediawiki maintenance job "generate-fancycaptcha" - fatal error when trying to copy new captchas to storage - https://phabricator.wikimedia.org/T230245 (10Krinkle) This is still reproducible also in Beta Clu... [16:20:40] (03PS8) 10Ema: cache: remove director and other legacy directives [puppet] - 10https://gerrit.wikimedia.org/r/604048 (https://phabricator.wikimedia.org/T241239) [16:20:59] (03CR) 10CDanis: "Is this still in need of review? I lost track the other day in the revert-of-revert-of-revert fun :)" [puppet] - 10https://gerrit.wikimedia.org/r/602329 (owner: 10Jbond) [16:21:16] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v4.0.0rc1 [software/cumin] - 10https://gerrit.wikimedia.org/r/604061 (owner: 10Volans) [16:21:49] (03CR) 10CDanis: prometheus: enable Thanos upload for ops in esams (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602717 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [16:22:13] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/602329 (owner: 10Jbond) [16:22:26] (03Abandoned) 10Jbond: Revert "puppetmaster: update puppet-merge""" [puppet] - 10https://gerrit.wikimedia.org/r/602329 (owner: 10Jbond) [16:26:55] (03CR) 10CDanis: [C: 03+1] "looks good, thank you!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/602693 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [16:28:05] (03PS9) 10Ema: cache: remove director and other legacy directives [puppet] - 10https://gerrit.wikimedia.org/r/604048 (https://phabricator.wikimedia.org/T241239) [16:29:21] (03PS10) 10Ema: cache: remove director and other legacy directives [puppet] - 10https://gerrit.wikimedia.org/r/604048 (https://phabricator.wikimedia.org/T241239) [16:29:52] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Yes there is one. We can disable puppet across the mw* fleet and then enable it and force run puppet on specific sets of hosts." [puppet] - 10https://gerrit.wikimedia.org/r/604062 (https://phabricator.wikimedia.org/T254581) (owner: 10JMeybohm) [16:30:11] (03CR) 10Alexandros Kosiaris: [C: 03+1] Mobileapps: Add initial helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/602155 (https://phabricator.wikimedia.org/T218733) (owner: 10Mholloway) [16:30:52] (03CR) 10CDanis: [C: 03+1] thanos: add SyslogIdentifier=%N to systemd services [puppet] - 10https://gerrit.wikimedia.org/r/604009 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [16:31:23] (03CR) 10Jbond: "After speaking with Keith im going to mark this WIP and first try a "recipient verification callout" in exim" [puppet] - 10https://gerrit.wikimedia.org/r/604059 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [16:32:09] (03PS11) 10Ema: cache: remove director and other legacy directives [puppet] - 10https://gerrit.wikimedia.org/r/604048 (https://phabricator.wikimedia.org/T241239) [16:35:23] (03Merged) 10jenkins-bot: Branch commit for wmf/1.35.0-wmf.36 [core] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604060 (owner: 10TrainBranchBot) [16:35:25] (03CR) 10Alexandros Kosiaris: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/598280 (owner: 10Alexandros Kosiaris) [16:37:10] (03PS12) 10Ema: cache: remove director and other legacy directives [puppet] - 10https://gerrit.wikimedia.org/r/604048 (https://phabricator.wikimedia.org/T241239) [16:40:41] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [16:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:30] (03PS13) 10Ema: cache: remove director and other legacy directives [puppet] - 10https://gerrit.wikimedia.org/r/604048 (https://phabricator.wikimedia.org/T241239) [16:41:52] (03CR) 10Mholloway: [C: 03+2] "To be deployed during the upcoming service deploy window." (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602155 (https://phabricator.wikimedia.org/T218733) (owner: 10Mholloway) [16:41:55] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Interestingly:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/598280 (owner: 10Alexandros Kosiaris) [16:42:22] (03Merged) 10jenkins-bot: Mobileapps: Add initial helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/602155 (https://phabricator.wikimedia.org/T218733) (owner: 10Mholloway) [16:43:21] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:40] (03PS14) 10Ema: cache: remove director and other legacy directives [puppet] - 10https://gerrit.wikimedia.org/r/604048 (https://phabricator.wikimedia.org/T241239) [16:44:04] (03PS1) 10Elukey: archiva: move archiva-gitfat-link to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/604066 (https://phabricator.wikimedia.org/T252767) [16:45:12] (03PS1) 10Volans: Upstream release v4.0.0rc1 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/604067 [16:45:38] PROBLEM - Host thumbor2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:45:38] PROBLEM - Host thumbor2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:45:51] (03PS1) 10Jeena Huneidi: testwikis wikis to 1.35.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604068 [16:45:53] (03CR) 10Jeena Huneidi: [C: 03+2] testwikis wikis to 1.35.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604068 (owner: 10Jeena Huneidi) [16:47:07] (03Merged) 10jenkins-bot: testwikis wikis to 1.35.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604068 (owner: 10Jeena Huneidi) [16:47:30] (03PS15) 10Ema: cache: remove director and other legacy directives [puppet] - 10https://gerrit.wikimedia.org/r/604048 (https://phabricator.wikimedia.org/T241239) [16:50:48] RECOVERY - Host thumbor2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.88 ms [16:50:48] RECOVERY - Host thumbor2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.80 ms [16:50:56] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Editing-team, 10SRE-swift-storage, and 3 others: Mediawiki maintenance job "generate-fancycaptcha" - fatal error when trying to copy new captchas to storage - https://phabricator.wikimedia.org/T230245 (10sbassett) >>! In T230245#6206697, @Krinkle wrote: >... [16:53:18] (03PS16) 10Ema: cache: remove director and other legacy directives [puppet] - 10https://gerrit.wikimedia.org/r/604048 (https://phabricator.wikimedia.org/T241239) [16:54:39] (03PS2) 10Elukey: archiva: move archiva-gitfat-link to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/604066 (https://phabricator.wikimedia.org/T252767) [16:55:13] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/604048 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [16:55:54] (03CR) 10jerkins-bot: [V: 04-1] archiva: move archiva-gitfat-link to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/604066 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [16:56:37] (03PS1) 10Jeena Huneidi: Revert "testwikis wikis to 1.35.0-wmf.36" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604070 [16:56:58] (03CR) 10Jeena Huneidi: [C: 03+2] Revert "testwikis wikis to 1.35.0-wmf.36" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604070 (owner: 10Jeena Huneidi) [16:57:41] (03PS3) 10Elukey: archiva: move archiva-gitfat-link to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/604066 (https://phabricator.wikimedia.org/T252767) [16:57:51] (03Merged) 10jenkins-bot: Revert "testwikis wikis to 1.35.0-wmf.36" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604070 (owner: 10Jeena Huneidi) [17:00:04] halfak and accraze: That opportune time is upon us again. Time for a Services – Graphoid / Citoid / ORES deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200609T1700). [17:00:44] (03PS1) 10Ema: purged: kafka topic configuration for beta [puppet] - 10https://gerrit.wikimedia.org/r/604072 (https://phabricator.wikimedia.org/T254844) [17:01:58] 10Operations, 10ops-codfw, 10decommission, 10serviceops, 10Patch-For-Review: codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10Papaul) @Dzahn yes you can merge that. Thabks [17:02:13] (03CR) 10Elukey: [C: 03+2] archiva: move archiva-gitfat-link to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/604066 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [17:02:59] 10Operations, 10ops-codfw, 10procurement: codfw: Next Gen test rack - https://phabricator.wikimedia.org/T251570 (10Papaul) it took 3 hours 1/2 to rebuild the whole rack. Pictures coming soon [17:04:34] (03CR) 10Herron: [C: 03+1] profile: add loki output support to the logstash pipeline [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [17:05:31] (03CR) 10Ppchelko: [C: 03+1] purged: kafka topic configuration for beta [puppet] - 10https://gerrit.wikimedia.org/r/604072 (https://phabricator.wikimedia.org/T254844) (owner: 10Ema) [17:05:43] (03CR) 10Ema: [C: 03+2] purged: kafka topic configuration for beta [puppet] - 10https://gerrit.wikimedia.org/r/604072 (https://phabricator.wikimedia.org/T254844) (owner: 10Ema) [17:06:05] (03PS1) 10Ssingh: dnsrecursor: make forward-zones and edns-subnet-whitelist optional [puppet] - 10https://gerrit.wikimedia.org/r/604075 (https://phabricator.wikimedia.org/T252132) [17:06:29] (03CR) 10Herron: [C: 03+1] thanos: add SyslogIdentifier=%N to systemd services [puppet] - 10https://gerrit.wikimedia.org/r/604009 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [17:09:47] (03PS1) 10Elukey: profile::archiva: fix timer's command [puppet] - 10https://gerrit.wikimedia.org/r/604076 [17:10:07] (03PS12) 10EBernhardson: Consolidate query_service profile duplication [puppet] - 10https://gerrit.wikimedia.org/r/599146 [17:10:09] (03PS7) 10EBernhardson: Revert "Revert "Role for SDoC WDQS"" [puppet] - 10https://gerrit.wikimedia.org/r/602171 [17:11:44] (03CR) 10Elukey: [C: 03+2] profile::archiva: fix timer's command [puppet] - 10https://gerrit.wikimedia.org/r/604076 (owner: 10Elukey) [17:16:48] (03CR) 10Ssingh: "Daniel, John: I added Andrew as the reviewer with his permission as this seemed more relevant for him. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/604075 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [17:17:05] (03PS13) 10EBernhardson: Consolidate query_service profile duplication [puppet] - 10https://gerrit.wikimedia.org/r/599146 [17:17:07] (03PS8) 10EBernhardson: Revert "Revert "Role for SDoC WDQS"" [puppet] - 10https://gerrit.wikimedia.org/r/602171 [17:24:15] (03PS1) 10Cmjohnson: Updating dns for thanos-fe1002 to reflect rack relocation [dns] - 10https://gerrit.wikimedia.org/r/604080 (https://phabricator.wikimedia.org/T251620) [17:25:10] (03PS2) 10Cmjohnson: Updating dns for thanos-fe1002 to reflect rack relocation [dns] - 10https://gerrit.wikimedia.org/r/604080 (https://phabricator.wikimedia.org/T251620) [17:25:46] (03CR) 10Cmjohnson: [C: 03+2] Updating dns for thanos-fe1002 to reflect rack relocation [dns] - 10https://gerrit.wikimedia.org/r/604080 (https://phabricator.wikimedia.org/T251620) (owner: 10Cmjohnson) [17:30:31] (03CR) 10Muehlenhoff: Upstream release v4.0.0rc1 (031 comment) [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/604067 (owner: 10Volans) [17:32:33] (03PS1) 10Andrew Bogott: cloud-vps resolv.conf: restore use of both recursors [puppet] - 10https://gerrit.wikimedia.org/r/604084 (https://phabricator.wikimedia.org/T253780) [17:32:35] (03PS1) 10Andrew Bogott: Revert "Horizon: put into maintenance mode for cloudservices rebuilds" [puppet] - 10https://gerrit.wikimedia.org/r/604085 [17:33:50] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps resolv.conf: restore use of both recursors [puppet] - 10https://gerrit.wikimedia.org/r/604084 (https://phabricator.wikimedia.org/T253780) (owner: 10Andrew Bogott) [17:34:10] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Horizon: put into maintenance mode for cloudservices rebuilds" [puppet] - 10https://gerrit.wikimedia.org/r/604085 (owner: 10Andrew Bogott) [17:48:10] (03PS1) 10Cmjohnson: Updating dns for thanos-be1002/1004 to reflect rack change [dns] - 10https://gerrit.wikimedia.org/r/604091 [17:49:31] (03PS2) 10Volans: Upstream release v4.0.0rc1 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/604067 [17:49:45] (03PS2) 10Cmjohnson: Updating dns for thanos-be1002/1004 to reflect rack change [dns] - 10https://gerrit.wikimedia.org/r/604091 [17:50:27] (03CR) 10Volans: "addressed comments" (031 comment) [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/604067 (owner: 10Volans) [17:50:57] (03PS14) 10EBernhardson: Consolidate query_service profile duplication [puppet] - 10https://gerrit.wikimedia.org/r/599146 [17:50:59] (03PS9) 10EBernhardson: Revert "Revert "Role for SDoC WDQS"" [puppet] - 10https://gerrit.wikimedia.org/r/602171 [17:51:06] (03PS1) 10Jforrester: Use UserGroupManagerFactory with correct domain to fetch groups [extensions/CheckUser] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604096 (https://phabricator.wikimedia.org/T234921) [17:51:13] (03CR) 10Muehlenhoff: [C: 03+1] Upstream release v4.0.0rc1 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/604067 (owner: 10Volans) [17:51:23] (03CR) 10Jforrester: [C: 03+2] Use UserGroupManagerFactory with correct domain to fetch groups [extensions/CheckUser] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604096 (https://phabricator.wikimedia.org/T234921) (owner: 10Jforrester) [17:54:26] (03PS2) 10Volans: mgmt: use netbox-generated data for eqsin mgmt [dns] - 10https://gerrit.wikimedia.org/r/601434 (https://phabricator.wikimedia.org/T233183) [17:54:30] (03CR) 10Herron: [C: 03+1] "Thanks for this! Big +1 for cleaning up lonely alerts" [puppet] - 10https://gerrit.wikimedia.org/r/602751 (owner: 10Cwhite) [17:54:46] (03CR) 10Volans: [C: 03+2] Upstream release v4.0.0rc1 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/604067 (owner: 10Volans) [17:57:00] (03CR) 10Cmjohnson: [C: 03+2] Updating dns for thanos-be1002/1004 to reflect rack change [dns] - 10https://gerrit.wikimedia.org/r/604091 (owner: 10Cmjohnson) [17:57:11] (03Merged) 10jenkins-bot: Upstream release v4.0.0rc1 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/604067 (owner: 10Volans) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200609T1800) [18:05:20] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) [18:08:04] (03Merged) 10jenkins-bot: Use UserGroupManagerFactory with correct domain to fetch groups [extensions/CheckUser] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604096 (https://phabricator.wikimedia.org/T234921) (owner: 10Jforrester) [18:12:49] !log uploaded cumin_4.0.0rc1-1_amd64.deb to apt.wikimedia.org buster-wikimedia [18:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:06] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.36/extensions/CheckUser/: T234921 T254912 Use UserGroupManagerFactory with correct domain to fetch groups (duration: 02m 26s) [18:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:11] T234921: Factor group membership management out of User class - https://phabricator.wikimedia.org/T234921 [18:13:11] T254912: CheckUser is using UserGroupMembership::getGroupMemberships for foreign wikis - https://phabricator.wikimedia.org/T254912 [18:21:13] (03PS1) 10Jforrester: Avoid undefined index error [extensions/TimedMediaHandler] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604100 (https://phabricator.wikimedia.org/T254824) [18:21:19] (03CR) 10Jforrester: [C: 03+2] Avoid undefined index error [extensions/TimedMediaHandler] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604100 (https://phabricator.wikimedia.org/T254824) (owner: 10Jforrester) [18:21:22] 10Operations, 10Patch-For-Review: Migrate Cumin hosts to Buster - https://phabricator.wikimedia.org/T245114 (10Volans) @Krenair @MoritzMuehlenhoff I've //finally!// built and uploaded cumin `4.0.0~rc1` to our APT in `wikimedia-buster`. The package is installable and from some quick test on a local setup it sho... [18:21:44] longma, brennen: OK, CU fix deployed; TMH one now landing. Then it should be OK to proceed. [18:22:05] thanks James_F [18:24:36] ^ [18:31:11] (03CR) 10Volans: [C: 03+2] "Verified again one by one all the records with the files on the netbox exported repos on an authdns host." [dns] - 10https://gerrit.wikimedia.org/r/601434 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [18:31:56] (03PS3) 10Volans: mgmt: use netbox-generated data for eqsin mgmt [dns] - 10https://gerrit.wikimedia.org/r/601434 (https://phabricator.wikimedia.org/T233183) [18:32:42] (03CR) 10Volans: [C: 03+2] mgmt: use netbox-generated data for eqsin mgmt [dns] - 10https://gerrit.wikimedia.org/r/601434 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [18:36:44] !log migrated mgmt DNS records in eqsin to the Netbox-generated records - T233183 [18:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:48] T233183: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 [18:37:15] (03Merged) 10jenkins-bot: Avoid undefined index error [extensions/TimedMediaHandler] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604100 (https://phabricator.wikimedia.org/T254824) (owner: 10Jforrester) [18:39:26] I am pulling the backports into the branch on the deploy server now [18:39:40] Oh, I've done it for everything. [18:39:46] Just needs TMH to be synced. [18:40:33] I haven't synced anything yet. Was waiting for the backports [18:41:07] longma: OK, all done, over to you. [18:41:13] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.36/extensions/TimedMediaHandler/includes/TimedMediaHandler.php: T254824 Avoid undefined index error (duration: 00m 57s) [18:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:17] T254824: Undefined indexes: start and end in includes/TimedMediaHandler.php +392 - https://phabricator.wikimedia.org/T254824 [18:41:17] (Tsk, deploy-and-dashing.) [18:41:40] * James_F goes for dinner. [18:43:54] (03PS1) 10Jeena Huneidi: testwikis wikis to 1.35.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604108 [18:43:56] (03CR) 10Jeena Huneidi: [C: 03+2] testwikis wikis to 1.35.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604108 (owner: 10Jeena Huneidi) [18:44:47] (03Merged) 10jenkins-bot: testwikis wikis to 1.35.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604108 (owner: 10Jeena Huneidi) [18:45:09] !log jhuneidi@deploy1001 Started scap: testwikis wikis to 1.35.0-wmf.36 [18:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:11] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` thanos-be1001.eqiad.wmnet ` The log can be found in `/var... [18:53:02] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` thanos-be1002.eqiad.wmnet ` The log can be found in `/var... [18:54:00] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventStreams, and 2 others: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10Ottomata) Hm, I'm pretty sure the connection is terminated even when there are events being sent. ` ti... [18:54:42] (03PS1) 10Urbanecm: Grant cswiki accountcreators tboverride-account and override-antispoof [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604112 (https://phabricator.wikimedia.org/T254927) [18:57:37] (03PS1) 10Mholloway: Echo: Enable push subscription management API (BETA) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604115 (https://phabricator.wikimedia.org/T252899) [18:57:39] (03PS1) 10Mholloway: Echo: Enable push notifier (BETA) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604116 (https://phabricator.wikimedia.org/T252899) [19:00:04] longma and liw: (Dis)respected human, time to deploy Mediawiki train - American+European Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200609T1900). Please do the needful. [19:00:06] (03CR) 10Ottomata: [C: 03+1] "Right, but to reuse common_templates/ we'd have to integrate a canary solution into those?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/602060 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [19:01:02] Still syncing testwikis [19:05:11] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [19:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:01] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [19:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:46] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:52] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` thanos-be1003.eqiad.wmnet ` The log can be found in `/var... [19:10:28] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:45] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thanos-be1001.eqiad.wmnet'] ` and were **ALL** successful. [19:12:48] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` thanos-be1004.eqiad.wmnet ` The log can be found in `/var... [19:15:30] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thanos-be1002.eqiad.wmnet'] ` and were **ALL** successful. [19:23:50] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [19:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:20] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:48] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [19:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:23] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:09] (03CR) 10JMeybohm: "> Patch Set 4:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/602060 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [19:31:12] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thanos-be1003.eqiad.wmnet'] ` and were **ALL** successful. [19:34:23] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thanos-be1004.eqiad.wmnet'] ` and were **ALL** successful. [19:35:08] (03PS1) 10Nuria: Blacklisting deleted schema [puppet] - 10https://gerrit.wikimedia.org/r/604130 [19:37:14] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10Cmjohnson) [19:37:39] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10Cmjohnson) 05Open→03Resolved These are all finsihed [19:41:50] 10Operations, 10ops-codfw, 10procurement: codfw: Next Gen test rack - https://phabricator.wikimedia.org/T251570 (10Papaul) Old design and new design https://drive.google.com/drive/folders/14M22fzUhnhgtDOKyVMFXiNeCOEfbG9es [19:42:56] !log jhuneidi@deploy1001 Finished scap: testwikis wikis to 1.35.0-wmf.36 (duration: 57m 47s) [19:42:58] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Hardware): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Andrew) @wiki_willy what should we do about this server? At this point going back to Dell feels like throwing good mone... [19:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:40] Deploying to group0 [19:45:24] (03PS1) 10Jeena Huneidi: group0 wikis to 1.35.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604135 [19:45:26] (03CR) 10Jeena Huneidi: [C: 03+2] group0 wikis to 1.35.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604135 (owner: 10Jeena Huneidi) [19:47:14] (03Merged) 10jenkins-bot: group0 wikis to 1.35.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604135 (owner: 10Jeena Huneidi) [19:51:16] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Hardware): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10wiki_willy) Hi @Andrew - I'll sync up with @Jclark-ctr tomorrow to get a summary of the interactions that have taken pla... [19:51:22] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.36 [19:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:05] 10Operations, 10ops-codfw, 10procurement: codfw: Next Gen test rack - https://phabricator.wikimedia.org/T251570 (10Papaul) [19:57:44] Cleaning up old branches [20:00:47] (03PS1) 10Volans: mgmt: use netbox-generated data for esams mgmt [dns] - 10https://gerrit.wikimedia.org/r/604136 (https://phabricator.wikimedia.org/T233183) [20:00:55] !log jhuneidi@deploy1001 Pruned MediaWiki: 1.35.0-wmf.32 (duration: 05m 11s) [20:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:06] Train done for today [20:03:57] (03CR) 10Volans: "question inline" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/604136 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [20:04:04] 10Operations, 10Traffic, 10Core Platform Team Workboards (Clinic Duty Team): Move wikitech purges to kafka - https://phabricator.wikimedia.org/T254828 (10Pchelolo) [20:06:28] (03PS15) 10Herron: elasticsearch: manage java dependencies with ::profile::java [puppet] - 10https://gerrit.wikimedia.org/r/602459 (https://phabricator.wikimedia.org/T252913) [20:07:00] (03CR) 10Reedy: [C: 04-1] Blacklisting deleted schema (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/604130 (owner: 10Nuria) [20:09:05] (03CR) 10Mholloway: [C: 03+2] Echo: Enable push subscription management API (BETA) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604115 (https://phabricator.wikimedia.org/T252899) (owner: 10Mholloway) [20:09:57] (03Merged) 10jenkins-bot: Echo: Enable push subscription management API (BETA) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604115 (https://phabricator.wikimedia.org/T252899) (owner: 10Mholloway) [20:13:33] (03CR) 10Mholloway: [C: 03+2] Echo: Enable push notifier (BETA) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604116 (https://phabricator.wikimedia.org/T252899) (owner: 10Mholloway) [20:14:20] (03Merged) 10jenkins-bot: Echo: Enable push notifier (BETA) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604116 (https://phabricator.wikimedia.org/T252899) (owner: 10Mholloway) [20:15:00] (03PS1) 10Bartosz Dziewoński: Make VisualEditorDisableForAnons only hide the tabs, not disable the editor [extensions/VisualEditor] (wmf/1.35.0-wmf.35) - 10https://gerrit.wikimedia.org/r/604138 (https://phabricator.wikimedia.org/T253941) [20:15:33] (03PS1) 10Bartosz Dziewoński: Make VisualEditorDisableForAnons only hide the tabs, not disable the editor [extensions/VisualEditor] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604139 (https://phabricator.wikimedia.org/T253941) [20:16:26] (03CR) 10Bartosz Dziewoński: "Note that I squashed https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/603561 into this commit when cherry-picking, to " [extensions/VisualEditor] (wmf/1.35.0-wmf.35) - 10https://gerrit.wikimedia.org/r/604138 (https://phabricator.wikimedia.org/T253941) (owner: 10Bartosz Dziewoński) [20:26:07] (03PS2) 10Ottomata: Exclude deleted eventlogging schema [puppet] - 10https://gerrit.wikimedia.org/r/604130 (owner: 10Nuria) [20:28:02] (03CR) 10Ottomata: [C: 03+2] Exclude deleted eventlogging schema [puppet] - 10https://gerrit.wikimedia.org/r/604130 (owner: 10Nuria) [20:30:19] (03PS1) 10Ottomata: Exclude deleted eventlogging schema from proper refine job [puppet] - 10https://gerrit.wikimedia.org/r/604142 [20:34:35] (03CR) 10Volans: "Much cleaner, we're going in the right direction :) Some comment/question inline." (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (owner: 10Kormat) [20:36:36] (03CR) 10Ottomata: [C: 03+2] Exclude deleted eventlogging schema from proper refine job [puppet] - 10https://gerrit.wikimedia.org/r/604142 (owner: 10Ottomata) [20:39:29] 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10AndrewKuznetsov) [20:41:30] (03CR) 10Volans: [C: 03+1] "LGTM, one comment was not answered, not sure if was left on purpose or by accident." (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [20:52:18] (03CR) 10Herron: [C: 04-2] "updated PCC https://puppet-compiler.wmflabs.org/compiler1003/23132/" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/602459 (https://phabricator.wikimedia.org/T252913) (owner: 10Herron) [20:55:47] 10Operations, 10ops-codfw, 10procurement: codfw: Next Gen test rack - https://phabricator.wikimedia.org/T251570 (10Papaul) [21:03:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Hardware): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Andrew) Thanks @wiki_willy! I wouldn't love to decom that host, but if thinking about this is stealing DC-Ops's time aw... [21:33:59] (03PS1) 10CDanis: ats: per-instance named healthcheck URL [puppet] - 10https://gerrit.wikimedia.org/r/604148 [21:34:19] (03CR) 10jerkins-bot: [V: 04-1] ats: per-instance named healthcheck URL [puppet] - 10https://gerrit.wikimedia.org/r/604148 (owner: 10CDanis) [21:36:29] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): Puppet labs/private.git data loss incident affecting some projects - https://phabricator.wikimedia.org/T254491 (10bd808) >>! In T254491#6207860, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-cloud), href=https://sal.toolforge.org/l... [21:37:51] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): Puppet labs/private.git data loss incident affecting some projects - https://phabricator.wikimedia.org/T254491 (10bd808) [21:39:49] (03PS2) 10CDanis: ats: per-instance named healthcheck URL [puppet] - 10https://gerrit.wikimedia.org/r/604148 [21:40:09] (03CR) 10jerkins-bot: [V: 04-1] ats: per-instance named healthcheck URL [puppet] - 10https://gerrit.wikimedia.org/r/604148 (owner: 10CDanis) [21:41:33] (03PS3) 10CDanis: ats: per-instance named healthcheck URL [puppet] - 10https://gerrit.wikimedia.org/r/604148 [21:41:36] (03CR) 10Ayounsi: mgmt: use netbox-generated data for esams mgmt (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/604136 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [21:43:51] (03PS4) 10CDanis: ats: per-instance named healthcheck URL [puppet] - 10https://gerrit.wikimedia.org/r/604148 [21:46:05] (03CR) 10CDanis: "PCC lgtm: https://puppet-compiler.wmflabs.org/compiler1003/23136/cp1079.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/604148 (owner: 10CDanis) [21:47:51] (03PS1) 10CDanis: remove obsoleted/unused wmflib::service::lvs_icinga() [puppet] - 10https://gerrit.wikimedia.org/r/604149 [21:52:01] (03CR) 10CDanis: "pcc confirms noop https://puppet-compiler.wmflabs.org/compiler1003/23137/" [puppet] - 10https://gerrit.wikimedia.org/r/604149 (owner: 10CDanis) [22:02:33] (03PS3) 10RhinosF1: Enable WikiLove on slwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603453 (https://phabricator.wikimedia.org/T254706) [22:02:57] * RhinosF1 doing backport windoow [22:04:00] (03PS3) 10RhinosF1: Enable VE on bnwikibook's wikijunior namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599810 (https://phabricator.wikimedia.org/T241893) [22:04:09] (03PS6) 10RhinosF1: Add namespace alias and project namespace for hiwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603365 (https://phabricator.wikimedia.org/T254012) [22:05:05] * RhinosF1 done rebasing [22:15:00] (03CR) 10Cwhite: [C: 03+1] thanos: add SyslogIdentifier=%N to systemd services [puppet] - 10https://gerrit.wikimedia.org/r/604009 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [22:18:32] (03PS1) 10Cwhite: profile: add ecs 1.5.0 template [puppet] - 10https://gerrit.wikimedia.org/r/604155 (https://phabricator.wikimedia.org/T234565) [22:42:11] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): Puppet labs/private.git data loss incident affecting some projects - https://phabricator.wikimedia.org/T254491 (10bd808) [22:50:18] (03PS1) 10Andrew Bogott: profile::wmcs::proxy::static: allow hiera to specify an acme-chief cert [puppet] - 10https://gerrit.wikimedia.org/r/604165 (https://phabricator.wikimedia.org/T252721) [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200609T2300). [23:00:04] RhinosF1: A patch you scheduled for Evening backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:13] o/ [23:02:22] * Reedy looks [23:03:18] !log created wikilove_log on slwiki T254706 [23:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:22] T254706: Enable wikilove on slwiki - https://phabricator.wikimedia.org/T254706 [23:03:48] (03CR) 10Reedy: [C: 03+2] Enable WikiLove on slwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603453 (https://phabricator.wikimedia.org/T254706) (owner: 10RhinosF1) [23:04:36] (03Merged) 10jenkins-bot: Enable WikiLove on slwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603453 (https://phabricator.wikimedia.org/T254706) (owner: 10RhinosF1) [23:06:23] (03PS7) 10Reedy: Add namespace alias and project namespace for hiwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603365 (https://phabricator.wikimedia.org/T254012) (owner: 10RhinosF1) [23:06:29] (03CR) 10Reedy: [C: 03+2] Add namespace alias and project namespace for hiwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603365 (https://phabricator.wikimedia.org/T254012) (owner: 10RhinosF1) [23:06:47] * RhinosF1 is ready when you say what debug its on [23:07:11] These are pretty easy to verify without testing [23:07:15] (03Merged) 10jenkins-bot: Add namespace alias and project namespace for hiwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603365 (https://phabricator.wikimedia.org/T254012) (owner: 10RhinosF1) [23:07:23] (03PS4) 10Reedy: Enable VE on bnwikibook's wikijunior namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599810 (https://phabricator.wikimedia.org/T241893) (owner: 10RhinosF1) [23:07:26] Reedy: extremely easy [23:07:34] (03CR) 10Reedy: [C: 03+2] Enable VE on bnwikibook's wikijunior namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599810 (https://phabricator.wikimedia.org/T241893) (owner: 10RhinosF1) [23:07:55] the test for these is looking at max a few pages each [23:08:16] I can do most looking at 3 pages across the lot [23:08:21] (03Merged) 10jenkins-bot: Enable VE on bnwikibook's wikijunior namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599810 (https://phabricator.wikimedia.org/T241893) (owner: 10RhinosF1) [23:10:18] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T254706 T254012 T241893 (duration: 01m 06s) [23:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:24] T241893: Enable VisualEditor on bn.wikibooks at উইকিশৈশব (Wikijunior) Namespace - https://phabricator.wikimedia.org/T241893 [23:10:24] T254706: Enable wikilove on slwiki - https://phabricator.wikimedia.org/T254706 [23:10:25] T254012: Change in project namespace and add namespace shortcuts on hiwikibooks. - https://phabricator.wikimedia.org/T254012 [23:12:51] Reedy: nothing comes up when I click the heart on slwiki [23:13:01] (03CR) 10Andrew Bogott: [C: 03+2] profile::wmcs::proxy::static: allow hiera to specify an acme-chief cert [puppet] - 10https://gerrit.wikimedia.org/r/604165 (https://phabricator.wikimedia.org/T252721) (owner: 10Andrew Bogott) [23:13:06] JS caching? [23:13:10] It's on Special:Version, so it's enabled [23:13:24] Reedy: it's enabled but likely [23:13:29] * RhinosF1 goes to open incog [23:13:51] https://sl.wikipedia.org/wiki/Uporabni%C5%A1ki_pogovor:RhinosF1#A_goat_for_you! [23:13:52] works fine [23:15:04] Reedy: good, I assume you've checked the rest [23:15:33] * RhinosF1 remebers namespaceDupes for hiwikbooks [23:15:42] Reedy: ^ shouldn't you run [23:16:05] In most cases it's not needed [23:16:40] * RhinosF1 knows it's unlikely but better to remind you and be safe [23:18:08] looks like they've added numerous redirects [23:18:29] Reedy: I'll resolve everything on phabricator while you look at that. Everything seems sane. [23:18:39] !log run namespaceDupes.php --fix for hiwikibooks T254012 [23:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:45] T254012: Change in project namespace and add namespace shortcuts on hiwikibooks. - https://phabricator.wikimedia.org/T254012 [23:20:45] * RhinosF1 done [23:21:08] Reedy: all tasks closed, Thanks for deploying! [23:21:14] np [23:25:15] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): Puppet labs/private.git data loss incident affecting some projects - https://phabricator.wikimedia.org/T254491 (10Bstorm)