[00:03:25] funny how fixing a bug makes a bug Jdlrobson :) [00:39:15] 10Operations, 10Performance-Team, 10vm-requests: More RAM needed for webperf1002 and webperf2002 - https://phabricator.wikimedia.org/T260192 (10ThesenatorO5-2) p:05Triage→03Medium [00:50:19] 10Operations, 10Performance-Team, 10vm-requests: More RAM needed for webperf1002 and webperf2002 - https://phabricator.wikimedia.org/T260192 (10Krinkle) p:05Medium→03High [00:51:22] (03CR) 10Dzahn: [C: 03+1] doc: move Apache config to flat file [puppet] - 10https://gerrit.wikimedia.org/r/607525 (owner: 10Hashar) [00:51:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:51:43] 10Operations, 10Performance-Team, 10vm-requests: More RAM needed for webperf1002 and webperf2002 - https://phabricator.wikimedia.org/T260192 (10Krinkle) @ThesenatorO5-2 Please explain why you triaged the priority of this task. [00:51:46] (03CR) 10Dzahn: [C: 03+1] contint: move Apache config to flat file [puppet] - 10https://gerrit.wikimedia.org/r/607524 (owner: 10Hashar) [00:53:45] (03CR) 10Krinkle: [C: 03+1] contint: move Apache config to flat file [puppet] - 10https://gerrit.wikimedia.org/r/607524 (owner: 10Hashar) [00:55:02] 10Operations, 10Performance-Team, 10vm-requests: More RAM needed for webperf1002 and webperf2002 - https://phabricator.wikimedia.org/T260192 (10ThesenatorO5-2) p:05High→03Unbreak! I do not know the importance of this task, anyway. fixed it. [00:55:53] 10Operations, 10Performance-Team, 10vm-requests: More RAM needed for webperf1002 and webperf2002 - https://phabricator.wikimedia.org/T260192 (10ThesenatorO5-2) I did not found the two mentioned file(s) [01:02:35] 10Operations, 10Performance-Team, 10vm-requests: More RAM needed for webperf1002 and webperf2002 - https://phabricator.wikimedia.org/T260192 (10Krinkle) p:05Unbreak!→03High [01:02:51] 10Operations, 10Performance-Team, 10vm-requests: More RAM needed for webperf1002 and webperf2002 - https://phabricator.wikimedia.org/T260192 (10Krinkle) Please stop changing the priority of random tasks. I will ask for your account to be blocked if you continue. [01:03:19] (03PS1) 10Andrew Bogott: nova flavor aggregate monitoring: increase timeout [puppet] - 10https://gerrit.wikimedia.org/r/619599 [01:03:26] PROBLEM - Host mc2028 is DOWN: PING CRITICAL - Packet loss = 100% [01:03:42] (03CR) 10jerkins-bot: [V: 04-1] nova flavor aggregate monitoring: increase timeout [puppet] - 10https://gerrit.wikimedia.org/r/619599 (owner: 10Andrew Bogott) [01:04:40] (03PS2) 10Andrew Bogott: nova flavor aggregate monitoring: increase timeout [puppet] - 10https://gerrit.wikimedia.org/r/619599 (https://phabricator.wikimedia.org/T259542) [01:06:04] (03CR) 10Andrew Bogott: [C: 03+2] nova flavor aggregate monitoring: increase timeout [puppet] - 10https://gerrit.wikimedia.org/r/619599 (https://phabricator.wikimedia.org/T259542) (owner: 10Andrew Bogott) [01:06:46] PROBLEM - Host mc2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:11:18] PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance=mc1028 site=eqiad tunnel=mc2028_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [01:19:27] (03CR) 10Dzahn: [C: 03+1] ci: switch integration.wikimedia.org to scap DocumentRoot [puppet] - 10https://gerrit.wikimedia.org/r/611369 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [01:22:01] mutante: do you have a preference for disabling job first or switching docroot first? [01:22:10] I can disable the job now [01:24:16] Krinkle: thank you, but not right now. Hashar wanted to schedule it for next week and I just wanted to review and assign as a reminder [01:24:27] ok [01:36:46] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:40:26] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:44:18] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:46:18] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:36:18] (03CR) 10Cwhite: [C: 03+1] "Looks like a great start! Thanks for this!" [puppet] - 10https://gerrit.wikimedia.org/r/619295 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [02:36:52] (03PS2) 10Krinkle: Remove bogus $wgWMEPhp7SamplingRate setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609494 (https://phabricator.wikimedia.org/T219127) [02:37:50] (03PS4) 10Krinkle: scap: Remove commit and sync steps from 'update-interwiki-cache' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599147 (https://phabricator.wikimedia.org/T247107) [02:38:44] (03CR) 10Krinkle: "@James That seems fine. This isn't adding code though, it's removing an armed footgun. Given it's been another month and the rm-rf commit " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599147 (https://phabricator.wikimedia.org/T247107) (owner: 10Krinkle) [02:39:14] (03CR) 10Cwhite: "I was able to build this on my machine and the output looked good. However, I cannot build this on deneb with the same `pdebuild` invocat" [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/618953 (https://phabricator.wikimedia.org/T259143) (owner: 10Filippo Giunchedi) [02:40:37] (03CR) 10Cwhite: "LGTM once the grafana-plugins deb is available" [puppet] - 10https://gerrit.wikimedia.org/r/619451 (https://phabricator.wikimedia.org/T259143) (owner: 10Filippo Giunchedi) [02:41:10] (03CR) 10Cwhite: [C: 03+1] add alertmanager to alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/619296 (owner: 10Filippo Giunchedi) [02:56:42] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10Privacy: Kibana next sending telemetry to elastic.co - https://phabricator.wikimedia.org/T259794 (10colewhite) >>! In T259794#6378106, @Krinkle wrote: > Can we set a hard CSP on this domain at the web server level so that in gener... [03:48:10] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:58:04] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:00:04] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:04:00] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:21:54] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:33:52] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:39:50] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:43:04] (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.4*: Update package version [software] - 10https://gerrit.wikimedia.org/r/619462 (owner: 10Marostegui) [04:51:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1126 for MCR change', diff saved to https://phabricator.wikimedia.org/P12214 and previous config saved to /var/cache/conftool/dbconfig/20200812-045157-marostegui.json [04:51:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:32] (03PS1) 10Marostegui: labsdb.zone: Update s5 wikis [puppet] - 10https://gerrit.wikimedia.org/r/619627 (https://phabricator.wikimedia.org/T259438) [05:05:40] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:09:38] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:41:30] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:45:28] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:19:16] (03CR) 10Volans: [C: 03+1] "LGTM, tests can also be added in a separate CR" (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/617418 (owner: 10Ayounsi) [06:21:32] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:26:49] (03CR) 10Jcrespo: [C: 03+1] "+1 for check_private_data.py" [puppet] - 10https://gerrit.wikimedia.org/r/619572 (owner: 10Cwhite) [06:27:28] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:31:59] 10Operations, 10DBA, 10Phabricator: replace phabricator db passwords with longer passwords - https://phabricator.wikimedia.org/T250361 (10jcrespo) Maybe this can be scheduled before or after the maintenance for T259589? [06:34:53] (03Abandoned) 10Volans: mysql: adapt Cumin queries to select DBs [software/spicerack] - 10https://gerrit.wikimedia.org/r/570161 (https://phabricator.wikimedia.org/T243935) (owner: 10Volans) [06:38:10] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 112.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [06:44:31] 10Operations, 10ops-codfw: mc2028 regular and mgmt interface down - https://phabricator.wikimedia.org/T260224 (10elukey) p:05Triage→03High [06:48:11] (03CR) 10Volans: [C: 03+1] "LGTM given the related patch for homer" [homer/public] - 10https://gerrit.wikimedia.org/r/617603 (https://phabricator.wikimedia.org/T200277) (owner: 10Ayounsi) [06:57:02] 10Operations, 10DBA, 10Phabricator: replace phabricator db passwords with longer passwords - https://phabricator.wikimedia.org/T250361 (10Marostegui) I don't have the bandwidth to prepare this change before Tuesday - @jcrespo if you happen to have some room to prepare this before Tuesday (or after), please t... [07:22:43] 10Operations, 10DBA, 10Phabricator: replace phabricator db passwords with longer passwords - https://phabricator.wikimedia.org/T250361 (10mmodell) I made a note about how we go about setting a separate password for PHD daemons in a comment at T146055#6378825. Essentially we just need to define a new passwor... [07:31:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1110 for reimage', diff saved to https://phabricator.wikimedia.org/P12215 and previous config saved to /var/cache/conftool/dbconfig/20200812-073130-marostegui.json [07:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:47] (03PS1) 10Marostegui: mariadb: Reimage db1110 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/619698 (https://phabricator.wikimedia.org/T250666) [07:33:45] (03CR) 10Marostegui: [C: 03+2] mariadb: Reimage db1110 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/619698 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [07:46:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [07:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:04] (03PS2) 10JMeybohm: helm: Add wmf-stable helm repo [puppet] - 10https://gerrit.wikimedia.org/r/619493 [08:27:51] 10Operations, 10ops-codfw, 10DC-Ops, 10SRE-swift-storage: (Need By: ASAP) rack/setup/install ms-be2057.codfw.wmnet (Test Server - Keep Boxes) - https://phabricator.wikimedia.org/T260188 (10fgiunchedi) [08:28:47] 10Operations, 10ops-codfw, 10DC-Ops, 10SRE-swift-storage: (Need By: ASAP) rack/setup/install ms-be2057.codfw.wmnet (Test Server - Keep Boxes) - https://phabricator.wikimedia.org/T260188 (10fgiunchedi) a:05fgiunchedi→03Papaul >>! In T260188#6377449, @RobH wrote: > @fgiunchedi: What racking restrictions... [08:31:57] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10Privacy: Kibana next sending telemetry to elastic.co - https://phabricator.wikimedia.org/T259794 (10fgiunchedi) >>! In T259794#6378450, @colewhite wrote: >>>! In T259794#6378106, @Krinkle wrote: >> Can we set a hard CSP on this do... [08:35:43] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:35:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1110', diff saved to https://phabricator.wikimedia.org/P12217 and previous config saved to /var/cache/conftool/dbconfig/20200812-083548-marostegui.json [08:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:54] 10Operations, 10observability, 10User-fgiunchedi: VictorOps behavior on long-ack'd incidents - https://phabricator.wikimedia.org/T259465 (10fgiunchedi) >>! In T259465#6376612, @Bstorm wrote: >>>! In T259465#6376519, @fgiunchedi wrote: >> >> Good question re: SRE rotation only, I forgot to specify that the s... [08:36:25] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:36:32] (03PS1) 10Marostegui: db1110: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/619701 [08:37:15] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: Introduce Alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/619295 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [08:37:17] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:37:17] (03CR) 10Marostegui: [C: 03+2] db1110: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/619701 (owner: 10Marostegui) [08:37:27] (03CR) 10Filippo Giunchedi: [C: 03+2] add alertmanager to alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/619296 (owner: 10Filippo Giunchedi) [08:41:07] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/24448/deploy1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/619493 (owner: 10JMeybohm) [08:44:17] (03CR) 10JMeybohm: [C: 03+2] helm: Add wmf-stable helm repo [puppet] - 10https://gerrit.wikimedia.org/r/619493 (owner: 10JMeybohm) [08:45:07] hey godog p/ your prometheus/alertmanager changes okay to puppet-merge? [08:45:30] jayme: woops! yes please, thank you [08:45:54] ok, merged :) [08:48:52] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:50:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1110', diff saved to https://phabricator.wikimedia.org/P12218 and previous config saved to /var/cache/conftool/dbconfig/20200812-085021-marostegui.json [08:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:24] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:52:29] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:03:50] PROBLEM - Check the last execution of php7.2-fpm_check_restart on mw1377 is CRITICAL: connect to address 10.64.48.219 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:03:56] PROBLEM - Check the last execution of php7.2-fpm_check_restart on mw1357 is CRITICAL: connect to address 10.64.48.199 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:04:28] PROBLEM - puppet last run on mw1357 is CRITICAL: connect to address 10.64.48.199 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:04:30] PROBLEM - puppet last run on mw1377 is CRITICAL: connect to address 10.64.48.219 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:04:52] PROBLEM - Check that envoy is running on mw1357 is CRITICAL: connect to address 10.64.48.199 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [09:04:54] PROBLEM - mcrouter process on mw1377 is CRITICAL: connect to address 10.64.48.219 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Mcrouter [09:05:10] PROBLEM - nutcracker process on mw1377 is CRITICAL: connect to address 10.64.48.219 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker [09:05:12] PROBLEM - nutcracker socket on mw1357 is CRITICAL: connect to address 10.64.48.199 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker [09:05:14] PROBLEM - nutcracker process on mw1357 is CRITICAL: connect to address 10.64.48.199 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker [09:05:30] PROBLEM - PHP opcache health on mw1377 is CRITICAL: connect to address 10.64.48.219 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [09:05:34] PROBLEM - Check systemd state on mw1357 is CRITICAL: connect to address 10.64.48.199 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:05:39] PROBLEM - php7.2-fpm service on mw1377 is CRITICAL: connect to address 10.64.48.219 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:06:02] <_joe_> uhm lemme see what's up with those two servers [09:06:02] PROBLEM - mcrouter process on mw1357 is CRITICAL: connect to address 10.64.48.199 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Mcrouter [09:06:12] <_joe_> jayme: can you take a look at mw1357? [09:06:12] PROBLEM - php7.2-fpm service on mw1357 is CRITICAL: connect to address 10.64.48.199 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:06:14] PROBLEM - nutcracker socket on mw1377 is CRITICAL: connect to address 10.64.48.219 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker [09:06:20] PROBLEM - Check size of conntrack table on mw1377 is CRITICAL: connect to address 10.64.48.219 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [09:06:20] PROBLEM - PHP opcache health on mw1357 is CRITICAL: connect to address 10.64.48.199 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [09:06:24] PROBLEM - DPKG on mw1357 is CRITICAL: connect to address 10.64.48.199 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:06:45] _joe_: ack [09:07:21] <_joe_> it looks like they finished memory, or the process table [09:08:06] PROBLEM - MD RAID on mw1377 is CRITICAL: connect to address 10.64.48.219 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:08:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1110', diff saved to https://phabricator.wikimedia.org/P12219 and previous config saved to /var/cache/conftool/dbconfig/20200812-090831-marostegui.json [09:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:57] yeah. I see "faild to fork" as well [09:10:29] RECOVERY - Check systemd state on mw1357 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:10:42] PROBLEM - configured eth on mw1377 is CRITICAL: connect to address 10.64.48.219 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [09:10:52] RECOVERY - mcrouter process on mw1357 is OK: PROCS OK: 1 process with UID = 114 (mcrouter), command name mcrouter https://wikitech.wikimedia.org/wiki/Mcrouter [09:11:16] RECOVERY - php7.2-fpm service on mw1357 is OK: OK - php7.2-fpm is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:11:16] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Add new SSH key for Neil Shah-Quinn - https://phabricator.wikimedia.org/T260160 (10elukey) I had a chat with Neil on meet and he confirmed his identity and the validity of this task, we can proceed :) [09:11:24] RECOVERY - PHP opcache health on mw1357 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [09:11:28] RECOVERY - Check that envoy is running on mw1357 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [09:11:50] RECOVERY - nutcracker socket on mw1357 is OK: TCP OK - 0.000 second response time on socket /var/run/nutcracker/redis_eqiad.sock https://wikitech.wikimedia.org/wiki/Nutcracker [09:11:52] RECOVERY - nutcracker process on mw1357 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker [09:12:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1110', diff saved to https://phabricator.wikimedia.org/P12220 and previous config saved to /var/cache/conftool/dbconfig/20200812-091211-marostegui.json [09:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:31] (03CR) 10Neil P. Quinn-WMF: [C: 03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/619505 (https://phabricator.wikimedia.org/T260160) (owner: 10Elukey) [09:14:04] PROBLEM - Disk space on mw1377 is CRITICAL: connect to address 10.64.48.219 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mw1377&var-datasource=eqiad+prometheus/ops [09:14:06] <_joe_> !log depooling mw1377 for inspection [09:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:40] RECOVERY - Check the last execution of php7.2-fpm_check_restart on mw1357 is OK: OK: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:16:52] PROBLEM - Ensure local MW versions match expected deployment on mw1377 is CRITICAL: connect to address 10.64.48.219 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [09:19:12] PROBLEM - Check systemd state on mw1377 is CRITICAL: connect to address 10.64.48.219 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:19:34] PROBLEM - Check that envoy is running on mw1377 is CRITICAL: connect to address 10.64.48.219 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [09:19:49] (03PS3) 10Ayounsi: Configure transport links OSPF based on Netbox data [homer/public] - 10https://gerrit.wikimedia.org/r/617603 (https://phabricator.wikimedia.org/T200277) [09:19:55] (03PS1) 10Ayounsi: Workaround a Jinja regression [homer/public] - 10https://gerrit.wikimedia.org/r/619710 [09:21:20] (03CR) 10Vgutierrez: [C: 03+1] admin: add new ssh key for neilpquinn-wmf [puppet] - 10https://gerrit.wikimedia.org/r/619505 (https://phabricator.wikimedia.org/T260160) (owner: 10Elukey) [09:22:40] (03CR) 10Elukey: [C: 03+2] admin: add new ssh key for neilpquinn-wmf [puppet] - 10https://gerrit.wikimedia.org/r/619505 (https://phabricator.wikimedia.org/T260160) (owner: 10Elukey) [09:22:46] <_joe_> !log depool mw1357 tool [09:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:58] (03PS4) 10Gilles: Lossy optimisation of Wikipedia logos static PNGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) [09:27:09] (03CR) 10jerkins-bot: [V: 04-1] Lossy optimisation of Wikipedia logos static PNGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [09:27:18] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:27:49] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1377 is CRITICAL: connect to address 10.64.48.219 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [09:28:31] (03PS1) 10Filippo Giunchedi: alertmanager: add advertise-address cluster option [puppet] - 10https://gerrit.wikimedia.org/r/619712 (https://phabricator.wikimedia.org/T258948) [09:28:48] PROBLEM - IPMI Sensor Status on mw1377 is CRITICAL: connect to address 10.64.48.219 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:32:06] RECOVERY - Check size of conntrack table on mw1377 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [09:32:14] RECOVERY - mcrouter process on mw1377 is OK: PROCS OK: 1 process with UID = 114 (mcrouter), command name mcrouter https://wikitech.wikimedia.org/wiki/Mcrouter [09:32:38] RECOVERY - nutcracker process on mw1377 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker [09:32:42] RECOVERY - Check systemd state on mw1377 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:33:02] RECOVERY - Check that envoy is running on mw1377 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [09:33:04] RECOVERY - PHP opcache health on mw1377 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [09:33:16] RECOVERY - php7.2-fpm service on mw1377 is OK: OK - php7.2-fpm is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:33:58] RECOVERY - nutcracker socket on mw1377 is OK: TCP OK - 0.000 second response time on socket /var/run/nutcracker/redis_eqiad.sock https://wikitech.wikimedia.org/wiki/Nutcracker [09:34:02] RECOVERY - puppet last run on mw1377 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:34:42] RECOVERY - Ensure local MW versions match expected deployment on mw1377 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [09:35:00] RECOVERY - Disk space on mw1377 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mw1377&var-datasource=eqiad+prometheus/ops [09:36:28] RECOVERY - Check the last execution of php7.2-fpm_check_restart on mw1377 is OK: OK: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:36:43] (03PS1) 10Kormat: pontoon: Disable rsyslog remote logging [puppet] - 10https://gerrit.wikimedia.org/r/619715 [09:36:58] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:37:20] RECOVERY - DPKG on mw1357 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:38:09] (03PS2) 10Kormat: pontoon: Disable rsyslog remote logging [puppet] - 10https://gerrit.wikimedia.org/r/619715 [09:38:12] 10Operations, 10Performance-Team, 10vm-requests: More RAM needed for webperf1002 and webperf2002 - https://phabricator.wikimedia.org/T260192 (10Aklapper) @ThesenatorO5-2: Please read https://www.mediawiki.org/wiki/Bug_management/Phabricator_etiquette, https://www.mediawiki.org/wiki/How_to_report_a_bug, and h... [09:40:08] <_joe_> !log rebooting mw1377 [09:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:52] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:41:28] PROBLEM - Apache HTTP on mw1357 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:41:38] PROBLEM - PHP7 rendering on mw1357 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:41:46] PROBLEM - php7.2-fpm service on mw1357 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:42:08] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Add new SSH key for Neil Shah-Quinn - https://phabricator.wikimedia.org/T260160 (10elukey) kerberos account re-created too. [09:43:58] PROBLEM - Check that envoy is running on mw1357 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is inactive https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [09:44:43] (03PS5) 10Gilles: Lossy optimisation of Wikipedia logos static PNGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) [09:45:14] PROBLEM - mcrouter process on mw1357 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (mcrouter), command name mcrouter https://wikitech.wikimedia.org/wiki/Mcrouter [09:45:36] <_joe_> !log repooling mw1377 [09:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:45] (03CR) 10Jforrester: "Eh. Fine. But please liaise with Lars; scap is his thing now, and was never mine. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599147 (https://phabricator.wikimedia.org/T247107) (owner: 10Krinkle) [09:48:24] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/24449/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/619712 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [09:51:46] RECOVERY - MD RAID on mw1377 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:58:03] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Add new SSH key for Neil Shah-Quinn - https://phabricator.wikimedia.org/T260160 (10nshahquinn-wmf) 05Open→03Resolved a:03nshahquinn-wmf It works! Thanks, @elukey! [09:58:11] <_joe_> jouncebot: next [09:58:12] In 1 hour(s) and 1 minute(s): European mid-day backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200812T1100) [09:58:21] <_joe_> sigh, we will need to cancel that window I think [09:58:36] it has no patches scheduled atm [09:58:46] RECOVERY - Check the NTP synchronisation status of timesyncd on mw1377 is OK: OK: synced at Wed 2020-08-12 09:58:45 UTC. https://wikitech.wikimedia.org/wiki/NTP [09:59:44] RECOVERY - IPMI Sensor Status on mw1377 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [10:01:24] PROBLEM - PHP opcache health on mw1357 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:01:44] PROBLEM - Check no envoy runtime configuration is left persistent on mw1357 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:06:40] RECOVERY - mcrouter process on mw1357 is OK: PROCS OK: 1 process with UID = 114 (mcrouter), command name mcrouter https://wikitech.wikimedia.org/wiki/Mcrouter [10:06:46] RECOVERY - Apache HTTP on mw1357 is OK: HTTP OK: HTTP/1.1 302 Found - 629 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:06:56] RECOVERY - PHP7 rendering on mw1357 is OK: HTTP OK: HTTP/1.1 302 Found - 643 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:07:00] <_joe_> uhm jayme you also need to stop puppet :P [10:07:06] RECOVERY - php7.2-fpm service on mw1357 is OK: OK - php7.2-fpm is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:07:14] RECOVERY - PHP opcache health on mw1357 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:07:22] RECOVERY - Check that envoy is running on mw1357 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:07:27] <_joe_> sudo puppet agent --disable "reason --jayme" [10:08:45] yeah..I'm fine with them coming up ahain as well I guess. Basically just wanted to see if reaping the fpm zombies would give us back the memory [10:09:46] RECOVERY - puppet last run on mw1357 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:10:06] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:12:36] RECOVERY - configured eth on mw1377 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [10:14:35] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/619715 (owner: 10Kormat) [10:15:32] (03CR) 10Kormat: [C: 03+2] pontoon: Disable rsyslog remote logging [puppet] - 10https://gerrit.wikimedia.org/r/619715 (owner: 10Kormat) [10:20:47] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, see inline for optional comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/619563 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [10:26:52] (03PS5) 10Filippo Giunchedi: Debian packaging for Grafana plugins [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/618953 (https://phabricator.wikimedia.org/T259143) [10:27:12] (03CR) 10Filippo Giunchedi: "> Patch Set 4:" [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/618953 (https://phabricator.wikimedia.org/T259143) (owner: 10Filippo Giunchedi) [10:31:16] mutante, thcipriani: I think the SSH key fingerprints for gerrit1001 (https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/gerrit1001.wikimedia.org) are out of data. Could one of you update them? :) [10:32:40] RECOVERY - Check no envoy runtime configuration is left persistent on mw1357 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [10:45:41] (03PS3) 10Hnowlan: wmnet: add api-gateway records [dns] - 10https://gerrit.wikimedia.org/r/619499 (https://phabricator.wikimedia.org/T254908) [10:48:15] (03PS1) 10Giuseppe Lavagetto: Add ability to depool/repool a server to sre.hosts.reboot-single [cookbooks] - 10https://gerrit.wikimedia.org/r/619723 [10:48:16] PROBLEM - LVS api eqiad port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4 #page on api.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1137 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:48:29] <_joe_> oh sigh [10:48:38] <_joe_> ok jayme we need to start my hand [10:48:41] * volans here [10:48:46] looks like [10:48:58] <_joe_> mw1356 I guess [10:49:13] <_joe_> and mw1378 [10:49:19] <_joe_> I'll take the latter [10:49:24] <_joe_> !log rebooting mw1378 [10:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:28] shouldn't they just be depooled? [10:49:41] (re: alert) [10:49:42] <_joe_> no. [10:49:59] <_joe_> failures are intermittent and random [10:50:03] I see [10:50:13] RECOVERY - LVS api eqiad port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4 #page on api.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 24113 bytes in 0.642 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:50:15] like that last time if I remember [10:50:17] (03CR) 10Volans: [C: 03+1] "LGTM, just a nit on mixed single/double quotes used." [cookbooks] - 10https://gerrit.wikimedia.org/r/619723 (owner: 10Giuseppe Lavagetto) [10:50:42] <_joe_> volans: can you fix those while I turn off the fire here? [10:50:50] sure [10:51:00] <_joe_> jayme: can you reboot mw1356? [10:51:01] !log rebooting mw1356 [10:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:06] <_joe_> ack :) [10:51:40] (03PS2) 10Volans: Add ability to depool/repool a server to sre.hosts.reboot-single [cookbooks] - 10https://gerrit.wikimedia.org/r/619723 (owner: 10Giuseppe Lavagetto) [10:51:44] _joe_: is there a "safe" time tpo wait after depool? [10:52:28] _joe_: want me to merge and deploy so yoy can test it on one host [10:52:29] ? [10:52:49] <_joe_> volans: yes [10:52:51] <_joe_> thanks [10:53:01] (03CR) 10Volans: [C: 03+2] Add ability to depool/repool a server to sre.hosts.reboot-single [cookbooks] - 10https://gerrit.wikimedia.org/r/619723 (owner: 10Giuseppe Lavagetto) [10:53:03] <_joe_> jayme: 30 seconds is ok [10:54:13] (03Merged) 10jenkins-bot: Add ability to depool/repool a server to sre.hosts.reboot-single [cookbooks] - 10https://gerrit.wikimedia.org/r/619723 (owner: 10Giuseppe Lavagetto) [10:55:14] <_joe_> !log rebooting mw1361 [10:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:32] PROBLEM - Host mw1361 is DOWN: PING CRITICAL - Packet loss = 100% [10:56:32] (03PS1) 10Volans: sre.hosts.reboot-single: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/619726 [10:56:36] almos there, found a typo [10:56:44] give me 3 min [10:56:56] (03CR) 10Volans: [C: 03+2] "Just fixing the typo" [cookbooks] - 10https://gerrit.wikimedia.org/r/619726 (owner: 10Volans) [10:57:58] RECOVERY - Host mw1361 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [10:58:42] (03Merged) 10jenkins-bot: sre.hosts.reboot-single: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/619726 (owner: 10Volans) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for European mid-day backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200812T1100). [11:00:16] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [11:00:17] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) [11:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:30] Lucas and I are in a meeting. Sorry :( [11:00:38] it looks like there’s nothing to deploy anyways [11:01:15] +1, if anyone comes up with a last-minute patch, feel free to ping me for deployment. [11:02:26] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:04:05] (03PS1) 10Giuseppe Lavagetto: Fix api call: RemoteHosts uses run_async, not run [cookbooks] - 10https://gerrit.wikimedia.org/r/619727 [11:04:56] <_joe_> awight: no actually let's cancel completely, we're having some production trouble [11:05:14] <_joe_> volans: ^^ [11:05:52] (03CR) 10Volans: [C: 03+1] "LGTM, sorry for missing that earlier" [cookbooks] - 10https://gerrit.wikimedia.org/r/619727 (owner: 10Giuseppe Lavagetto) [11:06:02] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Fix api call: RemoteHosts uses run_async, not run [cookbooks] - 10https://gerrit.wikimedia.org/r/619727 (owner: 10Giuseppe Lavagetto) [11:06:10] that reminds me that I should add mypy to the cookbooks too, if not only for checking those things [11:06:13] <_joe_> volans: outages are not the best time to write software :) [11:06:13] (03CR) 10JMeybohm: [C: 03+1] Fix api call: RemoteHosts uses run_async, not run [cookbooks] - 10https://gerrit.wikimedia.org/r/619727 (owner: 10Giuseppe Lavagetto) [11:06:41] but without adding work for who writes cookbook if possible [11:06:56] (03Merged) 10jenkins-bot: Fix api call: RemoteHosts uses run_async, not run [cookbooks] - 10https://gerrit.wikimedia.org/r/619727 (owner: 10Giuseppe Lavagetto) [11:07:32] * volans running puppet on the cumi hosts [11:07:44] <_joe_> oh it needs *puppet*? [11:07:55] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [11:07:55] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) [11:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:08] it's magic :D [11:08:12] {done} [11:08:14] go ahead [11:08:15] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [11:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:30] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [11:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:44] <_joe_> volans: these damn logs should include the args [11:11:51] <_joe_> :P [11:11:54] eheh [11:12:51] I failed to find another good (bad?) candidate host to keep for investigation. So picking one randomly [11:13:15] <_joe_> jayme: mw1357 is enough [11:13:19] (03PS1) 10Jcrespo: mariadb-backups: Remove replication lag avg check from backup sources [puppet] - 10https://gerrit.wikimedia.org/r/619729 (https://phabricator.wikimedia.org/T253120) [11:13:20] _joe_: indeed, part of my Q OKRs, hopefully will get to it [11:13:34] _joe_: ah, okay. Understood you wanted a second one [11:13:48] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:31] <_joe_> so jayme be careful, if icinga is not recovered [11:14:39] <_joe_> it will return without pooling [11:14:52] yeah, I've looked at the code. [11:15:06] was it recovered for this first test? [11:15:37] the existing code is not waiting explicitely for it [11:15:43] _joe_: should we keep at around 4-5 hosts in depooled at the same time? [11:15:44] just check it after the first puppet run has completed [11:15:54] <_joe_> jayme: at most [11:16:12] (03PS2) 10Jcrespo: mariadb-backups: Remove replication lag avg check from backup sources [puppet] - 10https://gerrit.wikimedia.org/r/619729 (https://phabricator.wikimedia.org/T253120) [11:16:13] <_joe_> volans: it waits for puppet [11:16:33] meaning slow enough? :D [11:17:17] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:45] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [11:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:04] (03CR) 10Jcrespo: "How does this interact with profile::mariadb::mysql_role ?" [puppet] - 10https://gerrit.wikimedia.org/r/619729 (https://phabricator.wikimedia.org/T253120) (owner: 10Jcrespo) [11:18:10] should we change the cookbook to fail if the host has not recovered? [11:19:04] <_joe_> jayme: meh, dunno [11:20:02] (03CR) 10Marostegui: [C: 03+1] mariadb-backups: Remove replication lag avg check from backup sources [puppet] - 10https://gerrit.wikimedia.org/r/619729 (https://phabricator.wikimedia.org/T253120) (owner: 10Jcrespo) [11:20:48] (03PS3) 10Jcrespo: mariadb-backups: Remove replication lag avg check from backup sources [puppet] - 10https://gerrit.wikimedia.org/r/619729 (https://phabricator.wikimedia.org/T253120) [11:21:52] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:21:55] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:46] This week's train in a nutshell: [11:22:47] https://media.giphy.com/media/xT9Igk6pl01yVK0FHO/giphy.gif [11:24:14] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [11:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:15] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [11:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:45] (03CR) 10Jcrespo: [C: 04-1] "-1 because I am not sure if I will break "mysql_role" (should that be dependent on the prometheus check)?" [puppet] - 10https://gerrit.wikimedia.org/r/619729 (https://phabricator.wikimedia.org/T253120) (owner: 10Jcrespo) [11:27:14] (03PS1) 10Addshore: golang: new build of 1.13 to update CA file from base [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619731 [11:28:41] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:39] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:30:28] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:40] (03PS2) 10Addshore: golang: new build of 1.13 to update CA file from base [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619731 [11:34:42] (03CR) 10Addshore: [C: 04-1] "Looks like this might actually be an issue with the buster base image? and that might need fixing first" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619731 (owner: 10Addshore) [11:37:06] !log ema@cumin1001 START - Cookbook sre.hosts.reboot-single [11:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:16] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:58] !log creating artificial low replication lag on db2130 to test icinga alerts T253120 [11:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:01] T253120: Create prometheus alert to detect lag spikes - https://phabricator.wikimedia.org/T253120 [11:51:36] !log pool mw1363 after reboot [11:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200812T1200) [12:04:09] (03CR) 10Kormat: [C: 03+1] "> Patch Set 3: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/619729 (https://phabricator.wikimedia.org/T253120) (owner: 10Jcrespo) [12:06:09] (03CR) 10Ayounsi: [C: 03+2] Netbox: add circuits support [software/homer] - 10https://gerrit.wikimedia.org/r/617418 (owner: 10Ayounsi) [12:06:14] (03CR) 10Jcrespo: [C: 03+2] "> Patch Set 3: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/619729 (https://phabricator.wikimedia.org/T253120) (owner: 10Jcrespo) [12:07:23] (03Merged) 10jenkins-bot: Netbox: add circuits support [software/homer] - 10https://gerrit.wikimedia.org/r/617418 (owner: 10Ayounsi) [12:09:31] (03PS1) 10Giuseppe Lavagetto: reboot-single: wait for icinga to turn green, then error out [cookbooks] - 10https://gerrit.wikimedia.org/r/619734 [12:10:24] (03CR) 10jerkins-bot: [V: 04-1] reboot-single: wait for icinga to turn green, then error out [cookbooks] - 10https://gerrit.wikimedia.org/r/619734 (owner: 10Giuseppe Lavagetto) [12:10:34] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:13:10] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:14:03] (03PS4) 10Michael Große: Create dispatch lag alerts for test.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/618084 (https://phabricator.wikimedia.org/T258374) [12:14:03] <_joe_> sigh, lemme start the whack-a-mole again [12:14:33] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:14:36] (03CR) 10jerkins-bot: [V: 04-1] Create dispatch lag alerts for test.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/618084 (https://phabricator.wikimedia.org/T258374) (owner: 10Michael Große) [12:15:16] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [12:15:16] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) [12:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:22] (03PS5) 10Michael Große: Create dispatch lag alerts for test.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/618084 (https://phabricator.wikimedia.org/T258374) [12:15:27] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [12:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:07] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [12:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:30] (03PS2) 10Giuseppe Lavagetto: reboot-single: wait for icinga to turn green, then error out [cookbooks] - 10https://gerrit.wikimedia.org/r/619734 [12:17:37] (03CR) 10JMeybohm: [C: 03+1] "Looks good to me besides tox" [cookbooks] - 10https://gerrit.wikimedia.org/r/619734 (owner: 10Giuseppe Lavagetto) [12:19:15] 10Operations, 10ops-eqiad, 10netops: cloudflare CLF-20200806 dmarc to router patch - https://phabricator.wikimedia.org/T259923 (10ayounsi) 05Resolved→03Open Thanks! Please update the cable in Netbox with the cable ID (label). https://netbox.wikimedia.org/dcim/cables/1799/ [12:20:01] (03CR) 10JMeybohm: [C: 03+1] reboot-single: wait for icinga to turn green, then error out [cookbooks] - 10https://gerrit.wikimedia.org/r/619734 (owner: 10Giuseppe Lavagetto) [12:20:18] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [12:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:28] (03CR) 10Michael Große: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/618084 (https://phabricator.wikimedia.org/T258374) (owner: 10Michael Große) [12:20:42] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [12:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:34] (03CR) 10Michael Große: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/618084 (https://phabricator.wikimedia.org/T258374) (owner: 10Michael Große) [12:23:18] (03PS1) 10Filippo Giunchedi: alertmanager: allow access from all Prometheis [puppet] - 10https://gerrit.wikimedia.org/r/619737 (https://phabricator.wikimedia.org/T258948) [12:23:20] (03PS1) 10Filippo Giunchedi: prometheus: add alertmanager jobs [puppet] - 10https://gerrit.wikimedia.org/r/619738 (https://phabricator.wikimedia.org/T258948) [12:23:22] (03PS1) 10Filippo Giunchedi: prometheus: add alertmanagers configuration [puppet] - 10https://gerrit.wikimedia.org/r/619739 (https://phabricator.wikimedia.org/T258948) [12:24:40] <_joe_> ok jayme I'll merge, volans can faint later [12:24:48] (03CR) 10Giuseppe Lavagetto: [C: 03+2] reboot-single: wait for icinga to turn green, then error out [cookbooks] - 10https://gerrit.wikimedia.org/r/619734 (owner: 10Giuseppe Lavagetto) [12:24:50] :-) [12:26:27] (03Merged) 10jenkins-bot: reboot-single: wait for icinga to turn green, then error out [cookbooks] - 10https://gerrit.wikimedia.org/r/619734 (owner: 10Giuseppe Lavagetto) [12:27:18] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [12:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:50] <_joe_> jayme: we'll soon know how much I screwed up [12:31:34] _joe_: we're sharing the burden than :-) Or do you mean your experiment on mw1357? [12:32:01] <_joe_> no with reboots [12:33:48] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [12:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:01] <_joe_> jayme: turns out I was a bit optimistic with timing :P [12:35:53] uh...I thought it was quite a bit of time [12:36:18] <_joe_> it's 60 seconds [12:36:34] <_joe_> which is probably half the correct time :P [12:36:37] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1001/24450/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/619737 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [12:37:18] yeah sure it's 60, but thats even after the puppet run [12:38:41] <_joe_> sigh the md raid check still returns unknown apparently [12:38:45] <_joe_> which I don't believe [12:39:13] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/24450/" [puppet] - 10https://gerrit.wikimedia.org/r/619739 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [12:39:23] <_joe_> indeed, it's just lag [12:39:33] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/24450/" [puppet] - 10https://gerrit.wikimedia.org/r/619738 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [12:40:16] <_joe_> jayme: I'm not sure I can improve this further [12:40:49] <_joe_> I mean we can go with an exp backoff [12:41:49] (03PS1) 10Giuseppe Lavagetto: sre.hosts.reboot-single: wait a bit more for icinga [cookbooks] - 10https://gerrit.wikimedia.org/r/619741 [12:42:24] _joe_: Hmm...yeah. Might be good...I mean I've waited more then 5min for one host just for icinga to check again :-/ [12:42:53] <_joe_> ok so this goes from waiting 3s up to waiting 1 minute between checks [12:42:56] <_joe_> it should be enough [12:43:31] <_joe_> merging [12:43:35] (03CR) 10Giuseppe Lavagetto: [C: 03+2] sre.hosts.reboot-single: wait a bit more for icinga [cookbooks] - 10https://gerrit.wikimedia.org/r/619741 (owner: 10Giuseppe Lavagetto) [12:43:44] Maybe also add a --no-wait switch to get the old behaviour (and fail-fast) mode back [12:44:48] (03Merged) 10jenkins-bot: sre.hosts.reboot-single: wait a bit more for icinga [cookbooks] - 10https://gerrit.wikimedia.org/r/619741 (owner: 10Giuseppe Lavagetto) [12:47:01] <_joe_> jayme: later! [12:50:11] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [12:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:00] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [12:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:41] (03PS8) 10Ema: cache: remove '_ats' suffix from DC names [puppet] - 10https://gerrit.wikimedia.org/r/618975 (https://phabricator.wikimedia.org/T241239) [12:53:59] created a incident status doc at https://docs.google.com/document/d/1x1sXbklz98fBTKMBsv4fnbeAedG0RMjGEFJ8mrvD454/edit# - will fill in a bit! [12:54:24] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/618975 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [12:58:17] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:59:58] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [12:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] hashar and twentyafterfour: How many deployers does it take to do Mediawiki train - European+American Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200812T1300). [13:00:20] <_joe_> jayme: it worked! [13:00:39] o/ [13:00:41] <_joe_> also thanks for the Incident doc, although this is more matter of restarting a lot of servers [13:01:01] <_joe_> hashar: can you please hold while we assess the situation? [13:01:10] <_joe_> I'm not sure we can proceed with the train right now [13:01:35] (03PS1) 10Hashar: group1 wikis to 1.36.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619747 [13:01:37] (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.36.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619747 (owner: 10Hashar) [13:01:56] <_joe_> hashar: ahem :P [13:02:22] _joe_: what? ;) [13:02:22] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619747 (owner: 10Hashar) [13:02:28] did I missed an ongoing outage? [13:02:31] <_joe_> @_joe_> hashar: can you please hold while we assess the situation? [13:02:36] <_joe_> yes [13:03:08] that is for the couple servers that had some nasty out of memory issue earlier today isn't it ? [13:04:27] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [13:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:45] <_joe_> hashar: turns out it wasn't just a couple [13:05:14] <_joe_> hashar: I think I'll need you to revert, else we risk the alert for wikiversions to fire off [13:05:24] <_joe_> hnowlan: ping :) [13:06:24] yeah doing so [13:06:35] <_joe_> thanks, and sorry [13:06:45] <_joe_> I hope we'll be ok-ish in ~ 20 minutes [13:07:24] note that we are on a tight train schedule with group1 going this afternoon and the rest of the wikis later tonight [13:07:38] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [13:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:46] cause there is no deploy tomorrow (due to friday being off for US) [13:08:09] (03PS1) 10Hashar: Revert "group1 wikis to 1.36.0-wmf.4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619613 (https://phabricator.wikimedia.org/T257972) [13:08:13] (03CR) 10Hashar: [C: 03+2] Revert "group1 wikis to 1.36.0-wmf.4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619613 (https://phabricator.wikimedia.org/T257972) (owner: 10Hashar) [13:08:41] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.36.0-wmf.4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619613 (https://phabricator.wikimedia.org/T257972) (owner: 10Hashar) [13:09:55] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:09:57] <_joe_> hashar: well I can't tell the servers to get better magically :) [13:10:24] if only ! [13:10:44] have you tried: sudo get better? [13:10:47] * volans hides [13:11:22] _joe_: yo! might you be looking for https://gerrit.wikimedia.org/r/c/operations/puppet/+/619482 :D [13:11:44] <_joe_> hnowlan: ehehe [13:11:54] volans: --force may be needed this time [13:11:59] <_joe_> not only that, we have an issue on the appservers, but I'll get into details elsewhere [13:12:42] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [13:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:54] _joe_: right (regarding the incident doc). But we're actually doing somthing about an issue (whatever that might be) so it's better to habe one I guess [13:13:13] <_joe_> ack [13:14:12] jouncebot: next [13:14:13] In 4 hour(s) and 45 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200812T1800) [13:14:13] In 4 hour(s) and 45 minute(s): Morning backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200812T1800) [13:15:06] !log ✔️ cdanis@mw1357.eqiad.wmnet ~ 🕘☕ sudo sysctl -w vm/compact_memory=1 [13:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:11] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:19:47] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [13:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:07] <_joe_> sigh [13:20:12] <_joe_> damn icinga lag [13:20:50] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [13:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:02] Check Latency: 0.01 sec 62.32 sec 41.716 sec (min/max/avg) [13:23:59] (03PS1) 10Hashar: group1 wikis to 1.36.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619749 [13:24:23] (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.36.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619749 (owner: 10Hashar) [13:24:28] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619749 (owner: 10Hashar) [13:24:34] (03PS1) 10Kormat: switchover: Handle semisync for mariadb >= 10.3.0 [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619750 (https://phabricator.wikimedia.org/T260127) [13:25:56] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.4 [13:25:58] (03PS1) 10Elukey: Update spark upstream links in README.debian [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/619751 [13:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:13] !log hashar@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.4 (duration: 01m 16s) [13:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:38] (03PS1) 10Filippo Giunchedi: templates: add alerts.w.o [dns] - 10https://gerrit.wikimedia.org/r/619752 (https://phabricator.wikimedia.org/T258948) [13:31:00] looks good [13:33:47] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [13:37:38] (03CR) 10Jcrespo: [C: 03+1] "Looks ok, I checked the get_version() function with common patterns even outside of the versions we use." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619750 (https://phabricator.wikimedia.org/T260127) (owner: 10Kormat) [13:37:55] (03CR) 10Ladsgroup: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/618084 (https://phabricator.wikimedia.org/T258374) (owner: 10Michael Große) [13:38:49] PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance=mc1028 site=eqiad tunnel=mc2028_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [13:41:10] (03CR) 10Marostegui: [C: 03+1] "I have tested the script on the testing environment for the normal switchover workflow and options and it worked fine." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619750 (https://phabricator.wikimedia.org/T260127) (owner: 10Kormat) [13:41:13] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [13:41:13] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [13:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:35] (03CR) 10Ottomata: [C: 03+1] Update spark upstream links in README.debian [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/619751 (owner: 10Elukey) [13:41:49] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is CRITICAL: 103.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [13:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:16] !log restart mw1383 & mw1386 [13:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:32] (03CR) 10Kormat: [C: 03+2] switchover: Handle semisync for mariadb >= 10.3.0 [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619750 (https://phabricator.wikimedia.org/T260127) (owner: 10Kormat) [13:43:01] (03Merged) 10jenkins-bot: switchover: Handle semisync for mariadb >= 10.3.0 [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619750 (https://phabricator.wikimedia.org/T260127) (owner: 10Kormat) [13:43:06] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single [13:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:33] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single [13:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:51] (03CR) 10Hnowlan: [C: 03+2] wmnet: add api-gateway records [dns] - 10https://gerrit.wikimedia.org/r/619499 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [13:48:05] 10Operations, 10Cloud-VPS, 10observability: UNIX group 'bird' missing on bird package installation - https://phabricator.wikimedia.org/T260240 (10fgiunchedi) [13:48:48] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [13:48:48] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [13:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:57] PROBLEM - Thanos query has high latency for instant queries on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [13:50:17] 10Operations, 10Cloud-VPS, 10observability: UNIX group 'bird' missing on bird package installation - https://phabricator.wikimedia.org/T260240 (10Vgutierrez) p:05Triage→03Medium [13:50:54] si, eso seguro [13:51:20] 10Operations, 10observability: Grafana/Thanos serves 503s for long-time-window requests - https://phabricator.wikimedia.org/T260241 (10CDanis) [13:51:55] PROBLEM - Thanos query has many failed HTTP range queries requests on icinga1001 is CRITICAL: 7.306 ge 5 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [13:52:50] 10Operations, 10observability: Grafana/Thanos serves 503s for long-time-window requests - https://phabricator.wikimedia.org/T260241 (10Marostegui) I have also seen that myself quite often. Getting data older than 30 days is very useful, specially for capacity planning. [13:53:47] RECOVERY - Thanos query has many failed HTTP range queries requests on icinga1001 is OK: (C)5 ge (W)3 ge 1.136 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [13:53:47] (03PS1) 10Kormat: Release 0.3 [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619756 [13:54:11] (03PS2) 10Kormat: Release 0.3 [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619756 [13:54:15] PROBLEM - Thanos query has high latency for range queries on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [13:54:53] kormat: marostegui talks with bots in spanish (earlier on in this chan), bad sign [13:55:12] (03CR) 10Kormat: [C: 03+2] Release 0.3 [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619756 (owner: 10Kormat) [13:55:22] hahaha elukey I didn't even realised I did! [13:55:36] elukey: indeeeed [13:55:41] (03Merged) 10jenkins-bot: Release 0.3 [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619756 (owner: 10Kormat) [13:56:18] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [13:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:51] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [13:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:53] RECOVERY - Thanos query has high latency for instant queries on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [14:00:03] PROBLEM - Thanos query has high latency for range queries on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [14:01:08] ooof, that's high latency alright [14:01:31] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:02:13] !log uploaded wmfmariadbpy 0.3 to apt [14:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:29] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:03:55] RECOVERY - Thanos query has high latency for range queries on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [14:04:44] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat: switchover.py breaks on 10.4 master - https://phabricator.wikimedia.org/T260127 (10Kormat) 05Open→03Resolved Fix is merged, and a fresh debian package has been released, and installed on both cumin hosts. [14:05:09] (03PS1) 10Kormat: Update version in setup.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619757 [14:05:36] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat: switchover.py breaks on 10.4 master - https://phabricator.wikimedia.org/T260127 (10Marostegui) Thanks for addressing this so fast! [14:06:01] (03CR) 10Kormat: [C: 03+2] Update version in setup.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619757 (owner: 10Kormat) [14:06:28] (03Merged) 10jenkins-bot: Update version in setup.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619757 (owner: 10Kormat) [14:09:23] (03CR) 10Elukey: [C: 03+2] Update spark upstream links in README.debian [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/619751 (owner: 10Elukey) [14:17:05] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:17:23] 10Operations, 10Research, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Redirect wikimedia.org/research to research.wikimedia.org instead of some external closed survey - https://phabricator.wikimedia.org/T259979 (10Vgutierrez) p:05Triage→03Low [14:17:52] (03CR) 10Hnowlan: [C: 03+2] Add discovery and disabled LVS components for API gateway [puppet] - 10https://gerrit.wikimedia.org/r/615512 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [14:18:38] (03PS2) 10Kormat: Update remote execution libraries from transferpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619476 (https://phabricator.wikimedia.org/T259516) [14:19:21] (03PS3) 10Kormat: Update remote execution libraries from transferpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619476 (https://phabricator.wikimedia.org/T259516) [14:22:24] 10Operations, 10observability: Grafana/Thanos serves 503s for long-time-window requests - https://phabricator.wikimedia.org/T260241 (10fgiunchedi) From a first look at this I believe it is a combination of factors: namely Prometheus at the moment struggling with long time range queries due to having compaction... [14:23:00] (03CR) 10Hashar: "Thank you Giuseppe ;)" (034 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605653 (https://phabricator.wikimedia.org/T259611) (owner: 10Hashar) [14:24:23] (03PS3) 10Hashar: python-build: reuse previously built wheels [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605653 (https://phabricator.wikimedia.org/T259611) [14:24:25] (03PS1) 10Hashar: .gitignore docker-pkg-build.log [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619759 [14:24:49] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:24:54] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [14:24:54] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [14:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:36] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10Trizek-WMF) [14:31:23] !log temporarily kludging deneb.codfw.wmnet:/var/cache/pbuilder/hooks/stretch/D02backports, original in my homedir [14:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:49] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10Trizek-WMF) Reading the parent task, I realized that we will have the switchover and also the switchback. When the switchback would happen... [14:32:02] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [14:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:35] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:34:21] 10Operations, 10observability, 10Patch-For-Review: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 (10fgiunchedi) [14:35:47] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 11.03 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [14:35:49] !log un-kludging deneb.codfw.wmnet:/var/cache/pbuilder/hooks/stretch/D02backports [14:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:25] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:37:13] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [14:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:43] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [14:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:15] (03CR) 10Ppchelko: [C: 04-1] "A little nit inlined" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/619490 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [14:42:17] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:44:38] !log temporarily re-kludging deneb.codfw.wmnet:/var/cache/pbuilder/hooks/stretch/D02backports, original in my homedir [14:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:42] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:12] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [14:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:50] mmhh I'm looking into the logging indexing failures [14:53:34] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [14:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:05] (03PS1) 10Ottomata: eventgate-* - bump to version 2020-08-12-144657-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/619767 (https://phabricator.wikimedia.org/T251935) [14:54:15] !log again un-kludging deneb.codfw.wmnet:/var/cache/pbuilder/hooks/stretch/D02backports [14:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:26] 10Operations, 10Fundraising-Backlog, 10MW-1.36-notes (1.36.0-wmf.4; 2020-08-11), 10Patch-For-Review, and 3 others: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Ladsgroup) There are some incidents ongoing with appservers. I think w... [14:57:51] (03CR) 10Ottomata: [C: 03+2] eventgate-* - bump to version 2020-08-12-144657-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/619767 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [14:58:15] 10Operations, 10ops-eqiad, 10netops: cloudflare CLF-20200806 dmarc to router patch - https://phabricator.wikimedia.org/T259923 (10RobH) a:05RobH→03Cmjohnson >>! In T259923#6379226, @ayounsi wrote: > Thanks! Please update the cable in Netbox with the cable ID (label). > > https://netbox.wikimedia.org/dci... [14:59:34] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [14:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:50] ottomata: looks like the logstash indexing errors might be due to eventgate (or one of its clients), I'm looking at https://logstash-next.wikimedia.org/app/kibana#/discover/doc/2d891220-161a-11ea-a364-c747e6d6cfc2/logstash-syslog-2020.08.12?id=_Bgr43MBpw3s5BexsykZ [15:00:54] ottomata: does that ring a bell ? [15:02:35] (03PS1) 10Tarrow: Remove wgExtraLanguageNames from beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619768 (https://phabricator.wikimedia.org/T260118) [15:02:45] (03PS2) 10Hnowlan: api-gateway: serve public traffic over TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/619490 (https://phabricator.wikimedia.org/T254908) [15:03:12] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [15:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:16] hmm godog i have trace routing on a canary pod, maybe that's why [15:03:20] lemme turn it off, i figure out my problem [15:03:42] oh ok, sounds good to me thanks ottomata ! [15:03:45] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [15:03:46] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [15:03:46] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [15:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:47] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10RLazarus) No, it'll be roughly a month -- there's a variety of maintenance we'd like to do in eqiad while we're serving from codfw. Likel... [15:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:09] (03PS2) 10Tarrow: Remove wgExtraLanguageNames from beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619768 (https://phabricator.wikimedia.org/T260118) [15:06:16] (03CR) 10Tarrow: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619768 (https://phabricator.wikimedia.org/T260118) (owner: 10Tarrow) [15:06:17] ottomata: yeah I think that was it, waiting a little more to confirm [15:07:32] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10Trizek-WMF) >>! In T244808#6379641, @RLazarus wrote: > Likely candidate dates are September 29 or October 6, but I've been figuring we wou... [15:08:14] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [15:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:59] !log ✔️ cdanis@mw1359.eqiad.wmnet ~ 🕚☕ sudo dpkg -i bpfcc-tools_0.12.0-2_all.deb libbpfcc_0.12.0-2_amd64.deb python3-bpfcc_0.12.0-2_all.deb [15:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:12] !log ✔️ cdanis@mw1359.eqiad.wmnet ~ 🕚☕ sudo apt install python3-netaddr ieee-data [15:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:18] ottomata: yeah looking good, thanks for the quick action! not 100% sure though why a trace would generate unparseable logs/json [15:11:34] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)8 ge (W)1 ge 0.2625 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:12:28] !log ✔️ cdanis@mw1359.eqiad.wmnet ~ 🕚☕ sudo apt install linux-headers-4.9.0-12-amd64 [15:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:57] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [15:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:37] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [15:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:04] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [15:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:13] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [15:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:30] <_joe_> jouncebot: next [15:16:30] In 2 hour(s) and 43 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200812T1800) [15:16:31] In 2 hour(s) and 43 minute(s): Morning backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200812T1800) [15:19:03] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [15:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:36] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [15:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:01] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [15:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:18] (03PS1) 10Cwhite: hiera: update mtail run as group [puppet] - 10https://gerrit.wikimedia.org/r/619773 (https://phabricator.wikimedia.org/T224586) [15:22:01] (03Abandoned) 10Cwhite: hiera: update mtail run as group [puppet] - 10https://gerrit.wikimedia.org/r/619773 (https://phabricator.wikimedia.org/T224586) (owner: 10Cwhite) [15:22:23] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [15:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:56] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [15:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:13] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [15:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:05] (03PS1) 10Cwhite: profile: add mtail user to list group [puppet] - 10https://gerrit.wikimedia.org/r/619777 (https://phabricator.wikimedia.org/T224586) [15:31:20] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [15:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:12] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [15:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:18] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [15:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:36] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [15:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:57] (03PS2) 10Cwhite: profile: add mtail user to list group [puppet] - 10https://gerrit.wikimedia.org/r/619777 (https://phabricator.wikimedia.org/T224586) [15:35:54] (03PS1) 10Giuseppe Lavagetto: sre.hosts.reboot-single: refine the icinga check/repool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/619778 [15:36:10] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [15:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:57] (03CR) 10Herron: [C: 03+1] "May race with the list group creation on the first run without a dependency on mailman, but not a major issue IMHO. LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/619777 (https://phabricator.wikimedia.org/T224586) (owner: 10Cwhite) [15:37:06] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [15:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:05] (03PS4) 10Hashar: python-build: reuse previously built wheels [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605653 (https://phabricator.wikimedia.org/T259611) [15:41:30] (03PS2) 10Giuseppe Lavagetto: sre.hosts.reboot-single: refine the icinga check/repool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/619778 [15:41:39] (03PS3) 10Cwhite: profile: add mtail user to list group [puppet] - 10https://gerrit.wikimedia.org/r/619777 (https://phabricator.wikimedia.org/T224586) [15:42:00] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [15:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:06] (03CR) 10Hashar: python-build: reuse previously built wheels (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605653 (https://phabricator.wikimedia.org/T259611) (owner: 10Hashar) [15:42:53] (03CR) 10JMeybohm: [C: 03+1] sre.hosts.reboot-single: refine the icinga check/repool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/619778 (owner: 10Giuseppe Lavagetto) [15:43:52] (03CR) 10Cwhite: "PCC checks out https://puppet-compiler.wmflabs.org/compiler1002/24453/" [puppet] - 10https://gerrit.wikimedia.org/r/619777 (https://phabricator.wikimedia.org/T224586) (owner: 10Cwhite) [15:44:36] (03CR) 10Hnowlan: api-gateway: serve public traffic over TLS (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/619490 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [15:44:55] (03CR) 10Herron: [C: 03+1] profile: add mtail user to list group [puppet] - 10https://gerrit.wikimedia.org/r/619777 (https://phabricator.wikimedia.org/T224586) (owner: 10Cwhite) [15:45:17] (03CR) 10Cwhite: [C: 03+2] profile: add mtail user to list group [puppet] - 10https://gerrit.wikimedia.org/r/619777 (https://phabricator.wikimedia.org/T224586) (owner: 10Cwhite) [15:45:19] (03CR) 10Giuseppe Lavagetto: [C: 03+2] sre.hosts.reboot-single: refine the icinga check/repool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/619778 (owner: 10Giuseppe Lavagetto) [15:45:46] (03PS1) 10Hashar: python-build: do not archive previously built wheels [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619779 [15:46:14] (03CR) 10Hashar: python-build: reuse previously built wheels (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605653 (https://phabricator.wikimedia.org/T259611) (owner: 10Hashar) [15:46:16] (03Merged) 10jenkins-bot: sre.hosts.reboot-single: refine the icinga check/repool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/619778 (owner: 10Giuseppe Lavagetto) [15:47:13] (03CR) 10Cwhite: [C: 03+1] "Got it to build on deneb with USENETWORK=yes in ~/.pbuilderrc and the https_proxy. LGTM!" [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/618953 (https://phabricator.wikimedia.org/T259143) (owner: 10Filippo Giunchedi) [15:48:02] o/ I'd like to merge this (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/619768) config change for beta. Am I supposed to wait until a "Backports and Config window" since I guess it still needs to be scap'ed out to not leave things in a mess? [15:48:09] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [15:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:37] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [15:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:34] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [15:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:16] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [15:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:54] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:53:08] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 238, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:55:11] <_joe_> uh? XioNoX ^^ [15:55:12] XioNoX: ^^^ FYI, I don't see mainteanance in the calendar [15:55:57] someone wants to email Zayo? [15:56:50] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [15:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:38] (03CR) 10Krinkle: [C: 03+2] Remove bogus $wgWMEPhp7SamplingRate setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609494 (https://phabricator.wikimedia.org/T219127) (owner: 10Krinkle) [15:58:40] (03PS1) 10Ryan Kemper: WIP: elasticsearch: add active dc for each clustergroup [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 [15:58:45] (done) [15:59:18] tarrow: I can roll it out now for you [15:59:24] (03CR) 10Krinkle: [C: 03+2] Remove wgExtraLanguageNames from beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619768 (https://phabricator.wikimedia.org/T260118) (owner: 10Tarrow) [15:59:27] (03Merged) 10jenkins-bot: Remove bogus $wgWMEPhp7SamplingRate setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609494 (https://phabricator.wikimedia.org/T219127) (owner: 10Krinkle) [15:59:33] Thanks! [15:59:34] XioNoX: sorry missed that message [15:59:35] thx [15:59:46] thx for the ping [15:59:46] <_joe_> Krinkle: please check with us, we're doing rolling reboots [16:00:08] <_joe_> jayme, effie any reboot in progress? [16:00:09] (03Merged) 10jenkins-bot: Remove wgExtraLanguageNames from beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619768 (https://phabricator.wikimedia.org/T260118) (owner: 10Tarrow) [16:00:13] (03CR) 10jerkins-bot: [V: 04-1] WIP: elasticsearch: add active dc for each clustergroup [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 (owner: 10Ryan Kemper) [16:00:27] _joe_: ack [16:00:44] _joe_: no, everything currently in waiting for icinga state for me [16:01:01] <_joe_> Krinkle: go on then [16:01:08] tarrow: if the file is entirely beta-only we usually just pull it down to deploy host to avoid unexpected diffs [16:01:11] but no need to sync them. [16:01:20] if it's a no-op but affects prod files, then a sync as well [16:02:08] tarrow: I've pulled it down to deploy1001, so consider it done [16:04:43] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: I3726a6364d, T257079 (duration: 01m 02s) [16:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:47] T257079: Audit all mismatched/unused wmf-config settings - https://phabricator.wikimedia.org/T257079 [16:04:50] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [16:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:08] <_joe_> jayme: I upgraded the script on cumin1001 [16:05:25] effie: ^^ [16:07:00] Krinkle: Thanks awfully :) [16:07:34] 10Operations, 10SRE-tools, 10Patch-For-Review: spicerack/cookbook: add additional arguments IRC/SAL logging - https://phabricator.wikimedia.org/T221212 (10Krinkle) Ping :) These messages are currently quite noisy in my opinion with little to no useful signal when investigating past events as it doesn't say... [16:07:46] Krinkle: should I just ping people in here that I'm going to pull to deploy1001 or wait for a window normally? [16:08:02] (assuming it touches no prod files) [16:08:18] tarrow: if there's not currently a window, and it's a simple no-prod-file commit, I'd say it's okay to just mention here that you're doing it and then merge/pull down [16:08:30] great! Thanks :) [16:09:20] 10Operations, 10Fundraising-Backlog, 10MW-1.36-notes (1.36.0-wmf.4; 2020-08-11), 10Patch-For-Review, and 3 others: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10DStrine) Thanks everyone for all the help! This will make a huge, posi... [16:18:11] (03CR) 10Ppchelko: [C: 04-1] "A few more nits and one issue inlined" (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/619490 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [16:23:49] <_joe_> jouncebot: next [16:23:50] In 1 hour(s) and 36 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200812T1800) [16:23:50] In 1 hour(s) and 36 minute(s): Morning backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200812T1800) [16:24:32] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [16:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:04] (03PS3) 10Hnowlan: api-gateway: serve public traffic over TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/619490 (https://phabricator.wikimedia.org/T254908) [16:27:47] (03CR) 10Ppchelko: [C: 03+1] "Makes sence" [deployment-charts] - 10https://gerrit.wikimedia.org/r/619490 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [16:27:53] (03CR) 10Bstorm: [C: 03+1] "+1 for grid_configurator.py" [puppet] - 10https://gerrit.wikimedia.org/r/619572 (owner: 10Cwhite) [16:29:47] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) [16:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:32] (03PS1) 10Giuseppe Lavagetto: Brown paper bag fix [cookbooks] - 10https://gerrit.wikimedia.org/r/619787 [16:30:55] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Brown paper bag fix [cookbooks] - 10https://gerrit.wikimedia.org/r/619787 (owner: 10Giuseppe Lavagetto) [16:31:09] !log reboot mw1277 mw1278 && mw1261 mw1262 [16:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:17] <_joe_> effie: wait please [16:31:30] <_joe_> the cookbook has an error, see above [16:31:49] (03Merged) 10jenkins-bot: Brown paper bag fix [cookbooks] - 10https://gerrit.wikimedia.org/r/619787 (owner: 10Giuseppe Lavagetto) [16:32:05] (03CR) 10Hnowlan: [C: 03+2] api-gateway: serve public traffic over TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/619490 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [16:32:28] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [16:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:09] (03Merged) 10jenkins-bot: api-gateway: serve public traffic over TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/619490 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [16:35:14] (03PS1) 10Elukey: Add yarn.nodemanager.vmem-pmem-ratio setting to Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/619788 (https://phabricator.wikimedia.org/T244499) [16:35:26] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single [16:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:43] oh _joe_ [16:35:50] I started one [16:35:54] damn [16:35:56] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [16:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:58] <_joe_> should be fixed now [16:36:04] (03CR) 10Elukey: [C: 03+2] Add yarn.nodemanager.vmem-pmem-ratio setting to Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/619788 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [16:36:05] we will know [16:36:13] <_joe_> nooo [16:36:19] <_joe_> I said it should be fixed [16:36:27] <_joe_> :) [16:36:54] <_joe_> let's see if it works this time :D [16:38:16] ok should I start the second or not? [16:38:29] worst case scenario I re-reboot mw1277 [16:38:36] <_joe_> just go [16:38:38] <_joe_> it now works [16:38:48] <_joe_> sadly it seems icinga doesn't care for our requests to recheck [16:38:59] <_joe_> so you will still possibly have to repool by hand [16:41:15] yeah that is fine really [16:41:22] we have seen worse [16:41:40] (03PS1) 10Elukey: Fix yarn.nodemanager.vmem-pmem-ratio for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/619789 [16:42:10] (03CR) 10Elukey: [C: 03+2] Fix yarn.nodemanager.vmem-pmem-ratio for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/619789 (owner: 10Elukey) [16:44:37] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single [16:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:14] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single [16:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:30] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single [16:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:35] (03PS1) 10Hnowlan: api-gateway: healthcheck using HTTPS [deployment-charts] - 10https://gerrit.wikimedia.org/r/619790 (https://phabricator.wikimedia.org/T254908) [16:45:57] 10Operations, 10SRE-tools, 10Patch-For-Review: spicerack/cookbook: add additional arguments IRC/SAL logging - https://phabricator.wikimedia.org/T221212 (10Volans) @Krinkle I hear you, and I totally agree. This is in some way part of my OKRs for this Q (add the class API to cookbooks so that we can surface th... [16:48:25] (03CR) 10Ppchelko: [C: 03+2] api-gateway: healthcheck using HTTPS [deployment-charts] - 10https://gerrit.wikimedia.org/r/619790 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [16:48:36] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [16:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:25] (03Merged) 10jenkins-bot: api-gateway: healthcheck using HTTPS [deployment-charts] - 10https://gerrit.wikimedia.org/r/619790 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [16:49:41] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [16:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:32] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [16:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:00] Hey all - would like to deploy a sec mitigation in /private. Let me know if you're doing anything. [16:52:32] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [16:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:57] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [16:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:53] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [16:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:39] !log sbassett@deploy1001 Synchronized private/PrivateSettings.php: Additional mitigations for T257687 (duration: 01m 03s) [16:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:22] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 52 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:00:50] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [17:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:16] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [17:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:23] (03PS1) 10Hnowlan: api-gateway: create discovery records [dns] - 10https://gerrit.wikimedia.org/r/619798 (https://phabricator.wikimedia.org/T254908) [17:13:13] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [17:13:13] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [17:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:03] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [17:15:03] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [17:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:03] (03PS1) 10Hnowlan: api-gateway: change service state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/619800 (https://phabricator.wikimedia.org/T254908) [17:16:11] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [17:16:11] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [17:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:20] !log for posterity: mw1359 has a bunch of special packages installed (previously recorded in SAL) and also has `sudo memleak-bpfcc -o 60000 -z 31 -Z 33 30` running in a tmux in an attempt to understand what's causing the page fragmentation in the appserver fleet [17:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:18] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [17:17:18] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [17:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:12] nshahquinn: do you mean the fingerprints for gerrit.wikimedia.org:29418? they are at https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/gerrit.wikimedia.org:29418 [17:19:30] !log reboot mw1263 mw1264 mw1279 and mw1281 [17:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:02] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single [17:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:37] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single [17:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:19] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single [17:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:29] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single [17:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:19] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [17:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:00] 10Operations, 10ops-eqiad, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps10[05-10].eqiad.wmnet - https://phabricator.wikimedia.org/T260269 (10RobH) [17:29:12] 10Operations, 10ops-eqiad, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps10[05-10].eqiad.wmnet - https://phabricator.wikimedia.org/T260269 (10RobH) [17:30:25] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [17:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:52] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10RobH) [17:31:29] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10RobH) [17:31:32] (03PS1) 10Ladsgroup: Set caching of CachingEntityRevisionLookup to CACHE_NONE in repo [extensions/Wikibase] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/619615 (https://phabricator.wikimedia.org/T255305) [17:31:36] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10RobH) [17:32:06] (03CR) 10Ladsgroup: [C: 03+2] Set caching of CachingEntityRevisionLookup to CACHE_NONE in repo [extensions/Wikibase] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/619615 (https://phabricator.wikimedia.org/T255305) (owner: 10Ladsgroup) [17:35:09] I'm deploying some fixes of memcached [17:35:16] slowly, going to take hours [17:36:50] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [17:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:35] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [17:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:08] hashar: hey, I'm deploying a fix for memcached, is it okay if I block the deploy for a bit? everything seems fine so far [17:42:17] I want to use --canary-wait-time [17:43:28] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10Trizek-WMF) I plan to reuse the message we used in 2018. This would allow translators to benefit from already existing translations, or co... [17:45:27] (03PS1) 10Ppchelko: Configure ratelimiter to support authenticated/anon limits for api [deployment-charts] - 10https://gerrit.wikimedia.org/r/619804 (https://phabricator.wikimedia.org/T254914) [17:45:35] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [17:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:39] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single [17:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:55] (03PS5) 10Dzahn: contint: move Apache config to flat file [puppet] - 10https://gerrit.wikimedia.org/r/607524 (owner: 10Hashar) [17:47:38] (03CR) 10Ppchelko: "There's a TODO inlined in the config.yaml that I want your opinion on before merging this masterpiece." [deployment-charts] - 10https://gerrit.wikimedia.org/r/619804 (https://phabricator.wikimedia.org/T254914) (owner: 10Ppchelko) [17:48:52] (03CR) 10Dzahn: [C: 03+2] contint: move Apache config to flat file [puppet] - 10https://gerrit.wikimedia.org/r/607524 (owner: 10Hashar) [17:49:57] !log reboot mw1265 mw1282 mw1283 [17:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:30] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single [17:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:19] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single [17:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:33] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [17:51:34] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [17:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:07] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single [17:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:53] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [17:52:53] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [17:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:05] mutante: yeah, I realized that later :) It was just a bit confusing because the docs said Gerrit runs on gerrit1001, so I was looking at those fingerprints. [17:54:05] One weird thing: when I looked at the actual Gerrit fingerprints (https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/gerrit.wikimedia.org:29418), the one that matched was one of the ones labeled as "_Before_ July 14th 2020" 🤷‍♂️ [17:54:44] (03Merged) 10jenkins-bot: Set caching of CachingEntityRevisionLookup to CACHE_NONE in repo [extensions/Wikibase] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/619615 (https://phabricator.wikimedia.org/T255305) (owner: 10Ladsgroup) [17:56:20] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [17:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:00] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [17:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:02] nshahquinn: ack, 2 different SSH daemons on the same machine, one on port 22 for the server and one on 29418 which is gerrit itself. [17:58:44] nshahquinn: re: the second part.. yea.. uh.. sorry about that. that is indeed confusing for users. the reason is we announced replacing the host keys on that day but then had to postpone it [17:59:04] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [17:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:17] nshahquinn: it will be back to the new ones some time next week [17:59:21] mutante: makes sense, thanks for the explanations! :) [18:00:04] hashar and twentyafterfour: #bothumor My software never has bugs. It just develops random features. Rise for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200812T1800). [18:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200812T1800). [18:00:04] RoanKattouw: A patch you scheduled for Morning backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:13] I'll do the deployment [18:00:44] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single [18:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:32] (03PS5) 10Catrope: Enable and configure GrowthExperiments on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616959 (https://phabricator.wikimedia.org/T255020) [18:01:48] (03CR) 10Catrope: [C: 03+2] Enable and configure GrowthExperiments on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616959 (https://phabricator.wikimedia.org/T255020) (owner: 10Catrope) [18:02:31] (03Merged) 10jenkins-bot: Enable and configure GrowthExperiments on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616959 (https://phabricator.wikimedia.org/T255020) (owner: 10Catrope) [18:02:40] (03CR) 10Dzahn: [C: 03+2] doc: move Apache config to flat file [puppet] - 10https://gerrit.wikimedia.org/r/607525 (owner: 10Hashar) [18:02:44] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/24457/doc1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/607525 (owner: 10Hashar) [18:02:53] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [18:02:53] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [18:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:09] looks good on mwdebug1002, syncing [18:04:13] Amir1: Let me know when you're done deploying [18:04:18] !log ladsgroup@deploy1001 Synchronized php-1.36.0-wmf.4/extensions/Wikibase/repo/includes/Store/Sql/SqlStore.php: [[phab:T255305|Set caching of CachingEntityRevisionLookup to CACHE_NONE in repo]] (duration: 01m 06s) [18:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:21] T255305: [Investigation] - Consider if the Wikibase cache for CachingEntityRevisionLookup is needed any more - https://phabricator.wikimedia.org/T255305 [18:04:36] RoanKattouw: I'm done for now, I monitor logs for a bit before moving to client wikis [18:04:49] OK I'll deploy my patch in the meantime then [18:06:45] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [18:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:08] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:07:43] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [18:07:43] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [18:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:06] 10Operations, 10MediaWiki-Parser, 10serviceops, 10Platform Engineering (Icebox): purgeParserCache.php: Cannot purge this kind of parser cache - https://phabricator.wikimedia.org/T250231 (10Aklapper) @Amooney: Assuming that "Set projects" was accidentally used instead of "Add projects", hence restoring some... [18:08:16] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [18:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:34] the logsatsh alert seems weird, RED and logstash itself seems fine [18:10:45] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable GrowthExperiments on hewiki (T255020) (duration: 01m 03s) [18:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:48] T255020: Deploy Growth features on Hebrew Wikipedia - https://phabricator.wikimedia.org/T255020 [18:11:00] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:12:06] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10RLazarus) I bet we can do 14:00 UTC. I'm finishing up the timeline with my SRE colleagues this week, I'll confirm and get back to you. Aft... [18:12:35] (03CR) 10Clarakosi: Configure ratelimiter to support authenticated/anon limits for api (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/619804 (https://phabricator.wikimedia.org/T254914) (owner: 10Ppchelko) [18:13:05] (03PS1) 10Ottomata: eventgate-main - Use MW EventStreamConfig API [deployment-charts] - 10https://gerrit.wikimedia.org/r/619813 (https://phabricator.wikimedia.org/T251935) [18:15:09] (03PS1) 10Ladsgroup: Set caching of CachingEntityRevisionLookup to CACHE_NONE in client [extensions/Wikibase] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/619616 (https://phabricator.wikimedia.org/T255305) [18:15:20] (03CR) 10Ladsgroup: [C: 03+2] Set caching of CachingEntityRevisionLookup to CACHE_NONE in client [extensions/Wikibase] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/619616 (https://phabricator.wikimedia.org/T255305) (owner: 10Ladsgroup) [18:15:41] (03PS1) 10Ladsgroup: Set caching of CachingEntityRevisionLookup to CACHE_NONE in client [extensions/Wikibase] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/619617 (https://phabricator.wikimedia.org/T255305) [18:16:56] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [18:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:45] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single [18:17:46] !log jiji@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) [18:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:12] (03CR) 10Dzahn: [C: 03+1] "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/611369 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [18:20:02] (03CR) 10Ladsgroup: "This change is ready for review." [extensions/Wikibase] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/619617 (https://phabricator.wikimedia.org/T255305) (owner: 10Ladsgroup) [18:20:14] (03PS2) 10Ladsgroup: Set caching of CachingEntityRevisionLookup to CACHE_NONE in client [extensions/Wikibase] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/619617 (https://phabricator.wikimedia.org/T255305) [18:20:36] (03CR) 10Ottomata: [C: 03+2] eventgate-main - Use MW EventStreamConfig API [deployment-charts] - 10https://gerrit.wikimedia.org/r/619813 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [18:22:18] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single [18:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:36] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single [18:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:40] (03CR) 10Ppchelko: Configure ratelimiter to support authenticated/anon limits for api (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/619804 (https://phabricator.wikimedia.org/T254914) (owner: 10Ppchelko) [18:22:46] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single [18:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:49] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [18:22:49] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [18:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:06] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single [18:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:05] (03CR) 10jerkins-bot: [V: 04-1] Set caching of CachingEntityRevisionLookup to CACHE_NONE in client [extensions/Wikibase] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/619617 (https://phabricator.wikimedia.org/T255305) (owner: 10Ladsgroup) [18:25:47] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [18:25:47] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [18:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:56] !log reboot mw1268 [18:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:22] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [18:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:33] (03PS1) 10Ottomata: camus - replace mediawiki_events job with eventgate-main_events job [puppet] - 10https://gerrit.wikimedia.org/r/619816 (https://phabricator.wikimedia.org/T251935) [18:28:40] (03PS3) 10Ladsgroup: Set caching of CachingEntityRevisionLookup to CACHE_NONE in client [extensions/Wikibase] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/619617 (https://phabricator.wikimedia.org/T255305) [18:29:45] (03CR) 10jerkins-bot: [V: 04-1] camus - replace mediawiki_events job with eventgate-main_events job [puppet] - 10https://gerrit.wikimedia.org/r/619816 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [18:30:55] (03PS2) 10Ottomata: camus - replace mediawiki_events job with eventgate-main_events job [puppet] - 10https://gerrit.wikimedia.org/r/619816 (https://phabricator.wikimedia.org/T251935) [18:33:40] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1001/24458/an-launcher1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/619816 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [18:33:43] (03CR) 10Ottomata: [C: 03+2] camus - replace mediawiki_events job with eventgate-main_events job [puppet] - 10https://gerrit.wikimedia.org/r/619816 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [18:38:50] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [18:38:50] (03Merged) 10jenkins-bot: Set caching of CachingEntityRevisionLookup to CACHE_NONE in client [extensions/Wikibase] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/619616 (https://phabricator.wikimedia.org/T255305) (owner: 10Ladsgroup) [18:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:01] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [18:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:20] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [18:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:47] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [18:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:26] (03PS1) 10Ottomata: camus - Exclude mediawiki.job streams from eventgate-main_events job [puppet] - 10https://gerrit.wikimedia.org/r/619820 (https://phabricator.wikimedia.org/T251935) [18:44:06] (03CR) 10Ottomata: [C: 03+2] camus - Exclude mediawiki.job streams from eventgate-main_events job [puppet] - 10https://gerrit.wikimedia.org/r/619820 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [18:45:08] (03CR) 10Bstorm: [C: 03+1] "I see no reason not to also do +1 for toolviews.py from my perspective. I didn't write that bit, but it looks fine, and I'll fix it if not" [puppet] - 10https://gerrit.wikimedia.org/r/619572 (owner: 10Cwhite) [18:45:28] !log reboot mw1269 [18:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:26] (03PS1) 10Dzahn: releases: allow rsyncing jenkins data between releases servers [puppet] - 10https://gerrit.wikimedia.org/r/619822 (https://phabricator.wikimedia.org/T247652) [18:46:37] Testing mwdebug1002 [18:46:47] the number of queries to s8 will definitely increase [18:46:57] but the compression is not an issue [18:47:03] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [18:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:08] Amir1: reboots are still happening [18:47:12] keep it in mimnd [18:47:14] mind* [18:47:24] !log reboot mw1270 [18:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:31] noted, I'm deploying wmf.4 client one which basically enables it on commons, hewiki and cawiki that heavily use wikidata [18:49:04] what i am afraid of is scap failing [18:49:17] on a server currently being rebooted [18:49:52] aah, yeah I think it'll fail :( [18:50:03] I'll do it again [18:50:11] !log ladsgroup@deploy1001 Synchronized php-1.36.0-wmf.4/extensions/Wikibase/client/includes/Store/Sql/DirectSqlStore.php: [[phab:T255305|Set caching of CachingEntityRevisionLookup to CACHE_NONE in client]] (duration: 02m 13s) [18:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:15] T255305: [Investigation] - Consider if the Wikibase cache for CachingEntityRevisionLookup is needed any more - https://phabricator.wikimedia.org/T255305 [18:50:18] actually it didn't [18:50:25] luck [18:50:55] <3 [18:55:59] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [18:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:29] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [18:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:49] (03PS1) 10Dzahn: ATS: temp. set backend for releases-jenkins to releases1001 [puppet] - 10https://gerrit.wikimedia.org/r/619826 (https://phabricator.wikimedia.org/T247652) [18:58:12] (03PS1) 10Dzahn: Revert "Revert "switch releases.wikimedia.org to buster backends"" [dns] - 10https://gerrit.wikimedia.org/r/619618 [18:59:54] (03PS1) 10Ottomata: camus - Remove now unused mediawiki_events job [puppet] - 10https://gerrit.wikimedia.org/r/619830 (https://phabricator.wikimedia.org/T251935) [19:00:04] hashar and twentyafterfour: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Mediawiki train - European+American Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200812T1900). [19:02:00] holding a bit [19:03:39] (03CR) 10Ottomata: [C: 03+2] camus - Remove now unused mediawiki_events job [puppet] - 10https://gerrit.wikimedia.org/r/619830 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [19:04:04] (03PS2) 10Ppchelko: Configure ratelimiter to support authenticated/anon limits for api [deployment-charts] - 10https://gerrit.wikimedia.org/r/619804 (https://phabricator.wikimedia.org/T254914) [19:04:53] (03CR) 10Ppchelko: "Done. Apparently Claras patch for default values hasn't made it to 1.15, so anon limits will not work until 1.16 upgrade." [deployment-charts] - 10https://gerrit.wikimedia.org/r/619804 (https://phabricator.wikimedia.org/T254914) (owner: 10Ppchelko) [19:05:29] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is OK: (C)100 gt (W)80 gt 77.29 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [19:06:09] !log repool mw1395 mw1397 mw1399 [19:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:43] one thing I noticed so far is that since the cache is cold, the reads on es got doubled, hasn't recovered yet but hopefully it will soon. Monitoring [19:07:18] which of all the caches [19:07:29] we have sucj a collection [19:08:15] the memcached ones that hold the items content [19:08:24] SqlBlobStore (the core) [19:08:42] https://grafana.wikimedia.org/d/lqE4lcGWz/wanobjectcache-key-group?orgId=1&var-kClass=SqlBlobStore_blob&from=now-1h&to=now [19:08:49] The misses on them got doubled [19:08:54] !log pool mw1396 [19:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:55] yup, the misses is back to normal now [19:11:05] same as es load [19:12:28] :d [19:12:29] :D [19:13:01] !log uploade new jenkins version to APT repo; upgrading jenkins on releases1002/2002 [19:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:18] (03PS1) 10Hashar: all wikis to 1.36.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619831 [19:13:21] (03CR) 10Hashar: [C: 03+2] all wikis to 1.36.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619831 (owner: 10Hashar) [19:13:36] rzl: effie: cdanis: train is going on :] [19:13:57] ack, toot toot [19:14:06] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619831 (owner: 10Hashar) [19:14:16] toooot [19:15:47] 🚂 [19:15:51] whiiim shuhushsush whimmm shishusushus whimmm shishushusshus whimmmm shushushisi [19:15:57] PAOPAOMOOOOOOOOOOOOO [19:16:16] clearly my train noise transcription is terrible [19:16:32] (03Abandoned) 10Ladsgroup: Set caching of CachingEntityRevisionLookup to CACHE_NONE in client [extensions/Wikibase] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/619617 (https://phabricator.wikimedia.org/T255305) (owner: 10Ladsgroup) [19:16:59] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.4 [19:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:03] JO-oooooooooooo AAE-O-A-A-U-U-A- [19:19:12] http://gph.is/1sGDkaD [19:19:42] !log Upgrading Jenkins on contint1001 (spare) [19:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:05] grmblblb [19:20:07] it is not masked :\ [19:20:52] !log upgrading jenkins on releases*001 [19:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:44] ah puppet ... [19:22:02] !log releases2001 - stopped and masked jenkins service [19:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:57] !log releases1002 - stopped and masked jenkins service [19:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:42] so I have got systemd::service { 'jenkins': service_params => { enable => false, ensure => ensure_service(false) } [19:23:57] why again. i thought we fixed this [19:24:10] this seems so familiar like we did it before [19:24:22] remembers fixing the masking thing in puppet [19:24:30] and ensure_service() well does not service masked bah [19:25:06] or maybe that was for zuul ;] [19:25:19] so either spend 3 hours figuring it out, or I just manually mask it on the spare server [19:25:25] !log all releases* servers except 1001 - disable puppet; stop jenkins, mask jenkins [19:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:50] !log contint1001: sudo systemctl mask jenkins # spare server [19:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:07] hashar: manually masked for releases, do the same for contint1001, disable puppet, stop , mask [19:26:16] then i will look at the puppet fix [19:26:21] while you can continue with train [19:26:26] it is not urgent [19:26:37] PROBLEM - jenkins_service_running on releases2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/jenkins/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [19:26:42] probably we want to overhaul how we manage the state of the services [19:26:54] yea, could have been zuul last time [19:27:41] PROBLEM - HTTP releases-jenkins.wikimedia.org on releases1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 553 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Releases.wikimedia.org%23Jenkins [19:28:07] PROBLEM - jenkins_service_running on releases1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/jenkins/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [19:28:24] hashar: jenkins was running on these all the time.. without causing issues apparently [19:28:35] or we would have had the alerts [19:28:35] I guess cause they don't have any config [19:28:40] yeah [19:29:05] PROBLEM - HTTP releases-jenkins.wikimedia.org on releases2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 553 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Releases.wikimedia.org%23Jenkins [19:29:11] ACKNOWLEDGEMENT - HTTP releases-jenkins.wikimedia.org on releases1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 553 bytes in 0.002 second response time daniel_zahn jenkins is supposed to run only on the active server https://wikitech.wikimedia.org/wiki/Releases.wikimedia.org%23Jenkins [19:29:11] ACKNOWLEDGEMENT - jenkins_service_running on releases1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/jenkins/jenkins.war daniel_zahn jenkins is supposed to run only on the active server https://wikitech.wikimedia.org/wiki/Jenkins [19:29:11] ACKNOWLEDGEMENT - HTTP releases-jenkins.wikimedia.org on releases2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 553 bytes in 0.061 second response time daniel_zahn jenkins is supposed to run only on the active server https://wikitech.wikimedia.org/wiki/Releases.wikimedia.org%23Jenkins [19:29:11] ACKNOWLEDGEMENT - jenkins_service_running on releases2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/jenkins/jenkins.war daniel_zahn jenkins is supposed to run only on the active server https://wikitech.wikimedia.org/wiki/Jenkins [19:29:11] ACKNOWLEDGEMENT - HTTP releases-jenkins.wikimedia.org on releases2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 553 bytes in 0.062 second response time daniel_zahn jenkins is supposed to run only on the active server https://wikitech.wikimedia.org/wiki/Releases.wikimedia.org%23Jenkins [19:29:12] ACKNOWLEDGEMENT - jenkins_service_running on releases2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/jenkins/jenkins.war daniel_zahn jenkins is supposed to run only on the active server https://wikitech.wikimedia.org/wiki/Jenkins [19:29:46] giving a chance for a few patches to be merged by CI and I will upgrade the CI Jenkins [19:29:50] ETA ~ 7 minutes [19:29:50] (03PS1) 10Jeena Huneidi: [WIP] Script to update image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/619833 [19:30:02] hashar: ack. so i was going to fix it anyways (auto-start of service on new instance) [19:30:23] let me find the "mask" fix from last time now [19:32:15] yes, it was zuul-merger: https://gerrit.wikimedia.org/r/q/owner:dzahn+message:mask [19:33:27] profile::zuul::merger::ensure_service: 'masked' [19:33:59] we don't need to rethink the whole way we manage services, we just need to copy this for jenkins, will do [19:34:44] (03CR) 10Jeena Huneidi: "This needs some changes to work with the new directory structure. I wanted your opinions on that since I'm not sure on the timeline there." [deployment-charts] - 10https://gerrit.wikimedia.org/r/619833 (owner: 10Jeena Huneidi) [19:38:41] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Add DVrandecic to group nda - https://phabricator.wikimedia.org/T260279 (10Urbanecm) [19:40:17] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Add DVrandecic to group nda - https://phabricator.wikimedia.org/T260279 (10DVrandecic) I already am. The onboading section says I need to be added to nda as well. Is nda automatically set when wmf? I am listed in the wmf group but not in the nda group. [19:41:07] mutante: maybe that is sufficient. Depends on how the variable ends up being used though. Maybe that is passed to systemd::service system_parameters [19:41:08] no idea [19:41:25] !log Upgrading Jenkins on contint2001 (primary) [19:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:51] yea, it is [19:42:36] then maybe that would work [19:42:52] yea, it does [19:43:21] confirmed releases-jenkins is on 2.525 and I see the web UI (recently i got the privs for that) [19:43:23] and we now have the improved Jenkins design ( https://integration.wikimedia.org/ci/ ) \o/ [19:43:39] Hey all, I need to do some debugging on an issue that I'm unable to reproduce locally and it hints that it is either a replication lag issue or DeferredUpdates taking too long before a redirect(?). [19:43:40] I was told that I can use mwdebug1001 and do some basic debugging in there. Is that ok? [19:43:55] hashar: nice! [19:44:24] dmaza: yeah that is what it is meant for. Though the backends are the production ones! [19:44:50] dmaza: so for example if you edit a page through mwdebug1001, it will really be edited :] [19:45:21] hashar: cool thanks. I'll do my testing on testwiki [19:45:26] I'm basically adding a few wfDebug lines around and possibly moving a DeferredUpdate callback outside of it to prove my theory [19:47:09] hashar: once I'm done I only need to run scap pull to clean up my changes right? [19:48:08] dmaza: sorry I should not even started that conversation I am over busy. Can you check with folks in #wikimedia-releng please ? [19:48:16] that is my time, one surely has bandwith [19:48:44] hashar: sure no problem. Thank you [19:49:10] dmaza: but yeah essentially scap pull should work [19:49:32] thanks. That's all I need [19:49:33] by syncing with the deployment server and thus eradicating any local changes made to /srv/mediawiki/ [19:49:34] :] [19:49:51] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] "Looks good. Shipping it" [puppet] - 10https://gerrit.wikimedia.org/r/619259 (https://phabricator.wikimedia.org/T259543) (owner: 10ZPapierski) [19:51:40] mutante: thank you :] [20:00:04] halfak and accraze: #bothumor I � Unicode. All rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200812T2000). [20:09:51] (03PS7) 10Ryan Kemper: elasticsearch: Amend prom query to match new state [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 [20:10:49] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Amend prom query to match new state [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (owner: 10Ryan Kemper) [20:11:06] !log reboot mw1271 [20:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:16] !log reboot mw1272 [20:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:20] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [20:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:31] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [20:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:06] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [20:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:14] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [20:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:28] !log reboot mw1286 [20:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:38] !log reboot mw1287 [20:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:41] (03CR) 10Cwhite: [C: 03+2] "Thanks, all!" [puppet] - 10https://gerrit.wikimedia.org/r/619572 (owner: 10Cwhite) [20:16:47] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [20:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:49] 10Operations, 10serviceops: mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10jijiki) [20:19:18] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10Jenkins: Review process to fetch Jenkins Debian package from upstream - https://phabricator.wikimedia.org/T260282 (10hashar) [20:19:57] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10Jenkins: Review process to fetch Jenkins Debian package from upstream - https://phabricator.wikimedia.org/T260282 (10hashar) p:05Triage→03Medium No hurry. I filed it cause we ended up importing the non LTS version... [20:20:03] (03PS4) 10Cwhite: prometheus: add config tests [puppet] - 10https://gerrit.wikimedia.org/r/619563 (https://phabricator.wikimedia.org/T256418) [20:20:11] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [20:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:28] !log reboot mw1273 [20:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:59] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [20:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:16] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [20:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:39] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [20:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:51] !log reboot mw1274 [20:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:05] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [20:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:27] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [20:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:41] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [20:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:52] !log reboot mw1288 [20:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:01] !log reboot mw1289 [20:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:27] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [20:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:12] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [20:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:24] !log reboot mw1275 [20:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:25] (03PS2) 10Ryan Kemper: elasticsearch: add active dc for each clustergroup [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 [20:31:58] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [20:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:20] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [20:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:03] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [20:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:12] !log reboot mw1319 [20:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:26] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add active dc for each clustergroup [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 (owner: 10Ryan Kemper) [20:33:45] (03CR) 10Ryan Kemper: "This is related to https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/603731/, where it was mentioned that the Cookbooks method `wait" [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 (owner: 10Ryan Kemper) [20:34:09] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [20:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:21] !log reboot mw1290 [20:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:12] (03CR) 10Ryan Kemper: "Patch 7 is roughly what this patch would look like without moving anything to spicerack." [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (owner: 10Ryan Kemper) [20:37:44] (03PS8) 10Ryan Kemper: elasticsearch: Let spicerack handle wait for queue [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 [20:38:02] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [20:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:44] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Let spicerack handle wait for queue [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (owner: 10Ryan Kemper) [20:39:21] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [20:39:21] (03PS3) 10Ryan Kemper: elasticsearch: add active dc for each clustergroup [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 [20:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:34] !log reboot mw1320 [20:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:57] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [20:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:42] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [20:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:51] !log reboot mw1297 [20:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:21] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add active dc for each clustergroup [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 (owner: 10Ryan Kemper) [20:41:40] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [20:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:07] (03PS4) 10Ryan Kemper: elasticsearch: add active dc for each clustergroup [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 [20:42:39] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [20:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:00] !log reboot mw1321 [20:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:03] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [20:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:01] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [20:44:04] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add active dc for each clustergroup [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 (owner: 10Ryan Kemper) [20:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:13] !log reboot mw1312 [20:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:18] 10Operations, 10Fundraising-Backlog, 10MW-1.36-notes (1.36.0-wmf.4; 2020-08-11), 10Patch-For-Review, and 3 others: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Urbanecm) I have [scheduled](https://wikitech.wikimedia.org/w/index.ph... [20:46:19] any ops around to review two straightforward patches? I requested them for this week's puppet window but it didn't get merged :( https://gerrit.wikimedia.org/r/c/operations/puppet/+/619446 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/619137 [20:46:51] one adds wmcloud.org to wikimedia networks for exim4 and the other is needed for creating a new wiki [20:46:59] mutante: if you have time ^ [20:48:05] 10Operations, 10Fundraising-Backlog, 10MW-1.36-notes (1.36.0-wmf.4; 2020-08-11), 10Patch-For-Review, and 3 others: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Ladsgroup) >>! In T259002#6380928, @Urbanecm wrote: > I have [schedule... [20:48:59] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [20:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:17] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [20:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:29] Amir1: varnish VCL changes, text-frontend specifically is one of the most critical things to touch in the entire infra, i would prefer to let traffic do that and not swat it [20:50:03] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [20:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:15] !log reboot mw1313 [20:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:21] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [20:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:39] the other change is something that should have reviews from wmcs please [20:51:01] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [20:51:03] Thanks. I go ask traffic for the first one, for the latter I try to ping someone [20:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:12] !log reboot mw1314 [20:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:33] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [20:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:40] !log reboot mw1322 [20:52:42] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is CRITICAL: 104.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [20:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:52] (03PS5) 10Ryan Kemper: elasticsearch: add active dc for each clustergroup [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 [20:54:54] (03CR) 10Bstorm: [C: 03+1] exim: Add wmcloud.org to list of wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/619137 (https://phabricator.wikimedia.org/T259981) (owner: 10Ladsgroup) [20:54:56] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [20:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:48] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add active dc for each clustergroup [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 (owner: 10Ryan Kemper) [20:56:33] (03CR) 10Andrew Bogott: [C: 03+2] exim: Add wmcloud.org to list of wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/619137 (https://phabricator.wikimedia.org/T259981) (owner: 10Ladsgroup) [20:57:18] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [20:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:27] !log reboot mw1323 [20:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:07] (03PS6) 10Ryan Kemper: elasticsearch: add active dc for each clustergroup [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 [21:00:23] (03PS7) 10Ryan Kemper: elasticsearch: add active dc for each clustergroup [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 [21:00:55] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [21:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:18] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [21:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:26] !log reboot mw1315 [21:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:46] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [21:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:54] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [21:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:22] (03PS1) 10Andrew Bogott: exim: add toolforge.org domain [puppet] - 10https://gerrit.wikimedia.org/r/619851 [21:02:28] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add active dc for each clustergroup [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 (owner: 10Ryan Kemper) [21:02:31] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [21:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:33] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [21:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:42] !log reboot mw1324 [21:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:31] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [21:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:34] (03PS1) 10Urbanecm: Initial configuration for thankyouwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619852 (https://phabricator.wikimedia.org/T259002) [21:03:51] !log reboot mw1325 [21:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:27] 10Operations, 10Fundraising-Backlog, 10MW-1.36-notes (1.36.0-wmf.4; 2020-08-11), 10Patch-For-Review, and 3 others: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Urbanecm) >>! In T259002#6380931, @Ladsgroup wrote: >>>! In T259002#63... [21:04:51] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [21:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:02] !log reboot mw1316 [21:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:33] 10Operations, 10serviceops, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10Krinkle) [21:06:54] 10Operations, 10Fundraising-Backlog, 10MW-1.36-notes (1.36.0-wmf.4; 2020-08-11), 10Patch-For-Review, and 3 others: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Urbanecm) >>! In T259002#6380998, @gerritbot wrote: > Change 619852 ha... [21:10:09] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [21:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:11] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [21:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:27] !log reboot mw1317 [21:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:33] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [21:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:10] I asked on #wikimedia-releng but in case someone here is available.. Following up to what I was saying earlier about debugging on mwdebug1001, I can access the server but I'm unable to make any file changes due to file permissions error. Is this something I should be able to do and I'm missing a step or do I need special permissions? [21:13:21] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [21:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:33] !log reboot mw1326 [21:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:10] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [21:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:47] (03PS8) 10Ryan Kemper: elasticsearch: add active dc for each clustergroup [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 [21:15:29] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [21:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:38] !log reboot mw1327 [21:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:38] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add active dc for each clustergroup [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 (owner: 10Ryan Kemper) [21:17:57] (03PS9) 10Ryan Kemper: elasticsearch: add active dc for each clustergroup [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 [21:19:34] 10Operations, 10serviceops, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10CDanis) Here's a summary of my findings so far, although keep in mind that I'm well out of my depth here and this is an amalgamation of guesswork. mw1357 is a se... [21:19:43] (03PS1) 10Dzahn: jenkins: redesign the way we manage service ensure/enable [puppet] - 10https://gerrit.wikimedia.org/r/619855 [21:20:03] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [21:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:27] (03CR) 10jerkins-bot: [V: 04-1] jenkins: redesign the way we manage service ensure/enable [puppet] - 10https://gerrit.wikimedia.org/r/619855 (owner: 10Dzahn) [21:20:29] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add active dc for each clustergroup [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 (owner: 10Ryan Kemper) [21:20:53] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [21:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:01] !log reboot mw1339 [21:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:59] (03PS10) 10Ryan Kemper: elasticsearch: add active dc for each clustergroup [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 [21:22:22] wkandek: out of curiosity, are you using double sudo by any chance? I see the runs !logged as root@ instead of the real user [21:22:35] like sudo from root [21:22:43] (03PS3) 10Urbanecm: Initial configuration for lijwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617861 (https://phabricator.wikimedia.org/T259432) [21:23:38] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [21:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:57] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add active dc for each clustergroup [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 (owner: 10Ryan Kemper) [21:24:36] volans: running as root after sudo su -; will switch to sudo from my account. [21:25:26] (03PS2) 10Dzahn: jenkins: redesign the way we manage service ensure/enable [puppet] - 10https://gerrit.wikimedia.org/r/619855 [21:25:46] ah ok, yeah I usually use sudo cookbook... or sudo -i and then anything else [21:25:48] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [21:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:54] thx! :) [21:25:59] !log reboot mw1340 [21:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:05] (03CR) 10jerkins-bot: [V: 04-1] jenkins: redesign the way we manage service ensure/enable [puppet] - 10https://gerrit.wikimedia.org/r/619855 (owner: 10Dzahn) [21:27:42] (03PS2) 10Jeena Huneidi: [WIP] Script to update image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/619833 (https://phabricator.wikimedia.org/T255835) [21:28:36] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [21:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:23] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [21:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:30] 10Operations, 10serviceops, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10CDanis) There's some preliminary results from tracing 4096-byte allocations at {P12238} although some of the stack traces make me think I need to make sure the sc... [21:32:13] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [21:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:50] !log root@cumin1001 START - Cookbook sre.hosts.reboot-single [21:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:19] !log reboot mw1328 [21:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:28] !log reboot mw1329 [21:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:44] 10Operations, 10Fundraising-Backlog, 10MW-1.36-notes (1.36.0-wmf.4; 2020-08-11), 10Patch-For-Review, and 3 others: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Urbanecm) @Ladsgroup @Pcoombe This wiki is scheduled to be created on... [21:36:02] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [21:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:25] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [21:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:41] (03PS2) 10Urbanecm: Initial configuration for thankyouwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619852 (https://phabricator.wikimedia.org/T259002) [21:37:42] !log wkandek@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) [21:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:21] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [21:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:27] !log reboot mw1341 [21:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:47] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [21:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:38] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [21:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:53] !log wkandek@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [21:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:38] (03PS1) 10Jeena Huneidi: Fix deploy script [deployment-charts] - 10https://gerrit.wikimedia.org/r/619858 (https://phabricator.wikimedia.org/T259684) [21:44:17] (03PS3) 10Dzahn: jenkins: redesign the way we manage service ensure/enable [puppet] - 10https://gerrit.wikimedia.org/r/619855 [21:44:51] (03CR) 10jerkins-bot: [V: 04-1] jenkins: redesign the way we manage service ensure/enable [puppet] - 10https://gerrit.wikimedia.org/r/619855 (owner: 10Dzahn) [21:46:40] !log wkandek@cumin1001 conftool action : set/pooled=yes; selector: name=mw1340.eqiad.wmnet [21:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:42] (03PS3) 10Ppchelko: Configure ratelimiter to support authenticated/anon limits for api [deployment-charts] - 10https://gerrit.wikimedia.org/r/619804 (https://phabricator.wikimedia.org/T254914) [21:46:56] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [21:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:38] (03PS4) 10Dzahn: jenkins: redesign the way we manage service ensure/enable [puppet] - 10https://gerrit.wikimedia.org/r/619855 [21:47:51] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [21:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:03] !log reboot mw1342 [21:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:18] (03CR) 10jerkins-bot: [V: 04-1] jenkins: redesign the way we manage service ensure/enable [puppet] - 10https://gerrit.wikimedia.org/r/619855 (owner: 10Dzahn) [21:50:16] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [21:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:28] !log reboot mw1331 [21:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:23] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [21:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:43] (03PS5) 10Dzahn: jenkins: redesign the way we manage service ensure/enable [puppet] - 10https://gerrit.wikimedia.org/r/619855 [21:55:17] (03CR) 10jerkins-bot: [V: 04-1] jenkins: redesign the way we manage service ensure/enable [puppet] - 10https://gerrit.wikimedia.org/r/619855 (owner: 10Dzahn) [21:55:25] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [21:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:39] !log wkandek@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [21:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:40] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [22:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:52] !log reboot mw1332 [22:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:02] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [22:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:11] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [22:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:34] !log reboot mw1343 [22:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:29] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [22:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:44] !log reboot mw1344 [22:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:08] (03PS6) 10Dzahn: jenkins: redesign the way we manage service ensure/enable [puppet] - 10https://gerrit.wikimedia.org/r/619855 [22:04:48] (03CR) 10jerkins-bot: [V: 04-1] jenkins: redesign the way we manage service ensure/enable [puppet] - 10https://gerrit.wikimedia.org/r/619855 (owner: 10Dzahn) [22:05:10] (03PS4) 10Ppchelko: Configure ratelimiter to support authenticated/anon limits for api [deployment-charts] - 10https://gerrit.wikimedia.org/r/619804 (https://phabricator.wikimedia.org/T254914) [22:06:03] (03CR) 10Dzahn: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/618084 (https://phabricator.wikimedia.org/T258374) (owner: 10Michael Große) [22:06:07] (03PS5) 10Ppchelko: Configure ratelimiter to support authenticated/anon limits for api [deployment-charts] - 10https://gerrit.wikimedia.org/r/619804 (https://phabricator.wikimedia.org/T254914) [22:07:18] !log wkandek@cumin1001 conftool action : set/pooled=yes; selector: name=mw1330.eqiad.wmnet [22:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:53] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [22:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:03] !log reboot mw1333 [22:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:47] (03CR) 10Ppchelko: "So, I tried a" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/619804 (https://phabricator.wikimedia.org/T254914) (owner: 10Ppchelko) [22:09:51] (03PS7) 10Dzahn: jenkins: redesign the way we manage service ensure/enable [puppet] - 10https://gerrit.wikimedia.org/r/619855 [22:11:18] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [22:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:17] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [22:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:33] !log reboot mw1349 [22:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:54] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [22:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:38] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [22:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:26] !log reboot mw1345 [22:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:37] (03PS8) 10Dzahn: jenkins: redesign the way we manage service ensure/enable [puppet] - 10https://gerrit.wikimedia.org/r/619855 [22:18:29] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [22:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:21] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [22:19:21] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/24460/icinga1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/618084 (https://phabricator.wikimedia.org/T258374) (owner: 10Michael Große) [22:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:39] !log reboot mw1346 [22:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:19] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [22:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:43] (03PS1) 10AntiCompositeNumber: engine: Remove custom XCF handler in favor of ImageMagick [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/619864 (https://phabricator.wikimedia.org/T260285) [22:21:12] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service: [EPIC] Deploy push-notifications service to production - https://phabricator.wikimedia.org/T256237 (10Mholloway) [22:21:36] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [22:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:14] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [22:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:22] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1003/24462/ close, now just needs to not affect "service_monitor"" [puppet] - 10https://gerrit.wikimedia.org/r/619855 (owner: 10Dzahn) [22:22:23] !log reboot mw1350 [22:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:15] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [22:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:57] (03CR) 10Dzahn: "the new icinga check has been added at https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=test.wikidata.org&service=test." [puppet] - 10https://gerrit.wikimedia.org/r/618084 (https://phabricator.wikimedia.org/T258374) (owner: 10Michael Große) [22:25:09] 10Operations, 10Cloud-Services, 10Mail, 10User-Ladsgroup: Wikimedia exim config drops mails it's relaying from *.wmcloud.org - https://phabricator.wikimedia.org/T259981 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup [22:25:13] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [22:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:23] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [22:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:33] !log reboot 1347 [22:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:07] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [22:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:01] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [22:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:16] !log reboot mw1348 [22:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:45] (03PS9) 10Dzahn: jenkins: redesign the way we manage service ensure/enable [puppet] - 10https://gerrit.wikimedia.org/r/619855 [22:30:46] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [22:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:22] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [22:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:27] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [22:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:32] !log reboot mw1352 [22:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:29] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [22:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:40] !log reboot mw1353 [22:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:25] (03PS10) 10Dzahn: jenkins: redesign the way we manage service ensure/enable [puppet] - 10https://gerrit.wikimedia.org/r/619855 [22:35:44] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [22:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:16] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [22:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:30] !log reboot mw1396 [22:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:35] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [22:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:08] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [22:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:20] !log reboot mw1354 [22:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:15] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [22:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:56] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [22:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:06] !log reboot mw1355 [22:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:24] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [22:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:04] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [22:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:12] !log reboot mw1401 [22:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:11] (03PS11) 10Dzahn: jenkins: redesign the way we manage service ensure/enable [puppet] - 10https://gerrit.wikimedia.org/r/619855 [22:44:29] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [22:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:54] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [22:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:31] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [22:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:40] !log reboot mw1364 [22:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:21] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [22:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:34] !log reboot mw1399 [22:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:32] (03CR) 10BryanDavis: [C: 03+1] "When I created this static zone file in Ib0351f0b8f12ef0476c8baf69153433722b8a182, I chose to point to slice names rather than the matchin" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/619627 (https://phabricator.wikimedia.org/T259438) (owner: 10Marostegui) [22:51:02] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [22:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:44] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [22:51:44] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [22:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:00] !log reboot me1365 [22:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:07] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [22:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:37] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [22:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:48] !log reboot mw1366 [22:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:40] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [22:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:51] !log reboot mw1397 [22:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Evening backport window(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200812T2300). [23:00:18] dear jouncebot, there is nothing to deploy :-( [23:00:43] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [23:00:44] (03PS12) 10Dzahn: jenkins: redesign the way we manage service ensure/enable [puppet] - 10https://gerrit.wikimedia.org/r/619855 [23:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:05] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [23:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:16] !log reboot mw1395 [23:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:55] (03CR) 10jerkins-bot: [V: 04-1] jenkins: redesign the way we manage service ensure/enable [puppet] - 10https://gerrit.wikimedia.org/r/619855 (owner: 10Dzahn) [23:02:02] sigh [23:02:55] (03CR) 10Cwhite: [C: 03+1] templates: add alerts.w.o [dns] - 10https://gerrit.wikimedia.org/r/619752 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [23:03:57] Urbanecm: at some point we decided that jouncebot should announce the window even when it has no patches. Folks thought the bot was broken when it skipped empty windows. [23:04:22] I'm not complaining :-) [23:04:23] (03PS13) 10Dzahn: jenkins: redesign the way we manage service ensure/enable [puppet] - 10https://gerrit.wikimedia.org/r/619855 [23:04:40] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [23:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:43] IIRC it used to say "nothing to deploy" some time back [23:04:51] bd808: do you know why that's no longer a thing? [23:04:55] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [23:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:14] !log reboot mw1393 [23:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:43] (03CR) 10jerkins-bot: [V: 04-1] jenkins: redesign the way we manage service ensure/enable [puppet] - 10https://gerrit.wikimedia.org/r/619855 (owner: 10Dzahn) [23:06:10] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [23:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:28] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [23:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:44] Urbanecm: because it was special cased and then the wording changed on the wiki! -- https://gerrit.wikimedia.org/r/c/wikimedia/bots/jouncebot/+/425881/1/jouncebot.py [23:06:50] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [23:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:04] probably an simple patch to fix if you are interested ;) [23:07:15] Ohoho, sure ;) [23:07:47] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [23:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:58] !log reboot mw1391 [23:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:06] (03CR) 10Cwhite: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/619738 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [23:08:37] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [23:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:46] !log reboot me1367 [23:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:08] (03CR) 10Cwhite: [C: 03+1] prometheus: add alertmanagers configuration [puppet] - 10https://gerrit.wikimedia.org/r/619739 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [23:09:24] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [23:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:43] !log reboot mw1368 [23:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:53] (03CR) 10Dzahn: [C: 03+2] mailman: replace fermium with lists1001 in rsync scripts [puppet] - 10https://gerrit.wikimedia.org/r/619585 (https://phabricator.wikimedia.org/T224586) (owner: 10Dzahn) [23:10:39] (03PS1) 10Urbanecm: SWAT was renamed to backport windows [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/619868 [23:10:52] (03CR) 10Dzahn: "these scripts are not automatically running, they were made for manual runs for migrations" [puppet] - 10https://gerrit.wikimedia.org/r/619585 (https://phabricator.wikimedia.org/T224586) (owner: 10Dzahn) [23:11:36] (03PS14) 10Dzahn: jenkins: redesign the way we manage service ensure/enable [puppet] - 10https://gerrit.wikimedia.org/r/619855 [23:11:58] https://gerrit.wikimedia.org/r/admin/repos/wikimedia/bots/jouncebot,access super small list of +2ers [23:12:42] (03CR) 10Dzahn: [C: 03+1] remove fermium from DHCP,partman and acme_chief [puppet] - 10https://gerrit.wikimedia.org/r/619586 (https://phabricator.wikimedia.org/T224586) (owner: 10Dzahn) [23:14:12] (03PS1) 10BryanDavis: SWAT is now known as "backport" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/619869 [23:14:30] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [23:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:47] bd808: you don't like my patch? :-( [23:15:36] Urbanecm: lol. I didn't even wait for you. I'll merge yours and abandon mine :) [23:15:57] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [23:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:05] !log reboot mw1389 [23:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:08] always better to have two patchs fixing a bug than zero :-) [23:16:19] (03Abandoned) 10BryanDavis: SWAT is now known as "backport" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/619869 (owner: 10BryanDavis) [23:16:35] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [23:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:58] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [23:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:08] !log reboot mw1387 [23:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:11] (03CR) 10BryanDavis: [C: 03+2] SWAT was renamed to backport windows [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/619868 (owner: 10Urbanecm) [23:17:28] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [23:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:43] (03Merged) 10jenkins-bot: SWAT was renamed to backport windows [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/619868 (owner: 10Urbanecm) [23:18:33] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [23:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:44] !log reboot mw1369 [23:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:09] thx bd808 [23:19:29] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [23:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:03] jouncebot_: next [23:21:03] In 0 hour(s) and 38 minute(s): Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200813T0000) [23:21:03] (03PS15) 10Dzahn: jenkins: redesign the way we manage service ensure/enable [puppet] - 10https://gerrit.wikimedia.org/r/619855 [23:22:03] (03PS1) 10Urbanecm: Remove "Max X patches" from window's name [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/619870 [23:22:08] bd808: may I trouble with Yet Another Patch (tm)? [23:22:28] (03CR) 10jerkins-bot: [V: 04-1] Remove "Max X patches" from window's name [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/619870 (owner: 10Urbanecm) [23:22:33] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [23:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:46] !log reboot mw1370 [23:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:56] (03PS2) 10Urbanecm: Remove "Max X patches" from window's name [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/619870 [23:23:21] (03CR) 10jerkins-bot: [V: 04-1] Remove "Max X patches" from window's name [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/619870 (owner: 10Urbanecm) [23:24:56] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [23:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:19] (03PS3) 10Urbanecm: Remove "Max X patches" from window's name [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/619870 [23:25:26] (03CR) 10BryanDavis: Remove "Max X patches" from window's name (031 comment) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/619870 (owner: 10Urbanecm) [23:25:45] (03CR) 10jerkins-bot: [V: 04-1] Remove "Max X patches" from window's name [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/619870 (owner: 10Urbanecm) [23:25:45] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [23:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:37] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [23:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:00] !log reboot mw1385 [23:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:49] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [23:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:01] !log reboot mw1384 [23:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:54] (03PS4) 10Urbanecm: Remove "Max X patches" from window's name [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/619870 [23:29:19] (03CR) 10jerkins-bot: [V: 04-1] Remove "Max X patches" from window's name [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/619870 (owner: 10Urbanecm) [23:30:51] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [23:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:27] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [23:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:37] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [23:31:38] !log reboot mw1371 [23:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:13] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [23:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:20] (03PS5) 10Urbanecm: Remove "Max X patches" from window's name [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/619870 [23:32:24] !log reboot mw1373 [23:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:35] bd808: finally :D [23:34:05] (03CR) 10Urbanecm: Remove "Max X patches" from window's name (031 comment) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/619870 (owner: 10Urbanecm) [23:36:13] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [23:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:51] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-single [23:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:00] !log reboot mw1372 [23:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:24] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [23:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:47] (03PS16) 10Dzahn: jenkins: redesign the way we manage service ensure/enable [puppet] - 10https://gerrit.wikimedia.org/r/619855 [23:40:37] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [23:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:36] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [23:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:31] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [23:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:24] (03PS17) 10Dzahn: jenkins: redesign the way we manage service ensure/enable [puppet] - 10https://gerrit.wikimedia.org/r/619855 [23:53:43] (03PS18) 10Dzahn: jenkins: redesign the way we manage service ensure/enable [puppet] - 10https://gerrit.wikimedia.org/r/619855