[00:00:04] twentyafterfour: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200806T0000). [00:00:17] !log LDAP - removed demon from nda group [00:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:38] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:06:34] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:11:36] (03CR) 10Dzahn: [C: 03+1] "required per https://wiki.mozilla.org/Security/DOH-resolver-policy" [puppet] - 10https://gerrit.wikimedia.org/r/618376 (owner: 10Ssingh) [00:12:46] 10Operations, 10Fundraising-Backlog, 10User-Urbanecm, 10User-dancy, 10Wiki-Setup (Create): New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Ejegg) @Ladsgroup The amount of content hosted here will be very small - just a handful of f... [00:14:08] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [00:14:36] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [00:17:32] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [00:19:58] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [00:34:32] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` wtp2019.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [00:35:19] !log wtp2019 - reimaging - parsoid service does not work, unlike on all other wtp*, making sure it's clean [00:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:10] 10Operations, 10ops-codfw, 10netops: (Need by: ) codfw:rack/setup/new management switches - https://phabricator.wikimedia.org/T253154 (10Papaul) [00:48:51] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:25] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 62 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:58:39] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 46 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:59:05] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [01:06:15] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [01:14:45] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [01:17:30] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [01:17:30] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [01:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:23] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [01:19:23] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [01:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:50] (03PS2) 10Ottomata: eventgate - use /v1/_test/events route for readinessProbe [deployment-charts] - 10https://gerrit.wikimedia.org/r/618624 (https://phabricator.wikimedia.org/T251935) [01:22:51] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [01:31:33] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:39:17] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:40:21] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [01:40:53] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [01:51:14] PROBLEM - DPKG on wtp2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.34: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [01:51:14] PROBLEM - configured eth on wtp2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.34: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [01:52:01] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [01:52:01] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [01:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:52:09] ACK - wtp2019 [01:58:40] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [02:03:20] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [02:23:21] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:26:18] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [02:29:06] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:29:14] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [02:57:15] 10Operations, 10serviceops: All wtp and parse servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp2019.codfw.wmnet'] ` and were **ALL** successful. [02:57:30] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:57:59] 10Operations, 10Fundraising-Backlog, 10User-Urbanecm, 10User-dancy, 10Wiki-Setup (Create): New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10DStrine) @Ladsgroup one other note. We only need a few pages here but they will take the ful... [03:02:40] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:04:02] RECOVERY - DPKG on wtp2019 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [03:04:02] RECOVERY - configured eth on wtp2019 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [03:04:40] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2019.codfw.wmnet [03:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:07:04] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:11:26] (03PS1) 10Tim Starling: Enable fastStale mode on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618646 (https://phabricator.wikimedia.org/T250248) [03:12:42] (03CR) 10Tim Starling: "Please review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618646 (https://phabricator.wikimedia.org/T250248) (owner: 10Tim Starling) [03:35:52] PROBLEM - Host cloudcephosd1011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [03:37:56] RECOVERY - Host cloudcephosd1011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [04:02:29] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:08:31] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:11:05] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [04:13:51] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [04:27:05] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:37:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1079', diff saved to https://phabricator.wikimedia.org/P12179 and previous config saved to /var/cache/conftool/dbconfig/20200806-043758-marostegui.json [04:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:46:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1079', diff saved to https://phabricator.wikimedia.org/P12180 and previous config saved to /var/cache/conftool/dbconfig/20200806-044608-marostegui.json [04:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:51:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1079', diff saved to https://phabricator.wikimedia.org/P12181 and previous config saved to /var/cache/conftool/dbconfig/20200806-045107-marostegui.json [04:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:34] (03CR) 10Marostegui: "Brooke, I am surprised that after so many hours...it looks like there are no connections going thru the "new" proxy, so I am not sure if t" [puppet] - 10https://gerrit.wikimedia.org/r/534577 (https://phabricator.wikimedia.org/T231520) (owner: 10Marostegui) [04:56:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1079', diff saved to https://phabricator.wikimedia.org/P12182 and previous config saved to /var/cache/conftool/dbconfig/20200806-045622-marostegui.json [04:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1127 for MCR', diff saved to https://phabricator.wikimedia.org/P12184 and previous config saved to /var/cache/conftool/dbconfig/20200806-050743-marostegui.json [05:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:58] (03PS1) 10Marostegui: db1132: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/618655 (https://phabricator.wikimedia.org/T259589) [05:11:42] (03CR) 10Marostegui: [C: 03+2] db1132: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/618655 (https://phabricator.wikimedia.org/T259589) (owner: 10Marostegui) [05:15:36] (03CR) 10Marostegui: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/534577 (https://phabricator.wikimedia.org/T231520) (owner: 10Marostegui) [05:16:11] (03CR) 10Marostegui: "Brooke, I am surprised that after so many hours...it looks like there are no connections going thru the "new" proxy, so I am not sure if t" [puppet] - 10https://gerrit.wikimedia.org/r/618283 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [05:18:14] 10Operations, 10Fundraising-Backlog, 10User-Urbanecm, 10User-dancy, 10Wiki-Setup (Create): New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Dzahn) >>! In T259002#6364836, @DStrine wrote: > We're talking hundreds of thousands of use... [05:21:51] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 52 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:30:35] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog2001.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [05:31:03] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog2001.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [05:33:31] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [05:33:59] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [05:39:07] (03CR) 10Bstorm: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/534577 (https://phabricator.wikimedia.org/T231520) (owner: 10Marostegui) [05:42:43] (03PS2) 10KartikMistry: Update cxserver to 2020-08-05-070016-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/618525 (https://phabricator.wikimedia.org/T258919) [05:43:33] * kart_ updating cxserver.. [05:46:37] (03CR) 10Bstorm: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/618283 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [05:47:03] (03PS1) 10QChris: Add .gitreview [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/618656 [05:47:05] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/618656 (owner: 10QChris) [05:48:36] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2020-08-05-070016-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/618525 (https://phabricator.wikimedia.org/T258919) (owner: 10KartikMistry) [05:48:38] (03CR) 10Marostegui: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/618283 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [05:49:37] (03Merged) 10jenkins-bot: Update cxserver to 2020-08-05-070016-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/618525 (https://phabricator.wikimedia.org/T258919) (owner: 10KartikMistry) [05:53:22] (03CR) 10Bstorm: "It all matches this patch in cloud DNS from everything I'm able to check:" [puppet] - 10https://gerrit.wikimedia.org/r/618283 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [05:55:00] (03CR) 10Marostegui: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/618283 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [05:57:40] (03CR) 10Bstorm: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/618283 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [05:59:26] (03CR) 10Marostegui: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/618283 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [06:02:50] (03PS1) 10Marostegui: install_server: Reimage dbproxy1018 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/618659 (https://phabricator.wikimedia.org/T255408) [06:03:31] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage dbproxy1018 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/618659 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [06:05:05] (03PS1) 10Marostegui: install_server: Actually set dbproxy1018 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/618660 [06:05:42] (03CR) 10Marostegui: [C: 03+2] install_server: Actually set dbproxy1018 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/618660 (owner: 10Marostegui) [06:06:45] I'm getting unusual error while running helmfile command. Did anything change with it? [06:08:12] https://www.irccloud.com/pastebin/lrj5Zab5/ [06:10:17] akosiaris: When you're around ^ [06:11:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:16:19] (03CR) 10Elukey: "Can this be deployed asap? We have recurrent alerts in icinga for netbox1001's root partition filling up :)" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/562408 (https://phabricator.wikimedia.org/T231512) (owner: 10CRusnov) [06:20:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [06:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:20] 10Operations, 10ops-eqiad: relforge1001's mgmt IP not reachable - https://phabricator.wikimedia.org/T259777 (10elukey) [06:36:33] !log elukey@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper [06:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:30] !log roll restart of druid clusters' zookeeper and an-conf* zookeeper for openjdk-11 upgrades [06:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:41] (03PS1) 10Marostegui: Revert "wikireplica_dns.yaml: Depool dbproxy1018" [puppet] - 10https://gerrit.wikimedia.org/r/618570 [06:42:17] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:42:49] (03CR) 10Marostegui: [C: 03+2] Revert "wikireplica_dns.yaml: Depool dbproxy1018" [puppet] - 10https://gerrit.wikimedia.org/r/618570 (owner: 10Marostegui) [06:43:03] !log elukey@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) [06:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:57] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:47:08] !log elukey@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper [06:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:27] !log elukey@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) [06:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:31] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog2001.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [06:54:31] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog2001.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [06:57:04] !log Truncate tables on zerowiki T227717 [06:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:07] T227717: Drop DB tables for now-deleted zerowiki from production - https://phabricator.wikimedia.org/T227717 [06:57:11] (03CR) 10Elukey: [C: 03+2] mjolnir: Increase msearch daemon parallelism to 25 [puppet] - 10https://gerrit.wikimedia.org/r/618538 (owner: 10Ebernhardson) [06:57:24] !log elukey@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper [06:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:10] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [06:59:10] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [07:00:07] (03CR) 10Elukey: [C: 03+1] kafkamon: add role::kafka::monitoring_buster, assign kafkamon[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/618359 (https://phabricator.wikimedia.org/T252773) (owner: 10Herron) [07:03:46] !log elukey@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) [07:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:57] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 50 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:14:13] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 51 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:19:51] PROBLEM - Thanos compact has not run on icinga1001 is CRITICAL: 4.435e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [07:25:08] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-14) rack/setup/install dbprov2003.codfw.wmnet - https://phabricator.wikimedia.org/T258749 (10jcrespo) Thank you Papaul very much for you work, really appreciated how fast this was completed! [07:27:14] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary outbound port utilisation over 80% #page [07:27:26] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary inbound port utilisation over 80% #page [07:28:14] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqdfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page [07:28:18] * volans here [07:28:26] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqdfw.wikimedia.org recovered from Primary inbound port utilisation over 80% #page [07:29:34] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 48 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:29:37] cc XioNoX ^^^ [07:32:36] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog2001.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [07:33:37] there was a jump for cr1-codfw too [07:40:20] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [07:44:16] RECOVERY - Thanos compact has not run on icinga1001 is OK: (C)24 ge (W)12 ge 0.01612 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [07:46:18] (03PS1) 10Elukey: druid: fix monitoring configuration [puppet] - 10https://gerrit.wikimedia.org/r/618705 (https://phabricator.wikimedia.org/T244482) [07:46:23] volans: on my phone, looks like it recovered [07:47:09] XioNoX: yeah was a spike in traffic actually, we're looking into it [07:51:33] (03CR) 10Elukey: [C: 03+2] druid: fix monitoring configuration [puppet] - 10https://gerrit.wikimedia.org/r/618705 (https://phabricator.wikimedia.org/T244482) (owner: 10Elukey) [07:55:13] !roll restart druid brokers on druid-analytics to pick up new settings [08:02:04] (03PS1) 10Jcrespo: mariadb-backup: Initial setup of dbprov2003 [puppet] - 10https://gerrit.wikimedia.org/r/618706 (https://phabricator.wikimedia.org/T257551) [08:03:44] (03PS6) 10Ema: ATS: add function profile::trafficserver_caching_rules [puppet] - 10https://gerrit.wikimedia.org/r/618537 (https://phabricator.wikimedia.org/T259692) [08:04:38] !roll restart druid brokers on druid-public to pick up new settings [08:05:41] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/618537 (https://phabricator.wikimedia.org/T259692) (owner: 10Ema) [08:05:51] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [08:06:38] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [08:07:02] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:08:44] (03PS2) 10Thiemo Kreuz (WMDE): Remove deprecated setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618482 (https://phabricator.wikimedia.org/T232542) (owner: 10Awight) [08:11:56] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:14:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P12185 and previous config saved to /var/cache/conftool/dbconfig/20200806-081416-marostegui.json [08:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:36] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [08:16:41] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [08:21:36] 10Operations, 10Traffic, 10Patch-For-Review: Generate ATS cache.config from software-agnostic data structures - https://phabricator.wikimedia.org/T259692 (10ema) On a text node, [[https://puppet-compiler.wmflabs.org/compiler1001/527/|applying the change]] results in the following diff: ` --- /etc/trafficser... [08:25:13] (03CR) 10Filippo Giunchedi: [C: 03+1] kafkamon: add role::kafka::monitoring_buster, assign kafkamon[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/618359 (https://phabricator.wikimedia.org/T252773) (owner: 10Herron) [08:30:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P12186 and previous config saved to /var/cache/conftool/dbconfig/20200806-083033-marostegui.json [08:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P12187 and previous config saved to /var/cache/conftool/dbconfig/20200806-083743-marostegui.json [08:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:36] (03CR) 10Vgutierrez: ATS: add function profile::trafficserver_caching_rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/618537 (https://phabricator.wikimedia.org/T259692) (owner: 10Ema) [08:44:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1127', diff saved to https://phabricator.wikimedia.org/P12188 and previous config saved to /var/cache/conftool/dbconfig/20200806-084406-marostegui.json [08:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:10] (03CR) 10Ema: ATS: add function profile::trafficserver_caching_rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/618537 (https://phabricator.wikimedia.org/T259692) (owner: 10Ema) [08:45:32] (03CR) 10Jcrespo: [C: 03+2] mariadb-backup: Initial setup of dbprov2003 [puppet] - 10https://gerrit.wikimedia.org/r/618706 (https://phabricator.wikimedia.org/T257551) (owner: 10Jcrespo) [08:46:03] (03PS7) 10Ema: ATS: add function profile::trafficserver_caching_rules [puppet] - 10https://gerrit.wikimedia.org/r/618537 (https://phabricator.wikimedia.org/T259692) [08:57:00] (03PS1) 10Mvolz: Update citoid to 37e45898 [deployment-charts] - 10https://gerrit.wikimedia.org/r/618713 (https://phabricator.wikimedia.org/T259469) [08:57:34] What can be this error in deployment-charts? https://pastebin.com/xr5iyBzN :/ [08:57:49] (03PS2) 10Mvolz: Update citoid to 37e45898 [deployment-charts] - 10https://gerrit.wikimedia.org/r/618713 (https://phabricator.wikimedia.org/T259469) [08:58:40] (03PS1) 10Jcrespo: mariadb: Move x1 snapshots for the temporary backup2002 to dbprov2003 [puppet] - 10https://gerrit.wikimedia.org/r/618714 (https://phabricator.wikimedia.org/T257551) [09:00:13] (03CR) 10Vgutierrez: [C: 03+1] ATS: add function profile::trafficserver_caching_rules [puppet] - 10https://gerrit.wikimedia.org/r/618537 (https://phabricator.wikimedia.org/T259692) (owner: 10Ema) [09:02:39] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [09:03:33] (03PS1) 10Mvolz: Update zotero to use buster10 images [deployment-charts] - 10https://gerrit.wikimedia.org/r/618716 (https://phabricator.wikimedia.org/T258158) [09:05:47] (03PS1) 10Ema: varnish: add Go-http-client to cache_upload naughty list [puppet] - 10https://gerrit.wikimedia.org/r/618717 [09:06:39] (03CR) 10Vgutierrez: [C: 03+1] "looking good from ats-tls and TLS point of view" [puppet] - 10https://gerrit.wikimedia.org/r/615797 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [09:07:17] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [09:08:00] FYI I'm debugging these rsyslog failures ^ see also T259780 [09:08:01] T259780: rsyslog occasional segfault on centrallog hosts - https://phabricator.wikimedia.org/T259780 [09:10:46] (03PS1) 10Jcrespo: install: Prevent full wipe of dbprov2003 data by changing its recipe [puppet] - 10https://gerrit.wikimedia.org/r/618718 (https://phabricator.wikimedia.org/T257551) [09:11:26] (03PS2) 10Jcrespo: mariadb: Move x1 snapshots from the temporary backup2002 to dbprov2003 [puppet] - 10https://gerrit.wikimedia.org/r/618714 (https://phabricator.wikimedia.org/T257551) [09:11:53] (03PS2) 10Jcrespo: install: Prevent full wipe of dbprov2003 data by changing its recipe [puppet] - 10https://gerrit.wikimedia.org/r/618718 (https://phabricator.wikimedia.org/T257551) [09:12:56] (03CR) 10Jcrespo: [C: 03+2] mariadb: Move x1 snapshots from the temporary backup2002 to dbprov2003 [puppet] - 10https://gerrit.wikimedia.org/r/618714 (https://phabricator.wikimedia.org/T257551) (owner: 10Jcrespo) [09:15:35] (03CR) 10Jcrespo: [C: 03+2] install: Prevent full wipe of dbprov2003 data by changing its recipe [puppet] - 10https://gerrit.wikimedia.org/r/618718 (https://phabricator.wikimedia.org/T257551) (owner: 10Jcrespo) [09:16:39] (03PS1) 10Filippo Giunchedi: hieradata: add alert[12]001 to monitoring hosts [puppet] - 10https://gerrit.wikimedia.org/r/618719 (https://phabricator.wikimedia.org/T247966) [09:16:42] (03PS5) 10ZPapierski: Additional prefixes for sdoc for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/618237 (https://phabricator.wikimedia.org/T258625) [09:20:01] (03PS1) 10Jcrespo: mariadb-backups: Reenable notifications for dbprov2003 after maintenance [puppet] - 10https://gerrit.wikimedia.org/r/618720 (https://phabricator.wikimedia.org/T138562) [09:20:29] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [09:23:17] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [09:26:11] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [09:31:56] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/607524 (owner: 10Hashar) [09:32:09] (03CR) 10Hashar: [C: 03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/607525 (owner: 10Hashar) [09:32:21] (03CR) 10Hashar: [C: 03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/611369 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [09:34:06] (03PS2) 10Ema: varnish: add Go-http-client to cache_upload naughty list [puppet] - 10https://gerrit.wikimedia.org/r/618717 (https://phabricator.wikimedia.org/T192688) [09:38:01] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [09:40:52] (03PS1) 10Jcrespo: mariadb-backups: Move x1 and misc logical dumps to dbprov1003 [puppet] - 10https://gerrit.wikimedia.org/r/618722 (https://phabricator.wikimedia.org/T138562) [09:47:53] (03PS1) 10Jcrespo: BackupStatistics: Do not raise an exception if metadata cannot be sent [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618723 (https://phabricator.wikimedia.org/T138562) [09:49:33] (03PS2) 10Jcrespo: BackupStatistics: Do not raise an exception if metadata cannot be sent [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618723 (https://phabricator.wikimedia.org/T138562) [09:50:17] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:51:51] elukey: jupyter-mgerlach-singleuser.service failed ^^^ [09:51:54] /bin/bash: line 0: exec: jupyterhub-singleuser: not found [09:52:02] (03CR) 10Jcrespo: [C: 04-2] "Actually, the code is good; it just logs the exception, it doesn't rise it further." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618723 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [09:53:21] volans: thanks! [09:53:30] (03Abandoned) 10Jcrespo: BackupStatistics: Do not raise an exception if metadata cannot be sent [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618723 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [09:58:17] (03PS1) 10Hashar: Fix changelog filename [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/618724 [09:58:34] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [09:59:24] (03CR) 10Hashar: "puppet catalog diff: https://puppet-compiler.wmflabs.org/compiler1001/528/contint2001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/607524 (owner: 10Hashar) [10:00:04] mvolz: Time to snap out of that daydream and deploy Services – Citoid / Zotero. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200806T1000). [10:00:14] (03CR) 10Hashar: [C: 03+1] "Puppet catalog diff looks fine https://puppet-compiler.wmflabs.org/compiler1003/529/doc1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/607525 (owner: 10Hashar) [10:01:11] (03PS3) 10Mvolz: Update citoid to 37e45898 [deployment-charts] - 10https://gerrit.wikimedia.org/r/618713 (https://phabricator.wikimedia.org/T259469) [10:03:01] (03PS3) 10Hashar: ci: switch integration.wikimedia.org to scap DocumentRoot [puppet] - 10https://gerrit.wikimedia.org/r/611369 (https://phabricator.wikimedia.org/T149924) [10:03:14] (03PS4) 10Hnowlan: api-gateway: open parts of the admin interface internally [deployment-charts] - 10https://gerrit.wikimedia.org/r/616121 (https://phabricator.wikimedia.org/T254908) [10:03:47] (03PS4) 10Hashar: ci: switch integration.wikimedia.org to scap DocumentRoot [puppet] - 10https://gerrit.wikimedia.org/r/611369 (https://phabricator.wikimedia.org/T149924) [10:03:53] (03CR) 10Mvolz: [C: 03+2] Update citoid to 37e45898 [deployment-charts] - 10https://gerrit.wikimedia.org/r/618713 (https://phabricator.wikimedia.org/T259469) (owner: 10Mvolz) [10:04:29] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [10:04:43] (03CR) 10Hashar: [C: 03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/611369 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [10:04:58] (03Merged) 10jenkins-bot: Update citoid to 37e45898 [deployment-charts] - 10https://gerrit.wikimedia.org/r/618713 (https://phabricator.wikimedia.org/T259469) (owner: 10Mvolz) [10:05:34] ls [10:07:21] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [10:07:22] shame-elukey-for-mixing-up-windows.txt [10:07:53] ahahahh well deserved [10:11:29] !log mvolz@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [10:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:56] (03CR) 10Hashar: [C: 03+1] "PS3 gives more context in the commit message" [puppet] - 10https://gerrit.wikimedia.org/r/611369 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [10:12:02] (03PS5) 10Hnowlan: Add discovery and disabled LVS components for API gateway [puppet] - 10https://gerrit.wikimedia.org/r/615512 (https://phabricator.wikimedia.org/T254908) [10:12:29] !log jynus@cumin2001 START - Cookbook sre.hosts.downtime [10:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:42] !log jynus@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:56] (03PS1) 10Elukey: Add basic Debian packaging [debs/hue] - 10https://gerrit.wikimedia.org/r/618728 (https://phabricator.wikimedia.org/T233073) [10:16:50] !log mvolz@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' . [10:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:44] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [10:23:43] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add alert[12]001 to monitoring hosts [puppet] - 10https://gerrit.wikimedia.org/r/618719 (https://phabricator.wikimedia.org/T247966) (owner: 10Filippo Giunchedi) [10:23:48] !log mvolz@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' . [10:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:51] (03CR) 10Ayounsi: [C: 03+1] "+1 on the basis that this briefly saturated one of our outbound link" [puppet] - 10https://gerrit.wikimedia.org/r/618717 (https://phabricator.wikimedia.org/T192688) (owner: 10Ema) [10:30:38] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:32:14] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 51 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:35:54] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [10:36:04] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [10:36:28] (03PS3) 10Hnowlan: Add api.wikimedia.org and api.m.wikimedia.org DNS entries [dns] - 10https://gerrit.wikimedia.org/r/599273 (https://phabricator.wikimedia.org/T246945) (owner: 10Ladsgroup) [10:37:48] (03PS2) 10Mvolz: Update zotero to use buster10 images [deployment-charts] - 10https://gerrit.wikimedia.org/r/618716 (https://phabricator.wikimedia.org/T258158) [10:38:30] 10Operations, 10observability: Making centrallog syslog easier and faster to work with - https://phabricator.wikimedia.org/T254605 (10fgiunchedi) Something else that occurred to me today while debugging {T259780}: sometimes it is useful to be able to access the "syslog firehose" for fleet-wide real time monito... [10:39:11] (03PS2) 10Alexandros Kosiaris: admin: add Edward Tadros to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/609158 (https://phabricator.wikimedia.org/T256435) (owner: 10Ssingh) [10:41:36] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_proton_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:41:52] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [10:41:58] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org [10:42:20] (03CR) 10Mvolz: [C: 03+2] Update zotero to use buster10 images [deployment-charts] - 10https://gerrit.wikimedia.org/r/618716 (https://phabricator.wikimedia.org/T258158) (owner: 10Mvolz) [10:43:12] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/618719 (https://phabricator.wikimedia.org/T247966) (owner: 10Filippo Giunchedi) [10:43:21] (03Merged) 10jenkins-bot: Update zotero to use buster10 images [deployment-charts] - 10https://gerrit.wikimedia.org/r/618716 (https://phabricator.wikimedia.org/T258158) (owner: 10Mvolz) [10:43:32] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [10:43:40] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [10:45:12] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:45:50] !log mvolz@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' . [10:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:13] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Reenable notifications for dbprov2003 after maintenance [puppet] - 10https://gerrit.wikimedia.org/r/618720 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [10:48:27] !log mvolz@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'zotero' for release 'production' . [10:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:06] (03CR) 10Hashar: [C: 04-1] "Moritz wrote:" [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [10:52:45] !log mvolz@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'production' . [10:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:45] (03PS1) 10Ladsgroup: Fix CachingFallbackLabelDescriptionLookup failing in edge-cases [extensions/Wikibase] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618579 (https://phabricator.wikimedia.org/T259744) [10:54:00] (03CR) 10Ladsgroup: [C: 03+2] Fix CachingFallbackLabelDescriptionLookup failing in edge-cases [extensions/Wikibase] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618579 (https://phabricator.wikimedia.org/T259744) (owner: 10Ladsgroup) [10:56:32] (03CR) 10jerkins-bot: [V: 04-1] Fix CachingFallbackLabelDescriptionLookup failing in edge-cases [extensions/Wikibase] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618579 (https://phabricator.wikimedia.org/T259744) (owner: 10Ladsgroup) [10:57:10] jouncebot: refresh [10:57:11] I refreshed my knowledge about deployments. [10:57:13] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:57:16] thx :) [10:58:51] (03PS1) 10Lucas Werkmeister (WMDE): Pass jQuery objects into jqueryMsg [extensions/Flow] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618580 [10:59:00] (03CR) 10Hashar: [C: 03+1] Scap: git_fat -> git_binary_manager [software/cassandra-twcs] - 10https://gerrit.wikimedia.org/r/404228 (https://phabricator.wikimedia.org/T184882) (owner: 10Thcipriani) [10:59:21] jouncebot: refresh [10:59:22] I refreshed my knowledge about deployments. [10:59:43] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: add Edward Tadros to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/609158 (https://phabricator.wikimedia.org/T256435) (owner: 10Ssingh) [10:59:47] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European mid-day backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200806T1100). [11:00:04] Lucas_WMDE: A patch you scheduled for European mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:16] o/ [11:00:22] o/ [11:00:41] (03CR) 10Hashar: [C: 03+1] Scap: git_fat -> git_binary_manager [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/404227 (https://phabricator.wikimedia.org/T184882) (owner: 10Thcipriani) [11:00:59] (03CR) 10Hashar: [C: 03+1] Scap: git_fat -> git_binary_manager [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/404226 (https://phabricator.wikimedia.org/T184882) (owner: 10Thcipriani) [11:01:00] (03CR) 10jerkins-bot: [V: 04-1] Pass jQuery objects into jqueryMsg [extensions/Flow] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618580 (owner: 10Lucas Werkmeister (WMDE)) [11:01:14] Lucas_WMDE: are you deploying? [11:01:19] or should I? [11:01:20] yup, just a second [11:01:24] coolio [11:01:26] Thanks! [11:01:27] I can do it, I have another backport too [11:02:03] (03CR) 10Hashar: [C: 03+1] Scap: git_fat -> git_binary_manager [software/prometheus_jmx_exporter] - 10https://gerrit.wikimedia.org/r/404224 (https://phabricator.wikimedia.org/T184882) (owner: 10Thcipriani) [11:02:30] (03CR) 10Lucas Werkmeister (WMDE): [V: 03+2 C: 03+2] "CI fails due to an npmjs.com outage (https://status.npmjs.org/incidents/cksjqc1w11v5). Force-merging and deploying with extra caution." [extensions/Wikibase] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618579 (https://phabricator.wikimedia.org/T259744) (owner: 10Ladsgroup) [11:02:53] (03CR) 10Hashar: [C: 03+1] Scap: git_fat -> git_binary_manager [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/404222 (https://phabricator.wikimedia.org/T184882) (owner: 10Thcipriani) [11:03:28] * Amir1 sits here to be mad at npm [11:03:46] first seeing if I can reproduce the bug *without* deploying the fix [11:05:13] someone at my door, brb [11:05:52] back [11:07:16] Amir1: do you know how to reproduce this? [11:07:38] no tbh but also isn't it rolledbacked? [11:07:46] I tried =mw.wikibase.getLabel('Q11') and =mw.wikibase.getLabel('Q11', 'de-formal') in a Lua console on test wikidata, but that doesn’t do anything [11:07:49] all groups? [11:08:00] no, group0 should still have the bug, right? [11:08:04] hmm, I thought you're checking it on wikidata [11:08:18] no, test wikidata [11:08:24] test is good it seems [11:08:31] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for Denny Vrandecic - https://phabricator.wikimedia.org/T259388 (10akosiaris) a:03DVrandecic [11:08:37] does logstash have anything for test wikidata? [11:09:06] I’m looking at the mwdebug1002 board because I’m testing with X-Wikimedia-Debug, nothing there [11:09:10] I can check the mediawiki-errors board too [11:09:57] nothing there either afaict [11:11:00] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and nda groups for edtadros - https://phabricator.wikimedia.org/T256435 (10akosiaris) 05Open→03Resolved a:03akosiaris Change merged, user added to the NDA group as requested. @Edtadros you should be good to go. I 'll re... [11:11:12] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and nda groups for edtadros - https://phabricator.wikimedia.org/T256435 (10akosiaris) [11:11:59] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for Denny Vrandecic - https://phabricator.wikimedia.org/T259388 (10akosiaris) p:05Triage→03Medium [11:13:07] !log drain traffic away cr2-eqdfw - T259621 [11:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:43] (03CR) 10Ema: ATS: add new backend for phabricator aphlict (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615797 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [11:16:09] Amir1: so what should we do about that change? [11:16:16] I still haven’t been able to reproduce the bug it fixes [11:16:42] hmm, I'm not 100% sure, I'd better safe than sorry [11:16:46] *I'd say [11:18:08] (03PS6) 10ZPapierski: Additional prefixes for sdoc for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/618237 (https://phabricator.wikimedia.org/T258625) [11:20:24] and that means deploy or not deploy? ^^ [11:21:54] aha! managed to reproduce it [11:21:57] finally [11:21:59] https://test.wikidata.org/wiki/Module_talk:T259744?uselang=%E2%A7%BCLang%E2%A7%BD [11:22:00] T259744: Argument 3 passed to CachingFallbackLabelDescriptionLookup::buildCacheKey() must be of the type string, null given - https://phabricator.wikimedia.org/T259744 [11:22:24] ok, let’s scap pull and see if that fixes it [11:22:32] (03CR) 10Alexandros Kosiaris: [C: 03+1] Detect kubeconfig as known argument in plugin invocations [debs/helm] - 10https://gerrit.wikimedia.org/r/618556 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [11:22:52] !log reboot cr2-eqdfw - T259621 [11:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:22] change is on mwdebug1001 [11:23:32] yup, that fixes it [11:23:46] testing a bit of other stuff just to see if anything else breaks [11:24:57] looks fine as far as I can tell [11:25:02] Amir1: ok if I sync? [11:25:23] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:25:28] I think so [11:25:30] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:25:31] Thanks! [11:25:49] XioNoX: ^ [11:25:54] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:26:00] yep, expected [11:26:16] syncing [11:26:29] Cool. Thanks! [11:26:29] and +2ing my second backport to get CI going [11:26:37] 10Operations: rsyslog occasional segfault on centrallog hosts - https://phabricator.wikimedia.org/T259780 (10Peachey88) [11:26:38] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "backporting this" [extensions/Flow] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618580 (owner: 10Lucas Werkmeister (WMDE)) [11:26:51] let’s see if npmjs recovered [11:26:53] (03PS7) 10ZPapierski: Additional prefixes for sdoc for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/618237 (https://phabricator.wikimedia.org/T258625) [11:26:54] Is npm back? [11:26:59] 10Operations, 10observability: rsyslog occasional segfault on centrallog hosts - https://phabricator.wikimedia.org/T259780 (10Peachey88) [11:27:17] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.36.0-wmf.3/extensions/Wikibase/lib/: Backport: [[gerrit:618579|Fix CachingFallbackLabelDescriptionLookup failing in edge-cases (T259744)]] (duration: 01m 10s) [11:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:20] T259744: Argument 3 passed to CachingFallbackLabelDescriptionLookup::buildCacheKey() must be of the type string, null given - https://phabricator.wikimedia.org/T259744 [11:28:12] (03CR) 10jerkins-bot: [V: 04-1] Additional prefixes for sdoc for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/618237 (https://phabricator.wikimedia.org/T258625) (owner: 10ZPapierski) [11:28:38] gate-and-submit hasn’t failed yet, at least [11:29:42] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:30:30] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:30:40] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:34:03] cr2-eqdfw is all back to normal [11:34:06] cr2-eqord soon [11:34:07] Amir1: might not be reliably back yet: https://status.npmjs.org/ [11:37:56] !log drain traffic away cr2-eqord - T259621 [11:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:24] RECOVERY - BGP status on cr2-eqord is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:41:15] 10Operations, 10Traffic, 10Patch-For-Review: varnishmtail silently stops working if varnishncsa crashes - https://phabricator.wikimedia.org/T259020 (10ema) 05Open→03Stalled p:05Medium→03Lowest [11:41:46] 10Operations, 10Traffic, 10Patch-For-Review: varnishmtail silently stops working if varnishncsa crashes - https://phabricator.wikimedia.org/T259020 (10ema) Workaround deployed, now stalling while waiting for a proper solution to be implemented in mtail. [11:47:44] 10Operations, 10Epic, 10Performance-Team (Radar), 10Services (watching): 2017/18 Annual Plan Program 8: Multi-datacenter support - https://phabricator.wikimedia.org/T175206 (10Gilles) 05Open→03Resolved a:03Gilles [11:47:53] 10Operations, 10Epic, 10Performance-Team (Radar), 10Services (watching): 2017/18 Annual Plan Program 8: Multi-datacenter support, Q2 goals - https://phabricator.wikimedia.org/T175213 (10Gilles) 05Open→03Resolved a:03Gilles [11:47:56] 10Operations, 10Epic, 10Performance-Team (Radar), 10Services (watching): 2017/18 Annual Plan Program 8: Multi-datacenter support - https://phabricator.wikimedia.org/T175206 (10Gilles) [11:50:43] (03Merged) 10jenkins-bot: Pass jQuery objects into jqueryMsg [extensions/Flow] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618580 (owner: 10Lucas Werkmeister (WMDE)) [11:50:58] alright, doing that backport [11:51:29] (03CR) 10Ema: [C: 03+2] varnish: add Go-http-client to cache_upload naughty list [puppet] - 10https://gerrit.wikimedia.org/r/618717 (https://phabricator.wikimedia.org/T192688) (owner: 10Ema) [11:51:50] 10Operations: Change of nameservers for Wikimedia.org.tr - https://phabricator.wikimedia.org/T259792 (10Aklapper) [11:52:16] yup, works like a charm (tested on mwdebug1001) [11:52:18] syncing [11:52:40] Lucas_WMDE: ping me when Backport window is done, need to update cxserver. [11:53:02] yup [11:53:35] !log reboot cr2-eqord - T259621 [11:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:08] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.36.0-wmf.3/extensions/Flow/: Backport: [[gerrit:618580|Pass jQuery objects into jqueryMsg]] (duration: 01m 09s) [11:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:27] !log EU backport window done [11:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:32] kart_: the floor is yours [11:55:07] 10Operations, 10ops-codfw, 10RESTBase: restbase2009 down - https://phabricator.wikimedia.org/T256863 (10hnowlan) I'm looking into this today - I see that restbase2009 is up 9 days, has been configured by puppet and added to the Cassandra cluster but I don't see anything in SAL about who did it. Still investi... [11:56:00] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:56:34] Lucas_WMDE: Thanks! [11:57:09] !log kartik@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [11:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:46] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 238, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:57:54] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:58:50] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 130, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:59:06] 10Operations, 10Domains, 10Traffic: Change of nameservers for Wikimedia.org.tr - https://phabricator.wikimedia.org/T259792 (10Peachey88) [11:59:22] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:59:28] !log kartik@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' . [11:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:44] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:01:18] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:01:40] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 240, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:03:15] !log kartik@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' . [12:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:00] cr2-eqord back to normal [12:06:34] !log Updated cxserver to 2020-08-05-070016-production (T258919, T199523, T257943, T256194) [12:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:40] T258919: Enable MT based on closely-related languages based on community input - https://phabricator.wikimedia.org/T258919 [12:06:40] T257943: Create Wikipedia Kotava - https://phabricator.wikimedia.org/T257943 [12:06:40] T199523: Expose Machine Translation services supporting Chinese to closer languages/variants - https://phabricator.wikimedia.org/T199523 [12:06:40] T256194: Provide section order information in the section suggestions API - https://phabricator.wikimedia.org/T256194 [12:16:50] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:21:40] 10Operations, 10Wikimedia-Logstash: Kibana ng sending telemetry to elastic.io - https://phabricator.wikimedia.org/T259794 (10Rxy) [12:22:05] 10Operations, 10Wikimedia-Logstash: Kibana next sending telemetry to elastic.io - https://phabricator.wikimedia.org/T259794 (10Rxy) [12:22:10] (03PS1) 10Ema: varnish: lower cache_upload rate limit for Facebook [puppet] - 10https://gerrit.wikimedia.org/r/618736 (https://phabricator.wikimedia.org/T192688) [12:22:34] 10Operations, 10Wikimedia-Logstash: Kibana next sending telemetry to elastic.co - https://phabricator.wikimedia.org/T259794 (10Rxy) [12:22:48] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [12:23:12] 10Operations, 10Wikimedia-Logstash, 10Privacy: Kibana next sending telemetry to elastic.co - https://phabricator.wikimedia.org/T259794 (10Majavah) [12:24:32] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:24:38] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [12:25:28] 10Operations, 10observability: rsyslog occasional segfault on centrallog hosts - https://phabricator.wikimedia.org/T259780 (10fgiunchedi) I've captured core dumps on centrallog2001 for this issue, unclear yet what the root cause is. The trigger was a big influx of firewall drop logs for NRPE (port 5666) from a... [12:29:58] (03PS8) 10ZPapierski: Additional prefixes for sdoc for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/618237 (https://phabricator.wikimedia.org/T258625) [12:30:36] 10Operations, 10Wikimedia-Logstash, 10Privacy: Kibana next sending telemetry to elastic.co - https://phabricator.wikimedia.org/T259794 (10Rxy) [12:36:20] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [12:40:04] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [12:40:38] (03PS1) 10JMeybohm: Add basic sre.discovery.pool and sre.discovery.depool [cookbooks] - 10https://gerrit.wikimedia.org/r/618738 [12:41:42] (03CR) 10jerkins-bot: [V: 04-1] Add basic sre.discovery.pool and sre.discovery.depool [cookbooks] - 10https://gerrit.wikimedia.org/r/618738 (owner: 10JMeybohm) [12:42:03] (03CR) 10Elukey: [C: 03+1] Scap: git_fat -> git_binary_manager [software/prometheus_jmx_exporter] - 10https://gerrit.wikimedia.org/r/404224 (https://phabricator.wikimedia.org/T184882) (owner: 10Thcipriani) [12:45:14] (03PS1) 10Kormat: admin: Update kormat configs [puppet] - 10https://gerrit.wikimedia.org/r/618739 [12:46:26] (03CR) 10Kormat: [C: 03+2] admin: Update kormat configs [puppet] - 10https://gerrit.wikimedia.org/r/618739 (owner: 10Kormat) [12:47:19] (03PS2) 10JMeybohm: Add basic sre.discovery.pool and sre.discovery.depool [cookbooks] - 10https://gerrit.wikimedia.org/r/618738 [12:48:13] (03CR) 10jerkins-bot: [V: 04-1] Add basic sre.discovery.pool and sre.discovery.depool [cookbooks] - 10https://gerrit.wikimedia.org/r/618738 (owner: 10JMeybohm) [12:51:25] (03PS3) 10JMeybohm: Add basic sre.discovery.pool and sre.discovery.depool [cookbooks] - 10https://gerrit.wikimedia.org/r/618738 [12:51:48] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [12:52:26] (03CR) 10jerkins-bot: [V: 04-1] Add basic sre.discovery.pool and sre.discovery.depool [cookbooks] - 10https://gerrit.wikimedia.org/r/618738 (owner: 10JMeybohm) [12:53:40] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [12:57:01] (03PS9) 10ZPapierski: Additional prefixes for sdoc for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/618237 (https://phabricator.wikimedia.org/T258625) [12:58:36] (03PS4) 10JMeybohm: Add basic sre.discovery.pool and sre.discovery.depool [cookbooks] - 10https://gerrit.wikimedia.org/r/618738 [13:00:57] (03PS5) 10Kormat: Split utilities into separate packages [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618513 (https://phabricator.wikimedia.org/T259516) [13:01:30] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:01:42] (03PS5) 10JMeybohm: Add basic sre.discovery.pool and sre.discovery.depool [cookbooks] - 10https://gerrit.wikimedia.org/r/618738 [13:02:09] (03PS10) 10ZPapierski: Additional prefixes for sdoc for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/618237 (https://phabricator.wikimedia.org/T258625) [13:02:38] 10Operations, 10Wikimedia-Logstash, 10observability, 10Privacy: Kibana next sending telemetry to elastic.co - https://phabricator.wikimedia.org/T259794 (10jcrespo) [13:03:34] (03CR) 10jerkins-bot: [V: 04-1] Additional prefixes for sdoc for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/618237 (https://phabricator.wikimedia.org/T258625) (owner: 10ZPapierski) [13:07:18] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:07:32] (03PS11) 10ZPapierski: Additional prefixes for sdoc for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/618237 (https://phabricator.wikimedia.org/T258625) [13:08:46] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:09:31] (03CR) 10JMeybohm: [C: 03+2] Detect kubeconfig as known argument in plugin invocations [debs/helm] - 10https://gerrit.wikimedia.org/r/618556 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [13:12:06] (03CR) 10ZPapierski: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/24353/ - changes are related to prefixes file configuration." [puppet] - 10https://gerrit.wikimedia.org/r/618237 (https://phabricator.wikimedia.org/T258625) (owner: 10ZPapierski) [13:12:36] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:15:27] (03Merged) 10jenkins-bot: Detect kubeconfig as known argument in plugin invocations [debs/helm] - 10https://gerrit.wikimedia.org/r/618556 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [13:15:29] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/618542 (owner: 10Alexandros Kosiaris) [13:18:27] (03CR) 10JMeybohm: [C: 03+2] helm: Replace repo update cronjob by systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/618350 (owner: 10JMeybohm) [13:18:29] (03CR) 10Gilles: [C: 03+1] arclamp: require python-swiftclient [puppet] - 10https://gerrit.wikimedia.org/r/618626 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke) [13:18:36] (03CR) 10JMeybohm: [C: 03+2] helm: Replace repo update cronjob by systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/618350 (owner: 10JMeybohm) [13:20:05] (03CR) 10Volans: "Thanks for adventuring into your first cookbook! Looks mostly ok, few nits inline. Ping me offline if something is not clear or you have a" (0313 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/618738 (owner: 10JMeybohm) [13:20:19] (03PS1) 10Kormat: Use 'native' debian format, and exclude irrlevant dirs. [software/transferpy] - 10https://gerrit.wikimedia.org/r/618743 [13:22:17] (03CR) 10Gilles: "This has been stalled for a couple of months. @Cdanis who else do you think should review this?" [puppet] - 10https://gerrit.wikimedia.org/r/597176 (https://phabricator.wikimedia.org/T225739) (owner: 10Dave Pifke) [13:23:40] 10Operations, 10Wikimedia-Logstash, 10observability, 10Privacy: Kibana next sending telemetry to elastic.co - https://phabricator.wikimedia.org/T259794 (10Rxy) perhaps rOPUP[/modules/kibana/manifests/init.pp$39-47](https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/kibana/... [13:24:19] (03CR) 10CDanis: [C: 03+1] "This is LGTM from me. I thought dpifke had +2 on the repo and could merge? If not I'm happy to do that." [puppet] - 10https://gerrit.wikimedia.org/r/597176 (https://phabricator.wikimedia.org/T225739) (owner: 10Dave Pifke) [13:24:33] !log imported helm_2.16.9-2 and tiller_2.16.9-2 to buster-wikimedia, jessie-wikimedia and stretch-wikimedia [13:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:47] (03PS1) 10Volans: dbmonitor: use default 1H TTL [dns] - 10https://gerrit.wikimedia.org/r/618744 [13:24:50] (03CR) 10Ayounsi: Configure transport links OSPF based on Netbox data (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/617603 (https://phabricator.wikimedia.org/T200277) (owner: 10Ayounsi) [13:26:16] (03PS2) 10Kormat: Use 'native' debian format, and exclude irrelevant dirs. [software/transferpy] - 10https://gerrit.wikimedia.org/r/618743 [13:26:28] (03PS1) 10Ema: cache: add type Profile::Cache::Sites [puppet] - 10https://gerrit.wikimedia.org/r/618745 [13:27:31] (03PS3) 10Kormat: Use 'native' debian format, and exclude irrelevant dirs. [software/transferpy] - 10https://gerrit.wikimedia.org/r/618743 [13:27:53] (03CR) 10jerkins-bot: [V: 04-1] cache: add type Profile::Cache::Sites [puppet] - 10https://gerrit.wikimedia.org/r/618745 (owner: 10Ema) [13:28:14] (03CR) 10Gilles: "He doesn't, unfortunately. Is there a process for him to request getting +2 in operations/puppet?" [puppet] - 10https://gerrit.wikimedia.org/r/597176 (https://phabricator.wikimedia.org/T225739) (owner: 10Dave Pifke) [13:30:24] (03PS2) 10Ema: cache: add type Profile::Cache::Sites [puppet] - 10https://gerrit.wikimedia.org/r/618745 [13:32:28] !log updated helm to 2.16.9-2 on contint*, deploy* and chartmuseum* [13:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:34] (03PS3) 10Ema: cache: add type Profile::Cache::Sites [puppet] - 10https://gerrit.wikimedia.org/r/618745 [13:40:45] (03PS4) 10Ema: cache: add type Profile::Cache::Sites [puppet] - 10https://gerrit.wikimedia.org/r/618745 [13:42:12] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/618745 (owner: 10Ema) [13:48:56] (03CR) 10Ppchelko: [C: 04-1] "For health checks, why not use https://www.envoyproxy.io/docs/envoy/latest/api-v2/config/filter/http/health_check/v2/health_check.proto wi" [deployment-charts] - 10https://gerrit.wikimedia.org/r/616121 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [13:49:30] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [13:49:53] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Move x1 and misc logical dumps to dbprov1003 [puppet] - 10https://gerrit.wikimedia.org/r/618722 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [13:50:24] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [13:51:01] (03CR) 10Ppchelko: [C: 04-1] "As for stats endpoint, is the purpose to use /stats/prometheus to enable native prometheus exporting?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/616121 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [13:51:39] (03PS1) 10Kormat: Ignore debuild-generated files [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618749 [13:53:23] (03CR) 10Jcrespo: [C: 03+1] "Why not just make cumin a build depends?" [software/transferpy] - 10https://gerrit.wikimedia.org/r/618743 (owner: 10Kormat) [13:54:14] (03CR) 10Jcrespo: [C: 03+1] Ignore debuild-generated files [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618749 (owner: 10Kormat) [13:54:54] (03CR) 10Jcrespo: [C: 03+1] dbmonitor: use default 1H TTL [dns] - 10https://gerrit.wikimedia.org/r/618744 (owner: 10Volans) [13:55:48] (03CR) 10Kormat: "> Patch Set 3: Code-Review+1" [software/transferpy] - 10https://gerrit.wikimedia.org/r/618743 (owner: 10Kormat) [13:56:10] (03PS3) 10Ottomata: Add eventgate service specific test.event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618550 (https://phabricator.wikimedia.org/T251935) [13:56:15] (03CR) 10Kormat: [C: 03+2] Use 'native' debian format, and exclude irrelevant dirs. [software/transferpy] - 10https://gerrit.wikimedia.org/r/618743 (owner: 10Kormat) [13:56:41] (03CR) 10Kormat: [C: 03+2] Ignore debuild-generated files [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618749 (owner: 10Kormat) [13:56:46] (03Merged) 10jenkins-bot: Use 'native' debian format, and exclude irrelevant dirs. [software/transferpy] - 10https://gerrit.wikimedia.org/r/618743 (owner: 10Kormat) [13:57:30] (03CR) 10Ottomata: [C: 03+2] Add eventgate service specific test.event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618550 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [13:58:08] 10Operations, 10ops-codfw, 10RESTBase: restbase2009 down - https://phabricator.wikimedia.org/T256863 (10Eevans) >>! In T256863#6365459, @hnowlan wrote: > I'm looking into this today - I see that restbase2009 is up 9 days, has been configured by puppet and added to the Cassandra cluster but I don't see anythi... [13:59:14] (03PS1) 10Andrew Bogott: openstack haproxy: increase server timeout to 120s [puppet] - 10https://gerrit.wikimedia.org/r/618753 [14:00:54] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: EventStreamConfig - Add eventgate-* test.event streams - T251935 (duration: 01m 08s) [14:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:58] T251935: All EventGate instances should use EventStreamConfig - https://phabricator.wikimedia.org/T251935 [14:01:36] 10Operations, 10netops, 10Patch-For-Review: OSPF metrics - https://phabricator.wikimedia.org/T200277 (10ayounsi) Discussed it with Faidon and created/populated the custom fields in Netbox. [14:02:51] 10Operations, 10ops-codfw, 10RESTBase: restbase2009 down - https://phabricator.wikimedia.org/T256863 (10Eevans) >>! In T256863#6365705, @Eevans wrote: >>>! In T256863#6365459, @hnowlan wrote: >> I'm looking into this today - I see that restbase2009 is up 9 days, has been configured by puppet and added to the... [14:05:36] kormat: CI can build the transferpy debian package ;) [14:06:21] there is some documentation for it at https://wikitech.wikimedia.org/wiki/Debian_Glue ;) [14:08:23] (03CR) 10Andrew Bogott: [C: 03+2] openstack haproxy: increase server timeout to 120s [puppet] - 10https://gerrit.wikimedia.org/r/618753 (owner: 10Andrew Bogott) [14:09:29] (03CR) 10Marostegui: [C: 03+1] dbmonitor: use default 1H TTL [dns] - 10https://gerrit.wikimedia.org/r/618744 (owner: 10Volans) [14:09:31] hashar: huh, TIL [14:10:03] (03PS3) 10Ottomata: eventgate - use /v1/_test/events route for readinessProbe [deployment-charts] - 10https://gerrit.wikimedia.org/r/618624 (https://phabricator.wikimedia.org/T251935) [14:10:08] (03PS4) 10Ottomata: eventgate - use /v1/_test/events route for readinessProbe [deployment-charts] - 10https://gerrit.wikimedia.org/r/618624 (https://phabricator.wikimedia.org/T251935) [14:10:28] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 58 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:10:34] (03CR) 10Volans: [C: 03+2] dbmonitor: use default 1H TTL [dns] - 10https://gerrit.wikimedia.org/r/618744 (owner: 10Volans) [14:16:18] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 46 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:16:27] 10Operations, 10ops-codfw, 10RESTBase: restbase2009 down - https://phabricator.wikimedia.org/T256863 (10Papaul) @Eevans @hnowlan this is a different machine same disks. The disks were taken out of the old machine and placed into the new machine [14:17:30] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:18:29] (03PS5) 10Hnowlan: api-gateway: open parts of the admin interface internally [deployment-charts] - 10https://gerrit.wikimedia.org/r/616121 (https://phabricator.wikimedia.org/T254908) [14:21:10] (03PS5) 10CDanis: Add check_prometheus rules for navtiming [puppet] - 10https://gerrit.wikimedia.org/r/597176 (https://phabricator.wikimedia.org/T225739) (owner: 10Dave Pifke) [14:21:13] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:23:00] (03CR) 10CDanis: [C: 03+2] Add check_prometheus rules for navtiming [puppet] - 10https://gerrit.wikimedia.org/r/597176 (https://phabricator.wikimedia.org/T225739) (owner: 10Dave Pifke) [14:23:07] (03CR) 10Multichill: [C: 04-1] "Bad solution, see phabricator. Wrong approach." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618245 (https://phabricator.wikimedia.org/T258354) (owner: 10Tobias Andersson) [14:29:13] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Switch service-checker-image to python3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/618542 (owner: 10Alexandros Kosiaris) [14:37:00] (03PS5) 10Ottomata: eventgate - use /v1/_test/events route for readinessProbe [deployment-charts] - 10https://gerrit.wikimedia.org/r/618624 (https://phabricator.wikimedia.org/T251935) [14:38:40] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:38:47] (03CR) 10Ottomata: [C: 03+2] eventgate - use /v1/_test/events route for readinessProbe [deployment-charts] - 10https://gerrit.wikimedia.org/r/618624 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [14:40:38] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:42:44] (03PS1) 10Ottomata: eventgate-logging-external - use api-ro.discovery.wmnet for remote stream config [deployment-charts] - 10https://gerrit.wikimedia.org/r/618762 [14:44:57] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (CI & Testing services): Assess whether we should still disable seccomp in Docker for CI - https://phabricator.wikimedia.org/T249729 (10hashar) 05Open→03Resolved [14:46:10] (03CR) 10Ottomata: [C: 03+2] eventgate-logging-external - use api-ro.discovery.wmnet for remote stream config [deployment-charts] - 10https://gerrit.wikimedia.org/r/618762 (owner: 10Ottomata) [14:46:46] (03PS8) 10Cwhite: prometheus: puppetized install of prometheus-es-exporter [puppet] - 10https://gerrit.wikimedia.org/r/617260 (https://phabricator.wikimedia.org/T256418) [14:50:53] !log fdans@deploy1001 Started deploy [analytics/refinery@97a02a3]: Regular analytics weekly train [analytics/refinery@97a02a3 [14:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:21] 10Operations, 10DNS, 10Traffic: Verify diff.wikimedia.org ownership for Facebook - https://phabricator.wikimedia.org/T259807 (10CKoerner_WMF) [14:55:39] (03PS6) 10JMeybohm: Add basic sre.discovery.pool and sre.discovery.depool [cookbooks] - 10https://gerrit.wikimedia.org/r/618738 [14:56:47] (03CR) 10jerkins-bot: [V: 04-1] Add basic sre.discovery.pool and sre.discovery.depool [cookbooks] - 10https://gerrit.wikimedia.org/r/618738 (owner: 10JMeybohm) [14:57:01] (03CR) 10JMeybohm: Add basic sre.discovery.pool and sre.discovery.depool (0313 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/618738 (owner: 10JMeybohm) [14:58:09] (03PS1) 10Filippo Giunchedi: Add Debian packaging [debs/karma] - 10https://gerrit.wikimedia.org/r/618764 (https://phabricator.wikimedia.org/T258948) [14:58:37] 10Operations, 10Keyholder: After arming a new key in keyholder, the identity file path does not show up - https://phabricator.wikimedia.org/T257329 (10hashar) - 4096 SHA256:qoe6/ybxTT1xw+RXdA1ecioQFh1AYjzGjluYt1uT25s /etc/keyholder.d/deploy_ci_docroot (RSA) Thank you ;) [14:58:48] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:02:40] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:02:49] (03PS1) 10Volans: stdlib: add netmask_to_cidr parser function [puppet] - 10https://gerrit.wikimedia.org/r/618765 [15:02:51] (03PS1) 10Volans: interface::alias: add optional is_service_ip param [puppet] - 10https://gerrit.wikimedia.org/r/618766 [15:02:53] (03PS1) 10Volans: cassandra::instance: use real netmask for IP alias [puppet] - 10https://gerrit.wikimedia.org/r/618767 [15:03:03] 10Operations, 10LDAP-Access-Requests: LDAP access to the 'wmf' group for Monte Hurd - https://phabricator.wikimedia.org/T259382 (10akosiaris) 05Open→03Invalid I am gonna close this as invalid. Monte has been around for a long time and is definitely in the wmf group. @Mhurd if there is some kind of access y... [15:05:40] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:07:32] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:10:55] !log fdans@deploy1001 Finished deploy [analytics/refinery@97a02a3]: Regular analytics weekly train [analytics/refinery@97a02a3 (duration: 20m 01s) [15:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:10] PROBLEM - k8s API server requests latencies on neon is CRITICAL: instance=10.64.0.40 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:13:20] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:13:48] (03CR) 10Cwhite: "Overall LGTM (haven't tried to build it though)." (031 comment) [debs/karma] - 10https://gerrit.wikimedia.org/r/618764 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [15:14:04] RECOVERY - k8s API server requests latencies on neon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:14:08] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:18:00] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:19:03] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:21:48] (03CR) 10Volans: "Compiler results on few random hosts that use the define:" [puppet] - 10https://gerrit.wikimedia.org/r/618766 (owner: 10Volans) [15:23:23] (03CR) 10Volans: "Some compiler results available here:" [puppet] - 10https://gerrit.wikimedia.org/r/618767 (owner: 10Volans) [15:24:22] (03PS1) 10Alexandros Kosiaris: aptrepo: Update jenkins gpg release key [puppet] - 10https://gerrit.wikimedia.org/r/618771 (https://phabricator.wikimedia.org/T259116) [15:24:40] 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins, 10Patch-For-Review: Update Jenkins gpg release key in reprepro - https://phabricator.wikimedia.org/T259116 (10akosiaris) > I could not find where we store that key in puppet :-\ That's cause we don't store it. We just use the fingerprint. [15:29:45] (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/618767 (owner: 10Volans) [15:31:32] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:32:17] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools as a beta feature on 8 more wikis ("phase 1") [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618773 (https://phabricator.wikimedia.org/T259574) [15:32:24] (03CR) 10Ayounsi: "I don't know Ruby enough to review that." [puppet] - 10https://gerrit.wikimedia.org/r/618765 (owner: 10Volans) [15:32:36] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:34:59] (03PS1) 10Ottomata: eventgate - test_events cannot be templated; it is needed in values to be used in deployment.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/618775 (https://phabricator.wikimedia.org/T251609) [15:36:28] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:37:20] (03CR) 10Ottomata: [C: 03+2] eventgate - test_events cannot be templated; it is needed in values to be used in deployment.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/618775 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [15:37:20] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:37:45] (03CR) 10Ayounsi: "2 comments, otherwise LGTM." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/618766 (owner: 10Volans) [15:37:50] (03CR) 10Ayounsi: [C: 03+1] interface::alias: add optional is_service_ip param [puppet] - 10https://gerrit.wikimedia.org/r/618766 (owner: 10Volans) [15:40:22] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [15:40:23] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [15:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:10] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM to my ruby-untrained eye, see inline for (quite optional) additional tests" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/618765 (owner: 10Volans) [15:42:28] (03CR) 10Ayounsi: [C: 03+1] interface::alias: add optional is_service_ip param (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/618766 (owner: 10Volans) [15:46:36] (03CR) 10Filippo Giunchedi: [C: 03+1] interface::alias: add optional is_service_ip param [puppet] - 10https://gerrit.wikimedia.org/r/618766 (owner: 10Volans) [15:46:54] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 50 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:46:58] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:47:58] (03CR) 10Ayounsi: [C: 03+1] cassandra::instance: use real netmask for IP alias [puppet] - 10https://gerrit.wikimedia.org/r/618767 (owner: 10Volans) [15:49:08] (03PS1) 10Alexandros Kosiaris: Revert "Remove access for nathante" [puppet] - 10https://gerrit.wikimedia.org/r/618779 (https://phabricator.wikimedia.org/T256356) [15:55:54] (03CR) 10Alexandros Kosiaris: [C: 03+2] Revert "Remove access for nathante" [puppet] - 10https://gerrit.wikimedia.org/r/618779 (https://phabricator.wikimedia.org/T256356) (owner: 10Alexandros Kosiaris) [15:57:24] (03PS7) 10JMeybohm: Add basic sre.discovery.pool and sre.discovery.depool [cookbooks] - 10https://gerrit.wikimedia.org/r/618738 [16:00:04] godog and _joe_: (Dis)respected human, time to deploy Puppet request window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200806T1600). Please do the needful. [16:04:38] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 52 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:05:03] 10Puppet, 10Beta-Cluster-Infrastructure, 10VPS-Projects: Puppet failures on deployment-docker-changeprop01, deployment-docker-cpjobqueue01, deployment-push-notifications01, and deployment-docker-proton01 due to Docker version pinning - https://phabricator.wikimedia.org/T259812 (10bd808) [16:06:21] (03PS2) 10Filippo Giunchedi: Add Debian packaging [debs/karma] - 10https://gerrit.wikimedia.org/r/618764 (https://phabricator.wikimedia.org/T258948) [16:06:47] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" (031 comment) [debs/karma] - 10https://gerrit.wikimedia.org/r/618764 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [16:09:26] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:11:23] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:11:23] (03PS2) 10Volans: stdlib: add netmask_to_cidr parser function [puppet] - 10https://gerrit.wikimedia.org/r/618765 [16:11:25] (03PS2) 10Volans: interface::alias: add optional is_service_ip param [puppet] - 10https://gerrit.wikimedia.org/r/618766 [16:11:27] (03PS2) 10Volans: cassandra::instance: use real netmask for IP alias [puppet] - 10https://gerrit.wikimedia.org/r/618767 [16:13:57] (03CR) 10Volans: "reply inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/618765 (owner: 10Volans) [16:15:14] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:15:34] (03CR) 10Filippo Giunchedi: [C: 03+1] stdlib: add netmask_to_cidr parser function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/618765 (owner: 10Volans) [16:18:24] !log chrisalbon@deploy1001 Started deploy [ores/deploy@f3c44be]: T258435 [16:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:27] T258435: ORES deployment Late July 2020 - https://phabricator.wikimedia.org/T258435 [16:18:48] !log dpifke@deploy1001 Started deploy [performance/arc-lamp@7838c88]: Deploying fixes for T259167 [16:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:51] T259167: Truncated ArcLamp output files - https://phabricator.wikimedia.org/T259167 [16:18:54] !log dpifke@deploy1001 Finished deploy [performance/arc-lamp@7838c88]: Deploying fixes for T259167 (duration: 00m 05s) [16:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:06] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:19:17] (03CR) 10Cwhite: [C: 03+1] "Thanks!" [debs/karma] - 10https://gerrit.wikimedia.org/r/618764 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [16:19:41] (03CR) 10Volans: [C: 03+1] "Nice! LGTM, last couple of replies inline." (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/618738 (owner: 10JMeybohm) [16:20:13] (03CR) 10Cwhite: [C: 03+2] prometheus: puppetized install of prometheus-es-exporter [puppet] - 10https://gerrit.wikimedia.org/r/617260 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [16:20:55] (03CR) 10Cwhite: [C: 03+2] prometheus: puppetized install of prometheus-es-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617260 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [16:25:43] (03CR) 10Volans: "Updated compiler results:" [puppet] - 10https://gerrit.wikimedia.org/r/618767 (owner: 10Volans) [16:27:29] (03PS1) 10Brennen Bearnes: Fix array unpacking as argument list [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618582 (https://phabricator.wikimedia.org/T259745) [16:32:36] !log chrisalbon@deploy1001 Finished deploy [ores/deploy@f3c44be]: T258435 (duration: 14m 12s) [16:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:39] T258435: ORES deployment Late July 2020 - https://phabricator.wikimedia.org/T258435 [16:38:35] (03PS1) 10Gergő Tisza: Fix "Ask mentor" help panel button styling [extensions/GrowthExperiments] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618583 (https://phabricator.wikimedia.org/T250235) [16:39:41] (03CR) 10Dzahn: [C: 03+1] "per https://wiki.mozilla.org/Security/DOH-resolver-policy" [puppet] - 10https://gerrit.wikimedia.org/r/618591 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:41:54] 10Operations, 10Fundraising-Backlog, 10User-Urbanecm, 10User-dancy, 10Wiki-Setup (Create): New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Ladsgroup) Seconding Daniel here. In cases of peak, we had experiences of outages (after dea... [16:44:53] (03PS1) 10Gergő Tisza: Direct GrowthExperiments help panel questions to mentors on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618786 (https://phabricator.wikimedia.org/T250235) [16:45:37] (03CR) 10Gergő Tisza: "To be deployed on Monday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618786 (https://phabricator.wikimedia.org/T250235) (owner: 10Gergő Tisza) [16:45:39] (03PS1) 10JMeybohm: eventgate: Fix repository URL in requirements, bump to 0.2.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/618787 [16:46:08] (03CR) 10Ottomata: [C: 03+1] eventgate: Fix repository URL in requirements, bump to 0.2.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/618787 (owner: 10JMeybohm) [16:46:26] (03PS1) 10Volans: wmcs: remove unused leftover records [dns] - 10https://gerrit.wikimedia.org/r/618788 [16:46:42] (03CR) 10JMeybohm: [C: 03+2] eventgate: Fix repository URL in requirements, bump to 0.2.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/618787 (owner: 10JMeybohm) [16:47:44] (03Merged) 10jenkins-bot: eventgate: Fix repository URL in requirements, bump to 0.2.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/618787 (owner: 10JMeybohm) [16:49:51] (03CR) 10Bstorm: [C: 03+1] wmcs: remove unused leftover records [dns] - 10https://gerrit.wikimedia.org/r/618788 (owner: 10Volans) [16:51:15] (03PS1) 10JMeybohm: changeprop: Fix repository URL in requirements, bump to 0.9.52 [deployment-charts] - 10https://gerrit.wikimedia.org/r/618790 [16:52:38] 10Operations, 10Fundraising-Backlog, 10User-Urbanecm, 10User-dancy, 10Wiki-Setup (Create): New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10DStrine) Fishbowl please and thanks! [16:53:53] (03CR) 10Hnowlan: [C: 03+1] changeprop: Fix repository URL in requirements, bump to 0.9.52 [deployment-charts] - 10https://gerrit.wikimedia.org/r/618790 (owner: 10JMeybohm) [16:54:50] 10Operations, 10Traffic: Enable DNSSEC validation in Wikidough - https://phabricator.wikimedia.org/T259816 (10ssingh) [16:56:08] 10Operations, 10Traffic, 10Patch-For-Review: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh) [16:56:10] 10Operations, 10Traffic: Enable DNSSEC validation in Wikidough - https://phabricator.wikimedia.org/T259816 (10ssingh) [16:59:54] 10Operations, 10ops-codfw, 10RESTBase: restbase2009 down - https://phabricator.wikimedia.org/T256863 (10WDoranWMF) @Papaul @wkandek @akosiaris @wiki_willy Hope everyone is relatively well. I've also sent this as an email. There is an issue as a result of 2009 coming back online. The rough chronology I hav... [17:00:04] halfak and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200806T1700). [17:03:28] i'm going to get https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseMediaInfo/+/618582 out and roll the train to group1 shortly - cc: dancy. [17:04:46] (03CR) 10Brennen Bearnes: [C: 03+2] Fix array unpacking as argument list [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618582 (https://phabricator.wikimedia.org/T259745) (owner: 10Brennen Bearnes) [17:08:09] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10Bstorm) [17:14:31] (03CR) 10Volans: [C: 03+2] wmcs: remove unused leftover records [dns] - 10https://gerrit.wikimedia.org/r/618788 (owner: 10Volans) [17:15:54] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 50 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:16:03] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10Bstorm) [17:23:41] (03Merged) 10jenkins-bot: Fix array unpacking as argument list [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618582 (https://phabricator.wikimedia.org/T259745) (owner: 10Brennen Bearnes) [17:25:46] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 51 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:28:54] 10Puppet, 10Beta-Cluster-Infrastructure, 10VPS-Projects: Puppet failures on deployment-docker-changeprop01, deployment-docker-cpjobqueue01, deployment-push-notifications01, deployment-docker-mobileapps01, and deployment-docker-proton01 due to Docker version pinning - https://phabricator.wikimedia.org/T259812 (... [17:31:06] (03PS1) 10Dzahn: set TTL for webperf* static entries to default 1H [dns] - 10https://gerrit.wikimedia.org/r/618792 [17:31:15] volans: ^ fixing that [17:31:28] (03PS1) 10Cwhite: prometheus: add first draft query to es_exporter [puppet] - 10https://gerrit.wikimedia.org/r/618793 (https://phabricator.wikimedia.org/T256418) [17:33:49] (03CR) 10Volans: "Thanks! see one comment inline" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/618792 (owner: 10Dzahn) [17:33:50] thx mutante [17:34:21] you got to it before me, still stuck in other rabbit holes [17:35:05] (03PS2) 10Dzahn: set TTL for webperf* static entries to default 1H [dns] - 10https://gerrit.wikimedia.org/r/618792 [17:35:13] (03CR) 10Cwhite: [C: 03+2] prometheus: add first draft query to es_exporter [puppet] - 10https://gerrit.wikimedia.org/r/618793 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [17:35:43] (03PS1) 10BryanDavis: ci: Bump Buster docker.io version to match apt repos [puppet] - 10https://gerrit.wikimedia.org/r/618795 (https://phabricator.wikimedia.org/T259812) [17:35:49] (03CR) 10Volans: [C: 03+1] "LGTM, thx" [dns] - 10https://gerrit.wikimedia.org/r/618792 (owner: 10Dzahn) [17:36:14] !log brennen@deploy1001 Synchronized php-1.36.0-wmf.3/extensions/WikibaseMediaInfo/src/View/MediaInfoEntityTermsView.php: Backport: [[gerrit:618582|Fix array unpacking as argument list]] (T259745) (duration: 01m 07s) [17:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:18] T259745: Uncaught ArgumentCountError: Too few arguments to function OOUI\Tag::appendContent(), 0 passed - https://phabricator.wikimedia.org/T259745 [17:37:52] !log train 1.36.0-wmf.3: proceeding to group1 [17:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:35] (03PS1) 10Brennen Bearnes: group1 wikis to 1.36.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618796 [17:38:37] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.36.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618796 (owner: 10Brennen Bearnes) [17:39:11] (03CR) 10BryanDavis: "Currently pinned version has been replaced in apt repo" [puppet] - 10https://gerrit.wikimedia.org/r/618795 (https://phabricator.wikimedia.org/T259812) (owner: 10BryanDavis) [17:39:24] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618796 (owner: 10Brennen Bearnes) [17:39:38] (03CR) 10Dzahn: "e" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/618792 (owner: 10Dzahn) [17:40:32] 10Puppet, 10Beta-Cluster-Infrastructure, 10VPS-Projects, 10Patch-For-Review, and 2 others: Puppet failures on deployment-docker-changeprop01, deployment-docker-cpjobqueue01, deployment-push-notifications01, deployment-docker-mobileapps01, and deployment-docker-prot... - https://phabricator.wikimedia.org/T259812 [17:41:32] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.3 [17:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:39] !log brennen@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.3 (duration: 01m 06s) [17:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:27] (03CR) 10Dzahn: [C: 03+2] set TTL for webperf* static entries to default 1H [dns] - 10https://gerrit.wikimedia.org/r/618792 (owner: 10Dzahn) [17:48:39] (03PS1) 10Cwhite: profile,prometheus: create define for prometheus-es-exporter configs [puppet] - 10https://gerrit.wikimedia.org/r/618797 (https://phabricator.wikimedia.org/T256418) [17:50:58] (03CR) 10Andrew Bogott: [C: 03+2] ci: Bump Buster docker.io version to match apt repos [puppet] - 10https://gerrit.wikimedia.org/r/618795 (https://phabricator.wikimedia.org/T259812) (owner: 10BryanDavis) [17:51:07] (03CR) 10Cwhite: [C: 03+2] "pcc checks out https://puppet-compiler.wmflabs.org/compiler1002/24360/" [puppet] - 10https://gerrit.wikimedia.org/r/618797 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [17:54:27] (03PS1) 10Cwhite: profile:prometheus: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/618798 (https://phabricator.wikimedia.org/T256418) [17:54:46] (03CR) 10Cwhite: [C: 03+2] profile:prometheus: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/618798 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [17:56:23] (03PS2) 10Cwhite: profile:prometheus: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/618798 (https://phabricator.wikimedia.org/T256418) [17:56:27] (03CR) 10Cwhite: [V: 03+2 C: 03+2] profile:prometheus: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/618798 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning backport window(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200806T1800). [18:00:04] Pchelolo, tgr, and MatmaRex: A patch you scheduled for Morning backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:34] hello [18:01:00] hello MatmaRex [18:01:03] happy to deploy today :) [18:01:21] (03CR) 10Urbanecm: [C: 03+2] "B&C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618773 (https://phabricator.wikimedia.org/T259574) (owner: 10Bartosz Dziewoński) [18:02:00] Pchelolo: do you want me to ping you at the end to self-service? [18:02:07] (03Merged) 10jenkins-bot: Enable DiscussionTools as a beta feature on 8 more wikis ("phase 1") [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618773 (https://phabricator.wikimedia.org/T259574) (owner: 10Bartosz Dziewoński) [18:02:16] (03PS1) 10Cwhite: prometheus: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/618799 (https://phabricator.wikimedia.org/T256418) [18:02:27] Urbanecm: if you are going to be deploying, would you mind doing mine too? [18:02:33] it's trivial [18:02:36] o/ [18:02:46] (03CR) 10jerkins-bot: [V: 04-1] prometheus: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/618799 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [18:03:00] (03PS2) 10Cwhite: prometheus: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/618799 (https://phabricator.wikimedia.org/T256418) [18:03:03] not at all, it makes complete sense, so happy to do that too :) [18:03:13] thank you! [18:03:35] (03CR) 10Urbanecm: [C: 03+2] Fix "Ask mentor" help panel button styling [extensions/GrowthExperiments] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618583 (https://phabricator.wikimedia.org/T250235) (owner: 10Gergő Tisza) [18:03:37] no problem :) [18:04:01] MatmaRex: your patch is at mwdebug1001 [18:04:41] Urbanecm: seems good [18:04:49] thanks, syncing [18:07:31] (03PS9) 10Urbanecm: Remove temporary logging for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606239 (https://phabricator.wikimedia.org/T259742) (owner: 10Cicalese) [18:07:33] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 9695811a30de30471a81b6ad05aa5e625f52caf1: : Enable DiscussionTools as a beta feature on 8 more wikis ("phase 1") (T259574) (duration: 01m 06s) [18:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:38] T259574: Make config change to enable Reply Tool as Beta Feature at Phase 1 wikis - https://phabricator.wikimedia.org/T259574 [18:07:38] (03CR) 10Urbanecm: [C: 03+2] Remove temporary logging for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606239 (https://phabricator.wikimedia.org/T259742) (owner: 10Cicalese) [18:07:45] MatmaRex: here you go [18:07:57] thanks! [18:08:28] (03Merged) 10jenkins-bot: Remove temporary logging for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606239 (https://phabricator.wikimedia.org/T259742) (owner: 10Cicalese) [18:08:42] my pleasure MatmaRex :) [18:10:53] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 9db96595695b5ec1144c078e8961b3c04e8983cf: Remove temporary logging for mediamoderation (T259742) (duration: 01m 07s) [18:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:57] T259742: Turn off MediaModeration debug logging in Production - https://phabricator.wikimedia.org/T259742 [18:11:27] Pchelolo: done, labs should be done automatically [18:11:41] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [18:11:41] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [18:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:44] thank you! [18:12:44] tgr: as soon as CI allows us, your patch will be ready. Wanna self-service, or should I? [18:13:04] Urbanecm: please do if it's no trouble [18:13:10] not at all [18:13:18] (03Merged) 10jenkins-bot: Fix "Ask mentor" help panel button styling [extensions/GrowthExperiments] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618583 (https://phabricator.wikimedia.org/T250235) (owner: 10Gergő Tisza) [18:15:36] tgr: available at mwdebug1001 [18:17:20] Urbanecm: thanks, tested [18:17:27] thanks, syncing [18:20:19] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.3/extensions/GrowthExperiments/modules/: fb4a80830d7d915479e097cc82c681c5fb03d51b: Fix "Ask mentor" help panel button styling (T250235) (duration: 01m 07s) [18:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:22] T250235: Scale: pilot help panel with mentorship - https://phabricator.wikimedia.org/T250235 [18:20:25] tgr: done [18:20:38] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: analytics1041.eqiad.wmnet, puppetmaster2001.codfw.wmnet, deneb.codfw.wmnet, wdqs1009.eqiad.wmnet, testreduce1001.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [18:20:39] thx! [18:20:49] no problem [18:21:21] !log Morning B&C window was completed [18:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:03] 10Puppet, 10Beta-Cluster-Infrastructure, 10VPS-Projects: Puppet failures on deployment-docker-changeprop01, deployment-docker-cpjobqueue01, deployment-push-notifications01, deployment-docker-mobileapps01, and deployment-docker-proton01 due to Docker version pinning - https://phabricator.wikimedia.org/T259812 (... [18:29:51] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [18:29:52] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [18:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:17] (03CR) 10Ebernhardson: [C: 03+2] Scap: git_fat -> git_binary_manager [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/404222 (https://phabricator.wikimedia.org/T184882) (owner: 10Thcipriani) [18:33:19] (03CR) 10Andrew Bogott: [C: 03+2] Add Backy2 module and profile [puppet] - 10https://gerrit.wikimedia.org/r/617841 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [18:34:42] 10Operations, 10ops-codfw, 10DC-Ops: (Need By:2020-08-17) label/setup/install pki2001 - https://phabricator.wikimedia.org/T259825 (10RobH) [18:35:03] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By:2020-08-17) label/setup/install pki1001 - https://phabricator.wikimedia.org/T259826 (10RobH) [18:35:35] 10Operations, 10ops-codfw, 10DC-Ops: (Need By:2020-08-17) label/setup/install pki2001 - https://phabricator.wikimedia.org/T259825 (10RobH) [18:35:53] 10Operations, 10User-jbond: OKR: Install and configure new CFSSL PKI server - https://phabricator.wikimedia.org/T259117 (10RobH) [18:36:07] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By:2020-08-17) label/setup/install pki1001 - https://phabricator.wikimedia.org/T259826 (10RobH) [18:36:22] 10Operations, 10User-jbond: OKR: Install and configure new CFSSL PKI server - https://phabricator.wikimedia.org/T259117 (10RobH) [18:36:51] (03CR) 10Ssingh: [C: 03+2] wikidough: enable QNAME minimisation for the dnsrecursor module [puppet] - 10https://gerrit.wikimedia.org/r/618591 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [18:40:46] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:42:42] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:55:47] (03PS1) 10Bstorm: share storage: remove nfs-manage-binds [puppet] - 10https://gerrit.wikimedia.org/r/618804 (https://phabricator.wikimedia.org/T169570) [18:57:21] (03PS2) 10Bstorm: shared-storage: remove nfs-manage-binds [puppet] - 10https://gerrit.wikimedia.org/r/618804 (https://phabricator.wikimedia.org/T169570) [18:57:59] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [18:57:59] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [18:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:05] brennen and dancy: #bothumor My software never has bugs. It just develops random features. Rise for Mediawiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200806T1900). [19:00:25] * dancy salutes. [19:00:56] dancy: logs still looking good, running deploy-promote. [19:01:27] (03PS1) 10Brennen Bearnes: all wikis to 1.36.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618805 [19:01:29] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.36.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618805 (owner: 10Brennen Bearnes) [19:02:14] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618805 (owner: 10Brennen Bearnes) [19:04:23] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.3 [19:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:45] (03CR) 10Bstorm: "https://puppet-compiler.wmflabs.org/compiler1003/24361/" [puppet] - 10https://gerrit.wikimedia.org/r/618804 (https://phabricator.wikimedia.org/T169570) (owner: 10Bstorm) [19:05:50] (03CR) 10Bstorm: [C: 03+2] shared-storage: remove nfs-manage-binds [puppet] - 10https://gerrit.wikimedia.org/r/618804 (https://phabricator.wikimedia.org/T169570) (owner: 10Bstorm) [19:06:07] (03PS1) 10Catrope: WelcomeSurvey: Use autonyms for language question [extensions/GrowthExperiments] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618806 (https://phabricator.wikimedia.org/T232410) [19:06:35] (03PS1) 10Catrope: WelcomeSurvey: Reuse server-rendered language question field [extensions/GrowthExperiments] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618807 (https://phabricator.wikimedia.org/T232410) [19:08:04] (03PS1) 10Ayounsi: Netbox driven interfaces for cr1/2-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/618827 [19:09:15] brennen: Lookin' good [19:09:25] (03PS2) 10Catrope: WelcomeSurvey: Reuse server-rendered language question field [extensions/GrowthExperiments] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618807 (https://phabricator.wikimedia.org/T232410) [19:10:02] (03PS1) 10Ssingh: dnsrecursor: update the location of socket-dir for 4.3.0 [puppet] - 10https://gerrit.wikimedia.org/r/618830 [19:10:13] (03CR) 10Catrope: "This change is ready for review." [extensions/GrowthExperiments] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618807 (https://phabricator.wikimedia.org/T232410) (owner: 10Catrope) [19:10:28] dancy: yep. i'll keep an eye on it, but smooth sailing so far. [19:13:53] (03CR) 10Ssingh: "Confirming no change to dns2001 and cloudservices:" [puppet] - 10https://gerrit.wikimedia.org/r/618830 (owner: 10Ssingh) [19:18:47] (03CR) 10Andrew Bogott: [C: 03+1] dnsrecursor: update the location of socket-dir for 4.3.0 [puppet] - 10https://gerrit.wikimedia.org/r/618830 (owner: 10Ssingh) [19:19:36] (03CR) 10Ssingh: [C: 03+2] dnsrecursor: update the location of socket-dir for 4.3.0 [puppet] - 10https://gerrit.wikimedia.org/r/618830 (owner: 10Ssingh) [19:22:14] 10Operations, 10Analytics-Radar, 10Traffic: Spammy events coming our way for sites such us https://ru.wikipedia.kim - https://phabricator.wikimedia.org/T190843 (10Nathan708) is there any legal caution mated for such an act; because it looks as if these guys mirror wikipedia. But they can only mirror wikipedi... [19:26:04] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [19:26:04] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [19:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:43] 10Operations, 10Toolhub, 10Wikimedia-Mailing-lists, 10User-bd808: Create toolhub-dev@lists.wikimedia.org - https://phabricator.wikimedia.org/T259830 (10bd808) [19:35:12] (03PS1) 10Ottomata: wgEventStreams - fix typo in eventgate stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618833 (https://phabricator.wikimedia.org/T251935) [19:35:46] brennen: train ok? lemme know if it i ok to sync ^ [19:36:23] or dancy ^ [19:36:39] Train looks ok. [19:36:56] No alarming log entries. [19:37:45] ok proceeding shoudln't affect any mw stuff [19:38:40] (03CR) 10Ottomata: [C: 03+2] wgEventStreams - fix typo in eventgate stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618833 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [19:40:13] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: EventStreamConfig - wgEventStreams - fix typo in eventgate stream config - T251935 (duration: 00m 59s) [19:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:17] T251935: All EventGate instances should use EventStreamConfig - https://phabricator.wikimedia.org/T251935 [19:42:32] (03PS1) 10Ottomata: wgEventStreams - fix another typo in eventgate stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618837 (https://phabricator.wikimedia.org/T251935) [19:42:41] brennen do you know why https://gerrit.wikimedia.org/r/c/mediawiki/vendor/+/618038 ( which had cherrypicked 0.13.0-a4 onto wmf-3 branch ) didn't work? It looks like it didn't ride the train .. https://www.mediawiki.org/wiki/Special:Version says that Parsoid is at 0.13.0-a3 /cc cscott [19:43:21] subbu: was just looking at that [19:43:23] unclear [19:43:27] ok. [19:43:49] ah scott was asking in -releng. :) [19:44:02] (03CR) 10Ottomata: [C: 03+2] wgEventStreams - fix another typo in eventgate stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618837 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [19:45:27] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: EventStreamConfig - wgEventStreams - fix another typo in eventgate stream config - T251935 (duration: 00m 58s) [19:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:31] T251935: All EventGate instances should use EventStreamConfig - https://phabricator.wikimedia.org/T251935 [19:47:35] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [19:47:35] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [19:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:11] 10Operations, 10Toolhub, 10Wikimedia-Mailing-lists, 10User-bd808: Create toolhub-dev@lists.wikimedia.org - https://phabricator.wikimedia.org/T259830 (10bd808) [19:51:22] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:54:09] (03PS1) 10Ottomata: eventgate-* - precache /test/event/1.0.0 schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/618838 (https://phabricator.wikimedia.org/T251935) [19:56:00] (03CR) 10Ottomata: [C: 03+2] eventgate-* - precache /test/event/1.0.0 schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/618838 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [19:57:12] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:58:13] (03PS1) 10Tchanders: Enable Special:Investigate on French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618839 (https://phabricator.wikimedia.org/T257891) [20:01:02] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:01:08] (03PS1) 10Ottomata: eventgate - fix httpGet port [deployment-charts] - 10https://gerrit.wikimedia.org/r/618840 [20:02:13] (03CR) 10DannyS712: [C: 03+1] Enable Special:Investigate on French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618839 (https://phabricator.wikimedia.org/T257891) (owner: 10Tchanders) [20:02:20] (03CR) 10Ottomata: [C: 03+2] eventgate - fix httpGet port [deployment-charts] - 10https://gerrit.wikimedia.org/r/618840 (owner: 10Ottomata) [20:02:58] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:06:01] (03PS1) 10Andrew Bogott: backy2: fix up some dependency issues in install [puppet] - 10https://gerrit.wikimedia.org/r/618842 (https://phabricator.wikimedia.org/T259192) [20:06:28] (03CR) 10jerkins-bot: [V: 04-1] backy2: fix up some dependency issues in install [puppet] - 10https://gerrit.wikimedia.org/r/618842 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [20:07:28] (03PS2) 10Andrew Bogott: backy2: fix up some dependency issues in install [puppet] - 10https://gerrit.wikimedia.org/r/618842 (https://phabricator.wikimedia.org/T259192) [20:08:32] (03CR) 10Andrew Bogott: [C: 03+2] backy2: fix up some dependency issues in install [puppet] - 10https://gerrit.wikimedia.org/r/618842 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [20:11:57] (03PS1) 10Ottomata: eventgate - bump chart version to 0.2.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/618843 [20:13:11] (03CR) 10Ottomata: [C: 03+2] eventgate - bump chart version to 0.2.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/618843 (owner: 10Ottomata) [20:15:53] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [20:15:53] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [20:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:16] jouncebot now [20:19:16] For the next 0 hour(s) and 40 minute(s): Mediawiki train - American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200806T1900) [20:19:55] !log manually updating the vendor submodule on 1.36.0 for T259832 [20:19:56] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [20:19:56] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [20:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:58] T259832: Deployed 1.36.0-wmf.3 does not have the 1.36.0-wmf.3 branch of mediawiki-vendor - https://phabricator.wikimedia.org/T259832 [20:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:20] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.8342 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [20:20:28] ottomata: am i going to step on your toes in any way with that? [20:25:23] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:26:32] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.7518 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [20:27:20] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:30:34] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.788 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [20:35:51] (03PS1) 10Ottomata: eventgate-* - bump image to 2020-08-06-202915-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/618847 [20:37:24] (03CR) 10Ottomata: [C: 03+2] eventgate-* - bump image to 2020-08-06-202915-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/618847 (owner: 10Ottomata) [20:38:09] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [20:38:09] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [20:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:15] (03PS4) 10Dzahn: profile::gerrit::migrations: correct hiera name space [puppet] - 10https://gerrit.wikimedia.org/r/617676 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [20:39:27] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/24364/" [puppet] - 10https://gerrit.wikimedia.org/r/617676 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [20:40:21] (03PS1) 10Andrew Bogott: Introduce role::wmcs::ceph::backup [puppet] - 10https://gerrit.wikimedia.org/r/618849 (https://phabricator.wikimedia.org/T259192) [20:43:13] andrewbogott: can i merge backy2 change? [20:43:29] yes plesae [20:43:32] please [20:43:46] done! [20:44:28] thx [20:46:18] (03PS1) 10Brennen Bearnes: Update git submodules [core] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618850 [20:47:16] (03PS1) 10Ottomata: eventgate-logging-external: use remote stream config, eventgate-analytics-external: use constraints [deployment-charts] - 10https://gerrit.wikimedia.org/r/618851 (https://phabricator.wikimedia.org/T251935) [20:47:43] !log restart logstash -- pipeline appears stuck [20:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:31] (03CR) 10Ottomata: [C: 03+2] eventgate-logging-external: use remote stream config, eventgate-analytics-external: use constraints [deployment-charts] - 10https://gerrit.wikimedia.org/r/618851 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [20:48:53] (03CR) 10Dzahn: "noop" [puppet] - 10https://gerrit.wikimedia.org/r/617676 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [20:49:01] (03PS4) 10Dzahn: profile::gerrit::server: correct hiera name space [puppet] - 10https://gerrit.wikimedia.org/r/617683 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [20:49:07] (03CR) 10Brennen Bearnes: [C: 03+2] Update git submodules [core] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618850 (owner: 10Brennen Bearnes) [20:49:39] (03Merged) 10jenkins-bot: eventgate-logging-external: use remote stream config, eventgate-analytics-external: use constraints [deployment-charts] - 10https://gerrit.wikimedia.org/r/618851 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [20:49:59] jouncebot next [20:50:00] In 2 hour(s) and 10 minute(s): Evening backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200806T2300) [20:50:32] (03PS1) 10C. Scott Ananian: Update git submodules [core] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618852 (https://phabricator.wikimedia.org/T259832) [20:50:57] (03Abandoned) 10C. Scott Ananian: Update git submodules [core] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618852 (https://phabricator.wikimedia.org/T259832) (owner: 10C. Scott Ananian) [20:51:18] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0.01629 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [20:51:37] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [20:51:37] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [20:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:22] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:53:38] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [20:53:48] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0.007107 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [20:54:06] (03PS1) 10Andrew Bogott: Introduce role::wmcs::ceph::backup [puppet] - 10https://gerrit.wikimedia.org/r/618853 (https://phabricator.wikimedia.org/T259192) [20:54:08] (03PS1) 10Andrew Bogott: Retool cloudvirt1004 and cloudvirt1006 as ceph/backy2 test hosts [puppet] - 10https://gerrit.wikimedia.org/r/618854 (https://phabricator.wikimedia.org/T259192) [20:54:19] (03Abandoned) 10Andrew Bogott: Introduce role::wmcs::ceph::backup [puppet] - 10https://gerrit.wikimedia.org/r/618849 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [20:55:20] (03CR) 10Andrew Bogott: [C: 03+2] Introduce role::wmcs::ceph::backup [puppet] - 10https://gerrit.wikimedia.org/r/618853 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [20:55:36] (03CR) 10Andrew Bogott: [C: 03+2] Retool cloudvirt1004 and cloudvirt1006 as ceph/backy2 test hosts [puppet] - 10https://gerrit.wikimedia.org/r/618854 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [20:59:06] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 50 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:59:40] (03PS5) 10Dzahn: profile::gerrit::server: correct hiera name space [puppet] - 10https://gerrit.wikimedia.org/r/617683 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [21:00:11] (03CR) 10Dzahn: [C: 03+2] "compiler says noop in prod and cloud" [puppet] - 10https://gerrit.wikimedia.org/r/617683 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [21:01:06] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:10:07] (03Merged) 10jenkins-bot: Update git submodules [core] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618850 (owner: 10Brennen Bearnes) [21:10:30] (03PS5) 10Dzahn: profile::gerrit::server: correct hiera name space [puppet] - 10https://gerrit.wikimedia.org/r/617690 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [21:16:51] (03CR) 10Dzahn: [C: 03+2] profile::gerrit::server: correct hiera name space [puppet] - 10https://gerrit.wikimedia.org/r/617690 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [21:18:46] (03PS1) 10Andrew Bogott: cloudvirt100[4-9]: use hp raid recipe [puppet] - 10https://gerrit.wikimedia.org/r/618860 [21:19:22] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt100[4-9]: use hp raid recipe [puppet] - 10https://gerrit.wikimedia.org/r/618860 (owner: 10Andrew Bogott) [21:24:37] (03CR) 10Dzahn: "still all noop in prod" [puppet] - 10https://gerrit.wikimedia.org/r/617690 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [21:26:52] (03PS1) 10Mholloway: Update wikifeeds to 2020-08-06-212118-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/618861 [21:29:02] (03CR) 10Dzahn: "the reason for it being "gerrit::server" and not just "gerrit" is historic. there used to be a time when we did not have the role/profile " [puppet] - 10https://gerrit.wikimedia.org/r/617691 (owner: 10Jbond) [21:29:41] (03PS2) 10Dzahn: profile::gerrit::server: rename profile [puppet] - 10https://gerrit.wikimedia.org/r/617691 (owner: 10Jbond) [21:29:43] (03CR) 10Mholloway: [C: 03+2] Update wikifeeds to 2020-08-06-212118-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/618861 (owner: 10Mholloway) [21:30:50] (03Merged) 10jenkins-bot: Update wikifeeds to 2020-08-06-212118-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/618861 (owner: 10Mholloway) [21:32:41] !log mholloway-shell@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [21:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:20] !log brennen@deploy1001 Synchronized php-1.36.0-wmf.3/vendor: [[gerrit:618850|Update git submodules (vendor)]] (T259832) (duration: 01m 08s) [21:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:23] T259832: Deployed 1.36.0-wmf.3 does not have the 1.36.0-wmf.3 branch of mediawiki-vendor - https://phabricator.wikimedia.org/T259832 [21:34:13] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:34:19] beta on master is broken: https://en.wikipedia.beta.wmflabs.org/wiki/Main_Page [21:34:57] Amir1: it's fine mobile site if that helps [21:35:53] !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [21:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:09] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [21:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:39] !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [21:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:03] (03CR) 10Dzahn: [C: 03+2] profile::gerrit::server: rename profile [puppet] - 10https://gerrit.wikimedia.org/r/617691 (owner: 10Jbond) [21:40:11] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [21:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:13] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:47] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:43:11] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:46] 10Operations, 10serviceops, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10MMiller_WMF) @kostajh -- are you asking whether we should deactivate EditorJourney in all wikis, so as to stop it from recording data anywhere? If so, I am fi... [22:30:55] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [22:41:58] (03PS1) 10Ladsgroup: Use a new page for mentor list in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618866 (https://phabricator.wikimedia.org/T253291) [23:00:04] RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200806T2300). [23:00:05] kaldari, RoanKattouw, and Amir1: A patch you scheduled for Evening backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:20] o/ [23:01:09] I'll do the deployment [23:01:32] (03CR) 10Catrope: [C: 03+2] Use a new page for mentor list in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618866 (https://phabricator.wikimedia.org/T253291) (owner: 10Ladsgroup) [23:01:44] (03CR) 10Catrope: [C: 03+2] WelcomeSurvey: Use autonyms for language question [extensions/GrowthExperiments] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618806 (https://phabricator.wikimedia.org/T232410) (owner: 10Catrope) [23:01:50] (03CR) 10Catrope: [C: 03+2] WelcomeSurvey: Reuse server-rendered language question field [extensions/GrowthExperiments] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618807 (https://phabricator.wikimedia.org/T232410) (owner: 10Catrope) [23:02:15] (03Merged) 10jenkins-bot: Use a new page for mentor list in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618866 (https://phabricator.wikimedia.org/T253291) (owner: 10Ladsgroup) [23:04:37] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Change GrowthExperiments mentor list on fawiki (T253291) (duration: 00m 59s) [23:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:42] T253291: Deploy Growth features on Persian Wikipedia - https://phabricator.wikimedia.org/T253291 [23:06:18] Amir1: Yours is deployed. It's tricky to test so I didn't put it on the debug server first [23:06:33] thanks. I was thinking of the same [23:06:38] To test, you'd create a new account or enable the homepage on an account that's never had it enabled before, then verify that it gets a mentor assignment [23:09:01] Done and it works just fine [23:09:03] https://usercontent.irccloud-cdn.com/file/KQZVRAGU/image.png [23:09:40] (03Merged) 10jenkins-bot: WelcomeSurvey: Use autonyms for language question [extensions/GrowthExperiments] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618806 (https://phabricator.wikimedia.org/T232410) (owner: 10Catrope) [23:10:53] (03Merged) 10jenkins-bot: WelcomeSurvey: Reuse server-rendered language question field [extensions/GrowthExperiments] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618807 (https://phabricator.wikimedia.org/T232410) (owner: 10Catrope) [23:16:27] PROBLEM - Check systemd state on aphlict1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:17:33] (03PS1) 10Cwhite: prometheus: add default count all query [puppet] - 10https://gerrit.wikimedia.org/r/618869 (https://phabricator.wikimedia.org/T256418) [23:17:57] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add default count all query [puppet] - 10https://gerrit.wikimedia.org/r/618869 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [23:19:54] (03PS1) 10Cwhite: profile: update mediawiki errors query to count beyond the 10k limit [puppet] - 10https://gerrit.wikimedia.org/r/618870 (https://phabricator.wikimedia.org/T256418) [23:20:11] (03PS2) 10Cwhite: prometheus: add default count all query [puppet] - 10https://gerrit.wikimedia.org/r/618869 (https://phabricator.wikimedia.org/T256418) [23:20:35] (03PS2) 10Cwhite: profile: update mediawiki errors query to count beyond the 10k limit [puppet] - 10https://gerrit.wikimedia.org/r/618870 (https://phabricator.wikimedia.org/T256418) [23:20:37] (03CR) 10jerkins-bot: [V: 04-1] profile: update mediawiki errors query to count beyond the 10k limit [puppet] - 10https://gerrit.wikimedia.org/r/618870 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [23:21:26] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.3/extensions/GrowthExperiments/: Fixes for WelcomeSurvey language question (T232410) (duration: 00m 59s) [23:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:29] T232410: Newcomer tasks: add language question to welcome survey - https://phabricator.wikimedia.org/T232410 [23:21:54] (03PS3) 10Cwhite: prometheus: add default count all query [puppet] - 10https://gerrit.wikimedia.org/r/618869 (https://phabricator.wikimedia.org/T256418) [23:29:05] PROBLEM - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [23:30:41] PROBLEM - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [23:42:33] PROBLEM - ensure kvm processes are running on cloudvirt1006 is CRITICAL: connect to address 10.64.20.24 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:42:41] PROBLEM - nova-compute proc minimum on cloudvirt1006 is CRITICAL: connect to address 10.64.20.24 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:42:55] PROBLEM - nova-compute proc maximum on cloudvirt1006 is CRITICAL: connect to address 10.64.20.24 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:43:51] PROBLEM - ensure kvm processes are running on cloudvirt1004 is CRITICAL: connect to address 10.64.20.22 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:43:59] PROBLEM - nova-compute proc minimum on cloudvirt1004 is CRITICAL: connect to address 10.64.20.22 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:44:55] PROBLEM - nova-compute proc maximum on cloudvirt1004 is CRITICAL: connect to address 10.64.20.22 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:46:23] PROBLEM - Check the NTP synchronisation status of timesyncd on cloudvirt1006 is CRITICAL: connect to address 10.64.20.24 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [23:50:37] ACKNOWLEDGEMENT - Check the NTP synchronisation status of timesyncd on cloudvirt1004 is CRITICAL: connect to address 10.64.20.22 port 5666: Connection refused andrew bogott this is a phantom from reimaging https://wikitech.wikimedia.org/wiki/NTP [23:50:37] ACKNOWLEDGEMENT - Check whether microcode mitigations for CPU vulnerabilities are applied on cloudvirt1004 is CRITICAL: connect to address 10.64.20.22 port 5666: Connection refused andrew bogott this is a phantom from reimaging https://wikitech.wikimedia.org/wiki/Microcode [23:50:40] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt1004 is CRITICAL: connect to address 10.64.20.22 port 5666: Connection refused andrew bogott this is a phantom from reimaging https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:50:41] ACKNOWLEDGEMENT - nova-compute proc maximum on cloudvirt1004 is CRITICAL: connect to address 10.64.20.22 port 5666: Connection refused andrew bogott this is a phantom from reimaging https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:50:42] ACKNOWLEDGEMENT - nova-compute proc minimum on cloudvirt1004 is CRITICAL: connect to address 10.64.20.22 port 5666: Connection refused andrew bogott this is a phantom from reimaging https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:50:43] ACKNOWLEDGEMENT - Check the NTP synchronisation status of timesyncd on cloudvirt1006 is CRITICAL: connect to address 10.64.20.24 port 5666: Connection refused andrew bogott this is a phantom from reimaging https://wikitech.wikimedia.org/wiki/NTP [23:50:44] ACKNOWLEDGEMENT - Check whether microcode mitigations for CPU vulnerabilities are applied on cloudvirt1006 is CRITICAL: connect to address 10.64.20.24 port 5666: Connection refused andrew bogott this is a phantom from reimaging https://wikitech.wikimedia.org/wiki/Microcode [23:50:48] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt1006 is CRITICAL: connect to address 10.64.20.24 port 5666: Connection refused andrew bogott this is a phantom from reimaging https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:50:49] ACKNOWLEDGEMENT - nova-compute proc maximum on cloudvirt1006 is CRITICAL: connect to address 10.64.20.24 port 5666: Connection refused andrew bogott this is a phantom from reimaging https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:50:50] ACKNOWLEDGEMENT - nova-compute proc minimum on cloudvirt1006 is CRITICAL: connect to address 10.64.20.24 port 5666: Connection refused andrew bogott this is a phantom from reimaging https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:52:02] RECOVERY - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [23:52:22] RECOVERY - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops