[00:00:11] (03PS1) 10Dzahn: releases: use --delete when rsyncing files between servers [puppet] - 10https://gerrit.wikimedia.org/r/618411 [00:03:41] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/24314/releases1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/618411 (owner: 10Dzahn) [00:03:53] (03PS1) 10Dzahn: ATS: switch releases.wm to new buster backend servers [dns] - 10https://gerrit.wikimedia.org/r/618412 [00:04:01] (03CR) 10jerkins-bot: [V: 04-1] ATS: switch releases.wm to new buster backend servers [dns] - 10https://gerrit.wikimedia.org/r/618412 (owner: 10Dzahn) [00:04:18] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:05:51] (03PS2) 10Dzahn: ATS: switch releases.wm to new buster backend servers [dns] - 10https://gerrit.wikimedia.org/r/618412 [00:20:03] 10Operations, 10VPS-Projects, 10Wikimedia-Mailing-lists, 10User-Ladsgroup, and 2 others: Request for creating a DNS record for lists.wmcloud.org to 185.15.56.28 - https://phabricator.wikimedia.org/T259444 (10Ladsgroup) 05Open→03Resolved This is done, thanks! [00:20:07] 10Operations, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Setup Mailman3 in Cloud VPS - https://phabricator.wikimedia.org/T258365 (10Ladsgroup) [00:21:00] (03PS1) 10Dzahn: httpbb: add test file for releases.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/618415 [00:42:00] (03PS1) 10Bstorm: haproxy-galera: Make a meaningful healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/618418 [00:42:41] (03PS2) 10Bstorm: haproxy-galera: Make a meaningful healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/618418 [00:43:58] (03CR) 10jerkins-bot: [V: 04-1] haproxy-galera: Make a meaningful healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/618418 (owner: 10Bstorm) [00:44:02] (03CR) 10Bstorm: "This is all assuming we don't use the tcp haproxy listen blocks for anything except mysql. If that's not true, this needs a bit more work" [puppet] - 10https://gerrit.wikimedia.org/r/618418 (owner: 10Bstorm) [00:45:44] (03PS3) 10Bstorm: haproxy-galera: Make a meaningful healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/618418 [00:52:04] (03PS4) 10Bstorm: haproxy-galera: Make a meaningful healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/618418 [00:55:24] (03CR) 10Bstorm: "On-server testing show that it is correct for the current state of cloudcontrol1004:" [puppet] - 10https://gerrit.wikimedia.org/r/618418 (owner: 10Bstorm) [00:58:30] (03PS6) 10CRusnov: rotatedump: Enhance to retain period copies [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/562408 (https://phabricator.wikimedia.org/T231512) [01:00:12] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [01:01:22] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:02:41] (03CR) 10Bstorm: "I do question if I want those shell options at the start of the script. I want it to generally terminate with an http response. eqiad PCC:" [puppet] - 10https://gerrit.wikimedia.org/r/618418 (owner: 10Bstorm) [01:13:12] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:25:22] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [01:48:36] PROBLEM - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [01:48:58] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:53:12] PROBLEM - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [02:13:48] RECOVERY - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [02:14:28] RECOVERY - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [03:02:42] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:27:50] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:27:57] 10Operations, 10Parsoid, 10serviceops, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ssastry) >>! In T257906#6330994, @Dzahn wrote: > Merging the change above was a noop on scandium. I did not manuall... [03:31:42] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:45:12] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:50:54] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:16:09] 10Operations, 10Platform Engineering, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 6 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10jeena) Hmm, I tried to deploy again but still couldn't. I would be happy to help with up... [04:16:12] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:23:58] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:33:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1136', diff saved to https://phabricator.wikimedia.org/P12163 and previous config saved to /var/cache/conftool/dbconfig/20200805-043346-marostegui.json [04:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:46:02] (03PS1) 10Marostegui: mariadb: Reimage db1132 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/618427 (https://phabricator.wikimedia.org/T259589) [04:53:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1136', diff saved to https://phabricator.wikimedia.org/P12164 and previous config saved to /var/cache/conftool/dbconfig/20200805-045334-marostegui.json [04:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1136', diff saved to https://phabricator.wikimedia.org/P12165 and previous config saved to /var/cache/conftool/dbconfig/20200805-050308-marostegui.json [05:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1136', diff saved to https://phabricator.wikimedia.org/P12166 and previous config saved to /var/cache/conftool/dbconfig/20200805-050808-marostegui.json [05:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1094 for MCR', diff saved to https://phabricator.wikimedia.org/P12167 and previous config saved to /var/cache/conftool/dbconfig/20200805-050907-marostegui.json [05:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:48] (03CR) 10Hashar: "Can we get stretch-backports removed from the stretch base image or is this change pending something else? I am a few changes for CI imag" [puppet] - 10https://gerrit.wikimedia.org/r/610050 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [05:25:48] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:29:44] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:38:54] (03PS3) 10Hashar: Stop including backports in Stretch images [puppet] - 10https://gerrit.wikimedia.org/r/610050 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [05:39:21] (03CR) 10Hashar: "Rebased to fix a trivial merge conflict with Ic2b5bfb122ad9d0fc7f4e404f639d9b71114691f" [puppet] - 10https://gerrit.wikimedia.org/r/610050 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [05:44:16] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 238, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:44:22] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:45:16] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:51:08] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:53:16] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [05:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:42] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 240, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:17:38] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:21:22] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 130, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:21:26] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 238, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:29:06] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:29:10] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 240, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:40:39] (03CR) 10Marostegui: [C: 03+2] wikireplica_dns.yaml: Depool dbproxy1018 [puppet] - 10https://gerrit.wikimedia.org/r/618283 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [06:41:04] PROBLEM - Check the last execution of generate-mysqld-exporter-config on prometheus1004 is CRITICAL: connect to address 10.64.16.38 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:42:20] PROBLEM - Check systemd state on prometheus1004 is CRITICAL: connect to address 10.64.16.38 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:42:28] PROBLEM - Check size of conntrack table on prometheus1004 is CRITICAL: connect to address 10.64.16.38 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:43:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Enable the service proxy on termbox in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/618317 (owner: 10Giuseppe Lavagetto) [06:45:08] PROBLEM - puppet last run on prometheus1004 is CRITICAL: connect to address 10.64.16.38 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:46:11] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [06:46:11] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' . [06:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:10] PROBLEM - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [06:48:12] RECOVERY - Check systemd state on prometheus1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:48:18] RECOVERY - Check size of conntrack table on prometheus1004 is OK: OK: nf_conntrack is 3 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:48:56] 10Operations, 10netops: Make eqord its own AS - https://phabricator.wikimedia.org/T259593 (10ayounsi) [06:50:51] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' . [06:50:51] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [06:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:04] RECOVERY - puppet last run on prometheus1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:51:58] RECOVERY - Check the last execution of generate-mysqld-exporter-config on prometheus1004 is OK: OK: Status of the systemd unit generate-mysqld-exporter-config https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:52:25] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0) [06:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:12] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' . [06:59:12] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [06:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:40] (03PS1) 10Giuseppe Lavagetto: termbox/staging: rollback the configuration, it clearly doesn't work. [deployment-charts] - 10https://gerrit.wikimedia.org/r/618479 [07:01:42] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] termbox/staging: rollback the configuration, it clearly doesn't work. [deployment-charts] - 10https://gerrit.wikimedia.org/r/618479 (owner: 10Giuseppe Lavagetto) [07:04:51] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [07:04:51] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' . [07:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:36] (03PS1) 10Ayounsi: Depool ulsfo for routers upgrade [dns] - 10https://gerrit.wikimedia.org/r/618480 (https://phabricator.wikimedia.org/T259621) [07:08:26] RECOVERY - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [07:12:47] (03CR) 10Marostegui: [C: 03+2] mariadb: Reimage db1132 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/618427 (https://phabricator.wikimedia.org/T259589) (owner: 10Marostegui) [07:13:42] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [07:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:10] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [07:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:12] (03PS1) 10Awight: FileImporter: full default deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618481 (https://phabricator.wikimedia.org/T232542) [07:18:22] (03CR) 10Elukey: "Keith one question - shouldn't we add the hiera config for monitoring_buster in this patch, to see if the new instances comes up fine etc." [puppet] - 10https://gerrit.wikimedia.org/r/618359 (https://phabricator.wikimedia.org/T252773) (owner: 10Herron) [07:20:15] !log installing libexif security updates on buster [07:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:57] (03CR) 10DCausse: [C: 03+1] Additional prefixes for sdoc for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/618237 (https://phabricator.wikimedia.org/T258625) (owner: 10ZPapierski) [07:21:04] (03PS1) 10Awight: Remove deprecated setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618482 (https://phabricator.wikimedia.org/T232542) [07:26:30] !log installing perl security updates on buster [07:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [07:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:46] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: lowercase alerts annotations [puppet] - 10https://gerrit.wikimedia.org/r/618284 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [07:36:51] (03PS2) 10Filippo Giunchedi: prometheus: lowercase alerts annotations [puppet] - 10https://gerrit.wikimedia.org/r/618284 (https://phabricator.wikimedia.org/T258948) [07:38:30] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/617260 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [07:39:26] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0) [07:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:26] (03PS4) 10JMeybohm: helm-diff: New upstream version 3.1.2 [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/618314 (https://phabricator.wikimedia.org/T258572) [07:41:40] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, CC'ing Alex as a FYI" [puppet] - 10https://gerrit.wikimedia.org/r/618388 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [07:42:27] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/618345 (https://phabricator.wikimedia.org/T247966) (owner: 10Herron) [07:43:54] (03CR) 10JMeybohm: [C: 03+2] blubberoid: remove out-dated repositories definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/618347 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [07:45:04] (03Merged) 10jenkins-bot: blubberoid: remove out-dated repositories definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/618347 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [07:45:11] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0) [07:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:15] !log Stop mysql on db1117:3323 (this will generate haproxy irc alerts) T259589 [07:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:17] T259589: Upgrade m3 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T259589 [07:49:34] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Packaging is fine. A couple minor comments on the control file, but it's good as-is otherwise." (032 comments) [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/618314 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [07:49:48] 10Operations, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) [07:54:53] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [07:55:26] ebernhardson: looks like prometheus is failing to scrape metrics from mjolnir, I'm assuming due to the deploy yesterday [07:55:54] haproxy is me, as announced [07:56:52] ebernhardson: or rather, prometheus is expecting to find mjolnir on all elastic instances but atm only ~2% of elastic hosts can have mjolnir metrics scraped, https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets?orgId=1 [07:57:42] (03CR) 10Ema: [C: 03+2] Revert "ATS: force cache revalidation on a few wikis" [puppet] - 10https://gerrit.wikimedia.org/r/618294 (https://phabricator.wikimedia.org/T256750) (owner: 10Ema) [07:58:45] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Fix the copyright, otherwise LGTM" (032 comments) [debs/helmfile] - 10https://gerrit.wikimedia.org/r/618273 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [08:02:19] (03CR) 10Giuseppe Lavagetto: [C: 04-1] helm: Replace repo update cronjob by systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/618350 (owner: 10JMeybohm) [08:02:31] (03PS5) 10JMeybohm: helmfile: New upstream version 0.125.2 [debs/helmfile] - 10https://gerrit.wikimedia.org/r/618273 (https://phabricator.wikimedia.org/T258572) [08:02:39] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:03:01] (03PS5) 10JMeybohm: helm-diff: New upstream version 3.1.2 [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/618314 (https://phabricator.wikimedia.org/T258572) [08:03:08] haproxy expected [08:03:46] (03CR) 10Giuseppe Lavagetto: [C: 04-1] helm: Replace repo update cronjob by systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/618350 (owner: 10JMeybohm) [08:07:01] (03CR) 10JMeybohm: helm: Replace repo update cronjob by systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/618350 (owner: 10JMeybohm) [08:08:18] (03PS1) 10Alexandros Kosiaris: mobileapps: Switch conftool to kubernetes/kubesvc [puppet] - 10https://gerrit.wikimedia.org/r/618485 (https://phabricator.wikimedia.org/T218733) [08:08:19] (03PS1) 10Alexandros Kosiaris: mobileapps: Remove from scb conftool config [puppet] - 10https://gerrit.wikimedia.org/r/618486 (https://phabricator.wikimedia.org/T218733) [08:08:21] (03PS1) 10Alexandros Kosiaris: mobileapps: Remove mobileapps from scb [puppet] - 10https://gerrit.wikimedia.org/r/618487 (https://phabricator.wikimedia.org/T218733) [08:08:25] (03PS1) 10Alexandros Kosiaris: mobileapps: Remove the profile and the role [puppet] - 10https://gerrit.wikimedia.org/r/618488 (https://phabricator.wikimedia.org/T218733) [08:08:28] (03CR) 10JMeybohm: [C: 03+2] helm-diff: New upstream version 3.1.2 (032 comments) [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/618314 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [08:09:17] (03CR) 10JMeybohm: [C: 03+2] helmfile: New upstream version 0.125.2 (032 comments) [debs/helmfile] - 10https://gerrit.wikimedia.org/r/618273 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [08:11:15] (03Merged) 10jenkins-bot: helm-diff: New upstream version 3.1.2 [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/618314 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [08:11:16] (03CR) 10Alexandros Kosiaris: [C: 03+2] mobileapps: Switch conftool to kubernetes/kubesvc [puppet] - 10https://gerrit.wikimedia.org/r/618485 (https://phabricator.wikimedia.org/T218733) (owner: 10Alexandros Kosiaris) [08:12:20] (03Merged) 10jenkins-bot: helmfile: New upstream version 0.125.2 [debs/helmfile] - 10https://gerrit.wikimedia.org/r/618273 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [08:12:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1094', diff saved to https://phabricator.wikimedia.org/P12169 and previous config saved to /var/cache/conftool/dbconfig/20200805-081237-marostegui.json [08:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:53] (03CR) 10Vgutierrez: [C: 03+1] varnishmtail: check if varnishncsa is still running [puppet] - 10https://gerrit.wikimedia.org/r/618308 (https://phabricator.wikimedia.org/T259020) (owner: 10Ema) [08:15:10] (03CR) 10Ema: [C: 03+2] varnishmtail: check if varnishncsa is still running [puppet] - 10https://gerrit.wikimedia.org/r/618308 (https://phabricator.wikimedia.org/T259020) (owner: 10Ema) [08:21:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1094', diff saved to https://phabricator.wikimedia.org/P12170 and previous config saved to /var/cache/conftool/dbconfig/20200805-082138-marostegui.json [08:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:51] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:29:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1094', diff saved to https://phabricator.wikimedia.org/P12171 and previous config saved to /var/cache/conftool/dbconfig/20200805-082908-marostegui.json [08:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:15] (03PS1) 10Elukey: profile::prometheus::ops: change mjolnir's target classes [puppet] - 10https://gerrit.wikimedia.org/r/618493 (https://phabricator.wikimedia.org/T258245) [08:31:49] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:32:49] (03PS8) 10Jdrewniak: Switch test wikis to new version of vector by default (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614891 (https://phabricator.wikimedia.org/T254227) (owner: 10Jdlrobson) [08:32:54] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/618493 (https://phabricator.wikimedia.org/T258245) (owner: 10Elukey) [08:34:38] (03CR) 10DCausse: [C: 03+1] profile::prometheus::ops: change mjolnir's target classes [puppet] - 10https://gerrit.wikimedia.org/r/618493 (https://phabricator.wikimedia.org/T258245) (owner: 10Elukey) [08:34:40] (03PS1) 10Marostegui: db1132: Set binlog format to ROW [puppet] - 10https://gerrit.wikimedia.org/r/618494 (https://phabricator.wikimedia.org/T259589) [08:34:53] (03CR) 10Elukey: [C: 03+2] profile::prometheus::ops: change mjolnir's target classes [puppet] - 10https://gerrit.wikimedia.org/r/618493 (https://phabricator.wikimedia.org/T258245) (owner: 10Elukey) [08:35:22] (03CR) 10Marostegui: [C: 03+2] db1132: Set binlog format to ROW [puppet] - 10https://gerrit.wikimedia.org/r/618494 (https://phabricator.wikimedia.org/T259589) (owner: 10Marostegui) [08:37:09] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:37:09] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:37:40] 10Operations, 10Parsoid, 10serviceops, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10MoritzMuehlenhoff) >>! In T257906#6361874, @ssastry wrote: > testreduce codebase is used for regular roundtrip testi... [08:38:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1094', diff saved to https://phabricator.wikimedia.org/P12172 and previous config saved to /var/cache/conftool/dbconfig/20200805-083833-marostegui.json [08:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1079 for MCR', diff saved to https://phabricator.wikimedia.org/P12173 and previous config saved to /var/cache/conftool/dbconfig/20200805-083916-marostegui.json [08:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:47] !log Stop replication on db1079 for MCR, this will generate lag on s7 on labsdb [08:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:57] !log Remove revision triggers on db1125:3317 [08:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:21] (03PS1) 10Muehlenhoff: Add Chad to ldap_only_users now that shell access has been removed [puppet] - 10https://gerrit.wikimedia.org/r/618496 [08:45:24] (03PS1) 10Marostegui: install_server: Do not reimage db1132 [puppet] - 10https://gerrit.wikimedia.org/r/618497 [08:46:15] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/618488 (https://phabricator.wikimedia.org/T218733) (owner: 10Alexandros Kosiaris) [08:46:24] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and nda groups for edtadros - https://phabricator.wikimedia.org/T256435 (10akosiaris) @Jrbranaa Ping? [08:46:37] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1132 [puppet] - 10https://gerrit.wikimedia.org/r/618497 (owner: 10Marostegui) [08:49:02] (03CR) 10Muehlenhoff: [C: 03+2] Add Chad to ldap_only_users now that shell access has been removed [puppet] - 10https://gerrit.wikimedia.org/r/618496 (owner: 10Muehlenhoff) [08:51:17] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:01:45] (03CR) 10Marostegui: "After merging I realised that Puppet is disabled on all the eqiad cloudcontrol hosts, how should I proceed?" [puppet] - 10https://gerrit.wikimedia.org/r/618283 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [09:05:50] !log imported helmfile_0.125.2-0 to buster-wikimedia [09:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:20] !log imported helmfile_0.125.2-0 to stretch-wikimedia [09:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:27] !log imported helmfile_0.125.2-0 to jessie-wikimedia [09:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:04] (03PS1) 10Alexandros Kosiaris: admin: Add cdunn to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/618500 (https://phabricator.wikimedia.org/T259615) [09:15:37] (03PS1) 10JMeybohm: Add postinst to clean up after old package version [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/618501 (https://phabricator.wikimedia.org/T258572) [09:16:24] (03CR) 10JMeybohm: "Package has not been build by now, so I did not increment the debian version." [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/618501 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [09:17:23] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Add cdunn to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/618500 (https://phabricator.wikimedia.org/T259615) (owner: 10Alexandros Kosiaris) [09:18:24] (03PS1) 10Elukey: kerberos: set ticket renew lifetime to 7d [puppet] - 10https://gerrit.wikimedia.org/r/618502 [09:19:12] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add postinst to clean up after old package version [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/618501 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [09:19:55] (03CR) 10JMeybohm: [C: 03+2] Add postinst to clean up after old package version [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/618501 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [09:22:23] (03Merged) 10jenkins-bot: Add postinst to clean up after old package version [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/618501 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [09:22:27] (03PS1) 10Filippo Giunchedi: Init retry_count at each collection [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/618504 (https://phabricator.wikimedia.org/T258948) [09:22:29] (03PS1) 10Filippo Giunchedi: Add support for exposing Icinga problems as metrics [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/618505 (https://phabricator.wikimedia.org/T258948) [09:23:40] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/618502 (owner: 10Elukey) [09:24:05] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add Carol Dunn to the wmf LDAP group - https://phabricator.wikimedia.org/T259615 (10akosiaris) 05Open→03Resolved p:05Triage→03Medium a:03akosiaris [09:24:20] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add Carol Dunn to the wmf LDAP group - https://phabricator.wikimedia.org/T259615 (10akosiaris) Done. Resolving, feel free to reopen [09:25:08] (03CR) 10Alexandros Kosiaris: [C: 03+2] mobileapps: Remove from scb conftool config [puppet] - 10https://gerrit.wikimedia.org/r/618486 (https://phabricator.wikimedia.org/T218733) (owner: 10Alexandros Kosiaris) [09:25:47] (03PS2) 10Filippo Giunchedi: Add support for exposing Icinga problems as metrics [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/618505 (https://phabricator.wikimedia.org/T258948) [09:26:46] (03PS3) 10Jbond: profile::gerrit::migrations: correct hiera name space [puppet] - 10https://gerrit.wikimedia.org/r/617676 (https://phabricator.wikimedia.org/T247956) [09:27:04] (03CR) 10Jbond: "> Patch Set 2: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/617676 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [09:27:40] https://mysqlserverteam.com/mysql-shell-dump-load-part-2-benchmarks/ [09:27:48] sorry, wrong channel [09:28:18] (03CR) 10Elukey: [C: 03+2] kerberos: set ticket renew lifetime to 7d [puppet] - 10https://gerrit.wikimedia.org/r/618502 (owner: 10Elukey) [09:31:24] (03CR) 10Alexandros Kosiaris: [C: 03+2] mobileapps: Remove mobileapps from scb [puppet] - 10https://gerrit.wikimedia.org/r/618487 (https://phabricator.wikimedia.org/T218733) (owner: 10Alexandros Kosiaris) [09:32:39] !log set ticket max renewable lifetime to 7d on all kerberos clients (was zero, the default) [09:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:35] (03CR) 10Awight: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/618390 (https://phabricator.wikimedia.org/T259254) (owner: 10BryanDavis) [09:34:59] (03CR) 10Alexandros Kosiaris: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/618352 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [09:36:34] (03CR) 10Alexandros Kosiaris: [C: 04-1] "> Patch Set 1: Code-Review-2" [puppet] - 10https://gerrit.wikimedia.org/r/618379 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [09:36:38] (03CR) 10Filippo Giunchedi: "We'll need to do the same for host status as well" [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/618505 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [09:42:14] 10Operations, 10Traffic: Generate ATS cache.config from software-agnostic data structures - https://phabricator.wikimedia.org/T259692 (10ema) [09:42:20] 10Operations, 10Traffic: Generate ATS cache.config from software-agnostic data structures - https://phabricator.wikimedia.org/T259692 (10ema) p:05Triage→03Medium [09:42:41] (03PS1) 10Elukey: druid: puppet cleanup after upgrading all clusters to 0.19 [puppet] - 10https://gerrit.wikimedia.org/r/618506 (https://phabricator.wikimedia.org/T244482) [09:50:41] (03CR) 10Elukey: [C: 03+2] druid: puppet cleanup after upgrading all clusters to 0.19 [puppet] - 10https://gerrit.wikimedia.org/r/618506 (https://phabricator.wikimedia.org/T244482) (owner: 10Elukey) [09:52:24] (03CR) 10Ayounsi: [C: 03+2] Depool ulsfo for routers upgrade [dns] - 10https://gerrit.wikimedia.org/r/618480 (https://phabricator.wikimedia.org/T259621) (owner: 10Ayounsi) [09:53:11] !log depool ulsfo - T259621 [09:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:34] 10Operations, 10observability: db1082 failed on Jul 18th and 25th, however on the 25th pages didn't go out to VO/phones - https://phabricator.wikimedia.org/T259465 (10fgiunchedi) >>! In T259465#6355942, @fgiunchedi wrote: > A reminder might work! We'll be inquiring VO about that possibility e.g. via email when... [09:58:15] !log drain traffic away cr4-ulsfo [09:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:41] (03Abandoned) 10Elukey: Remove AAAA/PTR records for db1108 [dns] - 10https://gerrit.wikimedia.org/r/617064 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [10:07:07] (03PS6) 10Elukey: Add custom ferm srange to Kafka Jumbo brokers [puppet] - 10https://gerrit.wikimedia.org/r/611168 (https://phabricator.wikimedia.org/T204957) [10:15:20] (03PS3) 10Filippo Giunchedi: Add support for exposing Icinga problems as metrics [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/618505 (https://phabricator.wikimedia.org/T258948) [10:18:41] !log reboot cr4-ulsfo - T259621 [10:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:07] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:21:14] !log installing libssh security updates [10:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:27] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:21:41] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:22:25] router looks back up on the console [10:25:28] (03PS1) 10Hnowlan: api-gateway: change deployment to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/618512 (https://phabricator.wikimedia.org/T254906) [10:25:37] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:25:59] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:26:19] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:27:38] alright cr4 is all good [10:27:41] cr3 now [10:28:28] !log drain traffic away cr3-ulsfo - T259621 [10:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:26] (03PS1) 10Kormat: Split utilities into separate packages [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618513 [10:29:52] (03CR) 10jerkins-bot: [V: 04-1] Split utilities into separate packages [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618513 (owner: 10Kormat) [10:34:31] (03PS2) 10Kormat: Split utilities into separate packages [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618513 [10:36:55] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 44 probes of 655 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:39:27] !log reboot cr3-ulsfo - T259621 [10:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:33] (03PS3) 10Kormat: Split utilities into separate packages [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618513 [10:42:37] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 11 probes of 655 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:43:05] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:43:37] (03CR) 10Kormat: "Jcrespo: this is what we talked about yesterday" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618513 (owner: 10Kormat) [10:45:04] (03CR) 10Kormat: "Note: i'm not planning to do a release just yet, hence UNRELEASED in the changelog." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618513 (owner: 10Kormat) [10:46:37] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:53:30] all good, removing downtimes [10:59:38] jouncebot: refresh [10:59:39] I refreshed my knowledge about deployments. [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European mid-day backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200805T1100). [11:00:04] Lucas_WMDE, DannyS712, awight, and jan_drewniak: A patch you scheduled for European mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:12] o/ [11:00:23] o/ [11:00:50] I’ll start with my config change [11:01:01] RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:01:05] (03PS6) 10Lucas Werkmeister (WMDE): Enable Data Bridge on Test Wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595542 (https://phabricator.wikimedia.org/T232584) [11:01:19] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable Data Bridge on Test Wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595542 (https://phabricator.wikimedia.org/T232584) (owner: 10Lucas Werkmeister (WMDE)) [11:02:04] (03Merged) 10jenkins-bot: Enable Data Bridge on Test Wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595542 (https://phabricator.wikimedia.org/T232584) (owner: 10Lucas Werkmeister (WMDE)) [11:02:36] testing on mwdebug1001 [11:04:43] hm, action=wbformatreference isn’t available on testwikidatawiki [11:05:53] ah, because it’s a client module ^^ [11:05:58] it is available on testwiki, so that’s fine [11:08:34] data bridge works \o/ [11:08:37] syncing [11:09:10] Lucas_WMDE: I'm happy to take over for this next one-liner, or for my patch later... [11:09:22] sure, if you want [11:09:32] :-) Enjoy the 5 minutes off [11:09:43] (03PS3) 10Awight: Add import sources for lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618303 (https://phabricator.wikimedia.org/T259633) (owner: 10DannyS712) [11:10:13] good luck with the FileImporter deploy! big change :) [11:10:26] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:595542|Enable Data Bridge on Test Wikidata clients (T232584)]] (duration: 01m 20s) [11:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:29] T232584: Step 1: Production deployment checklist - https://phabricator.wikimedia.org/T232584 [11:10:32] ^ awight: go ahead [11:10:43] Lucas_WMDE: ack [11:11:19] (03CR) 10Awight: [C: 03+2] "Bacon deploying. Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618303 (https://phabricator.wikimedia.org/T259633) (owner: 10DannyS712) [11:12:05] (03Merged) 10jenkins-bot: Add import sources for lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618303 (https://phabricator.wikimedia.org/T259633) (owner: 10DannyS712) [11:12:41] (03PS9) 10Jdrewniak: Switch test wikis to new version of vector by default (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614891 (https://phabricator.wikimedia.org/T254227) (owner: 10Jdlrobson) [11:13:58] !log awight@deploy1001 sync-file aborted: Config: [[gerrit:618303|Add import sources for lijwikisource (T259633)]] (duration: 00m 13s) [11:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:02] T259633: Add import sources for lijwikisource - https://phabricator.wikimedia.org/T259633 [11:14:25] DannyS712: Sorry, I decided to test this at the last moment. [11:15:56] DannyS712: Okay, your patch is live on mwdebug1001. [11:17:23] is DannyS712 here? I haven’t seen an o/ yet [11:17:46] (03CR) 10Ladsgroup: Turn muswiki and mhwiktionary to read-only (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618089 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm) [11:18:13] I'm rusty at this, anyway. Import requires special permissions, I'll just be satisfied that the site doesn't explode. [11:19:46] !log awight@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:618303|Add import sources for lijwikisource (T259633)]] (duration: 01m 07s) [11:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:49] T259633: Add import sources for lijwikisource - https://phabricator.wikimedia.org/T259633 [11:20:32] (03PS2) 10Awight: FileImporter: full default deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618481 (https://phabricator.wikimedia.org/T232542) [11:20:49] (03CR) 10Awight: [C: 03+2] "Bacon deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618481 (https://phabricator.wikimedia.org/T232542) (owner: 10Awight) [11:21:39] (03Merged) 10jenkins-bot: FileImporter: full default deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618481 (https://phabricator.wikimedia.org/T232542) (owner: 10Awight) [11:22:09] !log imported helm-diff_3.1.2-0 to buster-wikimedia [11:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:14] !log imported helm-diff_3.1.2-0 to jessie-wikimedia and stretch-wikimedia [11:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:47] !log awight@deploy1001 Synchronized wmf-config: Config: [[gerrit:618481|FileImporter: full default deployment (T232542)]] (duration: 01m 04s) [11:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:50] T232542: [Deployment] FileImporter / FileExporter full default - https://phabricator.wikimedia.org/T232542 [11:28:36] !log EU Bacon complete [11:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:57] hey, there's still my patch! [11:29:17] !log EU Bacon reopened [11:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:25] jan_drewniak: So sorry, I just saw it. [11:29:36] late addition :P [11:29:43] jan_drewniak: Helpful if I deploy? [11:29:59] awight: sure! [11:30:03] ack [11:30:28] (03PS10) 10Awight: Switch test wikis to new version of vector by default (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614891 (https://phabricator.wikimedia.org/T254227) (owner: 10Jdlrobson) [11:30:51] (03CR) 10Awight: [C: 03+2] "Bacon deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614891 (https://phabricator.wikimedia.org/T254227) (owner: 10Jdlrobson) [11:31:22] (03Merged) 10jenkins-bot: Switch test wikis to new version of vector by default (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614891 (https://phabricator.wikimedia.org/T254227) (owner: 10Jdlrobson) [11:32:44] jan_drewniak: The config is live on mwdebug1001, if you'd like to test [11:32:59] Ok I'll test it now [11:33:59] awight: alrighty, look good! [11:34:05] jan_drewniak: Thanks :-) [11:36:02] !log awight@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:614891|Switch test wikis to new version of vector by default (3/3) (T254227)]] (duration: 01m 07s) [11:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:04] T254227: Switch test wikis to new version of vector by default - https://phabricator.wikimedia.org/T254227 [11:36:21] !log EU Bacon reclosed [11:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:50] awight: thanks! [11:37:17] My pleasure! [12:09:29] (03CR) 10Ladsgroup: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618090 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm) [12:15:58] (03PS1) 10Kormat: Add wikimedia.cloud domain [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/618522 [12:25:42] (03PS1) 10JMeybohm: Fix if in postinst [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/618524 (https://phabricator.wikimedia.org/T258572) [12:25:47] (03PS1) 10KartikMistry: Update cxserver to 2020-08-05-070016-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/618525 (https://phabricator.wikimedia.org/T258919) [12:26:21] RECOVERY - puppet last run on otrs1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:28:15] 10Operations, 10DBA, 10User-Kormat: DBA python layout - https://phabricator.wikimedia.org/T259516 (10Kormat) [12:28:38] (03PS4) 10Kormat: Split utilities into separate packages [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618513 (https://phabricator.wikimedia.org/T259516) [12:33:13] (03PS2) 10Alexandros Kosiaris: mobileapps: Remove the profile and the role [puppet] - 10https://gerrit.wikimedia.org/r/618488 (https://phabricator.wikimedia.org/T218733) [12:33:26] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/618488 (https://phabricator.wikimedia.org/T218733) (owner: 10Alexandros Kosiaris) [12:33:42] !log installing net-snmp security updates on icinga hosts [12:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:56] (03CR) 10Jcrespo: "Looks ok, but have you tried building all packages locally? I wonder if dependencies on python3-wmfmariadbpy will get duplicated as they c" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618513 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [12:35:53] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/618524 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [12:35:56] (03CR) 10Volans: [C: 03+1] "disclaimer: reviewing it as a new patch because too much time has passed since the last PS." (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/562408 (https://phabricator.wikimedia.org/T231512) (owner: 10CRusnov) [12:37:41] (03CR) 10JMeybohm: [C: 03+2] Fix if in postinst [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/618524 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [12:40:12] (03Merged) 10jenkins-bot: Fix if in postinst [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/618524 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [12:40:25] (03PS1) 10DCausse: [yarn] set yarn.scheduler.minimum-allocation-mb to > 0 [puppet] - 10https://gerrit.wikimedia.org/r/618526 [12:40:59] (03CR) 10DCausse: "ref https://github.com/apache/flink/pull/12444" [puppet] - 10https://gerrit.wikimedia.org/r/618526 (owner: 10DCausse) [12:46:05] (03PS2) 10JMeybohm: helm: Replace repo update cronjob by systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/618350 [12:46:46] !log installing imagemagick security updates on buster [12:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:20] (03CR) 10JMeybohm: helm: Replace repo update cronjob by systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/618350 (owner: 10JMeybohm) [12:49:17] !log imported helm-diff_3.1.2-1 to buster-wikimedia, jessie-wikimedia and stretch-wikimedia [12:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:48] !log netmon1002:/srv/deployment/librenms/librenms$ sudo -u librenms ./lnms migrate [12:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:50] 10Operations, 10OTRS, 10serviceops, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10akosiaris) >>! In T187984#6351881, @eyazi wrote: > Not sure if you did, but you should also reset the Ticket::SearchIndexModule setting. C... [13:00:20] (03CR) 10Kormat: "> Patch Set 4:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618513 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [13:00:50] !log installing libjpeg-turbo security updates on stretch [13:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:40] (03CR) 10Jcrespo: [C: 03+1] "Cool. Let's maybe merge fast and give a final production test/review when fully ready for deployment." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618513 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [13:01:58] (03PS1) 10Elukey: hadoop: set yarn_scheduler_minimum_allocation_mb to 1 [puppet] - 10https://gerrit.wikimedia.org/r/618529 [13:02:34] (03CR) 10Elukey: [C: 03+2] hadoop: set yarn_scheduler_minimum_allocation_mb to 1 [puppet] - 10https://gerrit.wikimedia.org/r/618529 (owner: 10Elukey) [13:02:47] (03Abandoned) 10DCausse: [yarn] set yarn.scheduler.minimum-allocation-mb to > 0 [puppet] - 10https://gerrit.wikimedia.org/r/618526 (owner: 10DCausse) [13:04:10] !log restart yarn resource managers on an-master100[12] to pick up new Yarn settings - https://gerrit.wikimedia.org/r/c/operations/puppet/+/618529 [13:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:07] (03CR) 10Volans: "I don't see either a clear way to simplify it given the current data, see also inline." (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/617603 (https://phabricator.wikimedia.org/T200277) (owner: 10Ayounsi) [13:14:47] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:16:41] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:20:16] (03PS7) 10Elukey: Add custom ferm srange to Kafka Jumbo brokers [puppet] - 10https://gerrit.wikimedia.org/r/611168 (https://phabricator.wikimedia.org/T204957) [13:22:25] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:24:21] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:24:27] (03CR) 10Elukey: [C: 04-2] "https://puppet-compiler.wmflabs.org/compiler1001/24318/kafka-jumbo1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/611168 (https://phabricator.wikimedia.org/T204957) (owner: 10Elukey) [13:24:41] (03CR) 10Alexandros Kosiaris: [C: 03+1] helm: Replace repo update cronjob by systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/618350 (owner: 10JMeybohm) [13:24:43] (03PS3) 10Volans: mgmt: netbox-generated data for mgmt eqiad [dns] - 10https://gerrit.wikimedia.org/r/617509 (https://phabricator.wikimedia.org/T233183) [13:24:47] (03CR) 10Ppchelko: [C: 03+2] api-gateway: change deployment to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/618512 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [13:24:58] !log volans@cumin1001 START - Cookbook sre.dns.netbox [13:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:17] (03Merged) 10jenkins-bot: api-gateway: change deployment to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/618512 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [13:28:54] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:41] oh elukey [13:32:43] mirror maker? [13:32:48] hmmm [13:32:52] ok no this is just for jumbo [13:32:58] mirror maker pulls from elsewhere there [13:32:58] !log updated helmfile to 0.125.2-0 and helm-diff to 3.1.2-1 on contint* and deploy* [13:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:07] oops wrong chat room back to analytics :) [13:35:27] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The idea is sound, but I'd implement it the same way as it's done for the envoy sidecars for TLS termination." (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/616121 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [13:36:34] (03PS1) 10Muehlenhoff: update libjpeg-turbo library hint to also cover libturbojpeg [puppet] - 10https://gerrit.wikimedia.org/r/618532 [13:40:22] (03CR) 10Muehlenhoff: [C: 03+2] update libjpeg-turbo library hint to also cover libturbojpeg [puppet] - 10https://gerrit.wikimedia.org/r/618532 (owner: 10Muehlenhoff) [13:49:02] 10Operations, 10serviceops, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10akosiaris) [13:49:08] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Epic, and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10akosiaris) [13:49:48] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for David Rochford (Drochford) - https://phabricator.wikimedia.org/T259713 (10drochford) [13:51:31] !log installing Linux update to 4.9.132 from buster point update (no reboots, just the package updates) [13:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:57] !log installing node-minimist security updates [14:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:15] 10Operations: Integrate Buster 10.5 point release - https://phabricator.wikimedia.org/T259519 (10MoritzMuehlenhoff) [14:06:55] 10Operations, 10User-jbond: update profile::waf::apache2::administrative to use the new abuse_networks hiera key - https://phabricator.wikimedia.org/T253632 (10Aklapper) @JBond: Both patches in Gerrit have been merged. Can this task be resolved (via {nav name=Add Action... > Change Status} in the dropdown menu... [14:08:30] 10Operations, 10Phabricator, 10Traffic: Access Forbidden to Phabricator at WikiArabia 2019 (Morocco) via Indian IP 185.174.156.75 - https://phabricator.wikimedia.org/T234598 (10Aklapper) [14:09:38] 10Operations, 10Phabricator, 10Traffic: Access Forbidden to Phabricator at WikiArabia 2019 (Morocco) via Indian IP 185.174.156.75 - https://phabricator.wikimedia.org/T234598 (10Aklapper) See also T257507, T229575, T258059, T246923, etc. Reason might be vandalism (non-public T218589). (The error itself is a... [14:14:17] !log installing pillow security updates [14:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:45] PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:14] XioNoX: ^^^ fyi [14:15:39] (hoping that you don't filter icinga-wm too :-P ) [14:15:52] who's pinging me? [14:16:18] someone is pinging you? [14:16:39] PROBLEM - Juniper alarms on mr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 103.102.166.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [14:16:48] did mr1-eqsin die? [14:17:55] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 66%, RTA = 4267.51 ms [14:17:59] seems like it [14:18:06] I haven't tried scs yet [14:18:13] cdanis: scs is via mr1 :) [14:18:17] ahah ofc [14:18:37] (03PS1) 10Ottomata: eventgate-main - bump image version to get schema mediawiki/revision/create/1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/618535 (https://phabricator.wikimedia.org/T216297) [14:18:50] `ping mr1-eqsin.oob.wikimedia.org -4` still replies [14:19:05] but not ssh [14:19:19] PROBLEM - Host cp5006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:20:04] !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime [14:20:04] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:28] mmmh something is going on, I can login on a random cp host though [14:20:52] guessing some weird internal crash, the data plane is still working but not other parts [14:21:40] v6 to bast5001 is terribly slow for me [14:21:43] v4 is fine [14:22:02] and by ssh I meant ofc into a mgmt console [14:22:07] RECOVERY - Juniper alarms on mr1-eqsin is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [14:22:37] (03CR) 10Ottomata: [C: 03+2] eventgate-main - bump image version to get schema mediawiki/revision/create/1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/618535 (https://phabricator.wikimedia.org/T216297) (owner: 10Ottomata) [14:22:38] I'm on v4 indeed [14:22:57] I'm in via mr1-eqsin.oob.wikimedia.org -4 [14:22:59] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 212.51 ms [14:24:13] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [14:24:14] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [14:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:29] RECOVERY - Host cp5006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 228.99 ms [14:25:07] RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 230.73 ms [14:25:10] !log installing nmap bugfix updates from buster point release [14:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:30] 10Operations: Integrate Buster 10.5 point release - https://phabricator.wikimedia.org/T259519 (10MoritzMuehlenhoff) [14:27:46] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [14:27:46] (03CR) 10Andrew Bogott: [C: 03+1] "This is ambitious, but I like it. We could modify the patch to apply only to codfw1dev for testing, although since puppet is disabled on " [puppet] - 10https://gerrit.wikimedia.org/r/618418 (owner: 10Bstorm) [14:27:46] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [14:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:19] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:32:58] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [14:32:58] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [14:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:48] 10Operations: Integrate Buster 10.5 point release - https://phabricator.wikimedia.org/T259519 (10MoritzMuehlenhoff) [14:35:38] 10Operations, 10ops-codfw, 10RESTBase: restbase2009 down - https://phabricator.wikimedia.org/T256863 (10Papaul) @Eevans @hnowlan any update on this? [14:36:00] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks." [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/618522 (owner: 10Kormat) [14:36:10] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add wikimedia.cloud domain [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/618522 (owner: 10Kormat) [14:38:09] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:40:07] volans: thanks for the ping :) [14:41:16] anytime :) [14:41:27] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:43:09] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:45:04] (03CR) 10Ottomata: Add eventgate-logging-external streams, and add destination_event_service to all stream configs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618394 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [14:46:36] (03PS1) 10Ema: ATS: add function profile::trafficserver_caching_rules [puppet] - 10https://gerrit.wikimedia.org/r/618537 (https://phabricator.wikimedia.org/T259692) [14:47:10] (03PS1) 10Ebernhardson: mjolnir: Increase msearch daemon parallelism to 25 [puppet] - 10https://gerrit.wikimedia.org/r/618538 [14:47:44] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add a TestCase field for POST form data. [software/httpbb] - 10https://gerrit.wikimedia.org/r/615570 (owner: 10RLazarus) [14:47:55] (03CR) 10jerkins-bot: [V: 04-1] ATS: add function profile::trafficserver_caching_rules [puppet] - 10https://gerrit.wikimedia.org/r/618537 (https://phabricator.wikimedia.org/T259692) (owner: 10Ema) [14:47:58] (03CR) 10RLazarus: [C: 03+2] Add a TestCase field for POST form data. [software/httpbb] - 10https://gerrit.wikimedia.org/r/615570 (owner: 10RLazarus) [14:48:37] !log reboot stat1008 for unexpected maintenance (GPU stuck) [14:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:15] (03Merged) 10jenkins-bot: Add a TestCase field for POST form data. [software/httpbb] - 10https://gerrit.wikimedia.org/r/615570 (owner: 10RLazarus) [14:49:45] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-14) rack/setup/install dbprov2003.codfw.wmnet - https://phabricator.wikimedia.org/T258749 (10Papaul) [14:51:42] (03PS1) 10Elukey: hadoop set yarn_scheduler_minimum_allocation_vcores to 1 [puppet] - 10https://gerrit.wikimedia.org/r/618539 [14:52:18] (03CR) 10Elukey: [C: 03+2] hadoop set yarn_scheduler_minimum_allocation_vcores to 1 [puppet] - 10https://gerrit.wikimedia.org/r/618539 (owner: 10Elukey) [14:52:20] (03PS2) 10Ema: ATS: add function profile::trafficserver_caching_rules [puppet] - 10https://gerrit.wikimedia.org/r/618537 (https://phabricator.wikimedia.org/T259692) [14:53:07] PROBLEM - Host stat1008 is DOWN: PING CRITICAL - Packet loss = 100% [14:53:38] (03CR) 10jerkins-bot: [V: 04-1] ATS: add function profile::trafficserver_caching_rules [puppet] - 10https://gerrit.wikimedia.org/r/618537 (https://phabricator.wikimedia.org/T259692) (owner: 10Ema) [14:53:46] stat1008 is me [14:53:51] probably stuck in booting sigh [14:57:22] 10Operations, 10ops-eqiad, 10DC-Ops: Check Netbox/dns/reality inconsistencies - https://phabricator.wikimedia.org/T259283 (10wiki_willy) a:03Cmjohnson [14:58:01] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.82 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [14:59:51] RECOVERY - Host stat1008 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [15:00:47] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.7685 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [15:01:53] (03PS3) 10Ema: ATS: add function profile::trafficserver_caching_rules [puppet] - 10https://gerrit.wikimedia.org/r/618537 (https://phabricator.wikimedia.org/T259692) [15:03:08] (03CR) 10jerkins-bot: [V: 04-1] ATS: add function profile::trafficserver_caching_rules [puppet] - 10https://gerrit.wikimedia.org/r/618537 (https://phabricator.wikimedia.org/T259692) (owner: 10Ema) [15:03:09] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.7494 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [15:04:35] (03CR) 10Ppchelko: [C: 03+1] Add eventgate-logging-external streams, and add destination_event_service to all stream configs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618394 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [15:05:05] uh oh, logstash is unhappy, I'm taking a look [15:05:11] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [15:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:44] (03PS4) 10Ottomata: Add eventgate-logging-external streams, and add destination_event_service to all stream configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618394 (https://phabricator.wikimedia.org/T251935) [15:07:44] (03PS4) 10Ema: ATS: add function profile::trafficserver_caching_rules [puppet] - 10https://gerrit.wikimedia.org/r/618537 (https://phabricator.wikimedia.org/T259692) [15:07:57] not sure exactly what's up, I'll bounce logstash [15:08:29] !log bounce logstash on logstash100[789] - udp loss reported [15:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:47] 10Operations, 10Parsoid, 10serviceops, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10cscott) @ssastry one minor wrinkle to keep in mind is that to start an rt test run you need to update files on both... [15:11:54] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:39] (03PS5) 10Ema: ATS: add function profile::trafficserver_caching_rules [puppet] - 10https://gerrit.wikimedia.org/r/618537 (https://phabricator.wikimedia.org/T259692) [15:12:45] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [15:13:37] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:14:11] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0.007086 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [15:15:19] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [15:16:37] the last exceptions alert might a false positive while logstash catches up btw [15:16:37] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-14) rack/setup/install dbprov2003.codfw.wmnet - https://phabricator.wikimedia.org/T258749 (10Papaul) [15:16:43] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/618537 (https://phabricator.wikimedia.org/T259692) (owner: 10Ema) [15:22:53] (03CR) 10Mholloway: "In case it's useful, Mateus from Product Infrastructure has a project (https://github.com/thesocialdev/mediawiki-services-profiler) based " [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [15:25:47] (03PS1) 10Kormat: mariadb: Use correct binlog format for db_inventory [puppet] - 10https://gerrit.wikimedia.org/r/618540 [15:28:46] (03CR) 10Kormat: "PCC run looks good: https://puppet-compiler.wmflabs.org/compiler1002/24323/" [puppet] - 10https://gerrit.wikimedia.org/r/618540 (owner: 10Kormat) [15:29:15] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001 job=burrow partition={0,1,2,3,4,5} prometheus=ops site=eqiad topic={udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-dataso [15:29:15] luster=logging-eqiad&var-topic=All&var-consumer_group=All [15:29:24] (03CR) 10Marostegui: [C: 03+1] mariadb: Use correct binlog format for db_inventory [puppet] - 10https://gerrit.wikimedia.org/r/618540 (owner: 10Kormat) [15:29:26] (03CR) 10Jcrespo: "I am 100% ok with the change, but it is not a universal rule for all hosts. I believe most misc servers use ROW on the master, and some ot" [puppet] - 10https://gerrit.wikimedia.org/r/618540 (owner: 10Kormat) [15:29:35] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-14) rack/setup/install dbprov2003.codfw.wmnet - https://phabricator.wikimedia.org/T258749 (10Papaul) ` Interface Admin Link Description xe-4/0/20 up up dbprov2003 Logical Vlan TAG MAC STP L... [15:30:03] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-14) rack/setup/install dbprov2003.codfw.wmnet - https://phabricator.wikimedia.org/T258749 (10Papaul) [15:30:06] (03CR) 10Marostegui: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/618540 (owner: 10Kormat) [15:30:07] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.7679 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [15:34:31] (03CR) 10Herron: [C: 03+2] alerting_host: assign alert[12]001 role::alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/618345 (https://phabricator.wikimedia.org/T247966) (owner: 10Herron) [15:34:39] (03PS3) 10Herron: alerting_host: assign alert[12]001 role::alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/618345 (https://phabricator.wikimedia.org/T247966) [15:35:02] hmm looking at udp loss [15:36:09] thanks, I'm looking too but not sure yet what/why is happening (bounced logstash a little while ago for the same reason) [15:39:37] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0.0007004 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [15:39:48] (03PS2) 10Kormat: mariadb: Use correct binlog format for db_inventory [puppet] - 10https://gerrit.wikimedia.org/r/618540 [15:40:05] (03CR) 10Kormat: "Updated the commit message slightly to be clearer." [puppet] - 10https://gerrit.wikimedia.org/r/618540 (owner: 10Kormat) [15:40:29] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:41:02] (03CR) 10Cwhite: [V: 03+2 C: 03+2] Init retry_count at each collection [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/618504 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [15:41:40] (03CR) 10Cwhite: [V: 03+2 C: 03+2] Add support for exposing Icinga problems as metrics [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/618505 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [15:42:18] (03CR) 10Kormat: [C: 03+2] mariadb: Use correct binlog format for db_inventory [puppet] - 10https://gerrit.wikimedia.org/r/618540 (owner: 10Kormat) [15:43:49] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [15:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:40] (03PS1) 10Alexandros Kosiaris: Switch service-checker-image to python3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/618542 [15:48:41] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-14) rack/setup/install dbprov2003.codfw.wmnet - https://phabricator.wikimedia.org/T258749 (10jcrespo) [15:48:44] (03CR) 10Hashar: [C: 04-1] "There is a similar change for the Jenkins user and I expressed concerns about it on https://gerrit.wikimedia.org/r/c/operations/puppet/+/6" [puppet] - 10https://gerrit.wikimedia.org/r/607853 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [15:50:13] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [15:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:05] (03CR) 10Ottomata: [C: 03+2] Add eventgate-logging-external streams, and add destination_event_service to all stream configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618394 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [15:53:46] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (2020-09-14) rack/setup/install dbprov1003.eqiad.wmnet - https://phabricator.wikimedia.org/T258750 (10jcrespo) [15:55:07] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-14) rack/setup/install dbprov2003.codfw.wmnet - https://phabricator.wikimedia.org/T258749 (10jcrespo) [15:56:22] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: EventStreamConfig - Add eventgate-logging-external streams and destination_event_service settings - T251935 (duration: 01m 05s) [15:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:24] T251935: All EventGate instances should use EventStreamConfig - https://phabricator.wikimedia.org/T251935 [15:56:32] (03CR) 10Muehlenhoff: [C: 03+1] "What Daniel wrote: Having the central GIDs recorded in data.yaml is the only sane way to prevent duplicated use." [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [15:59:14] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-14) rack/setup/install dbprov2003.codfw.wmnet - https://phabricator.wikimedia.org/T258749 (10Papaul) Status Name State Layout Size Media Type Read Policy Write Policy Stripe Size Secured Remaining Redundancy Virtual Disk 0 Onlin... [15:59:25] 10Operations, 10Parsoid, 10serviceops, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ssastry) >>! In T257906#6362985, @cscott wrote: > @ssastry one minor wrinkle to keep in mind is that to start an rt... [16:00:55] (03CR) 10Ottomata: [C: 03+1] Move oozie server to an-scheduler1001 [puppet] - 10https://gerrit.wikimedia.org/r/618339 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [16:01:45] (03CR) 10Bstorm: [C: 03+2] haproxy-galera: Make a meaningful healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/618418 (owner: 10Bstorm) [16:02:27] (03PS2) 10Ottomata: eventgate-logging-external - Use MW EventStreamConfig API to get static stream configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/618395 (https://phabricator.wikimedia.org/T251935) [16:04:03] (03PS1) 10Herron: acme_cheif: permit alert[12]001 to fetch icinga cert [puppet] - 10https://gerrit.wikimedia.org/r/618545 (https://phabricator.wikimedia.org/T247966) [16:04:18] (03CR) 10Ottomata: [C: 03+2] eventgate-logging-external - Use MW EventStreamConfig API to get static stream configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/618395 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [16:04:39] (03PS1) 10Papaul: DNS: Add production DNS for dbprov2003 [dns] - 10https://gerrit.wikimedia.org/r/618546 [16:05:59] (03CR) 10Papaul: [C: 03+2] DNS: Add production DNS for dbprov2003 [dns] - 10https://gerrit.wikimedia.org/r/618546 (owner: 10Papaul) [16:06:41] (03CR) 10Hashar: [C: 04-1] "My review comment was unclear, I apologize. My concerns are:" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [16:07:00] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-09-14) rack/setup/install dbprov2003.codfw.wmnet - https://phabricator.wikimedia.org/T258749 (10Papaul) [16:12:12] (03PS1) 10Papaul: DHCP: Add MAC address for dbprov2003 [puppet] - 10https://gerrit.wikimedia.org/r/618548 (https://phabricator.wikimedia.org/T258749) [16:13:29] (03CR) 10Papaul: [C: 03+2] DHCP: Add MAC address for dbprov2003 [puppet] - 10https://gerrit.wikimedia.org/r/618548 (https://phabricator.wikimedia.org/T258749) (owner: 10Papaul) [16:15:47] 10Operations: Integrate Buster 10.5 point release - https://phabricator.wikimedia.org/T259519 (10MoritzMuehlenhoff) [16:16:50] 10Operations: Integrate Buster 10.5 point release - https://phabricator.wikimedia.org/T259519 (10MoritzMuehlenhoff) [16:16:54] (03PS4) 10Herron: kafkamon: add role::kafka::monitoring_buster, assign kafkamon[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/618359 (https://phabricator.wikimedia.org/T252773) [16:17:24] (03CR) 10Herron: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/618359 (https://phabricator.wikimedia.org/T252773) (owner: 10Herron) [16:18:05] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-09-14) rack/setup/install dbprov2003.codfw.wmnet - https://phabricator.wikimedia.org/T258749 (10jcrespo) They use the `custom/db.cfg` recipe, but only on first install, after that they are moved to `custom/reuse-dbprov.cfg`. [16:21:50] (03PS1) 10Papaul: Add dbprov2003 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/618549 (https://phabricator.wikimedia.org/T258749) [16:22:29] (03CR) 10Papaul: [C: 03+2] Add dbprov2003 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/618549 (https://phabricator.wikimedia.org/T258749) (owner: 10Papaul) [16:26:20] RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:28:33] (03PS1) 10Ottomata: Add eventgate service specific test.event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618550 (https://phabricator.wikimedia.org/T251935) [16:32:33] (03CR) 10Ppchelko: [C: 04-1] Add eventgate service specific test.event streams (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618550 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [16:33:39] (03PS2) 10Ottomata: Add eventgate service specific test.event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618550 (https://phabricator.wikimedia.org/T251935) [16:33:41] (03PS1) 10Andrew Bogott: mwopenstackclients: fix ensure_recordset [puppet] - 10https://gerrit.wikimedia.org/r/618553 [16:34:40] (03PS1) 10RLazarus: httpbb: Move test files into subdirectories by host type. [puppet] - 10https://gerrit.wikimedia.org/r/618554 (https://phabricator.wikimedia.org/T259665) [16:35:12] (03CR) 10Dzahn: [C: 03+1] "but do the new hosts also need new SNIs to be added?" [puppet] - 10https://gerrit.wikimedia.org/r/618545 (https://phabricator.wikimedia.org/T247966) (owner: 10Herron) [16:35:18] (03CR) 10Ppchelko: [C: 03+1] Add eventgate service specific test.event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618550 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [16:35:59] (03CR) 10jerkins-bot: [V: 04-1] httpbb: Move test files into subdirectories by host type. [puppet] - 10https://gerrit.wikimedia.org/r/618554 (https://phabricator.wikimedia.org/T259665) (owner: 10RLazarus) [16:37:09] (03PS2) 10RLazarus: httpbb: Move test files into subdirectories by host type. [puppet] - 10https://gerrit.wikimedia.org/r/618554 (https://phabricator.wikimedia.org/T259665) [16:38:40] (03PS1) 10Jcrespo: mariadb-backups: Add dbprov2003 to the db.cfg partman recipe list [puppet] - 10https://gerrit.wikimedia.org/r/618555 (https://phabricator.wikimedia.org/T258749) [16:39:09] (03CR) 10RLazarus: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/24325/cumin1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/618554 (https://phabricator.wikimedia.org/T259665) (owner: 10RLazarus) [16:39:12] (03CR) 10Bstorm: [C: 03+1] mwopenstackclients: fix ensure_recordset [puppet] - 10https://gerrit.wikimedia.org/r/618553 (owner: 10Andrew Bogott) [16:39:44] (03CR) 10Andrew Bogott: [C: 03+2] mwopenstackclients: fix ensure_recordset [puppet] - 10https://gerrit.wikimedia.org/r/618553 (owner: 10Andrew Bogott) [16:40:19] (03PS2) 10Herron: acme_cheif: add alert[12]001 SNI and permit to fetch icinga cert [puppet] - 10https://gerrit.wikimedia.org/r/618545 (https://phabricator.wikimedia.org/T247966) [16:40:37] (03CR) 10Giuseppe Lavagetto: [C: 03+1] httpbb: Move test files into subdirectories by host type. [puppet] - 10https://gerrit.wikimedia.org/r/618554 (https://phabricator.wikimedia.org/T259665) (owner: 10RLazarus) [16:40:57] (03CR) 10Herron: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/618545 (https://phabricator.wikimedia.org/T247966) (owner: 10Herron) [16:41:21] (03PS2) 10Jcrespo: mariadb-backups: Add dbprov[12]003 to the db.cfg partman recipe list [puppet] - 10https://gerrit.wikimedia.org/r/618555 (https://phabricator.wikimedia.org/T258749) [16:41:53] (03CR) 10RLazarus: [C: 03+2] httpbb: Move test files into subdirectories by host type. [puppet] - 10https://gerrit.wikimedia.org/r/618554 (https://phabricator.wikimedia.org/T259665) (owner: 10RLazarus) [16:42:24] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Add dbprov[12]003 to the db.cfg partman recipe list [puppet] - 10https://gerrit.wikimedia.org/r/618555 (https://phabricator.wikimedia.org/T258749) (owner: 10Jcrespo) [16:42:50] rzl: deploy? [16:42:55] yes please! [16:45:11] (03PS1) 10JMeybohm: Detect kubeconfig as known argument in plugin invocations [debs/helm] - 10https://gerrit.wikimedia.org/r/618556 (https://phabricator.wikimedia.org/T258572) [16:46:08] 10Operations, 10Fundraising-Backlog, 10User-Urbanecm, 10Wiki-Setup (Create): New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10DStrine) [16:48:41] 10Operations, 10Fundraising-Backlog, 10User-Urbanecm, 10Wiki-Setup (Create): New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10DStrine) @Ladsgroup thanks for the info. I have rewritten this task in the format you have recommended. Let... [16:50:46] !log powercycle stat1005 after GPU issue [16:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:28] PROBLEM - Host stat1005 is DOWN: PING CRITICAL - Packet loss = 100% [16:54:53] (03PS1) 10Ahmon Dancy: zuul_error_log.mtail: Settle on initial counters [puppet] - 10https://gerrit.wikimedia.org/r/618557 (https://phabricator.wikimedia.org/T258821) [16:55:04] RECOVERY - Host stat1005 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [16:56:19] (03CR) 10jerkins-bot: [V: 04-1] zuul_error_log.mtail: Settle on initial counters [puppet] - 10https://gerrit.wikimedia.org/r/618557 (https://phabricator.wikimedia.org/T258821) (owner: 10Ahmon Dancy) [16:57:51] (03PS2) 10Ahmon Dancy: zuul_error_log.mtail: Settle on initial counters [puppet] - 10https://gerrit.wikimedia.org/r/618557 (https://phabricator.wikimedia.org/T258821) [17:03:55] (03PS3) 10Dzahn: ATS: switch releases.wm to new buster backend servers [dns] - 10https://gerrit.wikimedia.org/r/618412 (https://phabricator.wikimedia.org/T247652) [17:03:59] I broke puppet on cumin* and deployment*, working on it [17:04:02] (03PS2) 10Dzahn: releases: use --delete when rsyncing files between servers [puppet] - 10https://gerrit.wikimedia.org/r/618411 (https://phabricator.wikimedia.org/T247652) [17:04:10] (03PS2) 10Dzahn: httpbb: add test file for releases.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/618415 (https://phabricator.wikimedia.org/T247652) [17:04:34] rzl: once fixed https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed is your friend :D [17:04:40] ack, thanks [17:05:17] (03PS1) 10Dzahn: releases: open firewall hole for http from deployment_server [puppet] - 10https://gerrit.wikimedia.org/r/618559 (https://phabricator.wikimedia.org/T247652) [17:06:10] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and nda groups for edtadros - https://phabricator.wikimedia.org/T256435 (10Jrbranaa) Sorry, being out of the office and changing my IRC usage (missing channels :-/) I didn't see this. The contract termi... [17:06:33] (03PS2) 10Dzahn: releases: open firewall hole for http from deployment_server [puppet] - 10https://gerrit.wikimedia.org/r/618559 (https://phabricator.wikimedia.org/T247652) [17:07:44] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and nda groups for edtadros - https://phabricator.wikimedia.org/T256435 (10Dzahn) 05Stalled→03Open a:05Jrbranaa→03None [17:08:49] (03CR) 10Dzahn: "a contract end date has been provided at https://phabricator.wikimedia.org/T256435#6363455 this can now be amended with that and the tick" [puppet] - 10https://gerrit.wikimedia.org/r/609158 (https://phabricator.wikimedia.org/T256435) (owner: 10Ssingh) [17:09:56] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [17:10:07] (03PS1) 10RLazarus: httpbb: Fix breakage caused by 618554, create dir before files [puppet] - 10https://gerrit.wikimedia.org/r/618562 (https://phabricator.wikimedia.org/T259665) [17:10:10] (03CR) 10Dzahn: [C: 03+2] releases: open firewall hole for http from deployment_server [puppet] - 10https://gerrit.wikimedia.org/r/618559 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn) [17:10:45] mutante: fyi puppet's broken on the deployment hosts right now ^ [17:10:54] I'll send you the fix for review as soon as pcc finishes [17:10:54] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [17:11:08] rzl: thanks! it's alright, i just need it on releases* right now [17:11:13] ahh okay [17:11:34] well, I'll send you the fix for review anyway if you don't mind :D [17:12:08] (03CR) 10Dzahn: [C: 03+1] "warning, i was about to upload another test file that is neither for miscweb nor for appservers :p" [puppet] - 10https://gerrit.wikimedia.org/r/618562 (https://phabricator.wikimedia.org/T259665) (owner: 10RLazarus) [17:12:37] rzl: i made another test file for releases* , heh [17:12:51] i am opening the firewall there to be able to use it :) [17:13:02] cool! go ahead and add a third subdir in that case [17:13:10] ok [17:13:11] at some point I might clean up the way those are defined, but [17:13:21] (03CR) 10RLazarus: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1001/24326/cumin1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/618562 (https://phabricator.wikimedia.org/T259665) (owner: 10RLazarus) [17:15:47] rzl: and to be clear, if unsure were it's broken, it's totally ok to run that cumin command across the whole fleet, granted you use a sane batch size like in the example [17:15:56] nod [17:16:17] in this case it's only four hosts and I know what they are, so I just did a plain old sudo cumin run-puppet-agent [17:16:25] but glad to be reminded that recipe exists for the more interesting cases [17:17:07] fixed! [17:17:30] :) [17:17:43] so the other day i kept missing the $http_proxy setup on some hosts.. so i think it's a good idea to add it to my global .bash_profile. One day later... i am using httpbb and wonder why it times out.. now gotta think about removing the http_proxy :p [17:17:50] I *really* wish pcc could catch "parent directory doesn't exist" [17:18:38] it could exists outside of puppet though [17:19:01] yeah it's true, I don't think it's actually possible [17:19:08] but life would be nice if it were [17:20:51] (03PS1) 10RLazarus: httpbb: Remove temporary ensure-absents for moved files [puppet] - 10https://gerrit.wikimedia.org/r/618563 (https://phabricator.wikimedia.org/T259665) [17:21:45] (03PS1) 10Lucas Werkmeister (WMDE): Pass jQuery objects into jqueryMsg [extensions/ContentTranslation] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618566 [17:22:07] (03CR) 10Dzahn: [C: 03+1] httpbb: Remove temporary ensure-absents for moved files [puppet] - 10https://gerrit.wikimedia.org/r/618563 (https://phabricator.wikimedia.org/T259665) (owner: 10RLazarus) [17:25:19] (03CR) 10RLazarus: [C: 03+2] httpbb: Remove temporary ensure-absents for moved files [puppet] - 10https://gerrit.wikimedia.org/r/618563 (https://phabricator.wikimedia.org/T259665) (owner: 10RLazarus) [17:30:17] !log test prometheus-icinga-exporter upgrade on icinga2001 [17:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:12] (03CR) 10Bstorm: [C: 03+2] wmcs: alphabetize labstore NFS mounts [puppet] - 10https://gerrit.wikimedia.org/r/618389 (owner: 10BryanDavis) [17:36:12] (03PS4) 10Bstorm: wmcs: Add project NFS for wmde-templates-alpha [puppet] - 10https://gerrit.wikimedia.org/r/618390 (https://phabricator.wikimedia.org/T259254) (owner: 10BryanDavis) [17:36:34] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:37:33] (03CR) 10Bstorm: [C: 03+2] wmcs: Add project NFS for wmde-templates-alpha [puppet] - 10https://gerrit.wikimedia.org/r/618390 (https://phabricator.wikimedia.org/T259254) (owner: 10BryanDavis) [17:37:38] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:39:00] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [17:39:32] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:39:58] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [17:40:22] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:40:27] (03PS3) 10Ahmon Dancy: zuul_error_log.mtail: Settle on initial counters [puppet] - 10https://gerrit.wikimedia.org/r/618557 (https://phabricator.wikimedia.org/T258821) [17:40:58] (03CR) 10Bstorm: [C: 03+1] "Looks great to me." [puppet] - 10https://gerrit.wikimedia.org/r/617995 (owner: 10Muehlenhoff) [17:42:15] 10Operations, 10Fundraising-Backlog, 10User-Urbanecm, 10User-dancy, 10Wiki-Setup (Create): New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10dancy) [17:43:18] rzl: currently confused. i add more assertions to my local test file but httpbb says "4 requests sent" before and after. I am like "that should be more than 4 now". [17:46:47] well, i have 6 URLs in the file but only 4 are unique, the other 2 differ in path [17:46:51] (03CR) 10Bstorm: "Won't this need to auth to the ceph cluster as well? Maybe that would be on the backup server profile and not this anyway, though. I'm jus" [puppet] - 10https://gerrit.wikimedia.org/r/617841 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [17:48:44] ok, got it. i did it wrong [17:50:09] (03CR) 10Ahmon Dancy: "Followup to https://gerrit.wikimedia.org/r/c/operations/puppet/+/617271" [puppet] - 10https://gerrit.wikimedia.org/r/618557 (https://phabricator.wikimedia.org/T258821) (owner: 10Ahmon Dancy) [17:50:26] mutante: cool, happy to look if it comes up again [17:51:04] (03CR) 10BryanDavis: [C: 03+1] toolforge: Remove jessie conditionals [puppet] - 10https://gerrit.wikimedia.org/r/617995 (owner: 10Muehlenhoff) [17:52:58] (03PS3) 10Dzahn: releases: use --delete when rsyncing files between servers [puppet] - 10https://gerrit.wikimedia.org/r/618411 [17:53:00] (03PS3) 10Dzahn: httpbb: add test file for releases.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/618415 (https://phabricator.wikimedia.org/T247652) [17:55:13] (03PS3) 10Nray: Re-enable growth study quick survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618343 (https://phabricator.wikimedia.org/T257015) [17:56:24] (03PS4) 10Dzahn: httpbb: add test file for releases.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/618415 (https://phabricator.wikimedia.org/T247652) [17:57:58] (03PS5) 10Dzahn: httpbb: add directory and test file for releases.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/618415 (https://phabricator.wikimedia.org/T247652) [17:58:11] jouncebot: refresh [17:58:12] I refreshed my knowledge about deployments. [17:59:10] thx [17:59:58] rzl: ^ like that? i kept adding both http and https even though i don't get the redirects when testing internally [18:00:04] brennen and dancy: Time to snap out of that daydream and deploy Train log triage with CPT. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200805T1800). [18:00:05] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Morning backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200805T1800). [18:00:05] nray and Lucas_WMDE: A patch you scheduled for Morning backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:13] o/ [18:00:16] o/ here [18:00:27] and just 2 random files i check from the actual releases .. good enough to me [18:01:11] nray: do you want to deploy your change yourself? [18:01:40] I don't have deploy rights [18:01:59] ok, then I can do it [18:02:00] (03CR) 10BryanDavis: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/618283 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [18:02:13] (03CR) 10Herron: [C: 03+2] acme_cheif: add alert[12]001 SNI and permit to fetch icinga cert [puppet] - 10https://gerrit.wikimedia.org/r/618545 (https://phabricator.wikimedia.org/T247966) (owner: 10Herron) [18:02:26] I’ll also +2 my backport so the CI starts already [18:02:32] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Pass jQuery objects into jqueryMsg [extensions/ContentTranslation] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618566 (owner: 10Lucas Werkmeister (WMDE)) [18:03:00] Lucas_WMDE: thank you! [18:03:33] (03CR) 10Lucas Werkmeister (WMDE): Re-enable growth study quick survey (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618343 (https://phabricator.wikimedia.org/T257015) (owner: 10Nray) [18:04:03] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "(lgtm otherwise)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618343 (https://phabricator.wikimedia.org/T257015) (owner: 10Nray) [18:05:14] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [18:05:20] (03PS4) 10Nray: Re-enable growth study quick survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618343 (https://phabricator.wikimedia.org/T257015) [18:05:23] (03PS4) 10Dzahn: releases: use --delete when rsyncing files between servers [puppet] - 10https://gerrit.wikimedia.org/r/618411 [18:05:52] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Re-enable growth study quick survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618343 (https://phabricator.wikimedia.org/T257015) (owner: 10Nray) [18:06:14] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog2001.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [18:06:43] (03Merged) 10jenkins-bot: Re-enable growth study quick survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618343 (https://phabricator.wikimedia.org/T257015) (owner: 10Nray) [18:07:06] nray: this sounds like a change that can’t really be tested on mwdebug, right? [18:07:15] other than quickly checking that the wiki doesn’t explode [18:07:33] I think it can be tested. There is a query param that makes the survey show [18:07:39] ah, ok [18:07:46] in that case it’s on mwdebug1001 now, please test :) [18:07:51] cool thanks [18:09:00] Lucas_WMDE: Tested and lgtm! [18:09:03] ok! [18:09:38] syncing [18:10:36] (03CR) 10Dzahn: [C: 03+2] releases: use --delete when rsyncing files between servers [puppet] - 10https://gerrit.wikimedia.org/r/618411 (owner: 10Dzahn) [18:10:46] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:618343|Re-enable growth study quick survey (T257015)]] (duration: 01m 12s) [18:11:22] stashbot: u there? [18:11:40] !log test !log [18:11:44] well, it’s in the sal tool at least [18:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:53] ok there we go [18:12:27] (but the test !log is still waiting) [18:12:53] next on the deployment calendar for this window is a backport, which is currently going through CI, btw [18:12:55] * Lucas_WMDE waits [18:12:57] surprisingly slow... [18:14:44] that missing test !log is concerning [18:14:46] T257015: Redeploy quicksurvey on enwiki (for a Growth study) - https://phabricator.wikimedia.org/T257015 [18:14:46] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [18:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:03] ok now it went through [18:15:12] ACK, seeing it [18:15:19] sweet, thank you for your help @Lucas_WMDE [18:15:20] perhaps the phabricator API was slow and it blocked on getting the task label…? [18:15:22] random guess [18:15:26] np nray :) [18:15:30] good luck with the survey [18:15:56] i don't think phab, my test did not include a ticket number [18:16:33] maybe toollabs is very busy [18:16:40] but maybe processing that !log was blocked on processing the task number from the scap !log [18:16:54] (and apparently “see … for help” is its response to my “u there”) [18:17:32] Lucas_WMDE: heh, yea, you might be right there [18:18:15] (to clarify – on my end, the messages are T257015, then see X for help, then logged the message. looks like the wm-bot log has a different order so maybe I’m completely wrong after all) [18:20:02] (03CR) 10Andrew Bogott: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/617841 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [18:21:05] (03Merged) 10jenkins-bot: Pass jQuery objects into jqueryMsg [extensions/ContentTranslation] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618566 (owner: 10Lucas Werkmeister (WMDE)) [18:22:04] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-14) rack/setup/install dbprov2003.codfw.wmnet - https://phabricator.wikimedia.org/T258749 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` dbprov2003.codfw.wmnet ` The log can be found i... [18:22:34] testing the backport on mwdebug1001 [18:23:46] seems to work fine, syncing [18:25:50] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.36.0-wmf.3/extensions/ContentTranslation/: Backport: [[gerrit:618566|Pass jQuery objects into jqueryMsg]] (duration: 01m 11s) [18:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:56] !log Morning backport window done [18:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:24] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [18:35:18] 10Operations, 10Fundraising-Backlog, 10User-Urbanecm, 10User-dancy, 10Wiki-Setup (Create): New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Ejegg) [18:35:20] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [18:37:21] (03PS1) 10Kaldari: Switching to updated license definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618586 [18:39:15] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-14) rack/setup/install dbprov2003.codfw.wmnet - https://phabricator.wikimedia.org/T258749 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbprov2003.codfw.wmnet'] ` Of which those **FAILED**: ` ['dbprov2003.codfw.wmnet'] ` [18:39:46] (03CR) 10Bstorm: [C: 03+1] "/me looks. Oh yeah, that'd totally take care of it." [puppet] - 10https://gerrit.wikimedia.org/r/617841 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [18:42:31] (03CR) 10Bstorm: [C: 03+1] jessie-ssd: Fetch base image from docker-registry.tools.wmflabs.org [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/617288 (owner: 10BryanDavis) [18:52:38] mutante: sorry, back from lunch and looking now [18:54:00] (03CR) 10Andrew Bogott: [C: 03+1] dnsrecursor: allow installation of pdns-recursor from component [puppet] - 10https://gerrit.wikimedia.org/r/618376 (owner: 10Ssingh) [18:54:51] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-14) rack/setup/install dbprov2003.codfw.wmnet - https://phabricator.wikimedia.org/T258749 (10jcrespo) This was the only issue we had the last time with the same hw and recipe: T218336#5068836 [18:57:37] (03PS6) 10Cicalese: DO NOT MERGE Remove temporary logging for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606239 (https://phabricator.wikimedia.org/T245595) [18:58:30] (03CR) 10Ssingh: [C: 03+2] dnsrecursor: allow installation of pdns-recursor from component [puppet] - 10https://gerrit.wikimedia.org/r/618376 (owner: 10Ssingh) [18:58:39] (03PS7) 10Cicalese: DO NOT MERGE Remove temporary logging for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606239 (https://phabricator.wikimedia.org/T259742) [18:59:38] (03PS8) 10Cicalese: Remove temporary logging for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606239 (https://phabricator.wikimedia.org/T259742) [19:00:04] brennen and dancy: Your horoscope predicts another unfortunate Mediawiki train - American Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200805T1900). [19:00:40] horrorscope. [19:00:43] I've been enjoying the automated nags so far. [19:01:53] Did anything interesting come out of the triage meeting? [19:02:06] dancy (or any random persons who want to watch me swear at a train): https://meet.google.com/qxk-kkjc-meo [19:03:20] pretty quiet triage meeting; we should have a clean dashboard for this one. [19:04:28] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [19:05:50] (03PS1) 10Ssingh: wikidough: enable QNAME minimisation for the dnsrecursor module [puppet] - 10https://gerrit.wikimedia.org/r/618591 (https://phabricator.wikimedia.org/T252132) [19:06:44] (03PS1) 10Brennen Bearnes: group1 wikis to 1.36.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618592 [19:06:46] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.36.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618592 (owner: 10Brennen Bearnes) [19:07:25] (03CR) 10Ppchelko: [C: 03+1] Remove temporary logging for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606239 (https://phabricator.wikimedia.org/T259742) (owner: 10Cicalese) [19:07:30] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618592 (owner: 10Brennen Bearnes) [19:10:16] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [19:11:55] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.3 [19:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:07] (03CR) 10RLazarus: httpbb: add directory and test file for releases.wm.org (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/618415 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn) [19:12:39] (03PS1) 10Papaul: DHCP: Fix MAC address for dbprov2003 [puppet] - 10https://gerrit.wikimedia.org/r/618593 (https://phabricator.wikimedia.org/T258749) [19:13:15] (03CR) 10Papaul: [C: 03+2] DHCP: Fix MAC address for dbprov2003 [puppet] - 10https://gerrit.wikimedia.org/r/618593 (https://phabricator.wikimedia.org/T258749) (owner: 10Papaul) [19:13:39] !log brennen@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.3 (duration: 01m 44s) [19:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:43] (03PS1) 10Ssingh: dnsrecursor: use the correct option name in commit e250327 [puppet] - 10https://gerrit.wikimedia.org/r/618594 [19:21:10] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-09-14) rack/setup/install dbprov2003.codfw.wmnet - https://phabricator.wikimedia.org/T258749 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` dbprov2003.codfw.wmnet `... [19:22:28] (03CR) 10Ssingh: "Merging this based on the review from I2173f99. No Puppet code change; only the template was updated which does not affect the other DNS r" [puppet] - 10https://gerrit.wikimedia.org/r/618594 (owner: 10Ssingh) [19:23:16] (03CR) 10Ssingh: [C: 03+2] dnsrecursor: use the correct option name in commit e250327 [puppet] - 10https://gerrit.wikimedia.org/r/618594 (owner: 10Ssingh) [19:26:31] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1001/24330/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/618591 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [19:29:17] (03CR) 10Dzahn: "all releases servers (and deploy1001 when it comes to /srv/patches) are now _actual_ mirrors of each other and are not accumulating old fi" [puppet] - 10https://gerrit.wikimedia.org/r/618411 (owner: 10Dzahn) [19:37:11] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [19:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:12] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:55] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: Revert group1 wikis to 1.36.0-wmf.2 [19:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:44] (03PS1) 10Brennen Bearnes: Revert "group1 wikis to 1.36.0-wmf.3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618595 [19:42:46] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "group1 wikis to 1.36.0-wmf.3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618595 (owner: 10Brennen Bearnes) [19:43:29] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.36.0-wmf.3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618595 (owner: 10Brennen Bearnes) [19:46:03] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-09-14) rack/setup/install dbprov2003.codfw.wmnet - https://phabricator.wikimedia.org/T258749 (10Papaul) [19:50:10] PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:59:40] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [19:59:47] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-09-14) rack/setup/install dbprov2003.codfw.wmnet - https://phabricator.wikimedia.org/T258749 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbprov2003.codfw.wmnet'] ` and were **ALL** successful. [20:00:04] halfak and accraze: How many deployers does it take to do Services – Graphoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200805T2000). [20:00:42] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [20:04:31] (03PS1) 10RLazarus: web_testing: Remove the apache-fast-test placeholder [puppet] - 10https://gerrit.wikimedia.org/r/618602 [20:04:33] (03PS1) 10RLazarus: web_testing: Clean up the old class used for apache-fast-test. [puppet] - 10https://gerrit.wikimedia.org/r/618603 [20:08:06] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [20:09:51] (03CR) 10Dzahn: [C: 03+2] zuul_error_log.mtail: Settle on initial counters [puppet] - 10https://gerrit.wikimedia.org/r/618557 (https://phabricator.wikimedia.org/T258821) (owner: 10Ahmon Dancy) [20:12:43] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-14) rack/setup/install dbprov2003.codfw.wmnet - https://phabricator.wikimedia.org/T258749 (10Papaul) [20:12:58] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-14) rack/setup/install dbprov2003.codfw.wmnet - https://phabricator.wikimedia.org/T258749 (10Papaul) 05Open→03Resolved This is done [20:20:31] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [20:20:48] (03CR) 10Herron: [C: 03+1] prometheus: puppetized install of prometheus-es-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617260 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [20:20:52] 10Operations, 10serviceops: httpbb: Mapping between tests and hosts - https://phabricator.wikimedia.org/T259665 (10RLazarus) 05Open→03Resolved The simple version of this is done. We might eventually want to do something more elaborate -- the advantage would be that httpbb could be run without explicitly pa... [20:26:24] (03PS5) 10Herron: kafkamon: add role::kafka::monitoring_buster, assign kafkamon[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/618359 (https://phabricator.wikimedia.org/T252773) [20:36:22] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [20:38:36] (03PS1) 10RLazarus: cumin: Update wmf_auto_reimage_lib for the new httpbb test layout [puppet] - 10https://gerrit.wikimedia.org/r/618618 (https://phabricator.wikimedia.org/T259665) [20:39:40] (03CR) 10Herron: "PCC https://puppet-compiler.wmflabs.org/compiler1003/24332/" [puppet] - 10https://gerrit.wikimedia.org/r/618359 (https://phabricator.wikimedia.org/T252773) (owner: 10Herron) [20:41:00] (03CR) 10Volans: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/618618 (https://phabricator.wikimedia.org/T259665) (owner: 10RLazarus) [20:42:36] (03CR) 10RLazarus: [C: 03+2] cumin: Update wmf_auto_reimage_lib for the new httpbb test layout [puppet] - 10https://gerrit.wikimedia.org/r/618618 (https://phabricator.wikimedia.org/T259665) (owner: 10RLazarus) [20:42:39] (03CR) 10Dzahn: httpbb: add directory and test file for releases.wm.org (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/618415 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn) [20:42:42] (03PS6) 10Dzahn: httpbb: add directory and test file for releases.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/618415 (https://phabricator.wikimedia.org/T247652) [20:44:46] (03CR) 10RLazarus: [C: 03+1] "Looks good modulo the nit inline, thanks!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/618415 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn) [20:46:13] (03PS1) 10Andrew Bogott: wmcs galera: change our HA approach to primary/backups for db access [puppet] - 10https://gerrit.wikimedia.org/r/618619 [20:46:40] (03PS1) 10Dzahn: hiera: switch releases server to releases1001, remove 1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/618621 (https://phabricator.wikimedia.org/T247652) [20:46:41] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [20:47:01] (03PS2) 10Andrew Bogott: wmcs galera: change our HA approach to primary/backups for db access [puppet] - 10https://gerrit.wikimedia.org/r/618619 [20:48:48] (03PS3) 10Andrew Bogott: wmcs galera: change our HA approach to primary/backups for db access [puppet] - 10https://gerrit.wikimedia.org/r/618619 [20:52:54] (03PS4) 10Andrew Bogott: wmcs galera: change our HA approach to primary/backups for db access [puppet] - 10https://gerrit.wikimedia.org/r/618619 [20:56:57] (03PS5) 10Andrew Bogott: wmcs galera: change our HA approach to primary/backups for db access [puppet] - 10https://gerrit.wikimedia.org/r/618619 [21:00:27] (03PS6) 10Andrew Bogott: wmcs galera: change our HA approach to primary/backups for db access [puppet] - 10https://gerrit.wikimedia.org/r/618619 [21:01:15] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [21:01:45] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [21:01:46] (03CR) 10jerkins-bot: [V: 04-1] wmcs galera: change our HA approach to primary/backups for db access [puppet] - 10https://gerrit.wikimedia.org/r/618619 (owner: 10Andrew Bogott) [21:03:07] (03PS7) 10Andrew Bogott: wmcs galera: change our HA approach to primary/backups for db access [puppet] - 10https://gerrit.wikimedia.org/r/618619 [21:04:24] (03CR) 10jerkins-bot: [V: 04-1] wmcs galera: change our HA approach to primary/backups for db access [puppet] - 10https://gerrit.wikimedia.org/r/618619 (owner: 10Andrew Bogott) [21:05:27] (03PS8) 10Andrew Bogott: wmcs galera: change our HA approach to primary/backups for db access [puppet] - 10https://gerrit.wikimedia.org/r/618619 [21:06:18] (03PS1) 10Ottomata: eventgate - use /v1/_test/events route for readinessProbe [deployment-charts] - 10https://gerrit.wikimedia.org/r/618624 (https://phabricator.wikimedia.org/T251935) [21:08:06] (03PS9) 10Andrew Bogott: wmcs galera: change our HA approach to primary/backups for db access [puppet] - 10https://gerrit.wikimedia.org/r/618619 [21:09:46] (03PS2) 10Dzahn: hiera: switch releases server to releases1001, remove 1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/618621 (https://phabricator.wikimedia.org/T247652) [21:12:55] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [21:14:19] (03PS10) 10Andrew Bogott: wmcs galera: change our HA approach to primary/backups for db access [puppet] - 10https://gerrit.wikimedia.org/r/618619 [21:15:57] 10Operations, 10Fundraising-Backlog, 10User-Urbanecm, 10User-dancy, 10Wiki-Setup (Create): New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Ladsgroup) >>! In T259002#6363413, @DStrine wrote: > @Ladsgroup thanks for the info. I have... [21:17:20] (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/compiler1001/24340/" [puppet] - 10https://gerrit.wikimedia.org/r/618619 (owner: 10Andrew Bogott) [21:17:32] (03PS1) 10Dave Pifke: arclamp: require python-swiftclient [puppet] - 10https://gerrit.wikimedia.org/r/618626 (https://phabricator.wikimedia.org/T244776) [21:23:06] 10Operations, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 (10wiki_willy) a:03Cmjohnson [21:24:32] (03PS11) 10Andrew Bogott: wmcs galera: change our HA approach to primary/backups for db access [puppet] - 10https://gerrit.wikimedia.org/r/618619 [21:27:32] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [21:28:20] (03CR) 10Andrew Bogott: [C: 03+2] wmcs galera: change our HA approach to primary/backups for db access [puppet] - 10https://gerrit.wikimedia.org/r/618619 (owner: 10Andrew Bogott) [21:30:55] PROBLEM - nova-compute proc minimum on cloudvirt1030 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:31:15] PROBLEM - nova-compute proc minimum on cloudvirt1006 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:31:59] PROBLEM - nova-compute proc maximum on cloudvirt1030 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:32:02] PROBLEM - nova-compute proc maximum on cloudvirt1006 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:32:49] RECOVERY - nova-compute proc minimum on cloudvirt1030 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:33:53] RECOVERY - nova-compute proc maximum on cloudvirt1030 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:36:19] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [21:36:43] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [21:44:01] RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:52:22] RECOVERY - nova-compute proc minimum on cloudvirt1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:53:12] RECOVERY - nova-compute proc maximum on cloudvirt1006 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:53:13] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: DC Failover) eqiad: Upgrades of Management Switches - https://phabricator.wikimedia.org/T259758 (10wiki_willy) [21:53:49] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [21:53:56] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: DC Failover) eqiad: Upgrades of Management Switches - https://phabricator.wikimedia.org/T259758 (10wiki_willy) [21:55:15] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: DC Failover) eqiad: Upgrades of Management Switches - https://phabricator.wikimedia.org/T259758 (10wiki_willy) [22:00:53] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: Q2) eqiad: Upgrades of Management Switches - https://phabricator.wikimedia.org/T259758 (10wiki_willy) [22:02:32] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [22:02:59] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [22:03:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need by: 2020-04-01) rack/setup/install cloudcontrol1005 - https://phabricator.wikimedia.org/T247471 (10bd808) Netbox is showing this host as "staged" rather than "active": https://netbox.wikimedia.org/dcim/devices/2613/ [22:05:16] (03PS1) 10Bstorm: galera: Ease up on replication restrictions since there is one primary [puppet] - 10https://gerrit.wikimedia.org/r/618633 [22:06:43] (03CR) 10Andrew Bogott: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/618633 (owner: 10Bstorm) [22:08:09] (03CR) 10Bstorm: [C: 03+2] galera: Ease up on replication restrictions since there is one primary [puppet] - 10https://gerrit.wikimedia.org/r/618633 (owner: 10Bstorm) [22:17:05] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog2001.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [22:24:57] PROBLEM - Host relforge1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [22:34:39] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [22:35:09] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [22:39:00] (03PS3) 10Dzahn: hiera: switch releases server to releases1001, remove 1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/618621 (https://phabricator.wikimedia.org/T247652) [22:55:38] (03PS7) 10Cwhite: prometheus: puppetized install of prometheus-es-exporter [puppet] - 10https://gerrit.wikimedia.org/r/617260 (https://phabricator.wikimedia.org/T256418) [22:58:01] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [22:58:03] (03CR) 10Cwhite: "fixed hiera lookup issue" [puppet] - 10https://gerrit.wikimedia.org/r/617260 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [22:58:23] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_centrallog2001.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [23:00:04] RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Evening backport window(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200805T2300). [23:02:13] !log logstash in codfw looks stuck -- restarting [23:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:34] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10tstarling) We still can't announce anything since we're waiting for vendor security releases. Third p... [23:13:52] (03CR) 10Ppchelko: [C: 03+1] eventgate - use /v1/_test/events route for readinessProbe [deployment-charts] - 10https://gerrit.wikimedia.org/r/618624 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [23:20:54] (03PS1) 10Bstorm: Disable the mdadm check cron for cloudcontrol servers [puppet] - 10https://gerrit.wikimedia.org/r/618638 [23:21:56] (03PS7) 10Dzahn: httpbb: add directory and test file for releases.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/618415 (https://phabricator.wikimedia.org/T247652) [23:22:00] (03CR) 10Dzahn: httpbb: add directory and test file for releases.wm.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/618415 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn) [23:24:19] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [23:24:25] (03CR) 10Dzahn: [C: 03+2] httpbb: add directory and test file for releases.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/618415 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn) [23:26:40] (03CR) 10Bstorm: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/24345/" [puppet] - 10https://gerrit.wikimedia.org/r/618638 (owner: 10Bstorm) [23:29:02] (03CR) 10Bstorm: [C: 03+2] Disable the mdadm check cron for cloudcontrol servers [puppet] - 10https://gerrit.wikimedia.org/r/618638 (owner: 10Bstorm) [23:32:28] (03CR) 10Andrew Bogott: "retrospective +1" [puppet] - 10https://gerrit.wikimedia.org/r/618638 (owner: 10Bstorm) [23:36:27] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [23:52:31] (03CR) 10Dzahn: "after this we are getting warnings "demon present in privileged LDAP group (nda),but not present in data.yaml". Are there any web UIs tha" [puppet] - 10https://gerrit.wikimedia.org/r/617749 (owner: 10Chad) [23:52:53] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 53 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:58:47] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 46 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas