[00:01:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:02:03] watching icinga because I touched submit_check_result slightly.. but nothign should happen [00:08:23] (03CR) 10Dzahn: "check_wikitech-static: confirmed still working, submit_check_result: only affects passive checks and keeping an eye on icinga: mailman r" [puppet] - 10https://gerrit.wikimedia.org/r/631890 (https://phabricator.wikimedia.org/T95064) (owner: 10Dzahn) [00:16:58] (03CR) 10Dzahn: "double checked submit_check_result.sh works as before - icinga-downtime.sh though is actually affected by this even though bashisms did no" [puppet] - 10https://gerrit.wikimedia.org/r/631890 (https://phabricator.wikimedia.org/T95064) (owner: 10Dzahn) [00:30:16] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3050 is OK: HTTP OK: HTTP/1.0 200 OK - 23588 bytes in 0.278 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [00:52:30] (03PS1) 10Dzahn: icinga-downtime.sh: drop printf directive for long unsigned int [puppet] - 10https://gerrit.wikimedia.org/r/632353 (https://phabricator.wikimedia.org/T95064) [00:53:44] (03PS2) 10Dzahn: icinga-downtime.sh: drop printf directive for long unsigned int [puppet] - 10https://gerrit.wikimedia.org/r/632353 (https://phabricator.wikimedia.org/T95064) [01:08:58] (03PS3) 10Dzahn: icinga-downtime.sh: use %s directive with printf, not %lu [puppet] - 10https://gerrit.wikimedia.org/r/632353 (https://phabricator.wikimedia.org/T95064) [01:09:54] (03PS4) 10Dzahn: icinga-downtime.sh: use %s directive with printf, not %lu [puppet] - 10https://gerrit.wikimedia.org/r/632353 (https://phabricator.wikimedia.org/T95064) [01:10:44] (03CR) 10Dzahn: [C: 03+2] icinga-downtime.sh: use %s directive with printf, not %lu [puppet] - 10https://gerrit.wikimedia.org/r/632353 (https://phabricator.wikimedia.org/T95064) (owner: 10Dzahn) [01:12:36] PROBLEM - ping-offload grafana alert on alert1001 is CRITICAL: CRITICAL: Ping offload ( https://grafana.wikimedia.org/d/000000513/ping-offload ) is alerting: target IP missing on hosts loopback. https://wikitech.wikimedia.org/wiki/Ping_offload%23InAddrErrors_alert https://grafana.wikimedia.org/d/000000513/ [01:14:20] RECOVERY - ping-offload grafana alert on alert1001 is OK: OK: Ping offload ( https://grafana.wikimedia.org/d/000000513/ping-offload ) is not alerting. https://wikitech.wikimedia.org/wiki/Ping_offload%23InAddrErrors_alert https://grafana.wikimedia.org/d/000000513/ [01:14:58] PROBLEM - Thanos query has high gRPC client errors on alert1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [01:15:12] PROBLEM - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [01:16:42] RECOVERY - Thanos query has high gRPC client errors on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [01:17:48] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/632354 [01:19:00] (03PS3) 10Dzahn: thumbor: role->profile, hiera->lookup, data types [puppet] - 10https://gerrit.wikimedia.org/r/630694 [01:19:37] (03CR) 10Dzahn: "> Im not sure if this comment relates to this patch as i don't see hiera_array anywhere. The only place i do see `hiera_array` is in `pro" [puppet] - 10https://gerrit.wikimedia.org/r/630694 (owner: 10Dzahn) [01:22:19] (03PS4) 10Dzahn: thumbor: role->profile, hiera->lookup, data types [puppet] - 10https://gerrit.wikimedia.org/r/630694 [01:22:47] (03CR) 10Dzahn: "uhm.. so if this is confusing then see how PS1 looked. that was my original intention. something changed meanwhile." [puppet] - 10https://gerrit.wikimedia.org/r/630694 (owner: 10Dzahn) [01:27:23] (03CR) 10Nuria: [C: 04-1] [WIP] Drop /wmf/data/raw/mediawiki_job and /wmf/data/raw/netflow after 90 days (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/628895 (https://phabricator.wikimedia.org/T231339) (owner: 10Ottomata) [01:27:29] (03PS5) 10Dzahn: thumbor: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630694 [01:28:56] PROBLEM - Stale file for node-exporter textfile in eqiad on alert1001 is CRITICAL: cluster=misc file=smartmon.prom instance=relforge1004 job=node site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [01:38:34] RECOVERY - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [02:07:09] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.12 [core] (wmf/1.36.0-wmf.12) - 10https://gerrit.wikimedia.org/r/632357 [02:07:42] (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.12 [core] (wmf/1.36.0-wmf.12) - 10https://gerrit.wikimedia.org/r/632357 (https://phabricator.wikimedia.org/T263178) (owner: 10TrainBranchBot) [02:11:01] (03CR) 10Jforrester: "Note that we're potentially not going to merge this at all. :-(" [core] (wmf/1.36.0-wmf.12) - 10https://gerrit.wikimedia.org/r/632357 (https://phabricator.wikimedia.org/T263178) (owner: 10TrainBranchBot) [02:11:47] (03CR) 10DannyS712: "> Patch Set 2:" [core] (wmf/1.36.0-wmf.12) - 10https://gerrit.wikimedia.org/r/632357 (https://phabricator.wikimedia.org/T263178) (owner: 10TrainBranchBot) [02:15:28] (03CR) 10Jforrester: "> Patch Set 2:" [core] (wmf/1.36.0-wmf.12) - 10https://gerrit.wikimedia.org/r/632357 (https://phabricator.wikimedia.org/T263178) (owner: 10TrainBranchBot) [02:18:46] RECOVERY - dump of analytics_meta in eqiad on alert1001 is OK: Last dump for analytics_meta at eqiad (db1108.eqiad.wmnet:3352) taken on 2020-10-06 02:08:27 (2 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [02:19:08] (03PS1) 10Andrew Bogott: wmcs backy2: fix quoting in config [puppet] - 10https://gerrit.wikimedia.org/r/632358 [02:20:38] (03CR) 10Andrew Bogott: [C: 03+2] wmcs backy2: fix quoting in config [puppet] - 10https://gerrit.wikimedia.org/r/632358 (owner: 10Andrew Bogott) [02:24:07] 10Operations, 10LDAP-Access-Requests: Add lilients_WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T264590 (10KFrancis) a:03KFrancis Working on the NDA now. Will confirm when it's complete. [02:35:47] Jenkins seems to be ignoring https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/632247 [02:36:11] and also hasn't run tests for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/632252 yet [02:36:21] did something change in the wikibase config? [04:13:11] 10Operations, 10Traffic: ATS-BE Lua mitigations for cacheable responses w/ Set-Cookie seemingly not working - https://phabricator.wikimedia.org/T264378 (10Tgr) That endpoint are from the [[https://www.mediawiki.org/wiki/API:REST_API/Reference#History|history API]], not Parsoid. Also, I think it is only used by... [04:26:26] PROBLEM - snapshot of s3 in eqiad on alert1001 is CRITICAL: snapshot for s3 at eqiad taken more than 3 days ago: Most recent backup 2020-10-03 04:17:29 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [04:34:25] 10Operations, 10SRE-Access-Requests: Requesting access to researchers and analytics-privatedata-users for Leila Zia - https://phabricator.wikimedia.org/T264472 (10leila) >>! In T264472#6517631, @Kormat wrote: > Hi Leila, > > I'm the SRE clinic duty person this week :) lovely. thanks for helping me. :) > >... [04:34:48] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 55.95 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:34:48] I'll do a bit of testing on mwdebug2001 [04:39:54] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 73.05 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:19:16] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 99 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:27:47] (03PS1) 10Marostegui: instances.yaml: Remove es2017 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/632366 (https://phabricator.wikimedia.org/T264386) [05:28:22] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove es2017 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/632366 (https://phabricator.wikimedia.org/T264386) (owner: 10Marostegui) [05:28:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove es2017 from dbctl T264386', diff saved to https://phabricator.wikimedia.org/P12925 and previous config saved to /var/cache/conftool/dbconfig/20201006-052849-marostegui.json [05:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:57] T264386: decommission es2017.codfw.wmnet - https://phabricator.wikimedia.org/T264386 [05:32:23] (03PS1) 10Marostegui: dbproxy1018: Repool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/632367 [05:35:51] (03PS2) 10Marostegui: dbproxy1018: Repool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/632367 [05:37:48] (03PS3) 10Marostegui: dbproxy1018: Repool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/632367 [05:38:46] (03CR) 10Marostegui: [C: 03+2] dbproxy1018: Repool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/632367 (owner: 10Marostegui) [06:32:22] (03CR) 10Volans: [C: 03+1] "LGTM for the homer's one" [puppet] - 10https://gerrit.wikimedia.org/r/631892 (https://phabricator.wikimedia.org/T95064) (owner: 10Dzahn) [06:37:52] PROBLEM - MD RAID on mw2279 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:37:53] ACKNOWLEDGEMENT - MD RAID on mw2279 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T264698 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:37:57] 10Operations, 10ops-codfw: Degraded RAID on mw2279 - https://phabricator.wikimedia.org/T264698 (10ops-monitoring-bot) [06:55:41] (03CR) 10Volans: "replies inline" (032 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [06:58:10] (03PS1) 10Marostegui: mariadb: Remove es2017 puppet entries [puppet] - 10https://gerrit.wikimedia.org/r/632428 (https://phabricator.wikimedia.org/T264386) [06:59:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [06:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:29] (03CR) 10Muehlenhoff: [C: 03+2] Add component/icu63 for stretch-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/632264 (owner: 10Muehlenhoff) [07:02:38] (03PS1) 10Marostegui: dns: Remove es2017 DNS entries [dns] - 10https://gerrit.wikimedia.org/r/632429 (https://phabricator.wikimedia.org/T264386) [07:05:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [07:05:54] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove es2017 puppet entries [puppet] - 10https://gerrit.wikimedia.org/r/632428 (https://phabricator.wikimedia.org/T264386) (owner: 10Marostegui) [07:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:33] (03CR) 10Marostegui: [C: 03+2] dns: Remove es2017 DNS entries [dns] - 10https://gerrit.wikimedia.org/r/632429 (https://phabricator.wikimedia.org/T264386) (owner: 10Marostegui) [07:07:51] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2017.codfw.wmnet - https://phabricator.wikimedia.org/T264386 (10Marostegui) [07:09:07] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1114.eqiad.wmnet'] ` The log can be... [07:11:03] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey) [07:12:22] PROBLEM - Device not healthy -SMART- on mw2279 is CRITICAL: cluster=jobrunner device=sdb instance=mw2279 job=node site=codfw https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mw2279&var-datasource=codfw+prometheus/ops [07:12:24] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:13:36] PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:14:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2015 T264700 ', diff saved to https://phabricator.wikimedia.org/P12926 and previous config saved to /var/cache/conftool/dbconfig/20201006-071451-marostegui.json [07:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:58] T264700: decommission es2015.codfw.wmnet - https://phabricator.wikimedia.org/T264700 [07:17:42] !log Remove es2015 and es2017 from tendril and zarcillo T264700 T264386 [07:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:49] T264386: decommission es2017.codfw.wmnet - https://phabricator.wikimedia.org/T264386 [07:19:54] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:20:13] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1114.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1114.eqiad.wmnet'] ` [07:20:15] (03PS1) 10Filippo Giunchedi: alertmanager: default to one alert per line in karma UI [puppet] - 10https://gerrit.wikimedia.org/r/632430 [07:23:06] (03CR) 10Volans: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/632430 (owner: 10Filippo Giunchedi) [07:24:06] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1114.eqiad.wmnet'] ` The log can be... [07:30:46] (03PS1) 10Elukey: Set an-worker111[13] as Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/632431 (https://phabricator.wikimedia.org/T259071) [07:30:55] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:31:45] (03CR) 10Elukey: [C: 03+2] Set an-worker111[13] as Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/632431 (https://phabricator.wikimedia.org/T259071) (owner: 10Elukey) [07:31:58] !log filippo@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [07:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:54] (03PS2) 10Muehlenhoff: Switch gerrit to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/632224 (https://phabricator.wikimedia.org/T264182) [07:34:36] 10Operations, 10DBA, 10Sustainability (Incident Followup), 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10Marostegui) IR is at: https://wikitech.wikimedia.org/wiki/Incident_documentation/20200925-s5-replication-lag [07:34:47] 10Operations, 10DBA, 10Sustainability (Incident Followup), 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10Marostegui) 05Open→03Resolved I am going to close this as resolved as the incident is over and the table w... [07:34:54] (03CR) 10jerkins-bot: [V: 04-1] Switch gerrit to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/632224 (https://phabricator.wikimedia.org/T264182) (owner: 10Muehlenhoff) [07:35:06] !log filippo@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [07:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:45] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1114.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1... [07:37:26] FYI the wikifeeds deploy above is to apply https://gerrit.wikimedia.org/r/632243 [07:38:49] (03CR) 10Ema: [C: 03+1] delete role::beta::availability_collector, diamond varnishstatus.py [puppet] - 10https://gerrit.wikimedia.org/r/632351 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn) [07:39:50] !log filippo@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [07:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:26] 10Operations, 10Wikifeeds, 10Wikimedia-Logstash, 10observability: Move wikifeeds to the logging pipeline - https://phabricator.wikimedia.org/T245604 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi @MSantos I went ahead and deployed wikifeed chart 0.0.19, resolving [07:40:28] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10fgiunchedi) [07:41:01] (03CR) 10Ema: [C: 03+1] trafficserver: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/631291 (owner: 10Dzahn) [07:41:33] RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:42:09] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:43:48] (03CR) 10JMeybohm: [C: 03+2] profile::envoy::builder: fix repos in chroot [puppet] - 10https://gerrit.wikimedia.org/r/631802 (owner: 10JMeybohm) [07:47:01] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey) an-worker1114's reimage fails for: ` 07:35:31 | an-worker1114.eqiad.wmnet | Unable to run wmf-auto-reimage-host: Unable... [07:47:15] (03PS1) 10Filippo Giunchedi: termbox: use k8s stdout/stderr logging only [deployment-charts] - 10https://gerrit.wikimedia.org/r/632432 (https://phabricator.wikimedia.org/T245603) [07:50:52] (03CR) 10Muehlenhoff: [C: 03+2] delete role::beta::availability_collector, diamond varnishstatus.py [puppet] - 10https://gerrit.wikimedia.org/r/632351 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn) [07:52:37] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10fgiunchedi) As far as gelf is concerned we're left with: * elasticsearch: (T225125) pending ES7 upgrade to get native json lo... [07:53:12] !log Change innodb_change_buffering = inserts on db2087:3316 db2089:3316 db2076 db2097:3316 db2114 T263443 [07:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:19] T263443: Evaluate the impact of changing innodb_change_buffering to inserts - https://phabricator.wikimedia.org/T263443 [07:53:23] (03CR) 10DCausse: [C: 03+1] "https://github.com/elastic/elasticsearch/issues/27748 sounds interesting as well as attempts to solve similar issues at the elastic level." [puppet] - 10https://gerrit.wikimedia.org/r/632319 (https://phabricator.wikimedia.org/T264053) (owner: 10Ebernhardson) [07:55:03] (03PS1) 10Elukey: Add new G1GC heap and gc timing settings for Hadoop Namenodes [puppet] - 10https://gerrit.wikimedia.org/r/632433 [07:55:23] (03CR) 10jerkins-bot: [V: 04-1] Add new G1GC heap and gc timing settings for Hadoop Namenodes [puppet] - 10https://gerrit.wikimedia.org/r/632433 (owner: 10Elukey) [07:56:49] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10Volans) @elukey if I try to ssh with the install console key I get a BusyBox... I guess that's the reason. Basically the reimage... [07:57:23] (03PS2) 10Elukey: Add new G1GC heap and gc timing settings for Hadoop Namenodes [puppet] - 10https://gerrit.wikimedia.org/r/632433 [07:57:43] (03PS3) 10Volans: swift: remove old unused service records [dns] - 10https://gerrit.wikimedia.org/r/628086 (https://phabricator.wikimedia.org/T244153) [08:01:13] (03CR) 10Volans: [C: 03+2] swift: remove old unused service records [dns] - 10https://gerrit.wikimedia.org/r/628086 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [08:02:47] !log removing unused ms-fe and ms-fe-thumbs svc records from DNS (gerrit/628086) [08:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:58] (03PS3) 10Muehlenhoff: Switch gerrit to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/632224 (https://phabricator.wikimedia.org/T264182) [08:05:41] 10Operations, 10ops-codfw, 10serviceops: Degraded RAID on mw2279 - https://phabricator.wikimedia.org/T264698 (10Volans) [08:05:57] (03CR) 10Muehlenhoff: Switch gerrit to profile::java (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/632224 (https://phabricator.wikimedia.org/T264182) (owner: 10Muehlenhoff) [08:06:55] (03CR) 10Elukey: "Little nit on the admin group and then we are done :) We also have to remember to add the admin_no_ssh hiera config for the coordinators, " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/631849 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi) [08:07:57] (03CR) 10Kormat: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/631288 (https://phabricator.wikimedia.org/T159412) (owner: 10Dzahn) [08:10:00] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey) The boot sequence was NIC then HD (as it happened for 1117), just fixed it thanks for the suggestion :) [08:13:39] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1114.eqiad.wmnet'] ` The log can be... [08:16:53] 10Operations, 10Traffic: ATS-BE Lua mitigations for cacheable responses w/ Set-Cookie seemingly not working - https://phabricator.wikimedia.org/T264378 (10ema) >>! In T264378#6520026, @Tgr wrote: > That endpoint are from the [[https://www.mediawiki.org/wiki/API:REST_API/Reference#History|history API]], not Par... [08:17:13] (03PS1) 10Ayounsi: diffscan scan all the ports instead of top 2000 [puppet] - 10https://gerrit.wikimedia.org/r/632436 (https://phabricator.wikimedia.org/T264694) [08:19:00] (03CR) 10Ayounsi: [C: 03+2] diffscan scan all the ports instead of top 2000 [puppet] - 10https://gerrit.wikimedia.org/r/632436 (https://phabricator.wikimedia.org/T264694) (owner: 10Ayounsi) [08:19:57] 10Operations, 10observability, 10Patch-For-Review, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10MoritzMuehlenhoff) [08:21:49] 10Operations, 10Data-Persistence-Backup, 10Patch-For-Review: Implement logic to be able to perform full and incremental backups of ES hosts - https://phabricator.wikimedia.org/T244884 (10LSobanski) [08:24:55] 10Operations, 10Analytics-Clusters, 10Traffic: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10elukey) I found https://github.com/varnishcache/varnish-cache/issues/2788 that might be what's happening. The fix is https://github.com/varnishcache/varnish-cache/commit/ed1696e... [08:26:03] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [08:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:57] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:09] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: convert mysqld_exporter from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/631288 (https://phabricator.wikimedia.org/T159412) (owner: 10Dzahn) [08:33:24] !log imported envoyproxy_1.15.1-1+deb9u1 to stretch-wikimedia [08:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:21] 10Operations, 10Analytics-Clusters, 10Traffic: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10ema) >>! In T264074#6520380, @elukey wrote: > I found https://github.com/varnishcache/varnish-cache/issues/2788 that might be what's happening. The fix is https://github.com/var... [08:35:40] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1114.eqiad.wmnet'] ` and were **ALL** successful. [08:44:53] 10Operations, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Set up backup strategy for es clusters - https://phabricator.wikimedia.org/T79922 (10LSobanski) [08:54:53] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat) [08:55:36] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers [08:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:29] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) [08:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:37] (03PS1) 10Elukey: Set an-worker1114 as Hadoop worker node [puppet] - 10https://gerrit.wikimedia.org/r/632442 (https://phabricator.wikimedia.org/T259071) [09:08:28] (03CR) 10Elukey: [C: 03+2] Set an-worker1114 as Hadoop worker node [puppet] - 10https://gerrit.wikimedia.org/r/632442 (https://phabricator.wikimedia.org/T259071) (owner: 10Elukey) [09:08:32] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: default to one alert per line in karma UI [puppet] - 10https://gerrit.wikimedia.org/r/632430 (owner: 10Filippo Giunchedi) [09:09:01] elukey: merging your patch too [09:09:17] <3 [09:09:38] (03PS1) 10Effie Mouzeli: memcached: refactor rules [puppet] - 10https://gerrit.wikimedia.org/r/632443 (https://phabricator.wikimedia.org/T149804) [09:09:59] (03CR) 10jerkins-bot: [V: 04-1] memcached: refactor rules [puppet] - 10https://gerrit.wikimedia.org/r/632443 (https://phabricator.wikimedia.org/T149804) (owner: 10Effie Mouzeli) [09:13:06] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: add alertmanager::alerts [puppet] - 10https://gerrit.wikimedia.org/r/629153 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [09:16:13] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey) [09:16:30] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey) 05Open→03Resolved All nodes are in hadoop now, looks good! [09:18:32] (03PS2) 10Filippo Giunchedi: hieradata: expand swift object server statsd mappings [puppet] - 10https://gerrit.wikimedia.org/r/632205 (https://phabricator.wikimedia.org/T264588) [09:18:34] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Needs an update to Chart.yaml as well" [deployment-charts] - 10https://gerrit.wikimedia.org/r/632432 (https://phabricator.wikimedia.org/T245603) (owner: 10Filippo Giunchedi) [09:20:24] (03PS2) 10Filippo Giunchedi: termbox: use k8s stdout/stderr logging only [deployment-charts] - 10https://gerrit.wikimedia.org/r/632432 (https://phabricator.wikimedia.org/T245603) [09:20:34] (03CR) 10Hnowlan: [C: 03+1] envoy: New upstream version 1.15.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/631383 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [09:21:00] (03PS2) 10Effie Mouzeli: memcached: refactor rules [puppet] - 10https://gerrit.wikimedia.org/r/632443 (https://phabricator.wikimedia.org/T149804) [09:22:05] 10Operations, 10Data-Persistence-Backup: Setup an Offsite backup infrastructure - https://phabricator.wikimedia.org/T85278 (10LSobanski) [09:22:49] 10Operations, 10Puppet, 10Data-Persistence-Backup: Missing dependency on bacula-fd Puppet setup - https://phabricator.wikimedia.org/T256454 (10LSobanski) [09:25:25] 10Operations, 10Data-Persistence-Backup, 10observability, 10Goal, and 2 others: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10LSobanski) [09:25:56] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/632443 (https://phabricator.wikimedia.org/T149804) (owner: 10Effie Mouzeli) [09:26:00] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime [09:26:01] (03PS1) 10Ayounsi: diffscan perf tunning [puppet] - 10https://gerrit.wikimedia.org/r/632444 [09:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:22] 10Operations, 10Data-Persistence-Backup, 10Goal: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10LSobanski) [09:26:33] (03CR) 10Filippo Giunchedi: hieradata: expand swift object server statsd mappings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/632205 (https://phabricator.wikimedia.org/T264588) (owner: 10Filippo Giunchedi) [09:26:58] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/632444 (owner: 10Ayounsi) [09:27:10] PROBLEM - Alertmanager config is not valid on alert1001 is CRITICAL: cluster=alerting instance={alert1001,alert2001} job=alertmanager prometheus=ops site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Alertmanager%23Alerts https://grafana.wikimedia.org/d/eea-9_sik/alertmanager [09:27:22] that's me ^ new alerts [09:27:32] (03CR) 10Ayounsi: [C: 03+2] diffscan perf tunning [puppet] - 10https://gerrit.wikimedia.org/r/632444 (owner: 10Ayounsi) [09:27:45] 10Operations, 10DBA: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [09:27:58] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:36] (03CR) 10Ayounsi: "Might be worth sending a PR upstream as well: https://github.com/ameihm0912/diffscan2" [puppet] - 10https://gerrit.wikimedia.org/r/630703 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [09:35:46] (03CR) 10Hnowlan: [C: 03+1] "Old pcc has been cleaned up, new pcc https://puppet-compiler.wmflabs.org/compiler1002/25690/" [puppet] - 10https://gerrit.wikimedia.org/r/616733 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [09:36:12] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 100 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:38:31] !log disable puppet on mc* [09:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:59] (03CR) 10Effie Mouzeli: [C: 03+2] memcached: refactor rules [puppet] - 10https://gerrit.wikimedia.org/r/632443 (https://phabricator.wikimedia.org/T149804) (owner: 10Effie Mouzeli) [09:41:44] !log enable puppet on mc10* [09:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:02] PROBLEM - Prometheus is failing to connect to Alertmanager on alert1001 is CRITICAL: instance=127.0.0.1 job=prometheus prometheus=ext site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Alertmanager%23Alerts https://grafana.wikimedia.org/d/eea-9_sik/alertmanager [09:48:47] also new alert ^ investigating [09:49:38] (03CR) 10Jbond: "> Patch Set 1: Code-Review+2" [puppet] - 10https://gerrit.wikimedia.org/r/631895 (https://phabricator.wikimedia.org/T95064) (owner: 10Dzahn) [09:49:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:51:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:52:02] PROBLEM - Disk space on an-worker1092 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1092&var-datasource=eqiad+prometheus/ops [09:52:44] (03PS1) 10Filippo Giunchedi: prometheus: connect prometheus 'ext' to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/632449 (https://phabricator.wikimedia.org/T258948) [09:53:24] PROBLEM - Disk space on an-worker1116 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1116&var-datasource=eqiad+prometheus/ops [09:53:24] PROBLEM - Disk space on analytics1059 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1059&var-datasource=eqiad+prometheus/ops [09:54:30] PROBLEM - Disk space on an-worker1112 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1112&var-datasource=eqiad+prometheus/ops [09:54:48] PROBLEM - Disk space on analytics1071 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1071&var-datasource=eqiad+prometheus/ops [09:54:48] PROBLEM - Disk space on analytics1044 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1044&var-datasource=eqiad+prometheus/ops [09:55:14] this is a little weird [09:55:16] PROBLEM - Disk space on an-worker1080 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1080&var-datasource=eqiad+prometheus/ops [09:55:32] PROBLEM - Disk space on an-worker1107 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1107&var-datasource=eqiad+prometheus/ops [09:56:00] PROBLEM - Disk space on an-worker1081 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1081&var-datasource=eqiad+prometheus/ops [09:56:15] (03PS1) 10Marostegui: dbstore1004: Decrease buffer pool a bit [puppet] - 10https://gerrit.wikimedia.org/r/632451 [09:56:22] PROBLEM - Disk space on analytics1055 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1055&var-datasource=eqiad+prometheus/ops [09:56:32] PROBLEM - Disk space on an-worker1086 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1086&var-datasource=eqiad+prometheus/ops [09:56:49] elukey: I found https://bugs.launchpad.net/ubuntu/+source/nagios-plugins/+bug/1516451 :D [09:57:19] elukey: better: http://www.dailyithelp.com/nagios-disk-critical-syskerneldebugtracing-is-not-accessible-permission/ [09:57:38] reading [09:57:54] same bug in debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=910267 [09:58:00] PROBLEM - Disk space on analytics1049 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1049&var-datasource=eqiad+prometheus/ops [09:58:04] but why now [09:58:13] that's a good question! [09:58:16] I don't see anything happening on the nodes [09:58:31] (03PS1) 10Filippo Giunchedi: alertmanager: fix config invalid alert [puppet] - 10https://gerrit.wikimedia.org/r/632452 (https://phabricator.wikimedia.org/T258948) [09:59:04] !log enable puppet on mc20* [09:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:38] (03CR) 10Elukey: [C: 03+1] dbstore1004: Decrease buffer pool a bit [puppet] - 10https://gerrit.wikimedia.org/r/632451 (owner: 10Marostegui) [09:59:49] (03CR) 10Jbond: [C: 04-1] "looks good but missing a default" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630694 (owner: 10Dzahn) [10:00:13] (03CR) 10Marostegui: [C: 03+2] dbstore1004: Decrease buffer pool a bit [puppet] - 10https://gerrit.wikimedia.org/r/632451 (owner: 10Marostegui) [10:00:40] PROBLEM - Disk space on an-worker1093 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1093&var-datasource=eqiad+prometheus/ops [10:00:42] PROBLEM - Disk space on analytics1075 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1075&var-datasource=eqiad+prometheus/ops [10:00:52] PROBLEM - Disk space on analytics1043 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1043&var-datasource=eqiad+prometheus/ops [10:01:22] PROBLEM - Disk space on analytics1048 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1048&var-datasource=eqiad+prometheus/ops [10:01:26] PROBLEM - Disk space on an-worker1102 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1102&var-datasource=eqiad+prometheus/ops [10:01:38] PROBLEM - Disk space on an-worker1078 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1078&var-datasource=eqiad+prometheus/ops [10:01:38] PROBLEM - Disk space on analytics1069 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1069&var-datasource=eqiad+prometheus/ops [10:01:38] PROBLEM - Disk space on an-worker1097 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1097&var-datasource=eqiad+prometheus/ops [10:01:49] !log Restart mysql on dbstore1004 to pick up new buffer pool sizes [10:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:06] PROBLEM - Disk space on analytics1072 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1072&var-datasource=eqiad+prometheus/ops [10:02:10] PROBLEM - Disk space on analytics1058 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1058&var-datasource=eqiad+prometheus/ops [10:02:16] klausman: --^ is interesting if you have time [10:02:24] PROBLEM - Disk space on an-worker1085 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1085&var-datasource=eqiad+prometheus/ops [10:02:36] PROBLEM - Disk space on analytics1061 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1061&var-datasource=eqiad+prometheus/ops [10:03:44] PROBLEM - Disk space on an-worker1095 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1095&var-datasource=eqiad+prometheus/ops [10:03:48] PROBLEM - Disk space on analytics1062 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1062&var-datasource=eqiad+prometheus/ops [10:04:05] moritzm: --^ any idea? [10:04:08] PROBLEM - Disk space on an-worker1084 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1084&var-datasource=eqiad+prometheus/ops [10:04:08] PROBLEM - Disk space on an-worker1101 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1101&var-datasource=eqiad+prometheus/ops [10:04:08] PROBLEM - Disk space on an-worker1088 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1088&var-datasource=eqiad+prometheus/ops [10:04:30] PROBLEM - Disk space on an-worker1087 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1087&var-datasource=eqiad+prometheus/ops [10:04:32] elukey: I can confirm that adding -X tracefs fixes it [10:04:32] PROBLEM - Disk space on analytics1064 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1064&var-datasource=eqiad+prometheus/ops [10:05:48] PROBLEM - Disk space on analytics1070 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1070&var-datasource=eqiad+prometheus/ops [10:05:48] PROBLEM - Disk space on an-worker1083 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1083&var-datasource=eqiad+prometheus/ops [10:05:50] PROBLEM - Disk space on an-worker1117 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1117&var-datasource=eqiad+prometheus/ops [10:05:52] PROBLEM - Disk space on an-worker1089 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1089&var-datasource=eqiad+prometheus/ops [10:05:52] PROBLEM - Disk space on an-worker1099 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1099&var-datasource=eqiad+prometheus/ops [10:05:52] PROBLEM - Disk space on analytics1067 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1067&var-datasource=eqiad+prometheus/ops [10:05:52] PROBLEM - Disk space on analytics1066 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1066&var-datasource=eqiad+prometheus/ops [10:06:24] PROBLEM - Disk space on analytics1060 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1060&var-datasource=eqiad+prometheus/ops [10:06:26] volans: if you know how to add it on the nagios check disk config I'd be grateful, otherwise I'll check it [10:06:29] sorry for the spam [10:06:44] elukey: sure, let me find the right place ,I'm wondering why only on analytics hosts [10:06:54] PROBLEM - Disk space on an-worker1104 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1104&var-datasource=eqiad+prometheus/ops [10:06:54] PROBLEM - Disk space on analytics1077 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1077&var-datasource=eqiad+prometheus/ops [10:07:06] elukey: ah, we have it in modules/profile/manifests/base.pp [10:07:11] might be overridne [10:07:22] PROBLEM - Disk space on an-worker1094 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1094&var-datasource=eqiad+prometheus/ops [10:07:24] PROBLEM - Disk space on an-worker1106 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1106&var-datasource=eqiad+prometheus/ops [10:07:26] PROBLEM - Disk space on analytics1063 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1063&var-datasource=eqiad+prometheus/ops [10:07:26] PROBLEM - Disk space on analytics1074 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1074&var-datasource=eqiad+prometheus/ops [10:07:30] I was checking base::monitoring::host [10:07:56] PROBLEM - Disk space on analytics1042 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1042&var-datasource=eqiad+prometheus/ops [10:08:06] PROBLEM - Disk space on an-worker1115 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1115&var-datasource=eqiad+prometheus/ops [10:08:10] PROBLEM - Disk space on analytics1076 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1076&var-datasource=eqiad+prometheus/ops [10:08:35] volans: it is weird, I see --exclude-type=tracefs [10:09:00] PROBLEM - Disk space on analytics1051 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1051&var-datasource=eqiad+prometheus/ops [10:09:20] PROBLEM - Disk space on an-worker1105 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1105&var-datasource=eqiad+prometheus/ops [10:09:23] elukey: you override profile::base::check_disk_options [10:09:40] PROBLEM - Disk space on an-worker1100 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1100&var-datasource=eqiad+prometheus/ops [10:09:40] PROBLEM - Disk space on an-worker1111 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1111&var-datasource=eqiad+prometheus/ops [10:09:40] PROBLEM - Disk space on analytics1045 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1045&var-datasource=eqiad+prometheus/ops [10:09:52] elukey: I'll send a patch [10:09:52] * elukey cries in a corner [10:09:56] PROBLEM - Disk space on an-worker1110 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1110&var-datasource=eqiad+prometheus/ops [10:10:05] volans: thanks a lot [10:10:12] PROBLEM - Disk space on an-worker1098 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1098&var-datasource=eqiad+prometheus/ops [10:10:36] PROBLEM - Disk space on an-worker1090 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1090&var-datasource=eqiad+prometheus/ops [10:10:36] PROBLEM - Disk space on an-worker1091 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1091&var-datasource=eqiad+prometheus/ops [10:10:54] PROBLEM - Disk space on an-worker1103 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1103&var-datasource=eqiad+prometheus/ops [10:11:05] (03PS1) 10Volans: analytics: exclude tracefs FS from check_disk [puppet] - 10https://gerrit.wikimedia.org/r/632453 [10:11:05] elukey: ^^^ should do it IMHO [10:11:07] in case you didn't know we have a big hadoop cluster :D [10:11:12] PROBLEM - Disk space on analytics1046 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1046&var-datasource=eqiad+prometheus/ops [10:11:52] on why was triggered now I have no idea [10:12:06] (03CR) 10Elukey: [C: 03+2] analytics: exclude tracefs FS from check_disk [puppet] - 10https://gerrit.wikimedia.org/r/632453 (owner: 10Volans) [10:12:08] PROBLEM - Disk space on an-worker1108 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1108&var-datasource=eqiad+prometheus/ops [10:12:08] PROBLEM - Disk space on analytics1073 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1073&var-datasource=eqiad+prometheus/ops [10:12:20] PROBLEM - Disk space on an-worker1096 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1096&var-datasource=eqiad+prometheus/ops [10:12:33] unrelated but I'm please to report that alertmanager deduplication is working as expected in cases like this [10:12:47] nice!, looking [10:12:49] s/please/pleased/ [10:12:56] elukey: looking [10:13:06] PROBLEM - Disk space on an-worker1113 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1113&var-datasource=eqiad+prometheus/ops [10:13:06] PROBLEM - Disk space on an-worker1109 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1109&var-datasource=eqiad+prometheus/ops [10:13:08] moritzm: already done [10:13:10] PROBLEM - Disk space on an-worker1082 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1082&var-datasource=eqiad+prometheus/ops [10:13:12] ack [10:13:19] running puppet now [10:14:17] use batch :D [10:14:28] nono all in one go it is better [10:14:38] two times also [10:14:39] :D [10:14:40] volans: the deduplication on irc too in case you are interested: https://phabricator.wikimedia.org/P12929 [10:15:21] lol, didn't know about that chan :D [10:15:28] really nice godog [10:16:00] RECOVERY - Disk space on an-worker1080 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1080&var-datasource=eqiad+prometheus/ops [10:16:16] RECOVERY - Disk space on an-worker1107 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1107&var-datasource=eqiad+prometheus/ops [10:16:20] hehe thanks [10:16:28] PROBLEM - Disk space on analytics1068 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1068&var-datasource=eqiad+prometheus/ops [10:16:28] PROBLEM - Disk space on analytics1065 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1065&var-datasource=eqiad+prometheus/ops [10:16:42] RECOVERY - Disk space on an-worker1081 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1081&var-datasource=eqiad+prometheus/ops [10:17:14] RECOVERY - Disk space on an-worker1086 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1086&var-datasource=eqiad+prometheus/ops [10:17:34] godog: would be nice to have the link to the group in alerts ,like: https://alerts.wikimedia.org/?q=alertname%3DIcinga%2FDisk%20space&q=%40receiver%3Dirc [10:17:57] and I whish there was a way to expand the list all at once [10:18:06] and not 12 at a time [10:18:42] RECOVERY - Disk space on analytics1049 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1049&var-datasource=eqiad+prometheus/ops [10:21:00] RECOVERY - Disk space on an-worker1093 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1093&var-datasource=eqiad+prometheus/ops [10:21:00] RECOVERY - Disk space on analytics1075 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1075&var-datasource=eqiad+prometheus/ops [10:21:12] RECOVERY - Disk space on analytics1043 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1043&var-datasource=eqiad+prometheus/ops [10:21:18] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [10:21:30] RECOVERY - Disk space on analytics1048 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1048&var-datasource=eqiad+prometheus/ops [10:21:30] RECOVERY - Disk space on an-worker1102 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1102&var-datasource=eqiad+prometheus/ops [10:21:38] RECOVERY - Disk space on an-worker1078 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1078&var-datasource=eqiad+prometheus/ops [10:21:38] volans: thanks <3 [10:21:38] RECOVERY - Disk space on an-worker1097 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1097&var-datasource=eqiad+prometheus/ops [10:21:38] RECOVERY - Disk space on analytics1069 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1069&var-datasource=eqiad+prometheus/ops [10:21:39] * volans looking at the dns one [10:21:52] elukey: anytime :) [10:22:06] RECOVERY - Disk space on analytics1072 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1072&var-datasource=eqiad+prometheus/ops [10:22:21] (03PS1) 10Elukey: spark2: set spark.shuffle.io.maxRetries to 10 [puppet] - 10https://gerrit.wikimedia.org/r/632454 [10:22:24] RECOVERY - Disk space on an-worker1085 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1085&var-datasource=eqiad+prometheus/ops [10:23:44] RECOVERY - Disk space on an-worker1095 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1095&var-datasource=eqiad+prometheus/ops [10:23:48] RECOVERY - Disk space on analytics1062 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1062&var-datasource=eqiad+prometheus/ops [10:24:08] RECOVERY - Disk space on an-worker1084 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1084&var-datasource=eqiad+prometheus/ops [10:24:08] RECOVERY - Disk space on an-worker1088 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1088&var-datasource=eqiad+prometheus/ops [10:24:08] RECOVERY - Disk space on an-worker1101 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1101&var-datasource=eqiad+prometheus/ops [10:24:32] RECOVERY - Disk space on an-worker1087 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1087&var-datasource=eqiad+prometheus/ops [10:25:48] RECOVERY - Disk space on analytics1070 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1070&var-datasource=eqiad+prometheus/ops [10:25:48] RECOVERY - Disk space on an-worker1083 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1083&var-datasource=eqiad+prometheus/ops [10:25:50] RECOVERY - Disk space on an-worker1117 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1117&var-datasource=eqiad+prometheus/ops [10:25:54] RECOVERY - Disk space on an-worker1089 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1089&var-datasource=eqiad+prometheus/ops [10:25:54] RECOVERY - Disk space on an-worker1099 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1099&var-datasource=eqiad+prometheus/ops [10:26:26] RECOVERY - Disk space on analytics1060 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1060&var-datasource=eqiad+prometheus/ops [10:26:58] RECOVERY - Disk space on an-worker1104 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1104&var-datasource=eqiad+prometheus/ops [10:27:28] RECOVERY - Disk space on an-worker1094 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1094&var-datasource=eqiad+prometheus/ops [10:27:30] RECOVERY - Disk space on an-worker1106 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1106&var-datasource=eqiad+prometheus/ops [10:27:32] RECOVERY - Disk space on analytics1063 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1063&var-datasource=eqiad+prometheus/ops [10:27:42] (03CR) 10Hnowlan: [C: 03+2] restbase: return restbase2009 to the host list [puppet] - 10https://gerrit.wikimedia.org/r/626119 (owner: 10Hnowlan) [10:28:00] RECOVERY - Disk space on analytics1042 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1042&var-datasource=eqiad+prometheus/ops [10:28:10] RECOVERY - Disk space on an-worker1115 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1115&var-datasource=eqiad+prometheus/ops [10:29:04] RECOVERY - Disk space on analytics1051 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1051&var-datasource=eqiad+prometheus/ops [10:29:26] RECOVERY - Disk space on an-worker1105 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1105&var-datasource=eqiad+prometheus/ops [10:29:46] RECOVERY - Disk space on an-worker1100 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1100&var-datasource=eqiad+prometheus/ops [10:29:46] RECOVERY - Disk space on an-worker1111 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1111&var-datasource=eqiad+prometheus/ops [10:29:46] RECOVERY - Disk space on analytics1045 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1045&var-datasource=eqiad+prometheus/ops [10:30:04] RECOVERY - Disk space on an-worker1110 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1110&var-datasource=eqiad+prometheus/ops [10:30:20] RECOVERY - Disk space on an-worker1098 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1098&var-datasource=eqiad+prometheus/ops [10:30:41] !log hnowlan@deploy1001 Started deploy [restbase/deploy@4ad65b0]: (no justification provided) [10:30:42] RECOVERY - Disk space on an-worker1091 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1091&var-datasource=eqiad+prometheus/ops [10:30:42] RECOVERY - Disk space on an-worker1090 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1090&var-datasource=eqiad+prometheus/ops [10:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:00] RECOVERY - Disk space on an-worker1103 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1103&var-datasource=eqiad+prometheus/ops [10:31:18] RECOVERY - Disk space on analytics1046 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1046&var-datasource=eqiad+prometheus/ops [10:31:21] !log volans@cumin1001 START - Cookbook sre.dns.netbox [10:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:22] RECOVERY - Disk space on an-worker1108 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1108&var-datasource=eqiad+prometheus/ops [10:32:22] RECOVERY - Disk space on analytics1073 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1073&var-datasource=eqiad+prometheus/ops [10:32:32] RECOVERY - Disk space on an-worker1096 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1096&var-datasource=eqiad+prometheus/ops [10:32:58] RECOVERY - Disk space on an-worker1092 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1092&var-datasource=eqiad+prometheus/ops [10:33:16] RECOVERY - Disk space on an-worker1109 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1109&var-datasource=eqiad+prometheus/ops [10:33:16] RECOVERY - Disk space on an-worker1113 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1113&var-datasource=eqiad+prometheus/ops [10:33:20] RECOVERY - Disk space on an-worker1082 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1082&var-datasource=eqiad+prometheus/ops [10:33:42] !log hnowlan@deploy1001 Finished deploy [restbase/deploy@4ad65b0]: (no justification provided) (duration: 03m 01s) [10:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:24] RECOVERY - Disk space on an-worker1116 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1116&var-datasource=eqiad+prometheus/ops [10:35:28] RECOVERY - Disk space on an-worker1112 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1112&var-datasource=eqiad+prometheus/ops [10:35:50] RECOVERY - Disk space on analytics1071 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1071&var-datasource=eqiad+prometheus/ops [10:35:50] RECOVERY - Disk space on analytics1044 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1044&var-datasource=eqiad+prometheus/ops [10:36:46] RECOVERY - Disk space on analytics1065 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1065&var-datasource=eqiad+prometheus/ops [10:36:46] RECOVERY - Disk space on analytics1068 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1068&var-datasource=eqiad+prometheus/ops [10:36:46] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [10:36:48] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:00] (03CR) 10Jbond: [C: 03+1] prometheus: convert mysqld_exporter from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/631288 (https://phabricator.wikimedia.org/T159412) (owner: 10Dzahn) [10:37:09] !log hnowlan@deploy1001 Started deploy [restbase/deploy@4ad65b0]: Redeploying to depooled restbase2009 [10:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:22] RECOVERY - Disk space on analytics1055 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1055&var-datasource=eqiad+prometheus/ops [10:37:24] !log hnowlan@deploy1001 Finished deploy [restbase/deploy@4ad65b0]: Redeploying to depooled restbase2009 (duration: 00m 15s) [10:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:34] (03PS8) 10Jbond: casandra: drop legacy _validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616733 (https://phabricator.wikimedia.org/T259013) [10:39:53] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/616733 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [10:39:55] (03CR) 10jerkins-bot: [V: 04-1] casandra: drop legacy _validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616733 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [10:39:57] 10Operations, 10ops-codfw, 10serviceops: Degraded RAID on mw2279 - https://phabricator.wikimedia.org/T264698 (10jijiki) ` [Tue Oct 6 06:28:23 2020] ata2.00: failed command: READ FPDMA QUEUED [Tue Oct 6 06:28:23 2020] ata2.00: cmd 60/80:00:00:a9:f7/00:00:03:00:00/40 tag 0 ncq dma 65536 in... [10:40:01] volans: indeed the link would be nice to have! I'm making a note about it, not sure if the threshold for expanding alert list is configurable tho [10:40:19] (03PS9) 10Jbond: casandra: drop legacy _validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616733 (https://phabricator.wikimedia.org/T259013) [10:41:05] godog: ack, would be nice the hostname too when it's only one fwiw [10:41:45] !log hnowlan@deploy1001 Started deploy [restbase/deploy@4ad65b0]: Deploying restbase to new hosts [10:41:47] (03PS3) 10Mforns: Drop /wmf/data/raw/mediawiki_job and /wmf/data/raw/netflow after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/628895 (https://phabricator.wikimedia.org/T231339) (owner: 10Ottomata) [10:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:01] 10Operations, 10Technical-blog-posts, 10Traffic: Blog post series: the evolution of Wikimedia's Content Delivery Network - https://phabricator.wikimedia.org/T264729 (10ema) [10:42:23] mmhh yeah that's a little trickier [10:42:38] RECOVERY - Disk space on analytics1058 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1058&var-datasource=eqiad+prometheus/ops [10:43:04] RECOVERY - Disk space on analytics1061 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1061&var-datasource=eqiad+prometheus/ops [10:43:05] !log hnowlan@deploy1001 Finished deploy [restbase/deploy@4ad65b0]: Deploying restbase to new hosts (duration: 01m 19s) [10:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:09] (03CR) 10Mforns: Drop /wmf/data/raw/mediawiki_job and /wmf/data/raw/netflow after 90 days (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/628895 (https://phabricator.wikimedia.org/T231339) (owner: 10Ottomata) [10:44:11] !log hnowlan@deploy1001 Started deploy [restbase/deploy@4ad65b0]: Deploying restbase to new hosts [10:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:04] RECOVERY - Disk space on analytics1064 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1064&var-datasource=eqiad+prometheus/ops [10:45:29] !log hnowlan@deploy1001 Finished deploy [restbase/deploy@4ad65b0]: Deploying restbase to new hosts (duration: 01m 19s) [10:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:24] RECOVERY - Disk space on analytics1066 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1066&var-datasource=eqiad+prometheus/ops [10:46:24] RECOVERY - Disk space on analytics1067 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1067&var-datasource=eqiad+prometheus/ops [10:47:28] !log jiji@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2279.codfw.wmnet [10:47:30] RECOVERY - Disk space on analytics1077 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1077&var-datasource=eqiad+prometheus/ops [10:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:52] (03PS1) 10Filippo Giunchedi: alertmanager: drop the plus sign for group notifications [puppet] - 10https://gerrit.wikimedia.org/r/632457 (https://phabricator.wikimedia.org/T258948) [10:48:04] RECOVERY - Disk space on analytics1074 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1074&var-datasource=eqiad+prometheus/ops [10:48:07] !log set mw2279.codfw.wmnet as inactive T264698 [10:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:12] T264698: Degraded RAID on mw2279 - https://phabricator.wikimedia.org/T264698 [10:48:26] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 103.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [10:48:48] RECOVERY - Disk space on analytics1076 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1076&var-datasource=eqiad+prometheus/ops [10:50:50] (03PS2) 10Filippo Giunchedi: alertmanager: drop the plus sign for group notifications [puppet] - 10https://gerrit.wikimedia.org/r/632457 (https://phabricator.wikimedia.org/T258948) [10:51:25] 10Operations, 10ops-codfw, 10serviceops: Degraded RAID on mw2279 - https://phabricator.wikimedia.org/T264698 (10jijiki) According to netbox, this server is still under warranty. @Papaul the server is set as inactive, let me know if you need anything from, thank you! [10:52:42] (03PS2) 10Elukey: spark2: set spark.shuffle.io.maxRetries to 10 [puppet] - 10https://gerrit.wikimedia.org/r/632454 [10:54:30] (03CR) 10Jbond: [C: 03+2] casandra: drop legacy _validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616733 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [10:55:08] RECOVERY - Disk space on analytics1059 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1059&var-datasource=eqiad+prometheus/ops [10:55:42] (03CR) 10Giuseppe Lavagetto: "The code is correct, but I suggest an alternative approach in my comments. If you still prefer the current approach, consider this a +1 😊" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/631720 (https://phabricator.wikimedia.org/T260917) (owner: 10JMeybohm) [10:57:50] (03CR) 10Jbond: "deployed no-op" [puppet] - 10https://gerrit.wikimedia.org/r/616733 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European mid-day backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201006T1100). [11:00:04] revi: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:10] I can deploy today [11:00:24] hi [11:00:30] is it just me? [11:00:37] revi: it seems so :) [11:00:39] if yes and you have some time — give me 2 minutes [11:00:44] * revi needs to finish his dinner [11:00:46] revi: sure, tell me when ready [11:00:48] kk [11:01:21] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [11:01:22] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:20] !log hnowlan@deploy1001 Started deploy [restbase/deploy@4ad65b0]: Redeploying restbase2009 [11:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:30] (03PS2) 10Revi: GrowthExperiments: Change Help Page URL for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631854 (https://phabricator.wikimedia.org/T254364) [11:02:32] !log hnowlan@deploy1001 Finished deploy [restbase/deploy@4ad65b0]: Redeploying restbase2009 (duration: 00m 12s) [11:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:38] ready, rebased [11:02:43] Urbanecm: ^ [11:02:48] thanks [11:03:04] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Change Help Page URL for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631854 (https://phabricator.wikimedia.org/T254364) (owner: 10Revi) [11:03:29] (03PS1) 10Jbond: casandra -rspec: update rspec jobs [puppet] - 10https://gerrit.wikimedia.org/r/632458 [11:04:03] (03Merged) 10jenkins-bot: GrowthExperiments: Change Help Page URL for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631854 (https://phabricator.wikimedia.org/T254364) (owner: 10Revi) [11:04:15] !log hnowlan@deploy1001 Started deploy [restbase/deploy@4ad65b0]: Redeploying restbase2009 [11:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:30] (03CR) 10Jbond: [C: 03+2] casandra -rspec: update rspec jobs [puppet] - 10https://gerrit.wikimedia.org/r/632458 (owner: 10Jbond) [11:04:45] revi: can you test it at mwdebug2001, please? [11:04:50] (03CR) 10Joal: [C: 03+1] "Yes! Let's test" [puppet] - 10https://gerrit.wikimedia.org/r/632454 (owner: 10Elukey) [11:04:54] 2001 [11:05:18] (03CR) 10Joal: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/632433 (owner: 10Elukey) [11:05:26] revi: yes, 2001 [11:05:42] https://ko.wikipedia.org/w/index.php?title=%EC%9C%84%ED%82%A4%EB%B0%B1%EA%B3%BC:%EC%A7%88%EB%AC%B8%EB%B0%A9/2020%EB%85%84_%EC%A0%9C41%EC%A3%BC&oldid=27775620 as expected [11:05:44] +2 [11:05:52] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add 50 percentile for ats-tls TTFB [puppet] - 10https://gerrit.wikimedia.org/r/632190 (https://phabricator.wikimedia.org/T263536) (owner: 10Filippo Giunchedi) [11:05:54] cool [11:06:03] (03CR) 10Jbond: "I forgot but i had an old change[1] which i think covers most of this, sorry for not rembering later. i have also added the additional rs" [puppet] - 10https://gerrit.wikimedia.org/r/630312 (owner: 10Dzahn) [11:06:18] revi: scap'ing it to the whole universe then :) [11:06:29] great [11:07:05] is this translation accurate revi? https://usercontent.irccloud-cdn.com/file/JcgBela9/image.png [11:07:10] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 5f9721b3300c8e733d331bcbc754d31d9493f8ba: GrowthExperiments: Change Help Page URL for kowiki (T254364) (duration: 01m 00s) [11:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:16] T254364: Change Korean Wikipedia archive model to weekly - https://phabricator.wikimedia.org/T254364 [11:07:16] mostly tes [11:07:22] used for test pages [11:07:25] yes* [11:07:29] !log hnowlan@deploy1001 Finished deploy [restbase/deploy@4ad65b0]: Redeploying restbase2009 (duration: 03m 14s) [11:07:29] i see [11:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:43] revi: well, unless you've anything else, we're done :) [11:08:06] I guess so :-) [11:08:13] Cool :) [11:09:24] (03PS2) 10Urbanecm: Allow bureaucrats to remove sysop permissions on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623119 (https://phabricator.wikimedia.org/T261481) (owner: 10Mdaniels5757) [11:09:51] (03CR) 10Urbanecm: [C: 03+2] "per T261481#6519905 et seq" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623119 (https://phabricator.wikimedia.org/T261481) (owner: 10Mdaniels5757) [11:10:53] (03PS9) 10Jbond: role::logstash: refactor role to conform to coding guide [puppet] - 10https://gerrit.wikimedia.org/r/617085 [11:11:35] (03Merged) 10jenkins-bot: Allow bureaucrats to remove sysop permissions on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623119 (https://phabricator.wikimedia.org/T261481) (owner: 10Mdaniels5757) [11:12:19] (03CR) 10jerkins-bot: [V: 04-1] role::logstash: refactor role to conform to coding guide [puppet] - 10https://gerrit.wikimedia.org/r/617085 (owner: 10Jbond) [11:13:34] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 103.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [11:15:45] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 5cc7027ba8d0ddee5c9898b80afe850603bf870e: Allow bureaucrats to remove sysop permissions on Commons (T261481) (duration: 00m 58s) [11:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:51] T261481: Allow bureaucrats to remove sysop permissions on Commons - https://phabricator.wikimedia.org/T261481 [11:16:07] (03PS1) 10Urbanecm: ruewiki: Add rollbacker, grantable and revokable by sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632459 (https://phabricator.wikimedia.org/T264147) [11:17:10] (03PS2) 10Urbanecm: ruewiki: Add rollbacker, grantable and revokable by sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632459 (https://phabricator.wikimedia.org/T264147) [11:17:13] (03CR) 10Urbanecm: [C: 03+2] ruewiki: Add rollbacker, grantable and revokable by sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632459 (https://phabricator.wikimedia.org/T264147) (owner: 10Urbanecm) [11:19:11] (03Merged) 10jenkins-bot: ruewiki: Add rollbacker, grantable and revokable by sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632459 (https://phabricator.wikimedia.org/T264147) (owner: 10Urbanecm) [11:19:56] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 7b1a4fad0f55c626e42961489062115d5f97ed6c: ruewiki: Add rollbacker, grantable and revokable by sysops (T264147) (duration: 00m 58s) [11:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:03] T264147: Add rollbacker role on Rusyn (rue) Wikipedia - https://phabricator.wikimedia.org/T264147 [11:20:31] !log push L3 prep work to cloudsw1-c8-eqiad [11:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:25] (03PS10) 10Jbond: role::logstash: refactor role to conform to coding guide [puppet] - 10https://gerrit.wikimedia.org/r/617085 [11:24:59] (03PS1) 10Urbanecm: arbcom_ruwiki: Change favicon to File:Arbcom-ru_favicon.svg from commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632460 (https://phabricator.wikimedia.org/T264430) [11:25:01] (03PS1) 10Urbanecm: arbcom_ruwiki: Set AK as alias for NS_PROJECT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632461 (https://phabricator.wikimedia.org/T264430) [11:25:12] (03PS2) 10Urbanecm: arbcom_ruwiki: Change favicon to File:Arbcom-ru_favicon.svg from commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632460 (https://phabricator.wikimedia.org/T264430) [11:25:16] (03CR) 10Urbanecm: [C: 03+2] arbcom_ruwiki: Change favicon to File:Arbcom-ru_favicon.svg from commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632460 (https://phabricator.wikimedia.org/T264430) (owner: 10Urbanecm) [11:25:58] (03Merged) 10jenkins-bot: arbcom_ruwiki: Change favicon to File:Arbcom-ru_favicon.svg from commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632460 (https://phabricator.wikimedia.org/T264430) (owner: 10Urbanecm) [11:27:38] (03PS11) 10Jbond: role::logstash: refactor role to conform to coding guide [puppet] - 10https://gerrit.wikimedia.org/r/617085 [11:29:04] (03PS2) 10Urbanecm: arbcom_ruwiki: Set AK as alias for NS_PROJECT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632461 (https://phabricator.wikimedia.org/T264430) [11:29:11] (03CR) 10Urbanecm: [C: 03+2] arbcom_ruwiki: Set AK as alias for NS_PROJECT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632461 (https://phabricator.wikimedia.org/T264430) (owner: 10Urbanecm) [11:29:54] (03Merged) 10jenkins-bot: arbcom_ruwiki: Set AK as alias for NS_PROJECT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632461 (https://phabricator.wikimedia.org/T264430) (owner: 10Urbanecm) [11:30:20] !log urbanecm@deploy1001 Synchronized static/favicon/arbcom_ruwiki.ico: 7e4e81129b8697c394ec329dd2b3c784e607a4d1: arbcom_ruwiki: Change favicon to File:Arbcom-ru_favicon.svg from commons (T264430) (duration: 00m 58s) [11:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:26] T264430: arbcom-ru.wikipedia.org: Change favicon; customized interwiki/namespace linking or adding an alias - https://phabricator.wikimedia.org/T264430 [11:31:35] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 7e4e81129b8697c394ec329dd2b3c784e607a4d1: arbcom_ruwiki: Change favicon to File:Arbcom-ru_favicon.svg from commons (T264430) (duration: 00m 58s) [11:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:59] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 07c19f97c79ec20d6b1657e589acfc242dd53b09: arbcom_ruwiki: Set AK as alias for NS_PROJECT (T264430) (duration: 00m 58s) [11:33:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:28] !log urbanecm@mwmaint2001:~$ mwscript namespaceDupes.php --wiki=arbcom_ruwiki --fix # T264430 # P12930 [11:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:37] !log EU B&C window done [11:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:16] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [11:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:32] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is OK: (C)100 gt (W)80 gt 77.29 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [11:39:14] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:06] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/617085 (owner: 10Jbond) [11:40:20] (03PS6) 10Jbond: role::logstash::puppetboard: refactor role [puppet] - 10https://gerrit.wikimedia.org/r/617087 (https://phabricator.wikimedia.org/T259013) [11:40:43] (03CR) 10Alexandros Kosiaris: Add pytest and a simple test for decommission (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/631448 (https://phabricator.wikimedia.org/T257297) (owner: 10Alexandros Kosiaris) [11:41:12] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/617087 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [11:41:20] (03PS4) 10Jbond: role::logstash::collector: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/617090 (https://phabricator.wikimedia.org/T259013) [11:41:24] (03CR) 10jerkins-bot: [V: 04-1] role::logstash::puppetboard: refactor role [puppet] - 10https://gerrit.wikimedia.org/r/617087 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [11:41:28] (03PS5) 10Jbond: role::logstash::collector: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/617090 (https://phabricator.wikimedia.org/T259013) [11:41:39] (03PS7) 10Jbond: role::logstash::puppetboard: refactor role [puppet] - 10https://gerrit.wikimedia.org/r/617087 (https://phabricator.wikimedia.org/T259013) [11:41:48] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/617087 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [11:41:51] (03CR) 10jerkins-bot: [V: 04-1] role::logstash::collector: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/617090 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [11:41:54] (03PS6) 10Jbond: role::logstash::collector: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/617090 (https://phabricator.wikimedia.org/T259013) [11:42:01] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/617090 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [11:42:14] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [11:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:27] (03CR) 10Volans: "reply inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/631448 (https://phabricator.wikimedia.org/T257297) (owner: 10Alexandros Kosiaris) [11:50:04] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:08] (03PS7) 10Jbond: service: drop legacy validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616758 (https://phabricator.wikimedia.org/T259013) [11:51:11] (03CR) 10jerkins-bot: [V: 04-1] service: drop legacy validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616758 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [11:51:19] 10Operations, 10SRE-Access-Requests: Requesting access to researchers and analytics-privatedata-users for Leila Zia - https://phabricator.wikimedia.org/T264472 (10Kormat) [11:53:52] PROBLEM - mediawiki-installation DSH group on mw2279 is CRITICAL: Host mw2279 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [11:54:12] (03PS8) 10Jbond: service: drop legacy validate functions [puppet] - 10https://gerrit.wikimedia.org/r/616758 (https://phabricator.wikimedia.org/T259013) [11:58:45] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/616758 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201006T1200) [12:01:04] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 110.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [12:03:57] (03CR) 10Elukey: [C: 03+2] spark2: set spark.shuffle.io.maxRetries to 10 [puppet] - 10https://gerrit.wikimedia.org/r/632454 (owner: 10Elukey) [12:04:32] (03CR) 10Alexandros Kosiaris: [C: 03+1] "> LGTM the 3 failures from PCC can be safley ignored" [puppet] - 10https://gerrit.wikimedia.org/r/631900 (owner: 10Dzahn) [12:05:25] (03PS8) 10Alexandros Kosiaris: Add pytest and a simple test for decommission [cookbooks] - 10https://gerrit.wikimedia.org/r/631448 (https://phabricator.wikimedia.org/T257297) [12:05:28] (03PS6) 10Alexandros Kosiaris: decommission: Avoid matching some IPs in regexp [cookbooks] - 10https://gerrit.wikimedia.org/r/631397 [12:08:56] !log deploy puppetlabs-stdlib 5.2 [12:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:00] (03CR) 10Jbond: [C: 03+2] stdlib: update to version 5.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/631900 (owner: 10Dzahn) [12:09:08] (03PS8) 10Jbond: stdlib: update to version 5.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/631900 (owner: 10Dzahn) [12:09:47] (03PS1) 10Muehlenhoff: Remove obsolete Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/632466 (https://phabricator.wikimedia.org/T210993) [12:09:55] (03CR) 10Elukey: [C: 03+2] Add new G1GC heap and gc timing settings for Hadoop Namenodes [puppet] - 10https://gerrit.wikimedia.org/r/632433 (owner: 10Elukey) [12:10:15] elukey: you good for me to merge [12:10:29] yep! [12:10:48] merged [12:13:14] !log imported envoyproxy_1.15.1-2 to buster-wikimedia and stretch-wikimedia [12:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:52] 10Operations, 10SRE-Access-Requests: Requesting access to researchers and analytics-privatedata-users for Leila Zia - https://phabricator.wikimedia.org/T264472 (10elukey) I know that it seems strange but the `researchers` group is not needed for you @leila, it is an old one that we'll deprecate :) [12:20:23] !log update HDFS Namenode GC/Heap settings on an-master100[1,2] [12:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:28] (03PS3) 10JMeybohm: envoy: New upstream version 1.15.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/631383 (https://phabricator.wikimedia.org/T264157) [12:23:51] (03PS2) 10JMeybohm: citoid: Update to envoy 1.15.1-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/631384 (https://phabricator.wikimedia.org/T264157) [12:25:16] 10Operations, 10SRE-Access-Requests: Requesting access to researchers and analytics-privatedata-users for Leila Zia - https://phabricator.wikimedia.org/T264472 (10Kormat) Hi Leila, Thanks for your reply, things are a lot clearer now :) The existing shell account is associated with your volunteer wikitech acc... [12:29:16] (03Abandoned) 10Alexandros Kosiaris: Revert "Revert "user homes: Allow git to control +x for $HOME files"" [puppet] - 10https://gerrit.wikimedia.org/r/394586 (owner: 10Alexandros Kosiaris) [12:30:13] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks, merging!" [cookbooks] - 10https://gerrit.wikimedia.org/r/631448 (https://phabricator.wikimedia.org/T257297) (owner: 10Alexandros Kosiaris) [12:30:56] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:31:04] (03CR) 10Alexandros Kosiaris: [C: 03+2] decommission: Avoid matching some IPs in regexp [cookbooks] - 10https://gerrit.wikimedia.org/r/631397 (owner: 10Alexandros Kosiaris) [12:31:09] (03PS1) 10Muehlenhoff: Stop using Diamond on Cloud VPS/Toolforge [puppet] - 10https://gerrit.wikimedia.org/r/632471 (https://phabricator.wikimedia.org/T210993) [12:31:47] (03PS1) 10Ayounsi: Add cloud customer to eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/632472 [12:31:50] (03Merged) 10jenkins-bot: Add pytest and a simple test for decommission [cookbooks] - 10https://gerrit.wikimedia.org/r/631448 (https://phabricator.wikimedia.org/T257297) (owner: 10Alexandros Kosiaris) [12:32:36] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:33:42] (03CR) 10Ayounsi: [C: 03+2] Add cloud customer to eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/632472 (owner: 10Ayounsi) [12:33:54] 10Operations, 10SRE-tools, 10Continuous-Integration-Config, 10Patch-For-Review: Add shell scripts CI validations - https://phabricator.wikimedia.org/T148494 (10jbond) > This task T148494 is to add generic support for ShellCheck beyond just operations/puppet. Which has two caveats: > > how to find files to... [12:34:09] (03Merged) 10jenkins-bot: decommission: Avoid matching some IPs in regexp [cookbooks] - 10https://gerrit.wikimedia.org/r/631397 (owner: 10Alexandros Kosiaris) [12:36:34] (03CR) 10Alexandros Kosiaris: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/630562 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [12:38:34] 10Operations, 10ops-eqiad, 10decommission-hardware: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10elukey) [12:41:44] (03PS1) 10Aklapper: Delete puppet role and module for Phragile [puppet] - 10https://gerrit.wikimedia.org/r/632475 (https://phabricator.wikimedia.org/T240308) [12:42:10] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] envoy: New upstream version 1.15.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/631383 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [12:44:15] (03PS2) 10Aklapper: Delete puppet role and module for Phragile [puppet] - 10https://gerrit.wikimedia.org/r/632475 (https://phabricator.wikimedia.org/T240308) [12:44:24] 10Operations, 10Traffic: ATS-BE Lua mitigations for cacheable responses w/ Set-Cookie seemingly not working - https://phabricator.wikimedia.org/T264378 (10CDanis) >>! In T264378#6520360, @ema wrote: > @cdanis only noticed the logstash entries mentioned here while looking for clues, I don't think he meant to su... [12:45:58] (03PS1) 10Alexandros Kosiaris: redis: Check for existence of pass [puppet] - 10https://gerrit.wikimedia.org/r/632476 (https://phabricator.wikimedia.org/T228266) [12:48:10] (03PS1) 10Muehlenhoff: Don't add diamond to new images [puppet] - 10https://gerrit.wikimedia.org/r/632477 (https://phabricator.wikimedia.org/T210993) [12:51:45] go _ale [12:51:50] lol, nope [12:52:10] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [12:53:26] (03PS1) 10JMeybohm: Use ubuntu 16.04 as buildsystem to be compatible with stretch [debs/envoyproxy] - 10https://gerrit.wikimedia.org/r/632479 [12:53:48] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is OK: (C)100 gt (W)80 gt 66.1 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [12:54:34] (03PS2) 10JMeybohm: Use ubuntu 16.04 as buildsystem to be compatible with stretch [debs/envoyproxy] - 10https://gerrit.wikimedia.org/r/632479 [12:55:13] !log swift codfw-prod: bump weight for ms-be2057 - T261633 [12:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:20] T261633: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 [12:55:41] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove obsolete Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/632466 (https://phabricator.wikimedia.org/T210993) (owner: 10Muehlenhoff) [12:56:35] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/632466 (https://phabricator.wikimedia.org/T210993) (owner: 10Muehlenhoff) [13:00:04] hashar and marxarelli: Your horoscope predicts another unfortunate Mediawiki train - European+American Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201006T1300). [13:04:36] !log Change innodb_change_buffering = inserts on db2075 db2089 db2099 db2111 db2128 T263443 [13:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:43] T263443: Evaluate the impact of changing innodb_change_buffering to inserts - https://phabricator.wikimedia.org/T263443 [13:08:56] (03CR) 10Alexandros Kosiaris: "PCC at https://puppet-compiler.wmflabs.org/compiler1002/25700/" [puppet] - 10https://gerrit.wikimedia.org/r/632476 (https://phabricator.wikimedia.org/T228266) (owner: 10Alexandros Kosiaris) [13:09:12] 10Operations, 10Traffic, 10Performance-Team (Radar): Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10Gilles) Seems like the cache is warm now and the host is faster than its peer: {F32375693, size=full} [13:14:23] !log pushed docker-registry.discovery.wmnet/envoy:1.15.1-2 - T264157 [13:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:56] (03CR) 10Jbond: [C: 03+1] "LGTM <3" [puppet] - 10https://gerrit.wikimedia.org/r/632476 (https://phabricator.wikimedia.org/T228266) (owner: 10Alexandros Kosiaris) [13:19:07] 10Operations, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T264630 (10Kormat) Hi @CGlenn, In order to complete this task, we just need sign-off from your manager (@ahemmer, i presume :) [13:20:22] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:22:01] (03PS1) 10JMeybohm: api-gateway: use default envoy 1.15.1 image [deployment-charts] - 10https://gerrit.wikimedia.org/r/632483 (https://phabricator.wikimedia.org/T264157) [13:23:32] stat1007 is me [13:24:11] (03CR) 10JMeybohm: [C: 03+2] citoid: Update to envoy 1.15.1-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/631384 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [13:25:53] (03PS1) 10Ppchelko: Allow read of CentralAuth special pages on api portal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632484 (https://phabricator.wikimedia.org/T264637) [13:26:39] (03Merged) 10jenkins-bot: citoid: Update to envoy 1.15.1-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/631384 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [13:26:50] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:54] (03PS3) 10Ppchelko: Force local short descriptions for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631775 (https://phabricator.wikimedia.org/T263493) [13:37:36] (03CR) 10Muehlenhoff: [C: 03+1] "Looks sensible, ldd on /usr/bin/envoyproxy only lists libc-related libraries." [debs/envoyproxy] - 10https://gerrit.wikimedia.org/r/632479 (owner: 10JMeybohm) [13:40:20] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [13:40:20] !log kormat@cumin1001 dbctl commit (dc=all): 'Remove db2119 from dump/vslow, add to all other contributions/logpager/recentchanges*/watchlist temporarily T259831', diff saved to https://phabricator.wikimedia.org/P12931 and previous config saved to /var/cache/conftool/dbconfig/20201006-134020-kormat.json [13:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:31] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [13:41:27] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [13:41:28] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:49] !log kormat@cumin1001 dbctl commit (dc=all): 'db2137:3314 depooling: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12932 and previous config saved to /var/cache/conftool/dbconfig/20201006-134149-kormat.json [13:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:47] (03PS1) 10Ayounsi: Diffscan: T5 is too much [puppet] - 10https://gerrit.wikimedia.org/r/632488 [13:49:49] (03CR) 10Ema: [C: 03+1] "A couple of minor comments, the patch LGTM though." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/631848 (owner: 10CDanis) [13:53:31] (03CR) 10Ayounsi: [C: 03+2] Diffscan: T5 is too much [puppet] - 10https://gerrit.wikimedia.org/r/632488 (owner: 10Ayounsi) [13:54:44] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:57:22] PROBLEM - Host db1076 is DOWN: PING CRITICAL - Packet loss = 100% [13:57:31] ^ woot? [13:57:35] checking [13:57:48] maybe another case of BBU failure? [13:58:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1076 ', diff saved to https://phabricator.wikimedia.org/P12934 and previous config saved to /var/cache/conftool/dbconfig/20201006-135810-marostegui.json [13:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:51] description=Smart Storage Battery failure (Battery 1, service information: 0x0A). Action: Gather AHS log and contact Support [13:59:05] sobanski: told you! we jinxed the BBU thing by talking about it a few days ago ^ [13:59:10] I am creating a task for that host [13:59:31] :( [13:59:39] at least that host is going away soon [14:00:20] (03PS1) 10Vgutierrez: vcl: Bump ECDHE-ECDSA-AES128-SHA pageview replacement to 5% [puppet] - 10https://gerrit.wikimedia.org/r/632490 (https://phabricator.wikimedia.org/T258405) [14:03:35] 10Operations, 10DBA, 10Data-Persistence: db1076 crashed - BBU failure - https://phabricator.wikimedia.org/T264755 (10Marostegui) [14:03:54] !log Power cycle db1076 T264755 [14:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:00] T264755: db1076 crashed - BBU failure - https://phabricator.wikimedia.org/T264755 [14:04:49] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' . [14:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:12] RECOVERY - Host db1076 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [14:05:47] (03PS2) 10Alexandros Kosiaris: redis: Check for existence of pass [puppet] - 10https://gerrit.wikimedia.org/r/632476 (https://phabricator.wikimedia.org/T228266) [14:07:09] 10Operations, 10DBA, 10Data-Persistence: db1076 crashed - BBU failure - https://phabricator.wikimedia.org/T264755 (10Marostegui) The battery is gone: ` root@db1076:~# hpssacli controller all show detail | grep Battery No-Battery Write Cache: Disabled Battery/Capacitor Count: 0 ` [14:07:42] 10Operations, 10DBA, 10Data-Persistence: db1076 crashed - BBU failure - https://phabricator.wikimedia.org/T264755 (10Marostegui) p:05Triage→03Medium [14:08:00] !log Reboot db1076 for kernel upgrade T264755 [14:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:12] PROBLEM - Host db1076 is DOWN: PING CRITICAL - Packet loss = 100% [14:09:42] ^ me rebooting [14:10:39] (03PS1) 10Marostegui: db1076: Add clarification comment on BBU status [puppet] - 10https://gerrit.wikimedia.org/r/632494 (https://phabricator.wikimedia.org/T264755) [14:11:15] (03CR) 10Marostegui: [C: 03+2] db1076: Add clarification comment on BBU status [puppet] - 10https://gerrit.wikimedia.org/r/632494 (https://phabricator.wikimedia.org/T264755) (owner: 10Marostegui) [14:11:46] RECOVERY - Host db1076 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [14:11:57] 10Operations, 10DBA, 10Data-Persistence, 10Patch-For-Review: db1076 crashed - BBU failure - https://phabricator.wikimedia.org/T264755 (10Marostegui) I have rebooted the host to make sure it boots up cleanly and to get it on the newest kernel. Let's leave the controller on `write through` policy and if we s... [14:12:00] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Merging to at least get PCC working again for those hosts, but the real solution is probably rearchitecting those profile classes to share" [puppet] - 10https://gerrit.wikimedia.org/r/632476 (https://phabricator.wikimedia.org/T228266) (owner: 10Alexandros Kosiaris) [14:12:36] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10CDanis) [14:12:41] 10Operations, 10Maps, 10Traffic, 10Wiki-Loves-Monuments (2020): wikimedia.pl returns a HTTP 429 error (let it access varnish maps_domains) - https://phabricator.wikimedia.org/T261506 (10CDanis) [14:12:45] 10Operations, 10Maps, 10Traffic, 10Wiki-Loves-Monuments (2020): maps.wikilovesmonuments.org returns a HTTP 429 error (let it access varnish maps_domains) - https://phabricator.wikimedia.org/T260520 (10CDanis) [14:13:03] 10Operations, 10Maps, 10Traffic, 10Wiki-Loves-Monuments (2020): maps.wikilovesmonuments.org returns a HTTP 429 error (let it access varnish maps_domains) - https://phabricator.wikimedia.org/T260520 (10CDanis) [14:13:06] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10CDanis) [14:13:48] (03CR) 10Alexandros Kosiaris: [C: 03+2] "It's important to point out that PCC for a slave host might be missing redis configuration stuff after this patch. I 've added comments to" [puppet] - 10https://gerrit.wikimedia.org/r/632476 (https://phabricator.wikimedia.org/T228266) (owner: 10Alexandros Kosiaris) [14:15:03] !log installed envoyproxy 1.15.1-2 on mwdebug1001 [14:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:24] 10Operations, 10DBA, 10Data-Persistence, 10Patch-For-Review: db1076 crashed - BBU failure - https://phabricator.wikimedia.org/T264755 (10Marostegui) MySQL has started and recovered from the crash, I am going to let replication catch up and then start a table comparison to make sure we are good on the data... [14:19:15] !log volker-e@deploy1001 Started deploy [design/style-guide@e3fda83]: Deploy design/style-guide: [14:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:20] !log volker-e@deploy1001 Finished deploy [design/style-guide@e3fda83]: Deploy design/style-guide: (duration: 00m 05s) [14:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:59] (03CR) 10Cwhite: [C: 03+2] mtail: upgrade mtail across the fleet to 3.0.0~rc35-3+wmf3 [puppet] - 10https://gerrit.wikimedia.org/r/631501 (https://phabricator.wikimedia.org/T263728) (owner: 10Cwhite) [14:28:20] (03CR) 10Ema: [C: 03+1] vcl: Bump ECDHE-ECDSA-AES128-SHA pageview replacement to 5% [puppet] - 10https://gerrit.wikimedia.org/r/632490 (https://phabricator.wikimedia.org/T258405) (owner: 10Vgutierrez) [14:29:41] ACKNOWLEDGEMENT - HP RAID on db1076 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T264757 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:29:47] 10Operations, 10ops-eqiad: Degraded RAID on db1076 - https://phabricator.wikimedia.org/T264757 (10ops-monitoring-bot) [14:31:58] !log kormat@cumin1001 dbctl commit (dc=all): 'db2137:3314 (re)pooling @ 75%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12935 and previous config saved to /var/cache/conftool/dbconfig/20201006-143157-kormat.json [14:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:04] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [14:32:15] (03PS1) 10Hnowlan: conftool-data: add new restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/632497 (https://phabricator.wikimedia.org/T261512) [14:33:05] 10Operations, 10DBA, 10Data-Persistence, 10Patch-For-Review: db1076 crashed - BBU failure - https://phabricator.wikimedia.org/T264755 (10Marostegui) [14:33:08] 10Operations, 10ops-eqiad: Degraded RAID on db1076 - https://phabricator.wikimedia.org/T264757 (10Marostegui) [14:34:00] 10Operations, 10DBA, 10Data-Persistence, 10Patch-For-Review: db1076 crashed - BBU failure - https://phabricator.wikimedia.org/T264755 (10Marostegui) Comparison between the master and this host started for the following tables: ` actor actor_id archive ar_id change_tag ct_id comment comment_id logging log_i... [14:35:52] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:36:18] !log repooling restbase2009 [14:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:32] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:37:51] 10Operations, 10serviceops, 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): Separate mediawiki latency metrics by endpoint - https://phabricator.wikimedia.org/T263727 (10colewhite) [14:38:18] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=restbase,service=restbase,name=restbase2009.codfw.wmnet [14:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:38] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=restbase,service=restbase-ssl,name=restbase2009.codfw.wmnet [14:38:39] 10Operations, 10observability, 10serviceops, 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): mtail 3.0.0-rc35 doesn't support the histogram type in -oneshot mode. - https://phabricator.wikimedia.org/T263728 (10colewhite) 05Open→03Resolved Patched mtail rolling out to the fleet this mor... [14:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:43] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=restbase,service=restbase-backend,name=restbase2009.codfw.wmnet [14:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:34] (03CR) 10WMDE-leszek: [C: 03+1] "thanks. I thought this has been removed months ago." [puppet] - 10https://gerrit.wikimedia.org/r/632475 (https://phabricator.wikimedia.org/T240308) (owner: 10Aklapper) [14:40:10] !log updated envoyproxy to 1.15.1-2 on mw2295.codfw.wmnet,restbase2017.codfw.wmnet [14:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:48] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:44:14] (03CR) 10Vgutierrez: [C: 03+2] vcl: Bump ECDHE-ECDSA-AES128-SHA pageview replacement to 5% [puppet] - 10https://gerrit.wikimedia.org/r/632490 (https://phabricator.wikimedia.org/T258405) (owner: 10Vgutierrez) [14:44:41] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [14:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:11] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:45:13] !log Bump ECDHE-ECDSA-AES128-SHA pageview replacement to 5% - T262946 [14:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:21] T262946: Bump Firefox version in basic support to 3.6 or newer - https://phabricator.wikimedia.org/T262946 [14:45:35] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [14:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:40] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:55] (03PS1) 10Ema: vcl: include beresp.was_304 in cacheable cookie logging [puppet] - 10https://gerrit.wikimedia.org/r/632499 (https://phabricator.wikimedia.org/T264378) [14:47:01] !log kormat@cumin1001 dbctl commit (dc=all): 'db2137:3314 (re)pooling @ 100%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12936 and previous config saved to /var/cache/conftool/dbconfig/20201006-144701-kormat.json [14:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:07] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [14:47:56] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:48:34] (03CR) 10CDanis: [C: 03+1] vcl: include beresp.was_304 in cacheable cookie logging [puppet] - 10https://gerrit.wikimedia.org/r/632499 (https://phabricator.wikimedia.org/T264378) (owner: 10Ema) [14:49:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Stop using Diamond on Cloud VPS/Toolforge [puppet] - 10https://gerrit.wikimedia.org/r/632471 (https://phabricator.wikimedia.org/T210993) (owner: 10Muehlenhoff) [14:52:11] (03PS1) 10Reedy: Add trailing full stop to mirrors index.html [puppet] - 10https://gerrit.wikimedia.org/r/632501 [14:53:06] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:53:46] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:56:50] (03CR) 10Bstorm: "This is effectively a "local" rsync, since this is from the hdfs, right? If so, that's totally fine. If remote rsync, I need to check some" [puppet] - 10https://gerrit.wikimedia.org/r/629409 (https://phabricator.wikimedia.org/T251777) (owner: 10Fdans) [14:57:16] !log upload dnsdist_1.5.0-1wm1 to apt.wm.o (buster) - T263789 [14:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:22] T263789: Wikidough: Upgrade to dnsdist 1.5.0 - https://phabricator.wikimedia.org/T263789 [14:57:53] (03PS2) 10Ema: vcl: include beresp.was_304 in cacheable cookie logging [puppet] - 10https://gerrit.wikimedia.org/r/632499 (https://phabricator.wikimedia.org/T264378) [14:58:30] (03CR) 10Ema: [C: 03+2] vcl: include beresp.was_304 in cacheable cookie logging [puppet] - 10https://gerrit.wikimedia.org/r/632499 (https://phabricator.wikimedia.org/T264378) (owner: 10Ema) [15:04:07] (03CR) 10Fdans: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/629409 (https://phabricator.wikimedia.org/T251777) (owner: 10Fdans) [15:04:10] (03CR) 10Elukey: Import the config module from Spicerack (033 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631909 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [15:05:13] fdans: do you want me to merge? [15:07:59] 10Operations, 10Traffic, 10Patch-For-Review: Wikidough: Upgrade to dnsdist 1.5.0 - https://phabricator.wikimedia.org/T263789 (10ssingh) [15:09:30] 10Operations, 10Traffic, 10Patch-For-Review: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh) [15:09:51] 10Operations, 10Traffic, 10Patch-For-Review: Wikidough: Upgrade to dnsdist 1.5.0 - https://phabricator.wikimedia.org/T263789 (10ssingh) 05Open→03Resolved ` sukhe@malmok:~$ /usr/bin/dnsdist --version dnsdist 1.5.0 (Lua 5.2.4) ` Completed upgrade to dnsdist 1.5.0, marking as closed. [15:14:43] (03CR) 10Ebernhardson: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/632319 (https://phabricator.wikimedia.org/T264053) (owner: 10Ebernhardson) [15:16:15] (03PS1) 10Ssingh: Revert "dnsdist: temporarily disable validate_cmd for dnsdist.conf" [puppet] - 10https://gerrit.wikimedia.org/r/632504 [15:19:03] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1002/25709/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/632504 (owner: 10Ssingh) [15:19:56] (03CR) 10Ssingh: [C: 03+2] Revert "dnsdist: temporarily disable validate_cmd for dnsdist.conf" [puppet] - 10https://gerrit.wikimedia.org/r/632504 (owner: 10Ssingh) [15:23:20] !log updated envoyproxy to 1.15.1-2 on mw-canary and restbase-canary [15:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:41] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' . [15:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:22] PROBLEM - SSH on ms-be2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:26:54] RECOVERY - SSH on ms-be2023 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:27:02] (03CR) 10Herron: [C: 03+1] "LGTM, thanks for updating this! PCC for all logstash hosts is at https://puppet-compiler.wmflabs.org/compiler1002/25708/" [puppet] - 10https://gerrit.wikimedia.org/r/617085 (owner: 10Jbond) [15:28:50] 10Operations, 10Traffic: Create a second text-lb IP address for test purposes - https://phabricator.wikimedia.org/T237492 (10ayounsi) a:03BBlack Might also be fine to remove it now that the testing have been done. [15:35:38] 10Operations, 10Technical-blog-posts, 10Traffic: Blog post series: the evolution of Wikimedia's Content Delivery Network - https://phabricator.wikimedia.org/T264729 (10srodlund) @ema Great! The doc just came through. Looking forward to reading and editing this! [15:36:10] (03CR) 10Jbond: [C: 03+2] role::logstash: refactor role to conform to coding guide [puppet] - 10https://gerrit.wikimedia.org/r/617085 (owner: 10Jbond) [15:37:37] (03PS4) 10Jbond: role::logstash7: refactor role to conform to coding guide [puppet] - 10https://gerrit.wikimedia.org/r/617095 (https://phabricator.wikimedia.org/T259013) [15:37:52] (03PS5) 10Jbond: role::logstash7: refactor role to conform to coding guide [puppet] - 10https://gerrit.wikimedia.org/r/617095 (https://phabricator.wikimedia.org/T259013) [15:40:20] (03CR) 10Jbond: [C: 03+2] role::logstash7: refactor role to conform to coding guide [puppet] - 10https://gerrit.wikimedia.org/r/617095 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [15:40:52] (03PS8) 10Jbond: role::logstash::puppetboard: refactor role [puppet] - 10https://gerrit.wikimedia.org/r/617087 (https://phabricator.wikimedia.org/T259013) [15:41:20] !log centrallog* delete archived logs from old, single file, organization [15:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:00] (03CR) 10Jbond: [C: 03+2] role::logstash::puppetboard: refactor role [puppet] - 10https://gerrit.wikimedia.org/r/617087 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [15:44:06] (03PS7) 10Jbond: role::logstash::collector: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/617090 (https://phabricator.wikimedia.org/T259013) [15:46:21] 10Operations, 10SRE-Access-Requests: Requesting access to researchers and analytics-privatedata-users for Leila Zia - https://phabricator.wikimedia.org/T264472 (10leila) @elukey ok. :) @Kormat let's go with `leizi`. Thanks! :) [15:46:22] (03CR) 10Herron: [C: 03+1] role::logstash::collector: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/617090 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [15:48:23] (03PS8) 10Jbond: role::logstash::collector: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/617090 (https://phabricator.wikimedia.org/T259013) [15:49:00] (03CR) 10Cwhite: [C: 03+1] alertmanager: drop the plus sign for group notifications [puppet] - 10https://gerrit.wikimedia.org/r/632457 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [15:49:17] herron: can you do anoether pass of //gerrit.wikimedia.org/r/617090, i just updated the cumin alias as well [15:49:39] (03CR) 10Cwhite: [C: 03+1] alertmanager: fix config invalid alert [puppet] - 10https://gerrit.wikimedia.org/r/632452 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [15:49:50] jbond42: sure looking again now [15:49:54] thx [15:50:01] (03CR) 10Cwhite: [C: 03+1] prometheus: connect prometheus 'ext' to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/632449 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [15:52:04] (03Abandoned) 10Hnowlan: restbase: temporarily remove new nodes to allow for bootstrapping [puppet] - 10https://gerrit.wikimedia.org/r/631184 (https://phabricator.wikimedia.org/T264092) (owner: 10Hnowlan) [15:52:08] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I tried to trace back why I did say something in the comment and the code did something else, but failed to find any rationale." [puppet] - 10https://gerrit.wikimedia.org/r/631503 (https://phabricator.wikimedia.org/T263073) (owner: 10Ebernhardson) [15:52:13] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/632205 (https://phabricator.wikimedia.org/T264588) (owner: 10Filippo Giunchedi) [15:54:10] (03PS1) 10Cmjohnson: Add maps1005-1010 to site.pp and add mac addresses to dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/632511 (https://phabricator.wikimedia.org/T260269) [15:55:42] 10Operations, 10Analytics-Clusters, 10Traffic: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10ema) >>! In T264074#6520414, @ema wrote: > Definitely, please feel free to go ahead if you have the time. Implicit but it's probably better to state it clearly: if you do have... [15:56:56] (03CR) 10Herron: role::logstash::collector: remove unused role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617090 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [15:57:30] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10Urbanecm) Hello, Wikimedia Czech Republic uses maps.wikimedia.org in our new website, which you can preview at https://... [15:58:42] PROBLEM - Unmerged changes on repository puppet on puppetmaster1003 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:58:48] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:59:10] unmerged is me, going now [16:00:04] jbond42 and cdanis: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201006T1600). [16:00:22] RECOVERY - Unmerged changes on repository puppet on puppetmaster1003 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [16:00:22] (03CR) 10Cmjohnson: [C: 03+2] Add maps1005-1010 to site.pp and add mac addresses to dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/632511 (https://phabricator.wikimedia.org/T260269) (owner: 10Cmjohnson) [16:00:27] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: drop the plus sign for group notifications [puppet] - 10https://gerrit.wikimedia.org/r/632457 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [16:00:38] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: fix config invalid alert [puppet] - 10https://gerrit.wikimedia.org/r/632452 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [16:00:45] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: connect prometheus 'ext' to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/632449 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [16:01:29] (03PS9) 10Jbond: role::logstash::collector: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/617090 (https://phabricator.wikimedia.org/T259013) [16:01:46] (03PS2) 10Filippo Giunchedi: alertmanager: fix config invalid alert [puppet] - 10https://gerrit.wikimedia.org/r/632452 (https://phabricator.wikimedia.org/T258948) [16:01:56] (03CR) 10Jbond: role::logstash::collector: remove unused role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617090 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [16:02:15] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] alertmanager: fix config invalid alert [puppet] - 10https://gerrit.wikimedia.org/r/632452 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [16:02:40] (03CR) 10Jbond: [C: 03+2] role::logstash::collector: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/617090 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [16:02:50] (03CR) 10Herron: [C: 03+1] "thx again for this!" [puppet] - 10https://gerrit.wikimedia.org/r/617090 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [16:05:15] (03Abandoned) 10Herron: admin: change sbailey ssh key [puppet] - 10https://gerrit.wikimedia.org/r/631259 (https://phabricator.wikimedia.org/T264127) (owner: 10Herron) [16:05:22] (03PS2) 10Ebernhardson: Lower elasticsearch readahead from 128kB to 16kB [puppet] - 10https://gerrit.wikimedia.org/r/632319 (https://phabricator.wikimedia.org/T264053) [16:09:08] (03PS1) 10Filippo Giunchedi: hieradata: enable rsyslog queues for kafka in esams/eqsin/eqiad [puppet] - 10https://gerrit.wikimedia.org/r/632512 (https://phabricator.wikimedia.org/T226703) [16:13:02] 10Operations, 10Traffic, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Aggregated metrics for ats-tls <-> clients ttfb percentiles - https://phabricator.wikimedia.org/T263536 (10fgiunchedi) With 50 percentile added I'm considering this closed! As a demo/playground I've started https://grafana.... [16:13:07] 10Operations, 10Traffic, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Aggregated metrics for ats-tls <-> clients ttfb percentiles - https://phabricator.wikimedia.org/T263536 (10fgiunchedi) 05Open→03Resolved [16:19:04] (03PS1) 10Andrew Bogott: openldap: increase query size limit [puppet] - 10https://gerrit.wikimedia.org/r/632514 (https://phabricator.wikimedia.org/T264770) [16:21:18] (03CR) 10BryanDavis: [C: 03+1] openldap: increase query size limit [puppet] - 10https://gerrit.wikimedia.org/r/632514 (https://phabricator.wikimedia.org/T264770) (owner: 10Andrew Bogott) [16:22:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps10[05-10].eqiad.wmnet - https://phabricator.wikimedia.org/T260269 (10Cmjohnson) [16:26:47] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:30:36] (03PS1) 10JMeybohm: Add more dummy default secret overrides [labs/private] - 10https://gerrit.wikimedia.org/r/632515 (https://phabricator.wikimedia.org/T260917) [16:30:51] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add more dummy default secret overrides [labs/private] - 10https://gerrit.wikimedia.org/r/632515 (https://phabricator.wikimedia.org/T260917) (owner: 10JMeybohm) [16:32:25] RECOVERY - Alertmanager config is not valid on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Alertmanager%23Alerts https://grafana.wikimedia.org/d/eea-9_sik/alertmanager [16:32:32] (03PS4) 10JMeybohm: deployment_server::helmfile: Allow default secrets per environment [puppet] - 10https://gerrit.wikimedia.org/r/631720 (https://phabricator.wikimedia.org/T260917) [16:33:13] (03CR) 10JMeybohm: deployment_server::helmfile: Allow default secrets per environment (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/631720 (https://phabricator.wikimedia.org/T260917) (owner: 10JMeybohm) [16:34:01] (03PS1) 10Hnowlan: apiportalwiki: disable ElectronPdfService [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632516 (https://phabricator.wikimedia.org/T264043) [16:36:31] (03CR) 10Andrew Bogott: [C: 03+2] openldap: increase query size limit [puppet] - 10https://gerrit.wikimedia.org/r/632514 (https://phabricator.wikimedia.org/T264770) (owner: 10Andrew Bogott) [16:36:45] RECOVERY - Prometheus is failing to connect to Alertmanager on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Alertmanager%23Alerts https://grafana.wikimedia.org/d/eea-9_sik/alertmanager [16:43:40] (03CR) 10Herron: [C: 03+1] hieradata: enable rsyslog queues for kafka in esams/eqsin/eqiad [puppet] - 10https://gerrit.wikimedia.org/r/632512 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [16:44:02] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10bd808) As a #toolforge and #cloud-vps administrator I would like to request that `*.tooforge.org`, `*.wmcloud.org`, and... [16:45:58] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10CDanis) >>! In T261694#6522265, @bd808 wrote: > As a #toolforge and #cloud-vps administrator I would like to request th... [17:00:04] chrisalbon and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201006T1700). [17:07:52] (03PS1) 10Jdlrobson: Hot fix: Use display for hiding/showing sidebar on OS 14_0 [skins/MinervaNeue] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/632520 (https://phabricator.wikimedia.org/T264376) [17:13:11] (03PS1) 10Andrew Bogott: Revert "openldap: increase query size limit" [puppet] - 10https://gerrit.wikimedia.org/r/632523 (https://phabricator.wikimedia.org/T264770) [17:13:13] (03PS1) 10Andrew Bogott: Ldap: Raise query limit again [puppet] - 10https://gerrit.wikimedia.org/r/632524 (https://phabricator.wikimedia.org/T264770) [17:14:45] (03CR) 10Andrew Bogott: [C: 03+2] Revert "openldap: increase query size limit" [puppet] - 10https://gerrit.wikimedia.org/r/632523 (https://phabricator.wikimedia.org/T264770) (owner: 10Andrew Bogott) [17:15:36] (03CR) 10BryanDavis: [C: 03+1] Revert "openldap: increase query size limit" [puppet] - 10https://gerrit.wikimedia.org/r/632523 (https://phabricator.wikimedia.org/T264770) (owner: 10Andrew Bogott) [17:15:52] (03CR) 10BryanDavis: [C: 03+1] Ldap: Raise query limit again [puppet] - 10https://gerrit.wikimedia.org/r/632524 (https://phabricator.wikimedia.org/T264770) (owner: 10Andrew Bogott) [17:16:01] (03CR) 10Andrew Bogott: [C: 03+2] Ldap: Raise query limit again [puppet] - 10https://gerrit.wikimedia.org/r/632524 (https://phabricator.wikimedia.org/T264770) (owner: 10Andrew Bogott) [17:18:54] (03PS1) 10Jbond: kartotherian: drop validate_array function [puppet] - 10https://gerrit.wikimedia.org/r/632526 (https://phabricator.wikimedia.org/T259013) [17:19:24] (03CR) 10Jbond: [C: 03+2] kartotherian: drop validate_array function [puppet] - 10https://gerrit.wikimedia.org/r/632526 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [17:22:26] (03PS1) 10Jbond: logstash::input::tcp: drop validate functions [puppet] - 10https://gerrit.wikimedia.org/r/632527 (https://phabricator.wikimedia.org/T259013) [17:22:59] (03CR) 10Jbond: [C: 03+2] logstash::input::tcp: drop validate functions [puppet] - 10https://gerrit.wikimedia.org/r/632527 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [17:23:06] (03PS1) 10Jdlrobson: Hot fix: Use display for hiding/showing sidebar on OS 14_0 [skins/MinervaNeue] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/632528 (https://phabricator.wikimedia.org/T264376) [17:28:29] (03CR) 10Herron: [C: 03+2] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/631952 (https://phabricator.wikimedia.org/T261031) (owner: 10Ladsgroup) [17:30:49] (03PS1) 10Jbond: wmflib: drop validate functions [puppet] - 10https://gerrit.wikimedia.org/r/632529 (https://phabricator.wikimedia.org/T259013) [17:31:47] 10Operations, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Redirect 2030.wikimedia.org to the new movement strategy portal - https://phabricator.wikimedia.org/T202498 (10sguebo_WMF) Hello @Dzahn, can the 2030.wikimedia.org subdomain redirect to the new url: https://meta.wikimedia.org/wiki/Wikimedia_... [17:32:31] (03CR) 10Jbond: [C: 03+2] wmflib: drop validate functions [puppet] - 10https://gerrit.wikimedia.org/r/632529 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [17:34:06] (03PS3) 10Jbond: wmflib: drop Wmflib::Sourceurl and replace with Stdlib::Filesource [puppet] - 10https://gerrit.wikimedia.org/r/629423 [17:35:39] (03CR) 10Jbond: [C: 03+2] wmflib: drop Wmflib::Sourceurl and replace with Stdlib::Filesource [puppet] - 10https://gerrit.wikimedia.org/r/629423 (owner: 10Jbond) [17:38:07] PROBLEM - Host tools.wmflabs.org is DOWN: CRITICAL - Host Unreachable (tools.wmflabs.org) [17:39:17] andrewbogott, bd808: ^ [17:39:25] RECOVERY - Host tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 2.39 ms [17:39:34] Spookreeeno: ^ :) [17:39:49] bd808: yep, weird but it was dead for a few minutes [17:40:05] Or 1.5 minutes between alerts [17:40:06] yeah, seems flakey right now. [17:40:37] :( [17:41:19] Spookreeeno: I poked some folks to take a look (/me is in meeting) [17:42:17] bd808: hope they sort it. I would say enjoy your meeting but I don't know who you're with or what it's about :) [17:45:57] (03CR) 10Ppchelko: apiportalwiki: disable ElectronPdfService (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632516 (https://phabricator.wikimedia.org/T264043) (owner: 10Hnowlan) [17:47:37] (03CR) 10Nray: [C: 03+1] Hot fix: Use display for hiding/showing sidebar on OS 14_0 [skins/MinervaNeue] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/632520 (https://phabricator.wikimedia.org/T264376) (owner: 10Jdlrobson) [17:50:12] (03CR) 10Ppchelko: apiportalwiki: disable ElectronPdfService (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632516 (https://phabricator.wikimedia.org/T264043) (owner: 10Hnowlan) [17:52:54] (03CR) 10Nray: [C: 03+1] Hot fix: Use display for hiding/showing sidebar on OS 14_0 [skins/MinervaNeue] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/632528 (https://phabricator.wikimedia.org/T264376) (owner: 10Jdlrobson) [18:00:04] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201006T1800). [18:00:04] Pchelolo and nray: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:14] o/ here and ready [18:00:27] nray: your's seem to be hot fixes, wanna go first? mine's less urgent [18:00:38] works for me! [18:00:52] pls ping me when done, I'll deploy my own [18:01:03] (03PS1) 10Jbond: wmflib: drop apply_format function [puppet] - 10https://gerrit.wikimedia.org/r/632532 [18:01:34] nray I don't know much about the back end of the deployment process, but since wmf.11 isn't currently deployed anywhere does the patch need to be included? [18:02:29] DannyS712: depends if .12 was cut yesterday right? [18:03:12] yes [18:03:17] nray: do you have deployment rights? :-) [18:03:22] (03CR) 10Elukey: [C: 03+2] Drop /wmf/data/raw/mediawiki_job and /wmf/data/raw/netflow after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/628895 (https://phabricator.wikimedia.org/T231339) (owner: 10Ottomata) [18:03:29] no I don't have deployment rights [18:03:39] I mean after the .11 patch is merged, does it need to be deployed if there isn't anything running the code to deploy it to? [18:03:49] 10Operations, 10SRE-OnFire, 10Sustainability (Incident Followup): Investigate whether we can automatically share incident status docs with WMDE - https://phabricator.wikimedia.org/T244395 (10RLazarus) 05Open→03Resolved This should be done! Thanks one last time for your patience. @Addshore @Ladsgroup @Lu... [18:04:02] DannyS712: in that case no [18:04:48] I think it just needs to be merged. The .10 patch needs to definitely be deployed though [18:05:01] nray: okay, thanks [18:05:11] nray: are you able to test it? [18:05:41] (03CR) 10Urbanecm: [C: 03+2] Hot fix: Use display for hiding/showing sidebar on OS 14_0 [skins/MinervaNeue] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/632520 (https://phabricator.wikimedia.org/T264376) (owner: 10Jdlrobson) [18:05:57] I'd like to test the popular browsers. The hotfix is only for a particular browser version so will be hard to fully test but I can test for regressions [18:06:16] nray: sure, I'll ping you once it's ready [18:06:21] sweet, thank you [18:06:35] Pchelolo: you can go forward with your config patches, I'm waiting on CI [18:06:44] thank you [18:06:52] please ping you once finished [18:07:24] (03CR) 10Ppchelko: [C: 03+2] Allow read of CentralAuth special pages on api portal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632484 (https://phabricator.wikimedia.org/T264637) (owner: 10Ppchelko) [18:07:50] sure [18:08:14] (03Merged) 10jenkins-bot: Allow read of CentralAuth special pages on api portal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632484 (https://phabricator.wikimedia.org/T264637) (owner: 10Ppchelko) [18:09:53] (03CR) 10Urbanecm: [C: 03+2] Hot fix: Use display for hiding/showing sidebar on OS 14_0 [skins/MinervaNeue] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/632528 (https://phabricator.wikimedia.org/T264376) (owner: 10Jdlrobson) [18:10:04] Urbanecm: what's the urgency of T261694 ? I'd like to do a minor refactor of that VCL code while I'm there [18:10:05] T261694: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 [18:10:41] cdanis: I think the web will be deployed to visitors in ~1 month [18:12:03] (03PS2) 10Ppchelko: Add API Portal to $wgCentralAuthAutoLoginWikis - prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632323 (https://phabricator.wikimedia.org/T264637) (owner: 10Cicalese) [18:12:12] (03CR) 10Ppchelko: [C: 03+2] Add API Portal to $wgCentralAuthAutoLoginWikis - prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632323 (https://phabricator.wikimedia.org/T264637) (owner: 10Cicalese) [18:12:28] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: gerrit:632484 T264637 (duration: 00m 58s) [18:12:30] Urbanecm: ah great, thanks very much for the advance notice :) [18:12:31] (maybe three weeks if I won't introduce new bugs, or if management won't want new features) [18:12:36] yeah that's no problem [18:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:38] T264637: Add API Portal to $wgCentralAuthAutoLoginWikis - https://phabricator.wikimedia.org/T264637 [18:12:42] cdanis: cool, thanks :) [18:13:01] (03Merged) 10jenkins-bot: Add API Portal to $wgCentralAuthAutoLoginWikis - prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632323 (https://phabricator.wikimedia.org/T264637) (owner: 10Cicalese) [18:15:39] (03PS2) 10Ppchelko: apiportalwiki: disable ElectronPdfService [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632516 (https://phabricator.wikimedia.org/T264043) (owner: 10Hnowlan) [18:15:44] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: gerrit:632323 T264637 (duration: 00m 58s) [18:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:58] (03CR) 10Ppchelko: [C: 03+2] apiportalwiki: disable ElectronPdfService [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632516 (https://phabricator.wikimedia.org/T264043) (owner: 10Hnowlan) [18:16:49] (03Merged) 10jenkins-bot: apiportalwiki: disable ElectronPdfService [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632516 (https://phabricator.wikimedia.org/T264043) (owner: 10Hnowlan) [18:19:06] (03PS4) 10Ppchelko: Force local short descriptions for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631775 (https://phabricator.wikimedia.org/T263493) [18:19:10] (03CR) 10Ppchelko: [C: 03+2] Force local short descriptions for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631775 (https://phabricator.wikimedia.org/T263493) (owner: 10Ppchelko) [18:19:11] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: gerrit:632516 T264043 (duration: 00m 59s) [18:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:17] T264043: Remove Print/export from navigation - https://phabricator.wikimedia.org/T264043 [18:20:07] (03Merged) 10jenkins-bot: Force local short descriptions for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631775 (https://phabricator.wikimedia.org/T263493) (owner: 10Ppchelko) [18:21:41] (03CR) 10Jbond: "Ready for review pcc's all no-op" [puppet] - 10https://gerrit.wikimedia.org/r/632532 (owner: 10Jbond) [18:22:31] (03PS4) 10Razzi: oozie: use admin groups to determine admin access [puppet] - 10https://gerrit.wikimedia.org/r/631849 (https://phabricator.wikimedia.org/T262660) [18:23:38] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: IS.php gerrit:631775 T263493 T259622 (duration: 00m 59s) [18:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:45] T259622: System Administrator disables Wikidata fallback for article descriptions - https://phabricator.wikimedia.org/T259622 [18:23:46] T263493: Reader gets appropriate article short description in search results - https://phabricator.wikimedia.org/T263493 [18:24:59] !log ppchelko@deploy1001 Synchronized wmf-config/Wikibase.php: Wikibase.php gerrit:631775 T263493 T259622 (duration: 00m 58s) [18:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:31] (03PS1) 10BryanDavis: dynamicproxy: guard against missing Host header [puppet] - 10https://gerrit.wikimedia.org/r/632534 [18:25:55] Urbanecm: I'm done. [18:26:02] thanks [18:26:40] 10Operations, 10ops-eqiad, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps10[05-10].eqiad.wmnet - https://phabricator.wikimedia.org/T260269 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` maps1005.eqiad.wmnet ` The log can be found in... [18:26:42] (03CR) 10BryanDavis: "Needs testing" [puppet] - 10https://gerrit.wikimedia.org/r/632534 (owner: 10BryanDavis) [18:27:17] * Urbanecm is still waiting on CI [18:28:51] (03Merged) 10jenkins-bot: Hot fix: Use display for hiding/showing sidebar on OS 14_0 [skins/MinervaNeue] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/632520 (https://phabricator.wikimedia.org/T264376) (owner: 10Jdlrobson) [18:29:49] finally [18:30:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps10[05-10].eqiad.wmnet - https://phabricator.wikimedia.org/T260269 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` ['maps1006.eqiad.wmnet', 'maps1007.eqiad.wmnet'... [18:31:45] nray: pulled onto mwdebug2001, can you test, please? [18:31:58] Urbanecm: thank you, testing now, will ping when done [18:32:04] thanks [18:35:03] Urbanecm: things look good. You can proceed! [18:35:07] syncing [18:36:20] (03Merged) 10jenkins-bot: Hot fix: Use display for hiding/showing sidebar on OS 14_0 [skins/MinervaNeue] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/632528 (https://phabricator.wikimedia.org/T264376) (owner: 10Jdlrobson) [18:37:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps10[05-10].eqiad.wmnet - https://phabricator.wikimedia.org/T260269 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['maps1005.eqiad.wmnet'] ` Of which those **FAILED**: ` ['maps1005.eqiad.wmnet'] ` [18:37:41] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.10/skins/MinervaNeue/: d428ccbdf3be9a45139f8b8c0874c113f1732198: Hot fix: Use display for hiding/showing sidebar on OS 14_0 (T264376) (duration: 01m 03s) [18:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:48] nray: should be live :) [18:38:13] (03CR) 10Bstorm: [C: 03+1] "Looks legit." [puppet] - 10https://gerrit.wikimedia.org/r/632534 (owner: 10BryanDavis) [18:38:23] Urbanecm: thank you so much for your help! [18:38:33] no problem nray [18:39:10] (03CR) 10Bstorm: [C: 03+1] "I'll merge a little later if nobody else sees anything wrong." [puppet] - 10https://gerrit.wikimedia.org/r/632534 (owner: 10BryanDavis) [18:39:49] (03CR) 10Bstorm: [C: 03+1] dumps::web::fetches::stat_dumps: add rsync job for pageview complete [puppet] - 10https://gerrit.wikimedia.org/r/629409 (https://phabricator.wikimedia.org/T251777) (owner: 10Fdans) [18:40:45] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.11/skins/MinervaNeue/: 2118d265c0f5b6c914efeba86ba7eacd30c5ee0f: Hot fix: Use display for hiding/showing sidebar on OS 14_0 (T264376) (duration: 01m 00s) [18:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:56] !log Morning B&C done [18:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:19] (03CR) 10Bstorm: [C: 03+1] Stop using Diamond on Cloud VPS/Toolforge [puppet] - 10https://gerrit.wikimedia.org/r/632471 (https://phabricator.wikimedia.org/T210993) (owner: 10Muehlenhoff) [18:42:55] (03PS3) 10Fdans: dumps::web::fetches::stat_dumps: add rsync job for pageview complete [puppet] - 10https://gerrit.wikimedia.org/r/629409 (https://phabricator.wikimedia.org/T251777) [18:45:06] (03PS1) 10Clarakosi: Add demo tier to OAuthRateLimiter tier configurations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632535 [18:45:11] 10Operations, 10ops-codfw, 10serviceops: Degraded RAID on mw2279 - https://phabricator.wikimedia.org/T264698 (10wiki_willy) a:03Papaul [18:46:08] (03CR) 10jerkins-bot: [V: 04-1] Add demo tier to OAuthRateLimiter tier configurations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632535 (owner: 10Clarakosi) [18:53:07] (03PS2) 10Clarakosi: Add demo tier to OAuthRateLimiter tier configurations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632535 [18:53:56] (03CR) 10jerkins-bot: [V: 04-1] Add demo tier to OAuthRateLimiter tier configurations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632535 (owner: 10Clarakosi) [18:55:42] (03PS3) 10Clarakosi: Add demo tier to OAuthRateLimiter tier configurations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632535 [18:58:44] (03CR) 10Ppchelko: [C: 03+1] Add demo tier to OAuthRateLimiter tier configurations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632535 (owner: 10Clarakosi) [18:59:34] (03PS1) 10CDanis: VCL: Maps Referer block: update comments & redo regex w/ comments [puppet] - 10https://gerrit.wikimedia.org/r/632539 [19:00:04] hashar and marxarelli: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Mediawiki train - European+American Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201006T1900). [19:00:26] one day I will shut those silly messages [19:00:36] status of train is still blocked [19:01:41] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10Multichill) @CDanis based on the webserver logs we should know what domains give the most hits. Can you share a list of... [19:11:57] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10CDanis) >>! In T261694#6522739, @Multichill wrote: > @CDanis based on the webserver logs we should know what domains gi... [19:17:33] (03PS2) 10CDanis: VCL: Maps Referer block: update comments & redo regex w/ comments [puppet] - 10https://gerrit.wikimedia.org/r/632539 [19:17:56] (03CR) 10Dzahn: "ah, no problem. I see those are merged so let me just abandon this one." [puppet] - 10https://gerrit.wikimedia.org/r/630312 (owner: 10Dzahn) [19:18:05] (03Abandoned) 10Dzahn: cassandra: add data types, remove validation code [puppet] - 10https://gerrit.wikimedia.org/r/630312 (owner: 10Dzahn) [19:22:23] (03CR) 10Dzahn: [V: 03+1 C: 03+2] prometheus: convert mysqld_exporter from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/631288 (https://phabricator.wikimedia.org/T159412) (owner: 10Dzahn) [19:24:07] (03CR) 10Dzahn: "noop confirmed on db1134, labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/631288 (https://phabricator.wikimedia.org/T159412) (owner: 10Dzahn) [19:32:28] (03PS3) 10CDanis: VCL: Maps Referer block: update comments & redo regex w/ comments [puppet] - 10https://gerrit.wikimedia.org/r/632539 [19:34:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps10[05-10].eqiad.wmnet - https://phabricator.wikimedia.org/T260269 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['maps1006.eqiad.wmnet', 'maps1007.eqiad.wmnet', 'maps1008.eqiad.wmnet', 'maps1009.eqiad.wmnet', 'ma... [19:35:01] (03PS4) 10CDanis: VCL: Maps Referer block: no-op!: comments & redo regex w/ comments [puppet] - 10https://gerrit.wikimedia.org/r/632539 [19:35:50] (03PS5) 10CDanis: VCL: Maps Referer block: no-op!: comments & redo regex w/ comments [puppet] - 10https://gerrit.wikimedia.org/r/632539 (https://phabricator.wikimedia.org/T261694) [19:38:38] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "compiled once for backend and once for tls class:" [puppet] - 10https://gerrit.wikimedia.org/r/631291 (owner: 10Dzahn) [19:40:53] (03CR) 10CDanis: "0 tests failed, 0 tests skipped, 16 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/632539 (https://phabricator.wikimedia.org/T261694) (owner: 10CDanis) [19:44:32] (03CR) 10Dzahn: "confirmed noop on cp1082, cp1079, cp4032" [puppet] - 10https://gerrit.wikimedia.org/r/631291 (owner: 10Dzahn) [19:46:32] (03PS6) 10CDanis: VCL: Maps Referer block: no-op!: comments & redo regex w/ comments [puppet] - 10https://gerrit.wikimedia.org/r/632539 (https://phabricator.wikimedia.org/T261694) [19:56:32] (03CR) 10BryanDavis: "Just a note for posterity that this (disabling the diamond collectors) will break https://nagf.toolforge.org/ which reads data from the gr" [puppet] - 10https://gerrit.wikimedia.org/r/632471 (https://phabricator.wikimedia.org/T210993) (owner: 10Muehlenhoff) [19:58:07] (03CR) 10RLazarus: [C: 03+1] "Looks like a no-op to me. Consider adding a test for affiliate subdomains if we care about keeping that fixed (which we probably do)." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/632539 (https://phabricator.wikimedia.org/T261694) (owner: 10CDanis) [19:59:39] rzl: thanks! [19:59:46] 👍 [19:59:49] (03CR) 10CDanis: [C: 03+2] "> Patch Set 6: Code-Review+1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/632539 (https://phabricator.wikimedia.org/T261694) (owner: 10CDanis) [19:59:58] (03PS7) 10CDanis: VCL: Maps Referer block: no-op!: comments & redo regex w/ comments [puppet] - 10https://gerrit.wikimedia.org/r/632539 (https://phabricator.wikimedia.org/T261694) [20:00:04] (03CR) 10CDanis: [C: 03+2] VCL: Maps Referer block: no-op!: comments & redo regex w/ comments [puppet] - 10https://gerrit.wikimedia.org/r/632539 (https://phabricator.wikimedia.org/T261694) (owner: 10CDanis) [20:04:36] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://openstack-browser.toolforge.org/puppetclass/role::wmcs::services::ntp and compiler confirms noop https://puppet-compiler.wmflabs.o" [puppet] - 10https://gerrit.wikimedia.org/r/631304 (owner: 10Dzahn) [20:04:42] (03PS1) 10Jbond: (WIP) firewall: change to default reject instead of drop [puppet] - 10https://gerrit.wikimedia.org/r/632543 [20:05:17] (03CR) 10Jbond: [C: 04-2] "dont merge just a conversation starter" [puppet] - 10https://gerrit.wikimedia.org/r/632543 (owner: 10Jbond) [20:06:38] (03CR) 10Dzahn: "confirmed noop on ntp-01.cloudinfra.eqiad1.wikimedia.cloud" [puppet] - 10https://gerrit.wikimedia.org/r/631304 (owner: 10Dzahn) [20:07:48] (03PS1) 10CDanis: VCL: Maps Referer block: allow wikimedia.cz & subdomains thereof [puppet] - 10https://gerrit.wikimedia.org/r/632544 (https://phabricator.wikimedia.org/T261694) [20:11:14] (03CR) 10Reedy: [C: 03+1] Update redirection of 2030.wikimedia.org with new URI [puppet] - 10https://gerrit.wikimedia.org/r/632552 (https://phabricator.wikimedia.org/T202498) (owner: 10Samuel (WMF)) [20:13:53] (03PS1) 10Andrew Bogott: cloudvirt1023: to Ceph and Buster [puppet] - 10https://gerrit.wikimedia.org/r/632545 (https://phabricator.wikimedia.org/T259399) [20:14:54] 10Operations: Update 2030.wikimedia.org redirect to new URI - https://phabricator.wikimedia.org/T264797 (10sguebo_WMF) [20:15:08] (03PS2) 10Jbond: (WIP) firewall: change to default reject instead of drop [puppet] - 10https://gerrit.wikimedia.org/r/632543 [20:15:26] (03CR) 10Jbond: [C: 04-2] "will add a phab task for this tomorrow" [puppet] - 10https://gerrit.wikimedia.org/r/632543 (owner: 10Jbond) [20:16:19] 10Operations, 10Wikimedia-Apache-configuration: Update 2030.wikimedia.org redirect to new URI - https://phabricator.wikimedia.org/T264797 (10Reedy) [20:16:21] (03PS5) 10Samuel (WMF): Update redirection of 2030.wikimedia.org with new URI [puppet] - 10https://gerrit.wikimedia.org/r/632552 (https://phabricator.wikimedia.org/T202498) [20:18:06] (03PS6) 10Samuel (WMF): Update redirection of 2030.wikimedia.org with new URI [puppet] - 10https://gerrit.wikimedia.org/r/632552 (https://phabricator.wikimedia.org/T264797) [20:18:41] 10Operations, 10Wikimedia-Apache-configuration, 10Patch-For-Review, 10User-RhinosF1: Update 2030.wikimedia.org redirect to new URI - https://phabricator.wikimedia.org/T264797 (10RhinosF1) a:05RhinosF1→03None Was gonna do a patch but I'll review what exists [20:18:46] 10Operations, 10Wikimedia-Apache-configuration, 10Patch-For-Review, 10User-RhinosF1: Update 2030.wikimedia.org redirect to new URI - https://phabricator.wikimedia.org/T264797 (10RhinosF1) a:03RhinosF1 [20:19:30] 10Operations, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Update 2030.wikimedia.org redirect to new URI - https://phabricator.wikimedia.org/T264797 (10RhinosF1) [20:19:39] (03CR) 10RhinosF1: [C: 03+1] Update redirection of 2030.wikimedia.org with new URI [puppet] - 10https://gerrit.wikimedia.org/r/632552 (https://phabricator.wikimedia.org/T264797) (owner: 10Samuel (WMF)) [20:23:17] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [20:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:16] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:34] (03CR) 10Dzahn: [C: 04-1] "Evaluation Error: Operator '[]' is not applicable to an Undef Value. (file: /srv/jenkins-workspace/puppet-compiler/25736/change/src/module" [puppet] - 10https://gerrit.wikimedia.org/r/628459 (owner: 10Dzahn) [20:28:46] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Special:HideBanners is not really cacheable - https://phabricator.wikimedia.org/T256447 (10Tgr) [20:29:26] cdanis: oh, that took way less time than i expected :D [20:29:27] thanks [20:29:44] Urbanecm: I wasn't sure the refactor I just did was going to be easy, but it turns out it was :) [20:30:07] (03CR) 10Urbanecm: [C: 03+1] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/632544 (https://phabricator.wikimedia.org/T261694) (owner: 10CDanis) [20:30:22] cdanis: cool! Always better to find out sth is easier than the other way around :) [20:33:06] (03CR) 10Dzahn: [C: 04-1] "hmm...? https://puppet-compiler.wmflabs.org/compiler1003/25736/" [puppet] - 10https://gerrit.wikimedia.org/r/628459 (owner: 10Dzahn) [20:33:35] (03PS3) 10Dzahn: bird/piwik/elasticsearch: replace hiera with lookup, data types [puppet] - 10https://gerrit.wikimedia.org/r/631899 [20:36:49] (03PS4) 10Dzahn: bird/piwik: replace hiera with lookup, data types [puppet] - 10https://gerrit.wikimedia.org/r/631899 [20:38:27] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) Hi all! I believe we can use a Refine transform function to add the requested fields (except for BGP communities IIUC) at refine time. Pl... [20:40:47] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10Nuria) @mforns: it could also be a second job run after the refined one (similar to how we do virtual-pageviews) as we probably do not want to cre... [20:41:08] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "bird: https://puppet-compiler.wmflabs.org/compiler1002/25738/ piwik: https://puppet-compiler.wmflabs.org/compiler1001/25739/" [puppet] - 10https://gerrit.wikimedia.org/r/631899 (owner: 10Dzahn) [20:41:16] (03CR) 10Dzahn: [V: 03+1 C: 03+2] bird/piwik: replace hiera with lookup, data types [puppet] - 10https://gerrit.wikimedia.org/r/631899 (owner: 10Dzahn) [20:43:40] (03CR) 10Dzahn: "confirm noop on matomo1002 (piwik), dns3001, centrallog1001 (bird)" [puppet] - 10https://gerrit.wikimedia.org/r/631899 (owner: 10Dzahn) [20:45:44] (03PS1) 10Dzahn: elasticsearch::cirrus: hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/632567 [20:49:55] (03CR) 10CDanis: [C: 03+2] "0 tests failed, 0 tests skipped, 16 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/632544 (https://phabricator.wikimedia.org/T261694) (owner: 10CDanis) [20:52:06] (03CR) 10Dzahn: [V: 03+1] "hadoop::master: https://puppet-compiler.wmflabs.org/compiler1001/25741/an-master1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/631302 (owner: 10Dzahn) [20:54:33] (03PS6) 10Dzahn: thumbor: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630694 [20:54:38] (03CR) 10Dzahn: thumbor: hiera->lookup, add data types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630694 (owner: 10Dzahn) [20:57:48] (03CR) 10Dzahn: [V: 04-1] "parameter 'memcached_servers_nutcracker' index 0 expects a Stdlib::Host...got String" [puppet] - 10https://gerrit.wikimedia.org/r/630694 (owner: 10Dzahn) [21:00:10] (03PS7) 10Dzahn: thumbor: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630694 [21:01:40] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/25746/thumbor2004.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/630694 (owner: 10Dzahn) [21:05:00] (03CR) 10Effie Mouzeli: [C: 03+1] "As long as we roll it out carefully:)" [puppet] - 10https://gerrit.wikimedia.org/r/630694 (owner: 10Dzahn) [21:09:20] (03PS1) 10Dzahn: labs_bootstrapvz: remove diamond from lists of installed packages [puppet] - 10https://gerrit.wikimedia.org/r/632569 (https://phabricator.wikimedia.org/T210993) [21:09:40] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:11:27] * Pchelolo will deploy a little mw config change [21:11:41] (03CR) 10Ppchelko: [C: 03+2] Add demo tier to OAuthRateLimiter tier configurations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632535 (owner: 10Clarakosi) [21:12:22] (03Merged) 10jenkins-bot: Add demo tier to OAuthRateLimiter tier configurations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632535 (owner: 10Clarakosi) [21:13:35] (03PS1) 10Dzahn: wmcs::instance: remove diamond removal remnants [puppet] - 10https://gerrit.wikimedia.org/r/632570 (https://phabricator.wikimedia.org/T210993) [21:14:41] !log ppchelko@deploy1001 Synchronized wmf-config/CommonSettings.php: gerrit:632535 (duration: 01m 00s) [21:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:09] (03PS1) 10Dzahn: toolforge/dynamicproxy: remove diamond monitoring proxy [puppet] - 10https://gerrit.wikimedia.org/r/632571 (https://phabricator.wikimedia.org/T210993) [21:17:11] (03CR) 10Razzi: "How does this look for a start? Still not sure how to go about turning off the timer on one host and turning it on in another; perhaps @El" [puppet] - 10https://gerrit.wikimedia.org/r/631896 (https://phabricator.wikimedia.org/T264152) (owner: 10Razzi) [21:17:13] (03PS1) 10Dzahn: delete the diamond module [puppet] - 10https://gerrit.wikimedia.org/r/632572 (https://phabricator.wikimedia.org/T210993) [21:23:43] (03CR) 10Bstorm: wmcs::postgres: hiera->lookup and add data types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/628459 (owner: 10Dzahn) [21:27:00] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1013 with 10G interfaces - https://phabricator.wikimedia.org/T264806 (10Andrew) [21:27:44] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1013 with 10G interfaces - https://phabricator.wikimedia.org/T264806 (10Andrew) a:05Andrew→03Cmjohnson [21:28:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1013 with 10G interfaces - https://phabricator.wikimedia.org/T264806 (10Andrew) a:05Cmjohnson→03Andrew this host still needs draining, I'll re-assign when it's empty. [21:29:16] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1013 with 10G interfaces - https://phabricator.wikimedia.org/T264806 (10Andrew) 05Open→03Invalid oops, duplicate [21:29:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [21:29:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1013 with 10G interfaces - https://phabricator.wikimedia.org/T243414 (10Andrew) [21:29:48] (03CR) 10Urbanecm: "it works. thanks" [puppet] - 10https://gerrit.wikimedia.org/r/632544 (https://phabricator.wikimedia.org/T261694) (owner: 10CDanis) [21:31:15] (03PS1) 10Volans: dns: consolidate reverse zone files [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632574 (https://phabricator.wikimedia.org/T264273) [21:32:28] 10Operations, 10SRE-OnFire, 10Sustainability (Incident Followup): Investigate whether we can automatically share incident status docs with WMDE - https://phabricator.wikimedia.org/T244395 (10Ladsgroup) Thanks! [21:32:41] (03PS1) 10Dzahn: geoip: allow enabling archive timer via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/632575 [21:33:11] (03CR) 10Dzahn: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/631896 (https://phabricator.wikimedia.org/T264152) (owner: 10Razzi) [21:33:40] PROBLEM - SSH on ms-be2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:33:41] (03CR) 10jerkins-bot: [V: 04-1] geoip: allow enabling archive timer via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/632575 (owner: 10Dzahn) [21:35:25] (03CR) 10Bstorm: "Oh! I see what PCC is complaining about. It's in modules/postgresql/manifests/master.pp:" [puppet] - 10https://gerrit.wikimedia.org/r/628459 (owner: 10Dzahn) [21:35:28] (03PS2) 10Dzahn: geoip: allow enabling archive timer via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/632575 [21:37:48] (03CR) 10Dzahn: [C: 04-1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/628459 (owner: 10Dzahn) [21:38:08] bstorm: thanks for pointing that out. yes, the error exists before and after the change [21:38:13] in pcc output [21:38:32] so seems like it would indeed be broken unrelatedly [21:38:36] RECOVERY - SSH on ms-be2016 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:41:04] (03CR) 10Volans: "Including the temporary code for double generation this is the diff:" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632574 (https://phabricator.wikimedia.org/T264273) (owner: 10Volans) [21:41:56] About to roll a wdqs deploy, everything looks good (lag, tests) before the deploy [21:42:24] !log ryankemper@deploy1001 Started deploy [wdqs/wdqs@e56a20e]: 0.3.51 [21:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:31] !log All tests passing on canary `wdqs1003`, proceeding to rest of fleet [21:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:48] (03PS3) 10Dzahn: geoip: allow enabling archive timer via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/632575 [21:47:14] (03PS4) 10Dzahn: geoip: allow enabling archive timer via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/632575 [21:48:53] (03CR) 10Dzahn: [V: 03+1] "PS4 works. It randomly enables the timer on stat1004 while the default is to keep it disabled on other hosts, just to demonstrate it and s" [puppet] - 10https://gerrit.wikimedia.org/r/632575 (owner: 10Dzahn) [21:54:08] (03CR) 10Bstorm: [C: 03+2] dynamicproxy: guard against missing Host header [puppet] - 10https://gerrit.wikimedia.org/r/632534 (owner: 10BryanDavis) [21:55:17] (03PS1) 10Dzahn: geoip: add data types [puppet] - 10https://gerrit.wikimedia.org/r/632579 [21:55:33] !log ryankemper@deploy1001 Finished deploy [wdqs/wdqs@e56a20e]: 0.3.51 (duration: 13m 09s) [21:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:43] !log Restarting `wdqs-updater` across the fleet: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [21:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:28] !log Restarting `wdqs-categories` across all test instances (not public facing): `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [21:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:55] !log Restarting `wdqs-categories` across production instances one-at-a-time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 60 && systemctl restart wdqs-categories && sleep 30 && pool'` [21:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:06] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:03:11] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at codfw #page on alert1001 is CRITICAL: 0.2271 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [22:03:44] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [22:03:44] (03CR) 10Dzahn: [C: 03+1] "thanks, wanted the same thing but not rush it without more eyes" [puppet] - 10https://gerrit.wikimedia.org/r/632443 (https://phabricator.wikimedia.org/T149804) (owner: 10Effie Mouzeli) [22:03:48] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:03:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_wikifeeds_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:04:16] my wifi just went down, I'll be looking soon but it'll be a moment [22:04:22] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:04:36] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:04:44] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: / (spec from root) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 20 [22:04:44] ore a response was received: /{domain}/v1/page/news (get In the News content for unsupported language (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [22:04:51] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at codfw #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.8011 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [22:05:02] got paged, looks like the recovery just came in tho [22:05:26] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [22:05:35] same here, page came with recovery [22:05:44] checking the dashboards [22:07:05] but yeah looks like api codfw was overloaded briefly [22:07:14] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:07:27] i see a spike in response time and active workers but it recovered quickly [22:07:43] okay, on now [22:07:45] looks like a spike in latency. can't tell yet if there was a coorrelated spike in traffic though [22:08:07] would that lead to the low fpm worker alert? [22:08:08] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [22:08:52] both API and appservers [22:08:58] a lot of 400 responses though [22:08:58] also big increase in rows read from s8, https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=1602018537530&orgId=1&to=1602022137530&var-cluster=api_appserver&var-datasource=codfw%20prometheus%2Fops&var-method=GET&var-code=200&viewPanel=39 [22:09:01] chaomodus: yea, it would [22:09:20] (03CR) 10Razzi: "> Patch Set 4: Verified+1" [puppet] - 10https://gerrit.wikimedia.org/r/632575 (owner: 10Dzahn) [22:09:22] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:09:36] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:10:03] there's a tiny bump in POST traffic that might have been it, if they were particularly heavy requests [22:10:27] kay cool good to know [22:10:30] matches the timing of shdubsh's spike in 400s -- which wasn't actually all that big, peaked at ~20 rps, we just don't normally serve even that many [22:10:50] if we're lucky we might have it in mediawiki-apache2 on logstash, digging around [22:10:54] wdqs-updater restart shortly before that [22:11:21] yeah I saw that in -ops scrollback but I'm not sure I can figure out if it could be causal [22:12:01] actually s8 scraping time jumped too, so that's kind of suggestive [22:12:09] also restbase-dev machine alerts weirdly in the same timeframe [22:13:44] api-appserver CPU heatmap looks normal, so I think we can rule out any on-host load increase, it must have been a slow backend [22:14:11] oh, s8 rows read also spiked, I didn't see it at first [22:14:59] wikidata .. so the wdqs restarts could be related after all [22:15:01] looks like there's folks around now, I'll go to bed [22:15:15] okay, I'm more and more convinced that there was a smallish bump in POST requests, that tried to do a bunch of wikidata-related work, timed out at 5s, and returned 400s [22:15:19] godog: please do, thanks <3 [22:15:41] not sure if the requests were *caused* by the wdqs restart or if they *failed* because of the restart, or both [22:15:43] ryankemper: still around? [22:16:09] rzl: yeah, reading the above [22:16:48] the timing is also weird, our apiserver weirdness looks like it started at 22:00 sharp, which is a few minutes late for those restarts [22:17:07] first it was on test instances.. and then restbase-dev alerted.. then it was on prod instances and restbase alerted? [22:17:27] so...spike in 400s for wikifeeds that might be attributable to wdqs work? [22:17:41] ryankemper: just for clarity, nothing is still broken :) no immediate urgency here, it's just troubling [22:18:02] so the actual issue we got paged for wasn't the 400s, it's that api servers briefly ran out of workers [22:18:15] ack [22:18:26] or, correction, were in danger of running out of workers [22:18:38] ryankemper: it just looks like shortly after those restarts there were the restbase alerts and then that spike in appserver used workers [22:18:57] I'm pretty sure that wasn't because the api servers were themselves doing too much work, but because we had a lot of requests hanging open to some backend, which took (I think) five seconds to timeout and return a 400 [22:19:22] each of those tied up a worker, so in principle when that happens, the api servers can get starved out and be unable to handle new requests, which is the worry here [22:20:39] and the row reads on s8 like doubled for a short time and that is the wikidata shard [22:22:07] what are you looking at to get those per-shard metrics? [22:22:33] ryankemper: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard [22:22:38] down at the bottom, under "Traffic: rows read" [22:22:42] ty [22:27:25] (03CR) 10Razzi: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/632579 (owner: 10Dzahn) [22:28:38] not sure what it's indicative of but the latency worst scraping time jumped from 50ms to ~1.5s for `s8` in that time period [22:29:35] (03PS1) 10Bstorm: locales: switch to using locales-all package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/632582 (https://phabricator.wikimedia.org/T263339) [22:31:06] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:31:59] !log Restart of `wdqs-categories` done. WDQS deploy is complete [22:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:54] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:36:40] okay, I still want to know what happened here but the harm is long over, I'm not finding much new data right now, and I'm not sure how long it's worth staring at into the evening :) [22:37:26] I wish I knew my way around our logs better but I haven't been able to come up with anything in the mediawiki-apache2 sample [22:40:50] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:46:28] (03PS2) 10Bstorm: locales: switch to using locales-all package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/632582 (https://phabricator.wikimedia.org/T263339) [22:53:28] (03CR) 10Jeena Huneidi: [C: 03+2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/632354 (owner: 10PipelineBot) [22:55:44] (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/632354 (owner: 10PipelineBot) [22:59:03] (03Abandoned) 10Dzahn: geoip: allow enabling archive timer via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/632575 (owner: 10Dzahn) [23:00:04] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201006T2300). Please do the needful. [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:04:05] (03PS2) 10Dzahn: geoip: add data types [puppet] - 10https://gerrit.wikimedia.org/r/632579 [23:07:01] !log jhuneidi@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [23:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:36] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/25750/stat1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/632579 (owner: 10Dzahn) [23:11:29] !log jhuneidi@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [23:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:35] (03CR) 10Razzi: [C: 03+2] geoip: add data types [puppet] - 10https://gerrit.wikimedia.org/r/632579 (owner: 10Dzahn) [23:20:06] rzl: are you still around? [23:20:11] rzl: (no worries if not) [23:20:16] (03CR) 10Dzahn: service: drop legacy validate functions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/616758 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [23:21:13] (03CR) 10Dzahn: [C: 03+1] "compiler output looks good" [puppet] - 10https://gerrit.wikimedia.org/r/616758 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [23:27:55] (03PS2) 10Dzahn: switch DHCP servers in POPs to new local install hosts [homer/public] - 10https://gerrit.wikimedia.org/r/631261 (https://phabricator.wikimedia.org/T252526) [23:28:31] (03CR) 10Dzahn: [C: 03+2] switch DHCP servers in POPs to new local install hosts [homer/public] - 10https://gerrit.wikimedia.org/r/631261 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [23:30:44] (03CR) 10Razzi: "Confirmed noop on stat1008." [puppet] - 10https://gerrit.wikimedia.org/r/632579 (owner: 10Dzahn) [23:37:21] 10Operations, 10DBA, 10serviceops: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes - https://phabricator.wikimedia.org/T264821 (10CDanis) [23:41:50] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Alejabo) p:05High→03Medium [23:43:29] !log jhuneidi@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [23:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:03] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Ladsgroup) p:05Medium→03High Do not change priorities without any context. [23:48:52] (03PS1) 10CDanis: depool esams for router upgrade [dns] - 10https://gerrit.wikimedia.org/r/632586 [23:49:47] (03CR) 10CDanis: [C: 03+2] depool esams for router upgrade [dns] - 10https://gerrit.wikimedia.org/r/632586 (owner: 10CDanis) [23:52:25] !log 🖧 switched DHCP server for esams from install1003 to install3001 - homer deployed to cr*esams* (T252526) 🖧 [23:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:31] T252526: serve tftpboot environment from the install servers and create one in each edge POP - https://phabricator.wikimedia.org/T252526 [23:53:34] !log 🖧 switched DHCP server for ulsfo from install2003 to install4001 - homer deployed to cr*ulsfo* (T252526) 🖧 [23:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:06] (03CR) 10Ryan Kemper: [C: 03+2] Lower elasticsearch readahead from 128kB to 16kB [puppet] - 10https://gerrit.wikimedia.org/r/632319 (https://phabricator.wikimedia.org/T264053) (owner: 10Ebernhardson) [23:55:24] !log 🖧 switched DHCP server for eqsin from install2003 to install5001 - homer deployed to cr*eqsin* (T252526) 🖧 [23:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:54] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 48.75 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:58:59] ^ expected ofc [23:59:08] ack