[00:01:18] (03PS1) 10Bstorm: cloudstore: set syncserver to only be run with puppet disabled [puppet] - 10https://gerrit.wikimedia.org/r/690783 (https://phabricator.wikimedia.org/T224747) [00:10:16] (03PS1) 10Dzahn: microsites::peopleweb: add more comments [puppet] - 10https://gerrit.wikimedia.org/r/690786 [00:12:02] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet - https://phabricator.wikimedia.org/T281437 (10RKemper) @wiki_willy I heard Papaul is out for a couple weeks so see the above comment https://phabricator.wikimedia.org/T281437#7086866 [00:20:10] (03PS1) 10Dzahn: peopleweb: put a public_html into /etc/skel to ensure all users get one [puppet] - 10https://gerrit.wikimedia.org/r/690787 (https://phabricator.wikimedia.org/T280989) [00:39:41] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 --new wdqs2003.codfw.wmnet` on `ryankemper@cumin2001` tmux session `wdqs_reimage` [00:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:45] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [00:43:44] (03PS1) 10Nray: Fix 'final_state: vector' bug in VectorPrefDiffInstrumentation [extensions/WikimediaEvents] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/690789 (https://phabricator.wikimedia.org/T261842) [00:50:50] (03PS1) 10Jforrester: Using RevisionListBase::getPage instead of calling $title directly [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690084 (https://phabricator.wikimedia.org/T282825) [01:21:07] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:22:01] 10SRE, 10SRE-Access-Requests: Requesting access to releases1002/2002 for jhuneidi - https://phabricator.wikimedia.org/T282610 (10Dzahn) [01:22:38] (03PS1) 10Dzahn: admin: add jhuneidi to contint-roots [puppet] - 10https://gerrit.wikimedia.org/r/690793 (https://phabricator.wikimedia.org/T282610) [01:22:41] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releases1002/2002 for jhuneidi - https://phabricator.wikimedia.org/T282610 (10Dzahn) confirmed L3, confirmed all the other checkboxes uploaded patch to gerrit [01:22:53] (03PS2) 10Dzahn: admin: add jhuneidi to contint-roots [puppet] - 10https://gerrit.wikimedia.org/r/690793 (https://phabricator.wikimedia.org/T282610) [01:23:30] (03PS3) 10Dzahn: admin: add jhuneidi to contint-roots [puppet] - 10https://gerrit.wikimedia.org/r/690793 (https://phabricator.wikimedia.org/T282610) [01:23:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:33:33] (03CR) 10Jdlrobson: "This change is ready for review." [core] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/690085 (https://phabricator.wikimedia.org/T280292) (owner: 10Jdlrobson) [01:35:41] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet - https://phabricator.wikimedia.org/T281437 (10wiki_willy) Hi @RKemper - Papaul will be back on the 24th. Would you be able to hold off until then? If not, we can submit the RMA r... [01:36:36] (03CR) 10BPirkle: Initial image-suggestion-api helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [02:07:06] (03PS1) 10Bstorm: cloudstore: fix some more settings on the syncserver mess [puppet] - 10https://gerrit.wikimedia.org/r/690795 (https://phabricator.wikimedia.org/T224747) [02:10:04] (03PS1) 10Jforrester: LogEventsList: always define $pageName [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690808 (https://phabricator.wikimedia.org/T282834) [02:10:30] (03CR) 10Ppchelko: [C: 03+1] LogEventsList: always define $pageName [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690808 (https://phabricator.wikimedia.org/T282834) (owner: 10Jforrester) [02:13:14] (03CR) 10Jforrester: "recheck" [container/miscweb] - 10https://gerrit.wikimedia.org/r/690768 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [02:13:36] (03CR) 10jerkins-bot: [V: 04-1] add initial Blubberfile and placeholders for prod and staging HTML [container/miscweb] - 10https://gerrit.wikimedia.org/r/690768 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [02:13:49] (03CR) 10Tim Starling: "Should I deploy this?" [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690808 (https://phabricator.wikimedia.org/T282834) (owner: 10Jforrester) [02:17:45] (03CR) 10Ppchelko: [C: 03+1] "I donno, maybe James was going to?" [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690808 (https://phabricator.wikimedia.org/T282834) (owner: 10Jforrester) [02:20:10] (03CR) 10Bstorm: [C: 03+2] cloudstore: fix some more settings on the syncserver mess [puppet] - 10https://gerrit.wikimedia.org/r/690795 (https://phabricator.wikimedia.org/T224747) (owner: 10Bstorm) [02:21:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:23:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:27:23] (03CR) 10Tim Starling: [C: 03+2] "I'll give it a +2 anyway so the gate checks can start. I can deploy it in about 30 mins if James hasn't gotten to it by then." [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690808 (https://phabricator.wikimedia.org/T282834) (owner: 10Jforrester) [02:34:28] (03PS1) 10Tim Starling: Revert "Add assertions about page IDs during undeletion." [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690796 (https://phabricator.wikimedia.org/T282844) [02:35:26] (03CR) 10Jforrester: "> Patch Set 1:" [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690808 (https://phabricator.wikimedia.org/T282834) (owner: 10Jforrester) [02:38:51] (03CR) 10Tim Starling: [C: 03+2] "I had to upload the cherry-pick manually due to a merge conflict in the "use" list of PageArchive.php. I've checked the diff 4 or 5 times," [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690796 (https://phabricator.wikimedia.org/T282844) (owner: 10Tim Starling) [02:43:19] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 54382368 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:45:47] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 60768 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:45:52] (03PS1) 10Jforrester: Check array boundaries before accessing array [extensions/MapSources] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690809 (https://phabricator.wikimedia.org/T282833) [02:46:25] TimStarling: If you're deploying anyway, there are another two: https://gerrit.wikimedia.org/r/q/branch:wmf/1.37.0-wmf.5+status:open [02:48:46] (03CR) 10Jforrester: "I think you meant to cherry-pick this to REL1_36 instead (it doesn't cleanly cherry-pick there either, FWIW)." [core] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/690085 (https://phabricator.wikimedia.org/T280292) (owner: 10Jdlrobson) [02:48:49] (03Abandoned) 10Jforrester: Legacy feature should not load thumbnail style rules (only layout) [core] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/690085 (https://phabricator.wikimedia.org/T280292) (owner: 10Jdlrobson) [02:49:33] (03CR) 10Ppchelko: [C: 03+1] Check array boundaries before accessing array [extensions/MapSources] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690809 (https://phabricator.wikimedia.org/T282833) (owner: 10Jforrester) [02:50:46] (03Merged) 10jenkins-bot: LogEventsList: always define $pageName [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690808 (https://phabricator.wikimedia.org/T282834) (owner: 10Jforrester) [03:03:47] (03Merged) 10jenkins-bot: Revert "Add assertions about page IDs during undeletion." [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690796 (https://phabricator.wikimedia.org/T282844) (owner: 10Tim Starling) [03:04:44] (03CR) 10Tim Starling: [C: 03+2] Check array boundaries before accessing array [extensions/MapSources] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690809 (https://phabricator.wikimedia.org/T282833) (owner: 10Jforrester) [03:09:46] (03Merged) 10jenkins-bot: Check array boundaries before accessing array [extensions/MapSources] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690809 (https://phabricator.wikimedia.org/T282833) (owner: 10Jforrester) [03:12:39] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [03:13:36] !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.5/includes/logging/LogEventsList.php: fix PHP notice T282834 (duration: 01m 08s) [03:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:13:40] T282834: LogEventsList.php: PHP Notice: Undefined variable: pageName - https://phabricator.wikimedia.org/T282834 [03:15:05] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [03:16:30] !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.5/includes/Revision/RevisionArchiveRecord.php: fix DeletedContributions breakage T282844 (duration: 01m 07s) [03:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:16:34] T282844: Special:DeletedContributions shows no or almost no edits. - https://phabricator.wikimedia.org/T282844 [03:18:34] !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.5/includes/page/PageArchive.php: T282844 (duration: 01m 07s) [03:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:20:08] !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.5/includes/page/WikiPage.php: T282844 (duration: 01m 06s) [03:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:02] !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.5/extensions/MapSources/includes/specials/MapSourcesPage.php: fix PHP notice T282833 (duration: 01m 07s) [03:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:06] T282833: MapSourcesPage.php: PHP Notice: Undefined offset: 13 - https://phabricator.wikimedia.org/T282833 [03:34:51] (03CR) 10Tim Starling: [C: 03+2] Using RevisionListBase::getPage instead of calling $title directly [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690084 (https://phabricator.wikimedia.org/T282825) (owner: 10Jforrester) [04:03:05] (03Merged) 10jenkins-bot: Using RevisionListBase::getPage instead of calling $title directly [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690084 (https://phabricator.wikimedia.org/T282825) (owner: 10Jforrester) [04:09:51] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:10:59] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:15:43] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:16:57] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:18:04] !log ariel@deploy1002 Started deploy [dumps/dumps@b97a2a9]: eliminate double slash in construction of api path [04:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:18:07] !log ariel@deploy1002 Finished deploy [dumps/dumps@b97a2a9]: eliminate double slash in construction of api path (duration: 00m 03s) [04:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:19:09] !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.5/includes/revisiondelete/RevDelRevisionItem.php: fix deprecation warning T282825 (duration: 01m 07s) [04:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:19:12] T282825: PHP Deprecated: Use of RevisionListBase::$title was deprecated in MediaWiki 1.37. [Called from RevDelRevisionItem::getHTML] - https://phabricator.wikimedia.org/T282825 [04:20:35] !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.5/includes/revisionlist/RevisionItem.php: fix deprecation warning T282825 (duration: 01m 07s) [04:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:45:21] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: " %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s" in recent Wikitech-l posts - https://phabricator.wikimedia.org/T282762 (10Ladsgroup) 05Resolved→03Open It seems running the fix templates wasn't enough. I check soon. [05:21:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:23:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:31:09] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [05:40:53] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [05:51:14] (03CR) 10Majavah: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/690055 (https://phabricator.wikimedia.org/T282264) (owner: 10Bstorm) [06:13:51] (03PS3) 10Jcrespo: bacula: Reenable read-write ES database backups, disable read-only [puppet] - 10https://gerrit.wikimedia.org/r/690338 (https://phabricator.wikimedia.org/T282249) [06:14:59] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [06:29:35] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [06:51:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:53:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210514T0700) [07:03:03] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [07:11:46] (03CR) 10Ayounsi: [C: 03+2] admin: add jhuneidi to contint-roots [puppet] - 10https://gerrit.wikimedia.org/r/690793 (https://phabricator.wikimedia.org/T282610) (owner: 10Dzahn) [07:12:39] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [07:13:15] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releases1002/2002 for jhuneidi - https://phabricator.wikimedia.org/T282610 (10ayounsi) 05Open→03Resolved a:03ayounsi Thanks for sending the CR. It's now merged. [07:19:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:24:47] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:37:42] (03CR) 10Jcrespo: "Does this need revert, according to: https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-04-29_db_and_memc_load#29_April_2021 " [puppet] - 10https://gerrit.wikimedia.org/r/683682 (owner: 10Effie Mouzeli) [07:46:15] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [07:50:54] (03PS16) 10Elukey: Add istio base images build support [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) [07:51:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:51:44] (03CR) 10Elukey: "All images are now building fine with docker-pkg locally, ready to get the first comments :)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [07:53:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:53:33] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [07:59:37] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal_80: Servers wdqs1008.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:00:15] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1008.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [08:00:35] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal_80: Servers wdqs1008.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:01:55] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1008.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [08:03:01] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:04:33] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:06:15] PROBLEM - SSH on logstash2020.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:06:41] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [08:08:21] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [08:28:24] 10SRE, 10SRE-Access-Requests: Allow JStephenson to access Superset - https://phabricator.wikimedia.org/T282515 (10JStephenson) Hi! I am still not able to login to my account as a developer in order to be able to enter Superset. My manager, Kassia, has already authorised this above. Can you please help me?... [08:29:15] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [08:36:37] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [08:52:08] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, matches https://github.com/elastic/ecs/issues/232" [software/ecs] - 10https://gerrit.wikimedia.org/r/636515 (owner: 10Cwhite) [08:52:39] (03PS1) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 [08:52:55] the varnish http requests alerts are due to sinusoidal traffic patterns in codfw (both text and upload), nothing worrisome it seems [08:52:58] https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?orgId=1&from=1620938261011&to=1620981554316&var-site=codfw&var-cache_type=varnish-upload&var-cache_type=varnish-text&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&var-status_type=5&var-method=GET&var-method=HEAD&var-method=POST [08:55:12] (03CR) 10jerkins-bot: [V: 04-1] resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (owner: 10Jbond) [08:59:17] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, very nice!" [puppet] - 10https://gerrit.wikimedia.org/r/674718 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [09:00:08] (03CR) 10Filippo Giunchedi: [C: 03+1] rsyslog: remove host parameter from syslog_cee template [puppet] - 10https://gerrit.wikimedia.org/r/690760 (owner: 10Cwhite) [09:00:38] 10SRE, 10SRE-Access-Requests: Allow JStephenson to access Superset - https://phabricator.wikimedia.org/T282515 (10Aklapper) This ticket is open (see status). The ticket that needs to be fixed first (see Task Graph) above is also still open. So this ticket is not (yet) actionable currently. [09:01:18] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add nodejs ecs migration config and tests [puppet] - 10https://gerrit.wikimedia.org/r/690759 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [09:06:34] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1053 - https://phabricator.wikimedia.org/T282839 (10fgiunchedi) `sdd` is indeed busted and host is under warranty, please replace @Cmjohnson / @Jclark-ctr , thank you! [09:10:37] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [09:12:24] (03PS1) 10Elukey: admin: add new wmf ldap accounts [puppet] - 10https://gerrit.wikimedia.org/r/691089 (https://phabricator.wikimedia.org/T282589) [09:13:08] jbond42: ---^ [09:16:08] (03PS1) 10Ayounsi: Add Cathal to AM netops group [puppet] - 10https://gerrit.wikimedia.org/r/691091 [09:20:15] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [09:37:03] (03CR) 10Effie Mouzeli: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/683682 (owner: 10Effie Mouzeli) [09:47:41] (03CR) 10Filippo Giunchedi: [C: 03+1] Add Cathal to AM netops group [puppet] - 10https://gerrit.wikimedia.org/r/691091 (owner: 10Ayounsi) [09:50:25] (03CR) 10Ayounsi: [C: 03+2] Add Cathal to AM netops group [puppet] - 10https://gerrit.wikimedia.org/r/691091 (owner: 10Ayounsi) [09:51:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:53:39] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:53:41] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [09:59:37] (03PS1) 10Alexandros Kosiaris: docker-registry: Clean up old http endpoint [puppet] - 10https://gerrit.wikimedia.org/r/691106 (https://phabricator.wikimedia.org/T256762) [09:59:39] (03PS1) 10Alexandros Kosiaris: docker-registry: Remove Docker-Distribution-API-version header [puppet] - 10https://gerrit.wikimedia.org/r/691107 (https://phabricator.wikimedia.org/T256762) [09:59:41] (03PS1) 10Alexandros Kosiaris: docker-registry: Re-apply Cache-Control rules [puppet] - 10https://gerrit.wikimedia.org/r/691108 (https://phabricator.wikimedia.org/T256762) [10:03:09] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [10:04:05] (03PS1) 10Alexandros Kosiaris: docker-registry: Remove absented nginx-site resource [puppet] - 10https://gerrit.wikimedia.org/r/691110 (https://phabricator.wikimedia.org/T256762) [10:06:57] (03CR) 10Jcrespo: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/683682 (owner: 10Effie Mouzeli) [10:08:59] RECOVERY - SSH on logstash2020.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:36:51] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [10:46:37] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [10:49:32] there is some weird traffic patterns on codfw [10:50:10] started at 22:50 yesterday [11:13:07] 10SRE, 10Epic, 10cloud-services-team (Kanban): CloudVPS: network architecture - https://phabricator.wikimedia.org/T209460 (10aborrero) [11:15:00] (03PS1) 10Jbond: O:admin: add ability to manage home [puppet] - 10https://gerrit.wikimedia.org/r/691131 (https://phabricator.wikimedia.org/T280989) [11:15:33] (03CR) 10jerkins-bot: [V: 04-1] O:admin: add ability to manage home [puppet] - 10https://gerrit.wikimedia.org/r/691131 (https://phabricator.wikimedia.org/T280989) (owner: 10Jbond) [11:17:19] (03PS2) 10Jbond: O:admin: add ability to manage home [puppet] - 10https://gerrit.wikimedia.org/r/691131 (https://phabricator.wikimedia.org/T280989) [11:17:56] (03PS1) 10Zabe: Enable NewUserMessage on ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/691132 (https://phabricator.wikimedia.org/T282845) [11:18:34] (03CR) 10Elukey: [C: 03+2] "Going to merge but lemme know if I have missed anything!" [puppet] - 10https://gerrit.wikimedia.org/r/691089 (https://phabricator.wikimedia.org/T282589) (owner: 10Elukey) [11:19:25] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29564/console" [puppet] - 10https://gerrit.wikimedia.org/r/691131 (https://phabricator.wikimedia.org/T280989) (owner: 10Jbond) [11:20:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:20:59] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [11:21:02] 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Grant access to LDAP nda for Sannita - https://phabricator.wikimedia.org/T282600 (10elukey) >>! In T282600#7084732, @Sannita wrote: >>>! In T282600#7084689, @elukey wrote: >> @KFrancis hi! I can't find Sannita's NDA in the spreadshe... [11:21:18] (03PS3) 10Jbond: O:admin: add ability to manage home [puppet] - 10https://gerrit.wikimedia.org/r/691131 (https://phabricator.wikimedia.org/T280989) [11:21:58] 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Grant access to LDAP nda for Sannita - https://phabricator.wikimedia.org/T282600 (10elukey) Nevermind I see `sannita-ctr@wikimedia.org` in LDAP, all good! [11:22:09] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29565/console" [puppet] - 10https://gerrit.wikimedia.org/r/691131 (https://phabricator.wikimedia.org/T280989) (owner: 10Jbond) [11:23:27] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:23:29] (03PS4) 10Jbond: C:admin: add ability to manage home [puppet] - 10https://gerrit.wikimedia.org/r/691131 (https://phabricator.wikimedia.org/T280989) [11:24:16] (03PS1) 10Elukey: admin: add user sannita to ldap_only [puppet] - 10https://gerrit.wikimedia.org/r/691139 (https://phabricator.wikimedia.org/T282600) [11:25:01] (03CR) 10Elukey: [C: 03+2] admin: add user sannita to ldap_only [puppet] - 10https://gerrit.wikimedia.org/r/691139 (https://phabricator.wikimedia.org/T282600) (owner: 10Elukey) [11:25:45] (03PS1) 10Arturo Borrero Gonzalez: cr/firewall.conf: allow openstack Trove port TCP/8779 [homer/public] - 10https://gerrit.wikimedia.org/r/691140 (https://phabricator.wikimedia.org/T282809) [11:26:42] (03PS5) 10Jbond: C:admin: add ability to manage home [puppet] - 10https://gerrit.wikimedia.org/r/691131 (https://phabricator.wikimedia.org/T280989) [11:27:28] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Patch-For-Review: Please grant CRS access to Superset/Turnilo (deadline EOD Monday 17) - https://phabricator.wikimedia.org/T282589 (10elukey) [11:27:30] 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Patch-For-Review: Grant access to LDAP nda for Sannita - https://phabricator.wikimedia.org/T282600 (10elukey) 05Open→03Resolved a:03elukey Done! [11:28:17] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [11:29:02] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Patch-For-Review: Please grant CRS access to Superset/Turnilo (deadline EOD Monday 17) - https://phabricator.wikimedia.org/T282589 (10elukey) >>! In T282589#7085443, @mpopov wrote: > Thanks so much @elukey you're... [11:30:26] (03PS6) 10Jbond: C:admin: add ability to manage home [puppet] - 10https://gerrit.wikimedia.org/r/691131 (https://phabricator.wikimedia.org/T280989) [11:30:41] 10SRE, 10Analytics, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Patch-For-Review: Please grant CRS access to Superset/Turnilo (deadline EOD Monday 17) - https://phabricator.wikimedia.org/T282589 (10elukey) 05Open→03Resolved @Elitre everything should be done, please ping me... [11:31:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29569/console" [puppet] - 10https://gerrit.wikimedia.org/r/691131 (https://phabricator.wikimedia.org/T280989) (owner: 10Jbond) [11:42:27] (03CR) 10Jbond: [C: 03+1] "LGTM will need the following patch before you can enable managehome, will wait until monday to deploy" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690787 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [11:44:44] (03CR) 10Jbond: [C: 03+1] "lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/691140 (https://phabricator.wikimedia.org/T282809) (owner: 10Arturo Borrero Gonzalez) [11:52:25] 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Patch-For-Review: Grant access to LDAP nda for Sannita - https://phabricator.wikimedia.org/T282600 (10Aklapper) **[off-topic]** @Sannita: I'd second T282600#7084746 that a separation between staff/contractor and volunteer activi... [11:52:34] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: add NFS ratelimit [puppet] - 10https://gerrit.wikimedia.org/r/691154 [11:53:06] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "DON'T MERGE." [puppet] - 10https://gerrit.wikimedia.org/r/691154 (owner: 10Arturo Borrero Gonzalez) [12:01:27] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [12:03:14] I've created https://phabricator.wikimedia.org/T282861 to track the requests alert above (under NDA due to IPs) [12:06:15] ema: do you think we should add a block for those prefixes? [12:08:35] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [12:19:32] !log run puppet on CP servers [12:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:48] (03CR) 10Matthias Mullie: [C: 03+1] Properly enable media change tags on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690691 (https://phabricator.wikimedia.org/T266067) (owner: 10Urbanecm) [12:20:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:23:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:29:13] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01303 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:29:29] * jbond42 looking [12:43:13] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [12:49:41] (03PS1) 10BBlack: Add missing cache::nodes for cp501[3456] [puppet] - 10https://gerrit.wikimedia.org/r/691170 (https://phabricator.wikimedia.org/T275046) [12:50:55] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [12:52:45] (03CR) 10BBlack: [C: 03+2] Add missing cache::nodes for cp501[3456] [puppet] - 10https://gerrit.wikimedia.org/r/691170 (https://phabricator.wikimedia.org/T275046) (owner: 10BBlack) [12:54:55] !log re-running puppet agent on cp5* [12:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:05] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005924 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:05:57] I'm going to live hack mwdebug1001 for checking some RL stuff [13:20:23] 10SRE, 10Wikimedia-General-or-Unknown, 10Wikimedia-SVG-rendering, 10Documentation: Document how to request installing additional fonts for SVG thumbnails and generated PDF files on Wikimedia servers - https://phabricator.wikimedia.org/T228591 (10Aklapper) Trying to collect some bits and pieces from T280718... [13:20:35] 10SRE, 10Wikimedia-General-or-Unknown, 10Wikimedia-SVG-rendering, 10Documentation: Document how to request installing additional fonts for SVG thumbnails and generated PDF files on Wikimedia servers - https://phabricator.wikimedia.org/T228591 (10Aklapper) [13:20:51] 10SRE, 10Wikimedia-General-or-Unknown, 10Wikimedia-SVG-rendering, 10Documentation: Document how to request installing additional fonts for SVG thumbnails and generated PDF files on Wikimedia servers - https://phabricator.wikimedia.org/T228591 (10Aklapper) [13:27:23] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [13:29:53] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [13:32:07] (03CR) 10Jbond: [V: 03+1] "PCC https://puppet-compiler.wmflabs.org/compiler1001/29570/" [puppet] - 10https://gerrit.wikimedia.org/r/691131 (https://phabricator.wikimedia.org/T280989) (owner: 10Jbond) [13:36:27] (03CR) 10Andrew Bogott: [C: 03+1] "thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/691140 (https://phabricator.wikimedia.org/T282809) (owner: 10Arturo Borrero Gonzalez) [13:50:36] (03CR) 10Jbond: [C: 03+2] C:gitlab::ssh: add new gilab::ssh class [puppet] - 10https://gerrit.wikimedia.org/r/684437 (owner: 10Jbond) [13:50:51] (03CR) 10Jbond: [C: 03+2] P:gitlab: add basic gitlab class [puppet] - 10https://gerrit.wikimedia.org/r/684486 (owner: 10Jbond) [14:04:54] !log andrew@deploy1002 Started deploy [horizon/deploy@5d0a683]: removing 'locality' from trove dashboard [14:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:56] (03CR) 10Hnowlan: Initial image-suggestion-api helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [14:09:09] !log andrew@deploy1002 Finished deploy [horizon/deploy@5d0a683]: removing 'locality' from trove dashboard (duration: 04m 15s) [14:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:32] (03PS1) 10CDanis: Revert "fix NIC saturation exporter to be jessie-compatible 😖" [puppet] - 10https://gerrit.wikimedia.org/r/691216 (https://phabricator.wikimedia.org/T224454) [14:27:05] (03PS1) 10Jbond: gitlab: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/691228 [14:27:49] (03CR) 10Jbond: [C: 03+2] gitlab: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/691228 (owner: 10Jbond) [14:31:25] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:47] (03PS1) 10Jbond: O:gitlab: add external url [puppet] - 10https://gerrit.wikimedia.org/r/691229 [14:33:20] (03CR) 10jerkins-bot: [V: 04-1] O:gitlab: add external url [puppet] - 10https://gerrit.wikimedia.org/r/691229 (owner: 10Jbond) [14:33:41] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 69547120 and 17 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:35:41] (03PS2) 10Jbond: O:gitlab: add external url [puppet] - 10https://gerrit.wikimedia.org/r/691229 [14:36:11] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 559032 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:37:07] (03CR) 10Jbond: [C: 03+2] O:gitlab: add external url [puppet] - 10https://gerrit.wikimedia.org/r/691229 (owner: 10Jbond) [14:38:07] (03PS1) 10Filippo Giunchedi: pontoon: add bootstrap and provision scripts [puppet] - 10https://gerrit.wikimedia.org/r/691231 [14:44:33] (03CR) 10Ssingh: "Needs to be updated for the right domain and the updated Wikidough IP but PCC looks OK: https://puppet-compiler.wmflabs.org/compiler1002/2" [puppet] - 10https://gerrit.wikimedia.org/r/690698 (owner: 10Ssingh) [14:49:11] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [14:51:39] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [14:55:08] (03Abandoned) 10Ssingh: WIP: wikidough: update role to work towards anycast support [puppet] - 10https://gerrit.wikimedia.org/r/690698 (owner: 10Ssingh) [14:58:21] (03PS2) 10Seddon: Change HTTP to HTTPS for concept URIs on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679327 (https://phabricator.wikimedia.org/T258590) [15:00:21] (03PS17) 10Elukey: Add istio base images build support [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) [15:01:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:02:55] (03CR) 10Multichill: [C: 03+1] "Thanks for the update. Looks good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679327 (https://phabricator.wikimedia.org/T258590) (owner: 10Seddon) [15:05:30] !log Start server-side upload for 1 video file (T282874) [15:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:40] T282874: Server side upload for Raymond - https://phabricator.wikimedia.org/T282874 [15:08:41] (03CR) 10RLazarus: [C: 03+1] "🎊" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/691216 (https://phabricator.wikimedia.org/T224454) (owner: 10CDanis) [15:21:23] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:22:27] !log cdanis@cumin2002 START - Cookbook sre.network.cf [15:22:28] !log cdanis@cumin2002 END (PASS) - Cookbook sre.network.cf (exit_code=0) [15:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:51] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:53:37] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:56:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:58:17] (03CR) 10Bstorm: "Cool stuff!" [puppet] - 10https://gerrit.wikimedia.org/r/691154 (owner: 10Arturo Borrero Gonzalez) [16:02:31] PROBLEM - WDQS high update lag on wdqs1006 is CRITICAL: 4.629e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:08:39] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [16:13:29] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [16:15:45] 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Grant access to LDAP nda for Sannita - https://phabricator.wikimedia.org/T282600 (10Elitre) >>! In T282600#7087879, @Aklapper wrote: > **[off-topic]** @Sannita: I'd second T282600#7084746 that a separation between staff/contractor a... [16:17:43] PROBLEM - SSH on logstash2020.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:25:21] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:50:51] 10SRE, 10Data-Persistence-Backup, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) This is not very urgent, but I am generating backups from eqiad to codfw at 173Mbps, which takes... [17:03:30] !log cdanis@re0.cr2-eqiad# set interfaces gr-4/3/0.2 disable # T282881 [17:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:00] 10SRE, 10Analytics, 10Discovery, 10Platform Engineering, 10Product-Data-Infrastructure: Avoid accepting Kafka messages with whacky timestamps - https://phabricator.wikimedia.org/T282887 (10Ottomata) [17:10:16] 10SRE, 10Analytics, 10Discovery, 10Event-Platform, and 2 others: Avoid accepting Kafka messages with whacky timestamps - https://phabricator.wikimedia.org/T282887 (10Ottomata) [17:10:52] !log cdanis@re0.cr1-eqiad# set interfaces gr-3/3/0.1 disable # T282881 [17:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:50] 10SRE, 10Data-Persistence, 10observability, 10Patch-For-Review: MySQL metrics monitoring - https://phabricator.wikimedia.org/T143896 (10LSobanski) [17:18:05] 10SRE, 10observability, 10Patch-For-Review: google safe browsing icinga checks sporadic UNKNOWN due to 404 - https://phabricator.wikimedia.org/T216985 (1040y20garcia) p:05Medium→03High [17:18:19] (03PS1) 10Zabe: Update bnwiki project logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/691245 (https://phabricator.wikimedia.org/T282886) [17:25:29] !log rolled back cr1-eqiad/cr2-eqiad interface disables T282881 [17:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:51] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [17:35:09] (03PS4) 10Jcrespo: bacula: Reenable read-write ES database backups, disable read-only [puppet] - 10https://gerrit.wikimedia.org/r/690338 (https://phabricator.wikimedia.org/T282249) [17:35:39] !log install1003 - puppet disabled and /etc/resolv.conf manually patched over to deal with a current issue [17:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:37] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for ODimitrijevic - https://phabricator.wikimedia.org/T282836 (10odimitrijevic) I would like to also request ldap access to nda [17:38:05] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [17:41:40] !log install1003 - restart squid [17:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:25] (03CR) 10BBlack: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/677626 (https://phabricator.wikimedia.org/T275904) (owner: 10Jforrester) [17:43:53] 10SRE, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): labstore1004/5: buy a DAC 10Gb cable or adjust the current fiber cable for DAC/crossover - https://phabricator.wikimedia.org/T282799 (10Jclark-ctr) @Bstorm Swapped out both optics. [17:49:37] !log install1003 - restored normal resolv.conf + re-enabled+ran puppet [17:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:20] 10SRE, 10observability, 10Patch-For-Review: google safe browsing icinga checks sporadic UNKNOWN due to 404 - https://phabricator.wikimedia.org/T216985 (10Dzahn) @40y20garcia Could you let us know some details about your recent edits here? Not sure I understand what the linked code tells us in relation to thi... [17:55:49] PROBLEM - Host cloudvirt1040.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:57:38] (03CR) 10Dzahn: [C: 03+2] microsites::peopleweb: add more comments [puppet] - 10https://gerrit.wikimedia.org/r/690786 (owner: 10Dzahn) [17:57:58] (03PS2) 10Dzahn: microsites::peopleweb: add more comments [puppet] - 10https://gerrit.wikimedia.org/r/690786 [17:58:23] PROBLEM - grafana.wikimedia.org on grafana2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [17:58:23] 10SRE, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): labstore1004/5: buy a DAC 10Gb cable or adjust the current fiber cable for DAC/crossover - https://phabricator.wikimedia.org/T282799 (10Bstorm) I will poke it and see if that works. [17:59:30] (03PS2) 10Dzahn: site: remove people1002 and people2001, update comments [puppet] - 10https://gerrit.wikimedia.org/r/690666 (https://phabricator.wikimedia.org/T280989) [18:00:00] (03CR) 10Dzahn: [C: 03+2] site: remove people1002 and people2001, update comments [puppet] - 10https://gerrit.wikimedia.org/r/690666 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [18:01:05] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10Jclark-ctr) @Andrew Reseated cables and network card [18:02:09] RECOVERY - Host cloudvirt1040.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.48 ms [18:04:01] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:05:17] (03PS1) 10Bstorm: labstore: Switch DRBD devices to using the 10Gb addresses [puppet] - 10https://gerrit.wikimedia.org/r/691254 (https://phabricator.wikimedia.org/T282754) [18:06:15] 10SRE, 10ops-eqiad, 10Data-Services, 10Epic, 10cloud-services-team (Hardware): Move labstore1004 and labstore1005 to 10G Ethernet - https://phabricator.wikimedia.org/T266198 (10Bstorm) [18:07:49] RECOVERY - grafana.wikimedia.org on grafana2001 is OK: HTTP OK: HTTP/1.1 200 OK - 96162 bytes in 0.157 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [18:08:49] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:12:28] (03CR) 10Bstorm: [C: 03+2] labstore: Switch DRBD devices to using the 10Gb addresses [puppet] - 10https://gerrit.wikimedia.org/r/691254 (https://phabricator.wikimedia.org/T282754) (owner: 10Bstorm) [18:14:06] !log people1003/people2002: awk -F: '$6 ~ "^\/home" {print $1,$6}' /etc/passwd | while read line ; do user=${line% *}; dir=${line#* }; sudo mkdir -p ${dir}/public_html; sudo chown $user ${dir}/public_html; done (courtesy of Jbond) [18:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:17] (03CR) 10Dzahn: [C: 03+2] peopleweb: put a public_html into /etc/skel to ensure all users get one (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690787 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [18:17:24] (03PS2) 10Dzahn: peopleweb: put a public_html into /etc/skel to ensure all users get one [puppet] - 10https://gerrit.wikimedia.org/r/690787 (https://phabricator.wikimedia.org/T280989) [18:22:21] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1002 - https://phabricator.wikimedia.org/T280540 (10Jclark-ctr) a:03Jclark-ctr [18:23:11] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1002 - https://phabricator.wikimedia.org/T280540 (10Jclark-ctr) phab1002 Rack B1 U26 cable id #3948 Port22 [18:23:13] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1002 - https://phabricator.wikimedia.org/T280540 (10Jclark-ctr) [18:23:32] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1002 - https://phabricator.wikimedia.org/T280540 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [18:24:01] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1002 - https://phabricator.wikimedia.org/T280540 (10Majavah) phab1002 and phab1003 names were already used (T195623, T221389), shouldn't this be phab1004? [18:25:24] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1002 - https://phabricator.wikimedia.org/T280540 (10Dzahn) hmm.. @Majavah is right, thank you for catching that. Yea, it should be phab1004 [18:25:48] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1004 (was: phab1002) - https://phabricator.wikimedia.org/T280540 (10Dzahn) [18:26:05] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1004 (was: phab1002) - https://phabricator.wikimedia.org/T280540 (10Dzahn) [18:27:13] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1004 (was: phab1002) - https://phabricator.wikimedia.org/T280540 (10Dzahn) @Jclark-ctr @Cmjohnson Renamed the ticket based on the comments above. If you already entered the hostname phab1002 for this in places, can it be changed to phab1004... [18:28:41] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1004 (was: phab1002) - https://phabricator.wikimedia.org/T280540 (10Jclark-ctr) yea can change easily Thanks! [18:36:13] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [18:39:51] !log ✔️ cdanis@install1003.wikimedia.org ~ 🕝☕ sudo systemctl restart squid.service [18:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:30] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1004 (was: phab1002) - https://phabricator.wikimedia.org/T280540 (10Dzahn) Glad it's easy, cool, thank you [18:41:35] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947 (10JoKalliauer) >>! In T36947#7076752, @Arthur2e5 wrote: > At least the 2.51.1 result makes more sense and works in non-extreme scales.... [18:42:00] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10Jclark-ctr) @wiki_willy we are short on 2u spaced in 10g racks while being diverse [18:42:31] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [18:43:06] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947 (10JoKalliauer) [18:43:13] (03PS1) 10Bstorm: cloud nfs: fix the netmask for swapping cables [puppet] - 10https://gerrit.wikimedia.org/r/691262 (https://phabricator.wikimedia.org/T282754) [18:44:34] (03CR) 10Krinkle: [C: 04-1] "Landed there as https://gerrit.wikimedia.org/r/c/performance/arc-lamp/+/611465." [puppet] - 10https://gerrit.wikimedia.org/r/598292 (https://phabricator.wikimedia.org/T253679) (owner: 10Aaron Schulz) [18:45:27] (03CR) 10Bstorm: [C: 03+2] cloud nfs: fix the netmask for swapping cables [puppet] - 10https://gerrit.wikimedia.org/r/691262 (https://phabricator.wikimedia.org/T282754) (owner: 10Bstorm) [18:50:13] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [18:50:47] 10SRE, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): labstore1004/5: buy a DAC 10Gb cable or adjust the current fiber cable for DAC/crossover - https://phabricator.wikimedia.org/T282799 (10Bstorm) 05Open→03Invalid It works! It turns out it worked before you swapped optics I think. You... [18:52:47] (03PS12) 10Ahmon Dancy: WIP: Test emailing notification of security patch failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679015 [18:54:21] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [18:54:28] (03PS8) 10Krinkle: mediawiki: Remove references to obsolete rpc/RunJobs.php endpoint [puppet] - 10https://gerrit.wikimedia.org/r/575392 (https://phabricator.wikimedia.org/T243096) (owner: 10Aaron Schulz) [18:58:34] !log cdanis@cumin1001 START - Cookbook sre.network.cf [18:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:38] !log cdanis@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [18:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:02] 10SRE, 10ops-codfw: Degraded RAID on wdqs2007 - https://phabricator.wikimedia.org/T282758 (10Dzahn) [19:01:24] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet - https://phabricator.wikimedia.org/T281437 (10Dzahn) [19:02:43] (03CR) 10Dzahn: "recheck" [container/miscweb] - 10https://gerrit.wikimedia.org/r/690768 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [19:10:50] (03PS1) 10Bstorm: cloud nfs: Change primary cluster rate limits dramatically [puppet] - 10https://gerrit.wikimedia.org/r/691267 (https://phabricator.wikimedia.org/T218338) [19:14:45] (03CR) 10Bstorm: "Please note, tc ratelimits are written like they are in bits per second, but they are actually bytes per second. I have no idea why." [puppet] - 10https://gerrit.wikimedia.org/r/691267 (https://phabricator.wikimedia.org/T218338) (owner: 10Bstorm) [19:19:32] 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Grant access to LDAP nda for Sannita - https://phabricator.wikimedia.org/T282600 (10Aklapper) @Elitre: I'm not after policies; I recommended that people separate roles like they already do on-wiki. [19:24:52] 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Grant access to LDAP nda for Sannita - https://phabricator.wikimedia.org/T282600 (10Dzahn) The reason to have "WMF" accounts was originally only for "office actions" on wiki, which is a super rare thing in the grand scheme of things. [19:26:45] (03PS13) 10Ahmon Dancy: WIP: Test emailing notification of security patch failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679015 [19:32:41] 10SRE, 10observability: google safe browsing icinga checks sporadic UNKNOWN due to 404 - https://phabricator.wikimedia.org/T216985 (10Aklapper) p:05High→03Medium [19:38:53] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:40:08] mutante: can I help from the toolforge side with T218828 somehow? [19:40:09] T218828: Commons SVG Checker has differences between Wikimedia rendering and Toolforge rendering - https://phabricator.wikimedia.org/T218828 [19:41:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:46:23] (03PS14) 10Ahmon Dancy: WIP: Test emailing notification of security patch failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679015 [19:51:45] Majavah: yes, you can paste on those tickets (about 3 or more, heh) which versions of librsvg are installed! and thank you [19:52:27] or if you have info how apt-browser.toolforge.org relates to apt.wikimedia.org [19:54:36] mutante: replied https://phabricator.wikimedia.org/T218828#7088964 [19:56:01] thank you! [19:56:14] mutante: apt-browser is a flask app by l.egoktm that is basically just displays what package versions are available on different versions and components on apt.wikimedia.org [19:56:27] (according to https://toolsadmin.wikimedia.org/tools/id/apt-browser and https://gerrit.wikimedia.org/r/plugins/gitiles/labs/tools/apt-browser/) [19:56:56] Majavah: it's like all of those tickets have the core issue that there is an expectation toolforge things are prod things [19:57:30] Majavah: that makes sense, but leaves the question why the ticket (rightfully) says the versions differ [19:58:00] or they are comparing the wrong component or something [19:58:16] (03CR) 10Jbond: peopleweb: put a public_html into /etc/skel to ensure all users get one (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690787 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [19:58:56] yeah, unfortunately most tickets like "wmcs and prod are different" should be reported to the maintainer of said tool running in wmcs and not to sre/wmcs [19:59:04] (03CR) 10Dzahn: "No worries, I added some echo" around it at first and did not blindly run it" [puppet] - 10https://gerrit.wikimedia.org/r/690787 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [20:00:26] Majavah: would be solved if volunteer could use debmonitor.wm.org but that's another can of worms [20:00:34] mutante: keep in mind that apt-browser only has packages specifically uploaded to apt.wm.o, not everything in debian repositories [20:00:51] im glad mutante :) [20:01:05] Majavah: yea, that's why I also linked to mirrors.wikimedia.org though [20:01:22] oh, apt-browser, ACK [20:01:28] jbond42: hehe:) thanks [20:01:58] or just hire the people who need access to that :P [20:02:17] although I think cn=nda can also access it [20:02:23] they could make a cloud VPS and use the production APT sources there [20:02:41] and look apt apt-cache policy [20:06:14] (03PS12) 10Legoktm: Add shellbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 [20:10:23] apt-browser just scrapes the package listings on apt.wm.o [20:10:29] (03PS1) 10CDanis: sre.network.cf: Provide some advice in the event of errors [cookbooks] - 10https://gerrit.wikimedia.org/r/691275 [20:10:55] it's the equivalent of packages.debian.org basically [20:16:31] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [20:18:39] (03PS15) 10Ahmon Dancy: Email notification of security patch failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679015 (https://phabricator.wikimedia.org/T271274) [20:19:52] 10SRE, 10Okapi [Wikimedia Enterprise], 10Platform Engineering: Securely connect Wikimedia Enterprise Infrastructure with WMF Kafka Streams - https://phabricator.wikimedia.org/T280628 (10Ottomata) Here's how this could possibly work: - Someone (SRE? Platform Eng? Cloud Services?) provisions and maintains a n... [20:21:03] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [20:22:49] RECOVERY - SSH on logstash2020.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:25:04] (03CR) 10Legoktm: Add shellbox chart (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 (owner: 10Legoktm) [20:25:42] (03PS13) 10Legoktm: Add shellbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 [20:31:13] (03CR) 10Legoktm: "PS12: Fixed the SetEnvIf syntax, it now passes the secret key properly and all works" [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 (owner: 10Legoktm) [20:31:47] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts people1002.eqiad.wmnet [20:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:00] !log people1002 - decom'ing - please use people1003 and see list mail [20:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:08] (03PS2) 10Legoktm: docker-registry: Clean up old nginx http endpoint [puppet] - 10https://gerrit.wikimedia.org/r/691106 (https://phabricator.wikimedia.org/T256762) (owner: 10Alexandros Kosiaris) [20:33:23] (03CR) 10RLazarus: [C: 03+1] sre.network.cf: Provide some advice in the event of errors (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/691275 (owner: 10CDanis) [20:34:35] (03CR) 10Legoktm: [C: 03+1] "I clarified in the commit message that this just removes the nginx HTTP endpoint. The build-homepage script talks directly to the registry" [puppet] - 10https://gerrit.wikimedia.org/r/691106 (https://phabricator.wikimedia.org/T256762) (owner: 10Alexandros Kosiaris) [20:34:47] (03CR) 10Legoktm: [C: 03+1] docker-registry: Remove Docker-Distribution-API-version header [puppet] - 10https://gerrit.wikimedia.org/r/691107 (https://phabricator.wikimedia.org/T256762) (owner: 10Alexandros Kosiaris) [20:36:13] (03CR) 10Jeena Huneidi: [C: 03+2] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679015 (https://phabricator.wikimedia.org/T271274) (owner: 10Ahmon Dancy) [20:37:24] (03Merged) 10jenkins-bot: Email notification of security patch failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679015 (https://phabricator.wikimedia.org/T271274) (owner: 10Ahmon Dancy) [20:38:43] (03PS1) 10Dwisehaupt: Monitor civiproxy nginx port [puppet] - 10https://gerrit.wikimedia.org/r/691277 (https://phabricator.wikimedia.org/T281321) [20:39:17] (03PS1) 10Krinkle: [Beta Cluster] Enable onhost memc tier for ParserCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/691278 (https://phabricator.wikimedia.org/T264604) [20:40:11] (03CR) 10Krinkle: [C: 03+2] [Beta Cluster] Enable onhost memc tier for ParserCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/691278 (https://phabricator.wikimedia.org/T264604) (owner: 10Krinkle) [20:40:53] (03CR) 10Legoktm: docker-registry: Re-apply Cache-Control rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/691108 (https://phabricator.wikimedia.org/T256762) (owner: 10Alexandros Kosiaris) [20:41:07] (03CR) 10Legoktm: [C: 03+2] docker-registry: Remove absented nginx-site resource [puppet] - 10https://gerrit.wikimedia.org/r/691110 (https://phabricator.wikimedia.org/T256762) (owner: 10Alexandros Kosiaris) [20:41:36] (03Merged) 10jenkins-bot: [Beta Cluster] Enable onhost memc tier for ParserCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/691278 (https://phabricator.wikimedia.org/T264604) (owner: 10Krinkle) [20:41:39] (03CR) 10Legoktm: [C: 03+1] "Er, meant to +1." [puppet] - 10https://gerrit.wikimedia.org/r/691110 (https://phabricator.wikimedia.org/T256762) (owner: 10Alexandros Kosiaris) [20:42:31] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts people1002.eqiad.wmnet [20:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:39] 10SRE, 10Patch-For-Review: try planet/people on bullseye / upgrade people.wikimedia.org backends to bullseye - https://phabricator.wikimedia.org/T280989 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `people1002.eqiad.wmnet` - people1002.eqiad.wmnet (**PASS**)... [20:52:43] (03PS1) 10Dzahn: add a test variant to match the test pipeline [container/miscweb] - 10https://gerrit.wikimedia.org/r/691283 [20:53:42] (03CR) 10jerkins-bot: [V: 04-1] add a test variant to match the test pipeline [container/miscweb] - 10https://gerrit.wikimedia.org/r/691283 (owner: 10Dzahn) [20:55:24] (03PS1) 10Krinkle: [Beta Cluster] Fix undefined 'mcrouter-with-onhost-tier' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/691285 (https://phabricator.wikimedia.org/T264604) [20:55:53] (03CR) 10Krinkle: [C: 03+2] [Beta Cluster] Fix undefined 'mcrouter-with-onhost-tier' (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/691285 (https://phabricator.wikimedia.org/T264604) (owner: 10Krinkle) [20:56:50] (03Merged) 10jenkins-bot: [Beta Cluster] Fix undefined 'mcrouter-with-onhost-tier' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/691285 (https://phabricator.wikimedia.org/T264604) (owner: 10Krinkle) [20:59:10] (03CR) 10Cwhite: [V: 03+1 C: 03+1] "This works well!" [puppet] - 10https://gerrit.wikimedia.org/r/691231 (owner: 10Filippo Giunchedi) [21:03:00] (03PS1) 10Legoktm: httpd: Add directory for applications to add config [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/691287 [21:13:54] (03PS1) 10Dzahn: httpd: add a resursive chmod to ensure log files are group writable [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/691293 [21:15:17] (03PS2) 10Dzahn: httpd: add a resursive chmod to ensure log files are group writable [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/691293 [21:17:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:19:27] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:22:19] (03CR) 10Volans: "Couple of nits inline, none is a blocker." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/691275 (owner: 10CDanis) [21:37:17] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1053 - https://phabricator.wikimedia.org/T282839 (10wiki_willy) a:03Cmjohnson [21:38:35] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [21:38:57] 10SRE, 10Machine-Learning-Team, 10ORES, 10Release Pipeline (Blubber): Build blubber file for ORES - https://phabricator.wikimedia.org/T210268 (10thcipriani) [21:39:00] 10SRE, 10Machine-Learning-Team, 10ORES, 10Release Pipeline, 10Release-Engineering-Team (Seen): Execution of the deployment pipeline should be configurable via .pipeline/config.yaml - https://phabricator.wikimedia.org/T210267 (10thcipriani) 05Open→03Resolved a:03dduvall There are now many services t... [21:40:01] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [21:44:49] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [22:22:45] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [22:24:57] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:27:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:29:21] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [23:07:43] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [23:11:16] (03CR) 10Dwisehaupt: "For when we are ready to monitor this service fully." [puppet] - 10https://gerrit.wikimedia.org/r/691277 (https://phabricator.wikimedia.org/T281321) (owner: 10Dwisehaupt) [23:14:47] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [23:19:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:21:44] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 bounce runner is running very slowly - https://phabricator.wikimedia.org/T282348 (10Platonides) Probably more a Feature Request for upstream, but I think mailman3 should parse that rejection message, find out the error is actually due to the specific message it was tr... [23:24:27] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:47:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:50:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:52:33] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [23:59:33] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/