[00:01:18] <wikibugs>	 (03PS1) 10Bstorm: cloudstore: set syncserver to only be run with puppet disabled [puppet] - 10https://gerrit.wikimedia.org/r/690783 (https://phabricator.wikimedia.org/T224747)
[00:10:16] <wikibugs>	 (03PS1) 10Dzahn: microsites::peopleweb: add more comments [puppet] - 10https://gerrit.wikimedia.org/r/690786
[00:12:02] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet - https://phabricator.wikimedia.org/T281437 (10RKemper) @wiki_willy I heard Papaul is out for a couple weeks so see the above comment https://phabricator.wikimedia.org/T281437#7086866
[00:20:10] <wikibugs>	 (03PS1) 10Dzahn: peopleweb: put a public_html into /etc/skel to ensure all users get one [puppet] - 10https://gerrit.wikimedia.org/r/690787 (https://phabricator.wikimedia.org/T280989)
[00:39:41] <ryankemper>	 !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 --new wdqs2003.codfw.wmnet` on `ryankemper@cumin2001` tmux session `wdqs_reimage`
[00:39:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:39:45] <stashbot>	 T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382
[00:43:44] <wikibugs>	 (03PS1) 10Nray: Fix 'final_state: vector' bug in VectorPrefDiffInstrumentation [extensions/WikimediaEvents] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/690789 (https://phabricator.wikimedia.org/T261842)
[00:50:50] <wikibugs>	 (03PS1) 10Jforrester: Using RevisionListBase::getPage instead of calling $title directly [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690084 (https://phabricator.wikimedia.org/T282825)
[01:21:07] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:22:01] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to releases1002/2002 for jhuneidi - https://phabricator.wikimedia.org/T282610 (10Dzahn)
[01:22:38] <wikibugs>	 (03PS1) 10Dzahn: admin: add jhuneidi to contint-roots [puppet] - 10https://gerrit.wikimedia.org/r/690793 (https://phabricator.wikimedia.org/T282610)
[01:22:41] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releases1002/2002 for jhuneidi - https://phabricator.wikimedia.org/T282610 (10Dzahn) confirmed L3, confirmed all the other checkboxes  uploaded patch to gerrit
[01:22:53] <wikibugs>	 (03PS2) 10Dzahn: admin: add jhuneidi to contint-roots [puppet] - 10https://gerrit.wikimedia.org/r/690793 (https://phabricator.wikimedia.org/T282610)
[01:23:30] <wikibugs>	 (03PS3) 10Dzahn: admin: add jhuneidi to contint-roots [puppet] - 10https://gerrit.wikimedia.org/r/690793 (https://phabricator.wikimedia.org/T282610)
[01:23:31] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:33:33] <wikibugs>	 (03CR) 10Jdlrobson: "This change is ready for review." [core] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/690085 (https://phabricator.wikimedia.org/T280292) (owner: 10Jdlrobson)
[01:35:41] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet - https://phabricator.wikimedia.org/T281437 (10wiki_willy) Hi @RKemper - Papaul will be back on the 24th.  Would you be able to hold off until then?  If not, we can submit the RMA r...
[01:36:36] <wikibugs>	 (03CR) 10BPirkle: Initial image-suggestion-api helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui)
[02:07:06] <wikibugs>	 (03PS1) 10Bstorm: cloudstore: fix some more settings on the syncserver mess [puppet] - 10https://gerrit.wikimedia.org/r/690795 (https://phabricator.wikimedia.org/T224747)
[02:10:04] <wikibugs>	 (03PS1) 10Jforrester: LogEventsList: always define $pageName [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690808 (https://phabricator.wikimedia.org/T282834)
[02:10:30] <wikibugs>	 (03CR) 10Ppchelko: [C: 03+1] LogEventsList: always define $pageName [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690808 (https://phabricator.wikimedia.org/T282834) (owner: 10Jforrester)
[02:13:14] <wikibugs>	 (03CR) 10Jforrester: "recheck" [container/miscweb] - 10https://gerrit.wikimedia.org/r/690768 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn)
[02:13:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] add initial Blubberfile and placeholders for prod and staging HTML [container/miscweb] - 10https://gerrit.wikimedia.org/r/690768 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn)
[02:13:49] <wikibugs>	 (03CR) 10Tim Starling: "Should I deploy this?" [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690808 (https://phabricator.wikimedia.org/T282834) (owner: 10Jforrester)
[02:17:45] <wikibugs>	 (03CR) 10Ppchelko: [C: 03+1] "I donno, maybe James was going to?" [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690808 (https://phabricator.wikimedia.org/T282834) (owner: 10Jforrester)
[02:20:10] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] cloudstore: fix some more settings on the syncserver mess [puppet] - 10https://gerrit.wikimedia.org/r/690795 (https://phabricator.wikimedia.org/T224747) (owner: 10Bstorm)
[02:21:25] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:23:55] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:27:23] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] "I'll give it a +2 anyway so the gate checks can start. I can deploy it in about 30 mins if James hasn't gotten to it by then." [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690808 (https://phabricator.wikimedia.org/T282834) (owner: 10Jforrester)
[02:34:28] <wikibugs>	 (03PS1) 10Tim Starling: Revert "Add assertions about page IDs during undeletion." [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690796 (https://phabricator.wikimedia.org/T282844)
[02:35:26] <wikibugs>	 (03CR) 10Jforrester: "> Patch Set 1:" [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690808 (https://phabricator.wikimedia.org/T282834) (owner: 10Jforrester)
[02:38:51] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] "I had to upload the cherry-pick manually due to a merge conflict in the "use" list of PageArchive.php. I've checked the diff 4 or 5 times," [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690796 (https://phabricator.wikimedia.org/T282844) (owner: 10Tim Starling)
[02:43:19] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 54382368 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:45:47] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 60768 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:45:52] <wikibugs>	 (03PS1) 10Jforrester: Check array boundaries before accessing array [extensions/MapSources] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690809 (https://phabricator.wikimedia.org/T282833)
[02:46:25] <James_F>	 TimStarling: If you're deploying anyway, there are another two: https://gerrit.wikimedia.org/r/q/branch:wmf/1.37.0-wmf.5+status:open
[02:48:46] <wikibugs>	 (03CR) 10Jforrester: "I think you meant to cherry-pick this to REL1_36 instead (it doesn't cleanly cherry-pick there either, FWIW)." [core] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/690085 (https://phabricator.wikimedia.org/T280292) (owner: 10Jdlrobson)
[02:48:49] <wikibugs>	 (03Abandoned) 10Jforrester: Legacy feature should not load thumbnail style rules (only layout) [core] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/690085 (https://phabricator.wikimedia.org/T280292) (owner: 10Jdlrobson)
[02:49:33] <wikibugs>	 (03CR) 10Ppchelko: [C: 03+1] Check array boundaries before accessing array [extensions/MapSources] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690809 (https://phabricator.wikimedia.org/T282833) (owner: 10Jforrester)
[02:50:46] <wikibugs>	 (03Merged) 10jenkins-bot: LogEventsList: always define $pageName [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690808 (https://phabricator.wikimedia.org/T282834) (owner: 10Jforrester)
[03:03:47] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Add assertions about page IDs during undeletion." [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690796 (https://phabricator.wikimedia.org/T282844) (owner: 10Tim Starling)
[03:04:44] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Check array boundaries before accessing array [extensions/MapSources] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690809 (https://phabricator.wikimedia.org/T282833) (owner: 10Jforrester)
[03:09:46] <wikibugs>	 (03Merged) 10jenkins-bot: Check array boundaries before accessing array [extensions/MapSources] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690809 (https://phabricator.wikimedia.org/T282833) (owner: 10Jforrester)
[03:12:39] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[03:13:36] <logmsgbot>	 !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.5/includes/logging/LogEventsList.php: fix PHP notice T282834 (duration: 01m 08s)
[03:13:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:13:40] <stashbot>	 T282834: LogEventsList.php: PHP Notice: Undefined variable: pageName - https://phabricator.wikimedia.org/T282834
[03:15:05] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[03:16:30] <logmsgbot>	 !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.5/includes/Revision/RevisionArchiveRecord.php: fix DeletedContributions breakage T282844 (duration: 01m 07s)
[03:16:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:16:34] <stashbot>	 T282844: Special:DeletedContributions shows no or almost no edits.  - https://phabricator.wikimedia.org/T282844
[03:18:34] <logmsgbot>	 !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.5/includes/page/PageArchive.php: T282844 (duration: 01m 07s)
[03:18:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:20:08] <logmsgbot>	 !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.5/includes/page/WikiPage.php: T282844 (duration: 01m 06s)
[03:20:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:25:02] <logmsgbot>	 !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.5/extensions/MapSources/includes/specials/MapSourcesPage.php: fix PHP notice T282833 (duration: 01m 07s)
[03:25:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:25:06] <stashbot>	 T282833: MapSourcesPage.php: PHP Notice: Undefined offset: 13 - https://phabricator.wikimedia.org/T282833
[03:34:51] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Using RevisionListBase::getPage instead of calling $title directly [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690084 (https://phabricator.wikimedia.org/T282825) (owner: 10Jforrester)
[04:03:05] <wikibugs>	 (03Merged) 10jenkins-bot: Using RevisionListBase::getPage instead of calling $title directly [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/690084 (https://phabricator.wikimedia.org/T282825) (owner: 10Jforrester)
[04:09:51] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:10:59] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:15:43] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:16:57] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:18:04] <logmsgbot>	 !log ariel@deploy1002 Started deploy [dumps/dumps@b97a2a9]: eliminate double slash in construction of api path
[04:18:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:18:07] <logmsgbot>	 !log ariel@deploy1002 Finished deploy [dumps/dumps@b97a2a9]: eliminate double slash in construction of api path (duration: 00m 03s)
[04:18:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:19:09] <logmsgbot>	 !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.5/includes/revisiondelete/RevDelRevisionItem.php: fix deprecation warning T282825 (duration: 01m 07s)
[04:19:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:19:12] <stashbot>	 T282825: PHP Deprecated: Use of RevisionListBase::$title was deprecated in MediaWiki 1.37. [Called from RevDelRevisionItem::getHTML] - https://phabricator.wikimedia.org/T282825
[04:20:35] <logmsgbot>	 !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.5/includes/revisionlist/RevisionItem.php: fix deprecation warning T282825 (duration: 01m 07s)
[04:20:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:45:21] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: " %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s" in recent Wikitech-l posts - https://phabricator.wikimedia.org/T282762 (10Ladsgroup) 05Resolved→03Open It seems running the fix templates wasn't enough. I check soon.
[05:21:27] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:23:49] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:31:09] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[05:40:53] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[05:51:14] <wikibugs>	 (03CR) 10Majavah: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/690055 (https://phabricator.wikimedia.org/T282264) (owner: 10Bstorm)
[06:13:51] <wikibugs>	 (03PS3) 10Jcrespo: bacula: Reenable read-write ES database backups, disable read-only [puppet] - 10https://gerrit.wikimedia.org/r/690338 (https://phabricator.wikimedia.org/T282249)
[06:14:59] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[06:29:35] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[06:51:11] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:53:35] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210514T0700)
[07:03:03] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[07:11:46] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] admin: add jhuneidi to contint-roots [puppet] - 10https://gerrit.wikimedia.org/r/690793 (https://phabricator.wikimedia.org/T282610) (owner: 10Dzahn)
[07:12:39] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[07:13:15] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releases1002/2002 for jhuneidi - https://phabricator.wikimedia.org/T282610 (10ayounsi) 05Open→03Resolved a:03ayounsi Thanks for sending the CR. It's now merged.
[07:19:51] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:24:47] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:37:42] <wikibugs>	 (03CR) 10Jcrespo: "Does this need revert, according to: https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-04-29_db_and_memc_load#29_April_2021 " [puppet] - 10https://gerrit.wikimedia.org/r/683682 (owner: 10Effie Mouzeli)
[07:46:15] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[07:50:54] <wikibugs>	 (03PS16) 10Elukey: Add istio base images build support [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192)
[07:51:03] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:51:44] <wikibugs>	 (03CR) 10Elukey: "All images are now building fine with docker-pkg locally, ready to get the first comments :)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey)
[07:53:33] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:53:33] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[07:59:37] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal_80: Servers wdqs1008.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:00:15] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1008.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[08:00:35] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal_80: Servers wdqs1008.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:01:55] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1008.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[08:03:01] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:04:33] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:06:15] <icinga-wm>	 PROBLEM - SSH on logstash2020.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:06:41] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[08:08:21] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[08:28:24] <wikibugs>	 10SRE, 10SRE-Access-Requests: Allow JStephenson to access Superset - https://phabricator.wikimedia.org/T282515 (10JStephenson) Hi!   I am still not able to login to my account as a developer in order to be able to enter Superset. My manager, Kassia, has already authorised this above.   Can you please help me?...
[08:29:15] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[08:36:37] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[08:52:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, matches https://github.com/elastic/ecs/issues/232" [software/ecs] - 10https://gerrit.wikimedia.org/r/636515 (owner: 10Cwhite)
[08:52:39] <wikibugs>	 (03PS1) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080
[08:52:55] <ema>	 the varnish http requests alerts are due to sinusoidal traffic patterns in codfw (both text and upload), nothing worrisome it seems 
[08:52:58] <ema>	 https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?orgId=1&from=1620938261011&to=1620981554316&var-site=codfw&var-cache_type=varnish-upload&var-cache_type=varnish-text&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&var-status_type=5&var-method=GET&var-method=HEAD&var-method=POST 
[08:55:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (owner: 10Jbond)
[08:59:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, very nice!" [puppet] - 10https://gerrit.wikimedia.org/r/674718 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[09:00:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] rsyslog: remove host parameter from syslog_cee template [puppet] - 10https://gerrit.wikimedia.org/r/690760 (owner: 10Cwhite)
[09:00:38] <wikibugs>	 10SRE, 10SRE-Access-Requests: Allow JStephenson to access Superset - https://phabricator.wikimedia.org/T282515 (10Aklapper) This ticket is open (see status). The ticket that needs to be fixed first (see Task Graph) above is also still open. So this ticket is not (yet) actionable currently.
[09:01:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add nodejs ecs migration config and tests [puppet] - 10https://gerrit.wikimedia.org/r/690759 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[09:06:34] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on ms-be1053 - https://phabricator.wikimedia.org/T282839 (10fgiunchedi) `sdd` is indeed busted and host is under warranty, please replace @Cmjohnson / @Jclark-ctr , thank you!
[09:10:37] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[09:12:24] <wikibugs>	 (03PS1) 10Elukey: admin: add new wmf ldap accounts [puppet] - 10https://gerrit.wikimedia.org/r/691089 (https://phabricator.wikimedia.org/T282589)
[09:13:08] <elukey>	 jbond42: ---^
[09:16:08] <wikibugs>	 (03PS1) 10Ayounsi: Add Cathal to AM netops group [puppet] - 10https://gerrit.wikimedia.org/r/691091
[09:20:15] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[09:37:03] <wikibugs>	 (03CR) 10Effie Mouzeli: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/683682 (owner: 10Effie Mouzeli)
[09:47:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Add Cathal to AM netops group [puppet] - 10https://gerrit.wikimedia.org/r/691091 (owner: 10Ayounsi)
[09:50:25] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add Cathal to AM netops group [puppet] - 10https://gerrit.wikimedia.org/r/691091 (owner: 10Ayounsi)
[09:51:15] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:53:39] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:53:41] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[09:59:37] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: docker-registry: Clean up old http endpoint [puppet] - 10https://gerrit.wikimedia.org/r/691106 (https://phabricator.wikimedia.org/T256762)
[09:59:39] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: docker-registry: Remove Docker-Distribution-API-version header [puppet] - 10https://gerrit.wikimedia.org/r/691107 (https://phabricator.wikimedia.org/T256762)
[09:59:41] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: docker-registry: Re-apply Cache-Control rules [puppet] - 10https://gerrit.wikimedia.org/r/691108 (https://phabricator.wikimedia.org/T256762)
[10:03:09] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[10:04:05] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: docker-registry: Remove absented nginx-site resource [puppet] - 10https://gerrit.wikimedia.org/r/691110 (https://phabricator.wikimedia.org/T256762)
[10:06:57] <wikibugs>	 (03CR) 10Jcrespo: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/683682 (owner: 10Effie Mouzeli)
[10:08:59] <icinga-wm>	 RECOVERY - SSH on logstash2020.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:36:51] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[10:46:37] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[10:49:32] <jynus>	 there is some weird traffic patterns on codfw
[10:50:10] <jynus>	 started at 22:50 yesterday
[11:13:07] <wikibugs>	 10SRE, 10Epic, 10cloud-services-team (Kanban): CloudVPS: network architecture - https://phabricator.wikimedia.org/T209460 (10aborrero)
[11:15:00] <wikibugs>	 (03PS1) 10Jbond: O:admin: add ability to manage home [puppet] - 10https://gerrit.wikimedia.org/r/691131 (https://phabricator.wikimedia.org/T280989)
[11:15:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] O:admin: add ability to manage home [puppet] - 10https://gerrit.wikimedia.org/r/691131 (https://phabricator.wikimedia.org/T280989) (owner: 10Jbond)
[11:17:19] <wikibugs>	 (03PS2) 10Jbond: O:admin: add ability to manage home [puppet] - 10https://gerrit.wikimedia.org/r/691131 (https://phabricator.wikimedia.org/T280989)
[11:17:56] <wikibugs>	 (03PS1) 10Zabe: Enable NewUserMessage on ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/691132 (https://phabricator.wikimedia.org/T282845)
[11:18:34] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "Going to merge but lemme know if I have missed anything!" [puppet] - 10https://gerrit.wikimedia.org/r/691089 (https://phabricator.wikimedia.org/T282589) (owner: 10Elukey)
[11:19:25] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29564/console" [puppet] - 10https://gerrit.wikimedia.org/r/691131 (https://phabricator.wikimedia.org/T280989) (owner: 10Jbond)
[11:20:55] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:20:59] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[11:21:02] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Grant access to LDAP nda for Sannita - https://phabricator.wikimedia.org/T282600 (10elukey) >>! In T282600#7084732, @Sannita wrote: >>>! In T282600#7084689, @elukey wrote: >> @KFrancis hi! I can't find Sannita's NDA in the spreadshe...
[11:21:18] <wikibugs>	 (03PS3) 10Jbond: O:admin: add ability to manage home [puppet] - 10https://gerrit.wikimedia.org/r/691131 (https://phabricator.wikimedia.org/T280989)
[11:21:58] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Grant access to LDAP nda for Sannita - https://phabricator.wikimedia.org/T282600 (10elukey) Nevermind I see `sannita-ctr@wikimedia.org` in LDAP, all good!
[11:22:09] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29565/console" [puppet] - 10https://gerrit.wikimedia.org/r/691131 (https://phabricator.wikimedia.org/T280989) (owner: 10Jbond)
[11:23:27] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:23:29] <wikibugs>	 (03PS4) 10Jbond: C:admin: add ability to manage home [puppet] - 10https://gerrit.wikimedia.org/r/691131 (https://phabricator.wikimedia.org/T280989)
[11:24:16] <wikibugs>	 (03PS1) 10Elukey: admin: add user sannita to ldap_only [puppet] - 10https://gerrit.wikimedia.org/r/691139 (https://phabricator.wikimedia.org/T282600)
[11:25:01] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin: add user sannita to ldap_only [puppet] - 10https://gerrit.wikimedia.org/r/691139 (https://phabricator.wikimedia.org/T282600) (owner: 10Elukey)
[11:25:45] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cr/firewall.conf: allow openstack Trove port TCP/8779 [homer/public] - 10https://gerrit.wikimedia.org/r/691140 (https://phabricator.wikimedia.org/T282809)
[11:26:42] <wikibugs>	 (03PS5) 10Jbond: C:admin: add ability to manage home [puppet] - 10https://gerrit.wikimedia.org/r/691131 (https://phabricator.wikimedia.org/T280989)
[11:27:28] <wikibugs>	 10SRE, 10Analytics, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Patch-For-Review: Please grant CRS access to Superset/Turnilo (deadline EOD Monday 17) - https://phabricator.wikimedia.org/T282589 (10elukey)
[11:27:30] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Patch-For-Review: Grant access to LDAP nda for Sannita - https://phabricator.wikimedia.org/T282600 (10elukey) 05Open→03Resolved a:03elukey Done!
[11:28:17] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[11:29:02] <wikibugs>	 10SRE, 10Analytics, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Patch-For-Review: Please grant CRS access to Superset/Turnilo (deadline EOD Monday 17) - https://phabricator.wikimedia.org/T282589 (10elukey) >>! In T282589#7085443, @mpopov wrote: > Thanks so much @elukey you're...
[11:30:26] <wikibugs>	 (03PS6) 10Jbond: C:admin: add ability to manage home [puppet] - 10https://gerrit.wikimedia.org/r/691131 (https://phabricator.wikimedia.org/T280989)
[11:30:41] <wikibugs>	 10SRE, 10Analytics, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Patch-For-Review: Please grant CRS access to Superset/Turnilo (deadline EOD Monday 17) - https://phabricator.wikimedia.org/T282589 (10elukey) 05Open→03Resolved @Elitre everything should be done, please ping me...
[11:31:16] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29569/console" [puppet] - 10https://gerrit.wikimedia.org/r/691131 (https://phabricator.wikimedia.org/T280989) (owner: 10Jbond)
[11:42:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM will need the following patch before you can enable managehome, will wait until monday to deploy" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690787 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn)
[11:44:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/691140 (https://phabricator.wikimedia.org/T282809) (owner: 10Arturo Borrero Gonzalez)
[11:52:25] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Patch-For-Review: Grant access to LDAP nda for Sannita - https://phabricator.wikimedia.org/T282600 (10Aklapper) **[off-topic]** @Sannita: I'd second T282600#7084746 that a separation between staff/contractor and volunteer activi...
[11:52:34] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: add NFS ratelimit [puppet] - 10https://gerrit.wikimedia.org/r/691154
[11:53:06] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "DON'T MERGE." [puppet] - 10https://gerrit.wikimedia.org/r/691154 (owner: 10Arturo Borrero Gonzalez)
[12:01:27] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[12:03:14] <ema>	 I've created https://phabricator.wikimedia.org/T282861 to track the requests alert above (under NDA due to IPs)
[12:06:15] <jbond42>	 ema: do you think we should add a block for those prefixes?
[12:08:35] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[12:19:32] <jbond42>	 !log run puppet on CP servers
[12:19:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:48] <wikibugs>	 (03CR) 10Matthias Mullie: [C: 03+1] Properly enable media change tags on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690691 (https://phabricator.wikimedia.org/T266067) (owner: 10Urbanecm)
[12:20:59] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:23:31] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:29:13] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01303 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[12:29:29] * jbond42 looking
[12:43:13] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[12:49:41] <wikibugs>	 (03PS1) 10BBlack: Add missing cache::nodes for cp501[3456] [puppet] - 10https://gerrit.wikimedia.org/r/691170 (https://phabricator.wikimedia.org/T275046)
[12:50:55] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[12:52:45] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Add missing cache::nodes for cp501[3456] [puppet] - 10https://gerrit.wikimedia.org/r/691170 (https://phabricator.wikimedia.org/T275046) (owner: 10BBlack)
[12:54:55] <bblack>	 !log re-running puppet agent on cp5*
[12:54:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:59:05] <icinga-wm>	 RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005924 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[13:05:57] <Amir1>	 I'm going to live hack mwdebug1001 for checking some RL stuff
[13:20:23] <wikibugs>	 10SRE, 10Wikimedia-General-or-Unknown, 10Wikimedia-SVG-rendering, 10Documentation: Document how to request installing additional fonts for SVG thumbnails and generated PDF files on Wikimedia servers - https://phabricator.wikimedia.org/T228591 (10Aklapper) Trying to collect some bits and pieces from T280718...
[13:20:35] <wikibugs>	 10SRE, 10Wikimedia-General-or-Unknown, 10Wikimedia-SVG-rendering, 10Documentation: Document how to request installing additional fonts for SVG thumbnails and generated PDF files on Wikimedia servers - https://phabricator.wikimedia.org/T228591 (10Aklapper)
[13:20:51] <wikibugs>	 10SRE, 10Wikimedia-General-or-Unknown, 10Wikimedia-SVG-rendering, 10Documentation: Document how to request installing additional fonts for SVG thumbnails and generated PDF files on Wikimedia servers - https://phabricator.wikimedia.org/T228591 (10Aklapper)
[13:27:23] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[13:29:53] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[13:32:07] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC https://puppet-compiler.wmflabs.org/compiler1001/29570/" [puppet] - 10https://gerrit.wikimedia.org/r/691131 (https://phabricator.wikimedia.org/T280989) (owner: 10Jbond)
[13:36:27] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/691140 (https://phabricator.wikimedia.org/T282809) (owner: 10Arturo Borrero Gonzalez)
[13:50:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:gitlab::ssh: add new gilab::ssh class [puppet] - 10https://gerrit.wikimedia.org/r/684437 (owner: 10Jbond)
[13:50:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:gitlab: add basic gitlab class [puppet] - 10https://gerrit.wikimedia.org/r/684486 (owner: 10Jbond)
[14:04:54] <logmsgbot>	 !log andrew@deploy1002 Started deploy [horizon/deploy@5d0a683]: removing 'locality' from trove dashboard
[14:04:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:56] <wikibugs>	 (03CR) 10Hnowlan: Initial image-suggestion-api helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui)
[14:09:09] <logmsgbot>	 !log andrew@deploy1002 Finished deploy [horizon/deploy@5d0a683]: removing 'locality' from trove dashboard (duration: 04m 15s)
[14:09:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:32] <wikibugs>	 (03PS1) 10CDanis: Revert "fix NIC saturation exporter to be jessie-compatible 😖" [puppet] - 10https://gerrit.wikimedia.org/r/691216 (https://phabricator.wikimedia.org/T224454)
[14:27:05] <wikibugs>	 (03PS1) 10Jbond: gitlab: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/691228
[14:27:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] gitlab: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/691228 (owner: 10Jbond)
[14:31:25] <icinga-wm>	 PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:32:47] <wikibugs>	 (03PS1) 10Jbond: O:gitlab: add external url [puppet] - 10https://gerrit.wikimedia.org/r/691229
[14:33:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] O:gitlab: add external url [puppet] - 10https://gerrit.wikimedia.org/r/691229 (owner: 10Jbond)
[14:33:41] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 69547120 and 17 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[14:35:41] <wikibugs>	 (03PS2) 10Jbond: O:gitlab: add external url [puppet] - 10https://gerrit.wikimedia.org/r/691229
[14:36:11] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 559032 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[14:37:07] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] O:gitlab: add external url [puppet] - 10https://gerrit.wikimedia.org/r/691229 (owner: 10Jbond)
[14:38:07] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: add bootstrap and provision scripts [puppet] - 10https://gerrit.wikimedia.org/r/691231
[14:44:33] <wikibugs>	 (03CR) 10Ssingh: "Needs to be updated for the right domain and the updated Wikidough IP but PCC looks OK: https://puppet-compiler.wmflabs.org/compiler1002/2" [puppet] - 10https://gerrit.wikimedia.org/r/690698 (owner: 10Ssingh)
[14:49:11] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[14:51:39] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[14:55:08] <wikibugs>	 (03Abandoned) 10Ssingh: WIP: wikidough: update role to work towards anycast support [puppet] - 10https://gerrit.wikimedia.org/r/690698 (owner: 10Ssingh)
[14:58:21] <wikibugs>	 (03PS2) 10Seddon: Change HTTP to HTTPS for concept URIs on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679327 (https://phabricator.wikimedia.org/T258590)
[15:00:21] <wikibugs>	 (03PS17) 10Elukey: Add istio base images build support [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192)
[15:01:17] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:02:55] <wikibugs>	 (03CR) 10Multichill: [C: 03+1] "Thanks for the update. Looks good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679327 (https://phabricator.wikimedia.org/T258590) (owner: 10Seddon)
[15:05:30] <Urbanecm>	 !log Start server-side upload for 1 video file (T282874)
[15:05:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:40] <stashbot>	 T282874: Server side upload for Raymond - https://phabricator.wikimedia.org/T282874
[15:08:41] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] "🎊" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/691216 (https://phabricator.wikimedia.org/T224454) (owner: 10CDanis)
[15:21:23] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:22:27] <logmsgbot>	 !log cdanis@cumin2002 START - Cookbook sre.network.cf
[15:22:28] <logmsgbot>	 !log cdanis@cumin2002 END (PASS) - Cookbook sre.network.cf (exit_code=0)
[15:22:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:51] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:53:37] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:56:05] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:58:17] <wikibugs>	 (03CR) 10Bstorm: "Cool stuff!" [puppet] - 10https://gerrit.wikimedia.org/r/691154 (owner: 10Arturo Borrero Gonzalez)
[16:02:31] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs1006 is CRITICAL: 4.629e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:08:39] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[16:13:29] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[16:15:45] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Grant access to LDAP nda for Sannita - https://phabricator.wikimedia.org/T282600 (10Elitre) >>! In T282600#7087879, @Aklapper wrote: > **[off-topic]** @Sannita: I'd second T282600#7084746 that a separation between staff/contractor a...
[16:17:43] <icinga-wm>	 PROBLEM - SSH on logstash2020.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:25:21] <icinga-wm>	 RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:50:51] <wikibugs>	 10SRE, 10Data-Persistence-Backup, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) This is not very urgent, but I am generating backups from eqiad to codfw at 173Mbps, which takes...
[17:03:30] <cdanis>	 !log cdanis@re0.cr2-eqiad# set interfaces gr-4/3/0.2 disable   # T282881
[17:03:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:10:00] <wikibugs>	 10SRE, 10Analytics, 10Discovery, 10Platform Engineering, 10Product-Data-Infrastructure: Avoid accepting Kafka messages with whacky timestamps - https://phabricator.wikimedia.org/T282887 (10Ottomata)
[17:10:16] <wikibugs>	 10SRE, 10Analytics, 10Discovery, 10Event-Platform, and 2 others: Avoid accepting Kafka messages with whacky timestamps - https://phabricator.wikimedia.org/T282887 (10Ottomata)
[17:10:52] <cdanis>	 !log cdanis@re0.cr1-eqiad# set interfaces gr-3/3/0.1 disable   # T282881
[17:10:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:13:50] <wikibugs>	 10SRE, 10Data-Persistence, 10observability, 10Patch-For-Review: MySQL metrics monitoring - https://phabricator.wikimedia.org/T143896 (10LSobanski)
[17:18:05] <wikibugs>	 10SRE, 10observability, 10Patch-For-Review: google safe browsing icinga checks sporadic UNKNOWN due to 404 - https://phabricator.wikimedia.org/T216985 (1040y20garcia) p:05Medium→03High
[17:18:19] <wikibugs>	 (03PS1) 10Zabe: Update bnwiki project logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/691245 (https://phabricator.wikimedia.org/T282886)
[17:25:29] <cdanis>	 !log rolled back cr1-eqiad/cr2-eqiad interface disables T282881
[17:25:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:51] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[17:35:09] <wikibugs>	 (03PS4) 10Jcrespo: bacula: Reenable read-write ES database backups, disable read-only [puppet] - 10https://gerrit.wikimedia.org/r/690338 (https://phabricator.wikimedia.org/T282249)
[17:35:39] <bblack>	 !log install1003 - puppet disabled and /etc/resolv.conf manually patched over to deal with a current issue
[17:35:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:36:37] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for ODimitrijevic - https://phabricator.wikimedia.org/T282836 (10odimitrijevic) I would like to also request  ldap access to nda
[17:38:05] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[17:41:40] <bblack>	 !log install1003 - restart squid
[17:41:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:42:25] <wikibugs>	 (03CR) 10BBlack: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/677626 (https://phabricator.wikimedia.org/T275904) (owner: 10Jforrester)
[17:43:53] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): labstore1004/5: buy a DAC 10Gb cable or adjust the current fiber cable for DAC/crossover - https://phabricator.wikimedia.org/T282799 (10Jclark-ctr) @Bstorm  Swapped out both optics.
[17:49:37] <bblack>	 !log install1003 - restored normal resolv.conf + re-enabled+ran puppet
[17:49:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:20] <wikibugs>	 10SRE, 10observability, 10Patch-For-Review: google safe browsing icinga checks sporadic UNKNOWN due to 404 - https://phabricator.wikimedia.org/T216985 (10Dzahn) @40y20garcia Could you let us know some details about your recent edits here? Not sure I understand what the linked code tells us in relation to thi...
[17:55:49] <icinga-wm>	 PROBLEM - Host cloudvirt1040.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:57:38] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] microsites::peopleweb: add more comments [puppet] - 10https://gerrit.wikimedia.org/r/690786 (owner: 10Dzahn)
[17:57:58] <wikibugs>	 (03PS2) 10Dzahn: microsites::peopleweb: add more comments [puppet] - 10https://gerrit.wikimedia.org/r/690786
[17:58:23] <icinga-wm>	 PROBLEM - grafana.wikimedia.org on grafana2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org
[17:58:23] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): labstore1004/5: buy a DAC 10Gb cable or adjust the current fiber cable for DAC/crossover - https://phabricator.wikimedia.org/T282799 (10Bstorm) I will poke it and see if that works.
[17:59:30] <wikibugs>	 (03PS2) 10Dzahn: site: remove people1002 and people2001, update comments [puppet] - 10https://gerrit.wikimedia.org/r/690666 (https://phabricator.wikimedia.org/T280989)
[18:00:00] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] site: remove people1002 and people2001, update comments [puppet] - 10https://gerrit.wikimedia.org/r/690666 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn)
[18:01:05] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10Jclark-ctr) @Andrew Reseated cables and network card
[18:02:09] <icinga-wm>	 RECOVERY - Host cloudvirt1040.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.48 ms
[18:04:01] <icinga-wm>	 PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:05:17] <wikibugs>	 (03PS1) 10Bstorm: labstore: Switch DRBD devices to using the 10Gb addresses [puppet] - 10https://gerrit.wikimedia.org/r/691254 (https://phabricator.wikimedia.org/T282754)
[18:06:15] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Services, 10Epic, 10cloud-services-team (Hardware): Move labstore1004 and labstore1005 to 10G Ethernet - https://phabricator.wikimedia.org/T266198 (10Bstorm)
[18:07:49] <icinga-wm>	 RECOVERY - grafana.wikimedia.org on grafana2001 is OK: HTTP OK: HTTP/1.1 200 OK - 96162 bytes in 0.157 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org
[18:08:49] <icinga-wm>	 RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:12:28] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] labstore: Switch DRBD devices to using the 10Gb addresses [puppet] - 10https://gerrit.wikimedia.org/r/691254 (https://phabricator.wikimedia.org/T282754) (owner: 10Bstorm)
[18:14:06] <mutante>	 !log people1003/people2002: awk -F: '$6 ~ "^\/home" {print $1,$6}' /etc/passwd  | while read line ; do user=${line% *}; dir=${line#* }; sudo mkdir -p ${dir}/public_html; sudo chown $user ${dir}/public_html; done (courtesy of Jbond)
[18:14:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:17] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] peopleweb: put a public_html into /etc/skel to ensure all users get one (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690787 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn)
[18:17:24] <wikibugs>	 (03PS2) 10Dzahn: peopleweb: put a public_html into /etc/skel to ensure all users get one [puppet] - 10https://gerrit.wikimedia.org/r/690787 (https://phabricator.wikimedia.org/T280989)
[18:22:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1002 - https://phabricator.wikimedia.org/T280540 (10Jclark-ctr) a:03Jclark-ctr
[18:23:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1002 - https://phabricator.wikimedia.org/T280540 (10Jclark-ctr) phab1002 Rack B1 U26 cable id #3948   Port22
[18:23:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1002 - https://phabricator.wikimedia.org/T280540 (10Jclark-ctr)
[18:23:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1002 - https://phabricator.wikimedia.org/T280540 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson
[18:24:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1002 - https://phabricator.wikimedia.org/T280540 (10Majavah) phab1002 and phab1003 names were already used (T195623, T221389), shouldn't this be phab1004?
[18:25:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1002 - https://phabricator.wikimedia.org/T280540 (10Dzahn) hmm.. @Majavah is right, thank you for catching that. Yea, it should be phab1004
[18:25:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1004 (was: phab1002) - https://phabricator.wikimedia.org/T280540 (10Dzahn)
[18:26:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1004 (was: phab1002) - https://phabricator.wikimedia.org/T280540 (10Dzahn)
[18:27:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1004 (was: phab1002) - https://phabricator.wikimedia.org/T280540 (10Dzahn) @Jclark-ctr @Cmjohnson Renamed the ticket based on the comments above. If you already entered the hostname phab1002 for this in places, can it be changed to phab1004...
[18:28:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1004 (was: phab1002) - https://phabricator.wikimedia.org/T280540 (10Jclark-ctr) yea can change easily Thanks!
[18:36:13] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[18:39:51] <cdanis>	 !log ✔️ cdanis@install1003.wikimedia.org ~ 🕝☕ sudo systemctl restart squid.service
[18:39:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:40:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1004 (was: phab1002) - https://phabricator.wikimedia.org/T280540 (10Dzahn) Glad it's easy, cool, thank you
[18:41:35] <wikibugs>	 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947 (10JoKalliauer) >>! In T36947#7076752, @Arthur2e5 wrote: > At least the 2.51.1 result makes more sense and works in non-extreme scales....
[18:42:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10Jclark-ctr) @wiki_willy   we are short on 2u spaced in 10g racks while being diverse
[18:42:31] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[18:43:06] <wikibugs>	 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947 (10JoKalliauer)
[18:43:13] <wikibugs>	 (03PS1) 10Bstorm: cloud nfs: fix the netmask for swapping cables [puppet] - 10https://gerrit.wikimedia.org/r/691262 (https://phabricator.wikimedia.org/T282754)
[18:44:34] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] "Landed there as https://gerrit.wikimedia.org/r/c/performance/arc-lamp/+/611465." [puppet] - 10https://gerrit.wikimedia.org/r/598292 (https://phabricator.wikimedia.org/T253679) (owner: 10Aaron Schulz)
[18:45:27] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] cloud nfs: fix the netmask for swapping cables [puppet] - 10https://gerrit.wikimedia.org/r/691262 (https://phabricator.wikimedia.org/T282754) (owner: 10Bstorm)
[18:50:13] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[18:50:47] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): labstore1004/5: buy a DAC 10Gb cable or adjust the current fiber cable for DAC/crossover - https://phabricator.wikimedia.org/T282799 (10Bstorm) 05Open→03Invalid It works! It turns out it worked before you swapped optics I think. You...
[18:52:47] <wikibugs>	 (03PS12) 10Ahmon Dancy: WIP: Test emailing notification of security patch failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679015
[18:54:21] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[18:54:28] <wikibugs>	 (03PS8) 10Krinkle: mediawiki: Remove references to obsolete rpc/RunJobs.php endpoint [puppet] - 10https://gerrit.wikimedia.org/r/575392 (https://phabricator.wikimedia.org/T243096) (owner: 10Aaron Schulz)
[18:58:34] <logmsgbot>	 !log cdanis@cumin1001 START - Cookbook sre.network.cf
[18:58:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:58:38] <logmsgbot>	 !log cdanis@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0)
[18:58:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:01:02] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on wdqs2007 - https://phabricator.wikimedia.org/T282758 (10Dzahn)
[19:01:24] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet - https://phabricator.wikimedia.org/T281437 (10Dzahn)
[19:02:43] <wikibugs>	 (03CR) 10Dzahn: "recheck" [container/miscweb] - 10https://gerrit.wikimedia.org/r/690768 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn)
[19:10:50] <wikibugs>	 (03PS1) 10Bstorm: cloud nfs: Change primary cluster rate limits dramatically [puppet] - 10https://gerrit.wikimedia.org/r/691267 (https://phabricator.wikimedia.org/T218338)
[19:14:45] <wikibugs>	 (03CR) 10Bstorm: "Please note, tc ratelimits are written like they are in bits per second, but they are actually bytes per second. I have no idea why." [puppet] - 10https://gerrit.wikimedia.org/r/691267 (https://phabricator.wikimedia.org/T218338) (owner: 10Bstorm)
[19:19:32] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Grant access to LDAP nda for Sannita - https://phabricator.wikimedia.org/T282600 (10Aklapper) @Elitre: I'm not after policies; I recommended that people separate roles like they already do on-wiki.
[19:24:52] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Grant access to LDAP nda for Sannita - https://phabricator.wikimedia.org/T282600 (10Dzahn) The reason to have "WMF" accounts was originally only for "office actions" on wiki, which is a super rare thing in the grand scheme of things.
[19:26:45] <wikibugs>	 (03PS13) 10Ahmon Dancy: WIP: Test emailing notification of security patch failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679015
[19:32:41] <wikibugs>	 10SRE, 10observability: google safe browsing icinga checks sporadic UNKNOWN due to 404 - https://phabricator.wikimedia.org/T216985 (10Aklapper) p:05High→03Medium
[19:38:53] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:40:08] <Majavah>	 mutante: can I help from the toolforge side with T218828 somehow?
[19:40:09] <stashbot>	 T218828: Commons SVG Checker has differences between Wikimedia rendering and Toolforge rendering - https://phabricator.wikimedia.org/T218828
[19:41:05] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:46:23] <wikibugs>	 (03PS14) 10Ahmon Dancy: WIP: Test emailing notification of security patch failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679015
[19:51:45] <mutante>	 Majavah: yes, you can paste on those tickets (about 3 or more, heh) which versions of librsvg are installed! and thank you
[19:52:27] <mutante>	 or if you have info how apt-browser.toolforge.org relates to apt.wikimedia.org
[19:54:36] <Majavah>	 mutante: replied https://phabricator.wikimedia.org/T218828#7088964
[19:56:01] <mutante>	 thank you!
[19:56:14] <Majavah>	 mutante: apt-browser is a flask app by l.egoktm that is basically just displays what package versions are available on different versions and components on apt.wikimedia.org
[19:56:27] <Majavah>	 (according to https://toolsadmin.wikimedia.org/tools/id/apt-browser and https://gerrit.wikimedia.org/r/plugins/gitiles/labs/tools/apt-browser/)
[19:56:56] <mutante>	 Majavah: it's like all of those tickets have the core issue that there is an expectation toolforge things are prod things
[19:57:30] <mutante>	 Majavah: that makes sense, but leaves the question why the ticket (rightfully) says the versions differ
[19:58:00] <mutante>	 or they are comparing the wrong component or something
[19:58:16] <wikibugs>	 (03CR) 10Jbond: peopleweb: put a public_html into /etc/skel to ensure all users get one (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690787 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn)
[19:58:56] <Majavah>	 yeah, unfortunately most tickets like "wmcs and prod are different" should be reported to the maintainer of said tool running in wmcs and not to sre/wmcs
[19:59:04] <wikibugs>	 (03CR) 10Dzahn: "No worries, I added some echo" around it at first and did not blindly run it" [puppet] - 10https://gerrit.wikimedia.org/r/690787 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn)
[20:00:26] <mutante>	 Majavah: would be solved if volunteer could use debmonitor.wm.org but that's another can of worms
[20:00:34] <Majavah>	 mutante: keep in mind that apt-browser only has packages specifically uploaded to apt.wm.o, not everything in debian repositories
[20:00:51] <jbond42>	 im glad mutante :)
[20:01:05] <mutante>	 Majavah: yea, that's why I also linked to mirrors.wikimedia.org though
[20:01:22] <mutante>	 oh, apt-browser, ACK
[20:01:28] <mutante>	 jbond42: hehe:) thanks
[20:01:58] <Majavah>	 or just hire the people who need access to that :P
[20:02:17] <Majavah>	 although I think cn=nda can also access it
[20:02:23] <mutante>	 they could make a cloud VPS and use the production APT sources there
[20:02:41] <mutante>	 and look apt apt-cache policy 
[20:06:14] <wikibugs>	 (03PS12) 10Legoktm: Add shellbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047
[20:10:23] <legoktm>	 apt-browser just scrapes the package listings on apt.wm.o
[20:10:29] <wikibugs>	 (03PS1) 10CDanis: sre.network.cf: Provide some advice in the event of errors [cookbooks] - 10https://gerrit.wikimedia.org/r/691275
[20:10:55] <legoktm>	 it's the equivalent of packages.debian.org basically
[20:16:31] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[20:18:39] <wikibugs>	 (03PS15) 10Ahmon Dancy: Email notification of security patch failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679015 (https://phabricator.wikimedia.org/T271274)
[20:19:52] <wikibugs>	 10SRE, 10Okapi [Wikimedia Enterprise], 10Platform Engineering: Securely connect Wikimedia Enterprise Infrastructure with WMF Kafka Streams - https://phabricator.wikimedia.org/T280628 (10Ottomata) Here's how this could possibly work:  - Someone (SRE? Platform Eng? Cloud Services?) provisions and maintains a n...
[20:21:03] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[20:22:49] <icinga-wm>	 RECOVERY - SSH on logstash2020.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:25:04] <wikibugs>	 (03CR) 10Legoktm: Add shellbox chart (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 (owner: 10Legoktm)
[20:25:42] <wikibugs>	 (03PS13) 10Legoktm: Add shellbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047
[20:31:13] <wikibugs>	 (03CR) 10Legoktm: "PS12: Fixed the SetEnvIf syntax, it now passes the secret key properly and all works" [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 (owner: 10Legoktm)
[20:31:47] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts people1002.eqiad.wmnet
[20:31:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:00] <mutante>	 !log people1002 - decom'ing - please use people1003 and see list mail
[20:32:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:33:08] <wikibugs>	 (03PS2) 10Legoktm: docker-registry: Clean up old nginx http endpoint [puppet] - 10https://gerrit.wikimedia.org/r/691106 (https://phabricator.wikimedia.org/T256762) (owner: 10Alexandros Kosiaris)
[20:33:23] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] sre.network.cf: Provide some advice in the event of errors (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/691275 (owner: 10CDanis)
[20:34:35] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] "I clarified in the commit message that this just removes the nginx HTTP endpoint. The build-homepage script talks directly to the registry" [puppet] - 10https://gerrit.wikimedia.org/r/691106 (https://phabricator.wikimedia.org/T256762) (owner: 10Alexandros Kosiaris)
[20:34:47] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] docker-registry: Remove Docker-Distribution-API-version header [puppet] - 10https://gerrit.wikimedia.org/r/691107 (https://phabricator.wikimedia.org/T256762) (owner: 10Alexandros Kosiaris)
[20:36:13] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679015 (https://phabricator.wikimedia.org/T271274) (owner: 10Ahmon Dancy)
[20:37:24] <wikibugs>	 (03Merged) 10jenkins-bot: Email notification of security patch failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679015 (https://phabricator.wikimedia.org/T271274) (owner: 10Ahmon Dancy)
[20:38:43] <wikibugs>	 (03PS1) 10Dwisehaupt: Monitor civiproxy nginx port [puppet] - 10https://gerrit.wikimedia.org/r/691277 (https://phabricator.wikimedia.org/T281321)
[20:39:17] <wikibugs>	 (03PS1) 10Krinkle: [Beta Cluster] Enable onhost memc tier for ParserCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/691278 (https://phabricator.wikimedia.org/T264604)
[20:40:11] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] [Beta Cluster] Enable onhost memc tier for ParserCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/691278 (https://phabricator.wikimedia.org/T264604) (owner: 10Krinkle)
[20:40:53] <wikibugs>	 (03CR) 10Legoktm: docker-registry: Re-apply Cache-Control rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/691108 (https://phabricator.wikimedia.org/T256762) (owner: 10Alexandros Kosiaris)
[20:41:07] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] docker-registry: Remove absented nginx-site resource [puppet] - 10https://gerrit.wikimedia.org/r/691110 (https://phabricator.wikimedia.org/T256762) (owner: 10Alexandros Kosiaris)
[20:41:36] <wikibugs>	 (03Merged) 10jenkins-bot: [Beta Cluster] Enable onhost memc tier for ParserCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/691278 (https://phabricator.wikimedia.org/T264604) (owner: 10Krinkle)
[20:41:39] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] "Er, meant to +1." [puppet] - 10https://gerrit.wikimedia.org/r/691110 (https://phabricator.wikimedia.org/T256762) (owner: 10Alexandros Kosiaris)
[20:42:31] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts people1002.eqiad.wmnet
[20:42:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:42:39] <wikibugs>	 10SRE, 10Patch-For-Review: try planet/people on bullseye / upgrade people.wikimedia.org backends to bullseye - https://phabricator.wikimedia.org/T280989 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `people1002.eqiad.wmnet` - people1002.eqiad.wmnet (**PASS**)...
[20:52:43] <wikibugs>	 (03PS1) 10Dzahn: add a test variant to match the test pipeline [container/miscweb] - 10https://gerrit.wikimedia.org/r/691283
[20:53:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] add a test variant to match the test pipeline [container/miscweb] - 10https://gerrit.wikimedia.org/r/691283 (owner: 10Dzahn)
[20:55:24] <wikibugs>	 (03PS1) 10Krinkle: [Beta Cluster] Fix undefined 'mcrouter-with-onhost-tier' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/691285 (https://phabricator.wikimedia.org/T264604)
[20:55:53] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] [Beta Cluster] Fix undefined 'mcrouter-with-onhost-tier' (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/691285 (https://phabricator.wikimedia.org/T264604) (owner: 10Krinkle)
[20:56:50] <wikibugs>	 (03Merged) 10jenkins-bot: [Beta Cluster] Fix undefined 'mcrouter-with-onhost-tier' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/691285 (https://phabricator.wikimedia.org/T264604) (owner: 10Krinkle)
[20:59:10] <wikibugs>	 (03CR) 10Cwhite: [V: 03+1 C: 03+1] "This works well!" [puppet] - 10https://gerrit.wikimedia.org/r/691231 (owner: 10Filippo Giunchedi)
[21:03:00] <wikibugs>	 (03PS1) 10Legoktm: httpd: Add directory for applications to add config [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/691287
[21:13:54] <wikibugs>	 (03PS1) 10Dzahn: httpd: add a resursive chmod to ensure log files are group writable [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/691293
[21:15:17] <wikibugs>	 (03PS2) 10Dzahn: httpd: add a resursive chmod to ensure log files are group writable [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/691293
[21:17:25] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:19:27] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:22:19] <wikibugs>	 (03CR) 10Volans: "Couple of nits inline, none is a blocker." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/691275 (owner: 10CDanis)
[21:37:17] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on ms-be1053 - https://phabricator.wikimedia.org/T282839 (10wiki_willy) a:03Cmjohnson
[21:38:35] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[21:38:57] <wikibugs>	 10SRE, 10Machine-Learning-Team, 10ORES, 10Release Pipeline (Blubber): Build blubber file for ORES - https://phabricator.wikimedia.org/T210268 (10thcipriani)
[21:39:00] <wikibugs>	 10SRE, 10Machine-Learning-Team, 10ORES, 10Release Pipeline, 10Release-Engineering-Team (Seen): Execution of the deployment pipeline should be configurable via .pipeline/config.yaml - https://phabricator.wikimedia.org/T210267 (10thcipriani) 05Open→03Resolved a:03dduvall There are now many services t...
[21:40:01] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[21:44:49] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[22:22:45] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[22:24:57] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:27:11] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:29:21] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[23:07:43] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[23:11:16] <wikibugs>	 (03CR) 10Dwisehaupt: "For when we are ready to monitor this service fully." [puppet] - 10https://gerrit.wikimedia.org/r/691277 (https://phabricator.wikimedia.org/T281321) (owner: 10Dwisehaupt)
[23:14:47] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[23:19:45] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:21:44] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Mailman3 bounce runner is running very slowly - https://phabricator.wikimedia.org/T282348 (10Platonides) Probably more a Feature Request for upstream, but I think mailman3 should parse that rejection message, find out the error is actually due to the specific message it was tr...
[23:24:27] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:47:49] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:50:11] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:52:33] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[23:59:33] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/