[00:01:56] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:06:15] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install wcqs100[123] - https://phabricator.wikimedia.org/T276644 (10RobH) [00:06:46] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install wcqs100[123] - https://phabricator.wikimedia.org/T276644 (10RobH) [00:07:32] 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install wcqs100[123] - https://phabricator.wikimedia.org/T276644 (10RobH) [00:08:13] 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install wcqs100[123] - https://phabricator.wikimedia.org/T276644 (10RobH) a:03Jclark-ctr [00:15:47] 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install wcqs200[123] - https://phabricator.wikimedia.org/T276647 (10RobH) [00:15:59] 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install wcqs200[123] - https://phabricator.wikimedia.org/T276647 (10RobH) [00:16:21] 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install wcqs200[123] - https://phabricator.wikimedia.org/T276647 (10RobH) a:03Papaul [00:19:51] 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs200[123] - https://phabricator.wikimedia.org/T276647 (10RobH) [00:20:00] 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs100[123] - https://phabricator.wikimedia.org/T276644 (10RobH) [00:55:11] (03PS1) 10Legoktm: sre.ganeti.makevm: Automatically generate fqdn from hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/668867 (https://phabricator.wikimedia.org/T276516) [01:06:38] apcu release: [01:06:40] - Fix handling of references in PHP 8 if "default" serializer is used (which is not the default). [01:06:52] default not the default, got it. [01:14:45] * Zppix wonders what the default is [01:37:24] I just got a db locked warning due to rep lag, dunno if it is known or worrisome [01:43:59] dont|panic: probably good idea to provide which wiki [01:44:17] hrwiki, sorry [03:10:52] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:13:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:48:34] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:50:52] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:33:10] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 215828736 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:35:32] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 766816 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:39:58] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:42:18] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:08:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:10:20] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:19:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:22:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:36:22] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:36:46] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:16:38] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:18:38] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:35:06] 10Puppet, 10SRE, 10Beta-Cluster-Infrastructure, 10Technical-Debt, 10Tracking-Neverending: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220 (10Majavah) [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210306T0800) [08:01:48] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.486e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [08:06:30] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.02948 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [08:17:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={netbox_device_statistics,swagger_check_wikifeeds_eqiad} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:19:50] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:09:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:11:36] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:45:27] (03PS1) 10Majavah: Add apache2 mod_rewrite to beta prometheus [puppet] - 10https://gerrit.wikimedia.org/r/668995 (https://phabricator.wikimedia.org/T276654) [10:53:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=redis_maps site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:58:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:54:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:56:56] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:30:20] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:23:56] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:58:02] PROBLEM - puppet last run on mwdebug1001 is CRITICAL: CRITICAL: Puppet has been disabled for longer than 86400 seconds, message: jiji - jiji, last run 1 day ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:23:22] RECOVERY - puppet last run on mwdebug1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:22:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:25:06] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:47:56] 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10wkandek) I created https://phabricator.wikimedia.org/T276673 to track the decison on how to... [16:57:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:00:06] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:19:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:21:44] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:56:56] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:59:04] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:01:02] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:03:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:17:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:19:36] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:28:34] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:30:54] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:47:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:51:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:50:45] (03PS1) 10Ladsgroup: mailman3: Add exim4 configuration [puppet] - 10https://gerrit.wikimedia.org/r/669182 (https://phabricator.wikimedia.org/T256536) [21:51:58] (03CR) 10jerkins-bot: [V: 04-1] mailman3: Add exim4 configuration [puppet] - 10https://gerrit.wikimedia.org/r/669182 (https://phabricator.wikimedia.org/T256536) (owner: 10Ladsgroup) [22:18:30] On my userpage (https://simple.wikipedia.org/wiki/User:Operator873 ), there is a link to my website. If you click the link on my userpage, it's being hijacked and redirected elsewhere. I've verified my DNS records are intact and linking from anywhere else works as expected. Some how, the link is highjacked on mediawiki [22:18:36] This behavior is only happening on wikimedia sites. I've since removed the link from my meta page and enwiki page. I left the simple wiki page for investigation. Since this is confined to wikimedia, I think there may be something bad afoot. [22:22:22] Operator873: can reproduce here as well [22:22:29] nod [22:22:42] I had it confirmed a couple places before I began sounding alarms [22:23:13] can reproduce with safe (no-JS) mode as well [22:23:34] I'm trying to confirm a report that 'safe-mode' on Firefox works though [22:23:57] Operator873: coping url from source mode into incognito/separate tab works too [22:24:38] on mobile, using safe-mode still kicked me over to an ad site [22:24:39] generated html is "
873gear.com" [22:25:45] weird [22:26:32] ?action=purge does nothing [22:26:59] Operator873: when curling 873gear.com with Referer header set to simple.wikimedia.org I get a redirect header to a shady-sounding site [22:27:25] the first redirect I see is blameworth.buzz or somethign like that [22:27:36] but not if I don't set that referer header [22:28:05] when redirecting in the header for the request for redirect it says http://xn--i1avf9a.xn--p1ai/ [22:28:06] oops [22:28:17] Referer: https://simple.wikipedia.org/* [22:30:16] https://www.irccloud.com/pastebin/TSIpIeto/ [22:30:20] Operator873: ^ [22:30:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:30:25] Operator873: I get always redirected when I set a Referer header, you sure it's not on your side? [22:30:53] you might wanna check your site's config again [22:31:00] Majavah I'm checking [22:31:01] Zppix: same here [22:32:38] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:38:24] <[1997kB]> even on actual site when click on links in navbar, it still takes me to that shady site. [22:39:13] (03PS2) 10Ladsgroup: mailman3: Add exim4 configuration [puppet] - 10https://gerrit.wikimedia.org/r/669182 (https://phabricator.wikimedia.org/T256536) [22:39:32] indeed Operator873, see [1997kB] message [22:39:45] nods [22:39:49] I think I found the culprit [22:45:16] Alright... [1997kB] Majavah Zppix RhinosF1 can you please confirm site now loads appropriately? [22:45:47] Operator873: yeah [22:45:50] LGTM [22:48:18] Operator873: works for me, what was it? [22:48:49] indeed it was. You catching that header issue allowed me to quickly isolate the cause. [22:48:53] Was 100% my end [22:49:25] looks like a wordpress issue. (Don't be too harsh, I don't write html) [22:49:38] Operator873: like bad config [22:50:44] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Ladsgroup) I'm asking to have a test VM in production to test it. Can I have a test db in production for this? two databases (`mailman3` and `mailman3web` with users exactly the same name) pr... [23:01:24] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 191375288 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:05:15] 10SRE, 10Wikimedia-Mailing-lists: Reuqesting a test VM in production for mailman3 - https://phabricator.wikimedia.org/T276686 (10Ladsgroup) [23:06:10] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 733888 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring