[00:00:29] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [00:00:31] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [00:02:17] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [00:02:49] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [00:02:57] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [00:02:57] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [00:03:13] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [00:04:05] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [00:04:31] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [00:06:21] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:09:57] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:33:37] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.012e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [00:49:45] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [01:33:45] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:37:19] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:58:41] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:02:15] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:14:47] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:18:21] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:20:09] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:27:15] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:29:03] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:32:39] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:35:07] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.292e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [02:51:13] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [02:54:46] (03CR) 10Krinkle: [C: 03+1] xhgui: use ensure=>present instead of ensure=>latest [puppet] - 10https://gerrit.wikimedia.org/r/560364 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [02:55:53] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:59:27] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:04:47] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:08:21] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:19:03] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:22:37] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:05:33] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:09:05] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:37:43] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:43:03] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:46:39] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:50:13] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:52:01] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:55:35] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:37:04] (03PS1) 10Minhducsun2002: Upload HD logos for fa, te wikiquote & fr wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560555 [05:44:18] (03CR) 10Minhducsun2002: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560555 (owner: 10Minhducsun2002) [05:44:37] (03PS1) 10Minhducsun2002: Add wgLogoHD entry for fa, te wikiquote & fr wikisource in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560556 [05:48:56] (03PS2) 10Minhducsun2002: Add wgLogoHD entry for fa, te wikiquote & fr wikisource in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560556 [05:52:49] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:53:54] (03PS3) 10Minhducsun2002: Add wgLogoHD entry for fa, te wikiquote & fr wikisource in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560556 [05:56:23] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:58:11] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:05:46] (03CR) 10Ammarpad: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560555 (owner: 10Minhducsun2002) [06:07:09] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:07:35] (03CR) 10Ammarpad: "> recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560555 (owner: 10Minhducsun2002) [06:09:54] (03CR) 10Ammarpad: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560556 (owner: 10Minhducsun2002) [06:14:19] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:17:53] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:21:29] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:25:05] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:28:37] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:32:07] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:39:13] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:42:45] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:44:31] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:46:55] (03CR) 10ArielGlenn: "> I would personally be a lot happier with line length of 79." [puppet] - 10https://gerrit.wikimedia.org/r/554825 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [06:47:01] 10Operations, 10Commons, 10MediaWiki-extensions-PagedTiffHandler, 10Multimedia, and 2 others: Large TIFF files do not pass file verification (related to version of image magick installed) - https://phabricator.wikimedia.org/T240455 (10Alicia_Fagerving_WMSE) >>! In T240455#5762892, @Reedy wrote: > It works!... [06:48:05] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:49:53] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:02:23] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:04:09] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:13:03] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:18:23] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:21:15] (03Abandoned) 10Andrew Bogott: nova firstboot: add a few setup steps to firstboot.sh [puppet] - 10https://gerrit.wikimedia.org/r/560206 (https://phabricator.wikimedia.org/T181375) (owner: 10Andrew Bogott) [07:27:19] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:29:05] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:29:59] 10Operations, 10Gerrit, 10serviceops, 10Patch-For-Review: Convert Gerrit to use H2 as the database - https://phabricator.wikimedia.org/T211139 (10Peachey88) >>! In T211139#5762926, @Paladox wrote: >>>! In T211139#4798560, @Dzahn wrote: >> On one hand i would love this because it would make the gerrit codfw... [07:32:39] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:39:47] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:46:55] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:54:01] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:02:51] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:03:01] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:04:39] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:18:59] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:22:33] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:24:19] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:25:11] 10Operations, 10Commons, 10MediaWiki-extensions-PagedTiffHandler, 10Multimedia, and 2 others: Large TIFF files do not pass file verification (related to version of image magick installed) - https://phabricator.wikimedia.org/T240455 (10TheDJ) Thank u @Reedy ! [08:31:27] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:33:13] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:36:47] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:38:33] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:42:07] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:45:41] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:49:15] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:19:37] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:20:59] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 3.471e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:23:09] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:24:31] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 2 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:28:31] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:31:43] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 2.193e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:33:31] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 2 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:33:55] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={atlas_exporter,icinga} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:43:52] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack [09:46:15] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 3 probes of 571 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [09:46:21] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 3 probes of 567 (alerts on 35) - https://atlas.ripe.net/measurements/23449935/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [09:46:29] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 3 probes of 571 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [09:46:31] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 3 probes of 571 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [09:46:41] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 36 probes of 509 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [09:47:43] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 3 probes of 571 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [09:48:03] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 27 probes of 509 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [09:48:05] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 29 probes of 509 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [09:48:13] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:49:43] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 27 probes of 509 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [09:49:43] ACKNOWLEDGEMENT - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project andrew bogott papering over this for another day T239168 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack [09:49:43] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 29 probes of 505 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [09:51:01] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 0 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack [10:08:49] (03PS1) 10Arturo Borrero Gonzalez: openstack: reduce again number of DB connections [puppet] - 10https://gerrit.wikimedia.org/r/560575 [10:11:54] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [10:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:12] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:20] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [10:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:25] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: reduce again number of DB connections [puppet] - 10https://gerrit.wikimedia.org/r/560575 (owner: 10Arturo Borrero Gonzalez) [10:14:22] 10Operations, 10DNS, 10Mail, 10Traffic: wikimedia.community domain name is not resolving an mx record - https://phabricator.wikimedia.org/T241132 (10Krd) Could anybody please explain why such an easy task does takes so long to get resolved? What can be done to expedite? [10:17:43] 10Operations, 10DNS, 10Mail, 10Traffic: wikimedia.community domain name is not resolving an mx record - https://phabricator.wikimedia.org/T241132 (10Reedy) >>! In T241132#5763084, @Krd wrote: > Could anybody please explain why such an easy task does takes so long to get resolved? What can be done to expedi... [10:21:30] 10Operations, 10DNS, 10Mail, 10Traffic: wikimedia.community domain name is not resolving an mx record - https://phabricator.wikimedia.org/T241132 (10Krd) Re-adding an MX record is not rocket science, this should be possible also at this time of the year, and not noticing missing e-mails if a different thin... [10:44:45] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:45:33] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:46:33] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:47:21] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:47:43] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:51:17] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:51:23] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:51:55] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:52:43] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:53:05] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:53:11] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:53:26] 10Operations, 10DNS, 10Mail, 10Traffic: wikimedia.community domain name is not resolving an mx record - https://phabricator.wikimedia.org/T241132 (10Peachey88) >>! In T241132#5763086, @Krd wrote: > Re-adding an MX record is not rocket science, this should be possible also at this time of the year, and not... [10:54:04] 10Operations, 10DNS, 10Mail, 10Traffic: wikimedia.community domain name is not resolving an mx record - https://phabricator.wikimedia.org/T241132 (10Aklapper) >>! In T241132#5763084, @Krd wrote: > Could anybody please explain why such an easy task does takes so long to get resolved? What can be done to exp... [10:54:53] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:55:29] 10Operations, 10DNS, 10Mail, 10Traffic: wikimedia.community domain name is not resolving an mx record - https://phabricator.wikimedia.org/T241132 (10Krd) A formal deployment freeze sounds reasonable. Can you please advise for which day the change can be scheduled, so we can decide if an expensive interim s... [10:55:33] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:56:19] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:58:31] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:00:51] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:02:05] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:02:39] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:03:33] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:03:51] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:04:14] 10Operations, 10DNS, 10Mail, 10Traffic: wikimedia.community domain name is not resolving an mx record - https://phabricator.wikimedia.org/T241132 (10Peachey88) >>! In T241132#5753952, @Reedy wrote: > Did this ever work? Or did someone just start using the email and expect it to work? It was purchased for... [11:05:23] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:08:05] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:10:49] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:11:05] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:12:37] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:12:53] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:12:59] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:13:29] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:13:29] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:15:19] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:16:35] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:16:35] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:17:13] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:19:01] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:24:21] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:25:33] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:27:21] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:28:51] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:30:39] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:31:05] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:31:33] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:32:55] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:36:03] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:36:17] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 255.6 ge 210 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [11:36:26] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:37:51] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:37:51] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:38:09] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:38:13] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:38:13] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:38:19] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:39:41] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:39:59] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:40:03] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:40:07] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:57:22] 10Operations, 10DNS, 10Mail, 10Traffic: wikimedia.community domain name is not resolving an mx record - https://phabricator.wikimedia.org/T241132 (10Aklapper) >>! In T241132#5763090, @Krd wrote: > A formal deployment freeze sounds reasonable. See https://www.mediawiki.org/wiki/MediaWiki_1.35/Roadmap [12:24:30] 10Operations, 10MediaWiki-Authentication-and-authorization, 10Traffic, 10Security: Investigate usefulness of SameSite cookies for logged-in accounts - https://phabricator.wikimedia.org/T158604 (10MarcoAurelio) Chrome seems that it'll require both `SameSite` and `Secure` attributes. Example from my console:... [13:37:05] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)210 ge (W)150 ge 89.54 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [13:47:17] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 51664160 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:59:47] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 50952 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:23:37] (03PS1) 10Subscriptshoe9: Upload HD Logo for: simplewikibooks afwikibooks akwikibooks angwikibooks astwikibooks aswikibooks aywikibooks mgwikibooks miwikibooks iawikibooks iewikibooks zuwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560577 (https://phabricator.wikimedia.org/T150618) [14:41:11] (03CR) 10Ammarpad: [C: 04-1] Upload HD Logo for: simplewikibooks afwikibooks akwikibooks angwikibooks astwikibooks aswikibooks aywikibooks mgwikibooks miwikibooks iawiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560577 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [15:00:59] (03PS2) 10Subscriptshoe9: Upload HD Logo for 12 Wikibooks Projects and 1 Wikipeida Project: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560577 (https://phabricator.wikimedia.org/T150618) [15:03:33] (03CR) 10Subscriptshoe9: Upload HD Logo for 12 Wikibooks Projects and 1 Wikipeida Project: (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560577 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [15:09:15] (03PS1) 10Subscriptshoe9: Add HD logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560580 (https://phabricator.wikimedia.org/T150618) [15:11:28] (03PS2) 10Subscriptshoe9: Add HD logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560580 (https://phabricator.wikimedia.org/T150618) [15:12:19] (03PS3) 10Subscriptshoe9: Add HD logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560580 (https://phabricator.wikimedia.org/T150618) [15:15:51] (03PS3) 10Subscriptshoe9: Upload HD Logo for 11 Wikibooks Projects and 1 Wikipeida Project: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560577 (https://phabricator.wikimedia.org/T150618) [15:16:30] (03PS4) 10Subscriptshoe9: Add HD logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560580 (https://phabricator.wikimedia.org/T150618) [15:21:41] (03CR) 10Urbanecm: [C: 04-1] "Thanks for optimalizing unoptimalized logos. However, if you do so, I would advise that being done in a separate commit, so the difference" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560577 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [15:21:42] (03CR) 10Urbanecm: [C: 04-1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560577 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [15:22:08] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560580 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [15:23:16] (03CR) 10jerkins-bot: [V: 04-1] Add HD logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560580 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [15:25:05] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560580 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [15:39:26] (03PS4) 10Subscriptshoe9: Upload HD Logo for 9 Wikibooks Projects and 1 Wikipeida Project: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560577 (https://phabricator.wikimedia.org/T150618) [15:39:49] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560555 (owner: 10Minhducsun2002) [15:40:09] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560556 (owner: 10Minhducsun2002) [15:45:01] (03PS1) 10Subscriptshoe9: Upload HD Logo for 9 Wikibooks Projects and 1 Wikipeida Project: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560582 (https://phabricator.wikimedia.org/T150618) [15:45:30] (03Abandoned) 10Subscriptshoe9: Upload HD Logo for 9 Wikibooks Projects and 1 Wikipeida Project: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560582 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [15:55:00] (03PS5) 10Subscriptshoe9: Upload HD Logo for 9 Wikibooks Projects and 1 Wikipeida Project: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560577 (https://phabricator.wikimedia.org/T150618) [15:55:38] (03PS5) 10Subscriptshoe9: Add HD logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560580 (https://phabricator.wikimedia.org/T150618) [17:21:15] (03CR) 10Ammarpad: [C: 03+1] Add HD logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560580 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [18:09:09] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [18:52:05] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:53:53] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:02:23] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [20:03:01] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [20:20:35] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:22:25] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:55:05] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:58:11] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 56729448 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:58:39] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:59:57] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 86968 and 91 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:42:43] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [22:14:49] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [22:36:15] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [23:38:52] 10Operations, 10MediaWiki-Authentication-and-authorization, 10Traffic, 10Security: Investigate usefulness of SameSite cookies for logged-in accounts - https://phabricator.wikimedia.org/T158604 (10Bawolff) How does SameSite=lax work with credentialed CORS requests? That's the only issue i could possibly see... [23:58:33] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets