[00:00:29] <wikibugs>	 (03CR) 10Razzi: [C: 03+2] superset: comment out check that isn't working as intended [puppet] - 10https://gerrit.wikimedia.org/r/678113 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi)
[00:02:33] <wikibugs>	 (03PS1) 10Papaul: Fix partman recipe for new moss nodes [puppet] - 10https://gerrit.wikimedia.org/r/678125 (https://phabricator.wikimedia.org/T276642)
[00:03:55] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Fix partman recipe for new moss nodes [puppet] - 10https://gerrit.wikimedia.org/r/678125 (https://phabricator.wikimedia.org/T276642) (owner: 10Papaul)
[00:11:14] <wikibugs>	 (03PS1) 10Legoktm: aptrepo: Add "mailman3" component [puppet] - 10https://gerrit.wikimedia.org/r/678128 (https://phabricator.wikimedia.org/T278905)
[00:13:31] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] aptrepo: Add "mailman3" component [puppet] - 10https://gerrit.wikimedia.org/r/678128 (https://phabricator.wikimedia.org/T278905) (owner: 10Legoktm)
[00:22:29] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be200[12] - https://phabricator.wikimedia.org/T276642 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['moss-be2001.codfw.wmnet'] `  Of which those **FAILED**: ` ['moss-be2001.codfw.wmnet'] `
[00:34:09] <wikibugs>	 (03PS1) 10Razzi: superset: ensure http check absent until it is working [puppet] - 10https://gerrit.wikimedia.org/r/678130 (https://phabricator.wikimedia.org/T277729)
[00:34:16] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be200[12] - https://phabricator.wikimedia.org/T276642 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` moss-be2001.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2...
[00:35:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] superset: ensure http check absent until it is working [puppet] - 10https://gerrit.wikimedia.org/r/678130 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi)
[00:36:28] <wikibugs>	 (03PS2) 10Razzi: superset: ensure http check absent until it is working [puppet] - 10https://gerrit.wikimedia.org/r/678130 (https://phabricator.wikimedia.org/T277729)
[00:40:03] <wikibugs>	 (03CR) 10Razzi: [C: 03+2] superset: ensure http check absent until it is working [puppet] - 10https://gerrit.wikimedia.org/r/678130 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi)
[00:49:37] <legoktm>	 !log imported mailman3 backports on apt.wm.o (T278905)
[00:49:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:49:47] <stashbot>	 T278905: Reconsider which mailman3 version we're running - https://phabricator.wikimedia.org/T278905
[00:52:16] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Reconsider which mailman3 version we're running - https://phabricator.wikimedia.org/T278905 (10Legoktm) a:03Legoktm https://apt-browser.toolforge.org/buster-wikimedia/component/mailman3/
[00:54:11] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-be2001.codfw.wmnet with reason: REIMAGE
[00:54:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:56:05] <logmsgbot>	 !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on moss-be2001.codfw.wmnet with reason: REIMAGE
[00:56:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:03:30] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be200[12] - https://phabricator.wikimedia.org/T276642 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['moss-be2001.codfw.wmnet'] `  and were **ALL** successful.
[01:07:30] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be200[12] - https://phabricator.wikimedia.org/T276642 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` moss-be2002.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2...
[01:09:20] <wikibugs>	 (03PS1) 10Legoktm: apt: Copyedit package_from_component documentation [puppet] - 10https://gerrit.wikimedia.org/r/678133
[01:09:22] <wikibugs>	 (03PS1) 10Legoktm: mailman3: Use backported packages from component/mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/678134 (https://phabricator.wikimedia.org/T278905)
[01:10:53] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] apt: Copyedit package_from_component documentation [puppet] - 10https://gerrit.wikimedia.org/r/678133 (owner: 10Legoktm)
[01:12:05] <wikibugs>	 (03PS2) 10Legoktm: mailman3: Use backported packages from component/mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/678134 (https://phabricator.wikimedia.org/T278905)
[01:12:53] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28971/console" [puppet] - 10https://gerrit.wikimedia.org/r/678134 (https://phabricator.wikimedia.org/T278905) (owner: 10Legoktm)
[01:17:31] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Reconsider which mailman3 version we're running - https://phabricator.wikimedia.org/T278905 (10Legoktm) I didn't actually check what the schema changes, if any, look like. We should probably be aware of that before testing the upgrade in Cloud (and probab...
[01:19:21] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] mailman3: Use backported packages from component/mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/678134 (https://phabricator.wikimedia.org/T278905) (owner: 10Legoktm)
[01:23:36] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-be2002.codfw.wmnet with reason: REIMAGE
[01:23:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:25:32] <logmsgbot>	 !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on moss-be2002.codfw.wmnet with reason: REIMAGE
[01:25:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:33:54] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be200[12] - https://phabricator.wikimedia.org/T276642 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['moss-be2002.codfw.wmnet'] `  and were **ALL** successful.
[01:35:25] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be200[12] - https://phabricator.wikimedia.org/T276642 (10Papaul)
[01:36:08] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be200[12] - https://phabricator.wikimedia.org/T276642 (10Papaul) 05Open→03Resolved @fgiunchedi This is complete
[01:37:54] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup200[4-7] - https://phabricator.wikimedia.org/T277323 (10Papaul)
[01:40:13] <wikibugs>	 (03PS1) 10Ladsgroup: lists: Drop check for buster [puppet] - 10https://gerrit.wikimedia.org/r/678137
[01:40:46] <wikibugs>	 (03PS2) 10Ladsgroup: lists: Drop check for buster [puppet] - 10https://gerrit.wikimedia.org/r/678137
[01:43:17] <wikibugs>	 (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/678137 (owner: 10Ladsgroup)
[01:45:48] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "PCC happy https://puppet-compiler.wmflabs.org/compiler1001/706/" [puppet] - 10https://gerrit.wikimedia.org/r/678137 (owner: 10Ladsgroup)
[02:19:11] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:21:17] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:58:40] <wikibugs>	 (03PS1) 10Ladsgroup: lists: Make mailman2 easier to run on the cloud [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612)
[02:59:52] <wikibugs>	 (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup)
[02:59:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] lists: Make mailman2 easier to run on the cloud [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup)
[03:00:49] <wikibugs>	 (03PS2) 10Ladsgroup: lists: Make mailman2 easier to run on the cloud [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612)
[03:04:13] <wikibugs>	 (03PS3) 10Ladsgroup: lists: Make mailman2 easier to run on the cloud [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612)
[03:04:54] <wikibugs>	 (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup)
[03:07:31] <wikibugs>	 (03PS4) 10Ladsgroup: lists: Make mailman2 easier to run on the cloud [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612)
[03:08:07] <wikibugs>	 (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup)
[03:12:28] <wikibugs>	 (03PS5) 10Ladsgroup: lists: Make mailman2 easier to run on the cloud [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612)
[03:12:43] <wikibugs>	 (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup)
[03:18:59] <wikibugs>	 (03CR) 10Ladsgroup: "The change is noop" [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup)
[04:02:03] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[04:04:17] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: task_max_waiting_in_queue_millis: 0, number_of_pending_tasks: 0, active_primary_shards: 937, active_shards: 1877, timed_out: False, unassigned_shards: 0, initializing_shards: 0, number_of_nodes: 6, cluster_name: cloudelastic-chi-eqiad, active_shards_percent_as_number: 100.0, number_of_in_flight_fet
[04:04:17] <icinga-wm>	 _shards: 0, number_of_data_nodes: 6, delayed_unassigned_shards: 0, status: green https://wikitech.wikimedia.org/wiki/Search%23Administration
[04:07:32] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Install mailman3 and mailman2 at the same time on the cloud - https://phabricator.wikimedia.org/T278612 (10Ladsgroup) I constantly get > AH00534: apache2: Configuration error: No MPM loaded.  trying to install mailamn2 on the cloud. I haven't figured out...
[04:46:49] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Install mailman3 and mailman2 at the same time on the cloud - https://phabricator.wikimedia.org/T278612 (10Ladsgroup) ` root@mailman03:/etc/apache2# /usr/sbin/apache2 -X -k start AH00534: apache2: Configuration error: No MPM loaded. root@mailman03:/etc/ap...
[04:58:06] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[05:00:34] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: active_primary_shards: 937, unassigned_shards: 0, timed_out: False, number_of_in_flight_fetch: 0, status: green, relocating_shards: 0, number_of_pending_tasks: 0, number_of_nodes: 6, active_shards_percent_as_number: 100.0, cluster_name: cloudelastic-chi-eqiad, delayed_unassigned_shards: 0, number_o
[05:00:34] <icinga-wm>	 task_max_waiting_in_queue_millis: 0, initializing_shards: 0, active_shards: 1877 https://wikitech.wikimedia.org/wiki/Search%23Administration
[05:14:13] <wikibugs>	 (03PS6) 10Ladsgroup: lists: Make mailman2 easier to run on the cloud [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612)
[05:17:51] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Install mailman3 and mailman2 at the same time on the cloud - https://phabricator.wikimedia.org/T278612 (10Ladsgroup) \o/ https://polymorphic.lists.wmcloud.org/mailman/listinfo
[05:20:09] <wikibugs>	 (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup)
[05:22:05] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28972/console" [puppet] - 10https://gerrit.wikimedia.org/r/678137 (owner: 10Ladsgroup)
[05:22:12] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1 C: 03+2] lists: Drop check for buster [puppet] - 10https://gerrit.wikimedia.org/r/678137 (owner: 10Ladsgroup)
[05:24:53] <wikibugs>	 (03CR) 10Legoktm: [C: 04-1] lists: Make mailman2 easier to run on the cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup)
[05:29:10] <wikibugs>	 (03PS7) 10Ladsgroup: lists: Make mailman2 easier to run on the cloud [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612)
[05:30:01] <wikibugs>	 (03CR) 10Ladsgroup: "check experimental" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup)
[05:52:29] <wikibugs>	 10SRE, 10Dumps-Generation, 10SRE-Access-Requests: Create new group for root access to snapshot*, dumpsdata* and labstore1006,7 with holger in it - https://phabricator.wikimedia.org/T277629 (10ArielGlenn) >>! In T277629#6985797, @Dzahn wrote: > Any news on access check for @holger.knust ?  It might be some da...
[05:58:23] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[06:00:29] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: status: green, initializing_shards: 0, active_shards: 1877, number_of_in_flight_fetch: 0, number_of_nodes: 6, number_of_pending_tasks: 0, task_max_waiting_in_queue_millis: 0, delayed_unassigned_shards: 0, number_of_data_nodes: 6, relocating_shards: 0, active_shards_percent_as_number: 100.0, timed_o
[06:00:29] <icinga-wm>	 _primary_shards: 937, cluster_name: cloudelastic-chi-eqiad, unassigned_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[06:00:41] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28973/console" [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup)
[06:01:34] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1 C: 03+2] lists: Make mailman2 easier to run on the cloud [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup)
[06:31:39] <wikibugs>	 (03PS1) 10Legoktm: ExtensionDistributor: Add REL1_36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678146
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See [[Deployments/Emergencies]] if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210409T0700)
[07:36:38] <wikibugs>	 (03CR) 10Aklapper: "Please abandon per https://phabricator.wikimedia.org/T279226" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676716 (https://phabricator.wikimedia.org/T279226) (owner: 10Zabe)
[07:44:08] <wikibugs>	 (03Abandoned) 10Zabe: Add 'apihighlimits' to accountcreator on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676716 (https://phabricator.wikimedia.org/T279226) (owner: 10Zabe)
[08:28:50] <apergos>	 so very very quiet :-D
[08:31:07] <legoktm>	 it's a holiday!
[09:00:25] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37
[09:24:54] <wikibugs>	 (03PS1) 10Legoktm: codesearch: Puppetize beta frontend [puppet] - 10https://gerrit.wikimedia.org/r/678216 (https://phabricator.wikimedia.org/T277459)
[09:25:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] codesearch: Puppetize beta frontend [puppet] - 10https://gerrit.wikimedia.org/r/678216 (https://phabricator.wikimedia.org/T277459) (owner: 10Legoktm)
[09:26:00] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28974/console" [puppet] - 10https://gerrit.wikimedia.org/r/678216 (https://phabricator.wikimedia.org/T277459) (owner: 10Legoktm)
[09:26:34] <wikibugs>	 (03PS2) 10Legoktm: codesearch: Puppetize beta frontend [puppet] - 10https://gerrit.wikimedia.org/r/678216 (https://phabricator.wikimedia.org/T277459)
[10:40:57] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[10:43:15] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: number_of_nodes: 6, unassigned_shards: 0, timed_out: False, number_of_data_nodes: 6, active_shards_percent_as_number: 100.0, number_of_pending_tasks: 0, active_shards: 1877, initializing_shards: 0, delayed_unassigned_shards: 0, active_primary_shards: 937, task_max_waiting_in_queue_millis: 0, cluste
[10:43:15] <icinga-wm>	 tic-chi-eqiad, relocating_shards: 0, status: green, number_of_in_flight_fetch: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[10:49:23] <icinga-wm>	 PROBLEM - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration
[10:53:57] <icinga-wm>	 RECOVERY - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 673 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration
[11:40:51] <wikibugs>	 (03PS1) 10Zabe: Enable <mapframe> on bswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678237 (https://phabricator.wikimedia.org/T279635)
[11:41:21] <wikibugs>	 (03CR) 10Zabe: [C: 04-1] "on hold" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678237 (https://phabricator.wikimedia.org/T279635) (owner: 10Zabe)
[12:33:33] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[12:38:07] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: relocating_shards: 0, active_shards: 1877, task_max_waiting_in_queue_millis: 0, active_primary_shards: 937, timed_out: False, delayed_unassigned_shards: 0, active_shards_percent_as_number: 100.0, unassigned_shards: 0, number_of_data_nodes: 6, number_of_in_flight_fetch: 0, cluster_name: cloudelastic
[12:38:07] <icinga-wm>	 s: green, number_of_pending_tasks: 0, number_of_nodes: 6, initializing_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[13:24:04] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant access to Superset for Mikeraish - https://phabricator.wikimedia.org/T279147 (10MRaishWMF) @ema it looks like we're up and running! Thanks a lot
[13:34:51] <wikibugs>	 (03CR) 10Jcrespo: "See my comments below." (033 comments) [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98)
[13:35:00] <wikibugs>	 (03CR) 10Jcrespo: "recheck" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98)
[14:07:51] <jynus>	 !log retry es4 backup dump on eqiad (backup1002)
[14:08:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:05] <icinga-wm>	 ACKNOWLEDGEMENT - dump of es4 in eqiad on alert1001 is CRITICAL: dump for es4 at eqiad taken more than 8 days ago: Most recent backup 2021-03-30 00:00:01 Jcrespo retrying backup now- the host was restarted while backups ran. - The acknowledgement expires at: 2021-04-13 08:14:53. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[14:36:05] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:40:43] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:58:57] <wikibugs>	 (03PS1) 10Urbanecm: toolforge: Install pandoc [puppet] - 10https://gerrit.wikimedia.org/r/678259 (https://phabricator.wikimedia.org/T279787)
[15:06:01] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[15:08:11] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: unassigned_shards: 0, timed_out: False, relocating_shards: 0, number_of_data_nodes: 6, active_primary_shards: 937, active_shards_percent_as_number: 100.0, cluster_name: cloudelastic-chi-eqiad, number_of_pending_tasks: 0, active_shards: 1877, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_m
[15:08:11] <icinga-wm>	  green, initializing_shards: 0, number_of_nodes: 6, delayed_unassigned_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[15:19:51] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:22:09] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:24:21] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[15:26:31] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: active_shards_percent_as_number: 100.0, status: green, number_of_nodes: 6, cluster_name: cloudelastic-chi-eqiad, task_max_waiting_in_queue_millis: 0, timed_out: False, relocating_shards: 0, number_of_data_nodes: 6, number_of_pending_tasks: 0, unassigned_shards: 0, active_shards: 1877, number_of_in_
[15:26:31] <icinga-wm>	 active_primary_shards: 937, delayed_unassigned_shards: 0, initializing_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:06:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10Cmjohnson) the Dell tech came out and replaced the motherboard, that did not fix the issue, it turns out that there is bad cable to the backplane.   A new part has been o...
[16:10:51] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:15:31] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, delayed_unassigned_shards: 0, relocating_shards: 0, number_of_nodes: 6, initializing_shards: 0, task_max_waiting_in_queue_millis: 0, active_primary_shards: 937, active_shards: 1877, unassigned_shards: 0, status: green, number_of_in_flight_fetch: 0, timed_out: F
[16:15:31] <icinga-wm>	 ds_percent_as_number: 100.0, number_of_data_nodes: 6, number_of_pending_tasks: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:57:19] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:58:41] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:06:19] <wikibugs>	 (03PS3) 10Krinkle: [Beta Cluster] mc: Set new 'broadcastRoutingPrefix' option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677732
[18:06:21] <wikibugs>	 (03PS3) 10Krinkle: mc: Set 'broadcastRoutingPrefix' option in $wgWANObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677418
[18:06:33] <Krinkle>	 AaronSchulz: ok to land the one for beta now,  I think?
[18:14:52] <AaronSchulz>	 Krinkle: lgtm
[18:16:36] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] [Beta Cluster] mc: Set new 'broadcastRoutingPrefix' option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677732 (owner: 10Krinkle)
[18:19:37] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:21:33] <wikibugs>	 (03Merged) 10jenkins-bot: [Beta Cluster] mc: Set new 'broadcastRoutingPrefix' option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677732 (owner: 10Krinkle)
[18:22:05] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:26:08] <wikibugs>	 (03PS3) 10Krinkle: [Beta Cluster] mc: Remove unused mcrouterAware/cluster/coalesceKeys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677733
[18:26:10] <wikibugs>	 (03PS3) 10Krinkle: mc: Remove unused mcrouterAware/cluster/coalesceKeys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677734
[19:54:37] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:57:03] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:37:31] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:39:51] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: unassigned_shards: 0, active_shards_percent_as_number: 100.0, active_primary_shards: 937, initializing_shards: 0, number_of_data_nodes: 6, number_of_in_flight_fetch: 0, number_of_nodes: 6, status: green, relocating_shards: 0, cluster_name: cloudelastic-chi-eqiad, active_shards: 1877, task_max_waiti
[21:39:51] <icinga-wm>	 s: 0, number_of_pending_tasks: 0, delayed_unassigned_shards: 0, timed_out: False https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:11:33] <icinga-wm>	 PROBLEM - Host lvs3005 is DOWN: PING CRITICAL - Packet loss = 100%
[22:12:01] <icinga-wm>	 RECOVERY - Host lvs3005 is UP: PING WARNING - Packet loss = 77%, RTA = 183.64 ms
[22:12:15] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:13:47] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:16:15] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:52:40] <wikibugs>	 10SRE, 10Traffic: Wikimedia was temporarily unreachable - https://phabricator.wikimedia.org/T279809 (10Legoktm) Around what time did you get errors?
[23:32:16] <wikibugs>	 10SRE, 10Traffic: Wikimedia was temporarily unreachable - https://phabricator.wikimedia.org/T279809 (10Krinkle) Looking at [Grafana: frontend traffic](https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&from=1617441420000&to=1618010820000):  |{F34316356 height=300} |{F34316343 height=200}  Looks...