[00:00:29] (03CR) 10Razzi: [C: 03+2] superset: comment out check that isn't working as intended [puppet] - 10https://gerrit.wikimedia.org/r/678113 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi) [00:02:33] (03PS1) 10Papaul: Fix partman recipe for new moss nodes [puppet] - 10https://gerrit.wikimedia.org/r/678125 (https://phabricator.wikimedia.org/T276642) [00:03:55] (03CR) 10Papaul: [C: 03+2] Fix partman recipe for new moss nodes [puppet] - 10https://gerrit.wikimedia.org/r/678125 (https://phabricator.wikimedia.org/T276642) (owner: 10Papaul) [00:11:14] (03PS1) 10Legoktm: aptrepo: Add "mailman3" component [puppet] - 10https://gerrit.wikimedia.org/r/678128 (https://phabricator.wikimedia.org/T278905) [00:13:31] (03CR) 10Legoktm: [C: 03+2] aptrepo: Add "mailman3" component [puppet] - 10https://gerrit.wikimedia.org/r/678128 (https://phabricator.wikimedia.org/T278905) (owner: 10Legoktm) [00:22:29] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be200[12] - https://phabricator.wikimedia.org/T276642 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['moss-be2001.codfw.wmnet'] ` Of which those **FAILED**: ` ['moss-be2001.codfw.wmnet'] ` [00:34:09] (03PS1) 10Razzi: superset: ensure http check absent until it is working [puppet] - 10https://gerrit.wikimedia.org/r/678130 (https://phabricator.wikimedia.org/T277729) [00:34:16] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be200[12] - https://phabricator.wikimedia.org/T276642 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` moss-be2001.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2... [00:35:44] (03CR) 10jerkins-bot: [V: 04-1] superset: ensure http check absent until it is working [puppet] - 10https://gerrit.wikimedia.org/r/678130 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi) [00:36:28] (03PS2) 10Razzi: superset: ensure http check absent until it is working [puppet] - 10https://gerrit.wikimedia.org/r/678130 (https://phabricator.wikimedia.org/T277729) [00:40:03] (03CR) 10Razzi: [C: 03+2] superset: ensure http check absent until it is working [puppet] - 10https://gerrit.wikimedia.org/r/678130 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi) [00:49:37] !log imported mailman3 backports on apt.wm.o (T278905) [00:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:47] T278905: Reconsider which mailman3 version we're running - https://phabricator.wikimedia.org/T278905 [00:52:16] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Reconsider which mailman3 version we're running - https://phabricator.wikimedia.org/T278905 (10Legoktm) a:03Legoktm https://apt-browser.toolforge.org/buster-wikimedia/component/mailman3/ [00:54:11] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-be2001.codfw.wmnet with reason: REIMAGE [00:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:05] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on moss-be2001.codfw.wmnet with reason: REIMAGE [00:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:03:30] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be200[12] - https://phabricator.wikimedia.org/T276642 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['moss-be2001.codfw.wmnet'] ` and were **ALL** successful. [01:07:30] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be200[12] - https://phabricator.wikimedia.org/T276642 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` moss-be2002.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2... [01:09:20] (03PS1) 10Legoktm: apt: Copyedit package_from_component documentation [puppet] - 10https://gerrit.wikimedia.org/r/678133 [01:09:22] (03PS1) 10Legoktm: mailman3: Use backported packages from component/mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/678134 (https://phabricator.wikimedia.org/T278905) [01:10:53] (03CR) 10Legoktm: [C: 03+2] apt: Copyedit package_from_component documentation [puppet] - 10https://gerrit.wikimedia.org/r/678133 (owner: 10Legoktm) [01:12:05] (03PS2) 10Legoktm: mailman3: Use backported packages from component/mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/678134 (https://phabricator.wikimedia.org/T278905) [01:12:53] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28971/console" [puppet] - 10https://gerrit.wikimedia.org/r/678134 (https://phabricator.wikimedia.org/T278905) (owner: 10Legoktm) [01:17:31] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Reconsider which mailman3 version we're running - https://phabricator.wikimedia.org/T278905 (10Legoktm) I didn't actually check what the schema changes, if any, look like. We should probably be aware of that before testing the upgrade in Cloud (and probab... [01:19:21] (03CR) 10Ladsgroup: [C: 03+1] mailman3: Use backported packages from component/mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/678134 (https://phabricator.wikimedia.org/T278905) (owner: 10Legoktm) [01:23:36] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-be2002.codfw.wmnet with reason: REIMAGE [01:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:32] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on moss-be2002.codfw.wmnet with reason: REIMAGE [01:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:54] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be200[12] - https://phabricator.wikimedia.org/T276642 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['moss-be2002.codfw.wmnet'] ` and were **ALL** successful. [01:35:25] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be200[12] - https://phabricator.wikimedia.org/T276642 (10Papaul) [01:36:08] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be200[12] - https://phabricator.wikimedia.org/T276642 (10Papaul) 05Open→03Resolved @fgiunchedi This is complete [01:37:54] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup200[4-7] - https://phabricator.wikimedia.org/T277323 (10Papaul) [01:40:13] (03PS1) 10Ladsgroup: lists: Drop check for buster [puppet] - 10https://gerrit.wikimedia.org/r/678137 [01:40:46] (03PS2) 10Ladsgroup: lists: Drop check for buster [puppet] - 10https://gerrit.wikimedia.org/r/678137 [01:43:17] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/678137 (owner: 10Ladsgroup) [01:45:48] (03CR) 10Ladsgroup: [C: 03+1] "PCC happy https://puppet-compiler.wmflabs.org/compiler1001/706/" [puppet] - 10https://gerrit.wikimedia.org/r/678137 (owner: 10Ladsgroup) [02:19:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:21:17] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:58:40] (03PS1) 10Ladsgroup: lists: Make mailman2 easier to run on the cloud [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612) [02:59:52] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [02:59:56] (03CR) 10jerkins-bot: [V: 04-1] lists: Make mailman2 easier to run on the cloud [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [03:00:49] (03PS2) 10Ladsgroup: lists: Make mailman2 easier to run on the cloud [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612) [03:04:13] (03PS3) 10Ladsgroup: lists: Make mailman2 easier to run on the cloud [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612) [03:04:54] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [03:07:31] (03PS4) 10Ladsgroup: lists: Make mailman2 easier to run on the cloud [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612) [03:08:07] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [03:12:28] (03PS5) 10Ladsgroup: lists: Make mailman2 easier to run on the cloud [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612) [03:12:43] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [03:18:59] (03CR) 10Ladsgroup: "The change is noop" [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [04:02:03] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [04:04:17] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: task_max_waiting_in_queue_millis: 0, number_of_pending_tasks: 0, active_primary_shards: 937, active_shards: 1877, timed_out: False, unassigned_shards: 0, initializing_shards: 0, number_of_nodes: 6, cluster_name: cloudelastic-chi-eqiad, active_shards_percent_as_number: 100.0, number_of_in_flight_fet [04:04:17] _shards: 0, number_of_data_nodes: 6, delayed_unassigned_shards: 0, status: green https://wikitech.wikimedia.org/wiki/Search%23Administration [04:07:32] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Install mailman3 and mailman2 at the same time on the cloud - https://phabricator.wikimedia.org/T278612 (10Ladsgroup) I constantly get > AH00534: apache2: Configuration error: No MPM loaded. trying to install mailamn2 on the cloud. I haven't figured out... [04:46:49] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Install mailman3 and mailman2 at the same time on the cloud - https://phabricator.wikimedia.org/T278612 (10Ladsgroup) ` root@mailman03:/etc/apache2# /usr/sbin/apache2 -X -k start AH00534: apache2: Configuration error: No MPM loaded. root@mailman03:/etc/ap... [04:58:06] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [05:00:34] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: active_primary_shards: 937, unassigned_shards: 0, timed_out: False, number_of_in_flight_fetch: 0, status: green, relocating_shards: 0, number_of_pending_tasks: 0, number_of_nodes: 6, active_shards_percent_as_number: 100.0, cluster_name: cloudelastic-chi-eqiad, delayed_unassigned_shards: 0, number_o [05:00:34] task_max_waiting_in_queue_millis: 0, initializing_shards: 0, active_shards: 1877 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:14:13] (03PS6) 10Ladsgroup: lists: Make mailman2 easier to run on the cloud [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612) [05:17:51] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Install mailman3 and mailman2 at the same time on the cloud - https://phabricator.wikimedia.org/T278612 (10Ladsgroup) \o/ https://polymorphic.lists.wmcloud.org/mailman/listinfo [05:20:09] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [05:22:05] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28972/console" [puppet] - 10https://gerrit.wikimedia.org/r/678137 (owner: 10Ladsgroup) [05:22:12] (03CR) 10Legoktm: [V: 03+1 C: 03+2] lists: Drop check for buster [puppet] - 10https://gerrit.wikimedia.org/r/678137 (owner: 10Ladsgroup) [05:24:53] (03CR) 10Legoktm: [C: 04-1] lists: Make mailman2 easier to run on the cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [05:29:10] (03PS7) 10Ladsgroup: lists: Make mailman2 easier to run on the cloud [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612) [05:30:01] (03CR) 10Ladsgroup: "check experimental" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [05:52:29] 10SRE, 10Dumps-Generation, 10SRE-Access-Requests: Create new group for root access to snapshot*, dumpsdata* and labstore1006,7 with holger in it - https://phabricator.wikimedia.org/T277629 (10ArielGlenn) >>! In T277629#6985797, @Dzahn wrote: > Any news on access check for @holger.knust ? It might be some da... [05:58:23] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:00:29] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: status: green, initializing_shards: 0, active_shards: 1877, number_of_in_flight_fetch: 0, number_of_nodes: 6, number_of_pending_tasks: 0, task_max_waiting_in_queue_millis: 0, delayed_unassigned_shards: 0, number_of_data_nodes: 6, relocating_shards: 0, active_shards_percent_as_number: 100.0, timed_o [06:00:29] _primary_shards: 937, cluster_name: cloudelastic-chi-eqiad, unassigned_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [06:00:41] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28973/console" [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [06:01:34] (03CR) 10Legoktm: [V: 03+1 C: 03+2] lists: Make mailman2 easier to run on the cloud [puppet] - 10https://gerrit.wikimedia.org/r/678140 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [06:31:39] (03PS1) 10Legoktm: ExtensionDistributor: Add REL1_36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678146 [07:00:05] Deploy window No deploys all day! See [[Deployments/Emergencies]] if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210409T0700) [07:36:38] (03CR) 10Aklapper: "Please abandon per https://phabricator.wikimedia.org/T279226" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676716 (https://phabricator.wikimedia.org/T279226) (owner: 10Zabe) [07:44:08] (03Abandoned) 10Zabe: Add 'apihighlimits' to accountcreator on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676716 (https://phabricator.wikimedia.org/T279226) (owner: 10Zabe) [08:28:50] so very very quiet :-D [08:31:07] it's a holiday! [09:00:25] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [09:24:54] (03PS1) 10Legoktm: codesearch: Puppetize beta frontend [puppet] - 10https://gerrit.wikimedia.org/r/678216 (https://phabricator.wikimedia.org/T277459) [09:25:51] (03CR) 10jerkins-bot: [V: 04-1] codesearch: Puppetize beta frontend [puppet] - 10https://gerrit.wikimedia.org/r/678216 (https://phabricator.wikimedia.org/T277459) (owner: 10Legoktm) [09:26:00] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28974/console" [puppet] - 10https://gerrit.wikimedia.org/r/678216 (https://phabricator.wikimedia.org/T277459) (owner: 10Legoktm) [09:26:34] (03PS2) 10Legoktm: codesearch: Puppetize beta frontend [puppet] - 10https://gerrit.wikimedia.org/r/678216 (https://phabricator.wikimedia.org/T277459) [10:40:57] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [10:43:15] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: number_of_nodes: 6, unassigned_shards: 0, timed_out: False, number_of_data_nodes: 6, active_shards_percent_as_number: 100.0, number_of_pending_tasks: 0, active_shards: 1877, initializing_shards: 0, delayed_unassigned_shards: 0, active_primary_shards: 937, task_max_waiting_in_queue_millis: 0, cluste [10:43:15] tic-chi-eqiad, relocating_shards: 0, status: green, number_of_in_flight_fetch: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:49:23] PROBLEM - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [10:53:57] RECOVERY - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 673 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [11:40:51] (03PS1) 10Zabe: Enable on bswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678237 (https://phabricator.wikimedia.org/T279635) [11:41:21] (03CR) 10Zabe: [C: 04-1] "on hold" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/678237 (https://phabricator.wikimedia.org/T279635) (owner: 10Zabe) [12:33:33] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [12:38:07] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: relocating_shards: 0, active_shards: 1877, task_max_waiting_in_queue_millis: 0, active_primary_shards: 937, timed_out: False, delayed_unassigned_shards: 0, active_shards_percent_as_number: 100.0, unassigned_shards: 0, number_of_data_nodes: 6, number_of_in_flight_fetch: 0, cluster_name: cloudelastic [12:38:07] s: green, number_of_pending_tasks: 0, number_of_nodes: 6, initializing_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:24:04] 10SRE, 10LDAP-Access-Requests: Grant access to Superset for Mikeraish - https://phabricator.wikimedia.org/T279147 (10MRaishWMF) @ema it looks like we're up and running! Thanks a lot [13:34:51] (03CR) 10Jcrespo: "See my comments below." (033 comments) [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98) [13:35:00] (03CR) 10Jcrespo: "recheck" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98) [14:07:51] !log retry es4 backup dump on eqiad (backup1002) [14:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:05] ACKNOWLEDGEMENT - dump of es4 in eqiad on alert1001 is CRITICAL: dump for es4 at eqiad taken more than 8 days ago: Most recent backup 2021-03-30 00:00:01 Jcrespo retrying backup now- the host was restarted while backups ran. - The acknowledgement expires at: 2021-04-13 08:14:53. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:36:05] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:40:43] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:58:57] (03PS1) 10Urbanecm: toolforge: Install pandoc [puppet] - 10https://gerrit.wikimedia.org/r/678259 (https://phabricator.wikimedia.org/T279787) [15:06:01] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [15:08:11] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: unassigned_shards: 0, timed_out: False, relocating_shards: 0, number_of_data_nodes: 6, active_primary_shards: 937, active_shards_percent_as_number: 100.0, cluster_name: cloudelastic-chi-eqiad, number_of_pending_tasks: 0, active_shards: 1877, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_m [15:08:11] green, initializing_shards: 0, number_of_nodes: 6, delayed_unassigned_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:19:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:22:09] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:24:21] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [15:26:31] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: active_shards_percent_as_number: 100.0, status: green, number_of_nodes: 6, cluster_name: cloudelastic-chi-eqiad, task_max_waiting_in_queue_millis: 0, timed_out: False, relocating_shards: 0, number_of_data_nodes: 6, number_of_pending_tasks: 0, unassigned_shards: 0, active_shards: 1877, number_of_in_ [15:26:31] active_primary_shards: 937, delayed_unassigned_shards: 0, initializing_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:06:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10Cmjohnson) the Dell tech came out and replaced the motherboard, that did not fix the issue, it turns out that there is bad cable to the backplane. A new part has been o... [16:10:51] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [16:15:31] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, delayed_unassigned_shards: 0, relocating_shards: 0, number_of_nodes: 6, initializing_shards: 0, task_max_waiting_in_queue_millis: 0, active_primary_shards: 937, active_shards: 1877, unassigned_shards: 0, status: green, number_of_in_flight_fetch: 0, timed_out: F [16:15:31] ds_percent_as_number: 100.0, number_of_data_nodes: 6, number_of_pending_tasks: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:57:19] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:58:41] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:06:19] (03PS3) 10Krinkle: [Beta Cluster] mc: Set new 'broadcastRoutingPrefix' option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677732 [18:06:21] (03PS3) 10Krinkle: mc: Set 'broadcastRoutingPrefix' option in $wgWANObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677418 [18:06:33] AaronSchulz: ok to land the one for beta now, I think? [18:14:52] Krinkle: lgtm [18:16:36] (03CR) 10Krinkle: [C: 03+2] [Beta Cluster] mc: Set new 'broadcastRoutingPrefix' option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677732 (owner: 10Krinkle) [18:19:37] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:21:33] (03Merged) 10jenkins-bot: [Beta Cluster] mc: Set new 'broadcastRoutingPrefix' option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677732 (owner: 10Krinkle) [18:22:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:26:08] (03PS3) 10Krinkle: [Beta Cluster] mc: Remove unused mcrouterAware/cluster/coalesceKeys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677733 [18:26:10] (03PS3) 10Krinkle: mc: Remove unused mcrouterAware/cluster/coalesceKeys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677734 [19:54:37] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:57:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:37:31] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:39:51] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: unassigned_shards: 0, active_shards_percent_as_number: 100.0, active_primary_shards: 937, initializing_shards: 0, number_of_data_nodes: 6, number_of_in_flight_fetch: 0, number_of_nodes: 6, status: green, relocating_shards: 0, cluster_name: cloudelastic-chi-eqiad, active_shards: 1877, task_max_waiti [21:39:51] s: 0, number_of_pending_tasks: 0, delayed_unassigned_shards: 0, timed_out: False https://wikitech.wikimedia.org/wiki/Search%23Administration [22:11:33] PROBLEM - Host lvs3005 is DOWN: PING CRITICAL - Packet loss = 100% [22:12:01] RECOVERY - Host lvs3005 is UP: PING WARNING - Packet loss = 77%, RTA = 183.64 ms [22:12:15] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:13:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:16:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:52:40] 10SRE, 10Traffic: Wikimedia was temporarily unreachable - https://phabricator.wikimedia.org/T279809 (10Legoktm) Around what time did you get errors? [23:32:16] 10SRE, 10Traffic: Wikimedia was temporarily unreachable - https://phabricator.wikimedia.org/T279809 (10Krinkle) Looking at [Grafana: frontend traffic](https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&from=1617441420000&to=1618010820000): |{F34316356 height=300} |{F34316343 height=200} Looks...