[00:00:17] RECOVERY - Check systemd state on cp1087 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:29] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [00:05:31] RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: number_of_pending_tasks: 3, active_shards: 3624, unassigned_shards: 96, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 0, cluster_name: cloudelastic-omega-eqiad, timed_out: False, number_of_data_nodes: 6, relocating_shards: 0, number_of_nodes: 6, task_max_waiting_in_queue_millis: 223, i [00:05:31] s: 2, status: yellow, active_primary_shards: 1524, active_shards_percent_as_number: 97.36700698549167 https://wikitech.wikimedia.org/wiki/Search%23Administration [00:12:15] (03PS1) 10Dzahn: site: turn 12 new codfw servers into mw appservers [puppet] - 10https://gerrit.wikimedia.org/r/676484 (https://phabricator.wikimedia.org/T278396) [00:14:17] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-omega-eqiad on cloudelastic1005 is CRITICAL: 711.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-omega-eqiad&var-instance=cloudelastic1005&panelId=37 [00:32:42] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Ladsgroup) I checked the db size and it seems it's indeed smaller: - The mbox file for discovery alerts is: 1... [00:58:59] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-omega-eqiad on cloudelastic1005 is OK: (C)100 gt (W)80 gt 0 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-omega-eqiad&var-instance=cloudelastic1005&panelId=37 [01:13:40] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10wiki_willy) Thanks for the feedback @fgiunchedi. We plan on setting up a follow-up meeting with the vendor next week to provide them some feedback, so we'll be sure to pass along yo... [01:26:35] PROBLEM - DNS on mw2247.mgmt is CRITICAL: Domain mw2247.mgmt.codfw.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:18:17] PROBLEM - Host mw2247 is DOWN: PING CRITICAL - Packet loss = 100% [02:51:26] (03CR) 10DannyS712: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/676508 (owner: 10DannyS712) [03:25:30] (03CR) 10Reedy: Update rewrite rule for https://www.mediawiki.org/FAQ (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/676508 (owner: 10DannyS712) [03:34:54] (03PS3) 10DannyS712: Update rewrite rule for https://www.mediawiki.org/FAQ [puppet] - 10https://gerrit.wikimedia.org/r/676508 [03:34:55] (03CR) 10DannyS712: Update rewrite rule for https://www.mediawiki.org/FAQ (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/676508 (owner: 10DannyS712) [04:03:48] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [04:06:12] Looking [04:08:03] PROBLEM - LibreNMS has a critical alert #page on alert1001 is CRITICAL: Primary outbound port utilisation over 80% #page (cr2-eqiad.wikimedia.org) https://bit.ly/wmf-librenms [04:08:47] (Primary outbound port utilisation over 80% #page) resolved: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [04:10:52] hi [04:33:47] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [04:38:48] (Primary outbound port utilisation over 80% #page) resolved: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [04:39:41] (03PS1) 10Ayounsi: rate limit eqiad upload [puppet] - 10https://gerrit.wikimedia.org/r/676490 [04:40:09] (03PS1) 10Legoktm: Set attack_mode: true on upload caches in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/676491 [04:44:09] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/28869/cp1076.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/676490 (owner: 10Ayounsi) [04:44:44] (03Abandoned) 10Legoktm: Set attack_mode: true on upload caches in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/676491 (owner: 10Legoktm) [05:03:47] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [05:08:47] (Primary outbound port utilisation over 80% #page) resolved: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [05:09:24] (03PS1) 10Legoktm: vcl: Temporarily block UA from upload [puppet] - 10https://gerrit.wikimedia.org/r/676492 [05:10:09] (03PS1) 10CDanis: upload-lb: block python-requests UA on AWS [puppet] - 10https://gerrit.wikimedia.org/r/676493 [05:11:43] (03CR) 10CDanis: [C: 03+2] upload-lb: block python-requests UA on AWS [puppet] - 10https://gerrit.wikimedia.org/r/676493 (owner: 10CDanis) [05:14:07] (03Abandoned) 10Legoktm: vcl: Temporarily block UA from upload [puppet] - 10https://gerrit.wikimedia.org/r/676492 (owner: 10Legoktm) [05:19:13] RECOVERY - LibreNMS has a critical alert #page on alert1001 is OK: OK: zero critical LibreNMS alerts https://bit.ly/wmf-librenms [05:21:45] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-chi-eqiad on cloudelastic1005 is CRITICAL: 144.4 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1005&panelId=37 [05:21:51] (03PS1) 10Ayounsi: Revert "rate limit eqiad upload" [puppet] - 10https://gerrit.wikimedia.org/r/676511 [05:22:47] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1006-cloudelastic-chi-eqiad on cloudelastic1006 is CRITICAL: 135.3 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1006&panelId=37 [05:22:49] (03CR) 10Ayounsi: [C: 03+2] Revert "rate limit eqiad upload" [puppet] - 10https://gerrit.wikimedia.org/r/676511 (owner: 10Ayounsi) [05:22:51] (03CR) 10CDanis: [C: 03+1] Revert "rate limit eqiad upload" [puppet] - 10https://gerrit.wikimedia.org/r/676511 (owner: 10Ayounsi) [05:31:50] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Legoktm) >>! In T277780#6966775, @ops-monitoring-bot wrote: > cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2247.codfw.wmnet` >... [06:39:59] PROBLEM - MariaDB Replica Lag: analytics_meta on db1108 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:40:10] I am trying to fix --^ [06:53:52] yesterday we upgraded superset and probably the sqlalchemy upgrade led to some replication issues [06:54:04] I am trying to fix them manually [06:56:45] RECOVERY - MariaDB Replica SQL: analytics_meta on db1108 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:56:49] \o/ [06:57:03] PROBLEM - MariaDB Replica Lag: analytics_meta on db1108 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1867.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:57:28] * elukey cries in a corner [06:59:00] ah this is lag! [06:59:10] yes yes makes sense, it was down since yesterday [06:59:10] okok [06:59:21] RECOVERY - MariaDB Replica Lag: analytics_meta on db1108 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210402T0700) [07:00:26] mutante: o/ - thanks for handling the alert, next time may I ask to open a task if you do it? I saw it by chance in my irssi notifications, and I don't see activity in the #analytics chan about it so they probably have missed it too (we were upgrading superset) [07:02:22] sorry my bad (not enough coffee) - I see that you alerted my team, but they didn't follow up! Lovely [07:27:57] RECOVERY - Host an-worker1080 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [07:28:14] !log manual fix for an-worker1080's interface in netbox (xe-4/0/11), moved by mistake to public-1b [07:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:29] !log powercycle an-worker1080 [07:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:15] PROBLEM - Host an-worker1080 is DOWN: PING CRITICAL - Packet loss = 100% [07:35:28] effie: hello :) why was the host rebooted? :) [07:35:56] elukey: it is down according to icing [07:35:57] a [07:36:06] effie: see the previous sal entry [07:36:14] I was waiting for the recovery [07:36:20] I wrote on SRE [07:36:25] sigh too many channels [07:37:11] RECOVERY - Host an-worker1080 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [07:37:15] yeah but also the host was perfectly reachable :) [07:37:35] it is not a problem for the hadoop workers, all good [07:38:01] I was reading the log entrys [07:38:09] anyway, mybad [07:39:05] super fine no problem! The host is already up [07:39:27] I thought I was the only one looking into it [07:40:03] so basically I used the last good info I had on the subject [07:41:49] you were caring about analytics nodes, so this deserves a big thank you anyway :) [07:44:55] PROBLEM - WDQS SPARQL on wdqs1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [07:54:15] RECOVERY - WDQS SPARQL on wdqs1003 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.060 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [08:06:25] PROBLEM - WDQS SPARQL on wdqs1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [08:13:17] RECOVERY - WDQS SPARQL on wdqs1008 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.070 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [08:14:29] (03PS5) 10Effie Mouzeli: modules: remove parsoidJS puppet module [puppet] - 10https://gerrit.wikimedia.org/r/676071 (https://phabricator.wikimedia.org/T279059) [08:15:26] (03Abandoned) 10Effie Mouzeli: profile::parsoid: remove parsoid class from parsoid profile [puppet] - 10https://gerrit.wikimedia.org/r/676068 (https://phabricator.wikimedia.org/T268524) (owner: 10Effie Mouzeli) [08:31:57] (03PS1) 10Elukey: Move k8s_infrastructure_users to role hiera namespaces [labs/private] - 10https://gerrit.wikimedia.org/r/676546 (https://phabricator.wikimedia.org/T278224) [08:32:42] (03CR) 10Elukey: [V: 03+2 C: 03+2] Move k8s_infrastructure_users to role hiera namespaces [labs/private] - 10https://gerrit.wikimedia.org/r/676546 (https://phabricator.wikimedia.org/T278224) (owner: 10Elukey) [08:37:27] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 9): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28873/console" [puppet] - 10https://gerrit.wikimedia.org/r/675566 (https://phabricator.wikimedia.org/T278224) (owner: 10Elukey) [08:49:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:52:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:01:31] (03PS6) 10Elukey: kubernetes: move infrastructure_users to the k8s master role [puppet] - 10https://gerrit.wikimedia.org/r/675566 (https://phabricator.wikimedia.org/T278224) [09:04:12] (03CR) 10Elukey: [C: 03+2] kubernetes: move infrastructure_users to the k8s master role [puppet] - 10https://gerrit.wikimedia.org/r/675566 (https://phabricator.wikimedia.org/T278224) (owner: 10Elukey) [09:06:31] !log remove dumps from wdqs1009 to free disk space [09:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:47] (03PS1) 10Hashar: Merge tag 'v3.2.8' into wmf/stable-3.2 [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/676551 (https://phabricator.wikimedia.org/T273223) [09:09:55] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: Upgrade Calico - https://phabricator.wikimedia.org/T207804 (10Aklapper) There is an open 7-line patch at https://gerrit.wikimedia.org/r/c/operations/puppet/+/469339 which needs rebasing if still wanted [09:15:46] (03Abandoned) 10Alexandros Kosiaris: calico: Support version 2.4.1 [puppet] - 10https://gerrit.wikimedia.org/r/469339 (https://phabricator.wikimedia.org/T207804) (owner: 10Alexandros Kosiaris) [09:16:29] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: Upgrade Calico - https://phabricator.wikimedia.org/T207804 (10akosiaris) 05Open→03Resolved All of our clusters are now on calico 3.16, we can close this as resolved! [09:21:14] hashar: https://phabricator.wikimedia.org/T279132 [09:21:33] Not sure what you do when a likely train bug comes out on a Friday [09:25:51] Anyway I have to go, seeing family for first time since last summer [09:26:19] (03PS4) 10Aklapper: Adjust CSP header for pdfs & videos & set enforce on testwiki [puppet] - 10https://gerrit.wikimedia.org/r/547929 (https://phabricator.wikimedia.org/T117618) (owner: 10Brian Wolff) [09:29:41] RhinosF1: checking [09:29:43] thank you! [09:31:10] hashar: seems familiar to https://phabricator.wikimedia.org/T278579, which at the time was assumed to be something wrong with the beta clusters config [09:33:04] oh my [09:34:03] Majavah: nice flag will mark the most renet one as a dupe [09:36:29] and I am going to rollback [09:38:49] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (NOOP 17): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28876/console" [puppet] - 10https://gerrit.wikimedia.org/r/676326 (owner: 10Alexandros Kosiaris) [09:42:07] (03PS1) 10Hashar: Rollback group1 and group2 wikis to 1.36.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676557 (https://phabricator.wikimedia.org/T279127) [09:42:11] (03CR) 10Hashar: [C: 03+2] Rollback group1 and group2 wikis to 1.36.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676557 (https://phabricator.wikimedia.org/T279127) (owner: 10Hashar) [09:43:00] (03Merged) 10jenkins-bot: Rollback group1 and group2 wikis to 1.36.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676557 (https://phabricator.wikimedia.org/T279127) (owner: 10Hashar) [09:44:09] !log hashar@deploy1002 sync-wikiversions aborted: Revert group1 and group2 wikis to 1.36.0-wmf.36 (duration: 00m 01s) [09:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:06] hashar: that's a known bug that's actually really old [09:45:50] T275322 [09:45:51] T275322: Some edits made by extended confirmed users are no longer automatically accepted - https://phabricator.wikimedia.org/T275322 [09:45:56] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: Revert group1 and group2 wikis to 1.36.0-wmf.36 - T278343 [09:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:05] T278343: 1.36.0-wmf.37 deployment blockers - https://phabricator.wikimedia.org/T278343 [09:46:14] 10SRE, 10Traffic, 10Goal, 10HTTPS: Create a secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548 (10Aklapper) @Vgutierrez: Hi, all related patches in Gerrit have been merged or abandoned. Is there more to do in this task? Asking as you are set... [09:48:14] (03PS3) 10Aklapper: WIP: icinga: add check_sysctl.sh script [puppet] - 10https://gerrit.wikimedia.org/r/376566 (https://phabricator.wikimedia.org/T160060) (owner: 10Herron) [09:49:35] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (DIFF 17): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28878/console" [puppet] - 10https://gerrit.wikimedia.org/r/676328 (owner: 10Alexandros Kosiaris) [09:50:02] 10SRE, 10Mail, 10Patch-Needs-Improvement: Disavow emails from wikipedia.com - https://phabricator.wikimedia.org/T184230 (10Aklapper) [09:52:52] hashar: can we revert the change? [09:53:01] 10SRE, 10Sustainability (Incident Followup): Update Runboook wikis for the application and LVS servers - https://phabricator.wikimedia.org/T278948 (10akosiaris) 05Open→03Resolved a:03akosiaris >>! In T278948#6962391, @Legoktm wrote: > I'm not sure if we should put jobrunner stuff on the LVS page, LVS was... [09:54:15] Amir1: which change? [09:54:51] Amir1: Special:Export is broken anyway [09:54:58] Majavah: https://phabricator.wikimedia.org/T279127#6967552 [09:55:16] ahh [09:55:29] so feel free to drop the FlaggedRevs task from the list of blockers :] [09:55:42] the main reason for the revert is Special:Export being empty [09:56:13] okay then! [09:56:26] Thanks [09:57:18] 10SRE, 10Sustainability (Incident Followup): Update Runbook wikis for the application and LVS servers - https://phabricator.wikimedia.org/T278948 (10Aklapper) [09:57:30] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] "PCC is effectively a noop across a select set of hosts having the profile, I am gonna merge this and dependent changes" [puppet] - 10https://gerrit.wikimedia.org/r/676328 (owner: 10Alexandros Kosiaris) [09:58:53] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] services_proxy: Reorder on port number ascending [puppet] - 10https://gerrit.wikimedia.org/r/676327 (owner: 10Alexandros Kosiaris) [09:58:56] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] conftool: Reorder services alphabetically [puppet] - 10https://gerrit.wikimedia.org/r/676326 (owner: 10Alexandros Kosiaris) [09:58:58] first, coffee [09:59:05] damn I should rollback group 0 as well [09:59:11] cause well https://phabricator.wikimedia.org/T278579 [09:59:13] err [09:59:31] https://www.mediawiki.org/wiki/Special:Export/Main_Page [10:00:10] I can give it a try in debugging [10:00:35] (as I'm free today, it's public holiday here, super extended weekend) [10:02:59] I can't reproduce locally with the latest git master :( [10:04:05] (03PS1) 10Alexandros Kosiaris: cxserver: Switch apertium port [deployment-charts] - 10https://gerrit.wikimedia.org/r/676558 [10:04:21] Majavah: oh it's clearly an extension. which one. Stay tuned [10:05:04] (03PS1) 10Hashar: Revert "group0 wikis to 1.36.0-wmf.37" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676559 (https://phabricator.wikimedia.org/T278579) [10:05:07] (03CR) 10Hashar: [C: 03+2] Revert "group0 wikis to 1.36.0-wmf.37" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676559 (https://phabricator.wikimedia.org/T278579) (owner: 10Hashar) [10:05:47] Amir1: no ideas :( [10:05:56] (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.36.0-wmf.37" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676559 (https://phabricator.wikimedia.org/T278579) (owner: 10Hashar) [10:05:57] anything in logstash related to export? [10:06:16] haven't checked [10:06:22] but at least it is reproducible on beta! [10:07:04] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: Rollback group0 wikis to 1.36.0-wmf.36 - T278343 [10:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:12] (03PS2) 10Hashar: Merge tag 'v3.2.8' into wmf/stable-3.2 [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/676551 (https://phabricator.wikimedia.org/T278990) [10:07:13] T278343: 1.36.0-wmf.37 deployment blockers - https://phabricator.wikimedia.org/T278343 [10:07:32] (03CR) 10Alexandros Kosiaris: [C: 03+2] cxserver: Switch apertium port [deployment-charts] - 10https://gerrit.wikimedia.org/r/676558 (owner: 10Alexandros Kosiaris) [10:08:15] there were no changes to special:export itself since early march [10:08:51] (03Merged) 10jenkins-bot: cxserver: Switch apertium port [deployment-charts] - 10https://gerrit.wikimedia.org/r/676558 (owner: 10Alexandros Kosiaris) [10:08:55] betacluster logstash is still broken so nothing helpful there, and grepping on mwlog01 is kind of annoying when you're not sure what you're looking for [10:09:28] that could be anything really [10:09:34] like the revision storage [10:10:30] Amir1: if the FlaggedRevs task is not related to wmf.37, feel free to remove it frmo the list of blockers ;) [10:11:08] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'production' . [10:11:08] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [10:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:47] (03PS2) 10Alexandros Kosiaris: services_proxy: Add thanos-{query,swift} [puppet] - 10https://gerrit.wikimedia.org/r/676329 (https://phabricator.wikimedia.org/T278385) [10:12:25] Sure, just grabbing something to eat [10:12:25] fyi just running dumpBackup on a random wiki in beta works fine, i.e. with --current or --full [10:12:52] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [10:12:52] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' . [10:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:09] this calls WikiExporter and eventually XmlDumpWriter so all those bits work for the main code path [10:14:05] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' . [10:14:05] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [10:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:19] who knows what has been broken [10:14:32] guess we will add some test to cover Special:Export works [10:15:55] it looks like none of the code in includes/exports has been touched for this branch either, so. [10:16:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:18:48] I am off [10:18:53] (03PS1) 10Elukey: install_server: add custom recipes for hadoop test masters/coord [puppet] - 10https://gerrit.wikimedia.org/r/676560 (https://phabricator.wikimedia.org/T278422) [10:19:02] lunch and stuff. Be back around 14:00 UTC or four hours from now [10:19:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:19:34] (03CR) 10Elukey: [C: 03+2] install_server: add custom recipes for hadoop test masters/coord [puppet] - 10https://gerrit.wikimedia.org/r/676560 (https://phabricator.wikimedia.org/T278422) (owner: 10Elukey) [10:19:46] (03CR) 10DharmrajRathod98: "> Patch Set 9:" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98) [10:21:53] (03PS1) 10Elukey: install_server: set buster for an-test-(master|coord) nodes [puppet] - 10https://gerrit.wikimedia.org/r/676561 (https://phabricator.wikimedia.org/T278422) [10:23:15] (03CR) 10Elukey: [C: 03+2] install_server: set buster for an-test-(master|coord) nodes [puppet] - 10https://gerrit.wikimedia.org/r/676561 (https://phabricator.wikimedia.org/T278422) (owner: 10Elukey) [10:27:14] I'm going to mess with deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud to debug the export issue [10:27:50] Amir1: you know scap is going to override you every ten minutes, right? [10:27:58] yup [10:28:01] I'm fast [10:34:44] (03PS1) 10Elukey: install_server: fix partman recipe for an-test-master/coord [puppet] - 10https://gerrit.wikimedia.org/r/676562 (https://phabricator.wikimedia.org/T278422) [10:35:25] (03CR) 10Elukey: [C: 03+2] install_server: fix partman recipe for an-test-master/coord [puppet] - 10https://gerrit.wikimedia.org/r/676562 (https://phabricator.wikimedia.org/T278422) (owner: 10Elukey) [10:38:20] (03PS1) 10Majavah: changeprop: Update beta jobrunner [deployment-charts] - 10https://gerrit.wikimedia.org/r/676563 [10:52:37] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-master1002.eqiad.wmnet with reason: REIMAGE [10:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:44] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-master1002.eqiad.wmnet with reason: REIMAGE [10:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:58] (03PS3) 10Alexandros Kosiaris: services_proxy: Add thanos-{query,swift} and schema [puppet] - 10https://gerrit.wikimedia.org/r/676329 (https://phabricator.wikimedia.org/T278385) [10:59:31] I'm tikering on snapshot03 in deployment prep so if you see whies in logstash from there, please igore [10:59:34] (03CR) 10Alexandros Kosiaris: [C: 03+2] services_proxy: Add thanos-{query,swift} and schema [puppet] - 10https://gerrit.wikimedia.org/r/676329 (https://phabricator.wikimedia.org/T278385) (owner: 10Alexandros Kosiaris) [11:03:09] export via WikiExporter::dumpFrom of the condtion "page_namespace=0 AND page_title='Main_Page'" works fine, when I hack dumpBackup.php and BackupDumper.php to take a title strig and pass the condition directly. Next up: see if it's title conversion somehow [11:08:55] export using --pagelist, passing the name of a file with just the page MainPage in it, works fine; this uses WikiExporter::pagesByName which repeatedly calls pageByTitle. pageByTitle is what SpecialExport:doExport calls. so I guess the problem must be before that in the special page [11:10:24] no, I am sorry. pagesByName repeatedly calls pageByName which calls pageByTitle. but this amounts to the same thing I think, with the same onclusion. [11:11:10] (03PS6) 10Effie Mouzeli: (WIP) modules: remove parsoidJS from puppet [puppet] - 10https://gerrit.wikimedia.org/r/676071 (https://phabricator.wikimedia.org/T279059) [11:11:24] (03CR) 10jerkins-bot: [V: 04-1] (WIP) modules: remove parsoidJS from puppet [puppet] - 10https://gerrit.wikimedia.org/r/676071 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [11:11:50] Amir1: as you are testing can you add a debugging entry to SpecialExport:doExport at the bottom just before $exporter->pageByTitle and see if that ever gets called? [11:11:59] if you are not already closing in on the issue, I mean [11:12:24] apergos: It does the work, I can see that [11:12:33] https://en.wikipedia.beta.wmflabs.org/wiki/Special:Export [11:12:45] if it's not overriden yet, you can see [11:13:23] oh it's not empty. good [11:13:37] um [11:13:45] what did you change so that it's not empty? [11:13:46] currently I think it's either: 1- the export structure has changed and now doesn't have a root element 2- the buffering, etc. is broken [11:14:14] commented out most of stuff in if ( $this->doExport ) { [11:14:51] and the root element should be what? [11:15:03] as I look at the exported xml it looks fine [11:15:04] no clue [11:15:11] 10SRE, 10WMF-JobQueue, 10serviceops, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10jijiki) I agree, we should start with 2 servers, with a higher weight than the others, and adjust in case we have some sim... [11:15:21] maybe compare wmf.36 and wmf.37 to see if there's a change? [11:15:22] all the usual tags [11:15:31] I looked at changes to core: did not see anything [11:15:48] I mean the xml output [11:15:50] I looked to see in particular though if anything has been touched in includes/export or the maintenace scripts, no [11:18:36] so removing wfResetOutputBuffers() fixes the issue [11:18:46] now let's see what has broken this function [11:21:38] [11:21:38] I see this at the bottom of the xml [11:21:45] no close mediawiki tag [11:21:48] from beta. [11:22:21] oh there is a close mediawiki tag [11:22:26] but then there is a bunch of cruft after it [11:22:53] [11:22:53] [11:22:53] [11:22:53] [11:22:53] Export pages - Wikipedia, the free encyclopedia [11:22:56] and a bunch of other junk [11:23:26] I guess that is the special export form being tacked on to the end [11:23:32] Amir1: ^^ [11:24:05] oh interesting [11:24:55] 10SRE, 10WMF-JobQueue, 10serviceops, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10akosiaris) We probably should have reserved capacity for both clusters. Something like 4 for jobrunners, 2 for videoscalers. [11:25:21] I do't know what exactly was changed in beta when I ran that special export though, you'll want to look at one yourself where you can check [11:26:24] 10SRE, 10WMF-JobQueue, 10serviceops, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10jijiki) @akosiaris would start like this then 2 (jr) + 2 (vs) + 2 (both) [11:27:04] (03PS1) 10Alexandros Kosiaris: Revert "Allow RunAsAny in the restricted PSP as well" [deployment-charts] - 10https://gerrit.wikimedia.org/r/676512 [11:40:34] Thanks people looking, I'm back but you seem to be well under control [11:45:40] !log Start server-side upload for 3 images (T279079, T279080, T279104) [11:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:58] T279079: Server side upload for Sturm - https://phabricator.wikimedia.org/T279079 [11:45:58] T279104: Server side upload for Sturm - https://phabricator.wikimedia.org/T279104 [11:45:58] T279080: Server side upload for Sturm - https://phabricator.wikimedia.org/T279080 [11:46:02] !log correction: Start server-side upload for 3 video files (T279079, T279080, T279104) [11:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:48] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Bernard Wang - https://phabricator.wikimedia.org/T279014 (10jijiki) p:05Triage→03Medium [11:46:54] 10SRE, 10Wikimedia-Mailing-lists: Expose mailman3 internal REST API inside Wikimedia production network - https://phabricator.wikimedia.org/T279023 (10jijiki) p:05Triage→03Medium [11:47:02] 10SRE: Integrate Buster 10.9 point update - https://phabricator.wikimedia.org/T279054 (10jijiki) p:05Triage→03Medium [11:47:19] 10SRE, 10WMF-JobQueue, 10serviceops, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10jijiki) p:05Triage→03Medium [12:33:05] I see Reedy is doing the revert [12:33:30] (for the buffer/gzip/content length patch) [12:35:59] (03PS1) 10Reedy: Revert "Move logDataPageOutputOnly() call to outputResponsePayload()" [core] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/676571 (https://phabricator.wikimedia.org/T278579) [12:36:01] (03PS1) 10Reedy: Revert "Avoid HTTP protocol errors when fastcgi_finish_request() is unavailable" [core] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/676572 (https://phabricator.wikimedia.org/T278579) [12:36:34] I thought it was more complex than it was, as I thought https://gerrit.wikimedia.org/r/c/mediawiki/core/+/661452 had landed [12:36:51] 10SRE, 10WMF-JobQueue, 10serviceops, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10akosiaris) >>! In T279100#6967664, @jijiki wrote: > @akosiaris would start like this then 2 (jr) + 2 (vs) + 2 (both), and... [12:37:09] when infact it was "only" https://gerrit.wikimedia.org/r/c/mediawiki/core/+/675218 ontop [12:38:25] I'm happy to merge into .37 at least [12:38:26] ah ha [12:38:50] awesome. if a fix comes in later, so much the better, but then at least there's not a rush [12:40:16] (03CR) 10Reedy: [C: 03+2] Revert "Move logDataPageOutputOnly() call to outputResponsePayload()" [core] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/676571 (https://phabricator.wikimedia.org/T278579) (owner: 10Reedy) [12:40:21] (03CR) 10Reedy: [C: 03+2] Revert "Avoid HTTP protocol errors when fastcgi_finish_request() is unavailable" [core] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/676572 (https://phabricator.wikimedia.org/T278579) (owner: 10Reedy) [12:40:35] Can at least potentially roll forward again to group 0 [12:40:54] Though, if I merge to master, we can test it on beta too ;p [12:42:55] +1 [12:46:48] Hashar won't be back for another hour or so even if he can roll forward again [12:47:08] There's others of us more than capable to do it ;) [12:47:14] Is Monday still normal because it'll be a bank holiday for uk people at least? [12:47:35] Yeah, it's not a US Federal holiday [12:47:43] And I don't think it's even a WMF holiday [12:49:23] I know I saw an email about no train one week due to earth day [12:49:31] Which I only realised existed due to that email [12:50:35] That's a few week off though [12:52:13] are deployments via Helm allowed on Friday? the service in question does not get any production traffic yet, so I'm not worried about it breaking; are there other kinds of risks in using the deployment pipeline? [13:05:35] (03PS1) 10Alexandros Kosiaris: kubectl: Fetch it from_future for a set of hosts [puppet] - 10https://gerrit.wikimedia.org/r/676574 (https://phabricator.wikimedia.org/T278356) [13:05:57] (03Merged) 10jenkins-bot: Revert "Move logDataPageOutputOnly() call to outputResponsePayload()" [core] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/676571 (https://phabricator.wikimedia.org/T278579) (owner: 10Reedy) [13:06:59] (03Merged) 10jenkins-bot: Revert "Avoid HTTP protocol errors when fastcgi_finish_request() is unavailable" [core] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/676572 (https://phabricator.wikimedia.org/T278579) (owner: 10Reedy) [13:08:54] !log reedy@deploy1002 Synchronized php-1.36.0-wmf.37/includes/MediaWiki.php: T278579 (duration: 00m 58s) [13:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:05] T278579: Special:Export broken: always generates an empty file - https://phabricator.wikimedia.org/T278579 [13:10:07] !log reedy@deploy1002 Synchronized php-1.36.0-wmf.37/includes/OutputHandler.php: T278579 (duration: 00m 57s) [13:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:17] !log reedy@deploy1002 Synchronized php-1.36.0-wmf.37/load.php: T278579 (duration: 00m 58s) [13:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:46] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-master1001.eqiad.wmnet with reason: REIMAGE [13:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:57] should be live on beta in a few mins [13:14:50] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-master1001.eqiad.wmnet with reason: REIMAGE [13:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:47] (03PS2) 10Alexandros Kosiaris: kubectl: Fetch it from_future for a set of hosts [puppet] - 10https://gerrit.wikimedia.org/r/676574 (https://phabricator.wikimedia.org/T278356) [13:19:03] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28882/console" [puppet] - 10https://gerrit.wikimedia.org/r/676574 (https://phabricator.wikimedia.org/T278356) (owner: 10Alexandros Kosiaris) [13:19:28] (03PS1) 10Effie Mouzeli: admin: add amir to mailman-admins group [puppet] - 10https://gerrit.wikimedia.org/r/676576 (https://phabricator.wikimedia.org/T278616) [13:19:29] Reedy: https://en.wikipedia.beta.wmflabs.org/wiki/Special:Export/Main_Page works now [13:19:39] [14:19:02] MediaWiki-Export-or-Import, Beta-Cluster-Infrastructure, Beta-Cluster-reproducible, MW-1.35-notes, and 2 others: Special:Export broken: always generates an empty file - https://phabricator.wikimedia.org/T278579 (Reedy) Confirmed fixed on beta [13:19:40] ;) [13:20:50] are we going to roll the train forward today or wait until next week? [13:22:06] (03CR) 10Effie Mouzeli: [C: 03+2] admin: add amir to mailman-admins group [puppet] - 10https://gerrit.wikimedia.org/r/676576 (https://phabricator.wikimedia.org/T278616) (owner: 10Effie Mouzeli) [13:22:33] I'm not sure [13:22:56] It doesn't feel like rolling forward is a necessary deployment (it's certainly not applicable under "emergency deployment") [13:24:21] Wait for Antoine to reappear I guess [13:29:19] (03PS3) 10Alexandros Kosiaris: kubectl: Fetch it from_future for a set of hosts [puppet] - 10https://gerrit.wikimedia.org/r/676574 (https://phabricator.wikimedia.org/T278356) [13:30:23] (03CR) 10Hashar: [C: 03+2] Merge tag 'v3.2.8' into wmf/stable-3.2 [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/676551 (https://phabricator.wikimedia.org/T278990) (owner: 10Hashar) [13:30:46] Oh he's back ;P [13:32:06] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28883/console" [puppet] - 10https://gerrit.wikimedia.org/r/676574 (https://phabricator.wikimedia.org/T278356) (owner: 10Alexandros Kosiaris) [13:32:43] hashar: bonjour! [13:32:54] good morning [13:33:04] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1006-cloudelastic-chi-eqiad on cloudelastic1006 is OK: (C)100 gt (W)80 gt 74.24 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1006&panelId=37 [13:35:03] (03Merged) 10jenkins-bot: Merge tag 'v3.2.8' into wmf/stable-3.2 [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/676551 (https://phabricator.wikimedia.org/T278990) (owner: 10Hashar) [13:35:25] (03PS4) 10Alexandros Kosiaris: kubectl: Fetch it from_future for a set of hosts [puppet] - 10https://gerrit.wikimedia.org/r/676574 (https://phabricator.wikimedia.org/T278356) [13:35:56] hashar: Not sure what you want to do about the train. I backported and deployed the revert, and merged into master and tested working on beta [13:36:19] Reedy: revert of what? [13:36:27] the Special:Export thing? :D [13:36:34] the patches causing the export breakage [13:36:35] Oui [13:36:41] lets roll [13:37:29] (03CR) 10Alexandros Kosiaris: [C: 03+2] "PCC is pretty happy, merging and shepherding." [puppet] - 10https://gerrit.wikimedia.org/r/676574 (https://phabricator.wikimedia.org/T278356) (owner: 10Alexandros Kosiaris) [13:37:35] I have no idea how folks figured out the root cause [13:38:36] promoting all wikis again [13:39:03] Promote all from 1.36.0-wmf.36 to 1.36.0-wmf.36 [y/N] [13:39:04] ... [13:39:07] stupid script [13:39:18] doesn't sound like a promotion :) [13:39:30] yeah management is obviously completely broken [13:39:54] (03PS1) 10Hashar: all wikis to 1.36.0-wmf.37 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676578 [13:39:56] (03CR) 10Hashar: [C: 03+2] all wikis to 1.36.0-wmf.37 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676578 (owner: 10Hashar) [13:39:58] fingers crossed :) [13:40:14] my hope is that we "soon" move to a rolling daily deploy [13:40:32] how's that supposed to work? [13:40:36] oh [13:40:37] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.37 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676578 (owner: 10Hashar) [13:40:44] releng / tooling wise that is straightforward [13:40:44] What happens when an unstoppable train hits an unfixable bug? [13:41:06] just deploy frm master using the deploy-promote all script in a cron job ;] [13:41:20] Reedy: sudo systemctl stop unstoppable-train? [13:41:34] s/frm/from/ [13:42:10] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.37 [13:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:26] Amir1: Reedy: Special:Export works indeed. I guess one of you can claim https://phabricator.wikimedia.org/T278579 now ;) [13:44:33] (03PS1) 10Effie Mouzeli: hieradata: enable onhost memcached socket on all mw clusters [puppet] - 10https://gerrit.wikimedia.org/r/676580 (https://phabricator.wikimedia.org/T273115) [13:46:04] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-chi-eqiad on cloudelastic1005 is OK: (C)100 gt (W)80 gt 7.119 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1005&panelId=37 [13:49:14] 10SRE, 10serviceops, 10User-jijiki: Remove mediawiki api loop requests from production - https://phabricator.wikimedia.org/T279146 (10jijiki) [14:09:07] (03PS1) 10Andrew Bogott: codfw1dev OpenStack to version Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676584 (https://phabricator.wikimedia.org/T261136) [14:09:23] !log Start server-side upload for 3 video files (T279138, T279137, T279136) [14:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:35] T279137: Server side upload for Sturm - https://phabricator.wikimedia.org/T279137 [14:09:35] T279138: Server side upload for Sturm - https://phabricator.wikimedia.org/T279138 [14:09:35] T279136: Server side upload for Sturm - https://phabricator.wikimedia.org/T279136 [14:10:55] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] OpenStack: add manifests, files and templates for version Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676453 (https://phabricator.wikimedia.org/T261136) (owner: 10Andrew Bogott) [14:11:03] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Glance: update config for Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676468 (https://phabricator.wikimedia.org/T261136) (owner: 10Andrew Bogott) [14:11:14] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev OpenStack to version Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676584 (https://phabricator.wikimedia.org/T261136) (owner: 10Andrew Bogott) [14:11:25] (03CR) 10Andrew Bogott: [C: 03+2] Neutron: Remove ip_lib.py hack for Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676467 (https://phabricator.wikimedia.org/T261136) (owner: 10Andrew Bogott) [14:11:31] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Neutron: update our l3 agent hacks for Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676466 (https://phabricator.wikimedia.org/T261136) (owner: 10Andrew Bogott) [14:11:41] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack nova: refresh our servers.py hack from Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676465 (owner: 10Andrew Bogott) [14:11:47] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Nova: update config for Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676464 (https://phabricator.wikimedia.org/T261136) (owner: 10Andrew Bogott) [14:13:18] 10SRE, 10Platform Engineering, 10serviceops, 10User-jijiki: Remove mediawiki Request loops from production - https://phabricator.wikimedia.org/T279146 (10jijiki) [14:14:50] 10SRE, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10Jgreen) [14:16:48] 10SRE, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10Jgreen) [14:20:09] 10SRE, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10Jgreen) @Cmjohnson @Jclark-ctr Can you take a look at the serial settings for payments1006? Console redirection isn't working at all. I tried to fix it using racadm but although it s... [14:20:34] !log Start server-side upload for 3 video files (T279060, T279061, T279062) [14:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:45] T279062: Server side upload for Sturm - https://phabricator.wikimedia.org/T279062 [14:20:45] T279060: Server side upload for Sturm - https://phabricator.wikimedia.org/T279060 [14:20:45] T279061: Server side upload for Sturm - https://phabricator.wikimedia.org/T279061 [14:28:26] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-coord1001.eqiad.wmnet with reason: REIMAGE [14:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:00] !log jiji@cumin1001 conftool action : set/pooled=no; selector: cluster=videoscaler,name=mw1111.eqiad.wmnet [14:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:52] lol [14:30:29] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-coord1001.eqiad.wmnet with reason: REIMAGE [14:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:57] !log jiji@cumin1001 conftool action : set/pooled=no; selector: cluster=videoscaler,name=mw133[7-8].eqiad.wmnet [14:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:40] !log jiji@cumin1001 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw133[5-6].eqiad.wmnet [14:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:41] !log jiji@cumin1001 conftool action : set/weight=20; selector: cluster=videoscaler,name=mw133[5-6].eqiad.wmnet [14:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:10] !log jiji@cumin1001 conftool action : set/weight=20; selector: cluster=jobrunner,name=mw133[7-8].eqiad.wmnet [14:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:45:26] RECOVERY - parsoid on parse2001 is OK: HTTP OK: HTTP/1.1 200 OK - 1022 bytes in 0.071 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [14:46:13] (03PS4) 10DannyS712: Update rewrite rule for https://www.mediawiki.org/FAQ [puppet] - 10https://gerrit.wikimedia.org/r/676508 [14:46:14] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:01:51] (03PS1) 10Elukey: profile::oozie::server: use default-mysql-client [puppet] - 10https://gerrit.wikimedia.org/r/676592 (https://phabricator.wikimedia.org/T278422) [15:03:00] (03CR) 10Elukey: [C: 03+2] profile::oozie::server: use default-mysql-client [puppet] - 10https://gerrit.wikimedia.org/r/676592 (https://phabricator.wikimedia.org/T278422) (owner: 10Elukey) [15:17:31] (03PS1) 10Elukey: Use mariadb instead of mysql for Hive and Oozie settings [puppet] - 10https://gerrit.wikimedia.org/r/676598 (https://phabricator.wikimedia.org/T278422) [15:18:37] (03CR) 10jerkins-bot: [V: 04-1] Use mariadb instead of mysql for Hive and Oozie settings [puppet] - 10https://gerrit.wikimedia.org/r/676598 (https://phabricator.wikimedia.org/T278422) (owner: 10Elukey) [15:19:38] (03PS2) 10Elukey: Use mariadb instead of mysql for Hive and Oozie settings [puppet] - 10https://gerrit.wikimedia.org/r/676598 (https://phabricator.wikimedia.org/T278422) [15:22:13] (03PS1) 10Elukey: Add fake alluxio kerberos keytab [labs/private] - 10https://gerrit.wikimedia.org/r/676599 [15:22:23] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake alluxio kerberos keytab [labs/private] - 10https://gerrit.wikimedia.org/r/676599 (owner: 10Elukey) [15:25:31] (03PS1) 10Elukey: Fix location of the fake alluxio keytab for an-test-coord1001 [labs/private] - 10https://gerrit.wikimedia.org/r/676601 [15:25:44] (03CR) 10Elukey: [V: 03+2 C: 03+2] Fix location of the fake alluxio keytab for an-test-coord1001 [labs/private] - 10https://gerrit.wikimedia.org/r/676601 (owner: 10Elukey) [15:27:21] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28888/console" [puppet] - 10https://gerrit.wikimedia.org/r/676598 (https://phabricator.wikimedia.org/T278422) (owner: 10Elukey) [15:29:45] (03PS1) 10Andrew Bogott: Openstack Glance: rip out code for the glance-registry service [puppet] - 10https://gerrit.wikimedia.org/r/676604 [15:30:59] (03CR) 10Elukey: [V: 03+1 C: 03+2] Use mariadb instead of mysql for Hive and Oozie settings [puppet] - 10https://gerrit.wikimedia.org/r/676598 (https://phabricator.wikimedia.org/T278422) (owner: 10Elukey) [15:37:29] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/676606 [15:37:37] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/676606 (owner: 10Kosta Harlan) [15:38:57] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/676606 (owner: 10Kosta Harlan) [15:41:38] !log kharlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [15:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:59] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [15:44:59] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [15:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:46] (03PS2) 10ArielGlenn: clean up check_fragments_file argument handling in page content batches test [dumps] - 10https://gerrit.wikimedia.org/r/676032 (https://phabricator.wikimedia.org/T252396) [15:47:48] (03PS1) 10ArielGlenn: make sure configured number of retries is honored for page content batches [dumps] - 10https://gerrit.wikimedia.org/r/676608 (https://phabricator.wikimedia.org/T252396) [15:48:30] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [15:48:30] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [15:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:35] 10SRE, 10SRE-Access-Requests: Request for adding Ladsgroup to mailman-admins group - https://phabricator.wikimedia.org/T278616 (10jijiki) 05Open→03Resolved a:03jijiki Wish granted. [16:00:19] (03PS1) 10Elukey: Fix the mariadb jdbc driver name for Hive and Oozie on Buster [puppet] - 10https://gerrit.wikimedia.org/r/676609 (https://phabricator.wikimedia.org/T278422) [16:02:12] (03CR) 10Elukey: [C: 03+2] Fix the mariadb jdbc driver name for Hive and Oozie on Buster [puppet] - 10https://gerrit.wikimedia.org/r/676609 (https://phabricator.wikimedia.org/T278422) (owner: 10Elukey) [16:02:30] 10SRE, 10LDAP-Access-Requests: Superset access - https://phabricator.wikimedia.org/T279147 (10jijiki) @MRaishWMF please copy the description as it is instructed in [[ https://phabricator.wikimedia.org/project/view/1564/ | LDAP-Access-Requests ]]. We will additionally need an approval from your manager. Thank... [16:02:33] 10SRE, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10Jgreen) a:05Jgreen→03Cmjohnson [16:22:19] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Glance: rip out code for the glance-registry service [puppet] - 10https://gerrit.wikimedia.org/r/676604 (owner: 10Andrew Bogott) [16:23:20] 10SRE, 10WMF-JobQueue, 10serviceops, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10jijiki) **videscalers** ` { 'host': 'mw1335.eqiad.wmnet', 'weight':20, 'enabled': True } { 'host': 'mw1336.eqiad.wmnet', '... [16:43:28] (03PS2) 10ArielGlenn: make sure configured number of retries is honored for page content batches [dumps] - 10https://gerrit.wikimedia.org/r/676608 (https://phabricator.wikimedia.org/T252396) [17:05:11] (03PS1) 10Andrew Bogott: Openstack Glance: fix init-script to be ipv4 only [puppet] - 10https://gerrit.wikimedia.org/r/676621 (https://phabricator.wikimedia.org/T261136) [17:06:39] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Glance: fix init-script to be ipv4 only [puppet] - 10https://gerrit.wikimedia.org/r/676621 (https://phabricator.wikimedia.org/T261136) (owner: 10Andrew Bogott) [17:07:06] 10ops-eqiad, 10DC-Ops: Netbox Duplicate Cable Lables - https://phabricator.wikimedia.org/T279160 (10wiki_willy) [17:14:23] 10SRE, 10SRE-Access-Requests: Request for adding Ladsgroup to mailman-admins group - https://phabricator.wikimedia.org/T278616 (10Ladsgroup) Thanks. But I can't login to lists1001.wikimedia.org, probably something is missing [17:19:23] 10SRE, 10SRE-Access-Requests: Request for adding Ladsgroup to mailman-admins group - https://phabricator.wikimedia.org/T278616 (10Dzahn) The mailman-admins group is NOT applied to the role(lists). [17:19:37] (03PS1) 10Legoktm: role::lists: Actually use mailman-admins group [puppet] - 10https://gerrit.wikimedia.org/r/676626 [17:19:44] Amir1: ^ [17:19:55] heh, mutante and I figured it out at the same time :) [17:20:11] \o/ [17:20:14] Thanks [17:20:16] (03PS2) 10Dzahn: role::lists: Actually use mailman-admins group [puppet] - 10https://gerrit.wikimedia.org/r/676626 (https://phabricator.wikimedia.org/T278616) (owner: 10Legoktm) [17:20:25] (03CR) 10Dzahn: [C: 03+1] role::lists: Actually use mailman-admins group [puppet] - 10https://gerrit.wikimedia.org/r/676626 (https://phabricator.wikimedia.org/T278616) (owner: 10Legoktm) [17:20:26] That's really fast [17:20:50] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28890/console" [puppet] - 10https://gerrit.wikimedia.org/r/676626 (https://phabricator.wikimedia.org/T278616) (owner: 10Legoktm) [17:21:36] https://puppet-compiler.wmflabs.org/compiler1002/28890/lists1001.wikimedia.org/index.html looks good [17:22:14] (03CR) 10Legoktm: [C: 03+2] role::lists: Actually use mailman-admins group [puppet] - 10https://gerrit.wikimedia.org/r/676626 (https://phabricator.wikimedia.org/T278616) (owner: 10Legoktm) [17:23:23] Amir1: try now [17:23:49] :D [17:24:06] Apr 2 17:23:31 lists1001 systemd: pam_unix(systemd-user:session): session opened for user ladsgroup by (uid=0) [17:24:18] \o/ [17:24:20] Thanks [17:24:32] /dev/vda1 287G 175G 97G 65% / [17:24:40] that's not scary at all [17:25:09] mutante: thanks ^^ [17:25:23] hehe, i just like to use wall [17:25:41] (03PS1) 10Andrew Bogott: Openstack Keystone: update our hacked projects.py for Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676628 (https://phabricator.wikimedia.org/T261136) [17:26:18] Amir1: scary because of the size? [17:26:31] yeah, it'll need to be migrated to mailman3 [17:26:35] rsync between hosts should still be reasonably fast [17:26:44] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Keystone: update our hacked projects.py for Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676628 (https://phabricator.wikimedia.org/T261136) (owner: 10Andrew Bogott) [17:27:14] yeah but indexing for search, adding them to the database, etc. That'll be a massive endeavor [17:27:19] I think the easiest is to just copy it all and delete what you dont need on the target [17:27:42] ah, the database part.. yea. that's new [17:28:39] I hope we can leave lots of old mailing lists in migration (maybe closed ones/inactive ones?) [17:28:46] but that's a discussion for later [17:28:56] it seems hard to not break all the links into archives [17:29:48] I think the files will stay as the same, the apache rules just handles them [17:29:54] we will have a copy basically [17:30:07] "closed", "private", "inactive" all are not strictly defined when we use them for mailing lists, sometimes they mean a little bit different things than other times [17:30:34] oh, that sounds good Amir1 [17:30:41] did not expect that to stay the same [17:35:47] (03PS2) 10Dzahn: site: turn 12 new codfw servers into mw appservers [puppet] - 10https://gerrit.wikimedia.org/r/676484 (https://phabricator.wikimedia.org/T278396) [17:38:20] Amir1: beware of "hidden" lists that are not only private in the sense that they require approval to subscribe and/or have private archives but also the mere existence of them is hidden, they don't appear on the public list info page but they exist. they should not exist but there are probably some because admins can do as they like [17:38:56] yeah saw the option in mailman3 as well [17:38:59] it's against some old policy but people are not aware [17:39:03] when I was making test mailing lists [17:39:42] yea, they are kind of breaking the rules [17:41:03] people also do it thinking it helps against spam.. i guess [17:41:31] but also probably it doesnt actually help [17:43:36] (03PS1) 10Cwhite: logstash: use curator cluster config when possible [puppet] - 10https://gerrit.wikimedia.org/r/676631 (https://phabricator.wikimedia.org/T274394) [17:43:38] (03CR) 10Dzahn: [C: 03+2] site: turn 12 new codfw servers into mw appservers [puppet] - 10https://gerrit.wikimedia.org/r/676484 (https://phabricator.wikimedia.org/T278396) (owner: 10Dzahn) [17:43:42] I mean I understand the reasoning and I do wish our spam fighting were better (hopefully mailman3 addresses the problem a bit) but yeah, still [17:43:42] (03PS3) 10Dzahn: site: turn 12 new codfw servers into mw appservers [puppet] - 10https://gerrit.wikimedia.org/r/676484 (https://phabricator.wikimedia.org/T278396) [17:44:20] Amir1: spamsassassin is scoring all the mails, mostly the issue is missing or bad rules to act based on the score [17:44:34] that is on the list admin side, per list [17:44:42] central infra does what it can [17:45:08] not sure that specific part is a big difference between version 2 and 3 [17:45:28] it already had the capabilities but people need regexes/filter rules for their individual needs [17:45:53] there isn't one that fits all, i think [17:46:10] with so many different languages and countries as well [17:47:12] we can't delete mails for users, but we do give them the score headers [17:48:33] but maybe v3 fixes that part about spam to the -owner special address [17:52:26] ooh... unexpected issue when trying to generate new mcrouter certs that did not exist yesterday [17:52:42] as a maintainer of +10 mailing lists, the old unusable interface contributes a lot to the problem, each mailing list has its own password (shared with other admins) which means every time I want to do something I need to look it up in my password manager [17:52:51] and SubjectAltNameWarning: Certificate for puppetdb1002.eqiad.wmnet has no `subjectAltName` [17:55:00] (03PS1) 10ArielGlenn: preclaim job fragments before claiming them [dumps] - 10https://gerrit.wikimedia.org/r/676632 (https://phabricator.wikimedia.org/T252396) [17:57:13] !log upgraded mailman3 python3-django-postorius on lists1002 [17:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:39] 10SRE, 10Wikimedia-Mailing-lists: lists-next: “confirm” and “welcome” emails lack List-Id header - https://phabricator.wikimedia.org/T278431 (10Legoktm) 05Open→03Resolved Upgraded lists-next. [18:15:28] if there is an alert about "widespread" puppet failures then it's me because there are 12 new hosts that dont have the mcrouter certs yet, but there is an issue creating them. but now I want to see if it actually alerts [18:20:47] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2247.codfw.wmnet [18:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:27] mw2247 failed to decom properly and is now in zombie state [18:22:45] where it exists in some places but not others and it broke adding mcrouter cert for new hosts [18:22:55] now repeating decom cookbook [18:23:18] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Legoktm) ` legoktm@lists1002:~$ du -hs discovery-alerts.mbox 14M discovery-alerts.mbox mysql:testmailman3web@m... [18:23:34] o.O [18:23:41] yeah, that was the one I manually downtimed yesterday [18:23:53] i found your comment [18:24:02] but after I ran into the mcrouter cert issue [18:24:08] and asked Reuven to help [18:24:12] but it's all the same thing :p [18:24:19] :)) [18:24:31] and root cause was .. the host did not want to shutdown when it was told to die over IPMI [18:24:41] and now let's see what happens on second run [18:24:46] it was still powered on? [18:24:47] it already whined that mgmt is gone but continues [18:25:12] Failed to power off, manual intervention required: Remote IPMI for mw2247.mgmt.codfw.wmnet failed (exit=1): b'' [18:25:17] Removed from Puppet master and PuppetDB [18:25:21] it's just sitting there singing "Daisy, Daisy" but nobody can hear it because data centers are loud [18:26:17] it has killed the mgmt interface DNS already last time [18:26:59] we will know more in a few minutes, sitting at generate DNS from Netbox step [18:28:04] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mw2247.codfw.wmnet [18:28:06] rzl: :) that's a nice visual [18:28:12] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2247.codfw.wmnet` - mw2247.codfw.wmnet (**FAIL**) - Downtime... [18:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:51] (03PS1) 10Andrew Bogott: Openstack Nova policy.yaml: remove all keypair-related policies [puppet] - 10https://gerrit.wikimedia.org/r/676644 (https://phabricator.wikimedia.org/T261136) [18:29:10] Nothing to commit! ... [18:30:10] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Nova policy.yaml: remove all keypair-related policies [puppet] - 10https://gerrit.wikimedia.org/r/676644 (https://phabricator.wikimedia.org/T261136) (owner: 10Andrew Bogott) [18:31:03] still in Icinga. running puppet on alert1001 to see if it disappears now. [18:33:11] ACKNOWLEDGEMENT - Host mw2247 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn shutdown -h now [18:40:37] the log of the decom cookbook says how it ran puppet node clean and puppet node deactivate and then removed it from Puppet master and PuppetDB... yet.. it is still in the PuppetDB [18:44:05] !log [puppetmaster1001:~] $ sudo puppet node deactivate mw2247.codfw.wmnet [18:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:32] !log bstorm@cumin1001 START - Cookbook wmcs.wikireplicas.add_wiki [18:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:53] legoktm: running 'puppet node deactivate' manually for the zombie node on the master, then running puppet on alert1001.. removes icinga confg snippets :) [18:46:22] all the alerts gone [18:47:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:47:19] rzl: the same thing also fixed mcrouter_generate_certs :) thanks for your help [18:47:30] PROBLEM - LVS thumbor eqiad port 8800/tcp - Thumbor image scaling IPv4 #page on thumbor.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.0 503 Service Unavailable - 212 bytes in 10.002 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:47:47] here [18:48:05] Hi [18:48:26] RECOVERY - LVS thumbor eqiad port 8800/tcp - Thumbor image scaling IPv4 #page on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 368 bytes in 3.495 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:48:44] hi [18:48:50] o/ [18:49:05] thumbor latency in both clusters has been elevated since a little before 17:30 [18:49:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:50:11] same with 503 responses in eqiad [18:50:12] cdanis: I think the thing right before 17:30 was a traffic shift in codfw, there was a sharp drop in 404s [18:51:47] it's still hard to explain tail latency increasing from a bunch of cheap requests going away, thoguh [18:51:53] median or mean latency sure [18:52:04] mm, that's true [18:52:10] they are very well-correlated though [18:52:50] it does not look like cpu saturation on the thumbor hosts [18:53:02] maybe the same clients kept sending traffic, but changed from getting fast 404s to slower 200s? now I'm just speculating though [18:53:12] it's certainly possible [18:53:15] looking at haproxy sessions, the DC traffic shift appears there too [18:53:55] and I'll tell you as soon as I manage to sign into grafana :( [18:55:35] there we go -- I had to switch from log to linear to get an idea but it looks like the answer to my question is "no" [18:55:37] still digging [18:55:46] observed latency p99 looks very high since 1100 [18:56:45] however, that looks like the normal daily pattern. disregard ^^ [18:57:29] https://logstash.wikimedia.org/goto/dddf1585ad463e1756f4e3fa9bf09761 [19:01:36] what is the summary? maybe I can help [19:01:38] on thumbor1001, in /srv/thumbor/tmp there are the actual thumbor processes apparently and then there are comments where gilles did stuff like this: sudo -u thumbor socat - unix-connect:manhole-8831 [19:02:47] effie: LVS thumbor alerted and then recovered again right after [19:03:22] see _security [19:03:22] (03PS1) 10CDanis: upload: ratelimit two new bogus UAs: Python-urllib & Java/ [puppet] - 10https://gerrit.wikimedia.org/r/676650 [19:03:53] ack [19:04:20] (03CR) 10Cwhite: [C: 03+1] upload: ratelimit two new bogus UAs: Python-urllib & Java/ [puppet] - 10https://gerrit.wikimedia.org/r/676650 (owner: 10CDanis) [19:04:26] (03CR) 10RLazarus: [C: 03+1] upload: ratelimit two new bogus UAs: Python-urllib & Java/ [puppet] - 10https://gerrit.wikimedia.org/r/676650 (owner: 10CDanis) [19:04:44] (03CR) 10CDanis: [C: 03+2] upload: ratelimit two new bogus UAs: Python-urllib & Java/ [puppet] - 10https://gerrit.wikimedia.org/r/676650 (owner: 10CDanis) [19:05:11] (03CR) 10Legoktm: [C: 03+1] upload: ratelimit two new bogus UAs: Python-urllib & Java/ [puppet] - 10https://gerrit.wikimedia.org/r/676650 (owner: 10CDanis) [19:06:31] I assume the slowness on instant commons is gonna be this [19:07:06] !log bstorm@cumin1001 Added views for new wiki: mnwwiktionary T276126 [19:07:06] !log bstorm@cumin1001 END (PASS) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=0) [19:07:07] RhinosF1: very likely, stand by [19:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:15] T276126: Prepare and check storage layer for mnwwiktionary - https://phabricator.wikimedia.org/T276126 [19:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:37] rzl: it's a good test for my fixes on our timeout config! [19:13:33] 10SRE, 10OTRS, 10Security, 10User-notice: ((OTRS)) Community Edition 6 is end-of-life; no FOSS replacement provided - https://phabricator.wikimedia.org/T275294 (10Keegan) I would prefer to wait a week until this is published in Tech/News, I'm meeting with the admins and we're discussing and planning a larg... [19:15:07] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) >>! In T277780#6967287, @Legoktm wrote: > Not sure why, but the icinga downtime on this actually failed. I just set it manually. Thank you! The orig... [19:27:15] PROBLEM - mediawiki-installation DSH group on mw2383 is CRITICAL: Host mw2383 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:29:29] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2383.codfw.wmnet with reason: new_install [19:29:30] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2383.codfw.wmnet with reason: new_install [19:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:11] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2383 is CRITICAL: Host mw2383 is not in mediawiki-installation dsh group daniel_zahn new hardware https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:43:28] (03PS1) 10Gergő Tisza: Fix growthexperiments.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676654 (https://phabricator.wikimedia.org/T275171) [19:43:41] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01122 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [19:44:13] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2550.22 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:45:23] (03PS1) 10Bstorm: wikireplicas: add a skip-dns option [cookbooks] - 10https://gerrit.wikimedia.org/r/676655 (https://phabricator.wikimedia.org/T279185) [19:45:43] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.003565 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [19:47:29] !log bstorm@cumin1001 START - Cookbook wmcs.wikireplicas.add_wiki [19:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:30] PROBLEM - Check systemd state on mw2384 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:53:17] (03CR) 10H.krishna123: "> Patch Set 9:" (034 comments) [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98) [19:55:34] PROBLEM - Apache HTTP on mw2384 is CRITICAL: connect to address 10.192.0.47 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [19:55:36] PROBLEM - PHP7 rendering on mw2391 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 458 bytes in 0.072 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:55:48] PROBLEM - mediawiki-installation DSH group on mw2389 is CRITICAL: Host mw2389 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:55:48] PROBLEM - mediawiki-installation DSH group on mw2394 is CRITICAL: Host mw2394 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:58:04] PROBLEM - Apache HTTP on mw2385 is CRITICAL: connect to address 10.192.0.48 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [19:58:04] PROBLEM - PHP7 rendering on mw2387 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 458 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:00:30] PROBLEM - Apache HTTP on mw2386 is CRITICAL: connect to address 10.192.0.49 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [20:00:50] PROBLEM - mediawiki-installation DSH group on mw2386 is CRITICAL: Host mw2386 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:00:50] PROBLEM - mediawiki-installation DSH group on mw2390 is CRITICAL: Host mw2390 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:00:50] PROBLEM - Memcached on mw2388 is CRITICAL: connect to address 10.192.0.51 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [20:00:50] PROBLEM - Memcached on mw2392 is CRITICAL: connect to address 10.192.0.55 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [20:01:38] mutante: fyi ^ [20:02:52] PROBLEM - Check systemd state on mw2388 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:03:03] 10SRE, 10Traffic, 10HTTPS, 10Security: Investigate our mitigation strategy for HTTPS response length attacks - https://phabricator.wikimedia.org/T92298 (10Krinkle) 05Stalled→03Open >>! In T92298#2795655, @BBlack wrote: > > My current thinking on this is that it's best to wait on TLSv1.3's padding mech... [20:03:20] PROBLEM - Check systemd state on mw2385 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:04:32] PROBLEM - Check systemd state on mw2393 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:05:04] PROBLEM - PHP7 rendering on mw2388 is CRITICAL: connect to address 10.192.0.51 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:05:04] PROBLEM - PHP7 rendering on mw2392 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 458 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:05:46] PROBLEM - mediawiki-installation DSH group on mw2391 is CRITICAL: Host mw2391 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:07:30] PROBLEM - Apache HTTP on mw2390 is CRITICAL: connect to address 10.192.0.53 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [20:08:16] PROBLEM - mediawiki-installation DSH group on mw2387 is CRITICAL: Host mw2387 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:08:16] PROBLEM - Memcached on mw2384 is CRITICAL: connect to address 10.192.0.47 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [20:08:18] PROBLEM - Check systemd state on mw2386 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:08:18] PROBLEM - Memcached on mw2393 is CRITICAL: connect to address 10.192.0.57 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [20:09:10] PROBLEM - Apache HTTP on mw2394 is CRITICAL: connect to address 10.192.0.58 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [20:09:48] !log bstorm@cumin1001 Added views for new wiki: taywiki T275836 [20:09:48] !log bstorm@cumin1001 END (PASS) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=0) [20:09:48] PROBLEM - Apache HTTP on mw2387 is CRITICAL: connect to address 10.192.0.50 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [20:09:48] PROBLEM - Apache HTTP on mw2391 is CRITICAL: connect to address 10.192.0.54 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [20:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:57] T275836: Prepare and check storage layer for taywiki - https://phabricator.wikimedia.org/T275836 [20:10:00] PROBLEM - Check systemd state on mw2394 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:36] PROBLEM - Memcached on mw2389 is CRITICAL: connect to address 10.192.0.52 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [20:12:02] PROBLEM - PHP7 rendering on mw2384 is CRITICAL: connect to address 10.192.0.47 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:12:02] PROBLEM - PHP7 rendering on mw2393 is CRITICAL: connect to address 10.192.0.57 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:12:52] PROBLEM - Memcached on mw2385 is CRITICAL: connect to address 10.192.0.48 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [20:12:52] PROBLEM - Memcached on mw2394 is CRITICAL: connect to address 10.192.0.58 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [20:12:52] PROBLEM - Check systemd state on mw2390 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:14:14] PROBLEM - Apache HTTP on mw2392 is CRITICAL: connect to address 10.192.0.55 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [20:15:08] PROBLEM - mediawiki-installation DSH group on mw2392 is CRITICAL: Host mw2392 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:15:08] PROBLEM - Check systemd state on mw2391 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:16:26] PROBLEM - PHP7 rendering on mw2385 is CRITICAL: connect to address 10.192.0.48 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:16:26] PROBLEM - PHP7 rendering on mw2394 is CRITICAL: connect to address 10.192.0.58 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:16:26] PROBLEM - PHP7 rendering on mw2389 is CRITICAL: connect to address 10.192.0.52 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:17:30] PROBLEM - mediawiki-installation DSH group on mw2388 is CRITICAL: Host mw2388 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:17:30] PROBLEM - Memcached on mw2386 is CRITICAL: connect to address 10.192.0.49 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [20:17:30] PROBLEM - Check systemd state on mw2387 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:17:36] these are all the new hosts right? [20:18:10] RECOVERY - PHP7 rendering on mw2384 is OK: HTTP OK: HTTP/1.1 302 Found - 655 bytes in 0.571 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:18:14] PROBLEM - Check systemd state on mw2389 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:18:42] RECOVERY - Check systemd state on mw2384 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:18:54] PROBLEM - Apache HTTP on mw2388 is CRITICAL: connect to address 10.192.0.51 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [20:19:26] not sure why apache is going down, I started it on mw2392 and it seems fine [20:19:42] RECOVERY - Apache HTTP on mw2385 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.142 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:20:00] RECOVERY - Apache HTTP on mw2384 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:20:42] RECOVERY - Apache HTTP on mw2392 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.128 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:20:48] RECOVERY - PHP7 rendering on mw2392 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.123 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:20:50] RECOVERY - Check systemd state on mw2385 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:20:54] RECOVERY - PHP7 rendering on mw2385 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.126 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:21:12] PROBLEM - PHP7 rendering on mw2386 is CRITICAL: connect to address 10.192.0.49 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:21:20] RECOVERY - Check systemd state on mw2390 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:21:40] RECOVERY - Apache HTTP on mw2390 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.143 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:22:36] PROBLEM - mediawiki-installation DSH group on mw2384 is CRITICAL: Host mw2384 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:22:36] PROBLEM - mediawiki-installation DSH group on mw2393 is CRITICAL: Host mw2393 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:22:36] PROBLEM - Memcached on mw2390 is CRITICAL: connect to address 10.192.0.53 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [20:23:36] PROBLEM - Apache HTTP on mw2393 is CRITICAL: connect to address 10.192.0.57 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [20:24:56] PROBLEM - mediawiki-installation DSH group on mw2385 is CRITICAL: Host mw2385 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:24:58] PROBLEM - Memcached on mw2391 is CRITICAL: connect to address 10.192.0.54 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [20:25:48] PROBLEM - Apache HTTP on mw2389 is CRITICAL: connect to address 10.192.0.52 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [20:27:14] PROBLEM - Memcached on mw2387 is CRITICAL: connect to address 10.192.0.50 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [20:27:38] RECOVERY - Check systemd state on mw2386 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:28:18] RECOVERY - PHP7 rendering on mw2386 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:28:20] RECOVERY - Apache HTTP on mw2386 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:29:50] RECOVERY - Check systemd state on mw2388 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:30:18] RECOVERY - PHP7 rendering on mw2388 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.140 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:30:38] RECOVERY - Apache HTTP on mw2388 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.143 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:32:07] (03PS1) 10Andrew Bogott: Add usuri and usurri to typos [puppet] - 10https://gerrit.wikimedia.org/r/676664 [20:32:12] RECOVERY - Check systemd state on mw2393 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:32:20] RECOVERY - PHP7 rendering on mw2393 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.119 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:33:06] RECOVERY - Apache HTTP on mw2393 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.150 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:33:15] (03CR) 10Andrew Bogott: [C: 03+2] Add usuri and usurri to typos [puppet] - 10https://gerrit.wikimedia.org/r/676664 (owner: 10Andrew Bogott) [20:34:16] (03PS1) 10QChris: Allow “Gerrit Managers” to import history [debs/postorius] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/676665 [20:34:18] (03CR) 10QChris: [V: 03+2 C: 03+2] Allow “Gerrit Managers” to import history [debs/postorius] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/676665 (owner: 10QChris) [20:34:30] (03PS1) 10QChris: Import done. Revoke import grants [debs/postorius] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/676666 [20:34:33] (03CR) 10QChris: [V: 03+2 C: 03+2] Import done. Revoke import grants [debs/postorius] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/676666 (owner: 10QChris) [20:35:56] RECOVERY - Check systemd state on mw2391 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:36:18] RECOVERY - PHP7 rendering on mw2387 is OK: HTTP OK: HTTP/1.1 302 Found - 655 bytes in 0.613 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:36:18] RECOVERY - Check systemd state on mw2387 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:36:36] (03PS1) 10QChris: Allow “Gerrit Managers” to import history [debs/mailman3] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/676667 [20:36:38] RECOVERY - Apache HTTP on mw2387 is OK: HTTP OK: HTTP/1.1 302 Found - 641 bytes in 0.168 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:36:38] (03CR) 10QChris: [V: 03+2 C: 03+2] Allow “Gerrit Managers” to import history [debs/mailman3] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/676667 (owner: 10QChris) [20:36:40] RECOVERY - Apache HTTP on mw2391 is OK: HTTP OK: HTTP/1.1 302 Found - 641 bytes in 0.607 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:36:40] RECOVERY - PHP7 rendering on mw2391 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.126 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:36:43] (03PS1) 10QChris: Import done. Revoke import grants [debs/mailman3] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/676668 [20:36:46] (03CR) 10QChris: [V: 03+2 C: 03+2] Import done. Revoke import grants [debs/mailman3] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/676668 (owner: 10QChris) [20:37:04] RECOVERY - Check systemd state on mw2394 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:37:36] RECOVERY - PHP7 rendering on mw2394 is OK: HTTP OK: HTTP/1.1 302 Found - 655 bytes in 0.602 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:37:58] RECOVERY - Apache HTTP on mw2394 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.155 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:39:09] !log andrew@deploy1002 Started deploy [horizon/deploy@86c7cdc]: update horizon for codfw1dev [20:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:39] (03PS2) 10Legoktm: mailman3: Avoid duplication in rsync definitions [puppet] - 10https://gerrit.wikimedia.org/r/676479 (https://phabricator.wikimedia.org/T278609) [20:40:33] rzl: back! handling it.. and SIGH.. all of them failed [20:40:50] had them running screen [20:40:56] !log andrew@deploy1002 Finished deploy [horizon/deploy@86c7cdc]: update horizon for codfw1dev (duration: 01m 47s) [20:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:30] RECOVERY - Apache HTTP on mw2389 is OK: HTTP OK: HTTP/1.1 302 Found - 641 bytes in 0.590 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:42:53] (03CR) 10Legoktm: [C: 03+2] mailman3: Avoid duplication in rsync definitions [puppet] - 10https://gerrit.wikimedia.org/r/676479 (https://phabricator.wikimedia.org/T278609) (owner: 10Legoktm) [20:43:01] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 10 hosts with reason: new_install [20:43:05] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 10 hosts with reason: new_install [20:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:17] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw[2390-2394].codfw.wmnet with reason: new_install [20:43:19] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw[2390-2394].codfw.wmnet with reason: new_install [20:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:30] !log mw2384 reboot [20:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:22] RECOVERY - Check systemd state on mw2389 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:44:39] !log mw2385 through mw2394 - serial rebooting [20:44:44] RECOVERY - PHP7 rendering on mw2389 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.135 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:31] !log andrew@deploy1002 Started deploy [horizon/deploy@86c7cdc]: tweak to affinity group options [20:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:46] RECOVERY - Memcached on mw2384 is OK: TCP OK - 0.032 second response time on 10.192.0.47 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [20:47:04] 10SRE, 10ops-eqiad, 10serviceops: decommission scb100[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T275759 (10wiki_willy) Hi @akosiaris - just checking on the status on this, to see if we could an ETA on when we could pull these servers from the racks? We're starting to hit our max power threshold a... [20:47:36] RECOVERY - Memcached on mw2385 is OK: TCP OK - 0.032 second response time on 10.192.0.48 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [20:50:10] !log andrew@deploy1002 Finished deploy [horizon/deploy@86c7cdc]: tweak to affinity group options (duration: 03m 39s) [20:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:46] RECOVERY - Memcached on mw2386 is OK: TCP OK - 0.035 second response time on 10.192.0.49 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [20:51:16] RECOVERY - Memcached on mw2387 is OK: TCP OK - 0.032 second response time on 10.192.0.50 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [20:55:20] RECOVERY - Memcached on mw2388 is OK: TCP OK - 0.032 second response time on 10.192.0.51 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [20:56:14] RECOVERY - Memcached on mw2390 is OK: TCP OK - 0.032 second response time on 10.192.0.53 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [20:56:52] RECOVERY - Memcached on mw2389 is OK: TCP OK - 0.032 second response time on 10.192.0.52 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [20:58:05] !log mw238* - scap pull via cumin not possible because it doesnt work as root [20:58:06] (03PS1) 10RLazarus: cergen: When mcrouter_generator hits a DNS error, print the hostname [puppet] - 10https://gerrit.wikimedia.org/r/676671 [20:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:38] RECOVERY - Memcached on mw2391 is OK: TCP OK - 0.036 second response time on 10.192.0.54 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [20:59:22] (03CR) 10Dzahn: [C: 03+1] "thanks! If this is like the live hack earlier I can already confirm it works. And was very useful." [puppet] - 10https://gerrit.wikimedia.org/r/676671 (owner: 10RLazarus) [21:00:18] RECOVERY - Memcached on mw2392 is OK: TCP OK - 0.032 second response time on 10.192.0.55 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [21:01:12] (03CR) 10RLazarus: [C: 03+2] "Yep, same thing! Output in today's case would have been: "Temporary failure in name resolution: mw2247.codfw.wmnet" followed by the stack " [puppet] - 10https://gerrit.wikimedia.org/r/676671 (owner: 10RLazarus) [21:03:56] RECOVERY - Memcached on mw2393 is OK: TCP OK - 0.032 second response time on 10.192.0.57 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [21:06:19] (03CR) 10Dzahn: "yes, that's what it was. and mw2247 is now removed from puppetDB and the issue is gone" [puppet] - 10https://gerrit.wikimedia.org/r/676671 (owner: 10RLazarus) [21:07:48] !log mw2383 through mw2394 - 'uptime && scap pull' via ssh -C (not cumin because it needs to run as non-root) [21:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:01] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-fe100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T275511 (10RobH) [21:14:34] RECOVERY - Memcached on mw2394 is OK: TCP OK - 0.032 second response time on 10.192.0.58 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [21:14:35] (03PS1) 10RobH: moss-fe100[12] setup info [puppet] - 10https://gerrit.wikimedia.org/r/676672 (https://phabricator.wikimedia.org/T275511) [21:15:38] 10SRE, 10Analytics-Radar, 10Traffic, 10Wikimedia-General-or-Unknown: Cookie “WMF-Last-Access-Global” has been rejected for invalid domain. - https://phabricator.wikimedia.org/T261803 (10Krinkle) [21:15:42] 10SRE, 10Analytics-Radar, 10Domains, 10Traffic, 10Wikimedia-General-or-Unknown: WMF third-party cookies rejected - https://phabricator.wikimedia.org/T262882 (10Krinkle) [21:16:08] 10SRE, 10serviceops, 10Patch-For-Review: bring 26 new mediawiki appserver in codfw into production, rack A3 (mw2377 - mw2402) - https://phabricator.wikimedia.org/T278396 (10Dzahn) a:03Dzahn [21:16:17] (03CR) 10RobH: [C: 03+2] moss-fe100[12] setup info [puppet] - 10https://gerrit.wikimedia.org/r/676672 (https://phabricator.wikimedia.org/T275511) (owner: 10RobH) [21:16:52] 10SRE, 10Analytics-Radar, 10Traffic, 10Wikimedia-General-or-Unknown: Requests for /static get an invalid WMF-Last-Access cookie for wikipedia.org on non-Wikipedia requests - https://phabricator.wikimedia.org/T261803 (10Krinkle) [21:17:01] (03PS1) 10Dzahn: site/conftool-data: turn 8 new codfw servers into API appservers [puppet] - 10https://gerrit.wikimedia.org/r/676673 (https://phabricator.wikimedia.org/T278396) [21:18:18] 10SRE, 10Analytics-Radar, 10Traffic, 10Wikimedia-General-or-Unknown: Requests for /static get an invalid WMF-Last-Access cookie for wikipedia.org on non-Wikipedia requests - https://phabricator.wikimedia.org/T261803 (10Krinkle) This happens because our traffic layer shares the caches for `/static` across a... [21:18:39] 10SRE, 10Analytics-Radar, 10Traffic, 10Wikimedia-General-or-Unknown: Requests for /static get an invalid WMF-Last-Access cookie for wikipedia.org on non-Wikipedia requests - https://phabricator.wikimedia.org/T261803 (10Krinkle) ` $ curl -I 'https://commons.wikimedia.org/static/favicon/commons.ico' HTTP/2 2... [21:19:05] !log generating mcrouter certs for mw2395 through mw2404 (T278396) [21:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:14] T278396: bring 26 new mediawiki appserver in codfw into production, rack A3 (mw2377 - mw2402) - https://phabricator.wikimedia.org/T278396 [21:20:56] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-fe100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T275511 (10RobH) [21:21:01] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-fe100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T275511 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['moss-fe1001.eqiad.wmnet', 'moss-fe1002.eqiad.wmnet'] ` The log can... [21:21:11] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2383.codfw.wmnet [21:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:21] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2384.codfw.wmnet [21:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:27] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2385.codfw.wmnet [21:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:33] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2386.codfw.wmnet [21:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:40] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2387.codfw.wmnet [21:21:44] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2388.codfw.wmnet [21:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:15] (03PS2) 10Dzahn: site/conftool-data: turn 8 new codfw servers into API appservers [puppet] - 10https://gerrit.wikimedia.org/r/676673 (https://phabricator.wikimedia.org/T278396) [21:22:35] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2389.codfw.wmnet [21:22:39] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2390.codfw.wmnet [21:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:44] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2391.codfw.wmnet [21:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:07] (03CR) 10Dzahn: [C: 03+2] site/conftool-data: turn 8 new codfw servers into API appservers [puppet] - 10https://gerrit.wikimedia.org/r/676673 (https://phabricator.wikimedia.org/T278396) (owner: 10Dzahn) [21:24:13] 10SRE, 10Traffic: TATA SKY users blocked from upload.wikimedia.org in any browser except Opera - https://phabricator.wikimedia.org/T275211 (10Krinkle) [21:24:36] 10SRE, 10Traffic: TATA SKY users unable to connect with upload.wikimedia.org in browsers except Opera - https://phabricator.wikimedia.org/T275211 (10Krinkle) [21:24:38] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2392.codfw.wmnet [21:24:43] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2393.codfw.wmnet [21:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:48] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2394.codfw.wmnet [21:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:14] RECOVERY - mediawiki-installation DSH group on mw2385 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:27:50] RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:28:16] !log imported python-xapian-haystack 2.1.0-6~wmf1 on apt1001 (T278717) [21:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:23] T278717: Use xapian search backend for mailman3 - https://phabricator.wikimedia.org/T278717 [21:33:27] (03PS1) 10Legoktm: mailman3: Use xapian for fulltext search [puppet] - 10https://gerrit.wikimedia.org/r/676675 (https://phabricator.wikimedia.org/T278717) [21:33:39] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe1001.eqiad.wmnet with reason: REIMAGE [21:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:39] !log dzahn@cumin1001 conftool action : set/weight=30; selector: name=mw238[3-9].codfw.wmnet [21:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:54] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The following units failed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:35:09] !log dzahn@cumin1001 conftool action : set/weight=30; selector: name=mw239[0-4].codfw.wmnet [21:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:35] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe1002.eqiad.wmnet with reason: REIMAGE [21:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:47] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe1001.eqiad.wmnet with reason: REIMAGE [21:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:04] PROBLEM - Check systemd state on mw2395 is CRITICAL: CRITICAL - degraded: The following units failed: mcrouter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:37:53] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe1002.eqiad.wmnet with reason: REIMAGE [21:37:53] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw[2395-2396].codfw.wmnet with reason: new_install [21:37:54] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw[2395-2396].codfw.wmnet with reason: new_install [21:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:58] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28891/console" [puppet] - 10https://gerrit.wikimedia.org/r/676675 (https://phabricator.wikimedia.org/T278717) (owner: 10Legoktm) [21:40:03] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2383.codfw.wmnet [21:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:19] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2384.codfw.wmnet [21:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:14] (03PS1) 10Dzahn: site/conftool-data: mw2397 through mw2402 back to insetup, not ready yet [puppet] - 10https://gerrit.wikimedia.org/r/676677 (https://phabricator.wikimedia.org/T278396) [21:41:35] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw238[5-9].codfw.wmnet [21:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:31] RECOVERY - Check systemd state on mw2395 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:42:37] !log pooled 12 brand-new codfw appservers running on new hardware generation [21:42:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:10] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw239[0-4].codfw.wmnet [21:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:47] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-fe100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T275511 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['moss-fe1001.eqiad.wmnet', 'moss-fe1002.eqiad.wmnet'] ` and were **ALL** successful. [21:47:49] 10SRE, 10serviceops, 10Patch-For-Review: bring 26 new mediawiki appserver in codfw into production, rack A3 (mw2377 - mw2402) - https://phabricator.wikimedia.org/T278396 (10Dzahn) [21:48:05] 10SRE, 10serviceops, 10Patch-For-Review: bring 26 new mediawiki appserver in codfw into production, rack A3 (mw2377 - mw2402) - https://phabricator.wikimedia.org/T278396 (10Dzahn) [21:48:43] !log mw2395, mw2396 - reboot - becoming API servers [21:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:38] (03CR) 10Dzahn: [C: 03+2] site/conftool-data: mw2397 through mw2402 back to insetup, not ready yet [puppet] - 10https://gerrit.wikimedia.org/r/676677 (https://phabricator.wikimedia.org/T278396) (owner: 10Dzahn) [21:53:41] (03CR) 10Legoktm: [V: 03+1 C: 03+2] "The xapian index is at a different file path, so it should be straightforward to revert this without needing to rebuild the index (though " [puppet] - 10https://gerrit.wikimedia.org/r/676675 (https://phabricator.wikimedia.org/T278717) (owner: 10Legoktm) [21:54:44] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Dzahn) mw2384 through mw2394 moved to production and marked as such in netbox mw2395, mw2396 to follow [21:55:57] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw239[5-6].codfw.wmnet [21:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:20] !log dzahn@cumin1001 conftool action : set/weight=30; selector: name=mw239[5-6].codfw.wmnet [21:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:29] RECOVERY - mediawiki-installation DSH group on mw2389 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:57:29] RECOVERY - mediawiki-installation DSH group on mw2394 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:58:00] !log legoktm@lists1002:~$ time sudo mailman-web rebuild_index [21:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:59] RECOVERY - mediawiki-installation DSH group on mw2386 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:02:59] RECOVERY - mediawiki-installation DSH group on mw2390 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:04:18] 10SRE, 10Wikimedia-Mailing-lists: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Legoktm) >>! In T278609#6967023, @Ladsgroup wrote: > - The search index though is 17 MB (you can see it in `/var/lib/mailman3/web/f... [22:05:39] RECOVERY - mediawiki-installation DSH group on mw2391 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:05:39] RECOVERY - mediawiki-installation DSH group on mw2393 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:05:39] RECOVERY - mediawiki-installation DSH group on mw2392 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:06:28] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Use xapian search backend for mailman3 - https://phabricator.wikimedia.org/T278717 (10Legoktm) 05Open→03Resolved a:03Legoktm More discussion ongoing at {T278609}, but it's enabled for now. I'm also going to see if we can get python-xapian-haystack f... [22:08:10] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw239[5-6].codfw.wmnet [22:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:27] !log pooled mw2395,mw2396 as API appservers running on new hardware [22:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:53] !log bstorm@cumin1001 START - Cookbook wmcs.wikireplicas.add_wiki [22:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:49] RECOVERY - mediawiki-installation DSH group on mw2387 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:14:15] RECOVERY - mediawiki-installation DSH group on mw2383 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:20:05] RECOVERY - mediawiki-installation DSH group on mw2388 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:20:52] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-fe100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T275511 (10RobH) [22:21:00] 10SRE, 10serviceops, 10Patch-For-Review: bring 26 new mediawiki appserver in codfw into production, rack A3 (mw2377 - mw2402) - https://phabricator.wikimedia.org/T278396 (10Dzahn) [22:21:07] 10SRE, 10serviceops, 10Patch-For-Review: bring 26 new mediawiki appserver in codfw into production, rack A3 (mw2377 - mw2402) - https://phabricator.wikimedia.org/T278396 (10Dzahn) 20:44 < mutante> !log mw2385 through mw2394 - serial rebooting 20:58 < mutante> !log mw238* - scap pull via cumin not possible be... [22:21:08] (03PS1) 10Krinkle: trafficserver: Remove X-Request-Id from response headers unless debug [puppet] - 10https://gerrit.wikimedia.org/r/676682 (https://phabricator.wikimedia.org/T210484) [22:21:12] 10SRE, 10Wikimedia-Mailing-lists: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Legoktm) {P15109} qa-alerts and wikidata-bugs seem like good candidates to try next, though I would want #DBA supervision if we're... [22:21:21] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10Dzahn) >>! In T268524#6966291, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/WkrxjngB1jz_IcW... [22:25:03] RECOVERY - mediawiki-installation DSH group on mw2384 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:26:39] (03PS1) 10RLazarus: cergen: Inline definition_for, now that we're not using a dict comprehension [puppet] - 10https://gerrit.wikimedia.org/r/676684 [22:28:45] (03PS1) 10Cwhite: logstash: set cluster name for elasticsearch outputs [puppet] - 10https://gerrit.wikimedia.org/r/676685 (https://phabricator.wikimedia.org/T274394) [22:31:33] !log bstorm@cumin1001 Added views for new wiki: trvwiki T276246 [22:31:33] !log bstorm@cumin1001 END (PASS) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=0) [22:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:42] T276246: Create Wikipedia Kari Seediq - https://phabricator.wikimedia.org/T276246 [22:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:55] (03PS1) 10Cwhite: logstash: use logstash output to manage ecs-test indexes [puppet] - 10https://gerrit.wikimedia.org/r/676687 (https://phabricator.wikimedia.org/T274394) [22:32:32] (03PS2) 10Cwhite: logstash: use logstash output to manage ecs-test indexes [puppet] - 10https://gerrit.wikimedia.org/r/676687 (https://phabricator.wikimedia.org/T274394) [22:39:25] (03CR) 10Legoktm: [C: 03+1] cergen: Inline definition_for, now that we're not using a dict comprehension [puppet] - 10https://gerrit.wikimedia.org/r/676684 (owner: 10RLazarus) [22:41:57] (03CR) 10RLazarus: [C: 03+2] cergen: Inline definition_for, now that we're not using a dict comprehension [puppet] - 10https://gerrit.wikimedia.org/r/676684 (owner: 10RLazarus) [22:42:50] (03PS1) 10Cwhite: logstash: add curator config to manage w3creportingapi revision 1 indexes [puppet] - 10https://gerrit.wikimedia.org/r/676690 (https://phabricator.wikimedia.org/T274394) [22:44:06] (03PS2) 10Cwhite: logstash: remove logstash output on legacy cluster [puppet] - 10https://gerrit.wikimedia.org/r/676477 (https://phabricator.wikimedia.org/T234854) [22:44:21] 10ops-codfw, 10DC-Ops: Netbox/Accounting Discrepancies - https://phabricator.wikimedia.org/T279214 (10wiki_willy) [22:50:43] (03CR) 10Cwhite: "These indexes are from the backfill into w3creportingapi_1.0.0-1. The last one to keep is January 2021 and with this should expire at the" [puppet] - 10https://gerrit.wikimedia.org/r/676690 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [22:55:44] (03CR) 10Bstorm: [C: 03+2] static-binaries: first pass at a stripped-down image for binaries [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/673378 (https://phabricator.wikimedia.org/T277749) (owner: 10Bstorm) [22:56:23] (03Merged) 10jenkins-bot: static-binaries: first pass at a stripped-down image for binaries [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/673378 (https://phabricator.wikimedia.org/T277749) (owner: 10Bstorm) [23:37:34] 10SRE, 10Commons, 10SRE-swift-storage, 10User-Josve05a: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101 (10AlexisJazz) https://en.wikipedia.org/wiki/File:Wilderness_Society_in_front_of_Kindness_House.jpg Only one revision and it'...