[00:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210107T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:19:10] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 102.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [00:25:12] (03CR) 10CRusnov: [C: 03+2] hieradata/role/common/ganeti.yaml: Allow netbox-dev2001 RAPI access [puppet] - 10https://gerrit.wikimedia.org/r/654515 (owner: 10CRusnov) [00:44:35] !log restart elasticsearch on logstash1011 - oom [00:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:00] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 7757643984 and 476 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:53:32] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 749554224 and 149 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:53:42] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2670058752 and 250 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:53:42] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 854616176 and 163 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:00] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3292005528 and 298 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:32] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 35437056 and 175 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:02] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 197048 and 205 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:14] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 95264 and 215 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:02] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 58448 and 265 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:42] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 99312 and 305 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:57:00] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 78064 and 323 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:57:32] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 305192 and 355 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:00:04] twentyafterfour: I, the Bot under the Fountain, allow thee, The Deployer, to do Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210107T0100). [01:19:27] (03PS1) 10Mstyles: update flink logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/654723 (https://phabricator.wikimedia.org/T264006) [01:20:50] (03CR) 10Mstyles: "Updated the logging fields per our ECS talk. Do we still want to have all of the nested JSON that I see in the ECS examples?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/654723 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [01:41:30] (03PS1) 10Legoktm: [WIP] docker_registry_ha: Add a script to generate a static HTML homepage [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) [01:41:55] (03CR) 10jerkins-bot: [V: 04-1] [WIP] docker_registry_ha: Add a script to generate a static HTML homepage [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [01:42:45] (03PS2) 10Legoktm: [WIP] docker_registry_ha: Add a script to generate a static HTML homepage [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) [01:43:11] (03CR) 10jerkins-bot: [V: 04-1] [WIP] docker_registry_ha: Add a script to generate a static HTML homepage [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [01:44:07] (03PS3) 10Legoktm: [WIP] docker_registry_ha: Add a script to generate a static HTML homepage [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) [01:45:28] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is OK: (C)100 gt (W)80 gt 73.22 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [01:46:29] (03CR) 10Legoktm: "Still pending is the nginx config, but it should be functionally identical to the current listing on Toolforge." [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [01:48:24] (03CR) 10Legoktm: [C: 04-1] "See Change-Id: Idd878538973db8efeae8e0ad9d2eb11b55ef6780 as an alternative." [puppet] - 10https://gerrit.wikimedia.org/r/650215 (https://phabricator.wikimedia.org/T179696) (owner: 10Ahmon Dancy) [01:49:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:32] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654463 (https://phabricator.wikimedia.org/T271389) (owner: 10DannyS712) [02:48:00] !log andrew@deploy1001 Started deploy [horizon/deploy@ce4c515]: update codfw1dev deploy from train-buster branch [02:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:49:48] !log andrew@deploy1001 Finished deploy [horizon/deploy@ce4c515]: update codfw1dev deploy from train-buster branch (duration: 01m 48s) [02:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:50:38] (03PS1) 10Andrew Bogott: Horizon: switch from python3.5 to python3.7 [puppet] - 10https://gerrit.wikimedia.org/r/654728 (https://phabricator.wikimedia.org/T269004) [02:52:44] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: switch from python3.5 to python3.7 [puppet] - 10https://gerrit.wikimedia.org/r/654728 (https://phabricator.wikimedia.org/T269004) (owner: 10Andrew Bogott) [02:57:50] !log andrew@deploy1001 Started deploy [striker/deploy@e120c6c]: update codfw1dev striker [02:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:57:55] !log andrew@deploy1001 Finished deploy [striker/deploy@e120c6c]: update codfw1dev striker (duration: 00m 05s) [02:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:18:40] PROBLEM - Host cloudweb2001-dev is DOWN: PING CRITICAL - Packet loss = 100% [03:20:36] RECOVERY - Host cloudweb2001-dev is UP: PING OK - Packet loss = 0%, RTA = 33.81 ms [04:11:38] !log andrew@deploy1001 Started deploy [striker/deploy@e4db843]: update codfw1dev striker [04:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:12:26] !log andrew@deploy1001 Finished deploy [striker/deploy@e4db843]: update codfw1dev striker (duration: 00m 48s) [04:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:00] (03PS1) 10KartikMistry: Enable ContentTranslation in Tsonga Wikipedia as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654734 (https://phabricator.wikimedia.org/T271204) [05:21:48] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:36:48] 10ops-codfw, 10DBA, 10SRE: db2140 crashed due to HW memory errors - https://phabricator.wikimedia.org/T271084 (10Marostegui) Thanks Moritz. @Papaul let me know if you need something else apart from the idrac logs to provide to Dell in order to get a replacement [05:37:35] (03PS1) 10Marostegui: db2140: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/654735 (https://phabricator.wikimedia.org/T271084) [05:43:46] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 100 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:43:51] (03CR) 10Marostegui: [C: 03+2] db2140: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/654735 (https://phabricator.wikimedia.org/T271084) (owner: 10Marostegui) [05:45:37] (03CR) 10Marostegui: [C: 03+2] dbctl: Add x2 as a valid section [puppet] - 10https://gerrit.wikimedia.org/r/654045 (https://phabricator.wikimedia.org/T269324) (owner: 10Marostegui) [06:29:00] (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for Apache on Thanos frontends [puppet] - 10https://gerrit.wikimedia.org/r/654422 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [06:30:44] (03PS1) 10Muehlenhoff: Extend access for jmads [puppet] - 10https://gerrit.wikimedia.org/r/654740 [06:32:44] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me [06:34:14] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [06:34:43] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for jmads [puppet] - 10https://gerrit.wikimedia.org/r/654740 (owner: 10Muehlenhoff) [06:38:20] RECOVERY - exim queue on mx1001 is OK: OK: Less than 2000 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim [06:44:12] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:44:55] (03PS1) 10Ayounsi: Depool eqsin for router replacement [dns] - 10https://gerrit.wikimedia.org/r/654742 (https://phabricator.wikimedia.org/T267544) [06:56:38] (03PS1) 10Muehlenhoff: Remove access for dz1 [puppet] - 10https://gerrit.wikimedia.org/r/654743 [07:02:17] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for dz1 [puppet] - 10https://gerrit.wikimedia.org/r/654743 (owner: 10Muehlenhoff) [07:16:21] (03CR) 10ArielGlenn: [C: 03+1] "Thanks for this!" [puppet] - 10https://gerrit.wikimedia.org/r/654675 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [07:18:34] (03CR) 10Ayounsi: [C: 03+2] Depool eqsin for router replacement [dns] - 10https://gerrit.wikimedia.org/r/654742 (https://phabricator.wikimedia.org/T267544) (owner: 10Ayounsi) [07:19:38] !log depool eqsin for router replacement - T267544 [07:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:42] T267544: cr2-eqsin: fan failure - https://phabricator.wikimedia.org/T267544 [07:30:12] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 45.76 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:32:38] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [07:41:59] (03CR) 10Joal: "One question on inner-parallelism - spark settings looks good" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/654650 (owner: 10Ottomata) [07:43:19] !log installing libxml2 security updates on buster [07:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:08] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:03:11] (03CR) 10Elukey: [C: 03+2] eventschemas: Migrate hiera() to lookup() and setting datatype [puppet] - 10https://gerrit.wikimedia.org/r/654635 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [08:12:53] (03CR) 10Elukey: [C: 03+2] Adjust refine_event memory and parallelism [puppet] - 10https://gerrit.wikimedia.org/r/654650 (owner: 10Ottomata) [08:14:07] !log shutdown cr2-eqsin - T267544 [08:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:12] T267544: cr2-eqsin: fan failure - https://phabricator.wikimedia.org/T267544 [08:19:09] PROBLEM - LVS upload-https eqsin port 443/tcp - Images and other media- upload.eqiad.wikimedia.org IPv6 #page on upload-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [08:19:14] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:19:20] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 76, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:19:26] er, please ignore [08:19:50] ack [08:19:56] Ok [08:19:58] site is depooled [08:20:03] I acked the page [08:20:33] RECOVERY - LVS upload-https eqsin port 443/tcp - Images and other media- upload.eqiad.wikimedia.org IPv6 #page on upload-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 1036 bytes in 0.907 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [08:20:44] PROBLEM - OSPF status on mr1-eqsin is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:20:57] it's the healthchecks from icinga were going through cr2-eqsin and ospf didn't failover fast enough [08:25:58] PROBLEM - Check systemd state on logstash1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:22] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1009 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f8fa097e4e0: Failed to establish a new connection: [Errno 111] Connection [08:26:22] ://wikitech.wikimedia.org/wiki/Search%23Administration [08:29:15] (03CR) 10Elukey: "To keep archives happy - I had a chat with Joal and we decided to test the new settings since refine jobs were already failing, and re-tun" [puppet] - 10https://gerrit.wikimedia.org/r/654650 (owner: 10Ottomata) [08:29:55] [Thu Jan 7 08:22:07 2021] Out of memory: Kill process 25877 (java) score 529 or sacrifice child [08:30:01] this is on logstash1009 --^ :( [08:30:20] running puppet [08:30:58] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1009 is OK: OK - elasticsearch status production-logstash-eqiad: active_shards: 916, delayed_unassigned_shards: 0, unassigned_shards: 0, cluster_name: production-logstash-eqiad, initializing_shards: 0, status: green, number_of_in_flight_fetch: 0, number_of_nodes: 6, number_of_pending_tasks: 0, active_primary_shards: 483, task_max_waiting_in_queue_millis: 0, timed [08:30:58] ve_shards_percent_as_number: 100.0, relocating_shards: 0, number_of_data_nodes: 3 https://wikitech.wikimedia.org/wiki/Search%23Administration [08:32:06] RECOVERY - Check systemd state on logstash1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:35:16] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for Apache on xhgui [puppet] - 10https://gerrit.wikimedia.org/r/654798 (https://phabricator.wikimedia.org/T135991) [08:45:14] RECOVERY - OSPF status on mr1-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:45:16] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:45:26] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 82, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:48:22] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [08:56:54] new cr2-eqsin is up and running [08:57:13] wiping the old one, then will repool the site a bit later on [08:57:14] (03CR) 10JMeybohm: "Did not check in detail, but we have a library you could probably reuse for interaction with the registry at https://gerrit.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [08:57:51] !log re-enable BGP on cr2-eqsin - T267544 [08:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:56] T267544: cr2-eqsin: fan failure - https://phabricator.wikimedia.org/T267544 [09:05:54] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for Apache/Yarn [puppet] - 10https://gerrit.wikimedia.org/r/654800 (https://phabricator.wikimedia.org/T135991) [09:08:36] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for Apache/Hue [puppet] - 10https://gerrit.wikimedia.org/r/654801 (https://phabricator.wikimedia.org/T135991) [09:12:06] (03CR) 10Gehel: [C: 04-1] "See comments inline." (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/654723 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [09:13:49] (03PS1) 10Ayounsi: Revert "Depool eqsin for router replacement" [dns] - 10https://gerrit.wikimedia.org/r/654465 [09:14:10] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for Apache/Hue on testcluster [puppet] - 10https://gerrit.wikimedia.org/r/654802 (https://phabricator.wikimedia.org/T135991) [09:14:37] (03CR) 10Ayounsi: [C: 03+2] Revert "Depool eqsin for router replacement" [dns] - 10https://gerrit.wikimedia.org/r/654465 (owner: 10Ayounsi) [09:17:21] !log re-pool eqsin - T267544 [09:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:25] T267544: cr2-eqsin: fan failure - https://phabricator.wikimedia.org/T267544 [09:17:39] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/654677 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [09:18:56] 10ops-eqsin, 10DC-Ops, 10SRE: cr2-eqsin: fan failure - https://phabricator.wikimedia.org/T267544 (10ayounsi) 05Open→03Resolved All done. [09:21:42] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/654676 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [09:23:15] (03CR) 10Elukey: [C: 03+1] Enable base::service_auto_restart for Apache/Yarn [puppet] - 10https://gerrit.wikimedia.org/r/654800 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:23:38] (03CR) 10Elukey: [C: 03+1] Enable base::service_auto_restart for Apache/Hue [puppet] - 10https://gerrit.wikimedia.org/r/654801 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:23:50] (03CR) 10Elukey: [C: 03+1] Enable base::service_auto_restart for Apache/Hue on testcluster [puppet] - 10https://gerrit.wikimedia.org/r/654802 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:24:46] (03CR) 10Ladsgroup: [C: 04-1] mail::smarthost: hiera -> lookup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/654697 (owner: 10Dzahn) [09:27:01] !log push new pfw policies - T271384 [09:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:51] (03CR) 10Awight: [C: 03+1] "A bit surprising to see that CodeMirror is enabled everywhere, with no $wmgUseCodeMirror variable. But the patch looks correct!" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654633 (https://phabricator.wikimedia.org/T271293) (owner: 10WMDE-Fisch) [09:28:08] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 46.91 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:28:20] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [09:28:46] (03PS2) 10Volans: Testing CI [software/spicerack] - 10https://gerrit.wikimedia.org/r/647657 [09:29:10] PROBLEM - very high load average likely xfs on ms-be2055 is CRITICAL: CRITICAL - load average: 108.52, 102.33, 74.97 https://wikitech.wikimedia.org/wiki/Swift [09:32:02] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 111 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:33:36] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 42 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:34:10] (03CR) 10jerkins-bot: [V: 04-1] Testing CI [software/spicerack] - 10https://gerrit.wikimedia.org/r/647657 (owner: 10Volans) [09:34:12] (03PS2) 10WMDE-Fisch: Enable bracket matching on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654633 (https://phabricator.wikimedia.org/T271293) [09:34:49] (03CR) 10WMDE-Fisch: Enable bracket matching on beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654633 (https://phabricator.wikimedia.org/T271293) (owner: 10WMDE-Fisch) [09:35:44] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Enable bracket matching on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654633 (https://phabricator.wikimedia.org/T271293) (owner: 10WMDE-Fisch) [09:36:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:37:10] !log installing nodejs security updates on buster [09:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:34] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:53:00] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:53:04] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:03:50] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [10:04:04] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:53] !log bounce apache on prometheus codfw [10:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:03] (03CR) 10Arturo Borrero Gonzalez: "take a look at modules/profile/manifests/wmcs/kubeadm/etcd.pp to see how we usually handle puppet certs and the relationship with the serv" [puppet] - 10https://gerrit.wikimedia.org/r/654419 (https://phabricator.wikimedia.org/T268877) (owner: 10David Caro) [10:16:00] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.backup: Add backup_image command (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/654266 (https://phabricator.wikimedia.org/T270478) (owner: 10David Caro) [10:16:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.backup: Add a command to create the next backup [puppet] - 10https://gerrit.wikimedia.org/r/654220 (https://phabricator.wikimedia.org/T267195) (owner: 10David Caro) [10:17:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.backup: Add host to the rbd snapshot name [puppet] - 10https://gerrit.wikimedia.org/r/654221 (https://phabricator.wikimedia.org/T267195) (owner: 10David Caro) [10:41:39] (03PS1) 10Aklapper: Replace Operations with SRE; link to Phabricator user account help [software/klaxon] - 10https://gerrit.wikimedia.org/r/654810 (https://phabricator.wikimedia.org/T258305) [10:42:46] (03CR) 10Filippo Giunchedi: "PCC is failing for me with the latest PS: https://puppet-compiler.wmflabs.org/compiler1002/27364/" [puppet] - 10https://gerrit.wikimedia.org/r/650629 (https://phabricator.wikimedia.org/T254533) (owner: 10Cwhite) [10:44:55] (03CR) 10Filippo Giunchedi: "Looks like the change worked as expected" [puppet] - 10https://gerrit.wikimedia.org/r/654446 (owner: 10Jbond) [10:51:12] !log swift eqiad-prod: add weight to ms-be106[0-3] - T268435 [10:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:17] T268435: Add ms-be106[0-3] to swift - https://phabricator.wikimedia.org/T268435 [10:53:56] (03CR) 10Jbond: [C: 03+2] apt_install_audit: add the apt-audit-installed script by default [puppet] - 10https://gerrit.wikimedia.org/r/654445 (owner: 10Jbond) [11:00:05] mvolz: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210107T1100). [11:06:15] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/654446 (owner: 10Jbond) [11:06:24] (03Abandoned) 10Jbond: Revert "pupetlabs-lvm: update lvm module with latest upstream" [puppet] - 10https://gerrit.wikimedia.org/r/654446 (owner: 10Jbond) [11:09:00] PROBLEM - very high load average likely xfs on ms-be2055 is CRITICAL: CRITICAL - load average: 128.12, 97.63, 100.35 https://wikitech.wikimedia.org/wiki/Swift [11:15:51] (03CR) 10David Caro: wmcs.backup: Add backup_image command (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/654266 (https://phabricator.wikimedia.org/T270478) (owner: 10David Caro) [11:21:22] (03PS1) 10Jbond: idp: add grafana.wikimedia.org as a validservice_id [puppet] - 10https://gerrit.wikimedia.org/r/654811 (https://phabricator.wikimedia.org/T269272) [11:21:58] (03CR) 10Jbond: [C: 03+2] idp: add grafana.wikimedia.org as a validservice_id [puppet] - 10https://gerrit.wikimedia.org/r/654811 (https://phabricator.wikimedia.org/T269272) (owner: 10Jbond) [11:22:52] PROBLEM - very high load average likely xfs on ms-be2055 is CRITICAL: CRITICAL - load average: 82.86, 136.10, 124.19 https://wikitech.wikimedia.org/wiki/Swift [11:23:12] <_joe_> lemme take a look [11:23:43] <_joe_> heh already gone [11:24:37] <_joe_> I see a filesystem was unmounted on ms-be2055 [11:24:38] _joe_: expected, I was poking at T271055 but no joy so far [11:24:39] T271055: Degraded RAID on ms-be2055 - https://phabricator.wikimedia.org/T271055 [11:24:49] <_joe_> yeah I just saw it [11:25:03] will have to wait next week for pa.paul to be onsite [11:27:25] (03CR) 10Arturo Borrero Gonzalez: [DONT MERGE] cloud: expand dmz_cidr list for public endpoints [puppet] - 10https://gerrit.wikimedia.org/r/651169 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez) [11:30:30] (03PS2) 10Arturo Borrero Gonzalez: cloud: expand dmz_cidr list for public endpoints [puppet] - 10https://gerrit.wikimedia.org/r/651169 (https://phabricator.wikimedia.org/T209082) [11:30:52] (03CR) 10Ayounsi: [C: 03+1] "Checked the syntax is coherent and rules for critical IPs (NFS, text, etc) are present." [puppet] - 10https://gerrit.wikimedia.org/r/651169 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez) [11:32:04] RECOVERY - very high load average likely xfs on ms-be2055 is OK: OK - load average: 9.61, 29.96, 73.13 https://wikitech.wikimedia.org/wiki/Swift [11:33:07] (03PS1) 10Volans: raid handler: fix renamed Phabricator's tag [puppet] - 10https://gerrit.wikimedia.org/r/654812 (https://phabricator.wikimedia.org/T258305) [11:34:39] (03CR) 10David Caro: cloud: expand dmz_cidr list for public endpoints (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/651169 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez) [11:34:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: expand dmz_cidr list for public endpoints [puppet] - 10https://gerrit.wikimedia.org/r/651169 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez) [11:38:55] (03PS5) 10David Caro: wmcs.backup: Add backup_image command [puppet] - 10https://gerrit.wikimedia.org/r/654266 (https://phabricator.wikimedia.org/T270478) [11:40:52] (03PS1) 10Jbond: P:grafana: Update CAS config to authenticate users on the correct vhost [puppet] - 10https://gerrit.wikimedia.org/r/654813 (https://phabricator.wikimedia.org/T269272) [13:28:25] (03PS1) 10Arturo Borrero Gonzalez: cr/firewall.cf: cloud-in4: seperate gerrit service ACL to a different term [homer/public] - 10https://gerrit.wikimedia.org/r/654826 (https://phabricator.wikimedia.org/T209082) [13:31:12] (03CR) 10Ayounsi: [C: 04-1] "1 comment, otherwise lgtm" (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/654823 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez) [13:39:07] (03PS1) 10Klausman: install_server: Try fixing ml-serve recipe by including mdX devices [puppet] - 10https://gerrit.wikimedia.org/r/654828 [13:43:12] (03CR) 10Klausman: [C: 03+2] install_server: Try fixing ml-serve recipe by including mdX devices [puppet] - 10https://gerrit.wikimedia.org/r/654828 (owner: 10Klausman) [13:46:08] (03PS1) 10Arturo Borrero Gonzalez: cr/firewall.cf: cloud-in4: fix port for contint2001 This only requires tcp/9418 per puppet configuration. Bug: T209082 Signed-off-by: Arturo Borrero Gonzalez Change-Id: Ic26e387c9a8ca54bed4317932d380ffb12188f6d [homer/public] - 10https://gerrit.wikimedia.org/r/654830 (https://phabricator.wikimedia.org/T209082) [13:47:04] (03PS2) 10Arturo Borrero Gonzalez: cr/firewall.cf: cloud-in4: seperate dumps service ACL to a different term [homer/public] - 10https://gerrit.wikimedia.org/r/654823 (https://phabricator.wikimedia.org/T209082) [13:49:00] (03PS2) 10Arturo Borrero Gonzalez: cr/firewall.cf: cloud-in4: seperate gerrit service ACL to a different term [homer/public] - 10https://gerrit.wikimedia.org/r/654826 (https://phabricator.wikimedia.org/T209082) [13:49:02] (03PS2) 10Arturo Borrero Gonzalez: cr/firewall.cf: cloud-in4: fix port for contint2001 [homer/public] - 10https://gerrit.wikimedia.org/r/654830 (https://phabricator.wikimedia.org/T209082) [13:50:43] (03PS3) 10Arturo Borrero Gonzalez: cr/firewall.cf: cloud-in4: seperate gerrit service ACL to a different term [homer/public] - 10https://gerrit.wikimedia.org/r/654826 (https://phabricator.wikimedia.org/T209082) [13:54:12] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=DELETE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:00:05] longma and hashar: May I have your attention please! Mediawiki train - American+European Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210107T1400) [14:00:22] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:08:12] !log installing sqlite3 security updates on buster [14:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:30] (03PS1) 10JMeybohm: Revert "admin_ng: Set global calico version to 3.17.1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/654846 [14:11:44] (03PS1) 10JMeybohm: Revert "Update to v3.17.1" [debs/calico] - 10https://gerrit.wikimedia.org/r/654847 [14:21:58] (03CR) 10JMeybohm: [C: 03+2] Revert "admin_ng: Set global calico version to 3.17.1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/654846 (owner: 10JMeybohm) [14:22:06] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Revert "Update to v3.17.1" [debs/calico] - 10https://gerrit.wikimedia.org/r/654847 (owner: 10JMeybohm) [14:23:24] (03Merged) 10jenkins-bot: Revert "admin_ng: Set global calico version to 3.17.1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/654846 (owner: 10JMeybohm) [14:32:12] !log jayme@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [14:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:32] !log imported calico 3.17.0-2 to component/calico-future stretch-wikimedia [14:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:07] !log jayme@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [14:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:47] !log kormat@cumin1001 START - Cookbook sre.ganeti.makevm [14:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:29] 10ops-codfw, 10DC-Ops, 10SRE: (Need By: 2021-01-30) rack/setup/install frqueue2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T269481 (10Jgreen) a:05Jgreen→03Papaul Something seems to have gone awry again with the DRAC. I thought I was able to get in yesterday after you reset it, but now it's... [14:42:43] !log andrew@deploy1001 Started deploy [horizon/deploy@ce4c515]: refresh labweb1002 with buster-ready wheels [14:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:46] !log andrew@deploy1001 Finished deploy [horizon/deploy@ce4c515]: refresh labweb1002 with buster-ready wheels (duration: 00m 04s) [14:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:20] !log andrew@deploy1001 Started deploy [horizon/deploy@ce4c515]: refresh labweb1002 with buster-ready wheels [14:43:51] andrew@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [14:46:59] !log andrew@deploy1001 Finished deploy [horizon/deploy@ce4c515]: refresh labweb1002 with buster-ready wheels (duration: 03m 39s) [14:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:16] !log andrew@deploy1001 Started deploy [horizon/deploy@ce4c515]: refresh cloudweb2001-dev [14:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:38] !log kormat@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [14:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:02] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:51:09] !log andrew@deploy1001 Finished deploy [horizon/deploy@ce4c515]: refresh cloudweb2001-dev (duration: 01m 53s) [14:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:46] !log andrew@deploy1001 Started deploy [horizon/deploy@ce4c515]: refresh labweb1002 with buster-ready wheels [14:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:50] !log andrew@deploy1001 Finished deploy [horizon/deploy@ce4c515]: refresh labweb1002 with buster-ready wheels (duration: 00m 04s) [14:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:02] !log andrew@deploy1001 Started deploy [horizon/deploy@ce4c515]: refresh labweb1002 with buster-ready wheels [14:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:50] (03CR) 10CDanis: [C: 03+2] "thanks!" [software/klaxon] - 10https://gerrit.wikimedia.org/r/654810 (https://phabricator.wikimedia.org/T258305) (owner: 10Aklapper) [14:54:06] !log andrew@deploy1001 Finished deploy [horizon/deploy@ce4c515]: refresh labweb1002 with buster-ready wheels (duration: 02m 05s) [14:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:48] (03PS1) 10Elukey: admin: move user kharlan from 'researchers' to 'analytics-privatedata-users' [puppet] - 10https://gerrit.wikimedia.org/r/654871 (https://phabricator.wikimedia.org/T268801) [14:54:50] !log kormat@cumin1001 START - Cookbook sre.ganeti.makevm [14:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:53] !log kormat@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [14:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:59] (03Merged) 10jenkins-bot: Replace Operations with SRE; link to Phabricator user account help [software/klaxon] - 10https://gerrit.wikimedia.org/r/654810 (https://phabricator.wikimedia.org/T258305) (owner: 10Aklapper) [14:56:10] !log andrew@deploy1001 Started deploy [horizon/deploy@ce4c515]: trying to debug a compression error that doesn't happen on the test host [14:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:33] !log kormat@cumin1001 START - Cookbook sre.ganeti.makevm [14:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:14] !log andrew@deploy1001 Finished deploy [horizon/deploy@ce4c515]: trying to debug a compression error that doesn't happen on the test host (duration: 02m 04s) [14:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:21] !log andrew@deploy1001 Started deploy [horizon/deploy@ce4c515]: trying to debug a compression error that doesn't happen on the test host [14:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:25] !log andrew@deploy1001 Finished deploy [horizon/deploy@ce4c515]: trying to debug a compression error that doesn't happen on the test host (duration: 00m 03s) [14:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:42] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:59:04] !log andrew@deploy1001 Started deploy [horizon/deploy@ce4c515]: trying to debug a compression error that doesn't happen on the test host [14:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:11] (03CR) 10Ottomata: Adjust refine_event memory and parallelism (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/654650 (owner: 10Ottomata) [15:01:30] !log andrew@deploy1001 Finished deploy [horizon/deploy@ce4c515]: trying to debug a compression error that doesn't happen on the test host (duration: 02m 25s) [15:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:15] !log andrew@deploy1001 Started deploy [horizon/deploy@ce4c515]: trying to debug a compression error that doesn't happen on the test host [15:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:33] (03PS1) 10Elukey: admin: set user mattflaschen in ldap_only [puppet] - 10https://gerrit.wikimedia.org/r/654873 (https://phabricator.wikimedia.org/T268801) [15:04:05] -1 coming since I am a n00b [15:04:05] (03CR) 10jerkins-bot: [V: 04-1] admin: set user mattflaschen in ldap_only [puppet] - 10https://gerrit.wikimedia.org/r/654873 (https://phabricator.wikimedia.org/T268801) (owner: 10Elukey) [15:04:16] (03PS2) 10Elukey: admin: set user mattflaschen in ldap_only [puppet] - 10https://gerrit.wikimedia.org/r/654873 (https://phabricator.wikimedia.org/T268801) [15:06:49] !log andrew@deploy1001 Finished deploy [horizon/deploy@ce4c515]: trying to debug a compression error that doesn't happen on the test host (duration: 03m 34s) [15:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:30] (03PS1) 10Elukey: admin: set user 'daisy' in ldap_only [puppet] - 10https://gerrit.wikimedia.org/r/654874 (https://phabricator.wikimedia.org/T268801) [15:09:05] !log installing libmaxminddb security updates on stretch [15:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:10:31] !log kormat@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [15:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:04] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:11:20] (03PS1) 10Muehlenhoff: Add library hint for libmaxminddb [puppet] - 10https://gerrit.wikimedia.org/r/654876 [15:11:26] (03PS1) 10Elukey: admin: set user 'etonkovidova' to ldap_only [puppet] - 10https://gerrit.wikimedia.org/r/654877 (https://phabricator.wikimedia.org/T268801) [15:11:56] !log kormat@cumin1001 START - Cookbook sre.dns.netbox [15:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:32] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/654877 (https://phabricator.wikimedia.org/T268801) (owner: 10Elukey) [15:14:50] (03PS1) 10MSantos: push-notifications: pass x-fowarded-proto https in header [deployment-charts] - 10https://gerrit.wikimedia.org/r/654878 [15:14:58] !log kormat@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:32] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/654874 (https://phabricator.wikimedia.org/T268801) (owner: 10Elukey) [15:16:38] (03PS1) 10Elukey: admin: set user 'risler' to ldap_only [puppet] - 10https://gerrit.wikimedia.org/r/654880 (https://phabricator.wikimedia.org/T268801) [15:17:19] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, if there's no reply at all over the next weeks, we can also consider to retire the NDA entirey (and also ditch cn=nda)." [puppet] - 10https://gerrit.wikimedia.org/r/654873 (https://phabricator.wikimedia.org/T268801) (owner: 10Elukey) [15:17:29] (03PS1) 10Kormat: install_server: Update mac address for d-i-test [puppet] - 10https://gerrit.wikimedia.org/r/654881 [15:17:56] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/654871 (https://phabricator.wikimedia.org/T268801) (owner: 10Elukey) [15:18:27] (03CR) 10Klausman: [C: 03+2] install_server: Update mac address for d-i-test [puppet] - 10https://gerrit.wikimedia.org/r/654881 (owner: 10Kormat) [15:19:04] (03PS2) 10MSantos: push-notifications: pass x-fowarded-proto https in header [deployment-charts] - 10https://gerrit.wikimedia.org/r/654878 [15:19:10] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/654880 (https://phabricator.wikimedia.org/T268801) (owner: 10Elukey) [15:19:38] (03PS1) 10Elukey: admin: absent user 'jdittrich' [puppet] - 10https://gerrit.wikimedia.org/r/654882 (https://phabricator.wikimedia.org/T268801) [15:20:14] moritzm: another one and I am done I promise, thanks :D [15:23:11] (03PS1) 10Elukey: admin: set user 'debt' to ldap_only [puppet] - 10https://gerrit.wikimedia.org/r/654884 (https://phabricator.wikimedia.org/T268801) [15:23:20] aaaand done [15:23:56] keep them coming :-) [15:30:50] (03CR) 10Muehlenhoff: [C: 03+1] "This is a bit of an oddball: The user still has NDA-sensitive LDAP access, but under a different username (wmde-jand), I suppose jdittrich" [puppet] - 10https://gerrit.wikimedia.org/r/654882 (https://phabricator.wikimedia.org/T268801) (owner: 10Elukey) [15:32:21] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/654884 (https://phabricator.wikimedia.org/T268801) (owner: 10Elukey) [15:32:43] all done, happy merging :-) [15:33:27] \o/ [15:33:36] thanks for the patience [15:36:28] (03CR) 10Elukey: [C: 03+2] admin: move user kharlan from 'researchers' to 'analytics-privatedata-users' [puppet] - 10https://gerrit.wikimedia.org/r/654871 (https://phabricator.wikimedia.org/T268801) (owner: 10Elukey) [15:36:38] (03CR) 10Elukey: [C: 03+2] admin: set user mattflaschen in ldap_only [puppet] - 10https://gerrit.wikimedia.org/r/654873 (https://phabricator.wikimedia.org/T268801) (owner: 10Elukey) [15:36:44] (03PS3) 10Elukey: admin: set user mattflaschen in ldap_only [puppet] - 10https://gerrit.wikimedia.org/r/654873 (https://phabricator.wikimedia.org/T268801) [15:37:53] (03PS2) 10Elukey: admin: set user 'daisy' in ldap_only [puppet] - 10https://gerrit.wikimedia.org/r/654874 (https://phabricator.wikimedia.org/T268801) [15:37:54] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad total VRPs alert, total VRPs alert, valid ROAs alert, valid ROAs alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [15:38:46] (03CR) 10Elukey: [C: 03+2] admin: set user 'daisy' in ldap_only [puppet] - 10https://gerrit.wikimedia.org/r/654874 (https://phabricator.wikimedia.org/T268801) (owner: 10Elukey) [15:38:58] (03PS2) 10Elukey: admin: set user 'etonkovidova' to ldap_only [puppet] - 10https://gerrit.wikimedia.org/r/654877 (https://phabricator.wikimedia.org/T268801) [15:40:26] (03CR) 10Elukey: [C: 03+2] admin: set user 'etonkovidova' to ldap_only [puppet] - 10https://gerrit.wikimedia.org/r/654877 (https://phabricator.wikimedia.org/T268801) (owner: 10Elukey) [15:40:39] (03PS2) 10Elukey: admin: set user 'risler' to ldap_only [puppet] - 10https://gerrit.wikimedia.org/r/654880 (https://phabricator.wikimedia.org/T268801) [15:41:15] (03CR) 10Elukey: [C: 03+2] admin: set user 'risler' to ldap_only [puppet] - 10https://gerrit.wikimedia.org/r/654880 (https://phabricator.wikimedia.org/T268801) (owner: 10Elukey) [15:42:10] (03PS2) 10Elukey: admin: absent user 'jdittrich' [puppet] - 10https://gerrit.wikimedia.org/r/654882 (https://phabricator.wikimedia.org/T268801) [15:42:35] a little more spam and I am done I promise [15:42:46] (03CR) 10Elukey: [C: 03+2] admin: absent user 'jdittrich' [puppet] - 10https://gerrit.wikimedia.org/r/654882 (https://phabricator.wikimedia.org/T268801) (owner: 10Elukey) [15:42:59] (03CR) 10Lars Wirzenius: "I'm afraid I don't understand Puppet or our operations/puppet repository to review this. The configuration change to turn on syslog loggin" [puppet] - 10https://gerrit.wikimedia.org/r/574485 (https://phabricator.wikimedia.org/T227080) (owner: 10Filippo Giunchedi) [15:43:02] (03PS2) 10Elukey: admin: set user 'debt' to ldap_only [puppet] - 10https://gerrit.wikimedia.org/r/654884 (https://phabricator.wikimedia.org/T268801) [15:44:01] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on labweb1002.wikimedia.org with reason: REIMAGE [15:44:03] (03CR) 10Elukey: [C: 03+2] admin: set user 'debt' to ldap_only [puppet] - 10https://gerrit.wikimedia.org/r/654884 (https://phabricator.wikimedia.org/T268801) (owner: 10Elukey) [15:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:57] (03PS1) 10Volans: type hints: mark the package as type hinted [software/homer] - 10https://gerrit.wikimedia.org/r/654889 [15:45:11] (03PS2) 10Elukey: Remove analytics-users from various analytics and sre configs [puppet] - 10https://gerrit.wikimedia.org/r/654271 (https://phabricator.wikimedia.org/T269150) [15:45:50] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudweb2001-dev.wikimedia.org with reason: REIMAGE [15:45:50] (03CR) 10Elukey: [C: 03+2] Remove analytics-users from various analytics and sre configs [puppet] - 10https://gerrit.wikimedia.org/r/654271 (https://phabricator.wikimedia.org/T269150) (owner: 10Elukey) [15:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:02] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on labweb1002.wikimedia.org with reason: REIMAGE [15:46:02] (03CR) 10Ayounsi: [C: 03+1] type hints: mark the package as type hinted [software/homer] - 10https://gerrit.wikimedia.org/r/654889 (owner: 10Volans) [15:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:59] (03CR) 10Ayounsi: [C: 03+1] cr/firewall.cf: cloud-in4: seperate gerrit service ACL to a different term [homer/public] - 10https://gerrit.wikimedia.org/r/654826 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez) [15:48:04] (03CR) 10Ayounsi: [C: 03+1] cr/firewall.cf: cloud-in4: seperate dumps service ACL to a different term [homer/public] - 10https://gerrit.wikimedia.org/r/654823 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez) [15:48:11] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudweb2001-dev.wikimedia.org with reason: REIMAGE [15:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:01] (03CR) 10Volans: [C: 03+2] type hints: mark the package as type hinted [software/homer] - 10https://gerrit.wikimedia.org/r/654889 (owner: 10Volans) [15:51:27] (03Merged) 10jenkins-bot: type hints: mark the package as type hinted [software/homer] - 10https://gerrit.wikimedia.org/r/654889 (owner: 10Volans) [15:53:07] !log installing xorg-server security updates on stretch [15:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:18] (03PS1) 10Bstorm: wikireplicas: fix up wmf-pt-kill service on multiinstance replicas [puppet] - 10https://gerrit.wikimedia.org/r/654890 (https://phabricator.wikimedia.org/T260511) [15:57:24] (03CR) 10Ayounsi: [C: 04-1] "I think the rule with port 22 is still needed. I created it as there was flows from contint TO cloud VMs on port 22, this is to allow the " [homer/public] - 10https://gerrit.wikimedia.org/r/654830 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez) [15:59:49] (03CR) 10Bstorm: "PCC is encouraging https://puppet-compiler.wmflabs.org/compiler1003/27367/clouddb1014.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/654890 (https://phabricator.wikimedia.org/T260511) (owner: 10Bstorm) [15:59:53] (03PS1) 10Elukey: Remove 'reseachers' and 'gpu-testers' posix group from Analytics cfgs [puppet] - 10https://gerrit.wikimedia.org/r/654892 (https://phabricator.wikimedia.org/T268801) [16:01:57] (03CR) 10Elukey: "Adding also Moritz since I changed some sre-related scripts :)" [puppet] - 10https://gerrit.wikimedia.org/r/654892 (https://phabricator.wikimedia.org/T268801) (owner: 10Elukey) [16:03:47] (03CR) 10Bstorm: "From that PCC of the wmf-pt-kill@s2 service:" [puppet] - 10https://gerrit.wikimedia.org/r/654890 (https://phabricator.wikimedia.org/T260511) (owner: 10Bstorm) [16:16:04] !log installing xerces-c security updates on Buster [16:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:14] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/654892 (https://phabricator.wikimedia.org/T268801) (owner: 10Elukey) [16:19:40] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.2.6 [software/homer] - 10https://gerrit.wikimedia.org/r/654896 [16:22:43] (03CR) 10Elukey: [C: 03+2] Remove 'reseachers' and 'gpu-testers' posix group from Analytics cfgs [puppet] - 10https://gerrit.wikimedia.org/r/654892 (https://phabricator.wikimedia.org/T268801) (owner: 10Elukey) [16:23:47] (03CR) 10Ayounsi: [C: 03+1] "Can't wait to use those new features!" [software/homer] - 10https://gerrit.wikimedia.org/r/654896 (owner: 10Volans) [16:24:37] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.2.6 [software/homer] - 10https://gerrit.wikimedia.org/r/654896 (owner: 10Volans) [16:29:53] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.2.6 [software/homer] - 10https://gerrit.wikimedia.org/r/654896 (owner: 10Volans) [16:31:19] (03PS1) 10David Caro: wmcs.backups: replaces the image script with new one [puppet] - 10https://gerrit.wikimedia.org/r/654898 (https://phabricator.wikimedia.org/T270478) [16:32:58] (03CR) 10jerkins-bot: [V: 04-1] wmcs.backups: replaces the image script with new one [puppet] - 10https://gerrit.wikimedia.org/r/654898 (https://phabricator.wikimedia.org/T270478) (owner: 10David Caro) [16:33:52] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [16:43:47] (03PS7) 10Cwhite: profile: add priority to logstash filter filenames [puppet] - 10https://gerrit.wikimedia.org/r/650629 (https://phabricator.wikimedia.org/T254533) [16:44:04] !log andrew@deploy1001 Started deploy [horizon/deploy@ce4c515]: trying to debug a compression error that doesn't happen on the test host [16:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:12] !log andrew@deploy1001 Finished deploy [horizon/deploy@ce4c515]: trying to debug a compression error that doesn't happen on the test host (duration: 00m 08s) [16:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:28] !log andrew@deploy1001 Started deploy [horizon/deploy@ce4c515]: trying to debug a compression error that doesn't happen on the test host [16:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:17] (03PS1) 10Volans: Upstream release v0.2.6 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/654900 [16:46:33] !log andrew@deploy1001 Finished deploy [horizon/deploy@ce4c515]: trying to debug a compression error that doesn't happen on the test host (duration: 02m 05s) [16:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:03] (03PS8) 10Cwhite: profile: add priority to logstash filter filenames [puppet] - 10https://gerrit.wikimedia.org/r/650629 (https://phabricator.wikimedia.org/T254533) [16:47:06] (03PS2) 10David Caro: wmcs.backups: replaces the image script with new one [puppet] - 10https://gerrit.wikimedia.org/r/654898 (https://phabricator.wikimedia.org/T270478) [16:47:22] !log andrew@deploy1001 Started deploy [horizon/deploy@ce4c515]: trying to debug a compression error that doesn't happen on the test host [16:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:29] (03CR) 10Ayounsi: [C: 03+1] "good numbers." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/654900 (owner: 10Volans) [16:48:47] (03CR) 10Volans: [V: 03+2 C: 03+2] Upstream release v0.2.6 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/654900 (owner: 10Volans) [16:50:13] !log andrew@deploy1001 Finished deploy [horizon/deploy@ce4c515]: trying to debug a compression error that doesn't happen on the test host (duration: 02m 50s) [16:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:52] (03PS9) 10Cwhite: profile: add priority to logstash filter filenames [puppet] - 10https://gerrit.wikimedia.org/r/650629 (https://phabricator.wikimedia.org/T254533) [16:54:03] (03PS10) 10Cwhite: profile: add priority to logstash filter filenames [puppet] - 10https://gerrit.wikimedia.org/r/650629 (https://phabricator.wikimedia.org/T254533) [16:54:32] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 54.3 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:55:52] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [16:56:09] (03CR) 10Cwhite: "> Patch Set 6:" [puppet] - 10https://gerrit.wikimedia.org/r/650629 (https://phabricator.wikimedia.org/T254533) (owner: 10Cwhite) [16:58:42] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 85.12 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:59:00] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [16:59:16] 10ops-codfw, 10DBA, 10SRE: db2140 crashed due to HW memory errors - https://phabricator.wikimedia.org/T271084 (10Papaul) Create Dispatch: Success You have successfully submitted request SR1048216249. [17:00:05] jbond42 and cdanis: Dear deployers, time to do the Puppet request window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210107T1700). [17:07:52] (03CR) 10Mstyles: update flink logging (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/654723 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [17:39:59] 10SRE, 10Wikidata, 10Wikidata Query UI, 10User-Addshore: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10Dzahn) ` [deploy1001:~] $ curl -H "Host: query.wikidata.org" https://webserver-misc-apps.discovery.wmnet/custom-config.json { "api": { "sparql": { "uri": "/sparql"... [17:41:23] 10SRE, 10Wikidata, 10Wikidata Query UI, 10User-Addshore: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10Dzahn) >>! In T266702#6727727, @Addshore wrote: > @dzahn is there any way for us to be able to force an update rather than wait for puppet? Not without shell access to the mis... [17:48:35] (03CR) 10Herron: profile: add priority to logstash filter filenames (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/650629 (https://phabricator.wikimedia.org/T254533) (owner: 10Cwhite) [17:49:58] 10SRE, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) [17:50:52] (03PS1) 10Effie Mouzeli: hiera: upgrade mc1027, mc2027 to buster [puppet] - 10https://gerrit.wikimedia.org/r/654908 (https://phabricator.wikimedia.org/T213089) [17:53:50] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/27372/netbox-dev2001.wikimedia.org/index.html (the removal of the resources like that is" [puppet] - 10https://gerrit.wikimedia.org/r/654676 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [18:00:04] chrisalbon and accraze: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210107T1800). [18:04:55] 10SRE, 10Project-Admins, 10Patch-For-Review: Rename #Operations Phab project to #SRE - https://phabricator.wikimedia.org/T258305 (10Legoktm) >>! In T258305#6727823, @Aklapper wrote: > Thanks! I'm a bit surprised wikibugs relies on primary tags (which can change at any time) instead of project PHIDs (which ar... [18:05:50] 10SRE, 10Project-Admins, 10Patch-For-Review: Rename #Operations Phab project to #SRE - https://phabricator.wikimedia.org/T258305 (10Legoktm) [18:06:35] 10SRE, 10Patch-For-Review: Adapt IRC notifications for #wikimedia-operations to reflect Operations->SRE rename in Phabricator - https://phabricator.wikimedia.org/T271405 (10Legoktm) 05Open→03Resolved a:03Aklapper [18:29:58] (03CR) 10Effie Mouzeli: [C: 03+2] hiera: upgrade mc1027, mc2027 to buster [puppet] - 10https://gerrit.wikimedia.org/r/654908 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [18:30:32] !log volans@deploy1001 Started deploy [homer/deploy@fe7acbc]: Release v0.2.6 [18:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:54] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "thank you. compiler output is as expected: https://puppet-compiler.wmflabs.org/compiler1002/27373/cumin2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/654677 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [18:31:29] 10SRE, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mc2027.codfw.wmnet ` The log can be found i... [18:31:42] 10SRE, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mc1027.eqiad.wmnet ` The log can be found i... [18:34:57] !log volans@deploy1001 Finished deploy [homer/deploy@fe7acbc]: Release v0.2.6 (duration: 04m 25s) [18:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:47] (03CR) 10Dzahn: "noop on cumin hosts" [puppet] - 10https://gerrit.wikimedia.org/r/654677 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [18:36:35] (03CR) 10Dzahn: [V: 03+1 C: 03+2] netbox: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/654676 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [18:38:21] (03CR) 10Dzahn: "noop on netbox1001" [puppet] - 10https://gerrit.wikimedia.org/r/654676 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [18:38:46] (03PS2) 10Dzahn: snapshot: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/654675 (https://phabricator.wikimedia.org/T266479) [18:39:24] (03CR) 10jerkins-bot: [V: 04-1] snapshot: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/654675 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [18:41:06] (03PS3) 10Dzahn: snapshot: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/654675 (https://phabricator.wikimedia.org/T266479) [18:45:28] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1027.eqiad.wmnet with reason: REIMAGE [18:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:47] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/27374/" [puppet] - 10https://gerrit.wikimedia.org/r/654675 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [18:45:52] (03PS1) 10Dzahn: dumps: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/654911 (https://phabricator.wikimedia.org/T266479) [18:46:55] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2027.codfw.wmnet with reason: REIMAGE [18:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:30] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1027.eqiad.wmnet with reason: REIMAGE [18:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:33] (03CR) 10Dzahn: "noop on snapshot1008" [puppet] - 10https://gerrit.wikimedia.org/r/654675 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [18:47:41] (03CR) 10Ryan Kemper: [C: 03+2] [wdqs] re-enable polling kafka for updates on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/646631 (https://phabricator.wikimedia.org/T267175) (owner: 10DCausse) [18:49:01] (03CR) 10Bstorm: [C: 03+2] wikireplicas: Work toward a proxy setup on multi-instance replicas [puppet] - 10https://gerrit.wikimedia.org/r/627379 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [18:49:32] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2027.codfw.wmnet with reason: REIMAGE [18:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:44] ryankemper: ok to merge your patch? [18:51:03] bstorm: yes, go ahead [18:51:11] done! [18:51:15] (sorry, was rebasing a followup patch and hadn't fetched in awhile :P) [18:51:22] thanks [18:52:39] (03PS2) 10Ryan Kemper: Revert "wdqs: use RecentChanges API for updates on all WDQS servers" [puppet] - 10https://gerrit.wikimedia.org/r/646632 (https://phabricator.wikimedia.org/T267175) (owner: 10DCausse) [18:54:31] (03PS1) 10Dzahn: sentry: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/654912 (https://phabricator.wikimedia.org/T266479) [18:54:41] (03CR) 10Ryan Kemper: [C: 03+2] Revert "wdqs: use RecentChanges API for updates on all WDQS servers" [puppet] - 10https://gerrit.wikimedia.org/r/646632 (https://phabricator.wikimedia.org/T267175) (owner: 10DCausse) [18:55:14] (03CR) 10jerkins-bot: [V: 04-1] sentry: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/654912 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [18:56:22] (03CR) 10Dzahn: [C: 03+1] Enable base::service_auto_restart for Apache on xhgui [puppet] - 10https://gerrit.wikimedia.org/r/654798 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [18:56:34] (03CR) 10Ryan Kemper: "Just realized I rebased incorrectly (to resolve merge I set the default for use_kafka_for_updates to false instead of true. Opening up a n" [puppet] - 10https://gerrit.wikimedia.org/r/646632 (https://phabricator.wikimedia.org/T267175) (owner: 10DCausse) [18:57:35] (03CR) 10Dave Pifke: [C: 03+1] "LGTM. Should I submit a similar patch for our other Apache instances?" [puppet] - 10https://gerrit.wikimedia.org/r/654798 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [18:57:57] (03PS1) 10Ryan Kemper: wdqs: default to using kafka for updates [puppet] - 10https://gerrit.wikimedia.org/r/654914 (https://phabricator.wikimedia.org/T267175) [18:57:59] (03PS1) 10Herron: mailman: set Czech language to iso-8859-2 [puppet] - 10https://gerrit.wikimedia.org/r/654915 (https://phabricator.wikimedia.org/T271123) [18:58:16] (03PS2) 10Dzahn: sentry: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/654912 (https://phabricator.wikimedia.org/T266479) [18:58:20] (03CR) 10Ryan Kemper: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/654914 is the new patch" [puppet] - 10https://gerrit.wikimedia.org/r/646632 (https://phabricator.wikimedia.org/T267175) (owner: 10DCausse) [18:58:51] (03CR) 10Dzahn: [C: 03+2] Enable base::service_auto_restart for Apache on xhgui [puppet] - 10https://gerrit.wikimedia.org/r/654798 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [18:59:37] 10SRE, 10Project-Admins, 10Patch-For-Review: Rename #Operations Phab project to #SRE - https://phabricator.wikimedia.org/T258305 (10Aklapper) In theory, it's writing `["that project tag"]` into the `names` field on https://phabricator.wikimedia.org/conduit/method/project.query/ and then taking the `phid` val... [18:59:41] 10SRE, 10Wikimedia-Mailing-lists, 10I18n, 10Patch-For-Review: Mailman password reminder mail (and other texts) has broken encoding in Czech - https://phabricator.wikimedia.org/T271123 (10herron) >>! In T271123#6729318, @gerritbot wrote: > https://gerrit.wikimedia.org/r/654915 In my testing this corrected... [18:59:59] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: default to using kafka for updates [puppet] - 10https://gerrit.wikimedia.org/r/654914 (https://phabricator.wikimedia.org/T267175) (owner: 10Ryan Kemper) [19:00:02] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10Cmjohnson) a:05Cmjohnson→03RobH @robh can you complete the off-site work for an-worker1118-1138. Still needs dhcpd file updated and maybe netboot.cfg.... [19:00:04] RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210107T1900). [19:00:05] DannyS712 and nray: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:25] I can deploy today [19:00:48] nray: hi, around? [19:00:53] (03CR) 10Urbanecm: [C: 03+2] Use DisabledSpecialPage to disable ItemDisambiguation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654463 (https://phabricator.wikimedia.org/T271389) (owner: 10DannyS712) [19:00:55] (03CR) 10Dzahn: "applied on xhgui1001 - puppet created the systemd timer and wmf_auto_restart_apache2.service" [puppet] - 10https://gerrit.wikimedia.org/r/654798 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [19:00:58] o/ yes I'm here! [19:01:06] great! [19:01:12] thank you Urbanecm [19:01:32] (03CR) 10Urbanecm: [C: 03+2] Revert "Provide native support to dismiss sitenotice in core." [core] (wmf/1.36.0-wmf.25) - 10https://gerrit.wikimedia.org/r/654461 (https://phabricator.wikimedia.org/T271365) (owner: 10Krinkle) [19:01:43] (03Merged) 10jenkins-bot: Use DisabledSpecialPage to disable ItemDisambiguation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654463 (https://phabricator.wikimedia.org/T271389) (owner: 10DannyS712) [19:01:46] nray: are you able to test the patch please? [19:01:53] yes [19:02:12] great, I'll ping you once ready [19:02:19] cool [19:03:13] 10SRE, 10ops-eqiad, 10DC-Ops: frdev1001 ILO inaccessible - https://phabricator.wikimedia.org/T267969 (10Cmjohnson) @jgreen I swapped the mgmt cable, if that does not fix the issue then I will need to power this server off for a few minutes to do a hard reset. [19:03:15] (03CR) 10Ryan Kemper: [C: 03+2] [wdqs] add jmx_wdqs_streaming_updater prometheus job [puppet] - 10https://gerrit.wikimedia.org/r/649827 (https://phabricator.wikimedia.org/T266986) (owner: 10DCausse) [19:05:57] !log urbanecm@deploy1001 Synchronized wmf-config/Wikibase.php: 90f98c6a049c69b70ab9cb78eb986f1ecf4ffc9b: Use DisabledSpecialPage to disable ItemDisambiguation (T271389) (duration: 01m 08s) [19:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:02] T271389: Use DisabledSpecialPage for Special:ItemDisambiguation on wikidata - https://phabricator.wikimedia.org/T271389 [19:06:33] (03PS2) 10Dzahn: mail::smarthost: hiera -> lookup [puppet] - 10https://gerrit.wikimedia.org/r/654697 [19:06:36] (03CR) 10Dzahn: mail::smarthost: hiera -> lookup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/654697 (owner: 10Dzahn) [19:06:54] (03PS1) 10Urbanecm: throttle: Cleanup outdated rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654916 [19:07:08] (03CR) 10Urbanecm: [C: 03+2] throttle: Cleanup outdated rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654916 (owner: 10Urbanecm) [19:08:01] (03Merged) 10jenkins-bot: throttle: Cleanup outdated rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654916 (owner: 10Urbanecm) [19:08:33] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27376/console" [puppet] - 10https://gerrit.wikimedia.org/r/649827 (https://phabricator.wikimedia.org/T266986) (owner: 10DCausse) [19:09:28] (03CR) 10Cwhite: profile: add priority to logstash filter filenames (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/650629 (https://phabricator.wikimedia.org/T254533) (owner: 10Cwhite) [19:10:03] !log urbanecm@deploy1001 Synchronized wmf-config/throttle.php: 8a849d90277b1e13154e87d812d64efc3a99c00a: throttle: Cleanup outdated rules (duration: 01m 06s) [19:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:46] RECOVERY - HP RAID on ms-be1019 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:14:02] (03PS1) 10Ryan Kemper: cirrus: bump es shard size alert thresholds [puppet] - 10https://gerrit.wikimedia.org/r/654917 (https://phabricator.wikimedia.org/T265908) [19:17:42] 10SRE, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2027.codfw.wmnet'] ` and were **ALL** successful. [19:17:45] (03CR) 10Dzahn: ""No hosts found matching `C:sentry::packages` unable to do anything" and jessie reference? Is this still used?" [puppet] - 10https://gerrit.wikimedia.org/r/654912 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [19:17:48] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27378/console" [puppet] - 10https://gerrit.wikimedia.org/r/654917 (https://phabricator.wikimedia.org/T265908) (owner: 10Ryan Kemper) [19:18:28] (03PS4) 10Dzahn: mediawiki: remove mongodb PHP extension from appservers [puppet] - 10https://gerrit.wikimedia.org/r/620729 (https://phabricator.wikimedia.org/T180761) [19:18:33] (03CR) 10Ryan Kemper: "> Patch Set 2: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/654307 (https://phabricator.wikimedia.org/T271161) (owner: 10Ryan Kemper) [19:18:53] (03Abandoned) 10Ryan Kemper: remove chelsyx' admin access [puppet] - 10https://gerrit.wikimedia.org/r/654307 (https://phabricator.wikimedia.org/T271161) (owner: 10Ryan Kemper) [19:20:41] (03CR) 10Gergő Tisza: "I forgot completely this existed. We used it for some WM Cloud project - sentry? deployment-prep? not sure but neither of those use it any" [puppet] - 10https://gerrit.wikimedia.org/r/654912 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [19:23:15] (03CR) 10Dzahn: "Ah, yes, see this:" [puppet] - 10https://gerrit.wikimedia.org/r/654912 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [19:23:44] (03CR) 10Dzahn: "thank you, I will "recycle" this patch to delete the module" [puppet] - 10https://gerrit.wikimedia.org/r/654912 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [19:24:51] (03Merged) 10jenkins-bot: Revert "Provide native support to dismiss sitenotice in core." [core] (wmf/1.36.0-wmf.25) - 10https://gerrit.wikimedia.org/r/654461 (https://phabricator.wikimedia.org/T271365) (owner: 10Krinkle) [19:25:13] 10SRE, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc1027.eqiad.wmnet'] ` and were **ALL** successful. [19:26:16] 10SRE, 10ops-eqiad, 10Analytics, 10Patch-For-Review: Degraded RAID on an-coord1002 - https://phabricator.wikimedia.org/T270768 (10Cmjohnson) The server is out of warranty but there should be some disks on-site I can use. In the past, anytime /dev/sda goes bad a re-image needs to happen. Let me know if you... [19:27:40] 10SRE, 10ops-eqiad, 10Analytics, 10Patch-For-Review: Degraded RAID on an-coord1002 - https://phabricator.wikimedia.org/T270768 (10elukey) >>! In T270768#6729455, @Cmjohnson wrote: > The server is out of warranty but there should be some disks on-site I can use. In the past, anytime /dev/sda goes bad a re-... [19:31:54] ah, finally got merged nray [19:32:01] \o/ [19:32:14] pulling to mwdebug1001 [19:32:44] nray: pulled to mwdebug1001, please test [19:32:52] awesome, testing now [19:33:02] 10SRE, 10fundraising-tech-ops: hw troubleshooting: Illegal opcode error on boot for frdb1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T271284 (10Cmjohnson) a:05Cmjohnson→03Dwisehaupt This server is well out of warranty but a quick google check because I have never seen an Illegal opcode error... [19:33:16] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/654917 (https://phabricator.wikimedia.org/T265908) (owner: 10Ryan Kemper) [19:34:13] (03CR) 10Legoktm: [C: 04-1] "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [19:34:25] Urbanecm: Looks great, you may proceed! [19:34:31] 10SRE, 10ops-eqiad: Interface errors between pfw3a-eqiad and fasw-c1a-eqiad - https://phabricator.wikimedia.org/T271295 (10Cmjohnson) @Jgreen We need to schedule this, How does Monday at 10am local work for you? [19:34:33] excellent, thanks nray [19:35:42] (03PS3) 10Dzahn: sentry: delete module and hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/654912 (https://phabricator.wikimedia.org/T266479) [19:35:54] 10SRE, 10ops-eqiad, 10SRE-swift-storage: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T270806 (10Cmjohnson) 05Open→03Resolved @fgiunchedi We do not, loads of 3TB. I will close the task. [19:36:59] (03CR) 10Dzahn: "Alright, can I get a +1 for this? Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/654912 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [19:37:28] 10SRE, 10ops-eqiad, 10Traffic: Interface errors on asw2-a-eqiad:xe-4/0/7 (lvs1016) - https://phabricator.wikimedia.org/T271087 (10Cmjohnson) @bblack can we schedule this on Monday? 1530/1600UTC [19:39:03] (03CR) 10Dzahn: "the part that sentry does not appear on https://openstack-browser.toolforge.org/puppetclass/ should proof that it's not used in cloud" [puppet] - 10https://gerrit.wikimedia.org/r/654912 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [19:39:11] nray: seems I can't sync it all at once. Can you recommend a good order to sync the files in? [19:40:54] actually... this shows when i try to sync it https://www.irccloud.com/pastebin/DEutTyle/ [19:41:02] (03CR) 10Dzahn: "yep, also in Horizon in deploment-prep can't see an instance called deployment-sentry01" [puppet] - 10https://gerrit.wikimedia.org/r/654912 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [19:41:31] 10SRE, 10ops-eqiad, 10DC-Ops: frdev1001 ILO inaccessible - https://phabricator.wikimedia.org/T267969 (10Jgreen) >>! In T267969#6729346, @Cmjohnson wrote: > @jgreen I swapped the mgmt cable, if that does not fix the issue then I will need to power this server off for a few minutes to do a hard reset. Still... [19:41:39] and when i run that find command...: https://www.irccloud.com/pastebin/kDR0cbMW/ [19:41:50] what the hell does `Fatal error: Cannot redeclare wikidiff2_do_diff() in /srv/mediawiki-staging/php-1.36.0-wmf.25/.phan/internal_stubs/wikidiff.php on line 29` mean? [19:42:38] RoanKattouw or James_F if some of you is around, can you help? [19:42:49] hmmm, I'm not sure and am not super familiar with this patch so maybe we need to hold off [19:43:15] ack nray [19:43:39] Urbanecm: Sounds like it's either double-loading the stub or it's got the actual code and so the stub is unnecessary? Daimona is the phan ultra-expert. [19:44:23] James_F: I can't find wikidiff2 in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/654461, which is the (UBN) patch I'm trying to deploy [19:44:35] WTH? [19:44:44] It's very odd. [19:44:52] Why is it loading the stub in the first place? [19:45:21] Stubs are only meant to be used by phan, it's not even valid PHP code (at runtime), so they really shouldn't be loaded [19:45:23] Yeah, scap really shouldn't ever touch this. [19:45:34] James_F: at least i'm not the only one looking at it confusingly :) [19:46:11] (03CR) 10Dzahn: "interesting.. compiler says it's only a change on debug host" [puppet] - 10https://gerrit.wikimedia.org/r/620729 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [19:46:25] so, if i understand the find magic correctly, scap tries to verify syntax is correct in all *.php files, but stub files aren't valid PHP, so it fails [19:46:28] Ah now that I read that command, it should probably exclude the .phan folder [19:47:10] Urbanecm: not exactly because they're invalid (php -l will pass, but you'd get runtime errors, e.g. a return typehint on a function that doesn't return anything), but because the class is effectively being declared twice [19:47:21] got it [19:47:26] Does somebody know what that find command used to look like? [19:47:35] (I can't help much right now) [19:47:41] https://gerrit.wikimedia.org/g/mediawiki/tools/scap/+/master/scap/lint.py#41 [19:48:01] Code hasn't changed since 2017. [19:48:19] Is this just a race condition inside scap? [19:48:39] 10SRE, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) [19:48:53] James_F: I tried to run the scap multiple times, failed all the times [19:49:02] Very odd. [19:49:07] maybe it's because I'm trying to scap sync-file whole php-1.36.0-wmf.25 ? [19:49:14] Oh, ait. [19:49:17] You can't do that. [19:49:20] aha [19:49:33] That's totally not a supported scap target for live code. [19:49:44] And this explains why we didn't see this before. [19:50:03] okay, thanks then [19:50:43] Urbanecm: I'd say manual scap order is includes/skins, then resources, then DS. [19:50:55] But not tested, never looked at this code before. [19:51:19] James_F: sounds about right, thanks [19:52:14] syncing in that order [19:53:42] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.25/includes/skins/: 59866730ea7534db9e47ea308ba2a3c1807d5f11: Revert "Provide native support to dismiss sitenotice in core." (T271365; T259903; 1/3) (duration: 01m 04s) [19:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:56] T271365: [Regression] "[dismiss]" button shows up on all pages for no apparent reason - https://phabricator.wikimedia.org/T271365 [19:53:56] T259903: Merge DismissableSiteNotice extension into core - https://phabricator.wikimedia.org/T259903 [19:53:59] syncing resources now [19:54:17] (03CR) 10Dzahn: [C: 03+2] "checked with " sudo cumin 'A:mw' 'file /etc/php/7.2/mods-available/mongodb.ini' " on all mw hosts and confirmed it's already gone everywhe" [puppet] - 10https://gerrit.wikimedia.org/r/620729 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [19:55:01] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.25/resources/: 59866730ea7534db9e47ea308ba2a3c1807d5f11: Revert "Provide native support to dismiss sitenotice in core." (T271365; T259903; 2/3) (duration: 01m 05s) [19:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:08] and now DS [19:56:27] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.25/includes/DefaultSettings.php: 59866730ea7534db9e47ea308ba2a3c1807d5f11: Revert "Provide native support to dismiss sitenotice in core." (T271365; T259903; 3/3) (duration: 01m 03s) [19:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:42] !log removing mongodb PHP extension, config, package from mwdebug* hosts - T180761 [19:56:44] nray: should be live now. Can you please check outside of mwdebug if it really works? Just to make sure I synced everything. [19:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:46] T180761: Move XHGui from tungsten to xhgui-001 - https://phabricator.wikimedia.org/T180761 [19:56:56] Urbanecm: checking now [19:57:01] thanks [19:57:11] and thanks James_F for your help ;) [19:57:32] (03CR) 10Dzahn: "Notice: /Stage[main]/Profile::Mediawiki::Php/Php::Extension[mongodb]/File[/etc/php/7.2/mods-available/mongodb.ini]/ensure: removed" [puppet] - 10https://gerrit.wikimedia.org/r/620729 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [19:58:07] Always. [19:58:28] Urbanecm: that looks great to me. Thank you for your help Urbanecm, James_F, Daimona! [19:58:37] thanks nray! [19:58:39] we're done then :) [19:58:43] Hurrah. [19:58:46] Aye [20:00:04] longma and hashar: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Mediawiki train - American+European Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210107T2000). [20:00:21] (03PS1) 10Ryan Kemper: cloudelastic: Update storage device for new partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/654918 [20:01:00] !log restarting haproxy on dbproxy1018 to pick up new config file [20:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:05] (03CR) 10Dzahn: "https://debmonitor.wikimedia.org/packages/php-mongodb this is empty now - ran puppet on all mwdebug* hosts, confirmed gone :) thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/620729 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [20:03:32] (03CR) 10Gehel: [C: 04-1] "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1001/27380/" [puppet] - 10https://gerrit.wikimedia.org/r/654918 (owner: 10Ryan Kemper) [20:03:58] ACKNOWLEDGEMENT - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 2 down 16 Bstorm New wikireplica servers need some perms https://wikitech.wikimedia.org/wiki/HAProxy [20:04:46] (03PS2) 10Ryan Kemper: cloudelastic: Update storage device for new partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/654918 (https://phabricator.wikimedia.org/T265699) [20:05:26] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/654918 (https://phabricator.wikimedia.org/T265699) (owner: 10Ryan Kemper) [20:06:09] (03PS1) 10Jeena Huneidi: all wikis to 1.36.0-wmf.25 refs T267418 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654919 [20:06:11] (03CR) 10Jeena Huneidi: [C: 03+2] all wikis to 1.36.0-wmf.25 refs T267418 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654919 (owner: 10Jeena Huneidi) [20:06:40] (03CR) 10Bstorm: "Oops. The new replicas are missing the haproxy healthcheck user. I acked the alert and hope nobody was paged." [puppet] - 10https://gerrit.wikimedia.org/r/627379 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [20:07:02] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.25 refs T267418 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654919 (owner: 10Jeena Huneidi) [20:08:35] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.25 refs T267418 [20:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:40] T267418: 1.36.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T267418 [20:12:54] longma: I am around ;) [20:13:21] :) thanks [20:13:30] I did the deploy, so watching logs now [20:14:19] looks quiet :] [20:14:29] ya [20:14:33] : Error 1062: Duplicate entry '430392149' for key 'PRIMARY' (10.64.16.101) [20:14:33] Function: WatchedItemStore::updateExpiriesAfterMove [20:14:36] is surely a fun one [20:15:42] that's for wmf.22 at least [20:17:11] and surely if that is already inserted, it is not much of an issue [20:17:35] I guess something in the code needs to deduplicate the items before trying to insert them. Or it is called twice somehow [20:17:54] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-01-30) rack/setup/install frqueue2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T269481 (10Jgreen) [20:18:36] 10SRE, 10Add-Link, 10GrowthExperiments-NewcomerTasks, 10Product-Infrastructure-Team-Backlog, and 2 others: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10kostajh) @akosiaris picking up the thread on this from before the holiday break; IIRC there was some netw... [20:19:54] grafana shows some database error surging https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=10&orgId=1 [20:20:00] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-01-30) rack/setup/install frqueue2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T269481 (10Jgreen) [20:20:19] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-01-30) rack/setup/install frqueue2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T269481 (10Jgreen) a:05Papaul→03Jgreen [20:21:05] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2021-01-30) rack/setup/install frqueue2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T269481 (10Jgreen) [20:22:25] there are a few DB errors on logstash [20:23:24] yeah [20:23:27] spotted a new one: Error 1146: Table 'commonswiki.wbt_type' doesn't exist (10.64.16.175:3314) [20:23:47] though that is for commonswiki so we should have seen it yesterday [20:24:43] !log razzi@cumin1001 START - Cookbook sre.hosts.decommission [20:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:22] 1 year old known issue https://phabricator.wikimedia.org/T242959 [20:30:31] RECOVERY - Device not healthy -SMART- on ms-be1019 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1019&var-datasource=eqiad+prometheus/ops [20:31:40] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/27382/" [puppet] - 10https://gerrit.wikimedia.org/r/651834 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [20:33:30] (03CR) 10Dzahn: "I'm not opposed to this but I feel others have more insight and discusses this already in way more detail." [puppet] - 10https://gerrit.wikimedia.org/r/649882 (https://phabricator.wikimedia.org/T270285) (owner: 10Jbond) [20:34:14] (03CR) 10Dzahn: [C: 03+1] "Still seems ok to me nowadays." [puppet] - 10https://gerrit.wikimedia.org/r/376024 (owner: 10Giuseppe Lavagetto) [20:34:30] IIRC there's a DBError dashboard that might answer what those errors are [20:34:41] in logstash [20:35:15] (03Abandoned) 10Dzahn: ATS: switch ORES to TLS to backends [puppet] - 10https://gerrit.wikimedia.org/r/618379 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [20:37:02] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2021-01-30) rack/setup/install frqueue2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T269481 (10Jgreen) [20:42:42] (03PS1) 10Dzahn: mediawiki::php: remove code to absent mongodb module [puppet] - 10https://gerrit.wikimedia.org/r/654922 (https://phabricator.wikimedia.org/T180761) [20:43:41] !log razzi@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [20:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:47] 10SRE, 10vm-requests, 10Patch-For-Review: Eq: 5 VM request for kafka-test-eqiad cluster - https://phabricator.wikimedia.org/T268202 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by razzi@cumin1001 for hosts: `kafka-test1005.eqiad.wmnet` - kafka-test1005.eqiad.wmnet (**WARN**) - **Failed... [20:44:48] !log razzi@cumin1001 START - Cookbook sre.dns.netbox [20:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:23] (03PS1) 10Dzahn: delete the mongodb module [puppet] - 10https://gerrit.wikimedia.org/r/654923 [20:50:45] (03CR) 10Dzahn: "Originally Andrew imported the mongodb module from puppetlabs, then Ori rewrote it because the one from puppetlabs sucked. But this was in" [puppet] - 10https://gerrit.wikimedia.org/r/654923 (owner: 10Dzahn) [20:51:03] longma: the mw-client-errors dashboard has bunch of javascript errors though :/ [20:51:09] https://logstash.wikimedia.org/app/kibana#/dashboard/AXDBY8Qhh3Uj6x1zCF56 [20:51:46] looking [20:51:53] (03CR) 10Dzahn: "Hi Ori, hi other reviewers. So.. originally Andrew imported the mongodb module from puppetlabs, then Ori rewrote it because the one from p" [puppet] - 10https://gerrit.wikimedia.org/r/654923 (owner: 10Dzahn) [20:52:40] I've never used this dashboard so I don't know what is expected [20:52:40] longma: maybe due to local gadgets [20:54:30] !log razzi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:47] there are a bunch of "Uncaught TypeError: Cannot read property 'modules' of undefined" [20:56:33] at addCodeMirrorToWikiEditor [20:56:57] yeah [20:57:04] so maybe an issue with the WikiEditor [20:57:41] though the extension hasn't been touched in a while ;D [20:57:57] oh no I am wrong, forgot to pull [21:01:57] longma: I am filing a task for ti [21:02:11] okay. I'm looking at them with thcipriani [21:02:26] over Google Meet? [21:04:40] yeah [21:05:02] yeah, might be worth rolling back since that's such a spike. Also the DB thing is concerning. [21:07:39] Rolling back [21:09:03] filed as https://phabricator.wikimedia.org/T271468 [21:10:00] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: Revert "group[2] wikis to 1.36.0-wmf.22" [21:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:15] hashar: Sorry for duping so quickly. ;-) [21:10:34] * James_F waits for CI for the patch to merge in master. [21:11:01] oh Jdlrobson was looking at the dashboard as well ;D [21:12:33] (03PS1) 10Jeena Huneidi: Revert "all wikis to 1.36.0-wmf.25 refs T267418" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654928 [21:13:53] (03CR) 10Jeena Huneidi: [C: 03+2] Revert "all wikis to 1.36.0-wmf.25 refs T267418" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654928 (owner: 10Jeena Huneidi) [21:13:56] (03Merged) 10jenkins-bot: Revert "all wikis to 1.36.0-wmf.25 refs T267418" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654928 (owner: 10Jeena Huneidi) [21:14:49] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on labweb1002.wikimedia.org with reason: REIMAGE [21:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:27] PROBLEM - Kafka Broker Server on kafka-test1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [21:15:55] PROBLEM - Check systemd state on kafka-test1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:16:25] So glad we're waiting for a one-character doc fix in Wikibase to deploy a UBN fix. :-P [21:16:36] ^^ is razzi and I we will downtime [21:16:47] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on labweb1002.wikimedia.org with reason: REIMAGE [21:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:25] James_F: so what you're saying is we need a NO-REALLY-priority-pipeline? [21:21:37] thcipriani: Yes. ;-) [21:21:54] thcipriani: Or "yes this patch is C+2'ed, but it's unimportant so batch it for later" pipeline. ;-) [21:22:15] A kind of C+1.5 button. [21:23:14] RECOVERY - Check systemd state on kafka-test1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:29:07] James_F: well we can cherry pick it to wmf branch and CR+2 it [21:29:35] hashar: Yes, but given you rolled back the branch there's no rush. [21:29:45] And this way we maintain git hash pointers. [21:30:04] (03PS1) 10Jforrester: Guard against WikiEditor being removed by the time the hook runs [extensions/CodeMirror] (wmf/1.36.0-wmf.25) - 10https://gerrit.wikimedia.org/r/654857 (https://phabricator.wikimedia.org/T271457) [21:30:07] There we go. [21:30:08] RECOVERY - Kafka Broker Server on kafka-test1009 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [21:30:17] hashar, longma: Want me to deploy? [21:30:32] okay [21:30:47] Kk. [21:36:07] (03CR) 10Jforrester: [C: 03+2] Guard against WikiEditor being removed by the time the hook runs [extensions/CodeMirror] (wmf/1.36.0-wmf.25) - 10https://gerrit.wikimedia.org/r/654857 (https://phabricator.wikimedia.org/T271457) (owner: 10Jforrester) [21:38:44] (03CR) 10Gergő Tisza: [C: 03+1] sentry: delete module and hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/654912 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [21:39:45] (03CR) 10Ottomata: [C: 03+1] delete the mongodb module [puppet] - 10https://gerrit.wikimedia.org/r/654923 (owner: 10Dzahn) [21:41:28] (03Merged) 10jenkins-bot: Guard against WikiEditor being removed by the time the hook runs [extensions/CodeMirror] (wmf/1.36.0-wmf.25) - 10https://gerrit.wikimedia.org/r/654857 (https://phabricator.wikimedia.org/T271457) (owner: 10Jforrester) [21:42:52] (03CR) 10Ori.livneh: [C: 03+1] "Nah, no sense in keeping it around. +1 for deletion." [puppet] - 10https://gerrit.wikimedia.org/r/654923 (owner: 10Dzahn) [21:43:40] !log jforrester@deploy1001 Synchronized php-1.36.0-wmf.25/extensions/CodeMirror/resources/ext.CodeMirror.js: T271457 Guard against WikiEditor being removed by the time the hook runs (duration: 01m 05s) [21:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:46] T271457: Uncaught TypeError: Cannot read property 'modules' of undefined - https://phabricator.wikimedia.org/T271457 [21:46:32] (03PS1) 10Jeena Huneidi: all wikis to 1.36.0-wmf.25 refs T267418 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654932 [21:46:34] (03CR) 10Jeena Huneidi: [C: 03+2] all wikis to 1.36.0-wmf.25 refs T267418 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654932 (owner: 10Jeena Huneidi) [21:47:26] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.25 refs T267418 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654932 (owner: 10Jeena Huneidi) [21:50:17] razzi: what failed in the decom of kafka-test1005? [21:54:15] volans: I'm pretty sure the dns update failed, which I resolved by manually running the sre.dns.netbox cookbook [21:54:35] yes, I saw that in the task, but what failed there? [21:55:18] the decom does additional things, running just the dns cookbook doesn't do all the additional steps that the decom does [21:55:33] I don't recall if the dns cookbook is the last step or not, I fear not [21:58:10] As far as I can tell, dns is the last step; as for what failed, I don't recall any specific info output by the cookbook when I ran it, and I closed the tmux session [21:58:21] it should have been there [21:59:07] right, I checked the code and yes is now the last, we moved things around few weeks ago [22:04:59] so somehow scap is blocked [22:05:13] `scap deploy-promote all` got stall [22:07:07] labweb1002: Finished rsync common (duration: 17m 42s) [22:07:13] seems to be the cause [22:08:42] hashar: What I know is that these have just been upgraded to buster. [22:09:05] we dont even test MediaWiki on Buster [22:09:17] https://phabricator.wikimedia.org/T269004#6726522 [22:09:36] volans: let me know if there's any info I can provide; I'll be more careful to save the output of a cookbook error run next time. Sorry for the confusion [22:09:42] well but we should because we already run production systems on buster [22:09:47] https://phabricator.wikimedia.org/T269004#6725264 [22:09:50] yeah that is it [22:10:03] so that was reimaged [22:10:11] END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on labweb1002.wikimedia.org with reason: REIMAGE [22:10:13] but why exactly does it fail? never ran scap pull? [22:10:57] andrewbogott: ^ an issue with labweb1002 and scap [22:11:28] it's rebuilding [22:11:56] razzi: yeah without knowing the exact error it's a bit more diffult to fix it ;) if it happen again please let me know what's the issue :) [22:11:58] but it needs to be moved out of the scap/mediawiki group [22:12:07] we should kill the one process for now to continue sync [22:12:24] also the first server supposed to be running MediaWiki with Buster is mw1265 and that hasn't happened yet afaik [22:12:36] andrewbogott: ooh.. it's in the middle of it .. that part I did not see , gotcha [22:12:44] since it shouldn't be the dsh group but is because...something is broken [22:12:48] yea, so reimaging and deployment window dont mix well [22:12:57] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.25 refs T267418 [22:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:01] T267418: 1.36.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T267418 [22:13:09] unless they are fully removed from the "dsh" group [22:13:26] hieradata/common/scap/dsh.yaml:43: - labweb1002.wikimedia.org [22:13:34] dsh group means conftool status is not inactive [22:13:36] aren't we removing mwapp servers from the dsh group when reimaging? [22:13:54] I think the issue is it is pooled "no" but not pooled "inactive". [22:14:23] but if the cookbook was used it would happen automatically ? [22:15:57] * andrewbogott taps T237773 and returns to his workout [22:16:07] andrewbogott: could you please set that to inactive ? [22:16:15] to avoid the issues for deployment [22:16:19] set what to inactive? [22:16:25] I depooled it and used the reimage script [22:16:26] labweb1002 [22:16:49] I'm sorry, I literally don't know what you mean by 'set to inactive' [22:17:20] andrewbogott: there are 3 pool states, yes, no and inactive [22:17:34] no means it is not getting traffic [22:17:39] but still gets scap sync [22:17:41] ok, but I would run that on the host, right? The host that isn't up? [22:17:50] so if deployment happens and the host is down [22:17:53] they cant sync to it [22:18:00] no, on cumin1001 [22:18:05] with confctl [22:18:06] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:18:25] shouldn't the reimage cookbook do that? [22:18:46] yes, i think so [22:18:57] only if you pass the confctl params [22:18:58] what is the confctl command? [22:19:05] not all host are behind LVS [22:19:28] [cumin1001:~] $ sudo -i confctl select name=labweb1002.wikimedia.org set/pooled=inactive [22:19:59] !log andrew@cumin1001 conftool action : set/pooled=inactive; selector: name=labweb1002.wikimedia.org [22:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:41] thanks. [22:21:48] so this made it disappear from https://config-master.wikimedia.org/pybal/eqiad/labweb [22:21:55] hashar: I think it should be gone now [22:22:01] longma: ^^^ :) [22:22:09] labweb1002 should be out of the scap targets now [22:22:11] hopefully ;] [22:22:15] thanks! [22:23:09] once labweb1002 is back up.. someone should run scap pull on it or deploy to just that from deploy1001 [22:23:37] well, depends on the timing of things [22:23:42] but it won't hurt [22:23:46] please just let me do the work that I need to do on that host. I'm debugging issues unrelated to mediawiki [22:23:47] Puppet will do it [22:23:59] also https://phabricator.wikimedia.org/T237773 [22:24:30] !log robh@cumin1001 START - Cookbook sre.dns.netbox [22:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:37] the issue happens though because it is part of the cluster [22:24:44] in this sense [22:25:45] nevertheless if that issue were handled today's issues would not have appeared [22:26:04] And issues /like/ that will continue to disrupt mediawiki processes as long as it remains a unicorn deployment [22:27:09] Not sure I understand that part. Any appserver part of the cluster would have that issue is it is down but not removed from pybal [22:27:31] Right but if labweb1002 isn't running mediawiki [22:27:39] which is the point of that bug [22:27:57] wikitech is just a wiki, it should be managed just like our 1000 other wikis [22:28:05] rather than run on two special servers managed by a different team [22:28:35] unless I linked the wrong task :) [22:28:52] nope, I think that's the one I mean [22:29:25] guess make that "move wikitech to prod cluster" a shared goal with sre? [22:29:36] pretty sure everyone will be pleased with that [22:30:15] but I guess there are some architecture limitations so it is a whole can of worms [22:30:17] :\ [22:30:26] It's not a can of worms, it's pretty much trivial. [22:30:32] A few hours of dba work first: https://phabricator.wikimedia.org/T167973 [22:30:38] and then a config patch [22:30:44] as far as I know [22:31:12] (03CR) 10Ryan Kemper: [C: 03+2] cloudelastic: Update storage device for new partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/654918 (https://phabricator.wikimedia.org/T265699) (owner: 10Ryan Kemper) [22:31:13] so maybe it is just a quick sprint :] [22:31:19] Sorry — I'm grumpy because I am ALREADY losing two days of work to maintaining labweb1001/1002 which would be unnecessary if they weren't running wikitech [22:31:28] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:33] anyway if labweb1002 is out of the scap target, that close the issue we had earlier and it is all fine to me :] [22:31:36] (because of a truly evil bug which will probably take me the rest of the week) [22:32:26] mutante: andrewbogott: thx for the fix! [22:32:50] sorry about breaking scap. But I don't promise not to do it again :) [22:33:36] no worries, if you ever need to it's just that extra option to the cookbook [22:33:51] why can't the cookbook just do it as necessary? puppetdb etc. etc. [22:34:56] it can't know if the server is behind LVS / in conftool-data [22:35:11] if /I/ can know then software can know :) [22:39:48] !log andrew@deploy1001 Started deploy [horizon/deploy@ce4c515]: trying to debug a compression error that doesn't happen on the test host [22:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:54] !log andrew@deploy1001 Finished deploy [horizon/deploy@ce4c515]: trying to debug a compression error that doesn't happen on the test host (duration: 00m 06s) [22:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:43] !log andrew@deploy1001 Started deploy [striker/deploy@e4db843]: striker -> labweb1002 [22:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:47] !log andrew@deploy1001 Finished deploy [striker/deploy@e4db843]: striker -> labweb1002 (duration: 00m 04s) [22:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:29] mutante: typically I use pool/depool on the affected host. What is the equivalent to disable? [22:44:19] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve100[1-4] - https://phabricator.wikimedia.org/T267050 (10RobH) Ran the dns cookbook for these hosts fine, but the homer script has issues: ` robh@cumin1001:~$ sudo homer asw*eqiad* diff INFO:homer.devices:Initialized 35 devices INFO:hom... [22:44:30] !log andrew@deploy1001 Started deploy [horizon/deploy@ce4c515]: trying to debug a compression error that doesn't happen on the test host [22:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:44] andrewbogott: I never used it before but apparently it's "decommission" [22:44:56] found on https://wikitech.wikimedia.org/wiki/Load_Balanced_Services_And_Conftool#Helper_scripts [22:44:58] hm, ok [22:45:28] pool = yes, depool = no, decommission = inactive , drain = weight 0 [22:46:45] whether you use the local commands or confctl should not matter [22:47:08] https://wikitech.wikimedia.org/wiki/Load_Balanced_Services_And_Conftool#States_in_conftool_and_their_meaning has the part about the 3 states that stays the same [22:48:34] so one can say the local decom command is for "should be offline for extended period" where extended mean "longer than until the next deploy" but for an actual decom forever one would use the decom cookbook on cumin [22:49:58] maybe it should be called "inactivate" to make a difference to "actual decom" [22:51:51] jouncebot: next [22:51:52] In 1 hour(s) and 8 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210108T0000) [22:52:13] !log andrew@deploy1001 Finished deploy [horizon/deploy@ce4c515]: trying to debug a compression error that doesn't happen on the test host (duration: 07m 44s) [22:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:03] !log andrew@deploy1001 Started deploy [horizon/deploy@ce4c515]: trying to debug a compression error that doesn't happen on the test host [22:54:07] !log andrew@deploy1001 Finished deploy [horizon/deploy@ce4c515]: trying to debug a compression error that doesn't happen on the test host (duration: 00m 04s) [22:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:24] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:58:52] 10SRE, 10LDAP-Access-Requests, 10Abstract Wikipedia (Phase β): Grant Access to ldap/wmf for Cory Massaro - https://phabricator.wikimedia.org/T271245 (10Dzahn) No hiring announcement but can confirm this way: ` [ldap-corp1001:~] $ /usr/bin/ldapsearch -x "mail=cmassaro*" | grep -E 'employee|mail|manager' # fi... [23:01:01] (03PS1) 10Andrew Bogott: Disable offline compression in train/buster [puppet] - 10https://gerrit.wikimedia.org/r/654945 (https://phabricator.wikimedia.org/T269004) [23:01:02] (03CR) 10ArielGlenn: [C: 03+1] "Looks fine to me." [puppet] - 10https://gerrit.wikimedia.org/r/654911 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [23:01:06] (03CR) 10jerkins-bot: [V: 04-1] Disable offline compression in train/buster [puppet] - 10https://gerrit.wikimedia.org/r/654945 (https://phabricator.wikimedia.org/T269004) (owner: 10Andrew Bogott) [23:01:29] (03PS2) 10Andrew Bogott: Horizon: Disable offline compression in Train [puppet] - 10https://gerrit.wikimedia.org/r/654945 (https://phabricator.wikimedia.org/T269004) [23:02:01] (03CR) 10jerkins-bot: [V: 04-1] Horizon: Disable offline compression in Train [puppet] - 10https://gerrit.wikimedia.org/r/654945 (https://phabricator.wikimedia.org/T269004) (owner: 10Andrew Bogott) [23:03:20] (03PS3) 10Andrew Bogott: Horizon: Disable offline compression in Train [puppet] - 10https://gerrit.wikimedia.org/r/654945 (https://phabricator.wikimedia.org/T269004) [23:05:14] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: Disable offline compression in Train [puppet] - 10https://gerrit.wikimedia.org/r/654945 (https://phabricator.wikimedia.org/T269004) (owner: 10Andrew Bogott) [23:06:48] (03CR) 10Bstorm: [C: 03+1] dumps: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/654911 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [23:08:04] (03PS1) 10Andrew Bogott: Horizon config: remove an errant # [puppet] - 10https://gerrit.wikimedia.org/r/654946 [23:08:55] (03CR) 10Andrew Bogott: [C: 03+2] Horizon config: remove an errant # [puppet] - 10https://gerrit.wikimedia.org/r/654946 (owner: 10Andrew Bogott) [23:12:15] !log andrew@deploy1001 Started deploy [horizon/deploy@25ffdee]: trying to debug a compression error that doesn't happen on the test host [23:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:15] !log andrew@deploy1001 Finished deploy [horizon/deploy@25ffdee]: trying to debug a compression error that doesn't happen on the test host (duration: 02m 00s) [23:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:05] 10SRE, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw1266.eqiad... [23:28:09] !log reimaging mw1266 [23:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:20] (03PS1) 10Dzahn: DHCP: switch mw1266,mw1267,mw1276,mw1277 to use buster installer [puppet] - 10https://gerrit.wikimedia.org/r/654947 (https://phabricator.wikimedia.org/T245757) [23:37:32] (03CR) 10Dzahn: [C: 03+2] DHCP: switch mw1266,mw1267,mw1276,mw1277 to use buster installer [puppet] - 10https://gerrit.wikimedia.org/r/654947 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn) [23:39:15] 10SRE, 10MediaWiki-Debug-Logger, 10Traffic, 10Developer Productivity, 10Performance-Team (Radar): noc.wikimedia.org with X-Wikimedia-Debug routes to mwdebug but host is not served there - https://phabricator.wikimedia.org/T245552 (10Krinkle) As it stands, NOC is broken with XWD. We need to choose one of... [23:40:27] 10SRE, 10Graphoid, 10Platform Engineering, 10serviceops: Final undeploy for graphoid - en.wiki - https://phabricator.wikimedia.org/T271495 (10Jseddon) [23:43:29] 10SRE, 10MediaWiki-Debug-Logger, 10Wikimedia-Rdbms, 10observability: Logstash no longer captures DB queries in debug mode - https://phabricator.wikimedia.org/T190455 (10Krinkle) [23:44:26] (03PS1) 10Seddon: Undeploy graphoid on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654949 (https://phabricator.wikimedia.org/T271495) [23:44:57] 10SRE, 10MediaWiki-Debug-Logger, 10Wikimedia-Rdbms, 10observability, 10Developer Productivity: Logstash no longer captures DB queries in debug mode - https://phabricator.wikimedia.org/T190455 (10Krinkle) [23:49:47] 10SRE, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1266.eqiad.wmnet'] ` Of which those **FAILED**: ` ['mw12... [23:50:22] 10SRE, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw1266.eqiad... [23:50:25] 10SRE, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1266.eqiad.wmnet'] ` Of which those **FAILED**: ` ['mw12... [23:51:11] 10SRE, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw1266.eqiad... [23:53:09] 10SRE, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw1276.eqiad... [23:54:51] 10SRE, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw1267.eqiad... [23:55:26] 10SRE, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw1277.eqiad... [23:55:40] !log reimaging mw1267,mw1276,mw1277 [23:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log