[00:03:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:05:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:11:36] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634361 (https://phabricator.wikimedia.org/T265558) (owner: 10Dzahn) [00:56:41] PROBLEM - Check systemd state on logstash1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:57:27] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1009 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7fd5853104e0: Failed to establish a new connection: [Errno 111] Connection [00:57:27] ://wikitech.wikimedia.org/wiki/Search%23Administration [01:22:45] RECOVERY - Check systemd state on logstash1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:23:31] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1009 is OK: OK - elasticsearch status production-logstash-eqiad: active_shards: 916, status: green, number_of_pending_tasks: 0, cluster_name: production-logstash-eqiad, timed_out: False, number_of_data_nodes: 3, active_shards_percent_as_number: 100.0, active_primary_shards: 483, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 0, unassigned_shards: 0, num [01:23:31] relocating_shards: 0, task_max_waiting_in_queue_millis: 0, initializing_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [02:26:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:28:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:34:39] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:35:11] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:41:39] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:42:11] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:17:03] PROBLEM - PHP7 jobrunner on mw2249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [05:18:47] RECOVERY - PHP7 jobrunner on mw2249 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 7.614 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [05:18:57] PROBLEM - PHP7 rendering on mw2249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:20:33] RECOVERY - PHP7 rendering on mw2249 is OK: HTTP OK: HTTP/1.1 200 OK - 322 bytes in 0.088 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:39:25] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:43:44] 10Operations, 10Analytics, 10SRE-Access-Requests: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 (10Marostegui) I can confirm https://phabricator.wikimedia.org/L2 has been signed by @Nuria We'd still need a C-Level approval for this. I will seek Grant's approval for this [05:45:05] PROBLEM - PHP7 jobrunner on mw2249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [05:45:47] 10Operations, 10Analytics, 10SRE-Access-Requests: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 (10Marostegui) >>! In T266086#6575066, @Dzahn wrote: > The offboarding script has an option for "stay volunteer". This is very useful! Thanks [05:50:07] RECOVERY - PHP7 jobrunner on mw2249 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [05:54:53] (03CR) 10ArielGlenn: [C: 03+1] "According to https://gerrit.wikimedia.org/r/c/operations/puppet/+/421489 we should be able to remove it, and indeed the /srv/dumps/xmldata" [puppet] - 10https://gerrit.wikimedia.org/r/636087 (owner: 10Dzahn) [05:55:29] PROBLEM - PHP7 jobrunner on mw2249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [05:57:11] RECOVERY - PHP7 jobrunner on mw2249 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 7.582 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [06:00:23] PROBLEM - PHP7 rendering on mw2249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:00:45] PROBLEM - PHP7 jobrunner on mw2249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [06:04:50] <_joe_> I can't seem to ssh to mw2249 [06:05:25] <_joe_> oh it seems the videoscalers are exploding [06:10:48] !log Warm up tables T261914 [06:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:56] T261914: Enable replication eqiad -> codfw and other checks - https://phabricator.wikimedia.org/T261914 [06:15:49] !log oblivian@cumin2001 conftool action : set/pooled=no; selector: cluster=videoscaler,dc=codfw,name=mw228.* [06:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:53] !log oblivian@cumin2001 conftool action : set/pooled=no; selector: cluster=jobrunner,dc=codfw,name=mw224.* [06:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:02] (03PS1) 10Marostegui: orchestrator.conf: Add a bunch of options [puppet] - 10https://gerrit.wikimedia.org/r/636233 (https://phabricator.wikimedia.org/T265990) [06:24:57] RECOVERY - PHP7 rendering on mw2249 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.691 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:25:17] RECOVERY - PHP7 jobrunner on mw2249 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [06:27:23] PROBLEM - PHP7 rendering on mw2250 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:28:33] PROBLEM - PHP7 jobrunner on mw2250 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [06:30:11] RECOVERY - PHP7 jobrunner on mw2250 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [06:32:18] (03PS1) 10Marostegui: dborch1001.yaml: Clarify what this host is [puppet] - 10https://gerrit.wikimedia.org/r/636234 (https://phabricator.wikimedia.org/T265990) [06:32:29] PROBLEM - PHP7 jobrunner on mw2249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [06:32:37] RECOVERY - PHP7 rendering on mw2250 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 8.618 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:33:09] PROBLEM - PHP7 rendering on mw2249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:34:31] (03CR) 10Marostegui: [C: 03+2] dborch1001.yaml: Clarify what this host is [puppet] - 10https://gerrit.wikimedia.org/r/636234 (https://phabricator.wikimedia.org/T265990) (owner: 10Marostegui) [06:36:00] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat: orchestrator: Add service monitoring - https://phabricator.wikimedia.org/T266338 (10Marostegui) p:05Triage→03Low We don't have it in production, so putting this to low as we aren't on a hurry for this as of today [06:36:08] 10Operations, 10IPv6: update bacula-sd config so that it listens on IPv6 - https://phabricator.wikimedia.org/T253986 (10Marostegui) p:05Triage→03Medium [06:36:27] RECOVERY - PHP7 rendering on mw2249 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.170 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:39:19] RECOVERY - PHP7 jobrunner on mw2249 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [06:41:07] PROBLEM - PHP7 rendering on mw2249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:42:33] PROBLEM - PHP7 jobrunner on mw2250 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [06:42:59] PROBLEM - PHP7 jobrunner on mw2249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [06:44:13] RECOVERY - PHP7 jobrunner on mw2250 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 6.756 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [06:46:09] PROBLEM - PHP7 jobrunner on mw2278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [06:46:56] <_joe_> effie: I am sending a patch to reduce the concurrency of the video scaling [06:47:04] <_joe_> I have no idea why it was this elevated [06:48:37] (03PS1) 10Giuseppe Lavagetto: cpjobqueue: reduce concurrency of video transcodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/636239 [06:48:57] <_joe_> can you please take a look? ^^ [06:51:40] (03CR) 10Marostegui: [C: 03+1] cpjobqueue: reduce concurrency of video transcodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/636239 (owner: 10Giuseppe Lavagetto) [06:54:25] (03CR) 10Giuseppe Lavagetto: [C: 03+2] cpjobqueue: reduce concurrency of video transcodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/636239 (owner: 10Giuseppe Lavagetto) [06:55:29] PROBLEM - PHP7 rendering on mw2278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:57:05] RECOVERY - PHP7 rendering on mw2278 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:57:08] (03Merged) 10jenkins-bot: cpjobqueue: reduce concurrency of video transcodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/636239 (owner: 10Giuseppe Lavagetto) [06:58:15] PROBLEM - PHP7 jobrunner on mw2250 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [06:59:15] !log oblivian@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [06:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:50] <_joe_> !log rolling restart of php7.2-fpm on the codfw jobrunners, to reduce the number of dangling transcodes after restarting cp-jobqueue for a deploy [06:59:51] RECOVERY - PHP7 jobrunner on mw2250 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [06:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:57] PROBLEM - PHP7 rendering on mw2246 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:00:01] PROBLEM - PHP7 jobrunner on mw2246 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [07:00:38] _joe_: I am hopping on my computer [07:01:14] <_joe_> effie: let's see if my changes fix the situation [07:01:41] PROBLEM - PHP7 rendering on mw2278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:01:55] _joe_: caj you tell me what you see? [07:02:09] I have not checked graphs yet [07:03:15] RECOVERY - PHP7 rendering on mw2246 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:03:19] RECOVERY - PHP7 jobrunner on mw2246 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [07:03:39] RECOVERY - PHP7 rendering on mw2249 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:03:45] RECOVERY - PHP7 jobrunner on mw2249 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [07:06:45] RECOVERY - PHP7 rendering on mw2278 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:06:59] RECOVERY - PHP7 jobrunner on mw2278 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [07:37:45] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [07:38:48] <_joe_> ^^ what is this? [07:39:25] <_joe_> oh ulsf apparently [07:39:29] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [07:46:47] 10Operations, 10Analytics-Clusters: Rename an-scheduler1001 to an-coord1002 - https://phabricator.wikimedia.org/T265620 (10elukey) 05Open→03Resolved [08:02:17] 10Operations, 10Analytics, 10SRE-Access-Requests: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 (10MoritzMuehlenhoff) >>! In T266086#6575708, @Dzahn wrote: >>>! In T266086#6575705, @Stashbot wrote: >> {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.t... [08:11:27] 10Operations, 10Commons, 10DBA, 10Release-Engineering-Team: Increase on database writes and deletes activity on Commonswiki lead to some replication lag - https://phabricator.wikimedia.org/T266432 (10Marostegui) [08:11:37] 10Operations, 10Commons, 10DBA, 10Release-Engineering-Team: Increase on database writes and deletes activity on Commonswiki lead to some replication lag - https://phabricator.wikimedia.org/T266432 (10Marostegui) p:05Triage→03Medium [08:12:06] 10Operations, 10Commons, 10DBA, 10Release-Engineering-Team: Increase on database writes and deletes activity on Commonswiki lead to some replication lag - https://phabricator.wikimedia.org/T266432 (10Marostegui) p:05Medium→03High Setting to high as this might be causing cross dc lag [08:12:27] 10Operations, 10Commons, 10DBA, 10Release-Engineering-Team: Increase on database writes and deletes activity on Commonswiki leads to some replication lag - https://phabricator.wikimedia.org/T266432 (10Marostegui) [08:18:18] ema: ^ congratulations [08:19:47] ahahhaha [08:20:12] *shake hand emoji* [08:21:20] marostegui: thank you! [08:21:26] !log remove down sessions to AS16509 [08:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:14] (03PS1) 10JMeybohm: Rename the fields in output json [software/heptiolabs/eventrouter] (v0.3-wmf) - 10https://gerrit.wikimedia.org/r/636354 (https://phabricator.wikimedia.org/T262675) [08:25:34] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Sounds sane to me" [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/631686 (https://phabricator.wikimedia.org/T264362) (owner: 10Giuseppe Lavagetto) [08:25:44] !log remove down sessions to AS24429 [08:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:31] (03CR) 10Ema: [C: 03+1] Add debian-debug repository [puppet] - 10https://gerrit.wikimedia.org/r/636040 (https://phabricator.wikimedia.org/T164819) (owner: 10Muehlenhoff) [08:27:12] !log remove down sessions to AS8674 [08:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:15] !log remove down sessions to AS6327 [08:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:51] RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 254, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:33:00] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/635666 (owner: 10Dzahn) [08:34:45] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM (the commit message needs updating, though)" [puppet] - 10https://gerrit.wikimedia.org/r/635665 (owner: 10Dzahn) [08:41:13] !log remove down sessions to AS31334 [08:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:51] !log remove down sessions to AS8560 [08:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:37] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 425, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:47:10] (03CR) 10Muehlenhoff: "Given that the Icinga servers are now running Buster this seems safe to re-enable" [puppet] - 10https://gerrit.wikimedia.org/r/475453 (https://phabricator.wikimedia.org/T204993) (owner: 10Alex Monk) [08:47:56] (03CR) 10Filippo Giunchedi: [C: 03+1] Add debian-debug repository [puppet] - 10https://gerrit.wikimedia.org/r/636040 (https://phabricator.wikimedia.org/T164819) (owner: 10Muehlenhoff) [08:48:13] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 97, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:51:33] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [08:51:34] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:01] !log remove down sessions to AS38758 [08:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:43] RECOVERY - HP RAID on ms-be2017 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:58:11] !log installing freetype security updates for stretch [08:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:00] 10Operations, 10ops-codfw, 10SRE-swift-storage: Degraded RAID on ms-be2017 - https://phabricator.wikimedia.org/T266214 (10fgiunchedi) 05Open→03Resolved Handler reenabled, disk reenabled and rebuilding! Thanks @papaul @Marostegui ! [09:04:01] (03CR) 10Marostegui: [C: 03+1] dbtools: Add master-pos script [software] - 10https://gerrit.wikimedia.org/r/635834 (owner: 10Kormat) [09:04:20] (03CR) 10Kormat: [C: 03+2] dbtools: Add master-pos script [software] - 10https://gerrit.wikimedia.org/r/635834 (owner: 10Kormat) [09:04:46] !log swift codfw-prod: bump object weight for ms-be2057 - T261633 [09:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:51] T261633: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 [09:06:12] (03CR) 10Filippo Giunchedi: [C: 03+1] librenms: remove absented and obsoleted cron [puppet] - 10https://gerrit.wikimedia.org/r/636091 (owner: 10Dzahn) [09:06:20] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Rename the fields in output json [software/heptiolabs/eventrouter] (v0.3-wmf) - 10https://gerrit.wikimedia.org/r/636354 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [09:06:53] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, +Eric and Hugh for heads up" [puppet] - 10https://gerrit.wikimedia.org/r/636092 (owner: 10Dzahn) [09:07:42] (03PS1) 10JMeybohm: eventrouter: Use less generic field names in output json [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/636358 (https://phabricator.wikimedia.org/T262675) [09:11:04] (03PS6) 10Filippo Giunchedi: WIP ldap/grafana user sync [puppet] - 10https://gerrit.wikimedia.org/r/635559 [09:11:17] (03CR) 10jerkins-bot: [V: 04-1] WIP ldap/grafana user sync [puppet] - 10https://gerrit.wikimedia.org/r/635559 (owner: 10Filippo Giunchedi) [09:11:56] (03PS7) 10Filippo Giunchedi: WIP ldap/grafana user sync [puppet] - 10https://gerrit.wikimedia.org/r/635559 [09:13:41] (03CR) 10Filippo Giunchedi: "> Patch Set 5: Code-Review-1" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/635559 (owner: 10Filippo Giunchedi) [09:16:12] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: proxy pushgateway through apache [puppet] - 10https://gerrit.wikimedia.org/r/635788 (https://phabricator.wikimedia.org/T249311) (owner: 10Filippo Giunchedi) [09:21:24] (03CR) 10Kormat: [C: 03+1] "This looks great, thank you :)" [puppet] - 10https://gerrit.wikimedia.org/r/636067 (https://phabricator.wikimedia.org/T266338) (owner: 10Dzahn) [09:25:25] 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for FY2020-2021 Q2 DC switchback - https://phabricator.wikimedia.org/T264364 (10Trizek-WMF) [09:31:09] !log restarting PHP FPM on mw canaries to pick up freetype update [09:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:07] (03CR) 10Filippo Giunchedi: [C: 03+1] kibana: add kibana_ecs role [puppet] - 10https://gerrit.wikimedia.org/r/635864 (owner: 10Herron) [09:46:32] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] eventrouter: Use less generic field names in output json [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/636358 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [09:47:32] (03CR) 10Kormat: orchestrator.conf: Add a bunch of options (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/636233 (https://phabricator.wikimedia.org/T265990) (owner: 10Marostegui) [09:50:45] (03CR) 10Marostegui: "I had the same doubt, whether to include them or not. My reasoning is basically to understand what orchestrator is doing without having to" [puppet] - 10https://gerrit.wikimedia.org/r/636233 (https://phabricator.wikimedia.org/T265990) (owner: 10Marostegui) [09:50:59] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus_80: Servers prometheus2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:51:00] (03PS1) 10Filippo Giunchedi: prometheus: re-enable compaction by default [puppet] - 10https://gerrit.wikimedia.org/r/636362 (https://phabricator.wikimedia.org/T261281) [09:51:03] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus_80: Servers prometheus2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:51:54] !log published docker-registry.discovery.wmnet/eventrouter:0.3.0-3 [09:51:54] mmhh I'll take a look [09:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:41] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:54:08] 10Operations, 10fundraising-tech-ops, 10netops, 10observability: Add alert[12]001 to network ACLs - https://phabricator.wikimedia.org/T260533 (10ayounsi) @herron anything left to do? [09:54:47] PROBLEM - Check systemd state on ms-be2026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:57:12] mmhh ok my latest patch to prometheus apache introduced a new vhost that comes first in the configuration, thus pybal is hitting that now and failing because the backend isn't supposed to be on all the time [10:00:00] 10Operations, 10Traffic, 10netops: Anycast: consistent ICMP packet too big routing - https://phabricator.wikimedia.org/T253732 (10ayounsi) {F32414504} I made a diagram. [10:00:49] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus_80: Servers prometheus2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:01:12] that's me ^ fixing [10:03:20] (03PS1) 10JMeybohm: eventrouter: Various chart improvements [deployment-charts] - 10https://gerrit.wikimedia.org/r/636363 (https://phabricator.wikimedia.org/T262675) [10:03:22] (03PS1) 10JMeybohm: eventrouter: Update image version and set kubernetesApi [deployment-charts] - 10https://gerrit.wikimedia.org/r/636364 (https://phabricator.wikimedia.org/T262675) [10:03:43] (03CR) 10jerkins-bot: [V: 04-1] eventrouter: Various chart improvements [deployment-charts] - 10https://gerrit.wikimedia.org/r/636363 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [10:03:45] (03CR) 10jerkins-bot: [V: 04-1] eventrouter: Update image version and set kubernetesApi [deployment-charts] - 10https://gerrit.wikimedia.org/r/636364 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [10:03:47] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/635559 (owner: 10Filippo Giunchedi) [10:03:59] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:04:05] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:06:33] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([prometheus2004.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [10:08:51] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus_80: Servers prometheus2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:08:59] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus_80: Servers prometheus2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:09:26] (03PS1) 10Filippo Giunchedi: hieradata: use 'prometheus' vhost for ProxyFetch [puppet] - 10https://gerrit.wikimedia.org/r/636365 (https://phabricator.wikimedia.org/T249311) [10:10:13] fix ^ [10:10:55] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:12:20] (03CR) 10Kormat: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/636233 (https://phabricator.wikimedia.org/T265990) (owner: 10Marostegui) [10:12:35] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:12:52] (03CR) 10Jbond: [C: 03+2] rpkivalidator: hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/635664 (owner: 10Dzahn) [10:13:43] (03CR) 10Marostegui: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/636233 (https://phabricator.wikimedia.org/T265990) (owner: 10Marostegui) [10:13:53] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:14:11] (03CR) 10Ema: [C: 03+1] hieradata: use 'prometheus' vhost for ProxyFetch [puppet] - 10https://gerrit.wikimedia.org/r/636365 (https://phabricator.wikimedia.org/T249311) (owner: 10Filippo Giunchedi) [10:14:40] (03PS2) 10Filippo Giunchedi: hieradata: use 'prometheus' vhost for ProxyFetch [puppet] - 10https://gerrit.wikimedia.org/r/636365 (https://phabricator.wikimedia.org/T249311) [10:15:25] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:15:28] (03PS3) 10Filippo Giunchedi: hieradata: use 'prometheus' vhost for ProxyFetch [puppet] - 10https://gerrit.wikimedia.org/r/636365 (https://phabricator.wikimedia.org/T249311) [10:15:45] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/635905 (owner: 10Dzahn) [10:16:18] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: use 'prometheus' vhost for ProxyFetch [puppet] - 10https://gerrit.wikimedia.org/r/636365 (https://phabricator.wikimedia.org/T249311) (owner: 10Filippo Giunchedi) [10:16:49] (03PS2) 10Marostegui: orchestrator.conf: Add a bunch of options [puppet] - 10https://gerrit.wikimedia.org/r/636233 (https://phabricator.wikimedia.org/T265990) [10:18:08] !log roll restart pybal to apply latest configuration [10:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:56] (03CR) 10Jbond: base/labs: add systemd timer to clean puppet client bucket (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/635406 (https://phabricator.wikimedia.org/T165885) (owner: 10Dzahn) [10:22:01] (03CR) 10Kormat: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/636233 (https://phabricator.wikimedia.org/T265990) (owner: 10Marostegui) [10:22:19] (03CR) 10Marostegui: [C: 03+2] orchestrator.conf: Add a bunch of options [puppet] - 10https://gerrit.wikimedia.org/r/636233 (https://phabricator.wikimedia.org/T265990) (owner: 10Marostegui) [10:22:38] (03CR) 10Jbond: [C: 03+1] ntp: hiera->lookup, data types [puppet] - 10https://gerrit.wikimedia.org/r/635666 (owner: 10Dzahn) [10:23:27] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:23:43] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 158 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:23:43] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/635665 (owner: 10Dzahn) [10:23:47] (03CR) 10Jbond: [C: 03+2] debmonitor::client: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/635665 (owner: 10Dzahn) [10:24:07] (03PS2) 10JMeybohm: eventrouter: Various chart improvements [deployment-charts] - 10https://gerrit.wikimedia.org/r/636363 (https://phabricator.wikimedia.org/T262675) [10:24:09] (03PS2) 10JMeybohm: eventrouter: Update image version and set kubernetesApi [deployment-charts] - 10https://gerrit.wikimedia.org/r/636364 (https://phabricator.wikimedia.org/T262675) [10:24:27] (03CR) 10Jbond: [C: 03+1] Add debian-debug repository [puppet] - 10https://gerrit.wikimedia.org/r/636040 (https://phabricator.wikimedia.org/T164819) (owner: 10Muehlenhoff) [10:25:21] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 12 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:27:42] 10Operations, 10DBA, 10User-Kormat: Integrate orchestrator with !log - https://phabricator.wikimedia.org/T266452 (10Marostegui) [10:27:54] 10Operations, 10DBA, 10User-Kormat: Integrate orchestrator with !log - https://phabricator.wikimedia.org/T266452 (10Marostegui) p:05Triage→03Medium [10:29:15] RECOVERY - Check systemd state on ms-be2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:29:46] !log upload trafficserver 8.0.8-1wm3 to apt.wm.org (buster) - T265911 [10:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:53] T265911: ATS trying to set socket options SO_MARK / IP_TOS - https://phabricator.wikimedia.org/T265911 [10:30:04] jan_drewniak: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201026T1030). [10:32:27] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:33:22] (03CR) 10Vgutierrez: [C: 03+1] Reduce reconnectTimeout for etcd to 0.1 seconds [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/631686 (https://phabricator.wikimedia.org/T264362) (owner: 10Giuseppe Lavagetto) [10:33:43] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:37:09] (03PS1) 10Marostegui: orchestrator.conf.json: Add some failover options [puppet] - 10https://gerrit.wikimedia.org/r/636387 (https://phabricator.wikimedia.org/T265990) [10:41:16] (03PS3) 10Marostegui: orchestrator.sql: Track orchestrator grants for topology discovery [puppet] - 10https://gerrit.wikimedia.org/r/636052 (https://phabricator.wikimedia.org/T265990) [10:42:18] (03PS1) 10Hashar: ci: drop reference to Jessie [puppet] - 10https://gerrit.wikimedia.org/r/636388 [10:47:50] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Nice idea! It will make updates a bit more difficult, but it will work for now" [software/heptiolabs/eventrouter] (v0.3-wmf) - 10https://gerrit.wikimedia.org/r/636354 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [10:50:00] 10Operations, 10Commons, 10DBA, 10Release-Engineering-Team, 10Wikimedia-production-error: Increase on database writes and deletes activity on Commonswiki leads to some replication lag - https://phabricator.wikimedia.org/T266432 (10jcrespo) Adding #Wikimedia-production-error as it seems to coincide with a... [10:51:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/636096 (owner: 10Dzahn) [10:51:26] !log manually reloading nginx on cloudelastic[1005-1006] [10:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/636093 (owner: 10Dzahn) [10:51:42] (03PS4) 10Elukey: sre.hadoop.init-hadoop-workers: add option to wipe partition tables [cookbooks] - 10https://gerrit.wikimedia.org/r/634911 [10:52:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/636088 (owner: 10Dzahn) [10:52:15] (03CR) 10Elukey: [C: 03+2] "I had some chats with Riccardo about how to structure this cookbook, I am going to merge this code to see how it behaves and then I'll ref" [cookbooks] - 10https://gerrit.wikimedia.org/r/634911 (owner: 10Elukey) [10:54:35] (03PS1) 10Filippo Giunchedi: prometheus: add ServerAlias for fqdn [puppet] - 10https://gerrit.wikimedia.org/r/636389 (https://phabricator.wikimedia.org/T249311) [10:55:58] (03PS1) 10Elukey: sre.hadoop.init-hadoop-workers: fix typo in argparse parameter [cookbooks] - 10https://gerrit.wikimedia.org/r/636390 [10:57:23] (03PS1) 10Ayounsi: Add device inventory support [software/homer] - 10https://gerrit.wikimedia.org/r/636391 (https://phabricator.wikimedia.org/T257392) [10:57:27] (03PS5) 10ArielGlenn: per job batches file with locking and methods for claiming jobs etc [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396) [10:57:44] (03CR) 10Elukey: [C: 03+2] sre.hadoop.init-hadoop-workers: fix typo in argparse parameter [cookbooks] - 10https://gerrit.wikimedia.org/r/636390 (owner: 10Elukey) [10:57:56] (03CR) 10jerkins-bot: [V: 04-1] per job batches file with locking and methods for claiming jobs etc [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [10:58:19] (03PS1) 10Ayounsi: Automatically enable sampling on all FPCs [homer/public] - 10https://gerrit.wikimedia.org/r/636392 (https://phabricator.wikimedia.org/T257392) [10:58:39] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/26111/" [puppet] - 10https://gerrit.wikimedia.org/r/636389 (https://phabricator.wikimedia.org/T249311) (owner: 10Filippo Giunchedi) [10:59:11] looking for +1s on a simple patch/fix ^ [10:59:12] (03CR) 10jerkins-bot: [V: 04-1] Add device inventory support [software/homer] - 10https://gerrit.wikimedia.org/r/636391 (https://phabricator.wikimedia.org/T257392) (owner: 10Ayounsi) [10:59:34] (03CR) 10Ayounsi: "No tests :(" [software/homer] - 10https://gerrit.wikimedia.org/r/636391 (https://phabricator.wikimedia.org/T257392) (owner: 10Ayounsi) [10:59:40] (03CR) 10Kormat: [C: 04-1] orchestrator.conf.json: Add some failover options (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636387 (https://phabricator.wikimedia.org/T265990) (owner: 10Marostegui) [10:59:42] (03CR) 10Hashar: "Straight forward, CI no more uses Jessie :)" [puppet] - 10https://gerrit.wikimedia.org/r/636388 (owner: 10Hashar) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201026T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:01:20] (03PS2) 10Ayounsi: Add device inventory support [software/homer] - 10https://gerrit.wikimedia.org/r/636391 (https://phabricator.wikimedia.org/T257392) [11:02:18] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers [11:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:23] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) [11:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:30] I might deploy a config change in 15-30 minutes or so [11:02:36] if there’s nothing else in the window [11:02:41] (03CR) 10jerkins-bot: [V: 04-1] Add device inventory support [software/homer] - 10https://gerrit.wikimedia.org/r/636391 (https://phabricator.wikimedia.org/T257392) (owner: 10Ayounsi) [11:03:22] (03CR) 10Ayounsi: "Changes for 2 devices: ['cr1-codfw.wikimedia.org', 'cr2-codfw.wikimedia.org']" [homer/public] - 10https://gerrit.wikimedia.org/r/636392 (https://phabricator.wikimedia.org/T257392) (owner: 10Ayounsi) [11:03:59] (03CR) 10JMeybohm: "> Patch Set 1: Code-Review+1" [software/heptiolabs/eventrouter] (v0.3-wmf) - 10https://gerrit.wikimedia.org/r/636354 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [11:04:01] PROBLEM - Check systemd state on ms-be2023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:04:24] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, will merge" [puppet] - 10https://gerrit.wikimedia.org/r/636388 (owner: 10Hashar) [11:04:44] (03CR) 10Muehlenhoff: [C: 03+2] ci: drop reference to Jessie [puppet] - 10https://gerrit.wikimedia.org/r/636388 (owner: 10Hashar) [11:04:52] moritzm: danke :) [11:05:53] (03CR) 10Jbond: [C: 03+2] service_auto_restart: update to use systemd::timer:job instead of cron [puppet] - 10https://gerrit.wikimedia.org/r/635516 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [11:06:40] (03PS2) 10Marostegui: orchestrator.conf.json: Add some failover options [puppet] - 10https://gerrit.wikimedia.org/r/636387 (https://phabricator.wikimedia.org/T265990) [11:07:43] (03PS3) 10Ayounsi: Add device inventory support [software/homer] - 10https://gerrit.wikimedia.org/r/636391 (https://phabricator.wikimedia.org/T257392) [11:08:57] PROBLEM - SSH on ms-be2055 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:10:20] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/636389 (https://phabricator.wikimedia.org/T249311) (owner: 10Filippo Giunchedi) [11:10:29] RECOVERY - SSH on ms-be2055 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:10:49] (03CR) 10Alexandros Kosiaris: [C: 03+1] "1 minor comment, LGTM" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/636363 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [11:11:10] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add ServerAlias for fqdn [puppet] - 10https://gerrit.wikimedia.org/r/636389 (https://phabricator.wikimedia.org/T249311) (owner: 10Filippo Giunchedi) [11:11:10] !log upgrade trafficserver to 8.0.8-1wm3 on cp4032 - T265911 [11:11:16] (03CR) 10JMeybohm: [C: 03+1] envoy-future: upgrade to Envoy 1.16.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/636062 (owner: 10Hnowlan) [11:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:17] T265911: ATS trying to set socket options SO_MARK / IP_TOS - https://phabricator.wikimedia.org/T265911 [11:14:19] 10Operations, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, 10Proton, and 3 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10ovasileva) [11:16:47] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01211 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:22:21] (03PS1) 10Elukey: sre.hadoop.init-hadoop-workers: avoid to install wipefs [cookbooks] - 10https://gerrit.wikimedia.org/r/636395 [11:22:30] (03PS3) 10JMeybohm: eventrouter: Various chart improvements [deployment-charts] - 10https://gerrit.wikimedia.org/r/636363 (https://phabricator.wikimedia.org/T262675) [11:22:32] (03PS3) 10JMeybohm: eventrouter: Update image version and set kubernetesApi [deployment-charts] - 10https://gerrit.wikimedia.org/r/636364 (https://phabricator.wikimedia.org/T262675) [11:23:05] RECOVERY - Check systemd state on ms-be2023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:23:14] (03CR) 10JMeybohm: eventrouter: Various chart improvements (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/636363 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [11:23:25] (03PS1) 10Jbond: cfssl::cert: fix re-sign command [puppet] - 10https://gerrit.wikimedia.org/r/636396 [11:24:42] (03CR) 10Elukey: [C: 03+2] sre.hadoop.init-hadoop-workers: avoid to install wipefs [cookbooks] - 10https://gerrit.wikimedia.org/r/636395 (owner: 10Elukey) [11:24:44] (03CR) 10Jbond: [C: 03+2] cfssl::cert: fix re-sign command [puppet] - 10https://gerrit.wikimedia.org/r/636396 (owner: 10Jbond) [11:26:09] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers [11:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:17] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) [11:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:48] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers [11:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:55] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) [11:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:12] sorry for the spam :) [11:29:19] PROBLEM - Check systemd state on ms-be2051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:16] (03CR) 10Volans: [C: 03+1] "LGTM but missing tests" (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/636391 (https://phabricator.wikimedia.org/T257392) (owner: 10Ayounsi) [11:35:01] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:41:02] fyi the noise in icinga related to wmf_auto_restart_cron_* is related to a recent change i made (https://gerrit.wikimedia.org/r/c/operations/puppet/+/635516) currently investigating [11:42:29] heheh suspiciously only jessie hosts ? [11:42:48] yes it seems systemd on jessie dosn;t recognise the time spec [11:48:04] (03PS1) 10Jbond: service_auto_restart: fix time spec [puppet] - 10https://gerrit.wikimedia.org/r/636399 [11:48:56] (03CR) 10Muehlenhoff: [C: 03+1] "shakes fist at remaining jessie servers" [puppet] - 10https://gerrit.wikimedia.org/r/636399 (owner: 10Jbond) [11:48:58] jouncebot: now [11:48:58] For the next 0 hour(s) and 11 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201026T1100) [11:49:02] * Urbanecm goes to deploy [11:49:08] (03PS1) 10Urbanecm: Add foto.digitalarkivet.no to wgCopyUploadsDomains whitelist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636400 (https://phabricator.wikimedia.org/T266390) [11:49:19] (03CR) 10Urbanecm: [C: 03+2] Add foto.digitalarkivet.no to wgCopyUploadsDomains whitelist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636400 (https://phabricator.wikimedia.org/T266390) (owner: 10Urbanecm) [11:49:45] (03CR) 10Kormat: service_auto_restart: fix time spec (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636399 (owner: 10Jbond) [11:50:19] (03Merged) 10jenkins-bot: Add foto.digitalarkivet.no to wgCopyUploadsDomains whitelist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636400 (https://phabricator.wikimedia.org/T266390) (owner: 10Urbanecm) [11:50:37] (03PS2) 10Jbond: service_auto_restart: fix time spec [puppet] - 10https://gerrit.wikimedia.org/r/636399 [11:50:47] (03CR) 10Jbond: service_auto_restart: fix time spec (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636399 (owner: 10Jbond) [11:51:13] (03CR) 10Kormat: [C: 03+1] service_auto_restart: fix time spec [puppet] - 10https://gerrit.wikimedia.org/r/636399 (owner: 10Jbond) [11:52:20] (03CR) 10Jbond: [C: 03+2] service_auto_restart: fix time spec [puppet] - 10https://gerrit.wikimedia.org/r/636399 (owner: 10Jbond) [11:52:33] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: bff6b37a55fe8f260fe00cbb942c53101167fb07: Add foto.digitalarkivet.no to wgCopyUploadsDomains whitelist of Wikimedia Commons (T266390) (duration: 01m 14s) [11:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:41] T266390: Add foto.digitalarkivet.no to wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T266390 [11:53:41] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:57:15] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.001908 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:58:59] (03CR) 10Muehlenhoff: [C: 03+2] Add debian-debug repository [puppet] - 10https://gerrit.wikimedia.org/r/636040 (https://phabricator.wikimedia.org/T164819) (owner: 10Muehlenhoff) [12:03:11] 10Operations, 10Patch-For-Review: reprepro: Support for buildinfo files / dbgsym packages - https://phabricator.wikimedia.org/T164819 (10MoritzMuehlenhoff) 05Open→03Resolved dbgsym files are supported in reprepro for quite a while now and as of today, we can also install dbgsym packages from the Debian arc... [12:06:14] (03PS1) 10Jbond: smart: switch cron job to systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/636401 (https://phabricator.wikimedia.org/T265138) [12:08:29] (03PS1) 10Jbond: smart: remove cron type [puppet] - 10https://gerrit.wikimedia.org/r/636402 (https://phabricator.wikimedia.org/T265138) [12:09:45] (03PS1) 10Elukey: sre.hadoop.init-hadoop-workers: add more defensive code [cookbooks] - 10https://gerrit.wikimedia.org/r/636403 (https://phabricator.wikimedia.org/T260411) [12:10:06] (03PS2) 10Jbond: smart: switch cron job to systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/636401 (https://phabricator.wikimedia.org/T265138) [12:10:31] RECOVERY - Check systemd state on ms-be2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:13:21] 10Operations, 10Puppet: Upgrade Puppet to 5.5.21 - https://phabricator.wikimedia.org/T248168 (10MoritzMuehlenhoff) Debian unstable was updated to 5.5.22: https://packages.qa.debian.org/p/puppet/news/20201025T173952Z.html [12:20:14] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/636401 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [12:20:52] (03PS1) 10Jbond: cumin: switch check-cumin-aliases to systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/636404 (https://phabricator.wikimedia.org/T265138) [12:20:54] (03PS1) 10Jbond: cumin: remove cron type [puppet] - 10https://gerrit.wikimedia.org/r/636405 (https://phabricator.wikimedia.org/T265138) [12:20:59] (03CR) 10Jbond: [C: 03+2] smart: switch cron job to systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/636401 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [12:21:25] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/636404 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [12:24:00] (03PS1) 10Volans: dns: add retry logic to all Netbox API calls [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/636406 [12:24:23] 10Operations, 10Commons, 10DBA, 10Release-Engineering-Team, 10Wikimedia-production-error: Increase on database writes and deletes activity on Commonswiki leads to some replication lag - https://phabricator.wikimedia.org/T266432 (10Marostegui) >>! In T266432#6577770, @jcrespo wrote: > Adding #Wikimedia-pr... [12:24:28] (03CR) 10jerkins-bot: [V: 04-1] dns: add retry logic to all Netbox API calls [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/636406 (owner: 10Volans) [12:25:18] (03PS2) 10Jbond: cumin: switch check-cumin-aliases to systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/636404 (https://phabricator.wikimedia.org/T265138) [12:25:27] (03CR) 10Volans: "LOL there was a patch the other day on the same topic, see I997bca1659539a048e61346ee405125da8b915c6 (I've commented there)" [puppet] - 10https://gerrit.wikimedia.org/r/636404 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [12:26:18] (03PS2) 10Jbond: cumin: replace check-aliases-cron with a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636102 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [12:29:49] (03CR) 10Jbond: "See inline, could you also please tag CR's related to this with `Bug: T265138` thanks" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/636102 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [12:30:07] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/636404 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [12:30:15] (03Abandoned) 10Jbond: cumin: switch check-cumin-aliases to systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/636404 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [12:32:29] 10Operations, 10Traffic: ATS trying to set socket options SO_MARK / IP_TOS - https://phabricator.wikimedia.org/T265911 (10ema) >>! In T265911#6577824, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/BzaaZHUBhxWNv8gI1jEo} [2020-10-26T11:11:10Z... [12:37:53] (03PS1) 10Muehlenhoff: Remove access for rodolfovalentim [puppet] - 10https://gerrit.wikimedia.org/r/636407 [12:39:55] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for rodolfovalentim [puppet] - 10https://gerrit.wikimedia.org/r/636407 (owner: 10Muehlenhoff) [12:40:35] (03CR) 10JMeybohm: [C: 03+2] eventrouter: Various chart improvements [deployment-charts] - 10https://gerrit.wikimedia.org/r/636363 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [12:41:05] (03CR) 10JMeybohm: [C: 03+2] eventrouter: Update image version and set kubernetesApi [deployment-charts] - 10https://gerrit.wikimedia.org/r/636364 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [12:42:56] (03PS1) 10Jbond: prometheus_intel_microcode: update cron to systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/636408 (https://phabricator.wikimedia.org/T265138) [12:42:58] (03PS1) 10Jbond: prometheus_intel_microcode: remove cron type [puppet] - 10https://gerrit.wikimedia.org/r/636409 (https://phabricator.wikimedia.org/T265138) [12:44:13] (03Merged) 10jenkins-bot: eventrouter: Various chart improvements [deployment-charts] - 10https://gerrit.wikimedia.org/r/636363 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [12:44:22] (03Merged) 10jenkins-bot: eventrouter: Update image version and set kubernetesApi [deployment-charts] - 10https://gerrit.wikimedia.org/r/636364 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [12:45:44] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/636408 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [12:46:45] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/636409 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [12:48:31] (03CR) 10Effie Mouzeli: [C: 04-1] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/636047 (https://phabricator.wikimedia.org/T253673) (owner: 10Effie Mouzeli) [12:52:10] 10Operations, 10netops, 10Patch-For-Review: fastnetmon misreports attack type and protocol - https://phabricator.wikimedia.org/T241374 (10CDanis) 05Stalled→03Resolved a:03CDanis [12:54:58] 10Operations, 10netops: fastnetmon misreports attack type and protocol - https://phabricator.wikimedia.org/T241374 (10Nintendofan885) [12:56:59] (03PS1) 10Jbond: remote-backup-mariadb: update cron to systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/636410 (https://phabricator.wikimedia.org/T265138) [12:57:01] (03PS1) 10Jbond: remote-backup-mariadb: remove cron type [puppet] - 10https://gerrit.wikimedia.org/r/636411 (https://phabricator.wikimedia.org/T265138) [12:57:27] (03PS2) 10Jbond: remote-backup-mariadb: update cron to systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/636410 (https://phabricator.wikimedia.org/T265138) [12:58:00] (03CR) 10Jbond: [C: 03+2] prometheus_intel_microcode: update cron to systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/636408 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [12:58:33] (03CR) 10jerkins-bot: [V: 04-1] remote-backup-mariadb: remove cron type [puppet] - 10https://gerrit.wikimedia.org/r/636411 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [12:58:44] (03CR) 10jerkins-bot: [V: 04-1] remote-backup-mariadb: update cron to systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/636410 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [12:59:07] (03PS2) 10Jbond: prometheus_intel_microcode: remove cron type [puppet] - 10https://gerrit.wikimedia.org/r/636409 (https://phabricator.wikimedia.org/T265138) [13:00:10] (03PS3) 10Jbond: remote-backup-mariadb: update cron to systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/636410 (https://phabricator.wikimedia.org/T265138) [13:03:31] (03PS2) 10Jbond: remote-backup-mariadb: remove cron type [puppet] - 10https://gerrit.wikimedia.org/r/636411 (https://phabricator.wikimedia.org/T265138) [13:04:39] (03CR) 10CDanis: [C: 03+1] "nice, thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/636392 (https://phabricator.wikimedia.org/T257392) (owner: 10Ayounsi) [13:05:49] (03PS3) 10Ayounsi: Add switch interface support to decom script [cookbooks] - 10https://gerrit.wikimedia.org/r/633723 (https://phabricator.wikimedia.org/T265341) [13:07:15] (03CR) 10Kormat: "I think you want @jcrespo instead of me :)" [puppet] - 10https://gerrit.wikimedia.org/r/636410 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [13:07:43] (03PS1) 10JMeybohm: eventrouter: Fix values for all environments [deployment-charts] - 10https://gerrit.wikimedia.org/r/636412 (https://phabricator.wikimedia.org/T262675) [13:11:08] (03CR) 10Jbond: "> Patch Set 3:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/636410 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [13:14:05] PROBLEM - Check systemd state on ms-be2055 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:17:25] (03PS2) 10JMeybohm: eventrouter: Fix values for all environments [deployment-charts] - 10https://gerrit.wikimedia.org/r/636412 (https://phabricator.wikimedia.org/T262675) [13:19:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [13:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:43] rzl: ^ just tested it [13:19:54] \o/ [13:20:28] (03PS1) 10Muehlenhoff: Create component/php72 for buster-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/636413 [13:20:48] (03CR) 10Alexandros Kosiaris: [C: 04-1] Switch Thumbor haproxy load balancing to IP hash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636024 (https://phabricator.wikimedia.org/T266155) (owner: 10Gilles) [13:21:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [13:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:39] (03CR) 10JMeybohm: [C: 03+2] eventrouter: Fix values for all environments [deployment-charts] - 10https://gerrit.wikimedia.org/r/636412 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [13:21:44] (03PS1) 10Jbond: idp: use correct server name on idp-test [puppet] - 10https://gerrit.wikimedia.org/r/636415 [13:21:57] RECOVERY - Check systemd state on ms-be2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:23:53] (03CR) 10Jbond: [C: 03+2] idp: use correct server name on idp-test [puppet] - 10https://gerrit.wikimedia.org/r/636415 (owner: 10Jbond) [13:25:02] (03PS1) 10Jgiannelos: Add changeprop rules for DelayeEchoNotificationJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/636416 [13:25:22] (03CR) 10Muehlenhoff: [C: 03+2] Create component/php72 for buster-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/636413 (owner: 10Muehlenhoff) [13:25:55] (03CR) 10Jgiannelos: [C: 04-1] "Hold until this gets merged:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/636416 (owner: 10Jgiannelos) [13:26:35] (03CR) 10Ayounsi: [C: 03+2] Add device inventory support [software/homer] - 10https://gerrit.wikimedia.org/r/636391 (https://phabricator.wikimedia.org/T257392) (owner: 10Ayounsi) [13:27:19] (03CR) 10jerkins-bot: [V: 04-1] eventrouter: Fix values for all environments [deployment-charts] - 10https://gerrit.wikimedia.org/r/636412 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [13:28:01] (03Merged) 10jenkins-bot: Add device inventory support [software/homer] - 10https://gerrit.wikimedia.org/r/636391 (https://phabricator.wikimedia.org/T257392) (owner: 10Ayounsi) [13:28:38] (03CR) 10Jbond: [V: 03+2 C: 03+2] "> Patch Set 3: Code-Review+1" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/635848 (owner: 10Jbond) [13:28:45] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] eventrouter: Fix values for all environments [deployment-charts] - 10https://gerrit.wikimedia.org/r/636412 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [13:29:24] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] "gate-and-sumbmit failed because of helm race" [deployment-charts] - 10https://gerrit.wikimedia.org/r/636412 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [13:33:12] 10Operations, 10fundraising-tech-ops, 10netops, 10observability: Add alert[12]001 to network ACLs - https://phabricator.wikimedia.org/T260533 (10herron) 05Open→03Resolved a:03herron Nope! I think we're good here [13:33:19] (03PS3) 10Effie Mouzeli: Switch Thumbor haproxy load balancing to IP hash [puppet] - 10https://gerrit.wikimedia.org/r/636024 (https://phabricator.wikimedia.org/T266155) (owner: 10Gilles) [13:34:20] (03PS2) 10Gehel: [wdqs] fix StreamingUpdate package name after refactoring [puppet] - 10https://gerrit.wikimedia.org/r/636034 (https://phabricator.wikimedia.org/T255399) (owner: 10DCausse) [13:34:24] (03PS1) 10JMeybohm: eventrouter: Fix link to eventrouter helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/636418 (https://phabricator.wikimedia.org/T262675) [13:35:44] (03CR) 10Gehel: [C: 03+2] [wdqs] fix StreamingUpdate package name after refactoring [puppet] - 10https://gerrit.wikimedia.org/r/636034 (https://phabricator.wikimedia.org/T255399) (owner: 10DCausse) [13:36:43] (03PS4) 10Effie Mouzeli: Switch Thumbor haproxy load balancing to IP hash [puppet] - 10https://gerrit.wikimedia.org/r/636024 (https://phabricator.wikimedia.org/T266155) (owner: 10Gilles) [13:37:07] (03CR) 10jerkins-bot: [V: 04-1] eventrouter: Fix link to eventrouter helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/636418 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [13:37:29] (03PS4) 10Kormat: orchestrator: Add topology querying config. [puppet] - 10https://gerrit.wikimedia.org/r/636052 (https://phabricator.wikimedia.org/T265990) (owner: 10Marostegui) [13:37:35] (03CR) 10Herron: [C: 03+2] kibana: add kibana_ecs role [puppet] - 10https://gerrit.wikimedia.org/r/635864 (owner: 10Herron) [13:39:33] (03CR) 10Effie Mouzeli: "As per Alex's suggestion, we can check the X-Client-IP header. I will test this, but due to the upcoming switchover, it will be merged tow" [puppet] - 10https://gerrit.wikimedia.org/r/636024 (https://phabricator.wikimedia.org/T266155) (owner: 10Gilles) [13:40:37] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/636418 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [13:41:05] (03CR) 10Marostegui: [C: 03+1] "Thanks! Let me know if you want to apply the grants on pc1 or you want me to" [puppet] - 10https://gerrit.wikimedia.org/r/636052 (https://phabricator.wikimedia.org/T265990) (owner: 10Marostegui) [13:46:11] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:55] !log imported cas 6.2.4-1 to apt.wikimedia.org T265857 [13:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:01] T265857: Update CAS to 6.2 - https://phabricator.wikimedia.org/T265857 [13:49:49] PROBLEM - Check the last execution of wmf_auto_restart_librenms-ircbot on netmon2001 is CRITICAL: CRITICAL: Status of the systemd unit wmf_auto_restart_librenms-ircbot https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:50:21] _joe_: getting ready to start services repool -- can you take a quick look at this before I get rolling? doesn't need to be a close read https://phabricator.wikimedia.org/P13068 [13:50:40] <_joe_> rzl: sure [13:51:03] and for anyone else following along, shortly we'll start repooling active-active services in eqiad as part of the DC switchback, this should be unimpactful but we'll be keeping an eye on things anyway [13:52:07] <_joe_> oh damn f-strings [13:52:09] <_joe_> -2 [13:53:52] <_joe_> rzl: looks ok but... you could've used conftool as a library :) [13:54:03] I tried that and couldn't make heads or tails of its api [13:54:04] <_joe_> jokes aside, +1 looks ok [13:54:21] literally stared at it for an hour trying to figure it out before I gave up and used subprocess [13:54:30] would love to learn some other time :D [13:54:42] <_joe_> yeah it dearly needs some documentation *cough cough* [13:55:13] ah, the classic "the code is fine, the user just can't understand it" ;) [13:55:17] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:55:54] <_joe_> kormat: it's not very clear what the api is from a ton of code with no documentation, but it's also not super clear :) [13:56:20] "it's not very clear.. but it's also not super clear". he says, clearly. [13:56:25] <_joe_> rzl: you have my +1 [13:56:33] <_joe_> the api / the code [13:56:52] thanks! rolling [13:57:17] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:00:07] haha great I forgot confctl was going to prompt me to manually confirm each write [14:00:21] let me just look up what its "skip confirmation" flag is... haha great [14:00:42] (03PS1) 10DCausse: [wdqs] switch wikibase:isSomeValue to skolem for wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/636420 (https://phabricator.wikimedia.org/T266470) [14:01:37] (03PS1) 10Jbond: librenms: use cron_ensure variable to install auto_restart_service [puppet] - 10https://gerrit.wikimedia.org/r/636421 [14:03:22] (03PS2) 10Jbond: librenms: use cron_ensure variable to install auto_restart_service [puppet] - 10https://gerrit.wikimedia.org/r/636421 [14:03:53] 10Operations, 10Traffic, 10Patch-For-Review: Large text objects are randomized to cache backends - https://phabricator.wikimedia.org/T266040 (10BBlack) Notes on the large increase in large_objects_cutoff from late last week: * Graph link: https://grafana.wikimedia.org/d/000000500/varnish-caching?viewPanel=1... [14:05:14] (03CR) 10Jcrespo: "More than comments, I have questions:" [puppet] - 10https://gerrit.wikimedia.org/r/636410 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [14:05:25] PROBLEM - Check the last execution of wmf_auto_restart_mcelog on kubestage1001 is CRITICAL: CRITICAL: Status of the systemd unit wmf_auto_restart_mcelog https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:06:46] !log rzl@cumin1001 conftool action : set/ttl=10; selector: dnsdisc=apertium|api-gateway|citoid|cxserver|echostore|eventgate-analytics|eventgate-analytics-external|eventgate-logging-external|eventgate-main|eventstreams|graphoid|kartotherian|mathoid|mobileapps|ores|parsoid|proton|push-notifications|recommendation-api|restbase|restbase-async|schema|search|sessionstore|termbox|wdqs|wdqs-internal|wikifeeds|zotero,name=eqiad [14:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:01] PROBLEM - Check systemd state on kubestage1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:48] (03PS1) 10Kormat: hiera: Add fake db topology password for orchestrator [labs/private] - 10https://gerrit.wikimedia.org/r/636422 (https://phabricator.wikimedia.org/T265990) [14:09:30] (03CR) 10Jbond: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1001/26116/netmon2001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/636421 (owner: 10Jbond) [14:10:55] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: ASAP) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Cmjohnson) [14:11:46] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=apertium,name=eqiad [14:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:01] (03CR) 10Jcrespo: remote-backup-mariadb: update cron to systemd::timer::job (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/636410 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [14:14:47] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=api-gateway,name=eqiad [14:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:47] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=citoid,name=eqiad [14:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:39] 10Operations, 10Commons, 10DBA, 10Release-Engineering-Team, 10Wikimedia-production-error: Increase on database writes and deletes activity on Commonswiki leads to some replication lag - https://phabricator.wikimedia.org/T266432 (10jcrespo) I am getting strange, inconsistent results every time I check, no... [14:20:47] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=cxserver,name=eqiad [14:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:30] (03CR) 10Kormat: [V: 03+2 C: 03+2] hiera: Add fake db topology password for orchestrator [labs/private] - 10https://gerrit.wikimedia.org/r/636422 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat) [14:23:48] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=echostore,name=eqiad [14:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:48] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=eventgate-analytics,name=eqiad [14:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:43] PROBLEM - Check systemd state on an-worker1096 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:49] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=eventgate-analytics-external,name=eqiad [14:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:14] !log ppchelko@deploy1001 Started deploy [restbase/deploy@a1a1bd7]: Add api-portal and snmwiki [14:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:49] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=eventgate-logging-external,name=eqiad [14:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:35] (03PS5) 10Kormat: orchestrator: Add topology querying config. [puppet] - 10https://gerrit.wikimedia.org/r/636052 (https://phabricator.wikimedia.org/T265990) (owner: 10Marostegui) [14:35:50] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=eventgate-main,name=eqiad [14:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:04] 10Operations, 10Commons, 10DBA, 10Release-Engineering-Team, 10Wikimedia-production-error: Increase on database writes and deletes activity on Commonswiki leads to some replication lag - https://phabricator.wikimedia.org/T266432 (10thcipriani) > Was something released that 22nd Oct? Commonswiki was updat... [14:36:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:36:30] (03CR) 10Kormat: "PCC run looks good: https://puppet-compiler.wmflabs.org/compiler1001/26117/" [puppet] - 10https://gerrit.wikimedia.org/r/636052 (https://phabricator.wikimedia.org/T265990) (owner: 10Marostegui) [14:38:50] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=eventstreams,name=eqiad [14:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:01] PROBLEM - Disk space on maps2002 is CRITICAL: DISK CRITICAL - free space: /srv 62437 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps2002&var-datasource=codfw+prometheus/ops [14:39:14] (03CR) 10Marostegui: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/636052 (https://phabricator.wikimedia.org/T265990) (owner: 10Marostegui) [14:39:33] (03CR) 10Kormat: [C: 03+2] orchestrator: Add topology querying config. [puppet] - 10https://gerrit.wikimedia.org/r/636052 (https://phabricator.wikimedia.org/T265990) (owner: 10Marostegui) [14:41:29] (03CR) 10Kormat: [C: 03+1] orchestrator.conf.json: Add some failover options [puppet] - 10https://gerrit.wikimedia.org/r/636387 (https://phabricator.wikimedia.org/T265990) (owner: 10Marostegui) [14:41:51] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=graphoid,name=eqiad [14:41:51] (03PS3) 10Marostegui: orchestrator.conf.json: Add some failover options [puppet] - 10https://gerrit.wikimedia.org/r/636387 (https://phabricator.wikimedia.org/T265990) [14:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:42:16] (03CR) 10Marostegui: [C: 03+2] orchestrator.conf.json: Add some failover options [puppet] - 10https://gerrit.wikimedia.org/r/636387 (https://phabricator.wikimedia.org/T265990) (owner: 10Marostegui) [14:44:51] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad [14:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:30] (03PS2) 10Elukey: sre.hadoop.init-hadoop-workers: add more defensive code [cookbooks] - 10https://gerrit.wikimedia.org/r/636403 (https://phabricator.wikimedia.org/T260411) [14:46:57] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@a1a1bd7]: Add api-portal and snmwiki (duration: 16m 43s) [14:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:12] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10Cmjohnson) [14:47:52] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=mathoid,name=eqiad [14:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:52] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=mobileapps,name=eqiad [14:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:52] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=ores,name=eqiad [14:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:19] (03PS1) 10Ppchelko: Expose RESTBase for api-portal wiki. [deployment-charts] - 10https://gerrit.wikimedia.org/r/636431 (https://phabricator.wikimedia.org/T246945) [14:56:53] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=parsoid,name=eqiad [14:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:20] (03PS1) 10DCausse: [wdqs] add support for streaming updater lag metric [puppet] - 10https://gerrit.wikimedia.org/r/636432 (https://phabricator.wikimedia.org/T255399) [14:59:26] 10Operations, 10Commons, 10DBA, 10Release-Engineering-Team, 10Wikimedia-production-error: Increase on database writes and deletes activity on Commonswiki leads to some replication lag - https://phabricator.wikimedia.org/T266432 (10thcipriani) Adding @LarsWirzenius in-case he remembers anything deploying... [14:59:53] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=proton,name=eqiad [14:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:21] PROBLEM - IPMI Sensor Status on wtp1033 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:02:54] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=push-notifications,name=eqiad [15:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:44] (03PS1) 10Urbanecm: Add growthexperiments to allowed logtypes [puppet] - 10https://gerrit.wikimedia.org/r/636436 (https://phabricator.wikimedia.org/T266477) [15:05:54] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=recommendation-api,name=eqiad [15:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:38] 10Operations, 10Wikimedia-Mailing-lists: Mailing list request for the new User Group Wiki World Heritage - https://phabricator.wikimedia.org/T266478 (10Yamen) [15:08:55] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=restbase,name=eqiad [15:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:06] 10Operations, 10Puppet: Puppet Proposal to remove require_packages - https://phabricator.wikimedia.org/T266479 (10jbond) [15:09:17] 10Operations, 10Wikimedia-Mailing-lists: Mailing list request for the new User Group Wiki World Heritage - https://phabricator.wikimedia.org/T266478 (10Dyolf77) [15:09:22] 10Operations, 10Wikimedia-Mailing-lists: Mailing list request for the new User Group Wiki World Heritage - https://phabricator.wikimedia.org/T266478 (10Yamen) [15:09:37] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:10:05] PROBLEM - SSH on ms-be2035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:11:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:11:51] (03CR) 10Jbond: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/636410 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [15:11:55] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=restbase-async,name=eqiad [15:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:51] RECOVERY - SSH on ms-be2035 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:14:56] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=schema,name=eqiad [15:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:56] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=search,name=eqiad [15:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:31] (03CR) 10Jbond: remote-backup-mariadb: update cron to systemd::timer::job (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/636410 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [15:20:57] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=sessionstore,name=eqiad [15:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:19] PROBLEM - Elevated latency for icinga checks in codfw on alert1001 is CRITICAL: cluster=alerting instance=alert2001 job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [15:23:57] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=termbox,name=eqiad [15:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:07] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:26:57] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=wdqs,name=eqiad [15:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:29:58] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=wdqs-internal,name=eqiad [15:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:20] (03CR) 10Jcrespo: "> Patch Set 3:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/636410 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [15:30:23] 10Operations, 10Puppet: Puppet Proposal to remove require_package - https://phabricator.wikimedia.org/T266479 (10MoritzMuehlenhoff) [15:32:58] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=wikifeeds,name=eqiad [15:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:59] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=zotero,name=eqiad [15:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:28] that's all services repooled in eqiad, waiting a moment to check things out before restoring TTLs [15:36:56] (03CR) 10Jbond: "updated thanks" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/636410 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [15:37:17] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10RobH) [15:37:42] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10RobH) [15:38:00] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10RobH) [15:39:22] (03CR) 10Jcrespo: "To sumarize: let's update the description, and either keep the dependency, or create the directory on puppet and change the package depend" [puppet] - 10https://gerrit.wikimedia.org/r/636410 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [15:40:11] RECOVERY - Elevated latency for icinga checks in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [15:40:42] (03CR) 10Ppchelko: [C: 04-1] "No, this is incorrect, Varnish doesn't do this rewrite for us.." [deployment-charts] - 10https://gerrit.wikimedia.org/r/636431 (https://phabricator.wikimedia.org/T246945) (owner: 10Ppchelko) [15:40:47] (03CR) 10Jcrespo: "ah, and whetever is best between ensure_packages and the other, whatever you prefer." [puppet] - 10https://gerrit.wikimedia.org/r/636410 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [15:41:00] (03PS1) 10Muehlenhoff: Don't limit PHP72 pbuilder hook to stretch [puppet] - 10https://gerrit.wikimedia.org/r/636451 [15:45:58] 10Operations, 10SRE-swift-storage: swift backend decomms / rebalances are noisy - https://phabricator.wikimedia.org/T221904 (10lmata) I'm going to un tag Observability for now as this is more swift related and less o11y related. :-) if this changes please retag [15:46:32] (03CR) 10Muehlenhoff: [C: 03+2] Don't limit PHP72 pbuilder hook to stretch [puppet] - 10https://gerrit.wikimedia.org/r/636451 (owner: 10Muehlenhoff) [15:51:06] !log rzl@cumin1001 conftool action : set/ttl=300; selector: dnsdisc=apertium|api-gateway|citoid|cxserver|echostore|eventgate-analytics|eventgate-analytics-external|eventgate-logging-external|eventgate-main|eventstreams|graphoid|kartotherian|mathoid|mobileapps|ores|parsoid|proton|push-notifications|recommendation-api|restbase|restbase-async|schema|search|sessionstore|termbox|wdqs|wdqs-internal|wikifeeds|zotero,name=eqiad [15:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:22] 10Operations, 10Analytics-Radar, 10SRE-Access-Requests: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 (10fdans) [15:52:50] (03CR) 10Jcrespo: "I think this is ready for +1 on my side when you upload the latest patch." [puppet] - 10https://gerrit.wikimedia.org/r/636410 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [15:55:30] (03PS2) 10Ppchelko: Expose RESTBase for api-portal wiki. [deployment-charts] - 10https://gerrit.wikimedia.org/r/636431 (https://phabricator.wikimedia.org/T246945) [15:55:33] (03CR) 10JMeybohm: [C: 03+2] eventrouter: Fix link to eventrouter helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/636418 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [15:55:47] (03PS4) 10Jbond: remote-backup-mariadb: update cron to systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/636410 (https://phabricator.wikimedia.org/T265138) [15:56:44] (03CR) 10Jbond: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/636410 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [15:58:20] (03Merged) 10jenkins-bot: eventrouter: Fix link to eventrouter helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/636418 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [15:59:12] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10Cmjohnson) [16:00:09] (03PS1) 10Kormat: mariadb: Enable report_host [puppet] - 10https://gerrit.wikimedia.org/r/636452 (https://phabricator.wikimedia.org/T266483) [16:00:15] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:04] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Enable report_host [puppet] - 10https://gerrit.wikimedia.org/r/636452 (https://phabricator.wikimedia.org/T266483) (owner: 10Kormat) [16:01:07] PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:01:53] (03CR) 10Dzahn: [C: 03+2] orchestrator: add monitoring for process and TCP port [puppet] - 10https://gerrit.wikimedia.org/r/636067 (https://phabricator.wikimedia.org/T266338) (owner: 10Dzahn) [16:02:22] 10Operations, 10Wikimedia-Etherpad, 10Patch-For-Review: rate limited etherpad - https://phabricator.wikimedia.org/T265490 (10ssastry) 05Resolved→03Open I ran into this repeatedly a few mins back. [16:02:32] (03PS1) 10Lucas Werkmeister (WMDE): Enable propagatePageDeletion on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636453 [16:03:53] (03CR) 10Hnowlan: [C: 03+1] Expose RESTBase for api-portal wiki. [deployment-charts] - 10https://gerrit.wikimedia.org/r/636431 (https://phabricator.wikimedia.org/T246945) (owner: 10Ppchelko) [16:04:56] 10Operations, 10DBA, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10LSobanski) [16:05:30] (03PS2) 10Kormat: mariadb: Enable report_host [puppet] - 10https://gerrit.wikimedia.org/r/636452 (https://phabricator.wikimedia.org/T266483) [16:06:59] 10Operations, 10Commons, 10DBA, 10Release-Engineering-Team, 10Wikimedia-production-error: Increase on database writes and deletes activity on Commonswiki leads to some replication lag - https://phabricator.wikimedia.org/T266432 (10Marostegui) We just had another huge spike of DELETEs {F32414787} [16:07:16] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:19] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:43] (03CR) 10Dzahn: "on dborch1001 - new NRPE checks created" [puppet] - 10https://gerrit.wikimedia.org/r/636067 (https://phabricator.wikimedia.org/T266338) (owner: 10Dzahn) [16:09:49] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat) [16:10:05] (03CR) 10Bstorm: "Thanks! Luckily, our first cookbook uses the wikireplicas-all alias, so the server wasn't missed in the initial run." [puppet] - 10https://gerrit.wikimedia.org/r/636016 (owner: 10Jcrespo) [16:10:10] (03CR) 10Jcrespo: [C: 03+1] "Lets deploy tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/636410 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [16:12:07] (03PS3) 10Kormat: mariadb: Enable report_host [puppet] - 10https://gerrit.wikimedia.org/r/636452 (https://phabricator.wikimedia.org/T266483) [16:12:52] ACKNOWLEDGEMENT - Long running screen/tmux on elastic2049 is CRITICAL: CRIT: Long running tmux process. (user: ebernhardson PID: 64788, 1790609s 1728000s). Gehel ongoing investigation into GC issues on elastic2049 https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [16:14:37] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:59] (03CR) 10Kormat: "PCC run looks good: https://puppet-compiler.wmflabs.org/compiler1001/26119/" [puppet] - 10https://gerrit.wikimedia.org/r/636452 (https://phabricator.wikimedia.org/T266483) (owner: 10Kormat) [16:18:59] (03PS1) 10Dzahn: etherpad: reduce rate limiting window from 10s to 1s [puppet] - 10https://gerrit.wikimedia.org/r/636458 (https://phabricator.wikimedia.org/T265490) [16:22:23] (03CR) 10Marostegui: "Nice! Even though this shouldn't trigger anything, let's deploy on Thursday once the maintenance can come back to eqiad?" [puppet] - 10https://gerrit.wikimedia.org/r/636452 (https://phabricator.wikimedia.org/T266483) (owner: 10Kormat) [16:23:35] (03CR) 10Kormat: [C: 04-2] "Don't merge before we're out of maintenance window." [puppet] - 10https://gerrit.wikimedia.org/r/636452 (https://phabricator.wikimedia.org/T266483) (owner: 10Kormat) [16:24:00] (03CR) 10Bstorm: [C: 03+1] "The general sense here as to remove it on the secondary because it won't work there. However, I'm not sure there is puppet code anywhere t" [puppet] - 10https://gerrit.wikimedia.org/r/636089 (owner: 10Dzahn) [16:27:43] 10Operations, 10Scap, 10serviceops, 10Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)): Make a way to build Scap .deb in Docker - https://phabricator.wikimedia.org/T265501 (10jijiki) p:05High→03Low [16:28:59] PROBLEM - Check systemd state on idp2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:29:31] 10Operations, 10Commons, 10DBA, 10Release-Engineering-Team, 10Wikimedia-production-error: Increase on database writes and deletes activity on Commonswiki leads to some replication lag - https://phabricator.wikimedia.org/T266432 (10LarsWirzenius) @thcipriani Sorry, I have no recollection that anything tha... [16:29:46] !log set security-log traceoptions on pfw3-eqiad - T263833 [16:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:13] (03PS1) 10Dzahn: service_auto_restart: disable monitoring [puppet] - 10https://gerrit.wikimedia.org/r/636459 [16:32:03] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/636459 (owner: 10Dzahn) [16:32:25] (03CR) 10Bstorm: [C: 03+1] toolforge: Install pack and buildpacks repo on image builder [puppet] - 10https://gerrit.wikimedia.org/r/636103 (https://phabricator.wikimedia.org/T266270) (owner: 10Legoktm) [16:32:45] PROBLEM - Check the last execution of wmf_auto_restart_cas on idp2001 is CRITICAL: CRITICAL: Status of the systemd unit wmf_auto_restart_cas https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:32:49] (03CR) 10Dzahn: [C: 03+2] service_auto_restart: disable monitoring [puppet] - 10https://gerrit.wikimedia.org/r/636459 (owner: 10Dzahn) [16:41:45] !log bounce security log on pfw3-eqiad - T263833 [16:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:21] PROBLEM - SSH on ms-be2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:50:45] PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:52:09] RECOVERY - SSH on ms-be2017 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:00:04] ryankemper: Your horoscope predicts another unfortunate Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201026T1700). [17:00:27] 10Operations, 10ops-eqiad, 10netops, 10Sustainability (Incident Followup): eqiad row D switch fabric recabling - https://phabricator.wikimedia.org/T256112 (10wiki_willy) [17:02:10] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/634017 (https://phabricator.wikimedia.org/T262899) (owner: 10Ayounsi) [17:06:50] (03CR) 10Bstorm: [C: 03+2] toolforge: Install pack and buildpacks repo on image builder [puppet] - 10https://gerrit.wikimedia.org/r/636103 (https://phabricator.wikimedia.org/T266270) (owner: 10Legoktm) [17:08:27] (03CR) 10Krinkle: [C: 03+2] mediawiki.util: Use mw.util rather than 'this' [core] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635981 (https://phabricator.wikimedia.org/T265809) (owner: 10Krinkle) [17:09:00] PROBLEM - Check the last execution of wmf_auto_restart_mcelog on an-worker1096 is CRITICAL: CRITICAL: Status of the systemd unit wmf_auto_restart_mcelog https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:10:11] * Krinkle staging on mwdebug2001 [17:10:44] PROBLEM - Check systemd state on an-worker1098 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:46] 10Operations, 10ops-eqiad, 10netops, 10User-Kormat, 10User-jijiki: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10wiki_willy) [17:12:34] errors related to wmf_auto_restart are probably race conditions right now [17:13:54] running puppet on icinga servers to get them removed [17:23:41] (03CR) 10Dzahn: [C: 03+2] etherpad: reduce rate limiting window from 10s to 1s [puppet] - 10https://gerrit.wikimedia.org/r/636458 (https://phabricator.wikimedia.org/T265490) (owner: 10Dzahn) [17:23:43] 10Operations, 10Platform Engineering, 10serviceops, 10Performance-Team (Radar), and 2 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) [17:24:06] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui) [17:24:21] 10Operations, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, 10Proton, and 3 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10ovasileva) [17:24:27] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: patch in FB peering into cr1-eqiad:xe-3/2/1 - https://phabricator.wikimedia.org/T265916 (10Cmjohnson) @robh the circuit at 17/18 with ID 21557287 is connected to cr1 xe-3/2/1 with fiber number 2648 [17:24:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: patch in FB peering into cr1-eqiad:xe-3/2/1 - https://phabricator.wikimedia.org/T265916 (10Cmjohnson) forgot to add I do not have a link light [17:25:15] (03CR) 10Dzahn: [C: 03+2] openstack::wikitech::web: remove absented cron TODO [puppet] - 10https://gerrit.wikimedia.org/r/636096 (owner: 10Dzahn) [17:25:52] (03CR) 10Dzahn: [C: 03+2] openstack::designate::dns_floating_ip_updater: remove absented cron TODO [puppet] - 10https://gerrit.wikimedia.org/r/636093 (owner: 10Dzahn) [17:26:47] (03CR) 10Dzahn: [C: 03+2] toolforge::clush::master: remove absented cron TODO [puppet] - 10https://gerrit.wikimedia.org/r/636088 (owner: 10Dzahn) [17:27:28] (03CR) 10Dzahn: [C: 03+2] librenms: remove absented and obsoleted cron [puppet] - 10https://gerrit.wikimedia.org/r/636091 (owner: 10Dzahn) [17:27:36] (03PS2) 10Dzahn: librenms: remove absented and obsoleted cron [puppet] - 10https://gerrit.wikimedia.org/r/636091 [17:29:36] (03CR) 10Dzahn: [C: 03+2] wmcs::nfs::primary: remove absented cron TODO [puppet] - 10https://gerrit.wikimedia.org/r/636089 (owner: 10Dzahn) [17:30:20] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: patch in FB peering into cr1-eqiad:xe-3/2/1 - https://phabricator.wikimedia.org/T265916 (10RobH) [17:31:28] 10Operations, 10DC-Ops, 10netops: patch in FB peering into cr1-eqiad:xe-3/2/1 - https://phabricator.wikimedia.org/T265916 (10RobH) >>! In T265916#6579218, @Cmjohnson wrote: > forgot to add I do not have a link light I show good RX light for the connection. Laser receiver power :... [17:31:38] 10Operations, 10DC-Ops, 10netops: patch in FB peering into cr1-eqiad:xe-3/2/1 - https://phabricator.wikimedia.org/T265916 (10RobH) a:05Cmjohnson→03ayounsi [17:32:38] checking in for backport window [17:32:47] 10Operations, 10DC-Ops, 10netops: patch in FB peering into cr1-eqiad:xe-3/2/1 - https://phabricator.wikimedia.org/T265916 (10RobH) I've updated the circuit (with its circuit id) and updated the cable (with its cable id and set to status connected) [17:33:08] (03CR) 10Ema: [C: 03+1] stats: switch analytics sites to use Envoy on port 8443 [puppet] - 10https://gerrit.wikimedia.org/r/634669 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [17:33:13] PROBLEM - SSH on ms-be2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:33:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_mobileapps_cluster_codfw,swagger_check_restbase_esams} site={codfw,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:34:02] 10Operations, 10DC-Ops, 10netops: patch in FB peering into cr1-eqiad:xe-3/2/1 - https://phabricator.wikimedia.org/T265916 (10RobH) [17:34:12] 10Operations, 10DC-Ops, 10netops: patch in FB peering into cr1-eqiad:xe-3/2/1 - https://phabricator.wikimedia.org/T265916 (10RobH) [17:35:03] 10Operations, 10Wikimedia-Etherpad, 10Patch-For-Review: rate limited etherpad - https://phabricator.wikimedia.org/T265490 (10Dzahn) >>! In T265490#6578828, @ssastry wrote: > I ran into this repeatedly a few mins back. Apparently it still happens on Monday mornings but not other times because these are the m... [17:35:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:36:19] RECOVERY - SSH on ms-be2017 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:36:29] (03Merged) 10jenkins-bot: mediawiki.util: Use mw.util rather than 'this' [core] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635981 (https://phabricator.wikimedia.org/T265809) (owner: 10Krinkle) [17:37:25] (03PS1) 10Cmjohnson: Add new es servers to site.pp setup role and mac addresses to dhcp [puppet] - 10https://gerrit.wikimedia.org/r/636467 (https://phabricator.wikimedia.org/T260370) [17:37:39] PROBLEM - Check systemd state on ms-be2050 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:39:18] (03CR) 10Dzahn: "new checks are in Icinga https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=orchestrator" [puppet] - 10https://gerrit.wikimedia.org/r/636067 (https://phabricator.wikimedia.org/T266338) (owner: 10Dzahn) [17:39:31] !log krinkle@deploy1001 Synchronized php-1.36.0-wmf.13/resources/src/mediawiki.util/: T265809, I1011f63ae61f5a6 (duration: 01m 00s) [17:39:32] (03CR) 10Krinkle: "Confirmed on enwiki that addPortletLink('p-tb', '') throws 'Uncaught TypeError: this is undefined', and verified on mwdebug2001 with this" [core] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635981 (https://phabricator.wikimedia.org/T265809) (owner: 10Krinkle) [17:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:38] T265809: addPortletLink function throws uncaught error when called by reference - https://phabricator.wikimedia.org/T265809 [17:39:43] (03CR) 10Cmjohnson: [C: 03+2] Add new es servers to site.pp setup role and mac addresses to dhcp [puppet] - 10https://gerrit.wikimedia.org/r/636467 (https://phabricator.wikimedia.org/T260370) (owner: 10Cmjohnson) [17:40:02] 10Operations, 10DBA, 10User-Kormat: orchestrator: Add service monitoring - https://phabricator.wikimedia.org/T266338 (10Dzahn) New checks have been added to Icinga: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=orchestrator But notifications for everything on this new host are disabl... [17:44:05] RECOVERY - Check systemd state on releases2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:45:42] !log releases2002,netmon2001, various other hosts - systemctl reset-failed to clear Icinga alerts related to wmf_auto_restart changes [17:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:27] PROBLEM - Check systemd state on an-worker1099 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:46:45] (03PS6) 10ArielGlenn: per job batches file with locking and methods for claiming jobs etc [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396) [17:47:07] (03CR) 10jerkins-bot: [V: 04-1] per job batches file with locking and methods for claiming jobs etc [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [17:47:51] RECOVERY - Check systemd state on an-worker1098 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:06] !log an-worker109* - systemctl reset-failed to clear Icinga alerts related to wmf_auto_restart changes [17:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:11] RECOVERY - Check systemd state on an-worker1099 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:17] RECOVERY - Check systemd state on idp2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:47] (03PS1) 10Jdlrobson: Fix logic in collapsibleTabs code [skins/Vector] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/636377 (https://phabricator.wikimedia.org/T71729) [17:48:58] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/636464 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [17:49:05] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:49:17] RECOVERY - Check systemd state on an-worker1096 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:49:23] RECOVERY - Check systemd state on kubestage1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:50:13] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:50:20] 10Operations, 10ops-eqiad, 10Data-Services, 10Epic, 10cloud-services-team (Kanban): Move labstore1004 and labstore1005 to 10G Ethernet - https://phabricator.wikimedia.org/T266198 (10Bstorm) [17:50:57] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:51:31] RECOVERY - Check systemd state on ms-be2050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:53:58] could we remove the logstash2* alerts? They have been CRIT since over 17 days now [17:54:14] by remove i mean "get out of unhandled list" [17:54:59] (03CR) 10Ppchelko: [C: 03+2] Expose RESTBase for api-portal wiki. [deployment-charts] - 10https://gerrit.wikimedia.org/r/636431 (https://phabricator.wikimedia.org/T246945) (owner: 10Ppchelko) [17:55:49] (03CR) 10Dzahn: [C: 03+2] cassandra: remove absented metrics collector cron [puppet] - 10https://gerrit.wikimedia.org/r/636092 (owner: 10Dzahn) [17:55:55] (03PS2) 10Dzahn: cassandra: remove absented metrics collector cron [puppet] - 10https://gerrit.wikimedia.org/r/636092 [17:57:45] (03PS1) 10Bstorm: toolsdb: Fail over toolsdb to its replica [puppet] - 10https://gerrit.wikimedia.org/r/636468 (https://phabricator.wikimedia.org/T263679) [17:57:48] (03Merged) 10jenkins-bot: Expose RESTBase for api-portal wiki. [deployment-charts] - 10https://gerrit.wikimedia.org/r/636431 (https://phabricator.wikimedia.org/T246945) (owner: 10Ppchelko) [17:58:07] (03CR) 10jerkins-bot: [V: 04-1] toolsdb: Fail over toolsdb to its replica [puppet] - 10https://gerrit.wikimedia.org/r/636468 (https://phabricator.wikimedia.org/T263679) (owner: 10Bstorm) [17:58:24] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'eventrouter' . [17:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:31] (03CR) 10Bstorm: [C: 04-1] "Setting my own review to -1 until failover time" [puppet] - 10https://gerrit.wikimedia.org/r/636468 (https://phabricator.wikimedia.org/T263679) (owner: 10Bstorm) [18:00:04] RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201026T1800). [18:00:05] RoanKattouw, cscott, and jdlrobson: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:05] (03PS2) 10Bstorm: toolsdb: Fail over toolsdb to its replica [puppet] - 10https://gerrit.wikimedia.org/r/636468 (https://phabricator.wikimedia.org/T263679) [18:00:20] It should take only one deployer, and that's me! [18:00:35] hi RoanKattouw [18:00:38] ! [18:01:16] (03CR) 10Catrope: [C: 03+2] Bump wikimedia/parsoid to v0.13.0-a13 [vendor] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635782 (https://phabricator.wikimedia.org/T266285) (owner: 10C. Scott Ananian) [18:01:18] i put parsoid on yr queue, don't know if you noticed [18:01:27] That patch I just +2ed? [18:01:31] i guess you did! [18:01:56] Thanks for pointing out that I need to do a manual submodule update [18:03:52] thanks for doing my patches first :) [18:04:47] (im here when needed) [18:04:59] (03PS2) 10Catrope: GrowthExperiments: Make variant D the default on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635669 (https://phabricator.wikimedia.org/T265556) [18:05:27] (03CR) 10Catrope: [C: 03+2] Fix logic in collapsibleTabs code [skins/Vector] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/636377 (https://phabricator.wikimedia.org/T71729) (owner: 10Jdlrobson) [18:05:46] cscott: Well, I'm doing my config change first, but I +2 wmf patches ahead of time because CI is slow [18:05:55] (03CR) 10Catrope: [C: 03+2] Revert "Revert "Make variant D the default, and remove variant A"" [extensions/GrowthExperiments] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635595 (https://phabricator.wikimedia.org/T265372) (owner: 10Catrope) [18:06:01] (03CR) 10Catrope: [C: 03+2] GrowthExperiments: Make variant D the default on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635669 (https://phabricator.wikimedia.org/T265556) (owner: 10Catrope) [18:06:20] makes sense [18:07:05] (03Merged) 10jenkins-bot: GrowthExperiments: Make variant D the default on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635669 (https://phabricator.wikimedia.org/T265556) (owner: 10Catrope) [18:07:12] Since yours needs a separate submodule update commit that then also has to go through CI, you'll probably end up going last [18:07:27] But I try to parallelize as much as I can [18:09:49] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/26120/" [puppet] - 10https://gerrit.wikimedia.org/r/635666 (owner: 10Dzahn) [18:10:51] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: GrowthExperiments: Make variant D the default on all wikis (T265556) (duration: 00m 58s) [18:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:57] T265556: Variant tests: roll out variant C/D - https://phabricator.wikimedia.org/T265556 [18:11:26] jouncebot: now [18:11:26] For the next 0 hour(s) and 48 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201026T1800) [18:11:51] RoanKattouw: is the window full? [18:12:05] mutante: Feel free to add something [18:12:27] There are a few patches but not a huge number [18:12:46] cool,thanks.doing [18:12:50] Bonus points if you have a config patch, those are basically free since right now I'm just twiddling my thumbs waiting for CI [18:13:32] IntialiseSettings.php and harmless. just removes decom'ed hosts from a whitelist [18:14:05] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/634361, right? [18:15:40] yes, that one [18:15:55] my problem right now I cant find my phone to get past 2fa login :p [18:16:00] :D [18:16:18] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Reshard commonswiki_file elasticsearch index - https://phabricator.wikimedia.org/T260083 (10RKemper) 05Open→03Declined [18:16:22] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Reshard commonswiki_file elasticsearch index - https://phabricator.wikimedia.org/T260083 (10RKemper) Closing this ticket because we ended up changing the alert thresholds which removes the need to re-shard the index [18:16:32] OK I'll deploy it [18:16:34] thanks [18:16:45] (03PS1) 10Bstorm: toolsdb: remove temporarily replication filters [puppet] - 10https://gerrit.wikimedia.org/r/636469 (https://phabricator.wikimedia.org/T257274) [18:17:18] (03PS2) 10Bstorm: toolsdb: remove temporary replication filters [puppet] - 10https://gerrit.wikimedia.org/r/636469 (https://phabricator.wikimedia.org/T257274) [18:17:29] (03PS1) 10RLazarus: switchdc: Run Puppet on DB masters after setting read-write [cookbooks] - 10https://gerrit.wikimedia.org/r/636471 (https://phabricator.wikimedia.org/T261767) [18:17:52] (03PS2) 10RLazarus: switchdc: Run Puppet on DB masters after setting read-write [cookbooks] - 10https://gerrit.wikimedia.org/r/636471 (https://phabricator.wikimedia.org/T261767) [18:18:10] (03CR) 10Catrope: [C: 03+2] remove wtp2001-wtp2020 from LinterSubmitterWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634361 (https://phabricator.wikimedia.org/T265558) (owner: 10Dzahn) [18:18:23] 10Operations, 10Traffic, 10Patch-For-Review: Deprecate TLSv1.2 weak ciphersuites - https://phabricator.wikimedia.org/T258405 (10AntiCompositeNumber) We're getting a few OTRS tickets about this, a note in Tech News or on wikitech-l would have been appreciated. [18:18:25] (03PS3) 10Catrope: remove wtp2001-wtp2020 from LinterSubmitterWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634361 (https://phabricator.wikimedia.org/T265558) (owner: 10Dzahn) [18:18:30] (03CR) 10Catrope: [C: 03+2] remove wtp2001-wtp2020 from LinterSubmitterWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634361 (https://phabricator.wikimedia.org/T265558) (owner: 10Dzahn) [18:19:14] 10Operations, 10Commons, 10DBA, 10Platform Engineering, and 2 others: Increase on database writes and deletes activity on Commonswiki leads to some replication lag - https://phabricator.wikimedia.org/T266432 (10thcipriani) Here is the changelog of all patchsets that went out last week: https://www.mediaw... [18:19:35] (03Merged) 10jenkins-bot: remove wtp2001-wtp2020 from LinterSubmitterWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634361 (https://phabricator.wikimedia.org/T265558) (owner: 10Dzahn) [18:20:57] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: Deprecate TLSv1.2 weak ciphersuites - https://phabricator.wikimedia.org/T258405 (10Urbanecm) #user-notice is definitely warranted [18:21:38] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Remove wtp2001-wtp2020 from LinterSubmitterWhitelist (T265558) (duration: 00m 59s) [18:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:44] T265558: decommission wtp2001 through wtp2020 - https://phabricator.wikimedia.org/T265558 [18:22:32] 10Operations, 10ops-eqiad, 10DC-Ops: fix/replace cable ID 2648 on FB peering patch - cable report error - https://phabricator.wikimedia.org/T266497 (10RobH) [18:22:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=cloud_dev_pdns_rec site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:22:49] (03CR) 10Volans: [C: 03+1] "LGTM, it can also be tested for real as it's harmless" [cookbooks] - 10https://gerrit.wikimedia.org/r/636471 (https://phabricator.wikimedia.org/T261767) (owner: 10RLazarus) [18:23:07] RoanKattouw: thank you, just added it to wikitech now [18:23:19] there is no test because those hosts are already down [18:23:24] mutante, parse2001 / parse2002 are now the canaries right? [18:23:29] subbu: correct [18:23:34] k [18:26:13] 10Operations, 10ops-eqiad, 10DC-Ops: fix/replace cable ID 2648 on FB peering patch - cable report error - https://phabricator.wikimedia.org/T266497 (10RobH) So to find an available cable ID at a given site, I do the following: https://netbox.wikimedia.org/dcim/cables/ > input just a single site > https://n... [18:26:37] 10Operations, 10SRE-Access-Requests: Requesting access to Prod ssh access for calbon - https://phabricator.wikimedia.org/T266498 (10calbon) [18:27:06] (03CR) 10RLazarus: [C: 03+2] switchdc: Run Puppet on DB masters after setting read-write [cookbooks] - 10https://gerrit.wikimedia.org/r/636471 (https://phabricator.wikimedia.org/T261767) (owner: 10RLazarus) [18:27:34] bbiab [18:28:13] 10Operations, 10ops-eqiad, 10DC-Ops: fix/replace cable ID 2648 on FB peering patch - cable report error - https://phabricator.wikimedia.org/T266497 (10RobH) [18:28:31] (03Merged) 10jenkins-bot: switchdc: Run Puppet on DB masters after setting read-write [cookbooks] - 10https://gerrit.wikimedia.org/r/636471 (https://phabricator.wikimedia.org/T261767) (owner: 10RLazarus) [18:29:23] (03PS2) 10Volans: dns: add retry logic to all Netbox API calls [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/636406 [18:30:39] (03CR) 10Bstorm: toolforge: script to make long-running processes on bastions less good (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/635888 (https://phabricator.wikimedia.org/T266300) (owner: 10Bstorm) [18:31:25] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to v0.13.0-a13 [vendor] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635782 (https://phabricator.wikimedia.org/T266285) (owner: 10C. Scott Ananian) [18:32:01] (03CR) 10Ottomata: [C: 03+1] stats: switch analytics sites to use Envoy on port 8443 [puppet] - 10https://gerrit.wikimedia.org/r/634669 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [18:32:55] (03CR) 10Razzi: [C: 03+2] stats: switch analytics sites to use Envoy on port 8443 [puppet] - 10https://gerrit.wikimedia.org/r/634669 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [18:33:38] RoanKattouw: still free for a quick config change? [18:34:03] Sure, give me a second though [18:34:06] sure [18:34:10] I need to upload it first [18:34:10] Add it to the wiki page in the meantime [18:35:29] (03PS1) 10Urbanecm: Add www.legislation.gov.uk to $wgCopyUploadsDomains for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636474 (https://phabricator.wikimedia.org/T265690) [18:36:44] RoanKattouw: https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=1886035&oldid=1886034 [18:37:29] RoanKattouw: is my change synced? I have to drop off in about 10 mins and would like to test before I do. [18:38:30] (03Merged) 10jenkins-bot: Fix logic in collapsibleTabs code [skins/Vector] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/636377 (https://phabricator.wikimedia.org/T71729) (owner: 10Jdlrobson) [18:38:33] (03Merged) 10jenkins-bot: Revert "Revert "Make variant D the default, and remove variant A"" [extensions/GrowthExperiments] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635595 (https://phabricator.wikimedia.org/T265372) (owner: 10Catrope) [18:38:47] oh it hadn't merged :) [18:38:58] (03PS1) 10Andrew Bogott: cloud-vps resolv.conf: replace hardcoded .eqiad.wmflabs search with hiera [puppet] - 10https://gerrit.wikimedia.org/r/636476 (https://phabricator.wikimedia.org/T266227) [18:39:14] 10Operations, 10Analytics-Radar, 10SRE-Access-Requests: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 (10gsingers) Having never dealt with this before here, can someone clarify for me what exactly keeping access entails and how it compares to our policies? In other words, what am I be... [18:39:23] Ugh finally they're all merged [18:40:17] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps resolv.conf: replace hardcoded .eqiad.wmflabs search with hiera [puppet] - 10https://gerrit.wikimedia.org/r/636476 (https://phabricator.wikimedia.org/T266227) (owner: 10Andrew Bogott) [18:40:24] Jdlrobson: Your patch is now on mwdebug2001, please test [18:40:31] (03PS1) 10Urbanecm: Configure $wgBabelCategoryNames for ndswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636477 (https://phabricator.wikimedia.org/T264990) [18:40:41] cscott: Yours is too. Apparently the manual submodule update thing wasn't needed [18:41:07] (03CR) 10Catrope: [C: 03+2] Add www.legislation.gov.uk to $wgCopyUploadsDomains for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636474 (https://phabricator.wikimedia.org/T265690) (owner: 10Urbanecm) [18:41:27] RoanKattouw, so, we cannot test it via mwdebug* since parsoid uses the servers parse20** [18:41:32] RoanKattouw: on it [18:41:49] subbu: Oh right. So should I just deploy it then? [18:41:56] RoanKattouw: it works! [18:42:02] so i suppose you can as well push that through since we also cannot tell rb to issue requests to specific parsoid servers. [18:42:04] ya [18:42:20] we'll test and if it looks bad, will need a rollback. [18:42:21] 10Operations, 10Performance-Team, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10dpifke) The Swift object-expirer is running in beta if we want to start testing this there. There are some loose ends bef... [18:43:01] OK great, I'm now syncing Jon's change but I'll do yours next [18:43:05] k [18:43:25] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.14/skins/Vector/: Fix logic in collapsibleTabs code (T71729) (duration: 00m 58s) [18:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:31] T71729: [collapsibleTabs] If a tab's width changes after initial page load, endless animation loop can happen - https://phabricator.wikimedia.org/T71729 [18:43:42] cscott is away for another 15 mins .. so i am standing in for him. [18:44:29] (03PS2) 10Andrew Bogott: cloud-vps resolv.conf: replace hardcoded .eqiad.wmflabs search with hiera [puppet] - 10https://gerrit.wikimedia.org/r/636476 (https://phabricator.wikimedia.org/T266227) [18:45:15] Yeah sorry it took so long [18:45:20] I blame Jenkins [18:45:34] RoanKattouw, oh, no problem. i was mostly clarifying why i was responding instead of scott. :) [18:45:48] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps resolv.conf: replace hardcoded .eqiad.wmflabs search with hiera [puppet] - 10https://gerrit.wikimedia.org/r/636476 (https://phabricator.wikimedia.org/T266227) (owner: 10Andrew Bogott) [18:46:03] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.14/vendor/wikimedia/parsoid/: Bump wikimedia/parsoid to v0.13.0-a13, enabling 6-element DSRs (T266285) (duration: 00m 58s) [18:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:09] T266285: Deploy 6-element DSR to prod - https://phabricator.wikimedia.org/T266285 [18:46:14] (03PS3) 10Andrew Bogott: cloud-vps resolv.conf: replace hardcoded .eqiad.wmflabs search with hiera [puppet] - 10https://gerrit.wikimedia.org/r/636476 (https://phabricator.wikimedia.org/T266227) [18:46:44] (03Merged) 10jenkins-bot: Add www.legislation.gov.uk to $wgCopyUploadsDomains for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636474 (https://phabricator.wikimedia.org/T265690) (owner: 10Urbanecm) [18:47:01] (thanks subbu and RoanKattouw !) [18:47:32] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps resolv.conf: replace hardcoded .eqiad.wmflabs search with hiera [puppet] - 10https://gerrit.wikimedia.org/r/636476 (https://phabricator.wikimedia.org/T266227) (owner: 10Andrew Bogott) [18:47:53] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.14/extensions/GrowthExperiments/: Make variant D the default, remove variant A (T265372, T265556) (duration: 00m 58s) [18:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:59] T265556: Variant tests: roll out variant C/D - https://phabricator.wikimedia.org/T265556 [18:48:00] T265372: Variant C/D: configuration control - https://phabricator.wikimedia.org/T265372 [18:48:35] Urbanecm: Your copy upload domains patch is now on mwdebug2001 in case you want to test it, but it's probably safe to sync directly. Let me know what you want to do [18:48:47] (03PS4) 10Andrew Bogott: cloud-vps resolv.conf: replace hardcoded .eqiad.wmflabs search with hiera [puppet] - 10https://gerrit.wikimedia.org/r/636476 (https://phabricator.wikimedia.org/T266227) [18:48:53] should be possible to see whether it dispays at special:upload [18:49:00] thanks RoanKattouw [18:49:09] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:49:09] or Special:GWToolset [18:49:16] RoanKattouw, so, is the code deployed then? [18:49:26] subbu: Yes, sorry for clarifying [18:49:51] k [18:49:55] logmsgbot's message at 13:46:04 your time indicated that it had been deployed [18:50:04] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps resolv.conf: replace hardcoded .eqiad.wmflabs search with hiera [puppet] - 10https://gerrit.wikimedia.org/r/636476 (https://phabricator.wikimedia.org/T266227) (owner: 10Andrew Bogott) [18:50:12] Urbanecm: OK, please do that and let me know [18:50:12] RoanKattouw: works fine [18:50:16] Great, deploying [18:50:20] thanks [18:50:35] sorry for it to take long, was testing at mwdebug2002 :/ [18:50:48] (03PS5) 10Andrew Bogott: cloud-vps resolv.conf: replace hardcoded .eqiad.wmflabs search with hiera [puppet] - 10https://gerrit.wikimedia.org/r/636476 (https://phabricator.wikimedia.org/T266227) [18:51:10] (03PS2) 10Catrope: Configure $wgBabelCategoryNames for ndswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636477 (https://phabricator.wikimedia.org/T264990) (owner: 10Urbanecm) [18:51:20] (03CR) 10Catrope: [C: 03+2] Configure $wgBabelCategoryNames for ndswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636477 (https://phabricator.wikimedia.org/T264990) (owner: 10Urbanecm) [18:51:38] Ha, understandable mistake [18:51:44] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add www.legislation.gov.uk to $wgCopyUploadsDomains on commonswiki (T265690) (duration: 00m 58s) [18:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:50] T265690: Please add www.legislation.gov.uk to $wgCopyUploadsDomains - https://phabricator.wikimedia.org/T265690 [18:51:53] Today is also the last day we'll be testing on 2001, we'll be back to 1001 on Wednesday [18:52:03] great! [18:52:08] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps resolv.conf: replace hardcoded .eqiad.wmflabs search with hiera [puppet] - 10https://gerrit.wikimedia.org/r/636476 (https://phabricator.wikimedia.org/T266227) (owner: 10Andrew Bogott) [18:52:11] * Urbanecm sets a mental reminder to refresh muscle memory [18:52:14] (And there are no deploys tomorrow, because of the switch) [18:52:14] (03Merged) 10jenkins-bot: Configure $wgBabelCategoryNames for ndswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636477 (https://phabricator.wikimedia.org/T264990) (owner: 10Urbanecm) [18:52:43] Urbanecm: Next up, your ndswiki patch is ready for testing [18:52:47] thanks [18:52:53] (03PS3) 10Catrope: GrowthExperiments: Remove variant setting override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635371 (https://phabricator.wikimedia.org/T265556) [18:53:04] (03CR) 10Catrope: [C: 03+2] GrowthExperiments: Remove variant setting override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635371 (https://phabricator.wikimedia.org/T265556) (owner: 10Catrope) [18:53:28] RoanKattouw: works fine [18:53:47] PROBLEM - SSH on ms-be2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:53:56] (03Merged) 10jenkins-bot: GrowthExperiments: Remove variant setting override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635371 (https://phabricator.wikimedia.org/T265556) (owner: 10Catrope) [18:55:46] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Configure $wgBabelCategoryNames on ndswiki (T264990) (duration: 00m 58s) [18:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:52] T264990: Configure $wgBabelCategoryNames for nds.wp - https://phabricator.wikimedia.org/T264990 [18:55:54] 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs, 10iOS-app-v6.8-Manta-Ray-On-A-Riding-Mower: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10JMinor) [18:56:02] thanks - that's all form me [18:57:19] (03CR) 10Dzahn: [C: 03+2] "even though this touches production.pp it is actually still cloud-only because that's "labspuppetbackend"" [puppet] - 10https://gerrit.wikimedia.org/r/634387 (https://phabricator.wikimedia.org/T256972) (owner: 10Dzahn) [18:59:01] RECOVERY - SSH on ms-be2017 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:59:37] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: GrowthExperiments: Remove variant setting override (no-op) (T265556) (duration: 00m 57s) [18:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:43] T265556: Variant tests: roll out variant C/D - https://phabricator.wikimedia.org/T265556 [18:59:47] And that's the last patch! All done [19:00:28] (03PS2) 10Dzahn: ldap::client::labs: fix 'Unknown variable: '::restricted..' [puppet] - 10https://gerrit.wikimedia.org/r/633838 (https://phabricator.wikimedia.org/T101447) [19:00:57] \o/ [19:04:22] (03PS7) 10Bstorm: toolforge: script to make long-running processes on bastions less good [puppet] - 10https://gerrit.wikimedia.org/r/635888 (https://phabricator.wikimedia.org/T266300) [19:04:49] (03CR) 10Bstorm: [C: 04-1] "nobody merge until failover time" [puppet] - 10https://gerrit.wikimedia.org/r/636469 (https://phabricator.wikimedia.org/T257274) (owner: 10Bstorm) [19:05:42] (03CR) 10Bstorm: toolforge: script to make long-running processes on bastions less good (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/635888 (https://phabricator.wikimedia.org/T266300) (owner: 10Bstorm) [19:18:51] PROBLEM - SSH on ms-be2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:20:29] RECOVERY - SSH on ms-be2017 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:26:28] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [19:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:25] (03PS8) 10Bstorm: toolforge: script to make long-running processes on bastions less good [puppet] - 10https://gerrit.wikimedia.org/r/635888 (https://phabricator.wikimedia.org/T266300) [19:28:11] (03PS1) 10Ebernhardson: Increase cirrus morelike pool counter by 20% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636480 [19:28:42] FYI: starting a dry run of the sre.switchdc.mediawiki cookbook, no effect on production expected [19:29:18] !log ppchelko@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [19:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:11] !log ppchelko@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [19:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:30] (03CR) 10Andrew Bogott: [C: 03+1] "The only place I know of this functionality being used is in the bastion project:" [puppet] - 10https://gerrit.wikimedia.org/r/633838 (https://phabricator.wikimedia.org/T101447) (owner: 10Dzahn) [19:39:13] PROBLEM - Check systemd state on mw1381 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:42:18] (03PS6) 10Andrew Bogott: cloud-vps resolv.conf: replace hardcoded .eqiad.wmflabs search with hiera [puppet] - 10https://gerrit.wikimedia.org/r/636476 (https://phabricator.wikimedia.org/T266227) [19:42:20] (03PS1) 10Andrew Bogott: does this help? [puppet] - 10https://gerrit.wikimedia.org/r/636486 [19:42:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:43:09] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps resolv.conf: replace hardcoded .eqiad.wmflabs search with hiera [puppet] - 10https://gerrit.wikimedia.org/r/636476 (https://phabricator.wikimedia.org/T266227) (owner: 10Andrew Bogott) [19:44:11] (03PS1) 10RLazarus: 04-switch-mediawiki: Fix a backwards minus sign. [cookbooks] - 10https://gerrit.wikimedia.org/r/636487 [19:44:39] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:48:39] PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:55:26] (03PS7) 10Andrew Bogott: cloud-vps resolv.conf: replace hardcoded .eqiad.wmflabs search with hiera [puppet] - 10https://gerrit.wikimedia.org/r/636476 (https://phabricator.wikimedia.org/T266227) [19:56:13] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps resolv.conf: replace hardcoded .eqiad.wmflabs search with hiera [puppet] - 10https://gerrit.wikimedia.org/r/636476 (https://phabricator.wikimedia.org/T266227) (owner: 10Andrew Bogott) [19:56:18] (03Abandoned) 10Andrew Bogott: does this help? [puppet] - 10https://gerrit.wikimedia.org/r/636486 (owner: 10Andrew Bogott) [19:59:24] (03PS1) 10RLazarus: 08-run-puppet-on-db-masters: Correct docstring [cookbooks] - 10https://gerrit.wikimedia.org/r/636490 [19:59:47] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:00:04] chrisalbon and accraze: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201026T2000). [20:00:05] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:01:50] (03CR) 10CRusnov: "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/636406 (owner: 10Volans) [20:01:52] (03CR) 10CRusnov: [C: 03+1] dns: add retry logic to all Netbox API calls [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/636406 (owner: 10Volans) [20:01:54] (03PS8) 10Andrew Bogott: cloud-vps resolv.conf: replace hardcoded .eqiad.wmflabs search with hiera [puppet] - 10https://gerrit.wikimedia.org/r/636476 (https://phabricator.wikimedia.org/T266227) [20:04:57] deploying ores now [20:05:07] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/636490 (owner: 10RLazarus) [20:05:39] (03CR) 10Dzahn: cloud-vps resolv.conf: replace hardcoded .eqiad.wmflabs search with hiera (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/636476 (https://phabricator.wikimedia.org/T266227) (owner: 10Andrew Bogott) [20:06:43] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/636487 (owner: 10RLazarus) [20:07:07] commit id to revert: 8540eec99d5506bbd2ce6c876ea4f1bd343c0524 [20:08:37] !log ladsgroup@deploy1001 Started deploy [ores/deploy@6912889]: Deploy new version of articlequality for wikidata (T261326) [20:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:44] T261326: Deploy the retrained model - https://phabricator.wikimedia.org/T261326 [20:09:50] (03CR) 10Dzahn: cloud-vps resolv.conf: replace hardcoded .eqiad.wmflabs search with hiera (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/636476 (https://phabricator.wikimedia.org/T266227) (owner: 10Andrew Bogott) [20:11:21] (03CR) 10Dzahn: "This seems to work: https://puppet-compiler.wmflabs.org/compiler1001/26137/wikistats-dancing-goat.wikistats.eqiad.wmflabs/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/636476 (https://phabricator.wikimedia.org/T266227) (owner: 10Andrew Bogott) [20:12:17] (03CR) 10RLazarus: [C: 03+2] 08-run-puppet-on-db-masters: Correct docstring [cookbooks] - 10https://gerrit.wikimedia.org/r/636490 (owner: 10RLazarus) [20:12:50] (03PS1) 10Ottomata: Exclude predefined user agents from eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130) [20:12:52] (03PS1) 10Subramanya Sastry: Get rid of update_parsoid.sh script that is no longer needed [puppet] - 10https://gerrit.wikimedia.org/r/636494 [20:13:39] (03Merged) 10jenkins-bot: 08-run-puppet-on-db-masters: Correct docstring [cookbooks] - 10https://gerrit.wikimedia.org/r/636490 (owner: 10RLazarus) [20:14:13] (03CR) 10jerkins-bot: [V: 04-1] Exclude predefined user agents from eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130) (owner: 10Ottomata) [20:14:53] (03CR) 10RLazarus: [C: 03+2] 04-switch-mediawiki: Fix a backwards minus sign. [cookbooks] - 10https://gerrit.wikimedia.org/r/636487 (owner: 10RLazarus) [20:15:25] this part is going to be fun [20:15:30] !log ladsgroup@deploy1001 Finished deploy [ores/deploy@6912889]: Deploy new version of articlequality for wikidata (T261326) (duration: 06m 53s) [20:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:35] T261326: Deploy the retrained model - https://phabricator.wikimedia.org/T261326 [20:16:10] (03PS2) 10Ottomata: Exclude predefined user agents from eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130) [20:17:27] (03CR) 10jerkins-bot: [V: 04-1] Exclude predefined user agents from eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130) (owner: 10Ottomata) [20:17:57] (03PS2) 10RLazarus: 04-switch-mediawiki: Fix a backwards minus sign. [cookbooks] - 10https://gerrit.wikimedia.org/r/636487 [20:18:00] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1001/26140/cp1075.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130) (owner: 10Ottomata) [20:18:19] (03PS3) 10Dzahn: ldap::client::labs: fix 'Unknown variable: '::restricted..' [puppet] - 10https://gerrit.wikimedia.org/r/633838 (https://phabricator.wikimedia.org/T101447) [20:19:15] (03CR) 10Andrew Bogott: ldap::client::labs: fix 'Unknown variable: '::restricted..' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633838 (https://phabricator.wikimedia.org/T101447) (owner: 10Dzahn) [20:21:57] (03PS3) 10Ottomata: Exclude predefined user agents from eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130) [20:23:08] (03PS4) 10Dzahn: ldap::client::labs: fix 'Unknown variable: '::restricted..' [puppet] - 10https://gerrit.wikimedia.org/r/633838 (https://phabricator.wikimedia.org/T101447) [20:23:14] (03CR) 10jerkins-bot: [V: 04-1] Exclude predefined user agents from eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130) (owner: 10Ottomata) [20:23:48] (03PS4) 10Ottomata: Exclude predefined user agents from eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130) [20:25:04] (03CR) 10jerkins-bot: [V: 04-1] Exclude predefined user agents from eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130) (owner: 10Ottomata) [20:27:03] (03CR) 10Arlolra: [C: 03+1] Get rid of update_parsoid.sh script that is no longer needed [puppet] - 10https://gerrit.wikimedia.org/r/636494 (owner: 10Subramanya Sastry) [20:28:23] (03CR) 10RLazarus: [C: 03+1] 04-switch-mediawiki: Fix a backwards minus sign. [cookbooks] - 10https://gerrit.wikimedia.org/r/636487 (owner: 10RLazarus) [20:28:28] (03CR) 10RLazarus: [C: 03+2] 04-switch-mediawiki: Fix a backwards minus sign. [cookbooks] - 10https://gerrit.wikimedia.org/r/636487 (owner: 10RLazarus) [20:28:30] (03CR) 10Ottomata: "I don't know why puppet is mad at me" [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130) (owner: 10Ottomata) [20:29:37] (03Merged) 10jenkins-bot: 04-switch-mediawiki: Fix a backwards minus sign. [cookbooks] - 10https://gerrit.wikimedia.org/r/636487 (owner: 10RLazarus) [20:29:55] (03PS9) 10Bstorm: toolforge: script to make long-running processes on bastions less good [puppet] - 10https://gerrit.wikimedia.org/r/635888 (https://phabricator.wikimedia.org/T266300) [20:30:30] 10Operations, 10SRE-Access-Requests: New prod ssh key for calbon - https://phabricator.wikimedia.org/T266498 (10Reedy) [20:33:19] 10Operations, 10DC-Ops, 10netops: patch in FB peering into cr1-eqiad:xe-3/2/1 - https://phabricator.wikimedia.org/T265916 (10ayounsi) 05Open→03Resolved Interface up. Thanks! [20:34:49] (03CR) 10Ayounsi: [C: 03+1] "+1 for netbox-next." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/636464 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [20:34:51] (03CR) 10ArielGlenn: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130) (owner: 10Ottomata) [20:36:10] (03PS5) 10Ottomata: Exclude predefined user agents from eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130) [20:36:36] (03CR) 10Ottomata: "AH thank you! was confused because the column info is wrong. I changed this line to use double quotes, so i need to escape the backslash" [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130) (owner: 10Ottomata) [20:38:06] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/26141/wikistats-dancing-goat.wikistats.eqiad.wmflabs/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/633838 (https://phabricator.wikimedia.org/T101447) (owner: 10Dzahn) [20:40:55] (03PS1) 10Ppchelko: Update mobileapps to 2020-10-26-150740-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/636496 (https://phabricator.wikimedia.org/T264024) [20:53:10] (03CR) 10Dzahn: [C: 03+2] Get rid of update_parsoid.sh script that is no longer needed [puppet] - 10https://gerrit.wikimedia.org/r/636494 (owner: 10Subramanya Sastry) [20:54:20] !log scandium rm /usr/local/bin/update_parsoid.sh (gerrit:636494) [20:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:34] (03CR) 10Dzahn: "deleted the file manually on scandium" [puppet] - 10https://gerrit.wikimedia.org/r/636494 (owner: 10Subramanya Sastry) [21:00:04] Reedy and sbassett: How many deployers does it take to do Weekly Security deployment window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201026T2100). [21:01:55] PROBLEM - Check systemd state on kubestage1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:06:04] 10Operations, 10MW-on-K8s, 10TechCom-RFC, 10serviceops, 10Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10BPirkle) [21:07:06] 10Operations, 10Parsoid, 10Parsoid-Tests, 10serviceops: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10ssastry) [21:16:37] (03CR) 10Gehel: [C: 04-1] "From PPC:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/634381 (https://phabricator.wikimedia.org/T246345) (owner: 10Ryan Kemper) [21:29:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:31:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:31:29] !log starting a live test of sre.switchdc.mediawiki, which will create some logging noise but no actual production impact [21:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:52] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet [21:31:55] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0) [21:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:33] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl [21:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:48] 10Operations, 10ops-eqiad, 10DC-Ops: fix/replace cable ID 2648 on FB peering patch - cable report error - https://phabricator.wikimedia.org/T266497 (10RobH) p:05Triage→03High [21:32:58] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) [21:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:05] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-warmup-caches [21:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:30] 10Operations, 10ops-eqiad, 10DC-Ops: fix/replace cable ID 2648 on FB peering patch - cable report error - https://phabricator.wikimedia.org/T266497 (10RobH) I set to high priority, since its causing a report error. Once a netbox report is in error, it won't repeat/append to its error state via IRC echo. Ba... [21:34:33] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0) [21:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:46] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance [21:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:59] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) [21:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:20] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.02-set-readonly [21:35:21] !log rzl@cumin1001 [DRY-RUN] MediaWiki read-only period starts at: 2020-10-26 21:35:20.837214 [21:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:31] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0) [21:35:35] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly [21:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:03] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0) [21:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:13] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:36:27] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki [21:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:36] ^ safe to ignore that alert, it's eqiad (passive DC) [21:36:37] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) [21:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:42] and an expected result of the warmup script running there [21:36:56] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.05-invert-redis-sessions [21:36:58] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.05-invert-redis-sessions (exit_code=0) [21:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:07] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite [21:37:10] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) [21:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:16] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite [21:37:17] !log rzl@cumin1001 [DRY-RUN] MediaWiki read-only period ends at: 2020-10-26 21:37:17.809596 [21:37:18] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) [21:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:29] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-run-puppet-on-db-masters [21:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:35] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:37:59] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:38:01] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:40:56] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-run-puppet-on-db-masters (exit_code=0) [21:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:06] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-restore-ttl [21:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:34] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restore-ttl (exit_code=0) [21:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:40] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance [21:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:22] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) [21:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:32] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-update-tendril [21:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:42] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-update-tendril (exit_code=0) [21:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:09] !log live test of sre.switchdc.mediawiki complete, the foregoing logging noise had no actual production impact [21:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:35] (03PS1) 10Catrope: PostEditPanel: Account for topics being null [extensions/GrowthExperiments] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/636383 (https://phabricator.wikimedia.org/T266501) [22:10:44] (03PS2) 10Dzahn: puppetmaster: replace cron to remove old reports with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636104 [22:11:26] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: replace cron to remove old reports with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636104 (owner: 10Dzahn) [22:13:31] (03PS3) 10Dzahn: puppetmaster: replace cron to remove old reports with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636104 [22:14:13] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: replace cron to remove old reports with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636104 (owner: 10Dzahn) [22:15:45] (03CR) 10Dzahn: "John, the error " os_version(): LSB facts are not set; is lsb-release installed?" sounds like related to things you were working on ?" [puppet] - 10https://gerrit.wikimedia.org/r/636104 (owner: 10Dzahn) [22:21:49] PROBLEM - MariaDB Replica Lag: s4 on db1144 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 344.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:22:03] PROBLEM - MariaDB Replica Lag: s4 on db1149 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 344.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:22:05] PROBLEM - MariaDB Replica Lag: s4 on db1150 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 344.49 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:22:07] PROBLEM - MariaDB Replica Lag: s4 on db1081 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 344.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:22:17] PROBLEM - MariaDB Replica Lag: s4 on db1146 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 344.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:22:21] PROBLEM - MariaDB Replica Lag: s4 on db1125 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 344.05 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:22:27] PROBLEM - MariaDB Replica Lag: s4 on db1147 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 343.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:22:31] PROBLEM - MariaDB Replica Lag: s4 on db1148 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 342.72 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:22:37] PROBLEM - MariaDB Replica Lag: s4 on db1143 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 344.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:22:45] PROBLEM - MariaDB Replica Lag: s4 on db1142 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 345.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:22:49] PROBLEM - MariaDB Replica Lag: s4 on dbstore1004 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 346.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:22:59] PROBLEM - MariaDB Replica Lag: s4 on db1141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 345.57 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:23:17] PROBLEM - MariaDB Replica Lag: s4 on db1121 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 345.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:23:19] PROBLEM - MariaDB Replica Lag: s4 on db1138 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 344.55 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:24:27] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1028 with 10G interfaces - https://phabricator.wikimedia.org/T266514 (10Andrew) [22:24:58] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [22:28:43] PROBLEM - Check systemd state on netflow5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:29:55] (03PS9) 10Andrew Bogott: cloud-vps resolv.conf: replace hardcoded .eqiad.wmflabs search with hiera [puppet] - 10https://gerrit.wikimedia.org/r/636476 (https://phabricator.wikimedia.org/T266227) [22:29:57] (03PS1) 10Andrew Bogott: Cloudvirt1027/1028 to ceph and backy2 [puppet] - 10https://gerrit.wikimedia.org/r/636507 (https://phabricator.wikimedia.org/T259399) [22:30:31] RECOVERY - Check systemd state on netflow5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:30:37] !log netflow5001 - systemctl reset-failed [22:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:40] (03CR) 10Andrew Bogott: [C: 03+2] Cloudvirt1027/1028 to ceph and backy2 [puppet] - 10https://gerrit.wikimedia.org/r/636507 (https://phabricator.wikimedia.org/T259399) (owner: 10Andrew Bogott) [22:36:12] (03PS4) 10Dzahn: mirrors: replace cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636082 [22:36:37] (03CR) 10jerkins-bot: [V: 04-1] mirrors: replace cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636082 (owner: 10Dzahn) [22:41:39] (03PS5) 10Dzahn: mirrors: replace cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636082 [22:42:04] (03CR) 10jerkins-bot: [V: 04-1] mirrors: replace cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636082 (owner: 10Dzahn) [22:42:37] (03PS1) 10Cwhite: Initial release based on ECS 1.6.0. [software/ecs] - 10https://gerrit.wikimedia.org/r/636513 [22:45:55] (03PS6) 10Dzahn: mirrors: replace cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636082 [22:46:02] (03CR) 10Cwhite: [C: 03+1] prometheus: re-enable compaction by default [puppet] - 10https://gerrit.wikimedia.org/r/636362 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [22:46:21] (03CR) 10jerkins-bot: [V: 04-1] mirrors: replace cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636082 (owner: 10Dzahn) [22:48:45] (03PS1) 10Razzi: stats: Remove nginx from thorium [puppet] - 10https://gerrit.wikimedia.org/r/636514 (https://phabricator.wikimedia.org/T240439) [22:49:05] yea.. spec tests are soooo useful when they are in one out of 100 modules [22:49:15] but causing 10 more edits [22:51:28] (03CR) 10Cwhite: [C: 03+1] "Overall, this LGTM! A good step towards eliminating usage of the legacy function api." [puppet] - 10https://gerrit.wikimedia.org/r/635356 (owner: 10Jbond) [22:52:41] (03PS7) 10Dzahn: mirrors: replace cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636082 [22:53:06] (03CR) 10jerkins-bot: [V: 04-1] mirrors: replace cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636082 (owner: 10Dzahn) [22:54:43] (03CR) 10Catrope: [C: 03+2] PostEditPanel: Account for topics being null [extensions/GrowthExperiments] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/636383 (https://phabricator.wikimedia.org/T266501) (owner: 10Catrope) [22:56:56] (03PS8) 10Dzahn: mirrors: replace cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636082 [22:57:21] (03CR) 10jerkins-bot: [V: 04-1] mirrors: replace cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636082 (owner: 10Dzahn) [22:59:04] (03PS9) 10Dzahn: mirrors: replace cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636082 [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201026T2300). [23:00:04] RoanKattouw: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:30] I'll deploy it myselrf [23:00:35] Waiting for CI [23:01:41] (03PS1) 10Cwhite: Add HTTP request and response headers fields as object[field:keyword] [software/ecs] - 10https://gerrit.wikimedia.org/r/636515 [23:02:29] (03CR) 10Dzahn: "After about 10 attempts to fix the spec tests and not being successful I decided to delete them because they exist in almost no other modu" [puppet] - 10https://gerrit.wikimedia.org/r/636082 (owner: 10Dzahn) [23:03:29] (03PS1) 10Cwhite: Add CSP Report fields. [software/ecs] - 10https://gerrit.wikimedia.org/r/636516 [23:05:15] (03Merged) 10jenkins-bot: PostEditPanel: Account for topics being null [extensions/GrowthExperiments] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/636383 (https://phabricator.wikimedia.org/T266501) (owner: 10Catrope) [23:05:38] (03PS5) 10Dzahn: puppetmaster: add data types to all remaining parameters [puppet] - 10https://gerrit.wikimedia.org/r/635656 [23:06:27] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: add data types to all remaining parameters [puppet] - 10https://gerrit.wikimedia.org/r/635656 (owner: 10Dzahn) [23:06:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:08:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:12:06] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.14/extensions/GrowthExperiments/: Fix JS error when no topics set (T266501) (duration: 01m 00s) [23:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:12] T266501: [testwiki wmf.14] Cannot read property 'length' of null at PostEditPanel.logImpression - https://phabricator.wikimedia.org/T266501 [23:14:06] (03PS6) 10Dzahn: puppetmaster: add data types to all remaining parameters [puppet] - 10https://gerrit.wikimedia.org/r/635656 [23:15:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:17:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:24:35] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/26143/" [puppet] - 10https://gerrit.wikimedia.org/r/635656 (owner: 10Dzahn) [23:29:33] (03PS1) 10Razzi: geoip: cleanup having moved archiving to launcher [puppet] - 10https://gerrit.wikimedia.org/r/636517 (https://phabricator.wikimedia.org/T264152) [23:31:06] (03CR) 10jerkins-bot: [V: 04-1] geoip: cleanup having moved archiving to launcher [puppet] - 10https://gerrit.wikimedia.org/r/636517 (https://phabricator.wikimedia.org/T264152) (owner: 10Razzi) [23:37:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:38:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets