[00:19:13] FIRING: [2x] MediawikiContentHistoryReconcileEnrichHighKafkaConsumerLag: ... [00:19:13] High Kafka consumer lag for mw_content_history_reconcile_enrich in eqiad - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s-dse&var-namespace=mw-content-history-reconcile-enrich&var-helm_release=production&var-operator_name=All&var-flink_job_name=mw_content_history_reconcile_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiContentHistoryReconcileEnrichHighKafkaConsumerLag [04:19:13] FIRING: [2x] MediawikiContentHistoryReconcileEnrichHighKafkaConsumerLag: ... [04:19:13] High Kafka consumer lag for mw_content_history_reconcile_enrich in eqiad - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s-dse&var-namespace=mw-content-history-reconcile-enrich&var-helm_release=production&var-operator_name=All&var-flink_job_name=mw_content_history_reconcile_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiContentHistoryReconcileEnrichHighKafkaConsumerLag [08:19:13] FIRING: [2x] MediawikiContentHistoryReconcileEnrichHighKafkaConsumerLag: ... [08:19:13] High Kafka consumer lag for mw_content_history_reconcile_enrich in eqiad - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s-dse&var-namespace=mw-content-history-reconcile-enrich&var-helm_release=production&var-operator_name=All&var-flink_job_name=mw_content_history_reconcile_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiContentHistoryReconcileEnrichHighKafkaConsumerLag [08:38:05] 06Data-Engineering, 06Data-Engineering-Radar, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Requesting Kerberos access for SCardenas (WMF) - https://phabricator.wikimedia.org/T418664#11676605 (10Gehel) [08:38:17] 06Data-Engineering, 06Data-Engineering-Radar, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Requesting Kerberos access for SCardenas (WMF) - https://phabricator.wikimedia.org/T418664#11676615 (10Gehel) a:03Gehel [09:05:34] 06Data-Engineering, 06Data-Engineering-Radar, 06Data-Platform-SRE (2026-02-13 - 2026-03-06), 13Patch-For-Review: Requesting Kerberos access for SCardenas (WMF) - https://phabricator.wikimedia.org/T418664#11676689 (10Gehel) @Scardenasmolinar : you need to first request production/shell access as documented... [12:11:56] 06Data-Engineering (Q3 FY25/26 January 1st - March 31th): [OpsWeek] Testing on airflow-devenvs can generate false alerts such as SLO misses - https://phabricator.wikimedia.org/T416596#11677211 (10AndrewTavis_WMDE) WMDE is using the `EmailOperator` in our DAGs a lot for notifying stakeholders that their data is a... [12:17:28] 06Data-Engineering: druid_load_webrequest_sampled_live_hourly - https://phabricator.wikimedia.org/T419121 (10dr0ptp4kt) 03NEW [12:17:52] 06Data-Engineering: druid_load_webrequest_sampled_live_hourly SerDe error in singular DAG run - https://phabricator.wikimedia.org/T419121#11677231 (10dr0ptp4kt) [12:19:13] FIRING: [2x] MediawikiContentHistoryReconcileEnrichHighKafkaConsumerLag: ... [12:19:14] High Kafka consumer lag for mw_content_history_reconcile_enrich in eqiad - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s-dse&var-namespace=mw-content-history-reconcile-enrich&var-helm_release=production&var-operator_name=All&var-flink_job_name=mw_content_history_reconcile_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiContentHistoryReconcileEnrichHighKafkaConsumerLag [12:39:48] 06Data-Engineering: druid_load_webrequest_sampled_live_hourly SerDe error in singular DAG run - https://phabricator.wikimedia.org/T419121#11677329 (10dr0ptp4kt) [13:04:53] 06Data-Engineering, 10ChangeProp, 10EventStreams, 06MediaWiki-Engineering, and 15 others: Migrate node-based services in production to node22 - https://phabricator.wikimedia.org/T393434#11677382 (10Krinkle) [13:44:11] 06Data-Engineering: Optimize enqueueing of refine_webrequest_hourly pipeline - https://phabricator.wikimedia.org/T419050#11677499 (10dr0ptp4kt) We've seen some issues with getting at log data that would help in troubleshooting this sort of thing. @amastilovic noted that https://github.com/apache/airflow/issues... [13:56:10] 06Data-Engineering (Q3 FY25/26 January 1st - March 31th): druid_load_webrequest_sampled_live_hourly SerDe error in singular DAG run - https://phabricator.wikimedia.org/T419121#11677540 (10Antoine_Quhen) 05Open→03Resolved a:03Antoine_Quhen Data cleaned with: `python import json sc = spark.sparkContext... [14:02:04] 06Data-Engineering (Q3 FY25/26 January 1st - March 31th): druid_load_webrequest_sampled_live_hourly SerDe error in singular DAG run - https://phabricator.wikimedia.org/T419121#11677563 (10dr0ptp4kt) @Antoine_Quhen, 🍪 for you. Well done! [14:18:23] 06Data-Engineering (Q3 FY25/26 January 1st - March 31th), 13Patch-For-Review: Adapt Sqoop for imagelinks schema changes - https://phabricator.wikimedia.org/T416481#11677623 (10Snwachukwu) Thank you @Zabe for the explanation. Indeed I used stale data from last sqoop run. [15:06:01] 06Data-Engineering, 06Data-Engineering-Radar, 06Content-Transform-Team, 06MW-Interfaces-Team, 10Event-Platform: Expose MediaWiki Parser render_id as a response header in relevant MW REST API endpoints - https://phabricator.wikimedia.org/T418792#11677830 (10cscott) Yeah, sounds good. [15:30:22] 06Data-Engineering (Q3 FY25/26 January 1st - March 31th): druid_load_webrequest_sampled_live_hourly SerDe error in singular DAG run - https://phabricator.wikimedia.org/T419121#11677953 (10Antoine_Quhen) One step higher in the problem is here: https://gerrit.wikimedia.org/g/operations/puppet/+/f0d57f3f75c39d9... [15:40:27] PROBLEM - Check if active EventStreams endpoint is delivering messages. on alert1002 is CRITICAL: CRITICAL: No EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [15:40:38] 06Data-Engineering: Optimize enqueueing of refine_webrequest_hourly pipeline - https://phabricator.wikimedia.org/T419050#11678029 (10Gehel) Tagging #data-platform-sre for visibility [15:40:53] 06Data-Engineering, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Optimize enqueueing of refine_webrequest_hourly pipeline - https://phabricator.wikimedia.org/T419050#11678030 (10Gehel) [16:19:14] FIRING: [2x] MediawikiContentHistoryReconcileEnrichHighKafkaConsumerLag: ... [16:19:19] High Kafka consumer lag for mw_content_history_reconcile_enrich in eqiad - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s-dse&var-namespace=mw-content-history-reconcile-enrich&var-helm_release=production&var-operator_name=All&var-flink_job_name=mw_content_history_reconcile_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiContentHistoryReconcileEnrichHighKafkaConsumerLag [16:40:27] RECOVERY - Check if active EventStreams endpoint is delivering messages. on alert1002 is OK: OK: An EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [16:58:02] 06Data-Engineering, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Optimize enqueueing of refine_webrequest_hourly pipeline - https://phabricator.wikimedia.org/T419050#11678366 (10JAllemandou) I see how the change defined above has an impact on SLAs: for an SLA defined of 5h, if we're waiting one hour more tha... [17:40:33] 06Data-Engineering (Q3 FY25/26 January 1st - March 31th), 10Datasets-General-or-Unknown: Get dump mirrors to use new dumps-rsync service name - https://phabricator.wikimedia.org/T415193#11678469 (10xcollazo) [18:43:01] !log Deploying change 1240253 for refinery ( T414478 ), already hotfixed, should be no-op [18:43:02] 06Data-Engineering (Q3 FY25/26 January 1st - March 31th): druid_load_webrequest_sampled_live_hourly SerDe error in singular DAG run - https://phabricator.wikimedia.org/T419121#11678695 (10amastilovic) Is this cleanup process something we should implement as part of the pipeline? [18:43:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:43:05] T414478: Add 'first campaign' and 'first campaign status code' to CentralNotice banner_activity_minutely Turnilo cube and Druid source table - https://phabricator.wikimedia.org/T414478 [18:46:22] Deploying change 1239200 for refinery ( T416481 ) [18:46:23] T416481: Adapt Sqoop for imagelinks schema changes - https://phabricator.wikimedia.org/T416481 [18:47:34] !log Deploying change 1239200 for refinery ( T416481 ) [18:47:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:48:10] ! Deploying change 1240253 for refinery ( T414478 ), already hotfixed, should be no-op [18:48:11] T414478: Add 'first campaign' and 'first campaign status code' to CentralNotice banner_activity_minutely Turnilo cube and Druid source table - https://phabricator.wikimedia.org/T414478 [18:49:35] 06Data-Engineering (Q3 FY25/26 January 1st - March 31th), 06MW-Interfaces-Team, 06Traffic, 06MediaWiki-Platform-Team (Radar), 07OKR-Work: haproxy: capture x-wmf-* headers in webrequest data set - https://phabricator.wikimedia.org/T417864#11678731 (10daniel) With T417780 deployed, you should be seeing dat... [19:04:00] !log Deployed refinery change 1240253 ( T414478 ), 1240253 (no-op) for refinery ( T414478 ) using scap, then deployed onto hdfs [19:04:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:04:08] T414478: Add 'first campaign' and 'first campaign status code' to CentralNotice banner_activity_minutely Turnilo cube and Druid source table - https://phabricator.wikimedia.org/T414478 [19:04:32] !log Deploying change 1239200 for refinery ( T416481 ) using scap, then deployed onto hdfs [19:04:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:04:35] T416481: Adapt Sqoop for imagelinks schema changes - https://phabricator.wikimedia.org/T416481 [19:08:09] 06Data-Engineering: table_maintenance_iceberg_monthly permission issue fails task due to permission on Ivy cache artifact - https://phabricator.wikimedia.org/T418804#11678791 (10xcollazo) This had happened before when we run an airflow devenv with a personal user that creates, say, `/tmp/table_maintenance_iceber... [19:29:21] 06Data-Engineering (Q3 FY25/26 January 1st - March 31th), 06MW-Interfaces-Team, 06Traffic, 06MediaWiki-Platform-Team (Radar), 07OKR-Work: haproxy: capture x-wmf-* headers in webrequest data set - https://phabricator.wikimedia.org/T417864#11678858 (10Fabfur) Hi @daniel I can confirm I see the `x_is_browse... [20:19:08] 06Data-Engineering (Q3 FY25/26 January 1st - March 31th), 06MW-Interfaces-Team, 06Traffic, 06MediaWiki-Platform-Team (Radar), 07OKR-Work: haproxy: capture x-wmf-* headers in webrequest data set - https://phabricator.wikimedia.org/T417864#11679050 (10daniel) 05Open→03Resolved >>! In T417864#116788... [20:23:57] FIRING: [2x] MediawikiContentHistoryReconcileEnrichHighKafkaConsumerLag: ... [20:24:03] High Kafka consumer lag for mw_content_history_reconcile_enrich in eqiad - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s-dse&var-namespace=mw-content-history-reconcile-enrich&var-helm_release=production&var-operator_name=All&var-flink_job_name=mw_content_history_reconcile_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiContentHistoryReconcileEnrichHighKafkaConsumerLag [20:28:07] FIRING: EventgateProduceRateAnomaly: Significant produce rate deviation (+-25%) on eventgate-analytics-external. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-dc=000000026&var-service=eventgate-analytics-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateProduceRateAnomaly [20:45:58] 06Data-Engineering (Q3 FY25/26 January 1st - March 31th), 06Content-Transform-Team, 06MW-Interfaces-Team, 10Event-Platform: Common event data model for data derived from parsed page revision html (and more!) - https://phabricator.wikimedia.org/T415158#11679138 (10AKhatun_WMF) Took a look at [MR#33](https:/... [20:53:06] RESOLVED: EventgateProduceRateAnomaly: Significant produce rate deviation (+-25%) on eventgate-analytics-external. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-dc=000000026&var-service=eventgate-analytics-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateProduceRateAnomaly [21:12:19] 06Data-Engineering: Task Tries and Logs for Airflow DAGs sometimes unavailable - https://phabricator.wikimedia.org/T419162 (10dr0ptp4kt) 03NEW [21:16:34] 06Data-Engineering: Task Tries and Logs for Airflow DAGs sometimes unavailable - https://phabricator.wikimedia.org/T419162#11679232 (10dr0ptp4kt) [21:19:21] 06Data-Engineering, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Task Tries and Logs for Airflow DAGs sometimes unavailable - https://phabricator.wikimedia.org/T419162#11679243 (10dr0ptp4kt) Tagging #data-platform-sre for visibility, mirroring what @Gehel did in T419050#11678029 . [23:27:07] FIRING: EventgateProduceRateAnomaly: Significant produce rate deviation (+-25%) on eventgate-logging-external. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-dc=000000026&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateProduceRateAnomaly [23:32:07] RESOLVED: EventgateProduceRateAnomaly: Significant produce rate deviation (+-25%) on eventgate-logging-external. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-dc=000000026&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateProduceRateAnomaly