[00:00:04] RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Evening SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200108T0000). [00:00:04] NicholasG04: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:01:23] :) [00:02:57] (03CR) 10Dzahn: [V: 03+2 C: 03+2] delete unused fake SSL keys [labs/private] - 10https://gerrit.wikimedia.org/r/561909 (owner: 10Dzahn) [00:12:54] Urbanecm should I be doing something? [00:13:18] It kinda looks like no one has shown up to do the deploy [00:14:24] (03PS2) 10Dzahn: gerrit: adjust bacula backup behaviour to deal with multiple hosts [puppet] - 10https://gerrit.wikimedia.org/r/562639 (https://phabricator.wikimedia.org/T239151) [00:16:24] (03PS6) 10Reedy: Added throttle rule for University of Derby mini-editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559612 (https://phabricator.wikimedia.org/T240845) (owner: 10NicholasG04) [00:16:56] (03CR) 10Reedy: [C: 03+2] Added throttle rule for University of Derby mini-editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559612 (https://phabricator.wikimedia.org/T240845) (owner: 10NicholasG04) [00:17:27] Reedy literally ._. [00:17:35] ? [00:18:11] (03Merged) 10jenkins-bot: Added throttle rule for University of Derby mini-editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559612 (https://phabricator.wikimedia.org/T240845) (owner: 10NicholasG04) [00:18:17] The three scheduled aren't here lol [00:18:38] It's mostly a best effort thing [00:19:02] Thank you for sorting it! [00:21:39] !log reedy@deploy1001 Synchronized wmf-config/throttle.php: T240845 (duration: 01m 04s) [00:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:43] T240845: Temporary lift of IP cap on en.wikipedia for 21 Jan 2020 - https://phabricator.wikimedia.org/T240845 [00:23:49] (03PS1) 10Dzahn: admins: add clarakosi to deploy-service for RESTBase deployment [puppet] - 10https://gerrit.wikimedia.org/r/562661 (https://phabricator.wikimedia.org/T242152) [00:25:02] (03CR) 10jerkins-bot: [V: 04-1] admins: add clarakosi to deploy-service for RESTBase deployment [puppet] - 10https://gerrit.wikimedia.org/r/562661 (https://phabricator.wikimedia.org/T242152) (owner: 10Dzahn) [00:29:04] PROBLEM - Check systemd state on netflow5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:37:21] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@024488f]: airflow: set mjolnir dag start date to today (20200108) [00:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:03] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@024488f]: airflow: set mjolnir dag start date to today (20200108) (duration: 00m 42s) [00:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:52] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to RESTBase for clarakosi - https://phabricator.wikimedia.org/T242152 (10Dzahn) a:03Dzahn [00:45:10] 10Operations, 10SRE-Access-Requests: Requesting access to EventLogging data for knissen - https://phabricator.wikimedia.org/T241838 (10Dzahn) [00:45:42] 10Operations, 10SRE-Access-Requests: Requesting access to EventLogging data for knissen - https://phabricator.wikimedia.org/T241838 (10Dzahn) [00:48:31] 10Operations, 10Analytics, 10Product-Analytics, 10SRE-Access-Requests: Access to analytics infrastructure for SNowick_WMF - https://phabricator.wikimedia.org/T242026 (10Dzahn) [00:53:21] 10Operations, 10Analytics, 10Product-Analytics, 10SRE-Access-Requests: Access to analytics infrastructure for SNowick_WMF - https://phabricator.wikimedia.org/T242026 (10Dzahn) I see on https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hue " If you already have cluster access, but can't log into Hue, it... [00:54:23] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Request to add Silvan Heintze to the ldap/wmde group - https://phabricator.wikimedia.org/T242080 (10Dzahn) [00:56:40] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Request to add Silvan Heintze to the ldap/wmde group - https://phabricator.wikimedia.org/T242080 (10Dzahn) a:03Silvan_WMDE Hi Silvan, assigning this ticket to you for signing the NDA. Once that is done please assign it back to "nobody' or me or just leave... [01:09:21] (03PS2) 10Dzahn: admins: add clarakosi to deploy-service for RESTBase deployment [puppet] - 10https://gerrit.wikimedia.org/r/562661 (https://phabricator.wikimedia.org/T242152) [01:11:53] (03PS2) 10Dzahn: phabricator: Remove comment about bans being superseded by now non existent WP0 bans [puppet] - 10https://gerrit.wikimedia.org/r/543138 (owner: 10Reedy) [01:12:08] (03CR) 10Dzahn: [C: 03+2] "comment-only" [puppet] - 10https://gerrit.wikimedia.org/r/543138 (owner: 10Reedy) [01:14:05] (03CR) 10jerkins-bot: [V: 04-1] phabricator: Remove comment about bans being superseded by now non existent WP0 bans [puppet] - 10https://gerrit.wikimedia.org/r/543138 (owner: 10Reedy) [01:16:59] jerkins ... [01:17:47] (03PS3) 10Dzahn: phabricator: Remove comment about bans being superseded by WP0 bans [puppet] - 10https://gerrit.wikimedia.org/r/543138 (owner: 10Reedy) [01:19:20] mutante: rebase it? [01:21:40] Reedy: it was "commit message too long" [01:21:46] submits [01:23:00] (03PS1) 10EBernhardson: airflow: Provide wrapper script to invoke airflow [puppet] - 10https://gerrit.wikimedia.org/r/562666 [01:25:39] (03PS2) 10Dzahn: Adapt auto restart for Buster [puppet] - 10https://gerrit.wikimedia.org/r/562473 (owner: 10Muehlenhoff) [01:25:59] (03PS3) 10Dzahn: url_downloader: Adapt auto restart for Buster [puppet] - 10https://gerrit.wikimedia.org/r/562473 (owner: 10Muehlenhoff) [01:26:39] (03CR) 10Dzahn: [C: 03+2] url_downloader: Adapt auto restart for Buster [puppet] - 10https://gerrit.wikimedia.org/r/562473 (owner: 10Muehlenhoff) [01:33:13] 10Operations, 10vm-requests: eqiad/codfw: 2 VM request for URL downloaders - https://phabricator.wikimedia.org/T241979 (10Dzahn) 05Open→03Resolved VMs have been created. looks like from here on it will continue on T224551 [01:33:15] 10Operations: Migrate URL downloaders to Buster - https://phabricator.wikimedia.org/T224551 (10Dzahn) merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/562473 commented on https://gerrit.wikimedia.org/r/c/operations/puppet/+/562472 [01:34:36] 10Operations, 10Analytics, 10Product-Analytics, 10SRE-Access-Requests: Access to analytics infrastructure for SNowick_WMF - https://phabricator.wikimedia.org/T242026 (10SNowick_WMF) Thanks, yes it is a manual sync process: The ticket attached to this one says "Currently, Hue users are manually synced from... [01:39:19] 10Operations, 10Analytics, 10Product-Analytics, 10SRE-Access-Requests: Access to analytics infrastructure for SNowick_WMF - https://phabricator.wikimedia.org/T242026 (10Dzahn) a:03elukey [02:12:02] (03CR) 10CDanis: fastnetmon: add UDP/ICMP bw limits, greatly increase pps limits (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/562387 (owner: 10CDanis) [03:02:36] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:04:24] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:54:49] !log volker-e@deploy1001 Started deploy [design/style-guide@ad595d5]: Deploy design/style-guide: [03:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:54:57] !log volker-e@deploy1001 Finished deploy [design/style-guide@ad595d5]: Deploy design/style-guide: (duration: 00m 08s) [03:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:14:04] (03CR) 10Ayounsi: [C: 03+2] Set port 443 (was 8190) for term schema in analytics-in4 [homer/public] - 10https://gerrit.wikimedia.org/r/562543 (owner: 10Elukey) [05:27:07] (03PS3) 10Ammarpad: Enable lead paragraph in user namespace on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562486 (https://phabricator.wikimedia.org/T242030) [05:35:28] RECOVERY - Check systemd state on netflow5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:38:32] (03PS1) 10Ayounsi: Enable netflow in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/562692 [05:40:52] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Enable netflow in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/562692 (owner: 10Ayounsi) [05:41:14] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] "Diff for 2 devices: ['cr1-eqsin.wikimedia.org', 'cr2-eqsin.wikimedia.org']" [homer/public] - 10https://gerrit.wikimedia.org/r/562692 (owner: 10Ayounsi) [05:41:43] !log enable netflow in eqsin [05:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:08] (03PS6) 10Ammarpad: Add initial configuration for ng.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559218 (https://phabricator.wikimedia.org/T240771) (owner: 10IAmNetx) [05:58:16] (03CR) 10Ammarpad: [C: 03+1] Add initial configuration for ng.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559218 (https://phabricator.wikimedia.org/T240771) (owner: 10IAmNetx) [06:06:11] (03PS1) 10Ayounsi: Remove sampling: true for eqsin as it's true by default [homer/public] - 10https://gerrit.wikimedia.org/r/562697 [06:07:06] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Remove sampling: true for eqsin as it's true by default [homer/public] - 10https://gerrit.wikimedia.org/r/562697 (owner: 10Ayounsi) [06:11:19] (03PS1) 10Ayounsi: Enable netflow sampling in knams [homer/public] - 10https://gerrit.wikimedia.org/r/562698 [06:15:57] (03CR) 10Ayounsi: "Faidon for the administrative stamp, Chris for the technical one." [homer/public] - 10https://gerrit.wikimedia.org/r/562698 (owner: 10Ayounsi) [06:19:38] 10Operations, 10SRE-Access-Requests: Requesting access to production servers in perf-team group for dpifke - https://phabricator.wikimedia.org/T242189 (10dpifke) [06:24:12] Please see https://phabricator.wikimedia.org/T242188 - PHP fatal error on beta cluster [06:25:03] 10Operations, 10DBA: backup2001 crashed 2019-12-08 - https://phabricator.wikimedia.org/T240177 (10Marostegui) Thanks for clarifying Papaul. Jaime is off and will be back online the 9th of January [06:26:38] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui) Thanks for the clarification. My thoughts were that we upgraded also BIOS. Let's start with that indeed. [06:33:46] 10Operations, 10Beta-Cluster-Infrastructure, 10Lexicographical data, 10Wikidata, 10User-DannyS712: PHP fatal error on beta cluster - https://phabricator.wikimedia.org/T242188 (10DannyS712) [06:35:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1098:3316 - T239453', diff saved to https://phabricator.wikimedia.org/P10077 and previous config saved to /var/cache/conftool/dbconfig/20200108-063550-marostegui.json [06:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:55] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [06:41:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3316 - T239453', diff saved to https://phabricator.wikimedia.org/P10078 and previous config saved to /var/cache/conftool/dbconfig/20200108-064144-marostegui.json [06:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:49] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [06:42:27] !log Remove partitions from revision table on s6 for db1096:3316 - T239453 [06:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1079', diff saved to https://phabricator.wikimedia.org/P10079 and previous config saved to /var/cache/conftool/dbconfig/20200108-064404-marostegui.json [06:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:14] 10Operations, 10Beta-Cluster-Infrastructure, 10Lexicographical data, 10Wikidata, 10User-DannyS712: PHP fatal error on beta cluster - https://phabricator.wikimedia.org/T242188 (10DannyS712) @Reedy can I ask why https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikibaseLexeme/+/562646/ was abandoned?... [06:49:41] (03PS1) 10Marostegui: db1114: Change package to reflect the current one [puppet] - 10https://gerrit.wikimedia.org/r/562702 [06:55:28] (03CR) 10Marostegui: [C: 03+2] db1114: Change package to reflect the current one [puppet] - 10https://gerrit.wikimedia.org/r/562702 (owner: 10Marostegui) [06:56:43] !log Upgrade db1079 [06:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:23] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [07:00:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1079', diff saved to https://phabricator.wikimedia.org/P10080 and previous config saved to /var/cache/conftool/dbconfig/20200108-070009-marostegui.json [07:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1079', diff saved to https://phabricator.wikimedia.org/P10081 and previous config saved to /var/cache/conftool/dbconfig/20200108-070614-marostegui.json [07:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1097:3315', diff saved to https://phabricator.wikimedia.org/P10082 and previous config saved to /var/cache/conftool/dbconfig/20200108-070712-marostegui.json [07:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:34] !log Remove partitions from dewiki.revision on db1097:3315 T239453 [07:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:37] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [07:13:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1079', diff saved to https://phabricator.wikimedia.org/P10083 and previous config saved to /var/cache/conftool/dbconfig/20200108-071312-marostegui.json [07:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1079', diff saved to https://phabricator.wikimedia.org/P10084 and previous config saved to /var/cache/conftool/dbconfig/20200108-072017-marostegui.json [07:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:02] RECOVERY - snapshot of s7 in eqiad on db1115 is OK: snapshot for s7 at eqiad taken less than 4 days ago and larger than 90 GB: Last one 2020-01-08 05:14:57 from db1116.eqiad.wmnet:3317 (899 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [07:32:19] XioNoX: ^ :) [07:36:25] nice! [07:41:52] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10ema) >>! In T238305#5784511, @Papaul wrote: > sometimes when the IDRAC version is not up to date we might not see and log at system crash Interesting! > so i think let us start by getting all tho... [07:50:25] (03CR) 10Urbanecm: [C: 03+1] Add initial configuration for ng.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559218 (https://phabricator.wikimedia.org/T240771) (owner: 10IAmNetx) [07:57:59] !log Deploy schema change on clouddb2001-dev.labtestwiki - T234052 [07:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:02] T234052: Add abuse_filter_log.afl_filter_id and afl_global columns - https://phabricator.wikimedia.org/T234052 [08:02:15] (03CR) 10Elukey: [C: 03+2] Enable hive kerberos connections from search/airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/562589 (owner: 10EBernhardson) [08:05:34] 10Operations, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee - Kris Litson - https://phabricator.wikimedia.org/T241722 (10Kris_Litson_WMDE) Got it! Thanks everyone! [08:07:11] !log Deploy schema change on s1 codfw, there will be lag on s1 codfw - T234052 [08:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:14] T234052: Add abuse_filter_log.afl_filter_id and afl_global columns - https://phabricator.wikimedia.org/T234052 [08:08:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1085', diff saved to https://phabricator.wikimedia.org/P10085 and previous config saved to /var/cache/conftool/dbconfig/20200108-080853-marostegui.json [08:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:01] (03CR) 10Filippo Giunchedi: [C: 03+2] caching-proxy: squid vs squid3 paths [puppet] - 10https://gerrit.wikimedia.org/r/562560 (owner: 10Filippo Giunchedi) [08:09:15] !log Upgrade db1085 [08:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:40] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [08:19:48] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Request to add Silvan Heintze to the ldap/wmde group - https://phabricator.wikimedia.org/T242080 (10Silvan_WMDE) a:05Silvan_WMDE→03Dzahn Thanks everyone, I just signed the NDA. [08:20:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1085', diff saved to https://phabricator.wikimedia.org/P10086 and previous config saved to /var/cache/conftool/dbconfig/20200108-082050-marostegui.json [08:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1085', diff saved to https://phabricator.wikimedia.org/P10087 and previous config saved to /var/cache/conftool/dbconfig/20200108-082930-marostegui.json [08:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:26] 10Operations, 10Beta-Cluster-Infrastructure, 10Lexicographical data, 10Wikidata, 10User-DannyS712: PHP fatal error on beta cluster - https://phabricator.wikimedia.org/T242188 (10Reedy) >>! In T242188#5784973, @DannyS712 wrote: > @Reedy can I ask why https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions... [08:41:41] 10Operations, 10Beta-Cluster-Infrastructure, 10Lexicographical data, 10Wikidata, 10User-DannyS712: PHP fatal error on beta cluster - https://phabricator.wikimedia.org/T242188 (10Reedy) 05Open→03Resolved a:03Reedy [08:49:42] (03PS1) 10Alexandros Kosiaris: DNM: Append -http to eventgate-analytics [puppet] - 10https://gerrit.wikimedia.org/r/562767 [08:50:12] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/562767 (owner: 10Alexandros Kosiaris) [08:54:34] 10Operations, 10Maps, 10Discovery-Search (Current work): Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 (10Pikne) I'm not sure if this should be considered part of or related to this task, but no new tiles have been generat... [08:58:36] 10Operations, 10netops: Routinator RSYNC errors - https://phabricator.wikimedia.org/T240817 (10ayounsi) 05Open→03Stalled p:05Normal→03Low [08:58:40] (03CR) 10Alexandros Kosiaris: [C: 03+1] admins: add clarakosi to deploy-service for RESTBase deployment [puppet] - 10https://gerrit.wikimedia.org/r/562661 (https://phabricator.wikimedia.org/T242152) (owner: 10Dzahn) [09:00:26] !log installing urldownloader1001 T241979 [09:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:29] T241979: eqiad/codfw: 2 VM request for URL downloaders - https://phabricator.wikimedia.org/T241979 [09:01:56] (03CR) 10Alexandros Kosiaris: [C: 04-1] profile::url_downloader: Add types and switch to lookup() (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/562472 (owner: 10Muehlenhoff) [09:11:22] 10Operations, 10netops: Upgrade routinator to 0.6.4 - https://phabricator.wikimedia.org/T242197 (10ayounsi) p:05Triage→03Low [09:11:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1085', diff saved to https://phabricator.wikimedia.org/P10088 and previous config saved to /var/cache/conftool/dbconfig/20200108-091124-marostegui.json [09:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:28] 10Operations, 10Traffic: Docker registry needs cache to vary on Accept header value - https://phabricator.wikimedia.org/T242200 (10Joe) [09:27:09] !log installing urldownloader1002 T241979 [09:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:12] T241979: eqiad/codfw: 2 VM request for URL downloaders - https://phabricator.wikimedia.org/T241979 [09:33:52] (03PS2) 10Alexandros Kosiaris: DNM: Append -http to eventgate-analytics [puppet] - 10https://gerrit.wikimedia.org/r/562767 [09:37:49] (03PS3) 10Alexandros Kosiaris: DNM: Append -http to eventgate-analytics [puppet] - 10https://gerrit.wikimedia.org/r/562767 [09:40:59] (03PS1) 10Tarrow: Enable tainted references on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562777 (https://phabricator.wikimedia.org/T239621) [09:41:46] (03PS1) 10Muehlenhoff: Switch rdb* to standardised Partman layout [puppet] - 10https://gerrit.wikimedia.org/r/562778 (https://phabricator.wikimedia.org/T156955) [09:41:49] (03CR) 10jerkins-bot: [V: 04-1] Enable tainted references on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562777 (https://phabricator.wikimedia.org/T239621) (owner: 10Tarrow) [09:44:23] (03CR) 10Alexandros Kosiaris: "Did a full PCC for the whole fleet, identified the hosts that failed and fixed those in the followup patches. Now at https://puppet-compil" [puppet] - 10https://gerrit.wikimedia.org/r/562767 (owner: 10Alexandros Kosiaris) [09:45:11] (03PS2) 10Tarrow: Enable tainted references on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562777 (https://phabricator.wikimedia.org/T239621) [09:45:11] (03CR) 10Alexandros Kosiaris: "I think I 'll tentatively merge this and shepherd it through production to make sure this won't bite us." [puppet] - 10https://gerrit.wikimedia.org/r/562767 (owner: 10Alexandros Kosiaris) [09:54:11] (03CR) 10Tarrow: "Added some extra camp reviewers. I wanted to know what you think about adding the 'default' line. You'll see this results in many more ent" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562777 (https://phabricator.wikimedia.org/T239621) (owner: 10Tarrow) [09:57:02] (03PS1) 10Vgutierrez: ATS: Disable TLSv1.0/1.1 support on the caching layer [puppet] - 10https://gerrit.wikimedia.org/r/562779 (https://phabricator.wikimedia.org/T238038) [09:57:31] (03CR) 10Addshore: [C: 03+1] Enable tainted references on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562777 (https://phabricator.wikimedia.org/T239621) (owner: 10Tarrow) [09:59:04] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/20264/" [puppet] - 10https://gerrit.wikimedia.org/r/562779 (https://phabricator.wikimedia.org/T238038) (owner: 10Vgutierrez) [10:01:35] (03PS1) 10Muehlenhoff: Extend Netbox Ganeti sync for eqsin [puppet] - 10https://gerrit.wikimedia.org/r/562780 (https://phabricator.wikimedia.org/T228099) [10:05:58] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/562780 (https://phabricator.wikimedia.org/T228099) (owner: 10Muehlenhoff) [10:08:03] !log enabling spec-ctr, ssbd. md-clear passthrough for new eqsin cluster T228099 [10:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:08] T228099: rack/setup/install ganeti500[123].eqsin.wmnet - https://phabricator.wikimedia.org/T228099 [10:18:43] (03CR) 10Filippo Giunchedi: [C: 03+1] Switch rdb* to standardised Partman layout [puppet] - 10https://gerrit.wikimedia.org/r/562778 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [10:18:50] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 85468512 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:19:11] wut? [10:20:36] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 22216 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:21:11] 22216? [10:21:17] (03PS4) 10Alexandros Kosiaris: Append -http to eventgate-analytics [puppet] - 10https://gerrit.wikimedia.org/r/562767 [10:22:52] (03CR) 10Alexandros Kosiaris: [C: 03+2] Append -http to eventgate-analytics [puppet] - 10https://gerrit.wikimedia.org/r/562767 (owner: 10Alexandros Kosiaris) [10:31:52] (03PS1) 10Muehlenhoff: Don't install the Postgres contrib package on Buster [puppet] - 10https://gerrit.wikimedia.org/r/562787 [10:32:27] (03CR) 10jerkins-bot: [V: 04-1] Don't install the Postgres contrib package on Buster [puppet] - 10https://gerrit.wikimedia.org/r/562787 (owner: 10Muehlenhoff) [10:38:12] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=ulsfo https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:39:38] (03PS6) 10Alexandros Kosiaris: Set up new LVS service eventgate-analytics-https [puppet] - 10https://gerrit.wikimedia.org/r/559167 (https://phabricator.wikimedia.org/T241073) (owner: 10Ottomata) [10:39:58] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:40:48] (03CR) 10Alexandros Kosiaris: "Check experimental" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/559167 (https://phabricator.wikimedia.org/T241073) (owner: 10Ottomata) [10:41:31] !log rebooting netflow5001 to pick up microcode [10:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:35] (03PS2) 10Ema: ATS: add webrequest logging for atskafka [puppet] - 10https://gerrit.wikimedia.org/r/562535 (https://phabricator.wikimedia.org/T237993) [10:44:18] (03PS1) 10Ayounsi: Routinator: add proxy for RRDP protocol [puppet] - 10https://gerrit.wikimedia.org/r/562788 [10:46:12] PROBLEM - Check systemd state on netflow5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:48:19] (03CR) 10Ema: [C: 03+1] 5.1.3-1wm12: Bump version and target buster [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/562493 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [10:52:20] (03PS1) 10Jbond: apt:::pin: allow callers to override the notify resource [puppet] - 10https://gerrit.wikimedia.org/r/562789 [10:53:05] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [10:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:15] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:25] (03PS1) 10Muehlenhoff: Initially assing spare role to gerrit-test.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/562790 (https://phabricator.wikimedia.org/T239151) [10:54:27] (03CR) 10jerkins-bot: [V: 04-1] apt:::pin: allow callers to override the notify resource [puppet] - 10https://gerrit.wikimedia.org/r/562789 (owner: 10Jbond) [10:54:55] RECOVERY - Check systemd state on netflow5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:56:55] (03PS2) 10Jbond: apt::pin: allow callers to override the notify resource [puppet] - 10https://gerrit.wikimedia.org/r/562789 [10:59:08] (03CR) 10jerkins-bot: [V: 04-1] apt::pin: allow callers to override the notify resource [puppet] - 10https://gerrit.wikimedia.org/r/562789 (owner: 10Jbond) [11:00:16] !log drain ganeti5003 to test new Ganeti setup in eqsin T228099 [11:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:18] T228099: rack/setup/install ganeti500[123].eqsin.wmnet - https://phabricator.wikimedia.org/T228099 [11:06:27] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] 5.1.3-1wm12: Bump version and target buster [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/562493 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [11:07:08] !log test failover of Ganeti master in eqsin T228099 [11:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:11] T228099: rack/setup/install ganeti500[123].eqsin.wmnet - https://phabricator.wikimedia.org/T228099 [11:11:35] (03PS7) 10Alexandros Kosiaris: Set up new LVS service eventgate-analytics-https [puppet] - 10https://gerrit.wikimedia.org/r/559167 (https://phabricator.wikimedia.org/T241073) (owner: 10Ottomata) [11:11:35] (03PS1) 10Alexandros Kosiaris: lvs: Remove unused eventgate-analytics-http service [puppet] - 10https://gerrit.wikimedia.org/r/562792 (https://phabricator.wikimedia.org/T241073) [11:14:18] (03PS3) 10Ema: ATS: add webrequest logging for atskafka [puppet] - 10https://gerrit.wikimedia.org/r/562535 (https://phabricator.wikimedia.org/T237993) [11:16:40] (03PS1) 10Muehlenhoff: Re-enable notifications for ganeti5*, setup is done [puppet] - 10https://gerrit.wikimedia.org/r/562793 (https://phabricator.wikimedia.org/T228099) [11:19:24] (03CR) 10Muehlenhoff: [C: 03+2] Re-enable notifications for ganeti5*, setup is done [puppet] - 10https://gerrit.wikimedia.org/r/562793 (https://phabricator.wikimedia.org/T228099) (owner: 10Muehlenhoff) [11:25:02] 10Operations, 10Patch-For-Review: rack/setup/install ganeti500[123].eqsin.wmnet - https://phabricator.wikimedia.org/T228099 (10MoritzMuehlenhoff) 05Open→03Resolved I tested a failover and an instance migration successfully. I also changed the cluster setting so that CPU vulnerability flags are passed throu... [11:26:49] (03CR) 10Muehlenhoff: [C: 03+2] Initially assing spare role to gerrit-test.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/562790 (https://phabricator.wikimedia.org/T239151) (owner: 10Muehlenhoff) [11:36:13] moritzm: assing? [11:36:19] (03PS8) 10Alexandros Kosiaris: Set up new LVS service eventgate-analytics-https [puppet] - 10https://gerrit.wikimedia.org/r/559167 (https://phabricator.wikimedia.org/T241073) (owner: 10Ottomata) [11:36:20] probably adding, right? [11:36:21] (03PS2) 10Alexandros Kosiaris: lvs: Remove unused eventgate-analytics-http service [puppet] - 10https://gerrit.wikimedia.org/r/562792 (https://phabricator.wikimedia.org/T241073) [11:36:31] although it was a nice typo [11:36:40] oh, I actually meant assign :-) [11:36:51] ahaha [11:36:53] even better [11:37:38] (03CR) 10Alexandros Kosiaris: [C: 03+2] Set up new LVS service eventgate-analytics-https [puppet] - 10https://gerrit.wikimedia.org/r/559167 (https://phabricator.wikimedia.org/T241073) (owner: 10Ottomata) [11:38:31] (03CR) 10Alexandros Kosiaris: [C: 03+2] "This is a bit of a weird one as it's a migration of an existing service to a different port. So, we reuse a lot of the things (e.g. IP, di" [puppet] - 10https://gerrit.wikimedia.org/r/559167 (https://phabricator.wikimedia.org/T241073) (owner: 10Ottomata) [11:44:24] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: name=kubernetes1001.* [11:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:38] !log akosiaris@cumin1001 conftool action : set/weight=10; selector: name=kubernetes1001.* [11:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:39] !log uploaded varnish 5.1.3-1wm12 to apt.wikimedia.org (buster) - T242093 [11:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:42] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [11:45:15] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.42:4192]) https://wikitech.wikimedia.org/wiki/PyBal [11:45:46] I guess that's expected akosiaris [11:45:50] !log akosiaris@cumin1001 conftool action : set/weight=10; selector: service=echostore [11:45:51] yup [11:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:55] PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.42:4192]) https://wikitech.wikimedia.org/wiki/PyBal [11:46:06] pybal has already been restarted on the mains to avoid pages [11:46:09] PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 95 connections established with conf1004.eqiad.wmnet:4001 (min=96) https://wikitech.wikimedia.org/wiki/PyBal [11:46:24] those are the backups I 'll wait a couple of more mins just to avoid BGP not having converged yet [11:46:47] PROBLEM - PyBal connections to etcd on lvs2006 is CRITICAL: CRITICAL: 52 connections established with conf2001.codfw.wmnet:2379 (min=53) https://wikitech.wikimedia.org/wiki/PyBal [11:47:08] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: name=kubernetes2001.* [11:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:33] (03CR) 10Vgutierrez: "recheck" [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/562515 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [11:49:57] at least it has worked pretty well up to now. It does look like we 'll get 0 pages [11:50:25] 0 pages <3 [11:50:56] RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:51:14] RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 96 connections established with conf1004.eqiad.wmnet:4001 (min=96) https://wikitech.wikimedia.org/wiki/PyBal [11:51:50] RECOVERY - PyBal connections to etcd on lvs2006 is OK: OK: 53 connections established with conf2001.codfw.wmnet:2379 (min=53) https://wikitech.wikimedia.org/wiki/PyBal [11:52:43] (03PS1) 10KartikMistry: Update cxserver to 2020-01-06-070550-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/562799 (https://phabricator.wikimedia.org/T233405) [11:53:00] cool, everything went according to plan. [11:53:32] icinga is happy for both services (TLS and temporary now nonTLS one), we got 0 pages, so cool [11:54:34] (03PS3) 10Jbond: apt:::pin: allow callers to override the notify resource [puppet] - 10https://gerrit.wikimedia.org/r/562789 [11:55:24] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:55:33] (03CR) 10jerkins-bot: [V: 04-1] apt:::pin: allow callers to override the notify resource [puppet] - 10https://gerrit.wikimedia.org/r/562789 (owner: 10Jbond) [11:55:46] akosiaris: updating cxserver soon. [11:56:10] kart_: ok, thanks for the heads up [11:56:14] (03PS1) 10Arturo Borrero Gonzalez: toolforge: new k8s: cleanup metrics manifests and files [puppet] - 10https://gerrit.wikimedia.org/r/562800 (https://phabricator.wikimedia.org/T237643) [11:57:21] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2020-01-06-070550-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/562799 (https://phabricator.wikimedia.org/T233405) (owner: 10KartikMistry) [11:57:42] (03Merged) 10jenkins-bot: Update cxserver to 2020-01-06-070550-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/562799 (https://phabricator.wikimedia.org/T233405) (owner: 10KartikMistry) [11:57:43] jouncebot: refresh [11:57:43] I refreshed my knowledge about deployments. [11:58:36] <_joe_> I will finish my patches for moving to a better abstraction of lvs configurations [11:58:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: new k8s: cleanup metrics manifests and files [puppet] - 10https://gerrit.wikimedia.org/r/562800 (https://phabricator.wikimedia.org/T237643) (owner: 10Arturo Borrero Gonzalez) [11:58:49] <_joe_> so that no pages will be the norm, not an exception [11:59:51] (03PS1) 10Vgutierrez: 1.3.1-3 Rebuild for buster [software/varnish/libvmod-re2] (debian) - 10https://gerrit.wikimedia.org/r/562801 (https://phabricator.wikimedia.org/T242093) [12:00:01] (03CR) 10jerkins-bot: [V: 04-1] 1.3.1-3 Rebuild for buster [software/varnish/libvmod-re2] (debian) - 10https://gerrit.wikimedia.org/r/562801 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European Mid-day SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200108T1200). [12:00:04] tarrow, CFisch_WMDE, and Lucas_WMDE: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:09] o/ [12:00:22] I can SWAT [12:00:23] !log kartik@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [12:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:29] tarrow: do you want to start by deploying your change? [12:00:57] I would be happy to! Unless CFisch_WMDE wants to go first? [12:01:54] !log kartik@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' . [12:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:59] I’ll already +2 the backport, since it’ll take a while to go through CI anyways [12:02:04] but I think you can go first [12:02:16] Thanks! I'm doing it now :) [12:03:37] (03CR) 10Tarrow: [C: 03+2] Enable tainted references on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562777 (https://phabricator.wikimedia.org/T239621) (owner: 10Tarrow) [12:04:16] (03PS1) 10Arturo Borrero Gonzalez: toolforge: new k8s: fix metrics directory [puppet] - 10https://gerrit.wikimedia.org/r/562802 (https://phabricator.wikimedia.org/T237643) [12:04:43] (03Merged) 10jenkins-bot: Enable tainted references on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562777 (https://phabricator.wikimedia.org/T239621) (owner: 10Tarrow) [12:04:48] !log kartik@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' . [12:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: new k8s: fix metrics directory [puppet] - 10https://gerrit.wikimedia.org/r/562802 (https://phabricator.wikimedia.org/T237643) (owner: 10Arturo Borrero Gonzalez) [12:06:16] tarrow: I'm in a meeting, would be happy if someone does the deployment of my backport [12:06:21] I can check [12:06:39] I can deploy it [12:08:17] Lucas_WMDE: silly question time: Did mwdebug1002 change ssh key in the last month or so? [12:08:46] !log Updated cxserver to 2020-01-06-070550-production (T233405) [12:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:49] T233405: Reference shown duplicated in the source document - https://phabricator.wikimedia.org/T233405 [12:08:56] it might have [12:09:02] I think at the moment you’re supposed to use mwdebug1001 anyways [12:09:23] ah! yes that's where I've confused myself [12:10:06] tarrow: yeah, it was reimaged some time in December IIRC [12:10:52] or rather November: https://phabricator.wikimedia.org/T236806 [12:11:06] not sure if mwdebug1002 is still verboten actually [12:11:12] the motd warning about it was reverted, apparently: https://gerrit.wikimedia.org/r/c/operations/puppet/+/559088 [12:14:26] Lucas_WMDE: it's still broken, but also the other mwdebug1001 started to behave the same way. It doesn't matter which host you use [12:14:50] great [12:14:58] (03PS1) 10Alexandros Kosiaris: lvs: Append -http to eventgate-main [puppet] - 10https://gerrit.wikimedia.org/r/562805 (https://phabricator.wikimedia.org/T241073) [12:17:24] syncing now :) [12:17:48] !log tarrow@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:562777|Enable tainted references on test.wikidata.org (T239621)]] (duration: 01m 19s) [12:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:53] T239621: Enable Tainted Refs on test.wikidata.org - https://phabricator.wikimedia.org/T239621 [12:17:54] (03PS4) 10Jbond: apt:::pin: allow callers to override the notify resource [puppet] - 10https://gerrit.wikimedia.org/r/562789 [12:18:36] \o/ [12:18:38] Lucas_WMDE: I'm all done :) [12:18:49] great, thanks! [12:19:03] Thanks [12:19:57] hm, /srv/mediawiki-staging/php-1.35.0-wmf.11 is one commit ahead of upstream… [12:19:59] (03PS5) 10Jbond: apt:::pin: allow callers to override the notify resource [puppet] - 10https://gerrit.wikimedia.org/r/562789 [12:20:46] anomie: in case you’re online – do you remember if that’s a security commit? [12:22:57] (03CR) 10Jbond: "I think it would be better to update the apt::pin resource to be a bit more flexible. see https://gerrit.wikimedia.org/r/c/operations/pup" [puppet] - 10https://gerrit.wikimedia.org/r/562544 (owner: 10Muehlenhoff) [12:23:28] strange thing, the commit isn’t on Gerrit but the Phabricator task has been public for a while [12:23:31] :shrug: [12:23:34] I’ll just rebase it, I guess [12:24:19] (03PS5) 10Alexandros Kosiaris: Switch eventgate-main LVS to use TLS port 4292 [puppet] - 10https://gerrit.wikimedia.org/r/559168 (https://phabricator.wikimedia.org/T241073) (owner: 10Ottomata) [12:24:22] CFisch_WMDE: your change is on mwdebug1001, can you test it? [12:25:14] jepp [12:25:56] (03CR) 10Muehlenhoff: "Sure thing! I'll look into your patch in a bit." [puppet] - 10https://gerrit.wikimedia.org/r/562544 (owner: 10Muehlenhoff) [12:28:32] Lucas_WMDE: Seems to work thanks. [12:28:37] great! [12:28:54] (03CR) 10Alexandros Kosiaris: [C: 04-1] "PCC at https://puppet-compiler.wmflabs.org/compiler1002/20270, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/562792 (https://phabricator.wikimedia.org/T241073) (owner: 10Alexandros Kosiaris) [12:29:24] syncing [12:29:56] (03PS3) 10Lucas Werkmeister (WMDE): Update Skolt Sami language name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510875 (https://phabricator.wikimedia.org/T223544) [12:30:27] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.35.0-wmf.11/extensions/Cite: SWAT: [[gerrit:561169|Fix handling of `` (T241303)]] (duration: 01m 06s) [12:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:30] T241303: does not work anymore - https://phabricator.wikimedia.org/T241303 [12:31:04] (03PS1) 10Elukey: admin: add kerberos flag for gbirke [puppet] - 10https://gerrit.wikimedia.org/r/562809 (https://phabricator.wikimedia.org/T242215) [12:31:05] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510875 (https://phabricator.wikimedia.org/T223544) (owner: 10Lucas Werkmeister (WMDE)) [12:31:17] deploying my own config change now [12:32:14] (03Merged) 10jenkins-bot: Update Skolt Sami language name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510875 (https://phabricator.wikimedia.org/T223544) (owner: 10Lucas Werkmeister (WMDE)) [12:32:54] testing on mwdebug1001 [12:33:08] looks great, syncing [12:34:36] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:510875|Update Skolt Sami language name (T223544)]] (duration: 01m 06s) [12:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:39] T223544: WMHack19: The native language name for [sms] Skolt Sami should be changed from "sää´mǩiõll" and "sääʹmǩiõll" to "nuõrttsääʹmǩiõll" - https://phabricator.wikimedia.org/T223544 [12:35:11] anything else to SWAT? [12:36:15] (03PS1) 10Alexandros Kosiaris: lvs: Remove unused eventgate-main-http service [puppet] - 10https://gerrit.wikimedia.org/r/562810 (https://phabricator.wikimedia.org/T241073) [12:36:20] !log EU SWAT done [12:36:20] (03CR) 10Alexandros Kosiaris: [C: 03+2] lvs: Append -http to eventgate-main [puppet] - 10https://gerrit.wikimedia.org/r/562805 (https://phabricator.wikimedia.org/T241073) (owner: 10Alexandros Kosiaris) [12:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:23] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Same thing as for eventgate-analytics-http. Merging, it's a NOOP effectively" [puppet] - 10https://gerrit.wikimedia.org/r/562805 (https://phabricator.wikimedia.org/T241073) (owner: 10Alexandros Kosiaris) [12:37:06] (03CR) 10Elukey: [C: 03+2] admin: add kerberos flag for gbirke [puppet] - 10https://gerrit.wikimedia.org/r/562809 (https://phabricator.wikimedia.org/T242215) (owner: 10Elukey) [12:42:24] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/562666 (owner: 10EBernhardson) [12:46:24] <_joe_> !log deleting releng/composer-php55:0.1.0 from the docker registry [12:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:32] (03PS1) 10Ema: ATS: add X-Analytics-TLS [puppet] - 10https://gerrit.wikimedia.org/r/562811 (https://phabricator.wikimedia.org/T237993) [12:57:08] (03PS2) 10Ema: ATS: add X-Analytics-TLS [puppet] - 10https://gerrit.wikimedia.org/r/562811 (https://phabricator.wikimedia.org/T237993) [12:57:09] (03PS4) 10Ema: ATS: add webrequest logging for atskafka [puppet] - 10https://gerrit.wikimedia.org/r/562535 (https://phabricator.wikimedia.org/T237993) [13:15:58] (03PS1) 10Gehel: wdqs: enable async_imports by default [puppet] - 10https://gerrit.wikimedia.org/r/562817 [13:16:00] 10Operations, 10Traffic: Docker registry needs cache to vary on Accept header value - https://phabricator.wikimedia.org/T242200 (10BBlack) So long as the registry's responses do all the standards-based things correctly (they contain `Vary: Accept`, and the matching `Accept` values also match the `Content-Type`... [13:20:56] (03CR) 10Gehel: "PCC looks happy: https://puppet-compiler.wmflabs.org/compiler1002/20274/" [puppet] - 10https://gerrit.wikimedia.org/r/562817 (owner: 10Gehel) [13:36:54] (03PS1) 10Elukey: admin: add kerberos flag to user cohi [puppet] - 10https://gerrit.wikimedia.org/r/562824 (https://phabricator.wikimedia.org/T242217) [13:36:57] (03PS5) 10Ema: ATS: add webrequest logging for atskafka [puppet] - 10https://gerrit.wikimedia.org/r/562535 (https://phabricator.wikimedia.org/T237993) [13:41:23] (03CR) 10Vgutierrez: [C: 03+1] "It looks good, but ideally X-Analytics-TLS shouldn't reach varnish-fe" [puppet] - 10https://gerrit.wikimedia.org/r/562811 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [13:44:17] (03CR) 10DCausse: [C: 03+1] wdqs: enable async_imports by default [puppet] - 10https://gerrit.wikimedia.org/r/562817 (owner: 10Gehel) [13:45:35] (03CR) 10Gehel: [C: 03+2] wdqs: enable async_imports by default [puppet] - 10https://gerrit.wikimedia.org/r/562817 (owner: 10Gehel) [13:47:33] (03PS1) 10ArielGlenn: Generate only missing 7z files when doing recompression job [dumps] - 10https://gerrit.wikimedia.org/r/562828 (https://phabricator.wikimedia.org/T242221) [13:50:41] (03PS3) 10Ema: ATS: add X-Analytics-TLS [puppet] - 10https://gerrit.wikimedia.org/r/562811 (https://phabricator.wikimedia.org/T237993) [13:51:20] (03CR) 10Alexandros Kosiaris: [C: 03+2] Switch eventgate-main LVS to use TLS port 4292 [puppet] - 10https://gerrit.wikimedia.org/r/559168 (https://phabricator.wikimedia.org/T241073) (owner: 10Ottomata) [13:52:18] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Same as with eventgate-analytics, merging and shepherding to production" [puppet] - 10https://gerrit.wikimedia.org/r/559168 (https://phabricator.wikimedia.org/T241073) (owner: 10Ottomata) [13:53:54] (03PS7) 10Ema: ATS: add webrequest logging for atskafka [puppet] - 10https://gerrit.wikimedia.org/r/562535 (https://phabricator.wikimedia.org/T237993) [13:55:53] akosiaris: phew for a second there i thought you just merged the main one...i hadn't submitted a patch for that [13:55:55] thanks for doing that! [13:56:25] ottomata: yeah I worked on that. It should be good to use in a few [13:56:49] up to now everything has gone perfect [13:56:57] awesome [13:57:05] thanks for renaming those too [13:58:10] it turned out to be less complicated that I feared [13:59:22] (03CR) 10ArielGlenn: [C: 03+2] Generate only missing 7z files when doing recompression job [dumps] - 10https://gerrit.wikimedia.org/r/562828 (https://phabricator.wikimedia.org/T242221) (owner: 10ArielGlenn) [13:59:24] (03PS4) 10Ema: ATS: add X-Analytics-TLS [puppet] - 10https://gerrit.wikimedia.org/r/562811 (https://phabricator.wikimedia.org/T237993) [13:59:26] (03PS8) 10Ema: ATS: add webrequest logging for atskafka [puppet] - 10https://gerrit.wikimedia.org/r/562535 (https://phabricator.wikimedia.org/T237993) [13:59:51] akosiaris: am confused tho [13:59:53] https://gerrit.wikimedia.org/r/c/operations/puppet/+/559167/8/hieradata/common/discovery.yaml [14:00:04] longma and liw: Your horoscope predicts another unfortunate Mediawiki train - American+European Version (secondary timeslot) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200108T1400). [14:00:12] !log ariel@deploy1001 Started deploy [dumps/dumps@dbd0ecd]: don't regenerate existing 7z files on rerun of the 7z recompression job [14:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:17] !log ariel@deploy1001 Finished deploy [dumps/dumps@dbd0ecd]: don't regenerate existing 7z files on rerun of the 7z recompression job (duration: 00m 05s) [14:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:44] oh the discovery doesn't matter, since it is just dns? [14:00:48] yes [14:01:00] I had the epiphany as well today [14:01:23] ok, so now, all we have to do is change the ports on the client services [14:01:24] ? [14:01:36] which is just mediawiki, right? [14:01:43] no there are a few [14:01:48] ah, please do tell [14:01:53] change prop, job queue, analytics stuff, wdqs [14:01:53] or RTFM me [14:02:05] i don't think we have a good collection of all users [14:02:06] we should though eh [14:02:16] doc with collection of all users* [14:02:31] this is a good time to make one! [14:02:35] :) [14:03:36] pybal restarted everywhere, monitoring has been updated, everything looks peachy. Ball's in your court now [14:05:02] awesome thank you so much! [14:06:53] yw, thanks as well! [14:07:06] !log add routinator 0.6.4 to reprepro stretch-wikimedia - T242197 [14:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:09] T242197: Upgrade routinator to 0.6.4 - https://phabricator.wikimedia.org/T242197 [14:07:48] (03PS9) 10Ema: ATS: add webrequest logging for atskafka [puppet] - 10https://gerrit.wikimedia.org/r/562535 (https://phabricator.wikimedia.org/T237993) [14:08:15] Lucas_WMDE: https://phabricator.wikimedia.org/T234450#5698503 seems most relevant to that. It's being applied as a security patch, yes, even though the task and patch are public. [14:09:51] ema: ! am very interested to understand how ^^ works :) [14:10:30] ok, thanks [14:12:24] ottomata: hey! It doesn't :) [14:13:02] hah wow sounds easy [14:13:10] ottomata: jk, that's the first step: we're configuring a named pipe to which ATS logs all requests with the format above [14:13:35] RECOVERY - Disk space on notebook1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1004&var-datasource=eqiad+prometheus/ops [14:14:13] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:14:20] ottomata: then we're gonna write atskafka, which reads from there and sends to kafka [14:14:54] (plus metrics and all the nice things that elukey wrote here: T237993) [14:14:54] T237993: Create replacement for Varnishkafka - https://phabricator.wikimedia.org/T237993 [14:15:01] ah k [14:15:16] nice keeping them separate. so that is just regular ats logging stuff [14:15:28] no json formatting or anything, that will be done by atskafka? [14:15:37] that's the plan, yes [14:15:39] aye cool [14:16:42] I like the idea of keeping logging and kafkaing separate too :) [14:20:21] (03CR) 10Ema: [C: 03+2] ATS: add X-Analytics-TLS [puppet] - 10https://gerrit.wikimedia.org/r/562811 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [14:20:22] (03CR) 10Vgutierrez: [C: 03+1] ATS: add webrequest logging for atskafka [puppet] - 10https://gerrit.wikimedia.org/r/562535 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [14:22:20] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:23:28] !log depool cp4028 to test X-Analytics-TLS patch T237993 [14:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:32] T237993: Create replacement for Varnishkafka - https://phabricator.wikimedia.org/T237993 [14:30:00] (03PS1) 10Arturo Borrero Gonzalez: toolforge: prometheus: fix regex for cadvisor in the new k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/562837 (https://phabricator.wikimedia.org/T237643) [14:30:32] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:30:56] (03PS1) 10Arturo Borrero Gonzalez: toolforge: new k8s: prometheus: scrape metrics from each individual ingress pod [puppet] - 10https://gerrit.wikimedia.org/r/562838 (https://phabricator.wikimedia.org/T237643) [14:30:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks! One comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/562789 (owner: 10Jbond) [14:33:34] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:35:34] !log repool cp4028 after successful X-Analytics-TLS patch test T237993 [14:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:37] T237993: Create replacement for Varnishkafka - https://phabricator.wikimedia.org/T237993 [14:36:24] 10Operations, 10Analytics, 10Product-Analytics, 10SRE-Access-Requests: Access to analytics infrastructure for SNowick_WMF - https://phabricator.wikimedia.org/T242026 (10Ottomata) Manual syncing is still needed for Hue (users are in MySQL, not SQLite, syncing is still needed). [14:37:21] 10Operations, 10Analytics, 10Product-Analytics, 10SRE-Access-Requests: Access to analytics infrastructure for SNowick_WMF - https://phabricator.wikimedia.org/T242026 (10Ottomata) Done. Use your shell username and ldap password to login. [14:40:34] 10Operations, 10Analytics, 10Product-Analytics, 10SRE-Access-Requests: Access to analytics infrastructure for SNowick_WMF - https://phabricator.wikimedia.org/T242026 (10Ottomata) Also hi and welcome! :D [14:42:29] (03Abandoned) 10Muehlenhoff: Inline a variant of apt::pin to package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/562544 (owner: 10Muehlenhoff) [14:42:34] (03PS2) 10Muehlenhoff: Deprecate raid1.cfg [puppet] - 10https://gerrit.wikimedia.org/r/562483 (https://phabricator.wikimedia.org/T156955) [14:42:36] (03PS2) 10Gehel: airflow: Provide wrapper script to invoke airflow [puppet] - 10https://gerrit.wikimedia.org/r/562666 (owner: 10EBernhardson) [14:45:03] (03CR) 10Gehel: [C: 03+2] airflow: Provide wrapper script to invoke airflow [puppet] - 10https://gerrit.wikimedia.org/r/562666 (owner: 10EBernhardson) [14:50:10] (03PS1) 10Ottomata: Use new TLS port for eventgate-analytics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562840 (https://phabricator.wikimedia.org/T242224) [14:53:02] (03PS1) 10Ema: ATS: escape hyphens in X-Analytics-TLS patterns [puppet] - 10https://gerrit.wikimedia.org/r/562841 (https://phabricator.wikimedia.org/T237993) [14:53:41] (03CR) 10Elukey: [V: 03+2] "Is it ok to submit right?" [homer/public] - 10https://gerrit.wikimedia.org/r/562543 (owner: 10Elukey) [14:53:54] XioNoX: shall I puppet-merge your routinator patch along? [14:54:05] moritzm: was about to ping you, yep [14:54:15] ok :-) [14:54:44] done [14:56:33] (03CR) 10Vgutierrez: [C: 03+1] ATS: escape hyphens in X-Analytics-TLS patterns [puppet] - 10https://gerrit.wikimedia.org/r/562841 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [14:58:45] (03PS3) 10Muehlenhoff: Don't install the Postgres contrib package on Buster [puppet] - 10https://gerrit.wikimedia.org/r/562787 [14:59:20] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:59:47] (03CR) 10jerkins-bot: [V: 04-1] Don't install the Postgres contrib package on Buster [puppet] - 10https://gerrit.wikimedia.org/r/562787 (owner: 10Muehlenhoff) [14:59:50] (03PS3) 10Vgutierrez: 1.3.1-3 Rebuild for buster [software/varnish/libvmod-re2] (debian) - 10https://gerrit.wikimedia.org/r/562801 (https://phabricator.wikimedia.org/T242093) [15:00:35] (03CR) 10Ottomata: [C: 03+2] Use new TLS port for eventgate-analytics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562840 (https://phabricator.wikimedia.org/T242224) (owner: 10Ottomata) [15:00:39] !log deploying change to use new TLS port for eventgate-analytics - T242224 [15:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:42] T242224: Switch all eventgate clients to use new TLS port - https://phabricator.wikimedia.org/T242224 [15:01:10] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:02:51] !log Routinator 0.6.4 looking good on rpki2001, upgrading rpki1001 - T242197 [15:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:54] T242197: Upgrade routinator to 0.6.4 - https://phabricator.wikimedia.org/T242197 [15:03:56] (03CR) 10jerkins-bot: [V: 04-1] 1.3.1-3 Rebuild for buster [software/varnish/libvmod-re2] (debian) - 10https://gerrit.wikimedia.org/r/562801 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [15:03:56] (03CR) 10Ema: [C: 03+2] ATS: escape hyphens in X-Analytics-TLS patterns [puppet] - 10https://gerrit.wikimedia.org/r/562841 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [15:04:01] (03PS4) 10Muehlenhoff: Don't install the Postgres contrib package on Buster [puppet] - 10https://gerrit.wikimedia.org/r/562787 [15:04:54] (03CR) 10jerkins-bot: [V: 04-1] Don't install the Postgres contrib package on Buster [puppet] - 10https://gerrit.wikimedia.org/r/562787 (owner: 10Muehlenhoff) [15:05:46] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Refactor, preparatory to testing multiple hosts in parallel. [software/httpbb] - 10https://gerrit.wikimedia.org/r/555515 (owner: 10RLazarus) [15:08:00] PROBLEM - Nginx local proxy to apache on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:02] PROBLEM - PHP7 rendering on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:08:02] PROBLEM - PHP7 rendering on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:08:02] PROBLEM - PHP7 rendering on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:08:02] PROBLEM - Apache HTTP on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:02] PROBLEM - Nginx local proxy to apache on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:02] PROBLEM - PHP7 rendering on mw1297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:08:03] PROBLEM - PHP7 rendering on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:08:04] PROBLEM - Apache HTTP on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:12] PROBLEM - Apache HTTP on mw1284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:12] PROBLEM - Nginx local proxy to apache on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:12] PROBLEM - PHP7 rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:08:14] PROBLEM - Apache HTTP on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:14] PROBLEM - Apache HTTP on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:14] PROBLEM - Apache HTTP on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:15] PROBLEM - Apache HTTP on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:16] PROBLEM - PHP7 rendering on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:08:18] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Test multiple hosts in parallel. [software/httpbb] - 10https://gerrit.wikimedia.org/r/559952 (owner: 10RLazarus) [15:08:18] PROBLEM - Apache HTTP on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:18] PROBLEM - PHP7 rendering on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:08:18] PROBLEM - PHP7 rendering on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:08:20] PROBLEM - Apache HTTP on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:20] PROBLEM - Apache HTTP on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:20] PROBLEM - Nginx local proxy to apache on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:20] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:08:20] PROBLEM - Nginx local proxy to apache on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:22] PROBLEM - Nginx local proxy to apache on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:22] PROBLEM - Apache HTTP on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:22] PROBLEM - Nginx local proxy to apache on mw1297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:22] PROBLEM - PHP7 rendering on mw1284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:08:26] PROBLEM - Apache HTTP on mw1221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:26] PROBLEM - PHP7 rendering on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:08:28] PROBLEM - Nginx local proxy to apache on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:30] PROBLEM - PHP7 rendering on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:08:34] PROBLEM - Apache HTTP on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:34] PROBLEM - Apache HTTP on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:34] PROBLEM - Nginx local proxy to apache on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:34] PROBLEM - Nginx local proxy to apache on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:36] PROBLEM - PHP7 rendering on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:08:36] PROBLEM - Nginx local proxy to apache on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:36] PROBLEM - PHP7 rendering on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:08:38] PROBLEM - PHP7 rendering on mw1222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:08:38] PROBLEM - Apache HTTP on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:38] PROBLEM - Apache HTTP on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:38] PROBLEM - Apache HTTP on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:40] PROBLEM - PHP7 rendering on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:08:40] PROBLEM - Apache HTTP on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:42] PROBLEM - PHP7 rendering on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:08:42] PROBLEM - Apache HTTP on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:42] PROBLEM - Nginx local proxy to apache on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:42] PROBLEM - Nginx local proxy to apache on mw1340 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:42] PROBLEM - Apache HTTP on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:44] PROBLEM - PHP7 rendering on mw1342 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:08:44] PROBLEM - PHP7 rendering on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:08:44] PROBLEM - Nginx local proxy to apache on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:44] PROBLEM - Apache HTTP on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:45] PROBLEM - Nginx local proxy to apache on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:45] PROBLEM - Nginx local proxy to apache on mw1342 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:46] PROBLEM - Apache HTTP on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:46] PROBLEM - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Graphoid [15:08:47] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out be [15:08:47] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:08:48] PROBLEM - PHP7 rendering on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:08:48] PROBLEM - Apache HTTP on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:49] PROBLEM - Apache HTTP on mw1340 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:49] PROBLEM - Nginx local proxy to apache on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:50] PROBLEM - Nginx local proxy to apache on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:50] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out be [15:08:51] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:08:51] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{ [15:08:52] obile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was [15:08:52] n}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [15:10:08] uh oh [15:10:09] !log otto@deploy1001 Synchronized wmf-config/ProductionServices.php: Make EventBus use TLS for eventgate-analytics - T242224 (duration: 06m 10s) [15:10:10] wow [15:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:13] T242224: Switch all eventgate clients to use new TLS port - https://phabricator.wikimedia.org/T242224 [15:10:32] (03PS5) 10Muehlenhoff: Don't install the Postgres contrib package on Buster [puppet] - 10https://gerrit.wikimedia.org/r/562787 [15:10:59] <_joe_> what [15:11:02] hmm [15:11:13] <_joe_> ottomata: revert please [15:11:19] <_joe_> like now [15:11:21] am [15:11:26] o/ [15:11:29] (03CR) 10jerkins-bot: [V: 04-1] Don't install the Postgres contrib package on Buster [puppet] - 10https://gerrit.wikimedia.org/r/562787 (owner: 10Muehlenhoff) [15:11:29] * apergos looks in [15:11:30] (03PS1) 10Ottomata: Revert "Use new TLS port for eventgate-analytics" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562844 [15:11:30] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 23.76 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:11:31] <_joe_> I could've told you it would not work if I noticed the change [15:11:32] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Revert "Use new TLS port for eventgate-analytics" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562844 (owner: 10Ottomata) [15:11:43] !log otto@deploy1001 sync-file aborted: Make EventBus use TLS for eventgate-analytics - T242224 (duration: 00m 00s) [15:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:50] <_joe_> php-fpm and TLS don't like each other [15:11:58] RECOVERY - PyBal backends health check on lvs3007 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:12:03] uh huhh... [15:12:12] <_joe_> lemme see the impact [15:12:14] _joe_: it worked on mwdebug1001 [15:12:20] PROBLEM - ATS TLS has reduced HTTP availability #page on icinga1001 is CRITICAL: cluster=cache_text layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [15:12:23] <_joe_> ottomata: without traffic, sure [15:12:24] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1284.eqiad.wmnet, mw1346.eqiad.wmnet, mw1348.eqiad.wmnet, mw1232.eqiad.wmnet, mw1344.eqiad.wmnet, mw1227.eqiad.wmnet, mw1229.eqiad.wmnet, mw1314.eqiad.wmnet, mw1279.eqiad.wmnet, mw1226.eqiad.wmnet, mw1317.eqiad.wmnet, mw1233.eqiad.wmnet, mw1222.eqiad.wmnet, mw1283.eqiad.wmnet, mw1340.eqiad.wmnet, mw1343.eqiad.wmnet, mw1225 [15:12:24] 281.eqiad.wmnet, mw1347.eqiad.wmnet, mw1345.eqiad.wmnet, mw1223.eqiad.wmnet, mw1286.eqiad.wmnet, mw1282.eqiad.wmnet, mw1276.eqiad.wmnet, mw1221.eqiad.wmnet, mw1230.eqiad.wmnet, mw1235.eqiad.wmnet, mw1234.eqiad.wmnet, mw1278.eqiad.wmnet, mw1224.eqiad.wmnet, mw1316.eqiad.wmnet, mw1231.eqiad.wmnet, mw1312.eqiad.wmnet, mw1228.eqiad.wmnet, mw1297.eqiad.wmnet, mw1342.eqiad.wmnet, mw1289.eqiad.wmnet, mw1315.eqiad.wmnet, mw1341.eqiad.wmn [15:12:24] wmnet, mw1277.eqiad.wmnet, mw1313.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [15:12:26] ah [15:12:31] I'm around if needed [15:12:37] !log otto@deploy1001 Scap failed!: 4/11 canaries failed their endpoint checks(http://en.wikipedia.org) [15:12:38] im also here [15:12:44] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 57.15 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:12:50] having trouble syncing [15:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:53] timeouts [15:12:54] <_joe_> ottomata: --force? [15:13:01] * volans ofc around but I see enough people already [15:13:14] scap sync-file --force [15:13:15] ? [15:13:22] <_joe_> ottomata: IIRC, yes [15:13:31] <_joe_> I can't get data from grafana btw [15:13:36] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1284.eqiad.wmnet, mw1346.eqiad.wmnet, mw1280.eqiad.wmnet, mw1348.eqiad.wmnet, mw1232.eqiad.wmnet, mw1344.eqiad.wmnet, mw1287.eqiad.wmnet, mw1227.eqiad.wmnet, mw1288.eqiad.wmnet, mw1229.eqiad.wmnet, mw1314.eqiad.wmnet, mw1279.eqiad.wmnet, mw1226.eqiad.wmnet, mw1317.eqiad.wmnet, mw1233.eqiad.wmnet, mw1222.eqiad.wmnet, mw1283 [15:13:36] 340.eqiad.wmnet, mw1343.eqiad.wmnet, mw1225.eqiad.wmnet, mw1281.eqiad.wmnet, mw1228.eqiad.wmnet, mw1345.eqiad.wmnet, mw1339.eqiad.wmnet, mw1286.eqiad.wmnet, mw1282.eqiad.wmnet, mw1276.eqiad.wmnet, mw1221.eqiad.wmnet, mw1230.eqiad.wmnet, mw1347.eqiad.wmnet, mw1235.eqiad.wmnet, mw1234.eqiad.wmnet, mw1278.eqiad.wmnet, mw1224.eqiad.wmnet, mw1290.eqiad.wmnet, mw1316.eqiad.wmnet, mw1231.eqiad.wmnet, mw1312.eqiad.wmnet, mw1223.eqiad.wmn [15:13:36] wmnet, mw1342.eqiad.wmnet, mw1289.eqiad.wmnet, mw1315.eqiad.wmnet, mw1341.eqiad.wmnet, mw1 https://wikitech.wikimedia.org/wiki/PyBal [15:13:49] _joe_: likely if you use another text-lb it will work [15:13:50] PROBLEM - wikidata.org dispatch lag is REALLY high ---4000s- on www.wikidata.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/project/view/71/ [15:13:58] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:14:12] wfm on esams btw [15:14:23] (03CR) 10Vgutierrez: "builds as expected on boron" [software/varnish/libvmod-re2] (debian) - 10https://gerrit.wikimedia.org/r/562801 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [15:14:28] <_joe_> now it does godog [15:14:32] revert sync in progress [15:14:41] <_joe_> ottomata: aye [15:14:59] <_joe_> apis are completely down [15:15:26] oof [15:16:29] <_joe_> ottomata: lmk when the sync is done [15:16:41] (03PS1) 10Clarakosi: Update parsoid_uri to use Parsoid-PHP [puppet] - 10https://gerrit.wikimedia.org/r/562845 (https://phabricator.wikimedia.org/T241756) [15:16:49] 15:13:58 Check php-fpm cache... [15:17:00] it does say [15:17:02] sync-apaches: 100% (ok: 325; fail: 0; left: 0) [15:17:02] 15:13:58 Finished sync-apaches (duration: 00m 06s) [15:17:05] <_joe_> which will never work [15:17:15] do we need to rolling restart apiservers by hand? [15:17:16] it is hanbging on the check php-fpm cache [15:17:18] are all the threads wedged? [15:17:19] <_joe_> ok things are definitely NOT back to normal [15:17:24] PROBLEM - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:19:08] RECOVERY - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15276 bytes in 6.335 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:19:19] yeah i still see some https configs on an app server [15:19:34] i could try and fix manually via cumin? [15:19:48] ottomata: sync-file again [15:19:49] <_joe_> ottomata: can you do another sync? [15:19:54] ko [15:19:56] ok [15:19:57] maybe this is the thing where old configurations get wedged in the cache [15:19:59] !log otto@deploy1001 sync-file aborted: REVERT Make EventBus use TLS for eventgate-analytics - T242224 (duration: 06m 33s) [15:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:02] T242224: Switch all eventgate clients to use new TLS port - https://phabricator.wikimedia.org/T242224 [15:20:02] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 59.62 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:20:14] can I help in anyway? [15:20:15] syncing again [15:20:28] RECOVERY - Apache HTTP on mw1317 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:20:28] volans: i made a config change that made eventbus use https [15:20:30] RECOVERY - PHP7 rendering on mw1315 is OK: HTTP OK: HTTP/1.1 200 OK - 76983 bytes in 3.126 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:20:31] in ProductionServices.php [15:20:32] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [15:20:32] RECOVERY - Nginx local proxy to apache on mw1315 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:20:33] we need to revert it [15:20:35] RECOVERY - Apache HTTP on mw1345 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.896 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:20:36] RECOVERY - PHP7 rendering on mw1345 is OK: HTTP OK: HTTP/1.1 200 OK - 76982 bytes in 0.988 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:20:38] RECOVERY - Nginx local proxy to apache on mw1312 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.223 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:20:38] i'm scap syncing again [15:20:40] RECOVERY - Nginx local proxy to apache on mw1339 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.694 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:20:40] RECOVERY - Nginx local proxy to apache on mw1347 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 8.131 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:20:42] but i'm not sure it is working [15:20:42] RECOVERY - PHP7 rendering on mw1348 is OK: HTTP OK: HTTP/1.1 200 OK - 76983 bytes in 1.701 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:20:43] ottomata: yes I'm aware [15:20:44] RECOVERY - PHP7 rendering on mw1341 is OK: HTTP OK: HTTP/1.1 200 OK - 76982 bytes in 0.158 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:20:50] RECOVERY - Nginx local proxy to apache on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:20:50] RECOVERY - PHP7 rendering on mw1280 is OK: HTTP OK: HTTP/1.1 200 OK - 76982 bytes in 0.159 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:20:52] RECOVERY - Nginx local proxy to apache on mw1283 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 3.772 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:20:52] RECOVERY - Apache HTTP on mw1346 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 4.095 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:20:54] RECOVERY - PHP7 rendering on mw1297 is OK: HTTP OK: HTTP/1.1 200 OK - 76982 bytes in 0.538 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:20:54] RECOVERY - Apache HTTP on mw1235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.410 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:20:56] if we can manually revert the config line [15:21:00] RECOVERY - PHP7 rendering on mw1288 is OK: HTTP OK: HTTP/1.1 200 OK - 76983 bytes in 9.304 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:21:00] RECOVERY - PHP7 rendering on mw1225 is OK: HTTP OK: HTTP/1.1 200 OK - 76983 bytes in 9.987 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:21:01] ...unless it is working? [15:21:08] RECOVERY - Apache HTTP on mw1284 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 6.549 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:21:12] RECOVERY - Apache HTTP on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:21:14] RECOVERY - Apache HTTP on mw1231 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.465 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:21:26] RECOVERY - Nginx local proxy to apache on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:21:31] <_joe_> ok this time it worked [15:21:52] still hanging on checking php-fpm cache [15:21:58] <_joe_> I still see a the https url on some servers though [15:22:01] and i still see bad config ...yeah [15:22:07] on the one i'm looking at [15:22:10] RECOVERY - PHP7 rendering on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 76982 bytes in 0.146 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:22:30] <_joe_> ottomata: ok the problem is we don't timeout on the php cache check [15:22:42] <_joe_> it should be possible to disable it though, lemme see [15:22:57] would it be faster to cumin a full scap pull on each apiserver? [15:23:17] <_joe_> cdanis: that would kill the network, but maybe [15:23:27] for scap? --no-php-restart? [15:23:39] <_joe_> yes, that [15:23:41] <_joe_> thanks thcipriani [15:23:42] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 84.45 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:23:44] probably ought to make --force do that, too [15:23:46] <_joe_> I was looking at the code [15:23:48] ok i should run sync-file with that? [15:23:52] <_joe_> thcipriani: yep [15:23:53] doing [15:23:55] <_joe_> ottomata: and --force [15:23:57] yes [15:23:58] !log otto@deploy1001 sync-file aborted: REVERT Make EventBus use TLS for eventgate-analytics - T242224 (duration: 03m 56s) [15:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:22] scap: error: extra arguments found: --no-php-restart [15:24:34] scap sync-file --no-php-restart --force wmf-config/ProductionServices.php 'REVERT Make EventBus use TLS for eventgate-analytics - T242224' [15:24:54] thcipriani: ^ [15:25:12] PROBLEM - PHP7 rendering on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:25:16] PROBLEM - Apache HTTP on mw1317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:25:20] PROBLEM - Nginx local proxy to apache on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:25:20] PROBLEM - Nginx local proxy to apache on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:25:22] PROBLEM - Apache HTTP on mw1345 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:25:22] PROBLEM - PHP7 rendering on mw1345 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:25:24] PROBLEM - Nginx local proxy to apache on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:25:26] PROBLEM - Nginx local proxy to apache on mw1339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:25:28] PROBLEM - PHP7 rendering on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:25:30] PROBLEM - PHP7 rendering on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:25:32] <_joe_> ottomata: try now please [15:25:34] PROBLEM - Nginx local proxy to apache on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:25:34] PROBLEM - Apache HTTP on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:25:35] PROBLEM - Apache HTTP on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:25:38] PROBLEM - Nginx local proxy to apache on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:25:38] PROBLEM - PHP7 rendering on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:25:38] PROBLEM - PHP7 rendering on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:25:38] PROBLEM - PHP7 rendering on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:25:40] PROBLEM - PHP7 rendering on mw1297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:25:44] PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:25:47] <_joe_> without the --no-php-restart [15:25:48] PROBLEM - Apache HTTP on mw1284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:25:54] PROBLEM - Apache HTTP on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:25:58] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:26:07] <_joe_> ottomata: try a --force please [15:26:10] PROBLEM - Varnish has reduced HTTP availability #page on icinga1001 is CRITICAL: job=varnish-text https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [15:26:10] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 504 (expectin [15:26:10] ikitech.wikimedia.org/wiki/Wikifeeds [15:26:25] !log otto@deploy1001 Synchronized wmf-config/ProductionServices.php: REVERT Make EventBus use TLS for eventgate-analytics - T242224 (duration: 00m 34s) [15:26:28] worked [15:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:30] T242224: Switch all eventgate clients to use new TLS port - https://phabricator.wikimedia.org/T242224 [15:26:30] RECOVERY - Apache HTTP on mw1233 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.716 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:26:34] RECOVERY - Nginx local proxy to apache on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 9.954 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:26:38] RECOVERY - PHP7 rendering on mw1233 is OK: HTTP OK: HTTP/1.1 200 OK - 76983 bytes in 1.557 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:26:38] RECOVERY - Nginx local proxy to apache on mw1314 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 5.650 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:26:38] RECOVERY - Apache HTTP on mw1313 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 4.791 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:26:38] RECOVERY - Apache HTTP on mw1226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 4.958 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:26:38] RECOVERY - Apache HTTP on mw1281 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 5.261 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:26:40] RECOVERY - PHP7 rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 76983 bytes in 3.419 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:26:42] RECOVERY - PHP7 rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 76983 bytes in 8.950 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:26:44] RECOVERY - PHP7 rendering on mw1290 is OK: HTTP OK: HTTP/1.1 200 OK - 76983 bytes in 5.904 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:26:44] RECOVERY - Nginx local proxy to apache on mw1297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:26:44] RECOVERY - Apache HTTP on mw1280 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 3.967 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:26:44] RECOVERY - Nginx local proxy to apache on mw1343 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 4.161 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:26:44] RECOVERY - Apache HTTP on mw1347 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 4.376 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:26:45] RECOVERY - Apache HTTP on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 1.162 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:26:45] RECOVERY - Nginx local proxy to apache on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 3.422 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:26:46] RECOVERY - Nginx local proxy to apache on mw1288 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 3.802 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:26:46] RECOVERY - Restrouter LVS eqiad on restrouter.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [15:26:47] RECOVERY - Apache HTTP on mw1221 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.028 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:26:47] RECOVERY - PHP7 rendering on mw1285 is OK: HTTP OK: HTTP/1.1 200 OK - 76982 bytes in 0.126 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:26:48] RECOVERY - PHP7 rendering on mw1284 is OK: HTTP OK: HTTP/1.1 200 OK - 76983 bytes in 2.929 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:26:48] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [15:26:50] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:26:50] RECOVERY - Nginx local proxy to apache on mw1226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:26:50] PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:26:50] (03CR) 10Jdlrobson: [C: 03+1] Enable lead paragraph in user namespace on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562486 (https://phabricator.wikimedia.org/T242030) (owner: 10Ammarpad) [15:26:52] RECOVERY - PHP7 rendering on mw1344 is OK: HTTP OK: HTTP/1.1 200 OK - 76982 bytes in 0.145 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:26:52] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [15:26:54] RECOVERY - Apache HTTP on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:26:54] RECOVERY - Nginx local proxy to apache on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:26:54] RECOVERY - Apache HTTP on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:26:54] RECOVERY - Nginx local proxy to apache on mw1286 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:26:55] RECOVERY - PHP7 rendering on mw1315 is OK: HTTP OK: HTTP/1.1 200 OK - 76982 bytes in 0.833 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:26:58] RECOVERY - Nginx local proxy to apache on mw1233 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:26:58] RECOVERY - Nginx local proxy to apache on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:26:58] RECOVERY - Apache HTTP on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:26:58] RECOVERY - Apache HTTP on mw1315 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:26:58] RECOVERY - PHP7 rendering on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 76981 bytes in 0.112 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:26:59] RECOVERY - PHP7 rendering on mw1231 is OK: HTTP OK: HTTP/1.1 200 OK - 76982 bytes in 0.131 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:26:59] RECOVERY - PHP7 rendering on mw1222 is OK: HTTP OK: HTTP/1.1 200 OK - 76982 bytes in 0.120 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:27:00] RECOVERY - PHP7 rendering on mw1221 is OK: HTTP OK: HTTP/1.1 200 OK - 76982 bytes in 0.165 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:27:00] RECOVERY - LVS HTTP IPv4 #page on api.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 23764 bytes in 0.318 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:27:02] RECOVERY - Apache HTTP on mw1341 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:27:02] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:27:04] RECOVERY - PHP7 rendering on mw1281 is OK: HTTP OK: HTTP/1.1 200 OK - 76982 bytes in 0.121 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:27:04] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [15:27:04] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:27:05] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:27:05] RECOVERY - Nginx local proxy to apache on mw1225 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:27:05] RECOVERY - Apache HTTP on mw1316 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:27:05] RECOVERY - Nginx local proxy to apache on mw1347 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:27:06] RECOVERY - Apache HTTP on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:27:06] RECOVERY - Nginx local proxy to apache on mw1340 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:27:07] RECOVERY - Nginx local proxy to apache on mw1281 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:27:07] RECOVERY - Nginx local proxy to apache on mw1313 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:27:08] RECOVERY - PHP7 rendering on mw1289 is OK: HTTP OK: HTTP/1.1 200 OK - 76982 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:27:08] RECOVERY - PHP7 rendering on mw1286 is OK: HTTP OK: HTTP/1.1 200 OK - 76982 bytes in 0.136 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:27:09] RECOVERY - PHP7 rendering on mw1342 is OK: HTTP OK: HTTP/1.1 200 OK - 76982 bytes in 0.148 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:27:09] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:27:09] RECOVERY - Apache HTTP on mw1317 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.528 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:27:10] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:27:11] RECOVERY - Nginx local proxy to apache on mw1315 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 2.162 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:27:11] RECOVERY - Apache HTTP on mw1344 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:27:12] RECOVERY - Apache HTTP on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:27:12] RECOVERY - Nginx local proxy to apache on mw1342 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:27:12] RECOVERY - Apache HTTP on mw1286 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:27:13] RECOVERY - Nginx local proxy to apache on mw1231 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.071 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:27:14] RECOVERY - PHP7 rendering on mw1223 is OK: HTTP OK: HTTP/1.1 200 OK - 76982 bytes in 0.126 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:27:14] RECOVERY - PHP7 rendering on mw1313 is OK: HTTP OK: HTTP/1.1 200 OK - 76982 bytes in 0.152 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:27:15] RECOVERY - Apache HTTP on mw1340 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:27:15] RECOVERY - Nginx local proxy to apache on mw1228 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:27:16] RECOVERY - Nginx local proxy to apache on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:27:16] RECOVERY - Nginx local proxy to apache on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:27:16] RECOVERY - PHP7 rendering on mw1346 is OK: HTTP OK: HTTP/1.1 200 OK - 76982 bytes in 0.131 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:27:17] RECOVERY - PHP7 rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 76982 bytes in 0.127 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:27:18] RECOVERY - PHP7 rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 76982 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:27:18] RECOVERY - Apache HTTP on mw1312 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 4.479 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:27:18] RECOVERY - Nginx local proxy to apache on mw1339 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:27:19] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:27:19] RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [15:27:20] RECOVERY - PHP7 rendering on mw1348 is OK: HTTP OK: HTTP/1.1 200 OK - 76982 bytes in 0.146 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:27:28] (03PS1) 10Ema: Revert "ATS: assign 8G instead of 2G to RAM caches on ats-be" [puppet] - 10https://gerrit.wikimedia.org/r/562849 (https://phabricator.wikimedia.org/T241593) [15:27:31] (03PS2) 10Ema: Revert "ATS: assign 8G instead of 2G to RAM caches on ats-be" [puppet] - 10https://gerrit.wikimedia.org/r/562849 (https://phabricator.wikimedia.org/T241593) [15:27:35] um. and. um. it didn't work before because I didn't merge in the revert... :( sorry. I often forget that because of the fetch && diff steps, but forget to merge after it looks good. [15:27:43] yar i'm sorry all. [15:28:16] ottomata: it's also possible to revert locally at deploy1001 and care 'bout Gerrit later [15:28:40] gerrit revert was fine [15:28:52] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [15:29:05] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:29:10] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:29:12] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [15:29:14] RECOVERY - graphoid endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [15:29:18] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 47.52 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:29:20] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:29:20] RECOVERY - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [15:29:22] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:29:22] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:29:26] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:29:28] ottomata: sorry you have to know so much about deploying. In an emergency it should be easier than that. [15:29:30] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:29:32] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:29:45] RECOVERY - graphoid endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [15:29:52] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [15:29:52] i probably should have let someone else do the revert emergency sync [15:29:54] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:29:56] RECOVERY - Varnish has reduced HTTP availability #page on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [15:30:00] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [15:30:04] PROBLEM - PyBal backends health check on lvs3005 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp3060.esams.wmnet, cp3054.esams.wmnet, cp3064.esams.wmnet, cp3056.esams.wmnet are marked down but pooled: textlb_443: Servers cp3060.esams.wmnet, cp3054.esams.wmnet, cp3052.esams.wmnet, cp3064.esams.wmnet, cp3056.esams.wmnet are marked down but pooled: testlb6_443: Servers cp3060.esams.wmnet, cp3054.esams.wmnet, cp3064.e [15:30:05] 6.esams.wmnet are marked down but pooled: textlb6_443: Servers cp3060.esams.wmnet, cp3054.esams.wmnet, cp3052.esams.wmnet, cp3064.esams.wmnet, cp3056.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:30:16] RECOVERY - PHP7 rendering on mw1282 is OK: HTTP OK: HTTP/1.1 200 OK - 76983 bytes in 7.418 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:30:18] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [15:30:18] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:30:23] <_joe_> something's not right in esams right now [15:30:24] PROBLEM - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:30:28] RECOVERY - graphoid endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [15:30:35] RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [15:30:48] RECOVERY - ATS TLS has reduced HTTP availability #page on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [15:31:08] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:31:10] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:31:16] RECOVERY - wikidata.org dispatch lag is REALLY high ---4000s- on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1978 bytes in 5.775 second response time https://phabricator.wikimedia.org/project/view/71/ [15:31:30] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [15:31:32] RECOVERY - graphoid endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [15:31:34] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:31:36] RECOVERY - Nginx local proxy to apache on mw1345 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 5.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:31:46] Hey folks, I am in Europe and nothing opens at all but VPN via US works just fine [15:31:48] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)8 ge (W)1 ge 0.8917 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:32:18] thanks for the report. folks are looking into it. [15:32:22] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:32:30] PROBLEM - Restrouter LVS eqiad on restrouter.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [15:32:48] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:32:48] RECOVERY - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Graphoid [15:32:48] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:33:02] RECOVERY - graphoid endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [15:33:10] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [15:33:16] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:33:16] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:33:26] RECOVERY - graphoid endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [15:33:32] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:34:01] (03CR) 10Ema: [C: 03+2] Revert "ATS: assign 8G instead of 2G to RAM caches on ats-be" [puppet] - 10https://gerrit.wikimedia.org/r/562849 (https://phabricator.wikimedia.org/T241593) (owner: 10Ema) [15:34:02] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:34:22] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [15:34:28] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:34:42] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:34:46] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [15:34:56] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:35:02] PROBLEM - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:35:02] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:35:02] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:35:44] PROBLEM - Apache HTTP on mw1345 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:35:52] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:36:14] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [15:36:34] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=text-lb.esams.wikimedia.org, port=443): Read timed out. (read timeout=15),): /api/rest_v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase [15:37:09] (03PS1) 10BBlack: Depool esams temporarily [dns] - 10https://gerrit.wikimedia.org/r/562850 [15:37:09] (03PS1) 10CDanis: depool esams text [dns] - 10https://gerrit.wikimedia.org/r/562851 [15:37:09] VE on mw.org seems to have trouble? [15:37:12] (03CR) 10BBlack: [C: 03+2] depool esams text [dns] - 10https://gerrit.wikimedia.org/r/562851 (owner: 10CDanis) [15:37:22] RECOVERY - Apache HTTP on mw1345 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:37:30] dcausse: let's assume it's related to the ongoing outage [15:37:40] if it's not cleared up when this does, we can look then [15:37:40] PROBLEM - HTTPS Unified ECDSA on cp3050 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [15:37:40] PROBLEM - ats-tls HTTPS en.wikipedia.org RSA on cp3050 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [15:37:40] PROBLEM - ats-tls HTTPS en.wikipedia.org ECDSA on cp3050 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [15:37:40] PROBLEM - HTTPS Unified RSA on cp3050 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [15:37:41] !log authdns-update to depool esams [15:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:51] !log cumin -s10 -b1 'A:cp-text_esams' 'run-puppet-agent -q ; ats-backend-restart' [15:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:56] RECOVERY - Restrouter LVS eqiad on restrouter.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [15:38:08] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3050 is OK: HTTP OK: HTTP/1.0 200 OK - 20453 bytes in 0.258 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:38:34] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [15:38:42] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3050 is OK: HTTP OK: HTTP/1.1 200 Ok - 31770 bytes in 0.663 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:38:44] RECOVERY - ats-tls HTTPS en.wikipedia.org RSA on cp3050 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 547769 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2020-10-06 12:00:00 +0000 (expires in 271 days) https://wikitech.wikimedia.org/wiki/HTTPS [15:38:44] RECOVERY - HTTPS Unified RSA on cp3050 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 547768 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2020-10-06 12:00:00 +0000 (expires in 271 days) https://wikitech.wikimedia.org/wiki/HTTPS [15:38:54] RECOVERY - ats-tls HTTPS en.wikipedia.org ECDSA on cp3050 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 559279 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2020-10-06 12:00:00 +0000 (expires in 271 days) https://wikitech.wikimedia.org/wiki/HTTPS [15:39:02] RECOVERY - HTTPS Unified ECDSA on cp3050 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 559270 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2020-10-06 12:00:00 +0000 (expires in 271 days) https://wikitech.wikimedia.org/wiki/HTTPS [15:40:02] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:40:06] !log restarting ats-tls on esams text nodes [15:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:18] PROBLEM - Apache HTTP on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:42:00] RECOVERY - Apache HTTP on mw1282 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 2.198 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:42:26] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=text-lb.esams.wikimedia.org, port=443): Read timed out. (read timeout=15),): /api/rest_v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase [15:42:34] (03Abandoned) 10BBlack: Depool esams temporarily [dns] - 10https://gerrit.wikimedia.org/r/562850 (owner: 10BBlack) [15:43:04] RECOVERY - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15275 bytes in 7.527 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:43:38] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [15:43:56] RECOVERY - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15263 bytes in 0.544 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:44:38] RECOVERY - PyBal backends health check on lvs3005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:45:00] RECOVERY - PyBal backends health check on lvs3007 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:45:34] !log cumin -s10 -b1 'A:cp-text_eqiad' 'run-puppet-agent -q ; ats-backend-restart' [15:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:16] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:46:42] PROBLEM - HTTPS Unified ECDSA on cp3058 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [15:46:42] PROBLEM - ats-tls HTTPS en.wikipedia.org RSA on cp3058 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [15:47:28] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 81.4 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:47:34] RECOVERY - ats-tls HTTPS en.wikipedia.org RSA on cp3058 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 547238 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2020-10-06 12:00:00 +0000 (expires in 271 days) https://wikitech.wikimedia.org/wiki/HTTPS [15:47:34] RECOVERY - HTTPS Unified ECDSA on cp3058 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 558758 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2020-10-06 12:00:00 +0000 (expires in 271 days) https://wikitech.wikimedia.org/wiki/HTTPS [15:47:38] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3058 is OK: HTTP OK: HTTP/1.1 200 Ok - 31683 bytes in 0.430 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:47:52] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is OK: HTTP OK: HTTP/1.0 200 OK - 20462 bytes in 0.268 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:47:58] PROBLEM - ats-tls HTTPS en.wikipedia.org RSA on cp3062 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [15:47:58] PROBLEM - HTTPS Unified RSA on cp3062 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [15:47:58] PROBLEM - ats-tls HTTPS en.wikipedia.org ECDSA on cp3062 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [15:48:32] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3062 is OK: HTTP OK: HTTP/1.1 200 Ok - 31603 bytes in 0.453 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:48:48] RECOVERY - ats-tls HTTPS en.wikipedia.org ECDSA on cp3062 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 558685 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2020-10-06 12:00:00 +0000 (expires in 271 days) https://wikitech.wikimedia.org/wiki/HTTPS [15:48:48] RECOVERY - ats-tls HTTPS en.wikipedia.org RSA on cp3062 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 547165 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2020-10-06 12:00:00 +0000 (expires in 271 days) https://wikitech.wikimedia.org/wiki/HTTPS [15:49:18] RECOVERY - HTTPS Unified RSA on cp3062 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 547135 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2020-10-06 12:00:00 +0000 (expires in 271 days) https://wikitech.wikimedia.org/wiki/HTTPS [15:49:30] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3062 is OK: HTTP OK: HTTP/1.0 200 OK - 20457 bytes in 0.255 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:49:39] (03PS1) 10Jbond: ldap - idp: add ldap helper script for enabling u2f on cas [puppet] - 10https://gerrit.wikimedia.org/r/562852 [15:50:58] (03PS2) 10Jbond: ldap - idp: add ldap helper script for enabling u2f on cas [puppet] - 10https://gerrit.wikimedia.org/r/562852 [15:51:00] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 37.49 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:52:30] (03CR) 10Ppchelko: [C: 04-1] "LGTM. -1 until I095ed9b4cf2afd2e933738246d49fa416d151d6e is fully deployed." [puppet] - 10https://gerrit.wikimedia.org/r/562845 (https://phabricator.wikimedia.org/T241756) (owner: 10Clarakosi) [15:52:45] (03PS3) 10Jbond: ldap - idp: add ldap helper script for enabling u2f on cas [puppet] - 10https://gerrit.wikimedia.org/r/562852 [15:52:46] (03PS4) 10Jbond: ldap - idp: add ldap helper script for enabling u2f on cas [puppet] - 10https://gerrit.wikimedia.org/r/562852 [15:54:44] (03CR) 10jerkins-bot: [V: 04-1] ldap - idp: add ldap helper script for enabling u2f on cas [puppet] - 10https://gerrit.wikimedia.org/r/562852 (owner: 10Jbond) [15:56:21] (03PS5) 10Jbond: ldap - idp: add ldap helper script for enabling u2f on cas [puppet] - 10https://gerrit.wikimedia.org/r/562852 [15:56:59] (03PS1) 10Herron: apply profile::base::firewall to default nodes [puppet] - 10https://gerrit.wikimedia.org/r/562856 [15:57:42] PROBLEM - graphoid endpoints health on scb1003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [15:58:10] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [15:58:24] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia [15:58:24] e [16:00:29] (03PS1) 10BBlack: Revert "depool esams text" [dns] - 10https://gerrit.wikimedia.org/r/562858 [16:00:30] (03CR) 10BBlack: [C: 03+2] Revert "depool esams text" [dns] - 10https://gerrit.wikimedia.org/r/562858 (owner: 10BBlack) [16:00:41] !log re-pooling esams text traffic in DNS [16:01:10] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [16:01:48] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:01:56] <_joe_> what's all this ^^ [16:01:59] bblack: Failed to log message to wiki. Somebody should check the error logs. [16:02:02] PROBLEM - graphoid endpoints health on scb1004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [16:02:30] _joe: we suspect text-lb overload with all the esams traffic [16:02:30] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [16:02:32] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb_443: Servers cp1077.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:03:00] RECOVERY - graphoid endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [16:03:16] PROBLEM - Restrouter LVS eqiad on restrouter.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [16:03:23] (03PS6) 10Jbond: ldap - idp: add ldap helper script for enabling u2f on cas [puppet] - 10https://gerrit.wikimedia.org/r/562852 [16:03:48] RECOVERY - graphoid endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [16:04:12] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [16:04:20] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:04:42] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [16:04:50] PROBLEM - WDQS high update lag on wdqs1010 is CRITICAL: 3653 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:05:00] RECOVERY - Restrouter LVS eqiad on restrouter.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [16:05:20] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [16:05:26] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [16:09:12] 10Operations, 10Traffic, 10Performance Issue: Current performance issues - https://phabricator.wikimedia.org/T242228 (10Gestumblindi) [16:12:02] 10Operations, 10ops-eqiad, 10vm-requests: rack/setup/install ganeti10([09]|1[0-8[).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10herron) The row_A ganeti group is running low on memory capacity (please see T239151#5707691) . Should we allocate a few of these new hosts to expand the existing row... [16:16:20] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 48.13 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:19:08] RECOVERY - WDQS high update lag on wdqs1010 is OK: (C)3600 ge (W)1200 ge 904.7 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:19:19] <_joe_> the eqiad alert is expected [16:20:50] !log rolling ats-be restart on !text@eqiad, !text@esams to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/562849/ [16:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:14] I guess it clears itself in another 30 mins? [16:25:14] <_joe_> !log running puppet on deploy1001 to remove my hot-patch to scap.cfg [16:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:57] 10Operations, 10Traffic, 10Performance Issue: Current performance issues - https://phabricator.wikimedia.org/T242228 (10Joe) 05Open→03Resolved a:03Joe Hi, thanks for your report! We were already aware of the issues, and were at work to solve them. Everything should be fine now though. [16:29:53] 10Operations, 10Traffic, 10Performance Issue: Current performance issues - https://phabricator.wikimedia.org/T242228 (10Joe) An incident report will be published later on wikitech at https://wikitech.wikimedia.org/wiki/Incident_documentation [16:43:12] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 70.48 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:44:08] yep 30 mins like clockwork [16:53:39] (03PS6) 10Muehlenhoff: Don't install the Postgres contrib package on Buster [puppet] - 10https://gerrit.wikimedia.org/r/562787 [16:54:25] (03CR) 10jerkins-bot: [V: 04-1] Don't install the Postgres contrib package on Buster [puppet] - 10https://gerrit.wikimedia.org/r/562787 (owner: 10Muehlenhoff) [16:58:38] (03CR) 10CDanis: [C: 03+1] "\o/" [homer/public] - 10https://gerrit.wikimedia.org/r/562692 (owner: 10Ayounsi) [17:11:14] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 52275664 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:13:02] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 42328 and 57 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:15:32] since when we started to have postgres in production [17:16:07] For maps and shizz [17:16:09] (a while) [17:17:01] also puppetdb [17:17:45] I thought we didn't have postgres at all [17:22:21] OpenStreetMaps doesn't work with any other database as far as I know [17:25:56] (03PS7) 10Muehlenhoff: Don't install the Postgres contrib package on Buster [puppet] - 10https://gerrit.wikimedia.org/r/562787 [17:27:54] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: hw troubleshooting: hardware RAID predictive failure for bellatrix.frack.codfw.wmnet - https://phabricator.wikimedia.org/T240876 (10Papaul) @Jgreen this server is out of warranty since 2017 and we have a replacement server already on site that w... [17:28:49] (03PS1) 10Elukey: admin: add kerberos flag for user dsharpe [puppet] - 10https://gerrit.wikimedia.org/r/562882 (https://phabricator.wikimedia.org/T242244) [17:29:20] RECOVERY - HP RAID on ms-be2035 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:30:06] 10Operations, 10DBA: backup2001 crashed 2019-12-08 - https://phabricator.wikimedia.org/T240177 (10Papaul) @Marostegui thanks will wait tomorrow the 9th so he can take the server down for the FW upgrade. [17:32:05] (03Abandoned) 10Zoranzoki21: Rearrange of wmgEnableGeoData [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561658 (owner: 10Zoranzoki21) [17:33:09] (03CR) 10Elukey: [C: 03+2] admin: add kerberos flag for user dsharpe [puppet] - 10https://gerrit.wikimedia.org/r/562882 (https://phabricator.wikimedia.org/T242244) (owner: 10Elukey) [17:36:26] 10Operations, 10Analytics, 10Product-Analytics, 10SRE-Access-Requests: Access to analytics infrastructure for SNowick_WMF - https://phabricator.wikimedia.org/T242026 (10Dzahn) 05Open→03Resolved Cool, thanks, Ottomata. Closing ticket. [17:51:41] (03PS1) 10Joal: Bump aqs druid snapshot to 2019-12 [puppet] - 10https://gerrit.wikimedia.org/r/562887 [17:51:46] sigh, I was updating [[m:System administrators]] [17:52:08] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:52:36] (03CR) 10Elukey: [C: 03+2] Add tables to analytics regular sqoop list [puppet] - 10https://gerrit.wikimedia.org/r/562322 (https://phabricator.wikimedia.org/T242015) (owner: 10Joal) [17:52:41] (03CR) 10Elukey: [C: 03+2] Bump aqs druid snapshot to 2019-12 [puppet] - 10https://gerrit.wikimedia.org/r/562887 (owner: 10Joal) [17:53:54] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:57:50] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Request to add Silvan Heintze to the ldap/wmde group - https://phabricator.wikimedia.org/T242080 (10Dzahn) @RStallman-legalteam good to go? [17:59:38] RECOVERY - Device not healthy -SMART- on ms-be2035 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2035&var-datasource=codfw+prometheus/ops [18:03:37] !log elukey@cumin1001 START - Cookbook sre.aqs.roll-restart [18:03:37] !log elukey@cumin1001 END (FAIL) - Cookbook sre.aqs.roll-restart (exit_code=99) [18:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:00] !log elukey@cumin1001 START - Cookbook sre.aqs.roll-restart [18:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:10] (was not using tmux) [18:04:41] (03CR) 10Volans: "Do you have a compiler result by any chance?" [puppet] - 10https://gerrit.wikimedia.org/r/562787 (owner: 10Muehlenhoff) [18:07:25] !log elukey@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) [18:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:42] where is wikibugs gone? killed for excess flood apparently [18:14:30] naughty old bots [18:17:31] (03PS1) 10RobH: setting new eqsin PDUs dns entries [dns] - 10https://gerrit.wikimedia.org/r/562894 (https://phabricator.wikimedia.org/T242250) [18:18:36] !log ppchelko@deploy1001 Started deploy [restbase/deploy@ebb1849] (dev-cluster): Clean up Parsoid-PHP transition code & config T241756 [18:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:39] T241756: Clean-up Parsoid-PHP transition code from RESTBase - https://phabricator.wikimedia.org/T241756 [18:19:35] (03CR) 10RobH: [C: 03+2] setting new eqsin PDUs dns entries [dns] - 10https://gerrit.wikimedia.org/r/562894 (https://phabricator.wikimedia.org/T242250) (owner: 10RobH) [18:21:16] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@ebb1849] (dev-cluster): Clean up Parsoid-PHP transition code & config T241756 (duration: 02m 41s) [18:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:03] !log ppchelko@deploy1001 Started deploy [restbase/deploy@ebb1849]: Clean up Parsoid-PHP transition code & config T241756 [18:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:08] moritzm: hi, can I have a quick word? [18:30:35] (03PS1) 10Arturo Borrero Gonzalez: nagios_common: contacgroups: arturo in email paging for prod servers [puppet] - 10https://gerrit.wikimedia.org/r/562902 [18:33:08] !log restarted wikibugs [18:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:31] (03CR) 10Ppchelko: [C: 03+1] "Currently this variable is temporary not used in RESTBase. Once this is merged and deployed, we will switch RESTBase to using the variable" [puppet] - 10https://gerrit.wikimedia.org/r/562845 (https://phabricator.wikimedia.org/T241756) (owner: 10Clarakosi) [18:34:26] (03PS3) 10Jforrester: logspam.pl: Shorten paths and include fatals [puppet] - 10https://gerrit.wikimedia.org/r/559246 (https://phabricator.wikimedia.org/T242252) (owner: 10Brennen Bearnes) [18:34:50] (03CR) 10Jforrester: [C: 03+1] logspam.pl: Shorten paths and include fatals [puppet] - 10https://gerrit.wikimedia.org/r/559246 (https://phabricator.wikimedia.org/T242252) (owner: 10Brennen Bearnes) [18:36:30] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@ebb1849]: Clean up Parsoid-PHP transition code & config T241756 (duration: 14m 27s) [18:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:35] T241756: Clean-up Parsoid-PHP transition code from RESTBase - https://phabricator.wikimedia.org/T241756 [18:39:05] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [18:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:31] (03PS1) 10Ottomata: staging/eventgate-logging-external - fix name of client error schema to precache [deployment-charts] - 10https://gerrit.wikimedia.org/r/562906 (https://phabricator.wikimedia.org/T240985) [18:43:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1097:3315', diff saved to https://phabricator.wikimedia.org/P10089 and previous config saved to /var/cache/conftool/dbconfig/20200108-184350-marostegui.json [18:43:52] (03CR) 10Ottomata: [V: 03+2 C: 03+2] staging/eventgate-logging-external - fix name of client error schema to precache [deployment-charts] - 10https://gerrit.wikimedia.org/r/562906 (https://phabricator.wikimedia.org/T240985) (owner: 10Ottomata) [18:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3315', diff saved to https://phabricator.wikimedia.org/P10090 and previous config saved to /var/cache/conftool/dbconfig/20200108-184510-marostegui.json [18:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:08] !log Remove partitions from dewiki.revision on db1096:3315 T239453 [18:46:09] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 (10CDanis) 05Resolved→03Open boldly re-opening this, now that the POPs have Ganeti clusters available. Today I learned that text-lb.esams receives something like 60k+ PP... [18:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:10] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [18:46:32] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [18:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:43] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Request to add Silvan Heintze to the ldap/wmde group - https://phabricator.wikimedia.org/T242080 (10RStallman-legalteam) Yes, the NDA is signed and filed. Thanks all! [18:50:16] (03PS27) 10Cwhite: lvs, prometheus, profile: add blackbox job helper and enable openapi scrapes [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [18:52:16] (03CR) 10jerkins-bot: [V: 04-1] lvs, prometheus, profile: add blackbox job helper and enable openapi scrapes [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [18:53:25] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Request to add Silvan Heintze to the ldap/wmde group - https://phabricator.wikimedia.org/T242080 (10Dzahn) [18:56:00] (03PS28) 10Cwhite: lvs, prometheus, profile: add blackbox job helper and enable openapi scrapes [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [18:58:00] (03CR) 10jerkins-bot: [V: 04-1] lvs, prometheus, profile: add blackbox job helper and enable openapi scrapes [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [19:00:04] RoanKattouw, Niharika, and Urbanecm: Time to snap out of that daydream and deploy Morning SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200108T1900). [19:00:04] Ammarpad: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:36] (03PS1) 10Ottomata: eventgate - use new primary schema repository by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/562908 (https://phabricator.wikimedia.org/T240985) [19:01:33] (03CR) 10Ottomata: [C: 03+2] eventgate - use new primary schema repository by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/562908 (https://phabricator.wikimedia.org/T240985) (owner: 10Ottomata) [19:02:38] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install frdb1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jgreen) [19:02:50] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install frdb1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jgreen) [19:03:25] (03PS1) 10Ottomata: Add missing eventgate-0.0.17.tgz [deployment-charts] - 10https://gerrit.wikimedia.org/r/562909 (https://phabricator.wikimedia.org/T240985) [19:03:48] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install frdb1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jgreen) [19:04:25] !log joal@deploy1001 Started deploy [analytics/refinery@c205576]: Regular analytics weekly deploy train [19:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:51] (03CR) 10Ottomata: [C: 03+2] Add missing eventgate-0.0.17.tgz [deployment-charts] - 10https://gerrit.wikimedia.org/r/562909 (https://phabricator.wikimedia.org/T240985) (owner: 10Ottomata) [19:06:45] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frdb2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T239733 (10Jgreen) [19:07:00] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frdb2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T239733 (10Jgreen) [19:07:50] (03PS29) 10Cwhite: lvs, prometheus, profile: add blackbox job helper and enable openapi scrapes [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [19:09:06] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frdb2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T239733 (10Jgreen) [19:09:41] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frdb2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T239733 (10Jgreen) [19:11:43] (03PS1) 10Dzahn: admins: add Silvan Heintze to ldap_only_admins (WMDE) [puppet] - 10https://gerrit.wikimedia.org/r/562910 (https://phabricator.wikimedia.org/T242080) [19:13:00] !log joal@deploy1001 Finished deploy [analytics/refinery@c205576]: Regular analytics weekly deploy train (duration: 08m 36s) [19:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:20] (03CR) 10Dzahn: [C: 03+2] admins: add Silvan Heintze to ldap_only_admins (WMDE) [puppet] - 10https://gerrit.wikimedia.org/r/562910 (https://phabricator.wikimedia.org/T242080) (owner: 10Dzahn) [19:13:25] !log joal@deploy1001 Started deploy [analytics/refinery@c205576] (thin): Regular analytics weekly deploy train [thin] [19:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:32] !log joal@deploy1001 Finished deploy [analytics/refinery@c205576] (thin): Regular analytics weekly deploy train [thin] (duration: 00m 07s) [19:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:29] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [19:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:09] !log LDAP - added 'sihe' to 'wmde' and 'nda' (T242080) [19:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:11] T242080: Request to add Silvan Heintze to the ldap/wmde group - https://phabricator.wikimedia.org/T242080 [19:16:33] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10Patch-For-Review: Request to add Silvan Heintze to the ldap/wmde group - https://phabricator.wikimedia.org/T242080 (10Dzahn) [19:17:55] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10Patch-For-Review: Request to add Silvan Heintze to the ldap/wmde group - https://phabricator.wikimedia.org/T242080 (10Dzahn) 05Open→03Resolved @Silvan_WMDE Done! Things should work as expected now. You are in the LDAP group(s). [19:22:12] (03CR) 10RLazarus: [C: 03+2] Refactor, preparatory to testing multiple hosts in parallel. [software/httpbb] - 10https://gerrit.wikimedia.org/r/555515 (owner: 10RLazarus) [19:22:19] (03CR) 10RLazarus: [C: 03+2] Test multiple hosts in parallel. [software/httpbb] - 10https://gerrit.wikimedia.org/r/559952 (owner: 10RLazarus) [19:23:14] (03PS2) 10Ammarpad: Add ipblock-exempt and extendedconfirmed to bot group on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562040 (https://phabricator.wikimedia.org/T241904) [19:25:10] (03PS3) 10Ammarpad: Set $wgArticleCountMethod to 'any' for minwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561572 (https://phabricator.wikimedia.org/T241694) [19:26:37] (03PS11) 10Ammarpad: Add minerva custom log for la.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557439 (https://phabricator.wikimedia.org/T240728) [19:28:36] (03PS1) 10Ottomata: eventgate-logging-external - use proper schema_title name for mediawiki.client.error stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/562930 (https://phabricator.wikimedia.org/T240985) [19:28:52] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frban2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T234069 (10Jgreen) [19:29:17] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (ASAP) rack/setup/install frban1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T234068 (10Jgreen) [19:29:22] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (ASAP) rack/setup/install frnetmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T232137 (10Jgreen) [19:29:41] (03CR) 10Ottomata: [C: 03+2] eventgate-logging-external - use proper schema_title name for mediawiki.client.error stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/562930 (https://phabricator.wikimedia.org/T240985) (owner: 10Ottomata) [19:29:58] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (ASAP) rack/setup/install frban1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T234068 (10Jgreen) [19:30:10] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (ASAP) rack/setup/install frban1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T234068 (10Jgreen) [19:30:57] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frban2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T234069 (10Jgreen) [19:31:26] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [19:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:30] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frban2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T234069 (10Jgreen) [19:38:40] (03PS4) 10Brennen Bearnes: logspam.pl: Shorten paths and include fatals [puppet] - 10https://gerrit.wikimedia.org/r/559246 (https://phabricator.wikimedia.org/T242252) [19:40:43] (03CR) 10jerkins-bot: [V: 04-1] logspam.pl: Shorten paths and include fatals [puppet] - 10https://gerrit.wikimedia.org/r/559246 (https://phabricator.wikimedia.org/T242252) (owner: 10Brennen Bearnes) [19:43:01] (03PS5) 10Brennen Bearnes: logspam.pl: Shorten paths and include fatals [puppet] - 10https://gerrit.wikimedia.org/r/559246 (https://phabricator.wikimedia.org/T242252) [19:49:08] (03PS30) 10Cwhite: lvs, prometheus, profile: add blackbox job helper and enable openapi scrapes [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [19:54:03] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2035 is CRITICAL: connect to address 10.192.32.165 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:54:09] (03PS31) 10Cwhite: lvs, prometheus, profile: add blackbox job helper and enable openapi scrapes [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [19:54:19] PROBLEM - configured eth on ms-be2035 is CRITICAL: connect to address 10.192.32.165 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [19:54:41] PROBLEM - dhclient process on ms-be2035 is CRITICAL: connect to address 10.192.32.165 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [19:54:43] PROBLEM - DPKG on ms-be2035 is CRITICAL: connect to address 10.192.32.165 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [19:55:11] PROBLEM - Check size of conntrack table on ms-be2035 is CRITICAL: connect to address 10.192.32.165 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [19:55:21] PROBLEM - very high load average likely xfs on ms-be2035 is CRITICAL: connect to address 10.192.32.165 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Swift [19:55:37] PROBLEM - Disk space on ms-be2035 is CRITICAL: connect to address 10.192.32.165 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2035&var-datasource=codfw+prometheus/ops [19:55:43] PROBLEM - Check systemd state on ms-be2035 is CRITICAL: connect to address 10.192.32.165 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:55:53] PROBLEM - Check the NTP synchronisation status of timesyncd on ms-be2035 is CRITICAL: connect to address 10.192.32.165 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [19:57:40] downtimed ^ known, I'll look tomorrow [19:58:11] (03PS1) 10Dzahn: admins: add Kai Nissen to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/562940 (https://phabricator.wikimedia.org/T241838) [20:00:04] longma and liw: Your horoscope predicts another unfortunate Mediawiki train - American+European Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200108T2000). [20:00:20] (03CR) 10Cwhite: "PCC looks good https://puppet-compiler.wmflabs.org/compiler1001/20280/" [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [20:00:53] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'analytics' . [20:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:16] (03PS1) 10Jeena Huneidi: group1 wikis to 1.35.0-wmf.14 refs T233862 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562946 [20:06:19] (03CR) 10Jeena Huneidi: [C: 03+2] group1 wikis to 1.35.0-wmf.14 refs T233862 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562946 (owner: 10Jeena Huneidi) [20:09:08] (03PS1) 10Dzahn: admins: add Dave Pifke to perf-team admins [puppet] - 10https://gerrit.wikimedia.org/r/562947 (https://phabricator.wikimedia.org/T242189) [20:09:28] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production servers in perf-team group for dpifke - https://phabricator.wikimedia.org/T242189 (10Dzahn) a:03Dzahn [20:11:50] (03CR) 10Bartosz Dziewoński: "Peter said "great"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561649 (owner: 10Bartosz Dziewoński) [20:13:17] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 48124368 and 8 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:13:23] (03CR) 10Bartosz Dziewoński: "https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200109T0000" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561649 (owner: 10Bartosz Dziewoński) [20:13:29] (03PS2) 10Bartosz Dziewoński: Remove 2017 wikitext editor as default on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561649 [20:13:40] (03CR) 10Dzahn: [C: 03+2] "it's just about the allowed data types. to allow changing it to 7.3 in cloud" [puppet] - 10https://gerrit.wikimedia.org/r/561931 (owner: 10Dzahn) [20:15:05] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 43016 and 30 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:17:29] (03PS3) 10Dzahn: gerrit: adjust bacula backup behaviour to deal with multiple hosts [puppet] - 10https://gerrit.wikimedia.org/r/562639 (https://phabricator.wikimedia.org/T239151) [20:17:44] (03CR) 10Dzahn: [C: 03+2] gerrit: adjust bacula backup behaviour to deal with multiple hosts [puppet] - 10https://gerrit.wikimedia.org/r/562639 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [20:31:29] (03CR) 10Dzahn: [V: 03+2 C: 03+2] gerrit: adjust bacula backup behaviour to deal with multiple hosts [puppet] - 10https://gerrit.wikimedia.org/r/562639 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [20:34:32] (03CR) 10Jeena Huneidi: [C: 03+2] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562946 (owner: 10Jeena Huneidi) [20:36:09] (03CR) 10Jforrester: [C: 03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562946 (owner: 10Jeena Huneidi) [20:40:15] !log contint1001 - restarting zuul service [20:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:34] (03CR) 10Dzahn: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562946 (owner: 10Jeena Huneidi) [20:45:02] (03CR) 10Dzahn: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562946 (owner: 10Jeena Huneidi) [20:45:17] (03CR) 10Jforrester: [C: 03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562946 (owner: 10Jeena Huneidi) [20:46:12] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.14 refs T233862 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562946 (owner: 10Jeena Huneidi) [20:49:11] PROBLEM - Unmerged changes on repository puppet on labtestpuppetmaster2001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [20:50:20] 10Operations, 10Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: LDF server has 404 errors for JS and CSS resources - https://phabricator.wikimedia.org/T237165 (10Mstyles) [20:50:23] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.14 refs T233862 [20:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:26] T233862: 1.35.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T233862 [20:50:35] (03PS1) 10Joal: Update turnilo configuration [puppet] - 10https://gerrit.wikimedia.org/r/562958 (https://phabricator.wikimedia.org/T240681) [20:51:28] !log jhuneidi@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.14 refs T233862 (duration: 01m 04s) [20:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:55] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [20:54:48] mutante: everything ok? [20:56:28] (03PS1) 10Ottomata: eventgate - Bump staging services image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/562962 (https://phabricator.wikimedia.org/T240985) [20:56:50] (03CR) 10jerkins-bot: [V: 04-1] eventgate - Bump staging services image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/562962 (https://phabricator.wikimedia.org/T240985) (owner: 10Ottomata) [20:56:57] longma: So far LGTM. [20:57:32] 👍 [20:57:40] (03CR) 10Ottomata: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/562962 (https://phabricator.wikimedia.org/T240985) (owner: 10Ottomata) [20:58:19] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [20:58:28] cdanis: oh.. yes, all ok. merged now [20:58:39] (03CR) 10Ottomata: [C: 03+2] eventgate - Bump staging services image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/562962 (https://phabricator.wikimedia.org/T240985) (owner: 10Ottomata) [21:00:01] RECOVERY - Unmerged changes on repository puppet on labtestpuppetmaster2001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [21:00:04] cscott, arlolra, subbu, halfak, and accraze: Your horoscope predicts another unfortunate Services – Graphoid / Parsoid / Citoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200108T2100). [21:00:25] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'analytics' . [21:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:40] deploying ORES [21:02:28] (03CR) 10Umherirrender: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561649 (owner: 10Bartosz Dziewoński) [21:03:08] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [21:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:47] !log halfak@deploy1001 Started deploy [ores/deploy@039251f]: T242035 [21:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:50] T242035: First deployment of the new decade! - https://phabricator.wikimedia.org/T242035 [21:07:20] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-main' for release 'main' . [21:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:50] Canary looks good. Continuing [21:12:27] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frlog2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242265 (10Jgreen) [21:20:18] (03PS1) 10Dzahn: ferm_misc/db: allow connections from gerrit-test in ferm [puppet] - 10https://gerrit.wikimedia.org/r/562965 (https://phabricator.wikimedia.org/T239151) [21:21:17] !log halfak@deploy1001 Finished deploy [ores/deploy@039251f]: T242035 (duration: 16m 32s) [21:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:20] T242035: First deployment of the new decade! - https://phabricator.wikimedia.org/T242035 [21:23:01] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1100 - https://phabricator.wikimedia.org/T241506 (10Jclark-ctr) Drive was ordered should arrive shortly will update when it arrives [21:23:14] Everything looks good. [21:23:58] (03CR) 10Dzahn: "Hi Manuel, so we would like to let gerrit-test connect to the Gerrit DB (m2-master / dbproxy1007) but ideally we don't want it to have UPD" [puppet] - 10https://gerrit.wikimedia.org/r/562965 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [21:26:48] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frlog2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242265 (10Dzahn) p:05Triage→03Normal [21:28:30] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: Revert "commonswiki to 1.35.0-wmf.11" [21:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:16] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [21:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:39] longma: Thanks! [21:29:43] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudvirt1016 crash - https://phabricator.wikimedia.org/T241882 (10Jclark-ctr) Confirmed: Service Request 1009577756 was successfully submitted. [21:30:32] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [21:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:55] !log phab1003 - running decom cookbook - shutdown host, removed from puppetmaster, debmonitor etc (T238957) [21:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:58] T238957: decommission phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T238957 [21:31:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudvirt1016 crash - https://phabricator.wikimedia.org/T241882 (10Jclark-ctr) Confirmed: Service Request 1009577756 was successfully submitted. [21:35:03] halfak: all done? [21:35:11] (03PS2) 10Dzahn: remove service IPs and IPv6 for phab1003 [dns] - 10https://gerrit.wikimedia.org/r/552599 (https://phabricator.wikimedia.org/T238957) [21:35:12] yes :) [21:35:22] Sorry I wasn't clear [21:35:46] all good [21:36:03] just being cautious [21:36:04] (03CR) 10Dzahn: [C: 03+2] "host has been shut down" [dns] - 10https://gerrit.wikimedia.org/r/552599 (https://phabricator.wikimedia.org/T238957) (owner: 10Dzahn) [21:37:11] 10Operations, 10ops-codfw: (Need By: Jan 15) codfw: rack/setup/install mc-gp200[123].codfw.wmnet - https://phabricator.wikimedia.org/T239249 (10Papaul) [21:38:03] (03CR) 10Dzahn: [C: 03+1] "This host has been shut down today (by the decom script)" [puppet] - 10https://gerrit.wikimedia.org/r/552607 (https://phabricator.wikimedia.org/T238957) (owner: 10Dzahn) [21:38:18] !log arlolra@deploy1001 Started deploy [parsoid/deploy@45a4245]: Updating Parsoid to f963e51 [21:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:05] 10Operations, 10ops-codfw, 10DBA: (Needed By 31st January) codfw: rack/setup/install es202[0-5].codfw.wmnet - https://phabricator.wikimedia.org/T241336 (10Papaul) [21:46:17] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@45a4245]: Updating Parsoid to f963e51 (duration: 08m 00s) [21:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:01] (03CR) 10EBernhardson: [C: 04-1] "not needed anymore, we ended up getting it working to talk to an-coord1001 sql" [puppet] - 10https://gerrit.wikimedia.org/r/554215 (https://phabricator.wikimedia.org/T236180) (owner: 10Dzahn) [21:49:15] (03PS2) 10EBernhardson: airflow: remove config settings for Celery Executor and Flower [puppet] - 10https://gerrit.wikimedia.org/r/553413 (owner: 10Dzahn) [21:49:33] (03CR) 10EBernhardson: [C: 03+1] "seems reasonable to reduce confusion" [puppet] - 10https://gerrit.wikimedia.org/r/553413 (owner: 10Dzahn) [21:50:15] (03Abandoned) 10Dzahn: airflow: add a local mariadb server [puppet] - 10https://gerrit.wikimedia.org/r/554215 (https://phabricator.wikimedia.org/T236180) (owner: 10Dzahn) [21:51:09] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frlog2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242265 (10Jgreen) [21:51:34] (03CR) 10Dzahn: [C: 03+2] airflow: remove config settings for Celery Executor and Flower [puppet] - 10https://gerrit.wikimedia.org/r/553413 (owner: 10Dzahn) [21:55:58] !log Updated Parsoid to f963e51 (T238934, T237318, T238022, T228217) [21:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:14] T238934: Call to a member function getContent() on null - https://phabricator.wikimedia.org/T238934 [21:56:15] T237318: Invariant failed: Bad UTF-8 at end of string (2 byte sequence) - https://phabricator.wikimedia.org/T237318 [21:56:15] T228217: Ensure all the features of parse.js are covered by parse.php - https://phabricator.wikimedia.org/T228217 [21:56:15] T238022: Parsoid/JS use of \w \s \b etc is inconsistent with PHP's behavior when the 'u' regexp modifier is used, which leads to selective serializer output differences between Parsoid/PHP & Parsoid/JS in some scenarios - https://phabricator.wikimedia.org/T238022 [21:58:54] (03PS6) 10Dzahn: gerrit: make scap user configurable in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/536704 [22:00:11] (03PS2) 10CDanis: fastnetmon: remove UDP and ICMP limits [puppet] - 10https://gerrit.wikimedia.org/r/562387 (https://phabricator.wikimedia.org/T241374) [22:00:34] (03PS7) 10Dzahn: gerrit: make scap user configurable in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/536704 [22:01:25] PROBLEM - PHP opcache health on wtp1027 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:01:52] (03PS1) 10Jforrester: Revert commonswiki to 1.35.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562973 [22:02:05] (03CR) 10Dzahn: "amended to keep the "user and key name can be changed in Hiera" while removing the "user/group creation"-part of it. That probably needs t" [puppet] - 10https://gerrit.wikimedia.org/r/536704 (owner: 10Dzahn) [22:03:36] (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1001/20282/gerrit1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/536704 (owner: 10Dzahn) [22:03:56] (03CR) 10Jeena Huneidi: [C: 03+2] Revert commonswiki to 1.35.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562973 (owner: 10Jforrester) [22:04:13] RECOVERY - PHP opcache health on wtp1027 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:04:38] (03CR) 10CDanis: fastnetmon: remove UDP and ICMP limits (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/562387 (https://phabricator.wikimedia.org/T241374) (owner: 10CDanis) [22:05:09] (03Merged) 10jenkins-bot: Revert commonswiki to 1.35.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562973 (owner: 10Jforrester) [22:08:17] (03PS8) 10Dzahn: gerrit: make scap user configurable in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/536704 [22:10:38] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/20283/gerrit1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/536704 (owner: 10Dzahn) [22:11:40] (03CR) 10Paladox: gerrit: make scap user configurable in Hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536704 (owner: 10Dzahn) [22:18:13] (03CR) 10Dzahn: [C: 03+1] gerrit: make scap user configurable in Hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536704 (owner: 10Dzahn) [22:18:22] (03PS9) 10Dzahn: gerrit: make scap user configurable in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/536704 [22:19:44] (03CR) 10Paladox: [C: 03+1] gerrit: make scap user configurable in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/536704 (owner: 10Dzahn) [22:22:43] 10Operations, 10ops-codfw, 10DBA: (Needed By 31st January) codfw: rack/setup/install es202[0-5].codfw.wmnet - https://phabricator.wikimedia.org/T241336 (10Papaul) [22:23:01] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:25:01] looking what interface that is [22:25:17] BFD neighbor fe80::7a4f:9b00:174e:8004 down [22:25:47] 10Operations, 10Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: LDF server has 404 errors for JS and CSS resources - https://phabricator.wikimedia.org/T237165 (10Mstyles) from inside any of the WDQS machines ( 'wdqs1004.eqiad.wmnet','wdqs1005.eqiad.wmnet', 'wdqs1006.eqiad.wmnet','wdqs1007.eqi... [22:26:15] well, i can't follow the docs after that. dont have access [22:26:38] mutante: Zuul seems to have frozen again, BTW. [22:29:23] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frlog2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242265 (10wiki_willy) a:03Papaul [22:33:35] James_F: logs say it's doing something (now) [22:34:11] Yeah, but new gerrit events aren't coming in? Zuul dashboard is very quiet. [22:34:29] (03PS1) 10Paladox: Gerrit: Remove nocanon from apache template [puppet] - 10https://gerrit.wikimedia.org/r/562977 [22:34:31] (03PS2) 10Paladox: Gerrit: Remove nocanon from apache template [puppet] - 10https://gerrit.wikimedia.org/r/562977 [22:37:42] (03PS1) 10Jhedden: ceph: add prometheus scrape config [puppet] - 10https://gerrit.wikimedia.org/r/562979 (https://phabricator.wikimedia.org/T240715) [22:38:18] (03Abandoned) 10Jhedden: lvs: update cloudceph proxy check url [puppet] - 10https://gerrit.wikimedia.org/r/562637 (https://phabricator.wikimedia.org/T240715) (owner: 10Jhedden) [22:38:20] (03Abandoned) 10Paladox: Gerrit: Remove nocanon from apache template [puppet] - 10https://gerrit.wikimedia.org/r/562977 (owner: 10Paladox) [22:39:39] https://grafana.wikimedia.org/d/000000322/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:39:48] !log restarted zuul on contint1001 [22:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:34] (03PS2) 10Jhedden: ceph: add prometheus scrape config [puppet] - 10https://gerrit.wikimedia.org/r/562979 (https://phabricator.wikimedia.org/T240715) [22:42:19] (03PS3) 10Jhedden: ceph: add prometheus scrape config [puppet] - 10https://gerrit.wikimedia.org/r/562979 (https://phabricator.wikimedia.org/T240715) [22:46:32] (03PS4) 10Jhedden: ceph: add prometheus scrape config [puppet] - 10https://gerrit.wikimedia.org/r/562979 (https://phabricator.wikimedia.org/T240715) [22:47:40] (03PS5) 10Jhedden: ceph: add prometheus scrape config [puppet] - 10https://gerrit.wikimedia.org/r/562979 (https://phabricator.wikimedia.org/T240715) [22:50:20] (03PS3) 10Jdlrobson: Drop beta setting. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562607 (https://phabricator.wikimedia.org/T237290) [22:51:10] (03CR) 10Jforrester: "Please re-fix the commit message before merging." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562607 (https://phabricator.wikimedia.org/T237290) (owner: 10Jdlrobson) [22:51:21] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/536704 (owner: 10Dzahn) [22:54:07] (03CR) 10Dzahn: [C: 03+2] gerrit: make scap user configurable in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/536704 (owner: 10Dzahn) [22:55:52] (03PS6) 10Jhedden: ceph: add prometheus scrape config [puppet] - 10https://gerrit.wikimedia.org/r/562979 (https://phabricator.wikimedia.org/T240715) [22:56:20] (03PS7) 10Jhedden: ceph: add prometheus scrape config [puppet] - 10https://gerrit.wikimedia.org/r/562979 (https://phabricator.wikimedia.org/T240715) [23:00:10] (03PS1) 10Dzahn: admins: add Moushira Elamrawy to ldap_only_admins (WMF-ctr) [puppet] - 10https://gerrit.wikimedia.org/r/562981 (https://phabricator.wikimedia.org/T242000) [23:08:59] !log LDAP - added moushirael to 'wmf' (T242000) [23:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:02] T242000: Allow LDAP access to superset dashboards for Moushira Elamrawy - https://phabricator.wikimedia.org/T242000 [23:10:14] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Allow LDAP access to superset dashboards for Moushira Elamrawy - https://phabricator.wikimedia.org/T242000 (10Dzahn) 05Open→03Resolved Hi @Moushira you have been added to the "wmf" group with the Moushirael user. I confirm it has the wikimedia.org... [23:12:08] (03CR) 10Dzahn: "We are not using the old "moushira" user which is absented and was the wmf employee user, we are using the separate user with the contract" [puppet] - 10https://gerrit.wikimedia.org/r/562981 (https://phabricator.wikimedia.org/T242000) (owner: 10Dzahn) [23:14:56] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Allow LDAP access to superset dashboards for Moushira Elamrawy - https://phabricator.wikimedia.org/T242000 (10Dzahn) @Moushira @MeganHernandez_WMF Just one more question. Contractor access usually has an associated "expiry date". Is there a date whe... [23:15:28] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Allow LDAP access to superset dashboards for Moushira Elamrawy - https://phabricator.wikimedia.org/T242000 (10Dzahn) 05Resolved→03Open [23:15:55] (03CR) 10Dzahn: [C: 03+2] admins: add Moushira Elamrawy to ldap_only_admins (WMF-ctr) [puppet] - 10https://gerrit.wikimedia.org/r/562981 (https://phabricator.wikimedia.org/T242000) (owner: 10Dzahn) [23:17:53] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Clean up SSL configuration - https://phabricator.wikimedia.org/T240941 (10Dzahn) [23:18:20] clever nickname halafk :) [23:18:33] ^_^ [23:19:16] (03PS2) 10Dzahn: remove production IPs for phab1003 [dns] - 10https://gerrit.wikimedia.org/r/552601 (https://phabricator.wikimedia.org/T238957) [23:19:56] (03CR) 10Dzahn: [C: 04-1] "hmm..i'll wait with this until we removed the IP from mysql GRANTS.. before something else recycles them" [dns] - 10https://gerrit.wikimedia.org/r/552601 (https://phabricator.wikimedia.org/T238957) (owner: 10Dzahn) [23:25:58] (03CR) 10Jhedden: "PCC results https://puppet-compiler.wmflabs.org/compiler1003/20284/" [puppet] - 10https://gerrit.wikimedia.org/r/562979 (https://phabricator.wikimedia.org/T240715) (owner: 10Jhedden) [23:30:32] (03PS7) 10Paladox: Gerrit: Rename ssh_host_key to ssh_host_rsa_key [puppet] - 10https://gerrit.wikimedia.org/r/556265 [23:30:34] (03PS8) 10Paladox: Gerrit: Add ed25519 and ecdsa ssh host keys [puppet] - 10https://gerrit.wikimedia.org/r/556270 [23:31:52] (03PS9) 10Paladox: Gerrit: Add ed25519 and ecdsa ssh host keys [puppet] - 10https://gerrit.wikimedia.org/r/556270 [23:31:59] (03CR) 10Paladox: Gerrit: Add ed25519 and ecdsa ssh host keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/556270 (owner: 10Paladox) [23:32:40] Grabbing the prod conch. [23:34:19] (03Abandoned) 10Paladox: Gerrit: Rename ssh_host_key to ssh_host_rsa_key [labs/private] - 10https://gerrit.wikimedia.org/r/556268 (owner: 10Paladox) [23:34:38] (03PS8) 10Paladox: Gerrit: Rename ssh_host_key to ssh_host_rsa_key [puppet] - 10https://gerrit.wikimedia.org/r/556265 [23:34:57] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.14/extensions/WikibaseMediaInfo/resources/statements/StatementWidget.js: T242286 Update StatementWidget initialization logic (duration: 01m 05s) [23:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:00] T242286: Unable to to add Structured Data to files that don't have any; "mainSnak.getValue is not a function" thrown in console - https://phabricator.wikimedia.org/T242286 [23:35:08] (03PS10) 10Paladox: Gerrit: Add ed25519 and ecdsa ssh host keys [puppet] - 10https://gerrit.wikimedia.org/r/556270 [23:35:15] longma: wmf.14 now fixed for Commons, if you want to roll the train forwards there? [23:35:26] okay [23:36:58] (03PS1) 10Catrope: GrowthExperiments: Set newcomer tasks config title ahead of deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562987 (https://phabricator.wikimedia.org/T233465) [23:42:38] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Allow LDAP access to superset dashboards for Moushira Elamrawy - https://phabricator.wikimedia.org/T242000 (10Moushira) Thanks @Dzahn, yes it works now. I am in the process of contract extension, not sure about the new expiry dateyet, and yes Megan i... [23:44:13] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: Roll commonswiki forward to 1.35.0-wmf.14 [23:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:23] (03PS1) 10Jeena Huneidi: Roll commonswiki forward to 1.35.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562989 [23:46:44] (03CR) 10Jeena Huneidi: [C: 03+2] Roll commonswiki forward to 1.35.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562989 (owner: 10Jeena Huneidi) [23:47:34] (03Merged) 10jenkins-bot: Roll commonswiki forward to 1.35.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562989 (owner: 10Jeena Huneidi) [23:57:29] PROBLEM - Memory correctable errors -EDAC- on mw1239 is CRITICAL: 4.001 ge 4 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw1239&var-datasource=eqiad+prometheus/ops