[00:00:05] twentyafterfour: Dear deployers, time to do the Phabricator update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200716T0000). [00:05:22] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:07:40] 3~/win 31 [00:14:40] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:38:24] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:40:14] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:43:54] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (etcd1002), Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:07:10] (03PS1) 10Marostegui: db1131: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/613023 (https://phabricator.wikimedia.org/T257253) [05:08:46] (03CR) 10Marostegui: [C: 03+2] db1131: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/613023 (https://phabricator.wikimedia.org/T257253) (owner: 10Marostegui) [05:11:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1131', diff saved to https://phabricator.wikimedia.org/P11919 and previous config saved to /var/cache/conftool/dbconfig/20200716-051109-marostegui.json [05:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1131', diff saved to https://phabricator.wikimedia.org/P11920 and previous config saved to /var/cache/conftool/dbconfig/20200716-051342-marostegui.json [05:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:36] (03PS2) 10Privacybatm: Firewall.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/612726 (https://phabricator.wikimedia.org/T257600) [05:20:29] (03PS1) 10Privacybatm: Transferer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) [05:20:53] (03CR) 10jerkins-bot: [V: 04-1] Transferer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [06:32:30] (03PS2) 10Privacybatm: Transferer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) [06:33:33] 10Operations, 10Traffic, 10netops: Remove multicast - https://phabricator.wikimedia.org/T257573 (10ayounsi) [06:40:20] !log remove peering with AS8403 in eqsin (peer left the IX) [06:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:44] (03PS1) 10Jcrespo: mariadb-backups: Modularize common stats config into a subtemplate [puppet] - 10https://gerrit.wikimedia.org/r/613080 (https://phabricator.wikimedia.org/T258045) [06:44:29] (03PS3) 10Privacybatm: Transferer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) [06:44:46] 10Operations, 10Puppet, 10Diffusion, 10Phabricator: Diffusion (Phabricator) operations-puppet repo synchronization error - https://phabricator.wikimedia.org/T257895 (10jcrespo) Thanks! [06:44:53] (03CR) 10jerkins-bot: [V: 04-1] Transferer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [06:46:34] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Modularize common stats config into a subtemplate [puppet] - 10https://gerrit.wikimedia.org/r/613080 (https://phabricator.wikimedia.org/T258045) (owner: 10Jcrespo) [06:46:36] (03CR) 10Privacybatm: [C: 04-1] "The problem here is with the python version, If I change the python version to python3.7 in tox.ini, everything would work fine. What shou" [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [06:59:03] (03PS1) 10Ayounsi: Remove IGMP snooping from all access switches [homer/public] - 10https://gerrit.wikimedia.org/r/613081 (https://phabricator.wikimedia.org/T257573) [07:01:24] (03PS1) 10Ayounsi: Remove IGMP snooping stanza [homer/public] - 10https://gerrit.wikimedia.org/r/613082 (https://phabricator.wikimedia.org/T257573) [07:03:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1131', diff saved to https://phabricator.wikimedia.org/P11921 and previous config saved to /var/cache/conftool/dbconfig/20200716-070331-marostegui.json [07:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:57] (03PS1) 10Elukey: role::druid::test_analytics::worker: use CDH hadoop client [puppet] - 10https://gerrit.wikimedia.org/r/613084 (https://phabricator.wikimedia.org/T244482) [07:17:28] (03CR) 10Elukey: [C: 03+2] role::druid::test_analytics::worker: use CDH hadoop client [puppet] - 10https://gerrit.wikimedia.org/r/613084 (https://phabricator.wikimedia.org/T244482) (owner: 10Elukey) [07:23:37] 10Operations, 10DBA, 10OTRS, 10serviceops: Create a parallel OTRS database with a frozen snapshot of the production one - https://phabricator.wikimedia.org/T257928 (10jcrespo) 05Open→03Resolved Credentials have been setup and shared on client root dir, feel free to productionize as you see adequate: `... [07:23:43] 10Operations, 10OTRS, 10serviceops, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10jcrespo) [07:25:44] !log Drop database reviewdb-test T255715 [07:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:49] T255715: Make sure both `reviewdb-test` (used forgerrit upgrade testing) and `reviewdb` (formerly production) databases get torn down - https://phabricator.wikimedia.org/T255715 [07:28:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1131', diff saved to https://phabricator.wikimedia.org/P11922 and previous config saved to /var/cache/conftool/dbconfig/20200716-072838-marostegui.json [07:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:46] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Marostegui) 05Open→03Resolved This is all done, host fully back in production Pending: what was exactly replaced on this host on-site, so we can track that just in case this host... [07:31:11] !log imported envoyproxy_1.14.4-1 to buster-wikimedia [07:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:27] (03CR) 10Jcrespo: [C: 03+2] transfer.py: Refactor split_target function [software/transferpy] - 10https://gerrit.wikimedia.org/r/609778 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [07:38:34] (03CR) 10Muehlenhoff: "This looks good! A few comments inline" (034 comments) [debs/grafana-loki] (debian/sid) - 10https://gerrit.wikimedia.org/r/610864 (owner: 10Cwhite) [07:41:39] !log imported envoyproxy_1.14.4-1 to stretch-wikimedia [07:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:49] (03CR) 10Jcrespo: "When testing on production I got the following error:" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [07:47:22] (03PS4) 10Giuseppe Lavagetto: Add local service proxy to the tls terminator v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) [07:48:21] (03CR) 10Jcrespo: "Workaround?" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [08:03:28] (03PS1) 10Marostegui: mariadb: Reimage db2081 [puppet] - 10https://gerrit.wikimedia.org/r/613087 [08:03:36] !log remove PIM config from ulsfo routers - T257573 [08:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:41] T257573: Remove multicast - https://phabricator.wikimedia.org/T257573 [08:04:54] (03CR) 10Marostegui: [C: 03+2] mariadb: Reimage db2081 [puppet] - 10https://gerrit.wikimedia.org/r/613087 (owner: 10Marostegui) [08:06:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2081', diff saved to https://phabricator.wikimedia.org/P11923 and previous config saved to /var/cache/conftool/dbconfig/20200716-080613-marostegui.json [08:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:31] !log remove PIM config from codfw routers - T257573 [08:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:23] !log update mail delivery for phabricator to use phabricator.discovery.wmnet cname [08:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:31] (03CR) 10Jbond: [C: 03+2] exim: update phabricator redirects to use CNAME [puppet] - 10https://gerrit.wikimedia.org/r/612860 (owner: 10Jbond) [08:08:52] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active - Telia, AS1299/IPv4: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:09:50] ^ unrelated with the PIM work [08:09:58] !log remove PIM config from eqsin routers - T257573 [08:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:03] T257573: Remove multicast - https://phabricator.wikimedia.org/T257573 [08:11:58] !log remove PIM config from esams routers - T257573 [08:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:18] (03CR) 10Jbond: [C: 03+2] profile::waf::apache::administrative: remove waf config [puppet] - 10https://gerrit.wikimedia.org/r/608806 (https://phabricator.wikimedia.org/T253632) (owner: 10Jbond) [08:13:54] 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10akosiaris) >>! In T243112#6309704, @Papaul wrote: > @akosiaris no need to reopen the task since this needs to be done by the service owner on another task and not on the... [08:14:15] !log remove PIM config from eqiad routers - T257573 [08:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:03] !log remove PIM config from eqord/eqdfw/knams routers - T257573 [08:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:09] T257573: Remove multicast - https://phabricator.wikimedia.org/T257573 [08:15:27] !log installing python-urllib3 security updates [08:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:57] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Remove multicast - https://phabricator.wikimedia.org/T257573 (10ayounsi) Confirmed with `run clear multicast statistics` then `run show multicast statistics` that counters stay at 0. [08:18:02] 10Operations, 10netops, 10Patch-For-Review: Find a new PIM RP IP - https://phabricator.wikimedia.org/T167842 (10ayounsi) 05Open→03Resolved a:03ayounsi No more PIM in the infra. [08:18:19] 10Operations, 10netops, 10Patch-For-Review: Find a new PIM RP IP - https://phabricator.wikimedia.org/T167842 (10ayounsi) 05Resolved→03Declined [08:19:04] (03PS3) 10Jbond: block_abuse_nets: enable block abuse nets on misc sites [puppet] - 10https://gerrit.wikimedia.org/r/608807 (https://phabricator.wikimedia.org/T253632) [08:20:11] (03PS1) 10Ayounsi: Remove PIM related CR stanza [homer/public] - 10https://gerrit.wikimedia.org/r/613091 (https://phabricator.wikimedia.org/T257573) [08:20:57] (03CR) 10Jbond: [C: 03+2] block_abuse_nets: enable block abuse nets on misc sites [puppet] - 10https://gerrit.wikimedia.org/r/608807 (https://phabricator.wikimedia.org/T253632) (owner: 10Jbond) [08:21:45] (03CR) 10Ayounsi: [C: 03+2] Remove IGMP snooping from all access switches [homer/public] - 10https://gerrit.wikimedia.org/r/613081 (https://phabricator.wikimedia.org/T257573) (owner: 10Ayounsi) [08:22:09] (03Merged) 10jenkins-bot: Remove IGMP snooping from all access switches [homer/public] - 10https://gerrit.wikimedia.org/r/613081 (https://phabricator.wikimedia.org/T257573) (owner: 10Ayounsi) [08:23:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [08:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:16] !log remove igmp-snooping from access switches - T257573 [08:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:21] T257573: Remove multicast - https://phabricator.wikimedia.org/T257573 [08:25:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:43] !log installing dbus security updates [08:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:01] (03CR) 10Privacybatm: "> Patch Set 18:" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [08:28:20] 10Operations, 10Phabricator, 10Security-Team: HTTP 500 error trying to access phabricator from my server - https://phabricator.wikimedia.org/T257507 (10jbond) @Isarra I have not removed the blocking put in place from previous vandalism incidents are you able to confirm if your access issues have been resolved? [08:30:27] 10Operations, 10Phabricator, 10Traffic: Accessing Phabricator from Tor (some ranges blocked but not others) - https://phabricator.wikimedia.org/T254568 (10jbond) @Fae I have no removed the blocked referred to in T253632 are you able to confirm if this is still an issue? [08:30:41] (03CR) 10Ayounsi: [C: 03+2] Remove IGMP snooping stanza [homer/public] - 10https://gerrit.wikimedia.org/r/613082 (https://phabricator.wikimedia.org/T257573) (owner: 10Ayounsi) [08:30:43] PROBLEM - Host etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:31:05] PROBLEM - Host ganeti1008 is DOWN: PING CRITICAL - Packet loss = 100% [08:31:05] (03Merged) 10jenkins-bot: Remove IGMP snooping stanza [homer/public] - 10https://gerrit.wikimedia.org/r/613082 (https://phabricator.wikimedia.org/T257573) (owner: 10Ayounsi) [08:31:10] moritzm: related to some reboot? ^^^ [08:31:40] no, maintnance for ganeti server [08:31:41] PROBLEM - Host kubetcd1005 is DOWN: PING CRITICAL - Packet loss = 100% [08:31:47] k [08:31:47] I'll extend downtimes [08:32:26] thx [08:32:48] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [08:32:49] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:55] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 52, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:32:57] 10Operations, 10netops: IGMP snooping breaks IPv6 ND on Junos 14.1X53-D46 - https://phabricator.wikimedia.org/T201039 (10ayounsi) 05Open→03Declined IGMP snooping removed from all switches with T257573 [08:33:04] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [08:33:05] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:28] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [08:33:29] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:40] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/612929 (owner: 10Dzahn) [08:34:06] I've extended these by another day, I guess Jason will be able to add the RAM some time today [08:34:19] John, ofc [08:34:37] :) [08:34:59] (03CR) 10Ayounsi: [C: 03+2] Remove PIM related CR stanza [homer/public] - 10https://gerrit.wikimedia.org/r/613091 (https://phabricator.wikimedia.org/T257573) (owner: 10Ayounsi) [08:35:25] (03Merged) 10jenkins-bot: Remove PIM related CR stanza [homer/public] - 10https://gerrit.wikimedia.org/r/613091 (https://phabricator.wikimedia.org/T257573) (owner: 10Ayounsi) [08:35:25] !log Remove PIM/IGMP related CR stanza (acls) - T257573 [08:35:27] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:30] T257573: Remove multicast - https://phabricator.wikimedia.org/T257573 [08:36:36] 10Operations, 10Traffic: planet.wm.org missing from planet.discovery.wmnet Subject Alternative Name - https://phabricator.wikimedia.org/T257840 (10ema) >>! In T257840#6309655, @Dzahn wrote: > With existing Apache config https://planet.wikimedia.org redirects to https://meta.wikimedia.org/wiki/Planet_Wikimedia... [08:36:41] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:37:51] (03CR) 10Jcrespo: "> Yeah, this is the ordering issue, and when you do --parallel-checksum, the parallel-checksum did not actually invoked! By default the ch" [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [08:39:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2081', diff saved to https://phabricator.wikimedia.org/P11924 and previous config saved to /var/cache/conftool/dbconfig/20200716-083954-marostegui.json [08:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:29] (03PS1) 10Marostegui: db2081: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/613094 [08:41:04] (03CR) 10Marostegui: [C: 03+2] db2081: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/613094 (owner: 10Marostegui) [08:41:09] !log installing sqlite3 security updates [08:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:28] (03CR) 10Jcrespo: "I propose you something:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [08:51:54] (03CR) 10Privacybatm: "> Patch Set 18:" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [08:52:25] !log updated envoyproxy to 1.14.4-1 on mwdebug1001.eqiad.wmnet [08:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:13] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (and for the record, the current RT setup doesn't even use the "heavy" variant of Exim anymore)" [puppet] - 10https://gerrit.wikimedia.org/r/612929 (owner: 10Dzahn) [08:54:09] 10Operations, 10Phabricator, 10Traffic: Accessing Phabricator from Tor (some ranges blocked but not others) - https://phabricator.wikimedia.org/T254568 (10Fae) >>! In T254568#6311349, @jbond wrote: > @Fae I have no removed the blocked referred to in T253632 are you able to confirm if this is still an issue?... [08:55:16] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Remove multicast - https://phabricator.wikimedia.org/T257573 (10ayounsi) [08:55:36] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Remove multicast - https://phabricator.wikimedia.org/T257573 (10ayounsi) 05Open→03Resolved a:03ayounsi No more multicast/PIM in the infra. [08:56:16] (03CR) 10Privacybatm: "> Patch Set 18:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [09:02:11] (03PS1) 10Effie Mouzeli: Create namespaces/calico rules for push-notifications [deployment-charts] - 10https://gerrit.wikimedia.org/r/613097 (https://phabricator.wikimedia.org/T256973) [09:03:12] (03PS19) 10Privacybatm: transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) [09:03:19] (03CR) 10jerkins-bot: [V: 04-1] transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [09:03:55] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops, 10Patch-For-Review: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki) @MSantos Is there an internal service/database push-notifications will be communicatin... [09:04:34] (03CR) 10Privacybatm: "Thank you for the review, I will rebase this and push it right away." [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [09:04:40] (03CR) 10Jcrespo: "Will need manual rebase." [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [09:07:16] (03PS20) 10Privacybatm: transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) [09:08:28] (03CR) 10Privacybatm: [C: 03+1] "> Patch Set 19:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [09:13:30] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:13:37] (03CR) 10Jcrespo: [C: 03+2] Transferer.py: Calculate source checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/608640 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [09:14:04] (03Merged) 10jenkins-bot: Transferer.py: Calculate source checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/608640 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [09:15:10] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:15:26] !log rebooting flowspec1001 [09:15:27] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [09:15:28] !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) [09:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:30] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [09:15:30] (03CR) 10Jcrespo: [C: 03+2] transfer.py: Comment on setup_logger function [software/transferpy] - 10https://gerrit.wikimedia.org/r/612354 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [09:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:42] (03Merged) 10jenkins-bot: transfer.py: Comment on setup_logger function [software/transferpy] - 10https://gerrit.wikimedia.org/r/612354 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [09:16:47] 10Operations, 10Phabricator, 10Traffic: Accessing Phabricator from Tor (some ranges blocked but not others) - https://phabricator.wikimedia.org/T254568 (10jbond) > Yes, still an issue. See example attempting to edit my talk page on meta using TOR a few minutes ago. > {F31936961} @Fae, Thanks for the update... [09:17:35] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [09:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:11] (03PS1) 10Effie Mouzeli: Add k8s dummy tokens for push-notifications [labs/private] - 10https://gerrit.wikimedia.org/r/613101 (https://phabricator.wikimedia.org/T256973) [09:21:09] (03PS21) 10Privacybatm: transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) [09:21:10] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [09:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:33] (03CR) 10jerkins-bot: [V: 04-1] transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [09:22:32] 10Operations, 10Phabricator, 10Traffic: Accessing Phabricator from Tor (some ranges blocked but not others) - https://phabricator.wikimedia.org/T254568 (10Fae) Thanks for the explanation. I'll raise separate tasks for TOR issues on other projects if they become an issue. A short soak test by trying 10 diffe... [09:22:35] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [09:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:31] 10Operations, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10akosiaris) An exhaustive list of docker registry images that have bpo9 packages is at P11925 (alongside the script that generated them) [09:26:31] (03CR) 10Jcrespo: "See this. However, does the package really have to reserve /var/lib/transferpy? As I think this is used by servers used by transfer py (cl" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/612162 (https://phabricator.wikimedia.org/T257599) (owner: 10Privacybatm) [09:26:33] (03CR) 10Alexandros Kosiaris: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/610050 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [09:28:40] (03PS22) 10Privacybatm: transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) [09:29:03] jouncebot: next [09:29:03] In 1 hour(s) and 30 minute(s): European mid-day backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200716T1100) [09:30:22] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [09:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:24] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [09:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:03] (03PS1) 10Effie Mouzeli: Kubernetes: Create token stanzas for push-notifications [puppet] - 10https://gerrit.wikimedia.org/r/613104 (https://phabricator.wikimedia.org/T256973) [09:34:25] (03CR) 10Muehlenhoff: [C: 03+2] Switch yarn.wikimedia.org to CAS [puppet] - 10https://gerrit.wikimedia.org/r/612143 (https://phabricator.wikimedia.org/T159584) (owner: 10Muehlenhoff) [09:35:34] 10Operations, 10serviceops: Update deprecated extension names in envoy config - https://phabricator.wikimedia.org/T258140 (10JMeybohm) [09:37:09] (03CR) 10Jcrespo: "I tried this by running 2 concurrent calls of transfer.py on port 3000 and the second run gave me an exception:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/608274 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [09:37:10] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops, 10Patch-For-Review: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki) [09:37:42] !log updated envoyproxy to 1.14.4-1 on mw1325.eqiad.wmnet and restbase1026.eqiad.wmnet [09:37:44] 10Operations, 10Phabricator, 10Traffic: Accessing Phabricator from Tor (some ranges blocked but not others) - https://phabricator.wikimedia.org/T254568 (10jbond) Great thanks @fae [09:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:19] (03CR) 10Privacybatm: "> Patch Set 1:" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/612162 (https://phabricator.wikimedia.org/T257599) (owner: 10Privacybatm) [09:42:38] 10Operations, 10MediaWiki-extensions-Babel: Two user pages on meta can't be rendered: "request has exceeded memory limit" - https://phabricator.wikimedia.org/T231522 (10Legoktm) [09:44:11] (03CR) 10Privacybatm: "> Patch Set 3:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/608274 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [09:44:52] (03PS11) 10Effie Mouzeli: charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos) [09:47:00] (03PS2) 10Jcrespo: Firewall.py: Provide an absolute path to commands and refactor a function [software/transferpy] - 10https://gerrit.wikimedia.org/r/612705 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [09:47:06] (03CR) 10Jcrespo: [C: 03+1] Firewall.py: Provide an absolute path to commands and refactor a function [software/transferpy] - 10https://gerrit.wikimedia.org/r/612705 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [09:48:11] (03CR) 10Jcrespo: [C: 03+2] Firewall.py: Provide an absolute path to commands and refactor a function [software/transferpy] - 10https://gerrit.wikimedia.org/r/612705 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [09:48:43] (03Merged) 10jenkins-bot: Firewall.py: Provide an absolute path to commands and refactor a function [software/transferpy] - 10https://gerrit.wikimedia.org/r/612705 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [09:49:40] (03PS23) 10Jcrespo: transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [09:53:42] (03PS1) 10Muehlenhoff: Revert "Switch yarn.wikimedia.org to CAS" [puppet] - 10https://gerrit.wikimedia.org/r/613108 [09:56:19] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Switch yarn.wikimedia.org to CAS" [puppet] - 10https://gerrit.wikimedia.org/r/613108 (owner: 10Muehlenhoff) [09:56:56] (03CR) 10Jcrespo: [C: 03+2] transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [09:57:23] (03Merged) 10jenkins-bot: transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [10:04:12] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:07:14] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:08:32] (03PS1) 10Jbond: (WIP) labs - hiera: migrate to hiera version5 [puppet] - 10https://gerrit.wikimedia.org/r/613112 [10:08:52] !log disable puppet for hiera5 deployment [10:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:15] (03CR) 10jerkins-bot: [V: 04-1] (WIP) labs - hiera: migrate to hiera version5 [puppet] - 10https://gerrit.wikimedia.org/r/613112 (owner: 10Jbond) [10:11:21] (03CR) 10Jbond: [C: 03+2] hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 (https://phabricator.wikimedia.org/T254248) (owner: 10Jbond) [10:11:57] 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: mod_auth_cas segfaulting on Stretch Apache setups using OpenSSL 1.0.2 and 1.1 (netmon/yarn) - https://phabricator.wikimedia.org/T257587 (10MoritzMuehlenhoff) [10:13:33] 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: mod_auth_cas segfaulting on Stretch Apache setups using OpenSSL 1.0.2 and 1.1 (netmon/yarn) - https://phabricator.wikimedia.org/T257587 (10MoritzMuehlenhoff) Also happened for analytics-tool1001 running Yarn. As there's no technical fix on the Apac... [10:14:24] PROBLEM - yarn.wikimedia.org requires authentication on analytics-tool1001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 401 Unauthorized https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [10:16:06] this is an old error --^ [10:16:20] 10Operations, 10Cloud-Services, 10Mail, 10User-herron, 10cloud-services-team (Kanban): Routing RFC1918 private IP addresses to/from WMCS floating IPs - https://phabricator.wikimedia.org/T206261 (10aborrero) Heads up, I'm reverting the changes introduced in this ticket, see {T257534} for reference. I'm pr... [10:17:35] !log upgrade to hiera5 [10:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:41] 10Operations, 10vm-requests: eqiad: New Ganeti instance for Yarn (an-tool1008) - https://phabricator.wikimedia.org/T258145 (10MoritzMuehlenhoff) [10:35:02] (03CR) 10Jcrespo: "> Patch Set 3:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/608274 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [10:39:44] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops, 10Patch-For-Review: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10MSantos) >>! In T256973#6311423, @jijiki wrote: > @MSantos Is there an internal service/databa... [10:40:50] (03CR) 10Privacybatm: "> Patch Set 3:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/608274 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [10:43:52] (03PS1) 10Arturo Borrero Gonzalez: cloud: eqiad1: drop dmz_cidr exclussion 172.16.0.0/21 : 172.16.0.0/21 [puppet] - 10https://gerrit.wikimedia.org/r/613123 (https://phabricator.wikimedia.org/T257534) [10:47:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: eqiad1: drop dmz_cidr exclussion 172.16.0.0/21 : 172.16.0.0/21 [puppet] - 10https://gerrit.wikimedia.org/r/613123 (https://phabricator.wikimedia.org/T257534) (owner: 10Arturo Borrero Gonzalez) [10:48:45] (03PS4) 10Privacybatm: Firewall.py: Solve auto port detection concurrency issue [software/transferpy] - 10https://gerrit.wikimedia.org/r/608274 (https://phabricator.wikimedia.org/T256450) [10:50:17] (03CR) 10Privacybatm: "> Patch Set 3:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/608274 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Your horoscope predicts another unfortunate European mid-day backport window(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200716T1100). [11:00:04] Addshore: A patch you scheduled for European mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:10] o/ [11:00:49] I’m here, but I think addshore can handle this himself :) [11:04:06] (03PS12) 10Addshore: Wikidata client wikis: Define entity sources configuration (take 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609988 (https://phabricator.wikimedia.org/T254315) [11:05:16] (03PS8) 10Addshore: Wikibase: remove wmgWikibaseLocalEntitySourceName [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612670 (https://phabricator.wikimedia.org/T254315) [11:09:02] (03CR) 10ZPapierski: [C: 03+1] [wdqs] overrides default blazegraph ns [puppet] - 10https://gerrit.wikimedia.org/r/611373 (owner: 10DCausse) [11:10:30] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add k8s dummy tokens for push-notifications [labs/private] - 10https://gerrit.wikimedia.org/r/613101 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [11:11:08] (03CR) 10Addshore: [C: 03+2] Wikidata client wikis: Define entity sources configuration (take 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609988 (https://phabricator.wikimedia.org/T254315) (owner: 10Addshore) [11:11:11] (03PS1) 10Privacybatm: Make transferpy configurable using a configuration file [software/transferpy] - 10https://gerrit.wikimedia.org/r/613128 (https://phabricator.wikimedia.org/T257602) [11:11:58] (03Merged) 10jenkins-bot: Wikidata client wikis: Define entity sources configuration (take 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609988 (https://phabricator.wikimedia.org/T254315) (owner: 10Addshore) [11:13:03] (03PS1) 10Muehlenhoff: Add DNS entry for an-tool1008 [dns] - 10https://gerrit.wikimedia.org/r/613129 (https://phabricator.wikimedia.org/T258145) [11:13:11] (03PS1) 10Jforrester: Fix config variables regex concatenation [extensions/UrlShortener] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/612917 (https://phabricator.wikimedia.org/T258134) [11:13:16] 10Operations, 10Analytics: Move yarn.wikimedia.org to a separate Buster VM - https://phabricator.wikimedia.org/T258152 (10MoritzMuehlenhoff) [11:13:28] (03CR) 10jerkins-bot: [V: 04-1] Add DNS entry for an-tool1008 [dns] - 10https://gerrit.wikimedia.org/r/613129 (https://phabricator.wikimedia.org/T258145) (owner: 10Muehlenhoff) [11:13:31] (03CR) 10Jforrester: [C: 03+2] Fix config variables regex concatenation [extensions/UrlShortener] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/612917 (https://phabricator.wikimedia.org/T258134) (owner: 10Jforrester) [11:14:42] (03PS2) 10Privacybatm: Make transferpy configurable using a configuration file [software/transferpy] - 10https://gerrit.wikimedia.org/r/613128 (https://phabricator.wikimedia.org/T257602) [11:15:32] (03CR) 10Jcrespo: "> > An error/unsuccesful transmission is indeed expected, but is there a reason why a stacktrace is shown to the user and the generated ex" [software/transferpy] - 10https://gerrit.wikimedia.org/r/608274 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [11:15:58] (03CR) 10Privacybatm: "This is an interesting patch actually:-)" [software/transferpy] - 10https://gerrit.wikimedia.org/r/613128 (https://phabricator.wikimedia.org/T257602) (owner: 10Privacybatm) [11:16:42] (03Merged) 10jenkins-bot: Fix config variables regex concatenation [extensions/UrlShortener] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/612917 (https://phabricator.wikimedia.org/T258134) (owner: 10Jforrester) [11:17:02] addshore: Ping when you're done deploying? [11:17:11] James_F: will do [11:17:13] T. [11:17:16] Err. [11:17:17] Ta. [11:18:11] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T254315 T257266 [[gerrit:609988]] Wikidata client wikis: Define entity sources configuration (take 3) (duration: 01m 08s) [11:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:17] T254315: entitysources: Directly create entitySources config for WMF production wikis - https://phabricator.wikimedia.org/T254315 [11:18:17] T257266: Missing Wikidata sitelinks on Commons categories - https://phabricator.wikimedia.org/T257266 [11:19:30] (03CR) 10Addshore: [C: 03+2] Wikibase: remove wmgWikibaseLocalEntitySourceName [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612670 (https://phabricator.wikimedia.org/T254315) (owner: 10Addshore) [11:19:55] (03CR) 10Jcrespo: "It is a bit peculiar:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/608274 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [11:20:29] (03Merged) 10jenkins-bot: Wikibase: remove wmgWikibaseLocalEntitySourceName [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612670 (https://phabricator.wikimedia.org/T254315) (owner: 10Addshore) [11:23:38] !log addshore@deploy1001 Synchronized wmf-config/Wikibase.php: T254315 [[gerrit:612670]] Wikibase: remove wmgWikibaseLocalEntitySourceName (duration: 01m 05s) [11:23:40] James_F: all done! [11:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:43] T254315: entitysources: Directly create entitySources config for WMF production wikis - https://phabricator.wikimedia.org/T254315 [11:24:08] Cool. [11:24:57] (03PS1) 10Matěj Suchánek: Update several Wikidata-related configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612918 [11:26:24] (03CR) 10Privacybatm: "> Patch Set 4:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/608274 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [11:26:29] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.41/extensions/UrlShortener/includes/UrlShortenerUtils.php: T258134 Fix config variables regex concatenation (duration: 01m 05s) [11:26:32] 10Operations, 10MediaWiki-extensions-Babel: Two user pages on meta can't be rendered: "request has exceeded memory limit" - https://phabricator.wikimedia.org/T231522 (10Ammarpad) Babel::Render() is a variadic function, so literally one can pass as many languages as they like even in unreasonable number. I am... [11:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:25] (03CR) 10Jbond: Allow installing additional libapache2-mod* packages in the httpd class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/612865 (owner: 10Muehlenhoff) [11:28:59] (03PS2) 10Muehlenhoff: Add DNS entry for an-tool1008 [dns] - 10https://gerrit.wikimedia.org/r/613129 (https://phabricator.wikimedia.org/T258145) [11:29:23] (03CR) 10jerkins-bot: [V: 04-1] Add DNS entry for an-tool1008 [dns] - 10https://gerrit.wikimedia.org/r/613129 (https://phabricator.wikimedia.org/T258145) (owner: 10Muehlenhoff) [11:33:35] (03CR) 10Jcrespo: [C: 03+2] "This now works as intended. There is some followup that we can add to the TODO list: better cleanup after failure. Now, if a process crash" [software/transferpy] - 10https://gerrit.wikimedia.org/r/608274 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [11:34:08] (03Merged) 10jenkins-bot: Firewall.py: Solve auto port detection concurrency issue [software/transferpy] - 10https://gerrit.wikimedia.org/r/608274 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [11:34:11] (03PS3) 10Muehlenhoff: Add DNS entry for an-tool1008 [dns] - 10https://gerrit.wikimedia.org/r/613129 (https://phabricator.wikimedia.org/T258145) [11:34:26] (03CR) 10Jcrespo: "> Firewall failed to open/reserve the port on . That port must be using by some other process. Please try with another por" [software/transferpy] - 10https://gerrit.wikimedia.org/r/608274 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm) [11:34:38] (03CR) 10jerkins-bot: [V: 04-1] Add DNS entry for an-tool1008 [dns] - 10https://gerrit.wikimedia.org/r/613129 (https://phabricator.wikimedia.org/T258145) (owner: 10Muehlenhoff) [11:36:27] (03PS1) 10Jbond: configmaster: sort ssh keys to avoid changes on every run [puppet] - 10https://gerrit.wikimedia.org/r/613131 [11:37:18] (03CR) 10Jcrespo: "> Patch Set 1:" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/612162 (https://phabricator.wikimedia.org/T257599) (owner: 10Privacybatm) [11:37:32] 10Operations, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10akosiaris) [11:37:59] 10Operations, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10akosiaris) [11:38:14] (03PS4) 10Muehlenhoff: Add DNS entry for an-tool1008 [dns] - 10https://gerrit.wikimedia.org/r/613129 (https://phabricator.wikimedia.org/T258145) [11:38:44] (03PS2) 10Matěj Suchánek: Update several Wikidata-related configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612918 [11:39:13] (03CR) 10Jbond: [C: 03+2] configmaster: sort ssh keys to avoid changes on every run [puppet] - 10https://gerrit.wikimedia.org/r/613131 (owner: 10Jbond) [11:39:19] (03CR) 10Matěj Suchánek: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612918 (owner: 10Matěj Suchánek) [11:44:18] !log remove BGP to AS396253 in eqdfw (peer left the IX) [11:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:32] (03CR) 10Elukey: [C: 03+1] Add DNS entry for an-tool1008 [dns] - 10https://gerrit.wikimedia.org/r/613129 (https://phabricator.wikimedia.org/T258145) (owner: 10Muehlenhoff) [11:45:09] 10Operations, 10MediaWiki-extensions-Babel: Two user pages on meta can't be rendered: "request has exceeded memory limit" - https://phabricator.wikimedia.org/T231522 (10Jdforrester-WMF) Limiting it with a cleaner user experience when trying to use too many seems reasonable, yes. Not sure what the limit should... [11:47:58] (03CR) 10Jcrespo: "I am currently testing this, will soon have data to verify if it is worth pursuing this or not." [software/transferpy] - 10https://gerrit.wikimedia.org/r/610750 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [11:54:05] (03CR) 10Muehlenhoff: [C: 03+2] Add DNS entry for an-tool1008 [dns] - 10https://gerrit.wikimedia.org/r/613129 (https://phabricator.wikimedia.org/T258145) (owner: 10Muehlenhoff) [11:56:02] 10Operations, 10MediaWiki-extensions-Babel: Two user pages on meta can't be rendered: "request has exceeded memory limit" - https://phabricator.wikimedia.org/T231522 (10Ammarpad) a:03Ammarpad [11:58:02] (03PS2) 10Muehlenhoff: Allow installing additional libapache2-mod* packages in the httpd class [puppet] - 10https://gerrit.wikimedia.org/r/612865 [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200716T1200) [12:00:25] 10Operations, 10vm-requests, 10Patch-For-Review: eqiad: New Ganeti instance for Yarn (an-tool1008) - https://phabricator.wikimedia.org/T258145 (10elukey) The specs were discussed between me and Moritz, all good! [12:05:06] (03PS1) 10Kormat: mariadb: Enable binlogs for zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/613136 (https://phabricator.wikimedia.org/T257816) [12:08:51] !log updated envoyproxy to 1.14.4-1 on mw-canary and restbase-canary [12:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:31] (03CR) 10Marostegui: mariadb: Enable binlogs for zarcillo (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/613136 (https://phabricator.wikimedia.org/T257816) (owner: 10Kormat) [12:10:07] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:11:08] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/612865 (owner: 10Muehlenhoff) [12:11:57] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:11:57] 10Operations, 10ops-eqiad, 10DC-Ops, 10observability: eqiad: PDU Upgrade in C8 (July 14, 2pm-4pm UTC)) - https://phabricator.wikimedia.org/T257871 (10ayounsi) 05Resolved→03Open Netbox LibreNMS report is still alerting, I think it's just because the status needs to be changed from `Planned` to `Active`. [12:11:59] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By:TBD) rack/setup/install rows C and D new PDUs - https://phabricator.wikimedia.org/T253694 (10ayounsi) [12:12:00] !log jmm@cumin1001 START - Cookbook sre.ganeti.makevm [12:12:00] !log jmm@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [12:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:14] !log jmm@cumin1001 START - Cookbook sre.ganeti.makevm [12:12:14] !log jmm@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [12:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:38] !log jmm@cumin1001 START - Cookbook sre.ganeti.makevm [12:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:42] (03CR) 10Marostegui: profile::analytics::database::meta: enable TLS (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/612821 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [12:13:54] (03PS2) 10Kormat: mariadb: Enable binlogs for zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/613136 (https://phabricator.wikimedia.org/T257816) [12:14:29] (03CR) 10Kormat: mariadb: Enable binlogs for zarcillo (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/613136 (https://phabricator.wikimedia.org/T257816) (owner: 10Kormat) [12:15:15] (03CR) 10Elukey: profile::analytics::database::meta: enable TLS (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/612821 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [12:18:04] (03CR) 10Kormat: "PCC run: https://puppet-compiler.wmflabs.org/compiler1002/23912/" [puppet] - 10https://gerrit.wikimedia.org/r/613136 (https://phabricator.wikimedia.org/T257816) (owner: 10Kormat) [12:19:30] (03CR) 10Marostegui: [C: 03+1] "We can merge this, but there is no need to restart mysql on tendril yet. We can restart mariadb on db2093 though" [puppet] - 10https://gerrit.wikimedia.org/r/613136 (https://phabricator.wikimedia.org/T257816) (owner: 10Kormat) [12:21:27] kormat: see, you are a trusted source of changesets for Manuel, he is always skeptic with me when I send code reviews [12:21:56] (03CR) 10Marostegui: "The options look good" [puppet] - 10https://gerrit.wikimedia.org/r/612821 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [12:22:24] !log jmm@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [12:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:21] elukey: he just knows that it's better to get it over quick than to drag out the pain ;) [12:23:54] (03CR) 10Kormat: [C: 03+2] mariadb: Enable binlogs for zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/613136 (https://phabricator.wikimedia.org/T257816) (owner: 10Kormat) [12:24:21] kormat: I think that I put him in pain every time I say something about databases, so I am not sure :D [12:24:34] hahah [12:24:43] haha [12:30:38] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/23911/" [puppet] - 10https://gerrit.wikimedia.org/r/612865 (owner: 10Muehlenhoff) [12:34:02] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [12:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:02] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Don't use virtual sub-interfaces for basic interfaces (.0) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/607011 (owner: 10Ayounsi) [12:35:51] !log increase codfw mobileapps kubernetes traffic to 5% [12:35:53] (03PS1) 10JMeybohm: New envoy upstream version 1.14.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/613139 (https://phabricator.wikimedia.org/T256843) [12:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:58] !log increase codfw mobileapps kubernetes traffic to 5% T218733 [12:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:02] T218733: Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733 [12:36:03] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:36:05] !log akosiaris@cumin1001 conftool action : set/weight=50; selector: dc=codfw,service=mobileapps,name=scb.* [12:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:53] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:38:04] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:38] !log ayounsi@deploy1001 Started deploy [homer/deploy@fcf4332]: CR607011 [12:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:45] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for jlinehan - https://phabricator.wikimedia.org/T258119 (10CDanis) p:05Triage→03Medium a:05jlinehan→03CDanis [12:39:49] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-25) rack/setup/install alert1001 - https://phabricator.wikimedia.org/T255072 (10Cmjohnson) [12:42:20] !log ayounsi@deploy1001 Finished deploy [homer/deploy@fcf4332]: CR607011 (duration: 03m 42s) [12:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:34] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and nda groups for edtadros - https://phabricator.wikimedia.org/T256435 (10CDanis) @Jrbranaa ping? this access request is waiting on your reply re: contract end date [12:43:25] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10CDanis) @Nuria ping :) [12:44:28] 10Operations, 10Analytics-Radar, 10Product-Analytics, 10SRE-Access-Requests: Hive access for Sam Patton - https://phabricator.wikimedia.org/T248097 (10CDanis) 05Stalled→03Resolved @spatton I'm going to optimistically close this assuming that Turnilo access has been sufficient for you, please do reopen... [12:45:53] (03PS1) 10Muehlenhoff: Add an-tool1008 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/613141 [12:51:00] (03CR) 10Muehlenhoff: [C: 03+2] Allow installing additional libapache2-mod* packages in the httpd class [puppet] - 10https://gerrit.wikimedia.org/r/612865 (owner: 10Muehlenhoff) [12:51:09] (03CR) 10Elukey: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/612821 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [12:52:04] !log ayounsi@deploy1001 Started deploy [homer/deploy@fcf4332]: CR607011 [12:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:33] RECOVERY - Host ripe-atlas-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [12:55:41] 👀 [12:55:49] paravoid: RIPE finish their upgrade? [12:55:51] (03CR) 10Elukey: "Now I see /srv/private/modules/secret/secrets/mysql on puppetmaster1001." [puppet] - 10https://gerrit.wikimedia.org/r/612821 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [12:56:37] !log ayounsi@deploy1001 Finished deploy [homer/deploy@fcf4332]: CR607011 (duration: 04m 32s) [12:56:38] (03PS1) 10Ayounsi: Depool eqsin for router replacement [dns] - 10https://gerrit.wikimedia.org/r/613143 (https://phabricator.wikimedia.org/T257154) [12:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1092', diff saved to https://phabricator.wikimedia.org/P11926 and previous config saved to /var/cache/conftool/dbconfig/20200716-125643-marostegui.json [12:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:39] (03PS1) 10Jgiannelos: Change hostname of `mobileapps` to the dockerized instance. [puppet] - 10https://gerrit.wikimedia.org/r/613144 [12:59:26] (03PS2) 10Jgiannelos: Change hostname of `mobileapps` to the dockerized instance. [puppet] - 10https://gerrit.wikimedia.org/r/613144 (https://phabricator.wikimedia.org/T256794) [12:59:48] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Please run a PCC on deploy1001 and contint hosts, before merging, but otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/613104 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [12:59:52] (03CR) 10jerkins-bot: [V: 04-1] Change hostname of `mobileapps` to the dockerized instance. [puppet] - 10https://gerrit.wikimedia.org/r/613144 (https://phabricator.wikimedia.org/T256794) (owner: 10Jgiannelos) [13:00:04] James_F and longma: It is that lovely time of the day again! You are hereby commanded to deploy Mediawiki train - European+American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200716T1300). [13:00:24] (03PS1) 10Jforrester: all wikis to 1.35.0-wmf.41 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613145 [13:00:25] (03CR) 10Jforrester: [C: 03+2] all wikis to 1.35.0-wmf.41 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613145 (owner: 10Jforrester) [13:01:09] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.41 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613145 (owner: 10Jforrester) [13:01:20] Here we go. [13:01:32] (03PS3) 10Jgiannelos: Change hostname of `mobileapps` to the dockerized instance. [puppet] - 10https://gerrit.wikimedia.org/r/613144 (https://phabricator.wikimedia.org/T256794) [13:01:35] !log akosiaris@cumin1001 conftool action : set/weight=24; selector: dc=codfw,service=mobileapps,name=scb.* [13:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:13] !log increase codfw mobileapps kubernetes traffic to 10% T218733 [13:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:20] T218733: Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733 [13:02:48] !log jforrester@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.41 [13:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:40] (03PS1) 10Muehlenhoff: Install the fcgid package on Netmon [puppet] - 10https://gerrit.wikimedia.org/r/613146 (https://phabricator.wikimedia.org/T247967) [13:03:42] (03CR) 10Jgiannelos: [C: 04-1] "Blocking this one until all the action items from T256794 are done." [puppet] - 10https://gerrit.wikimedia.org/r/613144 (https://phabricator.wikimedia.org/T256794) (owner: 10Jgiannelos) [13:04:37] !log restarting tendril to pick up new mariadb config T257816 [13:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:44] T257816: Set up replication for zarcillo - https://phabricator.wikimedia.org/T257816 [13:05:44] All looks fine. Declaring the wmf.41 train done. [13:08:11] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Nice and concise, I like it!" [puppet] - 10https://gerrit.wikimedia.org/r/609403 (owner: 10Alexandros Kosiaris) [13:11:33] cdanis: maybe, I haven't received anything new -- they said it may take 2-3 weeks [13:11:46] they're done something more experimental, using us as testbed (with my consent) [13:12:02] ack [13:12:24] !log volans@deploy1001 Started deploy [homer/deploy@fcf4332]: Force deploy of the homer plugin [13:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:57] (03CR) 10Muehlenhoff: [C: 03+2] Add an-tool1008 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/613141 (owner: 10Muehlenhoff) [13:13:52] !log volans@deploy1001 Finished deploy [homer/deploy@fcf4332]: Force deploy of the homer plugin (duration: 01m 27s) [13:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:43] (03PS11) 10Alexandros Kosiaris: Add discovery records for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/609403 [13:15:29] (03CR) 10Alexandros Kosiaris: [C: 03+1] "oh, forgot to address those in subsequent patches. Done in PS11 (along with a mandatory rebase)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/609403 (owner: 10Alexandros Kosiaris) [13:16:19] (03CR) 10jerkins-bot: [V: 04-1] Add discovery records for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/609403 (owner: 10Alexandros Kosiaris) [13:21:44] (03PS2) 10Ayounsi: Depool eqsin for router replacement [dns] - 10https://gerrit.wikimedia.org/r/613143 (https://phabricator.wikimedia.org/T257154) [13:22:43] (03CR) 10Ayounsi: [C: 03+2] Depool eqsin for router replacement [dns] - 10https://gerrit.wikimedia.org/r/613143 (https://phabricator.wikimedia.org/T257154) (owner: 10Ayounsi) [13:22:51] (03PS1) 10Marostegui: section,report_users: Change active zarcillo host [software] - 10https://gerrit.wikimedia.org/r/613149 (https://phabricator.wikimedia.org/T257816) [13:22:55] (03CR) 10jerkins-bot: [V: 04-1] section,report_users: Change active zarcillo host [software] - 10https://gerrit.wikimedia.org/r/613149 (https://phabricator.wikimedia.org/T257816) (owner: 10Marostegui) [13:23:27] !log depool eqsin for cr3 replacement - T257154 [13:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:33] T257154: cr3-eqsin disk 1 failure - https://phabricator.wikimedia.org/T257154 [13:24:38] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] New envoy upstream version 1.14.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/613139 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [13:24:56] (03Abandoned) 10Marostegui: section,report_users: Change active zarcillo host [software] - 10https://gerrit.wikimedia.org/r/613149 (https://phabricator.wikimedia.org/T257816) (owner: 10Marostegui) [13:25:26] (03PS12) 10Alexandros Kosiaris: Add discovery records for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/609403 [13:26:45] (03PS1) 10Marostegui: section,report_users: Change active zarcillo host [software] - 10https://gerrit.wikimedia.org/r/613150 (https://phabricator.wikimedia.org/T257816) [13:27:52] (03PS13) 10Alexandros Kosiaris: Add discovery records for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/609403 [13:27:56] !log installing an-tool1008 [13:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:08] !log deactivate BGP groups IX/Transit/PyBal on cr3-eqsin - T257154 [13:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:13] T257154: cr3-eqsin disk 1 failure - https://phabricator.wikimedia.org/T257154 [13:30:42] (03PS1) 10JMeybohm: Fix last envoy version number [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/613151 (https://phabricator.wikimedia.org/T256843) [13:31:36] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Fix last envoy version number [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/613151 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [13:31:43] PROBLEM - Host ripe-atlas-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [13:32:56] (03PS1) 10Kormat: mariadb: Update zarcillo location to db1115 [puppet] - 10https://gerrit.wikimedia.org/r/613152 (https://phabricator.wikimedia.org/T257816) [13:33:05] PROBLEM - Host ripe-atlas-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [13:33:24] that's the RIPE trying to upgrade the anchor ^ [13:33:26] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Fix last envoy version number [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/613151 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [13:33:30] (03CR) 10RLazarus: [C: 03+2] "> Patch Set 10:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/594760 (https://phabricator.wikimedia.org/T244340) (owner: 10RLazarus) [13:33:37] (03PS11) 10RLazarus: mcrouter_wancache: Add mcrouter support for a machine-local memcached instance [puppet] - 10https://gerrit.wikimedia.org/r/594760 (https://phabricator.wikimedia.org/T244340) [13:33:44] (03CR) 10Marostegui: [C: 04-1] mariadb: Update zarcillo location to db1115 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/613152 (https://phabricator.wikimedia.org/T257816) (owner: 10Kormat) [13:34:13] (03CR) 10Kormat: [C: 03+1] section,report_users: Change active zarcillo host [software] - 10https://gerrit.wikimedia.org/r/613150 (https://phabricator.wikimedia.org/T257816) (owner: 10Marostegui) [13:34:15] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Update zarcillo location to db1115 [puppet] - 10https://gerrit.wikimedia.org/r/613152 (https://phabricator.wikimedia.org/T257816) (owner: 10Kormat) [13:34:24] (03CR) 10Marostegui: [C: 03+2] section,report_users: Change active zarcillo host [software] - 10https://gerrit.wikimedia.org/r/613150 (https://phabricator.wikimedia.org/T257816) (owner: 10Marostegui) [13:34:35] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 44.91 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:35:49] (03PS2) 10Kormat: mariadb: Update zarcillo location to db1115 [puppet] - 10https://gerrit.wikimedia.org/r/613152 (https://phabricator.wikimedia.org/T257816) [13:35:57] !log akosiaris@cumin1001 conftool action : set/weight=8; selector: dc=codfw,service=mobileapps,name=scb.* [13:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:06] !log increase codfw mobileapps kubernetes traffic to 25% T218733 [13:36:07] !log power-off cr3-eqsin - T257154 [13:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:11] T218733: Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733 [13:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:15] (03CR) 10Jcrespo: [C: 04-1] "Needs to add the check for eqiad to hieradata/role/common/mariadb/misc/zarcillo.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/613152 (https://phabricator.wikimedia.org/T257816) (owner: 10Kormat) [13:36:15] T257154: cr3-eqsin disk 1 failure - https://phabricator.wikimedia.org/T257154 [13:36:28] (03CR) 10Kormat: mariadb: Update zarcillo location to db1115 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/613152 (https://phabricator.wikimedia.org/T257816) (owner: 10Kormat) [13:37:40] (03PS1) 10Marostegui: report_users: Let's also read from db1115 [software] - 10https://gerrit.wikimedia.org/r/613154 [13:37:43] jin is replacing cr3-eqsin now [13:37:46] (03PS3) 10Kormat: mariadb: Update zarcillo location to db1115 [puppet] - 10https://gerrit.wikimedia.org/r/613152 (https://phabricator.wikimedia.org/T257816) [13:37:50] (03CR) 10Kormat: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/613152 (https://phabricator.wikimedia.org/T257816) (owner: 10Kormat) [13:39:04] (03CR) 10Jcrespo: [C: 03+1] mariadb: Update zarcillo location to db1115 [puppet] - 10https://gerrit.wikimedia.org/r/613152 (https://phabricator.wikimedia.org/T257816) (owner: 10Kormat) [13:41:05] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:41:15] (03CR) 10Marostegui: [C: 03+2] report_users: Let's also read from db1115 [software] - 10https://gerrit.wikimedia.org/r/613154 (owner: 10Marostegui) [13:41:53] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 73, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:42:21] PROBLEM - OSPF status on mr1-eqsin is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:42:25] (03CR) 10Marostegui: [C: 03+1] mariadb: Update zarcillo location to db1115 [puppet] - 10https://gerrit.wikimedia.org/r/613152 (https://phabricator.wikimedia.org/T257816) (owner: 10Kormat) [13:42:42] (03CR) 10Kormat: [C: 03+2] mariadb: Update zarcillo location to db1115 [puppet] - 10https://gerrit.wikimedia.org/r/613152 (https://phabricator.wikimedia.org/T257816) (owner: 10Kormat) [13:47:16] !log remove nonstop-bridging from asw1-eqsin [13:47:17] 10Operations, 10vm-requests: eqiad: New Ganeti instance for Yarn (an-tool1008) - https://phabricator.wikimedia.org/T258145 (10MoritzMuehlenhoff) 05Open→03Resolved The instance has been created and installed [13:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:04] (03PS1) 10Muehlenhoff: Add separate role for Yarn [puppet] - 10https://gerrit.wikimedia.org/r/613156 (https://phabricator.wikimedia.org/T258152) [13:48:55] (03CR) 10jerkins-bot: [V: 04-1] Add separate role for Yarn [puppet] - 10https://gerrit.wikimedia.org/r/613156 (https://phabricator.wikimedia.org/T258152) (owner: 10Muehlenhoff) [13:49:26] (03CR) 10Privacybatm: "> Patch Set 1:" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/612162 (https://phabricator.wikimedia.org/T257599) (owner: 10Privacybatm) [13:49:42] 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Papaul) @akosiaris I do agree with the change. I will pass the suggestion to the other members of the team. Thanks. [13:50:07] (03PS2) 10Muehlenhoff: Add separate role for Yarn [puppet] - 10https://gerrit.wikimedia.org/r/613156 (https://phabricator.wikimedia.org/T258152) [13:51:26] ACKNOWLEDGEMENT - Host ripe-atlas-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T258018 [13:51:26] ACKNOWLEDGEMENT - Host ripe-atlas-eqiad is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T258018 [13:51:34] RECOVERY - DPKG on stat1008 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:51:45] (03CR) 10Jcrespo: "These are the number I am getting on our production environment:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/610750 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [13:53:01] (03CR) 10Jcrespo: "Sounds like a good plan to me!" [software/transferpy] - 10https://gerrit.wikimedia.org/r/612162 (https://phabricator.wikimedia.org/T257599) (owner: 10Privacybatm) [13:55:12] (03PS1) 10Kormat: wmnet: Add zarcillo-master CNAME. [dns] - 10https://gerrit.wikimedia.org/r/613157 (https://phabricator.wikimedia.org/T257816) [13:55:40] (03PS2) 10Kormat: wmnet: Update es4-master alias [dns] - 10https://gerrit.wikimedia.org/r/612560 (https://phabricator.wikimedia.org/T257847) [13:56:14] (03CR) 10Marostegui: [C: 03+1] wmnet: Add zarcillo-master CNAME. [dns] - 10https://gerrit.wikimedia.org/r/613157 (https://phabricator.wikimedia.org/T257816) (owner: 10Kormat) [13:56:57] (03CR) 10Kormat: [C: 03+2] wmnet: Add zarcillo-master CNAME. [dns] - 10https://gerrit.wikimedia.org/r/613157 (https://phabricator.wikimedia.org/T257816) (owner: 10Kormat) [13:57:15] (03PS2) 10Privacybatm: transferpy: Create required config directory at the time of deb installation [software/transferpy] - 10https://gerrit.wikimedia.org/r/612162 (https://phabricator.wikimedia.org/T257599) [13:58:33] does operations/dns auto-submit? [13:58:44] no, needs deploy [13:59:01] https://wikitech.wikimedia.org/wiki/DNS#authdns-update [13:59:06] cheers [14:00:00] (03CR) 10Privacybatm: "> Patch Set 1:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/612162 (https://phabricator.wikimedia.org/T257599) (owner: 10Privacybatm) [14:00:46] (03PS3) 10Privacybatm: transferpy: Create required config directory at the time of deb installation [software/transferpy] - 10https://gerrit.wikimedia.org/r/612162 (https://phabricator.wikimedia.org/T257599) [14:02:27] (03PS3) 10Muehlenhoff: Add separate role for Yarn [puppet] - 10https://gerrit.wikimedia.org/r/613156 (https://phabricator.wikimedia.org/T258152) [14:03:04] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:03:28] !log published image docker-registry.discovery.wmnet/envoy:1.14.4-1 [14:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:33] i'm enjoying how the wiki page doesn't say what hosts that script is on [14:05:16] ssh ns0.wikimedia.org would work [14:06:36] (03CR) 10Privacybatm: [C: 04-1] "> Patch Set 1:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/610750 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [14:07:11] (03PS2) 10Jbond: (WIP) labs - hiera: migrate to hiera version5 [puppet] - 10https://gerrit.wikimedia.org/r/613112 [14:08:52] updated the wiki page [14:09:30] !log upgrade junos on cr3-eqsin - T257154 [14:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:36] T257154: cr3-eqsin disk 1 failure - https://phabricator.wikimedia.org/T257154 [14:12:20] !log rebooting webperf hosts in eqiad for kernel update [14:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:52] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [14:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:19] (03PS2) 10Privacybatm: transfer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/612384 (https://phabricator.wikimedia.org/T257600) [14:13:51] (03PS1) 10Kormat: switchover: update zarcillo db location [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/613158 (https://phabricator.wikimedia.org/T257816) [14:15:06] (03PS3) 10Privacybatm: Firewall.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/612726 (https://phabricator.wikimedia.org/T257600) [14:15:33] (03CR) 10jerkins-bot: [V: 04-1] Firewall.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/612726 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [14:15:38] (03CR) 10Marostegui: [C: 03+1] switchover: update zarcillo db location [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/613158 (https://phabricator.wikimedia.org/T257816) (owner: 10Kormat) [14:15:45] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:43] (03CR) 10Kormat: [C: 03+2] switchover: update zarcillo db location [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/613158 (https://phabricator.wikimedia.org/T257816) (owner: 10Kormat) [14:16:51] (03PS3) 10Jbond: (WIP) labs - hiera: migrate to hiera version5 [puppet] - 10https://gerrit.wikimedia.org/r/613112 [14:17:30] (03PS1) 10JMeybohm: Update envoy to 1.14.4-1 for mathoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/613159 (https://phabricator.wikimedia.org/T256843) [14:17:34] (03Merged) 10jenkins-bot: switchover: update zarcillo db location [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/613158 (https://phabricator.wikimedia.org/T257816) (owner: 10Kormat) [14:19:17] (03PS4) 10Privacybatm: Firewall.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/612726 (https://phabricator.wikimedia.org/T257600) [14:19:50] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Update envoy to 1.14.4-1 for mathoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/613159 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [14:20:33] (03CR) 10JMeybohm: [C: 03+2] Update envoy to 1.14.4-1 for mathoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/613159 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [14:21:07] (03PS1) 10Hnowlan: api-gateway: set CORS headers to allow all domains for API [deployment-charts] - 10https://gerrit.wikimedia.org/r/613160 (https://phabricator.wikimedia.org/T256771) [14:21:22] (03PS1) 10Lucas Werkmeister (WMDE): Avoid trying to register wikibase.Site twice [extensions/Wikibase] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/613167 (https://phabricator.wikimedia.org/T258065) [14:21:39] (03Merged) 10jenkins-bot: Update envoy to 1.14.4-1 for mathoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/613159 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [14:22:03] James_F, longma: what’s the status of the train window? I’d like to deploy a backport that should silence a logstash warning [14:22:20] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat: Set up replication for zarcillo - https://phabricator.wikimedia.org/T257816 (10Kormat) Replication is in place, but monitoring is not. [14:22:40] (SAL looks like everything went well?) [14:23:31] Lucas_WMDE: Go for it. Train's all done. [14:23:38] great, thanks! [14:24:23] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "backporting now" [extensions/Wikibase] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/613167 (https://phabricator.wikimedia.org/T258065) (owner: 10Lucas Werkmeister (WMDE)) [14:24:52] (03PS4) 10Jbond: (WIP) labs - hiera: migrate to hiera version5 [puppet] - 10https://gerrit.wikimedia.org/r/613112 [14:31:16] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mathoid' for release 'staging' . [14:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:38] (03PS16) 10Hnowlan: api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) [14:33:15] (03PS2) 10Lucas Werkmeister (WMDE): extension-list: Load WikibaseClient via JSON [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611393 [14:33:38] (03CR) 10Lucas Werkmeister (WMDE): "This should be ready for deployment now, and I may do it later today." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611393 (owner: 10Lucas Werkmeister (WMDE)) [14:34:07] (03PS1) 10Herron: dns: add forward/reverse records for prometheus[345]001 [dns] - 10https://gerrit.wikimedia.org/r/613163 (https://phabricator.wikimedia.org/T243057) [14:34:37] PROBLEM - dump of zarcillo in eqiad on db2093 is CRITICAL: We could not find any completed dump for zarcillo at eqiad https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:35:32] (03CR) 10Lucas Werkmeister (WMDE): extension-list: Load WikibaseClient via JSON (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611393 (owner: 10Lucas Werkmeister (WMDE)) [14:37:37] (03PS2) 10Andrew Bogott: Remove Cloud VPS global root key for valhallasw [labs/private] - 10https://gerrit.wikimedia.org/r/612975 (https://phabricator.wikimedia.org/T255697) (owner: 10BryanDavis) [14:37:51] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Remove Cloud VPS global root key for valhallasw [labs/private] - 10https://gerrit.wikimedia.org/r/612975 (https://phabricator.wikimedia.org/T255697) (owner: 10BryanDavis) [14:38:31] (03PS2) 10Andrew Bogott: Add Cloud VPS global root key for Sam Reed [labs/private] - 10https://gerrit.wikimedia.org/r/612974 (https://phabricator.wikimedia.org/T249774) (owner: 10BryanDavis) [14:39:29] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:41:02] 10Operations, 10ops-eqiad: Interface errors on asw2-d-eqiad:xe-7/0/0 (ms-be1037) - https://phabricator.wikimedia.org/T257541 (10Cmjohnson) a:05Cmjohnson→03fgiunchedi @fgiunchedi I need to replace the production/network cable for this server. I would like to do this tomorrow (Friday 16 July at 1500UTC) [14:41:35] akosiaris: ^? [14:42:41] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:43:25] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mathoid' for release 'production' . [14:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:13] (03Merged) 10jenkins-bot: Avoid trying to register wikibase.Site twice [extensions/Wikibase] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/613167 (https://phabricator.wikimedia.org/T258065) (owner: 10Lucas Werkmeister (WMDE)) [14:48:26] testing that backport on mwdebug1001 [14:48:53] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [14:49:10] without `scap pull`, the warning is there, now pulling… [14:50:24] hm, after `scap pull` the warning still seems to be there [14:50:42] aha! let’s try `git submodule update` instead of `git submodule sync` 🤦 [14:52:17] ok warning is gone now \o/ syncing [14:53:14] jayme: hmmm, will have a look in a bit [14:54:02] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.35.0-wmf.41/extensions/Wikibase/: Backport: [[gerrit:613167|Avoid trying to register wikibase.Site twice (T258065)]] (duration: 01m 03s) [14:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:07] T258065: wikibase.Site RL module can not be registered twice - https://phabricator.wikimedia.org/T258065 [14:54:19] backport done [14:54:23] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [14:54:48] !log load config on cr3-eqsin - T257154 [14:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:53] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:54:54] T257154: cr3-eqsin disk 1 failure - https://phabricator.wikimedia.org/T257154 [14:55:45] akosiaris: might be unrelated. Just wanted to poing it out [14:55:56] (03PS1) 10Addshore: Wikibase: stop setting wgWBRepoSettings['tmpSerializeEmptyListsAsObjects'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613164 (https://phabricator.wikimedia.org/T138104) [14:55:58] (03PS1) 10Addshore: Wikibase: Stop setting wmgWikibaseTmpSerializeEmptyListsAsObjects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613165 (https://phabricator.wikimedia.org/T138104) [14:56:14] (03PS2) 10Addshore: Wikibase: stop setting wmgWikibaseTmpSerializeEmptyListsAsObjects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613165 (https://phabricator.wikimedia.org/T138104) [14:56:21] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:56:23] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10Nuria) sorry I missed this, is @Nahid a full time WMF employee? [14:56:43] (03PS2) 10Addshore: Wikibase: stop setting wgWBRepoSettings['tmpSerializeEmptyListsAsObjects'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613164 (https://phabricator.wikimedia.org/T138104) [14:56:45] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:56:46] (03PS3) 10Addshore: Wikibase: stop setting wmgWikibaseTmpSerializeEmptyListsAsObjects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613165 (https://phabricator.wikimedia.org/T138104) [14:57:05] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 79, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:57:16] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10JanWMF) Yep :) [14:57:17] RECOVERY - OSPF status on mr1-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:58:14] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10Nuria) Approved on my end, let's add @nahid to ldap wmf group as well [14:59:20] (03PS1) 10ZPapierski: Add logout location [puppet] - 10https://gerrit.wikimedia.org/r/613186 (https://phabricator.wikimedia.org/T251498) [14:59:43] (03PS5) 10Jbond: (WIP) labs - hiera: migrate to hiera version5 [puppet] - 10https://gerrit.wikimedia.org/r/613112 [14:59:50] jayme: that specific endpoint has a pretty bad latency. https://grafana.wikimedia.org/d/5CmeRcnMz/mobileapps?panelId=20&fullscreen&orgId=1&refresh=5m&from=now-1h&to=now&var-dc=codfw%20prometheus%2Fk8s&var-service=mobileapps&var-container_name=All [14:59:56] avg ~10s [15:00:18] uuhf [15:00:20] and p99 and p90 is 30 and 28 seconds respectively [15:00:28] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'mathoid' for release 'production' . [15:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:58] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10CDanis) a:05Nuria→03CDanis thanks! will do so today [15:02:07] 10Operations, 10ops-eqsin, 10netops: cr3-eqsin disk 1 failure - https://phabricator.wikimedia.org/T257154 (10ayounsi) Router is back online. One weird thing is that once I rebooted the router after the upgrade, it came back up with the same issue as here (no more primary disk). I asked Jin to do a hard power... [15:02:48] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [15:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:07] (03CR) 10Elukey: Initial debian commit (032 comments) [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/610880 (https://phabricator.wikimedia.org/T251006) (owner: 10Ottomata) [15:03:12] (03PS6) 10Jbond: (WIP) labs - hiera: migrate to hiera version5 [puppet] - 10https://gerrit.wikimedia.org/r/613112 [15:04:10] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [15:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:57] jayme: dammit, old version doesn't have that https://grafana.wikimedia.org/d/000000183/mobileapps?panelId=9&fullscreen&orgId=1&var-percentile=p99 [15:05:03] * akosiaris fears something weird in node now [15:05:15] * akosiaris opens a task [15:05:23] akosiaris: ah, you're already up to 25% by now [15:05:46] (03PS1) 10Ayounsi: Revert "Depool eqsin for router replacement" [dns] - 10https://gerrit.wikimedia.org/r/613169 (https://phabricator.wikimedia.org/T257154) [15:06:07] <_joe_> akosiaris: node or some egress rule? [15:06:13] yes. Which is not much btw. Some 15rps [15:06:27] (03PS2) 10Ayounsi: Revert "Depool eqsin for router replacement" [dns] - 10https://gerrit.wikimedia.org/r/613169 (https://phabricator.wikimedia.org/T257154) [15:06:33] _joe_: meaning it tries to fetch something from mediawiki? [15:06:42] interesting [15:06:51] <_joe_> akosiaris: I'm pretty sure it does [15:06:53] (03PS1) 10Ssingh: dnsdist: reload the certificates instead of restarting the service [puppet] - 10https://gerrit.wikimedia.org/r/613187 (https://phabricator.wikimedia.org/T252132) [15:07:01] <_joe_> akosiaris: do you remember why hfenv sets K8S_CLUSTER to all caps? [15:07:01] (03CR) 10Ayounsi: [C: 03+2] Revert "Depool eqsin for router replacement" [dns] - 10https://gerrit.wikimedia.org/r/613169 (https://phabricator.wikimedia.org/T257154) (owner: 10Ayounsi) [15:07:08] (03PS1) 10Lucas Werkmeister (WMDE): Revert "Removes OtherProjectsSidebar hook" [extensions/Wikibase] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/613170 (https://phabricator.wikimedia.org/T258184) [15:07:26] !log repool eqsin - T257154 [15:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:31] _joe_: cause it's an env var I created and I like my env vars all caps [15:07:31] T257154: cr3-eqsin disk 1 failure - https://phabricator.wikimedia.org/T257154 [15:07:38] that's as good as it gets :P [15:07:45] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:07:51] <_joe_> akosiaris: ok, then I can just make it nod bleed my eyes? [15:07:54] <_joe_> *not [15:07:54] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "Backporting." [extensions/Wikibase] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/613170 (https://phabricator.wikimedia.org/T258184) (owner: 10Lucas Werkmeister (WMDE)) [15:08:30] _joe_: as long it also does not bleed my eyes as well, yes [15:08:39] doing another wmf.41 backport, okay for you akosiaris/jayme? [15:08:47] Lucas_WMDE: yeah, go ahead [15:08:50] ok thx [15:09:35] (03PS7) 10Jbond: (WIP) labs - hiera: migrate to hiera version5 [puppet] - 10https://gerrit.wikimedia.org/r/613112 [15:09:37] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:09:40] <_joe_> akosiaris: sorry, I meant why the value is all caps [15:09:46] <_joe_> like EQIAD or CODFW [15:11:03] _joe_: ah... no reason. feel free to make it lowercase [15:11:08] (03PS8) 10Jbond: (WIP) labs - hiera: migrate to hiera version5 [puppet] - 10https://gerrit.wikimedia.org/r/613112 [15:12:03] (03PS2) 10Lucas Werkmeister (WMDE): Revert "Removes OtherProjectsSidebar hook" [extensions/Wikibase] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/613170 (https://phabricator.wikimedia.org/T258184) [15:12:44] 10Operations, 10ops-eqsin, 10netops: cr3-eqsin disk 1 failure - https://phabricator.wikimedia.org/T257154 (10ayounsi) 05Open→03Resolved Everything here is done! [15:15:07] !log akosiaris@cumin1001 conftool action : set/weight=24; selector: dc=codfw,service=mobileapps,name=scb.* [15:15:11] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:30] !log lower codfw mobileapps kubernetes traffic to 10% T218733. Will open up task for it [15:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:35] T218733: Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733 [15:15:39] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "PS1 had PHPCS issues" [extensions/Wikibase] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/613170 (https://phabricator.wikimedia.org/T258184) (owner: 10Lucas Werkmeister (WMDE)) [15:17:01] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 52.38 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:17:36] (03PS9) 10Jbond: (WIP) labs - hiera: migrate to hiera version5 [puppet] - 10https://gerrit.wikimedia.org/r/613112 [15:17:38] yeah, that mobileapps data/css/mobile/site request should not be taking 10s. there's a bug in mobileapps which was surfaced by the move to k8s, specifically that this line is making a request to MediaWiki's public load.php endpoint: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/mobileapps/+/refs/heads/master/lib/css.js#15 [15:17:51] i can write up a ticket and push a fix as soon as i'm out of meeting [15:17:52] (03PS3) 10Jdrewniak: Disable affinity quicksurveys for the following wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612870 (https://phabricator.wikimedia.org/T246977) [15:18:04] (03Abandoned) 10Jcrespo: transfer.py: It is a test code for multiprocess transferpy [software/transferpy] - 10https://gerrit.wikimedia.org/r/610750 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [15:19:25] (03PS2) 10Ssingh: dnsdist: reload the certificates instead of restarting the service [puppet] - 10https://gerrit.wikimedia.org/r/613187 (https://phabricator.wikimedia.org/T252132) [15:20:39] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:23:26] (03PS3) 10Ssingh: dnsdist: reload the certificates instead of restarting the service [puppet] - 10https://gerrit.wikimedia.org/r/613187 (https://phabricator.wikimedia.org/T252132) [15:25:52] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [15:25:54] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1001/23930/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/613187 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [15:27:08] mdholloway: I 'll be afk for the next couple of hours, feel free to deploy the mobileapps fix. [15:27:36] akosiaris: got it, will do, thanks! [15:27:47] (03PS4) 10Privacybatm: Transferer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) [15:27:49] (03CR) 10jerkins-bot: [V: 04-1] Transferer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [15:28:05] (03PS10) 10Jbond: (WIP) labs - hiera: migrate to hiera version5 [puppet] - 10https://gerrit.wikimedia.org/r/613112 [15:29:12] (03PS1) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server: add kube-env script [puppet] - 10https://gerrit.wikimedia.org/r/613190 [15:29:17] (03PS1) 10Giuseppe Lavagetto: kubernetes::deployment_server::helmfile: use kube_env in .hfenv [puppet] - 10https://gerrit.wikimedia.org/r/613191 [15:29:35] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [15:30:31] (03PS5) 10Privacybatm: Transferer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) [15:31:03] (03CR) 10jerkins-bot: [V: 04-1] Transferer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [15:31:15] (03CR) 10Ottomata: Initial debian commit (032 comments) [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/610880 (https://phabricator.wikimedia.org/T251006) (owner: 10Ottomata) [15:31:23] (03PS1) 10Nray: Enable limited-width layout for Vector on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613192 (https://phabricator.wikimedia.org/T246420) [15:32:27] (03PS2) 10Nray: Enable limited-width layout for Vector on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613192 (https://phabricator.wikimedia.org/T246420) [15:33:37] (03PS2) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server: add kube-env script [puppet] - 10https://gerrit.wikimedia.org/r/613190 [15:33:39] (03PS2) 10Giuseppe Lavagetto: kubernetes::deployment_server::helmfile: use kube_env in .hfenv [puppet] - 10https://gerrit.wikimedia.org/r/613191 [15:34:10] (03CR) 10Elukey: Initial debian commit (032 comments) [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/610880 (https://phabricator.wikimedia.org/T251006) (owner: 10Ottomata) [15:34:12] (03PS3) 10Nray: Enable limited-width layout for Vector on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613192 (https://phabricator.wikimedia.org/T246420) [15:36:47] (03Merged) 10jenkins-bot: Revert "Removes OtherProjectsSidebar hook" [extensions/Wikibase] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/613170 (https://phabricator.wikimedia.org/T258184) (owner: 10Lucas Werkmeister (WMDE)) [15:37:23] testing backport on mwdebug1001 [15:37:26] (03PS11) 10Jbond: (WIP) labs - hiera: migrate to hiera version5 [puppet] - 10https://gerrit.wikimedia.org/r/613112 [15:38:27] yup, issue seems fixed [15:39:06] syncing [15:40:09] (03PS6) 10Privacybatm: Transferer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) [15:40:28] !log lucaswerkmeister-wmde@deploy1001 scap failed: average error rate on 7/9 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/e474f13ffac6b8c3bf919c4aeafc8c9b for details) [15:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:43] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:42:20] (03CR) 10Privacybatm: "I have rebased all the patches!" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm) [15:42:40] (03PS12) 10Jbond: (WIP) labs - hiera: migrate to hiera version5 [puppet] - 10https://gerrit.wikimedia.org/r/613112 [15:43:17] (03CR) 10jerkins-bot: [V: 04-1] (WIP) labs - hiera: migrate to hiera version5 [puppet] - 10https://gerrit.wikimedia.org/r/613112 (owner: 10Jbond) [15:44:03] Lucas_WMDE: Anything else seem broken? [15:44:23] that backport doesn’t seem to work, we’ll find something else [15:44:31] I hope the bad sync didn’t cause lasting problems [15:44:39] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:44:54] wait, scap keeps the broken code on the canaries right? [15:45:04] Right. [15:45:06] so I should git revert locally and scap sync that immediately, to fix the canaries? [15:45:09] ok [15:45:12] Yes, it doesn't roll back for you. [15:45:13] Yes. [15:45:25] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:45:50] 10Operations, 10netops: ripe-atlas-eqiad IPv6 unreachable - https://phabricator.wikimedia.org/T258018 (10CDanis) a:03CDanis [15:46:29] uhm [15:46:46] I reverted the change in the php-1.35.0-wmf.41 repo, I think [15:46:55] and then git submodule update extensions/Wikibase/ [15:47:06] but git -C extensions/Wikibase/ show still shows the commit [15:47:10] Including the submodule pointer? [15:47:13] (03CR) 10JMeybohm: [C: 04-1] "For me "CLUSTER NS" feels more natural then "NS CLUSTER" notation for k8s_env and the prompt. But that's highly biased" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/613190 (owner: 10Giuseppe Lavagetto) [15:47:42] (03PS13) 10Jbond: (WIP) labs - hiera: migrate to hiera version5 [puppet] - 10https://gerrit.wikimedia.org/r/613112 [15:47:46] the php-1.35.0-wmf.41 revert changed the submodule pointer, yeah… [15:47:48] I think… [15:47:50] 10Operations, 10netops: ripe-atlas-eqiad IPv6 unreachable - https://phabricator.wikimedia.org/T258018 (10faidon) To give a little more context: in response to us requesting an extension for the v2 anchors, the RIPE NCC team reached out to ask if they can run a test upgrade on our of anchors (which I of course... [15:47:52] but something’s not right [15:47:58] maybe I should revert in extensions/Wikibase [15:48:07] and then commit that submodule update instead [15:49:13] (03PS14) 10Jbond: (WIP) labs - hiera: migrate to hiera version5 [puppet] - 10https://gerrit.wikimedia.org/r/613112 [15:49:59] ok I *think* at least extensions/Wikibase/ is correct now, so I’ll try another sync [15:50:13] 10Operations, 10Analytics-Clusters, 10Discovery-Search, 10vm-requests: VM request for Analytics -> Elastic Search ML models update - https://phabricator.wikimedia.org/T258189 (10elukey) [15:50:25] (03CR) 10JMeybohm: [C: 03+1] "LGTM (if we don't change the order of parameters)" [puppet] - 10https://gerrit.wikimedia.org/r/613191 (owner: 10Giuseppe Lavagetto) [15:51:10] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.35.0-wmf.41/extensions/Wikibase/: Backport: [[gerrit:613170|Revert "Revert "Removes OtherProjectsSidebar hook"" (T258184)]] (duration: 01m 02s) [15:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:16] T258184: Call to undefined method Wikibase\Client\Hooks\OtherProjectsSidebarGenerator::buildProjectLinkSidebarFromItemId() - https://phabricator.wikimedia.org/T258184 [15:51:42] “Canary error check failed for 2 canaries, less than threshold to halt deployment” [15:52:09] looking at the canary logstash [15:52:15] dcausse: /win 30 [15:52:17] ahahahh [15:52:26] sorry, I was trying to join the discovery chan [15:52:45] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:54:35] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:54:57] (03PS1) 10Lucas Werkmeister (WMDE): Revert "Revert "Removes OtherProjectsSidebar hook"" [extensions/Wikibase] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/613194 (https://phabricator.wikimedia.org/T258184) [15:55:11] (03PS4) 10Nray: Enable limited-width layout for Vector on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613192 (https://phabricator.wikimedia.org/T246420) [15:55:37] (03CR) 10Lucas Werkmeister (WMDE): [V: 03+2 C: 03+2] "This commit (sans Change-Id) is already deployed, so insta-merging." [extensions/Wikibase] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/613194 (https://phabricator.wikimedia.org/T258184) (owner: 10Lucas Werkmeister (WMDE)) [15:55:59] (03CR) 10Addshore: [C: 03+2] "*agrees*" [extensions/Wikibase] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/613194 (https://phabricator.wikimedia.org/T258184) (owner: 10Lucas Werkmeister (WMDE)) [15:56:58] ok so now I need to get rid of the top commit in php-1.35.0-wmf.41 [15:57:00] which is the failed revert [15:57:06] and then pull the proper revert from gerrit [15:57:33] can I do a `git reset --hard @^` or will that also the Extension(s) Who Shall Not Be Named that has/have security patches? [15:57:41] will that also *affect them [15:57:48] Yup, need to restore the state without getting rid of the sec patches [15:57:52] * addshore lookks for his docs [15:58:39] * addshore has no good docs :P [15:58:54] I could do a `git rebase -i` and delete the commit that needs to go [15:59:00] `git rebase` seems to do the right thing with submodules [15:59:19] yeah I’ll try that [15:59:44] https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#Reverting [16:00:04] godog and _joe_: Dear deployers, time to do the Puppet request window(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200716T1600). [16:00:05] 10Operations, 10Analytics-Clusters, 10Discovery-Search, 10vm-requests: VM request for Analytics -> Elastic Search ML models update - https://phabricator.wikimedia.org/T258189 (10EBernhardson) This isn't only ranking models, but also general updates to the search indices that flow from analytics. This inclu... [16:00:23] (03PS5) 10Nray: Enable limited-width layout for Modern Vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613192 (https://phabricator.wikimedia.org/T246420) [16:00:26] that doesnt detail how to cleanup though [16:00:54] okay, I *think* it’s all cleaned up now [16:01:14] * addshore looks [16:01:14] security patch(es) and other submodules still look good to me [16:01:32] Lucas_WMDE: looks good to me :) [16:01:52] ok, phew [16:02:06] we would still want to find a solution for the ArticlePlaceholder issue [16:02:31] but godog and _joe_’s window started, is it okay if we continue? (nothing queued in the deployment calendar) [16:02:46] I briefly tested my patch locally and it looked good, but I'm just about to step out! [16:03:07] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_proton_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:04:02] Nothing is scheduled and everything is quiet so probably all good! [16:04:53] ok then let’s try addshore’s patch [16:04:57] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:06:36] !log Reboot rdb1010 - T254990 [16:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:23] 10Operations, 10Analytics-Clusters, 10Discovery-Search, 10vm-requests: VM request for Analytics -> Elastic Search ML models update - https://phabricator.wikimedia.org/T258189 (10elukey) [16:10:16] (03PS1) 10Lucas Werkmeister (WMDE): Re add OtherProjectsSidebarGenerator::buildProjectLinkSidebarFromItemId [extensions/Wikibase] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/613173 (https://phabricator.wikimedia.org/T258184) [16:11:09] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Re add OtherProjectsSidebarGenerator::buildProjectLinkSidebarFromItemId [extensions/Wikibase] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/613173 (https://phabricator.wikimedia.org/T258184) (owner: 10Lucas Werkmeister (WMDE)) [16:11:14] !log reboot rdb1009 - T254990 [16:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:56] (03CR) 10Tarrow: [C: 03+1] "looks good to me" [extensions/Wikibase] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/613173 (https://phabricator.wikimedia.org/T258184) (owner: 10Lucas Werkmeister (WMDE)) [16:27:49] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:30:03] (03PS2) 10Herron: dns: add forward/reverse records for prometheus[345]001 [dns] - 10https://gerrit.wikimedia.org/r/613163 (https://phabricator.wikimedia.org/T243057) [16:30:49] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:32:39] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:33:38] (03CR) 10Bstorm: [C: 03+2] clouddb: Uncap the network for the clouddb-services project [puppet] - 10https://gerrit.wikimedia.org/r/612958 (https://phabricator.wikimedia.org/T257884) (owner: 10Bstorm) [16:36:48] (03CR) 10Cwhite: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/613163 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [16:37:05] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:38:38] (03CR) 10Herron: [C: 03+2] dns: add forward/reverse records for prometheus[345]001 [dns] - 10https://gerrit.wikimedia.org/r/613163 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [16:40:32] (03Merged) 10jenkins-bot: Re add OtherProjectsSidebarGenerator::buildProjectLinkSidebarFromItemId [extensions/Wikibase] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/613173 (https://phabricator.wikimedia.org/T258184) (owner: 10Lucas Werkmeister (WMDE)) [16:40:40] ok let’s try this backport [16:40:47] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:41:15] testing on mwdebug1001 [16:41:39] looks good so far [16:41:45] looking around a bit more [16:43:20] no errors observed, I’ll try the sync [16:44:01] (03PS1) 10Bartosz Dziewoński: Move VisualEditor from beta to default on nlwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613198 (https://phabricator.wikimedia.org/T256142) [16:44:03] (03PS1) 10Bartosz Dziewoński: Move VisualEditor from beta to default on incubatorwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613199 (https://phabricator.wikimedia.org/T256957) [16:45:06] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.35.0-wmf.41/extensions/Wikibase/: Backport: [[gerrit:613173|Re add OtherProjectsSidebarGenerator::buildProjectLinkSidebarFromItemId (T258184)]] (duration: 01m 02s) [16:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:12] T258184: Call to undefined method Wikibase\Client\Hooks\OtherProjectsSidebarGenerator::buildProjectLinkSidebarFromItemId() - https://phabricator.wikimedia.org/T258184 [16:45:41] (03CR) 10Bartosz Dziewoński: "Is this even how you remove a wiki from a dblist? The system got way more complicated since I last looked at it…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613198 (https://phabricator.wikimedia.org/T256142) (owner: 10Bartosz Dziewoński) [16:45:58] checking logstash [16:49:44] looks good, I think I’m done [16:54:05] (03PS1) 10Dzahn: site: add aphlict1001 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/613200 (https://phabricator.wikimedia.org/T257617) [17:00:04] halfak and accraze: Time to snap out of that daydream and deploy Services – Graphoid / Citoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200716T1700). [17:03:50] (03PS1) 10Dzahn: site: add testreduce1001 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/613201 (https://phabricator.wikimedia.org/T257940) [17:04:45] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:05:45] !log msw1-codfw - replace member-range with list of individual interfaces [17:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:29] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:12:32] (03PS1) 10Dzahn: phabricator: create separate role/profile for aphlict (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/613205 [17:14:13] (03CR) 10jerkins-bot: [V: 04-1] phabricator: create separate role/profile for aphlict (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/613205 (owner: 10Dzahn) [17:17:49] !log msw1-eqiad delete unused VC-ports [17:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:50] 10Operations, 10MediaWiki-extensions-Babel: Two user pages on meta can't be rendered: "request has exceeded memory limit" - https://phabricator.wikimedia.org/T231522 (10Urbanecm) @ammarpad Does that mean it should work in theory? [17:38:54] 10Operations, 10Analytics-Clusters, 10Discovery-Search, 10vm-requests: VM request for Analytics -> Elastic Search ML models update - https://phabricator.wikimedia.org/T258189 (10EBernhardson) I also just remembered while considering this, we need to have an instance per datacenter. The current applications... [17:46:43] (03CR) 10Dzahn: [C: 03+2] site: add aphlict1001 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/613200 (https://phabricator.wikimedia.org/T257617) (owner: 10Dzahn) [17:46:52] (03PS2) 10Dzahn: site: add aphlict1001 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/613200 (https://phabricator.wikimedia.org/T257617) [17:48:19] (03CR) 10BryanDavis: [C: 03+1] cloud-nfs: Allow changing the nfs mount version [puppet] - 10https://gerrit.wikimedia.org/r/612647 (https://phabricator.wikimedia.org/T257945) (owner: 10Bstorm) [17:48:23] (03CR) 10Andrew Bogott: [C: 03+1] cloud-nfs: Allow changing the nfs mount version [puppet] - 10https://gerrit.wikimedia.org/r/612647 (https://phabricator.wikimedia.org/T257945) (owner: 10Bstorm) [17:48:31] 10Operations, 10ops-eqiad, 10netops: (Need by: 2019-09-30) upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10ayounsi) I moved all the cables from the old switch to the new one (using csv exports and mass copy/paste). Please double check them, and fill the few missing one... [17:48:39] PROBLEM - Host ganeti1008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:48:49] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10Jclark-ctr) [17:49:14] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10Jclark-ctr) @akosiaris Finished with upgrade on ganeti1008 [17:49:53] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10Dzahn) @Jclark-ctr The mgmt interface of ganeti1008 just went down. Could you please check the cable? [17:49:56] !log herron@cumin1001 START - Cookbook sre.ganeti.makevm [17:49:56] !log herron@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [17:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:12] !log herron@cumin1001 START - Cookbook sre.ganeti.makevm [17:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:18] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10CDanis) a:05CDanis→03Nahid Hi Nahid, Sorry to trouble you, but can you create another Wikitech accoun... [17:50:49] RECOVERY - Host ganeti1008 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [17:53:25] !log herron@cumin1001 START - Cookbook sre.ganeti.makevm [17:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:22] !log herron@cumin1001 START - Cookbook sre.ganeti.makevm [17:54:33] RECOVERY - Host ganeti1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [17:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:09] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1024 - https://phabricator.wikimedia.org/T257949 (10Jclark-ctr) @wiki_willy checked storage we do not have any 3tb drives [17:56:24] (03PS2) 10Dzahn: site: add testreduce1001 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/613201 (https://phabricator.wikimedia.org/T257940) [17:57:25] (03CR) 10Bstorm: [C: 03+2] cloud-nfs: Allow changing the nfs mount version [puppet] - 10https://gerrit.wikimedia.org/r/612647 (https://phabricator.wikimedia.org/T257945) (owner: 10Bstorm) [17:57:30] (03CR) 10Dzahn: [C: 03+2] site: add testreduce1001 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/613201 (https://phabricator.wikimedia.org/T257940) (owner: 10Dzahn) [17:58:56] jouncebot: next [17:58:56] In 0 hour(s) and 1 minute(s): Morning backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200716T1800) [18:00:04] RoanKattouw, Niharika, and Urbanecm: Time to snap out of that daydream and deploy Morning backport window(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200716T1800). [18:00:04] Addshore, jan_drewniak, and MatmaRex: A patch you scheduled for Morning backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:23] hi [18:01:09] o/ [18:01:45] o/ [18:01:52] I can probably deploy [18:02:13] *opens some windows* [18:02:28] jan_drewniak: can I lead with yours? [18:03:10] OR MatmaRex ? :P [18:03:16] addshore: yup, I think there was a second config patch a team member wanted swat'd too. one sec [18:03:22] ack! [18:03:44] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [18:03:44] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [18:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:54] (03CR) 10Addshore: [C: 03+2] Disable affinity quicksurveys for the following wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612870 (https://phabricator.wikimedia.org/T246977) (owner: 10Jdrewniak) [18:04:11] jan_drewniak: testable on mwdebug? [18:04:25] addshore: yup [18:04:30] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10Nahid) >>! In T256971#6313255, @CDanis wrote: > Hi Nahid, > > Sorry to trouble you, but can you create an... [18:04:30] amazing [18:04:34] i'll let you know when it is there [18:04:42] (03Merged) 10jenkins-bot: Disable affinity quicksurveys for the following wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612870 (https://phabricator.wikimedia.org/T246977) (owner: 10Jdrewniak) [18:05:04] jan_drewniak: it is on mwdebug1002! [18:07:08] addshore: ok, survey looks gone, good to deploy [18:07:16] syncing [18:07:35] (03PS6) 10Addshore: Enable limited-width layout for Modern Vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613192 (https://phabricator.wikimedia.org/T246420) (owner: 10Nray) [18:07:44] btw, could we do a second config patch? I can add it to the calendar retroactively https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/613192 [18:08:06] jan_drewniak: up! [18:08:10] *yup [18:08:11] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:612870]] T246977 Disable affinity quicksurveys for the following wikis (duration: 00m 57s) [18:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:17] T246977: Run baseline quicksurvey on test wikis - https://phabricator.wikimedia.org/T246977 [18:08:19] (03CR) 10Addshore: [C: 03+2] Enable limited-width layout for Modern Vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613192 (https://phabricator.wikimedia.org/T246420) (owner: 10Nray) [18:08:55] jan_drewniak: is "Enable limited-width layout for Modern Vector" testable on mwdebug1002? [18:09:02] yes! [18:09:04] woo! [18:09:06] (03Merged) 10jenkins-bot: Enable limited-width layout for Modern Vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613192 (https://phabricator.wikimedia.org/T246420) (owner: 10Nray) [18:09:25] jan_drewniak: it is on mwdebu1002 [18:10:03] (03PS3) 10Addshore: Wikibase: stop setting wgWBRepoSettings['tmpSerializeEmptyListsAsObjects'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613164 (https://phabricator.wikimedia.org/T138104) [18:10:13] (03CR) 10Addshore: [C: 03+2] Wikibase: stop setting wgWBRepoSettings['tmpSerializeEmptyListsAsObjects'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613164 (https://phabricator.wikimedia.org/T138104) (owner: 10Addshore) [18:10:20] 😱 it is! https://en.wikipedia.org/wiki/Main_Page?useskinversion=2 looks good to me! [18:10:32] oooh, pretty [18:10:48] lgtm [18:10:49] let me sync it :) [18:11:11] syncing [18:11:13] (03Merged) 10jenkins-bot: Wikibase: stop setting wgWBRepoSettings['tmpSerializeEmptyListsAsObjects'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613164 (https://phabricator.wikimedia.org/T138104) (owner: 10Addshore) [18:11:33] whtas your third patch then? =] [18:12:05] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:613192]] T246420 Enable limited-width layout for Modern Vector (duration: 00m 56s) [18:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:11] T246420: Limit content width, and refine alignment & styling of relevant elements - https://phabricator.wikimedia.org/T246420 [18:12:26] I just had 2, thanks addshore! [18:13:15] jan_drewniak: ack! :) [18:13:19] MatmaRex: your next! [18:13:33] (03PS4) 10Cwhite: debianization [debs/grafana-loki] (debian/sid) - 10https://gerrit.wikimedia.org/r/610864 [18:13:41] thanks [18:13:49] (03PS2) 10Addshore: Move VisualEditor from beta to default on nlwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613198 (https://phabricator.wikimedia.org/T256142) (owner: 10Bartosz Dziewoński) [18:14:01] (03CR) 10Addshore: [C: 03+2] Move VisualEditor from beta to default on nlwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613198 (https://phabricator.wikimedia.org/T256142) (owner: 10Bartosz Dziewoński) [18:14:06] !log addshore@deploy1001 Synchronized wmf-config/Wikibase.php: [[gerrit:613164]] T138104 Wikibase: stop setting wgWBRepoSettings tmpSerializeEmptyListsAsObjects (duration: 00m 57s) [18:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:12] T138104: Do not serialize empty containers (descriptions/aliases/sitelinks) as empty array [] - https://phabricator.wikimedia.org/T138104 [18:14:44] Hello tgr, so https://samesite-sandbox.glitch.me/ [18:14:48] 403 Forbidden [18:14:54] MatmaRex: looks testable on mwdebug1002 to me? :) [18:15:01] (03Merged) 10jenkins-bot: Move VisualEditor from beta to default on nlwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613198 (https://phabricator.wikimedia.org/T256142) (owner: 10Bartosz Dziewoński) [18:15:02] in relation to https://phabricator.wikimedia.org/T258121 [18:15:20] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for jlinehan - https://phabricator.wikimedia.org/T258119 (10CDanis) [18:15:23] (03PS1) 10CDanis: add jdl to deployment [puppet] - 10https://gerrit.wikimedia.org/r/613217 (https://phabricator.wikimedia.org/T258119) [18:15:28] addshore: probably. is it live there? [18:15:33] MatmaRex: its live there now! [18:15:50] (03PS2) 10Addshore: Move VisualEditor from beta to default on incubatorwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613199 (https://phabricator.wikimedia.org/T256957) (owner: 10Bartosz Dziewoński) [18:16:01] addshore: also, can you confirm that this is the proper way to remove a wiki from a dblist? i'm not really sure what i'm doing here [18:16:24] MatmaRex: it looked good to me. AFAIk that is removing it from the yaml, and then running the composer commant to rebuild the db lists [18:16:33] yeah, that's what i did. thanks [18:16:44] (03PS2) 10CDanis: admin: add jdl to deployment [puppet] - 10https://gerrit.wikimedia.org/r/613217 (https://phabricator.wikimedia.org/T258119) [18:16:48] I only did it myself for the first time the other day, but nothing exploded :) [18:17:08] addshore: and it looks good on mwdebug1002 [18:18:05] syncing [18:18:15] (03CR) 10CDanis: [C: 03+2] admin: add jdl to deployment [puppet] - 10https://gerrit.wikimedia.org/r/613217 (https://phabricator.wikimedia.org/T258119) (owner: 10CDanis) [18:18:20] (03CR) 10Addshore: [C: 03+2] Move VisualEditor from beta to default on incubatorwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613199 (https://phabricator.wikimedia.org/T256957) (owner: 10Bartosz Dziewoński) [18:18:47] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Agaduran - https://phabricator.wikimedia.org/T258214 (10MGerlach) [18:18:50] !log addshore@deploy1001 Synchronized dblists/visualeditor-nondefault.dblist: [[gerrit:613198]] T256142 Move VisualEditor from beta to default on nlwikimedia PT1/2 (duration: 00m 56s) [18:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:55] T256142: VE on for Dutch chapters wiki - https://phabricator.wikimedia.org/T256142 [18:19:04] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for jlinehan - https://phabricator.wikimedia.org/T258119 (10CDanis) 05Open→03Resolved Done; should be live across the fleet within 30 minutes. [18:19:06] (03Merged) 10jenkins-bot: Move VisualEditor from beta to default on incubatorwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613199 (https://phabricator.wikimedia.org/T256957) (owner: 10Bartosz Dziewoński) [18:20:10] James_F: and I correct in thinking the yml files are not actually used in prod? only the actual db lists? [18:20:18] !log addshore@deploy1001 Synchronized wmf-config/config/nlwikimedia.yaml: [[gerrit:613198]] T256142 Move VisualEditor from beta to default on nlwikimedia PT2/2 (duration: 00m 57s) [18:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:48] MatmaRex: "Move VisualEditor from beta to default on incubatorwiki" is in mwdebug1002 [18:21:07] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Agaduran - https://phabricator.wikimedia.org/T258214 (10MGerlach) we have a new formal collaborator onboard: Alberto Garcia Duran. Alberto needs access to HDFS and stat machines for a new research project.... [18:21:32] addshore: also looks good [18:21:47] amazing, syncing :) [18:21:54] addshore: Currently, yes. [18:22:33] !log addshore@deploy1001 Synchronized dblists/visualeditor-nondefault.dblist: [[gerrit:613199]] T256957 Move VisualEditor from beta to default on incubatorwiki PT1/2 (duration: 00m 56s) [18:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:38] T256957: Enable Visual Editor by default in the Wikimedia Incubator - https://phabricator.wikimedia.org/T256957 [18:22:55] MatmaRex: all done! [18:23:02] thanks addshore [18:23:06] (03PS4) 10Addshore: Wikibase: stop setting wmgWikibaseTmpSerializeEmptyListsAsObjects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613165 (https://phabricator.wikimedia.org/T138104) [18:23:15] (03CR) 10Addshore: [C: 03+2] Wikibase: stop setting wmgWikibaseTmpSerializeEmptyListsAsObjects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613165 (https://phabricator.wikimedia.org/T138104) (owner: 10Addshore) [18:23:47] !log addshore@deploy1001 Synchronized wmf-config/config/incubatorwiki.yaml: [[gerrit:613199]] T256957 Move VisualEditor from beta to default on incubatorwiki PT2/2 (duration: 00m 57s) [18:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:02] (03Merged) 10jenkins-bot: Wikibase: stop setting wmgWikibaseTmpSerializeEmptyListsAsObjects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613165 (https://phabricator.wikimedia.org/T138104) (owner: 10Addshore) [18:25:42] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:613165]] T138104 Wikibase: stop setting wmgWikibaseTmpSerializeEmptyListsAsObjects (duration: 00m 57s) [18:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:48] T138104: Do not serialize empty containers (descriptions/aliases/sitelinks) as empty array [] - https://phabricator.wikimedia.org/T138104 [18:25:54] * addshore looks around for other config cleanup [18:27:53] (03CR) 10Addshore: [C: 04-1] "This change is ready for review." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605609 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [18:28:08] (03CR) 10Addshore: [C: 04-2] "Things have changed under this now, and this will need to be reworked" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605608 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [18:30:16] (03PS15) 10CRusnov: customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) [18:31:44] (03PS1) 10Addshore: Wikibase: Always set wgWBRepoSettings idGeneratorSeparateDbConnection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613226 [18:31:53] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1024 - https://phabricator.wikimedia.org/T257949 (10wiki_willy) Thanks @Jclark-ctr. Hi @Marostegui or @fgiunchedi - since it looks like we'll be testing out some new hardware this quarter and eventually refreshing this via T252216, let me know if you'd like us... [18:32:11] (03CR) 10Addshore: [C: 03+2] Wikibase: Always set wgWBRepoSettings idGeneratorSeparateDbConnection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613226 (owner: 10Addshore) [18:32:49] if there are free time, I want to do something let me know [18:32:59] (03Merged) 10jenkins-bot: Wikibase: Always set wgWBRepoSettings idGeneratorSeparateDbConnection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613226 (owner: 10Addshore) [18:33:08] Amir1: Yeach, let me do this last one and you can have the last 25 mins [18:33:13] *yeah [18:33:27] cooool [18:34:22] Amir1: what is it? :O [18:34:49] "do something" [18:34:51] just say no [18:35:03] migrating client to use extension.json [18:35:04] :D [18:35:12] Amir1: ooooooh, sounds tasty [18:35:12] !log addshore@deploy1001 Synchronized wmf-config/Wikibase.php: [[gerrit:613226]] Wikibase: Always set wgWBRepoSettings idGeneratorSeparateDbConnection PT 1/2 (duration: 00m 56s) [18:35:13] for i18n now [18:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:20] It is :D [18:35:24] if Reedy let me :P [18:35:40] Amir1: where is the patch? I can push the buttons if you want :) [18:35:41] (03PS1) 10CDanis: admin: shell access + analytics access for nsultan@ [puppet] - 10https://gerrit.wikimedia.org/r/613228 (https://phabricator.wikimedia.org/T256971) [18:36:05] (03CR) 10CDanis: [C: 03+2] admin: shell access + analytics access for nsultan@ [puppet] - 10https://gerrit.wikimedia.org/r/613228 (https://phabricator.wikimedia.org/T256971) (owner: 10CDanis) [18:36:11] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/611393 [18:36:24] (03PS3) 10Addshore: extension-list: Load WikibaseClient via JSON [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611393 (owner: 10Lucas Werkmeister (WMDE)) [18:36:24] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:613226]] Wikibase: Always set wgWBRepoSettings idGeneratorSeparateDbConnection PT 2/2 (duration: 00m 56s) [18:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:45] Test plan: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/611393/2#message-3e88e99ddd205a0d1ae7887aba9d8ebe275accea [18:37:10] (03CR) 10Addshore: [C: 03+2] extension-list: Load WikibaseClient via JSON [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611393 (owner: 10Lucas Werkmeister (WMDE)) [18:37:19] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10CDanis) 05Open→03Resolved Thanks Nahid! I've granted access for you, although I... [18:37:31] * addshore reads about "Build a temporary ExtensionMessages.php" [18:37:54] (03Merged) 10jenkins-bot: extension-list: Load WikibaseClient via JSON [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611393 (owner: 10Lucas Werkmeister (WMDE)) [18:38:22] its on the deploy server [18:39:14] Amir1: where is the real ExtensionMessages.php ? [18:39:33] let me find it [18:40:39] /srv/mediawiki/php-1.35.0-wmf.41/cache/l10n ? [18:41:27] not that I see! [18:42:11] aaah wait, ExtensionMessages-1.NNwmfMM.php [18:42:31] https://www.irccloud.com/pastebin/yBtmPDew/ [18:43:16] (03CR) 10Addshore: "diff per the test plan https://phabricator.wikimedia.org/P11929" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611393 (owner: 10Lucas Werkmeister (WMDE)) [18:43:26] it's just the order, right? [18:43:54] looks like it yes [18:44:23] yup [18:44:46] amazing, anything else, I'll sync it too anyway [18:45:10] It's not read locally. [18:45:27] Only by scap on the deployment server as part of the full scap sync (in the i18n build step, specifically). [18:46:09] nah, We will migrate the loading from client next week [18:46:11] at some point I was told to just always sync files in case of odd things. I guess when this is all automated that will happen, so I may as well start now :) [18:46:18] "nah" was to Adam not James [18:46:24] !log addshore@deploy1001 Synchronized wmf-config/extension-list: [[gerrit:611393]] extension-list: Load WikibaseClient via JSON (duration: 00m 56s) [18:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:46] This week has had some fun anti-protocol syncs. [18:47:06] "anti-protocol syncs"? =o [18:47:22] Like the time I left wmf.41 pushed onto the canaries and couldn't roll forwards or back (yay for the HTTPS change breaking the scap sync check). [18:47:32] ooof [18:47:40] Or the time I pushed out a config patch before pushing it to gerrit. [18:47:55] Or the time I pushed out an intermediate config patch that never made it to gerrit, so I could save time. [18:47:58] * James_F coughs. [18:48:04] Do as I say, not as I do. :-) [18:48:04] * addshore hands James_F a wet fish to slap himslf with [18:48:43] In my defence, we didn't have any downtime due to me (quite the reverse). [18:49:06] !log deployment windows finished with [18:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:40] (03PS1) 10Andrew Bogott: wikilabels::session: don't quote the port number [puppet] - 10https://gerrit.wikimedia.org/r/613234 [18:50:55] !log herron@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [18:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:44] (03CR) 10Andrew Bogott: [C: 03+2] wikilabels::session: don't quote the port number [puppet] - 10https://gerrit.wikimedia.org/r/613234 (owner: 10Andrew Bogott) [18:53:28] !log herron@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [18:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:23] (03PS1) 10Ladsgroup: Load WikibaseClient from extension.json file instead of php one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613235 (https://phabricator.wikimedia.org/T256228) [18:54:23] !log herron@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [18:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] James_F and longma: How many deployers does it take to do Mediawiki train - European+American Version (secondary timeslot) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200716T1900). [19:03:35] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1024 - https://phabricator.wikimedia.org/T257949 (10Marostegui) Hey @wiki_willy I'm not a service owner for this kind of host :-) [19:04:33] (03PS1) 10Mholloway: mobileapps: add request template for load.php requests to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/613237 (https://phabricator.wikimedia.org/T258186) [19:04:40] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:04:41] Nothing to deploy; train is done. [19:06:17] choo choo [19:06:29] Hmm. [19:06:30] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:06:36] I could deploy the next videojs thing. [19:06:39] Hmmmmmm. [19:06:44] OK, sure, let's do it. [19:06:49] What Could Possibly Go Wrong? [19:07:17] (03PS2) 10Jforrester: TimedMediaHandler: Make videojs the only player on all group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612347 (https://phabricator.wikimedia.org/T248418) [19:07:25] (03CR) 10Jforrester: [C: 03+2] TimedMediaHandler: Make videojs the only player on all group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612347 (https://phabricator.wikimedia.org/T248418) (owner: 10Jforrester) [19:07:30] It's only group0, after all. [19:07:30] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Agaduran - https://phabricator.wikimedia.org/T258214 (10CDanis) Thanks, I've verified that NDA is on file. This task also needs @Nuria's approval for the Analytics access. [19:08:12] (03Merged) 10jenkins-bot: TimedMediaHandler: Make videojs the only player on all group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612347 (https://phabricator.wikimedia.org/T248418) (owner: 10Jforrester) [19:10:24] James_F: do it! [19:10:29] * James_F grins. [19:10:48] FFS, now is not the time to crash, Chrome. [19:10:51] * James_F sighs. [19:11:18] just use Firefox so you don't have any Chrome crashes! [19:11:25] Firefox crashes even more. [19:11:29] (03PS1) 10CDanis: admin: shell & analytics access for agaduran [puppet] - 10https://gerrit.wikimedia.org/r/613241 (https://phabricator.wikimedia.org/T258214) [19:11:39] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and researchers for Agaduran - https://phabricator.wikimedia.org/T258214 (10CDanis) [19:11:54] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [19:12:21] (03CR) 10jerkins-bot: [V: 04-1] admin: shell & analytics access for agaduran [puppet] - 10https://gerrit.wikimedia.org/r/613241 (https://phabricator.wikimedia.org/T258214) (owner: 10CDanis) [19:13:01] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and researchers for Agaduran - https://phabricator.wikimedia.org/T258214 (10CDanis) Two things pending: [ ] @Nuria's approval for analytics access [ ] three business day waiting period [19:13:12] (03PS2) 10CDanis: admin: shell & analytics access for agaduran [puppet] - 10https://gerrit.wikimedia.org/r/613241 (https://phabricator.wikimedia.org/T258214) [19:13:40] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and researchers for Agaduran - https://phabricator.wikimedia.org/T258214 (10Nuria) We need an expiration date for this collaboration. [19:14:00] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T248418 TimedMediaHandler: Make videojs the only player on all group0 (duration: 00m 57s) [19:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:06] T248418: Roll out videojs as the only video/audio player on all Wikimedia wikis - https://phabricator.wikimedia.org/T248418 [19:15:32] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [19:15:35] (03CR) 10CDanis: [C: 04-2] "blocked on approvals" [puppet] - 10https://gerrit.wikimedia.org/r/613241 (https://phabricator.wikimedia.org/T258214) (owner: 10CDanis) [19:15:43] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and researchers for Agaduran - https://phabricator.wikimedia.org/T258214 (10CDanis) >>! In T258214#6313618, @Nuria wrote: > We need an expiration date for this collaboration. Expiration date is listed... [19:19:24] 10Operations, 10Phabricator, 10Security-Team: HTTP 500 error trying to access phabricator from my server - https://phabricator.wikimedia.org/T257507 (10CDanis) a:03Isarra [19:20:56] Hey all - looks like train stuff has wrapped up, so I'd like to security deploy the block for T257687 if there's no objections. [19:32:45] !log Deployed mitigations for T257687 [19:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:32] 10Operations, 10Release-Engineering-Team, 10Patch-For-Review, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10CDanis) a:03calbon Hi @calbon, Are you still having trouble accessing the stat machines ? If so can you please paste ou... [19:34:52] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and researchers for Agaduran - https://phabricator.wikimedia.org/T258214 (10CDanis) a:03Nuria [19:34:54] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and researchers for Agaduran - https://phabricator.wikimedia.org/T258214 (10Nuria) Approved on my end [19:37:27] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and researchers for Agaduran - https://phabricator.wikimedia.org/T258214 (10CDanis) a:05Nuria→03None [19:37:28] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and researchers for Agaduran - https://phabricator.wikimedia.org/T258214 (10CDanis) Thanks! Will go through early next week, per the waiting period [19:50:39] 10Operations, 10ops-eqiad: please connect eqiad's RIPE Atlas anchor to one of the SCSes - https://phabricator.wikimedia.org/T258221 (10CDanis) [19:53:56] 10Operations, 10netops: ripe-atlas-eqiad IPv6 unreachable - https://phabricator.wikimedia.org/T258018 (10CDanis) [19:53:58] 10Operations, 10ops-eqiad: please connect eqiad's RIPE Atlas anchor to one of the SCSes - https://phabricator.wikimedia.org/T258221 (10CDanis) [19:54:03] 10Operations, 10netops: ripe-atlas-eqiad IPv6 unreachable - https://phabricator.wikimedia.org/T258018 (10CDanis) 05Open→03Stalled [19:54:59] 10Operations, 10ops-eqiad: please connect eqiad's RIPE Atlas anchor to one of the SCSes - https://phabricator.wikimedia.org/T258221 (10CDanis) [20:03:12] (03CR) 10Cwhite: debianization (034 comments) [debs/grafana-loki] (debian/sid) - 10https://gerrit.wikimedia.org/r/610864 (owner: 10Cwhite) [20:09:19] (03PS1) 10Dzahn: parsoid: create new role to install just testreduce vd-server [puppet] - 10https://gerrit.wikimedia.org/r/613278 (https://phabricator.wikimedia.org/T257940) [20:10:07] (03PS2) 10Dzahn: parsoid: create new role to install just testreduce vd-server [puppet] - 10https://gerrit.wikimedia.org/r/613278 (https://phabricator.wikimedia.org/T257940) [20:12:45] 10Operations, 10ops-eqiad: please connect eqiad's RIPE Atlas anchor to one of the SCSes - https://phabricator.wikimedia.org/T258221 (10ayounsi) (and add it to Netbox as well) [20:13:26] (03PS3) 10Dzahn: parsoid: create new role to install just testreduce vd-server [puppet] - 10https://gerrit.wikimedia.org/r/613278 (https://phabricator.wikimedia.org/T257940) [20:16:56] (03PS1) 10CDanis: admin: add kerberos for nsultan [puppet] - 10https://gerrit.wikimedia.org/r/613280 (https://phabricator.wikimedia.org/T256971) [20:18:12] (03CR) 10CDanis: [C: 03+2] admin: add kerberos for nsultan [puppet] - 10https://gerrit.wikimedia.org/r/613280 (https://phabricator.wikimedia.org/T256971) (owner: 10CDanis) [20:18:54] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10CDanis) Ah -- one last thing -- you should have an email in your inbox with a tempor... [20:19:15] (03PS1) 10Herron: install_server: add dhcp/netboot entries for prometheus[345]001 [puppet] - 10https://gerrit.wikimedia.org/r/613281 (https://phabricator.wikimedia.org/T243057) [20:19:45] (03CR) 10jerkins-bot: [V: 04-1] install_server: add dhcp/netboot entries for prometheus[345]001 [puppet] - 10https://gerrit.wikimedia.org/r/613281 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [20:21:09] (03PS4) 10Dzahn: parsoid: create new role to install just testreduce [puppet] - 10https://gerrit.wikimedia.org/r/613278 (https://phabricator.wikimedia.org/T257940) [20:21:36] (03PS2) 10Herron: install_server: add dhcp/netboot entries for prometheus[345]001 [puppet] - 10https://gerrit.wikimedia.org/r/613281 (https://phabricator.wikimedia.org/T243057) [20:22:21] (03PS5) 10Dzahn: parsoid: create new role to install just testreduce [puppet] - 10https://gerrit.wikimedia.org/r/613278 (https://phabricator.wikimedia.org/T257940) [20:23:54] 10Operations, 10Parsoid, 10serviceops, 10Parsoid-Tests: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) [20:23:57] 10Operations, 10Parsoid, 10vm-requests, 10Parsoid-Tests, 10Patch-For-Review: eqiad: 1 VM request for testreduce - https://phabricator.wikimedia.org/T257940 (10Dzahn) 05Open→03Resolved VM has been created and runs with insetup role. Next will be applying a new role. [20:24:31] (03PS2) 10Dzahn: phabricator: create separate role/profile for aphlict (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/613205 (https://phabricator.wikimedia.org/T257617) [20:24:34] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:25:12] (03PS3) 10Mholloway: Enable client error logging on ca.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612893 (https://phabricator.wikimedia.org/T258073) (owner: 10Jason Linehan) [20:25:18] 10Operations, 10Phabricator, 10vm-requests, 10Patch-For-Review: eqiad: 1 VM request for aphlict - https://phabricator.wikimedia.org/T257617 (10Dzahn) 05Open→03Resolved VM has been created and runs with insetup role. Next will be applying a new puppet role that installs just aphlict by itself rather tha... [20:26:05] (03CR) 10jerkins-bot: [V: 04-1] phabricator: create separate role/profile for aphlict (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/613205 (https://phabricator.wikimedia.org/T257617) (owner: 10Dzahn) [20:26:24] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:28:25] (03CR) 10Mholloway: [C: 03+2] Enable client error logging on ca.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612893 (https://phabricator.wikimedia.org/T258073) (owner: 10Jason Linehan) [20:29:24] (03Merged) 10jenkins-bot: Enable client error logging on ca.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612893 (https://phabricator.wikimedia.org/T258073) (owner: 10Jason Linehan) [20:31:47] (03CR) 10Herron: [C: 03+2] install_server: add dhcp/netboot entries for prometheus[345]001 [puppet] - 10https://gerrit.wikimedia.org/r/613281 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [20:32:46] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable client error logging on Catalan Wikipedia (T258073) (duration: 00m 57s) [20:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:52] T258073: Enable client error logging on ca.wikipedia.org (Catalan) - https://phabricator.wikimedia.org/T258073 [20:37:20] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 54 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:43:10] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 49 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:00:42] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:01:15] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Upgrade Netbox to v2.8.7-wmf [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/612624 (owner: 10CRusnov) [21:06:30] (03PS1) 10Herron: assign role::insetup to prometheus[345]001 [puppet] - 10https://gerrit.wikimedia.org/r/613293 (https://phabricator.wikimedia.org/T243057) [21:10:23] 10Operations, 10ops-codfw, 10RESTBase: restbase2009 down - https://phabricator.wikimedia.org/T256863 (10wiki_willy) Hi @Eevans - it looks like this was originally scheduled to be refreshed this fiscal year during the annual CapEx planning, but then someone decided to push out the refresh until FY21-22. Can... [21:11:18] (03CR) 10Herron: [C: 03+2] assign role::insetup to prometheus[345]001 [puppet] - 10https://gerrit.wikimedia.org/r/613293 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [21:17:26] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:17:52] (03PS6) 10Dzahn: parsoid: create new role to install just testreduce [puppet] - 10https://gerrit.wikimedia.org/r/613278 (https://phabricator.wikimedia.org/T257940) [21:18:23] (03PS7) 10Dzahn: parsoid: create new role to install just testreduce [puppet] - 10https://gerrit.wikimedia.org/r/613278 (https://phabricator.wikimedia.org/T257940) [21:24:31] (03CR) 10Dzahn: [C: 03+2] parsoid: create new role to install just testreduce [puppet] - 10https://gerrit.wikimedia.org/r/613278 (https://phabricator.wikimedia.org/T257940) (owner: 10Dzahn) [21:24:37] (03CR) 10Mholloway: [C: 03+1] Change hostname of `mobileapps` to the dockerized instance. [puppet] - 10https://gerrit.wikimedia.org/r/613144 (https://phabricator.wikimedia.org/T256794) (owner: 10Jgiannelos) [21:25:17] (03CR) 10Dzahn: "I'm sure this will need follow-ups. Just starting somewhere to give access and install npm and node on buster." [puppet] - 10https://gerrit.wikimedia.org/r/613278 (https://phabricator.wikimedia.org/T257940) (owner: 10Dzahn) [21:26:41] 10Operations, 10Parsoid, 10serviceops, 10Parsoid-Tests: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) Change 613278 merged by Dzahn: [operations/puppet@production] parsoid: create new role to install just testreduce https://gerrit.w... [21:34:55] (03CR) 10Aaron Schulz: mcrouter_wancache: Add mcrouter support for a machine-local memcached instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594760 (https://phabricator.wikimedia.org/T244340) (owner: 10RLazarus) [21:38:44] (03PS1) 10Dzahn: visualdiff: ensure /srv/visualdiff/testreduce exists [puppet] - 10https://gerrit.wikimedia.org/r/613306 (https://phabricator.wikimedia.org/T257906) [21:39:53] !log crusnov@deploy1001 Started deploy [netbox/deploy@39c5cae]: Deploying Netbox 2.8.7 [21:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:59] (03CR) 10jerkins-bot: [V: 04-1] visualdiff: ensure /srv/visualdiff/testreduce exists [puppet] - 10https://gerrit.wikimedia.org/r/613306 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [21:40:55] !log crusnov@deploy1001 Finished deploy [netbox/deploy@39c5cae]: Deploying Netbox 2.8.7 (duration: 01m 01s) [21:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:01] !log crusnov@deploy1001 Started deploy [netbox/deploy@39c5cae]: Deploying Netbox 2.8.7 part 2 [21:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:14] (03PS1) 10Effie Mouzeli: role::parsoid: Add missing exporters for parsoid [puppet] - 10https://gerrit.wikimedia.org/r/613307 [21:42:34] !log crusnov@deploy1001 Finished deploy [netbox/deploy@39c5cae]: Deploying Netbox 2.8.7 part 2 (duration: 01m 33s) [21:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:39] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/23938/scandium.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/613306 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [21:49:47] (03PS2) 10Dzahn: visualdiff: ensure /srv/visualdiff/testreduce exists [puppet] - 10https://gerrit.wikimedia.org/r/613306 (https://phabricator.wikimedia.org/T257906) [21:52:51] !log crusnov@deploy1001 Started deploy [netbox/deploy@39c5cae]: Deploying Netbox 2.8.7 part 3 [21:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:40] !log crusnov@deploy1001 Finished deploy [netbox/deploy@39c5cae]: Deploying Netbox 2.8.7 part 3 (duration: 01m 49s) [21:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:27] (03CR) 10Effie Mouzeli: [V: 03+1] "PCC is happy: https://puppet-compiler.wmflabs.org/compiler1001/23939/" [puppet] - 10https://gerrit.wikimedia.org/r/613307 (owner: 10Effie Mouzeli) [21:58:20] (03CR) 10Dzahn: "noop on scandium - fixed puppet on testreduce1001" [puppet] - 10https://gerrit.wikimedia.org/r/613306 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [22:12:24] (03PS1) 10Dzahn: visualdiff: update git branch from ruthenium to scandium [puppet] - 10https://gerrit.wikimedia.org/r/613309 (https://phabricator.wikimedia.org/T257906) [22:15:58] !log testreduce1001 manually git clone 'scandium' branch of integration/visualdiff into /srv/visualdiff (T257906) [22:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:03] T257906: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 [22:17:36] (03CR) 10Dzahn: "i manually cloned the scandium branch into /srv/visualdiff on testreduce1001 but will delete it again" [puppet] - 10https://gerrit.wikimedia.org/r/613309 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [22:18:59] (03CR) 10Subramanya Sastry: "On parsing-qa-01 where npm is available, I can run npm install locally." [puppet] - 10https://gerrit.wikimedia.org/r/613309 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [22:21:43] PROBLEM - Check systemd state on testreduce1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:25:39] ACKNOWLEDGEMENT - Check systemd state on testreduce1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T257906 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:29:45] (03CR) 10Dzahn: "It's just that changing this will affect scandium as well. scandium is currently on the ruthenium branch but it would not be possible to c" [puppet] - 10https://gerrit.wikimedia.org/r/613309 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [22:33:47] (03CR) 10Subramanya Sastry: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/613309 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [22:36:16] (03CR) 10Dzahn: [C: 03+1] "lgtm. nitpick: would move the includes to the bottom after the main classes" [puppet] - 10https://gerrit.wikimedia.org/r/613307 (owner: 10Effie Mouzeli) [22:38:05] (03CR) 10Dzahn: [C: 03+1] "or..actually. We could consider moving it to mediawiki::common because it seems to be something that should be common to all MW(-like) ser" [puppet] - 10https://gerrit.wikimedia.org/r/613307 (owner: 10Effie Mouzeli) [22:39:30] (03PS1) 10Dave Pifke: [WIP] webperf: add APT repository [puppet] - 10https://gerrit.wikimedia.org/r/613320 [22:40:41] (03CR) 10jerkins-bot: [V: 04-1] [WIP] webperf: add APT repository [puppet] - 10https://gerrit.wikimedia.org/r/613320 (owner: 10Dave Pifke) [22:42:42] (03PS2) 10Dave Pifke: [WIP] webperf: add APT repository [puppet] - 10https://gerrit.wikimedia.org/r/613320 [22:43:54] (03CR) 10jerkins-bot: [V: 04-1] [WIP] webperf: add APT repository [puppet] - 10https://gerrit.wikimedia.org/r/613320 (owner: 10Dave Pifke) [22:46:38] (03PS3) 10Dave Pifke: [WIP] webperf: add APT repository [puppet] - 10https://gerrit.wikimedia.org/r/613320 [23:00:04] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Evening backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200716T2300) [23:09:27] (03PS1) 10Cmjohnson: Adding production dns for cloudcephosd1004-1015 [dns] - 10https://gerrit.wikimedia.org/r/613333 (https://phabricator.wikimedia.org/T251619) [23:09:52] (03CR) 10jerkins-bot: [V: 04-1] Adding production dns for cloudcephosd1004-1015 [dns] - 10https://gerrit.wikimedia.org/r/613333 (https://phabricator.wikimedia.org/T251619) (owner: 10Cmjohnson) [23:13:22] (03PS2) 10Cmjohnson: Adding production dns for cloudcephosd1004-1015 [dns] - 10https://gerrit.wikimedia.org/r/613333 (https://phabricator.wikimedia.org/T251619) [23:14:27] PROBLEM - SSH on webperf2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:16:09] RECOVERY - SSH on webperf2002 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:16:16] (03CR) 10Cmjohnson: [C: 03+2] Adding production dns for cloudcephosd1004-1015 [dns] - 10https://gerrit.wikimedia.org/r/613333 (https://phabricator.wikimedia.org/T251619) (owner: 10Cmjohnson) [23:18:15] 10Operations, 10SRE-OnFire, 10Sustainability (Incident Prevention): Investigate whether we can automatically share incident status docs with WMDE - https://phabricator.wikimedia.org/T244395 (10RLazarus) Sorry for the delay, but this is still in progress -- I've checked in with Legal and they're still working... [23:40:41] 10Operations, 10Performance-Team, 10Traffic: Review socket balancing in ATS/Varnish traffic layers - https://phabricator.wikimedia.org/T248522 (10Krinkle) p:05Triage→03Low [23:49:30] (03PS4) 10Dzahn: exim: remove RT redirects [puppet] - 10https://gerrit.wikimedia.org/r/612929 [23:53:15] (03CR) 10Dzahn: [C: 03+2] exim: remove RT redirects [puppet] - 10https://gerrit.wikimedia.org/r/612929 (owner: 10Dzahn)