[00:53:20] PROBLEM - Check the last execution of search-drop-query-clicks on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:54:00] PROBLEM - Check systemd state on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:54:14] PROBLEM - Disk space on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [00:54:44] PROBLEM - configured eth on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [00:54:46] PROBLEM - Check whether ferm is active by checking the default input chain on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [00:54:50] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:54:54] PROBLEM - MD RAID on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:55:00] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:55:24] PROBLEM - DPKG on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [00:55:36] PROBLEM - puppet last run on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:55:42] PROBLEM - dhclient process on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [00:55:44] PROBLEM - Check size of conntrack table on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [01:03:58] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [01:17:16] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:17:32] RECOVERY - Disk space on stat1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [01:18:00] RECOVERY - configured eth on stat1007 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [01:18:04] RECOVERY - Check whether ferm is active by checking the default input chain on stat1007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [01:18:10] RECOVERY - MD RAID on stat1007 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [01:18:42] RECOVERY - DPKG on stat1007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [01:19:00] RECOVERY - dhclient process on stat1007 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [01:19:02] RECOVERY - Check size of conntrack table on stat1007 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [01:20:04] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:26:42] RECOVERY - Check the last execution of search-drop-query-clicks on stat1007 is OK: OK: Status of the systemd unit search-drop-query-clicks https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:35:06] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1007 is OK: OK: synced at Mon 2020-02-24 01:35:05 UTC. https://wikitech.wikimedia.org/wiki/NTP [02:03:10] PROBLEM - BGP status on cr1-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.129 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:05:20] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:07:30] PROBLEM - BGP status on cr1-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.129 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:09:32] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [02:09:40] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:11:32] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5009 is OK: HTTP OK: HTTP/1.0 200 OK - 21992 bytes in 0.748 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [02:13:14] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:15:40] PROBLEM - OSPF status on cr1-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:16:18] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:17:38] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 13 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:17:52] RECOVERY - OSPF status on cr1-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:18:28] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:25:37] foks, i just ser a gban on unreg users. (Derp always starts hitting other chans when banned from one) [02:26:01] I'm on mobile, could you set web gateway exempt in -help? [02:33:09] Wrong chan [03:58:42] (03PS1) 10KartikMistry: Adjust MT threshold for Telugu to 70% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574265 (https://phabricator.wikimedia.org/T244769) [05:54:36] (03PS1) 10Marostegui: Revert "wikireplicas: depool labsdb1011 and set weights on other cluster" [puppet] - 10https://gerrit.wikimedia.org/r/574267 [05:55:51] (03CR) 10Marostegui: [C: 03+2] Revert "wikireplicas: depool labsdb1011 and set weights on other cluster" [puppet] - 10https://gerrit.wikimedia.org/r/574267 (owner: 10Marostegui) [05:57:15] !log Repool labsdb1011 - T245797 [05:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1101:3318 after removing partitions - T239453', diff saved to https://phabricator.wikimedia.org/P10483 and previous config saved to /var/cache/conftool/dbconfig/20200224-060118-marostegui.json [06:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:23] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [06:15:24] (03PS1) 10CDanis: dbctl: schema: allow es4 and es5 sections [puppet] - 10https://gerrit.wikimedia.org/r/574268 (https://phabricator.wikimedia.org/T245806) [06:15:26] (03PS1) 10CDanis: dbctl: create sections s4/s5 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/574269 (https://phabricator.wikimedia.org/T245806) [06:15:28] (03PS1) 10CDanis: dbctl: create sections s4/s5 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/574270 (https://phabricator.wikimedia.org/T245806) [06:19:24] (03PS1) 10Marostegui: sX-pager.sql: Remove partitions from revision tables. [software] - 10https://gerrit.wikimedia.org/r/574271 (https://phabricator.wikimedia.org/T239453) [06:21:56] (03CR) 10Marostegui: [C: 03+2] sX-pager.sql: Remove partitions from revision tables. [software] - 10https://gerrit.wikimedia.org/r/574271 (https://phabricator.wikimedia.org/T239453) (owner: 10Marostegui) [06:22:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1101:3318 after removing partitions - T239453', diff saved to https://phabricator.wikimedia.org/P10484 and previous config saved to /var/cache/conftool/dbconfig/20200224-062226-marostegui.json [06:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:31] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [06:26:23] (03PS1) 10CDanis: dbctl: schema: allow es4 and es5 sections [software/conftool] - 10https://gerrit.wikimedia.org/r/574272 (https://phabricator.wikimedia.org/T245806) [06:28:31] (03CR) 10jerkins-bot: [V: 04-1] dbctl: schema: allow es4 and es5 sections [software/conftool] - 10https://gerrit.wikimedia.org/r/574272 (https://phabricator.wikimedia.org/T245806) (owner: 10CDanis) [06:28:38] (03PS1) 10Marostegui: Revert "db1101: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/574273 [06:29:53] (03CR) 10Marostegui: [C: 03+2] Revert "db1101: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/574273 (owner: 10Marostegui) [06:32:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1101:3318 after removing partitions - T239453', diff saved to https://phabricator.wikimedia.org/P10485 and previous config saved to /var/cache/conftool/dbconfig/20200224-063258-marostegui.json [06:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:03] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [06:35:32] (03PS1) 10CDanis: drop support for Python 3.4 [software/conftool] - 10https://gerrit.wikimedia.org/r/574275 [06:40:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1101:3318 after removing partitions - T239453', diff saved to https://phabricator.wikimedia.org/P10486 and previous config saved to /var/cache/conftool/dbconfig/20200224-064044-marostegui.json [06:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:49] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [06:44:55] (03CR) 10CDanis: "Failed because PyYAML dropped support for 3.4. I propose Ic404709c1e42" [software/conftool] - 10https://gerrit.wikimedia.org/r/574272 (https://phabricator.wikimedia.org/T245806) (owner: 10CDanis) [06:51:07] (03CR) 10Giuseppe Lavagetto: [C: 03+1] dbctl: schema: allow es4 and es5 sections [puppet] - 10https://gerrit.wikimedia.org/r/574268 (https://phabricator.wikimedia.org/T245806) (owner: 10CDanis) [06:51:22] (03CR) 10Giuseppe Lavagetto: [C: 03+1] dbctl: create sections s4/s5 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/574269 (https://phabricator.wikimedia.org/T245806) (owner: 10CDanis) [06:51:33] (03CR) 10Giuseppe Lavagetto: [C: 03+1] dbctl: create sections s4/s5 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/574270 (https://phabricator.wikimedia.org/T245806) (owner: 10CDanis) [07:00:33] (03CR) 10CDanis: [C: 03+2] dbctl: schema: allow es4 and es5 sections [puppet] - 10https://gerrit.wikimedia.org/r/574268 (https://phabricator.wikimedia.org/T245806) (owner: 10CDanis) [07:03:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1107 for 10.4 testing in main and API - T242702', diff saved to https://phabricator.wikimedia.org/P10487 and previous config saved to /var/cache/conftool/dbconfig/20200224-070337-marostegui.json [07:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:42] T242702: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 [07:04:56] (03CR) 10CDanis: [C: 03+2] dbctl: create sections s4/s5 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/574269 (https://phabricator.wikimedia.org/T245806) (owner: 10CDanis) [07:05:15] (03PS2) 10CDanis: dbctl: create sections es4/es5 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/574269 (https://phabricator.wikimedia.org/T245806) [07:08:42] (03PS2) 10CDanis: dbctl: create sections es4/es5 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/574270 (https://phabricator.wikimedia.org/T245806) [07:09:04] (03PS3) 10CDanis: dbctl: create sections es4/es5 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/574270 (https://phabricator.wikimedia.org/T245806) [07:12:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1107 for 10.4 testing in special slaves group with weight 10 - T242702', diff saved to https://phabricator.wikimedia.org/P10488 and previous config saved to /var/cache/conftool/dbconfig/20200224-071201-marostegui.json [07:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:06] T242702: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 [07:12:07] (03CR) 10CDanis: [C: 03+2] dbctl: create sections es4/es5 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/574270 (https://phabricator.wikimedia.org/T245806) (owner: 10CDanis) [07:13:16] marostegui: let me know if you see any unexpected dbctl diffs (but i expect all my changes to be no-ops) [07:25:34] (03PS2) 10Elukey: role::analytics_cluster::launcher: set hive-site.xml in HDFS [puppet] - 10https://gerrit.wikimedia.org/r/574044 (https://phabricator.wikimedia.org/T240880) [07:28:09] (03CR) 10Elukey: "hieradata/role/common/analytics_cluster/launcher.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/574044 (https://phabricator.wikimedia.org/T240880) (owner: 10Elukey) [07:28:36] (03CR) 10Elukey: "copy/paste fail: https://puppet-compiler.wmflabs.org/compiler1001/20982/" [puppet] - 10https://gerrit.wikimedia.org/r/574044 (https://phabricator.wikimedia.org/T240880) (owner: 10Elukey) [07:29:14] !log dbctl: edit es4/es5 sections in codfw (flavor & master fields) T245806 [07:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:18] T245806: Create es4 and es5 sections in dbctl - https://phabricator.wikimedia.org/T245806 [07:29:59] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::launcher: set hive-site.xml in HDFS [puppet] - 10https://gerrit.wikimedia.org/r/574044 (https://phabricator.wikimedia.org/T240880) (owner: 10Elukey) [07:30:13] !log dbctl: (and min_replicas field) T245806 [07:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:03] !log dbctl: edit es4/es5 sections in eqiad (flavor & master & min_replicas fields) T245806 [07:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:37] (03PS1) 10Elukey: role::analytics_cluster::launcher: add Analytics Refinery scap repo [puppet] - 10https://gerrit.wikimedia.org/r/574289 (https://phabricator.wikimedia.org/T243934) [07:46:01] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::launcher: add Analytics Refinery scap repo [puppet] - 10https://gerrit.wikimedia.org/r/574289 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [07:54:46] (03CR) 10Effie Mouzeli: [C: 04-1] "I am posting a pcc from 574200, where the changes have effect on mwdebug1001 and mwdebig1002 https://puppet-compiler.wmflabs.org/compiler1" [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [08:01:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add instances to es4 codfw - T245806', diff saved to https://phabricator.wikimedia.org/P10489 and previous config saved to /var/cache/conftool/dbconfig/20200224-080128-marostegui.json [08:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:33] T245806: Create es4 and es5 sections in dbctl - https://phabricator.wikimedia.org/T245806 [08:07:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add instances to es4 eqiad - T245806', diff saved to https://phabricator.wikimedia.org/P10490 and previous config saved to /var/cache/conftool/dbconfig/20200224-080708-marostegui.json [08:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:13] T245806: Create es4 and es5 sections in dbctl - https://phabricator.wikimedia.org/T245806 [08:11:50] (03PS3) 10Muehlenhoff: Add CAS authentication to tendril [puppet] - 10https://gerrit.wikimedia.org/r/573527 [08:12:24] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team: Prepare and check storage layer for gr.wikimedia.org - https://phabricator.wikimedia.org/T245912 (10Marostegui) p:05Triage→03Medium Let us know when the database is created so we can sanitize it. Thanks! [08:13:37] (03CR) 10Muehlenhoff: [C: 03+1] "All great minds think alike :-)" [puppet] - 10https://gerrit.wikimedia.org/r/574104 (owner: 10Dzahn) [08:23:24] (03PS1) 10CDanis: etcd.php: support es4/es5 sections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574377 (https://phabricator.wikimedia.org/T245806) [08:23:41] (03PS1) 10Marostegui: etcd.php: Add es4 and es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574378 (https://phabricator.wikimedia.org/T245806) [08:24:24] (03PS1) 10Elukey: role::analytics_cluster::launcher: add git proxy config for Analytics vlan [puppet] - 10https://gerrit.wikimedia.org/r/574379 (https://phabricator.wikimedia.org/T243934) [08:25:08] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good! The reason this hasn't come up so far is because gobgpd has been imported from Debian unstable (it's not in buster). In Buster" [puppet] - 10https://gerrit.wikimedia.org/r/574039 (https://phabricator.wikimedia.org/T245847) (owner: 10Jbond) [08:26:04] (03CR) 10jerkins-bot: [V: 04-1] etcd.php: support es4/es5 sections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574377 (https://phabricator.wikimedia.org/T245806) (owner: 10CDanis) [08:27:22] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw,logstash7-codfw,logstash7-eqiad} instance=kafkamon1001:9501 job=burrow partition={0,1,2,3,4,5} site=eqiad topic={rsyslog-notice,rsyslog-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now [08:27:22] 1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [08:28:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add instances to es5 codfw - T245806', diff saved to https://phabricator.wikimedia.org/P10491 and previous config saved to /var/cache/conftool/dbconfig/20200224-082848-marostegui.json [08:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:53] T245806: Create es4 and es5 sections in dbctl - https://phabricator.wikimedia.org/T245806 [08:29:22] !log Temporary put es1020 (es4) and es1023 (es5) on RO on a mysql level - T245806 [08:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:07] (03CR) 10CDanis: [C: 03+1] "tested on mwdebug1002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574378 (https://phabricator.wikimedia.org/T245806) (owner: 10Marostegui) [08:30:29] (03CR) 10Marostegui: [C: 03+2] etcd.php: Add es4 and es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574378 (https://phabricator.wikimedia.org/T245806) (owner: 10Marostegui) [08:30:38] (03Abandoned) 10CDanis: etcd.php: support es4/es5 sections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574377 (https://phabricator.wikimedia.org/T245806) (owner: 10CDanis) [08:31:15] (03Merged) 10jenkins-bot: etcd.php: Add es4 and es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574378 (https://phabricator.wikimedia.org/T245806) (owner: 10Marostegui) [08:31:21] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::launcher: add git proxy config for Analytics vlan [puppet] - 10https://gerrit.wikimedia.org/r/574379 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [08:32:31] (03CR) 10Giuseppe Lavagetto: [C: 03+2] drop support for Python 3.4 [software/conftool] - 10https://gerrit.wikimedia.org/r/574275 (owner: 10CDanis) [08:33:58] !log marostegui@deploy1001 Synchronized wmf-config/etcd.php: Add es4 and es5 (unused new external store sections to etcd - T245806 (duration: 00m 58s) [08:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:04] T245806: Create es4 and es5 sections in dbctl - https://phabricator.wikimedia.org/T245806 [08:34:45] (03Merged) 10jenkins-bot: drop support for Python 3.4 [software/conftool] - 10https://gerrit.wikimedia.org/r/574275 (owner: 10CDanis) [08:35:30] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:37:38] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:40:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add instances to es5 eqiad - T245806', diff saved to https://phabricator.wikimedia.org/P10492 and previous config saved to /var/cache/conftool/dbconfig/20200224-084027-marostegui.json [08:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:32] T245806: Create es4 and es5 sections in dbctl - https://phabricator.wikimedia.org/T245806 [08:40:51] (03CR) 10Filippo Giunchedi: [C: 03+2] service: fix uwsgi logstash_port_logback [puppet] - 10https://gerrit.wikimedia.org/r/573936 (https://phabricator.wikimedia.org/T245512) (owner: 10Filippo Giunchedi) [08:41:14] (03PS2) 10Muehlenhoff: Remove support for < Buster from Phabricator class [puppet] - 10https://gerrit.wikimedia.org/r/563469 [08:43:47] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: use class logstash for jmx_logstash job [puppet] - 10https://gerrit.wikimedia.org/r/573968 (owner: 10Filippo Giunchedi) [08:47:31] (03PS4) 10KartikMistry: ContentTranslation: Set cookieDomain for Production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416973 [08:56:15] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/20986/" [puppet] - 10https://gerrit.wikimedia.org/r/563469 (owner: 10Muehlenhoff) [08:56:28] (03CR) 10Muehlenhoff: [C: 03+2] Remove support for < Buster from Phabricator class [puppet] - 10https://gerrit.wikimedia.org/r/563469 (owner: 10Muehlenhoff) [09:03:27] (03PS1) 10Elukey: role::analytics_cluster::launcher: add hdfs RU jobs [puppet] - 10https://gerrit.wikimedia.org/r/574385 (https://phabricator.wikimedia.org/T243934) [09:06:24] (03CR) 10jerkins-bot: [V: 04-1] role::analytics_cluster::launcher: add hdfs RU jobs [puppet] - 10https://gerrit.wikimedia.org/r/574385 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [09:08:10] !log update puppet compiler's facts [09:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:45] 10Operations, 10Patch-For-Review, 10User-fgiunchedi: Standardizing our partman recipes - https://phabricator.wikimedia.org/T156955 (10fgiunchedi) [09:13:06] (03CR) 10Muehlenhoff: "This patch should re-add the hostnames which were dropped in 0160c56cd1917, shouldn't they?" [puppet] - 10https://gerrit.wikimedia.org/r/572312 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [09:21:14] RECOVERY - Check systemd state on ms-be2023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:21:15] !log bounce ferm on ms-be2023, it had failed (no entries in journald) [09:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:18] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2023 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:22:19] (03PS9) 10Effie Mouzeli: (WIP) mcrouter: add gutter pool servers in configuration [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) [09:22:26] (03CR) 10Muehlenhoff: "Looks good, few comments inline" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/573990 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [09:28:04] (03PS2) 10Elukey: role::analytics_cluster::launcher: add hdfs RU jobs [puppet] - 10https://gerrit.wikimedia.org/r/574385 (https://phabricator.wikimedia.org/T243934) [09:31:15] (03CR) 10jerkins-bot: [V: 04-1] role::analytics_cluster::launcher: add hdfs RU jobs [puppet] - 10https://gerrit.wikimedia.org/r/574385 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [09:33:03] I'm quickly going to deploy a fix for metrics now: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/574386 [09:33:48] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [09:34:24] (03PS4) 10Effie Mouzeli: mcrouter: enable gutter pool config on mwdebug1001 and mwdebug2001 [puppet] - 10https://gerrit.wikimedia.org/r/574200 (https://phabricator.wikimedia.org/T213089) [09:34:24] (03CR) 10Muehlenhoff: admin: add rerprepo system user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573991 (owner: 10Jbond) [09:37:02] (03PS3) 10Elukey: role::analytics_cluster::launcher: add hdfs RU jobs [puppet] - 10https://gerrit.wikimedia.org/r/574385 (https://phabricator.wikimedia.org/T243934) [09:37:58] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [09:40:31] (03PS4) 10Elukey: role::analytics_cluster::launcher: add hdfs RU jobs [puppet] - 10https://gerrit.wikimedia.org/r/574385 (https://phabricator.wikimedia.org/T243934) [09:42:31] (03CR) 10Muehlenhoff: [C: 03+2] Add CAS authentication to tendril [puppet] - 10https://gerrit.wikimedia.org/r/573527 (owner: 10Muehlenhoff) [09:44:38] 10Operations, 10Wikimedia-Logstash, 10observability: Migrate services using deprecated Gelf logstash input to Kafka enabled logging pipeline - https://phabricator.wikimedia.org/T225122 (10fgiunchedi) 05Open→03Invalid Resolving in favor of service-specific tasks (subtasks of T225122) [09:44:40] 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10fgiunchedi) [09:47:22] (03CR) 10Ladsgroup: [C: 03+1] Synchronize and fix DisableQueryPageUpdate for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573969 (owner: 10Matěj Suchánek) [09:51:42] (03CR) 10Volans: "FWIW we have 10 hosts that match this:" [software/conftool] - 10https://gerrit.wikimedia.org/r/574275 (owner: 10CDanis) [09:56:50] (03PS5) 10Elukey: role::analytics_cluster::launcher: add hdfs RU jobs [puppet] - 10https://gerrit.wikimedia.org/r/574385 (https://phabricator.wikimedia.org/T243934) [10:01:02] (03PS1) 10Arturo Borrero Gonzalez: hieradata: fix typo in cloud recursor FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/574395 (https://phabricator.wikimedia.org/T243766) [10:01:52] (03CR) 10Volans: "> Patch Set 1:" [software/conftool] - 10https://gerrit.wikimedia.org/r/574275 (owner: 10CDanis) [10:04:25] (03CR) 10Filippo Giunchedi: "See inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/574395 (https://phabricator.wikimedia.org/T243766) (owner: 10Arturo Borrero Gonzalez) [10:05:57] (03PS2) 10Volans: ganeti: use canonical cluster names [software/spicerack] - 10https://gerrit.wikimedia.org/r/571780 (https://phabricator.wikimedia.org/T231068) [10:05:59] (03PS3) 10Volans: ganeti: add logging for GntInstance actions [software/spicerack] - 10https://gerrit.wikimedia.org/r/571997 (https://phabricator.wikimedia.org/T231068) [10:06:01] (03PS4) 10Volans: ganeti: add VM creation capability [software/spicerack] - 10https://gerrit.wikimedia.org/r/571998 (https://phabricator.wikimedia.org/T231068) [10:06:03] (03PS4) 10Volans: spicerack: add support for HTTP proxy [software/spicerack] - 10https://gerrit.wikimedia.org/r/574152 [10:10:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1087 for compression and place db1101:3318 into vslow,dump - T232446', diff saved to https://phabricator.wikimedia.org/P10493 and previous config saved to /var/cache/conftool/dbconfig/20200224-101030-marostegui.json [10:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:35] T232446: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 [10:10:55] (03PS2) 10Arturo Borrero Gonzalez: hieradata: fix typo in cloud recursor FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/574395 (https://phabricator.wikimedia.org/T243766) [10:11:49] (03PS1) 10Arturo Borrero Gonzalez: cloud: codfw1dev: fix instance flat range [puppet] - 10https://gerrit.wikimedia.org/r/574400 [10:12:34] !log Stop db1087 and db2079 in sync - T232446 [10:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:49] (03CR) 10jerkins-bot: [V: 04-1] cloud: codfw1dev: fix instance flat range [puppet] - 10https://gerrit.wikimedia.org/r/574400 (owner: 10Arturo Borrero Gonzalez) [10:15:17] (03PS2) 10Arturo Borrero Gonzalez: cloud: codfw1dev: fix instance flat range [puppet] - 10https://gerrit.wikimedia.org/r/574400 [10:19:16] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 321.6 ge 210 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [10:20:03] (03PS10) 10Effie Mouzeli: (WIP) mcrouter: add gutter pool servers in configuration [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) [10:20:20] (03CR) 10Jbond: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/574039 (https://phabricator.wikimedia.org/T245847) (owner: 10Jbond) [10:20:36] (03CR) 10Jbond: [C: 03+2] enforce-user-groups: update script to ignore Dynamic users [puppet] - 10https://gerrit.wikimedia.org/r/574039 (https://phabricator.wikimedia.org/T245847) (owner: 10Jbond) [10:21:39] (03PS5) 10Effie Mouzeli: mcrouter: enable gutter pool config on mwdebug1001 and mwdebug2001 [puppet] - 10https://gerrit.wikimedia.org/r/574200 (https://phabricator.wikimedia.org/T213089) [10:23:19] 10Operations, 10Patch-For-Review: enforce-users-groups tries to remove package created gobgp user - https://phabricator.wikimedia.org/T245847 (10jbond) 05Open→03Resolved I have now excluded DynamicUsers from the enforce script. I tested this on flospec1001 and all seems well but please reopen if you still... [10:25:48] (03CR) 10Jbond: "updated thanks" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/573990 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [10:26:24] (03PS6) 10Elukey: role::analytics_cluster::launcher: add hdfs RU jobs [puppet] - 10https://gerrit.wikimedia.org/r/574385 (https://phabricator.wikimedia.org/T243934) [10:26:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: codfw1dev: fix instance flat range [puppet] - 10https://gerrit.wikimedia.org/r/574400 (owner: 10Arturo Borrero Gonzalez) [10:27:06] (03PS4) 10Jbond: admin: add support for system users and groups [puppet] - 10https://gerrit.wikimedia.org/r/573990 (https://phabricator.wikimedia.org/T235162) [10:27:07] (03CR) 10Filippo Giunchedi: [C: 03+1] hieradata: fix typo in cloud recursor FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/574395 (https://phabricator.wikimedia.org/T243766) (owner: 10Arturo Borrero Gonzalez) [10:28:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hieradata: fix typo in cloud recursor FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/574395 (https://phabricator.wikimedia.org/T243766) (owner: 10Arturo Borrero Gonzalez) [10:29:42] (03CR) 10Jbond: admin: add rerprepo system user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573991 (owner: 10Jbond) [10:30:13] (03PS2) 10Jbond: admin: add rerprepo system user [puppet] - 10https://gerrit.wikimedia.org/r/573991 [10:30:42] (03PS3) 10Jbond: admin: add rerprepo system user [puppet] - 10https://gerrit.wikimedia.org/r/573991 [10:32:34] (03CR) 10Filippo Giunchedi: [C: 03+2] netbox: log to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/573976 (https://phabricator.wikimedia.org/T245511) (owner: 10Filippo Giunchedi) [10:33:21] (03CR) 10jerkins-bot: [V: 04-1] admin: add rerprepo system user [puppet] - 10https://gerrit.wikimedia.org/r/573991 (owner: 10Jbond) [10:34:47] !log onboard netbox to logging pipeline [10:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:59] (03PS7) 10Elukey: role::analytics_cluster::launcher: add hdfs RU jobs [puppet] - 10https://gerrit.wikimedia.org/r/574385 (https://phabricator.wikimedia.org/T243934) [10:35:18] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: enable BGP in the neutron transport network - https://phabricator.wikimedia.org/T245606 (10aborrero) >>! In T245606#5910457, @Krenair wrote: > I've been reading the linked proposal and noticed this: > "the internal flat network CIDR. This is 1... [10:35:23] (03PS11) 10Effie Mouzeli: (WIP) mcrouter: add gutter pool servers in configuration [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) [10:35:25] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/20996/" [puppet] - 10https://gerrit.wikimedia.org/r/574385 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [10:36:52] (03PS6) 10Effie Mouzeli: mcrouter: enable gutter pool config on mwdebug1001 and mwdebug2001 [puppet] - 10https://gerrit.wikimedia.org/r/574200 (https://phabricator.wikimedia.org/T213089) [10:40:40] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [10:40:42] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:41:13] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.20/extensions/Wikibase: [[gerrit:574386|Add metric for recording cache hits in StatsdRecordingSimpleCache]] (T244260) (duration: 01m 04s) [10:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:18] T244260: Investigate seemingly broken https://grafana.wikimedia.org/d/u5wAugyik/wikibase-formattercachedashboard - https://phabricator.wikimedia.org/T244260 [10:42:16] the errors during deployment is because scap sends the files in random order and there's no way to avoid it in this patch [10:42:48] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [10:44:38] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:45:00] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:45:00] let's see if it recovers [10:45:04] (03PS1) 10Muehlenhoff: Enable CAS authentication for tendril/dbmonitor [puppet] - 10https://gerrit.wikimedia.org/r/574404 [10:45:21] fatal monitor has recovered [10:46:09] (03CR) 10jerkins-bot: [V: 04-1] Enable CAS authentication for tendril/dbmonitor [puppet] - 10https://gerrit.wikimedia.org/r/574404 (owner: 10Muehlenhoff) [10:46:45] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:46:55] oh okay, it's done now [10:49:11] (03PS1) 10Brian Wolff: Make Beta labs CSP settings be same as prod but with beta urls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574405 (https://phabricator.wikimedia.org/T245983) [10:49:28] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:50:28] (03PS2) 10Muehlenhoff: Enable CAS authentication for tendril/dbmonitor [puppet] - 10https://gerrit.wikimedia.org/r/574404 [10:50:36] PROBLEM - Check systemd state on ms-be2043 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:51:48] jouncebot next [10:51:48] In 0 hour(s) and 38 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200224T1130) [10:52:42] PROBLEM - Check systemd state on ms-be2016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:53:04] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2016 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:53:34] (03PS4) 10Giuseppe Lavagetto: role::elasticsearch::cloudelastic: use profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/573515 [10:53:36] (03PS4) 10Giuseppe Lavagetto: lvs::configuration: drop the lvs service hashes [puppet] - 10https://gerrit.wikimedia.org/r/573516 [10:53:38] (03PS1) 10Giuseppe Lavagetto: profile::dns::auth::discovery: only add services in production state [puppet] - 10https://gerrit.wikimedia.org/r/574406 [10:53:38] uh? DNS query for 'ms-be2056.codfw.wmnet' failed: query timed out [10:53:42] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:54:14] forcing a puppet run [10:54:45] 10Operations, 10Traffic: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez) [10:55:03] 10Operations, 10Traffic: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez) p:05Triage→03Medium [10:55:13] 10Operations, 10Traffic: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez) [10:55:16] 10Operations, 10Pybal, 10Traffic: Migrate pybal-test2001 away from jessie - https://phabricator.wikimedia.org/T224570 (10Vgutierrez) [10:55:30] <_joe_> vgutierrez: do we still need pybal-test? [10:55:43] !log restarted ferm on ms-be2016, had failed with DNS query for 'ms-be2056.codfw.wmnet' failed: query timed out [10:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:57] _joe_: I'll use it to test the buster packages [10:56:09] <_joe_> fair enough [10:56:12] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [10:56:17] afterwards we can burn them in hell or whatever suits us better [10:56:30] summoning Cthulhu can be useful in some scenarios [10:56:56] RECOVERY - Check systemd state on ms-be2016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:57:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/20999/cloudelastic1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/573515 (owner: 10Giuseppe Lavagetto) [10:57:20] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2016 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:59:39] !log upgrading scap in eqiad and codfw - T245530 [10:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:43] T245530: Deploy scap 3.13.0-1 - https://phabricator.wikimedia.org/T245530 [10:59:58] (03PS1) 10Filippo Giunchedi: Revert "netbox: log to logging pipeline" [puppet] - 10https://gerrit.wikimedia.org/r/574407 [11:00:16] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2028 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:00:20] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:00:24] (03CR) 10Filippo Giunchedi: [C: 03+1] Revert "netbox: log to logging pipeline" [puppet] - 10https://gerrit.wikimedia.org/r/574407 (owner: 10Filippo Giunchedi) [11:01:52] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Revert "netbox: log to logging pipeline" [puppet] - 10https://gerrit.wikimedia.org/r/574407 (owner: 10Filippo Giunchedi) [11:02:45] !log Move labsdb1009, labsdb1011 and labsdb1012 (labsdb1010 is currently delayed, will be done later) to replicate under codfw for a few days while we alter wb_terms on db1087 - T232446 [11:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:49] T232446: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 [11:03:07] jouncebot next [11:03:08] In 0 hour(s) and 26 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200224T1130) [11:03:41] Amir1 keep in mind that scap was just updated [11:04:06] PROBLEM - Check systemd state on ms-be2046 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:04:11] effie: ooh nice, the static array for i18n cache is coming \o/ [11:04:48] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2046 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:06:42] !log restarted ferm on ms-be2046 [11:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:52] (03CR) 10Muehlenhoff: "One comment inline, otherwise looks good to merge!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573990 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [11:06:56] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2046 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:08:00] 10Operations, 10ops-eqiad, 10serviceops: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10jijiki) [11:08:05] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573991 (owner: 10Jbond) [11:08:22] RECOVERY - Check systemd state on ms-be2046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:26] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1003/21001/ is a fleet-wide pcc run." [puppet] - 10https://gerrit.wikimedia.org/r/573516 (owner: 10Giuseppe Lavagetto) [11:10:36] (03PS1) 10Jbond: admin: add CI checks to ensure users and group shave the correct gid/uid [puppet] - 10https://gerrit.wikimedia.org/r/574412 (https://phabricator.wikimedia.org/T235162) [11:13:15] (03CR) 10jerkins-bot: [V: 04-1] admin: add CI checks to ensure users and group shave the correct gid/uid [puppet] - 10https://gerrit.wikimedia.org/r/574412 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [11:13:36] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/574172 (owner: 10Andrew Bogott) [11:13:47] (03PS1) 10Vgutierrez: Release 1.15.8 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/574414 (https://phabricator.wikimedia.org/T245984) [11:15:25] (03PS12) 10Effie Mouzeli: (WIP) mcrouter: add gutter pool servers in configuration [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) [11:16:32] (03PS2) 10Jbond: admin: add CI checks to ensure users and group shave the correct gid/uid [puppet] - 10https://gerrit.wikimedia.org/r/574412 (https://phabricator.wikimedia.org/T235162) [11:19:56] (03CR) 10jerkins-bot: [V: 04-1] admin: add CI checks to ensure users and group shave the correct gid/uid [puppet] - 10https://gerrit.wikimedia.org/r/574412 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [11:21:51] (03CR) 10Vgutierrez: [C: 03+2] Release 1.15.8 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/574414 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [11:27:09] !log upload pybal 1.15.8 to apt.wm.o (buster) - T245984 [11:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:13] T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 [11:30:04] jan_drewniak: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200224T1130). [11:31:57] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574420 (https://phabricator.wikimedia.org/T128546) [11:34:54] (03PS1) 10Vgutierrez: install_server: Reimage pybal-test2001 as buster [puppet] - 10https://gerrit.wikimedia.org/r/574421 (https://phabricator.wikimedia.org/T224570) [11:36:34] (03CR) 10Vgutierrez: [C: 03+2] install_server: Reimage pybal-test2001 as buster [puppet] - 10https://gerrit.wikimedia.org/r/574421 (https://phabricator.wikimedia.org/T224570) (owner: 10Vgutierrez) [11:39:48] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574420 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:41:25] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574420 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:45:38] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:574398| Bumping portals to master (563985)]] (duration: 00m 57s) [11:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:40] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:574398| Bumping portals to master (563985)]] (duration: 00m 55s) [11:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:16] Well nuts, I'm getting some scap errors, is this a known thing? https://www.irccloud.com/pastebin/cn42vPU1/ [11:53:04] huh [11:53:10] <_joe_> jan_drewniak: no. [11:53:15] SAL doesn’t have any mwdebug2001-specific messages since mid-December [11:53:37] <_joe_> oh I know what's going on [11:53:41] <_joe_> effie: ^^ [11:53:49] <_joe_> please reenable puppet on that host [11:54:07] urm but in mid-December [11:54:12] that host had puppet rinning [11:54:15] rtunning [11:54:16] <_joe_> you can't leave puppet disabled for a week, or the ssh host key will be removed [11:54:25] <_joe_> ok, the problem is now, not in mid december [11:54:36] I wil reenable it sure [11:55:32] I did it last monday as well for the same reasons, forgot it this morning though [11:56:03] (03PS1) 10Ayounsi: Juniper report: only log warning if S/N missing from Netbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/574425 (https://phabricator.wikimedia.org/T213843) [11:56:42] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/21000/" [puppet] - 10https://gerrit.wikimedia.org/r/574404 (owner: 10Muehlenhoff) [11:59:02] <_joe_> jan_drewniak: we'll run scap pull on that host, don't worry and sorry for the scare [11:59:47] (03PS3) 10Jbond: admin: add CI checks to ensure users and group shave the correct gid/uid [puppet] - 10https://gerrit.wikimedia.org/r/574412 (https://phabricator.wikimedia.org/T235162) [11:59:48] _joe_: ok cool! no problem, and thanks for the quick response btw! [11:59:59] I will run it [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200224T1200). [12:00:04] kart_ and Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:12] o/ [12:00:16] (03CR) 10Volans: "I've mainly reviewed the python bits. Logic seems sane, few nits inline." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [12:00:17] o/ [12:00:46] * kart_ is here [12:01:01] jan_drewniak: done [12:01:19] effie: we can go ahead with SWAT, right? [12:01:26] yeah yeah [12:01:30] cool. [12:01:36] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/21001/" [puppet] - 10https://gerrit.wikimedia.org/r/573516 (owner: 10Giuseppe Lavagetto) [12:02:35] (03PS1) 10Volans: swift: optimize ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/574426 [12:02:38] !log reimage pybal-test2001 as buster - T224570 T245984 [12:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:44] T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 [12:02:44] T224570: Migrate pybal-test2001 away from jessie - https://phabricator.wikimedia.org/T224570 [12:03:12] (03PS2) 10KartikMistry: Adjust MT threshold for Telugu to 70% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574265 (https://phabricator.wikimedia.org/T244769) [12:04:03] (03PS2) 10Volans: swift: optimize ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/574426 [12:04:53] (03CR) 10KartikMistry: [C: 03+2] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574265 (https://phabricator.wikimedia.org/T244769) (owner: 10KartikMistry) [12:05:28] !log re-enable deactivated BGP sessions from ulsfo to office - T239893 [12:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:32] T239893: BGP peering sessions with corp partially down in ulsfo - https://phabricator.wikimedia.org/T239893 [12:06:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/21004/ the change is a noop right now." [puppet] - 10https://gerrit.wikimedia.org/r/574406 (owner: 10Giuseppe Lavagetto) [12:07:02] 10Operations, 10netops: BGP peering sessions with corp partially down in ulsfo - https://phabricator.wikimedia.org/T239893 (10ayounsi) 05Open→03Resolved a:03ayounsi All BGP are now re-enabled. [12:07:04] (03CR) 10Volans: "Compiler results available here:" [puppet] - 10https://gerrit.wikimedia.org/r/574426 (owner: 10Volans) [12:07:07] (03PS13) 10Effie Mouzeli: (WIP) mcrouter: add gutter pool servers in configuration [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) [12:07:16] (03Merged) 10jenkins-bot: Adjust MT threshold for Telugu to 70% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574265 (https://phabricator.wikimedia.org/T244769) (owner: 10KartikMistry) [12:08:25] (03PS7) 10Effie Mouzeli: mcrouter: enable gutter pool config on mwdebug1001 and mwdebug2001 [puppet] - 10https://gerrit.wikimedia.org/r/574200 (https://phabricator.wikimedia.org/T213089) [12:10:47] * Urbanecm backported https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMaintenance/+/574431, ping me once I can update interwiki cache [12:11:53] * tassu would like to SWAT https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/573965 in 10 mins if possible [12:12:22] !log kartik@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit|574265|CX: Adjust MT threshold for Telugu WP to 70% (T244769)]] (duration: 00m 56s) [12:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:26] T244769: Adjust the threshold for Telugu to prevent publishing when overall unmodified content is higher than 70% - https://phabricator.wikimedia.org/T244769 [12:12:45] effie: I had similar failures as jan_drewniak had.. [12:13:10] 12:12:15 /usr/bin/sudo -u root -- /usr/local/sbin/check-and-restart-php php7.2-fpm 100 on mwdebug2001.codfw.wmnet returned [255]: Host key verification failed. [12:13:30] <_joe_> kart_: no one ran puppet on deploy1001, doing it [12:13:34] shoot, puppet hadn't run yet on deploy1001 [12:13:41] _joe_: I'll do it [12:13:45] <_joe_> effie: already done [12:13:56] _joe_: thanks. Anything needed from my side? [12:14:26] <_joe_> kart_: just confirm to effie things are now ok when you scap next [12:14:50] OK. It is now Amir1's turn to run SWAT.. [12:14:55] Amir1: go ahead.. [12:15:05] Thanks! [12:15:08] <_joe_> you may proceed [12:15:26] _joe_ effie this is the memcached thing, do you want me to deploy something simpler first? [12:16:02] do so, but puppet run is done, it should be ok [12:16:17] ah, also, can I redeploy the portals after SWAT? they have a custom script that updates a series of urls instead of just scap sync. [12:17:11] okay then, let's go with memcached [12:19:20] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)210 ge (W)150 ge 106.8 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [12:19:39] Amir1: Can I still fit https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/573965 into swat? [12:19:44] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.20/extensions/WikimediaMaintenance/dumpInterwiki.php: dumpInterwiki: Respect comments in dblists (T244906) (duration: 00m 56s) [12:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:48] T244906: Updating interwiki cache adds invalid entries to the cache - https://phabricator.wikimedia.org/T244906 [12:19:50] (03PS2) 10Majavah: Disallow crats to (un)assign flow-bot group on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573965 (https://phabricator.wikimedia.org/T245716) [12:20:20] tassu: let's see how complicated it become [12:20:21] 10Operations, 10SRE-Access-Requests, 10User-RhinosF1: Requesting access to stat1007 for jmorgan - https://phabricator.wikimedia.org/T244785 (10jbond) [12:20:24] *becomes [12:20:50] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/574240 (https://phabricator.wikimedia.org/T244785) (owner: 10RhinosF1) [12:21:22] mwdebug1001 looks happy with the memcached change and no errors in logs [12:21:24] proceeding [12:23:19] (03CR) 10jerkins-bot: [V: 04-1] Admin: Add jmorgan to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/574240 (https://phabricator.wikimedia.org/T244785) (owner: 10RhinosF1) [12:23:53] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.20/extensions/Wikibase/client/includes: SWAT: [[gerrit:574391|Use formatter cache in client LUA label lookups (T245740)]] (duration: 00m 56s) [12:23:55] (03PS4) 10Jbond: Admin: Add jmorgan to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/574240 (https://phabricator.wikimedia.org/T244785) (owner: 10RhinosF1) [12:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:58] T245740: LUA getLabelWithLang calls result in many db queries to terms storage with spikey and unpredictable patterns - https://phabricator.wikimedia.org/T245740 [12:24:59] 10Operations, 10Pybal, 10Traffic: Migrate pybal-test2001 away from jessie - https://phabricator.wikimedia.org/T224570 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [12:25:22] 10Operations, 10Traffic: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez) [12:25:24] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Vgutierrez) [12:25:30] (03PS3) 10Ladsgroup: Add definitions for redirect badges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571738 (https://phabricator.wikimedia.org/T235420) (owner: 10Itamar Givon) [12:25:35] (03CR) 10Ladsgroup: [C: 03+2] Add definitions for redirect badges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571738 (https://phabricator.wikimedia.org/T235420) (owner: 10Itamar Givon) [12:25:55] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [12:27:02] the requests to memcached got 4x bigger, now at 200k per minute [12:27:13] (03Merged) 10jenkins-bot: Add definitions for redirect badges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571738 (https://phabricator.wikimedia.org/T235420) (owner: 10Itamar Givon) [12:27:46] (03CR) 10Jbond: [C: 03+2] Admin: Add jmorgan to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/574240 (https://phabricator.wikimedia.org/T244785) (owner: 10RhinosF1) [12:28:05] (03PS3) 10Majavah: Disallow crats to (un)assign flow-bot group on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573965 (https://phabricator.wikimedia.org/T245716) [12:28:10] the hit ratio dropped from 95% to 90% but it's recovering [12:28:52] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: enable BGP in the neutron transport network - https://phabricator.wikimedia.org/T245606 (10aborrero) NOTE: apparently the neutron BGP implementation doesn't support ingesting routes using BGP, only advertising. In our neutron setup, the defaul... [12:30:24] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-RhinosF1: Requesting access to stat1007 for jmorgan - https://phabricator.wikimedia.org/T244785 (10jbond) 05Open→03Resolved a:03jbond @Capt_Swing i have now added you to the analytics-privatedata-users group. please allow up-to 30 minut... [12:35:19] it's moving around 92% right now _joe_ [12:35:55] * tassu afk 10min [12:36:50] <_joe_> Amir1: it was expected it would go down for a bit, right? [12:36:55] yup [12:37:10] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:571738|Add definitions for redirect badges (T235420)]] (duration: 00m 56s) [12:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:14] I don't see any changes for traffic though [12:37:14] T235420: Create wikidata badges to indicate when sitelinks point to Wikipedia redirect pages - https://phabricator.wikimedia.org/T235420 [12:37:26] <_joe_> indeed [12:38:06] Strangely I can't find the cache key in https://grafana.wikimedia.org/d/2Zx07tGZz/wanobjectcache?orgId=1&from=now-1h&to=now or the WAN group to monitor [12:38:23] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:571738|Add definitions for redirect badges (T235420)]], take II, the cache issue (duration: 00m 56s) [12:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:42] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me, +1 to run this in deployment-prep first; to rule out odd ferm syntax surprises." [puppet] - 10https://gerrit.wikimedia.org/r/574426 (owner: 10Volans) [12:40:55] tassu: let me know when you're back [12:43:10] Amir1: let me know when you're done swat, I want to redeploy the portals after (because of earlier errors). [12:43:29] jan_drewniak: sure, I'm waiting for someone, you can go ahead for now [12:46:44] ok cool [12:47:05] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: enable BGP in the neutron transport network - https://phabricator.wikimedia.org/T245606 (10ayounsi) For the routers to cloudnet hosts traffic, we should only establish the BGP sessions over the transport network, doing it over the hosts vlan w... [12:47:49] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:574398| Bumping portals to master (563985)]] (duration: 00m 56s) [12:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:46] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:574398| Bumping portals to master (563985)]] (duration: 00m 56s) [12:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:39] Amir1: back, sorry for the delay [12:49:45] no worries [12:49:54] can you test it? [12:50:10] sure [12:50:27] (03CR) 10Ladsgroup: "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573965 (https://phabricator.wikimedia.org/T245716) (owner: 10Majavah) [12:53:00] Amir1: did you mean giving +2 instead of leaving a comment? [12:53:16] (03CR) 10Ladsgroup: [C: 03+2] Disallow crats to (un)assign flow-bot group on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573965 (https://phabricator.wikimedia.org/T245716) (owner: 10Majavah) [12:53:22] tassu: yup, sorry [12:54:12] (03CR) 10Volans: [C: 03+1] "LGTM for now" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/574425 (https://phabricator.wikimedia.org/T213843) (owner: 10Ayounsi) [12:54:15] (03Merged) 10jenkins-bot: Disallow crats to (un)assign flow-bot group on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573965 (https://phabricator.wikimedia.org/T245716) (owner: 10Majavah) [12:55:23] 10Operations, 10Security-Team, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10MoritzMuehlenhoff) >>! In T244792#5901231, @HMarcus wrote: > As far as next steps go, you mentioned that this will be a judgment call at this... [12:57:23] tassu: it's live in mwdebug1001 [12:57:32] let me know when it works and I proceed [12:57:54] (03CR) 10Jbond: [C: 03+1] "lgtm, optional comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/574404 (owner: 10Muehlenhoff) [12:58:08] https://en.wikipedia.org/wiki/Special:ListGroupRights looks correct so I believe it's working [12:58:15] so sure, go ahead [12:58:37] (03PS4) 10Jbond: admin: add CI checks to ensure users and group have the correct gid/uid [puppet] - 10https://gerrit.wikimedia.org/r/574412 (https://phabricator.wikimedia.org/T235162) [12:58:52] (03PS1) 10Urbanecm: Update interwiki-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574447 [12:59:31] (03CR) 10Muehlenhoff: Enable CAS authentication for tendril/dbmonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/574404 (owner: 10Muehlenhoff) [12:59:46] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:573965|Disallow crats to (un)assign flow-bot group on enwiki (T245716)]] (duration: 00m 56s) [12:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:50] T245716: InitialiseSettings: don't let enwiki crats add/remove flow-bot - https://phabricator.wikimedia.org/T245716 [13:00:12] Urbanecm: it looks correct to me [13:00:45] Amir1: yup, seems the fix works as intended :-). If you're done, may I update prod cache now? [13:01:08] Urbanecm: almost done, running the second sync [13:01:15] okay :-) [13:01:21] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:573965|Disallow crats to (un)assign flow-bot group on enwiki (T245716)]] (duration: 00m 56s) [13:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:41] (03CR) 10Jcrespo: Enable CAS authentication for tendril/dbmonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/574404 (owner: 10Muehlenhoff) [13:01:56] (03CR) 10jerkins-bot: [V: 04-1] admin: add CI checks to ensure users and group have the correct gid/uid [puppet] - 10https://gerrit.wikimedia.org/r/574412 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [13:01:57] Urbanecm: done now [13:02:02] thanks! [13:02:05] (03CR) 10Jcrespo: [C: 04-1] Enable CAS authentication for tendril/dbmonitor [puppet] - 10https://gerrit.wikimedia.org/r/574404 (owner: 10Muehlenhoff) [13:02:07] Amir1: thanks a lot! [13:02:15] (03PS14) 10Effie Mouzeli: mcrouter: add gutter pool servers in configuration [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) [13:03:31] (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574449 [13:03:33] (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574449 (owner: 10Urbanecm) [13:04:34] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574449 (owner: 10Urbanecm) [13:05:19] (03CR) 10jerkins-bot: [V: 04-1] mcrouter: add gutter pool servers in configuration [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [13:05:39] !log urbanecm@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 18s) [13:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:17] (03PS5) 10Jbond: admin: add CI checks to ensure users and group have the correct gid/uid [puppet] - 10https://gerrit.wikimedia.org/r/574412 (https://phabricator.wikimedia.org/T235162) [13:06:18] (03PS1) 10Krinkle: [BETA] Remove $wgDebugTimestamps override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574450 [13:07:17] (03PS3) 10Muehlenhoff: Enable CAS authentication for tendril/dbmonitor [puppet] - 10https://gerrit.wikimedia.org/r/574404 [13:07:57] (03CR) 10Muehlenhoff: Enable CAS authentication for tendril/dbmonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/574404 (owner: 10Muehlenhoff) [13:08:03] (03CR) 10Jcrespo: [C: 03+1] Enable CAS authentication for tendril/dbmonitor [puppet] - 10https://gerrit.wikimedia.org/r/574404 (owner: 10Muehlenhoff) [13:08:27] (03CR) 10Urbanecm: [C: 03+2] "noop, beta-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574447 (owner: 10Urbanecm) [13:08:54] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/574404 (owner: 10Muehlenhoff) [13:09:00] (03PS1) 10Arturo Borrero Gonzalez: codfw: cloudnet: allocate addresses in the cloud transport network [dns] - 10https://gerrit.wikimedia.org/r/574452 (https://phabricator.wikimedia.org/T245606) [13:09:12] (03PS15) 10Effie Mouzeli: mcrouter: add gutter pool servers in configuration [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) [13:09:35] (03Merged) 10jenkins-bot: Update interwiki-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574447 (owner: 10Urbanecm) [13:09:57] Amir1: seems both caches are good [13:10:01] nice [13:10:17] then I start slowly deploying reads on the new term store [13:10:26] :) [13:10:42] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: rack/setup/install frpm2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242269 (10jbond) p:05Triage→03Medium [13:11:22] 10Operations, 10Release-Engineering-Team, 10serviceops: mcrouter proxies and scap proxies - https://phabricator.wikimedia.org/T245841 (10jbond) p:05Triage→03Medium [13:11:32] (03PS8) 10Effie Mouzeli: mcrouter: enable gutter pool config on mwdebug1001 and mwdebug2001 [puppet] - 10https://gerrit.wikimedia.org/r/574200 (https://phabricator.wikimedia.org/T213089) [13:11:55] 10Operations, 10Parsoid-PHP, 10SRE-Access-Requests, 10serviceops: Give all members of the Parsing team production `deployment` access - https://phabricator.wikimedia.org/T245877 (10jbond) p:05Triage→03Medium [13:12:12] 10Operations, 10Core Platform Team, 10DC-Ops, 10serviceops: Rename wtp* servers to parsoid* (Parsoid PHP servers) - https://phabricator.wikimedia.org/T245888 (10jbond) p:05Triage→03Medium [13:12:44] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: Degraded RAID on analytics1044 - https://phabricator.wikimedia.org/T245910 (10jbond) p:05Triage→03Medium [13:12:56] (03PS1) 10Ladsgroup: Increase the reads for term store for clients for up to Q30K [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574454 (https://phabricator.wikimedia.org/T219123) [13:14:07] (03CR) 10Ladsgroup: [C: 03+2] Increase the reads for term store for clients for up to Q30K [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574454 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [13:15:12] (03Merged) 10jenkins-bot: Increase the reads for term store for clients for up to Q30K [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574454 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [13:17:54] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:574454|Increase the reads for term store for clients for up to Q30K (T219123)]] (duration: 00m 56s) [13:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:58] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [13:18:55] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:574454|Increase the reads for term store for clients for up to Q30K (T219123)]], take II (duration: 00m 56s) [13:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:55] (03CR) 10Effie Mouzeli: "Thanks to Elukey and Jbond, this patch finally works! https://puppet-compiler.wmflabs.org/compiler1001/21007/. You can compare the resulti" [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [13:24:18] (03PS1) 10Ladsgroup: Increase the reads for term store for clients for up to Q60K [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574455 (https://phabricator.wikimedia.org/T219123) [13:26:06] (03CR) 10Ladsgroup: [C: 03+2] Increase the reads for term store for clients for up to Q60K [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574455 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [13:27:16] (03Merged) 10jenkins-bot: Increase the reads for term store for clients for up to Q60K [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574455 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [13:28:05] (03CR) 10Jbond: "lgtm, just some nits" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [13:28:44] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:574454|Increase the reads for term store for clients for up to Q60K (T219123)]] (duration: 00m 56s) [13:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:48] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [13:30:24] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:574454|Increase the reads for term store for clients for up to Q60K (T219123)]], take II (duration: 00m 56s) [13:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:15] (03CR) 10Ayounsi: [C: 03+1] codfw: cloudnet: allocate addresses in the cloud transport network [dns] - 10https://gerrit.wikimedia.org/r/574452 (https://phabricator.wikimedia.org/T245606) (owner: 10Arturo Borrero Gonzalez) [13:42:49] (03PS1) 10Ladsgroup: Increase the reads for term store for clients for up to Q120K [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574457 (https://phabricator.wikimedia.org/T219123) [13:44:30] (03CR) 10Ladsgroup: [C: 03+2] Increase the reads for term store for clients for up to Q120K [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574457 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [13:45:27] (03Merged) 10jenkins-bot: Increase the reads for term store for clients for up to Q120K [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574457 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [13:48:00] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thanks for looking into this!" [puppet] - 10https://gerrit.wikimedia.org/r/574426 (owner: 10Volans) [13:48:29] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:574454|Increase the reads for term store for clients for up to Q120K (T219123)]] (duration: 00m 56s) [13:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:34] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [13:51:12] (03CR) 10Ayounsi: [C: 03+1] ripe atlas alerts: allow more ipv6 failures [puppet] - 10https://gerrit.wikimedia.org/r/574145 (owner: 10CDanis) [13:51:38] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:574454|Increase the reads for term store for clients for up to Q120K (T219123)]], take II (duration: 00m 55s) [13:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:07] _joe_: we just had a spike of 1M/min lua requests, can you take a look if that affected memcached? it was around 13:30 UTC [13:53:22] phab is down for me? :( [13:53:53] I'm about to revert a patch that removes the APCu cache [13:54:07] I can't ping dyna.wikimedia.org at all [13:54:09] addshore: same here with grafana [13:54:19] network issues? [13:54:23] yup, i cant ping dyna.wikimedia.org [13:55:02] traceroute for me: https://paste.ubuntu.com/p/ymqhBjPPWB/ [13:55:06] Works for me (EU). [13:55:10] can’t ping text-lb.esams.wikimedia.org either (IPv6) [13:55:26] though pinging dyna.wikimedia.org fails [13:55:45] PROBLEM - Host cr2-esams is DOWN: PING CRITICAL - Packet loss = 100% [13:55:47] gah, and of course I can’t open the “reporting network issues” pages on mw.org >.< [13:56:31] oh, seems to be back [13:56:45] same [13:56:51] not for me [13:57:42] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:57:42] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:58:23] ^^ [13:58:36] PROBLEM - OSPF status on mr1-esams is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:58:47] we're on it [13:58:49] the SRE team is aware, people already working on it [13:58:58] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 34.03 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:59:56] PROBLEM - Host cr2-esams IPv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:862:ffff::3) [14:01:32] I actually get always timeouts when trying to access dewiki or arbcom-dewiki. confirmed from a second user, but works for another one [14:01:39] anything known yet? [14:01:54] (03PS1) 10Giuseppe Lavagetto: Depool esams (hardware troubles) [dns] - 10https://gerrit.wikimedia.org/r/574462 [14:01:58] Sagan: SRE's on it [14:02:18] <_joe_> anyone ^^ +1s welcome [14:02:18] ok, thx [14:02:28] (03CR) 10BBlack: [C: 03+1] Depool esams (hardware troubles) [dns] - 10https://gerrit.wikimedia.org/r/574462 (owner: 10Giuseppe Lavagetto) [14:02:37] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Depool esams (hardware troubles) [dns] - 10https://gerrit.wikimedia.org/r/574462 (owner: 10Giuseppe Lavagetto) [14:03:26] <_joe_> !log depooling esams (authdns-update) [14:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:40] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 448 probes of 604 (alerts on 35) - https://atlas.ripe.net/measurements/23449935/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:04:43] sagan, tassu and others could you confirm issues gone after retrying? [14:04:59] looking [14:05:18] <_joe_> jynus: it will need some time before the dns change propagates [14:05:21] I know [14:05:32] jynus: I still can't load anything [14:05:32] that is why I asking for independent confirmation [14:05:37] the ttl is ten mintues [14:05:40] no, not yet [14:05:41] Urbanecm: ok, it may take some time [14:05:53] ttl of dns cache [14:06:03] nope [14:06:49] I've did dnsflush on my machine, now it works [14:07:21] <_joe_> Sagan: good :) [14:07:44] same here [14:07:58] okay [14:08:00] traffic throughput going up, but still not at 100 [14:08:02] % [14:09:13] working [14:24:37] (03CR) 10Andrew Bogott: [C: 03+2] designatemakedomain: make region aware [puppet] - 10https://gerrit.wikimedia.org/r/574172 (owner: 10Andrew Bogott) [14:24:56] (03CR) 10Andrew Bogott: [C: 03+2] wmfkeystonehooks: make region-aware [puppet] - 10https://gerrit.wikimedia.org/r/574176 (owner: 10Andrew Bogott) [14:26:09] (03PS1) 10Filippo Giunchedi: logstash: write kafka inputs to es by default [puppet] - 10https://gerrit.wikimedia.org/r/574467 (https://phabricator.wikimedia.org/T227080) [14:27:58] (03PS1) 10KartikMistry: Enable CX out of beta in eu, sw and ta WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574469 (https://phabricator.wikimedia.org/T245446) [14:29:32] (03PS1) 10Ladsgroup: Increase the reads for term store for clients for up to Q256K [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574470 (https://phabricator.wikimedia.org/T219123) [14:30:05] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash: (No Need By Date) rack/setup/install logstash102[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T240881 (10herron) Hi @wiki_willy, do you know what the ETA is for these hosts? These are needed for okr work this quarter, and are related to the hosts in T... [14:31:30] (03CR) 10Herron: [C: 03+1] "Good call! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/574467 (https://phabricator.wikimedia.org/T227080) (owner: 10Filippo Giunchedi) [14:34:19] (03CR) 10Jhedden: [C: 03+1] cloudstore: remove dependency on bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/571821 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [14:36:20] (03CR) 10Jhedden: [C: 03+2] haproxy: update systemd service for buster [puppet] - 10https://gerrit.wikimedia.org/r/574063 (https://phabricator.wikimedia.org/T236606) (owner: 10Jhedden) [14:38:06] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:40:14] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:42:33] !log Compress innodb on wb_terms on db1087 - T232446 [14:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:39] T232446: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 [14:43:06] !log andrew@deploy1001 Started deploy [horizon/deploy@a8f2ea9]: modest css change for the hiera editing dialog [14:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:18] !log andrew@deploy1001 Finished deploy [horizon/deploy@a8f2ea9]: modest css change for the hiera editing dialog (duration: 00m 12s) [14:43:18] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:11] !log andrew@deploy1001 Started deploy [horizon/deploy@dab0ca0]: modest css change for the hiera editing dialog (take two -- I consistently forget to rebase before doing this) [14:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:44] !log andrew@deploy1001 Finished deploy [horizon/deploy@dab0ca0]: modest css change for the hiera editing dialog (take two -- I consistently forget to rebase before doing this) (duration: 03m 33s) [14:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:32] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [14:51:01] (03CR) 10Ladsgroup: [C: 03+2] Increase the reads for term store for clients for up to Q256K [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574470 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [14:52:07] (03Merged) 10jenkins-bot: Increase the reads for term store for clients for up to Q256K [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574470 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [14:52:36] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [14:54:01] Krinkle: the new scap is in production, do you want me to run a full scap? [14:54:47] (03PS1) 10Filippo Giunchedi: scap: enable logging to syslog [puppet] - 10https://gerrit.wikimedia.org/r/574485 [14:55:25] 10Operations, 10netops: Prefering AS13030 instead of AS13335 - https://phabricator.wikimedia.org/T245998 (10jbond) p:05Triage→03Medium [14:55:45] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:574454|Increase the reads for term store for clients for up to Q256K (T219123)]] (duration: 00m 57s) [14:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:54] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [14:56:03] 10Operations, 10netops: Prefering AS13030 instead of AS13335 - https://phabricator.wikimedia.org/T245998 (10jbond) [14:56:51] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:574454|Increase the reads for term store for clients for up to Q256K (T219123)]], take II (duration: 00m 55s) [14:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:06] !log read_only=0 on es1020 (es4) and es1023 (es5) - unused new external store masters - T245806 [14:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:13] T245806: Create es4 and es5 sections in dbctl - https://phabricator.wikimedia.org/T245806 [15:02:21] marostegui: btw look at this beautiful deployment: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=1582544773679&to=1582549178783&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s8&var-role=All [15:02:58] Amir1: wow! [15:03:08] Amir1: how come the rows written go that low and then recover? [15:03:15] writes should be the same? [15:03:20] that's coincidence [15:03:25] Ah ok :) [15:03:42] some bot stopping at the same time but rows read are not recovering [15:03:54] yeah yeah, that's beautiful [15:03:58] (03CR) 10CDanis: [C: 03+2] "PCC lgtm https://puppet-compiler.wmflabs.org/compiler1003/21010/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/574145 (owner: 10CDanis) [15:04:00] I was curious (and scared) for the writes [15:04:13] The part I like is that it absorbs spikes https://phabricator.wikimedia.org/T245740#5912036 [15:04:35] Amir1: I just asked the same questions to addshore [15:04:41] haha [15:04:46] similar spikes would have 5x the rows read [15:04:48] not as much as a coincidence but "unrelated ongoing issue" [15:04:54] aka lag control [15:05:24] although maybe the reduction on overhead was so big that made the other issue more visible [15:05:41] let me find the ticket for marostegui [15:05:51] let me see [15:05:58] !log Deploy schema change on db1086 (s7 master) with replication - T245925 [15:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:04] T245925: CentralAuth schema changes Feb 2020 - https://phabricator.wikimedia.org/T245925 [15:06:44] (03PS6) 10Giuseppe Lavagetto: profile::services_proxy: envoy-based version [puppet] - 10https://gerrit.wikimedia.org/r/572832 (https://phabricator.wikimedia.org/T244843) [15:07:18] marostegui: there isn't a single task, but T244722#5889241 and others on that comment [15:07:19] T244722: increase factor for query service that is taken into account for maxlag - https://phabricator.wikimedia.org/T244722 [15:07:58] jynus: thanks, I was aware of that task [15:07:59] probably T240442 is the master design problem [15:08:00] T240442: Design a continuous throttling policy for Wikidata bots - https://phabricator.wikimedia.org/T240442 [15:08:13] I confimed with addshore that was the cause [15:08:39] and I agree with the proposed solution there [15:08:44] * addshore reads up [15:09:11] addshore: nothing interesting, marostegui asked the same questions I did some time ago on -databases [15:09:12] so, i believe that pattern of write increases is something else [15:09:19] and I havn't pinned down that [15:09:27] addshore: not the maxlag? [15:09:49] the maxlag basically makes osscilate [15:10:03] *makes it oscillate [15:10:23] yeah, not saying it is related to the deployment [15:10:26] :-D [15:10:34] jynus: also not the maxlag [15:10:38] oh [15:10:58] https://usercontent.irccloud-cdn.com/file/Y03j3POH/image.png [15:11:06] addshore: oh it's not maxlag dropping? [15:11:12] ^^ that jump, im not sure what caused it, i have seen a similarish pattern in the past week [15:11:20] Amir1: no, ediut rate stayed constant during that time [15:11:25] addshore: parameters? [15:11:35] addshore: that can be an alter table i just started [15:11:39] the aggregated graphs are not dynamic [15:11:40] * addshore row wriutten jump >> https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=1582554290240&to=1582556552740&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s8&var-role=All [15:11:47] nice, another mystery [15:11:48] so it may include depooled hosts under maintenance [15:11:56] It is probably my alter, let me confirm [15:11:59] edits in that time https://grafana.wikimedia.org/d/000000170/wikidata-edits?orgId=1&refresh=1m&from=1582554290240&to=1582556552740 [15:12:10] you should check the master for write level [15:12:18] (03PS1) 10Muehlenhoff: Add profile::base::no_firewall [puppet] - 10https://gerrit.wikimedia.org/r/574491 (https://phabricator.wikimedia.org/T104939) [15:12:23] otherwise you may have false positives [15:12:35] we could introduce dynamic pooling an depooling for prometheus [15:12:47] IMO that would be nice [15:12:53] but that is a level of complexity we cannot handle at the moment [15:13:06] so it would be on the wishlist for now [15:13:07] addshore: it is the alter, confirmed [15:13:14] -rw-rw---- 1 mysql mysql 4.7K Feb 24 14:42 #sql-3bb_217c34e.frm [15:13:15] ack! :) [15:13:16] times match [15:13:19] nice [15:13:24] note the graph is not wrong [15:13:33] it is a mysql-level graph, not mw service [15:14:58] but "log alerts but not inserts" is a level of granularity we cannot provide for now [15:15:02] *alters [15:16:00] (03CR) 10Alexandros Kosiaris: [C: 03+1] swift: optimize ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/574426 (owner: 10Volans) [15:16:15] the workaround is to check writes on the pooled master [15:16:46] marostegui: is this because of your alter? https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1126&var-port=9104&from=now-3h&to=now [15:16:52] or should I scream [15:17:06] definitely not [15:17:09] my alter is on db1087 [15:17:47] something is going on there [15:18:15] Lots of SELECT /* Wikibase\Lib\Store\Sql\Terms\PrefetchingItemTermLookup::prefetchTerms */ wbit_item_id,wb [15:18:50] could we reduce weight from db1126 [15:19:10] db1126 is the loadest (500) [15:19:14] yes [15:19:20] that is why I say to reduce it [15:19:24] most of them should get handled by the cache [15:19:28] it has close to 30K QPS [15:19:38] and db1087 is depooled [15:19:48] db1104 is also having an increase: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=now-12h&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1104&var-port=9104 [15:20:08] it is peaking at 1200 ongoing connections [15:20:25] maybe we can decrease db1126 to 400 and increase db1101 from 50 to 100 on main [15:20:34] let me do that [15:20:38] just reduce the load on db1126 [15:20:45] shift it to the other servers [15:21:14] it will give us some time [15:21:29] (03PS2) 10Filippo Giunchedi: scap: enable logging to syslog [puppet] - 10https://gerrit.wikimedia.org/r/574485 (https://phabricator.wikimedia.org/T227080) [15:21:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Reduce weight for db1126, increase it a bit for db1101:3318', diff saved to https://phabricator.wikimedia.org/P10498 and previous config saved to /var/cache/conftool/dbconfig/20200224-152132-marostegui.json [15:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:49] that works too [15:22:01] we are only at Q256K items so far [15:22:10] should I stop going further? I think so [15:22:14] Amir1: yep [15:22:14] load went ok [15:22:19] for now [15:22:27] Amir1: Remember our conversation about how hard was to measure how much a host can handle? hehe [15:22:36] Amir1: we got the first warning! [15:22:38] 400 connections, much less [15:23:36] inserts and updates spiked [15:23:38] :( Can I look at something to see what I can improve? [15:23:41] tendril? [15:23:52] latency increased 5x [15:24:08] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1126&var-port=9104&fullscreen&panelId=40&from=1582547044507&to=1582557844508 [15:24:13] Amir1: I was checking the queries, so far I haven't found anything terribly wrong [15:25:16] (03PS1) 10CDanis: esams-offline: route heavy EU bytes users to codfw [dns] - 10https://gerrit.wikimedia.org/r/574493 [15:26:31] cpu on db1126 spiked at 88%, something that rarely happens on mysql [15:26:55] load 50 on a 16 thread cpu [15:27:32] it got automatically depooled at 15:13 I think [15:27:39] for some time [15:28:38] should I scale back? [15:28:48] things are ok now [15:28:54] but we should keep monitoring [15:29:07] lets monitor closely, and investigate, and maybe scale back in a bit [15:29:23] 10Operations, 10DC-Ops, 10netops, 10Patch-For-Review: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10ayounsi) From JTAC: > To answer your question “For the list of devices/serials that are decommissioned and we don't own anymore, is there a process so they don't sho... [15:29:31] Amir1: I am seeing a huge query, don't know if it is normal [15:29:36] Amir1: let me send it to you [15:30:16] done [15:30:26] in particular https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s8&var-role=slave&fullscreen&panelId=11&from=1582536615469&to=1582558215469 [15:30:43] !log updated component/jdk8 to 8u242-b08-1~deb10u1 (forward port of latest Java 8 security update) [15:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:50] marostegui: can i have a copy too? :) [15:30:57] addshore: sorry!, of course [15:31:19] PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:31:44] technically it's just a large batch, addshore I think we should split large group of entities to smaller batch sizes. What do you think? [15:32:01] sounds like it might be needed [15:32:12] but also we should look at it a bit closer before jumping to that conclusion :) [15:32:42] marostegui: I am checking https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=MariaDB+read+only+s8 and I am thinking, if the current state continues, we may want to rebalance the weights a bit [15:33:28] jynus: yeah, let's try to leave the dust settle for now [15:33:45] sure, just thinking mid-term [15:33:46] there's not much more room for changes there though [15:34:01] we can increase weight for the multiinstance a bit, after all they all now have the same schemas! [15:34:04] ^idp1001 is me [15:34:09] marostegui: indeed [15:34:43] marostegui: we could even depool those from other sections in an emergency and give them close-to-full weights [15:34:52] yeah [15:34:53] that too [15:34:58] lets wait for now [15:35:01] something needs terms of 9000 different entities, that's weird [15:35:21] Amir1: sounds like the sort of thing we could log? [15:35:41] Amir1: that looks like cpu increase to me [15:35:50] as in, a good candidate for the bump [15:38:09] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:38:36] db1126 latency went down again, but I would keep it in the current wait for some time [15:38:45] yeah, let's not touch it for now [15:39:05] DBPerformance channel would log it [15:39:36] Amir1: does that have stacktraces? [15:40:28] yup, this one seems unrelated: https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-mediawiki-2020.02.24/mediawiki?id=AXB32jwDh3Uj6x1zH7RT&_g=h@44136fa [15:40:30] in fact, it went a bit up again [15:40:32] but an example [15:41:23] Amir1: yay stacktraces, lets do some digging in mattermost? [15:45:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::services_proxy: envoy-based version [puppet] - 10https://gerrit.wikimedia.org/r/572832 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [15:45:46] (03PS7) 10Giuseppe Lavagetto: mwdebug: enable envoy-based services proxy [puppet] - 10https://gerrit.wikimedia.org/r/572833 [15:46:12] (03Abandoned) 10SBassett: Revert "Also log authevents channel." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572895 (owner: 10SBassett) [15:48:16] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash: (No Need By Date) rack/setup/install logstash102[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T240881 (10wiki_willy) @herron - is there a specific date that you need these by? We can adjust our priorities and the need by date of this task to meet that... [15:49:47] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash: (No Need By Date) rack/setup/install logstash102[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T240881 (10herron) >>! In T240881#5912321, @wiki_willy wrote: > @herron - is there a specific date that you need these by? We can adjust our priorities and t... [15:50:04] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10Marostegui) >>! In T243963#5881470, @wiki_willy wrote: > @Cmjohnson - looks like it's too late to do this one today, since they need 24hrs to depool. Chatted with @Marostegui briefly, so just le... [15:50:46] (03CR) 10RhinosF1: "> Patch Set 4: Code-Review+2" [puppet] - 10https://gerrit.wikimedia.org/r/574240 (https://phabricator.wikimedia.org/T244785) (owner: 10RhinosF1) [15:51:07] Amir1, addshore: wikidata latency p50 seems controlled, but p99 seems high (aka no outage but something to check performance-wise) https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1126&var-port=9104&fullscreen&panelId=40&from=1582548658209&to=1582559458209 [15:52:10] I think we should scale back for now [15:52:19] addshore: what do you think? [15:52:36] just one step, from Q256 -> Q120 [15:52:37] Is it getting better? [15:52:43] Amir1: yes lets half it [15:52:50] while we investigate further [15:52:54] James_F: p99 goes up an down, p55 is stable at normal leves [15:53:01] *p50 [15:53:10] * James_F nods. [15:53:16] Driven by user traffic a fair bit, I imagine. [15:53:41] (03PS1) 10Ladsgroup: Revert "Increase the reads for term store for clients for up to Q256K" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574498 [15:53:55] so we are ok mysql-level, leaving to you any change based on user performance desired level [15:54:19] I think there's one or two use case that trigger a pain point [15:54:29] (03CR) 10Ladsgroup: [C: 03+2] Revert "Increase the reads for term store for clients for up to Q256K" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574498 (owner: 10Ladsgroup) [15:55:28] (03Merged) 10jenkins-bot: Revert "Increase the reads for term store for clients for up to Q256K" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574498 (owner: 10Ladsgroup) [15:57:40] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Revert: [[gerrit:574454|Increase the reads for term store for clients for up to Q256K (T219123)]] (duration: 00m 56s) [15:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:47] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [15:59:23] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: enable envoy-based services proxy [puppet] - 10https://gerrit.wikimedia.org/r/572833 (owner: 10Giuseppe Lavagetto) [16:00:59] (03PS1) 10Muehlenhoff: Enabled CAS endpoint for Kibana [puppet] - 10https://gerrit.wikimedia.org/r/574499 [16:01:34] (03PS2) 10Muehlenhoff: Enable CAS endpoint for Kibana [puppet] - 10https://gerrit.wikimedia.org/r/574499 [16:02:55] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Revert: [[gerrit:574454|Increase the reads for term store for clients for up to Q256K (T219123)]], take II (duration: 00m 56s) [16:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:02] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [16:03:14] 10Operations: requesting additional production ssh key for jmorgan - https://phabricator.wikimedia.org/T200103 (10RhinosF1) Was this resolved? [16:03:43] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:04:02] load went down to not-saturated levels: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=db1126&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql&fullscreen&panelId=9&from=1582549428384&to=1582560228384 [16:04:17] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash: (No Need By Date) rack/setup/install logstash102[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T240881 (10wiki_willy) @herron - I'll add next week on as the target date. We're in the middle of a couple other installs, but we can try prioritize this one... [16:04:36] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash: (Need By March 6) rack/setup/install logstash102[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T240881 (10wiki_willy) [16:04:38] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash: (Need By March 6) rack/setup/install logstash102[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T240881 (10herron) Thank you! [16:04:46] jynus: I go eat something, I think we are narrowing down what's going on [16:05:01] (03PS1) 10Giuseppe Lavagetto: services_proxy::envoy: fix yaml indentation [puppet] - 10https://gerrit.wikimedia.org/r/574500 [16:05:24] (03CR) 10jerkins-bot: [V: 04-1] services_proxy::envoy: fix yaml indentation [puppet] - 10https://gerrit.wikimedia.org/r/574500 (owner: 10Giuseppe Lavagetto) [16:10:14] (03PS1) 10Andrew Bogott: designate.conf: change logging rules from False to false [puppet] - 10https://gerrit.wikimedia.org/r/574501 [16:10:44] jynus: marostegui for now the ticket I filed is https://phabricator.wikimedia.org/T246005 [16:11:00] !log reloading ferm on ms-be2043 DNS query timed out [16:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:04] a client api module is selecting 15k rows, and then returning 50 descriptions.. [16:11:23] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/21013/" [puppet] - 10https://gerrit.wikimedia.org/r/574499 (owner: 10Muehlenhoff) [16:11:43] RECOVERY - Check systemd state on ms-be2043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:12:05] (03PS2) 10Giuseppe Lavagetto: services_proxy::envoy: fix yaml indentation [puppet] - 10https://gerrit.wikimedia.org/r/574500 [16:12:27] (03CR) 10Ottomata: [C: 03+1] role::analytics_cluster::launcher: add hdfs RU jobs [puppet] - 10https://gerrit.wikimedia.org/r/574385 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [16:13:35] (03CR) 10Andrew Bogott: [C: 03+2] designate.conf: change logging rules from False to false [puppet] - 10https://gerrit.wikimedia.org/r/574501 (owner: 10Andrew Bogott) [16:14:29] (03CR) 10CRusnov: [C: 03+1] "Looks good" [software/spicerack] - 10https://gerrit.wikimedia.org/r/571997 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [16:15:56] (03PS2) 10Arturo Borrero Gonzalez: codfw: cloudnet: allocate addresses in the cloud transport network [dns] - 10https://gerrit.wikimedia.org/r/574452 (https://phabricator.wikimedia.org/T245606) [16:16:20] !log reloading ferm on ms-be2028 DNS query timed out [16:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:48] (03CR) 10Giuseppe Lavagetto: [C: 03+2] services_proxy::envoy: fix yaml indentation [puppet] - 10https://gerrit.wikimedia.org/r/574500 (owner: 10Giuseppe Lavagetto) [16:18:11] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2028 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:18:11] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:13] !log tools.zppixbot changing logging_level from DEBUG --> INFO to avoid pointless logspam [16:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:44] * RhinosF1 reverts and joins the actual channel [16:24:26] (03PS1) 10Giuseppe Lavagetto: services_proxy::envoy: fix the retry policy entry in the listener [puppet] - 10https://gerrit.wikimedia.org/r/574502 [16:25:13] (03PS3) 10Volans: ganeti: absent to the makevm script [puppet] - 10https://gerrit.wikimedia.org/r/565232 (owner: 10Alexandros Kosiaris) [16:26:34] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] services_proxy::envoy: fix the retry policy entry in the listener [puppet] - 10https://gerrit.wikimedia.org/r/574502 (owner: 10Giuseppe Lavagetto) [16:26:51] <_joe_> puppet is broken, I have no time for CI [16:27:07] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:27:23] <_joe_> godog: what does this ^^ mean? [16:29:17] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:30:17] _joe_: can I merge a puppet patch or better to wait? [16:30:30] <_joe_> volans: go on [16:30:33] thx [16:30:44] (03CR) 10Volans: [C: 03+2] ganeti: absent to the makevm script [puppet] - 10https://gerrit.wikimedia.org/r/565232 (owner: 10Alexandros Kosiaris) [16:33:00] Amir1: I'd rather wait for the train to do the full scap so that more people are around [16:34:05] volans: noooo I use makevm! [16:34:07] * elukey runs away [16:34:22] elukey: lol, good one :D [16:37:23] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests: Unable to access SWAP notebooks using LDAP - https://phabricator.wikimedia.org/T245997 (10Ottomata) [16:40:41] Eurgh, more class@anonymous over-run errors from the Wikibase cache infrastructure. :-( [16:41:03] RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:41:48] 10Operations, 10netops: Prefering AS13030 instead of AS13335 - https://phabricator.wikimedia.org/T245998 (10ayounsi) 05Open→03Declined Indeed, and it can be for many reasons (cost, list saturation, etc...). As long as it's a minority and not the norm it's not an issue. Thanks for reporting it though, it's... [16:44:32] (03PS8) 10Krinkle: Use GTIDs for "wait for replica" barriers for external DB clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147 (owner: 10Aaron Schulz) [16:45:54] (03PS3) 10Jforrester: Beta sessionstore: don't use TLS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574034 (https://phabricator.wikimedia.org/T224712) (owner: 10Ppchelko) [16:46:23] (03CR) 10Jforrester: [C: 03+2] Beta sessionstore: don't use TLS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574034 (https://phabricator.wikimedia.org/T224712) (owner: 10Ppchelko) [16:47:20] (03Merged) 10jenkins-bot: Beta sessionstore: don't use TLS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574034 (https://phabricator.wikimedia.org/T224712) (owner: 10Ppchelko) [16:50:15] _joe_: which part specifically ? [16:50:37] <_joe_> godog: the prometheus reduced availability alert was on citoid [16:50:48] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Degraded RAID on analytics1044 - https://phabricator.wikimedia.org/T245910 (10Milimetric) a:03elukey [16:51:00] <_joe_> does that mean just that we had some pods not responding to the swagger check correctly? [16:53:23] _joe_: from the dashboard it looks like the checker didn't answer (in time), although ATM I'm not 100% clear if it was the checker itself or the check [16:53:38] Krinkle sure [16:53:57] James_F: what are they? Can you point out to a ticket? [16:54:21] I'm on my way [16:54:39] also cc shdubsh re: the swagger checker above ^ [16:57:03] Amir1: Occasional "Cache key contains characters that are not allowed" errors (looks like there's an extra \n somehow in the key), but the main irritation is that it's over-writing the logstash error field somehow, like last week. [16:57:40] Not a UBN like T245062 last week, but irritating. [16:57:44] T245062: Double ?uselang= passed to a file results in "Cache key contains characters that are not allowed: `P180_1101514477_fr?uselang=fr_label`" - https://phabricator.wikimedia.org/T245062 [16:58:15] I'm sure we fixed it [16:58:23] You fixed the double-uselang one. [16:58:33] But did you trim() the key before send it? Etc. [16:58:38] godog _joe_: citoid is timing out Feb 24 16:51:51 prometheus1003 prometheus-swagger-exporter[32360]: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='citoid.svc.eqiad.wmnet', port=1970): Read timed out. (read timeout=15.0)",)': [16:58:41] /api?search=http%3A%2F%2Fexample.com%2Fthisurldoesntexist&format=mediawiki [16:58:57] <_joe_> oh heh, all requests or just that one? [16:59:15] <_joe_> because the lvs checks seem to succeed [16:59:31] It should not anything that's not in the list of languages. This has an underlying problem I should check [16:59:40] seems to be in intermittent problem [17:01:31] quick napkin math from prometheus logs indicate 6% of citoid requests are timing out [17:01:40] (03Abandoned) 10EBernhardson: spark-env.sh: Allow overriding python version detection [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/562651 (owner: 10EBernhardson) [17:01:52] Amir1: See T245396 [17:01:52] T245396: SimpleCacheWithBagOStuff shouldnt be so easy to use bad keys with - https://phabricator.wikimedia.org/T245396 [17:01:59] Err, T245396#5912675 [17:02:25] I get to it asap [17:02:59] Amir1: Thank you. As I said, not UBN, just really irritating, but finding out what manages to break the logging infrastructure would be worth fixing itself. [17:03:30] I think Reedy had a fix for that [17:03:39] Oh, awesome, will go mine his CR board. [17:03:51] Saw it somewhere [17:05:17] (03CR) 10Jforrester: [C: 03+2] Remove outdated flaggedrevs.php comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574210 (owner: 10Reedy) [17:06:06] Bah, CodeSearch is down? :-( [17:06:13] (03Merged) 10jenkins-bot: Remove outdated flaggedrevs.php comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574210 (owner: 10Reedy) [17:07:18] (03CR) 10Jforrester: [C: 03+2] [BETA] Remove $wgDebugTimestamps override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574450 (owner: 10Krinkle) [17:08:35] (03Merged) 10jenkins-bot: [BETA] Remove $wgDebugTimestamps override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574450 (owner: 10Krinkle) [17:08:45] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 49.54 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:09:17] (03CR) 10Jforrester: "Why move this from CS to IS, where people might think it's OK to vary it by wiki?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574405 (https://phabricator.wikimedia.org/T245983) (owner: 10Brian Wolff) [17:10:12] James_F: it seems it's merged https://phabricator.wikimedia.org/T245280#5885466 [17:10:40] Oh, you think it might be a formatter key conflict? Yeah, that'd explain it. [17:10:40] bblack: anything on traffic that can explain the above? [17:10:46] (eqsin traffic) [17:10:52] !log jforrester@deploy1001 Synchronized wmf-config/flaggedrevs.php: Sync doc-only change; should be a no-op (duration: 00m 57s) [17:10:53] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 77.16 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:07] Amir1: But all of that code should be live now. Last merge was on 14 Feb. [17:15:52] yeah, it's not it then. Something else maybe [17:17:08] (03CR) 10Jbond: "no issue from a technical point of view but what is the use case? could a parameter like enable in base::profile preform the same use ca" [puppet] - 10https://gerrit.wikimedia.org/r/574491 (https://phabricator.wikimedia.org/T104939) (owner: 10Muehlenhoff) [17:20:46] volans: not that I know of? [17:21:25] ack, I can see a spike first and then a hole after ~30m in the graph [17:21:44] (03PS1) 10Giuseppe Lavagetto: services_proxy::envoy: avoid duplicate declarations [puppet] - 10https://gerrit.wikimedia.org/r/574509 [17:21:46] (03CR) 10Jbond: "one nit, otherwise lgtm" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/574499 (owner: 10Muehlenhoff) [17:22:09] volans: yeah the spike thing seems to confuse the alert [17:22:28] the spike was all 301 responses, so likely something spamming port 80 briefly [17:22:59] k [17:23:39] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael Holloway - https://phabricator.wikimedia.org/T246019 (10Mholloway) [17:24:01] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/574491 (https://phabricator.wikimedia.org/T104939) (owner: 10Muehlenhoff) [17:24:54] (03CR) 10jerkins-bot: [V: 04-1] services_proxy::envoy: avoid duplicate declarations [puppet] - 10https://gerrit.wikimedia.org/r/574509 (owner: 10Giuseppe Lavagetto) [17:25:25] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael Holloway - https://phabricator.wikimedia.org/T246019 (10Mholloway) Side note: Is it possible to update my shell username simply by changing all instances of it in modules/admin/data/data.yaml, or am I effectively... [17:25:41] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael Holloway - https://phabricator.wikimedia.org/T246019 (10Mholloway) Ping @Jhernandez for manager approval. [17:26:35] 10Operations, 10Product-Infrastructure-Team-Backlog, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael Holloway - https://phabricator.wikimedia.org/T246019 (10Mholloway) [17:28:03] 10Operations: requesting additional production ssh key for jmorgan - https://phabricator.wikimedia.org/T200103 (10Capt_Swing) @RhinosF1 I think so? I mean, I don't use the same keys in Labs and production, and I have access to both servers. So I'm going to assume everything is fine ;) - J [17:33:51] 10Operations, 10Product-Infrastructure-Team-Backlog, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael Holloway - https://phabricator.wikimedia.org/T246019 (10Jhernandez) Approved from my side [17:35:54] 10Operations: requesting additional production ssh key for jmorgan - https://phabricator.wikimedia.org/T200103 (10Majavah) 05Open→03Resolved [17:37:17] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={swagger_check_citoid_cluster_eqiad,swagger_check_cxserver_cluster_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:39:23] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:41:21] 10Operations: requesting additional production ssh key for jmorgan - https://phabricator.wikimedia.org/T200103 (10RhinosF1) I’ll resolve then and someone can reopen if they think otherwise. [17:55:05] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests: Add Fsalutari to nda LDAP group - https://phabricator.wikimedia.org/T245997 (10Ottomata) [17:55:44] (03PS2) 10Giuseppe Lavagetto: services_proxy::envoy: avoid duplicate declarations [puppet] - 10https://gerrit.wikimedia.org/r/574509 [17:56:41] (03CR) 10Brian Wolff: "It made it a bit more elegant to override for beta since the main code is in a callback in CS.php. Otherwise we would have to have beta co" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574405 (https://phabricator.wikimedia.org/T245983) (owner: 10Brian Wolff) [17:59:11] (03CR) 10Giuseppe Lavagetto: [C: 03+2] services_proxy::envoy: avoid duplicate declarations [puppet] - 10https://gerrit.wikimedia.org/r/574509 (owner: 10Giuseppe Lavagetto) [17:59:13] (03CR) 10Jforrester: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574405 (https://phabricator.wikimedia.org/T245983) (owner: 10Brian Wolff) [18:00:04] gehel and onimisionipe: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200224T1800). [18:01:26] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests: Add Fsalutari to nda LDAP group - https://phabricator.wikimedia.org/T245997 (10Ottomata) @Muehlenhoff just double checking: Fsalutari has an NDA, can I just add to `nda` LDAP group? [18:06:05] (03PS2) 10Dwisehaupt: Add IPs for new frack hosts: civi2001, frpm2001 [dns] - 10https://gerrit.wikimedia.org/r/574097 (https://phabricator.wikimedia.org/T242270) [18:07:08] heh, two commit messages in one :P [18:07:17] (03PS3) 10Dwisehaupt: Plumb in frdb2001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/573763 (https://phabricator.wikimedia.org/T245566) [18:07:53] (03CR) 10Dwisehaupt: [C: 03+1] Plumb in frdb2001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/573763 (https://phabricator.wikimedia.org/T245566) (owner: 10Dwisehaupt) [18:09:23] (03CR) 10Ayounsi: "As said on IRC, I prefer interface names rather than abstract names." [dns] - 10https://gerrit.wikimedia.org/r/574452 (https://phabricator.wikimedia.org/T245606) (owner: 10Arturo Borrero Gonzalez) [18:10:11] (03PS8) 10Elukey: role::analytics_cluster::launcher: add hdfs RU jobs [puppet] - 10https://gerrit.wikimedia.org/r/574385 (https://phabricator.wikimedia.org/T243934) [18:13:55] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::launcher: add hdfs RU jobs [puppet] - 10https://gerrit.wikimedia.org/r/574385 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [18:15:47] (03CR) 10Dzahn: [C: 03+2] delete role::repositoryserver, duplicate of role::apt_repo [puppet] - 10https://gerrit.wikimedia.org/r/574104 (owner: 10Dzahn) [18:16:55] (03PS1) 10Giuseppe Lavagetto: services_proxy::envoy: fix cluster names for swift [puppet] - 10https://gerrit.wikimedia.org/r/574522 [18:18:05] (03CR) 10Dzahn: "Greeks - FYI new Wikimedia User Group Greece wiki" [dns] - 10https://gerrit.wikimedia.org/r/574150 (https://phabricator.wikimedia.org/T245911) (owner: 10MarcoAurelio) [18:18:16] (03CR) 10Dzahn: [C: 03+2] wikimedia: Add new records for gr.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/574150 (https://phabricator.wikimedia.org/T245911) (owner: 10MarcoAurelio) [18:18:21] (03PS3) 10Dzahn: wikimedia: Add new records for gr.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/574150 (https://phabricator.wikimedia.org/T245911) (owner: 10MarcoAurelio) [18:23:41] (03CR) 10Giuseppe Lavagetto: [C: 03+2] services_proxy::envoy: fix cluster names for swift [puppet] - 10https://gerrit.wikimedia.org/r/574522 (owner: 10Giuseppe Lavagetto) [18:25:37] (03CR) 10Dzahn: [C: 03+2] Plumb in frdb2001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/573763 (https://phabricator.wikimedia.org/T245566) (owner: 10Dwisehaupt) [18:25:42] (03PS4) 10Dzahn: Plumb in frdb2001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/573763 (https://phabricator.wikimedia.org/T245566) (owner: 10Dwisehaupt) [18:30:19] PROBLEM - Check systemd state on mwdebug1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:32:55] (03PS16) 10Effie Mouzeli: mcrouter: add gutter pool servers in configuration [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) [18:34:35] RECOVERY - Check systemd state on mwdebug1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:36:43] (03PS9) 10Effie Mouzeli: mcrouter: enable gutter pool config on mwdebug1001 and mwdebug2001 [puppet] - 10https://gerrit.wikimedia.org/r/574200 (https://phabricator.wikimedia.org/T213089) [18:42:02] (03CR) 10Dzahn: [C: 03+2] prod_sites: Apache configuration for gr.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/574151 (https://phabricator.wikimedia.org/T245911) (owner: 10MarcoAurelio) [18:43:03] (03PS1) 10BryanDavis: codesearch: Prevent ferm from deleting Docker iptables rules [puppet] - 10https://gerrit.wikimedia.org/r/574524 (https://phabricator.wikimedia.org/T246017) [18:44:51] jouncebot: next [18:44:51] In 0 hour(s) and 15 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200224T1900) [18:45:56] (03CR) 10Herron: WIP mediawiki: send apache logs to logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/571239 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli) [18:46:41] !log deploying cluster apache config change - adds gr.wikimedia.org vhost and refreshes apache2 [18:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:38] 10Operations, 10DC-Ops, 10netops, 10Patch-For-Review: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10RobH) I can provide a list of sold network gear from before 2017, otherwise all network gear is still in our sites in storage, even decom gear. I'm not exactly sure... [18:48:52] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [18:49:50] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [18:54:08] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops, 10Patch-For-Review: (ASAP) rack/setup/install frdb2001 - https://phabricator.wikimedia.org/T245566 (10Dwisehaupt) [18:58:38] (03PS4) 10Jforrester: MWConfigCacheGenerator: Add test suite, fix non-Wikipedia fallbacks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573807 [19:00:04] RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Morning SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200224T1900). [19:00:04] Pchelolo: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:02:05] (03CR) 10Volans: "Tested on deployment prep, all good, ferm reloaded successfully." [puppet] - 10https://gerrit.wikimedia.org/r/574426 (owner: 10Volans) [19:03:12] Already been deployed [19:03:15] Yup. [19:03:25] I'll steal SWAT for a sec. [19:03:27] (03CR) 10Jforrester: [C: 03+2] MWConfigCacheGenerator: Add test suite, fix non-Wikipedia fallbacks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573807 (owner: 10Jforrester) [19:04:26] (03Merged) 10jenkins-bot: MWConfigCacheGenerator: Add test suite, fix non-Wikipedia fallbacks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573807 (owner: 10Jforrester) [19:05:17] (03PS1) 10Jhedden: toolforge: upgrade elasticsearch and add debian buster support [puppet] - 10https://gerrit.wikimedia.org/r/574527 (https://phabricator.wikimedia.org/T236606) [19:07:52] !log jforrester@deploy1001 Synchronized multiversion/MWConfigCacheGenerator.php: Changes here areonly used in tests right now, but keep line numbers sync'ed (duration: 00m 56s) [19:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:06] (03CR) 10Jforrester: "This needs to wait for wmf.21 to be everywhere, per the dependencies, but is otherwise good to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998 (owner: 10Jforrester) [19:11:08] 10Operations, 10Analytics, 10Research, 10Traffic, 10WMF-Legal: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10Miriam) [19:11:27] 10Operations, 10Analytics, 10Research, 10Traffic, 10WMF-Legal: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10Miriam) [19:12:00] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (ASAP) rack/setup/install frdb2001 - https://phabricator.wikimedia.org/T245566 (10Dwisehaupt) Database cloning over from frdb1003. Will do the restore after the copy is finished. Will take a while as it is 290G. [19:13:27] (03PS1) 10Andrew Bogott: nova.conf: update loglevel settings [puppet] - 10https://gerrit.wikimedia.org/r/574533 [19:13:29] (03PS1) 10Andrew Bogott: designate.conf: update loglevel settings [puppet] - 10https://gerrit.wikimedia.org/r/574534 [19:13:31] (03PS1) 10Andrew Bogott: nova: remove a backup file, most likely committed by mistake [puppet] - 10https://gerrit.wikimedia.org/r/574535 [19:17:33] (03PS13) 10Effie Mouzeli: WIP mediawiki: send apache logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/571239 (https://phabricator.wikimedia.org/T244472) [19:18:00] (03PS3) 10Volans: ganeti: Remove makevm.sh [puppet] - 10https://gerrit.wikimedia.org/r/565233 (owner: 10Alexandros Kosiaris) [19:22:02] (03CR) 10Volans: [C: 03+2] ganeti: Remove makevm.sh [puppet] - 10https://gerrit.wikimedia.org/r/565233 (owner: 10Alexandros Kosiaris) [19:25:19] (03PS1) 10Dzahn: DHCP: add apt1002 and apt2001 [puppet] - 10https://gerrit.wikimedia.org/r/574537 (https://phabricator.wikimedia.org/T224576) [19:25:43] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [19:26:47] (03PS2) 10RLazarus: Split httpbb.py into two modules. [software/httpbb] - 10https://gerrit.wikimedia.org/r/574066 [19:27:16] (03CR) 10Dzahn: [C: 03+2] DHCP: add apt1002 and apt2001 [puppet] - 10https://gerrit.wikimedia.org/r/574537 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [19:27:24] (03PS2) 10Dzahn: DHCP: add apt1002 and apt2001 [puppet] - 10https://gerrit.wikimedia.org/r/574537 (https://phabricator.wikimedia.org/T224576) [19:27:59] (03CR) 10RLazarus: [C: 03+2] Split httpbb.py into two modules. (031 comment) [software/httpbb] - 10https://gerrit.wikimedia.org/r/574066 (owner: 10RLazarus) [19:31:20] (03CR) 10RLazarus: [C: 03+2] "Thanks!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/574067 (owner: 10RLazarus) [19:32:10] (03CR) 10Jhedden: [C: 03+1] designate.conf: update loglevel settings [puppet] - 10https://gerrit.wikimedia.org/r/574534 (owner: 10Andrew Bogott) [19:32:26] (03CR) 10Jhedden: [C: 03+1] nova.conf: update loglevel settings [puppet] - 10https://gerrit.wikimedia.org/r/574533 (owner: 10Andrew Bogott) [19:37:41] (03Abandoned) 10Jdlrobson: Mobile logo should fall back to PNG if no SVG support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573419 (https://phabricator.wikimedia.org/T232140) (owner: 10Jdlrobson) [19:38:00] (03CR) 10RobH: [C: 03+1] Juniper report: only log warning if S/N missing from Netbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/574425 (https://phabricator.wikimedia.org/T213843) (owner: 10Ayounsi) [19:39:06] (03CR) 10Ayounsi: [C: 03+2] Juniper report: only log warning if S/N missing from Netbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/574425 (https://phabricator.wikimedia.org/T213843) (owner: 10Ayounsi) [19:41:18] (03PS1) 10EBernhardson: airflow: Drop old airflow user/group statement [puppet] - 10https://gerrit.wikimedia.org/r/574538 [19:41:20] (03PS1) 10EBernhardson: airflow: Deploy data necessary to connect to analytics sql replicas [puppet] - 10https://gerrit.wikimedia.org/r/574539 [19:41:26] mutante: okay to puppet-merge yours with mine? [19:41:36] "DHCP: add apt1002 and apt2001 (9027ccde68)" [19:41:43] rlazarus: yes, please do [19:42:03] i was waiting for ssh to puppetmaster.. i think this wifi doesnt like me to use port 22 [19:42:16] * mutante connects via phone again [19:42:19] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:42:22] thanks, done 👍 [19:42:29] tx [19:45:55] 10Operations, 10DC-Ops, 10netops, 10Patch-For-Review: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10ayounsi) Ok! From https://wikitech.wikimedia.org/wiki/Server_Lifecycle#States I thought that if a device was not in netbox it was not in our possession anymore. So I... [19:57:25] (03CR) 10Ottomata: [C: 03+2] airflow: Deploy data necessary to connect to analytics sql replicas [puppet] - 10https://gerrit.wikimedia.org/r/574539 (owner: 10EBernhardson) [19:57:33] (03PS2) 10Ottomata: airflow: Deploy data necessary to connect to analytics sql replicas [puppet] - 10https://gerrit.wikimedia.org/r/574539 (owner: 10EBernhardson) [19:58:50] (03CR) 10Ottomata: [V: 03+2 C: 03+2] airflow: Deploy data necessary to connect to analytics sql replicas [puppet] - 10https://gerrit.wikimedia.org/r/574539 (owner: 10EBernhardson) [20:02:39] !log installing OS on new ganeti VMs apt1001 and apt2001.wikimedia.org for buster APT repos [20:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:30] 10Operations, 10netops: cr3-esams:fpc1 crash - https://phabricator.wikimedia.org/T245825 (10ayounsi) a:05ayounsi→03RobH From JTAC (over the phone), tl;dr; If the FPC reboot didn't solve the issue, we need to re-seat all linecards and SCBs (should solve the issue in 90% of the cases). And only if that doesn... [20:08:54] (03CR) 10Andrew Bogott: [C: 03+2] nova.conf: update loglevel settings [puppet] - 10https://gerrit.wikimedia.org/r/574533 (owner: 10Andrew Bogott) [20:09:08] (03CR) 10Andrew Bogott: [C: 03+2] designate.conf: update loglevel settings [puppet] - 10https://gerrit.wikimedia.org/r/574534 (owner: 10Andrew Bogott) [20:09:26] (03CR) 10Andrew Bogott: [C: 03+2] nova: remove a backup file, most likely committed by mistake [puppet] - 10https://gerrit.wikimedia.org/r/574535 (owner: 10Andrew Bogott) [20:15:51] 10Operations, 10Security-Team, 10Wikimedia-Mailing-lists: Transfer ownership of mediawiki-security mailman list to Security Team - https://phabricator.wikimedia.org/T230951 (10chasemp) > At this time the intention is to archive this list. If someone wants to take ownership in the future or resurrect another... [20:16:54] 10Operations, 10netops: cr3-esams:fpc1 crash - https://phabricator.wikimedia.org/T245825 (10ayounsi) Said one thing over the phone, but sent a RMA email. Gave him the same details than the cr2-esams linecard. [20:19:18] (03PS4) 10Jforrester: Scap: update-interwiki-cache for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446507 (https://phabricator.wikimedia.org/T198844) (owner: 10Thcipriani) [20:23:30] (03PS5) 10Jforrester: Scap: update-interwiki-cache for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446507 (https://phabricator.wikimedia.org/T198844) (owner: 10Thcipriani) [20:24:00] (03PS14) 10Effie Mouzeli: WIP mediawiki: send apache logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/571239 (https://phabricator.wikimedia.org/T244472) [20:24:52] 10Operations, 10netops: cr3-esams:fpc1 crash - https://phabricator.wikimedia.org/T245825 (10ayounsi) FPC1 only have 2 ports curently: ` et-1/0/0 up up Core: asw2-esams:et-6/0/50 {#20049} et-1/0/1 up up Core: asw2-esams:et-6/0/51 {#20042} ` [20:24:56] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-12) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10wiki_willy) [20:25:33] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash: (Need by: 2020-03-06) rack/setup/install logstash102[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T240881 (10wiki_willy) [20:26:31] 10Operations, 10ops-eqiad, 10netops: (Need by: 2019-09-30) upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10wiki_willy) [20:26:34] 10Operations, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Rename multimedia-team to structured-data-team - https://phabricator.wikimedia.org/T235550 (10MBinder_WMF) @MarkTraceur Know the status of this? [20:27:00] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: 2020-03-01) rack/setup/install htmldumper1001.eqiad.wmnet. - https://phabricator.wikimedia.org/T245567 (10wiki_willy) [20:29:49] (03CR) 10Jforrester: "This "works", but it doesn't change the output from just doing `scap update-interwiki-cache --file=interwiki-labs.php`. We also would pres" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446507 (https://phabricator.wikimedia.org/T198844) (owner: 10Thcipriani) [20:32:15] !log load new FW policies on pfw3-eqiad/codfw - T246036 [20:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:04] (03PS6) 10Jforrester: Scap: update-interwiki-cache for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446507 (https://phabricator.wikimedia.org/T198844) (owner: 10Thcipriani) [20:34:35] 10Operations, 10ops-esams, 10netops: cr3-esams:fpc1 RMA - https://phabricator.wikimedia.org/T245825 (10RobH) [20:44:03] 10Operations, 10Traffic: Reporting en.wikipedia is down - https://phabricator.wikimedia.org/T246040 (10Reedy) Can you access https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue or https://wikitech-static.wikimedia.org/wiki/Reporting_a_connectivity_issue and follow the instructions? [20:45:32] 10Operations, 10SRE-Access-Requests, 10User-RhinosF1: Requesting access to stat1007 for jmorgan - https://phabricator.wikimedia.org/T244785 (10Capt_Swing) @jbond @RhinosF1 @leila @Nuria thank you! @elukey thanks for the heads-up! If I'm reading the docs right I guess I can use stat1004 or 1005 to access web... [20:45:52] 10Operations, 10SRE-Access-Requests, 10User-RhinosF1: Requesting access to stat1007 for jmorgan - https://phabricator.wikimedia.org/T244785 (10Capt_Swing) [20:47:44] 10Operations, 10Traffic: Reporting en.wikipedia is down - https://phabricator.wikimedia.org/T246040 (10JoeHebda) Microsoft Windows [Version 10.0.18363.657] (c) 2019 Microsoft Corporation. All rights reserved. C:\Users\Admin>tracert en.wikipedia.org Tracing route to en.wikipedia.org [91.198.174.192] over a ma... [20:50:49] (03CR) 10Dzahn: "> Patch Set 2: -Code-Review" [puppet] - 10https://gerrit.wikimedia.org/r/572312 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [20:51:31] 10Operations, 10Traffic, 10netops: Reporting en.wikipedia is down - https://phabricator.wikimedia.org/T246040 (10Dzahn) [20:51:59] 10Operations, 10SRE-Access-Requests, 10User-RhinosF1: Requesting access to stat1007 for jmorgan - https://phabricator.wikimedia.org/T244785 (10RhinosF1) No problem! Thanks for being part of my first production commit. [20:58:36] (03PS1) 10Ottomata: Enable canary releases for evenstreams chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/574560 [20:58:44] !log test flowspec BGP config on cr3-knams [20:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:04] cscott, arlolra, subbu, halfak, and accraze: Dear deployers, time to do the Services – Graphoid / Parsoid / Citoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200224T2100). [21:01:45] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): (Need by: 2020-03-02) rack/setup/install cloudvirt-wdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10wiki_willy) [21:02:52] (03PS2) 10Ottomata: Enable canary releases for evenstreams chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/574560 [21:03:05] (03CR) 10Dzahn: [C: 03+2] site: add apt[12]001.wikimedia.org with role::apt_repo [puppet] - 10https://gerrit.wikimedia.org/r/572312 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [21:03:21] (03PS3) 10Ottomata: Enable canary releases for evenstreams chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/574560 [21:05:35] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: TBD) setup/install sretest100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T245754 (10RobH) [21:07:10] (03CR) 10Ottomata: [C: 03+2] Enable canary releases for evenstreams chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/574560 (owner: 10Ottomata) [21:09:33] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: (Need by: TBD) rack/setup/install an-druid1001 and druid1007 - https://phabricator.wikimedia.org/T245569 (10RobH) [21:10:10] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: ASAP) rack/setup/install fran1001 - https://phabricator.wikimedia.org/T245554 (10RobH) [21:10:35] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10RobH) [21:11:11] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (Need by: ASAP) rack/setup/install frnetmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T232137 (10RobH) [21:11:16] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10RobH) [21:11:20] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team): (Need by: TBD) rack/setup/install restbase1029, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10RobH) [21:11:22] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): (Need by: TBD) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10RobH) [21:11:28] 10Operations, 10ops-eqiad: (Need by: TBD) replace scs-a8-eqiad - https://phabricator.wikimedia.org/T228919 (10RobH) [21:12:09] (03PS1) 10Dzahn: Revert "installserver: temp remove new APT servers to avoid cron spam" [puppet] - 10https://gerrit.wikimedia.org/r/574564 [21:12:16] (03PS2) 10Jhedden: toolforge: upgrade elasticsearch and add debian buster support [puppet] - 10https://gerrit.wikimedia.org/r/574527 (https://phabricator.wikimedia.org/T236606) [21:12:33] (03PS1) 10EBernhardson: Deploy analytics-privatedata-users group to an-airflow [puppet] - 10https://gerrit.wikimedia.org/r/574565 [21:17:20] (03PS1) 10Andrew Bogott: nova: remove some obsolete config files [puppet] - 10https://gerrit.wikimedia.org/r/574566 [21:17:52] (03PS2) 10EBernhardson: Deploy analytics-privatedata-users group to an-airflow [puppet] - 10https://gerrit.wikimedia.org/r/574565 [21:19:38] 10Operations, 10ops-eqiad: (Need by: TBD) replace scs-a8-eqiad - https://phabricator.wikimedia.org/T228919 (10RobH) Please note that this is being tracked as an old hw refresh. The hardware has actually been onsite for awhile, but the scope of this is fairly large. Any serial cable runs that we custom crimpe... [21:20:20] 10Operations, 10ops-eqiad: (Need by: 2020-06-30) replace scs-a8-eqiad - https://phabricator.wikimedia.org/T228919 (10RobH) [21:21:03] (03PS3) 10EBernhardson: Deploy analytics-privatedata-users group to an-airflow [puppet] - 10https://gerrit.wikimedia.org/r/574565 [21:22:23] (03PS4) 10EBernhardson: airflow: mysql replica credentials must be owned by service_group [puppet] - 10https://gerrit.wikimedia.org/r/574565 [21:23:10] (03PS1) 10Ottomata: eventstreams - bump cpu limit to 2000m for benchmarking [deployment-charts] - 10https://gerrit.wikimedia.org/r/574567 (https://phabricator.wikimedia.org/T238658) [21:23:44] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [21:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:14] (03CR) 10Andrew Bogott: [C: 03+1] "puppet compiler agrees with me" [puppet] - 10https://gerrit.wikimedia.org/r/574566 (owner: 10Andrew Bogott) [21:26:13] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [21:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:31] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [21:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:11] 10Operations, 10ops-codfw, 10DC-Ops: (no date provided) rack/setup/install ganeti20[19-24] - https://phabricator.wikimedia.org/T244783 (10Papaul) [21:37:04] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [21:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:46] (03PS2) 10Holger Knust: changeprop: New helmfiles for deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) [21:41:30] 10Operations, 10ops-eqiad: (Need by: 2020-06-30) replace scs-a8-eqiad - https://phabricator.wikimedia.org/T228919 (10RobH) [21:42:28] !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@f87bdd9]: Take service name into account for consumer group name T244387 [21:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:35] T244387: Change-Prop consumer group must respect service name - https://phabricator.wikimedia.org/T244387 [21:43:43] !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@f87bdd9]: Take service name into account for consumer group name T244387 (duration: 01m 14s) [21:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:11] (03PS3) 10Dzahn: site: add apt[12]001.wikimedia.org with role::apt_repo [puppet] - 10https://gerrit.wikimedia.org/r/572312 (https://phabricator.wikimedia.org/T224576) [21:46:10] (03CR) 10Holger Knust: "Addressed all PS1 review comments except for the items that require Alex's input." (0311 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust) [21:50:06] 10Operations, 10ChangeProp, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 5 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10Pchelolo) [21:50:39] (03CR) 10Ottomata: [C: 03+2] airflow: mysql replica credentials must be owned by service_group [puppet] - 10https://gerrit.wikimedia.org/r/574565 (owner: 10EBernhardson) [21:52:32] (03CR) 10Ppchelko: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust) [21:53:19] (03CR) 10Dzahn: [C: 03+2] site: add apt[12]001.wikimedia.org with role::apt_repo [puppet] - 10https://gerrit.wikimedia.org/r/572312 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [22:00:04] Reedy and sbassett: (Dis)respected human, time to deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200224T2200). Please do the needful. [22:03:23] (03CR) 10Ppchelko: changeprop: New helmfiles for deployment (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust) [22:06:27] (03CR) 10Ppchelko: [C: 04-1] "This is a primary concern: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/574094/1/helmfile.d/services/codfw/changeprop/v" [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust) [22:07:46] (03PS1) 10RLazarus: httpbb: Tiny script to work around import path issues. [puppet] - 10https://gerrit.wikimedia.org/r/574583 (https://phabricator.wikimedia.org/T236699) [22:08:12] (03PS1) 10Dzahn: aptrepo: add support for buster, install python-apt package [puppet] - 10https://gerrit.wikimedia.org/r/574584 (https://phabricator.wikimedia.org/T224576) [22:09:13] (03CR) 10jerkins-bot: [V: 04-1] httpbb: Tiny script to work around import path issues. [puppet] - 10https://gerrit.wikimedia.org/r/574583 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [22:09:33] (03CR) 10Dzahn: [C: 03+2] aptrepo: add support for buster, install python-apt package [puppet] - 10https://gerrit.wikimedia.org/r/574584 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [22:09:51] (03PS2) 10RLazarus: httpbb: Tiny script to work around import path issues. [puppet] - 10https://gerrit.wikimedia.org/r/574583 (https://phabricator.wikimedia.org/T236699) [22:10:48] (03CR) 10jerkins-bot: [V: 04-1] httpbb: Tiny script to work around import path issues. [puppet] - 10https://gerrit.wikimedia.org/r/574583 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [22:12:01] (03CR) 10Muehlenhoff: "Which tools emits the error message? That would be a bug in the reprepro package, then." [puppet] - 10https://gerrit.wikimedia.org/r/574584 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [22:12:45] (03PS3) 10RLazarus: httpbb: Tiny script to work around import path issues. [puppet] - 10https://gerrit.wikimedia.org/r/574583 (https://phabricator.wikimedia.org/T236699) [22:13:49] (03CR) 10Dzahn: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/574584 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [22:14:03] (03CR) 10RLazarus: "PCC looks right: https://puppet-compiler.wmflabs.org/compiler1001/21026/" [puppet] - 10https://gerrit.wikimedia.org/r/574583 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [22:17:03] PROBLEM - Check systemd state on apt1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:18:30] ACKNOWLEDGEMENT - Check systemd state on apt1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn new install, issue with nginx on buster https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:22:24] !log disable transits on cr3-esams [22:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:47] PROBLEM - HTTP on apt1001 is CRITICAL: connect to address 208.80.154.30 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/APT_repository [22:27:00] ACKNOWLEDGEMENT - HTTP on apt1001 is CRITICAL: connect to address 208.80.154.30 and port 80: Connection refused daniel_zahn not in prod, nginx issue https://wikitech.wikimedia.org/wiki/APT_repository [22:30:05] (03PS3) 10Jhedden: toolforge: upgrade elasticsearch and add debian buster support [puppet] - 10https://gerrit.wikimedia.org/r/574527 (https://phabricator.wikimedia.org/T236606) [22:31:05] ACKNOWLEDGEMENT - Host cr2-esams is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn known and WIP by netops [22:31:05] ACKNOWLEDGEMENT - Host cr2-esams IPv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:862:ffff::3) daniel_zahn known and WIP by netops [22:33:50] (03CR) 10jerkins-bot: [V: 04-1] toolforge: upgrade elasticsearch and add debian buster support [puppet] - 10https://gerrit.wikimedia.org/r/574527 (https://phabricator.wikimedia.org/T236606) (owner: 10Jhedden) [22:33:57] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:34:22] !log stat1007 sudo systemctl reset-failed to clear Icinga alerts about reportupdater-pingback.service [22:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:00] !log redirect ns2 to authdns1001 [22:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:24] (03CR) 10Jhedden: "PCC results: https://puppet-compiler.wmflabs.org/compiler1002/21027/" [puppet] - 10https://gerrit.wikimedia.org/r/574527 (https://phabricator.wikimedia.org/T236606) (owner: 10Jhedden) [22:41:05] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 16 probes of 604 (alerts on 35) - https://atlas.ripe.net/measurements/23449935/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:44:31] (03PS1) 10Andrew Bogott: nova.conf: specify drivername on mysql URI [puppet] - 10https://gerrit.wikimedia.org/r/574590 [22:44:33] (03PS1) 10Andrew Bogott: neutron: specify drivername on mysql URI [puppet] - 10https://gerrit.wikimedia.org/r/574591 [22:44:35] (03PS1) 10Andrew Bogott: nova: remove some old nova-network config settings [puppet] - 10https://gerrit.wikimedia.org/r/574592 [22:44:37] (03PS1) 10Andrew Bogott: nova.conf: replace rabbit_hosts with transport_url [puppet] - 10https://gerrit.wikimedia.org/r/574593 [22:45:56] (03PS2) 10Andrew Bogott: nova.conf: specify drivername on mysql URI [puppet] - 10https://gerrit.wikimedia.org/r/574590 [22:45:58] (03PS2) 10Andrew Bogott: neutron: specify drivername on mysql URI [puppet] - 10https://gerrit.wikimedia.org/r/574591 [22:46:00] (03PS2) 10Andrew Bogott: nova: remove some old nova-network config settings [puppet] - 10https://gerrit.wikimedia.org/r/574592 [22:46:02] (03PS2) 10Andrew Bogott: nova.conf: replace rabbit_hosts with transport_url [puppet] - 10https://gerrit.wikimedia.org/r/574593 [22:46:26] 10Operations, 10Wikimedia-Incident: Investigate whether we can automatically share incident status docs with WMDE - https://phabricator.wikimedia.org/T244395 (10RLazarus) "On Monday" turned out to be two weeks later -- sorry about that. Conclusions from today's SRE meeting, documented for posterity: - Yes, in... [22:47:34] (03CR) 10Andrew Bogott: [C: 03+2] nova.conf: specify drivername on mysql URI [puppet] - 10https://gerrit.wikimedia.org/r/574590 (owner: 10Andrew Bogott) [22:48:58] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/574583 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [22:49:34] (03CR) 10RLazarus: [C: 03+2] httpbb: Tiny script to work around import path issues. [puppet] - 10https://gerrit.wikimedia.org/r/574583 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [22:52:29] RECOVERY - Check systemd state on apt1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:54:05] RECOVERY - HTTP on apt1001 is OK: HTTP OK: HTTP/1.1 302 Moved Temporarily - 381 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/APT_repository [22:54:11] (03CR) 10Andrew Bogott: [C: 03+2] neutron: specify drivername on mysql URI [puppet] - 10https://gerrit.wikimedia.org/r/574591 (owner: 10Andrew Bogott) [22:54:25] (03CR) 10Andrew Bogott: [C: 03+2] nova: remove some old nova-network config settings [puppet] - 10https://gerrit.wikimedia.org/r/574592 (owner: 10Andrew Bogott) [22:57:59] (03PS1) 10Dzahn: apt.wikimedia.org: remove (duplicate) OCSP stapling config line for RSA [puppet] - 10https://gerrit.wikimedia.org/r/574597 (https://phabricator.wikimedia.org/T224576) [22:58:36] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [22:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:44] (03PS3) 10Andrew Bogott: nova.conf: replace rabbit_hosts with transport_url [puppet] - 10https://gerrit.wikimedia.org/r/574593 [23:01:46] (03PS1) 10Andrew Bogott: nova: rename one rabbit setting, remove a deprecated one [puppet] - 10https://gerrit.wikimedia.org/r/574598 [23:02:17] (03PS1) 10Dzahn: install_server: update MAC address for apt2001 [puppet] - 10https://gerrit.wikimedia.org/r/574599 [23:03:54] (03CR) 10Dzahn: [C: 03+2] install_server: update MAC address for apt2001 [puppet] - 10https://gerrit.wikimedia.org/r/574599 (owner: 10Dzahn) [23:06:27] (03PS2) 10Dzahn: apt: remove (duplicate) OCSP stapling config and RSA cert [puppet] - 10https://gerrit.wikimedia.org/r/574597 (https://phabricator.wikimedia.org/T224576) [23:08:24] (03PS2) 10Dzahn: install_server: update MAC address for apt2001 [puppet] - 10https://gerrit.wikimedia.org/r/574599 [23:10:01] (03CR) 10Dzahn: [V: 03+2 C: 03+2] install_server: update MAC address for apt2001 [puppet] - 10https://gerrit.wikimedia.org/r/574599 (owner: 10Dzahn) [23:11:46] (03PS1) 10CRusnov: Add support for getting Device status breakdowns [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/574600 [23:12:02] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/574600 (owner: 10CRusnov) [23:12:09] (03CR) 10jerkins-bot: [V: 04-1] Add support for getting Device status breakdowns [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/574600 (owner: 10CRusnov) [23:12:37] (03PS3) 10Bstorm: Update wm-bot hostname [puppet] - 10https://gerrit.wikimedia.org/r/572489 (owner: 10Lucas Werkmeister) [23:15:06] (03PS2) 10CRusnov: Add support for getting Device status breakdowns [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/574600 (https://phabricator.wikimedia.org/T243927) [23:16:06] (03CR) 10Bstorm: [C: 03+2] Update wm-bot hostname [puppet] - 10https://gerrit.wikimedia.org/r/572489 (owner: 10Lucas Werkmeister) [23:16:40] (03CR) 10Andrew Bogott: [C: 03+2] nova: rename one rabbit setting, remove a deprecated one [puppet] - 10https://gerrit.wikimedia.org/r/574598 (owner: 10Andrew Bogott) [23:28:15] (03PS3) 10Holger Knust: changeprop: New helmfiles for deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) [23:29:25] (03CR) 10Holger Knust: "Reverted changes and remove all duplicated values as discussed." (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust) [23:30:40] (03PS2) 10CRusnov: reports: Make it so external stuff cannot break things [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/574057 (https://phabricator.wikimedia.org/T239119) [23:31:33] (03CR) 10CRusnov: [C: 03+2] "last PS: Just minor change based on testing (lru_cache has to come before property in the decorator chain apparently). Tested good, so bas" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/574057 (https://phabricator.wikimedia.org/T239119) (owner: 10CRusnov) [23:33:10] 10Operations, 10Traffic, 10netops: Reporting en.wikipedia is down - https://phabricator.wikimedia.org/T246040 (10JoeHebda) Now connected...running OK. Ticket can be closed. Done [23:37:58] (03PS1) 10Andrew Bogott: neutron: turn down log levels for wsgi [puppet] - 10https://gerrit.wikimedia.org/r/574607 [23:39:14] (03CR) 10Ppchelko: [C: 04-1] changeprop: New helmfiles for deployment (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust) [23:40:31] (03CR) 10Andrew Bogott: [C: 03+2] neutron: turn down log levels for wsgi [puppet] - 10https://gerrit.wikimedia.org/r/574607 (owner: 10Andrew Bogott) [23:58:17] (03CR) 10Brian Wolff: "> Just explicitly setting it to a value in CS-labs will over-write. Is that not sufficient?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574405 (https://phabricator.wikimedia.org/T245983) (owner: 10Brian Wolff)