[00:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200221T0000). [00:00:04] tassu: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:34] my patch is {{done}}, see above [00:07:23] 10Operations, 10Gerrit: gerrit1002 running out of space - https://phabricator.wikimedia.org/T243808 (10Dzahn) Had to fix /etc/network/interfaces again (interface name changed again, ens5 -> ens6 now ens7) and restart to fix networking. Then formatted with ext4 and mounted additional 20G on /srv/lfs. Added t... [00:13:48] (03CR) 10Jforrester: [C: 03+1] "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566410 (owner: 10Legoktm) [00:18:50] (03CR) 10Thcipriani: [C: 03+2] Update delete-project [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/573759 (owner: 10Paladox) [00:27:13] (03Merged) 10jenkins-bot: Update delete-project [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/573759 (owner: 10Paladox) [00:38:35] PROBLEM - Device not healthy -SMART- on cloudvirt1008 is CRITICAL: cluster=wmcs device=cciss,17 instance=cloudvirt1008:9100 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudvirt1008&var-datasource=eqiad+prometheus/ops [00:55:51] (03PS2) 10Jforrester: scap: Actually pass flake8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566410 (owner: 10Legoktm) [00:56:30] (03CR) 10Jforrester: [C: 03+1] "I guess we should start running tox here, then…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566410 (owner: 10Legoktm) [00:57:51] (03PS1) 10Samwilson: Enable password-reset-update on Wikivoyages and Wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573788 (https://phabricator.wikimedia.org/T245792) [00:59:38] (03CR) 10Jforrester: [C: 03+1] "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566410 (owner: 10Legoktm) [01:01:26] !log andrew@deploy1001 Started deploy [horizon/deploy@13ca90a]: Remove guided puppet config mode; this gets us back to working with latest puppet packages. [01:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:03:02] (03CR) 10Jforrester: [C: 03+2] scap: Actually pass flake8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566410 (owner: 10Legoktm) [01:03:48] (03PS2) 10Papaul: DNS: Add mgmt and production DNS for mw236[6-9], mw237[0-6] [dns] - 10https://gerrit.wikimedia.org/r/573713 [01:04:05] (03Merged) 10jenkins-bot: scap: Actually pass flake8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566410 (owner: 10Legoktm) [01:04:12] (03CR) 10jerkins-bot: [V: 04-1] DNS: Add mgmt and production DNS for mw236[6-9], mw237[0-6] [dns] - 10https://gerrit.wikimedia.org/r/573713 (owner: 10Papaul) [01:04:27] (03PS2) 10Jforrester: scap: Add Python 3 support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566411 (owner: 10Legoktm) [01:04:35] (03PS2) 10Jforrester: scap: Clean up unused build configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566412 (owner: 10Legoktm) [01:04:58] !log andrew@deploy1001 Finished deploy [horizon/deploy@13ca90a]: Remove guided puppet config mode; this gets us back to working with latest puppet packages. (duration: 03m 32s) [01:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:10] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Sync Beta-Cluster-only change to CommonSettings now we're sure we won't revert (duration: 00m 56s) [01:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:25] (03CR) 10Jforrester: [C: 03+1] scap: Add Python 3 support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566411 (owner: 10Legoktm) [01:06:50] (03CR) 10Jforrester: [C: 03+1] scap: Clean up unused build configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566412 (owner: 10Legoktm) [01:07:15] (03PS3) 10Papaul: DNS: Add mgmt and production DNS for mw236[6-9], mw237[0-6] [dns] - 10https://gerrit.wikimedia.org/r/573713 [01:11:03] (03PS3) 10Andrew Bogott: cloud eqiad1: configure new puppetmaster to only use new puppet as backend [puppet] - 10https://gerrit.wikimedia.org/r/573743 (owner: 10Alex Monk) [01:12:50] (03CR) 10CDanis: [C: 03+1] Add configuration for a flowspec controller [puppet] - 10https://gerrit.wikimedia.org/r/573401 (owner: 10Ayounsi) [01:15:29] (03CR) 10Andrew Bogott: [C: 03+2] cloud eqiad1: configure new puppetmaster to only use new puppet as backend [puppet] - 10https://gerrit.wikimedia.org/r/573743 (owner: 10Alex Monk) [01:17:15] PROBLEM - Check systemd state on boron is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:19:29] (03CR) 10Krinkle: "SGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140) (owner: 10Jforrester) [02:01:39] PROBLEM - MariaDB read only wikireplica on labsdb1011 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [02:03:03] PROBLEM - Check systemd state on labsdb1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:03:49] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [02:04:11] (03PS2) 10Jforrester: Enable password-reset-update on Wikivoyages and Wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573788 (https://phabricator.wikimedia.org/T245792) (owner: 10Samwilson) [02:04:13] (03PS1) 10Jforrester: MWConfigCacheGenerator: Add test suite, fix non-Wikipedia fallbacks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573807 [02:04:35] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [02:04:35] (03CR) 10jerkins-bot: [V: 04-1] Enable password-reset-update on Wikivoyages and Wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573788 (https://phabricator.wikimedia.org/T245792) (owner: 10Samwilson) [02:04:36] (03CR) 10jerkins-bot: [V: 04-1] MWConfigCacheGenerator: Add test suite, fix non-Wikipedia fallbacks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573807 (owner: 10Jforrester) [02:07:25] (03PS2) 10Jforrester: MWConfigCacheGenerator: Add test suite, fix non-Wikipedia fallbacks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573807 [02:07:27] (03PS3) 10Jforrester: Enable password-reset-update on Wikivoyages and Wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573788 (https://phabricator.wikimedia.org/T245792) (owner: 10Samwilson) [02:13:49] (03PS1) 10Jforrester: [DNM] Global grab $site and $lang [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573810 [02:14:59] (03CR) 10jerkins-bot: [V: 04-1] [DNM] Global grab $site and $lang [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573810 (owner: 10Jforrester) [02:16:16] (03CR) 10Jforrester: "Looks good to deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573788 (https://phabricator.wikimedia.org/T245792) (owner: 10Samwilson) [02:22:06] (03CR) 10CDanis: [C: 03+1] "Looks good! Some nits and one longer-term things." (038 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/572262 (owner: 10Ayounsi) [02:22:30] (03Abandoned) 10CDanis: depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/570776 (owner: 10CDanis) [02:27:25] !log stopped mariadb on labsdb1011 because it keeps crashing anyway [02:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:34] ACKNOWLEDGEMENT - Check systemd state on labsdb1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Bstorm mariadb crashed T245797 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:29:34] ACKNOWLEDGEMENT - MariaDB read only wikireplica on labsdb1011 is CRITICAL: Could not connect to localhost:3306 Bstorm mariadb crashed T245797 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [02:29:35] ACKNOWLEDGEMENT - mysqld processes #page on labsdb1011 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Bstorm mariadb crashed T245797 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [02:29:57] ohhhh, icinga [02:32:41] hi [02:33:10] ohh I didn't even see that was an ack I got [02:33:24] teach me to read more carefully :) thanks bstorm_ [02:34:05] lol, sorry for the noise from icinga [02:37:56] (03PS1) 10Bstorm: wikireplicas: depool labsdb1011 and set weights on other cluster [puppet] - 10https://gerrit.wikimedia.org/r/573813 (https://phabricator.wikimedia.org/T245797) [02:44:48] (03PS2) 10Bstorm: wikireplicas: depool labsdb1011 and set weights on other cluster [puppet] - 10https://gerrit.wikimedia.org/r/573813 (https://phabricator.wikimedia.org/T245797) [02:46:01] (03CR) 10Bstorm: [C: 03+2] wikireplicas: depool labsdb1011 and set weights on other cluster [puppet] - 10https://gerrit.wikimedia.org/r/573813 (https://phabricator.wikimedia.org/T245797) (owner: 10Bstorm) [02:51:43] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [02:52:23] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [02:53:16] !log depooled labsdb1011 and set weight 10 on labsdb1009 vs 3 on labsdb1010 T245797 [02:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:53:20] T245797: labsdb1011 mariadb crashed again - https://phabricator.wikimedia.org/T245797 [02:59:11] (03PS3) 10Jforrester: MWConfigCacheGenerator: Add test suite, fix non-Wikipedia fallbacks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573807 [02:59:20] (03Abandoned) 10Jforrester: [DNM] Global grab $site and $lang [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573810 (owner: 10Jforrester) [03:15:06] (03PS1) 10BryanDavis: kubernetes: Remove deprecated flag from tcl image [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/573823 [04:27:01] 10Operations, 10netops, 10Wikimedia-Incident: Investigate Juniper storm control - https://phabricator.wikimedia.org/T245192 (10Papaul) Reading from librenms on ms1-eqiad * normal operation Total traffic on ge-0/0/14 which is connected to msw-b2-eqiad IN = 96.11MB OUT = 40.92MB * During the outage: 2020-0... [04:27:58] (03CR) 10Samwilson: "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573788 (https://phabricator.wikimedia.org/T245792) (owner: 10Samwilson) [04:38:01] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 36 probes of 527 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:50:15] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 35 probes of 527 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:55:51] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:57:57] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:11:33] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [05:12:15] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [05:18:33] RECOVERY - Device not healthy -SMART- on cloudvirt1008 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudvirt1008&var-datasource=eqiad+prometheus/ops [05:37:51] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [05:44:09] !log Reload haproxy on dbproxy1010, dbproxy1011, dbproxy18 - T245797 [05:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:15] T245797: labsdb1011 mariadb crashed again - https://phabricator.wikimedia.org/T245797 [05:57:42] !log Start MySQL on labsdb1011 without replication - T245797 [05:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:49] ACKNOWLEDGEMENT - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui T245797 https://wikitech.wikimedia.org/wiki/HAProxy [06:05:49] ACKNOWLEDGEMENT - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui T245797 https://wikitech.wikimedia.org/wiki/HAProxy [06:05:54] ACKNOWLEDGEMENT - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui T245797 https://wikitech.wikimedia.org/wiki/HAProxy [06:05:54] ACKNOWLEDGEMENT - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui T245797 https://wikitech.wikimedia.org/wiki/HAProxy [06:05:54] ACKNOWLEDGEMENT - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui T245797 https://wikitech.wikimedia.org/wiki/HAProxy [06:05:54] ACKNOWLEDGEMENT - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui T245797 https://wikitech.wikimedia.org/wiki/HAProxy [06:12:49] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 74, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:13:48] PROBLEM - mysqld processes #page on labsdb1011 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [06:14:03] (03PS1) 10Marostegui: wikireplica_analytics.yaml: Decrease running time [puppet] - 10https://gerrit.wikimedia.org/r/573855 (https://phabricator.wikimedia.org/T245797) [06:14:07] ^ going to disable notifications for labsdb1011 and downtime it [06:14:33] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:15:46] (03CR) 10Marostegui: [C: 03+2] wikireplica_analytics.yaml: Decrease running time [puppet] - 10https://gerrit.wikimedia.org/r/573855 (https://phabricator.wikimedia.org/T245797) (owner: 10Marostegui) [06:16:03] the router notification is a known maintenance as well [06:16:25] ACKNOWLEDGEMENT - mysqld processes #page on labsdb1011 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Marostegui T245797 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [06:16:54] ACKNOWLEDGEMENT - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis planned maintenance PWIC105701 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:16:54] ACKNOWLEDGEMENT - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 74, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis planned maintenance PWIC105701 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:32:55] PROBLEM - Device not healthy -SMART- on cloudvirt1008 is CRITICAL: cluster=wmcs device=cciss,17 instance=cloudvirt1008:9100 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudvirt1008&var-datasource=eqiad+prometheus/ops [06:32:55] (03PS1) 10Marostegui: mariadb: Productionize es1025 [puppet] - 10https://gerrit.wikimedia.org/r/573860 (https://phabricator.wikimedia.org/T243052) [06:34:54] !log Stop mysql on es1024 to clone es1025 - T243052 [06:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:58] T243052: Productionize es1020-es1025, es2020-es2025 - https://phabricator.wikimedia.org/T243052 [06:35:30] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize es1025 [puppet] - 10https://gerrit.wikimedia.org/r/573860 (https://phabricator.wikimedia.org/T243052) (owner: 10Marostegui) [06:37:23] (03CR) 10Giuseppe Lavagetto: ">" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/572832 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [06:51:09] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 76, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:51:33] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:58:11] !log Stop MySQL on labsdb1012 to clone labsdb1011 - [06:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:21] !log Stop MySQL on labsdb1012 to clone labsdb1011 - T245797 [06:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:01] PROBLEM - MariaDB read only wikireplica on labsdb1012 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [07:03:40] PROBLEM - mysqld processes #page on labsdb1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [07:03:56] damn [07:03:58] that's me [07:04:00] sorry [07:04:02] no worries [07:16:15] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/573578 (owner: 10Alexandros Kosiaris) [07:16:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::lvs: use wmflib::fetch [puppet] - 10https://gerrit.wikimedia.org/r/572215 (owner: 10Giuseppe Lavagetto) [07:35:03] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash7-eqiad instance=kafkamon1001:9501 job=burrow partition={2,3} site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-clust [07:35:03] var-topic=All&var-consumer_group=All [07:45:35] (03PS1) 10Alexandros Kosiaris: otrs: Remove mod_remoteip [puppet] - 10https://gerrit.wikimedia.org/r/573877 [07:51:07] (03CR) 10Alexandros Kosiaris: [C: 03+2] "I 'll disable manually mod_remoteip on the machine." [puppet] - 10https://gerrit.wikimedia.org/r/573877 (owner: 10Alexandros Kosiaris) [07:53:06] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Adding dzahn as it seems the same approach was taken on phab+gerrit and since 3a55ec489327a1 it might no longer be required (but care shou" [puppet] - 10https://gerrit.wikimedia.org/r/573877 (owner: 10Alexandros Kosiaris) [08:02:41] !log disable mod_remoteip on otrs host, following merge of https://gerrit.wikimedia.org/r/573877 [08:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:02] (03Abandoned) 10Alexandros Kosiaris: httpd: Switch defaults.conf from file to template [puppet] - 10https://gerrit.wikimedia.org/r/572708 (owner: 10Alexandros Kosiaris) [08:16:41] (03PS1) 10Muehlenhoff: Add system::role for k8s staging roles [puppet] - 10https://gerrit.wikimedia.org/r/573887 [08:19:45] !log fdans@deploy1001 Started deploy [analytics/refinery@4d56021]: deploying refinery [08:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:25] (03PS1) 10Muehlenhoff: Remove system role from role::swift::stats_reporter [puppet] - 10https://gerrit.wikimedia.org/r/573895 [08:34:40] !log fdans@deploy1001 Finished deploy [analytics/refinery@4d56021]: deploying refinery (duration: 14m 55s) [08:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:42] (03PS1) 10Muehlenhoff: Add a separate class parameter to toggle the auth Icinga check for tendril [puppet] - 10https://gerrit.wikimedia.org/r/573935 [08:47:34] (03CR) 10Muehlenhoff: [C: 03+2] Add logstash/kibana IDP service definition [puppet] - 10https://gerrit.wikimedia.org/r/573560 (owner: 10Muehlenhoff) [08:51:06] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/20947/" [puppet] - 10https://gerrit.wikimedia.org/r/573935 (owner: 10Muehlenhoff) [08:54:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1107 after 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10473 and previous config saved to /var/cache/conftool/dbconfig/20200221-085405-marostegui.json [08:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:11] T242702: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 [08:58:05] (03CR) 10Muehlenhoff: [C: 03+1] "Sounds good" [puppet] - 10https://gerrit.wikimedia.org/r/573711 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [09:02:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, fermium also has it enabled, albeit not in Puppet." [puppet] - 10https://gerrit.wikimedia.org/r/573732 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [09:09:57] (03PS1) 10Filippo Giunchedi: service: fix uwsgi logstash_port_logback [puppet] - 10https://gerrit.wikimedia.org/r/573936 (https://phabricator.wikimedia.org/T245512) [09:09:59] (03PS1) 10Filippo Giunchedi: service: logging pipeline support for uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/573937 (https://phabricator.wikimedia.org/T245512) [09:13:34] (03CR) 10Filippo Giunchedi: "PCC for this and I3449b226289" [puppet] - 10https://gerrit.wikimedia.org/r/573937 (https://phabricator.wikimedia.org/T245512) (owner: 10Filippo Giunchedi) [09:17:45] (03CR) 10Volans: [C: 03+1] "LGTM for cumin*" [puppet] - 10https://gerrit.wikimedia.org/r/572014 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [09:18:01] !log akosiaris@puppetmaster1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=mathoid [09:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:13] !log depool mathoid in eqiad for a test [09:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:09] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Ferm rules for labstore1004/1005 NFS hosts - https://phabricator.wikimedia.org/T165136 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03None [09:22:12] (03CR) 10Filippo Giunchedi: "PCC for (a sample of) all users of service::uwsgi https://puppet-compiler.wmflabs.org/compiler1001/20949/" [puppet] - 10https://gerrit.wikimedia.org/r/573937 (https://phabricator.wikimedia.org/T245512) (owner: 10Filippo Giunchedi) [09:22:14] 10Operations, 10cloud-services-team: Ferm rules for cloudbackup2001/2001 - https://phabricator.wikimedia.org/T245808 (10MoritzMuehlenhoff) [09:22:20] 10Operations, 10cloud-services-team: Ferm rules for cloudbackup2001/2001 - https://phabricator.wikimedia.org/T245808 (10MoritzMuehlenhoff) p:05Triage→03High [09:25:03] (03CR) 10Filippo Giunchedi: [C: 03+2] Switch bast*/cumin*/scandium to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/572014 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [09:25:13] (03PS3) 10Filippo Giunchedi: Switch bast*/cumin*/scandium to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/572014 (https://phabricator.wikimedia.org/T156955) [09:27:41] 10Operations, 10User-Elukey: Race condition in setting net.netfilter.nf_conntrack_tcp_timeout_time_wait - https://phabricator.wikimedia.org/T136094 (10MoritzMuehlenhoff) 05Open→03Resolved I think this bug can be closed in fact. [09:28:22] (03PS1) 10Jcrespo: WMFBackup: Remove Popen.wait() from commands with long output [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/573956 [09:30:05] (03PS2) 10Muehlenhoff: Switch dbproxy1021 to standard Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/573538 (https://phabricator.wikimedia.org/T156955) [09:32:03] (03CR) 10Muehlenhoff: [C: 03+2] Switch dbproxy1021 to standard Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/573538 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [09:34:55] (03PS1) 10Jcrespo: mariadb-backups: Update code to wmfbackup HEAD to fix stalling issue [puppet] - 10https://gerrit.wikimedia.org/r/573961 (https://phabricator.wikimedia.org/T138562) [09:35:51] (03CR) 10Hnowlan: "I'll deploy this as soon as https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/573333/ is approved." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573558 (https://phabricator.wikimedia.org/T244549) (owner: 10Hnowlan) [09:41:00] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] WMFBackup: Remove Popen.wait() from commands with long output [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/573956 (owner: 10Jcrespo) [09:41:25] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Update code to wmfbackup HEAD to fix stalling issue [puppet] - 10https://gerrit.wikimedia.org/r/573961 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [09:53:37] (03CR) 10Volans: [C: 03+1] "LGTM, I don't see the parameter used anywhere in prod, didn't check WMCS though." [puppet] - 10https://gerrit.wikimedia.org/r/573936 (https://phabricator.wikimedia.org/T245512) (owner: 10Filippo Giunchedi) [09:58:52] (03PS1) 10Alexandros Kosiaris: configmaster: Add DNS Discovery disrepancy check [puppet] - 10https://gerrit.wikimedia.org/r/573963 [10:00:08] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add system::role for k8s staging roles [puppet] - 10https://gerrit.wikimedia.org/r/573887 (owner: 10Muehlenhoff) [10:01:50] (03CR) 10jerkins-bot: [V: 04-1] configmaster: Add DNS Discovery disrepancy check [puppet] - 10https://gerrit.wikimedia.org/r/573963 (owner: 10Alexandros Kosiaris) [10:02:49] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove system role from role::swift::stats_reporter [puppet] - 10https://gerrit.wikimedia.org/r/573895 (owner: 10Muehlenhoff) [10:04:59] (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/573936 (https://phabricator.wikimedia.org/T245512) (owner: 10Filippo Giunchedi) [10:06:10] (03CR) 10Volans: [C: 04-1] "One tech+policy problem, see inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573937 (https://phabricator.wikimedia.org/T245512) (owner: 10Filippo Giunchedi) [10:09:50] (03PS2) 10Alexandros Kosiaris: configmaster: Add DNS Discovery disrepancy check [puppet] - 10https://gerrit.wikimedia.org/r/573963 [10:10:51] 10Operations, 10SRE-Access-Requests: Requesting access to stat1007 for jmorgan - https://phabricator.wikimedia.org/T244785 (10jbond) >>! In T244785#5901477, @Capt_Swing wrote: > @jbond is endorsement from @leila and @Ottomata sufficient here? unfortunately no, as the policy currently stands it has to be Nuria... [10:13:32] (03PS1) 10Majavah: Disallow crats to (un)assign flow-bot group on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573965 (https://phabricator.wikimedia.org/T245716) [10:14:56] (03CR) 10Jcrespo: [C: 03+1] "Please add a too TODO comment there with the pending work to "remove the if"." [puppet] - 10https://gerrit.wikimedia.org/r/573935 (owner: 10Muehlenhoff) [10:18:13] (03CR) 10Filippo Giunchedi: [C: 03+1] "I'll merge on Mon, thanks Volans for the review" [puppet] - 10https://gerrit.wikimedia.org/r/573936 (https://phabricator.wikimedia.org/T245512) (owner: 10Filippo Giunchedi) [10:19:59] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 42 probes of 527 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:21:00] (03CR) 10Jbond: [C: 03+1] "thanks for pcc" [puppet] - 10https://gerrit.wikimedia.org/r/573401 (owner: 10Ayounsi) [10:25:34] (03PS2) 10Muehlenhoff: Add a separate class parameter to toggle the auth Icinga check for tendril [puppet] - 10https://gerrit.wikimedia.org/r/573935 [10:25:39] (03CR) 10Muehlenhoff: "Sure, amended the patch with a comment in PS2" [puppet] - 10https://gerrit.wikimedia.org/r/573935 (owner: 10Muehlenhoff) [10:25:52] (03CR) 10Filippo Giunchedi: service: logging pipeline support for uwsgi (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573937 (https://phabricator.wikimedia.org/T245512) (owner: 10Filippo Giunchedi) [10:26:07] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 34 probes of 527 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:26:43] (03CR) 10Jbond: Add a separate class parameter to toggle the auth Icinga check for tendril (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573935 (owner: 10Muehlenhoff) [10:47:42] (03PS1) 10Filippo Giunchedi: prometheus: use class logstash for jmx_logstash job [puppet] - 10https://gerrit.wikimedia.org/r/573968 [10:48:35] (03PS1) 10Matěj Suchánek: Synchronize and fix DisableQueryPageUpdate for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573969 [10:49:53] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 37 probes of 527 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:55:59] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 527 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:58:49] (03PS2) 10Elukey: Add an-launcher1001 A/AAAA/PTR records [dns] - 10https://gerrit.wikimedia.org/r/571479 (https://phabricator.wikimedia.org/T244717) [11:07:59] (03CR) 10MarcoAurelio: [C: 03+1] Disallow crats to (un)assign flow-bot group on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573965 (https://phabricator.wikimedia.org/T245716) (owner: 10Majavah) [11:12:05] (03PS2) 10Lucas Werkmeister: Update wm-bot hostname [puppet] - 10https://gerrit.wikimedia.org/r/572489 [11:12:58] (03CR) 10Lucas Werkmeister: "> I was aware that this change of hostname will cause this kind of problems but as I don't know who is actively using this feature, I had " [puppet] - 10https://gerrit.wikimedia.org/r/572489 (owner: 10Lucas Werkmeister) [11:13:48] (03PS1) 10Filippo Giunchedi: netbox: log to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/573976 (https://phabricator.wikimedia.org/T245511) [11:14:59] !log reboot stat1005 - GPU blocked at 100% after issue with tensorflow [11:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:30] (03PS3) 10Elukey: Add an-launcher1001 A/AAAA/PTR records [dns] - 10https://gerrit.wikimedia.org/r/571479 (https://phabricator.wikimedia.org/T244717) [11:19:25] 10Operations, 10Analytics, 10serviceops, 10vm-requests, 10Patch-For-Review: Create a ganeti VM in eqiad: an-launcher1001 - https://phabricator.wikimedia.org/T244717 (10elukey) [11:20:39] (03PS2) 10Filippo Giunchedi: netbox: log to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/573976 (https://phabricator.wikimedia.org/T245511) [11:20:45] (03PS1) 10Vgutierrez: ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) [11:21:58] !log bounce logstash on logstash1023 - see if can catch up with elastic7 kafka lag [11:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:48] (03CR) 10jerkins-bot: [V: 04-1] ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [11:26:49] is phabricator down? can't open any page :/ [11:26:57] same here for me [11:27:03] but not just phab? [11:27:43] PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:27:44] I can't even load Wikipedia in reasonable time zeljkof akosiaris [11:27:45] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:27:51] PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3065 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:27:51] PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:27:53] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:27:57] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3063 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:28:07] PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3059 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:28:17] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:28:19] gerrit seems to work just fine [11:28:23] PROBLEM - NTP peers on dns3002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [11:28:31] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:28:31] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:28:33] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3063 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:28:35] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:28:35] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:28:37] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3051 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:28:51] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:28:55] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:28:55] PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:28:55] PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3061 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:28:57] PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:28:57] PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:28:57] PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:28:57] PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:28:57] PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:28:59] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:28:59] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3059 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:29:01] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:29:01] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:29:09] Wikis and Phab endlessly loading [11:29:15] PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3051 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:29:15] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:29:17] <_joe_> hauskatze: known [11:29:17] hauskatze: know, SRE'S on it [11:29:17] PROBLEM - Maps edge esams on upload-lb.esams.wikimedia.org is CRITICAL: /private-info/info.json (private tile service info for osm-intl) timed out before a response was received: /v4/marker/pin-m+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m+ffffff@2x.png (Untitled test) timed out before a response wa [11:29:17] intl/info.json (tile service info for osm-intl) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps/RunBook [11:29:19] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3051 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:29:21] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3061 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:29:23] ack [11:29:23] PROBLEM - Juniper alarms on cr3-esams is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [11:29:23] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:29:23] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:29:23] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3065 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:29:23] PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:29:25] PROBLEM - Recursive DNS on 2620:0:862:1:91:198:174:62 is CRITICAL: Return code of 255 is out of bounds https://wikitech.wikimedia.org/wiki/DNS [11:29:25] PROBLEM - Recursive DNS on 2620:0:862:1:91:198:174:61 is CRITICAL: Return code of 110 is out of bounds https://wikitech.wikimedia.org/wiki/DNS [11:29:25] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:29:25] PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3063 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:29:39] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:29:39] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:29:45] (03PS1) 10CDanis: depool esams [dns] - 10https://gerrit.wikimedia.org/r/573978 [11:29:47] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3059 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:29:54] PROBLEM - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:30:09] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:30:15] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:30:27] PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:30:37] PROBLEM - NTP peers on dns3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [11:30:43] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3062 is OK: HTTP OK: HTTP/1.1 200 Ok - 34858 bytes in 0.516 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:30:45] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.1 200 Ok - 34820 bytes in 3.219 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:31:03] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3052 is OK: HTTP OK: HTTP/1.1 200 Ok - 34845 bytes in 1.966 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:31:03] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3061 is OK: HTTP OK: HTTP/1.1 200 Ok - 31696 bytes in 2.170 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:31:03] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3053 is OK: HTTP OK: HTTP/1.1 200 Ok - 31691 bytes in 0.565 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:31:03] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3055 is OK: HTTP OK: HTTP/1.1 200 Ok - 31697 bytes in 0.584 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:31:03] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3058 is OK: HTTP OK: HTTP/1.0 200 OK - 24441 bytes in 7.197 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:31:05] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3053 is OK: HTTP OK: HTTP/1.0 200 OK - 21920 bytes in 0.370 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:31:05] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3064 is OK: HTTP OK: HTTP/1.1 200 Ok - 31868 bytes in 1.212 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:31:05] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3059 is OK: HTTP OK: HTTP/1.0 200 OK - 24326 bytes in 0.373 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:31:05] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3052 is OK: HTTP OK: HTTP/1.1 200 Ok - 31885 bytes in 1.377 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:31:05] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3050 is OK: HTTP OK: HTTP/1.1 200 Ok - 31848 bytes in 3.835 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:31:05] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3058 is OK: HTTP OK: HTTP/1.1 200 Ok - 31875 bytes in 1.995 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:31:06] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3052 is OK: HTTP OK: HTTP/1.0 200 OK - 24433 bytes in 0.366 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:31:06] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3055 is OK: HTTP OK: HTTP/1.1 200 Ok - 34630 bytes in 0.376 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:31:11] RECOVERY - Maps edge esams on upload-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps/RunBook [11:31:19] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3051 is OK: HTTP OK: HTTP/1.1 200 Ok - 31664 bytes in 0.435 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:31:19] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is OK: HTTP OK: HTTP/1.0 200 OK - 22051 bytes in 0.277 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:31:23] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3051 is OK: HTTP OK: HTTP/1.1 200 Ok - 34495 bytes in 0.263 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:31:25] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3061 is OK: HTTP OK: HTTP/1.0 200 OK - 21930 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:31:27] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3064 is OK: HTTP OK: HTTP/1.0 200 OK - 24429 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:31:27] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3065 is OK: HTTP OK: HTTP/1.0 200 OK - 21939 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:31:27] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3054 is OK: HTTP OK: HTTP/1.1 200 Ok - 34850 bytes in 0.290 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:31:27] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.1 200 Ok - 31889 bytes in 0.464 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:31:29] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3057 is OK: HTTP OK: HTTP/1.1 200 Ok - 34639 bytes in 0.262 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:31:29] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3063 is OK: HTTP OK: HTTP/1.1 200 Ok - 31697 bytes in 0.452 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:31:35] RECOVERY - Recursive DNS on 2620:0:862:1:91:198:174:62 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [11:31:35] RECOVERY - Recursive DNS on 2620:0:862:1:91:198:174:61 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [11:31:43] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3060 is OK: HTTP OK: HTTP/1.0 200 OK - 24436 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:31:43] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3062 is OK: HTTP OK: HTTP/1.0 200 OK - 24434 bytes in 0.272 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:31:51] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3059 is OK: HTTP OK: HTTP/1.1 200 Ok - 34608 bytes in 0.262 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:31:58] RECOVERY - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 14933 bytes in 0.492 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:32:03] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3060 is OK: HTTP OK: HTTP/1.1 200 Ok - 31866 bytes in 0.459 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:32:03] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3050 is OK: HTTP OK: HTTP/1.1 200 Ok - 34672 bytes in 0.262 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:32:11] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3065 is OK: HTTP OK: HTTP/1.1 200 Ok - 31730 bytes in 0.433 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:32:11] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3057 is OK: HTTP OK: HTTP/1.1 200 Ok - 31698 bytes in 0.431 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:32:13] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.0 200 OK - 22039 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:32:13] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3053 is OK: HTTP OK: HTTP/1.1 200 Ok - 34598 bytes in 0.285 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:32:17] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3063 is OK: HTTP OK: HTTP/1.0 200 OK - 21928 bytes in 0.255 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:32:17] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.0 200 OK - 24437 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:32:27] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3059 is OK: HTTP OK: HTTP/1.1 200 Ok - 31703 bytes in 0.447 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:32:29] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3054 is OK: HTTP OK: HTTP/1.1 200 Ok - 31876 bytes in 0.436 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:32:35] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3060 is OK: HTTP OK: HTTP/1.1 200 Ok - 34824 bytes in 0.268 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:32:41] RECOVERY - NTP peers on dns3001 is OK: NTP OK: Offset -0.000131 secs https://wikitech.wikimedia.org/wiki/NTP [11:32:41] RECOVERY - NTP peers on dns3002 is OK: NTP OK: Offset 0.000229 secs https://wikitech.wikimedia.org/wiki/NTP [11:32:49] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3053 is OK: HTTP OK: HTTP/1.0 200 OK - 24320 bytes in 0.255 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:32:51] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3055 is OK: HTTP OK: HTTP/1.0 200 OK - 21920 bytes in 0.255 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:32:51] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3063 is OK: HTTP OK: HTTP/1.1 200 Ok - 34621 bytes in 0.262 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:32:57] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3051 is OK: HTTP OK: HTTP/1.0 200 OK - 24244 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:39:23] !log restart varnishkafka on cp3057 (stuck in timeouts to kafka, analytics alarms raised) [11:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:37] 10Operations, 10Phabricator, 10Security-Team, 10Security: Adjust onboarding/offboarding logic to accommodate changes to #security (now acl*security) - https://phabricator.wikimedia.org/T245771 (10jbond) p:05Triage→03Medium a:03jbond [11:44:23] 10Operations, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests: Should 'doc' machines (i.e. doc1001) have contint-roots as a group? - https://phabricator.wikimedia.org/T245691 (10jbond) p:05Triage→03Medium [11:47:39] !log restart varnishkafka-webrequest on cp3056/cp3058/cp3054/cp3064 (stuck in timeouts to kafka, analytics alarms raised) [11:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:51] !log restart varnishkafka-webrequest on cp3052 (stuck in timeouts to kafka, analytics alarms raised) [11:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:54] sigh [11:49:00] elukey: just stuck? [11:49:10] or still showing connectivity issues/degradation? [11:49:30] just stuck, I've seen this behavior before, restarting makes it work [11:49:39] see https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=esams%20prometheus%2Fops&var-source=webrequest&var-cp_cluster=cache_text&var-instance=All&from=now-30m&to=now [11:49:57] ack [11:50:01] it must be a weird corner case of librdkafka [11:52:08] all good now, need to go afk, will check later :) [11:57:10] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: enable BGP in the neutron transport network - https://phabricator.wikimedia.org/T245606 (10aborrero) python3-os-ken is not required for openstack pike. The ryu driver is used instead (and is available). When trying to create the bgp speaker I... [11:57:49] !log Set VRRP prio cost to 50 on cr3-esams to make it backup VRRP [11:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:54] !log Disabled Telia transit on cr3-esams [11:57:54] heh was about to [11:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:13] PROBLEM - Check systemd state on boron is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:01:29] <_joe_> uh, again? [12:01:35] <_joe_> well nevermind now [12:01:44] * volans checking [12:02:01] <_joe_> it's probably the cron from this morning [12:02:03] PROBLEM - SSH on boron is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:02:10] <_joe_> uhm no. [12:02:12] I'm not able to ssh.. [12:02:17] now yes [12:02:22] load average: 143.09, 112.13, 71.67 [12:02:27] <_joe_> volans: ok... [12:02:32] <_joe_> I think docker is borked [12:03:05] there is an openjdk build in progress too [12:03:12] <_joe_> oh no someone's building java [12:03:19] that might explain it [12:03:26] moritzm: ^^^ [12:03:29] that's me, checking [12:04:07] RECOVERY - SSH on boron is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:04:26] meh: [12:04:28] [2020-02-21 11:45:41,427] Agent[32]: stdout: # There is insufficient memory for the Java Runtime Environment to continue. [12:04:30] [2020-02-21 11:45:41,427] Agent[32]: stdout: # Native memory allocation (mmap) failed to map 81788928 bytes for committing reserved memory. [12:04:58] !log cr3-esams: request chassis fpc offline slot 1 [12:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:06] boron had 16G RAM, but then we reduced it to 8G due to a shortage of RAM on the eqiad Ganeti cluster :-/ [12:05:42] I think we got a couple of new ganeti hosts, I may be wrong though [12:06:11] those are being racked still I think [12:07:08] ah, no. those are racked now: https://phabricator.wikimedia.org/T228924 [12:07:51] but the old 16G puppetdb host was shut down, I'll check whether I can bump boron back to 16G [12:08:43] (03PS3) 10Alexandros Kosiaris: configmaster: Add DNS Discovery disrepancy check [puppet] - 10https://gerrit.wikimedia.org/r/573963 [12:08:56] damn you found the task way faster moritzm [12:09:06] next time I will use google :p [12:09:37] PROBLEM - OSPF status on mr1-esams is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:09:56] <_joe_> this is expected right ^^ mark [12:10:01] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:10:24] _joe_: I think so, yes [12:11:43] (03CR) 10jerkins-bot: [V: 04-1] configmaster: Add DNS Discovery disrepancy check [puppet] - 10https://gerrit.wikimedia.org/r/573963 (owner: 10Alexandros Kosiaris) [12:12:09] not sure... [12:13:17] seems cr3->mr1 goes over asw, which is on FPC 1, which is down [12:13:25] <_joe_> ok [12:13:55] so alert expected, whether that's a good topology I have no opinion of right now :) [12:16:40] (03PS4) 10Alexandros Kosiaris: configmaster: Add DNS Discovery disrepancy check [puppet] - 10https://gerrit.wikimedia.org/r/573963 [12:17:04] (03CR) 10Alexandros Kosiaris: "Sample output is:" [puppet] - 10https://gerrit.wikimedia.org/r/573963 (owner: 10Alexandros Kosiaris) [12:17:07] 10Operations, 10ops-esams, 10netops: 2*10G optics down on cr2-esams - https://phabricator.wikimedia.org/T245520 (10mark) I am pretty sure there are a bunch of optics (of various kinds) in the "spare" switches, in the bottom of rack OE15. Unfortunately those switches are not powered up, and certainly not conf... [12:17:32] !log bumped memory for boron.eqiad.wmnet to 16G [12:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:05] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [12:20:07] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:29] !log rebooting boron [12:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:02] (03PS2) 10Alexandros Kosiaris: eventgate-analytics-external: Add k8s token [puppet] - 10https://gerrit.wikimedia.org/r/573602 (https://phabricator.wikimedia.org/T233629) [12:22:07] RECOVERY - Check systemd state on boron is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:22:41] (03CR) 10Volans: [C: 04-1] "Some minor comment inline. The -1 is just a reminder for myself to setup the http proxy at spicerack/cookbook level, already WIP." (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/572262 (owner: 10Ayounsi) [12:24:49] (03CR) 10Muehlenhoff: [C: 03+2] Add a separate class parameter to toggle the auth Icinga check for tendril [puppet] - 10https://gerrit.wikimedia.org/r/573935 (owner: 10Muehlenhoff) [12:26:51] (03CR) 10Muehlenhoff: [C: 03+2] Remove system role from role::swift::stats_reporter [puppet] - 10https://gerrit.wikimedia.org/r/573895 (owner: 10Muehlenhoff) [12:27:30] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=mathoid [12:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:45] !log repool mathoid at eqiad, test complete [12:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:05] !log cr3-esams: Shutdown GRE tunnels over Telia [12:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:21] RECOVERY - Juniper alarms on cr3-esams is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [12:33:03] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:34:45] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:34:45] RECOVERY - OSPF status on mr1-esams is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:44:37] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: enable BGP in the neutron transport network - https://phabricator.wikimedia.org/T245606 (10aborrero) >>! In T245606#5906590, @aborrero wrote: > > When trying to create the bgp speaker I found this issue: > > ` > 2020-02-21 11:54:04.537 6951... [12:49:15] (03PS2) 10Vgutierrez: ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) [12:51:18] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: enable BGP in the neutron transport network - https://phabricator.wikimedia.org/T245606 (10aborrero) Apparently the config makes sense, this is twhat the bgp speaker would advertise: ` root@cloudcontrol2001-dev:~# neutron bgp-speaker-advertis... [12:52:02] (03PS1) 10Jbond: admin: add support for system users and groups [puppet] - 10https://gerrit.wikimedia.org/r/573990 (https://phabricator.wikimedia.org/T235162) [12:52:07] (03PS1) 10Jbond: admin: add rerprepo system user [puppet] - 10https://gerrit.wikimedia.org/r/573991 [12:52:09] (03CR) 10jerkins-bot: [V: 04-1] ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [12:54:04] (03PS3) 10Vgutierrez: ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) [12:54:43] (03CR) 10jerkins-bot: [V: 04-1] admin: add support for system users and groups [puppet] - 10https://gerrit.wikimedia.org/r/573990 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [12:57:11] (03CR) 10jerkins-bot: [V: 04-1] ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [12:59:39] (03PS2) 10Jbond: admin: add support for system users and groups [puppet] - 10https://gerrit.wikimedia.org/r/573990 (https://phabricator.wikimedia.org/T235162) [13:15:43] (03PS3) 10Jbond: admin: add support for system users and groups [puppet] - 10https://gerrit.wikimedia.org/r/573990 (https://phabricator.wikimedia.org/T235162) [13:33:13] (03Abandoned) 10CDanis: depool esams [dns] - 10https://gerrit.wikimedia.org/r/573978 (owner: 10CDanis) [13:33:56] 10Operations, 10Graphite, 10User-jbond: Add SSO support to graphite - https://phabricator.wikimedia.org/T244861 (10jbond) 05Open→03Resolved a:03jbond [13:34:03] 10Operations, 10netops: cr3-esams:fpc1 crash - https://phabricator.wikimedia.org/T245825 (10ayounsi) p:05Triage→03High [13:35:05] 10Operations, 10netops: cr3-esams:fpc1 crash - https://phabricator.wikimedia.org/T245825 (10ayounsi) Service Request ID 2020-0221-0237 has been created. [13:38:15] !log reedy@deploy1001 Synchronized php-1.35.0-wmf.20/includes/resourceloader/ResourceLoaderSkinModule.php: T245778 T245182 T232140 (duration: 01m 00s) [13:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:24] T232140: Separate out logo handling into square image logos and long text/wordmark banner logos - https://phabricator.wikimedia.org/T232140 [13:38:24] T245182: Use of $wgLogoHD was deprecated in Rename configuration variable to $wgLogos - https://phabricator.wikimedia.org/T245182 [13:38:24] T245778: Spike in "Use of ResourceLoaderSkinModule::getAvailableLogos with $wgLogoHD set instead of $wgLogos was deprecated in MediaWiki 1.35." - https://phabricator.wikimedia.org/T245778 [13:43:32] (03PS1) 10Jbond: offboard-user: add acl*security to list of protected phab groups [puppet] - 10https://gerrit.wikimedia.org/r/573996 (https://phabricator.wikimedia.org/T245771) [13:49:04] (03PS1) 10Jbond: admin: add ldap only user - emedina [puppet] - 10https://gerrit.wikimedia.org/r/573997 (https://phabricator.wikimedia.org/T244176) [13:49:40] (03CR) 10Alexandros Kosiaris: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/573963 (owner: 10Alexandros Kosiaris) [13:51:20] (03CR) 10Jbond: "PCC is still running all the diffs i have checked are related to the fact we now pass home_dir from admin::hashuser to admin::user" [puppet] - 10https://gerrit.wikimedia.org/r/573990 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [13:55:02] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: RRDP status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [14:14:52] (03CR) 10Effie Mouzeli: [C: 03+1] "A couple of suggestions otherwise looks like a good starting point" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/573963 (owner: 10Alexandros Kosiaris) [14:18:49] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1003/20956/netbox1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/573976 (https://phabricator.wikimedia.org/T245511) (owner: 10Filippo Giunchedi) [14:38:24] (03CR) 10Ottomata: [C: 03+1] Add an-launcher1001 A/AAAA/PTR records [dns] - 10https://gerrit.wikimedia.org/r/571479 (https://phabricator.wikimedia.org/T244717) (owner: 10Elukey) [14:41:58] (03PS4) 10Vgutierrez: ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) [14:43:01] (03CR) 10Elukey: [C: 03+2] Add an-launcher1001 A/AAAA/PTR records [dns] - 10https://gerrit.wikimedia.org/r/571479 (https://phabricator.wikimedia.org/T244717) (owner: 10Elukey) [14:45:00] (03CR) 10jerkins-bot: [V: 04-1] ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [14:59:04] (03PS5) 10Vgutierrez: ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) [15:00:40] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)210 ge (W)150 ge 140.8 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [15:05:48] (03PS1) 10Alexandros Kosiaris: eventgate-analytics-external: Add k8s tokens [labs/private] - 10https://gerrit.wikimedia.org/r/574004 [15:06:01] !log elukey@cumin1001 START - Cookbook sre.ganeti.makevm [15:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:10] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] eventgate-analytics-external: Add k8s tokens [labs/private] - 10https://gerrit.wikimedia.org/r/574004 (owner: 10Alexandros Kosiaris) [15:09:43] (03CR) 10Alexandros Kosiaris: configmaster: Add DNS Discovery disrepancy check (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/573963 (owner: 10Alexandros Kosiaris) [15:10:23] (03PS5) 10Alexandros Kosiaris: configmaster: Add DNS Discovery disrepancy check [puppet] - 10https://gerrit.wikimedia.org/r/573963 [15:11:34] (03CR) 10Ayounsi: Add cookbook to control CF BGP advertisements (0314 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/572262 (owner: 10Ayounsi) [15:12:02] (03PS5) 10Ayounsi: Add cookbook to control CF BGP advertisements [cookbooks] - 10https://gerrit.wikimedia.org/r/572262 [15:17:07] (03CR) 10Bstorm: [C: 03+1] "That looks like the only one." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/573823 (owner: 10BryanDavis) [15:23:34] (03PS1) 10Jbond: profile::tlsproxy::envoy: refactor parameters [puppet] - 10https://gerrit.wikimedia.org/r/574009 (https://phabricator.wikimedia.org/T240941) [15:23:36] (03PS1) 10Jbond: profile::tlsproxy::envoy: add support for acme certs [puppet] - 10https://gerrit.wikimedia.org/r/574010 (https://phabricator.wikimedia.org/T240941) [15:23:38] (03PS1) 10Jbond: profile::idp: update profile to use tlsproxy::envoy [puppet] - 10https://gerrit.wikimedia.org/r/574011 (https://phabricator.wikimedia.org/T240941) [15:31:56] (03PS2) 10Jbond: profile::tlsproxy::envoy: refactor parameters [puppet] - 10https://gerrit.wikimedia.org/r/574009 (https://phabricator.wikimedia.org/T240941) [15:35:44] (03PS3) 10Jbond: profile::tlsproxy::envoy: refactor parameters [puppet] - 10https://gerrit.wikimedia.org/r/574009 (https://phabricator.wikimedia.org/T240941) [15:35:59] (03PS1) 10Andrew Bogott: wmfsink: clean up puppet records associated with legacy domains [puppet] - 10https://gerrit.wikimedia.org/r/574013 [15:39:09] (03CR) 10Ayounsi: [C: 03+2] Add configuration for a flowspec controller [puppet] - 10https://gerrit.wikimedia.org/r/573401 (owner: 10Ayounsi) [15:39:40] (03CR) 10Jhedden: [C: 03+1] wmfsink: clean up puppet records associated with legacy domains [puppet] - 10https://gerrit.wikimedia.org/r/574013 (owner: 10Andrew Bogott) [15:40:32] (03PS8) 10Ayounsi: Add configuration for a flowspec controller [puppet] - 10https://gerrit.wikimedia.org/r/573401 [15:41:00] (03CR) 10Jbond: "PCC: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/20960" [puppet] - 10https://gerrit.wikimedia.org/r/574009 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [15:41:13] (03PS2) 10Jbond: profile::tlsproxy::envoy: add support for acme certs [puppet] - 10https://gerrit.wikimedia.org/r/574010 (https://phabricator.wikimedia.org/T240941) [15:42:43] (03CR) 10Jbond: "PCC: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/20961" [puppet] - 10https://gerrit.wikimedia.org/r/574010 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [15:42:58] (03PS2) 10Jbond: profile::idp: update profile to use tlsproxy::envoy [puppet] - 10https://gerrit.wikimedia.org/r/574011 (https://phabricator.wikimedia.org/T240941) [15:44:32] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Clean up SSL configuration - https://phabricator.wikimedia.org/T240941 (10jbond) [15:46:33] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Clean up SSL configuration - https://phabricator.wikimedia.org/T240941 (10jbond) SSL validation has been turned on for conftool however client authentication will need to wait untill we migrate to ectd v3 as RBAC is not optimised in v2 [15:49:52] RECOVERY - Host flowspec1001 is UP: PING OK - Packet loss = 0%, RTA = 2.45 ms [15:51:10] (03PS3) 10Jbond: profile::tlsproxy::envoy: add support for acme certs [puppet] - 10https://gerrit.wikimedia.org/r/574010 (https://phabricator.wikimedia.org/T240941) [15:51:17] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [15:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:19] (03PS3) 10Jbond: profile::idp: update profile to use tlsproxy::envoy [puppet] - 10https://gerrit.wikimedia.org/r/574011 (https://phabricator.wikimedia.org/T240941) [15:51:41] (03PS2) 10RLazarus: Convert all the apache-fast-test URLs to httpbb tests. [puppet] - 10https://gerrit.wikimedia.org/r/573760 [15:55:24] (03PS4) 10Jbond: profile::tlsproxy::envoy: add support for acme certs [puppet] - 10https://gerrit.wikimedia.org/r/574010 (https://phabricator.wikimedia.org/T240941) [15:55:39] (03PS4) 10Jbond: profile::idp: update profile to use tlsproxy::envoy [puppet] - 10https://gerrit.wikimedia.org/r/574011 (https://phabricator.wikimedia.org/T240941) [15:55:45] !log add gobgpd to buster-wikimedia repo [15:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:30] (03PS5) 10Jbond: profile::idp: update profile to use tlsproxy::envoy [puppet] - 10https://gerrit.wikimedia.org/r/574011 (https://phabricator.wikimedia.org/T240941) [15:57:10] 10Operations, 10Analytics, 10serviceops, 10vm-requests: Create a ganeti VM in eqiad: an-launcher1001 - https://phabricator.wikimedia.org/T244717 (10elukey) 05Stalled→03Open ` Creating new VM named an-launcher1001.eqiad.wmnet in eqiad with row=C vcpu=4 memory=8 gigabytes disk=100 gigabytes link=analytic... [15:58:18] (03Abandoned) 10Jbond: profile::idp: update profile to use tlsproxy::envoy [puppet] - 10https://gerrit.wikimedia.org/r/574011 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [15:59:55] XioNoX: <3 gobgpd [16:00:51] (03PS1) 10Jbond: profile::idp: update profile to use tlsproxy::envoy [puppet] - 10https://gerrit.wikimedia.org/r/574020 (https://phabricator.wikimedia.org/T240941) [16:02:42] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/20965/idp1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/574011 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [16:03:04] ACKNOWLEDGEMENT - Check systemd state on flowspec1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Ayounsi Working on it https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:03:56] (03PS2) 10Andrew Bogott: wmfsink: clean up puppet records associated with legacy domains [puppet] - 10https://gerrit.wikimedia.org/r/574013 [16:03:58] (03PS1) 10Andrew Bogott: wmcs-novastats-dnsleaks: include the new .wikimedia.cloud domain [puppet] - 10https://gerrit.wikimedia.org/r/574021 [16:04:00] (03PS1) 10Andrew Bogott: wmcs-novastats-puppetleaks: make aware of the new .wikimedia.cloud domain [puppet] - 10https://gerrit.wikimedia.org/r/574022 [16:04:02] (03PS1) 10Andrew Bogott: wmcs-novastats-proxyleaks.py: make aware of the new wmcloud.org domain [puppet] - 10https://gerrit.wikimedia.org/r/574023 [16:04:13] (03CR) 10jerkins-bot: [V: 04-1] profile::idp: update profile to use tlsproxy::envoy [puppet] - 10https://gerrit.wikimedia.org/r/574020 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [16:05:08] (03CR) 10jerkins-bot: [V: 04-1] wmcs-novastats-dnsleaks: include the new .wikimedia.cloud domain [puppet] - 10https://gerrit.wikimedia.org/r/574021 (owner: 10Andrew Bogott) [16:05:53] (03CR) 10jerkins-bot: [V: 04-1] wmcs-novastats-puppetleaks: make aware of the new .wikimedia.cloud domain [puppet] - 10https://gerrit.wikimedia.org/r/574022 (owner: 10Andrew Bogott) [16:07:51] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Hardware): Degraded RAID on cloudvirt1014 - https://phabricator.wikimedia.org/T241494 (10Andrew) This host is now drained and ready for maintenance. [16:11:42] (03PS1) 10Elukey: Introduce an-launcher1001 [puppet] - 10https://gerrit.wikimedia.org/r/574025 (https://phabricator.wikimedia.org/T244717) [16:11:47] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/573997 (https://phabricator.wikimedia.org/T244176) (owner: 10Jbond) [16:12:01] (03PS2) 10Andrew Bogott: wmcs-novastats-dnsleaks: include the new .wikimedia.cloud domain [puppet] - 10https://gerrit.wikimedia.org/r/574021 [16:12:03] (03PS3) 10Andrew Bogott: wmfsink: clean up puppet records associated with legacy domains [puppet] - 10https://gerrit.wikimedia.org/r/574013 [16:12:05] (03PS2) 10Andrew Bogott: wmcs-novastats-puppetleaks: make aware of the new .wikimedia.cloud domain [puppet] - 10https://gerrit.wikimedia.org/r/574022 [16:12:07] (03PS2) 10Andrew Bogott: wmcs-novastats-proxyleaks.py: make aware of the new wmcloud.org domain [puppet] - 10https://gerrit.wikimedia.org/r/574023 [16:12:23] (03PS2) 10Effie Mouzeli: (WIP) mcrouter: add gutter pool servers in configuration [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) [16:12:39] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/573996 (https://phabricator.wikimedia.org/T245771) (owner: 10Jbond) [16:13:50] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={swagger_check_cxserver_cluster_eqiad,swagger_check_eventgate_main_http_cluster_to_delete_eqiad,swagger_check_termbox_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:14:04] (03CR) 10Jbond: [C: 03+2] admin: add ldap only user - emedina [puppet] - 10https://gerrit.wikimedia.org/r/573997 (https://phabricator.wikimedia.org/T244176) (owner: 10Jbond) [16:15:25] (03CR) 10Elukey: [C: 03+2] Introduce an-launcher1001 [puppet] - 10https://gerrit.wikimedia.org/r/574025 (https://phabricator.wikimedia.org/T244717) (owner: 10Elukey) [16:16:41] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Request LDAP access to the WMF group for Edna M - https://phabricator.wikimedia.org/T244176 (10jbond) 05Stalled→03Resolved a:03jbond @Edna you should now have access to wmf, sorry for the delay i missed the update last week. please re-open if t... [16:16:59] (03CR) 10Jbond: [C: 03+2] offboard-user: add acl*security to list of protected phab groups [puppet] - 10https://gerrit.wikimedia.org/r/573996 (https://phabricator.wikimedia.org/T245771) (owner: 10Jbond) [16:17:10] (03PS3) 10Andrew Bogott: wmcs-novastats-dnsleaks: include the new .wikimedia.cloud domain [puppet] - 10https://gerrit.wikimedia.org/r/574021 [16:17:12] (03PS4) 10Andrew Bogott: wmfsink: clean up puppet records associated with legacy domains [puppet] - 10https://gerrit.wikimedia.org/r/574013 [16:17:14] (03PS3) 10Andrew Bogott: wmcs-novastats-puppetleaks: make aware of the new .wikimedia.cloud domain [puppet] - 10https://gerrit.wikimedia.org/r/574022 [16:17:16] (03PS3) 10Andrew Bogott: wmcs-novastats-proxyleaks.py: make aware of the new wmcloud.org domain [puppet] - 10https://gerrit.wikimedia.org/r/574023 [16:17:18] (03CR) 10jerkins-bot: [V: 04-1] (WIP) mcrouter: add gutter pool servers in configuration [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [16:18:52] 10Operations, 10Phabricator, 10Security-Team, 10Patch-For-Review, 10Security: Adjust onboarding/offboarding logic to accommodate changes to #security (now acl*security) - https://phabricator.wikimedia.org/T245771 (10jbond) 05Open→03Resolved @chasemp thanks for the heads up i have updated the offboard... [16:20:53] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [16:21:04] 10Operations, 10Release-Engineering-Team, 10serviceops: mcrouter proxies and scap proxies - https://phabricator.wikimedia.org/T245841 (10jijiki) [16:21:28] 10Operations, 10Phabricator, 10Security-Team, 10Patch-For-Review, 10Security: Adjust onboarding/offboarding logic to accommodate changes to #security (now acl*security) - https://phabricator.wikimedia.org/T245771 (10chasemp) >>! In T245771#5907231, @jbond wrote: > @chasemp thanks for the heads up i have... [16:22:02] 10Operations, 10Release-Engineering-Team, 10serviceops: mcrouter proxies and scap proxies - https://phabricator.wikimedia.org/T245841 (10jijiki) [16:22:05] RECOVERY - Device not healthy -SMART- on cloudvirt1008 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudvirt1008&var-datasource=eqiad+prometheus/ops [16:22:18] 10Operations, 10ORES, 10Scoring-platform-team (Current): ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart) - https://phabricator.wikimedia.org/T242705 (10Halfak) Maybe related: https://github.com/unbit/uwsgi/issues/844 [16:22:21] (03PS3) 10Effie Mouzeli: (WIP) mcrouter: add gutter pool servers in configuration [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) [16:22:59] 10Operations, 10ops-esams: rack/setup/install ps[12]-oe1[456]-esams - https://phabricator.wikimedia.org/T184066 (10RobH) 05Open→03Resolved All of the power ports are documented in netbox and labeled on the pdu towers. The port groups for network hardware have been setup for easy reboot for the entire grou... [16:23:01] 10Operations, 10ops-esams: Prepare racks OE14, OE15 and OE16 with new infrastructure - https://phabricator.wikimedia.org/T184064 (10RobH) [16:23:13] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:23:18] (03CR) 10jerkins-bot: [V: 04-1] (WIP) mcrouter: add gutter pool servers in configuration [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [16:23:21] 10Operations, 10ops-esams: rack/setup/install ps[12]-oe1[456]-esams - https://phabricator.wikimedia.org/T184066 (10RobH) [16:24:26] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Lauren Dickinson - https://phabricator.wikimedia.org/T245524 (10RhinosF1) >>! In T245524#5898361, @LDickinsonWMF wrote: > Hi, @jbond, @Aklapper, @RhinosF1, and @Varnent: This is my Wikimedia account. My username orginally was LDickinson (WM... [16:27:10] (03PS1) 10Ayounsi: GoBGP fix configuration [puppet] - 10https://gerrit.wikimedia.org/r/574030 [16:30:32] (03CR) 10Ayounsi: [C: 03+2] GoBGP fix configuration [puppet] - 10https://gerrit.wikimedia.org/r/574030 (owner: 10Ayounsi) [16:34:09] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2023 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:34:11] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:41:35] (03PS4) 10Effie Mouzeli: (WIP) mcrouter: add gutter pool servers in configuration [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) [16:41:55] (03CR) 10Cwhite: [C: 03+1] prometheus: use class logstash for jmx_logstash job [puppet] - 10https://gerrit.wikimedia.org/r/573968 (owner: 10Filippo Giunchedi) [16:42:05] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:42:19] 10Operations, 10Analytics, 10serviceops, 10vm-requests, 10Patch-For-Review: Create a ganeti VM in eqiad: an-launcher1001 - https://phabricator.wikimedia.org/T244717 (10elukey) 05Open→03Resolved a:03elukey [16:43:43] (03CR) 10Herron: [C: 03+2] hieradata: move lists interface alias definitions to host yaml [puppet] - 10https://gerrit.wikimedia.org/r/573711 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [16:44:16] (03PS1) 10Elukey: Add a new Analytics role to an-launcher1001 [puppet] - 10https://gerrit.wikimedia.org/r/574032 (https://phabricator.wikimedia.org/T243934) [16:46:26] (03PS2) 10Elukey: Add a new Analytics role to an-launcher1001 [puppet] - 10https://gerrit.wikimedia.org/r/574032 (https://phabricator.wikimedia.org/T243934) [16:48:14] (03CR) 10Herron: [C: 03+1] prometheus: use class logstash for jmx_logstash job [puppet] - 10https://gerrit.wikimedia.org/r/573968 (owner: 10Filippo Giunchedi) [16:50:03] PROBLEM - Check systemd state on ms-be2023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:57:46] (03PS1) 10Ppchelko: Beta sessionstore: don't use TLS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574034 (https://phabricator.wikimedia.org/T224712) [16:58:00] (03PS5) 10Effie Mouzeli: (WIP) mcrouter: add gutter pool servers in configuration [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) [17:00:40] (03PS1) 10Elukey: Add fake keytabs for an-launcher1001 [labs/private] - 10https://gerrit.wikimedia.org/r/574037 [17:01:09] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake keytabs for an-launcher1001 [labs/private] - 10https://gerrit.wikimedia.org/r/574037 (owner: 10Elukey) [17:01:56] (03CR) 10Elukey: [C: 03+2] Add a new Analytics role to an-launcher1001 [puppet] - 10https://gerrit.wikimedia.org/r/574032 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [17:05:13] 10Operations: enforce-users-groups tries to remove package created gobgp user - https://phabricator.wikimedia.org/T245847 (10ayounsi) p:05Triage→03Medium [17:06:40] (03CR) 10RLazarus: [C: 03+1] "Mostly reviewed for Python style -- I don't know enough to have strong opinions about this tool at a higher level, but it seems like a goo" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/573963 (owner: 10Alexandros Kosiaris) [17:07:21] (03PS1) 10Elukey: role::analytics_cluster::launcher: add Hadoop common hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/574038 (https://phabricator.wikimedia.org/T243934) [17:09:38] (03CR) 1020after4: [C: 03+1] Remove support for < Buster from Phabricator class [puppet] - 10https://gerrit.wikimedia.org/r/563469 (owner: 10Muehlenhoff) [17:12:07] (03PS1) 10Jbond: enforce-user-groups: update script to ignore Dynamic users [puppet] - 10https://gerrit.wikimedia.org/r/574039 (https://phabricator.wikimedia.org/T245847) [17:12:09] 10Operations: enforce-users-groups tries to remove package created gobgp user - https://phabricator.wikimedia.org/T245847 (10jbond) > it seems that your system has the Dynamic user in the result of getent passwd, this is not the case on my system so it may have changed recently. > could you raise a phab task to... [17:12:32] 10Operations, 10Security-Team, 10Wikimedia-Mailing-lists: Transfer ownership of mediawiki-security mailman list to Security Team - https://phabricator.wikimedia.org/T230951 (10chasemp) @dzahn are there instructions somewhere for archiving a list? [17:14:41] (03CR) 10Volans: "One nit inline to make it more robust." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/572262 (owner: 10Ayounsi) [17:15:26] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::launcher: add Hadoop common hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/574038 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [17:15:39] (03CR) 10BryanDavis: [C: 03+1] Update wm-bot hostname [puppet] - 10https://gerrit.wikimedia.org/r/572489 (owner: 10Lucas Werkmeister) [17:24:39] (03Abandoned) 10Jbond: release: ensure reprepo uid/gid is the same on all servers [puppet] - 10https://gerrit.wikimedia.org/r/573282 (https://phabricator.wikimedia.org/T245612) (owner: 10Jbond) [17:29:36] (03CR) 10Ayounsi: Add cookbook to control CF BGP advertisements (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/572262 (owner: 10Ayounsi) [17:35:45] PROBLEM - Device not healthy -SMART- on cloudvirt1008 is CRITICAL: cluster=wmcs device=cciss,17 instance=cloudvirt1008:9100 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudvirt1008&var-datasource=eqiad+prometheus/ops [17:37:30] (03PS2) 10RobH: adding new dell skus [software] - 10https://gerrit.wikimedia.org/r/573611 [17:39:04] (03CR) 10RobH: [C: 03+2] adding new dell skus [software] - 10https://gerrit.wikimedia.org/r/573611 (owner: 10RobH) [17:47:09] (03CR) 10Herron: [C: 03+2] role::lists ensure apache mod_cgi enabled [puppet] - 10https://gerrit.wikimedia.org/r/573732 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [17:52:00] (03PS1) 10Elukey: role::analytics_cluster::launcher: add kerberos and base profiles [puppet] - 10https://gerrit.wikimedia.org/r/574042 (https://phabricator.wikimedia.org/T243934) [17:53:45] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 43 probes of 528 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:54:51] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::launcher: add kerberos and base profiles [puppet] - 10https://gerrit.wikimedia.org/r/574042 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [18:00:49] jouncebot: now [18:00:50] No deployments scheduled for the next 65 hour(s) and 29 minute(s) [18:01:02] Eurgh, it got prematurely archived. [18:01:14] 10Operations, 10Security-Team, 10Wikimedia-Mailing-lists: Transfer ownership of mediawiki-security mailman list to Security Team - https://phabricator.wikimedia.org/T230951 (10Aklapper) >>! In T230951#5907499, @chasemp wrote: > are there instructions somewhere for archiving a list? https://wikitech.wikimedi... [18:01:24] jouncebot: refresh [18:01:25] I refreshed my knowledge about deployments. [18:01:28] jouncebot: now [18:01:28] For the next 13 hour(s) and 58 minute(s): NO DEPLOYS (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200221T0800) [18:01:32] Success. [18:02:35] \o/ [18:06:15] (03PS4) 10Andrew Bogott: wmcs-novastats-dnsleaks: include the new .wikimedia.cloud domain [puppet] - 10https://gerrit.wikimedia.org/r/574021 [18:06:18] (03PS5) 10Andrew Bogott: wmfsink: clean up puppet records associated with legacy domains [puppet] - 10https://gerrit.wikimedia.org/r/574013 [18:06:20] (03PS4) 10Andrew Bogott: wmcs-novastats-puppetleaks: make aware of the new .wikimedia.cloud domain [puppet] - 10https://gerrit.wikimedia.org/r/574022 [18:06:22] (03PS4) 10Andrew Bogott: wmcs-novastats-proxyleaks.py: make aware of the new wmcloud.org domain [puppet] - 10https://gerrit.wikimedia.org/r/574023 [18:10:07] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-novastats-dnsleaks: include the new .wikimedia.cloud domain [puppet] - 10https://gerrit.wikimedia.org/r/574021 (owner: 10Andrew Bogott) [18:12:10] (03PS2) 10Dwisehaupt: Plumb in frdb2001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/573763 (https://phabricator.wikimedia.org/T245566) [18:14:10] (03PS1) 10Elukey: role::analytics_cluster::launcher: set hive-site.xml in HDFS [puppet] - 10https://gerrit.wikimedia.org/r/574044 (https://phabricator.wikimedia.org/T240880) [18:16:06] 10Operations, 10Security-Team, 10Wikimedia-Mailing-lists: Transfer ownership of mediawiki-security mailman list to Security Team - https://phabricator.wikimedia.org/T230951 (10Dzahn) @chasemp All you have to run is "rmlist " on fermium. By default that will keep the archives but close the list oth... [18:24:56] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 34 probes of 528 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:33:26] ACKNOWLEDGEMENT - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP Ayounsi https://phabricator.wikimedia.org/T245825 https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:34:50] !log re-enable GRE tunnels on cr3-esams - T245825 [18:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:59] 10Operations, 10ops-eqsin: snag asset tags from ulsfo, ship some to eqsin - https://phabricator.wikimedia.org/T245056 (10RobH) Please note the asset tags were mailed to Jin on 2020-02-18, once he receives them and confirms (via hangout message), I'll close this task. [18:34:59] T245825: cr3-esams:fpc1 crash - https://phabricator.wikimedia.org/T245825 [18:35:26] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:53:52] (03CR) 10Papaul: [C: 03+1] Plumb in frdb2001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/573763 (https://phabricator.wikimedia.org/T245566) (owner: 10Dwisehaupt) [18:54:52] RECOVERY - mysqld processes #page on labsdb1012 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [18:57:17] (03PS1) 10CRusnov: reports: Make it so external stuff cannot break things [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/574057 (https://phabricator.wikimedia.org/T239119) [18:59:20] RECOVERY - MariaDB read only wikireplica on labsdb1012 is OK: Version 10.1.43-MariaDB, Uptime 62s, read_only: True, 10.81 QPS, connection latency: 0.001996s, query latency: 0.003713s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [18:59:37] (03CR) 10Andrew Bogott: [C: 03+2] wmfsink: clean up puppet records associated with legacy domains [puppet] - 10https://gerrit.wikimedia.org/r/574013 (owner: 10Andrew Bogott) [19:01:14] RECOVERY - mysqld processes #page on labsdb1011 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [19:01:53] (03CR) 10Effie Mouzeli: [C: 03+1] configmaster: Add DNS Discovery disrepancy check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573963 (owner: 10Alexandros Kosiaris) [19:02:10] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [19:02:38] RECOVERY - MariaDB read only wikireplica on labsdb1011 is OK: Version 10.1.43-MariaDB, Uptime 116s, read_only: True, 15.67 QPS, connection latency: 0.004428s, query latency: 0.000815s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [19:03:04] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [19:03:30] RECOVERY - Check systemd state on labsdb1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:05:40] (03CR) 10Effie Mouzeli: [C: 04-1] "For some reason, in the $routes variable, we get "routes": [{"aliases": ["/eqiad/mw/"], "route": "PoolRoute|eqiad"}, "undef", {"aliases": " [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [19:06:20] (03CR) 10Effie Mouzeli: [C: 04-1] "> For some reason, in the $routes variable, we get "routes":" [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [19:08:06] Any haproxy alert, that's me (will not page) [19:09:36] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [19:09:47] (03CR) 10Ottomata: [C: 03+1] role::analytics_cluster::launcher: set hive-site.xml in HDFS [puppet] - 10https://gerrit.wikimedia.org/r/574044 (https://phabricator.wikimedia.org/T240880) (owner: 10Elukey) [19:10:34] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [19:11:02] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-novastats-puppetleaks: make aware of the new .wikimedia.cloud domain [puppet] - 10https://gerrit.wikimedia.org/r/574022 (owner: 10Andrew Bogott) [19:11:05] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-novastats-proxyleaks.py: make aware of the new wmcloud.org domain [puppet] - 10https://gerrit.wikimedia.org/r/574023 (owner: 10Andrew Bogott) [19:11:26] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [19:12:05] ACKNOWLEDGEMENT - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui recovery labsdb1011 https://wikitech.wikimedia.org/wiki/HAProxy [19:12:05] ACKNOWLEDGEMENT - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui recovery labsdb1011 https://wikitech.wikimedia.org/wiki/HAProxy [19:15:54] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [19:20:08] (03PS1) 10Jhedden: haproxy: update systemd service for buster [puppet] - 10https://gerrit.wikimedia.org/r/574063 [19:24:41] (03CR) 10Papaul: [C: 03+2] DNS: Add mgmt and production DNS for mw236[6-9], mw237[0-6] [dns] - 10https://gerrit.wikimedia.org/r/573713 (owner: 10Papaul) [19:24:51] (03PS4) 10Papaul: DNS: Add mgmt and production DNS for mw236[6-9], mw237[0-6] [dns] - 10https://gerrit.wikimedia.org/r/573713 [19:24:54] (03CR) 10Papaul: [V: 03+2 C: 03+2] DNS: Add mgmt and production DNS for mw236[6-9], mw237[0-6] [dns] - 10https://gerrit.wikimedia.org/r/573713 (owner: 10Papaul) [19:27:08] (03CR) 10Volans: [C: 04-1] "updated for context" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573937 (https://phabricator.wikimedia.org/T245512) (owner: 10Filippo Giunchedi) [19:28:40] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops, 10Patch-For-Review: (ASAP) rack/setup/install frdb2001 - https://phabricator.wikimedia.org/T245566 (10Dwisehaupt) [19:29:11] (03CR) 10Jhedden: "I'm looking to use this module on Debian Buster and would like to update the systemd configuration to support the new package in Buster." [puppet] - 10https://gerrit.wikimedia.org/r/574063 (owner: 10Jhedden) [19:35:57] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) [19:39:03] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/574057 (https://phabricator.wikimedia.org/T239119) (owner: 10CRusnov) [19:39:17] (03PS1) 10RLazarus: Split httpbb.py into two modules. [software/httpbb] - 10https://gerrit.wikimedia.org/r/574066 [19:39:28] 10Operations, 10observability, 10serviceops, 10Patch-For-Review: Stream a subset of mediawiki apache logs to logstash - https://phabricator.wikimedia.org/T244472 (10jijiki) @herron any ideas how to proceed here? Is there someone who can help? [19:39:52] 10Operations, 10Beta-Cluster-Infrastructure, 10observability, 10serviceops: Stream a subset of mediawiki apache logs to logstash - https://phabricator.wikimedia.org/T244472 (10jijiki) [19:41:56] (03PS1) 10RLazarus: Update httpbb symlink for new filename, following https://gerrit.wikimedia.org/r/574066. [puppet] - 10https://gerrit.wikimedia.org/r/574067 [19:42:58] (03PS2) 10RLazarus: httpbb: Update symlink for new filename, following https://gerrit.wikimedia.org/r/574066. [puppet] - 10https://gerrit.wikimedia.org/r/574067 [19:44:06] (03CR) 10jerkins-bot: [V: 04-1] httpbb: Update symlink for new filename, following https://gerrit.wikimedia.org/r/574066. [puppet] - 10https://gerrit.wikimedia.org/r/574067 (owner: 10RLazarus) [19:44:44] (03CR) 10Eevans: [C: 03+1] "Ship it!" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574034 (https://phabricator.wikimedia.org/T224712) (owner: 10Ppchelko) [19:44:59] jerkins 😠 [19:46:11] (03PS3) 10RLazarus: httpbb: Update symlink target, following https://gerrit.wikimedia.org/r/574066. [puppet] - 10https://gerrit.wikimedia.org/r/574067 [19:50:03] (03CR) 10Volans: "Sorry for the lazy review, I didn't look too much in depth but I've two questions, see inline. And sorry if those were already discussed a" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/573963 (owner: 10Alexandros Kosiaris) [19:54:11] (03PS2) 10Ppchelko: Beta sessionstore: don't use TLS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574034 (https://phabricator.wikimedia.org/T224712) [19:54:19] (03CR) 10Ppchelko: Beta sessionstore: don't use TLS (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574034 (https://phabricator.wikimedia.org/T224712) (owner: 10Ppchelko) [19:55:29] (03PS1) 10Papaul: DHCP: Add MAC address for mw236[6-9] mw237[0-6] [puppet] - 10https://gerrit.wikimedia.org/r/574072 (https://phabricator.wikimedia.org/T241852) [19:58:40] (03CR) 10Papaul: [C: 03+2] DHCP: Add MAC address for mw236[6-9] mw237[0-6] [puppet] - 10https://gerrit.wikimedia.org/r/574072 (https://phabricator.wikimedia.org/T241852) (owner: 10Papaul) [20:07:51] (03PS1) 10Elukey: role::analytics_cluster::launcher: enable /mnt/hdfs kerberos checks [puppet] - 10https://gerrit.wikimedia.org/r/574076 [20:08:21] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2366.codfw.wmnet ` The log can be found in `/var/log... [20:09:34] 10Operations, 10Beta-Cluster-Infrastructure, 10observability, 10serviceops: Stream a subset of mediawiki apache logs to logstash - https://phabricator.wikimedia.org/T244472 (10herron) Looking a bit closer I think this is happening because the nodes in labs are assigned their roles/profiles/etc via the exte... [20:11:17] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::launcher: enable /mnt/hdfs kerberos checks [puppet] - 10https://gerrit.wikimedia.org/r/574076 (owner: 10Elukey) [20:22:08] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2367.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [20:27:26] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [20:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:42] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:25] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2366.codfw.wmnet'] ` and were **ALL** successful. [20:35:16] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2368.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [20:36:08] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [20:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:24] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:09] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2367.codfw.wmnet'] ` and were **ALL** successful. [20:45:26] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2369.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [20:50:14] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [20:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:27] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:41] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [20:54:57] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [20:55:51] (03PS1) 10Dzahn: installserver: allow for multiple failover servers at once [puppet] - 10https://gerrit.wikimedia.org/r/574088 (https://phabricator.wikimedia.org/T224576) [20:57:10] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2368.codfw.wmnet'] ` and were **ALL** successful. [20:57:45] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2370.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [20:58:39] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [20:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:24] (03CR) 10Dzahn: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/572312 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [21:00:56] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [21:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:43] (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1001/20971/" [puppet] - 10https://gerrit.wikimedia.org/r/574088 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [21:02:35] (03CR) 10Dzahn: [C: 04-1] "duplicate declaration because of the ".each" loop. gotta use $title in resource names" [puppet] - 10https://gerrit.wikimedia.org/r/574088 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [21:05:40] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2369.codfw.wmnet'] ` and were **ALL** successful. [21:06:25] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2371.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [21:06:59] (03PS2) 10Dzahn: installserver: allow for multiple failover servers at once [puppet] - 10https://gerrit.wikimedia.org/r/574088 (https://phabricator.wikimedia.org/T224576) [21:08:34] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/20972/install1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/574088 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [21:09:15] (03PS10) 10Effie Mouzeli: WIP mediawiki: send apache logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/571239 (https://phabricator.wikimedia.org/T244472) [21:09:39] (03PS4) 10Effie Mouzeli: WIP hieradata: test streaming apache logs to logstash from mwdebug1001 [puppet] - 10https://gerrit.wikimedia.org/r/572057 (https://phabricator.wikimedia.org/T244472) [21:10:45] (03CR) 10Dzahn: [V: 03+1 C: 03+2] installserver: allow for multiple failover servers at once [puppet] - 10https://gerrit.wikimedia.org/r/574088 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [21:11:01] 10Operations, 10SRE-tools: Cookbook sre.hosts.downtime displayed on tools.wmflabs.org - https://phabricator.wikimedia.org/T245871 (10Etonkovidova) [21:11:01] (03PS3) 10Dzahn: installserver: allow for multiple failover servers at once [puppet] - 10https://gerrit.wikimedia.org/r/574088 (https://phabricator.wikimedia.org/T224576) [21:11:53] (03PS11) 10Effie Mouzeli: WIP mediawiki: send apache logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/571239 (https://phabricator.wikimedia.org/T244472) [21:12:13] (03PS5) 10Effie Mouzeli: WIP hieradata: test streaming apache logs to logstash from mwdebug1001 [puppet] - 10https://gerrit.wikimedia.org/r/572057 (https://phabricator.wikimedia.org/T244472) [21:12:43] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [21:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:57] 10Operations, 10Gerrit, 10LDAP-Access-Requests: Add dduvall to archiva-deployers - https://phabricator.wikimedia.org/T245872 (10thcipriani) [21:15:05] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [21:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:07] 10Operations, 10Gerrit, 10LDAP-Access-Requests: Add dduvall to archiva-deployers - https://phabricator.wikimedia.org/T245872 (10Dzahn) 05Open→03Resolved a:03Dzahn done! dduvall is already MW deployer and shell admin. added to the LDAP group for achiva deploys [21:19:48] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2370.codfw.wmnet'] ` and were **ALL** successful. [21:20:22] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [21:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:12] (03PS1) 10Thcipriani: Gerrit 2.16.16 [software/gerrit] (deploy/wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/574092 (https://phabricator.wikimedia.org/T200739) [21:22:36] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [21:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:30] !log LDAP - added dduvall to archiva-deployers [21:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:41] !log LDAP - added ldickinson to wmf [21:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:21] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2371.codfw.wmnet'] ` and were **ALL** successful. [21:27:45] (03PS1) 10Dzahn: admins: add Lauren Dickinson to ldap_only_admins (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/574093 (https://phabricator.wikimedia.org/T245524) [21:27:56] 10Operations, 10SRE-tools: Cookbook sre.hosts.downtime displayed on tools.wmflabs.org - https://phabricator.wikimedia.org/T245871 (10Volans) @Etonkovidova what's would be the issue? As far as I know that page is just listing the last few items of the [[ https://wikitech.wikimedia.org/wiki/Server_Admin_Log | SA... [21:28:53] 10Operations, 10Beta-Cluster-Infrastructure, 10observability, 10serviceops: Stream a subset of mediawiki apache logs to logstash - https://phabricator.wikimedia.org/T244472 (10jijiki) I have uploaded a patch which I manually tried on beta, this seems to work, but sadly, puppet breaks a bit further down the... [21:29:45] (03PS1) 10Holger Knust: WIP changeprop: New helmfiles for deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) [21:29:52] (03CR) 10Dzahn: [C: 03+2] admins: add Lauren Dickinson to ldap_only_admins (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/574093 (https://phabricator.wikimedia.org/T245524) (owner: 10Dzahn) [21:31:35] (03PS2) 10Dzahn: admins: add Lauren Dickinson to ldap_only_admins (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/574093 (https://phabricator.wikimedia.org/T245524) [21:36:04] (03PS1) 10Dwisehaupt: Add IPs for new frack hosts: civi2001, frpm2001 [dns] - 10https://gerrit.wikimedia.org/r/574097 (https://phabricator.wikimedia.org/T242270) [21:41:06] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Lauren Dickinson - https://phabricator.wikimedia.org/T245524 (10Dzahn) done! i added "ldickinson" to the "wmf" LDAP group and the patch above to go with it. @lookd_up It should work now. The phabricator user name is... [21:41:27] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Lauren Dickinson - https://phabricator.wikimedia.org/T245524 (10Dzahn) 05Open→03Resolved [21:41:34] _joe_: i *think* i learned that train deployments touch the parsoid cluster too [21:41:53] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Lauren Dickinson - https://phabricator.wikimedia.org/T245524 (10RhinosF1) [21:42:19] (03PS2) 10Jhedden: haproxy: update systemd service for buster [puppet] - 10https://gerrit.wikimedia.org/r/574063 (https://phabricator.wikimedia.org/T236606) [21:42:22] _joe_: so the parsoid cluster should have parsoid in the vendor repo already in wmf.20, as a result of https://gerrit.wikimedia.org/r/572047 ? [21:42:29] 10Operations, 10LDAP-Access-Requests: access to Superset for Alex Hollender - https://phabricator.wikimedia.org/T244490 (10Dzahn) a:03MNovotny_WMF [21:42:49] 10Operations, 10LDAP-Access-Requests: Allow LDAP access to superset dashboards for Moushira Elamrawy - https://phabricator.wikimedia.org/T242000 (10Dzahn) a:03Moushira [21:43:13] _joe_: so why aren't we getting version conflict on the parsoid cluster between that version of parsoid and the one deployed in /srv/deployment/parsoid/deploy/src ? [21:43:25] (well, those versions are pretty much identical. but still) [21:44:51] Planning to deploy sec patch for T232932 soon. [21:45:10] !log andrew@deploy1001 Started deploy [horizon/deploy@13ca90a]: added a warning about the public git history to the hiera edit panel [21:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:21] !log andrew@deploy1001 Finished deploy [horizon/deploy@13ca90a]: added a warning about the public git history to the hiera edit panel (duration: 00m 11s) [21:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:48] (03PS2) 10Dzahn: site: add apt[12]001.wikimedia.org with role::apt_repo [puppet] - 10https://gerrit.wikimedia.org/r/572312 (https://phabricator.wikimedia.org/T224576) [21:47:04] (03CR) 10Dzahn: "this should be fixed now. primary server will have 3 crons to push to all 3 (install2001, apt1001, apt2001) secondary servers and they sho" [puppet] - 10https://gerrit.wikimedia.org/r/572312 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [21:47:29] (03CR) 10Volans: [C: 03+1] "LGTM, see inline for one generic comment. Can be postponed too." (031 comment) [software/httpbb] - 10https://gerrit.wikimedia.org/r/574066 (owner: 10RLazarus) [21:48:33] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/574067 (owner: 10RLazarus) [21:49:13] 10Operations, 10Beta-Cluster-Infrastructure, 10observability, 10serviceops: Stream a subset of mediawiki apache logs to logstash - https://phabricator.wikimedia.org/T244472 (10jijiki) With a little bit more fiddling, I managed to run puppet on ssh deployment-mediawiki-09.deployment-prep.eqiad.wmflabs! @her... [21:49:21] !log andrew@deploy1001 Started deploy [horizon/deploy@a8f2ea9]: added a warning about the public git history to the hiera edit panel -- take two [21:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:02] !log andrew@deploy1001 Finished deploy [horizon/deploy@a8f2ea9]: added a warning about the public git history to the hiera edit panel -- take two (duration: 03m 41s) [21:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:31] 10Operations, 10SRE-tools: Cookbook sre.hosts.downtime displayed on tools.wmflabs.org - https://phabricator.wikimedia.org/T245871 (10RhinosF1) 05Open→03Invalid >>! In T245871#5908323, @Volans wrote: > @Etonkovidova what would be the issue? As far as I know that page is just listing the last few items of th... [21:56:13] !log sbassett@deploy1001 Started scap: Deploy security fix for T232932 [21:56:14] (03PS12) 10Effie Mouzeli: WIP mediawiki: send apache logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/571239 (https://phabricator.wikimedia.org/T244472) [21:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:12] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): (No Need By Date Provided) rack/setup/install cloudvirt-wdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10Andrew) Can I get an update on who's task this is now? The last comment is asking @Jclark-ctr to follow up but on IRC h... [22:01:48] !log sbassett@deploy1001 Finished scap: Deploy security fix for T232932 (duration: 05m 35s) [22:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:11] (03PS1) 10Dzahn: delete role::repositoryserver, duplicate of role::apt_repo [puppet] - 10https://gerrit.wikimedia.org/r/574104 [22:02:47] (03CR) 10Dzahn: "We both did the same thing. I had vague memories you already did that but when i created apt_repo i did not see it." [puppet] - 10https://gerrit.wikimedia.org/r/574104 (owner: 10Dzahn) [22:03:13] (03CR) 10Dzahn: "this needed the additional profile for the http server anyways" [puppet] - 10https://gerrit.wikimedia.org/r/574104 (owner: 10Dzahn) [22:05:12] (03PS4) 10Jforrester: apache: Stop aliasing zero.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/524925 (https://phabricator.wikimedia.org/T187716) [22:06:02] (03CR) 10Dzahn: "duplicate of https://gerrit.wikimedia.org/r/524925 ?" [puppet] - 10https://gerrit.wikimedia.org/r/524925 (https://phabricator.wikimedia.org/T187716) (owner: 10Jforrester) [22:08:11] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/574057 (https://phabricator.wikimedia.org/T239119) (owner: 10CRusnov) [22:11:56] (03CR) 10Volans: Add cookbook to control CF BGP advertisements (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/572262 (owner: 10Ayounsi) [22:12:26] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): (No Need By Date Provided) rack/setup/install cloudvirt-wdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10Jclark-ctr) Cabling is finished still being configured by Chris [22:13:28] (03PS1) 10Dzahn: aptrepo/install: move https monitoring to aptrepo profile [puppet] - 10https://gerrit.wikimedia.org/r/574106 (https://phabricator.wikimedia.org/T224576) [22:18:15] 10Operations, 10Parsoid-PHP, 10SRE-Access-Requests, 10serviceops: Give all members of the Parsing team production deployment access - https://phabricator.wikimedia.org/T245877 (10Jdforrester-WMF) [22:18:37] (03CR) 10Ppchelko: [C: 04-1] "We need staging too." (0311 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust) [22:19:31] 10Operations, 10Parsoid-PHP, 10SRE-Access-Requests, 10serviceops: Give all members of the Parsing team production `deployment` access - https://phabricator.wikimedia.org/T245877 (10Jdforrester-WMF) [22:25:52] (03PS1) 10Krinkle: Fix Windows-style CR/LF line endings in ngwikimedia.yaml [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574107 [22:30:53] (03PS2) 10Krinkle: Fix Windows-style CR/LF line endings in ngwikimedia.yaml [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574107 [22:30:53] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/20975/install2002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/574106 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [22:31:15] (03CR) 10Jforrester: "Nice spot, thank you." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574107 (owner: 10Krinkle) [22:31:38] (03CR) 10Krinkle: [C: 03+2] Fix Windows-style CR/LF line endings in ngwikimedia.yaml [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574107 (owner: 10Krinkle) [22:31:54] James_F: I happen to notice it in a diff locally [22:31:58] ^M^M^M^M :) [22:32:03] Ha. [22:32:12] Yeah, that'd stand out. [22:32:38] (03Merged) 10jenkins-bot: Fix Windows-style CR/LF line endings in ngwikimedia.yaml [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574107 (owner: 10Krinkle) [22:32:55] (03CR) 10Dzahn: "this means we avoid having multiple checks in Icinga doing the same thing" [puppet] - 10https://gerrit.wikimedia.org/r/574106 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [22:38:08] (03PS3) 10Krinkle: The preprocessorClass property in $wgParserConf doesn't do anything any more [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567155 (https://phabricator.wikimedia.org/T204945) (owner: 10C. Scott Ananian) [22:38:17] (03CR) 10Krinkle: [C: 03+1] The preprocessorClass property in $wgParserConf doesn't do anything any more [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567155 (https://phabricator.wikimedia.org/T204945) (owner: 10C. Scott Ananian) [22:40:44] (03CR) 10Jforrester: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/524925 (https://phabricator.wikimedia.org/T187716) (owner: 10Jforrester) [22:43:03] ACKNOWLEDGEMENT - Device not healthy -SMART- on cloudvirt1008 is CRITICAL: cluster=wmcs device=cciss,17 instance=cloudvirt1008:9100 job=node site=eqiad andrew bogott T245815 https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudvirt1008&var-datasource=eqiad+prometheus/ops [22:50:05] (03PS1) 10Dzahn: site: add apt1001/apt2001 with spare role [puppet] - 10https://gerrit.wikimedia.org/r/574115 [22:51:11] (03PS2) 10Dzahn: site: add apt1001/apt2001 with spare role [puppet] - 10https://gerrit.wikimedia.org/r/574115 [22:51:55] (03CR) 10Jforrester: [C: 03+1] Synchronize and fix DisableQueryPageUpdate for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573969 (owner: 10Matěj Suchánek) [22:52:09] (03PS3) 10Dzahn: site: add apt1001/apt2001 with spare role [puppet] - 10https://gerrit.wikimedia.org/r/574115 (https://phabricator.wikimedia.org/T224576) [22:52:21] 10Operations, 10SRE-tools: Cookbook sre.hosts.downtime displayed on tools.wmflabs.org - https://phabricator.wikimedia.org/T245871 (10Etonkovidova) >>! In T245871#5908323, @Volans wrote: > @Etonkovidova what would be the issue? As far as I know that page is just listing the last few items of the [[ https://wiki... [22:52:44] (03CR) 10Dzahn: [C: 03+2] site: add apt1001/apt2001 with spare role [puppet] - 10https://gerrit.wikimedia.org/r/574115 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [22:53:43] (03CR) 10Krinkle: "LGTM, but probably needs to be moved instead of removed. moved from wwwportal to redirects as simple funnel to www, like we did with m-dot" [puppet] - 10https://gerrit.wikimedia.org/r/524925 (https://phabricator.wikimedia.org/T187716) (owner: 10Jforrester) [22:54:54] 10Operations, 10SRE-tools: Cookbook sre.hosts.downtime displayed on tools.wmflabs.org - https://phabricator.wikimedia.org/T245871 (10Dzahn) The FAILs are actually something to look for but they are in the cookbooks being run. So the log works as intended but shows us something went wrong with some specific com... [23:05:10] !log updated (?) wikitech-static to 1.34.0 [23:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:00] 10Operations, 10Core Platform Team, 10DC-Ops, 10serviceops: Rename wtp* servers to parsoid* (Parsoid PHP servers) - https://phabricator.wikimedia.org/T245888 (10jijiki) [23:24:51] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [23:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:51] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [23:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:51] (03PS1) 10Dzahn: site: add mw2366-mw2376 with spare role [puppet] - 10https://gerrit.wikimedia.org/r/574124 [23:35:59] (03CR) 10jerkins-bot: [V: 04-1] site: add mw2366-mw2376 with spare role [puppet] - 10https://gerrit.wikimedia.org/r/574124 (owner: 10Dzahn) [23:57:57] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: rack/setup/install frpm2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242269 (10RobH)