[00:00:12] RECOVERY - Check systemd state on netflow3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:56] RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:10] RECOVERY - Check systemd state on netflow5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:20] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:38] RECOVERY - Check systemd state on netflow4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:50:54] PROBLEM - Check systemd state on an-launcher1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:54:02] (03PS1) 10Tim Starling: Use a short connect timeout for PoolCounter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605436 (https://phabricator.wikimedia.org/T105378) [03:56:06] 10Operations, 10MediaWiki-General, 10Patch-For-Review: Stop a poolcounter server fail from being a SPOF for the service and the api (and the site) - https://phabricator.wikimedia.org/T105378 (10tstarling) 05Resolved→03Open Reopening pending deployment of my patch above. [04:00:06] (03PS1) 10Tim Starling: Enable PoolCounter fastStale mode on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605437 [04:06:58] (03PS2) 10KartikMistry: Update cxserver to 2020-06-10-044445-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/604657 (https://phabricator.wikimedia.org/T254959) [05:19:06] (03PS1) 10Tim Starling: Set a maximum HTTP client timeout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605440 (https://phabricator.wikimedia.org/T245170) [05:20:41] (03PS1) 10Elukey: Remove unused Analytics statistics profile [puppet] - 10https://gerrit.wikimedia.org/r/605441 [05:20:55] ACKNOWLEDGEMENT - MariaDB Slave Lag: x1 on db2101 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 234411.11 seconds Marostegui checking https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [05:20:55] ACKNOWLEDGEMENT - MariaDB Slave SQL: x1 on db2101 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1677, Errmsg: Column 2 of table wikishared.echo_unread_wikis cannot be converted from type varchar(30) to type varbinary(64) Marostegui checking https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [05:21:56] (03CR) 10Elukey: [C: 03+2] Remove unused Analytics statistics profile [puppet] - 10https://gerrit.wikimedia.org/r/605441 (owner: 10Elukey) [05:28:06] RECOVERY - Check systemd state on an-launcher1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:32:00] (03CR) 10Elukey: [C: 03+1] hiera: disable hardware monitoring on analytics1049 and thumbor1004 [puppet] - 10https://gerrit.wikimedia.org/r/605270 (owner: 10Cwhite) [05:38:32] RECOVERY - MariaDB Slave SQL: x1 on db2101 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [05:46:46] RECOVERY - MariaDB Slave Lag: x1 on db2101 is OK: OK slave_sql_lag Replication lag: 0.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [05:50:30] the uslfo <-> codfw link is down due to unexpected/extended Zayo maintenance, they are still working on it [06:01:53] (03PS1) 10Marostegui: mariadb: Productionize dbproxy1020 [puppet] - 10https://gerrit.wikimedia.org/r/605444 (https://phabricator.wikimedia.org/T202367) [06:04:11] (03PS2) 10Marostegui: mariadb: Productionize dbproxy1020 [puppet] - 10https://gerrit.wikimedia.org/r/605444 (https://phabricator.wikimedia.org/T202367) [06:04:50] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize dbproxy1020 [puppet] - 10https://gerrit.wikimedia.org/r/605444 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [06:18:34] (03PS7) 10Elukey: Add support to pull datapoints from Kafka [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/600295 [06:18:36] (03CR) 10Elukey: Add support to pull datapoints from Kafka (0310 comments) [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/600295 (owner: 10Elukey) [06:19:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [06:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:44] (03PS1) 10Marostegui: production-m3.sql: Add dbproxy1020 grants [puppet] - 10https://gerrit.wikimedia.org/r/605524 (https://phabricator.wikimedia.org/T202367) [06:53:08] (03CR) 10Marostegui: [C: 03+2] production-m3.sql: Add dbproxy1020 grants [puppet] - 10https://gerrit.wikimedia.org/r/605524 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [06:59:48] 10Operations, 10netops: ulsfo - codfw Zayo link down - https://phabricator.wikimedia.org/T255393 (10ayounsi) p:05Triage→03Medium [07:00:15] (03CR) 10Muehlenhoff: [C: 03+2] package_builder: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/605240 (owner: 10Muehlenhoff) [07:00:52] ACKNOWLEDGEMENT - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi https://phabricator.wikimedia.org/T255393 - The acknowledgement expires at: 2020-06-15 10:00:22. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:52] ACKNOWLEDGEMENT - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi https://phabricator.wikimedia.org/T255393 - The acknowledgement expires at: 2020-06-15 10:00:22. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:06:09] 10Operations, 10ops-eqiad, 10DC-Ops: Netbox report accounting icinga alert - https://phabricator.wikimedia.org/T250053 (10ayounsi) 05Resolved→03Open Alerting for 6 days. Right now says: backup1002 backup1002-array not present in accounting. [07:09:33] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/605271 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [07:09:40] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10ayounsi) >>! In T252185#6221836, @akosiaris wrote: > kubernetes2007 has been reimage successfully,... [07:12:56] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/605267 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [07:13:44] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/603950 (https://phabricator.wikimedia.org/T254818) (owner: 10Jbond) [07:15:45] (03PS1) 10Marostegui: dbproxy1020: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/605525 (https://phabricator.wikimedia.org/T202367) [07:17:09] (03CR) 10Muehlenhoff: cas-icinga: Add an entry point for the external monitoring script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598742 (https://phabricator.wikimedia.org/T239323) (owner: 10Jbond) [07:22:21] (03PS1) 10Marostegui: report_users: Add dbproxy1020 [software] - 10https://gerrit.wikimedia.org/r/605526 (https://phabricator.wikimedia.org/T202367) [07:22:47] (03CR) 10Marostegui: [C: 03+2] dbproxy1020: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/605525 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [07:23:01] (03CR) 10Marostegui: [C: 03+2] report_users: Add dbproxy1020 [software] - 10https://gerrit.wikimedia.org/r/605526 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [07:27:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2092', diff saved to https://phabricator.wikimedia.org/P11491 and previous config saved to /var/cache/conftool/dbconfig/20200615-072742-marostegui.json [07:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1093 for schema change', diff saved to https://phabricator.wikimedia.org/P11492 and previous config saved to /var/cache/conftool/dbconfig/20200615-072835-marostegui.json [07:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:58] (03PS1) 10Marostegui: report_users: Fix typo on dbproxy1020 IP [software] - 10https://gerrit.wikimedia.org/r/605527 [07:28:59] !log Deploy schema change on db1093 [07:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:49] (03CR) 10Marostegui: [C: 03+2] report_users: Fix typo on dbproxy1020 IP [software] - 10https://gerrit.wikimedia.org/r/605527 (owner: 10Marostegui) [07:33:26] (03PS1) 10Marostegui: install_server: Reimage dbproxy1016 with Buster [puppet] - 10https://gerrit.wikimedia.org/r/605528 (https://phabricator.wikimedia.org/T202367) [07:33:58] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage dbproxy1016 with Buster [puppet] - 10https://gerrit.wikimedia.org/r/605528 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [07:36:43] !log push new pfw firewall policies - T255185 [07:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:46] T255185: Deploy pfw policy 1591901800 for T122104 - https://phabricator.wikimedia.org/T255185 [07:38:07] 10Operations, 10fundraising-tech-ops, 10netops, 10WMF-NDA: Deploy pfw policy 1591901800 for T122104 - https://phabricator.wikimedia.org/T255185 (10ayounsi) 05Open→03Resolved a:03ayounsi Done! [07:46:36] !log standardize ae device-count on all routers [07:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [07:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:28] (03CR) 10Ayounsi: [C: 03+2] Chassis: more generic, add ae count [homer/public] - 10https://gerrit.wikimedia.org/r/592251 (owner: 10Ayounsi) [07:55:53] (03Merged) 10jenkins-bot: Chassis: more generic, add ae count [homer/public] - 10https://gerrit.wikimedia.org/r/592251 (owner: 10Ayounsi) [08:02:40] (03PS1) 10Marostegui: wmnet: Failover m3 primary dbproxy [dns] - 10https://gerrit.wikimedia.org/r/605531 (https://phabricator.wikimedia.org/T202367) [08:09:39] !log installing libexif security updates [08:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:29] (03PS1) 10Kormat: admin: Update my configs. [puppet] - 10https://gerrit.wikimedia.org/r/605533 [08:15:06] (03CR) 10Kormat: [C: 03+2] admin: Update my configs. [puppet] - 10https://gerrit.wikimedia.org/r/605533 (owner: 10Kormat) [08:17:15] !log Deploy schema change on db1131 (s6 master) - T250066 [08:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:19] T250066: text table still has old_* fields and indexes on some hosts - https://phabricator.wikimedia.org/T250066 [08:21:26] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m3 primary dbproxy [dns] - 10https://gerrit.wikimedia.org/r/605531 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [08:22:06] !log Switchover m3-master from dbproxy1008 to dbproxy1016 - T202367 [08:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:12] T202367: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4] - https://phabricator.wikimedia.org/T202367 [08:25:27] marostegui: Is it OK to deploy cxserver now? [08:27:26] yes [08:27:45] (03CR) 10Gehel: [C: 03+2] [wdqs] add an option to skolemize blank nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/605231 (owner: 10DCausse) [08:27:58] (03CR) 10ArielGlenn: "Just a couple of nits left. Once those are fixed up, I'd like to dry-run test these before they get merged, so let's coordinate on that." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/602645 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [08:29:52] 10Operations, 10Traffic: noc.wikimedia.org consistently 503s in eqsin and sometimes 503s in esams - https://phabricator.wikimedia.org/T255368 (10Vgutierrez) Checking against eqsin with `curl --resolve noc.wikimedia.org:443:$(dig +short text-lb.eqsin.wikimedia.org) https://noc.wikimedia.org` I do get a 503 and... [08:34:52] !log reimaging cumin2001 T245114 [08:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:55] T245114: Migrate Cumin hosts to Buster - https://phabricator.wikimedia.org/T245114 [08:36:23] marostegui: thanks [08:36:34] 10Operations, 10Traffic: noc.wikimedia.org consistently 503s in eqsin and sometimes 503s in esams - https://phabricator.wikimedia.org/T255368 (10Vgutierrez) `curl --http1.1 -H 'Host: noc.wikimedia.org' https://mwmaint.discovery.wmnet` from cp5010 returns a HTTP 200 as expected [08:36:46] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2020-06-10-044445-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/604657 (https://phabricator.wikimedia.org/T254959) (owner: 10KartikMistry) [08:37:13] kart_: you are welcome :) [08:37:16] (03Merged) 10jenkins-bot: Update cxserver to 2020-06-10-044445-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/604657 (https://phabricator.wikimedia.org/T254959) (owner: 10KartikMistry) [08:38:14] (03PS1) 10Muehlenhoff: Switch cumin2001 to buster [puppet] - 10https://gerrit.wikimedia.org/r/605534 [08:39:06] (03CR) 10Muehlenhoff: [C: 03+2] Switch cumin2001 to buster [puppet] - 10https://gerrit.wikimedia.org/r/605534 (owner: 10Muehlenhoff) [08:39:13] (03PS2) 10Muehlenhoff: Switch cumin2001 to buster [puppet] - 10https://gerrit.wikimedia.org/r/605534 [08:39:48] !log kartik@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [08:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:11] !log kartik@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'cxserver' for release 'production' . [08:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:22] (03CR) 10Kormat: Add native mysql spicerack module. (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (owner: 10Kormat) [08:44:48] 10Operations, 10DBA: rack/setup/install dbproxy101[2-7].eqiad.wmnet - https://phabricator.wikimedia.org/T196690 (10Marostegui) [08:44:51] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install dbproxy200[1-4] - https://phabricator.wikimedia.org/T223492 (10Marostegui) [08:46:03] !log kartik@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'cxserver' for release 'production' . [08:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:17] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Port exim statistics to Prometheus - https://phabricator.wikimedia.org/T179565 (10fgiunchedi) Hi @Cyrille37, we're currently using mtail (https://github.com/google/mtail) to turn exim logs into metrics. See also https://gerrit.wikimedia.org/r/plu... [08:50:04] (03CR) 10Ayounsi: [C: 03+1] "Compared to authdns1001:/etc/gdnsd/zones/netbox/... LGTM." [dns] - 10https://gerrit.wikimedia.org/r/604136 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [08:50:07] !log Updated cxserver to 2020-06-10-044445-production (T246319, T254959) [08:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:12] T254959: cxserver: Breaks All Machine Translation services - https://phabricator.wikimedia.org/T254959 [08:50:12] T246319: Enable Google Translate support in Content Translation for Kinyarwanda, Odia, Tatar, Turkmen and Uyghur - https://phabricator.wikimedia.org/T246319 [08:52:20] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: remove swift-container-sharder unit [puppet] - 10https://gerrit.wikimedia.org/r/604623 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [08:53:31] (03PS7) 10Jbond: dumps: fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/602645 (https://phabricator.wikimedia.org/T254480) [08:53:57] (03CR) 10Jbond: "updated thanks" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/602645 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [08:55:12] !log Deploy schema change on db2123 (s5 codfw master) - T250066 [08:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:18] T250066: text table still has old_* fields and indexes on some hosts - https://phabricator.wikimedia.org/T250066 [08:58:04] (03PS1) 10DCausse: [wdqs] bump vocabulary and inline URI handler version [puppet] - 10https://gerrit.wikimedia.org/r/605536 (https://phabricator.wikimedia.org/T255399) [08:58:30] 10Operations, 10Traffic: noc.wikimedia.org consistently 503s in eqsin and sometimes 503s in esams - https://phabricator.wikimedia.org/T255368 (10Vgutierrez) varnish-fe also shows a 503: `vgutierrez@cp5010:~$ sudo -i varnishlog -n frontend -q "ReqHeader:Host eq noc.wikimedia.org" * << Request >> 1066411391 -... [08:58:45] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: enable thanos upload in ops eqsin/ulsfo/codfw [puppet] - 10https://gerrit.wikimedia.org/r/605177 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [09:00:27] (03CR) 10Jbond: [C: 03+2] profile::icinga: move single line scripts in line [puppet] - 10https://gerrit.wikimedia.org/r/605271 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [09:00:47] yes [09:01:18] (03CR) 10Jbond: [C: 03+2] aptrepo: build single line shell script inline [puppet] - 10https://gerrit.wikimedia.org/r/605267 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [09:02:57] 10Operations, 10Traffic: noc.wikimedia.org consistently 503s in eqsin and sometimes 503s in esams - https://phabricator.wikimedia.org/T255368 (10Vgutierrez) Filtering by BeReqHeader we can see how varnish-fe apparently gets a 200 from ats-be and returns a 503 cause the "body cannot be fetched": `vgutierrez@cp... [09:03:18] (03CR) 10Jcrespo: "I want to hit reset button here because I cannot really review this." [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (owner: 10Kormat) [09:05:04] PROBLEM - Check systemd state on an-launcher1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:11:32] !log jmm@cumin1001 START - Cookbook sre.hosts.downtime [09:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:53] (03CR) 10Volans: [C: 03+1] "Thanks for all the fixes, looks pretty good. I just have one nit inline on the behaviour if the kafka dependency is missing, but YMMV." (0310 comments) [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/600295 (owner: 10Elukey) [09:13:29] (03PS1) 10Jbond: puppetmaster::server: offline 1003 and 2003 for microcode upgrade [puppet] - 10https://gerrit.wikimedia.org/r/605538 (https://phabricator.wikimedia.org/T254990) [09:14:05] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:42] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/605538 (https://phabricator.wikimedia.org/T254990) (owner: 10Jbond) [09:17:36] !log reduce ae device-count from 10 to 3 on asw2-a/b/c-eqiad [09:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:09] (03PS1) 10Muehlenhoff: Switch spicerack installation to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/605539 [09:20:41] PROBLEM - Thanos swift https on thanos-fe2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.151 second response time https://wikitech.wikimedia.org/wiki/Thanos [09:21:02] (03PS1) 10Elukey: Add druid100[7,8] to the druid term in analytics-in4/6 [homer/public] - 10https://gerrit.wikimedia.org/r/605540 [09:21:12] XioNoX: if you have time --^ [09:21:23] elukey: what's up? [09:21:39] !log offlining puppetmaster1003 and 2003 for reboot [09:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:42] (03CR) 10Jbond: [C: 03+2] puppetmaster::server: offline 1003 and 2003 for microcode upgrade [puppet] - 10https://gerrit.wikimedia.org/r/605538 (https://phabricator.wikimedia.org/T254990) (owner: 10Jbond) [09:21:54] XioNoX: ah sorry, https://gerrit.wikimedia.org/r/#/c/operations/homer/public/+/605540/1/templates/cr/firewall.conf [09:22:09] elukey, I /ignore some bots :) [09:23:12] XioNoX: I always forget : [09:23:13] :) [09:24:12] (03CR) 10Ayounsi: [C: 03+1] Add druid100[7,8] to the druid term in analytics-in4/6 [homer/public] - 10https://gerrit.wikimedia.org/r/605540 (owner: 10Elukey) [09:24:23] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/605539 (owner: 10Muehlenhoff) [09:24:25] elukey: lgtm! [09:24:30] thanks! [09:24:36] ok if I deploy? [09:25:31] 10Operations, 10Wikidata, 10serviceops: mw1384 is misbehaving - https://phabricator.wikimedia.org/T255282 (10Addshore) Is T253673 relevant here? [09:25:34] elukey: yep [09:27:47] PROBLEM - Thanos store has high percentage of object storage failures on icinga1001 is CRITICAL: job=thanos-compact https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/e832e8f26403d95fac0ea1c59837588b/thanos-store [09:28:18] that's me ^ [09:28:35] (03CR) 10Elukey: [C: 03+2] Add druid100[7,8] to the druid term in analytics-in4/6 [homer/public] - 10https://gerrit.wikimedia.org/r/605540 (owner: 10Elukey) [09:28:41] PROBLEM - Thanos compact has high percentage of object storage failures on icinga1001 is CRITICAL: job=thanos-compact https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:29:06] !log update analytics-in4/6 filters on cr1-cr2 eqiad to update the Druid term (new nodes added) [09:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:45] 10Operations, 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1008.eqiad.wmnet - https://phabricator.wikimedia.org/T255406 (10Marostegui) [09:30:10] (03CR) 10Muehlenhoff: [C: 03+2] Switch spicerack installation to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/605539 (owner: 10Muehlenhoff) [09:30:21] 10Operations, 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1008.eqiad.wmnet - https://phabricator.wikimedia.org/T255406 (10Marostegui) dbproxy1008 is no longer the active m3-master per https://gerrit.wikimedia.org/r/#/c/operations/dns/+/605531/ Let's give it a few days before starting i... [09:31:00] 10Operations, 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1008.eqiad.wmnet - https://phabricator.wikimedia.org/T255406 (10Marostegui) [09:31:51] 10Operations, 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1008.eqiad.wmnet - https://phabricator.wikimedia.org/T255406 (10Marostegui) [09:32:57] (03PS1) 10Kosta Harlan: GrowthExperiments: Switch on guidance feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605543 (https://phabricator.wikimedia.org/T239181) [09:33:33] (03PS2) 10Volans: mgmt: use netbox-generated data for esams mgmt [dns] - 10https://gerrit.wikimedia.org/r/604136 (https://phabricator.wikimedia.org/T233183) [09:33:37] RECOVERY - Thanos store has high percentage of object storage failures on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/e832e8f26403d95fac0ea1c59837588b/thanos-store [09:34:37] RECOVERY - Thanos compact has high percentage of object storage failures on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:35:23] (03PS1) 10Jbond: Revert "puppetmaster::server: offline 1003 and 2003 for microcode upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/605544 [09:35:24] !log volans@cumin1001 START - Cookbook sre.dns.netbox [09:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:35] PROBLEM - Thanos compact has disappeared from Prometheus discovery on icinga1001 is CRITICAL: 1 ge 1 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview [09:37:43] RECOVERY - Check systemd state on an-launcher1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:38:21] (03PS1) 10Muehlenhoff: Readd the spicerack component on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/605545 (https://phabricator.wikimedia.org/T245114) [09:38:33] (03CR) 10jerkins-bot: [V: 04-1] Readd the spicerack component on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/605545 (https://phabricator.wikimedia.org/T245114) (owner: 10Muehlenhoff) [09:41:39] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:02] (03CR) 10Volans: [C: 03+2] mgmt: use netbox-generated data for esams mgmt (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/604136 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [09:42:48] !log deploying esams mgmt DNS records automatically generated by Netbox ( operations/dns/+/604136/ ) - T233183 [09:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:52] T233183: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 [09:44:17] (03PS2) 10Muehlenhoff: Readd the spicerack component on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/605545 (https://phabricator.wikimedia.org/T245114) [09:44:19] (03PS1) 10Marostegui: install_server: Upgrade dbproxy2* to Buster [puppet] - 10https://gerrit.wikimedia.org/r/605546 (https://phabricator.wikimedia.org/T255408) [09:44:43] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=thanos-compact site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:45:39] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/23214/" [puppet] - 10https://gerrit.wikimedia.org/r/605545 (https://phabricator.wikimedia.org/T245114) (owner: 10Muehlenhoff) [09:45:48] (03PS2) 10Marostegui: install_server: Upgrade dbproxy2* to Buster [puppet] - 10https://gerrit.wikimedia.org/r/605546 (https://phabricator.wikimedia.org/T255408) [09:46:39] !log run logstash benchmark on logstash1023 [09:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:51] (03CR) 10Gergő Tisza: [C: 03+1] GrowthExperiments: Switch on guidance feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605543 (https://phabricator.wikimedia.org/T239181) (owner: 10Kosta Harlan) [09:46:53] (03CR) 10Marostegui: [C: 03+2] install_server: Upgrade dbproxy2* to Buster [puppet] - 10https://gerrit.wikimedia.org/r/605546 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [09:47:10] 10Operations, 10DBA, 10SRE-tools: Add native mysql module to spicerack - https://phabricator.wikimedia.org/T255409 (10Kormat) [09:47:45] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605545 (https://phabricator.wikimedia.org/T245114) (owner: 10Muehlenhoff) [09:47:47] (03PS7) 10Kormat: Add native mysql spicerack module. [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (https://phabricator.wikimedia.org/T255409) [09:47:51] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:49:05] (03CR) 10Muehlenhoff: Readd the spicerack component on Stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605545 (https://phabricator.wikimedia.org/T245114) (owner: 10Muehlenhoff) [09:49:13] (03CR) 10Muehlenhoff: [C: 03+2] Readd the spicerack component on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/605545 (https://phabricator.wikimedia.org/T245114) (owner: 10Muehlenhoff) [09:50:38] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [09:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:15] (03CR) 10Jbond: [C: 03+2] Revert "puppetmaster::server: offline 1003 and 2003 for microcode upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/605544 (owner: 10Jbond) [09:52:19] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [09:53:12] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:11] (03PS1) 10Ayounsi: Manage routers k8s BGP config [homer/public] - 10https://gerrit.wikimedia.org/r/605548 [09:59:18] PROBLEM - Check systemd state on dumpsdata1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:36] this was a systemd issue to give debmonitor it's session [10:01:11] interesting, same issue we've seen on ms-be hosts sometimes ? [10:01:25] I think so, was checking if we got some old logs from those [10:01:27] to compare [10:01:35] https://phabricator.wikimedia.org/T199911 [10:01:39] was just looking myself [10:02:15] yeah, same thing [10:02:55] im gonna reset [10:02:58] ack [10:03:05] same conditions, heavyly used host [10:03:05] load average: 223.26, 222.69, 222.17 [10:03:53] ack [10:03:58] RECOVERY - Check systemd state on dumpsdata1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:04:25] thx [10:05:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [10:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:33] (03PS1) 10Hashar: contint: raise egress traffic shaping limit [puppet] - 10https://gerrit.wikimedia.org/r/605550 (https://phabricator.wikimedia.org/T255371) [10:08:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:06] (03PS3) 10Jforrester: [enwikivoyage] Undeploy the Listings extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604670 (https://phabricator.wikimedia.org/T254820) (owner: 10MarcoAurelio) [10:09:20] (03CR) 10Jforrester: [C: 03+2] [enwikivoyage] Undeploy the Listings extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604670 (https://phabricator.wikimedia.org/T254820) (owner: 10MarcoAurelio) [10:10:15] (03Merged) 10jenkins-bot: [enwikivoyage] Undeploy the Listings extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604670 (https://phabricator.wikimedia.org/T254820) (owner: 10MarcoAurelio) [10:10:33] 10Operations, 10DBA, 10SRE-tools, 10Patch-For-Review: Add native mysql module to spicerack - https://phabricator.wikimedia.org/T255409 (10jbond) p:05Triage→03Medium [10:11:21] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Continuous-Integration-Config, and 2 others: Add pcov PHP extension to wikimedia apt so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 (10jbond) p:05Triage→03Medium [10:11:40] 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-SVG-rendering, 10Documentation: Document how to request installing additional SVG and PDF fonts on Wikimedia servers - https://phabricator.wikimedia.org/T228591 (10jbond) p:05Triage→03Medium [10:12:24] RECOVERY - Thanos compact has disappeared from Prometheus discovery on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview [10:12:42] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T254820 [enwikivoyage] Undeploy the Listings extension (duration: 01m 00s) [10:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:46] T254820: Undeploy Listings extension from English Wikivoyage - https://phabricator.wikimedia.org/T254820 [10:16:19] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single [10:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:26] !log jmm@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) [10:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:28] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605553 (https://phabricator.wikimedia.org/T128546) [10:18:37] PROBLEM - Thanos compact has not run on icinga1001 is CRITICAL: 4.423e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [10:18:55] 10Operations, 10Research, 10Wikimedia-Mailing-lists: Admin password reset request for a mailman list: research-wmf - https://phabricator.wikimedia.org/T255326 (10jbond) @leila This has now been reset and io have removed the other admin from the list of admins. you should have been emailed a copy of the new... [10:19:01] 10Operations, 10Research, 10Wikimedia-Mailing-lists: Admin password reset request for a mailman list: research-wmf - https://phabricator.wikimedia.org/T255326 (10jbond) 05Open→03Resolved a:03jbond [10:19:18] (03CR) 10Alexandros Kosiaris: [C: 03+2] install: Switch kubernetes2014 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/605400 (https://phabricator.wikimedia.org/T252185) (owner: 10Alexandros Kosiaris) [10:19:25] (03PS3) 10Alexandros Kosiaris: install: Switch kubernetes2014 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/605400 (https://phabricator.wikimedia.org/T252185) [10:19:28] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] install: Switch kubernetes2014 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/605400 (https://phabricator.wikimedia.org/T252185) (owner: 10Alexandros Kosiaris) [10:22:26] (03PS1) 10Alexandros Kosiaris: services_proxy: Add keepalive, retries [puppet] - 10https://gerrit.wikimedia.org/r/605554 (https://phabricator.wikimedia.org/T255410) [10:26:52] (03PS2) 10Alexandros Kosiaris: services_proxy: Add keepalive, retries [puppet] - 10https://gerrit.wikimedia.org/r/605554 (https://phabricator.wikimedia.org/T255410) [10:27:36] (03CR) 10Alexandros Kosiaris: "Added joe just for posterity's sake, so he is aware of what we did" [puppet] - 10https://gerrit.wikimedia.org/r/605554 (https://phabricator.wikimedia.org/T255410) (owner: 10Alexandros Kosiaris) [10:29:31] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [10:30:04] jan_drewniak: I, the Bot under the Fountain, allow thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200615T1030). [10:30:31] (03PS2) 10Ayounsi: Manage routers k8s BGP config [homer/public] - 10https://gerrit.wikimedia.org/r/605548 [10:30:33] (03PS1) 10Ayounsi: Manage routers' anycast BGP neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/605557 [10:32:11] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [10:33:33] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605553 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:34:18] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605553 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:35:46] (03CR) 10JMeybohm: [C: 03+1] services_proxy: Add keepalive, retries [puppet] - 10https://gerrit.wikimedia.org/r/605554 (https://phabricator.wikimedia.org/T255410) (owner: 10Alexandros Kosiaris) [10:36:40] (03CR) 10Alexandros Kosiaris: [C: 03+2] services_proxy: Add keepalive, retries [puppet] - 10https://gerrit.wikimedia.org/r/605554 (https://phabricator.wikimedia.org/T255410) (owner: 10Alexandros Kosiaris) [10:38:05] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:605553| Bumping portals to master (605553)]] (duration: 00m 58s) [10:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:36] !log regenerated restbase2009's cassandra certificates [10:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:04] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:605553| Bumping portals to master (605553)]] (duration: 00m 58s) [10:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:26] Hey, there are two small changes in comments in puppet https://gerrit.wikimedia.org/r/c/operations/puppet/+/605382 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/605383 [10:43:37] If that's okay to merge [10:46:14] (03PS1) 10Volans: homer: fix initial clone of private repo [puppet] - 10https://gerrit.wikimedia.org/r/605558 (https://phabricator.wikimedia.org/T245114) [10:51:50] (03CR) 10Volans: "This might work, compiler results here:" [puppet] - 10https://gerrit.wikimedia.org/r/605558 (https://phabricator.wikimedia.org/T245114) (owner: 10Volans) [10:54:08] !log imported python-phabricator 0.7.0-2~wmf2 to apt.wikimedia.org/buster-wikimedia T245114 [10:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:12] T245114: Migrate Cumin hosts to Buster - https://phabricator.wikimedia.org/T245114 [10:58:02] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Disable tool name alias in lighttpd config with --canonical [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/603668 (https://phabricator.wikimedia.org/T254640) (owner: 10BryanDavis) [10:59:43] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/605558 (https://phabricator.wikimedia.org/T245114) (owner: 10Volans) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do European mid-day backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200615T1100). [11:00:04] kostajh: A patch you scheduled for European mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:12] \o [11:03:06] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [11:03:07] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:27] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [11:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:21] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:29] jouncebot is officially my favorite, [11:07:32] !log regenerated certificates for restbase2009, restbase101[678], restbase201[012]. Did not roll-restart yet [11:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:38] hi Amir1 / awight / Urbanecm, are any of you around for backport? [11:08:53] sorry meeting atm :( [11:10:01] !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime [11:10:01] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:31] I'll BACON [11:12:46] (is that the official name now?) [11:13:20] (03CR) 10Gergő Tisza: [C: 03+2] GrowthExperiments: Switch on guidance feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605543 (https://phabricator.wikimedia.org/T239181) (owner: 10Kosta Harlan) [11:14:43] PROBLEM - Keyholder SSH agent on cumin2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [11:14:57] seems like we don't have gerrit hashtags for backports anymore? [11:15:02] moritzm: ^^ from the last reboot I guess [11:19:34] that's for the homer passphrase, needs someone in the netops group to enter it [11:19:42] XioNoX: ^^^ [11:19:47] ^ XioNoX: can you run "sudo keyholder arm" on cumin2001? [11:21:07] PROBLEM - Check systemd state on cescout1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:22:32] moritzm, volans, I'm commuting, will look at it when I'm at a destination [11:22:35] 30min or so [11:23:45] (03PS2) 10Gergő Tisza: GrowthExperiments: Switch on guidance feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605543 (https://phabricator.wikimedia.org/T239181) (owner: 10Kosta Harlan) [11:25:45] (03CR) 10Volans: [C: 03+2] homer: fix initial clone of private repo [puppet] - 10https://gerrit.wikimedia.org/r/605558 (https://phabricator.wikimedia.org/T245114) (owner: 10Volans) [11:26:39] XioNoX: ack, thx [11:27:55] (03CR) 10Gergő Tisza: [C: 03+2] GrowthExperiments: Switch on guidance feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605543 (https://phabricator.wikimedia.org/T239181) (owner: 10Kosta Harlan) [11:28:18] 10Operations, 10Traffic: noc.wikimedia.org consistently 503s in eqsin and sometimes 503s in esams - https://phabricator.wikimedia.org/T255368 (10ema) It seems that something might be going wrong at the ats-tls<->varnish-fe level. Hitting varnish-fe directly on cp5007 I constantly get a 200 response with body a... [11:28:54] (03Merged) 10jenkins-bot: GrowthExperiments: Switch on guidance feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605543 (https://phabricator.wikimedia.org/T239181) (owner: 10Kosta Harlan) [11:31:07] (03PS1) 10Volans: homer: enforce resource order [puppet] - 10https://gerrit.wikimedia.org/r/605564 (https://phabricator.wikimedia.org/T245114) [11:33:05] kostajh: sorry, I was fumbling with gerrit. Patch is on mwdebug1001. [11:33:49] RECOVERY - Check systemd state on cescout1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:34:46] tgr: ok looking now [11:37:36] puppet on cumin2001 it's me, WIP on a fix, please ignore for now [11:38:19] RECOVERY - Thanos compact has not run on icinga1001 is OK: (C)24 ge (W)12 ge 0.01651 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [11:38:21] tgr: looks good! [11:40:34] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:605543|GrowthExperiments: Switch on guidance feature (T239181)]] (duration: 00m 57s) [11:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:40] T239181: [EPIC] Growth: Newcomer tasks 1.2 (guidance) - https://phabricator.wikimedia.org/T239181 [11:43:25] !log Reimage dbproxy2003 which points to m3-master.codfw.wmnet (not in use) - T255408 [11:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:29] T255408: Upgrade dbproxyXXXX to Buster - https://phabricator.wikimedia.org/T255408 [11:49:09] (03CR) 10Alexandros Kosiaris: "Many thanks. This is pretty nice, a small comment inline, but otherwise /me likes" (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/605548 (owner: 10Ayounsi) [11:49:56] (03CR) 10Hashar: [C: 03+1] "I have cherry picked this on the CI puppet master, ran puppet and tc-setup on all instances using cumin." [puppet] - 10https://gerrit.wikimedia.org/r/605550 (https://phabricator.wikimedia.org/T255371) (owner: 10Hashar) [11:50:18] (03PS1) 10Jbond: role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) [11:55:04] jouncebot: now [11:55:05] For the next 0 hour(s) and 4 minute(s): European mid-day backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200615T1100) [11:55:07] jouncebot: next [11:55:07] In 5 hour(s) and 4 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200615T1700) [11:57:57] !log reimaging sretest1002 to validate the reimage script on Buster [11:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [11:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:11] moritzm, volans done [11:59:47] ack, thx [11:59:48] (03PS1) 10Muehlenhoff: Switch cumin1001 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/605570 [12:00:16] RECOVERY - Keyholder SSH agent on cumin2001 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [12:01:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:49] 10Operations, 10Traffic: noc.wikimedia.org consistently 503s in eqsin and sometimes 503s in esams - https://phabricator.wikimedia.org/T255368 (10ema) >>! In T255368#6223544, @ema wrote: > It seems that something might be going wrong at the ats-tls<->varnish-fe level. Hitting varnish-fe directly on cp5007 I con... [12:09:54] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [12:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:50] moritzm: https://phabricator.wikimedia.org/T255169 FYI. PI pinned the version the previous one in blubber.yaml, but I don't feel particularly nice about pinning ca-certificates [12:11:31] !log Upgrade db2134 [12:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:28] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:42] akosiaris: thanks, that's known. we've been holding off from rolling this update out to production until a fixed version gets published (next 1-2 days) [12:14:22] moritzm: one thing that ticket made me question was the inclusion of buster-updates in those images. Do you have any feelings regarding this? [12:20:26] 10Operations, 10Wikimedia-Logstash, 10observability: Increase logging pipeline ingestion capacity - https://phabricator.wikimedia.org/T255243 (10fgiunchedi) I ran some benchmarking on logstash1023 (i.e. elk7) to address point 2. I grabbed the the logs from the timeframe of the incident from kafka and ran log... [12:21:26] generally I think it makes sense, buster-updates is used for exactly these kind of non-security related changes which need to be shipped to stable/olstable without waiting for the next point release, it's a little unfortunate that this one caused a regression, but I thinm in general the usefulness outweighs the potential harm [12:24:24] (03PS1) 10Marostegui: install_server: Do not reimage dbproxy2003, dbproxy2004. [puppet] - 10https://gerrit.wikimedia.org/r/605573 (https://phabricator.wikimedia.org/T255408) [12:25:12] ok, cool. Let's keep it as is then [12:25:30] (03CR) 10Vgutierrez: [C: 03+2] acme_chief,x509: Provide .crt.key file support [software/acme-chief] - 10https://gerrit.wikimedia.org/r/605237 (https://phabricator.wikimedia.org/T255249) (owner: 10Vgutierrez) [12:26:50] (03CR) 10Vgutierrez: [C: 03+2] api: Allow acme-chief clients to fetch .crt.key files [software/acme-chief] - 10https://gerrit.wikimedia.org/r/605254 (https://phabricator.wikimedia.org/T255249) (owner: 10Vgutierrez) [12:29:13] (03PS1) 10Vgutierrez: Release 0.26 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/605577 (https://phabricator.wikimedia.org/T255249) [12:30:16] (03PS1) 10Ema: ATS: stop caching noc.wm.org responses [puppet] - 10https://gerrit.wikimedia.org/r/605578 (https://phabricator.wikimedia.org/T255368) [12:30:58] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 27886800 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:32:10] (03CR) 10Vgutierrez: [C: 03+1] "nice catch" [puppet] - 10https://gerrit.wikimedia.org/r/605578 (https://phabricator.wikimedia.org/T255368) (owner: 10Ema) [12:32:38] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 372368 and 94 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:32:50] (03CR) 10Vgutierrez: [C: 03+2] Release 0.26 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/605577 (https://phabricator.wikimedia.org/T255249) (owner: 10Vgutierrez) [12:33:05] (03CR) 10Ema: [C: 03+2] ATS: stop caching noc.wm.org responses [puppet] - 10https://gerrit.wikimedia.org/r/605578 (https://phabricator.wikimedia.org/T255368) (owner: 10Ema) [12:34:24] !log rolling reboot on the ganeti cluster in eqsin (for security updates and to pick up the network changes to provides instances with a public IP) [12:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:52] (03PS1) 10Vgutierrez: acme_chief,x509: Provide .crt.key file support [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/605579 (https://phabricator.wikimedia.org/T255249) [12:35:54] (03PS1) 10Vgutierrez: api: Allow acme-chief clients to fetch .crt.key files [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/605580 (https://phabricator.wikimedia.org/T255249) [12:35:56] (03PS1) 10Vgutierrez: Release 0.26 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/605581 (https://phabricator.wikimedia.org/T255249) [12:36:18] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:37:51] (03PS1) 10Vgutierrez: debian: Add release 0.26 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/605582 (https://phabricator.wikimedia.org/T255249) [12:38:02] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:38:07] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [12:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:58] (03CR) 10Vgutierrez: [C: 03+2] acme_chief,x509: Provide .crt.key file support [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/605579 (https://phabricator.wikimedia.org/T255249) (owner: 10Vgutierrez) [12:41:06] (03CR) 10Vgutierrez: [C: 03+2] api: Allow acme-chief clients to fetch .crt.key files [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/605580 (https://phabricator.wikimedia.org/T255249) (owner: 10Vgutierrez) [12:41:15] (03CR) 10Vgutierrez: [C: 03+2] Release 0.26 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/605581 (https://phabricator.wikimedia.org/T255249) (owner: 10Vgutierrez) [12:41:53] (03PS1) 10Volans: git::clone: allow to pass environment variables [puppet] - 10https://gerrit.wikimedia.org/r/605583 (https://phabricator.wikimedia.org/T245114) [12:42:13] (03CR) 10Vgutierrez: [C: 03+2] debian: Add release 0.26 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/605582 (https://phabricator.wikimedia.org/T255249) (owner: 10Vgutierrez) [12:42:32] (03PS3) 10Ayounsi: Manage routers k8s BGP config [homer/public] - 10https://gerrit.wikimedia.org/r/605548 [12:42:34] (03PS2) 10Ayounsi: Manage routers' anycast BGP neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/605557 [12:42:40] (03PS12) 10WMDE-leszek: Wikidata: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569258 (https://phabricator.wikimedia.org/T254315) [12:42:42] (03PS1) 10WMDE-leszek: test wikidata: Use the database name in the Wikibase entity source config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605584 [12:42:44] (03PS1) 10WMDE-leszek: test commons: Use the database name in the Wikibase entity source config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605585 [12:43:02] (03PS14) 10WMDE-leszek: Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T254315) [12:43:27] (03CR) 10Ayounsi: Manage routers k8s BGP config (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/605548 (owner: 10Ayounsi) [12:43:29] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM (Giuseppe won't be able to comment for a couple of weeks). Feel free to deploy!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605436 (https://phabricator.wikimedia.org/T105378) (owner: 10Tim Starling) [12:43:33] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [12:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:06] (03PS11) 10WMDE-leszek: Commons: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569260 (https://phabricator.wikimedia.org/T254315) [12:45:20] (03CR) 10Alexandros Kosiaris: [C: 03+1] Manage routers k8s BGP config [homer/public] - 10https://gerrit.wikimedia.org/r/605548 (owner: 10Ayounsi) [12:46:04] !log upload acme-chief 0.26 to apt.wm.o (buster) - T255249 [12:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:08] T255249: acme-chief: support for generating a concatenated cert/key file - https://phabricator.wikimedia.org/T255249 [12:46:09] (03PS12) 10WMDE-leszek: Commons: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569260 (https://phabricator.wikimedia.org/T254315) [12:47:17] (03CR) 10Ayounsi: [C: 03+2] Manage routers k8s BGP config [homer/public] - 10https://gerrit.wikimedia.org/r/605548 (owner: 10Ayounsi) [12:47:24] (03PS10) 10WMDE-leszek: Wikibase: Removed config option wmgUseEntitySourceBasedFederation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569263 (https://phabricator.wikimedia.org/T241975) [12:47:45] (03Merged) 10jenkins-bot: Manage routers k8s BGP config [homer/public] - 10https://gerrit.wikimedia.org/r/605548 (owner: 10Ayounsi) [12:49:49] (03PS2) 10Volans: git::clone: allow to pass environment variables [puppet] - 10https://gerrit.wikimedia.org/r/605583 (https://phabricator.wikimedia.org/T245114) [12:49:51] (03PS2) 10Volans: homer: set the keyholder env variable [puppet] - 10https://gerrit.wikimedia.org/r/605564 (https://phabricator.wikimedia.org/T245114) [12:50:31] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage dbproxy2003, dbproxy2004. [puppet] - 10https://gerrit.wikimedia.org/r/605573 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [12:53:49] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: disable hardware monitoring on analytics1049 and thumbor1004 [puppet] - 10https://gerrit.wikimedia.org/r/605270 (owner: 10Cwhite) [12:55:41] (03CR) 10Volans: "Compiler results at: https://puppet-compiler.wmflabs.org/compiler1003/23218/" [puppet] - 10https://gerrit.wikimedia.org/r/605583 (https://phabricator.wikimedia.org/T245114) (owner: 10Volans) [12:57:34] (03PS7) 10WMDE-leszek: Wikidata/Wikibase: use entity source Wikibase setting for all wikibase-enabled wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569261 (https://phabricator.wikimedia.org/T242087) [12:57:51] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [12:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:15] (03CR) 10Alexandros Kosiaris: "Note that this define isn't being used anywhere in production and that's on purpose." [puppet] - 10https://gerrit.wikimedia.org/r/605343 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [12:58:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2091:3312, db2091:3314 - T253217', diff saved to https://phabricator.wikimedia.org/P11495 and previous config saved to /var/cache/conftool/dbconfig/20200615-125856-marostegui.json [12:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:05] T253217: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 [12:59:43] (03PS1) 10Marostegui: mariadb: Reimage db2091 [puppet] - 10https://gerrit.wikimedia.org/r/605586 (https://phabricator.wikimedia.org/T253217) [13:00:25] (03CR) 10Marostegui: [C: 03+2] mariadb: Reimage db2091 [puppet] - 10https://gerrit.wikimedia.org/r/605586 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui) [13:01:07] (03CR) 10Ayounsi: [C: 03+1] git::clone: allow to pass environment variables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605583 (https://phabricator.wikimedia.org/T245114) (owner: 10Volans) [13:03:56] (03CR) 10Ayounsi: git::clone: allow to pass environment variables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605583 (https://phabricator.wikimedia.org/T245114) (owner: 10Volans) [13:04:08] 10Operations, 10serviceops, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) A little note about the last patch merged. There are two main memcached parameters that can influence the distribution of the slab classes' chunk size:... [13:04:41] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [13:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:10] (03CR) 10Ayounsi: [C: 03+1] homer: set the keyholder env variable [puppet] - 10https://gerrit.wikimedia.org/r/605564 (https://phabricator.wikimedia.org/T245114) (owner: 10Volans) [13:06:40] (03CR) 10Volans: "replies inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/605583 (https://phabricator.wikimedia.org/T245114) (owner: 10Volans) [13:06:45] ACKNOWLEDGEMENT - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Zayo still working on it https://phabricator.wikimedia.org/T255393 - The acknowledgement expires at: 2020-06-16 16:06:14. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:06:45] ACKNOWLEDGEMENT - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Zayo still working on it https://phabricator.wikimedia.org/T255393 - The acknowledgement expires at: 2020-06-16 16:06:14. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:07:50] (03PS3) 10Volans: git::clone: allow to pass environment variables [puppet] - 10https://gerrit.wikimedia.org/r/605583 (https://phabricator.wikimedia.org/T245114) [13:07:52] (03PS3) 10Volans: homer: set the keyholder env variable [puppet] - 10https://gerrit.wikimedia.org/r/605564 (https://phabricator.wikimedia.org/T245114) [13:10:09] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/605583 (https://phabricator.wikimedia.org/T245114) (owner: 10Volans) [13:10:59] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/605564 (https://phabricator.wikimedia.org/T245114) (owner: 10Volans) [13:12:09] PROBLEM - ganeti-mond running on ganeti5002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [13:13:25] that's a monitoring glitch, I'm forcing a puppet run on icinga1001, it didn't realise yet that 5002 is no longer the master node [13:14:52] (03CR) 10Ayounsi: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/605583 (https://phabricator.wikimedia.org/T245114) (owner: 10Volans) [13:15:17] (03CR) 10Volans: [C: 03+2] git::clone: allow to pass environment variables [puppet] - 10https://gerrit.wikimedia.org/r/605583 (https://phabricator.wikimedia.org/T245114) (owner: 10Volans) [13:15:37] (03CR) 10Volans: [C: 03+2] "compiler seems happy too: https://puppet-compiler.wmflabs.org/compiler1001/23220/cumin2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/605564 (https://phabricator.wikimedia.org/T245114) (owner: 10Volans) [13:15:41] (03CR) 10CDanis: [C: 03+1] Set a maximum HTTP client timeout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605440 (https://phabricator.wikimedia.org/T245170) (owner: 10Tim Starling) [13:15:48] (03CR) 10CDanis: [C: 03+1] Enable PoolCounter fastStale mode on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605437 (owner: 10Tim Starling) [13:17:44] (03CR) 10CDanis: [C: 04-1] rsync: move oneline script inline (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605275 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [13:19:26] (03PS1) 10Volans: homer: fix git clone URL [puppet] - 10https://gerrit.wikimedia.org/r/605590 (https://phabricator.wikimedia.org/T245114) [13:19:28] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [13:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:02] (03CR) 10Volans: [C: 03+2] homer: fix git clone URL [puppet] - 10https://gerrit.wikimedia.org/r/605590 (https://phabricator.wikimedia.org/T245114) (owner: 10Volans) [13:20:19] !log Stopping zuul-merger on contint1001 to rebuild the virtualenv # T255424 [13:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:22] T255424: Zuul deployment fails due to unsupported wheel - https://phabricator.wikimedia.org/T255424 [13:21:27] !log filippo@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=thanos-query,name=eqiad [13:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:22] (03PS10) 10Andrew Bogott: Initial module and profile for galera + mariadb [puppet] - 10https://gerrit.wikimedia.org/r/604856 (https://phabricator.wikimedia.org/T242455) [13:23:24] (03PS2) 10Andrew Bogott: Galera: move behind haproxy [puppet] - 10https://gerrit.wikimedia.org/r/605336 (https://phabricator.wikimedia.org/T242455) [13:23:26] (03PS11) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [13:24:21] PROBLEM - zuul_merger_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args bin/zuul-merger https://www.mediawiki.org/wiki/Continuous_integration/Zuul [13:24:34] 10Operations, 10Traffic: noc.wikimedia.org consistently 503s in eqsin and sometimes 503s in esams - https://phabricator.wikimedia.org/T255368 (10ema) Apparently `TE:chunked` is not added only on cache hits, but occasionally on miss/pass too. https://gerrit.wikimedia.org/r/605578 does make sense and it's good w... [13:26:07] RECOVERY - zuul_merger_service_running on contint1001 is OK: PROCS OK: 1 process with regex args bin/zuul-merger https://www.mediawiki.org/wiki/Continuous_integration/Zuul [13:26:14] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [13:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:22] !log Started zuul-merger on contint1001 with newer virtualenv # T255424 [13:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:25] T255424: Zuul deployment fails due to unsupported wheel - https://phabricator.wikimedia.org/T255424 [13:27:59] PROBLEM - ganeti-mond running on ganeti5002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [13:28:49] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is OK: (C)100 gt (W)80 gt 76.27 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [13:30:21] !log volans@deploy1001 Started deploy [homer/deploy@ac7a4c6]: Release v0.2.3 on cumin2001 now on buster [13:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:25] !log rolling reboot on the ganeti cluster in esams (for kernel security updates and to pick up the network changes to provides instances with a public IP) [13:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:37] !log volans@deploy1001 Finished deploy [homer/deploy@ac7a4c6]: Release v0.2.3 on cumin2001 now on buster (duration: 01m 15s) [13:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:17] (03PS2) 10Filippo Giunchedi: prometheus: enable thanos upload in ops eqiad [puppet] - 10https://gerrit.wikimedia.org/r/605178 (https://phabricator.wikimedia.org/T252186) [13:36:19] (03PS1) 10Filippo Giunchedi: swift: optional read affinity proxy setting [puppet] - 10https://gerrit.wikimedia.org/r/605591 (https://phabricator.wikimedia.org/T252186) [13:36:21] (03PS1) 10Filippo Giunchedi: hieradata: enable read affinity for thanos-swift [puppet] - 10https://gerrit.wikimedia.org/r/605592 (https://phabricator.wikimedia.org/T252186) [13:38:22] !log elukey@cumin2001 START - Cookbook sre.hadoop.roll-restart-workers [13:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:07] volans: --^ [13:39:17] elukey: <3 [13:40:16] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [13:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [13:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:05] (03PS2) 10Filippo Giunchedi: swift: optional read affinity proxy setting [puppet] - 10https://gerrit.wikimedia.org/r/605591 (https://phabricator.wikimedia.org/T252186) [13:42:07] (03PS2) 10Filippo Giunchedi: hieradata: enable read affinity for thanos-swift [puppet] - 10https://gerrit.wikimedia.org/r/605592 (https://phabricator.wikimedia.org/T252186) [13:43:03] (03CR) 10WMDE-leszek: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605584 (owner: 10WMDE-leszek) [13:43:10] (03CR) 10WMDE-leszek: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605585 (owner: 10WMDE-leszek) [13:43:17] (03CR) 10WMDE-leszek: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569258 (https://phabricator.wikimedia.org/T254315) (owner: 10WMDE-leszek) [13:43:50] (03PS2) 10WMDE-leszek: test wikidata: Use the database name in the Wikibase entity source config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605584 [13:43:59] (03PS2) 10WMDE-leszek: test commons: Use the database name in the Wikibase entity source config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605585 [13:44:13] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [13:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:33] (03PS13) 10WMDE-leszek: Wikidata: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569258 (https://phabricator.wikimedia.org/T254315) [13:46:36] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/23222/" [puppet] - 10https://gerrit.wikimedia.org/r/605592 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [13:49:34] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [13:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:41] !log Upgrade db2133 [13:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:20] jouncebot: next [13:50:20] In 3 hour(s) and 9 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200615T1700) [13:50:24] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/23223/" [puppet] - 10https://gerrit.wikimedia.org/r/605591 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [13:51:51] (03PS1) 10Ayounsi: Add k8s stage BGP neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/605594 [13:54:35] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [13:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:39] !log Deploy schema change on db1100 (s5 master) - T250066 [13:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:42] T250066: text table still has old_* fields and indexes on some hosts - https://phabricator.wikimedia.org/T250066 [13:59:24] (03PS1) 10Jbond: profile::mail::check_mail: new script for checking user emails [puppet] - 10https://gerrit.wikimedia.org/r/605596 (https://phabricator.wikimedia.org/T244792) [14:03:27] PROBLEM - ganeti-mond running on ganeti3002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [14:06:17] RECOVERY - ganeti-mond running on ganeti5002 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [14:09:02] !log elukey@cumin2001 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) [14:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:29] (03PS2) 10Ayounsi: Add k8s stage BGP neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/605594 [14:11:47] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [14:11:48] (03CR) 10Andrew Bogott: [C: 03+2] Initial module and profile for galera + mariadb [puppet] - 10https://gerrit.wikimedia.org/r/604856 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [14:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:10] (03PS3) 10Andrew Bogott: Galera: move behind haproxy [puppet] - 10https://gerrit.wikimedia.org/r/605336 (https://phabricator.wikimedia.org/T242455) [14:15:12] (03PS12) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [14:15:24] (03PS1) 10Andrew Bogott: wmcs: install galera on codfw1dev cloudcontrol nodes [puppet] - 10https://gerrit.wikimedia.org/r/605597 [14:16:47] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:15] (03CR) 10Andrew Bogott: [C: 03+2] wmcs: install galera on codfw1dev cloudcontrol nodes [puppet] - 10https://gerrit.wikimedia.org/r/605597 (owner: 10Andrew Bogott) [14:18:23] RECOVERY - ganeti-mond running on ganeti3002 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [14:18:56] (03PS1) 10Filippo Giunchedi: templates: add ipv6 for thanos-be2* [dns] - 10https://gerrit.wikimedia.org/r/605598 (https://phabricator.wikimedia.org/T252186) [14:20:51] (03CR) 10Filippo Giunchedi: [C: 03+2] templates: add ipv6 for thanos-be2* [dns] - 10https://gerrit.wikimedia.org/r/605598 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [14:24:23] !log Deploy schema change on db2107 (s2 codfw master) - T250066 [14:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:27] (03PS4) 10Andrew Bogott: Galera: move behind haproxy [puppet] - 10https://gerrit.wikimedia.org/r/605336 (https://phabricator.wikimedia.org/T242455) [14:24:27] T250066: text table still has old_* fields and indexes on some hosts - https://phabricator.wikimedia.org/T250066 [14:24:29] (03PS13) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [14:24:31] (03PS1) 10Andrew Bogott: codfw1dev: try to get hiera to notice galera settings [puppet] - 10https://gerrit.wikimedia.org/r/605599 [14:24:51] 10Operations, 10vm-requests, 10Patch-For-Review: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10MoritzMuehlenhoff) The Ganeti clusters in esams and eqsin have also been rebooted, they should also be ready for instances with public IPs now. [14:25:08] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: try to get hiera to notice galera settings [puppet] - 10https://gerrit.wikimedia.org/r/605599 (owner: 10Andrew Bogott) [14:25:17] (03CR) 10Alexandros Kosiaris: [C: 03+1] "\o/" [homer/public] - 10https://gerrit.wikimedia.org/r/605594 (owner: 10Ayounsi) [14:30:27] (03CR) 10Ayounsi: [C: 03+2] "Self +2 as it's a NOOP in prod and similar to existing code." [homer/public] - 10https://gerrit.wikimedia.org/r/605557 (owner: 10Ayounsi) [14:30:42] (03CR) 10Ayounsi: [C: 03+2] Add k8s stage BGP neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/605594 (owner: 10Ayounsi) [14:30:58] (03Merged) 10jenkins-bot: Manage routers' anycast BGP neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/605557 (owner: 10Ayounsi) [14:31:08] (03Merged) 10jenkins-bot: Add k8s stage BGP neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/605594 (owner: 10Ayounsi) [14:35:54] (03PS2) 10Jbond: profile::sre::check_mail: new script for checking user emails [puppet] - 10https://gerrit.wikimedia.org/r/605596 (https://phabricator.wikimedia.org/T244792) [14:39:41] (03PS1) 10Herron: logstash: align number of shards with number of ES indexing hosts [puppet] - 10https://gerrit.wikimedia.org/r/605602 (https://phabricator.wikimedia.org/T255243) [14:39:43] (03PS5) 10Huji: Set $wgCheckUserLogLogins to true for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599492 (https://phabricator.wikimedia.org/T253802) [14:40:41] (03CR) 10Huji: [C: 04-1] "Until https://gerrit.wikimedia.org/r/605301/ is merged" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599492 (https://phabricator.wikimedia.org/T253802) (owner: 10Huji) [14:42:04] (03PS5) 10Andrew Bogott: Galera: move behind haproxy [puppet] - 10https://gerrit.wikimedia.org/r/605336 (https://phabricator.wikimedia.org/T242455) [14:42:06] (03PS14) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [14:42:08] (03PS1) 10Andrew Bogott: wmcs galera: fix firewall rules to remove commas from ip lists [puppet] - 10https://gerrit.wikimedia.org/r/605603 [14:43:23] (03CR) 10RhinosF1: "LGTM once CU patch rides the train to fawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599492 (https://phabricator.wikimedia.org/T253802) (owner: 10Huji) [14:44:08] (03CR) 10Andrew Bogott: [C: 03+2] wmcs galera: fix firewall rules to remove commas from ip lists [puppet] - 10https://gerrit.wikimedia.org/r/605603 (owner: 10Andrew Bogott) [14:44:26] (03PS2) 10Jbond: rsync: move oneline script inline [puppet] - 10https://gerrit.wikimedia.org/r/605275 (https://phabricator.wikimedia.org/T254480) [14:44:28] (03CR) 10Jbond: "updated thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605275 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [14:48:13] 10Operations, 10vm-requests, 10Patch-For-Review: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10Dzahn) a:05akosiaris→03Dzahn [14:49:50] (03CR) 10Huji: [C: 04-1] "Thanks. I will be closely following up on this and will schedule its deployment, etc." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599492 (https://phabricator.wikimedia.org/T253802) (owner: 10Huji) [14:51:32] (03CR) 10Filippo Giunchedi: "The idea LGTM overall and +1 to run with this to test a fix for T255243." [puppet] - 10https://gerrit.wikimedia.org/r/605602 (https://phabricator.wikimedia.org/T255243) (owner: 10Herron) [14:53:12] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: align number of shards with number of ES indexing hosts [puppet] - 10https://gerrit.wikimedia.org/r/605602 (https://phabricator.wikimedia.org/T255243) (owner: 10Herron) [14:54:52] (03PS6) 10Andrew Bogott: Galera: move behind haproxy [puppet] - 10https://gerrit.wikimedia.org/r/605336 (https://phabricator.wikimedia.org/T242455) [14:54:54] (03PS15) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [14:54:55] 10Operations, 10DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission heka.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T248628 (10Jgreen) 05duplicate→03Resolved [14:54:57] (03PS1) 10Andrew Bogott: wmcs galera: rearrange hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/605604 (https://phabricator.wikimedia.org/T242455) [14:55:25] !log delete VCP from msw1-codfw [14:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:05] (03CR) 10Andrew Bogott: [C: 03+2] wmcs galera: rearrange hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/605604 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [14:59:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1144:3314', diff saved to https://phabricator.wikimedia.org/P11496 and previous config saved to /var/cache/conftool/dbconfig/20200615-145914-marostegui.json [14:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:59] !log Deploy schema change on db1144:3314 [15:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1144:3314', diff saved to https://phabricator.wikimedia.org/P11497 and previous config saved to /var/cache/conftool/dbconfig/20200615-150148-marostegui.json [15:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:29] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/602459 (https://phabricator.wikimedia.org/T252913) (owner: 10Herron) [15:06:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1121', diff saved to https://phabricator.wikimedia.org/P11498 and previous config saved to /var/cache/conftool/dbconfig/20200615-150639-marostegui.json [15:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:12] !log Deploy schema change on db1121 (and labs) [15:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:15] 10Operations, 10ops-codfw, 10fundraising-tech-ops: (Need by: TBD) codfw:fundraising single-cpu misc servers frpig2001,civi2001.pay-lvs200[1-2] - https://phabricator.wikimedia.org/T244950 (10Jgreen) 05Declined→03Resolved [15:09:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1121', diff saved to https://phabricator.wikimedia.org/P11499 and previous config saved to /var/cache/conftool/dbconfig/20200615-150908-marostegui.json [15:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:14] (03PS7) 10Andrew Bogott: Galera: move behind haproxy [puppet] - 10https://gerrit.wikimedia.org/r/605336 (https://phabricator.wikimedia.org/T242455) [15:10:16] (03PS16) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [15:10:18] (03PS1) 10Andrew Bogott: wmcs galera codfw1dev: enable [puppet] - 10https://gerrit.wikimedia.org/r/605607 (https://phabricator.wikimedia.org/T242455) [15:11:13] (03CR) 10Andrew Bogott: [C: 03+2] wmcs galera codfw1dev: enable [puppet] - 10https://gerrit.wikimedia.org/r/605607 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [15:16:51] !log upgrading wtp1025-wtp1027 to PHP 7.2.31 [15:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:21] (03PS1) 10WMDE-leszek: Wikibase: added a config to simpler configuration of "entity sources" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605608 (https://phabricator.wikimedia.org/T242087) [15:23:23] (03PS1) 10WMDE-leszek: Wikibase: Generate "entity source" config array based on source names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605609 (https://phabricator.wikimedia.org/T242087) [15:23:25] (03PS1) 10WMDE-leszek: Wikibase: Removed no longer used wmgWikibaseEntitySources setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605610 (https://phabricator.wikimedia.org/T242087) [15:25:46] (03CR) 10jerkins-bot: [V: 04-1] Wikibase: Generate "entity source" config array based on source names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605609 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [15:28:41] 10Operations, 10MediaWiki-General, 10Patch-For-Review, 10Sustainability (Incident Prevention): Stop a poolcounter server fail from being a SPOF for the service and the api (and the site) - https://phabricator.wikimedia.org/T105378 (10Krinkle) [15:31:14] (03PS2) 10WMDE-leszek: Wikibase: Generate "entity source" config array based on source names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605609 (https://phabricator.wikimedia.org/T242087) [15:32:20] (03CR) 10jerkins-bot: [V: 04-1] Wikibase: Generate "entity source" config array based on source names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605609 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [15:33:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1141', diff saved to https://phabricator.wikimedia.org/P11501 and previous config saved to /var/cache/conftool/dbconfig/20200615-153344-marostegui.json [15:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:47] (03PS3) 10WMDE-leszek: Wikibase: Generate "entity source" config array based on source names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605609 (https://phabricator.wikimedia.org/T242087) [15:35:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1141', diff saved to https://phabricator.wikimedia.org/P11502 and previous config saved to /var/cache/conftool/dbconfig/20200615-153546-marostegui.json [15:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:57] 10Operations, 10Analytics-Radar, 10SRE-Access-Requests, 10Patch-For-Review, 10Product-Analytics (Kanban): Creation of a new POSIX group and system user for the Product Analytics team - https://phabricator.wikimedia.org/T255039 (10fdans) [15:36:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1142', diff saved to https://phabricator.wikimedia.org/P11503 and previous config saved to /var/cache/conftool/dbconfig/20200615-153630-marostegui.json [15:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:42] (03PS2) 10WMDE-leszek: Wikibase: Removed no longer used wmgWikibaseEntitySources setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605610 (https://phabricator.wikimedia.org/T242087) [15:37:11] !log Deploy schema change on db1142 [15:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1142', diff saved to https://phabricator.wikimedia.org/P11504 and previous config saved to /var/cache/conftool/dbconfig/20200615-153825-marostegui.json [15:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:40] (03CR) 10BryanDavis: [C: 03+2] Remove validation of Kubernetes self-signed API cert [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/598109 (https://phabricator.wikimedia.org/T253412) (owner: 10BryanDavis) [15:49:02] (03CR) 10BryanDavis: [C: 03+2] Disable tool name alias in lighttpd config with --canonical [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/603668 (https://phabricator.wikimedia.org/T254640) (owner: 10BryanDavis) [15:51:01] (03PS1) 10Awight: [beta] Don't shuffle answers for demo survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605616 (https://phabricator.wikimedia.org/T253112) [15:52:21] (03PS1) 10Elukey: role::mediawiki::memcached::gutter: change slab distribution [puppet] - 10https://gerrit.wikimedia.org/r/605617 (https://phabricator.wikimedia.org/T252391) [15:52:39] (03Merged) 10jenkins-bot: Remove validation of Kubernetes self-signed API cert [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/598109 (https://phabricator.wikimedia.org/T253412) (owner: 10BryanDavis) [15:52:41] (03Merged) 10jenkins-bot: Disable tool name alias in lighttpd config with --canonical [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/603668 (https://phabricator.wikimedia.org/T254640) (owner: 10BryanDavis) [16:11:03] (note: I'm deploying a minor config change for beta.) [16:11:31] (03CR) 10Awight: [V: 03+1 C: 03+2] "beta-only. Tested locally." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605616 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight) [16:12:17] (03Merged) 10jenkins-bot: [beta] Don't shuffle answers for demo survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605616 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight) [16:12:39] 10Operations, 10Traffic: ats-backend throttles connections under heavy load - https://phabricator.wikimedia.org/T254714 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [16:22:06] (03PS1) 10Awight: [beta] Fix enum value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605619 (https://phabricator.wikimedia.org/T253112) [16:22:45] (03CR) 10Awight: [V: 03+1 C: 03+2] "beta-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605619 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight) [16:22:50] (03PS8) 10Andrew Bogott: Galera: move behind haproxy [puppet] - 10https://gerrit.wikimedia.org/r/605336 (https://phabricator.wikimedia.org/T242455) [16:22:52] (03PS17) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [16:22:54] (03PS1) 10Andrew Bogott: wmcs galera: fix mysqld process check [puppet] - 10https://gerrit.wikimedia.org/r/605620 (https://phabricator.wikimedia.org/T242455) [16:23:37] (03Merged) 10jenkins-bot: [beta] Fix enum value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605619 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight) [16:24:26] (03CR) 10Andrew Bogott: [C: 03+2] wmcs galera: fix mysqld process check [puppet] - 10https://gerrit.wikimedia.org/r/605620 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [16:25:12] (03CR) 10CDanis: [C: 03+1] rsync: move oneline script inline [puppet] - 10https://gerrit.wikimedia.org/r/605275 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [16:34:16] (03CR) 10Andrew Bogott: [C: 03+2] Galera: move behind haproxy [puppet] - 10https://gerrit.wikimedia.org/r/605336 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [16:37:32] (03PS1) 10Ppchelko: Switch changeprop and changeprop-jobqueue to v0.10.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/605621 (https://phabricator.wikimedia.org/T255278) [16:38:32] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:43:09] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] [beta] Fix enum value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605619 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight) [16:44:23] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] [beta] Don't shuffle answers for demo survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605616 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight) [16:45:37] 10Operations, 10ops-eqiad, 10DC-Ops: Netbox report accounting icinga alert - https://phabricator.wikimedia.org/T250053 (10wiki_willy) @ayounsi - I think the alert is being triggered from the Finance spreadsheet: https://docs.google.com/spreadsheets/d/11xbHX7lRzglFYc85kvmtOjssOfm3tCkxH7cYr7ImFbk/edit#gid=0... [16:48:55] 10Operations, 10ops-codfw, 10procurement: codfw: Next Gen test rack - https://phabricator.wikimedia.org/T251570 (10wiki_willy) Thanks @Papaul - I'm going to paste this link below for future reference when purchasing: https://www.altex.com/6-black-slotted-wall-wiring-duct-3-x-3-w-cover Thanks, Willy [16:50:26] 10Operations, 10WMF-Design, 10Design: Create sub-directory URL for Design blog (https://design.wikimedia.org/blog) - https://phabricator.wikimedia.org/T254118 (10Prtksxna) 05Open→03Resolved Thanks @Dzahn! So, jekyll had configurations to deal with websites that are in sub-directories and I think everythi... [16:51:49] (03PS18) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [16:51:51] (03PS1) 10Andrew Bogott: wmcs galera: move backend port to 23306; 13306 is already occupied by prometheus [puppet] - 10https://gerrit.wikimedia.org/r/605622 (https://phabricator.wikimedia.org/T242455) [16:52:38] (03CR) 10Andrew Bogott: [C: 03+2] wmcs galera: move backend port to 23306; 13306 is already occupied by prometheus [puppet] - 10https://gerrit.wikimedia.org/r/605622 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [16:59:50] (03PS1) 10Volans: Include pip into the built wheels [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/605623 (https://phabricator.wikimedia.org/T245114) [17:00:04] gehel and onimisionipe: #bothumor My software never has bugs. It just develops random features. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200615T1700). [17:03:18] (03PS2) 10Volans: Include pip into the built wheels [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/605623 (https://phabricator.wikimedia.org/T245114) [17:10:23] 10Operations, 10DBA, 10SRE-tools, 10Patch-For-Review: Add native mysql module to spicerack - https://phabricator.wikimedia.org/T255409 (10jcrespo) a:03Kormat [17:16:05] (03PS19) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [17:16:07] (03PS1) 10Andrew Bogott: wmcs galera: move codfw1dev mysql port behind haproxy [puppet] - 10https://gerrit.wikimedia.org/r/605625 (https://phabricator.wikimedia.org/T242455) [17:17:26] (03CR) 10jerkins-bot: [V: 04-1] wmcs galera: move codfw1dev mysql port behind haproxy [puppet] - 10https://gerrit.wikimedia.org/r/605625 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [17:17:32] 10Puppet, 10Analytics, 10Cloud-VPS: Puppet failing on wikistats.analytics.eqiad.wmflabs: /usr/local/sbin/x509-bundle error - https://phabricator.wikimedia.org/T255464 (10bd808) [17:19:22] (03PS2) 10Andrew Bogott: wmcs galera: move codfw1dev mysql port behind haproxy [puppet] - 10https://gerrit.wikimedia.org/r/605625 (https://phabricator.wikimedia.org/T242455) [17:19:24] (03PS20) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [17:20:59] (03CR) 10Andrew Bogott: [C: 03+2] wmcs galera: move codfw1dev mysql port behind haproxy [puppet] - 10https://gerrit.wikimedia.org/r/605625 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [17:39:18] (03PS9) 10Krinkle: profiler: Add PDO driver for XHGui and enable on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603546 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [17:39:22] (03PS10) 10Krinkle: profiler: Add PDO driver for XHGui and enable on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603546 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [17:41:57] (03CR) 10Krinkle: [C: 03+2] profiler: Add PDO driver for XHGui and enable on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603546 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [17:42:45] (03Merged) 10jenkins-bot: profiler: Add PDO driver for XHGui and enable on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603546 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [17:43:03] (03CR) 10Cwhite: [C: 03+2] hiera: disable hardware monitoring on analytics1049 and thumbor1004 [puppet] - 10https://gerrit.wikimedia.org/r/605270 (owner: 10Cwhite) [17:44:54] * Krinkle staging on mwdebug1002 [17:46:25] (03CR) 10Cwhite: [C: 03+2] hiera: install mtail 3.0.0~rc35 from component in esams and eqiad [puppet] - 10https://gerrit.wikimedia.org/r/599474 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [17:47:01] (03PS1) 10BryanDavis: d/changelog: prepare for 0.71 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/605628 [17:47:30] (03CR) 10BryanDavis: [C: 03+2] d/changelog: prepare for 0.71 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/605628 (owner: 10BryanDavis) [17:48:21] (03Merged) 10jenkins-bot: d/changelog: prepare for 0.71 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/605628 (owner: 10BryanDavis) [17:52:29] !log krinkle@deploy1001 Synchronized lib/: I7721f4018b07dac (duration: 00m 58s) [17:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:04] !log krinkle@deploy1001 Synchronized wmf-config/ProductionServices.php: I7721f4018b07dac (duration: 00m 57s) [17:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:19] !log krinkle@deploy1001 Synchronized wmf-config: I7721f4018b07dac (duration: 00m 58s) [17:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Morning backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200615T1800). [18:00:09] (03CR) 10CRusnov: [C: 03+1] "As we've discussed, this looks good. I tested several scenarios on af-netbox and it works as expected. I did a run through the code and it" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/601877 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [18:06:33] (03PS1) 10Bstorm: labstore: long-overdue refactor of primary profile and role [puppet] - 10https://gerrit.wikimedia.org/r/605634 [18:11:30] (03CR) 10Herron: [C: 03+2] logstash: align number of shards with number of ES indexing hosts [puppet] - 10https://gerrit.wikimedia.org/r/605602 (https://phabricator.wikimedia.org/T255243) (owner: 10Herron) [18:13:20] PROBLEM - IPMI Sensor Status on thumbor1004 is CRITICAL: NRPE: Command check_check_ipmi_sensor not defined https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:14:58] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw,logstash7-codfw,logstash7-eqiad} instance=kafkamon1001:9501 job=burrow partition={0,1,2,3,4,5} site=eqiad topic=rsyslog-notice https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1 [18:14:58] qiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [18:15:43] jouncebot: next [18:15:43] In 1 hour(s) and 44 minute(s): Services – Graphoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200615T2000) [18:15:47] jouncebot: now [18:15:48] For the next 0 hour(s) and 44 minute(s): Morning backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200615T1800) [18:16:43] James_F: how do you feel about trying to get these entity sources realted wikibase patches deployed today? [18:16:54] get them off the silly backlog [18:20:01] addshore: I'm OK with it, and if you want to deploy now please go ahead, but probably best for me to not be the deployer as I'm feeling headachey. [18:20:23] ack! I'm going to start with some of the "easier" ones in this next 40 mins then! [18:20:29] (03PS3) 10Addshore: test wikidata: Use the database name in the Wikibase entity source config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605584 (owner: 10WMDE-leszek) [18:20:31] (03CR) 10Addshore: [C: 03+2] test wikidata: Use the database name in the Wikibase entity source config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605584 (owner: 10WMDE-leszek) [18:22:38] (03PS2) 10Bstorm: labstore: long-overdue refactor of primary profile and role [puppet] - 10https://gerrit.wikimedia.org/r/605634 [18:23:18] (03PS3) 10Bstorm: labstore: long-overdue refactor of primary profile and role [puppet] - 10https://gerrit.wikimedia.org/r/605634 [18:25:00] (03CR) 10Addshore: [C: 03+2] test wikidata: Use the database name in the Wikibase entity source config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605584 (owner: 10WMDE-leszek) [18:25:08] gate succeeded, but it didnt merge... [18:25:52] (03Merged) 10jenkins-bot: test wikidata: Use the database name in the Wikibase entity source config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605584 (owner: 10WMDE-leszek) [18:26:39] (03PS4) 10Bstorm: labstore: long-overdue refactor of primary profile and role [puppet] - 10https://gerrit.wikimedia.org/r/605634 [18:28:32] (03PS3) 10Addshore: test commons: Use the database name in the Wikibase entity source config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605585 (owner: 10WMDE-leszek) [18:28:56] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:605584]] T254315 test wikidata: Use the database name in the Wikibase entity source config (duration: 00m 58s) [18:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:00] T254315: entitysources: Directly create entitySources config for WMF production wikis - https://phabricator.wikimedia.org/T254315 [18:29:11] (03CR) 10Addshore: [C: 03+2] test commons: Use the database name in the Wikibase entity source config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605585 (owner: 10WMDE-leszek) [18:30:07] (03Merged) 10jenkins-bot: test commons: Use the database name in the Wikibase entity source config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605585 (owner: 10WMDE-leszek) [18:34:42] James_F: are depict statements behaving badly for you on https://test-commons.wikimedia.org/wiki/File:Jakarta_MRT_women_car_sign.jpg on mwdebu1002? [18:35:18] / on any file [18:37:09] Looking. [18:37:31] think I see a JS exception, interesting, not what I was expecting [18:38:16] "Error: Invalid statement: value type mismatch" [18:38:40] hmm O_o thats different to me again.... [18:38:58] Aka yes, broken on mwdebug1002. [18:39:27] just a useful "Exception in module-execute in module wikibase.mediainfo.filePageDisplay:" [18:43:47] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:43:51] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:47:31] https://phabricator.wikimedia.org/P11508 [18:47:50] (03PS1) 10Addshore: Revert "test commons: Use the database name in the Wikibase entity source config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605643 [18:47:59] (03CR) 10Addshore: [C: 03+2] Revert "test commons: Use the database name in the Wikibase entity source config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605643 (owner: 10Addshore) [18:48:18] (03PS5) 10Bstorm: labstore: long-overdue refactor of primary profile and role [puppet] - 10https://gerrit.wikimedia.org/r/605634 [18:48:46] James_F: well... 1 out the door, im going to go for a run and check back later and ponder this odd UI issue... [18:48:51] (03Merged) 10jenkins-bot: Revert "test commons: Use the database name in the Wikibase entity source config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605643 (owner: 10Addshore) [18:49:23] (03PS1) 10Addshore: Revert "Revert "test commons: Use the database name in the Wikibase entity source config"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605645 [18:49:34] Ack. [18:50:10] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@41186c8]: port glent from oozie to airflow [18:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:50] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@41186c8]: port glent from oozie to airflow (duration: 00m 39s) [18:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:33] troubleshooting "too many messages in kafka-logging eqiad" and seeing logs like "PHP Notice: Undefined variable: wmgXhguiDBpassword in \/srv\/mediawiki\/wmf-config\/PhpAutoPrepend.php" look to have started about the same time and are being throttled by logstash https://logstash.wikimedia.org/goto/9ab81c2d50d8816e36854cca1854189d [18:52:45] "PHP Notice: Undefined variable: wmgXhguiDBuser in \/srv\/mediawiki\/wmf-config\/PhpAutoPrepend.php" as well [18:53:05] (03CR) 10CDanis: [C: 03+1] ATS: use X-Cache-Status 'int' for responses without lookup [puppet] - 10https://gerrit.wikimedia.org/r/604710 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [18:53:41] (03PS6) 10Bstorm: labstore: long-overdue refactor of primary profile and role [puppet] - 10https://gerrit.wikimedia.org/r/605634 [18:53:54] (03CR) 10CDanis: [C: 03+1] role::mediawiki::memcached::gutter: change slab distribution [puppet] - 10https://gerrit.wikimedia.org/r/605617 (https://phabricator.wikimedia.org/T252391) (owner: 10Elukey) [18:54:09] (03CR) 10CDanis: [C: 03+1] swift: optional read affinity proxy setting [puppet] - 10https://gerrit.wikimedia.org/r/605591 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [18:54:21] (03CR) 10CDanis: [C: 03+1] hieradata: enable read affinity for thanos-swift [puppet] - 10https://gerrit.wikimedia.org/r/605592 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [18:57:58] (03CR) 10Bstorm: [C: 04-1] "Fixing some typos." [puppet] - 10https://gerrit.wikimedia.org/r/605634 (owner: 10Bstorm) [19:00:35] (03PS7) 10Bstorm: labstore: long-overdue refactor of primary profile and role [puppet] - 10https://gerrit.wikimedia.org/r/605634 [19:01:10] (03PS2) 10Esanders: Install DiscussionTools on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599307 (https://phabricator.wikimedia.org/T252264) [19:01:27] (03PS3) 10Esanders: Install DiscussionTools on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599307 (https://phabricator.wikimedia.org/T252264) [19:03:13] (03PS1) 10Rush: peek: add hcaptcha to asana backend [puppet] - 10https://gerrit.wikimedia.org/r/605648 [19:03:48] (03CR) 10Rush: [C: 03+2] peek: add hcaptcha to asana backend [puppet] - 10https://gerrit.wikimedia.org/r/605648 (owner: 10Rush) [19:05:11] (03CR) 10RhinosF1: "Task says all wikipedias, commit is for all wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599307 (https://phabricator.wikimedia.org/T252264) (owner: 10Esanders) [19:09:16] (03PS8) 10Bstorm: labstore: long-overdue refactor of primary profile and role [puppet] - 10https://gerrit.wikimedia.org/r/605634 [19:09:41] 10Operations, 10Research, 10Wikimedia-Mailing-lists: Admin password reset request for a mailman list: research-wmf - https://phabricator.wikimedia.org/T255326 (10leila) @jbond thanks a lot for the very fast turnaround on this task. I just changed the password and confirm that I can access the administrator UI. [19:13:12] (03PS1) 10Hashar: Add basic doc for python-build* images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605649 [19:15:12] Krinkle: could that be an unexpected side effect from https://gerrit.wikimedia.org/r/603546 ? [19:16:23] (03CR) 10Ppchelko: [C: 03+1] "exclude_topics:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/604425 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [19:21:19] (03PS9) 10Bstorm: labstore: long-overdue refactor of primary profile and role [puppet] - 10https://gerrit.wikimedia.org/r/605634 [19:25:40] (03PS21) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [19:25:42] (03PS1) 10Andrew Bogott: wmcs haproxy: fix up mysql config [puppet] - 10https://gerrit.wikimedia.org/r/605651 (https://phabricator.wikimedia.org/T242455) [19:26:54] (03CR) 10jerkins-bot: [V: 04-1] wmcs haproxy: fix up mysql config [puppet] - 10https://gerrit.wikimedia.org/r/605651 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [19:27:11] (03PS10) 10Bstorm: labstore: long-overdue refactor of primary profile and role [puppet] - 10https://gerrit.wikimedia.org/r/605634 [19:29:13] (03PS2) 10Andrew Bogott: wmcs haproxy: fix up mysql config [puppet] - 10https://gerrit.wikimedia.org/r/605651 (https://phabricator.wikimedia.org/T242455) [19:29:15] (03PS22) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [19:29:40] (03PS11) 10Bstorm: labstore: long-overdue refactor of primary profile and role [puppet] - 10https://gerrit.wikimedia.org/r/605634 [19:30:25] (03CR) 10jerkins-bot: [V: 04-1] wmcs haproxy: fix up mysql config [puppet] - 10https://gerrit.wikimedia.org/r/605651 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [19:31:30] (03PS3) 10Andrew Bogott: wmcs haproxy: fix up mysql config [puppet] - 10https://gerrit.wikimedia.org/r/605651 (https://phabricator.wikimedia.org/T242455) [19:31:32] (03PS23) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [19:31:51] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 1270 ge 210 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [19:32:37] (03CR) 10jerkins-bot: [V: 04-1] wmcs haproxy: fix up mysql config [puppet] - 10https://gerrit.wikimedia.org/r/605651 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [19:33:42] herron: likely [19:33:43] (03CR) 10Bstorm: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/23233/" [puppet] - 10https://gerrit.wikimedia.org/r/605634 (owner: 10Bstorm) [19:33:47] will fix [19:33:55] kk, thanks [19:34:31] how frequent is it? [19:34:33] I'm in a meeting [19:34:38] feel free to revert it's fine [19:35:08] !log update mtail to 3.0.0~rc35 on mw in eqiad [19:35:10] IT shouldn't cause any problem user-facing given it's null now anyway, [19:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:24] But if the volume is a risk can mergency revert [19:35:29] which I'll fix in an hour otherwise [19:35:38] (03PS4) 10Andrew Bogott: wmcs haproxy: fix up mysql config [puppet] - 10https://gerrit.wikimedia.org/r/605651 (https://phabricator.wikimedia.org/T242455) [19:35:40] (03PS24) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [19:36:15] Krinkle: looks like it's only lagging one topic (rsyslog-notice) so I think ok to handle it when time permits [19:36:48] (03CR) 10jerkins-bot: [V: 04-1] wmcs haproxy: fix up mysql config [puppet] - 10https://gerrit.wikimedia.org/r/605651 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [19:36:55] herron: ok, interesting. is that separate from the topic that handles the "important" mediawikii fatal/exception stuff? I thought it was the same [19:37:50] yeah, the "udp-localhost" topics that mw uses look ok so far [19:38:57] (03PS5) 10Andrew Bogott: wmcs haproxy: fix up mysql config [puppet] - 10https://gerrit.wikimedia.org/r/605651 (https://phabricator.wikimedia.org/T242455) [19:38:59] (03PS25) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [19:39:20] herron: ok, I have more questions, but will ask later. [19:39:29] would conflict with other logs matching "severity":"notice", "facility":"daemon", "program":"php7.2-fpm" [19:39:42] ok [19:40:06] ah, it's the syslog one, not the mw one. The message is sent twice, once through mw ohnce through fpm [19:40:10] (03CR) 10jerkins-bot: [V: 04-1] wmcs haproxy: fix up mysql config [puppet] - 10https://gerrit.wikimedia.org/r/605651 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [19:40:12] yeah, that one doesn't matter, we dn't look at that [19:40:31] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10serviceops, and 4 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10jeena) I created a helm test and got the integration and functional tests running in minikube. Do... [19:41:33] (03CR) 10BPirkle: [C: 03+1] "Approved for self-merge and deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605437 (owner: 10Tim Starling) [19:42:03] (03PS6) 10Andrew Bogott: wmcs haproxy: fix up mysql config [puppet] - 10https://gerrit.wikimedia.org/r/605651 (https://phabricator.wikimedia.org/T242455) [19:42:05] (03PS26) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [19:43:12] (03CR) 10jerkins-bot: [V: 04-1] wmcs haproxy: fix up mysql config [puppet] - 10https://gerrit.wikimedia.org/r/605651 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [19:45:55] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [19:45:56] (03PS7) 10Andrew Bogott: wmcs haproxy: fix up mysql config [puppet] - 10https://gerrit.wikimedia.org/r/605651 (https://phabricator.wikimedia.org/T242455) [19:45:58] (03PS27) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [19:46:42] (03CR) 10BPirkle: [C: 03+1] "Approved for self-merge and deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605440 (https://phabricator.wikimedia.org/T245170) (owner: 10Tim Starling) [19:48:55] (03CR) 10Andrew Bogott: [C: 03+2] wmcs haproxy: fix up mysql config [puppet] - 10https://gerrit.wikimedia.org/r/605651 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [19:51:51] (03PS1) 10Ammarpad: Add extended-confirmed group and restriction level for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605652 (https://phabricator.wikimedia.org/T254471) [19:53:47] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10serviceops, and 4 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10thcipriani) >>! In T224041#6225405, @jeena wrote: > I created a helm test and got the integration... [19:54:21] (03PS2) 10Ammarpad: Add extended-confirmed group and restriction level for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605652 (https://phabricator.wikimedia.org/T254471) [19:54:31] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Build python-thumbor-wikimedia 2.9 Debian package and upload to apt.wikimedia.org - https://phabricator.wikimedia.org/T254845 (10Gilles) [19:55:11] (03PS2) 10Hashar: Add basic doc for python-build* images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605649 [19:55:13] (03PS1) 10Hashar: python-build: allow reuse of existing wheels [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605653 [19:58:37] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [19:58:40] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10serviceops, and 4 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10jeena) >>! In T224041#6225462, @thcipriani wrote: > PipelineLib could clean up an image from the r... [19:59:21] (03PS28) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [20:00:04] halfak and accraze: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200615T2000). [20:02:15] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [20:06:49] (03PS29) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [20:12:35] (03CR) 10BryanDavis: cloud nfs: allow opt-in soft mounting wherever folks want to try it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605310 (https://phabricator.wikimedia.org/T127559) (owner: 10Bstorm) [20:17:00] 10Operations, 10Research, 10Wikimedia-Mailing-lists: Admin password reset request for a mailman list: research-wmf - https://phabricator.wikimedia.org/T255326 (10jbond) @leila great and no problem :) [20:22:07] 10Operations, 10Analytics-Radar, 10Product-Analytics, 10SRE-Access-Requests: Hive access for Sam Patton - https://phabricator.wikimedia.org/T248097 (10RobH) @spatton does have a wikitech account (I checked ldap) but this still needs feedback if https://turnilo.wikimedia.org will meet their needs. There i... [20:22:46] (03PS1) 10Ammarpad: Change sidebar upload link destination for tr.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605656 (https://phabricator.wikimedia.org/T253490) [20:23:40] (03CR) 10jerkins-bot: [V: 04-1] Change sidebar upload link destination for tr.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605656 (https://phabricator.wikimedia.org/T253490) (owner: 10Ammarpad) [20:23:42] 10Operations, 10SRE-Access-Requests: Requesting access to sites from Google Search Console for soworu - https://phabricator.wikimedia.org/T252705 (10RobH) @soworu, This has been pending feedback since May 26th regarding: >>! In T252705#6146379, @RLazarus wrote: > Hi Segun, thanks for the clear and comple... [20:26:25] (03PS2) 10Ammarpad: Change sidebar upload link destination for tr.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605656 (https://phabricator.wikimedia.org/T253490) [20:27:49] 10Operations, 10Analytics-Radar, 10SRE-Access-Requests, 10Patch-For-Review, 10Product-Analytics (Kanban): Creation of a new POSIX group and system user for the Product Analytics team - https://phabricator.wikimedia.org/T255039 (10RobH) Since this is setting a new user group for sudo as that user(s), shou... [20:30:45] !log update mtail to 3.0.0~rc35 on wtp in eqiad [20:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:52] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [20:44:44] !log update mtail to 3.0.0~rc35 on cp nodes in eqiad and esams [20:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:35] (03CR) 10RLazarus: [C: 03+1] role::mediawiki::memcached::gutter: change slab distribution [puppet] - 10https://gerrit.wikimedia.org/r/605617 (https://phabricator.wikimedia.org/T252391) (owner: 10Elukey) [20:55:47] !log update mtail to 3.0.0~rc35 on the rest of the hosts - eqiad and esams [20:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:22] !log ebernhardson@deploy1001 Started deploy [search/airflow@62a024b]: Add pydruid to airflow [20:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:12] !log ebernhardson@deploy1001 Finished deploy [search/airflow@62a024b]: Add pydruid to airflow (duration: 00m 50s) [20:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:04] Reedy and sbassett: How many deployers does it take to do Weekly Security deployment window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200615T2100). [21:05:47] 10Operations, 10Cloud-VPS, 10DNS, 10Maps, and 2 others: multi-component wmflabs.org subdomains doesn't work under simple wildcard TLS cert - https://phabricator.wikimedia.org/T161256 (10bd808) >>! In T161256#6221208, @TheDJ wrote: > For future reference.. I think these types of subdomains still require som... [21:12:06] (03PS1) 10Cwhite: hiera: set mtail disable_fsnotify in eqiad and esams [puppet] - 10https://gerrit.wikimedia.org/r/605668 (https://phabricator.wikimedia.org/T251466) [21:12:09] (03CR) 10Bstorm: cloud nfs: allow opt-in soft mounting wherever folks want to try it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605310 (https://phabricator.wikimedia.org/T127559) (owner: 10Bstorm) [21:13:50] (03CR) 10Cwhite: [C: 03+2] hiera: set mtail disable_fsnotify in eqiad and esams [puppet] - 10https://gerrit.wikimedia.org/r/605668 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [21:16:27] (03CR) 10Bstorm: cloud nfs: allow opt-in soft mounting wherever folks want to try it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605310 (https://phabricator.wikimedia.org/T127559) (owner: 10Bstorm) [21:18:56] (03PS1) 10Cwhite: hiera: set disable_fsnotify for cache base in eqiad and esams [puppet] - 10https://gerrit.wikimedia.org/r/605669 (https://phabricator.wikimedia.org/T251466) [21:19:10] +/14 [21:19:34] (03CR) 10Cwhite: [C: 03+2] hiera: set disable_fsnotify for cache base in eqiad and esams [puppet] - 10https://gerrit.wikimedia.org/r/605669 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [21:19:50] (03CR) 10Bstorm: [C: 03+2] "Heh, this probably won't even work from a bastion anymore anyway." [puppet] - 10https://gerrit.wikimedia.org/r/605345 (https://phabricator.wikimedia.org/T157792) (owner: 10BryanDavis) [21:25:53] 10Operations, 10Performance-Team, 10serviceops, 10Sustainability (Incident Prevention): Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10Krinkle) a:03aaron [21:27:16] 10Operations, 10Traffic, 10Services (watching), 10Sustainability (MediaWiki-MultiDC): Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820 (10Krinkle) [21:27:28] 10Operations, 10Traffic, 10Services (watching), 10Sustainability (MediaWiki-MultiDC): Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820 (10Krinkle) [21:27:49] (03CR) 10Andrew Bogott: [C: 03+1] "this seems fine as long as it isn't applying to a ton of VMs on the same hypervisor." [puppet] - 10https://gerrit.wikimedia.org/r/605550 (https://phabricator.wikimedia.org/T255371) (owner: 10Hashar) [21:43:24] (03CR) 10Andrew Bogott: [C: 03+1] "This looks ok to me, and the pcc results (although weirdly scrambled) also look ok. I few minor questions inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/605634 (owner: 10Bstorm) [21:47:38] (03CR) 10Bstorm: labstore: long-overdue refactor of primary profile and role (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/605634 (owner: 10Bstorm) [21:48:37] (03PS12) 10Bstorm: labstore: long-overdue refactor of primary profile and role [puppet] - 10https://gerrit.wikimedia.org/r/605634 [21:49:35] (03CR) 10Andrew Bogott: [C: 03+1] labstore: long-overdue refactor of primary profile and role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605634 (owner: 10Bstorm) [21:53:38] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [21:56:23] 10Operations, 10fundraising-tech-ops, 10netops, 10WMF-NDA: Deploy pfw policy 1591901800 for T122104 - https://phabricator.wikimedia.org/T255185 (10Dwisehaupt) Thanks. Runsgood. [21:57:00] (03CR) 10Bstorm: [C: 03+2] labstore: long-overdue refactor of primary profile and role [puppet] - 10https://gerrit.wikimedia.org/r/605634 (owner: 10Bstorm) [22:00:42] (03PS30) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [22:00:55] 10Operations, 10Cloud-VPS, 10DNS, 10Maps, and 2 others: multi-component wmflabs.org subdomains doesn't work under simple wildcard TLS cert - https://phabricator.wikimedia.org/T161256 (10bd808) 05Open→03Resolved a:03Andrew I'm going to call this {{done}}. Maps is covered and we can do something simila... [22:02:00] !log downtimed puppet alerts for testing some changes on labstore1004/5 [22:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:59] (03PS31) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [22:11:47] (03PS32) 10Andrew Bogott: Add icinga monitoring for WMCS galera [puppet] - 10https://gerrit.wikimedia.org/r/605315 [22:13:37] (03PS2) 10CRusnov: netbox: Configure for netbox-dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/601893 (https://phabricator.wikimedia.org/T253140) [22:14:01] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/601893 (https://phabricator.wikimedia.org/T253140) (owner: 10CRusnov) [22:24:50] (03PS1) 10Bstorm: labstore: fix variable definition for drbd_actual_role [puppet] - 10https://gerrit.wikimedia.org/r/605685 [22:28:04] (03CR) 10Bstorm: "PCC: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/23244/console" [puppet] - 10https://gerrit.wikimedia.org/r/605685 (owner: 10Bstorm) [22:28:29] (03CR) 10Bstorm: [C: 03+2] labstore: fix variable definition for drbd_actual_role [puppet] - 10https://gerrit.wikimedia.org/r/605685 (owner: 10Bstorm) [22:31:21] (03PS1) 10Cwhite: set disable_fsnotify for all current mtail usage [puppet] - 10https://gerrit.wikimedia.org/r/605688 (https://phabricator.wikimedia.org/T251466) [22:39:56] (03PS3) 10CRusnov: netbox: Configure for netbox-dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/601893 (https://phabricator.wikimedia.org/T253140) [22:44:58] (03CR) 10Cwhite: "some nits inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/601893 (https://phabricator.wikimedia.org/T253140) (owner: 10CRusnov) [22:45:49] (03PS1) 10Krinkle: profiler: Fix undefined $wmgXhguiDBuser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605690 (https://phabricator.wikimedia.org/T180761) [22:52:40] (03CR) 10Krinkle: [C: 03+2] profiler: Fix undefined $wmgXhguiDBuser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605690 (https://phabricator.wikimedia.org/T180761) (owner: 10Krinkle) [22:52:42] (03CR) 10Cwhite: [C: 03+1] swift: optional read affinity proxy setting [puppet] - 10https://gerrit.wikimedia.org/r/605591 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [22:52:53] (03CR) 10Cwhite: [C: 03+1] hieradata: enable read affinity for thanos-swift [puppet] - 10https://gerrit.wikimedia.org/r/605592 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [22:53:27] (03Merged) 10jenkins-bot: profiler: Fix undefined $wmgXhguiDBuser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605690 (https://phabricator.wikimedia.org/T180761) (owner: 10Krinkle) [22:54:01] (03CR) 10Cwhite: [C: 03+1] prometheus: enable thanos upload in ops eqiad [puppet] - 10https://gerrit.wikimedia.org/r/605178 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [22:57:47] !log krinkle@deploy1001 Synchronized wmf-config/profiler.php: If7e1613cbcf8 (duration: 00m 59s) [22:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:44] !log krinkle@deploy1001 Synchronized wmf-config/PhpAutoPrepend.php: If7e1613cbcf8 (duration: 00m 56s) [22:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Evening backport window(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200615T2300). [23:00:04] Ammarpad: A patch you scheduled for Evening backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:06:57] (03PS1) 10Bstorm: drbd: lowercase the output in the custom fact [puppet] - 10https://gerrit.wikimedia.org/r/605695 [23:09:37] (03PS4) 10CRusnov: netbox: Configure for netbox-dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/601893 (https://phabricator.wikimedia.org/T253140) [23:10:44] (03CR) 10Bstorm: [C: 03+2] "tested in place on the servers this applies to" [puppet] - 10https://gerrit.wikimedia.org/r/605695 (owner: 10Bstorm) [23:11:11] (03CR) 10CRusnov: "Thanks much!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/601893 (https://phabricator.wikimedia.org/T253140) (owner: 10CRusnov) [23:14:22] (03PS5) 10CRusnov: netbox: Configure for netbox-dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/601893 (https://phabricator.wikimedia.org/T253140) [23:22:27] (03PS1) 10Bstorm: drbd_role: extract the string we are looking for [puppet] - 10https://gerrit.wikimedia.org/r/605699 [23:23:36] 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10MNovotny_WMF) >>! In T254939#6217431, @Nuria wrote: > "let's wait to confirm the nature of internship to see... [23:25:37] (03CR) 10Bstorm: [C: 03+2] drbd_role: extract the string we are looking for [puppet] - 10https://gerrit.wikimedia.org/r/605699 (owner: 10Bstorm) [23:30:24] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@eb0ac12]: Ship templatad table names in HivePartitionRangeSensor [23:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:13] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@eb0ac12]: Ship templatad table names in HivePartitionRangeSensor (duration: 00m 49s) [23:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:08] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 102.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [23:52:22] (03CR) 10Tim Starling: [C: 03+2] Use a short connect timeout for PoolCounter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605436 (https://phabricator.wikimedia.org/T105378) (owner: 10Tim Starling) [23:53:09] (03Merged) 10jenkins-bot: Use a short connect timeout for PoolCounter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605436 (https://phabricator.wikimedia.org/T105378) (owner: 10Tim Starling) [23:56:55] !log tstarling@deploy1001 Synchronized wmf-config/PoolCounterSettings.php: reducing connect timeout per T105378 (duration: 01m 00s) [23:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:59] T105378: Stop a poolcounter server fail from being a SPOF for the service and the api (and the site) - https://phabricator.wikimedia.org/T105378