[00:04:26] 10Operations, 10Commons, 10Wikimedia-SVG-rendering, 10media-storage: Install Noto fonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T184664 (10Aklapper) @Verdy_p: This task got resolved 18 months ago. I don't know the reason for the last comment and how it's related to installi... [00:09:43] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [00:11:36] 10Operations, 10Commons, 10Wikimedia-SVG-rendering, 10media-storage: Install Noto fonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T184664 (10Verdy_p) I got a recent update today from this channel. It was sent by "Maintenance_bot removed a project: Patch-For-Review" which just... [00:14:45] 10Operations, 10Commons, 10Wikimedia-SVG-rendering, 10media-storage: Install Noto fonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T184664 (10Aklapper) @Verdy_p: No. This task got closed 18 months ago, see the task history in this very task. Please don't add unclear comments t... [00:21:17] 10Operations, 10Commons, 10Wikimedia-SVG-rendering, 10media-storage: Install Noto fonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T184664 (10Verdy_p) You affirmed "I don't know why" but I explain you the reason. It's a fact that I got notified by Phabricator just a few minute... [00:30:11] PROBLEM - puppet last run on db1079 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [00:39:09] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [00:44:09] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [00:58:29] RECOVERY - puppet last run on db1079 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [01:09:05] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [01:14:07] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [01:24:37] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [01:26:27] PROBLEM - High lag on wdqs1009 is CRITICAL: 1.011e+06 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:27:50] ACKNOWLEDGEMENT - High lag on wdqs1009 is CRITICAL: 1.011e+06 ge 3600 Stas Malychev reloading in progress https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:48:03] (03CR) 10Huji: "Can you rebase please?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480074 (owner: 10Daimona Eaytoy) [03:07:31] PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 177 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [03:18:51] RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.141 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [03:29:42] (03CR) 10BryanDavis: "Untested, but the general logic seems reasonable." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/524610 (https://phabricator.wikimedia.org/T221301) (owner: 10Jhedden) [03:51:01] (03PS1) 10Andrew Bogott: wmcs-cold-migrate: provide the --leak switch [puppet] - 10https://gerrit.wikimedia.org/r/524701 [03:53:08] (03PS2) 10Andrew Bogott: wmcs-cold-migrate: provide the --leak switch [puppet] - 10https://gerrit.wikimedia.org/r/524701 [03:54:11] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cold-migrate: provide the --leak switch [puppet] - 10https://gerrit.wikimedia.org/r/524701 (owner: 10Andrew Bogott) [04:17:00] !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.14/extensions/CentralAuth/includes/specials/SpecialMultiLock.php: fix UBN bug T227772 (duration: 00m 56s) [04:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:17:09] T227772: Fix or remove capability to override user rights for the current request - https://phabricator.wikimedia.org/T227772 [04:21:53] PROBLEM - puppet last run on mw1302 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [04:44:16] !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.14/extensions/TorBlock/includes/TorExitNodes.php: fixing UBN T228465 (duration: 00m 56s) [04:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:28] T228465: TorBlock maintenance failures on labweb hosts - https://phabricator.wikimedia.org/T228465 [04:46:07] !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.14/extensions/TorBlock/maintenance/loadExitNodes.php: fixing UBN T228465 (duration: 00m 54s) [04:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:48:52] !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.14/extensions/TorBlock/extension.json: fixing UBN T228465 (duration: 00m 54s) [04:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:50:11] RECOVERY - puppet last run on mw1302 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [04:51:58] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10tstarling) [05:04:21] 10Operations, 10DBA: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 - https://phabricator.wikimedia.org/T225378 (10Marostegui) 05Open→03Resolved a:05Marostegui→03jcrespo As spoken, I am going to close this as the scope of the ticket is done. I will create a new... [05:08:55] PROBLEM - Device not healthy -SMART- on helium is CRITICAL: cluster=misc device=megaraid,14 instance=helium:9100 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=helium&var-datasource=eqiad+prometheus/ops [05:21:31] (03PS1) 10DannyS712: Revert "Remove "עמוד" namespace from wgFlaggedRevsNamespaces for hewikisource" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524704 [05:24:34] !log Compress more tables on labsdb1009 - T222978 [05:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:41] T222978: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978 [05:25:48] (03PS2) 10DannyS712: Revert "Remove "עמוד" namespace from wgFlaggedRevsNamespaces for hewikisource" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524704 (https://phabricator.wikimedia.org/T227000) [05:26:07] (03PS3) 10DannyS712: Fix "Remove "עמוד" namespace from wgFlaggedRevsNamespaces for hewikisource" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524704 (https://phabricator.wikimedia.org/T227000) [05:47:55] (03PS1) 10Marostegui: db-eqiad.php: More weight to db1109 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524708 [05:51:14] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More weight to db1109 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524708 (owner: 10Marostegui) [05:51:29] (03PS8) 10Marostegui: mariadb: Provision dbproxy2001 into codfw m1 [puppet] - 10https://gerrit.wikimedia.org/r/518251 (https://phabricator.wikimedia.org/T202367) [05:52:13] (03Merged) 10jenkins-bot: db-eqiad.php: More weight to db1109 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524708 (owner: 10Marostegui) [05:52:30] (03CR) 10jenkins-bot: db-eqiad.php: More weight to db1109 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524708 (owner: 10Marostegui) [05:53:38] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1109 into API (duration: 00m 58s) [05:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:56] (03PS1) 10Marostegui: db-eqiad.php: Remove db1104 from API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524712 [06:13:09] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Remove db1104 from API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524712 (owner: 10Marostegui) [06:14:00] (03Merged) 10jenkins-bot: db-eqiad.php: Remove db1104 from API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524712 (owner: 10Marostegui) [06:14:15] (03CR) 10jenkins-bot: db-eqiad.php: Remove db1104 from API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524712 (owner: 10Marostegui) [06:15:18] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db1104 from s8 API (duration: 00m 55s) [06:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:11] !log restart hadoop-hdfs-namenode on an-master1002 to verify if out-of-the-ordinary GC activity [06:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:38] (03CR) 10Marostegui: [C: 03+2] mariadb: Provision dbproxy2001 into codfw m1 [puppet] - 10https://gerrit.wikimedia.org/r/518251 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [06:29:11] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [06:29:49] PROBLEM - puppet last run on cloudvirt1029 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/10-puppet-agent.conf] https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [06:32:39] PROBLEM - puppet last run on ms-be1044 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml] https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [06:46:31] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:47:09] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5 [06:47:15] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:47:15] !log Stop MySQL on db2062 to test dbproxy2001 notification T202367 [06:47:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:25] T202367: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4] - https://phabricator.wikimedia.org/T202367 [06:51:02] lovely, a link down [06:51:29] the 50xs seems to be a single spike, I guess traffic failed over [06:53:45] there is a scheduled maintenance in the calendar [06:54:07] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5 [06:56:01] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5 [06:56:15] RECOVERY - puppet last run on cloudvirt1029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [06:57:19] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5 [06:59:21] RECOVERY - puppet last run on ms-be1044 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [07:01:13] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [07:07:50] (03PS1) 10Muehlenhoff: Change email adresse for Chelsy Xie [puppet] - 10https://gerrit.wikimedia.org/r/524714 [07:08:48] (03CR) 10Muehlenhoff: [C: 03+2] Change email adresse for Chelsy Xie [puppet] - 10https://gerrit.wikimedia.org/r/524714 (owner: 10Muehlenhoff) [07:14:15] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:14:59] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:17:47] !log installing openjdk-11 security updates [07:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:15] (03PS1) 10Marostegui: db-eqiad.php: Depool db1134 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524716 (https://phabricator.wikimedia.org/T226851) [07:25:12] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1134 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524716 (https://phabricator.wikimedia.org/T226851) (owner: 10Marostegui) [07:25:18] 10Operations, 10netops: AS63541's session down reported by cr1-eqsin - https://phabricator.wikimedia.org/T228617 (10elukey) [07:26:05] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1134 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524716 (https://phabricator.wikimedia.org/T226851) (owner: 10Marostegui) [07:26:20] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1134 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524716 (https://phabricator.wikimedia.org/T226851) (owner: 10Marostegui) [07:27:27] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1134 for schema change T226851 (duration: 00m 56s) [07:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:34] T226851: Drop abuse_filter_log.afl_log_id in production - https://phabricator.wikimedia.org/T226851 [07:27:43] !log Drop afl_log_id column from enwiki.abuse_filter_log on db1134 T226851 [07:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:31] (03PS4) 10Elukey: superset: move httpd proxy config to a profile [puppet] - 10https://gerrit.wikimedia.org/r/524531 (https://phabricator.wikimedia.org/T227860) [07:35:16] (03PS1) 10Muehlenhoff: Phabricator: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/524718 [07:36:27] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/17535/" [puppet] - 10https://gerrit.wikimedia.org/r/524531 (https://phabricator.wikimedia.org/T227860) (owner: 10Elukey) [07:40:52] !log systemctl reset-failed restbase on restbase1007->15 (decommed nodes) [07:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:33] (03PS1) 10Muehlenhoff: Remove support for jessie/wmf-mariadb10 from mariadb::packages_wmf [puppet] - 10https://gerrit.wikimedia.org/r/524721 [07:44:23] dcausse: o/ [07:44:35] o/ [07:44:40] hello :) [07:44:49] hey :) [07:45:03] I was reviewing icinga alerts and I noticed elastic1046 [07:45:20] that is probably due to https://phabricator.wikimedia.org/T228606 [07:45:38] elasticsearch_6@production-search-eqiad fails for Likely root cause: java.nio.file.AccessDeniedException: /var/run/elasticsearch [07:46:01] I am wondering if anything should be done for the host etc.. [07:46:05] not sure what it is the procedure [07:47:09] elukey: the node should be depooled for disk replacement no? [07:47:59] dcausse: never done it before so this is why I am asking (geh*el said to ping you or Erik in case somethig was needed with elastic search IIUC) [07:48:47] as for elastic there are no special procedure, it can be removed/readded from the cluster [07:50:49] dcausse: weird, it seems that /var/run/elasticsearch is not there, but the disk failed is related to the root partition (that is in raid 1, so should keep going) [07:52:06] ls: reading directory '/srv/': Input/output error [07:52:26] that is a very good point, still haven't done it :D [07:52:49] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1134" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524724 [07:53:08] dcausse: ah lovely md1 (where /srv is mounted) is a raid0 [07:53:35] obvious :) [07:54:27] (03CR) 10Marostegui: "Not sure how this would affect any possible jessie hosts in WMCS? Are there any jessie there?" [puppet] - 10https://gerrit.wikimedia.org/r/524721 (owner: 10Muehlenhoff) [07:54:48] !log sudo -i depool on elastic1046 - broken disk (srv partition not available) - T228606 [07:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:55] T228606: Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 [07:54:57] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1134" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524724 (owner: 10Marostegui) [07:55:10] 10Operations, 10ops-eqiad: Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10elukey) ` elukey@elastic1046:~$ sudo -i depool Depooling all services on elastic1046.eqiad.wmnet eqiad/elasticsearch/elasticsearch/elastic1046.eqiad.wmnet: pooled changed yes => no eqiad/elasticsearch/elastic... [07:55:32] dcausse: (last question I promise) - anything more to do other than --^ ? [07:55:50] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1134" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524724 (owner: 10Marostegui) [07:56:09] elukey: no, it should be ok [07:56:19] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1134" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524724 (owner: 10Marostegui) [07:56:59] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1134 after schema change T226851 (duration: 00m 51s) [07:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:07] T226851: Drop abuse_filter_log.afl_log_id in production - https://phabricator.wikimedia.org/T226851 [08:04:21] (03PS1) 10Elukey: Exclude notebook1* hosts from SCREEN/tmux checks [puppet] - 10https://gerrit.wikimedia.org/r/524725 [08:05:30] (03CR) 10Elukey: [C: 03+2] Exclude notebook1* hosts from SCREEN/tmux checks [puppet] - 10https://gerrit.wikimedia.org/r/524725 (owner: 10Elukey) [08:06:01] 10Operations, 10ops-eqiad: Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10elukey) p:05Triage→03Normal [08:09:02] 10Operations, 10Machine vision, 10serviceops, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Joe) >>! In T225664#5329541, @Mholloway wrote: > @joe Thanks (belatedly) for the comments. I've... [08:09:15] (03PS2) 10Muehlenhoff: puppetboard: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524487 (https://phabricator.wikimedia.org/T227650) [08:10:37] (03CR) 10Muehlenhoff: [C: 03+2] puppetboard: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524487 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [08:18:04] 10Operations, 10ops-eqiad, 10DC-Ops: Reallocate dbproxy1020 and dbproxy1021 from row D to row C - https://phabricator.wikimedia.org/T228618 (10Marostegui) [08:20:55] (03PS2) 10Marostegui: mariadb: Promote db1128 as master for m3 [puppet] - 10https://gerrit.wikimedia.org/r/523941 (https://phabricator.wikimedia.org/T228243) [08:27:58] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/524622 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [08:30:51] (03CR) 10Filippo Giunchedi: "> Patch Set 4:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/524037 (owner: 10Ayounsi) [08:33:38] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/524583 (https://phabricator.wikimedia.org/T227650) (owner: 10Dzahn) [08:33:40] !log Rename table enwiki.math on db2116 T196055 [08:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:48] T196055: Remove table `math` from the database - https://phabricator.wikimedia.org/T196055 [08:34:04] 10Operations, 10media-storage: refresh swift hardware in codfw/eqiad - https://phabricator.wikimedia.org/T148647 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This has happened, resolving [08:36:13] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/523993 (https://phabricator.wikimedia.org/T227650) (owner: 10Dzahn) [08:38:28] 10Operations, 10MediaWiki-Logging, 10Wikimedia-Logstash, 10serviceops, and 8 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10Joe) >>! In T187147#5343202, @tstarling wrote: > Is this blocking deployment of PHP 7? In my opinion, it should not at this poi... [08:39:04] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [08:43:54] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [09:02:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/524586 (https://phabricator.wikimedia.org/T227650) (owner: 10Dzahn) [09:02:45] (03PS2) 10Muehlenhoff: graphite: use ldap-ro, stop using ldap-labs, use Hiera [puppet] - 10https://gerrit.wikimedia.org/r/524586 (https://phabricator.wikimedia.org/T227650) (owner: 10Dzahn) [09:04:12] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), and 2 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Joe) Frankly without further information on what happened, how it was debugged, and how it was solved, I'm... [09:08:21] (03CR) 10Muehlenhoff: [C: 03+2] graphite: use ldap-ro, stop using ldap-labs, use Hiera [puppet] - 10https://gerrit.wikimedia.org/r/524586 (https://phabricator.wikimedia.org/T227650) (owner: 10Dzahn) [09:13:15] 10Operations, 10ops-eqiad, 10DC-Ops: Reallocate dbproxy1020 and dbproxy1021 from row D to row C - https://phabricator.wikimedia.org/T228618 (10Peachey88) [09:13:31] (03PS5) 10Muehlenhoff: netbox: Read LDAP server from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524494 (https://phabricator.wikimedia.org/T227650) [09:18:48] (03PS1) 10Marostegui: mariadb: Replace dbproxy1006 with dbproxy1012 [puppet] - 10https://gerrit.wikimedia.org/r/524729 (https://phabricator.wikimedia.org/T202367) [09:22:04] (03PS1) 10Elukey: role::analytics_cluster::hadoop::master/standby: tune CMS GC [puppet] - 10https://gerrit.wikimedia.org/r/524730 (https://phabricator.wikimedia.org/T228620) [09:22:10] (03PS2) 10Marostegui: mariadb: Replace dbproxy1006 with dbproxy1014 [puppet] - 10https://gerrit.wikimedia.org/r/524729 (https://phabricator.wikimedia.org/T202367) [09:23:08] (03CR) 10Muehlenhoff: [C: 03+2] netbox: Read LDAP server from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524494 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [09:23:19] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::hadoop::master/standby: tune CMS GC [puppet] - 10https://gerrit.wikimedia.org/r/524730 (https://phabricator.wikimedia.org/T228620) (owner: 10Elukey) [09:23:28] (03PS2) 10Elukey: role::analytics_cluster::hadoop::master/standby: tune CMS GC [puppet] - 10https://gerrit.wikimedia.org/r/524730 (https://phabricator.wikimedia.org/T228620) [09:26:18] 10Operations, 10observability, 10Performance-Team (Radar): Revisit Grafana/Icinga notification strategy - https://phabricator.wikimedia.org/T203485 (10fgiunchedi) >>! In T203485#5329077, @Krinkle wrote: > In recent weeks there have been three or four occasions where an SRE kindly informed us about an "on-goi... [09:29:00] (03PS1) 10Muehlenhoff: profile::docker::engine: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/524731 [09:29:12] (03PS3) 10Marostegui: mariadb: Replace dbproxy1006 with dbproxy1014 in m1 [puppet] - 10https://gerrit.wikimedia.org/r/524729 (https://phabricator.wikimedia.org/T202367) [09:32:31] 10Operations, 10MediaWiki-General, 10serviceops-radar, 10Core Platform Team (Mainstash Multi-DC), and 3 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10Joe) >>! In T212129#5322515, @aaron wrote: > I think (1) is more useful and fills... [09:32:32] !log restart hadoop hdfs namenode on an-master1002 to apply new GC settings - T228620 [09:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:40] T228620: Frequent/Long GC old gen collections for HDFS namenodes on an-master100[1,2] - https://phabricator.wikimedia.org/T228620 [09:37:41] 10Operations, 10Traffic, 10Goal: ATS Backends: Test live cache_text traffic - https://phabricator.wikimedia.org/T228629 (10ema) [09:37:49] 10Operations, 10Traffic, 10Goal: ATS Backends: Test live cache_text traffic - https://phabricator.wikimedia.org/T228629 (10ema) p:05Triage→03Normal [09:38:28] (03PS4) 10Marostegui: mariadb: Replace dbproxy1006 with dbproxy1014 in m1 [puppet] - 10https://gerrit.wikimedia.org/r/524729 (https://phabricator.wikimedia.org/T202367) [09:39:54] (03PS2) 10Ema: cache: add role::cache::text_ats [puppet] - 10https://gerrit.wikimedia.org/r/522398 (https://phabricator.wikimedia.org/T227432) [09:39:56] (03CR) 10Marostegui: "Puppet looks good: https://puppet-compiler.wmflabs.org/compiler1002/17539/" [puppet] - 10https://gerrit.wikimedia.org/r/524729 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [09:40:15] !log Deploy grants on m1 to allow connections from dbproxy1014 - T202367 [09:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:23] T202367: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4] - https://phabricator.wikimedia.org/T202367 [09:43:06] (03CR) 10Muehlenhoff: librenms: use ldap-ro, stop using ldap-labs, use Hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523994 (https://phabricator.wikimedia.org/T227650) (owner: 10Dzahn) [09:47:58] !log failover + restart of Hadoop HDFS namenode on an-master1001 to apply GC settings - T228620 [09:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:06] T228620: Frequent/Long GC old gen collections for HDFS namenodes on an-master100[1,2] - https://phabricator.wikimedia.org/T228620 [09:48:36] (03PS3) 10Ema: cache: add role::cache::text_ats [puppet] - 10https://gerrit.wikimedia.org/r/522398 (https://phabricator.wikimedia.org/T227432) [09:49:26] (03PS4) 10Muehlenhoff: librenms: use ldap-ro, stop using ldap-labs, use Hiera [puppet] - 10https://gerrit.wikimedia.org/r/523994 (https://phabricator.wikimedia.org/T227650) (owner: 10Dzahn) [09:49:28] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM overall, see nits inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/524545 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [09:49:35] (03CR) 10Ema: [C: 03+2] cache: add role::cache::text_ats [puppet] - 10https://gerrit.wikimedia.org/r/522398 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [09:50:49] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: add varnishkafka job to prometheus scrapes [puppet] - 10https://gerrit.wikimedia.org/r/524561 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [09:53:21] 10Operations, 10Traffic, 10Patch-For-Review, 10discovery-system, 10services-tooling: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972 (10Joe) >>! In T97972#5335454, @CDanis wrote: > I think we likely want to revisit this. > > * Right now the `guest` user has access to `/ev... [09:55:19] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/17541/" [puppet] - 10https://gerrit.wikimedia.org/r/523994 (https://phabricator.wikimedia.org/T227650) (owner: 10Dzahn) [09:57:57] (03PS5) 10Muehlenhoff: librenms: use ldap-ro, stop using ldap-labs, use Hiera [puppet] - 10https://gerrit.wikimedia.org/r/523994 (https://phabricator.wikimedia.org/T227650) (owner: 10Dzahn) [10:00:40] (03CR) 10Muehlenhoff: [C: 03+2] librenms: use ldap-ro, stop using ldap-labs, use Hiera [puppet] - 10https://gerrit.wikimedia.org/r/523994 (https://phabricator.wikimedia.org/T227650) (owner: 10Dzahn) [10:01:02] (03CR) 10Filippo Giunchedi: puppetmaster: Add the ability to have canary backends (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/524287 (owner: 10Jbond) [10:09:58] PROBLEM - puppet last run on ldap-eqiad-replica02 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [10:13:45] 10Operations, 10serviceops: SRE FY19-20 Q1 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10jijiki) [10:14:01] (03PS1) 10Fsero: helmfile,deploy: coredns needs api host and port [deployment-charts] - 10https://gerrit.wikimedia.org/r/524737 (https://phabricator.wikimedia.org/T226516) [10:15:06] (03PS1) 10Muehlenhoff: Revert "librenms: use ldap-ro, stop using ldap-labs, use Hiera" [puppet] - 10https://gerrit.wikimedia.org/r/524738 [10:15:34] (03CR) 10Fsero: [V: 03+2 C: 03+2] helmfile,deploy: coredns needs api host and port [deployment-charts] - 10https://gerrit.wikimedia.org/r/524737 (https://phabricator.wikimedia.org/T226516) (owner: 10Fsero) [10:15:36] RECOVERY - puppet last run on ldap-eqiad-replica02 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [10:15:50] (03PS2) 10Muehlenhoff: Revert "librenms: use ldap-ro, stop using ldap-labs, use Hiera" [puppet] - 10https://gerrit.wikimedia.org/r/524738 [10:16:12] (03CR) 10Fsero: [C: 03+1] profile::docker::engine: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/524731 (owner: 10Muehlenhoff) [10:17:21] (03CR) 10Muehlenhoff: [C: 03+2] Revert "librenms: use ldap-ro, stop using ldap-labs, use Hiera" [puppet] - 10https://gerrit.wikimedia.org/r/524738 (owner: 10Muehlenhoff) [10:18:08] (03CR) 10Fsero: zuul: stop zuul-merger gracefully (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/524180 (owner: 10Hashar) [10:18:24] (03CR) 10Fsero: zuul: fix systemd Service/TimeoutStopSec (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/524174 (https://phabricator.wikimedia.org/T228381) (owner: 10Hashar) [10:20:07] 10Operations, 10Commons, 10Wikimedia-SVG-rendering, 10media-storage: Install Noto fonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T184664 (10Aklapper) @Verdy_p: Please don't just add random comments to tasks without having read the task and having checked the status of the ta... [10:20:57] !log deploy coredns in staging T226516 [10:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:05] T226516: deploy CoreDNS as a in-cluster DNS service - https://phabricator.wikimedia.org/T226516 [10:21:40] !log root@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'coredns' . [10:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:07] (03PS2) 10Filippo Giunchedi: puppetmaster: blacklist per-host catalogs metrics [puppet] - 10https://gerrit.wikimedia.org/r/524182 (https://phabricator.wikimedia.org/T228395) [10:22:36] (03PS5) 10Effie Mouzeli: profile::mediawiki::jobrunner: Configure php7_only flag [puppet] - 10https://gerrit.wikimedia.org/r/524336 (https://phabricator.wikimedia.org/T219148) [10:22:52] (03CR) 10Filippo Giunchedi: [C: 03+2] puppetmaster: blacklist per-host catalogs metrics [puppet] - 10https://gerrit.wikimedia.org/r/524182 (https://phabricator.wikimedia.org/T228395) (owner: 10Filippo Giunchedi) [10:22:55] (03PS6) 10Effie Mouzeli: profile::mediawiki::jobrunner: Configure php7_only flag [puppet] - 10https://gerrit.wikimedia.org/r/524336 (https://phabricator.wikimedia.org/T219148) [10:23:44] !log Disable puppet on jobrunners for 524336 - T219148 [10:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:51] T219148: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 [10:27:58] !log Depool and pool mw1300 [10:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:17] (03PS1) 10Muehlenhoff: yarn: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524742 (https://phabricator.wikimedia.org/T227650) [10:30:04] jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190722T1030). [10:30:11] (03CR) 10jerkins-bot: [V: 04-1] yarn: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524742 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [10:32:23] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/17543/" [puppet] - 10https://gerrit.wikimedia.org/r/524742 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [10:32:57] (03PS2) 10Muehlenhoff: yarn: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524742 (https://phabricator.wikimedia.org/T227650) [10:37:28] (03CR) 10ArielGlenn: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/17544/console noop as expected." [puppet] - 10https://gerrit.wikimedia.org/r/524632 (https://phabricator.wikimedia.org/T227742) (owner: 10ArielGlenn) [10:37:38] (03PS2) 10ArielGlenn: replace all hiera clls with lookup() for dumps generation manifests [puppet] - 10https://gerrit.wikimedia.org/r/524632 (https://phabricator.wikimedia.org/T227742) [10:39:21] (03CR) 10ArielGlenn: [C: 03+2] replace all hiera clls with lookup() for dumps generation manifests [puppet] - 10https://gerrit.wikimedia.org/r/524632 (https://phabricator.wikimedia.org/T227742) (owner: 10ArielGlenn) [10:39:41] 10Operations, 10Traffic, 10Patch-For-Review, 10discovery-system, 10services-tooling: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972 (10Volans) 05Resolved→03Open >>! In T97972#5352851, @Joe wrote: > IIRC we already have an account specialized for accessing only `mwconf... [10:39:43] 10Operations, 10Traffic, 10discovery-system, 10services-tooling: Create a tool to sync static configuration from a repository to the consistent k/v store - https://phabricator.wikimedia.org/T97978 (10Volans) [10:39:50] 10Operations, 10Traffic, 10Patch-For-Review, 10discovery-system, 10services-tooling: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972 (10Volans) p:05High→03Normal [10:40:51] 10Operations, 10Puppet, 10Patch-For-Review: puppetdb prometheus metrics per-host metrics - https://phabricator.wikimedia.org/T228395 (10fgiunchedi) Change deployed, the high cardinality metrics should also be deleted. To do that we'll need to pass `--web.enable-admin-api` to prometheus first to be able to de... [10:43:02] 10Operations, 10serviceops, 10PHP 7.2 support: Don't monitor HHVM on PHP7 only servers - https://phabricator.wikimedia.org/T228643 (10jijiki) [10:43:18] 10Operations, 10serviceops, 10PHP 7.2 support: Don't monitor HHVM on PHP7 only servers - https://phabricator.wikimedia.org/T228643 (10jijiki) [10:43:21] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10jijiki) [10:43:29] 10Operations, 10serviceops, 10PHP 7.2 support: Don't monitor HHVM on PHP7 only servers - https://phabricator.wikimedia.org/T228643 (10jijiki) p:05Triage→03Normal a:03jijiki [10:46:33] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/524076 (owner: 10Ayounsi) [10:49:04] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [10:49:58] (03CR) 10Effie Mouzeli: [C: 03+2] profile::mediawiki::jobrunner: Configure php7_only flag [puppet] - 10https://gerrit.wikimedia.org/r/524336 (https://phabricator.wikimedia.org/T219148) (owner: 10Effie Mouzeli) [10:50:13] (03PS7) 10Effie Mouzeli: profile::mediawiki::jobrunner: Configure php7_only flag [puppet] - 10https://gerrit.wikimedia.org/r/524336 (https://phabricator.wikimedia.org/T219148) [10:51:30] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/523993 (https://phabricator.wikimedia.org/T227650) (owner: 10Dzahn) [10:52:10] (03PS1) 10ArielGlenn: replace hiera calls with lookup() for dumps distribution manifests [puppet] - 10https://gerrit.wikimedia.org/r/524745 (https://phabricator.wikimedia.org/T227742) [10:55:42] !log Enable puppet on jobrunners [10:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190722T1100). [11:00:04] awight, MatmaRex, and revi: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] hi hi [11:00:54] (03PS2) 10Revi: Enable wgNamespacesWithSubpages on main NS for kowikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524685 (https://phabricator.wikimedia.org/T228481) [11:03:03] anyone? [11:03:20] I can do swat today [11:03:35] ;) [11:04:20] (03CR) 10Effie Mouzeli: [V: 03+1] "LGTM https://puppet-compiler.wmflabs.org/compiler1002/17545/" [puppet] - 10https://gerrit.wikimedia.org/r/522472 (https://phabricator.wikimedia.org/T219148) (owner: 10Effie Mouzeli) [11:05:19] (03PS2) 10Awight: Enable FileImporter source wiki edit and delete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523661 (https://phabricator.wikimedia.org/T225617) [11:05:32] (03CR) 10Awight: [C: 03+2] Enable FileImporter source wiki edit and delete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523661 (https://phabricator.wikimedia.org/T225617) (owner: 10Awight) [11:06:03] (03PS1) 10ArielGlenn: ukwiki and viwiki are now configured as 'big wikis' for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/524748 (https://phabricator.wikimedia.org/T228558) [11:06:24] MatmaRex: I'll wait until you're here to deploy the scheduled change. [11:06:31] (03Merged) 10jenkins-bot: Enable FileImporter source wiki edit and delete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523661 (https://phabricator.wikimedia.org/T225617) (owner: 10Awight) [11:06:38] (03CR) 10ArielGlenn: "Do not merge until just before the August 20th run." [puppet] - 10https://gerrit.wikimedia.org/r/524748 (https://phabricator.wikimedia.org/T228558) (owner: 10ArielGlenn) [11:06:44] (03CR) 10ArielGlenn: [C: 04-1] ukwiki and viwiki are now configured as 'big wikis' for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/524748 (https://phabricator.wikimedia.org/T228558) (owner: 10ArielGlenn) [11:06:46] (03CR) 10jenkins-bot: Enable FileImporter source wiki edit and delete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523661 (https://phabricator.wikimedia.org/T225617) (owner: 10Awight) [11:07:07] * revi waits patiently [11:07:53] (03PS3) 10Revi: Enable wgNamespacesWithSubpages on main NS for kowikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524685 (https://phabricator.wikimedia.org/T228481) [11:07:59] well, awight can I get mine done if he does not arrive? (I don't want to wait for full hour) [11:09:00] (03PS1) 10Muehlenhoff: Superset: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524749 (https://phabricator.wikimedia.org/T227650) [11:09:44] revi: For sure! I'm just deploying one other config change first. [11:09:49] kk [11:09:51] (03CR) 10jerkins-bot: [V: 04-1] Superset: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524749 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [11:10:39] (03PS1) 10Fsero: k8s: introducing cluster-dns flag for coredns [puppet] - 10https://gerrit.wikimedia.org/r/524750 (https://phabricator.wikimedia.org/T226516) [11:11:15] (03PS2) 10Muehlenhoff: Superset: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524749 (https://phabricator.wikimedia.org/T227650) [11:13:05] !log awight@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:523661|Enable FileImporter source wiki edit and delete (T225617, T226532)]] (duration: 00m 56s) [11:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:14] T225617: Add the nowcommons template to the source wiki if requested - https://phabricator.wikimedia.org/T225617 [11:13:14] T226532: If rights exist, suggest automatic deletion of the source wiki file - https://phabricator.wikimedia.org/T226532 [11:13:54] (03PS2) 10Fsero: k8s: introducing cluster-dns flag for coredns [puppet] - 10https://gerrit.wikimedia.org/r/524750 (https://phabricator.wikimedia.org/T226516) [11:14:36] !log awight@deploy1001 Synchronized wmf-config/CommonSettings-labs.php: SWAT: [[gerrit:523661|Enable FileImporter source wiki edit and delete, (remove labs customizations) (T225617, T226532)]] (duration: 00m 54s) [11:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:45] (03CR) 10Awight: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524685 (https://phabricator.wikimedia.org/T228481) (owner: 10Revi) [11:16:43] (03Merged) 10jenkins-bot: Enable wgNamespacesWithSubpages on main NS for kowikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524685 (https://phabricator.wikimedia.org/T228481) (owner: 10Revi) [11:16:58] (03CR) 10jenkins-bot: Enable wgNamespacesWithSubpages on main NS for kowikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524685 (https://phabricator.wikimedia.org/T228481) (owner: 10Revi) [11:17:32] revi: please test on mwdebug1002, when you can [11:17:37] ACK [11:18:01] awight: +2 from mwdebug1002 [11:18:36] (03PS3) 10Fsero: k8s: introducing cluster-dns flag for coredns [puppet] - 10https://gerrit.wikimedia.org/r/524750 (https://phabricator.wikimedia.org/T226516) [11:18:41] great! [11:18:54] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/17548/" [puppet] - 10https://gerrit.wikimedia.org/r/524749 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [11:20:33] !log awight@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:524685|Enable wgNamespacesWithSubpages on main NS for kowikiversity (T228481)]] (duration: 00m 54s) [11:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:41] T228481: Activate subpages link in the main namespace of the Korean Wikiversity - https://phabricator.wikimedia.org/T228481 [11:20:45] (03CR) 10Fsero: "PCC seems happy https://puppet-compiler.wmflabs.org/compiler1002/17549/" [puppet] - 10https://gerrit.wikimedia.org/r/524750 (https://phabricator.wikimedia.org/T226516) (owner: 10Fsero) [11:21:03] Done :-) [11:21:13] thanks! [11:21:14] seeya [11:21:18] I'll wait around in case MatmaRex or other colleague wants that patch deployed. [11:21:21] o/ [11:22:45] (03PS1) 10Muehlenhoff: turnilo: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524752 (https://phabricator.wikimedia.org/T227650) [11:22:52] (03PS4) 10Fsero: k8s: introducing cluster-dns flag for coredns [puppet] - 10https://gerrit.wikimedia.org/r/524750 (https://phabricator.wikimedia.org/T226516) [11:23:17] (03CR) 10jerkins-bot: [V: 04-1] turnilo: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524752 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [11:28:40] (03PS2) 10Muehlenhoff: turnilo: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524752 (https://phabricator.wikimedia.org/T227650) [11:29:10] (03CR) 10jerkins-bot: [V: 04-1] turnilo: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524752 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [11:40:36] (03PS3) 10Muehlenhoff: turnilo: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524752 (https://phabricator.wikimedia.org/T227650) [11:41:37] (03CR) 10jerkins-bot: [V: 04-1] turnilo: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524752 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [11:43:46] (03PS4) 10Muehlenhoff: turnilo: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524752 (https://phabricator.wikimedia.org/T227650) [11:49:47] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/17550/" [puppet] - 10https://gerrit.wikimedia.org/r/524752 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [11:54:36] (03PS10) 10Jbond: puppetmaster: Add the ability to have canary backends [puppet] - 10https://gerrit.wikimedia.org/r/524287 [11:55:16] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: Add the ability to have canary backends [puppet] - 10https://gerrit.wikimedia.org/r/524287 (owner: 10Jbond) [11:56:51] (03PS11) 10Jbond: puppetmaster: Add the ability to have canary backends [puppet] - 10https://gerrit.wikimedia.org/r/524287 [11:57:29] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: Add the ability to have canary backends [puppet] - 10https://gerrit.wikimedia.org/r/524287 (owner: 10Jbond) [12:04:30] (03PS2) 10Muehlenhoff: debmonitor: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524489 (https://phabricator.wikimedia.org/T227650) [12:07:29] (03CR) 10Muehlenhoff: [C: 03+2] debmonitor: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524489 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [12:08:14] RECOVERY - Check systemd state on elastic1046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [12:13:12] PROBLEM - Check systemd state on elastic1046 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [12:13:39] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: add PSP for nginx-ingress [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500) [12:13:57] (03CR) 10Muehlenhoff: icinga: Read LDAP servers from Hiera and switch to read-only replicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/524540 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [12:22:15] !log installing debian-archive-keyring Stretch update (SUA 164) [12:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:29] (03PS12) 10Jbond: puppetmaster: Add the ability to have canary backends [puppet] - 10https://gerrit.wikimedia.org/r/524287 [12:25:55] (03CR) 10Jbond: [C: 03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/524287 (owner: 10Jbond) [12:28:22] (03CR) 10Jbond: [C: 03+1] puppetmaster: Add the ability to have canary backends (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/524287 (owner: 10Jbond) [12:29:04] PROBLEM - HHVM rendering on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:30:32] RECOVERY - HHVM rendering on mw1285 is OK: HTTP OK: HTTP/1.1 200 OK - 81078 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:30:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "This configuration allows me to effectively run the required nginx-ingress pods. Please @bstorm review :-)" [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500) (owner: 10Arturo Borrero Gonzalez) [12:42:44] (03CR) 10Arturo Borrero Gonzalez: "This is probably not enough:" [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500) (owner: 10Arturo Borrero Gonzalez) [12:44:01] (03PS5) 10Marostegui: mariadb: Replace dbproxy1006 with dbproxy1014 in m1 [puppet] - 10https://gerrit.wikimedia.org/r/524729 (https://phabricator.wikimedia.org/T202367) [12:46:33] (03CR) 10Jbond: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/524749 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [12:47:12] (03CR) 10Marostegui: [C: 03+2] mariadb: Replace dbproxy1006 with dbproxy1014 in m1 [puppet] - 10https://gerrit.wikimedia.org/r/524729 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [12:47:14] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/524742 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [12:48:29] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/524752 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [12:51:38] (03PS3) 10Muehlenhoff: icinga: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524540 (https://phabricator.wikimedia.org/T227650) [12:52:24] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/524287 (owner: 10Jbond) [12:52:34] (03PS7) 10Daimona Eaytoy: Rename globals and rights in AbuseFilter config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480074 [12:53:04] (03CR) 10Muehlenhoff: [C: 03+2] icinga: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524540 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [12:53:51] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor comment inline, but otherwise LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/524750 (https://phabricator.wikimedia.org/T226516) (owner: 10Fsero) [12:54:04] (03CR) 10Daimona Eaytoy: "> Can you rebase please?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480074 (owner: 10Daimona Eaytoy) [12:54:52] (03PS13) 10Jbond: puppetmaster: Add the ability to have canary backends [puppet] - 10https://gerrit.wikimedia.org/r/524287 [12:55:39] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: Add the ability to have canary backends [puppet] - 10https://gerrit.wikimedia.org/r/524287 (owner: 10Jbond) [12:57:02] (03CR) 10Alexandros Kosiaris: [C: 04-1] k8s: introducing cluster-dns flag for coredns (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/524750 (https://phabricator.wikimedia.org/T226516) (owner: 10Fsero) [12:58:29] !log Stop MySQL on db1117:3321 to test dbproxy1014 (replacement for dbproxy1006) on m1 - T202367 [12:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:37] T202367: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4] - https://phabricator.wikimedia.org/T202367 [12:58:44] Going to update cxserver. Ping me if anything going that might affect it. [12:58:50] (03PS5) 10Fsero: k8s: introducing cluster-dns flag for coredns [puppet] - 10https://gerrit.wikimedia.org/r/524750 (https://phabricator.wikimedia.org/T226516) [12:59:03] (03PS14) 10Jbond: puppetmaster: Add the ability to have canary backends [puppet] - 10https://gerrit.wikimedia.org/r/524287 [12:59:08] (03CR) 10Jbond: "Thanks, updated and ready for review" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/524287 (owner: 10Jbond) [12:59:13] (03PS5) 10Elukey: turnilo: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524752 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [12:59:56] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-staging-values.yaml staging stable/cxserver [namespace: cxserver, clusters: staging] [13:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:03] !log kartik@deploy1001 scap-helm cxserver cluster staging completed [13:00:03] !log kartik@deploy1001 scap-helm cxserver finished [13:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:17] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [13:02:32] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-codfw-values.yaml production stable/cxserver [namespace: cxserver, clusters: codfw] [13:02:34] !log kartik@deploy1001 scap-helm cxserver cluster codfw completed [13:02:34] !log kartik@deploy1001 scap-helm cxserver finished [13:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:47] (03CR) 10Elukey: [C: 03+2] turnilo: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524752 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [13:04:13] (03PS6) 10Fsero: k8s: introducing cluster-dns flag for coredns [puppet] - 10https://gerrit.wikimedia.org/r/524750 (https://phabricator.wikimedia.org/T226516) [13:05:46] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, two nits inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/524287 (owner: 10Jbond) [13:05:51] (03PS3) 10Elukey: yarn: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524742 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [13:06:21] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade puppet master servers - https://phabricator.wikimedia.org/T227587 (10jbond) puppetmaster1003 has now been build and is running puppet 5.5 [13:06:58] ACKNOWLEDGEMENT - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui expected https://wikitech.wikimedia.org/wiki/HAProxy [13:07:19] (03PS3) 10Muehlenhoff: grafana: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524519 (https://phabricator.wikimedia.org/T227650) [13:07:41] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-eqiad-values.yaml production stable/cxserver [namespace: cxserver, clusters: eqiad] [13:07:42] !log kartik@deploy1001 scap-helm cxserver cluster eqiad completed [13:07:42] !log kartik@deploy1001 scap-helm cxserver finished [13:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:03] (03CR) 10Elukey: [C: 03+2] yarn: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524742 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [13:09:25] (03PS3) 10Elukey: Superset: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524749 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [13:10:34] (03CR) 10Elukey: [C: 03+2] Superset: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524749 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [13:11:25] PROBLEM - haproxy failover on dbproxy1006 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [13:11:30] ^ that is me [13:11:39] same for dbproxy1001 which will arrive in a bit [13:12:21] !log Updated cxserver to 2019-07-17-074415-production (T227553, T216812) [13:12:23] PROBLEM - haproxy failover on dbproxy1001 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [13:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:29] T227553: Test failure: TypeError: Cannot read property 'item' of undefined - https://phabricator.wikimedia.org/T227553 [13:12:29] T216812: CX2: Template appears as regular text in the source document, propagating as HTML into the translation - https://phabricator.wikimedia.org/T216812 [13:12:55] RECOVERY - haproxy failover on dbproxy1006 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [13:13:31] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [13:13:45] (03PS15) 10Jbond: puppetmaster: Add the ability to have canary backends [puppet] - 10https://gerrit.wikimedia.org/r/524287 [13:13:55] RECOVERY - haproxy failover on dbproxy1001 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [13:14:13] (03CR) 10Jbond: puppetmaster: Add the ability to have canary backends (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/524287 (owner: 10Jbond) [13:15:25] (03CR) 10Alexandros Kosiaris: [C: 03+1] k8s: introducing cluster-dns flag for coredns [puppet] - 10https://gerrit.wikimedia.org/r/524750 (https://phabricator.wikimedia.org/T226516) (owner: 10Fsero) [13:16:20] (03PS7) 10Fsero: k8s: introducing cluster-dns flag for coredns [puppet] - 10https://gerrit.wikimedia.org/r/524750 (https://phabricator.wikimedia.org/T226516) [13:16:26] (03CR) 10Marostegui: [C: 03+1] tendril: use ldap-ro, use Hiera, refactor to profile [puppet] - 10https://gerrit.wikimedia.org/r/523993 (https://phabricator.wikimedia.org/T227650) (owner: 10Dzahn) [13:17:43] (03PS4) 10Muehlenhoff: grafana: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524519 (https://phabricator.wikimedia.org/T227650) [13:19:40] (03CR) 10Fsero: [C: 03+2] "PCC is happy https://puppet-compiler.wmflabs.org/compiler1001/17557/" [puppet] - 10https://gerrit.wikimedia.org/r/524750 (https://phabricator.wikimedia.org/T226516) (owner: 10Fsero) [13:19:53] (03PS8) 10Fsero: k8s: introducing cluster-dns flag for coredns [puppet] - 10https://gerrit.wikimedia.org/r/524750 (https://phabricator.wikimedia.org/T226516) [13:20:13] (03PS2) 10Jhedden: toolschecker: check for existing webservice [puppet] - 10https://gerrit.wikimedia.org/r/524610 (https://phabricator.wikimedia.org/T221301) [13:21:11] (03CR) 10Muehlenhoff: [C: 03+2] grafana: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524519 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [13:21:58] (03PS9) 10Fsero: k8s: introducing cluster-dns flag for coredns [puppet] - 10https://gerrit.wikimedia.org/r/524750 (https://phabricator.wikimedia.org/T226516) [13:31:43] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 1: -Code-Review" [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500) (owner: 10Arturo Borrero Gonzalez) [13:36:56] (03CR) 10Alexandros Kosiaris: [C: 04-1] "toolforge is still one jessie and does use that class." [puppet] - 10https://gerrit.wikimedia.org/r/524731 (owner: 10Muehlenhoff) [13:37:49] (03PS1) 10Elukey: profile::hadoop::master/standby: add CMS old gen monitors [puppet] - 10https://gerrit.wikimedia.org/r/524783 (https://phabricator.wikimedia.org/T228620) [13:39:33] (03CR) 10Alexandros Kosiaris: [C: 03+1] nrpe/icinga: make notes_url a required parameter of nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [13:39:52] (03PS1) 10Ladsgroup: labs: Set Commons to read the new term store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524787 (https://phabricator.wikimedia.org/T226008) [13:40:22] (03CR) 10Ladsgroup: [C: 03+2] labs: Set Commons to read the new term store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524787 (https://phabricator.wikimedia.org/T226008) (owner: 10Ladsgroup) [13:41:22] (03Merged) 10jenkins-bot: labs: Set Commons to read the new term store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524787 (https://phabricator.wikimedia.org/T226008) (owner: 10Ladsgroup) [13:41:39] (03CR) 10jenkins-bot: labs: Set Commons to read the new term store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524787 (https://phabricator.wikimedia.org/T226008) (owner: 10Ladsgroup) [13:42:39] (03PS2) 10Elukey: profile::hadoop::master/standby: add CMS old gen monitors [puppet] - 10https://gerrit.wikimedia.org/r/524783 (https://phabricator.wikimedia.org/T228620) [13:43:48] My patch in beta cluster is rebaed on deploy1001 [13:49:04] (03PS1) 10Elukey: role::analytics_cluster::superset: add TLS proxy [puppet] - 10https://gerrit.wikimedia.org/r/524788 (https://phabricator.wikimedia.org/T227860) [13:49:19] RECOVERY - Device not healthy -SMART- on helium is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=helium&var-datasource=eqiad+prometheus/ops [13:53:04] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, although static-rt appears currenly non-functional, probably still WIP." [puppet] - 10https://gerrit.wikimedia.org/r/523992 (https://phabricator.wikimedia.org/T227650) (owner: 10Dzahn) [13:54:34] (03PS1) 10Ema: ATS: split the cache for beta variant of the mobile site [puppet] - 10https://gerrit.wikimedia.org/r/524789 (https://phabricator.wikimedia.org/T227432) [13:55:13] (03CR) 10Ottomata: "Just curious...why? :)" [puppet] - 10https://gerrit.wikimedia.org/r/524531 (https://phabricator.wikimedia.org/T227860) (owner: 10Elukey) [13:55:25] (03CR) 10jerkins-bot: [V: 04-1] ATS: split the cache for beta variant of the mobile site [puppet] - 10https://gerrit.wikimedia.org/r/524789 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [13:56:04] (03CR) 10Ottomata: "Oh I see why, so you can include the health_check profile. hm." [puppet] - 10https://gerrit.wikimedia.org/r/524531 (https://phabricator.wikimedia.org/T227860) (owner: 10Elukey) [14:00:08] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/17560/" [puppet] - 10https://gerrit.wikimedia.org/r/523991 (https://phabricator.wikimedia.org/T227650) (owner: 10Dzahn) [14:00:26] (03PS5) 10Muehlenhoff: microsites/transparency: use ldap-ro, stop using ldap-labs, use Hiera [puppet] - 10https://gerrit.wikimedia.org/r/523991 (https://phabricator.wikimedia.org/T227650) (owner: 10Dzahn) [14:00:38] (03CR) 10Elukey: [C: 03+2] "> Oh I see why, so you can include the health_check profile. hm." [puppet] - 10https://gerrit.wikimedia.org/r/524531 (https://phabricator.wikimedia.org/T227860) (owner: 10Elukey) [14:01:15] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17559/" [puppet] - 10https://gerrit.wikimedia.org/r/524788 (https://phabricator.wikimedia.org/T227860) (owner: 10Elukey) [14:02:07] (03PS6) 10Muehlenhoff: microsites/transparency: use ldap-ro, stop using ldap-labs, use Hiera [puppet] - 10https://gerrit.wikimedia.org/r/523991 (https://phabricator.wikimedia.org/T227650) (owner: 10Dzahn) [14:03:55] (03CR) 10Muehlenhoff: [C: 03+2] microsites/transparency: use ldap-ro, stop using ldap-labs, use Hiera [puppet] - 10https://gerrit.wikimedia.org/r/523991 (https://phabricator.wikimedia.org/T227650) (owner: 10Dzahn) [14:09:29] PROBLEM - Check systemd state on analytics-tool1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [14:09:54] me --^ [14:10:31] PROBLEM - puppet last run on analytics-tool1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Service[nginx] https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [14:12:09] (03PS1) 10Muehlenhoff: profile::webperf::xhgui: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524791 (https://phabricator.wikimedia.org/T227650) [14:12:35] (03CR) 10jerkins-bot: [V: 04-1] profile::webperf::xhgui: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524791 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [14:17:15] (03PS2) 10Muehlenhoff: profile::webperf::xhgui: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524791 (https://phabricator.wikimedia.org/T227650) [14:18:02] 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi: TEC6: Logging infrastructure (Q4 2018/19 goal) - https://phabricator.wikimedia.org/T220103 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Resolving as Q is over, things TODO are tracked in subtasks [14:18:06] (03CR) 10jerkins-bot: [V: 04-1] profile::webperf::xhgui: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524791 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [14:19:46] 10Operations, 10Wikimedia-Logstash, 10User-herron: logstash1012 lock up caused central logging stuck - https://phabricator.wikimedia.org/T220500 (10fgiunchedi) 05Open→03Invalid Likely the same bug/condition we've are seeing in {T199406} [14:20:34] jouncebot: now [14:20:34] No deployments scheduled for the next 2 hour(s) and 39 minute(s) [14:20:36] jouncebot: next [14:20:36] In 2 hour(s) and 39 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190722T1700) [14:20:41] (03PS3) 10Muehlenhoff: xhgui: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524791 (https://phabricator.wikimedia.org/T227650) [14:21:13] liw: i don't know the rules, but sounds like we can promote now [14:21:24] hashar, ack [14:21:28] hashar, thanks [14:21:47] I'm going to be promoting group2 to wmf.14, which was rolled back last week due to problems [14:25:51] (03PS1) 10Lars Wirzenius: all wikis to 1.34.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524794 [14:25:53] (03CR) 10Lars Wirzenius: [C: 03+2] all wikis to 1.34.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524794 (owner: 10Lars Wirzenius) [14:26:06] PROBLEM - puppet last run on mc1024 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [14:26:07] (03CR) 10Hashar: [C: 03+1] all wikis to 1.34.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524794 (owner: 10Lars Wirzenius) [14:26:14] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/17561/" [puppet] - 10https://gerrit.wikimedia.org/r/524791 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [14:26:52] (03Merged) 10jenkins-bot: all wikis to 1.34.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524794 (owner: 10Lars Wirzenius) [14:27:07] (03CR) 10jenkins-bot: all wikis to 1.34.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524794 (owner: 10Lars Wirzenius) [14:28:36] !log liw@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.34.0-wmf.14 [14:28:37] 10Operations, 10ops-eqiad, 10DC-Ops: Reallocate dbproxy1020 and dbproxy1021 from row D to row C - https://phabricator.wikimedia.org/T228618 (10Cmjohnson) Received your request to move these servers. Please expect this to be completed NLT than 16 August 2019. [14:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:17] (03CR) 10Muehlenhoff: "Given that thus is going away, we could either set a puppet-line override or simply replace labs-ldap with the static name of the new RO c" [puppet] - 10https://gerrit.wikimedia.org/r/523995 (https://phabricator.wikimedia.org/T227650) (owner: 10Dzahn) [14:30:21] !log deploying refactored eventgate chart using eventgate-wikimedia image to eventgate-* services - T226668 [14:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:29] T226668: Factor out eventgate-wikimedia factory into its own gerrit repo and use it for deployment pipeline - https://phabricator.wikimedia.org/T226668 [14:30:57] !log otto@ helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'analytics' . [14:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:56] PROBLEM - HHVM rendering on mw1317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:32:06] liw: then if there are no more timeout, I guess we have a confirmation that https://phabricator.wikimedia.org/T228436 got properly fixed [14:32:32] there was also mention of a useless warning/trace message AND apparently some cache backend issue at the time [14:32:45] so lot of moving grounds :) [14:33:10] RECOVERY - HHVM rendering on mw1317 is OK: HTTP OK: HTTP/1.1 200 OK - 81075 bytes in 0.119 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:33:38] 10Operations, 10ops-eqiad: Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10Cmjohnson) @elukey I received your task about the failed disk. The server is under warranty. I will find time this week to submit a ticket with HPE. This is a busy week with PDU moves. [14:34:18] (03PS2) 10Ema: ATS: split the cache for beta variant of the mobile site [puppet] - 10https://gerrit.wikimedia.org/r/524789 (https://phabricator.wikimedia.org/T227432) [14:34:20] (03Abandoned) 10CRusnov: Add sshguard to base module. [puppet] - 10https://gerrit.wikimedia.org/r/498231 (https://phabricator.wikimedia.org/T218956) (owner: 10CRusnov) [14:36:40] 10Operations, 10ops-eqiad, 10DC-Ops: Reallocate dbproxy1020 and dbproxy1021 from row D to row C - https://phabricator.wikimedia.org/T228618 (10Marostegui) @Cmjohnson lovely - thank you! [14:36:59] (03PS1) 10Ottomata: Use eventgate-wikimedia version for eventgate-{main,analytics} codfw and eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/524795 (https://phabricator.wikimedia.org/T226668) [14:37:21] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Use eventgate-wikimedia version for eventgate-{main,analytics} codfw and eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/524795 (https://phabricator.wikimedia.org/T226668) (owner: 10Ottomata) [14:38:45] !log otto@ helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'analytics' . [14:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:55] 10Operations, 10ops-eqiad: Degraded RAID on analytics1029 - https://phabricator.wikimedia.org/T228577 (10Cmjohnson) a:03Ottomata @Ottomata Please use the new hardware run book template. https://phabricator.wikimedia.org/maniphest/task/edit/form/55/ and assign the task to me [14:41:55] 10Operations, 10ops-eqiad: Degraded RAID on analytics1029 - https://phabricator.wikimedia.org/T228577 (10elukey) a:05Ottomata→03elukey @Cmjohnson this is a OOW node part of the Hadoop testing cluster, no need to replace the disk.. will try to silence this alarm :) [14:43:25] 10Operations, 10Traffic: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10BBlack) 05Open→03Resolved These have been in-service for a while now, closing! [14:44:00] !log otto@ helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics' for release 'analytics' . [14:44:05] (03CR) 10Nuria: profile::hadoop::master/standby: add CMS old gen monitors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/524783 (https://phabricator.wikimedia.org/T228620) (owner: 10Elukey) [14:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:32] 10Operations, 10ops-eqiad, 10DC-Ops: (July 22-26) install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) Please note I've chatted with @fgiunchedi about ms-be systems, and the preferred method of dealing with them in any rack in which we are doing PDU swap... [14:46:09] 10Operations, 10ops-eqiad, 10DC-Ops: a4-eqiad pdu refresh - https://phabricator.wikimedia.org/T227140 (10BBlack) cp1076 - Can depool ahead of work and repool later, with the local commands "depool" and "repool" lvs100[123] - Not in use and should be decommed, but this ticket made me realize we haven't made a... [14:48:03] 10Operations, 10ops-eqiad, 10DC-Ops: a4-eqiad pdu refresh - https://phabricator.wikimedia.org/T227140 (10RobH) >>! In T226778#5354000, @RobH wrote: > Please note I've chatted with @fgiunchedi about ms-be systems, and the preferred method of dealing with them in any rack in which we are doing PDU swaps is to... [14:48:59] 10Operations, 10ops-eqiad: Degraded RAID on analytics1029 - https://phabricator.wikimedia.org/T228577 (10Cmjohnson) 05Open→03Declined [14:50:41] 10Operations, 10ops-eqiad, 10DC-Ops: a4-eqiad pdu refresh - https://phabricator.wikimedia.org/T227140 (10fgiunchedi) >>! In T227140#5354007, @RobH wrote: >>>! In T226778#5354000, @RobH wrote: >> Please note I've chatted with @fgiunchedi about ms-be systems, and the preferred method of dealing with them in an... [14:52:21] 10Operations, 10DC-Ops, 10Traffic, 10decommission: Decommission lvs100[123456] - https://phabricator.wikimedia.org/T228671 (10BBlack) [14:52:57] (03PS1) 10Fsero: k8s: introducing termbox-test.staging.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/524797 (https://phabricator.wikimedia.org/T226814) [14:53:01] !log otto@ helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-main' for release 'main' . [14:53:10] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh - https://phabricator.wikimedia.org/T227138 (10Marostegui) db1081 and db1075 are primary masters, so if we are not fully sure no power will be lost, I rather do other racks first Racks on row A that are good to go: A3: has two active dbproxies I cou... [14:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:22] (03CR) 10jerkins-bot: [V: 04-1] k8s: introducing termbox-test.staging.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/524797 (https://phabricator.wikimedia.org/T226814) (owner: 10Fsero) [14:53:54] RECOVERY - puppet last run on mc1024 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [14:54:14] (03CR) 10Muehlenhoff: [C: 03+1] puppetmaster: Add the ability to have canary backends [puppet] - 10https://gerrit.wikimedia.org/r/524287 (owner: 10Jbond) [14:54:44] 10Operations, 10DC-Ops, 10Traffic, 10decommission: Decommission lvs100[123456] - https://phabricator.wikimedia.org/T228671 (10BBlack) [14:54:48] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:55:59] !log otto@ helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-main' for release 'main' . [14:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:26] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:57:12] liw: looks like it is all fine [14:57:40] (03CR) 10Bstorm: toolforge: k8s: add PSP for nginx-ingress (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500) (owner: 10Arturo Borrero Gonzalez) [14:57:47] (03PS2) 10Fsero: k8s: introducing termbox-test.staging.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/524797 (https://phabricator.wikimedia.org/T226814) [14:58:10] (03CR) 10jerkins-bot: [V: 04-1] k8s: introducing termbox-test.staging.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/524797 (https://phabricator.wikimedia.org/T226814) (owner: 10Fsero) [14:58:58] (03PS1) 10Muehlenhoff: librenms: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524798 (https://phabricator.wikimedia.org/T227650) [15:00:28] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500) (owner: 10Arturo Borrero Gonzalez) [15:01:37] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/524791 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [15:01:43] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh - https://phabricator.wikimedia.org/T227138 (10Marostegui) [15:01:44] hashar, hoping it is, but will keep an eye on the dashboard [15:01:54] PROBLEM - HHVM rendering on mw2196 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:01:54] PROBLEM - HHVM rendering on mw2226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:02:46] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh - https://phabricator.wikimedia.org/T227138 (10Marostegui) [15:03:24] RECOVERY - HHVM rendering on mw2196 is OK: HTTP OK: HTTP/1.1 200 OK - 81174 bytes in 0.323 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:03:24] RECOVERY - HHVM rendering on mw2226 is OK: HTTP OK: HTTP/1.1 200 OK - 81174 bytes in 0.324 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:07:54] 10Operations, 10observability: Icinga custom checks should follow our HTTP User-Agent policy - https://phabricator.wikimedia.org/T226508 (10fgiunchedi) >>! In T226508#5300536, @jbond wrote: > If anyone can review ttps://gerrit.wikimedia.org/r/c/operations/puppet/+/519227 and https://gerrit.wikimedia.org/r/c/o... [15:08:29] (03CR) 10ArielGlenn: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/17563/console noop." [puppet] - 10https://gerrit.wikimedia.org/r/524745 (https://phabricator.wikimedia.org/T227742) (owner: 10ArielGlenn) [15:08:35] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) [15:08:58] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) OO, when we reimage these, let's use Buster! :) [15:11:58] 10Operations, 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Core Platform Team Workboards (Clinic Duty Team), and 3 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10Pchelolo) [15:12:19] (03PS1) 10Jhedden: dumps distribution: switch dumps to labstore1006 [dns] - 10https://gerrit.wikimedia.org/r/524802 (https://phabricator.wikimedia.org/T224228) [15:12:52] PROBLEM - Check the Netbox report-s- librenms for fail status. on netmon1002 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [15:13:05] (03PS1) 10Bstorm: toolforge: put up a default psp to unblock other work [puppet] - 10https://gerrit.wikimedia.org/r/524803 (https://phabricator.wikimedia.org/T227290) [15:14:02] 10Operations, 10observability: Icinga custom checks should follow our HTTP User-Agent policy - https://phabricator.wikimedia.org/T226508 (10jbond) 05Open→03Resolved a:03jbond yes, closing thanks [15:14:49] marostegui: can you changes the clinic person please? :D [15:15:01] it's herro.n [15:15:22] (03CR) 10Jbond: [C: 03+2] puppetmaster: Add the ability to have canary backends [puppet] - 10https://gerrit.wikimedia.org/r/524287 (owner: 10Jbond) [15:15:30] (03PS16) 10Jbond: puppetmaster: Add the ability to have canary backends [puppet] - 10https://gerrit.wikimedia.org/r/524287 [15:15:33] 10Operations, 10observability: Upgrade grafana to 6.x - https://phabricator.wikimedia.org/T220838 (10CDanis) p:05Triage→03Low so far work on dbctl has taken priority but I should have time for this in another few weeks. [15:15:57] jijiki: done [15:16:43] :D [15:16:58] (03PS2) 10Bstorm: toolforge: put up a default psp to unblock other work [puppet] - 10https://gerrit.wikimedia.org/r/524803 (https://phabricator.wikimedia.org/T227290) [15:17:26] (03PS1) 10Jhedden: dumps dist: switch active VPS to labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/524804 (https://phabricator.wikimedia.org/T224228) [15:17:44] !log Disable puppet on jobrunners to enable php7_only [15:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:38] PROBLEM - puppet last run on ganeti1008 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [15:22:31] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] jobrunners: Test php7_only on 6 jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/522472 (https://phabricator.wikimedia.org/T219148) (owner: 10Effie Mouzeli) [15:22:42] (03PS3) 10Effie Mouzeli: jobrunners: Test php7_only on 6 jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/522472 (https://phabricator.wikimedia.org/T219148) [15:22:52] (03PS1) 10Marostegui: wmnet: Failover dbproxy1001 to dbproxy1006 [dns] - 10https://gerrit.wikimedia.org/r/524805 (https://phabricator.wikimedia.org/T227138) [15:23:11] 10Operations, 10Release Pipeline, 10serviceops, 10Goal, 10Release-Engineering-Team (Pipeline): Self-service Deployment Pipeline - https://phabricator.wikimedia.org/T228676 (10akosiaris) p:05Triage→03Normal [15:24:08] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.14/includes/export/XmlDumpWriter.php: T228614 XmlDumpWriter: don't load revision text content unless requested to (duration: 00m 48s) [15:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:15] T228614: stubs dumps broken for wikidatawiki with old revision for an entity redirecting to self; content read for every revision in stubs! - https://phabricator.wikimedia.org/T228614 [15:27:55] !log Depool mw1300 and pool back [15:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:49] 10Operations, 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Core Platform Team Workboards (Clinic Duty Team), and 4 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10WDoranWMF) [15:34:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolforge: put up a default psp to unblock other work [puppet] - 10https://gerrit.wikimedia.org/r/524803 (https://phabricator.wikimedia.org/T227290) (owner: 10Bstorm) [15:35:12] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.14/extensions/Wikibase/lib/WikibaseLib.php: T227814 Wikibase: Define $wgMessagesDirs in WikibaseLib PHP entry point (duration: 00m 48s) [15:35:17] (03PS3) 10Bstorm: toolforge: put up a default psp to unblock other work [puppet] - 10https://gerrit.wikimedia.org/r/524803 (https://phabricator.wikimedia.org/T227290) [15:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:19] T227814: [Regression wmf.13] Wikidata localisation is broken - https://phabricator.wikimedia.org/T227814 [15:35:30] PROBLEM - Check the Netbox report-s- librenms for fail status. on netmon1002 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [15:35:52] (03PS3) 10Elukey: profile::hadoop::master/standby: add CMS old gen monitors [puppet] - 10https://gerrit.wikimedia.org/r/524783 (https://phabricator.wikimedia.org/T228620) [15:36:15] (03CR) 10Bstorm: [C: 03+2] toolforge: put up a default psp to unblock other work [puppet] - 10https://gerrit.wikimedia.org/r/524803 (https://phabricator.wikimedia.org/T227290) (owner: 10Bstorm) [15:38:08] !log Stop mysql and power off pc2010 for on-site maintenance - T227552 [15:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:15] T227552: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 [15:38:24] (03CR) 10Andrew Bogott: [C: 03+1] "This looks fine but please hold off until we're out of the migration window" [puppet] - 10https://gerrit.wikimedia.org/r/524287 (owner: 10Jbond) [15:43:21] 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, 10Research-Backlog: Make oozie swift upload emit event to Kafka about swift object upload complete - https://phabricator.wikimedia.org/T227896 (10Ottomata) [15:43:25] 10Puppet: Upgrade Puppet Masters and Puppet DB servers - https://phabricator.wikimedia.org/T228657 (10MarcoAurelio) [15:43:32] (03PS1) 10Bstorm: tooforge: actually place the default-psp file on the master server [puppet] - 10https://gerrit.wikimedia.org/r/524809 (https://phabricator.wikimedia.org/T228500) [15:43:54] (03CR) 10Bstorm: [C: 03+1] dumps distribution: switch dumps to labstore1006 [dns] - 10https://gerrit.wikimedia.org/r/524802 (https://phabricator.wikimedia.org/T224228) (owner: 10Jhedden) [15:44:44] (03CR) 10Bstorm: [C: 03+1] "Also don't forget the do_acme value on the host hiera" [dns] - 10https://gerrit.wikimedia.org/r/524802 (https://phabricator.wikimedia.org/T224228) (owner: 10Jhedden) [15:45:54] 10Operations, 10Analytics, 10EventBus, 10MassMessage, and 3 others: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10CDanis) 05Resolved→03Open >>! In T226109#5279992, @CDanis wrote: > Am I alone in feeling like this probably deserves an [[ https://wikitech.wikimed... [15:45:56] (03CR) 10Bstorm: [C: 03+1] "when you merge and deploy this, toggle the value on /Users/bstorm/src/wmf/puppet/hieradata/hosts/labstore1006.yaml and /Users/bstorm/src/w" [dns] - 10https://gerrit.wikimedia.org/r/524802 (https://phabricator.wikimedia.org/T224228) (owner: 10Jhedden) [15:46:32] 10Operations, 10Analytics, 10EventBus, 10MassMessage, and 3 others: Write incident report for jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10CDanis) p:05Unbreak!→03Normal [15:46:38] (03CR) 10Bstorm: [C: 03+1] dumps dist: switch active VPS to labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/524804 (https://phabricator.wikimedia.org/T224228) (owner: 10Jhedden) [15:46:54] RECOVERY - puppet last run on ganeti1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [15:48:32] (03PS2) 10Jhedden: dumps dist: switch active VPS to labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/524804 (https://phabricator.wikimedia.org/T224228) [15:48:35] (03PS1) 10Muehlenhoff: maintain_dbusers: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524810 (https://phabricator.wikimedia.org/T227650) [15:49:15] (03CR) 10jerkins-bot: [V: 04-1] maintain_dbusers: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524810 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [15:49:51] !log Rolling depool and pool of mw1293, mw1294, mw1295, mw1296, mw1299 - T219148 [15:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:58] T219148: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 [15:50:57] (03PS3) 10Jhedden: dumps dist: switch active VPS to labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/524804 (https://phabricator.wikimedia.org/T224228) [15:51:10] PROBLEM - Host pc2010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:51:21] ^ expected [15:52:46] (03PS2) 10Muehlenhoff: maintain_dbusers: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524810 (https://phabricator.wikimedia.org/T227650) [15:54:30] (03PS4) 10Elukey: profile::hadoop::master/standby: add CMS old gen monitors [puppet] - 10https://gerrit.wikimedia.org/r/524783 (https://phabricator.wikimedia.org/T228620) [15:55:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] tooforge: actually place the default-psp file on the master server [puppet] - 10https://gerrit.wikimedia.org/r/524809 (https://phabricator.wikimedia.org/T228500) (owner: 10Bstorm) [15:56:05] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/17564/" [puppet] - 10https://gerrit.wikimedia.org/r/524810 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [15:56:54] RECOVERY - Host pc2010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 38.00 ms [15:57:18] (03CR) 10Bstorm: [C: 03+1] dumps dist: switch active VPS to labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/524804 (https://phabricator.wikimedia.org/T224228) (owner: 10Jhedden) [15:57:22] (03CR) 10Elukey: [C: 03+2] profile::hadoop::master/standby: add CMS old gen monitors [puppet] - 10https://gerrit.wikimedia.org/r/524783 (https://phabricator.wikimedia.org/T228620) (owner: 10Elukey) [15:59:08] (03PS2) 10Bstorm: tooforge: actually place the default-psp file on the master server [puppet] - 10https://gerrit.wikimedia.org/r/524809 (https://phabricator.wikimedia.org/T228500) [16:01:04] Krinkle: I'd like your input on https://phabricator.wikimedia.org/T224491 if you can; I hunted for but did not find an error relating to the | corruption in logstash for that period... [16:01:25] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/17565/" [puppet] - 10https://gerrit.wikimedia.org/r/524798 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [16:01:37] the other thing, what was done to get around the problem? was something restarted? because I couldn't find anything in SAL... [16:02:59] (03PS5) 10Thcipriani: blubberoid: Add policy file [deployment-charts] - 10https://gerrit.wikimedia.org/r/517573 (https://phabricator.wikimedia.org/T215319) [16:03:01] (03PS3) 10Thcipriani: Blubberoid: enable policy, bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/522561 [16:03:04] (03PS1) 10Thcipriani: blubberoid: bump helm chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/524813 [16:04:44] apergos: I cleared opcache manually, as we always do for this kind of issue. [16:04:53] I thought I !log'ed that but apparetnly not [16:04:54] sorry about htat [16:04:58] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10wikimediafoundation.org: Access to WikimediaFoundation.org analytics for Deb - https://phabricator.wikimedia.org/T227496 (10fdans) a:03Nuria [16:05:02] Getting logstash urls now, will add to the task [16:05:06] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10wikimediafoundation.org: Access to WikimediaFoundation.org analytics for Deb - https://phabricator.wikimedia.org/T227496 (10fdans) p:05Triage→03High [16:05:19] thanks! and yeah if you say what you restarted there, that would be great too [16:06:07] apergos: php7adm /opcache-free [16:06:11] the standard thing we do for opcache corruption [16:06:14] right [16:06:15] Immediately after that it went away [16:06:35] good to know [16:08:18] 10Operations, 10Analytics, 10Traffic: Fix geoip updaters for new MaxMind hashed keys by 2019-08-15 - https://phabricator.wikimedia.org/T228533 (10Milimetric) @faidon: we don't have any updaters on our end, we just move the databases around and keep backups for historical use. But let us know if you run into... [16:10:08] 10Operations, 10Traffic: Implement GeoDNS smooth repooling in gdnsd - https://phabricator.wikimedia.org/T228678 (10BBlack) p:05Triage→03Normal [16:12:56] 10Operations, 10ops-codfw, 10DBA: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 (10Papaul) a:05Papaul→03Marostegui DIMM replaced and Firmware upgrade Before Operating System Version Service Tag DHPR0S2 Asset Tag DHR0S2 Express Service Code 29369346482 BIOS Version 1.5.4 Lif... [16:13:00] Krinkle: IIUC we use /usr/local/sbin/restart-php7.2-fpm that depools/restart/pools the mw server [16:14:12] (so a lot more than a simple opcache clean but safer.. this is my understanding, could be wrong :) [16:16:08] elukey: sounds good, first time I see that though. [16:17:25] Krinkle: I found it in https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#PHP7_opcache_health (needs sudo -i) [16:17:34] (03CR) 10Jhedden: [C: 03+1] maintain_dbusers: Read LDAP servers from Hiera and switch to read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/524810 (https://phabricator.wikimedia.org/T227650) (owner: 10Muehlenhoff) [16:20:27] 10Operations, 10ops-codfw, 10DBA: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 (10Marostegui) Thanks Papaul. Memory looking good. ` root@pc2010:~# free -m total used free shared buff/cache available Mem: 257392 674 256516... [16:21:23] 10Operations, 10Analytics, 10EventBus, 10MassMessage, and 3 others: Write incident report for jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10jijiki) [16:25:48] 10Operations, 10Traffic: Implement GeoDNS smooth repooling in gdnsd - https://phabricator.wikimedia.org/T228678 (10BBlack) [16:25:50] 10Operations, 10Traffic: implement better failure-scenario geoip mapping in gdnsd - https://phabricator.wikimedia.org/T94697 (10BBlack) [16:26:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] tooforge: actually place the default-psp file on the master server [puppet] - 10https://gerrit.wikimedia.org/r/524809 (https://phabricator.wikimedia.org/T228500) (owner: 10Bstorm) [16:26:46] 10Operations, 10Beta-Cluster-Infrastructure, 10Mail, 10MediaWiki-Email, 10Release-Engineering-Team (Other / Uncategorized): [betacluster] Cannot confirm email address - confirmation never received - https://phabricator.wikimedia.org/T227714 (10JTannerWMF) p:05Triage→03High [16:31:03] (03PS2) 10Jhedden: dumps distribution: switch dumps to labstore1006 [dns] - 10https://gerrit.wikimedia.org/r/524802 (https://phabricator.wikimedia.org/T224228) [16:31:18] (03CR) 10Jhedden: [C: 03+2] dumps distribution: switch dumps to labstore1006 [dns] - 10https://gerrit.wikimedia.org/r/524802 (https://phabricator.wikimedia.org/T224228) (owner: 10Jhedden) [16:31:31] (03CR) 10Jhedden: [V: 03+2 C: 03+2] dumps distribution: switch dumps to labstore1006 [dns] - 10https://gerrit.wikimedia.org/r/524802 (https://phabricator.wikimedia.org/T224228) (owner: 10Jhedden) [16:33:42] 10Operations, 10Traffic: Implement GeoDNS smooth repooling in gdnsd - https://phabricator.wikimedia.org/T228678 (10BBlack) [16:36:00] !log redirecting dumps.wikimedia.org dns to labstore1006 T224228 [16:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:24] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), and 2 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle) Logstash query for the error in question: (03PS1) 10Tarrow: Add termbox-test release [deployment-charts] - 10https://gerrit.wikimedia.org/r/524817 (https://phabricator.wikimedia.org/T226814) [16:37:58] (03CR) 10Tarrow: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/524817 (https://phabricator.wikimedia.org/T226814) (owner: 10Tarrow) [16:39:52] 10Operations, 10ops-eqiad, 10DC-Ops: a4-eqiad pdu refresh - https://phabricator.wikimedia.org/T227140 (10elukey) [16:44:20] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), and 2 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10ArielGlenn) From IRC conversation, Krinle says he ran php7adm /opcache-free and the problem immediately... [16:45:36] (03CR) 10Jforrester: [C: 03+2] extension-list: Load Collection via extension.json directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524556 (owner: 10Jforrester) [16:46:21] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10wikimediafoundation.org: Access to WikimediaFoundation.org analytics for Deb - https://phabricator.wikimedia.org/T227496 (10Nuria) @Varnent Deb needs to sign an nda, someone will verify is been so and she can be added to the group that has access to this... [16:46:34] (03Merged) 10jenkins-bot: extension-list: Load Collection via extension.json directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524556 (owner: 10Jforrester) [16:46:50] (03CR) 10jenkins-bot: extension-list: Load Collection via extension.json directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524556 (owner: 10Jforrester) [16:47:25] 10Operations, 10Beta-Cluster-Infrastructure, 10Mail, 10MediaWiki-Email, 10Release-Engineering-Team (Other / Uncategorized): [betacluster] Cannot confirm email address - confirmation never received - https://phabricator.wikimedia.org/T227714 (10herron) @greg sure, I'm back today from being out of the offi... [16:47:37] (03PS2) 10Jforrester: Load Collection from extension.json directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524557 (https://phabricator.wikimedia.org/T87899) [16:47:46] (03CR) 10Jforrester: [C: 03+2] Load Collection from extension.json directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524557 (https://phabricator.wikimedia.org/T87899) (owner: 10Jforrester) [16:48:06] !log jforrester@deploy1001 Synchronized wmf-config/extension-list: Load Collection i18n via extension.json directly (duration: 00m 47s) [16:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:21] 10Operations, 10TechCom-RFC, 10Traffic, 10Core Platform Team Backlog (Designing), 10Services (designing): Make API usage limits easier to understand, implement, and more adaptive to varying request costs / concurrency limiting - https://phabricator.wikimedia.org/T167906 (10daniel) @EvanProdromou Are you... [16:48:44] PROBLEM - Host ps1-a4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:49:25] (03Merged) 10jenkins-bot: Load Collection from extension.json directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524557 (https://phabricator.wikimedia.org/T87899) (owner: 10Jforrester) [16:49:41] (03CR) 10jenkins-bot: Load Collection from extension.json directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524557 (https://phabricator.wikimedia.org/T87899) (owner: 10Jforrester) [16:51:04] PROBLEM - Host cloudvirt1015 is DOWN: PING CRITICAL - Packet loss = 100% [16:51:29] ouch [16:51:30] expected? [16:51:40] I don't think so [16:51:42] it's on b2 [16:51:47] so I don't think it's part of the power work? [16:52:01] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10wikimediafoundation.org: Access to WikimediaFoundation.org analytics for Deb - https://phabricator.wikimedia.org/T227496 (10Varnent) @Nuria - the NDA was a part of her onboarding - so she should be all set. :) [16:52:07] probably, but it's running like 52 VMs [16:52:22] so, we didn't schedule downtime for it or something [16:52:50] cloudvirt1015 is up and running [16:52:55] physically [16:52:58] paged [16:53:35] (03CR) 10Volans: "Sorry for the late reply, it skipped my queue" (031 comment) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/522526 (https://phabricator.wikimedia.org/T220820) (owner: 10Jbond) [16:53:59] cmjohnson1: I can't ssh to it [16:54:05] (for the cloudvirt host I mean) [16:54:58] oh, it seems the kernel crashed [16:55:25] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T87899 Use wfLoadExtension for Collection rather than deprecated entry point (duration: 00m 47s) [16:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:35] T87899: Convert Collection to use extension registration - https://phabricator.wikimedia.org/T87899 [16:56:46] cmjohnson1: it seems is T220853 again :-( [16:56:47] T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 [16:56:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10aborrero) 05Resolved→03Open The server just died again. I found this in the mgmt console: `lines=10 [4576846.406213]... [16:57:02] ACKNOWLEDGEMENT - SSH on cloudvirt1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds CDanis arturo investigating T220853 https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:57:48] thanks cdanis [16:57:56] 10Operations, 10ops-eqiad, 10DC-Ops: (July 22-26) install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) >>! In T227138#5354060, @Marostegui wrote: > db1081 and db1075 are primary masters, so if we are not fully sure no power will be lost, I rather do othe... [16:58:00] np [17:00:04] gehel and onimisionipe: Your horoscope predicts another unfortunate Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190722T1700). [17:02:19] !log enable puppet on all jobrunners [17:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:37] (03CR) 10Bstorm: [C: 03+1] "LGTM. s/hiera/lookup/g? :-D" [puppet] - 10https://gerrit.wikimedia.org/r/524745 (https://phabricator.wikimedia.org/T227742) (owner: 10ArielGlenn) [17:02:41] !log nuria@deploy1001 Started deploy [analytics/refinery@d889893]: deploying refinery jar bump forwebrequest/load jobs [17:02:43] RECOVERY - Host cloudvirt1015 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [17:02:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Andrew) This seems to only fail when under load. I've thought it was 'fixed about four times only to have it crash and ca... [17:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:16] RECOVERY - Check systemd state on elastic1046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [17:08:29] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul) [17:17:33] !log nuria@deploy1001 Finished deploy [analytics/refinery@d889893]: deploying refinery jar bump forwebrequest/load jobs (duration: 14m 51s) [17:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:44] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:17:52] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,list} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:19:22] 10Operations, 10Beta-Cluster-Infrastructure, 10Mail, 10MediaWiki-Email, 10Release-Engineering-Team (Other / Uncategorized): [betacluster] Cannot confirm email address - confirmation never received - https://phabricator.wikimedia.org/T227714 (10herron) I'm not having luck reproducing this with my own non-... [17:19:32] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:19:36] (03PS3) 10Jbond: lookup checks: add checks to warn against using hiera and advice lookup [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/522526 (https://phabricator.wikimedia.org/T220820) [17:19:51] (03CR) 10jerkins-bot: [V: 04-1] lookup checks: add checks to warn against using hiera and advice lookup [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/522526 (https://phabricator.wikimedia.org/T220820) (owner: 10Jbond) [17:20:01] 10Operations, 10ops-eqiad, 10DC-Ops: a4-eqiad pdu refresh - https://phabricator.wikimedia.org/T227140 (10elukey) Just depooled aqs1004 [17:21:04] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:22:55] !log depooling kafka1001 for PDU work T227140 [17:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:03] T227140: a4-eqiad pdu refresh - https://phabricator.wikimedia.org/T227140 [17:24:12] (03PS4) 10Jbond: lookup checks: add checks to warn against using hiera and advice lookup [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/522526 (https://phabricator.wikimedia.org/T220820) [17:25:49] (03CR) 10Jbond: "Thanks :)" (032 comments) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/522526 (https://phabricator.wikimedia.org/T220820) (owner: 10Jbond) [17:30:08] 10Operations, 10ops-codfw, 10netops: Cable mr1-codfw<->cr1/2-codfw through asw-a-codfw - https://phabricator.wikimedia.org/T228112 (10Papaul) [17:35:45] !log depool scb1001 for PDU work T227140 [17:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:52] T227140: a4-eqiad pdu refresh - https://phabricator.wikimedia.org/T227140 [17:37:46] PROBLEM - Host restbase1007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:38:18] this is due to the a4 work [17:38:30] and 1007 is being decommed IIRC [17:38:32] PROBLEM - Host maps1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:38:46] PROBLEM - Host labstore1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:38:46] PROBLEM - Host db1111.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:38:50] PROBLEM - Host druid1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:39:12] all a4 related [17:39:54] RECOVERY - Host druid1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.33 ms [17:42:25] PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 177 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [17:43:28] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1003 is CRITICAL: 297 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1003 [17:43:32] RECOVERY - Host restbase1007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.27 ms [17:43:41] herron: --^ [17:44:12] k [17:44:18] RECOVERY - Host maps1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [17:44:32] RECOVERY - Host db1111.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [17:44:32] RECOVERY - Host labstore1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms [17:44:48] PROBLEM - Host kubestage1001 is DOWN: PING CRITICAL - Packet loss = 100% [17:45:08] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1002 is CRITICAL: 375 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1002 [17:45:20] elukey: bah I didn’t downtime those two alerts [17:45:28] but yeah kafka1001 down so under replicated, no? [17:45:47] yes I agree but it wasn't supposed to happen in theory [17:46:10] PROBLEM - Host netmon1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:46:17] lovely [17:46:30] what's up? [17:46:40] pdu maintenance ongoing [17:46:51] oic [17:47:12] (03CR) 10Cwhite: [C: 03+2] prometheus: add varnishkafka job to prometheus scrapes [puppet] - 10https://gerrit.wikimedia.org/r/524561 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [17:47:22] (03PS2) 10Cwhite: prometheus: add varnishkafka job to prometheus scrapes [puppet] - 10https://gerrit.wikimedia.org/r/524561 (https://phabricator.wikimedia.org/T196066) [17:47:33] ok, mgmt is expected as its single pdu [17:47:38] chaomodus: here’s the parent task https://phabricator.wikimedia.org/T226778 [17:47:46] if a host goes down we ar worried [17:47:49] but not if the mgmt flapped [17:49:04] RECOVERY - Host netmon1002 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [17:49:26] RECOVERY - Host kubestage1001 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [17:49:28] robh: yes some hosts went down [17:49:39] bah [17:49:42] PROBLEM - puppet last run on mw1268 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [17:49:52] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [17:49:53] i see scb1001 =[ [17:50:00] elukey: sorry about that =[ [17:50:06] this is a particularly full rack [17:50:08] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:50:09] thanks :) [17:50:23] robh: completely fine, this is why I was asking to have people around :) (I depooled scb1001 beforehand) [17:50:36] (03PS3) 10Cwhite: prometheus: refresh prometheus-varnishkafka-exporter on configuration change [puppet] - 10https://gerrit.wikimedia.org/r/524545 (https://phabricator.wikimedia.org/T196066) [17:50:49] (03CR) 10Cwhite: prometheus: refresh prometheus-varnishkafka-exporter on configuration change (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/524545 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [17:53:34] PROBLEM - Keyholder SSH agent on netmon1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [17:53:45] herron: I see you're working on kafka 1001, I wanted to deploy something related to kafka but will hold off. Any ETA? cc ottomata [17:54:03] (03CR) 10Cwhite: [C: 03+2] prometheus: refresh prometheus-varnishkafka-exporter on configuration change [puppet] - 10https://gerrit.wikimedia.org/r/524545 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [17:54:12] (03PS4) 10Cwhite: prometheus: refresh prometheus-varnishkafka-exporter on configuration change [puppet] - 10https://gerrit.wikimedia.org/r/524545 (https://phabricator.wikimedia.org/T196066) [17:54:23] Pchelolo: o/ - I think 10/30 mins probably, they are working on the rack in the DC [17:54:37] elukey: g8t, thank you. I'll check back later [17:54:48] I'm not in a rush :) [17:56:42] PROBLEM - Check the Netbox report-s- librenms for fail status. on netmon1002 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [17:57:32] ok, both new pdus mounted in a4-eqiad [17:57:46] we are routing the main power input for ps1, ps2 is already energized and carrying load [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Morning SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190722T1800). [18:00:05] RoanKattouw: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:06] RECOVERY - Keyholder SSH agent on netmon1002 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [18:00:26] (03PS5) 10Dzahn: xhgui::app: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523995 (https://phabricator.wikimedia.org/T227650) [18:00:38] !log arm keyholder on netmon1002 after power loss [18:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:06] ok we are cutting over power for ps1-a4, there shouldnt be any noticable loss since mgmt was on the other side but we shall see [18:01:32] (03CR) 10Dzahn: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/523995 (https://phabricator.wikimedia.org/T227650) (owner: 10Dzahn) [18:02:10] I'll do my own SWAT [18:03:24] (03CR) 10Dzahn: [C: 03+2] xhgui::app: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523995 (https://phabricator.wikimedia.org/T227650) (owner: 10Dzahn) [18:03:27] (03CR) 10Cwhite: [C: 03+2] hiera: deploy varnishkafka exporter to eqsin [puppet] - 10https://gerrit.wikimedia.org/r/524622 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [18:03:35] (03PS6) 10Dzahn: xhgui::app: use ldap-ro, stop using ldap-labs [puppet] - 10https://gerrit.wikimedia.org/r/523995 (https://phabricator.wikimedia.org/T227650) [18:03:38] (03PS3) 10Cwhite: hiera: deploy varnishkafka exporter to eqsin [puppet] - 10https://gerrit.wikimedia.org/r/524622 (https://phabricator.wikimedia.org/T196066) [18:03:56] (03PS3) 10Catrope: Enable GrowthExperiments help panel on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523354 (https://phabricator.wikimedia.org/T226729) [18:04:03] (03CR) 10Catrope: [C: 03+2] Enable GrowthExperiments help panel on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523354 (https://phabricator.wikimedia.org/T226729) (owner: 10Catrope) [18:04:19] rebase race ..reloading..reloading ;) [18:04:58] RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.137 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [18:05:11] heh, I'll pause for a few minutes :) [18:05:46] (03Merged) 10jenkins-bot: Enable GrowthExperiments help panel on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523354 (https://phabricator.wikimedia.org/T226729) (owner: 10Catrope) [18:06:52] shdubsh: hehe, no no. the road is clear now [18:07:03] (03PS1) 10CDanis: noc db.php: include more data & add jsonv2 format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524825 [18:08:00] PROBLEM - Check the Netbox report-s- librenms for fail status. on netmon1002 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:08:38] ^ due to pdu swap fwiw [18:08:41] (03PS2) 10CDanis: noc db.php: include more data & add jsonv2 format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524825 [18:09:34] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable GrowthExperiments help panel on arwiki (T226729) (duration: 00m 48s) [18:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:42] T226729: Setup and deploy Help panel on Arabic Wikipedia - https://phabricator.wikimedia.org/T226729 [18:09:44] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [18:10:00] ^ known bug, fixing it shortly :) [18:11:52] (03PS4) 10Cwhite: hiera: deploy varnishkafka exporter to eqsin [puppet] - 10https://gerrit.wikimedia.org/r/524622 (https://phabricator.wikimedia.org/T196066) [18:13:07] jouncebot: now [18:13:07] For the next 0 hour(s) and 46 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190722T1800) [18:13:29] ok, the pdu swap in a4 is complete [18:13:39] the new pdu software isnt setup, but power interruption should be complete [18:13:42] (03CR) 10Dzahn: "yes, i still have to find a way to automate scraping the HTML the same way you get it when clicking "save page" in Firefox. surprisingly n" [puppet] - 10https://gerrit.wikimedia.org/r/523992 (https://phabricator.wikimedia.org/T227650) (owner: 10Dzahn) [18:14:10] 10Operations, 10ops-eqiad, 10DC-Ops: a5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227141 (10BBlack) All the traffic cp and lvs nodes are decoms and not in use: T208584 T208586 [18:14:11] RoanKattouw: I need to restart gerrit quickly, I am unsure if you're still swatting things [18:14:56] thcipriani: I'm not right now, I will resume in ~15 mins [18:14:59] k [18:15:05] elukey: wanna repool scb1001? [18:15:10] !log restarting gerrit due to T224448 [18:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:17] robh: Did anything you guys did involve depooling/repooling appservers? [18:15:18] T224448: Gerrit http threads stuck behind sendemail thread - https://phabricator.wikimedia.org/T224448 [18:15:37] Getting reports of some weird behavior that would be explained by a couple of appservers in the pool not having gotten the latest sync [18:17:39] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Trying the last sync again, because it's appearing inconsistently (duration: 00m 47s) [18:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:56] RECOVERY - puppet last run on mw1268 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [18:18:24] RoanKattouw: nope [18:18:40] Hmm weird. Well I resynced, hopefully that fixes it [18:18:42] unless it was in https://phabricator.wikimedia.org/T227140 it wouldnt have been us [18:18:49] and it has no mw sysrtems [18:19:42] (03CR) 10jenkins-bot: Enable GrowthExperiments help panel on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523354 (https://phabricator.wikimedia.org/T226729) (owner: 10Catrope) [18:24:15] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul) [18:26:12] (03PS3) 10Catrope: Enable help panel for 50% of new users on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523355 (https://phabricator.wikimedia.org/T226729) [18:26:19] (03CR) 10Catrope: [C: 03+2] Enable help panel for 50% of new users on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523355 (https://phabricator.wikimedia.org/T226729) (owner: 10Catrope) [18:27:22] (03Merged) 10jenkins-bot: Enable help panel for 50% of new users on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523355 (https://phabricator.wikimedia.org/T226729) (owner: 10Catrope) [18:27:37] (03CR) 10jenkins-bot: Enable help panel for 50% of new users on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523355 (https://phabricator.wikimedia.org/T226729) (owner: 10Catrope) [18:28:10] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:31:25] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable help panel for 50% of new users on arwiki (T226729) (duration: 00m 47s) [18:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:32] T226729: Setup and deploy Help panel on Arabic Wikipedia - https://phabricator.wikimedia.org/T226729 [18:40:08] (03CR) 10Ottomata: "Hm, I had thought the files were readable by anybody." [puppet] - 10https://gerrit.wikimedia.org/r/524625 (https://phabricator.wikimedia.org/T227364) (owner: 10EBernhardson) [18:41:26] (03CR) 10Dzahn: [C: 03+2] static-rt: LDAP config, use ro, Hiera and new password classes [puppet] - 10https://gerrit.wikimedia.org/r/523992 (https://phabricator.wikimedia.org/T227650) (owner: 10Dzahn) [18:41:36] (03PS6) 10Dzahn: static-rt: LDAP config, use ro, Hiera and new password classes [puppet] - 10https://gerrit.wikimedia.org/r/523992 (https://phabricator.wikimedia.org/T227650) [18:44:18] 10Operations, 10ops-eqiad, 10DC-Ops: a4-eqiad pdu refresh - https://phabricator.wikimedia.org/T227140 (10RobH) [18:44:29] jouncebot: next [18:44:29] In 1 hour(s) and 15 minute(s): Services – Parsoid / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190722T2000) [18:45:40] RECOVERY - Host ps1-a4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.66 ms