[00:00:04] twentyafterfour: How many deployers does it take to do Phabricator update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180913T0000). [00:00:08] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@7e5e537]: Deploy Blazegraph & Updater for T202765 and T203646 handling (duration: 23m 45s) [00:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:16] T203646: Wikidata Query Service nodes out of sync - https://phabricator.wikimedia.org/T203646 [00:02:01] Better luck catching someone next time [00:02:41] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: move/setup/install frauth2001.codfw.wmnet - https://phabricator.wikimedia.org/T204079 (10Dzahn) Yep, with a /29 netmask and 10.195.0.72 as the network address, .79 is the broadcast address. It looks like .73 is not used though and we c... [00:09:00] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: move/setup/install frauth2001.codfw.wmnet - https://phabricator.wikimedia.org/T204079 (10Dzahn) And as Arzhel points out .73 is already the router IP, so afraid this network is already full now and the host can't be added. And changing... [00:13:16] (03CR) 10Dzahn: [C: 04-1] "we can't use .73 either, that's the router IP, so scratch that. this network is already full unfortunately" [dns] - 10https://gerrit.wikimedia.org/r/460127 (owner: 10Papaul) [00:20:44] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: move/setup/install frauth2001.codfw.wmnet - https://phabricator.wikimedia.org/T204079 (10ayounsi) 10.195.0.73 is the router IP, it's missing from DNS, I'll add it (and the other ones). And indeed, it can't be extended to a /28. Short t... [00:22:27] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Revisit Grafana/Icinga notification strategy - https://phabricator.wikimedia.org/T203485 (10Dzahn) I added a link to this ticket next to the performance related grafana checks in the Icinga web UI, using notes_url. This was possible after... [00:50:09] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [00:52:23] (03PS1) 10Ayounsi: Add frack internal routers IPs [dns] - 10https://gerrit.wikimedia.org/r/460179 [00:53:56] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: move/setup/install frauth2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T204079 (10Dzahn) [00:54:01] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: move/setup/install frauth2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T204079 (10Jgreen) >>! In T204079#4579043, @ayounsi wrote: > Short time solution, if this host replaces an existing host in the same subnet, is to re-al... [00:54:39] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [00:58:34] 10Operations, 10Goal: Migrate the hardware inventory from Racktables to Netbox - https://phabricator.wikimedia.org/T199083 (10Dzahn) [00:58:36] 10Operations: Add which ldap groups can login on netbox login form - https://phabricator.wikimedia.org/T203840 (10Dzahn) 05Open>03Resolved a:03Dzahn [00:59:05] 10Operations: Add which ldap groups can login on netbox login form - https://phabricator.wikimedia.org/T203840 (10Dzahn) a:05Dzahn>03Krenair [00:59:57] (03PS2) 10Ayounsi: Add frack internal routers IPs [dns] - 10https://gerrit.wikimedia.org/r/460179 [00:59:59] 10Operations, 10Patch-For-Review, 10Tor: rack/setup/install torrelay1001.wikimedia.org - https://phabricator.wikimedia.org/T196701 (10Dzahn) [01:00:03] 10Operations, 10Patch-For-Review, 10Tor: decom radium - https://phabricator.wikimedia.org/T203861 (10Dzahn) 05stalled>03Open [01:00:23] 10Operations: decom radium - https://phabricator.wikimedia.org/T203861 (10Dzahn) [01:02:26] (03CR) 10Dzahn: "i can't really check the IPs themselves, but thank you for adding them all. it's nice to be able to see which IPs are used in DNS includin" [dns] - 10https://gerrit.wikimedia.org/r/460179 (owner: 10Ayounsi) [01:04:37] (03CR) 10Dzahn: [C: 031] "well, i can check this is always the first IP in each subnet, ack" [dns] - 10https://gerrit.wikimedia.org/r/460179 (owner: 10Ayounsi) [01:04:56] (03CR) 10Ayounsi: [C: 032] "> Patch Set 2:" [dns] - 10https://gerrit.wikimedia.org/r/460179 (owner: 10Ayounsi) [01:14:18] (03CR) 10Dzahn: "the question here is nott technical but whether it's agreeable that hosts defined as "test" would _not_ send notifications (IRC, email) be" [puppet] - 10https://gerrit.wikimedia.org/r/460064 (owner: 10Dzahn) [01:16:31] (03PS2) 10Dzahn: icinga: disable notifications for hosts using role(test) [puppet] - 10https://gerrit.wikimedia.org/r/460064 [01:17:22] (03PS3) 10Dzahn: icinga: disable notifications for hosts using role(test) [puppet] - 10https://gerrit.wikimedia.org/r/460064 [01:22:02] (03CR) 10Dzahn: [C: 04-1] "another option that has been brought up is that the domain stays with wmf but is removed from our name servers and points to name servers " [dns] - 10https://gerrit.wikimedia.org/r/459835 (https://phabricator.wikimedia.org/T204056) (owner: 10Reedy) [01:36:09] (03PS7) 10Alex Monk: Add make_account CLI script [software/certcentral] - 10https://gerrit.wikimedia.org/r/457933 [01:40:18] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Dzahn) Thanks for additional comments. The old ticket T84200 (currently private because imported from old ticket system (RT) and contained personal email from... [01:41:16] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Dzahn) p:05Triage>03Normal [01:41:19] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [01:43:28] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [02:35:30] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.20) (duration: 14m 03s) [02:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:16] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Thu Sep 13 02:46:16 UTC 2018 (duration 10m 46s) [02:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:09:58] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [03:14:29] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [03:28:49] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 847.18 seconds [03:47:29] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 158.45 seconds [04:02:28] (03PS2) 10Andrew Bogott: Horizon: set a default network ID for neutron VMs. [puppet] - 10https://gerrit.wikimedia.org/r/460144 [04:02:30] (03PS1) 10Andrew Bogott: Horizon: Add ACTIVE_REGIONS setting [puppet] - 10https://gerrit.wikimedia.org/r/460190 [05:11:19] !log Stop MySQL on db2054 and dbstore2001:3317 to clone db2054 - T204127 [05:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:27] T204127: Reclone db2054 and db2068 - https://phabricator.wikimedia.org/T204127 [05:52:17] (03PS4) 10Giuseppe Lavagetto: authdns: fix spec tests [puppet] - 10https://gerrit.wikimedia.org/r/458751 [06:05:20] (03PS1) 10C. Scott Ananian: Use core default for Parser preprocessor class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460202 [06:22:36] 10Operations, 10Quarry, 10Patch-For-Review, 10cloud-services-team (Kanban): Let quarry use the mariadb module - https://phabricator.wikimedia.org/T181205 (10jcrespo) why not modify mariadb::packages and use that instead? Was there a blocker for that? Packages is supposed to be mostly for non mw core db ins... [06:23:11] 10Operations, 10Wikimedia-Logstash: logstash group1 dashboard incorrectly shows testwikidatawiki - https://phabricator.wikimedia.org/T184655 (10Krinkle) I could be wrong but afaik we use the same ldap/nda restriction for viewing as for editing of Logstash dashboards. `wiki:testwikidatawiki` is part of the lon... [06:23:30] (03PS2) 10Jcrespo: mariadb: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/450314 (owner: 10Dzahn) [06:23:48] 10Operations, 10Wikimedia-Logstash: logstash group1 dashboard incorrectly shows testwikidatawiki - https://phabricator.wikimedia.org/T184655 (10Krinkle) 05stalled>03Open [06:25:59] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_upload site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [06:29:18] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [06:29:24] 10Operations: SRE quarterly goal: allow MediaWiki requests to be served by PHP7 alongside HHVM - https://phabricator.wikimedia.org/T203959 (10Joe) @Legoktm I'm ok delaying this into next quarter, or even the one after that; but I think php 7.2 is indeed a possibility; there are packages that should be easy to ba... [06:31:08] PROBLEM - puppet last run on analytics1071 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/R/update-library.R] [06:41:59] !log reboot stat100[4-6] for kernel upgrades - T203165 [06:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:07] T203165: Reboot Analytics hosts for kernel security upgrades - https://phabricator.wikimedia.org/T203165 [06:45:19] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web::prod_sites: make includes explicit in more wikis [puppet] - 10https://gerrit.wikimedia.org/r/451257 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [06:45:39] PROBLEM - Host stat1006 is DOWN: PING CRITICAL - Packet loss = 100% [06:48:19] RECOVERY - Host stat1006 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [06:48:19] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:51:31] 10Operations: setup/install cumin2001.eqiad.wmnet - https://phabricator.wikimedia.org/T204156 (10MoritzMuehlenhoff) @RobH Ack, I'll take care of that. [06:56:11] !log reboot notebook100[3,4] for kernel upgrades - T203165 [06:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:19] T203165: Reboot Analytics hosts for kernel security upgrades - https://phabricator.wikimedia.org/T203165 [06:56:38] RECOVERY - puppet last run on analytics1071 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:39] PROBLEM - Check systemd state on notebook1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:59:10] (03CR) 10Ema: [C: 031] icinga: disable notifications for hosts using role(test) [puppet] - 10https://gerrit.wikimedia.org/r/460064 (owner: 10Dzahn) [07:01:09] PROBLEM - Host notebook1004 is DOWN: PING CRITICAL - Packet loss = 100% [07:01:29] RECOVERY - Host notebook1004 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [07:03:18] (03PS1) 10Jcrespo: mariadb: Reduce load of db2041, 55, 62 and 63 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460207 [07:04:19] PROBLEM - Check systemd state on notebook1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:04:38] PROBLEM - Check systemd state on notebook1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:05:49] RECOVERY - DPKG on cp1099 is OK: All packages OK [07:12:17] (03PS2) 10Jcrespo: mariadb: Adjust weights to load for perf optimization [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460207 [07:12:28] (03CR) 10Muehlenhoff: [C: 031] "In the past there was a wide range of services using the test role (and also included things like ruthenium, the Parsoid test server, for " [puppet] - 10https://gerrit.wikimedia.org/r/460064 (owner: 10Dzahn) [07:15:53] (03CR) 10Marostegui: [C: 031] mariadb: Adjust weights to load for perf optimization [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460207 (owner: 10Jcrespo) [07:17:59] (03PS3) 10Jcrespo: mariadb: Adjust weights to load for perf optimization [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460207 [07:21:00] (03CR) 10Jcrespo: [C: 032] mariadb: Adjust weights to load for perf optimization [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460207 (owner: 10Jcrespo) [07:22:10] (03Merged) 10jenkins-bot: mariadb: Adjust weights to load for perf optimization [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460207 (owner: 10Jcrespo) [07:25:01] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Tune db weights even more (duration: 00m 50s) [07:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:28] (03CR) 10jenkins-bot: mariadb: Adjust weights to load for perf optimization [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460207 (owner: 10Jcrespo) [07:34:18] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational [07:34:29] RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational [07:46:06] !log execute apt-get autoremove on notebook* to remove old nginx packages (not used anymore) [07:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:50] !log reboot kafka10[12-23] (old analytics cluster) for kernel upgrades [07:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:02] (03CR) 10Muehlenhoff: [C: 04-1] "Actually, there's actually still hosts of that type for which disabling notifications is not adequate, tungsten and ruthenium (previously " [puppet] - 10https://gerrit.wikimedia.org/r/460064 (owner: 10Dzahn) [07:54:34] (03CR) 10Volans: [C: 032] exceptions: add SpicerackCheckError [software/spicerack] - 10https://gerrit.wikimedia.org/r/459804 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [07:56:12] (03Merged) 10jenkins-bot: exceptions: add SpicerackCheckError [software/spicerack] - 10https://gerrit.wikimedia.org/r/459804 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:01:04] (03PS1) 10Ema: site: convert cache::misc hosts to spares [puppet] - 10https://gerrit.wikimedia.org/r/460217 (https://phabricator.wikimedia.org/T164609) [08:01:07] (03PS1) 10Ema: lvs: remove misc_web and misc_web-https [puppet] - 10https://gerrit.wikimedia.org/r/460218 (https://phabricator.wikimedia.org/T164609) [08:01:07] !log Disconnect replication eqiad -> codfw on s1-s8, x1, es2, es3 - T189107 [08:01:08] (03PS1) 10Ema: Remove cache_misc definitions [puppet] - 10https://gerrit.wikimedia.org/r/460219 (https://phabricator.wikimedia.org/T164609) [08:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:14] T189107: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 [08:01:23] !log rebooting dns recursors in eqsin for kernel security update [08:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:52] 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) Replication has been disconnected from eqiad to codfw: ``` root@neodymium:/home/marostegui# for i in db2048 db2035 db2043 db2051 db2052 db2039 db2040 db2045 d... [08:08:51] !log Deploy schema change on s5 eqiad master - T187089 [08:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:59] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [08:16:08] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 56.67% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [08:16:09] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 58.62% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [08:16:26] this is me --^ [08:16:32] !log Stop replication on s4 eqiad master (db1068) and deploy a schema change - this will generate lag on s4 eqiad - T144010 [08:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:43] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [08:17:49] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 62.07% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [08:17:49] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1013 is CRITICAL: CRITICAL: 63.33% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [08:18:22] (03PS1) 10Ema: Remove maps-lb definitions [dns] - 10https://gerrit.wikimedia.org/r/460274 (https://phabricator.wikimedia.org/T164608) [08:21:09] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1023 is CRITICAL: CRITICAL: 56.00% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [08:21:15] !log Deploy schema change on s6 eqiad master (db1061) - T187089 [08:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:23] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [08:25:39] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1023 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [08:26:11] (03PS1) 10Ema: Remove misc-web [dns] - 10https://gerrit.wikimedia.org/r/460275 (https://phabricator.wikimedia.org/T164609) [08:26:49] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1013 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [08:26:49] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1020 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [08:27:19] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1022 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [08:29:48] !log installing curl security updates [08:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:29] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1023 is CRITICAL: CRITICAL: 55.17% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [08:35:01] * elukey wishes to decom this cluster sometimes in the future [08:35:38] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 55.17% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [08:35:44] (03PS4) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert loginwiki, chapterwiki [puppet] - 10https://gerrit.wikimedia.org/r/451258 (https://phabricator.wikimedia.org/T196968) [08:36:08] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 56.67% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [08:36:09] !log Deploy schema change on s6 eqiad master (db1061) - T89737 [08:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:16] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [08:36:48] (03PS1) 10Alexandros Kosiaris: Introduce matomo1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/460276 (https://phabricator.wikimedia.org/T202963) [08:39:59] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1020 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [08:40:29] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1022 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [08:40:39] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1014 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [08:41:08] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1023 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [08:41:25] !log removed labvirt1019/labvirt1020 from debmonitor (T204004) [08:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:32] T204004: Rename labvirt1019 and cloudvirt1020 to cloudvirt1019 and cloudvirt1020 - https://phabricator.wikimedia.org/T204004 [08:45:45] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce matomo1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/460276 (https://phabricator.wikimedia.org/T202963) (owner: 10Alexandros Kosiaris) [08:49:42] !log Deploy schema change on s5 eqiad master (db1070) - T89737 [08:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:51] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [08:51:00] !log rebooting dns recursors in ulsfo for kernel security update [08:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:59] (03PS1) 10Ema: cp1099: disable icinga notifications [puppet] - 10https://gerrit.wikimedia.org/r/460279 (https://phabricator.wikimedia.org/T202966) [08:59:43] !log Bumped CI jobs based on Debian Stretch to use Chrome 69 and Firefox 60 | T203902 [08:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:51] T203902: Rebuild quibble images for Chrome 69 and Firefox 60 - https://phabricator.wikimedia.org/T203902 [09:00:24] (03CR) 10Ema: [C: 032] cp1099: disable icinga notifications [puppet] - 10https://gerrit.wikimedia.org/r/460279 (https://phabricator.wikimedia.org/T202966) (owner: 10Ema) [09:01:21] (03CR) 10Ema: "> Actually, there's actually still hosts of that type for which" [puppet] - 10https://gerrit.wikimedia.org/r/460064 (owner: 10Dzahn) [09:01:23] (03CR) 10Alexandros Kosiaris: [C: 032] ci: Allow Docker nodes to use a dedicated /var/lib/docker volume [puppet] - 10https://gerrit.wikimedia.org/r/459875 (https://phabricator.wikimedia.org/T203841) (owner: 10Dduvall) [09:01:30] (03PS7) 10Alexandros Kosiaris: ci: Allow Docker nodes to use a dedicated /var/lib/docker volume [puppet] - 10https://gerrit.wikimedia.org/r/459875 (https://phabricator.wikimedia.org/T203841) (owner: 10Dduvall) [09:01:33] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] ci: Allow Docker nodes to use a dedicated /var/lib/docker volume [puppet] - 10https://gerrit.wikimedia.org/r/459875 (https://phabricator.wikimedia.org/T203841) (owner: 10Dduvall) [09:08:13] !log Deploy schema change on s8 eqiad master (db1071) - T89737 [09:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:21] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [09:15:49] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1023 is CRITICAL: CRITICAL: 56.67% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [09:15:49] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1013 is CRITICAL: CRITICAL: 56.67% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [09:16:19] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1012 is CRITICAL: CRITICAL: 56.67% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [09:17:09] (03CR) 10Mark Bergsma: [C: 031] Icinga: add check_vcp (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/458850 (https://phabricator.wikimedia.org/T201097) (owner: 10Ayounsi) [09:20:19] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1023 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [09:20:19] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1013 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [09:20:48] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1012 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [09:23:15] (03PS2) 10Marostegui: wiki replicas: depool labsdb1009 to run view updates [puppet] - 10https://gerrit.wikimedia.org/r/460016 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm) [09:23:58] (03CR) 10Marostegui: [C: 032] wiki replicas: depool labsdb1009 to run view updates [puppet] - 10https://gerrit.wikimedia.org/r/460016 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm) [09:24:54] !log Reload haproxy on dbproxy1011 to depool labsdb1009 - https://phabricator.wikimedia.org/T174047 [09:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:25] (03PS1) 10Marostegui: Revert "wiki replicas: depool labsdb1009 to run view updates" [puppet] - 10https://gerrit.wikimedia.org/r/460288 [09:34:38] !log rebooting dns recursors in esams for kernel security update [09:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:01] (03PS4) 10Elukey: mariadb::service: remove old require causing issues on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/459994 (https://phabricator.wikimedia.org/T204074) [09:41:31] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wtp2020.codfw.wmnet [09:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:54] 10Operations, 10ops-codfw, 10Parsing-Team: wtp2020 correctable memory errors - https://phabricator.wikimedia.org/T194176 (10fgiunchedi) 05Open>03Resolved Host is pooled, let's see if it happens again. [09:44:43] !log Enable GTID on eqiad masters - T189107 [09:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:51] T189107: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 [09:51:00] (03PS5) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: make includes explicit in more wikis [puppet] - 10https://gerrit.wikimedia.org/r/451257 (https://phabricator.wikimedia.org/T196968) [09:51:10] 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) GTID enabled on all eqiad masters but db1071 (s8) and db1068 (s4) as they are currently running a big alter. [09:54:54] !log repair sdd on ms-be2043 - T199198 [09:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:01] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [09:55:38] 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) db1068 enabled [09:57:02] !log Deploy schema change on s4 eqiad master (db1068) - T89737 [09:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:10] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [10:03:53] (03PS1) 10Marostegui: db-codfw.php: Slowly repool db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460304 (https://phabricator.wikimedia.org/T204127) [10:04:05] (03CR) 10Marostegui: [C: 04-1] "Wait until lag is gone" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460304 (https://phabricator.wikimedia.org/T204127) (owner: 10Marostegui) [10:08:02] 10Operations, 10Discovery-Search, 10Datacenter-Switchover-2018: Warn when CirrusSearch is not configured to use local DCfor an extended time - https://phabricator.wikimedia.org/T204135 (10dcausse) I think using SiteInfo might be the easiest solution, we could add `wmgCirrusSearchDefaultCluster` besides `wmfM... [10:15:19] RECOVERY - Filesystem available is greater than filesystem size on ms-be2043 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2043&var-datasource=codfw%2520prometheus%252Fops [10:30:48] (03CR) 10Marostegui: [C: 032] db-codfw.php: Slowly repool db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460304 (https://phabricator.wikimedia.org/T204127) (owner: 10Marostegui) [10:32:11] (03Merged) 10jenkins-bot: db-codfw.php: Slowly repool db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460304 (https://phabricator.wikimedia.org/T204127) (owner: 10Marostegui) [10:33:01] (03CR) 10jenkins-bot: db-codfw.php: Slowly repool db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460304 (https://phabricator.wikimedia.org/T204127) (owner: 10Marostegui) [10:33:24] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Slowly repool db2054 - T204127 (duration: 00m 50s) [10:33:24] !log Deploy schema change on s4 eqiad master (db1068) - T187089 [10:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:32] T204127: Reclone db2054 and db2068 - https://phabricator.wikimedia.org/T204127 [10:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:39] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [10:46:07] !log uploaded debdeploy 0.0.99.5-1+deb9u1 to apt.wikimedia.org/stretch-wikimedia [10:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:09] !log shutting down db2068 and dbstore2001:s7 for cloning [10:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180913T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:00:13] o/ [11:00:25] I'm around for SWAT, but look like there's nothing to do :D [11:13:33] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: eqiad1: add cloudinstances2b virtual router FQDNs [dns] - 10https://gerrit.wikimedia.org/r/460320 (https://phabricator.wikimedia.org/T202886) [11:16:18] (03PS2) 10Arturo Borrero Gonzalez: cloudvps: eqiad1: add cloudinstances2b virtual router FQDNs [dns] - 10https://gerrit.wikimedia.org/r/460320 (https://phabricator.wikimedia.org/T202886) [11:19:16] (03CR) 10Alex Monk: cloudvps: eqiad1: add cloudinstances2b virtual router FQDNs (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/460320 (https://phabricator.wikimedia.org/T202886) (owner: 10Arturo Borrero Gonzalez) [11:20:11] (03PS1) 10Muehlenhoff: Switch role for cumin2001 to role::cluster::management [puppet] - 10https://gerrit.wikimedia.org/r/460321 (https://phabricator.wikimedia.org/T177385) [11:20:13] (03PS1) 10Muehlenhoff: Add cumin2001 to network constants and tcpircbot [puppet] - 10https://gerrit.wikimedia.org/r/460322 [11:20:15] (03PS1) 10Muehlenhoff: Enable cumin2001 as mysql maintenance client [puppet] - 10https://gerrit.wikimedia.org/r/460323 (https://phabricator.wikimedia.org/T177385) [11:22:48] 10Operations, 10Traffic: puppetize http purging for ATS backends - https://phabricator.wikimedia.org/T204208 (10ema) [11:22:59] 10Operations, 10Traffic: puppetize http purging for ATS backends - https://phabricator.wikimedia.org/T204208 (10ema) p:05Triage>03Normal [11:23:43] (03PS1) 10Arturo Borrero Gonzalez: wmnet: cleanup labnet1001-ext.eqiad.wmnet FQDN [dns] - 10https://gerrit.wikimedia.org/r/460324 [11:24:24] (03CR) 10Arturo Borrero Gonzalez: [C: 032] wmnet: cleanup labnet1001-ext.eqiad.wmnet FQDN [dns] - 10https://gerrit.wikimedia.org/r/460324 (owner: 10Arturo Borrero Gonzalez) [11:27:14] (03PS1) 10Jcrespo: mariadb: Reimage db1075 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/460325 (https://phabricator.wikimedia.org/T148507) [11:29:02] (03CR) 10Jcrespo: [C: 032] mariadb: Reimage db1075 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/460325 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [11:33:12] 10Operations, 10Traffic: Define and deploy Icinga checks for ATS backends - https://phabricator.wikimedia.org/T204209 (10ema) [11:33:28] 10Operations, 10Traffic: Define and deploy Icinga checks for ATS backends - https://phabricator.wikimedia.org/T204209 (10ema) p:05Triage>03Normal [11:34:55] (03PS1) 10Jcrespo: mariadb: Move db1075 socket to default, disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/460327 (https://phabricator.wikimedia.org/T148507) [11:36:22] (03CR) 10Jcrespo: [C: 032] mariadb: Move db1075 socket to default, disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/460327 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [11:38:33] (03PS2) 10Gehel: Temporarily enable debug logging for regex matches [puppet] - 10https://gerrit.wikimedia.org/r/460151 (owner: 10Smalyshev) [11:38:58] !log stopping eqiad s3 replication and shuttinf down db1075 in preparation for reimage T148507 [11:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:17] (03CR) 10Gehel: [C: 032] Temporarily enable debug logging for regex matches [puppet] - 10https://gerrit.wikimedia.org/r/460151 (owner: 10Smalyshev) [11:41:13] jynus: unmerged patch on puppetmaster, should I merge it? [11:41:25] yes, sorry [11:41:32] I though I had done it already [11:41:37] no problem, merged [11:41:52] gehel: thanks for asking [11:42:12] it looked simple, but I have no idea what the implication are :) [11:42:56] well, it was a master, and I needed to disable puppet first [11:43:14] you did well, thank you, it was my fault for forgetting [11:43:55] I was focused on making no alerts and forgot to merge [11:47:40] !log rebooting dns recursors in codfw for kernel security update [11:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:39] jynus: no problem, not the first time... not the last one! [11:51:35] (03PS5) 10Elukey: mariadb::service: remove old require causing issues on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/459994 (https://phabricator.wikimedia.org/T204074) [11:51:37] (03CR) 10Gehel: [C: 04-1] "We might not share a common understanding of the use case behind CheckError. Now is a good time to discuss and document it. See inline com" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/460114 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:52:20] (03CR) 10Elukey: [C: 032] mariadb::service: remove old require causing issues on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/459994 (https://phabricator.wikimedia.org/T204074) (owner: 10Elukey) [11:54:29] (03CR) 10Gehel: Elasticsearch module is coming up. (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [12:00:28] 10Operations, 10MediaWiki-ResourceLoader, 10Performance-Team, 10Traffic: Investigate source of 404 Not Found responses from load.php - https://phabricator.wikimedia.org/T202479 (10ema) >>! In T202479#4575922, @Krinkle wrote: > 2. Hostnames we route to text-lb that Varnish doesn't recognise (receives varnis... [12:07:02] (03CR) 10Volans: "replies inline" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/460114 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:15:17] 10Operations, 10Traffic: ATS: log inspection at runtime - https://phabricator.wikimedia.org/T204225 (10ema) [12:15:24] 10Operations, 10Traffic: ATS: log inspection at runtime - https://phabricator.wikimedia.org/T204225 (10ema) p:05Triage>03Normal [12:37:11] (03CR) 10Bstorm: [C: 031] Revert "wiki replicas: depool labsdb1009 to run view updates" [puppet] - 10https://gerrit.wikimedia.org/r/460288 (owner: 10Marostegui) [12:40:57] 10Operations, 10Traffic: Package and deploy ATS v8.x - https://phabricator.wikimedia.org/T204232 (10ema) [12:41:23] 10Operations, 10Traffic: Package and deploy ATS v8.x - https://phabricator.wikimedia.org/T204232 (10ema) p:05Triage>03Normal [12:41:49] 10Operations, 10Traffic: Package and deploy ATS v8.x - https://phabricator.wikimedia.org/T204232 (10ema) [12:43:11] (03PS2) 10Marostegui: Revert "wiki replicas: depool labsdb1009 to run view updates" [puppet] - 10https://gerrit.wikimedia.org/r/460288 [12:45:04] (03CR) 10Marostegui: [C: 032] Revert "wiki replicas: depool labsdb1009 to run view updates" [puppet] - 10https://gerrit.wikimedia.org/r/460288 (owner: 10Marostegui) [12:45:32] !log Reload haproxy on dbproxy1011 to repool labsdb1009 - https://phabricator.wikimedia.org/T174047 [12:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:39] (03PS1) 10Muehlenhoff: Temporarily remove dns1001 for kernel update [puppet] - 10https://gerrit.wikimedia.org/r/460341 [12:49:18] 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) db1071 GTID enabled [12:50:05] (03PS1) 10Marostegui: db-codfw.php: Depool db2085:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460342 (https://phabricator.wikimedia.org/T189101) [12:51:51] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2085:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460342 (https://phabricator.wikimedia.org/T189101) (owner: 10Marostegui) [12:53:10] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2085:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460342 (https://phabricator.wikimedia.org/T189101) (owner: 10Marostegui) [12:54:50] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2085:3318 - T189101 (duration: 00m 49s) [12:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:58] T189101: Deploy schema change for adding numeric primary key to wbqc_constraints table - https://phabricator.wikimedia.org/T189101 [12:55:15] !log Stop db1071 (s8 eqiad master) and db2085:3318 in sync - T189101 [12:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:08] (03CR) 10jenkins-bot: db-codfw.php: Depool db2085:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460342 (https://phabricator.wikimedia.org/T189101) (owner: 10Marostegui) [12:58:13] 10Operations: setup/install cumin2001.eqiad.wmnet - https://phabricator.wikimedia.org/T204156 (10MoritzMuehlenhoff) 05Open>03Resolved Closing this task, actual implementation will happen via T177385 [12:59:23] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2085:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460343 [13:01:06] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2085:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460343 (owner: 10Marostegui) [13:02:22] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2085:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460343 (owner: 10Marostegui) [13:03:32] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2085:3318 - T189101 (duration: 00m 49s) [13:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:40] T189101: Deploy schema change for adding numeric primary key to wbqc_constraints table - https://phabricator.wikimedia.org/T189101 [13:06:00] (03PS1) 10Marostegui: db-codfw.php: Increase traffic for db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460344 [13:07:40] !log Deploy schema change on s8 eqiad master (db1071) - T187089 [13:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:47] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [13:10:27] (03PS2) 10Herron: mtail: restart on change to exim program [puppet] - 10https://gerrit.wikimedia.org/r/459789 [13:11:58] (03CR) 10Herron: [C: 032] mtail: restart on change to exim program [puppet] - 10https://gerrit.wikimedia.org/r/459789 (owner: 10Herron) [13:12:10] (03PS1) 10Volans: setup.py: fix missing comma [software/spicerack] - 10https://gerrit.wikimedia.org/r/460348 (https://phabricator.wikimedia.org/T199079) [13:12:37] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2085:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460343 (owner: 10Marostegui) [13:12:45] (03PS1) 10BPirkle: Add export logging channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460349 (https://phabricator.wikimedia.org/T203424) [13:15:21] (03CR) 10Marostegui: [C: 032] db-codfw.php: Increase traffic for db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460344 (owner: 10Marostegui) [13:15:23] (03PS2) 10Herron: mtail: update exim ciphersuite regex [puppet] - 10https://gerrit.wikimedia.org/r/459842 [13:15:52] (03CR) 10Volans: "Good catch Matt!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/460348 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:17:20] (03CR) 10Herron: [C: 032] mtail: update exim ciphersuite regex [puppet] - 10https://gerrit.wikimedia.org/r/459842 (owner: 10Herron) [13:18:24] (03CR) 10Gehel: [C: 04-1] "Replied inline, feel free to ping me for a more synchronous conversation." (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/460114 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:18:29] (03PS1) 10Alexandros Kosiaris: Introduce matomo1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/460353 (https://phabricator.wikimedia.org/T202963) [13:18:43] !log Deploy schema change on s2 eqiad master (db1066) - T89737 [13:18:50] (03CR) 10Gehel: [C: 031] "LGTM, trivial enough" [software/spicerack] - 10https://gerrit.wikimedia.org/r/460348 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:53] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [13:19:07] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce matomo1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/460353 (https://phabricator.wikimedia.org/T202963) (owner: 10Alexandros Kosiaris) [13:20:45] !log andrew@deploy1001 Started deploy [horizon/deploy@12aa2d3]: Improvements for VM creation in eqiad1, T167293 [13:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:52] T167293: Nova-network to Neutron migration - https://phabricator.wikimedia.org/T167293 [13:20:56] !log andrew@deploy1001 Finished deploy [horizon/deploy@12aa2d3]: Improvements for VM creation in eqiad1, T167293 (duration: 00m 13s) [13:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:55] (03Merged) 10jenkins-bot: db-codfw.php: Increase traffic for db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460344 (owner: 10Marostegui) [13:23:41] 10Operations, 10DBA, 10Research, 10Services (designing): Storage of data for recommendation API - https://phabricator.wikimedia.org/T203039 (10Joe) >>! In T203039#4574768, @Pchelolo wrote: >> don't have libraries and abstractions for accessing MySQL from our nodejs services. Is that correct? > > That's th... [13:23:54] !log andrew@deploy1001 Started deploy [horizon/deploy@c9c7a56]: Improvements for VM creation in eqiad1, T167293 (take two) [13:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:10] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Increase weight for db2054 - T204127 (duration: 00m 49s) [13:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:18] T204127: Reclone db2054 and db2068 - https://phabricator.wikimedia.org/T204127 [13:27:19] (03CR) 10jenkins-bot: db-codfw.php: Increase traffic for db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460344 (owner: 10Marostegui) [13:28:04] !log andrew@deploy1001 Finished deploy [horizon/deploy@c9c7a56]: Improvements for VM creation in eqiad1, T167293 (take two) (duration: 04m 10s) [13:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:13] T167293: Nova-network to Neutron migration - https://phabricator.wikimedia.org/T167293 [13:29:50] (03PS3) 10Andrew Bogott: Horizon: set a default network ID for neutron VMs. [puppet] - 10https://gerrit.wikimedia.org/r/460144 [13:29:58] (03PS2) 10Andrew Bogott: Horizon: Add ACTIVE_REGIONS setting [puppet] - 10https://gerrit.wikimedia.org/r/460190 [13:31:07] (03CR) 10Andrew Bogott: [C: 032] Horizon: set a default network ID for neutron VMs. [puppet] - 10https://gerrit.wikimedia.org/r/460144 (owner: 10Andrew Bogott) [13:31:19] (03CR) 10Andrew Bogott: [C: 032] Horizon: Add ACTIVE_REGIONS setting [puppet] - 10https://gerrit.wikimedia.org/r/460190 (owner: 10Andrew Bogott) [13:33:10] 10Operations, 10Traffic, 10fixcopyright.wikimedia.org, 10MW-1.32-release-notes (WMF-deploy-2018-09-04 (1.32.0-wmf.20)), 10Patch-For-Review: Sort out HTTP caching issues for fixcopyright wiki - https://phabricator.wikimedia.org/T203179 (10CCicalese_WMF) 05Open>03Resolved a:03CCicalese_WMF I'm going... [13:35:17] (03PS1) 10Andrew Bogott: Horizon local_settings.py.erb: remove an extraneous } [puppet] - 10https://gerrit.wikimedia.org/r/460359 [13:36:03] (03CR) 10Andrew Bogott: [C: 032] Horizon local_settings.py.erb: remove an extraneous } [puppet] - 10https://gerrit.wikimedia.org/r/460359 (owner: 10Andrew Bogott) [13:42:13] (03PS2) 10Muehlenhoff: Temporarily remove dns1001 for kernel update [puppet] - 10https://gerrit.wikimedia.org/r/460341 [13:43:22] (03CR) 10Muehlenhoff: [C: 032] Temporarily remove dns1001 for kernel update [puppet] - 10https://gerrit.wikimedia.org/r/460341 (owner: 10Muehlenhoff) [13:45:30] (03PS1) 10Andrew Bogott: Horizon: disable project 'hhvm' in eqiad and enable in eqiad1. [puppet] - 10https://gerrit.wikimedia.org/r/460360 [13:49:26] 10Operations, 10Cloud-Services, 10Parsing-Team, 10Datacenter-Switchover-2018, and 2 others: VisualEditor broken on wikitech when codfw is primary: "Error loading data from server: apierror-visualeditor-docserver-http: HTTP 500." - https://phabricator.wikimedia.org/T163438 (10Gilles) 05Open>03Resolved a... [13:49:53] 10Operations, 10Continuous-Integration-Infrastructure, 10Mail, 10Jenkins, and 2 others: Ensure Jenkins mail configuration supports outbound smtp server failover - https://phabricator.wikimedia.org/T203607 (10herron) [13:50:02] 10Operations, 10Cloud-Services, 10Parsing-Team, 10Datacenter-Switchover-2018, and 2 others: VisualEditor broken on wikitech when codfw is primary: "Error loading data from server: apierror-visualeditor-docserver-http: HTTP 500." - https://phabricator.wikimedia.org/T163438 (10Gilles) a:05Gilles>03Joe [13:50:08] 10Operations, 10Mail, 10Patch-For-Review, 10User-herron: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361 (10herron) [13:51:53] !log rebooting dns1001 for kernel security update [13:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:22] (03PS1) 10Bstorm: wiki replicas: depool labsdb1011 to run view updates [puppet] - 10https://gerrit.wikimedia.org/r/460362 (https://phabricator.wikimedia.org/T174047) [13:58:07] (03CR) 10Mathew.onipe: [C: 031] "> Patch Set 1:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/460348 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:59:23] !log depool mx1001 and relay queued messages to mx2001 for upgrade to stretch T175361 [13:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:32] T175361: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361 [14:01:13] (03CR) 10Volans: [C: 032] setup.py: fix missing comma [software/spicerack] - 10https://gerrit.wikimedia.org/r/460348 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:02:28] (03PS1) 10Muehlenhoff: Revert "Temporarily remove dns1001 for kernel update" [puppet] - 10https://gerrit.wikimedia.org/r/460363 [14:02:31] (03Merged) 10jenkins-bot: setup.py: fix missing comma [software/spicerack] - 10https://gerrit.wikimedia.org/r/460348 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:03:21] (03CR) 10Muehlenhoff: [C: 032] Revert "Temporarily remove dns1001 for kernel update" [puppet] - 10https://gerrit.wikimedia.org/r/460363 (owner: 10Muehlenhoff) [14:06:40] (03PS1) 10Elukey: profile::analytics::refinery::job::camus: add parameter to enable monitors [puppet] - 10https://gerrit.wikimedia.org/r/460365 (https://phabricator.wikimedia.org/T204060) [14:07:22] (03CR) 10Alex Monk: [C: 031] "per Moritz" [puppet] - 10https://gerrit.wikimedia.org/r/460360 (owner: 10Andrew Bogott) [14:09:12] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/12438/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/460365 (https://phabricator.wikimedia.org/T204060) (owner: 10Elukey) [14:09:45] (03CR) 10Muehlenhoff: [C: 031] Horizon: disable project 'hhvm' in eqiad and enable in eqiad1. [puppet] - 10https://gerrit.wikimedia.org/r/460360 (owner: 10Andrew Bogott) [14:11:33] (03PS1) 10Muehlenhoff: Temporarily remove dns1002 for reboot [puppet] - 10https://gerrit.wikimedia.org/r/460366 [14:15:38] (03CR) 10Muehlenhoff: [C: 032] Temporarily remove dns1002 for reboot [puppet] - 10https://gerrit.wikimedia.org/r/460366 (owner: 10Muehlenhoff) [14:28:37] (03PS4) 10Herron: install_server: reinstall mx1001 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/429241 (https://phabricator.wikimedia.org/T175361) [14:29:17] (03PS2) 10Andrew Bogott: Horizon: disable project 'hhvm' in eqiad and enable in eqiad1. [puppet] - 10https://gerrit.wikimedia.org/r/460360 [14:29:47] (03CR) 10Herron: [C: 032] install_server: reinstall mx1001 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/429241 (https://phabricator.wikimedia.org/T175361) (owner: 10Herron) [14:30:15] (03PS3) 10Andrew Bogott: Horizon: disable project 'hhvm' in eqiad and enable in eqiad1. [puppet] - 10https://gerrit.wikimedia.org/r/460360 [14:31:21] (03CR) 10Andrew Bogott: [C: 032] Horizon: disable project 'hhvm' in eqiad and enable in eqiad1. [puppet] - 10https://gerrit.wikimedia.org/r/460360 (owner: 10Andrew Bogott) [14:32:22] !log rebooting dns1002 for kernel security update [14:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:13] (03PS2) 10Marostegui: wiki replicas: depool labsdb1011 to run view updates [puppet] - 10https://gerrit.wikimedia.org/r/460362 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm) [14:35:11] !log switch labstore1004 and labstore1005 to using the cfq scheduler on the DRBD volumes [14:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:25] (03CR) 10Marostegui: [C: 032] wiki replicas: depool labsdb1011 to run view updates [puppet] - 10https://gerrit.wikimedia.org/r/460362 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm) [14:36:27] (03CR) 10Ottomata: profile::analytics::refinery::job::camus: add parameter to enable monitors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/460365 (https://phabricator.wikimedia.org/T204060) (owner: 10Elukey) [14:36:32] !log Reload haproxy on dbproxy1010 to depool labsdb1011 - https://phabricator.wikimedia.org/T174047 [14:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:30] (03CR) 10Elukey: profile::analytics::refinery::job::camus: add parameter to enable monitors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/460365 (https://phabricator.wikimedia.org/T204060) (owner: 10Elukey) [14:39:05] (03PS1) 10Muehlenhoff: Revert "Temporarily remove dns1002 for reboot" [puppet] - 10https://gerrit.wikimedia.org/r/460368 [14:40:49] (03CR) 10Muehlenhoff: [C: 032] Revert "Temporarily remove dns1002 for reboot" [puppet] - 10https://gerrit.wikimedia.org/r/460368 (owner: 10Muehlenhoff) [14:42:00] (03PS1) 10Marostegui: db-codfw.php: Increase weight for db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460369 (https://phabricator.wikimedia.org/T204127) [14:44:09] (03CR) 10Marostegui: [C: 032] db-codfw.php: Increase weight for db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460369 (https://phabricator.wikimedia.org/T204127) (owner: 10Marostegui) [14:46:06] (03Merged) 10jenkins-bot: db-codfw.php: Increase weight for db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460369 (https://phabricator.wikimedia.org/T204127) (owner: 10Marostegui) [14:46:32] (03PS2) 10Elukey: profile::analytics::refinery::job::camus: add parameter to enable monitors [puppet] - 10https://gerrit.wikimedia.org/r/460365 (https://phabricator.wikimedia.org/T204060) [14:46:43] PROBLEM - MariaDB Slave IO: s7 on db2068 is CRITICAL: CRITICAL slave_io_state could not connect [14:46:48] PROBLEM - MariaDB Slave SQL: s7 on db2068 is CRITICAL: CRITICAL slave_sql_state could not connect [14:46:53] banyek: ^ that you? [14:47:02] PROBLEM - mysqld processes on db2068 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [14:47:06] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Increase weight for db2054 - T204127 (duration: 00m 50s) [14:47:09] no [14:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:13] T204127: Reclone db2054 and db2068 - https://phabricator.wikimedia.org/T204127 [14:47:17] I think the 4 hour downtime passed [14:47:17] banyek: aren't you touching db2068? [14:47:32] it was downtimed for 4 hours [14:47:35] only [14:47:38] banyek: can you downtime it again for maybe 24h? [14:47:44] PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag could not connect [14:47:44] sure [14:47:53] * banyek running to silence it [14:48:18] :) [14:48:23] I see everything under control [14:48:29] <_joe_> ok so I can go back? [14:48:32] <_joe_> cool [14:48:35] _joe_: yep [14:48:38] sorry for the noise [14:48:41] db2068 is inn maintenance [14:48:50] <_joe_> I was going downstairs to buy something, I ran back home :P [14:51:21] (03CR) 10Elukey: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1002/12439/analytics1003.eqiad.wmnet/change.analytics1003.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/460365 (https://phabricator.wikimedia.org/T204060) (owner: 10Elukey) [14:53:03] (03CR) 10jenkins-bot: db-codfw.php: Increase weight for db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460369 (https://phabricator.wikimedia.org/T204127) (owner: 10Marostegui) [14:55:31] 10Operations, 10Analytics, 10vm-requests, 10Patch-For-Review: eqiad (1) - VM request for Piwik/Matomo - https://phabricator.wikimedia.org/T202963 (10akosiaris) 05Open>03Resolved @elukey VM is up and running. No role assigned in puppet so you probably want to handle that. Resolving this. [14:56:02] akosiaris: thanks! --^ [14:56:26] (03PS3) 10Elukey: profile::analytics::refinery::job::camus: add parameter to enable monitors [puppet] - 10https://gerrit.wikimedia.org/r/460365 (https://phabricator.wikimedia.org/T204060) [14:58:33] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/12440/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/460365 (https://phabricator.wikimedia.org/T204060) (owner: 10Elukey) [14:59:01] ottomata: --^ [15:02:37] !log upload blubber_0.5.0-1 to apt.wikimedia.org/{stretch,jessie}-wikimedia/main T203121 [15:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:45] T203121: Update Debian package of Blubber (0.5.0-1) - https://phabricator.wikimedia.org/T203121 [15:03:26] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Watching / External): Update Debian package of Blubber (0.5.0-1) - https://phabricator.wikimedia.org/T203121 (10akosiaris) 05Open>03Resolved a:03akosiaris Packages built and uploaded to both stretch-wikimedia and jessie-wikimedia. Resolving,... [15:07:39] (03CR) 10Ottomata: [C: 031] profile::analytics::refinery::job::camus: add parameter to enable monitors [puppet] - 10https://gerrit.wikimedia.org/r/460365 (https://phabricator.wikimedia.org/T204060) (owner: 10Elukey) [15:07:55] (03PS4) 10Elukey: profile::analytics::refinery::job::camus: add parameter to enable monitors [puppet] - 10https://gerrit.wikimedia.org/r/460365 (https://phabricator.wikimedia.org/T204060) [15:11:00] (03CR) 10Elukey: [C: 032] profile::analytics::refinery::job::camus: add parameter to enable monitors [puppet] - 10https://gerrit.wikimedia.org/r/460365 (https://phabricator.wikimedia.org/T204060) (owner: 10Elukey) [15:13:12] (03PS1) 10Marostegui: Revert "wiki replicas: depool labsdb1011 to run view updates" [puppet] - 10https://gerrit.wikimedia.org/r/460375 [15:13:31] (03PS2) 10Marostegui: Revert "wiki replicas: depool labsdb1011 to run view updates" [puppet] - 10https://gerrit.wikimedia.org/r/460375 [15:14:22] (03CR) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert loginwiki, chapterwiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/451258 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [15:15:30] !log repool mx1001 — upgrade to stretch complete T175361 [15:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:37] T175361: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361 [15:16:00] (03PS5) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert loginwiki, chapterwiki [puppet] - 10https://gerrit.wikimedia.org/r/451258 (https://phabricator.wikimedia.org/T196968) [15:20:44] (03PS1) 10Marostegui: db-codfw.php: Increase traffic for db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460377 [15:22:26] 10Operations, 10Core-Platform-Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10fred) 3.30 (due mid-December) will be the last HHVM release that aims to support PHP: https://hhvm.com/blog/2018/09/12/end-of-... [15:23:05] (03CR) 10Marostegui: [C: 032] db-codfw.php: Increase traffic for db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460377 (owner: 10Marostegui) [15:24:32] (03Merged) 10jenkins-bot: db-codfw.php: Increase traffic for db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460377 (owner: 10Marostegui) [15:25:48] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Increase weight for db2054 - T204127 (duration: 00m 49s) [15:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:56] T204127: Reclone db2054 and db2068 - https://phabricator.wikimedia.org/T204127 [15:26:51] !log restarting again db1075 for proper kernel upgrade [15:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:54] 10Operations, 10Wikimedia-Logstash, 10Goal: Investigate log shipping methods and standardize on them (logstash) - https://phabricator.wikimedia.org/T198757 (10fgiunchedi) As part of the [[ https://docs.google.com/document/d/1Aq-Dhq3SbRCPmQdw6jjaHvH-KaDfuQOeuORu4aCgTyQ/edit# | logging infrastructure design do... [15:30:01] hello!! [15:32:17] (03PS1) 10Elukey: profile::analytics::database::meta: upgrade mariadb version for labs [puppet] - 10https://gerrit.wikimedia.org/r/460380 (https://phabricator.wikimedia.org/T204060) [15:33:44] (03PS2) 10Elukey: profile::analytics::database::meta: upgrade mariadb version for labs [puppet] - 10https://gerrit.wikimedia.org/r/460380 (https://phabricator.wikimedia.org/T204060) [15:34:15] (03CR) 10Anomie: [C: 04-1] "I spotted a few errors." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/431823 (https://phabricator.wikimedia.org/T195747) (owner: 10Bstorm) [15:35:03] (03CR) 10Elukey: [C: 032] profile::analytics::database::meta: upgrade mariadb version for labs [puppet] - 10https://gerrit.wikimedia.org/r/460380 (https://phabricator.wikimedia.org/T204060) (owner: 10Elukey) [15:36:28] (03CR) 10jenkins-bot: db-codfw.php: Increase traffic for db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460377 (owner: 10Marostegui) [15:38:08] (03PS1) 10Alex Monk: [WIP] Detect when cert config changes and re-issue [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 [15:39:02] (03CR) 10Bstorm: [C: 031] Revert "wiki replicas: depool labsdb1011 to run view updates" [puppet] - 10https://gerrit.wikimedia.org/r/460375 (owner: 10Marostegui) [15:39:37] (03PS3) 10Marostegui: Revert "wiki replicas: depool labsdb1011 to run view updates" [puppet] - 10https://gerrit.wikimedia.org/r/460375 [15:41:21] (03CR) 10Marostegui: [C: 032] Revert "wiki replicas: depool labsdb1011 to run view updates" [puppet] - 10https://gerrit.wikimedia.org/r/460375 (owner: 10Marostegui) [15:51:40] (03PS4) 10Andrew Bogott: m5: add grants for designate on cloudservices1003 and 1004 [puppet] - 10https://gerrit.wikimedia.org/r/452997 [15:52:11] 10Operations, 10Traffic, 10Patch-For-Review: Make cp1099 the new pinkunicorn - https://phabricator.wikimedia.org/T202966 (10ema) [15:52:36] (03CR) 10Andrew Bogott: [C: 032] m5: add grants for designate on cloudservices1003 and 1004 [puppet] - 10https://gerrit.wikimedia.org/r/452997 (owner: 10Andrew Bogott) [15:58:29] (03PS1) 10Elukey: profile::analytics::database::meta: update mariadb's basedir for stretch [puppet] - 10https://gerrit.wikimedia.org/r/460387 (https://phabricator.wikimedia.org/T204060) [15:59:13] (03PS10) 10Bstorm: wiki replicas - prepare for refactored actor storage [puppet] - 10https://gerrit.wikimedia.org/r/431823 (https://phabricator.wikimedia.org/T195747) [15:59:50] (03CR) 10Bstorm: "> Patch Set 9: Code-Review-1" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/431823 (https://phabricator.wikimedia.org/T195747) (owner: 10Bstorm) [16:00:05] godog and _joe_: (Dis)respected human, time to deploy Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180913T1600). Please do the needful. [16:00:06] No GERRIT patches in the queue for this window AFAICS. [16:01:54] (03PS2) 10Alex Monk: [WIP] Detect when cert config changes and re-issue [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 [16:02:37] 10Operations, 10Quarry, 10Patch-For-Review, 10cloud-services-team (Kanban): Let quarry use the mariadb module - https://phabricator.wikimedia.org/T181205 (10zhuyifei1999) >>! In T181205#4579461, @jcrespo wrote: > why not modify mariadb::packages and use that instead? Was there a blocker for that? Packages... [16:02:40] (03CR) 10Elukey: [C: 032] profile::analytics::database::meta: update mariadb's basedir for stretch [puppet] - 10https://gerrit.wikimedia.org/r/460387 (https://phabricator.wikimedia.org/T204060) (owner: 10Elukey) [16:04:08] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Detect when cert config changes and re-issue [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 (owner: 10Alex Monk) [16:09:19] (03PS3) 10Alex Monk: [WIP] Detect when cert config changes and re-issue [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 [16:09:55] (03CR) 10Anomie: [C: 031] "PS10 seems good. Haven't tested." [puppet] - 10https://gerrit.wikimedia.org/r/431823 (https://phabricator.wikimedia.org/T195747) (owner: 10Bstorm) [16:14:17] 10Operations, 10Quarry, 10Patch-For-Review, 10cloud-services-team (Kanban): Let quarry use the mariadb module - https://phabricator.wikimedia.org/T181205 (10jcrespo) > profile 'profile::quarry::database' includes non-profile class mariadb::packages That is probably pedantic, to use class { 'mariadb::packa... [16:19:20] (03CR) 10Krinkle: mediawiki::web::prod_sites: convert loginwiki, chapterwiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/451258 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [16:24:31] (03Abandoned) 10Jcrespo: mariadb: Add db2056 to api [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460086 (owner: 10Jcrespo) [16:24:43] (03PS5) 10Jcrespo: mariadb: Enable read_only monitoring on core mariadb hosts [puppet] - 10https://gerrit.wikimedia.org/r/450228 (https://phabricator.wikimedia.org/T172489) [16:24:45] (03PS1) 10Jcrespo: mariadb: Reenable notifications on db1075 after maintenance [puppet] - 10https://gerrit.wikimedia.org/r/460389 (https://phabricator.wikimedia.org/T148507) [16:26:56] (03PS1) 10Jcrespo: mariadb: Repool db2068 with load load after recloning it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460390 (https://phabricator.wikimedia.org/T204127) [16:27:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10Parsing-Team: Replace wtp1043's sda - https://phabricator.wikimedia.org/T196886 (10RobH) I'm not sure why this is still pending repair after all this time. In checking on the system, I can see it has both SDA and SDB present. SDA is marked as failed across both md0... [16:30:12] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [16:32:31] PROBLEM - Filesystem available is greater than filesystem size on ms-be1041 is CRITICAL: cluster=swift device=/dev/sde1 fstype=xfs instance=ms-be1041:9100 job=node mountpoint=/srv/swift-storage/sde1 site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be1041&var-datasource=eqiad%2520prometheus%252Fops [16:41:12] 10Operations, 10ops-eqiad, 10DC-Ops, 10Parsing-Team: Replace wtp1043's sda - https://phabricator.wikimedia.org/T196886 (10Cmjohnson) Dell denied my request becuase they say the h/w log doesn't show a disk failure We are unable to proceed with this request as the provided log does not reflect any Hard Dri... [16:42:54] 10Operations, 10Release-Engineering-Team: Keyholder phab repo duplicate work - https://phabricator.wikimedia.org/T203003 (10hashar) Thank you @faidon to have managed to import the git history into operations/software/keyholder that will be very helpful in the future I guess :] [16:45:30] !log repair sde on ms-be1041 - T199198 [16:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:39] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [16:55:52] 10Operations, 10SRE-Access-Requests: Access to restbase servers (including sudo) for Imarlier - https://phabricator.wikimedia.org/T202563 (10MoritzMuehlenhoff) This was approved in the SRE meeing on Monday. [16:56:22] RECOVERY - Device not healthy -SMART- on wtp1043 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=wtp1043&var-datasource=eqiad%2520prometheus%252Fops [16:56:53] (03CR) 10Hashar: [C: 031] "A bunch of nitpicks that are of no importance :]" (033 comments) [software/keyholder] - 10https://gerrit.wikimedia.org/r/460065 (owner: 10Thcipriani) [16:57:09] (03PS1) 10Muehlenhoff: Add Ian to restbase-roots [puppet] - 10https://gerrit.wikimedia.org/r/460394 (https://phabricator.wikimedia.org/T202563) [16:57:41] (03CR) 10Hashar: [C: 031] "To clarify: I am fine with that change, I am not +2ing to let someone else do it and happily ignore my nitpickings." [software/keyholder] - 10https://gerrit.wikimedia.org/r/460065 (owner: 10Thcipriani) [16:58:33] (03CR) 10Muehlenhoff: [C: 032] Add Ian to restbase-roots [puppet] - 10https://gerrit.wikimedia.org/r/460394 (https://phabricator.wikimedia.org/T202563) (owner: 10Muehlenhoff) [16:59:45] 10Operations: setup/install cumin2001.eqiad.wmnet - https://phabricator.wikimedia.org/T204156 (10Papaul) [16:59:48] 10Operations, 10ops-codfw: apply hostname label to cumin2001 / wmf6407 and update visible label field in racktables - https://phabricator.wikimedia.org/T204173 (10Papaul) 05Open>03Resolved Done [17:01:24] (03CR) 10Ayounsi: [C: 032] Create shell account for kharlan and add to researchers [puppet] - 10https://gerrit.wikimedia.org/r/460175 (https://phabricator.wikimedia.org/T203847) (owner: 10Ayounsi) [17:02:01] PROBLEM - Host wtp1043 is DOWN: PING CRITICAL - Packet loss = 100% [17:02:29] (03CR) 10Hashar: [C: 031] Drop legacy SSHv1 support (031 comment) [software/keyholder] - 10https://gerrit.wikimedia.org/r/458227 (owner: 10Faidon Liambotis) [17:02:34] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Access to restbase servers (including sudo) for Imarlier - https://phabricator.wikimedia.org/T202563 (10MoritzMuehlenhoff) 05Open>03Resolved I've added Ian to restbase-roots. [17:02:44] (03CR) 10jerkins-bot: [V: 04-1] Drop legacy SSHv1 support [software/keyholder] - 10https://gerrit.wikimedia.org/r/458227 (owner: 10Faidon Liambotis) [17:03:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10Parsing-Team: Replace wtp1043's sda - https://phabricator.wikimedia.org/T196886 (10RobH) Ok, we're reimaging this in an attempt to troubleshoot the fact it did not see SDA in the OS. Rebooted, and Crhis swapped SDA and SDB. The defective SDA disk is now located in S... [17:04:14] (03PS2) 10Ayounsi: Create shell account for kharlan and add to researchers [puppet] - 10https://gerrit.wikimedia.org/r/460175 (https://phabricator.wikimedia.org/T203847) [17:05:52] RECOVERY - Host wtp1043 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [17:05:55] 10Operations, 10Datacenter-Switchover-2018, 10Discovery-Search (Current work): Warn when CirrusSearch is not configured to use local DCfor an extended time - https://phabricator.wikimedia.org/T204135 (10debt) [17:06:25] 10Operations, 10Datacenter-Switchover-2018, 10Discovery-Search (Current work): Warn when CirrusSearch is not configured to use local DCfor an extended time - https://phabricator.wikimedia.org/T204135 (10debt) p:05Triage>03High We'll get this fixed before the datacenter switchback in October. [17:08:12] PROBLEM - parsoid on wtp1043 is CRITICAL: connect to address 10.64.48.161 and port 8000: Connection refused [17:08:12] PROBLEM - Confd template for /etc/parsoid/config-vars.yaml on wtp1043 is CRITICAL: Return code of 255 is out of bounds [17:08:22] PROBLEM - Check whether ferm is active by checking the default input chain on wtp1043 is CRITICAL: Return code of 255 is out of bounds [17:08:31] PROBLEM - Check systemd state on wtp1043 is CRITICAL: Return code of 255 is out of bounds [17:08:31] PROBLEM - Disk space on wtp1043 is CRITICAL: Return code of 255 is out of bounds [17:08:41] PROBLEM - dhclient process on wtp1043 is CRITICAL: Return code of 255 is out of bounds [17:08:42] PROBLEM - DPKG on wtp1043 is CRITICAL: Return code of 255 is out of bounds [17:08:51] PROBLEM - configured eth on wtp1043 is CRITICAL: Return code of 255 is out of bounds [17:08:52] PROBLEM - Check size of conntrack table on wtp1043 is CRITICAL: Return code of 255 is out of bounds [17:09:11] PROBLEM - confd service on wtp1043 is CRITICAL: Return code of 255 is out of bounds [17:10:01] PROBLEM - puppet last run on wtp1043 is CRITICAL: Return code of 255 is out of bounds [17:11:31] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 57.2 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:13:32] PROBLEM - Check the NTP synchronisation status of timesyncd on wtp1043 is CRITICAL: Return code of 255 is out of bounds [17:14:52] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 86.2 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:16:38] (03PS1) 10Alex Monk: [WIP] Check for outdated/expired certs in the main loop [software/certcentral] - 10https://gerrit.wikimedia.org/r/460397 [17:16:48] (03PS1) 10Elukey: profile::analytics::database::meta: use the same prod config in labs [puppet] - 10https://gerrit.wikimedia.org/r/460399 (https://phabricator.wikimedia.org/T204060) [17:18:16] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/12445/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/460399 (https://phabricator.wikimedia.org/T204060) (owner: 10Elukey) [17:18:43] (03CR) 10Hashar: [C: 031] "Sorry actually there is a better pattern I have just remembered about. That comes from Giuseppe and got used on docker-pkg." [software/keyholder] - 10https://gerrit.wikimedia.org/r/460065 (owner: 10Thcipriani) [17:18:52] PROBLEM - configured eth on wtp1043 is CRITICAL: Return code of 255 is out of bounds [17:18:52] PROBLEM - Check size of conntrack table on wtp1043 is CRITICAL: Return code of 255 is out of bounds [17:19:11] PROBLEM - confd service on wtp1043 is CRITICAL: Return code of 255 is out of bounds [17:19:19] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Check for outdated/expired certs in the main loop [software/certcentral] - 10https://gerrit.wikimedia.org/r/460397 (owner: 10Alex Monk) [17:19:21] PROBLEM - Confd template for /etc/parsoid/config-vars.yaml on wtp1043 is CRITICAL: Return code of 255 is out of bounds [17:19:32] PROBLEM - Check whether ferm is active by checking the default input chain on wtp1043 is CRITICAL: Return code of 255 is out of bounds [17:19:32] PROBLEM - Disk space on wtp1043 is CRITICAL: Return code of 255 is out of bounds [17:19:32] PROBLEM - Check systemd state on wtp1043 is CRITICAL: Return code of 255 is out of bounds [17:19:42] PROBLEM - dhclient process on wtp1043 is CRITICAL: Return code of 255 is out of bounds [17:19:52] PROBLEM - DPKG on wtp1043 is CRITICAL: Return code of 255 is out of bounds [17:20:12] PROBLEM - puppet last run on wtp1043 is CRITICAL: Return code of 255 is out of bounds [17:20:56] (03CR) 10Krinkle: [C: 031] Update npm to 6.4.0 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/453666 (https://phabricator.wikimedia.org/T169451) (owner: 10Legoktm) [17:21:02] scheduling downtime for that (https://phabricator.wikimedia.org/T196886) [17:21:49] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to researchers for kharlan - https://phabricator.wikimedia.org/T203847 (10ayounsi) 05Open>03Resolved ```notebook1003:~$ id kharlan uid=19582(kharlan) gid=500(wikidev) groups=500(wikidev),714(researchers) ``` You should be all set... [17:21:53] ACKNOWLEDGEMENT - Disk space on wtp1043 is CRITICAL: Return code of 255 is out of bounds daniel_zahn https://phabricator.wikimedia.org/T196886 [17:22:22] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to researchers for kharlan - https://phabricator.wikimedia.org/T203847 (10kostajh) Thank you @ayounsi! [17:25:02] !log disable cr2:xe-4/0/0 (to asw-a) for optics replacement - T203719 [17:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:13] T203719: Interface errors on cr2-eqiad:xe-4/0/0 - https://phabricator.wikimedia.org/T203719 [17:25:42] 10Operations, 10Wikimedia-Logstash, 10Goal: Investigate log shipping methods and standardize on them (logstash) - https://phabricator.wikimedia.org/T198757 (10fgiunchedi) >>! In T198757#4581150, @fgiunchedi wrote: > As part of the [[ https://docs.google.com/document/d/1Aq-Dhq3SbRCPmQdw6jjaHvH-KaDfuQOeuORu4aC... [17:29:01] (03PS37) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [17:29:03] (03PS3) 10Alex Monk: [WIP] Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 [17:29:10] (03PS9) 10Alex Monk: Debian packaging [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458554 [17:29:12] PROBLEM - High CPU load on API appserver on mw2203 is CRITICAL: CRITICAL - load average: 78.47, 35.75, 21.13 [17:29:19] !log enable cr2:xe-4/0/0 (to asw-a) for optics replacement - T203719 [17:29:21] PROBLEM - High CPU load on API appserver on mw2205 is CRITICAL: CRITICAL - load average: 89.91, 45.94, 28.95 [17:29:24] !log radium - removing tor package, clearing systemd failed units, to clear Icinga alerts from this host that is to be decomed [17:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:32] PROBLEM - High CPU load on API appserver on mw2207 is CRITICAL: CRITICAL - load average: 69.27, 35.63, 23.20 [17:29:33] PROBLEM - High CPU load on API appserver on mw2138 is CRITICAL: CRITICAL - load average: 68.54, 35.20, 21.44 [17:29:39] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [17:29:57] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 (owner: 10Alex Monk) [17:30:11] PROBLEM - High CPU load on API appserver on mw2210 is CRITICAL: CRITICAL - load average: 68.44, 35.09, 21.53 [17:30:22] PROBLEM - High CPU load on API appserver on mw2136 is CRITICAL: CRITICAL - load average: 61.86, 35.30, 21.10 [17:31:37] !log disable cr2:xe-4/0/0 (to asw-a) for optics replacement (round 2, 1st one didn't clear the errors, need to do the other side) - T203719 [17:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:44] T203719: Interface errors on cr2-eqiad:xe-4/0/0 - https://phabricator.wikimedia.org/T203719 [17:32:12] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 30 probes of 318 (alerts on 25) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [17:32:17] load lowering a lot on mw2136 already, looks on the way to recovery [17:32:22] RECOVERY - High CPU load on API appserver on mw2136 is OK: OK - load average: 20.92, 29.38, 20.79 [17:32:31] RECOVERY - Filesystem available is greater than filesystem size on ms-be1041 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be1041&var-datasource=eqiad%2520prometheus%252Fops [17:32:46] (03PS1) 10GTirloni: nfs-mount-manager: Add retry logic to check command [puppet] - 10https://gerrit.wikimedia.org/r/460401 (https://phabricator.wikimedia.org/T161898) [17:33:12] RECOVERY - High CPU load on API appserver on mw2210 is OK: OK - load average: 18.25, 29.66, 22.33 [17:33:30] (03PS2) 10GTirloni: nfs-mount-manager: Add retry logic to check command [puppet] - 10https://gerrit.wikimedia.org/r/460401 (https://phabricator.wikimedia.org/T161898) [17:33:42] RECOVERY - High CPU load on API appserver on mw2207 is OK: OK - load average: 18.55, 30.58, 24.70 [17:34:31] RECOVERY - High CPU load on API appserver on mw2203 is OK: OK - load average: 18.67, 30.67, 24.25 [17:34:51] RECOVERY - High CPU load on API appserver on mw2138 is OK: OK - load average: 16.81, 30.63, 24.57 [17:35:08] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: move/setup/install frauth2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T204079 (10Papaul) I talked with @Jgreen on IRC, he said to use the IP address of frmon2001 for frauth2001 for now since frmon2001 is not installed yet. [17:37:12] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 22 probes of 318 (alerts on 25) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [17:38:49] !log enable cr2:xe-4/0/0 (to asw-a) - T203719 [17:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:58] T203719: Interface errors on cr2-eqiad:xe-4/0/0 - https://phabricator.wikimedia.org/T203719 [17:39:32] RECOVERY - High CPU load on API appserver on mw2205 is OK: OK - load average: 12.71, 24.77, 29.04 [17:39:34] what's with the high API loadavg alerts that flapped? [17:39:46] * Krinkle staging on deployment/mwdebug2002 [17:39:52] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on einsteinium is CRITICAL: 150.6 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [17:39:58] (03CR) 10Dzahn: "looks good code-wise, only nitpick is you are adding some literal tabs on lines 50,51,61" [puppet] - 10https://gerrit.wikimedia.org/r/460401 (https://phabricator.wikimedia.org/T161898) (owner: 10GTirloni) [17:40:05] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.20/includes/htmlform/fields/HTMLCheckMatrix.php: I1f92479bf1, T203325 (duration: 00m 51s) [17:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:13] T203325: BotPasswords right selection form shows plain-text html - https://phabricator.wikimedia.org/T203325 [17:41:16] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.20/includes/widget/CheckMatrixWidget.php: I1f92479bf1, T203325 (duration: 00m 49s) [17:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:52] (03CR) 10Bstorm: "Looks good to me. I do figure that a grid job running at full steam, and taking up most of the processing power will slow that ls down ba" [puppet] - 10https://gerrit.wikimedia.org/r/460401 (https://phabricator.wikimedia.org/T161898) (owner: 10GTirloni) [17:42:18] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.20/resources/src/mediawiki.widgets/mw.widgets.CheckMatrixWidget.js: I1f92479bf1, T203325 (duration: 00m 50s) [17:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:32] 10Operations, 10ops-eqiad, 10DC-Ops, 10Parsing-Team: Replace wtp1043's sda - https://phabricator.wikimedia.org/T196886 (10RobH) a:05Cmjohnson>03MoritzMuehlenhoff Ok, reimage has completed and OS is running with puppet run already done. No errors logged so far, leaving this open and stalled checking fo... [17:44:17] 10Operations, 10ops-eqiad, 10netops: Interface errors on cr2-eqiad:xe-4/0/0 - https://phabricator.wikimedia.org/T203719 (10ayounsi) 05Open>03Resolved Lot better, thanks! [17:45:50] (03PS1) 10Smalyshev: Revert "Temporarily enable debug logging for regex matches" [puppet] - 10https://gerrit.wikimedia.org/r/460402 [17:46:47] 10Operations, 10ops-eqiad, 10DC-Ops, 10Parsing-Team: Replace wtp1043's sda - https://phabricator.wikimedia.org/T196886 (10Dzahn) this should be merged please once it's been confirmed the host is ok again: https://gerrit.wikimedia.org/r/#/c/operations/cookbooks/+/460049/ [17:47:07] (03Abandoned) 10Smalyshev: Revert "Temporarily enable debug logging for regex matches" [puppet] - 10https://gerrit.wikimedia.org/r/460402 (owner: 10Smalyshev) [17:47:23] (03CR) 10Dzahn: "current status of this: https://phabricator.wikimedia.org/T196886#4581588" [cookbooks] - 10https://gerrit.wikimedia.org/r/460049 (owner: 10Dzahn) [17:48:42] (03PS2) 10Papaul: DNS: Replace frauth2001 IP address with frmond2001 IP address [dns] - 10https://gerrit.wikimedia.org/r/460127 [17:48:58] (03PS1) 10Smalyshev: Remove debug logging for regex matches [puppet] - 10https://gerrit.wikimedia.org/r/460403 [17:49:55] (03CR) 10Gehel: Elasticsearch module is coming up. (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [17:50:46] (03PS3) 10GTirloni: nfs-mount-manager: Add retry logic to check command [puppet] - 10https://gerrit.wikimedia.org/r/460401 (https://phabricator.wikimedia.org/T161898) [17:51:43] (03CR) 10GTirloni: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/460401 (https://phabricator.wikimedia.org/T161898) (owner: 10GTirloni) [17:53:24] (03CR) 10Volans: [C: 04-1] "Nice work, I think there are couple of things that could be improved and then a bunch of minor nitpicks/style/documentation thing." (0329 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [17:54:35] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: rack/setup/install Prometeuse/Grafana host frmon2001 for fr-tech - https://phabricator.wikimedia.org/T196476 (10Papaul) [17:58:09] 08̶W̶a̶r̶n̶i̶n̶g Device cr2-eqiad.wikimedia.org recovered from Inbound interface errors [18:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy Morning SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180913T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:03:04] (03CR) 10Ottomata: [C: 031] "+1 I guess!" [puppet] - 10https://gerrit.wikimedia.org/r/460399 (https://phabricator.wikimedia.org/T204060) (owner: 10Elukey) [18:03:16] (03CR) 10Ottomata: [C: 031] "I don't really even remember why we have 2, if this will work in both." [puppet] - 10https://gerrit.wikimedia.org/r/460399 (https://phabricator.wikimedia.org/T204060) (owner: 10Elukey) [18:11:56] PROBLEM - parsoid on wtp1043 is CRITICAL: connect to address 10.64.48.161 and port 8000: Connection refused [18:12:34] (03CR) 10Volans: "replies inline" (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/460114 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [18:13:46] PROBLEM - puppet last run on wtp1043 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 25 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[parsoid/deploy] [18:15:30] (03CR) 10Gehel: Elasticsearch module is coming up. (038 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [18:18:16] RECOVERY - parsoid on wtp1043 is OK: HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 0.089 second response time [18:18:56] RECOVERY - puppet last run on wtp1043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:27:17] 10Operations, 10fundraising-tech-ops, 10netops: Grow frack-administration-codfw to /28 - https://phabricator.wikimedia.org/T204271 (10ayounsi) p:05Triage>03Normal [18:40:28] !log updating f/w cloudvirt1019 disabled icinga checks [18:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:26] (03CR) 10Gehel: [C: 032] Remove debug logging for regex matches [puppet] - 10https://gerrit.wikimedia.org/r/460403 (owner: 10Smalyshev) [19:08:49] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) @andrewbogott and @Bstorm I ran the HP Service pack on this server, several things were updated including the raid card firmware. Please let me know if the problem with the battery not being pre... [19:09:48] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) icinga shows it recharging WARNING: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:1:1, 2I:1:2, 2I:1:3, 2I:1:4, 2I:2:1, 2I:2:2 - Controller: OK - Battery/Capacitor: Recharging [19:24:39] !log moving puppet-compiler.wmlabs.org proxy from puppet3-diffs to puppet-diffs project T191438 [19:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:47] T191438: Upgrade Puppet compilers to Stretch - https://phabricator.wikimedia.org/T191438 [19:25:16] 10Operations, 10ops-eqiad, 10Analytics: analytics1068 doesn't boot - https://phabricator.wikimedia.org/T203244 (10Cmjohnson) attempted to update bios but the update does not run and the server remains stuck in the "Loading BIOS Drivers" during post [19:29:35] (03PS1) 10Zhuyifei1999: quarry::database: Use class declaration for mariadb::packages [puppet] - 10https://gerrit.wikimedia.org/r/460416 (https://phabricator.wikimedia.org/T181205) [19:29:39] (03PS1) 10Ottomata: Refactor refine_job to use new spark_job and ConfigHelper properties [puppet] - 10https://gerrit.wikimedia.org/r/460417 (https://phabricator.wikimedia.org/T203804) [19:30:24] (03CR) 10jerkins-bot: [V: 04-1] Refactor refine_job to use new spark_job and ConfigHelper properties [puppet] - 10https://gerrit.wikimedia.org/r/460417 (https://phabricator.wikimedia.org/T203804) (owner: 10Ottomata) [19:31:42] !log shutting down old puppet compiler instances in puppet3-diffs project T191438 [19:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:50] T191438: Upgrade Puppet compilers to Stretch - https://phabricator.wikimedia.org/T191438 [19:35:01] (03PS2) 10Ottomata: Refactor refine_job to use new spark_job and ConfigHelper properties [puppet] - 10https://gerrit.wikimedia.org/r/460417 (https://phabricator.wikimedia.org/T203804) [19:35:34] (03CR) 10jerkins-bot: [V: 04-1] Refactor refine_job to use new spark_job and ConfigHelper properties [puppet] - 10https://gerrit.wikimedia.org/r/460417 (https://phabricator.wikimedia.org/T203804) (owner: 10Ottomata) [19:36:54] (03PS3) 10Ottomata: Refactor refine_job to use new spark_job and ConfigHelper properties [puppet] - 10https://gerrit.wikimedia.org/r/460417 (https://phabricator.wikimedia.org/T203804) [19:37:23] (03PS4) 10Ottomata: Refactor refine_job to use new spark_job and ConfigHelper properties [puppet] - 10https://gerrit.wikimedia.org/r/460417 (https://phabricator.wikimedia.org/T203804) [19:45:34] (03PS5) 10Ottomata: Refactor refine_job to use new spark_job and ConfigHelper properties [puppet] - 10https://gerrit.wikimedia.org/r/460417 (https://phabricator.wikimedia.org/T203804) [19:50:26] (03PS6) 10Ottomata: Refactor refine_job to use new spark_job and ConfigHelper properties [puppet] - 10https://gerrit.wikimedia.org/r/460417 (https://phabricator.wikimedia.org/T203804) [19:52:15] (03PS3) 10Dzahn: DNS: Reuse IP of frmon2001 for frauth2001 [dns] - 10https://gerrit.wikimedia.org/r/460127 (owner: 10Papaul) [19:54:57] (03CR) 10Jgreen: [C: 032] DNS: Reuse IP of frmon2001 for frauth2001 [dns] - 10https://gerrit.wikimedia.org/r/460127 (owner: 10Papaul) [19:55:07] (03PS4) 10Jgreen: DNS: Reuse IP of frmon2001 for frauth2001 [dns] - 10https://gerrit.wikimedia.org/r/460127 (owner: 10Papaul) [19:55:28] (03CR) 10Jgreen: [V: 032 C: 032] DNS: Reuse IP of frmon2001 for frauth2001 [dns] - 10https://gerrit.wikimedia.org/r/460127 (owner: 10Papaul) [19:55:48] papaul: ^ :) [19:55:51] thanks Jeff [19:57:38] (03PS4) 10GTirloni: nfs-mount-manager: Add retry logic to check command [puppet] - 10https://gerrit.wikimedia.org/r/460401 (https://phabricator.wikimedia.org/T161898) [19:57:58] (03PS5) 10GTirloni: nfs-mount-manager: Add retry logic to check command [puppet] - 10https://gerrit.wikimedia.org/r/460401 (https://phabricator.wikimedia.org/T161898) [19:58:37] (03CR) 10Ottomata: "LoOKS GREAT!" [puppet] - 10https://gerrit.wikimedia.org/r/460417 (https://phabricator.wikimedia.org/T203804) (owner: 10Ottomata) [19:59:14] (03CR) 10Bstorm: nfs-mount-manager: Add retry logic to check command (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/460401 (https://phabricator.wikimedia.org/T161898) (owner: 10GTirloni) [20:01:45] (03CR) 10GTirloni: "> Patch Set 5:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/460401 (https://phabricator.wikimedia.org/T161898) (owner: 10GTirloni) [20:03:26] (03CR) 10GTirloni: "> Patch Set 5:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/460401 (https://phabricator.wikimedia.org/T161898) (owner: 10GTirloni) [20:09:26] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:09:37] (03PS1) 10Volans: setup.py: add missing fields [software/spicerack] - 10https://gerrit.wikimedia.org/r/460425 (https://phabricator.wikimedia.org/T199079) [20:09:39] (03PS1) 10Volans: tests: improve prospector tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/460426 (https://phabricator.wikimedia.org/T199079) [20:10:43] (03CR) 10jerkins-bot: [V: 04-1] tests: improve prospector tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/460426 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [20:11:48] 10Operations: decom radium - https://phabricator.wikimedia.org/T203861 (10Dzahn) [20:12:27] 10Operations, 10decommission: decom radium - https://phabricator.wikimedia.org/T203861 (10Dzahn) [20:13:56] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:16:15] (03PS1) 10GTirloni: nfs-mount-manager - Switch from using ls to stat [puppet] - 10https://gerrit.wikimedia.org/r/460429 (https://phabricator.wikimedia.org/T161898) [20:16:59] (03Abandoned) 10GTirloni: nfs-mount-manager: Add retry logic to check command [puppet] - 10https://gerrit.wikimedia.org/r/460401 (https://phabricator.wikimedia.org/T161898) (owner: 10GTirloni) [20:24:41] (03PS1) 10EBernhardson: Add CirrusSearch cluster name to siteinfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460430 (https://phabricator.wikimedia.org/T204135) [20:24:52] 10Operations, 10decommission: decom radium - https://phabricator.wikimedia.org/T203861 (10Dzahn) [20:24:57] (03CR) 10Bstorm: [C: 032] "let's see what happens!" [puppet] - 10https://gerrit.wikimedia.org/r/460429 (https://phabricator.wikimedia.org/T161898) (owner: 10GTirloni) [20:25:08] (03PS2) 10Bstorm: nfs-mount-manager - Switch from using ls to stat [puppet] - 10https://gerrit.wikimedia.org/r/460429 (https://phabricator.wikimedia.org/T161898) (owner: 10GTirloni) [20:25:29] (03CR) 10GTirloni: [V: 032] nfs-mount-manager - Switch from using ls to stat [puppet] - 10https://gerrit.wikimedia.org/r/460429 (https://phabricator.wikimedia.org/T161898) (owner: 10GTirloni) [20:30:31] (03CR) 10DCausse: Elasticsearch module is coming up. (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [20:31:45] (03CR) 10DCausse: [C: 031] Add CirrusSearch cluster name to siteinfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460430 (https://phabricator.wikimedia.org/T204135) (owner: 10EBernhardson) [20:32:10] 10Operations, 10decommission: decom radium - https://phabricator.wikimedia.org/T203861 (10Dzahn) a:05Dzahn>03None [20:34:14] 10Operations: Enable TRIM for SSDs for Cassandra software raid - https://phabricator.wikimedia.org/T89584 (10Eevans) As an update: These devices [[ https://github.com/torvalds/linux/blob/master/drivers/ata/libata-core.c#L4572 | are still blacklisted ]] AFAICT. [20:36:06] zhuyifei1999_: i can merge it if you like, +1.. if you can check on quarry instances afterwards [20:36:21] yes, should be no difference.. same packages [20:37:52] (03PS1) 10Urbanecm: New throttle rule for enwiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460432 (https://phabricator.wikimedia.org/T204243) [20:38:15] (03CR) 10Dzahn: [C: 031] "yea, that would work too if you'd rather not repeat the package names and since it's already using that module anyways" [puppet] - 10https://gerrit.wikimedia.org/r/460416 (https://phabricator.wikimedia.org/T181205) (owner: 10Zhuyifei1999) [20:40:16] (03CR) 10Dzahn: [C: 031] "i'll merge it when we are both on IRC just to confirm. but really should be noop" [puppet] - 10https://gerrit.wikimedia.org/r/460416 (https://phabricator.wikimedia.org/T181205) (owner: 10Zhuyifei1999) [20:41:45] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: move/setup/install frauth2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T204079 (10Papaul) [20:46:22] (03CR) 10Framawiki: New throttle rule for enwiki event (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460432 (https://phabricator.wikimedia.org/T204243) (owner: 10Urbanecm) [20:46:59] 10Operations: Enable TRIM for SSDs for Cassandra software raid - https://phabricator.wikimedia.org/T89584 (10faidon) As I mentioned above in my second-to-last update, they are blacklisted for //queued// TRIM which is suboptimal of course. However, the data corruption issues with //synchronous// TRIM have been lo... [20:58:47] mutante: ok thanks :) (and I'm on) [20:58:56] (03PS1) 10Dzahn: tor_relay: add Bacula backups of Tor keys [puppet] - 10https://gerrit.wikimedia.org/r/460437 [20:59:58] (03PS2) 10Dzahn: quarry::database: Use class declaration for mariadb::packages [puppet] - 10https://gerrit.wikimedia.org/r/460416 (https://phabricator.wikimedia.org/T181205) (owner: 10Zhuyifei1999) [21:00:07] zhuyifei1999_: doing :) [21:00:46] (03CR) 10Dzahn: [C: 032] quarry::database: Use class declaration for mariadb::packages [puppet] - 10https://gerrit.wikimedia.org/r/460416 (https://phabricator.wikimedia.org/T181205) (owner: 10Zhuyifei1999) [21:01:54] merged on the master.. how long did it take last time until you saw effects on the VPS? [21:02:23] (well, we wouldn't see one in this case unless we really overlooked something) [21:02:47] lol [21:03:25] nothing interesting indeed [21:03:31] 10Operations: Enable TRIM for SSDs for Cassandra software raid - https://phabricator.wikimedia.org/T89584 (10Eevans) >>! In T89584#4582113, @faidon wrote: > As I mentioned above in my second-to-last update, they are blacklisted for //queued// TRIM which is suboptimal of course. However, the data corruption issue... [21:03:35] good to know if it does nothing because it's not in sync yet or because the code does nothing :) [21:06:15] modules/role/manifests/simplelamp.pp: class { '::mysql::server': [21:06:26] some day we should also convert that .. [21:07:05] (03PS7) 10Ppchelko: Replace the semver patch version in Accept with 0 [puppet] - 10https://gerrit.wikimedia.org/r/455036 (https://phabricator.wikimedia.org/T202682) [21:07:14] but we would have to coordinate with all of: https://tools.wmflabs.org/openstack-browser/puppetclass/role::simplelamp [21:09:10] (03PS2) 10Volans: tests: improve prospector tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/460426 (https://phabricator.wikimedia.org/T199079) [21:09:31] I wouldn't know which one of 'it's not in sync yet' or 'the code does nothing' :P [21:10:49] 10Operations, 10DBA, 10Patch-For-Review: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Dzahn) another place where the mysql module is used is "m" in "role(simplelamp)" `modules/role/manifests/simplelamp.pp: class { '... [21:11:34] 10Operations, 10DBA, 10Patch-For-Review: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10zhuyifei1999) [21:11:37] 10Operations, 10Quarry, 10Patch-For-Review, 10cloud-services-team (Kanban): Let quarry use the mariadb module - https://phabricator.wikimedia.org/T181205 (10zhuyifei1999) 05Open>03Resolved a:03zhuyifei1999 [21:12:20] zhuyifei1999_: well.. at least we know last time, when there was a change to see, it didn't take longer than a few minutes, right [21:12:34] very nice how that is resolved now [21:12:48] :) [21:14:52] 10Operations, 10DBA, 10Patch-For-Review: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Dzahn) quarry switched now to mariadb, subtask resolved. updating the list in the ticket description here. thanks to @zhuyifei1999 fo... [21:15:13] 10Operations, 10DBA, 10Patch-For-Review: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Dzahn) [21:15:29] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Alert when elasticsearch has shards larger than a maximum size - https://phabricator.wikimedia.org/T203546 (10debt) 05Open>03Resolved [21:16:57] 10Operations, 10DBA, 10Patch-For-Review: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Dzahn) [21:17:18] 10Operations, 10DBA, 10Patch-For-Review: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Dzahn) [21:26:53] 10Operations, 10DBA, 10Patch-For-Review: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Dzahn) [21:28:02] 10Operations, 10DBA, 10Patch-For-Review: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Dzahn) [21:30:40] 10Operations, 10DBA, 10Patch-For-Review: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Dzahn) i think for the ones in the statistics and wikimetrics we should ask Analytics how they feel about converting that to mariadb.... [21:32:33] 10Operations, 10Performance-Team, 10netops: Stop prioritizing peering over transit - https://phabricator.wikimedia.org/T204281 (10ayounsi) p:05Triage>03Normal [21:33:15] !log running fstrim --all on restbase1007 -- T89584 [21:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:23] T89584: Enable TRIM for SSDs for Cassandra software raid - https://phabricator.wikimedia.org/T89584 [21:35:10] !log running fstrim --all on restbase1011 -- T89584 [21:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:32] heh :) [21:37:33] nice! [21:37:47] urandom: lmk how it goes :) [21:37:56] :) [21:38:06] paravoid: Top Gun Movie Clip "Tom Cruise is Dangerous" [21:38:06] Top Gun Movie Clip "Tom Cruise is Dangerous" [21:38:08] https://youtu.be/OFkutlswBY0 [21:38:17] yeesh, paste fail [21:44:09] (03PS2) 10Volans: mediawiki: improve siteinfo checks [software/spicerack] - 10https://gerrit.wikimedia.org/r/460114 (https://phabricator.wikimedia.org/T199079) [21:44:20] 10Operations, 10Performance-Team, 10netops: Stop prioritizing peering over transit - https://phabricator.wikimedia.org/T204281 (10Imarlier) Sounds interesting. Keep Perf in the loop as you start to think about how to do this, and what your target geos might be. [21:46:05] (03CR) 10Volans: "replies inline" (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/460114 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [21:49:35] paravoid: I don't think it's doing anything on the HP hosts, even the ones where we put the controller in HBA mode [21:50:33] 10Operations: Enable TRIM for SSDs for Cassandra software raid - https://phabricator.wikimedia.org/T89584 (10Eevans) First pass on restbase1007 running `fstrim --all` took several seconds to complete (a noteworthy delay, 10 seconds?) and exited 0. When run on restbase1011, the command completed almost instantan... [21:51:08] blergh [21:51:12] !log running fstrim --all on restbase1008 -- T89584 [21:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:20] T89584: Enable TRIM for SSDs for Cassandra software raid - https://phabricator.wikimedia.org/T89584 [21:51:33] (03CR) 10Volans: Add CirrusSearch cluster name to siteinfo (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460430 (https://phabricator.wikimedia.org/T204135) (owner: 10EBernhardson) [21:52:54] 10Operations: Enable TRIM for SSDs for Cassandra software raid - https://phabricator.wikimedia.org/T89584 (10Eevans) Here is restbase1008 (first path) using `--verbose` ```lang=shell-session eevans@restbase1008:~$ sudo fstrim --verbose --all /srv/cassandra/instance-data: 24.6 GiB (26378080256 bytes) trimmed /sr... [21:55:57] !log running fstrim --all on restbase1013 -- T89584 [21:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:19] urandom: lsblk -D would show this [21:56:38] and indeed, 1011 shows "0" [21:56:42] not sure which ones are on HBA mode, though [21:56:59] (03PS1) 10Zhuyifei1999: quarry: Get rid of ::labs_debrepo [puppet] - 10https://gerrit.wikimedia.org/r/460446 (https://phabricator.wikimedia.org/T153615) [21:57:28] urandom: more importantly, do you see any differences in terms of I/O throughput, I/O wait etc.? [21:57:28] 1011 & 1013 & 1015 are examples of HPs that AFAIK are configured as HBA [21:57:31] no [21:57:51] but it's low on these machines so far [21:58:16] i mean, it was relative low to begin with [21:58:34] still high compared to the Intel-equipped machines, and no change that I can see since running it [21:58:53] and iowait, i mean [21:59:30] mutante: mind checking https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/460446/ ? [22:02:36] (03CR) 10Dzahn: [C: 032] "ack, duplicate inclusion of profile base confirmed. and i know you are not using those packages anymore now" [puppet] - 10https://gerrit.wikimedia.org/r/460446 (https://phabricator.wikimedia.org/T153615) (owner: 10Zhuyifei1999) [22:02:38] urandom: yeah there's very little traffic on those boxes [22:03:02] (03CR) 10EBernhardson: Add CirrusSearch cluster name to siteinfo (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460430 (https://phabricator.wikimedia.org/T204135) (owner: 10EBernhardson) [22:04:06] (03CR) 10Volans: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460430 (https://phabricator.wikimedia.org/T204135) (owner: 10EBernhardson) [22:04:15] thanks :) [22:05:13] zhuyifei1999_: done. i linked that ticket to the "big migration" ticket [22:10:49] 10Puppet, 10Cloud-Services: Retire and remove module labs_debrepo - https://phabricator.wikimedia.org/T153612 (10zhuyifei1999) [22:29:04] paravoid: there is, but there continues to be an unjustifiably high iowait [22:29:22] like, these SSDs give you a graph to look at even when the load is light :) [22:30:37] paravoid: https://grafana.wikimedia.org/dashboard/db/cassandra-system?panelId=16&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=restbase&var-server=restbase1007:9100&var-server=restbase1016:9100&var-disk=sda&var-disk=sdb&var-disk=sdc&var-disk=sdd&var-disk=sde&from=now-3h&to=now [22:31:01] that's one Intel-equipped v one Samsung [22:31:08] (one that was trimmed) [22:31:38] which happened at 21:33 [22:31:43] so there might be some change there [22:32:16] I mean, it's not setting the world on fire, but there haven't been any of those > 2% spikes [22:41:34] 10Operations: Enable TRIM for SSDs for Cassandra software raid - https://phabricator.wikimedia.org/T89584 (10Eevans) Here is 1007 after an hour (along with 1016 which is Intel-equipped, for comparison). It's not worse. :) | {F25830662} | | `fstrim --all` @ 21:33 | We should probably also try a couple of hosts... [22:55:28] (03PS1) 10Andrew Bogott: wmcs region-migrate: after migration force the puppet cert to regenerate [puppet] - 10https://gerrit.wikimedia.org/r/460450 [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180913T2300). [23:00:04] Urbanecm: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:04:55] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [23:09:16] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [23:11:49] (03CR) 10Andrew Bogott: [C: 032] wmcs region-migrate: after migration force the puppet cert to regenerate [puppet] - 10https://gerrit.wikimedia.org/r/460450 (owner: 10Andrew Bogott) [23:18:45] (03PS1) 10EBernhardson: Mysql client not available on mwdebug* [puppet] - 10https://gerrit.wikimedia.org/r/460451 [23:19:36] (03CR) 10jerkins-bot: [V: 04-1] Mysql client not available on mwdebug* [puppet] - 10https://gerrit.wikimedia.org/r/460451 (owner: 10EBernhardson)