[00:00:04] twentyafterfour: Time to snap out of that daydream and deploy Phabricator update. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200416T0000). [00:09:42] RECOVERY - PHP opcache health on mw2296 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:34:21] (03PS1) 10VolkerE: Use Wikimedia Design Style Guide colors and use relative font-size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589165 [00:51:36] 10Operations, 10Research: recommendation api's test on scb nodes are flapping - https://phabricator.wikimedia.org/T247732 (10leila) @bmansurov please fill out the task description for T250335 and add LDAP-Access-Requests as a tag. [00:59:14] PROBLEM - PHP opcache health on mw2291 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [01:00:25] milimetric: do you know if this rcfeed change would need anything to happen on the EventStreams side ? https://gerrit.wikimedia.org/r/#/c/589166/ [01:00:32] afaik we allow extra properties, right? [01:17:50] RECOVERY - PHP opcache health on mw2291 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [01:36:26] PROBLEM - Hadoop NodeManager on an-worker1084 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [01:47:36] 10Operations, 10MediaWiki-Cache, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, and 3 others: cache_text cluster consistently backlogged on purge requests - https://phabricator.wikimedia.org/T249325 (10AntiCompositeNumber) https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Edi... [01:48:16] PROBLEM - Hadoop NodeManager on an-worker1085 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [02:30:12] Krinkle, I’m not sure without reading through the code, I’ll add Andrew to the review [04:46:38] (03PS1) 10Vgutierrez: Release 8.0.7-rc0-1wm3 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/589182 (https://phabricator.wikimedia.org/T249335) [04:59:33] (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/575603 (https://phabricator.wikimedia.org/T243927) (owner: 10CRusnov) [05:30:51] (03PS1) 10Marostegui: mariadb: Decommission dbproxy1011 [puppet] - 10https://gerrit.wikimedia.org/r/589185 (https://phabricator.wikimedia.org/T249590) [05:31:40] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/589021 (https://phabricator.wikimedia.org/T250259) (owner: 10Jbond) [05:32:48] (03PS1) 10Marostegui: wmnet: Remove production dns entry for dbproxy1011 [dns] - 10https://gerrit.wikimedia.org/r/589186 (https://phabricator.wikimedia.org/T249590) [05:33:58] !log restart hadoop-yarn-nodemanager on an-worker108[4,5] - failed after GC OOM events (heavy spark jobs) [05:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:20] RECOVERY - Hadoop NodeManager on an-worker1084 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [05:34:56] RECOVERY - Hadoop NodeManager on an-worker1085 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [05:36:44] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10MoritzMuehlenhoff) @fgiunchedi Contractors should be in the cn=wmf group, not cn=nda. [05:39:06] PROBLEM - PHP opcache health on mw2298 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:42:00] (03CR) 10Muehlenhoff: [C: 03+1] "For the WDQS use case it's just a matter of adapting the docs, instead of manualling adding an iptables rule, a respective ferm rule needs" [puppet] - 10https://gerrit.wikimedia.org/r/576101 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [05:43:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Reorganize s8 weights a little bit after the addition of the new host db1114', diff saved to https://phabricator.wikimedia.org/P10995 and previous config saved to /var/cache/conftool/dbconfig/20200416-054353-marostegui.json [05:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:11] (03PS1) 10Marostegui: Revert "db-eqiad.php: Emergency pool pc1010" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589187 [05:49:21] (03CR) 10jerkins-bot: [V: 04-1] Revert "db-eqiad.php: Emergency pool pc1010" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589187 (owner: 10Marostegui) [05:49:29] (03Abandoned) 10Marostegui: Revert "db-eqiad.php: Emergency pool pc1010" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589187 (owner: 10Marostegui) [05:50:22] 10Operations, 10observability: production-logstash elastic cluster is yellow state - https://phabricator.wikimedia.org/T250133 (10elukey) Cluster still yellow, 15 unassigned shards :( It seems that 1010 holds way more shards than the other two: ` elukey@logstash1008:~$ curl -s 'http://localhost:9200/_cat/sha... [05:52:48] (03PS1) 10Marostegui: db-eqiad.php: Restore pc1008 as pc2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589188 (https://phabricator.wikimedia.org/T247787) [05:54:32] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) Let's move pc1008 back to pc2 master. Also, as pc1010 is replicating still from pc1, its disk was around 88... [05:55:24] 10Operations, 10observability: production-logstash elastic cluster is yellow state - https://phabricator.wikimedia.org/T250133 (10elukey) Explain currently advice some things: * logstash1012-production-logstash-eqiad: `the shard cannot be allocated to the same node on which a copy of the shard already exists... [05:59:24] RECOVERY - PHP opcache health on mw2298 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [06:12:18] PROBLEM - PHP opcache health on mw2300 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [06:15:58] !log installing git security updates on jessie [06:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:19] !log installing icu security updates on jessie [06:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:48] RECOVERY - PHP opcache health on mw2300 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [06:33:30] !log installing apache-log4j1.2 security updates on jessie [06:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:50] 10Operations, 10Analytics: kafka-jumbo1003 /srv disk space usage over 90% - https://phabricator.wikimedia.org/T250347 (10elukey) p:05Triage→03High [06:36:45] 10Operations, 10Analytics: kafka-jumbo1003 /srv disk space usage over 90% - https://phabricator.wikimedia.org/T250347 (10elukey) One of the issues seems to be: ` 1.1T atskafka_test_webrequest_text-0 1.1T atskafka_test_webrequest_text-1 ` We currently only send traffic to that topic from a single cp3xxx host,... [06:41:48] PROBLEM - PHP opcache health on mw2294 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [06:44:21] (03PS1) 10Muehlenhoff: udp2log: Switch ferm::rule to the more canonical ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/589193 [06:52:20] 10Operations, 10MediaWiki-Cache, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, and 3 others: cache_text cluster consistently backlogged on purge requests - https://phabricator.wikimedia.org/T249325 (10ema) >>! In T249325#6060912, @AntiCompositeNumber wrote: > https://en.wikipedia.org/wiki/... [06:54:07] 10Operations, 10ops-codfw: Degraded RAID on restbase2014 - https://phabricator.wikimedia.org/T250293 (10fgiunchedi) [06:54:09] 10Operations, 10ops-codfw: Degraded RAID on restbase2014 - https://phabricator.wikimedia.org/T250050 (10fgiunchedi) [06:57:55] (03CR) 10Ema: [C: 03+1] Release 8.0.7-rc0-1wm3 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/589182 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [07:05:58] (03CR) 10Marostegui: WIP - Introduce profile::mariadb::misc::analytics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [07:06:45] (03PS1) 10DannyS712: Turn off direct account creations at Testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589216 (https://phabricator.wikimedia.org/T250348) [07:06:47] (03PS2) 10Vgutierrez: Release 8.0.7-rc0-1wm3 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/589182 (https://phabricator.wikimedia.org/T249335) [07:07:09] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10fgiunchedi) >>! In T249873#6061223, @MoritzMuehlenhoff wrote: > @fgiunchedi Contractors should be in the cn=wmf group, not cn=nda. Thank you, {{done}} [07:07:21] (03PS2) 10DannyS712: Turn off direct account creations at Testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589216 (https://phabricator.wikimedia.org/T250348) [07:11:28] RECOVERY - PHP opcache health on mw2294 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:15:24] !log volker-e@deploy1001 Started deploy [design/style-guide@2a7cc4a]: Deploy design/style-guide: [07:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:32] !log volker-e@deploy1001 Finished deploy [design/style-guide@2a7cc4a]: Deploy design/style-guide: (duration: 00m 08s) [07:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:54] (03CR) 10Jcrespo: [C: 03+1] "Ok, but be ready to revert quickly." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589188 (https://phabricator.wikimedia.org/T247787) (owner: 10Marostegui) [07:16:18] (03PS1) 10Ema: Add metric 'purged_frontend_backlog' [software/purged] - 10https://gerrit.wikimedia.org/r/589217 [07:16:20] (03PS1) 10Ema: Add metric 'purged_htcp_packets_total' [software/purged] - 10https://gerrit.wikimedia.org/r/589218 (https://phabricator.wikimedia.org/T249583) [07:19:21] (03CR) 10RhinosF1: [C: 03+1] Turn off direct account creations at Testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589216 (https://phabricator.wikimedia.org/T250348) (owner: 10DannyS712) [07:22:46] (03CR) 10Vgutierrez: Add metric 'purged_frontend_backlog' (032 comments) [software/purged] - 10https://gerrit.wikimedia.org/r/589217 (owner: 10Ema) [07:24:07] (03CR) 10Vgutierrez: Add metric 'purged_htcp_packets_total' (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/589218 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [07:26:40] (03PS2) 10Ema: multicast: set read buffer [software/purged] - 10https://gerrit.wikimedia.org/r/589049 (https://phabricator.wikimedia.org/T249583) [07:27:10] (03PS2) 10Ema: Add metric 'purged_frontend_backlog' [software/purged] - 10https://gerrit.wikimedia.org/r/589217 (https://phabricator.wikimedia.org/T249583) [07:30:51] (03CR) 10Vgutierrez: [C: 03+1] multicast: set read buffer [software/purged] - 10https://gerrit.wikimedia.org/r/589049 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [07:30:53] (03PS3) 10Ema: Add metric 'purged_frontend_backlog' [software/purged] - 10https://gerrit.wikimedia.org/r/589217 (https://phabricator.wikimedia.org/T249583) [07:31:19] (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.7-rc0-1wm3 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/589182 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [07:31:21] (03CR) 10Ema: [V: 03+2 C: 03+2] multicast: set read buffer [software/purged] - 10https://gerrit.wikimedia.org/r/589049 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [07:48:47] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/21967/mwlog1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/589193 (owner: 10Muehlenhoff) [07:49:18] !log upload trafficserver 8.0.7-rc0-1wm3 to apt.wm.o (buster) - T249335 [07:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:25] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [07:50:24] PROBLEM - PHP opcache health on mw2295 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:50:39] !log rolling update ats to version 8.0.7-rc0-1wm3 in cp[4026,4032,5006,5012] - T249335 [07:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:04] (03CR) 10Dzahn: "ferm config changes but the actual iptables output stayed identical" [puppet] - 10https://gerrit.wikimedia.org/r/589193 (owner: 10Muehlenhoff) [07:54:31] (03PS3) 10Dzahn: ci::firewall: replace ferm::rule with ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/589038 [07:57:28] (03PS4) 10Ema: Add layer to 'purged_backlog' [software/purged] - 10https://gerrit.wikimedia.org/r/589217 (https://phabricator.wikimedia.org/T249583) [07:57:33] (03CR) 10jerkins-bot: [V: 04-1] ci::firewall: replace ferm::rule with ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/589038 (owner: 10Dzahn) [07:57:39] (03CR) 10Dzahn: ci::firewall: replace ferm::rule with ferm::service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/589038 (owner: 10Dzahn) [07:58:10] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:58:20] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:58:37] (03CR) 10Urbanecm: [C: 03+1] Turn off direct account creations at Testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589216 (https://phabricator.wikimedia.org/T250348) (owner: 10DannyS712) [07:58:44] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 3/5 UP : OSPFv3: 3/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:58:48] PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:59:34] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:59:43] (03CR) 10Ema: Add layer to 'purged_backlog' (032 comments) [software/purged] - 10https://gerrit.wikimedia.org/r/589217 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [08:00:08] (03CR) 10Kormat: [C: 03+1] db-eqiad.php: Restore pc1008 as pc2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589188 (https://phabricator.wikimedia.org/T247787) (owner: 10Marostegui) [08:01:09] XioNoX: ^ i can't tell which circuit it is but see BFD alerts above [08:01:31] looks like GTT [08:01:39] they're backup everywhere [08:01:51] ok, ack [08:02:06] there is a planned maintenance email too [08:02:11] (03CR) 10Vgutierrez: [C: 03+1] Add layer to 'purged_backlog' [software/purged] - 10https://gerrit.wikimedia.org/r/589217 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [08:02:22] ok, good [08:03:12] 10Operations, 10observability: production-logstash elastic cluster is yellow state - https://phabricator.wikimedia.org/T250133 (10fgiunchedi) I don't think it is just a matter of rebalancing the shards, all three hosts have the same specs and will eventually fill again until they can't allocate shards again.... [08:04:25] (03PS2) 10Ema: Add metric 'purged_htcp_packets_total' [software/purged] - 10https://gerrit.wikimedia.org/r/589218 (https://phabricator.wikimedia.org/T249583) [08:04:36] !log mw1396 - restarted apache [08:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:55] (03PS1) 10Jcrespo: database-backups: Set backup2002 as a generator of es* host dumps [puppet] - 10https://gerrit.wikimedia.org/r/589263 (https://phabricator.wikimedia.org/T79922) [08:05:24] (03CR) 10Ema: Add metric 'purged_htcp_packets_total' (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/589218 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [08:06:05] (03CR) 10Vgutierrez: [C: 03+1] Add metric 'purged_htcp_packets_total' [software/purged] - 10https://gerrit.wikimedia.org/r/589218 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [08:06:12] thanks for the head's up though [08:06:20] (03PS3) 10Urbanecm: Remove grants for tboverride and tboverride-account [mediawiki-config] - 10https://gerrit.wikimedia.org/r/582531 (https://phabricator.wikimedia.org/T241114) (owner: 10JJMC89) [08:06:26] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/582531 (https://phabricator.wikimedia.org/T241114) (owner: 10JJMC89) [08:06:34] sure, np [08:06:55] !log mw1396 - restarted php7.2-fpm - was: 503 Service Unavailable - header 'X-Powered-By: PHP/7.' not found on 'http://en.wikipedia.org:80/wiki/Main_Page' [08:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:53] (03CR) 10Ema: [C: 03+2] Add layer to 'purged_backlog' [software/purged] - 10https://gerrit.wikimedia.org/r/589217 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [08:07:59] (03CR) 10Ema: [C: 03+2] Add metric 'purged_htcp_packets_total' [software/purged] - 10https://gerrit.wikimedia.org/r/589218 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [08:08:04] (03CR) 10jerkins-bot: [V: 04-1] database-backups: Set backup2002 as a generator of es* host dumps [puppet] - 10https://gerrit.wikimedia.org/r/589263 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [08:08:22] RECOVERY - Apache HTTP on mw1396 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:08:22] RECOVERY - PHP7 rendering on mw1396 is OK: HTTP OK: HTTP/1.1 200 OK - 75875 bytes in 0.148 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:08:54] RECOVERY - PHP opcache health on mw2295 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [08:10:40] (03PS1) 10Ema: 0.7: add -mcast_bufsize and new metrics [software/purged] - 10https://gerrit.wikimedia.org/r/589264 (https://phabricator.wikimedia.org/T249583) [08:12:11] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) [08:12:46] (03PS1) 10Urbanecm: Remove broken groupOverrides from amwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589265 (https://phabricator.wikimedia.org/T249585) [08:12:53] (03CR) 10Kormat: [C: 03+2] db-eqiad.php: Restore pc1008 as pc2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589188 (https://phabricator.wikimedia.org/T247787) (owner: 10Marostegui) [08:13:01] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) 05Open→03Resolved a:03jcrespo I am going to close this, leave pending work at T238048. [08:13:34] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:13:40] RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:13:44] (03Merged) 10jenkins-bot: db-eqiad.php: Restore pc1008 as pc2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589188 (https://phabricator.wikimedia.org/T247787) (owner: 10Marostegui) [08:14:20] kormat: ^ looks merged, let's proceed [08:14:24] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:14:38] (03CR) 10Urbanecm: "@Ladsgroup You actually added it in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/422181/3/wmf-config/InitialiseSettings." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589265 (https://phabricator.wikimedia.org/T249585) (owner: 10Urbanecm) [08:14:50] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:15:00] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:15:43] (03PS1) 10Jcrespo: mariadb-backups: Move all backup config templates to its own subdir [puppet] - 10https://gerrit.wikimedia.org/r/589266 (https://phabricator.wikimedia.org/T79922) [08:15:55] (03CR) 10Ema: [C: 03+2] 0.7: add -mcast_bufsize and new metrics [software/purged] - 10https://gerrit.wikimedia.org/r/589264 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [08:17:04] (03PS2) 10Jcrespo: database-backups: Set backup2002 as a generator of es* host dumps [puppet] - 10https://gerrit.wikimedia.org/r/589263 (https://phabricator.wikimedia.org/T79922) [08:17:08] (03PS4) 10Dzahn: ci::firewall: replace ferm::rule with ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/589038 [08:17:56] (03CR) 10DannyS712: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589265 (https://phabricator.wikimedia.org/T249585) (owner: 10Urbanecm) [08:18:23] !log kormat@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool pc1008 as pc2 master T247787 (duration: 01m 08s) [08:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:29] T247787: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 [08:21:57] !log Set email for Geraki@grwikimedia (T245911) [08:21:59] !log upload purged 0.7 to buster-wikimedia T249583 [08:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:04] T245911: Create a wiki for Wikimedia Community User Group Greece - https://phabricator.wikimedia.org/T245911 [08:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:10] T249583: Create vhtcpd replacement - https://phabricator.wikimedia.org/T249583 [08:22:30] !log cp3050: upgrade purged to 0.7 T249583 [08:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:54] !log Disconnect pc1008 replication from pc1010 T247787 [08:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:00] T247787: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 [08:36:12] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Kormat) pc1008 is back as pc2 master and pc1010 will be cleaned up, purged and then back to the spare pool [08:36:40] (03CR) 10Ladsgroup: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589265 (https://phabricator.wikimedia.org/T249585) (owner: 10Urbanecm) [08:37:43] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/21971/contint1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/589038 (owner: 10Dzahn) [08:37:47] (03CR) 10Dzahn: [C: 03+2] ci::firewall: replace ferm::rule with ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/589038 (owner: 10Dzahn) [08:40:03] 10Operations: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Kormat) [08:41:12] (03CR) 10Dzahn: "[contint1001:~] $ sudo iptables -L | grep localhost" [puppet] - 10https://gerrit.wikimedia.org/r/589038 (owner: 10Dzahn) [08:43:14] (03PS6) 10Dzahn: gerrit: remove unused parameter for cache_text_nodes [puppet] - 10https://gerrit.wikimedia.org/r/588699 [08:44:14] (03CR) 10Marostegui: database-backups: Set backup2002 as a generator of es* host dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/589263 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [08:45:40] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/21972/gerrit1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/588699 (owner: 10Dzahn) [08:46:07] 10Operations: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Marostegui) I confirm I can see Stephen on VictorOps as "Oncall SRE" [08:46:48] (03PS3) 10Jcrespo: database-backups: Set backup2002 as a generator of es* host dumps [puppet] - 10https://gerrit.wikimedia.org/r/589263 (https://phabricator.wikimedia.org/T79922) [08:47:10] (03PS1) 10Ema: cache: move text@esams to purged [puppet] - 10https://gerrit.wikimedia.org/r/589270 (https://phabricator.wikimedia.org/T249325) [08:47:19] (03CR) 10Jcrespo: "Done" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/589263 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [08:49:00] 10Operations, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T250350 (10XPvistaseven2020-2) [08:49:43] 10Operations, 10SRE-Access-Requests: Requesting access to datasets for XPvistaseven2020 - https://phabricator.wikimedia.org/T250350 (10XPvistaseven2020-2) [08:49:46] 10Operations, 10SRE-Access-Requests: Requesting access to datasets for XPvistaseven2020 - https://phabricator.wikimedia.org/T250350 (10Dzahn) [08:51:22] (03CR) 10Marostegui: [C: 03+1] database-backups: Set backup2002 as a generator of es* host dumps [puppet] - 10https://gerrit.wikimedia.org/r/589263 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [08:52:00] (03CR) 10Jcrespo: [C: 03+2] database-backups: Set backup2002 as a generator of es* host dumps [puppet] - 10https://gerrit.wikimedia.org/r/589263 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [08:53:39] (03PS2) 10Jcrespo: mariadb-backups: Move all backup config templates to its own subdir [puppet] - 10https://gerrit.wikimedia.org/r/589266 (https://phabricator.wikimedia.org/T79922) [08:54:16] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] mariadb-backups: Move all backup config templates to its own subdir [puppet] - 10https://gerrit.wikimedia.org/r/589266 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [08:59:06] (03CR) 10Ema: [C: 03+2] cache: move text@esams to purged [puppet] - 10https://gerrit.wikimedia.org/r/589270 (https://phabricator.wikimedia.org/T249325) (owner: 10Ema) [09:00:04] (03PS1) 10Elukey: atskafka::instance: add snappy compression by default [puppet] - 10https://gerrit.wikimedia.org/r/589271 (https://phabricator.wikimedia.org/T250347) [09:02:56] (03PS2) 10Elukey: atskafka::instance: add snappy compression by default [puppet] - 10https://gerrit.wikimedia.org/r/589271 (https://phabricator.wikimedia.org/T250347) [09:06:32] PROBLEM - Check systemd state on cp3052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:07:28] that's me ^ [09:08:20] RECOVERY - Check systemd state on cp3052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:08:42] 10Operations, 10observability: production-logstash elastic cluster is yellow state - https://phabricator.wikimedia.org/T250133 (10fgiunchedi) And indeed the last mention of e.g. `udp_localhost-info` in the log-cleaner logs is from Feb 28th ` root@cumin1001:~# cumin 'O:logstash::elasticsearch and logstash1*' '... [09:08:54] if anyone wants kafka riddles ^ [09:16:08] (03PS1) 10Dzahn: phabricator: fix database query for tasks with due date in the past [puppet] - 10https://gerrit.wikimedia.org/r/589272 (https://phabricator.wikimedia.org/T249807) [09:16:34] !log starting es backups on backup2002 T79922 [09:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:40] T79922: Set up backup strategy for es clusters - https://phabricator.wikimedia.org/T79922 [09:16:58] ^ CC marostegui kormat [09:16:58] (03PS2) 10Dzahn: phabricator: fix database query for tasks with due date in the past [puppet] - 10https://gerrit.wikimedia.org/r/589272 (https://phabricator.wikimedia.org/T249807) [09:17:04] !log text@esams: stop vhtcpd, start purged T249325 [09:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:09] T249325: cache_text cluster consistently backlogged on purge requests - https://phabricator.wikimedia.org/T249325 [09:19:36] (03PS3) 10Dzahn: phabricator: fix database query for tasks with due date in the past [puppet] - 10https://gerrit.wikimedia.org/r/589272 (https://phabricator.wikimedia.org/T249807) [09:19:39] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/21976/cp3050.esams.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/589271 (https://phabricator.wikimedia.org/T250347) (owner: 10Elukey) [09:22:36] (03CR) 10Aklapper: [C: 03+1] "Garrr. I am an idiot and I owe Daniel a beer." [puppet] - 10https://gerrit.wikimedia.org/r/589272 (https://phabricator.wikimedia.org/T249807) (owner: 10Dzahn) [09:25:38] (03CR) 10Ema: [C: 03+1] "Excellent." [puppet] - 10https://gerrit.wikimedia.org/r/589271 (https://phabricator.wikimedia.org/T250347) (owner: 10Elukey) [09:27:42] (03PS4) 10Dzahn: phabricator: fix database query for tasks with due date in the past [puppet] - 10https://gerrit.wikimedia.org/r/589272 (https://phabricator.wikimedia.org/T249807) [09:30:08] (03CR) 10Kormat: [C: 03+1] wmnet: Remove production dns entry for dbproxy1011 [dns] - 10https://gerrit.wikimedia.org/r/589186 (https://phabricator.wikimedia.org/T249590) (owner: 10Marostegui) [09:30:56] (03CR) 10Elukey: [C: 03+2] atskafka::instance: add snappy compression by default [puppet] - 10https://gerrit.wikimedia.org/r/589271 (https://phabricator.wikimedia.org/T250347) (owner: 10Elukey) [09:32:00] !log cp2027: upgrade varnish to 5.1.3-1wm14 T249810 [09:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:22] !log restart atskafka on cp3050 to pick up snappy compression - T250347 [09:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:29] (03CR) 10Kormat: [C: 03+1] "Looks good within the limits of my knowledge :)" [puppet] - 10https://gerrit.wikimedia.org/r/589185 (https://phabricator.wikimedia.org/T249590) (owner: 10Marostegui) [09:33:29] T250347: kafka-jumbo1003 /srv disk space usage over 90% - https://phabricator.wikimedia.org/T250347 [09:36:44] (03PS1) 10Elukey: atskafka::instance: add missing comma in config file [puppet] - 10https://gerrit.wikimedia.org/r/589275 (https://phabricator.wikimedia.org/T250347) [09:37:28] (03CR) 10Elukey: [C: 03+2] atskafka::instance: add missing comma in config file [puppet] - 10https://gerrit.wikimedia.org/r/589275 (https://phabricator.wikimedia.org/T250347) (owner: 10Elukey) [09:38:40] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587301 (https://phabricator.wikimedia.org/T249643) (owner: 10Huji) [09:52:19] (03CR) 10Dzahn: [C: 03+2] phabricator: fix database query for tasks with due date in the past [puppet] - 10https://gerrit.wikimedia.org/r/589272 (https://phabricator.wikimedia.org/T249807) (owner: 10Dzahn) [09:56:17] (03CR) 10Filippo Giunchedi: "Change itself LGTM, although I think a more gradual rollout would be desirable (unless that'd be done at deploy time?). Also IMHO worth st" [puppet] - 10https://gerrit.wikimedia.org/r/589085 (https://phabricator.wikimedia.org/T224454) (owner: 10CDanis) [09:57:23] PROBLEM - MariaDB Slave Lag: s8 #page on db1092 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 161954.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [09:57:44] expired downtime [09:57:45] Fixin [09:57:49] thanks marostegui [09:58:05] downtimed again [09:58:12] ack, acked on VO [09:58:15] thanks [10:00:48] 10Operations, 10DBA: replace phabricator db passwords with longer passwords - https://phabricator.wikimedia.org/T250361 (10Dzahn) [10:04:42] 10Operations, 10DBA: replace phabricator db passwords with longer passwords - https://phabricator.wikimedia.org/T250361 (10Dzahn) - schedule a short maintenance window for phabricator - change the passwords live - change the passwords in private repo in class passwords::mysql::phabricator - run puppet on phabr... [10:06:17] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-wmde-users for Christoph Jauera - https://phabricator.wikimedia.org/T250362 (10WMDE-Fisch) [10:09:11] (03CR) 10Hnowlan: [C: 03+2] Update change-prop container version to v0.9.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/589079 (https://phabricator.wikimedia.org/T248677) (owner: 10Ppchelko) [10:09:32] (03Merged) 10jenkins-bot: Update change-prop container version to v0.9.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/589079 (https://phabricator.wikimedia.org/T248677) (owner: 10Ppchelko) [10:20:01] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-wmde-users for awight - https://phabricator.wikimedia.org/T250364 (10awight) [10:20:31] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-wmde-users for awight - https://phabricator.wikimedia.org/T250364 (10awight) [10:21:39] 10Operations, 10Analytics: kafka-jumbo1003 /srv disk space usage over 90% - https://phabricator.wikimedia.org/T250347 (10elukey) Jumbo1002: ` 12G eventlogging_PaintTiming-0 19G eventlogging-valid-mixed-0 19G eventlogging-valid-mixed-1 19G eventlogging-valid-mixed-6 19G eventlogging-valid-mixed-7 19G eventlogg... [10:22:25] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [10:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:38] (03PS1) 10Jbond: profile::prometheus::nic_saturation_exporter: pass through ensure param [puppet] - 10https://gerrit.wikimedia.org/r/589277 (https://phabricator.wikimedia.org/T224454) [10:31:49] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' . [10:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:58] (03PS2) 10Jbond: profile::prometheus::nic_saturation_exporter: pass through ensure param [puppet] - 10https://gerrit.wikimedia.org/r/589277 (https://phabricator.wikimedia.org/T224454) [10:35:06] 10Operations, 10DBA: replace phabricator db passwords with longer passwords - https://phabricator.wikimedia.org/T250361 (10Marostegui) happy to help [10:35:21] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' . [10:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:31] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' . [10:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:49] (03PS3) 10Jbond: profile::prometheus::nic_saturation_exporter: pass through ensure param [puppet] - 10https://gerrit.wikimedia.org/r/589277 (https://phabricator.wikimedia.org/T224454) [10:41:01] (03CR) 10Vgutierrez: [C: 03+2] ATS: Enable inbound TLSv1.3 globally [puppet] - 10https://gerrit.wikimedia.org/r/589030 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [10:41:41] 10Operations, 10Analytics: kafka-jumbo1003 /srv disk space usage over 90% - https://phabricator.wikimedia.org/T250347 (10elukey) Overall I have the following ideas: * short term, we could drop the atskafka test topic to free space, and re-create it with more partitions to spread the load among multiple broker... [10:44:08] !log rolling restart of ats-tls to enable TLSv1.3 globally and disable the old TLS session cache - T170567 [10:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:15] T170567: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 [10:44:25] (03PS4) 10Jbond: profile::prometheus::nic_saturation_exporter: pass through ensure param [puppet] - 10https://gerrit.wikimedia.org/r/589277 (https://phabricator.wikimedia.org/T224454) [10:44:29] !log hnowlan@deploy1001 Started deploy [changeprop/deploy@354ae2d]: Testing rules moved to k8s [10:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:59] !log upgrading ATS to version 8.0.7-rc0-1wm3 - T249335 [10:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:08] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [10:45:33] 10Operations: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367 (10ayounsi) p:05Triage→03Low [10:45:46] !log hnowlan@deploy1001 Finished deploy [changeprop/deploy@354ae2d]: Testing rules moved to k8s (duration: 01m 16s) [10:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:33] (03CR) 10Jbond: "pcc: https://puppet-compiler.wmflabs.org/compiler1001/21980/" [puppet] - 10https://gerrit.wikimedia.org/r/589277 (https://phabricator.wikimedia.org/T224454) (owner: 10Jbond) [10:46:49] (03PS4) 10Jbond: run nic_saturation_exporter on all physical hosts [puppet] - 10https://gerrit.wikimedia.org/r/589085 (https://phabricator.wikimedia.org/T224454) (owner: 10CDanis) [10:48:57] (03CR) 10Jbond: "> Patch Set 3:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/589085 (https://phabricator.wikimedia.org/T224454) (owner: 10CDanis) [10:50:22] (03CR) 10Jbond: "> Patch Set 4:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/589085 (https://phabricator.wikimedia.org/T224454) (owner: 10CDanis) [10:51:01] (03CR) 10Jbond: [C: 03+2] admin: add basic schema validation via tox [puppet] - 10https://gerrit.wikimedia.org/r/589021 (https://phabricator.wikimedia.org/T250259) (owner: 10Jbond) [10:54:36] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' . [10:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200416T1100). [11:00:05] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:54] (03CR) 10Urbanecm: [C: 03+2] Turn off direct account creations at Testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589216 (https://phabricator.wikimedia.org/T250348) (owner: 10DannyS712) [11:01:58] (03Merged) 10jenkins-bot: Turn off direct account creations at Testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589216 (https://phabricator.wikimedia.org/T250348) (owner: 10DannyS712) [11:02:56] (03PS4) 10Urbanecm: Remove grants for tboverride and tboverride-account [mediawiki-config] - 10https://gerrit.wikimedia.org/r/582531 (https://phabricator.wikimedia.org/T241114) (owner: 10JJMC89) [11:03:14] !log urbanecm@deploy1001 sync-file aborted: SWAT: 74ad793: Turn off direct account creations at Testwikidata (duration: 00m 00s) [11:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:06] 10Operations, 10ops-eqsin: eqsin ganeti cable IDs - https://phabricator.wikimedia.org/T250369 (10ayounsi) p:05Triage→03Low [11:04:27] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 74ad793: Turn off direct account creations at Testwikidata (T250348) (duration: 01m 06s) [11:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:33] T250348: Turn off direct account creations at Testwikidata, at least for periods of time - https://phabricator.wikimedia.org/T250348 [11:05:39] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 74ad793: Turn off direct account creations at Testwikidata (T250348; take II) (duration: 01m 04s) [11:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:19] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/582531 (https://phabricator.wikimedia.org/T241114) (owner: 10JJMC89) [11:09:14] (03Merged) 10jenkins-bot: Remove grants for tboverride and tboverride-account [mediawiki-config] - 10https://gerrit.wikimedia.org/r/582531 (https://phabricator.wikimedia.org/T241114) (owner: 10JJMC89) [11:12:19] (03PS2) 10Urbanecm: Remove broken groupOverrides from amwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589265 (https://phabricator.wikimedia.org/T249585) [11:12:23] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: 70ee5f6: Remove grants for tboverride and tboverride-account (T241114) (duration: 01m 06s) [11:12:25] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589265 (https://phabricator.wikimedia.org/T249585) (owner: 10Urbanecm) [11:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:29] T241114: tboverride and tboverride-account should be included in a grant - https://phabricator.wikimedia.org/T241114 [11:13:14] (03Merged) 10jenkins-bot: Remove broken groupOverrides from amwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589265 (https://phabricator.wikimedia.org/T249585) (owner: 10Urbanecm) [11:14:59] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: a105f38: Remove broken groupOverrides from amwikimedia (T249585) (duration: 01m 05s) [11:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:06] T249585: am.wikimedia sysops have the '0' right - https://phabricator.wikimedia.org/T249585 [11:16:50] !log EU SWAT done [11:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:26] !log stop atskafka on cp3050 to re-create the topic atskafka_test_webrequest_text on Kafka Jumbo - T250347 [11:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:32] T250347: kafka-jumbo1003 /srv disk space usage over 90% - https://phabricator.wikimedia.org/T250347 [11:22:15] !log rename/format asw1-eqsin interfaces to match future homer driven format [11:22:18] (03PS1) 10Dzahn: add contint.wikimedia.org service alias for contint machines [dns] - 10https://gerrit.wikimedia.org/r/589285 [11:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:13] (03PS2) 10Dzahn: add contint.wikimedia.org service alias for contint machines [dns] - 10https://gerrit.wikimedia.org/r/589285 (https://phabricator.wikimedia.org/T210411) [11:23:56] (03PS2) 10Dzahn: merge microsites into webserver_misc_apps [puppet] - 10https://gerrit.wikimedia.org/r/587985 (https://phabricator.wikimedia.org/T247650) [11:29:39] !log restart atskafka on cp3050 after maintenance [11:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:07] (03PS6) 10Jbond: ferm: Add status check [puppet] - 10https://gerrit.wikimedia.org/r/576101 (https://phabricator.wikimedia.org/T206951) [11:30:59] 10Operations, 10Analytics: kafka-jumbo1003 /srv disk space usage over 90% - https://phabricator.wikimedia.org/T250347 (10elukey) p:05High→03Medium /srv usage is now around 83% after dropping the ats test topic! [11:33:20] (03CR) 10jerkins-bot: [V: 04-1] ferm: Add status check [puppet] - 10https://gerrit.wikimedia.org/r/576101 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [11:33:26] godog: volans: [11:33:28] sorry [11:33:41] * jbond42 is going to rename the go script [11:36:50] lol [11:38:52] (03PS7) 10Jbond: ferm: Add status check [puppet] - 10https://gerrit.wikimedia.org/r/576101 (https://phabricator.wikimedia.org/T206951) [11:39:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] glance image_sync: use primary_glance_image_store to choose the image store [puppet] - 10https://gerrit.wikimedia.org/r/589096 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [11:41:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. I usually introduce some kind of alignment for visual comfort, but that's optional." [puppet] - 10https://gerrit.wikimedia.org/r/589144 (owner: 10Andrew Bogott) [11:43:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "I have no idea how to use mcrouter so I cannot review the config introduced here. But overall the change LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/588752 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [11:58:18] 10Operations, 10Traffic: Implement TTL cap for ats-be - https://phabricator.wikimedia.org/T249627 (10ema) I was under the impression that ATS had no config setting to impose a TTL cap. The reason for this is that the [[https://docs.trafficserver.apache.org/en/8.0.x/admin-guide/files/records.config.en.html#prox... [12:00:42] PROBLEM - PHP opcache health on mw2299 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:02:47] 10Operations, 10Product-Analytics, 10Wikimedia-General-or-Unknown, 10Readers-Web-Backlog (Needs Product Owner Decisions), 10SEO: Yoruba Language Wikipedia not being indexed by search engines - https://phabricator.wikimedia.org/T236241 (10nshahquinn-wmf) @ovasileva, I'm not certain what you'd like #produc... [12:11:56] (03PS1) 10Jbond: sre.wdqs.data-transfer: manage ferm rules required for transfer [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) [12:13:12] (03CR) 10Jbond: "Please add any pother parties who may be interested" [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [12:13:38] jbond42: haha! [12:14:44] 10Operations, 10MediaWiki-Parser, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), and 3 others: Improve PoolCounterWork logic to cover possible raised exceptions - https://phabricator.wikimedia.org/T249531 (10daniel) 05Open→03Resolved [12:14:49] 10Operations, 10MediaWiki-Parser, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: API action=parse should be poolcounter-limited if a re-parse is necessary - https://phabricator.wikimedia.org/T243803 (10daniel) [12:15:00] (03CR) 10Jbond: "> Patch Set 5:" (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/576101 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [12:18:31] (03CR) 10Dzahn: [C: 04-1] "needs more Hiera: https://puppet-compiler.wmflabs.org/compiler1003/21981/miscweb1002.eqiad.wmnet/change.miscweb1002.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/587985 (https://phabricator.wikimedia.org/T247650) (owner: 10Dzahn) [12:19:49] (03CR) 10Dzahn: [C: 04-1] "let's use contint.wikimedia.org here https://gerrit.wikimedia.org/r/c/operations/dns/+/589285" [puppet] - 10https://gerrit.wikimedia.org/r/588973 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [12:20:54] RECOVERY - PHP opcache health on mw2299 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:24:44] (03PS1) 10Alex Monk: ATS TLS material: Reload even if do_ocsp is false [puppet] - 10https://gerrit.wikimedia.org/r/589291 [12:27:21] (03CR) 10Volans: "Thanks for the patch, for more context for the other reviewers:" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [12:30:05] (03CR) 10Muehlenhoff: "Two comments inline, approach looks good to me" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [12:30:20] PROBLEM - Host cp1087 is DOWN: PING CRITICAL - Packet loss = 100% [12:36:11] !! [12:36:27] * vgutierrez checking [12:37:43] !log depool & reboot cp1087 [12:38:15] vgutierrez: Failed to log message to wiki. Somebody should check the error logs. [12:38:17] (03CR) 10Alexandros Kosiaris: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/21982/deploy1001.eqiad.wmnet/ for PCC. This looks good to me! Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/587799 (https://phabricator.wikimedia.org/T249633) (owner: 10Hnowlan) [12:38:26] really? [12:38:50] (╯°□°)╯︵ ┻━┻ [12:39:04] ┬─┬ノ( º _ ºノ) [12:39:14] nice catch jynus [12:39:14] ;P [12:39:30] ┻━┻︵ \(°□°)/ ︵ ┻━┻ [12:40:28] it's not working since 11:29 [12:40:57] but the bot is working at https://sal.toolforge.org/ [12:41:13] (03CR) 10Volans: sre.wdqs.data-transfer: manage ferm rules required for transfer (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [12:44:14] !log test sal again [12:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:41] ^ vgutierrez it works now [12:44:47] jynus: <3 [12:45:02] (03CR) 10Hashar: "That is nicer indeed :)" [puppet] - 10https://gerrit.wikimedia.org/r/589038 (owner: 10Dzahn) [12:45:41] don't tell me- ask traffic team why wikitech may have failed to load [12:46:02] RECOVERY - Host cp1087 is UP: PING OK - Packet loss = 0%, RTA = 0.12 ms [12:46:12] (03CR) 10Muehlenhoff: sre.wdqs.data-transfer: manage ferm rules required for transfer (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [12:48:38] !log pool cp1087 [12:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:35] (03CR) 10Gehel: [C: 03+2] [wdqs] data-reload allow to reload only categories [cookbooks] - 10https://gerrit.wikimedia.org/r/589033 (owner: 10DCausse) [12:50:50] 10Operations, 10LDAP-Access-Requests, 10Research: LDAP/NDA Access Request for bmansurov - https://phabricator.wikimedia.org/T250335 (10bmansurov) [12:51:34] 10Operations, 10Traffic: Servers freezing across the caching cluster (November 2019) - https://phabricator.wikimedia.org/T238305 (10Vgutierrez) [12:53:07] 10Operations, 10Traffic: Servers freezing across the caching cluster (November 2019) - https://phabricator.wikimedia.org/T238305 (10Vgutierrez) cp1087 crashed a few minutes ago (12:30) showing the same symptoms, running buster and with firmware upgraded according to T243167 [12:54:26] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-reload [12:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:37] !log gehel@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) [12:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:37] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-wmde-users for awight - https://phabricator.wikimedia.org/T250364 (10fgiunchedi) [12:57:30] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-wmde-users for awight - https://phabricator.wikimedia.org/T250364 (10fgiunchedi) @awight I crossed off most items already since your user is indeed already on file. What we need next is signoff from @Tobi_WMDE_SW and @nuria (analytics access)... [12:58:21] (03CR) 10Jbond: sre.wdqs.data-transfer: manage ferm rules required for transfer (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [12:58:24] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-wmde-users for Christoph Jauera - https://phabricator.wikimedia.org/T250362 (10fgiunchedi) [12:59:00] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [12:59:00] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-wmde-users for Christoph Jauera - https://phabricator.wikimedia.org/T250362 (10fgiunchedi) Sounds good @WMDE-Fisch, we'd need sign off from @Tobi_WMDE_SW and @nuria (for analytics access), thanks! [12:59:23] (03PS1) 10DCausse: [wdqs] only download and munge the dump when reloading wikidata [cookbooks] - 10https://gerrit.wikimedia.org/r/589295 [13:00:04] James_F and liw: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Mediawiki train - American+European Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200416T1300). [13:00:44] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [13:01:04] James_F is still asleep, logstash looks calm, so I don't expect to need to do anything to the train in this train slot [13:02:10] (03CR) 10Gehel: [C: 03+2] [wdqs] only download and munge the dump when reloading wikidata [cookbooks] - 10https://gerrit.wikimedia.org/r/589295 (owner: 10DCausse) [13:02:34] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10Performance-Team (Radar): Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Vgutierrez) 05Open→03Resolved TLSv1.3 is now available on both text and upload clusters :) [13:02:37] 10Operations, 10Traffic, 10HTTPS, 10Security: Investigate our mitigation strategy for HTTPS response length attacks - https://phabricator.wikimedia.org/T92298 (10Vgutierrez) [13:02:49] (03CR) 10Gehel: [V: 03+2 C: 03+2] [wdqs] only download and munge the dump when reloading wikidata [cookbooks] - 10https://gerrit.wikimedia.org/r/589295 (owner: 10DCausse) [13:03:30] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-reload [13:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:02] !log hnowlan@deploy1001 Started deploy [changeprop/deploy@baf0a4b]: Rollback removing k8s rules, again [13:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:31] !log hnowlan@deploy1001 Finished deploy [changeprop/deploy@baf0a4b]: Rollback removing k8s rules, again (duration: 00m 30s) [13:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:30] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Release-Engineering-Team (CI & Testing services): Rebuild helm/helm-diff for buster-wikimedia - https://phabricator.wikimedia.org/T249812 (10hashar) @akosiaris has done patches for `helmfile` https... [13:07:08] (03CR) 10Hashar: "CI will not run the debian-glue packaging job since this change does not touch any file under ./debian/ But I think it is fine since the" [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/588418 (owner: 10Alexandros Kosiaris) [13:07:12] (03CR) 10Hashar: [C: 03+1] Merge branch 'master' into buster-wikimedia [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/588418 (owner: 10Alexandros Kosiaris) [13:10:33] (03PS1) 10DCausse: [wdqs] fix cleanup of dumps [cookbooks] - 10https://gerrit.wikimedia.org/r/589298 [13:12:51] (03CR) 10Hashar: "The build fails because dpkg-source find there is a deviation of the source code compared to the upstream ones (which are under tag upstre" [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/588415 (owner: 10Alexandros Kosiaris) [13:13:23] (03CR) 10CDanis: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/589277 (https://phabricator.wikimedia.org/T224454) (owner: 10Jbond) [13:14:23] cdanis: feel free to merge that ^^ when you merge yours [13:14:47] jbond42: yeah, I will merge yours, rework mine to have a few more stages [13:15:00] ack cheers [13:20:51] (03CR) 10Hashar: "Sorry I was wrong. The build passed in some previous changes I have made to this repository. On further inspection, the vendor files are " [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/588415 (owner: 10Alexandros Kosiaris) [13:23:26] jbond42: hmm, PCC doesn't seem to think that the hiera in hieradata/common/profile/prometheus/nic_saturation_exporter.yaml does anything -- it's showing it being enabled on an appserver and a bastion host https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/21984/console https://puppet-compiler.wmflabs.org/compiler1001/21984/ [13:28:12] (03PS1) 10Andrew Bogott: Horizon: put into maintenance mode for Rocky upgrade [puppet] - 10https://gerrit.wikimedia.org/r/589299 (https://phabricator.wikimedia.org/T248635) [13:28:14] (03PS1) 10Andrew Bogott: cloud-vps: move eqiad1 from openstack 'queens' to 'rocky' [puppet] - 10https://gerrit.wikimedia.org/r/589300 (https://phabricator.wikimedia.org/T248635) [13:28:16] (03PS1) 10Andrew Bogott: Revert "Horizon: put into maintenance mode for Rocky upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/589301 [13:30:20] (03CR) 10Volans: [C: 03+1] "Code looks good and some test on failoid2001 (buster) were successfull. If you want to be extra careful test it also on a jessie/stretch h" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/576101 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [13:31:41] (03CR) 10Volans: "For context, as discussed offline, the patch needs some refactor to actually create the ferm file on the target host and not the cumin one" [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [13:32:06] !log Restarting CI Jenkins for plugin upgrade T250377 [13:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:08] (03CR) 10Gehel: [C: 03+2] [wdqs] fix cleanup of dumps [cookbooks] - 10https://gerrit.wikimedia.org/r/589298 (owner: 10DCausse) [13:34:35] (03PS2) 10Jbond: sre.wdqs.data-transfer: manage ferm rules required for transfer [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) [13:34:52] (03PS1) 10Jcrespo: mariadb-backups: Split mariadb::backups into 2 roles for backup2002 [puppet] - 10https://gerrit.wikimedia.org/r/589303 (https://phabricator.wikimedia.org/T79922) [13:36:22] (03CR) 10Jcrespo: "We need to do this split today, because if not, it will try to store the ongoing logical dumps into the backup1001 Databases pool- we don'" [puppet] - 10https://gerrit.wikimedia.org/r/589303 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [13:36:30] !log Optimizing all tables on pc1010 T247787 [13:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:35] T247787: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 [13:38:48] 10Operations, 10Analytics: kafka-jumbo1003 /srv disk space usage over 90% - https://phabricator.wikimedia.org/T250347 (10Ottomata) Hm. Moving partitions is a bit annoying but is possible. https://docs.cloudera.com/runtime/7.1.0/kafka-managing/topics/kafka-manage-cli-reassign-overview.html I'd also be ok wi... [13:40:58] !log rename/format asw2-esams interfaces to match future homer driven format [13:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:13] (03PS2) 10Jcrespo: mariadb-backups: Split mariadb::backups into 2 roles for backup2002 [puppet] - 10https://gerrit.wikimedia.org/r/589303 (https://phabricator.wikimedia.org/T79922) [13:47:02] (03PS1) 10JMeybohm: * Update for Buster (Bug: T249812) [debs/helm] - 10https://gerrit.wikimedia.org/r/589304 (https://phabricator.wikimedia.org/T249812) [13:47:59] (03CR) 10Volans: [C: 03+1] "nit inline looks good otherwise" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [13:49:24] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::prometheus::nic_saturation_exporter: pass through ensure param [puppet] - 10https://gerrit.wikimedia.org/r/589277 (https://phabricator.wikimedia.org/T224454) (owner: 10Jbond) [13:50:36] 10Operations, 10LDAP-Access-Requests, 10Research: LDAP/NDA Access Request for bmansurov - https://phabricator.wikimedia.org/T250335 (10fgiunchedi) @leila I'm assuming you'll be providing the sign off for this request ? @bmansurov you have shell access already but indeed lack ldap group access. [13:51:49] (03PS1) 10JMeybohm: Add .gitreview [debs/helm] - 10https://gerrit.wikimedia.org/r/589308 [13:52:34] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [debs/helm] - 10https://gerrit.wikimedia.org/r/589308 (owner: 10JMeybohm) [13:53:40] 10Operations, 10LDAP-Access-Requests, 10Research: LDAP/NDA Access Request for bmansurov - https://phabricator.wikimedia.org/T250335 (10fgiunchedi) Nevermind, no sign off needed since everything is already on file. @bmansurov you are now in 'wmf' ldap group, please check/verify access works. [13:58:00] (03PS3) 10Jbond: sre.wdqs.data-transfer: manage ferm rules required for transfer [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) [13:59:25] 10Operations, 10Wikimedia-Mailing-lists: Mailing list request for WPWP - https://phabricator.wikimedia.org/T250390 (10Wikicology) [14:05:59] (03CR) 10Marostegui: [C: 03+1] mariadb-backups: Split mariadb::backups into 2 roles for backup2002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/589303 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [14:08:34] (03PS3) 10Jcrespo: mariadb-backups: Split mariadb::backups into 2 roles for backup2002 [puppet] - 10https://gerrit.wikimedia.org/r/589303 (https://phabricator.wikimedia.org/T79922) [14:08:38] (03CR) 10Jhedden: [C: 03+1] Glance profiles: add param types and lookup() calls [puppet] - 10https://gerrit.wikimedia.org/r/589144 (owner: 10Andrew Bogott) [14:09:12] (03CR) 10Jcrespo: "Done" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/589303 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [14:10:01] (03CR) 10Elukey: "Completely random question - would it be worst to set up a rsync daemon by default on wdqs nodes for this use case? Allowing only wdqs hos" [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [14:11:14] (03CR) 10Jhedden: [C: 03+1] glance image_sync: use primary_glance_image_store to choose the image store [puppet] - 10https://gerrit.wikimedia.org/r/589096 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [14:11:55] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Split mariadb::backups into 2 roles for backup2002 [puppet] - 10https://gerrit.wikimedia.org/r/589303 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [14:13:44] !log cache: upgrade varnish to 5.1.3-1wm14 and rolling restart T249810 [14:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:51] (03CR) 10Jhedden: [C: 03+1] cloud-vps hiera: introduce openstack_controllers and keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/589095 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [14:14:48] !log elastic (search cluster) reindexing commonswiki_content in codfw and ediad (T246882) [14:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:54] T246882: commonswiki shard size grew more than 50G in eqiad and codfw - https://phabricator.wikimedia.org/T246882 [14:15:34] (03CR) 10Jhedden: [C: 03+1] Glance: use keystone_api_fqdn for endpoints [puppet] - 10https://gerrit.wikimedia.org/r/589143 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [14:16:17] (03CR) 10Jhedden: [C: 03+1] Glance profiles: remove firewall rule for labs_hosts_range [puppet] - 10https://gerrit.wikimedia.org/r/589139 (owner: 10Andrew Bogott) [14:16:44] !log hnowlan@deploy1001 Started deploy [changeprop/deploy@354ae2d]: Enabling rules on k8s, disabling on scb [14:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:50] (03CR) 10Elukey: Designate: replace standalone memcached with a mcrouter cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/588752 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [14:17:56] !log hnowlan@deploy1001 Finished deploy [changeprop/deploy@354ae2d]: Enabling rules on k8s, disabling on scb (duration: 01m 12s) [14:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:28] (03CR) 10Hashar: "CI FTBS with:" [debs/helm] - 10https://gerrit.wikimedia.org/r/589304 (https://phabricator.wikimedia.org/T249812) (owner: 10JMeybohm) [14:20:11] (03CR) 10Jhedden: [C: 03+1] glance profiles: remove use of $nova_controller param [puppet] - 10https://gerrit.wikimedia.org/r/589112 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [14:21:34] !log gehel@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [14:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:54] (03CR) 10Hashar: "Note that the upstream fix at https://github.com/Masterminds/glide/pull/559/commits/be8a502f4bc46823ed17b2f90bc4fb0b8d003aa3 is from 2016" [debs/helm] - 10https://gerrit.wikimedia.org/r/589304 (https://phabricator.wikimedia.org/T249812) (owner: 10JMeybohm) [14:23:54] (03CR) 10Gehel: "> Patch Set 3:" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [14:30:46] !log holger@mwmaint1002 Starting uppercaseTitlesForUnicodeTransition.php as part of T219279 [14:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:53] T219279: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 [14:32:59] (03PS1) 10Ema: ATS: cap ats-be TTL at 24h [puppet] - 10https://gerrit.wikimedia.org/r/589317 (https://phabricator.wikimedia.org/T249627) [14:36:00] (03CR) 10Ema: "pcc here: https://puppet-compiler.wmflabs.org/compiler1002/21986/" [puppet] - 10https://gerrit.wikimedia.org/r/589317 (https://phabricator.wikimedia.org/T249627) (owner: 10Ema) [14:36:26] (03CR) 10Vgutierrez: ATS: cap ats-be TTL at 24h (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/589317 (https://phabricator.wikimedia.org/T249627) (owner: 10Ema) [14:38:24] (03PS2) 10Ema: ATS: cap ats-be TTL at 24h [puppet] - 10https://gerrit.wikimedia.org/r/589317 (https://phabricator.wikimedia.org/T249627) [14:42:01] (03PS1) 10Bearloga: Add analytics-product system user [puppet] - 10https://gerrit.wikimedia.org/r/589320 (https://phabricator.wikimedia.org/T230743) [14:42:23] (03CR) 10Vgutierrez: [C: 03+1] ATS: cap ats-be TTL at 24h [puppet] - 10https://gerrit.wikimedia.org/r/589317 (https://phabricator.wikimedia.org/T249627) (owner: 10Ema) [14:43:03] (03CR) 10Elukey: ">" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [14:45:08] (03CR) 10jerkins-bot: [V: 04-1] Add analytics-product system user [puppet] - 10https://gerrit.wikimedia.org/r/589320 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [14:47:47] (03PS2) 10Bearloga: Add analytics-product system user [puppet] - 10https://gerrit.wikimedia.org/r/589320 (https://phabricator.wikimedia.org/T230743) [14:48:33] (03PS5) 10CDanis: profile::prometheus::nic_saturation_exporter: pass through ensure param [puppet] - 10https://gerrit.wikimedia.org/r/589277 (https://phabricator.wikimedia.org/T224454) (owner: 10Jbond) [14:48:35] (03PS5) 10CDanis: run nic_saturation_exporter on all physical hosts [puppet] - 10https://gerrit.wikimedia.org/r/589085 (https://phabricator.wikimedia.org/T224454) [14:48:37] (03CR) 10Bearloga: "took my best shot at this sort of thing..." [puppet] - 10https://gerrit.wikimedia.org/r/589320 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [14:51:17] !log holger@mwmaint1002 END (Fail) uppercaseTitlesForUnicodeTransition.php as part of T219279 [14:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:24] T219279: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 [14:52:17] (03CR) 10Andrew Bogott: Designate: replace standalone memcached with a mcrouter cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/588752 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [14:52:38] (03PS14) 10Andrew Bogott: Designate: replace standalone memcached with a mcrouter cluster [puppet] - 10https://gerrit.wikimedia.org/r/588752 (https://phabricator.wikimedia.org/T249941) [14:56:07] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Duplicate "moderator request(s) waiting" emails sent to list admins - https://phabricator.wikimedia.org/T250032 (10bd808) I have a a similar but maybe slightly different mailman bug that is bugging me. I keep getting this message every day, but the UI shows th... [14:56:54] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), and 5 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10DannyS712) So on commons I found watchlist entries for a b... [14:57:55] (03PS6) 10Alexandros Kosiaris: Bump to 0.109.0 [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/588415 [14:58:11] (03PS2) 10RLazarus: maintenance: Migrate update_flaggedrev_stats to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/587328 (https://phabricator.wikimedia.org/T211250) [14:58:33] (03CR) 10jerkins-bot: [V: 04-1] Bump to 0.109.0 [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/588415 (owner: 10Alexandros Kosiaris) [14:59:07] (03Abandoned) 10Alexandros Kosiaris: Merge branch 'master' into buster-wikimedia [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/588418 (owner: 10Alexandros Kosiaris) [15:01:32] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: put into maintenance mode for Rocky upgrade [puppet] - 10https://gerrit.wikimedia.org/r/589299 (https://phabricator.wikimedia.org/T248635) (owner: 10Andrew Bogott) [15:01:41] (03CR) 10RLazarus: [C: 03+2] maintenance: Migrate update_flaggedrev_stats to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/587328 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [15:02:27] !log mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=commonswiki --logwiki=metawiki 'Petri Gyula' '23eki' (T250387) [15:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:33] T250387: Unblock stuck global rename of 23eki - https://phabricator.wikimedia.org/T250387 [15:03:12] (03PS1) 10Hnowlan: changeprop: reenable SSL for kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/589324 (https://phabricator.wikimedia.org/T248677) [15:03:34] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: move eqiad1 from openstack 'queens' to 'rocky' [puppet] - 10https://gerrit.wikimedia.org/r/589300 (https://phabricator.wikimedia.org/T248635) (owner: 10Andrew Bogott) [15:08:04] 10Operations, 10ops-eqiad: (Need by: TDB) rack/setup/install cloudelastic100[56] - https://phabricator.wikimedia.org/T249062 (10Jclark-ctr) [15:10:40] (03CR) 10Ema: [C: 03+2] ATS: cap ats-be TTL at 24h [puppet] - 10https://gerrit.wikimedia.org/r/589317 (https://phabricator.wikimedia.org/T249627) (owner: 10Ema) [15:11:25] (03PS6) 10CDanis: nic_saturation_exporter on all physical hosts w/ hiera enabled [puppet] - 10https://gerrit.wikimedia.org/r/589085 (https://phabricator.wikimedia.org/T224454) [15:13:00] (03CR) 10CDanis: [C: 03+2] profile::prometheus::nic_saturation_exporter: pass through ensure param [puppet] - 10https://gerrit.wikimedia.org/r/589277 (https://phabricator.wikimedia.org/T224454) (owner: 10Jbond) [15:15:03] (03CR) 10CDanis: [C: 03+2] nic_saturation_exporter on all physical hosts w/ hiera enabled [puppet] - 10https://gerrit.wikimedia.org/r/589085 (https://phabricator.wikimedia.org/T224454) (owner: 10CDanis) [15:15:27] 10Operations, 10Wikimedia-Mailing-lists: Mailing list request for WPWP (Wikipedia Pages Wanting Photos) - https://phabricator.wikimedia.org/T250390 (10Aklapper) [15:15:40] PROBLEM - Disk space on ganeti1003 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=95%): /tmp 0 MB (0% inode=95%): /var/tmp 0 MB (0% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ganeti1003&var-datasource=eqiad+prometheus/ops [15:18:14] akosiaris: anything ongoing on ganeti shosts? [15:18:18] that doesn't look good [15:18:36] * akosiaris looking [15:18:49] akosiaris: /var/log/ganeti is 25G [15:19:19] (03CR) 10Ppchelko: [C: 03+2] changeprop: reenable SSL for kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/589324 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [15:19:43] (03Merged) 10jenkins-bot: changeprop: reenable SSL for kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/589324 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [15:19:46] akosiaris: /var/log/ganeti/monitoring-daemon-error.log [15:19:50] [16/Apr/2020:14:29:47 +0000] got exception in httpAcceptFunc: accept: resource exhausted (Too many open files) [15:19:52] PROBLEM - Check systemd state on ganeti1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:19:59] yep [15:20:05] what on earth [15:20:55] !log stop ganeti daemons on ganeti1003 [15:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:00] akosiaris: FYI I just checked all other ganeti hosts and that file is 0-sized [15:23:22] (03CR) 10JMeybohm: "> Patch Set 1:" [debs/helm] - 10https://gerrit.wikimedia.org/r/589304 (https://phabricator.wikimedia.org/T249812) (owner: 10JMeybohm) [15:24:35] grep -v 'got exception in httpAcceptFunc: accept: resource exhausted (Too many open files)' monitoring-daemon-error.log | wc -l [15:24:35] 1 [15:24:40] ok, I 'll truncate it [15:24:48] k [15:25:29] weird. I was trying earlier to figure out how to monitor the internal ganeti CA and was looking into those daemons [15:25:40] if I have managed to trigger that... man what a bug must it be [15:26:17] lol [15:26:37] !log truncate /var/log/ganeti/monitoring-daemon-error.log on ganeti1003, start again all ganeti daemons [15:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:14] RECOVERY - Check systemd state on ganeti1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:29:33] (03PS2) 10Andrew Bogott: Revert "Horizon: put into maintenance mode for Rocky upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/589301 [15:29:35] (03PS1) 10Andrew Bogott: Openstack Nova: reduce the number of nova-scheduler workers [puppet] - 10https://gerrit.wikimedia.org/r/589340 [15:29:51] (03CR) 10JMeybohm: "> Patch Set 1:" [debs/helm] - 10https://gerrit.wikimedia.org/r/589304 (https://phabricator.wikimedia.org/T249812) (owner: 10JMeybohm) [15:34:37] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add .gitreview [debs/helm] - 10https://gerrit.wikimedia.org/r/589308 (owner: 10JMeybohm) [15:34:39] (03PS2) 10Ema: vcl: introduce wm_admission_policies [puppet] - 10https://gerrit.wikimedia.org/r/588945 (https://phabricator.wikimedia.org/T249809) [15:34:41] (03PS1) 10Ema: vcl: move 'exp' admission policy to wm_admission_policies [puppet] - 10https://gerrit.wikimedia.org/r/589341 (https://phabricator.wikimedia.org/T249809) [15:34:43] (03PS1) 10Ema: vcl: 10M cutoff for the 'exp' admission policy [puppet] - 10https://gerrit.wikimedia.org/r/589342 (https://phabricator.wikimedia.org/T249809) [15:35:32] (03CR) 10jerkins-bot: [V: 04-1] vcl: 10M cutoff for the 'exp' admission policy [puppet] - 10https://gerrit.wikimedia.org/r/589342 (https://phabricator.wikimedia.org/T249809) (owner: 10Ema) [15:35:40] RECOVERY - Disk space on ganeti1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ganeti1003&var-datasource=eqiad+prometheus/ops [15:35:48] thanks akosiaris! [15:36:43] volans: root 2435 0.0 17.6 1073792892 11664976 ? S 2019 52:58 /usr/sbin/ganeti-mond [15:36:49] stray ganeti-mond from 2019 [15:37:12] ok, killing it now [15:37:21] kaboom! [15:39:00] (03PS2) 10Ema: vcl: 10M cutoff for the 'exp' admission policy [puppet] - 10https://gerrit.wikimedia.org/r/589342 (https://phabricator.wikimedia.org/T249809) [15:39:18] (03CR) 10Jbond: "A recent update allows us to manage system, accounts via admin.yaml see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/589320 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [15:39:35] (03CR) 10jerkins-bot: [V: 04-1] vcl: 10M cutoff for the 'exp' admission policy [puppet] - 10https://gerrit.wikimedia.org/r/589342 (https://phabricator.wikimedia.org/T249809) (owner: 10Ema) [15:42:32] 10Operations, 10ORES, 10Scoring-platform-team (Current): ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart) - https://phabricator.wikimedia.org/T242705 (10Halfak) I made this stack overflow post: https://stackoverflow.com/questions/61130651/memory-available-free-pl... [15:43:12] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team): (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson restbase1028: A5 U18 WMF4802 port. 21 re... [15:43:52] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team): (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10Jclark-ctr) [15:45:58] (03PS3) 10Ema: vcl: 10M cutoff for the 'exp' admission policy [puppet] - 10https://gerrit.wikimedia.org/r/589342 (https://phabricator.wikimedia.org/T249809) [15:46:05] 10Operations, 10LDAP-Access-Requests, 10Research: LDAP/NDA Access Request for bmansurov - https://phabricator.wikimedia.org/T250335 (10leila) @fgiunchedi thanks! :) [15:47:07] 10Operations, 10ops-eqsin: eqsin ganeti cable IDs - https://phabricator.wikimedia.org/T250369 (10RobH) Acknowledged. We are waiting to have Jin go onsite once the new router arrives, so I'll add this to the list of items for him to tackle when onsite for that! The delay for Jin's visit is due to Covid conce... [15:47:22] (03PS2) 10Andrew Bogott: Openstack Nova: reduce the number of nova-scheduler workers [puppet] - 10https://gerrit.wikimedia.org/r/589340 [15:47:24] (03PS3) 10Andrew Bogott: Revert "Horizon: put into maintenance mode for Rocky upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/589301 [15:47:26] (03PS1) 10Andrew Bogott: nova: reduce number of api workers yet again [puppet] - 10https://gerrit.wikimedia.org/r/589344 [15:51:39] (03CR) 10Andrew Bogott: [C: 03+2] nova: reduce number of api workers yet again [puppet] - 10https://gerrit.wikimedia.org/r/589344 (owner: 10Andrew Bogott) [15:51:40] PROBLEM - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:51:50] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Nova: reduce the number of nova-scheduler workers [puppet] - 10https://gerrit.wikimedia.org/r/589340 (owner: 10Andrew Bogott) [15:54:02] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:53] !log restart chi on cloudelastic1001 with -XX:NewRatio=3 - T231517 [15:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:00] T231517: Investigate and fix GC issues on cloudelastic machines - https://phabricator.wikimedia.org/T231517 [15:58:48] 10Operations, 10MediaWiki-Cache, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, and 3 others: cache_text cluster consistently backlogged on purge requests - https://phabricator.wikimedia.org/T249325 (10ema) @Urbanecm, @AntiCompositeNumber: esams has now been running with the new system for... [15:59:28] (03PS2) 10JMeybohm: Update for Buster (Bug: T249812) [debs/helm] - 10https://gerrit.wikimedia.org/r/589304 (https://phabricator.wikimedia.org/T249812) [16:00:04] godog and _joe_: It is that lovely time of the day again! You are hereby commanded to deploy Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200416T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:04:42] 10Operations, 10observability: run nic_saturation_exporter on all physical hosts - https://phabricator.wikimedia.org/T250401 (10CDanis) [16:04:59] 10Operations, 10observability, 10serviceops, 10Patch-For-Review, and 2 others: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10CDanis) I've been abusing this task for the rollout of `nic_saturation_exporter` to other hosts; moving tracking that to T250401 [16:09:51] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [debs/helm] - 10https://gerrit.wikimedia.org/r/589304 (https://phabricator.wikimedia.org/T249812) (owner: 10JMeybohm) [16:10:59] (03PS1) 10CDanis: canary deployment of nic_saturation_exporter [puppet] - 10https://gerrit.wikimedia.org/r/589350 (https://phabricator.wikimedia.org/T250401) [16:11:29] (03CR) 10JMeybohm: [C: 03+2] Update for Buster (Bug: T249812) [debs/helm] - 10https://gerrit.wikimedia.org/r/589304 (https://phabricator.wikimedia.org/T249812) (owner: 10JMeybohm) [16:23:13] (03CR) 10CDanis: "PCC looks correct: https://puppet-compiler.wmflabs.org/compiler1003/21991/" [puppet] - 10https://gerrit.wikimedia.org/r/589350 (https://phabricator.wikimedia.org/T250401) (owner: 10CDanis) [16:24:15] (03CR) 10RLazarus: [C: 03+1] canary deployment of nic_saturation_exporter [puppet] - 10https://gerrit.wikimedia.org/r/589350 (https://phabricator.wikimedia.org/T250401) (owner: 10CDanis) [16:24:31] (03Abandoned) 10Elukey: role::elasticsearch::cloudelastic: add more heap space for young gen [puppet] - 10https://gerrit.wikimedia.org/r/589007 (https://phabricator.wikimedia.org/T231517) (owner: 10Elukey) [16:28:11] (03CR) 10CDanis: [C: 03+2] canary deployment of nic_saturation_exporter [puppet] - 10https://gerrit.wikimedia.org/r/589350 (https://phabricator.wikimedia.org/T250401) (owner: 10CDanis) [16:30:37] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [16:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:39] 10Operations, 10observability, 10Patch-For-Review: run nic_saturation_exporter on all physical hosts - https://phabricator.wikimedia.org/T250401 (10CDanis) [16:35:04] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:36:53] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' . [16:39:45] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' . [16:39:49] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' . [16:40:34] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:40:43] the heck [16:43:14] 10Operations, 10ops-ulsfo, 10DC-Ops: fix newly imported cable data in ulsfo - https://phabricator.wikimedia.org/T250408 (10RobH) p:05Triage→03Low [16:44:22] (03CR) 10Krinkle: [C: 03+1] "LGTM, I've checked code search and did not see any ExtensionRegistry-related references to "Parsoid/PHP" that relied on that name." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583394 (owner: 10C. Scott Ananian) [16:44:24] 10Operations, 10ops-ulsfo, 10DC-Ops: fix newly imported cable data in ulsfo - https://phabricator.wikimedia.org/T250408 (10ayounsi) Please also track and add to Netbox the switch mgmt cables. [16:44:26] 10Operations, 10ops-ulsfo, 10DC-Ops: fix newly imported cable data in ulsfo - https://phabricator.wikimedia.org/T250408 (10RobH) 05Open→03Stalled Please note this will NOT be fixed quickly, as we are limiting on-site visitation due to covid-19 shelter in place restrictions. Since this is just auditing... [16:44:55] (03PS1) 10Elukey: Use -XX:NewRatio=3 for cloudelastic-chi instead of setting a specific size [puppet] - 10https://gerrit.wikimedia.org/r/589356 (https://phabricator.wikimedia.org/T231517) [16:44:58] 10Operations, 10ops-ulsfo, 10DC-Ops: fix newly imported cable data in ulsfo - https://phabricator.wikimedia.org/T250408 (10RobH) [16:45:15] !log kafka-logging eqiad set retention.bytes=500000000000 on topic udp_localhost-info T250133 [16:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:23] T250133: production-logstash elastic cluster is yellow state - https://phabricator.wikimedia.org/T250133 [16:47:27] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/21992/" [puppet] - 10https://gerrit.wikimedia.org/r/589356 (https://phabricator.wikimedia.org/T231517) (owner: 10Elukey) [16:48:37] 10Operations, 10observability: production-logstash elastic cluster is yellow state - https://phabricator.wikimedia.org/T250133 (10herron) Progress! disk util on logstash1010 before adjusting udp_localhost-info retention.size: `/dev/md1 15T 14T 1.2T 92% /srv` disk util on logstash1010 after adjus... [16:51:23] !log kafka-logging eqiad set retention.bytes=500000000000 on topic udp_localhost-warning T250133 [16:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:30] T250133: production-logstash elastic cluster is yellow state - https://phabricator.wikimedia.org/T250133 [16:57:29] (03CR) 10Krinkle: [C: 03+1] "LGTM. I've tested it by applying the changes manually to https://en.wikipedia.org/abc." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589165 (owner: 10VolkerE) [16:59:42] (03PS2) 10RLazarus: maintenance: Migrate update_special_pages to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/587334 (https://phabricator.wikimedia.org/T211250) [17:00:04] halfak and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200416T1700). [17:02:07] (03CR) 10RLazarus: [C: 03+2] maintenance: Migrate update_special_pages to periodic_job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/587334 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [17:07:12] RECOVERY - cassandra-c service on restbase2014 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:07:36] RECOVERY - cassandra-c SSL 10.192.16.87:7001 on restbase2014 is OK: SSL OK - Certificate restbase2014-c valid until 2020-11-29 09:26:10 +0000 (expires in 226 days) https://phabricator.wikimedia.org/T120662 [17:07:54] RECOVERY - cassandra-c CQL 10.192.16.87:9042 on restbase2014 is OK: TCP OK - 0.036 second response time on 10.192.16.87 port 9042 https://phabricator.wikimedia.org/T93886 [17:15:29] this was me re-enabling puppet, but probably not super good [17:15:45] already alerted urandom [17:18:15] 10Operations, 10homer: Homer: add show support - https://phabricator.wikimedia.org/T250413 (10ayounsi) p:05Triage→03Low [17:22:25] 10Operations, 10homer: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415 (10ayounsi) p:05Triage→03Low [17:29:15] (03PS9) 10Krinkle: Use GTIDs for "wait for replica" barriers for external DB clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147 (owner: 10Aaron Schulz) [17:30:47] (03PS1) 10RLazarus: maintenance: Migrate purge_abusefilter to periodic_jobo [puppet] - 10https://gerrit.wikimedia.org/r/589369 (https://phabricator.wikimedia.org/T211250) [17:32:03] RECOVERY - MariaDB Slave Lag: s8 #page on db1092 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [17:32:23] yay [17:32:26] good to know? [17:32:52] it paged early as it's downtime expired [17:32:56] I will repool it tomorrow [17:32:56] rzl: that's the icinga way, if you downtime something that is already in CRITICAL it will page on recovery [17:33:01] yeah [17:33:07] ohh got it [17:33:15] I should've disabled notifications I guess [17:33:15] the only way to silence the recovery is to disable notification, but give the high risk of forgetting those disabled [17:33:19] we suggest to not use that [17:33:21] yeah [17:33:26] yeah that makes sense [17:33:30] unless we're *sure* we'll re-enable them [17:33:34] would love it if we got "silence expiring" notifications though [17:33:41] and also how cool it is to receive a page and realise it is a recovery? [17:33:48] or sorry, "downtime expiring" [17:33:51] mood-booster [17:33:56] :-P [17:34:15] HEY HEY HEY EVERYBODY STOP WHAT YOU'RE DOING AND PAY ATTENTION. things are okay. [17:34:29] thanks for your time [17:34:34] haha [17:34:39] you are welcome rzl [17:35:00] lol [17:37:50] (03CR) 10EBernhardson: [C: 03+1] Use -XX:NewRatio=3 for cloudelastic-chi instead of setting a specific size [puppet] - 10https://gerrit.wikimedia.org/r/589356 (https://phabricator.wikimedia.org/T231517) (owner: 10Elukey) [17:42:26] (03PS1) 10RLazarus: maintenance: Migrate purge_checkuser to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589377 (https://phabricator.wikimedia.org/T211250) [17:43:27] (03PS2) 10RLazarus: maintenance: Migrate purge_abusefilter to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589369 (https://phabricator.wikimedia.org/T211250) [17:48:17] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), and 5 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10kaldari) @DannyS712 - I can't parse what you're saying. Wh... [17:50:13] (03PS1) 10RLazarus: maintenance: Migrate purged_expired_userrights to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589378 (https://phabricator.wikimedia.org/T211250) [17:52:15] !log rename/format asw-ulsfo interfaces to match future homer driven format [17:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:51] (03PS10) 10Krinkle: Switch "wait for replica" method to use GTIDs for external DB clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147 (owner: 10Aaron Schulz) [17:59:03] (03PS1) 10RLazarus: maintenance: Migrate purge_old_cx_drafts to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589379 (https://phabricator.wikimedia.org/T211250) [18:00:04] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Morning SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200416T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:09:13] (03PS1) 10RLazarus: maintenance: Migrate purge_securepoll to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/589384 (https://phabricator.wikimedia.org/T211250) [18:21:37] PROBLEM - Check systemd state on restbase2014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:25:11] 10Operations, 10observability: production-logstash elastic cluster is yellow state - https://phabricator.wikimedia.org/T250133 (10herron) Reducing kafka retention.size to 500G on udp_localhost-info and udp_localhost-warning did the trick in terms of freeing up space, and the eqiad ELK5 ES cluster is green once... [18:26:04] (03PS3) 10Bearloga: Add analytics-product system user [puppet] - 10https://gerrit.wikimedia.org/r/589320 (https://phabricator.wikimedia.org/T230743) [18:26:32] (03CR) 10Bearloga: Add analytics-product system user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/589320 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [18:27:51] (03CR) 10jerkins-bot: [V: 04-1] Add analytics-product system user [puppet] - 10https://gerrit.wikimedia.org/r/589320 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [18:29:27] (03PS4) 10Bearloga: Add analytics-product system user [puppet] - 10https://gerrit.wikimedia.org/r/589320 (https://phabricator.wikimedia.org/T230743) [18:44:16] (03PS1) 10Jhedden: cloudvps: update project service discovery prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/589398 [18:48:51] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), and 5 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10DannyS712) >>! In T219279#6063531, @kaldari wrote: > @Dann... [18:49:56] (03CR) 10jerkins-bot: [V: 04-1] cloudvps: update project service discovery prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/589398 (owner: 10Jhedden) [18:50:43] (03PS2) 10Jhedden: cloudvps: update project service discovery prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/589398 (https://phabricator.wikimedia.org/T250206) [18:53:17] (03PS2) 10Krinkle: Use Wikimedia Design Style Guide colors and use relative font-size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589165 (owner: 10VolkerE) [18:54:22] (03PS3) 10Krinkle: errorpages: Update 404 page to follow current Style Guide (colors, font-size) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589165 (owner: 10VolkerE) [18:54:25] (03CR) 10Krinkle: [C: 03+2] errorpages: Update 404 page to follow current Style Guide (colors, font-size) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589165 (owner: 10VolkerE) [18:54:52] (03CR) 10jerkins-bot: [V: 04-1] cloudvps: update project service discovery prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/589398 (https://phabricator.wikimedia.org/T250206) (owner: 10Jhedden) [18:55:19] Krinkle: Four minutes before the train? Really? [18:55:24] (03Merged) 10jenkins-bot: errorpages: Update 404 page to follow current Style Guide (colors, font-size) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589165 (owner: 10VolkerE) [18:55:31] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Horizon: put into maintenance mode for Rocky upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/589301 (owner: 10Andrew Bogott) [18:55:35] I'm swatting [18:56:00] Hmm, we really should fix this SWAT window and have this as a sanity break. [18:56:15] Maybe make the SWAT window half an hour long? [18:56:28] I was going to deploy that after the train [18:56:32] sounds good to me. will that allow CI to run for 8 patches though? [18:56:42] ;-) [18:57:03] !log krinkle@deploy1001 Synchronized errorpages/404.php: I9fd5c99130c64 (duration: 01m 07s) [18:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:13] (03PS3) 10Jhedden: cloudvps: update project service discovery prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/589398 (https://phabricator.wikimedia.org/T250206) [18:57:14] * Krinkle releases the handle [18:57:20] Krinkle: Six. And it's up to the SWATer. [18:57:30] Also also, this patch wasn't listed. :-PP [18:57:36] (03PS1) 10Ottomata: Refactor logstash::input::kafka to DRY ssl_truststore_location logic [puppet] - 10https://gerrit.wikimedia.org/r/589400 [18:59:39] PROBLEM - nova-compute proc minimum on cloudvirt1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:59:45] PROBLEM - nova-compute proc minimum on cloudvirt1013 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:00:01] PROBLEM - nova-compute proc minimum on cloudvirt1030 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:00:03] PROBLEM - nova-compute proc minimum on cloudvirt1007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:00:04] James_F and liw: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Mediawiki train - American+European Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200416T1900). [19:00:11] PROBLEM - nova-compute proc minimum on cloudvirt1004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:00:19] PROBLEM - nova-compute proc minimum on cloudvirt1023 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:00:27] (03PS1) 10Jforrester: all wikis to 1.35.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589401 [19:00:29] (03CR) 10Jforrester: [C: 03+2] all wikis to 1.35.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589401 (owner: 10Jforrester) [19:00:29] PROBLEM - nova-compute proc minimum on cloudvirt1018 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:00:47] PROBLEM - nova-compute proc minimum on cloudvirt1009 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:00:47] PROBLEM - nova-compute proc minimum on cloudvirt1020 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:00:48] PROBLEM - nova-compute proc minimum on cloudvirt-wdqs1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:00:49] PROBLEM - nova-compute proc minimum on cloudvirt-wdqs1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:00:50] PROBLEM - nova-compute proc minimum on cloudvirt1014 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:00:51] PROBLEM - nova-compute proc minimum on cloudvirt1025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:00:52] PROBLEM - nova-compute proc minimum on cloudvirt1019 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:00:53] PROBLEM - nova-compute proc minimum on cloudvirt1005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:01:01] PROBLEM - nova-compute proc minimum on cloudvirt1021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:01:03] PROBLEM - nova-compute proc minimum on cloudvirt1016 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:01:05] PROBLEM - nova-compute proc minimum on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:01:06] Hmm. WMCS having fun. [19:01:07] PROBLEM - nova-compute proc minimum on cloudvirt1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:01:11] (03CR) 10jerkins-bot: [V: 04-1] Refactor logstash::input::kafka to DRY ssl_truststore_location logic [puppet] - 10https://gerrit.wikimedia.org/r/589400 (owner: 10Ottomata) [19:01:15] PROBLEM - nova-compute proc minimum on cloudvirt-wdqs1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:01:17] PROBLEM - nova-compute proc minimum on cloudvirt1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:01:25] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/589401 (owner: 10Jforrester) [19:01:41] RECOVERY - nova-compute proc minimum on cloudvirt1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:01:55] RECOVERY - nova-compute proc minimum on cloudvirt1030 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:01:57] RECOVERY - nova-compute proc minimum on cloudvirt1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:02:05] RECOVERY - nova-compute proc minimum on cloudvirt1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:02:13] RECOVERY - nova-compute proc minimum on cloudvirt1023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:02:23] RECOVERY - nova-compute proc minimum on cloudvirt1018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:02:41] RECOVERY - nova-compute proc minimum on cloudvirt1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:02:43] RECOVERY - nova-compute proc minimum on cloudvirt1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:02:43] RECOVERY - nova-compute proc minimum on cloudvirt-wdqs1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:02:44] RECOVERY - nova-compute proc minimum on cloudvirt-wdqs1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:02:45] RECOVERY - nova-compute proc minimum on cloudvirt1014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:02:46] RECOVERY - nova-compute proc minimum on cloudvirt1025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:02:47] RECOVERY - nova-compute proc minimum on cloudvirt1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:02:49] RECOVERY - nova-compute proc minimum on cloudvirt1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:02:57] RECOVERY - nova-compute proc minimum on cloudvirt1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:02:58] RECOVERY - nova-compute proc minimum on cloudvirt1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:03:03] RECOVERY - nova-compute proc minimum on cloudvirt1024 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:03:04] RECOVERY - nova-compute proc minimum on cloudvirt1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:03:11] RECOVERY - nova-compute proc minimum on cloudvirt-wdqs1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:03:15] RECOVERY - nova-compute proc minimum on cloudvirt1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:03:31] RECOVERY - nova-compute proc minimum on cloudvirt1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:03:52] (03PS4) 10Jhedden: cloudvps: update project service discovery prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/589398 (https://phabricator.wikimedia.org/T250206) [19:06:26] !log jforrester@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.28 [19:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:20] 10Operations, 10homer, 10netops: Homer: Netbox driven switch interfaces - https://phabricator.wikimedia.org/T250429 (10ayounsi) p:05Triage→03Medium [19:07:43] (03PS11) 10Aaron Schulz: Switch "wait for replica" method to use GTIDs for external DB clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147 [19:09:34] (03PS7) 10Ayounsi: Netbox driven switch interfaces configuration [homer/public] - 10https://gerrit.wikimedia.org/r/547584 (https://phabricator.wikimedia.org/T250429) [19:09:55] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 13.56 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [19:15:19] (03PS1) 10Ayounsi: WMF sepcific Netbox plugin for interfaces config [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/589406 (https://phabricator.wikimedia.org/T250429) [19:15:42] (03PS8) 10Ayounsi: Netbox driven switch interfaces configuration [homer/public] - 10https://gerrit.wikimedia.org/r/547584 (https://phabricator.wikimedia.org/T250429) [19:16:37] 10Operations, 10homer, 10netops, 10Patch-For-Review: Homer: Netbox driven switch interfaces - https://phabricator.wikimedia.org/T250429 (10ayounsi) [19:16:39] 10Operations, 10LDAP-Access-Requests: LDAP/NDA Access Request for mshaver - https://phabricator.wikimedia.org/T250430 (10MNoorWMF) [19:22:54] 10Operations, 10homer, 10netops, 10Patch-For-Review: Homer: Netbox driven switch interfaces - https://phabricator.wikimedia.org/T250429 (10ayounsi) [19:23:23] 10Operations, 10homer: Homer: add show support - https://phabricator.wikimedia.org/T250413 (10CDanis) Given that Juniper supports running a single command in the usual way as part of a ssh command line, I’m wondering if a homer/netbox device discovery backend for cumin might work well. [19:23:54] (03Abandoned) 10Ayounsi: Initial netbox interfaces support [software/homer] - 10https://gerrit.wikimedia.org/r/547562 (owner: 10Ayounsi) [19:28:15] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)8 ge (W)1 ge 0.0125 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [19:34:59] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [19:37:12] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), and 5 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10kaldari) Oh yes, now I understand what you mean. That's ve... [19:38:06] (03PS2) 10Ayounsi: WMF specific Netbox plugin for interfaces config [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/589406 (https://phabricator.wikimedia.org/T250429) [19:39:20] (03PS3) 10Ayounsi: WMF specific Netbox plugin for interfaces config [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/589406 (https://phabricator.wikimedia.org/T250429) [19:40:25] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [19:46:03] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [19:49:37] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [20:01:20] !log "beginning deploy of WDQS 0.3.22" [20:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:57] (03CR) 10Krinkle: "So.. this means non-ES use of et DBS (e.g. Flow etc.) currently don't use DBO_TRX on web requests for their questions, but rather auto-com" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525977 (owner: 10Aaron Schulz) [20:12:53] (03PS2) 10Ottomata: Refactor logstash::input::kafka to DRY ssl_truststore_location logic [puppet] - 10https://gerrit.wikimedia.org/r/589400 [20:14:58] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10Traffic, and 2 others: Stop sending purges for `action=history` for linked pages. - https://phabricator.wikimedia.org/T250261 (10Krinkle) a:03Krinkle [20:16:49] (03CR) 10jerkins-bot: [V: 04-1] Refactor logstash::input::kafka to DRY ssl_truststore_location logic [puppet] - 10https://gerrit.wikimedia.org/r/589400 (owner: 10Ottomata) [20:20:31] (03PS1) 10RLazarus: Change all helmfile_log_sal commands from prepare to presync hooks. [deployment-charts] - 10https://gerrit.wikimedia.org/r/589415 (https://phabricator.wikimedia.org/T248523) [20:21:50] (03PS1) 10Hashar: Merge tag 'debian/1.8.17-1_exp1' into debian/buster-wikimedia [debs/doxygen] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/589416 (https://phabricator.wikimedia.org/T242155) [20:22:24] (03CR) 10jerkins-bot: [V: 04-1] Merge tag 'debian/1.8.17-1_exp1' into debian/buster-wikimedia [debs/doxygen] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/589416 (https://phabricator.wikimedia.org/T242155) (owner: 10Hashar) [20:22:58] (03PS3) 10Ottomata: Refactor logstash::input::kafka to DRY ssl_truststore_location logic [puppet] - 10https://gerrit.wikimedia.org/r/589400 [20:28:19] (03CR) 10jerkins-bot: [V: 04-1] Refactor logstash::input::kafka to DRY ssl_truststore_location logic [puppet] - 10https://gerrit.wikimedia.org/r/589400 (owner: 10Ottomata) [20:29:43] (03PS2) 10Hashar: Merge tag 'debian/1.8.17-1_exp1' into debian/buster-wikimedia [debs/doxygen] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/589416 (https://phabricator.wikimedia.org/T242155) [20:30:21] (03PS4) 10Ottomata: Refactor logstash::input::kafka to DRY ssl_truststore_location logic [puppet] - 10https://gerrit.wikimedia.org/r/589400 [20:30:32] !log mstyles@deploy1001 Started deploy [wdqs/wdqs@1fb52b3]: WDQS version 0.3.22 [20:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:08] (03CR) 10Ottomata: "Looks good I think!" [puppet] - 10https://gerrit.wikimedia.org/r/589400 (owner: 10Ottomata) [20:38:35] (03PS5) 10Ottomata: Refactor logstash::input::kafka to DRY ssl_truststore_location logic [puppet] - 10https://gerrit.wikimedia.org/r/589400 [20:40:51] (03PS6) 10Ottomata: Refactor logstash::input::kafka to DRY ssl_truststore_location logic [puppet] - 10https://gerrit.wikimedia.org/r/589400 [20:42:15] !log mstyles@deploy1001 Finished deploy [wdqs/wdqs@1fb52b3]: WDQS version 0.3.22 (duration: 11m 43s) [20:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:01] (03CR) 10Hashar: "+ Moritz who kindly reviewed the previous backports ( https://gerrit.wikimedia.org/r/#/c/operations/debs/doxygen/+/554942/ )." [debs/doxygen] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/589416 (https://phabricator.wikimedia.org/T242155) (owner: 10Hashar) [20:43:57] (03PS3) 10Hashar: Merge tag 'debian/1.8.17-1_exp1' into debian/buster-wikimedia [debs/doxygen] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/589416 (https://phabricator.wikimedia.org/T250424) [20:44:35] (03CR) 10Hashar: "Better view in Gerrit: https://gerrit.wikimedia.org/r/#/c/operations/debs/doxygen/+/589416/AutoMerge..3" [debs/doxygen] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/589416 (https://phabricator.wikimedia.org/T250424) (owner: 10Hashar) [20:46:42] (03PS2) 10Alex Monk: ATS TLS material: Reload even if do_ocsp is false [puppet] - 10https://gerrit.wikimedia.org/r/589291 [20:50:19] (03CR) 10Alex Monk: [C: 03+1] cloudvps: update project service discovery prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/589398 (https://phabricator.wikimedia.org/T250206) (owner: 10Jhedden) [21:03:10] (03PS1) 10Andrew Bogott: Neutron.conf: change rpc_timeout to 30 [puppet] - 10https://gerrit.wikimedia.org/r/589423 (https://phabricator.wikimedia.org/T205524) [21:05:41] 10Operations, 10observability: production-logstash elastic cluster is yellow state - https://phabricator.wikimedia.org/T250133 (10Ottomata) > Set a log.message.timestamp.difference.max.ms value Setting this to the same value of log.rentention.ms might make sense. It looks like this has been considered before... [21:07:17] (03CR) 10Andrew Bogott: [C: 03+2] Neutron.conf: change rpc_timeout to 30 [puppet] - 10https://gerrit.wikimedia.org/r/589423 (https://phabricator.wikimedia.org/T205524) (owner: 10Andrew Bogott) [21:12:20] (03CR) 10Jhedden: [C: 03+1] Neutron.conf: change rpc_timeout to 30 [puppet] - 10https://gerrit.wikimedia.org/r/589423 (https://phabricator.wikimedia.org/T205524) (owner: 10Andrew Bogott) [21:23:21] 10Operations, 10MediaWiki-Cache, 10Traffic, 10serviceops, and 3 others: Stop sending purges for `action=history` for linked pages. - https://phabricator.wikimedia.org/T250261 (10daniel) [21:32:24] (03PS1) 10CDanis: nic-saturation-exporter: consistent metric naming across Debian versions [puppet] - 10https://gerrit.wikimedia.org/r/589430 (https://phabricator.wikimedia.org/T250401) [21:34:33] (03CR) 10RLazarus: [C: 03+1] nic-saturation-exporter: consistent metric naming across Debian versions [puppet] - 10https://gerrit.wikimedia.org/r/589430 (https://phabricator.wikimedia.org/T250401) (owner: 10CDanis) [21:35:28] (03CR) 10RLazarus: [C: 03+1] nic-saturation-exporter: consistent metric naming across Debian versions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/589430 (https://phabricator.wikimedia.org/T250401) (owner: 10CDanis) [21:36:10] (03CR) 10CDanis: [C: 03+2] nic-saturation-exporter: consistent metric naming across Debian versions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/589430 (https://phabricator.wikimedia.org/T250401) (owner: 10CDanis) [21:41:14] (03PS1) 10Mholloway: Bump wikifeeds to 2020-04-16-201212-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/589432 [21:43:18] (03CR) 10Mholloway: [C: 03+2] Bump wikifeeds to 2020-04-16-201212-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/589432 (owner: 10Mholloway) [21:43:35] (03Merged) 10jenkins-bot: Bump wikifeeds to 2020-04-16-201212-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/589432 (owner: 10Mholloway) [21:48:17] !log mholloway-shell@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' . [21:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:08] !log bsitzmann@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [21:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:46] !log bsitzmann@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [21:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:43] 10Operations, 10LDAP-Access-Requests, 10Research: LDAP/NDA Access Request for bmansurov - https://phabricator.wikimedia.org/T250335 (10bmansurov) 05Open→03Resolved Thanks, I can login now. [21:59:49] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.28/extensions/FlaggedRevs/: T250439 Don't try to create a Revision with null (duration: 01m 02s) [21:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:58] T250439: Revision.php: $row must be a row object, an associative array, or a RevisionRecord - https://phabricator.wikimedia.org/T250439 [22:10:08] (03CR) 10Jhedden: [C: 03+2] cloudvps: update project service discovery prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/589398 (https://phabricator.wikimedia.org/T250206) (owner: 10Jhedden) [22:10:49] !log jforrester@deploy1001 Pruned MediaWiki: 1.35.0-wmf.26 (duration: 05m 26s) [22:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:01] !log reindexing wikis that failed from previous reindex on mwmain1002 [22:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:48] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10Services, 10Service-deployment-requests: New Service Request: Wikimedia push notification service - https://phabricator.wikimedia.org/T250452 (10Mholloway) [22:43:27] 10Operations, 10Research: recommendation api's test on scb nodes are flapping - https://phabricator.wikimedia.org/T247732 (10bmansurov) @elukey from the logs I see that both 404 and 503 come in pairs. In the recommendation API we ping the MediaWiki API, which sometimes returns 503. We then return a 404 [[ http... [22:57:29] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is CRITICAL: 122 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1004&panelId=37 [22:58:03] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 109.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [22:59:01] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 124.1 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1002&panelId=37 [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200416T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:15:29] 10Operations, 10Toolforge, 10Traffic, 10HTTPS, 10cloud-services-team (Kanban): Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367 (10bd808) We have left the POST loophole open for more than a year. Now that we have introduced [[https://wikitech.wikimedia.or... [23:20:27] 10Operations, 10Cloud-VPS, 10Traffic, 10HTTPS, 10cloud-services-team (Kanban): Set "https_upgrade" configuration flag for domainproxy to enforce HTTPS upgrade for GET|HEAD requests - https://phabricator.wikimedia.org/T120486 (10bd808) [23:59:49] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37