[00:04:36] (03PS1) 10Subramanya Sastry: Bump wikimedia/parsoid to 0.13.0-a15 [vendor] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/638520 (https://phabricator.wikimedia.org/T262408) [00:08:08] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review, 10User-jijiki: MediaWiki to route specific keys to /*/mw-with-onhost-tier/ - https://phabricator.wikimedia.org/T264604 (10aaron) >>! In T264604#6601576, @jijiki wrote: > @aaron is there a timeline as to when those patches will be merged?... [00:21:17] (03PS1) 10RLazarus: decorators: Add an optional custom failure message to @retry. Use it in dnsdisc. [software/spicerack] - 10https://gerrit.wikimedia.org/r/638753 [00:24:40] (03CR) 10jerkins-bot: [V: 04-1] decorators: Add an optional custom failure message to @retry. Use it in dnsdisc. [software/spicerack] - 10https://gerrit.wikimedia.org/r/638753 (owner: 10RLazarus) [00:26:22] (03PS2) 10RLazarus: decorators: Add an optional custom failure message to @retry. Use it in dnsdisc. [software/spicerack] - 10https://gerrit.wikimedia.org/r/638753 [00:36:51] (03CR) 10Huji: "@Ladsgroup given that you are a native Persian speaker, why not add the "fa" translation?" [puppet] - 10https://gerrit.wikimedia.org/r/637850 (https://phabricator.wikimedia.org/T241656) (owner: 10Ladsgroup) [00:42:25] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 (10Krinkle) >>! In T234854#6016076, @Krinkle wrote: > First impressions of the new Logstash/Kibana based on using Firefox 74 for macOS on an idle high-end MacBook Pro using a fas... [01:52:01] PROBLEM - Maps HTTPS on maps2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [01:52:13] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:53:28] PROBLEM - LVS kartotherian-ssl codfw port 443/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet - HTTPS IPv4 #page on kartotherian.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:53:41] RECOVERY - Maps HTTPS on maps2001 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 6.645 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [01:53:47] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 1.987 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:54:11] here, looking [01:54:23] self-recovered I guess, looking to see if anything still needs to happen [01:54:39] here [01:54:48] capacity issue in codfw because 2002 is broken? [01:55:00] RECOVERY - LVS kartotherian-ssl codfw port 443/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet - HTTPS IPv4 #page on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 0.153 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:55:12] maybe, something happened to tail latency in the last few minutes https://grafana.wikimedia.org/d/000000030/service-kartotherian?orgId=1&refresh=5m&from=now-1h&to=now [01:55:50] it looks like 2001 hiccuped, I'd believe it if that overloaded us because 2002 is broken [01:56:09] I forget what's going on with 2002 exactly, would have to dig back in logs and see what h.nowlan posted [01:56:17] same, checkint [01:56:19] *ing [01:56:43] huh, joined too late for page [01:56:49] but saw its my week of clinic duty... [01:56:53] so i guess i should do that tomorrow. [01:57:31] (I actually am in someone's backyard and have laptop with me, so if not needed im going to logoff). [01:57:45] since cleared, seems ok, so off. [01:58:17] so I think the maps machines in codfw actually maxed out on disk reads? [01:59:03] maps2001 you mean? [01:59:08] or is that the backstory on 2002 [01:59:28] https://i.imgur.com/xD82t6A.png [01:59:47] huh okay [02:00:49] honestly wondering if they finished whatever data rebuilding they were doing and this was, like, a postgres vacuum or a cassandra compaction at the end of it? [02:01:02] yeah I was going to say, it looks like a change in workload [02:01:08] yeah, that [02:01:25] they were also definitely copying data amongst one another [02:01:31] if you look at the network per host graphs https://grafana.wikimedia.org/d/XhFPDdMGz/cluster-overview?orgId=1&var-site=codfw&var-cluster=maps&var-instance=All&var-datasource=thanos&from=now-3h&to=now [02:01:52] yeah just reached the same [02:01:53] ah, and maps2004 came close to saturating on tx [02:02:12] so okay, this sucks but is probably expected [02:02:16] yeah [02:02:25] or at worst, is an unexpected side effect of an expected thing [02:02:34] and once both clusters have 10 machines instead of 4, hopefully sucks less [02:02:44] yep [02:02:59] maps? [02:03:05] lmata: yep [02:03:19] cdanis: h.nowlan is off tomorrow but we can discuss with him Thursday [02:04:04] tail latency looks like it's recovered [02:04:22] sorry left off the graph link https://grafana.wikimedia.org/d/000000030/service-kartotherian?viewPanel=10&orgId=1&refresh=5m&from=now-1h&to=now [02:04:52] did we ever figure out what the units in that graph are btw? [02:04:58] my last best guess was milliseconds [02:05:19] milliseconds sounds right, but I'm at least confident that "up" is "bad" [02:05:25] heh yea [02:06:01] `alias(averageSeries(kartotherian.req.*.*.*.p75), 'p75')` doesn't convey much either [02:07:24] rzl: maybe related? https://phabricator.wikimedia.org/T149889 [02:07:40] haha looks like [02:08:01] > This seems to be a "it would be nice to investigate and sort this out", which doesn't seem to make the cut given that the team is spinning down. As long as we have data for the defined KPIs, digging through the other data does not seem necessary. Accordingly, I am declining this. [02:08:02] > This seems to be a "it would be nice to investigate and sort this out", which doesn't seem to make the cut given that the team is spinning down. [02:08:04] ahahaha [02:08:07] (January 2017) [02:08:18] well, it would be nice to investigate and sort this out [02:08:28] and that was the last engineering effort we spent on Maps once and for all [02:08:48] https://www.youtube.com/watch?v=LpOIPb1_aCU etc etc [02:08:55] well at least its written down 🙂 [02:09:02] okay, none of this is 9 PM work as far as I'm concerned though, now that the latency graph is recovered [02:09:11] have a good evening everyone [02:09:14] cheers rzl [02:09:29] thanks rzl ! [02:09:29] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [02:09:54] oof [02:10:12] that's fine, cassandra on maps2002 has been flapping for days, known thing [02:10:17] not user-facing either [02:10:32] cool then [02:13:48] things do seem fine now, I am also logging off 👋 [02:14:54] :wave [02:15:00] 👋 [02:17:45] RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 1.056 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [02:20:55] 10Operations, 10observability, 10serviceops: alert on too many close-to-saturated appservers / apiservers - https://phabricator.wikimedia.org/T267176 (10CDanis) [02:21:55] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:21:59] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:22:47] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [02:28:39] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:28:43] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:29:29] RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.032 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [04:20:15] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/556211 (owner: 10Jbond) [04:32:56] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/617703 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [04:33:39] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [04:39:42] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, two comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/605596 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [04:45:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554302 (https://phabricator.wikimedia.org/T239686) (owner: 10Jbond) [05:03:39] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:13:11] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:13:13] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:15:21] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:21:35] (03CR) 10Jcrespo: "I will run puppet compiler and if it is a noop I will just merge. As a reminder, not a lot of work is happening on bacula right now as the" [puppet] - 10https://gerrit.wikimedia.org/r/556211 (owner: 10Jbond) [05:28:13] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:28:17] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:28:49] RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.032 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [05:50:56] (03CR) 10Jcrespo: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/26279/" [puppet] - 10https://gerrit.wikimedia.org/r/556211 (owner: 10Jbond) [05:52:37] (03CR) 10Jcrespo: [C: 03+2] backup::director: add type checking and use lookup vs hiera [puppet] - 10https://gerrit.wikimedia.org/r/556211 (owner: 10Jbond) [06:05:50] (03PS1) 10Marostegui: instances.yaml: Add es1026, es1027, es1028 [puppet] - 10https://gerrit.wikimedia.org/r/638882 (https://phabricator.wikimedia.org/T261717) [06:06:59] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add es1026, es1027, es1028 [puppet] - 10https://gerrit.wikimedia.org/r/638882 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [06:13:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1011 (re)pooling @ 25%: After cloning es1026 T261717', diff saved to https://phabricator.wikimedia.org/P13152 and previous config saved to /var/cache/conftool/dbconfig/20201104-061339-root.json [06:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:47] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:13:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1012 (re)pooling @ 25%: After cloning es1027 T261717', diff saved to https://phabricator.wikimedia.org/P13153 and previous config saved to /var/cache/conftool/dbconfig/20201104-061355-root.json [06:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1014 (re)pooling @ 25%: After cloning es1028 T261717', diff saved to https://phabricator.wikimedia.org/P13154 and previous config saved to /var/cache/conftool/dbconfig/20201104-061416-root.json [06:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool es1026 with minimum weight after recloning T261717', diff saved to https://phabricator.wikimedia.org/P13155 and previous config saved to /var/cache/conftool/dbconfig/20201104-061549-marostegui.json [06:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool es1027 with minimum weight after recloning T261717', diff saved to https://phabricator.wikimedia.org/P13156 and previous config saved to /var/cache/conftool/dbconfig/20201104-061829-marostegui.json [06:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:00] (03PS1) 10Marostegui: es1026,es1027,es1028: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/638887 (https://phabricator.wikimedia.org/T261717) [06:20:34] (03CR) 10Marostegui: [C: 03+2] es1026,es1027,es1028: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/638887 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [06:28:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1011 (re)pooling @ 50%: After cloning es1026 T261717', diff saved to https://phabricator.wikimedia.org/P13157 and previous config saved to /var/cache/conftool/dbconfig/20201104-062842-root.json [06:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:50] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:28:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1012 (re)pooling @ 50%: After cloning es1027 T261717', diff saved to https://phabricator.wikimedia.org/P13158 and previous config saved to /var/cache/conftool/dbconfig/20201104-062858-root.json [06:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:17] PROBLEM - ores on ores1005 is CRITICAL: connect to address 10.64.32.14 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:29:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1014 (re)pooling @ 50%: After cloning es1028 T261717', diff saved to https://phabricator.wikimedia.org/P13159 and previous config saved to /var/cache/conftool/dbconfig/20201104-062919-root.json [06:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool es1028 with minimum weight after recloning T261717', diff saved to https://phabricator.wikimedia.org/P13160 and previous config saved to /var/cache/conftool/dbconfig/20201104-063028-marostegui.json [06:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:34] (03PS1) 10Elukey: Revert "Turn off airflow scheduler during db downtime" [puppet] - 10https://gerrit.wikimedia.org/r/638521 [06:36:29] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:36:58] (03PS1) 10Marostegui: mariadb: Productionize es1029,es1030,es1031 [puppet] - 10https://gerrit.wikimedia.org/r/638890 (https://phabricator.wikimedia.org/T261717) [06:37:42] (03CR) 10Elukey: [C: 03+2] Revert "Turn off airflow scheduler during db downtime" [puppet] - 10https://gerrit.wikimedia.org/r/638521 (owner: 10Elukey) [06:38:26] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize es1029,es1030,es1031 [puppet] - 10https://gerrit.wikimedia.org/r/638890 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [06:43:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1011 (re)pooling @ 75%: After cloning es1026 T261717', diff saved to https://phabricator.wikimedia.org/P13161 and previous config saved to /var/cache/conftool/dbconfig/20201104-064345-root.json [06:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:52] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:44:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1012 (re)pooling @ 75%: After cloning es1027 T261717', diff saved to https://phabricator.wikimedia.org/P13162 and previous config saved to /var/cache/conftool/dbconfig/20201104-064402-root.json [06:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1014 (re)pooling @ 75%: After cloning es1028 T261717', diff saved to https://phabricator.wikimedia.org/P13163 and previous config saved to /var/cache/conftool/dbconfig/20201104-064422-root.json [06:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:32] !log force restart of uwsgi-ores on ores1005 - daemon down after reload, max client reached error messages in the logs [06:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:55] RECOVERY - ores on ores1005 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.017 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:46:39] RECOVERY - Check systemd state on an-airflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:47:03] !log set an-presto1004's netbox status as "active" (was: failed) after hw maintenance - T253438 [06:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:10] T253438: an-presto1004 down - https://phabricator.wikimedia.org/T253438 [06:47:37] this should clear the netbox alert in theory --^ [06:51:43] RECOVERY - Check systemd state on dumpsdata1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:52:09] !log force start of rasdaemon.service on dumpsdata1002 (its auto-restart unit was failing for it) [06:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1011 (re)pooling @ 100%: After cloning es1026 T261717', diff saved to https://phabricator.wikimedia.org/P13164 and previous config saved to /var/cache/conftool/dbconfig/20201104-065849-root.json [06:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:56] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:59:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1012 (re)pooling @ 100%: After cloning es1027 T261717', diff saved to https://phabricator.wikimedia.org/P13165 and previous config saved to /var/cache/conftool/dbconfig/20201104-065905-root.json [06:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1014 (re)pooling @ 100%: After cloning es1028 T261717', diff saved to https://phabricator.wikimedia.org/P13166 and previous config saved to /var/cache/conftool/dbconfig/20201104-065926-root.json [06:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 10%: Slowly pool es1026 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13167 and previous config saved to /var/cache/conftool/dbconfig/20201104-065939-root.json [06:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 10%: Slowly pool es1027 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13168 and previous config saved to /var/cache/conftool/dbconfig/20201104-070010-root.json [07:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 10%: Slowly pool es1028 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13169 and previous config saved to /var/cache/conftool/dbconfig/20201104-070020-root.json [07:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:44] !log Stop mysql on es1016, es1013, es1017 to clone es1029, es1030, es1031 T261717 [07:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1016 es1013 es1017 T261717', diff saved to https://phabricator.wikimedia.org/P13170 and previous config saved to /var/cache/conftool/dbconfig/20201104-070121-marostegui.json [07:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:35] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10elukey) @Cmjohnson icinga reports that mw1267's mgmt is down, can you check? It also reports that PS redundancy is not good :( [07:01:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=mysql-test site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:03:27] ongoing postgress puppet failures [07:07:17] I am checking the alert on prometheus, did you recently restarted maybe db1077 mariadb process? [07:07:27] ^ marostegui [07:07:33] what? [07:07:43] I am trying to figure out if it is just maintenance [07:07:53] RECOVERY - Check systemd state on mw1381 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:07:54] jynus: postgress and db1077? [07:08:00] 2 different issues [07:08:06] one postgres [07:08:14] one prometheus scrping on test hosts [07:09:00] can I restart prometheus exporter on db1077? [07:09:05] sure [07:09:10] it has a few errors [07:09:30] maybe a restart will make the alert happy [07:09:33] the problem with db1077 is the disk, not the exporter [07:09:38] !log manual cleanup of mcelog and its wmf-auto-restart (failing) on mw1381 (kernel 4.19, doesn't support mcelog) [07:09:40] oh, how so? [07:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:44] I will take care of that [07:09:54] oh, I see [07:10:03] ok, then not touching it [07:10:10] leaving it to you [07:10:14] +1 [07:10:25] mistery solved :-D [07:10:46] "Widespread puppet agent failures" is still a think on cloud and maps hosts [07:10:49] *thing [07:11:47] (03PS1) 10Elukey: base::standard_packages: ensure absent mcelog to do auto-cleanup [puppet] - 10https://gerrit.wikimedia.org/r/638912 [07:12:03] (I pinged art*uro for the cloud failures, maps is still WIP IIUC) [07:12:16] on cloud it is "python3-fusepy" package failing [07:12:30] there is also wdqs [07:12:52] it just may be old errors, it just now alerted for going over a threashold [07:13:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:14:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 25%: Slowly pool es1026 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13171 and previous config saved to /var/cache/conftool/dbconfig/20201104-071443-root.json [07:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:50] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [07:14:57] jynus: I already pinged people for them (see -sre and -discovery, maps is currently WIP IIUC) [07:15:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 25%: Slowly pool es1027 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13172 and previous config saved to /var/cache/conftool/dbconfig/20201104-071513-root.json [07:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 25%: Slowly pool es1028 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13173 and previous config saved to /var/cache/conftool/dbconfig/20201104-071523-root.json [07:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:38] (03PS2) 10Elukey: base::standard_packages: ensure absent mcelog to do auto-cleanup [puppet] - 10https://gerrit.wikimedia.org/r/638912 [07:15:38] thanks, I was worried it was more widespread (base code failing), but it is not [07:29:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 50%: Slowly pool es1026 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13174 and previous config saved to /var/cache/conftool/dbconfig/20201104-072946-root.json [07:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:52] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [07:30:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 50%: Slowly pool es1027 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13175 and previous config saved to /var/cache/conftool/dbconfig/20201104-073017-root.json [07:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 50%: Slowly pool es1028 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13176 and previous config saved to /var/cache/conftool/dbconfig/20201104-073026-root.json [07:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:13] (03PS1) 10Elukey: role::analytics_cluster::coordinator: increase innodb buffer size after RAM expansion [puppet] - 10https://gerrit.wikimedia.org/r/638923 (https://phabricator.wikimedia.org/T257412) [07:32:33] (03CR) 10jerkins-bot: [V: 04-1] role::analytics_cluster::coordinator: increase innodb buffer size after RAM expansion [puppet] - 10https://gerrit.wikimedia.org/r/638923 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [07:35:05] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/26281/" [puppet] - 10https://gerrit.wikimedia.org/r/638923 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [07:35:50] (03PS2) 10Elukey: role::analytics_cluster::coordinator: increase innodb buffer size [puppet] - 10https://gerrit.wikimedia.org/r/638923 (https://phabricator.wikimedia.org/T257412) [07:44:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 75%: Slowly pool es1026 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13177 and previous config saved to /var/cache/conftool/dbconfig/20201104-074449-root.json [07:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:57] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [07:45:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 75%: Slowly pool es1027 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13178 and previous config saved to /var/cache/conftool/dbconfig/20201104-074520-root.json [07:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 75%: Slowly pool es1028 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13179 and previous config saved to /var/cache/conftool/dbconfig/20201104-074530-root.json [07:45:31] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:18] (03PS5) 10Giuseppe Lavagetto: profile::lvs::realserver: add ability to configure poolcounter for pools [puppet] - 10https://gerrit.wikimedia.org/r/635993 (https://phabricator.wikimedia.org/T266055) [07:59:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 100%: Slowly pool es1026 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13180 and previous config saved to /var/cache/conftool/dbconfig/20201104-075953-root.json [07:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:00] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [08:00:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 100%: Slowly pool es1027 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13181 and previous config saved to /var/cache/conftool/dbconfig/20201104-080024-root.json [08:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 100%: Slowly pool es1028 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13182 and previous config saved to /var/cache/conftool/dbconfig/20201104-080033-root.json [08:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:07] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [08:04:48] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/638912 (owner: 10Elukey) [08:06:57] RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.032 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [08:13:41] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [08:33:03] (03CR) 10Elukey: [C: 03+2] base::standard_packages: ensure absent mcelog to do auto-cleanup [puppet] - 10https://gerrit.wikimedia.org/r/638912 (owner: 10Elukey) [08:33:54] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/26283/ this just adds $@ at the end of the call to safe-service-restart as of now." [puppet] - 10https://gerrit.wikimedia.org/r/635993 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [08:35:20] (03CR) 10Elukey: [C: 03+1] spicerack: add requests_session accessor [software/spicerack] - 10https://gerrit.wikimedia.org/r/638550 (owner: 10Volans) [08:38:43] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01697 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [08:40:34] > Duplicate declaration: Package[mcelog] is already declared [08:41:16] elukey: ^ this might be yours [08:41:39] https://puppetboard.wikimedia.org/report/db1119.eqiad.wmnet/8dbe2d4468b72b419d6ecf0e76ab95e471bc6567 [08:41:43] (03PS5) 10Giuseppe Lavagetto: restbase: add poolcounter support to safe-service-restart scripts [puppet] - 10https://gerrit.wikimedia.org/r/635994 (https://phabricator.wikimedia.org/T266055) [08:42:00] kormat: oooff thanks, checking [08:42:33] of course, require_package vs package [08:44:43] ah no I am stupid, it is in the same file [08:44:47] * elukey cries in a corner [08:44:57] !log uploaded freetype 2.5.2+deb8u4+wmf1 to apt.wikimedia.org/jessie-wikimedia [08:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:55] (03PS1) 10Elukey: base::standard_packages: fix duplicate declaration for mcelog [puppet] - 10https://gerrit.wikimedia.org/r/638979 [08:47:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/26284/restbase1025.eqiad.wmnet/fulldiff.html looks correct." [puppet] - 10https://gerrit.wikimedia.org/r/635994 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [08:48:16] moritzm: I am not sure if the above change for mcelog is the right one, can you check if you have time? [08:48:30] looking [08:48:46] I am wondering if we need to ensure mcelog == purged on buster virts as well [08:49:40] kinda, it doesn't get installed on buster in the first place, but the case of upgrades, which are not reimages it can stick around [08:50:13] ok ok [08:50:13] but your patch won't work, as the auto restart config won't handle "purged" as a state [08:50:23] yes right [08:50:47] lemme try another one [08:51:13] absent per se should be fine for the package, it will make sure the package is uninstalled [08:51:22] i.e. in "rc" state in dpkg after Puppet ran [08:53:04] (03PS2) 10Elukey: base::standard_packages: fix duplicate declaration for mcelog [puppet] - 10https://gerrit.wikimedia.org/r/638979 [08:53:40] moritzm: --^ this one should be better, and more along the lines of the original code [08:53:46] checking [08:53:57] (03CR) 10Hashar: gerrit: fix SonarQube report url discovery (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/638565 (https://phabricator.wikimedia.org/T267028) (owner: 10Hashar) [08:54:23] ah no require_package('intel-microcode') needs to be outside the if stretch [08:54:55] (03PS3) 10Elukey: base::standard_packages: fix duplicate declaration for mcelog [puppet] - 10https://gerrit.wikimedia.org/r/638979 [08:55:02] new one --^ [08:55:21] (03CR) 10Muehlenhoff: base::standard_packages: fix duplicate declaration for mcelog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/638979 (owner: 10Elukey) [08:55:49] you are right :( [08:56:13] (03PS4) 10Elukey: base::standard_packages: fix duplicate declaration for mcelog [puppet] - 10https://gerrit.wikimedia.org/r/638979 [08:57:29] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/638979 (owner: 10Elukey) [08:59:56] (03CR) 10Elukey: [C: 03+2] base::standard_packages: fix duplicate declaration for mcelog [puppet] - 10https://gerrit.wikimedia.org/r/638979 (owner: 10Elukey) [09:00:21] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [09:00:25] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:57] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:04] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [09:01:05] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:31] kormat: db1119's puppet looks good, the rest of the puppet runs should fix it, sorry and thanks for the heads up [09:01:39] elukey: np :) [09:01:47] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:01:54] (03PS7) 10Ayounsi: Update AssignIPs to handle switch port and cable [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635385 (https://phabricator.wikimedia.org/T265339) [09:01:56] (03PS8) 10Ayounsi: Add CSV import to AssignIPs script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 [09:01:58] (03PS7) 10Ayounsi: AssingIPs, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339) [09:02:28] (03CR) 10jerkins-bot: [V: 04-1] AssingIPs, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339) (owner: 10Ayounsi) [09:02:33] (03CR) 10Ayounsi: Update AssignIPs to handle switch port and cable (037 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635385 (https://phabricator.wikimedia.org/T265339) (owner: 10Ayounsi) [09:11:15] (03PS2) 10Volans: spicerack: add requests_session accessor [software/spicerack] - 10https://gerrit.wikimedia.org/r/638550 [09:12:20] (03PS1) 10Filippo Giunchedi: alertmanager: add ack daemon [puppet] - 10https://gerrit.wikimedia.org/r/638998 (https://phabricator.wikimedia.org/T266535) [09:12:23] (03PS1) 10Filippo Giunchedi: alertmanager: enable acks and silences on alerts.w.o [puppet] - 10https://gerrit.wikimedia.org/r/638999 (https://phabricator.wikimedia.org/T266535) [09:12:25] (03PS1) 10Filippo Giunchedi: profile: add prometheus jobs for am acks [puppet] - 10https://gerrit.wikimedia.org/r/639000 (https://phabricator.wikimedia.org/T266535) [09:13:00] (03CR) 10Volans: "Thanks for the patch! Looks good already, but I've proposed some small additions/improvements inline." (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/638753 (owner: 10RLazarus) [09:14:25] (03CR) 10Elukey: [C: 03+1] Use wmflib.requests.http_session everywhere [software/spicerack] - 10https://gerrit.wikimedia.org/r/638551 (owner: 10Volans) [09:17:00] (03PS8) 10Ayounsi: AssingIPs, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339) [09:19:09] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [09:19:10] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [09:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:46] (03CR) 10Volans: [C: 03+2] spicerack: add requests_session accessor [software/spicerack] - 10https://gerrit.wikimedia.org/r/638550 (owner: 10Volans) [09:22:17] (03PS2) 10Volans: Use wmflib.requests.http_session everywhere [software/spicerack] - 10https://gerrit.wikimedia.org/r/638551 [09:24:30] (03CR) 10Volans: [C: 03+1] "LGTM, make sure to test it on netbox-next before merging given the move around of stuff ;)" (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635385 (https://phabricator.wikimedia.org/T265339) (owner: 10Ayounsi) [09:24:50] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [09:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:55] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:22] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [09:25:23] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:45] (03Merged) 10jenkins-bot: spicerack: add requests_session accessor [software/spicerack] - 10https://gerrit.wikimedia.org/r/638550 (owner: 10Volans) [09:28:56] (03CR) 10Volans: [C: 03+2] Use wmflib.requests.http_session everywhere [software/spicerack] - 10https://gerrit.wikimedia.org/r/638551 (owner: 10Volans) [09:29:02] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: puppet: Custom type providers - https://phabricator.wikimedia.org/T241160 (10jbond) 05Open→03Resolved Im closing this for now i dont think there is a clear use case [09:31:34] (03CR) 10Jbond: [C: 03+2] standard: move none standard class to profile::standard [puppet] - 10https://gerrit.wikimedia.org/r/617703 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [09:31:53] (03Merged) 10jenkins-bot: Use wmflib.requests.http_session everywhere [software/spicerack] - 10https://gerrit.wikimedia.org/r/638551 (owner: 10Volans) [09:32:19] (03PS8) 10Ayounsi: Update AssignIPs to handle switch port and cable [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635385 (https://phabricator.wikimedia.org/T265339) [09:32:20] (03PS9) 10Ayounsi: Add CSV import to AssignIPs script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 [09:32:22] (03PS9) 10Ayounsi: AssingIPs, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339) [09:33:15] (03CR) 10Ayounsi: [C: 03+2] Update AssignIPs to handle switch port and cable (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635385 (https://phabricator.wikimedia.org/T265339) (owner: 10Ayounsi) [09:33:54] (03PS1) 10Volans: dependencies: update min version to match Buster [software/spicerack] - 10https://gerrit.wikimedia.org/r/639010 [09:33:56] (03PS1) 10Volans: tests: remove require_* decorators [software/spicerack] - 10https://gerrit.wikimedia.org/r/639011 [09:34:53] (03PS3) 10Jbond: standard: move none standard class to profile::standard [puppet] - 10https://gerrit.wikimedia.org/r/617703 (https://phabricator.wikimedia.org/T247956) [09:36:20] (03CR) 10jerkins-bot: [V: 04-1] standard: move none standard class to profile::standard [puppet] - 10https://gerrit.wikimedia.org/r/617703 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [09:36:43] (03PS4) 10Jbond: standard: move none standard class to profile::standard [puppet] - 10https://gerrit.wikimedia.org/r/617703 (https://phabricator.wikimedia.org/T247956) [09:37:25] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.001886 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [09:38:21] (03CR) 10Jbond: [C: 03+2] standard: move none standard class to profile::standard [puppet] - 10https://gerrit.wikimedia.org/r/617703 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [09:45:34] (03PS1) 10Filippo Giunchedi: swift: split memcached servers and port [puppet] - 10https://gerrit.wikimedia.org/r/639014 [09:47:04] (03PS3) 10Jbond: profile::standard::admin: manage admin groups in profile::standard [puppet] - 10https://gerrit.wikimedia.org/r/617704 (https://phabricator.wikimedia.org/T247956) [09:49:25] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/26285/" [puppet] - 10https://gerrit.wikimedia.org/r/639014 (owner: 10Filippo Giunchedi) [09:51:56] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/617704 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [09:54:40] (03PS8) 10Jbond: admin: add tests for system users [puppet] - 10https://gerrit.wikimedia.org/r/554302 (https://phabricator.wikimedia.org/T239686) [09:54:55] (03CR) 10Jbond: [C: 03+2] "updated thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554302 (https://phabricator.wikimedia.org/T239686) (owner: 10Jbond) [09:55:26] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/554302 (https://phabricator.wikimedia.org/T239686) (owner: 10Jbond) [10:00:53] (03PS1) 10Giuseppe Lavagetto: Remove stray print [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/639020 [10:00:57] (03PS1) 10Giuseppe Lavagetto: Update changelog [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/639021 [10:02:45] (03CR) 10jerkins-bot: [V: 04-1] Update changelog [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/639021 (owner: 10Giuseppe Lavagetto) [10:03:34] (03CR) 10jerkins-bot: [V: 04-1] Remove stray print [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/639020 (owner: 10Giuseppe Lavagetto) [10:04:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=idp site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:05:07] <_joe_> !log restarting envoyproxy on restbase20{09,10} to test poolcounter usage by the safe restart scripts [10:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:25] (03PS1) 10Jbond: role::idp: disable X-frame-options for alerts [puppet] - 10https://gerrit.wikimedia.org/r/639046 (https://phabricator.wikimedia.org/T267186) [10:08:55] <_joe_> !log restarting envoyproxy on all of restbase codfw, sending the command in parallel via cumin, to test poolcounter usage by the safe restart scripts [10:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:30] (03PS5) 10Ayounsi: Add switch interface support to decom script [cookbooks] - 10https://gerrit.wikimedia.org/r/633723 (https://phabricator.wikimedia.org/T265341) [10:14:39] 10Operations, 10serviceops, 10CommRel-Specialists-Support (Oct-Dec-2020): CommRel support for ICU 63 upgrade - https://phabricator.wikimedia.org/T267145 (10Elitre) @Trizek-WMF FYI? [10:17:03] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [10:17:05] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:27] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [10:17:28] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:17:30] !log kormat@cumin1001 dbctl commit (dc=all): 'Rebooting for T261389', diff saved to https://phabricator.wikimedia.org/P13184 and previous config saved to /var/cache/conftool/dbconfig/20201104-101729-kormat.json [10:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:28] !log Start of `mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log` in a tmux session updateVarDumps at mwmaint1002 (wiki=fiwiki; T246539) [10:23:33] (03CR) 10Filippo Giunchedi: "Thanks for this!" [puppet] - 10https://gerrit.wikimedia.org/r/639046 (https://phabricator.wikimedia.org/T267186) (owner: 10Jbond) [10:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:35] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [10:23:42] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after reboot. T261389', diff saved to https://phabricator.wikimedia.org/P13185 and previous config saved to /var/cache/conftool/dbconfig/20201104-102341-kormat.json [10:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:25] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [10:27:26] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:31] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:01] (03CR) 10Kosta Harlan: "Thanks again for the fixes, Jeena." [deployment-charts] - 10https://gerrit.wikimedia.org/r/636916 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [10:33:06] (03PS4) 10Jbond: profile::standard::admin: manage admin groups in profile::standard [puppet] - 10https://gerrit.wikimedia.org/r/617704 (https://phabricator.wikimedia.org/T247956) [10:33:08] (03PS1) 10Jbond: P:analytics::jupyterhub: pick up the admin groups from P:standard [puppet] - 10https://gerrit.wikimedia.org/r/639050 (https://phabricator.wikimedia.org/T247956) [10:34:41] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/639050 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [10:38:35] (03PS5) 10Jbond: profile::standard::admin: manage admin groups in profile::standard [puppet] - 10https://gerrit.wikimedia.org/r/617704 (https://phabricator.wikimedia.org/T247956) [10:41:29] (03PS2) 10Jbond: P:analytics::jupyterhub: pick up the admin groups from P:standard [puppet] - 10https://gerrit.wikimedia.org/r/639050 (https://phabricator.wikimedia.org/T247956) [10:44:35] RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.033 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [10:46:45] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:47:47] (03PS3) 10Jbond: P:analytics::jupyterhub: pick up the admin groups from P:standard [puppet] - 10https://gerrit.wikimedia.org/r/639050 (https://phabricator.wikimedia.org/T247956) [10:48:16] (03PS1) 10Itamar Givon: Revert JS parser commits [extensions/Wikibase] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639035 (https://phabricator.wikimedia.org/T266671) [10:49:04] (03CR) 10jerkins-bot: [V: 04-1] P:analytics::jupyterhub: pick up the admin groups from P:standard [puppet] - 10https://gerrit.wikimedia.org/r/639050 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [10:49:46] 10Operations, 10ORES, 10Machine Learning Platform (Current): ores.wmflabs.org - 503 icinga alerts - https://phabricator.wikimedia.org/T242819 (10fgiunchedi) 05Resolved→03Open Reopening, this is alerting again ` ores.wmflabs.org ORES web node labs ores-web-05 View Extra Service Notes CRITICAL 2020-11-0... [10:53:18] 10Operations, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Elitre) >>! In T93049#6597488, @Pchelolo wrote: > So, we have found it: the same exact job has been executed twice. I have... [10:54:09] (03PS4) 10Jbond: P:analytics::jupyterhub: pick up the admin groups from P:standard [puppet] - 10https://gerrit.wikimedia.org/r/639050 (https://phabricator.wikimedia.org/T247956) [10:58:39] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review, 10User-jijiki: MediaWiki to route specific keys to /*/mw-with-onhost-tier/ - https://phabricator.wikimedia.org/T264604 (10jijiki) @aaron mwdebug1001 has the mcrouter configuration we want to roll out when we merge the mediawiki patches,... [11:01:32] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [11:01:33] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:26] (03PS1) 10Ema: varnish: case-insensitive matching for Cache-Control [puppet] - 10https://gerrit.wikimedia.org/r/639062 [11:02:58] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [11:02:59] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:08] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [11:03:09] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:41] PROBLEM - PHP7 rendering on mw1379 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1310 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:04:29] PROBLEM - Apache HTTP on mw1379 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1310 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:05:01] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [11:05:02] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:19] (03PS5) 10Jbond: P:analytics::jupyterhub: pick up the admin groups from P:standard [puppet] - 10https://gerrit.wikimedia.org/r/639050 (https://phabricator.wikimedia.org/T247956) [11:07:30] (03CR) 10Ema: "Tests look good:" [puppet] - 10https://gerrit.wikimedia.org/r/639062 (owner: 10Ema) [11:11:58] (03PS1) 10ArielGlenn: add pointer to new revsinfo util for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/639070 (https://phabricator.wikimedia.org/T263319) [11:12:34] (03PS6) 10Jbond: profile::standard::admin: manage admin groups in profile::standard [puppet] - 10https://gerrit.wikimedia.org/r/617704 (https://phabricator.wikimedia.org/T247956) [11:12:43] (03CR) 10ArielGlenn: [C: 03+2] add pointer to new revsinfo util for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/639070 (https://phabricator.wikimedia.org/T263319) (owner: 10ArielGlenn) [11:13:13] (03CR) 10Effie Mouzeli: "I find it alright, I will not +1 this on the grounds that I am not acquainted with this parts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636094 (https://phabricator.wikimedia.org/T264604) (owner: 10Aaron Schulz) [11:13:30] (03PS7) 10Jbond: profile::standard::admin: manage admin groups in profile::standard [puppet] - 10https://gerrit.wikimedia.org/r/617704 (https://phabricator.wikimedia.org/T247956) [11:13:32] (03PS6) 10Jbond: P:analytics::jupyterhub: pick up the admin groups from P:standard [puppet] - 10https://gerrit.wikimedia.org/r/639050 (https://phabricator.wikimedia.org/T247956) [11:13:36] (03CR) 10Effie Mouzeli: "I find it alright, I will not +1 this on the grounds that I am not acquainted with this parts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636095 (https://phabricator.wikimedia.org/T264604) (owner: 10Aaron Schulz) [11:14:45] (03PS1) 10Filippo Giunchedi: pontoon: set puppet ca_server during enroll [puppet] - 10https://gerrit.wikimedia.org/r/639072 [11:14:47] (03CR) 10Vgutierrez: [C: 03+1] varnish: case-insensitive matching for Cache-Control [puppet] - 10https://gerrit.wikimedia.org/r/639062 (owner: 10Ema) [11:14:56] (03PS7) 10Jbond: P:analytics::jupyterhub: pick up the admin groups from P:standard [puppet] - 10https://gerrit.wikimedia.org/r/639050 (https://phabricator.wikimedia.org/T247956) [11:15:11] (03PS8) 10Jbond: P:analytics::jupyterhub: pick up the admin groups from P:standard [puppet] - 10https://gerrit.wikimedia.org/r/639050 (https://phabricator.wikimedia.org/T247956) [11:15:52] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/639050 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [11:18:59] (03CR) 10Ema: [C: 03+2] varnish: case-insensitive matching for Cache-Control [puppet] - 10https://gerrit.wikimedia.org/r/639062 (owner: 10Ema) [11:25:23] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [11:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:28] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:40] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:29:12] (03PS1) 10Elukey: profile::hue: parametrize thrift version and fix hive hostname in test [puppet] - 10https://gerrit.wikimedia.org/r/639079 [11:32:16] (03PS2) 10Elukey: profile::hue: parametrize thrift version and fix hive hostname in test [puppet] - 10https://gerrit.wikimedia.org/r/639079 [11:32:42] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [11:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:46] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:30] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:10] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/616718 (https://phabricator.wikimedia.org/T258943) (owner: 10Jbond) [11:42:43] (03CR) 10Arturo Borrero Gonzalez: "can you please run this change through puppet catalog compiler? I can do it myself if you don't know how to do it." [puppet] - 10https://gerrit.wikimedia.org/r/638146 (owner: 10Ahmon Dancy) [11:43:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "The change LGTM, but please collect +1 from Brooke as well." [puppet] - 10https://gerrit.wikimedia.org/r/616718 (https://phabricator.wikimedia.org/T258943) (owner: 10Jbond) [11:44:46] (03PS1) 10Effie Mouzeli: Set debian buster for mc1036 [puppet] - 10https://gerrit.wikimedia.org/r/639089 (https://phabricator.wikimedia.org/T252391) [11:45:37] (03CR) 10Arturo Borrero Gonzalez: "Could you please generate a PCC run for this? Or even better, cherry-pick in codfw1dev to validate if the change is right." [puppet] - 10https://gerrit.wikimedia.org/r/550459 (https://phabricator.wikimedia.org/T237994) (owner: 10Jbond) [11:47:31] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [11:47:33] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:26] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [11:51:27] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:39] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [11:52:40] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:43] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [11:53:44] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:42] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [11:54:44] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:04] jouncebot: refresh just in case [11:55:05] I refreshed my knowledge about deployments. [11:56:43] (03CR) 10Effie Mouzeli: [C: 03+2] Set debian buster for mc1036 [puppet] - 10https://gerrit.wikimedia.org/r/639089 (https://phabricator.wikimedia.org/T252391) (owner: 10Effie Mouzeli) [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201104T1200). [12:00:05] ItamarWMDE: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:15] Sigh indeed [12:00:16] o/ [12:00:21] but let’s still try to get something deployed [12:00:30] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert JS parser commits [extensions/Wikibase] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639035 (https://phabricator.wikimedia.org/T266671) (owner: 10Itamar Givon) [12:03:54] Lucas_WMDE: can I sneak in with a config change? [12:04:05] sure [12:04:09] we’re waiting on zuul at the moment [12:05:10] PROBLEM - MariaDB Replica Lag: s6 on db2076 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3514.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:05:45] kormat: ^ maybe one of the restarts? [12:05:58] ah, replication isn't running there [12:06:00] starting it there [12:06:16] (03PS1) 10Urbanecm: Add www.irishstatutebook.ie to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639090 (https://phabricator.wikimedia.org/T267193) [12:06:18] (03CR) 10Urbanecm: [C: 03+2] Add www.irishstatutebook.ie to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639090 (https://phabricator.wikimedia.org/T267193) (owner: 10Urbanecm) [12:07:15] (03Merged) 10jenkins-bot: Add www.irishstatutebook.ie to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639090 (https://phabricator.wikimedia.org/T267193) (owner: 10Urbanecm) [12:08:58] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: ed3c43dc4488205663e6694b7ddfa991e3f3d4b9: Add www.irishstatutebook.ie to the wgCopyUploadsDomains allowlist of Wikimedia Commons (T267193) (duration: 01m 02s) [12:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:04] T267193: Add www.irishstatutebook.ie to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T267193 [12:09:12] RECOVERY - MariaDB Replica Lag: s6 on db2076 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:09:36] !log scap-sync file returned `snapshot1010.eqiad.wmnet returned [255]: Host key verification failed.` [12:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:47] ^^ is that an issue? ^^ [12:10:28] o_O [12:11:05] no SAL messages about snapshot1010 since march, when ariel “brought it up to date” [12:11:19] !log Run scap pull at snapshot1010 manually [12:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:50] ftr: asked for sre input in -sre [12:14:57] not deploying anything until this is clarified [12:17:16] marostegui: fyi, I plan to proceed with T253802 later today and enable it everywhere - unless you have any issue with that, of course. I'll create a monitoring task once that's done. [12:17:17] T253802: Configure WMF wikis to log login attempts in CheckUser - https://phabricator.wikimedia.org/T253802 [12:17:24] (03PS1) 10Marostegui: install_server: Do not reimage es1026-es1028. [puppet] - 10https://gerrit.wikimedia.org/r/639094 [12:17:47] Urbanecm: +1, thanks [12:18:08] (03PS1) 10Urbanecm: Enable wgCheckUserLogLogins at all wikis but loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639095 (https://phabricator.wikimedia.org/T253802) [12:19:04] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage es1026-es1028. [puppet] - 10https://gerrit.wikimedia.org/r/639094 (owner: 10Marostegui) [12:20:26] <_joe_> Lucas_WMDE: someone reinstalled snapshot1010 this morning I think [12:20:31] <_joe_> I saw the host key changing [12:21:33] <_joe_> apergos: did you change anyting on snapshot1010 today? [12:21:39] no [12:21:53] no one had better have reinstalled it either [12:22:05] so why scap fails to deploy there? [12:22:18] ariel@snapshot1010:~$ uptime [12:22:18] 12:22:10 up 114 days, 22 min, 1 user, load average: 30.81, 30.89, 31.35 [12:22:59] no idea [12:23:06] that's standard load and looks normal [12:24:45] host key verification failed? seriously? all the things in /etc/ssh are from feb 2020 [12:25:21] <_joe_> apergos: I definitely saw a key changing on another machine for snapshot1010 [12:25:44] unless I misunderstand the message, yes, host key verification failed at snapshot1010 https://usercontent.irccloud-cdn.com/file/pWQcuW8S/image.png [12:28:11] (03Merged) 10jenkins-bot: Revert JS parser commits [extensions/Wikibase] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639035 (https://phabricator.wikimedia.org/T266671) (owner: 10Itamar Givon) [12:29:01] I see keys added to known hosts for all of snapshot1007,8,9,10 when looking at puppet logs on another host, I guarantee nothing has actually changed there but I see added these: ecdsa- [12:29:01] sha2-nistp256 [12:29:32] (03CR) 10Jbond: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/637742 (owner: 10Dzahn) [12:29:58] how did it succeed on the rest of the hosts then? [12:30:11] no idea [12:30:17] undeterministic choice of keys? [12:30:21] and again I see nothing in puppet about an actual change to these keys on the hosts themselves [12:32:02] <_joe_> apergos: uhm something's wrong, basically the ssh host key for snapshot1010 is not being collected [12:32:24] <_joe_> apergos: is this machine running puppet at all? [12:33:10] (our Wikibase backport got merged by zuul now btw) [12:33:15] (watching chat to see when it’s safe to sync…) [12:34:11] -rw-r----- 1 root adm 23011 Nov 4 09:49 /var/log/puppet.log [12:34:16] but it doesn't indicate disabled [12:34:23] can someone have disabled it earlier today? [12:34:48] <_joe_> apergos: no that would not cause this [12:34:52] <_joe_> it's all very strange [12:34:56] it sure is [12:35:08] (03PS1) 10Jbond: cas: gradle seems to have switch to using implmentation for dependencies [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/639098 (https://phabricator.wikimedia.org/T265857) [12:35:21] <_joe_> jbond42: help :P [12:35:43] * jbond42 looking [12:35:45] Nov 4 12:19:52 snapshot1010 puppet-agent[36949]: Applied catalog in 7.01 seconds [12:35:47] it did run [12:35:47] so [12:35:56] <_joe_> tldr: it seems snapshot1010 is not reporting its ssh host key [12:37:38] a random idea: i see /home/urbanecm/.ssh/known_hosts exists at deploy1001, and https://github.com/wikimedia/scap/blob/master/scap/ssh.py#L39 doesn't indicate where the keys come from - can it come from my home by some weird way? dunno how any key for snapshot1010 would get there, but... [12:38:02] (03PS1) 10Effie Mouzeli: mc1036: Initial memcached 1.5 tuning [puppet] - 10https://gerrit.wikimedia.org/r/639099 (https://phabricator.wikimedia.org/T252391) [12:38:32] Urbanecm: how long is your ~/.ssh/known_hosts? mine only has two lines [12:38:45] (I’m not allowed to read yours) [12:38:45] no entry for snapshot anything in there [12:38:45] 6 lines [12:39:09] <_joe_> that's a red herring [12:39:47] okay, it might be an incorrect idea - I'm not insisting :) [12:39:53] _joe_: apergos: it looks like its caused by me https://gerrit.wikimedia.org/r/c/operations/puppet/+/617703/4/modules/standard/manifests/init.pp and the fact that role::dumps::generation::worker::dumper_monitor pulls standard in and not profile::standard [12:40:05] ahsigh [12:40:07] should I try to sync my / itamarWMDE’s backport and see how it goes? [12:40:14] ok but you found it, that's the important thing [12:40:29] <_joe_> Lucas_WMDE: please not now [12:40:29] yes will push a fix in a sec [12:40:33] ok [12:42:33] (03PS1) 10Jbond: standard: switch roles to call profile::standard not standard directly [puppet] - 10https://gerrit.wikimedia.org/r/639104 [12:43:17] (03CR) 10Jbond: [C: 03+2] standard: switch roles to call profile::standard not standard directly [puppet] - 10https://gerrit.wikimedia.org/r/639104 (owner: 10Jbond) [12:45:05] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:45:09] PROBLEM - Maps HTTPS on maps2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:45:50] PROBLEM - LVS kartotherian-ssl codfw port 443/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet - HTTPS IPv4 #page on kartotherian.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:46:13] * volans in interview [12:46:26] <_joe_> I am about to go to lunch [12:46:27] Urbanecm: apergos: Lucas_WMDE: _joe_: i have pushed a fix now and run puppet on snapshot1010 and deploy so the key is on deploy1001. it will take 30mins for the key to rol out everywhere et me know if that needs speeding up on any specific machines [12:46:29] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 2.857 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:46:37] RECOVERY - Maps HTTPS on maps2001 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 8.006 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:46:39] <_joe_> ok... [12:46:44] thanks for the fix [12:46:47] <_joe_> jbond42: deploy1001 is all we need [12:46:48] jbond42: ack. Thanks for the fix [12:46:54] ack thanks [12:46:55] thanks jbond42 [12:47:02] <_joe_> Urbanecm: you can resume your release [12:47:07] thank you! [12:47:18] Lucas_WMDE: the floor is yours then :) [12:47:20] <_joe_> can someone else check what's up with kartotherian please? [12:47:24] (03PS1) 10Joal: Bump AQS druid backend datasource to 2020-10 [puppet] - 10https://gerrit.wikimedia.org/r/639126 [12:47:25] ok [12:47:32] (ok to Urbanecm not _joe_ sorry) [12:47:45] <_joe_> jayme / akosiaris maybe? [12:47:57] _joe_: ill start taking a look but if there is someone with more maps knowlage that would be great [12:48:27] <_joe_> jbond42: I hardly have more. gehel might be able to help [12:48:31] <_joe_> but to be clear [12:48:41] <_joe_> it recovered, now we just have the VO incident still firing [12:48:46] <_joe_> so it's not super urgent [12:48:46] RECOVERY - LVS kartotherian-ssl codfw port 443/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet - HTTPS IPv4 #page on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 0.151 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:48:50] ack [12:49:36] it looks like there’s nothing to do for our backport, actually – it’s for wmf.16 but that branch doesn’t exist on deploy1001 yet [12:49:41] so I guess this will roll out with the train later today [12:50:02] and the backport should be part of the initial clone / scap prep iiuc [12:50:22] Lucas_WMDE: seems correct to me [12:50:42] (the backport was already in wmf.14 so it’s not like it’s risky or needs testing, IMHO) [12:50:43] PROBLEM - Maps HTTPS on maps2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:50:46] ok then I think we’re done [12:50:53] !log EU backport&config done [12:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:25] (03CR) 10Kormat: [C: 03+1] pontoon: set puppet ca_server during enroll [puppet] - 10https://gerrit.wikimedia.org/r/639072 (owner: 10Filippo Giunchedi) [12:51:50] RECOVERY - Maps HTTPS on maps2004 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 0.159 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:52:18] (03CR) 10Meno25: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/637850 (https://phabricator.wikimedia.org/T241656) (owner: 10Ladsgroup) [12:53:55] PROBLEM - LVS kartotherian-ssl codfw port 443/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet - HTTPS IPv4 #page on kartotherian.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:54:55] RECOVERY - LVS kartotherian-ssl codfw port 443/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet - HTTPS IPv4 #page on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 8.804 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:55:10] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [12:55:11] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:45] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [12:55:46] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:52] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [12:57:44] RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.032 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [12:58:29] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [12:58:30] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:05] hey, got pages [13:00:09] jbond42: I'm back now as well - if I can be of any help...but without any maps knowledge ofc. [13:00:33] i have seen casandra on maps 2002 take up l aload of cpu and then die [13:00:51] there are also errors like "20-11-04T12:56:34tilerator [13:00:51] maps2006 [13:00:52] first worker died during startup, continue startup" [13:00:55] in logstash [13:02:10] cassandra on maps2002 seems errors with java.rmi.ConnectException: Connection refused to host: 10.192.16.179 (this is the ip of maps2002) [13:03:36] hi, I'm here too if help is needed with karto/maps [13:04:50] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: set puppet ca_server during enroll [puppet] - 10https://gerrit.wikimedia.org/r/639072 (owner: 10Filippo Giunchedi) [13:04:56] srv is full on maps2002 [13:05:21] ]: WARN 12:59:35 insufficient space to compact all requested files BigTableReader(path='/srv/cassandra/data/v4/tiles-9cd67630304811e98bea0bfde337e2e2/la-64832-big-Data.db'), BigTableReader(path='/srv/cassandra/data/v4/tiles-9cd67630304811e98bea0bfde337e2e2/la-64831-big-Data.db') [13:05:27] Nov 4 12:59:35 maps2002 cassandra[30332]: ERROR 12:59:35 Exception in thread Thread[CompactionExecutor:1,1,main] [13:05:30] Nov 4 12:59:35 maps2002 cassandra[30332]: java.lang.RuntimeException: Not enough space for compaction, estimated sstables = 1, expected write size = 3554802679 [13:06:12] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [13:06:12] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:55] so, maybe something something about cassandra removing maps2002 from the cluster? I see tons of traffic in https://grafana.wikimedia.org/d/XhFPDdMGz/cluster-overview?viewPanel=84&orgId=1&var-site=codfw&var-cluster=maps&var-instance=All&var-datasource=thanos&from=now-1h&to=now [13:07:02] s/removing/kicking out [13:08:46] s, all maps200{1..4} hosts are pretty close to having /srv filled [13:09:05] but the others all have >150G left [13:09:12] yes the cluster seems pretty unbalanced https://phabricator.wikimedia.org/P13186 [13:09:31] jayme: yeah but at >85% and slowing rising [13:10:13] maps[2005-2010].codfw.wmnet seem new so perhaps the have been added to enable the data to gt balanced a bit better? [13:10:14] do you need a hand? [13:10:41] hmm maps2002 is at 100% for a pretty long time now [13:10:42] I just finished the interview and need to grab something for lunch, but can help if needed [13:11:20] https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=12&orgId=1&from=now-7d&to=now&refresh=5m&var-server=maps2002&var-datasource=thanos&var-cluster=maps [13:11:26] sigh [13:12:24] ryankemper: is working on the new servers https://phabricator.wikimedia.org/T260271 may have some insight [13:18:06] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [13:18:31] godog: ? ^^ [13:18:48] gah, oops(.ie) ! fixed [13:18:53] thx :) [13:19:40] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [13:19:49] maybe also hnowlan - as he is/was working in bringing maps20[05-10] to production ? [13:20:05] as of https://phabricator.wikimedia.org/T266820 [13:22:05] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/617704 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [13:24:12] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [13:24:13] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:47] (03CR) 10Muehlenhoff: "Good catch" (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/639098 (https://phabricator.wikimedia.org/T265857) (owner: 10Jbond) [13:26:45] (03PS2) 10Jbond: cas: gradle seems to have switch to using implmentation for dependencies [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/639098 (https://phabricator.wikimedia.org/T265857) [13:26:52] (03CR) 10Jbond: "updated thanks" (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/639098 (https://phabricator.wikimedia.org/T265857) (owner: 10Jbond) [13:30:32] ls -la ../ [13:30:57] ls: cannot open directory '../': Permission denied [13:31:14] hehehe [13:35:03] !log restart mysqls at db1095 T266483 [13:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:09] T266483: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 [13:36:23] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [13:36:24] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:36:25] 10Operations, 10MediaWiki-General, 10serviceops, 10MW-1.34-notes (1.34.0-wmf.16; 2019-07-30), and 4 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Aklapper) @holger.knust : Could you please answer the last commen... [13:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:11] !log restart mysqls at db1102 T266483 [13:40:13] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10Cmjohnson) a:05RobH→03Jclark-ctr John, on Thursday can you swap the motherboard out please. The new one is the flex space. [13:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:18] T266483: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 [13:40:24] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me! (also checked the PCC output for all roles)" [puppet] - 10https://gerrit.wikimedia.org/r/617704 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [13:41:50] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10jcrespo) I will have db1139 down and downtimed for a day by Thursday, unless you tell me not to. [13:41:56] 10Operations, 10ops-eqiad, 10DC-Ops: Update Documentation for dl360 Motherboard Swap - https://phabricator.wikimedia.org/T254272 (10Cmjohnson) John, you can use the db1139 swap to assist with the documentation. [13:43:13] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [13:43:13] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:23] !log restart mysqls at db1116 T266483 [13:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:11] akosiaris: the traffic spike seems not ewlated to 2002 but 2004 sending to 2001,2003 and 2006 [13:47:19] !log restart mysqls at db1139 T266483 [13:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:25] T266483: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 [13:51:42] !log restart mysqls at db1140 T266483 [13:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:06] !log restart mysqls at db1145 T266483 [13:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:12] T266483: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 [13:56:35] (03PS1) 10Elukey: role::mediawiki::memcached::gutter: set size as multiple of 1024 [puppet] - 10https://gerrit.wikimedia.org/r/639152 [13:59:18] !log restart mysqls at db1150 T266483 [13:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:24] T266483: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 [13:59:40] (03PS1) 10Alexandros Kosiaris: kartotherian: Don't page SREs on failure [puppet] - 10https://gerrit.wikimedia.org/r/639154 [14:03:49] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [14:03:50] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:14] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/639050 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [14:07:20] (03CR) 10Elukey: [C: 03+1] P:analytics::jupyterhub: pick up the admin groups from P:standard [puppet] - 10https://gerrit.wikimedia.org/r/639050 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [14:09:04] (03CR) 10Ottomata: "Hm, ok I get the reasoning here, but would the more proper thing to do be to make admin::groups global?" [puppet] - 10https://gerrit.wikimedia.org/r/639050 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [14:09:13] (03CR) 10Ottomata: [C: 03+1] "+1 either way :)" [puppet] - 10https://gerrit.wikimedia.org/r/639050 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [14:14:59] !log restart mysqls at db209[789],db210[01], db2139, db2141 T266483 [14:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:05] T266483: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 [14:17:27] !log upload 4.8.0-1+deb10u1 to buster-wikimedia [14:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:33] ahahah [14:17:39] missed "Hue" [14:18:06] * elukey tries to fix the error before kormat notices it [14:18:18] :D [14:19:13] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [14:19:14] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:24] we have have public record that elukey misses Hue \o/ [14:19:40] * elukey cries in a corner [14:20:04] (03CR) 10Jbond: "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/639050 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [14:20:34] (03CR) 10Jbond: [C: 03+2] profile::standard::admin: manage admin groups in profile::standard [puppet] - 10https://gerrit.wikimedia.org/r/617704 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [14:20:38] (03CR) 10Jbond: [C: 03+2] P:analytics::jupyterhub: pick up the admin groups from P:standard [puppet] - 10https://gerrit.wikimedia.org/r/639050 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [14:21:20] (03CR) 10Elukey: [C: 03+2] profile::hue: parametrize thrift version and fix hive hostname in test [puppet] - 10https://gerrit.wikimedia.org/r/639079 (owner: 10Elukey) [14:21:27] (03PS3) 10Elukey: profile::hue: parametrize thrift version and fix hive hostname in test [puppet] - 10https://gerrit.wikimedia.org/r/639079 [14:21:32] 10Operations, 10Traffic, 10serviceops, 10HTTPS: puppetmaster[12]001: add TLS termination - https://phabricator.wikimedia.org/T263831 (10Nintendofan885) [14:21:48] 10Operations, 10Cloud-Services, 10Traffic, 10HTTPS, 10cloud-services-team (Kanban): cloudweb2001-dev: add TLS termination - https://phabricator.wikimedia.org/T263829 (10Nintendofan885) [14:22:21] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 10HTTPS: contint.wikimedia.org: add TLS termination - https://phabricator.wikimedia.org/T263830 (10Nintendofan885) [14:24:34] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se website, 10HTTPS: Set HSTS on wikiba.se (force HTTPS) - https://phabricator.wikimedia.org/T232246 (10Nintendofan885) [14:26:02] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se website, 10HTTPS: Set HSTS on wikiba.se (force HTTPS) - https://phabricator.wikimedia.org/T232246 (10Nintendofan885) [14:26:30] (03CR) 10Muehlenhoff: [C: 03+1] "The problem statement is spot-on and I agree with the actionable" [puppet] - 10https://gerrit.wikimedia.org/r/639154 (owner: 10Alexandros Kosiaris) [14:33:05] (03CR) 10Alexandros Kosiaris: [C: 03+1] Update to 1.16.15 [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/638595 (https://phabricator.wikimedia.org/T266766) (owner: 10JMeybohm) [14:33:22] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Remove stray print [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/639020 (owner: 10Giuseppe Lavagetto) [14:34:57] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [14:34:57] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:28] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [14:35:29] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:47] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [14:35:48] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:57] (03PS1) 10Elukey: role::analytics_test_cluster::hadoop::ui: set correct hive service [puppet] - 10https://gerrit.wikimedia.org/r/639163 [14:36:20] (03CR) 10Huji: [C: 03+1] Enable wgCheckUserLogLogins at all wikis but loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639095 (https://phabricator.wikimedia.org/T253802) (owner: 10Urbanecm) [14:36:23] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::hadoop::ui: set correct hive service [puppet] - 10https://gerrit.wikimedia.org/r/639163 (owner: 10Elukey) [14:37:55] !log restart mysql at db1133 T266483 [14:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:01] T266483: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 [14:42:06] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Trizek-WMF) [14:42:16] 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for FY2020-2021 Q2 DC switchback - https://phabricator.wikimedia.org/T264364 (10Trizek-WMF) [14:42:26] 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for FY2020-2021 Q2 DC switchback - https://phabricator.wikimedia.org/T264364 (10Trizek-WMF) 05Open→03Resolved Retro: * It went well. * I had no messages from community members on places I monitored. * We had a g... [14:43:02] (03PS3) 10Ottomata: Migrate ContentTranslationAbuseFilter event stream to EventGate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634339 (https://phabricator.wikimedia.org/T259163) [14:43:22] (03PS4) 10Ottomata: Migrate ContentTranslationAbuseFilter event stream to EventGate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634339 (https://phabricator.wikimedia.org/T259163) [14:43:24] (03CR) 10Elukey: [C: 03+2] Bump AQS druid backend datasource to 2020-10 [puppet] - 10https://gerrit.wikimedia.org/r/639126 (owner: 10Joal) [14:43:30] (03PS2) 10Elukey: Bump AQS druid backend datasource to 2020-10 [puppet] - 10https://gerrit.wikimedia.org/r/639126 (owner: 10Joal) [14:46:23] (03PS1) 10Muehlenhoff: Don't write out Prometheus config if prometheus actuator is not enabled [puppet] - 10https://gerrit.wikimedia.org/r/639169 [14:46:25] (03CR) 10jerkins-bot: [V: 04-1] Don't write out Prometheus config if prometheus actuator is not enabled [puppet] - 10https://gerrit.wikimedia.org/r/639169 (owner: 10Muehlenhoff) [14:46:43] (03CR) 10Ottomata: [C: 03+2] Migrate ContentTranslationAbuseFilter event stream to EventGate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634339 (https://phabricator.wikimedia.org/T259163) (owner: 10Ottomata) [14:49:33] (03PS2) 10Muehlenhoff: Don't write out Prometheus config if prometheus actuator is not enabled [puppet] - 10https://gerrit.wikimedia.org/r/639169 [14:53:26] (03CR) 10Filippo Giunchedi: [C: 03+1] "Ditto" [puppet] - 10https://gerrit.wikimedia.org/r/639154 (owner: 10Alexandros Kosiaris) [14:54:53] 10Operations, 10DBA: db2077 hung on reboot - https://phabricator.wikimedia.org/T267220 (10Kormat) [14:56:31] (03PS1) 10Jbond: profile::standard: add default for admn gruops [puppet] - 10https://gerrit.wikimedia.org/r/639175 (https://phabricator.wikimedia.org/T247956) [14:57:06] (03CR) 10Jbond: [C: 03+2] profile::standard: add default for admn gruops [puppet] - 10https://gerrit.wikimedia.org/r/639175 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [14:57:58] (03PS1) 10Elukey: hive: use the FQDN as metastore_host instead of the DNS CNAME [puppet] - 10https://gerrit.wikimedia.org/r/639176 [15:00:05] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Reduce reconnectTimeout for etcd to 0.1 seconds [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/631686 (https://phabricator.wikimedia.org/T264362) (owner: 10Giuseppe Lavagetto) [15:01:00] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/26295/an-test-coord1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/639176 (owner: 10Elukey) [15:01:12] (03Merged) 10jenkins-bot: Reduce reconnectTimeout for etcd to 0.1 seconds [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/631686 (https://phabricator.wikimedia.org/T264362) (owner: 10Giuseppe Lavagetto) [15:06:58] 10Operations, 10DBA: db2077 hung on reboot - https://phabricator.wikimedia.org/T267220 (10Kormat) I did a second reboot while attached to the console. It hung at "Loading ramdisk..." for a minute or two, and then finally booted successfully. [15:07:12] 10Operations, 10DBA: db2077 hung on reboot - https://phabricator.wikimedia.org/T267220 (10Kormat) [15:07:15] 10Operations, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Kormat) [15:07:30] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10Jclark-ctr) @jcrespo thanks please have host down will change mainboard tomorrow [15:08:40] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10jcrespo) Thanks, will do and report here when done (will do on my -Europe- morning). [15:09:27] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Migrate ContentTranslationAbuseFilter event stream to EventGate on testwiki - T259163 (duration: 00m 59s) [15:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:34] T259163: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 [15:17:13] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/26296/" [puppet] - 10https://gerrit.wikimedia.org/r/639169 (owner: 10Muehlenhoff) [15:17:39] (03PS1) 10Ottomata: Migrate ContentTranslationAbuseFilter event stream to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639183 (https://phabricator.wikimedia.org/T259163) [15:18:21] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [15:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:26] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:09] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [15:19:09] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:48] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [15:20:21] (03CR) 10Effie Mouzeli: [C: 03+2] role::mediawiki::memcached::gutter: set size as multiple of 1024 [puppet] - 10https://gerrit.wikimedia.org/r/639152 (owner: 10Elukey) [15:20:39] (03PS2) 10Effie Mouzeli: role::mediawiki::memcached::gutter: set size as multiple of 1024 [puppet] - 10https://gerrit.wikimedia.org/r/639152 (owner: 10Elukey) [15:21:20] RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.032 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [15:24:11] (03CR) 10Ottomata: [C: 03+2] Migrate ContentTranslationAbuseFilter event stream to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639183 (https://phabricator.wikimedia.org/T259163) (owner: 10Ottomata) [15:25:26] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:25:29] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Migrate ContentTranslationAbuseFilter event stream to EventGate on all wikis - T259163 (duration: 00m 58s) [15:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:36] T259163: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 [15:26:04] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:26:17] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/639098 (https://phabricator.wikimedia.org/T265857) (owner: 10Jbond) [15:26:24] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:27:10] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:27:46] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:28:06] RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.032 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [15:29:07] (03PS2) 10Effie Mouzeli: mc1036: Initial memcached 1.5.x tuning [puppet] - 10https://gerrit.wikimedia.org/r/639099 (https://phabricator.wikimedia.org/T252391) [15:33:30] (03PS2) 10Jbond: taskgen: add new CI check to ensure hiera keys are valid [puppet] - 10https://gerrit.wikimedia.org/r/580921 (https://phabricator.wikimedia.org/T247956) [15:33:51] (03CR) 10jerkins-bot: [V: 04-1] taskgen: add new CI check to ensure hiera keys are valid [puppet] - 10https://gerrit.wikimedia.org/r/580921 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [15:33:56] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:36:00] (03PS1) 10Ottomata: Refine ContentTranslationAbuseFilter using new eventlogging_legacy job [puppet] - 10https://gerrit.wikimedia.org/r/639187 (https://phabricator.wikimedia.org/T259163) [15:38:00] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/639169 (owner: 10Muehlenhoff) [15:38:28] (03CR) 10Jbond: [V: 03+2 C: 03+2] cas: gradle seems to have switch to using implmentation for dependencies [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/639098 (https://phabricator.wikimedia.org/T265857) (owner: 10Jbond) [15:39:11] !log Reimage mc1036 to buster - T252391 [15:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:17] T252391: Reimage one memcached shard per DC to Buster - https://phabricator.wikimedia.org/T252391 [15:40:27] wow [15:41:38] 10Operations, 10DBA, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10RhinosF1) Should this task be unstalled? [15:42:28] 10Operations, 10DBA, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) 05Stalled→03Open It should, fixing :) [15:42:30] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Kormat) [15:43:03] (03CR) 10Ottomata: [C: 03+2] Refine ContentTranslationAbuseFilter using new eventlogging_legacy job [puppet] - 10https://gerrit.wikimedia.org/r/639187 (https://phabricator.wikimedia.org/T259163) (owner: 10Ottomata) [15:44:54] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Kormat) [15:47:23] (03Abandoned) 10JMeybohm: Update to 1.16.15 [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/638595 (https://phabricator.wikimedia.org/T266766) (owner: 10JMeybohm) [15:47:53] (03PS1) 10JMeybohm: Update to 1.16.15 [debs/kubernetes] (future) - 10https://gerrit.wikimedia.org/r/639192 (https://phabricator.wikimedia.org/T266766) [15:50:35] (03CR) 10Herron: [C: 03+1] "agree as well, sgtm" [puppet] - 10https://gerrit.wikimedia.org/r/639154 (owner: 10Alexandros Kosiaris) [15:51:24] (03CR) 10C. Scott Ananian: [C: 03+2] Bump wikimedia/parsoid to 0.13.0-a15 [vendor] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/638520 (https://phabricator.wikimedia.org/T262408) (owner: 10Subramanya Sastry) [15:52:54] (03PS1) 10Muehlenhoff: Disable prometheus actuator/JMX for now [puppet] - 10https://gerrit.wikimedia.org/r/639194 [16:00:10] (03CR) 10Alexandros Kosiaris: [C: 04-1] "This is pretty awesome. I 've got 2 minor comments but kudos!" (032 comments) [debs/kubernetes] (future) - 10https://gerrit.wikimedia.org/r/638115 (https://phabricator.wikimedia.org/T266766) (owner: 10JMeybohm) [16:00:57] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/26297/" [puppet] - 10https://gerrit.wikimedia.org/r/639194 (owner: 10Muehlenhoff) [16:02:02] 10Operations, 10SRE-Access-Requests: Requesting access to GLOBAL ROOT for David Caro - https://phabricator.wikimedia.org/T267040 (10RobH) >>! In T267040#6598754, @Dzahn wrote: > @RobH See above, I did these things to verify the user but on vacation from tomorow. Since it's a global root access and I see you ar... [16:09:00] (03PS1) 10Dave Pifke: [WIP] webperf: change navtiming to use Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/639197 [16:10:20] (03PS1) 10RobH: Adding dcaro to ops group [puppet] - 10https://gerrit.wikimedia.org/r/639198 (https://phabricator.wikimedia.org/T267040) [16:10:46] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [16:10:47] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:53] (03CR) 10jerkins-bot: [V: 04-1] Adding dcaro to ops group [puppet] - 10https://gerrit.wikimedia.org/r/639198 (https://phabricator.wikimedia.org/T267040) (owner: 10RobH) [16:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:03] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to GLOBAL ROOT for David Caro - https://phabricator.wikimedia.org/T267040 (10RobH) [16:12:02] (03PS2) 10RobH: Adding dcaro to ops group [puppet] - 10https://gerrit.wikimedia.org/r/639198 (https://phabricator.wikimedia.org/T267040) [16:12:13] (03CR) 10JMeybohm: Package binary kubernetes releases (032 comments) [debs/kubernetes] (future) - 10https://gerrit.wikimedia.org/r/638115 (https://phabricator.wikimedia.org/T266766) (owner: 10JMeybohm) [16:13:41] (03PS3) 10JMeybohm: Package binary kubernetes releases [debs/kubernetes] (future) - 10https://gerrit.wikimedia.org/r/638115 (https://phabricator.wikimedia.org/T266766) [16:13:43] (03PS2) 10JMeybohm: Update to 1.16.15 [debs/kubernetes] (future) - 10https://gerrit.wikimedia.org/r/639192 (https://phabricator.wikimedia.org/T266766) [16:14:41] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to GLOBAL ROOT for David Caro - https://phabricator.wikimedia.org/T267040 (10RobH) a:03faidon I've not had to handle an access request with quite this much scope (global root) since our new policies for approval took effect. I'm as... [16:15:20] 10Operations, 10serviceops: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10jijiki) [16:17:28] 10Operations, 10serviceops: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10jijiki) @thcipriani we will be upgrading to ICU 63 on the 16th Nov 2020. Since we will be restarting php-fpm across the cluster that day, can we put a note about this on the deployment calendar? [16:17:39] (03PS2) 10Dave Pifke: [WIP] webperf: change navtiming to use Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/639197 [16:22:19] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to GLOBAL ROOT for David Caro - https://phabricator.wikimedia.org/T267040 (10RobH) " Requesting access to GLOBAL ROOT for David Caro https://phabricator.wikimedia.org/T267040 PROCEED, for later changes, follow the ownership of WMCS... [16:23:19] 10Operations, 10Analytics-Clusters: Segfault for systemd-sysusers.service on stat1007 - https://phabricator.wikimedia.org/T256098 (10elukey) Haven't seen the issue for a while, maybe it is worth closing since there is already an upstream bug opened for Debian Buster. Thoughts? [16:24:09] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10Patch-For-Review, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) Reporting in here a chat with Chris - the maintenance is postponed to tomorrow (5th) [16:24:28] (03CR) 10RobH: [C: 03+2] Adding dcaro to ops group [puppet] - 10https://gerrit.wikimedia.org/r/639198 (https://phabricator.wikimedia.org/T267040) (owner: 10RobH) [16:25:10] PROBLEM - Host ms-be2057.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:25:25] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) >>! In T260445#6504248, @elukey wrote: > @Cmjohnson I checked the items listed in the package slip but I don't see the quantity, only the fa... [16:25:49] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to GLOBAL ROOT for David Caro - https://phabricator.wikimedia.org/T267040 (10RobH) a:05faidon→03None [16:25:57] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to GLOBAL ROOT for David Caro - https://phabricator.wikimedia.org/T267040 (10RobH) 05Open→03Resolved @dcaro: Your rights as 'ops' into the global root group have been merged live. Please allow an hour or so for this to propagate... [16:26:09] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to GLOBAL ROOT for David Caro - https://phabricator.wikimedia.org/T267040 (10RobH) [16:26:22] PROBLEM - MediaWiki memcached error rate on alert1001 is CRITICAL: 2.158e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:27:19] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.13.0-a15 [vendor] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/638520 (https://phabricator.wikimedia.org/T262408) (owner: 10Subramanya Sastry) [16:28:04] RECOVERY - MediaWiki memcached error rate on alert1001 is OK: (C)5000 gt (W)1000 gt 9 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:28:07] (03PS3) 10Dave Pifke: [WIP] webperf: change navtiming to use Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/639197 [16:29:23] (03Abandoned) 10Dzahn: systemd::timer: fix TODO of adding type definition for timer job [puppet] - 10https://gerrit.wikimedia.org/r/633853 (owner: 10Dzahn) [16:29:40] ok I am confused, I checked on mw1405 and mc-gp100x got listed as tkoed [16:30:24] 10Operations, 10ops-eqiad: Degraded RAID on an-presto1004 - https://phabricator.wikimedia.org/T267160 (10Cmjohnson) Sent the TSR report to Dell for a new disk [16:31:21] and only from a few hosts, mw140x [16:31:52] (03PS2) 10Dzahn: apache: add 20.wikipedia.org redirect to wikimediafoundation site [puppet] - 10https://gerrit.wikimedia.org/r/636755 (https://phabricator.wikimedia.org/T264367) [16:33:36] is there any maintenance ongoing for mw140x ? [16:35:24] ah all in C3 [16:35:35] Cc: effie: --^ [16:37:30] (03CR) 10C. Scott Ananian: [C: 04-1] "C-1, but the bump could be a follow up patch. Usually happens automatically when a patch is cherry-picked to mediawiki-vendor (but see T2" (031 comment) [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/638236 (https://phabricator.wikimedia.org/T263182) (owner: 10TrainBranchBot) [16:39:54] i am on site right now i was in c4 [16:40:07] !log 1.36.0-wmf.16 was branched at f51ccd2ccef8cba0e7d874b6f7cf4b73bcd36636 for T263182 [16:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:12] T263182: 1.36.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T263182 [16:40:29] @eluky is there still a problem in c3? [16:40:53] jclark-ctr: nono it was a weird blip, the no network connectivity to some memcached nodes, that auto-resolved [16:40:53] i don`t see anything off [16:41:04] ok thanks [16:41:06] (03CR) 10Brennen Bearnes: [C: 03+2] Branch commit for wmf/1.36.0-wmf.16 [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/638236 (https://phabricator.wikimedia.org/T263182) (owner: 10TrainBranchBot) [16:41:45] (03CR) 10Brennen Bearnes: [C: 04-2] Branch commit for wmf/1.36.0-wmf.16 [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/638236 (https://phabricator.wikimedia.org/T263182) (owner: 10TrainBranchBot) [16:41:57] 10Operations, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, 10Proton, and 3 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10LGoto) p:05Triage→03Low [16:42:16] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Proton, 10Traffic, and 2 others: PDF download generates invalid PDF files - https://phabricator.wikimedia.org/T266559 (10LGoto) p:05Triage→03Low [16:44:04] elukey: apart from the servers moving around, but I have not checked today [16:44:05] (03CR) 10Brennen Bearnes: [C: 03+2] Branch commit for wmf/1.36.0-wmf.16 [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/638236 (https://phabricator.wikimedia.org/T263182) (owner: 10TrainBranchBot) [16:48:28] RECOVERY - Host ms-be2057.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.00 ms [16:49:07] (03PS1) 10Jbond: P:idp: add paramters to control CORS [puppet] - 10https://gerrit.wikimedia.org/r/639201 (https://phabricator.wikimedia.org/T267186) [16:49:10] (03PS1) 10Jbond: O:idp_test: Enable CORS on idp-test [puppet] - 10https://gerrit.wikimedia.org/r/639202 (https://phabricator.wikimedia.org/T267186) [16:50:50] (03CR) 10jerkins-bot: [V: 04-1] P:idp: add paramters to control CORS [puppet] - 10https://gerrit.wikimedia.org/r/639201 (https://phabricator.wikimedia.org/T267186) (owner: 10Jbond) [16:51:24] jclark-ctr: is it possible that a link between the top of rack switch in c3 got moved and then re-seated? I am seeing weird logs on the switch, I can ping our netops in case [16:51:40] (03PS1) 10Jdlrobson: Disable the search in header A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639203 (https://phabricator.wikimedia.org/T265333) [16:51:44] 10Operations, 10serviceops: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10jijiki) [16:55:28] (03PS2) 10Jbond: P:idp: add paramters to control CORS [puppet] - 10https://gerrit.wikimedia.org/r/639201 (https://phabricator.wikimedia.org/T267186) [16:57:06] (03CR) 10jerkins-bot: [V: 04-1] P:idp: add paramters to control CORS [puppet] - 10https://gerrit.wikimedia.org/r/639201 (https://phabricator.wikimedia.org/T267186) (owner: 10Jbond) [16:57:39] 10Operations, 10TechCom, 10serviceops, 10Performance Issue, and 3 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10Reedy) [17:00:48] (03PS1) 10Dave Pifke: Enable wgImagePreconnect on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639205 (https://phabricator.wikimedia.org/T123582) [17:01:03] (03PS3) 10Jbond: P:idp: add paramters to control CORS [puppet] - 10https://gerrit.wikimedia.org/r/639201 (https://phabricator.wikimedia.org/T267186) [17:04:10] (03PS2) 10Jbond: O:idp_test: Enable CORS on idp-test [puppet] - 10https://gerrit.wikimedia.org/r/639202 (https://phabricator.wikimedia.org/T267186) [17:04:53] (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.16 [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/638236 (https://phabricator.wikimedia.org/T263182) (owner: 10TrainBranchBot) [17:05:14] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/639202 (https://phabricator.wikimedia.org/T267186) (owner: 10Jbond) [17:05:40] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/639201 (https://phabricator.wikimedia.org/T267186) (owner: 10Jbond) [17:06:17] (03CR) 10Jbond: "wrong pcc on the last update and not ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/639202 (https://phabricator.wikimedia.org/T267186) (owner: 10Jbond) [17:06:33] 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Reimage one memcached shard per DC to Buster - https://phabricator.wikimedia.org/T252391 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mc1036.eqiad.wmnet... [17:07:08] !log Reimage mc1036 for real this time [17:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:54] RECOVERY - Aggregate IPsec Tunnel Status eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [17:10:16] 10Operations, 10netops: Network blip for mw hosts in rack C3 (eqiad) - https://phabricator.wikimedia.org/T267242 (10elukey) [17:13:43] 10Operations, 10observability, 10serviceops, 10User-jijiki: alert on too many close-to-saturated appservers / apiservers - https://phabricator.wikimedia.org/T267176 (10jijiki) [17:13:46] !log zpapierski@deploy1001 Started deploy [wikimedia/discovery/analytics@8e8d2d4]: Deploying dc switch [17:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:01] !log zpapierski@deploy1001 Finished deploy [wikimedia/discovery/analytics@8e8d2d4]: Deploying dc switch (duration: 01m 15s) [17:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:39] 10Operations, 10observability, 10serviceops, 10User-jijiki: alert on too many close-to-saturated appservers / apiservers - https://phabricator.wikimedia.org/T267176 (10jijiki) a:05CDanis→03jijiki [17:15:44] 10Operations, 10observability, 10serviceops, 10User-jijiki: alert on too many close-to-saturated appservers / apiservers - https://phabricator.wikimedia.org/T267176 (10jijiki) I agree that is a good idea! [17:16:07] 10Operations, 10ops-eqiad, 10netops: Network blip for mw hosts in rack C3 (eqiad) - https://phabricator.wikimedia.org/T267242 (10ayounsi) p:05Triage→03High @Cmjohnson @Jclark-ctr It's about the DAC between asw2-c2:0/48 and asw2-c3:1/0 Can it be that the cable got bumped? Can you check if it's seated co... [17:16:21] (03CR) 10Alexandros Kosiaris: Package binary kubernetes releases (032 comments) [debs/kubernetes] (future) - 10https://gerrit.wikimedia.org/r/638115 (https://phabricator.wikimedia.org/T266766) (owner: 10JMeybohm) [17:16:51] 10Operations, 10LDAP-Access-Requests: Add msantos to wmf LDAP group - https://phabricator.wikimedia.org/T267125 (10RobH) So the @msantos account is tied to a gmail account, not a Wikimedia account? That seems not normal, so I've not simply just added this user to the wmf group. @msantos: Do you have an accou... [17:20:09] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Give Trey jones access necessary to support Search Platform Airflow jobs - https://phabricator.wikimedia.org/T266995 (10RobH) >>! In T266995#6595204, @gerritbot wrote: > Change 638019 had a related patch set uploade... [17:20:18] 10Operations, 10serviceops, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for ICU 63 upgrade - https://phabricator.wikimedia.org/T267145 (10Trizek-WMF) p:05Triage→03High a:03Trizek-WMF I can handle it. :) > We expect to start the upgrade no earlier than Monday Nov 16,... [17:20:20] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [17:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:51] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:18] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Give Trey jones access necessary to support Search Platform Airflow jobs - https://phabricator.wikimedia.org/T266995 (10RobH) [17:25:47] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Give Trey jones access necessary to support Search Platform Airflow jobs - https://phabricator.wikimedia.org/T266995 (10RobH) I've gone ahead and updated this request to reflect reality, as it was under #ldap-access... [17:26:01] gehel: heyas, im trying to move along https://phabricator.wikimedia.org/T266995 as part of clinic duty, but seems you assigned to rkemper for something? [17:26:18] i dont wanna just step in and merge, but it seems like you, as amanger, making the patchset is approval to me and Id otherwise merge this . [17:26:43] I also moved from ldap request to access request, as your patchset is changing shell access groups not ldap groups [17:29:12] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:29:29] robh: I was going to merge that patch after meetings this morning, so if you want to deploy that now, go ahead [17:30:01] (03CR) 10Gilles: [C: 03+1] Enable wgImagePreconnect on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639205 (https://phabricator.wikimedia.org/T123582) (owner: 10Dave Pifke) [17:30:02] robh: the only open question was whether we needed general sre approval, but if we don't for shell access then we're good to go [17:30:08] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:30:20] (or rather I should say if gehel's approval itself is sufficient) [17:30:22] ryankemper: nope not needed if the manager of the grup in question approves [17:30:28] so yeah, gehels approval is enough afaik [17:30:32] PROBLEM - Maps HTTPS on maps2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [17:30:36] robh: cool, feel free to proceed! [17:30:38] ill comment as such on task and merge things [17:30:43] thanks [17:31:34] RECOVERY - Maps HTTPS on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 5.564 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [17:31:35] 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Reimage one memcached shard per DC to Buster - https://phabricator.wikimedia.org/T252391 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc1036.eqiad.wmnet'] ` and were **ALL** successful. [17:31:46] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Give Trey jones access necessary to support Search Platform Airflow jobs - https://phabricator.wikimedia.org/T266995 (10RobH) Synced up with Ryan via IRC. So with the SRE access policies, we just need the approval... [17:31:54] (03PS1) 10Brennen Bearnes: vendor: Bump wikimedia/parsoid to 0.13.0-a15 [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639214 (https://phabricator.wikimedia.org/T262408) [17:31:58] (03CR) 10RobH: [C: 03+2] admin: Trey Jones needs access to support Search Platform Airflow jobs [puppet] - 10https://gerrit.wikimedia.org/r/638019 (https://phabricator.wikimedia.org/T266995) (owner: 10Gehel) [17:32:03] (03PS2) 10RobH: admin: Trey Jones needs access to support Search Platform Airflow jobs [puppet] - 10https://gerrit.wikimedia.org/r/638019 (https://phabricator.wikimedia.org/T266995) (owner: 10Gehel) [17:32:45] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Give Trey jones access necessary to support Search Platform Airflow jobs - https://phabricator.wikimedia.org/T266995 (10RobH) [17:33:25] (03CR) 10Brennen Bearnes: "This look correct?" [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639214 (https://phabricator.wikimedia.org/T262408) (owner: 10Brennen Bearnes) [17:33:58] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Give Trey jones access necessary to support Search Platform Airflow jobs - https://phabricator.wikimedia.org/T266995 (10RobH) 05Open→03Resolved Request completed. [17:35:52] PROBLEM - Host ms-be2057.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:36:45] (03PS4) 10Ahmon Dancy: openstack: nova: set cpu_model_extra_flags = vmx,pcid [puppet] - 10https://gerrit.wikimedia.org/r/638146 [17:37:09] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova: set cpu_model_extra_flags = vmx,pcid [puppet] - 10https://gerrit.wikimedia.org/r/638146 (owner: 10Ahmon Dancy) [17:37:24] 10Operations, 10ops-eqiad, 10netops: Network blip for mw hosts in rack C3 (eqiad) - https://phabricator.wikimedia.org/T267242 (10Jclark-ctr) @ayounsi I was working in C2 earlier i did remove old pdu possible i bumped a DAC cable then. it would of been 2-3 hours prior to this ticket/ [17:37:27] (03CR) 10Subramanya Sastry: [C: 03+2] vendor: Bump wikimedia/parsoid to 0.13.0-a15 [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639214 (https://phabricator.wikimedia.org/T262408) (owner: 10Brennen Bearnes) [17:38:17] 10Operations, 10LDAP-Access-Requests: Add msantos to wmf LDAP group - https://phabricator.wikimedia.org/T267125 (10Dzahn) @RobH We already have "mbsantos" in admin/data.yaml with shell accesss and there is a @wikimedia.org email address used there. So shouldn't need a Gerrit change in this case. [17:38:24] 10Operations, 10serviceops, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for ICU 63 upgrade - https://phabricator.wikimedia.org/T267145 (10RLazarus) We just met this morning to sort out our timeline -- current plan is to do the do the upgrade on Nov 16. That means the distur... [17:39:10] (03CR) 10Bstorm: [C: 03+1] labstore::fileserver::exports: use sudo::safe_wildcard_cmd [puppet] - 10https://gerrit.wikimedia.org/r/616718 (https://phabricator.wikimedia.org/T258943) (owner: 10Jbond) [17:40:03] 10Operations, 10serviceops: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10RLazarus) [17:40:07] (03CR) 10Bstorm: [C: 03+2] dumps nfs: remove probably-unused firewall ports and services [puppet] - 10https://gerrit.wikimedia.org/r/635628 (https://phabricator.wikimedia.org/T265588) (owner: 10Bstorm) [17:40:15] 10Operations, 10LDAP-Access-Requests: Add msantos to wmf LDAP group - https://phabricator.wikimedia.org/T267125 (10Dzahn) @hnowlan User "mbsantos" is already in the "wmf" LDAP group. This seems a duplicate. Is a login not working? Is it "msantos" vs "mbsantos" ? [17:41:16] 10Operations, 10SRE-Access-Requests: jiawang uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T267246 (10RobH) [17:41:18] RECOVERY - Host ms-be2057.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.92 ms [17:41:30] (03PS1) 10Dave Pifke: [WIP] webperf: convert statsv to use Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/639216 [17:41:37] (03PS5) 10Ahmon Dancy: openstack: nova: set cpu_model_extra_flags = vmx,pcid [puppet] - 10https://gerrit.wikimedia.org/r/638146 [17:42:36] 10Operations, 10ops-eqiad, 10netops: Network blip for mw hosts in rack C3 (eqiad) - https://phabricator.wikimedia.org/T267242 (10wiki_willy) a:03Jclark-ctr [17:42:46] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [17:43:08] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Give Trey jones access necessary to support Search Platform Airflow jobs - https://phabricator.wikimedia.org/T266995 (10TJones) Thanks, @RobH! [17:43:15] (03PS1) 10RobH: revoke jwang key [puppet] - 10https://gerrit.wikimedia.org/r/639217 (https://phabricator.wikimedia.org/T267246) [17:44:41] !log holger@mwmaint1002 START - Run updateRestrictions.php [17:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:57] (03PS2) 10RobH: revoke jwang key [puppet] - 10https://gerrit.wikimedia.org/r/639217 (https://phabricator.wikimedia.org/T267246) [17:45:43] (03CR) 10Dzahn: [C: 03+1] revoke jwang key [puppet] - 10https://gerrit.wikimedia.org/r/639217 (https://phabricator.wikimedia.org/T267246) (owner: 10RobH) [17:46:03] (03CR) 10RobH: [C: 03+2] revoke jwang key [puppet] - 10https://gerrit.wikimedia.org/r/639217 (https://phabricator.wikimedia.org/T267246) (owner: 10RobH) [17:49:10] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: jiawang uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T267246 (10RobH) a:05RobH→03jwang @jwang, I've merged live your revokation of the SSH key used on the production cluster. This was due to it also being us... [17:51:19] 10Operations, 10Analytics: Augment NEL reports with a computed timestamp-of-generation - https://phabricator.wikimedia.org/T266886 (10razzi) @Ottomata could you take a look? [17:53:49] 10Operations, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): Reclaim labpuppetmaster1001 and 1002 - https://phabricator.wikimedia.org/T234462 (10wiki_willy) a:05wiki_willy→03Cmjohnson Hi @Cmjohnson - I think we can keep them around as spares since they're still still a couple yea... [17:54:26] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:55:05] (03CR) 10Ahmon Dancy: [C: 04-1] "Holding. Looks like Andrew Bogott intentionally added the cpu_mode/cpu_model settings within the ceph conditional." [puppet] - 10https://gerrit.wikimedia.org/r/638146 (owner: 10Ahmon Dancy) [17:57:42] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:57:58] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [18:00:01] (03PS4) 10Jbond: P:idp: add paramters to control CORS [puppet] - 10https://gerrit.wikimedia.org/r/639201 (https://phabricator.wikimedia.org/T267186) [18:00:28] 10Operations, 10ops-eqiad, 10netops: Network blip for mw hosts in rack C3 (eqiad) - https://phabricator.wikimedia.org/T267242 (10ayounsi) Thanks for the quick turn-around, let's monitor it for a couple days, and close if no new issues. [18:01:26] (03PS3) 10Jbond: O:idp_test: Enable CORS on idp-test [puppet] - 10https://gerrit.wikimedia.org/r/639202 (https://phabricator.wikimedia.org/T267186) [18:01:41] (03CR) 10Jbond: "now ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/639201 (https://phabricator.wikimedia.org/T267186) (owner: 10Jbond) [18:01:54] (03CR) 10Elukey: [C: 03+1] mc1036: Initial memcached 1.5.x tuning [puppet] - 10https://gerrit.wikimedia.org/r/639099 (https://phabricator.wikimedia.org/T252391) (owner: 10Effie Mouzeli) [18:01:58] (03PS3) 10Effie Mouzeli: mc1036: Initial memcached 1.5.x tuning [puppet] - 10https://gerrit.wikimedia.org/r/639099 (https://phabricator.wikimedia.org/T252391) [18:02:05] Unmerged changes on repository puppet on puppetmaster1001 was ,e [18:02:06] fixed [18:02:12] i had it pending my saying 'yes' [18:02:44] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [18:03:09] (03PS4) 10Effie Mouzeli: mc1036: Initial memcached 1.5.x tuning [puppet] - 10https://gerrit.wikimedia.org/r/639099 (https://phabricator.wikimedia.org/T252391) [18:03:58] 10Operations, 10LDAP-Access-Requests: Add msantos to wmf LDAP group - https://phabricator.wikimedia.org/T267125 (10MSantos) Maybe there is some confusion: mbsantos is my WMF account and msantos my volunteer account. [18:04:15] (03PS1) 10Bstorm: toolsforge bastion: fix an error in the killer script [puppet] - 10https://gerrit.wikimedia.org/r/639224 (https://phabricator.wikimedia.org/T266300) [18:04:38] (03PS5) 10Jbond: P:idp: add paramters to control CORS [puppet] - 10https://gerrit.wikimedia.org/r/639201 (https://phabricator.wikimedia.org/T267186) [18:04:40] (03CR) 10jerkins-bot: [V: 04-1] toolsforge bastion: fix an error in the killer script [puppet] - 10https://gerrit.wikimedia.org/r/639224 (https://phabricator.wikimedia.org/T266300) (owner: 10Bstorm) [18:04:50] (03PS4) 10Jbond: O:idp_test: Enable CORS on idp-test [puppet] - 10https://gerrit.wikimedia.org/r/639202 (https://phabricator.wikimedia.org/T267186) [18:04:54] (03CR) 10Bstorm: "Verified this works correctly locally on tools-sgebastion-08" [puppet] - 10https://gerrit.wikimedia.org/r/639224 (https://phabricator.wikimedia.org/T266300) (owner: 10Bstorm) [18:05:20] 10Operations, 10ops-codfw: codfw: ms-be2057 reading ony 480GB of RAM and not 512GB - https://phabricator.wikimedia.org/T267252 (10Papaul) [18:05:44] RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.032 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [18:05:51] 10Operations, 10ops-codfw: codfw: ms-be2057 reading ony 480GB of RAM and not 512GB - https://phabricator.wikimedia.org/T267252 (10Papaul) p:05Triage→03Medium [18:06:16] (03Merged) 10jenkins-bot: vendor: Bump wikimedia/parsoid to 0.13.0-a15 [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639214 (https://phabricator.wikimedia.org/T262408) (owner: 10Brennen Bearnes) [18:06:42] (03CR) 10Bstorm: [V: 03+2 C: 03+2] "The jenkins errors is:" [puppet] - 10https://gerrit.wikimedia.org/r/639224 (https://phabricator.wikimedia.org/T266300) (owner: 10Bstorm) [18:06:53] (03PS1) 10Andrew Bogott: labspuppetbackend: add an explicit logfile [puppet] - 10https://gerrit.wikimedia.org/r/639225 [18:06:55] (03CR) 10Effie Mouzeli: [C: 03+2] mc1036: Initial memcached 1.5.x tuning [puppet] - 10https://gerrit.wikimedia.org/r/639099 (https://phabricator.wikimedia.org/T252391) (owner: 10Effie Mouzeli) [18:07:26] (03CR) 10jerkins-bot: [V: 04-1] labspuppetbackend: add an explicit logfile [puppet] - 10https://gerrit.wikimedia.org/r/639225 (owner: 10Andrew Bogott) [18:08:26] (03CR) 10Jbond: "Ready" [puppet] - 10https://gerrit.wikimedia.org/r/639201 (https://phabricator.wikimedia.org/T267186) (owner: 10Jbond) [18:08:35] (03CR) 10MSantos: [C: 03+1] "LGTM. Also, can you please write a follow-up task so we (Product Infrastructure currently maintaining maps) know clearly what to do regard" [puppet] - 10https://gerrit.wikimedia.org/r/639154 (owner: 10Alexandros Kosiaris) [18:09:19] (03PS2) 10Bartosz Dziewoński: Enable DiscussionTools as a beta feature on almost all Wikipedias ("phase 3") [mediawiki-config] - 10https://gerrit.wikimedia.org/r/638201 (https://phabricator.wikimedia.org/T266303) [18:09:55] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/639194 (owner: 10Muehlenhoff) [18:11:11] (03PS2) 10Andrew Bogott: labspuppetbackend: add an explicit logfile [puppet] - 10https://gerrit.wikimedia.org/r/639225 [18:12:10] (03CR) 10Andrew Bogott: [C: 03+2] labspuppetbackend: add an explicit logfile [puppet] - 10https://gerrit.wikimedia.org/r/639225 (owner: 10Andrew Bogott) [18:13:45] 10Operations, 10SRE-Access-Requests: jiawang uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T267246 (10jwang) @RobH, Thank you for catching it. Here is my new ssh key. Feel free to let me know if you have any concern/questions. ` ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIO/frOG... [18:14:01] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Proton, 10Traffic, and 2 others: PDF download generates invalid PDF files - https://phabricator.wikimedia.org/T266559 (10Emdosis) The [[ https://en.wikipedia.org/wiki/Physics | Physics]] page I downloaded yesterday (and today) was corrupted as well. W... [18:14:46] (03PS6) 10Jbond: P:idp: add paramters to control CORS [puppet] - 10https://gerrit.wikimedia.org/r/639201 (https://phabricator.wikimedia.org/T267186) [18:14:58] (03PS5) 10Jbond: O:idp_test: Enable CORS on idp-test [puppet] - 10https://gerrit.wikimedia.org/r/639202 (https://phabricator.wikimedia.org/T267186) [18:15:10] 10Operations, 10LDAP-Access-Requests: Add msantos to wmf LDAP group - https://phabricator.wikimedia.org/T267125 (10Dzahn) @MSantos Gotcha, well, your WMF account is already in the WMF group. So everything should work if you use "mbsantos". [18:15:10] !log holger@mwmaint1002 END - Run updateRestrictions.php [18:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:49] (03PS1) 10Herron: kibana: add param to manage kibana.index, and use for ECS instance [puppet] - 10https://gerrit.wikimedia.org/r/639227 [18:19:08] (03CR) 10CDanis: "overall LGTM, two nits and an idle thought" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/636652 (owner: 10Jbond) [18:19:12] (03CR) 10Phuedx: [C: 03+1] Disable the search in header A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639203 (https://phabricator.wikimedia.org/T265333) (owner: 10Jdlrobson) [18:19:38] 10Operations, 10Analytics: Augment NEL reports with a computed timestamp-of-generation - https://phabricator.wikimedia.org/T266886 (10Ottomata) @Cdanis and I need to discuss whether or not these events should ultimately go to Logstash or to Hive. I think this would be possible in either, but in Hive you could... [18:20:01] (03PS1) 10RobH: updating kwang prod key [puppet] - 10https://gerrit.wikimedia.org/r/639228 (https://phabricator.wikimedia.org/T267246) [18:20:34] !log restart memcached on mc1036 to pick up new settings (see https://gerrit.wikimedia.org/r/639099) [18:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:57] (03CR) 10RobH: [C: 03+2] updating kwang prod key [puppet] - 10https://gerrit.wikimedia.org/r/639228 (https://phabricator.wikimedia.org/T267246) (owner: 10RobH) [18:22:16] 10Operations, 10SRE-Access-Requests: jiawang uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T267246 (10RobH) [18:22:33] 10Operations, 10SRE-Access-Requests: jiawang uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T267246 (10RobH) 05Open→03Resolved @jwang, Thanks for the fast response, I've gone ahead and uploaded your new public key to the WMF production cluster. Since it is now merged,... [18:32:22] (03PS1) 10Andrew Bogott: labspuppetbackend: fix handling of eqiad.wmflabs -> eqiad1.wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/639233 [18:33:44] (03CR) 10Razzi: "@Moritz, all the usage of this profile in puppet have been removed; are there any other places this might be used (perhaps cloud VMs) or c" [puppet] - 10https://gerrit.wikimedia.org/r/638185 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [18:40:51] (03PS2) 10Andrew Bogott: labspuppetbackend: fix handling of eqiad.wmflabs -> eqiad1.wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/639233 [18:40:53] (03PS1) 10Andrew Bogott: labspuppetbackend: fix ownership of logfile [puppet] - 10https://gerrit.wikimedia.org/r/639235 [18:41:22] (03CR) 10jerkins-bot: [V: 04-1] labspuppetbackend: fix ownership of logfile [puppet] - 10https://gerrit.wikimedia.org/r/639235 (owner: 10Andrew Bogott) [18:43:14] (03PS2) 10Andrew Bogott: labspuppetbackend: fix ownership of logfile [puppet] - 10https://gerrit.wikimedia.org/r/639235 [18:43:15] (03PS3) 10Andrew Bogott: labspuppetbackend: fix handling of eqiad.wmflabs -> eqiad1.wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/639233 [18:43:38] (03CR) 10jerkins-bot: [V: 04-1] labspuppetbackend: fix ownership of logfile [puppet] - 10https://gerrit.wikimedia.org/r/639235 (owner: 10Andrew Bogott) [18:43:46] PROBLEM - Host ms-be2057.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:44:48] (03PS3) 10Andrew Bogott: labspuppetbackend: fix ownership of logfile [puppet] - 10https://gerrit.wikimedia.org/r/639235 [18:44:50] (03PS4) 10Andrew Bogott: labspuppetbackend: fix handling of eqiad.wmflabs -> eqiad1.wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/639233 [18:45:48] (03CR) 10Andrew Bogott: [C: 03+2] labspuppetbackend: fix ownership of logfile [puppet] - 10https://gerrit.wikimedia.org/r/639235 (owner: 10Andrew Bogott) [18:45:55] (03CR) 10Andrew Bogott: [C: 03+2] labspuppetbackend: fix handling of eqiad.wmflabs -> eqiad1.wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/639233 (owner: 10Andrew Bogott) [18:51:55] !log Strip 2FA for Mark83 at SUL (T267257) [18:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:56] !log brennen@deploy1001 Pruned MediaWiki: 1.36.0-wmf.10 (duration: 27m 38s) [18:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:24] (03PS2) 10Dave Pifke: [WIP] webperf: convert statsv to use Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/639216 [18:54:38] (03PS1) 10Brennen Bearnes: testwikis wikis to 1.36.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639237 [18:54:39] (03CR) 10Brennen Bearnes: [C: 03+2] testwikis wikis to 1.36.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639237 (owner: 10Brennen Bearnes) [18:55:18] RECOVERY - Host ms-be2057.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.50 ms [18:55:35] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639237 (owner: 10Brennen Bearnes) [18:57:00] !log brennen@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.16 [18:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] brennen and dancy: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201104T1900). [19:00:04] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201104T1900) [19:00:04] Urbanecm, Jdlrobson, dpifke, and MatmaRex: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:12] hi [19:00:16] I can deploy today! [19:00:25] i'm in a meeting but i will be here in 5 minutes [19:00:30] ack [19:00:43] Jdlrobson: dpifke: Hello, are you aroud please? [19:01:04] brennen: ah, I see you just started scap. Since Morning B&C window started, can I start deploying? [19:01:55] My patch can wait until the others are done, it's pretty low priority. [19:02:03] ack :) [19:02:36] here :) [19:02:39] thanks [19:03:38] brennen: ping? [19:03:53] brennen is in a meeting right now (train log triage) [19:04:27] dancy: but they started a scap command, while it's a deploy window [19:05:15] 10Operations, 10serviceops, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for ICU 63 upgrade - https://phabricator.wikimedia.org/T267145 (10Trizek-WMF) I'm going to add something to Tech News, and we could refine it next week. Thank you for finding the old task and the mess... [19:07:32] hey sorry reading backscroll one second [19:07:57] apologies; i'm syncing to testwikis at the moment. [19:08:20] thanks. So, I'm free to deploy once scap completes? [19:08:59] 10Operations, 10serviceops, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for ICU 63 upgrade - https://phabricator.wikimedia.org/T267145 (10Trizek-WMF) * https://meta.wikimedia.org/wiki/Tech/News/2020/46 * https://meta.wikimedia.org/wiki/Tech/News/2020/47 [19:10:00] Urbanecm: i think you can go ahead once scap finishes - this is a bit of an odd situation because we didn't deploy yesterday. [19:10:06] but it's going to be a while. [19:10:38] (03CR) 10Volans: "not really familiar with the script or APIs, did a quick pass, agree with Chris's comments, the rest seems sane." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636652 (owner: 10Jbond) [19:10:40] :-( [19:10:49] 10Operations, 10serviceops, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for ICU 63 upgrade - https://phabricator.wikimedia.org/T267145 (10Trizek-WMF) [19:10:49] sorry everyone, but I have to wait once brennen's scap completes :/ [19:11:58] No problem, I'll be around for the full window. [19:12:12] sorry folks, normally there's a "sanity window" where we don't do backports before train to avoid exactly this situation [19:12:13] thank you dpifke [19:12:32] weird week meant that I didn't translate that in the deployment schedule correctly [19:13:00] this is on me; i've been blindly following the standard train deploy flow forgetting to take the day into account. [19:13:21] i see [19:13:22] it might be best to cancel this window since scap just started. It normally takes about an hour to get this stuff out. [19:13:31] :-( [19:13:35] I'm OK with cancelling and rolling mine out another day. [19:13:44] let's see how it goes, and cancelling if turns out to be necessary [19:14:22] PROBLEM - Host ms-be2057.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:14:49] assuming that you basically started the "train window" now, can we just do the "backport window" afterwards? [19:14:55] (03CR) 10Jeena Huneidi: [C: 03+2] Scaffold improvements [deployment-charts] - 10https://gerrit.wikimedia.org/r/637753 (owner: 10Jeena Huneidi) [19:15:33] that depends on what's the plan for the train window that's immediately after this one :-) [19:15:44] thcipriani: ? [19:16:25] (03CR) 10Jeena Huneidi: "> Patch Set 5:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/636916 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [19:16:54] Urbanecm: we can ping you when done rolling out the train, it'd probably be an hour, would be fine to rollout backport patches then, I think. [19:17:01] thank you [19:17:19] sure thing, sorry for the confusion, should have spotted this in scheduling :\ [19:17:25] MatmaRex: Jdlrobson: dpifke: If that's fine with your schedule, I'll be happy to roll out the patches then [19:17:50] (03Merged) 10jenkins-bot: Scaffold improvements [deployment-charts] - 10https://gerrit.wikimedia.org/r/637753 (owner: 10Jeena Huneidi) [19:18:44] Works for me. [19:19:00] noted [19:19:16] i'm fine with waiting for an hour (but i don't really want to wait until the next deploy window at 1 am :) ) [19:19:38] RECOVERY - Host ms-be2057.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.22 ms [19:21:50] MatmaRex: I don't think you'd need to wait that long :) [19:24:01] Urbanecm: that works for me [19:24:15] thanks, noted :) [19:32:20] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10wiki_willy) I followed up with Dell (during my regular meeting with them) about the status of the PSUs, and they said it was delivered on Nov... [19:33:27] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10wiki_willy) Tracking #935433832396 [19:39:13] ACKNOWLEDGEMENT - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 2, dormant: 0, excluded: 0, unused: 0: ayounsi https://phabricator.wikimedia.org/T263019 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:59:09] !log brennen@deploy1001 Finished scap: testwikis wikis to 1.36.0-wmf.16 (duration: 62m 44s) [19:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] brennen and dancy: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Mediawiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201104T2000). [20:00:32] (03PS1) 10Mforns: Migrate EventLogging NewcomerTask to EventGate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639264 (https://phabricator.wikimedia.org/T259163) [20:00:33] 10Operations, 10ops-codfw: codfw: ms-be2057 reading ony 480GB of RAM and not 512GB - https://phabricator.wikimedia.org/T267252 (10Papaul) memory test finished we no error but server is still showing at boot problem on DIMM A2 . I replaced DIMM A2 and DIMM B2 with the 32GB DIMM for now and will return the bad... [20:00:42] PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1002 job=burrow partition={0,1} prometheus=ops site=eqiad topic={udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasourc [20:00:42] ter=logging-eqiad&var-topic=All&var-consumer_group=All [20:02:04] quick question for sysadmins here (Urbanecm maybe?), how many purges per day would be too much for a purge bot (that purges pages/members of categories). this is using mw:API:Purge and *not* recursive purges? [20:02:09] Urbanecm: i've deployed wmf.16 to testwikis; train is otherwise blocked, you're clear for late backports. [20:02:15] thank you brennen [20:02:18] brennen: was about to ask the same [20:02:18] ^ cc: thcipriani [20:02:21] i have a config change i want to deploy [20:02:25] should I just wait til Urbanecm is done? [20:02:33] ottomata: please wait, I didn't start with morning B&C yet [20:02:36] yeah, please coordinate with Urbanecm [20:02:37] oh [20:02:49] oh oh oh well then we might have a patch for you [20:02:51] yeah, we're sadly late on schedule due to yesterday being holiday [20:02:53] i'm teaching someone how this process works right now [20:02:57] cool! [20:02:59] mind if i add one to the wiki? [20:03:04] not at all [20:03:05] ty [20:03:17] MatmaRex: Jdlrobson: dpifke: Hello! The window can finally start! [20:03:18] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [20:03:42] 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5 - https://phabricator.wikimedia.org/T266192 (10Bstorm) [20:04:04] proc: that's kinda difficult question, basically it should be asked the other way around (is this minimum number needed for this bot an issue) :-). I suggest to create a Phab task if you want a bot to be reviewed. [20:04:24] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:04:26] I'll start with my own patch until MatmaRex, Jdlrobson or dpifke are back :-) [20:04:33] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:04:39] 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5 - https://phabricator.wikimedia.org/T266192 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [20:04:48] (03PS2) 10Urbanecm: Enable wgCheckUserLogLogins at all wikis but loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639095 (https://phabricator.wikimedia.org/T253802) [20:04:52] (03CR) 10Urbanecm: [C: 03+2] Enable wgCheckUserLogLogins at all wikis but loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639095 (https://phabricator.wikimedia.org/T253802) (owner: 10Urbanecm) [20:04:58] (03PS1) 10Hashar: Review access change [software/gerrit] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/639121 [20:05:14] ottomata: just to clarify, I should ping you once I'm done, so you can deploy it? [20:05:18] i'm around [20:05:19] (or are you teaching B&C process in general?) [20:05:21] thanks MatmaRex [20:05:49] (03Merged) 10jenkins-bot: Enable wgCheckUserLogLogins at all wikis but loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639095 (https://phabricator.wikimedia.org/T253802) (owner: 10Urbanecm) [20:06:06] (03PS3) 10Urbanecm: Enable DiscussionTools as a beta feature on almost all Wikipedias ("phase 3") [mediawiki-config] - 10https://gerrit.wikimedia.org/r/638201 (https://phabricator.wikimedia.org/T266303) (owner: 10Bartosz Dziewoński) [20:06:11] (03CR) 10Urbanecm: [C: 03+2] Enable DiscussionTools as a beta feature on almost all Wikipedias ("phase 3") [mediawiki-config] - 10https://gerrit.wikimedia.org/r/638201 (https://phabricator.wikimedia.org/T266303) (owner: 10Bartosz Dziewoński) [20:06:12] im back [20:06:20] Urbanecm: for context: it's in relation to an enwiki BRFA. we used to have a bot that did this, and its tasks did 10k purges/day, but the bot op is gone now. my bot is for general approval, though (so cats can be added to a list), so wanted to know a rough maximum I should cap it at so the sysadmins don't need to block my bot ;p [20:06:20] should I ask this on phab, or? [20:06:33] yes, please create a Phab task [20:06:36] what project? [20:06:38] Almost back, finishing my lunch. :) [20:06:48] ack! [20:06:52] I'll ping you when ready [20:06:56] (03PS2) 10Herron: kibana: add param to manage kibana.index, and use for ECS instance [puppet] - 10https://gerrit.wikimedia.org/r/639227 [20:07:03] (03Merged) 10jenkins-bot: Enable DiscussionTools as a beta feature on almost all Wikipedias ("phase 3") [mediawiki-config] - 10https://gerrit.wikimedia.org/r/638201 (https://phabricator.wikimedia.org/T266303) (owner: 10Bartosz Dziewoński) [20:07:20] 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5 - https://phabricator.wikimedia.org/T266192 (10wiki_willy) Hey @Jclark-ctr - can you see if we have a cross-over cable long eno... [20:07:21] syncing my one, after that, I'll fetch MatmaRex's to mwdebug [20:07:26] RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [20:08:06] (03PS2) 10Hashar: Change access to a dedicated group [software/gerrit] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/639121 [20:08:16] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: fb5c03262c20b5e99b3c2f6e91abb024f12da1f5: Enable wgCheckUserLogLogins at all wikis but loginwiki (T253802) (duration: 01m 08s) [20:08:18] ottomata: unsure whether you saw it, there's a question few lines above :) [20:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:22] T253802: Configure WMF wikis to log login attempts in CheckUser - https://phabricator.wikimedia.org/T253802 [20:08:24] Urbanecm: if you get done before we fill out the backport calendar on the deployments wiki [20:08:26] i'll do it [20:08:40] (03CR) 10Hashar: "The repository being owned by Administrators / Gerrit Managers does not make much sense so I have created a new group :)" [software/gerrit] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/639121 (owner: 10Hashar) [20:08:41] otherwise, wouldn't mind if you would, so mforns can learn how to do it without me :) [20:08:51] well it's technically already over :-) anyway, I'll ping you once done ;) [20:09:11] MatmaRex: can you please test yours at mwdebug1002? [20:09:18] hello mforns! [20:09:27] hey Urbanecm! :] [20:09:28] looking [20:09:31] thanks [20:09:54] dpifke: you're next :) [20:09:56] ok ya Urbanecm so mforns edited the wiki, his patch is last in the list [20:10:03] ack, thanks mforns and ottomata ! [20:10:24] (03PS2) 10Urbanecm: Enable wgImagePreconnect on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639205 (https://phabricator.wikimedia.org/T123582) (owner: 10Dave Pifke) [20:10:43] Urbanecm: looks good [20:10:49] thanks, syncing to the fleet [20:11:02] (03CR) 10Urbanecm: [C: 03+2] "B&C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639205 (https://phabricator.wikimedia.org/T123582) (owner: 10Dave Pifke) [20:11:35] Jdlrobson: ping, are you still around by any chance? :-) [20:12:13] (03Merged) 10jenkins-bot: Enable wgImagePreconnect on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639205 (https://phabricator.wikimedia.org/T123582) (owner: 10Dave Pifke) [20:12:26] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: d2a57725f8f6fdaa3f40c834e84b43a0260077f2: Enable DiscussionTools as a beta feature on almost all Wikipedias (T266303) (duration: 01m 07s) [20:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:32] MatmaRex: done :) [20:12:33] T266303: Enable Reply Tool as Beta Feature on "Phase 3" wikis - https://phabricator.wikimedia.org/T266303 [20:12:56] thanks [20:12:58] dpifke: hello! Your patch is available at mwdebug1002 for testing, can you have a look please? [20:12:58] (03CR) 10Herron: "Here's a PCC https://puppet-compiler.wmflabs.org/compiler1003/26318/ -- can see the kibana.index config difference in full diff" [puppet] - 10https://gerrit.wikimedia.org/r/639227 (owner: 10Herron) [20:13:03] MatmaRex: no problem :) [20:13:06] Looking now. [20:14:49] LGTM. [20:15:09] thanks, syncing [20:15:21] can someone add mwdebug1xxx to the default filter for mwdebug dashboard? [20:15:45] it looks like this https://usercontent.irccloud-cdn.com/file/jlY2UjtM/image.png [20:15:54] (03CR) 10Thcipriani: "> Patch Set 2:" [software/gerrit] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/639121 (owner: 10Hashar) [20:17:29] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 82579bf9d71bd3c9d97da0132ce8d92a8863da5b: Enable wgImagePreconnect on remaining wikis (T123582) (duration: 01m 06s) [20:17:34] dpifke: should be live [20:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:35] T123582: Use "preconnect" resource hint for thumbnail host - https://phabricator.wikimedia.org/T123582 [20:18:02] ottomata: I'm going to transfer the late backport window to you - as Jdlrobson doesn't seem to be around now. Feel free to scap your stuff! [20:18:05] *mforns's :-) [20:18:15] ok [20:18:19] Urbanecm: you want the to include mwdebug1xxx and exclude mwdebug2xxx ? [20:18:43] thcipriani: yes, I believe that should be done, as we're post-switchover [20:18:48] ok thanks Urbanecm [20:18:52] * thcipriani does [20:18:57] thank you! [20:19:09] Urbrennen its just my stuff you want me to do, right? [20:19:26] (updated) [20:20:00] Thanks urbanecm! [20:20:30] ottomata: yes (unless Jdlrobson returns -- but feel free to transfer back to me :)) [20:20:34] ok thank you [20:20:42] np [20:21:55] im here [20:21:57] sorry sorry [20:22:00] on it [20:22:50] we have a request to look at https://en.wikipedia.org/wiki/Wikipedia:Administrators%27_noticeboard#Big_oops (too big thing can't be undeleted) [20:22:55] o/ Urbanecm ottomata [20:23:00] Jdlrobson: ack, either ottomata or me will work on your patch once the current one finishes [20:23:09] ok, it should require minimal testing [20:23:15] it just needs to turn off the a/b test [20:23:17] (03CR) 10Cwhite: [C: 03+1] "LGTM! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/639227 (owner: 10Herron) [20:23:20] so that we stop collecting data [20:23:22] apergos: I can deal with that - is there a link I can refer to? [20:23:27] (I mean, to the request) [20:23:48] it was mentioned in a slack channel with the above link :-/ [20:23:51] that's all I got [20:24:00] ack - thanks. Taking over :-) [20:24:32] ty! [20:24:58] any time :) [20:26:47] (03PS2) 10Ottomata: Migrate EventLogging NewcomerTask to EventGate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639264 (https://phabricator.wikimedia.org/T259163) (owner: 10Mforns) [20:27:52] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:28:00] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 (10colewhite) @jcrespo This issue should be resolved at this point as I now see the `logstash-*` filter pattern on logstash-next. Please let us know if this isn't the case. [20:28:00] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:28:22] RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.032 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [20:29:20] (03CR) 10Ottomata: [C: 03+2] Migrate EventLogging NewcomerTask to EventGate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639264 (https://phabricator.wikimedia.org/T259163) (owner: 10Mforns) [20:30:41] (03PS1) 10Hashar: gerrit: remove obsolete profile::gerrit::java_version [puppet] - 10https://gerrit.wikimedia.org/r/639272 [20:30:43] (03PS1) 10Hashar: gerrit: move java config from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/639273 [20:31:09] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Migrate NewcomerTask event stream to EventGate on testwiki - T259163 (duration: 01m 07s) [20:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:16] T259163: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 [20:31:33] Urbanecm: Jdlrobson done, you can proceed if that's ok [20:31:37] i'm working with mforns on some stuff [20:31:40] absolutely :) [20:31:55] * Urbanecm opening ssh conns again :-) [20:32:32] (03PS2) 10Urbanecm: Disable the search in header A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639203 (https://phabricator.wikimedia.org/T265333) (owner: 10Jdlrobson) [20:32:36] (03CR) 10Urbanecm: [C: 03+2] Disable the search in header A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639203 (https://phabricator.wikimedia.org/T265333) (owner: 10Jdlrobson) [20:32:43] Jdlrobson: I'll ping you when it's ready [20:33:06] (03CR) 10Herron: [C: 03+1] thanos: add query-frontend [puppet] - 10https://gerrit.wikimedia.org/r/638119 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [20:33:22] (03Merged) 10jenkins-bot: Disable the search in header A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639203 (https://phabricator.wikimedia.org/T265333) (owner: 10Jdlrobson) [20:33:50] (03CR) 10Herron: [C: 03+1] prometheus: add thanos query-frontend jobs [puppet] - 10https://gerrit.wikimedia.org/r/638120 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [20:33:52] Jdlrobson: available to test at mwdebug1002 :-) [20:33:55] on it [20:34:19] (03CR) 10Herron: [C: 03+1] role: add query_frontend to thanos frontend [puppet] - 10https://gerrit.wikimedia.org/r/638121 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [20:34:20] Urbanecm: confirmed! you can sync [20:34:23] thank you [20:34:37] (03CR) 10Hashar: "I have applied them on the instance via https://horizon.wikimedia.org/project/instances/48c5ff1c-2885-410d-beeb-d5a57a0a91c7/" [puppet] - 10https://gerrit.wikimedia.org/r/639273 (owner: 10Hashar) [20:34:45] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/639273 (owner: 10Hashar) [20:34:46] syncing, thanks Jdlrobson [20:35:43] apergos: fyi, the page is back up [20:36:05] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: ee0ba541fa55f6707276fdc5bd3f032cb9be3e60: Disable the search in header A/B test (T265333) (duration: 01m 06s) [20:36:09] Jdlrobson: and done :) [20:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:12] T265333: End A/B test for search location - https://phabricator.wikimedia.org/T265333 [20:36:16] thank you! [20:36:19] I think we're done? [20:36:20] and hurrah [20:36:22] no problem Jdlrobson ! [20:36:25] yeah I saw (I checked the pedia dscussion ;-) and also passed it on to slack already! [20:36:31] cool :-) [20:36:59] !log Late B&C Morning window completed, deployment host is clear [20:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:29] you have a heart back from the other side :-) [20:37:43] :) [20:39:33] (03CR) 10Hashar: "> Patch Set 2:" [software/gerrit] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/639121 (owner: 10Hashar) [20:45:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:48:18] 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Reimage one memcached shard per DC to Buster - https://phabricator.wikimedia.org/T252391 (10jijiki) [20:49:50] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:55:52] 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Reimage one memcached shard per DC to Buster - https://phabricator.wikimedia.org/T252391 (10jijiki) `mc1036` is happily running buster, after the initial tuning (tx to @elukey), things look ok. We will keep monito... [21:00:04] chrisalbon and accraze: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201104T2100). [21:03:59] 10Operations, 10observability, 10serviceops, 10User-jijiki: alert on too many close-to-saturated appservers / apiservers - https://phabricator.wikimedia.org/T267176 (10CDanis) As discussed, here's a start on the query: https://w.wiki/k6F Both thresholds in there need some tuning, but it's a start. This sh... [21:04:02] (03PS3) 10Bstorm: wikireplicas: Work toward a proxy setup on multi-instance replicas [puppet] - 10https://gerrit.wikimedia.org/r/627379 (https://phabricator.wikimedia.org/T260389) [21:06:56] (03PS3) 10Dave Pifke: [WIP] webperf: convert statsv to use Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/639216 (https://phabricator.wikimedia.org/T267269) [21:08:05] (03PS4) 10Dave Pifke: [WIP] webperf: change navtiming to use Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/639197 (https://phabricator.wikimedia.org/T267269) [21:10:45] (03PS1) 10Mholloway: Add event stream config for android.user_contributions_screen [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639284 (https://phabricator.wikimedia.org/T228179) [21:14:07] (03Abandoned) 10Dave Pifke: webperf: new python-ua-parser navtiming dependency [puppet] - 10https://gerrit.wikimedia.org/r/629436 (https://phabricator.wikimedia.org/T260580) (owner: 10Dave Pifke) [21:14:56] (03CR) 10Mholloway: Add event stream config for android.user_contributions_screen (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639284 (https://phabricator.wikimedia.org/T228179) (owner: 10Mholloway) [21:21:42] 10Operations, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10BPirkle) [21:22:24] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install mwlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T267271 (10RobH) [21:22:33] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install mwlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T267271 (10RobH) [21:22:56] (03PS4) 10Bstorm: wikireplicas: Work toward a proxy setup on multi-instance replicas [puppet] - 10https://gerrit.wikimedia.org/r/627379 (https://phabricator.wikimedia.org/T260389) [21:27:37] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install mwlog2002.codfw.wmnet - https://phabricator.wikimedia.org/T267272 (10RobH) [21:27:53] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install mwlog2002.codfw.wmnet - https://phabricator.wikimedia.org/T267272 (10RobH) [21:32:33] (03CR) 10Ottomata: Add event stream config for android.user_contributions_screen (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639284 (https://phabricator.wikimedia.org/T228179) (owner: 10Mholloway) [21:36:46] moritzm: I assume you mean cn=nda at https://phabricator.wikimedia.org/T256367#6604159 [21:38:59] (03PS1) 10Harriet Ayugi: Add tests/selenium/log to .gitignore [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/639293 (https://phabricator.wikimedia.org/T265463) [21:40:22] (03CR) 10Harriet Ayugi: "This awaits review. Thanks" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/639293 (https://phabricator.wikimedia.org/T265463) (owner: 10Harriet Ayugi) [21:44:58] (03PS1) 10Andrew Bogott: labspuppetbackend.py: fix another regsub mishap [puppet] - 10https://gerrit.wikimedia.org/r/639296 [21:45:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:45:23] (03CR) 10Andrew Bogott: [C: 03+2] labspuppetbackend.py: fix another regsub mishap [puppet] - 10https://gerrit.wikimedia.org/r/639296 (owner: 10Andrew Bogott) [21:47:15] (03PS1) 10Bstorm: cloud-vps: Change NFS mounts to default to false [puppet] - 10https://gerrit.wikimedia.org/r/639297 (https://phabricator.wikimedia.org/T262350) [21:50:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:50:33] (03CR) 10Bstorm: "I wouldn't want to merge this without adding some hiera keys in horizon for Toolforge and PAWS first as well as some emails and wiki write" [puppet] - 10https://gerrit.wikimedia.org/r/639297 (https://phabricator.wikimedia.org/T262350) (owner: 10Bstorm) [21:55:44] Urbanecm: seems a bit quieter now :) — wondering what phab project I should create the task in? [21:56:33] proc: heh good question. I'd guess performance-team and someone'll triage it further [21:56:57] ottomata: FYI an-launcher1002 has produce_canary_events.service failed/flapping [21:57:41] yeahhhh Hm [21:58:10] volans: this T266573 [21:58:11] T266573: eventgate-analytics-external occasionally seems to fail lookups of dynamic stream config from MW EventStreamConfig API - https://phabricator.wikimedia.org/T266573 [21:58:45] looking [21:58:51] acutally if it is failed for hours it is now [21:58:51] thx [21:58:51] not [21:59:03] ? [21:59:21] * volans PARSE_ERROR :) [22:05:20] Urbanecm: doh, ofc, will correct it with Katie [22:05:37] thanks moritzm :) [22:07:57] ok that is a different problem, but it must be a cache issue; the client is pulling down an old http result from https://schema.wikimedia.org/repositories//secondary/jsonschema/analytics/legacy/contenttranslationabusefilter/latest [22:08:04] i expect it to go away when the cache expires [22:08:07] which is.../ [22:08:10] i will ack icing [22:08:11] a [22:09:01] will check back later [22:09:07] ack, thx [22:12:01] (03CR) 10BryanDavis: "The list of projects that might have random difficulties with this default flipped can be obtained from modules/labstore/templates/nfs-mou" [puppet] - 10https://gerrit.wikimedia.org/r/639297 (https://phabricator.wikimedia.org/T262350) (owner: 10Bstorm) [22:18:49] (03CR) 10Volans: "One main comment inline, LGTM otherwise." (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/633723 (https://phabricator.wikimedia.org/T265341) (owner: 10Ayounsi) [22:22:53] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/639297 (https://phabricator.wikimedia.org/T262350) (owner: 10Bstorm) [22:33:12] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [22:43:14] RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.032 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [22:44:05] 10Operations, 10Cassandra, 10Services (watching), 10User-Eevans: Cassandra uses default ip address for outbound packets while bootstrapping - https://phabricator.wikimedia.org/T128590 (10Aklapper) @fgiunchedi, @Eevans: Ping - anyone knows if this is still an issue? If yes, this task should be open. If not,... [22:48:26] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [22:50:04] RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 7.183 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [22:54:44] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:54:52] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:55:02] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [22:58:06] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:58:14] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:58:22] RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.032 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [23:07:12] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) is CRITICAL: Test Suggest a source title to use for translation returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [23:12:18] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [23:16:34] PROBLEM - Disk space on Hadoop worker on an-worker1113 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [23:31:11] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 (10colewhite) >>! In T234854#6016076, @Krinkle wrote: > * It is even slower to load. Just to have the UI appear initially at all now takes 7-8 seconds on `logstash-next` compared... [23:31:27] (03PS1) 10Bstorm: wmcs wikireplicas: add a dry_run option [cookbooks] - 10https://gerrit.wikimedia.org/r/639320 (https://phabricator.wikimedia.org/T266266) [23:32:48] (03CR) 10Volans: wmcs wikireplicas: add a dry_run option (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/639320 (https://phabricator.wikimedia.org/T266266) (owner: 10Bstorm) [23:34:52] (03CR) 10Volans: "FYI all remote execution via run_sync and run_async are already dry_run aware, so will not run the command unless you pass the is_safe=Tru" [cookbooks] - 10https://gerrit.wikimedia.org/r/639320 (https://phabricator.wikimedia.org/T266266) (owner: 10Bstorm) [23:36:23] (03CR) 10Bstorm: "> Patch Set 1:" [cookbooks] - 10https://gerrit.wikimedia.org/r/639320 (https://phabricator.wikimedia.org/T266266) (owner: 10Bstorm) [23:38:01] bstorm: to be precise all spicerack modules support dry_run (it's mandatory when adding a new module, unless I forgot to enforce it ;) ) so that no RW action is performed in dry-run [23:38:36] ofc if a command depends on the previous one or stuff like that it might make the cookbook fail [23:40:38] Fair enough [23:41:12] so feel free to run the cookbook in dry-run mode and see how it goes and decide from there how you want to tweak it [23:50:02] (03PS1) 10Andrew Bogott: puppet_ca_server default to '' on wmcs [puppet] - 10https://gerrit.wikimedia.org/r/639322 [23:53:09] (03PS2) 10Andrew Bogott: puppet_ca_server default to '' on wmcs [puppet] - 10https://gerrit.wikimedia.org/r/639322 [23:57:32] (03PS7) 10Razzi: oozie: Add admin groups for authorization [puppet] - 10https://gerrit.wikimedia.org/r/637587 (https://phabricator.wikimedia.org/T262660) [23:58:28] (03PS8) 10Razzi: oozie: Add admin groups for authorization [puppet] - 10https://gerrit.wikimedia.org/r/637587 (https://phabricator.wikimedia.org/T262660)