[00:00:01] <wikibugs>	 10SRE, 10WMF-JobQueue, 10serviceops, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Dzahn) I'm a bit confused now. I thought that was the question we talked about in today's meeting.
[00:00:55] <Amir1>	 legoktm: I prefer it outside too but that's what's in the handbook https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers
[00:01:23] <legoktm>	 does something rely on that exact format?
[00:01:26] <legoktm>	 I doubt it...
[00:04:50] <Amir1>	 yeah, I think we should change it in both places thcipriani would that be okay?
[00:05:01] <Amir1>	 both in the handbook and the scripts
[00:06:40] <icinga-wm>	 PROBLEM - Check systemd state on gerrit1001 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:07:28] <icinga-wm>	 PROBLEM - LVS linkrecommendation eqiad port 4005/tcp - Link Recommendation- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.23 and port 4005: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[00:08:42] <logmsgbot>	 !log legoktm@cumin1001 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw2410.codfw.wmnet
[00:08:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:08:50] <logmsgbot>	 !log legoktm@cumin1001 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw2411.codfw.wmnet
[00:08:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:09:54] <icinga-wm>	 RECOVERY - LVS linkrecommendation eqiad port 4005/tcp - Link Recommendation- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 193 bytes in 1.055 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[00:10:55] <logmsgbot>	 !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,name=mw2411.codfw.wmnet
[00:10:58] <logmsgbot>	 !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,name=mw2411.codfw.wmnet
[00:11:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:11:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:13:00] <wikibugs>	 (03PS1) 10Legoktm: conftool-data: Document which servers are only pooled as jobrunner/videoscaler [puppet] - 10https://gerrit.wikimedia.org/r/679022 (https://phabricator.wikimedia.org/T279100)
[00:16:01] <wikibugs>	 10SRE, 10WMF-JobQueue, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Legoktm) 05Open→03Resolved uh, that's right, my bad >.<  I submitted a documentation patch just...
[00:23:48] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "confirmed with conftctl" [puppet] - 10https://gerrit.wikimedia.org/r/679022 (https://phabricator.wikimedia.org/T279100) (owner: 10Legoktm)
[00:24:10] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] conftool-data: Document which servers are only pooled as jobrunner/videoscaler [puppet] - 10https://gerrit.wikimedia.org/r/679022 (https://phabricator.wikimedia.org/T279100) (owner: 10Legoktm)
[00:24:45] <wikibugs>	 10SRE, 10WMF-JobQueue, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Dzahn) documentation patch +1, confirmed that's how it is now. Thanks.   I could also be wrong, if w...
[00:27:14] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:04:00] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 56658080 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:06:24] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 37000 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:18:16] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[01:30:34] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[01:43:32] <icinga-wm>	 PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_cron.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:57:22] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[02:45:44] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs-policy-tests.py: add Designate tests [puppet] - 10https://gerrit.wikimedia.org/r/679083 (https://phabricator.wikimedia.org/T279845)
[02:49:56] <logmsgbot>	 !log andrew@deploy1002 Started deploy [horizon/deploy@ef844a1]: fix for T276963
[02:50:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:50:07] <stashbot>	 T276963: Horizon: add doc links and discouragement to the 'server groups' UIs - https://phabricator.wikimedia.org/T276963
[02:50:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs-policy-tests.py: add Designate tests [puppet] - 10https://gerrit.wikimedia.org/r/679083 (https://phabricator.wikimedia.org/T279845) (owner: 10Andrew Bogott)
[02:54:07] <logmsgbot>	 !log andrew@deploy1002 Finished deploy [horizon/deploy@ef844a1]: fix for T276963 (duration: 04m 10s)
[02:54:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:15:36] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 429 Too Many Requests - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 754 bytes in 1.046 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:18:08] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 3.622 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:25:46] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[04:50:20] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer
[04:50:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:03:58] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Decommission db1076 [puppet] - 10https://gerrit.wikimedia.org/r/679138 (https://phabricator.wikimedia.org/T274752)
[05:04:45] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1076.eqiad.wmnet
[05:04:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:07:45] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer
[05:07:47] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer
[05:07:50] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer
[05:07:51] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer
[05:07:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:08:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:08:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:08:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:10:24] <icinga-wm>	 PROBLEM - Check systemd state on mw2265 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:14:46] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1076.eqiad.wmnet
[05:14:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:16:04] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission db1076.eqiad.wmnet - https://phabricator.wikimedia.org/T274752 (10Marostegui) a:05Marostegui→03wiki_willy
[05:16:08] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[05:19:48] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1076 [puppet] - 10https://gerrit.wikimedia.org/r/679138 (https://phabricator.wikimedia.org/T274752) (owner: 10Marostegui)
[05:25:42] <wikibugs>	 (03PS1) 10Marostegui: db1177: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/679149 (https://phabricator.wikimedia.org/T275633)
[05:28:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1177: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/679149 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui)
[05:30:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1177 with minimal weight on s8 for the first time T275633', diff saved to https://phabricator.wikimedia.org/P15313 and previous config saved to /var/cache/conftool/dbconfig/20210414-052959-marostegui.json
[05:30:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:30:09] <stashbot>	 T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633
[06:25:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1177 with minimal weight on s8 for the first time T275633', diff saved to https://phabricator.wikimedia.org/P15314 and previous config saved to /var/cache/conftool/dbconfig/20210414-062549-marostegui.json
[06:25:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:26:01] <stashbot>	 T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633
[06:49:56] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:52:08] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:58:31] <wikibugs>	 (03PS1) 10Ayounsi: Add DHCP relay support for management routers [homer/public] - 10https://gerrit.wikimedia.org/r/679236 (https://phabricator.wikimedia.org/T271583)
[06:59:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add DHCP relay support for management routers [homer/public] - 10https://gerrit.wikimedia.org/r/679236 (https://phabricator.wikimedia.org/T271583) (owner: 10Ayounsi)
[07:02:55] <wikibugs>	 (03PS2) 10Ayounsi: Add DHCP relay support for management routers [homer/public] - 10https://gerrit.wikimedia.org/r/679236 (https://phabricator.wikimedia.org/T271583)
[07:03:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add DHCP relay support for management routers [homer/public] - 10https://gerrit.wikimedia.org/r/679236 (https://phabricator.wikimedia.org/T271583) (owner: 10Ayounsi)
[07:04:43] <wikibugs>	 (03PS3) 10Ayounsi: Add DHCP relay support for management routers [homer/public] - 10https://gerrit.wikimedia.org/r/679236 (https://phabricator.wikimedia.org/T271583)
[07:06:14] <icinga-wm>	 PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:06:54] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add DHCP relay support for management routers [homer/public] - 10https://gerrit.wikimedia.org/r/679236 (https://phabricator.wikimedia.org/T271583) (owner: 10Ayounsi)
[07:07:37] <wikibugs>	 (03Merged) 10jenkins-bot: Add DHCP relay support for management routers [homer/public] - 10https://gerrit.wikimedia.org/r/679236 (https://phabricator.wikimedia.org/T271583) (owner: 10Ayounsi)
[07:15:30] <wikibugs>	 10SRE, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Remove 185.15.56.0/24 from network::external - https://phabricator.wikimedia.org/T265864 (10ayounsi)
[07:22:40] <XioNoX>	 !log push pfw policy - T280059
[07:22:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:27:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/677514 (owner: 10Jbond)
[07:31:42] <icinga-wm>	 PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:34:14] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: linkrecommendation: Bump memory/cpu limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/679240 (https://phabricator.wikimedia.org/T279411)
[07:36:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] linkrecommendation: Bump memory/cpu limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/679240 (https://phabricator.wikimedia.org/T279411) (owner: 10Alexandros Kosiaris)
[07:37:07] <wikibugs>	 10SRE, 10Sustainability: Add redundancy to IRC recent changes service - https://phabricator.wikimedia.org/T128592 (10MoritzMuehlenhoff) >>! In T128592#6996726, @Legoktm wrote: > Is it even possible for IRC to be active-active? Doesn't the client have to maintain a connection with a single server, and if that s...
[07:38:31] <wikibugs>	 (03Merged) 10jenkins-bot: linkrecommendation: Bump memory/cpu limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/679240 (https://phabricator.wikimedia.org/T279411) (owner: 10Alexandros Kosiaris)
[07:40:50] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
[07:40:50] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
[07:40:50] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
[07:40:51] <godog>	 !log swift eqiad-prod: less weight for ms-be[1019-1026] / more weight to ms-be106[0-3] - T272836
[07:40:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:17] <stashbot>	 T272836: Decom ms-be[1019-1026] from swift - https://phabricator.wikimedia.org/T272836
[07:41:33] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
[07:41:33] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
[07:41:33] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
[07:41:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:01] <jayme>	 !log imported chartmuseum_0.13.1-1 to buster-wikimedia
[07:42:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:32] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
[07:42:33] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
[07:42:33] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
[07:42:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:43] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=helm-charts,name=codfw
[07:42:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:44:13] <wikibugs>	 10SRE, 10ops-codfw, 10User-fgiunchedi: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T279245 (10fgiunchedi) 05Open→03Resolved Thank you @papaul, all good
[07:51:10] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=helm-charts,name=codfw
[07:51:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:51:29] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=helm-charts,name=eqiad
[07:51:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:09] <gehel>	 !log restarting blazegraph + updater on wdqs1013
[07:55:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:56:48] <logmsgbot>	 !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[07:56:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:57:03] <gehel>	 !log depooling wdqs1013 - catching up on lag
[07:57:05] <gehel>	 ryankemper: ^
[07:57:10] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.087 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[07:57:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:58:20] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs2001 is CRITICAL: 1.305e+05 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[07:58:48] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[07:59:27] <gehel>	 !log depooling wdqs2001 - catching up on lag
[07:59:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:20] <icinga-wm>	 RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:01:06] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[08:01:36] <icinga-wm>	 RECOVERY - Check systemd state on gerrit1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:01:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove kraz [puppet] - 10https://gerrit.wikimedia.org/r/679250
[08:01:47] <logmsgbot>	 !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[08:01:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:00] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs2001 is CRITICAL: 1.304e+05 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[08:04:38] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs2004 is CRITICAL: 1.307e+05 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[08:05:51] <wikibugs>	 (03PS2) 10Effie Mouzeli: hieradata: enable onhost memcached socket on all mw clusters [puppet] - 10https://gerrit.wikimedia.org/r/676580 (https://phabricator.wikimedia.org/T273115)
[08:05:56] <gehel>	 !log depooling wdqs2004 - catching up on lag
[08:06:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:06:22] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=helm-charts,name=eqiad
[08:06:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:06:50] <icinga-wm>	 ACKNOWLEDGEMENT - WDQS high update lag on wdqs2001 is CRITICAL: 1.302e+05 ge 4.32e+04 Gehel catching up on lag after data reload https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[08:06:50] <icinga-wm>	 ACKNOWLEDGEMENT - WDQS high update lag on wdqs2004 is CRITICAL: 1.307e+05 ge 3600 Gehel catching up on lag after data reload https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[08:07:08] <jayme>	 !log updated chartmuseum to 0.13.1 on charmuseum1001, chartmuseum2001
[08:07:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:15:34] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: conftool: Create a shared jobrunner_videoscaler [puppet] - 10https://gerrit.wikimedia.org/r/679258 (https://phabricator.wikimedia.org/T279100)
[08:16:15] <logmsgbot>	 !log jiji@cumin1001 conftool action : set/pooled=yes; selector: name=(wtp1033.eqiad.wmnet|wtp1032.eqiad.wmnet)
[08:16:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:25] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/weight=10; selector: cluster=videoscaler,service=apache2,name=mw2395.codfw.wmnet
[08:20:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:31] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/weight=10; selector: cluster=videoscaler,service=apache2,name=mw2394.codfw.wmnet
[08:20:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:58] <wikibugs>	 10SRE, 10WMF-JobQueue, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10akosiaris) I 've gone a bit overboard and created https://gerrit.wikimedia.org/r/679258 that uses YA...
[08:28:45] <wikibugs>	 (03PS1) 10Jbond: P:debmonitor::client: update debmon-client systemd::timer [puppet] - 10https://gerrit.wikimedia.org/r/679263
[08:33:39] <wikibugs>	 (03PS1) 10Jbond: P:debmonitor::server: drop systemd-catafrom gc job [puppet] - 10https://gerrit.wikimedia.org/r/679268
[08:34:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:debmonitor::server: drop systemd-catafrom gc job [puppet] - 10https://gerrit.wikimedia.org/r/679268 (owner: 10Jbond)
[08:35:38] <wikibugs>	 (03PS2) 10Jbond: P:debmonitor::server: drop systemd-cat for gc job [puppet] - 10https://gerrit.wikimedia.org/r/679268
[08:36:30] <icinga-wm>	 PROBLEM - LVS linkrecommendation eqiad port 4005/tcp - Link Recommendation- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.23 and port 4005: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[08:36:35] <wikibugs>	 (03PS3) 10Jbond: P:debmonitor::server: drop systemd-catafrom gc job [puppet] - 10https://gerrit.wikimedia.org/r/679268
[08:36:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: "nits inline but LGTM otherwise" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/678844 (owner: 10Jbond)
[08:36:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] check_https_client_auth_puppet: add new icinga check [puppet] - 10https://gerrit.wikimedia.org/r/678844 (owner: 10Jbond)
[08:38:52] <icinga-wm>	 RECOVERY - LVS linkrecommendation eqiad port 4005/tcp - Link Recommendation- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 193 bytes in 1.051 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[08:39:10] <wikibugs>	 10SRE, 10WMF-JobQueue, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10jijiki) >>! In T279100#6997273, @akosiaris wrote: > I 've gone a bit overboard and created https://g...
[08:40:21] <Urbanecm>	 jouncebot: now
[08:40:21] <jouncebot>	 No deployments scheduled for the next 2 hour(s) and 19 minute(s)
[08:40:24] <Urbanecm>	 jouncebot: next
[08:40:24] <jouncebot>	 In 2 hour(s) and 19 minute(s): [[Backport windows|European mid-day backport window]]<br/><small>''''''</small> (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210414T1100)
[08:40:31] <Urbanecm>	 !log Stagging on mwdebug1002
[08:40:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:35] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679263 (owner: 10Jbond)
[08:44:36] <Urbanecm>	 !log Run scap pull on mwdebug1002
[08:44:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:06] <wikibugs>	 10SRE, 10Patch-For-Review: migrate services from bast1002 to bast1003 - https://phabricator.wikimedia.org/T276399 (10MoritzMuehlenhoff) 05Open→03Resolved bast1003 has now fully replaced bast1002. The decom task for bast1002 is T280110
[08:48:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install bast1003.wikimedia.org - https://phabricator.wikimedia.org/T276396 (10MoritzMuehlenhoff)
[08:48:12] <wikibugs>	 10SRE, 10Patch-For-Review: migrate services from bast1002 to bast1003 - https://phabricator.wikimedia.org/T276399 (10MoritzMuehlenhoff)
[08:50:50] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:52:35] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove bast1002 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/679273
[08:53:12] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:53:27] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.hosts.decommission for hosts bast1002.wikimedia.org
[08:53:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove bast1002 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/679273 (owner: 10Muehlenhoff)
[08:58:26] <wikibugs>	 (03PS2) 10Jbond: P:debmonitor::client: update debmon-client systemd::timer [puppet] - 10https://gerrit.wikimedia.org/r/679263
[08:58:56] <wikibugs>	 (03PS1) 10Jbond: debmonitor-client: Improve retry logic [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679275
[09:00:02] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on wtp1033 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[09:02:20] <wikibugs>	 (03CR) 10Volans: "I just noticed that I forgot to add a timeout to those requests here, so we should probably duplicate what's in wmflib to have both behavi" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679275 (owner: 10Jbond)
[09:03:15] <logmsgbot>	 !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[09:03:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:03:54] <logmsgbot>	 !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast1002.wikimedia.org
[09:04:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:28] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 649 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[09:04:58] <wikibugs>	 (03CR) 10Volans: "> Patch Set 1:" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679275 (owner: 10Jbond)
[09:05:00] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:05:16] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 683 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[09:05:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] debmonitor-client: Improve retry logic [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679275 (owner: 10Jbond)
[09:06:28] <ryankemper>	 !log T267927 depool `wdqs2001` following data transfer (catching up on lag)
[09:06:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:40] <stashbot>	 T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927
[09:06:41] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679268 (owner: 10Jbond)
[09:06:45] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM, once fixed the duplicated line, +1." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/676580 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli)
[09:07:29] <wikibugs>	 (03PS1) 10Urbanecm: Don't allow query and cookie hacks to enable topic subscriptions [extensions/DiscussionTools] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/678994 (https://phabricator.wikimedia.org/T280082)
[09:07:46] <wikibugs>	 (03PS3) 10Effie Mouzeli: hieradata: enable onhost memcached socket on all mw clusters [puppet] - 10https://gerrit.wikimedia.org/r/676580 (https://phabricator.wikimedia.org/T273115)
[09:07:52] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "train blocker" [extensions/DiscussionTools] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/678994 (https://phabricator.wikimedia.org/T280082) (owner: 10Urbanecm)
[09:07:54] <wikibugs>	 (03CR) 10Effie Mouzeli: hieradata: enable onhost memcached socket on all mw clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/676580 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli)
[09:08:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:debmonitor::server: drop systemd-catafrom gc job [puppet] - 10https://gerrit.wikimedia.org/r/679268 (owner: 10Jbond)
[09:08:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:debmonitor::client: update debmon-client systemd::timer [puppet] - 10https://gerrit.wikimedia.org/r/679263 (owner: 10Jbond)
[09:09:14] <wikibugs>	 (03PS4) 10Jbond: P:debmonitor::server: drop systemd-catafrom gc job [puppet] - 10https://gerrit.wikimedia.org/r/679268
[09:09:24] <wikibugs>	 (03PS5) 10Jbond: P:debmonitor::server: drop systemd-catafrom gc job [puppet] - 10https://gerrit.wikimedia.org/r/679268
[09:09:34] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.hosts.decommission for hosts kraz.wikimedia.org
[09:09:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:04] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] hieradata: enable onhost memcached socket on all mw clusters [puppet] - 10https://gerrit.wikimedia.org/r/676580 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli)
[09:10:30] <wikibugs>	 (03CR) 10Volans: "question inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/678844 (owner: 10Jbond)
[09:10:46] <logmsgbot>	 !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[09:10:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:54] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:12:04] <ryankemper>	 !log T267927 depooled `wdqs1004` following data transfer (catching up on lag), current round of data transfers is done so there shouldn't be any left to depool
[09:12:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:14] <stashbot>	 T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927
[09:14:18] <gehel>	 ryankemper: do you know why wdqs1003 is complaining?
[09:14:20] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on wtp1032 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[09:14:22] <wikibugs>	 (03CR) 10Jbond: check_https_client_auth_puppet: add new icinga check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/678844 (owner: 10Jbond)
[09:14:30] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:14:34] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=udpmxircecho site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:15:51] <wikibugs>	 (03Merged) 10jenkins-bot: Don't allow query and cookie hacks to enable topic subscriptions [extensions/DiscussionTools] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/678994 (https://phabricator.wikimedia.org/T280082) (owner: 10Urbanecm)
[09:16:10] <ryankemper>	 gehel: not sure about either 1003 or 1010, neither should be related to the transfers
[09:16:25] <gehel>	 !log restarting blazegraph on wdqs1003
[09:16:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:33] <moritzm>	 ^ the udpmxircecho should be harmless, will have a look soon
[09:19:50] <logmsgbot>	 !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kraz.wikimedia.org
[09:19:57] <wikibugs>	 10SRE, 10Wikimedia-IRC-RC-Server, 10Patch-For-Review, 10User-notice: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin1001 for hosts: `kraz.wikimedia.org` - kraz.wikimedia.org (**PASS**)   - Dow...
[09:19:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:23] <wikibugs>	 (03PS1) 10Muehlenhoff: Update NAT exceptions for kraz -> irc1001/irc2001 [puppet] - 10https://gerrit.wikimedia.org/r/679278
[09:20:28] <wikibugs>	 (03PS1) 10Filippo Giunchedi: admin: add lmeintrup [puppet] - 10https://gerrit.wikimedia.org/r/679279 (https://phabricator.wikimedia.org/T279531)
[09:22:07] <gehel>	 !log depooling wdqs1003 - corrupted data after data reload
[09:22:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:23:35] <gehel>	 !log repooling wdqs1013, catched up on lag
[09:23:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:15] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.1/extensions/DiscussionTools/includes/Hooks/HookUtils.php: e4b2d93dcf86a336314ed09fd37844edb16f4f30: Dont allow query and cookie hacks to enable topic subscriptions (T280082) (duration: 01m 24s)
[09:24:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:24] <stashbot>	 T280082: Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'mediawikiwiki.discussiontools_subscription' doesn't exist (10.64.16.7) - https://phabricator.wikimedia.org/T280082
[09:24:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:debmonitor::Server: drop absented resource [puppet] - 10https://gerrit.wikimedia.org/r/677514 (owner: 10Jbond)
[09:24:32] <wikibugs>	 (03PS5) 10Jbond: P:debmonitor::Server: drop absented resource [puppet] - 10https://gerrit.wikimedia.org/r/677514
[09:25:15] <wikibugs>	 (03CR) 10Volans: "> Patch Set 1:" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679275 (owner: 10Jbond)
[09:27:19] <wikibugs>	 (03PS1) 10Filippo Giunchedi: admin: add hnordeen [puppet] - 10https://gerrit.wikimedia.org/r/679280 (https://phabricator.wikimedia.org/T280073)
[09:27:40] <effie>	 !log disable puppet on all mediawiki servers to merge 676580
[09:27:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:02] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: enable onhost memcached socket on all mw clusters [puppet] - 10https://gerrit.wikimedia.org/r/676580 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli)
[09:29:02] <gehel>	 !log depooling wdqs1004 - corrupted data after data reload
[09:29:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:17] <wikibugs>	 10SRE, 10WMF-JobQueue, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10akosiaris) >>! In T279100#6997312, @jijiki wrote: >>>! In T279100#6997273, @akosiaris wrote: >> I 'v...
[09:32:25] <wikibugs>	 (03CR) 10Volans: check_https_client_auth_puppet: add new icinga check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/678844 (owner: 10Jbond)
[09:33:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1177 with minimal weight on s8 for the first time T275633', diff saved to https://phabricator.wikimedia.org/P15316 and previous config saved to /var/cache/conftool/dbconfig/20210414-093305-marostegui.json
[09:33:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:15] <stashbot>	 T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633
[09:36:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 20%: Slowly pool db1177 for the first time in s8 T275633', diff saved to https://phabricator.wikimedia.org/P15317 and previous config saved to /var/cache/conftool/dbconfig/20210414-093642-root.json
[09:36:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:45] <logmsgbot>	 !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[09:37:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:37] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 683 bytes in 1.046 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[09:38:38] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 649 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[09:38:51] <icinga-wm>	 PROBLEM - Memcached on parse2001 is CRITICAL: connect to address 10.192.0.182 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[09:38:59] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:39:18] <icinga-wm>	 PROBLEM - Memcached on parse2005 is CRITICAL: connect to address 10.192.0.186 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[09:39:18] <icinga-wm>	 PROBLEM - Memcached on parse2009 is CRITICAL: connect to address 10.192.16.25 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[09:39:29] <icinga-wm>	 PROBLEM - Memcached on parse2007 is CRITICAL: connect to address 10.192.16.22 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[09:40:02] <effie>	 that is me ^
[09:40:14] <effie>	 sorry
[09:40:19] <icinga-wm>	 PROBLEM - Memcached on parse2008 is CRITICAL: connect to address 10.192.16.24 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[09:40:49] <icinga-wm>	 PROBLEM - Memcached on parse2010 is CRITICAL: connect to address 10.192.16.206 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[09:40:57] <icinga-wm>	 PROBLEM - Memcached on parse2013 is CRITICAL: connect to address 10.192.32.197 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[09:41:01] <icinga-wm>	 RECOVERY - Check systemd state on mw2265 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:41:11] <icinga-wm>	 RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:42:18] <icinga-wm>	 PROBLEM - Memcached on parse2020 is CRITICAL: connect to address 10.192.48.153 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[09:42:22] <icinga-wm>	 PROBLEM - Memcached on parse2014 is CRITICAL: connect to address 10.192.32.198 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[09:42:25] <icinga-wm>	 PROBLEM - Memcached on parse2017 is CRITICAL: connect to address 10.192.48.150 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[09:42:35] <icinga-wm>	 PROBLEM - Memcached on parse2016 is CRITICAL: connect to address 10.192.48.149 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[09:42:39] <icinga-wm>	 PROBLEM - Memcached on parse2018 is CRITICAL: connect to address 10.192.48.151 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[09:42:43] <icinga-wm>	 PROBLEM - Memcached on parse2015 is CRITICAL: connect to address 10.192.32.199 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[09:42:47] <icinga-wm>	 PROBLEM - Memcached on parse2004 is CRITICAL: connect to address 10.192.0.185 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[09:42:47] <icinga-wm>	 PROBLEM - Memcached on parse2012 is CRITICAL: connect to address 10.192.32.196 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[09:43:22] <wikibugs>	 (03PS5) 10David Caro: ceph: add ceph repo and parameter to all client modules [puppet] - 10https://gerrit.wikimedia.org/r/677911 (https://phabricator.wikimedia.org/T274566)
[09:43:37] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:44:11] <icinga-wm>	 PROBLEM - Memcached on parse2019 is CRITICAL: connect to address 10.192.48.152 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[09:46:47] <icinga-wm>	 PROBLEM - Memcached on parse2003 is CRITICAL: connect to address 10.192.0.184 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[09:46:51] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service Gehel corrupted data after data transfer - https://phabricator.wikimedia.org/T267927 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:46:51] <icinga-wm>	 ACKNOWLEDGEMENT - Query Service HTTP Port on wdqs1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 649 bytes in 0.002 second response time Gehel corrupted data after data transfer - https://phabricator.wikimedia.org/T267927 https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[09:46:52] <icinga-wm>	 ACKNOWLEDGEMENT - WDQS SPARQL on wdqs1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 683 bytes in 1.046 second response time Gehel corrupted data after data transfer - https://phabricator.wikimedia.org/T267927
[09:46:52] <icinga-wm>	 .wikimedia.org/wiki/Wikidata_query_service/Runbook
[09:46:53] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service Gehel corrupted data after data transfer - https://phabricator.wikimedia.org/T267927 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:46:54] <icinga-wm>	 ACKNOWLEDGEMENT - Query Service HTTP Port on wdqs1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 649 bytes in 0.001 second response time Gehel corrupted data after data transfer - https://phabricator.wikimedia.org/T267927 https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[09:46:55] <icinga-wm>	 ACKNOWLEDGEMENT - WDQS SPARQL on wdqs1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 683 bytes in 1.052 second response time Gehel corrupted data after data transfer - https://phabricator.wikimedia.org/T267927
[09:46:56] <icinga-wm>	 .wikimedia.org/wiki/Wikidata_query_service/Runbook
[09:51:45] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: linkrecommendation: Add an internal release [deployment-charts] - 10https://gerrit.wikimedia.org/r/679287
[09:51:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 30%: Slowly pool db1177 for the first time in s8 T275633', diff saved to https://phabricator.wikimedia.org/P15318 and previous config saved to /var/cache/conftool/dbconfig/20210414-095146-root.json
[09:51:47] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: linkrecommendation: Cleanup production release [deployment-charts] - 10https://gerrit.wikimedia.org/r/679288
[09:51:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:55] <stashbot>	 T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633
[09:54:08] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove kraz [puppet] - 10https://gerrit.wikimedia.org/r/679250
[09:54:33] <wikibugs>	 10SRE: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff)
[09:57:08] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for HNordeen - https://phabricator.wikimedia.org/T280073 (10fgiunchedi) p:05Triage→03Medium
[10:02:13] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:04:32] <wikibugs>	 (03PS1) 10Jbond: systemd::timer::job: update mailing script with additional options [puppet] - 10https://gerrit.wikimedia.org/r/679292
[10:04:34] <wikibugs>	 (03PS1) 10Jbond: P:debmonitor::client: migrate timer::job to use send_mail [puppet] - 10https://gerrit.wikimedia.org/r/679293
[10:04:54] <wikibugs>	 (03CR) 10Marostegui: mariadb: Promote db1159 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/678801 (https://phabricator.wikimedia.org/T276448) (owner: 10Marostegui)
[10:04:56] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1159 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/678801 (https://phabricator.wikimedia.org/T276448) (owner: 10Marostegui)
[10:05:19] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:05:19] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+1] linkrecommendation: Cleanup production release [deployment-charts] - 10https://gerrit.wikimedia.org/r/679288 (owner: 10Alexandros Kosiaris)
[10:05:29] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] ceph.codfw1: enable ceph octopus repo [puppet] - 10https://gerrit.wikimedia.org/r/677583 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro)
[10:05:47] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+1] linkrecommendation: Add an internal release [deployment-charts] - 10https://gerrit.wikimedia.org/r/679287 (owner: 10Alexandros Kosiaris)
[10:06:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:debmonitor::client: migrate timer::job to use send_mail [puppet] - 10https://gerrit.wikimedia.org/r/679293 (owner: 10Jbond)
[10:06:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] systemd::timer::job: update mailing script with additional options [puppet] - 10https://gerrit.wikimedia.org/r/679292 (owner: 10Jbond)
[10:06:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove kraz [puppet] - 10https://gerrit.wikimedia.org/r/679250 (owner: 10Muehlenhoff)
[10:06:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 40%: Slowly pool db1177 for the first time in s8 T275633', diff saved to https://phabricator.wikimedia.org/P15319 and previous config saved to /var/cache/conftool/dbconfig/20210414-100649-root.json
[10:06:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:00] <stashbot>	 T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633
[10:08:14] <wikibugs>	 (03PS2) 10Jbond: systemd::timer::job: update mailing script with additional options [puppet] - 10https://gerrit.wikimedia.org/r/679292
[10:08:24] <wikibugs>	 (03PS2) 10Jbond: P:debmonitor::client: migrate timer::job to use send_mail [puppet] - 10https://gerrit.wikimedia.org/r/679293
[10:09:09] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29024/console" [puppet] - 10https://gerrit.wikimedia.org/r/679292 (owner: 10Jbond)
[10:10:18] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:10:36] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/679287 (owner: 10Alexandros Kosiaris)
[10:11:24] <wikibugs>	 10SRE: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff)
[10:11:43] <wikibugs>	 (03CR) 10Volans: "I'm not familiar with the send_mail puppetization but +1 for the approach." [puppet] - 10https://gerrit.wikimedia.org/r/679293 (owner: 10Jbond)
[10:11:47] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29025/console" [puppet] - 10https://gerrit.wikimedia.org/r/679293 (owner: 10Jbond)
[10:12:05] <wikibugs>	 10SRE, 10Wikimedia-IRC-RC-Server, 10Patch-For-Review, 10User-notice: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff kraz has been replaced by two Buster instances (irc1001.wikimedia.org and irc2001.wik...
[10:12:50] <wikibugs>	 (03Merged) 10jenkins-bot: linkrecommendation: Add an internal release [deployment-charts] - 10https://gerrit.wikimedia.org/r/679287 (owner: 10Alexandros Kosiaris)
[10:13:15] <wikibugs>	 (03CR) 10Volans: "> Patch Set 2: Verified+1" [puppet] - 10https://gerrit.wikimedia.org/r/679293 (owner: 10Jbond)
[10:14:18] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service Gehel corrupted data after data transfer - https://phabricator.wikimedia.org/T267927 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:14:18] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service Gehel corrupted data after data transfer - https://phabricator.wikimedia.org/T267927 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:14:20] <Majavah>	 moritzm: is there a reason not to point irc.wm.o to both 1001 and 2001 instead of having one standby?
[10:15:35] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/677911 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro)
[10:18:00] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/679293 (owner: 10Jbond)
[10:21:43] <marostegui>	 In 10 minutes we are restarting m1 master (etherpad, librenms, backups, bacula...) T276448
[10:21:44] <stashbot>	 T276448: Failover m1 master: db1080 -> db1159 Wed 14th April at 10 AM UTC - https://phabricator.wikimedia.org/T276448
[10:21:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 50%: Slowly pool db1177 for the first time in s8 T275633', diff saved to https://phabricator.wikimedia.org/P15320 and previous config saved to /var/cache/conftool/dbconfig/20210414-102153-root.json
[10:22:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:05] <stashbot>	 T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633
[10:22:22] <wikibugs>	 (03PS1) 10Elukey: Add kafka-logging1001 to term kafka in analytics-in4/6 [homer/public] - 10https://gerrit.wikimedia.org/r/679296
[10:23:26] <elukey>	 XioNoX: around for a quick cr? :)
[10:23:31] <elukey>	 https://gerrit.wikimedia.org/r/c/operations/homer/public/+/679296
[10:25:25] <logmsgbot>	 !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: Upgrading ceph to octopus
[10:25:28] <logmsgbot>	 !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: Upgrading ceph to octopus
[10:25:31] <wikibugs>	 (03PS3) 10Jbond: P:debmonitor::client: migrate timer::job to use send_mail [puppet] - 10https://gerrit.wikimedia.org/r/679293
[10:25:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:34] <wikibugs>	 (03PS3) 10Jbond: systemd::timer::job: update mailing script with additional options [puppet] - 10https://gerrit.wikimedia.org/r/679292
[10:25:36] <wikibugs>	 (03PS1) 10Jbond: check_cumin_aliases: ensure script exits 1 on error [puppet] - 10https://gerrit.wikimedia.org/r/679297
[10:25:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:38] <wikibugs>	 (03PS4) 10Jbond: systemd::timer::job: update mailing script with additional options [puppet] - 10https://gerrit.wikimedia.org/r/679292
[10:26:40] <wikibugs>	 (03PS2) 10Jbond: check_cumin_aliases: ensure script exits 1 on error [puppet] - 10https://gerrit.wikimedia.org/r/679297
[10:26:42] <wikibugs>	 (03PS4) 10Jbond: P:debmonitor::client: migrate timer::job to use send_mail [puppet] - 10https://gerrit.wikimedia.org/r/679293
[10:26:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] systemd::timer::job: update mailing script with additional options [puppet] - 10https://gerrit.wikimedia.org/r/679292 (owner: 10Jbond)
[10:27:14] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:28:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] systemd::timer::job: update mailing script with additional options [puppet] - 10https://gerrit.wikimedia.org/r/679292 (owner: 10Jbond)
[10:28:41] <wikibugs>	 (03PS3) 10Jbond: check_cumin_aliases: ensure script exits 1 on error [puppet] - 10https://gerrit.wikimedia.org/r/679297
[10:28:44] <wikibugs>	 (03PS5) 10Jbond: systemd::timer::job: update mailing script with additional options [puppet] - 10https://gerrit.wikimedia.org/r/679292
[10:28:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:debmonitor::client: migrate timer::job to use send_mail [puppet] - 10https://gerrit.wikimedia.org/r/679293 (owner: 10Jbond)
[10:28:48] <marostegui>	 akosiaris: around for the failover?
[10:29:48] <marostegui>	 jynus kormat ready?
[10:29:53] <jynus>	 I am here
[10:30:15] <kormat>	 here
[10:30:23] <marostegui>	 Good, I am going to go ahead
[10:30:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] systemd::timer::job: update mailing script with additional options [puppet] - 10https://gerrit.wikimedia.org/r/679292 (owner: 10Jbond)
[10:30:33] <marostegui>	 !log Failover m1 from db1080 to db1159 - T276448
[10:30:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:30:43] <stashbot>	 T276448: Failover m1 master: db1080 -> db1159 Wed 14th April at 10 AM UTC - https://phabricator.wikimedia.org/T276448
[10:31:09] <marostegui>	 done
[10:31:11] <marostegui>	 checking services
[10:31:14] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:31:23] <marostegui>	 etherpad works for
[10:31:28] <marostegui>	 me
[10:31:29] <wikibugs>	 (03PS6) 10Jbond: systemd::timer::job: update mailing script with additional options [puppet] - 10https://gerrit.wikimedia.org/r/679292
[10:31:48] <marostegui>	 moritzm jbond42 switchover done, can you check cas/pki?
[10:31:49] <wikibugs>	 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['wtp1034.eqiad.wmnet', 'wtp1035.eqiad.wmnet', 'wtp1036.eqia...
[10:32:06] <akosiaris>	 marostegui: perfect!
[10:32:17] <akosiaris>	 works for me too btw
[10:32:23] <marostegui>	 \o/
[10:32:34] <marostegui>	 librenms seems tobe working too
[10:32:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] systemd::timer::job: update mailing script with additional options [puppet] - 10https://gerrit.wikimedia.org/r/679292 (owner: 10Jbond)
[10:33:00] <marostegui>	 kormat: orchestrator needs cleaning up to remove the old heartbeat, I can do that later, not urgent
[10:33:07] <kormat>	 ack
[10:33:21] <jynus>	 is there something missing, other than dbbackups?
[10:33:39] <marostegui>	 jynus: I am checking racktables and rt
[10:33:49] <marostegui>	 So only backups I think
[10:34:23] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:34:23] <jynus>	 there is a rebase conflict, let me do it manually
[10:34:29] <marostegui>	 oki
[10:34:51] <wikibugs>	 (03PS7) 10Jbond: systemd::timer::job: update mailing script with additional options [puppet] - 10https://gerrit.wikimedia.org/r/679292
[10:35:29] <marostegui>	 kormat: orchestrator cleaned up, all good now
[10:35:42] <marostegui>	 kormat: we should probably include this step on the failover checklist, as this always be needed
[10:35:45] <jynus>	 mmm strange, when I downloaded it didn't conflict
[10:35:48] <moritzm>	 marostegui: CAS/IDP works fine,I just forced a new login with my U2F validation (which is fetched from mysql)
[10:35:48] <kormat>	 marostegui: yeah
[10:35:58] <wikibugs>	 (03PS5) 10Jbond: P:debmonitor::client: migrate timer::job to use send_mail [puppet] - 10https://gerrit.wikimedia.org/r/679293
[10:36:04] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Update backup metadata host db1080->db1159 [puppet] - 10https://gerrit.wikimedia.org/r/668449 (https://phabricator.wikimedia.org/T276448)
[10:36:13] <wikibugs>	 (03PS3) 10Jcrespo: dbbackups: Update backup metadata host db1080->db1159 [puppet] - 10https://gerrit.wikimedia.org/r/668449 (https://phabricator.wikimedia.org/T276448)
[10:36:14] <marostegui>	 moritzm: thanks :*
[10:36:45] <jynus>	 marostegui, can I get a quick +1 up there to double check the new primary db name?
[10:36:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 60%: Slowly pool db1177 for the first time in s8 T275633', diff saved to https://phabricator.wikimedia.org/P15321 and previous config saved to /var/cache/conftool/dbconfig/20210414-103659-root.json
[10:37:06] <marostegui>	 sure
[10:37:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:08] <stashbot>	 T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633
[10:37:36] <jynus>	 I will run a backup and check alerts after merging it
[10:37:41] <marostegui>	 thanks
[10:38:39] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] dbbackups: Update backup metadata host db1080->db1159 [puppet] - 10https://gerrit.wikimedia.org/r/668449 (https://phabricator.wikimedia.org/T276448) (owner: 10Jcrespo)
[10:39:26] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:39:26] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+1] "What will deploying this do to the currently running linkrecommendation-production-load-datasets-1618390800-np4ch container?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/679288 (owner: 10Alexandros Kosiaris)
[10:39:28] <wikibugs>	 (03PS6) 10Jbond: P:debmonitor::client: migrate timer::job to use send_mail [puppet] - 10https://gerrit.wikimedia.org/r/679293
[10:39:31] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] dbbackups: Update backup metadata host db1080->db1159 [puppet] - 10https://gerrit.wikimedia.org/r/668449 (https://phabricator.wikimedia.org/T276448) (owner: 10Jcrespo)
[10:39:32] <jynus>	 running puppet on alert1001, which will take a bit
[10:40:28] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/679293 (owner: 10Jbond)
[10:41:12] <wikibugs>	 (03CR) 10Kosta Harlan: linkrecommendation: Add an internal release (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/679287 (owner: 10Alexandros Kosiaris)
[10:42:36] <moritzm>	 Majavah: see https://phabricator.wikimedia.org/T128592#6996726
[10:43:00] <jynus>	 marostegui, db1080 will be decommissioned?
[10:43:02] <icinga-wm>	 RECOVERY - Check systemd state on mw1386 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:43:06] <wikibugs>	 (03PS1) 10Marostegui: db1080: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/679307
[10:43:07] <marostegui>	 jynus: yes, but not today
[10:43:11] <jynus>	 sure
[10:43:13] <marostegui>	 jynus: in a week or more if needed
[10:43:31] <jynus>	 I think we will only be sure the alerts are no longer pointing to db1080 then
[10:43:37] <jynus>	 in case there is some logic bug
[10:44:08] <marostegui>	 I was thinking about waiting a whole week, would that work for you or you need more?
[10:44:15] <jynus>	 yeah, no rush
[10:44:16] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1080: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/679307 (owner: 10Marostegui)
[10:44:24] <jynus>	 I was just commenting on the checks I can do for now
[10:44:31] <marostegui>	 oki
[10:44:35] <jynus>	 I will run a new backup too
[10:44:39] <marostegui>	 thanks
[10:44:55] <jynus>	 but it will take a few hours until it writes the results to the db, so taht will have to wait too
[10:45:07] <marostegui>	 no worries
[10:45:17] <jynus>	 so far, everything looks good
[10:45:19] <marostegui>	 once you are happy with it, let me know so I can close the task (or close it yourself)
[10:46:03] <jynus>	 wait, let me run puppet on cumin hosts, as otherwise they will try to write to the read only db
[10:46:10] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission bast1002.wikimedia.org - https://phabricator.wikimedia.org/T280110 (10MoritzMuehlenhoff)
[10:46:46] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/679279 (https://phabricator.wikimedia.org/T279531) (owner: 10Filippo Giunchedi)
[10:48:19] <wikibugs>	 (03PS4) 10Muehlenhoff: Remove dbmonitor1001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/678799 (https://phabricator.wikimedia.org/T224589)
[10:50:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/678134 (https://phabricator.wikimedia.org/T278905) (owner: 10Legoktm)
[10:50:44] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/679288 (owner: 10Alexandros Kosiaris)
[10:51:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "As the user is asking for access to turnilo they need approval from their manager and Analytics (Andrew Otto)," [puppet] - 10https://gerrit.wikimedia.org/r/679280 (https://phabricator.wikimedia.org/T280073) (owner: 10Filippo Giunchedi)
[10:51:31] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] admin: add hnordeen [puppet] - 10https://gerrit.wikimedia.org/r/679280 (https://phabricator.wikimedia.org/T280073) (owner: 10Filippo Giunchedi)
[10:51:34] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[10:52:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 70%: Slowly pool db1177 for the first time in s8 T275633', diff saved to https://phabricator.wikimedia.org/P15322 and previous config saved to /var/cache/conftool/dbconfig/20210414-105202-root.json
[10:52:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:13] <stashbot>	 T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633
[10:53:04] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for HNordeen - https://phabricator.wikimedia.org/T280073 (10jbond) @Ottomata are you able to approve access to Turnilo for HNordeen
[10:54:09] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:56:29] <icinga-wm>	 PROBLEM - Memcached on mw1315 is CRITICAL: connect to address 10.64.16.196 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[10:56:35] <icinga-wm>	 PROBLEM - Memcached on mw2401 is CRITICAL: connect to address 10.192.0.65 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[10:57:01] <icinga-wm>	 PROBLEM - Memcached on mw1345 is CRITICAL: connect to address 10.64.32.57 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[10:57:09] <icinga-wm>	 PROBLEM - Memcached on mw1343 is CRITICAL: connect to address 10.64.32.55 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[10:57:11] <icinga-wm>	 PROBLEM - Memcached on mw2402 is CRITICAL: connect to address 10.192.0.66 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[10:57:17] <icinga-wm>	 PROBLEM - Memcached on mw1340 is CRITICAL: connect to address 10.64.32.52 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[10:57:27] <icinga-wm>	 PROBLEM - Memcached on mw1339 is CRITICAL: connect to address 10.64.32.51 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[10:57:37] <icinga-wm>	 PROBLEM - Memcached on mw2405 is CRITICAL: connect to address 10.192.0.70 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[10:57:47] <icinga-wm>	 PROBLEM - Memcached on mw1290 is CRITICAL: connect to address 10.64.16.55 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[10:57:47] <icinga-wm>	 PROBLEM - Memcached on mw1346 is CRITICAL: connect to address 10.64.32.58 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[10:57:49] <icinga-wm>	 PROBLEM - Memcached on mw1344 is CRITICAL: connect to address 10.64.32.56 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[10:57:49] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:58:35] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Merging, I 'll deploy this once the job running right now is done." [deployment-charts] - 10https://gerrit.wikimedia.org/r/679288 (owner: 10Alexandros Kosiaris)
[10:59:31] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] ceph: add ceph repo and parameter to all client modules [puppet] - 10https://gerrit.wikimedia.org/r/677911 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro)
[10:59:37] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "I will clean up the grants file" [puppet] - 10https://gerrit.wikimedia.org/r/678799 (https://phabricator.wikimedia.org/T224589) (owner: 10Muehlenhoff)
[10:59:43] <icinga-wm>	 PROBLEM - Memcached on mw1341 is CRITICAL: connect to address 10.64.32.53 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[10:59:43] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/677911 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro)
[10:59:50] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1034.eqiad.wmnet with reason: REIMAGE
[10:59:53] <icinga-wm>	 PROBLEM - Memcached on mw1363 is CRITICAL: connect to address 10.64.48.205 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[10:59:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) [[Backport windows|European mid-day backport window]]<br/><small>''''''</small> deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210414T1100).
[11:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[11:00:23] <icinga-wm>	 PROBLEM - Memcached on mw1356 is CRITICAL: connect to address 10.64.48.198 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[11:00:27] <icinga-wm>	 PROBLEM - Memcached on mw2396 is CRITICAL: connect to address 10.192.0.60 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[11:00:27] <icinga-wm>	 PROBLEM - Memcached on mw2404 is CRITICAL: connect to address 10.192.0.68 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[11:00:35] <icinga-wm>	 PROBLEM - Memcached on mw1362 is CRITICAL: connect to address 10.64.48.204 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[11:00:35] <icinga-wm>	 PROBLEM - Memcached on mw1361 is CRITICAL: connect to address 10.64.48.203 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[11:00:35] <icinga-wm>	 PROBLEM - Memcached on mw1377 is CRITICAL: connect to address 10.64.48.219 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[11:00:42] <wikibugs>	 (03CR) 10Alexandros Kosiaris: linkrecommendation: Add an internal release (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/679287 (owner: 10Alexandros Kosiaris)
[11:00:56] <wikibugs>	 (03Merged) 10jenkins-bot: linkrecommendation: Cleanup production release [deployment-charts] - 10https://gerrit.wikimedia.org/r/679288 (owner: 10Alexandros Kosiaris)
[11:00:56] <jynus>	 is that a monitoring issue?
[11:01:45] <icinga-wm>	 PROBLEM - Memcached on mw1376 is CRITICAL: connect to address 10.64.48.218 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[11:01:47] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1035.eqiad.wmnet with reason: REIMAGE
[11:01:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:02:00] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1034.eqiad.wmnet with reason: REIMAGE
[11:02:01] <Majavah>	 jynus: likely related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/676580
[11:02:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:02:19] <jynus>	 ok, thanks
[11:02:29] <Majavah>	 but not sure if expected or not
[11:02:47] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
[11:02:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:02:59] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'internal' .
[11:02:59] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
[11:02:59] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
[11:03:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:45] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1036.eqiad.wmnet with reason: REIMAGE
[11:03:48] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
[11:03:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:00] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
[11:04:00] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'internal' .
[11:04:00] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
[11:04:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:12] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1035.eqiad.wmnet with reason: REIMAGE
[11:04:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:48] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: gridengine: set grid-configurator source files to use new domain name [puppet] - 10https://gerrit.wikimedia.org/r/678043 (https://phabricator.wikimedia.org/T277653) (owner: 10Bstorm)
[11:04:50] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: introduce support for the new domain [puppet] - 10https://gerrit.wikimedia.org/r/677873 (https://phabricator.wikimedia.org/T277653)
[11:05:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid-configurator: introduce support for the new domain [puppet] - 10https://gerrit.wikimedia.org/r/677873 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez)
[11:06:02] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
[11:06:02] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
[11:06:03] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'internal' .
[11:06:03] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
[11:06:07] <wikibugs>	 (03PS4) 10Jbond: check_https_client_auth_puppet: add new icinga check [puppet] - 10https://gerrit.wikimedia.org/r/678844
[11:06:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:12] <wikibugs>	 (03CR) 10Jbond: "updated thanks" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/678844 (owner: 10Jbond)
[11:06:14] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1036.eqiad.wmnet with reason: REIMAGE
[11:06:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:58] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "I didn't tested but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/679297 (owner: 10Jbond)
[11:07:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 80%: Slowly pool db1177 for the first time in s8 T275633', diff saved to https://phabricator.wikimedia.org/P15323 and previous config saved to /var/cache/conftool/dbconfig/20210414-110706-root.json
[11:07:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:07:19] <stashbot>	 T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633
[11:07:44] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29026/console" [puppet] - 10https://gerrit.wikimedia.org/r/678844 (owner: 10Jbond)
[11:10:24] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.0.150:7001 on restbase1021 is CRITICAL: SSL CRITICAL - Certificate restbase1021-c valid until 2021-05-14 11:10:13 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662
[11:10:27] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.16.123:7001 on restbase1024 is CRITICAL: SSL CRITICAL - Certificate restbase1024-b valid until 2021-05-14 11:10:21 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662
[11:10:27] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.0.149:7001 on restbase1021 is CRITICAL: SSL CRITICAL - Certificate restbase1021-b valid until 2021-05-14 11:10:12 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662
[11:10:37] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.0.148:7001 on restbase1021 is CRITICAL: SSL CRITICAL - Certificate restbase1021-a valid until 2021-05-14 11:10:11 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662
[11:10:41] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.48.186:7001 on restbase1027 is CRITICAL: SSL CRITICAL - Certificate restbase1027-c valid until 2021-05-14 11:10:31 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662
[11:11:01] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.48.181:7001 on restbase1026 is CRITICAL: SSL CRITICAL - Certificate restbase1026-b valid until 2021-05-14 11:10:27 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662
[11:11:01] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.16.118:7001 on restbase1023 is CRITICAL: SSL CRITICAL - Certificate restbase1023-a valid until 2021-05-14 11:10:17 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662
[11:11:04] <wikibugs>	 10SRE, 10Performance-Team, 10serviceops, 10MW-1.37-notes (1.37.0-wmf.1; 2021-04-13), and 2 others: Enable "/*/mw-with-onhost-tier/" route for MediaWiki where safe - https://phabricator.wikimedia.org/T264604 (10jijiki)
[11:11:05] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.0.106:7001 on restbase1020 is CRITICAL: SSL CRITICAL - Certificate restbase1020-b valid until 2021-05-14 11:10:09 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662
[11:11:05] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.16.119:7001 on restbase1023 is CRITICAL: SSL CRITICAL - Certificate restbase1023-b valid until 2021-05-14 11:10:18 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662
[11:11:08] <wikibugs>	 (03CR) 10Muehlenhoff: "That's actually intentional, though. See https://github.com/wikimedia/puppet/commit/c22aeac15940e20af4a6bfdb64ae9e7e1775cc49" [puppet] - 10https://gerrit.wikimedia.org/r/679297 (owner: 10Jbond)
[11:11:21] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.16.114:7001 on restbase1022 is CRITICAL: SSL CRITICAL - Certificate restbase1022-a valid until 2021-05-14 11:10:14 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662
[11:11:43] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.48.128:7001 on restbase1025 is CRITICAL: SSL CRITICAL - Certificate restbase1025-c valid until 2021-05-14 11:10:25 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662
[11:11:54] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: Migrate onhost memcached to use a unix socket - https://phabricator.wikimedia.org/T273115 (10jijiki) 05Open→03Resolved a:03jijiki
[11:12:03] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.180:7001 on restbase1026 is CRITICAL: SSL CRITICAL - Certificate restbase1026-a valid until 2021-05-14 11:10:26 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662
[11:12:04] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.16.120:7001 on restbase1023 is CRITICAL: SSL CRITICAL - Certificate restbase1023-c valid until 2021-05-14 11:10:19 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662
[11:12:11] <XioNoX>	 is that for robh ^ ?
[11:12:23] <hnowlan>	 nah, they're routine expiries 
[11:12:28] <hnowlan>	 I'll handle them 
[11:12:37] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.16.124:7001 on restbase1024 is CRITICAL: SSL CRITICAL - Certificate restbase1024-c valid until 2021-05-14 11:10:22 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662
[11:12:56] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "Is there a task?" [homer/public] - 10https://gerrit.wikimedia.org/r/679296 (owner: 10Elukey)
[11:13:07] <XioNoX>	 ok
[11:13:09] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.0.105:7001 on restbase1020 is CRITICAL: SSL CRITICAL - Certificate restbase1020-a valid until 2021-05-14 11:10:08 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662
[11:13:27] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.48.182:7001 on restbase1026 is CRITICAL: SSL CRITICAL - Certificate restbase1026-c valid until 2021-05-14 11:10:28 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662
[11:13:35] <XioNoX>	 hnowlan: thanks, is there a way to not have routine IRC alert flood? :)
[11:13:53] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, as always check_http is hard to parse so I can't exclude typos, but the logic seems good." [puppet] - 10https://gerrit.wikimedia.org/r/678844 (owner: 10Jbond)
[11:14:15] <wikibugs>	 (03CR) 10Muehlenhoff: "> Patch Set 4: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/678799 (https://phabricator.wikimedia.org/T224589) (owner: 10Muehlenhoff)
[11:14:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove dbmonitor1001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/678799 (https://phabricator.wikimedia.org/T224589) (owner: 10Muehlenhoff)
[11:14:53] <hnowlan>	 XioNoX: they expire once every 2 years so I'm tempted to say no - but this isn't ideal I agree 
[11:15:04] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.0.101:7001 on restbase1019 is CRITICAL: SSL CRITICAL - Certificate restbase1019-a valid until 2021-05-14 11:10:05 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662
[11:15:27] <volans>	 how long are they in WARNING state?
[11:15:40] <volans>	 maybe can improve ways to catch that before it becomes critical
[11:15:41] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.16.115:7001 on restbase1022 is CRITICAL: SSL CRITICAL - Certificate restbase1022-b valid until 2021-05-14 11:10:15 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662
[11:17:04] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/679297 (owner: 10Jbond)
[11:17:20] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.16.116:7001 on restbase1022 is CRITICAL: SSL CRITICAL - Certificate restbase1022-c valid until 2021-05-14 11:10:16 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662
[11:18:14] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] check_https_client_auth_puppet: add new icinga check [puppet] - 10https://gerrit.wikimedia.org/r/678844 (owner: 10Jbond)
[11:20:17] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.184:7001 on restbase1027 is CRITICAL: SSL CRITICAL - Certificate restbase1027-a valid until 2021-05-14 11:10:29 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662
[11:21:16] <wikibugs>	 (03Abandoned) 10Jbond: Switch debmonitor to Envoy (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/625890 (owner: 10Muehlenhoff)
[11:22:00] <wikibugs>	 (03CR) 10Jbond: "noticed this come up as a merge conflict, superceeded now by I584cc371938ed4c0cfd22e7e6e9d1cbefeb0df76 so boldly abandoned" [puppet] - 10https://gerrit.wikimedia.org/r/625890 (owner: 10Muehlenhoff)
[11:22:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 90%: Slowly pool db1177 for the first time in s8 T275633', diff saved to https://phabricator.wikimedia.org/P15325 and previous config saved to /var/cache/conftool/dbconfig/20210414-112211-root.json
[11:22:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:21] <stashbot>	 T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633
[11:25:06] <wikibugs>	 (03CR) 10Muehlenhoff: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/625890 (owner: 10Muehlenhoff)
[11:25:31] <icinga-wm>	 RECOVERY - debmonitor.discovery.wmnet:443 internal on debmonitor1002 is OK: HTTP OK: Status line output matched HTTP/1.1 200 - 639 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Debmonitor
[11:25:38] <hnowlan>	 !log regenerated certificates for restbase1019/restbase102[0-7]
[11:25:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:26:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096 (s5,s6) kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15326 and previous config saved to /var/cache/conftool/dbconfig/20210414-112619-marostegui.json
[11:26:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:27:02] <hnowlan>	 volans: I think it's 2 months for WARNING 
[11:27:55] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.126:7001 on restbase1025 is CRITICAL: SSL CRITICAL - Certificate restbase1025-a valid until 2021-05-14 11:10:23 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662
[11:28:07] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.16.122:7001 on restbase1024 is CRITICAL: SSL CRITICAL - Certificate restbase1024-a valid until 2021-05-14 11:10:20 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662
[11:29:33] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart
[11:29:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:53] <XioNoX>	 volans: already in https://phabricator.wikimedia.org/T225140 :)
[11:30:08] <volans>	 XioNoX: ehehe :D
[11:30:26] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.48.127:7001 on restbase1025 is CRITICAL: SSL CRITICAL - Certificate restbase1025-b valid until 2021-05-14 11:10:24 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662
[11:31:38] <marostegui>	 !log Upgrade kernel on db1096 (s5, s6)
[11:31:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:13] <hnowlan>	 oh, hadn't seen that ticket - having tasks for these in WARNING would be great 
[11:32:56] <hnowlan>	 part of the problem with this spam was that 9 hosts were done at the same time in the distant past, if it was just one host it wouldn't be so spammy
[11:33:53] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Migrate default nework policies (default-network-policy-conf.yaml) to GlobalNetworkPolicies - https://phabricator.wikimedia.org/T280125 (10JMeybohm)
[11:33:55] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:34:18] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Migrate default nework policies (default-network-policy-conf.yaml) to GlobalNetworkPolicies - https://phabricator.wikimedia.org/T280125 (10JMeybohm) p:05Triage→03Low
[11:34:43] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.48.185:7001 on restbase1027 is CRITICAL: SSL CRITICAL - Certificate restbase1027-b valid until 2021-05-14 11:10:30 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662
[11:34:53] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.0.146:7001 on restbase1020 is CRITICAL: SSL CRITICAL - Certificate restbase1020-c valid until 2021-05-14 11:10:10 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662
[11:35:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 25%: Repool db1096:3315 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15327 and previous config saved to /var/cache/conftool/dbconfig/20210414-113557-root.json
[11:36:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:37:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 100%: Slowly pool db1177 for the first time in s8 T275633', diff saved to https://phabricator.wikimedia.org/P15328 and previous config saved to /var/cache/conftool/dbconfig/20210414-113714-root.json
[11:37:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:37:28] <stashbot>	 T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633
[11:37:51] <wikibugs>	 (03CR) 10Muehlenhoff: P:debmonitor::client: migrate timer::job to use send_mail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679293 (owner: 10Jbond)
[11:38:45] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:39:52] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0)
[11:40:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:40:01] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Set resource requests and limits for calico PODs - https://phabricator.wikimedia.org/T277877 (10JMeybohm) This is not exactly looking great on the staging clusters as we can see heavy throttling. The current assumption is that this is caused by the very...
[11:41:15] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart
[11:41:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 25%: Repool db1096:3316 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15329 and previous config saved to /var/cache/conftool/dbconfig/20210414-114216-root.json
[11:42:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:52] <wikibugs>	 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10MoritzMuehlenhoff)
[11:43:54] <wikibugs>	 10SRE: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff)
[11:44:35] <wikibugs>	 10SRE, 10Patch-For-Review: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Tendril and dbtree are now running on a new Buster instance dbmonitor1002.wikimedia.org ith PHP 5.6 packages from sury.org (since Tendril...
[11:44:35] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.0.105:7001 on restbase1020 is OK: SSL OK - Certificate restbase1020-a valid until 2023-04-14 11:20:37 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662
[11:45:15] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.0.106:7001 on restbase1020 is OK: SSL OK - Certificate restbase1020-b valid until 2023-04-14 11:20:40 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662
[11:45:53] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.0.146:7001 on restbase1020 is OK: SSL OK - Certificate restbase1020-c valid until 2023-04-14 11:20:42 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662
[11:47:51] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Specify the old m1 master [puppet] - 10https://gerrit.wikimedia.org/r/679317 (https://phabricator.wikimedia.org/T280121)
[11:47:52] <wikibugs>	 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1034.eqiad.wmnet', 'wtp1035.eqiad.wmnet', 'wtp1036.eqiad.wmnet'] `  and were **ALL** successful.
[11:49:19] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Specify the old m1 master [puppet] - 10https://gerrit.wikimedia.org/r/679317 (https://phabricator.wikimedia.org/T280121) (owner: 10Marostegui)
[11:50:18] <wikibugs>	 (03CR) 10Jbond: P:debmonitor::client: migrate timer::job to use send_mail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679293 (owner: 10Jbond)
[11:50:36] <wikibugs>	 (03CR) 10Volans: "Looks sane to me, I'd add the timeout explicitly to all requests calls, also in a different patch if you prefer." (033 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679275 (owner: 10Jbond)
[11:51:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 50%: Repool db1096:3315 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15330 and previous config saved to /var/cache/conftool/dbconfig/20210414-115101-root.json
[11:51:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:41] <icinga-wm>	 RECOVERY - debmonitor.discovery.wmnet:443 internal on debmonitor2002 is OK: HTTP OK: Status line output matched HTTP/1.1 200 - 640 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/Debmonitor
[11:52:57] <wikibugs>	 (03PS1) 10Marostegui: tendril.sql: Remove dbmonitor1001 grants [puppet] - 10https://gerrit.wikimedia.org/r/679318 (https://phabricator.wikimedia.org/T224589)
[11:53:12] <logmsgbot>	 !log jiji@cumin1001 conftool action : set/pooled=yes; selector: name=(wtp1034|wtp1035|wtp1036).eqiad.wmnet
[11:53:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:21] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.0.148:7001 on restbase1021 is OK: SSL OK - Certificate restbase1021-a valid until 2023-04-14 11:20:45 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662
[11:55:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] tendril.sql: Remove dbmonitor1001 grants [puppet] - 10https://gerrit.wikimedia.org/r/679318 (https://phabricator.wikimedia.org/T224589) (owner: 10Marostegui)
[11:55:19] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.0.149:7001 on restbase1021 is OK: SSL OK - Certificate restbase1021-b valid until 2023-04-14 11:20:48 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662
[11:55:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] P:debmonitor::client: migrate timer::job to use send_mail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679293 (owner: 10Jbond)
[11:57:07] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Remove grant for dbmonitor1001 [puppet] - 10https://gerrit.wikimedia.org/r/678800 (https://phabricator.wikimedia.org/T224589) (owner: 10Muehlenhoff)
[11:57:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 50%: Repool db1096:3316 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15331 and previous config saved to /var/cache/conftool/dbconfig/20210414-115720-root.json
[11:57:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:59:57] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.0.150:7001 on restbase1021 is OK: SSL OK - Certificate restbase1021-c valid until 2023-04-14 11:20:51 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662
[12:02:42] <wikibugs>	 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['wtp1037.eqiad.wmnet', 'wtp1038.eqiad.wmnet', 'wtp1039.eqia...
[12:03:00] <marostegui>	 !log Upgrade mysql on db1080 T279281
[12:03:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:03:10] <stashbot>	 T279281: Upgrade 10.4.13 hosts to a higher version - https://phabricator.wikimedia.org/T279281
[12:04:41] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.16.115:7001 on restbase1022 is OK: SSL OK - Certificate restbase1022-b valid until 2023-04-14 11:20:56 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662
[12:06:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 75%: Repool db1096:3315 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15332 and previous config saved to /var/cache/conftool/dbconfig/20210414-120604-root.json
[12:06:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:23] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.16.116:7001 on restbase1022 is OK: SSL OK - Certificate restbase1022-c valid until 2023-04-14 11:20:59 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662
[12:07:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1025 for kernel and mysql upgrade T279281', diff saved to https://phabricator.wikimedia.org/P15333 and previous config saved to /var/cache/conftool/dbconfig/20210414-120724-marostegui.json
[12:07:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:08:51] <wikibugs>	 (03PS1) 10Jgreen: rename frmon*.frdev to just frmon* keeping a transitional legacy CNAME [dns] - 10https://gerrit.wikimedia.org/r/679319 (https://phabricator.wikimedia.org/T280034)
[12:10:53] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] rename frmon*.frdev to just frmon* keeping a transitional legacy CNAME [dns] - 10https://gerrit.wikimedia.org/r/679319 (https://phabricator.wikimedia.org/T280034) (owner: 10Jgreen)
[12:11:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add lmeintrup [puppet] - 10https://gerrit.wikimedia.org/r/679279 (https://phabricator.wikimedia.org/T279531) (owner: 10Filippo Giunchedi)
[12:12:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 75%: Repool db1096:3316 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15334 and previous config saved to /var/cache/conftool/dbconfig/20210414-121223-root.json
[12:12:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:13:33] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.16.118:7001 on restbase1023 is OK: SSL OK - Certificate restbase1023-a valid until 2023-04-14 11:21:01 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662
[12:14:51] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.16.119:7001 on restbase1023 is OK: SSL OK - Certificate restbase1023-b valid until 2023-04-14 11:21:04 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662
[12:15:31] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.16.114:7001 on restbase1022 is OK: SSL OK - Certificate restbase1022-a valid until 2023-04-14 11:20:53 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662
[12:17:41] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.16.120:7001 on restbase1023 is OK: SSL OK - Certificate restbase1023-c valid until 2023-04-14 11:21:07 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662
[12:21:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 100%: Repool db1096:3315 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15335 and previous config saved to /var/cache/conftool/dbconfig/20210414-122108-root.json
[12:21:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:22:59] <wikibugs>	 (03PS1) 10Gehel: WDQS: Wait for updater to catchup during data transfer. [cookbooks] - 10https://gerrit.wikimedia.org/r/679320 (https://phabricator.wikimedia.org/T280108)
[12:24:52] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.16.123:7001 on restbase1024 is OK: SSL OK - Certificate restbase1024-b valid until 2023-04-14 11:21:12 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662
[12:26:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WDQS: Wait for updater to catchup during data transfer. [cookbooks] - 10https://gerrit.wikimedia.org/r/679320 (https://phabricator.wikimedia.org/T280108) (owner: 10Gehel)
[12:26:42] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.16.124:7001 on restbase1024 is OK: SSL OK - Certificate restbase1024-c valid until 2023-04-14 11:21:14 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662
[12:26:56] <wikibugs>	 (03PS2) 10Gehel: WDQS: Wait for updater to catchup during data transfer. [cookbooks] - 10https://gerrit.wikimedia.org/r/679320 (https://phabricator.wikimedia.org/T280108)
[12:27:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 100%: Repool db1096:3316 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15336 and previous config saved to /var/cache/conftool/dbconfig/20210414-122727-root.json
[12:27:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:27:50] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:28:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "Chatted on IRC, for wmf ldap membership only we're ok to go ahead" [puppet] - 10https://gerrit.wikimedia.org/r/679280 (https://phabricator.wikimedia.org/T280073) (owner: 10Filippo Giunchedi)
[12:28:24] <wikibugs>	 (03PS2) 10Filippo Giunchedi: admin: add hnordeen [puppet] - 10https://gerrit.wikimedia.org/r/679280 (https://phabricator.wikimedia.org/T280073)
[12:29:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WDQS: Wait for updater to catchup during data transfer. [cookbooks] - 10https://gerrit.wikimedia.org/r/679320 (https://phabricator.wikimedia.org/T280108) (owner: 10Gehel)
[12:30:42] <wikibugs>	 (03PS3) 10Gehel: WDQS: Wait for updater to catchup during data transfer. [cookbooks] - 10https://gerrit.wikimedia.org/r/679320 (https://phabricator.wikimedia.org/T280108)
[12:30:43] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1037.eqiad.wmnet with reason: REIMAGE
[12:30:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:31:38] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:31:43] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1038.eqiad.wmnet with reason: REIMAGE
[12:31:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:50] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1037.eqiad.wmnet with reason: REIMAGE
[12:32:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:52] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.126:7001 on restbase1025 is OK: SSL OK - Certificate restbase1025-a valid until 2023-04-14 11:21:17 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662
[12:33:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 10%: Repool es1025 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15337 and previous config saved to /var/cache/conftool/dbconfig/20210414-123357-root.json
[12:34:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:42] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1038.eqiad.wmnet with reason: REIMAGE
[12:34:44] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wtp1039.eqiad.wmnet with reason: REIMAGE
[12:34:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:35:33] <wikibugs>	 (03PS1) 10Patriccck: Czech Wikimedia / Powered by MediaWiki icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679323 (https://phabricator.wikimedia.org/T279589)
[12:36:44] <wikibugs>	 (03PS2) 10Jbond: debmonitor-client: Improve retry logic [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679275
[12:36:48] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wtp1039.eqiad.wmnet with reason: REIMAGE
[12:36:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:37:34] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for HNordeen - https://phabricator.wikimedia.org/T280073 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi User added to `wmf` group (chatted on IRC with @jbond), @HNordeenWMF you should have access now!
[12:38:22] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "I think that Keith is working on the new cluster, will follow up later on :)" [homer/public] - 10https://gerrit.wikimedia.org/r/679296 (owner: 10Elukey)
[12:39:13] <elukey>	 !log update kafka term for analytics-in{4,6} on cr{1,2}-eqiad to include kafka-logging1001 - ref: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/679296
[12:39:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:40] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.48.127:7001 on restbase1025 is OK: SSL OK - Certificate restbase1025-b valid until 2023-04-14 11:21:19 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662
[12:41:19] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Add Lena Meintrup to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T279531 (10fgiunchedi) 05Open→03Resolved @Lena_WMDE you are now in `nda` and `wmde` groups, please verify access and reopen the task if something is amiss!
[12:42:27] <wikibugs>	 (03CR) 10Zabe: [C: 04-1] Czech Wikimedia / Powered by MediaWiki icons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679323 (https://phabricator.wikimedia.org/T279589) (owner: 10Patriccck)
[12:42:29] <wikibugs>	 (03PS2) 10Patriccck: Czech Wikimedia / Powered by MediaWiki icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679323 (https://phabricator.wikimedia.org/T279589)
[12:42:56] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.48.128:7001 on restbase1025 is OK: SSL OK - Certificate restbase1025-c valid until 2023-04-14 11:21:22 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662
[12:43:57] <wikibugs>	 (03PS3) 10Patriccck: Czech Wikimedia / Powered by MediaWiki icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679323 (https://phabricator.wikimedia.org/T279589)
[12:44:40] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1100 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[12:44:41] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on an-worker1100 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T280132 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[12:44:44] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on an-worker1100 - https://phabricator.wikimedia.org/T280132 (10ops-monitoring-bot)
[12:47:32] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.16.122:7001 on restbase1024 is OK: SSL OK - Certificate restbase1024-a valid until 2023-04-14 11:21:09 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662
[12:49:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 25%: Repool es1025 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15338 and previous config saved to /var/cache/conftool/dbconfig/20210414-124901-root.json
[12:49:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:51:08] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.180:7001 on restbase1026 is OK: SSL OK - Certificate restbase1026-a valid until 2023-04-14 11:21:25 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662
[12:51:39] <wikibugs>	 (03PS4) 10Urbanecm: Czech Wikimedia / Powered by MediaWiki icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679323 (https://phabricator.wikimedia.org/T279589) (owner: 10Patriccck)
[12:51:51] <wikibugs>	 (03PS1) 10Seddon: Change HTTP to HTTPS for concept URIs on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679327 (https://phabricator.wikimedia.org/T258590)
[12:52:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Czech Wikimedia / Powered by MediaWiki icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679323 (https://phabricator.wikimedia.org/T279589) (owner: 10Patriccck)
[12:53:26] <wikibugs>	 (03PS5) 10Urbanecm: Czech Wikimedia / Powered by MediaWiki icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679323 (https://phabricator.wikimedia.org/T279589) (owner: 10Patriccck)
[12:53:39] <wikibugs>	 (03PS1) 10Jbond: C:aptrepo: add gitlab repo mirror [puppet] - 10https://gerrit.wikimedia.org/r/679328 (https://phabricator.wikimedia.org/T279545)
[12:53:42] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.184:7001 on restbase1027 is OK: SSL OK - Certificate restbase1027-a valid until 2023-04-14 11:21:33 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662
[12:54:52] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.48.185:7001 on restbase1027 is OK: SSL OK - Certificate restbase1027-b valid until 2023-04-14 11:21:35 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662
[12:54:54] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:55:09] <wikibugs>	 (03PS6) 10Urbanecm: Czech Wikimedia / Powered by MediaWiki icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679323 (https://phabricator.wikimedia.org/T279589) (owner: 10Patriccck)
[12:57:16] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.48.186:7001 on restbase1027 is OK: SSL OK - Certificate restbase1027-c valid until 2023-04-14 11:21:38 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662
[12:57:45] <wikibugs>	 (03CR) 10Kosta Harlan: "> Patch Set 1: Code-Review+2" [deployment-charts] - 10https://gerrit.wikimedia.org/r/679288 (owner: 10Alexandros Kosiaris)
[12:57:52] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.48.181:7001 on restbase1026 is OK: SSL OK - Certificate restbase1026-b valid until 2023-04-14 11:21:27 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662
[12:59:56] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.48.182:7001 on restbase1026 is OK: SSL OK - Certificate restbase1026-c valid until 2023-04-14 11:21:30 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662
[13:01:08] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:01:38] <godog>	 !log extend prometheus global @ codfw by 100G
[13:01:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:01:47] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0)
[13:01:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:01:58] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on an-worker1100 is CRITICAL: cluster=analytics device=sat+megaraid,10 instance=an-worker1100 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1100&var-datasource=eqiad+prometheus/ops
[13:02:05] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart
[13:02:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:03:38] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29027/console" [puppet] - 10https://gerrit.wikimedia.org/r/679328 (https://phabricator.wikimedia.org/T279545) (owner: 10Jbond)
[13:04:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 50%: Repool es1025 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15339 and previous config saved to /var/cache/conftool/dbconfig/20210414-130404-root.json
[13:04:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:18] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.0.101:7001 on restbase1019 is OK: SSL OK - Certificate restbase1019-a valid until 2023-04-14 11:20:29 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662
[13:11:30] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] remove obsolete html files from snapshot manifests for dumps [puppet] - 10https://gerrit.wikimedia.org/r/678719 (https://phabricator.wikimedia.org/T279661) (owner: 10ArielGlenn)
[13:12:07] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0)
[13:12:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:49] <moritzm>	 !log installing OpenSSL updates on buster
[13:12:49] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for HNordeen - https://phabricator.wikimedia.org/T280073 (10Ottomata) Should be fine, it'd be nice if this ticket had a little more info about who and why though!
[13:12:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:19] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "The compilation results differences are expected: https://puppet-compiler.wmflabs.org/compiler1001/718/" [puppet] - 10https://gerrit.wikimedia.org/r/677911 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro)
[13:15:56] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] ceph: add ceph repo and parameter to all client modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677911 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro)
[13:17:16] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:19:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 75%: Repool es1025 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15340 and previous config saved to /var/cache/conftool/dbconfig/20210414-131908-root.json
[13:19:12] <wikibugs>	 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1038.eqiad.wmnet', 'wtp1037.eqiad.wmnet', 'wtp1039.eqiad.wmnet'] `  and were **ALL** successful.
[13:19:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:21] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/679288 (owner: 10Alexandros Kosiaris)
[13:19:30] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:26:42] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:27:05] <logmsgbot>	 !log zpapierski@deploy1002 Started deploy [wikimedia/discovery/analytics@825c60a]: T273847 export queries to relforge dag deployment - schedule change
[13:27:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:14] <stashbot>	 T273847: Create a elasticsearch/kibana index with queries to allow query completion candidate research - https://phabricator.wikimedia.org/T273847
[13:28:04] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete backup of /root on the apt servers [puppet] - 10https://gerrit.wikimedia.org/r/679332
[13:29:02] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "tested and LGTM, see comment inline on py2 support" (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679275 (owner: 10Jbond)
[13:29:14] <logmsgbot>	 !log zpapierski@deploy1002 Finished deploy [wikimedia/discovery/analytics@825c60a]: T273847 export queries to relforge dag deployment - schedule change (duration: 02m 08s)
[13:29:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:47] <wikibugs>	 (03PS1) 10Jbond: cfssl::multirootca: install certs script [puppet] - 10https://gerrit.wikimedia.org/r/679336
[13:34:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 100%: Repool es1025 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15341 and previous config saved to /var/cache/conftool/dbconfig/20210414-133411-root.json
[13:34:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] cfssl::multirootca: install certs script [puppet] - 10https://gerrit.wikimedia.org/r/679336 (owner: 10Jbond)
[13:38:59] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/679332 (owner: 10Muehlenhoff)
[13:39:22] <wikibugs>	 (03PS3) 10Jbond: debmonitor-client: Improve retry logic [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679275
[13:39:32] <wikibugs>	 (03CR) 10Jbond: debmonitor-client: Improve retry logic (036 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679275 (owner: 10Jbond)
[13:43:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/679328 (https://phabricator.wikimedia.org/T279545) (owner: 10Jbond)
[13:43:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove weight from es5 master', diff saved to https://phabricator.wikimedia.org/P15342 and previous config saved to /var/cache/conftool/dbconfig/20210414-134331-marostegui.json
[13:43:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:51] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] C:aptrepo: add gitlab repo mirror [puppet] - 10https://gerrit.wikimedia.org/r/679328 (https://phabricator.wikimedia.org/T279545) (owner: 10Jbond)
[13:46:50] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[13:48:17] <rzl>	 !log disabling puppet on C:mcrouter for cert renewal
[13:48:22] <icinga-wm>	 PROBLEM - Check systemd state on aqs1011 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:48:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:49] <wikibugs>	 10SRE, 10serviceops: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10RLazarus) a:03RLazarus
[13:53:41] <wikibugs>	 (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/679341
[14:01:50] <icinga-wm>	 RECOVERY - Check systemd state on aqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:04] <wikibugs>	 (03PS1) 10Jbond: reprepo: add gitlab component [puppet] - 10https://gerrit.wikimedia.org/r/679345
[14:06:26] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679275 (owner: 10Jbond)
[14:08:12] <wikibugs>	 10SRE, 10Maps, 10Packaging, 10Product-Infrastructure-Team-Backlog, 10serviceops: Packaging PostGIS 3.1 for the new Maps stack - https://phabricator.wikimedia.org/T277064 (10MSantos)
[14:08:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/679345 (owner: 10Jbond)
[14:08:34] <icinga-wm>	 PROBLEM - Check systemd state on aqs1011 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:08:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] reprepo: add gitlab component [puppet] - 10https://gerrit.wikimedia.org/r/679345 (owner: 10Jbond)
[14:09:06] <logmsgbot>	 !log zpapierski@deploy1002 Started deploy [wikimedia/discovery/analytics@8ae53e3]: T273847 export queries to relforge dag deployment - start date update
[14:09:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:16] <stashbot>	 T273847: Create a elasticsearch/kibana index with queries to allow query completion candidate research - https://phabricator.wikimedia.org/T273847
[14:11:04] <moritzm>	 !log installing intel-microcode updates on Buster
[14:11:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:20] <logmsgbot>	 !log zpapierski@deploy1002 Finished deploy [wikimedia/discovery/analytics@8ae53e3]: T273847 export queries to relforge dag deployment - start date update (duration: 02m 14s)
[14:11:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:39] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: linkrecommendation: Use the main_app resources for loaddatasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/679347
[14:13:07] <rzl>	 !log mcrouter cert renewal complete, puppet re-enabled T276029
[14:13:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:15] <stashbot>	 T276029: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029
[14:14:00] <wikibugs>	 10SRE, 10serviceops: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10RLazarus) 05Open→03Resolved Done -- just re-enabled puppet, so they'll get picked up over the next 30m.
[14:18:08] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "I checked and found no other uses of the fileset." [puppet] - 10https://gerrit.wikimedia.org/r/679332 (owner: 10Muehlenhoff)
[14:22:16] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for HNordeen - https://phabricator.wikimedia.org/T280073 (10HNordeenWMF) Thank you @Ottomata @jbond and @fgiunchedi !  Sorry for the lack of context -- I'm on the online fundraising team, and would like access to Turnilo for monitoring impressions on our A/B b...
[14:31:02] <icinga-wm>	 RECOVERY - Check systemd state on aqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:33:47] <wikibugs>	 (03PS1) 10Ladsgroup: Disable legacy javascript global variables in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679350 (https://phabricator.wikimedia.org/T72470)
[14:34:52] <wikibugs>	 (03CR) 10Hoo man: Disable legacy javascript global variables in ruwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679350 (https://phabricator.wikimedia.org/T72470) (owner: 10Ladsgroup)
[14:35:22] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "do you happen to know why we need this in the first place? It would be better if we could drop entirely that exception." [puppet] - 10https://gerrit.wikimedia.org/r/679278 (owner: 10Muehlenhoff)
[14:35:49] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Create a basic helm chart to test MediaWiki on kubernetes - https://phabricator.wikimedia.org/T265327 (10Joe) ` joe@wotan:~/Sandbox/mw-on-k8s$ kubectl get pods NAME                              READY   STATUS    RESTARTS   AGE mediawiki-test-6fb67b5f8b-...
[14:38:32] <wikibugs>	 (03CR) 10Ladsgroup: Disable legacy javascript global variables in ruwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679350 (https://phabricator.wikimedia.org/T72470) (owner: 10Ladsgroup)
[14:39:46] <wikibugs>	 (03PS1) 10Ayounsi: Merge all system.conf templates in one [homer/public] - 10https://gerrit.wikimedia.org/r/679351 (https://phabricator.wikimedia.org/T269345)
[14:40:12] <wikibugs>	 (03CR) 10Hoo man: Disable legacy javascript global variables in ruwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679350 (https://phabricator.wikimedia.org/T72470) (owner: 10Ladsgroup)
[14:43:22] <wikibugs>	 (03PS2) 10Ladsgroup: Disable legacy javascript global variables in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679350 (https://phabricator.wikimedia.org/T72470)
[14:45:03] <wikibugs>	 (03PS2) 10CDanis: Add a public_cloud bit to X-Analytics [puppet] - 10https://gerrit.wikimedia.org/r/679341 (https://phabricator.wikimedia.org/T279380)
[14:48:12] <icinga-wm>	 PROBLEM - Check systemd state on debmonitor2002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_nginx.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:48:12] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission bast1002.wikimedia.org - https://phabricator.wikimedia.org/T280110 (10wiki_willy) a:03Cmjohnson
[14:48:22] <shdubsh>	 !log O:logstash::elasticsearch7 update elasticsearch-curator to 5.8.1
[14:48:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:12] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on wtp1039 is CRITICAL: Host wtp1039 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[14:49:18] <icinga-wm>	 PROBLEM - Check systemd state on aqs1011 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:54:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Remove obsolete backup of /root on the apt servers [puppet] - 10https://gerrit.wikimedia.org/r/679332 (owner: 10Muehlenhoff)
[14:54:42] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:55:17] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on an-worker1100 - https://phabricator.wikimedia.org/T280132 (10wiki_willy) a:03Cmjohnson
[14:55:33] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission db1076.eqiad.wmnet - https://phabricator.wikimedia.org/T274752 (10wiki_willy) a:05wiki_willy→03Cmjohnson
[14:56:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] debmonitor-client: Improve retry logic [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679275 (owner: 10Jbond)
[15:00:02] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on wtp1037 is CRITICAL: Host wtp1037 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:00:44] <shdubsh>	 !log run new curator actions on codfw - T274394
[15:00:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:53] <stashbot>	 T274394: ES Curator cron jobs are not cleaned up when output no longer exists - https://phabricator.wikimedia.org/T274394
[15:01:06] <icinga-wm>	 RECOVERY - Check systemd state on aqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:05:38] <wikibugs>	 (03PS1) 10Elukey: aptrepo: add component libmysql-java to buster-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/679356 (https://phabricator.wikimedia.org/T278424)
[15:08:40] <wikibugs>	 (03CR) 10Ema: [C: 03+1] Move hue.wikimedia.org to the an-tool1009 backend [puppet] - 10https://gerrit.wikimedia.org/r/678861 (https://phabricator.wikimedia.org/T264896) (owner: 10Elukey)
[15:08:49] <wikibugs>	 (03PS1) 10Ppchelko: Math: Enable RESTBase-less Wikidata math validation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679357 (https://phabricator.wikimedia.org/T274436)
[15:09:10] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on wtp1038 is CRITICAL: Host wtp1038 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:10:05] <wikibugs>	 (03PS1) 10Jbond: Drop python3 support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679358
[15:10:47] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: use rolemap for template stack [puppet] - 10https://gerrit.wikimedia.org/r/679359 (https://phabricator.wikimedia.org/T280083)
[15:12:59] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] pontoon: use rolemap for template stack [puppet] - 10https://gerrit.wikimedia.org/r/679359 (https://phabricator.wikimedia.org/T280083) (owner: 10Filippo Giunchedi)
[15:13:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: use rolemap for template stack [puppet] - 10https://gerrit.wikimedia.org/r/679359 (https://phabricator.wikimedia.org/T280083) (owner: 10Filippo Giunchedi)
[15:14:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/679356 (https://phabricator.wikimedia.org/T278424) (owner: 10Elukey)
[15:15:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete backup of /root on the apt servers [puppet] - 10https://gerrit.wikimedia.org/r/679332 (owner: 10Muehlenhoff)
[15:15:23] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove obsolete backup of /root on the apt servers [puppet] - 10https://gerrit.wikimedia.org/r/679332
[15:15:27] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Thx, lgtm, we can probably add 3.8/9 (or just 3.9) in a later patch" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679358 (owner: 10Jbond)
[15:16:06] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] aptrepo: add component libmysql-java to buster-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/679356 (https://phabricator.wikimedia.org/T278424) (owner: 10Elukey)
[15:18:40] <wikibugs>	 (03PS2) 10Jbond: Drop python3 support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679358
[15:20:27] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 1: Code-Review+1" (033 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679358 (owner: 10Jbond)
[15:21:06] <wikibugs>	 (03PS3) 10Jbond: Drop python3 support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679358
[15:25:22] <icinga-wm>	 RECOVERY - AQS root url on aqs1010 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[15:25:32] <icinga-wm>	 RECOVERY - AQS root url on aqs1011 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[15:26:22] <icinga-wm>	 RECOVERY - AQS root url on aqs1012 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[15:27:24] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki/httpd: adapt to kubernetes [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/679362
[15:28:29] <elukey>	 new nodes --^
[15:29:10] <icinga-wm>	 RECOVERY - AQS root url on aqs1014 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[15:31:06] <icinga-wm>	 RECOVERY - AQS root url on aqs1015 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[15:31:46] <wikibugs>	 (03PS11) 10Giuseppe Lavagetto: Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327)
[15:32:55] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: cache_upload cache policy + large_objects_cutoff concerns - https://phabricator.wikimedia.org/T275809 (10ema) Apparently we do [[https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=97&orgId=1&var-site=esams%20prometheus%2Fops&var-instance=cp3065&from...
[15:32:57] <wikibugs>	 (03CR) 10Muehlenhoff: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/679278 (owner: 10Muehlenhoff)
[15:33:54] <wikibugs>	 (03PS1) 10Ema: cache_upload: set nuke_limit to 1000 [puppet] - 10https://gerrit.wikimedia.org/r/679364 (https://phabricator.wikimedia.org/T275809)
[15:33:58] <wikibugs>	 (03PS12) 10Giuseppe Lavagetto: Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327)
[15:34:05] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Helm chart to run MediaWiki (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) (owner: 10Giuseppe Lavagetto)
[15:35:04] <wikibugs>	 (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/679364 (https://phabricator.wikimedia.org/T275809) (owner: 10Ema)
[15:36:12] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: gridengine: set grid-configurator source files to use new domain name [puppet] - 10https://gerrit.wikimedia.org/r/678043 (https://phabricator.wikimedia.org/T277653) (owner: 10Bstorm)
[15:36:14] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: introduce support for the new domain [puppet] - 10https://gerrit.wikimedia.org/r/677873 (https://phabricator.wikimedia.org/T277653)
[15:37:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid-configurator: introduce support for the new domain [puppet] - 10https://gerrit.wikimedia.org/r/677873 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez)
[15:38:27] <wikibugs>	 (03PS1) 10Volans: netbox: improve as_dict() [software/spicerack] - 10https://gerrit.wikimedia.org/r/679367
[15:39:10] <wikibugs>	 (03PS1) 10Elukey: bigtop::mysql_jdbc: use component/libmysql-java for buster [puppet] - 10https://gerrit.wikimedia.org/r/679368 (https://phabricator.wikimedia.org/T278424)
[15:40:01] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Create a basic helm chart to test MediaWiki on kubernetes - https://phabricator.wikimedia.org/T265327 (10Joe)
[15:40:46] <wikibugs>	 (03CR) 10Volans: "FYI" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/679320 (https://phabricator.wikimedia.org/T280108) (owner: 10Gehel)
[15:40:47] <wikibugs>	 10SRE, 10netops: BGP: prioritize directly connected peers - https://phabricator.wikimedia.org/T280054 (10jbond) proposal seems fine to me however it would put it theses routes above PEER_INTERNAL which is probably fine but feels wrong  ~~That said Im also curious why PEERING_ROUTE and PEERING_ROUTE_PRIMARY hav...
[15:43:06] <wikibugs>	 (03PS7) 10Cwhite: logstash: refactor how curator jobs are defined and deployed [puppet] - 10https://gerrit.wikimedia.org/r/677593 (https://phabricator.wikimedia.org/T274394)
[15:45:59] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] admin: Introduce the cluster_group concept [deployment-charts] - 10https://gerrit.wikimedia.org/r/678789 (owner: 10Alexandros Kosiaris)
[15:48:13] <wikibugs>	 (03CR) 10CRusnov: "This change is ready for review." [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) (owner: 10CRusnov)
[15:48:38] <wikibugs>	 (03PS8) 10Cwhite: logstash: refactor how curator jobs are defined and deployed [puppet] - 10https://gerrit.wikimedia.org/r/677593 (https://phabricator.wikimedia.org/T274394)
[15:50:16] <wikibugs>	 (03CR) 10CRusnov: "LGTM thank you for this" [software/spicerack] - 10https://gerrit.wikimedia.org/r/679367 (owner: 10Volans)
[15:50:25] <wikibugs>	 (03CR) 10CRusnov: [C: 03+1] netbox: improve as_dict() [software/spicerack] - 10https://gerrit.wikimedia.org/r/679367 (owner: 10Volans)
[15:51:01] <wikibugs>	 (03PS2) 10Elukey: bigtop::mysql_jdbc: use component/libmysql-java for buster [puppet] - 10https://gerrit.wikimedia.org/r/679368 (https://phabricator.wikimedia.org/T278424)
[15:53:17] <wikibugs>	 10SRE, 10SRE-tools, 10IPv6, 10User-jbond: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136 (10elukey) @crusnov if you have time let's do it this week or the next!
[15:55:09] <wikibugs>	 10SRE, 10SRE-tools, 10IPv6, 10User-jbond: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136 (10crusnov) >>! In T271136#6999003, @elukey wrote: > @crusnov if you have time let's do it this week or the next!   Yes (thank you for the ping), let's do it first th...
[15:56:44] <wikibugs>	 (03PS3) 10Elukey: bigtop::mysql_jdbc: use component/libmysql-java for buster [puppet] - 10https://gerrit.wikimedia.org/r/679368 (https://phabricator.wikimedia.org/T278424)
[15:57:34] <wikibugs>	 (03PS4) 10Elukey: bigtop::mysql_jdbc: use component/libmysql-java for buster [puppet] - 10https://gerrit.wikimedia.org/r/679368 (https://phabricator.wikimedia.org/T278424)
[16:00:23] <wikibugs>	 (03PS1) 10Ottomata: refine - lowercase eventlogging legeacy table names in include/exclude regexes [puppet] - 10https://gerrit.wikimedia.org/r/679376 (https://phabricator.wikimedia.org/T273789)
[16:03:21] <wikibugs>	 (03CR) 10Muehlenhoff: bigtop::mysql_jdbc: use component/libmysql-java for buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679368 (https://phabricator.wikimedia.org/T278424) (owner: 10Elukey)
[16:04:08] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:04:53] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/679278 (owner: 10Muehlenhoff)
[16:06:00] <wikibugs>	 (03PS5) 10Elukey: bigtop::mysql_jdbc: use component/libmysql-java for buster [puppet] - 10https://gerrit.wikimedia.org/r/679368 (https://phabricator.wikimedia.org/T278424)
[16:07:27] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29039/console" [puppet] - 10https://gerrit.wikimedia.org/r/679368 (https://phabricator.wikimedia.org/T278424) (owner: 10Elukey)
[16:09:10] <wikibugs>	 (03CR) 10Physikerwelt: [C: 03+1] Math: Enable RESTBase-less Wikidata math validation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679357 (https://phabricator.wikimedia.org/T274436) (owner: 10Ppchelko)
[16:10:00] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] bigtop::mysql_jdbc: use component/libmysql-java for buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679368 (https://phabricator.wikimedia.org/T278424) (owner: 10Elukey)
[16:10:09] <wikibugs>	 10SRE: Integrate Buster 10.9 point update - https://phabricator.wikimedia.org/T279054 (10MoritzMuehlenhoff)
[16:10:36] <wikibugs>	 (03CR) 10Muehlenhoff: bigtop::mysql_jdbc: use component/libmysql-java for buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679368 (https://phabricator.wikimedia.org/T278424) (owner: 10Elukey)
[16:12:44] <wikibugs>	 (03CR) 10BryanDavis: "> Is this perhaps used by stashbot or other IRC-integrating bot? cc'ing Bryan." [puppet] - 10https://gerrit.wikimedia.org/r/679278 (owner: 10Muehlenhoff)
[16:13:35] <wikibugs>	 (03CR) 10Reedy: "Why has this suddenly become a thing? What does this add other than more complexity?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679323 (https://phabricator.wikimedia.org/T279589) (owner: 10Patriccck)
[16:13:59] <wikibugs>	 (03CR) 10Patriccck: Czech Wikimedia / Powered by MediaWiki icons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679323 (https://phabricator.wikimedia.org/T279589) (owner: 10Patriccck)
[16:20:00] <wikibugs>	 (03PS1) 10Jbond: base::firewall: ass switch to use seperate log file [puppet] - 10https://gerrit.wikimedia.org/r/679388 (https://phabricator.wikimedia.org/T238414)
[16:20:37] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "The code is not great but we'll likely do a clean up when stretch is gone, so it should be ok for the moment, but lemme know :)" [puppet] - 10https://gerrit.wikimedia.org/r/679368 (https://phabricator.wikimedia.org/T278424) (owner: 10Elukey)
[16:21:00] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] bigtop::mysql_jdbc: use component/libmysql-java for buster [puppet] - 10https://gerrit.wikimedia.org/r/679368 (https://phabricator.wikimedia.org/T278424) (owner: 10Elukey)
[16:21:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] base::firewall: ass switch to use seperate log file [puppet] - 10https://gerrit.wikimedia.org/r/679388 (https://phabricator.wikimedia.org/T238414) (owner: 10Jbond)
[16:21:52] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29040/console" [puppet] - 10https://gerrit.wikimedia.org/r/679388 (https://phabricator.wikimedia.org/T238414) (owner: 10Jbond)
[16:21:58] <elukey>	 jbond42: is it "add" ? :D
[16:22:07] <elukey>	 in the  commit message :D
[16:22:09] <jbond42>	 :D lol yes just noticed that :D
[16:22:43] <wikibugs>	 (03PS2) 10Jbond: base::firewall: add switch to use separate log file [puppet] - 10https://gerrit.wikimedia.org/r/679388 (https://phabricator.wikimedia.org/T238414)
[16:23:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] base::firewall: add switch to use separate log file [puppet] - 10https://gerrit.wikimedia.org/r/679388 (https://phabricator.wikimedia.org/T238414) (owner: 10Jbond)
[16:29:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/679368 (https://phabricator.wikimedia.org/T278424) (owner: 10Elukey)
[16:29:54] <wikibugs>	 (03PS1) 10Awight: Temporarily disable some reportupdater jobs [puppet] - 10https://gerrit.wikimedia.org/r/679390 (https://phabricator.wikimedia.org/T279046)
[16:31:06] <wikibugs>	 (03PS3) 10Jbond: base::firewall: add switch to use separate log file [puppet] - 10https://gerrit.wikimedia.org/r/679388 (https://phabricator.wikimedia.org/T238414)
[16:32:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] base::firewall: add switch to use separate log file [puppet] - 10https://gerrit.wikimedia.org/r/679388 (https://phabricator.wikimedia.org/T238414) (owner: 10Jbond)
[16:32:59] <wikibugs>	 (03PS4) 10Jbond: base::firewall: add switch to use separate log file [puppet] - 10https://gerrit.wikimedia.org/r/679388 (https://phabricator.wikimedia.org/T238414)
[16:34:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] base::firewall: add switch to use separate log file [puppet] - 10https://gerrit.wikimedia.org/r/679388 (https://phabricator.wikimedia.org/T238414) (owner: 10Jbond)
[16:37:51] <wikibugs>	 (03PS5) 10Jbond: base::firewall: add switch to use separate log file [puppet] - 10https://gerrit.wikimedia.org/r/679388 (https://phabricator.wikimedia.org/T238414)
[16:40:04] <wikibugs>	 (03PS1) 10Ahmon Dancy: MWScript.php: Add purgeMessageBlobStore.php to the wikiless list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679391
[16:40:38] <wikibugs>	 (03PS1) 10Jbond: hiera - sretest: test sending ulog to separate file [puppet] - 10https://gerrit.wikimedia.org/r/679392 (https://phabricator.wikimedia.org/T238414)
[16:41:02] <wikibugs>	 (03PS2) 10Ahmon Dancy: MWScript.php: Add purgeMessageBlobStore.php to the wikiless list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679391 (https://phabricator.wikimedia.org/T263872)
[16:41:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] hiera - sretest: test sending ulog to separate file [puppet] - 10https://gerrit.wikimedia.org/r/679392 (https://phabricator.wikimedia.org/T238414) (owner: 10Jbond)
[16:42:13] <wikibugs>	 (03PS2) 10Jbond: hiera - sretest: test sending ulog to separate file [puppet] - 10https://gerrit.wikimedia.org/r/679392 (https://phabricator.wikimedia.org/T238414)
[16:42:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] hiera - sretest: test sending ulog to separate file [puppet] - 10https://gerrit.wikimedia.org/r/679392 (https://phabricator.wikimedia.org/T238414) (owner: 10Jbond)
[16:45:20] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] MWScript.php: Add purgeMessageBlobStore.php to the wikiless list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679391 (https://phabricator.wikimedia.org/T263872) (owner: 10Ahmon Dancy)
[16:45:26] <wikibugs>	 (03PS3) 10Jbond: hiera - sretest: test sending ulog to separate file [puppet] - 10https://gerrit.wikimedia.org/r/679392 (https://phabricator.wikimedia.org/T238414)
[16:48:35] <wikibugs>	 (03Merged) 10jenkins-bot: MWScript.php: Add purgeMessageBlobStore.php to the wikiless list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679391 (https://phabricator.wikimedia.org/T263872) (owner: 10Ahmon Dancy)
[16:53:52] <icinga-wm>	 PROBLEM - Check systemd state on debmonitor1002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_nginx.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:54:13] <wikibugs>	 (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/29042/sretest1001.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/679392 (https://phabricator.wikimedia.org/T238414) (owner: 10Jbond)
[16:55:30] <icinga-wm>	 PROBLEM - Check systemd state on ldap-replica1001 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:55:45] <volans>	 jbond42: ^^ looking
[16:56:24] <volans>	 same, 502 proxy error
[16:57:26] <volans>	 and a restart worked just fine
[16:57:54] <icinga-wm>	 RECOVERY - Check systemd state on ldap-replica1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:58:26] <jbond42>	 strange that the switch to apache ios causing this issue. when i checked the one for gerrit there was an error getting data from the backend
[16:58:42] <jbond42>	 i.e. apache -> uwsgi
[17:00:07] <jbond42>	 ill see (tomorrow) if there are some settings i can put on the proxy config to improve the reliablity 
[17:03:22] <icinga-wm>	 RECOVERY - Check systemd state on debmonitor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:04:40] <icinga-wm>	 RECOVERY - Check systemd state on debmonitor2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:09:32] <volans>	 jbond42: ack, thx, lmk if you want another pair of eyes, I'm not looking at it right now
[17:12:32] <wikibugs>	 (03PS1) 10Jbond: P:debmonitor::server: switch to mod_proxy_uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/679399
[17:12:45] <jbond42>	 thanks volans ^^^ seems like a good first step but im logging of for now will pick it back up tomorrow
[17:13:53] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Why not, let's test this one." [puppet] - 10https://gerrit.wikimedia.org/r/679399 (owner: 10Jbond)
[17:27:38] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] "LGTM. We can make volans' proposed change once the required patch is merged." [cookbooks] - 10https://gerrit.wikimedia.org/r/679320 (https://phabricator.wikimedia.org/T280108) (owner: 10Gehel)
[17:33:10] <wikibugs>	 (03PS1) 10Urbanecm: DatabaseMentorStore: Cache mentor in memcached [extensions/GrowthExperiments] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/679003 (https://phabricator.wikimedia.org/T279959)
[17:33:53] <wikibugs>	 10SRE, 10DC-Ops, 10SRE-tools, 10Sustainability (Incident Followup): PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10Volans) @LSobanski  I'll try to give you some context from the SRE I/F team side of...
[17:34:50] <wikibugs>	 (03CR) 10Volans: [C: 03+2] netbox: improve as_dict() [software/spicerack] - 10https://gerrit.wikimedia.org/r/679367 (owner: 10Volans)
[17:39:04] <wikibugs>	 (03CR) 10CRusnov: "This change is ready for review." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/679403 (owner: 10CRusnov)
[17:39:06] <Urbanecm>	 thcipriani: hello, what happened with wmf.2 at T280157 please? 🙂
[17:39:06] <stashbot>	 T280157: 1.37.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T280157
[17:39:23] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "backporting" [extensions/GrowthExperiments] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/679003 (https://phabricator.wikimedia.org/T279959) (owner: 10Urbanecm)
[17:39:37] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "in preparation for B&C window" [extensions/GrowthExperiments] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/679002 (https://phabricator.wikimedia.org/T279957) (owner: 10Urbanecm)
[17:42:27] <wikibugs>	 10ops-eqiad, 10DC-Ops: Add eqiad airport express to Netbox - https://phabricator.wikimedia.org/T278934 (10wiki_willy) a:03Cmjohnson
[17:42:28] <thcipriani>	 Urbanecm: something got strange in the re-numbering following the branch cut, so I ended up having to make a wmf.2 so that the thing that generates the calendars would be happy :\ tl;dr: yak shaving
[17:43:42] <Urbanecm>	 thcipriani: I see. So, there won't be a train next week?
[17:43:46] <Urbanecm>	 or is it just a number skipped?
[17:44:03] <Urbanecm>	 I'm asking because I want to know when my risky patch will be deployed
[17:44:13] <wikibugs>	 10ops-eqiad, 10DC-Ops: Add eqiad airport express to Netbox - https://phabricator.wikimedia.org/T278934 (10wiki_willy) Hi @Cmjohnson - can you add the Airport Express access point into Netbox?  It should be next to (or near) the management router.  Thanks, Willy
[17:44:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] netbox: improve as_dict() [software/spicerack] - 10https://gerrit.wikimedia.org/r/679367 (owner: 10Volans)
[17:44:56] <thcipriani>	 Urbanecm: there will not be a train next week. Sent the email to wikitech-l this morning. Date was on https://wikitech.wikimedia.org/wiki/Deployments/Yearly_calendar (although I missed that until this morning as well). Your patch will go out the week after if it's in the mainline development branch now.
[17:46:33] <Urbanecm>	 got it, thanks a lot
[17:50:29] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/679367 (owner: 10Volans)
[17:56:36] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.1/extensions/GrowthExperiments/: ce44792: 84107c5: GrowthExperiments backports related to DatabaseMentorStore (T279957; T279959) (duration: 01m 55s)
[17:56:39] * Urbanecm done
[17:56:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:56:49] <stashbot>	 T279957: DatabaseMentorStore::setMentorForUser needs to be safe to call on GET requests - https://phabricator.wikimedia.org/T279957
[17:56:49] <stashbot>	 T279959: Cache mentor/mentee relationship in memcached - https://phabricator.wikimedia.org/T279959
[17:58:50] <wikibugs>	 (03Merged) 10jenkins-bot: netbox: improve as_dict() [software/spicerack] - 10https://gerrit.wikimedia.org/r/679367 (owner: 10Volans)
[18:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for [[Backport windows|Morning backport window]]<br/><small>''''''</small>. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210414T1800).
[18:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[18:00:04] <jouncebot>	 longma and marxarelli: How many deployers does it take to do Train log triage with CPT deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210414T1800).
[18:02:16] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 31089112 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[18:04:36] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[18:16:29] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Looks good." (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/679403 (owner: 10CRusnov)
[18:23:51] <wikibugs>	 (03PS1) 10Herron: kafka-logging: migrate broker logstash1011 to kafka-logging1002 [puppet] - 10https://gerrit.wikimedia.org/r/679411 (https://phabricator.wikimedia.org/T279342)
[18:30:03] <wikibugs>	 (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1001/29045/" [puppet] - 10https://gerrit.wikimedia.org/r/679411 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron)
[18:34:38] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "yes, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/679279 (https://phabricator.wikimedia.org/T279531) (owner: 10Filippo Giunchedi)
[18:39:48] <wikibugs>	 10SRE, 10WMF-JobQueue, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Dzahn) >>! In T279100#6997273, @akosiaris wrote: > We seem to only have 1 dedicated videoscaler in c...
[18:41:14] <wikibugs>	 10SRE, 10WMF-JobQueue, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Dzahn) Alex, your patch looks good but i can also see Effie's point. hmm...
[18:43:51] <wikibugs>	 (03PS1) 10CDanis: Add es_exporter config for NEL events [puppet] - 10https://gerrit.wikimedia.org/r/679417 (https://phabricator.wikimedia.org/T257527)
[18:49:28] <wikibugs>	 10SRE, 10RESTBase, 10Traffic, 10Page-Previews (Tracking), and 2 others: Cached page previews not shown when refreshed - https://phabricator.wikimedia.org/T184534 (10Jdlrobson)
[18:51:04] <wikibugs>	 (03CR) 10CDanis: "I *think* this is correct, but haven't actually written one of these before -- please let me know :)" [puppet] - 10https://gerrit.wikimedia.org/r/679417 (https://phabricator.wikimedia.org/T257527) (owner: 10CDanis)
[18:55:52] <icinga-wm>	 PROBLEM - Disk space on urldownloader1002 is CRITICAL: DISK CRITICAL - free space: / 340 MB (3% inode=87%): /tmp 340 MB (3% inode=87%): /var/tmp 340 MB (3% inode=87%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=urldownloader1002&var-datasource=eqiad+prometheus/ops
[18:56:28] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM!  I ran the query and the output looks good." [puppet] - 10https://gerrit.wikimedia.org/r/679417 (https://phabricator.wikimedia.org/T257527) (owner: 10CDanis)
[18:57:57] <wikibugs>	 10SRE, 10Discovery-Search: Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507 (10TJones) I've moved this ticket back to "needs triage" so we can discuss it again in light of the recent problems with  T274200, and decide if we should make it more of a priority, an...
[18:58:38] <mutante>	 !log urldownloader1002 - icinga alerted about disk space, ran 'apt-get clean' which is my usual go to in that case. it reduced usage from 97% to 89%
[18:58:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:04] <jouncebot>	 longma and marxarelli: (Dis)respected human, time to deploy Mediawiki train - American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210414T1900). Please do the needful.
[19:00:42] <wikibugs>	 10SRE, 10WMF-JobQueue, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Legoktm) >>! In T279100#6997472, @akosiaris wrote: >>>! In T279100#6997312, @jijiki wrote: >> I thin...
[19:01:46] <wikibugs>	 10SRE, 10WMF-JobQueue, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Dzahn) I like Lego's summary.
[19:02:19] <wikibugs>	 (03PS1) 10Jeena Huneidi: group1 wikis to 1.37.0-wmf.1  refs T278345 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679421
[19:02:21] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] group1 wikis to 1.37.0-wmf.1  refs T278345 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679421 (owner: 10Jeena Huneidi)
[19:03:10] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.37.0-wmf.1  refs T278345 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679421 (owner: 10Jeena Huneidi)
[19:04:09] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] Add es_exporter config for NEL events [puppet] - 10https://gerrit.wikimedia.org/r/679417 (https://phabricator.wikimedia.org/T257527) (owner: 10CDanis)
[19:04:41] <logmsgbot>	 !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.1  refs T278345
[19:04:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:04:49] <stashbot>	 T278345: 1.37.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T278345
[19:06:45] <logmsgbot>	 !log jhuneidi@deploy1002 Synchronized php: group1 wikis to 1.37.0-wmf.1  refs T278345 (duration: 02m 03s)
[19:06:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:07:13] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting deployment access for HMonroy - https://phabricator.wikimedia.org/T280177 (10Dzahn)
[19:08:22] <marxarelli>	 longma: o/ hey hey
[19:08:26] <wikibugs>	 10SRE, 10Release-Engineering-Team, 10SRE-Access-Requests: Requesting deployment access for HMonroy - https://phabricator.wikimedia.org/T280177 (10Dzahn)
[19:08:34] * marxarelli is log watching
[19:09:15] <longma>	 marxarelli: do you know anything about errors depooling the servers when running train?
[19:09:38] <marxarelli>	 i don't
[19:09:47] <marxarelli>	 did scap throw an error?
[19:09:58] <wikibugs>	 10SRE, 10Release-Engineering-Team, 10SRE-Access-Requests: Requesting deployment access for HMonroy - https://phabricator.wikimedia.org/T280177 (10Dzahn)
[19:10:03] <wikibugs>	 10SRE, 10Release-Engineering-Team, 10SRE-Access-Requests: Requesting deployment access for HMonroy - https://phabricator.wikimedia.org/T280177 (10Dzahn) Hi @HMonroy,  slightly renamed the ticket, confirmed you already signed L3 and added releng for deployment approval.  others will continue with this soon,...
[19:10:55] <longma>	 well it said success at the end, but it shows some errors depooling some servers, I think because they are disabled instead of enabled. So maybe it's fine?
[19:11:49] <mutante>	 hello! which servers do you see there having issues?
[19:11:52] <mutante>	 wtp* ?
[19:12:14] <mutante>	 or mw*
[19:12:57] <longma>	 jobrunner_443 and videoscaler_443
[19:13:31] <mutante>	 ah, do you see any "mw" host name in there?
[19:13:57] <mutante>	 might be the special ones we defined as "jobrunner only but not videoscaler"
[19:14:14] <mutante>	 because they are depooled from videoscaler pool
[19:14:40] <longma>	 yeah it also says not restarting php7.2-fpm 100 on mw1338
[19:14:45] <mutante>	 mw1335 and mw1336 
[19:14:55] <mutante>	 mw1337 and mw1338 
[19:15:15] <mutante>	     mw1334.eqiad.wmnet: [apache2,nginx]
[19:15:16] <mutante>	     mw1335.eqiad.wmnet: [apache2,nginx] # Only pooled as videoscaler
[19:15:18] <mutante>	     mw1336.eqiad.wmnet: [apache2,nginx] # Only pooled as videoscaler
[19:15:21] <mutante>	     mw1337.eqiad.wmnet: [apache2,nginx] # Only pooled as jobrunner
[19:15:24] <mutante>	     mw1338.eqiad.wmnet: [apache2,nginx] # Only pooled as jobrunner
[19:15:26] <mutante>	 it's this ^
[19:15:55] <mutante>	 if it doesn't actually break anything for you then it's just noise, but we can still think about how to remove that
[19:16:21] <longma>	 so I assume the "1 hosts had failures restarting php-fpm" is mw1338, but it said it wasn't restarting because of "free opcache 362MB Fragmentation is at 33%, nothing to do here"
[19:16:28] <longma>	 okay
[19:16:37] <mutante>	 ah, yea, that last part seems fine
[19:16:56] <longma>	 alright, thanks for the help!
[19:17:14] <icinga-wm>	 RECOVERY - Disk space on urldownloader1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=urldownloader1002&var-datasource=eqiad+prometheus/ops
[19:17:18] <mutante>	 yw
[19:17:51] <mutante>	 probably it should not call it a 'failure' if it just had nothing to do, ack
[19:18:46] <marxarelli>	 that is odd. i wonder is that systemctl that exited non-zero?
[19:19:07] <longma>	 yeah, it was unclear to me since the "failure" message at the end didn't say which host
[19:19:34] <marxarelli>	 wmf.1 logs look fairly clean otherwise, just the usual lock wait timeouts
[19:19:46] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:22:12] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:25:19] <mutante>	 marxarelli: it's this script  https://gerrit.wikimedia.org/r/c/operations/puppet/+/657398/3/modules/profile/files/mediawiki/php/php-check-and-restart.sh
[19:25:29] <mutante>	 some places there have an explicit "exit 0"
[19:25:29] <marxarelli>	 was just looking at that
[19:25:36] <mutante>	 but the "nothing to do" one does not
[19:25:39] <marxarelli>	 right, but not there
[19:25:43] <mutante>	 yea
[19:25:55] <marxarelli>	 looks like _joe_ or effie might know more
[19:26:03] <mutante>	 indeed
[19:27:17] <mutante>	 I will mention it but if you want to comment as well,this is for https://phabricator.wikimedia.org/T279100
[19:28:03] <mutante>	 well, kind of :)
[19:28:36] <marxarelli>	 ack. thanks, mutante :)
[19:28:58] <mutante>	 it popped up because of this special case but also it's a general question about the restart script that scap runs all the time but usually wasnt a problem
[19:29:37] <marxarelli>	 longma: if you still have the scap output, it might help to post it ^
[19:29:41] <marxarelli>	 in that task that is
[19:29:50] <longma>	 yeah, should I put it on the linked task?
[19:30:01] <marxarelli>	 seems like a decent spot to me
[19:30:09] <mutante>	 yes, that would be helpful
[19:30:17] <longma>	 after I figure out how to copy in tmux 😂
[19:30:41] <marxarelli>	 haha
[19:30:44] <mutante>	 if in doubt, screenshot it :p
[19:30:51] <mutante>	 drag into phab comment
[19:30:52] <wikibugs>	 (03PS1) 10Herron: kafka-logging1002: disable notifications during setup [puppet] - 10https://gerrit.wikimedia.org/r/679424
[19:31:07] <marxarelli>	 i often flail in tmux
[19:31:35] <wikibugs>	 (03CR) 10Herron: [C: 03+2] kafka-logging1002: disable notifications during setup [puppet] - 10https://gerrit.wikimedia.org/r/679424 (owner: 10Herron)
[19:32:57] <wikibugs>	 (03PS1) 10Herron: Revert "kafka-logging1001: disable icinga notifications during setup" [puppet] - 10https://gerrit.wikimedia.org/r/679446
[19:35:58] <wikibugs>	 (03CR) 10Herron: [C: 03+2] Revert "kafka-logging1001: disable icinga notifications during setup" [puppet] - 10https://gerrit.wikimedia.org/r/679446 (owner: 10Herron)
[19:40:31] <wikibugs>	 10SRE, 10WMF-JobQueue, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10jeena) Some "errors" restarting php-fpm and depooling services popped up while running the train tod...
[19:42:04] <herron>	 !log migrating kafka-logging broker logstash1011 to kafka-logging1002 T279342
[19:42:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:42:13] <stashbot>	 T279342: Migrate colocated kafka-logging brokers to dedicated kafka-logging hosts - https://phabricator.wikimedia.org/T279342
[19:45:21] <wikibugs>	 (03CR) 10Herron: [C: 03+2] kafka-logging: migrate broker logstash1011 to kafka-logging1002 [puppet] - 10https://gerrit.wikimedia.org/r/679411 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron)
[19:47:12] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2411.codfw.wmnet,cluster=jobrunner
[19:47:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:48:59] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=mw2394.codfw.wmnet,cluster=videoscaler
[19:49:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:49:10] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=mw2395.codfw.wmnet,cluster=videoscaler
[19:49:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:50:00] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/weight=15; selector: name=mw2410.codfw.wmnet,cluster=videoscaler
[19:50:05] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/weight=15; selector: name=mw2411.codfw.wmnet,cluster=videoscaler
[19:50:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:50:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:51:27] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/weight=20; selector: name=mw2411.codfw.wmnet,cluster=videoscaler
[19:51:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:51:33] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/weight=20; selector: name=mw2410.codfw.wmnet,cluster=videoscaler
[19:51:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:52:28] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/weight=20; selector: name=mw2394.codfw.wmnet,cluster=jobrunner
[19:52:33] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/weight=20; selector: name=mw2395.codfw.wmnet,cluster=jobrunner
[19:52:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:52:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:59:21] <wikibugs>	 (03PS1) 10Dzahn: conftool: fix TODO by adding 2 dedicated codfw videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/679432 (https://phabricator.wikimedia.org/T279100)
[20:00:04] <jouncebot>	 chrisalbon and accraze: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for [[mw:Services|Services]] – [[mw:Extension:Graph|Graphoid]] / [[ORES]] . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210414T2000).
[20:00:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] conftool: fix TODO by adding 2 dedicated codfw videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/679432 (https://phabricator.wikimedia.org/T279100) (owner: 10Dzahn)
[20:01:27] <wikibugs>	 (03PS2) 10Dzahn: conftool: fix TODO by adding 2 dedicated codfw videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/679432 (https://phabricator.wikimedia.org/T279100)
[20:02:10] <wikibugs>	 (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/679432" [puppet] - 10https://gerrit.wikimedia.org/r/679258 (https://phabricator.wikimedia.org/T279100) (owner: 10Alexandros Kosiaris)
[20:02:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] conftool: fix TODO by adding 2 dedicated codfw videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/679432 (https://phabricator.wikimedia.org/T279100) (owner: 10Dzahn)
[20:02:55] <wikibugs>	 (03PS3) 10Dzahn: conftool: fix TODO by adding 2 dedicated codfw videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/679432 (https://phabricator.wikimedia.org/T279100)
[20:08:09] <mutante>	 longma: thanks! looks like your screenshot is pointing out 2 separate issues (the WARNING vs the ERROR parts basically)
[20:09:54] <longma>	 Yes! I thought so too
[20:11:37] <mutante>	 *nod* should both be ignorable for deployment for now, but should have follow-ups
[20:12:15] <mutante>	 (since it's just the 2 special hosts)
[20:12:27] <longma>	 thanks for looking into it :)
[20:15:18] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on wtp1037 is CRITICAL: CRITICAL: 524 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[20:15:38] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on wtp1038 is CRITICAL: CRITICAL: 524 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[20:17:26] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on wtp1039 is CRITICAL: CRITICAL: 524 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[20:24:23] <wikibugs>	 (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/679447 (owner: 10Paladox)
[20:25:36] <longma>	 mutante: could the version mismatches above be from the train? or some other reason?
[20:25:38] <W13>	 ^ shouldn't that be looked at?
[20:30:48] <RhinosF1>	 longma: won't scap pull fix it? I think some of the wtp* servers had work today
[20:30:57] <RhinosF1>	 https://phabricator.wikimedia.org/T268524
[20:31:13] <longma>	 I can try it
[20:31:45] <RhinosF1>	 longma: I'm sure I've seen it mentioned before with reimaged servers
[20:32:03] <RhinosF1>	 Please be careful though
[20:32:08] <RhinosF1>	 As that's just my memory
[20:32:13] <dancy>	 haha
[20:32:46] <longma>	 💀
[20:32:58] <longma>	 could something bad happen if I do scap pull?
[20:33:58] <RhinosF1>	 Maybe see if the servers are pooled first
[20:35:19] <mutante>	 eh, back, seeing this now
[20:35:22] <bd808>	 longma: `scap pull` should be fully safe. It will just fetch the deploy server state to the server you run it on
[20:35:27] <mutante>	 that is yet another unrelated issue
[20:35:34] <mutante>	 because wtp servers are being reimaged
[20:35:38] <dancy>	 ooh.
[20:35:40] <mutante>	 scap pull shold fix it, yes
[20:36:55] <wikibugs>	 10SRE, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee Manuel - https://phabricator.wikimedia.org/T280162 (10Aklapper) Hi and welcome, please see https://phabricator.wikimedia.org/tag/ldap-access-requests/ for required data and (for future reference) for a template link. Thanks!
[20:37:11] <mutante>	 you dont need to worry. these 3 servers are not getting any traffic
[20:37:17] <longma>	 ah okay
[20:37:22] <mutante>	 but we should still avoid the alerts
[20:38:10] <mutante>	 !log wtp1037, wtp1038, wtp1039 - scap pull
[20:38:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:38:52] <mutante>	 longma: scap pull should never hurt
[20:39:03] <longma>	 would scap sync-world also work?
[20:39:12] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on wtp1039 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[20:39:12] <mutante>	 I think so, just takes longer
[20:39:16] <bd808>	 *as long as the state on the deploy server is not being actively changed
[20:39:30] <mutante>	 I was about to reschedule the icinga checks but there it goes ^
[20:39:36] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: m2 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 890.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[20:40:28] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on wtp1038 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[20:40:29] <longma>	 since I couldn't find fingerprints to confirm when I tried to log onto wtp1037
[20:40:33] <mutante>	 while the servers are being reimaged they are in pooled=inactive state. this means not being in scap "dsh" groups, so not getting deploys
[20:40:40] <mutante>	 issue is that they still alert
[20:40:47] <mutante>	 or that scap pull wanst run manually
[20:41:16] <mutante>	 it only becomes a real problem if we never scap pull before repooling them 
[20:41:49] <mutante>	 longma: yea, fingerprint also changed because of reimaging, that is ongoing "upgrade to buster" 
[20:42:12] <mutante>	 only parsoid (wtp)
[20:42:32] <longma>	 ah okay. That was the reason I wanted to run sync world instead :P
[20:45:54] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on wtp1037 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[20:47:21] <wikibugs>	 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10Dzahn) After mw train was deployed we get some Icinga alerts which caused worry among deployers:   ` 20:15 <+icinga-wm> PROBLEM - Ensure local MW versions match ex...
[20:47:26] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:47:30] <mutante>	 I left some comments about this on a ticket as well ^
[20:49:32] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:50:24] <RhinosF1>	 Ty mutante longma
[20:52:52] <icinga-wm>	 ACKNOWLEDGEMENT - mediawiki-installation DSH group on wtp1037 is CRITICAL: Host wtp1037 is not in mediawiki-installation dsh group daniel_zahn T268524 https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[20:52:52] <icinga-wm>	 ACKNOWLEDGEMENT - mediawiki-installation DSH group on wtp1038 is CRITICAL: Host wtp1038 is not in mediawiki-installation dsh group daniel_zahn T268524 https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[20:52:52] <icinga-wm>	 ACKNOWLEDGEMENT - mediawiki-installation DSH group on wtp1039 is CRITICAL: Host wtp1039 is not in mediawiki-installation dsh group daniel_zahn T268524 https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[20:54:56] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wtp[1037-1039].eqiad.wmnet with reason: reimage
[20:54:57] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wtp[1037-1039].eqiad.wmnet with reason: reimage
[20:55:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:55:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:55:22] <mutante>	 downtimes expired, i silenced them for 24 hours
[21:05:58] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: m2 on db2078 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[21:09:42] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] "PCC checks out: https://puppet-compiler.wmflabs.org/compiler1003/29033/" [puppet] - 10https://gerrit.wikimedia.org/r/677593 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite)
[21:15:46] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:20:34] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:20:56] <DannyS712>	 in the context of trying to test something else, I made an edit via the api (https://www.mediawiki.org/w/index.php?title=User:DannyS712/sandbox&diff=4529972&oldid=4529955) that also changed the content model of the page, but the "content model change" tag was not added to the edit. I reverted the content model back to wikitext, and made another
[21:20:56] <DannyS712>	 edit from the api that again changed the content model, and the second time the tag was properly applied. Any ideas what it wasn't the first time?
[21:22:47] <DannyS712>	 oh, found it - ContentHandler does not support multiple automatic tags
[21:28:08] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve2004 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:39:12] <icinga-wm>	 PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs3005 is CRITICAL: cpu={1,11,13,15,3,5,7,9} https://bit.ly/wmf-lvscpu https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3005&var-datasource=esams+prometheus/ops
[21:42:44] <wikibugs>	 10SRE, 10Product-Data-Infrastructure, 10Epic, 10Goal, 10Patch-For-Review: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10CDanis)
[21:44:28] <legoktm>	 !log manually started debmonitor-client.service on ml-serve2004 after 502 Bad gateway error
[21:44:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:45:10] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:46:30] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[21:48:40] <wikibugs>	 (03PS1) 10CDanis: prepend esams/knams [homer/public] - 10https://gerrit.wikimedia.org/r/679494
[21:50:04] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] prepend esams/knams [homer/public] - 10https://gerrit.wikimedia.org/r/679494 (owner: 10CDanis)
[21:52:32] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "talked to Paladox about this. He kindly provided further links below. Upstream changed it from disable to enable and then changed their mi" [puppet] - 10https://gerrit.wikimedia.org/r/679447 (owner: 10Paladox)
[21:53:32] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:56:18] <icinga-wm>	 RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[22:04:14] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) timed out before a r
[22:04:14] <icinga-wm>	 ved: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[22:05:27] <icinga-wm>	 PROBLEM - fastnetmon is alerting #page on netflow3001 is CRITICAL: CRITICAL: fastnetmon is alerting for 91.198.174.192 https://bit.ly/wmf-fastnetmon https://w.wiki/8oU
[22:05:31] <cdanis>	 indeed
[22:05:40] <rzl>	 here
[22:05:42] <legoktm>	 hi
[22:05:45] <rzl>	 not, like, surprised, but here
[22:06:19] <robh>	 here if ya need dc ops for anythign
[22:06:32] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs3005 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb_80: Servers cp3060.esams.wmnet, cp3064.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[22:07:47] <icinga-wm>	 RECOVERY - fastnetmon is alerting #page on netflow3001 is OK: OK: no fastnetmon alerts https://bit.ly/wmf-fastnetmon https://w.wiki/8oU
[22:08:54] <icinga-wm>	 RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[22:08:56] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs3005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[22:16:57] <wikibugs>	 (03PS1) 10RLazarus: depool esams [dns] - 10https://gerrit.wikimedia.org/r/679502
[22:16:59] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] depool esams [dns] - 10https://gerrit.wikimedia.org/r/679502 (owner: 10RLazarus)
[22:17:01] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] depool esams [dns] - 10https://gerrit.wikimedia.org/r/679502 (owner: 10RLazarus)
[22:34:44] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[22:37:27] <legoktm>	 ^ someone/some people have submitted a lot of stacked patches
[22:39:35] <bd808>	 legoktm: looks like it was Jdlrobson 
[22:40:27] <legoktm>	 ah, I didn't really look, I assumed that the queue would eventually catch up by itself
[22:40:30] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 3.601 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[22:40:39] <wikibugs>	 (03PS1) 10Ahmon Dancy: Fix error message if MWScript.php is run without arguments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679517
[22:41:21] <bd808>	 can the submitted together limit in gerrit be tuned per repo? I wonder if things would be any less prone to self-DOS if only like 3 patchset were allowed to stack there.
[22:42:28] <icinga-wm>	 ACKNOWLEDGEMENT - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 2.396 le 60 Legoktm esams is depooled https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[22:42:35] <wikibugs>	 (03CR) 10Mstyles: rdf-streaming-updater: create helmfile.d structure (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles)
[22:43:48] <bd808>	 it was more than just Jon though. eileen sent in a huge pile of frtech patches too
[22:44:20] <bd808>	 If we only had infinite vms for jerkins to use up I guess :/
[22:44:50] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[22:49:26] <wikibugs>	 (03PS1) 10Mstyles: rdf-streaming-updater: enable HA capability [deployment-charts] - 10https://gerrit.wikimedia.org/r/679519 (https://phabricator.wikimedia.org/T273098)
[22:49:36] <wikibugs>	 (03PS9) 10Mstyles: rdf-streaming-updater: create helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006)
[22:49:38] <wikibugs>	 (03PS2) 10Mstyles: rdf-streaming-updater: enable HA capability [deployment-charts] - 10https://gerrit.wikimedia.org/r/679519 (https://phabricator.wikimedia.org/T273098)
[22:54:24] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[22:56:58] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[22:57:08] <icinga-wm>	 RECOVERY - At least one CPU core of an LVS is saturated- packet drops are likely on lvs3005 is OK: All metrics within thresholds. https://bit.ly/wmf-lvscpu https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3005&var-datasource=esams+prometheus/ops
[23:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate [[Backport windows|Evening backport window]]<br/><small>''''''</small> deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210414T2300).
[23:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[23:16:49] <wikibugs>	 (03CR) 10Bstorm: "So it's trying to run `qconf -ss` somewhere that isn't expected. That would be querying the submit hosts. The sideeffects are basically th" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677873 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez)
[23:23:47] <Amir1>	 sorry I quickly backport https://gerrit.wikimedia.org/r/c/679350/ while we are in the window
[23:23:53] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Disable legacy javascript global variables in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679350 (https://phabricator.wikimedia.org/T72470) (owner: 10Ladsgroup)
[23:24:39] <wikibugs>	 (03Merged) 10jenkins-bot: Disable legacy javascript global variables in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679350 (https://phabricator.wikimedia.org/T72470) (owner: 10Ladsgroup)
[23:26:24] <Jdlrobson>	 Amir1: nice. I can keep an eye on the logs if you want to enjoy your eveing :)
[23:26:56] <Amir1>	 thanks. It takes some time to propagate through cache 
[23:27:10] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:679350|Disable legacy javascript global variables in ruwiki (T72470)]] (duration: 01m 16s)
[23:27:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:27:21] <stashbot>	 T72470: Remove legacy javascript globals - https://phabricator.wikimedia.org/T72470
[23:27:32] <Amir1>	 IIRC, we had only 700 errors for this in total from all wikis in the last 12 hours
[23:28:08] <Jdlrobson>	 Amir1: yeh im just crunching the numbers now
[23:28:14] <Jdlrobson>	 remember thats 1%
[23:28:44] <Jdlrobson>	 According to https://grafana.wikimedia.org/d/000000037/mw-js-deprecate?orgId=1&viewPanel=7&refresh=1m&from=now-12h&to=now&var-Step=5min at peak we were seeing just over 600 events in 5 mins (that's 6000 unsampled) A rate of 2000 a minute will trigger an alert.
[23:29:09] <Jdlrobson>	 we could probably get away with deploying them all tomorrow morning and monitoring it through the day
[23:30:42] <Amir1>	 oh no, lots of those are from code that iterate through the window object
[23:31:03] <Amir1>	 if you look at the smallest variable. There's a baseline for all variables
[23:31:06] <wikibugs>	 10SRE, 10Sustainability: Add redundancy to IRC recent changes service - https://phabricator.wikimedia.org/T128592 (10Krinkle) Ack, not missing messages !- active-active. So long as reconnect to the same hostname is expected to work within a reasonable amount of time, I guess we can close this. Requiring a publ...
[23:31:15] <Jdlrobson>	 ru.wikipedia seems quite, but that might be a false positive since Russia should be asleep now?
[23:31:31] <wikibugs>	 (03PS1) 10Ahmon Dancy: enable delay_messageblobstore_purge feature flag in beta scap.cfg [puppet] - 10https://gerrit.wikimedia.org/r/679522 (https://phabricator.wikimedia.org/T263872)
[23:31:46] <wikibugs>	 10SRE, 10Sustainability: Add redundancy to IRC recent changes service - https://phabricator.wikimedia.org/T128592 (10Krinkle) 05Open→03Resolved a:03Legoktm
[23:32:07] <Amir1>	 Jdlrobson: it takes at least a couple of hours to propagate through caches, you can cross check the time of my main deployment with the alert to be sure 
[23:32:15] <Amir1>	 (we had a couple of days ago)
[23:33:22] <Jdlrobson>	 Amir1: im wondering about what to do with all the scripts that dont get fix
[23:33:45] <Jdlrobson>	 do you think there's a case to make to blank any scripts that dont get fixed within a certain time frame?
[23:33:54] <wikibugs>	 10SRE, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee Manuel - https://phabricator.wikimedia.org/T280162 (10Dzahn) @KFrancis Hi, here is another NDA request (and thanks for T279531#6995697 as well!) -- Daniel
[23:33:55] <Jdlrobson>	 if the scripts are throwing reference errors they are unusable anyway
[23:34:22] <Jdlrobson>	 and we can relatively easily get a list of user script wiki pages which are broken
[23:34:59] <wikibugs>	 10SRE, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee Manuel - https://phabricator.wikimedia.org/T280162 (10Dzahn) @Manuel Please provide [[ https://www.mediawiki.org/wiki/User:KFrancis_(WMF) | Katie ]] with your email adddress and it will continue from there.
[23:35:38] <wikibugs>	 (03PS1) 10Cwhite: logstash: provision per-datacenter apifeatureusage jobs [puppet] - 10https://gerrit.wikimedia.org/r/679524 (https://phabricator.wikimedia.org/T274394)
[23:35:42] <Amir1>	 hmm, I honestly don't care. if they want to have/keep a broken script, it's their choice, their playground 
[23:36:07] <Amir1>	 simply turning off the logs for that if we care
[23:38:11] <Jdlrobson>	 my concern here though is these scripts generate a lot of noise, and if a script has a problem with code deprecated several years ago, the script is likely rotten to the core and probably contains other errors that are less easy to filter. It's also a bit of a privacy nightmare as these users are throwing errors on every page they visit.
[23:38:58] <wikibugs>	 (03PS2) 10Cwhite: logstash: provision per-datacenter apifeatureusage jobs [puppet] - 10https://gerrit.wikimedia.org/r/679524 (https://phabricator.wikimedia.org/T274394)
[23:39:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] logstash: provision per-datacenter apifeatureusage jobs [puppet] - 10https://gerrit.wikimedia.org/r/679524 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite)
[23:40:12] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[23:40:21] <wikibugs>	 (03PS3) 10Cwhite: logstash: provision per-datacenter apifeatureusage jobs [puppet] - 10https://gerrit.wikimedia.org/r/679524 (https://phabricator.wikimedia.org/T274394)
[23:43:05] <wikibugs>	 (03PS4) 10Cwhite: logstash: provision per-datacenter apifeatureusage jobs [puppet] - 10https://gerrit.wikimedia.org/r/679524 (https://phabricator.wikimedia.org/T274394)
[23:46:49] <wikibugs>	 (03PS5) 10Cwhite: logstash: provision per-datacenter apifeatureusage jobs [puppet] - 10https://gerrit.wikimedia.org/r/679524 (https://phabricator.wikimedia.org/T274394)
[23:48:30] <wikibugs>	 (03PS1) 10Cwhite: logstash: clean up apifeatureusage curator job [puppet] - 10https://gerrit.wikimedia.org/r/679525 (https://phabricator.wikimedia.org/T274394)
[23:49:34] <wikibugs>	 (03PS6) 10Cwhite: logstash: provision per-datacenter apifeatureusage jobs [puppet] - 10https://gerrit.wikimedia.org/r/679524 (https://phabricator.wikimedia.org/T274394)
[23:50:14] <Amir1>	 yeah, my idea: just disable error logging on them
[23:53:14] <wikibugs>	 (03CR) 10Cwhite: "PCC checks out: https://puppet-compiler.wmflabs.org/compiler1002/29052/" [puppet] - 10https://gerrit.wikimedia.org/r/679525 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite)
[23:57:39] <wikibugs>	 (03CR) 10Cwhite: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/29053/" [puppet] - 10https://gerrit.wikimedia.org/r/679524 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite)
[23:58:46] <wikibugs>	 10SRE, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee Manuel - https://phabricator.wikimedia.org/T280162 (10KFrancis) @Dzahn As soon as I have the email address, I'll forward for processing.  Thanks!
[23:59:35] <wikibugs>	 (03PS1) 10RLazarus: Revert "depool esams" [dns] - 10https://gerrit.wikimedia.org/r/679526